Introduction to Regression


What is Regression Analysis?

From the first paragraph of the main Wikipedia article on Regression Analysis (with some minor changes in language based upon our preferences):

Regression analysis is a statistical process for estimating the relationships among variables.

  • It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable (we will usually use the term ‘outcome’ or ‘response’) and one or more independent variables (or ‘predictors’).
  • More specifically, regression analysis helps one understand how the typical value of the response variable changes when any one of the predictors is varied, while the other predictors are held fixed.
  • Most commonly, regression analysis estimates the conditional expectation of the response variable given the predictors – that is, the average value of the response variable when the predictors are fixed.

By the end of the course, the more complex parts of this definition, such as what we mean by conditional expectation, will become clear.

One main aspect of regression analysis is the ability to include several predictors in our model simultaneously. There are a few reasons we may want to create a regression model from several predictors. We may wish to

  • Find a model which will primarly be used for prediction.
    • Prediction error is of primary concern.
    • Estimations of effects of individual predictors is not usually important.
  • Evaluate the effect of a predictor of primary interest in the population.
    • This is an inferential method in which confounding is a concern for observational data.
    • Some predictors are likely necessary as standards such age, gender, etc. are often included in the model automatically.
    • In randomized experiments confounding is not usually an issue, but other predictors can be included for a variety of reasons.
  • Identify the important independent predictors of a response.
    • This is very difficult!
    • Both causal interpretation and statistical inference can be problematic.
    • Possible issues: false-positive associations, potential complexity of causal pathways, difficulty of determining best model.

Many standard comparitive analyses can also be considered as subsets of the general framework of regression models including

  • Comparing two groups: t-test
  • Comparing more than two groups: one-way ANOVA or chi-square
  • Investigating two factors with or without an interaction: two-way ANOVA – with or without interaction
  • Investigating one or more factors with one continuous covariates: ANCOVA – with or without interaction

Steps in Regression Analysis

We will be going through our own version of the steps in regression during the semester but we won’t necessarily have a good chance to put everything together and some tasks involved in a typical regression analysis we will not get to practice.

So … before we begin, please carefully review the following reading providing a good summary of how a regression analysis proceeds in practice from start to finish. You may wish to return to this reading later in the semester or in the future as you begin to implement more complete analyses.

For our purposes, we will focus what happens after we have the data in hand. If you conduct an analysis in the future which starts at the data collection phase, you should review the full steps carefully and pay particular attention to what happens prior to collecting your data.

Once you have your data … you generally cannot solve any problems created in the data collection phase. Plus … It is ALWAYS a good idea to consult a statistician when planning any study which will involve data collection. 

Here is a summary of the steps we will cover in detail in this course. Exploratory Analysis - Tentative Models - Evaluate & Refine - Identify Best - Make Inferences

  • In regression analysis, once we have our data, we begin with exploratory data analysis
    • Here we check for errors, summarize our sample, and compare this to what we would expect for the population if possible.
    • There will usually be some sort of descriptive summary of the sample provided with any regression analysis.
    • Often, statistical tests and confidence intervals for effects are provided relating the outcome to each important predictor individually using 2-variable methods such as ttests, ANOVA, simple linear regression, correlation, or Chi-square tests as appropriate.
  • Then we develop tentative regression models taking into consideration our primary goal and the variable selection process.
  • We evaluate the suitability of those models and revise as needed to refine our model (or models).
  • Once we identify the best model (or maybe a few models) we can make inferences from our model.

We will spend the rest of the course on learning the details of this process.

Types of Regression Models

There are a vast number of possible regression analyses. We will only cover a very small number of these and yet the methods we will cover allow for investigation of a wide range of questions using data.

Some Important Assumptions for Our Course

  • We are assuming a dataset which can be considered a random sample from the population of interest.
  • Each observation should be independent of every other observation – we say we have independent observations.
  • Important: We will not cover methods to handle multiple observations of the same variables on a given subject in this course, such as the type of data which would be common in a longitudinal study.

In general we can classify the methods covered in this course as well as many others using this diagram.

Models broken into Simple or Multiple - each of these broken into linear or non-linear.

Simple models involve only one predictor, although we will see with multi-level categorical variables that this definition as not exactly clear.

  • Simple models are often called UNADJUSTED models and the estimates from them UNADJUSTED estimates.

Multiple Regression models involve multiple predictors, possibly with complex interactions.

  • Muliple regression models are often called ADJUSTED models and the estimates from them ADJUSTED estimates.

Within each both simple and mulitple regression models, we can classify methods as either:

  • Linear – we will cover simple and multiple linear regression extensively

OR

  • non-linear – among these, we will cover logistic regression and briefly discuss poisson regression

However; the “right side” of our model in this course will in fact always express a “linear” combination of variables and so much of what we learn for linear models will translate to non-linear models.

Regression for a Continuous Outcome

We will start this course with regression models which focus on a continuous outcome (or an outcome which, although discrete, has enough possible values to be considered continuous).

When we look at relationships between one quantitative variable and one categorical variable such as in a two-sample t-test or ANOVA, we compare the means of the quantitative outcome variable within the levels of the categorical explanatory variable.

Regression, in fact, is no different. And even methods involving categorical outcome variables can often be framed in terms of some type of average.

EXAMPLE:

Here we have a scatterplot with an overlay of boxplots for a categorized version of the continuous predictor.

a scatterplot showing age on the x-axis and systolic blood pressure on the y-axis. There are boxplots overlaid on the scatterplot to illustrate the trend we will estimated which going through the mean value at each x.

Our goal in regression is to produce a model which approximates the trend in the average Y for a given X. 

PRINCIPLE: In linear regression with a continuous outcome, we are are asking how the average Y changes as the value of one predictor increases (holding other predictors constant, if any)

Regression models have two essential ingredients:

  • A tendency of the response variable Y to vary with the predictor variable X in a systematic fashion (deterministic component)
  • A stochastic scattering of points around the curve of statistical relationship (random component)

The following images illustrate this idea using the specific assumptions for linear regression using a quantitative outcome, where we will assume that

  • At each X (for simple linear regression) or each combination of X’s (for multiple linear regression) there is an underlying normal distribution in the population – this is mostly needed for inference using confidence intervals and hypothesis tests, particularly for smaller samples.
  • This normal distribution is always the same except for the mean which varies linearly with the X’s included in our model.

illustration of a normal distribution centered at the regression line to visualize the assumption of normal errors in regression models

illustration of a normal distribution centered at the regression line to visualize the assumption of normal errors in regression models

Remember as we develop more complex models that the idea is the same as these illustrations.

  • How does the mean of Y change with the values of X?
  • Our resulting models will predict the MEAN response.
  • We will be able to discuss and make inferences about changes in the MEAN response as values of X vary.
  • But, we will have to manage and account for the variation around the mean response.