Introduction to Regression
- What is Regression Analysis?
- Steps in Regression Analysis
- Types of Regression Models
- Regression for a Continuous Outcome
One main aspect of regression analysis is the ability to include several predictors in our model simultaneously. There are a few reasons we may want to create a regression model from several predictors. We may wish to
- Find a model which will primarly be used for prediction.
- Prediction error is of primary concern.
- Estimations of effects of individual predictors is not usually important.
- Evaluate the effect of a predictor of primary interest in the population.
- This is an inferential method in which confounding is a concern for observational data.
- Some predictors are likely necessary as standards such age, gender, etc. are often included in the model automatically.
- In randomized experiments confounding is not usually an issue, but other predictors can be included for a variety of reasons.
- Identify the important independent predictors of a response.
- This is very difficult!
- Both causal interpretation and statistical inference can be problematic.
- Possible issues: false-positive associations, potential complexity of causal pathways, difficulty of determining best model.
Many standard comparitive analyses can also be considered as subsets of the general framework of regression models including
- Comparing two groups: t-test
- Comparing more than two groups: one-way ANOVA or chi-square
- Investigating two factors with or without an interaction: two-way ANOVA – with or without interaction
- Investigating one or more factors with one continuous covariates: ANCOVA – with or without interaction
We will be going through our own version of the steps in regression during the semester but we won’t necessarily have a good chance to put everything together and some tasks involved in a typical regression analysis we will not get to practice.
So … before we begin, please carefully review the following reading providing a good summary of how a regression analysis proceeds in practice from start to finish. You may wish to return to this reading later in the semester or in the future as you begin to implement more complete analyses.
For our purposes, we will focus what happens after we have the data in hand. If you conduct an analysis in the future which starts at the data collection phase, you should review the full steps carefully and pay particular attention to what happens prior to collecting your data.
Here is a summary of the steps we will cover in detail in this course.
- In regression analysis, once we have our data, we begin with exploratory data analysis –
- Here we check for errors, summarize our sample, and compare this to what we would expect for the population if possible.
- There will usually be some sort of descriptive summary of the sample provided with any regression analysis.
- Often, statistical tests and confidence intervals for effects are provided relating the outcome to each important predictor individually using 2-variable methods such as ttests, ANOVA, simple linear regression, correlation, or Chi-square tests as appropriate.
- Then we develop tentative regression models taking into consideration our primary goal and the variable selection process.
- We evaluate the suitability of those models and revise as needed to refine our model (or models).
- Once we identify the best model (or maybe a few models) we can make inferences from our model.
We will spend the rest of the course on learning the details of this process.
There are a vast number of possible regression analyses. We will only cover a very small number of these and yet the methods we will cover allow for investigation of a wide range of questions using data.
In general we can classify the methods covered in this course as well as many others using this diagram.
We will start this course with regression models which focus on a continuous outcome (or an outcome which, although discrete, has enough possible values to be considered continuous).
When we look at relationships between one quantitative variable and one categorical variable such as in a two-sample t-test or ANOVA, we compare the means of the quantitative outcome variable within the levels of the categorical explanatory variable.
Regression, in fact, is no different. And even methods involving categorical outcome variables can often be framed in terms of some type of average.
The following images illustrate this idea using the specific assumptions for linear regression using a quantitative outcome, where we will assume that
- At each X (for simple linear regression) or each combination of X’s (for multiple linear regression) there is an underlying normal distribution in the population – this is mostly needed for inference using confidence intervals and hypothesis tests, particularly for smaller samples.
- This normal distribution is always the same except for the mean which varies linearly with the X’s included in our model.
Remember as we develop more complex models that the idea is the same as these illustrations.
- How does the mean of Y change with the values of X?
- Our resulting models will predict the MEAN response.
- We will be able to discuss and make inferences about changes in the MEAN response as values of X vary.
- But, we will have to manage and account for the variation around the mean response.