# Unit 3: Multiple Linear Regression

## Learning Objectives

**Introductory Example and The Multiple Linear Regression Model**

- Formulate a multiple linear regression regression model based upon the predictors and effects specified.
- Be able to interpret the coefficients for quantitative predictors in a multiple linear regression model.
- Explain how to calculate a confidence interval for a single slope parameter in the multiple linear regression setting.
- Conduct hypothesis tests for a single slope parameter in the multiple linear regression setting.
- Interpret
*R*^{2}in a multiple linear regression setting. - Explain the calculation and use of adjusted
*R*^{2}in a multiple linear regression setting. - Be able to obtain the parameter estimates from output along with associated confidence intervals and p-values.
- Recognize the distinction between a population regression model and the estimated regression model.
- Write regression models for the population in two forms: for individual values of the response or for the mean response.
- Write estimated regression models using output.
- Summarize the assumptions (conditions) that comprise the multiple linear regression model.
- Explain what the unknown population variance
*σ*^{2}quantifies in the multiple linear regression setting. - Be able to obtain the estimate
*MSE*of the unknown population variance*σ*^{2 }from output. - Explain how each element of the analysis of variance table is calculated and be able to find the values of components given a partially complete ANOVA table.
- Distinguish between estimating a mean response (confidence interval) and predicting a new observation (prediction interval).

**Categorical Predictors**

- Formulate a regression model that contains one categorical predictor (binary or multi-level)
- Properly code a categorical variable (binary or multi-level) so that it can be incorporated into a multiple linear regression model.
- Be able to determine the impact of using different coding schemes for categorical predictors and to properly request the choice of reference group using software.
- Be able to interpret the coefficients of a multiple linear regression model for quantitative predictors, binary predictors, and multi-level categorical predictors.
- Determine the different mean response functions for different levels of a categorical predictor.
- Explain the two advantages of fitting one regression function rather than separate regression functions — one for each level of the multi-level (categorical) predictor.
- Translate research questions involving slope parameters into the appropriate hypotheses for testing.

**Advanced Concepts and Additional Example for Multi-Level Categorical Predictors**

- Explain how contrasts of parameters are created from the theoretical regression model.
- Use contrasts to estimate and test for effects/comparisons not immediately available in the parameter estimates table..
- Explain the linear trend test for multi-level categorical predictors.
- Translate research questions involving slope parameters into the appropriate hypotheses for testing.

**Confounding, Mediation, and Multicollinearity**

- Explain the difference between unadjusted models and adjusted models.
- Be able to calculate the percent change in a parameter estimate (relative to an adjusted model).
- Explain the theoretical conditions for confounding and the reason we adjust for confounding variables in regression.
- Be able to investigate potential confounders using data.
- Explain the limitations of regression for estimating causal effects in observational studies.
- Explain the conditions needed to fully control for confounding.
- Explain the similarties and differences between mediators and confounding variables.
- Distinguish between structural multicollinearity and data-based multicollinearity.
- Explain what multicollinearity means.
- Explain the effects of multicollinearity on various aspects of regression analyses.
- Explain the effects of uncorrelated predictors on various aspects of regression analyses.
- Explain variance inflation factors, and how to use them to help detect multicollinearity.
- Explain the two ways of reducing data-based multicollinearity.
- Explain how centering the predictors in a polynomial regression model helps to reduce structural multicollinearity.
- Explain the main issues surrounding other regression pitfalls, including extrapolation, nonconstant variance, autocorrelation, overfitting, excluding important predictor variables, missing data, and power and sample size.

**Interactions**

- Explain the distinction between additive effects and interaction effects.
- Explain the impact of including an interaction term in a regression model.
- Be able to use a formulated model to determine how to test whether there is an interaction between a qualitative (categorical) predictor and a quantitative predictor.
- Be able to answer various research questions for models with interaction terms.
- Explain the impact of leaving a necessary interaction term out of the model.
- Translate research questions involving slope parameters into the appropriate hypotheses for testing.

**Model Validation and Potential Solutions**

- Explain why we need to check the assumptions of our model.
- Explain potential issues with the linear regression model.
- Be able to detect various problems with the model using a residuals vs. fits plot.
- Be able to detect various problems with the model using a residuals vs. predictor plot.
- Be able to detect a certain kind of dependent error terms using a residuals vs. order plot.
*(It is somewhat rare that we can do this analysis in practice)* - Be able to detect non-normal error terms using a normal probability plot.
- Apply some numerical tests for assessing model assumptions.
- Explain when transforming predictor variables might help and when transforming the response variable might help (or when it might be necessary to do both).
- Explain the concept of an influential data point.
- Be able to detect outlying
*y*values by way of studentized residuals or studentized deleted residuals. - Explain leverage, and know how to detect outlying
*x*values using leverages. - Be able to detect potentially influential data points by way of
*DFFITS*and Cook’s distance measure.

The following lessons from **Penn State STAT 501** are linked in the materials as your textbook material or support materials for Unit 3.

- Lesson 5: Multiple Linear Regression (Printer-friendly version) – SKIP 5.4
- Lesson 8: Categorical Predictors (Printer-friendly version) – SKIP 8.8
- Lesson 9: Data Transformations (Printer-friendly version)
- Lesson 11: Influential Points (Printer-friendly version)
- Lesson 12: Multicollinearity & Other Regression Pitfalls (Printer-friendly version)

Optional PowerPoint: PHC6937-SAS-Mult_Reg.pdf (from PHC 6937 – Biostatistical Computing Using SAS for our MS Biostatistics Students)