Unit 3: Multiple Linear Regression

Learning Objectives

Introductory Example and The Multiple Linear Regression Model

  • Formulate a multiple linear regression regression model based upon the predictors and effects specified.
  • Be able to interpret the coefficients for quantitative predictors in a multiple linear regression model.
  • Explain how to calculate a confidence interval for a single slope parameter in the multiple linear regression setting.
  • Conduct hypothesis tests for a single slope parameter in the multiple linear regression setting.
  • Interpret R2 in a multiple linear regression setting.
  • Explain the calculation and use of adjusted R2 in a multiple linear regression setting.
  • Be able to obtain the parameter estimates from output along with associated confidence intervals and p-values.
  • Recognize the distinction between a population regression model and the estimated regression model.
  • Write regression models for the population in two forms: for individual values of the response or for the mean response.
  • Write estimated regression models using output.
  • Summarize the assumptions (conditions) that comprise the multiple linear regression model.
  • Explain what the unknown population variance σ2 quantifies in the multiple linear regression setting.
  • Be able to obtain the estimate MSE of the unknown population variance σ2 from output.
  • Explain how each element of the analysis of variance table is calculated and be able to find the values of components given a partially complete ANOVA table.
  • Distinguish between estimating a mean response (confidence interval) and predicting a new observation (prediction interval).

Categorical Predictors

  • Formulate a regression model that contains one categorical predictor (binary or multi-level)
  • Properly code a categorical variable (binary or multi-level) so that it can be incorporated into a multiple linear regression model.
  • Be able to determine the impact of using different coding schemes for categorical predictors and to properly request the choice of reference group using software.
  • Be able to interpret the coefficients of a multiple linear regression model for quantitative predictors, binary predictors, and multi-level categorical predictors.
  • Determine the different mean response functions for different levels of a categorical predictor.
  • Explain the two advantages of fitting one regression function rather than separate regression functions — one for each level of the multi-level (categorical) predictor.
  • Translate research questions involving slope parameters into the appropriate hypotheses for testing.

Advanced Concepts and Additional Example for Multi-Level Categorical Predictors

  • Explain how contrasts of parameters are created from the theoretical regression model.
  • Use contrasts to estimate and test for effects/comparisons not immediately available in the parameter estimates table..
  • Explain the linear trend test for multi-level categorical predictors.
  • Translate research questions involving slope parameters into the appropriate hypotheses for testing.

Confounding, Mediation, and Multicollinearity

  • Explain the difference between unadjusted models and adjusted models.
  • Be able to calculate the percent change in a parameter estimate (relative to an adjusted model).
  • Explain the theoretical conditions for confounding and the reason we adjust for confounding variables in regression.
  • Be able to investigate potential confounders using data.
  • Explain the limitations of regression for estimating causal effects in observational studies.
  • Explain the conditions needed to fully control for confounding.
  • Explain the similarties and differences between mediators and confounding variables.
  • Distinguish between structural multicollinearity and data-based multicollinearity.
  • Explain what multicollinearity means.
  • Explain the effects of multicollinearity on various aspects of regression analyses.
  • Explain the effects of uncorrelated predictors on various aspects of regression analyses.
  • Explain variance inflation factors, and how to use them to help detect multicollinearity.
  • Explain the two ways of reducing data-based multicollinearity.
  • Explain how centering the predictors in a polynomial regression model helps to reduce structural multicollinearity.
  • Explain the main issues surrounding other regression pitfalls, including extrapolation, nonconstant variance, autocorrelation, overfitting, excluding important predictor variables, missing data, and power and sample size.

Interactions

  • Explain the distinction between additive effects and interaction effects.
  • Explain the impact of including an interaction term in a regression model.
  • Be able to use a formulated model to determine how to test whether there is an interaction between a qualitative (categorical) predictor and a quantitative predictor.
  • Be able to answer various research questions for models with interaction terms.
  • Explain the impact of leaving a necessary interaction term out of the model.
  • Translate research questions involving slope parameters into the appropriate hypotheses for testing.

Model Validation and Potential Solutions

  • Explain why we need to check the assumptions of our model.
  • Explain potential issues with the linear regression model.
  • Be able to detect various problems with the model using a residuals vs. fits plot.
  • Be able to detect various problems with the model using a residuals vs. predictor plot.
  • Be able to detect a certain kind of dependent error terms using a residuals vs. order plot. (It is somewhat rare that we can do this analysis in practice)
  • Be able to detect non-normal error terms using a normal probability plot.
  • Apply some numerical tests for assessing model assumptions.
  • Explain when transforming predictor variables might help and when transforming the response variable might help (or when it might be necessary to do both).
  • Explain the concept of an influential data point.
  • Be able to detect outlying y values by way of studentized residuals or studentized deleted residuals.
  • Explain leverage, and know how to detect outlying x values using leverages.
  • Be able to detect potentially influential data points by way of DFFITS and Cook’s distance measure.

The following lessons from Penn State STAT 501 are linked in the materials as your textbook material or support materials for Unit 3.

Optional PowerPoint: PHC6937-SAS-Mult_Reg.pdf (from PHC 6937 – Biostatistical Computing Using SAS for our MS Biostatistics Students)