SLR – Diagnostics for Assumptions

NOTE: Except in cases of complex calculations, we use brackets [ ] to indicate “functions of” and parentheses ( ) to indicate “multiplication.”

SE[Beta_1-hat] = the standard error of the estimated slope Beta_1-hat. There is no multiplication!

Beta_1-hat(AGE) = the multiplication of estimated slope Beta_1-hat and the variable AGE.


Introduction and Links to Materials

In this unit, we are discussing simple linear regression in more detail than we did in the pre-requisite course and begin using PROC GLM instead of PROC REG.

The output and code are extremely similar so please continue to review the following materials and tutorials from PHC 6052 as needed.

Review from 6052 Materials: 

SAS Tutorials:

Useful SAS Procedures

  • PROC GLM
  • PROC SGPLOT
  • PROC SGSCATTER
  • PROC REG
  • PROC CORR

Consider the following materials from Penn State STAT 501 as your textbook content for this material.

Review Penn State materials with a focus on the definitions, concepts, and interpretations. You do not need to understand the mathematical details or be able to calculate regression models by hand (although you are expected to be able to work with models and ANOVA tables requiring simple mathematical calculations).

PROC GLM

We will be using PROC GLM for simple and multiple linear regression so let’s look more at PROC GLM documentation including examples.

SAS Documentation: Review the following links to documentation including common statements.


PHC 6053 Video (2:21)


Examples and Learn by Doing Activities

Now look at our set of examples again with additional output related to diagnostics. Answer the questions we pose by reviewing the output provided.

The first uses a random sample of 500 observations from the NHANES data.

Here our primary goal is to understand the predictors of Systolic Blood Pressure which is a broad and difficult task! Secondarily we would like to find a model to predict Systolic Blood Pressure but our goal is to interpret the parameter estimates and identify potential confounding variables, etc. as we begin working with this data for regression modeling.

EXAMPLE: NHANES DATA – Simple Linear Regression – SBP vs. AGE within each SEX.

LEARN BY DOING

NOTE: We used AGE_C50 as the predictor which is the AGE variable CENTERED at 50 (a new variable = AGE-50)

Answer the following using the regression results on pages 2 and 3 of the output.

  • For the model for FEMALES
    • Discuss the validity of assumptions for this analysis. (You do not need to say anything about independent errors.)
    • Using the print of the OUTPUT dataset requested, interpret the confidence and predictions limits for one observation.
  • For the model for MALES
    • Discuss the validity of assumptions for this analysis. (You do not need to say anything about independent errors.)
    • Using the print of the OUTPUT dataset requested, interpret the confidence and predictions limits for one observation.

Solution: Unit2-SLR-04-NHANES-SBP-AGE-BY-SEX-DIAGNOSTICS-SOLUTION.pdf

Now let’s look at the BEAR data. Here the primary goal is to develop a simple model that predicts the weight of the bear as accurately as possible. The inter-related nature of the predictors will be a difficulty in including many predictors in our model.

First we will look at some models where simple linear regression is generally reasonable.

EXAMPLE: BEAR DATA – Predicting Weight with Transformations

LEARN BY DOING

For each model:

  • Find the following values in the output: slope and it’s p-value and confidence interval, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
  • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
  • Write the theoretical regression model for the mean response.
  • Write the theoretical regression model for an individual response.
  • Write the estimated regression model for the mean response.
  • Interpret R-squared in context.
  • What can we say about the relationship from the value of Pearson’s correlation coefficient?
  • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.
  • Verify the calculation of R-squared from the sum of squares.
  • Verify the calculation of the F-value in the overall ANOVA table from the sum of squares and/or mean square values.
  • Verify the calculation of the t-statistic (for the slope) from other values in the output.
  • Interpret the slope of the estimated regression model in context along with it’s confidence interval.
  • Is the intercept meaningful here? If so interpret in context along with it’s confidence interval.
  • Discuss the validity of assumptions for this analysis. (You do not need to say anything about independent errors.)

Solution: Unit2-SLR-05-BEARS-TRANSFORMATIONS-REASONABLE-MODELS-SOLUTION.pdf

NOTE: There was a type in the original, I used R-squared instead of r in the discussion of the correlation for LOG(WEIGHT) vs. LOG(AGE).

Now let’s look at some problem models.

EXAMPLE: BEAR DATA – Predicting Weight with Transformations

LEARN BY DOING

For each model:

  • Find the following values in the output: slope and it’s p-value and confidence interval, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
  • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
  • Write the theoretical regression model for the mean response.
  • Write the theoretical regression model for an individual response.
  • Write the estimated regression model for the mean response.
  • Interpret R-squared in context.
  • What can we say about the relationship from the value of Pearson’s correlation coefficient?
  • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.
  • Which seems to be more of a problem: non-constant variance or outlier(s)
  • Verify the calculation of R-squared from the sum of squares.
  • Verify the calculation of the F-value in the overall ANOVA table from the sum of squares and/or mean square values.
  • Verify the calculation of the t-statistic (for the slope) from other values in the output.
  • Interpret the slope of the estimated regression model in context along with it’s confidence interval.
  • Is the intercept meaningful here? If so interpret in context along with it’s confidence interval.
  • Discuss the validity of assumptions for this analysis. (You do not need to say anything about independent errors.)
  • Using the print of the OUTPUT dataset requested, interpret the confidence and predictions limits for one observation.

Solution: Unit2-SLR-06-BEARS-TRANSFORMATIONS-PROBLEM-MODELS-SOLUTION.pdf