Simple Linear Regression Model

NOTE: Except in cases of complex calculations, we use brackets [ ] to indicate “functions of” and parentheses ( ) to indicate “multiplication.”

SE[Beta_1-hat] = the standard error of the estimated slope Beta_1-hat. There is no multiplication!

Beta_1-hat(AGE) = the multiplication of estimated slope Beta_1-hat and the variable AGE.


Deterministic vs. Statistical Relationships

Before beginning let’s briefly discuss two types of relationships.

  • A deterministic relationship is one where the values are directly related with no error. Here is an example where we have the temperature in Celsius and Fahrenheit (from Penn State STAT 501 from 1.1 – What is Simple Linear Regression?). The points line up exactly with no error. We are not interested in these types of relationships in this course.

  • We are interested in statistical relationships such as those we are familiar with seeing in scatterplots such as this one relating X = driver age to Y = sign legibility distance.

Introduction and Links to Materials

In this unit, we will cover simple linear regression in more detail than we did in the pre-requisite course and begin using PROC GLM instead of PROC REG.

The output and code will be extremely similar so please start by reviewing the following materials and tutorials from PHC 6052.

Review from 6052 Materials: 

SAS Tutorials:

Useful SAS Procedures

  • PROC GLM
  • PROC SGPLOT
  • PROC SGSCATTER
  • PROC REG
  • PROC CORR

And consider the following materials from Penn State STAT 501 as your textbook content for this material.

Review Penn State materials with a focus on the definitions, concepts, and interpretations. You do not need to understand the mathematical details or be able to calculate regression models by hand (although you are expected to be able to work with models and ANOVA tables requiring simple mathematical calculations).

PENN STATE STAT 501 Materials – required textbook reading for this material

Note: If you click on “Printer Friendly Version” in the main lesson page it will show all pages in that lesson like this. The only downside is that interactive applets will not work in the printer friendly version.

PROC GLM

We will be using PROC GLM for simple and multiple linear regression so let’s look at some PROC GLM documentation including examples.

SAS Documentation: Both of the examples are more complex than we will initially discuss but will teach you more about PROC GLM and get you thinking about regression modeling!


PHC 6053 Videos (40:14)

Now review the following videos we have put together for this course.

SLR Introductory Example (14:44)

  • View Lecture Slides with Transcript
  • This video is long but I decided not to split it into sub-parts. If you can’t sit through it all at once, come back and finish it later :-)

Introduction to Regression (4:40)


Developing Theoretical Model (7:25)


Estimating the SLR Model (9:28)


Using the SLR Model (3:57)


When is it Meaningful to Interpret the Intercept?

One common issue for students regarding this material is the interpretation of the intercept. Please review this discussion:

Examples and Learn by Doing Activities

Now look at a few examples and try to answer the questions we pose by reviewing the output provided.

The first two use a random sample of 500 observations from the NHANES data.

Here our primary goal is to understand the predictors of Systolic Blood Pressure which is a broad and difficult task! Secondarily we would like to find a model to predict Systolic Blood Pressure but our goal is to interpret the parameter estimates and identify potential confounding variables, etc. as we begin working with this data for regression modeling.

EXAMPLE: NHANES DATA – Simple Linear Regression – SBP vs. AGE

  • Dataset: nh_500c.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit2-SLR-01-NHANES-SBP-AGE.pdf

LEARN BY DOING

Answer the following using the scatterplots on page 1 of the output.

  • Do you see any major concerns about the assumption of linearity for the two lines illustrated on these plots? Explain.
  • For models using the original AGE as the predictor, for which gender will the slope be larger? the intercept? Can you visualize the intercept on the scatterplot?
  • For models using the centered AGE predictor, for which gender will the slope be larger? the intercept? Can you visualize the intercept on the scatterplot?
  • Is there any change in the slopes when we use centered AGE as the predictor as opposed to the original AGE variable?
  • At approximately what AGE will the MEAN Systolic Blood Pressure estimated  from our model be the same for males and females?

Answer the following using the regression results on pages 2 and 3 of the output.

  • For the model using AGE
    • Find the following values in the output: slope and it’s p-value, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
    • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
    • Write the theoretical regression model for the mean response.
    • Write the theoretical regression model for an individual response.
    • Write the estimated regression model for the mean response.
    • Interpret the slope of the estimated regression model in context.
    • Is the intercept meaningful here? If so interpret in context.
    • Interpret R-squared in context.
    • What can we say about the relationship from the value of Pearson’s correlation coefficient?
    • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.
  • For the model using AGE_C50 which is our CENTERED AGE variable – recall we centered at 50 by creating a variable from AGE-50.
    • Find the following values in the output: slope and it’s p-value and confidence interval, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
    • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
    • Write the theoretical regression model for the mean response.
    • Write the theoretical regression model for an individual response.
    • Write the estimated regression model for the mean response.
    • Interpret the slope of the estimated regression model in context.
    • Is the intercept meaningful here? If so interpret in context.
    • Interpret R-squared in context.
    • What can we say about the relationship from the value of Pearson’s correlation coefficient?
    • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.

Solution: Unit2-SLR-01-NHANES-SBP-AGE-SOLUTION.pdf

Now we will look at the regression lines by gender.

EXAMPLE: NHANES DATA – Simple Linear Regression – SBP vs. AGE within each SEX.

  • Dataset: nh_500c.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit2-SLR-02-NHANES-SBP-AGE-BY-SEX.pdf

LEARN BY DOING

NOTE: We used AGE_C50 as the predictor which is the AGE variable CENTERED at 50 (a new variable = AGE-50)

Answer the following using the scatterplots on page 1 of the output.

  • Do you see any major concerns about the assumption of linearity for the two lines illustrated on these plots? Explain.
  • For which gender will the slope be larger? the intercept? Can you visualize the intercept on the scatterplot?
  • For each gender, Is there evidence of non-constant variance?
  • For each gender, are there any outliers?

Answer the following using the regression results on pages 2 and 3 of the output.

  • For the model for FEMALES
    • Find the following values in the output: slope and it’s p-value and confidence interval, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
    • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
    • Write the theoretical regression model for the mean response.
    • Write the theoretical regression model for an individual response.
    • Write the estimated regression model for the mean response.
    • Interpret the slope of the estimated regression model in context.
    • Is the intercept meaningful here? If so interpret in context.
    • Interpret R-squared in context.
    • What can we say about the relationship from the value of Pearson’s correlation coefficient?
    • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.
    • Which seems to be more of a problem: non-constant variance or outlier(s)
  • For the model for MALES
    • Find the following values in the output: slope and it’s p-value and confidence interval, intercept, MSE, RootMSE, R-squared, F-Value from the overall ANOVA table.
    • Calculate the value of Pearson’s correlation coefficient using the information available in the output provided.
    • Write the theoretical regression model for the mean response.
    • Write the theoretical regression model for an individual response.
    • Write the estimated regression model for the mean response.
    • Interpret the slope of the estimated regression model in context.
    • Is the intercept meaningful here? If so interpret in context.
    • Interpret R-squared in context.
    • What can we say about the relationship from the value of Pearson’s correlation coefficient?
    • Looking at the Fit Plot and the value for RootMSE, explain what RootMSE measures.
    • Which seems to be more of a problem: non-constant variance or outlier(s)
  • Find the AGE_C50 value at which the two lines cross by setting the two estimated regression equations equal to each other and solving for AGE_C50.
  • Were your conclusions from the previous activity correct regarding the values of the intercept and slope for the regression lines within each gender? Reflect :-)

Solution: Unit2-SLR-02-NHANES-SBP-AGE-BY-SEX-SOLUTION.pdf

NOTE: There was a typo in the original posting for the interpretation of R-squared. We used the correlation coefficient, r, instead of the correct value.

Now let’s look at the BEAR data. Here the primary goal is to develop a simple model that predicts the weight of the bear as accurately as possible. The inter-related nature of the predictors will be a difficulty in including many predictors in our model.

EXAMPLE: BEAR DATA – Predicting Weight with Transformations

  • Dataset: bear.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit2-SLR-03-BEARS-TRANSFORMATIONS-PLOTS.pdf

LEARN BY DOING

The output provides scatterplots with regression lines and LOESS curves for our main response WEIGHT and a possible alternative transformation LOGWT = NATURAL LOG of WEIGHT. In addition some predictors are presented in their ORIGINAL form and a LOG-TRANSFORMED version.

The goal is to find the best one-variable model to predict weight from our set of possible predictors. In practice we might also consider which measurements are easiest to take in the field, for example it is easier to measure the overall length than to measure the chest diameter regardless of whether the bear is awake or asleep! :-)

  • For each predictor, which model would be best for simple linear regression: both original variables, transformed response vs original predictor, original response vs. transformed predictor, both transformed. Consider any remaining concerns with simple linear regression with this “best choice.”
  • Which good model(s) seem the strongest?
  • Which model do you feel will be best?

Solution: Unit2-SLR-03-BEARS-TRANSFORMATIONS-PLOTS-SOLUTION.pdf