This document is linked from Linear Relationships – Regression.
]]>
This document is linked from Linear Relationships – Linear Regression.
]]>Optional: Create your own solutions using your software for extra practice.
Use the following output to answer the questions that follow.
The modern Olympic Games have changed dramatically since their inception in 1896. For example, many commentators have remarked on the change in the quality of athletic performances from year to year. Regression will allow us to investigate the change in winning times for one event — the 1,500 meter race.
Here is a summary of the variables in our dataset:
Answer the following questions using the output. In this exercise you will:
Use the linear regression on the full data to answer the following question.
Use the linear regression after removing the outlier to answer the next two questions.
This document is linked from Linear Relationships – Linear Regression.
]]>To see the effect of outliers on a regression equation, use the applet introduced earlier. Draw points on the graph, add the regression line and then add an outlier or move an observation to see how the regression line changes.
Here is another similar applet that can be used to illustrate outliers and guessing lines of best fit.
Here is an interactive demonstration from the Rosman/Chance collection which has extensive options and illustrates many ideas about linear regression and correlation.
And, remember the twovariable calculator we introduced earlier.
This document is linked from Linear Relationships – Linear Regression.
]]>This document is linked from Linear Relationships – Linear Regression.
]]>A line is described by a set of points (X,Y) that obey a particular relationship between X and Y. That relationship is called the equation of the line, which we will express in the following form: Y = a + bX In this equation, a and b are constants that can be either negative or positive. The reason to write the line in this form is that the constants a and b tell us what the line looks like, as follows:
The slope and intercept are indicated with arrows on the following diagram:
The technique that specifies the dependence of the response variable on the explanatory variable is called regression. When that dependence is linear (which is the case in our examples in this section), the technique is called linear regression. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).
To understand how such a line is chosen, consider the following very simplified version of the agedistance example (we left just 6 of the drivers on the scatterplot):
Consider the line:
The intercept is 1. The slope is 1/3, and the graph of this line is, therefore:
Consider the line:
The intercept is 1. The slope is 1/3, and the graph of this line is, therefore:
This document is linked from Linear Relationships – Linear Regression.
]]>Add points to the scatterplot, then draw your guess at the regression line, and then check your answer.
This document is linked from Linear Relationships – Linear Regression.
]]>Related SAS Tutorials
Related SPSS Tutorials
So far we’ve used the scatterplot to describe the relationship between two quantitative variables, and in the special case of a linear relationship, we have supplemented the scatterplot with the correlation (r).
The correlation, however, doesn’t fully characterize the linear relationship between two quantitative variables — it only measures the strength and direction. We often want to describe more precisely how one variable changes with the other (by “more precisely,” we mean more than just the direction), or predict the value of the response variable for a given value of the explanatory variable.
In order to be able to do that, we need to summarize the linear relationship with a line that best fits the linear pattern of the data. In the remainder of this section, we will introduce a way to find such a line, learn how to interpret it, and use it (cautiously) to make predictions.
Again, let’s start with a motivating example:
Earlier, we examined the linear relationship between the age of a driver and the maximum distance at which a highway sign was legible, using both a scatterplot and the correlation coefficient. Suppose a government agency wanted to predict the maximum distance at which the sign would be legible for 60yearold drivers, and thus make sure that the sign could be used safely and effectively.
How would we make this prediction?
It would be useful if we could find a line (such as the one that is presented on the scatterplot) that represents the general pattern of the data, because then,
and predict that 60yearold drivers could see the sign from a distance of just under 400 feet we would simply use this line to find the distance that corresponds to an age of 60 like this:
How and why did we pick this particular line (the one shown in red in the above walkthrough) to describe the dependence of the maximum distance at which a sign is legible upon the age of a driver? What line exactly did we choose? We will return to this example once we can answer that question with a bit more precision.
The technique that specifies the dependence of the response variable on the explanatory variable is called regression. When that dependence is linear (which is the case in our examples in this section), the technique is called linear regression. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).
To understand how such a line is chosen, consider the following very simplified version of the agedistance example (we left just 6 of the drivers on the scatterplot):
There are many lines that look like they would be good candidates to be the line that best fits the data:
It is doubtful that everyone would select the same line in the plot above. We need to agree on what we mean by “best fits the data”; in other words, we need to agree on a criterion by which we would select this line. We want the line we choose to be close to the data points. In other words, whatever criterion we choose, it had better somehow take into account the vertical deviations of the data points from the line, which are marked with blue arrows in the plot below:
The most commonly used criterion is called the least squares criterion. This criterion says: Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations. Visually, each squared deviation is represented by the area of one of the squares in the plot below. Therefore, we are looking for the line that will have the smallest total yellow area.
This line is called the leastsquares regression line, and, as we’ll see, it fits the linear pattern of the data very well.
For the remainder of this lesson, you’ll need to feel comfortable with the algebra of a straight line. In particular you’ll need to be familiar with the slope and the intercept in the equation of a line, and their interpretation.
Like any other line, the equation of the leastsquares regression line for summarizing the linear relationship between the response variable (Y) and the explanatory variable (X) has the form: Y = a + bX
All we need to do is calculate the intercept a, and the slope b, which we will learn to do using software.
Let’s revisit our agedistance example, and find the leastsquares regression line. The following output will be helpful in getting the 5 values we need:
As we can see, the regression line fits the linear pattern of the data quite well.
Let’s go back now to our motivating example, in which we wanted to predict the maximum distance at which a sign is legible for a 60yearold. Now that we have found the least squares regression line, this prediction becomes quite easy:
Practically, what the figure tells us is that in order to find the predicted legibility distance for a 60yearold, we plug Age = 60 into the regression line equation, to find that:
Predicted distance = 576 + ( 3 * 60) = 396
396 feet is our best prediction for the maximum distance at which a sign is legible for a 60yearold.
Comment About Predictions:
(The green segment of the line is the region of ages beyond 82, the age of the oldest individual in the study.)
Question: Is our prediction for 90yearold drivers reliable?
Answer: Our original age data ranged from 18 (youngest driver) to 82 (oldest driver), and our regression line is therefore a summary of the linear relationship in that age range only. When we plug the value 90 into the regression line equation, we are assuming that the same linear relationship extends beyond the range of our age data (1882) into the green segment. There is no justification for such an assumption. It might be the case that the vision of drivers older than 82 falls off more rapidly than it does for younger drivers. (i.e., the slope changes from 3 to something more negative). Our prediction for age = 90 is therefore not reliable.
Prediction for ranges of the explanatory variable that are not in the data is called extrapolation. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided. In our example, like most others, extrapolation can lead to very poor or illogical predictions.
Part A (10:53)
This document linked from Case Q→Q
]]>Review: From UNIT 1
Related SAS Tutorials
Related SPSS Tutorials
In inference for relationships, so far we have learned inference procedures for both cases C→Q and C→C from the role/type classification table below.
The last case to be considered in this course is case Q→Q, where both the explanatory and response variables are quantitative. (Case Q→C requires statistical methods that go beyond the scope of this course, one of which is logistic regression).
For case Q→Q, we will learn the following tests:
Dependent Samples  Independent Samples  
Standard Test(s) 


NonParametric Test(s) 

In the Exploratory Data Analysis section, we examined the relationship between sample values for two quantitative variables by looking at a scatterplot and if the relationship was linear, we supplemented the scatterplot with the correlation coefficient r and the linear regression equation. We discussed the regression equation but made no attempt to claim that the relationship observed in the sample necessarily held for the larger population from which the sample originated.
Now that we have a better understanding of the process of statistical inference, we will discuss a few methods for inferring something about the relationship between two quantitative variables in an entire population, based on the relationship seen in the sample.
In particular, we will focus on linear relationships and will answer the following questions:
If we satisfy the assumptions and conditions to use the methods, we can estimate the slope and correlation coefficient for our population and conduct hypothesis tests about these parameters.
For the standard tests, the tests for the slope and the correlation coefficient are equivalent; they will always produce the same pvalue and conclusion. This is because they are directly related to each other.
In this section, we can state our null and alternative hypotheses as:
Ho: There is no relationship between the two quantitative variables X and Y.
Ha: There is a relationship between the two quantitative variables X and Y.
What we know from Unit 1:
r = 0 implies no relationship between X and Y (note this is our null hypothesis!!)
r > 0 implies a positive relationship between X and Y (as X increases, Y also increases)
r < 0 implies a negative relationship between X and Y (as X increases, Y decreases)
Now here are the steps for hypothesis testing for Pearson’s Correlation Coefficient:
Step 1: State the hypothesesIf we consider the above information and our null hypothesis,
Ho: There is no relationship between the two quantitative variables X and Y,
Before we can write this using correlation, we must define the population correlation coefficient. In statistics, we use the greek letter ρ (rho) to denote the population correlation coefficient. Thus if there is no relationship between the two quantitative variables X and Y in our population, we can see that this hypothesis is equivalent to
Ho: ρ = 0 (rho = 0).
The alternative hypothesis will be
Ha: ρ ≠ 0 (rho is not equal to zero).
however, one sided tests are possible.
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be reasonably linear which we can check using a scatterplot. Any clearly nonlinear relationship should not be analyzed using this method.
(iii) To conduct this test, both variables should be normally distributed which we can check using histograms and QQplots. Outliers can cause problems.
Although there is an intermediate test statistic, in effect, the value of r itself serves as our test statistic.
Step 3: Find the pvalue of the test by using the test statistic as follows
We will rely on software to obtain the pvalue for this test. We have seen this pvalue already when we calculated correlation in Unit 1.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related (ρ ≠ 0). In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals can be obtained to estimate the true population correlation coefficient, ρ (rho), however, we will not compute these intervals in this course. You could be asked to interpret or use a confidence interval which has been provided to you.
We will look at one nonparametric test in case Q→Q. Spearman’s rank correlation uses the same calculations as for Pearson’s correlation coefficient except that it uses the ranks instead of the original data. This test is useful when there are outliers or when the variables do not appear to be normally distributed.
This measure behaves similarly to r in that:
Now an example:
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no relationship between newborn cry count and IQ at three years of age
Ha: There is a relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histograms and QQplots for both variables are slightly skewed right. We would prefer more symmetric distributions; however, the skewness is not extreme so we will proceed with caution.
Pearson’s correlation coefficient is 0.402 with a pvalue of 0.012.
Spearman’s rank correlation is 0.354 with a pvalue of 0.029.
Step 4: Conclusion
Based upon the scatterplot and correlation results, there is a statistically significant, but somewhat weak, positive correlation between newborn cry count and IQ at age 3.
In Unit 1, we discussed the least squares method for estimating the regression line and used software to obtain the slope and intercept of the linear regression equation. These estimates can be considered as the sample statistics which estimate the true population slope and intercept.
Now we will formalize simple linear regression which will require some additional notation.
A regression model expresses two essential ingredients:
Regression is a vast subject which handles a wide variety of possible relationships.
All regression methods begin with a theoretical model which specifies the form of the relationship and includes any needed assumptions or conditions. Now we will introduce a more “statistical” definition of the regression model and define the parameters in the population.
We will use a different notation here than in the beginning of the semester. Now we use regression model style notation.
We assume the relationship in the population is linear and therefore our regression model can be written as:
where
The following picture illustrates the components of this model.
Each orange dot represents an individual observation in the scatterplot. Each observed value is modeled using the previous equation.
The red line is the true linear regression line. The blue dot represents the predicted value for a particular X value and illustrates that our predicted value only estimates the mean, average, or expected value of Y at that X value.
The error for an individual is expected and is due to the variation in our data. In the previous illustration, it is labeled with ε_{i} (epsilon_i) and denoted by a bracket which gives the distance between the orange dot for the observed value and the blue dot for the predicted value for a particular value of X. In practice, we cannot observe the true error for an individual but we will be able to estimate them using the residuals, which we will soon define mathematically.
The regression line represents the average Y for a given X and can be expressed as in symbols as the expected value of Y for a given X, E(YX) or Yhat.
In Unit 1, we used a to represent the intercept and b to represent the slope that we estimated from our data.
In formal regression procedures, we commonly use beta to represent the population parameter and betahat to represent the parameter estimate.
These parameter estimates, which are sample statistics estimated from our data, are also sometimes referred to as the coefficients using algebra terminology.
For each observation in our dataset, we also have a residual which is defined as the difference between the observed value and the predicted value for that observation.
The residuals are used to check our assumptions of normality and constant variance.
In effect, since we have a quantitative response variable, we are still comparing population means. However, now we must do so for EVERY possible value of X. We want to know if the distribution of Y is the same or different over our range of X values.
This idea is illustrated (including our assumption of normality) in the following picture which shows a case where the distribution of Y is changing as the values of the explanatory variable X change. This change is reflected by only a shift in means since we assume normality and constant variation of Y for all X.
The method used is mathematically equivalent to ANOVA but our interpretations are different due to the quantitative nature of our explanatory variable.
This image shows a scatterplot and regression line on the XY plane – as if flat on a table. Then standing up – in the vertical axis – we draw normal curves centered at the regression line for four different Xvalues – with X increasing for each.
The center of the distributions of the normal distributions which are displayed shows an increase in the mean but constant variation.
The idea is that the model assumes a normal distribution is a good approximation for how the Yvalues will vary around the regression line for a particular value of X.
There is one additional measure which is often of interest in linear regression, the coefficient of determination, R^{2} which, for simple linear regression is simply the square of the correlation coefficient, r.
The value of R^{2} is interpreted as the proportion of variation in our response variable Y, which can be explained by the linear regression model using our explanatory variable X.
Important Properties of R^{2}
A large R^{2} may or MAY NOT mean that the model fits our data well.
The image below illustrates data with a fairly large R^{2} yet the model does not fit the data well.
A small R^{2} may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model.
The image below illustrates data with a very small R^{2} yet the true relationship is very strong.
Now we move into our formal test procedure for simple linear regression.
A small R2 may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model. The image below illustrates data with a very small R2 yet the true relationship is very strong.
Step 1: State the hypothesesIn order to test the hypothesis that
Ho: There is no relationship between the two quantitative variables X and Y,
assuming our model is correct (a linear model is sufficient), we can write the above hypothesis as
Ho: β_{1} = 0 (Beta_1 = 0, the slope of our linear equation = 0 in the population).
The alternative hypothesis will be
Ha: β_{1 }≠ 0 (Beta_1 is not equal to zero).
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be linear which we can check using a scatterplot.
(iii) The residuals should be reasonably normally distributed with constant variance which we can check using the methods discussed below.
Normality: Histogram and QQplot of the residuals.
Constant Variance: Scatterplot of Y vs. X and/or a scatterplot of the residuals vs. the predicted values (Yhat). We would like to see random scatter with no pattern and approximately the same spread for all values of X.
Large outliers which fall outside the pattern of the data can cause problems and exert undue influence on our estimates. We saw in Unit 1 that one observation which is far away on the xaxis can have an large impact on the values of the correlation and slope.
Here are two examples each using the two plots mentioned above.
Example 1: Has constant variance (homoscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
Example 2: Does not have constant variance (heteroscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
The test statistic is similar to those we have studied for other ttests:
Where
Both of these values, along with the test statistic, are provided in the output from the software.
Step 3: Find the pvalue of the test by using the test statistic as follows
Under the null hypothesis, the test statistic follows a tdistribution with n2 degrees of freedom. We will rely on software to obtain the pvalue for this test.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and we would conclude there is enough evidence that hat slope in the population is not zero and therefore the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals will also be obtained in the software to estimate the true population slope, β_{1} (beta_1).
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no (linear) relationship between newborn cry count and IQ at three years of age
Ha: There is a (linear) relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histogram and QQplot of the residuals are both reasonably normally distributed. The scatterplots of Y vs. X and the residuals vs. the predicted values both show no evidence of nonconstant variance.
The estimated regression equation is
The parameter estimate of the slope is 1.54 which means that for each 1unit increase in cry count, the average IQ is expected to increase by 1.54 points.
The standard error of the estimate of the slope is 0.584 which give a test statistic of 2.63 in the output and using unrounded values from the output and the formula:
The pvalue is found to be 0.0124. Notice this exactly the same as we obtained for this data for our test of Pearson’s correlation coefficient. These two methods are equivalent and will always produce the same conclusion about the statistical significance of the linear relationship between X and Y.
The 95% confidence interval for β_{1} (beta_1) given in the output is (0.353, 2.720).
This regression model has coefficient of determination of R^{2} = 0.161 which means that 16.1% of the variation in IQ score at age three can be explained by our linear regression model using newborn cry count. This confirms a relatively weak relationship as we found in our previous example using correlations (Pearson’s correlation coefficient and Spearmans’ rank correlation).
Step 4: Conclusion
Conclusion of the test for the slope: Based upon the scatterplot and linear regression analysis, since the relationship is linear and the pvalue = 0.0124, there is a statistically significant positive linear relationship between newborn cry count and IQ at age 3.
Interpretation of Rsquared: Based upon our R^{2}and scatterplot, the relationship is somewhat weak with only 16.1% of the variation in IQ score at age three being explained by our linear regression model using newborn cry count.
Interpretation of the slope: For each 1unit increase in cry count, the population mean IQ is expected to increase by 1.54 points, however, the 95% confidence interval suggests this value could be as low as 0.35 points to as high as 2.72 points.
We return to the data from an earlier activity (Learn By Doing – Correlation and Outliers (Software)). The average gestation period, or time of pregnancy, of an animal is closely related to its longevity, the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded. Here is a summary of the variables in our dataset:
In this case, whether we include the outlier or not, there is a problem of nonconstant variance. You can clearly see that, in general, as longevity increases, the variation of gestation increases.
This data is not a particularly good candidate for simple linear regression analysis (without further modification such as transformations or the use of alternative methods).
Pearson’s correlation coefficient (or Spearman’s rank correlation), may still provide a reasonable measure of the strength of the relationship, which is clearly a positive relationship from the scatterplot and our previous measure of correlation.
Output – Contains scatterplots with linear equations and LOESS curves (running average) for the dataset with and without the outlier. Pay particular attention to the problem with nonconstant variance seen in these scatterplots.
The data used in the analysis provided below contains the monthly premiums, driving experience, and gender for a random sample of drivers.
To analyze this data, we have looked at males and females as two separate groups and estimated the correlation and linear regression equation for each gender. We wish to predict the monthly premium using years of driving experience.
Use this output for additional practice with these concepts. For each gender consider the following: