This document is linked from Linear Relationships – Correlation.
]]>From the online version of Little Handbook of Statistical Practice, this reading contains a detailed discussion of correlation.
This document is linked from Linear Relationships – Correlation.
]]>Optional: Create your own solutions using your software for extra practice.
Use the following output to answer the questions that follow.
The average gestation period, or time of pregnancy, of an animal is closely related to its longevity — the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded.
Here is a summary of the variables in our dataset:
Remember that the correlation is only an appropriate measure of the linear relationship between two quantitative variables. First produce a scatterplot to verify that gestation and longevity are nearly linear in their relationship.
Answer the following questions using the output obtained. In this exercise we will:
(Optional) SPSS Steps:
This document is linked from Linear Relationships – Correlation.
]]>This document is linked from Linear Relationships – Correlation.
]]>
This document is linked from Linear Relationships – Correlation.
]]>Here is another interactive demonstration from the Rosman/Chance collection which has extensive options and illustrates many ideas about linear regression and correlation.
And, remember the twovariable calculator we introduced earlier.
This document is linked from Linear Relationships – Correlation.
]]>Part A (10:53)
This document linked from Case Q→Q
]]>Related SAS Tutorials
Related SPSS Tutorials
So far we have visualized relationships between two quantitative variables using scatterplots, and described the overall pattern of a relationship by considering its direction, form, and strength. We noted that assessing the strength of a relationship just by looking at the scatterplot is quite difficult, and therefore we need to supplement the scatterplot with some kind of numerical measure that will help us assess the strength.
In this part, we will restrict our attention to the special case of relationships that have a linear form, since they are quite common and relatively simple to detect. More importantly, there exists a numerical measure that assesses the strength of the linear relationship between two quantitative variables with which we can supplement the scatterplot. We will introduce this numerical measure here and discuss it in detail.
Even though from this point on we are going to focus only on linear relationships, it is important to remember that not every relationship between two quantitative variables has a linear form. We have actually seen several examples of relationships that are not linear. The statistical tools that will be introduced here are appropriate only for examining linear relationships, and as we will see, when they are used in nonlinear situations, these tools can lead to errors in reasoning.
Let’s start with a motivating example. Consider the following two scatterplots.
We can see that in both cases, the direction of the relationship is positive and the form of the relationship is linear. What about the strength? Recall that the strength of a relationship is the extent to which the data follow its form.
The purpose of this example was to illustrate how assessing the strength of the linear relationship from a scatterplot alone is problematic, since our judgment might be affected by the scale on which the values are plotted. This example, therefore, provides a motivation for the need to supplement the scatterplot with a numerical measure that will measure the strength of the linear relationship between two quantitative variables.
The numerical measure that assesses the strength of a linear relationship is called the correlation coefficient, and is denoted by r. We will:
Calculation: r is calculated using the following formula:
However, the calculation of the correlation (r) is not the focus of this course. We will use a statistics package to calculate r for us, and the emphasis of this course will be on the interpretation of its value.
Once we obtain the value of r, its interpretation with respect to the strength of linear relationships is quite simple, as these images illustrate:
In order to get a better sense for how the value of r relates to the strength of the linear relationship, take a look the following applets.
If you will be using correlation often in your research, I highly urge you to read the following more detailed discussion of correlation.
Now that we understand the use of r as a numerical measure for assessing the direction and strength of linear relationships between quantitative variables, we will look at a few examples.
Earlier, we used the scatterplot below to find a negative linear relationship between the age of a driver and the maximum distance at which a highway sign was legible. What about the strength of the relationship? It turns out that the correlation between the two variables is r = 0.793.
Since r < 0, it confirms that the direction of the relationship is negative (although we really didn’t need r to tell us that). Since r is relatively close to 1, it suggests that the relationship is moderately strong. In context, the negative correlation confirms that the maximum distance at which a sign is legible generally decreases with age. Since the value of r indicates that the linear relationship is moderately strong, but not perfect, we can expect the maximum distance to vary somewhat, even among drivers of the same age.
A statistics department is interested in tracking the progress of its students from entry until graduation. As part of the study, the department tabulates the performance of 10 students in an introductory course and in an upperlevel course required for graduation. What is the relationship between the students’ course averages in the two courses? Here is the scatterplot for the data:
The scatterplot suggests a relationship that is positive in direction, linear in form, and seems quite strong. The value of the correlation that we find between the two variables is r = 0.931, which is very close to 1, and thus confirms that indeed the linear relationship is very strong.
Comments:
We will now discuss and illustrate several important properties of the correlation coefficient as a numerical measure of the strength of a linear relationship.
To illustrate this, below are two versions of the scatterplot of the relationship between sign legibility distance and driver’s age:
The top scatterplot displays the original data where the maximum distances are measured in feet. The bottom scatterplot displays the same relationship, but with maximum distances changed to meters. Notice that the Yvalues have changed, but the correlations are the same. This is an example of how changing the units of measurement of the response variable has no effect on r, but as we indicated above, the same is true for changing the units of the explanatory variable, or of both variables.
This might be a good place to comment that the correlation (r) is “unitless”. It is just a number.
Our data describe a fairly simple nonlinear (sometimes called curvilinear) relationship: the amount of fuel consumed decreases rapidly to a minimum for a car driving 60 kilometers per hour, and then increases gradually for speeds exceeding 60 kilometers per hour. The relationship is very strong, as the observations seem to perfectly fit the curve.
Although the relationship is strong, the correlation r = 0.172 indicates a weak linear relationship. This makes sense considering that the data fails to adhere closely to a linear form:
The relationship is nonlinear (sometimes called curvilinear), yet the correlation r = 0.876 is quite close to 1.
In the last two examples we have seen two very strong nonlinear (sometimes called curvilinear) relationships, one with a correlation close to 0, and one with a correlation close to 1. Therefore, the correlation alone does not indicate whether a relationship is linear or not. The important principle here is:
Always look at the data!
Hopefully, you’ve noticed the correlation decreasing when you created this kind of outlier, which is not consistent with the pattern of the relationship.
The next activity will show you how an outlier that is consistent with the direction of the linear relationship actually strengthens it.
In the previous activity, we saw an example where there was a positive linear relationship between the two variables, and including the outlier just “strengthened” it. Consider the hypothetical data displayed by the following scatterplot:
In this case, the low outlier gives an “illusion” of a positive linear relationship, whereas in reality, there is no linear relationship between X and Y.
]]>Review: From UNIT 1
Related SAS Tutorials
Related SPSS Tutorials
In inference for relationships, so far we have learned inference procedures for both cases C→Q and C→C from the role/type classification table below.
The last case to be considered in this course is case Q→Q, where both the explanatory and response variables are quantitative. (Case Q→C requires statistical methods that go beyond the scope of this course, one of which is logistic regression).
For case Q→Q, we will learn the following tests:
Dependent Samples  Independent Samples  
Standard Test(s) 


NonParametric Test(s) 

In the Exploratory Data Analysis section, we examined the relationship between sample values for two quantitative variables by looking at a scatterplot and if the relationship was linear, we supplemented the scatterplot with the correlation coefficient r and the linear regression equation. We discussed the regression equation but made no attempt to claim that the relationship observed in the sample necessarily held for the larger population from which the sample originated.
Now that we have a better understanding of the process of statistical inference, we will discuss a few methods for inferring something about the relationship between two quantitative variables in an entire population, based on the relationship seen in the sample.
In particular, we will focus on linear relationships and will answer the following questions:
If we satisfy the assumptions and conditions to use the methods, we can estimate the slope and correlation coefficient for our population and conduct hypothesis tests about these parameters.
For the standard tests, the tests for the slope and the correlation coefficient are equivalent; they will always produce the same pvalue and conclusion. This is because they are directly related to each other.
In this section, we can state our null and alternative hypotheses as:
Ho: There is no relationship between the two quantitative variables X and Y.
Ha: There is a relationship between the two quantitative variables X and Y.
What we know from Unit 1:
r = 0 implies no relationship between X and Y (note this is our null hypothesis!!)
r > 0 implies a positive relationship between X and Y (as X increases, Y also increases)
r < 0 implies a negative relationship between X and Y (as X increases, Y decreases)
Now here are the steps for hypothesis testing for Pearson’s Correlation Coefficient:
Step 1: State the hypothesesIf we consider the above information and our null hypothesis,
Ho: There is no relationship between the two quantitative variables X and Y,
Before we can write this using correlation, we must define the population correlation coefficient. In statistics, we use the greek letter ρ (rho) to denote the population correlation coefficient. Thus if there is no relationship between the two quantitative variables X and Y in our population, we can see that this hypothesis is equivalent to
Ho: ρ = 0 (rho = 0).
The alternative hypothesis will be
Ha: ρ ≠ 0 (rho is not equal to zero).
however, one sided tests are possible.
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be reasonably linear which we can check using a scatterplot. Any clearly nonlinear relationship should not be analyzed using this method.
(iii) To conduct this test, both variables should be normally distributed which we can check using histograms and QQplots. Outliers can cause problems.
Although there is an intermediate test statistic, in effect, the value of r itself serves as our test statistic.
Step 3: Find the pvalue of the test by using the test statistic as follows
We will rely on software to obtain the pvalue for this test. We have seen this pvalue already when we calculated correlation in Unit 1.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related (ρ ≠ 0). In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals can be obtained to estimate the true population correlation coefficient, ρ (rho), however, we will not compute these intervals in this course. You could be asked to interpret or use a confidence interval which has been provided to you.
We will look at one nonparametric test in case Q→Q. Spearman’s rank correlation uses the same calculations as for Pearson’s correlation coefficient except that it uses the ranks instead of the original data. This test is useful when there are outliers or when the variables do not appear to be normally distributed.
This measure behaves similarly to r in that:
Now an example:
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no relationship between newborn cry count and IQ at three years of age
Ha: There is a relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histograms and QQplots for both variables are slightly skewed right. We would prefer more symmetric distributions; however, the skewness is not extreme so we will proceed with caution.
Pearson’s correlation coefficient is 0.402 with a pvalue of 0.012.
Spearman’s rank correlation is 0.354 with a pvalue of 0.029.
Step 4: Conclusion
Based upon the scatterplot and correlation results, there is a statistically significant, but somewhat weak, positive correlation between newborn cry count and IQ at age 3.
In Unit 1, we discussed the least squares method for estimating the regression line and used software to obtain the slope and intercept of the linear regression equation. These estimates can be considered as the sample statistics which estimate the true population slope and intercept.
Now we will formalize simple linear regression which will require some additional notation.
A regression model expresses two essential ingredients:
Regression is a vast subject which handles a wide variety of possible relationships.
All regression methods begin with a theoretical model which specifies the form of the relationship and includes any needed assumptions or conditions. Now we will introduce a more “statistical” definition of the regression model and define the parameters in the population.
We will use a different notation here than in the beginning of the semester. Now we use regression model style notation.
We assume the relationship in the population is linear and therefore our regression model can be written as:
where
The following picture illustrates the components of this model.
Each orange dot represents an individual observation in the scatterplot. Each observed value is modeled using the previous equation.
The red line is the true linear regression line. The blue dot represents the predicted value for a particular X value and illustrates that our predicted value only estimates the mean, average, or expected value of Y at that X value.
The error for an individual is expected and is due to the variation in our data. In the previous illustration, it is labeled with ε_{i} (epsilon_i) and denoted by a bracket which gives the distance between the orange dot for the observed value and the blue dot for the predicted value for a particular value of X. In practice, we cannot observe the true error for an individual but we will be able to estimate them using the residuals, which we will soon define mathematically.
The regression line represents the average Y for a given X and can be expressed as in symbols as the expected value of Y for a given X, E(YX) or Yhat.
In Unit 1, we used a to represent the intercept and b to represent the slope that we estimated from our data.
In formal regression procedures, we commonly use beta to represent the population parameter and betahat to represent the parameter estimate.
These parameter estimates, which are sample statistics estimated from our data, are also sometimes referred to as the coefficients using algebra terminology.
For each observation in our dataset, we also have a residual which is defined as the difference between the observed value and the predicted value for that observation.
The residuals are used to check our assumptions of normality and constant variance.
In effect, since we have a quantitative response variable, we are still comparing population means. However, now we must do so for EVERY possible value of X. We want to know if the distribution of Y is the same or different over our range of X values.
This idea is illustrated (including our assumption of normality) in the following picture which shows a case where the distribution of Y is changing as the values of the explanatory variable X change. This change is reflected by only a shift in means since we assume normality and constant variation of Y for all X.
The method used is mathematically equivalent to ANOVA but our interpretations are different due to the quantitative nature of our explanatory variable.
This image shows a scatterplot and regression line on the XY plane – as if flat on a table. Then standing up – in the vertical axis – we draw normal curves centered at the regression line for four different Xvalues – with X increasing for each.
The center of the distributions of the normal distributions which are displayed shows an increase in the mean but constant variation.
The idea is that the model assumes a normal distribution is a good approximation for how the Yvalues will vary around the regression line for a particular value of X.
There is one additional measure which is often of interest in linear regression, the coefficient of determination, R^{2} which, for simple linear regression is simply the square of the correlation coefficient, r.
The value of R^{2} is interpreted as the proportion of variation in our response variable Y, which can be explained by the linear regression model using our explanatory variable X.
Important Properties of R^{2}
A large R^{2} may or MAY NOT mean that the model fits our data well.
The image below illustrates data with a fairly large R^{2} yet the model does not fit the data well.
A small R^{2} may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model.
The image below illustrates data with a very small R^{2} yet the true relationship is very strong.
Now we move into our formal test procedure for simple linear regression.
A small R2 may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model. The image below illustrates data with a very small R2 yet the true relationship is very strong.
Step 1: State the hypothesesIn order to test the hypothesis that
Ho: There is no relationship between the two quantitative variables X and Y,
assuming our model is correct (a linear model is sufficient), we can write the above hypothesis as
Ho: β_{1} = 0 (Beta_1 = 0, the slope of our linear equation = 0 in the population).
The alternative hypothesis will be
Ha: β_{1 }≠ 0 (Beta_1 is not equal to zero).
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be linear which we can check using a scatterplot.
(iii) The residuals should be reasonably normally distributed with constant variance which we can check using the methods discussed below.
Normality: Histogram and QQplot of the residuals.
Constant Variance: Scatterplot of Y vs. X and/or a scatterplot of the residuals vs. the predicted values (Yhat). We would like to see random scatter with no pattern and approximately the same spread for all values of X.
Large outliers which fall outside the pattern of the data can cause problems and exert undue influence on our estimates. We saw in Unit 1 that one observation which is far away on the xaxis can have an large impact on the values of the correlation and slope.
Here are two examples each using the two plots mentioned above.
Example 1: Has constant variance (homoscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
Example 2: Does not have constant variance (heteroscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
The test statistic is similar to those we have studied for other ttests:
Where
Both of these values, along with the test statistic, are provided in the output from the software.
Step 3: Find the pvalue of the test by using the test statistic as follows
Under the null hypothesis, the test statistic follows a tdistribution with n2 degrees of freedom. We will rely on software to obtain the pvalue for this test.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and we would conclude there is enough evidence that hat slope in the population is not zero and therefore the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals will also be obtained in the software to estimate the true population slope, β_{1} (beta_1).
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no (linear) relationship between newborn cry count and IQ at three years of age
Ha: There is a (linear) relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histogram and QQplot of the residuals are both reasonably normally distributed. The scatterplots of Y vs. X and the residuals vs. the predicted values both show no evidence of nonconstant variance.
The estimated regression equation is
The parameter estimate of the slope is 1.54 which means that for each 1unit increase in cry count, the average IQ is expected to increase by 1.54 points.
The standard error of the estimate of the slope is 0.584 which give a test statistic of 2.63 in the output and using unrounded values from the output and the formula:
The pvalue is found to be 0.0124. Notice this exactly the same as we obtained for this data for our test of Pearson’s correlation coefficient. These two methods are equivalent and will always produce the same conclusion about the statistical significance of the linear relationship between X and Y.
The 95% confidence interval for β_{1} (beta_1) given in the output is (0.353, 2.720).
This regression model has coefficient of determination of R^{2} = 0.161 which means that 16.1% of the variation in IQ score at age three can be explained by our linear regression model using newborn cry count. This confirms a relatively weak relationship as we found in our previous example using correlations (Pearson’s correlation coefficient and Spearmans’ rank correlation).
Step 4: Conclusion
Conclusion of the test for the slope: Based upon the scatterplot and linear regression analysis, since the relationship is linear and the pvalue = 0.0124, there is a statistically significant positive linear relationship between newborn cry count and IQ at age 3.
Interpretation of Rsquared: Based upon our R^{2}and scatterplot, the relationship is somewhat weak with only 16.1% of the variation in IQ score at age three being explained by our linear regression model using newborn cry count.
Interpretation of the slope: For each 1unit increase in cry count, the population mean IQ is expected to increase by 1.54 points, however, the 95% confidence interval suggests this value could be as low as 0.35 points to as high as 2.72 points.
We return to the data from an earlier activity (Learn By Doing – Correlation and Outliers (Software)). The average gestation period, or time of pregnancy, of an animal is closely related to its longevity, the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded. Here is a summary of the variables in our dataset:
In this case, whether we include the outlier or not, there is a problem of nonconstant variance. You can clearly see that, in general, as longevity increases, the variation of gestation increases.
This data is not a particularly good candidate for simple linear regression analysis (without further modification such as transformations or the use of alternative methods).
Pearson’s correlation coefficient (or Spearman’s rank correlation), may still provide a reasonable measure of the strength of the relationship, which is clearly a positive relationship from the scatterplot and our previous measure of correlation.
Output – Contains scatterplots with linear equations and LOESS curves (running average) for the dataset with and without the outlier. Pay particular attention to the problem with nonconstant variance seen in these scatterplots.
The data used in the analysis provided below contains the monthly premiums, driving experience, and gender for a random sample of drivers.
To analyze this data, we have looked at males and females as two separate groups and estimated the correlation and linear regression equation for each gender. We wish to predict the monthly premium using years of driving experience.
Use this output for additional practice with these concepts. For each gender consider the following: