Part A (10:53)
This document linked from Case Q→Q
]]>Review: From UNIT 1
Related SAS Tutorials
Related SPSS Tutorials
In inference for relationships, so far we have learned inference procedures for both cases C→Q and C→C from the role/type classification table below.
The last case to be considered in this course is case Q→Q, where both the explanatory and response variables are quantitative. (Case Q→C requires statistical methods that go beyond the scope of this course, one of which is logistic regression).
For case Q→Q, we will learn the following tests:
Dependent Samples  Independent Samples  
Standard Test(s) 


NonParametric Test(s) 

In the Exploratory Data Analysis section, we examined the relationship between sample values for two quantitative variables by looking at a scatterplot and if the relationship was linear, we supplemented the scatterplot with the correlation coefficient r and the linear regression equation. We discussed the regression equation but made no attempt to claim that the relationship observed in the sample necessarily held for the larger population from which the sample originated.
Now that we have a better understanding of the process of statistical inference, we will discuss a few methods for inferring something about the relationship between two quantitative variables in an entire population, based on the relationship seen in the sample.
In particular, we will focus on linear relationships and will answer the following questions:
If we satisfy the assumptions and conditions to use the methods, we can estimate the slope and correlation coefficient for our population and conduct hypothesis tests about these parameters.
For the standard tests, the tests for the slope and the correlation coefficient are equivalent; they will always produce the same pvalue and conclusion. This is because they are directly related to each other.
In this section, we can state our null and alternative hypotheses as:
Ho: There is no relationship between the two quantitative variables X and Y.
Ha: There is a relationship between the two quantitative variables X and Y.
What we know from Unit 1:
r = 0 implies no relationship between X and Y (note this is our null hypothesis!!)
r > 0 implies a positive relationship between X and Y (as X increases, Y also increases)
r < 0 implies a negative relationship between X and Y (as X increases, Y decreases)
Now here are the steps for hypothesis testing for Pearson’s Correlation Coefficient:
Step 1: State the hypothesesIf we consider the above information and our null hypothesis,
Ho: There is no relationship between the two quantitative variables X and Y,
Before we can write this using correlation, we must define the population correlation coefficient. In statistics, we use the greek letter ρ (rho) to denote the population correlation coefficient. Thus if there is no relationship between the two quantitative variables X and Y in our population, we can see that this hypothesis is equivalent to
Ho: ρ = 0 (rho = 0).
The alternative hypothesis will be
Ha: ρ ≠ 0 (rho is not equal to zero).
however, one sided tests are possible.
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be reasonably linear which we can check using a scatterplot. Any clearly nonlinear relationship should not be analyzed using this method.
(iii) To conduct this test, both variables should be normally distributed which we can check using histograms and QQplots. Outliers can cause problems.
Although there is an intermediate test statistic, in effect, the value of r itself serves as our test statistic.
Step 3: Find the pvalue of the test by using the test statistic as follows
We will rely on software to obtain the pvalue for this test. We have seen this pvalue already when we calculated correlation in Unit 1.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related (ρ ≠ 0). In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals can be obtained to estimate the true population correlation coefficient, ρ (rho), however, we will not compute these intervals in this course. You could be asked to interpret or use a confidence interval which has been provided to you.
We will look at one nonparametric test in case Q→Q. Spearman’s rank correlation uses the same calculations as for Pearson’s correlation coefficient except that it uses the ranks instead of the original data. This test is useful when there are outliers or when the variables do not appear to be normally distributed.
This measure behaves similarly to r in that:
Now an example:
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no relationship between newborn cry count and IQ at three years of age
Ha: There is a relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histograms and QQplots for both variables are slightly skewed right. We would prefer more symmetric distributions; however, the skewness is not extreme so we will proceed with caution.
Pearson’s correlation coefficient is 0.402 with a pvalue of 0.012.
Spearman’s rank correlation is 0.354 with a pvalue of 0.029.
Step 4: Conclusion
Based upon the scatterplot and correlation results, there is a statistically significant, but somewhat weak, positive correlation between newborn cry count and IQ at age 3.
In Unit 1, we discussed the least squares method for estimating the regression line and used software to obtain the slope and intercept of the linear regression equation. These estimates can be considered as the sample statistics which estimate the true population slope and intercept.
Now we will formalize simple linear regression which will require some additional notation.
A regression model expresses two essential ingredients:
Regression is a vast subject which handles a wide variety of possible relationships.
All regression methods begin with a theoretical model which specifies the form of the relationship and includes any needed assumptions or conditions. Now we will introduce a more “statistical” definition of the regression model and define the parameters in the population.
We will use a different notation here than in the beginning of the semester. Now we use regression model style notation.
We assume the relationship in the population is linear and therefore our regression model can be written as:
where
The following picture illustrates the components of this model.
Each orange dot represents an individual observation in the scatterplot. Each observed value is modeled using the previous equation.
The red line is the true linear regression line. The blue dot represents the predicted value for a particular X value and illustrates that our predicted value only estimates the mean, average, or expected value of Y at that X value.
The error for an individual is expected and is due to the variation in our data. In the previous illustration, it is labeled with ε_{i} (epsilon_i) and denoted by a bracket which gives the distance between the orange dot for the observed value and the blue dot for the predicted value for a particular value of X. In practice, we cannot observe the true error for an individual but we will be able to estimate them using the residuals, which we will soon define mathematically.
The regression line represents the average Y for a given X and can be expressed as in symbols as the expected value of Y for a given X, E(YX) or Yhat.
In Unit 1, we used a to represent the intercept and b to represent the slope that we estimated from our data.
In formal regression procedures, we commonly use beta to represent the population parameter and betahat to represent the parameter estimate.
These parameter estimates, which are sample statistics estimated from our data, are also sometimes referred to as the coefficients using algebra terminology.
For each observation in our dataset, we also have a residual which is defined as the difference between the observed value and the predicted value for that observation.
The residuals are used to check our assumptions of normality and constant variance.
In effect, since we have a quantitative response variable, we are still comparing population means. However, now we must do so for EVERY possible value of X. We want to know if the distribution of Y is the same or different over our range of X values.
This idea is illustrated (including our assumption of normality) in the following picture which shows a case where the distribution of Y is changing as the values of the explanatory variable X change. This change is reflected by only a shift in means since we assume normality and constant variation of Y for all X.
The method used is mathematically equivalent to ANOVA but our interpretations are different due to the quantitative nature of our explanatory variable.
This image shows a scatterplot and regression line on the XY plane – as if flat on a table. Then standing up – in the vertical axis – we draw normal curves centered at the regression line for four different Xvalues – with X increasing for each.
The center of the distributions of the normal distributions which are displayed shows an increase in the mean but constant variation.
The idea is that the model assumes a normal distribution is a good approximation for how the Yvalues will vary around the regression line for a particular value of X.
There is one additional measure which is often of interest in linear regression, the coefficient of determination, R^{2} which, for simple linear regression is simply the square of the correlation coefficient, r.
The value of R^{2} is interpreted as the proportion of variation in our response variable Y, which can be explained by the linear regression model using our explanatory variable X.
Important Properties of R^{2}
A large R^{2} may or MAY NOT mean that the model fits our data well.
The image below illustrates data with a fairly large R^{2} yet the model does not fit the data well.
A small R^{2} may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model.
The image below illustrates data with a very small R^{2} yet the true relationship is very strong.
Now we move into our formal test procedure for simple linear regression.
A small R2 may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model. The image below illustrates data with a very small R2 yet the true relationship is very strong.
Step 1: State the hypothesesIn order to test the hypothesis that
Ho: There is no relationship between the two quantitative variables X and Y,
assuming our model is correct (a linear model is sufficient), we can write the above hypothesis as
Ho: β_{1} = 0 (Beta_1 = 0, the slope of our linear equation = 0 in the population).
The alternative hypothesis will be
Ha: β_{1 }≠ 0 (Beta_1 is not equal to zero).
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be linear which we can check using a scatterplot.
(iii) The residuals should be reasonably normally distributed with constant variance which we can check using the methods discussed below.
Normality: Histogram and QQplot of the residuals.
Constant Variance: Scatterplot of Y vs. X and/or a scatterplot of the residuals vs. the predicted values (Yhat). We would like to see random scatter with no pattern and approximately the same spread for all values of X.
Large outliers which fall outside the pattern of the data can cause problems and exert undue influence on our estimates. We saw in Unit 1 that one observation which is far away on the xaxis can have an large impact on the values of the correlation and slope.
Here are two examples each using the two plots mentioned above.
Example 1: Has constant variance (homoscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
Example 2: Does not have constant variance (heteroscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
The test statistic is similar to those we have studied for other ttests:
Where
Both of these values, along with the test statistic, are provided in the output from the software.
Step 3: Find the pvalue of the test by using the test statistic as follows
Under the null hypothesis, the test statistic follows a tdistribution with n2 degrees of freedom. We will rely on software to obtain the pvalue for this test.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and we would conclude there is enough evidence that hat slope in the population is not zero and therefore the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals will also be obtained in the software to estimate the true population slope, β_{1} (beta_1).
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no (linear) relationship between newborn cry count and IQ at three years of age
Ha: There is a (linear) relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histogram and QQplot of the residuals are both reasonably normally distributed. The scatterplots of Y vs. X and the residuals vs. the predicted values both show no evidence of nonconstant variance.
The estimated regression equation is
The parameter estimate of the slope is 1.54 which means that for each 1unit increase in cry count, the average IQ is expected to increase by 1.54 points.
The standard error of the estimate of the slope is 0.584 which give a test statistic of 2.63 in the output and using unrounded values from the output and the formula:
The pvalue is found to be 0.0124. Notice this exactly the same as we obtained for this data for our test of Pearson’s correlation coefficient. These two methods are equivalent and will always produce the same conclusion about the statistical significance of the linear relationship between X and Y.
The 95% confidence interval for β_{1} (beta_1) given in the output is (0.353, 2.720).
This regression model has coefficient of determination of R^{2} = 0.161 which means that 16.1% of the variation in IQ score at age three can be explained by our linear regression model using newborn cry count. This confirms a relatively weak relationship as we found in our previous example using correlations (Pearson’s correlation coefficient and Spearmans’ rank correlation).
Step 4: Conclusion
Conclusion of the test for the slope: Based upon the scatterplot and linear regression analysis, since the relationship is linear and the pvalue = 0.0124, there is a statistically significant positive linear relationship between newborn cry count and IQ at age 3.
Interpretation of Rsquared: Based upon our R^{2}and scatterplot, the relationship is somewhat weak with only 16.1% of the variation in IQ score at age three being explained by our linear regression model using newborn cry count.
Interpretation of the slope: For each 1unit increase in cry count, the population mean IQ is expected to increase by 1.54 points, however, the 95% confidence interval suggests this value could be as low as 0.35 points to as high as 2.72 points.
We return to the data from an earlier activity (Learn By Doing – Correlation and Outliers (Software)). The average gestation period, or time of pregnancy, of an animal is closely related to its longevity, the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded. Here is a summary of the variables in our dataset:
In this case, whether we include the outlier or not, there is a problem of nonconstant variance. You can clearly see that, in general, as longevity increases, the variation of gestation increases.
This data is not a particularly good candidate for simple linear regression analysis (without further modification such as transformations or the use of alternative methods).
Pearson’s correlation coefficient (or Spearman’s rank correlation), may still provide a reasonable measure of the strength of the relationship, which is clearly a positive relationship from the scatterplot and our previous measure of correlation.
Output – Contains scatterplots with linear equations and LOESS curves (running average) for the dataset with and without the outlier. Pay particular attention to the problem with nonconstant variance seen in these scatterplots.
The data used in the analysis provided below contains the monthly premiums, driving experience, and gender for a random sample of drivers.
To analyze this data, we have looked at males and females as two separate groups and estimated the correlation and linear regression equation for each gender. We wish to predict the monthly premium using years of driving experience.
Use this output for additional practice with these concepts. For each gender consider the following: