Related SAS Tutorials
Related SPSS Tutorials
In this part, we continue to handle situations involving one categorical explanatory variable and one quantitative response variable, which is case C→Q.
Here is a summary of the tests we have covered for the case where k = 2. Methods in BOLD are our main focus in this unit.
So far we have discussed the two samples and matched pairs designs, in which the categorical explanatory variable is twovalued. As we saw, in these cases, examining the relationship between the explanatory and the response variables amounts to comparing the mean of the response variable (Y) in two populations, which are defined by the two values of the explanatory variable (X). The difference between the two samples and matched pairs designs is that in the former, the two samples are independent, and in the latter, the samples are dependent.
Independent Samples (More Emphasis) 
Dependent Samples (Less Emphasis) 
Standard Tests
NonParametric Test

Standard Test
NonParametric Tests

We now move on to the case where k > 2 when we have independent samples. Here is a summary of the tests we will learn for the case where k > 2. Notice we will not cover the dependent samples case in this course.
Independent Samples (Only Emphasis) 
Dependent Samples (Not Discussed) 
Standard Tests
NonParametric Test

Standard Test

Here, as in the twovalued case, making inferences about the relationship between the explanatory (X) and the response (Y) variables amounts to comparing the means of the response variable in the populations defined by the values of the explanatory variable, where the number of means we are comparing depends, of course, on the number of values of X.
Unlike the twovalued case, where we looked at two subcases (1) when the samples are independent (two samples design) and (2) when the samples are dependent (matched pairs design, here, we are just going to discuss the case where the samples are independent. In other words, we are just going to extend the two samples design to more than two independent samples.
The inferential method for comparing more than two means that we will introduce in this part is called ANalysis Of VAriance (abbreviated as ANOVA), and the test associated with this method is called the ANOVA Ftest.
In most software, the data need to be arranged so that each row contains one observation with one variable recording X and another variable recording Y for each observation.
As we mentioned earlier, the test that we will present is called the ANOVA Ftest, and as you’ll see, this test is different in two ways from all the tests we have presented so far:
but a different structure that captures the essence of the Ftest, and clarifies where the name “analysis of variance” is coming from.
The question we need to answer is: Are the differences among the sample means due to true differences among the μ’s (alternative hypothesis), or merely due to sampling variability or random chance (null hypothesis)?
Here are two sets of boxplots representing two possible scenarios:
Scenario #1
Scenario #2
Thus, in the language of hypothesis tests, we would say that if the data were configured as they are in scenario 1, we would not reject the null hypothesis that population means were equal for the k groups.
If the data were configured as they are in scenario 2, we would reject the null hypothesis, and we would conclude that not all population means are the same for the k groups.
Let’s summarize what we learned from this.
In order to answer this question using data, we need to look at the variation among the sample means, but this alone is not enough.
We need to look at the variation among the sample means relative to the variation within the groups. In other words, we need to look at the quantity:
which measures to what extent the difference among the sample means for our groups dominates over the usual variation within sampled groups (which reflects differences in individuals that are typical in random samples).
When the variation within groups is large (like in scenario 1), the variation (differences) among the sample means may become negligible resulting in data which provide very little evidence against Ho. When the variation within groups is small (like in scenario 2), the variation among the sample means dominates over it, and the data have stronger evidence against Ho.
It has a different structure from all the test statistics we’ve looked at so far, but it is similar in that it is still a measure of the evidence against H_{0}. The larger F is (which happens when the denominator, the variation within groups, is small relative to the numerator, the variation among the sample means), the more evidence we have against H_{0}.
Looking at this ratio of variations is the idea behind the comparing more than two means; hence the name analysis of variance (ANOVA).
Now test your understanding of this idea.
Comments
Here is a full statement of the process for the ANOVA FTest:
Step 1: State the hypotheses
The null hypothesis claims that there is no relationship between X and Y. Since the relationship is examined by comparing the means of Y in the populations defined by the values of X (μ_{1}, μ_{2}, …, μ_{k}), no relationship would mean that all the means are equal.
Therefore the null hypothesis of the Ftest is:
As we mentioned earlier, here we have just one alternative hypothesis, which claims that there is a relationship between X and Y. In terms of the means μ_{1}, μ_{2}, …, μ_{k}, it simply says the opposite of the null hypothesis, that not all the means are equal, and we simply write:
Comments:
Step 2: Obtain data, check conditions, and summarize data
The ANOVA Ftest can be safely used as long as the following conditions are met:
(i) Each of the populations are normal, or more specifically, the distribution of the response Y in each population is normal, and the samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).
Can check this condition using the rule of thumb that the ratio between the largest sample standard deviation and the smallest is less than 2. If that is the case, this condition is considered to be satisfied.
Can check this condition using a formal test similar to that used in the twosample ttest although we will not cover any formal tests.
Test Statistic
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual, we base our conclusion on the pvalue.
Final Comment
Note that when we reject Ho in the ANOVA Ftest, all we can conclude is that
However, the ANOVA Ftest does not provide any immediate insight into why Ho was rejected, or in other words, it does not tell us in what way the population means of the groups are different. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population means. More specifically, we can look at which of the confidence intervals overlap and which do not.
Multiple Comparisons:
Now let’s look at some examples using real data.
A college dean believes that students with different majors may experience different levels of academic frustration. Random samples of size 35 of Business, English, Mathematics, and Psychology majors are asked to rate their level of academic frustration on a scale of 1 (lowest) to 20 (highest).
The figure highlights what we have already mentioned: examining the relationship between major (X) and frustration level (Y) amounts to comparing the mean frustration levels among the four majors defined by X. Also, the figure reminds us that we are dealing with a case where the samples are independent.
Step 1: State the hypotheses
The correct hypotheses are:
Step 2: Obtain data, check conditions, and summarize data
Data: SPSS format, SAS format, Excel format, CSV format
In our example all the conditions are satisfied:
The rule of thumb is satisfied since 3.082 / 2.088 < 2. We will look at the formal test in the software.
Test statistic: (Minitab output)
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As a followup, we can construct confidence intervals (or conduct multiple comparisons as we will do in the software). This allows us to understand better which population means are likely to be different.
In this case, the business majors are clearly lower on the frustration scale than other majors. It is also possible that English majors are lower than psychology majors based upon the individual 95% confidence intervals in each group.
SAS Output and SAS Code (Includes NonParametric Test)
Here is another example
Do advertisers alter the reading level of their ads based on the target audience of the magazine they advertise in?
In 1981, a study of magazine advertisements was conducted (F.K. Shuptrine and D.D. McVicker, “Readability Levels of Magazine Ads,” Journal of Advertising Research, 21:5, October 1981). Researchers selected random samples of advertisements from each of three groups of magazines:
The measure that the researchers used to assess the level of the ads was the number of words in the ad. 18 ads were randomly selected from each of the magazine groups, and the number of words per ad were recorded.
The following figure summarizes this problem:
Our question of interest is whether the number of words in ads (Y) is related to the educational level of the magazine (X). To answer this question, we need to compare μ_{1}, μ_{2}, and μ_{3}, the mean number of words in ads of the three magazine groups. Note in the figure that the sample means are provided. It seems that what the data suggest makes sense; the magazines in group 1 have the largest number of words per ad (on average) followed by group 2, and then group 3.
The question is whether these differences between the sample means are significant. In other words, are the differences among the observed sample means due to true differences among the μ’s or merely due to sampling variability? To answer this question, we need to carry out the ANOVA Ftest.
Step 1: Stating the hypotheses.
We are testing:
Conceptually, the null hypothesis claims that the number of words in ads is not related to the educational level of the magazine, and the alternative hypothesis claims that there is a relationship.
Step 2: Checking conditions and summarizing the data.
In order to check the next two conditions, we’ll need to look at the data (condition ii), and calculate the sample standard deviations of the three samples (condition iii).
Using the above, we can address conditions (ii) and (iii)
Before we move on, let’s look again at the graph. It is easy to see the trend of the sample means (indicated by red circles).
However, there is so much variation within each of the groups that there is almost a complete overlap between the three boxplots, and the differences between the means are overshadowed and seem like something that could have happened just by chance.
Let’s move on and see whether the ANOVA Ftest will support this observation.
Step 3. Finding the pvalue.
Step 4: Making conclusions in context.
Now try one for yourself.
The ANOVA Ftest does not provide any insight into why H_{0} was rejected; it does not tell us in what way μ1,μ2,μ3…,μk are not all equal. We would like to know which pairs of ’s are not equal. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population meansμ1,μ2,μ3…,μk that appears in the output. More specifically, we should look at the position of the confidence intervals and overlap/no overlap between them.
* If the confidence interval for, say,μi overlaps with the confidence interval for μj , then μi and μj share some plausible values, which means that based on the data we have no evidence that these two ’s are different.
* If the confidence interval for μi does not overlap with the confidence interval for μj , then μi and μj do not share plausible values, which means that the data suggest that these two ’s are different.
Furthermore, if like in the figure above the confidence interval (set of plausible values) for μi lies entirely below the confidence interval (set of plausible values) for μj, then the data suggest that μi is smaller than μj.
Consider our first example on the level of academic frustration.
Based on the small pvalue, we rejected H_{o} and concluded that not all four frustration level means are equal, or in other words that frustration level is related to the student’s major. To get more insight into that relationship, we can look at the confidence intervals above (marked in red). The top confidence interval is the set of plausible values for μ_{1}, the mean frustration level of business students. The confidence interval below it is the set of plausible values for μ_{2}, the mean frustration level of English students, etc.
What we see is that the business confidence interval is way below the other three (it doesn’t overlap with any of them). The math confidence interval overlaps with both the English and the psychology confidence intervals; however, there is no overlap between the English and psychology confidence intervals.
This gives us the impression that the mean frustration level of business students is lower than the mean in the other three majors. Within the other three majors, we get the impression that the mean frustration of math students may not differ much from the mean of both English and psychology students, however the mean frustration of English students may be lower than the mean of psychology students.
Note that this is only an exploratory/visual way of getting an impression of why H_{o} was rejected, not a formal one. There is a formal way of doing it that is called “multiple comparisons,” which is beyond the scope of this course. An extension to this course will include this topic in the future.
We will look at one nonparametric test in the k > 2 independent sample setting. We will cover more details later (Details for NonParametric Alternatives).
The KruskalWallis test is a general test to compare multiple distributions in independent samples and is a common alternative to the oneway ANOVA.
]]>Related SAS Tutorials
Related SPSS Tutorials
Here is a summary of the tests we will learn for the scenario where k = 2. Methods in BOLD will be our main focus.
We have completed our discussion on dependent samples (2nd column) and now we move on to independent samples (1st column).
Independent Samples (More Emphasis) 
Dependent Samples (Less Emphasis) 
Standard Tests
NonParametric Test

Standard Test
NonParametric Tests

We have discussed the dependent sample case where observations are matched/paired/linked between the two samples. Recall that in that scenario observations can be the same individual or two individuals who are matched between samples. To analyze data from dependent samples, we simply took the differences and analyzed the difference using onesample techniques.
Now we will discuss the independent sample case. In this case, all individuals are independent of all other individuals in their sample as well as all individuals in the other sample. This is most often accomplished by either:
Recall that here we are interested in the effect of a twovalued (k = 2) categorical variable (X) on a quantitative response (Y). Random samples from the two subpopulations (defined by the two categories of X) are obtained and we need to evaluate whether or not the data provide enough evidence for us to believe that the two subpopulation means are different.
In other words, our goal is to test whether the means μ_{1} and μ_{2} (which are the means of the variable of interest in the two subpopulations) are equal or not, and in order to do that we have two samples, one from each subpopulation, which were chosen independently of each other.
The test that we will learn here is commonly known as the twosample ttest. As the name suggests, this is a ttest, which as we know means that the pvalues for this test are calculated under some tdistribution.
Here are figures that illustrate some of the examples we will cover. Notice how the original variables X (categorical variable with two levels) and Y (quantitative variable) are represented. Think about the fact that we are in case C → Q!
As in our discussion of dependent samples, we will often simplify our terminology and simply use the terms “population 1” and “population 2” instead of referring to these as subpopulations. Either terminology is fine.
Question: Does it matter which population we label as population 1 and which as population 2?
Answer: No, it does not matter as long as you are consistent, meaning that you do not switch labels in the middle.
Recall that our goal is to compare the means μ_{1} and μ_{2} based on the two independent samples.
The hypotheses represent our goal to compare μ_{1}and μ_{2}.
The null hypothesis is always:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
(There IS NO association between the categorical explanatory variable and the quantitative response variable)
We will focus on the twosided alternative hypothesis of the form:
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2}) (twosided)
(There IS AN association between the categorical explanatory variable and the quantitative response variable)
Note that the null hypothesis claims that there is no difference between the means. Conceptually, Ho claims that there is no relationship between the two relevant variables (X and Y).
Our parameter of interest in this case (the parameter about which we are making an inference) is the difference between the means (μ_{1} – μ_{2}) and the null value is 0. The alternative hypothesis claims that there is a difference between the means.
The twosample ttest can be safely used as long as the following conditions are met:
The two samples are indeed independent.
We are in one of the following two scenarios:
(i) Both populations are normal, or more specifically, the distribution of the response Y in both populations is normal, and both samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).
Assuming that we can safely use the twosample ttest, we need to summarize the data, and in particular, calculate our data summary—the test statistic.
Test Statistic for TwoSample Ttest:
There are two choices for our test statistic, and we must choose the appropriate one to summarize our data We will see how to choose between the two test statistics in the next section. The two options are as follows:
We use the following notation to describe our samples:
Here are the two cases for our test statistic.
(A) Equal Variances: If it is safe to assume that the two populations have equal standard deviations, we can pool our estimates of this common population standard deviation and use the following test statistic.
where
(B) Unequal Variances: If it is NOT safe to assume that the two populations have equal standard deviations, we have unequal standard deviations and must use the following test statistic.
Comments:
Each of these tests rely on a particular tdistribution under which the pvalues are calculated. In the case where equal variances are assumed, the degrees of freedom are simply:
whereas in the case of unequal variances, the formula for the degrees of freedom is more complex. We will rely on the software to obtain the degrees of freedom in both cases and provided us with the correct pvalue (usually this will be a twosided pvalue).
As usual, we draw our conclusion based on the pvalue. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the difference in population means in terms of the current variables.
If the pvalue is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.
Conclusion: There is enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho.
Conclusion: There is NOT enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the pvalue is less than α. Otherwise, we do not reject Ho.
As in previous methods, we can followup with a confidence interval for the difference between population means, μ_{1} – μ_{2} and interpret this interval in the context of the problem.
Interpretation: We are 95% confident that the population mean for (one group) is between __________________ compared to the population mean for (the other group).
Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.
If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)
If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)
NOTE: Be careful to choose the correct confidence interval about the difference between population means using the same assumption (variances equal or variances unequal) and not the individual confidence intervals for the means in the groups themselves.
Since we have two possible tests we can conduct, based upon whether or not we can assume the population standard deviations (or variances) are equal, we need a method to determine which test to use.
Although you can make a reasonable guess using information from the data (i.e. look at the distributions and estimates of the standard deviations and see if you feel they are reasonably equal), we have a test which can help us here, called the test for Equality of Variances. This output is automatically displayed in many software packages when a twosample ttest is requested although the particular test used may vary.The hypotheses of this test are:
Ho: σ_{1} = σ_{2} (the standard deviations in the two populations are the same)
Ha: σ_{1} ≠ σ_{2} (the standard deviations in the two populations are not the same)
Now let’s look at a complete example of conducting a twosample ttest, including the embedded test for equality of variances.
This question was asked of a random sample of 239 college students, who were to answer on a scale of 1 to 25. An answer of 1 means personality has maximum importance and looks no importance at all, whereas an answer of 25 means looks have maximum importance and personality no importance at all. The purpose of this survey was to examine whether males and females differ with respect to the importance of looks vs. personality.
Note that the data have the following format:
Score (Y)  Gender (X) 
15  Male 
13  Female 
10  Female 
12  Male 
14  Female 
14  Male 
6  Male 
17  Male 
etc. 
The format of the data reminds us that we are essentially examining the relationship between the twovalued categorical variable, gender, and the quantitative response, score. The two values of the categorical explanatory variable (k = 2) define the two populations that we are comparing — males and females. The comparison is with respect to the response variable score. Here is a figure that summarizes the example:
Comments:
Step 1: State the hypotheses
Recall that the purpose of this survey was to examine whether the opinions of females and males differ with respect to the importance of looks vs. personality. The hypotheses in this case are therefore:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean “looks vs personality score” for females and μ_{2} represents the mean “looks vs personality score” for males.
It is important to understand that conceptually, the two hypotheses claim:
Ho: Score (of looks vs. personality) is not related to gender
Ha: Score (of looks vs. personality) is related to gender
Step 2: Obtain data, check conditions, and summarize data
The output might also be broken up if you export or copy the items in certain ways. The results are the same but it can be more difficult to read.
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is practically 0 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean “looks vs personality score” for females minus the population mean “looks vs personality score” for males.
Practical Significance:
We should definitely ask ourselves if this is practically significant
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Here is another example.
A study was conducted which enrolled and followed heart attack patients in a certain metropolitan area. In this example we are interested in determining if there is a relationship between Body Mass Index (BMI) and gender. Individuals presenting to the hospital with a heart attack were randomly selected to participate in the study.
Step 1: State the hypotheses
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean BMI for males and μ_{2} represents the mean BMI for females.
It is important to understand that conceptually, the two hypotheses claim:
Ho: BMI is not related to gender in heart attack patients
Ha: BMI is related to gender in heart attack patients
Step 2: Obtain data, check conditions, and summarize data
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is 0.001 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean BMI for males minus the population mean BMI for females.
Practical Significance:
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Note: In the SAS output the variable gender is not formatted, in this case Males = 0 and Females = 1.
Comments:
You might ask yourself: “Where do we use the test statistic?”
It is true that for all practical purposes all we have to do is check that the conditions which allow us to use the twosample ttest are met, lift the pvalue from the output, and draw our conclusions accordingly.
However, we feel that it is important to mention the test statistic for two reasons:
Now try some more activities for yourself.
We will look at one nonparametric test in the twoindependent samples setting. More details will be discussed later (Details for NonParametric Alternatives).
Related SAS Tutorials
Related SPSS Tutorials
We are in Case CQ of inference about relationships, where the explanatory variable is categorical and the response variable is quantitative.
As we mentioned in the summary of the introduction to Case C→Q, the first case that we will deal with is that involving matched pairs. In this case:
Notice from this point forward we will use the terms population 1 and population 2 instead of subpopulation 1 and subpopulation 2. Either terminology is correct.
One of the most common cases where dependent samples occur is when both samples have the same subjects and they are “paired by subject.” In other words, each subject is measured twice on the response variable, typically before and then after some kind of treatment/intervention in order to assess its effectiveness.
Suppose you want to assess the effectiveness of an SAT prep class.
It would make sense to use the matched pairs design and record each sampled student’s SAT score before and after the SAT prep classes are attended:
Recall that the two populations represent the two values of the explanatory variable. In this situation, those two values come from a single set of subjects.
This, however, is not the only case where the paired design is used. Other cases are when the pairs are “natural pairs,” such as siblings, twins, or couples.
Notes about graphical summaries for paired data in Case CQ:
The idea behind the paired ttest is to reduce this twosample situation, where we are comparing two means, to a single sample situation where we are doing inference on a single mean, and then use a simple ttest that we introduced in the previous module.
In this setting, we can easily reduce the raw data to a set of differences and conduct a onesample ttest.
In other words, by reducing the two samples to one sample of differences, we are essentially reducing the problem from a problem where we’re comparing two means (i.e., doing inference on μ_{1}−μ_{2}) to a problem in which we are studying one mean.
In general, in every matched pairs problem, our data consist of 2 samples which are organized in n pairs:
We reduce the two samples to only one by calculating the difference between the two observations for each pair.
For example, think of Sample 1 as “before” and Sample 2 as “after”. We can find the difference between the before and after results for each participant, which gives us only one sample, namely “before – after”. We label this difference as “d” in the illustration below.
The paired ttest is based on this one sample of n differences,
and it uses those differences as data for a onesample ttest on a single mean — the mean of the differences.
This is the general idea behind the paired ttest; it is nothing more than a regular onesample ttest for the mean of the differences!
We will now go through the 4step process of the paired ttest.
Recall that in the ttest for a single mean our null hypothesis was: Ho: μ = μ_{0} and the alternative was one of Ha: μ < μ_{0} or μ > μ_{0} or μ ≠ μ_{0}. Since the paired ttest is a special case of the onesample ttest, the hypotheses are the same except that:
Instead of simply μ we use the notation μ_{d} to denote that the parameter of interest is the mean of the differences.
In this course our null value μ_{0} is always 0. In other words, going back to our original paired samples our null hypothesis claims that that there is no difference between the two means. (Technically, it does not have to be zero if you are interested in a more specific difference – for example, you might be interested in showing that there is a reduction in blood pressure of more than 10 points but we will not specifically look at such situations).
Therefore, in the paired ttest: The null hypothesis is always:
Ho: μ_{d} = 0
(There IS NO association between the categorical explanatory variable and the quantitative response variable)
We will focus on the twosided alternative hypothesis of the form:
Ha: μ_{d} ≠ 0
(There IS AN association between the categorical explanatory variable and the quantitative response variable)
Some students find it helpful to know that it turns out that μ_{d} = μ_{1} – μ_{2} (in other words, the difference between the means is the same as the mean of the differences). You may find it easier to first think about the hypotheses in terms of μ_{1} – μ_{2} and then represent it in terms of μ_{d}.
The paired ttest, as a special case of a onesample ttest, can be safely used as long as:
The sample of differences is random (or at least can be considered random in context).
The distribution of the differences in the population should vary normally if you have small samples. If the sample size is large, it is safe to use the paired ttest regardless of whether the differences vary normally or not. This condition is satisfied in the three situations marked by a green check mark in the table below.
Note: normality is checked by looking at the histogram of differences, and as long as no clear violation of normality (such as extreme skewness and/or outliers) is apparent, the normality assumption is reasonable.
Assuming that we can safely use the paired ttest, the data are summarized by a test statistic:
where
This test statistic measures (in standard errors) how far our data are (represented by the sample mean of the differences) from the null hypothesis (represented by the null value, 0).
Notice this test statistic has the same general form as those discussed earlier:
As a special case of the onesample ttest, the null distribution of the paired ttest statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the pvalues are calculated. We will use software to find the pvalue for us.
As usual, we draw our conclusion based on the pvalue. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the population mean difference in terms of the current variables.
In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the pvalue is less than α. Otherwise, we fail to reject Ho.
If the pvalue is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.
Conclusion: There is enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is enough evidence that the population mean difference is not equal to zero.
Remember: a small pvalue tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. Therefore, a small pvalue indicates that we should reject the null hypothesis.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho.
Conclusion: There is NOT enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is NOT enough evidence that the population mean difference is not equal to zero.
Notice how much better the first sentence sounds! It can get difficult to correctly phrase these conclusions in terms of the mean difference without confusing double negatives.
As in previous methods, we can followup with a confidence interval for the mean difference, μ_{d} and interpret this interval in the context of the problem.
Interpretation: We are 95% confident that the population mean difference (described in context) is between (lower bound) and (upper bound).
Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.
If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)
If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)
NOTE: Be careful to choose the correct confidence interval about the population mean difference and not the individual confidence intervals for the means in the groups themselves.
Now let’s look at an example.
Note: In some of the videos presented in the course materials, we do conduct the onesided test for this data instead of the twosided test we conduct below. In Unit 4B we are going to restrict our attention to twosided tests supplemented by confidence intervals as needed to provide more information about the effect of interest.
Drunk driving is one of the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 12 drinks … I am OK to drive.”
A sample of 20 drivers was chosen, and their reaction times in an obstacle course were measured before and after drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers. Here is a figure summarizing this study:
Since the measurements are paired, we can easily reduce the raw data to a set of differences and conduct a onesample ttest.
Here are some of the results for this data:
Step 1: State the hypotheses
We define μ_{d }= the population mean difference in reaction times (Before – After).
As we mentioned, the null hypothesis is:
The null hypothesis claims that the differences in reaction times are centered at (or around) 0, indicating that drinking two beers has no real impact on reaction times. In other words, drivers are not impaired after drinking two beers.
Although we really want to know whether their reaction times are longer after the two beers, we will still focus on conducting twosided hypothesis tests. We will be able to address whether the reaction times are longer after two beers when we look at the confidence interval.
Therefore, we will use the twosided alternative:
Step 2: Obtain data, check conditions, and summarize data
Let’s first check whether we can safely proceed with the paired ttest, by checking the two conditions.
We can see from the histogram above that there is no evidence of violation of the normality assumption (on the contrary, the histogram looks quite normal).
Also note that the vast majority of the differences are negative (i.e., the total reaction times for most of the drivers are larger after the two beers), suggesting that the data provide evidence against the null hypothesis.
The question (which the pvalue will answer) is whether these data provide strong enough evidence or not against the null hypothesis. We can safely proceed to calculate the test statistic (which in practice we leave to the software to calculate for us).
Test Statistic: We will use software to calculate the test statistic which is t = 2.58.
Step 3: Find the pvalue of the test by using the test statistic as follows
As a special case of the onesample ttest, the null distribution of the paired ttest statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the pvalues are calculated.
We will let the software find the pvalue for us, and in this case, gives us a pvalue of 0.0183 (SAS) or 0.018 (SPSS).
The small pvalue tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. More specifically, there is less than a 2% chance (0.018=1.8%) of obtaining a test statistic of 2.58 (or lower) or 2.58 (or higher), assuming that 2 beers have no impact on reaction times.
Step 4: Conclusion
In our example, the pvalue is 0.018, indicating that the data provide enough evidence to reject Ho.
Followup Confidence Interval:
As a followup to this conclusion, we quantify the effect that two beers have on the driver, using the 95% confidence interval for μ_{d}.
Using statistical software, we find that the 95% confidence interval for μ_{d}, the mean of the differences (before – after), is roughly (0.9, 0.1).
Note: Since the differences were calculated beforeafter, longer reaction times after the beers would translate into negative differences.
Since the confidence interval does not contain the null value of zero, we can use it to decide to reject the null hypothesis. Zero is not a plausible value of the population mean difference based upon the confidence interval. Notice that using this method is not always practical as often we still need to provide the pvalue in clinical research. (Note: this is NOT the interpretation of the confidence interval but a method of using the confidence interval to conduct a hypothesis test.)
Practical Significance:
We should definitely ask ourselves if this is practically significant and I would argue that it is.
In the output, we are generally provided the twosided pvalue. We must be very careful when converting this to a onesided pvalue (if this is not provided by the software)
The “driving after having 2 beers” example is a case in which observations are paired by subject. In other words, both samples have the same subject, so that each subject is measured twice. Typically, as in our example, one of the measurements occurs before a treatment/intervention (2 beers in our case), and the other measurement after the treatment/intervention.
Our next example is another typical type of study where the matched pairs design is used—it is a study involving twins.
Researchers have long been interested in the extent to which intelligence, as measured by IQ score, is affected by “nurture” as opposed to “nature”: that is, are people’s IQ scores mainly a result of their upbringing and environment, or are they mainly an inherited trait?
A study was designed to measure the effect of home environment on intelligence, or more specifically, the study was designed to address the question: “Are there statistically significant differences in IQ scores between people who were raised by their birth parents, and those who were raised by someone else?”
In order to be able to answer this question, the researchers needed to get two groups of subjects (one from the population of people who were raised by their birth parents, and one from the population of people who were raised by someone else) who are as similar as possible in all other respects. In particular, since genetic differences may also affect intelligence, the researchers wanted to control for this confounding factor.
We know from our discussion on study design (in the Producing Data unit of the course) that one way to (at least theoretically) control for all confounding factors is randomization—randomizing subjects to the different treatment groups. In this case, however, this is not possible. This is an observational study; you cannot randomize children to either be raised by their birth parents or to be raised by someone else. How else can we eliminate the genetics factor? We can conduct a “twin study.”
Because identical twins are genetically the same, a good design for obtaining information to answer this question would be to compare IQ scores for identical twins, one of whom is raised by birth parents and the other by someone else. Such a design (matched pairs) is an excellent way of making a comparison between individuals who only differ with respect to the explanatory variable of interest (upbringing) but are as alike as they can possibly be in all other important aspects (inborn intelligence). Identical twins raised apart were studied by Susan Farber, who published her studies in the book “Identical Twins Reared Apart” (1981, Basic Books).
In this problem, we are going to use the data that appear in Farber’s book in table E6, of the IQ scores of 32 pairs of identical twins who were reared apart.
Here is a figure that will help you understand this study:
Here are the important things to note in the figure:
Each of the 32 rows represents one pair of twins. Keeping the notation that we used above, twin 1 is the twin that was raised by his/her birth parents, and twin 2 is the twin that was raised by someone else. Let’s carry out the analysis.
Step 1: State the hypotheses
Recall that in matched pairs, we reduce the data from two samples to one sample of differences:
The hypotheses are stated in terms of the mean of the difference where, μ_{d} = population mean difference in IQ scores (Birth Parents – Someone Else):
Step 2: Obtain data, check conditions, and summarize data
Is it safe to use the paired ttest in this case?
The data don’t reveal anything that we should be worried about (like very extreme skewness or outliers), so we can safely proceed. Looking at the histogram, we note that most of the differences are negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ.
From this point we rely on statistical software, and find that:
Our test statistic is 1.85.
Our data (represented by the sample mean of the differences) are 1.85 standard errors below the null hypothesis (represented by the null value 0).
Step 3: Find the pvalue of the test by using the test statistic as follows
The pvalue is 0.074, indicating that there is a 7.4% chance of obtaining data like those observed (or even more extreme) assuming that H_{o} is true (i.e., assuming that there are no differences in IQ scores between people who were raised by their natural parents and those who weren’t).
Step 4: Conclusion
Using the conventional significance level (cutoff probability) of .05, our pvalue is not small enough, and we therefore cannot reject H_{o}.
Confidence Interval:
The 95% confidence interval for the population mean difference is (6.11322, 0.30072).
Interpretation:
This confidence interval does contain zero and thus results in the same conclusion to the hypothesis test. Zero IS a plausible value of the population mean difference and thus we cannot reject the null hypothesis.
Practical Significance:
It is very important to pay attention to whether the twosample ttest or the paired ttest is appropriate. In other words, being aware of the study design is extremely important. Consider our example, if we had not “caught” that this is a matched pairs design, and had analyzed the data as if the two samples were independent using the twosample ttest, we would have obtained a pvalue of 0.114.
Note that using this (wrong) method to analyze the data, and a significance level of 0.05, we would conclude that the data do not provide enough evidence for us to conclude that reaction times differed after drinking two beers. This is an example of how using the wrong statistical method can lead you to wrong conclusions, which in this context can have very serious implications.
Comments:
Now try a complete example for yourself.
Here are two other datasets with paired samples.
The statistical tests we have previously discussed (and many we will discuss) require assumptions about the distribution in the population or about the requirements to use a certain approximation as the sampling distribution. These methods are called parametric.
When these assumptions are not valid, alternative methods often exist to test similar hypotheses. Tests which require only minimal distributional assumptions, if any, are called nonparametric or distributionfree tests.
At the end of this section we will provide some details (see Details for NonParametric Alternatives), for now we simply want to mention that there are two common nonparametric alternatives to the paired ttest. They are:
The fact that both of these tests have the word “sign” in them is not a coincidence – it is due to the fact that we will be interested in whether the differences have a positive sign or a negative sign – and the fact that this word appears in both of these tests can help you to remember that they correspond to paired methods where we are often interested in whether there was an increase (positive sign) or a decrease (negative sign).
Review: From UNIT 1
Related SAS Tutorials
Related SPSS Tutorials
In inference for relationships, so far we have learned inference procedures for both cases C→Q and C→C from the role/type classification table below.
The last case to be considered in this course is case Q→Q, where both the explanatory and response variables are quantitative. (Case Q→C requires statistical methods that go beyond the scope of this course, one of which is logistic regression).
For case Q→Q, we will learn the following tests:
Dependent Samples  Independent Samples  
Standard Test(s) 


NonParametric Test(s) 

In the Exploratory Data Analysis section, we examined the relationship between sample values for two quantitative variables by looking at a scatterplot and if the relationship was linear, we supplemented the scatterplot with the correlation coefficient r and the linear regression equation. We discussed the regression equation but made no attempt to claim that the relationship observed in the sample necessarily held for the larger population from which the sample originated.
Now that we have a better understanding of the process of statistical inference, we will discuss a few methods for inferring something about the relationship between two quantitative variables in an entire population, based on the relationship seen in the sample.
In particular, we will focus on linear relationships and will answer the following questions:
If we satisfy the assumptions and conditions to use the methods, we can estimate the slope and correlation coefficient for our population and conduct hypothesis tests about these parameters.
For the standard tests, the tests for the slope and the correlation coefficient are equivalent; they will always produce the same pvalue and conclusion. This is because they are directly related to each other.
In this section, we can state our null and alternative hypotheses as:
Ho: There is no relationship between the two quantitative variables X and Y.
Ha: There is a relationship between the two quantitative variables X and Y.
What we know from Unit 1:
r = 0 implies no relationship between X and Y (note this is our null hypothesis!!)
r > 0 implies a positive relationship between X and Y (as X increases, Y also increases)
r < 0 implies a negative relationship between X and Y (as X increases, Y decreases)
Now here are the steps for hypothesis testing for Pearson’s Correlation Coefficient:
Step 1: State the hypothesesIf we consider the above information and our null hypothesis,
Ho: There is no relationship between the two quantitative variables X and Y,
Before we can write this using correlation, we must define the population correlation coefficient. In statistics, we use the greek letter ρ (rho) to denote the population correlation coefficient. Thus if there is no relationship between the two quantitative variables X and Y in our population, we can see that this hypothesis is equivalent to
Ho: ρ = 0 (rho = 0).
The alternative hypothesis will be
Ha: ρ ≠ 0 (rho is not equal to zero).
however, one sided tests are possible.
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be reasonably linear which we can check using a scatterplot. Any clearly nonlinear relationship should not be analyzed using this method.
(iii) To conduct this test, both variables should be normally distributed which we can check using histograms and QQplots. Outliers can cause problems.
Although there is an intermediate test statistic, in effect, the value of r itself serves as our test statistic.
Step 3: Find the pvalue of the test by using the test statistic as follows
We will rely on software to obtain the pvalue for this test. We have seen this pvalue already when we calculated correlation in Unit 1.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related (ρ ≠ 0). In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals can be obtained to estimate the true population correlation coefficient, ρ (rho), however, we will not compute these intervals in this course. You could be asked to interpret or use a confidence interval which has been provided to you.
We will look at one nonparametric test in case Q→Q. Spearman’s rank correlation uses the same calculations as for Pearson’s correlation coefficient except that it uses the ranks instead of the original data. This test is useful when there are outliers or when the variables do not appear to be normally distributed.
This measure behaves similarly to r in that:
Now an example:
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no relationship between newborn cry count and IQ at three years of age
Ha: There is a relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histograms and QQplots for both variables are slightly skewed right. We would prefer more symmetric distributions; however, the skewness is not extreme so we will proceed with caution.
Pearson’s correlation coefficient is 0.402 with a pvalue of 0.012.
Spearman’s rank correlation is 0.354 with a pvalue of 0.029.
Step 4: Conclusion
Based upon the scatterplot and correlation results, there is a statistically significant, but somewhat weak, positive correlation between newborn cry count and IQ at age 3.
In Unit 1, we discussed the least squares method for estimating the regression line and used software to obtain the slope and intercept of the linear regression equation. These estimates can be considered as the sample statistics which estimate the true population slope and intercept.
Now we will formalize simple linear regression which will require some additional notation.
A regression model expresses two essential ingredients:
Regression is a vast subject which handles a wide variety of possible relationships.
All regression methods begin with a theoretical model which specifies the form of the relationship and includes any needed assumptions or conditions. Now we will introduce a more “statistical” definition of the regression model and define the parameters in the population.
We will use a different notation here than in the beginning of the semester. Now we use regression model style notation.
We assume the relationship in the population is linear and therefore our regression model can be written as:
where
The following picture illustrates the components of this model.
Each orange dot represents an individual observation in the scatterplot. Each observed value is modeled using the previous equation.
The red line is the true linear regression line. The blue dot represents the predicted value for a particular X value and illustrates that our predicted value only estimates the mean, average, or expected value of Y at that X value.
The error for an individual is expected and is due to the variation in our data. In the previous illustration, it is labeled with ε_{i} (epsilon_i) and denoted by a bracket which gives the distance between the orange dot for the observed value and the blue dot for the predicted value for a particular value of X. In practice, we cannot observe the true error for an individual but we will be able to estimate them using the residuals, which we will soon define mathematically.
The regression line represents the average Y for a given X and can be expressed as in symbols as the expected value of Y for a given X, E(YX) or Yhat.
In Unit 1, we used a to represent the intercept and b to represent the slope that we estimated from our data.
In formal regression procedures, we commonly use beta to represent the population parameter and betahat to represent the parameter estimate.
These parameter estimates, which are sample statistics estimated from our data, are also sometimes referred to as the coefficients using algebra terminology.
For each observation in our dataset, we also have a residual which is defined as the difference between the observed value and the predicted value for that observation.
The residuals are used to check our assumptions of normality and constant variance.
In effect, since we have a quantitative response variable, we are still comparing population means. However, now we must do so for EVERY possible value of X. We want to know if the distribution of Y is the same or different over our range of X values.
This idea is illustrated (including our assumption of normality) in the following picture which shows a case where the distribution of Y is changing as the values of the explanatory variable X change. This change is reflected by only a shift in means since we assume normality and constant variation of Y for all X.
The method used is mathematically equivalent to ANOVA but our interpretations are different due to the quantitative nature of our explanatory variable.
This image shows a scatterplot and regression line on the XY plane – as if flat on a table. Then standing up – in the vertical axis – we draw normal curves centered at the regression line for four different Xvalues – with X increasing for each.
The center of the distributions of the normal distributions which are displayed shows an increase in the mean but constant variation.
The idea is that the model assumes a normal distribution is a good approximation for how the Yvalues will vary around the regression line for a particular value of X.
There is one additional measure which is often of interest in linear regression, the coefficient of determination, R^{2} which, for simple linear regression is simply the square of the correlation coefficient, r.
The value of R^{2} is interpreted as the proportion of variation in our response variable Y, which can be explained by the linear regression model using our explanatory variable X.
Important Properties of R^{2}
A large R^{2} may or MAY NOT mean that the model fits our data well.
The image below illustrates data with a fairly large R^{2} yet the model does not fit the data well.
A small R^{2} may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model.
The image below illustrates data with a very small R^{2} yet the true relationship is very strong.
Now we move into our formal test procedure for simple linear regression.
A small R2 may or MAY NOT mean that there is no relationship between X and Y – we must be careful as the relationship that exists may simply not be specified in our model – currently a simple linear model. The image below illustrates data with a very small R2 yet the true relationship is very strong.
Step 1: State the hypothesesIn order to test the hypothesis that
Ho: There is no relationship between the two quantitative variables X and Y,
assuming our model is correct (a linear model is sufficient), we can write the above hypothesis as
Ho: β_{1} = 0 (Beta_1 = 0, the slope of our linear equation = 0 in the population).
The alternative hypothesis will be
Ha: β_{1 }≠ 0 (Beta_1 is not equal to zero).
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) The relationship should be linear which we can check using a scatterplot.
(iii) The residuals should be reasonably normally distributed with constant variance which we can check using the methods discussed below.
Normality: Histogram and QQplot of the residuals.
Constant Variance: Scatterplot of Y vs. X and/or a scatterplot of the residuals vs. the predicted values (Yhat). We would like to see random scatter with no pattern and approximately the same spread for all values of X.
Large outliers which fall outside the pattern of the data can cause problems and exert undue influence on our estimates. We saw in Unit 1 that one observation which is far away on the xaxis can have an large impact on the values of the correlation and slope.
Here are two examples each using the two plots mentioned above.
Example 1: Has constant variance (homoscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
Example 2: Does not have constant variance (heteroscedasticity)
Scatterplot of Y vs. X (above)
Scatterplot of residuals vs. predicted values (above)
The test statistic is similar to those we have studied for other ttests:
Where
Both of these values, along with the test statistic, are provided in the output from the software.
Step 3: Find the pvalue of the test by using the test statistic as follows
Under the null hypothesis, the test statistic follows a tdistribution with n2 degrees of freedom. We will rely on software to obtain the pvalue for this test.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and we would conclude there is enough evidence that hat slope in the population is not zero and therefore the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Confidence intervals will also be obtained in the software to estimate the true population slope, β_{1} (beta_1).
A method for predicting IQ as soon as possible after birth could be important for early intervention in cases such as brain abnormalities or learning disabilities. It has been thought that greater infant vocalization (for instance, more crying) is associated with higher IQ. In 1964, a study was undertaken to see if IQ at 3 years of age is associated with amount of crying at newborn age. In the study, 38 newborns were made to cry after being tapped on the foot and the number of distinct cry vocalizations within 20 seconds was counted. The subjects were followed up at 3 years of age and their IQs were measured.
Data: SPSS format, SAS format, Excel format
Response Variable:
Explanatory Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no (linear) relationship between newborn cry count and IQ at three years of age
Ha: There is a (linear) relationship between newborn cry count and IQ at three years of age
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
(i) To the best of our knowledge the subjects are independent.
(ii) The scatterplot shows a relationship that is reasonably linear although not very strong.
(iii) The histogram and QQplot of the residuals are both reasonably normally distributed. The scatterplots of Y vs. X and the residuals vs. the predicted values both show no evidence of nonconstant variance.
The estimated regression equation is
The parameter estimate of the slope is 1.54 which means that for each 1unit increase in cry count, the average IQ is expected to increase by 1.54 points.
The standard error of the estimate of the slope is 0.584 which give a test statistic of 2.63 in the output and using unrounded values from the output and the formula:
The pvalue is found to be 0.0124. Notice this exactly the same as we obtained for this data for our test of Pearson’s correlation coefficient. These two methods are equivalent and will always produce the same conclusion about the statistical significance of the linear relationship between X and Y.
The 95% confidence interval for β_{1} (beta_1) given in the output is (0.353, 2.720).
This regression model has coefficient of determination of R^{2} = 0.161 which means that 16.1% of the variation in IQ score at age three can be explained by our linear regression model using newborn cry count. This confirms a relatively weak relationship as we found in our previous example using correlations (Pearson’s correlation coefficient and Spearmans’ rank correlation).
Step 4: Conclusion
Conclusion of the test for the slope: Based upon the scatterplot and linear regression analysis, since the relationship is linear and the pvalue = 0.0124, there is a statistically significant positive linear relationship between newborn cry count and IQ at age 3.
Interpretation of Rsquared: Based upon our R^{2}and scatterplot, the relationship is somewhat weak with only 16.1% of the variation in IQ score at age three being explained by our linear regression model using newborn cry count.
Interpretation of the slope: For each 1unit increase in cry count, the population mean IQ is expected to increase by 1.54 points, however, the 95% confidence interval suggests this value could be as low as 0.35 points to as high as 2.72 points.
We return to the data from an earlier activity (Learn By Doing – Correlation and Outliers (Software)). The average gestation period, or time of pregnancy, of an animal is closely related to its longevity, the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded. Here is a summary of the variables in our dataset:
In this case, whether we include the outlier or not, there is a problem of nonconstant variance. You can clearly see that, in general, as longevity increases, the variation of gestation increases.
This data is not a particularly good candidate for simple linear regression analysis (without further modification such as transformations or the use of alternative methods).
Pearson’s correlation coefficient (or Spearman’s rank correlation), may still provide a reasonable measure of the strength of the relationship, which is clearly a positive relationship from the scatterplot and our previous measure of correlation.
Output – Contains scatterplots with linear equations and LOESS curves (running average) for the dataset with and without the outlier. Pay particular attention to the problem with nonconstant variance seen in these scatterplots.
The data used in the analysis provided below contains the monthly premiums, driving experience, and gender for a random sample of drivers.
To analyze this data, we have looked at males and females as two separate groups and estimated the correlation and linear regression equation for each gender. We wish to predict the monthly premium using years of driving experience.
Use this output for additional practice with these concepts. For each gender consider the following:
Related SAS Tutorials
Related SPSS Tutorials
The last procedures we studied (twosample t, paired t, ANOVA, and their nonparametric alternatives) all involve the relationship between a categorical explanatory variable and a quantitative response variable (case C→Q).In all of these procedures, the result is a comparison of the quantitative response variable (Y) among the groups defined by the categorical explanatory variable (X).The standard tests result in a comparison of the population means of Y within each group defined by X.
Next, we will consider inferences about the relationships between two categorical variables, corresponding to case C→C.
For case C→C, we will learn the following tests:
Independent Samples (Only Emphasis) 
Dependent Samples (Not Discussed) 
Standard Tests
NonParametric Test

Standard Test

In the Exploratory Data Analysis unit of the course, we summarized the relationship between two categorical variables for a given data set (using a twoway table and conditional percents), without trying to generalize beyond the sample data.
Now we will perform statistical inference for two categorical variables, using the sample data to draw conclusions about whether or not we have evidence that the variables are related in the larger population from which the sample was drawn.
In other words, we would like to assess whether the relationship between X and Y that we observed in the data is due to a real relationship between X and Y in the population, or if it is something that could have happened just by chance due to sampling variability.
Before moving into the statistical tests, let’s look at a few (fake) examples.
Suppose our explanatory variable X has r levels and our response variable Y has c levels. We usually arrange our table with the explanatory variable in the rows and the response variable in the columns.
Suppose we have the following partial (fake) data summarized in a twoway table using X = BMI category (r = 4 levels) and Y = Diabetes Status (c = 3 levels).
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  100  
Normal  400  
Overweight  300  
Obese  200  
Total  700  200  100  1000 
From our study of probability we can determine:
In the test we are going to use, our null hypothesis will be:
Ho: There is no relationship between X and Y.
Which in this case would be:
Ho: There is no relationship between BMI category (X) and diabetes status (Y).
If there were no relationship between X and Y, this would imply that the distribution of diabetes status is the same for each BMI category.
In this case (C→C), the distribution of diabetes status consists of the probability of each diabetes status group and the null hypothesis becomes:
Ho: BMI category (X) and diabetes status (Y) are INDEPENDENT.
Since the probability of “No Diabetes” is 0.7 in the entire dataset, if there were no differences in the distribution of diabetes status between BMI categories, we would obtain the same proportion in each row. Using the row totals we can find the EXPECTED counts as follows.
Notice the formula used below is simply the formula for the mean or expected value of a binomial random variable with n “trials” and probability of “success” p which was μ = E(X) = np where X = number of successes for a sample of size n.
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  100(0.7) = 70  100  
Normal  400(0.7) = 280  400  
Overweight  300(0.7) = 210  300  
Obese  200(0.7) = 140  200  
Total  700  200  100  1000 
Notice that these do indeed add to 700.
Similarly we can determine the EXPECTED counts for the remaining two columns since 20% of our sample were classified as having prediabetes and 10% were classified as having diabetes.
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  70  100(0.2) = 20  100(0.1) = 10  100 
Normal  280  400(0.2) = 80  400(0.1) = 40  400 
Overweight  210  300(0.2) = 60  300(0.1) = 30  300 
Obese  140  200(0.2) = 40  200(0.1) = 20  200 
Total  700  200  100  1000 
What we have created, using only the row totals, column totals, and column percents, is a table of what we would expect to happen if the null hypothesis of no relationship between X and Y were true. Here is the final result.
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  70  20  10  100 
Normal  280  80  40  400 
Overweight  210  60  60  300 
Obese  140  40  40  200 
Total  700  200  100  1000 
Suppose we gather data and find the following (expected counts are in parentheses for easy comparison):
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  65 (70)  22 (20)  13 (10)  100 
Normal  285 (280)  78 (80)  37 (40)  400 
Overweight  216 (210)  53 (60)  31 (30)  300 
Obese  134 (140)  47 (40)  19 (20)  200 
Total  700  200  100  1000 
If we compare our counts to the expected counts they are fairly close. This data would not give much evidence of a difference in the distribution of diabetes status among the levels of BMI categories. In other words, this data would not give much evidence of a relationship (or association) between BMI categories and diabetes status.
The standard test we will learn in case C→C is based upon comparing the OBSERVED cell counts (our data) to the EXPECTED cell counts (using the method discussed above).
We want you to see how the expected cell counts are created so that you will understand what kind of evidence is being used to reject the null hypothesis in case C→C.
Suppose instead that we gather data and we obtain the following counts (expected counts are in parentheses and row percentages are provided):
No Diabetes  PreDiabetes  Diabetes  Total  
Underweight  90 (70) 90% 
7 (20) 7% 
3 (10) 3% 
100 
Normal  340 (280) 85% 
40 (80) 10% 
20 (40) 5% 
400 
Overweight  180 (210) 60% 
90 (60) 30% 
30 (30) 10% 
300 
Obese  90 (140) 45% 
63 (40) 31.5% 
47 (20) 23.5% 
200 
Total  700  200  100  1000 
In this case, most of the differences are drastic and there seems to be clear evidence that the distribution of diabetes status is not the same among the four BMI categories.
Although this data is entirely fabricated, it illustrates the kind of evidence we need to reject the null hypothesis in case C→C.
One special case occurs when we have two categorical variables where both of these variables have two levels. Twolevel categorical variables are often called binary variables or dichotomous variables and when possible are usually coded as 1 for “Yes” or “Success” and 0 for “No” or “Failure.”
Here is another (fake) example.
Suppose we have the following partial (fake) data summarized in a twoway table using X = treatment and Y = significant improvement in symptoms.
No Improvement  Improvement  Total  
Control  100  
Treatment  100  
Total  120  80  200 
From our study of probability we can determine:
Since the probability of “No Improvement” is 0.6 in the entire dataset and the probability for “Improvement” is 0.4, if there was no difference we would obtain the same proportion in each row. Using the row totals we can find the EXPECTED counts as follows.
No Improvement  Improvement  Total  
Control  100(0.6) = 60 
100(0.4) = 40  100 
Treatment  100(0.6) = 60  100(0.4) = 40  100 
Total  120  80  200 
Suppose we obtain the following data:
No Improvement  Improvement  Total  
Control  80  20  100 
Treatment  40  60  100 
Total  120  80  200 
In this example we are interested in the probability of improvement and the above data seem to indicate the treatment provides a greater chance for improvement than the control.
We use this example to mention two ways of comparing probability (sometimes “risk”) in 2×2 tables. Many of you may remember these topics from Epidemiology or may see these topics again in Epidemiology courses in the future!
Risk Difference:
For this data, a larger proportion of subjects in the treatment group showed improvement compared to the control group. In fact, the estimated probability of improvement is 0.4 higher for the treatment group than the control group.
This value (0.4) is called a riskdifference and is one common measure in 2×2 tables. Estimates and confidence intervals can be obtained.
For a fixed sample size, the larger this difference, the more evidence against our null hypothesis (no relationship between X and Y).
The population riskdifference is often denoted p_{1} – p_{2}, and is the difference between two population proportions. We estimate these proportions in the same manner as Unit 1, once for each sample.
For the current example, we obtain
and
from which we find the risk difference
Odds Ratio:
Another common measure in 2×2 tables is the odds ratio, which is defined as the odds of the event occurring in one group divided by the odds of the event occurring in another group.
In this case, the odds of improvement in the treatment group is
and the odds of improvement in the control group is
so the odds ratio to compare the treatment group to the control group is
This value means that the odds of improvement are 6 times higher in the treatment group than in the control group.
Properties of Odds Ratios:
Step 1: State the hypothesesThe hypotheses are:
Ho: There is no relationship between the two categorical variables. (They are independent.)
Ha: There is a relationship between the two categorical variables. (They are not independent.)
Note: for 2×2 tables, these hypotheses can be formulated the same as for population means except using population proportions. This can be done for RxC tables as well but is not common as it requires more notation to compare multiple group proportions.
Step 2: Obtain data, check conditions, and summarize data
(i) The sample should be random with independent observations (all observations are independent of all other observations).
(ii) In general, the larger the sample, the more precise and reliable the test results are. There are different versions of what the conditions are that will ensure reliable use of the test, all of which involve the expected counts. One version of the conditions says that all expected counts need to be greater than 1, and at least 80% of expected counts need to be greater than 5. A more conservative version requires that all expected counts are larger than 5. Some software packages will provide a warning if the sample size is “too small.”
Test Statistic of the Chisquare Test for Independence:
The single number that summarizes the overall difference between observed and expected counts is the chisquare statistic, which tells us in a standardized way how far what we observed (data) is from what would be expected if Ho were true.
Here it is:
Step 3: Find the pvalue of the test by using the test statistic as followsWe will rely on software to obtain this value for us. We can also request the expected counts using software.
The pvalues are calculated using a chisquare distribution with (r1)(c1) degrees of freedom (where r = number of levels of the row variable and c = number of levels of the column variable). We will rely on software to obtain the pvalue for this test.
IMPORTANT NOTE
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
We will look at one nonparametric test in case C→C. Fisher’s exact test is an exact method of obtaining a pvalue for the hypotheses tested in a standard chisquare test for independence. This test is often used when the sample size requirement of the chisquare test is not satisfied and can be used for 2×2 and RxC tables.
Step 1: State the hypothesesThe hypotheses are:
Ho: There is no relationship between the two categorical variables. (They are independent.)
Ha: There is a relationship between the two categorical variables. (They are not independent, they are dependent.)
Step 2: Obtain data, check conditions, and summarize data
The sample should be random with independent observations (all observations are independent of all other observations).
Step 3: Find the pvalue of the test by using the test statistic as follows
The pvalues are calculated using a distribution specific to this test. We will rely on software to obtain the pvalue for this test. The pvalue measures the chance of obtaining a table as or more extreme (against the null hypothesis) than our table.
Step 4: Conclusion
As usual, we use the magnitude of the pvalue to draw our conclusions. A small pvalue indicates that the evidence provided by the data is strong enough to reject Ho and conclude (beyond a reasonable doubt) that the two variables are related. In particular, if a significance level of 0.05 is used, we will reject Ho if the pvalue is less than 0.05.
Now let’s look at a some examples with real data.
Low birth weight is an outcome of concern due to the fact that infant mortality rates and birth defect rates are very high for babies with low birth weight. A woman’s behavior during pregnancy (including diet, smoking habits, and obtaining prenatal care) can greatly alter her chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.
In this example, we will use a 1986 study (Hosmer and Lemeshow (2000), Applied Logistic Regression: Second Edition) in which data were collected from 189 women (of whom 59 had low birth weight infants) at the Baystate Medical Center in Springfield, MA. The goal of the study was to identify risk factors associated with giving birth to a low birth weight baby.
Data: SPSS format, SAS format, Excel format
Response Variable:
Possible Explanatory Variables (variables we will use in this example are in bold):
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: There is no relationship between the categorical explanatory variable and presence of low birth weight. (They are independent.)
Ha: There is a relationship between the categorical explanatory variable and presence of low birth weight.(They are not independent, they are dependent.)
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
Explanatory Variable  Which Test is Appropriate?  Pvalue  Decision 
RACE  Min. Expected Count = 8.12 3×2 table Use Pearson Chisquare (since RxC) 
0.0819 (Chisquare – SAS) 0.082 (Chisquare – SPSS) 
Fail to Reject Ho 
SMOKE  Min. Expected Count = 23.1 2×2 table Use Continuity Correction (since 2×2) 
0.040 (Continuity Correction – SPSS) 0.0396 (Continuity Adj – SAS) 
Reject Ho 
PTL  Min. Expected Count = 0.31 4×2 table Fisher’s Exact test is more appropriate 
3.106 E04 = 0.0003106 (Fisher’s – SAS) 0.000 (Fisher’s – SPSS) 0.0008 (Chisquare – SAS) 0.001 (Chisquare – SPSS) 
Reject Ho 
HT  Min. Expected Count = 3.75 2×2 table Fisher’s Exact test may be more appropriate 
0.0516 (Fisher’s – SAS) 0.052 (Fisher’s – SPSS) 
Fail to Reject Ho (Barely) 
UI  Min. Expected Count = 8.74 2×2 table Use Continuity Correction 
0.0355 (Continuity Adj. – SAS) 0.035 (Continuity Correction – SPSS) 
Reject Ho 
Step 4: Conclusion
When considered individually, presence of uterine irritability, history of premature labor, and smoking during pregnancy are all significantly associated (pvalue < 0.05) with the presence/absence of a low birth weight infant whereas history of hypertension and race were only marginally significant (0.05 ≤ pvalue < 0.10).
Practical Significance:
Explanatory Variable  Comparison of Conditional Percentages of Low Birth Weight 
RACE  Race = White: 23.96% Race = Black: 42.31% Race = Other: 37.31% 
SMOKE  Smoke = No: 25.22% Smoke = Yes: 40.54% 
PTL  History of Premature Labor = 0: 25.79% History of Premature Labor = 1: 66.67% History of Premature Labor = 2: 40.00% (Note small sample size of 5 for this row) History of Premature Labor = 3: 0.00% (Note small sample size of 1 for this row) 
HT  Hypertension = No: 29.38% Hypertension = Yes: 58.33% (Note small sample size of 12 for this row) 
UI  Presence of uterine irritability = No: 27.95% Presence of uterine irritability = Yes: 50.00% 
If, instead of simply analyzing the “looks vs. personality” rating scale, we categorized the responses into groups then we would be in case C→C instead of case C→Q (see previous example in Case CQ for Two Independent Samples).
Recall the rating score was from 1 to 25 with 1 = personality most important (looks not important at all) and 25 = looks most important (personality not important at all). A score of 13 would be equally important and scores around 13 should indicate looks and personality are nearly equal in importance.
For our purposes we will use a rating of 16 or larger to indicate that looks were indeed more important than personality (by enough to matter).
Data: SPSS format, SAS format
Response Variable:
Results:
Step 1: State the hypotheses
The hypotheses are:
Ho: The proportion of college students who find looks more important than personality is the same for males and females. (The two variables are independent)
Ha: The proportion of college students who find looks more important than personality is different for males and females. (The two variables are dependent)
Steps 2 & 3: Obtain data, check conditions, summarize data, and find the pvalue
The minimum expected cell count is 13.38. This is a 2×2 table so we will use the continuity corrected chisquare statistic.
The pvalue is found to be 0.001 (SPSS) or 0.0007 (SAS).
Step 4: Conclusion
There is a significant association between gender and whether or not the individual rated looks more important than personality.
Among males, 27.1% rated looks higher than personality while among females this value was only 9.3%.
For fun: The odds ratio here is
which means, based upon our data, we estimate that the odds of rating looks more important than personality is 3.6 times higher among males than among females.
Practical Significance:
It seems clear that the difference between 27.1% and 9.3% is practically significant as well as statistically significant. This difference is large and likely represents a meaningful difference in the views of males and females regarding the importance of looks compared to personality.