This page is basically complete except that the PDF activies only have the noninteractive versions for now.
As we mentioned at the
end of the Introduction to Unit 4B,
we will focus only on twosided tests for the remainder of this course. Onesided tests are often possible but rarely used in clinical research.
CO4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results
LO 4.35: For a data analysis situation involving two variables, choose the appropriate inferential method for examining the relationship between the variables and justify the choice.
LO 4.36: For a data analysis situation involving two variables, carry out the appropriate inferential method for examining relationships between the variables and draw the correct conclusions in context.
CO5: Determine preferred methodological alternatives to commonly used statistical methods when assumptions are not met.
Related SAS Tutorials
Related SPSS Tutorials
Introduction
Here is a summary of the tests we will learn for the scenario where k = 2. Methods in BOLD will be our main focus.
We have completed our discussion on dependent samples (2nd column) and now we move on to independent samples (1st column).
Independent Samples (More Emphasis)

Dependent Samples (Less Emphasis)

Standard Tests
 Two Sample TTest Assuming Equal Variances
 Two Sample TTest Assuming Unequal Variances
NonParametric Test
 MannWhitney U (or Wilcoxon RankSum) Test

Standard Test
NonParametric Tests
 Sign Test
 Wilcoxon SignedRank Test

Dependent vs. Independent Samples
LO 4.37: Identify and distinguish between independent and dependent samples.
We have discussed the dependent sample case where observations are matched/paired/linked between the two samples. Recall that in that scenario observations can be the same individual or two individuals who are matched between samples. To analyze data from dependent samples, we simply took the differences and analyzed the difference using onesample techniques.
Now we will discuss the independent sample case. In this case, all individuals are independent of all other individuals in their sample as well as all individuals in the other sample. This is most often accomplished by either:
 Taking a random sample from each of the two groups under study. For example to compare heights of males and females, we could take a random sample of 100 females and another random sample of 100 males. The result would be two samples which are independent of each other.
 Taking a random sample from the entire population and then dividing it into two subsamples based upon the grouping variable of interest. For example, we take a random sample of U.S. adults and then split them into two samples based upon gender. This results in a subsample of females and a subsample of males which are independent of each other.
Comparing Two Means – Two Independent Samples Ttest
LO 4.38: In a given context, determine the appropriate standard method for comparing groups and provide the correct conclusions given the appropriate software output.
LO 4.39: In a given context, set up the appropriate null and alternative hypotheses for comparing groups.
Recall that here we are interested in the effect of a twovalued (k = 2) categorical variable (X) on a quantitative response (Y). Random samples from the two subpopulations (defined by the two categories of X) are obtained and we need to evaluate whether or not the data provide enough evidence for us to believe that the two subpopulation means are different.
In other words, our goal is to test whether the means μ_{1} and μ_{2} (which are the means of the variable of interest in the two subpopulations) are equal or not, and in order to do that we have two samples, one from each subpopulation, which were chosen independently of each other.
The test that we will learn here is commonly known as the twosample ttest. As the name suggests, this is a ttest, which as we know means that the pvalues for this test are calculated under some tdistribution.
Here are figures that illustrate some of the examples we will cover. Notice how the original variables X (categorical variable with two levels) and Y (quantitative variable) are represented. Think about the fact that we are in case C → Q!
As in our discussion of dependent samples, we will often simplify our terminology and simply use the terms “population 1” and “population 2” instead of referring to these as subpopulations. Either terminology is fine.
Many Students Wonder: Two Independent Samples
Question: Does it matter which population we label as population 1 and which as population 2?
Answer: No, it does not matter as long as you are consistent, meaning that you do not switch labels in the middle.
 BUT… considering how you label the populations is important in stating the hypotheses and in the interpretation of the results.
Steps for the TwoSample Ttest
Recall that our goal is to compare the means μ_{1} and μ_{2} based on the two independent samples.
 Step 1: State the hypotheses
The hypotheses represent our goal to compare μ_{1}and μ_{2}.
The null hypothesis is always:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
(There IS NO association between the categorical explanatory variable and the quantitative response variable)
We will focus on the twosided alternative hypothesis of the form:
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2}) (twosided)
(There IS AN association between the categorical explanatory variable and the quantitative response variable)
Note that the null hypothesis claims that there is no difference between the means. Conceptually, Ho claims that there is no relationship between the two relevant variables (X and Y).
Our parameter of interest in this case (the parameter about which we are making an inference) is the difference between the means (μ_{1} – μ_{2}) and the null value is 0. The alternative hypothesis claims that there is a difference between the means.
 Step 2: Obtain data, check conditions, and summarize data
The twosample ttest can be safely used as long as the following conditions are met:
The two samples are indeed independent.
We are in one of the following two scenarios:
(i) Both populations are normal, or more specifically, the distribution of the response Y in both populations is normal, and both samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).
Assuming that we can safely use the twosample ttest, we need to summarize the data, and in particular, calculate our data summary—the test statistic.
Test Statistic for TwoSample Ttest:
There are two choices for our test statistic, and we must choose the appropriate one to summarize our data We will see how to choose between the two test statistics in the next section. The two options are as follows:
We use the following notation to describe our samples:
Here are the two cases for our test statistic.
(A) Equal Variances: If it is safe to assume that the two populations have equal standard deviations, we can pool our estimates of this common population standard deviation and use the following test statistic.
where
(B) Unequal Variances: If it is NOT safe to assume that the two populations have equal standard deviations, we have unequal standard deviations and must use the following test statistic.
Comments:
 It is possible to never assume equal variances; however, if the assumption of equal variances is satisfied the equal variances ttest will have greater power to detect the difference of interest.
 We will not be calculating the values of these test statistics by hand in this course. We will instead rely on software to obtain the value for us.
 Both of these test statistics measure (in standard errors) how far our data are (represented by the difference of the sample means) from the null hypothesis (represented by the null value, 0).
 These test statistics have the same general form as others we have discussed. We will not discuss the derivation of the standard errors in each case but you should understand this general form and be able to identify each component for a specific test statistic.
 Step 3: Find the pvalue of the test by using the test statistic as follows
Each of these tests rely on a particular tdistribution under which the pvalues are calculated. In the case where equal variances are assumed, the degrees of freedom are simply:
whereas in the case of unequal variances, the formula for the degrees of freedom is more complex. We will rely on the software to obtain the degrees of freedom in both cases and provided us with the correct pvalue (usually this will be a twosided pvalue).
As usual, we draw our conclusion based on the pvalue. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the difference in population means in terms of the current variables.
If the pvalue is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.
Conclusion: There is enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho.
Conclusion: There is NOT enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the pvalue is less than α. Otherwise, we do not reject Ho.
LO 4.41: Based upon the output for a twosample ttest, correctly interpret in context the appropriate confidence interval for the difference between population means
As in previous methods, we can followup with a confidence interval for the difference between population means, μ_{1} – μ_{2} and interpret this interval in the context of the problem.
Interpretation: We are 95% confident that the population mean for (one group) is between __________________ compared to the population mean for (the other group).
Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.
If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)
If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)
NOTE: Be careful to choose the correct confidence interval about the difference between population means using the same assumption (variances equal or variances unequal) and not the individual confidence intervals for the means in the groups themselves.
Test for Equality of Variances (or Standard Deviations)
LO 4.42: Based upon the output for a twosample ttest, determine whether to use the results assuming equal variances or those assuming unequal variances.
Since we have two possible tests we can conduct, based upon whether or not we can assume the population standard deviations (or variances) are equal, we need a method to determine which test to use.
Although you can make a reasonable guess using information from the data (i.e. look at the distributions and estimates of the standard deviations and see if you feel they are reasonably equal), we have a test which can help us here, called the test for Equality of Variances. This output is automatically displayed in many software packages when a twosample ttest is requested although the particular test used may vary.The hypotheses of this test are:
Ho: σ_{1} = σ_{2} (the standard deviations in the two populations are the same)
Ha: σ_{1} ≠ σ_{2} (the standard deviations in the two populations are not the same)
 If the pvalue of this test for equal variances is small, there is enough evidence that the standard deviations in the two populations are different and we cannot assume equal variances.
 IMPORTANT! In this case, when we conduct the twosample ttest to compare the population means, we use the test statistic for unequal variances.
 If the pvalue of this test is large, there is not enough evidence that the standard deviations in the two populations are different. In this case we will assume equal variances since we have no clear evidence to the contrary.
 IMPORTANT! In this case, when we conduct the twosample ttest to compare the population means, we use the test statistic for equal variances.
Now let’s look at a complete example of conducting a twosample ttest, including the embedded test for equality of variances.
EXAMPLE: What is more important — personality or looks?
This question was asked of a random sample of 239 college students, who were to answer on a scale of 1 to 25. An answer of 1 means personality has maximum importance and looks no importance at all, whereas an answer of 25 means looks have maximum importance and personality no importance at all. The purpose of this survey was to examine whether males and females differ with respect to the importance of looks vs. personality.
Note that the data have the following format:
Score (Y) 
Gender (X) 
15 
Male 
13 
Female 
10 
Female 
12 
Male 
14 
Female 
14 
Male 
6 
Male 
17 
Male 
etc. 
The format of the data reminds us that we are essentially examining the relationship between the twovalued categorical variable, gender, and the quantitative response, score. The two values of the categorical explanatory variable (k = 2) define the two populations that we are comparing — males and females. The comparison is with respect to the response variable score. Here is a figure that summarizes the example:
Comments:
 Note that this figure emphasizes how the fact that our explanatory is a twovalued categorical variable means that in practice we are comparing two populations (defined by these two values) with respect to our response Y.
 Note that even though the problem description just says that we had 239 students, the figure tells us that there were 85 males in the sample, and 150 females.
 Following up on comment 2, note that 85 + 150 = 235 and not 239. In these data (which are real) there are four “missing observations,” 4 students for which we do not have the value of the response variable, “importance.” This could be due to a number of reasons, such as recording error or non response. The bottom line is that even though data were collected from 239 students, effectively we have data from only 235. (Recommended: Go through the data file and note that there are 4 cases of missing observations: students 34, 138, 179, and 183).
Step 1: State the hypotheses
Recall that the purpose of this survey was to examine whether the opinions of females and males differ with respect to the importance of looks vs. personality. The hypotheses in this case are therefore:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean “looks vs personality score” for females and μ_{2} represents the mean “looks vs personality score” for males.
It is important to understand that conceptually, the two hypotheses claim:
Ho: Score (of looks vs. personality) is not related to gender
Ha: Score (of looks vs. personality) is related to gender
Step 2: Obtain data, check conditions, and summarize data
 Data: Looks SPSS format, SAS format, Excel format, CSV format
 Let’s first check whether the conditions that allow us to safely use the twosample ttest are met.
 Here, 239 students were chosen and were naturally divided into a sample of females and a sample of males. Since the students were chosen at random, the sample of females is independent of the sample of males.
 Here we are in the second scenario — the sample sizes (150 and 85), are definitely large enough, and so we can proceed regardless of whether the populations are normal or not.
 In the output below we first look at the test for equality of variances (outlined in orange). The twosample ttest results we will use are outlined in blue.
 There are TWO TESTS represented in this output and we must make the correct decision for BOTH of these tests to correctly proceed.
 SOFTWARE OUTPUT In SPSS:
 The pvalue for the test of equality of variances is reported as 0.849 in the SIG column under Levene’s test for equality of variances. (Note this differs from the pvalue found using SAS, two different tests are used by default between the two programs).
 So we fail to reject the null hypothesis that the variances, or equivalently the standard deviations, are equal (Ho: σ_{1} = σ_{2}).
 Conclusion to test for equality of variances: We cannot conclude there is a difference in the variance of looks vs. personality score between males and females.
 This results in using the row for Equal variances assumed to find the ttest results including the test statistic, pvalue, and confidence interval for the difference. (Outlined in BLUE)
The output might also be broken up if you export or copy the items in certain ways. The results are the same but it can be more difficult to read.
 SOFTWARE OUTPUT In SAS:
 The pvalue for the test of equality of variances is reported as 0.5698 in the Pr > F column under equality of variances. (Note this differs from the pvalue found using SPSS, two different tests are used by default between the two programs).
 So we fail to reject the null hypothesis that the variances, or equivalently the standard deviations, are equal (Ho: σ_{1} = σ_{2}).
 Conclusion to test for equality of variances: We cannot conclude there is a difference in the variance of looks vs. personality score between males and females.
 This results in using the row for POOLED method where equal variances are assumed to find the ttest results including the test statistic, pvalue, and confidence interval for the difference. (Outlined in BLUE)
 TEST STATISTIC for TwoSample Ttest: In all of the results above, we determine that we will use the test which assumes the variances are EQUAL, and we find our test statistic of t = 4.58.
Step 3: Find the pvalue of the test by using the test statistic as follows
 We will let the software find the pvalue for us, and in this case, the pvalue is less than our significance level of 0.05 in fact it is practically 0.
 This is found in SPSS in the equal variances assumed row under ttest in the SIG. (twotailed) column given as 0.000 and in SAS in the POOLED ROW under Pr > t column given as <0.0001.
 A pvalue which is practically 0 means that it would be almost impossible to get data like that observed (or even more extreme) had the null hypothesis been true.
 More specifically, in our example, if there were no differences between females and males with respect to whether they value looks vs. personality, it would be almost impossible (probability approximately 0) to get data where the difference between the sample means of females and males is 2.6 (that difference is 10.73 – 13.33 = 2.6) or more extreme.
 Comment: Note that the output tells us that the difference μ_{1} – μ_{2} is approximately 2.6. But more importantly, we want to know if this difference is statistically significant. To answer this, we use the fact that this difference is 4.58 standard errors below the null value.
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is practically 0 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
 Conclusion: There is enough evidence that the mean Importance score (of looks vs personality) of males differs from that of females. In other words, males and females differ with respect to how they value looks vs. personality.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean “looks vs personality score” for females minus the population mean “looks vs personality score” for males.
 Using statistical software, we find that the 95% confidence interval for μ_{1} – μ_{2} is roughly (3.7, 1.5).
 This is found in SPSS in the equal variances assumed row under 95% confidence interval columns given as 3.712 to 1.480 and in SAS in the POOLED ROW under 95% CL MEAN column given as 3.7118 to 1.4804 (be careful NOT to choose the confidence interval for the standard deviation in the last column, 9% CL Std Dev).
 Interpretation:
 We are 95% confident that the population mean “looks vs personality score” for females is between 3.7 and 1.5 points lower than that of males.
 OR
 We are 95% confident that the population mean “looks vs personality score” for males is between 3.7 and 1.5 points higher than that of females.
 The confidence interval therefore quantifies the effect that the explanatory variable (gender) has on the response (looks vs personality score).
 Since low values correspond to personality being more important and high values correspond to looks being more important, the result of our investigation suggests that, on average, females place personality higher than do males. Alternatively we could say that males place looks higher than do females.
 Note: The confidence interval does not contain zero (both values are negative based upon how we chose our groups) and thus using the confidence interval we can reject the null hypothesis here.
Practical Significance:
We should definitely ask ourselves if this is practically significant
 Is a true difference in population means as represented by our estimate from this data meaningful here? I will let you consider and answer for yourself.
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Here is another example.
EXAMPLE: BMI vs. Gender in Heart Attack Patients
A study was conducted which enrolled and followed heart attack patients in a certain metropolitan area. In this example we are interested in determining if there is a relationship between Body Mass Index (BMI) and gender. Individuals presenting to the hospital with a heart attack were randomly selected to participate in the study.
Step 1: State the hypotheses
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean BMI for males and μ_{2} represents the mean BMI for females.
It is important to understand that conceptually, the two hypotheses claim:
Ho: BMI is not related to gender in heart attack patients
Ha: BMI is related to gender in heart attack patients
Step 2: Obtain data, check conditions, and summarize data
 Data: WHAS500 SPSS format, SAS format
 Let’s first check whether the conditions that allow us to safely use the twosample ttest are met.
 Here, subjects were chosen and were naturally divided into a sample of females and a sample of males. Since the subjects were chosen at random, the sample of females is independent of the sample of males.
 Here, we are in the second scenario — the sample sizes are extremely large, and so we can proceed regardless of whether the populations are normal or not.
 In the output below we first look at the test for equality of variances (outlined in orange). The twosample ttest results we will use are outlined in blue.
 There are TWO TESTS represented in this output and we must make the correct decision for BOTH of these tests to correctly proceed.
 SOFTWARE OUTPUT In SPSS:
 The pvalue for the test of equality of variances is reported as 0.001 in the SIG column under Levene’s test for equality of variances.
 So we reject the null hypothesis that the variances, or equivalently the standard deviations, are equal (Ho: σ_{1} = σ_{2}).
 Conclusion to test for equality of variances: We conclude there is enought evidence of a difference in the variance of looks vs. personality score between males and females.
 This results in using the row for Equal variances NOT assumed to find the ttest results including the test statistic, pvalue, and confidence interval for the difference. (Outlined in BLUE)
 SOFTWARE OUTPUT In SAS:
 The pvalue for the test of equality of variances is reported as 0.0004 in the Pr > F column under equality of variances.
 So we reject the null hypothesis that the variances, or equivalently the standard deviations, are equal (Ho: σ_{1} = σ_{2}).
 Conclusion to test for equality of variances: We conclude there is enough evidence of a difference in the variance of looks vs. personality score between males and females.
 This results in using the row for SATTERTHWAITE method where UNEQUAL variances are assumed to find the ttest results including the test statistic, pvalue, and confidence interval for the difference. (Outlined in BLUE)
 TEST STATISTIC for TwoSample Ttest: In all of the results above, we determine that we will use the test which assumes the variances are UNEQUAL, and we find our test statistic of t = 3.21.
Step 3: Find the pvalue of the test by using the test statistic as follows
 We will let the software find the pvalue for us, and in this case, the pvalue is less than our significance level of 0.05.
 This is found in SPSS in the UNEQUAL variances assumed row under ttest in the SIG. (twotailed) column given as 0.001 and in SAS in the SATTERTHWAITE ROW under Pr > t column given as 0.0015.
 This pvalue means that it would be extremely rare to get data like that observed (or even more extreme) had the null hypothesis been true.
 More specifically, in our example, if there were no differences between females and males with respect to BMI, it would be almost highly unlikely (0.001 probability) to get data where the difference between the sample mean BMIs of males and females is 1.64 or more extreme.
 Comment: Note that the output tells us that the difference μ_{1} – μ_{2} is approximately 1.64. But more importantly, we want to know if this difference is statistically significant. To answer this, we use the fact that this difference is 3.21 standard errors above the null value.
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is 0.001 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
 Conclusion: The mean BMI of males differs from that of females. In other words, males and females differ with respect to BMI among heart attack patients.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean BMI for males minus the population mean BMI for females.
 Using statistical software, we find that the 95% confidence interval for μ_{1} – μ_{2} is roughly (0.63, 2.64).
 This is found in SPSS in the UNEQUAL variances assumed row under 95% confidence interval columns and in SAS in the SATTERTHWAITE ROW under 95% CL MEAN column.
 Interpretation:
 With 95% confidence that the population mean BMI for males is between 0.63 and 2.64 units larger than that of females.
 OR
 With 95% confidence that the population mean BMI for females is between 0.63 and 2.64 units smaller than that of males.
 The confidence interval therefore quantifies the effect of the explanatory variable (gender) on the response (BMI). Notice that we cannot imply a causal effect of gender on BMI based upon this result alone as there could be many lurking variables, unaccounted for in this analysis, which might be partially or even completely responsible for this difference.
 Note: The confidence interval does not contain zero (both values are positive based upon how we chose our groups) and thus using the confidence interval we can reject the null hypothesis here.
Practical Significance:
 We should definitely ask ourselves if this is practically significant
 Is a true difference in population means as represented by our estimate from this data meaningful here? Is a difference in BMI of between 0.53 and 2.64 of interest?
 I will let you consider and answer for yourself.
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Note: In the SAS output the variable gender is not formatted, in this case Males = 0 and Females = 1.
Comments:
You might ask yourself: “Where do we use the test statistic?”
It is true that for all practical purposes all we have to do is check that the conditions which allow us to use the twosample ttest are met, lift the pvalue from the output, and draw our conclusions accordingly.
However, we feel that it is important to mention the test statistic for two reasons:
 The test statistic is what’s behind the scenes; based on its null distribution and its value, the pvalue is calculated.
 Apart from being the key for calculating the pvalue, the test statistic is also itself a measure of the evidence stored in the data against Ho. As we mentioned, it measures (in standard errors) how different our data is from what is claimed in the null hypothesis.
Now try some more activities for yourself.
NonParametric Alternative: Wilcoxon RankSum Test (MannWhitney U)
LO 5.1: For a data analysis situation involving two variables, determine the appropriate alternative (nonparametric) method when assumptions of our standard methods are not met.
We will look at one nonparametric test in the twoindependent samples setting. More details will be discussed later (Details for NonParametric Alternatives).
 The Wilcoxon ranksum test (MannWhitney U test) is a general test to compare two distributions in independent samples. It is a commonly used alternative to the twosample ttest when the assumptions are not met.
]]>