k > 2 Independent Samples

This page is basically complete except that the PDF activies only have the non-interactive versions for now. 
As we mentioned at the end of the Introduction to Unit 4B, we will focus only on two-sided tests for the remainder of this course. One-sided tests are often possible but rarely used in clinical research.
CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.
LO 4.35: For a data analysis situation involving two variables, choose the appropriate inferential method for examining the relationship between the variables and justify the choice.
LO 4.36: For a data analysis situation involving two variables, carry out the appropriate inferential method for examining relationships between the variables and draw the correct conclusions in context.
CO-5: Determine preferred methodological alternatives to commonly used statistical methods when assumptions are not met.
REVIEW: Unit 1 Case C-Q
Video: k > 2 Independent Samples (21:15)

Related SAS Tutorials

Related SPSS Tutorials

Introduction

In this part, we continue to handle situations involving one categorical explanatory variable and one quantitative response variable, which is case C→Q.

Here is a summary of the tests we have covered for the case where k = 2. Methods in BOLD are our main focus in this unit.

So far we have discussed the two samples and matched pairs designs, in which the categorical explanatory variable is two-valued. As we saw, in these cases, examining the relationship between the explanatory and the response variables amounts to comparing the mean of the response variable (Y) in two populations, which are defined by the two values of the explanatory variable (X). The difference between the two samples and matched pairs designs is that in the former, the two samples are independent, and in the latter, the samples are dependent.

Independent Samples (More Emphasis)

Dependent Samples (Less Emphasis)

Standard Tests

  • Two Sample T-Test Assuming Equal Variances
  • Two Sample T-Test Assuming Unequal Variances

Non-Parametric Test

  • Mann-Whitney U (or Wilcoxon Rank-Sum) Test
Standard Test

  • Paired T-Test

Non-Parametric Tests

  • Sign Test
  • Wilcoxon Signed-Rank Test

We now move on to the case where k > 2 when we have independent samples. Here is a summary of the tests we will learn for the case where k > 2. Notice we will not cover the dependent samples case in this course.

Independent Samples (Only Emphasis)

Dependent Samples (Not Discussed)

Standard Tests

  • One-way ANOVA (Analysis of Variance)

Non-Parametric Test

  • Kruskal–Wallis One-way ANOVA
Standard Test

  • Repeated Measures ANOVA (or similar)

Here, as in the two-valued case, making inferences about the relationship between the explanatory (X) and the response (Y) variables amounts to comparing the means of the response variable in the populations defined by the values of the explanatory variable, where the number of means we are comparing depends, of course, on the number of values of X.

Unlike the two-valued case, where we looked at two sub-cases (1) when the samples are independent (two samples design) and (2) when the samples are dependent (matched pairs design, here, we are just going to discuss the case where the samples are independent. In other words, we are just going to extend the two samples design to more than two independent samples.

The Explanatory (X): has k values. This means we have k populations, and for each population a Y mean μ. Each of these populations also has a sample, each with its own size. We end up with k independent samples.

The inferential method for comparing more than two means that we will introduce in this part is called ANalysis Of VAriance (abbreviated as ANOVA), and the test associated with this method is called the ANOVA F-test.

In most software, the data need to be arranged so that each row contains one observation with one variable recording X and another variable recording Y for each observation.

Comparing Two or More Means – The ANOVA F-test

LO 4.38: In a given context, determine the appropriate standard method for comparing groups and provide the correct conclusions given the appropriate software output.
LO 4.39: In a given context, set up the appropriate null and alternative hypotheses for comparing groups.

As we mentioned earlier, the test that we will present is called the ANOVA F-test, and as you’ll see, this test is different in two ways from all the tests we have presented so far:

  • Unlike the previous tests, where we had three possible alternative hypotheses to choose from (depending on the context of the problem), in the ANOVA F-test there is only one alternative, which actually makes life simpler.
  • The test statistic will not have the same structure as the test statistics we’ve seen so far. In other words, it will not have the form:

generalform_ts

but a different structure that captures the essence of the F-test, and clarifies where the name “analysis of variance” is coming from.

What is the idea behind comparing more than two means?

The question we need to answer is: Are the differences among the sample means due to true differences among the μ’s (alternative hypothesis), or merely due to sampling variability or random chance (null hypothesis)?


Here are two sets of boxplots representing two possible scenarios:

Scenario #1

 

In this set of box plots, we see that for each population, the interval between the first and third quartile is very large - for Business majors, the first quartile is at about 2, and the third quartile is at about 17. The rest of the majors have smaller ranges, but they are still large, covering 10 or more frustration points. The mean for business is about 9, for English 12, mathematics 13, and psychology about a 13. Every majors' mean is in the interval of every other major's first and third quartiles.

  • Because of the large amount of spread within the groups, this data shows boxplots with plenty of overlap.
  • One could imagine the data arising from 4 random samples taken from 4 populations, all having the same mean of about 11 or 12.
  • The first group of values may have been a bit on the low side, and the other three a bit on the high side, but such differences could conceivably have come about by chance.
  • This would be the case if the null hypothesis, claiming equal population means, were true.

Scenario #2

In this set of box plots, the interval between the first and third quartile for each major is much narrower - the worse case is about 7. In addition, we also see that while the means for each major are the same as in the first set of box plots, the box plot for business does not contain any other major's mean in its first and third quartile interval. We also see that no other majors' first and third quartile interval includes business's mean. In fact, they don't even include its third quartile.

  • Because of the small amount of spread within the groups, this data shows boxplots with very little overlap.
  • It would be very hard to believe that we are sampling from four groups that have equal population means.
  • This would be the case if the null hypothesis, claiming equal population means, were false.

Thus, in the language of hypothesis tests, we would say that if the data were configured as they are in scenario 1, we would not reject the null hypothesis that population means were equal for the k groups.

If the data were configured as they are in scenario 2, we would reject the null hypothesis, and we would conclude that not all population means are the same for the k groups.

Let’s summarize what we learned from this.

  • The question we need to answer is: Are the differences among the sample means due to true differences among the μ’s (alternative hypothesis), or merely due to sampling variability (null hypothesis)?

In order to answer this question using data, we need to look at the variation among the sample means, but this alone is not enough.

We need to look at the variation among the sample means relative to the variation within the groups. In other words, we need to look at the quantity:

image163

which measures to what extent the difference among the sample means for our groups dominates over the usual variation within sampled groups (which reflects differences in individuals that are typical in random samples).

When the variation within groups is large (like in scenario 1), the variation (differences) among the sample means may become negligible resulting in data which provide very little evidence against Ho. When the variation within groups is small (like in scenario 2), the variation among the sample means dominates over it, and the data have stronger evidence against Ho.

It has a different structure from all the test statistics we’ve looked at so far, but it is similar in that it is still a measure of the evidence against H0. The larger F is (which happens when the denominator, the variation within groups, is small relative to the numerator, the variation among the sample means), the more evidence we have against H0.

Looking at this ratio of variations is the idea behind the comparing more than two means; hence the name analysis of variance (ANOVA).

Now test your understanding of this idea.

Learn By Doing: Idea of One-Way ANOVA
(Non-Interactive Version – Spoiler Alert)

Comments

  • The focus here is for you to understand the idea behind this test statistic, so we do not go into detail about how the two variations are measured. We instead rely on software output to obtain the F-statistic.
  • This test is called the ANOVA F-test.
    • So far, we have explained the ANOVA part of the name.
    • Based on the previous tests we introduced, it should not be surprising that the “F-test” part comes from the fact that the null distribution of the test statistic, under which the p-values are calculated, is called an F-distribution.
    • We will say very little about the F-distribution in this course, which will essentially be limited to this comment and the next one.
  • It is fairly straightforward to decide if a z-statistic is large. Even without tables, we should realize by now that a z-statistic of 0.8 is not especially large, whereas a z-statistic of 2.5 is large.
    • In the case of the t-statistic, it is less straightforward, because there is a different t-distribution for every sample size n (and degrees of freedom n − 1).
    • However, the fact that a t-distribution with a large number of degrees of freedom is very close to the z (standard normal) distribution can help to assess the magnitude of the t-test statistic.
    • When the size of the F-statistic must be assessed, the task is even more complicated, because there is a different F-distribution for every combination of the number of groups we are comparing and the total sample size.
    • We will nevertheless say that for most situations, an F-statistic greater than 4 would be considered rather large, but tables or software are needed to get a truly accurate assessment.

Steps for One-Way ANOVA

Here is a full statement of the process for the ANOVA F-Test:

Step 1: State the hypotheses

The null hypothesis claims that there is no relationship between X and Y. Since the relationship is examined by comparing the means of Y in the populations defined by the values of X (μ1, μ2, …, μk), no relationship would mean that all the means are equal.

Therefore the null hypothesis of the F-test is:

  • Ho: μ1 = μ2 = … = μk. (There is no relationship between X and Y.)

As we mentioned earlier, here we have just one alternative hypothesis, which claims that there is a relationship between X and Y. In terms of the means μ1, μ2, …, μk, it simply says the opposite of the null hypothesis, that not all the means are equal, and we simply write:

  • Ha: not all μ’s are equal. (There is a relationship between X and Y.)
Learn By Doing: One-Way ANOVA – STEP 1
(Non-Interactive Version – Spoiler Alert)

Comments:

  • The alternative of the ANOVA F-test simply states that not all of the means are equal, and is not specific about the way in which they are different.
  • Another way to phrase the alternative is
    • Ha: at least two μ’s are different
  • Warning: It is incorrect to say that the alternative is μ1 ≠ μ2 ≠ … ≠ μk. This statement is MUCH stronger than our alternative hypothesis and says ALL means are different from ALL other mean
  • Note that there are many ways for μ1, μ2, μ3, μ4 not to be all equal, and μ1 ≠ μ2 ≠ μ3 ≠ μ4 is just one of them. Another way could be μ1 = μ2 = μ3 ≠ μ4 or μ1 = μ2 ≠ μ3 = μ4. The alternative of the ANOVA F-test simply states that not all of the means are equal, and is not specific about the way in which they are different.

Step 2: Obtain data, check conditions, and summarize data

The ANOVA F-test can be safely used as long as the following conditions are met:

  • The samples drawn from each of the populations we’re comparing are independent.
  • We are in one of the following two scenarios:

(i) Each of the populations are normal, or more specifically, the distribution of the response Y in each population is normal, and the samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.

(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).

  • The populations all have the same standard deviation.

Can check this condition using the rule of thumb that the ratio between the largest sample standard deviation and the smallest is less than 2. If that is the case, this condition is considered to be satisfied.

Can check this condition using a formal test similar to that used in the two-sample t-test although we will not cover any formal tests.

Learn By Doing: One-Way ANOVA – STEP 2
(Non-Interactive Version – Spoiler Alert)

Test Statistic

  • If our conditions are satisfied we have the test statistic.

image164

  • The statistic follows an F-distribution with k-1 numerator degrees of freedom and n-k denominator degrees of freedom.
  • Where n is the total (combined) sample size and k is the number of groups being compared.
  • We will rely on software to calculate the test statistic and p-value for us.

Step 3: Find the p-value of the test by using the test statistic as follows

  • The p-value of the ANOVA F-test is the probability of getting an F statistic as large as we obtained (or even larger), had Ho been true (all k population means are equal).
  • In other words, it tells us how surprising it is to find data like those observed, assuming that there is no difference among the population means μ1, μ2, …, μk.

Step 4: Conclusion

As usual, we base our conclusion on the p-value.

  • A small p-value tells us that our data contain a lot of evidence against Ho. More specifically, a small p-value tells us that the differences between the sample means are statistically significant (unlikely to have happened by chance), and therefore we reject Ho.
    • Conclusion: There is enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that there are differences between at least two of the population means (there are some differences in the population means).
  • If the p-value is not small, we do not have enough statistical evidence to reject Ho.
    • Conclusion: There is NOT enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is NOT enough evidence that there are differences between at least two of the population means.
  • A significance level (cut-off probability) of 0.05 can help determine what is considered a small p-value.

Final Comment

Note that when we reject Ho in the ANOVA F-test, all we can conclude is that

  • not all the means are equal, or
  • there are some differences between the means, or
  • the response Y is related to explanatory X.

However, the ANOVA F-test does not provide any immediate insight into why Ho was rejected, or in other words, it does not tell us in what way the population means of the groups are different. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population means. More specifically, we can look at which of the confidence intervals overlap and which do not.

Multiple Comparisons:

  • When we compare standard 95% confidence intervals in this way, we have an increased chance of making a type I error as each interval has a 5% error individually.
  • There are many multiple comparison procedures all of which propose alternative methods for determining which pairs of means are different.
  • We will look at a few of these in the software just to show you a little about this topic but we will not cover this officially in this course.
  • The goal is to provide an overall type I error rate no larger than 5% for all comparisons made.

Now let’s look at some examples using real data.

EXAMPLE: Is “academic frustration” related to major?

A college dean believes that students with different majors may experience different levels of academic frustration. Random samples of size 35 of Business, English, Mathematics, and Psychology majors are asked to rate their level of academic frustration on a scale of 1 (lowest) to 20 (highest).

The X variable is major, and it has four categories, which are Business, English, Mathematics, and Psychology. We have four populations, one for each of these categories. We are interested in the level of frustration (Y) mean for each population, so we have 4 μ, one for each population. For each population we take a sample of size 35, resulting in 4 separate samples.

The figure highlights what we have already mentioned: examining the relationship between major (X) and frustration level (Y) amounts to comparing the mean frustration levels among the four majors defined by X. Also, the figure reminds us that we are dealing with a case where the samples are independent.

Step 1: State the hypotheses

The correct hypotheses are:

  • Ho: μ1 = μ2 = μ3 = μ4.
    (There is NO relationship between major and academic frustration level.)
  • Ha: not all μ’s are equal.
    (There IS a relationship between major and academic frustration level.)

Step 2: Obtain data, check conditions, and summarize data

Data: SPSS formatSAS formatExcel format, CSV format

In our example all the conditions are satisfied:

  • All the samples were chosen randomly, and are therefore independent.
  • The sample sizes are large enough (n = 35) that we really don’t have to worry about the normality; however, let’s look at the data using side-by-side boxplots, just to get a sense of it:

image114

  • The data suggest that the frustration level of the business students is generally lower than students from the other three majors. The ANOVA F-test will tell us whether these differences are significant.

The rule of thumb is satisfied since 3.082 / 2.088 < 2. We will look at the formal test in the software.

image167_statcrunch

Test statistic: (Minitab output)

image165

  • The parts of the output that we will focus on here have been highlighted. In particular, note that the F-statistic is 46.60, which is very large, indicating that the data provide a lot of evidence against Ho (we can also see that the p-value is so small that it is reported to be 0, which supports that conclusion as well).

Step 3: Find the p-value of the test by using the test statistic as follows

  • As we already noticed before, the p-value in our example is so small that it is reported to be 0.000, telling us that it would be next to impossible to get data like those observed had the mean frustration level of the four majors been the same (as the null hypothesis claims).

Step 4: Conclusion

  • In our example, the p-value is extremely small – close to 0 – indicating that our data provide extremely strong evidence to reject Ho.
  • Conclusion: There is enough evidence that the population mean frustration level of the four majors are not all the same, or in other words, that majors do have an effect on students’ academic frustration levels at the school where the test was conducted.

As a follow-up, we can construct confidence intervals (or conduct multiple comparisons as we will do in the software). This allows us to understand better which population means are likely to be different.

image165 (1)

In this case, the business majors are clearly lower on the frustration scale than other majors. It is also possible that English majors are lower than psychology majors based upon the individual 95% confidence intervals in each group.

SPSS Output

SAS Output and SAS Code (Includes Non-Parametric Test)

Here is another example

EXAMPLE: Reading Level in Advertising

Do advertisers alter the reading level of their ads based on the target audience of the magazine they advertise in?

In 1981, a study of magazine advertisements was conducted (F.K. Shuptrine and D.D. McVicker, “Readability Levels of Magazine Ads,” Journal of Advertising Research, 21:5, October 1981). Researchers selected random samples of advertisements from each of three groups of magazines:

  • Group 1—highest educational level magazines (such as Scientific American, Fortune, The New Yorker)
  • Group 2—middle educational level magazines (such as Sports Illustrated, Newsweek, People)
  • Group 3—lowest educational level magazines (such as National Enquirer, Grit, True Confessions)

The measure that the researchers used to assess the level of the ads was the number of words in the ad. 18 ads were randomly selected from each of the magazine groups, and the number of words per ad were recorded.

The following figure summarizes this problem:

The variable Education Level (X) has three categories: High, Medium, and Low. From these categories, we form our populations: High education level magazines, medium education level magazines, and low education level magazines. Each population has its own # of words in ads (Y) mean. From each population we create a sample of size 18. We find that for High level magazines, Y-bar_1 = 140.0 . For medium level, Y-bar_2 = 121.4, and for low level, Y-bar_3 = 106.5

Our question of interest is whether the number of words in ads (Y) is related to the educational level of the magazine (X). To answer this question, we need to compare μ1, μ2, and μ3, the mean number of words in ads of the three magazine groups. Note in the figure that the sample means are provided. It seems that what the data suggest makes sense; the magazines in group 1 have the largest number of words per ad (on average) followed by group 2, and then group 3.

The question is whether these differences between the sample means are significant. In other words, are the differences among the observed sample means due to true differences among the μ’s or merely due to sampling variability? To answer this question, we need to carry out the ANOVA F-test.

Step 1: Stating the hypotheses.

We are testing:

  • Ho: μ1 = μ2 = μ3 .
    (There is NO relationship between educational level and number of words in ads.)
  • Ha: not all μ’s are equal.
    (There IS a relationship between educational level and number of words in ads.)

Conceptually, the null hypothesis claims that the number of words in ads is not related to the educational level of the magazine, and the alternative hypothesis claims that there is a relationship.

Step 2: Checking conditions and summarizing the data.

  • (i) The ads were selected at random from each magazine group, so the three samples are independent.

In order to check the next two conditions, we’ll need to look at the data (condition ii), and calculate the sample standard deviations of the three samples (condition iii).

  • Here are the side-by-side boxplots of the data:

Three boxplots titled "Boxplot of Words vs GROUP." The vertical axis is labeled "Words," and the horizontal axis is labeled "GROUP." We see that Group 3 has the smallest Q1 to Q3 interval, and that Q2 is far below the mean. Group 2 has a larger Q1 to Q3 interval, and Group 1 has the largest Q1 to Q3 interval, and a Q2 above its mean. Since the box plots are all centered at about the same word count, the larger Q1 to Q3 intervals cover the smaller ones.

  • And the standard deviations:
    • Group 1 StDev: 74.0
    • Group 2 StDev: 64.3
    • Group 3 StDev: 57.6

Using the above, we can address conditions (ii) and (iii)

  • (ii) The graph does not display any alarming violations of the normality assumption. It seems like there is some skewness in groups 2 and 3, but not extremely so, and there are no outliers in the data.
  • (iii) We can assume that the equal standard deviation assumption is met since the rule of thumb is satisfied: the largest sample standard deviation of the three is 74 (group 1), the smallest one is 57.6 (group 3), and 74/57.6 < 2.

Before we move on, let’s look again at the graph. It is easy to see the trend of the sample means (indicated by red circles).

However, there is so much variation within each of the groups that there is almost a complete overlap between the three boxplots, and the differences between the means are over-shadowed and seem like something that could have happened just by chance.

Let’s move on and see whether the ANOVA F-test will support this observation.

  • Test Statistic: Using statistical software to conduct the ANOVA F-test, we find that the F statistic is 1.18, which is not very large. We also find that the p-value is 0.317.

Step 3. Finding the p-value.

  • The p-value is 0.317, which tells us that getting data like those observed is not very surprising assuming that there are no differences between the three magazine groups with respect to the mean number of words in ads (which is what Ho claims).
  • In other words, the large p-value tells us that it is quite reasonable that the differences between the observed sample means could have happened just by chance (i.e., due to sampling variability) and not because of true differences between the means.

Step 4: Making conclusions in context.

  • The large p-value indicates that the results are not statistically significant, and that we cannot reject Ho.
  • Conclusion: The study does not provide evidence that the mean number of words in ads is related to the educational level of the magazine. In other words, the study does not provide evidence that advertisers alter the reading level of their ads (as measured by the number of words) based on the educational level of the target audience of the magazine.

Now try one for yourself.

Learn By Doing: One-Way ANOVA –  Flicker Frequency
(Non-Interactive Version – Spoiler Alert)

Confidence Intervals

The ANOVA F-test does not provide any insight into why H0 was rejected; it does not tell us in what way μ1,μ2,μ3…,μk are not all equal. We would like to know which pairs of ’s are not equal. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population meansμ1,μ2,μ3…,μk that appears in the output. More specifically, we should look at the position of the confidence intervals and overlap/no overlap between them.

* If the confidence interval for, say,μi overlaps with the confidence interval for μj , then μi and μj share some plausible values, which means that based on the data we have no evidence that these two ’s are different.

Illustrated are the confidence intervals for μ_i and μ_j on a number line. We see that they overlap, so there is an overlap in plausible values.

* If the confidence interval for μi does not overlap with the confidence interval for μj , then μi and μj do not share plausible values, which means that the data suggest that these two ’s are different.

Illustrated are the confidence intervals for μ_i and μ_j on a number line. We see that do not overlap, so there is no overlap in plausible values.

Furthermore, if like in the figure above the confidence interval (set of plausible values) for μi lies entirely below the confidence interval (set of plausible values) for μj, then the data suggest that μi is smaller than μj.

EXAMPLE

Consider our first example on the level of academic frustration.

 Business: Mean = 7.314, StDev = 2.898. The 95% Confidence Interval is about (6.5, 8.5) English: Mean = 11.771, StDev = 2.088. The 95% Confidence Interval is about (11, 13) Mathematics: Mean = 13.2, StDev = 2.153. The 95% Confidence Interval is about (12.5, 14.5) Psychology: Mean = 14.029, StDev = 3.082. The 95% Confidence Interval is about (13, 15)

Based on the small p-value, we rejected Ho and concluded that not all four frustration level means are equal, or in other words that frustration level is related to the student’s major. To get more insight into that relationship, we can look at the confidence intervals above (marked in red). The top confidence interval is the set of plausible values for μ1, the mean frustration level of business students. The confidence interval below it is the set of plausible values for μ2, the mean frustration level of English students, etc.

What we see is that the business confidence interval is way below the other three (it doesn’t overlap with any of them). The math confidence interval overlaps with both the English and the psychology confidence intervals; however, there is no overlap between the English and psychology confidence intervals.

This gives us the impression that the mean frustration level of business students is lower than the mean in the other three majors. Within the other three majors, we get the impression that the mean frustration of math students may not differ much from the mean of both English and psychology students, however the mean frustration of English students may be lower than the mean of psychology students.

Note that this is only an exploratory/visual way of getting an impression of why Ho was rejected, not a formal one. There is a formal way of doing it that is called “multiple comparisons,” which is beyond the scope of this course. An extension to this course will include this topic in the future.

Non-Parametric Alternative: Kruskal-Wallis Test

LO 5.1: For a data analysis situation involving two variables, determine the appropriate alternative (non-parametric) method when assumptions of our standard methods are not met.

We will look at one non-parametric test in the k > 2 independent sample setting. We will cover more details later (Details for Non-Parametric Alternatives).

The Kruskal-Wallis test is a general test to compare multiple distributions in independent samples and is a common alternative to the one-way ANOVA.