More about Hypothesis Testing
- 1. The Effect of Sample Size on Hypothesis Testing
- 2. Statistical significance vs. practical importance.
- 3. Hypothesis Testing and Confidence Intervals
- Let’s summarize
The issues regarding hypothesis testing that we will discuss are:
- The effect of sample size on hypothesis testing.
- Statistical significance vs. practical importance.
- Hypothesis testing and confidence intervals—how are they related?
We have already seen the effect that the sample size has on inference, when we discussed point and interval estimation for the population mean (μ, mu) and population proportion (p). Intuitively …
Larger sample sizes give us more information to pin down the true nature of the population. We can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, we can report a smaller margin of error, and get a narrower confidence interval. What we’ve seen, then, is that larger sample size gives a boost to how much we trust our sample results.
In hypothesis testing, larger sample sizes have a similar effect. We have also discussed that the power of our test increases when the sample size increases, all else remaining the same. This means, we have a better chance to detect the difference between the true value and the null value for larger samples.
The following two examples will illustrate that a larger sample size provides more convincing evidence (the test has greater power), and how the evidence manifests itself in hypothesis testing. Let’s go back to our example 2 (marijuana use at a certain liberal arts college).
What do we learn from this?
We see that sample results that are based on a larger sample carry more weight (have greater power).
In example 2, we saw that a sample proportion of 0.19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than 0.157. Recall, from our general overview of hypothesis testing, that this conclusion (not having enough evidence to reject the null hypothesis) doesn’t mean the null hypothesis is necessarily true (so, we never “accept” the null); it only means that the particular study didn’t yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference.
However, in example 2*, we saw that when the sample proportion of 0.19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than 0.157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.
The following activity will allow you to practice the ideas and terminology used in hypothesis testing when a result is not statistically significant.
Now, we will address the issue of statistical significance versus practical importance (which also involves issues of sample size).
The following activity will let you explore the effect of the sample size on the statistical significance of the results yourself, and more importantly will discuss issue 2: Statistical significance vs. practical importance.
This suggests that when interpreting the results of a test, you should always think not only about the statistical significance of the results but also about their practical importance.
The last topic we want to discuss is the relationship between hypothesis testing and confidence intervals. Even though the flavor of these two forms of inference is different (confidence intervals estimate a parameter, and hypothesis testing assesses the evidence in the data against one claim and in favor of another), there is a strong link between them.
We will explain this link (using the z-test and confidence interval for the population proportion), and then explain how confidence intervals can be used after a test has been carried out.
Recall that a confidence interval gives us a set of plausible values for the unknown population parameter. We may therefore examine a confidence interval to informally decide if a proposed value of population proportion seems plausible.
For example, if a 95% confidence interval for p, the proportion of all U.S. adults already familiar with Viagra in May 1998, was (0.61, 0.67), then it seems clear that we should be able to reject a claim that only 50% of all U.S. adults were familiar with the drug, since based on the confidence interval, 0.50 is not one of the plausible values for p.
In fact, the information provided by a confidence interval can be formally related to the information provided by a hypothesis test. (Comment: The relationship is more straightforward for two-sided alternatives, and so we will not present results for the one-sided cases.)
Suppose we want to carry out the two-sided test:
- Ho: p = p0
- Ha: p ≠ p0
using a significance level of 0.05.
An alternative way to perform this test is to find a 95% confidence interval for p and check:
- If p0 falls outside the confidence interval, reject Ho.
- If p0 falls inside the confidence interval, do not reject Ho.
In other words,
- If p0 is not one of the plausible values for p, we reject Ho.
- If p0 is a plausible value for p, we cannot reject Ho.
(Comment: Similarly, the results of a test using a significance level of 0.01 can be related to the 99% confidence interval.)
Let’s look at an example:
The context of the last example is a good opportunity to bring up an important point that was discussed earlier.
Even though we use 0.05 as a cutoff to guide our decision about whether the results are statistically significant, we should not treat it as inviolable and we should always add our own judgment. Let’s look at the last example again.
It turns out that the p-value of this test is 0.0734. In other words, it is maybe not extremely unlikely, but it is quite unlikely (probability of 0.0734) that when you toss a fair coin 80 times you’ll get a sample proportion of heads of 48/80 = 0.6 (or even more extreme). It is true that using the 0.05 significance level (cutoff), 0.0734 is not considered small enough to conclude that the coin is not fair. However, if you really don’t want to clean the apartment, the p-value might be small enough for you to ask your roommate to use a different coin, or to provide one yourself!
Here is our final point on this subject:
When the data provide enough evidence to reject Ho, we can conclude (depending on the alternative hypothesis) that the population proportion is either less than, greater than, or not equal to the null value p0. However, we do not get a more informative statement about its actual value. It might be of interest, then, to follow the test with a 95% confidence interval that will give us more insight into the actual value of p.
Even though this portion of the current section is about the z-test for population proportion, it is loaded with very important ideas that apply to hypothesis testing in general. We’ve already summarized the details that are specific to the z-test for proportions, so the purpose of this summary is to highlight the general ideas.
The process of hypothesis testing has four steps:
I. Stating the null and alternative hypotheses (Ho and Ha).
II. Obtaining a random sample (or at least one that can be considered random) and collecting data. Using the data:
Check that the conditions under which the test can be reliably used are met.
Summarize the data using a test statistic.
- The test statistic is a measure of the evidence in the data against Ho. The larger the test statistic is in magnitude, the more evidence the data present against Ho.
III. Finding the p-value of the test. The p-value is the probability of getting data like those observed (or even more extreme) assuming that the null hypothesis is true, and is calculated using the null distribution of the test statistic. The p-value is a measure of the evidence against Ho. The smaller the p-value, the more evidence the data present against Ho.
IV. Making conclusions.
Conclusions about the statistical significance of the results:
If the p-value is small, the data present enough evidence to reject Ho (and accept Ha).
If the p-value is not small, the data do not provide enough evidence to reject Ho.
To help guide our decision, we use the significance level as a cutoff for what is considered a small p-value. The significance cutoff is usually set at 0.05.
Conclusions should then be provided in the context of the problem.
Additional Important Ideas about Hypothesis Testing
- Results that are based on a larger sample carry more weight, and therefore as the sample size increases, results become more statistically significant.
- Even a very small and practically unimportant effect becomes statistically significant with a large enough sample size. The distinction between statistical significance and practical importance should therefore always be considered.
- Confidence intervals can be used in order to carry out two-sided tests (95% confidence for the 0.05 significance level). If the null value is not included in the confidence interval (i.e., is not one of the plausible values for the parameter), we have enough evidence to reject Ho. Otherwise, we cannot reject Ho.
- If the results are statistically significant, it might be of interest to follow up the tests with a confidence interval in order to get insight into the actual value of the parameter of interest.
- It is important to be aware that there are two types of errors in hypothesis testing (Type I and Type II) and that the power of a statistical test is an important measure of how likely we are to be able to detect a difference of interest to us in a particular problem.