The issues regarding hypothesis testing that we will discuss are:
Let’s begin.
We have already seen the effect that the sample size has on inference, when we discussed point and interval estimation for the population mean (μ, mu) and population proportion (p). Intuitively …
Larger sample sizes give us more information to pin down the true nature of the population. We can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, we can report a smaller margin of error, and get a narrower confidence interval. What we’ve seen, then, is that larger sample size gives a boost to how much we trust our sample results.
In hypothesis testing, larger sample sizes have a similar effect. We have also discussed that the power of our test increases when the sample size increases, all else remaining the same. This means, we have a better chance to detect the difference between the true value and the null value for larger samples.
The following two examples will illustrate that a larger sample size provides more convincing evidence (the test has greater power), and how the evidence manifests itself in hypothesis testing. Let’s go back to our example 2 (marijuana use at a certain liberal arts college).
Is the proportion of marijuana users in the college higher than the national figure?
We do not have enough evidence to conclude that the proportion of students at the college who use marijuana is higher than the national figure.
Now, let’s increase the sample size.
There are rumors that students in a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is 0.157? (Reported by the Harvard School of Public Health).
Our results here are statistically significant. In other words, in example 2* the data provide enough evidence to reject Ho.
What do we learn from this?
We see that sample results that are based on a larger sample carry more weight (have greater power).
In example 2, we saw that a sample proportion of 0.19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than 0.157. Recall, from our general overview of hypothesis testing, that this conclusion (not having enough evidence to reject the null hypothesis) doesn’t mean the null hypothesis is necessarily true (so, we never “accept” the null); it only means that the particular study didn’t yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference.
However, in example 2*, we saw that when the sample proportion of 0.19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than 0.157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference.
The following activity will allow you to practice the ideas and terminology used in hypothesis testing when a result is not statistically significant.
Now, we will address the issue of statistical significance versus practical importance (which also involves issues of sample size).
The following activity will let you explore the effect of the sample size on the statistical significance of the results yourself, and more importantly will discuss issue 2: Statistical significance vs. practical importance.
This suggests that when interpreting the results of a test, you should always think not only about the statistical significance of the results but also about their practical importance.
The last topic we want to discuss is the relationship between hypothesis testing and confidence intervals. Even though the flavor of these two forms of inference is different (confidence intervals estimate a parameter, and hypothesis testing assesses the evidence in the data against one claim and in favor of another), there is a strong link between them.
We will explain this link (using the ztest and confidence interval for the population proportion), and then explain how confidence intervals can be used after a test has been carried out.
Recall that a confidence interval gives us a set of plausible values for the unknown population parameter. We may therefore examine a confidence interval to informally decide if a proposed value of population proportion seems plausible.
For example, if a 95% confidence interval for p, the proportion of all U.S. adults already familiar with Viagra in May 1998, was (0.61, 0.67), then it seems clear that we should be able to reject a claim that only 50% of all U.S. adults were familiar with the drug, since based on the confidence interval, 0.50 is not one of the plausible values for p.
In fact, the information provided by a confidence interval can be formally related to the information provided by a hypothesis test. (Comment: The relationship is more straightforward for twosided alternatives, and so we will not present results for the onesided cases.)
Suppose we want to carry out the twosided test:
using a significance level of 0.05.
An alternative way to perform this test is to find a 95% confidence interval for p and check:
In other words,
(Comment: Similarly, the results of a test using a significance level of 0.01 can be related to the 99% confidence interval.)
Let’s look at an example:
Recall example 3, where we wanted to know whether the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64.
We are testing:
and as the figure reminds us, we took a sample of 1,000 U.S. adults, and the data told us that 675 supported the death penalty for convicted murderers (phat = 0.675).
A 95% confidence interval for p, the proportion of all U.S. adults who support the death penalty, is:
Since the 95% confidence interval for p does not include 0.64 as a plausible value for p, we can reject Ho and conclude (as we did before) that there is enough evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003.
You and your roommate are arguing about whose turn it is to clean the apartment. Your roommate suggests that you settle this by tossing a coin and takes one out of a locked box he has on the shelf. Suspecting that the coin might not be fair, you decide to test it first. You toss the coin 80 times, thinking to yourself that if, indeed, the coin is fair, you should get around 40 heads. Instead you get 48 heads. You are puzzled. You are not sure whether getting 48 heads out of 80 is enough evidence to conclude that the coin is unbalanced, or whether this a result that could have happened just by chance when the coin is fair.
Statistics can help you answer this question.
Let p be the true proportion (probability) of heads. We want to test whether the coin is fair or not.
We are testing:
The data we have are that out of n = 80 tosses, we got 48 heads, or that the sample proportion of heads is phat = 48/80 = 0.6.
A 95% confidence interval for p, the true proportion of heads for this coin, is:
Since in this case 0.5 is one of the plausible values for p, we cannot reject Ho. In other words, the data do not provide enough evidence to conclude that the coin is not fair.
Comment
The context of the last example is a good opportunity to bring up an important point that was discussed earlier.
Even though we use 0.05 as a cutoff to guide our decision about whether the results are statistically significant, we should not treat it as inviolable and we should always add our own judgment. Let’s look at the last example again.
It turns out that the pvalue of this test is 0.0734. In other words, it is maybe not extremely unlikely, but it is quite unlikely (probability of 0.0734) that when you toss a fair coin 80 times you’ll get a sample proportion of heads of 48/80 = 0.6 (or even more extreme). It is true that using the 0.05 significance level (cutoff), 0.0734 is not considered small enough to conclude that the coin is not fair. However, if you really don’t want to clean the apartment, the pvalue might be small enough for you to ask your roommate to use a different coin, or to provide one yourself!
Here is our final point on this subject:
When the data provide enough evidence to reject Ho, we can conclude (depending on the alternative hypothesis) that the population proportion is either less than, greater than, or not equal to the null value p_{0}. However, we do not get a more informative statement about its actual value. It might be of interest, then, to follow the test with a 95% confidence interval that will give us more insight into the actual value of p.
In our example 3,
we concluded that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64. It is probably of interest not only to know that the proportion has changed, but also to estimate what it has changed to. We’ve calculated the 95% confidence interval for p on the previous page and found that it is (0.646, 0.704).
We can combine our conclusions from the test and the confidence interval and say:
Data provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, and we are 95% confident that it is now between 0.646 and 0.704. (i.e. between 64.6% and 70.4%).
Let’s look at our example 1 to see how a confidence interval following a test might be insightful in a different way.
Here is a summary of example 1:
We conclude that as a result of the repair, the proportion of defective products has been reduced to below 0.20 (which was the proportion prior to the repair). It is probably of great interest to the company not only to know that the proportion of defective has been reduced, but also estimate what it has been reduced to, to get a better sense of how effective the repair was. A 95% confidence interval for p in this case is:
We can therefore say that the data provide evidence that the proportion of defective products has been reduced, and we are 95% confident that it has been reduced to somewhere between 12.4% and 19.6%. This is very useful information, since it tells us that even though the results were significant (i.e., the repair reduced the number of defective products), the repair might not have been effective enough, if it managed to reduce the number of defective products only to the range provided by the confidence interval. This, of course, ties back in to the idea of statistical significance vs. practical importance that we discussed earlier. Even though the results are statistically significant (Ho was rejected), practically speaking, the repair might still be considered ineffective.
Even though this portion of the current section is about the ztest for population proportion, it is loaded with very important ideas that apply to hypothesis testing in general. We’ve already summarized the details that are specific to the ztest for proportions, so the purpose of this summary is to highlight the general ideas.
The process of hypothesis testing has four steps:
I. Stating the null and alternative hypotheses (Ho and Ha).
II. Obtaining a random sample (or at least one that can be considered random) and collecting data. Using the data:
Check that the conditions under which the test can be reliably used are met.
Summarize the data using a test statistic.
III. Finding the pvalue of the test. The pvalue is the probability of getting data like those observed (or even more extreme) assuming that the null hypothesis is true, and is calculated using the null distribution of the test statistic. The pvalue is a measure of the evidence against Ho. The smaller the pvalue, the more evidence the data present against Ho.
IV. Making conclusions.
Conclusions about the statistical significance of the results:
If the pvalue is small, the data present enough evidence to reject Ho (and accept Ha).
If the pvalue is not small, the data do not provide enough evidence to reject Ho.
To help guide our decision, we use the significance level as a cutoff for what is considered a small pvalue. The significance cutoff is usually set at 0.05.
Conclusions should then be provided in the context of the problem.
Additional Important Ideas about Hypothesis Testing
This last part of the fourstep process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.
The pvalue is a measure of how much evidence the data present against Ho. The smaller the pvalue, the more evidence the data present against Ho.
We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α, alpha), a cutoff point below which the pvalue is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.
Where instead of Ha is True, we write what this means in the words of the problem, in other words, in the context of the current scenario.
It is important to mention again that this step has essentially two substeps:
Note: We always still must consider whether the results have any practical significance, particularly if they are statistically significant as a statistically significant result which has not practical use is essentially meaningless!
Let’s go back to our three examples and draw conclusions.
Has the proportion of defective products been reduced as a result of the repair?
We found that the pvalue for this test was 0.023.
Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho.
Conclusion:
The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:
Is the proportion of marijuana users in the college higher than the national figure?
We found that the pvalue for this test was 0.182.
Since .182 is not small (in particular, 0.182 > 0.05), the data do not provide enough evidence to reject Ho.
Conclusion:
Here is the complete story of this example:
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
We found that the pvalue for this test was 0.021.
Since 0.021 is small (in particular, 0.021 < 0.05), the data provide enough evidence to reject Ho
Conclusion:
Here is the complete story of this example:
Many students wonder why 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level. This is largely due to just convenience and tradition.
When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.
The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single allornothing value. There isn’t much meaningful difference, for instance, between a pvalue of .049 or .051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.
Whether such a pvalue is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.
We have now completed going through the four steps of hypothesis testing, and in particular we learned how they are applied to the ztest for the population proportion. Here is a brief summary:
State the null hypothesis:
Ho: p = p_{0}
State the alternative hypothesis:
Ha: p < p_{0} (onesided)
Ha: p > p_{0} (onesided)
Ha: p ≠ p_{0} (twosided)
where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem. If you feel it is not clear, it is most likely a twosided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! Use only the information given in the problem.
Obtain data from a sample and:
(i) Check whether the data satisfy the conditions which allow you to use this test.
random sample (or at least a sample that can be considered random in context)
the conditions under which the sampling distribution of phat is normal are met
(ii) Calculate the sample proportion phat, and summarize the data using the test statistic:
(Recall: This standardized test statistic represents how many standard deviations above or below p_{0} our sample proportion phat is.)
When the alternative hypothesis is “less than” the probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
When the alternative hypothesis is “greater than” the probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
When the alternative hypothesis is “not equal to” the probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Reach a conclusion first regarding the statistical significance of the results, and then determine what it means in the context of the problem.
If pvalue ≤ 0.05 then WE REJECT Ho
Conclusion: There IS enough evidence that Ha is True
If pvalue > 0.05 then WE FAIL TO REJECT Ho
Conclusion: There IS NOT enough evidence that Ha is True
Recall that: If the pvalue is small (in particular, smaller than the significance level, which is usually 0.05), the results are statistically significant (in the sense that there is a statistically significant difference between what was observed in the sample and what was claimed in Ho), and so we reject Ho.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho, and so we continue to believe that Ho may be true. (Remember: In hypothesis testing we never “accept” Ho).
Finally, in practice, we should always consider the practical significance of the results as well as the statistical significance.
Before we move on to the next test, we are going to use the ztest for proportions to bring up and illustrate a few more very important issues regarding hypothesis testing. This might also be a good time to review the concepts of Type I error, Type II error, and Power before continuing on.
]]>So far we’ve talked about the pvalue at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the statistical significance of our results. We will now go more deeply into how the pvalue is calculated.
It should be mentioned that eventually we will rely on technology to calculate the pvalue for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first understand the details, and only then let the computer do the calculations for us. Again, our goal is to use this simple example to give you the tools you need to understand the process entirely. Let’s start.
Recall that so far we have said that the pvalue is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the pvalue is, therefore, a measure of the evidence against Ho. In the case of the test statistic, the larger it is in magnitude (positive or negative), the further phat is from p_{0}, the more evidence we have against Ho. In the case of the pvalue, it is the opposite; the smaller it is, the more unlikely it is to get data like those observed when Ho is true, the more evidence it is against Ho. One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the pvalue is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the pvalue from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the pvalue keeps its intuitive appeal across all statistical tests.
How is the pvalue calculated?
Intuitively, the pvalue is the probability of observing data like those observed assuming that Ho is true. Let’s be a bit more formal:
Putting it all together, we get that in general:
By “extreme” we mean extreme in the direction(s) of the alternative hypothesis.
Specifically, for the ztest for the population proportion:
OK, hopefully that makes (some) sense. But how do we actually calculate it?
Recall the important comment from our discussion about our test statistic,
which said that when the null hypothesis is true (i.e., when p = p_{0}), the possible values of our test statistic follow a standard normal (N(0,1), denoted by Z) distribution. Therefore, the pvalue calculations (which assume that Ho is true) are simply standard normal distribution calculations for the 3 possible alternative hypotheses.
The probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
The probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
The probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Next, we will apply this to our three examples. But first, work through the following activities, which should help your understanding.
Has the proportion of defective products been reduced as a result of the repair?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
To find P(Z ≤ 2) we can either use the calculator or table we learned to use in the probability unit for normal random variables. Eventually, after we understand the details, we will use software to run the test for us and the output will give us all the information we need. The pvalue that the statistical software provides for this specific example is 0.023. The pvalue tells us that it is pretty unlikely (probability of 0.023) to get data like those observed (test statistic of 2 or less) assuming that Ho is true.
Is the proportion of marijuana users in the college higher than the national figure?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.182, this is P(Z ≥ 0.91).
The pvalue tells us that it is not very surprising (probability of 0.182) to get data like those observed (which yield a test statistic of 0.91 or higher) assuming that the null hypothesis is true.
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.021, this is P(Z ≤ 2.31) + P(Z ≥ 2.31) = 2*P(Z ≥ 2.31)
The pvalue tells us that it is pretty unlikely (probability of 0.021) to get data like those observed (test statistic as high as 2.31 or higher or as low as 2.31 or lower) assuming that Ho is true.
Comment:
Similarly, in any test, pvalues are found using the sampling distribution of the test statistic when the null hypothesis is true (also known as the “null distribution” of the test statistic). In this case, it was relatively easy to argue that the null distribution of our test statistic is N(0,1). As we’ll see, in other tests, other distributions come up (like the tdistribution and the Fdistribution), which we will just mention briefly, and rely heavily on the output of our statistical package for obtaining the pvalues.
We’ve just completed our discussion about the pvalue, and how it is calculated both in general and more specifically for the ztest for the population proportion. Let’s go back to the fourstep process of hypothesis testing and see what we’ve covered and what still needs to be discussed.
With respect to the ztest the population proportion:
Step 1: Completed
Step 2: Completed
Step 3: Completed
Step 4. This is what we will work on next.
After the hypotheses have been stated, the next step is to obtain a sample (on which the inference will be based), collect relevant data, and summarize them.
It is extremely important that our sample is representative of the population about which we want to draw conclusions. This is ensured when the sample is chosen at random. Beyond the practical issue of ensuring representativeness, choosing a random sample has theoretical importance that we will mention later.
In the case of hypothesis testing for the population proportion (p), we will collect data on the relevant categorical variable from the individuals in the sample and start by calculating the sample proportion phat (the natural quantity to calculate when the parameter of interest is p).
Let’s go back to our three examples and add this step to our figures.
Has the proportion of defective products been reduced as a result of the repair?
Is the proportion of marijuana users in the college higher than the national figure?
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
As we mentioned earlier without going into details, when we summarize the data in hypothesis testing, we go a step beyond calculating the sample statistic and summarize the data with a test statistic. Every test has a test statistic, which to some degree captures the essence of the test. In fact, the pvalue, which so far we have looked upon as “the king” (in the sense that everything is determined by it), is actually determined by (or derived from) the test statistic. We will now introduce the test statistic.
The test statistic is a measure of how far the sample proportion phat is from the null value p_{0}, the value that the null hypothesis claims is the value of p. In other words, since phat is what the data estimates p to be, the test statistic can be viewed as a measure of the “distance” between what the data tells us about p and what the null hypothesis claims p to be.
Let’s use our examples to understand this:
Has the proportion of defective products been reduced as a result of the repair?
The parameter of interest is p, the proportion of defective products following the repair.
The data estimate p to be phat = 0.16
The null hypothesis claims that p = 0.20
The data are therefore 0.04 (or 4 percentage points) below the null hypothesis value.
It is hard to evaluate whether this difference of 4% in defective products is enough evidence to say that the repair was effective at reducing the proportion of defective products, but clearly, the larger the difference, the more evidence it is against the null hypothesis. So if, for example, our sample proportion of defective products had been, say, 0.10 instead of 0.16, then I think you would all agree that cutting the proportion of defective products in half (from 20% to 10%) would be extremely strong evidence that the repair was effective at reducing the proportion of defective products.
Is the proportion of marijuana users in the college higher than the national figure?
The parameter of interest is p, the proportion of students in a college who use marijuana.
The data estimate p to be phat = 0.19
The null hypothesis claims that p = 0.157
The data are therefore 0.033 (or 3.3. percentage points) above the null hypothesis value.
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The parameter of interest is p, the proportion of U.S. adults who support the death penalty for convicted murderers.
The data estimate p to be phat = 0.675
The null hypothesis claims that p = 0.64
There is a difference of 0.035 (or 3.5. percentage points) between the data and the null hypothesis value.
The problem with looking only at the difference between the sample proportion, phat, and the null value, p_{0} is that we have not taken into account the variability of our estimator phat which, as we know from our study of sampling distributions, depends on the sample size.
For this reason, the test statistic cannot simply be the difference between phat and p_{0}, but must be some form of that formula that accounts for the sample size. In other words, we need to somehow standardize the difference so that comparison between different situations will be possible. We are very close to revealing the test statistic, but before we construct it, let’s be reminded of the following two facts from probability:
Fact 1: When we take a random sample of size n from a population with population proportion p, then
Fact 2: The zscore of any normal value (a value that comes from a normal distribution) is calculated by finding the difference between the value and the mean and then dividing that difference by the standard deviation (of the normal distribution associated with the value). The zscore represents how many standard deviations below or above the mean the value is.
Thus, our test statistic should be a measure of how far the sample proportion phat is from the null value p_{0} relative to the variation of phat (as measured by the standard error of phat).
Recall that the standard error is the standard deviation of the sampling distribution for a given statistic. For phat, we know the following:
To find the pvalue, we will need to determine how surprising our value is assuming the null hypothesis is true. We already have the tools needed for this process from our study of sampling distributions as represented in the table above.
Has the proportion of defective products been reduced as a result of the repair?
If we assume the null hypothesis is true, we can specify that the center of the distribution of all possible values of phat from samples of size 400 would be 0.20 (our null value).
We can calculate the standard error, assuming p = 0.20 as
The following picture represents the sampling distribution of all possible values of phat of samples of size 400, assuming the true proportion p is 0.20 and our other requirements for the sampling distribution to be normal are met (we will review these during the next step).
In order to calculate probabilities for the picture above, we would need to find the zscore associated with our result.
This zscore is the test statistic! In this example, the numerator of our zscore is the difference between phat (0.16) and null value (0.20) which we found earlier to be 0.04. The denominator of our zscore is the standard error calculated above (0.02) and thus quickly we find the zscore, our test statistic, to be 2.
The sample proportion based upon this data is 2 standard errors below the null value.
Hopefully you now understand more about the reasons we need probability in statistics!!
Now we will formalize the definition and look at our remaining examples before moving on to the next step, which will be to determine if a normal distribution applies and calculate the pvalue.
Test Statistic for Hypothesis Tests for One Proportion is:
It represents the difference between the sample proportion and the null value, measured in standard deviations (standard error of phat).
The picture above is a representation of the sampling distribution of phat assuming p = p_{0}. In other words, this is a model of how phat behaves if we are drawing random samples from a population for which Ho is true.
Notice the center of the sampling distribution is at p_{0}, which is the hypothesized proportion given in the null hypothesis (Ho: p = p_{0}.) We could also mark the axis in standard error units,
For example, if our null hypothesis claims that the proportion of U.S. adults supporting the death penalty is 0.64, then the sampling distribution is drawn as if the null is true. We draw a normal distribution centered at 0.64 (p_{0}) with a standard error dependent on sample size,
Important Comment:
By “null distribution,” we mean the distribution under the assumption that Ho is true. As we’ll see and stress again later, the null distribution of the test statistic is what the calculation of the pvalue is based on.
Let’s go back to our remaining two examples and find the test statistic in each case:
Is the proportion of marijuana users in the college higher than the national figure?
Since the null hypothesis is Ho: p = 0.157, the standardized (z) score of phat = 0.19 is
This is the value of the test statistic for this example.
We interpret this to mean that, assuming that Ho is true, the sample proportion phat = 0.19 is 0.91 standard errors above the null value (0.157).
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
Since the null hypothesis is Ho: p = 0.64, the standardized (z) score of phat = 0.675 is
This is the value of the test statistic for this example.
We interpret this to mean that, assuming that Ho is true, the sample proportion phat = 0.675 is 2.31 standard errors above the null value (0.64).
Comments about the Test Statistic:
Comments:
When we take a random sample of size n from a population with population proportion p_{0}, the possible values of the sample proportion phat (when certain conditions are met) have approximately a normal distribution with a mean of p_{0}… and a standard deviation of
This result provides the theoretical justification for constructing the test statistic the way we did, and therefore the assumptions under which this result holds (in bold, above) are the conditions that our data need to satisfy so that we can use this test. These two conditions are:
i. The sample has to be random.
ii. The conditions under which the sampling distribution of phat is normal are met. In other words:
Let’s check the conditions in our three examples.
Has the proportion of defective products been reduced as a result of the repair?
i. The 400 products were chosen at random.
ii. n = 400, p_{0} = 0.2 and therefore:
Is the proportion of marijuana users in the college higher than the national figure?
i. The 100 students were chosen at random.
ii. n = 100, p_{0} = 0.157 and therefore:
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
i. The 1000 adults were chosen at random.
ii. n = 1000, p_{0} = 0.64 and therefore:
Checking that our data satisfy the conditions under which the test can be reliably used is a very important part of the hypothesis testing process. Be sure to consider this for every hypothesis test you conduct in this course and certainly in practice.
With respect to the ztest, the population proportion that we are currently discussing we have:
Step 1: Completed
Step 2: Completed
Step 3: This is what we will work on next.
]]>Now that we understand the process of hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).
The first test we are going to learn is the test about the population proportion (p).
We will understand later where the “ztest” part is coming from.
This will be the only type of problem you will complete entirely “byhand” in this course. Our goal is to use this example to give you the tools you need to understand how this process works. After working a few problems, you should review the earlier material again. You will likely need to review the terminology and concepts a few times before you fully understand the process.
In reality, you will often be conducting more complex statistical tests and allowing software to provide the pvalue. In these settings it will be important to know what test to apply for a given situation and to be able to explain the results in context.
When we conduct a test about a population proportion, we are working with a categorical variable. Later in the course, after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.
In this part of our discussion on hypothesis testing, we will go into details that we did not go into before. More specifically, we will use this test to introduce the idea of a test statistic, and details about how pvalues are calculated.
Let’s start by introducing the three examples, which will be the leading examples in our discussion. Each example is followed by a figure illustrating the information provided, as well as the question of interest.
A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been reduced as a result of the repair?
The following figure displays the information, as well as the question of interest:
The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:
Ho: p = 0.20 (No change; the repair did not help).
Ha: p < 0.20 (The repair was effective at reducing the proportion of defective parts).
There are rumors that students at a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 100 students from the college, 19 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is 0.157? (This number is reported by the Harvard School of Public Health.)
Again, the following figure displays the information as well as the question of interest:
As before, we can formulate the null and alternative hypotheses in terms of p, the proportion of students in the college who use marijuana:
Ho: p = 0.157 (same as among all college students in the country).
Ha: p > 0.157 (higher than the national figure).
Polls on certain topics are conducted routinely in order to monitor changes in the public’s opinions over time. One such topic is the death penalty. In 2003 a poll estimated that 64% of U.S. adults support the death penalty for a person convicted of murder. In a more recent poll, 675 out of 1,000 U.S. adults chosen at random were in favor of the death penalty for convicted murderers. Do the results of this poll provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers (p) changed between 2003 and the later poll?
Here is a figure that displays the information, as well as the question of interest:
Again, we can formulate the null and alternative hypotheses in term of p, the proportion of U.S. adults who support the death penalty for convicted murderers.
Ho: p = 0.64 (No change from 2003).
Ha: p ≠ 0.64 (Some change since 2003).
Recall that there are basically 4 steps in the process of hypothesis testing:
We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.
Here again are the three set of hypotheses that are being tested in each of our three examples:
Has the proportion of defective products been reduced as a result of the repair?
Is the proportion of marijuana users in the college higher than the national figure?
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The null hypothesis always takes the form:
and the alternative hypothesis takes one of the following three forms:
Note that it was quite clear from the context which form of the alternative hypothesis would be appropriate. The value that is specified in the null hypothesis is called the null value, and is generally denoted by p_{0}. We can say, therefore, that in general the null hypothesis about the population proportion (p) would take the form:
We write Ho: p = p_{0} to say that we are making the hypothesis that the population proportion has the value of p_{0}. In other words, p is the unknown population proportion and p_{0} is the number we think p might be for the given situation.
The alternative hypothesis takes one of the following three forms (depending on the context):
The first two possible forms of the alternatives (where the = sign in Ho is challenged by < or >) are called onesided alternatives, and the third form of alternative (where the = sign in Ho is challenged by ≠) is called a twosided alternative. To understand the intuition behind these names let’s go back to our examples.
Example 3 (death penalty) is a case where we have a twosided alternative:
In this case, in order to reject Ho and accept Ha we will need to get a sample proportion of death penalty supporters which is very different from 0.64 in either direction, either much larger or much smaller than 0.64.
In example 2 (marijuana use) we have a onesided alternative:
Here, in order to reject Ho and accept Ha we will need to get a sample proportion of marijuana users which is much higher than 0.157.
Similarly, in example 1 (defective products), where we are testing:
in order to reject Ho and accept Ha, we will need to get a sample proportion of defective products which is much smaller than 0.20.
Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.
In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis (denoted “Ho“), and Claim 2 plays the role of the alternative hypothesis (denoted “Ha“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.
Let’s go back to our three examples and apply the new notation:
In example 1:
In example 2:
In example 3:
This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.
There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (phat), sample mean (xbar) and the sample standard deviation (s).
In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.
This step will also involve checking any conditions or assumptions required to use the test.
As we saw, this is the step where we calculate how likely is it to get data like that observed (or more extreme) when Ho is true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability.
In our three examples, the pvalues were given to you (and you were reassured that you didn’t need to worry about how these were derived yet):
Obviously, the smaller the pvalue, the more surprising it is to get data like ours (or more extreme) when Ho is true, and therefore, the stronger the evidence the data provide against Ho.
Looking at the three pvalues of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against Ho.
Comment:
Since our statistical conclusion is based on how small the pvalue is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the pvalue must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.
This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α (alpha). The most commonly used significance level is α (alpha) = 0.05 (or 5%). This means that:
Now that we have a cutoff to use, here are the appropriate conclusions for each of our examples based upon the pvalues we were given.
In Example 1:
In Example 2:
In Example 3:
Notice that all of the above conclusions are written in terms of the alternative hypothesis and are given in the context of the situation. In no situation have we claimed the null hypothesis is true. Be very careful of this and other issues discussed in the following comments.
Comments:
As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is not necessarily the case. Consider the following slightly artificial yet effective example:
An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following two hypotheses:
Data: You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.
Assessing Evidence: If the proportion of male managers hired is really 0.5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore 0.5 * 0.5 * 0.5 = 0.125. This is the pvalue (using the multiplication rule for independent events).
Conclusion: Using 0.05 as the significance level, you conclude that since the pvalue = 0.125 > 0.05, the fact that the three randomly selected managers were all males is not enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy (Ho).
However, the data (all three selected are males) definitely does NOT provide evidence to accept the employer’s claim (Ho).
Comment about wording: Another common wording in scientific journals is:
Often you will see significance levels reported with additional description to indicate the degree of statistical significance. A general guideline (although not required in our course) is:
We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:
Here are a few more activities if you need some additional practice.
Comments:
In this setting, if the pvalue is very small, this implies, assuming the null hypothesis is true, that it is extremely unlikely that the results we have obtained would have happened due to random error alone, and thus our assumption (Ho) is rejected in favor of the alternative hypothesis (Ha).
Remember:
We are in the middle of the part of the course that has to do with inference for one variable.
So far, we talked about point estimation and learned how interval estimation enhances it by quantifying the magnitude of the estimation error (with a certain level of confidence) in the form of the margin of error. The result is the confidence interval — an interval that, with a certain confidence, we believe captures the unknown parameter.
We are now moving to the other kind of inference, hypothesis testing. We say that hypothesis testing is “the other kind” because, unlike the inferential methods we presented so far, where the goal was estimating the unknown parameter, the idea, logic and goal of hypothesis testing are quite different.
In the first two parts of this section we will discuss the idea behind hypothesis testing, explain how it works, and introduce new terminology that emerges in this form of inference. The final two parts will be more specific and will discuss hypothesis testing for the population proportion (p) and the population mean (μ, mu).
If this is your first statistics course, you will need to spend considerable time on this topic as there are many new ideas. Many students find this process and its logic difficult to understand in the beginning.
In this section, we will use the hypothesis test for a population proportion to motivate our understanding of the process. We will conduct these tests manually. For all future hypothesis test procedures, including problems involving means, we will use software to obtain the results and focus on interpreting them in the context of our scenario.
The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.
To start our discussion about the idea behind statistical hypothesis testing, consider the following example:
A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.
There are two opposing claims in this case:
Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.
The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.
What does this example have to do with statistics?
While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.
Statistical hypothesis testing is defined as:
Here is how the process of statistical hypothesis testing works:
In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence (random chance) that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)
Hopefully this example helped you understand the logic behind hypothesis testing.
To strengthen your understanding of the process of hypothesis testing and the logic behind it, let’s look at three statistical examples.
A recent study estimated that 20% of all college students in the United States smoke. The head of Health Services at Goodheart University (GU) suspects that the proportion of smokers may be lower at GU. In hopes of confirming her claim, the head of Health Services chooses a random sample of 400 Goodheart students, and finds that 70 of them are smokers.
Let’s analyze this example using the 4 steps outlined above:
Claim 1 basically says “nothing special goes on at Goodheart University; the proportion of smokers there is no different from the proportion in the entire country.” This claim is challenged by the head of Health Services, who suspects that the proportion of smokers at Goodheart is lower.
A certain prescription allergy medicine is supposed to contain an average of 245 parts per million (ppm) of a certain chemical. If the concentration is higher than 245 ppm, the drug will likely cause unpleasant side effects, and if the concentration is below 245 ppm, the drug may be ineffective. The manufacturer wants to check whether the mean concentration in a large shipment is the required 245 ppm or not. To this end, a random sample of 64 portions from the large shipment is tested, and it is found that the sample mean concentration is 250 ppm with a sample standard deviation of 12 ppm.
Note that again, claim 1 basically says: “There is nothing unusual about this shipment, the mean concentration is the required 245 ppm.” This claim is challenged by the manufacturer, who wants to check whether that is, indeed, the case or not.
Do you think that you’re getting it? Let’s make sure, and look at another example.
Is there a relationship between gender and combined scores (Math + Verbal) on the SAT exam?
Following a report on the College Board website, which showed that in 2003, males scored generally higher than females on the SAT exam, an educational researcher wanted to check whether this was also the case in her school district. The researcher chose random samples of 150 males and 150 females from her school district, collected data on their SAT performance and found the following:
Females  Males  



Again, let’s see how the process of hypothesis testing works for this example:
Note that again, claim 1 basically says: “There is nothing going on between the variables SAT and gender.” Claim 2 represents what the researcher wants to check, or suspects might actually be the case.
Comment:
In particular, note that in the second type of conclusion we did not say: “I accept claim 1,” but only “I don’t have enough evidence to reject claim 1.” We will come back to this issue later, but this is a good place to make you aware of this subtle difference.
Hopefully by now, you understand the logic behind the statistical hypothesis testing process. Here is a summary: