This last part of the fourstep process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.
The pvalue is a measure of how much evidence the data present against Ho. The smaller the pvalue, the more evidence the data present against Ho.
We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α, alpha), a cutoff point below which the pvalue is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.
Where instead of Ha is True, we write what this means in the words of the problem, in other words, in the context of the current scenario.
It is important to mention again that this step has essentially two substeps:
Note: We always still must consider whether the results have any practical significance, particularly if they are statistically significant as a statistically significant result which has not practical use is essentially meaningless!
Let’s go back to our three examples and draw conclusions.
Has the proportion of defective products been reduced as a result of the repair?
We found that the pvalue for this test was 0.023.
Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho.
Conclusion:
The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:
Is the proportion of marijuana users in the college higher than the national figure?
We found that the pvalue for this test was 0.182.
Since .182 is not small (in particular, 0.182 > 0.05), the data do not provide enough evidence to reject Ho.
Conclusion:
Here is the complete story of this example:
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
We found that the pvalue for this test was 0.021.
Since 0.021 is small (in particular, 0.021 < 0.05), the data provide enough evidence to reject Ho
Conclusion:
Here is the complete story of this example:
Many students wonder why 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level. This is largely due to just convenience and tradition.
When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.
The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single allornothing value. There isn’t much meaningful difference, for instance, between a pvalue of .049 or .051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.
Whether such a pvalue is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.
We have now completed going through the four steps of hypothesis testing, and in particular we learned how they are applied to the ztest for the population proportion. Here is a brief summary:
State the null hypothesis:
Ho: p = p_{0}
State the alternative hypothesis:
Ha: p < p_{0} (onesided)
Ha: p > p_{0} (onesided)
Ha: p ≠ p_{0} (twosided)
where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem. If you feel it is not clear, it is most likely a twosided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! Use only the information given in the problem.
Obtain data from a sample and:
(i) Check whether the data satisfy the conditions which allow you to use this test.
random sample (or at least a sample that can be considered random in context)
the conditions under which the sampling distribution of phat is normal are met
(ii) Calculate the sample proportion phat, and summarize the data using the test statistic:
(Recall: This standardized test statistic represents how many standard deviations above or below p_{0} our sample proportion phat is.)
When the alternative hypothesis is “less than” the probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
When the alternative hypothesis is “greater than” the probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
When the alternative hypothesis is “not equal to” the probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Reach a conclusion first regarding the statistical significance of the results, and then determine what it means in the context of the problem.
If pvalue ≤ 0.05 then WE REJECT Ho
Conclusion: There IS enough evidence that Ha is True
If pvalue > 0.05 then WE FAIL TO REJECT Ho
Conclusion: There IS NOT enough evidence that Ha is True
Recall that: If the pvalue is small (in particular, smaller than the significance level, which is usually 0.05), the results are statistically significant (in the sense that there is a statistically significant difference between what was observed in the sample and what was claimed in Ho), and so we reject Ho.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho, and so we continue to believe that Ho may be true. (Remember: In hypothesis testing we never “accept” Ho).
Finally, in practice, we should always consider the practical significance of the results as well as the statistical significance.
Before we move on to the next test, we are going to use the ztest for proportions to bring up and illustrate a few more very important issues regarding hypothesis testing. This might also be a good time to review the concepts of Type I error, Type II error, and Power before continuing on.
]]>So far we’ve talked about the pvalue at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the statistical significance of our results. We will now go more deeply into how the pvalue is calculated.
It should be mentioned that eventually we will rely on technology to calculate the pvalue for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first understand the details, and only then let the computer do the calculations for us. Again, our goal is to use this simple example to give you the tools you need to understand the process entirely. Let’s start.
Recall that so far we have said that the pvalue is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the pvalue is, therefore, a measure of the evidence against Ho. In the case of the test statistic, the larger it is in magnitude (positive or negative), the further phat is from p_{0}, the more evidence we have against Ho. In the case of the pvalue, it is the opposite; the smaller it is, the more unlikely it is to get data like those observed when Ho is true, the more evidence it is against Ho. One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the pvalue is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the pvalue from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the pvalue keeps its intuitive appeal across all statistical tests.
How is the pvalue calculated?
Intuitively, the pvalue is the probability of observing data like those observed assuming that Ho is true. Let’s be a bit more formal:
Putting it all together, we get that in general:
By “extreme” we mean extreme in the direction(s) of the alternative hypothesis.
Specifically, for the ztest for the population proportion:
OK, hopefully that makes (some) sense. But how do we actually calculate it?
Recall the important comment from our discussion about our test statistic,
which said that when the null hypothesis is true (i.e., when p = p_{0}), the possible values of our test statistic follow a standard normal (N(0,1), denoted by Z) distribution. Therefore, the pvalue calculations (which assume that Ho is true) are simply standard normal distribution calculations for the 3 possible alternative hypotheses.
The probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
The probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
The probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Next, we will apply this to our three examples. But first, work through the following activities, which should help your understanding.
Has the proportion of defective products been reduced as a result of the repair?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
To find P(Z ≤ 2) we can either use the calculator or table we learned to use in the probability unit for normal random variables. Eventually, after we understand the details, we will use software to run the test for us and the output will give us all the information we need. The pvalue that the statistical software provides for this specific example is 0.023. The pvalue tells us that it is pretty unlikely (probability of 0.023) to get data like those observed (test statistic of 2 or less) assuming that Ho is true.
Is the proportion of marijuana users in the college higher than the national figure?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.182, this is P(Z ≥ 0.91).
The pvalue tells us that it is not very surprising (probability of 0.182) to get data like those observed (which yield a test statistic of 0.91 or higher) assuming that the null hypothesis is true.
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.021, this is P(Z ≤ 2.31) + P(Z ≥ 2.31) = 2*P(Z ≥ 2.31)
The pvalue tells us that it is pretty unlikely (probability of 0.021) to get data like those observed (test statistic as high as 2.31 or higher or as low as 2.31 or lower) assuming that Ho is true.
Comment:
Similarly, in any test, pvalues are found using the sampling distribution of the test statistic when the null hypothesis is true (also known as the “null distribution” of the test statistic). In this case, it was relatively easy to argue that the null distribution of our test statistic is N(0,1). As we’ll see, in other tests, other distributions come up (like the tdistribution and the Fdistribution), which we will just mention briefly, and rely heavily on the output of our statistical package for obtaining the pvalues.
We’ve just completed our discussion about the pvalue, and how it is calculated both in general and more specifically for the ztest for the population proportion. Let’s go back to the fourstep process of hypothesis testing and see what we’ve covered and what still needs to be discussed.
With respect to the ztest the population proportion:
Step 1: Completed
Step 2: Completed
Step 3: Completed
Step 4. This is what we will work on next.
Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.
In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis (denoted “Ho“), and Claim 2 plays the role of the alternative hypothesis (denoted “Ha“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.
Let’s go back to our three examples and apply the new notation:
In example 1:
In example 2:
In example 3:
This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.
There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (phat), sample mean (xbar) and the sample standard deviation (s).
In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.
This step will also involve checking any conditions or assumptions required to use the test.
As we saw, this is the step where we calculate how likely is it to get data like that observed (or more extreme) when Ho is true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability.
In our three examples, the pvalues were given to you (and you were reassured that you didn’t need to worry about how these were derived yet):
Obviously, the smaller the pvalue, the more surprising it is to get data like ours (or more extreme) when Ho is true, and therefore, the stronger the evidence the data provide against Ho.
Looking at the three pvalues of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against Ho.
Comment:
Since our statistical conclusion is based on how small the pvalue is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the pvalue must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.
This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α (alpha). The most commonly used significance level is α (alpha) = 0.05 (or 5%). This means that:
Now that we have a cutoff to use, here are the appropriate conclusions for each of our examples based upon the pvalues we were given.
In Example 1:
In Example 2:
In Example 3:
Notice that all of the above conclusions are written in terms of the alternative hypothesis and are given in the context of the situation. In no situation have we claimed the null hypothesis is true. Be very careful of this and other issues discussed in the following comments.
Comments:
As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is not necessarily the case. Consider the following slightly artificial yet effective example:
An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following two hypotheses:
Data: You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.
Assessing Evidence: If the proportion of male managers hired is really 0.5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore 0.5 * 0.5 * 0.5 = 0.125. This is the pvalue (using the multiplication rule for independent events).
Conclusion: Using 0.05 as the significance level, you conclude that since the pvalue = 0.125 > 0.05, the fact that the three randomly selected managers were all males is not enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy (Ho).
However, the data (all three selected are males) definitely does NOT provide evidence to accept the employer’s claim (Ho).
Comment about wording: Another common wording in scientific journals is:
Often you will see significance levels reported with additional description to indicate the degree of statistical significance. A general guideline (although not required in our course) is:
We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:
Here are a few more activities if you need some additional practice.
Comments:
In this setting, if the pvalue is very small, this implies, assuming the null hypothesis is true, that it is extremely unlikely that the results we have obtained would have happened due to random error alone, and thus our assumption (Ho) is rejected in favor of the alternative hypothesis (Ha).
Remember:
We are in the middle of the part of the course that has to do with inference for one variable.
So far, we talked about point estimation and learned how interval estimation enhances it by quantifying the magnitude of the estimation error (with a certain level of confidence) in the form of the margin of error. The result is the confidence interval — an interval that, with a certain confidence, we believe captures the unknown parameter.
We are now moving to the other kind of inference, hypothesis testing. We say that hypothesis testing is “the other kind” because, unlike the inferential methods we presented so far, where the goal was estimating the unknown parameter, the idea, logic and goal of hypothesis testing are quite different.
In the first two parts of this section we will discuss the idea behind hypothesis testing, explain how it works, and introduce new terminology that emerges in this form of inference. The final two parts will be more specific and will discuss hypothesis testing for the population proportion (p) and the population mean (μ, mu).
If this is your first statistics course, you will need to spend considerable time on this topic as there are many new ideas. Many students find this process and its logic difficult to understand in the beginning.
In this section, we will use the hypothesis test for a population proportion to motivate our understanding of the process. We will conduct these tests manually. For all future hypothesis test procedures, including problems involving means, we will use software to obtain the results and focus on interpreting them in the context of our scenario.
The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.
To start our discussion about the idea behind statistical hypothesis testing, consider the following example:
A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.
There are two opposing claims in this case:
Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.
The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.
What does this example have to do with statistics?
While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.
Statistical hypothesis testing is defined as:
Here is how the process of statistical hypothesis testing works:
In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence (random chance) that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)
Hopefully this example helped you understand the logic behind hypothesis testing.
To strengthen your understanding of the process of hypothesis testing and the logic behind it, let’s look at three statistical examples.
A recent study estimated that 20% of all college students in the United States smoke. The head of Health Services at Goodheart University (GU) suspects that the proportion of smokers may be lower at GU. In hopes of confirming her claim, the head of Health Services chooses a random sample of 400 Goodheart students, and finds that 70 of them are smokers.
Let’s analyze this example using the 4 steps outlined above:
Claim 1 basically says “nothing special goes on at Goodheart University; the proportion of smokers there is no different from the proportion in the entire country.” This claim is challenged by the head of Health Services, who suspects that the proportion of smokers at Goodheart is lower.
A certain prescription allergy medicine is supposed to contain an average of 245 parts per million (ppm) of a certain chemical. If the concentration is higher than 245 ppm, the drug will likely cause unpleasant side effects, and if the concentration is below 245 ppm, the drug may be ineffective. The manufacturer wants to check whether the mean concentration in a large shipment is the required 245 ppm or not. To this end, a random sample of 64 portions from the large shipment is tested, and it is found that the sample mean concentration is 250 ppm with a sample standard deviation of 12 ppm.
Note that again, claim 1 basically says: “There is nothing unusual about this shipment, the mean concentration is the required 245 ppm.” This claim is challenged by the manufacturer, who wants to check whether that is, indeed, the case or not.
Do you think that you’re getting it? Let’s make sure, and look at another example.
Is there a relationship between gender and combined scores (Math + Verbal) on the SAT exam?
Following a report on the College Board website, which showed that in 2003, males scored generally higher than females on the SAT exam, an educational researcher wanted to check whether this was also the case in her school district. The researcher chose random samples of 150 males and 150 females from her school district, collected data on their SAT performance and found the following:
Females  Males  



Again, let’s see how the process of hypothesis testing works for this example:
Note that again, claim 1 basically says: “There is nothing going on between the variables SAT and gender.” Claim 2 represents what the researcher wants to check, or suspects might actually be the case.
Comment:
In particular, note that in the second type of conclusion we did not say: “I accept claim 1,” but only “I don’t have enough evidence to reject claim 1.” We will come back to this issue later, but this is a good place to make you aware of this subtle difference.
Hopefully by now, you understand the logic behind the statistical hypothesis testing process. Here is a summary: