We have almost reached the end our discussion of probability. We were introduced to the important concept of **random variables**, which are quantitative variables whose value is determined by the outcome of a random experiment.

We discussed discrete and continuous random variables.

We saw that all the information about a **discrete random variable** is packed into its probability distribution. Using that, we can answer probability questions about the random variable and find its **mean and standard deviation**. We ended the part on discrete random variables by presenting a special class of discrete random variables – **binomial random variables.**

As we dove into **continuous random variables**, we saw how calculations can get complicated very quickly, when probabilities associated with a continuous random variable are found by calculating **areas under its density curve**.

As an example for a continuous random variable, we presented the **normal random variable**, and discussed it at length. The normal distribution is extremely important, not just because many variables in real life follow the normal distribution, but mainly because of the important role it plays in statistical inference, our ultimate goal of this course.

We learned how we can avoid calculus by using the **standard normal calculator or table** to find probabilities associated with the normal distribution, and learned how it can be used as an **approximation to the binomial** distribution under certain conditions.

A random variable is a variable whose values are numerical results of a random experiment.

- A
**discrete random variable**is summarized by its probability distribution — a list of its possible values and their corresponding probabilities.

The sum of the probabilities of all possible values must be 1.

The probability distribution can be represented by a table, histogram, or sometimes a formula.

- The
**probability distribution**of a random variable can be supplemented with numerical measures of the center and spread of the random variable.

**Center:** The center of a random variable is measured by its mean (which is sometimes also referred to as the **expected value**).

The mean of a random variable can be interpreted as its long run average.

The mean is a weighted average of the possible values of the random variable weighted by their corresponding probabilities.

**Spread:** The spread of a random variable is measured by its variance, or more typically by its standard deviation (the square root of the variance).

The standard deviation of a random variable can be interpreted as the typical (or long-run average) distance between the value that the random variable assumes and the mean of X.

- The binomial random variable is a type of discrete random variable that is quite common.

- The binomial random variable is defined in a random experiment that consists of n independent trials, each having two possible outcomes (called “success” and “failure”), and each having the same probability of success: p. Such a random experiment is called the binomial random experiment.

- The binomial random variable represents the number of successes (out of n) in a binomial experiment. It can therefore have values as low as 0 (if none of the n trials was a success) and as high as n (if all n trials were successes).

- There are “many” binomial random variables, depending on the number of trials (n) and the probability of success (p).

- The probability distribution of the binomial random variable is given in the form of a formula and can be used to find probabilities. Technology can be used as well.

- The mean and standard deviation of a binomial random variable can be easily found using short-cut formulas.

The probability distribution of a continuous random variable is represented by a probability density curve. The probability that the random variable takes a value in any interval of interest is the area above this interval and below the density curve.

An important example of a continuous random variable is the **normal random variable**, whose probability density curve is symmetric (bell-shaped), bulging in the middle and tapering at the ends.

- There are “many” normal random variables, each determined by its mean
*μ*(mu) (which determines where the density curve is centered) and standard deviation σ (sigma) (which determines how spread out (wide) the normal density curve is).

- Any normal random variable follows the Standard Deviation Rule, which can help us find probabilities associated with the normal random variable.

- Another way to find probabilities associated with the normal random variable is using the standard normal table. This process involves finding the z-score of values, which tells us how many standard deviations below or above the mean the value is.

- An important application of the normal random variable is that it can be used as an approximation of the binomial random variable (under certain conditions). A continuity correction can improve this approximation.

The issues regarding hypothesis testing that we will discuss are:

- The effect of sample size on hypothesis testing.
- Statistical significance vs. practical importance.
- Hypothesis testing and confidence intervals—how are they related?

Let’s begin.

We have already seen the effect that the sample size has on inference, when we discussed point and interval estimation for the population mean (μ, mu) and population proportion (p). Intuitively …

Larger sample sizes give us more information to pin down the true nature of the population. We can therefore expect the **sample** mean and **sample **proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, we can report a smaller margin of error, and get a narrower confidence interval. What we’ve seen, then, is that larger sample size gives a boost to how much we trust our sample results.

In hypothesis testing, larger sample sizes have a similar effect. We have also discussed that the power of our test increases when the sample size increases, all else remaining the same. This means, we have a better chance to detect the difference between the true value and the null value for larger samples.

The following two examples will illustrate that a larger sample size provides more convincing evidence (the test has greater power), and how the evidence manifests itself in hypothesis testing. Let’s go back to our example 2 (marijuana use at a certain liberal arts college).

Is the proportion of marijuana users in the college higher than the national figure?

We do **not** have enough evidence to conclude that the proportion of students at the college who use marijuana is higher than the national figure.

**Now, let’s increase the sample size. **

There are rumors that students in a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that **in a simple random sample of 400 students from the college, 76 admitted to marijuana use**. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is **higher** than the national proportion, which is 0.157? (Reported by the Harvard School of Public Health).

Our results here are statistically **significant**. In other words, in example 2* the data provide enough evidence to reject Ho.

**Conclusion:**There is enough evidence that the proportion of marijuana users at the college is higher than among all U.S. students.

What do we learn from this?

We see that sample results that are based on a larger sample carry more weight (have greater power).

In example 2, we saw that a sample proportion of 0.19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than 0.157. Recall, from our general overview of hypothesis testing, that this conclusion (not having enough evidence to reject the null hypothesis) **doesn’t** mean the null hypothesis is necessarily true (so, we never “accept” the null); it only means that the particular study didn’t yield sufficient evidence to reject the null. It **might** be that the sample size was simply too small to detect a statistically significant difference.

However, in example 2*, we saw that when the sample proportion of 0.19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than 0.157 (the national figure). In **this** case, the sample size of 400 **was** large enough to detect a statistically significant difference.

The following activity will allow you to practice the ideas and terminology used in hypothesis testing when a result is not statistically significant.

Now, we will address the issue of statistical significance versus practical importance (which also involves issues of sample size).

The following activity will let you explore the effect of the sample size on the statistical significance of the results yourself, and more importantly will discuss issue **2: Statistical significance vs. practical importance.**

This suggests that when interpreting the results of a test, you should always think not only about the statistical significance of the results but also about their practical importance.

The last topic we want to discuss is the relationship between hypothesis testing and confidence intervals. Even though the flavor of these two forms of inference is different (confidence intervals estimate a parameter, and hypothesis testing assesses the evidence in the data against one claim and in favor of another), there is a strong link between them.

We will explain this link (using the z-test and confidence interval for the population proportion), and then explain how confidence intervals can be used after a test has been carried out.

Recall that a confidence interval gives us a set of plausible values for the unknown population parameter. We may therefore examine a confidence interval to informally decide if a proposed value of population proportion seems plausible.

For example, if a 95% confidence interval for p, the proportion of all U.S. adults already familiar with Viagra in May 1998, was (0.61, 0.67), then it seems clear that we should be able to reject a claim that only 50% of all U.S. adults were familiar with the drug, since based on the confidence interval, 0.50 is not one of the plausible values for p.

In fact, the information provided by a confidence interval can be formally related to the information provided by a hypothesis test. (**Comment:** The relationship is more straightforward for two-sided alternatives, and so we will not present results for the one-sided cases.)

Suppose we want to carry out the **two-sided test:**

- Ho: p = p
_{0} - Ha: p ≠ p
_{0}

using a significance level of 0.05.

An alternative way to perform this test is to find a 95% **confidence interval** for p and check:

- If p
_{0}falls**outside**the confidence interval,**reject**Ho. - If p
_{0}falls**inside**the confidence interval,**do not reject**Ho.

In other words,

- If p
_{0}is not one of the plausible values for p, we reject Ho. - If p
_{0}is a plausible value for p, we cannot reject Ho.

(**Comment:** Similarly, the results of a test using a significance level of 0.01 can be related to the 99% confidence interval.)

Let’s look at an example:

Recall example 3, where we wanted to know whether the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64.

We are testing:

**Ho:**p = 0.64 (No change from 2003).**Ha:**p ≠ 0.64 (Some change since 2003).

and as the figure reminds us, we took a sample of 1,000 U.S. adults, and the data told us that 675 supported the death penalty for convicted murderers (p-hat = 0.675).

A 95% confidence interval for p, the proportion of **all** U.S. adults who support the death penalty, is:

Since the 95% confidence interval for p does not include 0.64 as a plausible value for p, we can reject Ho and conclude (as we did before) that there is enough evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003.

You and your roommate are arguing about whose turn it is to clean the apartment. Your roommate suggests that you settle this by tossing a coin and takes one out of a locked box he has on the shelf. Suspecting that the coin might not be fair, you decide to test it first. You toss the coin 80 times, thinking to yourself that if, indeed, the coin is fair, you should get around 40 heads. Instead you get 48 heads. You are puzzled. You are not sure whether getting 48 heads out of 80 is enough evidence to conclude that the coin is unbalanced, or whether this a result that could have happened just by chance when the coin is fair.

Statistics can help you answer this question.

Let p be the true proportion (probability) of heads. We want to test whether the coin is fair or not.

We are testing:

**Ho:**p = 0.5 (the coin is fair).**Ha:**p ≠ 0.5 (the coin is not fair).

The data we have are that out of n = 80 tosses, we got 48 heads, or that the sample proportion of heads is p-hat = 48/80 = 0.6.

A 95% confidence interval for p, the true proportion of heads for this coin, is:

Since in this case 0.5 is one of the plausible values for p, we cannot reject Ho. In other words, the data do not provide enough evidence to conclude that the coin is not fair.

**Comment**

The context of the last example is a good opportunity to bring up an important point that was discussed earlier.

Even though we use 0.05 as a cutoff to guide our decision about whether the results are statistically significant, we should not treat it as inviolable and we should always add our own judgment. Let’s look at the last example again.

It turns out that the p-value of this test is 0.0734. In other words, it is maybe not extremely unlikely, but it is quite unlikely (probability of 0.0734) that when you toss a fair coin 80 times you’ll get a sample proportion of heads of 48/80 = 0.6 (or even more extreme). It is true that using the 0.05 significance level (cutoff), 0.0734 is not considered small enough to conclude that the coin is not fair. However, if you really don’t want to clean the apartment, the p-value might be small enough for you to ask your roommate to use a different coin, or to provide one yourself!

**Here is our final point on this subject:**

When the data provide enough evidence to reject Ho, we can conclude (depending on the alternative hypothesis) that the population proportion is either less than, greater than, or not equal to the null value p_{0}. However, we do not get a more informative statement about its actual value. It might be of interest, then, to follow the test with a 95% confidence interval that will give us more insight into the actual value of p.

In our example 3,

we concluded that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, when it was 0.64. It is probably of interest not only to know that the proportion has changed, but also to estimate what it has changed to. We’ve calculated the 95% confidence interval for p on the previous page and found that it is (0.646, 0.704).

We can combine our conclusions from the test and the confidence interval and say:

Data provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers has changed since 2003, and we are 95% confident that it is now between 0.646 and 0.704. (i.e. between 64.6% and 70.4%).

Let’s look at our example 1 to see how a confidence interval following a test might be insightful in a different way.

Here is a summary of example 1:

We conclude that as a result of the repair, the proportion of defective products has been reduced to below 0.20 (which was the proportion prior to the repair). It is probably of great interest to the company not only to know that the proportion of defective has been reduced, but also estimate what it has been reduced to, to get a better sense of how effective the repair was. A 95% confidence interval for p in this case is:

We can therefore say that the data provide evidence that the proportion of defective products has been reduced, and we are 95% confident that it has been reduced to somewhere between 12.4% and 19.6%. This is very useful information, since it tells us that even though the results were significant (i.e., the repair reduced the number of defective products), the repair might not have been effective enough, if it managed to reduce the number of defective products only to the range provided by the confidence interval. This, of course, ties back in to the idea of statistical significance vs. practical importance that we discussed earlier. Even though the results are statistically significant (Ho was rejected), practically speaking, the repair might still be considered ineffective.

Even though this portion of the current section is about the z-test for population proportion, it is loaded with very important ideas that apply to hypothesis testing in general. We’ve already summarized the details that are specific to the z-test for proportions, so the purpose of this summary is to highlight the general ideas.

The process of hypothesis testing has **four steps**:

**I. Stating the null and alternative hypotheses (Ho and Ha).**

**II. Obtaining a random sample (or at least one that can be considered random) and collecting data. Using the data:**

**Check that the conditions** under which the test can be reliably used are met.

**Summarize the data using a test statistic. **

- The test statistic is a measure of the evidence in the data against Ho. The larger the test statistic is in magnitude, the more evidence the data present against Ho.

**III. Finding the p-value of the test. **The p-value is the probability of getting data like those observed (or even more extreme) assuming that the null hypothesis is true, and is calculated using the null distribution of the test statistic. The p-value is a measure of the evidence against Ho. The smaller the p-value, the more evidence the data present against Ho.

**IV. Making conclusions. **

Conclusions about the statistical **significance of the results:**

If the p-value is small, the data present enough evidence to reject Ho (and accept Ha).

If the p-value is not small, the data do not provide enough evidence to reject Ho.

To help guide our decision, we use the significance level as a cutoff for what is considered a small p-value. The significance cutoff is usually set at 0.05.

Conclusions should then be provided **in the context** of the problem.

**Additional Important Ideas about Hypothesis Testing**

- Results that are based on a larger sample carry more weight, and therefore
**as the sample size increases, results become more statistically significant.**

- Even a very small and practically unimportant effect becomes statistically significant with a large enough sample size. The
**distinction between statistical significance and practical importance**should therefore always be considered.

**Confidence intervals can be used in order to carry out two-sided tests**(95% confidence for the 0.05 significance level). If the null value is not included in the confidence interval (i.e., is not one of the plausible values for the parameter), we have enough evidence to reject Ho. Otherwise, we cannot reject Ho.

- If the results are statistically significant, it might be of interest to
**follow up the tests with a confidence interval**in order to get insight into the actual value of the parameter of interest.

- It is important to be aware that there are two types of errors in hypothesis testing (
**Type I and Type II**) and that the**power**of a statistical test is an important measure of how likely we are to be able to detect a difference of interest to us in a particular problem.

This last part of the four-step process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.

The p-value is a measure of how much evidence the data present against Ho. The smaller the p-value, the more evidence the data present against Ho.

We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α, alpha), a cutoff point below which the p-value is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.

- If p-value ≤ 0.05 then
**WE REJECT**Ho- Conclusion: There
**IS**enough evidence that*Ha is True*

- Conclusion: There
- If p-value > 0.05 then
**WE FAIL TO REJECT**Ho- Conclusion: There
**IS NOT**enough evidence that*Ha is True*

- Conclusion: There

Where instead of __ Ha is True__, we write what this means in the words of the problem, in other words, in the context of the current scenario.

It is important to mention again that this step has essentially two sub-steps:

- (i) Based on the p-value, determine whether or not the results are statistically significant (i.e., the data present enough evidence to reject Ho).
- (ii) State your conclusions in the context of the problem.

**Note:** We always still must consider whether the results have any practical significance, particularly if they are statistically significant as a statistically significant result which has not practical use is essentially meaningless!

Let’s go back to our three examples and draw conclusions.

Has the proportion of defective products been reduced as a result of the repair?

We found that the p-value for this test was 0.023.

Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho.

**Conclusion:**

- There
**IS**enough evidence that.*the proportion of defective products is less than 20% after the repair*

The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:

Is the proportion of marijuana users in the college higher than the national figure?

We found that the p-value for this test was 0.182.

Since .182 is *not* small (in particular, 0.182 > 0.05), the data do not provide enough evidence to reject Ho.

**Conclusion:**

- There
**IS NOT**enough evidence that*the proportion of students at the college who use marijuana is higher than the national figure.*

Here is the complete story of this example:

Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?

We found that the p-value for this test was 0.021.

Since 0.021 is small (in particular, 0.021 < 0.05), the data provide enough evidence to reject Ho

**Conclusion:**

- There
**IS**enough evidence that*the**proportion of adults who support the death penalty for convicted murderers has changed since 2003.*

Here is the complete story of this example:

Many students wonder why 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level. This is largely due to just convenience and tradition.

When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.

The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single all-or-nothing value. There isn’t much meaningful difference, for instance, between a p-value of .049 or .051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.

Whether such a p-value is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.

We have now completed going through the four steps of hypothesis testing, and in particular we learned how they are applied to the z-test for the population proportion. Here is a brief summary:

**Step 1: State the hypotheses**

State the null hypothesis:

Ho: p = p_{0}

State the alternative hypothesis:

Ha: p < p_{0} **(one-sided)**

Ha: p > p_{0} **(one-sided)**

Ha: p ≠ p_{0} **(two-sided)**

where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem. If you feel it is not clear, it is most likely a two-sided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! Use only the information given in the problem.

**Step 2: Obtain data, check conditions, and summarize data**

Obtain data from a sample and:

(i) Check whether the data satisfy the conditions which allow you to use this test.

random sample (or at least a sample that can be considered random in context)

the conditions under which the sampling distribution of p-hat is normal are met

(ii) Calculate the sample proportion p-hat, and summarize the data using the test statistic:

(**Recall:** This standardized test statistic represents how many standard deviations above or below p_{0} our sample proportion p-hat is.)

**Step 3: Find the p-value of the test by using the test statistic as follows**

**When the alternative hypothesis is “less than” **the probability of observing a test statistic as **small as that observed or smaller**, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.

Looking at the shaded region, you can see why this is often referred to as a **left-tailed** test. We shaded to the left of the test statistic, since less than is to the left.

**When the alternative hypothesis is “greater than”** the probability of observing a test statistic as **large as that observed or larger**, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution

Looking at the shaded region, you can see why this is often referred to as a **right-tailed** test. We shaded to the right of the test statistic, since greater than is to the right.

**When the alternative hypothesis is “not equal to” **the probability of observing a test statistic which is as large in **magnitude** as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.

This is often referred to as a **two-tailed** test, since we shaded in both directions.

**Step 4: Conclusion**

Reach a conclusion first regarding the statistical significance of the results, and then determine what it means in the context of the problem.

**If p-value ≤ 0.05 then WE REJECT Ho**

**Conclusion: There IS enough evidence that Ha is True**

**If p-value > 0.05 then WE FAIL TO REJECT Ho**

**Conclusion: There IS NOT enough evidence that Ha is True**

Recall that: If the p-value is small (in particular, smaller than the significance level, which is usually 0.05), the results are statistically significant (in the sense that there is a statistically significant difference between what was observed in the sample and what was claimed in Ho), and so we reject Ho.

If the p-value is not small, we do not have enough statistical evidence to reject Ho, and so we continue to believe that Ho **may** be true. (**Remember: In hypothesis testing we never “accept” Ho**).

Finally, in practice, we should always consider the **practical significance** of the results as well as the statistical significance.

Before we move on to the next test, we are going to use the z-test for proportions to bring up and illustrate a few more very important issues regarding hypothesis testing. This might also be a good time to review the concepts of Type I error, Type II error, and Power before continuing on.

]]>So far we’ve talked about the p-value at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the statistical significance of our results. We will now go more deeply into how the p-value is calculated.

It should be mentioned that eventually we will rely on technology to calculate the p-value for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first **understand** the details, and only then let the computer do the calculations for us. Again, our goal is to use this simple example to give you the tools you need to understand the process entirely. Let’s start.

Recall that so far we have said that the p-value is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the p-value is, therefore, a measure of the evidence against Ho. In the case of the **test statistic,** the **larger** it is in magnitude (positive or negative), the further p-hat is from p_{0}, the **more evidence we have against Ho. **In the case of the **p-value**, it is the opposite; the **smaller** it is, the more unlikely it is to get data like those observed when Ho is true, the **more evidence it is against Ho**. One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the p-value is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the p-value from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the p-value keeps its intuitive appeal across **all** statistical tests.

**How is the p-value calculated?**

Intuitively, the p-value is the **probability** of observing **data like those observed** assuming that Ho is true. Let’s be a bit more formal:

- Since this is a probability question about the
**data**, it makes sense that the calculation will involve the data summary, the**test statistic.** - What do we mean by
**“like”**those observed? By “like” we mean**“as extreme or even more extreme.”**

Putting it all together, we get that in **general:**

The** p-value **is the** probability of observing a test statistic as extreme as that observed (or even more extreme) assuming that the null hypothesis is true.**

By **“extreme”** we mean extreme **in the direction(s) of the alternative** hypothesis.

**Specifically**, for the z-test for the population proportion:

- If the alternative hypothesis is Ha: p < p
_{0}**(less than)**, then “extreme” means**small or less than**, and the p-value is: The probability of observing a test statistic**as small as that observed or smaller**if the null hypothesis is true. - If the alternative hypothesis is Ha: p > p
_{0}**(greater than)**, then “extreme” means**large or greater than**, and the p-value is: The probability of observing a test statistic**as large as that observed or larger**if the null hypothesis is true. - If the alternative is Ha: p ≠ p
_{0}**(different from)**, then “extreme” means extreme in either direction**either small or large (i.e., large in magnitude) or just different from**, and the p-value therefore is: The probability of observing a test statistic**as large in magnitude as that observed or larger**if the null hypothesis is true.(Examples: If z = -2.5: p-value = probability of observing a test statistic as small as -2.5 or smaller or as large as 2.5 or larger. If z = 1.5: p-value = probability of observing a test statistic as large as 1.5 or larger, or as small as -1.5 or smaller.)

**OK, hopefully that makes (some) sense. But how do we actually calculate it?**

Recall the important comment from our discussion about our test statistic,

which said that when the null hypothesis is true (i.e., when p = p_{0}), the possible values of our test statistic follow a standard normal (N(0,1), denoted by Z) distribution. Therefore, the p-value calculations (which assume that Ho is true) are simply standard normal distribution calculations for the 3 possible alternative hypotheses.

The probability of observing a test statistic as **small as that observed or smaller**, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.

Looking at the shaded region, you can see why this is often referred to as a **left-tailed** test. We shaded to the left of the test statistic, since less than is to the left.

The probability of observing a test statistic as **large as that observed or larger**, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution

Looking at the shaded region, you can see why this is often referred to as a **right-tailed** test. We shaded to the right of the test statistic, since greater than is to the right.

The probability of observing a test statistic which is as large in **magnitude** as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.

This is often referred to as a **two-tailed** test, since we shaded in both directions.

Next, we will apply this to our three examples. But first, work through the following activities, which should help your understanding.

Has the proportion of defective products been reduced as a result of the repair?

The p-value in this case is:

- The probability of observing a test statistic as small as -2 or smaller, assuming that Ho is true.

**OR (recalling what the test statistic actually means in this case),**

- The probability of observing a sample proportion that is 2 standard deviations or more below the null value (p
_{0}= 0.20), assuming that p_{0}is the true population proportion.

**OR, more specifically,**

- The probability of observing a sample proportion of 0.16 or lower in a random sample of size 400, when the true population proportion is p
_{0}=0.20

In either case, the p-value is found as shown in the following figure:

To find P(Z ≤ -2) we can either use the calculator or table we learned to use in the probability unit for normal random variables. Eventually, after we understand the details, we will use software to run the test for us and the output will give us all the information we need. The p-value that the statistical software provides for this specific example is 0.023. The p-value tells us that it is pretty unlikely (probability of 0.023) to get data like those observed (test statistic of -2 or less) assuming that Ho is true.

Is the proportion of marijuana users in the college higher than the national figure?

The p-value in this case is:

- The probability of observing a test statistic as large as 0.91 or larger, assuming that Ho is true.

**OR (recalling what the test statistic actually means in this case),**

- The probability of observing a sample proportion that is 0.91 standard deviations or more above the null value (p
_{0}= 0.157), assuming that p_{0}is the true population proportion.

**OR, more specifically,**

- The probability of observing a sample proportion of 0.19 or higher in a random sample of size 100, when the true population proportion is p
_{0}=0.157

In either case, the p-value is found as shown in the following figure:

Again, at this point we can either use the calculator or table to find that the p-value is 0.182, this is P(Z ≥ 0.91).

The p-value tells us that it is not very surprising (probability of 0.182) to get data like those observed (which yield a test statistic of 0.91 or higher) assuming that the null hypothesis is true.

Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?

The p-value in this case is:

- The probability of observing a test statistic as large as 2.31 (or larger) or as small as -2.31 (or smaller), assuming that Ho is true.

**OR (recalling what the test statistic actually means in this case),**

- The probability of observing a sample proportion that is 2.31 standard deviations or more away from the null value (p
_{0}= 0.64), assuming that p_{0}is the true population proportion.

**OR, more specifically,**

- The probability of observing a sample proportion as different as 0.675 is from 0.64, or even more different (i.e. as high as 0.675 or higher or as low as 0.605 or lower) in a random sample of size 1,000, when the true population proportion is p
_{0}= 0.64

In either case, the p-value is found as shown in the following figure:

Again, at this point we can either use the calculator or table to find that the p-value is 0.021, this is P(Z ≤ -2.31) + P(Z ≥ 2.31) = 2*P(Z ≥ |2.31|)

The p-value tells us that it is pretty unlikely (probability of 0.021) to get data like those observed (test statistic as high as 2.31 or higher or as low as -2.31 or lower) assuming that Ho is true.

**Comment:**

- We’ve just seen that finding p-values involves probability calculations about the value of the test statistic assuming that Ho is true. In this case, when Ho is true, the values of the test statistic follow a standard normal distribution (i.e., the sampling distribution of the test statistic when the null hypothesis is true is N(0,1)). Therefore, p-values correspond to areas (probabilities) under the standard normal curve.

Similarly, in **any test**, p-values are found using the sampling distribution of the test statistic when the null hypothesis is true (also known as the “null distribution” of the test statistic). In this case, it was relatively easy to argue that the null distribution of our test statistic is N(0,1). As we’ll see, in other tests, other distributions come up (like the t-distribution and the F-distribution), which we will just mention briefly, and rely heavily on the output of our statistical package for obtaining the p-values.

We’ve just completed our discussion about the p-value, and how it is calculated both in general and more specifically for the z-test for the population proportion. Let’s go back to the four-step process of hypothesis testing and see what we’ve covered and what still needs to be discussed.

**STEP 1:**State the appropriate null and alternative hypotheses, Ho and Ha.

**STEP 2:**Obtain a random sample, collect relevant data, and**check whether the data meet the conditions under which the test can be used**. If the conditions are met, summarize the data using a test statistic.

**STEP 3:**Find the p-value of the test.

**STEP 4:**Based on the p-value, decide whether or not the results are statistically significant and**draw your conclusions in context.**

**Note:**In practice, we should always consider the practical significance of the results as well as the statistical significance.

With respect to the z-test the population proportion:

Step 1: Completed

Step 2: Completed

Step 3: Completed

Step 4. This is what we will work on next.

After the hypotheses have been stated, the next step is to obtain a **sample** (on which the inference will be based), **collect relevant data**, and **summarize** them.

It is extremely important that our sample is representative of the population about which we want to draw conclusions. This is ensured when the sample is chosen at **random.** Beyond the practical issue of ensuring representativeness, choosing a random sample has theoretical importance that we will mention later.

In the case of hypothesis testing for the population proportion (p), we will collect data on the relevant categorical variable from the individuals in the sample and start by calculating the sample proportion p-hat (the natural quantity to calculate when the parameter of interest is p).

Let’s go back to our three examples and add this step to our figures.

Has the proportion of defective products been reduced as a result of the repair?

Is the proportion of marijuana users in the college higher than the national figure?

Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?

As we mentioned earlier without going into details, when we summarize the data in hypothesis testing, we go a step beyond calculating the sample statistic and summarize the data with a **test statistic**. Every test has a test statistic, which to some degree captures the essence of the test. In fact, the p-value, which so far we have looked upon as “the king” (in the sense that everything is determined by it), is actually determined by (or derived from) the test statistic. We will now introduce the test statistic.

The test statistic is a measure of how far the sample proportion p-hat is from the null value p_{0}, the value that the null hypothesis claims is the value of p. In other words, since p-hat is what the data estimates p to be, the test statistic can be viewed as a measure of the “distance” between what the data tells us about p and what the null hypothesis claims p to be.

Let’s use our examples to understand this:

Has the proportion of defective products been reduced as a result of the repair?

The parameter of interest is p, the proportion of defective products following the repair.

The data estimate p to be p-hat = 0.16

The null hypothesis claims that p = 0.20

The data are therefore 0.04 (or 4 percentage points) below the null hypothesis value.

It is hard to evaluate whether this difference of 4% in defective products is enough evidence to say that the repair was effective at reducing the proportion of defective products, but clearly, the larger the difference, the more evidence it is against the null hypothesis. So if, for example, our sample proportion of defective products had been, say, 0.10 instead of 0.16, then I think you would all agree that cutting the proportion of defective products in half (from 20% to 10%) would be extremely strong evidence that the repair was effective at reducing the proportion of defective products.

Is the proportion of marijuana users in the college higher than the national figure?

The parameter of interest is p, the proportion of students in a college who use marijuana.

The data estimate p to be p-hat = 0.19

The null hypothesis claims that p = 0.157

The data are therefore 0.033 (or 3.3. percentage points) above the null hypothesis value.

The parameter of interest is p, the proportion of U.S. adults who support the death penalty for convicted murderers.

The data estimate p to be p-hat = 0.675

The null hypothesis claims that p = 0.64

There is a difference of 0.035 (or 3.5. percentage points) between the data and the null hypothesis value.

The problem with looking only at the difference between the sample proportion, p-hat, and the null value, p_{0} is that we have not taken into account the variability of our estimator p-hat which, as we know from our study of sampling distributions, depends on the sample size.

For this reason, the test statistic cannot simply be the difference between p-hat and p_{0}, but must be some form of that formula that accounts for the sample size. In other words, we need to somehow standardize the difference so that comparison between different situations will be possible. We are very close to revealing the test statistic, but before we construct it, let’s be reminded of the following two facts from probability:

**Fact 1:** When we take a random sample of size n from a population with population proportion p, then

**Fact 2: **The z-score of any normal value (a value that comes from a normal distribution) is calculated by finding the difference between the value and the mean and then dividing that difference by the standard deviation (of the normal distribution associated with the value). The z-score represents how many standard deviations below or above the mean the value is.

Thus, our test statistic should be **a measure** of how far the sample proportion p-hat is from the null value p_{0} **relative** to the variation of p-hat (as measured by the standard error of p-hat).

Recall that the **standard error** is the **standard deviation of the sampling distribution** for a given statistic. For p-hat, we know the following:

To find the p-value, we will need to determine how surprising our value is assuming the null hypothesis is true. We already have the tools needed for this process from our study of sampling distributions as represented in the table above.

Has the proportion of defective products been reduced as a result of the repair?

If we assume the null hypothesis is true, we can specify that the center of the distribution of all possible values of p-hat from samples of size 400 would be 0.20 (our null value).

We can calculate the standard error, assuming p = 0.20 as

The following picture represents the sampling distribution of all possible values of p-hat of samples of size 400, assuming the true proportion p is 0.20 and our other requirements for the sampling distribution to be normal are met (we will review these during the next step).

In order to calculate probabilities for the picture above, we would need to find the z-score associated with our result.

This z-score is the **test statistic**! In this example, the numerator of our z-score is the difference between p-hat (0.16) and null value (0.20) which we found earlier to be -0.04. The denominator of our z-score is the standard error calculated above (0.02) and thus quickly we find the z-score, our test statistic, to be -2.

The sample proportion based upon this data is 2 standard errors below the null value.

Hopefully you now understand more about the reasons we need probability in statistics!!

Now we will formalize the definition and look at our remaining examples before moving on to the next step, which will be to determine if a normal distribution applies and calculate the p-value.

**Test Statistic for Hypothesis Tests for One Proportion** is:

It represents the difference between the sample proportion and the null value, measured in standard deviations (standard error of p-hat).

The picture above is a representation of the sampling distribution of p-hat assuming p = p_{0}. In other words, this is a model of how p-hat behaves if we are drawing random samples from a population for which Ho is true.

Notice the center of the sampling distribution is at p_{0}, which is the hypothesized proportion given in the null hypothesis (Ho: p = p_{0}.) We could also mark the axis in standard error units,

For example, if our null hypothesis claims that the proportion of U.S. adults supporting the death penalty is 0.64, then the sampling distribution is drawn as if the null is true. We draw a normal distribution centered at 0.64 (p_{0}) with a standard error dependent on sample size,

**Important Comment:**

- Note that under the assumption that Ho is true (and if the conditions for the sampling distribution to be normal are satisfied) the test statistic follows a N(0,1) (standard normal) distribution. Another way to say the same thing which is quite common is: “The null distribution of the test statistic is N(0,1).”

By “null distribution,” we mean the distribution under the assumption that Ho is true. As we’ll see and stress again later, the null distribution of the test statistic is what the calculation of the p-value is based on.

Let’s go back to our remaining two examples and find the test statistic in each case:

Is the proportion of marijuana users in the college higher than the national figure?

Since the null hypothesis is Ho: p = 0.157, the standardized (z) score of p-hat = 0.19 is

This is the value of the test statistic for this example.

We interpret this to mean that, assuming that Ho is true, the sample proportion p-hat = 0.19 is 0.91 standard errors above the null value (0.157).

Since the null hypothesis is Ho: p = 0.64, the standardized (z) score of p-hat = 0.675 is

This is the value of the test statistic for this example.

We interpret this to mean that, assuming that Ho is true, the sample proportion p-hat = 0.675 is 2.31 standard errors above the null value (0.64).

**Comments about the Test Statistic:**

- We mentioned earlier that to some degree, the test statistic captures the essence of the test. In this case, the test statistic measures the difference between p-hat and p
_{0}in standard errors. This is exactly what this test is about. Get data, and look at the discrepancy between what the data estimates p to be (represented by p-hat) and what Ho claims about p (represented by p_{0}). - You can think about this test statistic as a measure of evidence in the data against Ho. The larger the test statistic, the “further the data are from Ho” and therefore the more evidence the data provide against Ho.

**Comments:**

- It should now be clear why this test is commonly known as
**the z-test for the population proportion**. The name comes from the fact that it is based on a test statistic that is a*z-score.*

- Recall fact 1 that we used for constructing the z-test statistic. Here is part of it again:

When we take a **random** sample of size n from a population with population proportion p_{0}, the possible values of the sample proportion p-hat (**when certain conditions are met**) have approximately a normal distribution with a mean of p_{0}… and a standard deviation of

This result provides the theoretical justification for constructing the test statistic the way we did, and therefore the assumptions under which this result holds (in bold, above) are the conditions that our data need to satisfy so that we can use this test. These two conditions are:

i. The sample has to be random.

ii. The conditions under which the sampling distribution of p-hat is normal are met. In other words:

- Here we will pause to say more about condition (i.) above, the need for a random sample. In the Probability Unit we discussed sampling plans based on probability (such as a simple random sample, cluster, or stratified sampling) that produce a non-biased sample, which can be safely used in order to make inferences about a population. We noted in the Probability Unit that, in practice, other (non-random) sampling techniques are sometimes used when random sampling is not feasible. It is important though, when these techniques are used, to be aware of the type of bias that they introduce, and thus the limitations of the conclusions that can be drawn from them. For our purpose here, we will focus on one such practice, the situation in which a sample is not really chosen randomly, but in the context of the categorical variable that is being studied, the sample is regarded as random. For example, say that you are interested in the proportion of students at a certain college who suffer from seasonal allergies. For that purpose, the students in a large engineering class could be considered as a random sample, since there is nothing about being in an engineering class that makes you more or less likely to suffer from seasonal allergies. Technically, the engineering class is a convenience sample, but it is treated as a random sample in the context of this categorical variable. On the other hand, if you are interested in the proportion of students in the college who have math anxiety, then the class of engineering students clearly could not possibly be viewed as a random sample, since engineering students probably have a much lower incidence of math anxiety than the college population overall.

Let’s check the conditions in our three examples.

Has the proportion of defective products been reduced as a result of the repair?

i. The 400 products were chosen at random.

ii. n = 400, p_{0} = 0.2 and therefore:

Is the proportion of marijuana users in the college higher than the national figure?

i. The 100 students were chosen at random.

ii. n = 100, p_{0} = 0.157 and therefore:

i. The 1000 adults were chosen at random.

ii. n = 1000, p_{0} = 0.64 and therefore:

Checking that our data satisfy the conditions under which the test can be reliably used is a very important part of the hypothesis testing process. Be sure to consider this for every hypothesis test you conduct in this course and certainly in practice.

**STEP 1:**State the appropriate null and alternative hypotheses, Ho and Ha.**STEP 2:**Obtain a random sample, collect relevant data, and**check whether the data meet the conditions under which the test can be used**. If the conditions are met, summarize the data using a test statistic.**STEP 3:**Find the p-value of the test.**STEP 4:**Based on the p-value, decide whether or not the results are statistically significant and**draw your conclusions in context.****Note:**In practice, we should always consider the practical significance of the results as well as the statistical significance.

With respect to the z-test, the population proportion that we are currently discussing we have:

Step 1: Completed

Step 2: Completed

Step 3: This is what we will work on next.

]]>Now that we understand the process of hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).

The first test we are going to learn is the test about the population proportion (p).

This test is widely known as the **“z-test for the population proportion (p).”**

We will understand later where the “z-test” part is coming from.

This will be the only type of problem you will complete entirely “by-hand” in this course. Our goal is to use this example to give you the tools you need to understand how this process works. After working a few problems, you should review the earlier material again. You will likely need to review the terminology and concepts a few times before you fully understand the process.

In reality, you will often be conducting more complex statistical tests and allowing software to provide the p-value. In these settings it will be important to know what test to apply for a given situation and to be able to explain the results in context.

When we conduct a test about a population proportion, we are working with a categorical variable. Later in the course, after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.

In this part of our discussion on hypothesis testing, we will go into details that we did not go into before. More specifically, we will use this test to introduce the idea of a **test statistic**, and details about **how p-values are calculated**.

Let’s start by introducing the three examples, which will be the leading examples in our discussion. Each example is followed by a figure illustrating the information provided, as well as the question of interest.

A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been **reduced** as a result of the repair?

The following figure displays the information, as well as the question of interest:

The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:

**Ho:** p = 0.20 (No change; the repair did not help).

**Ha:** p < 0.20 (The repair was effective at reducing the proportion of defective parts).

There are rumors that students at a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 100 students from the college, 19 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is **higher** than the national proportion, which is 0.157? (This number is reported by the Harvard School of Public Health.)

Again, the following figure displays the information as well as the question of interest:

As before, we can formulate the null and alternative hypotheses in terms of p, the proportion of students in the college who use marijuana:

**Ho:** p = 0.157 (same as among all college students in the country).

**Ha:** p > 0.157 (higher than the national figure).

Polls on certain topics are conducted routinely in order to monitor changes in the public’s opinions over time. One such topic is the death penalty. In 2003 a poll estimated that 64% of U.S. adults support the death penalty for a person convicted of murder. In a more recent poll, 675 out of 1,000 U.S. adults chosen at random were in favor of the death penalty for convicted murderers. Do the results of this poll provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers (p) **changed** between 2003 and the later poll?

Here is a figure that displays the information, as well as the question of interest:

Again, we can formulate the null and alternative hypotheses in term of p, the proportion of U.S. adults who support the death penalty for convicted murderers.

**Ho:** p = 0.64 (No change from 2003).

**Ha:** p ≠ 0.64 (Some change since 2003).

Recall that there are basically 4 steps in the process of hypothesis testing:

**STEP 1:**State the appropriate null and alternative hypotheses, Ho and Ha.**STEP 2:**Obtain a random sample, collect relevant data, and**check whether the data meet the conditions under which the test can be used**. If the conditions are met, summarize the data using a test statistic.**STEP 3:**Find the p-value of the test.**STEP 4:**Based on the p-value, decide whether or not the results are statistically significant and**draw your conclusions in context.****Note:**In practice, we should always consider the practical significance of the results as well as the statistical significance.

We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.

Here again are the three set of hypotheses that are being tested in each of our three examples:

Has the proportion of defective products been reduced as a result of the repair?

**Ho:**p = 0.20 (No change; the repair did not help).

**Ha:**p < 0.20 (The repair was effective at reducing the proportion of defective parts).

Is the proportion of marijuana users in the college higher than the national figure?

**Ho:**p = 0.157 (same as among all college students in the country).

**Ha:**p > 0.157 (higher than the national figure).

**Ho:**p = 0.64 (No change from 2003).

**Ha:**p ≠ 0.64 (Some change since 2003).

The null hypothesis always takes the form:

- Ho: p = some value

and the alternative hypothesis takes one of the following three forms:

- Ha: p < that value (like in example 1)
**or**

- Ha: p > that value (like in example 2)
**or**

- Ha: p ≠ that value (like in example 3).

Note that it was quite clear from the context which form of the alternative hypothesis would be appropriate. The value that is specified in the null hypothesis is called the **null value**, and is generally denoted by p_{0}. We can say, therefore, that in general the null hypothesis about the population proportion (p) would take the form:

- Ho: p = p
_{0}

We write Ho: p = p_{0} to say that we are making the hypothesis that the population proportion has the value of p_{0}. In other words, p is the unknown population proportion and p_{0} is the number we think p might be for the given situation.

The alternative hypothesis takes one of the following three forms (depending on the context):

- Ha: p < p
_{0}**(one-sided)**

- Ha: p > p
_{0}**(one-sided)**

- Ha: p ≠ p
_{0}**(two-sided)**

The first two possible forms of the alternatives (where the = sign in Ho is challenged by < or >) are called **one-sided alternatives**, and the third form of alternative (where the = sign in Ho is challenged by ≠) is called a **two-sided alternative.** To understand the intuition behind these names let’s go back to our examples.

Example 3 (death penalty) is a case where we have a two-sided alternative:

**Ho:**p = 0.64 (No change from 2003).

**Ha:**p ≠ 0.64 (Some change since 2003).

In this case, in order to reject Ho and accept Ha we will need to get a sample proportion of death penalty supporters which is very different from 0.64 **in either direction,** either much larger or much smaller than 0.64.

In example 2 (marijuana use) we have a one-sided alternative:

**Ho:**p = 0.157 (same as among all college students in the country).

**Ha:**p > 0.157 (higher than the national figure).

Here, in order to reject Ho and accept Ha we will need to get a sample proportion of marijuana users which is much **higher** than 0.157.

Similarly, in example 1 (defective products), where we are testing:

**Ho:**p = 0.20 (No change; the repair did not help).

**Ha:**p < 0.20 (The repair was effective at reducing the proportion of defective parts).

in order to reject Ho and accept Ha, we will need to get a sample proportion of defective products which is much **smaller** than 0.20.

We have not yet discussed the fact that we are not guaranteed to make the correct decision by this process of hypothesis testing. Maybe you are beginning to see that there is always some level of uncertainty in statistics.

Let’s think about what we know already and define the possible errors we can make in hypothesis testing. When we conduct a hypothesis test, we choose one of two possible conclusions based upon our data.

If the **p-value is smaller than your pre-specified significance level (α, alpha), you reject the null hypothesis** and either

- You have made the correct decision since the null hypothesis is false

OR

- You have made an error (
**Type I**) and rejected Ho when in fact Ho is true (your data happened to be a RARE EVENT under Ho)

If the **p-value is greater than (or equal to) your chosen significance level (α, alpha), you fail to reject the null hypothesis** and either

- You have made the correct decision since the null hypothesis is true

OR

- You have made an error (
**Type II**) and failed to reject Ho when in fact Ho is false (the alternative hypothesis, Ha, is true)

The following summarizes the four possible results which can be obtained from a hypothesis test. Notice the rows represent the decision made in the hypothesis test and the columns represent the (usually unknown) truth in reality.

Although the truth is unknown in practice – or we would not be conducting the test – we know it must be the case that either the null hypothesis is true or the null hypothesis is false. It is also the case that** either decision we make in a hypothesis test can result in an incorrect conclusion!**

A **TYPE I Error **occurs when we Reject Ho when, in fact, Ho is True. In this case, **we mistakenly reject a true null hypothesis.**

- P(TYPE I Error) = P(Reject Ho | Ho is True) = α = alpha =
**Significance Level**

A **TYPE II Error** occurs when we fail to Reject Ho when, in fact, Ho is False. In this case** we fail to reject a false null hypothesis.**

- P(TYPE II Error) = P(Fail to Reject Ho | Ho is False) = β = beta

When our significance level is 5%, we are saying that we will allow ourselves to make a Type I error less than 5% of the time. In the long run, if we repeat the process, 5% of the time we will find a p-value < 0.05 when in fact the null hypothesis was true.

In this case, our data represent a rare occurrence which is unlikely to happen but is still possible. For example, suppose we toss a coin 10 times and obtain 10 heads, this is unlikely for a fair coin but not impossible. We might conclude the coin is unfair when in fact we simply saw a very rare event for this fair coin.

**Our testing procedure CONTROLS for the Type I error when we set a pre-determined value for the significance level.**

Notice that these probabilities are conditional probabilities. This is one more reason why conditional probability is an important concept in statistics.

Unfortunately, calculating the probability of a Type II error requires us to know the truth about the population. In practice we can only calculate this probability using a series of “what if” calculations which depend upon the type of problem.

**Comment:** As you initially read through the examples below, focus on the broad concepts instead of the small details. It is not important to understand how to calculate these values yourself at this point.

- Try to understand the pictures we present. Which pictures represent an assumed null hypothesis and which represent an alternative?
- It may be useful to come back to this page (and the activities here) after you have reviewed the rest of the section on hypothesis testing and have worked a few problems yourself.

Here are two examples of using an older version of this applet. It looks slightly different but the same settings and options are available in the version above.

In both cases we will consider IQ scores.

Our null hypothesis is that the true mean is 100. Assume the standard deviation is 16 and we will specify a significance level of 5%.

In this example we will specify that the true mean is indeed 100 so that the null hypothesis is true. Most of the time (95%), when we generate a sample, we should fail to reject the null hypothesis since the null hypothesis is indeed true.

Here is one sample that results in a correct decision:

In the sample above, we obtain an x-bar of 105, which is drawn on the distribution which assumes μ (mu) = 100 (the null hypothesis is true). Notice the sample is shown as blue dots along the x-axis and the shaded region shows for which values of x-bar we would reject the null hypothesis. In other words, we would reject Ho whenever the x-bar falls in the shaded region.

Enter the same values and generate samples until you obtain a Type I error (you falsely reject the null hypothesis). You should see something like this:

If you were to generate 100 samples, you should have around 5% where you rejected Ho. These would be samples which would result in a Type I error.

The previous example illustrates a correct decision and a Type I error when the null hypothesis is true. The next example illustrates a correct decision and Type II error when the null hypothesis is false. In this case, we must specify the true population mean.

Let’s suppose we are sampling from an honors program and that the true mean IQ for this population is 110. We do not know the probability of a Type II error without more detailed calculations.

Let’s start with a sample which results in a correct decision.

In the sample above, we obtain an x-bar of 111, which is drawn on the distribution which assumes μ (mu) = 100 (the null hypothesis is true).

Enter the same values and generate samples until you obtain a Type II error (you fail to reject the null hypothesis). You should see something like this:

You should notice that in this case (when Ho is false), it is easier to obtain an incorrect decision (a Type II error) than it was in the case where Ho is true. If you generate 100 samples, you can approximate the probability of a Type II error.

We can find the probability of a Type II error by visualizing both the assumed distribution and the true distribution together. The image below is adapted from an applet we will use when we discuss the power of a statistical test.

There is a 37.4% chance that, in the long run, we will make a Type II error and fail to reject the null hypothesis when in fact the true mean IQ is 110 in the population from which we sample our 10 individuals.

Can you visualize what will happen if the true population mean is really 115 or 108? When will the Type II error increase? When will it decrease? We will look at this idea again when we discuss the concept of power in hypothesis tests.

**Comments:**

- It is important to note that there is a trade-off between the probability of a Type I and a Type II error. If we decrease the probability of one of these errors, the probability of the other will increase! The practical result of this is that if we require stronger evidence to reject the null hypothesis (smaller significance level = probability of a Type I error), we will increase the chance that we will be unable to reject the null hypothesis when in fact Ho is false (increases the probability of a Type II error).

- When α (alpha) = 0.05 we obtained a Type II error probability of 0.374 = β = beta

- When α (alpha) = 0.01 (smaller than before) we obtain a Type II error probability of 0.644 = β = beta (larger than before)

- As the blue line in the picture moves farther right, the significance level (α, alpha) is decreasing and the Type II error probability is increasing.

- As the blue line in the picture moves farther left, the significance level (α, alpha) is increasing and the Type II error probability is decreasing

Let’s return to our very first example and define these two errors in context.

A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.

There are **two** opposing **claims** in this case:

- Ho = The
**student’s claim:**I did not cheat on the exam.

- Ha = The
**instructor’s claim:**The student did cheat on the exam.

Adhering to the principle **“innocent until proven guilty,”** the committee asks the instructor for **evidence** to support his claim.

There are four possible outcomes of this process. There are two possible correct decisions:

- The student did cheat on the exam and the instructor brings enough evidence to reject Ho and conclude the student did cheat on the exam. This is a CORRECT decision!

- The student did not cheat on the exam and the instructor fails to provide enough evidence that the student did cheat on the exam. This is a CORRECT decision!

Both the correct decisions and the possible errors are fairly easy to understand but with the errors, you must be careful to identify and define the two types correctly.

**TYPE I Error:** Reject Ho when Ho is True

- The student did not cheat on the exam but the instructor brings enough evidence to reject Ho and conclude the student cheated on the exam. This is a Type I Error.

**TYPE II Error:** Fail to Reject Ho when Ho is False

- The student did cheat on the exam but the instructor fails to provide enough evidence that the student cheated on the exam. This is a Type II Error.

In most situations, including this one, it is more “acceptable” to have a Type II error than a Type I error. Although allowing a student who cheats to go unpunished might be considered a very bad problem, punishing a student for something he or she did not do is usually considered to be a more severe error. This is one reason we control for our Type I error in the process of hypothesis testing.

**Comment:**

- The probabilities of Type I and Type II errors are closely related to the concepts of sensitivity and specificity that we discussed previously. Consider the following hypotheses:

**Ho:** The individual does not have diabetes (status quo, nothing special happening)

**Ha: **The individual does have diabetes (something is going on here)

In this setting:

When someone tests positive for diabetes we would reject the null hypothesis and conclude the person has diabetes (we may or may not be correct!).

When someone tests negative for diabetes we would fail to reject the null hypothesis so that we fail to conclude the person has diabetes (we may or may not be correct!)

Let’s take it one step further:

Sensitivity = P(Test + | Have Disease) which in this setting equals

P(Reject Ho | Ho is False) = 1 – P(Fail to Reject Ho | Ho is False) = 1 – β = 1 – beta

Specificity = P(Test – | No Disease) which in this setting equals

P(Fail to Reject Ho | Ho is True) = 1 – P(Reject Ho | Ho is True) = 1 – α = 1 – alpha

Notice that sensitivity and specificity relate to the probability of making a correct decision whereas α (alpha) and β (beta) relate to the probability of making an incorrect decision.

Usually α (alpha) = 0.05 so that the specificity listed above is 0.95 or 95%.

Next, we will see that the sensitivity listed above is the **power** of the hypothesis test!

Assuming that you have obtained a quality sample:

- The reason for a Type I error is random chance.
- When a Type I error occurs, our observed data represented a rare event which indicated evidence in favor of the alternative hypothesis even though the null hypothesis was actually true.

Again, assuming that you have obtained a quality sample, now we have a few possibilities depending upon the true difference that exists.

- The sample size is too small to detect an important difference. This is the worst case, you should have obtained a larger sample. In this situation, you may notice that the effect seen in the sample seems PRACTICALLY significant and yet the p-value is not small enough to reject the null hypothesis.

- The sample size is reasonable for the important difference but the true difference (which might be somewhat meaningful or interesting) is smaller than your test was capable of detecting. This is tolerable as you were not interested in being able to detect this difference when you began your study. In this situation, you may notice that the effect seen in the sample seems to have some potential for practical significance.

- The sample size is more than adequate, the difference that was not detected is meaningless in practice. This is not a problem at all and is in effect a “correct decision” since the difference you did not detect would have no practical meaning.
- Note: We will discuss the idea of practical significance later in more detail.

It is often the case that we truly wish to prove the alternative hypothesis. It is reasonable that we would be interested in the probability of correctly rejecting the null hypothesis. In other words, the probability of rejecting the null hypothesis, when in fact the null hypothesis is false. This can also be thought of as the probability of being able to detect a (pre-specified) difference of interest to the researcher.

Let’s begin with a realistic example of how power can be described in a study.

In a clinical trial to study two medications for weight loss, we have an 80% chance to detect a difference in the weight loss between the two medications of 10 pounds. In other words, the power of the hypothesis test we will conduct is 80%.

In other words, if one medication comes from a population with an average weight loss of 25 pounds and the other comes from a population with an average weight loss of 15 pounds, we will have an 80% chance to detect that difference using the sample we have in our trial.

If we were to repeat this trial many times, 80% of the time we will be able to reject the null hypothesis (that there is no difference between the medications) and 20% of the time we will fail to reject the null hypothesis (and make a Type II error!).

The difference of 10 pounds in the previous example, is often called the **effect size**. The measure of the effect differs depending on the particular test you are conducting but is always some measure related to the true effect in the population. In this example, it is the difference between two population means.

Recall the definition of a Type II error:

A **TYPE II Error** occurs when we fail to Reject Ho when, in fact, Ho is False. In this case** we fail to reject a false null hypothesis.**

P(TYPE II Error) = P(Fail to Reject Ho | Ho is False) = β = beta

Notice that P(Reject Ho | Ho is False) = 1 – P(Fail to Reject Ho | Ho is False) = 1 – β = 1- beta.

The **POWER** of a hypothesis test is the **probability of rejecting the null hypothesis when the null hypothesis is false**. This can also be stated as the **probability of correctly rejecting the null hypothesis**.

**POWER** = P(Reject Ho | Ho is False) = 1 – β = 1 – beta

Power is the test’s ability to correctly reject the null hypothesis. **A test with high power has a good chance of being able to detect the difference of interest to us, if it exists**.

As we mentioned on the bottom of the previous page, this can be thought of as the sensitivity of the hypothesis test if you imagine Ho = No disease and Ha = Disease.

The power of a hypothesis test is affected by numerous quantities (similar to the margin of error in a confidence interval).

Assume that the null hypothesis is false for a given hypothesis test. All else being equal, we have the following:

- Larger samples result in a greater chance to reject the null hypothesis which means an increase in the power of the hypothesis test.

- If the
**effect size**is larger, it will become easier for us to detect. This results in a greater chance to reject the null hypothesis which means an increase in the power of the hypothesis test. The effect size varies for each test and is usually closely related to the difference between the hypothesized value and the true value of the parameter under study.

- From the relationship between the probability of a Type I and a Type II error (as α (alpha) decreases, β (beta) increases), we can see that as α (alpha) decreases, Power = 1 – β = 1 – beta also decreases.

- There are other mathematical ways to change the power of a hypothesis test, such as changing the population standard deviation; however, these are not quantities that we can usually control so we will not discuss them here.

For most grants involving statistical analysis, power calculations must be completed to illustrate that the study will have a reasonable chance to detect an important effect. Otherwise, the money spent on the study could be wasted. The goal is usually to have a power close to 80%.

For example, if there is only a 5% chance to detect an important difference between two treatments in a clinical trial, this would result in a waste of time, effort, and money on the study since, when the alternative hypothesis is true, the chance a treatment effect can be found is very small.

**Comment:**

- In order to calculate the power of a hypothesis test, we must specify the “truth.” As we mentioned previously when discussing Type II errors, in practice we can only calculate this probability using a series of “what if” calculations which depend upon the type of problem.

The following activity involves working with an interactive applet to study power more carefully.

The following reading is an excellent discussion about Type I and Type II errors.

We will not be asking you to perform power calculations manually. You may be asked to use online calculators and applets. Most statistical software packages offer some ability to complete power calculations. There are also many online calculators for power and sample size on the internet, for example, Russ Lenth’s power and sample-size page.

]]>Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.

In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, **Claim 1** is called the **null hypothesis** (denoted “**Ho**“), and **Claim 2** plays the role of the **alternative hypothesis** (denoted “**Ha**“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.

Let’s go back to our three examples and apply the new notation:

**In example 1:**

**Ho:**The proportion of smokers at GU is 0.20.**Ha:**The proportion of smokers at GU is less than 0.20.

**In example 2:**

**Ho:**The mean concentration in the shipment is the required 245 ppm.**Ha:**The mean concentration in the shipment is not the required 245 ppm.

**In example 3:**

**Ho:**Performance on the SAT is not related to gender (males and females score the same).**Ha:**Performance on the SAT is related to gender – males score higher.

This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.

There is, however, one detail that we would like to add here. In this step we collect data and **summarize** it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (*p*-hat), sample mean (x-bar) and the sample standard deviation (s).

In practice, you go a step further and use these sample statistics to summarize the data with what’s called a **test statistic**. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.

This step will also involve checking any conditions or assumptions required to use the test.

As we saw, this is the step where we calculate how likely is it to get data like that observed (or more extreme) when Ho is true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability.

- If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed (or more extreme) if Ho were true. The fact that we
**did**observe such data is therefore evidence against Ho, and we should reject it. - On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed (or more extreme) is not very surprising if Ho were true. The fact that we observed such data does not provide evidence against Ho. This crucial probability, therefore, has a special name. It is called the
**p-value**of the test.

In our three examples, the p-values were given to you (and you were reassured that you didn’t need to worry about how these were derived yet):

- Example 1: p-value = 0.106
- Example 2: p-value = 0.0007
- Example 3: p-value = 0.29

Obviously, the smaller the p-value, the more surprising it is to get data like ours (or more extreme) when Ho is true, and therefore, the stronger the evidence the data provide against Ho.

Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against Ho.

**Comment:**

- Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting
**data**like those observed (or more extreme) when Ho is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.

Since our statistical conclusion is based on how small the p-value is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.

This cutoff exists, and because it is so important, it has a special name. It is called the **significance level of the test** and is usually denoted by the Greek letter α (alpha). The most commonly used significance level is α (alpha) = 0.05 (or 5%). This means that:

- if the p-value < α (alpha) (usually 0.05), then the data we obtained is considered to be “rare (or surprising) enough” under the assumption that Ho is true, and we say that the data provide statistically significant evidence against Ho, so we reject Ho and thus accept Ha.
- if the p-value > α (alpha)(usually 0.05), then our data are not considered to be “surprising enough” under the assumption that Ho is true, and we say that our data do not provide enough evidence to reject Ho (or, equivalently, that the data do not provide enough evidence to accept Ha).

Now that we have a cutoff to use, here are the appropriate conclusions for each of our examples based upon the p-values we were given.

**In Example 1:**

- Using our cutoff of 0.05, we fail to reject Ho.
**Conclusion**: There**IS NOT**enough evidence that the proportion of smokers at GU is less than 0.20**Still we should consider:**Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?

**In Example 2:**

- Using our cutoff of 0.05, we reject Ho.
**Conclusion**: There**IS**enough evidence that the mean concentration in the shipment is not the required 245 ppm.**Still we should consider: Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?**

**In Example 3:**

- Using our cutoff of 0.05, we fail to reject Ho.
**Conclusion**: There**IS NOT**enough evidence that males score higher on average than females on the SAT.**Still we should consider:**Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?

Notice that all of the above conclusions are written in terms of the alternative hypothesis and are given in the context of the situation. In no situation have we claimed the null hypothesis is true. Be very careful of this and other issues discussed in the following comments.

**Comments:**

- Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is 0.052? You might want to stick to the rules and say “0.052 > 0.05 and therefore I don’t have enough evidence to reject Ho”, but you might decide that 0.052 is small enough for you to believe that Ho should be rejected. It should be noted that scientific journals do consider 0.05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against Ho, and any p-value above it,
**or even equal to it**, indicates there is not enough evidence against Ho. Although a p-value between 0.05 and 0.10 is often reported as marginally statistically significant.

- It is important to draw your conclusions
**in context**. It is**never enough**to say:**“p-value = …, and therefore I have enough evidence to reject Ho at the 0.05 significance level.”**You**should always word your conclusion in terms of the data.**Although we will use the terminology of “rejecting Ho” or “failing to reject Ho” – this is mostly due to the fact that we are instructing you in these concepts. In practice, this language is rarely used. We also suggest writing your conclusion in terms of the alternative hypothesis.Is there or is there not enough evidence that the alternative hypothesis is true?

- Let’s go back to the issue of the nature of the two types of conclusions that I can make.

*Either***I reject Ho (when the p-value is smaller than the significance level)***or***I cannot reject Ho (when the p-value is larger than the significance level).**

As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is **not necessarily the case**. Consider the following slightly artificial yet effective example:

An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following **two hypotheses:**

**Ho:**The proportion of male managers hired is 0.5**Ha:**The proportion of male managers hired is more than 0.5

**Data:** You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.

**Assessing Evidence:** If the proportion of male managers hired is really 0.5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore 0.5 * 0.5 * 0.5 = 0.125. This is the p-value (using the multiplication rule for independent events).

**Conclusion:** Using 0.05 as the significance level, you conclude that since the p-value = 0.125 > 0.05, the fact that the three randomly selected managers were all males is not enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy (Ho).

However, **the data (all three selected are males) definitely does NOT provide evidence to accept the employer’s claim (Ho).**

**Comment about wording:** Another common wording in scientific journals is:

- “The results are statistically significant” – when the p-value < α (alpha).
- “The results are not statistically significant” – when the p-value > α (alpha).

Often you will see significance levels reported with additional description to indicate the degree of statistical significance. A general guideline (although not required in our course) is:

- If 0.01 ≤ p-value < 0.05, then the results are (statistically)
*significant*. - If 0.001 ≤ p-value < 0.01, then the results are
*highly statistically significant*. - If p-value < 0.001, then the results are
*very highly statistically significant*. - If p-value > 0.05, then the results are
*not statistically significant*(NS). - If 0.05 ≤ p-value < 0.10, then the results are
*marginally statistically significant*.

We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:

Here are a few more activities if you need some additional practice.

**Comments:**

- Notice that
**the p-value is an example of a conditional probability**. We calculate the probability of obtaining results like those of our data (or more extreme) GIVEN the null hypothesis is true. We could write P(Obtaining results like ours or more extreme | Ho is True).

- Another common phrase used to define the p-value is: “
**The probability of obtaining a statistic as or more extreme than your result given the null hypothesis is TRUE**“.- We could write P(Obtaining a test statistic as or more extreme than ours | Ho is True).
- In this case we are asking “Assuming the null hypothesis is true, how rare is it to observe something as or more extreme than what I have found in my data?”
- If after assuming the null hypothesis is true, what we have found in our data is extremely rare (small p-value), this provides evidence to reject our assumption that Ho is true in favor of Ha.

- The
**p-value can also be thought of as the probability, assuming the null hypothesis is true, that the result we have seen is solely due to random error (or random chance).**We have already seen that statistics from samples collected from a population vary. There is random error or random chance involved when we sample from populations.

In this setting, if the p-value is very small, this implies, assuming the null hypothesis is true, that it is extremely unlikely that the results we have obtained would have happened due to random error alone, and thus our assumption (Ho) is rejected in favor of the alternative hypothesis (Ha).

**It is EXTREMELY important that you find a definition of the p-value which makes sense to you. New students often need to contemplate this idea repeatedly through a variety of examples and explanations before becoming comfortable with this idea. It is one of the two most important concepts in statistics (the other being confidence intervals).**

**Remember:**

- We infer that the alternative hypothesis is true ONLY by rejecting the null hypothesis.
- A statistically significant result is one that has a very low probability of occurring if the null hypothesis is true.
- Results which are
**statistically**significant may or may not have**practical**significance and vice versa.

As we mentioned in the introduction to Unit 4A, when the variable that we’re interested in studying in the population is **categorical**, the parameter we are trying to infer about is the **population proportion (p)** associated with that variable. We also learned that the point estimator for the population proportion p is the sample proportion p-hat.

To refresh your memory, here is a picture that summarizes an example we looked at.

We are now moving on to interval estimation of p. In other words, we would like to develop a set of intervals that, with different levels of confidence, will capture the value of p. We’ve actually done all the groundwork and discussed all the big ideas of interval estimation when we talked about interval estimation for μ (mu), so we’ll be able to go through it much faster. Let’s begin.

Recall that the general form of any confidence interval for an unknown parameter is:

* estimate* ±

Since the unknown parameter here is the population proportion p, the point estimator (as I reminded you above) is the sample proportion p-hat. The confidence interval for p, therefore, has the form:

(Recall that m is the notation for the margin of error.) The margin of error (m) gives us the maximum estimation error with a certain confidence. In this case it tells us that p-hat is different from p (the parameter it estimates) by no more than m units.

From our previous discussion on confidence intervals, we also know that the margin of error is the product of two components:

To figure out what these two components are, we need to go back to a result we obtained in the Sampling Distributions section of the Probability unit about the sampling distribution of p-hat. We found that under certain conditions (which we’ll come back to later), p-hat has a normal distribution with mean p, and a

This result makes things very simple for us, because it reveals what the two components are that the margin of error is made of:

- Since, like the sampling distribution of x-bar, the sampling distribution of p-hat is normal, the confidence multipliers that we’ll use in the confidence interval for p will be the same z* multipliers we use for the confidence interval for μ (mu) when σ (sigma) is known (using
**exactly**the same reasoning and the same probability results). The multipliers we’ll use, then, are:**1.645, 1.96, and 2.576 at the 90%, 95% and 99% confidence levels, respectively.**

- The standard deviation of our estimator p-hat is

Putting it all together, we find that the confidence interval for p should be:

We just have to solve one practical problem and we’re done. We’re trying to estimate the **unknown** population proportion *p*, so having it appear in the confidence interval doesn’t make any sense. To overcome this problem, we’ll do the obvious thing …

We’ll replace p with its sample counterpart, p-hat, and work with the **estimated standard error of p-hat**

Now we’re done. The **confidence interval for the population proportion p** is:

The drug Viagra became available in the U.S. in May, 1998, in the wake of an advertising campaign that was unprecedented in scope and intensity. A Gallup poll found that by the end of the first week in May, 643 out of a random sample of 1,005 adults were aware that Viagra was an impotency medication (based on “Viagra A Popular Hit,” a Gallup poll analysis by Lydia Saad, May 1998).

Let’s estimate the proportion p of all adults in the U.S. who by the end of the first week of May 1998 were already aware of Viagra and its purpose by setting up a 95% confidence interval for p.

We first need to calculate the sample proportion p-hat. Out of 1,005 sampled adults, 643 knew what Viagra is used for, so p-hat = 643/1005 = 0.64

Therefore, a 95% confidence interval for p is

We can be 95% confident that the proportion of all U.S. adults who were already familiar with Viagra by that time was between 0.61 and 0.67 (or 61% and 67%).

The fact that the margin of error equals 0.03 says we can be 95% confident that unknown population proportion p is within 0.03 (3%) of the observed sample proportion 0.64 (64%). In other words, we are 95% confident that 64% is “off” by no more than 3%.

**Comment:**

- We would like to share with you the methodology portion of the official poll release for the Viagra example. We hope you see that you now have the tools to understand how poll results are analyzed:

“The results are based on telephone interviews with a randomly selected national sample of 1,005 adults, 18 years and older, conducted May 8-10, 1998. For results based on samples of this size, one can say with 95 percent confidence that the error attributable to sampling and other random effects could be plus or minus 3 percentage points. In addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.”

The purpose of the next activity is to provide guided practice in calculating and interpreting the confidence interval for the population proportion p, and drawing conclusions from it.

Two important results that we discussed at length when we talked about the confidence interval for μ (mu) also apply here:

1. There is a trade-off between level of confidence and the width (or precision) of the confidence interval. The more precision you would like the confidence interval for p to have, the more you have to pay by having a lower level of confidence.

2. Since n appears in the denominator of the margin of error of the confidence interval for p, for a fixed level of confidence, the larger the sample, the narrower, or more precise it is. This brings us naturally to our next point.

Just as we did for means, when we have some level of flexibility in determining the sample size, we can set a desired margin of error for estimating the population proportion and find the sample size that will achieve that.

For example, a final poll on the day before an election would want the margin of error to be quite small (with a high level of confidence) in order to be able to predict the election results with the most precision. This is particularly relevant when it is a close race between the candidates. The polling company needs to figure out how many eligible voters it needs to include in their sample in order to achieve that.

Let’s see how we do that.

(**Comment:** For our discussion here we will focus on a 95% confidence level (z* = 1.96), since this is the most commonly used level of confidence.)

The confidence interval for p is

The margin of error, then, is

Now we isolate n (i.e., express it as a function of m).

There is a practical problem with this expression that we need to overcome.

Practically, you first determine the sample size, then you choose a random sample of that size, and then use the collected data to find p-hat.

So the fact that the expression above for determining the sample size depends on p-hat is problematic.

The way to overcome this problem is to take the conservative approach by setting p-hat = 1/2 = 0.5.

Why do we call this approach conservative?

It is conservative because the expression that appears in the numerator,

is maximized when p-hat = 1/2 = 0.5.

That way, the n we get will work in giving us the desired margin of error regardless of what the value of p-hat is. This is a “worst case scenario” approach. So when we do that we get:

In general, for any confidence level we have

- If we know a reasonable estimate of the proportion we can use:

- If we choose the conservative estimate assuming we know nothing about the true proportion we use:

It seems like media polls usually use a sample size of 1,000 to 1,200. This could be puzzling.

How could the results obtained from, say, 1,100 U.S. adults give us information about the entire population of U.S. adults? 1,100 is such a tiny fraction of the actual population. Here is the answer:

What sample size n is needed if a margin of error m = 0.03 is desired?

(remember, always round up). In fact, 0.03 is a very commonly used margin of error, especially for media polls. For this reason, most media polls work with a sample of around 1,100 people.

As we mentioned before, one of the most important things to learn with any inference method is the conditions under which it is safe to use it.

As we did for the mean, the assumption we made in order to develop the methods in this unit was that the sampling distribution of the sample proportion, p-hat is roughly normal. Recall from the Probability unit that the conditions under which this happens are that

Since p is unknown, we will replace it with its estimate, the sample proportion, and set

to be the conditions under which it is safe to use the methods we developed in this section.

Here is one final practice for these confidence intervals!!

In general, a confidence interval for the unknown population proportion (p) is

where z* is 1.645 for 90% confidence, 1.96 for 95% confidence, and 2.576 for 99% confidence.

To obtain a desired margin of error (m) in a confidence interval for an unknown population proportion, a conservative sample size is

If a reasonable estimate of the true proportion is known, the sample size can be calculated using

The methods developed in this unit are safe to use as long as

]]>As we just learned, for a given level of confidence, the sample size determines the size of the margin of error and thus the width, or precision, of our interval estimation. This process can be reversed.

In situations where a researcher has some flexibility as to the sample size, the researcher can calculate in advance what the sample size is that he/she needs in order to be able to report a confidence interval with a certain level of confidence and a certain margin of error. Let’s look at an example.

Recall the example about the SAT-M scores of community college students.

An educational researcher is interested in estimating μ (mu), the mean score on the math part of the SAT (SAT-M) of all community college students in his state. To this end, the researcher has chosen a random sample of 650 community college students from his state, and found that their average SAT-M score is 475. Based on a large body of research that was done on the SAT, it is known that the scores roughly follow a normal distribution, with the standard deviation σ (sigma) =100.

The 95% confidence interval for μ (mu) is

which is roughly 475 ± 8, or (467, 483). For a sample size of n = 650, our margin of error is 8.

Now, let’s think about this problem in a slightly different way:

An educational researcher is interested in estimating μ (mu), the mean score on the math part of the SAT (SAT-M) of all community college students in his state with a margin of error of (only) 5, at the 95% confidence level. What is the sample size needed to achieve this? σ (sigma), of course, is still assumed to be 100.

To solve this, we set:

So, for a sample size of 1,600 community college students, the researcher will be able to estimate μ (mu) with a margin of error of 5, at the 95% level. In this example, we can also imagine that the researcher has some flexibility in choosing the sample size, since there is a minimal cost (if any) involved in recording students’ SAT-M scores, and there are many more than 1,600 community college students in each state.

Rather than take the same steps to isolate n every time we solve such a problem, we may obtain a general expression for the required n for a desired margin of error m and a certain level of confidence.

Since

is the formula to determine m for a given n, we can use simple algebra to express n in terms of m (multiply both sides by the square root of n, divide both sides by m, and square both sides) to get

**Comment:**

- Clearly, the
**sample size n must be an integer**. - In the previous example we got n = 1,600, but in other situations, the calculation may give us a non-integer result.
- In these cases, we should always
**round up to the next highest integer.** - Using this “conservative approach,” we’ll achieve an interval at least as narrow as the one desired.

IQ scores are known to vary normally with a standard deviation of 15. How many students should be sampled if we want to estimate the population mean IQ at 99% confidence with a margin of error equal to 2?

Round up to be safe, and take a sample of 374 students.

The purpose of the next activity is to give you guided practice in sample size calculations for obtaining confidence intervals with a desired margin of error, at a certain confidence level. Consider the example from the previous Learn By Doing activity:

**Comment:**

- In the preceding activity, you saw that in order to calculate the sample size when planning a study, you needed to know the population standard deviation, sigma (σ). In practice, sigma is usually not known, because it is a parameter. (The rare exceptions are certain variables like IQ score or standardized tests that might be constructed to have a particular known sigma.)

Therefore, when researchers wish to compute the required sample size in preparation for a study, they use an **estimate** of sigma. Usually, sigma is estimated based on the standard deviation obtained in prior studies.

However, in some cases, there might not be any prior studies on the topic. In such instances, a researcher still needs to get a rough estimate of the standard deviation of the (yet-to-be-measured) variable, in order to determine the required sample size for the study. One way to get such a rough estimate is with the “range rule of thumb.” We will not cover this topic in depth but mention here that a very rough estimate of the standard deviation of a population is the range/4.

There are a few more things we need to discuss:

- Is it always OK to use the confidence interval we developed for μ (mu) when σ (sigma) is known?

- What if σ (sigma) is unknown?

- How can we use statistical software to calculate confidence intervals for us?

One of the most important things to learn with any inference method is the conditions under which it is safe to use it. It is very tempting to apply a certain method, but if the conditions under which this method was developed are not met, then using this method will lead to unreliable results, which can then lead to wrong and/or misleading conclusions. As you’ll see throughout this section, we will always discuss the conditions under which each method can be safely used.

In particular, the confidence interval for μ (mu), when σ (sigma) is known:

was developed assuming that the sampling distribution of x-bar is normal; in other words, that the Central Limit Theorem applies. In particular, this allowed us to determine the values of z*, the confidence multiplier, for different levels of confidence.

First, **the sample must be random.** Assuming that the sample is random, recall from the Probability unit that the Central Limit Theorem works when the **sample size is large** (a common rule of thumb for “large” is n > 30), or, for **smaller sample sizes**, if it is known that the quantitative **variable** of interest is **distributed normally** in the population. The only situation when we cannot use the confidence interval, then, is when the sample size is small and the variable of interest is not known to have a normal distribution. In that case, other methods, called non-parametric methods, which are beyond the scope of this course, need to be used. This can be summarized in the following table:

In the following activity, you have to opportunity to use software to summarize the raw data provided.

As we discussed earlier, when variables have been well-researched in different populations it is reasonable to assume that the population standard deviation (σ, sigma) is known. However, this is rarely the case. What if σ (sigma) is unknown?

Well, there is some good news and some bad news.

The good news is that we can easily replace the population standard deviation, σ (sigma), with the **sample** standard deviation, s.

The bad news is that once σ (sigma) has been replaced by s, we lose the Central Limit Theorem, together with the normality of x-bar, and therefore the confidence multipliers z* for the different levels of confidence (1.645, 1.96, 2.576) are (generally) not correct any more. The new multipliers come from a different distribution called the “t distribution” and are therefore denoted by t* (instead of z*). We will discuss the t distribution in more detail when we talk about hypothesis testing.

The confidence interval for the population mean (μ, mu) when (σ, sigma) is unknown is therefore:

(Note that this interval is very similar to the one when σ (sigma) is known, with the obvious changes: s replaces σ (sigma), and t* replaces z* as discussed above.)

There is an important difference between the confidence multipliers we have used so far (z*) and those needed for the case when σ (sigma) is unknown (t*). Unlike the confidence multipliers we have used so far (z*), which depend only on the level of confidence, the new multipliers (t*) have the **added complexity** that they **depend on both the level of confidence and on the sample size** (for example: the t* used in a 95% confidence when n = 10 is different from the t* used when n = 40). Due to this added complexity in determining the appropriate t*, we will rely heavily on software in this case.

**Comments:**

- Since it is quite rare that σ (sigma) is known, this interval (sometimes called a “one-sample t confidence interval”) is more commonly used as the confidence interval for estimating μ (mu). (Nevertheless, we could not have presented it without our extended discussion up to this point, which also provided you with a solid understanding of confidence intervals.)

- The quantity s/sqrt(n) is called the
**estimated standard error**of x-bar. The Central Limit Theorem tells us that σ/sqrt(n) = sigma/sqrt(n) is the**standard deviation**of x-bar (and this is the quantity used in confidence interval when σ (sigma) is known). In general, the**standard error**is the**standard deviation of the sampling distribution of a statistic**. When we substitute s for σ (sigma) we are estimating the true standard error. You may see the term “standard error” used for both the true standard error and the estimated standard error depending on the author and audience. What is important to understand about the standard error is that it measures the variation of a statistic calculated from a sample of a specified sample size (not the variation of the original population).

- As before, to safely use this confidence interval (one-sample t confidence interval), the sample
**must be random**, and the only case when this interval cannot be used is when the sample size is small and the variable is not known to vary normally.

**Final Comment:**

- It turns out that for large values of n, the t* multipliers are not that different from the z* multipliers, and therefore using the interval formula:

for μ (mu) when σ (sigma) is unknown provides a pretty good approximation.

]]>