This document linked from Proportions (Introduction & Step 1)

]]>- Slides 1-4: Introduction to Steps and Motivating Examples

- Slides 5-12: Steps for Motivating Examples

- Slides 13-18: Final Comments

This document linked from Steps in Hypothesis Testing

]]>**Scenario 1:** When shirts are made, there can occasionally be defects (such as improper stitching). But too many such defective shirts can be a sign of substandard manufacturing.

Suppose, in the past, your favorite department store has had only one defective shirt per 200 shirts (a prior defective rate of only .005). But you suspect that the store has recently switched to a substandard manufacturer. So you decide to test to see if their overall proportion of defective shirts today is higher.

Suppose that, in a random sample of 200 shirts from the store, you find that 27 of them are defective, for a sample proportion of defective shirts of .135. You want to test whether this is evidence that the store is “guilty” of substandard manufacturing, compared to their prior rate of defective shirts.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-DIG-12020.swf

**Scenario 2:** It is a known medical fact that just slightly fewer females than males are born (although the reasons are not completely understood); the known “proper” baseline female birthrate is about 49% females.

In some cultures, male children are traditionally looked on more favorably than female children, and there is concern that the increasing availability of ultrasound may lead to pregnant mothers deciding to abort the fetus if it’s not the culturally “desired” gender. If this is happening, then the proportion of females in those nations would be significantly lower than the proper baseline rate.

To test whether the proportion of females born in India is lower than the proper baseline female birthrate, a study investigates a random sample of 6,500 births from hospital files in India, and finds 44.8% females born among the sample.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-DIG-12021.swf

**Scenario 3:** A properly-balanced 6-sided game die should give a 1 in exactly 1/6 (16.7%) of all rolls. A casino wants to test its game die. If the die is not properly balanced one way or another, it could give either too many 1’s or too few 1’s, either of which could be bad.

The casino wants to use the proportion of 1’s to test whether the die is out of balance. So the casino test-rolls the die 60 times and gets a 1 in 9 of the rolls (15%).

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-DIG-12022.swf

This document is linked from Proportions (Introduction & Step 1).

]]>**Scenario 1:**The UCLA Internet Report (February 2003) estimated that roughly 8.7% of Internet users are extremely concerned about credit card fraud when buying online. Has that figure changed since? To test this, a random sample of 100 Internet users was chosen, and when interviewed, 10 said that they were extremely worried about credit card fraud when buying online. Let p be the proportion of all Internet users who are concerned about credit card fraud.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-LBD-12017.swf

**Scenario 2: **The UCLA Internet Report (February 2003) estimated that a proportion of roughly .75 of online homes are still using dial-up access, but claimed that the use of dial-up is declining. Is that really the case? To examine this, a follow-up study was conducted a year later in which out of a random sample of 1,308 households that had Internet access, 804 were connecting using a dial-up modem. Let p be the proportion of all U.S. Internet-using households that have dial-up access.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-LBD-12018.swf

**Scenario 3:** According to the UCLA Internet Report (February 2003) the use of the Internet at home is growing steadily and it is estimated that roughly 59.3% of households in the United States have Internet access at home. Has that trend continued since the report was released? To study this, a random sample of 1,200 households from a big metropolitan area was chosen for a more recent study, and it was found that 972 had an Internet connection. Let p be the proportion of U.S. households that have internet access.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/03/qz-LBD-12019.swf

This document is linked from Proportions (Introduction & Step 1).

]]>http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/DIG_12016_186.swf

This document linked from Proportions (Introduction & Step 1).

]]>http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/LBD_12015_186.swf

This document is linked from Proportions (Introduction & Step 1).

]]>Now that we understand the process of hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).

The first test we are going to learn is the test about the population proportion (p).

This test is widely known as the **“z-test for the population proportion (p).”**

We will understand later where the “z-test” part is coming from.

This will be the only type of problem you will complete entirely “by-hand” in this course. Our goal is to use this example to give you the tools you need to understand how this process works. After working a few problems, you should review the earlier material again. You will likely need to review the terminology and concepts a few times before you fully understand the process.

In reality, you will often be conducting more complex statistical tests and allowing software to provide the p-value. In these settings it will be important to know what test to apply for a given situation and to be able to explain the results in context.

When we conduct a test about a population proportion, we are working with a categorical variable. Later in the course, after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.

In this part of our discussion on hypothesis testing, we will go into details that we did not go into before. More specifically, we will use this test to introduce the idea of a **test statistic**, and details about **how p-values are calculated**.

Let’s start by introducing the three examples, which will be the leading examples in our discussion. Each example is followed by a figure illustrating the information provided, as well as the question of interest.

A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been **reduced** as a result of the repair?

The following figure displays the information, as well as the question of interest:

The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:

**Ho:** p = 0.20 (No change; the repair did not help).

**Ha:** p < 0.20 (The repair was effective at reducing the proportion of defective parts).

There are rumors that students at a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 100 students from the college, 19 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is **higher** than the national proportion, which is 0.157? (This number is reported by the Harvard School of Public Health.)

Again, the following figure displays the information as well as the question of interest:

As before, we can formulate the null and alternative hypotheses in terms of p, the proportion of students in the college who use marijuana:

**Ho:** p = 0.157 (same as among all college students in the country).

**Ha:** p > 0.157 (higher than the national figure).

Polls on certain topics are conducted routinely in order to monitor changes in the public’s opinions over time. One such topic is the death penalty. In 2003 a poll estimated that 64% of U.S. adults support the death penalty for a person convicted of murder. In a more recent poll, 675 out of 1,000 U.S. adults chosen at random were in favor of the death penalty for convicted murderers. Do the results of this poll provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers (p) **changed** between 2003 and the later poll?

Here is a figure that displays the information, as well as the question of interest:

Again, we can formulate the null and alternative hypotheses in term of p, the proportion of U.S. adults who support the death penalty for convicted murderers.

**Ho:** p = 0.64 (No change from 2003).

**Ha:** p ≠ 0.64 (Some change since 2003).

Recall that there are basically 4 steps in the process of hypothesis testing:

**STEP 1:**State the appropriate null and alternative hypotheses, Ho and Ha.**STEP 2:**Obtain a random sample, collect relevant data, and**check whether the data meet the conditions under which the test can be used**. If the conditions are met, summarize the data using a test statistic.**STEP 3:**Find the p-value of the test.**STEP 4:**Based on the p-value, decide whether or not the results are statistically significant and**draw your conclusions in context.****Note:**In practice, we should always consider the practical significance of the results as well as the statistical significance.

We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.

Here again are the three set of hypotheses that are being tested in each of our three examples:

Has the proportion of defective products been reduced as a result of the repair?

**Ho:**p = 0.20 (No change; the repair did not help).

**Ha:**p < 0.20 (The repair was effective at reducing the proportion of defective parts).

Is the proportion of marijuana users in the college higher than the national figure?

**Ho:**p = 0.157 (same as among all college students in the country).

**Ha:**p > 0.157 (higher than the national figure).

Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?

**Ho:**p = 0.64 (No change from 2003).

**Ha:**p ≠ 0.64 (Some change since 2003).

The null hypothesis always takes the form:

- Ho: p = some value

and the alternative hypothesis takes one of the following three forms:

- Ha: p < that value (like in example 1)
**or**

- Ha: p > that value (like in example 2)
**or**

- Ha: p ≠ that value (like in example 3).

Note that it was quite clear from the context which form of the alternative hypothesis would be appropriate. The value that is specified in the null hypothesis is called the **null value**, and is generally denoted by p_{0}. We can say, therefore, that in general the null hypothesis about the population proportion (p) would take the form:

- Ho: p = p
_{0}

We write Ho: p = p_{0} to say that we are making the hypothesis that the population proportion has the value of p_{0}. In other words, p is the unknown population proportion and p_{0} is the number we think p might be for the given situation.

The alternative hypothesis takes one of the following three forms (depending on the context):

- Ha: p < p
_{0}**(one-sided)**

- Ha: p > p
_{0}**(one-sided)**

- Ha: p ≠ p
_{0}**(two-sided)**

The first two possible forms of the alternatives (where the = sign in Ho is challenged by < or >) are called **one-sided alternatives**, and the third form of alternative (where the = sign in Ho is challenged by ≠) is called a **two-sided alternative.** To understand the intuition behind these names let’s go back to our examples.

Example 3 (death penalty) is a case where we have a two-sided alternative:

**Ho:**p = 0.64 (No change from 2003).

**Ha:**p ≠ 0.64 (Some change since 2003).

In this case, in order to reject Ho and accept Ha we will need to get a sample proportion of death penalty supporters which is very different from 0.64 **in either direction,** either much larger or much smaller than 0.64.

In example 2 (marijuana use) we have a one-sided alternative:

**Ho:**p = 0.157 (same as among all college students in the country).

**Ha:**p > 0.157 (higher than the national figure).

Here, in order to reject Ho and accept Ha we will need to get a sample proportion of marijuana users which is much **higher** than 0.157.

Similarly, in example 1 (defective products), where we are testing:

**Ho:**p = 0.20 (No change; the repair did not help).

**Ha:**p < 0.20 (The repair was effective at reducing the proportion of defective parts).

in order to reject Ho and accept Ha, we will need to get a sample proportion of defective products which is much **smaller** than 0.20.

Data were collected in order to determine whether there is a relationship between a person’s level of education and whether or not the person is a smoker.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/DIG_12002_179.swf

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/DIG_12003_179.swf

This document is linked from Steps in Hypothesis Testing.

]]>

According to the Centers for Disease Control and Prevention, the proportion of U.S. adults age 25 or older who smoke is 0.22. A researcher suspects that the rate is lower among U.S. adults 25 or older who have a bachelor’s degree or higher education level.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/LBD_12003_179.swf

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/LBD_12004_179.swf

A study investigated whether there are differences between the mean IQ level of people who were reared by their biological parents and those who were reared by someone else.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/LBD_12005_179.swf

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2012/12/LBD_12006_179.swf

This document is linked from Steps in Hypothesis Testing.

]]>Now that we understand the general idea of how statistical hypothesis testing works, let’s go back to each of the steps and delve slightly deeper, getting more details and learning some terminology.

In all three examples, our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, **Claim 1** is called the **null hypothesis** (denoted “**Ho**“), and **Claim 2** plays the role of the **alternative hypothesis** (denoted “**Ha**“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.

Let’s go back to our three examples and apply the new notation:

**In example 1:**

**Ho:**The proportion of smokers at GU is 0.20.**Ha:**The proportion of smokers at GU is less than 0.20.

**In example 2:**

**Ho:**The mean concentration in the shipment is the required 245 ppm.**Ha:**The mean concentration in the shipment is not the required 245 ppm.

**In example 3:**

**Ho:**Performance on the SAT is not related to gender (males and females score the same).**Ha:**Performance on the SAT is related to gender – males score higher.

This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.

There is, however, one detail that we would like to add here. In this step we collect data and **summarize** it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (*p*-hat), sample mean (x-bar) and the sample standard deviation (s).

In practice, you go a step further and use these sample statistics to summarize the data with what’s called a **test statistic**. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.

This step will also involve checking any conditions or assumptions required to use the test.

As we saw, this is the step where we calculate how likely is it to get data like that observed (or more extreme) when Ho is true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability.

- If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed (or more extreme) if Ho were true. The fact that we
**did**observe such data is therefore evidence against Ho, and we should reject it. - On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed (or more extreme) is not very surprising if Ho were true. The fact that we observed such data does not provide evidence against Ho. This crucial probability, therefore, has a special name. It is called the
**p-value**of the test.

In our three examples, the p-values were given to you (and you were reassured that you didn’t need to worry about how these were derived yet):

- Example 1: p-value = 0.106
- Example 2: p-value = 0.0007
- Example 3: p-value = 0.29

Obviously, the smaller the p-value, the more surprising it is to get data like ours (or more extreme) when Ho is true, and therefore, the stronger the evidence the data provide against Ho.

Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against Ho.

**Comment:**

- Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting
**data**like those observed (or more extreme) when Ho is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.

Since our statistical conclusion is based on how small the p-value is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.

This cutoff exists, and because it is so important, it has a special name. It is called the **significance level of the test** and is usually denoted by the Greek letter α (alpha). The most commonly used significance level is α (alpha) = 0.05 (or 5%). This means that:

- if the p-value < α (alpha) (usually 0.05), then the data we obtained is considered to be “rare (or surprising) enough” under the assumption that Ho is true, and we say that the data provide statistically significant evidence against Ho, so we reject Ho and thus accept Ha.
- if the p-value > α (alpha)(usually 0.05), then our data are not considered to be “surprising enough” under the assumption that Ho is true, and we say that our data do not provide enough evidence to reject Ho (or, equivalently, that the data do not provide enough evidence to accept Ha).

Now that we have a cutoff to use, here are the appropriate conclusions for each of our examples based upon the p-values we were given.

**In Example 1:**

- Using our cutoff of 0.05, we fail to reject Ho.
**Conclusion**: There**IS NOT**enough evidence that the proportion of smokers at GU is less than 0.20**Still we should consider:**Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?

**In Example 2:**

- Using our cutoff of 0.05, we reject Ho.
**Conclusion**: There**IS**enough evidence that the mean concentration in the shipment is not the required 245 ppm.**Still we should consider: Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?**

**In Example 3:**

- Using our cutoff of 0.05, we fail to reject Ho.
**Conclusion**: There**IS NOT**enough evidence that males score higher on average than females on the SAT.**Still we should consider:**Does the evidence seen in the data provide any practical evidence towards our alternative hypothesis?

Notice that all of the above conclusions are written in terms of the alternative hypothesis and are given in the context of the situation. In no situation have we claimed the null hypothesis is true. Be very careful of this and other issues discussed in the following comments.

**Comments:**

- Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is 0.052? You might want to stick to the rules and say “0.052 > 0.05 and therefore I don’t have enough evidence to reject Ho”, but you might decide that 0.052 is small enough for you to believe that Ho should be rejected. It should be noted that scientific journals do consider 0.05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against Ho, and any p-value above it,
**or even equal to it**, indicates there is not enough evidence against Ho. Although a p-value between 0.05 and 0.10 is often reported as marginally statistically significant.

- It is important to draw your conclusions
**in context**. It is**never enough**to say:**“p-value = …, and therefore I have enough evidence to reject Ho at the 0.05 significance level.”**You**should always word your conclusion in terms of the data.**Although we will use the terminology of “rejecting Ho” or “failing to reject Ho” – this is mostly due to the fact that we are instructing you in these concepts. In practice, this language is rarely used. We also suggest writing your conclusion in terms of the alternative hypothesis.Is there or is there not enough evidence that the alternative hypothesis is true?

- Let’s go back to the issue of the nature of the two types of conclusions that I can make.

*Either***I reject Ho (when the p-value is smaller than the significance level)***or***I cannot reject Ho (when the p-value is larger than the significance level).**

As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is **not necessarily the case**. Consider the following slightly artificial yet effective example:

An employer claims to subscribe to an “equal opportunity” policy, not hiring men any more often than women for managerial positions. Is this credible? You’re not sure, so you want to test the following **two hypotheses:**

**Ho:**The proportion of male managers hired is 0.5**Ha:**The proportion of male managers hired is more than 0.5

**Data:** You choose at random three of the new managers who were hired in the last 5 years and find that all 3 are men.

**Assessing Evidence:** If the proportion of male managers hired is really 0.5 (Ho is true), then the probability that the random selection of three managers will yield three males is therefore 0.5 * 0.5 * 0.5 = 0.125. This is the p-value (using the multiplication rule for independent events).

**Conclusion:** Using 0.05 as the significance level, you conclude that since the p-value = 0.125 > 0.05, the fact that the three randomly selected managers were all males is not enough evidence to reject the employer’s claim of subscribing to an equal opportunity policy (Ho).

However, **the data (all three selected are males) definitely does NOT provide evidence to accept the employer’s claim (Ho).**

**Comment about wording:** Another common wording in scientific journals is:

- “The results are statistically significant” – when the p-value < α (alpha).
- “The results are not statistically significant” – when the p-value > α (alpha).

Often you will see significance levels reported with additional description to indicate the degree of statistical significance. A general guideline (although not required in our course) is:

- If 0.01 ≤ p-value < 0.05, then the results are (statistically)
*significant*. - If 0.001 ≤ p-value < 0.01, then the results are
*highly statistically significant*. - If p-value < 0.001, then the results are
*very highly statistically significant*. - If p-value > 0.05, then the results are
*not statistically significant*(NS). - If 0.05 ≤ p-value < 0.10, then the results are
*marginally statistically significant*.

We learned quite a lot about hypothesis testing. We learned the logic behind it, what the key elements are, and what types of conclusions we can and cannot draw in hypothesis testing. Here is a quick recap:

Here are a few more activities if you need some additional practice.

**Comments:**

- Notice that
**the p-value is an example of a conditional probability**. We calculate the probability of obtaining results like those of our data (or more extreme) GIVEN the null hypothesis is true. We could write P(Obtaining results like ours or more extreme | Ho is True).

- Another common phrase used to define the p-value is: “
**The probability of obtaining a statistic as or more extreme than your result given the null hypothesis is TRUE**“.- We could write P(Obtaining a test statistic as or more extreme than ours | Ho is True).
- In this case we are asking “Assuming the null hypothesis is true, how rare is it to observe something as or more extreme than what I have found in my data?”
- If after assuming the null hypothesis is true, what we have found in our data is extremely rare (small p-value), this provides evidence to reject our assumption that Ho is true in favor of Ha.

- The
**p-value can also be thought of as the probability, assuming the null hypothesis is true, that the result we have seen is solely due to random error (or random chance).**We have already seen that statistics from samples collected from a population vary. There is random error or random chance involved when we sample from populations.

In this setting, if the p-value is very small, this implies, assuming the null hypothesis is true, that it is extremely unlikely that the results we have obtained would have happened due to random error alone, and thus our assumption (Ho) is rejected in favor of the alternative hypothesis (Ha).

**It is EXTREMELY important that you find a definition of the p-value which makes sense to you. New students often need to contemplate this idea repeatedly through a variety of examples and explanations before becoming comfortable with this idea. It is one of the two most important concepts in statistics (the other being confidence intervals).**

**Remember:**

- We infer that the alternative hypothesis is true ONLY by rejecting the null hypothesis.
- A statistically significant result is one that has a very low probability of occurring if the null hypothesis is true.
- Results which are
**statistically**significant may or may not have**practical**significance and vice versa.