Related SAS Tutorials
Related SPSS Tutorials
Although not a required aspect of describing distributions of one quantitative variable, we are often interested in where a particular value falls in the distribution. Is the value unusually low or high or about what we would expect?
Answers to these questions rely on measures of position (or location). These measures give information about the distribution but also give information about how individual values relate to the overall distribution.
A common measure of position is the percentile. Although there are some mathematical considerations involved with calculating percentiles which we will not discuss, you should have a basic understanding of their interpretation.
The quartiles Q1 and Q3 are special cases of percentiles and thus are measures of position.
The combination of the five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
Each of the values represents a measure of position in the dataset.
The min and max providing the boundaries and the quartiles and median providing information about the 25th, 50th, and 75th percentiles.
Standardized scores, also called zscores use the mean and standard deviation as the primary measures of center and spread and are therefore most useful when the mean and standard deviation are appropriate, i.e. when the distribution is reasonably symmetric with no extreme outliers.
For any individual, the zscore tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive zscore indicates the individual is above average and a negative zscore indicates the individual is below average.
To calculate a zscore, we take the individual value and subtract the mean and then divide this difference by the standard deviation.
Measures of position also allow us to compare values from different distributions. For example, we can present the percentiles or zscores of an individual’s height and weight. These two measures together would provide a better picture of how the individual fits in the overall population than either would alone.
Although measures of position are not stressed in this course as much as measures of center and spread, we have seen and will see many measures of position used in various aspects of examining the distribution of one variable and it is good to recognize them as measures of position when they appear.
]]>Related SAS Tutorials
Related SPSS Tutorials
Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.
More specifically, we should consider the following features of the Distribution for One Quantitative Variable:
When describing the shape of a distribution, we should consider:
We distinguish between:
A distribution is called symmetric if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.
The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.
Note that all three distributions are symmetric, but are different in their modality (peakedness).
Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.
Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.
Comments:
Here is an example. A medium size neighborhood 24hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.
Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $5055.
The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $5055.
One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.
Another common way to measure the center of a distribution is to use the average value.
From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.
From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range. (More exact ways of finding measures of spread will be discussed soon.)
For example, the following histogram represents a distribution with a highly probable outlier:
As you can see from the histogram, the grades distribution is roughly symmetric and unimodal with no outliers.
The center of the grades distribution is roughly 70 (7 students scored below 70, and 8 students scored above 70).
approximate min:  45 (the middle of the lowest interval of scores) 
approximate max:  95 (the middle of the highest interval of scores) 
approximate range:  9545=50 
Let’s look at a new example.
To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001
The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).
We will now summarize the main features of the distribution of ages as it appears from the histogram:
Shape: The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.
Center: The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.
Spread: The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.
Outliers: There seem to be two probable outliers to the far right and possibly a third around 62 years old.
You can see how informative it is to know “what to look at” in a histogram.
The following exercises provide more practice with shapes of distributions for one quantitative variable.
Variables can be broadly classified into one of two types:
Below we define these two main types of variables and provide further subclassifications for each type.
Categorical variables take category or label values, and place an individual into one of several groups.
Categorical variables are often further classified as either:
Common examples would be gender, eye color, or ethnicity.
However, ordinal variables are still categorical and do not provide precise measurements.
Differences are not precisely meaningful, for example, if one student scores an A and another a B on an assignment, we cannot say precisely the difference in their scores, only that an A is larger than a B.
Quantitative variables take numerical values, and represent some kind of measurement.
Quantitative variables are often further classified as either:
Most often these variables indeed represent some kind of count such as the number of prescriptions an individual takes daily.
Our precision in measuring these variables is often limited by our instruments.
Units should be provided.
Common examples would be height (inches), weight (pounds), or time to recovery (days).
One special variable type occurs when a variable has only two possible values.
A variable is said to be Binary or Dichotomous, when there are only two possible levels.
These variables can usually be phrased in a “yes/no” question. Gender is an example of a binary variable.
Currently we are primarily concerned with classifying variables as either categorical or quantitative.
Sometimes, however, we will need to consider further and subclassify these variables as defined above.
These concepts will be discussed and reviewed as needed but here is a quick practice on subclassifying categorical and quantitative variables.
Let’s revisit the dataset showing medical records for a sample of patients
In our example of medical records, there are several variables of each type:
Comments:
It is quite common to code the values of a categorical variable as numbers, but you should remember that these are just codes.
They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).
Usually, if such a coding is used, all categorical variables will be coded and we will tend to do this type of coding for datasets in this course.
A common example is to provide information about an individual’s Body Mass Index by stating whether the individual is underweight, normal, overweight, or obese.
This categorized BMI is an example of an ordinal categorical variable.
The types of variables you are analyzing directly relate to the available descriptive and inferential statistical methods.
It is important to:
As we proceed in this course, we will continually emphasize the types of variables that are appropriate for each method we discuss.
For example:
To compare the number of polio cases in the two treatment arms of the Salk Polio vaccine trial, you could use
To compare blood pressures in a clinical trial evaluating two blood pressurelowering medications, you could use
Related SAS Tutorials
Related SPSS Tutorials
In this part, we continue to handle situations involving one categorical explanatory variable and one quantitative response variable, which is case C→Q.
Here is a summary of the tests we have covered for the case where k = 2. Methods in BOLD are our main focus in this unit.
So far we have discussed the two samples and matched pairs designs, in which the categorical explanatory variable is twovalued. As we saw, in these cases, examining the relationship between the explanatory and the response variables amounts to comparing the mean of the response variable (Y) in two populations, which are defined by the two values of the explanatory variable (X). The difference between the two samples and matched pairs designs is that in the former, the two samples are independent, and in the latter, the samples are dependent.
Independent Samples (More Emphasis) 
Dependent Samples (Less Emphasis) 
Standard Tests
NonParametric Test

Standard Test
NonParametric Tests

We now move on to the case where k > 2 when we have independent samples. Here is a summary of the tests we will learn for the case where k > 2. Notice we will not cover the dependent samples case in this course.
Independent Samples (Only Emphasis) 
Dependent Samples (Not Discussed) 
Standard Tests
NonParametric Test

Standard Test

Here, as in the twovalued case, making inferences about the relationship between the explanatory (X) and the response (Y) variables amounts to comparing the means of the response variable in the populations defined by the values of the explanatory variable, where the number of means we are comparing depends, of course, on the number of values of X.
Unlike the twovalued case, where we looked at two subcases (1) when the samples are independent (two samples design) and (2) when the samples are dependent (matched pairs design, here, we are just going to discuss the case where the samples are independent. In other words, we are just going to extend the two samples design to more than two independent samples.
The inferential method for comparing more than two means that we will introduce in this part is called ANalysis Of VAriance (abbreviated as ANOVA), and the test associated with this method is called the ANOVA Ftest.
In most software, the data need to be arranged so that each row contains one observation with one variable recording X and another variable recording Y for each observation.
As we mentioned earlier, the test that we will present is called the ANOVA Ftest, and as you’ll see, this test is different in two ways from all the tests we have presented so far:
but a different structure that captures the essence of the Ftest, and clarifies where the name “analysis of variance” is coming from.
The question we need to answer is: Are the differences among the sample means due to true differences among the μ’s (alternative hypothesis), or merely due to sampling variability or random chance (null hypothesis)?
Here are two sets of boxplots representing two possible scenarios:
Scenario #1
Scenario #2
Thus, in the language of hypothesis tests, we would say that if the data were configured as they are in scenario 1, we would not reject the null hypothesis that population means were equal for the k groups.
If the data were configured as they are in scenario 2, we would reject the null hypothesis, and we would conclude that not all population means are the same for the k groups.
Let’s summarize what we learned from this.
In order to answer this question using data, we need to look at the variation among the sample means, but this alone is not enough.
We need to look at the variation among the sample means relative to the variation within the groups. In other words, we need to look at the quantity:
which measures to what extent the difference among the sample means for our groups dominates over the usual variation within sampled groups (which reflects differences in individuals that are typical in random samples).
When the variation within groups is large (like in scenario 1), the variation (differences) among the sample means may become negligible resulting in data which provide very little evidence against Ho. When the variation within groups is small (like in scenario 2), the variation among the sample means dominates over it, and the data have stronger evidence against Ho.
It has a different structure from all the test statistics we’ve looked at so far, but it is similar in that it is still a measure of the evidence against H_{0}. The larger F is (which happens when the denominator, the variation within groups, is small relative to the numerator, the variation among the sample means), the more evidence we have against H_{0}.
Looking at this ratio of variations is the idea behind the comparing more than two means; hence the name analysis of variance (ANOVA).
Now test your understanding of this idea.
Comments
Here is a full statement of the process for the ANOVA FTest:
Step 1: State the hypotheses
The null hypothesis claims that there is no relationship between X and Y. Since the relationship is examined by comparing the means of Y in the populations defined by the values of X (μ_{1}, μ_{2}, …, μ_{k}), no relationship would mean that all the means are equal.
Therefore the null hypothesis of the Ftest is:
As we mentioned earlier, here we have just one alternative hypothesis, which claims that there is a relationship between X and Y. In terms of the means μ_{1}, μ_{2}, …, μ_{k}, it simply says the opposite of the null hypothesis, that not all the means are equal, and we simply write:
Comments:
Step 2: Obtain data, check conditions, and summarize data
The ANOVA Ftest can be safely used as long as the following conditions are met:
(i) Each of the populations are normal, or more specifically, the distribution of the response Y in each population is normal, and the samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).
Can check this condition using the rule of thumb that the ratio between the largest sample standard deviation and the smallest is less than 2. If that is the case, this condition is considered to be satisfied.
Can check this condition using a formal test similar to that used in the twosample ttest although we will not cover any formal tests.
Test Statistic
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual, we base our conclusion on the pvalue.
Final Comment
Note that when we reject Ho in the ANOVA Ftest, all we can conclude is that
However, the ANOVA Ftest does not provide any immediate insight into why Ho was rejected, or in other words, it does not tell us in what way the population means of the groups are different. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population means. More specifically, we can look at which of the confidence intervals overlap and which do not.
Multiple Comparisons:
Now let’s look at some examples using real data.
A college dean believes that students with different majors may experience different levels of academic frustration. Random samples of size 35 of Business, English, Mathematics, and Psychology majors are asked to rate their level of academic frustration on a scale of 1 (lowest) to 20 (highest).
The figure highlights what we have already mentioned: examining the relationship between major (X) and frustration level (Y) amounts to comparing the mean frustration levels among the four majors defined by X. Also, the figure reminds us that we are dealing with a case where the samples are independent.
Step 1: State the hypotheses
The correct hypotheses are:
Step 2: Obtain data, check conditions, and summarize data
Data: SPSS format, SAS format, Excel format, CSV format
In our example all the conditions are satisfied:
The rule of thumb is satisfied since 3.082 / 2.088 < 2. We will look at the formal test in the software.
Test statistic: (Minitab output)
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As a followup, we can construct confidence intervals (or conduct multiple comparisons as we will do in the software). This allows us to understand better which population means are likely to be different.
In this case, the business majors are clearly lower on the frustration scale than other majors. It is also possible that English majors are lower than psychology majors based upon the individual 95% confidence intervals in each group.
SAS Output and SAS Code (Includes NonParametric Test)
Here is another example
Do advertisers alter the reading level of their ads based on the target audience of the magazine they advertise in?
In 1981, a study of magazine advertisements was conducted (F.K. Shuptrine and D.D. McVicker, “Readability Levels of Magazine Ads,” Journal of Advertising Research, 21:5, October 1981). Researchers selected random samples of advertisements from each of three groups of magazines:
The measure that the researchers used to assess the level of the ads was the number of words in the ad. 18 ads were randomly selected from each of the magazine groups, and the number of words per ad were recorded.
The following figure summarizes this problem:
Our question of interest is whether the number of words in ads (Y) is related to the educational level of the magazine (X). To answer this question, we need to compare μ_{1}, μ_{2}, and μ_{3}, the mean number of words in ads of the three magazine groups. Note in the figure that the sample means are provided. It seems that what the data suggest makes sense; the magazines in group 1 have the largest number of words per ad (on average) followed by group 2, and then group 3.
The question is whether these differences between the sample means are significant. In other words, are the differences among the observed sample means due to true differences among the μ’s or merely due to sampling variability? To answer this question, we need to carry out the ANOVA Ftest.
Step 1: Stating the hypotheses.
We are testing:
Conceptually, the null hypothesis claims that the number of words in ads is not related to the educational level of the magazine, and the alternative hypothesis claims that there is a relationship.
Step 2: Checking conditions and summarizing the data.
In order to check the next two conditions, we’ll need to look at the data (condition ii), and calculate the sample standard deviations of the three samples (condition iii).
Using the above, we can address conditions (ii) and (iii)
Before we move on, let’s look again at the graph. It is easy to see the trend of the sample means (indicated by red circles).
However, there is so much variation within each of the groups that there is almost a complete overlap between the three boxplots, and the differences between the means are overshadowed and seem like something that could have happened just by chance.
Let’s move on and see whether the ANOVA Ftest will support this observation.
Step 3. Finding the pvalue.
Step 4: Making conclusions in context.
Now try one for yourself.
The ANOVA Ftest does not provide any insight into why H_{0} was rejected; it does not tell us in what way μ1,μ2,μ3…,μk are not all equal. We would like to know which pairs of ’s are not equal. As an exploratory (or visual) aid to get that insight, we may take a look at the confidence intervals for group population meansμ1,μ2,μ3…,μk that appears in the output. More specifically, we should look at the position of the confidence intervals and overlap/no overlap between them.
* If the confidence interval for, say,μi overlaps with the confidence interval for μj , then μi and μj share some plausible values, which means that based on the data we have no evidence that these two ’s are different.
* If the confidence interval for μi does not overlap with the confidence interval for μj , then μi and μj do not share plausible values, which means that the data suggest that these two ’s are different.
Furthermore, if like in the figure above the confidence interval (set of plausible values) for μi lies entirely below the confidence interval (set of plausible values) for μj, then the data suggest that μi is smaller than μj.
Consider our first example on the level of academic frustration.
Based on the small pvalue, we rejected H_{o} and concluded that not all four frustration level means are equal, or in other words that frustration level is related to the student’s major. To get more insight into that relationship, we can look at the confidence intervals above (marked in red). The top confidence interval is the set of plausible values for μ_{1}, the mean frustration level of business students. The confidence interval below it is the set of plausible values for μ_{2}, the mean frustration level of English students, etc.
What we see is that the business confidence interval is way below the other three (it doesn’t overlap with any of them). The math confidence interval overlaps with both the English and the psychology confidence intervals; however, there is no overlap between the English and psychology confidence intervals.
This gives us the impression that the mean frustration level of business students is lower than the mean in the other three majors. Within the other three majors, we get the impression that the mean frustration of math students may not differ much from the mean of both English and psychology students, however the mean frustration of English students may be lower than the mean of psychology students.
Note that this is only an exploratory/visual way of getting an impression of why H_{o} was rejected, not a formal one. There is a formal way of doing it that is called “multiple comparisons,” which is beyond the scope of this course. An extension to this course will include this topic in the future.
We will look at one nonparametric test in the k > 2 independent sample setting. We will cover more details later (Details for NonParametric Alternatives).
The KruskalWallis test is a general test to compare multiple distributions in independent samples and is a common alternative to the oneway ANOVA.
]]>Related SAS Tutorials
Related SPSS Tutorials
Here is a summary of the tests we will learn for the scenario where k = 2. Methods in BOLD will be our main focus.
We have completed our discussion on dependent samples (2nd column) and now we move on to independent samples (1st column).
Independent Samples (More Emphasis) 
Dependent Samples (Less Emphasis) 
Standard Tests
NonParametric Test

Standard Test
NonParametric Tests

We have discussed the dependent sample case where observations are matched/paired/linked between the two samples. Recall that in that scenario observations can be the same individual or two individuals who are matched between samples. To analyze data from dependent samples, we simply took the differences and analyzed the difference using onesample techniques.
Now we will discuss the independent sample case. In this case, all individuals are independent of all other individuals in their sample as well as all individuals in the other sample. This is most often accomplished by either:
Recall that here we are interested in the effect of a twovalued (k = 2) categorical variable (X) on a quantitative response (Y). Random samples from the two subpopulations (defined by the two categories of X) are obtained and we need to evaluate whether or not the data provide enough evidence for us to believe that the two subpopulation means are different.
In other words, our goal is to test whether the means μ_{1} and μ_{2} (which are the means of the variable of interest in the two subpopulations) are equal or not, and in order to do that we have two samples, one from each subpopulation, which were chosen independently of each other.
The test that we will learn here is commonly known as the twosample ttest. As the name suggests, this is a ttest, which as we know means that the pvalues for this test are calculated under some tdistribution.
Here are figures that illustrate some of the examples we will cover. Notice how the original variables X (categorical variable with two levels) and Y (quantitative variable) are represented. Think about the fact that we are in case C → Q!
As in our discussion of dependent samples, we will often simplify our terminology and simply use the terms “population 1” and “population 2” instead of referring to these as subpopulations. Either terminology is fine.
Question: Does it matter which population we label as population 1 and which as population 2?
Answer: No, it does not matter as long as you are consistent, meaning that you do not switch labels in the middle.
Recall that our goal is to compare the means μ_{1} and μ_{2} based on the two independent samples.
The hypotheses represent our goal to compare μ_{1}and μ_{2}.
The null hypothesis is always:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
(There IS NO association between the categorical explanatory variable and the quantitative response variable)
We will focus on the twosided alternative hypothesis of the form:
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2}) (twosided)
(There IS AN association between the categorical explanatory variable and the quantitative response variable)
Note that the null hypothesis claims that there is no difference between the means. Conceptually, Ho claims that there is no relationship between the two relevant variables (X and Y).
Our parameter of interest in this case (the parameter about which we are making an inference) is the difference between the means (μ_{1} – μ_{2}) and the null value is 0. The alternative hypothesis claims that there is a difference between the means.
The twosample ttest can be safely used as long as the following conditions are met:
The two samples are indeed independent.
We are in one of the following two scenarios:
(i) Both populations are normal, or more specifically, the distribution of the response Y in both populations is normal, and both samples are random (or at least can be considered as such). In practice, checking normality in the populations is done by looking at each of the samples using a histogram and checking whether there are any signs that the populations are not normal. Such signs could be extreme skewness and/or extreme outliers.
(ii) The populations are known or discovered not to be normal, but the sample size of each of the random samples is large enough (we can use the rule of thumb that a sample size greater than 30 is considered large enough).
Assuming that we can safely use the twosample ttest, we need to summarize the data, and in particular, calculate our data summary—the test statistic.
Test Statistic for TwoSample Ttest:
There are two choices for our test statistic, and we must choose the appropriate one to summarize our data We will see how to choose between the two test statistics in the next section. The two options are as follows:
We use the following notation to describe our samples:
Here are the two cases for our test statistic.
(A) Equal Variances: If it is safe to assume that the two populations have equal standard deviations, we can pool our estimates of this common population standard deviation and use the following test statistic.
where
(B) Unequal Variances: If it is NOT safe to assume that the two populations have equal standard deviations, we have unequal standard deviations and must use the following test statistic.
Comments:
Each of these tests rely on a particular tdistribution under which the pvalues are calculated. In the case where equal variances are assumed, the degrees of freedom are simply:
whereas in the case of unequal variances, the formula for the degrees of freedom is more complex. We will rely on the software to obtain the degrees of freedom in both cases and provided us with the correct pvalue (usually this will be a twosided pvalue).
As usual, we draw our conclusion based on the pvalue. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the difference in population means in terms of the current variables.
If the pvalue is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.
Conclusion: There is enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho.
Conclusion: There is NOT enough evidence that the categorical explanatory variable is related to (or associated with) the quantitative response variable. More specifically, there is enough evidence that the difference in population means is not equal to zero.
In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the pvalue is less than α. Otherwise, we do not reject Ho.
As in previous methods, we can followup with a confidence interval for the difference between population means, μ_{1} – μ_{2} and interpret this interval in the context of the problem.
Interpretation: We are 95% confident that the population mean for (one group) is between __________________ compared to the population mean for (the other group).
Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.
If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)
If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)
NOTE: Be careful to choose the correct confidence interval about the difference between population means using the same assumption (variances equal or variances unequal) and not the individual confidence intervals for the means in the groups themselves.
Since we have two possible tests we can conduct, based upon whether or not we can assume the population standard deviations (or variances) are equal, we need a method to determine which test to use.
Although you can make a reasonable guess using information from the data (i.e. look at the distributions and estimates of the standard deviations and see if you feel they are reasonably equal), we have a test which can help us here, called the test for Equality of Variances. This output is automatically displayed in many software packages when a twosample ttest is requested although the particular test used may vary.The hypotheses of this test are:
Ho: σ_{1} = σ_{2} (the standard deviations in the two populations are the same)
Ha: σ_{1} ≠ σ_{2} (the standard deviations in the two populations are not the same)
Now let’s look at a complete example of conducting a twosample ttest, including the embedded test for equality of variances.
This question was asked of a random sample of 239 college students, who were to answer on a scale of 1 to 25. An answer of 1 means personality has maximum importance and looks no importance at all, whereas an answer of 25 means looks have maximum importance and personality no importance at all. The purpose of this survey was to examine whether males and females differ with respect to the importance of looks vs. personality.
Note that the data have the following format:
Score (Y)  Gender (X) 
15  Male 
13  Female 
10  Female 
12  Male 
14  Female 
14  Male 
6  Male 
17  Male 
etc. 
The format of the data reminds us that we are essentially examining the relationship between the twovalued categorical variable, gender, and the quantitative response, score. The two values of the categorical explanatory variable (k = 2) define the two populations that we are comparing — males and females. The comparison is with respect to the response variable score. Here is a figure that summarizes the example:
Comments:
Step 1: State the hypotheses
Recall that the purpose of this survey was to examine whether the opinions of females and males differ with respect to the importance of looks vs. personality. The hypotheses in this case are therefore:
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean “looks vs personality score” for females and μ_{2} represents the mean “looks vs personality score” for males.
It is important to understand that conceptually, the two hypotheses claim:
Ho: Score (of looks vs. personality) is not related to gender
Ha: Score (of looks vs. personality) is related to gender
Step 2: Obtain data, check conditions, and summarize data
The output might also be broken up if you export or copy the items in certain ways. The results are the same but it can be more difficult to read.
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is practically 0 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean “looks vs personality score” for females minus the population mean “looks vs personality score” for males.
Practical Significance:
We should definitely ask ourselves if this is practically significant
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Here is another example.
A study was conducted which enrolled and followed heart attack patients in a certain metropolitan area. In this example we are interested in determining if there is a relationship between Body Mass Index (BMI) and gender. Individuals presenting to the hospital with a heart attack were randomly selected to participate in the study.
Step 1: State the hypotheses
Ho: μ_{1} – μ_{2} = 0 (which is the same as μ_{1} = μ_{2})
Ha: μ_{1} – μ_{2} ≠ 0 (which is the same as μ_{1} ≠ μ_{2})
where μ_{1} represents the mean BMI for males and μ_{2} represents the mean BMI for females.
It is important to understand that conceptually, the two hypotheses claim:
Ho: BMI is not related to gender in heart attack patients
Ha: BMI is related to gender in heart attack patients
Step 2: Obtain data, check conditions, and summarize data
Step 3: Find the pvalue of the test by using the test statistic as follows
Step 4: Conclusion
As usual a small pvalue provides evidence against Ho. In our case our pvalue is 0.001 (which is smaller than any level of significance that we will choose). The data therefore provide very strong evidence against Ho so we reject it.
As a followup to this conclusion, we can construct a confidence interval for the difference between population means. In this case we will construct a confidence interval for μ_{1} – μ_{2} the population mean BMI for males minus the population mean BMI for females.
Practical Significance:
SPSS Output for this example (NonParametric Output for Examples 1 and 2)
SAS Output and SAS Code (Includes NonParametric Test)
Note: In the SAS output the variable gender is not formatted, in this case Males = 0 and Females = 1.
Comments:
You might ask yourself: “Where do we use the test statistic?”
It is true that for all practical purposes all we have to do is check that the conditions which allow us to use the twosample ttest are met, lift the pvalue from the output, and draw our conclusions accordingly.
However, we feel that it is important to mention the test statistic for two reasons:
Now try some more activities for yourself.
We will look at one nonparametric test in the twoindependent samples setting. More details will be discussed later (Details for NonParametric Alternatives).
Related SAS Tutorials
Related SPSS Tutorials
We are in Case CQ of inference about relationships, where the explanatory variable is categorical and the response variable is quantitative.
As we mentioned in the summary of the introduction to Case C→Q, the first case that we will deal with is that involving matched pairs. In this case:
Notice from this point forward we will use the terms population 1 and population 2 instead of subpopulation 1 and subpopulation 2. Either terminology is correct.
One of the most common cases where dependent samples occur is when both samples have the same subjects and they are “paired by subject.” In other words, each subject is measured twice on the response variable, typically before and then after some kind of treatment/intervention in order to assess its effectiveness.
Suppose you want to assess the effectiveness of an SAT prep class.
It would make sense to use the matched pairs design and record each sampled student’s SAT score before and after the SAT prep classes are attended:
Recall that the two populations represent the two values of the explanatory variable. In this situation, those two values come from a single set of subjects.
This, however, is not the only case where the paired design is used. Other cases are when the pairs are “natural pairs,” such as siblings, twins, or couples.
Notes about graphical summaries for paired data in Case CQ:
The idea behind the paired ttest is to reduce this twosample situation, where we are comparing two means, to a single sample situation where we are doing inference on a single mean, and then use a simple ttest that we introduced in the previous module.
In this setting, we can easily reduce the raw data to a set of differences and conduct a onesample ttest.
In other words, by reducing the two samples to one sample of differences, we are essentially reducing the problem from a problem where we’re comparing two means (i.e., doing inference on μ_{1}−μ_{2}) to a problem in which we are studying one mean.
In general, in every matched pairs problem, our data consist of 2 samples which are organized in n pairs:
We reduce the two samples to only one by calculating the difference between the two observations for each pair.
For example, think of Sample 1 as “before” and Sample 2 as “after”. We can find the difference between the before and after results for each participant, which gives us only one sample, namely “before – after”. We label this difference as “d” in the illustration below.
The paired ttest is based on this one sample of n differences,
and it uses those differences as data for a onesample ttest on a single mean — the mean of the differences.
This is the general idea behind the paired ttest; it is nothing more than a regular onesample ttest for the mean of the differences!
We will now go through the 4step process of the paired ttest.
Recall that in the ttest for a single mean our null hypothesis was: Ho: μ = μ_{0} and the alternative was one of Ha: μ < μ_{0} or μ > μ_{0} or μ ≠ μ_{0}. Since the paired ttest is a special case of the onesample ttest, the hypotheses are the same except that:
Instead of simply μ we use the notation μ_{d} to denote that the parameter of interest is the mean of the differences.
In this course our null value μ_{0} is always 0. In other words, going back to our original paired samples our null hypothesis claims that that there is no difference between the two means. (Technically, it does not have to be zero if you are interested in a more specific difference – for example, you might be interested in showing that there is a reduction in blood pressure of more than 10 points but we will not specifically look at such situations).
Therefore, in the paired ttest: The null hypothesis is always:
Ho: μ_{d} = 0
(There IS NO association between the categorical explanatory variable and the quantitative response variable)
We will focus on the twosided alternative hypothesis of the form:
Ha: μ_{d} ≠ 0
(There IS AN association between the categorical explanatory variable and the quantitative response variable)
Some students find it helpful to know that it turns out that μ_{d} = μ_{1} – μ_{2} (in other words, the difference between the means is the same as the mean of the differences). You may find it easier to first think about the hypotheses in terms of μ_{1} – μ_{2} and then represent it in terms of μ_{d}.
The paired ttest, as a special case of a onesample ttest, can be safely used as long as:
The sample of differences is random (or at least can be considered random in context).
The distribution of the differences in the population should vary normally if you have small samples. If the sample size is large, it is safe to use the paired ttest regardless of whether the differences vary normally or not. This condition is satisfied in the three situations marked by a green check mark in the table below.
Note: normality is checked by looking at the histogram of differences, and as long as no clear violation of normality (such as extreme skewness and/or outliers) is apparent, the normality assumption is reasonable.
Assuming that we can safely use the paired ttest, the data are summarized by a test statistic:
where
This test statistic measures (in standard errors) how far our data are (represented by the sample mean of the differences) from the null hypothesis (represented by the null value, 0).
Notice this test statistic has the same general form as those discussed earlier:
As a special case of the onesample ttest, the null distribution of the paired ttest statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the pvalues are calculated. We will use software to find the pvalue for us.
As usual, we draw our conclusion based on the pvalue. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the population mean difference in terms of the current variables.
In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the pvalue is less than α. Otherwise, we fail to reject Ho.
If the pvalue is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.
Conclusion: There is enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is enough evidence that the population mean difference is not equal to zero.
Remember: a small pvalue tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. Therefore, a small pvalue indicates that we should reject the null hypothesis.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho.
Conclusion: There is NOT enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is NOT enough evidence that the population mean difference is not equal to zero.
Notice how much better the first sentence sounds! It can get difficult to correctly phrase these conclusions in terms of the mean difference without confusing double negatives.
As in previous methods, we can followup with a confidence interval for the mean difference, μ_{d} and interpret this interval in the context of the problem.
Interpretation: We are 95% confident that the population mean difference (described in context) is between (lower bound) and (upper bound).
Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.
If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)
If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)
NOTE: Be careful to choose the correct confidence interval about the population mean difference and not the individual confidence intervals for the means in the groups themselves.
Now let’s look at an example.
Note: In some of the videos presented in the course materials, we do conduct the onesided test for this data instead of the twosided test we conduct below. In Unit 4B we are going to restrict our attention to twosided tests supplemented by confidence intervals as needed to provide more information about the effect of interest.
Drunk driving is one of the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 12 drinks … I am OK to drive.”
A sample of 20 drivers was chosen, and their reaction times in an obstacle course were measured before and after drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers. Here is a figure summarizing this study:
Since the measurements are paired, we can easily reduce the raw data to a set of differences and conduct a onesample ttest.
Here are some of the results for this data:
Step 1: State the hypotheses
We define μ_{d }= the population mean difference in reaction times (Before – After).
As we mentioned, the null hypothesis is:
The null hypothesis claims that the differences in reaction times are centered at (or around) 0, indicating that drinking two beers has no real impact on reaction times. In other words, drivers are not impaired after drinking two beers.
Although we really want to know whether their reaction times are longer after the two beers, we will still focus on conducting twosided hypothesis tests. We will be able to address whether the reaction times are longer after two beers when we look at the confidence interval.
Therefore, we will use the twosided alternative:
Step 2: Obtain data, check conditions, and summarize data
Let’s first check whether we can safely proceed with the paired ttest, by checking the two conditions.
We can see from the histogram above that there is no evidence of violation of the normality assumption (on the contrary, the histogram looks quite normal).
Also note that the vast majority of the differences are negative (i.e., the total reaction times for most of the drivers are larger after the two beers), suggesting that the data provide evidence against the null hypothesis.
The question (which the pvalue will answer) is whether these data provide strong enough evidence or not against the null hypothesis. We can safely proceed to calculate the test statistic (which in practice we leave to the software to calculate for us).
Test Statistic: We will use software to calculate the test statistic which is t = 2.58.
Step 3: Find the pvalue of the test by using the test statistic as follows
As a special case of the onesample ttest, the null distribution of the paired ttest statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the pvalues are calculated.
We will let the software find the pvalue for us, and in this case, gives us a pvalue of 0.0183 (SAS) or 0.018 (SPSS).
The small pvalue tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. More specifically, there is less than a 2% chance (0.018=1.8%) of obtaining a test statistic of 2.58 (or lower) or 2.58 (or higher), assuming that 2 beers have no impact on reaction times.
Step 4: Conclusion
In our example, the pvalue is 0.018, indicating that the data provide enough evidence to reject Ho.
Followup Confidence Interval:
As a followup to this conclusion, we quantify the effect that two beers have on the driver, using the 95% confidence interval for μ_{d}.
Using statistical software, we find that the 95% confidence interval for μ_{d}, the mean of the differences (before – after), is roughly (0.9, 0.1).
Note: Since the differences were calculated beforeafter, longer reaction times after the beers would translate into negative differences.
Since the confidence interval does not contain the null value of zero, we can use it to decide to reject the null hypothesis. Zero is not a plausible value of the population mean difference based upon the confidence interval. Notice that using this method is not always practical as often we still need to provide the pvalue in clinical research. (Note: this is NOT the interpretation of the confidence interval but a method of using the confidence interval to conduct a hypothesis test.)
Practical Significance:
We should definitely ask ourselves if this is practically significant and I would argue that it is.
In the output, we are generally provided the twosided pvalue. We must be very careful when converting this to a onesided pvalue (if this is not provided by the software)
The “driving after having 2 beers” example is a case in which observations are paired by subject. In other words, both samples have the same subject, so that each subject is measured twice. Typically, as in our example, one of the measurements occurs before a treatment/intervention (2 beers in our case), and the other measurement after the treatment/intervention.
Our next example is another typical type of study where the matched pairs design is used—it is a study involving twins.
Researchers have long been interested in the extent to which intelligence, as measured by IQ score, is affected by “nurture” as opposed to “nature”: that is, are people’s IQ scores mainly a result of their upbringing and environment, or are they mainly an inherited trait?
A study was designed to measure the effect of home environment on intelligence, or more specifically, the study was designed to address the question: “Are there statistically significant differences in IQ scores between people who were raised by their birth parents, and those who were raised by someone else?”
In order to be able to answer this question, the researchers needed to get two groups of subjects (one from the population of people who were raised by their birth parents, and one from the population of people who were raised by someone else) who are as similar as possible in all other respects. In particular, since genetic differences may also affect intelligence, the researchers wanted to control for this confounding factor.
We know from our discussion on study design (in the Producing Data unit of the course) that one way to (at least theoretically) control for all confounding factors is randomization—randomizing subjects to the different treatment groups. In this case, however, this is not possible. This is an observational study; you cannot randomize children to either be raised by their birth parents or to be raised by someone else. How else can we eliminate the genetics factor? We can conduct a “twin study.”
Because identical twins are genetically the same, a good design for obtaining information to answer this question would be to compare IQ scores for identical twins, one of whom is raised by birth parents and the other by someone else. Such a design (matched pairs) is an excellent way of making a comparison between individuals who only differ with respect to the explanatory variable of interest (upbringing) but are as alike as they can possibly be in all other important aspects (inborn intelligence). Identical twins raised apart were studied by Susan Farber, who published her studies in the book “Identical Twins Reared Apart” (1981, Basic Books).
In this problem, we are going to use the data that appear in Farber’s book in table E6, of the IQ scores of 32 pairs of identical twins who were reared apart.
Here is a figure that will help you understand this study:
Here are the important things to note in the figure:
Each of the 32 rows represents one pair of twins. Keeping the notation that we used above, twin 1 is the twin that was raised by his/her birth parents, and twin 2 is the twin that was raised by someone else. Let’s carry out the analysis.
Step 1: State the hypotheses
Recall that in matched pairs, we reduce the data from two samples to one sample of differences:
The hypotheses are stated in terms of the mean of the difference where, μ_{d} = population mean difference in IQ scores (Birth Parents – Someone Else):
Step 2: Obtain data, check conditions, and summarize data
Is it safe to use the paired ttest in this case?
The data don’t reveal anything that we should be worried about (like very extreme skewness or outliers), so we can safely proceed. Looking at the histogram, we note that most of the differences are negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ.
From this point we rely on statistical software, and find that:
Our test statistic is 1.85.
Our data (represented by the sample mean of the differences) are 1.85 standard errors below the null hypothesis (represented by the null value 0).
Step 3: Find the pvalue of the test by using the test statistic as follows
The pvalue is 0.074, indicating that there is a 7.4% chance of obtaining data like those observed (or even more extreme) assuming that H_{o} is true (i.e., assuming that there are no differences in IQ scores between people who were raised by their natural parents and those who weren’t).
Step 4: Conclusion
Using the conventional significance level (cutoff probability) of .05, our pvalue is not small enough, and we therefore cannot reject H_{o}.
Confidence Interval:
The 95% confidence interval for the population mean difference is (6.11322, 0.30072).
Interpretation:
This confidence interval does contain zero and thus results in the same conclusion to the hypothesis test. Zero IS a plausible value of the population mean difference and thus we cannot reject the null hypothesis.
Practical Significance:
It is very important to pay attention to whether the twosample ttest or the paired ttest is appropriate. In other words, being aware of the study design is extremely important. Consider our example, if we had not “caught” that this is a matched pairs design, and had analyzed the data as if the two samples were independent using the twosample ttest, we would have obtained a pvalue of 0.114.
Note that using this (wrong) method to analyze the data, and a significance level of 0.05, we would conclude that the data do not provide enough evidence for us to conclude that reaction times differed after drinking two beers. This is an example of how using the wrong statistical method can lead you to wrong conclusions, which in this context can have very serious implications.
Comments:
Now try a complete example for yourself.
Here are two other datasets with paired samples.
The statistical tests we have previously discussed (and many we will discuss) require assumptions about the distribution in the population or about the requirements to use a certain approximation as the sampling distribution. These methods are called parametric.
When these assumptions are not valid, alternative methods often exist to test similar hypotheses. Tests which require only minimal distributional assumptions, if any, are called nonparametric or distributionfree tests.
At the end of this section we will provide some details (see Details for NonParametric Alternatives), for now we simply want to mention that there are two common nonparametric alternatives to the paired ttest. They are:
The fact that both of these tests have the word “sign” in them is not a coincidence – it is due to the fact that we will be interested in whether the differences have a positive sign or a negative sign – and the fact that this word appears in both of these tests can help you to remember that they correspond to paired methods where we are often interested in whether there was an increase (positive sign) or a decrease (negative sign).
This last part of the fourstep process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.
The pvalue is a measure of how much evidence the data present against Ho. The smaller the pvalue, the more evidence the data present against Ho.
We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α, alpha), a cutoff point below which the pvalue is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.
Where instead of Ha is True, we write what this means in the words of the problem, in other words, in the context of the current scenario.
It is important to mention again that this step has essentially two substeps:
Note: We always still must consider whether the results have any practical significance, particularly if they are statistically significant as a statistically significant result which has not practical use is essentially meaningless!
Let’s go back to our three examples and draw conclusions.
Has the proportion of defective products been reduced as a result of the repair?
We found that the pvalue for this test was 0.023.
Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho.
Conclusion:
The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:
Is the proportion of marijuana users in the college higher than the national figure?
We found that the pvalue for this test was 0.182.
Since .182 is not small (in particular, 0.182 > 0.05), the data do not provide enough evidence to reject Ho.
Conclusion:
Here is the complete story of this example:
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
We found that the pvalue for this test was 0.021.
Since 0.021 is small (in particular, 0.021 < 0.05), the data provide enough evidence to reject Ho
Conclusion:
Here is the complete story of this example:
Many students wonder why 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level. This is largely due to just convenience and tradition.
When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.
The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single allornothing value. There isn’t much meaningful difference, for instance, between a pvalue of .049 or .051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.
Whether such a pvalue is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.
We have now completed going through the four steps of hypothesis testing, and in particular we learned how they are applied to the ztest for the population proportion. Here is a brief summary:
State the null hypothesis:
Ho: p = p_{0}
State the alternative hypothesis:
Ha: p < p_{0} (onesided)
Ha: p > p_{0} (onesided)
Ha: p ≠ p_{0} (twosided)
where the choice of the appropriate alternative (out of the three) is usually quite clear from the context of the problem. If you feel it is not clear, it is most likely a twosided problem. Students are usually good at recognizing the “more than” and “less than” terminology but differences can sometimes be more difficult to spot, sometimes this is because you have preconceived ideas of how you think it should be! Use only the information given in the problem.
Obtain data from a sample and:
(i) Check whether the data satisfy the conditions which allow you to use this test.
random sample (or at least a sample that can be considered random in context)
the conditions under which the sampling distribution of phat is normal are met
(ii) Calculate the sample proportion phat, and summarize the data using the test statistic:
(Recall: This standardized test statistic represents how many standard deviations above or below p_{0} our sample proportion phat is.)
When the alternative hypothesis is “less than” the probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
When the alternative hypothesis is “greater than” the probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
When the alternative hypothesis is “not equal to” the probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Reach a conclusion first regarding the statistical significance of the results, and then determine what it means in the context of the problem.
If pvalue ≤ 0.05 then WE REJECT Ho
Conclusion: There IS enough evidence that Ha is True
If pvalue > 0.05 then WE FAIL TO REJECT Ho
Conclusion: There IS NOT enough evidence that Ha is True
Recall that: If the pvalue is small (in particular, smaller than the significance level, which is usually 0.05), the results are statistically significant (in the sense that there is a statistically significant difference between what was observed in the sample and what was claimed in Ho), and so we reject Ho.
If the pvalue is not small, we do not have enough statistical evidence to reject Ho, and so we continue to believe that Ho may be true. (Remember: In hypothesis testing we never “accept” Ho).
Finally, in practice, we should always consider the practical significance of the results as well as the statistical significance.
Before we move on to the next test, we are going to use the ztest for proportions to bring up and illustrate a few more very important issues regarding hypothesis testing. This might also be a good time to review the concepts of Type I error, Type II error, and Power before continuing on.
]]>So far we’ve talked about the pvalue at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the statistical significance of our results. We will now go more deeply into how the pvalue is calculated.
It should be mentioned that eventually we will rely on technology to calculate the pvalue for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first understand the details, and only then let the computer do the calculations for us. Again, our goal is to use this simple example to give you the tools you need to understand the process entirely. Let’s start.
Recall that so far we have said that the pvalue is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the pvalue is, therefore, a measure of the evidence against Ho. In the case of the test statistic, the larger it is in magnitude (positive or negative), the further phat is from p_{0}, the more evidence we have against Ho. In the case of the pvalue, it is the opposite; the smaller it is, the more unlikely it is to get data like those observed when Ho is true, the more evidence it is against Ho. One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the pvalue is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the pvalue from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the pvalue keeps its intuitive appeal across all statistical tests.
How is the pvalue calculated?
Intuitively, the pvalue is the probability of observing data like those observed assuming that Ho is true. Let’s be a bit more formal:
Putting it all together, we get that in general:
By “extreme” we mean extreme in the direction(s) of the alternative hypothesis.
Specifically, for the ztest for the population proportion:
OK, hopefully that makes (some) sense. But how do we actually calculate it?
Recall the important comment from our discussion about our test statistic,
which said that when the null hypothesis is true (i.e., when p = p_{0}), the possible values of our test statistic follow a standard normal (N(0,1), denoted by Z) distribution. Therefore, the pvalue calculations (which assume that Ho is true) are simply standard normal distribution calculations for the 3 possible alternative hypotheses.
The probability of observing a test statistic as small as that observed or smaller, assuming that the values of the test statistic follow a standard normal distribution. We will now represent this probability in symbols and also using the normal distribution.
Looking at the shaded region, you can see why this is often referred to as a lefttailed test. We shaded to the left of the test statistic, since less than is to the left.
The probability of observing a test statistic as large as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution. Again, we will represent this probability in symbols and using the normal distribution
Looking at the shaded region, you can see why this is often referred to as a righttailed test. We shaded to the right of the test statistic, since greater than is to the right.
The probability of observing a test statistic which is as large in magnitude as that observed or larger, assuming that the values of the test statistic follow a standard normal distribution.
This is often referred to as a twotailed test, since we shaded in both directions.
Next, we will apply this to our three examples. But first, work through the following activities, which should help your understanding.
Has the proportion of defective products been reduced as a result of the repair?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
To find P(Z ≤ 2) we can either use the calculator or table we learned to use in the probability unit for normal random variables. Eventually, after we understand the details, we will use software to run the test for us and the output will give us all the information we need. The pvalue that the statistical software provides for this specific example is 0.023. The pvalue tells us that it is pretty unlikely (probability of 0.023) to get data like those observed (test statistic of 2 or less) assuming that Ho is true.
Is the proportion of marijuana users in the college higher than the national figure?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.182, this is P(Z ≥ 0.91).
The pvalue tells us that it is not very surprising (probability of 0.182) to get data like those observed (which yield a test statistic of 0.91 or higher) assuming that the null hypothesis is true.
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The pvalue in this case is:
OR (recalling what the test statistic actually means in this case),
OR, more specifically,
In either case, the pvalue is found as shown in the following figure:
Again, at this point we can either use the calculator or table to find that the pvalue is 0.021, this is P(Z ≤ 2.31) + P(Z ≥ 2.31) = 2*P(Z ≥ 2.31)
The pvalue tells us that it is pretty unlikely (probability of 0.021) to get data like those observed (test statistic as high as 2.31 or higher or as low as 2.31 or lower) assuming that Ho is true.
Comment:
Similarly, in any test, pvalues are found using the sampling distribution of the test statistic when the null hypothesis is true (also known as the “null distribution” of the test statistic). In this case, it was relatively easy to argue that the null distribution of our test statistic is N(0,1). As we’ll see, in other tests, other distributions come up (like the tdistribution and the Fdistribution), which we will just mention briefly, and rely heavily on the output of our statistical package for obtaining the pvalues.
We’ve just completed our discussion about the pvalue, and how it is calculated both in general and more specifically for the ztest for the population proportion. Let’s go back to the fourstep process of hypothesis testing and see what we’ve covered and what still needs to be discussed.
With respect to the ztest the population proportion:
Step 1: Completed
Step 2: Completed
Step 3: Completed
Step 4. This is what we will work on next.
After the hypotheses have been stated, the next step is to obtain a sample (on which the inference will be based), collect relevant data, and summarize them.
It is extremely important that our sample is representative of the population about which we want to draw conclusions. This is ensured when the sample is chosen at random. Beyond the practical issue of ensuring representativeness, choosing a random sample has theoretical importance that we will mention later.
In the case of hypothesis testing for the population proportion (p), we will collect data on the relevant categorical variable from the individuals in the sample and start by calculating the sample proportion phat (the natural quantity to calculate when the parameter of interest is p).
Let’s go back to our three examples and add this step to our figures.
Has the proportion of defective products been reduced as a result of the repair?
Is the proportion of marijuana users in the college higher than the national figure?
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
As we mentioned earlier without going into details, when we summarize the data in hypothesis testing, we go a step beyond calculating the sample statistic and summarize the data with a test statistic. Every test has a test statistic, which to some degree captures the essence of the test. In fact, the pvalue, which so far we have looked upon as “the king” (in the sense that everything is determined by it), is actually determined by (or derived from) the test statistic. We will now introduce the test statistic.
The test statistic is a measure of how far the sample proportion phat is from the null value p_{0}, the value that the null hypothesis claims is the value of p. In other words, since phat is what the data estimates p to be, the test statistic can be viewed as a measure of the “distance” between what the data tells us about p and what the null hypothesis claims p to be.
Let’s use our examples to understand this:
Has the proportion of defective products been reduced as a result of the repair?
The parameter of interest is p, the proportion of defective products following the repair.
The data estimate p to be phat = 0.16
The null hypothesis claims that p = 0.20
The data are therefore 0.04 (or 4 percentage points) below the null hypothesis value.
It is hard to evaluate whether this difference of 4% in defective products is enough evidence to say that the repair was effective at reducing the proportion of defective products, but clearly, the larger the difference, the more evidence it is against the null hypothesis. So if, for example, our sample proportion of defective products had been, say, 0.10 instead of 0.16, then I think you would all agree that cutting the proportion of defective products in half (from 20% to 10%) would be extremely strong evidence that the repair was effective at reducing the proportion of defective products.
Is the proportion of marijuana users in the college higher than the national figure?
The parameter of interest is p, the proportion of students in a college who use marijuana.
The data estimate p to be phat = 0.19
The null hypothesis claims that p = 0.157
The data are therefore 0.033 (or 3.3. percentage points) above the null hypothesis value.
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The parameter of interest is p, the proportion of U.S. adults who support the death penalty for convicted murderers.
The data estimate p to be phat = 0.675
The null hypothesis claims that p = 0.64
There is a difference of 0.035 (or 3.5. percentage points) between the data and the null hypothesis value.
The problem with looking only at the difference between the sample proportion, phat, and the null value, p_{0} is that we have not taken into account the variability of our estimator phat which, as we know from our study of sampling distributions, depends on the sample size.
For this reason, the test statistic cannot simply be the difference between phat and p_{0}, but must be some form of that formula that accounts for the sample size. In other words, we need to somehow standardize the difference so that comparison between different situations will be possible. We are very close to revealing the test statistic, but before we construct it, let’s be reminded of the following two facts from probability:
Fact 1: When we take a random sample of size n from a population with population proportion p, then
Fact 2: The zscore of any normal value (a value that comes from a normal distribution) is calculated by finding the difference between the value and the mean and then dividing that difference by the standard deviation (of the normal distribution associated with the value). The zscore represents how many standard deviations below or above the mean the value is.
Thus, our test statistic should be a measure of how far the sample proportion phat is from the null value p_{0} relative to the variation of phat (as measured by the standard error of phat).
Recall that the standard error is the standard deviation of the sampling distribution for a given statistic. For phat, we know the following:
To find the pvalue, we will need to determine how surprising our value is assuming the null hypothesis is true. We already have the tools needed for this process from our study of sampling distributions as represented in the table above.
Has the proportion of defective products been reduced as a result of the repair?
If we assume the null hypothesis is true, we can specify that the center of the distribution of all possible values of phat from samples of size 400 would be 0.20 (our null value).
We can calculate the standard error, assuming p = 0.20 as
The following picture represents the sampling distribution of all possible values of phat of samples of size 400, assuming the true proportion p is 0.20 and our other requirements for the sampling distribution to be normal are met (we will review these during the next step).
In order to calculate probabilities for the picture above, we would need to find the zscore associated with our result.
This zscore is the test statistic! In this example, the numerator of our zscore is the difference between phat (0.16) and null value (0.20) which we found earlier to be 0.04. The denominator of our zscore is the standard error calculated above (0.02) and thus quickly we find the zscore, our test statistic, to be 2.
The sample proportion based upon this data is 2 standard errors below the null value.
Hopefully you now understand more about the reasons we need probability in statistics!!
Now we will formalize the definition and look at our remaining examples before moving on to the next step, which will be to determine if a normal distribution applies and calculate the pvalue.
Test Statistic for Hypothesis Tests for One Proportion is:
It represents the difference between the sample proportion and the null value, measured in standard deviations (standard error of phat).
The picture above is a representation of the sampling distribution of phat assuming p = p_{0}. In other words, this is a model of how phat behaves if we are drawing random samples from a population for which Ho is true.
Notice the center of the sampling distribution is at p_{0}, which is the hypothesized proportion given in the null hypothesis (Ho: p = p_{0}.) We could also mark the axis in standard error units,
For example, if our null hypothesis claims that the proportion of U.S. adults supporting the death penalty is 0.64, then the sampling distribution is drawn as if the null is true. We draw a normal distribution centered at 0.64 (p_{0}) with a standard error dependent on sample size,
Important Comment:
By “null distribution,” we mean the distribution under the assumption that Ho is true. As we’ll see and stress again later, the null distribution of the test statistic is what the calculation of the pvalue is based on.
Let’s go back to our remaining two examples and find the test statistic in each case:
Is the proportion of marijuana users in the college higher than the national figure?
Since the null hypothesis is Ho: p = 0.157, the standardized (z) score of phat = 0.19 is
This is the value of the test statistic for this example.
We interpret this to mean that, assuming that Ho is true, the sample proportion phat = 0.19 is 0.91 standard errors above the null value (0.157).
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
Since the null hypothesis is Ho: p = 0.64, the standardized (z) score of phat = 0.675 is
This is the value of the test statistic for this example.
We interpret this to mean that, assuming that Ho is true, the sample proportion phat = 0.675 is 2.31 standard errors above the null value (0.64).
Comments about the Test Statistic:
Comments:
When we take a random sample of size n from a population with population proportion p_{0}, the possible values of the sample proportion phat (when certain conditions are met) have approximately a normal distribution with a mean of p_{0}… and a standard deviation of
This result provides the theoretical justification for constructing the test statistic the way we did, and therefore the assumptions under which this result holds (in bold, above) are the conditions that our data need to satisfy so that we can use this test. These two conditions are:
i. The sample has to be random.
ii. The conditions under which the sampling distribution of phat is normal are met. In other words:
Let’s check the conditions in our three examples.
Has the proportion of defective products been reduced as a result of the repair?
i. The 400 products were chosen at random.
ii. n = 400, p_{0} = 0.2 and therefore:
Is the proportion of marijuana users in the college higher than the national figure?
i. The 100 students were chosen at random.
ii. n = 100, p_{0} = 0.157 and therefore:
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
i. The 1000 adults were chosen at random.
ii. n = 1000, p_{0} = 0.64 and therefore:
Checking that our data satisfy the conditions under which the test can be reliably used is a very important part of the hypothesis testing process. Be sure to consider this for every hypothesis test you conduct in this course and certainly in practice.
With respect to the ztest, the population proportion that we are currently discussing we have:
Step 1: Completed
Step 2: Completed
Step 3: This is what we will work on next.
]]>Now that we understand the process of hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).
The first test we are going to learn is the test about the population proportion (p).
We will understand later where the “ztest” part is coming from.
This will be the only type of problem you will complete entirely “byhand” in this course. Our goal is to use this example to give you the tools you need to understand how this process works. After working a few problems, you should review the earlier material again. You will likely need to review the terminology and concepts a few times before you fully understand the process.
In reality, you will often be conducting more complex statistical tests and allowing software to provide the pvalue. In these settings it will be important to know what test to apply for a given situation and to be able to explain the results in context.
When we conduct a test about a population proportion, we are working with a categorical variable. Later in the course, after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.
In this part of our discussion on hypothesis testing, we will go into details that we did not go into before. More specifically, we will use this test to introduce the idea of a test statistic, and details about how pvalues are calculated.
Let’s start by introducing the three examples, which will be the leading examples in our discussion. Each example is followed by a figure illustrating the information provided, as well as the question of interest.
A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been reduced as a result of the repair?
The following figure displays the information, as well as the question of interest:
The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:
Ho: p = 0.20 (No change; the repair did not help).
Ha: p < 0.20 (The repair was effective at reducing the proportion of defective parts).
There are rumors that students at a certain liberal arts college are more inclined to use drugs than U.S. college students in general. Suppose that in a simple random sample of 100 students from the college, 19 admitted to marijuana use. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is 0.157? (This number is reported by the Harvard School of Public Health.)
Again, the following figure displays the information as well as the question of interest:
As before, we can formulate the null and alternative hypotheses in terms of p, the proportion of students in the college who use marijuana:
Ho: p = 0.157 (same as among all college students in the country).
Ha: p > 0.157 (higher than the national figure).
Polls on certain topics are conducted routinely in order to monitor changes in the public’s opinions over time. One such topic is the death penalty. In 2003 a poll estimated that 64% of U.S. adults support the death penalty for a person convicted of murder. In a more recent poll, 675 out of 1,000 U.S. adults chosen at random were in favor of the death penalty for convicted murderers. Do the results of this poll provide evidence that the proportion of U.S. adults who support the death penalty for convicted murderers (p) changed between 2003 and the later poll?
Here is a figure that displays the information, as well as the question of interest:
Again, we can formulate the null and alternative hypotheses in term of p, the proportion of U.S. adults who support the death penalty for convicted murderers.
Ho: p = 0.64 (No change from 2003).
Ha: p ≠ 0.64 (Some change since 2003).
Recall that there are basically 4 steps in the process of hypothesis testing:
We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.
Here again are the three set of hypotheses that are being tested in each of our three examples:
Has the proportion of defective products been reduced as a result of the repair?
Is the proportion of marijuana users in the college higher than the national figure?
Did the proportion of U.S. adults who support the death penalty change between 2003 and a later poll?
The null hypothesis always takes the form:
and the alternative hypothesis takes one of the following three forms:
Note that it was quite clear from the context which form of the alternative hypothesis would be appropriate. The value that is specified in the null hypothesis is called the null value, and is generally denoted by p_{0}. We can say, therefore, that in general the null hypothesis about the population proportion (p) would take the form:
We write Ho: p = p_{0} to say that we are making the hypothesis that the population proportion has the value of p_{0}. In other words, p is the unknown population proportion and p_{0} is the number we think p might be for the given situation.
The alternative hypothesis takes one of the following three forms (depending on the context):
The first two possible forms of the alternatives (where the = sign in Ho is challenged by < or >) are called onesided alternatives, and the third form of alternative (where the = sign in Ho is challenged by ≠) is called a twosided alternative. To understand the intuition behind these names let’s go back to our examples.
Example 3 (death penalty) is a case where we have a twosided alternative:
In this case, in order to reject Ho and accept Ha we will need to get a sample proportion of death penalty supporters which is very different from 0.64 in either direction, either much larger or much smaller than 0.64.
In example 2 (marijuana use) we have a onesided alternative:
Here, in order to reject Ho and accept Ha we will need to get a sample proportion of marijuana users which is much higher than 0.157.
Similarly, in example 1 (defective products), where we are testing:
in order to reject Ho and accept Ha, we will need to get a sample proportion of defective products which is much smaller than 0.20.