# Paired Samples

As we mentioned at the end of the Introduction to Unit 4B, we will focus only on two-sided tests for the remainder of this course. One-sided tests are often possible but rarely used in clinical research.
CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.
LO 4.35: For a data analysis situation involving two variables, choose the appropriate inferential method for examining the relationship between the variables and justify the choice.
LO 4.36: For a data analysis situation involving two variables, carry out the appropriate inferential method for examining relationships between the variables and draw the correct conclusions in context.
CO-5: Determine preferred methodological alternatives to commonly used statistical methods when assumptions are not met.
Video: Paired Samples (27:19)

Related SAS Tutorials

Related SPSS Tutorials

## Introduction – Matched Pairs (Paired t-test)

LO 4.37: Identify and distinguish between independent and dependent samples.
LO 4.38: In a given context, determine the appropriate standard method for comparing groups and provide the correct conclusions given the appropriate software output.
LO 4.39: In a given context, set up the appropriate null and alternative hypotheses for comparing groups.

We are in Case CQ of inference about relationships, where the explanatory variable is categorical and the response variable is quantitative.

As we mentioned in the summary of the introduction to Case C→Q, the first case that we will deal with is that involving matched pairs. In this case:

• The samples are paired or matched. Every observation in one sample is linked with an observation in the other sample.
• In other words, the samples are dependent. Notice from this point forward we will use the terms population 1 and population 2 instead of sub-population 1 and sub-population 2. Either terminology is correct.

One of the most common cases where dependent samples occur is when both samples have the same subjects and they are “paired by subject.” In other words, each subject is measured twice on the response variable, typically before and then after some kind of treatment/intervention in order to assess its effectiveness.

## EXAMPLE: SAT Prep Class

Suppose you want to assess the effectiveness of an SAT prep class.

It would make sense to use the matched pairs design and record each sampled student’s SAT score before and after the SAT prep classes are attended: Recall that the two populations represent the two values of the explanatory variable. In this situation, those two values come from a single set of subjects.

• In other words, both populations really have the same students.
• However, each population has a different value of the explanatory variable. Those values are: no prep class, prep class.

This, however, is not the only case where the paired design is used. Other cases are when the pairs are “natural pairs,” such as siblings, twins, or couples.

Notes about graphical summaries for paired data in Case CQ:

• Due to the paired nature of this type of data, we cannot really use side-by-side boxplots to visualize this data as the information contained in the pairing is completely lost.
• We will need to provide graphical summaries of the differences themselves in order to explore this type of data.

## The Idea Behind Paired t-Test

The idea behind the paired t-test is to reduce this two-sample situation, where we are comparing two means, to a single sample situation where we are doing inference on a single mean, and then use a simple t-test that we introduced in the previous module.

In this setting, we can easily reduce the raw data to a set of differences and conduct a one-sample t-test.

• Thus we simplify our inference procedure to a problem where we are making an inference about a single mean:  the mean of the differences.

In other words, by reducing the two samples to one sample of differences, we are essentially reducing the problem from a problem where we’re comparing two means (i.e., doing inference on μ1−μ2) to a problem in which we are studying one mean.

In general, in every matched pairs problem, our data consist of 2 samples which are organized in n pairs: We reduce the two samples to only one by calculating the difference between the two observations for each pair.

For example, think of Sample 1 as “before” and Sample 2 as “after”. We can find the difference between the before and after results for each participant, which gives us only one sample, namely “before – after”. We label this difference as “d” in the illustration below. The paired t-test is based on this one sample of n differences, and it uses those differences as data for a one-sample t-test on a single mean — the mean of the differences.

This is the general idea behind the paired t-test; it is nothing more than a regular one-sample t-test for the mean of the differences!

## Test Procedure for Paired T-Test

We will now go through the 4-step process of the paired t-test.

• Step 1: State the hypotheses

Recall that in the t-test for a single mean our null hypothesis was: Ho: μ = μ0 and the alternative was one of Ha: μ < μ0 or μ > μ0 or μ ≠ μ0. Since the paired t-test is a special case of the one-sample t-test, the hypotheses are the same except that:

Instead of simply μ we use the notation μd to denote that the parameter of interest is the mean of the differences.

In this course our null value μ0 is always 0. In other words, going back to our original paired samples our null hypothesis claims that that there is no difference between the two means. (Technically, it does not have to be zero if you are interested in a more specific difference – for example, you might be interested in showing that there is a reduction in blood pressure of more than 10 points but we will not specifically look at such situations).

Therefore, in the paired t-test: The null hypothesis is always:

Ho: μd = 0
(There IS NO association between the categorical explanatory variable and the quantitative response variable)

We will focus on the two-sided alternative hypothesis of the form:

Ha: μd ≠ 0
(There IS AN association between the categorical explanatory variable and the quantitative response variable)

Some students find it helpful to know that it turns out that μd = μ1 – μ2 (in other words, the difference between the means is the same as the mean of the differences). You may find it easier to first think about the hypotheses in terms of μ1 – μ2 and then represent it in terms of μd.

• Step 2: Obtain data, check conditions, and summarize data

The paired t-test, as a special case of a one-sample t-test, can be safely used as long as:

The sample of differences is random (or at least can be considered random in context).

The distribution of the differences in the population should vary normally if you have small samples. If the sample size is large, it is safe to use the paired t-test regardless of whether the differences vary normally or not. This condition is satisfied in the three situations marked by a green check mark in the table below.

Note: normality is checked by looking at the histogram of differences, and as long as no clear violation of normality (such as extreme skewness and/or outliers) is apparent, the normality assumption is reasonable. Assuming that we can safely use the paired t-test, the data are summarized by a test statistic: where This test statistic measures (in standard errors) how far our data are (represented by the sample mean of the differences) from the null hypothesis (represented by the null value, 0).

Notice this test statistic has the same general form as those discussed earlier: • Step 3: Find the p-value of the test by using the test statistic as follows

As a special case of the one-sample t-test, the null distribution of the paired t-test statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the p-values are calculated. We will use software to find the p-value for us.

• Step 4: Conclusion

As usual, we draw our conclusion based on the p-value. Be sure to write your conclusions in context by specifying your current variables and/or precisely describing the population mean difference in terms of the current variables.

In particular, if a cutoff probability, α (significance level), is specified, we reject Ho if the p-value is less than α. Otherwise, we fail to reject Ho.

If the p-value is small, there is a statistically significant difference between what was observed in the sample and what was claimed in Ho, so we reject Ho.

Conclusion: There is enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is enough evidence that the population mean difference is not equal to zero.

Remember: a small p-value tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. Therefore, a small p-value indicates that we should reject the null hypothesis.

If the p-value is not small, we do not have enough statistical evidence to reject Ho.

Conclusion: There is NOT enough evidence that the categorical explanatory variable is associated with the quantitative response variable. More specifically, there is NOT enough evidence that the population mean difference is not equal to zero.

Notice how much better the first sentence sounds! It can get difficult to correctly phrase these conclusions in terms of the mean difference without confusing double negatives.

LO 4.40: Based upon the output for a paired t-test, correctly interpret in context the appropriate confidence interval for the population mean-difference.

As in previous methods, we can follow-up with a confidence interval for the mean difference, μd and interpret this interval in the context of the problem.

Interpretation: We are 95% confident that the population mean difference (described in context) is between (lower bound) and (upper bound).

Confidence intervals can also be used to determine whether or not to reject the null hypothesis of the test based upon whether or not the null value of zero falls outside the interval or inside.

If the null value, 0, falls outside the confidence interval, Ho is rejected. (Zero is NOT a plausible value based upon the confidence interval)

If the null value, 0, falls inside the confidence interval, Ho is not rejected. (Zero IS a plausible value based upon the confidence interval)

NOTE: Be careful to choose the correct confidence interval about the population mean difference and not the individual confidence intervals for the means in the groups themselves.

Now let’s look at an example.

## EXAMPLE: Drinking and Driving

Note: In some of the videos presented in the course materials, we do conduct the one-sided test for this data instead of the two-sided test we conduct below. In Unit 4B we are going to restrict our attention to two-sided tests supplemented by confidence intervals as needed to provide more information about the effect of interest.

Drunk driving is one of the main causes of car accidents. Interviews with drunk drivers who were involved in accidents and survived revealed that one of the main problems is that drivers do not realize that they are impaired, thinking “I only had 1-2 drinks … I am OK to drive.”

A sample of 20 drivers was chosen, and their reaction times in an obstacle course were measured before and after drinking two beers. The purpose of this study was to check whether drivers are impaired after drinking two beers. Here is a figure summarizing this study: • Note that the categorical explanatory variable here is “drinking 2 beers (Yes/No)”, and the quantitative response variable is the reaction time.
• By using the matched pairs design in this study (i.e., by measuring each driver twice), the researchers isolated the effect of the two beers on the drivers and eliminated any other confounding factors that might influence the reaction times (such as the driver’s experience, age, etc.).
• For each driver, the two measurements are the total reaction time before drinking two beers, and after. You can see the data by following the links in Step 2 below.

Since the measurements are paired, we can easily reduce the raw data to a set of differences and conduct a one-sample t-test. Here are some of the results for this data: Step 1: State the hypotheses

We define μ= the population mean difference in reaction times (Before – After).

As we mentioned, the null hypothesis is:

• Ho: μd = 0 (indicating that the population of the differences are centered at a number that IS ZERO)

The null hypothesis claims that the differences in reaction times are centered at (or around) 0, indicating that drinking two beers has no real impact on reaction times. In other words, drivers are not impaired after drinking two beers.

Although we really want to know whether their reaction times are longer after the two beers, we will still focus on conducting two-sided hypothesis tests. We will be able to address whether the reaction times are longer after two beers when we look at the confidence interval.

Therefore, we will use the two-sided alternative:

• Ha: μd  ≠ 0 (indicating that the population of the differences are centered at a number that is NOT ZERO)

Step 2: Obtain data, check conditions, and summarize data

Let’s first check whether we can safely proceed with the paired t-test, by checking the two conditions.

• The sample of drivers was chosen at random.
• The sample size is not large (n = 20), so in order to proceed, we need to look at the histogram or QQ-plot of the differences and make sure there is no evidence that the normality assumption is not met. We can see from the histogram above that there is no evidence of violation of the normality assumption (on the contrary, the histogram looks quite normal).

Also note that the vast majority of the differences are negative (i.e., the total reaction times for most of the drivers are larger after the two beers), suggesting that the data provide evidence against the null hypothesis.

The question (which the p-value will answer) is whether these data provide strong enough evidence or not against the null hypothesis. We can safely proceed to calculate the test statistic (which in practice we leave to the software to calculate for us).

Test Statistic: We will use software to calculate the test statistic which is t = -2.58.

• Recall: This indicates that the data (represented by the sample mean of the differences) are 2.58 standard errors below the null hypothesis (represented by the null value, 0).

Step 3: Find the p-value of the test by using the test statistic as follows

As a special case of the one-sample t-test, the null distribution of the paired t-test statistic is a t distribution (with n – 1 degrees of freedom), which is the distribution under which the p-values are calculated.

We will let the software find the p-value for us, and in this case, gives us a p-value of 0.0183 (SAS) or 0.018 (SPSS).

The small p-value tells us that there is very little chance of getting data like those observed (or even more extreme) if the null hypothesis were true. More specifically, there is less than a 2% chance (0.018=1.8%) of obtaining a test statistic of -2.58 (or lower) or 2.58 (or higher), assuming that 2 beers have no impact on reaction times.

Step 4: Conclusion

In our example, the p-value is 0.018, indicating that the data provide enough evidence to reject Ho.

• Conclusion: There is enough evidence that drinking two beers is associated with differences in reaction times of drivers.

Follow-up Confidence Interval:

As a follow-up to this conclusion, we quantify the effect that two beers have on the driver, using the 95% confidence interval for μd.

Using statistical software, we find that the 95% confidence interval for μd, the mean of the differences (before – after), is roughly (-0.9, -0.1).

Note: Since the differences were calculated before-after, longer reaction times after the beers would translate into negative differences.

• Interpretation: We are 95% confident that after drinking two beers, the true mean increase in total reaction time of drivers is between 0.1 and 0.9 of a second.
• Thus, the results of the study do indicate impairment of drivers (longer reaction times) not the other way around!

Since the confidence interval does not contain the null value of zero, we can use it to decide to reject the null hypothesis. Zero is not a plausible value of the population mean difference based upon the confidence interval. Notice that using this method is not always practical as often we still need to provide the p-value in clinical research. (Note: this is NOT the interpretation of the confidence interval but a method of using the confidence interval to conduct a hypothesis test.)

Practical Significance:

We should definitely ask ourselves if this is practically significant and I would argue that it is.

• Although a difference in the mean reaction time of 0.1 second might not be too bad, a difference of 0.9 seconds is likely a problem.
• Even at a difference in reaction time of 0.4 seconds, if you were traveling 60 miles per hour, this would translate into a distance traveled of around 35 feet.

## Many Students Wonder: One-sided vs. Two-sided P-values

In the output, we are generally provided the two-sided p-value. We must be very careful when converting this to a one-sided p-value (if this is not provided by the software)

• IF the data are in the direction of our alternative hypothesis then we can simply take half of the two-sided p-value.
• IF, however, the data are NOT in the direction of the alternative, the correct p-value is VERY LARGE and is the complement of (one minus) half the two-sided p-value.

The “driving after having 2 beers” example is a case in which observations are paired by subject. In other words, both samples have the same subject, so that each subject is measured twice. Typically, as in our example, one of the measurements occurs before a treatment/intervention (2 beers in our case), and the other measurement after the treatment/intervention.

Our next example is another typical type of study where the matched pairs design is used—it is a study involving twins.

## EXAMPLE: IQ Scores

Researchers have long been interested in the extent to which intelligence, as measured by IQ score, is affected by “nurture” as opposed to “nature”: that is, are people’s IQ scores mainly a result of their upbringing and environment, or are they mainly an inherited trait?

A study was designed to measure the effect of home environment on intelligence, or more specifically, the study was designed to address the question: “Are there statistically significant differences in IQ scores between people who were raised by their birth parents, and those who were raised by someone else?”

In order to be able to answer this question, the researchers needed to get two groups of subjects (one from the population of people who were raised by their birth parents, and one from the population of people who were raised by someone else) who are as similar as possible in all other respects. In particular, since genetic differences may also affect intelligence, the researchers wanted to control for this confounding factor.

We know from our discussion on study design (in the Producing Data unit of the course) that one way to (at least theoretically) control for all confounding factors is randomization—randomizing subjects to the different treatment groups. In this case, however, this is not possible. This is an observational study; you cannot randomize children to either be raised by their birth parents or to be raised by someone else. How else can we eliminate the genetics factor? We can conduct a “twin study.”

Because identical twins are genetically the same, a good design for obtaining information to answer this question would be to compare IQ scores for identical twins, one of whom is raised by birth parents and the other by someone else. Such a design (matched pairs) is an excellent way of making a comparison between individuals who only differ with respect to the explanatory variable of interest (upbringing) but are as alike as they can possibly be in all other important aspects (inborn intelligence). Identical twins raised apart were studied by Susan Farber, who published her studies in the book “Identical Twins Reared Apart” (1981, Basic Books).

In this problem, we are going to use the data that appear in Farber’s book in table E6, of the IQ scores of 32 pairs of identical twins who were reared apart. Here are the important things to note in the figure:

• We are essentially comparing the mean IQ scores in two populations that are defined by our (two-valued categorical) explanatory variableupbringing (X), whose two values are: raised by birth parents, raised by someone else.
• This is a matched pairs design (as opposed to a two independent samples design), since each observation in one sample is linked (matched) with an observation in the second sample. The observations are paired by twins.

Each of the 32 rows represents one pair of twins. Keeping the notation that we used above, twin 1 is the twin that was raised by his/her birth parents, and twin 2 is the twin that was raised by someone else. Let’s carry out the analysis.

Step 1: State the hypotheses

Recall that in matched pairs, we reduce the data from two samples to one sample of differences: The hypotheses are stated in terms of the mean of the difference where, μd = population mean difference in IQ scores (Birth Parents – Someone Else):

• Ho: μd = 0 (indicating that the population of the differences are centered at a number that IS ZERO)
• Ha: μd  ≠ 0 (indicating that the population of the differences are centered at a number that is NOT ZERO)

Step 2: Obtain data, check conditions, and summarize data

Is it safe to use the paired t-test in this case?

• Clearly, the samples of twins are not random samples from the two populations. However, in this context, they can be considered as random, assuming that there is nothing special about the IQ of a person just because he/she has an identical twin.
• The sample size here is n = 32. Even though it’s the case that if we use the n > 30 rule of thumb our sample can be considered large, it is sort of a borderline case, so just to be on the safe side, we should look at the histogram of the differences just to make sure that we do not see anything extreme. (Comment: Looking at the histogram of differences in every case is useful even if the sample is very large, just in order to get a sense of the data. Recall: “Always look at the data.”) The data don’t reveal anything that we should be worried about (like very extreme skewness or outliers), so we can safely proceed. Looking at the histogram, we note that most of the differences are negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ.

From this point we rely on statistical software, and find that:

• t-value = -1.85
• p-value = 0.074

Our test statistic is -1.85.

Our data (represented by the sample mean of the differences) are 1.85 standard errors below the null hypothesis (represented by the null value 0).

Step 3: Find the p-value of the test by using the test statistic as follows

The p-value is 0.074, indicating that there is a 7.4% chance of obtaining data like those observed (or even more extreme) assuming that Ho is true (i.e., assuming that there are no differences in IQ scores between people who were raised by their natural parents and those who weren’t).

Step 4: Conclusion

Using the conventional significance level (cut-off probability) of .05, our p-value is not small enough, and we therefore cannot reject Ho.

• Conclusion: Our data do not provide enough evidence to conclude that whether a person was raised by his/her natural parents has an impact on the person’s intelligence (as measured by IQ scores).

Confidence Interval:

The 95% confidence interval for the population mean difference is (-6.11322, 0.30072).

Interpretation:

• We are 95% confident that the population mean IQ for twins raised by someone else is between 6.11 greater to 0.3 lower than that for twins raised by their birth parents.
• OR … We are 95% confident that the population mean IQ for twins raised by their birth parents is between 6.11 lower to 0.3 greater than that for twins raised by someone else.
• Note: The order of the groups as well as the numbers provided in the interval can vary, what is important is to get the “lower” and “greater” with the correct value based upon the group order being used.
• Here we used Birth Parents – Someone Else and thus a positive number for our population mean difference indicates that birth parents group is higher (someone else gorup is lower) and a negative number indicates the someone else group is higher (birth parents group is lower).

This confidence interval does contain zero and thus results in the same conclusion to the hypothesis test. Zero IS a plausible value of the population mean difference and thus we cannot reject the null hypothesis.

Practical Significance:

• The confidence interval does “lean” towards the difference being negative, indicating that in most of the 32 pairs of twins, twin 2 (raised by someone else) has a higher IQ. The sample mean difference is -2.9 so we would need to consider whether this value and range of plausible values have any real practical significance.
• In this case, I don’t think I would consider a difference in IQ score of around 3 points to be very important in practice (but others could reasonably disagree).

It is very important to pay attention to whether the two-sample t-test or the paired t-test is appropriate. In other words, being aware of the study design is extremely important. Consider our example, if we had not “caught” that this is a matched pairs design, and had analyzed the data as if the two samples were independent using the two-sample t-test, we would have obtained a p-value of 0.114.

Note that using this (wrong) method to analyze the data, and a significance level of 0.05, we would conclude that the data do not provide enough evidence for us to conclude that reaction times differed after drinking two beers. This is an example of how using the wrong statistical method can lead you to wrong conclusions, which in this context can have very serious implications.

• The 95% confidence interval for μ can be used here in the same way as for proportions to conduct the two-sided test (checking whether the null value falls inside or outside the confidence interval) or following a t-test where Ho was rejected to get insight into the value of μ.
• In most situations in practice we use two-sided hypothesis tests, followed by confidence intervals to gain more insight.

Now try a complete example for yourself.

Here are two other datasets with paired samples.

## Non-Parametric Alternatives for Matched Pair Data

LO 5.1: For a data analysis situation involving two variables, determine the appropriate alternative (non-parametric) method when assumptions of our standard methods are not met.

The statistical tests we have previously discussed (and many we will discuss) require assumptions about the distribution in the population or about the requirements to use a certain approximation as the sampling distribution. These methods are called parametric.

When these assumptions are not valid, alternative methods often exist to test similar hypotheses. Tests which require only minimal distributional assumptions, if any, are called non-parametric or distribution-free tests.

At the end of this section we will provide some details (see Details for Non-Parametric Alternatives), for now we simply want to mention that there are two common non-parametric alternatives to the paired t-test. They are:

• Sign Test
• Wilcoxon Signed-Rank Test

The fact that both of these tests have the word “sign” in them is not a coincidence – it is due to the fact that we will be interested in whether the differences have a positive sign or a negative sign – and the fact that this word appears in both of these tests can help you to remember that they correspond to paired methods where we are often interested in whether there was an increase (positive sign) or a decrease (negative sign).

## Let’s Summarize

• The paired t-test is used to compare two population means when the two samples (drawn from the two populations) are dependent in the sense that every observation in one sample can be linked to an observation in the other sample. Such a design is called “matched pairs.”
• The most common case in which the matched pairs design is used is when the same subjects are measured twice, usually before and then after some kind of treatment and/or intervention. Another classic case are studies involving twins.
• In the background, we have a two-valued categorical explanatory whose categories define the two populations we are comparing and whose effect on the response variable we are trying to assess.
• The idea behind the paired t-test is to reduce the data from two samples to just one sample of the differences, and use these observed differences as data for inference about a single mean — the mean of the differences, μd.
• The paired t-test is therefore simply a one-sample t-test for the mean of the differences μd, where the null value is 0.
• Once we verify that we can safely proceed with the paired t-test, we use software output to carry it out.
• A 95% confidence interval for μd can be very insightful after a test has rejected the null hypothesis, and can also be used for testing in the two-sided case.
• Two non-parametric alternatives to the paired t-test are the sign test and the Wilcoxon signedrank test. (See Details for Non-Parametric Alternatives.)