In the previous unit, we learned to perform inference for a **single** categorical or quantitative **variable** in the form of **point estimation**, **confidence** **intervals** or **hypothesis** **testing**.

The inference was actually

- about the
**population proportion**(when the variable of interest was**categorical**) and - about the
**population mean**(when the variable of interest was**quantitative**).

Our next (and final) goal for this course is to perform **inference** about **relationships** between **two** **variables** in a population, based on an observed relationship between variables in a sample. Here is what the process looks like:

We are interested in studying whether a **relationship** exists **between** the **variables** X and Y **in a population of interest**. We choose a random sample and collect data on both variables from the subjects.

Our goal is to determine whether these data provide strong enough evidence for us to **generalize** the **observed** **relationship** in the **sample** and **conclude** (with some acceptable and agreed-upon level of uncertainty) that a **relationship** between X and Y **exists** in the entire **population**.

The primary form of inference that we will use in this unit is **hypothesis testing** but we will discuss **confidence** **intervals** both to estimate unknown parameters of interest involving two variables and as an alternative way of determining the conclusion to our hypothesis test.

Conceptually, across all the inferential methods that we will learn, we’ll test some form of:

**Ho: There is no relationship between X and Y**

**Ha: There is a relationship between X and Y**

(We will also discuss point and interval estimation, but our discussion about these forms of inference will be framed around the test.)

Recall that when we discussed examining the relationship between two variables in the **Exploratory Data Analysis** unit, our discussion was framed around the **role-type classification**. This part of the course will be structured exactly in the same way.

In other words, we will look at hypothesis testing in the 3 sections corresponding to cases C→Q, C→C, and Q→Q in the table below.

Recall that case Q→C is not specifically addressed in this course other than that we may investigate the association between these variables using the same methods as case C→Q.

It is also important to remember what we learned about lurking variables and causation.

- If our explanatory variable was part of a
**well-designed experiment**then it may be**possible**for us to claim a**causal****effect**

- But if it was based upon an
**observational****study**, we must be**cautious**to**imply****only**a**relationship**or**association**between the two variables,**not**a direct**causal****link**between the explanatory and response variable.

Unlike the previous part of the course on Inference for One Variable, where we discussed in some detail the theory behind the machinery of the test (such as the null distribution of the test statistic, under which the p-values are calculated), in the inferential procedures that we will introduce in Inference for Relationships, we will discuss much less of that kind of detail.

The principles are the same, but the details behind the null distribution of the test statistic (under which the p-value is calculated) become more complicated and require knowledge of theoretical results that are beyond the scope of this course.

Instead, **within each of the inferential methods we will focus on:**

- When the inferential method is appropriate for use.

- Under what conditions the procedure can safely be used.

- The conceptual idea behind the test (as it is usually captured by the test statistic).

- How to use software to carry out the procedure in order to get the p-value of the test.

- Interpreting the results in the context of the problem.

- Also, we will continue to introduce each test according to the four-step process of hypothesis testing.

From this point forward, we will generally focus on

**TWO-SIDED tests**and**Supplement**with**confidence intervals**for the**effect of interest**to give further information

Using two-sided tests is **standard practice in clinical research** EVEN when there is a direction of interest for the research hypothesis, such as the desire to prove a new treatment is better than the current treatment.

Here are a few comments:

- Although fewer participants are required for one-sided tests, we are
**unable to draw appropriate conclusions**if the study demonstrates the new treatment is worse. (See Defending the Rationale for the Two-Tailed Test in Clinical Research for a detailed discussion of this and other issues.)

- Using a one-sided test for the purpose of gaining statistical significance is
**NOT A VALID APPROACH**. (See What are the differences between one-tailed and two-tailed tests? for more on this as well as a general overview of both types of tests.)

We are now ready to start with Case C→Q.

]]>**Related SAS Tutorials**

- 9A – (3:53) Basic Scatterplots
- 9B – (2:29) Grouped Scatterplots
- 9C – (3:46) Pearson’s Correlation Coefficient
- 9D – (3:00) Simple Linear Regression – EDA

**Related SPSS Tutorials**

- 9A – (2:38) Basic Scatterplots
- 9B – (2:54) Grouped Scatterplots
- 9C – (3:35) Pearson’s Correlation Coefficient
- 9D – (2:53) Simple Linear Regression – EDA

Here again is the role-type classification table for framing our discussion about the relationship between two variables:

Before reading further, try this interactive online data analysis applet.

We are done with cases C→Q and C→C, and now we will move on to case Q→Q, where we examine the relationship between two quantitative variables.

In this section we will discuss scatterplots, which are the appropriate visual display in this case along with numerical methods for linear relationships including correlation and linear regression.

]]>**Related SAS Tutorials**

- 6A – (3:07) Two-Way (Contingency) Tables – EDA

**Related SPSS Tutorials**

- 6A – (7:57) Two-Way (Contingency) Tables – EDA

Recall the role-type classification table for framing our discussion about the relationship between two variables:

We are done with case C→Q, and will now move on to case C→C, where we examine the relationship between two categorical variables.

Earlier in the course, (when we discussed the distribution of a **single** categorical variable) we examined the data obtained when a random sample of 1,200 U.S. college students were asked about their body image (underweight, overweight, or about right). We are now returning to this example, to address the following question:

If we had separated our sample of 1,200 U.S. college students by gender and looked at **males and females separately**, would we have found a similar distribution across body-image categories? More specifically, are men and women just as likely to think their weight is about right? Among those students who do not think their weight is about right, is there a difference between the genders in feelings about body image?

Answering these questions requires us to **examine the relationship between two categorical variables**, gender and body image. Because the question of interest is whether there is a gender effect on body image,

- the
**explanatory**variable is**gender**, and - the
**response**variable is**body image**.

Here is what the raw data look like when we include the gender of each student:

Once again the raw data is a long list of 1,200 genders and responses, and thus not very useful in that form.

To start our exploration of how body image is related to gender, we need an informative display that summarizes the data. In order to summarize the relationship between two categorical variables, we create a display called a **two-way table** or **contingency table**.

Here is the two-way table for our example:

The table has the possible genders in the rows, and the possible responses regarding body image in the columns. At each intersection between row and column, we put the counts for how many times that combination of gender and body image occurred in the data. We sum across the rows to fill in the Total column, and we sum across the columns to fill in the Total row.

Complete the following activities related to this data.

**Comments:**

Note that from the way the two-way table is constructed, the Total row or column is a summary of one of the two categorical variables, ignoring the other. In our example:

- The Total row gives the summary of the categorical variable body image:

- The Total column gives the summary of the categorical variable gender:(These are the same counts we found earlier in the course when we looked at the single categorical variable body image, and did not consider gender.)

So far we have organized the raw data in a much more informative display — the two-way table:

Remember, though, that our primary goal is to explore how body image is related to gender. Exploring the relationship between two categorical variables (in this case body image and gender) amounts to comparing the distributions of the response variable (in this case body image) across the different values of the explanatory variable (in this case males and females):

Note that it doesn’t make sense to compare raw counts, because there are more females than males overall. So for example, it is not very informative to say “there are 560 females who responded ‘about right’ compared to only 295 males,” since the 560 females are out of a total of 760, and the 295 males are out of a total of only 440.

We need to supplement our display, the two-way table, with some numerical measures that will allow us to compare the distributions. These numerical measures are found by simply **converting the counts to percents within (or restricted to) each value of the explanatory variable separately. **

In our example: We look at each gender separately, and convert the counts to percents **within that gender.** Let’s start with females:

Note that each count is converted to percents by dividing by the total number of females, 760. These numerical measures are called **conditional percents**, since we find them by “conditioning” on one of the genders.

Now complete the following activities to calculate the row percentages for males.

**Comments:**

- In our example, we chose to organize the data with the explanatory variable gender in rows and the response variable body image in columns, and thus our conditional percents were
**row percents**, calculated within each row separately. Similarly, if the explanatory variable happens to sit in columns and the response variable in rows, our conditional percents will be**column percents**, calculated within each column separately. For an example, see the “Did I Get This?” exercises below.

- Another way to visualize the conditional percents, instead of a table, is the
**double bar chart.**This display is quite common in newspapers.

Now that we have summarized the relationship between the categorical variables gender and body image, let’s go back and interpret the results in the context of the questions that we posed.

For additional practice complete the following activities.

- The relationship between two categorical variables is summarized using:
**Data display:**two-way table, supplemented by**Numerical measures:**conditional percentages.

- Conditional percentages are calculated for each value of the explanatory variable separately. They can be row percents, if the explanatory variable “sits” in the rows, or column percents, if the explanatory variable “sits” in the columns.
- When we try to understand the relationship between two categorical variables, we compare the distributions of the response variable for values of the explanatory variable. In particular, we look at how the pattern of conditional percentages differs between the values of the explanatory variable.

**Related SAS Tutorials**

- 7A (2:32) Numeric Summaries by Groups
- 7B (3:03) Side-By-Side Boxplots

**Related SPSS Tutorials**

- 7A (3:29) Numeric Summaries by Groups
- 7B (1:59) Side-By-Side Boxplots

Recall the role-type classification table for framing our discussion about the relationship between two variables:

We are now ready to start with Case C→Q, exploring the relationship between two variables where the explanatory variable is categorical, and the response variable is quantitative. As you’ll discover, exploring relationships of this type is something we’ve already discussed in this course, but we didn’t frame the discussion this way.

**Background:** People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat). The purpose of the study was to examine whether the **number of calories** a hot dog has is related to (or affected by) its **type**. (Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Consumer Reports, June 1986, pp. 366-367.)

Answering this question requires us to examine the relationship between the categorical variable, Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,

- the
**explanatory**variable is**Type**, and - the
**response**variable is**Calories**.

Here is what the raw data look like:

The raw data are a list of types and calorie contents, and are not very useful in that form. To explore how the number of calories is related to the type of hot dog, we need an informative visual display of the data that will compare the three types of hot dogs with respect to their calorie content.

The visual display that we’ll use is **side-by-side boxplots** (which we’ve seen before). The side-by-side boxplots will allow us to **compare the distribution** of calorie counts within each category of the explanatory variable, hot dog type:

As before, we supplement the side-by-side boxplots with the descriptive statistics of the calorie content (response) for each type of hot dog separately (i.e., for each level of the explanatory variable separately):

Let’s summarize the results we obtained and interpret them in the context of the question we posed:

Statistic | Beef | Meat | Poultry |
---|---|---|---|

min | 111 | 107 | 86 |

Q1 | 139.5 | 138.5 | 100.5 |

Median | 152.5 | 153 | 113 |

Q3 | 179.75 | 180.5 | 142.5 |

Max | 190 | 195 | 152 |

By examining the three side-by-side boxplots and the numerical measures, we see at once that poultry hot dogs, as a group, contain fewer calories than those made of beef or meat. The median number of calories in poultry hot dogs (113) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153). The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more (beef: 80, meat: 88, poultry: 66). The general recommendation to the health-conscious consumer is to eat poultry hot dogs. It should be noted, though, that since each of the three types of hot dogs shows quite a large spread among brands, simply buying a poultry hot dog does not guarantee a low-calorie food.

What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case C→Q), we essentially **compare the distributions of the quantitative response for each category of the explanatory variable** using side-by-side boxplots supplemented by descriptive statistics. Recall that we have actually done this before when we talked about the boxplot and argued that boxplots are most useful when presented side by side for comparing distributions of two or more groups. This is exactly what we are doing here!

Here is another example:

**Background:** The Survey of Study Habits and Attitudes (SSHA) is a psychological test designed to measure the motivation, study habits, and attitudes toward learning of college students. Is there a relationship between **gender** and **SSHA** scores? In other words, is there a “gender effect” on SSHA scores? Data were collected from 40 randomly selected college students, and here is what the raw data look like:

(Reference: Moore and McCabe. (2003). Introduction to the Practice of Statistics)

Side-by-side boxplots supplemented by descriptive statistics allow us to compare the distribution of SSHA scores within each category of the explanatory variable—gender:

Statistic | Female | Male |
---|---|---|

min | 103 | 70 |

Q1 | 128.75 | 95 |

Median | 153 | 114.5 |

Q3 | 163.75 | 144.5 |

Max | 200 | 187 |

Let’s summarize our results and interpret them:

By examining the side-by-side boxplots and the numerical measures, we see that in general females perform better on the SSHA than males. The median SSHA score of females is higher than the median score for males (153 vs. 114), and in fact, it is even higher than the third quartile of the males’ distribution (144.5). On the other hand, the males’ scores display more variability, both in terms of IQR (49.5 vs. 35) and in terms of the full range of scores (117 vs. 97). Based on these results, it seems that there is a gender effect on SSHA score. It should be noted, though, that our sample consists of only 20 males and 20 females, so we should be cautious about making any kind of generalizations beyond this study. One interesting question that comes to mind is, “Why did we observe this relationship between gender and SSHA scores?” In other words, is there maybe an explanation for why females score higher on the SSHA? Let’s leave it to the psychologists to try and answer that one.

- The relationship between a categorical explanatory variable and a quantitative response variable is summarized using:
**Visual display:**side-by-side boxplots**Numerical measures:**descriptive statistics used for one quantitative variable calculated in each group

- Exploring the relationship between a categorical explanatory variable and a quantitative response variable amounts to comparing the distributions of the quantitative response for each category of the explanatory variable. In particular, we look at how the distribution of the response variable differs between the values of the explanatory variable

While it is fundamentally important to know how to describe the distribution of a single variable, most studies pose research questions that involve exploring the relationship between **two** (or more) variables. These research questions are investigated using a sample from the population of interest.

Here are a few examples of such research questions with the two variables highlighted:

- Is there a relationship between
**gender**and**test scores**on a particular standardized test? Other ways of phrasing the same research question:- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

- How is the
**number of calories**in a hot dog related to (or affected by) the**type of hot dog**(beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs?

- Is there a relationship between the
**type of light**a baby sleeps with (no light, night-light, lamp) and whether or not the child develops**nearsightedness**?

- Are the
**smoking habits**of a person (yes, no) related to the person’s**gender**?

- How well can we predict a student’s freshman year
**GPA**from his/her**SAT score**?

- What is the relationship between driver’s
**age**and sign legibility**distance**(the maximum distance at which the driver can read a sign)?

- Is there a relationship between the
**time**a person has practiced driving while having a learner’s permit, and**whether or not this person passed the driving test**?

- Can you predict a person’s
**favorite type of music**(classical, rock, jazz) based on his/her**IQ level**?

In most studies involving two variables, each of the variables has a role. We distinguish between:

- the
**response**variable — the outcome of the study; and - the
**explanatory**variable — the variable that claims to explain, predict or affect the response.

As we mentioned earlier the variable we wish to predict is commonly called the **dependent variable**, the **outcome **variable, or the **response **variable. Any variable we are using to predict (or explain differences) in the outcome is commonly called an **explanatory variable**, an **independent** **variable**, a **predictor** variable, or a **covariate**.

**Comment:**

- Typically the
**explanatory**variable is denoted by X, and the**response**variable by Y.

Now let’s go back to some of the examples and classify the two relevant variables according to their roles in the study:

Is there a relationship between **gender** and **test scores** on a particular standardized test? Other ways of phrasing the same research question:

- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

We want to explore whether the outcome of the study — the score on a test — is affected by the test-taker’s gender. Therefore:

**Gender** is the **explanatory** variable

**Test score** is the **response** variable

Is there a relationship between the **type of light** a baby sleeps with (no light, night-light, lamp) and whether or not the child develops **nearsightedness**?

In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:

**Light type** is the **explanatory** variable

**Nearsightedness** is the **response** variable

How well can we predict a student’s freshman year **GPA** from his/her **SAT score**?

Here we are examining whether a student’s SAT score is a good predictor for the student’s GPA freshman year. Therefore:

**SAT score** is the **explanatory** variable

**GPA of freshman year** is the **response** variable

Is there a relationship between the **time** a person has practiced driving while having a learner’s permit, and **whether or not this person passed the driving test**?

Here we are examining whether a person’s outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:

**Time** is the **explanatory** variable

**Driving test outcome** is the **response** variable

Now, using the same reasoning, the following exercise will help you to classify the two variables in the other examples.

**Question : **Is the role classification of variables always clear? In other words, is it always clear which of the variables is the explanatory and which is the response?

**Answer: **No. There are studies in which the role classification is not really clear. This mainly happens in cases when both variables are categorical or both are quantitative. An example is a study that explores the relationship between students’ SAT Math and SAT Verbal scores. In cases like this, any classification choice would be fine (as long as it is consistent throughout the analysis).

If we further classify each of the two relevant variables according to **type** (categorical or quantitative), we get the following 4 possibilities for **“role-type classification”**

- Categorical explanatory and quantitative response (Case CQ)
- Categorical explanatory and categorical response (Case CC)
- Quantitative explanatory and quantitative response (Case QQ)
- Quantitative explanatory and categorical response (Case QC)

This role-type classification can be summarized and easily visualized in the following table (note that the explanatory variable is always listed first):

This role-type classification serves as the infrastructure for this entire section. In each of the 4 cases, different statistical tools (displays and numerical measures) should be used in order to explore the relationship between the two variables.

This suggests the following important principle:

**PRINCIPLE: **When confronted with a research question that involves exploring the relationship between two variables, the first and most crucial step is to determine which of the 4 cases represents the data structure of the problem. In other words, the first step should be classifying the two relevant variables according to their role and type, and only then can we determine what statistical tools should be used to analyze them.

Now let’s go back to our 8 examples and determine which of the 4 cases represents the data structure of each:

Is there a relationship between **gender** and **test scores** on a particular standardized test? Other ways of phrasing the same research question:

- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

We want to explore whether the outcome of the study — the score on a test — is affected by the test-taker’s gender.

**Gender** is the **explanatory** variable and it is **categorical**.

**Test score** is the **response** variable and it is **quantitative**.

Therefore this is an example of **case C**→**Q**.

Is there a relationship between the **type of light** a baby sleeps with (no light, night-light, lamp) and whether or not the child develops **nearsightedness**?

In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby.

**Light type** is the **explanatory** variable and it is **categorical**.

**Nearsightedness** is the **response** variable and it is **categorical**.

Therefore this is an example of **case C**→**C**.

How well can we predict a student’s freshman year **GPA** from his/her **SAT score**?

Here we are examining whether a student’s SAT score is a good predictor for the student’s GPA freshman year.

**SAT score** is the **explanatory** variable and it is **quantitative**.

**GPA of freshman** year is the **response** variable and it is **quantitative**.

Therefore this is an example of **case Q**→**Q**.

Is there a relationship between the **time** a person has practiced driving while having a learner’s permit, and **whether or not this person passed the driving test**?

Here we are examining whether a person’s outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test.

**Time** is the **explanatory** variable and it is **quantitative**.

**Driving test outcome** is the **response** variable and it is **categorical**.

Therefore this is an example of **case Q**→**C**.

Now you complete the rest…

The remainder of this section on exploring relationships will be guided by this role-type classification. In the next three parts we will elaborate on cases C→Q, C→C, and Q→Q. More specifically, we will learn the appropriate statistical tools (visual display and numerical measures) that will allow us to explore the relationship between the two variables in each of the cases. Case Q→C will **not** be discussed in this course, and is typically covered in more advanced courses. The section will conclude with a discussion on causal relationships.