This document is linked from Role-Type Classification.

]]>- Examples of Variables
- Definition and Examples of Categorical and Quantitative Variables

- Definitions and Examples of Sub-Classifications
- Categorical into Nominal or Ordinal
- Quantitative into Discrete or Continuous

- A quick but complete review of the examples and definitions in Parts A and B
- An optional discussion on Time to Event data (survival data).
- An example dataset from a heart attack study. Its variables are classified as Categorical or Quantitative as additional practice.

This document is linked from Types of Variables.

**Future Edits: Add captions to YouTube videos from transcript text

]]>Variables can be broadly classified into one of two **types**:

- Quantitative

- Categorical

Below we define these two main types of variables and provide further sub-classifications for each type.

**Categorical variables** take **category** or **label** values, and place an individual into one of several **groups**.

Categorical variables are often further classified as either:

**Nominal,**when there**is no natural ordering among the categories**.

Common examples would be gender, eye color, or ethnicity.

**Ordinal**, when there**is a natural order among the categories**, such as, ranking scales or letter grades.

However, ordinal variables are still categorical and do not provide precise measurements.

Differences are not precisely meaningful, for example, if one student scores an A and another a B on an assignment, we cannot say precisely the difference in their scores, only that an A is larger than a B.

**Quantitative variables** take **numerical** values, and represent some kind of **measurement**.

Quantitative variables are often further classified as either:

**Discrete**, when the variable takes on a**countable**number of values.

Most often these variables indeed represent some kind of **count** such as the number of prescriptions an individual takes daily.

**Continuous**, when the variable**can take on any value in some range of values**.

Our precision in measuring these variables is often limited by our instruments.

Units should be provided.

Common examples would be height (inches), weight (pounds), or time to recovery (days).

One special variable type occurs when a variable has only two possible values.

A variable is said to be** Binary **or **Dichotomous**, when there are only two possible levels.

These variables can usually be phrased in a “yes/no” question. Gender is an example of a binary variable.

Currently we are primarily concerned with classifying variables as either categorical or quantitative.

Sometimes, however, we will need to consider further and sub-classify these variables as defined above.

These concepts will be discussed and reviewed as needed but here is a quick practice on sub-classifying categorical and quantitative variables.

Let’s revisit the dataset showing medical records for a sample of patients

In our example of medical records, there are several variables of each type:

- Age, Weight, and Height are
**quantitative**variables.

- Race, Gender, and Smoking are
**categorical**variables.

** Comments:**

- Notice that the values of the
**categorical**variable Smoking have been**coded**as the numbers 0 or 1.

It is quite common to code the values of a categorical variable as numbers, but you should remember that these are just codes.

They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).

Usually, if such a coding is used, all categorical variables will be coded and we will tend to do this type of coding for datasets in this course.

- Sometimes,
**quantitative**variables are**divided into groups**for analysis, in such a situation, although the original variable was quantitative, the variable analyzed is categorical.

A common example is to provide information about an individual’s Body Mass Index by stating whether the individual is underweight, normal, overweight, or obese.

This categorized BMI is an example of an ordinal categorical variable.

**Categorical**variables are sometimes called qualitative variables, but in this course we’ll use the term “categorical.”

The **types of variables** you are analyzing **directly relate to the available** descriptive and inferential **statistical methods**.

It is important to:

**assess how you will measure the effect of interest**and**know how this determines the statistical methods you can use.**

As we proceed in this course, we will continually emphasize the **types of variables** that are** appropriate for each method we discuss**.

For example:

To compare the number of polio cases in the two treatment arms of the Salk Polio vaccine trial, you could use

- Fisher’s Exact Test
- Chi-Square Test

To compare blood pressures in a clinical trial evaluating two blood pressure-lowering medications, you could use

- Two-sample t-Test
- Wilcoxon Rank-Sum Test

For each scenario, identify the variable as either **quantitative** or **categorical**.

This document is linked from Proportions (Introduction & Step 1).

]]>In each of the following three problems, you are presented with a brief description of a study involving two variables. Based on the role-type classification of the two variables, you’ll be asked to determine which of the four cases represents the data structure of the problem.

This document is linked from Role-Type Classification.

]]>2. How is the **number of calories** in a hot dog related to (or affected by) the **type of hot dog** (beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs?

4. Are the **smoking habits** of a person (yes, no) related to the person’s **gender**?

6. What is the relationship between driver’s **age** and sign legibility **distance** (the maximum distance at which the driver can read a sign)?

8. Can you predict a person’s **favorite type of music** (classical, rock, jazz) based on his/her **IQ level**?

This document is linked from Role-Type Classification.

]]>While it is fundamentally important to know how to describe the distribution of a single variable, most studies pose research questions that involve exploring the relationship between **two** (or more) variables. These research questions are investigated using a sample from the population of interest.

Here are a few examples of such research questions with the two variables highlighted:

- Is there a relationship between
**gender**and**test scores**on a particular standardized test? Other ways of phrasing the same research question:- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

- How is the
**number of calories**in a hot dog related to (or affected by) the**type of hot dog**(beef, meat or poultry)? In other words, are there differences in the number of calories among the three types of hot dogs?

- Is there a relationship between the
**type of light**a baby sleeps with (no light, night-light, lamp) and whether or not the child develops**nearsightedness**?

- Are the
**smoking habits**of a person (yes, no) related to the person’s**gender**?

- How well can we predict a student’s freshman year
**GPA**from his/her**SAT score**?

- What is the relationship between driver’s
**age**and sign legibility**distance**(the maximum distance at which the driver can read a sign)?

- Is there a relationship between the
**time**a person has practiced driving while having a learner’s permit, and**whether or not this person passed the driving test**?

- Can you predict a person’s
**favorite type of music**(classical, rock, jazz) based on his/her**IQ level**?

In most studies involving two variables, each of the variables has a role. We distinguish between:

- the
**response**variable — the outcome of the study; and - the
**explanatory**variable — the variable that claims to explain, predict or affect the response.

As we mentioned earlier the variable we wish to predict is commonly called the **dependent variable**, the **outcome **variable, or the **response **variable. Any variable we are using to predict (or explain differences) in the outcome is commonly called an **explanatory variable**, an **independent** **variable**, a **predictor** variable, or a **covariate**.

**Comment:**

- Typically the
**explanatory**variable is denoted by X, and the**response**variable by Y.

Now let’s go back to some of the examples and classify the two relevant variables according to their roles in the study:

Is there a relationship between **gender** and **test scores** on a particular standardized test? Other ways of phrasing the same research question:

- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

We want to explore whether the outcome of the study — the score on a test — is affected by the test-taker’s gender. Therefore:

**Gender** is the **explanatory** variable

**Test score** is the **response** variable

Is there a relationship between the **type of light** a baby sleeps with (no light, night-light, lamp) and whether or not the child develops **nearsightedness**?

In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:

**Light type** is the **explanatory** variable

**Nearsightedness** is the **response** variable

How well can we predict a student’s freshman year **GPA** from his/her **SAT score**?

Here we are examining whether a student’s SAT score is a good predictor for the student’s GPA freshman year. Therefore:

**SAT score** is the **explanatory** variable

**GPA of freshman year** is the **response** variable

Is there a relationship between the **time** a person has practiced driving while having a learner’s permit, and **whether or not this person passed the driving test**?

Here we are examining whether a person’s outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:

**Time** is the **explanatory** variable

**Driving test outcome** is the **response** variable

Now, using the same reasoning, the following exercise will help you to classify the two variables in the other examples.

**Question : **Is the role classification of variables always clear? In other words, is it always clear which of the variables is the explanatory and which is the response?

**Answer: **No. There are studies in which the role classification is not really clear. This mainly happens in cases when both variables are categorical or both are quantitative. An example is a study that explores the relationship between students’ SAT Math and SAT Verbal scores. In cases like this, any classification choice would be fine (as long as it is consistent throughout the analysis).

If we further classify each of the two relevant variables according to **type** (categorical or quantitative), we get the following 4 possibilities for **“role-type classification”**

- Categorical explanatory and quantitative response (Case CQ)
- Categorical explanatory and categorical response (Case CC)
- Quantitative explanatory and quantitative response (Case QQ)
- Quantitative explanatory and categorical response (Case QC)

This role-type classification can be summarized and easily visualized in the following table (note that the explanatory variable is always listed first):

This role-type classification serves as the infrastructure for this entire section. In each of the 4 cases, different statistical tools (displays and numerical measures) should be used in order to explore the relationship between the two variables.

This suggests the following important principle:

**PRINCIPLE: **When confronted with a research question that involves exploring the relationship between two variables, the first and most crucial step is to determine which of the 4 cases represents the data structure of the problem. In other words, the first step should be classifying the two relevant variables according to their role and type, and only then can we determine what statistical tools should be used to analyze them.

Now let’s go back to our 8 examples and determine which of the 4 cases represents the data structure of each:

Is there a relationship between **gender** and **test scores** on a particular standardized test? Other ways of phrasing the same research question:

- Is performance on the test related to gender?
- Is there a gender effect on test scores?
- Are there differences in test scores between males and females?

We want to explore whether the outcome of the study — the score on a test — is affected by the test-taker’s gender.

**Gender** is the **explanatory** variable and it is **categorical**.

**Test score** is the **response** variable and it is **quantitative**.

Therefore this is an example of **case C**→**Q**.

Is there a relationship between the **type of light** a baby sleeps with (no light, night-light, lamp) and whether or not the child develops **nearsightedness**?

In this study we explore whether the nearsightedness of a person can be explained by the type of light that person slept with as a baby.

**Light type** is the **explanatory** variable and it is **categorical**.

**Nearsightedness** is the **response** variable and it is **categorical**.

Therefore this is an example of **case C**→**C**.

How well can we predict a student’s freshman year **GPA** from his/her **SAT score**?

Here we are examining whether a student’s SAT score is a good predictor for the student’s GPA freshman year.

**SAT score** is the **explanatory** variable and it is **quantitative**.

**GPA of freshman** year is the **response** variable and it is **quantitative**.

Therefore this is an example of **case Q**→**Q**.

Is there a relationship between the **time** a person has practiced driving while having a learner’s permit, and **whether or not this person passed the driving test**?

Here we are examining whether a person’s outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test.

**Time** is the **explanatory** variable and it is **quantitative**.

**Driving test outcome** is the **response** variable and it is **categorical**.

Therefore this is an example of **case Q**→**C**.

Now you complete the rest…

The remainder of this section on exploring relationships will be guided by this role-type classification. In the next three parts we will elaborate on cases C→Q, C→C, and Q→Q. More specifically, we will learn the appropriate statistical tools (visual display and numerical measures) that will allow us to explore the relationship between the two variables in each of the cases. Case Q→C will **not** be discussed in this course, and is typically covered in more advanced courses. The section will conclude with a discussion on causal relationships.

This document is linked from Types of Variables.

]]>