This document is linked from Linear Relationships – Correlation.

]]>This document is linked from Case Q-Q.

]]>This short video contains an overview of calculating conditional percentages.

The original slides are not available.

This document is linked from Case C-C.

]]>This document is linked from Case C-Q.

]]>This document is linked from One Quantitative Variable.

]]>This short video contains an overview of exploratory data analysis as well as a few comments related to how exploratory data analysis is useful.

The original slides are not available.

Transcript – Live Introduction to Exploratory Data Analysis

This document is linked from Unit 1: Exploratory Data Analysis.

]]>**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE

Although not a required aspect of describing distributions of one quantitative variable, we are often interested in where a particular value falls in the distribution. Is the value unusually low or high or about what we would expect?

Answers to these questions rely on measures of position (or location). These measures give information about the distribution but also give information about how individual values relate to the overall distribution.

A common measure of position is the percentile. Although there are some mathematical considerations involved with calculating percentiles which we will not discuss, you should have a basic understanding of their interpretation.

In general the *P*-th percentile can be interpreted as a location in the data for which approximately *P*% of the other values in the distribution fall below the *P*-th percentile and (100 –*P*)% fall above the *P*-th percentile.

The quartiles Q1 and Q3 are special cases of percentiles and thus are measures of position.

The combination of the five numbers (min, Q1, M, Q3, Max) is called the **five number summary**, and provides a quick numerical description of both the center and spread of a distribution.

Each of the values represents a measure of position in the dataset.

The min and max providing the boundaries and the quartiles and median providing information about the 25th, 50th, and 75th percentiles.

Standardized scores, also called z-scores use the mean and standard deviation as the primary measures of center and spread and are therefore most useful when the mean and standard deviation are appropriate, i.e. when the distribution is reasonably symmetric with no extreme outliers.

For any individual, the **z-score** tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

To calculate a z-score, we take the individual value and subtract the mean and then divide this difference by the standard deviation.

Measures of position also allow us to compare values from different distributions. For example, we can present the percentiles or z-scores of an individual’s height and weight. These two measures together would provide a better picture of how the individual fits in the overall population than either would alone.

Although measures of position are not stressed in this course as much as measures of center and spread, we have seen and will see many measures of position used in various aspects of examining the distribution of one variable and it is good to recognize them as measures of position when they appear.

]]>(Link to Best Actor Oscar Data).

The results below were obtained using SPSS. Use the output to answer the following questions

This document is linked from Boxplots.

]]>**Related SAS Tutorials**

- 9A – (3:53) Basic Scatterplots
- 9B – (2:29) Grouped Scatterplots
- 9C – (3:46) Pearson’s Correlation Coefficient
- 9D – (3:00) Simple Linear Regression – EDA

**Related SPSS Tutorials**

- 9A – (2:38) Basic Scatterplots
- 9B – (2:54) Grouped Scatterplots
- 9C – (3:35) Pearson’s Correlation Coefficient
- 9D – (2:53) Simple Linear Regression – EDA

So far we have visualized relationships between two quantitative variables using scatterplots, and described the overall pattern of a relationship by considering its direction, form, and strength. We noted that assessing the strength of a relationship just by looking at the scatterplot is quite difficult, and therefore we need to supplement the scatterplot with some kind of numerical measure that will help us assess the strength.

In this part, we will restrict our attention to the **special case of relationships that have a linear form**, since they are quite common and relatively simple to detect. More importantly, there exists a numerical measure that assesses the strength of the **linear** relationship between two quantitative variables with which we can supplement the scatterplot. We will introduce this numerical measure here and discuss it in detail.

Even though from this point on we are going to focus only on **linear** relationships, it is important to remember that **not every relationship between two quantitative variables has a linear form.** We have actually seen several examples of relationships that are not linear. The statistical tools that will be introduced here are **appropriate only for examining linear relationships,** and as we will see, when they are used in nonlinear situations, these tools can lead to errors in reasoning.

Let’s start with a motivating example. Consider the following two scatterplots.

We can see that in both cases, the direction of the relationship is **positive** and the form of the relationship is **linear**. What about the strength? Recall that the strength of a relationship is the extent to which the data follow its form.

The purpose of this example was to illustrate how assessing the strength of the **linear** relationship from a scatterplot alone is problematic, since our judgment might be affected by the scale on which the values are plotted. This example, therefore, provides a motivation for the **need **to supplement the scatterplot with a **numerical measure** that will **measure the strength** of the **linear** relationship between two quantitative variables.

The numerical measure that assesses the strength of a **linear** relationship is called the **correlation coefficient**, and is denoted by r. We will:

- give a definition of the correlation r,
- discuss the calculation of r,
- explain how to interpret the value of r, and
- talk about some of the properties of r.

**Calculation: **r is calculated using the following formula:

However, the calculation of the correlation (r) is not the focus of this course. We will use a statistics package to calculate r for us, and the **emphasis **of this course will be on the **interpretation** of its value.

Once we obtain the value of r, its interpretation with respect to the strength of **linear** relationships is quite simple, as these images illustrate:

In order to get a better sense for how the value of *r* relates to the strength of the **linear** relationship, take a look the following applets.

If you will be using correlation often in your research, I highly urge you to read the following more detailed discussion of correlation.

Now that we understand the use of *r* as a numerical measure for assessing the direction and strength of **linear** relationships between quantitative variables, we will look at a few examples.

Earlier, we used the scatterplot below to find a **negative linear** relationship between the age of a driver and the maximum distance at which a highway sign was legible. What about the strength of the relationship? It turns out that the correlation between the two variables is r = -0.793.

Since r < 0, it confirms that the direction of the relationship is negative (although we really didn’t need r to tell us that). Since r is relatively close to -1, it suggests that the relationship is moderately strong. In context, the negative correlation confirms that the maximum distance at which a sign is legible generally decreases with age. Since the value of r indicates that the **linear** relationship is moderately strong, but not perfect, we can expect the maximum distance to vary somewhat, even among drivers of the same age.

A statistics department is interested in tracking the progress of its students from entry until graduation. As part of the study, the department tabulates the performance of 10 students in an introductory course and in an upper-level course required for graduation. What is the relationship between the students’ course averages in the two courses? Here is the scatterplot for the data:

The scatterplot suggests a relationship that is **positive** in direction, **linear** in form, and seems quite strong. The value of the correlation that we find between the two variables is r = 0.931, which is very close to 1, and thus confirms that indeed the **linear** relationship is very strong.

**Comments:**

- Note that in both examples we supplemented the scatterplot with the correlation (r). Now that we have the correlation (r), why do we still need to look at a scatterplot when examining the relationship between two quantitative variables?

- The
**correlation**coefficient can**only**be interpreted as the**measure of the strength of a linear relationship**, so we need the scatterplot to verify that the relationship indeed looks**linear**. This point and its importance will be clearer after we examine a few properties of r.

We will now discuss and illustrate several important properties of the correlation coefficient as a numerical measure of the strength of a **linear** relationship.

- The correlation does not change when the units of measurement of either one of the variables change. In other words, if we
**change the units of measurement**of the explanatory variable and/or the response variable, this has**no effect on the correlation (r)**.

To illustrate this, below are two versions of the scatterplot of the relationship between sign legibility distance and driver’s age:

The top scatterplot displays the original data where the maximum distances are measured **in feet**. The bottom scatterplot displays the same relationship, but with maximum distances changed to **meters**. Notice that the Y-values have changed, but the correlations are the same. This is an example of how changing the units of measurement of the response variable has no effect on r, but as we indicated above, the same is true for changing the units of the explanatory variable, or of both variables.

This might be a good place to comment that the correlation (r) is **“unitless”**. It is just a number.

- The correlation
**only measures the strength of a linear relationship**between two variables.**It ignores any other type of relationship, no matter how strong it is.**For example, consider the relationship between the average fuel usage of driving a fixed distance in a car, and the speed at which the car drives:

Our data describe a fairly simple non-linear (sometimes called curvilinear) relationship: the amount of fuel consumed decreases rapidly to a minimum for a car driving 60 kilometers per hour, and then increases gradually for speeds exceeding 60 kilometers per hour. The relationship is very strong, as the observations seem to perfectly fit the curve.

Although the relationship is strong, the correlation r = -0.172 indicates a weak **linear** relationship. This makes sense considering that the data fails to adhere closely to a linear form:

- The correlation by itself is
**not**enough to determine whether or not a relationship is linear. To see this, let’s consider the study that examined the effect of monetary incentives on the return rate of questionnaires. Below is the scatterplot relating the percentage of participants who completed a survey to the monetary incentive that researchers promised to participants, in which we find a**strong non-linear (sometimes called curvilinear) relationship:**

The relationship is non-linear (sometimes called curvilinear), yet the correlation r = 0.876 is quite close to 1.

In the last two examples we have seen two very strong non-linear (sometimes called curvilinear) relationships, one with a correlation close to 0, and one with a correlation close to 1. Therefore, the correlation alone does not indicate whether a relationship is **linear** or not. The important principle here is:

**Always look at the data!**

- The correlation is heavily influenced by outliers. As you will learn in the next two activities, the way in which the outlier influences the correlation depends upon whether or not the outlier is consistent with the pattern of the
**linear**relationship.

Hopefully, you’ve noticed the correlation decreasing when you created this kind of outlier, which **is not consistent **with the pattern of the relationship.

The next activity will show you how an outlier that **is consistent** with the direction of the linear relationship actually strengthens it.

In the previous activity, we saw an example where there was a positive **linear** relationship between the two variables, and including the outlier just “strengthened” it. Consider the hypothetical data displayed by the following scatterplot:

In this case, the low outlier gives an “illusion” of a positive **linear** relationship, whereas in reality, there is no **linear** relationship between X and Y.

**Related SAS Tutorials**

- 9A – (3:53) Basic Scatterplots
- 9B – (2:29) Grouped Scatterplots
- 9C – (3:46) Pearson’s Correlation Coefficient
- 9D – (3:00) Simple Linear Regression – EDA

**Related SPSS Tutorials**

- 9A – (2:38) Basic Scatterplots
- 9B – (2:54) Grouped Scatterplots
- 9C – (3:35) Pearson’s Correlation Coefficient
- 9D – (2:53) Simple Linear Regression – EDA

Here again is the role-type classification table for framing our discussion about the relationship between two variables:

Before reading further, try this interactive online data analysis applet.

We are done with cases C→Q and C→C, and now we will move on to case Q→Q, where we examine the relationship between two quantitative variables.

In this section we will discuss scatterplots, which are the appropriate visual display in this case along with numerical methods for linear relationships including correlation and linear regression.

]]>