This document is linked from Scatterplots.

]]>This document is linked from Outliers.

]]>This short video elaborates upon the information displayed in a boxplot.

The original slides are not available.

This document is linked from Boxplots.

]]>This short video contains some additional discussion about shapes of distribution including symmetry and modality.

The original slides are not available.

Transcript – Live Describing Distributions

This document is linked from Describing Distributions.

]]>**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS
- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE
- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.

More specifically, we should consider the following features of the Distribution for One Quantitative Variable:

When describing the shape of a distribution, we should consider:

**Symmetry/skewness**of the distribution.

**Peakedness (modality)**— the number of peaks (modes) the distribution has.

We distinguish between:

A distribution is called **symmetric **if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.

The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.

Note that all three distributions are symmetric, but are different in their **modality** (peakedness).

- The first distribution is
**unimodal**— it has one mode (roughly at 10) around which the observations are concentrated. - The second distribution is
**bimodal**— it has two modes (roughly at 10 and 20) around which the observations are concentrated. - The third distribution is kind of flat, or
**uniform**. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

A distribution is called **skewed right** if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values).

Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.

- An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.

A distribution is called **skewed left** if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).

Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.

- An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.

**Comments:**

- Distributions with more than two peaks are generally called
**multimodal**.

- Bimodal or multimodal distributions can be evidence that two distinct groups are represented.

- Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.

Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.

Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $50-55.

The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.

The **center** of the distribution is often used to represent a typical value.

One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Another common way to measure the center of a distribution is to use the average value.

From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.

One way to measure the **spread** (also called **variability **or** variation**) of the distribution is to use the approximate range covered by the data.

From looking at the histogram, we can approximate the smallest observation (**min**), and the largest observation (**max**), and thus approximate the **range**. (More exact ways of finding measures of spread will be discussed soon.)

For example, the following histogram represents a distribution with a highly probable outlier:

As you can see from the histogram, the grades distribution is roughly **symmetric** and **unimodal** with **no outliers**.

The **center** of the grades distribution is roughly **70** (7 students scored below 70, and 8 students scored above 70).

approximate min: | 45 (the middle of the lowest interval of scores) |

approximate max: | 95 (the middle of the highest interval of scores) |

approximate range: | 95-45=50 |

Let’s look at a new example.

To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001

The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).

We will now summarize the main features of the distribution of ages as it appears from the histogram:

**Shape:** The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.

**Center:** The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.

**Spread:** The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.

**Outliers:** There seem to be two probable outliers to the far right and possibly a third around 62 years old.

You can see how informative it is to know “what to look at” in a histogram.

The following exercises provide more practice with shapes of distributions for one quantitative variable.

- When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).

- When describing the shape of a distribution, one should consider:
- Symmetry/skewness of the distribution
- Peakedness (modality) — the number of peaks (modes) the distribution has.
- Not all distributions have a simple, recognizable shape.

- Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.

- It is always important to interpret what the features of the distribution mean in the context of the data.

To see the effect of outliers on a regression equation, use the applet introduced earlier. Draw points on the graph, add the regression line and then add an outlier or move an observation to see how the regression line changes.

Here is another similar applet that can be used to illustrate outliers and guessing lines of best fit.

Here is an interactive demonstration from the Rosman/Chance collection which has extensive options and illustrates many ideas about linear regression and correlation.

And, remember the two-variable calculator we introduced earlier.

This document is linked from Linear Relationships – Linear Regression.

]]>- Fill the scatterplot with a hypothetical positive linear relationship between X and Y (by clicking on the graph about a dozen times starting at lower left and going up diagonally to the top right). Pay attention to the correlation coefficient calculated at the top right of the applet. (Clicking on the garbage can will let you start over.)

- Once you are satisfied with your hypothetical data, create an outlier by clicking on one of the data points in the upper right of the graph, and dragging it down along the right side of the graph. Again, pay attention to what happens to the value of the correlation

This document is linked from Linear Relationships – Correlation.

]]>From the online version of Little Handbook of Statistical Practice, this reading contains a classic paper discussing how to handle outliers (wild observations).

This document is linked from Outliers.

]]>

This document is linked from Outliers.

]]>**Related SAS Tutorials**

- 9A – (3:53) Basic Scatterplots
- 9B – (2:29) Grouped Scatterplots
- 9C – (3:46) Pearson’s Correlation Coefficient
- 9D – (3:00) Simple Linear Regression – EDA

**Related SPSS Tutorials**

- 9A – (2:38) Basic Scatterplots
- 9B – (2:54) Grouped Scatterplots
- 9C – (3:35) Pearson’s Correlation Coefficient
- 9D – (2:53) Simple Linear Regression – EDA

In the previous two cases we had a categorical explanatory variable, and therefore exploring the relationship between the two variables was done by comparing the distribution of the response variable for each category of the explanatory variable:

- In case C→Q we compared distributions of the quantitative response.
- In case C→C we compared distributions of the categorical response.

Case Q→Q is different in the sense that both variables (in particular the explanatory variable) are quantitative. As you will discover, although we are still in essence comparing the distribution of one variable for different values of the other, this case will require a different kind of treatment and tools.

Let’s start with an example:

A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled, and for each one, the maximum distance (in feet) at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between a driver’s **age** and the **maximum distance** at which signs were legible, and then use the study’s findings to improve safety for older drivers. (Reference: Utts and Heckard, *Mind on Statistics* (2002). Original source: Data collected by Last Resource, Inc, Bellfonte, PA.)

Since the purpose of this study is to explore the effect of age on maximum legibility distance,

- the
**explanatory**variable is**Age**, and - the
**response**variable is**Distance**.

Here is what the raw data look like:

Note that the data structure is such that for each individual (in this case driver 1….driver 30) we have a pair of values (in this case representing the driver’s age and distance). We can therefore think about these data as 30 pairs of values: (18, 510), (32, 410), (55, 420), … , (82, 360).

The first step in exploring the relationship between driver age and sign legibility distance is to create an appropriate and informative graphical display. The appropriate graphical display for examining the relationship between two quantitative variables is the **scatterplot**. Here is how a scatterplot is constructed for our example:

To create a scatterplot, each pair of values is plotted, so that the value of the explanatory variable (X) is plotted on the horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis. In other words, each individual (driver, in our example) appears on the scatterplot as a single point whose X-coordinate is the value of the explanatory variable for that individual, and whose Y-coordinate is the value of the response variable. Here is an illustration:

And here is the completed scatterplot:

**Comment:**

- It is important to mention again that when creating a scatterplot, the explanatory variable should always be plotted on the horizontal X-axis, and the response variable should be plotted on the vertical Y-axis. If in a specific example we do not have a clear distinction between explanatory and response variables, each of the variables can be plotted on either axis.

How do we explore the relationship between two quantitative variables using the scatterplot? What should we look at, or pay attention to?

Recall that when we described the distribution of a single quantitative variable with a histogram, we described the overall pattern of the distribution (shape, center, spread) and any deviations from that pattern (outliers). **We do the same thing with the scatterplot.** The following figure summarizes this point:

As the figure explains, when describing the **overall pattern** of the relationship we look at its direction, form and strength.

- The
**direction**of the relationship can be positive, negative, or neither:

A **positive (or increasing) relationship** means that an increase in one of the variables is associated with an increase in the other.

A **negative (or decreasing) relationship** means that an increase in one of the variables is associated with a decrease in the other.

Not all relationships can be classified as either positive or negative.

- The
**form**of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. Here are a couple that are quite common:

Relationships with a **linear** form are most simply described as points scattered about a line:

Relationships with a** non-linear (sometimes called curvilinear) **form are most simply described as points dispersed around the same curved line:

There are many other possible forms for the relationship between two quantitative variables, but linear and curvilinear forms are quite common and easy to identify. Another form-related pattern that we should be aware of is clusters in the data:

- The
**strength**of the relationship is determined by how closely the data follow the form of the relationship. Let’s look, for example, at the following two scatterplots displaying positive, linear relationships:

The strength of the relationship is determined by how closely the data points follow the form. We can see that in the left scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship. In the right scatterplot, the points also follow the linear pattern, but much less closely, and therefore we can say that the relationship is weaker. In general, though, assessing the strength of a relationship just by looking at the scatterplot is quite problematic, and we need a numerical measure to help us with that. We will discuss that later in this section.

- Data points that
**deviate from the pattern**of the relationship are called**outliers**. We will see several examples of outliers during this section. Two outliers are illustrated in the scatterplot below:

Let’s go back now to our example, and use the scatterplot to examine the relationship between the age of the driver and the maximum sign legibility distance.

Here is the scatterplot:

The direction of the relationship is **negative**, which makes sense in context, since as you get older your eyesight weakens, and in particular older drivers tend to be able to read signs only at lesser distances. An arrow drawn over the scatterplot illustrates the negative direction of this relationship:

The form of the relationship seems to be **linear**. Notice how the points tend to be scattered about the line. Although, as we mentioned earlier, it is problematic to assess the strength without a numerical measure, the relationship appears to be **moderately strong**, as the data is fairly tightly scattered about the line. Finally, all the data points seem to “obey” the pattern — there **do not appear to be any outliers**.

We will now look at two more examples:

The average gestation period, or time of pregnancy, of an animal is closely related to its longevity (the length of its lifespan). Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been examined, with the purpose of examining how the gestation period of an animal is related to (or can be predicted from) its longevity. (Source: Rossman and Chance. (2001). Workshop statistics: Discovery with data and Minitab. Original source: The 1993 world almanac and book of facts).

Here is the scatterplot of the data.

What can we learn about the relationship from the scatterplot? The direction of the relationship is **positive**, which means that animals with longer life spans tend to have longer times of pregnancy (this makes intuitive sense). An arrow drawn over the scatterplot below illustrates this:

The form of the relationship is again essentially **linear**. There appears to be **one outlier**, indicating an animal with an exceptionally long longevity and gestation period. (This animal happens to be the elephant.) Note that while this outlier definitely deviates from the rest of the data in term of its magnitude, it **does** follow the direction of the data.

**Comment:**

- Another feature of the scatterplot that is worth observing is how the variation in gestation increases as longevity increases. This fact is illustrated by the two red vertical lines at the bottom left part of the graph. Note that the gestation periods for animals that live 5 years range from about 30 days up to about 120 days. On the other hand, the gestation periods of animals that live 12 years vary much more, and range from about 60 days up to more than 400 days.

As a third example, consider the relationship between the average amount of fuel used (in liters) to drive a fixed distance in a car (100 kilometers), and the speed at which the car is driven (in kilometers per hour). (Source: Moore and McCabe, (2003). Introduction to the practice of statistics. Original source: T.N. Lam. (1985). “Estimating fuel consumption for engine size,” Journal of Transportation Engineering, vol. 111)

The data describe a relationship that decreases and then increases — the amount of fuel consumed decreases rapidly to a minimum for a car driving 60 kilometers per hour, and then increases gradually for speeds exceeding 60 kilometers per hour. This suggests that the speed at which a car economizes on fuel the most is about 60 km/h. This forms a non-linear (curvilinear) relationship that seems to be very strong, as the observations seem to perfectly fit the curve. Finally, there do not appear to be any outliers.

The example in the last activity provides a great opportunity for interpretation of the form of the relationship in context. Recall that the example examined how the percentage of participants who completed a survey is affected by the monetary incentive that researchers promised to participants. Here again is the scatterplot that displays the relationship:

The positive relationship definitely makes sense in context, but what is the interpretation of the non-linear (curvilinear) form in the context of the problem? How can we explain (in context) the fact that the relationship seems at first to be increasing very rapidly, but then slows down? The following graph will help us:

Note that when the monetary incentive increases from $0 to $10, the percentage of returned surveys increases sharply — an increase of 27% (from 16% to 43%). However, the same increase of $10 from $30 to $40 doesn’t result in the same dramatic increase in the percentage of returned surveys — it results in an increase of only 3% (from 54% to 57%). The form displays the phenomenon of “diminishing returns” — a return rate that after a certain point fails to increase proportionately to additional outlays of investment. $10 is worth more to people relative to $0 than $30 is relative to $10.

In certain circumstances, it may be reasonable to indicate different subgroups or categories within the data on the scatterplot, by labeling each subgroup differently. The result is sometimes called a **labeled scatterplot **or** grouped scatterplot**, and can provide further insight about the relationship we are exploring. Here is an example.

The scatterplot below displays the relationship between the sodium and calorie content of 54 brands of hot dogs. Note that in this example there is no clear explanatory-response distinction, and we decided to have sodium content as the explanatory variable, and calorie content as the response variable.

The scatterplot displays a positive relationship, which means that hot dogs containing more sodium tend to be higher in calories.

The form of the relationship, however, is kind of hard to determine. Maybe if we label the scatterplot, indicating the type of hot dogs, we will get a better understanding of the form.

Here is the labeled scatterplot, with the three different colors representing the three types of hot dogs, as indicated.

The display does give us more insight about the form of the relationship between sodium and calorie content.

It appears that there is a positive relationship within all three types. In other words, we can generally expect hot dogs that are higher in sodium to be higher in calories, no matter what type of hot dog we consider. In addition, we can see that hot dogs made of poultry (indicated in blue) are generally lower in calories. This is a result we have seen before.

Interestingly, it appears that the form of the relationship specifically for poultry is further clustered, and we can only speculate about whether there is another categorical variable that describes these apparent sub-categories of poultry hot dogs.

- The relationship between two quantitative variables is visually displayed using the
**scatterplot**, where each point represents an individual. We always plot the explanatory variable on the horizontal X axis, and the response variable on the vertical Y axis. - When we explore a relationship using the scatterplot we should describe the
**overall pattern**of the relationship and any**deviations**from that pattern. To describe the overall pattern consider the**direction**,**form**and**strength**of the relationship. Assessing the strength just by looking at the scatterplot can be problematic; using a numerical measure to determine strength will be discussed later in this course. - Adding labels to the scatterplot that indicate different groups or categories within the data might help us get more insight about the relationship we are exploring.