# Describing Distributions

**CO-4:**Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.

**LO 4.4:**Using appropriate graphical displays and/or numerical measures, describe the distribution of a quantitative variable in context: a) describe the overall pattern, b) describe striking deviations from the pattern

**Video:**Describing Distributions (2 videos, 7:38 total)

**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS
- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE
- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

## Features of Distributions of Quantitative Variables

**LO 4.7:**Define and describe the features of the distribution of one quantitative variable (shape, center, spread, outliers).

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.

More specifically, we should consider the following features of the Distribution for One Quantitative Variable:

## Shape

When describing the shape of a distribution, we should consider:

**Symmetry/skewness**of the distribution.

**Peakedness (modality)**— the number of peaks (modes) the distribution has.

We distinguish between:

## Symmetric Distributions

A distribution is called **symmetric **if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.

The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.

Note that all three distributions are symmetric, but are different in their **modality** (peakedness).

- The first distribution is
**unimodal**— it has one mode (roughly at 10) around which the observations are concentrated. - The second distribution is
**bimodal**— it has two modes (roughly at 10 and 20) around which the observations are concentrated. - The third distribution is kind of flat, or
**uniform**. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

## Skewed Right Distributions

**skewed right**if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values).

Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.

- An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.

## Skewed Left Distributions

**skewed left**if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).

Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.

- An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.

**Comments:**

- Distributions with more than two peaks are generally called
**multimodal**.

- Bimodal or multimodal distributions can be evidence that two distinct groups are represented.

- Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.

Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.

Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $50-55.

The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.

## Center

**center**of the distribution is often used to represent a typical value.

One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Another common way to measure the center of a distribution is to use the average value.

From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.

## Spread

**spread**(also called

**variability**or

**variation**) of the distribution is to use the approximate range covered by the data.

From looking at the histogram, we can approximate the smallest observation (**min**), and the largest observation (**max**), and thus approximate the **range**. (More exact ways of finding measures of spread will be discussed soon.)

**Outliers**are observations that fall outside the overall pattern.

For example, the following histogram represents a distribution with a highly probable outlier:

## Example: Exam Grades

As you can see from the histogram, the grades distribution is roughly **symmetric** and **unimodal** with **no outliers**.

The **center** of the grades distribution is roughly **70** (7 students scored below 70, and 8 students scored above 70).

approximate min: | 45 (the middle of the lowest interval of scores) |

approximate max: | 95 (the middle of the highest interval of scores) |

approximate range: | 95-45=50 |

Let’s look at a new example.

## Example: Best Actress Oscar Winners

To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001

The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).

We will now summarize the main features of the distribution of ages as it appears from the histogram:

**Shape:** The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.

**Center:** The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.

**Spread:** The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.

**Outliers:** There seem to be two probable outliers to the far right and possibly a third around 62 years old.

You can see how informative it is to know “what to look at” in a histogram.

**Learn By Doing:**Shapes of Distributions (Best Actor Oscar Winners)

The following exercises provide more practice with shapes of distributions for one quantitative variable.

**Did I Get This?:**Shapes of Distributions

## Let’s Summarize

- When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).

- When describing the shape of a distribution, one should consider:
- Symmetry/skewness of the distribution
- Peakedness (modality) — the number of peaks (modes) the distribution has.
- Not all distributions have a simple, recognizable shape.

- Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.

- It is always important to interpret what the features of the distribution mean in the context of the data.