Describing Distributions

CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.
LO 4.4: Using appropriate graphical displays and/or numerical measures, describe the distribution of a quantitative variable in context: a) describe the overall pattern, b) describe striking deviations from the pattern
Video: Describing Distributions (2 videos, 7:38 total)

Features of Distributions of Quantitative Variables

LO 4.7: Define and describe the features of the distribution of one quantitative variable (shape, center, spread, outliers).

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.

More specifically, we should consider the following features of the Distribution for One Quantitative Variable:

Shape, Center, and Spread make up the overall pattern. Outliers represent deviations from that overall pattern

Shape

When describing the shape of a distribution, we should consider:

  • Symmetry/skewness of the distribution.
  • Peakedness (modality) — the number of peaks (modes) the distribution has.

We distinguish between:

Symmetric Distributions

A distribution is called symmetric if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.

The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.

A symmetric, Single-peaked (Unimodal) distribution. The histogram's bars start at low values close to 0 on the left and rise to a peak where the x-axis is labeled 10. Then, the values decrease as we go right, back down to nearly 0.

A symmetric, Double-peaked (Bimodal) distribution. The histogram's bars start at low values close to 0 on the left and rise to the first peak where the x-axis is labeled 10. Then, the values decrease as we go right, back down to nearly 0 at roughly where x=15. The values increase again and peak at x=20, and then, continuing right, decrease to nearly 0.

A symmetric, Uniform distribution. Throughout the entire range of the x-axis the bars are roughly the same height, meaning they are the same value.

Note that all three distributions are symmetric, but are different in their modality (peakedness).

  • The first distribution is unimodal — it has one mode (roughly at 10) around which the observations are concentrated.
  • The second distribution is bimodal — it has two modes (roughly at 10 and 20) around which the observations are concentrated.
  • The third distribution is kind of flat, or uniform. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

Skewed Right Distributions

A Skewed-right histogram. As we proceed from left to right across the x-axis, the bars rapidly increase to the peak of the histogram, located at roughly x=33. From there, the values slowly decrease, and the last measurement is at x=200. The bars of the histogram are barely visible above the x-axis starting at about x=150.

A distribution is called skewed right if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values).

Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.

  • An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.

Skewed Left Distributions

A Skewed-Left histogram. As we proceed from left to right across the x-axis, the bars rapidly slowly to the peak of the histogram, located at roughly x=78. From there, the values rapidly decrease, and the last measurement is at x=90. Since the X-axis starts at 0, the peak is offset to the right of the center of the histogram.

A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).

Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.

  • An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.

Comments:

  1. Distributions with more than two peaks are generally called multimodal.

  2. Bimodal or multimodal distributions can be evidence that two distinct groups are represented.

  3. Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.

Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.

A histogram in which the Y-axis is labeled with units in Frequency, from 0 to 70. The X-axis is labeled in Dollars Spent, from 0 to 105. Going from left to right on the X-axis, the bars of the histogram increase to a peak at x=25, where y=70. Then, the bars decrease, but at x=45 they begin to increase again, reaching a second peak at x=50, where y=37. Then, the values decrease until the end of the histogram.

Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $50-55.

The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.

Center

The center of the distribution is often used to represent a typical value.

One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Another common way to measure the center of a distribution is to use the average value.

From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.

Spread

One way to measure the spread (also called variability or variation) of the distribution is to use the approximate range covered by the data.

From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range. (More exact ways of finding measures of spread will be discussed soon.)

Outliers

Outliers are observations that fall outside the overall pattern.

For example, the following histogram represents a distribution with a highly probable outlier:

A histogram with frequency on the Y-axis. As we go from left to right on the x-axis, the frequency increases to a peak at x=5, then decreases. Eventually, we reach 0 at x=11. All of x > 10 have a frequency of 0, exception for x=15, which has a frequency of greater than zero. This is a outlier.

Example: Exam Grades

A histogram of the exam grade data where 1 student scored between 40 and 50, 2 students scored between 50 and 60, 4 students scored between 60 and 70, 5 students scored between 70 and 80, 2 students scored between 80 and 90, and 1 student scored between 90 and 100.

As you can see from the histogram, the grades distribution is roughly symmetric and unimodal with no outliers.

The center of the grades distribution is roughly 70 (7 students scored below 70, and 8 students scored above 70).

approximate min: 45 (the middle of the lowest interval of scores)
approximate max: 95 (the middle of the highest interval of scores)
approximate range: 95-45=50

Let’s look at a new example.

Example: Best Actress Oscar Winners

To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001

The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).

A histogram with Frequency on the Y-axis and Age on the X-axis. The following list describes for each age, the frequency with which actresses of that age won. x=20, y=1; x=26,y=4; x=32,y=10; x=38,y=6; x=44,y=6; x=50,y=2; x=56,y=0; x=62,y=1; x=68,y=0; x=74,y=1; x=80,y=1

We will now summarize the main features of the distribution of ages as it appears from the histogram:

Shape: The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.

Center: The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.

Spread: The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.

Outliers: There seem to be two probable outliers to the far right and possibly a third around 62 years old.

You can see how informative it is to know “what to look at” in a histogram.

The following exercises provide more practice with shapes of distributions for one quantitative variable.

Did I Get This?: Shapes of Distributions

Let’s Summarize

  • When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).
  • When describing the shape of a distribution, one should consider:
    • Symmetry/skewness of the distribution
    • Peakedness (modality) — the number of peaks (modes) the distribution has.
    • Not all distributions have a simple, recognizable shape.
  • Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
  • It is always important to interpret what the features of the distribution mean in the context of the data.