Describing Distributions
Related SAS Tutorials
- 5A – (3:01) Numeric Measures using PROC MEANS
- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE
Related SPSS Tutorials
- 5A – (8:00) Numeric Measures using EXPLORE
- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots
Features of Distributions of Quantitative Variables
Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.
More specifically, we should consider the following features of the Distribution for One Quantitative Variable:
Shape
When describing the shape of a distribution, we should consider:
- Symmetry/skewness of the distribution.
- Peakedness (modality) — the number of peaks (modes) the distribution has.
We distinguish between:
Symmetric Distributions
A distribution is called symmetric if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.
The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.
Note that all three distributions are symmetric, but are different in their modality (peakedness).
- The first distribution is unimodal — it has one mode (roughly at 10) around which the observations are concentrated.
- The second distribution is bimodal — it has two modes (roughly at 10 and 20) around which the observations are concentrated.
- The third distribution is kind of flat, or uniform. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.
Skewed Right Distributions
Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.
- An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.
Skewed Left Distributions
Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.
- An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.
Comments:
- Distributions with more than two peaks are generally called multimodal.
- Bimodal or multimodal distributions can be evidence that two distinct groups are represented.
- Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.
Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.
Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $50-55.
The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.
Center
One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.
Another common way to measure the center of a distribution is to use the average value.
From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.
Spread
From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range. (More exact ways of finding measures of spread will be discussed soon.)
For example, the following histogram represents a distribution with a highly probable outlier:
Example: Exam Grades
As you can see from the histogram, the grades distribution is roughly symmetric and unimodal with no outliers.
The center of the grades distribution is roughly 70 (7 students scored below 70, and 8 students scored above 70).
approximate min: | 45 (the middle of the lowest interval of scores) |
approximate max: | 95 (the middle of the highest interval of scores) |
approximate range: | 95-45=50 |
Let’s look at a new example.
Example: Best Actress Oscar Winners
To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001
The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).
We will now summarize the main features of the distribution of ages as it appears from the histogram:
Shape: The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.
Center: The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.
Spread: The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.
Outliers: There seem to be two probable outliers to the far right and possibly a third around 62 years old.
You can see how informative it is to know “what to look at” in a histogram.
The following exercises provide more practice with shapes of distributions for one quantitative variable.
Let’s Summarize
- When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).
- When describing the shape of a distribution, one should consider:
- Symmetry/skewness of the distribution
- Peakedness (modality) — the number of peaks (modes) the distribution has.
- Not all distributions have a simple, recognizable shape.
- Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
- It is always important to interpret what the features of the distribution mean in the context of the data.