The “Normal” Shape

CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.
CO-6: Apply basic concepts of probability, random variation, and commonly used statistical probability distributions.
LO 4.4: Using appropriate graphical displays and/or numerical measures, describe the distribution of a quantitative variable in context: a) describe the overall pattern, b) describe striking deviations from the pattern
LO 4.7: Define and describe the features of the distribution of one quantitative variable (shape, center, spread, outliers).
Video: The Normal Shape (5:34)

Related SAS Tutorials

Related SPSS Tutorials

 The Standard Deviation Rule

LO 6.2: Apply the standard deviation rule to the special case of distributions having the “normal” shape.

In the previous activity we tried to help you develop better intuition about the concept of standard deviation. The rule that we are about to present, called “The Standard Deviation Rule” (also known as “The Empirical Rule”) will hopefully also contribute to building your intuition about this concept.

Consider a symmetric mound-shaped distribution:

A symmetric, mound shaped histogram

For distributions having this shape (later we will define this shape as “normally distributed”), the following rule applies:

The Standard Deviation Rule:

  • Approximately 68% of the observations fall within 1 standard deviation of the mean.
  • Approximately 95% of the observations fall within 2 standard deviations of the mean.
  • Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

The following picture illustrates this rule:

A symmetric, mound shaped histogram. The mean is located at the mode of the histogram (right in the middle. The middle 68% of observations fall within 1 standard deviation of the mean. This means that the bars on this histogram representing the 68% of the observations closest to the mean are have a value that is at most 1 standard deviation from the mean. 95% of the observations fall within 2 standard deviations of the mean. This encompasses more bars which are further from the mean (center of the histogram) than the center 68% did. Lastly, 99.7% of the observations fall within 3 standard deviations of the mean. Even more bars are selected.

This rule provides another way to interpret the standard deviation of a distribution, and thus also provides a bit more intuition about it.

Interactive Applet: The Standard Deviation Rule 

To see how this rule works in practice, consider the following example:

EXAMPLE: Male Height

The following histogram represents height (in inches) of 50 males. Note that the data are roughly normal, so we would like to see how the Standard Deviation Rule works for this example.

A symmetric histogram. The vertical axis is labeled "Frequency" and ranges from 0 to 7. The horizontal axis is labeled "Height" and ranges from 64 to 72. The mode of the histogram is at around x=71, y=7.

Below are the actual data, and the numerical measures of the distribution. Note that the key players here, the mean and standard deviation, have been highlighted.

Actual Data. "(" Denotes the start of the 2nd standard deviation, and ")" denotes the end. "[" denotes the start of the 1st standard deviation, and "]" denotes the end. Data: 64 (66 66 67 67 67 67 [68 68 68 68 68 68 69 69 69 69 69 70 70 70 70 70 70 70 71 71 71 71 71 71 71 72 72 72 72 72 72 73 73 73] 74 74 74 74 74 75 76 76) 77

Statistic Height
N 50
Mean 70.58
StDev 2.858
min 64
Q1 68
Median 70.5
Q3 72
Max 77

To see how well the Standard Deviation Rule works for this case, we will find what percentage of the observations falls within 1, 2, and 3 standard deviations from the mean, and compare it to what the Standard Deviation Rule tells us this percentage should be.

mean-SD=67.7, and mean+SD=73.4, so this 1st deviation captures 34 out of 50 observations = 68%. The SD rule says 68% also. mean-2(SD) = 64.9 and mean+2(SD)=76.3, which encompasses 48 out of 50 observations = 96%. The SD rule says 95%. Mean-3(SD)=62, and mean+3(SD)=79.2, which captures all of the observations = 100%. The SD rule says 99.7%

It turns out the Standard Deviation Rule works very well in this example.

The following example illustrates how we can apply the Standard Deviation Rule to variables whose distribution is known to be approximately normal.

EXAMPLE: Length of Human Pregnancy

The length of the human pregnancy is not fixed. It is known that it varies according to a distribution which is roughly normal, with a mean of 266 days, and a standard deviation of 16 days. (Source: Figures are from Moore and McCabe, Introduction to the Practice of Statistics).

First, let’s apply the Standard Deviation Rule to this case by drawing a picture:

A histogram. The X-axis is labeled "Length (days)", and it ranges from about 214 to 314 days. The mode and mean of the histogram is at x=266. The 1st Standard Deviation, or the middle 68%, spans the range [250,282]. The 2nd Standard Deviation (middle 95%) spans the range [234,298]. The 3rd Standard Deviation (middle 99.7%) spans the range[218,314].

  • Question: How long do the middle 95% of human pregnancies last? We can now use the information provided by the Standard Deviation Rule about the distribution of the length of human pregnancy, to answer some questions. For example:
    • Answer: The middle 95% of pregnancies last within 2 standard deviations of the mean, or in this case 234-298 days.
  • Question: What percent of pregnancies last more than 298 days?
    • Answer: To answer this consider the following picture:

The area outside of the middle 95% has been shaded red. There are two red areas, on either side of the middle 95%. Together, they make up the remaining 5%, and individually, they are 2.5% each because the normal distribution is symmetric.

  • Question: How short are the shortest 2.5% of pregnancies? Since 95% of the pregnancies last between 234 and 298 days, the remaining 5% of pregnancies last either less than 234 days or more than 298 days. Since the normal distribution is symmetric, these 5% of pregnancies are divided evenly between the two tails, and therefore 2.5% of pregnancies last more than 298 days.
    • Answer: Using the same reasoning as in the previous question, the shortest 2.5% of human pregnancies last less than 234 days.
  • Question: What percent of human pregnancies last more than 266 days?
    • Answer: Since 266 days is the mean, approximately 50% of pregnancies last more than 266 days.

Here is a complete picture of the information provided by the standard deviation rule.

Standard Deviation Rule

Did I Get This?: Standard Deviation Rule

Visual Methods of Assessing Normality

LO 6.3: Use histograms and QQ-plots (or Normal Probability Plots) to visually assess the normality of distributions of quantitative variables.

The normal distribution exists in theory but rarely, if ever, in real life. Histograms provide an excellent graphical display to help us assess normality. We can add a “normal curve” to the histogram which shows the normal distribution having the same mean and standard deviation as our sample. The closer the histogram fits this curve, the more (perfectly) normal the sample.

In the examples below, the graph on the top is approximately normally distributed whereas the graph on the bottom is clearly skewed right.

Although there is a lot of variation, this histogram does seem to follow the overall pattern of the normal distribution which is drawn over the histogram

This graph is clearly skewed right and does not follow the general pattern of the normal curve displayed over the histogram.

Unfortunately, we cannot quantitatively determine the extent to which the distribution is normally or not normally distributed using this method, but it can be helpful for making qualitative judgments about whether the data approximates the normal curve.

Another common graph to assess normality is the Q-Q plot (or Normal Probability Plot). In these graphs, the percentiles or quantiles of the theoretical distribution (in this case the standard normal distribution) are plotted against those from the data. If the data matches the theoretical distribution, the graph will result in a straight line. The graph below shows a distribution which closely follows a normal model.

Note: QQ-plots are not scatterplots (which we will dicuss soon), they only display information about one quantitative variable and graph this against the theoretical or expected values from a normal distribution with the same mean and standard deviation as our data.  Other distributions can also be used.

In this QQ-plot, the points associated with the data fit the target line very closely

In most cases the distributions that you encounter will only be approximations of the normal curve, or they will not resemble the normal distribution at all! However, it can be important to consider how well the data being analyzed approximates the normal curve since this distribution is a key assumption of many statistical analyses.

Here are a few more examples:

EXAMPLE: Some Real Data

The following gives the QQ-plot, histogram and boxplot for variables from a dataset from a population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, who were tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.


 

Body Mass Index is definitely unimodal and symmetric and could easily have come from a population which is normally distributed.

Graph1_Symmetric


The Diabetes Pedigree Function scores were unimodal and skewed right. This data does not seem to have come from a population which is normally distributed.

Graph1_SkewedRight


The Triceps Skin Fold Thickness is basically symmetric with one extreme outlier (and one potential but mild outlier).

Be careful not to call such a distribution “skewed right” as it is only the single outlier which really shows that pattern here.  At a minimum remove the outlier and recreate the graphs to see how skewed the rest of the data might be.

Graph1_Outlier

EXAMPLE: Randomly Generated Data

Since there were no skewed left examples in the real data, here are two randomly generated skewed left distributions. Notice that the first is less skewed left than the second and this is indicated clearly in all three plots.

Graph1_SkewedLeft

Comments:

  • Even if the population is exactly normally distributed, samples from this population can appear non-normal especially for small sample sizes. See this document containing 21 samples of size n = 50 from a normal distribution with a mean of 200 and a standard deviation of 30.  The samples that produce results which are skewed or otherwise seemingly not-normal are highlighted but even among those not highlighted, notice the variation in shapes seen: Normal Samples
  • The standard deviation rule can also help in assessing normality in that the closer the percentage of data points within 1, 2, and 3 standard deviations is to that of the rule, the closer the data itself fits a normal distribution.
  • In our example of male heights, we see that the histogram resembles a normal distribution and the sample percentages are very close to that predicted by the standard deviation rule.
Did I Get This?: Assessing Normality
(Optional) Reading: The Normal Distribution (≈ 500 words)

Standardized Scores (Z-Scores)

LO 4.14: Define and interpret measures of position (percentiles, quartiles, the five-number summary, z-scores).

We have already learned the standard deviation rule, which for normally distributed data, provides approximations for the proportion of data values within 1, 2, and 3 standard deviations. From this we know that approximately 5% of the data values would be expected to fall OUTSIDE 2 standard deviations.

If we calculate the standardized scores (or z-scores) for our data, it would be easy to identify these unusually large or small values in our data. To calculate a z-score, recall that we take the individual value and subtract the mean and then divide this difference by the standard deviation.

gives the z-score formula in which we take the specified value, x, and subtract the sample mean, x-bar, then divide that result by the standard deviation to get the z-score

For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

Comments:

  • Standardized scores can be used to help identify potential outliers
    • For approximately normal distributions, z-scores greater than 2 or less than -2 are rare (will happen approximately 5% of the time).
    • For any distribution, z-scores greater than 4 or less than -4 are rare (will happen less than 6.25% of the time).
  • Standardized scores, along with other measures of position, are useful when comparing individuals in different datasets since the comparison takes into account the relative position of the individuals in their dataset. With z-scores, we can tell which individual has a relatively higher or lower position in their respective dataset.
  • Later in the course, we will see that this idea of standardizing is used often in statistical analyses.

EXAMPLE: Best Actress Oscar Winners

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

In previous examples, we identified three observations as outliers, two of which were classified as extreme outliers (ages of 61, 74 and 80)

The mean of this sample is 38.5 and the standard deviation is 12.95.

  • The z-score for the actress with age = 80 is

Thus, among our female Oscar winners from our sample, this actress is 3.20 standard deviations older than average.

Did I Get This?: Z-Scores