This document is linked from Outliers.

]]>This document is linked from The Normal Shape.

]]>From the online version of Little Handbook of Statistical Practice, this reading contains a classic paper discussing how to handle outliers (wild observations).

This document is linked from Outliers.

]]>

This document is linked from Outliers.

]]>**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

In the previous activity we tried to help you develop better intuition about the concept of standard deviation. The rule that we are about to present, called “The Standard Deviation Rule” (also known as “The Empirical Rule”) will hopefully also contribute to building your intuition about this concept.

Consider a symmetric mound-shaped distribution:

For distributions having this shape (later we will define this shape as “normally distributed”), the following rule applies:

**The Standard Deviation Rule:**

- Approximately 68% of the observations fall within 1 standard deviation of the mean.

- Approximately 95% of the observations fall within 2 standard deviations of the mean.

- Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

The following picture illustrates this rule:

This rule provides another way to interpret the standard deviation of a distribution, and thus also provides a bit more intuition about it.

To see how this rule works in practice, consider the following example:

The following histogram represents height (in inches) of 50 males. Note that the data are roughly normal, so we would like to see how the Standard Deviation Rule works for this example.

Below are the actual data, and the numerical measures of the distribution. Note that the key players here, the mean and standard deviation, have been highlighted.

Statistic | Height |
---|---|

N | 50 |

Mean | 70.58 |

StDev | 2.858 |

min | 64 |

Q1 | 68 |

Median | 70.5 |

Q3 | 72 |

Max | 77 |

To see how well the Standard Deviation Rule works for this case, we will find what percentage of the observations falls within 1, 2, and 3 standard deviations from the mean, and compare it to what the Standard Deviation Rule tells us this percentage should be.

It turns out the Standard Deviation Rule works **very well** in this example.

The following example illustrates how we can apply the Standard Deviation Rule to variables whose distribution is known to be approximately normal.

The length of the human pregnancy is not fixed. It is known that it varies according to a distribution which is roughly normal, with a mean of 266 days, and a standard deviation of 16 days. (Source: Figures are from Moore and McCabe, *Introduction to the Practice of Statistics*).

First, let’s apply the Standard Deviation Rule to this case by drawing a picture:

- Question: How long do the middle 95% of human pregnancies last? We can now use the information provided by the Standard Deviation Rule about the distribution of the length of human pregnancy, to answer some questions. For example:
- Answer: The middle 95% of pregnancies last within 2 standard deviations of the mean, or in this case 234-298 days.

- Question: What percent of pregnancies last more than 298 days?
- Answer: To answer this consider the following picture:

- Question: How short are the shortest 2.5% of pregnancies? Since 95% of the pregnancies last between 234 and 298 days, the remaining 5% of pregnancies last either less than 234 days or more than 298 days. Since the normal distribution is symmetric, these 5% of pregnancies are divided evenly between the two tails, and therefore 2.5% of pregnancies last more than 298 days.
- Answer: Using the same reasoning as in the previous question, the shortest 2.5% of human pregnancies last less than 234 days.

- Question: What percent of human pregnancies last more than 266 days?
- Answer: Since 266 days is the mean, approximately 50% of pregnancies last more than 266 days.

Here is a complete picture of the information provided by the standard deviation rule.

The normal distribution exists in theory but rarely, if ever, in real life. Histograms provide an excellent graphical display to help us assess normality. We can add a “normal curve” to the histogram which shows the normal distribution having the same mean and standard deviation as our sample. The closer the histogram fits this curve, the more (perfectly) normal the sample.

In the examples below, the graph on the top is approximately normally distributed whereas the graph on the bottom is clearly skewed right.

Unfortunately, we cannot quantitatively determine the extent to which the distribution is normally or not normally distributed using this method, but it can be helpful for making qualitative judgments about whether the data approximates the normal curve.

Another common graph to assess normality is the **Q-Q plot** (or **Normal Probability Plot**). In these graphs, the percentiles or quantiles of the theoretical distribution (in this case the standard normal distribution) are plotted against those from the data. If the data matches the theoretical distribution, the graph will result in a straight line. The graph below shows a distribution which closely follows a normal model.

**Note:** QQ-plots are not scatterplots (which we will dicuss soon), they only display information about one quantitative variable and graph this against the theoretical or expected values from a normal distribution with the same mean and standard deviation as our data. Other distributions can also be used.

In most cases the distributions that you encounter will only be approximations of the normal curve, or they will not resemble the normal distribution at all! However, it can be important to consider how well the data being analyzed approximates the normal curve since this distribution is a key assumption of many statistical analyses.

Here are a few more examples:

The following gives the QQ-plot, histogram and boxplot for variables from a dataset from a population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, who were tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Body Mass Index is definitely **unimodal** and **symmetric** and could easily have come from a population which is **normally distributed**.

The Diabetes Pedigree Function scores were unimodal and skewed right. This data does not seem to have come from a population which is normally distributed.

The Triceps Skin Fold Thickness is **basically symmetric with one extreme outlier** (and one potential but mild outlier).

**Be careful not to call such a distribution “skewed right”** as it is only the single outlier which really shows that pattern here. At a minimum remove the outlier and recreate the graphs to see how skewed the rest of the data might be.

Since there were no skewed left examples in the real data, here are two randomly generated skewed left distributions. Notice that the first is less skewed left than the second and this is indicated clearly in all three plots.

**Comments:**

- Even if the population is exactly normally distributed, samples from this population can appear non-normal especially for small sample sizes. See this document containing 21 samples of size n = 50 from a normal distribution with a mean of 200 and a standard deviation of 30. The samples that produce results which are skewed or otherwise seemingly not-normal are highlighted but even among those not highlighted, notice the variation in shapes seen: Normal Samples

- The standard deviation rule can also help in assessing normality in that the closer the percentage of data points within 1, 2, and 3 standard deviations is to that of the rule, the closer the data itself fits a normal distribution.

- In our example of male heights, we see that the histogram resembles a normal distribution and the sample percentages are very close to that predicted by the standard deviation rule.

We have already learned the standard deviation rule, which for normally distributed data, provides approximations for the proportion of data values within 1, 2, and 3 standard deviations. From this we know that approximately 5% of the data values would be expected to fall OUTSIDE 2 standard deviations.

If we calculate the standardized scores (or z-scores) for our data, it would be easy to identify these unusually large or small values in our data. To calculate a z-score, recall that we take the individual value and subtract the mean and then divide this difference by the standard deviation.

For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

**Comments:**

- Standardized scores can be used to help identify potential outliers
- For approximately normal distributions, z-scores greater than 2 or less than -2 are rare (will happen approximately 5% of the time).
- For any distribution, z-scores greater than 4 or less than -4 are rare (will happen less than 6.25% of the time).

- Standardized scores, along with other measures of position, are useful when comparing individuals in different datasets since the comparison takes into account the relative position of the individuals in their dataset. With z-scores, we can tell which individual has a relatively higher or lower position in their respective dataset.

- Later in the course, we will see that this idea of standardizing is used often in statistical analyses.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

In previous examples, we identified three observations as outliers, two of which were classified as extreme outliers (ages of 61, 74 and 80)

The mean of this sample is 38.5 and the standard deviation is 12.95.

- The z-score for the actress with age = 80 is

Thus, among our female Oscar winners from our sample, this actress is 3.20 standard deviations older than average.

So far we have quantified the idea of center, and we are in the middle of the discussion about measuring spread, but we haven’t really talked about a method or rule that will help us classify extreme observations as outliers. The IQR is commonly used as the basis for a rule of thumb for identifying outliers.

An observation is considered a **suspected** **outlier** or **potential outlier** if it is:

- below Q1 – 1.5(IQR) or
- above Q3 + 1.5(IQR)

The following picture (not to scale) illustrates this rule:

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

Recall that when we first looked at the histogram of ages of Best Actress Oscar winners, there were three observations that looked like possible outliers:

We can now use the 1.5(IQR) criterion to check whether the three highest ages should indeed be classified as potential outliers:

- For this example, we found Q1 = 32 and Q3 = 41.5 which give an IQR = 9.5
- Q1 – 1.5 (IQR) = 32 – (1.5)(9.5) = 17.75
- Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75

The 1.5(IQR) criterion tells us that any observation with an age that is below 17.75 or above 55.75 is considered a suspected outlier.

We therefore conclude that the observations with ages of 61, 74 and 80 should be flagged as suspected outliers in the distribution of ages. Note that since the smallest observation is 21, there are no suspected low outliers in this distribution.

An observation is considered an **EXTREME outlier** if it is:

- below Q1 – 3(IQR) or
- above Q3 + 3(IQR)

We can now use the 3(IQR) criterion to check whether any of the three suspected outliers can be classified as extreme outliers:

- For this example, we found Q1 = 32 and Q3 = 41.5 which give an IQR = 9.5
- Q1 – 3 (IQR) = 32 – (3)(9.5) = 3.5
- Q3 + 3 (IQR) = 41.5 + (3)(9.5) = 70

The 3(IQR) criterion tells us that any observation that is below 3.5 or above 70 is considered an extreme outlier.

We therefore conclude that the observations with ages 74 and 80 should be flagged as extreme outliers in the distribution of ages.

Note that since there were no suspected outliers on the low end there can be no extreme outliers on the low end of the distribution. Thus there was no real need for us to calculate the low cutoff for extreme outliers, i.e. Q1 – 3(IQR) = 3.5.

See the histogram below, and consider the outliers individually.

- The observation with age 62 is visually much closer to the center of the data. We might have a difficult time deciding if this value is really an outlier using this graph alone.
- However, the ages of 74 and 80 are clearly far from the bulk of the distribution. We might feel very comfortable deciding these values are outliers based only on the graph.

We just practiced one way to ‘flag’ possible outliers. Why is it important to identify possible outliers, and how should they be dealt with? The answers to these questions depend on the reasons for the outlying values. Here are several possibilities:

- Even though it is an extreme value, if an outlier can be understood to have been produced by
**essentially the same sort of physical or biological process**as the rest of the data, and if such extreme values are expected to**eventually occur again**, then such an outlier indicates something important and interesting about the process you’re investigating, and it**should be kept**in the data.

- If an outlier can be explained to have been produced under fundamentally
**different**conditions from the rest of the data (or by a fundamentally different process), such an outlier**can be removed**from the data if your goal is to investigate only the process that produced the rest of the data.

- An outlier might indicate a
**mistake**in the data (like a typo, or a measuring error), in which case it**should be corrected if possible or else removed**from the data before calculating summary statistics or making inferences from the data (and the reason for the mistake should be investigated).

**Here are examples of each of these types of outliers:**

- The following histogram displays the magnitude of 460 earthquakes in California, occurring in the year 2000, between August 28 and September 9:

**Identifying the outlier:**On the very far right edge of the display (beyond 4.8), we see a low bar; this represents one earthquake (because the bar has height of 1) that was much more severe than the others in the data.

**Understanding the outlier:**In this case, the outlier represents a much stronger earthquake, which is relatively rarer than the smaller quakes that happen more frequently in California.

**How to handle the outlier:**For many purposes, the relatively severe quakes represented by the outlier might be the most important (because, for instance, that sort of quake has the potential to do more damage to people and infrastructure). The smaller-magnitude quakes might not do any damage, or even be felt at all. So, for many purposes it could be important to keep this outlier in the data.

- The following histogram displays the monthly percent return on the stock of Phillip Morris (a large tobacco company) from July 1990 to May 1997:

**Identifying the outlier:**On the display, we see a low bar far to the left of the others; this represents one month’s return (because the bar has height of 1), where the value of Phillip Morris stock was unusually low.

**Understanding the outlier:**The explanation for this particular outlier is that, in the early 1990s, there were highly-publicized federal hearings being conducted regarding the addictiveness of smoking, and there was growing public sentiment against the tobacco companies. The unusually low monthly value in the Phillip Morris dataset was due to public pressure against smoking, which negatively affected the company’s stock for that particular month.

**How to handle the outlier:**In this case, the outlier was due to unusual conditions during one particular month that aren’t expected to be repeated, and that were fundamentally different from the conditions that produced the values in all the other months. So in this case, it would be reasonable to remove the outlier, if we wanted to characterize the “typical” monthly return on Phillip Morris stock.

- When archaeologists dig up objects such as pieces of ancient pottery, chemical analysis can be performed on the artifacts. The chemical content of pottery can vary depending on the type of clay as well as the particular manufacturing technique. The following histogram displays the results of one such actual chemical analysis, performed on 48 ancient Roman pottery artifacts from archaeological sites in Britain:

*As appeared in Tubb, et al. (1980). “The analysis of Romano-British pottery by atomic absorption spectrophotometry.” Archaeometry, vol. 22, reprinted in Statistics in Archaeology by Michael Baxter, p. 21.*

**Identifying the outlier:**On the display, we see a low bar far to the right of the others; this represents one piece of pottery (because the bar has a height of 1), which has a suspiciously high manganous oxide value.

**Understanding the outlier:**Based on comparison with other pieces of pottery found at the same site, and based on expert understanding of the typical content of this particular compound, it was concluded that the unusually high value was most likely a typo that was made when the data were published in the original 1980 paper (it was typed as “.394” but it was probably meant to be “.094”).

**How to handle the outlier:**In this case, since the outlier was judged to be a mistake, it should be removed from the data before further analysis. In fact, removing the outlier is useful not only because it’s a mistake, but also because doing so reveals important structure that was otherwise hidden. This feature is evident on the next display:When the outlier is removed, the display is re-scaled so that now we can see the set of 10 pottery pieces that had almost no manganous oxide. These 10 pieces might have been made with a different potting technique, so identifying them as different from the rest is historically useful. This feature was only evident after the outlier was removed.