This document is linked from The Normal Shape.

]]>This short video contains some additional discussion about shapes of distribution including symmetry and modality.

The original slides are not available.

Transcript – Live Describing Distributions

This document is linked from Describing Distributions.

]]>This document is linked from Histograms and Stemplots.

]]>This document is linked from One Quantitative Variable.

]]>This document is linked from One Categorical Variable.

]]>This short video contains an overview of exploratory data analysis as well as a few comments related to how exploratory data analysis is useful.

The original slides are not available.

Transcript – Live Introduction to Exploratory Data Analysis

This document is linked from Unit 1: Exploratory Data Analysis.

]]>**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS
- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE
- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern.

More specifically, we should consider the following features of the Distribution for One Quantitative Variable:

When describing the shape of a distribution, we should consider:

**Symmetry/skewness**of the distribution.

**Peakedness (modality)**— the number of peaks (modes) the distribution has.

We distinguish between:

A distribution is called **symmetric **if, as in the histograms above, the distribution forms an approximate mirror image with respect to the center of the distribution.

The center of the distribution is easy to locate and both tails of the distribution are the approximately the same length.

Note that all three distributions are symmetric, but are different in their **modality** (peakedness).

- The first distribution is
**unimodal**— it has one mode (roughly at 10) around which the observations are concentrated. - The second distribution is
**bimodal**— it has two modes (roughly at 10 and 20) around which the observations are concentrated. - The third distribution is kind of flat, or
**uniform**. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

A distribution is called **skewed right** if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values).

Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest.

- An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.

A distribution is called **skewed left** if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values).

Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.

- An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.

**Comments:**

- Distributions with more than two peaks are generally called
**multimodal**.

- Bimodal or multimodal distributions can be evidence that two distinct groups are represented.

- Unimodal, Bimodal, and multimodal distributions may or may not be symmetric.

Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spent in a single visit to the store. The following histogram displays the data.

Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition, it has another (smaller) “peak” (mode) around $50-55.

The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.

The **center** of the distribution is often used to represent a typical value.

One way to define the center is as the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Another common way to measure the center of a distribution is to use the average value.

From looking at the histogram we can get only a rough estimate for the center of the distribution. More exact ways of finding measures of center will be discussed in the next section.

One way to measure the **spread** (also called **variability **or** variation**) of the distribution is to use the approximate range covered by the data.

From looking at the histogram, we can approximate the smallest observation (**min**), and the largest observation (**max**), and thus approximate the **range**. (More exact ways of finding measures of spread will be discussed soon.)

For example, the following histogram represents a distribution with a highly probable outlier:

As you can see from the histogram, the grades distribution is roughly **symmetric** and **unimodal** with **no outliers**.

The **center** of the grades distribution is roughly **70** (7 students scored below 70, and 8 students scored above 70).

approximate min: | 45 (the middle of the lowest interval of scores) |

approximate max: | 95 (the middle of the highest interval of scores) |

approximate range: | 95-45=50 |

Let’s look at a new example.

To provide an example of a histogram applied to actual data, we will look at the ages of Best Actress Oscar winners from 1970 to 2001

The histogram for the data is shown below. (Link to the Best Actress Oscar Winners data).

We will now summarize the main features of the distribution of ages as it appears from the histogram:

**Shape:** The distribution of ages is skewed right. We have a concentration of data among the younger ages and a long tail to the right. The vast majority of the “best actress” awards are given to young actresses, with very few awards given to actresses who are older.

**Center:** The data seem to be centered around 35 or 36 years old. Note that this implies that roughly half the awards are given to actresses who are less than 35 years old.

**Spread:** The data range from about 20 to about 80, so the approximate range equals 80 – 20 = 60.

**Outliers:** There seem to be two probable outliers to the far right and possibly a third around 62 years old.

You can see how informative it is to know “what to look at” in a histogram.

The following exercises provide more practice with shapes of distributions for one quantitative variable.

- When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).

- When describing the shape of a distribution, one should consider:
- Symmetry/skewness of the distribution
- Peakedness (modality) — the number of peaks (modes) the distribution has.
- Not all distributions have a simple, recognizable shape.

- Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.

- It is always important to interpret what the features of the distribution mean in the context of the data.

From the online version of Little Handbook of Statistical Practice, this reading contains examples of numerous exploratory graphical displays.

This document is linked from Summary (Unit 1).

]]>**BACKGROUND INFORMATION**

A study was conducted in order to find out whether pamphlets containing information for cancer patients are written at a level that the cancer patients can understand.

Tests were administered to measure the reading levels of 63 cancer patients, and the readability levels of 30 cancer pamphlets were evaluated based on such factors as the lengths of the sentences and the number of polysyllabic words.

Both the reading and readability levels correspond to grade levels, but patients’ reading levels of less than grade 3 and above grade 12 cannot be determined exactly. (Source: Short, Moriarty, and Cooly. (1995). “Readability of Educational Materials for Cancer Patients.” Journal of Statistics Education, v.3, n.2)

The following tables indicate the number of patients at each reading level and the number of pamphlets at each readability level.

**Comment:**

- Note that the data are presented in a grouped form; the actual readability data, for example, are: 6 6 6 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9, etc.

Answer the following questions:

This document is linked from Measures of Center.

]]>**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

In the previous activity we tried to help you develop better intuition about the concept of standard deviation. The rule that we are about to present, called “The Standard Deviation Rule” (also known as “The Empirical Rule”) will hopefully also contribute to building your intuition about this concept.

Consider a symmetric mound-shaped distribution:

For distributions having this shape (later we will define this shape as “normally distributed”), the following rule applies:

**The Standard Deviation Rule:**

- Approximately 68% of the observations fall within 1 standard deviation of the mean.

- Approximately 95% of the observations fall within 2 standard deviations of the mean.

- Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

The following picture illustrates this rule:

This rule provides another way to interpret the standard deviation of a distribution, and thus also provides a bit more intuition about it.

To see how this rule works in practice, consider the following example:

The following histogram represents height (in inches) of 50 males. Note that the data are roughly normal, so we would like to see how the Standard Deviation Rule works for this example.

Below are the actual data, and the numerical measures of the distribution. Note that the key players here, the mean and standard deviation, have been highlighted.

Statistic | Height |
---|---|

N | 50 |

Mean | 70.58 |

StDev | 2.858 |

min | 64 |

Q1 | 68 |

Median | 70.5 |

Q3 | 72 |

Max | 77 |

To see how well the Standard Deviation Rule works for this case, we will find what percentage of the observations falls within 1, 2, and 3 standard deviations from the mean, and compare it to what the Standard Deviation Rule tells us this percentage should be.

It turns out the Standard Deviation Rule works **very well** in this example.

The following example illustrates how we can apply the Standard Deviation Rule to variables whose distribution is known to be approximately normal.

The length of the human pregnancy is not fixed. It is known that it varies according to a distribution which is roughly normal, with a mean of 266 days, and a standard deviation of 16 days. (Source: Figures are from Moore and McCabe, *Introduction to the Practice of Statistics*).

First, let’s apply the Standard Deviation Rule to this case by drawing a picture:

- Question: How long do the middle 95% of human pregnancies last? We can now use the information provided by the Standard Deviation Rule about the distribution of the length of human pregnancy, to answer some questions. For example:
- Answer: The middle 95% of pregnancies last within 2 standard deviations of the mean, or in this case 234-298 days.

- Question: What percent of pregnancies last more than 298 days?
- Answer: To answer this consider the following picture:

- Question: How short are the shortest 2.5% of pregnancies? Since 95% of the pregnancies last between 234 and 298 days, the remaining 5% of pregnancies last either less than 234 days or more than 298 days. Since the normal distribution is symmetric, these 5% of pregnancies are divided evenly between the two tails, and therefore 2.5% of pregnancies last more than 298 days.
- Answer: Using the same reasoning as in the previous question, the shortest 2.5% of human pregnancies last less than 234 days.

- Question: What percent of human pregnancies last more than 266 days?
- Answer: Since 266 days is the mean, approximately 50% of pregnancies last more than 266 days.

Here is a complete picture of the information provided by the standard deviation rule.

The normal distribution exists in theory but rarely, if ever, in real life. Histograms provide an excellent graphical display to help us assess normality. We can add a “normal curve” to the histogram which shows the normal distribution having the same mean and standard deviation as our sample. The closer the histogram fits this curve, the more (perfectly) normal the sample.

In the examples below, the graph on the top is approximately normally distributed whereas the graph on the bottom is clearly skewed right.

Unfortunately, we cannot quantitatively determine the extent to which the distribution is normally or not normally distributed using this method, but it can be helpful for making qualitative judgments about whether the data approximates the normal curve.

Another common graph to assess normality is the **Q-Q plot** (or **Normal Probability Plot**). In these graphs, the percentiles or quantiles of the theoretical distribution (in this case the standard normal distribution) are plotted against those from the data. If the data matches the theoretical distribution, the graph will result in a straight line. The graph below shows a distribution which closely follows a normal model.

**Note:** QQ-plots are not scatterplots (which we will dicuss soon), they only display information about one quantitative variable and graph this against the theoretical or expected values from a normal distribution with the same mean and standard deviation as our data. Other distributions can also be used.

In most cases the distributions that you encounter will only be approximations of the normal curve, or they will not resemble the normal distribution at all! However, it can be important to consider how well the data being analyzed approximates the normal curve since this distribution is a key assumption of many statistical analyses.

Here are a few more examples:

The following gives the QQ-plot, histogram and boxplot for variables from a dataset from a population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, who were tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Body Mass Index is definitely **unimodal** and **symmetric** and could easily have come from a population which is **normally distributed**.

The Diabetes Pedigree Function scores were unimodal and skewed right. This data does not seem to have come from a population which is normally distributed.

The Triceps Skin Fold Thickness is **basically symmetric with one extreme outlier** (and one potential but mild outlier).

**Be careful not to call such a distribution “skewed right”** as it is only the single outlier which really shows that pattern here. At a minimum remove the outlier and recreate the graphs to see how skewed the rest of the data might be.

Since there were no skewed left examples in the real data, here are two randomly generated skewed left distributions. Notice that the first is less skewed left than the second and this is indicated clearly in all three plots.

**Comments:**

- Even if the population is exactly normally distributed, samples from this population can appear non-normal especially for small sample sizes. See this document containing 21 samples of size n = 50 from a normal distribution with a mean of 200 and a standard deviation of 30. The samples that produce results which are skewed or otherwise seemingly not-normal are highlighted but even among those not highlighted, notice the variation in shapes seen: Normal Samples

- The standard deviation rule can also help in assessing normality in that the closer the percentage of data points within 1, 2, and 3 standard deviations is to that of the rule, the closer the data itself fits a normal distribution.

- In our example of male heights, we see that the histogram resembles a normal distribution and the sample percentages are very close to that predicted by the standard deviation rule.

We have already learned the standard deviation rule, which for normally distributed data, provides approximations for the proportion of data values within 1, 2, and 3 standard deviations. From this we know that approximately 5% of the data values would be expected to fall OUTSIDE 2 standard deviations.

If we calculate the standardized scores (or z-scores) for our data, it would be easy to identify these unusually large or small values in our data. To calculate a z-score, recall that we take the individual value and subtract the mean and then divide this difference by the standard deviation.

For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

**Comments:**

- Standardized scores can be used to help identify potential outliers
- For approximately normal distributions, z-scores greater than 2 or less than -2 are rare (will happen approximately 5% of the time).
- For any distribution, z-scores greater than 4 or less than -4 are rare (will happen less than 6.25% of the time).

- Standardized scores, along with other measures of position, are useful when comparing individuals in different datasets since the comparison takes into account the relative position of the individuals in their dataset. With z-scores, we can tell which individual has a relatively higher or lower position in their respective dataset.

- Later in the course, we will see that this idea of standardizing is used often in statistical analyses.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

In previous examples, we identified three observations as outliers, two of which were classified as extreme outliers (ages of 61, 74 and 80)

The mean of this sample is 38.5 and the standard deviation is 12.95.

- The z-score for the actress with age = 80 is

Thus, among our female Oscar winners from our sample, this actress is 3.20 standard deviations older than average.