This short video elaborates upon the information displayed in a boxplot.

The original slides are not available.

This document is linked from Boxplots.

]]>In slide 7, there is an extra “the” in the third bullet. “If we standardize an entire **the **variable, the new variable will…”

This extremely short video contains an overview of the five-number summary.

The original slides are not available.

Transcript – Live Five-Number Summary

This document is linked from Measures of Position.

]]>This short video contains an overview of the range and IQR.

The original slides are not available.

Transcript – Live Measures of Spread Range IQR

This short video contains an discussion of the importance of considering the variation in addition to the center.

The original slides are not available.

Transcript – Live Importance of Variation

This document is linked from Measures of Spread.

]]>**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE

Although not a required aspect of describing distributions of one quantitative variable, we are often interested in where a particular value falls in the distribution. Is the value unusually low or high or about what we would expect?

Answers to these questions rely on measures of position (or location). These measures give information about the distribution but also give information about how individual values relate to the overall distribution.

A common measure of position is the percentile. Although there are some mathematical considerations involved with calculating percentiles which we will not discuss, you should have a basic understanding of their interpretation.

In general the *P*-th percentile can be interpreted as a location in the data for which approximately *P*% of the other values in the distribution fall below the *P*-th percentile and (100 –*P*)% fall above the *P*-th percentile.

The quartiles Q1 and Q3 are special cases of percentiles and thus are measures of position.

The combination of the five numbers (min, Q1, M, Q3, Max) is called the **five number summary**, and provides a quick numerical description of both the center and spread of a distribution.

Each of the values represents a measure of position in the dataset.

The min and max providing the boundaries and the quartiles and median providing information about the 25th, 50th, and 75th percentiles.

Standardized scores, also called z-scores use the mean and standard deviation as the primary measures of center and spread and are therefore most useful when the mean and standard deviation are appropriate, i.e. when the distribution is reasonably symmetric with no extreme outliers.

For any individual, the **z-score** tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

To calculate a z-score, we take the individual value and subtract the mean and then divide this difference by the standard deviation.

Measures of position also allow us to compare values from different distributions. For example, we can present the percentiles or z-scores of an individual’s height and weight. These two measures together would provide a better picture of how the individual fits in the overall population than either would alone.

Although measures of position are not stressed in this course as much as measures of center and spread, we have seen and will see many measures of position used in various aspects of examining the distribution of one variable and it is good to recognize them as measures of position when they appear.

]]>**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots

Now we introduce another graphical display of the distribution of a quantitative variable, the **boxplot**.

So far, in our discussion about measures of spread, some key players were:

- the extremes (min and Max), which provide the range covered by all the data; and
- the quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.

Recall that the combination of all five numbers (min, Q1, M, Q3, Max) is called the **five number summary**, and provides a quick numerical description of both the center and spread of a distribution.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

The five number summary of the age of Best Actress Oscar winners (1970-2001) is:

min = 21, Q1 = 32, M = 35, Q3 = 41.5, Max = 80

To sketch the boxplot we will need to know the 5-number summary as well as identify any outliers. We will also need to locate the largest and smallest values which are not outliers. The stemplot below might be helpful as it displays the data in order.

Now that you understand what each of the five numbers means, you can appreciate how much information about the distribution is packed into the five-number summary. All this information can also be represented visually by using the boxplot.

The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.

(Link to the Best Actress Oscar Winners data).

- The central box spans from Q1 to Q3. In our example, the box spans from 32 to 41.5. Note that the width of the box has no meaning.

- A line in the box marks the median M, which in our case is 35.

- Lines extend from the edges of the box to the smallest and largest observations that were not classified as suspected outliers (using the 1.5xIQR criterion). In our example, we have no low outliers, so the bottom line goes down to the smallest observation, which is 21. Since we have three high outliers (61,74, and 80), the top line extends only up to 49, which is the largest observation that has not been flagged as an outlier.

- outliers are marked with asterisks (*).

To summarize: the following information is visually depicted in the boxplot:

- the five number summary (blue)
- the range and IQR (red)
- outliers (green)

As we learned earlier, the distribution of a quantitative variable is best represented graphically by a histogram. Boxplots are most useful when presented side-by-side for comparing and contrasting distributions from two or more groups.

So far we have examined the age distributions of Oscar winners for males and females separately. It will be interesting to compare the age distributions of actors and actresses who won best acting Oscars. To do that we will look at side-by-side boxplots of the age distributions by gender.

Recall also that we found the five-number summary and means for both distributions. For the Best Actress dataset, we did the calculations by hand. For the Best Actor dataset, we used statistical software, and here are the results:

- Actors: min = 31, Q1 = 37.25, M = 42.5, Q3 = 50.25, Max = 76
- Actresses: min = 21, Q1 = 32, M = 35, Q3 = 41.5, Max = 80

Based on the graph and numerical measures, we can make the following comparison between the two distributions:

**Center:** The graph reveals that the age distribution of the males is higher than the females’ age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for males (42.5). Actually, it should be noted that even the third quartile of the females’ distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than actors do.

**Spread:** Judging by the range of the data, there is much more variability in the females’ distribution (range = 59) than there is in the males’ distribution (range = 45). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread in the ages of males (IQR = 13) than females (IQR = 9.5). We conclude that among all the winners, the actors’ ages are more alike than the actresses’ ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors’ age distribution.

**Outliers:** We see that we have outliers in both distributions. There is only one high outlier in the actors’ distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses’ distribution.

In order to compare the average high temperatures of Pittsburgh to those in San Francisco we will look at the following side-by-side boxplots, and supplement the graph with the descriptive statistics of each of the two distributions.

Statistic | Pittsburgh | San Francisco |
---|---|---|

min | 33.7 | 56.3 |

Q1 | 41.2 | 60.2 |

Median | 61.4 | 62.7 |

Q3 | 77.75 | 65.35 |

Max | 82.6 | 68.7 |

When looking at the graph, the similarities and differences between the two distributions are striking. Both distributions have roughly the same center (medians are 61.4 for Pitt, and 62.7 for San Francisco). However, the temperatures in Pittsburgh have a much larger variability than the temperatures in San Francisco (Range: 49 vs. 12. IQR: 36.5 vs. 5).

The practical interpretation of the results we obtained is that the weather in San Francisco is much more consistent than the weather in Pittsburgh, which varies a lot during the year. Also, because the temperatures in San Francisco vary so little during the year, knowing that the median temperature is around 63 is actually very informative. On the other hand, knowing that the median temperature in Pittsburgh is around 61 is practically useless, since temperatures vary so much during the year, and can get much warmer or much colder.

Note that this example provides more intuition about variability by interpreting small variability as consistency, and large variability as lack of consistency. Also, through this example we learned that the center of the distribution is more meaningful as a typical value for the distribution when there is little variability (or, as statisticians say, little “noise”) around it. When there is large variability, the center loses its practical meaning as a typical value.

- The five-number summary of a distribution consists of the median (M), the two quartiles (Q1, Q3) and the extremes (min, Max).

- The five-number summary provides a complete numerical description of a distribution. The median describes the center, and the extremes (which give the range) and the quartiles (which give the IQR) describe the spread.

- The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. (Some software packages indicate extreme outliers with a different symbol)

- Boxplots are most useful when presented side-by-side to compare and contrast distributions from two or more groups.

**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE

So far we have learned about different ways to quantify the center of a distribution. A measure of center by itself is not enough, though, to describe a distribution.

Consider the following two distributions of exam scores. Both distributions are centered at 70 (the median of both distributions is approximately 70), but the distributions are quite different.

The first distribution has a much larger variability in scores compared to the second one.

In order to describe the distribution, we therefore need to supplement the graphical display not only with a measure of center, but also with a measure of the variability (or spread) of the distribution.

In this section, we will discuss the three most commonly used measures of spread:

- Range
- Inter-quartile range (IQR)
- Standard deviation

Although the **measures of center** did approach the question differently, they do **attempt to measure the same point in the distribution** and thus are comparable.

However, the three **measures of spread** provide very different ways to quantify the variability of the distribution and **do not try to estimate the same quantity**.

In fact, the three **measures of spread** **provide information about three different aspects of the spread** of the distribution which, together, give a more complete picture of the spread of the distribution.

The **range** covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).

- Range = Max – min

**Note: **When we first looked at the histogram, and tried to get a first feel for the spread of the data, we were actually approximating the range, rather than calculating the exact range.

Here we have the Best Actress Oscar winners’ data

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

In this example:

- min = 21 (Marlee Matlin for
*Children of a Lesser God*, 1986) - Max = 80 (Jessica Tandy for
*Driving Miss Daisy*, 1989)

The range covered by all the data is 80 – 21 = 59 years.

While the range quantifies the variability by looking at the range covered by ALL the data,

the **Inter-Quartile Range** or** IQR** measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.

**IQR**= Q3 – Q1**Q3**= 3^{rd}Quartile = 75^{th}Percentile**Q1**= 1^{st}Quartile = 25^{th}Percentile

The following picture illustrates this idea: (Think about the horizontal line as the data ranging from the min to the Max). **IMPORTANT NOTE:** **The “lines” in the following illustrations are not to scale. The equal distances indicate equal amounts of data NOT equal distance between the numeric values.**

Although we will use software to calculate the quartiles and IQR, we will illustrate the basic process to help you fully understand.

To calculate the IQR:

- Arrange the data in increasing order, and find the median M. Recall that the median divides the data, so that 50% of the data points are below the median, and 50% of the data points are above the median.

- Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. Note from the picture that Q1 divides the lower 50% of the data into two halves, containing 25% of the data points in each half. Q1 is called the first quartile, since one quarter of the data points fall below it.
- Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This point is called the third quartile of the distribution, and is denoted by Q3.

Note from the picture that Q3 divides the top 50% of the data into two halves, with 25% of the data points in each.Q3 is called the third quartile, since three quarters of the data points fall below it. - The middle 50% of the data falls between Q1 and Q3, and therefore: IQR = Q3 – Q1

**Comments:**

- The last picture shows that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR = Q3 – Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.

- We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the left (right) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n = 7 and n = 8.

Note that when n is **odd** (as in n = 7 above), the median is **not** included in either the bottom or top half of the data; When n is **even** (as in n = 8 above), the data are naturally divided into two halves.

To find the IQR of the Best Actress Oscar winners’ distribution, it will be convenient to use the stemplot.

Q1 is the median of the bottom half of the data. Since there are 16 observations in that half, Q1 is the mean of the 8th and 9th ranked observations in that half:

Q1 = (31 + 33) / 2 = 32

Similarly, Q3 is the median of the top half of the data, and since there are 16 observations in that half, Q3 is the mean of the 8th and 9th ranked observations in that half:

Q3 = (41 + 42) / 2 = 41.5

IQR = 41.5 – 32 = 9.5

Note that in this example, the range covered by all the ages is 59 years, while the range covered by the middle 50% of the ages is only 9.5 years. While the whole dataset is spread over a range of 59 years, the middle 50% of the data is packed into only 9.5 years. Looking again at the histogram will illustrate this:

**Comment:**

- Software packages use different formulas to calculate the quartiles Q1 and Q3. This should not worry you, as long as you understand the idea behind these concepts. For example, here are the quartile values provided by three different software packages for the age of best actress Oscar winners:

**R:**

**Minitab:**

**Excel:**

Q1 and Q3 as reported by the various software packages differ from each other and are also slightly different from the ones we found here. This should not worry you.

There are different acceptable ways to find the median and the quartiles. These can give different results occasionally, especially for datasets where n (the number of observations) is fairly small.

As long as you know what the numbers mean, and how to interpret them in context, it doesn’t really matter much what method you use to find them, since the differences are negligible.

So far, we have introduced two measures of spread; the range (covered by all the data) and the inter-quartile range (IQR), which looks at the range covered by the middle 50% of the distribution. We also noted that the IQR should be paired as a measure of spread with the median as a measure of center.

We now move on to another measure of spread, the **standard deviation**, which quantifies the spread of a distribution in a completely different way.

The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean. The standard deviation gives the average (or typical distance) between a data point and the mean.

There are many notations for the standard deviation: SD, s, Sd, StDev. Here, we’ll use **SD** as an abbreviation for standard deviation, and use s as the symbol.

The **sample standard deviation formula** is:

where,

s = sample standard deviation

n = number of scores in sample

= sum of…

and

= sample mean

In order to get a better understanding of the standard deviation, it would be useful to see an example of how it is calculated. In practice, we will use a computer to do the calculation.

The following are the number of customers who entered a video store in 8 consecutive hours:

7, 9, 5, 13, 3, 11, 15, 9

To find the standard deviation of the number of hourly customers:

**Find the mean, x-bar, of your data:**

(7 + 9 + 5 + 13 + 3 + 11 + 15 + 9)/8 = 9

**Find the deviations from the mean:**

- The differences between each observation and the mean here are

(7 – 9), (9 – 9), (5 – 9), (13 – 9), (3 – 9), (11 – 9), (15 – 9), (9 – 9)

-2, 0, -4, 4, -6, 2, 6, 0

- Since the standard deviation attempts to measure the average (typical) distance between the data points and their mean, it would make sense to average the deviation we obtained.
**Note,**however**, that the sum of the deviations is zero.**- This is always the case, and is the reason why we need a more complex calculation.

**To solve the previous problem, in our calculation, we square each of the deviations.**

(-2)^{2}, (0)^{2}, (-4)^{2}, (4)^{2}, (-6)^{2}, (2)^{2}, (6)^{2}, (0)^{2}

4, 0, 16, 16, 36, 4, 36, 0

**Sum the squared deviations and divide by***n*– 1:

(4 + 0 + 16 + 16 + 36 + 4 + 36 + 0)/(8 – 1)

(112)/(7) = 16

- The reason we divide by
*n*-1 will be discussed later. - This value, the sum of the squared deviations divided by n – 1, is called the
**variance**. However, the variance is not used as a measure of spread directly as the units are the square of the units of the original data.

**The standard deviationof the data is the square root of the variance calculated in step 4:**

- In this case, we have the square root of 16 which is 4. We will use the lower case letter
*s*to represent the standard deviation.

*s* = 4

- We take the square root to obtain a measure which is in the original units of the data. The units of the variance of 16 are in “squared customers” which is difficult to interpret.
- The units of the standard deviation are in “customers” which makes this measure of variation more useful in practice than the variance.

Recall that the average of the number of customers who enter the store in an hour is 9.

**The interpretation of the standard deviation is that on average, the actual number of customers who enter the store each hour is 4 away from 9.**

**Comment: **The importance of the numerical figure that we found in #4 above called the variance (=16 in our example) will be discussed much later in the course when we get to the inference part.

- It should be clear from the discussion thus far that the SD should be paired as a measure of spread with the mean as a measure of center.

- Note that the only way, mathematically, in which the SD = 0, is when all the observations have the same value (Ex: 5, 5, 5, … , 5), in which case, the deviations from the mean (which is also 5) are all 0. This is intuitive, since if all the data points have the same value, we have no variability (spread) in the data, and expect the measure of spread (like the SD) to be 0. Indeed, in this case, not only is the SD equal to 0, but the range and the IQR are also equal to 0. Do you understand why?

- Like the mean, the SD is strongly influenced by outliers in the data. Consider the example concerning video store customers: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then the average would jump up to 25.9, and the standard deviation would jump up to SD = 50.3. Note that in this simple example, it is easy to see that while the standard deviation is strongly influenced by outliers, the IQR is not. The IQR would be the same in both cases, since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.

The last comment leads to the following very important conclusion:

- Use the
**mean and the standard deviation**as measures of center and spread for**reasonably symmetric distributions with no extreme outliers.**

**For all other cases**, use**the five-number summary = min, Q1, Median, Q3, Max**(which gives the median, and easy access to the IQR and range). We will discuss the five-number summary in the next section in more detail.

- The
**range**covered by the data is the most intuitive measure of spread and is exactly the distance between the smallest data point (min) and the largest one (Max).

- Another measure of spread is the
**inter-quartile range (IQR)**, which is the range covered by the middle 50% of the data.

- IQR = Q3 – Q1, the difference between the third and first quartiles.
- The
**first quartile (Q1)**is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. - The
**third quartile (Q3)**is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.

- The

- The
**IQR**is generally used as a measure of spread of a distribution when the**median**is used as a measure of center.

- The
**standard deviation**measures the spread by reporting**a typical (average) distance between the data points and their mean.**

- It is appropriate to use the
**standard deviation**as a measure of spread with the**mean**as the measure of center.

- Since the
**mean and standard deviations are highly influenced by extreme observations**, they should be used as numerical descriptions of the center and spread**only for distributions that are roughly symmetric, and have no extreme outliers. In all other situations, we prefer the 5-number summary.**