**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots

The idea is to break the range of values into intervals and count how many observations fall into each interval.

Here are the exam grades of 15 students:

**88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73**

We first need to break the range of values into intervals (also called “bins” or “classes”).

In this case, since our dataset consists of exam scores, it will make sense to choose intervals that typically correspond to the range of a letter grade, 10 points wide: [40,50), [50, 60), … [90, 100).

By counting how many of the 15 observations fall in each of the intervals, we get the following table:

Score | Count |
---|---|

[40-50) | 1 |

[50-60) | 2 |

[60-70) | 4 |

[70-80) | 5 |

[80-90) | 2 |

[90-100) | 1 |

Note: The observation 60 was counted in the 60-70 interval. See comment 1 below.

To construct the histogram from this table we plot the intervals on the X-axis, and show the number of observations in each interval (frequency of the interval) on the Y-axis, which is represented by the height of a rectangle located above the interval:

**The previous table can also be turned into a relative frequency table using the following steps:**

- Add a row on the bottom and include the total number of observations in the dataset that are represented in the table.
- Add a column, at the end of the table, and calculate the relative frequency for each interval, by dividing the number of observations in each row by the total number of observations.

These two steps are illustrated in red in the following frequency distribution table:

**It is also possible to determine the number of scores for an interval, if you have the total number of observations and the relative frequency for that interval. **

- For instance, suppose there are 15 scores (or observations) in a set of data and the relative frequency for an interval is 0.13.
- To determine the number of scores in that interval, multiplying the total number of observations by the relative frequency and round up to the next whole number: 15*.13 = 1.95, which rounds up to 2 observations.

**A relative frequency table, like the one above, can be used to determine the frequency of scores occurring at or across intervals. **

Here are some examples, using this frequency table:

**What is the percentage of exam scores that were 70 and up to, but not including, 80? **

- To determine the answer, we look at the relative frequency associated with the [70-80) interval.
- The relative frequency is 0.33; to convert to percentage, multiply by 100 (0.33*100= 33) or 33%.

**What is the percentage of exam scores that are at least 70? To determine the answer, we need to:**

- Add together the relative frequencies for the intervals that have scores of at least 70 or above.
- Thus, would need to add together the relative frequencies from [70-80), [80-90), and [90-100]

= 0.33 + 0.13 + 0.07 = 0.53. - To get the percentage, need to multiple the calculated relative frequency by 100.
- In this case, it would be 0.53*100 = 53 or 53%.

**Study the histogram again and table and answer the following question. **

**Comments:**

- It is very important that each observation be counted only in one interval. For the most part, it is clear which interval an observation falls in. However, in our example, we needed to decide whether to include 60 in the interval 50-60, or the interval 60-70, and we chose to count it in the latter.
- In fact, this decision is captured by the way we wrote the intervals. If you’ll scroll up and look at the table, you’ll see that we wrote the intervals in a peculiar way: [40-50), [50,60), [60,70) etc.
- The square bracket means “including” and the parenthesis means “not including”. For example, [50,60) is the interval from 50 to 60, including 50 and not including 60; [60,70) is the interval from 60 to 70, including 60, and not including 70, etc.
- It really does not matter how you decide to set up your intervals, as long as you are consistent.
- When you look at a histogram such as the one above it is important to know that values falling on the border are only counted in one interval, even if you do not know which way this was done for a particular graph.

- When data are displayed in a histogram, some information is lost. Note that by looking at the histogram
- we
answer: “How many students scored 70 or above?” (5+2+1=8)*can* - But we
answer: “What was the lowest score?” All we can say is that the lowest score is somewhere between 40 and 50.*cannot*

- we

- Obviously, we could have chosen to break the data into intervals differently — for example: [45, 50), [50, 55), [55, 60) etc.

To see how our choice of bins or intervals affects a histogram, you can use the applet linked below that let you change the intervals dynamically.

**Question : **How do I know what interval width to choose?

**Answer: **There are many valid choices for interval widths and starting points. There are a few rules of thumb used by software packages to find optimal values. In this course, we will rely on a statistical package to produce the histogram for us, and we will focus instead on describing and summarizing the distribution as it appears from the histogram.

The following exercises provide more practice working with histograms created from a single quantitative variable.

The **stemplot** (also called stem and leaf plot) is another graphical display of the distribution of quantitative variable.

To create a **stemplot, t**he idea is to separate each data point into a stem and leaf, as follows:

- The leaf is the right-most digit.
- The stem is everything except the right-most digit.
- So, if the data point is 34, then 3 is the stem and 4 is the leaf.
- If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
- Note: For this to work, ALL data points should be rounded to the same number of decimal places.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

To make a stemplot:

- Separate each observation into a stem and a leaf.
- Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.
- Go through the data points, and write each leaf in the row to the right of its stem.
- Rearrange the leaves in an increasing order.

* When some of the stems hold a large number of leaves, we can split each stem into two: one holding the leaves 0-4, and the other holding the leaves 5-9. A statistical software package will often do the splitting for you, when appropriate.

Note that when rotated 90 degrees counterclockwise, the stemplot visually resembles a histogram:

The stemplot has additional unique features:

- It preserves the original data.
- It sorts the data (which will become very useful in the next section).

**You will not need to create these plots by hand but you may need to be able to discuss the information they contain.**

To see more stemplots, use the interactive applet we introduced earlier.

In particular, notice how the raw data are rounded and look at the stemplot with and without split stems.

**Comments: ABOUT DOTPLOTS**

- There is another type of display that we can use to summarize a quantitative variable graphically — the dotplot.
- The dotplot, like the stemplot, shows each observation, but displays it with a dot rather than with its actual value.
- We will not use these in this course but you may see them occasionally in practice and they are relatively easy to create by-hand.
- Here is the dotplot for the ages of Best Actress Oscar winners.

**Question:** How do we know which graph to use: the histogram, stemplot, or dotplot?

**Answer:** Since for the most part we are not going to deal with very small data sets in this course, we will generally display the distribution of a quantitative variable using a histogram generated by a statistical software package.

- The histogram is a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values.

- The stemplot is a simple, but useful visual display of a quantitative variable. Its principal virtues are:
- Easy and quick to construct for small, simple datasets.
- Retains the actual data.
- Sorts (ranks) the data.