This document is linked from Histograms and Stemplots.

]]>This document is linked from One Quantitative Variable.

]]>**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots

The idea is to break the range of values into intervals and count how many observations fall into each interval.

Here are the exam grades of 15 students:

**88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73**

We first need to break the range of values into intervals (also called “bins” or “classes”).

In this case, since our dataset consists of exam scores, it will make sense to choose intervals that typically correspond to the range of a letter grade, 10 points wide: [40,50), [50, 60), … [90, 100).

By counting how many of the 15 observations fall in each of the intervals, we get the following table:

Score | Count |
---|---|

[40-50) | 1 |

[50-60) | 2 |

[60-70) | 4 |

[70-80) | 5 |

[80-90) | 2 |

[90-100) | 1 |

Note: The observation 60 was counted in the 60-70 interval. See comment 1 below.

To construct the histogram from this table we plot the intervals on the X-axis, and show the number of observations in each interval (frequency of the interval) on the Y-axis, which is represented by the height of a rectangle located above the interval:

**The previous table can also be turned into a relative frequency table using the following steps:**

- Add a row on the bottom and include the total number of observations in the dataset that are represented in the table.
- Add a column, at the end of the table, and calculate the relative frequency for each interval, by dividing the number of observations in each row by the total number of observations.

These two steps are illustrated in red in the following frequency distribution table:

**It is also possible to determine the number of scores for an interval, if you have the total number of observations and the relative frequency for that interval. **

- For instance, suppose there are 15 scores (or observations) in a set of data and the relative frequency for an interval is 0.13.
- To determine the number of scores in that interval, multiplying the total number of observations by the relative frequency and round up to the next whole number: 15*.13 = 1.95, which rounds up to 2 observations.

**A relative frequency table, like the one above, can be used to determine the frequency of scores occurring at or across intervals. **

Here are some examples, using this frequency table:

**What is the percentage of exam scores that were 70 and up to, but not including, 80? **

- To determine the answer, we look at the relative frequency associated with the [70-80) interval.
- The relative frequency is 0.33; to convert to percentage, multiply by 100 (0.33*100= 33) or 33%.

**What is the percentage of exam scores that are at least 70? To determine the answer, we need to:**

- Add together the relative frequencies for the intervals that have scores of at least 70 or above.
- Thus, would need to add together the relative frequencies from [70-80), [80-90), and [90-100]

= 0.33 + 0.13 + 0.07 = 0.53. - To get the percentage, need to multiple the calculated relative frequency by 100.
- In this case, it would be 0.53*100 = 53 or 53%.

**Study the histogram again and table and answer the following question. **

**Comments:**

- It is very important that each observation be counted only in one interval. For the most part, it is clear which interval an observation falls in. However, in our example, we needed to decide whether to include 60 in the interval 50-60, or the interval 60-70, and we chose to count it in the latter.
- In fact, this decision is captured by the way we wrote the intervals. If you’ll scroll up and look at the table, you’ll see that we wrote the intervals in a peculiar way: [40-50), [50,60), [60,70) etc.
- The square bracket means “including” and the parenthesis means “not including”. For example, [50,60) is the interval from 50 to 60, including 50 and not including 60; [60,70) is the interval from 60 to 70, including 60, and not including 70, etc.
- It really does not matter how you decide to set up your intervals, as long as you are consistent.
- When you look at a histogram such as the one above it is important to know that values falling on the border are only counted in one interval, even if you do not know which way this was done for a particular graph.

- When data are displayed in a histogram, some information is lost. Note that by looking at the histogram
- we
answer: “How many students scored 70 or above?” (5+2+1=8)*can* - But we
answer: “What was the lowest score?” All we can say is that the lowest score is somewhere between 40 and 50.*cannot*

- we

- Obviously, we could have chosen to break the data into intervals differently — for example: [45, 50), [50, 55), [55, 60) etc.

To see how our choice of bins or intervals affects a histogram, you can use the applet linked below that let you change the intervals dynamically.

**Question : **How do I know what interval width to choose?

**Answer: **There are many valid choices for interval widths and starting points. There are a few rules of thumb used by software packages to find optimal values. In this course, we will rely on a statistical package to produce the histogram for us, and we will focus instead on describing and summarizing the distribution as it appears from the histogram.

The following exercises provide more practice working with histograms created from a single quantitative variable.

The **stemplot** (also called stem and leaf plot) is another graphical display of the distribution of quantitative variable.

To create a **stemplot, t**he idea is to separate each data point into a stem and leaf, as follows:

- The leaf is the right-most digit.
- The stem is everything except the right-most digit.
- So, if the data point is 34, then 3 is the stem and 4 is the leaf.
- If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
- Note: For this to work, ALL data points should be rounded to the same number of decimal places.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

To make a stemplot:

- Separate each observation into a stem and a leaf.
- Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.
- Go through the data points, and write each leaf in the row to the right of its stem.
- Rearrange the leaves in an increasing order.

* When some of the stems hold a large number of leaves, we can split each stem into two: one holding the leaves 0-4, and the other holding the leaves 5-9. A statistical software package will often do the splitting for you, when appropriate.

Note that when rotated 90 degrees counterclockwise, the stemplot visually resembles a histogram:

The stemplot has additional unique features:

- It preserves the original data.
- It sorts the data (which will become very useful in the next section).

**You will not need to create these plots by hand but you may need to be able to discuss the information they contain.**

To see more stemplots, use the interactive applet we introduced earlier.

In particular, notice how the raw data are rounded and look at the stemplot with and without split stems.

**Comments: ABOUT DOTPLOTS**

- There is another type of display that we can use to summarize a quantitative variable graphically — the dotplot.
- The dotplot, like the stemplot, shows each observation, but displays it with a dot rather than with its actual value.
- We will not use these in this course but you may see them occasionally in practice and they are relatively easy to create by-hand.
- Here is the dotplot for the ages of Best Actress Oscar winners.

**Question:** How do we know which graph to use: the histogram, stemplot, or dotplot?

**Answer:** Since for the most part we are not going to deal with very small data sets in this course, we will generally display the distribution of a quantitative variable using a histogram generated by a statistical software package.

- The histogram is a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values.

- The stemplot is a simple, but useful visual display of a quantitative variable. Its principal virtues are:
- Easy and quick to construct for small, simple datasets.
- Retains the actual data.
- Sorts (ranks) the data.

Choose one of the datasets in the list and click through the tabs at the top to see the data and results!

This document linked from One Quantitative Variable and Histograms and Stemplots.

]]>**Related SAS Tutorials**

- 5A – (3:01) Numeric Measures using PROC MEANS
- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5A – (8:00) Numeric Measures using EXPLORE
- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

In the previous section, we explored the distribution of a categorical variable using graphs (pie chart, bar chart) supplemented by numerical measures (percent of observations in each category).

In this section, we will explore the data collected from a **quantitative** variable, and learn how to describe and summarize the important features of its distribution.

We will learn how to display the **distribution** using **graphs** and discuss a variety of **numerical measures**.

An introduction to each of these topics follows.

To display data from one quantitative variable graphically, we can use either a **histogram** or **boxplot**.

We will also present several “by-hand” displays such as the **stemplot** and **dotplot** (although we will not rely on these in this course).

The overall pattern of the **distribution** of a quantitative variable is described by its **shape**, **center**, and **spread**.

By inspecting the histogram or boxplot, we can describe the shape of the distribution, but we can only get a rough estimate for the center and spread.

A description of the distribution of a quantitative variable must include, in addition to the **graphical display**, a more precise** numerical description** of the center and spread of the distribution.

In this section we will learn:

- how to display the
**distribution of one quantitative variable**using various graphs; - how to quantify the
**center**and**spread**of the**distribution of one quantitative variable**with various numerical measures; - some of the
**properties**of those**numerical****measures**; - how to choose the
**appropriate****numerical****measures**of**center**and**spread**to supplement the graph(s); and - how to identify potential outliers in the
**distribution of one quantitative variable**

- We will also discuss a few
**measures of position**(also called**measures of location**). These measures- allow us to quantify where a particular value is relative to the
**distribution**of all values - do provide information about the distribution itself
- also use the information
**about the distribution**to learn more about an**INDIVIDUAL**

- allow us to quantify where a particular value is relative to the

Before reading further, try this interactive applet which will give you a preview of some of the topics we will be learning about in this section on exploratory data analysis for one quantitative variable.