We have almost reached the end our discussion of probability. We were introduced to the important concept of **random variables**, which are quantitative variables whose value is determined by the outcome of a random experiment.

We discussed discrete and continuous random variables.

We saw that all the information about a **discrete random variable** is packed into its probability distribution. Using that, we can answer probability questions about the random variable and find its **mean and standard deviation**. We ended the part on discrete random variables by presenting a special class of discrete random variables – **binomial random variables.**

As we dove into **continuous random variables**, we saw how calculations can get complicated very quickly, when probabilities associated with a continuous random variable are found by calculating **areas under its density curve**.

As an example for a continuous random variable, we presented the **normal random variable**, and discussed it at length. The normal distribution is extremely important, not just because many variables in real life follow the normal distribution, but mainly because of the important role it plays in statistical inference, our ultimate goal of this course.

We learned how we can avoid calculus by using the **standard normal calculator or table** to find probabilities associated with the normal distribution, and learned how it can be used as an **approximation to the binomial** distribution under certain conditions.

A random variable is a variable whose values are numerical results of a random experiment.

- A
**discrete random variable**is summarized by its probability distribution — a list of its possible values and their corresponding probabilities.

The sum of the probabilities of all possible values must be 1.

The probability distribution can be represented by a table, histogram, or sometimes a formula.

- The
**probability distribution**of a random variable can be supplemented with numerical measures of the center and spread of the random variable.

**Center:** The center of a random variable is measured by its mean (which is sometimes also referred to as the **expected value**).

The mean of a random variable can be interpreted as its long run average.

The mean is a weighted average of the possible values of the random variable weighted by their corresponding probabilities.

**Spread:** The spread of a random variable is measured by its variance, or more typically by its standard deviation (the square root of the variance).

The standard deviation of a random variable can be interpreted as the typical (or long-run average) distance between the value that the random variable assumes and the mean of X.

- The binomial random variable is a type of discrete random variable that is quite common.

- The binomial random variable is defined in a random experiment that consists of n independent trials, each having two possible outcomes (called “success” and “failure”), and each having the same probability of success: p. Such a random experiment is called the binomial random experiment.

- The binomial random variable represents the number of successes (out of n) in a binomial experiment. It can therefore have values as low as 0 (if none of the n trials was a success) and as high as n (if all n trials were successes).

- There are “many” binomial random variables, depending on the number of trials (n) and the probability of success (p).

- The probability distribution of the binomial random variable is given in the form of a formula and can be used to find probabilities. Technology can be used as well.

- The mean and standard deviation of a binomial random variable can be easily found using short-cut formulas.

The probability distribution of a continuous random variable is represented by a probability density curve. The probability that the random variable takes a value in any interval of interest is the area above this interval and below the density curve.

An important example of a continuous random variable is the **normal random variable**, whose probability density curve is symmetric (bell-shaped), bulging in the middle and tapering at the ends.

- There are “many” normal random variables, each determined by its mean
*μ*(mu) (which determines where the density curve is centered) and standard deviation σ (sigma) (which determines how spread out (wide) the normal density curve is).

- Any normal random variable follows the Standard Deviation Rule, which can help us find probabilities associated with the normal random variable.

- Another way to find probabilities associated with the normal random variable is using the standard normal table. This process involves finding the z-score of values, which tells us how many standard deviations below or above the mean the value is.

- An important application of the normal random variable is that it can be used as an approximation of the binomial random variable (under certain conditions). A continuity correction can improve this approximation.

For the questions below, it would be useful to sketch this normal distribution yourself, marking its mean and the values that are 1, 2, and 3 standard deviations below and above the mean.

http://phhp-faculty-cantrell.sites.medinfo.ufl.edu/files/2013/02/qzDIG_08010.swf

This document is linked from Normal Random Variables.

]]>In the Exploratory Data Analysis unit of this course, we encountered data sets, **such as lengths of human pregnancies**, whose distributions naturally followed a symmetric unimodal bell shape, bulging in the middle and tapering off at the ends.

Many variables, such as pregnancy lengths, shoe sizes, foot lengths, and other human physical characteristics exhibit these properties: symmetry indicates that the variable is just as likely to take a value a certain distance below its mean as it is to take a value that same distance above its mean; the bell-shape indicates that values closer to the mean are more likely, and it becomes increasingly unlikely to take values far from the mean in either direction.

The particular shape exhibited by these variables has been studied since the early part of the nineteenth century, when they were first called “normal” as a way of suggesting their depiction of a common, natural pattern.

There are many normal distributions. Even though all of them have the bell-shape, they vary in their center and spread.

More specifically, the shape of the distribution is determined by its **mean** (mu, μ) and the spread is determined by its standard deviation (sigma, σ).

Some observations we can make as we look at this graph are:

- The black and the red normal curves have means or centers at μ = mu = 10. However, the red curve is more spread out and thus has a larger standard deviation. As you look at these two normal curves, notice that as the red graph is squished down, the spread gets larger, thus allowing the area under the curve to remain the same.
- The black and the green normal curves have the same standard deviation or spread (the range of the black curve is 6.5-13.5, and the green curve’s range is 10.5-17.5).

Even more important than the fact that many variables themselves follow the normal curve is the role played by the normal curve in sampling theory, as we’ll see in the next section in our unit on probability.

Understanding the normal distribution is an important step in the direction of our overall goal, which is to relate sample means or proportions to population means or proportions. The goal of this section is to better understand normal random variables and their distributions.

We began to get a feel for normal distributions in the Exploratory Data Analysis (EDA) section, when we introduced the Standard Deviation Rule (or the **68-95-99.7** rule) for how values in a normally-shaped **sample data set** behave relative to their sample mean (x-bar) and sample standard deviation (s).

This is the same rule that dictates how the distribution of a normal **random variable** behaves relative to its mean (mu, μ) and standard deviation (sigma, σ). Now we use probability language and notation to describe the random variable’s behavior.

For example, in the EDA section, we would have said “68% of pregnancies in our data set fall within 1 standard deviation (s) of their mean (x-bar).” The analogous statement now would be “If X, the length of a randomly chosen pregnancy, is normal with mean (mu, μ) and standard deviation (sigma, σ), then

In general, if X is a normal random variable, then the probability is

- 68% that X falls within 1 standard deviation (sigma, σ) of the mean (mu, μ)
- 95% that X falls within 2 standard deviations (sigma, σ) of the mean (mu, μ)
- 99.7% that X falls within 3 standard deviation (sigma, σ) of the mean (mu, μ).

Using probability notation, we may write

**Comment**

- Notice that the information from the rule can be interpreted from the perspective of the tails of the normal curve:
- Since 0.68 is the probability of being within 1 standard deviation of the mean, (1 – 0.68) / 2 = 0.16 is the probability of being further than 1 standard deviation below the mean (or further than 1 standard deviation above the mean.)
- Likewise, (1 – 0.95) / 2 = 0.025 is the probability of being more than 2 standard deviations below (or above) the mean.
- And (1 – 0.997) / 2 = 0.0015 is the probability of being more than 3 standard deviations below (or above) the mean.

- The three figures below illustrate this.

Suppose that foot length of a randomly chosen adult male is a normal random variable with mean μ = mu = 11 and standard deviation σ = sigma =1.5. Then the Standard Deviation Rule lets us sketch the probability distribution of X as follows:

**(a)** What is the probability that a randomly chosen adult male will have a foot length between 8 and 14 inches?

**0.95, or 95%.**

**(b)** An adult male is almost guaranteed (.997 probability) to have a foot length between what two values?

**6.5 and 15.5 inches.**

**(c)** The probability is only 2.5% that an adult male will have a foot length greater than how many inches?

**14.** (See image below)

Now you should try a few. (Use the figure that is just before **part (a)** to help you.)

**Comment**

- Notice that there are two types of problems we may want to solve: those like
**(a)**,**(d)**and**(e)**, in which a particular interval of values of a normal random variable is given, and we are asked to find a probability, and those like**(b)**,**(c)**and**(f)**, in which a probability is given and we are asked to identify what the normal random variable’s values would be.

Let’s go back to our example of foot length:

How likely or unlikely is it for a male’s foot length to be more than 13 inches?

Since 13 inches doesn’t happen to be exactly 1, 2, or 3 standard deviations away from the mean, we would only be able to give a very rough estimate of the probability at this point.

Clearly, the Standard Deviation Rule only describes the tip of the iceberg, and while it serves well as an introduction to the normal curve, and gives us a good sense of what would be considered likely and unlikely values, it is very limited in the probability questions it can help us answer.

Here is another familiar normal distribution:

Suppose we are interested in knowing the probability that a randomly selected student will score 633 or more on the math portion of his or her SAT (this is represented by the red area). Again, 633 does not fall exactly 1, 2, or 3 standard deviations above the mean.

Notice, however, that an SAT score of 633 and a foot length of 13 are both about 1/3 of the way between 1 and 2 standard deviations. As you continue to read, you’ll realize that this positioning relative to the mean is the key to finding probabilities.

**Related SAS Tutorials**

- 5B – (4:05) Creating Histograms and Boxplots using SGPLOT
- 5C – (5:41) Creating QQ-Plots and other plots using UNIVARIATE

**Related SPSS Tutorials**

- 5B – (2:29) Creating Histograms and Boxplots
- 5C – (2:31) Creating QQ-Plots and PP-Plots

In the previous activity we tried to help you develop better intuition about the concept of standard deviation. The rule that we are about to present, called “The Standard Deviation Rule” (also known as “The Empirical Rule”) will hopefully also contribute to building your intuition about this concept.

Consider a symmetric mound-shaped distribution:

For distributions having this shape (later we will define this shape as “normally distributed”), the following rule applies:

**The Standard Deviation Rule:**

- Approximately 68% of the observations fall within 1 standard deviation of the mean.

- Approximately 95% of the observations fall within 2 standard deviations of the mean.

- Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

The following picture illustrates this rule:

This rule provides another way to interpret the standard deviation of a distribution, and thus also provides a bit more intuition about it.

To see how this rule works in practice, consider the following example:

The following histogram represents height (in inches) of 50 males. Note that the data are roughly normal, so we would like to see how the Standard Deviation Rule works for this example.

Below are the actual data, and the numerical measures of the distribution. Note that the key players here, the mean and standard deviation, have been highlighted.

Statistic | Height |
---|---|

N | 50 |

Mean | 70.58 |

StDev | 2.858 |

min | 64 |

Q1 | 68 |

Median | 70.5 |

Q3 | 72 |

Max | 77 |

To see how well the Standard Deviation Rule works for this case, we will find what percentage of the observations falls within 1, 2, and 3 standard deviations from the mean, and compare it to what the Standard Deviation Rule tells us this percentage should be.

It turns out the Standard Deviation Rule works **very well** in this example.

The following example illustrates how we can apply the Standard Deviation Rule to variables whose distribution is known to be approximately normal.

The length of the human pregnancy is not fixed. It is known that it varies according to a distribution which is roughly normal, with a mean of 266 days, and a standard deviation of 16 days. (Source: Figures are from Moore and McCabe, *Introduction to the Practice of Statistics*).

First, let’s apply the Standard Deviation Rule to this case by drawing a picture:

- Question: How long do the middle 95% of human pregnancies last? We can now use the information provided by the Standard Deviation Rule about the distribution of the length of human pregnancy, to answer some questions. For example:
- Answer: The middle 95% of pregnancies last within 2 standard deviations of the mean, or in this case 234-298 days.

- Question: What percent of pregnancies last more than 298 days?
- Answer: To answer this consider the following picture:

- Question: How short are the shortest 2.5% of pregnancies? Since 95% of the pregnancies last between 234 and 298 days, the remaining 5% of pregnancies last either less than 234 days or more than 298 days. Since the normal distribution is symmetric, these 5% of pregnancies are divided evenly between the two tails, and therefore 2.5% of pregnancies last more than 298 days.
- Answer: Using the same reasoning as in the previous question, the shortest 2.5% of human pregnancies last less than 234 days.

- Question: What percent of human pregnancies last more than 266 days?
- Answer: Since 266 days is the mean, approximately 50% of pregnancies last more than 266 days.

Here is a complete picture of the information provided by the standard deviation rule.

The normal distribution exists in theory but rarely, if ever, in real life. Histograms provide an excellent graphical display to help us assess normality. We can add a “normal curve” to the histogram which shows the normal distribution having the same mean and standard deviation as our sample. The closer the histogram fits this curve, the more (perfectly) normal the sample.

In the examples below, the graph on the top is approximately normally distributed whereas the graph on the bottom is clearly skewed right.

Unfortunately, we cannot quantitatively determine the extent to which the distribution is normally or not normally distributed using this method, but it can be helpful for making qualitative judgments about whether the data approximates the normal curve.

Another common graph to assess normality is the **Q-Q plot** (or **Normal Probability Plot**). In these graphs, the percentiles or quantiles of the theoretical distribution (in this case the standard normal distribution) are plotted against those from the data. If the data matches the theoretical distribution, the graph will result in a straight line. The graph below shows a distribution which closely follows a normal model.

**Note:** QQ-plots are not scatterplots (which we will dicuss soon), they only display information about one quantitative variable and graph this against the theoretical or expected values from a normal distribution with the same mean and standard deviation as our data. Other distributions can also be used.

In most cases the distributions that you encounter will only be approximations of the normal curve, or they will not resemble the normal distribution at all! However, it can be important to consider how well the data being analyzed approximates the normal curve since this distribution is a key assumption of many statistical analyses.

Here are a few more examples:

The following gives the QQ-plot, histogram and boxplot for variables from a dataset from a population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, who were tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Body Mass Index is definitely **unimodal** and **symmetric** and could easily have come from a population which is **normally distributed**.

The Diabetes Pedigree Function scores were unimodal and skewed right. This data does not seem to have come from a population which is normally distributed.

The Triceps Skin Fold Thickness is **basically symmetric with one extreme outlier** (and one potential but mild outlier).

**Be careful not to call such a distribution “skewed right”** as it is only the single outlier which really shows that pattern here. At a minimum remove the outlier and recreate the graphs to see how skewed the rest of the data might be.

Since there were no skewed left examples in the real data, here are two randomly generated skewed left distributions. Notice that the first is less skewed left than the second and this is indicated clearly in all three plots.

**Comments:**

- Even if the population is exactly normally distributed, samples from this population can appear non-normal especially for small sample sizes. See this document containing 21 samples of size n = 50 from a normal distribution with a mean of 200 and a standard deviation of 30. The samples that produce results which are skewed or otherwise seemingly not-normal are highlighted but even among those not highlighted, notice the variation in shapes seen: Normal Samples

- The standard deviation rule can also help in assessing normality in that the closer the percentage of data points within 1, 2, and 3 standard deviations is to that of the rule, the closer the data itself fits a normal distribution.

- In our example of male heights, we see that the histogram resembles a normal distribution and the sample percentages are very close to that predicted by the standard deviation rule.

We have already learned the standard deviation rule, which for normally distributed data, provides approximations for the proportion of data values within 1, 2, and 3 standard deviations. From this we know that approximately 5% of the data values would be expected to fall OUTSIDE 2 standard deviations.

If we calculate the standardized scores (or z-scores) for our data, it would be easy to identify these unusually large or small values in our data. To calculate a z-score, recall that we take the individual value and subtract the mean and then divide this difference by the standard deviation.

For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

**Comments:**

- Standardized scores can be used to help identify potential outliers
- For approximately normal distributions, z-scores greater than 2 or less than -2 are rare (will happen approximately 5% of the time).
- For any distribution, z-scores greater than 4 or less than -4 are rare (will happen less than 6.25% of the time).

- Standardized scores, along with other measures of position, are useful when comparing individuals in different datasets since the comparison takes into account the relative position of the individuals in their dataset. With z-scores, we can tell which individual has a relatively higher or lower position in their respective dataset.

- Later in the course, we will see that this idea of standardizing is used often in statistical analyses.

We will continue with the Best Actress Oscar winners example (Link to the Best Actress Oscar Winners data).

34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33

In previous examples, we identified three observations as outliers, two of which were classified as extreme outliers (ages of 61, 74 and 80)

The mean of this sample is 38.5 and the standard deviation is 12.95.

- The z-score for the actress with age = 80 is

Thus, among our female Oscar winners from our sample, this actress is 3.20 standard deviations older than average.