Population Means (Part 3)

CO-4: Distinguish among different measurement scales, choose the appropriate descriptive and inferential statistical methods based on these distinctions, and interpret the results.
LO 4.30: Interpret confidence intervals for population parameters in context.
LO 4.31: Find confidence intervals for the population mean using the normal distribution (Z) based confidence interval formula (when required conditions are met) and perform sample size calculations.
CO-6: Apply basic concepts of probability, random variation, and commonly used statistical probability distributions.
LO 6.24: Explain the connection between the sampling distribution of a statistic, and its properties as a point estimator.
LO 6.25: Explain what a confidence interval represents and determine how changes in sample size and confidence level affect the precision of the confidence interval.

Sample Size Calculations

As we just learned, for a given level of confidence, the sample size determines the size of the margin of error and thus the width, or precision, of our interval estimation. This process can be reversed.

In situations where a researcher has some flexibility as to the sample size, the researcher can calculate in advance what the sample size is that he/she needs in order to be able to report a confidence interval with a certain level of confidence and a certain margin of error. Let’s look at an example.

EXAMPLE:

Recall the example about the SAT-M scores of community college students.

An educational researcher is interested in estimating μ (mu), the mean score on the math part of the SAT (SAT-M) of all community college students in his state. To this end, the researcher has chosen a random sample of 650 community college students from his state, and found that their average SAT-M score is 475. Based on a large body of research that was done on the SAT, it is known that the scores roughly follow a normal distribution, with the standard deviation σ (sigma) =100.

The 95% confidence interval for μ (mu) is

mod11-CI_mean95_ex1

which is roughly 475 ± 8, or (467, 483). For a sample size of n = 650, our margin of error is 8.

Now, let’s think about this problem in a slightly different way:

An educational researcher is interested in estimating μ (mu), the mean score on the math part of the SAT (SAT-M) of all community college students in his state with a margin of error of (only) 5, at the 95% confidence level. What is the sample size needed to achieve this? σ (sigma), of course, is still assumed to be 100.

To solve this, we set:

mod11-samplesize1

So, for a sample size of 1,600 community college students, the researcher will be able to estimate μ (mu) with a margin of error of 5, at the 95% level. In this example, we can also imagine that the researcher has some flexibility in choosing the sample size, since there is a minimal cost (if any) involved in recording students’ SAT-M scores, and there are many more than 1,600 community college students in each state.

Rather than take the same steps to isolate n every time we solve such a problem, we may obtain a general expression for the required n for a desired margin of error m and a certain level of confidence.

Since

mod11-CI_margerr

is the formula to determine m for a given n, we can use simple algebra to express n in terms of m (multiply both sides by the square root of n, divide both sides by m, and square both sides) to get

mod11-samplesize2

Comment:

  • Clearly, the sample size n must be an integer.
  • In the previous example we got n = 1,600, but in other situations, the calculation may give us a non-integer result.
  • In these cases, we should always round up to the next highest integer. 
  • Using this “conservative approach,” we’ll achieve an interval at least as narrow as the one desired.

EXAMPLE:

IQ scores are known to vary normally with a standard deviation of 15. How many students should be sampled if we want to estimate the population mean IQ at 99% confidence with a margin of error equal to 2?

mod11-samplesize_ex1

Round up to be safe, and take a sample of 374 students.

The purpose of the next activity is to give you guided practice in sample size calculations for obtaining confidence intervals with a desired margin of error, at a certain confidence level. Consider the example from the previous Learn By Doing activity:

Learn by Doing: Sample Size

Comment:

  • In the preceding activity, you saw that in order to calculate the sample size when planning a study, you needed to know the population standard deviation, sigma (σ). In practice, sigma is usually not known, because it is a parameter. (The rare exceptions are certain variables like IQ score or standardized tests that might be constructed to have a particular known sigma.)

Therefore, when researchers wish to compute the required sample size in preparation for a study, they use an estimate of sigma. Usually, sigma is estimated based on the standard deviation obtained in prior studies.

However, in some cases, there might not be any prior studies on the topic. In such instances, a researcher still needs to get a rough estimate of the standard deviation of the (yet-to-be-measured) variable, in order to determine the required sample size for the study. One way to get such a rough estimate is with the “range rule of thumb.” We will not cover this topic in depth but mention here that a very rough estimate of the standard deviation of a population is the range/4.

There are a few more things we need to discuss:

  • Is it always OK to use the confidence interval we developed for μ (mu) when σ (sigma) is known?
  • What if σ (sigma) is unknown?
  • How can we use statistical software to calculate confidence intervals for us?

When is it safe to use the confidence interval we developed?

One of the most important things to learn with any inference method is the conditions under which it is safe to use it. It is very tempting to apply a certain method, but if the conditions under which this method was developed are not met, then using this method will lead to unreliable results, which can then lead to wrong and/or misleading conclusions. As you’ll see throughout this section, we will always discuss the conditions under which each method can be safely used.

In particular, the confidence interval for μ (mu), when σ (sigma) is known:

mod11-CI_mean

was developed assuming that the sampling distribution of x-bar is normal; in other words, that the Central Limit Theorem applies. In particular, this allowed us to determine the values of z*, the confidence multiplier, for different levels of confidence.

First, the sample must be random. Assuming that the sample is random, recall from the Probability unit that the Central Limit Theorem works when the sample size is large (a common rule of thumb for “large” is n > 30), or, for smaller sample sizes, if it is known that the quantitative variable of interest is distributed normally in the population. The only situation when we cannot use the confidence interval, then, is when the sample size is small and the variable of interest is not known to have a normal distribution. In that case, other methods, called non-parametric methods, which are beyond the scope of this course, need to be used. This can be summarized in the following table:

A table with two columns and two rows. The column headings are: "Small Sample Size" and "Large Sample Size." The row headings are "Variable varies normally" and "Variable doesn't vary normally." Here is the data in the table by cell in "Row, Column: Value" format: Variable varies normally, Small sample size: OK; Variable varies normally, Large sample size: OK; Variable doesn't vary normally, Small sample size: NOT OK; Variable doesn't vary normally, Large sample size: OK;

In the following activity, you have to opportunity to use software to summarize the raw data provided.

What if σ (sigma) is unknown?

As we discussed earlier, when variables have been well-researched in different populations it is reasonable to assume that the population standard deviation (σ, sigma) is known. However, this is rarely the case. What if σ (sigma) is unknown?

Well, there is some good news and some bad news.

The good news is that we can easily replace the population standard deviation, σ (sigma), with the sample standard deviation, s.

A large circle represents the population of interest. μ is unknown and σ is unknown. From the population we create a SRS of size n, represented by a smaller circle. We can find x-bar for this SRS, and we can also obtain S. We use this instead of the unknown σ.

The bad news is that once σ (sigma) has been replaced by s, we lose the Central Limit Theorem, together with the normality of x-bar, and therefore the confidence multipliers z* for the different levels of confidence (1.645, 1.96, 2.576) are (generally) not correct any more. The new multipliers come from a different distribution called the “t distribution” and are therefore denoted by t* (instead of z*). We will discuss the t distribution in more detail when we talk about hypothesis testing.

The confidence interval for the population mean (μ, mu) when (σ, sigma) is unknown is therefore:

mod11-CI_t

(Note that this interval is very similar to the one when σ (sigma) is known, with the obvious changes: s replaces σ (sigma), and t* replaces z* as discussed above.)

There is an important difference between the confidence multipliers we have used so far (z*) and those needed for the case when σ (sigma) is unknown (t*). Unlike the confidence multipliers we have used so far (z*), which depend only on the level of confidence, the new multipliers (t*) have the added complexity that they depend on both the level of confidence and on the sample size (for example: the t* used in a 95% confidence when n = 10 is different from the t* used when n = 40). Due to this added complexity in determining the appropriate t*, we will rely heavily on software in this case.

Comments:

  • Since it is quite rare that σ (sigma) is known, this interval (sometimes called a “one-sample t confidence interval”) is more commonly used as the confidence interval for estimating μ (mu). (Nevertheless, we could not have presented it without our extended discussion up to this point, which also provided you with a solid understanding of confidence intervals.)
  • The quantity s/sqrt(n) is called the estimated standard error of x-bar. The Central Limit Theorem tells us that σ/sqrt(n) = sigma/sqrt(n) is the standard deviation of x-bar (and this is the quantity used in confidence interval when σ (sigma) is known). In general, the standard error is the standard deviation of the sampling distribution of a statistic. When we substitute s for σ (sigma) we are estimating the true standard error. You may see the term “standard error” used for both the true standard error and the estimated standard error depending on the author and audience. What is important to understand about the standard error is that it measures the variation of a statistic calculated from a sample of a specified sample size (not the variation of the original population).
  • As before, to safely use this confidence interval (one-sample t confidence interval), the sample must be random, and the only case when this interval cannot be used is when the sample size is small and the variable is not known to vary normally.

Final Comment:

  • It turns out that for large values of n, the t* multipliers are not that different from the z* multipliers, and therefore using the interval formula:

mod11-CI_z_s

for μ (mu) when σ (sigma) is unknown provides a pretty good approximation.