Unit 4A: Introduction to Statistical Inference
- Types of Inference (Point Estimation, Interval Estimation, Hypothesis Testing)
- Inference for One Variable
- Outline of the Process
- Applied Steps (What do Researchers Do?)
- Theoretical Steps (What do Statisticians Do?)
- Standard Error of a Statistic
Recall again the Big Picture, the four-step process that encompasses statistics: data production, exploratory data analysis, probability and inference.
We are about to start the fourth and final unit of this course, where we draw on principles learned in the other units (Exploratory Data Analysis, Producing Data, and Probability) in order to accomplish what has been our ultimate goal all along: use a sample to infer (or draw conclusions) about the population from which it was drawn.
As you will see in the introduction, the specific form of inference called for depends on the type of variables involved — either a single categorical or quantitative variable, or a combination of two variables whose relationship is of interest.
We are about to start the fourth and final part of this course — statistical inference, where we draw conclusions about a population based on the data obtained from a sample chosen from it.
The purpose of this introduction is to review how we got here and how the previous units fit together to allow us to make reliable inferences. Also, we will introduce the various forms of statistical inference that will be discussed in this unit, and give a general outline of how this unit is organized.
In the Exploratory Data Analysis unit, we learned to display and summarize data that were obtained from a sample. Regardless of whether we had one variable and we examined its distribution, or whether we had two variables and we examined the relationship between them, it was always understood that these summaries applied only to the data at hand; we did not attempt to make claims about the larger population from which the data were obtained.
Such generalizations were, however, a long-term goal from the very beginning of the course. For this reason, in the unit on Producing Data, we took care to establish principles of sampling and study design that would be essential in order for us to claim that, to some extent, what is true for the sample should be also true for the larger population from which the sample originated.
These principles should be kept in mind throughout this unit on statistical inference, since the results that we will obtain will not hold if there was bias in the sampling process, or flaws in the study design under which variables’ values were measured.
Perhaps the most important principle stressed in the Producing Data unit was that of randomization. Randomization is essential, not only because it prevents bias, but also because it permits us to rely on the laws of probability, which is the scientific study of random behavior.
In the Probability unit, we established basic laws for the behavior of random variables. We ultimately focused on two random variables of particular relevance: the sample mean (x-bar) and the sample proportion (p-hat), and the last section of the Probability unit was devoted to exploring their sampling distributions.
We learned what probability theory tells us to expect from the values of the sample mean and the sample proportion, given that the corresponding population parameters — the population mean (mu, μ) and the population proportion (p) — are known.
As we mentioned in that section, the value of such results is more theoretical than practical, since in real-life situations we seldom know what is true for the entire population. All we know is what we see in the sample, and we want to use this information to say something concrete about the larger population.
Probability theory has set the stage to accomplish this: learning what to expect from the value of the sample mean, given that population mean takes a certain value, teaches us (as we’ll soon learn) what to expect from the value of the unknown population mean, given that a particular value of the sample mean has been observed.
Similarly, since we have established how the sample proportion behaves relative to population proportion, we will now be able to turn this around and say something about the value of the population proportion, based on an observed sample proportion. This process — inferring something about the population based on what is measured in the sample — is (as you know) called statistical inference.
We will introduce three forms of statistical inference in this unit, each one representing a different way of using the information obtained in the sample to draw conclusions about the population. These forms are:
- Point Estimation
- Interval Estimation
- Hypothesis Testing
Obviously, each one of these forms of inference will be discussed at length in this section, but it would be useful to get at least an intuitive sense of the nature of each of these inference forms, and the difference between them in terms of the types of conclusions they draw about the population based on the sample results.
In terms of organization, the Inference unit consists of two main parts: Inference for One Variable and Inference for Relationships between Two Variables. The organization of each of these parts will be discussed further as we proceed through the unit.
The next two topics in the inference unit will deal with inference for one variable. Recall that in the Exploratory Data Analysis (EDA) unit, when we learned about summarizing the data obtained from one variable where we learned about examining distributions, we distinguished between two cases; categorical data and quantitative data.
The following outlines describe some of the important points about the process of inferential statistics as well as compare and contrast how researchers and statisticians approach this process.
Here is another restatement of the big picture of statistical inference as it pertains to the two simple examples we will discuss first.
- A simple random sample is taken from a population of interest.
- In order to estimate a population parameter, a statistic is calculated from the sample. For example:
Sample mean (x-bar)
Sample proportion (p-hat)
- We then learn about the DISTRIBUTION of this statistic in repeated sampling (theoretically). We now know these are called sampling distributions!
- Using THIS sampling distribution we can make inferences about our population parameter based upon our sample statistic.
It is this last step of statistical inference that we are interested in discussing now.
One issue for students is that the theoretical process of statistical inference is only a small part of the applied steps in a research project. Previously, in our discussion of the role of biostatistics, we defined these steps to be:
- Planning/design of study
- Data collection
- Data analysis
Among researchers, the following represent some of the important questions to address when conducting a study.
- What is the population of interest?
- What is the question or statistical problem?
- How to sample to best address the question given the available resources?
- How to analyze the data?
- How to report the results?
Statisticians, on the other hand, need to ask questions like these:
- What assumptions can be reasonably made about the population?
- What parameter(s) in the population do we need to estimate in order to address the research question?
- What statistic(s) from our sample data can be used to estimate the unknown parameter(s)?
- How does each statistic behave?
- Is it unbiased?
- How variable will it be for the planned sample size?
- What is the distribution of this statistic? (Sampling Distribution)
Then, we will see that we can use the sampling distribution of a statistic to:
- Provide confidence interval estimates for the corresponding parameter.
- Conduct hypothesis tests about the corresponding parameter.
In our discussion of sampling distributions, we discussed the variability of sample statistics; here is a quick review of this general concept and a formal definition of the standard error of a statistic.
- All statistics calculated from samples are random variables.
- The distribution of a statistic (from a sample of a given sample size) is called the sampling distribution of the statistic.
- The standard deviation of the sampling distribution of a particular statistic is called the standard error of the statistic and measures variability of the statistic for a particular sample size.