Unit 3A: Probability
Recall the Big Picture — the four-step process that encompasses statistics (as it is presented in this course):
So far, we’ve discussed the first two steps:
Producing data — how data are obtained, and what considerations affect the data production process.
Exploratory data analysis — tools that help us get a first feel for the data, by exposing their features using visual displays and numerical summaries which help us explore distributions, compare distributions, and investigate relationships.
(Recall that the structure of this course is such that Exploratory Data Analysis was covered first, followed by Producing Data.)
Our eventual goal is Inference — drawing reliable conclusions about the population based on what we’ve discovered in our sample.
In order to really understand how inference works, though, we first need to talk about Probability, because it is the underlying foundation for the methods of statistical inference.
The probability unit starts with an introduction, which will give you some motivating examples and an intuitive and informal perspective on probability.
Why do we need to understand probability?
- We often want to estimate the chance that an event (of interest to us) will occur.
- Many values of interest are probabilities or are derived from probabilities, for example, prevalence rates, incidence rates, and sensitivity/specificity of tests for disease.
- Plus!! Inferential statistics relies on probability to
- Test hypotheses
- Estimate population values, such as the population mean or population proportion.
We will use an example to try to explain why probability is so essential to inference.
First, here is the general idea:
As we all know, the way statistics works is that we use a sample to learn about the population from which it was drawn. Ideally, the sample should be random so that it represents the population well.
Recall from the discussion about sampling that when we say that a random sample represents the population well we mean that there is no inherent bias in this sampling technique.
It is important to acknowledge, though, that this does not mean that all random samples are necessarily “perfect.” Random samples are still random, and therefore no random sample will be exactly the same as another.
One random sample may give a fairly accurate representation of the population, while another random sample might be “off,” purely due to chance.
Unfortunately, when looking at a particular sample (which is what happens in practice), we will never know how much it differs from the population.
This uncertainty is where probability comes into the picture. This gives us a way to draw conclusions about the population in the face of the uncertainty that is generated by the use of a random sample.
The following example will illustrate this important point.
In the health sciences, a comparable situation to the death penalty example would be when we wish to determine the prevalence of a certain disease or condition.
As we will see, this is a form of probability.
In practice, we will need to estimate the prevalence using a sample and in order to make inferences about the population from a sample, we will need to understand probability.
Other common probabilities used in the health sciences are
- (Cumulative) Incidence: the probability that a person with no prior disease will develop disease over some specified time period
- Sensitivity of a diagnostic or screening test: the probability the person tests positive, given the person has the disease. Specificity of a diagnostic or screening test: the probability the person tests negative, given the person does not have the disease. As well as predictive value positive, predictive value negative, false positive rate, false negative rate.
- Survival probability: the probability an individual survives beyond a certain time