Variables can be broadly classified into one of two **types**:

- Quantitative

- Categorical

Below we define these two main types of variables and provide further sub-classifications for each type.

**Categorical variables** take **category** or **label** values, and place an individual into one of several **groups**.

Categorical variables are often further classified as either:

**Nominal,**when there**is no natural ordering among the categories**.

Common examples would be gender, eye color, or ethnicity.

**Ordinal**, when there**is a natural order among the categories**, such as, ranking scales or letter grades.

However, ordinal variables are still categorical and do not provide precise measurements.

Differences are not precisely meaningful, for example, if one student scores an A and another a B on an assignment, we cannot say precisely the difference in their scores, only that an A is larger than a B.

**Quantitative variables** take **numerical** values, and represent some kind of **measurement**.

Quantitative variables are often further classified as either:

**Discrete**, when the variable takes on a**countable**number of values.

Most often these variables indeed represent some kind of **count** such as the number of prescriptions an individual takes daily.

**Continuous**, when the variable**can take on any value in some range of values**.

Our precision in measuring these variables is often limited by our instruments.

Units should be provided.

Common examples would be height (inches), weight (pounds), or time to recovery (days).

One special variable type occurs when a variable has only two possible values.

A variable is said to be** Binary **or **Dichotomous**, when there are only two possible levels.

These variables can usually be phrased in a “yes/no” question. Gender is an example of a binary variable.

Currently we are primarily concerned with classifying variables as either categorical or quantitative.

Sometimes, however, we will need to consider further and sub-classify these variables as defined above.

These concepts will be discussed and reviewed as needed but here is a quick practice on sub-classifying categorical and quantitative variables.

Let’s revisit the dataset showing medical records for a sample of patients

In our example of medical records, there are several variables of each type:

- Age, Weight, and Height are
**quantitative**variables.

- Race, Gender, and Smoking are
**categorical**variables.

** Comments:**

- Notice that the values of the
**categorical**variable Smoking have been**coded**as the numbers 0 or 1.

It is quite common to code the values of a categorical variable as numbers, but you should remember that these are just codes.

They have no arithmetic meaning (i.e., it does not make sense to add, subtract, multiply, divide, or compare the magnitude of such values).

Usually, if such a coding is used, all categorical variables will be coded and we will tend to do this type of coding for datasets in this course.

- Sometimes,
**quantitative**variables are**divided into groups**for analysis, in such a situation, although the original variable was quantitative, the variable analyzed is categorical.

A common example is to provide information about an individual’s Body Mass Index by stating whether the individual is underweight, normal, overweight, or obese.

This categorized BMI is an example of an ordinal categorical variable.

**Categorical**variables are sometimes called qualitative variables, but in this course we’ll use the term “categorical.”

The **types of variables** you are analyzing **directly relate to the available** descriptive and inferential **statistical methods**.

It is important to:

**assess how you will measure the effect of interest**and**know how this determines the statistical methods you can use.**

As we proceed in this course, we will continually emphasize the **types of variables** that are** appropriate for each method we discuss**.

For example:

To compare the number of polio cases in the two treatment arms of the Salk Polio vaccine trial, you could use

- Fisher’s Exact Test
- Chi-Square Test

To compare blood pressures in a clinical trial evaluating two blood pressure-lowering medications, you could use

- Two-sample t-Test
- Wilcoxon Rank-Sum Test

Before we jump into Exploratory Data Analysis, and really appreciate its importance in the process of statistical analysis, let’s take a step back for a minute and ask:

**Data** are pieces of information about **individuals** organized into **variables**.

- By an
**individual**, we mean a particular person or object. - By a
**variable**, we mean a particular characteristic of the individual.

A **dataset** is a set of data identified with a particular experiment, scenario, or circumstance.

Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.

The following dataset shows medical records for a sample of patients.

In this example,

- the
**individuals**are patients, - and the
**variables**are Gender, Age, Weight, Height, Smoking, and Race.

Each **row**, then, gives us all of the information about a particular **individual** (in this case, patient), and each **column** gives us information about a particular **characteristic** of all of the patients.

The rows in a dataset (representing **individuals**) might also be called **observations**, **cases**, or a description that is specific to the individuals and the scenario.

For example, if we were interested in studying flu vaccinations in school children across the U.S., we could collect data where each observation was a

- student
- school
- school district
- city
- county
- state

Each of these would result in a different way to investigate questions about flu vaccinations in school children.

In our course, we will present methods which can be used when the **observations** being analyzed are **independent of each other**. If the observations (rows in our dataset) are not independent, a more complex analysis is needed.Clear violations of independent observations occur when

- we have more than one row for a given individual such as if we gather the same measurements at many different times for individuals in our study
- individuals are paired or matched in some way.

As we begin this course, you should start with an awareness of the types of data we will be working with and learn to recognize situations which are more complex than those covered in this course.

The columns in a dataset (representing **variables**) are often grouped and labeled by their role in our analysis.

For example, in many studies involving people, we often collect **demographic** variables such as gender, age, race, ethnicity, socioeconomic status, marital status, and many more.

The **role** a variable plays in our analysis must also be considered.

- In studies where we wish to predict one variable using one or more of the remaining variables, the variable we wish to predict is commonly called the
**response**variable, the**outcome**variable, or the**dependent variable**.

- Any variable we are using to predict or explain differences in the outcome is commonly called an
**explanatory variable**, an**independent****variable**, a**predictor**variable, or a**covariate**.

**Note:** The word “**independent**” is used in statistics in numerous ways. Be careful to understand in what way the words “independent” or “independence” (as well as dependent or dependence) are used when you see them used in the materials.

- Here we have discussed
**independent observations**(also called cases, individuals, or subjects). - We have also used the term
**independent variable**as another term for our explanatory variables. - Later we will learn the formal probability definitions of
**independent events**and**dependent events**. - And when comparing groups we will define
**independent samples**and**dependent samples**.

Our first course objective will be addressed throughout the semester in that you will be adding to your understanding of biostatistics in an ongoing manner during the course.

**Biostatistics** is the application of **statistics** to a variety of topics in biology. In this course, we tend to focus on biological topics in the health sciences as we learn about statistics.

In an introductory course such as ours, there is essentially no difference between “biostatistics” and “statistics” and thus you will notice that we focus on learning “statistics” in general but use as many examples from and applications to the health sciences as possible.

**Statistics** is all about **converting data into useful information**. Statistics is therefore a process where we are:

- collecting data,
- summarizing data, and
- interpreting data.

The following video adapted from material available from Johns Hopkins – Introduction to Biostatistics provides a few examples of statistics in use.

The following reading from the online version of Little Handbook of Statistical Practice contains excellent comments about common reasons why many people feel that “statistics is hard” and how to overcome them! We will suggest returning to and reviewing this document as we cover some of the topics mentioned in the reading.

In practice, every **research project** or study involves the following **steps**.

- Planning/design of study
- Data collection
- Data analysis
- Presentation
- Interpretation

The following video adapted from material available at Johns Hopkins – Introduction to Biostatistics provides an overview of the steps in a research project and the role biostatistics and biostatisticians play in each step.

Throughout the course, we will add to our understanding of the definitions, concepts, and processes which are introduced here. You are not expected to gain a full understanding of this process until much later in the course!

To really understand how this process works, we need to put it in a context. We will do that by introducing one of the central ideas of this course, the **Big Picture of Statistics**.

We will introduce the Big Picture by building it gradually and explaining each component.

At the end of the introductory explanation, once you have the full Big Picture in front of you, we will show it again using a concrete example.

The process of statistics starts when we identify what group we want to study or learn something about. We call this group the **population**.

Note that the word “population” here (and in the entire course) is not just used to refer to people; it is used in the more broad statistical sense, where population can refer not only to people, but also to animals, things etc. For example, we might be interested in:

- the opinions of the population of U.S. adults about the death penalty; or
- how the population of mice react to a certain chemical; or
- the average price of the population of all one-bedroom apartments in a certain city.

The **population**, then, is the entire group that is the target of our interest.

In most cases, the population is so large that as much as we might want to, there is absolutely no way that we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty…).

A more practical approach would be to examine and collect data only from a sub-group of the population, which we call a **sample**. We call this first component, which involves choosing a sample and collecting data from it, **Producing Data**.

A **sample** is a s subset of the population from which we collect data.

It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should make an effort to choose a sample in such a way that it will represent the population well.

For example, if we choose a sample from the population of U.S. adults, and ask their opinions about a particular federal health care program, we do not want our sample to consist of only Republicans or only Democrats.

Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way.

This second component, which consists of summarizing the collected data, is called **Exploratory Data Analysis** or **Descriptive** **Statistics**.

Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results.

Before we can do so, we need to look at how the sample we’re using may differ from the population as a whole, so that we can factor that into our analysis. To examine this difference, we use **Probability **which is the third component in the big picture.

The third component in the Big Picture of Statistics, **probability** is in essence the “machinery” that allows us to draw conclusions about the population based on the data collected in the sample.

Finally, we can use what we’ve discovered about our sample to draw conclusions about our population.

We call this final component in the process **Inference**.

This is the **Big Picture of Statistics**.

At the end of April 2005, a poll was conducted (by ABC News and the Washington Post), for the purpose of learning the opinions of U.S. adults about the death penalty.

**1. Producing Data:** A (representative) sample of 1,082 U.S. adults was chosen, and each adult was asked whether he or she favored or opposed the death penalty.

**2. Exploratory Data Analysis (EDA):** The collected data were summarized, and it was found that 65% of the sampled adults favor the death penalty for persons convicted of murder.

**3 and 4. Probability and Inference:** Based on the sample result (of 65% favoring the death penalty) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those who favor the death penalty in the population is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:

The structure of this entire course is based on the big picture.

The course will have 4 units; one for each of the components in the big picture.

As the figure below shows, even though it is second in the process of statistics, we will start this course with exploratory data analysis (EDA), continue to discuss producing data, then go on to probability, so that at the end we will be able to discuss inference.

The main reasons we begin with EDA is that we need to understand enough about what we want to do with our data before we can discuss the issues related to how to collect it!!

This also allows us to introduce many important concepts early in the course so that you will have ample time to master them before we return to inference at the end of the course.

The following figure summarizes the structure of the course.

As you will see, the Big Picture is the basis upon which the entire course is built, both conceptually and structurally.

We will refer to it often, and having it in mind will help you as you go through the course.

]]>