Exploratory Data Analysis


Introduction

We begin with a review of exploratory data analysis (also commonly called descriptive statistics) and considerations for why these methods are still important in regression modeling.

When we look at one variable at a time, we are examining distributions. When we have two variables, we are examining the relationship between them.

The two types of exploratory data analysis are:

  • Visual displays including graphs and tables
  • Numerical measures including frequencies, percentages, means, standard deviations, etc.

Exploratory Data Analysis (EDA) is particularly useful for

  • describing the distribution of one variable
  • investigating relationships between two variables
  • checking  our data for errors and
  • investigating the validity of assumptions

Missing Data

Although generally we will not ask you to work with missing data in this course. It is good to be exposed to the issues related to missing data.

Here is a short reading with some information about missing data in SAS.

Here is a FAQ from UCLA Statistical Computing. Likely they have much more information about missing data on their site.

Finally, here is a video illustrating how to handle missing data in SAS.

Data Checking

In complex statistical problems, exploratory data analysis is often used to check our data for inconsistencies, errors, or other problems.

Data entry programs can be set up to automatically screen for many errors to catch problems before the analysis stage. This is especially useful for checking large datasets and for logical checks involving two or more variables.

We can manually detect many problems with exploratory data analysis using:

  • Frequency distributions for categorical variables
  • Numerical summaries including the min and max for quantitative variables
  • Appropriate graphical displays
    • For One Quantitative: Histograms/Boxplots
    • For Two Quantitative: Scatterplots

Some of the types of problems we might find are:

  • Values outside of the expected range
  • Values of the wrong type
  • Impossible values
  • Missing data coded as 999
  • Not applicable, blank, or missing data coded as 0
  • Data entry errors
  • Data for one column was entered in an adjacent column
  • Coding, recording, or measurement errors

Here are a few specific examples.

EXAMPLE:

  • Proportion entered as “percentage” (e.g., 0.5 entered as 50; so the value is 100 times too large!)
  • Fifth blood type beyond A, B, AB, and O
  • Values for age outside 20-40 for a study which only enrolled patients between age 20 and 40
  • Validity of dates, for example April 31, Feb 30
  • Number of previous pregnancies should be missing or NA for men
  • Not likely that a subject is at 5th percentile of the distribution of systolic pressure, but at the 95th percentile for diastolic pressure

You can see that some of these problems could be very difficult to find, especially for large datasets.

Data checking and preparation is extremely important and can often be the most time consuming part of any real-world data analysis project!

If we have errors in our data, the conclusions from our subsequent analysis can be entirely incorrect.

PRINCIPLE: In order to obtain a useful regression model, it is essential to have good data that has been well-checked and cleaned as needed.

Types of Variables:

How the information we gather is recorded into variables determines the methods we can apply.

There are two main types of variables: Categorical and Quantitative.

(Note – some texts use “numeric” or “numerical” instead of Quantitative. We will stick with Quantitative for these materials).

Quantitative variables represent a measurement or count.

  • We can sub-classify quantitative variables as
    • discrete (gaps between possible values) or
    • continuous (can take on any value in an interval).
  • Calculations such as the mean and standard deviation make sense for these variables.

Categorical variables classify individuals into different groups.

  • Categorical variables can be sub-classified as
    • nominal (no natural ordering) or
    • ordinal (natural ordering)
  • Binary variables are categorical variables with only two levels.

Quantitative variables are sometimes categorized and used as categorical variables in our analysis

  • Age groups (20-29, 30-39, 40-49, …)
  • BMI categories (Underweight, Normal, Overweight, Obese)
  • High blood pressure (Yes/No)

The mathematics of our underlying statistical methods and interpretations of the results are determined by the types of variables used in the analysis.

  • For quantitative outcome variables, we often work with the mean response whereas for categorical outcomes, we work with percentages, probabilities, risks, and odds.
  • For quantitative predictor variables, we are interested in how the response variable changes for each 1-unit increase in our predictor whereas for categorical predictors, we are interested in a comparison of the response variable between categories.
  • Although mathematically similar in that we want to understand how the response changes, the interpretations are different and those differences are important for being able to make sense of what our analysis means in practice.

Summary

In this course we will learn to model relationships between more than two variables, however, we will use exploratory data analysis methods for data checking, investigating assumptions as well as summarizing our data and investigating which variables are related.

When we look at one variable at a time, we are examining distributions. When we have two variables, we are examining the relationship between them.

The two types of exploratory data analysis are:

  • Visual displays including graphs and tables
  • Numerical measures including frequencies, percentages, means, standard deviations, etc.

We must also consider the types of variables we are using as this will impact which methods we can use and the interpretation of the results.