# Exploratory Data Analysis

- Introduction
- Missing Data with link to Video (8:00)
- Data Checking
- Types of Variables
- Summary

## Introduction

We begin with a review of **exploratory data analysis** (also commonly called **descriptive statistics**) and considerations for why these methods are still important in regression modeling.

When we look at one variable at a time, we are **examining distributions**. When we have two variables, we are **examining** the **relationship** between them.

The two types of exploratory data analysis are:

**Visual displays**including graphs and tables**Numerical measures**including frequencies, percentages, means, standard deviations, etc.

**Exploratory Data Analysis (EDA)** is particularly useful for

- describing the distribution of one variable
- investigating relationships between two variables
- checking our data for errors and
- investigating the validity of assumptions

## Missing Data

Although generally we will not ask you to work with missing data in this course. It is good to be exposed to the issues related to missing data.

Here is a short reading with some information about missing data in SAS.

**Reading:**Missing values in SAS

Here is a FAQ from UCLA Statistical Computing. Likely they have much more information about missing data on their site.

**UCLA Statistical Computing:**SAS FAQ: See number of missing values and patterns of missing values

Finally, here is a video illustrating how to handle missing data in SAS.

**Video (8:00):**SAS Missing Data – Stanford University Libraries

## Data Checking

In complex statistical problems, **exploratory data analysis **is often used to **check our data **for inconsistencies, errors, or other problems.

Data entry programs can be set up to **automatically screen **for many errors to catch problems before the analysis stage. This is especially useful for checking large datasets and for logical checks involving two or more variables.

We can manually detect many problems with exploratory data analysis using:

- Frequency distributions for categorical variables
- Numerical summaries including the min and max for quantitative variables
- Appropriate graphical displays
- For One Quantitative: Histograms/Boxplots
- For Two Quantitative: Scatterplots

Some of the **types of problems **we might find are:

- Values outside of the expected range
- Values of the wrong type
- Impossible values
- Missing data coded as 999
- Not applicable, blank, or missing data coded as 0
- Data entry errors
- Data for one column was entered in an adjacent column
- Coding, recording, or measurement errors

Here are a few specific examples.

## EXAMPLE:

- Proportion entered as “percentage” (e.g., 0.5 entered as 50; so the value is 100 times too large!)
- Fifth blood type beyond A, B, AB, and O
- Values for age outside 20-40 for a study which only enrolled patients between age 20 and 40
- Validity of dates, for example April 31, Feb 30
- Number of previous pregnancies should be missing or NA for men
- Not likely that a subject is at 5th percentile of the distribution of systolic pressure, but at the 95th percentile for diastolic pressure

You can see that some of these problems could be very difficult to find, especially for large datasets.

**Data checking and preparation is extremely important **and can often be the most time consuming part of any real-world data analysis project!

If we have errors in our data, the conclusions from our subsequent analysis can be entirely incorrect.

**PRINCIPLE:**In order to obtain a

**useful regression model**, it is essential to have

**good**

**data**that has been

**well-checked**and

**cleaned**as needed.

## Types of Variables:

How the information we gather is recorded into variables determines the methods we can apply.

**Categorical**and

**Quantitative**.

(Note – some texts use “numeric” or “numerical” instead of Quantitative. We will stick with Quantitative for these materials).

**Quantitative** variables represent a measurement or count.

- We can sub-classify quantitative variables as
**discrete**(gaps between possible values) or**continuous**(can take on any value in an interval).

- Calculations such as the mean and standard deviation make sense for these variables.

**Categorical** variables classify individuals into different groups.

- Categorical variables can be sub-classified as
**nominal**(no natural ordering) or**ordinal**(natural ordering)

**Binary**variables are categorical variables with only**two levels**.

**Quantitative** variables are sometimes **categorized** and **used as categorical variables** in our analysis

- Age groups (20-29, 30-39, 40-49, …)
- BMI categories (Underweight, Normal, Overweight, Obese)
- High blood pressure (Yes/No)

The mathematics of our underlying statistical methods and interpretations of the results are determined by the types of variables used in the analysis.

- For quantitative outcome variables, we often work with the mean response whereas for categorical outcomes, we work with percentages, probabilities, risks, and odds.
- For quantitative predictor variables, we are interested in how the response variable changes for each 1-unit increase in our predictor whereas for categorical predictors, we are interested in a comparison of the response variable between categories.
- Although mathematically similar in that we want to understand how the response changes, the interpretations are different and those differences are important for being able to make sense of what our analysis means in practice.

## Summary

In this course we will learn to model **relationships between more than two variables**, however, we will use **exploratory data analysis** methods for data checking, investigating assumptions as well as summarizing our data and investigating which variables are related.

When we look at one variable at a time, we are **examining distributions**. When we have two variables, we are **examining** the **relationship** between them.

The two types of exploratory data analysis are:

**Visual displays**including graphs and tables**Numerical measures**including frequencies, percentages, means, standard deviations, etc.

We must also consider the **types of variables** we are using as this will impact **which methods** we can use and the **interpretation** of the results.