# EDA for Two Variables

## Introduction and Links to Materials

Now let’s look at two-variable methods. The materials from 6052 linked below provide more details.

Review from 6052 Materials:

SAS Tutorials:

Useful SAS Procedures

• PROC CORR
• PROC SGPLOT
• PROC MEANS
• PROC UNIVARIATE
• PROC FREQ

In regression analysis, we study relationships among variables and will distinguish the role of variables as

• Outcome or Response variable: to be predicted from remaining variables (dependent variable)
• Predictor or Explanatory variable: variable used to make prediction (covariate, independent variable)

When there are only two variables, we can classify them into these four cases.

PRINCIPLE: In regression analysis, our methods are determined by our outcome and whether it is a continuous, a discrete count, a binary categorical, or multi-level categorical variable.

We will focus on continuous and binary outcome variables in this course but will touch on discrete counts and build a foundation for you to explore other regression models on your own.

Luckily in regression analysis, the outcome variable is very clearly indicated by the goal of the regression itself.

Let’s review methods for investigating the relationship between two variables starting with a continuous outcome variable.

## Continuous Outcome with Continuous Predictor

When we have a continuous outcome and a continuous predictor (Case QQ, not to be confused with “QQ-plots”) we use

• Pearson’s correlation coefficient, r (with confidence interval and/or p-value)
• Scatterplot
• Regression!!

We will come back to a full discussion of simple linear regression soon and cover this topic in detail for now we will discuss correlation and scatterplots.

## Pearson’s correlation coefficient

Review from 6052 SAS Tutorials:  Topic 9C – (3:46) Pearson’s Correlation Coefficient
• Scale-free measure of association, for example we will obtain the same value of Pearson’s correlation coefficient if we represent height in feet, inches, or centimeters but note this is NOT true for the slope in a regression equation.
• Between -1 and 1
• A value of 0 indicates the line of best fit through the data is flat – there is no association between the two variables
• A value of 1 or -1 indicates the line of best fit is a perfect fit of the data
• Positive value indicates an increasing relationship
• Negative value indicates a decreasing relationship
• Only measures the strength and direction of the linear relationship or association between two variables.
• If the variables are not linearly related, the correlation one expresses information about the best line through the data which may not be at all helpful!!
• ** We must verify linearity using a scatterplot before interpreting Pearson’s correlation. **

## EXAMPLE: NHANES DATA – Correlation

• Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
• SAS Code and Output: Unit1-TwoVariables-Correlation.pdf

From the output we have the correlation between sytolic blood pressure and weight along with a confidence interval.

• The estimated correlation is 0.0986 with 95% confidence interval (0.055, 0.142).

Note: With PROC CORR, there are other graph options and other statistics available (such as Spearman’s Rank Correlation).

## LEARN BY DOING

Complete the following using the output provided above for this example.

• The estimated correlation between systolic blood pressure and age is ____ with 95% confidence interval ( ____ , ____ ).
• Discuss your preferences regarding the graphs which we use in PROC CORR.

## Scatterplots

• Plot outcome variable (vertical axis) vs. predictor variable (horizontal axis)
• Can use LOWESS smoother to determine if relationship is approximately linear
• LOWESS: LOcally WEighted Scatterplot Smoother
• Draw smooth line to express the average value of outcome as a function of predictor
Review from 6052 SAS Tutorials:  Topic 9A – (3:53) Basic Scatterplots

## EXAMPLE: NHANES DATA – Scatterplots

• Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
• SAS Code and Output: Unit1-TwoVariables-Scatterplots.pdf

## LEARN BY DOING

Answer the following using the output provided above for this example.

• Compare the 6 scatterplots on page 3 using Y = SBP to those on page 4 using Y = LOGSBP.  What are the main differences in the overall patterns seen in these 6 plots when using Y=LOGSBP as compared to using Y = SBP?
• Compare the 6 scatterplots on page 3 using Y = SBP to those on page 5 using Y = SBP_INV.  What are the main differences in the overall patterns seen in these 6 plots when using Y=SBP_INV as compared to using Y = SBP?
• There are at least two suspicious points, can you identify them?

## Continuous Outcome with Categorical Predictor

Review from 6052 SAS Tutorials:

When we have a continuous outcome and a categorical predictor (Case CQ) we use

• Descriptive statistics in each category
• Side-by-side boxplots
• For inferential methods we use:
• T-tests, One-Way ANOVA
• Wilcoxon Rank-Sum, Mann-Whitney U
• (Regression!) We will also see that linear regression is equivalent to ANOVA and the Two-Sample T-Test assuming Equal Variances.

In essence, for exploratory data analysis in this scenario with a continuous outcome and a categorical predictor, we will separately apply numerical or graphical methods described before for one quantitative variable in each category of the predictor.  Results can then be compared to investigate differences in the outcome variable across categories of the predictor.

We can use side-by-side boxplots to give a graphical view of the distribution of the outcome variable within each category of the predictor variable. This plot allows for easy comparisons across categories.

## EXAMPLE: NHANES DATA – Case CQ

• Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
• SAS Code and Output: Unit1-TwoVariables-CaseCQ.pdf

## LEARN BY DOING

Answer the following using the output provided above for this example.

• For which quantitative variable are the differences between smoking status groups most extreme? Describe the pattern you see including the groups means in your discussion.
• For which quantitative variable are the differences between smoking status groups least extreme? Describe the pattern you see including the groups means in your discussion.

## Categorical Outcome with Categorical Predictor

Review from 6052 SAS Tutorials:  Topic 6A – (3:07) Two-Way (Contingency) Tables – EDA

When we have a categorical outcome and a categorical predictor we use:

• A contingency table or two-way table
• Row and/or column percentages (conditional percentages)
• ** To compare the distribution of the outcome among the levels of the predictor variables.
• Chi-square Test, Fisher’s Exact Test

## EXAMPLE: NHANES DATA – Case CC

• Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
• SAS Code and Output: Unit1-TwoVariables-CaseCC.pdf

## LEARN BY DOING

Review the output provided above for this example and provide a discussion of any obvious associations you see in these tables.

## Categorical Outcome with Continuous Predictor

When we have a categorical outcome and a continuous predictor:

• Break continuous predictor into categories and make contingency table (Chi-square Test, Fisher’s Exact)
• Avoid too few observations in some categories
• In some cases cutoff points can be determined on non-statistical basis; In other situations, they can be chosen based on percentiles of predictor
• Or turn the situation around and use Side-by-side boxplot with ANOVA or T-test as in Case CQ.
• We will learn Logistic Regression for binary categorical outcomes later in this course.

## EXAMPLE: NHANES DATA – Case QC

• Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
• SAS Code and Output: Unit1-TwoVariables-CaseQC.pdf

## LEARN BY DOING

Review the output provided above for this example and complete the following:

• The contingency table shows us that, as age increases, the chance of having high blood pressure tends to ____ and gives us _____ of this probability for each age group.
• The boxplot shows us that Individuals with high blood pressure tend to be ___.

## Summary

In this section we reviewed exploratory data analysis methods for two variables. Be certain you can correctly classify types of variables and choose the correct methods based upon the types of variables you have.