EDA for Multiple Variables


Introduction

We will learn to use multiple regression to handle multi-predictor situations during this course. Here we present a few descriptive methods for looking deeper into multiple variable relationships. Many of these methods and associated SAS code may be new to you so be certain to review carefully.

Useful SAS Procedures

  • PROC CORR
  • PROC SGPLOT
  • PROC SGPANEL
  • PROC SGSCATTER
  • PROC FREQ
  • PROC SORT

We will start with methods for a continuous outcome.

Continuous Outcome – Correlations and Scatterplots

Review the following example from the SAS documentation which illustrates a correlation matrix and a scatterplot matrix. These analyses do not include a third variable but do allow us to investigate many bivariate associations in our dataset with relative ease

SAS Documentation Reading: Creating Scatter Plots

From the example, we have the correlation matrix below which provides information about the strength and direction of the best LINEAR trend through the data.

  • All of these variables, when considered in pairs, are strongly positively correlated. These associations are highly statistically significant.

We also have the scatterplot matrix which shows that all of these relationships are reasonably linear and thus the correlations provide an accurate summary of the strength and direction of these pair-wise associations.

Now review the following SAS tutorial on creating grouped scatterplots. This allows for investigating three variables simultaneously a continuous outcome vs a continuous predictor using the scatterplot as a basis but the points are identified according to a categorical predictor.

Video: SAS Tutorial Topic 9B – (2:29) Grouped Scatterplots

Next review two more examples from the SAS documentation. The first creates a panel of scatterplots with spline curves.

  • Here we are only looking at two variables at a time in each plot but we could group points in this display as well.
  • We will aften use a LOESS or REG option with this type of plot.

The second example creates a grouped scatterplot matrix with points identifying the species.

SAS Documentation Reading: Creating a (Grouped) Scatter Plot Matrix
  • There are clearly species differences here!
  • This is a famous dataset. Here is the associated Wikipedia article.
  • Notice that such a plot without a group specification would only investigate pair-wise associations each involving only two variables as discussed above.
  • Note that we could also calculate correlations within levels of categorical variables as well using the “BY” statement in SAS but we have not illustrated this in our materials.

Let’s look an example using a dataset collected on bears where we look at many bivariate relationship.

EXAMPLE: BEAR DATA – Many Bivariate Relationships

In this dataset, the goal of the researchers was to use a small dataset where the actually measured the weight of bears in the wild as a basis for creating a model to predict the weight from easier to obtain quantities such as length and circumference measures.

Therefore the variable WEIGHT is the primary outcome of interest.

In the following SAS code, we look at a number of correlation and scatterplot-based approaches to investigating associations between multiple-variables simultaneously.

  • Dataset: bear.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit1-MultipleVariables-Bear-StillTwo.pdf

LEARN BY DOING

Complete the following using the output provided above for this example.

  • The shape of the distribution of the variable WEIGHT is ____.
  • It is clear that all of the variables displayed in the scatterplot matrices are highly positively related but many of the relationships have some degree on non-linearity. When looking at the scatterplots on page 4 (along with the corresponding graphs in the scatterplot matrix on page 2 or 3), which variable has the MOST NON-LINEAR relationship with WEIGHT?
  • What is the correlation between WEIGHT and HEADLTH?
  • What is the correlation between WEIGHT and NECK?
  • What is the correlation between NECK and HEADLTH?
  • Which variable seems to be most STRONGLY ASSOCIATED with WEIGHT? Explain your answer.

Solution: Unit1-MultipleVariables-Bear-StillTwo-Solution.pdf

Now let’s look at adding additional variables to the plots to really start to get into multi-variable exploratory methods.

EXAMPLE: BEAR DATA – Multi-Variable Relationships

In the following SAS code, we look at how to add information about other variables to our scatterplots.

  • Dataset:bear.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output:Unit1-MultipleVariables-Bear.pdf

Often it is difficult to make much sense of these plots unless there are groupings as obvious as those in the IRIS data illustrated in the first SAS documentation example above.

All we can say from these plots is that based upon these bears, males tended to be the largest bears in all categories but there are very few differences between the overall trends seen in the LOESS curves in the plots on page 2.

NOTE: We use PROC SORT to order the months correctly in the plots. Sometimes categories/groups are ordered in ways that are not the way you wish. There is usually a way to fix this… but maybe not always.

Now let’s look at our NHANES data. With such a large dataset it can be very difficult to get good scatterplots so we will illustrate a few tricks to help produce readable graphs.

EXAMPLE: NHANES DATA – Multi-Variable Relationships

  • Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit1-MultipleVariables-NHANES.pdf

LEARN BY DOING

Answer the following using the output provided above for this example.

  • Based upon the correlation matrix, which variable is most strongly correlated with SBP? What is the value of the correlation?
  • For the variable chosen above, which other variable (besides SBP) is it most correlated with? What is the value of the correlation?
  • When looking at the plot on page 4:
    • Which group, males or females, has the largest slope for the line of best fit based upon the LOESS curves provided?
    • What is the approximate value of the age at which the two lines cross?
  • Discuss the similarities and differences in the pattern seen in the plot on page 7.

Solution: Unit1-MultipleVariables-NHANES-Solution.pdf

NOTE: We use PROC SORT to order categorical variables in a specific way in the plots. Sometimes categories/groups are ordered in ways that are not the way you wish. There is usually a way to fix this… but maybe not always.

Continuous Outcome – Boxplots

Now we will look at methods of investigating multiple variables simutaneously using boxplots. Here we can look at two or three categorical predictors versus a continuous outcome.

EXAMPLE: NHANES DATA – More Complex Boxplots

  • Dataset: nh_2000a.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit1-MultipleVariables-NHANES-BOXPLOTS.pdf

LEARN BY DOING

Answer the following using the output provided above for this example.

  • Is the pattern seen in the graph on page 1 between systolic blood pressure and smoking status similar for males and females?
  • Is the pattern seen in the graph on page 3 between age and smoking status similar for males and females?
  • Comment on the usefulness of these graphs in practice.

Solution: Unit1-MultipleVariables-NHANES-BOXPLOTS-Solutions.pdf

NOTE: We use PROC SORT to order categorical variables in a specific way in the plots. Sometimes categories/groups are ordered in ways that are not the way you wish. There is usually a way to fix this… but maybe not always.

We could also calculate numeric summaries with groups defined by as many categorical variables as we wish although it may require creating a new variable containing the new groups of interest in order to get exactly what we want.

Categorical Outcome – Contingency Tables

We will return to contingency table methods when we begin regression for binary outcomes using logistic regression later in the course. For now, we want to illustrate multi-way contingency tables where we create separate contingency tables based upon the levels of additional categorical variables.

EXAMPLE: NHANES DATA – Multi-way Contingency Tables

LEARN BY DOING

Answer the following using the output provided above for this example.

  • Among female current smokers ___% have high blood pressure.
  • Among male current smokers ___% have high blood pressure.
  • Among white female current smokers ___% have high blood pressure.
  • Among white male current smokers ___% have high blood pressure.
  • Among black female current smokers ___% have high blood pressure.
  • Among black male current smokers ___% have high blood pressure.

Solution: Unit1-MultipleVariables-NHANES-ContingencyTables-Solution.pdf

Summary

In this section we illustrated numerous exploratory data analysis methods for multiple variables.

It is easy to see that investigating multiple variables simultaneously using these methods has serious limitations. It is very difficult to fully account for multiple variables simultaneously or to make an clear conclusions.

Multiple regression will help us delve deeper into the multi-variable relationships in the data.