EDA for Two Variables
- Introduction and Links to Materials
- Continuous Outcome with Continuous Predictor
- Continuous Outcome with Categorical Predictor
- LEARN BY DOING: Case CQ
- Categorical Outcome with Categorical Predictor
- LEARN BY DOING: Case CC
- Categorical Outcome with Continuous Predictor
- LEARN BY DOING: Case QC
Now let’s look at two-variable methods. The materials from 6052 linked below provide more details.
In regression analysis, we study relationships among variables and will distinguish the role of variables as
- Outcome or Response variable: to be predicted from remaining variables (dependent variable)
- Predictor or Explanatory variable: variable used to make prediction (covariate, independent variable)
When there are only two variables, we can classify them into these four cases.
We will focus on continuous and binary outcome variables in this course but will touch on discrete counts and build a foundation for you to explore other regression models on your own.
Luckily in regression analysis, the outcome variable is very clearly indicated by the goal of the regression itself.
Let’s review methods for investigating the relationship between two variables starting with a continuous outcome variable.
When we have a continuous outcome and a continuous predictor (Case QQ, not to be confused with “QQ-plots”) we use
- Pearson’s correlation coefficient, r (with confidence interval and/or p-value)
We will come back to a full discussion of simple linear regression soon and cover this topic in detail for now we will discuss correlation and scatterplots.
Pearson’s correlation coefficient
- Scale-free measure of association, for example we will obtain the same value of Pearson’s correlation coefficient if we represent height in feet, inches, or centimeters but note this is NOT true for the slope in a regression equation.
- Between -1 and 1
- A value of 0 indicates the line of best fit through the data is flat – there is no association between the two variables
- A value of 1 or -1 indicates the line of best fit is a perfect fit of the data
- Positive value indicates an increasing relationship
- Negative value indicates a decreasing relationship
- Only measures the strength and direction of the linear relationship or association between two variables.
- If the variables are not linearly related, the correlation one expresses information about the best line through the data which may not be at all helpful!!
- ** We must verify linearity using a scatterplot before interpreting Pearson’s correlation. **
- Plot outcome variable (vertical axis) vs. predictor variable (horizontal axis)
- Can use LOWESS smoother to determine if relationship is approximately linear
- LOWESS: LOcally WEighted Scatterplot Smoother
- Draw smooth line to express the average value of outcome as a function of predictor
When we have a continuous outcome and a categorical predictor (Case CQ) we use
- Descriptive statistics in each category
- Side-by-side boxplots
- For inferential methods we use:
- T-tests, One-Way ANOVA
- Wilcoxon Rank-Sum, Mann-Whitney U
- (Regression!) We will also see that linear regression is equivalent to ANOVA and the Two-Sample T-Test assuming Equal Variances.
In essence, for exploratory data analysis in this scenario with a continuous outcome and a categorical predictor, we will separately apply numerical or graphical methods described before for one quantitative variable in each category of the predictor. Results can then be compared to investigate differences in the outcome variable across categories of the predictor.
We can use side-by-side boxplots to give a graphical view of the distribution of the outcome variable within each category of the predictor variable. This plot allows for easy comparisons across categories.
When we have a categorical outcome and a categorical predictor we use:
- A contingency table or two-way table
- Row and/or column percentages (conditional percentages)
- ** To compare the distribution of the outcome among the levels of the predictor variables.
- Chi-square Test, Fisher’s Exact Test
When we have a categorical outcome and a continuous predictor:
- Break continuous predictor into categories and make contingency table (Chi-square Test, Fisher’s Exact)
- Avoid too few observations in some categories
- In some cases cutoff points can be determined on non-statistical basis; In other situations, they can be chosen based on percentiles of predictor
- Or turn the situation around and use Side-by-side boxplot with ANOVA or T-test as in Case CQ.
- We will learn Logistic Regression for binary categorical outcomes later in this course.
In this section we reviewed exploratory data analysis methods for two variables. Be certain you can correctly classify types of variables and choose the correct methods based upon the types of variables you have.