EDA for One Variable
- Introduction and Links to Materials
- One Categorical Variable
- LEARN BY DOING: One Categorical Variable
- One Quantitative Variable
- LEARN BY DOING: One Quantitative Variable
Let’s review exploratory data analysis methods for one variable. The materials from 6052 linked below provide more details.
Please carefully review the page from the materials below on The “Normal” Shape with a particular focus on Quantile-Quantile Plots (QQ-Plots). You might also wish to review the following additional sources about these important plots:
We begin with one categorical variable. For these variables, we cannot perform calculations with the values as they only represent categories or groups.
Categorical variables are summarized with percentages (or proportions) in each category. These results can be displayed in a frequency table containing the frequencey or count in each category along with the percentage. You will obtain these in software as needed.
For categorical variables, the frequency table usually represents both the visual display and numerical measures. We can determine everything about one categorical variable from this table.
We can however, use bar charts or pie charts for a graphical representation. These can be used to visualize the distribution and emphasize certain points of importance about a particular variable. It is NOT necessary to be able to obtain these graphs in software for this course.
Now let’s look at one quantitative variable. To summarize the distribution of one quantitative variable, we are interested in
- Shape: symmetric, skewed left, skewed right, unimodal, bimodal, multi-modal, normal, etc.
- Center/Location: approximate location of the center of the distribution
- Spread/Variation: some measure or visual represatation of how much the data vary around the center
- Outliers: any unusually large or small values in the dataset.
One Quantitative Variable: Numeric Summaries
To summarize the distribution of one quantitative variable we usually look at the following numeric summaries:
- mean, median (location, center) and compare to investigate skewness
- standard deviation, IQR, range (spread, variation)
- min, max (useful to help screen for outlying or incorrect values)
- quartiles (Q1 = 25th percentile, Q3 = 75th percentile)
- other percentiles
- confidence intervals or standard errors for quantities of particular interest such as the mean.
One Quantitative Variable: Graphical Summaries
For graphical displays for one quantitative variable we usually look at:
- Histogram and Boxplot: information about center, spread, and shape of the distribution of the data
- Normal quantile-quantile plot (Q-Q plot): compare distribution of data to normal distribution
- Illustrate shape, center, and spread and gives some insight into outliers.
- Display frequencies or proportion of the data values in defined intervals, shown as bars
- The larger the sample, the more bins are needed to display shape of distribution more clearly without losing too much information
- One Rule of Thumb: number of bins is around 1+3.3log10 (n) where n = sample size
Boxplots display the distribution of one quantitative variable using the 5-number summary:
- Min = smallest observation
- lower quartile (Q1)
- median (Q2)
- upper quartile (Q3)
- Max = largest observation
Information conveyed by the boxplot:
- Location/Center – measured by the median (or mean if shown as in SAS).
- height of the box = IQR
- height of the entire plot = RANGE = Max – min
- Presence of outliers
- Shape of the distribution
- For Right skewed data:
- median is located toward bottom of box
- upper whisker is longer than lower whisker
- more outliers in upper range
- For Left skewed data:
- median is located toward top of box
- lower whisker is longer than upper whisker
- more outliers in lower range
- For Right skewed data:
This display is a compromise between a histogram and a numerical summary. We lose some information, such as the modality (high points) and the details provided by the bin frequencies or percentages. However, we gain information about outliers and specifics regarding the five-number summary.
QUANTILE-QUANTILE PLOTS (QQ-PLOTS)
QQ plots are used to compare distribution of data to normal distribution. When data are from a normal distribution, the points in QQ-plot fall along a straight line. This is not a scatterplot between two variables but a comparison of the observed to expected quantiles for one quantitative variable.
As we mentioned at the beginning of this section, please carefully review the page from the 6052 materials on The “Normal” Shape with a particular focus on Quantile-Quantile Plots (QQ-Plots). You might also wish to review the following additional sources about these important plots:
Systematic departures from a straight line indicate data are not from a normal distribution.
Upward or downward curvature indicates skewed data and s-shaped indicates either heavy– or light– tailed data.
Some packages may swap the X and Y axes of the plot. For SAS, we see the observed values on the y-axis and the normal distribution z-scores on the x-axis giving the opposite result from that obtained in SAS (i.e. instead of being curved upward for a skewed right distribution, the QQ-plot would curve downward if the x and y axes are swapped).
For a particular plot orientation, we can identify whether the data are skewed left, skewed right, heavy tailed, or light tailed. However, it is usually best to combine all three plots (histogram, boxplot, and QQ-plot) to get a complete picture, in which case the QQ-plot is used more to determine the severity of any problems than to identify the type of problem directly.
Here are two sets of plots illustrating symmetric distributions which result in s-shaped QQ-plots.
The top graphs are from a heavy tailed distribution. It is hard to see from the histogram, but the dotted line for the best guess distribution is ABOVE the normal distribution in the tails and in the center but BELOW the normal distribution in the middle ranges between these two areas. The t-distribution is an example of a common statistical distribution which has heavy tails.
The bottom graphs are from a light tailed distribution. Here it is easier to see that the dotted line is BELOW the normal curve in the tails (i.e. “light” tails with less chance of values happening here than expected) and is BELOW the normal curve in the center but ABOVE the normal curve in the middle ranges between these two areas. The uniform distribution is an example of a common statistical distribution which has light tails.
Note: This distinction becomes important in regression analysis as we will have assumptions that involve normality and we must distinguish between symmetric distributions which are reasonably normal and those which are not.
In regression, we will consider transformations to correct certain violations of our assumptions. In addition, sometimes transformations are simply the desired variable of interest.
Some advantages we will learn about later in the course are that transformations can
- simplify relationships between variables
- remove interactions
- stabilize variances.
Some disadvantages are that the results
- may be less interpretable
- emphasize differences in certain ranges of the data, but de-emphasize differences in others.
It is always best if the chosen transformation is both statistically useful and practically meaningful. For example, if a value represents area or volume it can make sense to take the square root or cube root to get back to the original units of measurement (feet instead of square feet or cubic feet).
When transformations are not desired, we can consider the following alternatives
- Instead of linear model, use generalized linear model
- Use non-parametric methods (e.g., Wilcoxon rank-sum test)
- Rely on the robustness of normality based techniques (when you have large enough sample size)
Now let’s look at one final example where we will completely summarize one quantiative variable.
In this section we reviewed exploratory data analysis methods for one variable.
- For categorical variables, we primarily use a frequency table with percentages.
- For quantitative variables, we need to obtain a variety of numeric measures and graphical displays in order to address the main aspects of the distribution (Shape, Center, Spread, and Outliers) such as
- mean, median, range, IQR, standard deviation, Q1, Q3, min, Max
- histograms, boxplots, and QQ-plots.
We also introduced the concept of transformations for a quantitative variable and presented some examples of log-transformed variables.