EDA for One Variable


Introduction and Links to Materials

Let’s review exploratory data analysis methods for one variable. The materials from 6052 linked below provide more details.

Please carefully review the page from the materials below on The “Normal” Shape with a particular focus on Quantile-Quantile Plots (QQ-Plots). You might also wish to review the following additional sources about these important plots:

Review from 6052 Materials: 

SAS Tutorials:

Useful SAS Procedures for one categorical variable

  • PROC FREQ
  • PROC SGPLOT

Useful SAS Procedures for numeric summaries for one quantitative variable:

  • PROC UNIVARIATE
  • PROC MEANS
  • Higher level procedures often give some summary statistics

Useful SAS Procedures for Histograms, Boxplots, and QQ-plots:

  • PROC SGPLOT
  • PROC UNIVARIATE
  • Higher level procedures may give histograms/boxplots/QQ-plots as default such as PROC TTEST
  • We will also see these plots provided for our residuals in regression analysis using PROC REG and PROC GLM.

One Categorical Variable

We begin with one categorical variable. For these variables, we cannot perform calculations with the values as they only represent categories or groups.

Categorical variables are summarized with percentages (or proportions) in each category. These results can be displayed in a frequency table containing the frequencey or count in each category along with the percentage. You will obtain these in software as needed.

For categorical variables, the frequency table usually represents both the visual display and numerical measures.  We can determine everything about one categorical variable from this table.

We can however, use bar charts or pie charts for a graphical representation. These can be used to visualize the distribution and emphasize certain points of importance about a particular variable. It is NOT necessary to be able to obtain these graphs in software for this course.

EXAMPLE: NHANES DATA – One Categorical Variable

  • Dataset: nhanes.sas7bdat  – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name (nhanes).
  • SAS Code and Output: Unit1-OneCategorical.pdf
  • SAS Code for Formats: Unit1-NHANES-Formats.pdf

Notice that the RACE variable has a category (OTHER) with a fairly small percentage. Overall 68.55% were classified as WHITE, 28.10% were classified as BLACK but only 3.35% were classified as OTHER.

This can cause problems if we wish to use this variable in our future regression models as we may not have enough data to include OTHER as a separate category in our analysis.

LEARN BY DOING

Complete the following using the output provided above for this example.

  • In this sample, __% were female and __% were male and overall __% were classified as having high blood pressure (systolic blood pressure > 140).
  • When classified according to smoking status, __% were classified as never smokers, __% as former smokers, and __% as current smokers.

Solution: Unit1-OneCategorical-Solution.pdf

One Quantitative Variable

Now let’s look at one quantitative variable. To summarize the distribution of one quantitative variable, we are interested in

  • Shape: symmetric, skewed left, skewed right, unimodal, bimodal, multi-modal, normal, etc.
  • Center/Location: approximate location of the center of the distribution
  • Spread/Variation: some measure or visual represatation of how much the data vary around the center
  • Outliers: any unusually large or small values in the dataset.

One Quantitative Variable: Numeric Summaries

To summarize the distribution of one quantitative variable we usually look at the following numeric summaries:

  • mean, median (location, center) and compare to investigate skewness
  • standard deviation, IQR, range (spread, variation)
  • min, max (useful to help screen for outlying or incorrect values)
  • quartiles (Q1 = 25th percentile, Q3 = 75th percentile)
  • other percentiles
  • confidence intervals or standard errors for quantities of particular interest such as the mean.

EXAMPLE: SYSTOLIC BLOOD PRESSURE

One Quantitative Variable: Graphical Summaries

For graphical displays for one quantitative variable we usually look at:

  • Histogram and Boxplot: information about center, spread, and shape of the distribution of the data
  • Normal quantile-quantile plot (Q-Q plot): compare distribution of data to normal distribution

HISTOGRAMS

  • Illustrate shape, center, and spread and gives some insight into outliers.
  • Display frequencies or proportion of the data values in defined intervals, shown as bars
  • The larger the sample, the more bins are needed to display shape of distribution more clearly without losing too much information
    • One Rule of Thumb: number of bins is around 1+3.3log10 (n) where n = sample size

EXAMPLE: SYSTOLIC BLOOD PRESSURE

Here we have histograms for systolic blood pressure

  • A histogram with too few bins

  • A good histogram with an appropriate number of bins

  • A good histogram with a normal distribution and “best guess” kernel distribution overlaid on the plot

The distribution of systolic blood pressure is skewed right with the possibility of outliers on the high end which will be clarified in the boxplot.

BOXPLOTS

Boxplots display the distribution of one quantitative variable using the 5-number summary:

  • Min = smallest observation
  • lower quartile (Q1)
  • median (Q2)
  • upper quartile (Q3)
  • Max = largest observation

Information conveyed by the boxplot:

  • Location/Center – measured by the median (or mean if shown as in SAS).
  • Spread/Variation
    • height of the box = IQR
    • height of the entire plot = RANGE = Max – min
  • Presence of outliers
  • Shape of the distribution
    • For Right skewed data:
      • median is located toward bottom of box
      • upper whisker is longer than lower whisker
      • more outliers in upper range
    • For Left skewed data:
      • median is located toward top of box
      • lower whisker is longer than upper whisker
      • more outliers in lower range

This display is a compromise between a histogram and a numerical summary.  We lose some information, such as the modality (high points) and the details provided by the bin frequencies or percentages.  However, we gain information about outliers and specifics regarding the five-number summary.

EXAMPLE: SYSTOLIC BLOOD PRESSURE

Here is the boxplot for our systolic blood pressure data.

The boxplot confirms what we found in the histogram in the previous example, the basic shape of the distribution of systolic blood pressure is skewed right, although ignoring the many outliers, the remaining distribution is only slightly skewed. T

his is a large sample of over 3000 observations so it isn’t too surprising to have such a seemingly large number of outliers.

QUANTILE-QUANTILE PLOTS (QQ-PLOTS)

QQ plots are used to compare distribution of data to normal distribution.  When data are from a normal distribution, the points in QQ-plot fall along a straight line. This is not a scatterplot between two variables but a comparison of the observed to expected quantiles for one quantitative variable.

As we mentioned at the beginning of this section, please carefully review the page from the 6052 materials on The “Normal” Shape with a particular focus on Quantile-Quantile Plots (QQ-Plots). You might also wish to review the following additional sources about these important plots:

EXAMPLE: SYSTOLIC BLOOD PRESSURE

Here is the QQ-plot for our systolic blood pressure data.

We see a clear upward curvature that deviates significantly from the reference line at Y = X.

This confirms that the distribution of systolic blood pressure is skewed and clearly not normally distributed.

Systematic departures from a straight line indicate data are not from a normal distribution.

Upward or downward curvature indicates skewed data and s-shaped indicates either heavy– or lighttailed data.

Some packages may swap the X and Y axes of the plot. For SAS, we see the observed values on the y-axis and the normal distribution z-scores on the x-axis giving the opposite result from that obtained in SAS (i.e. instead of being curved upward for a skewed right distribution, the QQ-plot would curve downward if the x and y axes are swapped).

For a particular plot orientation, we can identify whether the data are skewed left, skewed right, heavy tailed, or light tailed. However, it is usually best to combine all three plots (histogram, boxplot, and QQ-plot) to get a complete picture, in which case the QQ-plot is used more to determine the severity of any problems than to identify the type of problem directly.

Here are two sets of plots illustrating symmetric distributions which result in s-shaped QQ-plots.

The top graphs are from a heavy tailed distribution.  It is hard to see from the histogram, but the dotted line for the best guess distribution is ABOVE the normal distribution in the tails and in the center but BELOW the normal distribution in the middle ranges between these two areas. The t-distribution is an example of a common statistical distribution which has heavy tails.

The bottom graphs are from a light tailed distribution.  Here it is easier to see that the dotted line is BELOW the normal curve in the tails (i.e. “light” tails with less chance of values happening here than expected) and is BELOW the normal curve in the center but ABOVE the normal curve in the middle ranges between these two areas. The uniform distribution is an example of a common statistical distribution which has light tails.

Note: This distinction becomes important in regression analysis as we will have assumptions that involve normality and we must distinguish between symmetric distributions which are reasonably normal and those which are not.

EXAMPLE: t-distribution

Here we have an image which illustrates the fundamental difference between the normal distribution and the t-distribution:

A standard normal curve modeling the Z-distribution and a curve modeling the t-distribution. Both have been scaled so that the area under the curve is 1. The standard normal curve has less spread than the t-distribution curve. This means that the left and right tails are closer to each other than in the t-distribution, and that it is taller than the t-distribution. The t-distribution is narrower than the standard normal distribution when close to the center. Because of this, the curves intersect once on each side of the center.

You can see in the picture that the t-distribution has slightly less area near the expected central value than the normal distribution does, and you can see that the t distribution has correspondingly more area in the “tails” than the normal distribution does. (It’s often said that the t-distribution has “fatter tails” or “heavier tails” than the normal distribution.)

The following picture illustrates this idea with just a couple of t-distributions (note that “degrees of freedom” is abbreviated “d.f.” on the picture):

The standard normal z-distribution curve overlaid with a t-distribution with 5 d.f., and a t-distribution with 2 d.f. The distribution with 2 t.f. is shorter and has more spread than the t-distribution with 5 d.f., which in turn is shorter and wider than the standard normal distribution.

One Quantitative Variable – Transformations

In regression, we will consider transformations to correct  certain violations of our assumptions. In addition, sometimes transformations are simply the desired variable of interest.

Some advantages we will learn about later in the course are that transformations can

  • simplify relationships between variables
  • remove interactions
  • stabilize variances.

Some disadvantages are that the results

  • may be less interpretable
  • emphasize differences in certain ranges of the data, but de-emphasize differences in others.

EXAMPLE: SYSTOLIC BLOOD PRESSURE

Here is a comparison of the distribution for our original systolic blood pressure variable and a log-transformation using the natural logarithm.

Remember: in statistics we use LOG or log to denote the natural logarithm although we may also still sometimes use LN or ln as well.

Here we can see that the distribution of the Natural Logarithm transformation of systolic blood pressure is much more symmetric than the original variable.

EXAMPLE: EFFECT OF LOG-TRANSFORM

Here we illustrate that if we had original systolic blood pressure values of from 100 to 200 equally spaced by 20 units, if we use the natural logarithm transformation, we would end up with new values as illustrated in the table with the corresponding differences.

The differences in the resulting log-transformed values range from 0.182 on the low end to 0.095 on the high end. The original equal differences of 20 units now range so that the last difference (0.095) is nearly half that of the first (0.182).

Thus the effect is that, when compared to the original systolic blood pressure scale, differences on the lower end are now given more emphasis than differences on the higher end.

It is always best if the chosen transformation is both statistically useful and practically meaningful. For example, if a value represents area or volume it can make sense to take the square root or cube root to get back to the original units of measurement (feet instead of square feet or cubic feet).

When transformations are not desired, we can consider the following alternatives

  • Instead of linear model, use generalized linear model
  • Use non-parametric methods (e.g., Wilcoxon rank-sum test)
  • Rely on the robustness of normality based techniques (when you have large enough sample size)

Now let’s look at one final example where we will completely summarize one quantiative variable.

EXAMPLE: NHANES DATA – One Quantitative Variable

  • Dataset: nhanes.sas7bdat – To use the dataset, save the file into the folder on your computer which is associated with a SAS library. Once you do this, open SAS and you should be able to immediately access the file using that library and the file name.
  • SAS Code and Output: Unit1-OneQuntitative.pdf
  • SAS Code for Formats: Unit1-NHANES-Formats.pdf

LEARN BY DOING

Answer the following using the output provided above for this example.

  • Summarize the distribution of body mass index using information from both numeric summaries and graphical displays.
  • What is the effect of the log-transformation on the distribution of body mass index?
  • Consider the QQ-plots for the remaining variables (including the three transformations of systolic blood pressure which are LOG(SBP) = LOGSBP, 1/SBP – SBP_INV, and SBP squared = SBP_SQ).
    • List any variables whose distributions are reasonably normal.
    • List any variables whose distributions are light-tailed.
    • List any variables whose distributions are heavy-tailed.
    • List any variables whose distributions are skewed right.

Summary

In this section we reviewed exploratory data analysis methods for one variable.

  • For categorical variables, we primarily use a frequency table with percentages.
  • For quantitative variables, we need to obtain a variety of numeric measures and graphical displays in order to address the main aspects of the distribution (Shape, Center, Spread, and Outliers) such as
    • mean, median, range, IQR, standard deviation, Q1, Q3, min, Max
    • histograms, boxplots, and QQ-plots.

We also introduced the concept of transformations for a quantitative variable and presented some examples of log-transformed variables.