# Learn By Doing – Correlation and Outliers (Software)

Published: December 24th, 2012

Use the solutions provided and complete the questions for practice with Case Q-Q.

## Objectives:

• Observe how an outlier can affect the correlation coefficient by comparing the value using data with and without an outlier.

## Solutions:

Use the following output to answer the questions that follow.

## Background Information for Dataset

The average gestation period, or time of pregnancy, of an animal is closely related to its longevity — the length of its lifespan. Data on the average gestation period and longevity (in captivity) of 40 different species of animals have been recorded.

Here is a summary of the variables in our dataset:

• animal: the name of the animal species.
• gestation: the average gestation period of the species, in days.
• longevity: the average longevity of the species, in years.

Remember that the correlation is only an appropriate measure of the linear relationship between two quantitative variables. First produce a scatterplot to verify that gestation and longevity are nearly linear in their relationship.

## Learn By Doing

Answer the following questions using the output obtained. In this exercise we will:

• use the scatterplot to examine the relationship between two quantitative variables.
• use the labeled scatterplot to better understand the form of a relationship.

(Optional) SPSS Steps:

• Label Variables amd Define Variable Properties
• Create Scatterplot: GRAPHS > CHART BUILDER, create a simple scatterplot relating X = longevity to Y = gestation
• Calculate Correlation: ANALYZE > CORRELATE > BIVARIATE, calculate the correlation between longevity and gestation as illustrated
• Remove Outlier and Save New Data: select the row containing the outlier, right-click on the row number and choose CUT
• Re-create Scatterplot: GRAPHS > CHART BUILDER, create a simple scatterplot relating X = longevity to Y = gestation using the new dataset
• Re-calculate Correlation: ANALYZE > CORRELATE > BIVARIATE, calculate the correlation on the new dataset

## (Optional) SAS Steps:

• Label Variables: Using a DATA step create a new dataset (animals2) where you label the varibles longevity and gestation as Longevity (years) and Gestation (days) using a LABEL statement.
• View Dataset Information in SAS: Use PROC CONTENTS to view the information about the new dataset.
• Create Basic Scatterplot: Use PROC SGPLOT and the SCATTER statement to create a scatterplot of X=longevity by Y=gestation.
• Calculate Correlation Coefficient: Use PROC CORR to calculate the correlation coefficient between X=longevity by Y=gestation. In SAS 9.3 you will likely get the scatterplot matrix automatically, in SAS 9.2 you must request this by using ODS GRAPHICS ON before the procedure and ODS GRAPHIC OFF to stop producing this output after the procedure (or whenever you wish to stop producing ODS GRAPHICS).
• Delete Outlier: Using a DATA step create a new dataset (animals3) and use an IF-THEN statement to delete the observation corresponding to the outlier. This outlier is an elephant with average longevity of 40 years and average gestation of 645 days.
• View Dataset Information in SAS: Use PROC CONTENTS to view the information about the new dataset where you have removed the outlier.
• Create Basic Scatterplot: Use PROC SGPLOT and the SCATTER statement to create a scatterplot of X=longevity by Y=gestation on the dataset with the outlier removed.
• Calculate Correlation Coefficient: Use PROC CORR to calculate the correlation coefficient bewteen X=longevity by Y=gestation on the dataset with the outlier removed.

This document is linked from Linear Relationships – Correlation.