This document is linked from Linear Relationships – Regression.

]]>This document is linked from Scatterplots.

]]>This document is linked from Case Q-Q.

]]>

This document is linked from Linear Relationships – Linear Regression.

]]>Optional: Create your own solutions using your software for extra practice.

- Find a regression line and plot it on the scatterplot
- Examine the effect of outliers on the regression line

Use the following output to answer the questions that follow.

The modern Olympic Games have changed dramatically since their inception in 1896. For example, many commentators have remarked on the change in the quality of athletic performances from year to year. Regression will allow us to investigate the change in winning times for one event — the 1,500 meter race.

Here is a summary of the variables in our dataset:

**Year:**the year of the Olympic Games, from 1896 to 2000.**Time:**the winning time for the 1,500 meter race, in seconds.

Answer the following questions using the output. In this exercise you will:

- use the regression line to make predictions
- evaluate how reliable these predictions are

Use the linear regression on the full data to answer the following question.

Use the linear regression after removing the outlier to answer the next two questions.

**Import Data:**FILE > OPEN > DATA, choose Excel file from the pull-down, find the file, continue**Edit Data:**DATA > DEFINE VARIABLE PROPERTIES**Scatterplot:**GRAPHS > CHART BUILDER, create a simple scatterplot relating X = Year to Y = Time, double click on created scatterplot to add trend-line**Regression Equation:**ANALYZE > REGRESSION > LINEAR**Remove Outlier and Save New Data:**select the row containing the outlier, right-click on the row number and choose CUT**Scatterplot:**GRAPHS > CHART BUILDER, create a simple scatterplot relating X = Year to Y = Time using the new dataset, double click on created scatterplot to add trend-line**Regression Equation:**ANALYZE > REGRESSION > LINEAR

**View Dataset Information in SAS:**Use PROC CONTENTS to view the information about the dataset.**Create Regression Analysis with Fit Plot:**Use PROC REG to obtain the simple linear regression analysis for Y = time using X = year as the predictor. In SAS 9.3 (if you have ODS GRAPHICS enabled) you should obtain the fit plot by default in your HTML output). In SAS 9.2 you must use ODS GRAPHCIS ON to obtain these results.Note: In SAS 9.2, I tend to use ODS GRAPHICS OFF immediately following the procedure. This is not neccessary, however, you will receive ODS GRAPHICS until you turn it off with this command or exit SAS 9.2. In SAS 9.3, ODS GRAPHICS are enabled by default but can be enabled/disabled under TOOLS > OPTIONS > PREFERENCES in the RESULTS tab.**Delete Outlier:**Using a DATA step create a new dataset (olympics2) and use an IF-THEN statement to delete the observation corresponding to the outlier. This outlier is for the first observation in year=1896.**Create Regression Analysis with Fit Plot:**Use PROC REG to obtain the simple linear regression analysis for Y = time using X = year as the predictor using your dataset with the outlier removed. In SAS 9.3 (if you have ODS GRAPHICS enabled) you should obtain the fit plot by default in your HTML output). In SAS 9.2 you must use ODS GRAPHCIS ON to obtain these results.

This document is linked from Linear Relationships – Linear Regression.

]]>To see the effect of outliers on a regression equation, use the applet introduced earlier. Draw points on the graph, add the regression line and then add an outlier or move an observation to see how the regression line changes.

Here is another similar applet that can be used to illustrate outliers and guessing lines of best fit.

Here is an interactive demonstration from the Rosman/Chance collection which has extensive options and illustrates many ideas about linear regression and correlation.

And, remember the two-variable calculator we introduced earlier.

This document is linked from Linear Relationships – Linear Regression.

]]>A line is described by a set of points **(X,Y)** that obey a particular relationship between **X** and **Y**. That relationship is called the equation of the line, which we will express in the following form: **Y = a + bX **In this equation, **a** and **b** are constants that can be either negative or positive. The reason to write the line in this form is that the constants **a** and **b** tell us what the line looks like, as follows:

- The
**intercept (a)**is the value that**Y**takes when**X**= 0 - The
**slope (b)**is the change in**Y**for every increase of 1 unit in**X**.

The slope and intercept are indicated with arrows on the following diagram:

The technique that specifies the dependence of the response variable on the explanatory variable is called **regression**. When that dependence is linear (which is the case in our examples in this section), the technique is called **linear regression**. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).

To understand how such a line is chosen, consider the following very simplified version of the age-distance example (we left just 6 of the drivers on the scatterplot):

Consider the line:

The intercept is 1. The slope is 1/3, and the graph of this line is, therefore:

Consider the line:

The intercept is 1. The slope is -1/3, and the graph of this line is, therefore:

This document is linked from Linear Relationships – Linear Regression.

]]>Add points to the scatterplot, then draw your guess at the regression line, and then check your answer.

This document is linked from Linear Relationships – Linear Regression.

]]>**Related SAS Tutorials**

- 9A – (3:53) Basic Scatterplots
- 9B – (2:29) Grouped Scatterplots
- 9C – (3:46) Pearson’s Correlation Coefficient
- 9D – (3:00) Simple Linear Regression – EDA

**Related SPSS Tutorials**

- 9A – (2:38) Basic Scatterplots
- 9B – (2:54) Grouped Scatterplots
- 9C – (3:35) Pearson’s Correlation Coefficient
- 9D – (2:53) Simple Linear Regression – EDA

So far we’ve used the scatterplot to describe the relationship between two quantitative variables, and in the special case of a linear relationship, we have supplemented the scatterplot with the correlation (r).

The correlation, however, doesn’t fully characterize the linear relationship between two quantitative variables — it only measures the strength and direction. We often want to describe more precisely how one variable changes with the other (by “more precisely,” we mean more than just the direction), or predict the value of the response variable for a given value of the explanatory variable.

In order to be able to do that, we need to summarize the linear relationship with a line that best fits the linear pattern of the data. In the remainder of this section, we will introduce a way to find such a line, learn how to interpret it, and use it (cautiously) to make predictions.

Again, let’s start with a motivating example:

Earlier, we examined the linear relationship between the age of a driver and the maximum distance at which a highway sign was legible, using both a scatterplot and the correlation coefficient. Suppose a government agency wanted to predict the maximum distance at which the sign would be legible for 60-year-old drivers, and thus make sure that the sign could be used safely and effectively.

How would we make this prediction?

It would be useful if we could find a line (such as the one that is presented on the scatterplot) that represents the general pattern of the data, because then,

and predict that 60-year-old drivers could see the sign from a distance of just under 400 feet we would simply use this line to find the distance that corresponds to an age of 60 like this:

How and why did we pick this particular line (the one shown in red in the above walkthrough) to describe the dependence of the maximum distance at which a sign is legible upon the age of a driver? What line exactly did we choose? We will return to this example once we can answer that question with a bit more precision.

The technique that specifies the dependence of the response variable on the explanatory variable is called **regression**. When that dependence is linear (which is the case in our examples in this section), the technique is called **linear regression**. Linear regression is therefore the technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).

To understand how such a line is chosen, consider the following very simplified version of the age-distance example (we left just 6 of the drivers on the scatterplot):

There are many lines that look like they would be good candidates to be the line that best fits the data:

It is doubtful that everyone would select the same line in the plot above. We need to agree on what we mean by “best fits the data”; in other words, we need to agree on a criterion by which we would select this line. We want the line we choose to be close to the data points. In other words, whatever criterion we choose, it had better somehow take into account the vertical deviations of the data points from the line, which are marked with blue arrows in the plot below:

The most commonly used criterion is called the **least squares** criterion. This criterion says: Among all the lines that look good on your data, choose the one that has the smallest sum of squared vertical deviations. Visually, each squared deviation is represented by the area of one of the squares in the plot below. Therefore, we are looking for the line that will have the smallest total yellow area.

This line is called the **least-squares regression line**, and, as we’ll see, it fits the linear pattern of the data very well.

For the remainder of this lesson, you’ll need to feel comfortable with the algebra of a straight line. In particular you’ll need to be familiar with the **slope **and the **intercept **in the equation of a line, and their interpretation.

Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable (**Y**) and the explanatory variable (**X**) has the form: **Y = a + bX**

All we need to do is calculate the intercept * a*, and the slope

The **slope** of the least squares regression line can be interpreted as the estimated (or predicted) **change in the mean (or average) value of the response variable when the explanatory variable increases by 1 unit.**

Let’s revisit our age-distance example, and find the **least-squares regression line**. The following output will be helpful in getting the 5 values we need:

- Dependent Variable: Distance
- Independent Variable: Age
- Correlation Coefficient (
**r**) = -0.7929 - The
**least squares regression line**for this example is:

- This means that for every 1-unit increase of the explanatory variable, there is, on average, a 3-unit decrease in the response variable. The interpretation
**in context**of the slope (-3) is, therefore: In this dataset, when age increases by 1 year the**average**maximum distance at which subjects can read a sign is expected to**decrease by 3 feet.** - Here is the regression line plotted on the scatterplot:

As we can see, the regression line fits the linear pattern of the data quite well.

Let’s go back now to our motivating example, in which we wanted to predict the maximum distance at which a sign is legible for a 60-year-old. Now that we have found the least squares regression line, this prediction becomes quite easy:

Practically, what the figure tells us is that in order to find the predicted legibility distance for a 60-year-old, we plug Age = 60 into the regression line equation, to find that:

**Predicted distance = 576 + (- 3 * 60) = 396**

396 feet is our best prediction for the maximum distance at which a sign is legible for a 60-year-old.

**Comment About Predictions:**

- Suppose a government agency wanted to design a sign appropriate for an even wider range of drivers than were present in the original study. They want to predict the maximum distance at which the sign would be legible for a 90-year-old. Using the least squares regression line again as our summary of the linear dependence of the distances upon the drivers’ ages, the agency predicts that 90-year-old drivers can see the sign at no more than 576 + (- 3 * 90) = 306 feet:

(The green segment of the line is the region of ages beyond 82, the age of the oldest individual in the study.)

** Question: **Is our prediction for 90-year-old drivers reliable?

** Answer: **Our original age data ranged from 18 (youngest driver) to 82 (oldest driver), and our regression line is therefore a summary of the linear relationship

Prediction for ranges of the explanatory variable that are not in the data is called **extrapolation**. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided. In our example, like most others, extrapolation can lead to very poor or illogical predictions.

- A special case of the relationship between two quantitative variables is the
**linear**relationship. In this case, a straight line simply and adequately summarizes the relationship.

- When the scatterplot displays a linear relationship, we supplement it with the
**correlation coefficient (r)**, which measures the**strength**and direction of a linear relationship between two quantitative variables. The correlation ranges between -1 and 1. Values near -1 indicate a strong negative linear relationship, values near 0 indicate a weak linear relationship, and values near 1 indicate a strong positive linear relationship.

- The correlation is only an appropriate numerical measure for linear relationships, and is sensitive to outliers. Therefore, the correlation should only be used as a supplement to a scatterplot (after we look at the data).

- The most commonly used criterion for finding a line that summarizes the pattern of a linear relationship is “least squares.” The
**least squares regression line**has the smallest sum of squared vertical deviations of the data points from the line.

- The
**slope**of the least squares regression line can be interpreted as the estimated (or predicted)**change in the mean (or average) value of the response variable when the explanatory variable increases by 1 unit.**

- The
**intercept**of the least squares regression line is the average value of the response variable when the explanatory variable is zero. Thus, this is only of interest if it makes sense for the explanatory variable to be zero AND we have observed data in that range (explanatory variable around zero) in our sample.

- The least squares regression line predicts the value of the response variable for a given value of the explanatory variable.
**Extrapolation**is prediction of values of the explanatory variable that fall outside the range of the data. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.

Choose one of the datasets in the list and click through the tabs at the top to see the data and results!

This document is linked from Case Q-Q.

]]>