# Learn By Doing – Sampling (Software)

Published: December 28th, 2012

Use the solutions provided to answer the questions below.

## Objective:

The purpose of this activity is to show you how a simple random sample produces a sample that is not subject to any bias and is thus representative of the population from which it was selected. Also, we’ll see how a nonrandom sample can produce some sources of bias.

## Solutions:

Use the following output to answer the questions that follow.

## Background Information for Dataset

Consider the population of all students at a large university taking introductory statistics courses (1,129 students taking statistics for business, social sciences, or natural sciences).

Suppose we are interested in the values of four specific variables for this population: handedness (right-handed or left-handed), sex, SAT Verbal score, and age. If we were unable to determine the values of those variables for the entire population, we may be able to take a random sample from that population, and use the sample summaries as estimates for population summaries. Would the random sample provide unbiased estimates for the population values?

Next, what if instead of taking a random sample, we sampled the 192 students who happen to be enrolled in the business statistics course? First we will intuit, then check, if they would be a representative sample with respect to each of the four variables: handedness, sex, SAT Verbal score, and age. It may be helpful for you to know that, at this university, all students have comparable options in terms of when they take introductory statistics. You should also know that women, on the whole, tend to do somewhat better than men on the verbal portion of the SAT, and that business is a major that tends to interest males more than females.

The dataset includes the following variables:

• Course: natural science, social science, or business
• Handed: right-handed or left-handed
• Sex: female or male
• Verbal: SAT Verbal scores up to 800
• Age: in years

## Learn By Doing

The goals for this activity are to:

A. Verify that the distributions of the variables handedness, sex, SAT Verbal score, and age are roughly the same for the random sample as they are for the population.

B. Intuit whether the distributions of each of the four variables in the (nonrandom) sample of business students would be roughly the same as those for the population, or whether there is a reason to expect any of them to be biased.

C. Check our intuition by comparing the distributions of each of the four variables for the sample of business students with those for the population, and determine whether they are roughly the same or if the sample values for any of the variables appear to be biased.

Answer the following questions using the output.

A. Determine whether the four variables’ behavior for the random sample is comparable to their behavior for the population.

1. To compare the proportion of right-handed students in the random sample to those in the population, use the frequency tables created for handedness (one using the population and one using the random sample).Consider the distributions to be comparable if the sample proportion comes within about 5% of the population proportion. Does it? (Use the text box in the first Learn By Doing exercise below to record your answer.)

2. To compare the proportion of female students in the random sample to the proportion in the population, use the frequency tables created for sex (one using the population and one using the random sample).Consider the distributions to be comparable if the sample proportion comes within about 5% of population proportion. Does it? (Use the text box in the first Learn By Doing exercise below to record your answers.)

3. To compare the distribution of verbal scores in the random sample to those in the population, use the descriptive statistics summary tables for SAT Verbal score (one using the population and one using the random sample)Since SAT scores tend to follow a normal (symmetric) distribution, you can focus on means to make a comparison. Consider the distributions to be comparable if the sample mean SAT Verbal score comes within about 10 points of the population mean. Does it? (Use the text box in the first Learn By Doing exercise below to record your answers.)

4. To compare the distribution of ages in the random sample to those in the population, use the descriptive statistics summary tables for age (one using the population and one using the random sample)Since Age tends to follow a right-skewed distribution, you should focus on medians to make a comparison. Consider the distributions to be comparable if the sample median age comes within about 0.5 years of the population median. Does it?

Corrected Solutions

B. Intuit whether the distributions of each of the four variables in the (nonrandom) sample of business students would be roughly the same as those for the population, or whether there is a reason to expect any of them to be biased.

CDetermine whether the four variables’ behavior for the random sample is comparable to their behavior for the population.

1. To compare the proportion of right-handed students in the non-random sample to those in the population, use the frequency tables created for handedness (one using the population and one using the random sample).Consider the distributions to be comparable if the sample proportion comes within about 5% of the population proportion. Does it? (Use the text box in the first Learn By Doing exercise below to record your answer.)

2. To compare the proportion of female students in the non-random sample to the proportion in the population, use the frequency tables created for sex (one using the population and one using the random sample).Consider the distributions to be comparable if the sample proportion comes within about 5% of population proportion. Does it? (Use the text box in the first Learn By Doing exercise below to record your answers.)

3. To compare the distribution of verbal scores in the non-random sample to those in the population, use the descriptive statistics summary tables for SAT Verbal score (one using the population and one using the random sample)Since SAT scores tend to follow a normal (symmetric) distribution, you can focus on means to make a comparison. Consider the distributions to be comparable if the sample mean SAT Verbal score comes within about 10 points of the population mean. Does it? (Use the text box in the first Learn By Doing exercise below to record your answers.)

4. To compare the distribution of ages in the non-random sample to those in the population, use the descriptive statistics summary tables for age (one using the population and one using the random sample)Since Age tends to follow a right-skewed distribution, you should focus on medians to make a comparison. Consider the distributions to be comparable if the sample median age comes within about 0.5 years of the population median. Does it?

Corrected Solutions

## (Optional) SPSS Steps:

• Import Data: FILE > OPEN > DATA, choose Excel file from the pull-down, find the file, continue
• Edit Data: DATA > DEFINE VARIABLE PROPERTIES
• Create Frequency Table for Handedness and Sex in the Population: ANALYZE > DESCRIPTIVE STATISTICS > FREQUENCIES
• Create Summaries for Verbal and Age in the Population: ANALYZE > DESCRIPTIVE STATISTICS > FREQUENCIES
• Create Random Sample: DATA > SELECT CASES > Random Sample, choose “Exactly 192 cases from the first 1129 cases”. Look at the data to see how this works using the default method and notice our full dataset still exists (we will learn a different method in the next activity).
• Recreate the tables and summaries: exactly as before using this random sample
• Create Non-Random Sample: DATA > SELECT CASES > choose “if condition is satisfied”, pull in the Course variable, and use the condition Course = “Business”
• Recreate the tables and summaries: exactly as before using this random sample

## (Optional) SAS Steps:

• Create Frequency Table for Handedness and Sex in the Population: Use PROC FREQ to obtain these tables.
• Create Summaries for Verbal and Age in the Population: Use PROC MEANS to obtain these tables which should contain the mean, standard deviation, and five-number summary.
• Create Random Sample: Use PROC SURVEYSELECT to create a simple random sample of 192 observations from the current population. Name the output dataset students_srs.
• Recreate the tables and summaries: exactly as before using this simple random sample
• Create Non-Random Sample: Use a DATA step and an IF-THEN statement to create a non-random sample containing only business students. Name this dataset students_bus.
• Recreate the tables and summaries: exactly as before using this random sample