A key focus of Chapter 10 is how to make inferences about populations based on samples. The essential logic lies in comparing a single instance of a statistic, such as a sample mean, to a distribution of such values. The comparison can lead to one of two conclusions – the sample statistic is either extreme or not extreme. But what are the thresholds for making this kind of judgment call (i.e., whether a value is extreme or not)? This activity explores that question.

The problem is this: You receive a sample containing the ages of 30 students. You are wondering whether this sample is a group of undergraduates (mean age = 20 years) or graduates (mean age = 25 years). To answer this question, you must compare the mean of the sample you receive to a distribution of means from the population. The following fragment of R code begins the solution:

set.seed(56)
sampleSize <- 30
studentPop <- rnorm(20000,mean=20,sd=3)
if (runif(1)>0.5) { testSample <- grads } else { testSample <- undergrads }
mean(testSample)  
## [1] 24.91001

1. Annotate the code above with line-by-line commentary. To get full credit on this assignment, you must demonstrate a clear understanding of what the six lines of code actually do! You will have to look up the meaning of some commands.

2. The next line of code should generate a list of sample means from the population called “studentPop”. Very similar code to accomplish this appears right in Chapter 7. How many sample means should you generate? You can create any number that you want – hundreds, thousands, whatever – but I suggest that you generate just 100 means for ease of inspection. That is a pretty small number, but it makes it easy to think about percentiles and ranks.

3. Once you have your list of sample means generated from studentPop, the trick is to compare mean (testSample) to that list of sample means and see where it falls. Is it in the middle of the pack? Far out toward one end? Here is one hint that will help you: In Chapter 10 (p.90), the quantile() command is used to generate percentiles based on thresholds of 2.5% and 97.5%. Those are the thresholds we want, and the quantile() command will help you create them.

4. Your code should end with a print() statement that could say either, “Sample mean is extreme,”" or “Sample mean is not extreme.”

5. Please submit both the output of your runs and the R code.

# ----------------------------------------------------------------
# 1. Annotate the code above with line-by-line commentary.
# Specify a seed so that we can reproduce the same result later if we use the same seed.
set.seed(2)
# assign the value 30 to the variable "sampleSize"
sampleSize <- 30
# First, randomly generate 20000 numbers that obeyed normal distribution with mean 20 and standard deviation 3.
# Then assign those numbers to the vector "studentPop".
studentPop <- rnorm(20000, mean=20, sd=3)
# draw 30 sample from studentPop and assign the 30 numbers to "undergrads".

# First, randomly generate 30 numbers that obeyed normal distribution with mean 25 and sd 3.
# Assign the numbers to the vector "grads".

# Randomly assign either the grads sample or the undergrads sample to testSample, depending on the value generated by runif(1).
# "runif(1)" would generate a random number between 0 and 1.
# If the number is greater than 0.5, assign grads sample to testSample. Otherwise, assign undergrads sample to testSample.
if (runif(1)>0.5) { testSample <- grads } else { testSample <- undergrads }

# calculate the mean of "testSample"
mean(testSample)
## [1] 25.54158
 # The result is 25.54158.
# Comparing with the mean of undergrads (20) and the mean of grads (25), we can draw the conclusion that it is grad.

# ----------------------------------------------------------------
# 2. Generate 100 sample means from studentPop
mySample <- replicate(100, mean(sample(studentPop, size=sampleSize, replace=TRUE)))

# ----------------------------------------------------------------
# 3. Compare mean(testSample) to that list of sample means and see where it falls.
mean(testSample)
## [1] 25.54158
 # quantile() function helps display data distribution.
# Produce quantiles on thresholds 2.5% and 97.5%.
quantile(mySample, probs = c(0.025, 0.975)) # so only 2.5 percent of sample means are above 21.1 and only 2.5 percent are below 19.14....
##     2.5%    97.5%
## 19.14567 21.10799
# ----------------------------------------------------------------
# 4. if the sample mean is less than quantiles on thresholds 2.5% or greater than quantiles on thresholds 97.5%,
# then it can be definded as extreme. Otherwise it is not extreme (107 CH 10 differnt)
if (mean(testSample) < quantile(mySample, probs=0.025) | mean(testSample) > quantile(mySample, probs=0.975))
{ print("Sample mean is extreme") } else { print("Sample mean is not extreme") }
## [1] "Sample mean is extreme"