This is an R Markdown Notebook containing review questions for the IST 687 week 8 exam. The exam is worth 25 points and represents 30 percent of your grade for the course.

Logistics

  1. The exam consists of 21 questions. The distribution of questions by week and topic can be found in the table below.
  2. You will need a password to begin the exam. I’ll release this during the live session.
  3. Only one attempt is allowed and after clicking “Attempt Quiz,” you have one hour to complete the quiz (I cannot extend the time).
  4. There are three types of questions you can expect to encounter:
    • Open-ended: You will be asked to define concepts e.g., mean, measures of central tendency
    • Writing R code: You will be asked to write code to render some output
    • Evaluating R code: You will be asked to read code blocks and write what will be rendered in the console
Week # Questions
2 - Using R to manipulate data 8
3 - Descriptive Statistics & Functions 5
4 - Inferential statistics 4
6 - Introduction to visualization 1
7 - Working with map data 1
8 - Linear modeling 2

Sample Questions

  1. What are the dependent and independent variables in the code below? Conceptually describe what the following R code does and what you would expect to see as output from this code:

    summary(lm(formula = fawn~adult+precipitation, data=df))
# Dependent: fawn
# Independent: adult and precipitation
 
# This code attempts to create a linear model to predict the fawn based on adult and precipitation from the df dataframe. The output will give you coefficients, z-scores, and significance values. The Pr(>|t|) acronym, describes the probability of observing any value equal or larger than |t|. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (oil change) and response (repairs) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. 

# Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis, which allows us to conclude that there is a relationship between oil changes and repairs.
 
# Finally, an R-squared is used to determine how much of the variance is accounted for in the model. An r-squared value of 1.0 would mean that the X variable(s), the independent variable(s), perfectly predicted the y, or dependent variable. An r-squared value of zero would indicate that the x variable(s) did not predict the y variable at all.
  1. Write the R code that produces the following two outputs for a data frame with the name df
    height          weight   
 Min.   :58.00   Min.   :130  
 1st Qu.:59.50   1st Qu.:140  
 Median :61.00   Median :150  
 Mean   :63.86   Mean   :160  
 3rd Qu.:68.50   3rd Qu.:170  
 Max.   :72.00   Max.   :220

and…

'data.frame':   7 obs. of  2 variables:  
$ height: num  59 60 61 58 67 72 70  
$ weight: num  150 140 180 220 160 140 130  
summary(df)

str(df)
  1. Conceptually describe what the following R code does and what you would expect to see as output from this code:

    dfAir <- data.frame(air$Ozone, air$Solar.R, air$scaleWind, air$Temp, air$Date)
     dfAir <- melt(dfAir, id=c("air.Date"))
     ggplot(dfAir, aes(x=air.Date, y=value, color=variable)) + geom_line()
# create a new dataframe with the ozone, solar, wind, and temp columns from the air dataframe
dfAir <- data.frame(air$Ozone, air$Solar.R, air$scaleWind, air$Temp, air$Date)

# Converts the data frame dfAir from wide to long format on air.Date. For each column not named air.Date there is a record along with the original column name
 dfAir <- melt(dfAir, id=c("air.Date"))

# Creates a line plot with the value where the color is the name of the variable e.g, ozone, solar
 ggplot(dfAir, aes(x=air.Date, y=value, color=variable)) + geom_line()
  1. Conceptually describe what the following R code does and what you would expect to see as output from this code:

    mean(replicate(400,mean(sample(USstatePops$april10census, size=16, replace=TRUE)),  simplify=TRUE))
# Helpful to read this code starting with the bottom line


# gets the mean of 400 means drawn from the sample
mean(
  
# replicates drawing and computing the mean of the sample 400 times
replicate(400,
          
# computes the mean of the sample
mean(
  
# draws a sample of size 16 (and replaces the values) from a columns called april10census in the USstatePops dataframe
sample(USstatePops$april10census, size=16, replace=TRUE)
),  simplify=TRUE))
  1. Conceptually describe what the following R code does and what you would expect to see as output from this code:

    sample(USstatePops$april10census, size=16, replace=TRUE)
# The code draws a sample of size 16 (and replaces the values) from a column called april10census in the USstatePops dataframe

# When sampling you should (1) consider whether the sample is representative of the population (2) whether you should use replcement or no replacement and (3) an appropirate sample size

# Comparing two samples can be used to determine whether they are sufficiently different from one another (p. 92) If we get a new sample mean, and we find that it is in the extreme zone defined by our cut points, we can tentatively conclude that the sample that made that mean is a different kind of thing than the samples that made the sampling distribution.
  1. What information can we gain by examining the data distribution e.g., normal, skewed?
# Based on visual inspection of the graph we can see all the possible values (or intervals) of the data and how often they occur. 
  1. Write R code that would order a column with the name pricesin a data frame named food in ascending order. Place the results in a data frame with the name ordered_food.
ordered_food <- food[order(food$prices),] 
  1. What phenomenon does the law of large numbers describe?
# We find that the distribution of sampling means starts to create a bell-shaped or normal distribution, and the center of that distribution, the mean of all of those sample means gets really close to the actual population mean. (p. 88)
  1. Write the R code that:
exam_scores <- c(100, 85, 96, 91)
exam_scores + 5
sd(exam_scores)
  1. Write an SQL statement that computes the mean score from a data frame with the name exam_score.
select avg(score) from exam_score
  1. Use the following to answer the questions below:

    height <- c(59,60,61,58,67,72,70)
    weight <- c(150,140,180,220,160,140,130)
    df <- data.frame(height, weight)
# Write the R code that outputs the element at the sixth row and first column of df.
df[6,1]

# Add the numbers 64 and 72 to the vector height and 110 and 150 to the vector weight
height <- c(height, 64, 71)
weight <- c(weight, 110, 150)

# Update df with the new height and weight
df <- data.frame(height, weight)

# Write R code that computes the mean, max, min, and stores the results in variables with the prefix df_
df_mean <- mean(df$height)
df_max <- max(df$height)
df_min <- min(df$height)

#Write R code that computes weight/height for each person, using the new weight just created and store the results in a new column in df  
df$new <- df$weight/df$height


#Write an if-else statement that evaluates whether the value produced in the previous question is greater than the mean of that value and places yes or no in a new column in the data frame.
df$evaluate <- ifelse(df$new > mean(df$new), “yes”,”no”)

Exam Tips

  1. Be mindful of the time. You will not get a warning when the exam is about to end.
  2. There are several questions towards the end of the exam that are worth 2 points. From my experience, it is impossible to earn a passing grade if you do not complete these questions.
