# IST 687 Inferential statistics

2019-10-23 16:54:47

## Today’s Agenda

• Announcements
• Review of Week 3 (Async; Chapters 4-6)
• Breakout (Complete Lab 4)
• Homework 4 Tips
• Next week’s agenda

## Announcements

• Office Hours: After class and by appointment
• Submitting homework
• Final project deliverables for project update II
• Questions/concerns?

## Overview of Week 3: (Descriptive Statistics & Functions)

• Importing data
• Cleaning dataframes
• Exploring distributions
• Functions

## Week 3: Importing data

• Most common formats: txt, csv, Excel, json (Week 5)
• Several packages for importing data from these formats: read_delim(), read_tsv(), read_csv(), read_csv2(), read.xlsx().
• These require the readr or xlsx packages.

• Automatic import using R

## Week 3: Cleaning dataframes

• Renaming columns: names() or colnames()
• renaming one column: colnames(dataframe)[1] <- "new_name"
• Removing columns (scenario: remove columns 1 and 7):
• dataframe <- dataframe[,-c(1,7)] or dataframe <- dataframe[,c(2:6)]
• Creating new columns: dataframe$new_column <- code • Coercing datatypes • converting columns in data frames: state$population <- as.numeric(state$population) • Most of the work done in the background, but as.numeric(x) vs as.integer(x) ## Week 3: Exploring distributions • Descriptive statistics: (1) central tendency e.g., mean() and (2) dispersion gives us the properties of distributions e.g., sd() • Distibtions: (1) helps understand your data (2) helps determine modeling techniques (e.g., non-parametric modeling) • Simulating some distributions in R using e.g., rnorm(),rpareto() Note: Simulation helpful when you don’t have actual data or limited data. Unlikely to be true for most data science work. ## Week 3: Functions • Basic components of functions: body and arguments name <- function(arg) { BODY }  • Functions can have many arguments (seperated by , ) • Variables can be defined inside or outside a function (inside is first look) function(arg1,arg2,arg3)  ## Week 3: Functions • A tip for writing functions… start with pseudo-code  Distribution <- function(vector,number) { # only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count" # calculate the percentage and return the results } ## Week 3: Functions • Stepwise coding with functions vec <- c(1,2,3,4,5) val <- 2 • only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable “count” Start simple and add complexity 1. Return elements in vector less than the number vec < val ## [1] TRUE FALSE FALSE FALSE FALSE 1. Count the number of elements in the vector sum(vec < val) ## [1] 1 ## Week 3: Functions • Example using length vec[vec < val] ## [1] 1 length(vec[vec < val]) ## [1] 1  Distribution <- function(vector,number) { # only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count" count <- length(vec[vec < val]) # calculate the percentage and return the results } ## Exploratory Data Analysis (EDA) ## Exploratory Data Analysis (EDA): Summarizing data ## 'data.frame': 32 obs. of 11 variables: ##$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ... ##$ disp: num  160 160 108 258 360 ...
##  $hp : num 110 110 93 110 175 105 245 62 95 123 ... ##$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $wt : num 2.62 2.88 2.32 3.21 3.44 ... ##$ qsec: num  16.5 17 18.6 19.4 17 ...
##  $vs : num 0 0 1 1 0 1 0 1 1 1 ... ##$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ... ##$ carb: num  4 4 1 1 2 1 4 2 2 4 ...

## Exploratory Data Analysis (EDA): Summarizing data using dplyr()

• The ddplyr package is powerful for munging and summarizing data.

Select certain columns of data.
Filter your data to select specific rows.
Arrange the rows of your data into an order.
Mutate your data frame to contain new columns.
Summarize chunks of you data in some way.

It also has functions like sample, group by and pipe.

More on dplyr here: Exploratory Data Analysis with R

## Exploratory Data Analysis (EDA): Summarizing data using dplyr()

Problem: get the mean hp and mpg by cylinder

myCars %>%
group_by(cyl) %>%
summarize(
mean_mpg=mean(mpg),
mean_hp=mean(hp)
)
## # A tibble: 3 x 3
##   cyl   mean_mpg mean_hp
##   <fct>    <dbl>   <dbl>
## 1 4         26.7    82.6
## 2 6         19.7   122.
## 3 8         15.1   209.

## Exploratory Data Analysis (EDA): Summarizing data using dplyr()

Problem: get the mean hp and mpg by cyl and gear

myCars %>%
group_by(cyl,gear) %>%
summarize(
mean_mpg=mean(mpg),
mean_hp=mean(hp)
)

## Exploratory Data Analysis (EDA): Summarizing data using dplyr()

## # A tibble: 8 x 4
## # Groups:   cyl [3]
##   cyl   gear  mean_mpg mean_hp
##   <fct> <fct>    <dbl>   <dbl>
## 1 4     3         21.5     97
## 2 4     4         26.9     76
## 3 4     5         28.2    102
## 4 6     3         19.8    108.
## 5 6     4         19.8    116.
## 6 6     5         19.7    175
## 7 8     3         15.0    194.
## 8 8     5         15.4    300.

## Useful resources for EDA

1. R for Data Science (Chapter 7)
2. Exploratory Data Analysis

## Useful packages/functions for the future

Here are a few links to site with useful packages/functions for doing data science:

## A brief overview of Week 4: Sampling

Sampling data

• Allows us to make assumptions about the underlying truth (i.e., population).
• in R sample(X =, size =, replace = )

## A brief overview of Week 4: Sampling

Obtain a sample of size 5 from the Distirbution vector with replacement

Distribution <- rnorm(1000,80,10)

PopA <- rnorm(1000,80,10)

sample(PopA,5,replace = TRUE)

## [1] 77.10141 89.02092 76.44148 89.98562 81.71340

## A brief overview of Week 4: Evaluating two distributions

Comparing two distributions

• Helpful for evaluating whether two datasets are the “same” i.e., come from the same distirbution.
• To make this determination we can compare the sample statistics from the “unknown” population to the known population parameters.

## A brief overview of Week 4: Evaluating two distributions

A scenario:

1. You have the parameters of Pop A and you want to know if PopB with a single sample mean of 70.1608428 is same data as Pop A with a mean of 80.5812465

2. We can compare the sample mean of Pop B (70.1608428) to determine if it falls within the acceptable distirbution of Pop A.

## A brief overview of Week 4 for lab: Evaluating two distributions

Is the mean value for Pop B within an acceptable range?

## A brief overview of Week 4 for lab: Evaluating two distributions

We can determine whether the mean for PopB is within our range of truth by: (1) setting a threshold and (2) comparing the threshold to the mean of PopB. If its outside of the threshold its not likely from the same population.

Our acceptable threshold is between 5 percent and 95 percent of the data in popA.

quantile(PopA,probs = 0.05)

##       5%      95%
## 63.47763 96.42094

## Lab 4: Descriptive Stats & Functions

Lab Goals:

• Investigating new functions
• Create samples from a population
• Making inferences about the population based on the sample mean

Groups for Pair Programming

## Lab 4: Descriptive Stats & Functions

• New functions for the week (set.seed(),runif(),sample()). Describe what each function does by adding comments in the code:

• A commented line of code uses #

# Computes the mean of a vector
mean(vector)

Note: Explore the purposes of each using ??help or a search engine

## Homework 4 Tips

• Printing from functions
• Replicating samples
• Working with missing data

Be sure to install and load the moments() package.

## Homework 4 Tips: Printing in Functions (Step 1)

• cat(): take many arguments, but last argument should be a new line “\n”
nameptinter <- function(names){
cat("My name is:",names,"\n")
cat("There are ",nchar(names),"letters in", names)
}

nameptinter("Corey")

## My name is: Corey
## There are 5 letters in Corey

## Homework 4 Tips: Replicating samples (Step 2)

• Repating a sequence programatically
• Two functions: replicate(times,process) or rep(process,times)

replicate(4,"Corey")

## [1] "Corey" "Corey" "Corey" "Corey"

rep(mean(c(10,43,10,46,5)),4)

## [1] 22.8 22.8 22.8 22.8

## Homework 4 Tips: “Getting data” (Step 2)

• Counting things in a vector that match some critera. Using grep() or which()

A vector of names stored in people

## [1] "Corey"  "Corey"  "Corey"  "Marsha"

grep("Corey", people)

## [1] 1 2 3

which(people=="Corey")

## [1] 1 2 3

## Homework 4 Tips: Missing data (Step 3)

• Working with missing values. A matter of informed choice?
• Use summmary() to investigate missing values
• Choices: ignore, replace, delete

• na.omit() or complete.cases() removes observations with NAs in any column
##   score1 score2 score3
## 1      9     NA      1
## 2      6      5      3
## 3     NA      2      5

data[complete.cases(data), ] or na.omit(data)

##   score1 score2 score3
## 2      6      5      3

## Homework 4 Tips: Missing data (Step 3)

• Computing on columns with missing values na.rm = TRUE

mean(data\$score1)

## [1] NA

mean(data, na.rm=TRUE)

## [1] 7.5

## Homework 4 Tips

• Use the results in question 6 as a starting point for step 7. Remember, functions can take other functions as arguments

e.g.,

replicate(times,process)
replicate(100, mean(sample(studentPop, size=sampleSize)))

## Next Week

• Asynchronous
• Week 5 Connecting with external data sources; Chapter 11
• Submit HW 4 and Lab 4 Monday
• Continue collaborating on your final project
• Live Session
• Lab 5: Storage Wars