IST 687 Inferential statistics

Corey Jackson

2019-10-23 16:54:47

Today’s Agenda

Announcements

Overview of Week 3: (Descriptive Statistics & Functions)

Week 3: Importing data

Week 3: Cleaning dataframes

Week 3: Exploring distributions

Note: Simulation helpful when you don’t have actual data or limited data. Unlikely to be true for most data science work.

Week 3: Functions

name <- function(arg)  
{  
   BODY  
}  
function(arg1,arg2,arg3) 

Week 3: Functions

 Distribution <- function(vector,number)
 {
   # only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"

   # calculate the percentage and return the results

 }

Week 3: Functions

Start simple and add complexity 1. Return elements in vector less than the number

vec < val
## [1]  TRUE FALSE FALSE FALSE FALSE
  1. Count the number of elements in the vector

    sum(vec < val)
## [1] 1

Week 3: Functions

## [1] 1

length(vec[vec < val])

## [1] 1
 Distribution <- function(vector,number)
 {
   # only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"
   
 count <- length(vec[vec < val])
 
   # calculate the percentage and return the results

 }

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA): Summarizing data

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Exploratory Data Analysis (EDA): Summarizing data using dplyr()

Select certain columns of data.
Filter your data to select specific rows.
Arrange the rows of your data into an order.
Mutate your data frame to contain new columns.
Summarize chunks of you data in some way.

It also has functions like sample, group by and pipe.

More on dplyr here: Exploratory Data Analysis with R

Exploratory Data Analysis (EDA): Summarizing data using dplyr()

Problem: get the mean hp and mpg by cylinder

myCars %>% 
  group_by(cyl) %>% 
  summarize(
    mean_mpg=mean(mpg),                                                      
    mean_hp=mean(hp)
    )
## # A tibble: 3 x 3
##   cyl   mean_mpg mean_hp
##   <fct>    <dbl>   <dbl>
## 1 4         26.7    82.6
## 2 6         19.7   122. 
## 3 8         15.1   209.

Exploratory Data Analysis (EDA): Summarizing data using dplyr()

Problem: get the mean hp and mpg by cyl and gear

myCars %>% 
  group_by(cyl,gear) %>% 
  summarize(
    mean_mpg=mean(mpg),                                                      
    mean_hp=mean(hp)
    )

Exploratory Data Analysis (EDA): Summarizing data using dplyr()

## # A tibble: 8 x 4
## # Groups:   cyl [3]
##   cyl   gear  mean_mpg mean_hp
##   <fct> <fct>    <dbl>   <dbl>
## 1 4     3         21.5     97 
## 2 4     4         26.9     76 
## 3 4     5         28.2    102 
## 4 6     3         19.8    108.
## 5 6     4         19.8    116.
## 6 6     5         19.7    175 
## 7 8     3         15.0    194.
## 8 8     5         15.4    300.

Useful resources for EDA

  1. R for Data Science (Chapter 7)
  2. Exploratory Data Analysis

Useful packages/functions for the future

Here are a few links to site with useful packages/functions for doing data science:

  1. Top R libraries for data science

  2. Quick list of useful R packages

Week 4: Inferential stats for Lab

A brief overview of Week 4: Sampling

Sampling data

A brief overview of Week 4: Sampling

Obtain a sample of size 5 from the Distirbution vector with replacement

Distribution <- rnorm(1000,80,10)

PopA <- rnorm(1000,80,10)

sample(PopA,5,replace = TRUE)

## [1] 77.10141 89.02092 76.44148 89.98562 81.71340

A brief overview of Week 4: Evaluating two distributions

Comparing two distributions

A brief overview of Week 4: Evaluating two distributions

A scenario:

  1. You have the parameters of Pop A and you want to know if PopB with a single sample mean of 70.1608428 is same data as Pop A with a mean of 80.5812465

  2. We can compare the sample mean of Pop B (70.1608428) to determine if it falls within the acceptable distirbution of Pop A.

A brief overview of Week 4 for lab: Evaluating two distributions

Is the mean value for Pop B within an acceptable range?

A brief overview of Week 4 for lab: Evaluating two distributions

We can determine whether the mean for PopB is within our range of truth by: (1) setting a threshold and (2) comparing the threshold to the mean of PopB. If its outside of the threshold its not likely from the same population.

Our acceptable threshold is between 5 percent and 95 percent of the data in popA.

quantile(PopA,probs = 0.05)

##       5%      95% 
## 63.47763 96.42094

Lab 4: Descriptive Stats & Functions

Lab Goals:

Groups for Pair Programming

Lab 4: Descriptive Stats & Functions

# Computes the mean of a vector
mean(vector)

Note: Explore the purposes of each using ??help or a search engine

Homework 4 Tips

Be sure to install and load the moments() package.

Homework 4 Tips: Printing in Functions (Step 1)

nameptinter <- function(names){
  cat("My name is:",names,"\n")
  cat("There are ",nchar(names),"letters in", names)
}

nameptinter("Corey")

## My name is: Corey 
## There are 5 letters in Corey

Homework 4 Tips: Replicating samples (Step 2)

replicate(4,"Corey")

## [1] "Corey" "Corey" "Corey" "Corey"

rep(mean(c(10,43,10,46,5)),4)

## [1] 22.8 22.8 22.8 22.8

Homework 4 Tips: “Getting data” (Step 2)

A vector of names stored in people

## [1] "Corey"  "Corey"  "Corey"  "Marsha"

grep("Corey", people)

## [1] 1 2 3

which(people=="Corey")

## [1] 1 2 3

Homework 4 Tips: Missing data (Step 3)

##   score1 score2 score3
## 1      9     NA      1
## 2      6      5      3
## 3     NA      2      5

data[complete.cases(data), ] or na.omit(data)

##   score1 score2 score3
## 2      6      5      3

Homework 4 Tips: Missing data (Step 3)

mean(data$score1)

## [1] NA

mean(data, na.rm=TRUE)

## [1] 7.5

Homework 4 Tips

e.g.,

replicate(times,process)
replicate(100, mean(sample(studentPop, size=sampleSize)))

Next Week