Corey Jackson

2019-10-23 16:54:47

- Announcements
- Review of Week 3 (Async; Chapters 4-6)
- Breakout (Complete Lab 4)
- Homework 4 Tips
- Next week’s agenda

- Office Hours: After class and by appointment
- Submitting homework
- Grades/Feedback from HW 2 & 3 on LMS. Please check my comments
- Final project deliverables for project update II
- Questions/concerns?

- Importing data
- Cleaning dataframes
- Exploring distributions
- Functions

- Most common formats: txt, csv, Excel, json (Week 5)

- Several packages for importing data from these formats:
`read_delim()`

,`read_tsv()`

,`read_csv()`

,`read_csv2()`

,`read.xlsx()`

.

These require the

`readr`

or`xlsx`

packages.Automatic import using R

- Renaming columns:
`names()`

or`colnames()`

- renaming one column:
`colnames(dataframe)[1] <- "new_name"`

- renaming one column:
- Removing columns (scenario: remove columns 1 and 7):
`dataframe <- dataframe[,-c(1,7)]`

or`dataframe <- dataframe[,c(2:6)]`

- Creating new columns:
`dataframe$new_column <- code`

- Coercing datatypes
- converting columns in data frames:
`state$population <- as.numeric(state$population)`

- Most of the work done in the background, but
`as.numeric(x)`

vs`as.integer(x)`

- converting columns in data frames:

Descriptive statistics: (1) central tendency e.g.,

`mean()`

and (2) dispersion gives us the properties of distributions e.g.,`sd()`

- Distibtions: (1) helps understand your data (2) helps determine modeling techniques (e.g., non-parametric modeling)
Simulating

*some*distributions in R using e.g.,`rnorm()`

,`rpareto()`

Note: Simulation helpful when you don’t have actual data or limited data. Unlikely to be true for most data science work.

- Basic components of functions: body and arguments

```
name <- function(arg)
{
BODY
}
```

- Functions can have many arguments (seperated by , )
- Variables can be defined inside or outside a function (inside is first look)

`function(arg1,arg2,arg3) `

- A tip for writing functions… start with pseudo-code

```
Distribution <- function(vector,number)
{
# only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"
# calculate the percentage and return the results
}
```

Stepwise coding with functions

`vec <- c(1,2,3,4,5) val <- 2`

only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable “count”

Start simple and add complexity 1. Return elements in vector less than the number

`vec < val`

`## [1] TRUE FALSE FALSE FALSE FALSE`

Count the number of elements in the vector

`sum(vec < val)`

`## [1] 1`

- Example using length
`vec[vec < val]`

`## [1] 1`

`length(vec[vec < val])`

`## [1] 1`

```
Distribution <- function(vector,number)
{
# only keep the elements within the vector that are less than the number, and store the number of eligible elements into the variable "count"
count <- length(vec[vec < val])
# calculate the percentage and return the results
}
```

```
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```

`dplyr()`

- The ddplyr package is powerful for munging and summarizing data.

*Select* certain columns of data.

*Filter* your data to select specific rows.

*Arrange* the rows of your data into an order.

*Mutate* your data frame to contain new columns.

*Summarize* chunks of you data in some way.

It also has functions like *sample*, *group by* and *pipe*.

More on dplyr here: Exploratory Data Analysis with R

`dplyr()`

Problem: get the mean hp and mpg by cylinder

```
myCars %>%
group_by(cyl) %>%
summarize(
mean_mpg=mean(mpg),
mean_hp=mean(hp)
)
```

```
## # A tibble: 3 x 3
## cyl mean_mpg mean_hp
## <fct> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
```

`dplyr()`

Problem: get the mean hp and mpg by cyl and gear

```
myCars %>%
group_by(cyl,gear) %>%
summarize(
mean_mpg=mean(mpg),
mean_hp=mean(hp)
)
```

`dplyr()`

```
## # A tibble: 8 x 4
## # Groups: cyl [3]
## cyl gear mean_mpg mean_hp
## <fct> <fct> <dbl> <dbl>
## 1 4 3 21.5 97
## 2 4 4 26.9 76
## 3 4 5 28.2 102
## 4 6 3 19.8 108.
## 5 6 4 19.8 116.
## 6 6 5 19.7 175
## 7 8 3 15.0 194.
## 8 8 5 15.4 300.
```

- R for Data Science (Chapter 7)

- Exploratory Data Analysis

Here are a few links to site with useful packages/functions for doing data science:

- Helpful when we don’t have access to the full population

- Allows us to make assumptions about the underlying truth (i.e., population).

- in R
`sample(X =, size =, replace = )`

Obtain a sample of size 5 from the Distirbution vector with replacement

`Distribution <- rnorm(1000,80,10)`

`PopA <- rnorm(1000,80,10)`

`sample(PopA,5,replace = TRUE)`

`## [1] 77.10141 89.02092 76.44148 89.98562 81.71340`

Comparing two distributions

- Helpful for evaluating whether two datasets are the “same” i.e., come from the same distirbution.

- To make this determination we can compare the sample statistics from the “unknown” population to the known population parameters.

A scenario:

You have the parameters of Pop A and you want to know if PopB with a single sample mean of 70.1608428 is same data as Pop A with a mean of 80.5812465

We can compare the sample mean of Pop B (70.1608428) to determine if it falls within the acceptable distirbution of Pop A.

Is the mean value for Pop B within an acceptable range?

We can determine whether the mean for PopB is within our range of truth by: (1) setting a threshold and (2) comparing the threshold to the mean of PopB. If its outside of the threshold its not likely from the same population.

Our acceptable threshold is between 5 percent and 95 percent of the data in popA.

`quantile(PopA,probs = 0.05)`

```
## 5% 95%
## 63.47763 96.42094
```

**Lab Goals**:

- Investigating new functions
- Create samples from a population
- Making inferences about the population based on the sample mean

New functions for the week (

`set.seed()`

,`runif()`

,`sample()`

). Describe what each function does by adding comments in the code:A commented line of code uses

`#`

```
# Computes the mean of a vector
mean(vector)
```

Note: Explore the purposes of each using ??help or a search engine

- Printing from functions
- Replicating samples
- Working with missing data

Be sure to install and load the `moments()`

package.

`cat()`

: take many arguments, but last argument should be a new line “\n”

```
nameptinter <- function(names){
cat("My name is:",names,"\n")
cat("There are ",nchar(names),"letters in", names)
}
```

`nameptinter("Corey")`

```
## My name is: Corey
## There are 5 letters in Corey
```

- Repating a sequence programatically
- Two functions:
`replicate(times,process)`

or`rep(process,times)`

`replicate(4,"Corey")`

`## [1] "Corey" "Corey" "Corey" "Corey"`

`rep(mean(c(10,43,10,46,5)),4)`

`## [1] 22.8 22.8 22.8 22.8`

- Counting things in a vector that match some critera. Using
`grep()`

or`which()`

A vector of names stored in `people`

`## [1] "Corey" "Corey" "Corey" "Marsha"`

`grep("Corey", people)`

`## [1] 1 2 3`

`which(people=="Corey")`

`## [1] 1 2 3`

- Working with missing values.
*A matter of informed choice?* - Use
`summmary()`

to investigate missing values Choices: ignore, replace, delete

`na.omit()`

or`complete.cases()`

removes observations with NAs in any column

```
## score1 score2 score3
## 1 9 NA 1
## 2 6 5 3
## 3 NA 2 5
```

`data[complete.cases(data), ]`

or `na.omit(data)`

```
## score1 score2 score3
## 2 6 5 3
```

- Computing on columns with missing values
`na.rm = TRUE`

`mean(data$score1)`

`## [1] NA`

`mean(data, na.rm=TRUE)`

`## [1] 7.5`

- Use the results in question 6 as a starting point for step 7. Remember, functions can take other functions as arguments

e.g.,

```
replicate(times,process)
replicate(100, mean(sample(studentPop, size=sampleSize)))
```

**Asynchronous**- Week 5 Connecting with external data sources; Chapter 11
- Submit HW 4 and Lab 4 Monday
- Continue collaborating on your final project

**Live Session**- Lab 5: Storage Wars