Corey Jackson

2020-02-05 18:15:44

- Announcements
- Breakout I (Complete Lab 5)
- Homework 5 Tips
- Next week’s agenda
- Breakout II (Group Project Meeting)

- Plan for catching up with Week 4
- Office Hours: After class and by appointment
- R Cheetsheets:
- Homework grading (TA)
- Questions/concerns?

- Sampling from a population
- Replicating expressions
- Making inferences about a population statistic (is it extreme?)

- It may be necessary to draw samples from a population (or dataset).
- Samples can be drawn using
`sample(x, n, replace = FALSE)`

(values must be integers)*x*is the population,*size*is the number of samples, and*replace*is whether to put a value back in the X once it’t been drawn from the population.

- We can make inferences about populations using single statistics (e.g., a sample mean) and comparing them to a distribution of values.

*Law of large numbers*: As the size of a sample drawn from a random variable increases, the mean of more samples gets closer and closer to the true population mean.

*Central limit theorem*: Given a dataset with unknown distribution (it could be uniform, binomial or completely random), the sample means will approximate the normal distribution.

*Possible test questions

- Replicating processes using
`rep(expr, n)`

and`replicate(n, expr)`

`rep(c("Corey","Home"),3)`

or `replicate(3,c("Corey","Home"))`

`## [1] "Corey" "Home" "Corey" "Home" "Corey" "Home"`

**Lab Goals**:

- Using Structured Query Language (SQL)
- Subsetting data using SQL and R
- Subsetting data using SQL and R with conditionals

SQL is a standard language for storing, manipulating and retrieving data in databases.

The basics of SQL:

SELECT *column1, column2, …*

FROM *table-name*

- SQL resources/examples: dept.stat.lsa.umich.edu | jasminedaly | sqldf CRAN documentation and sqldf examples

- Return only mpg and cyl fields from
`mtcars`

- SELECT mpg,hp FROM mtcars

```
## mpg hp
## 1 21 110
## 2 21 110
```

- Return all data from
`mtcars`

- SELECT * FROM mtcars

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
```

- SQL also takes conditionals using the WHERE clause
- Return mpg,disp,cyl from mtcars where the number of cylinders is 6

`"SELECT mpg,disp,cyl FROM mtcars WHERE cyl = 6"`

```
## mpg disp cyl
## 1 21.0 160.0 6
## 2 21.0 160.0 6
## 3 21.4 258.0 6
## 4 18.1 225.0 6
## 5 19.2 167.6 6
## 6 17.8 167.6 6
## 7 19.7 145.0 6
```

- Applying functions over columns
- Get the minimum value in the mpg field in
`mtcars`

- SELECT min(mpg) FROM mtcars

```
## min(mpg)
## 1 10.4
```

Note: You’ll need to find out the appropirate functions for SQL queries for today’s lab. Check w3schools.com .

You can supply the results of one query as the conditional of another.

*Scenario*: Return the mpg, disp, and cyl for cars whose cylinder (cyl) match the minimum cylinder value in the data.

`SELECT min(mpg) FROM mtcar`

```
## min(cyl)
## 1 4
```

`SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars)`

```
## mpg disp cyl
## 1 22.8 108.0 4
## 2 24.4 146.7 4
## 3 22.8 140.8 4
## 4 32.4 78.7 4
## 5 30.4 75.7 4
```

An R package for SQL queries

`install.packages("sqldf")`

and the function`sqldf()`

to write SQL statements- Using SQL statements in R requires packages to translate SQL to R language.
In a relational database:

`SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars)`

In R:

`sqldf(" SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars) ")`

- Working with JSON data
- Aggregating data using
`tapply()`

- Errors in the data

- Many systems rely on non-SQL data stores e.g., MongoDB which outputs JSON document.

- Explore the data before importing, use a JavaScript Object Notation (JSON) converter

An example: Maryland DOT

Importing the data in R requires the use of a package called

`RJSONIO`

.

Data are imported in a lists that must be

**unlisted**I imported the data from HW 5 and stored it in an object called

`mydata`

. Lets take a look…

`summary(mydata)`

```
## Length Class Mode
## meta 1 -none- list
## data 18638 -none- list
```

`unlist()`

– takes a list and returns a simple vector- You need to unlist the second element in
`mydata`

and place it in a`data.frame`

(p. 118)

`tapply()`

Data may need to be aggregated to get quick summaries e.g., mean score per student, time spent on a website per day. The

`tapply()`

function can be used.- Three important arguments for
`tapply(X, INDEX, FUNCTION)`

- Think of X as the variable you want to compute, INDEX as the grouping variable, and function as the summary statistic you want to apply to X.

An example using

`mtcars()`

. Get the mean mpg by cylinder. What would the`tapply`

expression look like for our example?

```
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
```

`tapply()`

`tapply(mtcars$mpg,mtcars$cyl,mean)`

```
## 4 6 8
## 26.66364 19.74286 15.10000
```

Note: `na.rm = TRUE`

can also be used to ignore columns containing `NAs`

e.g., `tapply(mtcars$mpg,mtcars$cyl,mean, na.rm=TRUE)`

- Counting the number of characters in an object. What’s the difference here?

“Corey”

`nchar("Corey")`

`## [1] 5`

“Corey ”

nchar(“Corey ”)

`## [1] 6`

- Hint: Explore the data using
`str()`

. The`TRIM()`

function for SQL and`gsub()`

for R may be useful.

**Asynchronous**- Week 6 Introduction to visualization; Chapter 12
- Submit HW 5 and Lab 5
- Project update 2 Wednesday (in-class)

**Live Session**- Lab 6: Data Viz
- Project Updates: Come prepared to discuss your dataset, decision maker, research questions, and preliminary visualizations