# IST 687

2020-02-05 18:15:44

## Today’s Agenda

• Announcements
• Breakout I (Complete Lab 5)
• Homework 5 Tips
• Next week’s agenda
• Breakout II (Group Project Meeting)

## Announcements

• Plan for catching up with Week 4
• Office Hours: After class and by appointment
• R Cheetsheets:
• Questions/concerns?

# Week 4 Inferential statistics (Review)

## Week 4 Inferential statistics (Review)

• Sampling from a population
• Replicating expressions
• Making inferences about a population statistic (is it extreme?)

## Week 4: Sampling and Inference (Review)

• It may be necessary to draw samples from a population (or dataset).
• Samples can be drawn using sample(x, n, replace = FALSE) (values must be integers)
• x is the population, size is the number of samples, and replace is whether to put a value back in the X once it’t been drawn from the population.

## Week 4: Sampling and Inference (Review)

• We can make inferences about populations using single statistics (e.g., a sample mean) and comparing them to a distribution of values.

## Week 4: Important terminology/code (Review)

Law of large numbers: As the size of a sample drawn from a random variable increases, the mean of more samples gets closer and closer to the true population mean.

Central limit theorem: Given a dataset with unknown distribution (it could be uniform, binomial or completely random), the sample means will approximate the normal distribution.

*Possible test questions

## Week 4: Replication (Review)

• Replicating processes using rep(expr, n) and replicate(n, expr)

rep(c("Corey","Home"),3) or replicate(3,c("Corey","Home"))

## [1] "Corey" "Home"  "Corey" "Home"  "Corey" "Home"

# Lab 5: Storage Wars

## Lab 5: Storage Wars

Lab Goals:

• Using Structured Query Language (SQL)
• Subsetting data using SQL and R
• Subsetting data using SQL and R with conditionals

Groups for Pair Programming

## Lab 5: Working with SQL

• SQL is a standard language for storing, manipulating and retrieving data in databases.

• The basics of SQL:

SELECT column1, column2, …
FROM table-name

## Lab 5: Working with SQL

• Return only mpg and cyl fields from mtcars
• SELECT mpg,hp FROM mtcars
##   mpg  hp
## 1  21 110
## 2  21 110
• Return all data from mtcars
• SELECT * FROM mtcars
##   mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1  21   6  160 110  3.9 2.620 16.46  0  1    4    4
## 2  21   6  160 110  3.9 2.875 17.02  0  1    4    4

## Lab 5: Working with SQL

• SQL also takes conditionals using the WHERE clause
• Return mpg,disp,cyl from mtcars where the number of cylinders is 6

"SELECT mpg,disp,cyl FROM mtcars WHERE cyl = 6"

##    mpg  disp cyl
## 1 21.0 160.0   6
## 2 21.0 160.0   6
## 3 21.4 258.0   6
## 4 18.1 225.0   6
## 5 19.2 167.6   6
## 6 17.8 167.6   6
## 7 19.7 145.0   6

## Lab 5: Working with SQL

• Applying functions over columns
• Get the minimum value in the mpg field in mtcars
• SELECT min(mpg) FROM mtcars
##   min(mpg)
## 1     10.4

Note: You’ll need to find out the appropirate functions for SQL queries for today’s lab. Check w3schools.com .

## Lab 5: Working with SQL (subqueries)

• You can supply the results of one query as the conditional of another.

• Scenario: Return the mpg, disp, and cyl for cars whose cylinder (cyl) match the minimum cylinder value in the data.

SELECT min(mpg) FROM mtcar

##   min(cyl)
## 1        4

SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars)

##    mpg  disp cyl
## 1 22.8 108.0   4
## 2 24.4 146.7   4
## 3 22.8 140.8   4
## 4 32.4  78.7   4
## 5 30.4  75.7   4

## Lab 5: sqldf in R

• An R package for SQL queries install.packages("sqldf") and the function sqldf() to write SQL statements

• Using SQL statements in R requires packages to translate SQL to R language.
• In a relational database: SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars)

• In R: sqldf(" SELECT mpg,disp,cyl FROM mtcars WHERE cyl = (select min(cyl) from mtcars) ")

# Homework 5

## Homework 5 Tips

• Working with JSON data
• Aggregating data using tapply()
• Errors in the data

## Homework 5 Tips: About JSON

• Many systems rely on non-SQL data stores e.g., MongoDB which outputs JSON document.

## Homework 5 Tips: About JSON

• Explore the data before importing, use a JavaScript Object Notation (JSON) converter

## Homework 5 Tips: Working with JSON (Step 1)

• An example: Maryland DOT

• Importing the data in R requires the use of a package called RJSONIO.

## Homework 5 Tips: Working with JSON (Step 1)

• Data are imported in a lists that must be unlisted

• I imported the data from HW 5 and stored it in an object called mydata. Lets take a look…
summary(mydata)

##      Length Class  Mode
## meta     1  -none- list
## data 18638  -none- list
• unlist() – takes a list and returns a simple vector
• You need to unlist the second element in mydata and place it in a data.frame (p. 118)

## Homework 5 Tips: Aggregating using tapply()

• Data may need to be aggregated to get quick summaries e.g., mean score per student, time spent on a website per day. The tapply() function can be used.

• Three important arguments for tapply(X, INDEX, FUNCTION)
• Think of X as the variable you want to compute, INDEX as the grouping variable, and function as the summary statistic you want to apply to X.
• An example using mtcars(). Get the mean mpg by cylinder. What would the tapply expression look like for our example?

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

## Homework 5 Tips: Aggregating using tapply()

tapply(mtcars$mpg,mtcars$cyl,mean)

##        4        6        8
## 26.66364 19.74286 15.10000

Note: na.rm = TRUE can also be used to ignore columns containing NAs

e.g., tapply(mtcars$mpg,mtcars$cyl,mean, na.rm=TRUE)

## Homework 5 Tips: Users errors (Step 3 and 4)

• Counting the number of characters in an object. What’s the difference here?

“Corey”
nchar("Corey")

## [1] 5

“Corey ”
nchar(“Corey ”)

## [1] 6
• Hint: Explore the data using str(). The TRIM() function for SQL and gsub() for R may be useful.

## Next Week

• Asynchronous
• Week 6 Introduction to visualization; Chapter 12
• Submit HW 5 and Lab 5
• Project update 2 Wednesday (in-class)
• Live Session
• Lab 6: Data Viz
• Project Updates: Come prepared to discuss your dataset, decision maker, research questions, and preliminary visualizations