IST 687 Descriptive Statistics & Functions

Corey Jackson

2020-01-22 18:40:49

Today’s Agenda

Announcements

Overview of Week 2: (Using R to manipulate data)

Week 2: Working with Vectors

weather <- c("hot","cold","cold","cold")

which(weather=="cold")

## [1] 2 3 4

Week 2: Working with Vectors

## [1] 3
## [1] 9

Week 2: Working with Accessors

mtcars$mpg (interact with)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1

mtcars[1:2,] (subset with)

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Week 2: Dataframes

Week 2: Dataframes

rownames(mtcars)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

Week 2: Dataframes

carnames <- rownames(mtcars)
MyCars$cars <- carnames

##    qsec vs am gear carb              cars
## 1 16.46  0  1    4    4         Mazda RX4
## 2 17.02  0  1    4    4     Mazda RX4 Wag
## 3 18.61  1  1    4    1        Datsun 710
## 4 19.44  1  0    3    1    Hornet 4 Drive
## 5 17.02  0  0    3    2 Hornet Sportabout
## 6 20.22  1  0    3    1           Valiant

Week 2: Operating on Dataframes

MyCars2 <- MyCars[which(MyCars$mpg > 20), ]
MyCars2

##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb           cars
## 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
## 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
## 3 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1     Datsun 710
## 4 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 Hornet 4 Drive
## 8 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D
## 9 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230

Week 3: Descriptive Stats & Functions

Week 3: Descriptive Stats & Functions

The goal for this module will be to introduce you to descriptive statistics used to summarize your data and inferential statistics used to draw conclusions about a sample from the population.

Descriptive Statistcs

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of data collected.

Two primary means of describing data:
1. Central tendency: a central or typical value for a distribution
2. Spread or Variance: the extent to which a distribution is stretched or squeezed.

Descriptive Statistcs: Central tendency

Central tendency is a central or typical value for a distribution. Also called center or location

The most common measures of central tendency are:
- arithmetic mean: the numerical average of all values
- median: the value directly in the middle of the data set
- mode. the most frequent value in the data set

Descriptive Statistcs: Spread or Variance

Spread (dispersion or variability) is the extent to which a distribution is stretched or squeezed.

The most common measures of statistical dispersion
- variance: the average of the squared differences from the mean
- standard deviation: the square root of the variance
- inter-quartile range (IQR): the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data

Data distributions

A distribution contains information about the probabilities associated with the data points.

Data distributions

Example: Simulating a normal distributions in R

R allows you to simulate different distributions using functions and arguments as parameters.

Task: Generate 1000 values of a normal distribution, with a mean of 85

testdatasim <- rnorm(1000,85)

## [1] 83.74365 84.18087 84.86929 84.68837 85.30112 85.88301

mean(testdatasim)

## [1] 84.94085

Example: Visualizing a normal distributions in R

hist(testdatasim)

Lab 3: Simulating and visualizing a Pareto distribution

In lab you need to simulate a Pareto distribution: rpareto(n, m, s)

  1. Install VGAM: install.packages(“VGAM”)
  2. Read about rpareto using help: ??rpareto
  3. Set m to 560000 (about the population size of Wyoming), play around with the s parameter

Functions

function_name <- function(arg_1, arg_2, ...) {
   Function body 
} 

Function Example

Write a function that takes two arguments - a vector of numbers (v) and a random number (w) and returns the count of numbers in v greater than w

function_name <- function(arg_1, arg_2, ...) {
   Function body 
} 

Function Example: Step-wise function writing

## [1] 112  54  10   3 152  55
## [1] 25

Function Example: Step-wise function writing

Which are elements in v that are greater than w: v > w

## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Return only the elements in v that are greater than r: which(v>w)

## [1] 1 2 5 6

Return the count of the elements in v that are greater than r length(which(v>w))

## [1] 4

Function Example: Step-wise function writing

myfirstfunction <- function(arg,arg,..)  
{  
   BODY   
}  

Write a function that takes two arguments - a vector of numbers (v) and a random number (w) and returns the count of numbers in v greater than w

myfirstfunction <- function(v,w)  
{  
   greater_numbers <- which(v > w)
   count_numbers <- length(greater_numbers)
   return(count_numbers)
}

Function Example: myfirstfunction()

## [1] 2
## [1] 4
## [1] 5

Lab 3: Writing a function

Write a function that takes three arguments – a vector, a min and a max, and returns the percentage of elements in the vector that are between the min and max (including the min and max)

Build in a stepwise manner

  1. Compute the number of elements in the vector that are greater than min and less than max.
  2. Using the number that was returned in the previous line, divide the number by the total number of elements in the vector

Code hints: which() and length() or sum() and logical operators from Week 1

Project Update I

Homework 3 Tips

Homework 3 Tips

Next Week