Chapter 11, “Storage Wars,” describes a variety of ways in which R can connect to data sources. Given that SQL – structured query language – is one of the most fundamental and widely used tools for manipulating data, understand how to use SQL in the context of R is very important. One of the basic data building blocks in R is the data frame and this object bears a very strong resemblance to the concept of a table in SQL. In fact, there is a package in R called sqldf that allows for the manipulation of a data frame as an SQL table. This feat is accomplished thanks to SQLite, a fantastic, lightweight, open source implementation of SQL. Working with sqldf and SQLite is so convenient that under normal circumstances you can do your work completely within R, with no software installations needed on your computer (other than running the install.packages() command in R).

As we are about halfway through the course, this activity description does not provide the same level of code prompts as previous labs – it is assumed that you remember or can lookup the necessary code. The overall goal of this activity is to use SQL to produce a subset of the built-in airquality R dataset that contains only those records where the concentration of ozone is higher than the mean level of ozone. These are the conceptual steps you will need to follow:

1. Review sqldf

You’ll need to learn how to write subqueries.

2. Install and activate the sqldf package in R-Studio

 # install.packages("sqldf")
 library(sqldf)

3. Make sure the built-in airquality dataset is available for use in subsequent commands.

 # Assign the built-in "airquality" dataset into a new variable called "air"
 air <- airquality
 # reveal the first few records of "air" dataset
 head(air)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

4. run an SQL to calculate the average level of ozone and assign the result to a variable.

 # select the average value of "Ozone" from "air" dataset
 average_ozone <- sqldf("SELECT avg(Ozone) from air")
 # print the results
 print(average_ozone)
##   avg(Ozone)
## 1   42.12931

5. select all records where the value of Ozone is higher than the average.

 # select the average value of Ozone first, and then select the Ozone values that are higher than the average
 sqldf("SELECT * from air where Ozone > (SELECT avg(Ozone) from air)")
##    Ozone Solar.R Wind Temp Month Day
## 1     45     252 14.9   81     5  29
## 2    115     223  5.7   79     5  30
## 3     71     291 13.8   90     6   9
## 4    135     269  4.1   84     7   1
## 5     49     248  9.2   85     7   2
## 6     64     175  4.6   83     7   5
## 7     77     276  5.1   88     7   7
## 8     97     267  6.3   92     7   8
## 9     97     272  5.7   92     7   9
## 10    85     175  7.4   89     7  10
## 11    48     260  6.9   81     7  16
## 12    61     285  6.3   84     7  18
## 13    79     187  5.1   87     7  19
## 14    63     220 11.5   85     7  20
## 15    80     294  8.6   86     7  24
## 16   108     223  8.0   85     7  25
## 17    52      82 12.0   86     7  27
## 18    82     213  7.4   88     7  28
## 19    50     275  7.4   86     7  29
## 20    64     253  7.4   83     7  30
## 21    59     254  9.2   81     7  31
## 22    78      NA  6.9   86     8   4
## 23    66      NA  4.6   87     8   6
## 24   122     255  4.0   89     8   7
## 25    89     229 10.3   90     8   8
## 26   110     207  8.0   90     8   9
## 27    44     192 11.5   86     8  12
## 28    65     157  9.7   80     8  14
## 29    59      51  6.3   79     8  17
## 30    44     190 10.3   78     8  20
## 31    45     212  9.7   79     8  24
## 32   168     238  3.4   81     8  25
## 33    73     215  8.0   86     8  26
## 34    76     203  9.7   97     8  28
## 35   118     225  2.3   94     8  29
## 36    84     237  6.3   96     8  30
## 37    85     188  6.3   94     8  31
## 38    96     167  6.9   91     9   1
## 39    78     197  5.1   92     9   2
## 40    73     183  2.8   93     9   3
## 41    91     189  4.6   93     9   4
## 42    47      95  7.4   87     9   5
## 43    44     236 14.9   81     9  11
## 44    46     237  6.9   78     9  16

6. refine step 5 to write the result table into a new R data object called “newAQ”.

 newAQ <- sqldf("SELECT * from air where Ozone>(SELECT avg(Ozone) from air)")
 # reveal what type of object newAQ is
 str(newAQ)
## 'data.frame':    44 obs. of  6 variables:
##  $ Ozone  : int  45 115 71 135 49 64 77 97 97 85 ...
##  $ Solar.R: int  252 223 291 269 248 175 276 267 272 175 ...
##  $ Wind   : num  14.9 5.7 13.8 4.1 9.2 4.6 5.1 6.3 5.7 7.4 ...
##  $ Temp   : int  81 79 90 84 85 83 88 92 92 89 ...
##  $ Month  : int  5 5 6 7 7 7 7 7 7 7 ...
##  $ Day    : int  29 30 9 1 2 5 7 8 9 10 ...
 # reveal what its dimensions are (how many rows and columns)
 dim(newAQ)
## [1] 44  6
 # show the first few rows of "newAQ"
 head(newAQ)
##   Ozone Solar.R Wind Temp Month Day
## 1    45     252 14.9   81     5  29
## 2   115     223  5.7   79     5  30
## 3    71     291 13.8   90     6   9
## 4   135     269  4.1   84     7   1
## 5    49     248  9.2   85     7   2
## 6    64     175  4.6   83     7   5

7. Repeat step 4

 # Exclude Missing Values from calculating "Ozone" mean and assign the result to "average_ozone"
 average_ozone <- mean(air$Ozone, na.rm=TRUE)
 # print the result
 print(average_ozone)
## [1] 42.12931

Repeat step 5

 air[which(air$Ozone > average_ozone), ]
##     Ozone Solar.R Wind Temp Month Day
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 40     71     291 13.8   90     6   9
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 66     64     175  4.6   83     7   5
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 77     48     260  6.9   81     7  16
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 96     78      NA  6.9   86     8   4
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 104    44     192 11.5   86     8  12
## 106    65     157  9.7   80     8  14
## 109    59      51  6.3   79     8  17
## 112    44     190 10.3   78     8  20
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 134    44     236 14.9   81     9  11
## 139    46     237  6.9   78     9  16

Repeat step 6

 # only keep the rows in which the Ozone values are higher than the average, and write the result table into a new R data object called “newAQ2”
 newAQ2 <- air[which(air$Ozone > average_ozone), ]
 # reveal what type of object newAQ2 is
 str(newAQ2)
## 'data.frame':    44 obs. of  6 variables:
##  $ Ozone  : int  45 115 71 135 49 64 77 97 97 85 ...
##  $ Solar.R: int  252 223 291 269 248 175 276 267 272 175 ...
##  $ Wind   : num  14.9 5.7 13.8 4.1 9.2 4.6 5.1 6.3 5.7 7.4 ...
##  $ Temp   : int  81 79 90 84 85 83 88 92 92 89 ...
##  $ Month  : int  5 5 6 7 7 7 7 7 7 7 ...
##  $ Day    : int  29 30 9 1 2 5 7 8 9 10 ...
 # reveal the number of rows
 nrow(newAQ2)
## [1] 44
 # reveal the number of columns
 ncol(newAQ2)
## [1] 6
 # show the first few rows of "newAQ2"
 head(newAQ2)
##    Ozone Solar.R Wind Temp Month Day
## 29    45     252 14.9   81     5  29
## 30   115     223  5.7   79     5  30
## 40    71     291 13.8   90     6   9
## 62   135     269  4.1   84     7   1
## 63    49     248  9.2   85     7   2
## 66    64     175  4.6   83     7   5