Step 1: Load the data

# Let go back and analyze the air quality dataset (if you remember, we used that previously, in the visualization lab). Remember to think about how to deal with the NAs in the data. 

Step 2: Create train and test data sets

# Using techniques discussed in class, create two datasets – one for training and one for testing.

Step 3: Build a Model using KSVM & visualize the results

# Build a model (using the ‘ksvm’ function, trying to predict onzone). You can use all the possible attributes, or select the attributes that you think would be the most helpful.


# Test the model on the testing dataset, and compute the Root Mean Squared Error 3) Plot the results. Use a scatter plot. Have the x-axis represent temperature, the y-axis represent wind, the point size and color represent the error, as defined by the actual ozone level minus the predicted ozone level). 

# Compute models and plot the results for ‘svm’ (in the e1071 package) and ‘lm’. Generate similar charts for each model 

# Show all three results (charts) in one window, using the grid.arrange function 

Step 4: Create a ‘goodOzone’ variable

# This variable should be either 0 or 1. It should be 0 if the ozone is below the average for all the data observations, and 1 if it is equal to or above the average ozone observed. 

Step 5: See if we can do a better job predicting ‘good’ and ‘bad’ days

# Build a model (using the ‘ksvm’ function, trying to predict ‘goodOzone’). You can use all the possible attributes, or select the attributes that you think would be the most helpful. 

# Test the model on the testing dataset, and compute the percent of ‘goodOzone’ that was correctly predicted.

# Plot the results. Use a scatter plot. Have the x-axis represent temperature, the y-axis represent wind, the shape representing what was predicted (good or bad day), the color representing the actual value of ‘goodOzone’ (i.e. if the actual ozone level was good) and the size represent if the prediction was correct (larger symbols should be the observations the model got wrong).

# Compute models and plot the results for ‘svm’ (in the e1071 package) and ‘nb’ (Naive Bayes, also in the e1071 package). 

# Show all three results (charts) in one window, using the grid.arrange function (have two charts in one row).

Step 6 Which are the best Models for this data?

Review what you have done and state which is the best and why

LS0tCnRpdGxlOiAiSG9tZXdvcmsgOTogU3VwcG9ydAlWZWN0b3IgTWFjaGluZXMiCmF1dGhvcjogCi0gQXV0aG9yMQpkYXRlOiAiYHIgU3lzLnRpbWUoKWAiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIyBTdGVwIDE6IExvYWQgdGhlIGRhdGEgCmBgYHtyfQojIExldCBnbyBiYWNrIGFuZCBhbmFseXplIHRoZSBhaXIgcXVhbGl0eSBkYXRhc2V0IChpZiB5b3UgcmVtZW1iZXIsIHdlIHVzZWQgdGhhdCBwcmV2aW91c2x5LCBpbiB0aGUgdmlzdWFsaXphdGlvbiBsYWIpLiBSZW1lbWJlciB0byB0aGluayBhYm91dCBob3cgdG8gZGVhbCB3aXRoIHRoZSBOQXMgaW4gdGhlIGRhdGEuIApgYGAKCiMjIyBTdGVwIDI6IENyZWF0ZSB0cmFpbiBhbmQgdGVzdCBkYXRhIHNldHMgCmBgYHtyfQojIFVzaW5nIHRlY2huaXF1ZXMgZGlzY3Vzc2VkIGluIGNsYXNzLCBjcmVhdGUgdHdvIGRhdGFzZXRzIOKAkyBvbmUgZm9yIHRyYWluaW5nIGFuZCBvbmUgZm9yIHRlc3RpbmcuCmBgYAoKIyMjIFN0ZXAgMzogQnVpbGQgYSBNb2RlbCB1c2luZyBLU1ZNICYgdmlzdWFsaXplIHRoZSByZXN1bHRzIApgYGB7cn0KIyBCdWlsZCBhIG1vZGVsICh1c2luZyB0aGUg4oCYa3N2beKAmSBmdW5jdGlvbiwgdHJ5aW5nIHRvIHByZWRpY3Qgb256b25lKS4gWW91IGNhbiB1c2UgYWxsIHRoZSBwb3NzaWJsZSBhdHRyaWJ1dGVzLCBvciBzZWxlY3QgdGhlIGF0dHJpYnV0ZXMgdGhhdCB5b3UgdGhpbmsgd291bGQgYmUgdGhlIG1vc3QgaGVscGZ1bC4KCgojIFRlc3QgdGhlIG1vZGVsIG9uIHRoZSB0ZXN0aW5nIGRhdGFzZXQsIGFuZCBjb21wdXRlIHRoZSBSb290IE1lYW4gU3F1YXJlZCBFcnJvciAzKSBQbG90IHRoZSByZXN1bHRzLiBVc2UgYSBzY2F0dGVyIHBsb3QuIEhhdmUgdGhlIHgtYXhpcyByZXByZXNlbnQgdGVtcGVyYXR1cmUsIHRoZSB5LWF4aXMgcmVwcmVzZW50IHdpbmQsIHRoZSBwb2ludCBzaXplIGFuZCBjb2xvciByZXByZXNlbnQgdGhlIGVycm9yLCBhcyBkZWZpbmVkIGJ5IHRoZSBhY3R1YWwgb3pvbmUgbGV2ZWwgbWludXMgdGhlIHByZWRpY3RlZCBvem9uZSBsZXZlbCkuIAoKIyBDb21wdXRlIG1vZGVscyBhbmQgcGxvdCB0aGUgcmVzdWx0cyBmb3Ig4oCYc3Zt4oCZIChpbiB0aGUgZTEwNzEgcGFja2FnZSkgYW5kIOKAmGxt4oCZLiBHZW5lcmF0ZSBzaW1pbGFyIGNoYXJ0cyBmb3IgZWFjaCBtb2RlbCAKCiMgU2hvdyBhbGwgdGhyZWUgcmVzdWx0cyAoY2hhcnRzKSBpbiBvbmUgd2luZG93LCB1c2luZyB0aGUgZ3JpZC5hcnJhbmdlIGZ1bmN0aW9uIAoKYGBgCgoKIyMjIFN0ZXAgNDogQ3JlYXRlIGEg4oCYZ29vZE96b25l4oCZIHZhcmlhYmxlIApgYGB7cn0KIyBUaGlzIHZhcmlhYmxlIHNob3VsZCBiZSBlaXRoZXIgMCBvciAxLiBJdCBzaG91bGQgYmUgMCBpZiB0aGUgb3pvbmUgaXMgYmVsb3cgdGhlIGF2ZXJhZ2UgZm9yIGFsbCB0aGUgZGF0YSBvYnNlcnZhdGlvbnMsIGFuZCAxIGlmIGl0IGlzIGVxdWFsIHRvIG9yIGFib3ZlIHRoZSBhdmVyYWdlIG96b25lIG9ic2VydmVkLiAKCgpgYGAKCiMjIyBTdGVwIDU6IFNlZSBpZiB3ZSBjYW4gZG8gYSBiZXR0ZXIgam9iIHByZWRpY3Rpbmcg4oCYZ29vZOKAmSBhbmQg4oCYYmFk4oCZIGRheXMgCmBgYHtyfQojIEJ1aWxkIGEgbW9kZWwgKHVzaW5nIHRoZSDigJhrc3Zt4oCZIGZ1bmN0aW9uLCB0cnlpbmcgdG8gcHJlZGljdCDigJhnb29kT3pvbmXigJkpLiBZb3UgY2FuIHVzZSBhbGwgdGhlIHBvc3NpYmxlIGF0dHJpYnV0ZXMsIG9yIHNlbGVjdCB0aGUgYXR0cmlidXRlcyB0aGF0IHlvdSB0aGluayB3b3VsZCBiZSB0aGUgbW9zdCBoZWxwZnVsLiAKCiMgVGVzdCB0aGUgbW9kZWwgb24gdGhlIHRlc3RpbmcgZGF0YXNldCwgYW5kIGNvbXB1dGUgdGhlIHBlcmNlbnQgb2Yg4oCYZ29vZE96b25l4oCZIHRoYXQgd2FzIGNvcnJlY3RseSBwcmVkaWN0ZWQuCgojIFBsb3QgdGhlIHJlc3VsdHMuIFVzZSBhIHNjYXR0ZXIgcGxvdC4gSGF2ZSB0aGUgeC1heGlzIHJlcHJlc2VudCB0ZW1wZXJhdHVyZSwgdGhlIHktYXhpcyByZXByZXNlbnQgd2luZCwgdGhlIHNoYXBlIHJlcHJlc2VudGluZyB3aGF0IHdhcyBwcmVkaWN0ZWQgKGdvb2Qgb3IgYmFkIGRheSksIHRoZSBjb2xvciByZXByZXNlbnRpbmcgdGhlIGFjdHVhbCB2YWx1ZSBvZiDigJhnb29kT3pvbmXigJkgKGkuZS4gaWYgdGhlIGFjdHVhbCBvem9uZSBsZXZlbCB3YXMgZ29vZCkgYW5kIHRoZSBzaXplIHJlcHJlc2VudCBpZiB0aGUgcHJlZGljdGlvbiB3YXMgY29ycmVjdCAobGFyZ2VyIHN5bWJvbHMgc2hvdWxkIGJlIHRoZSBvYnNlcnZhdGlvbnMgdGhlIG1vZGVsIGdvdCB3cm9uZykuCgojIENvbXB1dGUgbW9kZWxzIGFuZCBwbG90IHRoZSByZXN1bHRzIGZvciDigJhzdm3igJkgKGluIHRoZSBlMTA3MSBwYWNrYWdlKSBhbmQg4oCYbmLigJkgKE5haXZlIEJheWVzLCBhbHNvIGluIHRoZSBlMTA3MSBwYWNrYWdlKS4gCgojIFNob3cgYWxsIHRocmVlIHJlc3VsdHMgKGNoYXJ0cykgaW4gb25lIHdpbmRvdywgdXNpbmcgdGhlIGdyaWQuYXJyYW5nZSBmdW5jdGlvbiAoaGF2ZSB0d28gY2hhcnRzIGluIG9uZSByb3cpLgpgYGAKCiMjIyBTdGVwIDYgV2hpY2ggYXJlIHRoZSBiZXN0IE1vZGVscyBmb3IgdGhpcyBkYXRhPyAKUmV2aWV3IHdoYXQgeW91IGhhdmUgZG9uZSBhbmQgc3RhdGUgd2hpY2ggaXMgdGhlIGJlc3QgYW5kIHdoeQoK