IST 687: Text Mining

Corey Jackson

2020-03-11 16:58:19

Agenda

Announcements

Mid-term scores

Final Project

Final Project: In-class presentation

Presentation tips: Defing the audience, provide context/motivation for the problem/research questions, describe the dataset, describe the methods, report the results, and close with major takeaways for the audience

Final Project: Project Report

Summary document tips: Write concisely, use a text editor with spell check, make research questions explicit, label figures with captions and reference figures in text.

Week 9

Homework 9 Overview: Support Vector Machines (SVM)

Week 9 Homework: Computing error

lm <- lm(formula = Ozone~.,data=trainData_Corey)
predLm <- predict(lm, testData_Corey)
compTable3 <- data.frame(testData_Corey[,1], predLm)

##          test     Pred
## 105  28.00000 45.59680
## 83   42.12931 53.89082
## 117 168.00000 69.99365

Computing the the Root Mean Squared Error (RMSE)

##          test     Pred      diff
## 105  28.00000 45.59680  309.6472
## 83   42.12931 53.89082  138.3332
## 117 168.00000 69.99365 9605.2439

sqrt(mean((compTable2$test-compTable2$Pred)^2))

## [1] 24.33016

Modeling a discrete outcome

Step 1. Need to convert the continuious ozone varible to a discrete outcome variable air$goodOzone <- ifelse(air$Ozone< mean(air$Ozone), 0, 1)
air$goodOzone <- as.factor(air$goodOzone) # convert from numeric to factor

Step 2. Again, create test and training datasets

Step 3. Train model (same as above, with new dependent variable goodOzone)

Computing a confusion matrix for discrete outcomes

Step 4. Evaluate the model using predict and compute model accuracy

goodPred <- predict(nb, testData_Corey)
compGood1 <- data.frame(testData_Corey[,6], goodPred) colnames(compGood1) <- c("test","Pred")

##   test Pred
## 1    0    1
## 2    1    1
## 3    1    1
## 4    0    0
## 5    1    1
## 6    1    1

Computing classification rate

compGood1$result <- ifelse(compGood1$test==compGood1$Pred,1,0)

##   test Pred result
## 1    0    1      0
## 2    1    1      1
## 3    1    1      1
## 4    0    0      1
## 5    1    1      1
## 6    1    1      1

Week 10

Week 10: Text mining

Lab 10 Overview

Lab 10 Overview II

Lab 10 Overview III

##  [1] "dream"        "president"    "anger"        "allude"      
##  [5] "school"       "Washington"   "shop"         "mischief"    
##  [9] "capitol"      "constitution" "black"        "children"

cutpoint <- round(length(words)/4)

## [1] 3

words[1:cutpoint]

## [1] "dream"     "president" "anger"

Homework 10 Tips: Text Mining

##         Word Score
## 1    abandon    -2
## 2  abandoned    -2
## 3   abandons    -2
## 4   abducted    -2
## 5  abduction    -2
## 6 abductions    -2

Datasets: MLK Speech and the AFFIN wordlist

Packages needed: readr, tm
More about AFFIN

Next week

Asynchronous Materials

Live Session