Now that we are doing text mining, we will be creating our own termDocMatrix.

This was also done in class, when we analyzed the structure of the “I have a dream” speech – in terms of the use of positive and negative words. However, in that effort, we treated all positive words the same (ex. good is the same as great). This might not be appropriate – maybe we should count more positive (and negative) words more than other words. For example “I loved the movie” might be stronger than “I liked the movie”.

There is a different word file that ranks each word on a scale of -5 to 5 (negative to positive). It is known as the AFINN word list.

Your task for this homework is to adapt the lab that we did in class, to compute the score for the MLK speech using the AFINN word list (as opposed to the positive and negative word lists).

# read in the data using read.delim() 

# change column names to "Word" and "Score"
 # read in text file MLK.txt  

  # interprets each element of the "mlk" as a document and create a vector source

  # create a Corpus, a "Bag of Words"

 # first step transformation: make all of the letters in "words.corpus" lowercase

  # second step transformation: remove the punctuation in "words.corpus"

  # third step transformation: remove numbers in "words.corpus"

  # final step transformation: take out the "stop" words, such as "the", "a" and "at"

  # create a term-document matrix "tdm"

  # convert tdm into a matrix called "m"

  # create a list of counts for each word named "wordCounts"

  # sort words in "wordCounts" by frequency

  # check the first ten items in "wordCounts" to see if it is built correctly

  # calculate the total number of words
  
  # create a vector that contains all the words in "wordCounts"

  # locate which words in mlk speech appeared in AFINN word list
  # returns 0 if one "mlk" word does not appeared in AFINN list

  # calculate the matched words counts

  # create a new dataframe that contains matched words and their counts, set ordinal numbers as row names

  # change column names to "word" and "counts"

  # join the dataframe "match" with "AFINN" by "word" column in match and "Word" column in AFINN

  # calculate the overall score

  # The overall score is 0.1343639
  
  # create a function to calculate scores for each quater
  myfunction <- function(q){
    # interprets each element of the "mlk" as a document and create a vector source
    words.vec <- VectorSource(mlk)
    # create a Corpus, which is a "Bag of Words"
    words.corpus <- Corpus(words.vec)
    # define "cutpoint_l" as the first cut points; round the number to get an interger
    cutpoint_l <- round(length(words.corpus)*(q-1)/4) + 1
    # define "cutpoint_r" as the second cut points; round the number to get an interger
    cutpoint_r <- round(length(words.corpus)*q/4)
    # create a word corpus for for each quarter (cut by cutpoints)
    words.corpus <- words.corpus[cutpoint_l: cutpoint_r]
    # word corpora transformation
    words.corpus <- tm_map(words.corpus, content_transformer(tolower))
    words.corpus <- tm_map(words.corpus, removePunctuation)
    words.corpus <- tm_map(words.corpus, removeNumbers)
    words.corpus <- tm_map(words.corpus, removeWords, stopwords("english"))
    # create term document matrix
    tdm <- TermDocumentMatrix(words.corpus)
    m <- as.matrix(tdm)
    # calculate a list of counts for each word
    wordCounts <- rowSums(m)
    wordCounts <- sort(wordCounts, decreasing=TRUE)
    # calculate total words
    totalWords <- sum(wordCounts)
    # locate the mlk words appeared in Afinn list
    words <- names(wordCounts)
    matched <- match(words, AFINN$Word, nomatch = 0)
    mCounts <- wordCounts[which(matched != 0)]
    match <- data.frame(names(mCounts),mCounts,row.names = c(1:length(mCounts)))
    colnames(match)<-c("word","counts")
    # merge matched words with Afinn scores
    mergedTable <- merge(match, AFINN, by.x = "word" ,by.y = "Word")
    # calculate the total score
    Score <- sum(mergedTable$counts * mergedTable$Score)/totalWords
    # return the results
    return(Score)
  }
  # apply function to first quarter

  # apply function to second quarter

  # apply function to third quarter

  # apply function to fourth quarter
  # combine scores of four quarters into one dataframe

  # create a bar plot for the four scores
LS0tCnRpdGxlOiAiSG9tZXdvcmsgMTA6IFRleHQgTWluaW5nIgphdXRob3I6IAotIEF1dGhvcjEKZGF0ZTogImByIFN5cy50aW1lKClgIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpOb3cgdGhhdCB3ZSBhcmUgZG9pbmcgdGV4dCBtaW5pbmcsIHdlIHdpbGwgYmUgY3JlYXRpbmcgb3VyIG93biB0ZXJtRG9jTWF0cml4LiAKClRoaXMgd2FzIGFsc28gZG9uZSBpbiBjbGFzcywgd2hlbiB3ZSBhbmFseXplZCB0aGUgc3RydWN0dXJlIG9mIHRoZSDigJxJIGhhdmUgYSBkcmVhbeKAnSBzcGVlY2gg4oCTIGluIHRlcm1zIG9mIHRoZSB1c2Ugb2YgcG9zaXRpdmUgYW5kIG5lZ2F0aXZlIHdvcmRzLiBIb3dldmVyLCBpbiB0aGF0IGVmZm9ydCwgd2UgdHJlYXRlZCBhbGwgcG9zaXRpdmUgd29yZHMgdGhlIHNhbWUgKGV4LiBnb29kIGlzIHRoZSBzYW1lIGFzIGdyZWF0KS4gVGhpcyBtaWdodCBub3QgYmUgYXBwcm9wcmlhdGUg4oCTIG1heWJlIHdlIHNob3VsZCBjb3VudCBtb3JlIHBvc2l0aXZlIChhbmQgbmVnYXRpdmUpIHdvcmRzIG1vcmUgdGhhbiBvdGhlciB3b3Jkcy4gRm9yIGV4YW1wbGUg4oCcSSBsb3ZlZCB0aGUgbW92aWXigJ0gbWlnaHQgYmUgc3Ryb25nZXIgdGhhbiDigJxJIGxpa2VkIHRoZSBtb3ZpZeKAnS4gCgpUaGVyZSBpcyBhIGRpZmZlcmVudCB3b3JkIGZpbGUgdGhhdCByYW5rcyBlYWNoIHdvcmQgb24gYSBzY2FsZSBvZiAtNSB0byA1IChuZWdhdGl2ZSB0byBwb3NpdGl2ZSkuIEl0IGlzIGtub3duIGFzIHRoZSBBRklOTiB3b3JkIGxpc3QuIAoKWW91ciB0YXNrIGZvciB0aGlzIGhvbWV3b3JrIGlzIHRvIGFkYXB0IHRoZSBsYWIgdGhhdCB3ZSBkaWQgaW4gY2xhc3MsIHRvIGNvbXB1dGUgdGhlIHNjb3JlIGZvciB0aGUgTUxLIHNwZWVjaCB1c2luZyB0aGUgQUZJTk4gd29yZCBsaXN0IChhcyBvcHBvc2VkIHRvIHRoZSBwb3NpdGl2ZSBhbmQgbmVnYXRpdmUgd29yZCBsaXN0cykuIAoKKiBGaXJzdCByZWFkIGluIHRoZSBBRklOTiB3b3JkIGxpc3QuIE5vdGUgdGhhdCBlYWNoIGxpbmUgaXMgYm90aCBhIHdvcmQgYW5kIGEgc2NvcmUgKGJldHdlZW4gLTUgYW5kIDUpLiBZb3Ugd2lsbCBuZWVkIHRvIHNwbGl0IHRoZSBsaW5lIGFuZCBjcmVhdGUgdHdvIHZlY3RvcnMgKG9uZSBmb3Igd29yZHMgYW5kIG9uZSBmb3Igc2NvcmVzKS4gVXNpbmcgW0FGRklOIHdvcmRsaXN0XShodHRwczovL2NqYWNrczA0LmdpdGh1Yi5pby82ODcvRGF0YXNldHMvQUZJTk4xMTEudHh0KSB0ZXh0IGxhYmVscy4gCgpgYGB7cn0KIyByZWFkIGluIHRoZSBkYXRhIHVzaW5nIHJlYWQuZGVsaW0oKSAKCiMgY2hhbmdlIGNvbHVtbiBuYW1lcyB0byAiV29yZCIgYW5kICJTY29yZSIKYGBgCgoqIENvbXB1dGUgdGhlIG92ZXJhbGwgc2NvcmUgZm9yIHRoZSBNTEsgc3BlZWNoIHVzaW5nIHRoZSBBRklOTiB3b3JkIGxpc3QgKGFzIG9wcG9zZWQgdG8gdGhlIHBvc2l0aXZlIGFuZCBuZWdhdGl2ZSB3b3JkIGxpc3RzKS4gCmBgYHtyfQogIyByZWFkIGluIHRleHQgZmlsZSBNTEsudHh0ICAKCiAgIyBpbnRlcnByZXRzIGVhY2ggZWxlbWVudCBvZiB0aGUgIm1sayIgYXMgYSBkb2N1bWVudCBhbmQgY3JlYXRlIGEgdmVjdG9yIHNvdXJjZQoKICAjIGNyZWF0ZSBhIENvcnB1cywgYSAiQmFnIG9mIFdvcmRzIgoKICMgZmlyc3Qgc3RlcCB0cmFuc2Zvcm1hdGlvbjogbWFrZSBhbGwgb2YgdGhlIGxldHRlcnMgaW4gIndvcmRzLmNvcnB1cyIgbG93ZXJjYXNlCgogICMgc2Vjb25kIHN0ZXAgdHJhbnNmb3JtYXRpb246IHJlbW92ZSB0aGUgcHVuY3R1YXRpb24gaW4gIndvcmRzLmNvcnB1cyIKCiAgIyB0aGlyZCBzdGVwIHRyYW5zZm9ybWF0aW9uOiByZW1vdmUgbnVtYmVycyBpbiAid29yZHMuY29ycHVzIgoKICAjIGZpbmFsIHN0ZXAgdHJhbnNmb3JtYXRpb246IHRha2Ugb3V0IHRoZSAic3RvcCIgd29yZHMsIHN1Y2ggYXMgInRoZSIsICJhIiBhbmQgImF0IgoKICAjIGNyZWF0ZSBhIHRlcm0tZG9jdW1lbnQgbWF0cml4ICJ0ZG0iCgogICMgY29udmVydCB0ZG0gaW50byBhIG1hdHJpeCBjYWxsZWQgIm0iCgogICMgY3JlYXRlIGEgbGlzdCBvZiBjb3VudHMgZm9yIGVhY2ggd29yZCBuYW1lZCAid29yZENvdW50cyIKCiAgIyBzb3J0IHdvcmRzIGluICJ3b3JkQ291bnRzIiBieSBmcmVxdWVuY3kKCiAgIyBjaGVjayB0aGUgZmlyc3QgdGVuIGl0ZW1zIGluICJ3b3JkQ291bnRzIiB0byBzZWUgaWYgaXQgaXMgYnVpbHQgY29ycmVjdGx5CgogICMgY2FsY3VsYXRlIHRoZSB0b3RhbCBudW1iZXIgb2Ygd29yZHMKICAKICAjIGNyZWF0ZSBhIHZlY3RvciB0aGF0IGNvbnRhaW5zIGFsbCB0aGUgd29yZHMgaW4gIndvcmRDb3VudHMiCgogICMgbG9jYXRlIHdoaWNoIHdvcmRzIGluIG1sayBzcGVlY2ggYXBwZWFyZWQgaW4gQUZJTk4gd29yZCBsaXN0CiAgIyByZXR1cm5zIDAgaWYgb25lICJtbGsiIHdvcmQgZG9lcyBub3QgYXBwZWFyZWQgaW4gQUZJTk4gbGlzdAoKICAjIGNhbGN1bGF0ZSB0aGUgbWF0Y2hlZCB3b3JkcyBjb3VudHMKCiAgIyBjcmVhdGUgYSBuZXcgZGF0YWZyYW1lIHRoYXQgY29udGFpbnMgbWF0Y2hlZCB3b3JkcyBhbmQgdGhlaXIgY291bnRzLCBzZXQgb3JkaW5hbCBudW1iZXJzIGFzIHJvdyBuYW1lcwoKICAjIGNoYW5nZSBjb2x1bW4gbmFtZXMgdG8gIndvcmQiIGFuZCAiY291bnRzIgoKICAjIGpvaW4gdGhlIGRhdGFmcmFtZSAibWF0Y2giIHdpdGggIkFGSU5OIiBieSAid29yZCIgY29sdW1uIGluIG1hdGNoIGFuZCAiV29yZCIgY29sdW1uIGluIEFGSU5OCgogICMgY2FsY3VsYXRlIHRoZSBvdmVyYWxsIHNjb3JlCgogICMgVGhlIG92ZXJhbGwgc2NvcmUgaXMgMC4xMzQzNjM5CiAgCmBgYAoKKiBUaGVuLCBqdXN0IGFzIGluIGNsYXNzLCBjb21wdXRlIHRoZSBzZW50aW1lbnQgc2NvcmUgZm9yIGVhY2ggcXVhcnRlciAoMjUlKSBvZiB0aGUgc3BlZWNoIHRvIHNlZSBob3cgdGhpcyBzZW50aW1lbnQgYW5hbHlzaXMgaXMgdGhlIHNhbWUgb3IgZGlmZmVyZW50IHRoYW4gd2hhdCB3YXMgY29tcHV0aW5nIHdpdGgganVzdCB0aGUgcG9zaXRpdmUgYW5kIG5lZ2F0aXZlIHdvcmQgZmlsZXMuICAKICAgICsgTm90ZSB0aGF0IHNpbmNlIHlvdSB3aWxsIGJlIGRvaW5nIGFsbW9zdCB0aGUgZXhhY3Qgc2FtZSB0aGluZyA0IHRpbWVzIChvbmNlIGZvciBlYWNoIHF1YXJ0ZXIgb2YgdGhlIHNwZWVjaCksIHlvdSBzaG91bGQgY3JlYXRlIGEgZnVuY3Rpb24gdG8gICBkbyBtb3N0IG9mIHRoZSB3b3JrLCBhbmQgY2FsbCBpdCA0IHRpbWVzLiAKCmBgYHtyfQogICMgY3JlYXRlIGEgZnVuY3Rpb24gdG8gY2FsY3VsYXRlIHNjb3JlcyBmb3IgZWFjaCBxdWF0ZXIKICBteWZ1bmN0aW9uIDwtIGZ1bmN0aW9uKHEpewogICAgIyBpbnRlcnByZXRzIGVhY2ggZWxlbWVudCBvZiB0aGUgIm1sayIgYXMgYSBkb2N1bWVudCBhbmQgY3JlYXRlIGEgdmVjdG9yIHNvdXJjZQogICAgd29yZHMudmVjIDwtIFZlY3RvclNvdXJjZShtbGspCiAgICAjIGNyZWF0ZSBhIENvcnB1cywgd2hpY2ggaXMgYSAiQmFnIG9mIFdvcmRzIgogICAgd29yZHMuY29ycHVzIDwtIENvcnB1cyh3b3Jkcy52ZWMpCiAgICAjIGRlZmluZSAiY3V0cG9pbnRfbCIgYXMgdGhlIGZpcnN0IGN1dCBwb2ludHM7IHJvdW5kIHRoZSBudW1iZXIgdG8gZ2V0IGFuIGludGVyZ2VyCiAgICBjdXRwb2ludF9sIDwtIHJvdW5kKGxlbmd0aCh3b3Jkcy5jb3JwdXMpKihxLTEpLzQpICsgMQogICAgIyBkZWZpbmUgImN1dHBvaW50X3IiIGFzIHRoZSBzZWNvbmQgY3V0IHBvaW50czsgcm91bmQgdGhlIG51bWJlciB0byBnZXQgYW4gaW50ZXJnZXIKICAgIGN1dHBvaW50X3IgPC0gcm91bmQobGVuZ3RoKHdvcmRzLmNvcnB1cykqcS80KQogICAgIyBjcmVhdGUgYSB3b3JkIGNvcnB1cyBmb3IgZm9yIGVhY2ggcXVhcnRlciAoY3V0IGJ5IGN1dHBvaW50cykKICAgIHdvcmRzLmNvcnB1cyA8LSB3b3Jkcy5jb3JwdXNbY3V0cG9pbnRfbDogY3V0cG9pbnRfcl0KICAgICMgd29yZCBjb3Jwb3JhIHRyYW5zZm9ybWF0aW9uCiAgICB3b3Jkcy5jb3JwdXMgPC0gdG1fbWFwKHdvcmRzLmNvcnB1cywgY29udGVudF90cmFuc2Zvcm1lcih0b2xvd2VyKSkKICAgIHdvcmRzLmNvcnB1cyA8LSB0bV9tYXAod29yZHMuY29ycHVzLCByZW1vdmVQdW5jdHVhdGlvbikKICAgIHdvcmRzLmNvcnB1cyA8LSB0bV9tYXAod29yZHMuY29ycHVzLCByZW1vdmVOdW1iZXJzKQogICAgd29yZHMuY29ycHVzIDwtIHRtX21hcCh3b3Jkcy5jb3JwdXMsIHJlbW92ZVdvcmRzLCBzdG9wd29yZHMoImVuZ2xpc2giKSkKICAgICMgY3JlYXRlIHRlcm0gZG9jdW1lbnQgbWF0cml4CiAgICB0ZG0gPC0gVGVybURvY3VtZW50TWF0cml4KHdvcmRzLmNvcnB1cykKICAgIG0gPC0gYXMubWF0cml4KHRkbSkKICAgICMgY2FsY3VsYXRlIGEgbGlzdCBvZiBjb3VudHMgZm9yIGVhY2ggd29yZAogICAgd29yZENvdW50cyA8LSByb3dTdW1zKG0pCiAgICB3b3JkQ291bnRzIDwtIHNvcnQod29yZENvdW50cywgZGVjcmVhc2luZz1UUlVFKQogICAgIyBjYWxjdWxhdGUgdG90YWwgd29yZHMKICAgIHRvdGFsV29yZHMgPC0gc3VtKHdvcmRDb3VudHMpCiAgICAjIGxvY2F0ZSB0aGUgbWxrIHdvcmRzIGFwcGVhcmVkIGluIEFmaW5uIGxpc3QKICAgIHdvcmRzIDwtIG5hbWVzKHdvcmRDb3VudHMpCiAgICBtYXRjaGVkIDwtIG1hdGNoKHdvcmRzLCBBRklOTiRXb3JkLCBub21hdGNoID0gMCkKICAgIG1Db3VudHMgPC0gd29yZENvdW50c1t3aGljaChtYXRjaGVkICE9IDApXQogICAgbWF0Y2ggPC0gZGF0YS5mcmFtZShuYW1lcyhtQ291bnRzKSxtQ291bnRzLHJvdy5uYW1lcyA9IGMoMTpsZW5ndGgobUNvdW50cykpKQogICAgY29sbmFtZXMobWF0Y2gpPC1jKCJ3b3JkIiwiY291bnRzIikKICAgICMgbWVyZ2UgbWF0Y2hlZCB3b3JkcyB3aXRoIEFmaW5uIHNjb3JlcwogICAgbWVyZ2VkVGFibGUgPC0gbWVyZ2UobWF0Y2gsIEFGSU5OLCBieS54ID0gIndvcmQiICxieS55ID0gIldvcmQiKQogICAgIyBjYWxjdWxhdGUgdGhlIHRvdGFsIHNjb3JlCiAgICBTY29yZSA8LSBzdW0obWVyZ2VkVGFibGUkY291bnRzICogbWVyZ2VkVGFibGUkU2NvcmUpL3RvdGFsV29yZHMKICAgICMgcmV0dXJuIHRoZSByZXN1bHRzCiAgICByZXR1cm4oU2NvcmUpCiAgfQoKYGBgCgogICAgCmBgYHtyfQogICMgYXBwbHkgZnVuY3Rpb24gdG8gZmlyc3QgcXVhcnRlcgoKICAjIGFwcGx5IGZ1bmN0aW9uIHRvIHNlY29uZCBxdWFydGVyCgogICMgYXBwbHkgZnVuY3Rpb24gdG8gdGhpcmQgcXVhcnRlcgoKICAjIGFwcGx5IGZ1bmN0aW9uIHRvIGZvdXJ0aCBxdWFydGVyCgpgYGAKCiogRmluYWxseSwgcGxvdCB0aGUgcmVzdWx0cyAoaS5lLCA0IG51bWJlcnMpIHZpYSBhIGJhciBjaGFydApgYGB7cn0KICAjIGNvbWJpbmUgc2NvcmVzIG9mIGZvdXIgcXVhcnRlcnMgaW50byBvbmUgZGF0YWZyYW1lCgogICMgY3JlYXRlIGEgYmFyIHBsb3QgZm9yIHRoZSBmb3VyIHNjb3JlcwoKYGBgCgoKCg==