15.071 | Spring 2017 | Graduate

The Analytics Edge

5 Text Analytics

5.2 Turning Tweets into Knowledge: An Introduction to Text Analytics

Quick Question

Which of these problems is the least likely to be a good application of natural language processing?

 
 
 
 

Explanation Judging the winner of a poetry contest requires a deep level of human understanding and emotion. Perhaps someday a computer will be able to accurately judge the winner of a poetry contest, but currently the other three tasks are much better suited for natural language processing.

Quick Question

For each tweet, we computed an overall score by averaging all five scores assigned by the Amazon Mechanical Turk workers. However, Amazon Mechanical Turk workers might make significant mistakes when labeling a tweet. The mean could be highly affected by this.

Which of the three alternative metrics below would best capture the typical opinion of the five Amazon Mechanical Turk workers, would be less affected by mistakes, and is well-defined regardless of the five labels?

Explanation The correct answer is the first one - the median would capture the typical opinion of the workers and tends to be less affected by significant mistakes. The majority score might not have given a score to all tweets because they might not all have a majority score (consider a tweet with scores 0, 0, 1, 1, and 2). The minimum score does not necessarily capture the typical opinion and could be highly affected by mistakes (consider a tweet with scores -2, 1, 1, 1, 1).

Quick Question

For each of the following questions, pick the preprocessing task that we discussed in the previous video that would change the sentence “Data is useful AND powerful!” to the new sentence listed in the question.

New sentence: Data useful powerful!

 
 
 

Explanation The first new sentence has the stop words "is" and "and" removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the "ful" is removed from "useful" and "powerful".

New sentence: data is useful and powerful

 
 
 

Explanation The first new sentence has the stop words "is" and "and" removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the "ful" is removed from "useful" and "powerful".

New sentence: Data is use AND power!

 
 
 

Explanation The first new sentence has the stop words "is" and "and" removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the "ful" is removed from "useful" and "powerful".

Quick Question

 

Given a corpus in R, how many commands do you need to run in R to clean up the irregularities (removing capital letters and punctuation)?

Exercise 1

 Numerical Response 

 

How many commands do you need to run to stem the document?

Exercise 2

 Numerical Response 

 

Explanation

In R, you can clean up the irregularities with two lines:

corpus = tm_map(corpus, tolower)

corpus = tm_map(corpus, removePunctuation)

And you can stem the document with one line:

corpus = tm_map(corpus, stemDocument)

CheckShow Answer

 

Quick Question

In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)

Explanation To answer this question, you need to run the following command in R: findFreqTerms(frequencies, lowfreq=100) This outputs the words that appear at least 100 times in our tweets. They are "iphon", "itun", and "new".

Quick Question

 

In the previous video, we used CART and Random Forest to predict sentiment. Let’s see how well logistic regression does. Build a logistic regression model (using the training set) to predict “Negative” using all of the independent variables. You may get a warning message after building your model - don’t worry (we explain what it means in the explanation).

Now, make predictions using the logistic regression model:

predictions = predict(tweetLog, newdata=testSparse, type=“response”)

where “tweetLog” should be the name of your logistic regression model. You might also get a warning message after this command, but don’t worry, it is due to the same problem as the previous warning message.

Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model. What is the accuracy?

Exercise 1

 Numerical Response 

 

Explanation

You can build a logistic regression model in R by using the command:

tweetLog = glm(Negative ~ ., data=trainSparse, family=“binomial”)

Then you can make predictions and build a confusion matrix with the following commands:

predictLog = predict(tweetLog, newdata=testSparse, type=“response”)

table(testSparse$Negative, predictLog > 0.5)

The accuracy is (254+37)/(254+46+18+37) = 0.8197183, which is worse than the baseline. If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.

Note that you might have gotten a different answer than us, because the glm function struggles with this many variables. The warning messages that you might have seen in this problem have to do with the number of variables, and the fact that the model is overfitting to the training set. We’ll discuss this in more detail in the Homework Assignment.

Is this worse or better than the baseline model accuracy of 84.5%? Think about the properties of logistic regression that might make this the case!

CheckShow Answer

Video 5: Pre-Processing in R

Note:  the dataset “tweets.csv” used in the rest of this lecture is not available to OCW users.

In the following video, we ask you to install the “tm” package to perform the pre-processing steps. Due to function changes that occurred after this video was recorded, you will need to run the following command immediately after converting all of the words to lowercase letters (it converts all documents in the corpus to the PlainTextDocument type):

corpus = tm_map(corpus, PlainTextDocument)

Then you can continue with the R commands as they are in the video.

Non-Standard Stop Word Lists

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)). 

Language Settings

If you downloaded and installed R in a location other than the United States, you might encounter some issues when using the bag of words approach (since the pre-processing tasks used here depend on the English language). To fix this, you will need to type in your R console:

Sys.setlocale(“LC_ALL”, “C”)

This will only change the locale for your current R session, so please make a note to run this command when you are working on any lectures or exercises that might depend on the English lanugage (for example, removing stop words).

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions