15.071 | Spring 2017 | Graduate

The Analytics Edge

5 Text Analytics

5.5 Assignment 5

Separating Spam From Ham

Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes — Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

 

  • text: The text of the email.
  • spam: A binary variable indicating if the email was spam.

Problem 1.1 - Loading the Dataset

Begin by loading the dataset emails (CSV - 8.5MB) into a data frame called emails (don’t open the file with Excel; import into R directly to avoid errors). Remember to pass the stringsAsFactors=FALSE option when loading the data.

How many emails are in the dataset?

Exercise 1

 Numerical Response 

 

Explanation

You can load the dataset with:

emails = read.csv(“emails.csv”, stringsAsFactors=FALSE)

The number of emails can be read from str(emails) or nrow(emails).

CheckShow Answer

Problem 1.2 - Loading the Dataset

How many of the emails are spam?

Exercise 2

 Numerical Response 

 

Explanation

This can be read from table(emails$spam).

CheckShow Answer

Problem 1.3 - Loading the Dataset

Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.

Exercise 3

 Text Response  Answer:subject

Explanation

You can review emails with, for instance, emails$text[1] or emails$text[1000]. Every email begins with the word “Subject:”.

CheckShow Answer

Problem 1.4 - Loading the Dataset

Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

Exercise 4

 No – the word appears in every email so this variable would not help us differentiate spam from ham. 

 Yes – the number of times the word appears might help us differentiate spam from ham. 

Explanation

We know that each email has the word “subject” appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word “subject” appear a number of times, and this higher frequency might be indicative of a ham message.

CheckShow Answer

Problem 1.5 - Loading the Dataset

The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?

Exercise 5

 Numerical Response 

 

Explanation

The maximum length can be obtained with max(nchar(emails$text)).

CheckShow Answer

Problem 2.1 - Preparing the Corpus

Follow the standard steps to build and pre-process the corpus:

  1. Build a new corpus variable called corpus.

  2. Using tm_map, convert the text to lowercase.

  3. Using tm_map, remove all punctuation from the corpus.

  4. Using tm_map, remove all English stopwords from the corpus.

  5. Using tm_map, stem the words in the corpus.

  6. Build a document term matrix from the corpus, called dtm.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).

How many terms are in dtm?

Exercise 6

 Numerical Response 

 

Explanation

These steps can be accomplished by running:

library(tm)

corpus = Corpus(VectorSource(emails$text))

corpus = tm_map(corpus, tolower)

corpus = tm_map(corpus, PlainTextDocument)

corpus = tm_map(corpus, removePunctuation)

corpus = tm_map(corpus, removeWords, stopwords(“english”))

corpus = tm_map(corpus, stemDocument)

dtm = DocumentTermMatrix(corpus)

dtm

¨C79C

CheckShow Answer

Problem 2.2 - Preparing the Corpus

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don’t overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

Exercise 7

 Numerical Response 

 

Explanation

This can be accomplished with:

spdtm = removeSparseTerms(dtm, 0.95)

spdtm

From the spdtm summary output, it contains 330 terms.

CheckShow Answer

Problem 3.1 - Building machine learning models

First, create a variable called “emailsSparse” from “spdtm” using the command “emailsSparse = as.data.frame(as.matrix(spdtm))” and ensure it has legal variable names with “names(emailsSparse) = make.names(names(emailsSparse))”. Then, copy the dependent variable from the original data frame called “emails” to “emailsSparse” using the command “emailsSparse$spam = as.factor(emails$spam)”.

Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.

Explanation

These steps can be accomplished with:

emailsSparse = as.data.frame(as.matrix(spdtm))

names(emailsSparse) = make.names(names(emailsSparse))

emailsSparse$spam = as.factor(emails$spam)

set.seed(123)

library(caTools)

spl = sample.split(emailsSparse$spam, 0.7)

library(dplyr)

train = emailsSparse %>% filter(spl == TRUE)

test = emailsSparse %>% filter(spl == FALSE)

Using the training set, train the following two machine learning models. The models should predict the dependent variable “spam”, using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.

  1. A CART model called spamCART, using the default parameters to train the model (don’t worry about adding minbucket or cp or specifying losses for false positives and false negatives). Remember to add the argument method=“class” since this is a binary classification problem.

  2. A random forest model called spamRF, using the default parameters to train the model (don’t worry about specifying mtry or ntree or nodesize or the losses for false positives and false negatives). Directly before training the random forest model, set the random seed to 123 (even though we’ve already done this earlier in the problem, it’s important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).

Explanation

These models can be trained with the following code:

library(rpart)

spamCART = rpart(spam~., data=train, method=“class”)

set.seed(123)

library(randomForest)

spamRF = randomForest(spam~., data=train)

Similar to logistic regression, CART and random forest can be told to give you predicted probabilities for classification problems. CART does this by returning the proportion of observations in the bucket of interest that fall into the relevant category, and random forest does this by returning the proportion of the trees that predict the relevant category. The following commands can be used to obtain predicted probabilities for the two fitted models on the training set:

predTrainCART = predict(spamCART)[,2]

predTrainRF = predict(spamRF, type=“prob”)[,2]

What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions?

Exercise 8

 Numerical Response 

 

Explanation

This can be obtained with:

table(train$spam, predTrainCART > 0.5)

Then the accuracy is (2885+894)/nrow(train)

What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions? (Remember that your answer might not match ours exactly, due to random behavior in the random forest algorithm on different operating systems.)

Exercise 9

 Numerical Response 

 

Explanation

This can be obtained with:

table(train$spam, predTrainRF > 0.5)

And then the accuracy is (3013+914)/nrow(train)

CheckShow Answer

Problem 3.2 - Building Machine Learning Models

How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree? These are word stems likely specific to the inbox of Vincent Kaminski, whose email we used as the ham observations in our dataset.

Exercise 10

 Numerical Response 

 

Explanation

After loading the necessary package with “library(rpart.plot)”, we see from “prp(spamCART)” that “vinc” and “enron” appear in the CART tree as the top two branches, but that “hou” and “kaminski” do not appear.

CheckShow Answer

Problem 3.3 - Building Machine Learning Models

What is the training set AUC of spamCART? Note that now that we have predicted probabilities from the CART model, we can compute AUC with the ROCR package just as we have for logistic regression.

Exercise 11

 Numerical Response 

 

Explanation

This can be obtained with:

library(ROCR)

predictionTrainCART = prediction(predTrainCART, train$spam)

as.numeric(performance(predictionTrainCART, “auc”)@y.values)

CheckShow Answer

Problem 3.4 - Building Machine Learning Models

What is the training set AUC of spamRF?

Exercise 12

 Numerical Response 

 

Explanation

This can be obtained with:

predictionTrainRF = prediction(predTrainRF, train$spam)

as.numeric(performance(predictionTrainRF, “auc”)@y.values)

CheckShow Answer

Problem 4.1 - Evaluating on the Test Set

Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained. This can be achieved with the following code:

predTestCART = predict(spamCART, newdata=test)[,2]

predTestRF = predict(spamRF, newdata=test, type=“prob”)[,2]

What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?

Exercise 13

 Numerical Response 

 

Explanation

This can be obtained with:

table(test$spam, predTestCART > 0.5)

Then the accuracy is (1228+386)/nrow(test)

CheckShow Answer

Problem 4.2 - Evaluating on the Test Set

What is the testing set AUC of spamCART?

Exercise 14

 Numerical Response 

 

Explanation

This can be obtained with:

predictionTestCART = prediction(predTestCART, test$spam)

as.numeric(performance(predictionTestCART, “auc”)@y.values)

CheckShow Answer

Problem 4.3 - Evaluating on the Test Set

What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

Exercise 15

 Numerical Response 

 

Explanation

This can be obtained with:

table(test$spam, predTestRF > 0.5)

Then the accuracy is (1290+385)/nrow(test)

CheckShow Answer

Problem 4.4 - Evaluating on the Test Set

What is the testing set AUC of spamRF?

Exercise 16

 Numerical Response 

 

Explanation

This can be obtained with:

predictionTestRF = prediction(predTestRF, test$spam)

as.numeric(performance(predictionTestRF, “auc”)@y.values)

CheckShow Answer

Problem 4.5 - Evaluating on the Test Set

Which model had the best testing set performance, in terms of accuracy and AUC?

Exercise 17

 CART 

 Random forest 

Explanation

The random forest outperformed CART in both measures, obtaining an impressive AUC of 0.998 on the test set.

CheckShow Answer

 

You may note that we did not ask you to fit a logistic regression model to predict whether an email was spam or ham. This is in contrast to our usual approach of comparing all three models. If you in fact tried to train a logistic regression model in R using this dataset, you would get the following warning:

glm.fit: algorithm did not converge

This warning indicates that R’s logistic regression solution procedure has failed.

Automating Reviews in Medicine

The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.

The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this problem, we will see how text analytics can be used to automate the process of information retrieval.

The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.

 

Problem 1.1 - Loading the Data

Load clinical_trial (CSV - 2.9MB) into a data frame called trials (remembering to add the argument stringsAsFactors=FALSE), and investigate the data frame with summary() and str().

Important Note: Some students have been getting errors like “invalid multibyte string” when performing certain parts of this homework question. If this is happening to you, use the argument fileEncoding=“latin1” when reading in the file with read.csv. This should cause those errors to go away.

We can use R’s string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text. Using the nchar() function on the variables in the data frame, answer the following questions:

How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)

Exercise 1

 Numerical Response 

 

Explanation

You can load the data set into R with the following command:

trials = read.csv(“clinical_trial.csv”, stringsAsFactors=FALSE)

From summary(nchar(trials$abstract)) or max(nchar(trials$abstract)), we can read the maximum length.

CheckShow Answer

Problem 1.2 - Loading the Data

How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)

Exercise 2

 Numerical Response 

 

Explanation

From table(nchar(trials$abstract) == 0) or sum(nchar(trials$abstract) == 0), we can find the number of missing abstracts.

CheckShow Answer

Problem 1.3 - Loading the Data

Find the observation with the minimum number of characters in the title (the variable “title”) out of all of the observations in this dataset. What is the text of the title of this article? Include capitalization and punctuation in your response, but don’t include the quotes.

Exercise 3

 Text Response  Answer:A decade of letrozole: FACE.

Explanation

To identify which title is the shortest, we can use

which.min(nchar(trials$title))

From this, we know the 1258th title is the shortest. We can access this title with trials$title[1258].

CheckShow Answer

Problem 2.1 - Preparing the Corpus

Because we have both title and abstract information for trials, we need to build two corpera instead of one. Name them corpusTitle and corpusAbstract.

Following the commands from lecture, perform the following tasks (you might need to load the “tm” package first if it isn’t already loaded). Make sure to perform them in this order.

  1. Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.

  2. Convert corpusTitle and corpusAbstract to lowercase. After performing this step, remember to run the lines:

corpusTitle = tm_map(corpusTitle, PlainTextDocument)

corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)

  1. Remove the punctuation in corpusTitle and corpusAbstract.

  2. Remove the English language stop words from corpusTitle and corpusAbstract.

  3. Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).

  4. Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.

  5. Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).

  6. Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords(“english”)) and tm_map(corpusAbstract, removeWords, stopwords(“english”)).

Explanation

Below we provide the code for corpusTitle; only minor modifications are needed to build corpusAbstract.

corpusTitle = Corpus(VectorSource(trials$title))

corpusTitle = tm_map(corpusTitle, tolower)

corpusTitle = tm_map(corpusTitle, PlainTextDocument)

corpusTitle = tm_map(corpusTitle, removePunctuation)

corpusTitle = tm_map(corpusTitle, removeWords, stopwords(“english”))

corpusTitle = tm_map(corpusTitle, stemDocument)

dtmTitle = DocumentTermMatrix(corpusTitle)

dtmTitle = removeSparseTerms(dtmTitle, 0.95)

¨C87C

How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?

Exercise 4

 Numerical Response 

 

How many terms remain in dtmAbstract?

Exercise 5

 Numerical Response 

 

Explanation

These can be read from str(dtmTitle) and str(dtmAbstract). Other than str(), the dim() or ncol() functions could have been used. If you used fileEncoding=“latin1” when reading in the datafile, you’ll have a few extra terms in dtmAbstract, but you should get the answer correct.

CheckShow Answer

Problem 2.2 - Preparing the Corpus

What is the most likely reason why dtmAbstract has so many more terms than dtmTitle?

Exercise 6

 Abstracts tend to have many more words than titles 

 Abstracts tend to have a much wider vocabulary than titles 

 More papers have abstracts than titles 

Explanation

Because titles are so short, a word needs to be very common to appear in 5% of titles. Because abstracts have many more words, a word can be much less common and still appear in 5% of abstracts.

While abstracts may have wider vocabulary, this is a secondary effect. As we saw in the previous subsection, all papers have titles, but not all have abstracts.

CheckShow Answer

Problem 2.3 - Preparing the Corpus

What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.

Exercise 7

 Text Response  Answer:patient

Explanation

We can compute the column sums and then identify the most common one with:

csAbstract = colSums(dtmAbstract)

which.max(csAbstract)

CheckShow Answer

Problem 3.1 - Building a model

We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:

colnames(dtmTitle) = paste0(“T”, colnames(dtmTitle))

colnames(dtmAbstract) = paste0(“A”, colnames(dtmAbstract))

What was the effect of these functions?

Exercise 8

 Removing the words that are in common between the titles and the abstracts. 

 Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names. 

 Adding the letter T in front of all the title variable names that also appear in the abstract data frame, and adding an A in front of all the abstract variable names that appear in the title data frame. 

Explanation

The first line pastes a T at the beginning of each column name for dtmTitle, which are the variable names. The second line does something similar for the Abstract variables - it pastes an A at the beginning of each column name for dtmAbstract, which are the variable names.

CheckShow Answer

Problem 3.2 - Building a Model

Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:

dtm = cbind(dtmTitle, dtmAbstract)

As we did in class, add the dependent variable “trial” to dtm, copying it from the original data frame called trials. How many columns are in this combined data frame?

Exercise 9

 Numerical Response 

 

Explanation

The combination can be accomplished with:

dtm = cbind(dtmTitle, dtmAbstract)

dtm$trial = trials$trial

The number of variables in the combined data frame can be read from str(dtm) or ncol(dtm). If you used fileEncoding=“latin1” when reading in the file, you should have 5 extra variables (but the answer should be graded as correct).

CheckShow Answer

Problem 3.3 - Building a Model

Now that we have prepared our data frame, it’s time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named “train” and “test”, putting 70% of the data in the training set.

Explanation

This can be accomplished with:

set.seed(144)

spl = sample.split(dtm$trial, 0.7)

train = subset(dtm, spl == TRUE)

test = subset(dtm, spl == FALSE)

What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)

Exercise 10

 Numerical Response 

 

Explanation

Just as in any binary classification problem, the naive baseline always predicts the most common class. From table(train$trial), we see 730 training set results were not trials, and 572 were trials. Therefore, the naive baseline always predicts a result is not a trial, yielding accuracy of 730/(730+572).

CheckShow Answer

Problem 3.4 - Building a Model

Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don’t add a minbucket or cp value). Remember to add the method=“class” argument, since this is a classification problem.

What is the name of the first variable the model split on?

Exercise 11

 Text Response  Answer:Tphase

Explanation

This can be accomplished with:

trialCART = rpart(trial~., data=train, method=“class”)

prp(trialCART)

The first split checks whether or not Tphase is less than 0.5

CheckShow Answer

Problem 3.5 - Building a Model

Obtain the training set predictions for the model (do not yet predict on the test set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output). What is the maximum predicted probability for any result?

Exercise 12

 Numerical Response 

 

Explanation

The training set predictions can be obtained and summarized with the following commands:

predTrain = predict(trialCart)[,2]

summary(predTrain)

CheckShow Answer

Problem 3.6 - Building a Model

Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set?

Exercise 13

 The maximum predicted probability will likely be smaller in the testing set. 

 The maximum predicted probability will likely be exactly the same in the testing set. 

 The maximum predicted probability will likely be larger in the testing set. 

Explanation

Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.

CheckShow Answer

Problem 3.7 - Building a Model

For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.

What is the training set accuracy of the CART model?

Exercise 14

 Numerical Response 

 

What is the training set sensitivity of the CART model?

Exercise 15

 Numerical Response 

 

What is the training set specificity of the CART model?

Exercise 16

 Numerical Response 

 

Explanation

We can compare the predictions with threshold 0.5 to the true results in the training set with:

table(train$trial, predTrain >= 0.5)

From this, we read the following confusion matrix (rows are true outcome, columns are predicted outcomes):

FALSE TRUE

0 631 99

1 131 441

 

We conclude that the model has training set accuracy (631+441)/(631+441+99+131), sensitivity 441/(441+131) and specificity 631/(631+99).

CheckShow Answer

Problem 4.1 - Evaluating the model on the testing set

Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.

What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?

Exercise 17

 Numerical Response 

 

Explanation

The testing set predictions can be obtained and compared to the true outcomes with:

predTest = predict(trialCART, newdata=test)[,2]

table(test$trial, predTest >= 0.5)

This yields the following confusion matrix:

FALSE TRUE

0 261 52

1 83 162

 

From this, we read that the testing set accuracy is (261+162)/(261+162+83+52).

CheckShow Answer

Problem 4.2 - Evaluating the Model on the Testing Set

Using the ROCR package, what is the testing set AUC of the prediction model?

Exercise 18

 Numerical Response 

 

Explanation

The AUC can be determined using the following code:

library(ROCR)

pred = prediction(predTest, test$trial)

as.numeric(performance(pred, “auc”)@y.values)

CheckShow Answer

Part 5: Decision-maker Tradeoffs

The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:

  1. For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)

  2. Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study’s detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.

  3. Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.

This process is shown in the figure below.

CART model related to automating info retrieval reviews in medical literature.

Problem 5.1 - Decision-Maker Tradeoffs

What is the cost associated with the model in Step 1 making a false negative prediction?

Exercise 19

 A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3. 

 A paper will be mistakenly added to Set A, definitely affecting the quality of the results of Step 3. 

 A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3. 

 There is no cost associated with a false negative prediction. 

Explanation

By definition, a false negative is a paper that should have been included in Set A but was missed by the model. This means a study that should have been included in Step 3 was missed, affecting the results.

CheckShow Answer

Problem 5.2 - Decision-Maker Tradeoffs

What is the cost associated with the model in Step 1 making a false positive prediction?

Exercise 20

 A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3. 

 A paper will be mistakenly added to Set A, definitely affecting the quality of the results of Step 3. 

 A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3. 

 There is no cost associated with a false positive prediction. 

Explanation

By definition, a false positive is a paper that should not have been included in Set A but that was actually included. However, because the manual review in Step 2 is assumed to be 100% effective, this extra paper will not make it into the more limited set of papers, and therefore this mistake will not affect the analysis in Step 3.

CheckShow Answer

Problem 5.3 - Decision-Maker Tradeoffs

Given the costs associated with false positives and false negatives, which of the following is most accurate?

Exercise 21

 A false positive is more costly than a false negative; the decision maker should use a probability threshold greater than 0.5 for the machine learning model. 

 A false positive is more costly than a false negative; the decision maker should use a probability threshold less than 0.5 for the machine learning model. 

 A false negative is more costly than a false positive; the decision maker should use a probability threshold greater than 0.5 for the machine learning model. 

 A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model. 

Explanation

A false negative might negatively affect the results of the literature review and analysis, while a false positive is a nuisance (one additional paper that needs to be manually checked). As a result, the cost of a false negative is much higher than the cost of a false positive, so much so that many studies actually use no machine learning (aka no Step 1) and have two people manually review each search result in Step 2. As always, we prefer a lower threshold in cases where false negatives are more costly than false positives, since we will make fewer negative predictions.

CheckShow Answer

Back: Detecting Vandalism on Wikipedia

Detecting Vandalism on Wikipedia

Wikipedia is a free online encyclopedia that anyone can edit and contribute to. It is available in many languages and is growing all the time. On the English language version of Wikipedia:

 

At time of 15.071x course publication.

One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it is difficult for humans to detect all instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

  • Vandal = 1 if this edit was vandalism, 0 if not.
  • Minor = 1 if the user marked this edit as a “minor edit”, 0 if not.
  • Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not.
  • Added = The unique words added.
  • Removed = The unique words removed.

Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.

Problem 1.1 - Bags of Words

Load the data wiki (CSV) with the option stringsAsFactors=FALSE, calling the data frame “wiki”. Convert the “Vandal” column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).

How many cases of vandalism were detected in the history of this page?

Exercise 1

 Numerical Response 

 

Explanation

You can load the data using the command:

wiki = read.csv(“wiki.csv”, stringsAsFactors=FALSE)

And then convert Vandal to a factor with the command:

wiki$Vandal = as.factor(wiki$Vandal)

You can then use the table command to see how many cases of Vandalism there are:

table(wiki$Vandal)

There are 1815 observations with value 1, which denotes vandalism.

CheckShow Answer

Problem 1.2 - Bags of Words

We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning than removing rude words. We’ll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

  1. Create the corpus for the Added column, and call it “corpusAdded”.

  2. Remove the English-language stopwords.

  3. Stem the words.

  4. Build the DocumentTermMatrix, and call it dtmAdded.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusAdded, removeWords, sw) instead of tm_map(corpusAdded, removeWords, stopwords(“english”)).

How many terms appear in dtmAdded?

Exercise 2

 Numerical Response 

 

Explanation

The following are the commands needed to execute these four steps:

corpusAdded = Corpus(VectorSource(wiki$Added))

corpusAdded = tm_map(corpusAdded, removeWords, stopwords(“english”))

corpusAdded = tm_map(corpusAdded, stemDocument)

dtmAdded = DocumentTermMatrix(corpusAdded)

If you type dtmAdded, you can see that there are 6675 terms.

CheckShow Answer

Problem 1.3 - Bags of Words

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?

Exercise 3

 Numerical Response 

 

Explanation

You can create the sparse matrix with the follow line:

sparseAdded = removeSparseTerms(dtmAdded, 0.997)

If you type sparseAdded, you can see that there are 166 terms.

CheckShow Answer

Problem 1.4 - Bags of Words

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

colnames(wordsAdded) = paste(“A”, colnames(wordsAdded))

Explanation

You need to type the following two commands:

wordsAdded = as.data.frame(as.matrix(sparseAdded))

colnames(wordsAdded) = paste(“A”, colnames(wordsAdded))

Now repeat all of the steps we’ve done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

colnames(wordsRemoved) = paste(“R”, colnames(wordsRemoved))

How many words are in the wordsRemoved data frame?

Exercise 4

 Numerical Response 

 

Explanation

To repeat the steps for the Removed column, use the following commands:

corpusRemoved = Corpus(VectorSource(wiki$Removed))

corpusRemoved = tm_map(corpusRemoved, removeWords, stopwords(“english”))

corpusRemoved = tm_map(corpusRemoved, stemDocument)

dtmRemoved = DocumentTermMatrix(corpusRemoved)

sparseRemoved = removeSparseTerms(dtmRemoved, 0.997)

wordsRemoved = as.data.frame(as.matrix(sparseRemoved))

colnames(wordsRemoved) = paste(“R”, colnames(wordsRemoved))

To see that there are 162 words in the wordsRemoved data frame, you can type ncol(wordsRemoved) in your R console.

CheckShow Answer

Problem 1.5 - Bags of Words

Combine the two data frames into a data frame called wikiWords with the following line of code:

wikiWords = cbind(wordsAdded, wordsRemoved)

The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture). Set the random seed to 123 and then split the data set using sample.split from the “caTools” package to put 70% in the training set.

Explanation

You can combine the two data frames by using the command:

wikiWords = cbind(wordsAdded, wordsRemoved)

And then add the Vandal variable by using the command:

wikiWords$Vandal = wiki$Vandal

To split the data, you can use the following commands:

library(caTools)

set.seed(123)

spl = sample.split(wikiWords$Vandal, SplitRatio = 0.7)

wikiTrain = subset(wikiWords, spl==TRUE)

wikiTest = subset(wikiWords, spl==FALSE)

What is the accuracy on the test set of a baseline method that always predicts “not vandalism” (the most frequent outcome)?

Exercise 5

 Numerical Response 

 

Explanation

You can compute this number using the table command:

table(wikiTest$Vandal)

It outputs that there are 618 observations with value 0, and 545 observations with value 1. The accuracy of the baseline method would be 618/(618+545) = 0.531.

CheckShow Answer

Problem 1.6 - Bags of Words

Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don’t set values for minbucket or cp).

What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type=“class” when making predictions, the output of predict will automatically use a threshold of 0.5.)

Exercise 6

 Numerical Response 

 

Explanation

You can build the CART model with the following command:

wikiCART = rpart(Vandal ~ ., data=wikiTrain, method=“class”)

And then make predictions on the test set:

testPredictCART = predict(wikiCART, newdata=wikiTest, type=“class”)

And compute the accuracy by comparing the actual values to the predicted values:

table(wikiTest$Vandal, testPredictCART)

The accuracy is (618+12)/(618+533+12) = 0.5417.

CheckShow Answer

Problem 1.7 - Bags of Words

Plot the CART tree. How many word stems does the CART model use?

Exercise 7

 Numerical Response 

 

Explanation

If you plot the tree with prp(wikiCART), you can see that the tree uses two words: “R arbitr” and “R thousa”.

CheckShow Answer

Problem 1.8 - Bags of Words

Given the performance of the CART model relative to the baseline, what is the best explanation of these results?

Exercise 8

 We have a bad testing/training split. 

 The CART model overfits to the training set. 

 Although it beats the baseline, bag of words is not very predictive for this problem. 

 We over-sparsified the document-term matrix. 

Explanation

There is no reason to think there was anything wrong with the split. CART did not overfit, which you can check by computing the accuracy of the model on the training set. Over-sparsification is plausible but unlikely, since we selected a very high sparsity parameter. The only conclusion left is simply that bag of words didn’t work very well in this case.

CheckShow Answer

Problem 2.1 - Problem-specific Knowledge

We weren’t able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

The key class of words we will use are website addresses. “Website addresses” (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be “http://www.google.com”. The first part is the protocol, which is usually “http” (HyperText Transfer Protocol). The second part is the address of the site, e.g. “www.google.com”. We have stripped all punctuation so links to websites appear in the data as one word, e.g. “httpwwwgooglecom”. We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

We can search for the presence of a web address in the words added by searching for “http” in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

grepl(“cat”,“dogs and cats”,fixed=TRUE) # TRUE

grepl(“cat”,“dogs and rats”,fixed=TRUE) # FALSE

Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

Make a new column in wikiWords2 that is 1 if “http” was in Added:

wikiWords2$HTTP = ifelse(grepl(“http”,wiki$Added,fixed=TRUE), 1, 0)

Based on this new column, how many revisions added a link?

Exercise 9

 Numerical Response 

 

Explanation

You can find this number by typing table(wikiWords2$HTTP), and seeing that there are 217 observations with value 1.

CheckShow Answer

Problem 2.2 - Problem-Specific Knowledge

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets:

wikiTrain2 = subset(wikiWords2, spl==TRUE)

wikiTest2 = subset(wikiWords2, spl==FALSE)

Then create a new CART model using this new variable as one of the independent variables.

What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

Exercise 10

 Numerical Response 

 

Explanation

You can compute this by running the following commands:

wikiCART2 = rpart(Vandal ~ ., data=wikiTrain2, method=“class”)

testPredictCART2 = predict(wikiCART2, newdata=wikiTest2, type=“class”)

table(wikiTest2$Vandal, testPredictCART2)

Then the accuracy is (609+57)/(609+9+488+57) = 0.5726569.

CheckShow Answer

Problem 2.3 - Problem-Specific Knowledge

Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:

wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))

wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))

What is the average number of words added?

Exercise 11

 Numerical Response 

 

Explanation

You can get this answer with mean(wikiWords2$NumWordsAdded).

CheckShow Answer

Problem 2.4 - Problem-Specific Knowledge

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).

What is the new accuracy of the CART model on the test set?

Exercise 12

 Numerical Response 

 

Explanation

To split the data again, use the following commands:

wikiTrain3 = subset(wikiWords2, spl==TRUE)

wikiTest3 = subset(wikiWords2, spl==FALSE)

You can compute the accuracy of the new CART model with the following commands:

wikiCART3 = rpart(Vandal ~ ., data=wikiTrain3, method=“class”)

testPredictCART3 = predict(wikiCART3, newdata=wikiTest3, type=“class”)

table(wikiTest3$Vandal, testPredictCART3)

The accuracy is (514+248)/(514+104+297+248) = 0.6552021.

CheckShow Answer

Problem 3.1 - Using Non-Textual Data

We have two pieces of “metadata” (data about data) that we haven’t yet used. Make a copy of wikiWords2, and call it wikiWords3:

wikiWords3 = wikiWords2

Then add the two original variables Minor and Loggedin to this new data frame:

wikiWords3$Minor = wiki$Minor

wikiWords3$Loggedin = wiki$Loggedin

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.

Explanation

This can be done with the following two commands:

wikiTrain4 = subset(wikiWords3, spl==TRUE)

wikiTest4 = subset(wikiWords3, spl==FALSE)

Build a CART model using all the training data. What is the accuracy of the model on the test set?

Exercise 13

 Numerical Response 

 

Explanation

This model can be built and evaluated using the following commands:

wikiCART4 = rpart(Vandal ~ ., data=wikiTrain4, method=“class”)

predictTestCART4 = predict(wikiCART4, newdata=wikiTest4, type=“class”)

table(wikiTest4$Vandal, predictTestCART4)

The accuracy of the model is (595+241)/(595+23+304+241) = 0.7188306.

CheckShow Answer

Problem 3.2 - Using Non-Textual Data

There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?

Plot the CART tree. How many splits are there in the tree?

Exercise 14

 Numerical Response 

 

Explanation

You can plot the tree with prp(wikiCART4). The first split is on the variable “Loggedin”, the second split is on the number of words added, and the third split is on the number of words removed.

By adding new independent variables, we were able to significantly improve our accuracy without making the model more complicated!

CheckShow Answer

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions