15.071 | Spring 2017 | Graduate

The Analytics Edge

5 Text Analytics

5.4 Predictive Coding: Bringing Text Analytics to the Courtroom (Recitation)

Video 2: The Data

In this recitation, we’ll be using the dataset energy_bids (CSV - 2.0MB). Please download and save this dataset to your computer so that you can follow along. This data comes from the 2010 TREC Legal Track.

An R script file with all of the commands we will be using in this recitation can be downloaded here: Resource Unit5_Recitation (R).

Video 3: Pre-Processing

Important Note: In the following video, we ask you to use the “tm” package to perform the pre-processing steps. Due to function changes that occurred after this video was recorded, you will need to run the following command immediately after converting all of the words to lowercase letters (it converts all documents in the corpus to the PlainTextDocument type):

corpus = tm_map(corpus, PlainTextDocument)

Then you can continue with the R commands as they are in the video.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)). 

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions