5.2 Turning Tweets into Knowledge: An Introduction to Text Analytics

Video 5: Pre-Processing in R

Note: the dataset “tweets.csv” used in the rest of this lecture is not available to OCW users.

In the following video, we ask you to install the “tm” package to perform the pre-processing steps. Due to function changes that occurred after this video was recorded, you will need to run the following command immediately after converting all of the words to lowercase letters (it converts all documents in the corpus to the PlainTextDocument type):

corpus = tm_map(corpus, PlainTextDocument)

Then you can continue with the R commands as they are in the video.

View video page

Non-Standard Stop Word Lists

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in stopwords (TXT) file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).

Language Settings

If you downloaded and installed R in a location other than the United States, you might encounter some issues when using the bag of words approach (since the pre-processing tasks used here depend on the English language). To fix this, you will need to type in your R console:

Sys.setlocale(“LC_ALL”, “C”)

This will only change the locale for your current R session, so please make a note to run this command when you are working on any lectures or exercises that might depend on the English lanugage (for example, removing stop words).

Browse Course Material

Course Info

Instructor

Departments

As Taught In

Level

Topics

Learning Resource Types

The Analytics Edge