WEBVTT

00:00:04.500 --> 00:00:07.130
Pre-processing the
data can be difficult,

00:00:07.130 --> 00:00:11.280
but, luckily, R's packages
provide easy-to-use functions

00:00:11.280 --> 00:00:13.400
for the most common tasks.

00:00:13.400 --> 00:00:16.530
In this video, we'll
load and process our data

00:00:16.530 --> 00:00:20.300
in R. In your R
console, let's load

00:00:20.300 --> 00:00:25.560
the data set tweets.csv
with the read.csv function.

00:00:25.560 --> 00:00:28.030
But since we're working
with text data here,

00:00:28.030 --> 00:00:29.740
we need one extra
argument, which is

00:00:29.740 --> 00:00:30.700
stringsAsFactors=FALSE.

00:00:33.390 --> 00:00:36.110
So we'll call our
data set tweets.

00:00:36.110 --> 00:00:39.440
And we'll use the read.csv
function to read in the data

00:00:39.440 --> 00:00:44.020
file tweets.csv, but then
we'll add the extra argument

00:00:44.020 --> 00:00:44.980
stringsAsFactors=FALSE.

00:00:51.530 --> 00:00:54.150
You'll always need to add
this extra argument when

00:00:54.150 --> 00:00:56.660
working on a text
analytics problem

00:00:56.660 --> 00:01:00.040
so that the text is
read in properly.

00:01:00.040 --> 00:01:01.670
Now let's take a
look at the structure

00:01:01.670 --> 00:01:03.620
of our data with
the str function.

00:01:06.450 --> 00:01:12.250
We can see that we have 1,181
observations of two variables,

00:01:12.250 --> 00:01:14.960
the text of the
tweet, called Tweet,

00:01:14.960 --> 00:01:19.500
and the average sentiment
score, called Avg for average.

00:01:19.500 --> 00:01:21.610
The tweet texts are
real tweets that we

00:01:21.610 --> 00:01:25.000
found on the internet directed
to Apple with a few cleaned up

00:01:25.000 --> 00:01:26.390
words.

00:01:26.390 --> 00:01:29.250
We're more interested
in being able to detect

00:01:29.250 --> 00:01:32.220
the tweets with clear
negative sentiment,

00:01:32.220 --> 00:01:34.910
so let's define a new
variable in our data

00:01:34.910 --> 00:01:38.509
set tweets called Negative.

00:01:38.509 --> 00:01:40.810
And we'll set this equal to
as.factor(tweets$Avg = -1).

00:01:51.000 --> 00:01:55.180
This will set tweets$Negative
equal to true if the average

00:01:55.180 --> 00:01:58.560
sentiment score is less than
or equal to negative 1 and will

00:01:58.560 --> 00:02:02.460
set tweets$Negative equal to
false if the average sentiment

00:02:02.460 --> 00:02:05.050
score is greater
than negative 1.

00:02:05.050 --> 00:02:07.610
Let's look at a table of
this new variable, Negative.

00:02:12.360 --> 00:02:19.320
We can see that 182 of the
1,181 tweets, or about 15%,

00:02:19.320 --> 00:02:19.870
are negative.

00:02:22.630 --> 00:02:24.829
Now to pre-process
our text data so

00:02:24.829 --> 00:02:27.250
that we can use the
bag of words approach,

00:02:27.250 --> 00:02:30.850
we'll be using the tm
text mining package.

00:02:30.850 --> 00:02:35.250
We'll need to install and
load two packages to do this.

00:02:35.250 --> 00:02:41.590
First, let's install the
package tm, and go ahead

00:02:41.590 --> 00:02:43.620
and select a CRAN
mirror near you.

00:02:47.380 --> 00:02:49.910
As soon as that package
is done installing

00:02:49.910 --> 00:02:51.990
and you're back at
the blinking cursor,

00:02:51.990 --> 00:02:56.640
go ahead and load that package
with the library command.

00:02:56.640 --> 00:03:00.990
Then we also need to install
the package snowballC.

00:03:04.230 --> 00:03:07.530
This package helps us
use the tm package.

00:03:07.530 --> 00:03:10.330
And go ahead and load the
snowball package as well.

00:03:13.280 --> 00:03:16.610
One of the concepts
introduced by the tm package

00:03:16.610 --> 00:03:18.660
is that of a corpus.

00:03:18.660 --> 00:03:21.329
A corpus is a
collection of documents.

00:03:21.329 --> 00:03:26.100
We'll need to convert our tweets
to a corpus for pre-processing.

00:03:26.100 --> 00:03:29.160
tm can create a corpus
in many different ways,

00:03:29.160 --> 00:03:32.420
but we'll create it from the
tweet column of our data frame

00:03:32.420 --> 00:03:36.420
using two functions,
corpus and vector source.

00:03:36.420 --> 00:03:39.310
We'll call our corpus
"corpus" and then

00:03:39.310 --> 00:03:44.030
use the corpus and the
vector source functions

00:03:44.030 --> 00:03:48.840
called on our tweets variable
of our tweets data set.

00:03:48.840 --> 00:03:49.800
So that's tweets$Tweet.

00:03:54.140 --> 00:03:55.710
We can check that
this has worked

00:03:55.710 --> 00:04:01.210
by typing corpus and seeing
that our corpus has 1,181 text

00:04:01.210 --> 00:04:03.040
documents.

00:04:03.040 --> 00:04:06.070
And we can check that the
documents match our tweets

00:04:06.070 --> 00:04:07.430
by using double brackets.

00:04:07.430 --> 00:04:08.270
So type corpus[[1]].

00:04:13.860 --> 00:04:18.130
This shows us the first
tweet in our corpus.

00:04:18.130 --> 00:04:21.660
Now we're ready to start
pre-processing our data.

00:04:21.660 --> 00:04:24.470
Pre-processing is easy in tm.

00:04:24.470 --> 00:04:28.010
Each operation, like stemming
or removing stop words,

00:04:28.010 --> 00:04:30.180
can be done with
one line in R, where

00:04:30.180 --> 00:04:33.010
we use the tm_map function.

00:04:33.010 --> 00:04:36.190
Let's try it out by changing
all of the text in our tweets

00:04:36.190 --> 00:04:37.840
to lowercase.

00:04:37.840 --> 00:04:40.540
To do that, we'll
replace our corpus

00:04:40.540 --> 00:04:45.290
with the output of the
tm_map function, where

00:04:45.290 --> 00:04:48.430
the first argument is
the name of our corpus

00:04:48.430 --> 00:04:50.780
and the second argument
is what we want to do.

00:04:50.780 --> 00:04:54.440
In this case, tolower.

00:04:54.440 --> 00:04:57.320
tolower is a standard
function in R,

00:04:57.320 --> 00:05:00.850
and this is like when we pass
mean to the tapply function.

00:05:00.850 --> 00:05:03.780
We're passing the
tm_map function

00:05:03.780 --> 00:05:07.620
a function to use on our corpus.

00:05:07.620 --> 00:05:10.180
Let's see what that did by
looking at our first tweet

00:05:10.180 --> 00:05:10.870
again.

00:05:10.870 --> 00:05:13.140
Go ahead and hit the up
arrow twice to get back

00:05:13.140 --> 00:05:17.410
to corpuscorpus{[1] and now we can
see that all of our letters are

00:05:17.410 --> 00:05:17.910
lowercase.

00:05:20.980 --> 00:05:23.950
Now let's remove
all punctuation.

00:05:23.950 --> 00:05:26.070
This is done in a
very similar way,

00:05:26.070 --> 00:05:28.370
except this time we
give the argument

00:05:28.370 --> 00:05:31.640
removePunctuation
instead of tolower.

00:05:31.640 --> 00:05:35.120
Hit the up arrow twice,
and in the tm_map function,

00:05:35.120 --> 00:05:37.990
delete tolower, and
type removePunctuation.

00:05:41.210 --> 00:05:44.100
Let's see what this did
to our first tweet again.

00:05:44.100 --> 00:05:47.540
Now the comma after "say",
the exclamation point after

00:05:47.540 --> 00:05:52.990
"received", and the @ symbols
before "Apple" are all gone.

00:05:52.990 --> 00:05:56.860
Now we want to remove the
stop words in our tweets.

00:05:56.860 --> 00:06:00.070
tm provides a list of stop
words for the English language.

00:06:00.070 --> 00:06:02.490
We can check it out by typing
stopwords("english") [1:10].

00:06:12.300 --> 00:06:14.430
We see that these
are words like "I,"

00:06:14.430 --> 00:06:18.490
"me," "my," "myself," et cetera.

00:06:18.490 --> 00:06:21.610
Removing words can be done
with the removeWords argument

00:06:21.610 --> 00:06:24.740
to the tm_map function, but
we need one extra argument

00:06:24.740 --> 00:06:28.830
this time-- what the stop words
are that we want to remove.

00:06:28.830 --> 00:06:31.370
We'll remove all of
these English stop words,

00:06:31.370 --> 00:06:33.750
but we'll also remove
the word "apple"

00:06:33.750 --> 00:06:36.310
since all of these tweets
have the word "apple"

00:06:36.310 --> 00:06:39.040
and it probably won't be
very useful in our prediction

00:06:39.040 --> 00:06:40.600
problem.

00:06:40.600 --> 00:06:42.659
So go ahead and hit the
up arrow to get back

00:06:42.659 --> 00:06:47.730
to the tm_map function, delete
removePunctuation and, instead,

00:06:47.730 --> 00:06:48.440
type removeWords.

00:06:52.210 --> 00:06:54.760
Then we need to add one
extra argument, c("apple").

00:06:59.230 --> 00:07:01.730
This is us removing
the word "apple."

00:07:01.730 --> 00:07:02.980
And then stopwords("english").

00:07:08.020 --> 00:07:10.260
So this will remove
the word "apple"

00:07:10.260 --> 00:07:14.060
and all of the
English stop words.

00:07:14.060 --> 00:07:15.560
Let's take a look
at our first tweet

00:07:15.560 --> 00:07:16.680
again to see what happened.

00:07:20.470 --> 00:07:23.590
Now we can see that we have
significantly fewer words, only

00:07:23.590 --> 00:07:26.730
the words that are
not stop words.

00:07:26.730 --> 00:07:30.230
Lastly, we want to stem our
document with the stem document

00:07:30.230 --> 00:07:31.220
argument.

00:07:31.220 --> 00:07:34.850
Go ahead and scroll back up
to the removePunctuation,

00:07:34.850 --> 00:07:40.090
delete removePunctuation,
and type stemDocument.

00:07:40.090 --> 00:07:43.830
If you hit Enter and then
look at the first tweet again,

00:07:43.830 --> 00:07:48.540
we can see that this took
off the ending of "customer,"

00:07:48.540 --> 00:07:52.260
"service," "received,"
and "appstore."

00:07:52.260 --> 00:07:55.360
In the next video, we'll
investigate our corpus

00:07:55.360 --> 00:07:58.510
and prepare it for our
prediction problem.