WEBVTT

00:00:09.500 --> 00:00:12.730
In this lecture, we'll use a
technique called Bag of Words

00:00:12.730 --> 00:00:15.640
to build text analytics models.

00:00:15.640 --> 00:00:19.570
Fully understanding text is
difficult, but Bag of Words

00:00:19.570 --> 00:00:22.130
provides a very simple approach.

00:00:22.130 --> 00:00:24.360
It just counts the
number of times

00:00:24.360 --> 00:00:28.320
each word appears in the
text and uses these counts

00:00:28.320 --> 00:00:31.080
as the independent variables.

00:00:31.080 --> 00:00:34.880
For example, in the sentence,
"This course is great.

00:00:34.880 --> 00:00:37.440
I would recommend this
course my friends,"

00:00:37.440 --> 00:00:43.180
the word this is seen twice,
the word course is seen twice,

00:00:43.180 --> 00:00:48.000
the word great is
seen once, et cetera.

00:00:48.000 --> 00:00:51.680
In Bag of Words, there's
one feature for each word.

00:00:51.680 --> 00:00:55.160
This is a very simple approach,
but is often very effective,

00:00:55.160 --> 00:00:56.330
too.

00:00:56.330 --> 00:00:59.680
It's used as a baseline
in text analytics projects

00:00:59.680 --> 00:01:02.390
and for natural
language processing.

00:01:02.390 --> 00:01:04.360
This isn't the
whole story, though.

00:01:04.360 --> 00:01:06.940
Preprocessing the
text can dramatically

00:01:06.940 --> 00:01:11.890
improve the performance of
the Bag of Words method.

00:01:11.890 --> 00:01:14.090
One part of
preprocessing the text

00:01:14.090 --> 00:01:16.620
is to clean up irregularities.

00:01:16.620 --> 00:01:20.000
Text data often as many
inconsistencies that will cause

00:01:20.000 --> 00:01:21.940
algorithms trouble.

00:01:21.940 --> 00:01:24.850
Computers are very
literal by default.

00:01:24.850 --> 00:01:29.710
Apple with just an uppercase A,
APPLE all in uppercase letters,

00:01:29.710 --> 00:01:33.370
or ApPLe with a mixture of
uppercase and lowercase letters

00:01:33.370 --> 00:01:35.990
will all be counted separately.

00:01:35.990 --> 00:01:38.610
We want to change the
text so that all three

00:01:38.610 --> 00:01:42.450
versions of Apple here will
be counted as the same word,

00:01:42.450 --> 00:01:46.930
by either changing all words
to uppercase or to lower case.

00:01:46.930 --> 00:01:50.350
We'll typically change all
the letters to lowercase,

00:01:50.350 --> 00:01:52.500
so these three versions
of Apple will all

00:01:52.500 --> 00:01:55.030
become Apple with
lower case letters

00:01:55.030 --> 00:01:57.000
and will be counted
as the same word.

00:01:59.940 --> 00:02:02.900
Punctuation can
also cause problems.

00:02:02.900 --> 00:02:04.910
The basic approach
is to deal with this

00:02:04.910 --> 00:02:07.510
is to remove everything
that isn't a standard number

00:02:07.510 --> 00:02:08.650
or letter.

00:02:08.650 --> 00:02:12.480
However, sometimes
punctuation is meaningful.

00:02:12.480 --> 00:02:16.990
In the case of Twitter, @Apple
denotes a message to Apple,

00:02:16.990 --> 00:02:20.690
and #Apple is a
message about Apple.

00:02:20.690 --> 00:02:22.790
For web addresses,
the punctuation

00:02:22.790 --> 00:02:25.420
often defines the web address.

00:02:25.420 --> 00:02:28.190
For these reasons, the
removal of punctuation

00:02:28.190 --> 00:02:31.370
should be tailored to
the specific problem.

00:02:31.370 --> 00:02:36.420
In our case, we will remove
all punctuation, so @Apple,

00:02:36.420 --> 00:02:40.020
Apple with an exclamation
point, Apple with dashes

00:02:40.020 --> 00:02:42.010
will all count as just Apple.

00:02:44.880 --> 00:02:47.490
Another preprocessing
task we want to do

00:02:47.490 --> 00:02:50.680
is to remove unhelpful terms.

00:02:50.680 --> 00:02:52.820
Many words are
frequently used but are

00:02:52.820 --> 00:02:54.990
only meaningful in a sentence.

00:02:54.990 --> 00:02:58.110
These are called stop words.

00:02:58.110 --> 00:03:02.940
Examples are the,
is, at, and which.

00:03:02.940 --> 00:03:05.440
It's unlikely that
these words will improve

00:03:05.440 --> 00:03:07.660
the machine learning
prediction quality,

00:03:07.660 --> 00:03:11.680
so we want to remove them to
reduce the size of the data.

00:03:11.680 --> 00:03:14.560
There are some potential
problems with this approach.

00:03:14.560 --> 00:03:17.329
Sometimes, two stop
words taken together

00:03:17.329 --> 00:03:19.560
have a very important meaning.

00:03:19.560 --> 00:03:23.579
For example, "The Who"-- which
is a combination of two stop

00:03:23.579 --> 00:03:27.100
words-- is actually the name
of the band we see on the right

00:03:27.100 --> 00:03:28.800
here.

00:03:28.800 --> 00:03:32.960
By removing the stop words,
we remove both of these words,

00:03:32.960 --> 00:03:35.720
but The Who might actually
have a significant meaning

00:03:35.720 --> 00:03:37.850
for our prediction task.

00:03:37.850 --> 00:03:40.940
Another example is the
phrase, "Take That".

00:03:40.940 --> 00:03:42.700
If we remove the
stop words, we'll

00:03:42.700 --> 00:03:47.000
remove the word "that," so the
phrase would just say, "take."

00:03:47.000 --> 00:03:50.620
It no longer has the
same meaning as before.

00:03:50.620 --> 00:03:54.150
So while removing stop words
sometimes is not helpful,

00:03:54.150 --> 00:03:59.770
it generally is a very
helpful preprocessing step.

00:03:59.770 --> 00:04:02.430
Lastly, an important
preprocessing step

00:04:02.430 --> 00:04:04.300
is called stemming.

00:04:04.300 --> 00:04:06.550
This step is motivated
by the desire

00:04:06.550 --> 00:04:08.760
to represent words
with different endings

00:04:08.760 --> 00:04:10.380
as the same word.

00:04:10.380 --> 00:04:14.020
We probably do not need to draw
a distinction between argue,

00:04:14.020 --> 00:04:16.910
argued, argues, and arguing.

00:04:16.910 --> 00:04:21.370
They could all be represented
by a common stem, argue.

00:04:21.370 --> 00:04:24.290
The algorithmic process of
performing this reduction

00:04:24.290 --> 00:04:26.170
is called stemming.

00:04:26.170 --> 00:04:29.440
There are many ways to
approach the problem.

00:04:29.440 --> 00:04:31.750
One approach is to
build a database

00:04:31.750 --> 00:04:33.890
of words and their stems.

00:04:33.890 --> 00:04:38.130
A pro is that this approach
handles exceptions very nicely,

00:04:38.130 --> 00:04:40.970
since we have to find
all of the stems.

00:04:40.970 --> 00:04:43.980
However, it won't
handle new words at all,

00:04:43.980 --> 00:04:45.860
since they are not
in the database.

00:04:45.860 --> 00:04:48.030
This is especially
bad for problems

00:04:48.030 --> 00:04:50.070
where we're using data
from the internet,

00:04:50.070 --> 00:04:53.480
since we have no idea
what words will be used.

00:04:53.480 --> 00:04:56.909
A different approach is to
write a rule-based algorithm.

00:04:56.909 --> 00:05:02.460
In this approach, if a word ends
in things like ed, ing, or ly,

00:05:02.460 --> 00:05:04.300
we would remove the ending.

00:05:04.300 --> 00:05:08.010
A pro of this approach is that
it handles new or unknown words

00:05:08.010 --> 00:05:08.910
well.

00:05:08.910 --> 00:05:11.080
However, there are
many exceptions,

00:05:11.080 --> 00:05:13.320
and this approach would
miss all of these.

00:05:13.320 --> 00:05:17.480
Words like child and children
would be considered different,

00:05:17.480 --> 00:05:22.830
but it would get other
plurals, like dog and dogs.

00:05:22.830 --> 00:05:24.940
This second approach
is widely popular

00:05:24.940 --> 00:05:26.910
and is called the
Porter Stemmer, designed

00:05:26.910 --> 00:05:31.520
by Martin Porter in 1980,
and it's still used today.

00:05:31.520 --> 00:05:35.820
Stemmers like this one have
been written for many languages.

00:05:35.820 --> 00:05:37.720
Other options for
stemming include

00:05:37.720 --> 00:05:39.570
machine learning,
where algorithms

00:05:39.570 --> 00:05:43.420
are trained to recognize the
roots of words and combinations

00:05:43.420 --> 00:05:45.750
of the approaches
explained here.

00:05:45.750 --> 00:05:48.190
As a real example
from our data set,

00:05:48.190 --> 00:05:51.710
the phrase "by far the best
customer care service I

00:05:51.710 --> 00:05:54.290
have ever received"
has three words

00:05:54.290 --> 00:05:57.659
that would be stemmed--
customer, service,

00:05:57.659 --> 00:05:59.050
and received.

00:05:59.050 --> 00:06:02.110
The "er" would be
removed in customer,

00:06:02.110 --> 00:06:04.550
the "e" would be
removed in service,

00:06:04.550 --> 00:06:08.740
and the "ed" would be
removed in received.

00:06:08.740 --> 00:06:12.180
In the next video, we'll see
how to run these preprocessing

00:06:12.180 --> 00:06:14.150
steps in R.