WEBVTT

00:00:04.490 --> 00:00:06.640
Now let's build the
document-term matrix

00:00:06.640 --> 00:00:07.740
for our corpus.

00:00:07.740 --> 00:00:09.120
So we'll create
a variable called

00:00:09.120 --> 00:00:11.270
dtm that contains the
DocumentTermMatrix(corpus).

00:00:15.510 --> 00:00:19.240
The corpus has already had all
the pre-processing run on it.

00:00:19.240 --> 00:00:22.870
So to get the summary statistics
about the document-term matrix,

00:00:22.870 --> 00:00:25.840
we'll just type in the
name of our variable, dtm.

00:00:25.840 --> 00:00:28.270
And what we can see
is that even though we

00:00:28.270 --> 00:00:31.860
have only 855 emails
in the corpus,

00:00:31.860 --> 00:00:35.890
we have over 22,000 terms
that showed up at least once,

00:00:35.890 --> 00:00:38.290
which is clearly
too many variables

00:00:38.290 --> 00:00:40.550
for the number of
observations we have.

00:00:40.550 --> 00:00:42.280
So we want to remove
the terms that

00:00:42.280 --> 00:00:45.030
don't appear too
often in our data set,

00:00:45.030 --> 00:00:50.590
and we'll do that using the
remove sparse terms function.

00:00:50.590 --> 00:00:52.680
And we're going to have
to determine the sparsity,

00:00:52.680 --> 00:00:55.920
so we'll say that we'll remove
any term that doesn't appear

00:00:55.920 --> 00:00:58.320
in at least 3% of the documents.

00:00:58.320 --> 00:01:04.720
To do that, we'll pass 0.97
to remove sparse terms.

00:01:04.720 --> 00:01:07.190
Now we can take a look
at the summary statistics

00:01:07.190 --> 00:01:09.270
for the document-term
matrix, and we

00:01:09.270 --> 00:01:11.270
can see that we've decreased
the number of terms

00:01:11.270 --> 00:01:15.380
to 788, which is a much
more reasonable number.

00:01:15.380 --> 00:01:18.970
So let's build a data frame
called labeledTerms out

00:01:18.970 --> 00:01:20.860
of this document-term matrix.

00:01:20.860 --> 00:01:24.320
So to do this, we'll
use as.data.fram

00:01:24.320 --> 00:01:30.080
of as.matrix applied to dtm,
the document-term matrix.

00:01:30.080 --> 00:01:33.330
So this data frame is
only including right now

00:01:33.330 --> 00:01:36.380
the frequencies of the words
that appeared in at least 3%

00:01:36.380 --> 00:01:40.050
of the documents, but in order
to run our text analytics

00:01:40.050 --> 00:01:43.670
models, we're also going to
have the outcome variable, which

00:01:43.670 --> 00:01:46.650
is whether or not each
email was responsive.

00:01:46.650 --> 00:01:49.280
So we need to add in
this outcome variable.

00:01:49.280 --> 00:01:53.740
So we'll create
labeledTerms$responsive,

00:01:53.740 --> 00:01:56.770
and we'll simply copy over
the responsive variable from

00:01:56.770 --> 00:01:59.240
the original emails
data frame so it's equal

00:01:59.240 --> 00:02:00.370
to emails$responsive.

00:02:04.480 --> 00:02:07.400
So finally let's take a look
at our newly constructed data

00:02:07.400 --> 00:02:09.509
frame with the str function.

00:02:12.580 --> 00:02:18.850
So as we expect, turn off a
lot of variables, 789 in total.

00:02:18.850 --> 00:02:21.630
788 of those variables
are the frequencies

00:02:21.630 --> 00:02:25.860
of various words in the emails,
and the last one is responsive,

00:02:25.860 --> 00:02:27.970
the outcome variable.