WEBVTT

00:00:04.490 --> 00:00:10.220
The cp parameter-- cp stands
for complexity parameter.

00:00:10.220 --> 00:00:12.180
Recall that the
first tree we made

00:00:12.180 --> 00:00:15.630
using latitude and longitude
only had many splits,

00:00:15.630 --> 00:00:19.370
but we were able to trim it
without losing much accuracy.

00:00:19.370 --> 00:00:22.010
The intuition we gain is,
having too many splits

00:00:22.010 --> 00:00:25.680
is bad for generalization--
that is, performance on the test

00:00:25.680 --> 00:00:27.930
set-- so we should
penalize the complexity.

00:00:30.710 --> 00:00:35.800
Let us define RSS to be the
residual sum of squares, also

00:00:35.800 --> 00:00:39.930
known as the sum of
square differences.

00:00:39.930 --> 00:00:41.740
Our goal when
building the tree is

00:00:41.740 --> 00:00:44.630
to minimize the RSS
by making splits,

00:00:44.630 --> 00:00:48.780
but we want to penalize
having too many splits now.

00:00:48.780 --> 00:00:51.210
Define S to be the
number of splits,

00:00:51.210 --> 00:00:53.900
and lambda to be our penalty.

00:00:53.900 --> 00:00:56.050
Our new goal is to
find a tree that

00:00:56.050 --> 00:01:00.080
minimizes the sum of
the RSS at each leaf,

00:01:00.080 --> 00:01:04.730
plus lambda, times S,
for the number of splits.

00:01:04.730 --> 00:01:08.289
Let us consider this
following example.

00:01:08.289 --> 00:01:12.280
Here we have set lambda
to be equal to 0.5.

00:01:12.280 --> 00:01:14.840
Initially, we have a
tree with no splits.

00:01:14.840 --> 00:01:17.360
We simply take the
average of the data.

00:01:17.360 --> 00:01:23.190
The RSS in this case is 5, thus
our total penalty is also 5.

00:01:23.190 --> 00:01:27.150
If we make one split,
we now have two leaves.

00:01:27.150 --> 00:01:32.600
At each of these leaves, say,
we have an error, or RSS of 2.

00:01:32.600 --> 00:01:37.039
The total RSS error
is then 2+2=4.

00:01:37.039 --> 00:01:43.370
And the total penalty is
4+0.5*1, the number of splits.

00:01:43.370 --> 00:01:47.410
Our total penalty
in this case is 4.5.

00:01:47.410 --> 00:01:50.190
If we split again on
one of our leaves,

00:01:50.190 --> 00:01:54.100
we now have a total of
three leaves for two splits.

00:01:54.100 --> 00:01:56.940
The error at our
left-most leaf is 1.

00:01:56.940 --> 00:01:59.600
The next leaf has
an error of 0.8.

00:01:59.600 --> 00:02:04.340
And the next leaf has an error
of 2, for a total error of 3.8.

00:02:04.340 --> 00:02:09.630
The total penalty
is thus 3.8+0.5*2,

00:02:09.630 --> 00:02:14.220
for a total penalty of 4.8.

00:02:14.220 --> 00:02:16.950
Notice that if we pick
a large value of lambda,

00:02:16.950 --> 00:02:18.970
we won't make many
splits, because you

00:02:18.970 --> 00:02:21.380
pay a big price for every
additional split that

00:02:21.380 --> 00:02:24.470
will outweigh the
decrease in error.

00:02:24.470 --> 00:02:27.040
If we pick a small,
or 0 value of lambda,

00:02:27.040 --> 00:02:29.960
it will make splits until it
no longer decreases the error.

00:02:32.650 --> 00:02:35.690
You may be wondering at this
point, the definition of cp

00:02:35.690 --> 00:02:37.750
is what, exactly?

00:02:37.750 --> 00:02:41.200
Well, it's very closely
related to lambda.

00:02:41.200 --> 00:02:44.020
Considering a tree
with no splits,

00:02:44.020 --> 00:02:46.740
we simply take the
average of our data,

00:02:46.740 --> 00:02:48.890
calculate RSS for
that so-called tree,

00:02:48.890 --> 00:02:52.540
and let us call that
RSS for no splits.

00:02:52.540 --> 00:02:54.370
Then we can define
cp=lambda/RSS(no splits).

00:02:58.950 --> 00:03:01.880
When you're actually
using cp in your R code,

00:03:01.880 --> 00:03:05.000
you don't need to think
exactly what it means-- just

00:03:05.000 --> 00:03:08.420
that small numbers of cp
encourage large trees,

00:03:08.420 --> 00:03:12.400
and large values of cp
encourage small trees.

00:03:12.400 --> 00:03:15.450
Let's go back to R now,
and apply cross-validation

00:03:15.450 --> 00:03:17.720
to our training data.