WEBVTT

00:00:05.040 --> 00:00:07.920
Let's see how
regression trees do.

00:00:07.920 --> 00:00:11.450
We'll first load
the rpart library

00:00:11.450 --> 00:00:16.550
and also load the
rpart plotting library.

00:00:16.550 --> 00:00:18.800
We build a regression
tree in the same way

00:00:18.800 --> 00:00:22.400
we would build a classification
tree, using the rpart command.

00:00:25.390 --> 00:00:29.570
We predict MEDV as a function
of latitude and longitude,

00:00:29.570 --> 00:00:30.620
using the Boston dataset.

00:00:33.950 --> 00:00:37.870
If we now plot the tree
using the prp command, which

00:00:37.870 --> 00:00:43.620
is to find an rpart.plot, we
can see it makes a lot of splits

00:00:43.620 --> 00:00:45.850
and is a little bit
hard to interpret.

00:00:45.850 --> 00:00:49.340
But the important thing
is look at the leaves.

00:00:49.340 --> 00:00:51.000
In a classification
tree, the leaves

00:00:51.000 --> 00:00:54.680
would be the
classification we assign

00:00:54.680 --> 00:00:58.000
that these splits
would apply to.

00:00:58.000 --> 00:01:02.930
But in regression trees, we
instead predict a number.

00:01:02.930 --> 00:01:06.230
That number is the average
of the median house

00:01:06.230 --> 00:01:11.080
prices in that bucket or leaf.

00:01:11.080 --> 00:01:14.160
So let's see what that
means in practice.

00:01:14.160 --> 00:01:18.460
So we'll plot again the
latitude-- the points.

00:01:22.910 --> 00:01:29.590
And we'll again plot the points
with above median prices.

00:01:29.590 --> 00:01:33.180
I just scrolled up from my
command history to do that.

00:01:33.180 --> 00:01:35.650
Now we want to predict
what the tree thinks

00:01:35.650 --> 00:01:38.310
is above median, just like we
did with linear regression.

00:01:38.310 --> 00:01:42.030
So we'll say the
fitted values we

00:01:42.030 --> 00:01:45.460
can get from using the predict
command on the tree we just

00:01:45.460 --> 00:01:45.960
built.

00:01:49.070 --> 00:01:51.400
And we can do another
points command,

00:01:51.400 --> 00:01:53.710
just like we did before.

00:01:53.710 --> 00:02:09.830
The fitted values are greater
than 21.2 The color is blue.

00:02:09.830 --> 00:02:14.090
And the character
is a dollar sign.

00:02:17.210 --> 00:02:19.730
Now we see that we've
done a much better job

00:02:19.730 --> 00:02:21.930
than linear regression
was able to do.

00:02:21.930 --> 00:02:25.840
We've correctly left the
low value area in Boston

00:02:25.840 --> 00:02:29.110
and below out, and
we've correctly

00:02:29.110 --> 00:02:30.780
managed to classify
some of those points

00:02:30.780 --> 00:02:33.600
in the bottom right
and top right.

00:02:33.600 --> 00:02:36.270
We're still making
mistakes, but we're

00:02:36.270 --> 00:02:38.740
able to make a
nonlinear prediction

00:02:38.740 --> 00:02:41.600
on latitude and longitude.

00:02:41.600 --> 00:02:45.840
So that's interesting, but
the tree was very complicated.

00:02:45.840 --> 00:02:49.420
So maybe it's
drastically overfitting.

00:02:49.420 --> 00:02:53.230
Can we get most of this effect
with a much simpler tree?

00:02:53.230 --> 00:02:53.940
We can.

00:02:53.940 --> 00:02:56.170
We would just change
the minbucket size.

00:02:56.170 --> 00:03:01.930
So let's build a new tree
using the rpart command again:

00:03:01.930 --> 00:03:08.800
MEDV as a function of LAT
and LON, the data=boston.

00:03:08.800 --> 00:03:11.840
But this time we'll say the
minbucket size must be 50.

00:03:14.620 --> 00:03:20.270
We'll use the other way
of plotting trees, plot,

00:03:20.270 --> 00:03:22.110
and we'll add text
to the text command.

00:03:25.820 --> 00:03:27.710
And we see we have
far fewer splits,

00:03:27.710 --> 00:03:30.040
and it's far more interpretable.

00:03:30.040 --> 00:03:32.590
The first split says
if the longitude

00:03:32.590 --> 00:03:34.970
is greater than or equal
to negative 71.07--

00:03:34.970 --> 00:03:36.890
so if you're on the right
side of the picture.

00:03:39.510 --> 00:03:41.510
So the left-hand branch
is on the left-hand side

00:03:41.510 --> 00:03:42.960
of the picture and
the right-hand--

00:03:42.960 --> 00:03:44.810
So the left-hand
side of the tree

00:03:44.810 --> 00:03:48.079
corresponds to the
right-hand side of the map.

00:03:48.079 --> 00:03:49.829
And the right side of
the tree corresponds

00:03:49.829 --> 00:03:51.770
to the left side of the map.

00:03:51.770 --> 00:03:54.340
That's a little
bit of a mouthful.

00:03:54.340 --> 00:03:55.990
Let's see what it
means visually.

00:03:55.990 --> 00:04:01.820
So we'll remember
these values, and we'll

00:04:01.820 --> 00:04:07.460
plot the longitude
and latitude again.

00:04:07.460 --> 00:04:08.510
So here's our map.

00:04:08.510 --> 00:04:09.120
OK.

00:04:09.120 --> 00:04:12.680
So the first split
was on longitude,

00:04:12.680 --> 00:04:16.000
and it was negative 71.07.

00:04:16.000 --> 00:04:19.570
So there's a very handy
command, "abline,"

00:04:19.570 --> 00:04:23.670
which can put horizontal
or vertical lines easily.

00:04:23.670 --> 00:04:27.070
So we're going to put
a vertical line, so v,

00:04:27.070 --> 00:04:32.070
and we wanted to plot
it at negative 71.07.

00:04:32.070 --> 00:04:32.850
OK.

00:04:32.850 --> 00:04:34.760
So that's that first
split from the tree.

00:04:34.760 --> 00:04:37.159
It corresponds to being on
either the left or right-hand

00:04:37.159 --> 00:04:40.909
side of this tree.

00:04:40.909 --> 00:04:48.240
We'll plot the-- what we want to
do is, we'll focus on one area.

00:04:48.240 --> 00:04:53.040
We'll focus on the lowest
price prediction, which

00:04:53.040 --> 00:04:55.050
is in the bottom left
corner of the tree,

00:04:55.050 --> 00:04:57.140
right down the bottom left
after all those splits.

00:04:57.140 --> 00:04:58.550
So that's where
we want to get to.

00:04:58.550 --> 00:05:01.640
So let's plot again the points.

00:05:01.640 --> 00:05:03.400
Plot a vertical line.

00:05:03.400 --> 00:05:06.340
The next split down towards
that bottom left corner

00:05:06.340 --> 00:05:12.900
was a horizontal line at 42.21.

00:05:12.900 --> 00:05:14.970
So I put that in.

00:05:14.970 --> 00:05:15.770
That's interesting.

00:05:15.770 --> 00:05:18.000
So that line corresponds
pretty much to where

00:05:18.000 --> 00:05:20.940
the Charles River
was from before.

00:05:20.940 --> 00:05:23.600
The final split you need to get
to that bottom left corner I

00:05:23.600 --> 00:05:27.840
was pointing out is 42.17.

00:05:27.840 --> 00:05:31.260
It was above this line.

00:05:31.260 --> 00:05:32.810
And now that's interesting.

00:05:32.810 --> 00:05:35.960
If we look at the right side
of the middle of the three

00:05:35.960 --> 00:05:37.970
rectangles on the
right side, that

00:05:37.970 --> 00:05:39.340
is the bucket we
were predicting.

00:05:39.340 --> 00:05:42.820
And it corresponds to that
rectangle, those areas.

00:05:42.820 --> 00:05:47.890
That's the South Boston low
price area we saw before.

00:05:47.890 --> 00:05:52.360
So maybe we can make that
more clear by plotting, now,

00:05:52.360 --> 00:05:53.900
the high value prices.

00:05:53.900 --> 00:05:59.210
So let's go back up to where
we plotted all the red dots

00:05:59.210 --> 00:06:00.700
and overlay it.

00:06:00.700 --> 00:06:02.470
So this makes it
even more clear.

00:06:02.470 --> 00:06:06.510
We've correctly shown how the
regression tree carves out

00:06:06.510 --> 00:06:09.120
that rectangle in
the bottom of Boston

00:06:09.120 --> 00:06:13.460
and says that is
a low value area.

00:06:13.460 --> 00:06:15.640
So that's actually
very interesting.

00:06:15.640 --> 00:06:17.690
It's shown us something
that regression trees can

00:06:17.690 --> 00:06:19.890
do that we would never expect
linear regression to be

00:06:19.890 --> 00:06:20.880
able to do.

00:06:20.880 --> 00:06:23.180
So the question we're going
to answer in the next video

00:06:23.180 --> 00:06:26.600
is given that regression trees
can do these fancy things

00:06:26.600 --> 00:06:28.020
with latitude and
longitude, is it

00:06:28.020 --> 00:06:29.890
actually going to help
us to be able to build

00:06:29.890 --> 00:06:32.220
predictive models,
predicting house prices?

00:06:32.220 --> 00:06:34.380
Well, we'll have to see.