WEBVTT

00:00:04.520 --> 00:00:08.650
Before we jump into R,
let's understand the data.

00:00:08.650 --> 00:00:10.900
Each entry of this
data set corresponds

00:00:10.900 --> 00:00:14.560
to a census tract, a statistical
division of the area that

00:00:14.560 --> 00:00:18.250
is used by researchers to
break down towns and cities.

00:00:18.250 --> 00:00:21.250
As a result, there will usually
be multiple census tracts

00:00:21.250 --> 00:00:23.020
per town.

00:00:23.020 --> 00:00:25.620
LON and LAT are the
longitude and latitude

00:00:25.620 --> 00:00:28.740
of the center of
the census tract.

00:00:28.740 --> 00:00:33.240
MEDV is the median value of
owner-occupied homes, measured

00:00:33.240 --> 00:00:36.280
in thousands of dollars.

00:00:36.280 --> 00:00:39.650
CRIM is the per
capita crime rate.

00:00:39.650 --> 00:00:42.220
ZN is related to
how much of the land

00:00:42.220 --> 00:00:45.560
is zoned for large
residential properties.

00:00:45.560 --> 00:00:50.270
INDUS is the proportion of
the area used for industry.

00:00:50.270 --> 00:00:53.920
CHAS is 1 if a census tract
is next to the Charles

00:00:53.920 --> 00:00:56.870
River, which I drew before.

00:00:56.870 --> 00:00:59.860
NOX is the concentration
of nitrous oxides

00:00:59.860 --> 00:01:03.290
in the air, a measure
of air pollution.

00:01:03.290 --> 00:01:07.830
RM is the average number
of rooms per dwelling.

00:01:07.830 --> 00:01:10.530
AGE is the proportion
of owner-occupied units

00:01:10.530 --> 00:01:13.550
built before 1940.

00:01:13.550 --> 00:01:15.950
DIS is a measure of
how far the tract is

00:01:15.950 --> 00:01:18.690
from centers of
employment in Boston.

00:01:18.690 --> 00:01:22.510
RAD is a measure of closeness
to important highways.

00:01:22.510 --> 00:01:26.560
TAX is the property tax
per $10,000 of value.

00:01:26.560 --> 00:01:31.070
And PTRATIO is the pupil
to teacher ratio by town.

00:01:31.070 --> 00:01:34.039
Let's switch over to R now.

00:01:34.039 --> 00:01:38.509
So let's begin to analyze our
data set with R. First of all,

00:01:38.509 --> 00:01:41.880
we'll load the data set
into the Boston variable.

00:01:46.770 --> 00:01:49.300
If we look at the structure
of the Boston data set,

00:01:49.300 --> 00:01:52.880
we can see all the variables
we talked about before.

00:01:52.880 --> 00:01:55.580
There are 506
observations corresponding

00:01:55.580 --> 00:02:01.050
to 506 census tracts in
the Greater Boston area.

00:02:01.050 --> 00:02:03.640
We are interested in
building a model initially

00:02:03.640 --> 00:02:07.280
of how prices vary by
location across a region.

00:02:07.280 --> 00:02:10.460
So let's first see how
the points are laid out.

00:02:10.460 --> 00:02:16.190
Using the plot commands, we can
plot the latitude and longitude

00:02:16.190 --> 00:02:19.100
of each of our census tracts.

00:02:21.850 --> 00:02:24.060
This picture might be a
little bit meaningless to you

00:02:24.060 --> 00:02:28.180
if you're not familiar with
the Massachusetts-Boston area,

00:02:28.180 --> 00:02:31.050
but I can tell you that the
dense central core of points

00:02:31.050 --> 00:02:33.550
corresponds to Boston
city, Cambridge

00:02:33.550 --> 00:02:38.700
city, and other
close urban cities.

00:02:38.700 --> 00:02:41.180
Still, let's try and relate
it back to that picture

00:02:41.180 --> 00:02:44.030
we saw in the first video,
where I showed you the river

00:02:44.030 --> 00:02:44.980
and where MIT was.

00:02:44.980 --> 00:02:49.040
So we want to show all
the points that lie along

00:02:49.040 --> 00:02:50.860
the Charles River in
a different color.

00:02:50.860 --> 00:02:53.070
We have a variable,
CHAS, that tells us

00:02:53.070 --> 00:02:55.860
if a point is on the
Charles River or not.

00:02:55.860 --> 00:02:58.760
So to put points on an
already-existing plot,

00:02:58.760 --> 00:03:00.920
we can use the
points command, which

00:03:00.920 --> 00:03:04.560
looks very similar
to the plot command,

00:03:04.560 --> 00:03:07.560
except it operates in a
plot that already exists.

00:03:07.560 --> 00:03:12.520
So let's plot just the points
where the Charles River

00:03:12.520 --> 00:03:14.880
variable is set to one.

00:03:27.880 --> 00:03:30.090
Up to now it looks pretty
much like the plot command,

00:03:30.090 --> 00:03:32.240
but here's where it's
about to get interesting.

00:03:32.240 --> 00:03:35.240
We can pass a
color, such as blue,

00:03:35.240 --> 00:03:37.050
to plot all these
points in blue.

00:03:37.050 --> 00:03:39.329
And this would plot
blue hollow circles

00:03:39.329 --> 00:03:41.030
on top of the black
hollow circles.

00:03:41.030 --> 00:03:42.870
Which would look
all right, but I

00:03:42.870 --> 00:03:45.880
think I'd much prefer
to have solid blue dots.

00:03:45.880 --> 00:03:47.940
To control how the
points are plotted,

00:03:47.940 --> 00:03:52.210
we use a PCH option, which you
can read about more in the help

00:03:52.210 --> 00:03:54.760
documentation for
the points command.

00:03:54.760 --> 00:03:57.520
But I'm going to
use PCH 19, which

00:03:57.520 --> 00:04:01.620
is a solid version of the dots
we already have on our plot.

00:04:01.620 --> 00:04:03.290
So by running this
command, you see

00:04:03.290 --> 00:04:06.470
we have some blue
dots in our plot now.

00:04:06.470 --> 00:04:10.910
These are the census tracts that
lie along the Charles River.

00:04:10.910 --> 00:04:12.750
But maybe it's still a
little bit confusing,

00:04:12.750 --> 00:04:15.820
and you'd like to know where
MIT is in this picture.

00:04:15.820 --> 00:04:17.470
So we can do that too.

00:04:17.470 --> 00:04:22.660
I looked up which
census tract MIT is in,

00:04:22.660 --> 00:04:27.520
and it's census tract 3531.

00:04:27.520 --> 00:04:29.000
So let's plot that.

00:04:29.000 --> 00:04:36.780
We add another point,
the longitude of MIT,

00:04:36.780 --> 00:04:42.520
which is in tract 3531,
and the latitude of MIT,

00:04:42.520 --> 00:04:51.040
which is in census tract 3531.

00:04:51.040 --> 00:04:52.990
I'm going to plot
this one in red,

00:04:52.990 --> 00:04:57.280
so we can tell it apart from
the other Charles River dots.

00:04:57.280 --> 00:05:01.540
And again, I'm going to
use a solid dot to do it.

00:05:01.540 --> 00:05:03.090
Can you see it on
the little picture?

00:05:03.090 --> 00:05:05.420
This little red dot,
right in the middle.

00:05:05.420 --> 00:05:06.880
That's exactly what
we were looking

00:05:06.880 --> 00:05:11.980
at from the picture in Video One

00:05:11.980 --> 00:05:13.410
What other things can we do?

00:05:13.410 --> 00:05:17.220
Well, this data set was
originally constructed

00:05:17.220 --> 00:05:19.000
to investigate
questions about how

00:05:19.000 --> 00:05:21.230
air pollution affects prices.

00:05:21.230 --> 00:05:24.360
So the air pollution variable
is this NOX variable.

00:05:24.360 --> 00:05:28.020
Let's have a look at
a distribution of NOX.

00:05:31.190 --> 00:05:32.840
boston$NOX.

00:05:32.840 --> 00:05:37.260
So we see that the
minimum value is 0.385,

00:05:37.260 --> 00:05:41.280
the maximum value is
0.87 and the median

00:05:41.280 --> 00:05:46.350
and the mean are
about 0.53, 0.55.

00:05:46.350 --> 00:05:49.790
So let's just use
the value of 0.55,

00:05:49.790 --> 00:05:51.950
it's kind of in the middle.

00:05:51.950 --> 00:05:53.810
And we'll look at
just the census

00:05:53.810 --> 00:05:56.970
tracts that have
above-average pollution.

00:05:56.970 --> 00:06:00.200
So we'll use the
points command again

00:06:00.200 --> 00:06:01.550
to plot just those points.

00:06:04.100 --> 00:06:11.110
So, points, the latitude--no
the longitude first.

00:06:11.110 --> 00:06:15.710
So we want the census
tracts with NOX levels

00:06:15.710 --> 00:06:19.540
greater than or equal to 0.55.

00:06:19.540 --> 00:06:24.580
We want the latitude of
those same census tracks.

00:06:24.580 --> 00:06:29.210
Again, only if the NOX
is greater than 0.55.

00:06:29.210 --> 00:06:32.600
And I guess a suitable
color for nasty pollution

00:06:32.600 --> 00:06:35.140
would be a bright green.

00:06:35.140 --> 00:06:40.280
And again, we'll
use the solid dots.

00:06:40.280 --> 00:06:44.530
So you can see it is pretty much
the same as the other commands.

00:06:44.530 --> 00:06:46.490
Wow okay.

00:06:46.490 --> 00:06:49.360
So those are all the points have
got above-average pollution.

00:06:49.360 --> 00:06:51.200
Looks like my office
is right in the middle.

00:06:53.620 --> 00:06:55.080
Now it kind of
makes sense, though,

00:06:55.080 --> 00:06:58.750
because that's the dense
urban core of Boston.

00:06:58.750 --> 00:07:00.870
If you think of anywhere
where pollution would be,

00:07:00.870 --> 00:07:03.830
you'd think it'd be where the
most cars and the most people

00:07:03.830 --> 00:07:04.330
are.

00:07:06.920 --> 00:07:10.090
So that's kind of interesting.

00:07:10.090 --> 00:07:12.240
Now, before we do
anything more, we

00:07:12.240 --> 00:07:14.450
should probably look at how
prices vary over the area

00:07:14.450 --> 00:07:16.430
as well.

00:07:16.430 --> 00:07:18.540
So let's make a new plot.

00:07:18.540 --> 00:07:20.340
This one's got a few
too many things on it.

00:07:20.340 --> 00:07:24.240
So we'll just plot
again the longitude

00:07:24.240 --> 00:07:26.910
and the latitude for
all census tracts.

00:07:26.910 --> 00:07:30.370
That kind of resets our plot.

00:07:30.370 --> 00:07:34.050
If we look at the distribution
of the housing prices (Boston

00:07:34.050 --> 00:07:40.460
MEDV), we see that
the minimum price

00:07:40.460 --> 00:07:43.500
(remember units are
thousands of dollars,

00:07:43.500 --> 00:07:45.740
so the median value of
owner-occupied homes

00:07:45.740 --> 00:07:51.350
is in thousands of
dollars) is around five,

00:07:51.350 --> 00:07:53.730
the maximum is around 50.

00:07:53.730 --> 00:07:59.750
So let's plot again only the
above-average price points.

00:07:59.750 --> 00:08:01.880
So we'll go:
points(boston$LON[boston$MEDV>=21.2].

00:08:01.880 --> 00:08:24.720
We can also plot the latitude:
boston$LATboston$LAT[boston$MEDV>=21.2].

00:08:24.720 --> 00:08:26.610
We'll reuse that red
color we used for MIT.

00:08:29.310 --> 00:08:32.120
And one more time, with
we'll do the solid dots.

00:08:34.760 --> 00:08:38.650
So what we see now are
all the census tracts

00:08:38.650 --> 00:08:43.110
with above-average
housing prices.

00:08:43.110 --> 00:08:46.140
As you can see, it's
definitely not simple.

00:08:46.140 --> 00:08:50.510
There's census tracts of
above-average and below-average

00:08:50.510 --> 00:08:52.820
mixed in between each other.

00:08:52.820 --> 00:08:54.640
But there are some patterns.

00:08:54.640 --> 00:08:59.130
For example, look at that
dense black bit in the middle.

00:08:59.130 --> 00:09:01.710
That corresponds to most
of the city of Boston,

00:09:01.710 --> 00:09:04.670
especially the southern
parts of the city.

00:09:04.670 --> 00:09:06.810
Also, on the Cambridge
side of the river,

00:09:06.810 --> 00:09:09.580
there's a big chunk there
of dots that are black,

00:09:09.580 --> 00:09:13.930
that are not red, that are
also presumably below average.

00:09:13.930 --> 00:09:16.770
So there's definitely
some structure to it,

00:09:16.770 --> 00:09:18.970
but it's certainly
not simple in relation

00:09:18.970 --> 00:09:21.420
to latitude and
longitude at least.

00:09:21.420 --> 00:09:24.450
We will explore this
more in the next video.