WEBVTT
00:00:04.500 --> 00:00:08.690
In our R Console, let's start
by loading our data set.
00:00:08.690 --> 00:00:11.430
Don't forget to make sure you're
in the directory containing
00:00:11.430 --> 00:00:15.340
the file wine.csv first.
00:00:15.340 --> 00:00:18.440
We'll call our data
frame wine, and we'll
00:00:18.440 --> 00:00:23.790
use the read.csv function to
read in the data file wine.csv.
00:00:27.930 --> 00:00:30.030
We can look at the
structure of our data
00:00:30.030 --> 00:00:31.840
by using the str function.
00:00:35.300 --> 00:00:37.120
We can see that we
have a data frame
00:00:37.120 --> 00:00:42.130
with 25 observations of
seven different variables.
00:00:42.130 --> 00:00:45.600
Year gives the year
the wine was produced,
00:00:45.600 --> 00:00:47.340
and it's just a
unique identifier
00:00:47.340 --> 00:00:49.530
for each observation.
00:00:49.530 --> 00:00:53.690
Price is the dependent variable
we're trying to predict.
00:00:53.690 --> 00:01:01.130
And WinterRain, AGST,
HarvestRain, Age, and FrancePop
00:01:01.130 --> 00:01:05.830
are the independent variables
we'll use to predict Price.
00:01:05.830 --> 00:01:09.510
We can also look at the
statistical summary of our data
00:01:09.510 --> 00:01:10.730
using the summary function.
00:01:15.160 --> 00:01:17.590
This gives us information
about the range
00:01:17.590 --> 00:01:22.510
of values for each
variable in our data set.
00:01:22.510 --> 00:01:26.780
Let's now create a one-variable
linear regression equation
00:01:26.780 --> 00:01:30.910
using AGST to predict Price.
00:01:30.910 --> 00:01:35.530
We'll call our
regression model model1,
00:01:35.530 --> 00:01:40.090
and we'll use the lm function,
which stands for linear model.
00:01:40.090 --> 00:01:42.120
This is the function
we'll use whenever
00:01:42.120 --> 00:01:45.470
we want to build a
linear regression model.
00:01:45.470 --> 00:01:51.060
Then inside parentheses, type
Price, our dependent variable,
00:01:51.060 --> 00:01:53.420
and then a tilde
symbol, and then
00:01:53.420 --> 00:01:58.780
AGST, the independent variable
we'll use in this model.
00:01:58.780 --> 00:02:04.550
Then after a comma, we
need to add data = wine
00:02:04.550 --> 00:02:06.910
to tell the lm
function what data
00:02:06.910 --> 00:02:10.729
set to use to build the model.
00:02:10.729 --> 00:02:13.310
We're saving the output
of the lm function
00:02:13.310 --> 00:02:15.790
to the variable named model1.
00:02:15.790 --> 00:02:18.810
So when we hit Enter,
we didn't see any output
00:02:18.810 --> 00:02:22.620
because it's been saved
to the variable model1.
00:02:22.620 --> 00:02:25.250
Let's take a look at
the summary of model1.
00:02:28.790 --> 00:02:31.040
The first thing we
see is a description
00:02:31.040 --> 00:02:34.370
of the function we used
to build the model.
00:02:34.370 --> 00:02:39.070
Then we see a summary of the
residuals or error terms.
00:02:39.070 --> 00:02:42.010
Following that is a
description of the coefficients
00:02:42.010 --> 00:02:43.320
of our model.
00:02:43.320 --> 00:02:46.850
The first row corresponds
to the intercept term,
00:02:46.850 --> 00:02:50.700
and the second row corresponds
to our independent variable,
00:02:50.700 --> 00:02:53.250
AGST.
00:02:53.250 --> 00:02:55.840
The Estimate column
gives estimates
00:02:55.840 --> 00:02:58.170
of the beta values
for our model.
00:02:58.170 --> 00:03:00.870
So here beta 0,
or the coefficient
00:03:00.870 --> 00:03:05.820
for the intercept term,
is estimated to be -3.4.
00:03:05.820 --> 00:03:10.140
And beta 1, or the coefficient
for our independent variable,
00:03:10.140 --> 00:03:13.770
is estimated to be 0.635.
00:03:13.770 --> 00:03:16.120
There's additional
information in this table
00:03:16.120 --> 00:03:19.470
that we'll discuss
in the next video.
00:03:19.470 --> 00:03:21.370
Towards the bottom
of the output,
00:03:21.370 --> 00:03:26.320
you can see Multiple
R-squared, 0.435,
00:03:26.320 --> 00:03:28.250
which is the R-squared
value that we
00:03:28.250 --> 00:03:30.840
discussed in the previous video.
00:03:30.840 --> 00:03:35.350
Beside it is a number
labeled Adjusted R-squared.
00:03:35.350 --> 00:03:38.290
In this case, it's 0.41.
00:03:38.290 --> 00:03:40.940
This number adjusts
the R-squared value
00:03:40.940 --> 00:03:44.780
to account for the number of
independent variables used
00:03:44.780 --> 00:03:48.020
relative to the
number of data points.
00:03:48.020 --> 00:03:50.860
Multiple R-squared
will always increase
00:03:50.860 --> 00:03:53.340
if you add more
independent variables.
00:03:53.340 --> 00:03:55.860
But Adjusted R-squared
will decrease
00:03:55.860 --> 00:03:58.350
if you add an
independent variable that
00:03:58.350 --> 00:04:00.450
doesn't help the model.
00:04:00.450 --> 00:04:03.930
This is a good way to determine
if an additional variable
00:04:03.930 --> 00:04:06.790
should even be
included in the model.
00:04:06.790 --> 00:04:09.140
We'll also discuss
other ways to select
00:04:09.140 --> 00:04:13.640
important independent
variables in the next video.
00:04:13.640 --> 00:04:17.200
Let's also compute the sum
of squared errors, or SSE,
00:04:17.200 --> 00:04:18.940
for our model.
00:04:18.940 --> 00:04:22.600
Our residuals, or error terms,
are stored in the vector
00:04:22.600 --> 00:04:23.300
model1$residuals.
00:04:29.050 --> 00:04:33.810
By hitting Enter, we can see the
values of all of our residuals.
00:04:33.810 --> 00:04:37.990
We can compute the Sum of
Squared Errors, or SSE,
00:04:37.990 --> 00:04:40.790
by taking the
sum(model1$residuals^2).
00:04:47.190 --> 00:04:50.350
If we type SSE and
hit Enter, we can
00:04:50.350 --> 00:04:56.659
see that our sum of
squared errors is 5.73.
00:04:56.659 --> 00:04:59.360
Now let's add another
variable to our regression
00:04:59.360 --> 00:05:01.570
model, HarvestRain.
00:05:01.570 --> 00:05:03.870
We'll call our new model model2.
00:05:07.060 --> 00:05:11.370
And again, we'll use the lm
function to predict Price,
00:05:11.370 --> 00:05:16.660
but this time using
AGST and HarvestRain.
00:05:20.050 --> 00:05:23.170
When you want to use more
than one independent variable,
00:05:23.170 --> 00:05:27.030
you can just separate them with
a plus sign like we did here.
00:05:27.030 --> 00:05:29.960
Then we again need to
indicate that the data that
00:05:29.960 --> 00:05:31.910
should be used is wine.
00:05:35.170 --> 00:05:38.190
Let's take a look at the
summary of our new model
00:05:38.190 --> 00:05:39.380
using the summary function.
00:05:43.900 --> 00:05:46.230
We have a third row in
our Coefficients table
00:05:46.230 --> 00:05:48.940
now corresponding
to HarvestRain.
00:05:48.940 --> 00:05:52.630
The coefficient estimate for
this new independent variable
00:05:52.630 --> 00:05:56.560
is negative 0.00457.
00:05:56.560 --> 00:05:59.960
And if you look at the R-squared
near the bottom of the output,
00:05:59.960 --> 00:06:03.390
you can see that this variable
really helped our model.
00:06:03.390 --> 00:06:06.550
Our Multiple R-squared
and Adjusted R-squared
00:06:06.550 --> 00:06:11.180
both increased significantly
compared to the previous model.
00:06:11.180 --> 00:06:13.790
This indicates that this
new model is probably
00:06:13.790 --> 00:06:16.870
better than the previous model.
00:06:16.870 --> 00:06:19.310
Let's now also compute
the sum of squared errors
00:06:19.310 --> 00:06:21.200
for this new model.
00:06:21.200 --> 00:06:24.570
So SSE equals, and then
sum(model2$residuals^2).
00:06:31.960 --> 00:06:35.800
If we type SSE, we can see
that the sum of squared errors
00:06:35.800 --> 00:06:39.500
for model2 is
2.97, which is much
00:06:39.500 --> 00:06:43.250
better than the sum of
squared errors for model1.
00:06:43.250 --> 00:06:45.850
Now let's build a
third model, this time
00:06:45.850 --> 00:06:48.510
with all of our
independent variables.
00:06:48.510 --> 00:06:51.700
We'll call this one model3.
00:06:51.700 --> 00:07:01.430
And again, use the lm function
to predict Price using AGST
00:07:01.430 --> 00:07:19.160
and HarvestRain and WinterRain
and Age and FrancePop.
00:07:22.170 --> 00:07:24.420
Again, we need to
tell the lm function
00:07:24.420 --> 00:07:28.340
to use the data set wine.
00:07:28.340 --> 00:07:30.290
Let's take a look at
the summary of model3.
00:07:35.810 --> 00:07:40.650
Now the Coefficients table has
six rows, one for the intercept
00:07:40.650 --> 00:07:44.159
and one for each of the
five independent variables.
00:07:44.159 --> 00:07:46.290
If we look at the
bottom of the output,
00:07:46.290 --> 00:07:49.750
we can again see that the
Multiple R-squared and Adjusted
00:07:49.750 --> 00:07:53.310
R-squared have both increased.
00:07:53.310 --> 00:07:55.460
Let's now compute the
sum of squared errors
00:07:55.460 --> 00:07:57.140
for this new model.
00:07:57.140 --> 00:08:00.240
SSE equals the
sum(model3$residuals^2).
00:08:09.350 --> 00:08:13.430
And if we type SSE, we can see
that the sum of squared errors
00:08:13.430 --> 00:08:18.830
for model3 is 1.7, even
better than before.
00:08:18.830 --> 00:08:21.040
In the next video,
we'll determine
00:08:21.040 --> 00:08:25.160
if we should keep all of these
variables in our final model.