1
00:00:04,500 --> 00:00:08,690
In our R Console, let's start
by loading our data set.
2
00:00:08,690 --> 00:00:11,430
Don't forget to make sure you're
in the directory containing
3
00:00:11,430 --> 00:00:15,340
the file wine.csv first.
4
00:00:15,340 --> 00:00:18,440
We'll call our data
frame wine, and we'll
5
00:00:18,440 --> 00:00:23,790
use the read.csv function to
read in the data file wine.csv.
6
00:00:27,930 --> 00:00:30,030
We can look at the
structure of our data
7
00:00:30,030 --> 00:00:31,840
by using the str function.
8
00:00:35,300 --> 00:00:37,120
We can see that we
have a data frame
9
00:00:37,120 --> 00:00:42,130
with 25 observations of
seven different variables.
10
00:00:42,130 --> 00:00:45,600
Year gives the year
the wine was produced,
11
00:00:45,600 --> 00:00:47,340
and it's just a
unique identifier
12
00:00:47,340 --> 00:00:49,530
for each observation.
13
00:00:49,530 --> 00:00:53,690
Price is the dependent variable
we're trying to predict.
14
00:00:53,690 --> 00:01:01,130
And WinterRain, AGST,
HarvestRain, Age, and FrancePop
15
00:01:01,130 --> 00:01:05,830
are the independent variables
we'll use to predict Price.
16
00:01:05,830 --> 00:01:09,510
We can also look at the
statistical summary of our data
17
00:01:09,510 --> 00:01:10,730
using the summary function.
18
00:01:15,160 --> 00:01:17,590
This gives us information
about the range
19
00:01:17,590 --> 00:01:22,510
of values for each
variable in our data set.
20
00:01:22,510 --> 00:01:26,780
Let's now create a one-variable
linear regression equation
21
00:01:26,780 --> 00:01:30,910
using AGST to predict Price.
22
00:01:30,910 --> 00:01:35,530
We'll call our
regression model model1,
23
00:01:35,530 --> 00:01:40,090
and we'll use the lm function,
which stands for linear model.
24
00:01:40,090 --> 00:01:42,120
This is the function
we'll use whenever
25
00:01:42,120 --> 00:01:45,470
we want to build a
linear regression model.
26
00:01:45,470 --> 00:01:51,060
Then inside parentheses, type
Price, our dependent variable,
27
00:01:51,060 --> 00:01:53,420
and then a tilde
symbol, and then
28
00:01:53,420 --> 00:01:58,780
AGST, the independent variable
we'll use in this model.
29
00:01:58,780 --> 00:02:04,550
Then after a comma, we
need to add data = wine
30
00:02:04,550 --> 00:02:06,910
to tell the lm
function what data
31
00:02:06,910 --> 00:02:10,729
set to use to build the model.
32
00:02:10,729 --> 00:02:13,310
We're saving the output
of the lm function
33
00:02:13,310 --> 00:02:15,790
to the variable named model1.
34
00:02:15,790 --> 00:02:18,810
So when we hit Enter,
we didn't see any output
35
00:02:18,810 --> 00:02:22,620
because it's been saved
to the variable model1.
36
00:02:22,620 --> 00:02:25,250
Let's take a look at
the summary of model1.
37
00:02:28,790 --> 00:02:31,040
The first thing we
see is a description
38
00:02:31,040 --> 00:02:34,370
of the function we used
to build the model.
39
00:02:34,370 --> 00:02:39,070
Then we see a summary of the
residuals or error terms.
40
00:02:39,070 --> 00:02:42,010
Following that is a
description of the coefficients
41
00:02:42,010 --> 00:02:43,320
of our model.
42
00:02:43,320 --> 00:02:46,850
The first row corresponds
to the intercept term,
43
00:02:46,850 --> 00:02:50,700
and the second row corresponds
to our independent variable,
44
00:02:50,700 --> 00:02:53,250
AGST.
45
00:02:53,250 --> 00:02:55,840
The Estimate column
gives estimates
46
00:02:55,840 --> 00:02:58,170
of the beta values
for our model.
47
00:02:58,170 --> 00:03:00,870
So here beta 0,
or the coefficient
48
00:03:00,870 --> 00:03:05,820
for the intercept term,
is estimated to be -3.4.
49
00:03:05,820 --> 00:03:10,140
And beta 1, or the coefficient
for our independent variable,
50
00:03:10,140 --> 00:03:13,770
is estimated to be 0.635.
51
00:03:13,770 --> 00:03:16,120
There's additional
information in this table
52
00:03:16,120 --> 00:03:19,470
that we'll discuss
in the next video.
53
00:03:19,470 --> 00:03:21,370
Towards the bottom
of the output,
54
00:03:21,370 --> 00:03:26,320
you can see Multiple
R-squared, 0.435,
55
00:03:26,320 --> 00:03:28,250
which is the R-squared
value that we
56
00:03:28,250 --> 00:03:30,840
discussed in the previous video.
57
00:03:30,840 --> 00:03:35,350
Beside it is a number
labeled Adjusted R-squared.
58
00:03:35,350 --> 00:03:38,290
In this case, it's 0.41.
59
00:03:38,290 --> 00:03:40,940
This number adjusts
the R-squared value
60
00:03:40,940 --> 00:03:44,780
to account for the number of
independent variables used
61
00:03:44,780 --> 00:03:48,020
relative to the
number of data points.
62
00:03:48,020 --> 00:03:50,860
Multiple R-squared
will always increase
63
00:03:50,860 --> 00:03:53,340
if you add more
independent variables.
64
00:03:53,340 --> 00:03:55,860
But Adjusted R-squared
will decrease
65
00:03:55,860 --> 00:03:58,350
if you add an
independent variable that
66
00:03:58,350 --> 00:04:00,450
doesn't help the model.
67
00:04:00,450 --> 00:04:03,930
This is a good way to determine
if an additional variable
68
00:04:03,930 --> 00:04:06,790
should even be
included in the model.
69
00:04:06,790 --> 00:04:09,140
We'll also discuss
other ways to select
70
00:04:09,140 --> 00:04:13,640
important independent
variables in the next video.
71
00:04:13,640 --> 00:04:17,200
Let's also compute the sum
of squared errors, or SSE,
72
00:04:17,200 --> 00:04:18,940
for our model.
73
00:04:18,940 --> 00:04:22,600
Our residuals, or error terms,
are stored in the vector
74
00:04:22,600 --> 00:04:23,300
model1$residuals.
75
00:04:29,050 --> 00:04:33,810
By hitting Enter, we can see the
values of all of our residuals.
76
00:04:33,810 --> 00:04:37,990
We can compute the Sum of
Squared Errors, or SSE,
77
00:04:37,990 --> 00:04:40,790
by taking the
sum(model1$residuals^2).
78
00:04:47,190 --> 00:04:50,350
If we type SSE and
hit Enter, we can
79
00:04:50,350 --> 00:04:56,659
see that our sum of
squared errors is 5.73.
80
00:04:56,659 --> 00:04:59,360
Now let's add another
variable to our regression
81
00:04:59,360 --> 00:05:01,570
model, HarvestRain.
82
00:05:01,570 --> 00:05:03,870
We'll call our new model model2.
83
00:05:07,060 --> 00:05:11,370
And again, we'll use the lm
function to predict Price,
84
00:05:11,370 --> 00:05:16,660
but this time using
AGST and HarvestRain.
85
00:05:20,050 --> 00:05:23,170
When you want to use more
than one independent variable,
86
00:05:23,170 --> 00:05:27,030
you can just separate them with
a plus sign like we did here.
87
00:05:27,030 --> 00:05:29,960
Then we again need to
indicate that the data that
88
00:05:29,960 --> 00:05:31,910
should be used is wine.
89
00:05:35,170 --> 00:05:38,190
Let's take a look at the
summary of our new model
90
00:05:38,190 --> 00:05:39,380
using the summary function.
91
00:05:43,900 --> 00:05:46,230
We have a third row in
our Coefficients table
92
00:05:46,230 --> 00:05:48,940
now corresponding
to HarvestRain.
93
00:05:48,940 --> 00:05:52,630
The coefficient estimate for
this new independent variable
94
00:05:52,630 --> 00:05:56,560
is negative 0.00457.
95
00:05:56,560 --> 00:05:59,960
And if you look at the R-squared
near the bottom of the output,
96
00:05:59,960 --> 00:06:03,390
you can see that this variable
really helped our model.
97
00:06:03,390 --> 00:06:06,550
Our Multiple R-squared
and Adjusted R-squared
98
00:06:06,550 --> 00:06:11,180
both increased significantly
compared to the previous model.
99
00:06:11,180 --> 00:06:13,790
This indicates that this
new model is probably
100
00:06:13,790 --> 00:06:16,870
better than the previous model.
101
00:06:16,870 --> 00:06:19,310
Let's now also compute
the sum of squared errors
102
00:06:19,310 --> 00:06:21,200
for this new model.
103
00:06:21,200 --> 00:06:24,570
So SSE equals, and then
sum(model2$residuals^2).
104
00:06:31,960 --> 00:06:35,800
If we type SSE, we can see
that the sum of squared errors
105
00:06:35,800 --> 00:06:39,500
for model2 is
2.97, which is much
106
00:06:39,500 --> 00:06:43,250
better than the sum of
squared errors for model1.
107
00:06:43,250 --> 00:06:45,850
Now let's build a
third model, this time
108
00:06:45,850 --> 00:06:48,510
with all of our
independent variables.
109
00:06:48,510 --> 00:06:51,700
We'll call this one model3.
110
00:06:51,700 --> 00:07:01,430
And again, use the lm function
to predict Price using AGST
111
00:07:01,430 --> 00:07:19,160
and HarvestRain and WinterRain
and Age and FrancePop.
112
00:07:22,170 --> 00:07:24,420
Again, we need to
tell the lm function
113
00:07:24,420 --> 00:07:28,340
to use the data set wine.
114
00:07:28,340 --> 00:07:30,290
Let's take a look at
the summary of model3.
115
00:07:35,810 --> 00:07:40,650
Now the Coefficients table has
six rows, one for the intercept
116
00:07:40,650 --> 00:07:44,159
and one for each of the
five independent variables.
117
00:07:44,159 --> 00:07:46,290
If we look at the
bottom of the output,
118
00:07:46,290 --> 00:07:49,750
we can again see that the
Multiple R-squared and Adjusted
119
00:07:49,750 --> 00:07:53,310
R-squared have both increased.
120
00:07:53,310 --> 00:07:55,460
Let's now compute the
sum of squared errors
121
00:07:55,460 --> 00:07:57,140
for this new model.
122
00:07:57,140 --> 00:08:00,240
SSE equals the
sum(model3$residuals^2).
123
00:08:09,350 --> 00:08:13,430
And if we type SSE, we can see
that the sum of squared errors
124
00:08:13,430 --> 00:08:18,830
for model3 is 1.7, even
better than before.
125
00:08:18,830 --> 00:08:21,040
In the next video,
we'll determine
126
00:08:21,040 --> 00:08:25,160
if we should keep all of these
variables in our final model.