WEBVTT
00:00:08.019 --> 00:00:13.759
Let's discuss the method Ashenfelter used
to build his model, linear regression.
00:00:13.759 --> 00:00:17.520
We'll start with one-variable linear regression,
which
00:00:17.520 --> 00:00:23.630
just uses one independent variable to predict
the dependent variable.
00:00:23.630 --> 00:00:27.200
This figure shows a plot of one of the independent
variables,
00:00:27.200 --> 00:00:32.800
average growing season temperature,
and the dependent variable, wine price.
00:00:32.800 --> 00:00:36.140
The goal of linear regression is to create
a predictive line
00:00:36.140 --> 00:00:37.890
through the data.
00:00:37.890 --> 00:00:41.940
There are many different lines that
could be drawn to predict wine price using
00:00:41.940 --> 00:00:44.560
average
growing season temperature.
00:00:44.560 --> 00:00:49.260
A simple option would be a flat line at the
average price,
00:00:49.260 --> 00:00:52.460
in this case 7.07.
00:00:52.460 --> 00:00:58.740
The equation for this line is y equals 7.07.
00:00:58.740 --> 00:01:03.719
This linear regression model would predict
7.07 regardless
00:01:03.719 --> 00:01:05.969
of the temperature.
00:01:05.969 --> 00:01:11.280
But it looks like a better line would
have a positive slope, such as this line in
00:01:11.280 --> 00:01:13.530
blue.
00:01:13.530 --> 00:01:25.539
The equation for this line is y equals 0.5*(AGST)
-1.25.
00:01:25.539 --> 00:01:28.729
This linear regression model would predict
a higher price
00:01:28.729 --> 00:01:31.340
when the temperature is higher.
00:01:31.340 --> 00:01:34.329
Let's make this idea a little more formal.
00:01:34.329 --> 00:01:38.319
In general form a one-variable linear regression
model
00:01:38.319 --> 00:01:42.810
is a linear equation to predict the dependent
variable, y,
00:01:42.810 --> 00:01:45.619
using the independent variable, x.
00:01:45.619 --> 00:01:50.439
Beta 0 is the intercept term or intercept
coefficient,
00:01:50.439 --> 00:01:56.759
and Beta 1 is the slope of the line or coefficient
for the independent variable, x.
00:01:56.759 --> 00:02:03.060
For each observation, i, we have data
for the dependent variable Yi and data
00:02:03.060 --> 00:02:06.849
for the independent variable, Xi.
00:02:06.849 --> 00:02:11.569
Using this equation we make a prediction beta
0 plus Beta
00:02:11.569 --> 00:02:15.730
1 times Xi for each data point, i.
00:02:15.730 --> 00:02:20.820
This prediction is hopefully close to the
true outcome, Yi.
00:02:20.820 --> 00:02:24.670
But since the coefficients have to be the
same for all data
00:02:24.670 --> 00:02:30.810
points, i, we often make a small error,
which we'll call epsilon i.
00:02:30.810 --> 00:02:35.480
This error term is also often called a residual.
00:02:35.480 --> 00:02:39.860
Our errors will only all be 0 if all our points
lie perfectly
00:02:39.860 --> 00:02:41.720
on the same line.
00:02:41.720 --> 00:02:44.990
This rarely happens, so we know that our model
will probably
00:02:44.990 --> 00:02:47.040
make some errors.
00:02:47.040 --> 00:02:52.240
The best model or best choice of coefficients
Beta 0 and Beta 1
00:02:52.240 --> 00:03:04.220
has the smallest error terms or smallest residuals.
00:03:04.220 --> 00:03:08.130
This figure shows the blue line that we drew
in the beginning.
00:03:08.130 --> 00:03:13.610
We can compute the residuals or errors
of this line for each data point.
00:03:13.610 --> 00:03:19.760
For example, for this point the actual value
is about 6.2.
00:03:19.760 --> 00:03:24.420
Using our regression model we predict about
6.5.
00:03:24.420 --> 00:03:28.640
So the error for this data point is negative
0.3,
00:03:28.640 --> 00:03:32.770
which is the actual value minus our prediction.
00:03:32.770 --> 00:03:39.180
As another example for this point,
the actual value is about 8.
00:03:39.180 --> 00:03:44.200
Using our regression model we predict about
7.5.
00:03:44.200 --> 00:03:48.560
So the error for this data point is about
0.5.
00:03:48.560 --> 00:03:53.820
Again the actual value minus our prediction.
00:03:53.820 --> 00:03:56.790
One measure of the quality of a regression
line
00:03:56.790 --> 00:04:01.110
is the sum of squared errors, or SSE.
00:04:01.110 --> 00:04:05.850
This is the sum of the squared residuals or
error terms.
00:04:05.850 --> 00:04:09.740
Let n equal the number of data points that
we have in our data
00:04:09.740 --> 00:04:12.400
set.
00:04:12.400 --> 00:04:17.339
Then the sum of squared errors is
equal to the error we make on the first data
00:04:17.339 --> 00:04:21.370
point squared
plus the error we make on the second data
00:04:21.370 --> 00:04:24.800
point squared
plus the errors that you make on all data
00:04:24.800 --> 00:04:32.979
points
up to the n-th data point squared.
00:04:32.979 --> 00:04:38.210
We can compute the sum of squared errors
for both the red line and the blue line.
00:04:38.210 --> 00:04:42.139
As expected the blue line is a better fit
than the red line
00:04:42.139 --> 00:04:45.930
since it has a smaller sum of squared errors.
00:04:45.930 --> 00:04:48.900
The line that gives the minimum sum of squared
errors
00:04:48.900 --> 00:04:50.680
is shown in green.
00:04:50.680 --> 00:04:54.830
This is the line that our regression model
will find.
00:04:54.830 --> 00:04:58.210
Although sum of squared errors allows us to
compare lines
00:04:58.210 --> 00:05:03.289
on the same data set, it's hard to interpret
for two reasons.
00:05:03.289 --> 00:05:07.849
The first is that it scales with n, the number
of data points.
00:05:07.849 --> 00:05:11.449
If we built the same model with twice as much
data,
00:05:11.449 --> 00:05:14.180
the sum of squared errors might be twice as
big.
00:05:14.180 --> 00:05:17.419
But this doesn't mean it's a worse model.
00:05:17.419 --> 00:05:20.270
The second is that the units are hard to understand.
00:05:20.270 --> 00:05:26.270
Some of squared errors is in squared units
of the dependent variable.
00:05:26.270 --> 00:05:31.039
Because of these problems, Root Means Squared
Error, or RMSE,
00:05:31.039 --> 00:05:32.819
is often used.
00:05:32.819 --> 00:05:37.699
This divides sum of squared errors by n
and then takes a square root.
00:05:37.699 --> 00:05:41.180
So it's normalized by n and is in the same
units
00:05:41.180 --> 00:05:44.129
as the dependent variable.
00:05:44.129 --> 00:05:48.759
Another common error measure for linear regression
is R squared.
00:05:48.759 --> 00:05:51.699
This error measure is nice because it compares
the best
00:05:51.699 --> 00:05:55.308
model to a baseline model, the model that
does not
00:05:55.308 --> 00:05:59.479
use any variables, or the red line from before.
00:05:59.479 --> 00:06:05.370
The baseline model predicts the average value
of the dependent variable regardless
00:06:05.370 --> 00:06:08.979
of the value of the independent variable.
00:06:08.979 --> 00:06:12.279
We can compute that the sum of squared errors
for the best fit
00:06:12.279 --> 00:06:16.949
line or the green line is 5.73.
00:06:16.949 --> 00:06:23.080
And the sum of squared errors for the baseline
or the red line is 10.15.
00:06:23.080 --> 00:06:25.819
The sum of squared errors for the baseline
model
00:06:25.819 --> 00:06:29.880
is also known as the total sum of squares,
commonly referred
00:06:29.880 --> 00:06:32.590
to as SST.
00:06:32.590 --> 00:06:40.860
Then the formula for R squared is
R squared equals 1 minus sum of squared errors
00:06:40.860 --> 00:06:44.599
divided
by total sum of squares.
00:06:44.599 --> 00:06:56.400
In this case it equals 1 minus 5.73
divided by 10.15 which equals 0.44.
00:06:56.400 --> 00:07:01.719
R squared is nice because it captures
the value added from using a linear regression
00:07:01.719 --> 00:07:06.129
model over just predicting the average outcome
for every data
00:07:06.129 --> 00:07:07.319
point.
00:07:07.319 --> 00:07:10.610
So what values do we expect to see for R squared?
00:07:10.610 --> 00:07:16.520
Well both the sum of squared errors
and the total sum of squares have
00:07:16.520 --> 00:07:19.430
to be greater than or equal to zero because
they're
00:07:19.430 --> 00:07:23.680
the sum of squared terms so they can't be
negative.
00:07:23.680 --> 00:07:27.809
Additionally the sum of squared errors has
to be less than
00:07:27.809 --> 00:07:31.270
or equal to the total sum of squares.
00:07:31.270 --> 00:07:34.460
This is because our linear regression model
could just
00:07:34.460 --> 00:07:38.749
set the coefficient for the independent variable
to 0
00:07:38.749 --> 00:07:41.720
and then we would have the baseline model.
00:07:41.720 --> 00:07:46.110
So our linear regression model will never
be worse than the baseline model.
00:07:46.110 --> 00:07:52.960
So in the worst case the sum of squares errors
equals the total sum of squares, and our R
00:07:52.960 --> 00:07:55.449
squared is equal to 0.
00:07:55.449 --> 00:07:59.529
So this means no improvement over the baseline.
00:07:59.529 --> 00:08:05.699
In the best case our linear regression model
makes no errors, and the sum of squared errors
00:08:05.699 --> 00:08:07.809
is equal to 0.
00:08:07.809 --> 00:08:11.029
And then our R squared is equal to 1.
00:08:11.029 --> 00:08:17.689
So an R squared equal to 1 or close to 1
means a perfect or almost perfect predictive
00:08:17.689 --> 00:08:20.089
model.
00:08:20.089 --> 00:08:23.069
R squared is nice because it's unitless and
therefore
00:08:23.069 --> 00:08:26.469
universally interpretable between problems.
00:08:26.469 --> 00:08:31.060
However, it can still be hard to compare between
problems.
00:08:31.060 --> 00:08:35.549
Good models for easy problems will
have an R squared close to 1.
00:08:35.549 --> 00:08:41.570
But good models for hard problems
can still have an R squared close to zero.
00:08:41.570 --> 00:08:45.660
Throughout this course we will see
examples of both types of problems.