WEBVTT
00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.870
Commons license.
00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to
00:00:06.910 --> 00:00:10.560
offer high quality educational
resources for free.
00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from
00:00:13.460 --> 00:00:19.290
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:19.290 --> 00:00:21.732
ocw.mit.edu.
00:00:21.732 --> 00:00:24.170
JOHN TSITSIKLIS: And we're going
to continue today with
00:00:24.170 --> 00:00:26.820
our discussion of classical
statistics.
00:00:26.820 --> 00:00:29.290
We'll start with a quick review
of what we discussed
00:00:29.290 --> 00:00:34.680
last time, and then talk about
two topics that cover a lot of
00:00:34.680 --> 00:00:37.740
statistics that are happening
in the real world.
00:00:37.740 --> 00:00:39.510
So two basic methods.
00:00:39.510 --> 00:00:43.730
One is the method of linear
regression, and the other one
00:00:43.730 --> 00:00:46.500
is the basic methods and
tools for how to
00:00:46.500 --> 00:00:49.540
do hypothesis testing.
00:00:49.540 --> 00:00:53.970
OK, so these two are topics
that any scientifically
00:00:53.970 --> 00:00:57.170
literate person should
know something about.
00:00:57.170 --> 00:00:59.570
So we're going to introduce
the basic ideas
00:00:59.570 --> 00:01:01.860
and concepts involved.
00:01:01.860 --> 00:01:07.580
So in classical statistics we
basically have essentially a
00:01:07.580 --> 00:01:11.250
family of possible models
about the world.
00:01:11.250 --> 00:01:15.190
So the world is the random
variable that we observe, and
00:01:15.190 --> 00:01:19.370
we have a model for it, but
actually not just one model,
00:01:19.370 --> 00:01:20.960
several candidate models.
00:01:20.960 --> 00:01:24.380
And each candidate model
corresponds to a different
00:01:24.380 --> 00:01:28.070
value of a parameter theta
that we do not know.
00:01:28.070 --> 00:01:32.275
So in contrast to Bayesian
statistics, this theta is
00:01:32.275 --> 00:01:35.540
assumed to be a constant
that we do not know.
00:01:35.540 --> 00:01:38.190
It is not modeled as a random
variable, there's no
00:01:38.190 --> 00:01:40.480
probabilities associated
with theta.
00:01:40.480 --> 00:01:43.380
We only have probabilities
about the X's.
00:01:43.380 --> 00:01:47.320
So in this context what is a
reasonable way of choosing a
00:01:47.320 --> 00:01:49.350
value for the parameter?
00:01:49.350 --> 00:01:53.470
One general approach is the
maximum likelihood approach,
00:01:53.470 --> 00:01:56.090
which chooses the
theta for which
00:01:56.090 --> 00:01:58.630
this quantity is largest.
00:01:58.630 --> 00:02:00.690
So what does that mean
intuitively?
00:02:00.690 --> 00:02:04.550
I'm trying to find the value of
theta under which the data
00:02:04.550 --> 00:02:08.970
that I observe are most likely
to have occurred.
00:02:08.970 --> 00:02:11.470
So is the thinking is
essentially as follows.
00:02:11.470 --> 00:02:13.970
Let's say I have to choose
between two choices of theta.
00:02:13.970 --> 00:02:16.520
Under this theta the
X that I observed
00:02:16.520 --> 00:02:17.940
would be very unlikely.
00:02:17.940 --> 00:02:21.350
Under that theta the X that I
observed would have a decent
00:02:21.350 --> 00:02:22.830
probability of occurring.
00:02:22.830 --> 00:02:28.340
So I chose the latter as
my estimate of theta.
00:02:28.340 --> 00:02:31.200
It's interesting to do the
comparison with the Bayesian
00:02:31.200 --> 00:02:34.110
approach which we did discuss
last time, in the Bayesian
00:02:34.110 --> 00:02:38.430
approach we also maximize over
theta, but we maximize a
00:02:38.430 --> 00:02:43.220
quantity in which the relation
between X's and thetas run the
00:02:43.220 --> 00:02:44.520
opposite way.
00:02:44.520 --> 00:02:47.500
Here in the Bayesian world,
Theta is a random variable.
00:02:47.500 --> 00:02:48.980
So it has a distribution.
00:02:48.980 --> 00:02:53.030
Once we observe the data, it has
a posterior distribution,
00:02:53.030 --> 00:02:56.480
and we find the value of Theta,
which is most likely
00:02:56.480 --> 00:02:59.250
under the posterior
distribution.
00:02:59.250 --> 00:03:03.090
As we discussed last time when
you do this maximization now
00:03:03.090 --> 00:03:05.750
the posterior distribution is
given by this expression.
00:03:05.750 --> 00:03:09.760
The denominator doesn't matter,
and if you were to
00:03:09.760 --> 00:03:12.790
take a prior, which is flat--
00:03:12.790 --> 00:03:16.210
that is a constant independent
of Theta, then that
00:03:16.210 --> 00:03:17.640
term would go away.
00:03:17.640 --> 00:03:19.360
And syntactically,
at least, the two
00:03:19.360 --> 00:03:21.970
approaches look the same.
00:03:21.970 --> 00:03:28.170
So syntactically, or formally,
maximum likelihood estimation
00:03:28.170 --> 00:03:32.225
is the same as Bayesian
estimation in which you assume
00:03:32.225 --> 00:03:36.090
a prior which is flat, so that
all possible values of Theta
00:03:36.090 --> 00:03:37.570
are equally likely.
00:03:37.570 --> 00:03:40.790
Philosophically, however,
they're very different things.
00:03:40.790 --> 00:03:44.150
Here I'm picking the most
likely value of Theta.
00:03:44.150 --> 00:03:47.140
Here I'm picking the value of
Theta under which the observed
00:03:47.140 --> 00:03:51.050
data would have been more
likely to occur.
00:03:51.050 --> 00:03:53.590
So maximum likelihood estimation
is a general
00:03:53.590 --> 00:03:57.820
purpose method, so it's applied
all over the place in
00:03:57.820 --> 00:04:02.220
many, many different types
of estimation problems.
00:04:02.220 --> 00:04:05.100
There is a special kind of
estimation problem in which
00:04:05.100 --> 00:04:08.040
you may forget about maximum
likelihood estimation, and
00:04:08.040 --> 00:04:12.700
come up with an estimate in
a straightforward way.
00:04:12.700 --> 00:04:15.680
And this is the case where
you're trying to estimate the
00:04:15.680 --> 00:04:22.390
mean of the distribution of X,
where X is a random variable.
00:04:22.390 --> 00:04:25.140
You observe several independent
identically
00:04:25.140 --> 00:04:30.020
distributed random variables
X1 up to Xn.
00:04:30.020 --> 00:04:32.880
All of them have the same
distribution as this X.
00:04:32.880 --> 00:04:34.600
So they have a common mean.
00:04:34.600 --> 00:04:37.020
We do not know the mean we
want to estimate it.
00:04:37.020 --> 00:04:40.560
What is more natural than just
taking the average of the
00:04:40.560 --> 00:04:42.470
values that we have observed?
00:04:42.470 --> 00:04:46.150
So you generate lots of X's,
take the average of them, and
00:04:46.150 --> 00:04:50.560
you expect that this is going to
be a reasonable estimate of
00:04:50.560 --> 00:04:53.420
the true mean of that
random variable.
00:04:53.420 --> 00:04:56.290
And indeed we know from the weak
law of large numbers that
00:04:56.290 --> 00:05:00.790
this estimate converges in
probability to the true mean
00:05:00.790 --> 00:05:02.680
of the random variable.
00:05:02.680 --> 00:05:04.870
The other thing that we talked
about last time is that
00:05:04.870 --> 00:05:07.770
besides giving a point estimate
we may want to also
00:05:07.770 --> 00:05:13.530
give an interval that tells us
something about where we might
00:05:13.530 --> 00:05:16.170
believe theta to lie.
00:05:16.170 --> 00:05:21.950
And 1-alpha confidence interval
is in interval
00:05:21.950 --> 00:05:24.200
generated based on the data.
00:05:24.200 --> 00:05:26.860
So it's an interval from this
value to that value.
00:05:26.860 --> 00:05:30.120
These values are written with
capital letters because
00:05:30.120 --> 00:05:32.390
they're random, because they
depend on the data
00:05:32.390 --> 00:05:33.870
that we have seen.
00:05:33.870 --> 00:05:36.740
And this gives us an interval,
and we would like this
00:05:36.740 --> 00:05:40.600
interval to have the property
that theta is inside that
00:05:40.600 --> 00:05:42.830
interval with high
probability.
00:05:42.830 --> 00:05:46.805
So typically we would take
1-alpha to be a quantity such
00:05:46.805 --> 00:05:49.780
as 95% for example.
00:05:49.780 --> 00:05:54.340
In which case we have a 95%
confidence interval.
00:05:54.340 --> 00:05:56.980
As we discussed last time it's
important to have the right
00:05:56.980 --> 00:06:00.730
interpretation of what's
95% means.
00:06:00.730 --> 00:06:04.640
What it does not mean
is the following--
00:06:04.640 --> 00:06:09.800
the unknown value has 95%
percent probability of being
00:06:09.800 --> 00:06:12.450
in the interval that
we have generated.
00:06:12.450 --> 00:06:14.550
That's because the unknown
value is not a random
00:06:14.550 --> 00:06:15.910
variable, it's a constant.
00:06:15.910 --> 00:06:18.930
Once we generate the interval
either it's inside or it's
00:06:18.930 --> 00:06:22.500
outside, but there's no
probabilities involved.
00:06:22.500 --> 00:06:26.415
Rather the probabilities are
to be interpreted over the
00:06:26.415 --> 00:06:28.590
random interval itself.
00:06:28.590 --> 00:06:31.730
What a statement like this
says is that if I have a
00:06:31.730 --> 00:06:37.060
procedure for generating 95%
confidence intervals, then
00:06:37.060 --> 00:06:40.800
whenever I use that procedure
I'm going to get a random
00:06:40.800 --> 00:06:44.260
interval, and it's going to
have 95% probability of
00:06:44.260 --> 00:06:48.270
capturing the true
value of theta.
00:06:48.270 --> 00:06:53.010
So most of the time when I use
this particular procedure for
00:06:53.010 --> 00:06:56.170
generating confidence intervals
the true theta will
00:06:56.170 --> 00:06:59.440
happen to lie inside that
confidence interval with
00:06:59.440 --> 00:07:01.190
probability 95%.
00:07:01.190 --> 00:07:04.230
So the randomness in this
statement is with respect to
00:07:04.230 --> 00:07:09.190
my confidence interval, it's
not with respect to theta,
00:07:09.190 --> 00:07:11.880
because theta is not random.
00:07:11.880 --> 00:07:14.710
How does one construct
confidence intervals?
00:07:14.710 --> 00:07:17.500
There's various ways of going
about it, but in the case
00:07:17.500 --> 00:07:20.330
where we're dealing with the
estimation of the mean of a
00:07:20.330 --> 00:07:23.790
random variable doing this is
straightforward using the
00:07:23.790 --> 00:07:25.680
central limit theorem.
00:07:25.680 --> 00:07:31.440
Basically we take our estimated
mean, that's the
00:07:31.440 --> 00:07:35.910
sample mean, and we take a
symmetric interval to the left
00:07:35.910 --> 00:07:38.220
and to the right of
the sample mean.
00:07:38.220 --> 00:07:42.340
And we choose the width of that
interval by looking at
00:07:42.340 --> 00:07:43.680
the normal tables.
00:07:43.680 --> 00:07:50.180
So if this quantity, 1-alpha is
95% percent, we're going to
00:07:50.180 --> 00:07:55.790
look at the 97.5 percentile of
the normal distribution.
00:07:55.790 --> 00:07:59.910
Find the constant number that
corresponds to that value from
00:07:59.910 --> 00:08:02.790
the normal tables, and construct
the confidence
00:08:02.790 --> 00:08:07.350
intervals according
to this formula.
00:08:07.350 --> 00:08:10.810
So that gives you a pretty
mechanical way of going about
00:08:10.810 --> 00:08:13.250
constructing confidence
intervals when you're
00:08:13.250 --> 00:08:15.270
estimating the sample mean.
00:08:15.270 --> 00:08:18.650
So constructing confidence
intervals in this way involves
00:08:18.650 --> 00:08:19.630
an approximation.
00:08:19.630 --> 00:08:22.230
The approximation is the
central limit theorem.
00:08:22.230 --> 00:08:24.490
We are pretending that
the sample mean is a
00:08:24.490 --> 00:08:26.400
normal random variable.
00:08:26.400 --> 00:08:30.110
Which is, more or less,
right when n is large.
00:08:30.110 --> 00:08:32.780
That's what the central limit
theorem tells us.
00:08:32.780 --> 00:08:36.429
And sometimes we may need to
do some extra approximation
00:08:36.429 --> 00:08:39.480
work, because quite often
we do not know the
00:08:39.480 --> 00:08:41.030
true value of sigma.
00:08:41.030 --> 00:08:43.559
So we need to do some work
either to estimate
00:08:43.559 --> 00:08:45.360
sigma from the data.
00:08:45.360 --> 00:08:48.520
So sigma is, of course, the
standard deviation of the X's.
00:08:48.520 --> 00:08:51.410
We may want to estimate it from
the data, or we may have
00:08:51.410 --> 00:08:54.450
an upper bound on sigma, and we
just use that upper bound.
00:08:57.430 --> 00:09:02.520
So now let's move on
to a new topic.
00:09:02.520 --> 00:09:09.420
A lot of statistics in the
real world are of the
00:09:09.420 --> 00:09:12.540
following flavor.
00:09:12.540 --> 00:09:16.820
So suppose that X is the SAT
score of a student in high
00:09:16.820 --> 00:09:23.620
school, and Y is the MIT GPA
of that same student.
00:09:23.620 --> 00:09:27.570
So you expect that there is a
relation between these two.
00:09:27.570 --> 00:09:31.240
So you go and collect data for
different students, and you
00:09:31.240 --> 00:09:35.470
record for a typical student
this would be their SAT score,
00:09:35.470 --> 00:09:37.700
that could be their MIT GPA.
00:09:37.700 --> 00:09:43.720
And you plot all this data
on an (X,Y) diagram.
00:09:43.720 --> 00:09:48.240
Now it's reasonable to believe
that there is some systematic
00:09:48.240 --> 00:09:49.940
relation between the two.
00:09:49.940 --> 00:09:54.650
So people who had higher SAT
scores in high school may have
00:09:54.650 --> 00:09:57.110
higher GPA in college.
00:09:57.110 --> 00:10:00.310
Well that may or may
not be true.
00:10:00.310 --> 00:10:05.270
You want to construct a model of
this kind, and see to what
00:10:05.270 --> 00:10:08.330
extent a relation of
this type is true.
00:10:08.330 --> 00:10:15.560
So you might hypothesize that
the real world is described by
00:10:15.560 --> 00:10:17.390
a model of this kind.
00:10:17.390 --> 00:10:22.730
That there is a linear relation
between the SAT
00:10:22.730 --> 00:10:27.710
score, and the college GPA.
00:10:27.710 --> 00:10:30.560
So it's a linear relation with
some parameters, theta0 and
00:10:30.560 --> 00:10:33.060
theta1 that we do not know.
00:10:33.060 --> 00:10:37.460
So we assume a linear relation
for the data, and depending on
00:10:37.460 --> 00:10:41.690
the choices of theta0 and theta1
it could be a different
00:10:41.690 --> 00:10:43.530
line through those data.
00:10:43.530 --> 00:10:47.670
Now we would like to find the
best model of this kind to
00:10:47.670 --> 00:10:49.230
explain the data.
00:10:49.230 --> 00:10:52.260
Of course there's going
to be some randomness.
00:10:52.260 --> 00:10:55.370
So in general it's going to be
impossible to find a line that
00:10:55.370 --> 00:10:57.780
goes through all of
the data points.
00:10:57.780 --> 00:11:04.020
So let's try to find the best
line that comes closest to
00:11:04.020 --> 00:11:05.810
explaining those data.
00:11:05.810 --> 00:11:08.520
And here's how we go about it.
00:11:08.520 --> 00:11:13.100
Suppose we try some particular
values of theta0 and theta1.
00:11:13.100 --> 00:11:15.750
These give us a certain line.
00:11:15.750 --> 00:11:20.760
Given that line, we can
make predictions.
00:11:20.760 --> 00:11:24.470
For a student who had this x,
the model that we have would
00:11:24.470 --> 00:11:27.580
predict that y would
be this value.
00:11:27.580 --> 00:11:32.150
The actual y is something else,
and so this quantity is
00:11:32.150 --> 00:11:37.660
the error that our model would
make in predicting the y of
00:11:37.660 --> 00:11:39.580
that particular student.
00:11:39.580 --> 00:11:43.350
We would like to choose a line
for which the predictions are
00:11:43.350 --> 00:11:45.110
as good as possible.
00:11:45.110 --> 00:11:47.790
And what do we mean by
as good as possible?
00:11:47.790 --> 00:11:51.150
As our criteria we're going
to take the following.
00:11:51.150 --> 00:11:54.070
We are going to look at the
prediction error that our
00:11:54.070 --> 00:11:56.310
model makes for each
particular student.
00:11:56.310 --> 00:12:01.050
Take the square of that, and
then add them up over all of
00:12:01.050 --> 00:12:02.580
our data points.
00:12:02.580 --> 00:12:06.140
So what we're looking at is
the sum of this quantity
00:12:06.140 --> 00:12:08.270
squared, that quantity squared,
that quantity
00:12:08.270 --> 00:12:09.570
squared, and so on.
00:12:09.570 --> 00:12:13.220
We add all of these squares, and
we would like to find the
00:12:13.220 --> 00:12:17.500
line for which the sum of
these squared prediction
00:12:17.500 --> 00:12:20.910
errors are as small
as possible.
00:12:20.910 --> 00:12:23.950
So that's the procedure.
00:12:23.950 --> 00:12:27.100
We have our data, the
X's and the Y's.
00:12:27.100 --> 00:12:31.340
And we're going to find theta's
the best model of this
00:12:31.340 --> 00:12:35.580
type, the best possible model,
by minimizing this sum of
00:12:35.580 --> 00:12:38.010
squared errors.
00:12:38.010 --> 00:12:41.020
So that's a method that one
could pull out of the hat and
00:12:41.020 --> 00:12:44.120
say OK, that's how I'm going
to build my model.
00:12:44.120 --> 00:12:46.730
And it sounds pretty
reasonable.
00:12:46.730 --> 00:12:49.530
And it sounds pretty reasonable
even if you don't
00:12:49.530 --> 00:12:51.660
know anything about
probability.
00:12:51.660 --> 00:12:55.340
But does it have some
probabilistic justification?
00:12:55.340 --> 00:12:59.280
It turns out that yes, you can
motivate this method with
00:12:59.280 --> 00:13:03.100
probabilistic considerations
under certain assumptions.
00:13:03.100 --> 00:13:07.360
So let's make a probabilistic
model that's going to lead us
00:13:07.360 --> 00:13:10.600
to these particular way of
estimating the parameters.
00:13:10.600 --> 00:13:12.920
So here's a probabilistic
model.
00:13:12.920 --> 00:13:18.090
I pick a student who had
a specific SAT score.
00:13:18.090 --> 00:13:21.190
And that could be done at
random, but also could be done
00:13:21.190 --> 00:13:22.330
in a systematic way.
00:13:22.330 --> 00:13:25.240
That is, I pick a student who
had an SAT of 600, a student
00:13:25.240 --> 00:13:33.170
of 610 all the way to 1,400
or 1,600, whatever the
00:13:33.170 --> 00:13:34.670
right number is.
00:13:34.670 --> 00:13:36.320
I pick all those students.
00:13:36.320 --> 00:13:40.370
And I assume that for a student
of this kind there's a
00:13:40.370 --> 00:13:44.500
true model that tells me that
their GPA is going to be a
00:13:44.500 --> 00:13:48.580
random variable, which is
something predicted by their
00:13:48.580 --> 00:13:52.690
SAT score plus some randomness,
some random noise.
00:13:52.690 --> 00:13:56.400
And I model that random noise
by independent normal random
00:13:56.400 --> 00:14:00.710
variables with 0 mean and
a certain variance.
00:14:00.710 --> 00:14:04.470
So this is a specific
probabilistic model, and now I
00:14:04.470 --> 00:14:09.010
can think about doing maximum
likelihood estimation for this
00:14:09.010 --> 00:14:10.980
particular model.
00:14:10.980 --> 00:14:14.490
So to do maximum likelihood
estimation here I need to
00:14:14.490 --> 00:14:19.830
write down the likelihood of the
y's that I have observed.
00:14:19.830 --> 00:14:23.380
What's the likelihood of the
y's that I have observed?
00:14:23.380 --> 00:14:28.425
Well, a particular w has a
likelihood of the form e to
00:14:28.425 --> 00:14:33.030
the minus w squared over
(2 sigma-squared).
00:14:33.030 --> 00:14:37.070
That's the likelihood
of a particular w.
00:14:37.070 --> 00:14:40.310
The probability, or the
likelihood of observing a
00:14:40.310 --> 00:14:43.990
particular value of y, that's
the same as the likelihood
00:14:43.990 --> 00:14:49.020
that w takes a value of y
minus this, minus that.
00:14:49.020 --> 00:14:52.850
So the likelihood of the
y's is of this form.
00:14:52.850 --> 00:14:57.360
Think of this as just being
the w_i-squared.
00:14:57.360 --> 00:15:01.370
So this is the density --
00:15:01.370 --> 00:15:06.060
and if we have multiple data you
multiply the likelihoods
00:15:06.060 --> 00:15:07.660
of the different y's.
00:15:07.660 --> 00:15:12.090
So you have to write something
like this.
00:15:12.090 --> 00:15:16.390
Since the w's are independent
that means that the y's are
00:15:16.390 --> 00:15:17.910
also independent.
00:15:17.910 --> 00:15:21.410
The likelihood of a y vector
is the product of the
00:15:21.410 --> 00:15:24.240
likelihoods of the
individual y's.
00:15:24.240 --> 00:15:27.800
The likelihood of every
individual y is of this form.
00:15:27.800 --> 00:15:33.050
Where w is y_i minus these
two quantities.
00:15:33.050 --> 00:15:36.000
So this is the form that the
likelihood function is going
00:15:36.000 --> 00:15:38.880
to take under this
particular model.
00:15:38.880 --> 00:15:42.260
And under the maximum likelihood
methodology we want
00:15:42.260 --> 00:15:49.170
to maximize this quantity with
respect to theta0 and theta1.
00:15:49.170 --> 00:15:56.930
Now to do this maximization you
might as well consider the
00:15:56.930 --> 00:16:00.990
logarithm and maximize the
logarithm, which is just the
00:16:00.990 --> 00:16:02.730
exponent up here.
00:16:02.730 --> 00:16:05.750
Maximizing this exponent because
we have a minus sign
00:16:05.750 --> 00:16:08.900
is the same as minimizing
the exponent
00:16:08.900 --> 00:16:10.840
without the minus sign.
00:16:10.840 --> 00:16:12.840
Sigma squared is a constant.
00:16:12.840 --> 00:16:17.970
So what you end up doing is
minimizing this quantity here,
00:16:17.970 --> 00:16:20.120
which is the same as
what we had in our
00:16:20.120 --> 00:16:23.640
linear regression methods.
00:16:23.640 --> 00:16:29.400
So in conclusion you might
choose to do linear regression
00:16:29.400 --> 00:16:34.490
in this particular way,
just because it looks
00:16:34.490 --> 00:16:36.210
reasonable or plausible.
00:16:36.210 --> 00:16:41.050
Or you might interpret what
you're doing as maximum
00:16:41.050 --> 00:16:45.220
likelihood estimation, in which
you assume a model of
00:16:45.220 --> 00:16:49.520
this kind where the noise
terms are normal random
00:16:49.520 --> 00:16:51.970
variables with the same
distribution --
00:16:51.970 --> 00:16:54.540
independent identically
distributed.
00:16:54.540 --> 00:17:01.320
So linear regression implicitly
makes an assumption
00:17:01.320 --> 00:17:02.840
of this kind.
00:17:02.840 --> 00:17:07.380
It's doing maximum likelihood
estimation as if the world was
00:17:07.380 --> 00:17:11.000
really described by a model of
this form, and with the W's
00:17:11.000 --> 00:17:12.560
being random variables.
00:17:12.560 --> 00:17:17.920
So this gives us at least some
justification that this
00:17:17.920 --> 00:17:21.800
particular approach to fitting
lines to data is not so
00:17:21.800 --> 00:17:25.579
arbitrary, but it has
a sound footing.
00:17:25.579 --> 00:17:30.530
OK so then once you accept this
formulation as being a
00:17:30.530 --> 00:17:32.920
reasonable one what's
the next step?
00:17:32.920 --> 00:17:37.760
The next step is to see how to
carry out this minimization.
00:17:37.760 --> 00:17:42.220
This is not a very difficult
minimization to do.
00:17:42.220 --> 00:17:48.260
The way it's done is by setting
the derivatives of
00:17:48.260 --> 00:17:50.930
this expression to 0.
00:17:50.930 --> 00:17:54.500
Now because this is a quadratic
function of theta0
00:17:54.500 --> 00:17:55.410
and theta1--
00:17:55.410 --> 00:17:57.270
when you take the derivatives
with respect
00:17:57.270 --> 00:17:58.940
to theta0 and theta1--
00:17:58.940 --> 00:18:03.250
you get linear functions
of theta0 and theta1.
00:18:03.250 --> 00:18:08.010
And you end up solving a system
of linear equations in
00:18:08.010 --> 00:18:09.630
theta0 and theta1.
00:18:09.630 --> 00:18:15.660
And it turns out that there's
very nice and simple formulas
00:18:15.660 --> 00:18:18.950
for the optimal estimates
of the parameters in
00:18:18.950 --> 00:18:20.510
terms of the data.
00:18:20.510 --> 00:18:23.910
And the formulas
are these ones.
00:18:23.910 --> 00:18:28.130
I said that these are nice
and simple formulas.
00:18:28.130 --> 00:18:29.800
Let's see why.
00:18:29.800 --> 00:18:31.270
How can we interpret them?
00:18:34.050 --> 00:18:42.250
So suppose that the world is
described by a model of this
00:18:42.250 --> 00:18:48.990
kind, where the X's and Y's
are random variables.
00:18:48.990 --> 00:18:53.920
And where W is a noise term
that's independent of X. So
00:18:53.920 --> 00:18:57.750
we're assuming that a linear
model is indeed true, but not
00:18:57.750 --> 00:18:58.530
exactly true.
00:18:58.530 --> 00:19:01.790
There's always some noise
associated with any particular
00:19:01.790 --> 00:19:04.980
data point that we obtain.
00:19:04.980 --> 00:19:10.880
So if a model of this kind is
true, and the W's have 0 mean
00:19:10.880 --> 00:19:15.370
then we have that the expected
value of Y would be theta0
00:19:15.370 --> 00:19:23.570
plus theta1 expected value of
X. And because W has 0 mean
00:19:23.570 --> 00:19:26.200
there's no extra term.
00:19:26.200 --> 00:19:31.660
So in particular, theta0 would
be equal to expected value of
00:19:31.660 --> 00:19:37.380
Y minus theta1 expected
value of X.
00:19:37.380 --> 00:19:40.660
So let's use this equation
to try to come up with a
00:19:40.660 --> 00:19:44.060
reasonable estimate of theta0.
00:19:44.060 --> 00:19:47.220
I do not know the expected
value of Y, but I
00:19:47.220 --> 00:19:48.430
can estimate it.
00:19:48.430 --> 00:19:49.820
How do I estimate it?
00:19:49.820 --> 00:19:53.460
I look at the average of all the
y's that I have obtained.
00:19:53.460 --> 00:19:57.320
so I replace this, I estimate
it with the average of the
00:19:57.320 --> 00:19:59.940
data I have seen.
00:19:59.940 --> 00:20:02.430
Here, similarly with the X's.
00:20:02.430 --> 00:20:06.820
I might not know the expected
value of X's, but I have data
00:20:06.820 --> 00:20:08.520
points for the x's.
00:20:08.520 --> 00:20:13.070
I look at the average of all my
data points, I come up with
00:20:13.070 --> 00:20:16.380
an estimate of this
expectation.
00:20:16.380 --> 00:20:21.390
Now I don't know what theta1 is,
but my procedure is going
00:20:21.390 --> 00:20:25.320
to generate an estimate of
theta1 called theta1 hat.
00:20:25.320 --> 00:20:29.230
And once I have this estimate,
then a reasonable person would
00:20:29.230 --> 00:20:33.400
estimate theta0 in this
particular way.
00:20:33.400 --> 00:20:37.320
So that's how my estimate
of theta0 is going to be
00:20:37.320 --> 00:20:38.490
constructed.
00:20:38.490 --> 00:20:41.420
It's this formula here.
00:20:41.420 --> 00:20:44.700
We have not yet addressed the
harder question, which is how
00:20:44.700 --> 00:20:47.490
to estimate theta1 in
the first place.
00:20:47.490 --> 00:20:50.830
So to estimate theta0 I assumed
that I already had an
00:20:50.830 --> 00:20:52.180
estimate for a theta1.
00:20:55.090 --> 00:21:02.060
OK, the right formula for the
estimate of theta1 happens to
00:21:02.060 --> 00:21:03.140
be this one.
00:21:03.140 --> 00:21:08.632
It looks messy, but let's
try to interpret it.
00:21:08.632 --> 00:21:12.970
What I'm going to do is I'm
going to take this model for
00:21:12.970 --> 00:21:18.340
simplicity let's assume that
they're the random variables
00:21:18.340 --> 00:21:19.590
have 0 means.
00:21:22.940 --> 00:21:28.800
And see how we might estimate
how we might
00:21:28.800 --> 00:21:30.960
try to estimate theta1.
00:21:30.960 --> 00:21:36.270
Let's multiply both sides of
this equation by X. So we get
00:21:36.270 --> 00:21:48.470
Y times X equals theta0 plus
theta0 times X plus theta1
00:21:48.470 --> 00:21:54.530
times X-squared, plus X times
W. And now take expectations
00:21:54.530 --> 00:21:56.420
of both sides.
00:21:56.420 --> 00:22:00.160
If I have 0 mean random
variables the expected value
00:22:00.160 --> 00:22:07.210
of Y times X is just the
covariance of X with Y.
00:22:07.210 --> 00:22:10.640
I have assumed that my random
variables have 0 means, so the
00:22:10.640 --> 00:22:13.680
expectation of this is 0.
00:22:13.680 --> 00:22:17.970
This one is going to be the
variance of X, so I have
00:22:17.970 --> 00:22:23.260
theta1 times variance of X. And
since I'm assuming that my
00:22:23.260 --> 00:22:26.990
random variables have 0 mean,
and I'm also assuming that W
00:22:26.990 --> 00:22:32.250
is independent of X this last
term also has 0 mean.
00:22:32.250 --> 00:22:39.280
So under such a probabilistic
model this equation is true.
00:22:39.280 --> 00:22:43.620
If we knew the variance and the
covariance then we would
00:22:43.620 --> 00:22:45.930
know the value of theta1.
00:22:45.930 --> 00:22:49.080
But we only have data, we do
not necessarily know the
00:22:49.080 --> 00:22:53.070
variance and the covariance,
but we can estimate it.
00:22:53.070 --> 00:22:55.885
What's a reasonable estimate
of the variance?
00:22:55.885 --> 00:22:59.390
The reasonable estimate of the
variance is this quantity here
00:22:59.390 --> 00:23:03.195
divided by n, and the reasonable
estimate of the
00:23:03.195 --> 00:23:06.730
covariance is that numerator
divided by n.
00:23:09.410 --> 00:23:11.510
So this is my estimate
of the mean.
00:23:11.510 --> 00:23:15.390
I'm looking at the squared
distances from the mean, and I
00:23:15.390 --> 00:23:18.740
average them over lots
and lots of data.
00:23:18.740 --> 00:23:23.990
This is the most reasonable way
of estimating the variance
00:23:23.990 --> 00:23:26.070
of our distribution.
00:23:26.070 --> 00:23:31.400
And similarly the expected value
of this quantity is the
00:23:31.400 --> 00:23:35.020
covariance of X with Y, and then
we have lots and lots of
00:23:35.020 --> 00:23:35.830
data points.
00:23:35.830 --> 00:23:38.895
This quantity here is going to
be a very good estimate of the
00:23:38.895 --> 00:23:40.140
covariance.
00:23:40.140 --> 00:23:44.820
So basically what this
formula does is--
00:23:44.820 --> 00:23:46.520
one way of thinking about it--
00:23:46.520 --> 00:23:50.870
is that it starts from this
relation which is true
00:23:50.870 --> 00:23:57.230
exactly, but estimates the
covariance and the variance on
00:23:57.230 --> 00:24:00.820
the basis of the data, and then
using these estimates to
00:24:00.820 --> 00:24:05.770
come up with an estimate
of theta1.
00:24:05.770 --> 00:24:09.890
So this gives us a probabilistic
interpretation
00:24:09.890 --> 00:24:13.620
of the formulas that we have for
the way that the estimates
00:24:13.620 --> 00:24:14.990
are constructed.
00:24:14.990 --> 00:24:19.560
If you're willing to assume that
this is the true model of
00:24:19.560 --> 00:24:22.640
the world, the structure of the
true model of the world,
00:24:22.640 --> 00:24:24.460
except that you do not
know means and
00:24:24.460 --> 00:24:27.590
covariances, and variances.
00:24:27.590 --> 00:24:33.010
Then this is a natural way of
estimating those unknown
00:24:33.010 --> 00:24:34.260
parameters.
00:24:36.770 --> 00:24:39.800
All right, so we have a
closed-form formula, we can
00:24:39.800 --> 00:24:43.620
apply it whenever
we have data.
00:24:43.620 --> 00:24:47.810
Now linear regression is a
subject on which there are
00:24:47.810 --> 00:24:51.520
whole courses, and whole
books that are given.
00:24:51.520 --> 00:24:54.560
And the reason for that is that
there's a lot more that
00:24:54.560 --> 00:24:58.840
you can bring into the topic,
and many ways that you can
00:24:58.840 --> 00:25:02.350
elaborate on the simple solution
that we got for the
00:25:02.350 --> 00:25:05.880
case of two parameters and only
two random variables.
00:25:05.880 --> 00:25:09.550
So let me give you a little bit
of flavor of what are the
00:25:09.550 --> 00:25:12.950
topics that come up when you
start looking into linear
00:25:12.950 --> 00:25:14.200
regression in more depth.
00:25:16.840 --> 00:25:24.390
So in our discussions so far
we made the linear model in
00:25:24.390 --> 00:25:28.370
which we're trying to explain
the values of one variable in
00:25:28.370 --> 00:25:30.860
terms of the values of
another variable.
00:25:30.860 --> 00:25:35.010
We're trying to explain GPAs
in terms of SAT scores, or
00:25:35.010 --> 00:25:39.640
we're trying to predict GPAs
in terms of SAT scores.
00:25:39.640 --> 00:25:47.910
But maybe your GPA is affected
by several factors.
00:25:47.910 --> 00:25:56.380
For example maybe your GPA is
affected by your SAT score,
00:25:56.380 --> 00:26:01.820
also the income of your family,
the years of education
00:26:01.820 --> 00:26:06.720
of your grandmother, and many
other factors like that.
00:26:06.720 --> 00:26:11.970
So you might write down a model
in which I believe that
00:26:11.970 --> 00:26:17.820
GPA has a relation, which is a
linear function of all these
00:26:17.820 --> 00:26:20.520
other variables that
I mentioned.
00:26:20.520 --> 00:26:24.350
So perhaps you have a theory of
what determines performance
00:26:24.350 --> 00:26:29.540
at college, and you want to
build a model of that type.
00:26:29.540 --> 00:26:31.460
How do we go about
in this case?
00:26:31.460 --> 00:26:33.830
Well, again we collect
the data points.
00:26:33.830 --> 00:26:37.980
We look at the i-th student,
who has a college GPA.
00:26:37.980 --> 00:26:42.090
We record their SAT score,
their family income, and
00:26:42.090 --> 00:26:45.010
grandmother's years
of education.
00:26:45.010 --> 00:26:50.390
So this is one data point that
is for one particular student.
00:26:50.390 --> 00:26:52.580
We postulate the model
of this form.
00:26:52.580 --> 00:26:56.160
For the i-th student this would
be the mistake that our
00:26:56.160 --> 00:26:59.940
model makes if we have chosen
specific values for those
00:26:59.940 --> 00:27:01.070
parameters.
00:27:01.070 --> 00:27:05.450
And then we go and choose the
parameters that are going to
00:27:05.450 --> 00:27:07.950
give us, again, the
smallest possible
00:27:07.950 --> 00:27:10.000
sum of squared errors.
00:27:10.000 --> 00:27:12.360
So philosophically it's exactly
the same as what we
00:27:12.360 --> 00:27:15.700
were discussing before, except
that now we're including
00:27:15.700 --> 00:27:19.560
multiple explanatory variables
in our model instead of a
00:27:19.560 --> 00:27:22.600
single explanatory variable.
00:27:22.600 --> 00:27:24.070
So that's the formulation.
00:27:24.070 --> 00:27:26.070
What do you do next?
00:27:26.070 --> 00:27:29.420
Well, to do this minimization
you're going to take
00:27:29.420 --> 00:27:32.750
derivatives once you have your
data, you have a function of
00:27:32.750 --> 00:27:34.310
these three parameters.
00:27:34.310 --> 00:27:37.190
You take the derivative with
respect to the parameter, set
00:27:37.190 --> 00:27:39.170
the derivative equal
to 0, you get the
00:27:39.170 --> 00:27:41.060
system of linear equations.
00:27:41.060 --> 00:27:43.450
You throw that system of
linear equations to the
00:27:43.450 --> 00:27:46.260
computer, and you get numerical
values for the
00:27:46.260 --> 00:27:48.060
optimal parameters.
00:27:48.060 --> 00:27:52.130
There are no nice closed-form
formulas of the type that we
00:27:52.130 --> 00:27:54.510
had in the previous slide
when you're dealing
00:27:54.510 --> 00:27:56.230
with multiple variables.
00:27:56.230 --> 00:28:02.240
Unless you're willing to go
into matrix notation.
00:28:02.240 --> 00:28:04.760
In that case you can again
write down closed-form
00:28:04.760 --> 00:28:07.290
formulas, but they will be a
little less intuitive than
00:28:07.290 --> 00:28:09.210
what we had before.
00:28:09.210 --> 00:28:13.550
But the moral of the story is
that numerically this is a
00:28:13.550 --> 00:28:16.480
procedure that's very easy.
00:28:16.480 --> 00:28:18.780
It's a problem, an optimization
problem that the
00:28:18.780 --> 00:28:20.680
computer can solve for you.
00:28:20.680 --> 00:28:23.290
And it can solve it for
you very quickly.
00:28:23.290 --> 00:28:25.470
Because all that it involves
is solving a
00:28:25.470 --> 00:28:26.720
system of linear equations.
00:28:29.590 --> 00:28:34.270
Now when you choose your
explanatory variables you may
00:28:34.270 --> 00:28:37.940
have some choices.
00:28:37.940 --> 00:28:43.550
One person may think that your
GPA a has something to do with
00:28:43.550 --> 00:28:45.340
your SAT score.
00:28:45.340 --> 00:28:48.480
Some other person may think that
your GPA has something to
00:28:48.480 --> 00:28:51.800
do with the square of
your SAT score.
00:28:51.800 --> 00:28:55.380
And that other person may
want to try to build a
00:28:55.380 --> 00:28:58.840
model of this kind.
00:28:58.840 --> 00:29:01.550
Now when would you want
to do this? ?
00:29:01.550 --> 00:29:07.830
Suppose that the data that
you have looks like this.
00:29:12.177 --> 00:29:15.740
If the data looks like this then
you might be tempted to
00:29:15.740 --> 00:29:20.710
say well a linear model does
not look right, but maybe a
00:29:20.710 --> 00:29:25.650
quadratic model will give me
a better fit for the data.
00:29:25.650 --> 00:29:30.690
So if you want to fit a
quadratic model to the data
00:29:30.690 --> 00:29:35.550
then what you do is you take
X-squared as your explanatory
00:29:35.550 --> 00:29:42.520
variable instead of X, and you
build a model of this kind.
00:29:42.520 --> 00:29:45.910
There's nothing really different
in models of this
00:29:45.910 --> 00:29:48.830
kind compared to models
of that kind.
00:29:48.830 --> 00:29:54.700
They are still linear models
because we have theta's
00:29:54.700 --> 00:29:57.630
showing up in a linear
fashion.
00:29:57.630 --> 00:30:00.460
What you take as your
explanatory variables, whether
00:30:00.460 --> 00:30:02.870
it's X, whether it's X-squared,
or whether it's
00:30:02.870 --> 00:30:05.390
some other function
that you chose.
00:30:05.390 --> 00:30:09.590
Some general function h of X,
doesn't make a difference.
00:30:09.590 --> 00:30:14.470
So think of you h of X as being
your new X. So you can
00:30:14.470 --> 00:30:17.620
formulate the problem exactly
the same way, except that
00:30:17.620 --> 00:30:21.035
instead of using X's you
choose h of X's.
00:30:23.610 --> 00:30:26.540
So it's basically a question
do I want to build a model
00:30:26.540 --> 00:30:31.390
that explains Y's based on the
values of X, or do I want to
00:30:31.390 --> 00:30:35.190
build a model that explains Y's
on the basis of the values
00:30:35.190 --> 00:30:38.970
of h of X. Which is the
right value to use?
00:30:38.970 --> 00:30:42.160
And with this picture here,
we see that it can make a
00:30:42.160 --> 00:30:43.160
difference.
00:30:43.160 --> 00:30:47.070
A linear model in X might be
a poor fit, but a quadratic
00:30:47.070 --> 00:30:49.660
model might give us
a better fit.
00:30:49.660 --> 00:30:55.450
So this brings to the topic of
how to choose your functions h
00:30:55.450 --> 00:30:59.480
of X if you're dealing with
a real world problem.
00:30:59.480 --> 00:31:03.080
So in a real world problem
you're just given X's and Y's.
00:31:03.080 --> 00:31:05.990
And you have the freedom
of building models of
00:31:05.990 --> 00:31:07.120
any kind you want.
00:31:07.120 --> 00:31:11.330
You have the freedom of choosing
a function h of X of
00:31:11.330 --> 00:31:13.130
any type that you want.
00:31:13.130 --> 00:31:14.980
So this turns out to be a quite
00:31:14.980 --> 00:31:18.800
difficult and tricky topic.
00:31:18.800 --> 00:31:22.630
Because you may be tempted
to overdo it.
00:31:22.630 --> 00:31:28.450
For example, I got my 10 data
points, and I could say OK,
00:31:28.450 --> 00:31:35.660
I'm going to choose an h of X.
I'm going to choose h of X and
00:31:35.660 --> 00:31:40.300
actually multiple h's of X
to do a multiple linear
00:31:40.300 --> 00:31:45.030
regression in which I'm going to
build a model that's uses a
00:31:45.030 --> 00:31:47.600
10th degree polynomial.
00:31:47.600 --> 00:31:51.160
If I choose to fit my data with
a 10th degree polynomial
00:31:51.160 --> 00:31:54.680
I'm going to fit my data
perfectly, but I may obtain a
00:31:54.680 --> 00:31:58.530
model is does something like
this, and goes through all my
00:31:58.530 --> 00:31:59.930
data points.
00:31:59.930 --> 00:32:03.830
So I can make my prediction
errors extremely small if I
00:32:03.830 --> 00:32:08.820
use lots of parameters, and
if I choose my h functions
00:32:08.820 --> 00:32:09.930
appropriately.
00:32:09.930 --> 00:32:11.800
But clearly this would
be garbage.
00:32:11.800 --> 00:32:15.270
If you get those data points,
and you say here's my model
00:32:15.270 --> 00:32:16.420
that explains them.
00:32:16.420 --> 00:32:21.320
That has a polynomial going up
and down, then you're probably
00:32:21.320 --> 00:32:22.900
doing something wrong.
00:32:22.900 --> 00:32:26.180
So choosing how complicated
those functions,
00:32:26.180 --> 00:32:27.900
the h's, should be.
00:32:27.900 --> 00:32:32.020
And how many explanatory
variables to use is a very
00:32:32.020 --> 00:32:36.560
delicate and deep topic on which
there's deep theory that
00:32:36.560 --> 00:32:39.910
tells you what you should do,
and what you shouldn't do.
00:32:39.910 --> 00:32:43.830
But the main thing that one
should avoid doing is having
00:32:43.830 --> 00:32:46.620
too many parameters in
your model when you
00:32:46.620 --> 00:32:48.900
have too few data.
00:32:48.900 --> 00:32:52.350
So if you only have 10 data
points, you shouldn't have 10
00:32:52.350 --> 00:32:53.350
free parameters.
00:32:53.350 --> 00:32:56.150
With 10 free parameters you will
be able to fit your data
00:32:56.150 --> 00:33:00.760
perfectly, but you wouldn't be
able to really rely on the
00:33:00.760 --> 00:33:02.010
results that you are seeing.
00:33:06.050 --> 00:33:12.630
OK, now in practice, when people
run linear regressions
00:33:12.630 --> 00:33:15.410
they do not just give
point estimates for
00:33:15.410 --> 00:33:17.370
the parameters theta.
00:33:17.370 --> 00:33:20.300
But similar to what we did for
the case of estimating the
00:33:20.300 --> 00:33:23.790
mean of a random variable you
might want to give confidence
00:33:23.790 --> 00:33:27.200
intervals that sort of tell you
how much randomness there
00:33:27.200 --> 00:33:30.730
is when you estimate each one of
the particular parameters.
00:33:30.730 --> 00:33:33.960
There are formulas for building
confidence intervals
00:33:33.960 --> 00:33:36.230
for the estimates
of the theta's.
00:33:36.230 --> 00:33:38.520
We're not going to look
at them, it would
00:33:38.520 --> 00:33:39.990
take too much time.
00:33:39.990 --> 00:33:44.600
Also you might want to estimate
the variance in the
00:33:44.600 --> 00:33:47.400
noise that you have
in your model.
00:33:47.400 --> 00:33:52.540
That is if you are pretending
that your true model is of the
00:33:52.540 --> 00:33:57.026
kind we were discussing before,
namely Y equals theta1
00:33:57.026 --> 00:34:02.190
times X plus W, and W has a
variance sigma squared.
00:34:02.190 --> 00:34:05.170
You might want to estimate this,
because it tells you
00:34:05.170 --> 00:34:09.199
something about the model, and
this is called standard error.
00:34:09.199 --> 00:34:11.929
It puts a limit on how
good predictions
00:34:11.929 --> 00:34:14.730
your model can make.
00:34:14.730 --> 00:34:18.170
Even if you have the correct
theta0 and theta1, and
00:34:18.170 --> 00:34:22.179
somebody tells you X you can
make a prediction about Y, but
00:34:22.179 --> 00:34:24.710
that prediction will
not be accurate.
00:34:24.710 --> 00:34:26.739
Because there's this additional
randomness.
00:34:26.739 --> 00:34:29.699
And if that additional
randomness is big, then your
00:34:29.699 --> 00:34:33.810
predictions will also have a
substantial error in them.
00:34:33.810 --> 00:34:38.300
There's another quantity that
gets reported usually.
00:34:38.300 --> 00:34:41.400
This is part of the computer
output that you get when you
00:34:41.400 --> 00:34:45.500
use a statistical package which
is called R-square.
00:34:45.500 --> 00:34:49.920
And its a measure of the
explanatory power of the model
00:34:49.920 --> 00:34:52.469
that you have built
linear regression.
00:34:52.469 --> 00:34:55.650
Using linear regression.
00:34:55.650 --> 00:35:01.030
Instead of defining R-square
exactly, let me give you a
00:35:01.030 --> 00:35:05.170
sort of analogous quantity
that's involved.
00:35:05.170 --> 00:35:08.030
After you do your linear
regression you can look at the
00:35:08.030 --> 00:35:10.600
following quantity.
00:35:10.600 --> 00:35:15.720
You look at the variance of Y,
which is something that you
00:35:15.720 --> 00:35:17.400
can estimate from data.
00:35:17.400 --> 00:35:23.370
This is how much randomness
there is in Y. And compare it
00:35:23.370 --> 00:35:28.090
with the randomness that you
have in Y, but conditioned on
00:35:28.090 --> 00:35:35.840
X. So this quantity tells
me if I knew X how much
00:35:35.840 --> 00:35:39.820
randomness would there
still be in my Y?
00:35:39.820 --> 00:35:43.650
So if I know X, I have more
information, so Y is more
00:35:43.650 --> 00:35:44.390
constrained.
00:35:44.390 --> 00:35:48.640
There's less randomness in Y.
This is the randomness in Y if
00:35:48.640 --> 00:35:50.790
I don't know anything about X.
00:35:50.790 --> 00:35:54.855
So naturally this quantity would
be less than 1, and if
00:35:54.855 --> 00:35:58.830
this quantity is small it would
mean that whenever I
00:35:58.830 --> 00:36:03.320
know X then Y is very
well known.
00:36:03.320 --> 00:36:07.440
Which essentially tells me that
knowing x allows me to
00:36:07.440 --> 00:36:12.370
make very good predictions about
Y. Knowing X means that
00:36:12.370 --> 00:36:17.390
I'm explaining away most
of the randomness in Y.
00:36:17.390 --> 00:36:22.590
So if you read a statistical
study that uses linear
00:36:22.590 --> 00:36:29.730
regression you might encounter
statements of the form 60% of
00:36:29.730 --> 00:36:36.140
a student's GPA is explained
by the family income.
00:36:36.140 --> 00:36:40.600
If you read the statements of
this kind it's really refers
00:36:40.600 --> 00:36:43.160
to quantities of this kind.
00:36:43.160 --> 00:36:47.820
Out of the total variance in Y,
how much variance is left
00:36:47.820 --> 00:36:50.060
after we build our model?
00:36:50.060 --> 00:36:56.490
So if only 40% of the variance
of Y is left after we build
00:36:56.490 --> 00:37:00.700
our model, that means that
X explains 60% of the
00:37:00.700 --> 00:37:02.510
variations in Y's.
00:37:02.510 --> 00:37:06.570
So the idea is that
randomness in Y is
00:37:06.570 --> 00:37:09.560
caused by multiple sources.
00:37:09.560 --> 00:37:12.025
Our explanatory variable
and random noise.
00:37:12.025 --> 00:37:15.610
And we ask the question what
percentage of the total
00:37:15.610 --> 00:37:19.940
randomness in Y is explained by
00:37:19.940 --> 00:37:23.030
variations in the X parameter?
00:37:23.030 --> 00:37:26.860
And how much of the total
randomness in Y is attributed
00:37:26.860 --> 00:37:30.390
just to random effects?
00:37:30.390 --> 00:37:34.050
So if you have a model that
explains most of the variation
00:37:34.050 --> 00:37:37.710
in Y then you can think that
you have a good model that
00:37:37.710 --> 00:37:42.550
tells you something useful
about the real world.
00:37:42.550 --> 00:37:45.990
Now there's lots of things that
can go wrong when you use
00:37:45.990 --> 00:37:50.670
linear regression, and there's
many pitfalls.
00:37:50.670 --> 00:37:56.440
One pitfall happens when you
have this situation that's
00:37:56.440 --> 00:37:58.300
called heteroskedacisity.
00:37:58.300 --> 00:38:01.020
So suppose your data
are of this kind.
00:38:06.550 --> 00:38:09.330
So what's happening here?
00:38:09.330 --> 00:38:17.640
You seem to have a linear model,
but when X is small you
00:38:17.640 --> 00:38:19.200
have a very good model.
00:38:19.200 --> 00:38:23.830
So this means that W has a small
variance when X is here.
00:38:23.830 --> 00:38:26.760
On the other hand, when X is
there you have a lot of
00:38:26.760 --> 00:38:27.970
randomness.
00:38:27.970 --> 00:38:32.080
This would be a situation
in which the W's are not
00:38:32.080 --> 00:38:35.840
identically distributed, but
the variance of the W's, of
00:38:35.840 --> 00:38:40.360
the noise, has something
to do with the X's.
00:38:40.360 --> 00:38:43.720
So with different regions of our
x-space we have different
00:38:43.720 --> 00:38:45.260
amounts of noise.
00:38:45.260 --> 00:38:47.615
What will go wrong in
this situation?
00:38:47.615 --> 00:38:51.290
Since we're trying to minimize
sum of squared errors, we're
00:38:51.290 --> 00:38:54.080
really paying attention
to the biggest errors.
00:38:54.080 --> 00:38:57.010
Which will mean that we are
going to pay attention to
00:38:57.010 --> 00:38:59.690
these data points, because
that's where the big errors
00:38:59.690 --> 00:39:01.130
are going to be.
00:39:01.130 --> 00:39:04.250
So the linear regression
formulas will end up building
00:39:04.250 --> 00:39:09.110
a model based on these data,
which are the most noisy ones.
00:39:09.110 --> 00:39:14.810
Instead of those data that are
nicely stacked in order.
00:39:14.810 --> 00:39:17.410
Clearly that's not to the
right thing to do.
00:39:17.410 --> 00:39:21.500
So you need to change something,
and use the fact
00:39:21.500 --> 00:39:25.800
that the variance of W changes
with the X's, and there are
00:39:25.800 --> 00:39:27.770
ways of dealing with it.
00:39:27.770 --> 00:39:31.280
It's something that one needs
to be careful about.
00:39:31.280 --> 00:39:34.580
Another possibility of getting
into trouble is if you're
00:39:34.580 --> 00:39:38.550
using multiple explanatory
variables that are very
00:39:38.550 --> 00:39:41.330
closely related to each other.
00:39:41.330 --> 00:39:47.500
So for example, suppose that I
tried to predict your GPA by
00:39:47.500 --> 00:39:54.100
looking at your SAT the first
time that you took it plus
00:39:54.100 --> 00:39:58.290
your SAT the second time that
you took your SATs.
00:39:58.290 --> 00:40:00.470
I'm assuming that almost
everyone takes the
00:40:00.470 --> 00:40:02.450
SAT more than once.
00:40:02.450 --> 00:40:05.630
So suppose that you had
a model of this kind.
00:40:05.630 --> 00:40:09.380
Well, SAT on your first try and
SAT on your second try are
00:40:09.380 --> 00:40:12.480
very likely to be
fairly close.
00:40:12.480 --> 00:40:17.570
And you could think of coming
up with estimates in which
00:40:17.570 --> 00:40:19.390
this is ignored.
00:40:19.390 --> 00:40:22.780
And you build a model based on
this, or an alternative model
00:40:22.780 --> 00:40:25.810
in which this term is ignored,
and you make predictions based
00:40:25.810 --> 00:40:27.430
on the second SAT.
00:40:27.430 --> 00:40:31.840
And both models are likely to be
essentially as good as the
00:40:31.840 --> 00:40:34.430
other one, because these
two quantities are
00:40:34.430 --> 00:40:36.630
essentially the same.
00:40:36.630 --> 00:40:41.440
So in that case, your theta's
that you estimate are going to
00:40:41.440 --> 00:40:44.880
be very sensitive to little
details of the data.
00:40:44.880 --> 00:40:48.560
You change your data, you have
your data, and your data tell
00:40:48.560 --> 00:40:52.170
you that this coefficient
is big and that
00:40:52.170 --> 00:40:52.760
coefficient is small.
00:40:52.760 --> 00:40:56.060
You change your data just a
tiny bit, and your theta's
00:40:56.060 --> 00:40:57.720
would drastically change.
00:40:57.720 --> 00:41:00.750
So this is a case in which you
have multiple explanatory
00:41:00.750 --> 00:41:04.110
variables, but they're redundant
in the sense that
00:41:04.110 --> 00:41:07.300
they're very closely related
to each other, and perhaps
00:41:07.300 --> 00:41:08.830
with a linear relation.
00:41:08.830 --> 00:41:11.980
So one must be careful about the
situation, and do special
00:41:11.980 --> 00:41:15.940
tests to make sure that
this doesn't happen.
00:41:15.940 --> 00:41:20.900
Finally the biggest and most
common blunder is that you run
00:41:20.900 --> 00:41:24.910
your linear regression, you
get your linear model, and
00:41:24.910 --> 00:41:26.760
then you say oh, OK.
00:41:26.760 --> 00:41:33.340
Y is caused by X according to
this particular formula.
00:41:33.340 --> 00:41:36.940
Well, all that we did was to
identify a linear relation
00:41:36.940 --> 00:41:40.120
between X and Y. This doesn't
tell us anything.
00:41:40.120 --> 00:41:44.130
Whether it's Y that causes X, or
whether it's X that causes
00:41:44.130 --> 00:41:48.850
Y, or maybe both X and Y are
caused by some other variable
00:41:48.850 --> 00:41:51.110
that we didn't think about.
00:41:51.110 --> 00:41:56.800
So building a good linear model
that has small errors
00:41:56.800 --> 00:42:00.980
does not tell us anything about
causal relations between
00:42:00.980 --> 00:42:02.320
the two variables.
00:42:02.320 --> 00:42:05.210
It only tells us that there's
a close association between
00:42:05.210 --> 00:42:06.010
the two variables.
00:42:06.010 --> 00:42:10.370
If you know one you can make
predictions about the other.
00:42:10.370 --> 00:42:13.370
But it doesn't tell you anything
about the underlying
00:42:13.370 --> 00:42:18.120
physics, that there's some
physical mechanism that
00:42:18.120 --> 00:42:22.310
introduces the relation between
those variables.
00:42:22.310 --> 00:42:26.430
OK, that's it about
linear regression.
00:42:26.430 --> 00:42:30.510
Let us start the next topic,
which is hypothesis testing.
00:42:30.510 --> 00:42:35.140
And we're going to continue
with it next time.
00:42:35.140 --> 00:42:37.780
So here, instead of trying
to estimate continuous
00:42:37.780 --> 00:42:41.920
parameters, we have two
alternative hypotheses about
00:42:41.920 --> 00:42:46.550
the distribution of the
X random variable.
00:42:46.550 --> 00:42:53.620
So for example our random
variable could be either
00:42:53.620 --> 00:42:58.480
distributed according to this
distribution, under H0, or it
00:42:58.480 --> 00:43:02.930
might be distributed according
to this distribution under H1.
00:43:02.930 --> 00:43:06.230
And we want to make a decision
which distribution is the
00:43:06.230 --> 00:43:07.990
correct one?
00:43:07.990 --> 00:43:10.850
So we're given those two
distributions, and some common
00:43:10.850 --> 00:43:14.290
terminologies that one of them
is the null hypothesis--
00:43:14.290 --> 00:43:16.600
sort of the default hypothesis,
and we have some
00:43:16.600 --> 00:43:18.290
alternative hypotheses--
00:43:18.290 --> 00:43:20.560
and we want to check whether
this one is true,
00:43:20.560 --> 00:43:21.950
or that one is true.
00:43:21.950 --> 00:43:24.500
So you obtain a data
point, and you
00:43:24.500 --> 00:43:26.060
want to make a decision.
00:43:26.060 --> 00:43:28.820
In this picture what would
a reasonable person
00:43:28.820 --> 00:43:30.650
do to make a decision?
00:43:30.650 --> 00:43:35.500
They would probably choose a
certain threshold, Xi, and
00:43:35.500 --> 00:43:43.540
decide that H1 is true if your
data falls in this interval.
00:43:43.540 --> 00:43:49.590
And decide that H0 is true
if you fall on the side.
00:43:49.590 --> 00:43:51.660
So that would be a
reasonable way of
00:43:51.660 --> 00:43:54.100
approaching the problem.
00:43:54.100 --> 00:43:59.160
More generally you take the set
of all possible X's, and
00:43:59.160 --> 00:44:03.050
you divide the set of possible
X's into two regions.
00:44:03.050 --> 00:44:11.110
One is the rejection region,
in which you decide H1,
00:44:11.110 --> 00:44:13.170
or you reject H0.
00:44:15.760 --> 00:44:21.640
And the complement of that
region is where you decide H0.
00:44:21.640 --> 00:44:25.210
So this is the x-space
of your data.
00:44:25.210 --> 00:44:28.350
In this example here, x
was one-dimensional.
00:44:28.350 --> 00:44:31.770
But in general X is going to
be a vector, where all the
00:44:31.770 --> 00:44:34.790
possible data vectors that
you can get, they're
00:44:34.790 --> 00:44:36.600
divided into two types.
00:44:36.600 --> 00:44:40.400
If it falls in this set you'd
make one decision.
00:44:40.400 --> 00:44:43.770
If it falls in that set, you
make the other decision.
00:44:43.770 --> 00:44:47.380
OK, so how would you
characterize the performance
00:44:47.380 --> 00:44:49.690
of the particular way of
making a decision?
00:44:49.690 --> 00:44:53.000
Suppose I chose my threshold.
00:44:53.000 --> 00:44:57.960
I may make mistakes of
two possible types.
00:44:57.960 --> 00:45:03.360
Perhaps H0 is true, but my data
happens to fall here.
00:45:03.360 --> 00:45:07.560
In which case I make a mistake,
and this would be a
00:45:07.560 --> 00:45:10.730
false rejection of H0.
00:45:10.730 --> 00:45:15.070
If my data falls here
I reject H0.
00:45:15.070 --> 00:45:16.890
I decide H1.
00:45:16.890 --> 00:45:19.510
Whereas H0 was true.
00:45:19.510 --> 00:45:21.690
The probability of
this happening?
00:45:21.690 --> 00:45:24.890
Let's call it alpha.
00:45:24.890 --> 00:45:28.040
But there's another kind of
error that can be made.
00:45:28.040 --> 00:45:32.810
Suppose that H1 was true, but by
accident my data happens to
00:45:32.810 --> 00:45:34.250
falls on that side.
00:45:34.250 --> 00:45:36.610
Then I'm going to make
an error again.
00:45:36.610 --> 00:45:40.540
I'm going to decide H0 even
though H1 was true.
00:45:40.540 --> 00:45:42.570
How likely is this to occur?
00:45:42.570 --> 00:45:46.420
This would be the area under
this curve here.
00:45:46.420 --> 00:45:50.600
And that's the other type of
error than can be made, and
00:45:50.600 --> 00:45:55.400
beta is the probability of this
particular type of error.
00:45:55.400 --> 00:45:57.550
Both of these are errors.
00:45:57.550 --> 00:45:59.640
Alpha is the probability
of error of one kind.
00:45:59.640 --> 00:46:02.110
Beta is the probability of an
error of the other kind.
00:46:02.110 --> 00:46:03.510
You would like the
probabilities
00:46:03.510 --> 00:46:05.050
of error to be small.
00:46:05.050 --> 00:46:07.550
So you would like to
make both alpha and
00:46:07.550 --> 00:46:09.780
beta as small as possible.
00:46:09.780 --> 00:46:13.300
Unfortunately that's not
possible, there's a trade-off.
00:46:13.300 --> 00:46:17.540
If I go to my threshold it this
way, then alpha become
00:46:17.540 --> 00:46:20.760
smaller, but beta
becomes bigger.
00:46:20.760 --> 00:46:22.770
So there's a trade-off.
00:46:22.770 --> 00:46:29.350
If I make my rejection region
smaller one kind of error is
00:46:29.350 --> 00:46:31.880
less likely, but the
other kind of error
00:46:31.880 --> 00:46:34.670
becomes more likely.
00:46:34.670 --> 00:46:38.050
So we got this trade-off.
00:46:38.050 --> 00:46:39.620
So what do we do about it?
00:46:39.620 --> 00:46:41.570
How do we move systematically?
00:46:41.570 --> 00:46:45.680
How do we come up with
rejection regions?
00:46:45.680 --> 00:46:48.900
Well, what the theory basically
tells you is it
00:46:48.900 --> 00:46:53.200
tells you how you should
create those regions.
00:46:53.200 --> 00:46:57.860
But it doesn't tell
you exactly how.
00:46:57.860 --> 00:47:00.970
It tells you the general
shape of those regions.
00:47:00.970 --> 00:47:05.120
For example here, the theory
who tells us that the right
00:47:05.120 --> 00:47:07.430
thing to do would be to put
the threshold and make
00:47:07.430 --> 00:47:10.910
decisions one way to the right,
one way to the left.
00:47:10.910 --> 00:47:12.830
But it might not necessarily
tell us
00:47:12.830 --> 00:47:15.020
where to put the threshold.
00:47:15.020 --> 00:47:18.890
Still, it's useful enough to
know that the way to make a
00:47:18.890 --> 00:47:20.960
good decision would
be in terms of
00:47:20.960 --> 00:47:22.400
a particular threshold.
00:47:22.400 --> 00:47:24.770
Let me make this
more specific.
00:47:24.770 --> 00:47:27.380
We can take our inspiration
from the solution of the
00:47:27.380 --> 00:47:29.820
hypothesis testing problem
that we had in
00:47:29.820 --> 00:47:31.370
the Bayesian case.
00:47:31.370 --> 00:47:34.130
In the Bayesian case we just
pick the hypothesis which is
00:47:34.130 --> 00:47:37.480
more likely given the data.
00:47:37.480 --> 00:47:40.080
The produced posterior
probabilities using Bayesian
00:47:40.080 --> 00:47:42.770
rule, they're written
this way.
00:47:42.770 --> 00:47:45.240
And this term is the
same as that term.
00:47:45.240 --> 00:47:49.500
They cancel out, then let me
collect terms here and there.
00:47:52.370 --> 00:47:54.030
I get an expression here.
00:47:54.030 --> 00:47:56.090
I think the version you
have in your handout
00:47:56.090 --> 00:47:57.340
is the correct one.
00:47:59.810 --> 00:48:02.082
The one on the slide was
not the correct one, so
00:48:02.082 --> 00:48:03.730
I'm fixing it here.
00:48:03.730 --> 00:48:06.920
OK, so this is the form of how
you make decisions in the
00:48:06.920 --> 00:48:08.720
Bayesian case.
00:48:08.720 --> 00:48:10.620
What you do in the Bayesian
case, you
00:48:10.620 --> 00:48:13.270
calculate this ratio.
00:48:13.270 --> 00:48:17.110
Let's call it the likelihood
ratio.
00:48:17.110 --> 00:48:20.770
And compare that ratio
to a threshold.
00:48:20.770 --> 00:48:22.916
And the threshold that you
should be using in the
00:48:22.916 --> 00:48:25.240
Bayesian case has something
to do with the prior
00:48:25.240 --> 00:48:28.000
probabilities of the
two hypotheses.
00:48:28.000 --> 00:48:31.840
In the non-Bayesian case we do
not have prior probabilities,
00:48:31.840 --> 00:48:34.690
so we do not know how to
set this threshold.
00:48:34.690 --> 00:48:38.350
But we're going to do is we're
going to keep this particular
00:48:38.350 --> 00:48:42.690
structure anyway, and maybe use
some other considerations
00:48:42.690 --> 00:48:44.480
to pick the threshold.
00:48:44.480 --> 00:48:51.030
So we're going to use a
likelihood ratio test, that's
00:48:51.030 --> 00:48:54.260
how it's called in which we
calculate a quantity of this
00:48:54.260 --> 00:48:56.830
kind that we call the
likelihood, and compare it
00:48:56.830 --> 00:48:58.480
with a threshold.
00:48:58.480 --> 00:49:00.530
So what's the interpretation
of this likelihood?
00:49:03.140 --> 00:49:04.290
We ask--
00:49:04.290 --> 00:49:08.570
the X's that I have observed,
how likely were they to occur
00:49:08.570 --> 00:49:10.460
if H1 was true?
00:49:10.460 --> 00:49:14.590
And how likely were they to
occur if H0 was true?
00:49:14.590 --> 00:49:20.560
This ratio could be big if my
data are plausible they might
00:49:20.560 --> 00:49:22.400
occur under H1.
00:49:22.400 --> 00:49:25.400
But they're very implausible,
extremely unlikely
00:49:25.400 --> 00:49:27.380
to occur under H0.
00:49:27.380 --> 00:49:30.060
Then my thinking would be well
the data that I saw are
00:49:30.060 --> 00:49:33.300
extremely unlikely to have
occurred under H0.
00:49:33.300 --> 00:49:36.780
So H0 is probably not true.
00:49:36.780 --> 00:49:39.820
I'm going to go for
H1 and choose H1.
00:49:39.820 --> 00:49:43.920
So when this ratio is big it
tells us that the data that
00:49:43.920 --> 00:49:47.720
we're seeing are better
explained if we assume H1 to
00:49:47.720 --> 00:49:50.620
be true rather than
H0 to be true.
00:49:50.620 --> 00:49:53.970
So I calculate this quantity,
compare it with a threshold,
00:49:53.970 --> 00:49:56.200
and that's how I make
my decision.
00:49:56.200 --> 00:49:59.360
So in this particular picture,
for example the way it would
00:49:59.360 --> 00:50:02.930
go would be the likelihood ratio
in this picture goes
00:50:02.930 --> 00:50:07.230
monotonically with my X. So
comparing the likelihood ratio
00:50:07.230 --> 00:50:10.150
to the threshold would be the
same as comparing my x to the
00:50:10.150 --> 00:50:12.890
threshold, and we've got
the question of how
00:50:12.890 --> 00:50:13.920
to choose the threshold.
00:50:13.920 --> 00:50:17.880
The way that the threshold is
chosen is usually done by
00:50:17.880 --> 00:50:21.560
fixing one of the two
probabilities of error.
00:50:21.560 --> 00:50:26.710
That is, I say, that I want my
error of one particular type
00:50:26.710 --> 00:50:30.160
to be a given number,
so I fix this alpha.
00:50:30.160 --> 00:50:33.160
And then I try to find where
my threshold should be.
00:50:33.160 --> 00:50:36.095
So that this probability theta,
probability out there,
00:50:36.095 --> 00:50:39.190
is just equal to alpha.
00:50:39.190 --> 00:50:42.050
And then the other probability
of error, beta, will be
00:50:42.050 --> 00:50:44.190
whatever it turns out to be.
00:50:44.190 --> 00:50:48.140
So somebody picks alpha
ahead of time.
00:50:48.140 --> 00:50:52.210
Based on the probability of
a false rejection based on
00:50:52.210 --> 00:50:55.890
alpha, I find where my threshold
is going to be.
00:50:55.890 --> 00:50:59.890
I choose my threshold, and that
determines subsequently
00:50:59.890 --> 00:51:01.270
the value of beta.
00:51:01.270 --> 00:51:07.340
So we're going to continue with
this story next time, and
00:51:07.340 --> 00:51:08.590
we'll stop here.