WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:15.210
from hundreds of
MIT courses, visit
00:00:15.210 --> 00:00:17.360
MITOpenCourseWare@OCW.MIT.edu.
00:00:20.292 --> 00:00:22.790
PHILIPPE RIGOLLET: It's
because if I was not,
00:00:22.790 --> 00:00:25.640
this would be basically the
last topic we would ever see.
00:00:25.640 --> 00:00:29.201
And this is arguably, probably
the most important topic
00:00:29.201 --> 00:00:30.950
in statistics, or at
least that's probably
00:00:30.950 --> 00:00:33.470
the reason why most of
you are taking this class.
00:00:33.470 --> 00:00:36.980
Because regression
implies prediction,
00:00:36.980 --> 00:00:39.260
and prediction is what people
are after to now, right?
00:00:39.260 --> 00:00:40.340
You don't need to
understand what
00:00:40.340 --> 00:00:41.960
the model for the
financial market
00:00:41.960 --> 00:00:43.460
is if you actually
have a formula
00:00:43.460 --> 00:00:47.430
to predict what the stock
prices are going to be tomorrow.
00:00:47.430 --> 00:00:49.705
And regression, in a way,
allows us to do that.
00:00:49.705 --> 00:00:52.080
And we'll start with a very
simple version of regression,
00:00:52.080 --> 00:00:55.130
which is linear regression,
which is the most standard one.
00:00:55.130 --> 00:00:58.070
And then we'll move on to
slightly more advanced notions
00:00:58.070 --> 00:00:59.560
such as nonparametric
regression.
00:00:59.560 --> 00:01:02.960
At least, we're going to see
the principles behind it.
00:01:02.960 --> 00:01:06.570
And I'll touch upon a little bit
of high dimensional regression,
00:01:06.570 --> 00:01:09.450
which is what people
are doing today.
00:01:09.450 --> 00:01:12.290
So the goal of
regression is to try
00:01:12.290 --> 00:01:16.220
to predict one variable
based on another variable.
00:01:16.220 --> 00:01:19.290
All right, so here the
notation is very important.
00:01:19.290 --> 00:01:22.220
It's extremely standard.
00:01:22.220 --> 00:01:25.600
It goes everywhere essentially,
and essentially you're
00:01:25.600 --> 00:01:29.840
trying to explain why
as a function of x,
00:01:29.840 --> 00:01:33.320
which is the usual y
equals f of x question--
00:01:33.320 --> 00:01:36.090
except that, you know, if
you look at a calculus class,
00:01:36.090 --> 00:01:39.930
people tell you y equals
f of x, and they give you
00:01:39.930 --> 00:01:42.142
a specific form for f,
and then you do something.
00:01:42.142 --> 00:01:43.850
Here, we're just going
to try to estimate
00:01:43.850 --> 00:01:46.040
what this length function is.
00:01:46.040 --> 00:01:49.790
And this is why we often
call y the explained variable
00:01:49.790 --> 00:01:52.800
and x the explanatory variable.
00:01:52.800 --> 00:01:55.960
All right, so we're
statisticians,
00:01:55.960 --> 00:01:56.925
so we start with data.
00:01:56.925 --> 00:01:58.800
All right, then what
does our data look like?
00:01:58.800 --> 00:02:01.820
Well, it looks like a
bunch of input, output
00:02:01.820 --> 00:02:03.110
to this relationship.
00:02:03.110 --> 00:02:05.870
All right, so we have
a bunch of xi, yi.
00:02:05.870 --> 00:02:09.280
Those are pairs, and I can do
a scatterplot of those guys.
00:02:09.280 --> 00:02:14.390
So each point here has a
x-coordinate, which is xi,
00:02:14.390 --> 00:02:16.550
and a y-coordinate,
which is yi, and here, I
00:02:16.550 --> 00:02:17.810
have a bunch of endpoints.
00:02:17.810 --> 00:02:19.830
And I just draw them like that.
00:02:19.830 --> 00:02:23.700
Now, the functions we're
going to be interested in
00:02:23.700 --> 00:02:30.170
are often function of the form
y equals a plus b times x, OK.
00:02:30.170 --> 00:02:32.870
And that means that this
function looks like this.
00:02:36.310 --> 00:02:38.200
So if I do x and
y, this function
00:02:38.200 --> 00:02:41.530
looks exactly like a line,
and clearly those points
00:02:41.530 --> 00:02:42.980
are not on the line.
00:02:42.980 --> 00:02:44.577
And it will basically
never happen
00:02:44.577 --> 00:02:45.910
that those points are on a line.
00:02:45.910 --> 00:02:48.460
There's a famous
T-shirt from, I think,
00:02:48.460 --> 00:02:50.320
U.C. Berkeley's
staff department,
00:02:50.320 --> 00:02:52.736
that shows this picture
and put a line between them
00:02:52.736 --> 00:02:53.860
like we're going to see it.
00:02:53.860 --> 00:02:56.890
And it says, oh,
statisticians, so many points,
00:02:56.890 --> 00:02:59.590
and you still managed
to miss all of them.
00:02:59.590 --> 00:03:04.150
And so essentially, we don't
believe that this relationship
00:03:04.150 --> 00:03:08.912
y is equal to a plus bx is true,
but maybe up to some noise.
00:03:08.912 --> 00:03:11.370
And that's where the statistics
is going to come into play.
00:03:11.370 --> 00:03:13.995
There's going to be some random
noise that's going to play out,
00:03:13.995 --> 00:03:17.470
and hopefully the noise is
going to be spread out evenly,
00:03:17.470 --> 00:03:20.950
so that we can average it
if we have enough points.
00:03:20.950 --> 00:03:22.830
Average it out, OK.
00:03:22.830 --> 00:03:26.910
And so this epsilon here is not
necessarily due to randomness.
00:03:26.910 --> 00:03:29.387
But again, just like we did
modeling in the first place,
00:03:29.387 --> 00:03:30.970
it essentially
accounts for everything
00:03:30.970 --> 00:03:33.520
we don't understand
about this relationship.
00:03:33.520 --> 00:03:36.200
All right, so for example--
00:03:36.200 --> 00:03:37.630
so here, I'm not going to be--
00:03:37.630 --> 00:03:41.960
give me one second, so we'll
see an example in a second.
00:03:41.960 --> 00:03:44.050
But the idea here is
that if you have data,
00:03:44.050 --> 00:03:45.850
and if you believe
that it's of the form,
00:03:45.850 --> 00:03:47.740
a plus b x plus
some noise, you're
00:03:47.740 --> 00:03:50.410
trying to find the line
that will explain your data
00:03:50.410 --> 00:03:51.640
the best, right?
00:03:51.640 --> 00:03:54.650
In the terminology
we've been using before,
00:03:54.650 --> 00:03:58.060
this would be the most likely
line that explains the data.
00:03:58.060 --> 00:03:59.680
So we can see that
it's slightly--
00:03:59.680 --> 00:04:01.145
we've just added
another dimension
00:04:01.145 --> 00:04:02.270
to our statistical problem.
00:04:02.270 --> 00:04:04.353
We don't have just x's,
but we have y's, and we're
00:04:04.353 --> 00:04:07.450
trying to find the most likely
explanation of the relationship
00:04:07.450 --> 00:04:09.220
between y and x.
00:04:09.220 --> 00:04:12.379
All right, and so
in practice, the way
00:04:12.379 --> 00:04:14.920
it's going to look like is that
we're going to have basically
00:04:14.920 --> 00:04:17.470
two parameters to
find the slope b
00:04:17.470 --> 00:04:20.240
and the intercept
a, and given data,
00:04:20.240 --> 00:04:23.120
the goal is going to be to try
to find the best possible line.
00:04:23.120 --> 00:04:24.514
All right?
00:04:24.514 --> 00:04:25.930
So what we're going
to find is not
00:04:25.930 --> 00:04:29.310
exactly a and b, the ones that
actually generate the data,
00:04:29.310 --> 00:04:33.790
but some estimators of those
parameters, a hat and b hat
00:04:33.790 --> 00:04:35.680
constructed from the data.
00:04:35.680 --> 00:04:38.260
All right, so we'll see
that more generally,
00:04:38.260 --> 00:04:40.990
but we're not going to go too
much in the details of this.
00:04:40.990 --> 00:04:42.190
There's actually
quite a bit that you
00:04:42.190 --> 00:04:43.773
can understand if
you do what's called
00:04:43.773 --> 00:04:47.260
univariate regression
when x is actually
00:04:47.260 --> 00:04:49.940
a real valued random variable.
00:04:49.940 --> 00:04:52.720
So when this happens, this is
called univariate regression.
00:04:59.640 --> 00:05:05.550
And when x is in rp for p
larger than or equal to 2,
00:05:05.550 --> 00:05:07.810
this is called
multivariate regression.
00:05:16.640 --> 00:05:20.340
OK, and so here we're
just trying to explain y
00:05:20.340 --> 00:05:23.940
is a plus bx plus epsilon.
00:05:23.940 --> 00:05:26.620
And here we're going to have
something more complicated.
00:05:26.620 --> 00:05:33.510
We're going to have y, which is
equal to a plus b1, x1 plus b2,
00:05:33.510 --> 00:05:39.780
x2 plus bp, xp plus epsilon--
00:05:39.780 --> 00:05:42.150
where x is equal to--
00:05:42.150 --> 00:05:46.710
the coordinates of x are
given by x1, 2xp, rp.
00:05:46.710 --> 00:05:49.200
OK, so it's still linear.
00:05:49.200 --> 00:05:51.030
Right, they still add
all the coordinates
00:05:51.030 --> 00:05:53.370
of x with a coefficient
in front of them,
00:05:53.370 --> 00:05:56.360
but it's a bit more complicated
than just one coefficient
00:05:56.360 --> 00:05:58.770
for one coordinate of x, OK?
00:05:58.770 --> 00:06:03.420
So we'll come back to
multivariate regression.
00:06:03.420 --> 00:06:08.280
Of course, you can write
this as x transpose b, right?
00:06:08.280 --> 00:06:14.800
So this entire thing here,
this linear combination
00:06:14.800 --> 00:06:17.710
is of the form x
transpose b, where
00:06:17.710 --> 00:06:23.310
b is the vector that has
coordinates b1 to bp.
00:06:23.310 --> 00:06:25.700
OK?
00:06:25.700 --> 00:06:31.100
Sorry, here, it's in [? rd, ?]
p is the natural notation.
00:06:31.100 --> 00:06:35.660
All right, so our goal
here, in the univariate one,
00:06:35.660 --> 00:06:38.360
is to try to write
the model, make sense
00:06:38.360 --> 00:06:40.910
of this little twiddle here--
00:06:40.910 --> 00:06:44.050
essentially, from a
statistical modeling question,
00:06:44.050 --> 00:06:47.480
the question is going to be,
what distributional assumptions
00:06:47.480 --> 00:06:48.730
do you want to put on epsilon?
00:06:48.730 --> 00:06:50.313
Are you going to say
they're Gaussian?
00:06:50.313 --> 00:06:52.720
Are you going to say
they're binomial?
00:07:00.160 --> 00:07:03.450
OK, are you going to
say they're binomial?
00:07:03.450 --> 00:07:05.532
Are you going to say
they're Bernoulli?
00:07:05.532 --> 00:07:07.990
So that's going to be what we
we're going to make sense of,
00:07:07.990 --> 00:07:10.230
and then we're going
to try to find a method
00:07:10.230 --> 00:07:11.700
to estimate a and b.
00:07:11.700 --> 00:07:13.680
And then maybe we're
going to try to do
00:07:13.680 --> 00:07:15.030
some inference about a and b--
00:07:15.030 --> 00:07:18.390
maybe test if a and b take
certain values, if they're
00:07:18.390 --> 00:07:20.850
less than something,
maybe find some confidence
00:07:20.850 --> 00:07:24.290
regions for a and b, all right?
00:07:24.290 --> 00:07:25.990
So why would you
want to do this?
00:07:25.990 --> 00:07:29.810
Well, I'm sure all of you have
an application, if I give you
00:07:29.810 --> 00:07:32.260
some x, you're trying
to predict what y is.
00:07:32.260 --> 00:07:34.730
Machine learning is all
about doing this, right?
00:07:34.730 --> 00:07:36.994
Without maybe trying
to even understand
00:07:36.994 --> 00:07:38.660
the physics behind
this, they're saying,
00:07:38.660 --> 00:07:40.590
well, you give me
a bag of words,
00:07:40.590 --> 00:07:43.520
I want to understand whether
it's going to be a spam or not.
00:07:43.520 --> 00:07:47.370
You give me a bunch of
economic indicators,
00:07:47.370 --> 00:07:51.530
I want you to tell me how much
I should be selling my car for.
00:07:51.530 --> 00:07:55.774
You give me a bunch of
measurements on some patient,
00:07:55.774 --> 00:07:57.440
I want you to predict
how this person is
00:07:57.440 --> 00:08:00.110
going to respond to my
drug-- and things like this.
00:08:00.110 --> 00:08:04.830
All right, and often we actually
don't have much modeling
00:08:04.830 --> 00:08:07.380
intuition about what the
relationship between x and y
00:08:07.380 --> 00:08:10.350
is, and this linear thing is
basically the simplest function
00:08:10.350 --> 00:08:11.530
we can think of.
00:08:11.530 --> 00:08:15.235
Arguably, linear functions
are the simplest functions
00:08:15.235 --> 00:08:16.110
that are not trivial.
00:08:16.110 --> 00:08:19.110
Otherwise, we would just say,
well, let's just predict x of y
00:08:19.110 --> 00:08:21.445
to be a constant, meaning
it does not depend on x.
00:08:21.445 --> 00:08:23.070
But if you want it
to depend on x, then
00:08:23.070 --> 00:08:25.710
your functions are basically
as simple as it gets.
00:08:25.710 --> 00:08:30.840
It turns out, amazingly, this
does the trick quite often.
00:08:30.840 --> 00:08:33.750
So for example, if
you look at economics,
00:08:33.750 --> 00:08:35.909
you might want to assume
that the demand is
00:08:35.909 --> 00:08:38.039
a linear function of the price.
00:08:38.039 --> 00:08:40.200
So if your price
is zero, there's
00:08:40.200 --> 00:08:41.640
going to be a certain demand.
00:08:41.640 --> 00:08:45.037
And as the price increases,
the demand is going to move.
00:08:45.037 --> 00:08:47.370
Do you think b is going to
be positive or negative here?
00:08:51.000 --> 00:08:52.069
What?
00:08:52.069 --> 00:08:53.610
Typically, it's
negative unless we're
00:08:53.610 --> 00:08:56.292
talking about
maybe luxury goods,
00:08:56.292 --> 00:08:57.750
where you know,
the more expensive,
00:08:57.750 --> 00:09:00.130
the more people
actually want it.
00:09:00.130 --> 00:09:02.380
I mean, if we're talking
about actual economic demand,
00:09:02.380 --> 00:09:06.030
that's probably
definitely negative.
00:09:06.030 --> 00:09:11.360
It doesn't have to be,
you know, clearly linear,
00:09:11.360 --> 00:09:13.724
so that you can actually
make it linear, transform it
00:09:13.724 --> 00:09:14.640
into something linear.
00:09:14.640 --> 00:09:17.520
So for example, you have
this like multiplicative
00:09:17.520 --> 00:09:24.330
relationship, PV equals nRT,
which is the Ideal gas law.
00:09:24.330 --> 00:09:26.670
If you want to actually
write this relationship,
00:09:26.670 --> 00:09:28.680
if you want to predict
what the pressure is
00:09:28.680 --> 00:09:33.690
going to be as a function of
the volume and the temperature--
00:09:33.690 --> 00:09:37.810
and well, let's assume that
n is the Avogadro constant,
00:09:37.810 --> 00:09:42.060
and let's assume that the
radius is actually fixed.
00:09:42.060 --> 00:09:47.840
Then you take the log on each
side, so you get PV equals nRT.
00:10:03.610 --> 00:10:07.690
So what that means is that
log PV is equal to log nRT.
00:10:10.600 --> 00:10:23.180
So that means log P plus log V
is equal to the log nR plus log
00:10:23.180 --> 00:10:28.737
T. So we said that R is
constant, so this is actually
00:10:28.737 --> 00:10:29.320
your constant.
00:10:29.320 --> 00:10:31.400
I'm going to call it a.
00:10:31.400 --> 00:10:35.410
And then that
means that log P is
00:10:35.410 --> 00:10:49.430
equal to minus log V. That
log P is equal to a minus log
00:10:49.430 --> 00:10:55.070
V plus log T. OK?
00:10:55.070 --> 00:11:01.650
And so in particular, if I
write b equal to negative 1
00:11:01.650 --> 00:11:04.800
and c equal to plus 1,
this gives me the formula
00:11:04.800 --> 00:11:06.210
that I have here.
00:11:06.210 --> 00:11:10.670
Now again, it might be the case
that this is the ideal gas law.
00:11:10.670 --> 00:11:12.960
So in practice, if I
start recording pressure,
00:11:12.960 --> 00:11:16.830
and temperature, and volume, I
might make measurement errors,
00:11:16.830 --> 00:11:18.950
there might be slightly
different conditions
00:11:18.950 --> 00:11:21.346
in such a way that I'm not
going to get exactly those.
00:11:21.346 --> 00:11:23.220
And I'm just going to
put this little twiddle
00:11:23.220 --> 00:11:25.350
to account for the fact
that the points that I'm
00:11:25.350 --> 00:11:28.170
going to be recording
for log pressure,
00:11:28.170 --> 00:11:30.180
log volume, and log
temperature are not going
00:11:30.180 --> 00:11:32.590
to be exactly on one line.
00:11:32.590 --> 00:11:33.840
OK, they're going to be close.
00:11:33.840 --> 00:11:36.150
Actually, in those
physics experiments,
00:11:36.150 --> 00:11:39.600
usually, they're very close
because the conditions
00:11:39.600 --> 00:11:41.740
are controlled under
lab experiments.
00:11:41.740 --> 00:11:44.670
So it means that the
noise is very small.
00:11:44.670 --> 00:11:47.160
But for other cases,
like demand and prices,
00:11:47.160 --> 00:11:50.820
it's not a law of physics,
and so this must change.
00:11:50.820 --> 00:11:53.180
Even the linear structure is
probably not clear, right.
00:11:53.180 --> 00:11:54.763
At some points,
there's probably going
00:11:54.763 --> 00:11:57.550
to be some weird
curvature happening.
00:11:57.550 --> 00:12:00.910
All right, so this slide is
just to tell you maybe you
00:12:00.910 --> 00:12:03.071
don't have, obviously,
a linear relationship,
00:12:03.071 --> 00:12:04.570
but maybe you do
if you start taking
00:12:04.570 --> 00:12:08.380
logs exponentials, squares.
00:12:08.380 --> 00:12:10.820
You can sometimes take the
product of two variables,
00:12:10.820 --> 00:12:12.040
things like this, right.
00:12:12.040 --> 00:12:13.570
So this is variable
transformation,
00:12:13.570 --> 00:12:15.610
and it's mostly
domain-specific, so we're not
00:12:15.610 --> 00:12:18.076
going to go into
more details of this.
00:12:18.076 --> 00:12:19.480
Any questions?
00:12:22.290 --> 00:12:27.100
All right, so now I'm
going to be giving--
00:12:27.100 --> 00:12:29.100
so if we start thinking
a little more about what
00:12:29.100 --> 00:12:32.100
these coefficients
should be, well,
00:12:32.100 --> 00:12:34.440
remember-- so
everybody's clear why
00:12:34.440 --> 00:12:36.280
I don't put the little i here?
00:12:41.971 --> 00:12:43.970
Right, I don't put the
little i because I'm just
00:12:43.970 --> 00:12:47.120
talking about a generic
x and a generic y,
00:12:47.120 --> 00:12:49.870
but the observations
are x1, y1, right.
00:12:49.870 --> 00:12:53.450
So typically, on
the blackboard I'm
00:12:53.450 --> 00:13:02.980
often going to write only xy,
but the data really is x1,
00:13:02.980 --> 00:13:07.180
y1, all the way to xn, yn.
00:13:07.180 --> 00:13:10.810
So those are those points in
this two dimensional plot.
00:13:10.810 --> 00:13:21.830
But I think of those as being
independent copies of the pair
00:13:21.830 --> 00:13:24.500
xy.
00:13:24.500 --> 00:13:26.120
They have to have--
00:13:26.120 --> 00:13:27.420
to contain their relationship.
00:13:27.420 --> 00:13:29.630
And so when I talk
about distribution
00:13:29.630 --> 00:13:32.420
of those random variables, I
talk about the distribution
00:13:32.420 --> 00:13:34.240
of xy, and that's the same.
00:13:34.240 --> 00:13:36.950
All right, so the first
thing you might want to ask
00:13:36.950 --> 00:13:41.790
is, well, if I have an
infinite amount of data,
00:13:41.790 --> 00:13:44.390
what can I hope to
get for a and b?
00:13:44.390 --> 00:13:46.350
If my simple size
goes to infinity,
00:13:46.350 --> 00:13:48.110
then I should actually
know exactly what
00:13:48.110 --> 00:13:50.040
the distribution of xy is.
00:13:50.040 --> 00:13:52.670
And so there should
be an a and a b
00:13:52.670 --> 00:13:57.305
that captures this linear
relationship between y and x.
00:13:57.305 --> 00:13:59.180
And so in particular,
we're going
00:13:59.180 --> 00:14:02.709
to try to ask the population,
or theoretic, values of a and b,
00:14:02.709 --> 00:14:04.250
and you can see that
you can actually
00:14:04.250 --> 00:14:05.960
compute them explicitly.
00:14:05.960 --> 00:14:08.510
So let's just try to find how.
00:14:08.510 --> 00:14:10.640
So as I said, we have
a bunch of points
00:14:10.640 --> 00:14:16.460
on this line close
to a line, and I'm
00:14:16.460 --> 00:14:20.520
trying to find the best fit.
00:14:20.520 --> 00:14:23.330
All right, so this
guy is not a good fit.
00:14:23.330 --> 00:14:24.960
This guy is not a good fit.
00:14:24.960 --> 00:14:27.870
And we know that this guy
is a good fit somehow.
00:14:27.870 --> 00:14:30.680
So we need to mathematically
formulate the fact
00:14:30.680 --> 00:14:35.150
that this line here is
better than this line here
00:14:35.150 --> 00:14:37.460
or better than this line here.
00:14:37.460 --> 00:14:41.030
So what we're trying to do
is to create a function that
00:14:41.030 --> 00:14:43.580
has values that are
smaller for this curve
00:14:43.580 --> 00:14:45.590
and larger for these two curves.
00:14:45.590 --> 00:14:47.630
And the way we do it is
by measuring the fit,
00:14:47.630 --> 00:14:51.740
and the fit is essentially
the aggregate distance
00:14:51.740 --> 00:14:55.310
of all the points to the curve.
00:14:55.310 --> 00:14:56.930
And there's many
ways I can measure
00:14:56.930 --> 00:14:58.550
the distance to a curve.
00:14:58.550 --> 00:15:01.730
So if I want to find so--
let's just open a parenthesis.
00:15:01.730 --> 00:15:03.290
If I have a point
here-- so we're
00:15:03.290 --> 00:15:05.250
going to do it for
one point at a time.
00:15:05.250 --> 00:15:07.120
So if I have a point,
there's many ways
00:15:07.120 --> 00:15:09.530
I can measure its distance
to the curve, right?
00:15:09.530 --> 00:15:12.800
I can measure it like that.
00:15:12.800 --> 00:15:14.690
That is one distance
to the curve.
00:15:14.690 --> 00:15:19.280
I can measure it like that by
having a right angle here that
00:15:19.280 --> 00:15:20.840
is one distance to the curve.
00:15:20.840 --> 00:15:23.430
Or I can measure it like that.
00:15:23.430 --> 00:15:27.490
That is another distance
to the curve, right.
00:15:27.490 --> 00:15:29.650
There's many ways
I can go for it.
00:15:29.650 --> 00:15:31.030
It turns out that
one is actually
00:15:31.030 --> 00:15:33.040
going to be fairly
convenient for us,
00:15:33.040 --> 00:15:36.910
and that's the one that says,
let's look at the square
00:15:36.910 --> 00:15:38.720
of the value of x on the curve.
00:15:38.720 --> 00:15:43.690
So if this is the curve,
y is equal to a plus bx.
00:15:51.260 --> 00:15:54.140
Now, I'm going to think of
this point as a random point,
00:15:54.140 --> 00:15:57.050
capital X, capital
Y, so that means
00:15:57.050 --> 00:16:02.210
that it's going to be x1,
y1 or x2, y2, et cetera.
00:16:02.210 --> 00:16:04.250
Now, I want to
measure the distance.
00:16:04.250 --> 00:16:06.390
Can somebody tell me
which of the three--
00:16:06.390 --> 00:16:08.870
the first one, the second
one, or the third one--
00:16:08.870 --> 00:16:13.610
this formula, expectation of y
minus a minus bx squared is--
00:16:13.610 --> 00:16:18.578
which of the three
is it representing?
00:16:18.578 --> 00:16:20.020
AUDIENCE: The second one.
00:16:20.020 --> 00:16:21.395
PHILIPPE RIGOLLET:
The second one
00:16:21.395 --> 00:16:22.740
where I have the right angle?
00:16:22.740 --> 00:16:26.710
OK, everybody agrees with this?
00:16:26.710 --> 00:16:28.730
Anybody wants to vote
for something else?
00:16:28.730 --> 00:16:29.320
Yeah?
00:16:29.320 --> 00:16:30.320
AUDIENCE: The third one?
00:16:30.320 --> 00:16:31.695
PHILIPPE RIGOLLET:
The third one?
00:16:31.695 --> 00:16:34.520
Everybody agrees
with the third one?
00:16:34.520 --> 00:16:38.975
So by default, everybody's
on the first one?
00:16:38.975 --> 00:16:42.010
Yeah, it is the vertical
distance actually.
00:16:42.010 --> 00:16:44.555
And the reason is if it was the
one with the straight angle,
00:16:44.555 --> 00:16:46.180
with the right angle,
it would actually
00:16:46.180 --> 00:16:48.430
be a very complicated
mathematical formula,
00:16:48.430 --> 00:16:51.240
so let's just see y, right?
00:16:51.240 --> 00:16:53.470
And by y, I mean y.
00:16:53.470 --> 00:16:59.460
OK, so this means that this
is my x, and this is my y.
00:17:02.500 --> 00:17:05.829
All right, so that means
that this point is xy.
00:17:05.829 --> 00:17:07.900
So what I'm measuring
is the difference
00:17:07.900 --> 00:17:15.965
between y minus
a plus b times x.
00:17:15.965 --> 00:17:18.339
This is the thing I'm going
to take the expectation off--
00:17:18.339 --> 00:17:20.290
the square and then
the expectation-- so a
00:17:20.290 --> 00:17:24.140
plus b times x, if this is
this line, this is this point.
00:17:24.140 --> 00:17:27.310
So that's this value here.
00:17:27.310 --> 00:17:33.254
This value here is
a plus bx, right?
00:17:33.254 --> 00:17:35.170
So what I'm really
measuring is the difference
00:17:35.170 --> 00:17:38.740
between y and N plus bx,
which is this distance here.
00:17:42.400 --> 00:17:45.846
And since I like things
like Pythagoras theorem,
00:17:45.846 --> 00:17:47.470
I'm actually going
to put a square here
00:17:47.470 --> 00:17:51.500
before I take the expectation.
00:17:51.500 --> 00:17:53.090
So now this is a
random variable.
00:17:53.090 --> 00:17:55.210
This is this random variable.
00:17:55.210 --> 00:17:58.420
And so I want a number,
so I'm going to turn it
00:17:58.420 --> 00:18:00.020
into a deterministic number.
00:18:00.020 --> 00:18:03.400
And the way I do this is
by taking expectation.
00:18:03.400 --> 00:18:07.330
And if you think expectations
should be close to average,
00:18:07.330 --> 00:18:09.310
this is the same
thing as saying,
00:18:09.310 --> 00:18:12.010
I want that in
average, the y's are
00:18:12.010 --> 00:18:14.440
close to the a plus bx, right?
00:18:14.440 --> 00:18:16.570
So we're doing it
in expectation,
00:18:16.570 --> 00:18:18.370
but that's going to
translate into doing it
00:18:18.370 --> 00:18:20.650
in average for all the points.
00:18:20.650 --> 00:18:22.850
All right, so this is the
thing I want to measure.
00:18:22.850 --> 00:18:24.500
So that's this
vertical distance.
00:18:24.500 --> 00:18:26.321
Yeah?
00:18:26.321 --> 00:18:26.820
OK.
00:18:32.750 --> 00:18:36.292
This is my fault actually.
00:18:36.292 --> 00:18:37.890
Maybe we should
close those shades.
00:18:50.230 --> 00:18:53.280
OK, I cannot do just
one at a time, sorry.
00:19:11.910 --> 00:19:15.640
All right, so now that I do
those vertical distances,
00:19:15.640 --> 00:19:18.340
I can ask-- well, now,
I have this function,
00:19:18.340 --> 00:19:22.020
right-- to have a function that
takes two parameters a and b,
00:19:22.020 --> 00:19:30.220
maps it to the expectation
of y minus a plus bx squared.
00:19:30.220 --> 00:19:32.170
Sorry, the square is here.
00:19:32.170 --> 00:19:35.080
And I could ask, well,
this is a function that
00:19:35.080 --> 00:19:38.320
measures the fit of the
parameters a and b, right?
00:19:38.320 --> 00:19:40.210
This function should be small.
00:19:40.210 --> 00:19:45.700
The value of this
function here, function
00:19:45.700 --> 00:20:07.370
of a and b that measures
how close the point xy is
00:20:07.370 --> 00:20:14.210
to the line a plus
b times x while y
00:20:14.210 --> 00:20:18.869
is equal to a plus b
times x in expectation.
00:20:23.760 --> 00:20:24.400
OK, agreed?
00:20:24.400 --> 00:20:27.030
This is what we just said.
00:20:27.030 --> 00:20:29.480
Again, if you're not
comfortable with the reason why
00:20:29.480 --> 00:20:32.290
you get expectations, just
think about having data points
00:20:32.290 --> 00:20:34.410
and taking the average
value for this guy.
00:20:34.410 --> 00:20:36.360
So it's basically an
aggregate distance
00:20:36.360 --> 00:20:41.070
of the points to their line.
00:20:41.070 --> 00:20:44.390
OK, everybody agrees this
is a legitimate measure?
00:20:44.390 --> 00:20:48.150
If all my points were on the
line-- if my distribution--
00:20:48.150 --> 00:20:51.720
if y was actually equal
to a plus bx for some a
00:20:51.720 --> 00:20:54.780
and b then this function
would be equal to 0
00:20:54.780 --> 00:20:57.906
for the correct a and b, right?
00:20:57.906 --> 00:20:59.510
If they are far--
well, it's going
00:20:59.510 --> 00:21:01.460
to depend on how much
noise I'm getting,
00:21:01.460 --> 00:21:04.190
but it's still going to be
minimized for the best one.
00:21:04.190 --> 00:21:06.800
So let's minimize this thing.
00:21:06.800 --> 00:21:11.000
So here, I don't make any--
00:21:11.000 --> 00:21:12.350
again, sorry.
00:21:12.350 --> 00:21:21.800
I don't make an assumption on
the distribution of x or y.
00:21:21.800 --> 00:21:27.290
Here, I assume, somehow,
that the variance of x
00:21:27.290 --> 00:21:28.289
is not equal to 0.
00:21:28.289 --> 00:21:29.330
Can somebody tell me why?
00:21:29.330 --> 00:21:30.310
Yeah?
00:21:30.310 --> 00:21:33.250
AUDIENCE: Not really a
question-- the slides,
00:21:33.250 --> 00:21:38.150
you have y minus a minus bx
quantity squared expectation
00:21:38.150 --> 00:21:41.204
of that, and here you've written
square of the expectation.
00:21:41.204 --> 00:21:42.870
PHILIPPE RIGOLLET:
No, here I'm actually
00:21:42.870 --> 00:21:46.890
in the expectation
of the square.
00:21:46.890 --> 00:21:49.200
If I wanted to write the
square of the expectation,
00:21:49.200 --> 00:21:52.350
I would just do this.
00:21:52.350 --> 00:21:53.680
So let's just make it clear.
00:22:00.970 --> 00:22:01.575
Right?
00:22:01.575 --> 00:22:03.820
Do you want me to put an
extra set of parenthesis?
00:22:03.820 --> 00:22:06.690
That's what you want me to do?
00:22:06.690 --> 00:22:11.034
AUDIENCE: Yeah, it's just
confusing with the [INAUDIBLE]
00:22:11.034 --> 00:22:13.450
PHILIPPE RIGOLLET: OK, that's
the one that makes sense, so
00:22:13.450 --> 00:22:14.700
the square of the expectation?
00:22:14.700 --> 00:22:15.680
AUDIENCE: Yeah.
00:22:15.680 --> 00:22:17.180
PHILIPPE RIGOLLET: Oh, the
expectation of the square,
00:22:17.180 --> 00:22:17.680
sorry.
00:22:20.310 --> 00:22:22.130
Yeah, dyslexia.
00:22:22.130 --> 00:22:25.100
All right, any question?
00:22:25.100 --> 00:22:25.600
Yeah?
00:22:25.600 --> 00:22:28.400
AUDIENCE: Does this assume
that the error is Gaussian?
00:22:28.400 --> 00:22:29.316
PHILIPPE RIGOLLET: No.
00:22:32.290 --> 00:22:34.133
AUDIENCE: I mean, in
the sense that like,
00:22:34.133 --> 00:22:36.980
if we knew that the
error was, like,
00:22:36.980 --> 00:22:40.062
even the minus followed
like-- so even the minus x
00:22:40.062 --> 00:22:44.942
to the fourth distribution,
would we want to minimise
00:22:44.942 --> 00:22:48.358
the expectation of what
the fourth power of y minus
00:22:48.358 --> 00:22:52.280
a equals bx in order to get
[? what the ?] [? best is? ?]
00:22:52.280 --> 00:22:53.238
PHILIPPE RIGOLLET: Why?
00:22:57.372 --> 00:22:59.080
So you know the answers
to your question,
00:22:59.080 --> 00:23:01.760
so I just want you to
use the words that--
00:23:01.760 --> 00:23:04.756
right, so why would you want
to use the fourth power?
00:23:04.756 --> 00:23:06.429
AUDIENCE: Well,
because, like, we
00:23:06.429 --> 00:23:08.137
want to more strongly
penalize deviations
00:23:08.137 --> 00:23:11.518
because we'd expect very
large deviations to be
00:23:11.518 --> 00:23:15.870
very rare, or more
rare, than it would
00:23:15.870 --> 00:23:18.170
with the Gaussian
[INAUDIBLE] power.
00:23:18.170 --> 00:23:19.360
PHILIPPE RIGOLLET: Yeah so,
that would be the maximum likely
00:23:19.360 --> 00:23:21.290
estimator that you're
describing to me, right?
00:23:21.290 --> 00:23:22.850
I can actually
write the likelihood
00:23:22.850 --> 00:23:25.340
of a pair of numbers ab.
00:23:25.340 --> 00:23:26.847
And if I know this,
that's actually
00:23:26.847 --> 00:23:28.430
what's going to come
into it because I
00:23:28.430 --> 00:23:31.610
know that the density is
going to come into play when
00:23:31.610 --> 00:23:32.740
I talk about there.
00:23:32.740 --> 00:23:34.580
But here, I'm just
talking about--
00:23:34.580 --> 00:23:36.350
this is a mechanical tool.
00:23:36.350 --> 00:23:39.640
I'm just saying, let's minimize
the distance to the curve.
00:23:39.640 --> 00:23:42.320
Another thing I could have
done is take the absolute value
00:23:42.320 --> 00:23:43.750
of this thing, for example.
00:23:43.750 --> 00:23:46.190
I just decided to take the
square root before I did it.
00:23:46.190 --> 00:23:48.630
OK, so regardless
of what I'm doing,
00:23:48.630 --> 00:23:50.600
I'm just taking the
squares because that's just
00:23:50.600 --> 00:23:53.600
going to be convenient for me
to do my computations for now.
00:23:53.600 --> 00:23:55.400
But we don't have
any statistical model
00:23:55.400 --> 00:23:56.940
at this point.
00:23:56.940 --> 00:23:59.040
I didn't say anything--
that y follows this.
00:23:59.040 --> 00:24:00.320
X follows this.
00:24:00.320 --> 00:24:01.760
I'm just doing
minimal assumptions
00:24:01.760 --> 00:24:04.250
as we go, all right?
00:24:04.250 --> 00:24:06.140
So the variance of
x is not equal to 0?
00:24:06.140 --> 00:24:07.270
Could somebody tell me why?
00:24:11.330 --> 00:24:14.490
What would my cloud point
look like if the variance of x
00:24:14.490 --> 00:24:16.130
was equal to 0?
00:24:16.130 --> 00:24:18.122
Yeah, they would all
be at the same point.
00:24:18.122 --> 00:24:20.580
So it's going to be hard for
me to start fitting in a line,
00:24:20.580 --> 00:24:21.180
right?
00:24:21.180 --> 00:24:24.100
I mean, best case
scenario, I have this x.
00:24:24.100 --> 00:24:26.700
It has variance, zero, so
this is the expectation of x.
00:24:26.700 --> 00:24:31.020
And all my points have
the same expectation,
00:24:31.020 --> 00:24:33.780
and so, yes, I could
probably fit that line.
00:24:33.780 --> 00:24:38.340
But that wouldn't help
very much for other x's.
00:24:38.340 --> 00:24:41.400
So I need a bit of variance
so that things spread out
00:24:41.400 --> 00:24:42.010
a little bit.
00:24:47.440 --> 00:24:51.130
OK, I'm going to
have to do this.
00:24:51.130 --> 00:24:52.370
I think it's just my--
00:25:10.200 --> 00:25:13.460
All right, so I'm going to
put a little bit of variance.
00:25:13.460 --> 00:25:15.960
And the other thing is here,
I don't want to do much more,
00:25:15.960 --> 00:25:22.440
but I'm actually going to think
of x as having means zero.
00:25:22.440 --> 00:25:24.430
And the way I do
this is as follows.
00:25:24.430 --> 00:25:30.570
Let's define x tilde, which is
x minus the expectation of x.
00:25:30.570 --> 00:25:33.920
OK, so definitely the
expectation of x tilde is what?
00:25:36.620 --> 00:25:38.110
Zero, OK.
00:25:38.110 --> 00:25:43.350
And so now I want to
minimize in ab, expectation
00:25:43.350 --> 00:25:53.920
of y minus a plus b, x squared.
00:25:53.920 --> 00:26:03.810
And the way I'm going to do this
is by turning x into x tilde
00:26:03.810 --> 00:26:07.060
and stuffing the extra--
00:26:07.060 --> 00:26:12.760
and putting the extra
expectation of x into the a.
00:26:12.760 --> 00:26:19.840
So I'm going to write this as
an expectation of y minus a plus
00:26:19.840 --> 00:26:25.180
b expectation of x--
00:26:25.180 --> 00:26:27.530
which I'm going to a tilde--
00:26:27.530 --> 00:26:30.300
and plus b x tilde.
00:26:33.930 --> 00:26:35.630
OK?
00:26:35.630 --> 00:26:38.920
And everybody agrees with this?
00:26:38.920 --> 00:26:41.490
So now I have two
parameters, a tilde and b,
00:26:41.490 --> 00:26:44.350
and I'm going to pretend
that now x tilde--
00:26:44.350 --> 00:26:50.830
so now the role of x is played
by x tilde, which is now
00:26:50.830 --> 00:26:53.020
a centered random variable.
00:26:53.020 --> 00:26:55.660
OK, so I'm going to
call this guy a tilde,
00:26:55.660 --> 00:26:58.859
but for my computations
I'm going to call it a.
00:26:58.859 --> 00:27:00.650
So how do I find the
minimum of this thing?
00:27:05.114 --> 00:27:06.620
Derivative equal to zero, right?
00:27:06.620 --> 00:27:08.235
So here it's a quadratic thing.
00:27:08.235 --> 00:27:09.360
It's going to be like that.
00:27:09.360 --> 00:27:10.880
I take the derivative,
set it to zero.
00:27:10.880 --> 00:27:13.130
So I'm first going to take
the derivative with respect
00:27:13.130 --> 00:27:16.370
to a and set it equal to zero,
so that's equivalent to saying
00:27:16.370 --> 00:27:18.320
that the expectation of--
00:27:18.320 --> 00:27:21.315
well, here, I'm going
to pick up a 2--
00:27:21.315 --> 00:27:33.720
y minus a plus bx
tilde is equal to zero.
00:27:33.720 --> 00:27:36.580
And then I also have that the
derivative with respect to b is
00:27:36.580 --> 00:27:40.260
equal to zero, which is
equivalent to the expectation
00:27:40.260 --> 00:27:42.100
of-- well, I have a
negative sign somewhere,
00:27:42.100 --> 00:27:43.410
so let me put it here--
00:27:43.410 --> 00:27:50.950
minus 2x tilde, y
minus a plus bx tilde.
00:27:55.644 --> 00:27:58.910
OK, see that's why I don't want
to put too many parenthesis.
00:28:03.140 --> 00:28:05.741
OK.
00:28:05.741 --> 00:28:07.490
So I just took the
derivative with respect
00:28:07.490 --> 00:28:09.920
to a, which is just
basically the square,
00:28:09.920 --> 00:28:12.569
and then I have a negative 1
that comes out from inside.
00:28:12.569 --> 00:28:14.360
And then I take the
derivative with respect
00:28:14.360 --> 00:28:17.010
to b, and since b has x tilde.
00:28:17.010 --> 00:28:19.340
In [? factor, ?] it
comes out as well.
00:28:19.340 --> 00:28:24.420
All right, so the minus 2's
really won't matter for me.
00:28:24.420 --> 00:28:26.706
And so now I have two equations.
00:28:26.706 --> 00:28:28.580
The first equation,
while it's pretty simple,
00:28:28.580 --> 00:28:31.955
it's just telling me that
the expectation of y minus a
00:28:31.955 --> 00:28:33.710
is equal to zero.
00:28:33.710 --> 00:28:41.870
So what I know is that a is
equal to the expectation of y.
00:28:41.870 --> 00:28:44.060
And really that
was a tilde, which
00:28:44.060 --> 00:28:47.870
implies that the a
I want is actually
00:28:47.870 --> 00:29:00.690
equal to the
expectation of y minus b
00:29:00.690 --> 00:29:05.030
times the expectation of x.
00:29:05.030 --> 00:29:05.530
OK?
00:29:10.240 --> 00:29:13.450
Just because a tilde is a plus
b times the expectation of x.
00:29:16.830 --> 00:29:19.360
So that's for my a.
00:29:19.360 --> 00:29:22.180
And then for my b, I
use the second one.
00:29:22.180 --> 00:29:27.990
So the second one tells me that
the expectation of x tilde of y
00:29:27.990 --> 00:29:32.430
is equal to a plus b times
the expectation of x tilde
00:29:32.430 --> 00:29:33.520
which is zero, right?
00:29:38.640 --> 00:29:39.460
OK?
00:29:39.460 --> 00:29:41.630
But this a is actually
a tilde in this problem,
00:29:41.630 --> 00:29:47.210
so it's actually a plus
b expectation of x.
00:29:51.900 --> 00:29:53.890
Now, this is the
expectation of the product
00:29:53.890 --> 00:29:57.480
of two random variables, but
x tilde is centered, right?
00:29:57.480 --> 00:30:00.670
It's x minus expectation of
x, so this thing is actually
00:30:00.670 --> 00:30:03.640
equal to the covariance
between x and y
00:30:03.640 --> 00:30:05.140
by definition of covariance.
00:30:09.130 --> 00:30:11.840
So now I have everything
I need, right.
00:30:11.840 --> 00:30:14.110
How do I just--
00:30:14.110 --> 00:30:16.520
I'm sorry about that.
00:30:16.520 --> 00:30:18.330
So I have everything I need.
00:30:18.330 --> 00:30:22.560
Now, I now have two
equations with two unknowns,
00:30:22.560 --> 00:30:25.110
and all I have to do is
to basically plug it in.
00:30:25.110 --> 00:30:29.460
So it's essentially telling
me that the covariance of xy--
00:30:29.460 --> 00:30:31.980
so the first equation tells
me that the covariance of xy
00:30:31.980 --> 00:30:36.750
is equal to a plus b expectation
of x, but a is expectation of y
00:30:36.750 --> 00:30:39.640
minus b expectation of x.
00:30:39.640 --> 00:30:45.113
So it's-- well, actually,
maybe I should start with b.
00:30:54.780 --> 00:30:56.010
Oh, sorry.
00:30:56.010 --> 00:30:59.580
OK, I forgot one thing.
00:30:59.580 --> 00:31:00.750
This is not true, right.
00:31:00.750 --> 00:31:02.516
I forgot this term.
00:31:02.516 --> 00:31:05.850
x tilde multiplies x
tilde here, so what
00:31:05.850 --> 00:31:07.680
I'm left with is x tilde--
00:31:07.680 --> 00:31:11.320
it's minus b times the
expectation of x tilde squared.
00:31:11.320 --> 00:31:14.800
So that's actually minus
b times the variance of x
00:31:14.800 --> 00:31:17.970
tilde because x tilde
is already centered,
00:31:17.970 --> 00:31:19.760
which is actually
the variance of x.
00:31:23.850 --> 00:31:29.790
So now I have that this thing
is actually a plus b expectation
00:31:29.790 --> 00:31:36.570
of x minus b variance of x.
00:31:36.570 --> 00:31:42.180
And I also have that a
is equal to expectation
00:31:42.180 --> 00:31:45.960
of y minus b expectation of x.
00:31:53.720 --> 00:31:58.100
So if I sum the two, those
guys are going to cancel.
00:31:58.100 --> 00:32:00.740
Those guys are going to cancel.
00:32:00.740 --> 00:32:05.630
And so what I'm going to be
left with is covariance of xy
00:32:05.630 --> 00:32:10.570
is equal to expectation
of x, expectation of y,
00:32:10.570 --> 00:32:12.610
and then I'm left with
this term here, minus
00:32:12.610 --> 00:32:14.050
b times the variance of x.
00:32:17.070 --> 00:32:20.171
And so that tells me that b--
00:32:20.171 --> 00:32:21.796
why do I still have
the variance there?
00:32:34.692 --> 00:32:37.668
AUDIENCE: So is the
covariance really
00:32:37.668 --> 00:32:43.124
the expectation of x tilde
times y minus expectation of y?
00:32:43.124 --> 00:32:46.092
Because y is not
centered, correct?
00:32:46.092 --> 00:32:47.092
PHILIPPE RIGOLLET: Yeah.
00:32:47.092 --> 00:32:48.814
AUDIENCE: OK, but x
is still the center.
00:32:48.814 --> 00:32:50.980
PHILIPPE RIGOLLET: But x
is still the center, right.
00:32:50.980 --> 00:32:52.700
So you just need
to have one that's
00:32:52.700 --> 00:32:53.830
centered for this to work.
00:32:57.187 --> 00:32:58.520
Right, I mean, you can check it.
00:32:58.520 --> 00:33:00.144
But basically when
you're going to have
00:33:00.144 --> 00:33:02.877
the product of the expectations,
you only need one of the two
00:33:02.877 --> 00:33:03.960
in the product to be zero.
00:33:03.960 --> 00:33:04.920
So the product is zero.
00:33:09.090 --> 00:33:11.020
OK, why do I keep my--
00:33:11.020 --> 00:33:14.542
so I get a, a, and
then the b expectation.
00:33:14.542 --> 00:33:16.750
OK, so that's probably
earlier that I made a mistake.
00:33:25.620 --> 00:33:29.140
So I get-- so this was a tilde.
00:33:29.140 --> 00:33:30.548
Let's just be clear about the--
00:33:40.508 --> 00:33:43.350
So that tells me that a tilde--
00:33:43.350 --> 00:33:45.570
maybe it's not super
fair of me to--
00:33:48.310 --> 00:33:50.426
yeah, OK, I think I know
where I made a mistake.
00:33:50.426 --> 00:33:51.550
I should not have centered.
00:33:51.550 --> 00:33:54.760
I wanted to make my life
easier, and I should not
00:33:54.760 --> 00:33:55.960
have done that.
00:33:55.960 --> 00:33:59.140
And the reason is a
tilde depends on b,
00:33:59.140 --> 00:34:01.780
so when I take the
derivative with respect
00:34:01.780 --> 00:34:04.840
to b, what I'm left with here--
00:34:04.840 --> 00:34:06.880
since a tilde
depends on b, when I
00:34:06.880 --> 00:34:09.370
take the derivative of
this guy, I actually
00:34:09.370 --> 00:34:12.550
don't get a tilde here,
but I really get--
00:34:17.570 --> 00:34:20.896
so again, this was not--
00:34:20.896 --> 00:34:21.960
so that's the first one.
00:34:30.389 --> 00:34:33.800
This is actually x here--
00:34:33.800 --> 00:34:38.050
because when I take the
derivative with respect to b.
00:34:38.050 --> 00:34:40.929
And so now, what I'm left with
is that the expectation-- so
00:34:40.929 --> 00:34:43.929
yeah, I'm basically left
with nothing that helps.
00:34:43.929 --> 00:34:46.300
So I'm sorry about.
00:34:46.300 --> 00:34:49.929
Let's start from the
beginning because this is not
00:34:49.929 --> 00:34:53.090
getting us anywhere, and a
fix is not going to help.
00:34:53.090 --> 00:34:55.370
So let's just do it again.
00:34:55.370 --> 00:34:56.320
Sorry about that.
00:34:56.320 --> 00:34:59.230
So let's not center anything
and just do brute force
00:34:59.230 --> 00:35:01.120
because we're going to--
00:35:01.120 --> 00:35:04.820
b x squared.
00:35:04.820 --> 00:35:07.270
All right.
00:35:07.270 --> 00:35:09.520
Partial, with respect
to a, is giving
00:35:09.520 --> 00:35:11.920
equal zero is
equivalent, so my minus 2
00:35:11.920 --> 00:35:13.060
is going to cancel, right.
00:35:13.060 --> 00:35:14.851
So I'm going to actually
forget about this.
00:35:14.851 --> 00:35:17.980
So it's actually telling
me that the expectation
00:35:17.980 --> 00:35:25.660
of y minus a plus bx
is equal to zero, which
00:35:25.660 --> 00:35:31.090
is equivalent to a plus
b expectation of x, is
00:35:31.090 --> 00:35:33.775
equal to the expectation of y.
00:35:33.775 --> 00:35:35.650
Now, if I take the
derivative with respect to
00:35:35.650 --> 00:35:38.830
b and set it equal to
zero, this is telling me
00:35:38.830 --> 00:35:41.656
that the expectation of--
00:35:41.656 --> 00:35:43.780
well, it's the same thing
except that this time I'm
00:35:43.780 --> 00:35:45.280
going to pull out an x.
00:35:52.470 --> 00:35:54.310
This guy is equal to zero--
00:35:54.310 --> 00:35:56.660
this guy is not here--
00:35:56.660 --> 00:36:03.650
and so that implies that
the expectation of xy
00:36:03.650 --> 00:36:09.560
is equal to a times
the expectation of x,
00:36:09.560 --> 00:36:16.726
plus b times the
expectation of x square.
00:36:16.726 --> 00:36:17.226
OK?
00:36:21.540 --> 00:36:26.720
All right, so the first one is
actually not giving me much,
00:36:26.720 --> 00:36:29.700
so I need to actually work
with the two of those guys.
00:36:29.700 --> 00:36:31.470
So I'm going to take the first--
00:36:31.470 --> 00:36:33.690
so let me rewrite those two
inequalities that I have.
00:36:33.690 --> 00:36:40.830
I have a plus b, e of
x is equal to e of y.
00:36:40.830 --> 00:36:43.092
And then I have e of xy.
00:36:50.970 --> 00:37:01.160
OK, and now what I do is
that I multiply this guy.
00:37:01.160 --> 00:37:03.230
So I want to cancel one
of those things, right?
00:37:03.230 --> 00:37:04.455
So what I'm going to--
00:37:12.197 --> 00:37:13.780
so I'm going to take
this guy, and I'm
00:37:13.780 --> 00:37:19.030
going to multiply it by e of
x and take the difference.
00:37:19.030 --> 00:37:26.330
So I do times e of x, and then
I take the sum of those two,
00:37:26.330 --> 00:37:28.840
and then those two terms
are going to cancel.
00:37:28.840 --> 00:37:33.550
So then that tells
me that b times e
00:37:33.550 --> 00:37:45.180
of x squared, plus the
expectation of xy is equal to--
00:37:45.180 --> 00:37:48.423
so this guy is the
one that cancelled.
00:37:53.850 --> 00:37:56.570
Then I get this guy
here, expectation
00:37:56.570 --> 00:38:02.450
of x times the expectation
of y, plus the guy that
00:38:02.450 --> 00:38:04.070
remains here--
00:38:04.070 --> 00:38:08.752
which is b times the
expectation of x square.
00:38:11.920 --> 00:38:16.220
So here I have b expectation
of x, the whole thing squared.
00:38:16.220 --> 00:38:18.400
And here I have b
expectation of x square.
00:38:18.400 --> 00:38:22.440
So if I pull this guy
here, what do I get?
00:38:22.440 --> 00:38:26.140
b times the variance of x, OK?
00:38:26.140 --> 00:38:28.180
So I'm going to move here.
00:38:28.180 --> 00:38:31.160
And this guy here, when
I move this guy here,
00:38:31.160 --> 00:38:32.980
I get the expectation
of x times y,
00:38:32.980 --> 00:38:35.590
minus the expectation of x
times the expectation of y.
00:38:35.590 --> 00:38:40.540
So this is actually telling me
that the covariance of x and y
00:38:40.540 --> 00:38:45.450
is equal to b times
the variance of x.
00:38:45.450 --> 00:38:48.840
And so then that
tells me that b is
00:38:48.840 --> 00:38:55.519
equal to covariance of xy
divided by the variance of x.
00:38:55.519 --> 00:38:57.310
And that's why I actually
need the variance
00:38:57.310 --> 00:39:01.690
of x to be non-zero because
I couldn't do that otherwise.
00:39:01.690 --> 00:39:03.190
And because if it
was, it would mean
00:39:03.190 --> 00:39:04.890
that b should be
plus infinity, which
00:39:04.890 --> 00:39:08.220
is what the limit of this
guy is when the variance goes
00:39:08.220 --> 00:39:11.200
to zero or negative infinity.
00:39:11.200 --> 00:39:14.410
I can not sort them out.
00:39:14.410 --> 00:39:16.130
All right, so I'm
sorry about the mess,
00:39:16.130 --> 00:39:19.070
but that should be more clear.
00:39:19.070 --> 00:39:21.410
Then a, of course,
you can write it
00:39:21.410 --> 00:39:23.240
by plugging in the
value of b, so you
00:39:23.240 --> 00:39:27.030
know it's only a function
of your distribution, right?
00:39:27.030 --> 00:39:29.240
So what are the characteristics
of the distribution--
00:39:29.240 --> 00:39:31.031
so distribution can
have a bunch of things.
00:39:31.031 --> 00:39:34.330
It can have movements
of order 4, of order 26.
00:39:34.330 --> 00:39:36.590
It can have heavy
tails or light tails.
00:39:36.590 --> 00:39:39.320
But when you compute
least squares,
00:39:39.320 --> 00:39:41.900
the only thing that
matters are the variance
00:39:41.900 --> 00:39:45.320
of x, the expectation
of the individual ones--
00:39:45.320 --> 00:39:50.300
and really what captures how
y changes when you change x,
00:39:50.300 --> 00:39:51.590
is captured in the covariance.
00:39:51.590 --> 00:39:54.510
The rest is really
just normalization.
00:39:54.510 --> 00:39:58.550
It's just telling you, I want
things to cross the y-axis
00:39:58.550 --> 00:39:59.360
at the right place.
00:39:59.360 --> 00:40:02.330
I want things to cross the
x-axis at the right place.
00:40:02.330 --> 00:40:05.720
But the slope is really captured
by how much more covariance
00:40:05.720 --> 00:40:08.330
you have relative to
the variance of x.
00:40:08.330 --> 00:40:12.350
So this is essentially setting
the scale for the x-axis,
00:40:12.350 --> 00:40:15.410
and this is telling
you for a unit scale,
00:40:15.410 --> 00:40:20.460
this is the unit of y
that you're changing.
00:40:20.460 --> 00:40:23.600
OK, so we have explicit forms.
00:40:23.600 --> 00:40:26.300
And what I could do, if I
wanted to estimate those things,
00:40:26.300 --> 00:40:32.510
is just say, well again, we
have expectations, right?
00:40:32.510 --> 00:40:36.050
The expectation of xy minus the
product of the expectations,
00:40:36.050 --> 00:40:38.510
I could replace
expectations by averages
00:40:38.510 --> 00:40:40.310
and get an empirical
covariance just
00:40:40.310 --> 00:40:42.710
like we can replace the
expectations for the variance
00:40:42.710 --> 00:40:44.720
and get a sample covariance.
00:40:44.720 --> 00:40:47.300
And this is basically what
we're going to be doing.
00:40:47.300 --> 00:40:49.470
All right, this is
essentially what you want.
00:40:49.470 --> 00:40:51.950
The problem is that if
you view it that way,
00:40:51.950 --> 00:40:54.860
you sort of prevent yourself
from being able to solve
00:40:54.860 --> 00:40:56.510
the multivariate problem.
00:40:56.510 --> 00:40:58.430
Because it's only in
the univariate problem
00:40:58.430 --> 00:41:00.930
that you have closed form
solutions for your problem.
00:41:00.930 --> 00:41:03.080
But if you actually
go to multivariate,
00:41:03.080 --> 00:41:05.510
this is not where you want
to replace expectations
00:41:05.510 --> 00:41:06.230
by averages.
00:41:06.230 --> 00:41:09.120
You actually want to replace
expectation by averages here.
00:41:12.520 --> 00:41:14.950
And once you do
it here, then you
00:41:14.950 --> 00:41:17.920
can actually just solve
the minimisation problem.
00:41:23.240 --> 00:41:29.840
OK, so one thing that
arises from this guy
00:41:29.840 --> 00:41:35.795
is that this is an
interesting formula.
00:41:40.640 --> 00:41:43.740
All right, think about it.
00:41:43.740 --> 00:42:00.190
If I have that y is a
plus bx plus some noise.
00:42:00.190 --> 00:42:02.680
Things are no
longer on something.
00:42:02.680 --> 00:42:08.470
I have that y is equal to
a bx plus some noise, which
00:42:08.470 --> 00:42:11.210
is usually denoted by epsilon.
00:42:11.210 --> 00:42:12.910
So that's the
distribution, right?
00:42:12.910 --> 00:42:15.760
If I tell you the
distribution of x, and I
00:42:15.760 --> 00:42:17.470
say y is a plus b epsilon--
00:42:17.470 --> 00:42:18.940
I tell you the
distribution of y,
00:42:18.940 --> 00:42:21.190
and if [? they mean ?] that
those two are independent,
00:42:21.190 --> 00:42:23.860
you have a distribution on y.
00:42:23.860 --> 00:42:27.364
So what happens is that I can
actually always say-- well, you
00:42:27.364 --> 00:42:28.780
know, this is
equivalent to saying
00:42:28.780 --> 00:42:35.560
that epsilon is equal to
y minus a plus bx, right?
00:42:35.560 --> 00:42:37.540
I can always write
this as just--
00:42:37.540 --> 00:42:40.320
I mean, as tautology.
00:42:40.320 --> 00:42:42.069
But here, for those guys--
00:42:42.069 --> 00:42:43.360
this is not for any guy, right.
00:42:43.360 --> 00:42:45.770
This is really for
the best fit, a
00:42:45.770 --> 00:42:50.170
and b, those ones that
satisfy this gradient is
00:42:50.170 --> 00:42:51.610
equal to zero thing.
00:42:51.610 --> 00:42:55.330
Then what we had is that
the expectation of epsilon
00:42:55.330 --> 00:42:59.380
was equal to expectation
of y minus a plus
00:42:59.380 --> 00:43:03.430
b expectation of x by linearity
of the expectation, which
00:43:03.430 --> 00:43:05.560
was equal to zero.
00:43:05.560 --> 00:43:10.180
So for this best
fit we have zero.
00:43:10.180 --> 00:43:13.630
Now, the covariance
between x and y--
00:43:17.190 --> 00:43:20.530
Between, sorry, x
and epsilon, is what?
00:43:20.530 --> 00:43:23.420
Well, it's the
covariance between x--
00:43:23.420 --> 00:43:27.540
and well, epsilon was
y minus a plus bx.
00:43:30.100 --> 00:43:33.240
Now, the covariance is
bilinear, so what I have
00:43:33.240 --> 00:43:35.640
is that the
covariance of this is
00:43:35.640 --> 00:43:38.760
the covariance of xn times y--
00:43:38.760 --> 00:43:41.790
sorry, of x and y, minus
the variance-- well,
00:43:41.790 --> 00:43:50.220
minus a plus b,
covariance of x and x,
00:43:50.220 --> 00:43:54.720
which is the variance of x?
00:43:59.050 --> 00:44:03.510
Covariance of xy minus
a plus b variance of x.
00:44:12.384 --> 00:44:13.300
OK, I didn't write it.
00:44:13.300 --> 00:44:16.080
So here I have
covariance of xy is
00:44:16.080 --> 00:44:17.910
equal to b variance of x, right?
00:44:34.070 --> 00:44:35.270
Covariance of xy.
00:44:35.270 --> 00:44:38.057
Yeah, that's because they cannot
do that with the covariance.
00:44:44.030 --> 00:44:46.520
Yeah, I have those
averages again.
00:44:46.520 --> 00:44:48.320
No, because this
is centered, right?
00:44:48.320 --> 00:44:51.000
Sorry, this is centered,
so this is actually
00:44:51.000 --> 00:44:56.760
equal to the expectation of
x times y minus a plus bx.
00:45:01.527 --> 00:45:03.110
The covariance is
equal to the product
00:45:03.110 --> 00:45:05.750
just because this insight
is actually centered.
00:45:05.750 --> 00:45:09.980
So this is the
expectation of x times y
00:45:09.980 --> 00:45:20.100
minus the expectation of a times
the expectation of x, plus b
00:45:20.100 --> 00:45:23.013
minus b times the
expectation of x squared.
00:45:32.200 --> 00:45:34.720
Well, actually maybe I
should not really go too far.
00:45:38.894 --> 00:45:40.560
So this is actually
the one that I need.
00:45:40.560 --> 00:45:47.300
But if I stop here, this is
actually equal to zero, right.
00:45:47.300 --> 00:45:49.095
Those are the same equations.
00:45:52.065 --> 00:45:53.050
OK?
00:45:53.050 --> 00:45:53.550
Yeah?
00:45:53.550 --> 00:45:55.516
AUDIENCE: What are
we doing right now?
00:45:55.516 --> 00:45:57.140
PHILIPPE RIGOLLET:
So we're just saying
00:45:57.140 --> 00:46:01.070
that if I actually believe that
this best fit was the one that
00:46:01.070 --> 00:46:02.990
gave me the right
parameters, what would
00:46:02.990 --> 00:46:05.804
that imply on the noise
itself, on this epsilon?
00:46:05.804 --> 00:46:07.220
So here we're
actually just trying
00:46:07.220 --> 00:46:10.070
to find some necessary condition
for the noise to hold--
00:46:10.070 --> 00:46:11.030
for the noise.
00:46:11.030 --> 00:46:14.540
And so those conditions are,
that first, the expectation
00:46:14.540 --> 00:46:15.290
is zero.
00:46:15.290 --> 00:46:17.090
That's what we've got here.
00:46:17.090 --> 00:46:20.480
And then, that the covariance
between the noise and x
00:46:20.480 --> 00:46:22.900
has to be zero as well.
00:46:22.900 --> 00:46:24.770
OK, so those are
actually conditions
00:46:24.770 --> 00:46:26.360
that the noise must satisfy.
00:46:26.360 --> 00:46:29.450
But the noise was just not
really defined as noise itself.
00:46:29.450 --> 00:46:31.550
We were just
saying, OK, if we're
00:46:31.550 --> 00:46:35.230
going to put some assumptions
on the epsilon, what
00:46:35.230 --> 00:46:36.110
do we better have?
00:46:36.110 --> 00:46:38.360
So the first one is that
it's centered, which is good,
00:46:38.360 --> 00:46:41.150
because otherwise, the noise
would shift everything.
00:46:41.150 --> 00:46:45.620
So now when you look at a
linear regression model--
00:46:45.620 --> 00:46:48.590
typically, if you open a book,
it doesn't start by saying,
00:46:48.590 --> 00:46:50.920
let the noise be the
difference between y
00:46:50.920 --> 00:46:52.940
and what I actually
want y to be.
00:46:52.940 --> 00:46:57.210
It says let y be a
plus bx plus epsilon.
00:46:57.210 --> 00:47:02.120
So conversely, if we assume that
this is the model that we have,
00:47:02.120 --> 00:47:04.340
then we're going to have
to assume that epsilon--
00:47:04.340 --> 00:47:06.298
we're going to assume
that epsilon is centered,
00:47:06.298 --> 00:47:10.840
and that the covariance
between x and epsilon is zero.
00:47:10.840 --> 00:47:13.760
Actually, often, we're
going to assume much more.
00:47:13.760 --> 00:47:17.600
And one way to ensure that
those two things are satisfied
00:47:17.600 --> 00:47:19.940
is to assume that x is
independent of epsilon,
00:47:19.940 --> 00:47:21.290
for example.
00:47:21.290 --> 00:47:23.940
If you assume that x is
independent of epsilon,
00:47:23.940 --> 00:47:28.332
of course the covariance
is going to be zero.
00:47:28.332 --> 00:47:30.720
Or we might assume that
the conditional expectation
00:47:30.720 --> 00:47:35.450
of epsilon, given x, is equal
to zero, then that implies that.
00:47:35.450 --> 00:47:38.710
OK, now the fact that it's
centered is one thing.
00:47:38.710 --> 00:47:43.500
So if we make this assumption,
the only thing it's telling us
00:47:43.500 --> 00:47:47.700
is that those ab's that come--
right, we started from there.
00:47:47.700 --> 00:47:51.240
y is equal to a plus bx plus
some epsilon for some a,
00:47:51.240 --> 00:47:51.960
for some b.
00:47:51.960 --> 00:47:55.890
What it turns out is that
those a's and b's are actually
00:47:55.890 --> 00:47:58.680
the ones that you would get
by solving this expectation
00:47:58.680 --> 00:48:00.690
of square thing.
00:48:00.690 --> 00:48:02.610
All right, so when you asked--
00:48:02.610 --> 00:48:04.530
back when you were following--
00:48:04.530 --> 00:48:07.170
so when you asked,
you know, why don't we
00:48:07.170 --> 00:48:10.290
take the square, for
example, or the power
00:48:10.290 --> 00:48:12.210
4, or something like this--
00:48:12.210 --> 00:48:15.990
then here, I'm saying, well, if
I have y is equal to a plus bx,
00:48:15.990 --> 00:48:19.230
I don't actually need to put
too much assumptions on epsilon.
00:48:19.230 --> 00:48:22.320
If epsilon is actually
satisfying those two things,
00:48:22.320 --> 00:48:25.620
expectation is equal to
zero and the covariance
00:48:25.620 --> 00:48:28.912
with x is equal to zero,
then the right a and b
00:48:28.912 --> 00:48:30.870
that I'm looking for are
actually the ones that
00:48:30.870 --> 00:48:32.120
come with the square--
00:48:32.120 --> 00:48:36.750
not with power 4 or power 25.
00:48:36.750 --> 00:48:39.300
So those are actually
pretty weak assumptions.
00:48:39.300 --> 00:48:41.510
If we want to do
inference, we're
00:48:41.510 --> 00:48:43.350
going to have to
assume slightly more.
00:48:43.350 --> 00:48:45.690
If we want to use
T-distributions at some point,
00:48:45.690 --> 00:48:47.520
for example, and we
will, we're going
00:48:47.520 --> 00:48:50.800
to have to assume that epsilon
has a Gaussian distribution.
00:48:50.800 --> 00:48:53.700
So if you want to start doing
more statistics beyond just
00:48:53.700 --> 00:48:56.550
like doing this least square
thing, which is minimizing
00:48:56.550 --> 00:48:58.350
the square of criterion,
you're actually
00:48:58.350 --> 00:48:59.933
going to have to put
more assumptions.
00:48:59.933 --> 00:49:01.710
But right now, we
did not need them.
00:49:01.710 --> 00:49:04.210
We only need that epsilon
as mean zero and covariant
00:49:04.210 --> 00:49:04.998
zero with x.
00:49:08.750 --> 00:49:13.040
OK, so that was basically
probabilistic, right.
00:49:13.040 --> 00:49:14.450
If I were to do
probability and I
00:49:14.450 --> 00:49:17.090
were trying to model the
relationship between two
00:49:17.090 --> 00:49:20.330
random variables, x
and y, in the form
00:49:20.330 --> 00:49:24.320
y is a plus bx plus some noise,
this is what would come out.
00:49:24.320 --> 00:49:25.640
Everything was expectations.
00:49:25.640 --> 00:49:27.290
There was no data involved.
00:49:27.290 --> 00:49:33.620
So now let's go to the
data problem, which is now,
00:49:33.620 --> 00:49:35.540
I do not know what
those expectations are.
00:49:35.540 --> 00:49:38.240
In particular, I don't know what
the covariance of x and y is,
00:49:38.240 --> 00:49:40.610
and I don't know with
the expectation of x
00:49:40.610 --> 00:49:42.950
and the expectation of y r.
00:49:42.950 --> 00:49:44.570
So I have data to do that.
00:49:44.570 --> 00:49:45.880
So how am I going to do this?
00:49:49.244 --> 00:49:50.660
Well, I'm just
going to say, well,
00:49:50.660 --> 00:49:57.570
if I want x1, y1,
xn, yn, and I'm going
00:49:57.570 --> 00:49:59.781
to assume that
they're [? iid. ?]
00:49:59.781 --> 00:50:01.530
And I'm actually going
to assume that they
00:50:01.530 --> 00:50:02.820
have some model, right.
00:50:02.820 --> 00:50:06.570
So I'm going to assume
that I have that a--
00:50:06.570 --> 00:50:09.150
so that Yi follows
the same model.
00:50:14.620 --> 00:50:17.000
So epsilon i
[? rad, ?] and I won't
00:50:17.000 --> 00:50:23.610
say that expectation of epsilon
i is zero and covariance of xi,
00:50:23.610 --> 00:50:25.630
epsilon i is equal to zero.
00:50:25.630 --> 00:50:28.880
So I'm going to put the
same model on all the data.
00:50:28.880 --> 00:50:31.420
So you can see that a is
not ai, and b is not bi.
00:50:31.420 --> 00:50:32.380
It's the same.
00:50:32.380 --> 00:50:34.090
So as my data
increases, I should
00:50:34.090 --> 00:50:36.850
be able to recover
the correct things--
00:50:36.850 --> 00:50:39.430
as the size of my
data increases.
00:50:39.430 --> 00:50:43.030
OK, so this is what the
statistical problem look like.
00:50:43.030 --> 00:50:45.250
You're given the points.
00:50:45.250 --> 00:50:47.350
There is a true line
from which this point
00:50:47.350 --> 00:50:48.557
was generated, right.
00:50:48.557 --> 00:50:49.390
There was this line.
00:50:49.390 --> 00:50:54.250
There was a true ab that
I use to draw this plot,
00:50:54.250 --> 00:50:55.190
and that was the line.
00:50:55.190 --> 00:50:59.320
So first I picked an
x, say uniformly at
00:50:59.320 --> 00:51:02.110
on this intervals, 0 to 2.
00:51:02.110 --> 00:51:03.610
I said that was this one.
00:51:03.610 --> 00:51:06.800
Then I said well, I
want y to be a plus bx,
00:51:06.800 --> 00:51:08.500
so it should be
here, but then I'm
00:51:08.500 --> 00:51:10.840
going to add some noise
epsilon to go away again
00:51:10.840 --> 00:51:13.270
back from this line.
00:51:13.270 --> 00:51:16.970
And that's actually me, here, we
actually got two points correct
00:51:16.970 --> 00:51:18.070
on this line.
00:51:18.070 --> 00:51:20.170
So there's basically
two epsilons
00:51:20.170 --> 00:51:22.330
that were small enough
that the dots actually
00:51:22.330 --> 00:51:24.720
look like they're on the line.
00:51:24.720 --> 00:51:27.060
Everybody's clear
about what I'm drawing?
00:51:27.060 --> 00:51:28.810
So now of course if
you're a statistician,
00:51:28.810 --> 00:51:29.620
you don't see this.
00:51:29.620 --> 00:51:30.810
You only see this.
00:51:30.810 --> 00:51:32.610
And you have to
recover this guy,
00:51:32.610 --> 00:51:34.260
and it's going to
look like this.
00:51:34.260 --> 00:51:36.550
You're going to have an
estimated line, which
00:51:36.550 --> 00:51:37.780
is the red one.
00:51:37.780 --> 00:51:42.610
And the blue line, which is
the true one, the one that
00:51:42.610 --> 00:51:44.230
actually generated the data.
00:51:44.230 --> 00:51:46.810
And your question is,
while this line corresponds
00:51:46.810 --> 00:51:48.967
to some parameters
a hat and b hat,
00:51:48.967 --> 00:51:51.550
how could I make sure that those
two lines-- how far those two
00:51:51.550 --> 00:51:52.060
lines are?
00:51:52.060 --> 00:51:53.620
And one to address
this question is
00:51:53.620 --> 00:51:57.920
to say how far is a from a hat,
and how far is b from b hat?
00:51:57.920 --> 00:51:58.785
OK?
00:51:58.785 --> 00:52:00.660
Another question, of
course, that you may ask
00:52:00.660 --> 00:52:04.470
is, how do you find
a hat and b hat?
00:52:04.470 --> 00:52:07.530
And as you can see, it's
basically the same thing.
00:52:07.530 --> 00:52:15.210
Remember, what was a-- so b
was the covariance between x
00:52:15.210 --> 00:52:21.240
and y divided by the
variance of x, right?
00:52:21.240 --> 00:52:22.410
We check and rewrite this.
00:52:22.410 --> 00:52:26.430
The expectation of
xy minus expectation
00:52:26.430 --> 00:52:30.060
of x times the
expectation of y, divided
00:52:30.060 --> 00:52:35.580
by expectation of x squared
minus expectation of x.
00:52:35.580 --> 00:52:37.540
The whole thing's--
00:52:37.540 --> 00:52:39.040
OK?
00:52:39.040 --> 00:52:42.910
If you look at the
expression for b hat,
00:52:42.910 --> 00:52:47.670
I basically replaced all
the expectations by bars.
00:52:47.670 --> 00:52:49.800
So I said, well,
this guy I'm going
00:52:49.800 --> 00:52:53.480
to estimate by an average.
00:52:53.480 --> 00:52:59.970
So that's the xy
bar, and is 1 over n,
00:52:59.970 --> 00:53:03.025
[? sum ?] from [? i co ?]
1, to n of Xi, times Yi.
00:53:05.555 --> 00:53:08.380
x bar, of course, is just
the one that we're used to.
00:53:12.690 --> 00:53:14.970
And same for y bar.
00:53:14.970 --> 00:53:20.580
X squared bar, the
one that's here,
00:53:20.580 --> 00:53:22.290
is the average of the squares.
00:53:22.290 --> 00:53:24.426
And x bar square is the
square of the average.
00:53:39.510 --> 00:53:44.070
OK, so you just basically
replace this guy by x bar,
00:53:44.070 --> 00:53:47.820
this guy by y bar, this
guy by x square bar,
00:53:47.820 --> 00:53:52.350
and this guy by x
bar and no square.
00:53:52.350 --> 00:53:54.810
OK, so that's basically
one way to do it.
00:53:54.810 --> 00:53:56.340
Everywhere you see
an expectation,
00:53:56.340 --> 00:53:58.740
you replace it by an average.
00:53:58.740 --> 00:54:02.070
That's the usual
statistical hammer.
00:54:02.070 --> 00:54:04.720
You can actually be slightly
more subtle about this.
00:54:09.980 --> 00:54:12.420
And as an exercise,
I invite you--
00:54:12.420 --> 00:54:14.940
just to make sure that you know
how to do this competition,
00:54:14.940 --> 00:54:17.400
it's going to be exactly the
same kind of competitions
00:54:17.400 --> 00:54:18.840
that we've done.
00:54:18.840 --> 00:54:20.670
But as an exercise,
you can check
00:54:20.670 --> 00:54:23.311
that if you actually
look at say, well,
00:54:23.311 --> 00:54:25.810
what I wanted to minimize here,
I had an expectation, right?
00:54:32.720 --> 00:54:35.660
And I said, let's
minimize this thing.
00:54:35.660 --> 00:54:41.800
Well, let's replace this
by an average first.
00:54:51.630 --> 00:54:54.270
And now minimize.
00:54:54.270 --> 00:54:57.100
OK, so if I do
this, it turns out
00:54:57.100 --> 00:55:00.160
I'm going to actually
get the same result.
00:55:00.160 --> 00:55:03.940
The minimum of the
average is basically--
00:55:03.940 --> 00:55:06.160
when I replace the
average by-- sorry,
00:55:06.160 --> 00:55:09.040
when I replace the
expectation by the average
00:55:09.040 --> 00:55:11.817
and then minimize,
it's the same thing
00:55:11.817 --> 00:55:13.900
as first minimizing and
then replacing expectation
00:55:13.900 --> 00:55:17.510
by averages in this case.
00:55:17.510 --> 00:55:21.764
Again, this is a much
more general principle
00:55:21.764 --> 00:55:23.180
because if you
don't have a closed
00:55:23.180 --> 00:55:27.530
form for the minimum like for
some, say, likelihood problems,
00:55:27.530 --> 00:55:30.579
well, you might not
actually have a possibility
00:55:30.579 --> 00:55:32.870
to just look at what the
formula looks like-- see where
00:55:32.870 --> 00:55:35.480
the expectations show up-- and
then just plug in the averages
00:55:35.480 --> 00:55:36.380
instead.
00:55:36.380 --> 00:55:39.170
So this is the one you
want to keep in mind.
00:55:39.170 --> 00:55:41.000
And again, as an exercise.
00:55:47.000 --> 00:55:48.870
OK, so here, and then
you do expectation
00:55:48.870 --> 00:55:52.980
replaced by averages.
00:55:52.980 --> 00:55:57.800
And then that's the same
answer, and I encourage
00:55:57.800 --> 00:56:00.080
you to solve the exercise.
00:56:00.080 --> 00:56:03.770
OK, everybody's clear that this
is actually the same expression
00:56:03.770 --> 00:56:07.140
for a hat and b hat that we had
before that we had for a and b
00:56:07.140 --> 00:56:12.460
when we replaced the
expectations by averages?
00:56:12.460 --> 00:56:16.960
Here, by the way, I minimize
the sum rather than the average.
00:56:16.960 --> 00:56:19.708
It's clear to everyone that
this is the same thing, right?
00:56:22.680 --> 00:56:23.180
Yep?
00:56:23.180 --> 00:56:27.148
AUDIENCE: [INAUDIBLE] sum
replacing it [INAUDIBLE]
00:56:27.148 --> 00:56:29.628
minimize the
expectation, I'm assuming
00:56:29.628 --> 00:56:31.612
it's switched with
the derivative
00:56:31.612 --> 00:56:33.596
on the expectation [INAUDIBLE].
00:56:37.592 --> 00:56:39.050
PHILIPPE RIGOLLET:
So we did switch
00:56:39.050 --> 00:56:43.640
the derivative and the
expectation before you came,
00:56:43.640 --> 00:56:44.140
I think.
00:56:47.890 --> 00:56:49.810
All right, so
indeed, the picture
00:56:49.810 --> 00:56:52.150
was the one that we
said, so visually, this
00:56:52.150 --> 00:56:53.380
is what we're doing.
00:56:53.380 --> 00:56:55.780
We're looking among
all the lines.
00:56:55.780 --> 00:56:58.822
For each line, we
compute this distance.
00:56:58.822 --> 00:57:00.280
So if I give you
another line there
00:57:00.280 --> 00:57:01.759
would be another set of arrows.
00:57:01.759 --> 00:57:02.800
You look at their length.
00:57:02.800 --> 00:57:03.610
You square it.
00:57:03.610 --> 00:57:05.520
And then you sum it
all, and you find
00:57:05.520 --> 00:57:08.080
the line that has the minimum
sum of squared lengths
00:57:08.080 --> 00:57:09.364
of the arrows.
00:57:09.364 --> 00:57:11.780
All right, and those are the
arrows that we're looking at.
00:57:11.780 --> 00:57:14.710
But again, you could actually
think of other distances,
00:57:14.710 --> 00:57:17.307
and you would actually
get different--
00:57:17.307 --> 00:57:19.390
you could actually get
different solutions, right.
00:57:19.390 --> 00:57:22.644
So there's something called,
mean absolute deviation,
00:57:22.644 --> 00:57:24.310
which rather than
minimizing this thing,
00:57:24.310 --> 00:57:27.490
is actually minimizing the
sum from i to co 1 to n
00:57:27.490 --> 00:57:33.970
of the absolute value
of y minus a plus bXi.
00:57:33.970 --> 00:57:36.160
And that's not
something for which
00:57:36.160 --> 00:57:39.190
you're going to have a closed
form, as you can imagine.
00:57:39.190 --> 00:57:42.010
You might have something
that's sort of implicit,
00:57:42.010 --> 00:57:44.647
but you can actually still
solve it numerically.
00:57:44.647 --> 00:57:46.230
And this is something
that people also
00:57:46.230 --> 00:57:50.478
like to use but way, way less
than the least squares one.
00:57:50.478 --> 00:57:52.174
AUDIENCE: [INAUDIBLE]
00:57:52.174 --> 00:57:53.840
PHILIPPE RIGOLLET:
What did I just what?
00:57:53.840 --> 00:57:56.600
AUDIENCE: [INAUDIBLE]
00:57:56.600 --> 00:58:02.230
The sum of the absolute
values of Yi minus a plus bXi.
00:58:02.230 --> 00:58:04.432
So it's the same except
I don't square here.
00:58:07.820 --> 00:58:08.320
OK?
00:58:11.250 --> 00:58:18.330
So arguably, you know,
predicting a demand
00:58:18.330 --> 00:58:21.780
based on price is a
fairly naive problem.
00:58:21.780 --> 00:58:23.787
Typically, what we
have is a bunch of data
00:58:23.787 --> 00:58:25.620
that we've collected,
and we're hoping that,
00:58:25.620 --> 00:58:29.460
together, they can help
us do a better prediction.
00:58:29.460 --> 00:58:31.890
All right, so maybe I
don't have only the price,
00:58:31.890 --> 00:58:35.670
but maybe I have a bunch
of other social indicators.
00:58:35.670 --> 00:58:40.484
Maybe I know the competition,
the price of the competition.
00:58:40.484 --> 00:58:42.150
And maybe I know a
bunch of other things
00:58:42.150 --> 00:58:43.980
that are actually relevant.
00:58:43.980 --> 00:58:48.030
And so I'm trying to find a way
to combine a bunch of points,
00:58:48.030 --> 00:58:50.880
a bunch of measures.
00:58:50.880 --> 00:58:52.540
There's a nice
example that I like,
00:58:52.540 --> 00:58:56.370
which is people were
trying to measure something
00:58:56.370 --> 00:59:00.750
related to your body
mass index, so basically
00:59:00.750 --> 00:59:04.820
the volume of your-- the
density of your body.
00:59:04.820 --> 00:59:07.380
And the way you can do
this is by just, really,
00:59:07.380 --> 00:59:10.170
weighing someone and
also putting them
00:59:10.170 --> 00:59:13.920
in some cubic meter of water
and see how much overflows.
00:59:13.920 --> 00:59:15.750
And then you have
both the volume
00:59:15.750 --> 00:59:20.850
and the mass of
this person, and you
00:59:20.850 --> 00:59:23.370
can start computing density.
00:59:23.370 --> 00:59:25.860
But as you can
imagine, you know,
00:59:25.860 --> 00:59:27.684
I would not personally
like to go to a gym
00:59:27.684 --> 00:59:29.600
when the first thing
they ask me is to just go
00:59:29.600 --> 00:59:33.240
in a bucket of
water, and so people
00:59:33.240 --> 00:59:36.840
try to find ways to measure this
based on other indicators that
00:59:36.840 --> 00:59:38.110
are much easier to measure.
00:59:38.110 --> 00:59:41.040
For example, I don't know,
the length of my forearm,
00:59:41.040 --> 00:59:45.090
and the circumference of
my head, and maybe my belly
00:59:45.090 --> 00:59:46.870
would probably be
more appropriate here.
00:59:46.870 --> 00:59:48.870
And so you know, they
just try to find something
00:59:48.870 --> 00:59:50.340
that actually makes sense.
00:59:50.340 --> 00:59:52.094
And so there's
actually a nice example
00:59:52.094 --> 00:59:53.760
where you can show
that if you measure--
00:59:53.760 --> 00:59:55.050
I think one of the
most significant
00:59:55.050 --> 00:59:56.860
was with the circumference
of your wrist.
00:59:56.860 --> 01:00:02.070
This is actually a very good
indicator of your body density.
01:00:02.070 --> 01:00:06.780
And it turns out that if you
stuff all the bunch of things
01:00:06.780 --> 01:00:09.240
together, you might actually
get a very good formula that
01:00:09.240 --> 01:00:10.840
explains things.
01:00:10.840 --> 01:00:12.390
All right, so what
we're going to do
01:00:12.390 --> 01:00:14.406
is rather than saying
we have only one x
01:00:14.406 --> 01:00:15.780
to explain y's,
let's say we have
01:00:15.780 --> 01:00:19.510
20 x's that we're trying
to combine to explain y.
01:00:19.510 --> 01:00:22.410
And again, just like assuming
something of the form,
01:00:22.410 --> 01:00:26.107
y is a plus b times x was the
simplest thing we could do,
01:00:26.107 --> 01:00:28.440
here we're just going to
assume that we have y is a plus
01:00:28.440 --> 01:00:31.650
b1, x1 plus b2, x2, plus b3, x3.
01:00:31.650 --> 01:00:33.690
And we can write
it in a vector form
01:00:33.690 --> 01:00:39.210
by writing that Yi is
Xi transposed b, which
01:00:39.210 --> 01:00:42.770
is now a vector plus epsilon i.
01:00:42.770 --> 01:00:44.520
OK, and here, on
the board, I'm going
01:00:44.520 --> 01:00:46.980
to have a hard time
doing boldface,
01:00:46.980 --> 01:00:52.360
but all these things are
vectors except for y,
01:00:52.360 --> 01:00:53.520
which is a number.
01:00:53.520 --> 01:00:54.450
Yi is a number.
01:00:54.450 --> 01:00:57.780
It's always the
value of my y-axis.
01:00:57.780 --> 01:00:59.930
So even if my x-axis lives on--
01:00:59.930 --> 01:01:04.350
this is x1, and this is x2, y
is really just the real valued
01:01:04.350 --> 01:01:05.249
function.
01:01:05.249 --> 01:01:07.290
And so I'm going to get
a bunch of points, x1,y1,
01:01:07.290 --> 01:01:10.380
and I'm going to see
how much they respond.
01:01:10.380 --> 01:01:13.560
So for example, my
body density is y,
01:01:13.560 --> 01:01:16.562
and then all the x's are
a bunch of other things.
01:01:16.562 --> 01:01:17.270
Agreed with that?
01:01:17.270 --> 01:01:20.870
So this is an equation that
holds on the real line,
01:01:20.870 --> 01:01:27.390
but this guy here is an r
p, and this guy's an rp.
01:01:30.080 --> 01:01:33.550
It's actually common to
talk to call b, beta,
01:01:33.550 --> 01:01:38.650
when it's a vector, and that's
the usual linear regression
01:01:38.650 --> 01:01:39.370
notation.
01:01:39.370 --> 01:01:42.470
Y is x beta plus epsilon.
01:01:42.470 --> 01:01:45.780
So x's are called
explanatory variables.
01:01:45.780 --> 01:01:50.600
y is called explained variable,
or dependent variable,
01:01:50.600 --> 01:01:52.000
or response variable.
01:01:52.000 --> 01:01:53.050
It has a bunch of names.
01:01:53.050 --> 01:01:55.877
You can use whatever you
feel more comfortable with.
01:01:55.877 --> 01:01:57.460
It should actually
be explicit, right,
01:01:57.460 --> 01:01:58.668
so that's all you care about.
01:02:01.100 --> 01:02:05.840
Now, what we typically do
is that rather-- so you
01:02:05.840 --> 01:02:07.840
notice here, that there's
actually no intercept.
01:02:07.840 --> 01:02:10.840
If I actually fold that
back down to one dimension,
01:02:10.840 --> 01:02:13.210
there's actually a is
equal to zero, right?
01:02:13.210 --> 01:02:18.350
If I go back to p
is equal to 1, that
01:02:18.350 --> 01:02:22.430
would imply that Yi is,
well, say, beta times
01:02:22.430 --> 01:02:24.979
x plus epsilon i.
01:02:24.979 --> 01:02:27.020
And that's not good, I
want to have an intercept.
01:02:27.020 --> 01:02:29.480
And the way I do this,
rather than writing
01:02:29.480 --> 01:02:31.910
a plus this, and
you know, just have
01:02:31.910 --> 01:02:35.420
like an overload of notation,
what I am actually doing
01:02:35.420 --> 01:02:37.670
is that I fold back.
01:02:37.670 --> 01:02:40.750
I fold my intercept
back into my x.
01:02:43.460 --> 01:02:46.190
And so if I measure
20 variables,
01:02:46.190 --> 01:02:48.080
I'm going to create a
21st variable, which
01:02:48.080 --> 01:02:49.700
is always equal to 1.
01:02:49.700 --> 01:02:52.650
OK, so you should need
to think of x as being 1.
01:02:52.650 --> 01:02:58.120
And then x1 xp.
01:02:58.120 --> 01:03:00.790
And sorry, xp minus 1, I guess.
01:03:00.790 --> 01:03:02.293
OK, and now this is an rp.
01:03:05.590 --> 01:03:07.900
I'm always going to assume
that the first one is 1.
01:03:07.900 --> 01:03:09.250
I can always do that.
01:03:09.250 --> 01:03:11.320
If I have a table of data--
01:03:11.320 --> 01:03:15.940
if my data is given to me
in an Excel spreadsheet--
01:03:15.940 --> 01:03:19.990
and here I have the density
that I measured on my data,
01:03:19.990 --> 01:03:22.940
and then maybe here
I have the height,
01:03:22.940 --> 01:03:25.544
and here I have the
wrist circumference.
01:03:25.544 --> 01:03:26.710
And I have all these things.
01:03:26.710 --> 01:03:31.100
All I have to do is to create
another column here of ones,
01:03:31.100 --> 01:03:34.180
and I just put 1-1-1-1-1.
01:03:34.180 --> 01:03:37.090
OK, that's all I have to
do to create this guy.
01:03:37.090 --> 01:03:39.190
Agreed?
01:03:39.190 --> 01:03:43.940
And now my x is going to
be just one of those rows.
01:03:43.940 --> 01:03:46.190
So that's this is
Xi, this entire row.
01:03:46.190 --> 01:03:47.622
And this entry here is Yi.
01:03:54.430 --> 01:03:56.920
So now, for my
noise coefficients,
01:03:56.920 --> 01:03:59.300
I'm still going to
ask for the same thing
01:03:59.300 --> 01:04:04.090
except that here, the
covariance is not between x--
01:04:04.090 --> 01:04:07.210
between one random variable
and another random variable.
01:04:07.210 --> 01:04:10.930
It's between a random vector
and a random variable.
01:04:10.930 --> 01:04:13.130
OK, how do I measure the
covariance between a vector
01:04:13.130 --> 01:04:14.594
and a random variable?
01:04:23.866 --> 01:04:25.840
AUDIENCE: [INAUDIBLE]
01:04:25.840 --> 01:04:29.002
PHILIPPE RIGOLLET:
Yeah, so basically--
01:04:29.002 --> 01:04:31.380
AUDIENCE: [INAUDIBLE]
01:04:31.380 --> 01:04:33.630
PHILIPPE RIGOLLET: Yeah, I
mean, the covariance vector
01:04:33.630 --> 01:04:36.171
is equal to 0 is the same thing
as [INAUDIBLE] equal to zero,
01:04:36.171 --> 01:04:39.270
but yeah, this is basically
thought of entry-wise.
01:04:39.270 --> 01:04:41.820
For each coordinate of x,
I want that the covariance
01:04:41.820 --> 01:04:47.430
between epsilon and this
coordinate of x is equal to 0.
01:04:47.430 --> 01:04:50.370
So I'm just asking this
for all coordinates.
01:04:50.370 --> 01:04:52.020
Again, in most
instances, we're going
01:04:52.020 --> 01:04:53.520
to think that epsilon
is independent
01:04:53.520 --> 01:04:56.310
of x, and that's something we
can understand without thinking
01:04:56.310 --> 01:04:59.022
about coordinates.
01:04:59.022 --> 01:05:00.471
Yep?
01:05:00.471 --> 01:05:03.852
AUDIENCE: [INAUDIBLE] like
what if beta equals alpha
01:05:03.852 --> 01:05:04.818
[INAUDIBLE]?
01:05:06.774 --> 01:05:09.190
PHILIPPE RIGOLLET: I'm sorry,
can you repeat the question?
01:05:09.190 --> 01:05:09.773
I didn't hear.
01:05:09.773 --> 01:05:12.140
AUDIENCE: Is this the
parameter of beta, a parameter?
01:05:12.140 --> 01:05:13.100
PHILIPPE RIGOLLET: Yeah,
beta is the parameter
01:05:13.100 --> 01:05:14.141
we're looking for, right.
01:05:14.141 --> 01:05:18.485
Just like it was the pair ab has
become the whole vector of beta
01:05:18.485 --> 01:05:19.394
now.
01:05:19.394 --> 01:05:20.810
AUDIENCE: And
what's [INAUDIBLE]??
01:05:22.720 --> 01:05:25.219
PHILIPPE RIGOLLET: Well, can
you think of an intercept
01:05:25.219 --> 01:05:26.260
of a function that take--
01:05:26.260 --> 01:05:28.630
I mean, there is one actually.
01:05:28.630 --> 01:05:30.370
There's the one
for which betas--
01:05:30.370 --> 01:05:31.840
all the betas that
don't correspond
01:05:31.840 --> 01:05:35.200
to the vector of all
ones, so the intercept
01:05:35.200 --> 01:05:38.469
is really the weight
that I put on this guy.
01:05:38.469 --> 01:05:40.510
That's the beta that's
going to come to this guy,
01:05:40.510 --> 01:05:44.310
but we don't really
talk about intercept.
01:05:44.310 --> 01:05:49.210
So if x lives in two
dimensions, the way
01:05:49.210 --> 01:05:50.950
you want to think
about this is you
01:05:50.950 --> 01:05:54.420
take a sheet of paper
like that, so now I
01:05:54.420 --> 01:05:57.080
have points that live
in three dimensions.
01:05:57.080 --> 01:05:59.320
So let's say one
direction here is x1.
01:05:59.320 --> 01:06:02.710
This direction is x2,
and this direction is y.
01:06:02.710 --> 01:06:04.960
And so what's going
to happen is that I'm
01:06:04.960 --> 01:06:07.120
going to have my points
that live in this three
01:06:07.120 --> 01:06:08.710
dimensional space.
01:06:08.710 --> 01:06:10.180
And what I'm trying
to do when I'm
01:06:10.180 --> 01:06:12.580
trying to do a linear
model for those guys--
01:06:12.580 --> 01:06:13.990
when I assume a linear model.
01:06:13.990 --> 01:06:17.380
What I assume is that there's
a plane in those three
01:06:17.380 --> 01:06:17.950
dimensions.
01:06:17.950 --> 01:06:20.170
So think of this guy
as going everywhere,
01:06:20.170 --> 01:06:23.920
and there's a plane close to
which all my points should be.
01:06:23.920 --> 01:06:26.320
That's what's happening
in two dimensional.
01:06:26.320 --> 01:06:29.930
If you see higher dimensions
then congratulations to you,
01:06:29.930 --> 01:06:30.975
but I can't.
01:06:33.530 --> 01:06:36.470
But you know, you can definitely
formalize that fairly easily
01:06:36.470 --> 01:06:38.405
mathematically and just
talk about vectors.
01:06:40.940 --> 01:06:44.200
So now here, if I talk about the
least square error estimator,
01:06:44.200 --> 01:06:47.470
or just the least squares
estimator of beta,
01:06:47.470 --> 01:06:49.990
it's simply the same
thing as before.
01:06:49.990 --> 01:06:52.460
Just like we said--
01:06:52.460 --> 01:06:56.750
so remember, you
should think of as beta
01:06:56.750 --> 01:06:59.930
as being both the
pair a b generalized.
01:06:59.930 --> 01:07:05.060
So we said, oh, we wanted to
minimize the expectation of y
01:07:05.060 --> 01:07:13.640
minus a plus bx squared, right?
01:07:13.640 --> 01:07:16.910
Now, so that's in--
for p is equal to 1.
01:07:16.910 --> 01:07:19.510
Now for p lower
than or equal to 2,
01:07:19.510 --> 01:07:28.760
we're just going to write it
as y minus x transpose beta
01:07:28.760 --> 01:07:29.260
squared.
01:07:34.210 --> 01:07:37.900
OK, so I'm just trying to
minimize this quantity.
01:07:37.900 --> 01:07:40.857
Of course, I don't
have access to this,
01:07:40.857 --> 01:07:42.940
so what I'm going to do
with them going to replace
01:07:42.940 --> 01:07:44.881
my expectation by an average.
01:07:51.010 --> 01:07:54.890
So here I'm using the notation
t because beta is the true one,
01:07:54.890 --> 01:07:56.960
and I don't want you to just--
01:07:56.960 --> 01:07:59.960
so here, I have a variable
t that's just moving around.
01:07:59.960 --> 01:08:02.390
And so now I'm going to take
the square of this thing.
01:08:02.390 --> 01:08:08.450
And when I minimize this over
all t in rp, the arc min,
01:08:08.450 --> 01:08:19.584
the minimum is attained at beta
hat, which is my estimator.
01:08:19.584 --> 01:08:20.084
OK?
01:08:25.359 --> 01:08:29.337
So if I want to
actually compute--
01:08:29.337 --> 01:08:29.837
yeah?
01:08:29.837 --> 01:08:31.420
AUDIENCE: I'm sorry,
on the last slide
01:08:31.420 --> 01:08:36.422
did we require the expectation
of [INAUDIBLE] to be zero?
01:08:36.422 --> 01:08:38.380
PHILIPPE RIGOLLET: You
mean the previous slide?
01:08:38.380 --> 01:08:38.963
AUDIENCE: Yes.
01:08:38.963 --> 01:08:40.262
[INAUDIBLE]
01:08:40.262 --> 01:08:42.720
PHILIPPE RIGOLLET: So again,
I'm just defining an estimator
01:08:42.720 --> 01:08:45.053
just like I would tell you,
just take the estimator that
01:08:45.053 --> 01:08:46.539
has coordinates for everywhere.
01:08:46.539 --> 01:08:48.984
AUDIENCE: So I'm saying like
[? in that sign, ?] we'll say
01:08:48.984 --> 01:08:51.918
the noise [? terms ?] we want to
satisfy the covariance of that
01:08:51.918 --> 01:08:55.830
[? side. ?] We also want them
to satisfy expectation of each
01:08:55.830 --> 01:08:56.808
[? noise turn ?] zero?
01:09:07.827 --> 01:09:09.660
PHILIPPE RIGOLLET: And
so the answer is yes.
01:09:09.660 --> 01:09:13.050
I was just trying to think
if this was captured.
01:09:13.050 --> 01:09:15.180
So it is not
captured in this guy
01:09:15.180 --> 01:09:17.700
because this is just telling
me that the expectation
01:09:17.700 --> 01:09:23.750
of epsilon i minus expectation
of some i is equal to zero.
01:09:23.750 --> 01:09:27.380
OK, so yes I need to have
that epsilon has mean zero--
01:09:27.380 --> 01:09:29.130
let's assume that
expectation of epsilon
01:09:29.130 --> 01:09:31.545
is zero for this problem.
01:09:43.640 --> 01:09:45.374
And we're going
to need something
01:09:45.374 --> 01:09:47.540
about some sort of question
about the variance being
01:09:47.540 --> 01:09:51.060
not equal to zero, right, but
this is going to come up later.
01:09:51.060 --> 01:09:54.710
So let's think for one second
about doing the same approach
01:09:54.710 --> 01:09:55.490
as we did before.
01:09:55.490 --> 01:09:57.320
Take the partial
derivative with respect
01:09:57.320 --> 01:09:59.279
to the first coordinate
of t, with respect
01:09:59.279 --> 01:10:01.070
to the second coordinate
of t, with respect
01:10:01.070 --> 01:10:03.320
to the third coordinate
of t, et cetera.
01:10:03.320 --> 01:10:04.610
So that's what we did before.
01:10:04.610 --> 01:10:07.460
We had two equations,
and we reconciled them
01:10:07.460 --> 01:10:10.190
because it was fairly
easy to solve, right?
01:10:10.190 --> 01:10:11.826
But in general,
what's going to happen
01:10:11.826 --> 01:10:13.700
is we're going to have
a system of equations.
01:10:13.700 --> 01:10:17.150
We're going to have a system
of p equations, one for each
01:10:17.150 --> 01:10:19.340
of the coordinates of t.
01:10:19.340 --> 01:10:23.960
And we're going to have p
unknowns, each coordinate of t.
01:10:23.960 --> 01:10:26.559
And so we're going to
have the system to solve--
01:10:26.559 --> 01:10:28.850
actually, i turns out it's
going to be a linear system.
01:10:28.850 --> 01:10:29.960
But it's not going
to be something
01:10:29.960 --> 01:10:32.543
that we're going to be able to
solve coordinate by coordinate.
01:10:32.543 --> 01:10:34.020
It's going to be
annoying to solve.
01:10:34.020 --> 01:10:36.820
You know, you can guess that
what's going to happen, right.
01:10:36.820 --> 01:10:40.700
Here, it involved the covariance
between x and epsilon, right.
01:10:40.700 --> 01:10:43.910
That's what it involved
to understand--
01:10:43.910 --> 01:10:47.540
sorry, the correlation
between x and y
01:10:47.540 --> 01:10:50.660
to understand how the
solution of this problem was.
01:10:50.660 --> 01:10:52.070
In this case,
there's going to be
01:10:52.070 --> 01:10:57.930
only the covariance between
x1 and y, x2 and y, x3, et
01:10:57.930 --> 01:10:59.510
cetera, all the way to xp and y.
01:10:59.510 --> 01:11:02.960
There's also going to be all
the cross covariances between xj
01:11:02.960 --> 01:11:04.077
and xk.
01:11:04.077 --> 01:11:05.660
And so this is going
to be a nightmare
01:11:05.660 --> 01:11:08.210
to solve, like, in this system.
01:11:08.210 --> 01:11:12.100
And what we do is that we go
on to using a matrix notation,
01:11:12.100 --> 01:11:14.600
so that when we
take derivatives,
01:11:14.600 --> 01:11:16.340
we talk about
gradients, and then we
01:11:16.340 --> 01:11:20.390
can invert matrices and solve
linear systems in a somewhat
01:11:20.390 --> 01:11:23.330
formal manner by just saying
that, if I want to solve
01:11:23.330 --> 01:11:27.230
the system ax equals b--
01:11:27.230 --> 01:11:28.760
rather than actually
solving this
01:11:28.760 --> 01:11:30.440
for each coordinate
of x individually,
01:11:30.440 --> 01:11:33.770
I just say that x is
equal to a inverse times.
01:11:33.770 --> 01:11:37.490
So that's really why we're
going to the equation one,
01:11:37.490 --> 01:11:40.730
because we have a
formalism to write that x
01:11:40.730 --> 01:11:42.260
is the solution of the system.
01:11:42.260 --> 01:11:43.843
I'm not telling you
that this is going
01:11:43.843 --> 01:11:48.110
to be easy to solve numerically,
but at least I can write it.
01:11:48.110 --> 01:11:51.307
And so here's how it goes.
01:11:51.307 --> 01:11:52.390
I have a bunch of vectors.
01:11:55.540 --> 01:11:56.790
So what are my vectors, right?
01:11:56.790 --> 01:11:57.875
So I have x1--
01:11:57.875 --> 01:11:59.250
oh, by the way,
I didn't actually
01:11:59.250 --> 01:12:01.320
mention that when I
put the lowercase, when
01:12:01.320 --> 01:12:03.660
I put the subscript, I'm
talking about the observation.
01:12:03.660 --> 01:12:05.118
And when I put the
superscript, I'm
01:12:05.118 --> 01:12:07.110
talking about the
coordinates, right?
01:12:07.110 --> 01:12:13.290
So I have x1, which is
equal to x1, x1 [? 1, ?]
01:12:13.290 --> 01:12:19.965
x 1p, x2, which is 1.
01:12:19.965 --> 01:12:32.380
x2, 1, x2 p, all the way to
xn, which is 1, xn 1, x np.
01:12:32.380 --> 01:12:35.210
All right, so those are n
observed x's, and then I
01:12:35.210 --> 01:12:40.870
have another y1, y2, yn, that
comes paired with those guys.
01:12:40.870 --> 01:12:42.510
OK?
01:12:42.510 --> 01:12:44.640
So the first thing
is that I'm going
01:12:44.640 --> 01:12:46.290
to stack those guys
into some vector
01:12:46.290 --> 01:12:47.520
that I'm going to call y.
01:12:47.520 --> 01:12:49.710
So maybe I should put
an arrow for the purpose
01:12:49.710 --> 01:12:53.310
of the blackboard, and
it's just y1 to yn.
01:12:53.310 --> 01:12:56.720
OK, so this is a vector in rn.
01:12:56.720 --> 01:12:59.150
Now, if I want to stack
those guys together,
01:12:59.150 --> 01:13:03.449
I can either create a long
vector of size n times p,
01:13:03.449 --> 01:13:05.990
but the problem is that I lose
the role of who's a coordinate
01:13:05.990 --> 01:13:08.815
and who's an observation.
01:13:08.815 --> 01:13:10.190
And so it's actually
nicer for me
01:13:10.190 --> 01:13:12.840
to just put those guys
next to each other
01:13:12.840 --> 01:13:15.320
and create one new variable.
01:13:15.320 --> 01:13:18.020
And so the way I'm going to do
this is-- rather than actually
01:13:18.020 --> 01:13:22.220
stacking those guys like that,
I'm getting their transpose
01:13:22.220 --> 01:13:24.530
and stack them as
rows of a matrix.
01:13:24.530 --> 01:13:26.870
OK, so I'm going to
create a matrix, which
01:13:26.870 --> 01:13:28.700
here is denoted typically by--
01:13:28.700 --> 01:13:31.295
I'm going to write x double bar.
01:13:31.295 --> 01:13:33.420
And here, I'm going to
actually just-- so since I'm
01:13:33.420 --> 01:13:35.940
taking those guys like
this, the first column
01:13:35.940 --> 01:13:37.010
is going to be only ones.
01:13:40.510 --> 01:13:41.950
And then I'm going to have--
01:13:41.950 --> 01:13:47.130
well, x1, 1, [? 1, ?] x1, p.
01:13:47.130 --> 01:13:52.890
And here, I'm going
to have x n1, x np.
01:13:52.890 --> 01:13:57.690
OK, so here the number of rows
is n, and the number of columns
01:13:57.690 --> 01:13:58.800
is p.
01:13:58.800 --> 01:14:02.352
One row per observation,
one column per coordinate.
01:14:05.010 --> 01:14:10.710
And again, I make your life
miserable because this really
01:14:10.710 --> 01:14:13.380
should be p minus 1
because I already used
01:14:13.380 --> 01:14:15.850
the first one for this guy.
01:14:15.850 --> 01:14:16.820
I'm sorry about that.
01:14:16.820 --> 01:14:18.400
It's a bit painful.
01:14:18.400 --> 01:14:20.490
So usually we don't even
write what's in there.
01:14:20.490 --> 01:14:21.948
So we don't have
to think about it.
01:14:21.948 --> 01:14:23.970
Those are just
vectors of size p.
01:14:23.970 --> 01:14:25.380
OK?
01:14:25.380 --> 01:14:27.740
So now that I
created this thing,
01:14:27.740 --> 01:14:31.340
I can actually just basically
stack up all my models.
01:14:31.340 --> 01:14:39.270
So Yi equals Xi transpose
beta plus epsilon i for all i
01:14:39.270 --> 01:14:41.430
equal 1 to n.
01:14:41.430 --> 01:14:44.010
This transforms into-- this
is equivalent to saying
01:14:44.010 --> 01:14:47.610
that the vector y is
equal to the matrix x
01:14:47.610 --> 01:14:51.150
times beta plus a matrix,
plus a vector epsilon,
01:14:51.150 --> 01:14:57.940
where epsilon is just epsilon
1 to epsilon n, right.
01:14:57.940 --> 01:14:59.830
So I have just
this system, which
01:14:59.830 --> 01:15:02.000
I write as a matrix,
which really just consists
01:15:02.000 --> 01:15:04.900
in stacking up all these
equations next to each other.
01:15:10.195 --> 01:15:12.820
So now that I have this model--
this is the usual least squares
01:15:12.820 --> 01:15:13.330
model.
01:15:13.330 --> 01:15:16.150
And here, when I want to write
my least squares criterion
01:15:16.150 --> 01:15:17.500
in terms of matrices, right?
01:15:17.500 --> 01:15:19.041
My least squares
criterion, remember,
01:15:19.041 --> 01:15:27.010
was sum from i equal 1 to n
of Yi minus Xi transposed beta
01:15:27.010 --> 01:15:28.210
squared.
01:15:28.210 --> 01:15:31.060
Well, here it's
really just the sum
01:15:31.060 --> 01:15:35.260
of the square of the
coefficients of the vector
01:15:35.260 --> 01:15:37.540
y minus x beta.
01:15:37.540 --> 01:15:40.380
So this is actually
equal to the norm squared
01:15:40.380 --> 01:15:43.090
of y minus x beta square.
01:15:46.382 --> 01:15:47.340
That's just the square.
01:15:47.340 --> 01:15:49.470
Norm is, by definition,
the sum of the square
01:15:49.470 --> 01:15:51.720
of the coordinates.
01:15:51.720 --> 01:15:53.885
And so now I can actually
talk about minimizing
01:15:53.885 --> 01:15:56.090
a norm squared,
and here it's going
01:15:56.090 --> 01:15:58.160
to be easier for me
to take derivatives.
01:15:58.160 --> 01:16:01.300
All right, so we'll
do that next time.