WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high-quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:19.865
at ocw.mit.edu.
00:00:19.865 --> 00:00:26.880
PHILIPPE RIGOLLET: [INAUDIBLE]
minus xi transpose t.
00:00:26.880 --> 00:00:30.220
I just pick whatever notation
I want from a variable.
00:00:30.220 --> 00:00:33.850
And let's say it's t.
00:00:33.850 --> 00:00:35.530
So that's the least
squares estimator.
00:00:35.530 --> 00:00:38.350
And it turns out that,
as I said last time,
00:00:38.350 --> 00:00:39.850
it's going to be
convenient to think
00:00:39.850 --> 00:00:42.160
of those things as matrices.
00:00:42.160 --> 00:00:44.530
So here, I already have vectors.
00:00:44.530 --> 00:00:47.050
I've already gone from one
dimension, just real valued
00:00:47.050 --> 00:00:49.450
random variables through
random vectors when
00:00:49.450 --> 00:00:52.720
I think of each xi, but if I
start stacking them together,
00:00:52.720 --> 00:00:56.020
I'm going to have vectors
and matrices that show up.
00:00:56.020 --> 00:00:57.610
So the first vector
I'm getting is
00:00:57.610 --> 00:01:04.030
y, which is just a vector
where I have y1 to yn.
00:01:04.030 --> 00:01:07.900
Then I have-- so that's
a boldface vector.
00:01:07.900 --> 00:01:12.940
Then I have x, which is
a matrix where I have--
00:01:12.940 --> 00:01:16.150
well, the first
coordinate is always 1.
00:01:16.150 --> 00:01:24.760
So I have 1, and then x1 xp
minus 1, and that's-- sorry,
00:01:24.760 --> 00:01:29.672
x1 xp minus 1, and
that's for observation 1.
00:01:29.672 --> 00:01:31.630
And then I have the same
thing all the way down
00:01:31.630 --> 00:01:32.720
for observation n.
00:01:40.390 --> 00:01:42.685
OK, everybody
understands what this is?
00:01:42.685 --> 00:01:47.920
So I'm just basically
stacking up all the xi's.
00:01:47.920 --> 00:01:55.420
So this i-th row
is xi transpose.
00:01:55.420 --> 00:01:57.390
I am just stacking them up.
00:01:57.390 --> 00:02:00.310
And so if I want to write
all these things to be
00:02:00.310 --> 00:02:03.130
true for each of
them, all I need to do
00:02:03.130 --> 00:02:05.350
is to write a vector
epsilon, which
00:02:05.350 --> 00:02:08.680
is epsilon 1 to epsilon n.
00:02:08.680 --> 00:02:11.510
And what I'm going to have is
that y, the boldface vector,
00:02:11.510 --> 00:02:14.260
now is equal to the
matrix x times the vector
00:02:14.260 --> 00:02:18.490
beta plus the vector epsilon.
00:02:18.490 --> 00:02:20.540
And it's really
just exactly saying
00:02:20.540 --> 00:02:23.620
what's there, because for 2--
so this is a vector, right?
00:02:23.620 --> 00:02:25.780
This is a vector.
00:02:25.780 --> 00:02:27.850
And what is the
dimension of this vector?
00:02:32.660 --> 00:02:37.140
n, so this is n observations.
00:02:37.140 --> 00:02:39.770
And for all these-- for
two vectors to be equal,
00:02:39.770 --> 00:02:41.990
I need to have all the
coordinates to be equal,
00:02:41.990 --> 00:02:44.600
and that's exactly the same
thing as saying that this
00:02:44.600 --> 00:02:46.290
holds for i equal 1 to n.
00:02:48.990 --> 00:02:51.400
But now, when I have
this, I can actually
00:02:51.400 --> 00:02:55.690
rewrite the sum for t equals--
00:02:55.690 --> 00:03:03.310
sorry, for i equals 1 to
n of yi minus xi transpose
00:03:03.310 --> 00:03:05.680
beta squared, this
turns out to be
00:03:05.680 --> 00:03:12.340
equal to the Euclidean norm of
the vector y minus the matrix x
00:03:12.340 --> 00:03:14.852
times beta squared.
00:03:14.852 --> 00:03:16.310
And I'm going to
put a 2 here so we
00:03:16.310 --> 00:03:19.079
know we're talking about
the Euclidean norm.
00:03:19.079 --> 00:03:20.870
This just means this
is the Euclidean norm.
00:03:27.259 --> 00:03:28.800
That's the one we've
seen before when
00:03:28.800 --> 00:03:30.365
we talked about chi squared--
00:03:30.365 --> 00:03:31.740
that's the square
norm is the sum
00:03:31.740 --> 00:03:32.940
of the square of
the coefficients,
00:03:32.940 --> 00:03:34.320
and then I take a
square root, but here I
00:03:34.320 --> 00:03:35.467
have an extra square.
00:03:35.467 --> 00:03:38.050
So it's really just the sum of
the square of the coefficients,
00:03:38.050 --> 00:03:38.730
which is this.
00:03:38.730 --> 00:03:40.464
And here are the coefficients.
00:03:43.370 --> 00:03:49.530
So then, that I write this thing
like that, then minimizing--
00:03:49.530 --> 00:03:54.430
so my goal here, now, is
going to solve minimum over t
00:03:54.430 --> 00:04:05.300
in our p of y minus
x times t2 squared.
00:04:05.300 --> 00:04:07.260
And just like we did
for one dimension,
00:04:07.260 --> 00:04:12.200
we can actually write
optimality conditions for this.
00:04:12.200 --> 00:04:14.990
I mean, this is a function.
00:04:14.990 --> 00:04:23.064
So this is a function
from rp to r.
00:04:23.064 --> 00:04:24.980
And if I want to minimize
it, all I have to do
00:04:24.980 --> 00:04:28.820
is to take its gradient
and set it equal to 0.
00:04:28.820 --> 00:04:42.010
So minimum, set gradient to 0.
00:04:42.010 --> 00:04:45.640
So that's where it becomes
a little complicated.
00:04:45.640 --> 00:04:49.210
Now I'm going to have to take
the gradient of this norm.
00:04:49.210 --> 00:04:51.550
It might be a little
annoying to do.
00:04:51.550 --> 00:04:53.980
But actually, what's
nice about those things--
00:04:53.980 --> 00:04:56.720
I mean, I remember that it
was a bit annoying to learn.
00:04:56.720 --> 00:04:59.800
I mean, it's just
basically rules of calculus
00:04:59.800 --> 00:05:01.480
that you don't use that much.
00:05:01.480 --> 00:05:05.690
But essentially, you can
actually expend this norm.
00:05:05.690 --> 00:05:07.787
And you will see that
the rules are basically
00:05:07.787 --> 00:05:09.370
the same as in one
dimension, you just
00:05:09.370 --> 00:05:11.890
have to be careful about the
fact that matrices do not
00:05:11.890 --> 00:05:13.050
commute.
00:05:13.050 --> 00:05:15.940
So let's expand this thing.
00:05:15.940 --> 00:05:19.550
y minus xt squared--
00:05:19.550 --> 00:05:21.400
well, this is equal
to the norm of y
00:05:21.400 --> 00:05:30.580
squared plus the norm
of x squared plus 2
00:05:30.580 --> 00:05:33.980
times y transpose xt.
00:05:36.730 --> 00:05:41.230
That's just expanding the
square in more dimensions.
00:05:41.230 --> 00:05:47.710
And this, I'm actually going
to write as y squared plus--
00:05:47.710 --> 00:05:50.600
so here, the norm
squared of this guy,
00:05:50.600 --> 00:05:53.140
I always have that
the norm of x squared
00:05:53.140 --> 00:05:56.650
is equal to x transpose x.
00:05:56.650 --> 00:05:58.480
So I'm going to write
this as x transpose
00:05:58.480 --> 00:06:04.540
x, so it's t transpose
x transpose xt
00:06:04.540 --> 00:06:10.540
plus 2 times y transpose xt.
00:06:10.540 --> 00:06:13.735
So now, if I'm going to take
the gradient with respect to t,
00:06:13.735 --> 00:06:16.210
I have basically three
terms, and each of them
00:06:16.210 --> 00:06:18.280
has some sort of a
different nature.
00:06:18.280 --> 00:06:21.610
This term is linear
in t, and it's
00:06:21.610 --> 00:06:23.170
going to differentiate
the same way
00:06:23.170 --> 00:06:25.720
that I differentiate a times x.
00:06:25.720 --> 00:06:28.210
I'm just going to keep the a.
00:06:28.210 --> 00:06:29.710
This guy is quadratic.
00:06:29.710 --> 00:06:32.170
t appears twice.
00:06:32.170 --> 00:06:34.140
And this guy, I'm
going to pick up a 2,
00:06:34.140 --> 00:06:37.510
and it's going to differentiate
just like when I differentiate
00:06:37.510 --> 00:06:38.840
a times x squared.
00:06:38.840 --> 00:06:41.200
It's 2 times ax.
00:06:41.200 --> 00:06:43.330
And this guy is a constant
with respect to t,
00:06:43.330 --> 00:06:47.380
so it's going to
differentiate to 0.
00:06:47.380 --> 00:06:49.090
So when I compute the gradient--
00:06:53.680 --> 00:06:55.930
now, of course, all of these
rules that I give you you
00:06:55.930 --> 00:06:58.810
can check by looking at the
partial derivative with respect
00:06:58.810 --> 00:06:59.950
to each coordinate.
00:06:59.950 --> 00:07:02.800
But arguably, it's
much faster to know
00:07:02.800 --> 00:07:04.780
the rules of differentiability.
00:07:04.780 --> 00:07:06.922
It's like if I gave you
the function exponential x
00:07:06.922 --> 00:07:08.380
and I said, what
is the derivative,
00:07:08.380 --> 00:07:09.796
and you started
writing, well, I'm
00:07:09.796 --> 00:07:13.420
going to write exponential x
plus h minus exponential ax
00:07:13.420 --> 00:07:15.670
divided by h and let h go to 0.
00:07:15.670 --> 00:07:17.476
That's a bit painful.
00:07:17.476 --> 00:07:19.891
AUDIENCE: Why did
you transpose your--
00:07:19.891 --> 00:07:23.755
why does x have
to be [INAUDIBLE]??
00:07:23.755 --> 00:07:25.106
PHILIPPE RIGOLLET: I'm sorry?
00:07:25.106 --> 00:07:26.814
AUDIENCE: I was
wondering why you times t
00:07:26.814 --> 00:07:29.080
times the [INAUDIBLE]?
00:07:29.080 --> 00:07:33.190
PHILIPPE RIGOLLET:
The transpose of 2ab
00:07:33.190 --> 00:07:35.490
is b transpose a transpose.
00:07:38.990 --> 00:07:40.880
If you're not sure
about this, just
00:07:40.880 --> 00:07:42.860
make a and b have
different size,
00:07:42.860 --> 00:07:46.070
and then you will see that
there's some incompatibility.
00:07:46.070 --> 00:07:48.440
I mean, there's basically
only one way to not screw
00:07:48.440 --> 00:07:51.230
that one up, so that's
easy to remember.
00:07:51.230 --> 00:07:54.650
So if I take the gradient, then
it's going to be equal to what?
00:07:54.650 --> 00:07:58.130
It's going to be 0
plus-- we said here,
00:07:58.130 --> 00:07:59.880
this is going to
differentiate like-- so
00:07:59.880 --> 00:08:05.130
think a times x squared.
00:08:05.130 --> 00:08:06.730
So I'm going to have 2ax.
00:08:06.730 --> 00:08:13.840
So here, basically, this guy is
going to go to x transpose xt.
00:08:13.840 --> 00:08:17.250
Now, I could have
made this one go away,
00:08:17.250 --> 00:08:20.050
but that's the same thing as
saying that my gradient is--
00:08:20.050 --> 00:08:21.610
I can think of my
gradient as being
00:08:21.610 --> 00:08:24.200
either a horizontal vector
or a vertical vector.
00:08:24.200 --> 00:08:26.530
So if I remove this guy,
I'm thinking of my gradient
00:08:26.530 --> 00:08:27.370
as being horizontal.
00:08:27.370 --> 00:08:30.460
If I remove that guy, I'm
thinking of my gradient
00:08:30.460 --> 00:08:31.300
as being vertical.
00:08:31.300 --> 00:08:33.258
And that's what I want
to think of, typically--
00:08:33.258 --> 00:08:36.820
vertical vectors,
column vectors.
00:08:36.820 --> 00:08:39.159
And then this guy, well,
it's like these guys just
00:08:39.159 --> 00:08:42.460
think a times x.
00:08:42.460 --> 00:08:44.560
So the derivative is
just a, so I'm going
00:08:44.560 --> 00:08:47.712
to keep only that part here.
00:08:47.712 --> 00:08:49.670
Sorry, I forgot a minus
somewhere-- yeah, here.
00:08:55.200 --> 00:08:59.700
Minus 2y transpose x.
00:08:59.700 --> 00:09:02.160
And what I want is this
thing to be equal to 0.
00:09:06.240 --> 00:09:20.530
So t, the optimal t, is called
beta hat and satisfies--
00:09:24.680 --> 00:09:27.786
well, I can cancel the
2's and put the minus
00:09:27.786 --> 00:09:29.160
on the other side,
and what I get
00:09:29.160 --> 00:09:36.970
is that x transpose xt is
equal to y transpose x.
00:09:44.720 --> 00:09:48.240
Yeah, that's not working for me.
00:09:48.240 --> 00:09:50.240
Yeah, that's because when
I took the derivative,
00:09:50.240 --> 00:09:51.665
I still need to make sure--
00:09:51.665 --> 00:09:53.660
so it's the same question
of whether I want
00:09:53.660 --> 00:09:55.610
things to be columns or rows.
00:09:55.610 --> 00:09:57.550
So this is not a column.
00:09:57.550 --> 00:10:01.465
If I remove that guy,
y transpose t is a row.
00:10:01.465 --> 00:10:03.590
So I'm just going to take
the transpose of this guy
00:10:03.590 --> 00:10:07.890
to make things work, and this is
just going to be x transpose y.
00:10:11.710 --> 00:10:14.310
And this guy is x transpose
y so that I have columns.
00:10:19.090 --> 00:10:23.910
So this is just the
linear equation in t.
00:10:23.910 --> 00:10:26.640
And I have to solve it, so it's
of the form some matrix times
00:10:26.640 --> 00:10:29.890
t is equal to another vector.
00:10:29.890 --> 00:10:31.590
And so that's basically
in your system.
00:10:31.590 --> 00:10:33.381
And the way to solve
it, at least formally,
00:10:33.381 --> 00:10:36.450
is to just take the inverse
of the matrix on the left.
00:10:36.450 --> 00:10:45.830
So if x transpose x is
invertible, then-- sorry,
00:10:45.830 --> 00:10:48.200
that's beta hat is the t I want.
00:10:48.200 --> 00:10:51.410
I get that beta hat is
equal to x transpose
00:10:51.410 --> 00:10:54.440
x inverse x transpose y.
00:10:57.860 --> 00:10:59.780
And that's the least
squares estimator.
00:11:12.670 --> 00:11:16.600
So here, I use this condition.
00:11:16.600 --> 00:11:21.580
I want it to be invertible so I
can actually write its inverse.
00:11:21.580 --> 00:11:25.570
Here, I wrote, rank
of x is equal to p.
00:11:25.570 --> 00:11:26.948
What is the difference?
00:11:36.910 --> 00:11:40.900
Well, there's basically
no difference.
00:11:40.900 --> 00:11:45.100
Basically, here,
I have to assume--
00:11:45.100 --> 00:11:47.350
what is the size of the
matrix x transpose x?
00:11:52.509 --> 00:11:53.376
[INTERPOSING VOICES]
00:11:53.376 --> 00:11:55.250
PHILIPPE RIGOLLET: Yeah,
so what is the size?
00:11:55.250 --> 00:11:56.850
AUDIENCE: p by p.
00:11:56.850 --> 00:11:58.000
PHILIPPE RIGOLLET: p by p.
00:11:58.000 --> 00:12:00.940
So this matrix is
invertible if it's a rank p,
00:12:00.940 --> 00:12:02.500
if you know what rank means.
00:12:02.500 --> 00:12:05.740
If you don't, that just rank
p means that it's invertible.
00:12:05.740 --> 00:12:07.780
So it's full rank
and it's invertible.
00:12:07.780 --> 00:12:10.690
And the rank of x
transpose x is actually
00:12:10.690 --> 00:12:13.540
just the rank of x because this
is the same matrix that you
00:12:13.540 --> 00:12:14.710
apply twice.
00:12:14.710 --> 00:12:15.940
And that's all it's saying.
00:12:15.940 --> 00:12:18.160
So if you're not comfortable
with the notion of rank
00:12:18.160 --> 00:12:21.070
that you see here, just
think of this condition
00:12:21.070 --> 00:12:25.080
just being the condition that
x transpose x is invertible.
00:12:25.080 --> 00:12:26.650
And that's all it says.
00:12:26.650 --> 00:12:29.620
What it means for it to be
invertible-- this was true.
00:12:29.620 --> 00:12:32.530
We made no assumption
up to this point.
00:12:32.530 --> 00:12:35.840
If x is not invertible,
it means that there
00:12:35.840 --> 00:12:38.840
might be multiple
solutions to this equation.
00:12:38.840 --> 00:12:42.830
In particular, for a matrix
to not be invertible,
00:12:42.830 --> 00:12:45.890
it means that there's
some vector v.
00:12:45.890 --> 00:12:55.800
So if x transpose x
is not invertible,
00:12:55.800 --> 00:13:00.080
then this is equivalent
to there exists a vector
00:13:00.080 --> 00:13:07.910
v, which is not 0, and such that
x transpose xv is equal to 0.
00:13:07.910 --> 00:13:10.400
That's what it means
to not be invertible.
00:13:10.400 --> 00:13:13.730
So in particular, if
beta hat is a solution--
00:13:22.090 --> 00:13:26.290
so this equation is sometimes
called score equations,
00:13:26.290 --> 00:13:28.280
because the gradient
is called the score,
00:13:28.280 --> 00:13:31.090
and so you're just checking
if the gradient is equal to 0.
00:13:31.090 --> 00:13:33.730
So if beta hat
satisfies star, then so
00:13:33.730 --> 00:13:46.820
does beta hat plus lambda v for
all lambda in the real line.
00:13:51.840 --> 00:13:54.930
And the reason is because,
well, if I start looking at--
00:13:54.930 --> 00:14:02.400
what is x transpose x times
beta hat plus lambda v?
00:14:02.400 --> 00:14:08.000
Well, by linearity, this
is just x transpose x
00:14:08.000 --> 00:14:16.510
beta hat plus lambda
x transpose x times v.
00:14:16.510 --> 00:14:17.750
But this guy is what?
00:14:22.420 --> 00:14:27.860
It's 0, just because
that's what we assumed.
00:14:27.860 --> 00:14:31.070
We assumed that x transpose
xv was equal to 0,
00:14:31.070 --> 00:14:34.130
so we're left only with this
part, which, by star, is just
00:14:34.130 --> 00:14:35.060
x transpose y.
00:14:40.040 --> 00:14:44.300
So that means that x transpose
x beta hat plus lambda v
00:14:44.300 --> 00:14:48.080
is actually equal to x transpose
y, which means that there's
00:14:48.080 --> 00:14:50.360
another solution, which
is not just beta hat,
00:14:50.360 --> 00:14:56.784
but any move of beta hat along
this direction v by any size.
00:14:56.784 --> 00:14:58.700
So that's going to be
an issue, because you're
00:14:58.700 --> 00:15:00.050
looking for one estimator.
00:15:00.050 --> 00:15:03.350
And there's not just one,
in this case, there's many.
00:15:03.350 --> 00:15:05.599
And so this is not
going to be well-defined
00:15:05.599 --> 00:15:07.140
and you're going to
have some issues.
00:15:07.140 --> 00:15:09.560
So if you want to talk about
the least squares estimator,
00:15:09.560 --> 00:15:13.510
you have to make
this assumption.
00:15:13.510 --> 00:15:15.310
What does it imply
in terms of, can I
00:15:15.310 --> 00:15:18.976
think of p being too n,
for example, in this case?
00:15:18.976 --> 00:15:20.350
What happens if
p is equal to 2n?
00:15:27.528 --> 00:15:31.084
AUDIENCE: Well, then the rank
of your matrix is only p/2.
00:15:31.084 --> 00:15:33.500
PHILIPPE RIGOLLET: So the rank
of your matrix is only p/2,
00:15:33.500 --> 00:15:36.480
so that means that this is
actually not going to happen.
00:15:36.480 --> 00:15:39.530
I mean, it's not only
p/2, it's at most p/2.
00:15:39.530 --> 00:15:42.560
It's at most the smallest of the
two dimensions of your matrix.
00:15:42.560 --> 00:15:44.600
So if your matrix
is n times 2n, it's
00:15:44.600 --> 00:15:47.702
at most n, which means that
it's not going to be full rank,
00:15:47.702 --> 00:15:49.160
so it's not going
to be invertible.
00:15:49.160 --> 00:15:53.600
So every time the dimension p
is larger than the sample size,
00:15:53.600 --> 00:15:56.060
your matrix is not invertible,
and you cannot talk about
00:15:56.060 --> 00:15:57.950
the least squares estimator.
00:15:57.950 --> 00:15:59.930
So that's something
to keep in mind.
00:15:59.930 --> 00:16:01.710
And it's actually a
very simple thing.
00:16:01.710 --> 00:16:05.750
It's essentially saying,
well, if p is lower than n,
00:16:05.750 --> 00:16:07.760
it means that you
have more parameters
00:16:07.760 --> 00:16:11.000
to estimate than you have
equations to estimate it.
00:16:11.000 --> 00:16:12.480
So you have this linear system.
00:16:12.480 --> 00:16:17.240
There's one equation
per observation.
00:16:17.240 --> 00:16:19.400
Each row, which was
each observation,
00:16:19.400 --> 00:16:21.230
was giving me one equation.
00:16:21.230 --> 00:16:24.960
But then the number of unknowns
in this linear system is p,
00:16:24.960 --> 00:16:28.760
and so I cannot solve linear
systems that have more unknowns
00:16:28.760 --> 00:16:30.350
than they have equations.
00:16:30.350 --> 00:16:32.302
And so that's basically
what's happening.
00:16:32.302 --> 00:16:34.010
Now, in practice, if
you think about what
00:16:34.010 --> 00:16:36.330
data sets look like
these days, for example,
00:16:36.330 --> 00:16:38.940
people are trying to
express some phenotype.
00:16:38.940 --> 00:16:41.570
So phenotype is something you
can measure on people-- maybe
00:16:41.570 --> 00:16:43.880
the color of your
eyes, or your height,
00:16:43.880 --> 00:16:47.690
or whether you have diabetes
or not, things like this,
00:16:47.690 --> 00:16:51.080
so things that are macroscopic.
00:16:51.080 --> 00:16:53.429
And then they want to use
the genotype to do that.
00:16:53.429 --> 00:16:55.970
They want to measure your-- they
want to sequence your genome
00:16:55.970 --> 00:16:58.940
and try to use this to
predict whether you're going
00:16:58.940 --> 00:17:01.250
to be responsive to a drug
or whether your r's are
00:17:01.250 --> 00:17:03.170
going to be blue, or
something like this.
00:17:03.170 --> 00:17:05.060
Now, the data sets
that you can have--
00:17:05.060 --> 00:17:09.619
people, maybe, for a given study
about some sort of disease.
00:17:09.619 --> 00:17:15.260
Maybe you will sequence the
genome of maybe 100 people.
00:17:15.260 --> 00:17:17.645
n is equal to 100.
00:17:17.645 --> 00:17:21.030
p is basically the number
of genes they're sequencing.
00:17:21.030 --> 00:17:23.849
This is of the order of 100,000.
00:17:23.849 --> 00:17:26.287
So you can imagine that this
is a case where n is much,
00:17:26.287 --> 00:17:28.620
much smaller than p, and you
cannot talk about the least
00:17:28.620 --> 00:17:29.670
squares estimator.
00:17:29.670 --> 00:17:31.080
There's plenty of them.
00:17:31.080 --> 00:17:33.630
There's not just
one line like that,
00:17:33.630 --> 00:17:36.180
lambda times v that
you can move away.
00:17:36.180 --> 00:17:40.320
There's basically an entire
space in which you can move,
00:17:40.320 --> 00:17:42.027
and so it's not well-defined.
00:17:42.027 --> 00:17:43.860
So at the end of this
class, I will give you
00:17:43.860 --> 00:17:46.740
a short introduction
on how you do this.
00:17:46.740 --> 00:17:49.200
This actually represents
more and more.
00:17:49.200 --> 00:17:51.652
It becomes a more and more
preponderant part of the data
00:17:51.652 --> 00:17:53.610
sets you have to deal
with, because people just
00:17:53.610 --> 00:17:54.950
collect data.
00:17:54.950 --> 00:17:57.810
When I do the
sequencing, the machine
00:17:57.810 --> 00:17:59.730
allows me to sequence
100,000 genes.
00:17:59.730 --> 00:18:03.600
I'm not going to stop at 100
because doctors are never
00:18:03.600 --> 00:18:06.510
going to have cohorts of
more than 100 patients.
00:18:06.510 --> 00:18:08.510
So you just collect
everything you can collect.
00:18:08.510 --> 00:18:11.310
And this is true for everything.
00:18:11.310 --> 00:18:13.372
Cars have sensors
all over the place,
00:18:13.372 --> 00:18:15.080
much more than they
actually gather data.
00:18:15.080 --> 00:18:16.890
There's data, there's--
we're creating,
00:18:16.890 --> 00:18:18.840
we're recording
everything we can.
00:18:18.840 --> 00:18:20.744
And so we need some new
techniques for that,
00:18:20.744 --> 00:18:23.410
and that's what high-dimensional
statistics is trying to answer.
00:18:23.410 --> 00:18:25.530
So this is way beyond
the scope of this class,
00:18:25.530 --> 00:18:27.029
but towards the
end, I will give you
00:18:27.029 --> 00:18:29.340
some hints about what can
be done in this framework
00:18:29.340 --> 00:18:34.810
because, well, this is the new
reality we have to deal with.
00:18:34.810 --> 00:18:37.100
So here, we're in a case
where p's less than n
00:18:37.100 --> 00:18:38.555
and typically much
smaller than n.
00:18:38.555 --> 00:18:40.680
So the kind of orders of
magnitude you want to have
00:18:40.680 --> 00:18:46.135
is maybe p's of order 10 and
n's of order 100, something
00:18:46.135 --> 00:18:46.635
like this.
00:18:46.635 --> 00:18:50.280
So you can scale that,
but maybe 10 times larger.
00:18:50.280 --> 00:18:57.810
So maybe you cannot solve this
guy b for b hat, but actually,
00:18:57.810 --> 00:19:00.480
you can talk about
x times b hat,
00:19:00.480 --> 00:19:02.580
even if p is larger than n.
00:19:02.580 --> 00:19:06.880
And the reason is
that x times b hat
00:19:06.880 --> 00:19:09.280
is actually something
that's very well-defined.
00:19:09.280 --> 00:19:11.400
So what is x times b hat?
00:19:11.400 --> 00:19:16.810
Remember, I started
with the model.
00:19:16.810 --> 00:19:20.580
So if I look at this
definition, essentially, what I
00:19:20.580 --> 00:19:24.360
had as the original
thing was that the vector
00:19:24.360 --> 00:19:29.910
y was equal to x times beta
plus the vector epsilon.
00:19:29.910 --> 00:19:32.960
That was my model.
00:19:32.960 --> 00:19:36.400
So beta is actually
giving me something.
00:19:36.400 --> 00:19:39.380
Beta is actually some
parameter, some coefficients
00:19:39.380 --> 00:19:40.830
that are interesting.
00:19:40.830 --> 00:19:43.610
But a good estimator
for-- so here, it
00:19:43.610 --> 00:19:45.320
means that the
observations that I have
00:19:45.320 --> 00:19:48.870
are of the form x times
beta plus some noise.
00:19:48.870 --> 00:19:51.050
So if I want to adjust the
noise, remove the noise,
00:19:51.050 --> 00:19:57.110
a good candidate to do
noise is x times beta hat.
00:19:57.110 --> 00:19:59.450
x times beta hat is something
that should actually
00:19:59.450 --> 00:20:10.140
be useful to me, which should
be close to x times beta.
00:20:10.140 --> 00:20:13.770
So in the one-dimensional case,
what it means is that if I
00:20:13.770 --> 00:20:16.530
have-- let's say this
is the true line,
00:20:16.530 --> 00:20:19.050
and these are my
x's, so I have--
00:20:19.050 --> 00:20:22.050
these are the true
points on the real line,
00:20:22.050 --> 00:20:24.180
and then I have
my little epsilon
00:20:24.180 --> 00:20:26.670
that just give me
my observations that
00:20:26.670 --> 00:20:28.560
move around this line.
00:20:28.560 --> 00:20:34.860
So this is one of
epsilons, say epsilon i.
00:20:34.860 --> 00:20:37.460
Then I can actually
either talk--
00:20:37.460 --> 00:20:39.210
to say that I
recovered the line,
00:20:39.210 --> 00:20:42.270
I can actually talk about
recovering the right intercept
00:20:42.270 --> 00:20:44.370
or recovering the right
slope for this line.
00:20:44.370 --> 00:20:46.740
Those are the two parameters
that I need to recover.
00:20:46.740 --> 00:20:48.900
But I can also say
that I've actually
00:20:48.900 --> 00:20:50.880
found a set of
points that's closer
00:20:50.880 --> 00:20:56.250
to being on the line that are
closer to this set of points
00:20:56.250 --> 00:21:00.900
right here than the original
crosses that I observed.
00:21:00.900 --> 00:21:03.870
So if we go back to
the picture here,
00:21:03.870 --> 00:21:08.850
for example, what I could do
is say, well, for this point
00:21:08.850 --> 00:21:09.750
here--
00:21:09.750 --> 00:21:11.430
there was an x here--
00:21:11.430 --> 00:21:14.550
rather than looking at this
dot, which was my observation,
00:21:14.550 --> 00:21:17.732
I can say, well, now that
I've estimated the red line,
00:21:17.732 --> 00:21:19.440
I can actually just
say, well, this point
00:21:19.440 --> 00:21:21.420
should really be here.
00:21:21.420 --> 00:21:23.760
And actually, I can
move all these dots
00:21:23.760 --> 00:21:26.010
so that they're actually
on the red line.
00:21:26.010 --> 00:21:28.680
And this should be a
better value, something
00:21:28.680 --> 00:21:30.840
that has less noise than
the original y value
00:21:30.840 --> 00:21:32.010
that I should see.
00:21:32.010 --> 00:21:33.630
It should be close
to the true value
00:21:33.630 --> 00:21:37.240
that I should be seeing
without the extra noise.
00:21:37.240 --> 00:21:40.080
So that's definitely something
that could be of interest.
00:21:43.410 --> 00:21:45.990
For example, in
imaging, you're not
00:21:45.990 --> 00:21:48.690
trying to understand--
so when you do imaging,
00:21:48.690 --> 00:21:50.370
y is basically an image.
00:21:50.370 --> 00:21:53.310
So think of a pixel
image, and you just
00:21:53.310 --> 00:21:55.644
stack it into one long vector.
00:21:55.644 --> 00:21:57.060
And what you see
is something that
00:21:57.060 --> 00:21:59.730
should look like some linear
combination of some feature
00:21:59.730 --> 00:22:01.440
vectors, maybe.
00:22:01.440 --> 00:22:05.790
So there's people created
a bunch of features.
00:22:05.790 --> 00:22:09.290
They're called, for example,
Gabor frames or wavelet
00:22:09.290 --> 00:22:14.820
transforms-- so just well-known
libraries of variables x such
00:22:14.820 --> 00:22:17.250
that when you take linear
combinations of those guys,
00:22:17.250 --> 00:22:19.730
this should looks like
a bunch of images.
00:22:19.730 --> 00:22:22.049
And what you want
for your image--
00:22:22.049 --> 00:22:24.090
you don't care what the
coefficients of the image
00:22:24.090 --> 00:22:26.130
are in these bases
that you came up with.
00:22:26.130 --> 00:22:28.690
What you care about is
the noise in the image.
00:22:28.690 --> 00:22:31.690
And so you really
want to get x beta.
00:22:31.690 --> 00:22:34.390
So if you want to
estimate x beta,
00:22:34.390 --> 00:22:36.920
well, you can use x beta hat.
00:22:36.920 --> 00:22:37.960
What is x beta hat?
00:22:37.960 --> 00:22:42.040
Well, since beta hat is x
transpose x inverse x transpose
00:22:42.040 --> 00:22:44.030
y, this is x transpose.
00:22:48.800 --> 00:22:50.830
That's my estimator for x beta.
00:22:54.060 --> 00:22:59.170
Now, this thing,
actually, I can define
00:22:59.170 --> 00:23:01.630
even if I'm not low rank.
00:23:01.630 --> 00:23:03.190
So why is this
thing interesting?
00:23:03.190 --> 00:23:08.120
Well, there's a formula
for this estimator,
00:23:08.120 --> 00:23:10.260
but actually, I can
visualize what this thing is.
00:23:18.792 --> 00:23:22.840
So let's assume, for the
sake of illustration,
00:23:22.840 --> 00:23:26.200
that n is equal to 3.
00:23:29.500 --> 00:23:33.700
So that means that y lives
in a three-dimensional space.
00:23:33.700 --> 00:23:36.800
And so let's say it's here.
00:23:36.800 --> 00:23:43.970
And so I have my,
let's say, y's here.
00:23:43.970 --> 00:23:48.020
And I also have a
plane that's given
00:23:48.020 --> 00:23:55.450
by the vectors x1 transpose
x2 transpose, which
00:23:55.450 --> 00:23:58.890
is, by the way, 1--
00:23:58.890 --> 00:24:01.290
sorry, that's not
what I want to do.
00:24:04.320 --> 00:24:10.600
I'm going to say that n is equal
to 3 and that p is equal to 2.
00:24:10.600 --> 00:24:18.460
So I basically have two
vectors, 1, 1 and another one,
00:24:18.460 --> 00:24:25.670
let's assume that
it's, for example, abc.
00:24:25.670 --> 00:24:27.020
So those are my two vectors.
00:24:27.020 --> 00:24:33.430
This is x1, and this is x2.
00:24:36.190 --> 00:24:39.190
And those are my three
observations for this guy.
00:24:39.190 --> 00:24:48.940
So what I want when
I minimize this,
00:24:48.940 --> 00:24:50.560
I'm looking at the
point which can
00:24:50.560 --> 00:24:52.660
be formed as the linear
combination of the columns
00:24:52.660 --> 00:24:57.887
of x, and I'm trying to find
the guy that's the closest to y.
00:24:57.887 --> 00:24:58.970
So what does it look like?
00:24:58.970 --> 00:25:01.870
Well, the two points, 1, 1,
1 is going to be, say, here.
00:25:01.870 --> 00:25:04.360
That's the point 1, 1, 1.
00:25:04.360 --> 00:25:06.300
And let's say that
abc is this point.
00:25:14.890 --> 00:25:17.410
So now I have a line that
goes through those two guys.
00:25:20.620 --> 00:25:22.820
That's not really--
let's say it's
00:25:22.820 --> 00:25:24.560
going through those two guys.
00:25:24.560 --> 00:25:27.800
And this is the line which
can be formed by looking only
00:25:27.800 --> 00:25:28.990
at linear combination.
00:25:28.990 --> 00:25:36.330
So this is the line of
x times t for t in r2.
00:25:36.330 --> 00:25:39.320
That's this entire
line that you can get.
00:25:39.320 --> 00:25:42.870
Why is it-- yeah, sorry,
it's not just a line,
00:25:42.870 --> 00:25:45.720
I also have to have
t, all the 0's thing.
00:25:45.720 --> 00:25:48.860
So that actually
creates an entire plane,
00:25:48.860 --> 00:25:54.160
which is going to be really
hard for me to represent.
00:25:54.160 --> 00:25:55.390
I don't know.
00:25:55.390 --> 00:26:00.044
I mean, maybe I shouldn't
do it in these dimensions.
00:26:05.130 --> 00:26:08.650
So I'm going to do it like that.
00:26:11.240 --> 00:26:14.350
So this plane here is the
set of xt for t and r2.
00:26:17.770 --> 00:26:22.390
So that's a two-dimensional
plane, definitely goes to 0,
00:26:22.390 --> 00:26:23.760
and those are all these things.
00:26:23.760 --> 00:26:25.960
So think of a sheet of
paper in three dimensions.
00:26:25.960 --> 00:26:27.910
Those are the things I can get.
00:26:27.910 --> 00:26:29.600
So now, what I'm
going to have as y
00:26:29.600 --> 00:26:32.980
is not necessarily
in this plane.
00:26:32.980 --> 00:26:39.440
y is actually something
in this plane, x beta
00:26:39.440 --> 00:26:40.655
plus some epsilon.
00:26:44.810 --> 00:26:50.091
y is x beta plus epsilon.
00:26:50.091 --> 00:26:51.590
So I start from
this plane, and then
00:26:51.590 --> 00:26:53.048
I have this epsilon
that pushes me,
00:26:53.048 --> 00:26:54.620
maybe, outside of this plane.
00:26:54.620 --> 00:26:56.370
And what least squares
is doing is saying,
00:26:56.370 --> 00:26:59.212
well, I know that epsilon
should be fairly small,
00:26:59.212 --> 00:27:01.670
so the only thing I'm going to
be doing that actually makes
00:27:01.670 --> 00:27:04.370
sense is to take y and
find the point that's
00:27:04.370 --> 00:27:06.170
on this plane that's
the closest to it.
00:27:06.170 --> 00:27:10.010
And that corresponds to doing
an orthogonal projection of y
00:27:10.010 --> 00:27:13.070
onto this thing, and that's
actually exactly x beta hat.
00:27:18.840 --> 00:27:21.390
So in one dimension, just
because this is actually
00:27:21.390 --> 00:27:22.920
a little hard--
00:27:22.920 --> 00:27:34.140
in one dimension, so
that's if p is equal to 1.
00:27:34.140 --> 00:27:36.780
So let's say this is my point.
00:27:36.780 --> 00:27:38.854
And then I have y, which
is in two dimensions,
00:27:38.854 --> 00:27:40.020
so this is all on the plane.
00:27:42.930 --> 00:27:44.590
What it does, this is my--
00:27:44.590 --> 00:27:48.579
the point that's right here
is actually x beta hat.
00:27:48.579 --> 00:27:49.870
That's how you find x beta hat.
00:27:49.870 --> 00:27:51.780
You take your point
y and you project it
00:27:51.780 --> 00:27:54.490
on the linear span
of the columns of x.
00:27:54.490 --> 00:27:56.640
And that's x beta hat.
00:27:56.640 --> 00:27:59.032
This does not tell you
exactly what beta should be.
00:27:59.032 --> 00:28:00.990
And if you know a little
bit of linear algebra,
00:28:00.990 --> 00:28:04.580
it's pretty clear, because
if you want to find beta hat,
00:28:04.580 --> 00:28:06.330
that means that you
should be able to find
00:28:06.330 --> 00:28:12.284
the coordinates of a point in
the system of columns of x.
00:28:12.284 --> 00:28:13.950
And if those guys are
redundant, there's
00:28:13.950 --> 00:28:17.430
not going to be unique
coordinates for these guys,
00:28:17.430 --> 00:28:19.410
so that's why it's
actually not easy to find.
00:28:19.410 --> 00:28:21.120
But x beta hat is
uniquely defined.
00:28:21.120 --> 00:28:21.870
It's a projection.
00:28:21.870 --> 00:28:22.744
Yeah?
00:28:22.744 --> 00:28:24.285
AUDIENCE: And epsilon
is the distance
00:28:24.285 --> 00:28:25.840
between the y and the--
00:28:25.840 --> 00:28:29.630
PHILIPPE RIGOLLET: No, epsilon
is the vector that goes from--
00:28:29.630 --> 00:28:33.800
so there's a true x beta.
00:28:33.800 --> 00:28:36.245
That's the true one.
00:28:36.245 --> 00:28:36.870
It's not clear.
00:28:36.870 --> 00:28:41.940
I mean, x beta hat is unlikely
to be exactly equal to x beta.
00:28:41.940 --> 00:28:44.410
And then the epsilon is the
one that starts from this line.
00:28:44.410 --> 00:28:46.800
It's the vector that
pushes you away.
00:28:46.800 --> 00:28:48.240
So really, this is this vector.
00:28:48.240 --> 00:28:50.600
That's epsilon.
00:28:50.600 --> 00:28:51.650
So it's not a length.
00:28:51.650 --> 00:28:54.245
The lengths of epsilon
is the distance,
00:28:54.245 --> 00:28:57.454
but epsilon is just the
actual vector that takes you
00:28:57.454 --> 00:28:58.370
from one to the other.
00:29:01.600 --> 00:29:03.020
So this is all in
two dimensions,
00:29:03.020 --> 00:29:05.060
and it's probably much
clearer than what's here.
00:29:09.080 --> 00:29:12.860
And so here, I claim
that this x beta hat--
00:29:12.860 --> 00:29:15.110
so from this
picture, I implicitly
00:29:15.110 --> 00:29:22.980
claim that forming this
operator that ticks y
00:29:22.980 --> 00:29:26.400
and maps it into this vector
x times x transpose y, blah,
00:29:26.400 --> 00:29:33.570
blah, blah, this should actually
be equal to the projection of y
00:29:33.570 --> 00:29:44.990
onto the linear span
of the columns of x.
00:29:44.990 --> 00:29:46.889
That's what I just drew for you.
00:29:46.889 --> 00:29:48.430
And what it means
is that this matrix
00:29:48.430 --> 00:29:49.679
must be the projection matrix.
00:29:54.350 --> 00:29:56.910
So of course, anybody--
00:29:56.910 --> 00:29:59.400
who knows linear algebra here?
00:29:59.400 --> 00:30:01.730
OK, wow.
00:30:01.730 --> 00:30:04.560
So what are the conditions
that a projection matrix
00:30:04.560 --> 00:30:06.763
should be satisfying?
00:30:06.763 --> 00:30:08.150
AUDIENCE: Squares
through itself.
00:30:08.150 --> 00:30:09.360
PHILIPPE RIGOLLET: Squares
through itself, right.
00:30:09.360 --> 00:30:11.430
If I project twice,
I'm not moving.
00:30:11.430 --> 00:30:13.560
If I keep on
iterating projection,
00:30:13.560 --> 00:30:15.330
once I'm in the space
I'm projecting onto,
00:30:15.330 --> 00:30:16.854
I'm not moving.
00:30:16.854 --> 00:30:17.808
What else?
00:30:24.970 --> 00:30:28.110
Do they have to be
symmetric, maybe?
00:30:28.110 --> 00:30:29.960
AUDIENCE: If it's an
orthogonal projection.
00:30:29.960 --> 00:30:32.501
PHILIPPE RIGOLLET: Yeah, so this
is an orthogonal projection.
00:30:32.501 --> 00:30:34.710
It has to be symmetric.
00:30:34.710 --> 00:30:36.510
And that's pretty much it.
00:30:36.510 --> 00:30:38.520
So from those things,
you can actually
00:30:38.520 --> 00:30:39.694
get quite a bit of things.
00:30:39.694 --> 00:30:41.610
But what's interesting
is that if you actually
00:30:41.610 --> 00:30:44.550
look at the eigenvalues
of this matrix,
00:30:44.550 --> 00:30:47.670
they should be either
0 or 1, essentially.
00:30:47.670 --> 00:30:52.320
And they are 1 if the
eigenvector associated
00:30:52.320 --> 00:30:55.089
is within this space,
and 0 otherwise.
00:30:55.089 --> 00:30:56.880
And so that's basically
what you can check.
00:30:56.880 --> 00:30:58.630
This is not an exercise
in linear algebra,
00:30:58.630 --> 00:31:00.970
so I'm not going to go too
much into those details.
00:31:00.970 --> 00:31:03.330
But this is essentially what
you want to keep in mind.
00:31:03.330 --> 00:31:05.460
What's associated to
orthogonal projections
00:31:05.460 --> 00:31:07.860
is Pythagoras theorem.
00:31:07.860 --> 00:31:10.380
And that's something that's
going to be useful for us.
00:31:10.380 --> 00:31:12.150
What it's essentially
telling is that if I
00:31:12.150 --> 00:31:16.342
look at this norm squared, it's
equal to this norm squared--
00:31:16.342 --> 00:31:18.300
sorry, this norm squared
plus this norm squared
00:31:18.300 --> 00:31:20.100
is equal to this norm squared.
00:31:20.100 --> 00:31:22.510
And that's something
the norm of y squared.
00:31:22.510 --> 00:31:32.040
So Pythagoras tells me
that the norm of y squared
00:31:32.040 --> 00:31:40.090
is equal to the norm of x beta
hat squared plus the norm of y
00:31:40.090 --> 00:31:41.230
minus x beta hat squared.
00:31:46.120 --> 00:31:47.890
Agreed?
00:31:47.890 --> 00:31:51.700
It's just because I have
a straight angle here.
00:31:51.700 --> 00:31:54.174
So that's this plus
this is equal to this.
00:31:58.840 --> 00:32:02.770
So now, to define this,
I made no assumption.
00:32:02.770 --> 00:32:04.630
Epsilon could be as wild.
00:32:04.630 --> 00:32:07.300
I was just crossing my fingers
that epsilon was actually
00:32:07.300 --> 00:32:09.910
small enough that
it would make sense
00:32:09.910 --> 00:32:13.450
to project onto the linear
span, because I implicitly
00:32:13.450 --> 00:32:16.640
assumed that epsilon did not
take me all the way there,
00:32:16.640 --> 00:32:19.900
so that actually, it makes
sense to project back.
00:32:19.900 --> 00:32:22.240
And so for that, I need to
somehow make assumptions
00:32:22.240 --> 00:32:24.730
that epsilon is
well-behaved and that it's
00:32:24.730 --> 00:32:31.330
completely wild, that
it's moving uniformly
00:32:31.330 --> 00:32:33.050
in all directions of the space.
00:32:33.050 --> 00:32:34.630
There's no privileged
direction where
00:32:34.630 --> 00:32:36.005
it's always going,
otherwise, I'm
00:32:36.005 --> 00:32:37.900
going to make a
systematic error.
00:32:37.900 --> 00:32:42.400
And I need that those epsilons
are going to average somehow.
00:32:42.400 --> 00:32:44.641
So here are the
assumptions we're
00:32:44.641 --> 00:32:46.390
going to be making so
that we can actually
00:32:46.390 --> 00:32:48.880
do some statistical inference.
00:32:48.880 --> 00:32:53.350
The first one is that the
design matrix is deterministic.
00:32:53.350 --> 00:32:55.270
So I started by saying the x--
00:32:55.270 --> 00:32:58.570
I have xi, yi, and maybe
they're independent.
00:32:58.570 --> 00:33:03.460
Here, they are, but the xi's, I
want to think as deterministic.
00:33:03.460 --> 00:33:06.400
If they're not deterministic,
it can condition on them,
00:33:06.400 --> 00:33:08.110
but otherwise,
it's very difficult
00:33:08.110 --> 00:33:11.770
to think about this thing
if I think of those entries
00:33:11.770 --> 00:33:14.380
as being random,
because then I have
00:33:14.380 --> 00:33:17.470
the inverse of a random matrix,
and things become very, very
00:33:17.470 --> 00:33:18.800
complicated.
00:33:18.800 --> 00:33:21.760
So we're to think of those
guys as being deterministic.
00:33:21.760 --> 00:33:27.400
We're going to think of the
model as being homoscedastic.
00:33:27.400 --> 00:33:29.790
And actually, let me come
back to this in a second.
00:33:29.790 --> 00:33:31.780
Homoscedastic-- well,
I mean, if you're
00:33:31.780 --> 00:33:34.330
trying to find the
etymology of this word,
00:33:34.330 --> 00:33:38.080
"homo" means the same,
"scedastic" means scaling.
00:33:38.080 --> 00:33:40.090
So what I want to say
is that the epsilons
00:33:40.090 --> 00:33:41.890
have the same scaling.
00:33:41.890 --> 00:33:46.914
And since my third assumption is
that epsilon is Gaussian, then
00:33:46.914 --> 00:33:49.330
essentially, what I'm going
to want is that they all share
00:33:49.330 --> 00:33:52.900
the same sigma squared.
00:33:52.900 --> 00:33:55.540
They're independent, so this
is definitely in the identity
00:33:55.540 --> 00:33:56.784
covariance matrix.
00:33:56.784 --> 00:33:58.450
And I want them to
be centered, as well.
00:33:58.450 --> 00:34:00.310
That means that
there's no direction
00:34:00.310 --> 00:34:04.240
that I'm always privileging when
I'm moving away from my plane
00:34:04.240 --> 00:34:05.560
there.
00:34:05.560 --> 00:34:09.969
So these are
important conditions.
00:34:09.969 --> 00:34:13.210
It depends on how much
inference you want to do.
00:34:13.210 --> 00:34:16.310
If you want to write t-tests,
you need all these assumptions.
00:34:16.310 --> 00:34:19.810
But if you only want to
write, for example, the fact
00:34:19.810 --> 00:34:23.230
that your least squares
estimator is consistent,
00:34:23.230 --> 00:34:25.210
you really just need
the fact that epsilon
00:34:25.210 --> 00:34:27.630
has variance sigma squared.
00:34:27.630 --> 00:34:29.850
The fact that it's
Gaussian won't matter, just
00:34:29.850 --> 00:34:33.449
like Gaussianity doesn't
matter for a large number.
00:34:33.449 --> 00:34:34.055
Yeah?
00:34:34.055 --> 00:34:35.480
AUDIENCE: So the
first assumption
00:34:35.480 --> 00:34:38.013
that x has to be
deterministic, but I just
00:34:38.013 --> 00:34:40.327
made up this x1, x2--
00:34:40.327 --> 00:34:41.785
PHILIPPE RIGOLLET:
x is the matrix.
00:34:41.785 --> 00:34:42.485
AUDIENCE: Yeah.
00:34:42.485 --> 00:34:45.159
So most are random
variables, right?
00:34:45.159 --> 00:34:47.075
PHILIPPE RIGOLLET: No,
that's the assumption.
00:34:47.075 --> 00:34:49.400
AUDIENCE: OK.
00:34:49.400 --> 00:34:52.595
So I mean, once we collect the
data and put it in the matrix,
00:34:52.595 --> 00:34:54.020
it becomes deterministic.
00:34:54.020 --> 00:34:55.920
So maybe I'm missing something.
00:34:55.920 --> 00:34:56.920
PHILIPPE RIGOLLET: Yeah.
00:34:56.920 --> 00:35:00.510
So this is for the
purpose of the analysis.
00:35:00.510 --> 00:35:01.800
I can actually assume that--
00:35:01.800 --> 00:35:04.140
I look at my data,
and I think of this.
00:35:04.140 --> 00:35:06.210
So what is the difference
between thinking
00:35:06.210 --> 00:35:08.832
of data as deterministic or
thinking of it as random?
00:35:08.832 --> 00:35:11.040
When I talked about random
data, the only assumptions
00:35:11.040 --> 00:35:12.706
that I made were about
the distribution.
00:35:12.706 --> 00:35:14.730
I said, well, if my x
is a random variable,
00:35:14.730 --> 00:35:16.980
I want it to have this
variance and I want it to have,
00:35:16.980 --> 00:35:19.250
maybe, this distribution,
things like this.
00:35:19.250 --> 00:35:25.050
Here, I'm actually making
an assumption on the values
00:35:25.050 --> 00:35:25.940
that I see.
00:35:25.940 --> 00:35:30.120
I'm seeing that the value
that you give me is--
00:35:30.120 --> 00:35:32.010
the matrix is
actually invertible.
00:35:32.010 --> 00:35:33.960
x transpose x will
be invertible.
00:35:33.960 --> 00:35:36.690
So I've never done
that before, assuming
00:35:36.690 --> 00:35:38.880
that some random variable--
00:35:38.880 --> 00:35:41.740
assuming that some Gaussian
random variable was positive,
00:35:41.740 --> 00:35:43.160
for example.
00:35:43.160 --> 00:35:45.570
We don't do that, because
there's always some probability
00:35:45.570 --> 00:35:49.110
that things don't happen if
you make things at random.
00:35:49.110 --> 00:35:52.380
And so here, I'm just going
to say, OK, forget about--
00:35:52.380 --> 00:35:54.990
here, it's basically
a little stronger.
00:35:54.990 --> 00:35:58.710
I start my assumption by saying,
the data that's given to me
00:35:58.710 --> 00:36:00.630
will actually satisfy
those assumptions.
00:36:00.630 --> 00:36:02.130
And that means that
I don't actually
00:36:02.130 --> 00:36:05.279
need to make some modeling
assumption on this thing,
00:36:05.279 --> 00:36:06.820
because I'm actually
putting directly
00:36:06.820 --> 00:36:08.028
the assumption I want to see.
00:36:12.650 --> 00:36:14.730
So here, either I
know sigma squared
00:36:14.730 --> 00:36:16.190
or I don't know sigma squared.
00:36:16.190 --> 00:36:16.940
So is that clear?
00:36:16.940 --> 00:36:21.880
So essentially, I'm assuming
that I have this model, where
00:36:21.880 --> 00:36:26.950
this guy, now, is
deterministic, and this
00:36:26.950 --> 00:36:30.490
is some multivariate
Gaussian with mean 0
00:36:30.490 --> 00:36:33.500
and covariance matrix
identity of rn.
00:36:33.500 --> 00:36:36.460
That's the model I'm assuming.
00:36:36.460 --> 00:36:40.810
And I'm observing this, and
I'm given this matrix x.
00:36:40.810 --> 00:36:42.130
Where does this make sense?
00:36:42.130 --> 00:36:44.770
You could say, well, if I think
of my rows as being people
00:36:44.770 --> 00:36:48.084
and I'm collecting genes,
it's a little intense
00:36:48.084 --> 00:36:50.000
to assume that I actually
know, ahead of time,
00:36:50.000 --> 00:36:51.340
what I'm going to be seeing,
and that those things are
00:36:51.340 --> 00:36:52.210
deterministic.
00:36:52.210 --> 00:36:55.630
That's true, but it still
does not prevent the analysis
00:36:55.630 --> 00:36:56.830
to go through, for one.
00:36:56.830 --> 00:37:00.970
And second, a better example
might be this imaging example
00:37:00.970 --> 00:37:04.870
that I described, where those
x's are actually libraries.
00:37:04.870 --> 00:37:07.570
Those are libraries of
patterns that people
00:37:07.570 --> 00:37:09.800
have created, maybe
from deep learning nets,
00:37:09.800 --> 00:37:10.847
or something like this.
00:37:10.847 --> 00:37:12.430
But they've created
patterns, and they
00:37:12.430 --> 00:37:14.830
say that all images should
be representable as a linear
00:37:14.830 --> 00:37:16.511
combination of those patterns.
00:37:16.511 --> 00:37:18.260
And those patterns are
somewhere in books,
00:37:18.260 --> 00:37:19.390
so they're certainly
deterministic.
00:37:19.390 --> 00:37:21.190
Everything that's actually
written down in a book
00:37:21.190 --> 00:37:22.600
is as deterministic as it gets.
00:37:29.027 --> 00:37:30.610
Any questions about
those assumptions?
00:37:30.610 --> 00:37:32.776
Those are the things we're
going to be working with.
00:37:32.776 --> 00:37:33.910
There's only three of them.
00:37:33.910 --> 00:37:35.130
One is about x.
00:37:35.130 --> 00:37:37.530
Actually, there's
really two of them.
00:37:37.530 --> 00:37:41.625
I mean, this guy
already appears here.
00:37:41.625 --> 00:37:44.640
So there's two-- one on
the noise, one on the x's.
00:37:44.640 --> 00:37:45.600
That's it.
00:37:48.480 --> 00:37:51.430
Those things allow
us to do quite a bit.
00:37:51.430 --> 00:37:52.830
They will allow us to--
00:37:55.410 --> 00:38:02.980
well, that's
actually-- they allow
00:38:02.980 --> 00:38:09.100
me to write the distribution
of beta hat, which is great,
00:38:09.100 --> 00:38:12.220
because when I know the
distribution of my estimator,
00:38:12.220 --> 00:38:14.250
I know it's fluctuations.
00:38:14.250 --> 00:38:16.450
If it's centered around
the true parameter,
00:38:16.450 --> 00:38:19.060
I know that it's going
to be fluctuating
00:38:19.060 --> 00:38:20.440
around the true parameter.
00:38:20.440 --> 00:38:22.450
And it should tell me
what kind of distribution
00:38:22.450 --> 00:38:23.860
the fluctuations are.
00:38:23.860 --> 00:38:26.260
I actually know how to
build confidence intervals.
00:38:26.260 --> 00:38:27.790
I know how to build tests.
00:38:27.790 --> 00:38:29.170
I know how to build everything.
00:38:29.170 --> 00:38:31.660
It's just like when I told
you that asymptotically,
00:38:31.660 --> 00:38:33.760
the empirical
variance was Gaussian
00:38:33.760 --> 00:38:39.470
with mean theta and standard
deviation that depended on n,
00:38:39.470 --> 00:38:42.040
et cetera, that's basically
the only thing I needed.
00:38:42.040 --> 00:38:44.840
And this is what I'm
actually getting here.
00:38:44.840 --> 00:38:49.820
So let me start
with this statement.
00:38:49.820 --> 00:38:52.087
So remember, beta
hat satisfied this,
00:38:52.087 --> 00:38:53.420
so I'm going to rewrite it here.
00:38:57.940 --> 00:39:02.530
So beta hat was
equal to x transpose
00:39:02.530 --> 00:39:07.440
x inverse x transpose y.
00:39:07.440 --> 00:39:09.710
That was the definition
that we found.
00:39:09.710 --> 00:39:17.450
And now, I also know that y was
equal to x beta plus epsilon.
00:39:17.450 --> 00:39:20.800
So let me just replace y by
x beta plus epsilon here.
00:39:20.800 --> 00:39:21.300
Yeah?
00:39:21.300 --> 00:39:25.185
AUDIENCE: Isn't it x transpose
x inverse x transpose y?
00:39:25.185 --> 00:39:26.685
PHILIPPE RIGOLLET:
Yes, x transpose.
00:39:26.685 --> 00:39:27.184
Thank you.
00:39:31.890 --> 00:39:36.830
So I'm going to replace
y by x beta plus epsilon.
00:39:36.830 --> 00:39:58.560
So that's-- and here
comes the magic.
00:39:58.560 --> 00:40:00.780
I have an inverse of
a matrix, and then
00:40:00.780 --> 00:40:03.810
I have the true matrix, I
have the original matrix.
00:40:03.810 --> 00:40:08.420
So this is actually the
identity times beta.
00:40:08.420 --> 00:40:11.610
And now this guy, well,
this is a Gaussian,
00:40:11.610 --> 00:40:13.800
because this is a
Gaussian random vector,
00:40:13.800 --> 00:40:18.540
and I just multiply it by
a deterministic matrix.
00:40:18.540 --> 00:40:22.690
So we're going to use the rule
that if I have, say, epsilon,
00:40:22.690 --> 00:40:29.780
which is n0 sigma, then
b times epsilon is n0--
00:40:29.780 --> 00:40:32.280
can somebody tell me what the
covariance matrix of b epsilon
00:40:32.280 --> 00:40:32.780
is?
00:40:35.302 --> 00:40:37.010
AUDIENCE: What is
capital B in this case?
00:40:37.010 --> 00:40:38.593
PHILIPPE RIGOLLET:
It's just a matrix.
00:40:42.410 --> 00:40:46.230
And for any matrix, I mean any
matrix that I can premultiply--
00:40:46.230 --> 00:40:48.360
that I can postmultiply
with epsilon.
00:40:48.360 --> 00:40:49.042
Yeah?
00:40:49.042 --> 00:40:50.819
AUDIENCE: b transpose b.
00:40:50.819 --> 00:40:52.110
PHILIPPE RIGOLLET: b transpose?
00:40:52.110 --> 00:40:53.503
AUDIENCE: Times b.
00:40:53.503 --> 00:40:55.044
PHILIPPE RIGOLLET:
And sigma is gone.
00:40:55.044 --> 00:40:57.177
AUDIENCE: Oh,
times sigma, sorry.
00:40:57.177 --> 00:40:59.010
PHILIPPE RIGOLLET:
That's the matrix, right?
00:40:59.010 --> 00:41:00.427
AUDIENCE: b transpose sigma b.
00:41:00.427 --> 00:41:01.510
PHILIPPE RIGOLLET: Almost.
00:41:04.255 --> 00:41:07.470
Anybody wants to take a
guess at the last one?
00:41:07.470 --> 00:41:12.790
I think we've removed
all other possibilities.
00:41:12.790 --> 00:41:15.510
It's b sigma b transpose.
00:41:20.880 --> 00:41:24.910
So if you ever answered
to the question,
00:41:24.910 --> 00:41:26.590
do you know Gaussian
random vectors,
00:41:26.590 --> 00:41:29.414
but you did not know that,
there's a gap in your knowledge
00:41:29.414 --> 00:41:31.330
that you need to fill,
because that's probably
00:41:31.330 --> 00:41:33.880
the most important property
of Gaussian vectors.
00:41:33.880 --> 00:41:38.410
When you multiply
them by matrices,
00:41:38.410 --> 00:41:43.390
you have a simple rule on how
to update the covariance matrix.
00:41:43.390 --> 00:41:49.250
So here, sigma is the identity.
00:41:49.250 --> 00:41:53.480
And here, this is the
matrix b that I had here.
00:41:53.480 --> 00:41:58.480
So what this is is, basically,
n, some multivariate n,
00:41:58.480 --> 00:41:59.350
of course.
00:41:59.350 --> 00:42:00.970
Then I'm going to have 0.
00:42:00.970 --> 00:42:04.140
And so what I need to do is
b times the identity times b
00:42:04.140 --> 00:42:07.017
transpose, which is
just b b transpose.
00:42:07.017 --> 00:42:08.350
And what is it going to tell me?
00:42:08.350 --> 00:42:12.850
It's x transpose x--
00:42:12.850 --> 00:42:17.560
sorry, that's inverse--
inverse x transpose, and then
00:42:17.560 --> 00:42:21.760
the transpose of this
guy, which is x x
00:42:21.760 --> 00:42:25.170
transpose x inverse transpose.
00:42:25.170 --> 00:42:27.130
But this matrix is
symmetric, so I'm actually
00:42:27.130 --> 00:42:30.190
not going to make the
transpose of this guy.
00:42:30.190 --> 00:42:34.090
And again, magic shows up.
00:42:34.090 --> 00:42:36.220
Inverse times the
matrix of those two guys
00:42:36.220 --> 00:42:38.950
cancel, and so this is
actually equal to beta
00:42:38.950 --> 00:42:43.990
plus some n0 x
transpose x inverse.
00:42:46.955 --> 00:42:47.455
Yeah?
00:42:47.455 --> 00:42:49.454
AUDIENCE: I'm a little
lost on the [INAUDIBLE]..
00:42:49.454 --> 00:42:52.788
So you define that as the
b matrix, and what happens?
00:42:52.788 --> 00:42:54.954
PHILIPPE RIGOLLET: So I
just apply this rule, right?
00:42:54.954 --> 00:42:55.720
AUDIENCE: Yeah.
00:42:55.720 --> 00:42:57.850
PHILIPPE RIGOLLET: So
if I multiply a matrix
00:42:57.850 --> 00:43:01.840
by a Gaussian, then let's
say this Gaussian had
00:43:01.840 --> 00:43:05.680
mean 0, which is the
case of epsilon here,
00:43:05.680 --> 00:43:07.960
then the covariance
matrix that I get
00:43:07.960 --> 00:43:10.330
is b times the original
covariance matrix times b
00:43:10.330 --> 00:43:11.470
transpose.
00:43:11.470 --> 00:43:15.290
So all I did is write this
matrix times the identity
00:43:15.290 --> 00:43:18.195
times this matrix transpose.
00:43:18.195 --> 00:43:20.320
And the identity, of course,
doesn't play any role,
00:43:20.320 --> 00:43:21.240
so I can remove it.
00:43:21.240 --> 00:43:23.860
It's just this matrix,
then the matrix transpose.
00:43:23.860 --> 00:43:25.370
And what happened?
00:43:25.370 --> 00:43:27.280
So what is the transpose
of this matrix?
00:43:27.280 --> 00:43:32.710
So I used the fact that if I
look at x transpose x inverse x
00:43:32.710 --> 00:43:39.160
transpose, and now I look at the
whole transpose of this thing,
00:43:39.160 --> 00:43:40.510
that's actually equal 2.
00:43:40.510 --> 00:43:43.510
And I use the rule that ab
transpose is b transpose
00:43:43.510 --> 00:43:46.030
a transpose-- let me finish--
00:43:46.030 --> 00:43:51.925
and it's x x
transpose x inverse.
00:43:55.151 --> 00:43:55.650
Yes?
00:43:55.650 --> 00:43:58.020
AUDIENCE: I thought the--
00:43:58.020 --> 00:44:00.335
for epsilon, it
was sigma squared.
00:44:00.335 --> 00:44:01.710
PHILIPPE RIGOLLET:
Oh, thank you.
00:44:01.710 --> 00:44:03.610
There's a sigma
squared somewhere.
00:44:03.610 --> 00:44:08.610
So this was sigma squared times
the identity, so I can just
00:44:08.610 --> 00:44:10.566
pick up a sigma
squared anywhere.
00:44:14.740 --> 00:44:28.560
So here, in our case, so
for epsilon, this is sigma.
00:44:28.560 --> 00:44:30.000
Sigma squared
times the identity,
00:44:30.000 --> 00:44:31.166
that's my covariance matrix.
00:44:33.920 --> 00:44:35.242
You seem perplexed.
00:44:35.242 --> 00:44:37.170
AUDIENCE: It's just
a new idea for me
00:44:37.170 --> 00:44:41.754
to think of a maximum likelihood
estimator as a random variable.
00:44:41.754 --> 00:44:43.420
PHILIPPE RIGOLLET:
Oh, it should not be.
00:44:43.420 --> 00:44:45.722
Any estimator is
a random variable.
00:44:45.722 --> 00:44:48.132
AUDIENCE: Oh, yeah,
that's a good point.
00:44:48.132 --> 00:44:52.236
PHILIPPE RIGOLLET:
[LAUGHS] And I have not
00:44:52.236 --> 00:44:54.110
told you that this was
the maximum likelihood
00:44:54.110 --> 00:44:55.720
estimator just yet.
00:44:55.720 --> 00:44:58.910
The estimator is
a random variable.
00:44:58.910 --> 00:45:00.890
There's a word-- some
people use estimate just
00:45:00.890 --> 00:45:03.519
to differentiate the
estimator while you're
00:45:03.519 --> 00:45:05.810
doing the analysis with random
variables and the values
00:45:05.810 --> 00:45:09.477
when you plug in the
numbers in there.
00:45:09.477 --> 00:45:12.060
But then, of course, people use
estimate because it's shorter,
00:45:12.060 --> 00:45:14.660
so then it's confusing.
00:45:14.660 --> 00:45:17.990
So any questions about
this computation?
00:45:17.990 --> 00:45:20.810
Did I forget any other
Greek letter along the way?
00:45:20.810 --> 00:45:22.620
All right, I think we're good.
00:45:22.620 --> 00:45:26.225
So one thing that it
says-- and actually,
00:45:26.225 --> 00:45:27.600
thank you for
pointing this out--
00:45:27.600 --> 00:45:30.540
I said there's actually a
little hidden statement there.
00:45:30.540 --> 00:45:33.130
By the way, this
answers this question.
00:45:33.130 --> 00:45:35.990
Beta hat is of the form beta
plus something that's centered,
00:45:35.990 --> 00:45:39.484
so it's indeed of the form
Gaussian with mean beta
00:45:39.484 --> 00:45:41.900
and covariance matrix sigma
squared x transpose x inverse.
00:45:45.520 --> 00:45:47.640
So that's very nice.
00:45:47.640 --> 00:45:50.830
As long as x transpose
x is not huge,
00:45:50.830 --> 00:45:55.900
I'm going to have something
that is close to what I want.
00:45:55.900 --> 00:45:58.550
Oh, sorry, x transpose
x inverse is not huge.
00:46:01.800 --> 00:46:05.670
So there's a hidden
claim in there,
00:46:05.670 --> 00:46:08.640
which is that least
squares estimator
00:46:08.640 --> 00:46:11.588
is equal to the maximum
likelihood estimator.
00:46:15.500 --> 00:46:17.920
Why does the maximum
likelihood estimator just
00:46:17.920 --> 00:46:19.770
enter the picture now?
00:46:19.770 --> 00:46:23.280
We've been talking about
regression for the past 18
00:46:23.280 --> 00:46:24.450
slides.
00:46:24.450 --> 00:46:26.130
And we've been talking
about estimators.
00:46:26.130 --> 00:46:29.070
And I just dumped on you
the least squares estimator,
00:46:29.070 --> 00:46:31.830
but I never really came back
to this thing that we know--
00:46:31.830 --> 00:46:35.100
maybe the method of moments,
or maybe the maximum likelihood
00:46:35.100 --> 00:46:35.930
estimator.
00:46:35.930 --> 00:46:37.930
It turns out that those
two things are the same.
00:46:37.930 --> 00:46:41.880
But if I want to talk about a
maximum likelihood estimator,
00:46:41.880 --> 00:46:43.140
I need to have a likelihood.
00:46:43.140 --> 00:46:46.160
In particular, I need
to have a density.
00:46:46.160 --> 00:46:47.600
And so if I want
a density, I have
00:46:47.600 --> 00:46:53.210
to make those assumptions,
such as the epsilons have
00:46:53.210 --> 00:46:55.970
this Gaussian distribution.
00:46:55.970 --> 00:46:58.580
So why is this the maximum
likelihood estimator?
00:46:58.580 --> 00:47:04.740
Well, remember, y is x
transpose beta plus epsilon.
00:47:04.740 --> 00:47:07.530
So I actually have
a bunch of data.
00:47:07.530 --> 00:47:14.390
So what is my model here?
00:47:14.390 --> 00:47:18.040
Well, its the
family of Gaussians
00:47:18.040 --> 00:47:22.460
on n observations with
mean x beta, variance sigma
00:47:22.460 --> 00:47:31.380
squared identity,
and beta lives in rp.
00:47:31.380 --> 00:47:34.800
Here's my family
of distributions.
00:47:34.800 --> 00:47:38.160
That's the possible
distributions for y.
00:47:38.160 --> 00:47:41.500
And so in particular, I
can write the density of y.
00:47:47.980 --> 00:47:48.760
Well, what is it?
00:47:48.760 --> 00:47:52.010
It's something that
looks like p of x--
00:47:52.010 --> 00:47:58.359
well, p of y, let's say,
is equal to 1 over--
00:47:58.359 --> 00:48:00.400
so now its going to be a
little more complicated,
00:48:00.400 --> 00:48:17.740
but its sigma squared times 2
pi to the p/2 exponential minus
00:48:17.740 --> 00:48:26.840
norm of y minus x beta squared
divided by 2 sigma squared.
00:48:26.840 --> 00:48:29.780
So that's just the
multivariate Gaussian density.
00:48:29.780 --> 00:48:30.890
I just wrote it.
00:48:30.890 --> 00:48:33.530
That's the density of
a multivariate Gaussian
00:48:33.530 --> 00:48:36.740
with mean x beta and
covariance matrix sigma squared
00:48:36.740 --> 00:48:37.700
times the identity.
00:48:37.700 --> 00:48:40.410
That's what it is.
00:48:40.410 --> 00:48:42.300
So you don't have to
learn this by heart,
00:48:42.300 --> 00:48:47.100
but if you are familiar with
the case where p is equal to 1,
00:48:47.100 --> 00:48:49.820
you can check that you recover
what you're familiar with,
00:48:49.820 --> 00:48:54.811
and this makes sense
as an extension.
00:48:59.730 --> 00:49:08.560
So now, I can actually
write my log likelihood.
00:49:08.560 --> 00:49:14.880
How many observations do
I have of this vector y?
00:49:23.710 --> 00:49:25.144
Do I have n observations of y?
00:49:30.580 --> 00:49:33.110
I have just one, right?
00:49:33.110 --> 00:49:36.830
Oh, sorry, I shouldn't
have said p, this is n.
00:49:36.830 --> 00:49:38.510
Everything is in dimension n.
00:49:38.510 --> 00:49:42.700
So I can think of either having
n independent observations
00:49:42.700 --> 00:49:44.180
of each coordinate,
or I can think
00:49:44.180 --> 00:49:47.210
of having just one
observation of the vector y.
00:49:47.210 --> 00:49:50.050
So when I write
my log likelihood,
00:49:50.050 --> 00:49:54.850
it's just the log
of the density at y.
00:50:09.090 --> 00:50:13.710
And that's the
vector y, which I can
00:50:13.710 --> 00:50:18.990
write as minus n/2
log sigma squared
00:50:18.990 --> 00:50:28.690
2 pi minus 1 over 2 sigma
squared norm of y minus x beta.
00:50:28.690 --> 00:50:30.310
And that's, again,
my boldface y.
00:50:36.710 --> 00:50:39.222
And what is my maximum
likelihood estimator?
00:50:44.470 --> 00:50:47.940
Well, this guy does
not depend on beta.
00:50:47.940 --> 00:50:50.850
And this is just a constant
factor in front of this guy.
00:50:50.850 --> 00:50:54.270
So it's the same thing
as just minimizing,
00:50:54.270 --> 00:50:57.230
because I have a minus
sign, over all beta and rp.
00:51:03.140 --> 00:51:05.580
y minus x beta squared,
and that's my least squares
00:51:05.580 --> 00:51:06.570
estimator.
00:51:15.312 --> 00:51:17.270
Is there anything that's
unclear on this board?
00:51:17.270 --> 00:51:17.910
Any question?
00:51:20.550 --> 00:51:23.230
So all I used was-- so I
wrote my log likelihood, which
00:51:23.230 --> 00:51:25.860
is just the log
of this expression
00:51:25.860 --> 00:51:28.750
where y is my observation.
00:51:28.750 --> 00:51:32.430
And that's indeed the
observation that I have here.
00:51:32.430 --> 00:51:35.980
And that was just some constant
minus some constant times
00:51:35.980 --> 00:51:37.960
this quantity that
depends on beta.
00:51:37.960 --> 00:51:40.270
So maximizing this whole
thing is the same thing
00:51:40.270 --> 00:51:42.810
as minimizing only this thing.
00:51:42.810 --> 00:51:44.620
The minimizers are the same.
00:51:44.620 --> 00:51:47.320
And so that tells me
that I actually just
00:51:47.320 --> 00:51:49.000
have to minimize
the squared norm
00:51:49.000 --> 00:51:51.710
to get my maximum
likelihood estimator.
00:51:51.710 --> 00:51:55.060
But this used, heavily, the
fact that I could actually
00:51:55.060 --> 00:52:03.450
write exactly what
my density was,
00:52:03.450 --> 00:52:06.240
and that when I took
the log of this thing,
00:52:06.240 --> 00:52:09.660
I had exactly the square
norm that showed up.
00:52:09.660 --> 00:52:12.630
If I had a different
density, if, for example,
00:52:12.630 --> 00:52:17.040
I assumed that my coordinates
of epsilons were, say, iid
00:52:17.040 --> 00:52:18.720
double exponential
random variables.
00:52:18.720 --> 00:52:21.240
So it's just half
of an exponential.
00:52:21.240 --> 00:52:24.280
And the plus is half of an
exponential on the negatives.
00:52:24.280 --> 00:52:27.342
So if I said that,
then this would not
00:52:27.342 --> 00:52:28.800
have the square
norm that shows up.
00:52:28.800 --> 00:52:31.057
This is really
idiosyncratic to Gaussians.
00:52:31.057 --> 00:52:32.640
If I had something
else, I would have,
00:52:32.640 --> 00:52:35.190
maybe, a different norm
here, or something different
00:52:35.190 --> 00:52:39.420
measures the difference
between y and x beta.
00:52:39.420 --> 00:52:41.820
And that's how you come up
with other maximum likelihood
00:52:41.820 --> 00:52:44.010
estimators that leads
to other estimators that
00:52:44.010 --> 00:52:45.420
are not the least squares--
00:52:45.420 --> 00:52:47.040
maybe the least
absolute deviation,
00:52:47.040 --> 00:52:50.310
for example, or this
fourth movement,
00:52:50.310 --> 00:52:52.890
for example, that you
suggested last time.
00:52:52.890 --> 00:52:55.650
So I can come up with a
bunch of different things,
00:52:55.650 --> 00:52:56.910
and they might be tied--
00:52:56.910 --> 00:52:59.716
maybe I can come up from them
from the same perspective
00:52:59.716 --> 00:53:01.590
that I came from the
least squares estimator.
00:53:01.590 --> 00:53:03.210
I said, let's just
do something smart
00:53:03.210 --> 00:53:06.350
and check, then, that it's
indeed the maximum likelihood
00:53:06.350 --> 00:53:08.040
estimator.
00:53:08.040 --> 00:53:11.250
Or I could just start
with the modeling on--
00:53:11.250 --> 00:53:13.260
and check, then, what happens--
00:53:13.260 --> 00:53:15.840
what was the implicit assumption
that I put on my noise.
00:53:15.840 --> 00:53:18.164
Or I could start with the
assumption of the noise,
00:53:18.164 --> 00:53:19.830
compute the maximum
likelihood estimator
00:53:19.830 --> 00:53:21.000
and see what it turns into.
00:53:24.660 --> 00:53:26.760
So that was the first thing.
00:53:26.760 --> 00:53:29.080
I've just proved to
you the first line.
00:53:29.080 --> 00:53:31.950
And from there, you
can get what you want.
00:53:31.950 --> 00:53:34.690
So all the other lines
are going to follow.
00:53:34.690 --> 00:53:39.570
So what is beta hat-- so
for example, let's look
00:53:39.570 --> 00:53:41.660
at the second line,
the quadratic risk.
00:53:46.180 --> 00:53:49.270
Beta hat minus beta,
from this formula,
00:53:49.270 --> 00:53:53.780
has a distribution,
which is n n0,
00:53:53.780 --> 00:53:58.369
and then I have x
transpose x inverse.
00:53:58.369 --> 00:54:03.299
AUDIENCE: Wouldn't the
dimension be p on the board?
00:54:07.250 --> 00:54:10.308
PHILIPPE RIGOLLET: Sorry,
the dimension of what?
00:54:10.308 --> 00:54:11.769
AUDIENCE: Oh beta
hat minus beta.
00:54:11.769 --> 00:54:13.287
Isn't beta only a p dimensional?
00:54:13.287 --> 00:54:15.620
PHILIPPE RIGOLLET: Oh, yeah,
you're right, you're right.
00:54:15.620 --> 00:54:17.450
That was all p
dimensional there.
00:54:22.170 --> 00:54:23.700
Yeah.
00:54:23.700 --> 00:54:28.220
So if b here, the matrix
that I'm actually applying,
00:54:28.220 --> 00:54:30.810
has dimension p times n--
00:54:30.810 --> 00:54:34.710
so even if epsilon was an n
dimensional Gaussian vector,
00:54:34.710 --> 00:54:39.310
then b times epsilon is a p
dimensional Gaussian vector
00:54:39.310 --> 00:54:39.980
now.
00:54:39.980 --> 00:54:42.720
So that's how I
switch from p to n--
00:54:42.720 --> 00:54:43.770
from n to p.
00:54:43.770 --> 00:54:45.120
Thank you.
00:54:45.120 --> 00:54:50.430
So you're right, this is beta
hat minus beta is this guy.
00:54:50.430 --> 00:54:54.090
And so in particular, if
I look at the expectation
00:54:54.090 --> 00:55:01.160
of the norm of beta hat minus
beta squared, what is it?
00:55:01.160 --> 00:55:08.140
It's the expectation of the
norm of some Gaussian vector.
00:55:12.100 --> 00:55:15.530
And so it turns out--
so maybe we don't have--
00:55:15.530 --> 00:55:18.960
well, that's just also a
property of a Gaussian vector.
00:55:18.960 --> 00:55:26.840
So if epsilon is n0 sigma,
then the expectation
00:55:26.840 --> 00:55:34.576
of the norm of epsilon squared
is just the trace of sigma.
00:55:37.910 --> 00:55:41.030
Actually, we can
probably check this
00:55:41.030 --> 00:55:44.540
by saying that this is
the sum from j equal 1
00:55:44.540 --> 00:55:51.128
to p of the expectation of beta
hat j minus beta j squared.
00:55:54.310 --> 00:55:57.879
Since beta j squared is
the expectation-- beta j
00:55:57.879 --> 00:55:59.170
is the expectation of beta hat.
00:55:59.170 --> 00:56:01.990
This is actually equal
to the sum from j equal 1
00:56:01.990 --> 00:56:08.110
to p of the variance
of beta hat j,
00:56:08.110 --> 00:56:11.950
just because this is the
expectation of beta hat.
00:56:11.950 --> 00:56:15.590
And how do I read the variances
in a covariance matrix?
00:56:15.590 --> 00:56:17.830
There are just the
diagonal elements.
00:56:17.830 --> 00:56:25.390
So that's really just sigma jj.
00:56:25.390 --> 00:56:27.700
And so that's really equal to--
00:56:27.700 --> 00:56:29.470
so that's the sum of
the diagonal elements
00:56:29.470 --> 00:56:30.790
of this matrix.
00:56:30.790 --> 00:56:33.960
Let's call it sigma.
00:56:33.960 --> 00:56:40.020
So that's equal to the trace
of x transpose x inverse.
00:56:42.740 --> 00:56:45.364
The trace is the sum of the
diagonal elements of a matrix.
00:56:48.080 --> 00:56:49.700
And I still had something else.
00:56:49.700 --> 00:56:52.070
I'm sorry, this
was sigma squared.
00:56:52.070 --> 00:56:54.200
I forget it all the time.
00:56:54.200 --> 00:56:56.800
So the sigma squared comes out.
00:56:56.800 --> 00:56:58.760
It's there.
00:56:58.760 --> 00:57:01.275
And so the sigma
squared comes out
00:57:01.275 --> 00:57:02.900
because the trace is
a linear operator.
00:57:02.900 --> 00:57:06.275
If I multiply all the entries
of my matrix by the same number,
00:57:06.275 --> 00:57:08.150
then all the diagonal
elements are multiplied
00:57:08.150 --> 00:57:09.775
by the same number,
so when I sum them,
00:57:09.775 --> 00:57:13.930
the sum is multiplied
by the same number.
00:57:13.930 --> 00:57:18.120
So that's for the
quadratic risk of beta hat.
00:57:18.120 --> 00:57:21.580
And now I need to tell
you about x beta hat.
00:57:21.580 --> 00:57:27.250
x beta hat was something
that was actually telling me
00:57:27.250 --> 00:57:30.370
that that was the point that
I reported on the red line
00:57:30.370 --> 00:57:31.480
that I estimated.
00:57:31.480 --> 00:57:32.800
That was my x beta hat.
00:57:32.800 --> 00:57:40.310
That was my y minus the noise.
00:57:40.310 --> 00:57:42.470
Now, this thing here--
00:57:42.470 --> 00:57:47.100
so remember, we had this line,
and I had my observation.
00:57:47.100 --> 00:57:51.370
And here, I'm really trying to
measure this distance squared.
00:57:51.370 --> 00:57:53.470
This distance is actually
quite important for me
00:57:53.470 --> 00:57:58.920
because it actually shows up
in the Pythagoras theorem.
00:57:58.920 --> 00:58:02.260
And so you could actually
try to estimate this thing.
00:58:02.260 --> 00:58:03.790
So what is the prediction error?
00:58:12.900 --> 00:58:18.840
So we said we have y minus
x beta hat, so that's
00:58:18.840 --> 00:58:21.930
the norm of this thing
we're trying to compute.
00:58:21.930 --> 00:58:25.350
But let's write this for
what it is for one second.
00:58:25.350 --> 00:58:27.810
So we said that beta
hat was x transpose
00:58:27.810 --> 00:58:31.710
x inverse extra transpose
y, and we know that y is
00:58:31.710 --> 00:58:35.950
x transpose beta plus epsilon.
00:58:35.950 --> 00:58:37.410
So let's write this--
00:58:40.620 --> 00:58:43.800
x beta plus epsilon plus x.
00:58:57.000 --> 00:59:00.320
And actually, maybe I
should not write it.
00:59:00.320 --> 00:59:02.722
Let me keep the y
for what it is now.
00:59:07.140 --> 00:59:08.960
So that means that
I have, essentially,
00:59:08.960 --> 00:59:13.050
the identity of rn times y
minus this matrix times y.
00:59:13.050 --> 00:59:15.510
So I can factor
y out, and that's
00:59:15.510 --> 00:59:20.280
the identity of rn
minus x x transpose
00:59:20.280 --> 00:59:27.280
x inverse x transpose,
the whole thing times y.
00:59:32.760 --> 00:59:37.980
We call this matrix p because
this was the projection matrix
00:59:37.980 --> 00:59:41.540
onto the linear span of the x's.
00:59:41.540 --> 00:59:46.120
So that means that if I take a
point x and I apply p times x,
00:59:46.120 --> 00:59:50.910
I'm projecting onto the linear
span of the columns of x.
00:59:50.910 --> 00:59:57.400
What happens if I do
i minus p times x?
00:59:57.400 --> 00:59:59.000
It's x minus px.
01:00:01.540 --> 01:00:04.690
So if I look at the
point on which--
01:00:04.690 --> 01:00:07.000
so this is the point
on which I project.
01:00:07.000 --> 01:00:08.660
This is x.
01:00:08.660 --> 01:00:13.260
I project orthogonally
to get p times x.
01:00:13.260 --> 01:00:15.920
And so what it means
is that this operator i
01:00:15.920 --> 01:00:21.810
minus px is actually giving me
this guy, this vector here--
01:00:21.810 --> 01:00:23.360
x minus p times x.
01:00:30.790 --> 01:00:33.920
Let's say this is 0.
01:00:33.920 --> 01:00:36.460
This means that this
vector, I can put it here.
01:00:36.460 --> 01:00:38.370
It's this vector here.
01:00:38.370 --> 01:00:40.510
And that's actually the
orthogonal projection
01:00:40.510 --> 01:00:43.870
of x onto the orthogonal
complement of the span
01:00:43.870 --> 01:00:45.532
of the columns of x.
01:00:45.532 --> 01:00:51.000
So if I project x, or if I
look of x minus its projection,
01:00:51.000 --> 01:00:55.730
I'm basically projecting
onto two orthogonal spaces.
01:00:55.730 --> 01:00:59.520
What I'm trying to say
here is that this here
01:00:59.520 --> 01:01:01.301
is another projection
matrix p prime.
01:01:04.460 --> 01:01:10.310
That is just the projection
matrix onto the orthogonal--
01:01:10.310 --> 01:01:29.560
projection onto orthogonal
of column span of x.
01:01:29.560 --> 01:01:31.180
Orthogonal means
the set of vectors
01:01:31.180 --> 01:01:34.329
that's orthogonal to everyone
in this linear space.
01:01:37.050 --> 01:01:40.080
So now, when I'm doing
this, this is exactly what--
01:01:40.080 --> 01:01:42.600
I mean, in a way, this is
illustrating this Pythagoras
01:01:42.600 --> 01:01:43.610
theorem.
01:01:43.610 --> 01:01:47.190
And so when I want to compute
the norm of this guy, the norm
01:01:47.190 --> 01:01:49.560
squared of this guy,
I'm really computing--
01:01:49.560 --> 01:01:52.810
if this is my y now,
this is px of y,
01:01:52.810 --> 01:01:55.738
I'm really controlling the
norm squared of this thing.
01:02:06.720 --> 01:02:08.850
So if I want to compute
the norm squared--
01:02:42.540 --> 01:02:48.020
so I'm almost there.
01:02:48.020 --> 01:02:52.840
So what am I projecting here
onto the orthogonal projector?
01:02:52.840 --> 01:02:55.340
So here, y, now,
I know that y is
01:02:55.340 --> 01:03:00.480
equal to x beta plus epsilon.
01:03:00.480 --> 01:03:06.480
So when I look at this
matrix p prime times y,
01:03:06.480 --> 01:03:11.105
It's actually p prime times
x beta plus p prime times
01:03:11.105 --> 01:03:11.604
epsilon.
01:03:14.380 --> 01:03:18.400
What's happening to
p prime times x beta?
01:03:18.400 --> 01:03:19.525
Let's look at this picture.
01:03:23.400 --> 01:03:26.610
So we know that p prime takes
any point here and projects it
01:03:26.610 --> 01:03:29.350
orthogonally on this guy.
01:03:29.350 --> 01:03:33.960
But x beta is actually
a point that lives here.
01:03:33.960 --> 01:03:36.790
It's something that's
on the linear span.
01:03:36.790 --> 01:03:39.660
So where do all the points
that are on this line
01:03:39.660 --> 01:03:43.035
get projected to?
01:03:43.035 --> 01:03:43.970
AUDIENCE: The origin.
01:03:43.970 --> 01:03:45.920
PHILIPPE RIGOLLET:
The origin, to 0.
01:03:45.920 --> 01:03:47.750
They all get projected to 0.
01:03:47.750 --> 01:03:50.120
And that's because I'm
basically projecting
01:03:50.120 --> 01:03:54.872
something that's on the column
span of x onto its orthogonal.
01:03:54.872 --> 01:03:56.580
So that's always 0
that I'm getting here.
01:04:02.410 --> 01:04:04.410
So when I apply
p prime to y, I'm
01:04:04.410 --> 01:04:08.610
really just applying
p prime to epsilon.
01:04:08.610 --> 01:04:10.590
So I know that now,
this, actually,
01:04:10.590 --> 01:04:18.480
is equal to the norm of
some multivariate Gaussian.
01:04:18.480 --> 01:04:20.092
What is the size
of this Gaussian?
01:04:22.980 --> 01:04:24.570
What is the size of this matrix?
01:04:24.570 --> 01:04:25.820
Well, I actually had it there.
01:04:25.820 --> 01:04:28.440
It's i n, so it's n dimensional.
01:04:28.440 --> 01:04:31.236
So it's some n
dimensional with mean 0.
01:04:31.236 --> 01:04:32.610
And what is the
covariance matrix
01:04:32.610 --> 01:04:34.179
of p prime times epsilon?
01:04:39.109 --> 01:04:40.588
AUDIENCE: p p transpose.
01:04:40.588 --> 01:04:43.880
PHILIPPE RIGOLLET: Yeah,
p prime p prime transpose,
01:04:43.880 --> 01:04:48.500
which we just said p
prime transpose is p,
01:04:48.500 --> 01:04:49.610
so that's p squared.
01:04:49.610 --> 01:04:51.740
And we see that when
we project twice,
01:04:51.740 --> 01:04:54.540
it's as if we
projected only once.
01:04:54.540 --> 01:05:00.090
So here, this is n0 p
prime p prime transpose.
01:05:00.090 --> 01:05:05.150
That's the formula for
the covariance matrix.
01:05:05.150 --> 01:05:09.990
But this guy is actually equal
to p prime times p prime,
01:05:09.990 --> 01:05:13.580
which is equal to p prime.
01:05:13.580 --> 01:05:18.380
So now, what I'm looking for is
the norm squared of the trace.
01:05:18.380 --> 01:05:20.050
So that means that
this whole thing here
01:05:20.050 --> 01:05:22.270
is actually equal to the trace.
01:05:22.270 --> 01:05:24.730
Oh, did I forget
again a sigma squared?
01:05:24.730 --> 01:05:28.160
Yeah, I forgot it only
here, which is good news.
01:05:28.160 --> 01:05:32.665
So I should assume that
sigma squared is equal to 1.
01:05:32.665 --> 01:05:34.270
So sigma squared's here.
01:05:34.270 --> 01:05:36.430
And then what I'm left
with is sigma squared
01:05:36.430 --> 01:05:39.920
times the trace of p prime.
01:05:45.780 --> 01:05:51.240
At some point, I mentioned that
the eigenvalues of a projection
01:05:51.240 --> 01:05:54.210
matrix were actually 0 or 1.
01:05:54.210 --> 01:05:56.689
The trace is the sum
of the eigenvalues.
01:05:56.689 --> 01:05:58.230
So that means that
the trace is going
01:05:58.230 --> 01:06:03.720
to be an integer number as the
number of non-0 eigenvalues.
01:06:03.720 --> 01:06:05.170
And the non-0
eigenvalues are just
01:06:05.170 --> 01:06:07.776
the dimension of the space
onto which I'm projecting.
01:06:10.490 --> 01:06:15.200
Now, I'm projecting from
something of dimension n
01:06:15.200 --> 01:06:19.520
onto the orthogonal of
a space of dimension p.
01:06:19.520 --> 01:06:21.860
What is the dimension
of the orthogonal
01:06:21.860 --> 01:06:23.720
of a space of dimension
p when thought
01:06:23.720 --> 01:06:26.546
of space in dimension n?
01:06:26.546 --> 01:06:27.296
AUDIENCE: [? 1. ?]
01:06:27.296 --> 01:06:28.765
PHILIPPE RIGOLLET: N minus p--
01:06:28.765 --> 01:06:32.980
that's the so-called rank
theorem, I guess, as a name.
01:06:32.980 --> 01:06:35.710
And so that's how I get
this n minus p here.
01:06:35.710 --> 01:06:40.071
This is really just
equal to n minus p.
01:06:40.071 --> 01:06:40.570
Yeah?
01:06:40.570 --> 01:06:43.319
AUDIENCE: Here, we're taking the
expectation of the whole thing.
01:06:43.319 --> 01:06:44.860
PHILIPPE RIGOLLET:
Yes, you're right.
01:06:44.860 --> 01:06:48.780
So that's actually
the expectation
01:06:48.780 --> 01:06:50.410
of this thing that's
equal to that.
01:06:50.410 --> 01:06:53.020
Absolutely.
01:06:53.020 --> 01:06:55.150
But I actually have much better.
01:06:55.150 --> 01:06:57.412
I know, even, that the
norm that I'm looking at,
01:06:57.412 --> 01:06:58.870
I know it's going
to be this thing.
01:06:58.870 --> 01:07:00.911
What is going to be the
distribution of this guy?
01:07:03.860 --> 01:07:06.830
Norm squared of a
Gaussian, chi squared.
01:07:06.830 --> 01:07:09.150
So there's going to be some
chi squared that shows up.
01:07:09.150 --> 01:07:10.650
And the number of
degrees of freedom
01:07:10.650 --> 01:07:12.940
is actually going to
be also n minus p.
01:07:12.940 --> 01:07:16.510
And maybe it's
actually somewhere--
01:07:16.510 --> 01:07:20.560
yeah, right here-- n
minus p times sigma hat
01:07:20.560 --> 01:07:22.690
squared over sigma squared.
01:07:22.690 --> 01:07:24.675
This is my sigma hat squared.
01:07:24.675 --> 01:07:28.200
If I multiply n minus p, I'm
left only with this thing,
01:07:28.200 --> 01:07:31.136
and so that means that I
get sigma squared times--
01:07:31.136 --> 01:07:33.010
because they always
forget my sigma squared--
01:07:33.010 --> 01:07:34.870
I get sigma squared
times this thing.
01:07:34.870 --> 01:07:37.270
And it turns out that the
square norm of this guy
01:07:37.270 --> 01:07:39.412
is actually exactly chi
squared with n minus b
01:07:39.412 --> 01:07:40.226
degrees of freedom.
01:07:43.370 --> 01:07:47.900
So in particular, so we
know that the expectation
01:07:47.900 --> 01:07:50.556
of this thing is equal to
sigma squared times n minus p.
01:07:50.556 --> 01:07:53.342
So if I divide both
sides by n minus p,
01:07:53.342 --> 01:07:55.550
I'm going to have that
something whose expectation is
01:07:55.550 --> 01:07:57.140
sigma squared.
01:07:57.140 --> 01:07:59.140
And this something, I
can actually compute.
01:07:59.140 --> 01:08:02.090
It depends on y,
and x that I know,
01:08:02.090 --> 01:08:04.100
and beta hat that
I've just estimated.
01:08:04.100 --> 01:08:05.000
I know what n is.
01:08:05.000 --> 01:08:07.520
And pr's are the
dimensions of my matrix x.
01:08:07.520 --> 01:08:11.120
So I'm actually given an
estimator whose expectation
01:08:11.120 --> 01:08:13.330
is sigma squared.
01:08:13.330 --> 01:08:15.880
And so now, I actually
have an unbiased estimator
01:08:15.880 --> 01:08:17.430
of sigma squared.
01:08:17.430 --> 01:08:19.269
That's this guy right here.
01:08:19.269 --> 01:08:20.560
And it's actually super useful.
01:08:23.470 --> 01:08:25.270
So those are called the--
01:08:25.270 --> 01:08:27.950
this is the normalized
sum of square residuals.
01:08:27.950 --> 01:08:29.340
These are called the residuals.
01:08:29.340 --> 01:08:32.410
Those are whatever
is residual when
01:08:32.410 --> 01:08:36.580
I project my points onto the
line that I've estimated.
01:08:36.580 --> 01:08:40.870
And so in a way, those guys--
if you go back to this picture,
01:08:40.870 --> 01:08:47.109
this was yi and this was
xi transpose beta hat.
01:08:47.109 --> 01:08:49.540
So if beta hat is close
to beta, the difference
01:08:49.540 --> 01:08:52.810
between yi and xi
transpose beta should
01:08:52.810 --> 01:08:55.870
be close to my epsilon i.
01:08:55.870 --> 01:08:57.430
It's some sort of epsilon i hat.
01:09:00.319 --> 01:09:02.590
Agreed?
01:09:02.590 --> 01:09:04.960
And so that means
that if I think
01:09:04.960 --> 01:09:07.510
of those as being
epsilon i hat, they
01:09:07.510 --> 01:09:09.910
should be close to epsilon
i, and so their norm
01:09:09.910 --> 01:09:14.390
should be giving me something
that looks like sigma squared.
01:09:14.390 --> 01:09:16.359
And so that's why it
actually makes sense.
01:09:16.359 --> 01:09:18.790
It's just magical that
everything works out together,
01:09:18.790 --> 01:09:21.130
because I'm not projecting
on the right line,
01:09:21.130 --> 01:09:23.229
I'm actually projecting
on the wrong line.
01:09:23.229 --> 01:09:27.310
But in the end, things
actually work out pretty well.
01:09:27.310 --> 01:09:28.990
There's one thing--
so here, the theorem
01:09:28.990 --> 01:09:31.779
is that this thing not only
has the right expectation,
01:09:31.779 --> 01:09:33.450
but also has a chi
squared distribution.
01:09:33.450 --> 01:09:34.700
That's what we just discussed.
01:09:34.700 --> 01:09:36.250
So here, I'm just
telling you this.
01:09:36.250 --> 01:09:37.899
But it's not too
hard to believe,
01:09:37.899 --> 01:09:40.300
because it's actually
the norm of some vector.
01:09:40.300 --> 01:09:42.279
You could make this
obvious, but again, I
01:09:42.279 --> 01:09:44.800
didn't want to bring in
too much linear algebra.
01:09:44.800 --> 01:09:46.359
So to prove this,
you actually have
01:09:46.359 --> 01:09:48.899
to diagonalize the matrix p.
01:09:48.899 --> 01:09:53.890
So you have to invoke the
eigenvalue decomposition
01:09:53.890 --> 01:09:56.600
and the fact that the norm
is invariant by rotation.
01:09:56.600 --> 01:09:59.440
So for those who are
familiar with, what I can do
01:09:59.440 --> 01:10:01.780
is just look at the
decomposition of p
01:10:01.780 --> 01:10:08.200
prime into ud u transpose where
this is an orthogonal matrix,
01:10:08.200 --> 01:10:10.630
and this is a diagonal
matrix of eigenvalues.
01:10:10.630 --> 01:10:13.312
And when I look at the
norm squared of this thing,
01:10:13.312 --> 01:10:14.770
I mean, I have,
basically, the norm
01:10:14.770 --> 01:10:20.200
squared of p prime
times some epsilon.
01:10:20.200 --> 01:10:26.300
It's the norm of ud u
transpose epsilon squared.
01:10:26.300 --> 01:10:28.550
The norm of a
rotation of a vector
01:10:28.550 --> 01:10:32.280
is the same as the norm of the
vector, so this guy goes away.
01:10:32.280 --> 01:10:34.140
This is not actually--
01:10:34.140 --> 01:10:36.140
I mean, you don't have
to care about this if you
01:10:36.140 --> 01:10:37.880
don't understand what I'm
saying, so don't freak out.
01:10:37.880 --> 01:10:39.810
This is really for
those who follow.
01:10:39.810 --> 01:10:42.211
What is the distribution
of u transpose epsilon?
01:10:45.899 --> 01:10:50.310
I take a Gaussian vector that
has covariance matrix sigma
01:10:50.310 --> 01:10:52.560
squared times the
[? identity, ?] and I basically
01:10:52.560 --> 01:10:54.100
rotate it.
01:10:54.100 --> 01:10:57.965
What is its distribution?
01:10:57.965 --> 01:10:58.465
Yeah?
01:10:58.465 --> 01:10:59.440
AUDIENCE: The same.
01:10:59.440 --> 01:11:00.830
PHILIPPE RIGOLLET:
It's the same.
01:11:00.830 --> 01:11:02.950
It's completely invariant,
because the Gaussian
01:11:02.950 --> 01:11:04.700
think of all directions
as being the same.
01:11:04.700 --> 01:11:07.550
So it doesn't really matter if
I take a Gaussian or a rotated
01:11:07.550 --> 01:11:08.600
Gaussian.
01:11:08.600 --> 01:11:10.190
So this is also a
Gaussian, so I'm
01:11:10.190 --> 01:11:11.800
going to call it epsilon prime.
01:11:11.800 --> 01:11:15.110
And I am left with just
the norm of epsilon primes.
01:11:15.110 --> 01:11:23.730
So this is the sum of the
dj's squared times epsilon
01:11:23.730 --> 01:11:24.250
j squared.
01:11:27.030 --> 01:11:30.060
And we just said that
the eigenvalues of p
01:11:30.060 --> 01:11:33.780
are either 0 or 1,
because it's a projector.
01:11:33.780 --> 01:11:36.090
And so here, I'm going
to get only 0's and 1's.
01:11:36.090 --> 01:11:39.300
So I'm really just
summing a certain number
01:11:39.300 --> 01:11:42.050
of epsilon i squared.
01:11:42.050 --> 01:11:45.110
So square root of
standard Gaussians--
01:11:45.110 --> 01:11:48.210
sorry, with a sigma
squared somewhere.
01:11:48.210 --> 01:11:50.850
And basically, how
many am I summing?
01:11:50.850 --> 01:11:55.530
Well, the n minus p, the
number of non-0 eigenvalues
01:11:55.530 --> 01:11:57.190
of p prime.
01:11:57.190 --> 01:12:00.490
So that's how it shows up.
01:12:00.490 --> 01:12:05.820
When you see this, what
theorem am I using here?
01:12:05.820 --> 01:12:06.650
Cochran's theorem.
01:12:06.650 --> 01:12:07.650
This is this magic book.
01:12:07.650 --> 01:12:09.420
I'm actually going to dump
everything that I'm not going
01:12:09.420 --> 01:12:11.160
to prove to you and say, oh,
this is actually Cochran's.
01:12:11.160 --> 01:12:12.870
No, Cochran's theorem
is really just
01:12:12.870 --> 01:12:15.870
telling me something about
orthogonality of things,
01:12:15.870 --> 01:12:17.712
and therefore,
independence of things.
01:12:17.712 --> 01:12:19.170
And Cochran's
theorem was something
01:12:19.170 --> 01:12:23.271
that I used when I
wanted to use what?
01:12:23.271 --> 01:12:27.280
That's something I used
just one slide before.
01:12:27.280 --> 01:12:28.887
Student t-test, right?
01:12:28.887 --> 01:12:30.970
I used Cochran's theorem
to see that the numerator
01:12:30.970 --> 01:12:33.610
and the denominator of
the student statistic
01:12:33.610 --> 01:12:35.414
were independent of each other.
01:12:35.414 --> 01:12:37.330
And this is exactly what
I'm going to do here.
01:12:40.170 --> 01:12:42.430
I'm going to actually write
a test to test, maybe,
01:12:42.430 --> 01:12:44.430
if the beta j's are equal to 0.
01:12:44.430 --> 01:12:49.110
I'm going to form a numerator,
which is beta hat minus beta.
01:12:49.110 --> 01:12:50.310
This is normal.
01:12:50.310 --> 01:12:53.287
And we know that beta hat
has a Gaussian distribution.
01:12:53.287 --> 01:12:54.870
I'm going to
standardized by something
01:12:54.870 --> 01:12:55.720
that makes sense to me.
01:12:55.720 --> 01:12:56.940
And I'm not going
to go into details,
01:12:56.940 --> 01:12:58.200
because we're out of time.
01:12:58.200 --> 01:12:59.866
But there's the sigma
hat that shows up.
01:12:59.866 --> 01:13:03.240
And then there's a gamma
j, which takes into account
01:13:03.240 --> 01:13:06.450
the fact that my x's--
01:13:06.450 --> 01:13:12.465
if I look at the distribution of
beta, which is gone, I think--
01:13:12.465 --> 01:13:14.220
yeah, beta is gone.
01:13:14.220 --> 01:13:16.020
Oh, yeah, that's where it is.
01:13:16.020 --> 01:13:20.040
The covariance matrix depends
on this matrix x transpose x.
01:13:20.040 --> 01:13:22.110
So this will show
up in the variance.
01:13:22.110 --> 01:13:25.000
In particular, diagonal elements
are going to play a role here.
01:13:25.000 --> 01:13:26.850
And so that's what
my gammas are.
01:13:26.850 --> 01:13:30.880
The gammas is the j's diagonal
element of this matrix.
01:13:30.880 --> 01:13:35.010
So we'll resume
that on Tuesday, so
01:13:35.010 --> 01:13:38.476
don't worry too much if
this is going too fast.
01:13:38.476 --> 01:13:40.350
I'm not supposed to
cover it, but just so you
01:13:40.350 --> 01:13:45.300
get a hint of why Cochran's
theorem actually was useful.
01:13:45.300 --> 01:13:51.690
So I don't know if we
actually ended up recording.
01:13:51.690 --> 01:13:53.410
I have your homework.
01:13:53.410 --> 01:13:56.500
And as usual, I will
give it to you outside.