WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high-quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.650
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.650 --> 00:00:17.880
at ocw.mit.edu.
00:00:20.760 --> 00:00:25.220
PHILIPPE RIGOLLET: So
yes, before we start,
00:00:25.220 --> 00:00:27.710
this chapter will not
be part of the midterm.
00:00:27.710 --> 00:00:31.970
Everything else will be, so all
the way up to goodness of fit
00:00:31.970 --> 00:00:33.080
tests.
00:00:33.080 --> 00:00:36.920
And there will be
some practice exams
00:00:36.920 --> 00:00:38.690
that will be posted
in the recitation
00:00:38.690 --> 00:00:39.780
section of the course.
00:00:39.780 --> 00:00:40.790
And that will be--
00:00:40.790 --> 00:00:41.760
you will be working on.
00:00:41.760 --> 00:00:44.480
So the recitation tomorrow
will be a review session
00:00:44.480 --> 00:00:46.430
for the midterm.
00:00:46.430 --> 00:00:49.850
I'll send an
announcement by email.
00:00:49.850 --> 00:00:55.370
So going back to
our estimator, we
00:00:55.370 --> 00:00:58.460
showed that the least
squares estimator in the case
00:00:58.460 --> 00:01:01.860
where we had some
Gaussian observations.
00:01:01.860 --> 00:01:04.459
So we had something that
looked like this-- y was
00:01:04.459 --> 00:01:07.850
equal to some matrix x times
beta plus some epsilon.
00:01:07.850 --> 00:01:10.640
This was an equation
that was happening in r
00:01:10.640 --> 00:01:13.010
to the n for n observations.
00:01:13.010 --> 00:01:15.530
And then we wrote the least
squares estimator beta hat.
00:01:21.300 --> 00:01:23.580
And for the purpose
from here on,
00:01:23.580 --> 00:01:26.200
you see that you have
this normal distribution,
00:01:26.200 --> 00:01:28.189
this Gaussian p
variant distribution.
00:01:28.189 --> 00:01:29.730
That means that, at
some point, we've
00:01:29.730 --> 00:01:31.830
made the assumption
that epsilons
00:01:31.830 --> 00:01:38.060
were n and dimensional
0 identity of rn
00:01:38.060 --> 00:01:41.000
times sigma squared,
which I kept
00:01:41.000 --> 00:01:43.130
on forgetting about last time.
00:01:43.130 --> 00:01:45.530
I will try not to
do that this time.
00:01:45.530 --> 00:01:48.800
And so from this,
we derived a bunch
00:01:48.800 --> 00:01:54.080
of properties of this least
squares estimator, beta hat.
00:01:54.080 --> 00:01:58.090
And in particular, the key thing
that everything was built on
00:01:58.090 --> 00:02:02.090
was that we could write beta
hat as the true unknown beta
00:02:02.090 --> 00:02:05.020
plus some multivariate
Gaussian that was centered,
00:02:05.020 --> 00:02:07.010
but had a weird
covariant structure.
00:02:07.010 --> 00:02:08.979
So that was definitely
p dimensional.
00:02:08.979 --> 00:02:11.240
And it was sigma
squared times x--
00:02:13.970 --> 00:02:16.359
so that's x transpose x.
00:02:16.359 --> 00:02:17.150
And that's inverse.
00:02:19.720 --> 00:02:22.840
And the way we derived that
was by having a lot of--
00:02:22.840 --> 00:02:26.260
at least one cancellation
between x transpose x and x
00:02:26.260 --> 00:02:28.150
transpose x inverse.
00:02:28.150 --> 00:02:47.582
So this is the basis for
inference in linear regression.
00:02:51.650 --> 00:02:54.980
So in a way, that's
correct, because what
00:02:54.980 --> 00:02:58.370
happened is that we used
the fact that x beta hat--
00:02:58.370 --> 00:02:59.980
once we have this
beta, x beta hat
00:02:59.980 --> 00:03:04.310
is really just a projection
of y onto the linear span
00:03:04.310 --> 00:03:08.480
of the columns of x, or
the column span of x.
00:03:08.480 --> 00:03:10.040
And so in particular,
those things--
00:03:10.040 --> 00:03:11.990
y minus x beta hats--
00:03:11.990 --> 00:03:13.230
are called residuals.
00:03:22.060 --> 00:03:25.180
So that's the
vector of residuals.
00:03:28.070 --> 00:03:32.390
What's the dimension
of this vector?
00:03:36.214 --> 00:03:37.180
AUDIENCE: n by 1.
00:03:37.180 --> 00:03:38.350
PHILIPPE RIGOLLET: n by 1.
00:03:38.350 --> 00:03:42.890
So those things, we can
write as epsilon hat.
00:03:42.890 --> 00:03:44.720
There's an estimate
for this epsilon
00:03:44.720 --> 00:03:47.870
because we just
put a hat on beta.
00:03:47.870 --> 00:03:49.610
And from this one,
we could actually
00:03:49.610 --> 00:03:54.560
build an unbiased estimator
of sigma hat squared,
00:03:54.560 --> 00:03:55.940
and that was this guy.
00:03:55.940 --> 00:03:59.330
And we showed that, indeed, the
right normalization for this
00:03:59.330 --> 00:04:04.730
was n minus p, because y
minus x beta hat to norm
00:04:04.730 --> 00:04:07.830
is actually a chi squared with
n minus p degrees of freedom.
00:04:07.830 --> 00:04:11.120
And so that's up to this
scaling by sigma squared.
00:04:11.120 --> 00:04:12.766
So that's what we came up with.
00:04:12.766 --> 00:04:14.390
And something I told
you, which follows
00:04:14.390 --> 00:04:15.389
from Cochran's theorem--
00:04:15.389 --> 00:04:17.420
we did not go into
details about this.
00:04:17.420 --> 00:04:18.950
But essentially,
since one of them
00:04:18.950 --> 00:04:22.640
corresponds to projection onto
the linear span of the columns
00:04:22.640 --> 00:04:25.400
of x, and the other one
corresponds to projection
00:04:25.400 --> 00:04:28.575
onto the orthogonal of this guy,
and we're in a Gaussian case,
00:04:28.575 --> 00:04:30.200
things that are
orthogonal are actually
00:04:30.200 --> 00:04:31.874
independent in a Gaussian case.
00:04:31.874 --> 00:04:33.290
So from a geometric
point of view,
00:04:33.290 --> 00:04:34.873
you can sort of
understand everything.
00:04:34.873 --> 00:04:37.580
You think of your subspace of
the linear span of the x's,
00:04:37.580 --> 00:04:39.080
sometimes you project
onto this guy,
00:04:39.080 --> 00:04:41.420
sometimes you project
onto its orthogonal.
00:04:41.420 --> 00:04:43.240
Beta hat corresponds
to projection
00:04:43.240 --> 00:04:44.471
onto the linear span.
00:04:44.471 --> 00:04:46.970
Epsilon hats correspond to a
projection onto the orthogonal.
00:04:46.970 --> 00:04:48.777
And those things tend
to be independent,
00:04:48.777 --> 00:04:50.360
and that's what you
have that beta hat
00:04:50.360 --> 00:04:53.560
is independent of
sigma hat squared.
00:04:53.560 --> 00:04:56.930
So it's really just a statement
about two linear spaces being
00:04:56.930 --> 00:05:00.510
orthogonal with
respect to each other.
00:05:00.510 --> 00:05:07.820
So we left on this
slide last time.
00:05:07.820 --> 00:05:10.610
And what I claim is that
this thing here is actually--
00:05:10.610 --> 00:05:12.510
oh, yeah-- the other
thing we want to use.
00:05:12.510 --> 00:05:14.002
So that's good for beta hat.
00:05:14.002 --> 00:05:15.960
But since we don't know
what sigma squared is--
00:05:15.960 --> 00:05:17.335
if we knew what
sigma squared is,
00:05:17.335 --> 00:05:19.160
that would totally
be enough for us.
00:05:19.160 --> 00:05:21.290
But we also need
this extra thing--
00:05:21.290 --> 00:05:27.250
that sigma squared hat squared
over sigma squared follows--
00:05:27.250 --> 00:05:29.450
and there's an n minus p.
00:05:29.450 --> 00:05:33.250
This follows a chi squared with
n minus p degrees of freedom.
00:05:33.250 --> 00:05:36.820
And sigma hat squared is
independent of beta hat.
00:05:36.820 --> 00:05:41.780
So that's going to
be something we need.
00:05:41.780 --> 00:05:47.870
So that's useful if
sigma squared if unknown.
00:05:51.510 --> 00:05:53.490
And again, sometimes
it might be known
00:05:53.490 --> 00:05:56.164
if you're using some sort
of measurement device
00:05:56.164 --> 00:05:58.080
for which it's written
on the side of the box.
00:06:01.000 --> 00:06:02.860
So from these two
things, we're going
00:06:02.860 --> 00:06:05.830
to be able to do inference
And inference, we
00:06:05.830 --> 00:06:09.370
said there's three
pillars to inference.
00:06:09.370 --> 00:06:12.340
The first one is estimation, and
we've been doing that so far.
00:06:12.340 --> 00:06:14.530
We've constructed this
least squares estimator,
00:06:14.530 --> 00:06:16.280
which happens to be
the maximum likelihood
00:06:16.280 --> 00:06:18.520
estimator in the Gaussian case.
00:06:18.520 --> 00:06:20.710
The two other things
we do in inference
00:06:20.710 --> 00:06:22.780
are confidence intervals.
00:06:22.780 --> 00:06:24.294
And we can do
confidence intervals.
00:06:24.294 --> 00:06:25.960
We're not going to
do much because we're
00:06:25.960 --> 00:06:29.260
going to talk about their sort
of cousin, which are tests.
00:06:29.260 --> 00:06:31.600
And that's really where
the statistical inference
00:06:31.600 --> 00:06:32.180
comes into.
00:06:32.180 --> 00:06:34.180
And here, we're going to
be interested in a very
00:06:34.180 --> 00:06:36.820
specific kind of test
for linear regression.
00:06:36.820 --> 00:06:42.650
And those are tests
of the form beta j--
00:06:42.650 --> 00:06:46.190
so the j-th coefficient
of beta is equal to 0,
00:06:46.190 --> 00:06:52.310
and that's going to be our null
hypothesis, versus h1 where
00:06:52.310 --> 00:06:55.190
beta j is, say, not equal to 0.
00:06:55.190 --> 00:06:57.560
And for the purpose
of regression,
00:06:57.560 --> 00:07:00.080
unless you have lots of
domain-specific knowledge,
00:07:00.080 --> 00:07:03.020
it won't be beta j positive
or beta j negative.
00:07:03.020 --> 00:07:06.800
It's really non-0 that's
interesting to you.
00:07:06.800 --> 00:07:09.830
So why would I want
to do this test?
00:07:09.830 --> 00:07:14.540
Well, if I expand this
thing where I have y
00:07:14.540 --> 00:07:19.740
is equal to x beta
plus epsilon--
00:07:19.740 --> 00:07:21.850
so what happens if
I look, for example,
00:07:21.850 --> 00:07:24.630
at the first coordinates?
00:07:24.630 --> 00:07:32.420
So I have that y is actually--
so say, y1 is equal to beta 1
00:07:32.420 --> 00:07:37.050
plus beta 2 x 1.
00:07:37.050 --> 00:07:38.866
Well, that's
actually complicated.
00:07:38.866 --> 00:07:39.990
Let me write it like this--
00:07:42.600 --> 00:07:56.500
beta 0 plus beta 1 x1 plus
beta p minus 1 xp minus 1
00:07:56.500 --> 00:07:58.860
plus epsilon.
00:07:58.860 --> 00:08:00.130
And that's true for all i's.
00:08:04.510 --> 00:08:05.960
So this is beta 1 times 1.
00:08:05.960 --> 00:08:07.880
That was our first coordinate.
00:08:07.880 --> 00:08:09.920
So that's just expanding this--
00:08:09.920 --> 00:08:12.980
going back to the
scalar form rather than
00:08:12.980 --> 00:08:15.140
going to the matrix vector form.
00:08:15.140 --> 00:08:16.250
That's what we're doing.
00:08:16.250 --> 00:08:19.450
When I write y is equal
to x beta plus epsilon,
00:08:19.450 --> 00:08:22.400
I assume that each of my
y's can be represented
00:08:22.400 --> 00:08:25.520
as a linear combination
of the x's, the first one
00:08:25.520 --> 00:08:26.990
being 1 plus some epsilon i.
00:08:26.990 --> 00:08:29.630
Everybody agrees with this?
00:08:29.630 --> 00:08:32.539
What does it mean for
beta j to be equal to 0?
00:08:40.661 --> 00:08:41.161
Yeah?
00:08:41.161 --> 00:08:43.315
AUDIENCE: That
xj's not important.
00:08:43.315 --> 00:08:45.190
PHILIPPE RIGOLLET: Yeah,
that xj doesn't even
00:08:45.190 --> 00:08:46.750
show up in this thing.
00:08:46.750 --> 00:08:51.940
So if beta j is equal to 0,
that means that, essentially, we
00:08:51.940 --> 00:09:05.946
can remove the j's coordinate,
xj, from all observations.
00:09:12.710 --> 00:09:15.080
So for example, I'm
a banker, and I'm
00:09:15.080 --> 00:09:19.280
trying to predict some score--
00:09:19.280 --> 00:09:21.260
let's call it y--
00:09:21.260 --> 00:09:22.460
without the noise.
00:09:22.460 --> 00:09:26.400
So I'm trying to predict what
is going to be your score.
00:09:26.400 --> 00:09:29.090
And that's something
that should be telling me
00:09:29.090 --> 00:09:33.080
how likely you are to
reimburse your loan on time
00:09:33.080 --> 00:09:34.490
or do you have late payments.
00:09:34.490 --> 00:09:36.530
Or actually, maybe
these days bankers
00:09:36.530 --> 00:09:40.550
are actually looking at
how much late fees will I
00:09:40.550 --> 00:09:41.509
be collecting from you.
00:09:41.509 --> 00:09:44.049
Maybe that's what they are more
after rather than making sure
00:09:44.049 --> 00:09:45.490
that you reimburse everything.
00:09:45.490 --> 00:09:47.810
So they're trying to maximize
this number of late fees.
00:09:47.810 --> 00:09:49.970
And they collect a bunch
of things about you--
00:09:49.970 --> 00:09:52.130
definitely your credit
score, but maybe your
00:09:52.130 --> 00:09:57.110
zip code, profession, years
of education, family status,
00:09:57.110 --> 00:09:59.150
a bunch of things.
00:09:59.150 --> 00:10:01.560
One might be your shoe size.
00:10:01.560 --> 00:10:03.750
And they want to know--
maybe shoe is actually
00:10:03.750 --> 00:10:07.050
a good explanation
for how much fees
00:10:07.050 --> 00:10:08.770
they're going to be
collecting from you.
00:10:08.770 --> 00:10:10.950
But as you can imagine, this
would be a controversial thing
00:10:10.950 --> 00:10:12.720
to bring, and people might
want to test for their shoe
00:10:12.720 --> 00:10:14.010
size is a good idea.
00:10:14.010 --> 00:10:17.130
And so they would just
look at the j corresponding
00:10:17.130 --> 00:10:21.120
to shoe size and test whether
shoe size should appear or not
00:10:21.120 --> 00:10:22.484
in this formula.
00:10:22.484 --> 00:10:24.150
And that's essentially
the kind of thing
00:10:24.150 --> 00:10:25.410
that people are going to do.
00:10:25.410 --> 00:10:27.840
Now, if I do genomics
and I'm trying
00:10:27.840 --> 00:10:32.760
to predict the size, the girth,
of a pumpkin for a competition
00:10:32.760 --> 00:10:37.530
based on some
available genomic data,
00:10:37.530 --> 00:10:40.710
then I can test whether
gene j, which is called--
00:10:40.710 --> 00:10:44.010
I don't know-- pea snap 24--
they always have these crazy
00:10:44.010 --> 00:10:44.820
names--
00:10:44.820 --> 00:10:46.730
appears or not in this formula.
00:10:46.730 --> 00:10:49.350
Is the gene pea snap 24
going to be important or not
00:10:49.350 --> 00:10:52.080
for the size of
the final pumpkin?
00:10:52.080 --> 00:10:54.420
So those are definitely
the important things.
00:10:54.420 --> 00:10:57.660
And definitely, we
want to put beta j not
00:10:57.660 --> 00:11:00.120
equal to 0 as the alternative
because that's where
00:11:00.120 --> 00:11:02.880
scientific discovery shows up.
00:11:02.880 --> 00:11:06.450
And so to do that, well,
we're in a Gaussian set-up,
00:11:06.450 --> 00:11:10.470
so we know that even if we
don't know what sigma hat is,
00:11:10.470 --> 00:11:14.250
we can actually
call for a t-test.
00:11:14.250 --> 00:11:16.740
So how did we build
the t-test in general?
00:11:16.740 --> 00:11:23.630
Well, we had something that
looked like-- so before, what
00:11:23.630 --> 00:11:28.490
we had was something that
looked like theta hat was
00:11:28.490 --> 00:11:35.030
equal to theta plus some
n0 and something that
00:11:35.030 --> 00:11:38.540
depended on n, maybe, something
like this-- sigma squared
00:11:38.540 --> 00:11:39.350
over n.
00:11:39.350 --> 00:11:41.470
So that's what it looked like.
00:11:41.470 --> 00:11:46.120
Now what we have
is that beta hat
00:11:46.120 --> 00:11:50.470
is equal to beta plus some n,
but this time, it's p variant,
00:11:50.470 --> 00:11:56.130
and then x transpose x
inverse sigma squared.
00:11:56.130 --> 00:12:00.700
So it's actually very similar,
except that the matrix
00:12:00.700 --> 00:12:03.110
x transpose x inverse
is now replacing
00:12:03.110 --> 00:12:06.830
just this number, 1/n, but
it's playing the same role.
00:12:06.830 --> 00:12:12.750
So in particular, this implies
that for every j from 1
00:12:12.750 --> 00:12:16.300
to p, what is the
distribution of beta hat j?
00:12:19.010 --> 00:12:22.550
Well, beta hat j is
actually equal to--
00:12:22.550 --> 00:12:26.350
so all I have to do-- so this
is a system of p equations,
00:12:26.350 --> 00:12:29.540
and all I have to do is
to read the j through.
00:12:29.540 --> 00:12:32.090
So it's telling me here, I'm
going to read beta hat j.
00:12:32.090 --> 00:12:34.300
Here, I'm going to read beta j.
00:12:34.300 --> 00:12:36.470
And here, I need
to read, what is
00:12:36.470 --> 00:12:40.980
the distribution of the j-th
coordinates of this guy?
00:12:40.980 --> 00:12:43.570
So this is a Gaussian
vector, so we
00:12:43.570 --> 00:12:45.910
need to understand
what its definition is.
00:12:49.470 --> 00:12:52.350
So how do I do this?
00:12:52.350 --> 00:12:56.360
Well, the observation that's
actually useful for this--
00:12:56.360 --> 00:12:59.830
maybe I shouldn't use the word
observation in a stats class,
00:12:59.830 --> 00:13:00.900
so let's call it claim.
00:13:03.648 --> 00:13:09.610
The interesting claim is
that if I have a vector--
00:13:09.610 --> 00:13:13.916
let's call it v--
00:13:13.916 --> 00:13:20.500
then vj is equal to
v transpose ej where
00:13:20.500 --> 00:13:28.890
ej is the vector with 0, 0,
0, and then the 1 on the j-th
00:13:28.890 --> 00:13:30.990
coordinate, and
then 0 elsewhere.
00:13:30.990 --> 00:13:32.687
That's the j-th coordinate.
00:13:35.620 --> 00:13:38.530
So that's the j-th vector of
the canonical basis of rp.
00:13:41.640 --> 00:13:43.860
So now that I have
this form, I can
00:13:43.860 --> 00:13:45.730
see that, essentially,
beta hat j
00:13:45.730 --> 00:13:51.270
is just ej transpose
this np0 sigma squared
00:13:51.270 --> 00:13:53.790
x transpose x inverse.
00:13:59.550 --> 00:14:02.040
And now, I know what
the distribution
00:14:02.040 --> 00:14:05.550
of the inner product
between a Gaussian
00:14:05.550 --> 00:14:08.660
and a deterministic vector is.
00:14:08.660 --> 00:14:09.160
What is it?
00:14:13.810 --> 00:14:15.640
It's a Gaussian.
00:14:15.640 --> 00:14:23.950
So all I have to check is that
ej transpose np0 sigma squared
00:14:23.950 --> 00:14:27.200
x transpose x inverse--
00:14:27.200 --> 00:14:31.610
well, this is equal in
distribution to what?
00:14:31.610 --> 00:14:34.900
Well, this is going to be
a one-dimensional thing.
00:14:34.900 --> 00:14:38.230
A then your product
is just a real number.
00:14:38.230 --> 00:14:42.090
So it's going to
be some Gaussian.
00:14:42.090 --> 00:14:49.480
The mean is going to be 0 in
a product with ej, which is 0.
00:14:49.480 --> 00:14:52.502
What is the variance
of this guy?
00:14:52.502 --> 00:14:55.320
We actually used this, except
that ej was not a vector,
00:14:55.320 --> 00:14:57.460
but it was a matrix.
00:14:57.460 --> 00:15:04.990
So what we do is we, to see--
so the rule is that v transpose,
00:15:04.990 --> 00:15:16.610
say, n mu sigma is
some n v transpose mu,
00:15:16.610 --> 00:15:21.140
and then v transpose
sigma v. That's
00:15:21.140 --> 00:15:23.234
the rule for Gaussian vectors.
00:15:23.234 --> 00:15:25.150
There's just the property
of Gaussian vectors.
00:15:27.760 --> 00:15:29.300
So what do we have here?
00:15:29.300 --> 00:15:33.350
Well, ej plays the
role of v. And sigma
00:15:33.350 --> 00:15:36.990
squared x transpose x
inverse is the role of sigma.
00:15:36.990 --> 00:15:40.765
So here, I'm left
with ej transpose--
00:15:40.765 --> 00:15:42.390
let me pull out the
sigma squared here.
00:15:54.590 --> 00:15:57.350
But this thing is, what
happens if I take a matrix,
00:15:57.350 --> 00:16:00.110
I premultiply it
by this vector ej,
00:16:00.110 --> 00:16:02.298
and I postmultiply
it by this vector ej?
00:16:05.110 --> 00:16:07.890
I'm claiming that this
corresponds to only one
00:16:07.890 --> 00:16:09.510
single element of this matrix.
00:16:09.510 --> 00:16:11.371
Which one is it?
00:16:11.371 --> 00:16:11.870
AUDIENCE: j.
00:16:11.870 --> 00:16:14.570
PHILIPPE RIGOLLET:
j's diagonal element.
00:16:14.570 --> 00:16:23.210
So this thing here is nothing
but x transpose x inverse,
00:16:23.210 --> 00:16:27.840
and then the j-th
diagonal element is jj.
00:16:27.840 --> 00:16:30.840
Now, I cannot go any further.
00:16:30.840 --> 00:16:34.730
x transpose x inverse can
be a complicated matrix,
00:16:34.730 --> 00:16:40.350
and I do not know how to express
jj's diagonal element much
00:16:40.350 --> 00:16:41.190
better than this.
00:16:43.990 --> 00:16:46.740
Well, no, actually, I don't.
00:16:46.740 --> 00:16:48.681
It involves basically
all the coefficients.
00:16:48.681 --> 00:16:49.181
Yeah?
00:16:49.181 --> 00:16:52.127
AUDIENCE: [INAUDIBLE]
second j come from,
00:16:52.127 --> 00:16:55.073
so I get why ej
transpose [INAUDIBLE]..
00:16:55.073 --> 00:16:56.560
Where did the--
00:16:56.560 --> 00:16:58.194
PHILIPPE RIGOLLET:
From this rule?
00:16:58.194 --> 00:16:59.070
AUDIENCE: [INAUDIBLE]
00:16:59.070 --> 00:16:59.790
PHILIPPE RIGOLLET:
So you always pre-
00:16:59.790 --> 00:17:01.956
and postmultiply when you
talk about the covariance,
00:17:01.956 --> 00:17:04.869
because if you did not, it would
be a vector and not a scalar,
00:17:04.869 --> 00:17:06.480
for one.
00:17:06.480 --> 00:17:08.550
But in general, think
of v as a matrix.
00:17:08.550 --> 00:17:11.310
It's still true even
in v is a matrix that's
00:17:11.310 --> 00:17:12.911
compatible with
the premultiplying
00:17:12.911 --> 00:17:13.619
by some Gaussian.
00:17:19.079 --> 00:17:20.805
Any other question?
00:17:20.805 --> 00:17:21.305
Yeah?
00:17:21.305 --> 00:17:25.241
AUDIENCE: When you say claim
a vector v, what is vector v?
00:17:29.180 --> 00:17:31.759
PHILIPPE RIGOLLET:
So for any vector v--
00:17:31.759 --> 00:17:32.300
AUDIENCE: OK.
00:17:37.700 --> 00:17:40.690
PHILIPPE RIGOLLET:
Any other question?
00:17:40.690 --> 00:17:44.890
So now we've identified
that the j-th coefficient
00:17:44.890 --> 00:17:47.440
of this Gaussian, which I
can represent from the claim
00:17:47.440 --> 00:17:49.540
as ej transpose
this guy, is also
00:17:49.540 --> 00:17:51.220
a Gaussian that's centered.
00:17:51.220 --> 00:17:54.310
And its variance,
now, is sigma squared
00:17:54.310 --> 00:17:58.700
times the j-th diagonal element
of x transpose x inverse.
00:17:58.700 --> 00:18:05.830
So the conclusion
is that beta hat j
00:18:05.830 --> 00:18:10.749
is equal to beta j plus some n.
00:18:10.749 --> 00:18:12.790
And I'm going to emphasize
the fact that now it's
00:18:12.790 --> 00:18:19.180
one-dimensional with mean 0
and covariance sigma squared x
00:18:19.180 --> 00:18:25.100
transpose x inverse inverse jj.
00:18:25.100 --> 00:18:28.430
Now, if you look at the last
line of the second board
00:18:28.430 --> 00:18:31.912
and the first line
on the first board,
00:18:31.912 --> 00:18:33.370
those are basically
the same thing.
00:18:36.630 --> 00:18:39.410
Beta hat j is my theta hat.
00:18:39.410 --> 00:18:41.630
Beta j is my theta.
00:18:41.630 --> 00:18:44.080
And the variance
sigma squared over n
00:18:44.080 --> 00:18:47.840
is now sigma squared times
this [? jj's ?] element.
00:18:47.840 --> 00:18:52.710
Now, the inverse suggests that
it looks like the inverse of n.
00:18:52.710 --> 00:18:53.960
So those things are going to--
00:18:53.960 --> 00:18:55.710
we're going to want
to think of those guys
00:18:55.710 --> 00:18:59.917
as being some sort of
1/n kind of statement.
00:19:04.120 --> 00:19:09.420
So from this, the fact that
those two things are the same
00:19:09.420 --> 00:19:11.660
leads us to believe
that we are now
00:19:11.660 --> 00:19:14.010
equipped to perform the task
that we're trying to do,
00:19:14.010 --> 00:19:16.410
because under the
null hypothesis,
00:19:16.410 --> 00:19:22.790
beta j is known it's equal
to 0, so I can remove it.
00:19:22.790 --> 00:19:24.540
And I have to deal
with the sigma squared.
00:19:24.540 --> 00:19:26.130
If sigma squared
is known, then I
00:19:26.130 --> 00:19:29.100
can just perform
a regular Gaussian
00:19:29.100 --> 00:19:31.070
test using Gaussian quintiles.
00:19:31.070 --> 00:19:33.120
And if sigma squared
is unknown, I'm
00:19:33.120 --> 00:19:35.730
going to just divide
by sigma squared
00:19:35.730 --> 00:19:38.790
and multiply by sigma
hat, and then I'm
00:19:38.790 --> 00:19:40.260
going to basically
get my t-test.
00:20:00.630 --> 00:20:03.600
Actually, for the
purpose of your exam,
00:20:03.600 --> 00:20:06.060
I really suggest that you
understand every single word
00:20:06.060 --> 00:20:08.220
I'm going to be saying now,
because this is exactly
00:20:08.220 --> 00:20:09.678
the same thing that
you're expected
00:20:09.678 --> 00:20:12.719
to know from other courses,
because right now, I'm just
00:20:12.719 --> 00:20:14.760
going to apply exactly
the same technique that we
00:20:14.760 --> 00:20:17.400
did for the single
parameter estimation.
00:20:17.400 --> 00:20:26.940
So what do we have now is that
under h0, beta j is equal to 0.
00:20:26.940 --> 00:20:39.030
Therefore, beta hat j follows
some n0 sigma squared.
00:20:39.030 --> 00:20:41.670
Just like I do in the slide,
I'm going to call this gamma j.
00:20:50.810 --> 00:20:56.060
So gamma j is this x transpose
x inverse j-th diagonal element.
00:20:59.770 --> 00:21:06.140
So that implies that
beta hat j over sigma--
00:21:06.140 --> 00:21:08.120
oh, was it a square root?
00:21:08.120 --> 00:21:16.130
Yeah, sigma square root of
gamma j follows some n0 1.
00:21:16.130 --> 00:21:21.280
So I can form my
test statistic, which
00:21:21.280 --> 00:21:30.880
to be reject if the absolute
value of beta hat j divided
00:21:30.880 --> 00:21:38.159
by sigma square root gamma
j is larger than what?
00:21:38.159 --> 00:21:39.700
Can somebody tell
me what I want this
00:21:39.700 --> 00:21:41.050
to be larger than to reject?
00:21:43.948 --> 00:21:45.400
AUDIENCE: q alpha.
00:21:45.400 --> 00:21:46.525
PHILIPPE RIGOLLET: q alpha.
00:21:48.642 --> 00:21:49.350
Everybody agrees?
00:21:49.350 --> 00:21:50.852
Of what?
00:21:50.852 --> 00:21:58.070
Of this guy, where
the standard notation
00:21:58.070 --> 00:21:59.480
that this is the quintile.
00:21:59.480 --> 00:22:01.257
Everybody agrees?
00:22:01.257 --> 00:22:02.756
AUDIENCE: It's alpha
over 2 I think.
00:22:02.756 --> 00:22:03.537
I think alpha's--
00:22:03.537 --> 00:22:04.870
PHILIPPE RIGOLLET: Alpha over 2.
00:22:04.870 --> 00:22:06.520
So not everybody
should be agreeing.
00:22:06.520 --> 00:22:08.765
Thank you, you're the first
one to disagree with yourself,
00:22:08.765 --> 00:22:09.723
which is probably good.
00:22:12.111 --> 00:22:14.110
It's alpha over 2 because
of the absolute value.
00:22:14.110 --> 00:22:15.670
I want to just be
away from this guy,
00:22:15.670 --> 00:22:17.110
and that's because I have--
00:22:17.110 --> 00:22:19.140
so the alpha over 2--
00:22:19.140 --> 00:22:27.650
the sanity check should be that
h1 is beta j not equal to 0.
00:22:27.650 --> 00:22:35.010
So that works if sigma is known,
because I need to know sigma
00:22:35.010 --> 00:22:37.380
to be able to build my test.
00:22:37.380 --> 00:22:39.960
So if sigma is unknown, well,
I can tell you, use this test,
00:22:39.960 --> 00:22:41.550
but you're going to
be like, OK, when
00:22:41.550 --> 00:22:44.310
I'm going to have to
plug in some numbers,
00:22:44.310 --> 00:22:45.810
I'm going to be stuck.
00:22:49.240 --> 00:22:59.570
But if sigma is unknown,
we have sigma hat
00:22:59.570 --> 00:23:03.400
squared as an estimator.
00:23:03.400 --> 00:23:06.850
So let me write
sigma squared here.
00:23:06.850 --> 00:23:12.050
So in particular,
beta hat divided
00:23:12.050 --> 00:23:18.220
by sigma hat squared times
square root gamma j-- something
00:23:18.220 --> 00:23:19.169
I can compute.
00:23:19.169 --> 00:23:20.210
Sorry, that's beta hat j.
00:23:23.070 --> 00:23:24.576
I can compute that thing.
00:23:24.576 --> 00:23:25.490
Agreed?
00:23:25.490 --> 00:23:27.230
Now I have sigma hat j.
00:23:27.230 --> 00:23:28.980
What I need to do is
to be able to compute
00:23:28.980 --> 00:23:32.625
the distribution of this thing.
00:23:32.625 --> 00:23:37.880
So I know the distribution of
beta hat j over the square root
00:23:37.880 --> 00:23:38.410
of gamma j.
00:23:38.410 --> 00:23:40.479
That's some Gaussian 0, 1.
00:23:40.479 --> 00:23:42.770
I don't know exactly what
the distribution of sigma hat
00:23:42.770 --> 00:23:46.660
j squared is, but what I know is
that that was actually written,
00:23:46.660 --> 00:23:54.790
maybe, here is that n minus p
sigma hat squared over sigma
00:23:54.790 --> 00:23:59.550
squared follows some chi
squared with n minus p
00:23:59.550 --> 00:24:01.350
degrees of freedom,
and that it's actually
00:24:01.350 --> 00:24:06.590
independent of beta hat j.
00:24:06.590 --> 00:24:08.220
It's independent of
beta hat, so it's
00:24:08.220 --> 00:24:10.030
independent of each
of its coordinates.
00:24:10.030 --> 00:24:13.680
That was part of your
homework where you had to--
00:24:13.680 --> 00:24:15.900
some of you were confused
by the fact that--
00:24:15.900 --> 00:24:18.199
I mean, if you're independent
of some big thing,
00:24:18.199 --> 00:24:19.740
you're independent
of all the smaller
00:24:19.740 --> 00:24:20.948
components of this big thing.
00:24:20.948 --> 00:24:24.080
That's basically what
you need to know.
00:24:24.080 --> 00:24:26.310
And so now I can
just write this as--
00:24:29.630 --> 00:24:35.970
this is beta hat j divided by--
00:24:35.970 --> 00:24:37.890
so now I want to
make this guy appear,
00:24:37.890 --> 00:24:42.630
so it's beta hat j sigma
squared over sigma squared--
00:24:42.630 --> 00:24:48.261
sigma hat squared over sigma
squared times n minus p divided
00:24:48.261 --> 00:24:49.510
by the square root of gamma j.
00:24:49.510 --> 00:24:51.580
So that's what I want to see.
00:24:51.580 --> 00:24:52.236
Yeah?
00:24:52.236 --> 00:24:53.188
AUDIENCE: Why do
you have to stick
00:24:53.188 --> 00:24:54.313
the hat in the denominator?
00:24:54.313 --> 00:24:56.996
Shouldn't it be sigma?
00:24:56.996 --> 00:24:59.130
PHILIPPE RIGOLLET:
Yeah, so I write this.
00:24:59.130 --> 00:25:01.330
I decide to write this.
00:25:01.330 --> 00:25:03.170
I could have put a
Mickey Mouse here.
00:25:03.170 --> 00:25:04.460
It just wouldn't make sense.
00:25:04.460 --> 00:25:05.960
I just decided to
take this thing.
00:25:05.960 --> 00:25:06.390
AUDIENCE: OK.
00:25:06.390 --> 00:25:07.306
PHILIPPE RIGOLLET: OK.
00:25:07.306 --> 00:25:12.800
So now, let-- so I
take this guy, and now,
00:25:12.800 --> 00:25:15.050
I'm going to rewrite
it as something I want,
00:25:15.050 --> 00:25:17.891
because if you don't
know what sigma is--
00:25:17.891 --> 00:25:18.890
sorry, that's not sigm--
00:25:18.890 --> 00:25:19.850
you mean the square?
00:25:19.850 --> 00:25:20.265
AUDIENCE: Yeah.
00:25:20.265 --> 00:25:21.020
PHILIPPE RIGOLLET:
Oh, thank you.
00:25:21.020 --> 00:25:22.160
Yes, that's correct.
00:25:22.160 --> 00:25:25.390
[LAUGHS] OK, so you
don't know what's sigma
00:25:25.390 --> 00:25:26.740
is, you replace it by sigma hat.
00:25:26.740 --> 00:25:28.650
That's the most
natural thing to do.
00:25:28.650 --> 00:25:30.590
You just now want
to find out what
00:25:30.590 --> 00:25:33.380
the distribution of this guy is.
00:25:33.380 --> 00:25:35.780
So this is not
exactly what I had.
00:25:35.780 --> 00:25:41.530
To be able to get this, I need
to divide by sigma squared--
00:25:41.530 --> 00:25:42.640
sorry, I need to--
00:25:42.640 --> 00:25:43.950
AUDIENCE: Square root.
00:25:43.950 --> 00:25:44.741
PHILIPPE RIGOLLET: I'm sorry.
00:25:44.741 --> 00:25:46.157
AUDIENCE: Do we
need a square root
00:25:46.157 --> 00:25:47.450
of the sigma hat [INAUDIBLE].
00:25:47.450 --> 00:25:49.033
PHILIPPE RIGOLLET:
That's correct now.
00:25:55.400 --> 00:25:57.080
And now I have that--
00:25:57.080 --> 00:25:59.430
sorry, I should not
write it like that.
00:25:59.430 --> 00:26:01.770
That's not what I want.
00:26:01.770 --> 00:26:04.350
What I want is this.
00:26:08.260 --> 00:26:11.470
And to be able to get
this guy, what I need
00:26:11.470 --> 00:26:25.100
is sigma over sigma
hat square root.
00:26:25.100 --> 00:26:27.500
And then I need to make
this thing show up.
00:26:27.500 --> 00:26:32.670
So I need to have this n minus
p show up in the denominator.
00:26:32.670 --> 00:26:34.610
So to be able to get
it, I need to multiply
00:26:34.610 --> 00:26:37.343
the entire thing by the
square root of n minus p.
00:26:41.120 --> 00:26:42.590
So this is just a tautology.
00:26:42.590 --> 00:26:46.510
I just squeezed
in what I wanted.
00:26:46.510 --> 00:26:50.680
But now this whole thing
here, this is actually
00:26:50.680 --> 00:26:56.560
of the form beta hat j divided
by sigma over square root gamma
00:26:56.560 --> 00:27:01.450
j, and then divided by square
root of sigma hat squared
00:27:01.450 --> 00:27:04.700
over sigma squared.
00:27:08.607 --> 00:27:11.231
No, I don't want to divide it by
square root of minus p, sorry.
00:27:15.290 --> 00:27:21.720
And now it's times n minus
p divided by n minus p.
00:27:27.560 --> 00:27:29.714
And what is the distribution
of this thing here?
00:27:43.546 --> 00:27:45.610
So I'm going to keep going here.
00:27:45.610 --> 00:27:48.480
So the distribution of
this thing here is what?
00:27:48.480 --> 00:27:54.075
Well, this numerator,
what is this distribution?
00:27:58.035 --> 00:28:01.650
AUDIENCE: [INAUDIBLE]
00:28:01.650 --> 00:28:02.900
PHILIPPE RIGOLLET: Yeah, n0 1.
00:28:02.900 --> 00:28:04.525
It's actually still
written over there.
00:28:09.460 --> 00:28:11.509
So that's our n0 1.
00:28:11.509 --> 00:28:13.050
What is the distribution
of this guy?
00:28:16.580 --> 00:28:18.970
Sorry, I don't think
you have color again.
00:28:18.970 --> 00:28:22.922
So what is the
distribution of this guy?
00:28:22.922 --> 00:28:24.380
This is still
written on the board.
00:28:24.380 --> 00:28:25.660
AUDIENCE: Chi squared.
00:28:25.660 --> 00:28:28.285
PHILIPPE RIGOLLET: It's the chi
squared that I have right here.
00:28:32.530 --> 00:28:35.580
So that's a chi squared n
minus p divided by n minus p
00:28:35.580 --> 00:28:36.516
degrees of freedom.
00:28:36.516 --> 00:28:37.890
The only thing I
need to check is
00:28:37.890 --> 00:28:39.780
that those two guys
are independent, which
00:28:39.780 --> 00:28:43.050
is also what I have from here.
00:28:43.050 --> 00:28:49.690
And so that implies
that beta hat j divided
00:28:49.690 --> 00:28:53.290
by sigma hat square
root of gamma
00:28:53.290 --> 00:28:55.360
j, what is the
distribution of this guy?
00:29:04.822 --> 00:29:06.330
[INTERPOSING VOICES]
00:29:06.330 --> 00:29:09.095
PHILIPPE RIGOLLET: n minus p.
00:29:09.095 --> 00:29:12.040
Was that crystal
clear for everyone?
00:29:12.040 --> 00:29:15.370
Was that so simple that
it was boring to everyone?
00:29:15.370 --> 00:29:16.090
OK, good.
00:29:16.090 --> 00:29:18.760
That's where the point
at which you should be.
00:29:18.760 --> 00:29:23.350
So now I have this, I can read
the quintiles of this guy.
00:29:23.350 --> 00:29:28.580
So my test statistic becomes--
00:29:28.580 --> 00:29:31.000
well, my rejection
region, I reject
00:29:31.000 --> 00:29:40.390
if the absolute
value of this new guy
00:29:40.390 --> 00:29:44.900
exceeds the quintile of order
alpha over 2, but this time,
00:29:44.900 --> 00:29:48.390
of a tn minus p.
00:29:48.390 --> 00:29:50.660
And now you can actually
see that the only difference
00:29:50.660 --> 00:29:53.600
between this test and that
test, apart from replacing sigma
00:29:53.600 --> 00:29:55.490
by sigma hat, is
that now I've moved
00:29:55.490 --> 00:29:58.280
from the quintiles of a
Gaussian to the quintiles
00:29:58.280 --> 00:29:59.640
of a tn minus p.
00:30:11.085 --> 00:30:13.210
What's actually interesting,
from this perspective,
00:30:13.210 --> 00:30:18.070
is that the tn minus
p, we know, has
00:30:18.070 --> 00:30:20.800
heavier tails than the Gaussian,
but if the number of degrees
00:30:20.800 --> 00:30:26.131
of freedom reaches, maybe, 30 or
40, they're virtually the same.
00:30:26.131 --> 00:30:27.880
And here, the number
of degrees of freedom
00:30:27.880 --> 00:30:30.610
is not given only by
n, but it's n minus p.
00:30:30.610 --> 00:30:33.100
So if I have more and more
parameters to estimate,
00:30:33.100 --> 00:30:35.616
this will result in some
heavier, heavier tails,
00:30:35.616 --> 00:30:37.240
and that's just to
account for the fact
00:30:37.240 --> 00:30:41.680
that it's harder and harder
to estimate the variance
00:30:41.680 --> 00:30:44.680
when I have a lot of parameters.
00:30:44.680 --> 00:30:46.780
That's basically where
it's coming from.
00:30:46.780 --> 00:30:52.270
So now let's move on to--
00:30:52.270 --> 00:30:57.040
well, I don't know what because
this is not working anymore.
00:30:57.040 --> 00:30:59.080
So this is the simplest test.
00:30:59.080 --> 00:31:02.560
And actually, if you run
any statistical software
00:31:02.560 --> 00:31:06.190
for least squares, the
output in any of them
00:31:06.190 --> 00:31:08.690
will look like this.
00:31:08.690 --> 00:31:11.780
You will have a
sequence of rows.
00:31:11.780 --> 00:31:15.330
And you're going to have
an estimate for beta 0,
00:31:15.330 --> 00:31:17.445
an estimate for
beta 1, et cetera.
00:31:17.445 --> 00:31:19.320
Here, you're going to
have a bunch of things.
00:31:19.320 --> 00:31:23.040
And on this row, you're
going to have the value here,
00:31:23.040 --> 00:31:25.910
so that's going to be what's
estimated by least squares.
00:31:25.910 --> 00:31:30.260
And then the second line
immediately is going to be,
00:31:30.260 --> 00:31:32.300
well, either the
value of this thing--
00:31:35.320 --> 00:31:36.854
so let's call it t.
00:31:36.854 --> 00:31:38.520
And then there's going
to be the p value
00:31:38.520 --> 00:31:40.800
corresponding to this t.
00:31:40.800 --> 00:31:44.109
This is something that's just
routinely coming out because--
00:31:44.109 --> 00:31:46.650
oh, and then there's, of course,
the last line for people who
00:31:46.650 --> 00:31:49.740
cannot read numbers that's
really just giving you little
00:31:49.740 --> 00:31:50.240
stars.
00:31:53.850 --> 00:31:56.900
They're not stickers,
but that's close to it.
00:31:56.900 --> 00:31:59.110
And that's just saying,
well, I have three stars,
00:31:59.110 --> 00:32:01.420
I'm very significantly
different from 0's.
00:32:01.420 --> 00:32:04.160
If I have 2 stars, I'm
moderately differently from 0.
00:32:04.160 --> 00:32:07.090
And if I have 1 star,
it means, well, just
00:32:07.090 --> 00:32:10.450
give me another $1,000 and I
will sign that it's actually
00:32:10.450 --> 00:32:12.250
different from 0.
00:32:12.250 --> 00:32:14.950
So that's basically
the kind of outputs.
00:32:14.950 --> 00:32:16.467
Everybody sees what
I mean by that?
00:32:16.467 --> 00:32:18.550
So what I mean, what I'm
trying to emphasize here,
00:32:18.550 --> 00:32:20.260
is that those things
are so routine when
00:32:20.260 --> 00:32:23.740
you run linear aggression,
because people stuff in maybe--
00:32:23.740 --> 00:32:25.510
even if you have
200 observations,
00:32:25.510 --> 00:32:28.720
you're going to stuff in maybe
20 variables-- p equals 20.
00:32:28.720 --> 00:32:31.110
That's still a big number to
interpret what's going on.
00:32:31.110 --> 00:32:35.410
And it's nice for you if you
can actually trim some fat out.
00:32:35.410 --> 00:32:41.260
And so the problem is that when
you start doing this, and then
00:32:41.260 --> 00:32:44.386
this, and then
this, and then this,
00:32:44.386 --> 00:32:47.750
the probability that
you make a mistake
00:32:47.750 --> 00:32:52.040
in your test, the probably
that you erroneously
00:32:52.040 --> 00:32:55.170
reject the null here is 5%.
00:32:55.170 --> 00:32:56.540
Here, it's 5%.
00:32:56.540 --> 00:32:58.500
Here, it's 5%.
00:32:58.500 --> 00:33:00.120
Here, it's 5%.
00:33:00.120 --> 00:33:05.370
And at some point, if things
happen with 5% chances
00:33:05.370 --> 00:33:08.130
and you keep on doing
them over and over again,
00:33:08.130 --> 00:33:10.240
they're going to
start to happen.
00:33:10.240 --> 00:33:14.160
So you can see that
basically what's happening
00:33:14.160 --> 00:33:15.900
is that you actually
have an issue is
00:33:15.900 --> 00:33:18.750
that if you start
repeating those tests,
00:33:18.750 --> 00:33:23.000
you might not be at 5%
error at some point.
00:33:23.000 --> 00:33:25.940
And so what do you do
to prevent from that,
00:33:25.940 --> 00:33:28.850
if you want to test all those
beta j's simultaneously,
00:33:28.850 --> 00:33:32.340
you have to do what's called
the Bonferroni correction.
00:33:32.340 --> 00:33:35.060
And the Bonferroni correction
follows from what's
00:33:35.060 --> 00:33:36.790
called a union bound.
00:33:36.790 --> 00:33:40.392
A union bound is actually-- so
if you're a computer scientist,
00:33:40.392 --> 00:33:41.600
you're very familiar with it.
00:33:41.600 --> 00:33:44.390
If you're a mathematician,
that's just, essentially,
00:33:44.390 --> 00:33:46.650
the third axiom of
probability that you see,
00:33:46.650 --> 00:33:48.140
that the probability
of the union
00:33:48.140 --> 00:33:50.392
is less than the sum
of the probabilities.
00:34:00.350 --> 00:34:02.660
That's the union bound.
00:34:02.660 --> 00:34:05.570
And you, of course, can
generalize that to more than 2.
00:34:05.570 --> 00:34:07.460
And that's exactly
what you're doing here.
00:34:07.460 --> 00:34:11.870
So let's see how we would
want to perform Bonferroni
00:34:11.870 --> 00:34:19.340
correction to control the
probability that they're all
00:34:19.340 --> 00:34:21.429
equal to 0 at the same time.
00:34:26.690 --> 00:34:29.960
So recall-- so if I want to
perform this test over there
00:34:29.960 --> 00:34:34.820
where I want to
test h0, that beta j
00:34:34.820 --> 00:34:40.560
is equal to 0 for all
j in some subset s.
00:34:43.860 --> 00:34:48.409
So think of s included in 1p.
00:34:48.409 --> 00:34:51.139
You can think of it as being
all of 1 of p if you want.
00:34:51.139 --> 00:34:53.960
It really doesn't matter. s is
something that's given to you.
00:34:53.960 --> 00:34:55.790
Maybe you want to test
the subset of them,
00:34:55.790 --> 00:34:57.890
but maybe you want
to test all of them.
00:34:57.890 --> 00:35:04.540
Versus h1, beta j is not
equal to 0 for some j in s.
00:35:07.850 --> 00:35:10.610
That's a test that tests
all these things at once.
00:35:10.610 --> 00:35:13.880
And if you actually look
at this table all at once,
00:35:13.880 --> 00:35:16.820
implicitly, you're performing
this test for all of the rows,
00:35:16.820 --> 00:35:19.262
for s equal 1 to p.
00:35:19.262 --> 00:35:19.970
You will do that.
00:35:19.970 --> 00:35:23.120
Whether you like it
or not, you will.
00:35:23.120 --> 00:35:27.110
So now let's look at what the
probability of type I error
00:35:27.110 --> 00:35:28.100
looks like.
00:35:28.100 --> 00:35:31.270
So I want the probability
of type 1 error,
00:35:31.270 --> 00:35:35.370
so that's the probably
when h0 is true.
00:35:35.370 --> 00:35:41.930
Well, so let me call psi j the
indicator that, say, beta j
00:35:41.930 --> 00:35:51.330
hat over sigma hat square
root gamma j exceeds
00:35:51.330 --> 00:35:54.636
q alpha over 2 of tn minus p.
00:35:54.636 --> 00:35:56.760
So we know that those are
the tests that I perform.
00:35:56.760 --> 00:35:59.160
Here, I just add
this extra index j
00:35:59.160 --> 00:36:02.400
to tell me that I'm actually
testing the j-th coefficient.
00:36:02.400 --> 00:36:06.490
So what I want is the
probability that under the null
00:36:06.490 --> 00:36:12.450
so that those are all
equal to 0 that beta j's--
00:36:12.450 --> 00:36:16.620
that I will reject to the
alternative for one of them.
00:36:16.620 --> 00:36:25.510
So that's psi 1 is
equal to 1 or psi 2
00:36:25.510 --> 00:36:29.120
is equal to 1, all
the way to psi--
00:36:29.120 --> 00:36:31.474
well, let's just say that
this is the entire thing,
00:36:31.474 --> 00:36:32.390
because it's annoying.
00:36:36.247 --> 00:36:37.830
I mean, you can check
the slide if you
00:36:37.830 --> 00:36:39.150
want to do it more generally.
00:36:39.150 --> 00:36:44.140
But psi p is equal to--
00:36:44.140 --> 00:36:48.850
or, or-- everybody agrees
that this is the probability
00:36:48.850 --> 00:36:51.940
of type I error?
00:36:51.940 --> 00:36:54.010
So either I reject
this one, or this one,
00:36:54.010 --> 00:36:55.757
or this one, or this
one, or this one.
00:36:55.757 --> 00:36:58.090
And that's exactly when I'm
going to reject at least one
00:36:58.090 --> 00:36:59.580
of them.
00:36:59.580 --> 00:37:08.550
So this is the probability
of type I error.
00:37:08.550 --> 00:37:12.380
And what I want is to keep
this guy less than alpha.
00:37:15.780 --> 00:37:17.730
But what I know is to
control the probability
00:37:17.730 --> 00:37:20.190
that this guy is less than
alpha, that this guy is
00:37:20.190 --> 00:37:22.820
less than alpha, that this
guy is less than alpha.
00:37:22.820 --> 00:37:26.260
In particular, if all
these guys are disjoint,
00:37:26.260 --> 00:37:29.530
then this could really be the
sum of all these probabilities.
00:37:29.530 --> 00:37:42.400
So in the worst case, if psi j
equals 1 intersected with psi k
00:37:42.400 --> 00:37:46.540
equals 1 is the empty
set, so that means
00:37:46.540 --> 00:37:47.960
those are called disjoint sets.
00:37:51.210 --> 00:37:53.970
You've seen this terminology
in probability, right?
00:37:53.970 --> 00:38:00.800
So if those sets are
disjoint, for all of them,
00:38:00.800 --> 00:38:04.176
for all j different from
k, then this probability--
00:38:07.370 --> 00:38:14.590
well, let me write it as star--
00:38:14.590 --> 00:38:20.990
then star is equal to, well,
the probability under h0
00:38:20.990 --> 00:38:30.320
that psi 1 is equal to 1
plus the probability under h0
00:38:30.320 --> 00:38:33.110
that psi p is equal to 1.
00:38:33.110 --> 00:38:37.120
Now, if I use this test
with this alpha here,
00:38:37.120 --> 00:38:40.600
then this probability
is equal to alpha.
00:38:40.600 --> 00:38:43.185
This probability is
also equal to alpha.
00:38:43.185 --> 00:38:45.810
So the probably of type I error
is actually not equal to alpha.
00:38:45.810 --> 00:38:47.404
It's equal to?
00:38:47.404 --> 00:38:48.270
AUDIENCE: p alpha.
00:38:48.270 --> 00:38:49.395
PHILIPPE RIGOLLET: p alpha.
00:38:52.770 --> 00:38:54.240
So what is the solution here?
00:38:54.240 --> 00:38:58.470
Well, it's to run those
guys not with alpha,
00:38:58.470 --> 00:38:59.802
but with alpha over p.
00:39:02.400 --> 00:39:06.870
And if they do this, then this
guy is equal to alpha over p,
00:39:06.870 --> 00:39:09.036
this guy is equal
to alpha over p.
00:39:09.036 --> 00:39:10.410
And so when I get
those things, I
00:39:10.410 --> 00:39:13.260
get p times alpha over
p, which is just alpha.
00:39:17.170 --> 00:39:20.410
So all I do is, rather than
running each of the tests
00:39:20.410 --> 00:39:23.862
with probability of error--
00:39:23.862 --> 00:39:28.751
so that's a test at
level alpha over p.
00:39:32.500 --> 00:39:33.800
That's actually very stringent.
00:39:33.800 --> 00:39:35.500
If you think about
it for 1 second,
00:39:35.500 --> 00:39:41.542
even if you have only 5
variables-- p equals 5--
00:39:41.542 --> 00:39:43.000
and you started
with the tests, you
00:39:43.000 --> 00:39:45.610
wanted to do your tests at 5%.
00:39:45.610 --> 00:39:50.720
It forces you to do the test at
1% for each of those variables.
00:39:50.720 --> 00:39:53.350
If you have 10
variables, I mean, that
00:39:53.350 --> 00:39:55.460
start to be very stringent.
00:39:55.460 --> 00:39:59.690
So it's going to be
harder and harder for you
00:39:59.690 --> 00:40:01.945
to conclude to the alternative.
00:40:01.945 --> 00:40:03.320
Now, one thing I
need to tell you
00:40:03.320 --> 00:40:05.240
is that here I said,
if they are disjoint,
00:40:05.240 --> 00:40:07.230
then those
probabilities are equal.
00:40:07.230 --> 00:40:12.610
But if they are not
disjoint, the union bound
00:40:12.610 --> 00:40:14.360
tells me that the
probability of the union
00:40:14.360 --> 00:40:16.650
is less than the sum
of the probabilities.
00:40:16.650 --> 00:40:20.090
And so now I'm not
exactly equal to alpha,
00:40:20.090 --> 00:40:23.220
but I'm bounded by alpha.
00:40:23.220 --> 00:40:26.170
And that's why
Bonferroni correction,
00:40:26.170 --> 00:40:28.110
people are not super
comfortable with,
00:40:28.110 --> 00:40:30.600
is because, in reality,
you never think
00:40:30.600 --> 00:40:32.610
that those tests are
going to be giving you
00:40:32.610 --> 00:40:34.890
completely disjoint things.
00:40:34.890 --> 00:40:36.480
I mean, why would it be?
00:40:36.480 --> 00:40:39.210
Why would it be that if
this guy is equal to 1,
00:40:39.210 --> 00:40:42.110
then all the other
ones are equal to 0?
00:40:42.110 --> 00:40:44.340
Why would it make any sense?
00:40:44.340 --> 00:40:45.860
So this is definitely
conservative,
00:40:45.860 --> 00:40:49.394
but the problem is that we don't
know how to do much better.
00:40:49.394 --> 00:40:51.060
I mean, we have a
formula that tells you
00:40:51.060 --> 00:40:54.120
the probability of the
union as some crazy sum that
00:40:54.120 --> 00:40:57.330
looks at all the intersection
and all these little things.
00:40:57.330 --> 00:41:01.340
I mean, it's the
generalization of p of a or b
00:41:01.340 --> 00:41:06.060
is equal to p of a plus
p of b minus probability
00:41:06.060 --> 00:41:08.997
of the intersection.
00:41:08.997 --> 00:41:10.830
But if you start doing
this for more than 2,
00:41:10.830 --> 00:41:12.060
it's super complicated.
00:41:12.060 --> 00:41:15.030
The number of terms
grows really fast.
00:41:15.030 --> 00:41:17.432
But most importantly,
even if you go here,
00:41:17.432 --> 00:41:19.140
you still need to
control the probability
00:41:19.140 --> 00:41:20.130
of the intersection.
00:41:20.130 --> 00:41:22.470
And those tests are not
necessarily independent.
00:41:22.470 --> 00:41:24.090
If they were independent,
then that would be easy.
00:41:24.090 --> 00:41:26.340
The probably of the intersection
would be the product
00:41:26.340 --> 00:41:27.840
of the probabilities.
00:41:27.840 --> 00:41:31.270
But those things are
super correlated,
00:41:31.270 --> 00:41:33.220
and so it doesn't really help.
00:41:33.220 --> 00:41:37.026
And so we'll see, when we talk
about high-dimensional stats
00:41:37.026 --> 00:41:38.650
towards the end, that
there's something
00:41:38.650 --> 00:41:41.600
called false discovery rate,
which is essentially saying,
00:41:41.600 --> 00:41:45.380
listen, if I want to
control this thing,
00:41:45.380 --> 00:41:47.260
if I really define my
probability of type I
00:41:47.260 --> 00:41:50.260
error as this, I want to
make sure that I never make
00:41:50.260 --> 00:41:52.300
this kind of error, I'm doomed.
00:41:52.300 --> 00:41:54.680
This is just not
going to happen.
00:41:54.680 --> 00:41:59.500
But I can revise what my
goals are in terms of errors
00:41:59.500 --> 00:42:02.570
that I make, and then I
will actually be able to do.
00:42:02.570 --> 00:42:05.680
And what people are looking
at is false discovery rate.
00:42:05.680 --> 00:42:07.750
And this is called
family-wise error rate, which
00:42:07.750 --> 00:42:10.280
is a stronger thing to control.
00:42:10.280 --> 00:42:14.590
So this trick that
consists in replacing
00:42:14.590 --> 00:42:16.704
alpha by alpha over
the number of times
00:42:16.704 --> 00:42:18.370
you're going to be
performing your test,
00:42:18.370 --> 00:42:21.700
or alpha over the number
of terms in your union,
00:42:21.700 --> 00:42:24.164
is actually called the
Bonferroni correction.
00:42:32.160 --> 00:42:35.450
And that's something you use
when you have what's called--
00:42:35.450 --> 00:42:41.010
another key word here
is multiple testing,
00:42:41.010 --> 00:42:43.830
when you're trying to do
multiple tests simultaneously.
00:42:47.470 --> 00:42:49.840
And if s is not of
p, well, you just
00:42:49.840 --> 00:42:52.760
divide by the number of tests
that you are actually making.
00:42:52.760 --> 00:42:56.130
So if s is of size k
for some k less than p,
00:42:56.130 --> 00:42:59.172
you just divide alpha by
k and not by p, of course.
00:42:59.172 --> 00:43:00.630
I mean, you can
always divide by p,
00:43:00.630 --> 00:43:03.170
but you're going to make your
life harder for no reason.
00:43:11.010 --> 00:43:13.260
Any question about
Bonferroni correction?
00:43:18.260 --> 00:43:26.100
So one thing that is
maybe not as obvious
00:43:26.100 --> 00:43:30.150
as the test beta j equal to 0
versus beta j not equal to 0--
00:43:30.150 --> 00:43:32.190
and in particular,
what it means is
00:43:32.190 --> 00:43:36.360
that it's not going to come
up as a software output
00:43:36.360 --> 00:43:39.480
without even you requesting
it because this is so standard
00:43:39.480 --> 00:43:40.897
that it's just coming out.
00:43:40.897 --> 00:43:42.480
But there's other
tests that you might
00:43:42.480 --> 00:43:45.060
think of that might be
more complicated and more
00:43:45.060 --> 00:43:47.590
tailored to your
particular problem.
00:43:47.590 --> 00:43:52.560
And those tests are of
the form g times beta
00:43:52.560 --> 00:43:56.260
is equal to some lambda.
00:43:56.260 --> 00:44:05.810
So let's see, the
test we've just done,
00:44:05.810 --> 00:44:14.910
beta j equals 0 versus
beta j not equal to 0,
00:44:14.910 --> 00:44:23.100
is actually equivalent to
ej transpose beta equals
00:44:23.100 --> 00:44:28.020
0 versus ej transpose
beta not equal to 0.
00:44:28.020 --> 00:44:31.260
That was our claim.
00:44:31.260 --> 00:44:32.870
But now I don't
have to stop here.
00:44:32.870 --> 00:44:34.970
I don't have to
multiply by a vector
00:44:34.970 --> 00:44:36.890
and test if it's equal to 0.
00:44:36.890 --> 00:44:46.790
I can actually replace this
by some general matrix g
00:44:46.790 --> 00:44:54.449
and replace this guy by
some general vector lambda.
00:44:54.449 --> 00:44:56.240
And I'm not telling
you what the dimensions
00:44:56.240 --> 00:44:57.406
are because they're general.
00:44:57.406 --> 00:44:58.830
I can take whatever I want.
00:44:58.830 --> 00:45:00.260
Take your favorite
matrix, as long
00:45:00.260 --> 00:45:05.690
as the right side of the
matrix can be multiplying beta,
00:45:05.690 --> 00:45:09.710
and lambda, take it as
the number of rows of g,
00:45:09.710 --> 00:45:11.820
and then you can do that.
00:45:11.820 --> 00:45:14.280
I can always
formulate this test.
00:45:14.280 --> 00:45:16.680
What will this test encompass?
00:45:16.680 --> 00:45:18.780
Well, those are
kind of weird tests.
00:45:18.780 --> 00:45:22.170
So you can think
of things like, I
00:45:22.170 --> 00:45:30.440
want to test if beta 2 plus beta
3 are equal to 0, for example.
00:45:30.440 --> 00:45:40.770
Maybe I want to test if beta 5
minus 2 beta 6 is equal to 23.
00:45:40.770 --> 00:45:42.270
Well, that's weird.
00:45:42.270 --> 00:45:44.730
But why would you want to
test if beta 2 plus beta 3
00:45:44.730 --> 00:45:46.814
is equal to 0?
00:45:46.814 --> 00:45:48.730
Maybe you don't want to
know if the-- you know
00:45:48.730 --> 00:45:50.720
that the effect of
some gene is not 0.
00:45:50.720 --> 00:45:54.210
Maybe you know that this
gene affects this trait,
00:45:54.210 --> 00:45:56.790
but you want to know if
the effect of this gene
00:45:56.790 --> 00:45:59.262
is canceled by the
effect of that gene.
00:45:59.262 --> 00:46:00.970
And this is the kind
of stuff that you're
00:46:00.970 --> 00:46:02.178
going to be testing for that.
00:46:04.470 --> 00:46:06.150
Now, this guy is
much more artificial,
00:46:06.150 --> 00:46:08.770
and I don't have a bedtime
story to tell you around this.
00:46:08.770 --> 00:46:13.340
So those things can happen and
can be much more complicated.
00:46:13.340 --> 00:46:15.180
Now, here, notice
that the matrix g
00:46:15.180 --> 00:46:18.270
has one row for both
of the examples.
00:46:18.270 --> 00:46:20.580
But if I want to test if
those two things happen
00:46:20.580 --> 00:46:25.380
at the same time, then I
actually can take a matrix.
00:46:25.380 --> 00:46:27.840
Another matrix
that can be useful
00:46:27.840 --> 00:46:34.620
is g equals the identity of
rp and lambda is equal to 0.
00:46:34.620 --> 00:46:39.530
What am I doing
here in this case?
00:46:39.530 --> 00:46:41.480
What is this test testing?
00:46:41.480 --> 00:46:42.280
Sorry, this test.
00:46:44.959 --> 00:46:45.458
Yeah?
00:46:45.458 --> 00:46:46.820
AUDIENCE: Whether
or not beta is 0.
00:46:46.820 --> 00:46:49.278
PHILIPPE RIGOLLET: Yeah, we're
testing if the entire vector
00:46:49.278 --> 00:46:54.120
beta is equal to 0, because g
times beta is equal to beta,
00:46:54.120 --> 00:46:56.100
and we're asking
whether it's equal to 0.
00:47:00.375 --> 00:47:04.590
So the thing is, when
you want to actually test
00:47:04.590 --> 00:47:07.140
if beta is equal to
0, you're actually
00:47:07.140 --> 00:47:09.510
testing if your entire
model, everything you're
00:47:09.510 --> 00:47:12.070
doing in life, is just junk.
00:47:12.070 --> 00:47:13.920
This is just telling
you, actually,
00:47:13.920 --> 00:47:17.090
forget about this y is
x beta plus epsilon.
00:47:17.090 --> 00:47:18.360
y is really just epsilon.
00:47:18.360 --> 00:47:19.200
There's nothing.
00:47:19.200 --> 00:47:21.810
There's just some big noise
with some big variants,
00:47:21.810 --> 00:47:23.950
and there's nothing else.
00:47:23.950 --> 00:47:26.860
So turns out that the
statistical software
00:47:26.860 --> 00:47:30.970
output that I wrote here spits
out an answer to this question.
00:47:30.970 --> 00:47:34.480
Just the last line,
usually, is doing this test.
00:47:34.480 --> 00:47:36.642
Does your model even make sense?
00:47:36.642 --> 00:47:39.100
And it's probably for people
to check whether they actually
00:47:39.100 --> 00:47:41.230
just mix their two data sets.
00:47:41.230 --> 00:47:43.450
Maybe they're actually
trying to predict--
00:47:43.450 --> 00:47:49.190
I don't know-- some credit
score from genomic data,
00:47:49.190 --> 00:47:51.040
and so just want to
make sure, maybe, that's
00:47:51.040 --> 00:47:53.050
not the right thing.
00:47:53.050 --> 00:47:56.500
So it turns out that the
machinery is exactly the same
00:47:56.500 --> 00:47:58.750
as the one we've just taken.
00:47:58.750 --> 00:48:00.380
So we actually start from here.
00:48:05.542 --> 00:48:06.500
So let me pull this up.
00:48:12.930 --> 00:48:15.000
So we start from here.
00:48:15.000 --> 00:48:18.470
Beta hat was equal to
beta plus this guy.
00:48:21.780 --> 00:48:23.640
And the first thing we
did was to say, well,
00:48:23.640 --> 00:48:27.180
beta j is equal to this thing
because, well, beta j was
00:48:27.180 --> 00:48:29.250
just ej times beta.
00:48:29.250 --> 00:48:32.616
So rather than taking ej
here, let me just take g.
00:48:42.280 --> 00:48:45.220
Now, we said that
for any vector--
00:48:45.220 --> 00:48:47.840
well, that was trivial.
00:48:47.840 --> 00:48:50.350
So the thing we need to
know is, what is this thing?
00:48:50.350 --> 00:48:55.110
Well, this thing here,
what is this guy?
00:48:55.110 --> 00:48:59.870
It's also normal
and the mean is 0.
00:48:59.870 --> 00:49:03.510
Again, that's just using
properties of Gaussian vectors.
00:49:03.510 --> 00:49:06.430
And what is the
covariance matrix?
00:49:06.430 --> 00:49:09.290
Let's call these guys sigma so
that you can make an answer,
00:49:09.290 --> 00:49:11.660
you can formulate an answer.
00:49:11.660 --> 00:49:14.230
So what is the
distribution of-- what
00:49:14.230 --> 00:49:18.354
is the covariance of g
times some Gaussian 0 sigma?
00:49:18.354 --> 00:49:20.290
AUDIENCE: g sigma g transpose.
00:49:20.290 --> 00:49:22.500
PHILIPPE RIGOLLET: g
sigma g transpose, right?
00:49:22.500 --> 00:49:33.895
So that's gx transpose
x inverse g transpose.
00:49:38.650 --> 00:49:41.780
Now, I'm not going to be
able to go much farther.
00:49:41.780 --> 00:49:44.900
I mean, I made this
very acute observation
00:49:44.900 --> 00:49:47.790
that ej transpose the matrix
times ej is the j-th angle
00:49:47.790 --> 00:49:48.290
element.
00:49:48.290 --> 00:49:50.450
Now, if I have a general matrix,
the price to pay is that I
00:49:50.450 --> 00:49:52.949
cannot just shrink this thing
any further because I'm trying
00:49:52.949 --> 00:49:54.640
to be abstract.
00:49:54.640 --> 00:49:56.487
And so I'm almost there.
00:49:56.487 --> 00:49:58.070
The only thing that
happened last time
00:49:58.070 --> 00:50:00.050
is that when this
was ej under h0, 0,
00:50:00.050 --> 00:50:03.380
we knew that this was
equal to 0 under the null.
00:50:03.380 --> 00:50:08.790
But under the null,
what is this equal to?
00:50:12.510 --> 00:50:13.440
AUDIENCE: Lambda.
00:50:13.440 --> 00:50:15.106
PHILIPPE RIGOLLET:
Lambda, which I know.
00:50:15.106 --> 00:50:16.880
I mean, I wrote my thing.
00:50:16.880 --> 00:50:19.730
And in the couple instances
I just showed you,
00:50:19.730 --> 00:50:22.700
including this one over there
on top, lambda was equal to 0.
00:50:22.700 --> 00:50:24.620
But in general, it
can be any lambda.
00:50:24.620 --> 00:50:27.890
But what's key about this lambda
is that I actually know it.
00:50:27.890 --> 00:50:31.940
That's the hypothesis
I'm formulating.
00:50:31.940 --> 00:50:34.340
So now I'm going to have to
be a little more careful when
00:50:34.340 --> 00:50:36.650
I want to build the
distribution of g beta hat.
00:50:36.650 --> 00:50:39.380
I need to actually
subtract this lambda.
00:50:39.380 --> 00:50:40.970
So now we go from
this, and we say,
00:50:40.970 --> 00:50:47.040
well, g beta hat
minus lambda follows
00:50:47.040 --> 00:50:57.730
some np0 sigma squared
g x transpose x
00:50:57.730 --> 00:51:00.660
inverse g transpose.
00:51:04.070 --> 00:51:06.469
So that's true.
00:51:06.469 --> 00:51:08.510
Let's assume-- let's go
straight to the case when
00:51:08.510 --> 00:51:10.410
we don't know what sigma is.
00:51:10.410 --> 00:51:11.970
So what I'm going
to be interested in
00:51:11.970 --> 00:51:26.360
is g beta hat minus lambda
divided by sigma hat.
00:51:26.360 --> 00:51:29.870
And that's going to follow some
Gaussian that has this thing,
00:51:29.870 --> 00:51:37.660
gx transpose x
inverse g transpose.
00:51:37.660 --> 00:51:40.780
So now, what did I do last time?
00:51:40.780 --> 00:51:45.010
So clearly, the quintiles
of this distribution
00:51:45.010 --> 00:51:48.000
is-- well, OK, what is the
size of this distribution?
00:51:48.000 --> 00:51:52.848
Well, I need to tell
you that g is an--
00:51:52.848 --> 00:51:54.724
what did I take here?
00:51:54.724 --> 00:51:57.180
AUDIENCE: 1 divided by
sigma, not sigma hat.
00:51:57.180 --> 00:51:58.930
PHILIPPE RIGOLLET: Oh,
yeah, you're right.
00:51:58.930 --> 00:52:00.440
So let me write it like this.
00:52:05.750 --> 00:52:15.800
Well, let me write
it like this--
00:52:15.800 --> 00:52:17.253
sigma squared over sigma.
00:52:21.659 --> 00:52:23.325
So let's forget about
the size of g now.
00:52:23.325 --> 00:52:25.120
Let's just think
of any general g.
00:52:27.730 --> 00:52:30.820
When g was a vector,
what was nice
00:52:30.820 --> 00:52:35.410
is that this guy was just the
scalar number, just one number.
00:52:35.410 --> 00:52:38.012
And so if I wanted to get rid
of this in the right-hand side,
00:52:38.012 --> 00:52:39.970
all I had to do was to
divide it by this thing.
00:52:39.970 --> 00:52:41.464
We called it gamma j.
00:52:41.464 --> 00:52:43.630
And we just had to divide
by square root of gamma j,
00:52:43.630 --> 00:52:45.820
and that would be gone.
00:52:45.820 --> 00:52:48.450
Now I have a matrix.
00:52:48.450 --> 00:52:50.100
So I need to get
rid of this matrix
00:52:50.100 --> 00:52:55.016
somehow because, clearly, the
quintiles of this distribution
00:52:55.016 --> 00:52:56.640
are not going to be
written in the back
00:52:56.640 --> 00:52:59.170
of a book for any value
of g and any value of x.
00:52:59.170 --> 00:53:01.660
So I need to standardize
before I can read anything out
00:53:01.660 --> 00:53:03.860
of a table.
00:53:03.860 --> 00:53:04.820
So how do we do it?
00:53:04.820 --> 00:53:14.880
Well, we just form
this guy here.
00:53:14.880 --> 00:53:18.770
So what we know is that if--
00:53:18.770 --> 00:53:21.120
so here's the claim,
again, another
00:53:21.120 --> 00:53:23.520
claim about Gaussian vector.
00:53:23.520 --> 00:53:43.220
If x follows some n0 sigma,
then x transpose sigma inverse x
00:53:43.220 --> 00:53:44.596
follows some chi squared.
00:53:48.330 --> 00:53:51.930
And here, it's going to depend
on what is the dimension here.
00:53:51.930 --> 00:53:56.160
So if I make this a k by k, a
k-dimensional Gaussian vector,
00:53:56.160 --> 00:53:57.497
this is x squared k.
00:54:02.467 --> 00:54:04.455
Where have we used that before?
00:54:08.928 --> 00:54:09.922
Yeah?
00:54:09.922 --> 00:54:10.850
AUDIENCE: Wald's test.
00:54:10.850 --> 00:54:13.350
PHILIPPE RIGOLLET: Wald's test,
that's exactly what we used.
00:54:13.350 --> 00:54:16.480
Wald's test had a chi
squared that was showing up.
00:54:16.480 --> 00:54:18.430
And the way we
made it show up was
00:54:18.430 --> 00:54:20.640
by taking the
asymptotic variance,
00:54:20.640 --> 00:54:24.852
taking its inverse, which, in
this framework, was called--
00:54:24.852 --> 00:54:25.710
AUDIENCE: Fisher.
00:54:25.710 --> 00:54:27.300
PHILIPPE RIGOLLET:
Fisher information.
00:54:27.300 --> 00:54:31.410
And then we pre- and
postmultiply by this thing.
00:54:31.410 --> 00:54:33.150
So this is the key.
00:54:33.150 --> 00:54:35.400
And so now, it tells
me exactly, when
00:54:35.400 --> 00:54:38.190
I start from this guy that has
this multivariate Gaussian,
00:54:38.190 --> 00:54:40.050
it tells me how to
turn it into something
00:54:40.050 --> 00:54:42.720
that has a distribution
which is pivotal.
00:54:42.720 --> 00:54:45.849
Chi squared k is completely
pivotal, does not depend
00:54:45.849 --> 00:54:46.890
on anything I don't know.
00:55:03.810 --> 00:55:06.400
The way I go from here
is by saying, well, now,
00:55:06.400 --> 00:55:13.380
I look at g beta hat
minus lambda transpose,
00:55:13.380 --> 00:55:15.390
and now I need to
look at the inverse
00:55:15.390 --> 00:55:16.600
of the matrix over there.
00:55:16.600 --> 00:55:29.950
So it's gx transpose x
inverse g inverse g beta
00:55:29.950 --> 00:55:32.510
hat minus lambda.
00:55:35.647 --> 00:55:36.855
This guy is going to follow--
00:55:39.700 --> 00:55:42.891
well, here, I need to actually
divide by sigma in this case--
00:55:56.540 --> 00:56:00.560
if g is k times p.
00:56:00.560 --> 00:56:04.370
So what I mean here is
just that's the same k.
00:56:04.370 --> 00:56:07.250
The k that shows up is
the number of constraints
00:56:07.250 --> 00:56:08.840
that I have in my tests.
00:56:13.340 --> 00:56:20.690
So now, if I go from
here to using sigma hat,
00:56:20.690 --> 00:56:23.180
the key thing to observe is
that this guy is actually
00:56:23.180 --> 00:56:25.100
not a Gaussian.
00:56:25.100 --> 00:56:28.410
I'm not going to have a student
t-distribution that shows up.
00:56:36.290 --> 00:57:03.850
So that implies that if
I take the same thing,
00:57:03.850 --> 00:57:06.450
so now I just go from
sigma to sigma hat,
00:57:06.450 --> 00:57:08.140
then this thing is of the form--
00:57:12.620 --> 00:57:17.280
well, this chi squared k divided
by the chi squared that shows
00:57:17.280 --> 00:57:20.590
up in the denominator
of the t-distribution,
00:57:20.590 --> 00:57:28.270
which is square root of--
00:57:28.270 --> 00:57:30.060
oh, I should not
divide by sigma--
00:57:30.060 --> 00:57:31.510
so this is sigma squared, right?
00:57:31.510 --> 00:57:32.567
AUDIENCE: Yeah.
00:57:32.567 --> 00:57:34.400
PHILIPPE RIGOLLET: So
this is sigma squared.
00:57:34.400 --> 00:57:40.550
So this is of the form
divided by chi squared n
00:57:40.550 --> 00:57:44.180
minus p divided by n minus p.
00:57:44.180 --> 00:57:48.370
So that's the same denominator
that I saw in my t-test.
00:57:48.370 --> 00:57:49.955
The numerator has
changed, though.
00:57:49.955 --> 00:57:52.080
The numerator is now this
chi squared and no longer
00:57:52.080 --> 00:57:52.580
a Gaussian.
00:57:55.430 --> 00:58:00.350
But this distribution is
actually pivotal, as long
00:58:00.350 --> 00:58:02.210
as we can guarantee
that there's no hidden
00:58:02.210 --> 00:58:08.550
parameter in the correlation
between the two chi squares.
00:58:08.550 --> 00:58:13.470
So again, as all statements
of independence in this class,
00:58:13.470 --> 00:58:15.930
I will just give
it to you for free.
00:58:15.930 --> 00:58:20.660
Those two things, I claim--
00:58:20.660 --> 00:58:29.635
so OK, let's say admit
these are independent.
00:58:37.370 --> 00:58:38.730
We're almost there.
00:58:38.730 --> 00:58:41.627
This could be a
distribution that's pivotal.
00:58:41.627 --> 00:58:43.960
But there's something that's
a little unbalanced with it
00:58:43.960 --> 00:58:46.160
is that this guy is divided
by its number of degrees
00:58:46.160 --> 00:58:48.980
of freedom, but this guy is
not divided by its number
00:58:48.980 --> 00:58:50.670
of degrees of freedom.
00:58:50.670 --> 00:58:53.350
And so we just have
to make the extra step
00:58:53.350 --> 00:58:57.280
that if I divide this guy by k,
and this guy is a chi squared
00:58:57.280 --> 00:59:00.080
divided by k, if I
divide this guy by k,
00:59:00.080 --> 00:59:03.900
then I get this
guy divided by k.
00:59:03.900 --> 00:59:05.442
And now it looks--
00:59:05.442 --> 00:59:06.900
I mean, it doesn't
change anything.
00:59:06.900 --> 00:59:09.020
I've just divided
by a fixed number.
00:59:09.020 --> 00:59:11.200
But it just looks more elegant--
00:59:11.200 --> 00:59:13.650
is the ratio of
two independent chi
00:59:13.650 --> 00:59:15.420
squared that are
individually divided
00:59:15.420 --> 00:59:16.920
by the number of
degrees of freedom.
00:59:20.840 --> 00:59:31.100
And this has a name,
and it's called a Fisher
00:59:31.100 --> 00:59:34.190
or F-distribution.
00:59:34.190 --> 00:59:40.740
So unlike William
Gosset, who was not
00:59:40.740 --> 00:59:43.200
allowed to use his own name
and used the name student,
00:59:43.200 --> 00:59:45.000
Fisher was allowed
to use his own name,
00:59:45.000 --> 00:59:47.220
and that's called the
Fisher distribution.
00:59:47.220 --> 00:59:52.470
And the Fisher distribution
has now 2 parameters,
00:59:52.470 --> 00:59:53.910
a set of 2 degrees of freedom--
00:59:53.910 --> 00:59:57.180
1 for the numerator and
1 for the denominator.
00:59:57.180 --> 01:00:01.217
So F- of Fisher distribution--
01:00:07.430 --> 01:00:13.450
so F is equal to the
ratio of a chi squared p/p
01:00:13.450 --> 01:00:16.960
and a chi squared q/q.
01:00:16.960 --> 01:00:27.320
So that's Fpq P-q where the 2
chi squareds are independent.
01:00:32.970 --> 01:00:35.160
Is that clear what
I'm defining here?
01:00:35.160 --> 01:00:41.460
So this is basically what plays
the role of t-distributions
01:00:41.460 --> 01:00:43.870
when you're testing more
than 1 parameter at a time.
01:00:43.870 --> 01:00:45.630
So you basically replace--
01:00:45.630 --> 01:00:47.190
the normal that was
in the numerator,
01:00:47.190 --> 01:00:49.023
you replace it by chi
squared because you're
01:00:49.023 --> 01:00:51.780
testing if 2 vectors are
simultaneously close.
01:00:51.780 --> 01:00:55.340
And the way you do it is by
looking at their squared norm.
01:00:55.340 --> 01:00:57.800
And that's how the
chi squared shows up.
01:01:00.632 --> 01:01:08.240
Quick remark-- are those
things really very different?
01:01:08.240 --> 01:01:12.090
How can I relate a chi
squared with a t-distribution?
01:01:12.090 --> 01:01:19.151
Well, if t follows, say, a t--
01:01:19.151 --> 01:01:20.400
I don't know, let's call it q.
01:01:24.080 --> 01:01:28.330
So that means that
t, let me look at--
01:01:28.330 --> 01:01:38.200
t is some n01 divided by
the square root of a chi
01:01:38.200 --> 01:01:40.650
squared q/q.
01:01:44.820 --> 01:01:48.926
That's the distribution of t.
01:01:48.926 --> 01:01:51.300
So if I look at the square of
the-- the distribution of t
01:01:51.300 --> 01:01:53.600
squared--
01:01:53.600 --> 01:01:55.010
let me put it here--
01:01:58.300 --> 01:02:06.280
well, that's the square of some
n01 divided by chi squared q/q.
01:02:09.690 --> 01:02:11.900
Agreed?
01:02:11.900 --> 01:02:13.470
I just removed the
square root here,
01:02:13.470 --> 01:02:15.810
and I took the square
of the Gaussian.
01:02:15.810 --> 01:02:20.030
But what is the distribution
of a square of a Gaussian?
01:02:20.030 --> 01:02:21.530
AUDIENCE: Chi squared
with 1 degree.
01:02:21.530 --> 01:02:25.140
PHILIPPE RIGOLLET: Chi squared
with 1 degree of freedom.
01:02:25.140 --> 01:02:27.284
So this is a chi squared
with 1 degree of freedom.
01:02:27.284 --> 01:02:28.700
And in particular,
it's also a chi
01:02:28.700 --> 01:02:31.836
squared with 1 degree
of freedom divided by 1.
01:02:31.836 --> 01:02:38.860
So t-squared, in the end,
has an F-distribution with 1
01:02:38.860 --> 01:02:41.300
and q degrees of freedom.
01:02:41.300 --> 01:02:43.589
So those two things are
actually very similar.
01:02:43.589 --> 01:02:45.130
The only thing that's
going to change
01:02:45.130 --> 01:02:48.280
is that, since we're actually
looking at, typically,
01:02:48.280 --> 01:02:51.164
absolute values of t
when we do our tests,
01:02:51.164 --> 01:02:52.830
it's going to be
exactly the same thing.
01:02:52.830 --> 01:02:54.330
These quintiles of
one guy are going
01:02:54.330 --> 01:02:56.496
to be, essentially, the
square root of the quintiles
01:02:56.496 --> 01:02:57.310
of the other guy.
01:02:57.310 --> 01:03:00.390
That's all it's going to be.
01:03:00.390 --> 01:03:07.360
So if my test is psi is
equal to the indicator
01:03:07.360 --> 01:03:16.010
that t exceeds q alpha
over 2 of tq, for example,
01:03:16.010 --> 01:03:19.990
then it's equal to the
indicator that t-squared
01:03:19.990 --> 01:03:26.030
exceeds q squared
alpha over 2 tq,
01:03:26.030 --> 01:03:28.770
because I had the
absolute value here,
01:03:28.770 --> 01:03:33.110
which is equal to the
indicator that t squared is
01:03:33.110 --> 01:03:35.580
greater than q alpha over 2.
01:03:35.580 --> 01:03:37.000
And now this time, it's an F1q.
01:03:39.880 --> 01:03:42.340
So in a way, those two things
belong to the same family.
01:03:42.340 --> 01:03:44.680
They really are a natural
generalization of each other.
01:03:44.680 --> 01:03:47.310
I mean, at least the F-test is
a generalization of the t-test.
01:03:51.230 --> 01:03:54.480
And so now I can perform my test
just like it's written here.
01:03:54.480 --> 01:03:56.250
I just formed this
guy, and then I
01:03:56.250 --> 01:03:58.860
perform against the
quintile of an F-test.
01:03:58.860 --> 01:04:01.440
Notice, there's no
absolute value--
01:04:01.440 --> 01:04:04.740
oh, yeah, I forgot,
this is actually
01:04:04.740 --> 01:04:09.761
q alpha because the F-statistic
is already positive.
01:04:09.761 --> 01:04:11.760
So I'm not going to look
between left and right,
01:04:11.760 --> 01:04:15.240
I'm just going to look
whether it's too large or not.
01:04:15.240 --> 01:04:18.030
So that's by definition.
01:04:18.030 --> 01:04:19.380
So you can check--
01:04:19.380 --> 01:04:21.120
if you look at a
table for student
01:04:21.120 --> 01:04:23.025
and you look at a
table for F1q, one
01:04:23.025 --> 01:04:25.650
it just going to-- you're going
to have to move from one column
01:04:25.650 --> 01:04:26.610
to the other
because you're going
01:04:26.610 --> 01:04:28.475
to have to move from
alpha over 2 to alpha,
01:04:28.475 --> 01:04:31.670
but one is going to be
squared root of the other one,
01:04:31.670 --> 01:04:34.370
just like the chi squared is
the square of the Gaussian.
01:04:34.370 --> 01:04:36.828
I mean, if you look at the chi
squared 1 degree of freedom,
01:04:36.828 --> 01:04:40.441
you will see the same
thing as the Gaussians.
01:04:47.035 --> 01:04:53.594
So I'm actually going to
start with the last one
01:04:53.594 --> 01:04:55.760
because you've been asking
a few questions about why
01:04:55.760 --> 01:04:58.450
is my design deterministic.
01:04:58.450 --> 01:04:59.660
So there's many answers.
01:04:59.660 --> 01:05:01.955
Some are philosophical.
01:05:01.955 --> 01:05:04.330
But one that's actually--
well, there's the one that says
01:05:04.330 --> 01:05:07.106
everything you cannot do if
you don't have a condition--
01:05:07.106 --> 01:05:09.730
if you don't have x, because all
of the statements that we made
01:05:09.730 --> 01:05:12.850
here, for example, just the
fact that this is chi squared,
01:05:12.850 --> 01:05:15.010
if those guys start to
be random variables,
01:05:15.010 --> 01:05:17.010
then it's clearly not
going to be a chi squared.
01:05:17.010 --> 01:05:19.000
I mean, it cannot be chi
squared when those guys are
01:05:19.000 --> 01:05:20.624
deterministic and
when they are random.
01:05:20.624 --> 01:05:22.100
I mean, things change.
01:05:22.100 --> 01:05:25.060
So that's just maybe
[INAUDIBLE] check statement.
01:05:25.060 --> 01:05:27.580
But I think the one that
really matters is that--
01:05:27.580 --> 01:05:30.450
remember when we
did the t-test, we
01:05:30.450 --> 01:05:32.195
had this gamma j that showed up.
01:05:32.195 --> 01:05:34.910
Gamma j was playing the
role of the variance.
01:05:34.910 --> 01:05:36.904
So here, the variance,
you never think of--
01:05:36.904 --> 01:05:39.070
I mean, we'll talk about
this in the Bayesian setup,
01:05:39.070 --> 01:05:41.320
but so far, we haven't
thought of the variance
01:05:41.320 --> 01:05:42.390
as a random variable.
01:05:42.390 --> 01:05:45.580
And so here, your x's really
are the parameters of your data.
01:05:45.580 --> 01:05:48.110
And the diagonal elements
of x transpose x inverse
01:05:48.110 --> 01:05:49.787
actually tell you
what the variance is.
01:05:49.787 --> 01:05:52.120
So that's also one reason why
you should think of your x
01:05:52.120 --> 01:05:53.530
as being a deterministic number.
01:05:53.530 --> 01:05:55.450
They are, in a way,
things that change
01:05:55.450 --> 01:05:56.740
the geometry of your problem.
01:05:56.740 --> 01:05:58.450
They just say, oh,
let me look at it
01:05:58.450 --> 01:06:01.180
from the perspective of x.
01:06:01.180 --> 01:06:03.070
Actually, for that
matter, we didn't really
01:06:03.070 --> 01:06:06.000
spend much time
commenting on what
01:06:06.000 --> 01:06:09.730
is the effect of x onto gamma.
01:06:09.730 --> 01:06:19.910
So remember, gamma j, so that
was the variance parameter.
01:06:19.910 --> 01:06:23.780
So we should try to understand
what x's lead to big variance
01:06:23.780 --> 01:06:26.030
and what x's lead
to small variance.
01:06:26.030 --> 01:06:28.610
That would be nice.
01:06:28.610 --> 01:06:31.550
Well, if this is the
identity matrix--
01:06:31.550 --> 01:06:35.140
let's say identity over n,
which is the natural thing
01:06:35.140 --> 01:06:38.620
to look at, because we want
this thing to scale like 1/n--
01:06:38.620 --> 01:06:39.820
then this is just 1/n.
01:06:39.820 --> 01:06:41.200
We're back to the original case.
01:06:41.200 --> 01:06:41.700
Yes?
01:06:41.700 --> 01:06:43.200
AUDIENCE: Shouldn't
that be inverse?
01:06:43.200 --> 01:06:45.500
PHILIPPE RIGOLLET: Yeah,
thank you. x inverse, yes.
01:06:45.500 --> 01:06:48.590
So if this is the identity,
then, well, the inverse
01:06:48.590 --> 01:06:53.180
is-- let's say just this guy
here is n times this guy.
01:06:53.180 --> 01:06:56.210
So then the inverse is 1/n.
01:06:56.210 --> 01:06:59.270
So in this case, that means
that gamma j is equal to 1/n
01:06:59.270 --> 01:07:02.240
and we're back to
the theta hat theta
01:07:02.240 --> 01:07:06.450
case, the basic
one-dimensional thing.
01:07:06.450 --> 01:07:11.390
What does it mean for a
matrix for when I take its--
01:07:11.390 --> 01:07:13.230
yeah, so that's of dimension p.
01:07:13.230 --> 01:07:15.420
But when I take its transpose--
01:07:15.420 --> 01:07:17.394
so forget about the
scaling by n right now.
01:07:17.394 --> 01:07:19.060
This is just a matter
of scaling things.
01:07:19.060 --> 01:07:20.840
I can always multiply
my x's so that I
01:07:20.840 --> 01:07:22.584
have this thing that shows up.
01:07:22.584 --> 01:07:24.750
But when I have a matrix,
if I look at x transpose x
01:07:24.750 --> 01:07:26.550
and I get something which
is the identity, how
01:07:26.550 --> 01:07:27.570
do I call this matrix?
01:07:31.980 --> 01:07:32.970
AUDIENCE: Orthonormal?
01:07:32.970 --> 01:07:34.470
PHILIPPE RIGOLLET:
Orthogonal, yeah.
01:07:34.470 --> 01:07:35.790
Orthonormal or orthogonal.
01:07:35.790 --> 01:07:37.919
So you call this thing
an orthogonal matrix.
01:07:37.919 --> 01:07:39.960
And when it's an orthogonal
matrix, what it means
01:07:39.960 --> 01:07:42.540
is that the--
01:07:42.540 --> 01:07:46.230
so this matrix here, if you
look at the matrix xx transpose,
01:07:46.230 --> 01:07:48.390
the entries of this matrix
are the inner products
01:07:48.390 --> 01:07:49.890
between the columns of x.
01:07:49.890 --> 01:07:51.240
That's what's happening.
01:07:51.240 --> 01:07:52.800
You can write it,
and you will see
01:07:52.800 --> 01:07:55.890
that the entries of this
matrix are linear products.
01:07:55.890 --> 01:07:59.860
If it's the identity, that
means that you get some 1's
01:07:59.860 --> 01:08:03.170
and a bunch of 0's, it means
that all the inner products
01:08:03.170 --> 01:08:05.910
between 2 different
columns is actually 0.
01:08:05.910 --> 01:08:07.980
What it means is
that this matrix x
01:08:07.980 --> 01:08:09.990
is an orthonormal
basis for your space.
01:08:09.990 --> 01:08:12.100
The columns form an
orthonormal basis.
01:08:12.100 --> 01:08:15.680
So they're basically as far
from each other as they can.
01:08:15.680 --> 01:08:20.260
Now, if I start making those
guys closer and closer,
01:08:20.260 --> 01:08:21.939
then I'm starting
to have some issues.
01:08:21.939 --> 01:08:24.490
x transpose x is not
going to be the identity.
01:08:24.490 --> 01:08:27.330
I'm going to start to
have some non-0 entries.
01:08:27.330 --> 01:08:32.551
But if they all remain
of norm 1, then--
01:08:32.551 --> 01:08:34.880
oh, sorry, so that's
for the inverse.
01:08:34.880 --> 01:08:37.899
So I first start putting some
stuff here, which is non-0,
01:08:37.899 --> 01:08:39.550
by taking my x's.
01:08:39.550 --> 01:08:44.269
Rather than having
this, I move to this.
01:08:44.269 --> 01:08:46.310
Now I'm going to start
seeing some non-0 entries.
01:08:46.310 --> 01:08:49.410
And when I'm going to take
the inverse of this matrix,
01:08:49.410 --> 01:08:52.781
the diagonal elements are
going to start to blow up.
01:08:52.781 --> 01:08:56.010
Oh, sorry, the diagonals start
to become smaller and smaller.
01:08:56.010 --> 01:08:57.399
So when I take the inverse--
01:08:57.399 --> 01:09:01.399
no, sorry, the diagonal
limits are going to blow up.
01:09:01.399 --> 01:09:05.430
And so what it means is that the
variance is going to blow up.
01:09:05.430 --> 01:09:06.899
And that's essentially
telling you
01:09:06.899 --> 01:09:09.090
that if you get to
choose your x's, you
01:09:09.090 --> 01:09:12.582
want to take them as
orthogonal as you can.
01:09:12.582 --> 01:09:14.790
But if you don't, then you
just have to deal with it,
01:09:14.790 --> 01:09:18.950
and it will have a significant
impact on your estimation
01:09:18.950 --> 01:09:19.620
performance.
01:09:19.620 --> 01:09:25.010
And that's what, also,
routinely, statistical software
01:09:25.010 --> 01:09:26.885
is going to spit out
this value here for you.
01:09:26.885 --> 01:09:28.884
And you're going to have--
well, actually square
01:09:28.884 --> 01:09:30.410
root of this value.
01:09:30.410 --> 01:09:32.440
And it's going to tell
you, essentially--
01:09:32.440 --> 01:09:34.939
you're going to know how much
randomness, how much variation
01:09:34.939 --> 01:09:36.952
you have in this
particular parameter
01:09:36.952 --> 01:09:37.910
that you're estimating.
01:09:37.910 --> 01:09:41.564
So if gamma j is
large, then you're
01:09:41.564 --> 01:09:43.189
going to have wide
confidence intervals
01:09:43.189 --> 01:09:45.740
and your tests are not
going to reject very much.
01:09:45.740 --> 01:09:47.110
And that's all captured by x.
01:09:47.110 --> 01:09:48.109
That's what's important.
01:09:48.109 --> 01:09:50.927
Everything, all of this, is
completely captured by x.
01:09:50.927 --> 01:09:52.760
Then, of course, there
was the sigma squared
01:09:52.760 --> 01:09:54.570
that showed up here.
01:09:54.570 --> 01:09:57.155
Actually, it was here, even
in the definition of gamma j.
01:09:57.155 --> 01:09:58.430
I forgot it.
01:09:58.430 --> 01:10:00.440
What is the sigma
squared police doing?
01:10:00.440 --> 01:10:02.950
And so this thing
was here as well,
01:10:02.950 --> 01:10:04.850
and that's just exogenous.
01:10:04.850 --> 01:10:06.269
It comes from the noise itself.
01:10:06.269 --> 01:10:08.810
But there was this huge factor
that came from the x's itself.
01:10:11.680 --> 01:10:13.960
So let's go back,
now, to reading
01:10:13.960 --> 01:10:15.320
this list in a linear fashion.
01:10:15.320 --> 01:10:20.680
So I mean, you're MIT
students, you've probably
01:10:20.680 --> 01:10:25.480
heard that correlation does
not imply causation many times.
01:10:25.480 --> 01:10:27.145
Maybe you don't
know what it means.
01:10:27.145 --> 01:10:30.900
If you don't, that's OK, you
just have to know the sentence.
01:10:30.900 --> 01:10:32.420
No, what it means
is that it's done
01:10:32.420 --> 01:10:35.255
because I decided that
something was going to be the x
01:10:35.255 --> 01:10:36.630
and that something
else was going
01:10:36.630 --> 01:10:39.640
to be the y, that whatever
thing I'm getting,
01:10:39.640 --> 01:10:42.010
it means that x implies y.
01:10:42.010 --> 01:10:44.530
For example, even if I
do genetics, genomics,
01:10:44.530 --> 01:10:47.230
or whatever, I
mean, I implicitly
01:10:47.230 --> 01:10:49.630
assume that my genes
are going to have
01:10:49.630 --> 01:10:52.780
an effect on my outside look.
01:10:52.780 --> 01:10:54.310
I could be the opposite.
01:10:54.310 --> 01:10:55.720
I mean, who am I to say?
01:10:55.720 --> 01:10:56.570
I'm not a biologist.
01:10:56.570 --> 01:10:57.111
I don't know.
01:10:57.111 --> 01:10:59.590
I didn't open a biology
book in 20 years.
01:10:59.590 --> 01:11:02.140
So maybe, if I start hitting
my head with a hammer,
01:11:02.140 --> 01:11:04.720
I'm going to have changing
my genetic material.
01:11:04.720 --> 01:11:07.140
Probably not, but that's why--
01:11:07.140 --> 01:11:09.450
but causation definitely does
not come from statistics.
01:11:09.450 --> 01:11:11.690
So if you know that that's
the different thing,
01:11:11.690 --> 01:11:13.180
it's actually going to--
01:11:13.180 --> 01:11:14.690
it's not coming from there.
01:11:14.690 --> 01:11:18.410
So actually, I remember, once,
I put an exam to students,
01:11:18.410 --> 01:11:21.685
and there was an old data
set from police expenditures,
01:11:21.685 --> 01:11:23.920
I think, in Chicago in the '60s.
01:11:23.920 --> 01:11:27.437
And they were trying
to understand--
01:11:27.437 --> 01:11:28.270
no, it was on crime.
01:11:28.270 --> 01:11:29.650
It was the crime data set.
01:11:29.650 --> 01:11:31.700
And they were trying-- so
the y variable was just
01:11:31.700 --> 01:11:34.530
the rate of crime, and the
x's were a bunch of things,
01:11:34.530 --> 01:11:36.670
and one of them was
police expenditures.
01:11:36.670 --> 01:11:38.200
And if you rend
the regression, you
01:11:38.200 --> 01:11:41.050
would find that the coefficient
in front of police expenditure
01:11:41.050 --> 01:11:42.700
was a positive
number, which means
01:11:42.700 --> 01:11:45.690
that if you increase
police expenditures,
01:11:45.690 --> 01:11:48.400
that increases the crime.
01:11:48.400 --> 01:11:52.800
I mean, that's what it means
to have a positive coefficient.
01:11:52.800 --> 01:11:55.410
Everybody agrees with this fact?
01:11:55.410 --> 01:11:57.830
If beta j is 10, then it
means that if I increase by $1
01:11:57.830 --> 01:12:01.860
my police expenditure, I
[INAUDIBLE] by 10 my crime,
01:12:01.860 --> 01:12:04.160
everything else
being kept equal.
01:12:04.160 --> 01:12:06.140
Well, there were,
I think, about 80%
01:12:06.140 --> 01:12:09.230
of the students that were able
to explain to me that if you
01:12:09.230 --> 01:12:11.844
give more money to
the police, then
01:12:11.844 --> 01:12:13.010
the crime is going to raise.
01:12:13.010 --> 01:12:14.780
Some people were
like, well, police
01:12:14.780 --> 01:12:16.730
is making too much
money, and they
01:12:16.730 --> 01:12:19.264
don't think about their
work, and they become lazy.
01:12:19.264 --> 01:12:20.930
And I mean, people
were really coming up
01:12:20.930 --> 01:12:22.340
with some crazy things.
01:12:22.340 --> 01:12:26.090
And what it just meant is
that, no, it's not causation.
01:12:26.090 --> 01:12:28.030
It's just, if you
have more crime,
01:12:28.030 --> 01:12:29.810
you give more money
to your police.
01:12:29.810 --> 01:12:31.370
That's what's happening.
01:12:31.370 --> 01:12:33.800
And that's all there is.
01:12:33.800 --> 01:12:35.750
So just be careful
when you actually
01:12:35.750 --> 01:12:38.360
draw some conclusions that
causation is a very important
01:12:38.360 --> 01:12:39.410
thing to keep in mind.
01:12:39.410 --> 01:12:43.280
And in practice, unless you
have external sources of reason
01:12:43.280 --> 01:12:45.680
for causality-- for
example, genetic material
01:12:45.680 --> 01:12:52.040
and physical traits,
we agree upon what
01:12:52.040 --> 01:12:54.690
the direction of the arrow
of causality is here.
01:12:54.690 --> 01:12:57.845
There's places
where you might not.
01:12:57.845 --> 01:12:59.930
Now, finally, the
normality on the noise--
01:12:59.930 --> 01:13:04.340
everything we did today required
normal Gaussian distribution
01:13:04.340 --> 01:13:05.750
on the noise.
01:13:05.750 --> 01:13:07.541
I mean, it's everywhere.
01:13:07.541 --> 01:13:09.540
There's some Gaussian,
there's some chi squared.
01:13:09.540 --> 01:13:11.330
Everything came out of Gaussian.
01:13:11.330 --> 01:13:13.836
And for that, we needed
this basic formula
01:13:13.836 --> 01:13:15.710
for inference, which we
derived from the fact
01:13:15.710 --> 01:13:18.610
that the noise was
Gaussian itself.
01:13:18.610 --> 01:13:20.860
If we did not have that, the
only thing we could write
01:13:20.860 --> 01:13:24.370
is, beta hat is this
number, or this vector.
01:13:24.370 --> 01:13:27.980
We would not be able to say,
the fluctuations of beta hat
01:13:27.980 --> 01:13:28.615
are this guy.
01:13:28.615 --> 01:13:30.472
We would not be
able to do tests.
01:13:30.472 --> 01:13:31.930
We would not be
able to build, say,
01:13:31.930 --> 01:13:34.160
confidence regions or anything.
01:13:34.160 --> 01:13:38.150
And so this is an important
condition that we need,
01:13:38.150 --> 01:13:40.670
and that's what statistical
software assumes by default.
01:13:40.670 --> 01:13:44.870
But we now have a recipe
on how to do those tests.
01:13:44.870 --> 01:13:47.060
We can do it either
visually, if we really
01:13:47.060 --> 01:13:49.430
want to conclude that,
yes, this is Gaussian,
01:13:49.430 --> 01:13:51.350
using our normal Q-Q plots.
01:13:51.350 --> 01:13:54.860
And we can also do it
using our favorite tests.
01:13:54.860 --> 01:13:56.750
What test should I be
using to test that?
01:14:01.540 --> 01:14:03.771
With two names?
01:14:03.771 --> 01:14:04.270
Yeah?
01:14:04.270 --> 01:14:06.957
AUDIENCE: Normal [INAUDIBLE].
01:14:06.957 --> 01:14:08.540
PHILIPPE RIGOLLET:
Not the 2 Russians.
01:14:08.540 --> 01:14:10.820
So I want a Russian and
a Scandinavian person
01:14:10.820 --> 01:14:12.722
for this one.
01:14:12.722 --> 01:14:13.416
What's that?
01:14:13.416 --> 01:14:14.540
AUDIENCE: Lillie-something?
01:14:14.540 --> 01:14:16.290
PHILIPPE RIGOLLET:
Yeah, Lillie-something.
01:14:16.290 --> 01:14:18.660
So Kolmogorov
Lillie-something test.
01:14:18.660 --> 01:14:23.370
And [LAUGHS] so it's the
Kolmogorov Lilliefors test.
01:14:23.370 --> 01:14:26.670
And because I'm testing if
there Gaussian, and I'm actually
01:14:26.670 --> 01:14:28.140
not really making any--
01:14:28.140 --> 01:14:30.510
I don't need to know
what the variance is.
01:14:30.510 --> 01:14:31.350
The mean is 0.
01:14:31.350 --> 01:14:32.558
We saw that at the beginning.
01:14:32.558 --> 01:14:34.680
It's 0 by construction,
so we actually
01:14:34.680 --> 01:14:37.590
don't need to think about
the mean being 0 itself.
01:14:37.590 --> 01:14:38.850
This just happens to be 0.
01:14:38.850 --> 01:14:41.340
So we know that it's 0, but
the variance, we don't know.
01:14:41.340 --> 01:14:42.900
So we just want to
know if it belongs
01:14:42.900 --> 01:14:45.233
to the family of Gaussians,
and so we need to Kolmogorov
01:14:45.233 --> 01:14:46.660
Lilliefors for that.
01:14:46.660 --> 01:14:49.650
And that's also one of the thing
that's spit out by statistical
01:14:49.650 --> 01:14:52.680
software by default. When
you run a linear regression,
01:14:52.680 --> 01:14:54.670
actually, it spits out
both Kolmogorov-Smirnov
01:14:54.670 --> 01:14:59.118
and Kolmogorov Lilliefors,
probably contributing
01:14:59.118 --> 01:15:01.860
to the widespread use of
Kolmogorov-Smirnov when you
01:15:01.860 --> 01:15:03.550
really shouldn't.
01:15:03.550 --> 01:15:08.920
So next time, we will talk
about more advanced topics
01:15:08.920 --> 01:15:09.670
on regression.
01:15:09.670 --> 01:15:11.780
But I think I'm going
to stop here for today.
01:15:11.780 --> 01:15:14.740
So again, tomorrow,
sometime during the day,
01:15:14.740 --> 01:15:16.780
at least before
the recitation, you
01:15:16.780 --> 01:15:20.740
will have a list of practice
exercises that will be posted.
01:15:20.740 --> 01:15:23.600
And if you go to the
optional recitation,
01:15:23.600 --> 01:15:26.190
you will have
someone solving them