WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.850
Commons license.
00:00:03.850 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or give
you additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.880
at ocw.mit.edu.
00:00:21.030 --> 00:00:23.880
PHILIPPE RIGOLLET: So
again, before we start,
00:00:23.880 --> 00:00:27.720
there is a survey online
if you haven't done so.
00:00:27.720 --> 00:00:30.600
I would guess at least
one of you has not.
00:00:30.600 --> 00:00:33.750
Some of you have entered their
answers and their thoughts,
00:00:33.750 --> 00:00:35.055
and I really appreciate this.
00:00:35.055 --> 00:00:36.180
It's actually very helpful.
00:00:36.180 --> 00:00:40.230
So it seems that the
course is going fairly well
00:00:40.230 --> 00:00:42.100
from what I've read so far.
00:00:42.100 --> 00:00:43.900
So if you don't think
this is the case,
00:00:43.900 --> 00:00:45.720
please enter your
opinion and tell us
00:00:45.720 --> 00:00:47.440
how we can make it better.
00:00:47.440 --> 00:00:48.900
One of the things
that was said is
00:00:48.900 --> 00:00:53.370
that I speak too fast,
which is absolutely true.
00:00:53.370 --> 00:00:54.520
I just can't help it.
00:00:54.520 --> 00:00:59.370
I get so excited, but I
will really do my best.
00:00:59.370 --> 00:01:02.860
I will try to.
00:01:02.860 --> 00:01:04.800
I think I always start OK.
00:01:04.800 --> 00:01:07.170
I just end not so well.
00:01:07.170 --> 00:01:10.920
So last time we talked about
this chi squared distribution,
00:01:10.920 --> 00:01:13.170
which is just another
distribution that's
00:01:13.170 --> 00:01:16.140
so common that it
deserves its own name.
00:01:16.140 --> 00:01:17.790
And this is
something that arises
00:01:17.790 --> 00:01:22.200
when we sum the squares of
independent standard Gaussian
00:01:22.200 --> 00:01:23.330
random variables.
00:01:23.330 --> 00:01:25.785
And in particular,
why is that relevant?
00:01:25.785 --> 00:01:27.960
It's because if I look
at the sample variance,
00:01:27.960 --> 00:01:29.610
then it is a chi
square distribution,
00:01:29.610 --> 00:01:32.370
and the parameter
that shows up is also
00:01:32.370 --> 00:01:35.430
known as the degrees of
freedom, is the number
00:01:35.430 --> 00:01:37.943
of observations of minus one.
00:01:37.943 --> 00:01:39.360
And so as I said,
this chi squared
00:01:39.360 --> 00:01:43.110
distribution has an explicit
probability density function,
00:01:43.110 --> 00:01:44.980
and I tried to draw it.
00:01:44.980 --> 00:01:47.850
And one of the comments was
also about my handwriting,
00:01:47.850 --> 00:01:52.080
so I will actually not rely
on it for detailed things.
00:01:52.080 --> 00:01:54.570
So this is what the chi squared
with one degree of freedom
00:01:54.570 --> 00:01:55.237
would look like.
00:01:55.237 --> 00:01:57.862
And really, what this is is just
the distribution of the square
00:01:57.862 --> 00:01:58.890
of a standard Gaussian.
00:01:58.890 --> 00:02:01.650
I'm summing only one,
so that's what it is.
00:02:01.650 --> 00:02:03.810
Then when I go to 2,
this is what it is--
00:02:03.810 --> 00:02:07.038
3, 4, 5, 6, and 10.
00:02:07.038 --> 00:02:08.580
And as I move, you
can see this thing
00:02:08.580 --> 00:02:11.840
is becoming flatter and flatter,
and it's pushing to the right.
00:02:11.840 --> 00:02:14.910
And that's because I'm
summing more and more squares,
00:02:14.910 --> 00:02:18.600
and in expectation we
just get one every time.
00:02:18.600 --> 00:02:23.520
So it really means that the
mass is moving to infinity.
00:02:23.520 --> 00:02:26.140
In particular, a chi
squared distribution
00:02:26.140 --> 00:02:29.640
with n degrees of freedom
is going to infinity
00:02:29.640 --> 00:02:32.790
as n goes to infinity.
00:02:32.790 --> 00:02:35.130
Another distribution
that I asked
00:02:35.130 --> 00:02:38.010
you to think about--
anybody looked around
00:02:38.010 --> 00:02:39.750
about the student
t-distribution, what
00:02:39.750 --> 00:02:42.040
the history of this thing was?
00:02:42.040 --> 00:02:44.545
So I'll tell you a little bit.
00:02:44.545 --> 00:02:46.300
I understand if you
didn't have time.
00:02:46.300 --> 00:02:50.470
So the t-distribution is
another common distribution
00:02:50.470 --> 00:02:53.050
that is so common
that it will be used
00:02:53.050 --> 00:02:56.980
and will have its table
of quintiles that are
00:02:56.980 --> 00:02:59.320
drawn at the back of the book.
00:02:59.320 --> 00:03:02.110
Now, remember, when I
mentioned the Gaussian, I said,
00:03:02.110 --> 00:03:04.470
well, there are several
values for alpha
00:03:04.470 --> 00:03:06.250
that we're interested in.
00:03:06.250 --> 00:03:11.110
And so I wanted to draw
a table for the Gaussian.
00:03:11.110 --> 00:03:13.760
We had something that
looked like this,
00:03:13.760 --> 00:03:21.030
and I said, well, q alpha
over 2 to get alpha over 2
00:03:21.030 --> 00:03:22.690
to the right of this number.
00:03:22.690 --> 00:03:25.930
And we said that there is
a table for this things,
00:03:25.930 --> 00:03:28.330
for common values of theta.
00:03:28.330 --> 00:03:31.180
Well, if you try to envision
what this table will look like,
00:03:31.180 --> 00:03:34.000
it's actually a
pretty sad table,
00:03:34.000 --> 00:03:35.800
because it's basically
one list of numbers.
00:03:35.800 --> 00:03:37.390
Why would I call it a table?
00:03:37.390 --> 00:03:38.828
Because all I need
to tell you is
00:03:38.828 --> 00:03:40.120
something that looks like this.
00:03:40.120 --> 00:03:43.960
If I tell you this is alpha
and this is q alpha over 2
00:03:43.960 --> 00:03:47.170
and then I say, OK,
basically the three alphas
00:03:47.170 --> 00:03:54.870
that I told you I care about are
something like 1%, 5%, and 10%,
00:03:54.870 --> 00:03:57.400
then my table will just
give me q alpha over 2.
00:03:57.400 --> 00:03:59.710
So that's alpha, and
that's q alpha over 2.
00:03:59.710 --> 00:04:01.330
And that's going
to tell me that--
00:04:01.330 --> 00:04:04.910
I don't remember this
one, but this guy is 1.96.
00:04:04.910 --> 00:04:08.920
This guy is something like 2.45.
00:04:08.920 --> 00:04:11.860
I think this one
is like 1.65 maybe.
00:04:11.860 --> 00:04:15.274
And maybe you can
be a little finer,
00:04:15.274 --> 00:04:16.899
but it's not going
to be an entire page
00:04:16.899 --> 00:04:18.148
at the back of the book.
00:04:18.148 --> 00:04:19.690
And the reason is
because I only need
00:04:19.690 --> 00:04:22.840
to draw these things
for d1 standard Gaussian
00:04:22.840 --> 00:04:24.730
when the parameters
are 0 for the mean
00:04:24.730 --> 00:04:26.250
and 1 for the variance.
00:04:26.250 --> 00:04:30.740
Now, if I'm actually doing
this for the the chi squared,
00:04:30.740 --> 00:04:34.610
I basically have to give
you one table per values
00:04:34.610 --> 00:04:37.727
of the degrees of freedom,
because those things
00:04:37.727 --> 00:04:38.310
are different.
00:04:38.310 --> 00:04:41.070
There is no way I can take--
00:04:41.070 --> 00:04:43.070
for Gaussian's, if you
give me a different mean,
00:04:43.070 --> 00:04:46.820
I can substract it and make it
back to be a standard Gaussian.
00:04:46.820 --> 00:04:49.345
For the chi squared,
there is no such thing.
00:04:49.345 --> 00:04:50.720
There is no thing
that just takes
00:04:50.720 --> 00:04:53.345
the chi squared with d
degrees of freedom, nd,
00:04:53.345 --> 00:04:54.935
and turns it into,
say, a chi square
00:04:54.935 --> 00:04:56.060
with one degree of freedom.
00:04:56.060 --> 00:04:58.070
This just does not happen.
00:04:58.070 --> 00:05:01.010
So the word is standardized.
00:05:01.010 --> 00:05:02.412
Make it a standard chi squared.
00:05:02.412 --> 00:05:04.370
There is no such thing
as standard chi squared.
00:05:04.370 --> 00:05:05.787
So what it means
is that I'm going
00:05:05.787 --> 00:05:09.860
to need one row like that
for each value of the number
00:05:09.860 --> 00:05:11.130
of degrees of freedom.
00:05:11.130 --> 00:05:14.420
So that will certainly fill a
page at the back of a book--
00:05:14.420 --> 00:05:16.400
maybe even more.
00:05:16.400 --> 00:05:18.390
I need one per sample size.
00:05:18.390 --> 00:05:21.000
So if I want to go from
simple size 1 to 1,000,
00:05:21.000 --> 00:05:24.470
I need 1,000 rows.
00:05:24.470 --> 00:05:26.990
So now the student
distribution is
00:05:26.990 --> 00:05:30.740
one that arises where it looks
very much like the Gaussian
00:05:30.740 --> 00:05:33.920
distribution, and there's a
very simple reason for that, is
00:05:33.920 --> 00:05:37.820
that I take a standard Gaussian
and I divide it by something.
00:05:37.820 --> 00:05:39.410
That's how I get the student.
00:05:39.410 --> 00:05:40.520
What do I divide it with?
00:05:40.520 --> 00:05:42.900
Well, I take an
independent chi square--
00:05:42.900 --> 00:05:44.410
I'm going to call it v--
00:05:44.410 --> 00:05:47.030
and I want it to be
independent from z.
00:05:47.030 --> 00:05:52.040
And I'm going to divide
z by root v over d.
00:05:52.040 --> 00:05:55.910
So I start with a chi squared v.
00:05:55.910 --> 00:05:58.330
So this guy is chi squared d.
00:05:58.330 --> 00:06:02.840
I start with z, which is n 0, 1.
00:06:02.840 --> 00:06:06.640
I'm going to assume that
those guys are independent.
00:06:06.640 --> 00:06:08.640
In my t-distribution,
I'm going to write
00:06:08.640 --> 00:06:17.150
a T. Capital T is z divided by
the square root of v over d.
00:06:17.150 --> 00:06:18.660
Why would I want to do this?
00:06:18.660 --> 00:06:20.630
Well, because this
is exactly what
00:06:20.630 --> 00:06:25.800
happens when a divide not by
the true variance, a Gaussian,
00:06:25.800 --> 00:06:28.750
but by its empirical variance.
00:06:28.750 --> 00:06:30.290
So let's see why in a second.
00:06:30.290 --> 00:06:34.480
So I know that if you give
me some random variable--
00:06:34.480 --> 00:06:38.900
let's call it x, which
is N mu sigma squared--
00:06:38.900 --> 00:06:40.000
then I can do this.
00:06:40.000 --> 00:06:45.012
x minus mu divided by sigma.
00:06:45.012 --> 00:06:47.470
I'm going to call this thing
z, because this thing actually
00:06:47.470 --> 00:06:51.220
has some standard
Gaussian distribution.
00:06:51.220 --> 00:06:54.430
I have standardized
x into something
00:06:54.430 --> 00:06:58.090
that I can read the quintiles
at the back of the book.
00:06:58.090 --> 00:07:00.430
So that's this process
that I want to do.
00:07:00.430 --> 00:07:03.430
Now, to be able to do this,
I need to know what mu is,
00:07:03.430 --> 00:07:05.270
and I need to know
what sigma is.
00:07:05.270 --> 00:07:09.680
Otherwise I'm not going to be
able to make this operation.
00:07:09.680 --> 00:07:13.730
mu I can sort of get away
with, because remember,
00:07:13.730 --> 00:07:15.760
when we're doing
confidence intervals
00:07:15.760 --> 00:07:17.840
we're actually solving for mu.
00:07:17.840 --> 00:07:20.260
So it was good
that mu was there.
00:07:20.260 --> 00:07:22.070
When we're doing
hypothesis testing,
00:07:22.070 --> 00:07:26.690
we're actually plugging in here
the mu that shows up in h0.
00:07:26.690 --> 00:07:27.720
So that was good.
00:07:27.720 --> 00:07:28.490
We had this thing.
00:07:28.490 --> 00:07:31.250
Think of mu as being
p, for example.
00:07:31.250 --> 00:07:36.730
But this guy here, we don't
necessarily know what it is.
00:07:36.730 --> 00:07:40.090
I just had to tell you for
the entire first chapter,
00:07:40.090 --> 00:07:41.860
assume you have Gaussian
random variables
00:07:41.860 --> 00:07:44.050
and that you know
what the variance is.
00:07:44.050 --> 00:07:45.650
And the reason why
I said assume you
00:07:45.650 --> 00:07:47.567
know it-- and I said
sometimes you can read it
00:07:47.567 --> 00:07:52.390
on the side of the box of
measuring equipment in the lab.
00:07:52.390 --> 00:07:54.910
That was just the
way I justified it,
00:07:54.910 --> 00:07:57.490
but the real reason why I did
this is because I would not
00:07:57.490 --> 00:08:00.580
be able to perform this
operation if I actually did not
00:08:00.580 --> 00:08:02.380
know what sigma was.
00:08:02.380 --> 00:08:07.340
But from data, we know that
we can form this estimator
00:08:07.340 --> 00:08:11.530
Sn, which is 1 over n,
sum from i equals 1 to n
00:08:11.530 --> 00:08:15.430
of Xi, minus X bar squared.
00:08:15.430 --> 00:08:18.790
And this thing is approximately
equal to sigma squared.
00:08:18.790 --> 00:08:21.100
That's the sample
variance, and it's actually
00:08:21.100 --> 00:08:25.640
a good estimator just by the
law of large number, actually.
00:08:25.640 --> 00:08:29.472
This thing, by the law of large
number, as n goes to infinity--
00:08:32.370 --> 00:08:34.440
well, let's say
it in probability
00:08:34.440 --> 00:08:36.570
goes to sigma squared by
the law of large number.
00:08:36.570 --> 00:08:40.200
So it's a consistent
estimator of sigma squared.
00:08:40.200 --> 00:08:43.080
So now, what I
want to do is to be
00:08:43.080 --> 00:08:46.425
able to use this estimator
rather than using sigma.
00:08:46.425 --> 00:08:47.800
And the way I'm
going to do it is
00:08:47.800 --> 00:08:50.220
I'm going to say, OK,
what I want to form
00:08:50.220 --> 00:08:58.400
is x minus mu divided
by Sn this time.
00:08:58.400 --> 00:09:01.540
I don't know what the
distribution of this guy is.
00:09:01.540 --> 00:09:02.790
Sorry, it's square root of Sn.
00:09:02.790 --> 00:09:05.430
This is sigma squared.
00:09:05.430 --> 00:09:07.860
So this is what I would take.
00:09:07.860 --> 00:09:10.080
And I could think
of Slutsky, maybe,
00:09:10.080 --> 00:09:14.150
something like this that would
tell me, well, just use that
00:09:14.150 --> 00:09:15.980
and pretend it's a Gaussian.
00:09:15.980 --> 00:09:18.170
And we'll see how
actually it's sort
00:09:18.170 --> 00:09:20.900
of valid to do that,
because Slutsky tells us
00:09:20.900 --> 00:09:22.630
it is valid to do that.
00:09:22.630 --> 00:09:24.590
But what we can
also do is to say,
00:09:24.590 --> 00:09:28.940
well, this is actually equal to
x minus mu, divided by sigma,
00:09:28.940 --> 00:09:31.550
which I knew what the
distribution of this guy is.
00:09:31.550 --> 00:09:33.770
And then what I'm going to
do is I'm going to just--
00:09:33.770 --> 00:09:38.670
well, I'm going to cancel this
effect, sigma over square root
00:09:38.670 --> 00:09:39.470
Sn.
00:09:39.470 --> 00:09:41.480
So I didn't change anything.
00:09:41.480 --> 00:09:43.540
I just put the sigma here.
00:09:43.540 --> 00:09:47.010
So now what I know what I
know is that this is some z,
00:09:47.010 --> 00:09:51.340
and it has some standard
Gaussian distribution.
00:09:51.340 --> 00:09:54.480
What is this guy?
00:09:54.480 --> 00:09:57.340
Well, I know that Sn--
00:09:57.340 --> 00:09:59.870
we wrote this here.
00:09:59.870 --> 00:10:01.620
Maybe I shouldn't have
put those pictures,
00:10:01.620 --> 00:10:04.230
because now I keep on
skipping before and after.
00:10:04.230 --> 00:10:14.180
We know that Sn times n
divided by sigma squared
00:10:14.180 --> 00:10:18.636
is actually chi
squared n minus 1.
00:10:22.580 --> 00:10:23.900
So what do I have here?
00:10:23.900 --> 00:10:25.860
I have that chi squared--
00:10:25.860 --> 00:10:29.720
so here I have something that
looks like 1 over square root
00:10:29.720 --> 00:10:32.270
of Sn divided by sigma squared.
00:10:35.590 --> 00:10:38.680
This is what this guy is if
I just do some more writing.
00:10:38.680 --> 00:10:41.710
And maybe I actually want to
make my life a little easier.
00:10:41.710 --> 00:10:45.630
I'm actually going
to plug in my n here,
00:10:45.630 --> 00:10:48.693
and so I'm going to have to
multiply by square root of n
00:10:48.693 --> 00:10:49.193
here.
00:10:56.438 --> 00:10:59.718
Everybody's with me?
00:10:59.718 --> 00:11:01.510
So now what I end up
with is something that
00:11:01.510 --> 00:11:06.310
looks like this, where I have--
00:11:06.310 --> 00:11:07.775
here I started with x.
00:11:15.630 --> 00:11:19.890
I should really start
with Xn bar minus mu times
00:11:19.890 --> 00:11:21.650
square root of n.
00:11:21.650 --> 00:11:24.413
That's what the central
limit theorem would tell me.
00:11:24.413 --> 00:11:26.580
I need to work with the
average rather than just one
00:11:26.580 --> 00:11:27.953
observation.
00:11:27.953 --> 00:11:30.370
So if I start with this, then
I pick up a square root of n
00:11:30.370 --> 00:11:30.870
here.
00:11:43.180 --> 00:11:45.700
So if I had the sigma
here, I would know
00:11:45.700 --> 00:11:47.690
that this thing is actually--
00:11:47.690 --> 00:11:54.110
Xn bar minus mu divided by
sigma times the square root of n
00:11:54.110 --> 00:11:56.300
would be a standard Gaussian.
00:11:56.300 --> 00:11:58.460
So if I put Xn
bar here, I really
00:11:58.460 --> 00:12:00.620
need to put this thing that
goes around the Xn bar.
00:12:04.620 --> 00:12:06.120
That's just my
central limit theorem
00:12:06.120 --> 00:12:10.720
that says if I average, then my
variance has shrunk by a factor
00:12:10.720 --> 00:12:12.780
1 over n.
00:12:12.780 --> 00:12:15.400
Now, I can still do this.
00:12:15.400 --> 00:12:16.740
That was still fine.
00:12:16.740 --> 00:12:26.620
And now I said that this
thing is basically this guy.
00:12:26.620 --> 00:12:28.240
So what I know is
that this thing
00:12:28.240 --> 00:12:32.810
is a chi squared with n
minus 1 degrees of freedom,
00:12:32.810 --> 00:12:37.340
so this guy here is
chi squared with n
00:12:37.340 --> 00:12:40.710
minus 1 degrees of freedom.
00:12:40.710 --> 00:12:44.650
Let me call this thing v in the
spirit of what was used there
00:12:44.650 --> 00:12:49.690
and in the spirit of
what is written here.
00:12:49.690 --> 00:12:53.840
So this guy was called v,
so I'm going to call this v.
00:12:53.840 --> 00:12:57.000
So what I can write is
that square root of n Xn
00:12:57.000 --> 00:13:02.570
bar minus mu divided
by square root of Sn
00:13:02.570 --> 00:13:10.870
is equal to z times
square root of n
00:13:10.870 --> 00:13:20.460
divided by square root of
v. Everybody's with me here?
00:13:23.610 --> 00:13:37.070
Which I can rewrite as z times
square root of v divided by n
00:13:37.070 --> 00:13:40.150
And if you look at what the
definition of this thing is,
00:13:40.150 --> 00:13:41.920
I'm almost there.
00:13:41.920 --> 00:13:45.480
What is the only thing
that's wrong here?
00:13:45.480 --> 00:13:48.293
This is a student
distribution, right?
00:13:48.293 --> 00:13:49.210
So there's two things.
00:13:49.210 --> 00:13:51.840
The first one was that
they should be independent,
00:13:51.840 --> 00:13:53.360
and they actually
are independent.
00:13:53.360 --> 00:13:55.230
That's what Cochran's
theorem tells me,
00:13:55.230 --> 00:13:57.210
and you just have to
count on me for this.
00:13:57.210 --> 00:14:01.590
I told you already that Sn
was independent of Xn bar.
00:14:01.590 --> 00:14:04.510
So those two guys
are independent,
00:14:04.510 --> 00:14:07.590
which implies that the
numerator and denominator here
00:14:07.590 --> 00:14:08.410
are independent.
00:14:08.410 --> 00:14:12.260
That's what Cochran's
theorem tells us.
00:14:12.260 --> 00:14:14.660
But is this exactly
what I should
00:14:14.660 --> 00:14:17.990
be seeing if I wanted to
have my sample variance, if I
00:14:17.990 --> 00:14:19.430
want to have to write this?
00:14:19.430 --> 00:14:23.480
Is this actually the definition
of a student distribution?
00:14:23.480 --> 00:14:25.106
Yes?
00:14:25.106 --> 00:14:25.606
No.
00:14:28.890 --> 00:14:33.773
So we see z divided by
square root of v over d.
00:14:33.773 --> 00:14:35.690
That looks pretty much
like it, except there's
00:14:35.690 --> 00:14:36.560
a small discrepancy.
00:14:36.560 --> 00:14:38.016
What is the discrepancy?
00:14:47.260 --> 00:14:50.680
There's just the square
root of n minus 1 thing.
00:14:50.680 --> 00:14:55.520
So here, v has n minus
1 degrees of freedom.
00:14:55.520 --> 00:14:58.890
And in the definition, if the
v has d degrees of freedom,
00:14:58.890 --> 00:15:04.380
I divide it by d, not by d minus
1 or not by d plus 1, actually,
00:15:04.380 --> 00:15:06.060
in this case.
00:15:06.060 --> 00:15:07.820
So I have this extra thing.
00:15:07.820 --> 00:15:09.570
Well, there's two ways
I can address this.
00:15:13.230 --> 00:15:14.910
The first one is
by saying, well,
00:15:14.910 --> 00:15:18.380
this is actually equal
to z over square root
00:15:18.380 --> 00:15:27.390
of v divided by n minus
1 times square root of n
00:15:27.390 --> 00:15:28.547
over n minus 1.
00:15:32.810 --> 00:15:35.420
I can always do that and
say for n large enough
00:15:35.420 --> 00:15:37.740
this thing is actually
going to be pretty small,
00:15:37.740 --> 00:15:39.260
or I can take account for it.
00:15:39.260 --> 00:15:43.670
Or for any n you give me,
I can compute this number.
00:15:43.670 --> 00:15:45.615
And so rather than
having a t-distribution,
00:15:45.615 --> 00:15:47.240
I'm going to have a
t-distribution time
00:15:47.240 --> 00:15:49.480
this deterministic
number, which is just
00:15:49.480 --> 00:15:52.370
a function of my
number of observations.
00:15:52.370 --> 00:15:55.430
But what I actually
want to do instead
00:15:55.430 --> 00:16:00.520
is probably use a slightly
different normalization,
00:16:00.520 --> 00:16:04.140
which is just to say, well,
why do I have to define Sn--
00:16:10.260 --> 00:16:11.130
where was my Sn?
00:16:11.130 --> 00:16:14.910
Yeah, why do I have to define
Sn tend to be divided by n?
00:16:14.910 --> 00:16:17.730
Actually, this is
a biased estimator,
00:16:17.730 --> 00:16:20.010
and if I wanted to be
unbiased, I can actually just
00:16:20.010 --> 00:16:22.650
put an n minus 1 here.
00:16:22.650 --> 00:16:23.550
You can check that.
00:16:23.550 --> 00:16:25.800
You can expend this thing
and compute the expectation.
00:16:25.800 --> 00:16:27.990
You will see that it's
actually not sigma squared,
00:16:27.990 --> 00:16:31.360
but n over n minus
1 sigma squared.
00:16:31.360 --> 00:16:33.490
So you can actually
just make it unbiased.
00:16:33.490 --> 00:16:35.820
Let's call this
guy tilde, and then
00:16:35.820 --> 00:16:43.890
when I put this tilde here what
I actually get is s tilde here
00:16:43.890 --> 00:16:46.020
and s tilde here.
00:16:46.020 --> 00:16:49.920
I need actually to
have n minus 1 here
00:16:49.920 --> 00:16:55.801
to have this s tilde be a
chi squared distribution.
00:16:55.801 --> 00:16:56.755
Yes?
00:16:56.755 --> 00:17:02.010
AUDIENCE: [INAUDIBLE] defined
this way so that you--
00:17:02.010 --> 00:17:04.940
PHILIPPE RIGOLLET: So basically,
this is what the story did.
00:17:04.940 --> 00:17:08.359
So the story was, well,
rather than using always
00:17:08.359 --> 00:17:10.760
the central limit theorem
and just pretending
00:17:10.760 --> 00:17:13.800
that my Sn is actually
the true sigma squared,
00:17:13.800 --> 00:17:16.460
since this is something
I'm going to do a lot,
00:17:16.460 --> 00:17:19.460
I might as well just
compute the distribution,
00:17:19.460 --> 00:17:21.890
like the quintiles for this
particular distribution,
00:17:21.890 --> 00:17:24.470
which clearly does not depend
on any unknown parameter.
00:17:24.470 --> 00:17:27.109
d is the only parameter
that shows up here,
00:17:27.109 --> 00:17:28.670
and it's completely
characterized
00:17:28.670 --> 00:17:30.510
by the number of
observations that you have,
00:17:30.510 --> 00:17:32.530
which you definitely know.
00:17:32.530 --> 00:17:35.450
And so people said, let's just
be slightly more accurate.
00:17:35.450 --> 00:17:38.900
And in a second, I'll show you
how the distribution of the T--
00:17:38.900 --> 00:17:41.492
so we know that if the
sample size is large enough,
00:17:41.492 --> 00:17:43.700
this should not have any
difference with the Gaussian
00:17:43.700 --> 00:17:44.480
distribution.
00:17:44.480 --> 00:17:45.920
I mean, those two
things should be
00:17:45.920 --> 00:17:48.020
the same because we've
actually not paid
00:17:48.020 --> 00:17:51.410
attention to this discrepancy by
using empirical variance rather
00:17:51.410 --> 00:17:52.970
than true so far.
00:17:52.970 --> 00:17:55.220
And so we'll see what
the difference is,
00:17:55.220 --> 00:17:57.830
and this difference actually
manifests itself only
00:17:57.830 --> 00:17:59.490
in small sample sizes.
00:17:59.490 --> 00:18:02.000
So those are things
that matter mostly
00:18:02.000 --> 00:18:04.512
if you have less than,
say, 50 observations.
00:18:04.512 --> 00:18:06.470
Then you might want to
be slightly more precise
00:18:06.470 --> 00:18:08.960
and use t-distribution
rather than Gaussian.
00:18:08.960 --> 00:18:12.640
So this is just a matter of
being slightly more precise.
00:18:12.640 --> 00:18:14.260
If you have more
than 50 observations,
00:18:14.260 --> 00:18:15.965
just drop everything
and just pretend
00:18:15.965 --> 00:18:17.048
that this is the true one.
00:18:19.610 --> 00:18:22.210
Any other questions?
00:18:22.210 --> 00:18:25.450
So now I have this
thing, and so I'm
00:18:25.450 --> 00:18:27.790
on my way to changing this guy.
00:18:27.790 --> 00:18:31.540
So here now, I have not
root n but root n minus 1.
00:18:47.680 --> 00:18:48.740
So I have a z.
00:18:48.740 --> 00:18:55.441
So this guy here is S. Yet
Where did I get my root
00:18:55.441 --> 00:18:56.524
n from in the first place?
00:19:00.340 --> 00:19:02.270
Yeah, because I wanted this guy.
00:19:02.270 --> 00:19:05.250
And so now what I am
left with is Xn minus mu
00:19:05.250 --> 00:19:08.900
divided by Sn tilde, which
is the new one, which is now
00:19:08.900 --> 00:19:14.540
indeed of the form z v
root n minus 1, which now I
00:19:14.540 --> 00:19:16.540
can write it as z v minus 1.
00:19:16.540 --> 00:19:22.410
And so now I have
exactly what I want,
00:19:22.410 --> 00:19:25.100
and so this guy is n 0, 1.
00:19:25.100 --> 00:19:30.430
And this guy is chi squared with
n minus 1 degrees of freedom.
00:19:30.430 --> 00:19:33.310
And so now I'm back
to what I want.
00:19:33.310 --> 00:19:37.360
So rather than using Sn to be
the empirical variance where
00:19:37.360 --> 00:19:41.095
I just divide my normatizations
by n, if I use n minus 1,
00:19:41.095 --> 00:19:42.310
I'm perfect.
00:19:42.310 --> 00:19:45.730
Of course, I can still use
n and do this multiplying
00:19:45.730 --> 00:19:47.710
by root n minus 1
over n at the end.
00:19:47.710 --> 00:19:49.960
But that just doesn't
make as much sense.
00:19:52.535 --> 00:19:54.910
Everybody's fine with what
this T n distribution is doing
00:19:54.910 --> 00:19:58.970
and why this last
line is correct?
00:19:58.970 --> 00:20:01.570
So that's just
basically because it's
00:20:01.570 --> 00:20:04.250
been defined so that this
is actually happening.
00:20:04.250 --> 00:20:07.150
That was your question, and
that's really what happened.
00:20:07.150 --> 00:20:11.320
So what is this
student t-distribution?
00:20:11.320 --> 00:20:13.460
Where does the name come from?
00:20:13.460 --> 00:20:18.430
Well, it does not come from
Mr. T. And if you know who Mr.
00:20:18.430 --> 00:20:20.800
T was-- you're probably
too young for that--
00:20:20.800 --> 00:20:23.470
he was our hero in the 80s.
00:20:23.470 --> 00:20:26.580
And it comes from this guy.
00:20:26.580 --> 00:20:29.200
His name is Sean
William Gosset--
00:20:29.200 --> 00:20:29.900
1908.
00:20:29.900 --> 00:20:31.623
So that was back in the day.
00:20:31.623 --> 00:20:33.790
And this guy actually worked
at the Guinness Brewery
00:20:33.790 --> 00:20:35.150
in Dublin, Ireland.
00:20:35.150 --> 00:20:38.500
And Mr. Guinness back then
was a bit of a fascist,
00:20:38.500 --> 00:20:41.320
and he didn't want him to
actually publish papers.
00:20:41.320 --> 00:20:45.330
And so what he had to do is
to use a fake name to do that.
00:20:45.330 --> 00:20:50.650
And he was not very creative,
and he used a name "student."
00:20:50.650 --> 00:20:52.720
Because I guess he
was a student of life.
00:20:52.720 --> 00:20:55.990
And so here's the guy, actually.
00:20:55.990 --> 00:20:57.850
So back in 1908,
it was actually not
00:20:57.850 --> 00:21:01.270
difficult to put your
name or your pen name
00:21:01.270 --> 00:21:03.340
on a distribution.
00:21:03.340 --> 00:21:05.620
So what does this
thing look like?
00:21:05.620 --> 00:21:09.620
How does it compare to the
standard normal distribution?
00:21:09.620 --> 00:21:12.117
You think it's going to have
heavier or lighter tails
00:21:12.117 --> 00:21:13.700
compared to the
standard distribution,
00:21:13.700 --> 00:21:17.530
the Gaussian distribution?
00:21:17.530 --> 00:21:21.640
Yeah, because they have extra
uncertainty in the denominator,
00:21:21.640 --> 00:21:25.090
so it's actually going to make
things wiggle a little wider.
00:21:25.090 --> 00:21:26.680
So let's start with
a reference, which
00:21:26.680 --> 00:21:29.030
is the standard
normal distribution.
00:21:29.030 --> 00:21:31.300
So that's my usual
bell-shaped curve.
00:21:31.300 --> 00:21:33.490
And this is actually
the t-distribution
00:21:33.490 --> 00:21:35.440
with 50 degrees of freedom.
00:21:35.440 --> 00:21:37.930
So right now, that's probably
where you should just
00:21:37.930 --> 00:21:39.823
stand up and leave,
because you're like,
00:21:39.823 --> 00:21:40.990
why are we wasting our time?
00:21:40.990 --> 00:21:43.750
Those are actually pretty much
the same thing, and it is true.
00:21:43.750 --> 00:21:46.720
If you have 50 observations,
both the central limit
00:21:46.720 --> 00:21:49.210
theorem-- so here one of the
things that you need to know
00:21:49.210 --> 00:21:54.660
is that if I want to talk about
t-distribution for, say, eight
00:21:54.660 --> 00:21:57.270
observations, I need those
observations to be Gaussian
00:21:57.270 --> 00:21:57.780
for real.
00:21:57.780 --> 00:21:59.530
There's no central
limit theorem happening
00:21:59.530 --> 00:22:00.732
at eight observations.
00:22:00.732 --> 00:22:02.190
But really, what
this is telling me
00:22:02.190 --> 00:22:04.148
is not that the central
limit theorem kicks in.
00:22:04.148 --> 00:22:07.320
It's telling me what are the
asymptotics that kick in?
00:22:13.620 --> 00:22:15.530
The law of large number, right?
00:22:15.530 --> 00:22:19.260
This is exactly this guy.
00:22:19.260 --> 00:22:21.530
That's here.
00:22:21.530 --> 00:22:24.530
When I write this statement,
what this picture is really
00:22:24.530 --> 00:22:28.190
telling us is that for n is
equal to 50, I'm at the limit
00:22:28.190 --> 00:22:29.720
already almost.
00:22:29.720 --> 00:22:32.540
There's virtually no
difference between using
00:22:32.540 --> 00:22:36.860
the left-hand side or
using sigma squared.
00:22:36.860 --> 00:22:38.270
And now I start reducing.
00:22:38.270 --> 00:22:39.700
40, I'm still pretty good.
00:22:39.700 --> 00:22:41.690
We can start seeing that
this thing is actually
00:22:41.690 --> 00:22:43.010
losing some mass
on top, and that's
00:22:43.010 --> 00:22:44.843
because it's actually
pushing it to the left
00:22:44.843 --> 00:22:46.700
and to the right in the tails.
00:22:46.700 --> 00:22:49.940
And then we keep going,
keep going, keep going.
00:22:49.940 --> 00:22:50.943
So that's at 10.
00:22:50.943 --> 00:22:53.110
When you're at 10, there's
not much of a difference.
00:22:53.110 --> 00:22:54.860
And so you can start
seeing difference
00:22:54.860 --> 00:22:57.320
when you're at
five, for example.
00:22:57.320 --> 00:22:59.018
You can see the
tails become heavier.
00:22:59.018 --> 00:23:01.310
And the effect of this is
that when I'm going to build,
00:23:01.310 --> 00:23:05.930
for example, a confidence
interval to put the same amount
00:23:05.930 --> 00:23:07.700
of mass to the right
of some number--
00:23:07.700 --> 00:23:09.950
let's say I'm going to look
at this q alpha over 2--
00:23:09.950 --> 00:23:11.742
I'm going to have to
go much farther, which
00:23:11.742 --> 00:23:17.120
is going to result in much
wider confidence intervals
00:23:17.120 --> 00:23:20.890
to 4, 3, 2, 1.
00:23:20.890 --> 00:23:22.530
So that's the t1.
00:23:22.530 --> 00:23:24.560
Obviously that's the worst.
00:23:24.560 --> 00:23:30.510
And if you ever use
the t1 distribution,
00:23:30.510 --> 00:23:33.830
please ask yourself, why in the
world are you doing statistics
00:23:33.830 --> 00:23:35.360
based on one observation?
00:23:38.570 --> 00:23:41.460
But that's basically what it is.
00:23:41.460 --> 00:23:44.980
So now that we have
this t-distribution,
00:23:44.980 --> 00:23:48.000
we can define a more
sophisticated test
00:23:48.000 --> 00:23:50.640
than just take your
favorite estimator
00:23:50.640 --> 00:23:53.560
and see if it's far from the
value you're currently testing.
00:23:53.560 --> 00:23:57.360
That was our rationale
to build a test before.
00:23:57.360 --> 00:24:00.180
And the first test
that's non-trivial
00:24:00.180 --> 00:24:04.320
is a test that exploits the
fact that the maximum likelihood
00:24:04.320 --> 00:24:07.140
estimator, under some
technical condition,
00:24:07.140 --> 00:24:12.720
has a limit distribution
which is Gaussian with mean 0
00:24:12.720 --> 00:24:18.360
when properly centered and
a covariance matrix given
00:24:18.360 --> 00:24:19.993
by the Fisher
information matrix.
00:24:19.993 --> 00:24:21.660
Remember this Fisher
information matrix?
00:24:26.080 --> 00:24:29.870
And so this is the
setup that we have.
00:24:29.870 --> 00:24:31.190
So we have, again, an i.i.d.
00:24:31.190 --> 00:24:32.180
sample.
00:24:32.180 --> 00:24:35.570
Now I'm going to assume that I
have a d-dimensional parameter
00:24:35.570 --> 00:24:36.890
space, theta.
00:24:36.890 --> 00:24:39.740
And that's why I talk about
Fisher information matrix--
00:24:39.740 --> 00:24:41.350
and not just Fisher information.
00:24:41.350 --> 00:24:42.800
It's a number.
00:24:42.800 --> 00:24:45.420
And I'm going to
consider two hypotheses.
00:24:45.420 --> 00:24:52.730
So you're going to have h0,
theta is equal to theta 0.
00:24:52.730 --> 00:24:56.940
h1, theta is not
equal to theta 0.
00:24:56.940 --> 00:25:00.210
And this is basically
what we thought
00:25:00.210 --> 00:25:05.010
when we said, are we testing
if a coin is fair or unfair.
00:25:05.010 --> 00:25:09.390
So fair was p equals 1/2, and
fair was p different from 1/2.
00:25:09.390 --> 00:25:13.860
And here I'm just making
my life a bit easier.
00:25:13.860 --> 00:25:16.860
So now, I have this
maximum likelihood estimate
00:25:16.860 --> 00:25:17.940
that I can construct.
00:25:17.940 --> 00:25:20.500
Because let's say I
know what p theta is,
00:25:20.500 --> 00:25:23.250
and so I can build a maximum
likelihood estimator.
00:25:23.250 --> 00:25:26.340
And I'm going to assume that
these technical conditions that
00:25:26.340 --> 00:25:29.010
ensure that this maximum
likelihood properly
00:25:29.010 --> 00:25:35.920
standardized converges to some
Gaussian are actually satisfy,
00:25:35.920 --> 00:25:38.250
and so this thing
is actually true.
00:25:38.250 --> 00:25:41.870
So the theorem, the
way I stated it--
00:25:41.870 --> 00:25:44.550
if you're a little puzzled,
this is not the way I stated it.
00:25:44.550 --> 00:25:47.580
And the first time, the way we
stated it was that theta hat
00:25:47.580 --> 00:25:51.420
mle minus theta
not-- so here I'm
00:25:51.420 --> 00:25:53.420
going to place myself
under the null hypothesis,
00:25:53.420 --> 00:25:58.010
so here I'm going
to say under h0.
00:25:58.010 --> 00:26:01.050
And honestly, if you have
any exercise on tests,
00:26:01.050 --> 00:26:03.060
that's the way that
it should start.
00:26:03.060 --> 00:26:05.220
What is the
distribution under h0?
00:26:05.220 --> 00:26:08.610
Because otherwise you don't
know what this guy should be.
00:26:08.610 --> 00:26:10.110
So you have this,
and what we showed
00:26:10.110 --> 00:26:12.630
is that this thing was going
in distribution as n goes
00:26:12.630 --> 00:26:15.900
to infinity to some
normal with mean 0
00:26:15.900 --> 00:26:19.530
and covariance matrix,
which was i of theta,
00:26:19.530 --> 00:26:21.120
which was here for
the true parameter.
00:26:21.120 --> 00:26:22.590
But here I'm under
h0, so there's
00:26:22.590 --> 00:26:24.930
only one true parameter,
which is theta 0.
00:26:32.590 --> 00:26:36.830
This was our limiting
central limit theorem for--
00:26:36.830 --> 00:26:38.830
I mean, it's not really
central limited theorem;
00:26:38.830 --> 00:26:43.720
limited theorem for the
maximum likelihood estimator.
00:26:43.720 --> 00:26:47.230
Everybody remembers that part?
00:26:47.230 --> 00:26:50.830
The line before said, under
technical conditions, I guess.
00:26:50.830 --> 00:26:53.020
So now, it's not really
stated in the same way.
00:26:53.020 --> 00:26:54.850
If you look at
what's on the slide,
00:26:54.850 --> 00:26:57.550
here I don't have the
Fisher information matrix,
00:26:57.550 --> 00:26:59.290
but I really have
the identity of rd.
00:27:02.860 --> 00:27:05.590
If I have a random
variable x, which
00:27:05.590 --> 00:27:10.580
has some covariance
matrix sigma,
00:27:10.580 --> 00:27:12.680
how do I turn this thing
into something that
00:27:12.680 --> 00:27:15.320
has covariance matrix identity?
00:27:15.320 --> 00:27:20.120
So if this was a sigma squared,
well, the thing I would do
00:27:20.120 --> 00:27:21.890
would be divide by
sigma, and then I
00:27:21.890 --> 00:27:24.620
would have a 1,
which is also known
00:27:24.620 --> 00:27:28.360
as the identity matrix of r1.
00:27:28.360 --> 00:27:30.720
Now, what is this?
00:27:30.720 --> 00:27:32.880
This was root of sigma squared.
00:27:32.880 --> 00:27:35.400
So what I'm looking
for is the equivalent
00:27:35.400 --> 00:27:40.300
of taking sigma and dividing
by the square root of sigma,
00:27:40.300 --> 00:27:40.800
which--
00:27:40.800 --> 00:27:42.050
obviously those are matrices--
00:27:42.050 --> 00:27:43.988
I'm certainly not allowed to do.
00:27:43.988 --> 00:27:45.780
And so what I'm going
to do is I'm actually
00:27:45.780 --> 00:27:48.360
going to do the following.
00:27:48.360 --> 00:27:51.330
Sigma 1 over root
of sigma squared
00:27:51.330 --> 00:27:55.670
can be written as sigma
to the negative 1/2.
00:27:55.670 --> 00:27:58.900
And this is actually
the same thing here.
00:27:58.900 --> 00:28:02.180
So I'm going to write it as
sigma to the negative 1/2,
00:28:02.180 --> 00:28:06.730
and now this guy is
actually well-defined.
00:28:06.730 --> 00:28:08.808
So this is a positive
symmetric matrix,
00:28:08.808 --> 00:28:10.600
and you can actually
define the square root
00:28:10.600 --> 00:28:16.340
by just taking the square
root of its eigenvalues,
00:28:16.340 --> 00:28:17.480
for example.
00:28:17.480 --> 00:28:23.487
And so you get sigma 1/2
equals and follows n0 identity.
00:28:26.170 --> 00:28:30.790
And in general, I'm
going to see something
00:28:30.790 --> 00:28:34.630
that looks like sigma
1/2 negative 1/2 sigma
00:28:34.630 --> 00:28:37.060
sigma negative 1/2.
00:28:37.060 --> 00:28:40.930
And I have minus 1/2
plus 1 minus 1/2.
00:28:40.930 --> 00:28:45.330
This whole thing collapses to 0,
and it's actually the identity.
00:28:45.330 --> 00:28:47.450
So that's the actual rule.
00:28:47.450 --> 00:28:52.410
So if you're not familiar, this
is basic multivariate Gaussian
00:28:52.410 --> 00:28:54.640
distribution computations.
00:28:54.640 --> 00:28:57.240
Take a look at it.
00:28:57.240 --> 00:28:59.220
If you feel like you
don't need to look at it
00:28:59.220 --> 00:29:03.250
but you the basic maneuver,
it's fine as well.
00:29:03.250 --> 00:29:05.320
We're not going to go
much deeper into that,
00:29:05.320 --> 00:29:07.020
but those are part
of the thing that
00:29:07.020 --> 00:29:09.420
are sort of standard
manipulations
00:29:09.420 --> 00:29:11.250
about standard Gaussian vectors.
00:29:11.250 --> 00:29:13.710
Because obviously,
standard Gaussian vectors
00:29:13.710 --> 00:29:17.820
arise from this theorem a lot.
00:29:17.820 --> 00:29:22.040
So now I pre-multiplied my
sigma to minus minus 1/2.
00:29:22.040 --> 00:29:24.630
Now of course, I'm doing all
of this in the asymptotics,
00:29:24.630 --> 00:29:26.450
and so I have this effect.
00:29:26.450 --> 00:29:29.640
So if I pre-multiply
everything by sigma to the 1/2,
00:29:29.640 --> 00:29:34.680
sigma being the Fisher
information matrix at theta 0,
00:29:34.680 --> 00:29:38.530
then this is actually equivalent
to saying that square root
00:29:38.530 --> 00:29:39.030
of n--
00:29:43.630 --> 00:29:51.510
so now i of theta now
plays the role of sigma--
00:29:51.510 --> 00:29:59.620
times theta hat mle minus
theta not goes in distribution
00:29:59.620 --> 00:30:06.600
as n goes to infinity to some
multivariate standard Gaussian
00:30:06.600 --> 00:30:09.485
and 0 identity of rd.
00:30:09.485 --> 00:30:10.860
And here, to make
sure that we're
00:30:10.860 --> 00:30:13.080
talking about a
multivariate distribution,
00:30:13.080 --> 00:30:16.040
I can put a d here--
00:30:16.040 --> 00:30:18.420
so just so we know we're
talking about the multivariate,
00:30:18.420 --> 00:30:20.170
though it's pretty
clear from the context,
00:30:20.170 --> 00:30:23.010
since the covariance matrix
is actually a matrix and not
00:30:23.010 --> 00:30:23.840
a number.
00:30:23.840 --> 00:30:24.570
Michael?
00:30:24.570 --> 00:30:26.040
AUDIENCE: [INAUDIBLE].
00:30:29.623 --> 00:30:30.790
PHILIPPE RIGOLLET: Oh, yeah.
00:30:30.790 --> 00:30:31.040
Right.
00:30:31.040 --> 00:30:31.540
Thanks.
00:30:34.313 --> 00:30:35.230
So Yeah, you're right.
00:30:35.230 --> 00:30:39.550
So that's a minus
and that's a plus.
00:30:39.550 --> 00:30:40.420
Thanks.
00:30:40.420 --> 00:30:47.050
So yeah, anybody has
a way to remember
00:30:47.050 --> 00:30:49.390
whether it's inverse Fisher
information or Fisher
00:30:49.390 --> 00:30:54.100
information as a variance
other than just learning it?
00:30:54.100 --> 00:30:58.670
It is called information,
so it's really telling me
00:30:58.670 --> 00:31:00.620
how much information I have.
00:31:00.620 --> 00:31:02.450
So when a variance
increases, I'm
00:31:02.450 --> 00:31:04.430
getting less and
less information,
00:31:04.430 --> 00:31:08.175
and so this thing should
actually be 1 over a variance.
00:31:08.175 --> 00:31:10.550
The notion of information is
1 over a notion of variance.
00:31:13.320 --> 00:31:19.370
So now I just wrote this guy
like this, and the reason
00:31:19.370 --> 00:31:21.710
why I did this is
because now everything
00:31:21.710 --> 00:31:26.780
on the right-hand side does not
depend on any known parameter.
00:31:26.780 --> 00:31:30.080
There's 0 and identity.
00:31:30.080 --> 00:31:33.755
Those two things are
just absolute numbers
00:31:33.755 --> 00:31:38.420
or absolute quantities,
which means that this thing--
00:31:38.420 --> 00:31:42.500
I call this quantity here--
00:31:42.500 --> 00:31:44.520
what was the name that I used?
00:31:44.520 --> 00:31:47.030
Started with a "p."
00:31:47.030 --> 00:31:47.870
Pivotal.
00:31:47.870 --> 00:31:50.990
So this is a pivotal
quantity, meaning
00:31:50.990 --> 00:31:53.900
that its distribution, at
least asymptotic distribution,
00:31:53.900 --> 00:31:56.270
does not depend on
any unknown parameter.
00:31:56.270 --> 00:32:00.610
Moreover, it is
indeed a statistic,
00:32:00.610 --> 00:32:03.010
because I can
actually compute it.
00:32:03.010 --> 00:32:05.950
I know theta 0 and I
know theta hat mle.
00:32:05.950 --> 00:32:08.380
One thing that I did,
and you should actually
00:32:08.380 --> 00:32:11.020
complain about this,
is on the board
00:32:11.020 --> 00:32:15.730
I actually used i of theta not.
00:32:15.730 --> 00:32:20.380
And on the slides, it
says i of theta hat.
00:32:20.380 --> 00:32:22.900
And it's exactly the same
thing that we did before.
00:32:22.900 --> 00:32:26.110
Do I want to use the
variance as a way for me
00:32:26.110 --> 00:32:29.320
to check whether I'm under
the right assumption or not?
00:32:29.320 --> 00:32:31.270
Or do I actually want
to leave that part
00:32:31.270 --> 00:32:33.520
and just plug in the theta
hat mle, which should
00:32:33.520 --> 00:32:36.310
go to the true one eventually?
00:32:36.310 --> 00:32:39.610
Or do I actually want to
just plug in the theta 0?
00:32:39.610 --> 00:32:41.740
So this is exactly
playing the same role
00:32:41.740 --> 00:32:45.460
as whether I wanted to
see square root of Xn bar
00:32:45.460 --> 00:32:48.970
1 minus Xn bar in the
denominator of my test
00:32:48.970 --> 00:32:55.060
statistic for p, or if I wanted
to see square root of 0.5,
00:32:55.060 --> 00:32:59.980
1 minus 0.5 when I was
testing if p was equal to 0.5.
00:32:59.980 --> 00:33:03.070
So this is really a choice
that's left up to you,
00:33:03.070 --> 00:33:06.150
and that's something you
can really choose the two.
00:33:06.150 --> 00:33:09.710
And as we said, maybe this
guy is slightly more precise,
00:33:09.710 --> 00:33:11.980
but it's not going
to extend to the case
00:33:11.980 --> 00:33:15.383
where theta 0 is not reduced
to one single number.
00:33:20.950 --> 00:33:22.660
Any questions?
00:33:22.660 --> 00:33:26.140
So now we have our pivotal
distribution, so from there
00:33:26.140 --> 00:33:29.140
this is going to be
my test statistic.
00:33:29.140 --> 00:33:31.090
I'm going to use this
as a test statistic
00:33:31.090 --> 00:33:35.800
and declare that if
this thing is too large,
00:33:35.800 --> 00:33:36.910
n absolute value--
00:33:36.910 --> 00:33:41.020
because this is really a way to
quantify how far theta hat is
00:33:41.020 --> 00:33:41.790
from theta 0.
00:33:41.790 --> 00:33:44.230
And since theta hat should be
close to the true one, when
00:33:44.230 --> 00:33:45.813
this thing is large
in absolute value,
00:33:45.813 --> 00:33:50.180
it means that the true theta
should be far from theta 0.
00:33:50.180 --> 00:33:56.250
So this is my new
test statistic.
00:33:56.250 --> 00:33:59.450
Now, I said it should be
far, but this is a vector.
00:33:59.450 --> 00:34:02.540
So if I want a vector to be
far, two vectors to be far,
00:34:02.540 --> 00:34:04.340
I measure their norm.
00:34:04.340 --> 00:34:07.520
And so I'm going to form the
Euclidean norm of this guy.
00:34:07.520 --> 00:34:10.600
So if I look at the
Euclidean norm of n--
00:34:14.639 --> 00:34:16.510
and Euclidean norm
is the one you know--
00:34:22.840 --> 00:34:25.139
I'm going to take its square.
00:34:25.139 --> 00:34:26.949
Let me now put a 2 here.
00:34:26.949 --> 00:34:28.739
So that's just the
Euclidean norm,
00:34:28.739 --> 00:34:36.679
and so the norm of vector
x is just x transpose x.
00:34:36.679 --> 00:34:40.330
In the slides, the transpose
is denoted by prime.
00:34:40.330 --> 00:34:41.429
Wow, that's hard to say.
00:34:41.429 --> 00:34:42.420
Put prime in quotes.
00:34:48.510 --> 00:34:50.580
That's a statistic
standard that people do.
00:34:50.580 --> 00:34:53.260
They put prime for transpose.
00:34:53.260 --> 00:34:56.139
Everybody knows what
the transpose is?
00:34:56.139 --> 00:34:58.070
So I just make it flat
and I do it like this,
00:34:58.070 --> 00:34:59.528
and then that means
that's actually
00:34:59.528 --> 00:35:03.620
equal to the sum of the
coordinates Xi squared.
00:35:06.160 --> 00:35:08.350
And that's what you know.
00:35:08.350 --> 00:35:10.880
But here, I'm just writing
it in terms of vectors.
00:35:10.880 --> 00:35:13.100
And so when I run to write
this, this is equivalent,
00:35:13.100 --> 00:35:14.320
this is equal to--
00:35:14.320 --> 00:35:17.500
well, the square root of n is
going to pick up the square.
00:35:17.500 --> 00:35:20.546
So I get square root of
n times square root of n.
00:35:20.546 --> 00:35:23.210
So this guy is just 1/2.
00:35:23.210 --> 00:35:25.570
So 1/2 times 1/2 is
going to give me 1,
00:35:25.570 --> 00:35:29.360
and so I get theta
hat mle minus theta.
00:35:29.360 --> 00:35:32.710
And then I have e of theta not.
00:35:32.710 --> 00:35:37.630
And then I get theta
hat mle minus theta not.
00:35:37.630 --> 00:35:41.680
And so by definition, I'm
going to say that this
00:35:41.680 --> 00:35:45.320
is my test statistic Tn.
00:35:45.320 --> 00:35:50.480
And now I'm going to have a test
that rejects if Tn is large,
00:35:50.480 --> 00:35:53.720
because Tn is really measuring
the distance between theta hat
00:35:53.720 --> 00:35:55.670
and theta 0.
00:35:55.670 --> 00:36:20.530
So my test now is going
to be psi, which rejects.
00:36:20.530 --> 00:36:27.300
So it says 1 if Tn is larger
than some threshold T.
00:36:27.300 --> 00:36:30.060
And how do I pick this T?
00:36:30.060 --> 00:36:32.210
Well, by controlling
my type I error--
00:36:32.210 --> 00:36:35.730
sorry, the c by controlling
my type I error.
00:36:35.730 --> 00:36:44.300
So to choose c, what
we have to check
00:36:44.300 --> 00:36:47.670
is that p under theta not--
00:36:47.670 --> 00:36:49.460
so here it's theta not--
00:36:49.460 --> 00:36:55.550
that I reject so that
psi is equal to 1.
00:36:55.550 --> 00:36:58.400
I want this to be
equal to alpha, right?
00:36:58.400 --> 00:37:01.010
That's how I maximize
my type I error
00:37:01.010 --> 00:37:04.550
under the budget that's actually
given to me, which is alpha.
00:37:04.550 --> 00:37:12.910
So that's actually equivalent
to checking whether p not of Tn
00:37:12.910 --> 00:37:13.870
is larger than c.
00:37:19.270 --> 00:37:23.150
And so if I want to find
the c, all I need to know
00:37:23.150 --> 00:37:25.670
is what is the
distribution of Tn when
00:37:25.670 --> 00:37:28.400
theta is equal to theta not?
00:37:28.400 --> 00:37:31.820
Whatever this distribution is--
maybe it has some weird density
00:37:31.820 --> 00:37:32.870
like this--
00:37:32.870 --> 00:37:35.120
whatever this
distribution is, I'm
00:37:35.120 --> 00:37:37.400
just going to be able
to pick this number,
00:37:37.400 --> 00:37:41.150
and I'm going to take this
quintile alpha, here alpha,
00:37:41.150 --> 00:37:44.030
and I'm going to reject
if I'm larger than alpha--
00:37:44.030 --> 00:37:45.570
whatever this guy is.
00:37:45.570 --> 00:37:47.510
So to be able to do
that, I need to know
00:37:47.510 --> 00:37:56.890
what is the distribution of Tn
when theta is equal to theta 0.
00:37:56.890 --> 00:38:00.270
What is this distribution?
00:38:00.270 --> 00:38:02.842
What is Tn?
00:38:02.842 --> 00:38:08.720
It's the norm squared
of this vector.
00:38:08.720 --> 00:38:09.740
What is this vector?
00:38:09.740 --> 00:38:12.020
What is the asymptotic
distribution of this vector?
00:38:17.912 --> 00:38:18.894
Yes?
00:38:18.894 --> 00:38:21.650
AUDIENCE: [INAUDIBLE].
00:38:21.650 --> 00:38:23.400
PHILIPPE RIGOLLET:
Just look one board up.
00:38:23.400 --> 00:38:24.900
What is this
asymptotic distribution
00:38:24.900 --> 00:38:27.860
of the vector for which we're
taking the norm squared?
00:38:27.860 --> 00:38:30.560
It's right here.
00:38:30.560 --> 00:38:33.910
It's a standard
Gaussian multivariate.
00:38:33.910 --> 00:38:36.460
So when I look at
the norm squared--
00:38:36.460 --> 00:38:45.400
so if z is a standard
Gaussian multivariate,
00:38:45.400 --> 00:38:51.880
then the norm of z squared, by
definition of the norm squared,
00:38:51.880 --> 00:38:54.340
is the sum of the Zi squared.
00:39:01.790 --> 00:39:04.930
That's just the
definition of the norm.
00:39:04.930 --> 00:39:06.973
But what is this distribution?
00:39:06.973 --> 00:39:07.890
AUDIENCE: Chi-squared.
00:39:07.890 --> 00:39:09.515
PHILIPPE RIGOLLET:
That's a chi-square,
00:39:09.515 --> 00:39:12.750
because those guys
are all of variance 1.
00:39:12.750 --> 00:39:15.230
That's what the
diagonal tells me--
00:39:15.230 --> 00:39:15.805
only ones.
00:39:15.805 --> 00:39:18.180
And they're independent because
they have all these zeros
00:39:18.180 --> 00:39:20.320
outside of the diagonal.
00:39:20.320 --> 00:39:23.710
So really, this follows some
chi-squared distribution.
00:39:23.710 --> 00:39:25.260
How many degrees of freedom?
00:39:25.260 --> 00:39:30.560
Well, the number of
them that I sell, d.
00:39:30.560 --> 00:39:33.980
So now I have found
the distribution of Tn
00:39:33.980 --> 00:39:35.590
under this guy.
00:39:35.590 --> 00:39:41.120
And that's true because
this is true under h0.
00:39:41.120 --> 00:39:44.480
If I was not under
h0, again, I would
00:39:44.480 --> 00:39:46.100
need to take another guy here.
00:39:49.430 --> 00:39:52.190
How did I use the fact that
theta is equal to theta 0
00:39:52.190 --> 00:39:54.640
when I centered by theta 0?
00:39:54.640 --> 00:39:57.280
And that was very important.
00:39:57.280 --> 00:40:01.090
So now what I know is that
this is really equal--
00:40:01.090 --> 00:40:02.920
why did I put 0 here?
00:40:05.650 --> 00:40:10.390
So this here is actually equal.
00:40:10.390 --> 00:40:23.843
So in the end, I need c
such that the probability--
00:40:23.843 --> 00:40:25.510
and here I'm not going
to put a theta 0.
00:40:25.510 --> 00:40:26.770
I'm just talking
about the possibility
00:40:26.770 --> 00:40:29.080
of the random variable that
I'm going to put in there.
00:40:29.080 --> 00:40:31.570
It's a chi-square with d
degrees of freedom [INAUDIBLE]
00:40:31.570 --> 00:40:32.380
is equal to alpha.
00:40:35.200 --> 00:40:39.160
I just replaced the
fact that this guy, Tn,
00:40:39.160 --> 00:40:41.230
under the distribution
was just a chi-square.
00:40:41.230 --> 00:40:42.647
And this distribution
here is just
00:40:42.647 --> 00:40:44.890
really referring to the
distribution of a chi-square.
00:40:44.890 --> 00:40:46.820
There's no parameters here.
00:40:46.820 --> 00:40:51.390
And now, that means that I look
at my chi-square distribution.
00:40:51.390 --> 00:40:55.170
It sort of looks like this.
00:40:55.170 --> 00:40:59.940
And I'm going to
pick some alpha here,
00:40:59.940 --> 00:41:02.040
and I need to read
this number q alpha.
00:41:04.800 --> 00:41:09.010
And so here what I need to do
is to pick this q alpha here,
00:41:09.010 --> 00:41:11.780
for c.
00:41:11.780 --> 00:41:28.120
So take c to be q alpha, the
quintile of order 1 minus
00:41:28.120 --> 00:41:31.240
alpha of a chi-squared
distribution
00:41:31.240 --> 00:41:32.550
with this d degree of freedom.
00:41:32.550 --> 00:41:33.910
And why do I say 1 minus alpha?
00:41:33.910 --> 00:41:36.190
Because again, the
quintiles are usually
00:41:36.190 --> 00:41:41.680
referring to the area that's
to the left of them by--
00:41:41.680 --> 00:41:47.750
well, actually, it's
by a convention.
00:41:47.750 --> 00:41:52.460
However, in statistics, we
only care about the right tail
00:41:52.460 --> 00:41:55.000
usually, so it's not
very convenient for us.
00:41:55.000 --> 00:41:56.510
And that's why
rather than calling
00:41:56.510 --> 00:42:01.010
this guy s sub 1 minus alpha all
the time, I write it q alpha.
00:42:01.010 --> 00:42:03.890
So now you have
this q alpha, which
00:42:03.890 --> 00:42:08.600
is the 1 minus alpha quintile,
or quintile of order 1 minus
00:42:08.600 --> 00:42:10.680
alpha of chi squared d.
00:42:10.680 --> 00:42:12.770
And so now I need
to use a table.
00:42:12.770 --> 00:42:15.680
For each d, this thing is going
to take a different value,
00:42:15.680 --> 00:42:18.950
and this is why I can not
just spit out a number to you
00:42:18.950 --> 00:42:21.650
like I spit out 1.96.
00:42:21.650 --> 00:42:24.068
Because if I were
able to do that,
00:42:24.068 --> 00:42:25.610
that would mean that
I would remember
00:42:25.610 --> 00:42:30.760
an entire column of this table
for each possible value of d,
00:42:30.760 --> 00:42:32.830
and that I just don't know.
00:42:32.830 --> 00:42:34.680
So you need just
to look at tables,
00:42:34.680 --> 00:42:36.870
and this is what
it will tell you.
00:42:36.870 --> 00:42:38.610
Often software
will do that, too.
00:42:38.610 --> 00:42:41.600
You don't have to
search through tables.
00:42:41.600 --> 00:42:46.400
And so just as a remark is
that this test, Wald's test,
00:42:46.400 --> 00:42:50.040
is also valid when I have
this sort of other alternative
00:42:50.040 --> 00:42:51.400
that I could see quite a lot--
00:42:51.400 --> 00:42:55.670
if I actually have what's
called a one-sided alternative.
00:42:55.670 --> 00:42:58.280
By the way, this is
called Wald's test--
00:42:58.280 --> 00:43:01.250
so taking Tn to be this thing.
00:43:09.420 --> 00:43:12.980
So this is Wald's test.
00:43:12.980 --> 00:43:15.170
Abraham Wald was a
famous statistician
00:43:15.170 --> 00:43:22.768
in the early 20th century,
who actually was at Columbia
00:43:22.768 --> 00:43:26.226
for quite some time.
00:43:26.226 --> 00:43:27.750
And that was
actually at the time
00:43:27.750 --> 00:43:33.360
where statistics were getting
very popular in India,
00:43:33.360 --> 00:43:35.280
so he was actually
traveling all over India
00:43:35.280 --> 00:43:37.950
in some dinky planes.
00:43:37.950 --> 00:43:41.460
And one of them crashed,
and that's how he died--
00:43:41.460 --> 00:43:42.420
pretty young.
00:43:42.420 --> 00:43:45.060
But actually, there's a
huge school of statistics
00:43:45.060 --> 00:43:47.220
now in India thanks to him.
00:43:47.220 --> 00:43:49.110
There's the Indian
Statistical Institute,
00:43:49.110 --> 00:43:51.290
which is actually
a pretty big thing
00:43:51.290 --> 00:43:53.610
and trans the best
statisticians.
00:43:53.610 --> 00:43:55.610
So this is called Wald's
test, and it's actually
00:43:55.610 --> 00:43:56.527
a pretty popular test.
00:43:56.527 --> 00:43:59.360
Let's just look back a second.
00:43:59.360 --> 00:44:01.280
So you can do the
other alternatives,
00:44:01.280 --> 00:44:03.830
as I said, and for
the other alternatives
00:44:03.830 --> 00:44:06.260
you can actually do this
trick where you put theta 0 as
00:44:06.260 --> 00:44:08.780
well, as long as you
take the theta 0 that's
00:44:08.780 --> 00:44:10.550
the closest to the alternative.
00:44:10.550 --> 00:44:13.190
You just basically take the
one that's the least favorable
00:44:13.190 --> 00:44:13.690
to you--
00:44:16.540 --> 00:44:18.160
the alternative, I mean.
00:44:18.160 --> 00:44:21.540
So what is this thing doing?
00:44:21.540 --> 00:44:25.110
If you did not know anything
about statistics and I told
00:44:25.110 --> 00:44:26.950
you here's a vector--
00:44:26.950 --> 00:44:29.190
that's the mle
vector, theta hat mle.
00:44:32.250 --> 00:44:36.315
So let's say this theta hat
mle takes the values, say--
00:44:44.520 --> 00:44:57.430
so let's say theta hat mle takes
values, say, 1.2, 0.9, and 2.1.
00:44:57.430 --> 00:45:06.880
And then testing h0, theta is
equal to 1, 1, 2, versus theta
00:45:06.880 --> 00:45:08.950
is not equal to the same number.
00:45:08.950 --> 00:45:11.110
That's what I'm testing.
00:45:11.110 --> 00:45:13.475
So you compute this
thing and you find this.
00:45:13.475 --> 00:45:14.850
If you don't know
any statistics,
00:45:14.850 --> 00:45:15.892
what are you going to do?
00:45:18.280 --> 00:45:21.400
You're just going to check
if this guy goes to that guy,
00:45:21.400 --> 00:45:24.370
and probably what you're going
to do is compute something that
00:45:24.370 --> 00:45:27.240
looks like the norm squared
between those guys-- so
00:45:27.240 --> 00:45:28.120
the sum.
00:45:28.120 --> 00:45:31.690
So you're going to do
1.2 minus 1 squared
00:45:31.690 --> 00:45:38.740
plus 0.9 minus 1 squared
plus 2.1 minus 2 squared
00:45:38.740 --> 00:45:41.090
and check if this
number is large or not.
00:45:41.090 --> 00:45:44.140
Maybe you are going to apply
some stats to try to understand
00:45:44.140 --> 00:45:46.930
how those things are,
but this is basically
00:45:46.930 --> 00:45:49.760
what you are going
to want to do.
00:45:49.760 --> 00:45:52.670
What Wald's test
is telling you is
00:45:52.670 --> 00:45:56.830
that this average is actually
not what you should be doing.
00:45:56.830 --> 00:45:59.110
It's telling you that
you should have some sort
00:45:59.110 --> 00:46:00.170
of a weighted average.
00:46:00.170 --> 00:46:01.837
Actually, it would
be a weighted average
00:46:01.837 --> 00:46:06.730
if I was guaranteed that
my Fisher information
00:46:06.730 --> 00:46:08.090
matrix was diagonal.
00:46:08.090 --> 00:46:10.900
If my Fisher information
matrix is diagonal,
00:46:10.900 --> 00:46:13.790
looking at this
number minus this guy,
00:46:13.790 --> 00:46:16.405
transpose i, and then
this guy minus this,
00:46:16.405 --> 00:46:19.030
that would look like I have some
weight here, some weight here,
00:46:19.030 --> 00:46:19.905
and some weight here.
00:46:25.430 --> 00:46:29.190
Sorry, that's only three.
00:46:29.190 --> 00:46:32.880
So if it has non-zero numbers
on all of its nine entries,
00:46:32.880 --> 00:46:36.440
then what I'm going to
see is weird cross-terms.
00:46:36.440 --> 00:46:41.150
If I look at some vector
pre-multiplying this thing
00:46:41.150 --> 00:46:42.710
and post-multiplying
this thing--
00:46:42.710 --> 00:46:44.930
so if I look at something
that looks like this,
00:46:44.930 --> 00:46:51.200
x transpose i of theta
not, x transpose--
00:46:51.200 --> 00:46:56.270
think of x as being theta
hat mle minus theta--
00:46:56.270 --> 00:46:58.570
so if I look at what
this guy looks like,
00:46:58.570 --> 00:47:08.330
it's basically a sum over i and
j of Xi, Xj, i, theta not Ij.
00:47:08.330 --> 00:47:11.440
And so if none of
those things are 0,
00:47:11.440 --> 00:47:14.400
you're not going to see a sum
of three terms that are squares,
00:47:14.400 --> 00:47:18.560
but you're going to see a
sum of nine cross-products.
00:47:18.560 --> 00:47:20.030
And it's just weird.
00:47:20.030 --> 00:47:21.920
This is not something standard.
00:47:21.920 --> 00:47:26.450
So what is Wald's
test doing for you?
00:47:26.450 --> 00:47:29.680
Well, it's saying,
I'm actually going
00:47:29.680 --> 00:47:32.283
to look at all the
directions all at once.
00:47:32.283 --> 00:47:33.700
Some of those
directions are going
00:47:33.700 --> 00:47:41.660
to have more or less variance,
i.e., less or more information.
00:47:41.660 --> 00:47:43.500
And so for those
guys, I'm actually
00:47:43.500 --> 00:47:45.360
going to use a different weight.
00:47:45.360 --> 00:47:47.640
So what you're really
doing is putting a weight
00:47:47.640 --> 00:47:51.030
on all directions of
the space at once.
00:47:51.030 --> 00:47:53.280
So what this Wald's
test is doing--
00:47:53.280 --> 00:47:56.940
by squeezing in the
Fisher information matrix,
00:47:56.940 --> 00:48:00.840
it's placing your problem
into the right geometry.
00:48:00.840 --> 00:48:05.580
It's a geometry that's distorted
and where balls become ellipses
00:48:05.580 --> 00:48:07.860
that are distorted
in some directions
00:48:07.860 --> 00:48:10.260
and shrunk in
others, or depending
00:48:10.260 --> 00:48:12.690
on if you have more variance
or less variance in those
00:48:12.690 --> 00:48:13.565
directions.
00:48:13.565 --> 00:48:14.940
Those directions
don't have to be
00:48:14.940 --> 00:48:18.220
aligned with the axes of
your coordinate system.
00:48:18.220 --> 00:48:19.920
And if they were,
then that would
00:48:19.920 --> 00:48:24.570
mean you would have a
diagonal information matrix,
00:48:24.570 --> 00:48:25.800
but they might not be.
00:48:25.800 --> 00:48:28.260
And so there's this weird
geometry that shows up.
00:48:28.260 --> 00:48:31.410
There is actually
an entire field,
00:48:31.410 --> 00:48:34.200
admittedly a bit
dormant these days,
00:48:34.200 --> 00:48:36.270
that's called
information geometry,
00:48:36.270 --> 00:48:39.060
and it's really doing
differential geometry
00:48:39.060 --> 00:48:44.270
on spaces that are defined by
Fisher information matrices.
00:48:44.270 --> 00:48:46.770
And so you can do
some pretty hardcore--
00:48:46.770 --> 00:48:50.220
something that I
certainly cannot do--
00:48:50.220 --> 00:48:53.413
differential geometry , just by
playing around with statistical
00:48:53.413 --> 00:48:55.830
models and trying to understand
with the geometry of those
00:48:55.830 --> 00:48:56.700
models are.
00:48:56.700 --> 00:48:58.350
What does it mean
for two points to be
00:48:58.350 --> 00:49:01.570
close in some curved space?
00:49:01.570 --> 00:49:02.830
So that's basically the idea.
00:49:02.830 --> 00:49:06.440
So this thing is basically
curving your space.
00:49:06.440 --> 00:49:10.250
So again, I always
feel satisfied
00:49:10.250 --> 00:49:12.560
when my estimator
on my test does not
00:49:12.560 --> 00:49:14.150
involve just
computing an average
00:49:14.150 --> 00:49:16.520
and checking if it's big or not.
00:49:16.520 --> 00:49:18.560
And that's not what
we're doing here.
00:49:18.560 --> 00:49:23.350
We know that this theta hat
mle can be complicated--
00:49:23.350 --> 00:49:26.530
CF problem set, too, I believe.
00:49:26.530 --> 00:49:29.093
And we know that this Fisher
information matrix can also
00:49:29.093 --> 00:49:30.010
be pretty complicated.
00:49:30.010 --> 00:49:33.470
So here, your test is not
going to be trivial at all,
00:49:33.470 --> 00:49:37.000
and that requires understanding
the mathematics behind it.
00:49:37.000 --> 00:49:40.840
I mean, it all built
upon this theorem
00:49:40.840 --> 00:49:43.540
that I just erased,
I believe, which
00:49:43.540 --> 00:49:45.567
was that this guy
here inside this norm
00:49:45.567 --> 00:49:47.650
was actually converging
to some standard Gaussian.
00:49:52.690 --> 00:49:55.030
So there's another test
that you can actually use.
00:49:55.030 --> 00:50:00.800
So Wald's test is one option,
and there's another option.
00:50:00.800 --> 00:50:05.460
And just like maximum
likelihood estimation and method
00:50:05.460 --> 00:50:09.450
of moments would sometimes
agree and sometimes disagree,
00:50:09.450 --> 00:50:12.210
those guys are going to
sometimes agree and sometimes
00:50:12.210 --> 00:50:13.550
disagree.
00:50:13.550 --> 00:50:17.510
And this test is called
the likelihood ratio test.
00:50:17.510 --> 00:50:21.560
So let's parse those words--
00:50:21.560 --> 00:50:25.322
"likelihood," "ratio," "test."
00:50:25.322 --> 00:50:26.780
So at some point,
I'm going to have
00:50:26.780 --> 00:50:29.270
to take the likelihood
of something divided
00:50:29.270 --> 00:50:33.980
by the likelihood of some other
thing and then work with this.
00:50:33.980 --> 00:50:36.380
And this test is just
saying the following.
00:50:36.380 --> 00:50:39.654
Here's the simplest
principle you can think of.
00:50:44.513 --> 00:50:45.930
You're going to
have to understand
00:50:45.930 --> 00:50:51.440
the notion of likelihood in
the context of statistics.
00:50:51.440 --> 00:50:53.565
You just have to understand
the meaning of the word
00:50:53.565 --> 00:50:54.930
"likelihood."
00:50:54.930 --> 00:51:03.740
This test is just saying
if I want to test h0,
00:51:03.740 --> 00:51:07.240
theta is equal to theta 0,
versus theta is equal to theta
00:51:07.240 --> 00:51:13.040
1, all I have to look at is
whether theta 0 is more or less
00:51:13.040 --> 00:51:14.990
likely than theta 1.
00:51:14.990 --> 00:51:18.960
And I have an exact
number that spits out.
00:51:18.960 --> 00:51:24.760
Given a theta 0 or a
theta 1 and given data,
00:51:24.760 --> 00:51:26.830
I can put in this function
called the likelihood,
00:51:26.830 --> 00:51:31.630
and they tell me exactly
how likely those things are.
00:51:31.630 --> 00:51:33.420
And so all I have to
check is whether one
00:51:33.420 --> 00:51:35.760
is more likely than the
other, and so what I can do
00:51:35.760 --> 00:51:41.450
is form the likelihood
of theta, say,
00:51:41.450 --> 00:51:50.070
1 divided by the
likelihood of theta 0
00:51:50.070 --> 00:51:52.260
and check if this
thing is larger than 1.
00:51:52.260 --> 00:51:57.090
That would mean that this guy
is more likely than that guy.
00:51:57.090 --> 00:52:00.010
That's a natural way to proceed.
00:52:00.010 --> 00:52:03.190
Now, there's one
caveat here, which
00:52:03.190 --> 00:52:05.900
is that when I do
hypothesis testing
00:52:05.900 --> 00:52:10.960
and I have this asymmetry
between h0 and h1,
00:52:10.960 --> 00:52:13.660
I still need to be
able to control what
00:52:13.660 --> 00:52:15.340
my probably of type I error is.
00:52:15.340 --> 00:52:19.260
And here I basically
have no knob.
00:52:19.260 --> 00:52:21.310
This is something if you
give me data in theta 0
00:52:21.310 --> 00:52:24.470
and theta 1 I can compute to you
and spit out the yes/no answer.
00:52:24.470 --> 00:52:29.720
But I have no way of controlling
the type II and type I error,
00:52:29.720 --> 00:52:33.320
so what we do is that we
replace this 1 by some number c.
00:52:33.320 --> 00:52:35.300
And then we calibrate
c in such a way
00:52:35.300 --> 00:52:37.580
that the type I error is
exactly at level alpha.
00:52:40.630 --> 00:52:44.820
So for example, if
I want to make sure
00:52:44.820 --> 00:52:50.610
that my type I error is
always 0, all I have to do
00:52:50.610 --> 00:52:52.350
is to say that this
guy is actually never
00:52:52.350 --> 00:52:55.020
more likely than that
guy, meaning never reject.
00:52:55.020 --> 00:52:57.912
And so if I let
c go to infinity,
00:52:57.912 --> 00:52:59.370
then this is actually
going to make
00:52:59.370 --> 00:53:02.220
my type I error go to zero.
00:53:02.220 --> 00:53:05.790
But if I let c go to
negative infinity,
00:53:05.790 --> 00:53:12.270
then I'm always
going to conclude
00:53:12.270 --> 00:53:14.730
that h1 is the right one.
00:53:14.730 --> 00:53:16.200
So I have this
straight off, and I
00:53:16.200 --> 00:53:19.350
can turn this knob by
changing the values of c
00:53:19.350 --> 00:53:22.190
and get different results.
00:53:22.190 --> 00:53:25.890
And I'm going to be interested
in the one that maximizes
00:53:25.890 --> 00:53:29.010
my chances of rejecting the
null hypothesis while staying
00:53:29.010 --> 00:53:33.500
under my alpha budget
of type I error.
00:53:33.500 --> 00:53:37.280
So this is nice when I have
two very simple hypotheses,
00:53:37.280 --> 00:53:40.430
but to be fair, we've
actually not seen
00:53:40.430 --> 00:53:45.050
any tests that correspond
to real-life example.
00:53:45.050 --> 00:53:49.070
Where theta 0 was of the
form am I equal to, say, 0.5
00:53:49.070 --> 00:53:51.853
or am I equal to
0.41, we actually
00:53:51.853 --> 00:53:53.270
sort of suspected
that if somebody
00:53:53.270 --> 00:53:54.895
asked you to perform
this test, they've
00:53:54.895 --> 00:53:57.810
sort of seen the data before
and they're sort of cheating.
00:53:57.810 --> 00:54:00.290
So it's typically
something am I equal to 0.5
00:54:00.290 --> 00:54:02.420
or not equal to 0.5
or am I equal to 0.5
00:54:02.420 --> 00:54:03.960
or larger than 0.5.
00:54:03.960 --> 00:54:06.830
But it's very rare that you
actually get only two points
00:54:06.830 --> 00:54:07.520
to test--
00:54:07.520 --> 00:54:09.500
am I this guy or that guy?
00:54:09.500 --> 00:54:11.180
Now, I could go on.
00:54:11.180 --> 00:54:13.432
There's actually a nice
mathematical theory,
00:54:13.432 --> 00:54:15.140
something called the
Neyman-Pearson lemma
00:54:15.140 --> 00:54:18.470
that actually tells me that
this test, the likelihood ratio
00:54:18.470 --> 00:54:22.670
test, is the test, given the
constraint of type I error,
00:54:22.670 --> 00:54:25.220
that will have the
smallest type II error.
00:54:25.220 --> 00:54:27.680
So this is the ultimate test.
00:54:27.680 --> 00:54:29.900
No one should ever use
anything different.
00:54:29.900 --> 00:54:32.420
And we could go on and
do this, but in a way,
00:54:32.420 --> 00:54:35.150
it's completely irrelevant to
practice because you will never
00:54:35.150 --> 00:54:37.220
encounter such tests.
00:54:37.220 --> 00:54:41.000
And I actually find students
that they took my class
00:54:41.000 --> 00:54:44.180
as sophomores and then they're
still around a couple of years
00:54:44.180 --> 00:54:46.930
later and they're
doing, and they're like,
00:54:46.930 --> 00:54:50.250
I have this testing problem and
I want to use likelihood ratio
00:54:50.250 --> 00:54:54.740
test, the Neyman-Pearson one,
but I just can't because it
00:54:54.740 --> 00:54:56.110
just never occurs.
00:54:56.110 --> 00:54:57.480
This just does not happen.
00:54:57.480 --> 00:54:59.750
So here, rather than
going into details,
00:54:59.750 --> 00:55:02.950
let's just look at what
building on this principle
00:55:02.950 --> 00:55:05.570
we can actually make
a test that will work.
00:55:05.570 --> 00:55:08.720
So now, for
simplicity, I'm going
00:55:08.720 --> 00:55:11.810
to assume that my
alternatives-- so now, I still
00:55:11.810 --> 00:55:16.580
have a d dimensional
vector theta.
00:55:16.580 --> 00:55:20.840
And what I'm going to assume
is that the null hypothesis
00:55:20.840 --> 00:55:26.750
is actually only testing if
the last coefficients from r
00:55:26.750 --> 00:55:31.070
plus 1 to d are fixed numbers.
00:55:31.070 --> 00:55:35.460
So in this example, where
I have theta was equal--
00:55:35.460 --> 00:55:38.915
so if I have d equals
3, here's an example.
00:55:42.120 --> 00:55:53.510
h0 is theta 2 equals 1,
and theta 3 equals 2.
00:55:53.510 --> 00:55:56.360
That's my h0, but I
say I don't actually
00:55:56.360 --> 00:55:58.070
care about what theta
1 is going to be.
00:56:02.450 --> 00:56:04.450
So that's my null hypothesis.
00:56:04.450 --> 00:56:07.500
I'm not going to specify right
now what the alternative is.
00:56:07.500 --> 00:56:08.500
That's what the null is.
00:56:08.500 --> 00:56:13.240
And in particular, this null
is actually not of this form.
00:56:13.240 --> 00:56:15.190
It's not restricting
it to one point.
00:56:15.190 --> 00:56:18.070
It's actually restricting it to
an infinite amount of points.
00:56:18.070 --> 00:56:22.020
Those are all the vectors
of the form theta 1 1,
00:56:22.020 --> 00:56:29.440
2 for all theta 1 in, say, r.
00:56:29.440 --> 00:56:31.960
That's a lot of vectors,
and so it's certainly
00:56:31.960 --> 00:56:34.060
not like it's equal to
one specific vector.
00:56:36.670 --> 00:56:39.610
So now, what I'm going
to do is I'm actually
00:56:39.610 --> 00:56:43.300
going to look at the maximum
likelihood estimator,
00:56:43.300 --> 00:56:45.910
and I'm going to say, well, the
maximum likelihood estimator,
00:56:45.910 --> 00:56:50.310
regardless of anything, is
going to be close to. reality.
00:56:50.310 --> 00:56:53.480
Now, if you actually
tell me ahead of time
00:56:53.480 --> 00:56:56.520
that the true parameter
is of this form,
00:56:56.520 --> 00:56:59.698
I'm not going to maximize over
all three coordinates of theta.
00:56:59.698 --> 00:57:01.740
I'm just going to say,
well, I might as well just
00:57:01.740 --> 00:57:06.900
set the second one to
1, the third one to 2,
00:57:06.900 --> 00:57:09.690
and just optimize for this guy.
00:57:09.690 --> 00:57:11.990
So effectively, you can
say if you're telling me
00:57:11.990 --> 00:57:14.390
that this is the
reality, I can compute
00:57:14.390 --> 00:57:17.000
a constrained maximum
likelihood estimator
00:57:17.000 --> 00:57:21.690
which is constrained to look
like what you think reality is.
00:57:21.690 --> 00:57:24.270
So this is what the maximum
likelihood estimator is.
00:57:24.270 --> 00:57:26.130
That's the one that's
maximizing, say,
00:57:26.130 --> 00:57:30.120
here the log likelihood over
the entire space of candidate
00:57:30.120 --> 00:57:32.640
vectors, of
candidate parameters.
00:57:32.640 --> 00:57:36.357
But this partial one, this
is the constraint mle.
00:57:36.357 --> 00:57:38.940
That's the one that's actually
not maximizing our real thetas,
00:57:38.940 --> 00:57:41.120
but only over the thetas
that are plausible
00:57:41.120 --> 00:57:44.430
under the null hypothesis.
00:57:44.430 --> 00:57:52.880
So in particular, if I look
at ln of this constraint thing
00:57:52.880 --> 00:57:59.840
theta hat n c compared
to ln, theta hat--
00:57:59.840 --> 00:58:04.427
let's say n mle, so
we know which one--
00:58:04.427 --> 00:58:05.260
which one is bigger?
00:58:13.400 --> 00:58:15.400
The first one is bigger.
00:58:15.400 --> 00:58:17.330
So why?
00:58:17.330 --> 00:58:18.755
AUDIENCE: [INAUDIBLE].
00:58:20.770 --> 00:58:22.270
PHILIPPE RIGOLLET:
So the second one
00:58:22.270 --> 00:58:25.070
is maximized over
a larger space.
00:58:25.070 --> 00:58:25.570
Right.
00:58:25.570 --> 00:58:28.833
So I have this all
of theta, which
00:58:28.833 --> 00:58:30.250
are all the
parameters I can take,
00:58:30.250 --> 00:58:32.626
and let's say theta
0 is this guy.
00:58:32.626 --> 00:58:35.990
I'm maximizing a function
over all these things.
00:58:35.990 --> 00:58:38.930
So if the true
maximum is this here,
00:58:38.930 --> 00:58:41.210
then the two things are
equal, but if the maximum
00:58:41.210 --> 00:58:43.490
is on this side, then
the one on the right
00:58:43.490 --> 00:58:45.260
is actually going to be larger.
00:58:45.260 --> 00:58:48.050
They're maximizing
over a bigger space,
00:58:48.050 --> 00:58:51.440
so this guy has to be
less than this guy.
00:58:51.440 --> 00:58:53.450
So maybe it's not easy to see.
00:58:53.450 --> 00:59:01.610
So let's say that this is
theta and this is theta 0
00:59:01.610 --> 00:59:04.570
and now I have a function.
00:59:04.570 --> 00:59:09.720
The maximum over theta
0 is this guy here,
00:59:09.720 --> 00:59:12.040
but the maximum over the
entire space is here.
00:59:15.530 --> 00:59:17.330
So the maximum
over a larger space
00:59:17.330 --> 00:59:20.090
has to be larger than the
maximum over a smaller space.
00:59:20.090 --> 00:59:26.090
It can be equal, but the
one in the bigger space
00:59:26.090 --> 00:59:28.800
can be even bigger.
00:59:28.800 --> 00:59:33.730
However, if my
true theta actually
00:59:33.730 --> 00:59:35.440
did belong to theta 0--
00:59:35.440 --> 00:59:38.880
if h0 was true--
00:59:38.880 --> 00:59:39.850
what would happen?
00:59:39.850 --> 00:59:45.930
Well, if theta 0 is true,
then theta isn't theta 0,
00:59:45.930 --> 00:59:49.487
and since the maximum likelihood
should be close to theta,
00:59:49.487 --> 00:59:51.570
it should be the case that
those two things should
00:59:51.570 --> 00:59:52.890
be pretty similar.
00:59:52.890 --> 00:59:56.290
I should be in a case not
in this kind of thing,
00:59:56.290 --> 00:59:58.110
but more in this
kind of position,
00:59:58.110 --> 01:00:00.450
where the true maximum is
actually attained at theta 0.
01:00:00.450 --> 01:00:02.300
And in this case,
they're actually
01:00:02.300 --> 01:00:05.640
of the same size,
those two things.
01:00:05.640 --> 01:00:08.400
If it's not true, then I'm
going to see a discrepancy
01:00:08.400 --> 01:00:09.398
between the two guys.
01:00:12.030 --> 01:00:15.840
So my test is going to be
built on this intuition
01:00:15.840 --> 01:00:20.700
that if h0 is true, the values
of the likelihood at theta hat
01:00:20.700 --> 01:00:24.530
mle and at the constraint mle
should be pretty much the same.
01:00:24.530 --> 01:00:25.680
But if theta hat--
01:00:25.680 --> 01:00:29.490
if it's not true, then
the likelihood of the mle
01:00:29.490 --> 01:00:33.772
should be much larger
than the likelihood
01:00:33.772 --> 01:00:34.730
of the constrained mle.
01:00:37.600 --> 01:00:40.580
And this is exactly
what this test is doing.
01:00:40.580 --> 01:00:42.430
So that's the
likelihood ratio test.
01:00:42.430 --> 01:00:46.660
So rather than looking at
the ratio of the likelihoods,
01:00:46.660 --> 01:00:48.910
we look at the difference
of the log likelihood, which
01:00:48.910 --> 01:00:51.170
is really the same thing.
01:00:51.170 --> 01:00:54.420
And there is some weird
normalization factor, too,
01:00:54.420 --> 01:00:55.978
that shows up here.
01:01:04.910 --> 01:01:06.120
And this is what we get.
01:01:06.120 --> 01:01:18.900
So if I look at the
likelihood ratio test,
01:01:18.900 --> 01:01:25.280
so it's looking at two
times ln of theta hat mle
01:01:25.280 --> 01:01:32.070
minus ln of theta
hat mle constrained.
01:01:32.070 --> 01:01:34.100
And this is actually
the test statistic.
01:01:34.100 --> 01:01:39.810
So we've actually decided
that this statistic is what?
01:01:42.850 --> 01:01:44.565
It's non-negative, right?
01:01:44.565 --> 01:01:45.940
We've also decided
that it should
01:01:45.940 --> 01:01:49.120
be close to zero if h0
is true and of course
01:01:49.120 --> 01:01:52.990
then maybe far from
zero if h0 is not true.
01:01:52.990 --> 01:02:00.320
So what should be the
natural test based on Tn?
01:02:00.320 --> 01:02:03.300
Let me just check that it's--
01:02:03.300 --> 01:02:05.370
well, it's already there.
01:02:05.370 --> 01:02:08.610
So the natural test is something
that looks like indicator
01:02:08.610 --> 01:02:12.480
that Tn is larger than c.
01:02:12.480 --> 01:02:13.980
And you should say, well, again?
01:02:13.980 --> 01:02:15.800
I mean, we just did that.
01:02:15.800 --> 01:02:19.490
I mean, it is basically the
same thing that we just did.
01:02:19.490 --> 01:02:20.940
Agreed?
01:02:20.940 --> 01:02:22.380
But the Tn now is different.
01:02:22.380 --> 01:02:24.270
The Tn is the difference
of log likelihoods,
01:02:24.270 --> 01:02:29.970
whereas before the Tn was
this theta hat minus theta
01:02:29.970 --> 01:02:35.630
not transpose identity of
Fisher information matrix theta
01:02:35.630 --> 01:02:37.170
hat minus theta not.
01:02:37.170 --> 01:02:39.330
And this, there's no
reason why this guy
01:02:39.330 --> 01:02:41.410
should be of the same form.
01:02:41.410 --> 01:02:43.117
Now, if I have a
Gaussian model, you
01:02:43.117 --> 01:02:45.700
can check that those two things
are actually exactly the same.
01:02:49.040 --> 01:02:52.190
But otherwise, they don't
have any reason to be.
01:02:52.190 --> 01:02:54.220
And now, what's
happening is that
01:02:54.220 --> 01:02:57.100
under some technical
conditions--
01:02:57.100 --> 01:02:59.210
if h0 is true, now
what happens is
01:02:59.210 --> 01:03:02.690
that if I want to calibrate
c, what I need to do
01:03:02.690 --> 01:03:08.630
is to look at what is the
c such that this guy is
01:03:08.630 --> 01:03:10.350
equal to alpha?
01:03:10.350 --> 01:03:15.775
And that's for the distribution
of T under the knob.
01:03:20.330 --> 01:03:22.050
But there's not only one.
01:03:22.050 --> 01:03:26.790
The null hypothesis
here was actually
01:03:26.790 --> 01:03:28.050
just the family of things.
01:03:28.050 --> 01:03:29.580
It was not just one vector.
01:03:29.580 --> 01:03:31.500
It was an entire
family of vectors,
01:03:31.500 --> 01:03:33.520
just like in this example.
01:03:33.520 --> 01:03:35.670
So if I want my type I
error to be constrained
01:03:35.670 --> 01:03:39.120
over the entire space,
what I need to make sure of
01:03:39.120 --> 01:03:44.440
is that the maximum
overall theta n theta not
01:03:44.440 --> 01:03:45.860
is actually equal to alpha.
01:03:53.152 --> 01:03:53.652
Agreed?
01:03:53.652 --> 01:03:54.152
Yeah?
01:03:54.152 --> 01:03:55.600
AUDIENCE: [INAUDIBLE].
01:03:59.520 --> 01:04:04.050
PHILIPPE RIGOLLET: So not equal.
01:04:04.050 --> 01:04:06.858
In this case, it's
going to be not equal.
01:04:06.858 --> 01:04:08.650
I mean, it can really
be anything you want.
01:04:08.650 --> 01:04:12.670
It's just you're going to have
a different type II error.
01:04:12.670 --> 01:04:15.140
I guess here we're sort
of stuck in a corner.
01:04:15.140 --> 01:04:18.740
We built this T, and it has
to be small under the null.
01:04:18.740 --> 01:04:21.235
And whatever not
the null is, we just
01:04:21.235 --> 01:04:22.610
hope that it's
going to be large.
01:04:25.150 --> 01:04:27.200
So even if I tell you
what the alternative is,
01:04:27.200 --> 01:04:31.660
you're not going to change
anything about the procedure.
01:04:31.660 --> 01:04:33.970
So here, q alpha-- so
what I need to know
01:04:33.970 --> 01:04:37.540
is that if h0 is true,
then Tn in this case
01:04:37.540 --> 01:04:41.620
actually converges to some
chi-square distribution.
01:04:41.620 --> 01:04:44.500
And now here, the number
of degrees of freedom
01:04:44.500 --> 01:04:45.250
is kind of weird.
01:04:58.720 --> 01:05:02.100
But actually, what it should
tell you is, oh, finally, I
01:05:02.100 --> 01:05:05.030
know when you call this
parameter degrees of freedom
01:05:05.030 --> 01:05:08.790
rather than dimension
or just d parameter.
01:05:08.790 --> 01:05:13.100
It's because here what we did
is we actually pinned down
01:05:13.100 --> 01:05:19.330
everything, but r--
01:05:19.330 --> 01:05:23.050
sorry, we pinned
down everything but r
01:05:23.050 --> 01:05:24.190
coordinates of this thing.
01:05:26.710 --> 01:05:30.190
And so now I'm actually
wondering why--
01:05:34.102 --> 01:05:36.547
did I make a mistake here?
01:05:40.460 --> 01:05:41.930
I think this should
be chi square
01:05:41.930 --> 01:05:43.190
with r degrees of freedom.
01:05:46.290 --> 01:05:48.630
Let me check and send
you an update about this,
01:05:48.630 --> 01:05:53.140
because the number of
degrees of freedom,
01:05:53.140 --> 01:05:55.860
if you talk to normal
people they will tell you
01:05:55.860 --> 01:05:59.830
that here the number of
degrees of freedom is r.
01:05:59.830 --> 01:06:01.690
This is what's allowed
to move, and that's
01:06:01.690 --> 01:06:03.580
what's called
degrees of freedom.
01:06:03.580 --> 01:06:06.520
The rest is pinned down
to being something.
01:06:06.520 --> 01:06:10.480
So here, this chi-square
should be a chi-squared r.
01:06:10.480 --> 01:06:12.993
And that's something you
just have to believe me.
01:06:12.993 --> 01:06:15.160
Anybody guess what theorem
is going to tell me this?
01:06:19.050 --> 01:06:21.285
In some cases, it's going
to be Cochran's theorem--
01:06:21.285 --> 01:06:23.577
just something that tells
me that thing's [INAUDIBLE]..
01:06:23.577 --> 01:06:27.020
Now, here, I use the
very specific form
01:06:27.020 --> 01:06:29.600
of the null alternative.
01:06:29.600 --> 01:06:31.100
And so for those
of you who are sort
01:06:31.100 --> 01:06:35.740
of familiar with linear
algebra, what I did here is h0
01:06:35.740 --> 01:06:39.530
consists in saying
that theta belongs
01:06:39.530 --> 01:06:43.040
to an r dimensional
linear space.
01:06:43.040 --> 01:06:45.380
It's actually here, the r
dimensional linear space
01:06:45.380 --> 01:06:49.160
of vectors, that have the first
r coordinates that can move
01:06:49.160 --> 01:06:54.688
and the last coordinates that
are fixed to some number.
01:06:54.688 --> 01:06:57.230
Actually, it's undefined space
because it doesn't necessarily
01:06:57.230 --> 01:06:58.410
go through zero.
01:06:58.410 --> 01:07:00.410
And so I have this
defined space that
01:07:00.410 --> 01:07:05.555
has dimension r, and if I were
to constrain it to any other r
01:07:05.555 --> 01:07:08.070
dimensional space, that would
be exactly the same thing.
01:07:08.070 --> 01:07:10.910
And so to do that, essentially
what you need to do is to say,
01:07:10.910 --> 01:07:15.440
if I take any matrix that's say,
invertible-- let's call it u--
01:07:15.440 --> 01:07:21.500
and then so h0 is going to be
something like of the form u
01:07:21.500 --> 01:07:33.210
times theta and now I look only
at the coordinates r plus 1 2d,
01:07:33.210 --> 01:07:35.620
then I want to fix those
guys to some numbers.
01:07:35.620 --> 01:07:39.040
So I want to call them theta,
so let's call them tau.
01:07:39.040 --> 01:07:44.850
So it's going to be tau r
plus 1, all the way to tau d.
01:07:44.850 --> 01:07:47.580
So this is not part
of the requirements,
01:07:47.580 --> 01:07:50.075
but just so you know,
it's really not a matter
01:07:50.075 --> 01:07:51.450
of keeping only
some coordinates.
01:07:51.450 --> 01:07:54.120
Really, what matters
is the dimension
01:07:54.120 --> 01:07:56.980
in the sense of linear
subspaces of the problem,
01:07:56.980 --> 01:07:59.500
and that's what determines what
your degrees of freedom are.
01:08:03.000 --> 01:08:06.660
So now that we know what the
asymptotic distribution is
01:08:06.660 --> 01:08:10.630
under the null, then
we know basically
01:08:10.630 --> 01:08:17.920
that we know how which table we
need to pick our q alpha from.
01:08:17.920 --> 01:08:20.340
And here, again, the table
is a chi-squared table,
01:08:20.340 --> 01:08:22.090
but here, the number
of degrees of freedom
01:08:22.090 --> 01:08:26.277
is this weird d minus r
degrees of freedom thing.
01:08:29.689 --> 01:08:31.060
I just said it was r.
01:08:34.060 --> 01:08:36.952
I'm just checking,
actually, if I'm--
01:08:41.542 --> 01:08:42.042
it's r.
01:08:42.042 --> 01:08:42.792
It's definitely r.
01:08:51.200 --> 01:08:54.260
So here we've made tests.
01:08:54.260 --> 01:08:57.170
We're testing if r parameter
theta was explicitly
01:08:57.170 --> 01:09:00.140
in some set or not.
01:09:00.140 --> 01:09:03.140
By explicitly, I mean we're
saying, is theta like this
01:09:03.140 --> 01:09:04.380
or is theta not like this?
01:09:04.380 --> 01:09:06.350
Is theta equal to
theta not or is theta
01:09:06.350 --> 01:09:07.720
not equal to theta not?
01:09:07.720 --> 01:09:10.160
Are the last
coordinates of theta
01:09:10.160 --> 01:09:12.490
equal to those fixed
numbers, or are they not?
01:09:12.490 --> 01:09:15.555
There was something I was
stating directly about theta.
01:09:15.555 --> 01:09:17.930
But there's going to be some
instances where you actually
01:09:17.930 --> 01:09:21.200
want to test something
about a function of theta,
01:09:21.200 --> 01:09:22.700
not theta itself.
01:09:22.700 --> 01:09:27.350
For example, is the difference
between the first coordinate
01:09:27.350 --> 01:09:30.715
of theta and the second
coordinate of theta positive?
01:09:30.715 --> 01:09:32.840
That's definitely something
you might want to test,
01:09:32.840 --> 01:09:37.477
because maybe theta 1 is--
01:09:37.477 --> 01:09:39.185
let me try to think
of some good example.
01:09:44.618 --> 01:09:45.160
I don't know.
01:09:45.160 --> 01:09:49.779
Maybe theta 1 is your drawing
accuracy with the right hand
01:09:49.779 --> 01:09:52.720
and theta 2 is the drawing
accuracy with the left hand,
01:09:52.720 --> 01:09:56.320
and I'm actually collecting
data on young children
01:09:56.320 --> 01:09:58.840
to be able to test
early on whether they're
01:09:58.840 --> 01:10:01.810
going to be left-handed or
right-handed, for example.
01:10:01.810 --> 01:10:04.907
And so I want to just compare
those two with respect
01:10:04.907 --> 01:10:06.490
to each other, but
I don't necessarily
01:10:06.490 --> 01:10:10.300
need to know what the absolute
score for this handwriting
01:10:10.300 --> 01:10:12.010
skills are.
01:10:12.010 --> 01:10:14.890
So sometimes it's just
interesting to look
01:10:14.890 --> 01:10:17.520
at the difference of
things or maybe the sum,
01:10:17.520 --> 01:10:18.940
say the combined effect.
01:10:18.940 --> 01:10:22.690
Maybe this is my two
measurements of blood pressure,
01:10:22.690 --> 01:10:25.560
and I just want to talk about
the average blood pressure.
01:10:25.560 --> 01:10:28.040
And so I can make a linear
combination of those two,
01:10:28.040 --> 01:10:30.070
and so those things
implicitly depend on theta.
01:10:30.070 --> 01:10:36.460
And so I can generically
encapsule them
01:10:36.460 --> 01:10:39.610
in some test of the form
g of theta is equal to 0
01:10:39.610 --> 01:10:42.400
versus g of theta
is not equal to 0.
01:10:42.400 --> 01:10:46.060
And sometimes, in the first
test that we saw, g of theta
01:10:46.060 --> 01:10:53.350
was just the identity or
maybe the identity minus 0.5.
01:10:53.350 --> 01:10:55.170
If g of theta is
theta minus 0.5,
01:10:55.170 --> 01:10:57.320
that's exactly what
we've been testing.
01:10:57.320 --> 01:11:01.910
If g of theta is theta
minus 0.5 and theta
01:11:01.910 --> 01:11:06.850
is p, the parameter of a coin,
this is exactly of this form.
01:11:06.850 --> 01:11:08.930
So this is a simple
one, but then there's
01:11:08.930 --> 01:11:11.250
more complicated
ones we can think of.
01:11:14.830 --> 01:11:20.100
So how can I do this?
01:11:20.100 --> 01:11:22.100
Well, let's just
follow a recipe.
01:11:24.830 --> 01:11:26.210
So we traced back.
01:11:26.210 --> 01:11:31.995
We were trying to build a test
statistic which was pivotal.
01:11:31.995 --> 01:11:33.370
We wanted to have
this thing that
01:11:33.370 --> 01:11:37.220
had nothing that depended
on the parameter,
01:11:37.220 --> 01:11:39.140
and the only thing
we had for that
01:11:39.140 --> 01:11:41.000
that we built in
our chi-square test
01:11:41.000 --> 01:11:44.270
one is basically some form
of central limit theorem.
01:11:44.270 --> 01:11:46.580
Maybe it's for the maximum
likelihood estimator.
01:11:46.580 --> 01:11:48.500
Maybe it's for the
average, but it's basically
01:11:48.500 --> 01:11:52.610
some form of asymptotic
normality of the estimator.
01:11:52.610 --> 01:11:55.830
And that's what we started
from every single time.
01:11:55.830 --> 01:11:58.400
So let's assume
that I have this,
01:11:58.400 --> 01:12:00.590
and I'm going to
talk very abstractly.
01:12:00.590 --> 01:12:03.110
Let's assume that I
start with an estimator.
01:12:03.110 --> 01:12:04.880
Doesn't have to be the mle.
01:12:04.880 --> 01:12:06.770
It doesn't have
to be the average,
01:12:06.770 --> 01:12:08.020
but it's just something.
01:12:08.020 --> 01:12:11.960
And I know that I have the
estimator such that this guy
01:12:11.960 --> 01:12:15.310
converges in
distribution to some n0,
01:12:15.310 --> 01:12:17.900
and I have some
covariance matrix theta.
01:12:17.900 --> 01:12:20.330
Maybe it's not the
Fisher information.
01:12:20.330 --> 01:12:23.060
Maybe that's something that's
not as good as the mle,
01:12:23.060 --> 01:12:25.190
meaning that this
is going to give me
01:12:25.190 --> 01:12:29.160
less information than the Fisher
information, less accuracy.
01:12:29.160 --> 01:12:34.110
And now I can actually just say,
OK, if I know this about theta,
01:12:34.110 --> 01:12:43.920
I can apply the multivariate
delta method, which tells me
01:12:43.920 --> 01:12:50.050
that square root of n, g of
theta hat, minus g of theta
01:12:50.050 --> 01:12:56.170
goes in distribution to some n0.
01:12:56.170 --> 01:12:58.060
And then the price to
pay in one dimension
01:12:58.060 --> 01:13:01.060
was multiplying the square
root of the derivative,
01:13:01.060 --> 01:13:03.730
and we know that in multivariate
dimensions pre-multiplying
01:13:03.730 --> 01:13:05.170
by the gradient,
post-multiplying
01:13:05.170 --> 01:13:06.490
by the gradient.
01:13:06.490 --> 01:13:14.060
So I'm going to write delta
g of theta transpose sigma--
01:13:14.060 --> 01:13:15.630
sorry, not delta; nabla--
01:13:15.630 --> 01:13:19.090
g of theta-- so gradient.
01:13:19.090 --> 01:13:25.420
And here, I assume that
g takes values into rk.
01:13:25.420 --> 01:13:28.770
That's what's written here.
g takes value from d to k,
01:13:28.770 --> 01:13:30.970
but think of k as
being 1 for now.
01:13:30.970 --> 01:13:33.910
So the gradient is really just
a vector and not a matrix.
01:13:33.910 --> 01:13:40.680
That's your usual gradient
for real valid functions.
01:13:40.680 --> 01:13:45.797
So effectively, if g takes
values in dimension 1,
01:13:45.797 --> 01:13:47.130
what is the size of this matrix?
01:13:58.390 --> 01:13:59.920
I only ask trivial questions.
01:13:59.920 --> 01:14:02.990
Remember, that's
rule number one.
01:14:02.990 --> 01:14:04.320
It's one by one, right?
01:14:04.320 --> 01:14:06.540
And you can check it,
because on this side
01:14:06.540 --> 01:14:08.492
those are just the
difference between numbers.
01:14:08.492 --> 01:14:10.200
And it would be kind
of weird if they had
01:14:10.200 --> 01:14:11.550
a covariance matrix at the end.
01:14:11.550 --> 01:14:15.000
I mean, this is a random
variable, not a random vector.
01:14:15.000 --> 01:14:17.400
So I know that
this thing happens.
01:14:17.400 --> 01:14:21.390
And now, if I basically
divide by the square root
01:14:21.390 --> 01:14:22.110
of this thing--
01:14:30.210 --> 01:14:35.400
so for board I'm working with k
is equal to 1 divided by square
01:14:35.400 --> 01:14:41.735
root of delta g of theta
transpose sigma delta nabla--
01:14:41.735 --> 01:14:43.030
sorry, g of theta--
01:14:45.620 --> 01:14:51.580
then this thing should go to
some standard normal random
01:14:51.580 --> 01:14:56.890
variable, standard
normal distribution.
01:14:56.890 --> 01:14:59.730
I just divided by square
root of the variance here,
01:14:59.730 --> 01:15:01.410
which is the usual thing.
01:15:01.410 --> 01:15:05.580
Now, if you do not have
a univariate thing,
01:15:05.580 --> 01:15:07.630
you do the same
thing we did before,
01:15:07.630 --> 01:15:11.190
which is 3 multiplied
by the covariance matrix
01:15:11.190 --> 01:15:12.820
to the negative 1/2--
01:15:12.820 --> 01:15:16.920
so before this role was
played by the inverse Fisher
01:15:16.920 --> 01:15:18.730
information matrix.
01:15:18.730 --> 01:15:22.980
That's why we ended up
having i of theta to the 1/2,
01:15:22.980 --> 01:15:25.830
and now we just have this gamma,
which is just this function
01:15:25.830 --> 01:15:26.930
that I wrote up there.
01:15:26.930 --> 01:15:31.848
That could be potentially k by
k if g takes values into rk.
01:15:31.848 --> 01:15:32.764
Yes?
01:15:32.764 --> 01:15:35.578
AUDIENCE: [INAUDIBLE].
01:15:35.578 --> 01:15:37.620
PHILIPPE RIGOLLET: Yeah,
the gradient of a vector
01:15:37.620 --> 01:15:41.400
is just the vector with all
the derivatives with respect
01:15:41.400 --> 01:15:42.520
to each component, yes.
01:15:45.460 --> 01:15:48.400
So you know the word vector
for derivatives, but not
01:15:48.400 --> 01:15:49.930
for vectors?
01:15:49.930 --> 01:15:54.678
I mean, the word gradient
you use for one-dimensional?
01:15:54.678 --> 01:15:57.163
Yes, derivative
in one dimension.
01:16:01.150 --> 01:16:03.550
Now, of course, here, you
notice there's something--
01:16:03.550 --> 01:16:06.700
I actually have a
little caveat here.
01:16:06.700 --> 01:16:08.270
I want this to have rank k.
01:16:08.270 --> 01:16:10.120
I want this to be invertible.
01:16:10.120 --> 01:16:11.980
I want this matrix
to be invertible.
01:16:11.980 --> 01:16:13.660
Even for the Fisher
information matrix,
01:16:13.660 --> 01:16:15.280
I sort of need it
to be invertible.
01:16:15.280 --> 01:16:16.792
Even for the original
theorem, that
01:16:16.792 --> 01:16:18.250
was part of my
technical condition,
01:16:18.250 --> 01:16:21.540
just so that I could actually
write Fisher information matrix
01:16:21.540 --> 01:16:22.870
inverse.
01:16:22.870 --> 01:16:26.045
And so here, you can make
your life easy and just assume
01:16:26.045 --> 01:16:28.420
that it's true all the time,
because I'm actually writing
01:16:28.420 --> 01:16:29.880
in a fairly abstract way.
01:16:29.880 --> 01:16:31.380
But in practice,
we're going to have
01:16:31.380 --> 01:16:33.390
to check whether
this is going to be
01:16:33.390 --> 01:16:35.390
true for specific distributions.
01:16:35.390 --> 01:16:37.230
And we will see an
example towards the end
01:16:37.230 --> 01:16:39.690
of the chapter, the
multinomial, where
01:16:39.690 --> 01:16:42.750
it's actually not the case
that Fisher information
01:16:42.750 --> 01:16:43.650
matrix exists.
01:16:46.170 --> 01:16:49.230
The asymptotic covariance
matrix, is not invertible,
01:16:49.230 --> 01:16:52.848
so it's not the inverse of
a Fisher information matrix.
01:16:52.848 --> 01:16:54.390
Because to be the
inverse of someone,
01:16:54.390 --> 01:16:55.848
you need to be
invertible yourself.
01:16:58.670 --> 01:17:01.910
And so now what I can
do is apply Slutsky.
01:17:01.910 --> 01:17:06.790
So here, what I needed to
have is theta, the true theta.
01:17:06.790 --> 01:17:10.670
So what I can do is just
put some theta hat in there,
01:17:10.670 --> 01:17:16.490
and so that's the gamma of
theta hat that I see there.
01:17:16.490 --> 01:17:19.683
And if theta is true, then
g of theta is equal to 0.
01:17:19.683 --> 01:17:20.600
That's what we assume.
01:17:20.600 --> 01:17:25.970
That was our h0, was that under
h0 g of theta is equal to 0.
01:17:25.970 --> 01:17:29.550
So the number I need
to plug in here,
01:17:29.550 --> 01:17:31.620
I don't need to
replace theta here.
01:17:31.620 --> 01:17:33.135
What I need to
replace here is 0.
01:17:36.250 --> 01:17:38.000
Now let's go back to
what you were saying.
01:17:38.000 --> 01:17:41.490
Here you could say, let
me try to replace 0 here,
01:17:41.490 --> 01:17:42.690
but there is no such thing.
01:17:42.690 --> 01:17:43.910
There is no g here.
01:17:43.910 --> 01:17:45.530
It's only the gradient of g.
01:17:45.530 --> 01:17:50.300
So this thing that says
replace theta by theta 0
01:17:50.300 --> 01:17:53.990
wherever you see it
could not work here.
01:17:53.990 --> 01:17:57.050
If g was invertible,
I could just
01:17:57.050 --> 01:18:02.780
say that theta is equal to
g inverse of 0 in the null
01:18:02.780 --> 01:18:05.150
and then I could
plug in that value.
01:18:05.150 --> 01:18:08.860
But in general, it doesn't
have to be invertible.
01:18:08.860 --> 01:18:11.270
And it might be a pain
to invert g, even.
01:18:11.270 --> 01:18:13.250
I mean, it's not
clear how you can
01:18:13.250 --> 01:18:15.080
invert all functions like that.
01:18:15.080 --> 01:18:17.280
And so here you just go
with Slutsky, and you say,
01:18:17.280 --> 01:18:20.690
OK, I'm just going to
put theta hat in there.
01:18:20.690 --> 01:18:24.740
But this guy, I know I need to
check whether it's 0 or not.
01:18:24.740 --> 01:18:27.740
Same recipe we did for theta,
except we do it for g of theta
01:18:27.740 --> 01:18:28.240
now.
01:18:30.910 --> 01:18:34.030
And now I have my
asymptotic thing.
01:18:34.030 --> 01:18:36.570
I know this is a
pivotal distribution.
01:18:36.570 --> 01:18:38.100
This might be a vector.
01:18:38.100 --> 01:18:41.130
So rather than looking
at the matrix itself,
01:18:41.130 --> 01:18:43.512
I'm going to actually
look at the norm--
01:18:43.512 --> 01:18:44.970
rather than looking
at the vectors,
01:18:44.970 --> 01:18:46.620
I'm going to look at
their square norm.
01:18:46.620 --> 01:18:47.995
That gives me a
chi square, and I
01:18:47.995 --> 01:18:51.270
reject when my test statistic,
which is the norm square,
01:18:51.270 --> 01:18:53.700
exceeds the quintile
of a chi square--
01:18:53.700 --> 01:18:56.490
same as before, just
doing on your own.
01:18:56.490 --> 01:19:00.810
Before we part ways, I wanted
to just mention one thing, which
01:19:00.810 --> 01:19:02.590
is look at this thing.
01:19:02.590 --> 01:19:08.740
If g was of dimension 1, the
Euclidean norm in dimension 1
01:19:08.740 --> 01:19:10.760
is just the absolute value
of the number, right?
01:19:13.730 --> 01:19:19.460
Which means that when I am
actually computing this,
01:19:19.460 --> 01:19:22.590
I'm looking at the square, so
it's the square of something.
01:19:22.590 --> 01:19:25.378
So it means that this is
the square of a Gaussian.
01:19:25.378 --> 01:19:26.836
And it's true that,
indeed, the chi
01:19:26.836 --> 01:19:28.780
squared 1 is just the
square of a Gaussian.
01:19:31.420 --> 01:19:36.390
Sure, this is the tautology,
but let's look at this test now.
01:19:36.390 --> 01:19:40.860
This test was built using Wald's
theory and some pretty heavy
01:19:40.860 --> 01:19:42.150
stuff.
01:19:42.150 --> 01:19:44.460
But now if I start looking
at Tn and I think of it
01:19:44.460 --> 01:19:47.600
as being just the absolute value
of this quantity over there,
01:19:47.600 --> 01:19:50.970
squared, what I'm
really doing is
01:19:50.970 --> 01:19:54.510
I'm looking at whether the
square of some Gaussian
01:19:54.510 --> 01:20:00.250
exceeds the quintile of a chi
squared of 1 degree of freedom,
01:20:00.250 --> 01:20:02.550
which means that this thing
is actually equivalent--
01:20:02.550 --> 01:20:04.870
completely equivalent--
to the test.
01:20:04.870 --> 01:20:10.740
So if k is equal to
1, this is completely
01:20:10.740 --> 01:20:15.300
equivalent to looking at the
absolute value of something
01:20:15.300 --> 01:20:19.260
and check whether it's
larger than, say, q over 2--
01:20:19.260 --> 01:20:22.310
well, than q alpha--
01:20:22.310 --> 01:20:24.030
well, that's q alpha over 2--
01:20:24.030 --> 01:20:26.220
so that the probability
of this thing
01:20:26.220 --> 01:20:27.390
is actually equal to alpha.
01:20:27.390 --> 01:20:29.937
And that's exactly what
we've been doing before.
01:20:29.937 --> 01:20:31.770
When we introduced tests
in the first place,
01:20:31.770 --> 01:20:33.840
we just took absolute
values, said, well,
01:20:33.840 --> 01:20:36.180
is the absolute value of
a Gaussian in the limit.
01:20:36.180 --> 01:20:37.420
And so it's the same thing.
01:20:37.420 --> 01:20:40.620
So this is actually
equivalent to the probability
01:20:40.620 --> 01:20:44.170
that the norm squared is
larger so that the chi squared
01:20:44.170 --> 01:20:45.420
of some normal--
01:20:45.420 --> 01:20:52.200
and that's the q alpha
of some chi squared
01:20:52.200 --> 01:20:53.850
with one degree of freedom.
01:20:53.850 --> 01:20:58.350
Those are exactly
the two same tests.
01:20:58.350 --> 01:21:00.810
So in one dimension,
those things just
01:21:00.810 --> 01:21:03.437
collapse into being
one little thing,
01:21:03.437 --> 01:21:05.770
and that's because there's
no geometry in one dimension.
01:21:05.770 --> 01:21:08.820
It's just one dimension, whereas
if I'm in a higher dimension,
01:21:08.820 --> 01:21:12.560
then things get distorted
and things can become weird.