WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.880
at ocw.mit.edu.
00:00:21.380 --> 00:00:23.850
PHILIPPE RIGOLLET:
So welcome back.
00:00:23.850 --> 00:00:27.840
We're going to finish this
chapter on maximum likelihood
00:00:27.840 --> 00:00:28.430
estimation.
00:00:28.430 --> 00:00:30.830
And last time, I briefly
mentioned something that
00:00:30.830 --> 00:00:33.220
was called Fisher information.
00:00:33.220 --> 00:00:35.990
So Fisher information,
in general,
00:00:35.990 --> 00:00:40.890
is actually a matrix when you
have a multivariate parameter
00:00:40.890 --> 00:00:41.390
theta.
00:00:41.390 --> 00:00:44.960
So if theta, for example,
is of dimension d,
00:00:44.960 --> 00:00:46.430
then the Fisher
information matrix
00:00:46.430 --> 00:00:48.350
is going to be a d by d matrix.
00:00:48.350 --> 00:00:51.740
You can see that, because
it's the outer product.
00:00:51.740 --> 00:00:55.605
So it's of the form
gradient gradient transpose.
00:00:55.605 --> 00:00:57.230
So if it's gradient
gradient transpose,
00:00:57.230 --> 00:00:59.060
the gradient is
the d dimensional.
00:00:59.060 --> 00:01:03.530
And so gradient times gradient
transpose is a d by d matrix.
00:01:03.530 --> 00:01:07.280
And so this matrix
actually contains--
00:01:07.280 --> 00:01:09.560
well, tells you it's called
Fisher information matrix.
00:01:09.560 --> 00:01:11.143
So it's basically
telling you how much
00:01:11.143 --> 00:01:14.270
information about the
theta is in your model.
00:01:14.270 --> 00:01:17.960
So for example, if your model
is very well-parameterized,
00:01:17.960 --> 00:01:20.120
then you will have a
lot of information.
00:01:20.120 --> 00:01:21.684
You will have a higher--
00:01:21.684 --> 00:01:23.600
so let's think of it as
being a scalar number,
00:01:23.600 --> 00:01:25.141
just one number
now-- so you're going
00:01:25.141 --> 00:01:27.920
to have a larger information
about your parameter
00:01:27.920 --> 00:01:30.090
in the same probability
distribution.
00:01:30.090 --> 00:01:35.900
But if start having a weird
way to parameterize your model,
00:01:35.900 --> 00:01:38.150
then the Fisher information
is actually going to drop.
00:01:38.150 --> 00:01:40.820
So as a concrete example
think of, for example,
00:01:40.820 --> 00:01:44.240
a parameter of interest
in a Gaussian model,
00:01:44.240 --> 00:01:45.890
where the mean is
known to be zero.
00:01:45.890 --> 00:01:48.974
But what you're interested in
is the variance, sigma squared.
00:01:48.974 --> 00:01:50.390
If I'm interested
in sigma square,
00:01:50.390 --> 00:01:53.750
I could parameterize my model
by sigma, sigma squared, sigma
00:01:53.750 --> 00:01:55.880
to the fourth, sigma to 24th.
00:01:55.880 --> 00:01:57.690
I could parameterize
it by whatever I want,
00:01:57.690 --> 00:01:59.601
then I would have a
simple transformation.
00:01:59.601 --> 00:02:01.100
Then you could say
that some of them
00:02:01.100 --> 00:02:02.600
are actually more
or less informative,
00:02:02.600 --> 00:02:04.308
and you're going to
have different values
00:02:04.308 --> 00:02:06.800
for your Fisher information.
00:02:06.800 --> 00:02:10.729
So let's just review a few
well-known computations.
00:02:10.729 --> 00:02:17.540
So I will focus primarily on the
one dimensional case as usual.
00:02:17.540 --> 00:02:19.430
And I claim that
there's two definitions.
00:02:19.430 --> 00:02:24.020
So if theta is a real
valued parameter,
00:02:24.020 --> 00:02:26.827
then there's basically
two definitions
00:02:26.827 --> 00:02:28.910
that you can think of for
your Fisher information.
00:02:28.910 --> 00:02:30.350
One involves the
first derivative
00:02:30.350 --> 00:02:31.670
of your log likelihood.
00:02:31.670 --> 00:02:34.840
And the second one involves
the second derivative.
00:02:34.840 --> 00:02:36.590
So the log likelihood
here, we're
00:02:36.590 --> 00:02:39.715
actually going to
define it as l of theta.
00:02:39.715 --> 00:02:40.340
And what is it?
00:02:40.340 --> 00:02:43.550
Well, it's simply the likelihood
function for one observation.
00:02:43.550 --> 00:02:46.640
So it's l-- and I'm going to
write 1 just to make sure that
00:02:46.640 --> 00:02:49.010
we all know what we're talking
about one observation--
00:02:49.010 --> 00:02:52.148
of-- which is the order again,
I think it's X and theta.
00:02:55.170 --> 00:02:57.290
So that's the log
likelihood, remember?
00:03:05.290 --> 00:03:07.100
So for example, if
I have a density,
00:03:07.100 --> 00:03:08.058
what is it going to be?
00:03:08.058 --> 00:03:12.330
It's going to be log
of f sub theta of X.
00:03:12.330 --> 00:03:15.190
So this guy is a
random variable,
00:03:15.190 --> 00:03:17.456
because it's a function
of a random variable.
00:03:17.456 --> 00:03:19.580
And that's what you see
expectations of this thing.
00:03:19.580 --> 00:03:21.940
It's a random function of theta.
00:03:21.940 --> 00:03:23.525
If I view this as a
function of theta,
00:03:23.525 --> 00:03:25.150
the function becomes
random, because it
00:03:25.150 --> 00:03:27.590
depends on this random X.
00:03:27.590 --> 00:03:35.180
And so I of theta is actually
defined as the variance
00:03:35.180 --> 00:03:40.400
of l prime of theta--
00:03:40.400 --> 00:03:43.760
so the variance of the
derivative of this function.
00:03:43.760 --> 00:03:50.750
And I also claim that it's equal
to negative the expectation
00:03:50.750 --> 00:03:54.160
of the second
derivative of theta.
00:03:57.380 --> 00:04:00.560
And here, the expectation
and the variance
00:04:00.560 --> 00:04:03.120
are computed, because this
function, remember, is random.
00:04:03.120 --> 00:04:05.450
So I need to tell you
what is the distribution
00:04:05.450 --> 00:04:06.920
of the X with
respect to which I'm
00:04:06.920 --> 00:04:08.882
computing the expectation
and the variance.
00:04:08.882 --> 00:04:09.965
And it's the theta itself.
00:04:13.882 --> 00:04:15.340
So typically, the
theta we're going
00:04:15.340 --> 00:04:18.670
to be interested in--
so there's a Fisher
00:04:18.670 --> 00:04:20.519
information for all
values of the parameter,
00:04:20.519 --> 00:04:22.227
but the one typically
we're interested in
00:04:22.227 --> 00:04:25.830
is the true
parameter, theta star.
00:04:25.830 --> 00:04:29.270
But view this as a function
of theta right now.
00:04:29.270 --> 00:04:31.760
So now, I need to prove
to you-- and this is not
00:04:31.760 --> 00:04:34.610
a trivial statement-- the
variance of the derivative
00:04:34.610 --> 00:04:36.590
is equal to negative
the expectation
00:04:36.590 --> 00:04:37.777
of the second derivative.
00:04:37.777 --> 00:04:40.360
I mean, there's really quite a
bit that comes into this right.
00:04:40.360 --> 00:04:44.267
And it comes from the fact that
this is a log not of anything.
00:04:44.267 --> 00:04:45.350
It's the log of a density.
00:04:45.350 --> 00:04:48.560
So let's just prove
that without having
00:04:48.560 --> 00:04:51.260
to bother too much
ourselves with
00:04:51.260 --> 00:04:54.219
some technical assumptions.
00:04:54.219 --> 00:04:56.260
And the technical assumptions
are the assumptions
00:04:56.260 --> 00:04:59.270
that allow me to permute
derivative and integral.
00:04:59.270 --> 00:05:01.820
Because when I compute the
variances and expectations,
00:05:01.820 --> 00:05:04.310
I'm actually integrating
against the density.
00:05:04.310 --> 00:05:09.080
And what I want to do is to make
sure that I can always do that.
00:05:09.080 --> 00:05:13.250
So my technical assumptions
are I can always permute
00:05:13.250 --> 00:05:15.320
integral and derivatives.
00:05:15.320 --> 00:05:19.350
So let's just prove this.
00:05:19.350 --> 00:05:21.560
So what I'm going to do
is I'm going to assume
00:05:21.560 --> 00:05:32.750
that X has density f theta.
00:05:35.211 --> 00:05:37.210
And I'm actually just
going to write-- well, let
00:05:37.210 --> 00:05:39.030
me write it f theta right now.
00:05:39.030 --> 00:05:42.000
Let me try to not be
lazy about writing.
00:05:42.000 --> 00:05:43.910
And so the thing
I'm going to use
00:05:43.910 --> 00:05:50.540
is the fact that the integral of
this density is equal to what?
00:05:50.540 --> 00:05:51.820
1.
00:05:51.820 --> 00:05:54.620
And this is where I'm going
to start doing weird things.
00:05:54.620 --> 00:05:56.860
That means that if I take
the derivative of this guy,
00:05:56.860 --> 00:05:59.650
it's equal to 0.
00:05:59.650 --> 00:06:03.370
So that means that if I look
at the derivative with respect
00:06:03.370 --> 00:06:11.840
to theta of integral f theta
of X dX, this is equal to 0.
00:06:11.840 --> 00:06:14.124
And this is where I'm
actually making this switch,
00:06:14.124 --> 00:06:16.040
is that I'm going to say
that this is actually
00:06:16.040 --> 00:06:19.670
equal to the integral
of the derivative.
00:06:27.670 --> 00:06:30.390
So that's going to be the
first thing I'm going to use.
00:06:30.390 --> 00:06:32.820
And of course, if it's true
for the first derivative,
00:06:32.820 --> 00:06:34.820
it's going to be true for
the second derivative.
00:06:34.820 --> 00:06:36.550
So I'm going to actually
do it a second time.
00:06:36.550 --> 00:06:38.220
And the second thing
I'm going to use
00:06:38.220 --> 00:06:46.860
is the fact the integral
of the second derivative
00:06:46.860 --> 00:06:47.640
is equal to 0.
00:06:50.739 --> 00:06:51.780
So let's start from here.
00:06:59.410 --> 00:07:01.990
And let me start from,
say, the expectation
00:07:01.990 --> 00:07:06.790
of the second derivative
of l prime theta.
00:07:06.790 --> 00:07:08.440
So what is l prime prime theta?
00:07:08.440 --> 00:07:21.320
Well, it's the second derivative
of log of f theta of X.
00:07:21.320 --> 00:07:24.270
And we know that the
derivative of the log--
00:07:24.270 --> 00:07:30.780
sorry-- yeah, so the derivative
of the log is 1 over--
00:07:30.780 --> 00:07:34.050
well, it's the derivative
of f divided by f itself.
00:07:49.647 --> 00:07:50.480
Everybody's with me?
00:07:53.450 --> 00:07:58.760
Just log of f prime
is f prime over f.
00:07:58.760 --> 00:08:01.610
Here, it's just that f, I view
this as a function of theta
00:08:01.610 --> 00:08:04.200
and not as a function of X.
00:08:04.200 --> 00:08:08.040
So now, I need to take another
derivative of this thing.
00:08:08.040 --> 00:08:09.560
So that's going to be equal to--
00:08:09.560 --> 00:08:13.030
well, so we all know the
formula for the derivative
00:08:13.030 --> 00:08:14.000
of the ratio.
00:08:14.000 --> 00:08:22.080
So I pick up the second
derivative times f theta
00:08:22.080 --> 00:08:30.480
minus the first
derivative squared
00:08:30.480 --> 00:08:33.840
divided by f theta squared--
00:08:38.590 --> 00:08:40.542
basic calculus.
00:08:40.542 --> 00:08:43.390
And now, I need to check
that negative the expectation
00:08:43.390 --> 00:08:47.580
of this guy is giving
me back what I want.
00:08:47.580 --> 00:08:52.030
Well what is negative the
expectation of l prime prime
00:08:52.030 --> 00:08:54.150
of theta?
00:08:54.150 --> 00:08:56.440
Well, what we need to do
is to do negative integral
00:08:56.440 --> 00:08:59.500
of this guy against f theta.
00:08:59.500 --> 00:09:01.780
So it's minus the integral of--
00:09:28.340 --> 00:09:31.640
That's just the definition
of the expectation.
00:09:31.640 --> 00:09:34.370
I take an integral
against f theta.
00:09:34.370 --> 00:09:36.440
But here, I have something nice.
00:09:36.440 --> 00:09:38.540
What's happening is that
those guys are canceling.
00:09:41.937 --> 00:09:43.520
And now that those
guys are canceling,
00:09:43.520 --> 00:09:44.780
those guys are canceling too.
00:09:51.556 --> 00:09:53.180
So what I have is
that the first term--
00:09:53.180 --> 00:09:56.000
I'm going to break
this difference here.
00:09:56.000 --> 00:09:58.250
So I'm going to say that
integral of this difference
00:09:58.250 --> 00:10:00.320
is the difference
of the integrals.
00:10:00.320 --> 00:10:03.260
So the first term is
going to be the integral
00:10:03.260 --> 00:10:09.256
of d over d theta
squared of f theta.
00:10:13.910 --> 00:10:17.360
And the second one, the negative
signs are going to cancel,
00:10:17.360 --> 00:10:32.140
and I'm going to
be left with this.
00:10:37.700 --> 00:10:39.147
Everybody's following?
00:10:39.147 --> 00:10:40.230
Anybody found the mistake?
00:10:44.860 --> 00:10:46.954
How about the other mistake?
00:10:46.954 --> 00:10:48.370
I don't know if
there's a mistake.
00:10:48.370 --> 00:10:50.980
I'm just trying to get you
to check what I'm doing.
00:10:54.010 --> 00:10:56.370
With me so far?
00:10:56.370 --> 00:10:59.670
So this guy here is the integral
of the second the derivative
00:10:59.670 --> 00:11:02.400
of f of X dX.
00:11:02.400 --> 00:11:04.858
What is this?
00:11:04.858 --> 00:11:06.340
AUDIENCE: It's 0.
00:11:06.340 --> 00:11:08.330
PHILIPPE RIGOLLET: It's 0.
00:11:08.330 --> 00:11:16.910
And that's because of this guy,
which I will call frowny face.
00:11:16.910 --> 00:11:20.930
So frowny face tells me this.
00:11:20.930 --> 00:11:26.480
And let's call this guy
monkey that hides his eyes.
00:11:26.480 --> 00:11:28.030
No, let's just do
something simpler.
00:11:28.030 --> 00:11:29.360
Let's call it star.
00:11:29.360 --> 00:11:32.180
And this guy, we
will use later on.
00:11:35.630 --> 00:11:37.760
So now, I have to prove
that this guy, which
00:11:37.760 --> 00:11:40.460
I have proved is
equal to this, is now
00:11:40.460 --> 00:11:46.070
equal to the variance
of l prime theta.
00:11:46.070 --> 00:11:48.200
So now, let's go back
to the other way.
00:11:48.200 --> 00:11:49.520
We're going to meet halfway.
00:11:49.520 --> 00:11:51.020
I'm going to have a series--
00:11:51.020 --> 00:11:56.090
I want to prove that this
guy is equal to this guy.
00:11:56.090 --> 00:11:58.340
And I'm going to have
a series of equalities
00:11:58.340 --> 00:12:01.740
that I'm going to meet halfway.
00:12:01.740 --> 00:12:03.239
So let's start
from the other end.
00:12:03.239 --> 00:12:05.280
We started from the negative
l prime prime theta.
00:12:05.280 --> 00:12:06.743
Let's start with
the variance part.
00:12:10.330 --> 00:12:17.394
Variance of l prime of theta,
so that's the variance--
00:12:22.520 --> 00:12:29.170
so that's the
expectation of l prime
00:12:29.170 --> 00:12:35.230
of theta squared minus the
square of the expectation of l
00:12:35.230 --> 00:12:35.920
prime of theta.
00:12:41.370 --> 00:12:43.340
Now, what is the square
of the expectation
00:12:43.340 --> 00:12:44.630
of l prime of theta?
00:12:44.630 --> 00:12:50.750
Well, l prime of theta is equal
to the partial with respect
00:12:50.750 --> 00:12:57.461
to theta of log f theta of X,
which we know from the first
00:12:57.461 --> 00:12:59.960
line over there-- that's what's
in the bracket on the second
00:12:59.960 --> 00:13:01.030
line there--
00:13:01.030 --> 00:13:05.010
is actually equal to the
partial over theta of f
00:13:05.010 --> 00:13:09.410
theta X divided by
f theta X. That's
00:13:09.410 --> 00:13:11.390
the derivative of the log.
00:13:11.390 --> 00:13:16.150
So when I look at the
expectation of this guy,
00:13:16.150 --> 00:13:18.700
I'm going to have the integral
of this against f theta.
00:13:18.700 --> 00:13:20.540
And the f thetas are
going to cancel again,
00:13:20.540 --> 00:13:23.260
just like I did here.
00:13:23.260 --> 00:13:26.110
So this thing is actually
equal to the integral
00:13:26.110 --> 00:13:33.650
of partial over theta
of f theta of X dX.
00:13:33.650 --> 00:13:34.950
And what does this equal to?
00:13:37.720 --> 00:13:42.310
0, by the monkey hiding is eyes.
00:13:42.310 --> 00:13:46.950
So that's star-- tells me
that this is equal to 0.
00:13:50.090 --> 00:13:52.640
So basically, when I compute
the variance, this term is not.
00:13:52.640 --> 00:13:53.390
Going to matter.
00:13:53.390 --> 00:13:54.973
I only have to
complete the first one.
00:14:10.630 --> 00:14:12.460
So what is the first one?
00:14:12.460 --> 00:14:21.280
Well, the first one is the
expectation of l prime squared.
00:14:24.820 --> 00:14:29.770
And so that guy is the integral
of-- well, what is l prime?
00:14:29.770 --> 00:14:33.160
Again, it's partial
over partial theta
00:14:33.160 --> 00:14:37.960
f theta of X divided by f theta
of X. Now, this time, this guy
00:14:37.960 --> 00:14:40.975
is squared against the density.
00:14:44.320 --> 00:14:45.625
So one of the f thetas cancel.
00:15:07.195 --> 00:15:11.790
But now, I'm back to what
I had before for this guy.
00:15:16.420 --> 00:15:20.540
So this guy is now
equal to this guy.
00:15:20.540 --> 00:15:21.940
There's just the same formula.
00:15:21.940 --> 00:15:23.960
So they're the same thing.
00:15:23.960 --> 00:15:25.880
And so I've moved both ways.
00:15:25.880 --> 00:15:27.800
Starting from the
expression that
00:15:27.800 --> 00:15:30.230
involves the expectation
of the second derivative,
00:15:30.230 --> 00:15:31.820
I've come to this guy.
00:15:31.820 --> 00:15:34.310
And starting from the
expression that tells me
00:15:34.310 --> 00:15:36.350
about the variance of
the first derivative,
00:15:36.350 --> 00:15:38.300
I've come to the same guy.
00:15:38.300 --> 00:15:41.397
So that completes my proof.
00:15:41.397 --> 00:15:43.063
Are there any questions
about the proof?
00:15:47.050 --> 00:15:53.890
We also have on our way found an
explicit formula for the Fisher
00:15:53.890 --> 00:15:55.330
information as well.
00:15:55.330 --> 00:15:57.760
So now that I have this
thing, I could actually
00:15:57.760 --> 00:16:02.200
add that if X has a
density, for example,
00:16:02.200 --> 00:16:07.000
this is also equal
to the integral of--
00:16:09.810 --> 00:16:15.840
well, the partial over
theta of f theta of X
00:16:15.840 --> 00:16:24.431
squared divided by f
theta of X, because I just
00:16:24.431 --> 00:16:26.180
proved that those two
things were actually
00:16:26.180 --> 00:16:28.013
equal to the same thing,
which was this guy.
00:16:31.230 --> 00:16:34.024
Now in practice, this is really
going to be the useful one.
00:16:34.024 --> 00:16:35.940
The other two are going
to be useful depending
00:16:35.940 --> 00:16:37.810
on what case you're in.
00:16:37.810 --> 00:16:40.560
So if I ask you to compute
the Fisher information,
00:16:40.560 --> 00:16:43.020
you have now three
ways to pick from.
00:16:43.020 --> 00:16:44.670
And basically,
practice will tell you
00:16:44.670 --> 00:16:47.070
which one to choose if you
want to save five minutes when
00:16:47.070 --> 00:16:48.666
you're doing your computations.
00:16:48.666 --> 00:16:50.790
Maybe you're the guy who
likes to take derivatives.
00:16:50.790 --> 00:16:52.860
And then you're going to go
with the second derivative one.
00:16:52.860 --> 00:16:55.000
Maybe you're the guy who
likes to extend squares,
00:16:55.000 --> 00:16:56.500
so you're going to
take the one that
00:16:56.500 --> 00:16:59.340
involves the square
of the squared prime.
00:16:59.340 --> 00:17:01.042
And maybe you're
just a normal person,
00:17:01.042 --> 00:17:02.250
and you want to use that guy.
00:17:06.119 --> 00:17:07.740
Why do I care?
00:17:07.740 --> 00:17:09.540
This is the Fisher information.
00:17:09.540 --> 00:17:11.790
And I could have defined the
[? Hilbert ?] information
00:17:11.790 --> 00:17:13.470
by taking the square
root of this guy
00:17:13.470 --> 00:17:16.500
plus the sine of this thing
and just be super happy
00:17:16.500 --> 00:17:18.369
and have my name in textbooks.
00:17:18.369 --> 00:17:22.170
But this thing has a
very particular meaning.
00:17:22.170 --> 00:17:24.540
When we're doing the maximum
likelihood estimation--
00:17:28.590 --> 00:17:32.910
so remember the maximum
likelihood estimation is just
00:17:32.910 --> 00:17:36.720
an empirical version of trying
to minimize the KL divergence.
00:17:36.720 --> 00:17:39.570
So what we're trying to
do, maximum likelihood,
00:17:39.570 --> 00:17:42.820
is really trying to
minimize the KL divergence.
00:17:54.160 --> 00:17:57.490
And we're trying to minimize
this function, remember?
00:17:57.490 --> 00:18:01.034
So now what we're
going to do is we're
00:18:01.034 --> 00:18:02.200
going to plot this function.
00:18:02.200 --> 00:18:04.000
We said that, let's
place ourselves
00:18:04.000 --> 00:18:06.440
in cases where
this KL is convex,
00:18:06.440 --> 00:18:09.550
so that the inverse is concave.
00:18:09.550 --> 00:18:11.020
So it's going to
look like this--
00:18:11.020 --> 00:18:13.930
U-shaped, that's convex.
00:18:13.930 --> 00:18:16.119
So that's the truth thing
I'm trying to minimize.
00:18:16.119 --> 00:18:18.160
And what I said is that
I'm going to actually try
00:18:18.160 --> 00:18:19.215
to estimate this guy.
00:18:19.215 --> 00:18:20.590
So in practice,
I'm going to have
00:18:20.590 --> 00:18:22.990
something that looks like
this, but it's not really this.
00:18:26.425 --> 00:18:28.050
And we're not going
to do this, but you
00:18:28.050 --> 00:18:30.120
can show that you
can control this
00:18:30.120 --> 00:18:33.810
uniformly over the entire space,
that there is no space where
00:18:33.810 --> 00:18:35.190
it just becomes huge.
00:18:35.190 --> 00:18:37.020
In particular, this
is not the space
00:18:37.020 --> 00:18:38.450
where it just
becomes super huge,
00:18:38.450 --> 00:18:39.930
and the minimum
of the dotted line
00:18:39.930 --> 00:18:42.120
becomes really
far from this guy.
00:18:42.120 --> 00:18:45.330
So if those two functions
are close to each other,
00:18:45.330 --> 00:18:48.510
then this implies that the
minimum here of the dotted line
00:18:48.510 --> 00:18:53.250
is close to the minimum
of the solid line.
00:18:53.250 --> 00:18:56.180
So we know that
this is theta star.
00:18:56.180 --> 00:19:00.472
And this is our MLE,
estimator, theta hat ml.
00:19:00.472 --> 00:19:01.930
So that's basically
the principle--
00:19:01.930 --> 00:19:05.450
the more data we have,
the closer the dotted line
00:19:05.450 --> 00:19:07.020
is to the solid line.
00:19:07.020 --> 00:19:10.470
And so the minimum is
closer to the minimum.
00:19:10.470 --> 00:19:12.800
But now, this is
just one example,
00:19:12.800 --> 00:19:14.330
where I drew a picture for you.
00:19:14.330 --> 00:19:17.190
But there could be some
really nasty examples.
00:19:17.190 --> 00:19:20.330
Think of this
example, where I have
00:19:20.330 --> 00:19:23.840
a function, which is convex,
but it looks more like this.
00:19:30.120 --> 00:19:31.790
That's convex, it's U-shaped.
00:19:31.790 --> 00:19:36.300
It's just a professional U.
00:19:36.300 --> 00:19:41.430
Now, I'm going to put a dotted
line around it that has pretty
00:19:41.430 --> 00:19:42.690
much the same fluctuations.
00:19:42.690 --> 00:19:44.720
The bend around it
is of this size.
00:19:52.980 --> 00:19:56.730
So do we agree that the
distance between the solid line
00:19:56.730 --> 00:19:58.590
and the dotted line is
pretty much the same
00:19:58.590 --> 00:20:01.110
in those two pictures?
00:20:01.110 --> 00:20:04.650
Now, here, depending
on how I tilt this guy,
00:20:04.650 --> 00:20:07.570
basically, I can put the minimum
theta star wherever I want.
00:20:07.570 --> 00:20:11.650
And let's say that here,
I actually put it here.
00:20:11.650 --> 00:20:13.800
That's pretty much the
minimum of this line.
00:20:13.800 --> 00:20:16.530
And now, the minimum of the
dotted line is this guy.
00:20:20.930 --> 00:20:22.730
So they're very far.
00:20:22.730 --> 00:20:25.880
The fact that I'm very
flat at the bottom
00:20:25.880 --> 00:20:28.340
makes my requirements
for being close
00:20:28.340 --> 00:20:31.950
to the U-shaped solid
curve much more stringent,
00:20:31.950 --> 00:20:34.020
if I want to stay close.
00:20:34.020 --> 00:20:38.190
And so this is the
canonical case.
00:20:38.190 --> 00:20:39.720
This is the annoying case.
00:20:39.720 --> 00:20:43.710
And of course, you
have the awesome case--
00:20:43.710 --> 00:20:45.540
looks like this.
00:20:45.540 --> 00:20:47.780
And then whether
you deviate, you
00:20:47.780 --> 00:20:50.430
can have something
that moves pretty far.
00:20:50.430 --> 00:20:53.480
It doesn't matter, it's
always going to stay close.
00:20:53.480 --> 00:20:57.600
Now, what is the quantity
that measures how
00:20:57.600 --> 00:20:59.700
curved I am at a given point--
00:20:59.700 --> 00:21:03.840
how curved the function
is at a given point?
00:21:03.840 --> 00:21:05.420
The secondary derivative.
00:21:05.420 --> 00:21:11.150
And so the Fisher information is
negative the second derivative.
00:21:11.150 --> 00:21:12.394
Why the negative?
00:21:17.044 --> 00:21:18.960
Well here-- Yeah, we're
looking for a minimum,
00:21:18.960 --> 00:21:20.793
and this guy is really
the-- you should view
00:21:20.793 --> 00:21:23.460
this as a reverted function.
00:21:23.460 --> 00:21:26.370
This is we're trying to
maximize the likelihood, which
00:21:26.370 --> 00:21:28.950
is basically maximizing
the negative KL.
00:21:28.950 --> 00:21:31.650
So the picture I'm showing you
is trying to minimize the KL.
00:21:31.650 --> 00:21:33.990
So the truth picture that
you should see for this guy
00:21:33.990 --> 00:21:37.080
is the same, except that
it's just flipped over.
00:21:37.080 --> 00:21:40.800
But the curvature is the same,
whether I flip my sheet or not.
00:21:40.800 --> 00:21:42.390
So it's the same thing.
00:21:42.390 --> 00:21:44.034
So apart from this
negative sign,
00:21:44.034 --> 00:21:45.450
which is just
coming from the fact
00:21:45.450 --> 00:21:47.550
that we're maximizing
instead of minimizing,
00:21:47.550 --> 00:21:50.490
this is just telling me
how curved my likelihood is
00:21:50.490 --> 00:21:51.810
around the maximum.
00:21:51.810 --> 00:21:55.080
And therefore, it's actually
telling me how good,
00:21:55.080 --> 00:21:58.830
how robust my maximum
likelihood estimator is.
00:21:58.830 --> 00:22:01.270
It's going to tell me how
close, actually, my likelihood
00:22:01.270 --> 00:22:03.870
estimator is going to be--
00:22:03.870 --> 00:22:06.510
maximum likelihood is going
to be to the true parameter.
00:22:06.510 --> 00:22:09.030
So I should be able
to see that somewhere.
00:22:09.030 --> 00:22:11.220
There should be some
statement that tells me
00:22:11.220 --> 00:22:13.230
that this Fisher
information will
00:22:13.230 --> 00:22:17.700
play a role when assessing the
precision of this estimator.
00:22:17.700 --> 00:22:20.280
And remember, how do we
characterize a good estimator?
00:22:20.280 --> 00:22:24.030
Well, we look at its bias,
or we look its variance.
00:22:24.030 --> 00:22:26.880
And we can combine the two
and form the quadratic risk.
00:22:26.880 --> 00:22:29.160
So essentially, what
we're going to try to say
00:22:29.160 --> 00:22:31.525
is that one of those
guys-- either the bias
00:22:31.525 --> 00:22:33.150
or the variance or
the quadratic risk--
00:22:33.150 --> 00:22:35.284
is going to be worse if
my function is flatter,
00:22:35.284 --> 00:22:37.200
meaning that my Fisher
information is smaller.
00:22:40.390 --> 00:22:44.030
And this is exactly the
point of this last theorem.
00:22:44.030 --> 00:22:46.270
So let's look at a
couple of conditions.
00:22:46.270 --> 00:22:51.310
So this is your typical
1950s statistics
00:22:51.310 --> 00:22:54.770
paper that has like one
page of assumptions.
00:22:54.770 --> 00:22:56.764
And this was like that
in the early days,
00:22:56.764 --> 00:22:58.180
because people
were trying to make
00:22:58.180 --> 00:23:01.572
theories that would be valid
for as many models as possible.
00:23:01.572 --> 00:23:03.280
And now, people are
sort of abusing this,
00:23:03.280 --> 00:23:05.470
and they're just making all
this lists of assumptions
00:23:05.470 --> 00:23:06.670
so that their
particular method works
00:23:06.670 --> 00:23:08.628
for their particular
problem, because they just
00:23:08.628 --> 00:23:10.030
want to take shortcuts.
00:23:10.030 --> 00:23:13.480
But really, the maximum
likelihood estimator
00:23:13.480 --> 00:23:15.820
is basically as old
as modern statistics.
00:23:15.820 --> 00:23:18.190
And so this was really
necessary conditions.
00:23:18.190 --> 00:23:19.810
And we'll just parse that.
00:23:19.810 --> 00:23:21.610
The model is identified.
00:23:21.610 --> 00:23:24.070
Well, better be, because
I'm trying to estimate
00:23:24.070 --> 00:23:25.550
theta and not P theta.
00:23:25.550 --> 00:23:26.860
So this one is good.
00:23:26.860 --> 00:23:32.630
For all theta, the support of P
theta does not depend on theta.
00:23:32.630 --> 00:23:34.850
So that's just something
that we need to have.
00:23:34.850 --> 00:23:36.710
Otherwise, things
become really messy.
00:23:36.710 --> 00:23:38.540
And in particular,
I'm not going to be
00:23:38.540 --> 00:23:40.600
able to define likelihood--
00:23:40.600 --> 00:23:42.910
Kullback-Leibler divergences.
00:23:42.910 --> 00:23:44.340
Then why can I not do that?
00:23:44.340 --> 00:23:46.730
Well, because the
Kullback-Leibler divergence
00:23:46.730 --> 00:23:49.430
has a log of the ratio
of two densities.
00:23:49.430 --> 00:23:51.554
And if one of the support
is changing with theta
00:23:51.554 --> 00:23:53.720
is it might be they have
the log of something that's
00:23:53.720 --> 00:23:55.820
0 or something that's not 0.
00:23:55.820 --> 00:23:59.450
And the log of 0 is a slightly
annoying quantity to play with.
00:23:59.450 --> 00:24:01.220
And so we're just
removing that case.
00:24:01.220 --> 00:24:02.870
Nothing depends on theta--
00:24:02.870 --> 00:24:05.170
think of it as being
basically the entire real line
00:24:05.170 --> 00:24:08.020
as a support for
Gaussian, for example.
00:24:08.020 --> 00:24:10.830
Theta star is not on
the boundary of theta.
00:24:10.830 --> 00:24:13.147
Can anybody tell me
why this is important?
00:24:17.720 --> 00:24:19.142
We're talking about derivatives.
00:24:19.142 --> 00:24:20.850
So when I want to talk
about derivatives,
00:24:20.850 --> 00:24:23.260
I'm talking about fluctuations
around a certain point.
00:24:23.260 --> 00:24:26.166
And if I'm at the boundary,
it's actually really annoying.
00:24:26.166 --> 00:24:27.790
I might have the
derivative-- remember,
00:24:27.790 --> 00:24:28.940
I give you this example--
00:24:28.940 --> 00:24:31.720
where the maximum likelihood is
just obtained at the boundary,
00:24:31.720 --> 00:24:34.300
because the function cannot
grow anymore at the boundary.
00:24:34.300 --> 00:24:36.550
But it does not mean
that the first order
00:24:36.550 --> 00:24:38.050
derivative is equal to 0.
00:24:38.050 --> 00:24:39.700
It does not mean anything.
00:24:39.700 --> 00:24:42.040
So all this picture
here is valid
00:24:42.040 --> 00:24:46.720
only if I'm actually
achieving the minimum inside.
00:24:46.720 --> 00:24:52.030
Because if my theta space stops
here and it's just this guy,
00:24:52.030 --> 00:24:53.200
I'm going to be here.
00:24:53.200 --> 00:24:55.600
And there's no questions
about curvature or anything
00:24:55.600 --> 00:24:56.690
that comes into play.
00:24:56.690 --> 00:24:58.340
It's completely different.
00:24:58.340 --> 00:25:00.080
So here, it's inside.
00:25:00.080 --> 00:25:02.770
Again, think of theta as
being the entire real line.
00:25:02.770 --> 00:25:05.550
Then everything is inside.
00:25:05.550 --> 00:25:08.040
I is invertible.
00:25:08.040 --> 00:25:11.130
What does it mean for a
positive number, a 1 by 1 matrix
00:25:11.130 --> 00:25:12.420
to be invertible?
00:25:16.820 --> 00:25:17.320
Yep.
00:25:17.320 --> 00:25:20.260
AUDIENCE: It'd be equal
to its [INAUDIBLE]..
00:25:20.260 --> 00:25:24.167
PHILIPPE RIGOLLET: A 1 by 1
matrix, that's a number, right?
00:25:24.167 --> 00:25:26.750
What is a characteristic-- if I
give you a matrix with numbers
00:25:26.750 --> 00:25:28.250
and ask you if it's
invertible, what
00:25:28.250 --> 00:25:31.658
are you going to do with it?
00:25:31.658 --> 00:25:33.650
AUDIENCE: Check if
the determinant is 0.
00:25:33.650 --> 00:25:35.691
PHILIPPE RIGOLLET: Check
if the determinant is 0.
00:25:35.691 --> 00:25:37.600
What is the determinant
of the 1 by 1 matrix?
00:25:37.600 --> 00:25:38.944
It's just the number itself.
00:25:38.944 --> 00:25:41.360
So that's basically, you want
to check if this number is 0
00:25:41.360 --> 00:25:42.500
or not.
00:25:42.500 --> 00:25:44.990
So we're going to think in
the one dimensional case here.
00:25:44.990 --> 00:25:46.739
And in the one dimensional
case, that just
00:25:46.739 --> 00:25:51.480
means that the
curvature is not 0.
00:25:51.480 --> 00:25:53.230
Well, it better be not
0, because then I'm
00:25:53.230 --> 00:25:54.460
going to have no guarantees.
00:25:54.460 --> 00:25:56.680
If I'm totally flat,
if I have no curvature,
00:25:56.680 --> 00:25:58.900
I'm basically totally
flat at the bottom.
00:25:58.900 --> 00:26:00.580
And then I'm going
to get nasty things.
00:26:00.580 --> 00:26:02.170
Now, this is not true.
00:26:02.170 --> 00:26:05.740
I could have the curvature
which grows like-- so here, it's
00:26:05.740 --> 00:26:08.110
basically-- the second
derivative is telling me--
00:26:08.110 --> 00:26:09.670
if I do the Taylor
expansion, it's
00:26:09.670 --> 00:26:13.170
telling me how I grow as a
function of, say, x squared.
00:26:13.170 --> 00:26:15.250
It's the quadratic term
that I'm controlling.
00:26:15.250 --> 00:26:19.170
It could be that this guy is
0, and then the term of order,
00:26:19.170 --> 00:26:20.805
x to the fourth, is picking up.
00:26:20.805 --> 00:26:22.640
That could be the first
one that's non-zero.
00:26:23.290 --> 00:26:25.270
But that would mean that
my rate of convergence
00:26:25.270 --> 00:26:26.550
would not be square root of n.
00:26:26.550 --> 00:26:28.549
When I'm actually playing
central limit theorem,
00:26:28.549 --> 00:26:30.820
it would become n to the 1/4th.
00:26:30.820 --> 00:26:33.660
And if I have all a bunch
of 0 until the 16th order,
00:26:33.660 --> 00:26:36.460
I would have n to the
1/16th, because that's really
00:26:36.460 --> 00:26:39.572
telling me how flat I am.
00:26:39.572 --> 00:26:41.030
So we're going to
focus on the case
00:26:41.030 --> 00:26:43.160
where it's only quadratic
terms, and the rates
00:26:43.160 --> 00:26:46.370
of the central limit
theorems kick in.
00:26:46.370 --> 00:26:48.560
And then a few other
technical conditions--
00:26:48.560 --> 00:26:49.940
we just used a couple of them.
00:26:49.940 --> 00:26:52.100
So I permuted
limit and integral.
00:26:52.100 --> 00:26:54.890
And you can check that
really what you want
00:26:54.890 --> 00:26:58.246
is that the integral of a
derivative is equal to 0.
00:26:58.246 --> 00:27:00.370
Well, it just means that
the values at the two ends
00:27:00.370 --> 00:27:01.580
are actually the same.
00:27:01.580 --> 00:27:05.090
So those are slightly
different things.
00:27:05.090 --> 00:27:08.900
So now, what we have is that
the maximum likelihood estimator
00:27:08.900 --> 00:27:10.590
has the following
two properties.
00:27:10.590 --> 00:27:13.610
The first one, if I were to
say that in words, what would
00:27:13.610 --> 00:27:15.470
I say, that theta hat is--
00:27:18.851 --> 00:27:20.790
Is what?
00:27:20.790 --> 00:27:22.430
Yeah, that's what I
would say when I--
00:27:22.430 --> 00:27:23.630
that's for mathematicians.
00:27:23.630 --> 00:27:27.530
But if I'm a statistician,
what am I going to say?
00:27:27.530 --> 00:27:28.620
It's consistent.
00:27:28.620 --> 00:27:30.470
It's a consistent
estimator of theta star.
00:27:30.470 --> 00:27:33.830
It converges in
probability to theta star.
00:27:33.830 --> 00:27:35.960
And then we have this sort
of central limit theorem
00:27:35.960 --> 00:27:36.946
statement.
00:27:36.946 --> 00:27:39.320
The central limit theorem
statement tells me that if this
00:27:39.320 --> 00:27:44.200
was an average and I remove the
expectation of the average--
00:27:44.200 --> 00:27:45.717
let's say it's 0, for example--
00:27:45.717 --> 00:27:47.550
then square root of n
times the average blah
00:27:47.550 --> 00:27:49.830
goes through some
normal distribution.
00:27:49.830 --> 00:27:52.080
This is telling me that
this is actually true,
00:27:52.080 --> 00:27:55.500
even if theta hat has nothing
to do with an average.
00:27:55.500 --> 00:27:56.725
That's remarkable.
00:27:56.725 --> 00:27:59.640
Theta hat might not
even have a closed form,
00:27:59.640 --> 00:28:02.070
and I'm still having
basically the same properties
00:28:02.070 --> 00:28:04.410
as an average that
would be given to me
00:28:04.410 --> 00:28:08.180
by a central limit theorem.
00:28:08.180 --> 00:28:10.690
And what is the
asymptotic variance?
00:28:10.690 --> 00:28:12.510
So that's the variance in the n.
00:28:12.510 --> 00:28:15.980
So here, I'm thinking of having
those guys being multivariate.
00:28:15.980 --> 00:28:18.320
And so I have the inverse
of the covariance matrix
00:28:18.320 --> 00:28:21.050
that shows up as the
variance-covariance matrix
00:28:21.050 --> 00:28:22.640
asymptotically.
00:28:22.640 --> 00:28:25.430
But if you think of just being
a one dimensional parameter,
00:28:25.430 --> 00:28:27.680
it's one over this
Fisher information,
00:28:27.680 --> 00:28:29.616
one over the curvature.
00:28:29.616 --> 00:28:31.490
So the curvature is
really flat, the variance
00:28:31.490 --> 00:28:33.230
becomes really big.
00:28:33.230 --> 00:28:36.230
If the function is really
flat, curvature is low,
00:28:36.230 --> 00:28:37.070
variance is big.
00:28:37.070 --> 00:28:41.384
If the curvature is very high,
the variance becomes very low.
00:28:41.384 --> 00:28:42.800
And so that
illustrates everything
00:28:42.800 --> 00:28:45.510
that's happening with the
pictures that we have.
00:28:45.510 --> 00:28:48.740
And if you look,
what's amazing here,
00:28:48.740 --> 00:28:51.970
there is no square root 2
pi, there's no fudge factors
00:28:51.970 --> 00:28:52.940
going on here.
00:28:52.940 --> 00:28:56.270
This is the asymptotic
variance, nothing else.
00:28:56.270 --> 00:28:58.404
It's all in there,
all in the curvature.
00:29:03.770 --> 00:29:05.228
Are there any
questions about this?
00:29:07.860 --> 00:29:11.190
So you can see here that theta
star is the true parameter.
00:29:11.190 --> 00:29:17.160
And the information matrix
is evaluated at theta star.
00:29:17.160 --> 00:29:18.660
That's the point that matters.
00:29:18.660 --> 00:29:20.280
When I drew this
picture, the point
00:29:20.280 --> 00:29:22.420
that was at the very bottom
was always theta star.
00:29:22.420 --> 00:29:26.980
It's the one that minimizes
the KL divergence,
00:29:26.980 --> 00:29:32.856
am as long as I'm identified.
00:29:32.856 --> 00:29:33.356
Yes.
00:29:33.356 --> 00:29:35.766
AUDIENCE: So the
higher the curvature,
00:29:35.766 --> 00:29:38.515
the higher the inverse of
the Fisher information?
00:29:38.515 --> 00:29:39.890
PHILIPPE RIGOLLET:
No, the higher
00:29:39.890 --> 00:29:42.520
the Fisher information itself.
00:29:42.520 --> 00:29:46.310
So the inverse is
going to be smaller.
00:29:46.310 --> 00:29:48.960
So small variance is good.
00:29:48.960 --> 00:29:50.510
So now what it
means, actually, if I
00:29:50.510 --> 00:29:51.980
look at what is
the quadratic risk
00:29:51.980 --> 00:29:54.050
of this guy,
asymptotically-- what
00:29:54.050 --> 00:29:56.270
is asymptotic quadratic risk?
00:29:56.270 --> 00:29:57.590
Well, it's 0 actually.
00:29:57.590 --> 00:30:01.470
But if I assume that
this thing is true,
00:30:01.470 --> 00:30:03.419
that this thing is
pretty much Gaussian,
00:30:03.419 --> 00:30:05.210
if I look at the
quadratic risk, well, it's
00:30:05.210 --> 00:30:08.170
the expectation of the
square of this thing.
00:30:08.170 --> 00:30:12.312
And so it's going to scale
like the variance divided by n.
00:30:14.930 --> 00:30:18.800
The bias goes to
0, just by this.
00:30:18.800 --> 00:30:20.590
And then the quadratic
risk is going
00:30:20.590 --> 00:30:23.340
to scale like one over Fisher
information divided by n.
00:30:28.241 --> 00:30:30.240
So here, the-- I'm not
mentioning the constants.
00:30:30.240 --> 00:30:32.160
There must be constants, because
everything is asymptotic.
00:30:32.160 --> 00:30:33.834
So for each finite
n, I'm going to have
00:30:33.834 --> 00:30:35.000
some constants that show up.
00:30:39.270 --> 00:30:43.590
Everybody just got their mind
blown by this amazing theorem?
00:30:43.590 --> 00:30:48.090
So I mean, if you think about
it, the MLE can be anything.
00:30:48.090 --> 00:30:50.562
I'm sorry to say to
you, in many instances,
00:30:50.562 --> 00:30:52.770
the MLE is just going to be
an average, which is just
00:30:52.770 --> 00:30:54.660
going to be slightly annoying.
00:30:54.660 --> 00:30:57.370
But there are some
cases where it's not.
00:30:57.370 --> 00:30:59.462
And we have to resort
to this theorem
00:30:59.462 --> 00:31:01.920
rather than actually resorting
to the central limit theorem
00:31:01.920 --> 00:31:03.090
to prove this thing.
00:31:03.090 --> 00:31:05.960
And more importantly, even
if this was an average,
00:31:05.960 --> 00:31:07.710
you don't have to even
know how to compute
00:31:07.710 --> 00:31:09.060
the covariance matrix--
00:31:09.060 --> 00:31:11.320
sorry, the variance
of this thing
00:31:11.320 --> 00:31:14.490
to plug it into the
central limit theorem.
00:31:14.490 --> 00:31:17.220
I'm telling you, it's actually
given by the Fisher information
00:31:17.220 --> 00:31:18.950
matrix.
00:31:18.950 --> 00:31:22.070
So if it's an average,
between you and me,
00:31:22.070 --> 00:31:24.590
you probably want to go the
central limit theorem route
00:31:24.590 --> 00:31:26.700
if you want to prove
this kind of stuff.
00:31:26.700 --> 00:31:28.910
But if it's not, then
that's your best shot.
00:31:28.910 --> 00:31:31.870
But you have to check
those conditions.
00:31:31.870 --> 00:31:38.020
I will give you for
granted the 0.5.
00:31:38.020 --> 00:31:39.780
Ready?
00:31:39.780 --> 00:31:40.410
Any questions?
00:31:40.410 --> 00:31:41.960
We're going to wrap
up this chapter four.
00:31:41.960 --> 00:31:43.440
So if you have questions,
that's the time.
00:31:43.440 --> 00:31:43.939
Yes.
00:31:43.939 --> 00:31:45.925
AUDIENCE: What was the
quadratic risk up there?
00:31:45.925 --> 00:31:47.716
PHILIPPE RIGOLLET: You
mean the definition?
00:31:47.716 --> 00:31:49.620
AUDIENCE: No, the--
what is was for this.
00:31:49.620 --> 00:31:51.210
PHILIPPE RIGOLLET: Well,
you see the quadratic risk,
00:31:51.210 --> 00:31:53.070
if I think of it as
being one dimensional,
00:31:53.070 --> 00:31:55.272
the quadratic risk
is the expectation
00:31:55.272 --> 00:31:57.730
of the square of the difference
between theta hat and theta
00:31:57.730 --> 00:31:58.230
star.
00:32:01.010 --> 00:32:05.900
So that means that if I think
of this as having a normal 0, 1,
00:32:05.900 --> 00:32:09.680
that's basically computing
the expectation of the square
00:32:09.680 --> 00:32:13.310
of this Gaussian divided by n.
00:32:13.310 --> 00:32:15.759
I just divided by square
root of n on both sides.
00:32:15.759 --> 00:32:18.050
So it's the expectation of
the square of this Gaussian.
00:32:18.050 --> 00:32:20.383
The Gaussian is mean 0, so
the expectation of the square
00:32:20.383 --> 00:32:23.060
is just a variance.
00:32:23.060 --> 00:32:25.903
And so I'm left with 1 over
the Fisher information divided
00:32:25.903 --> 00:32:26.403
by n.
00:32:26.403 --> 00:32:26.886
AUDIENCE: I see.
00:32:26.886 --> 00:32:27.386
OK.
00:32:34.084 --> 00:32:36.250
PHILIPPE RIGOLLET: So let's
move on to chapter four.
00:32:36.250 --> 00:32:38.190
And this is the
method of moments.
00:32:38.190 --> 00:32:42.000
So the method of moments is
actually maybe a bit older
00:32:42.000 --> 00:32:44.260
than maximum likelihood.
00:32:44.260 --> 00:32:48.490
And maximum likelihood is
dated, say, early 20th century,
00:32:48.490 --> 00:32:50.490
I mean as a systematic
thing, because as I said,
00:32:50.490 --> 00:32:52.323
many of those guys are
going to be averages.
00:32:52.323 --> 00:32:56.010
So finding an average is
probably a little older.
00:32:56.010 --> 00:32:58.380
The method of moments,
there's some really nice uses.
00:32:58.380 --> 00:33:03.679
There's a paper by Pearson in
1904, I believe, or maybe 1894.
00:33:03.679 --> 00:33:04.220
I don't know.
00:33:06.930 --> 00:33:10.860
And this paper, he was
actually studying some species
00:33:10.860 --> 00:33:12.607
of crab in an island,
and he was trying
00:33:12.607 --> 00:33:13.690
to make some measurements.
00:33:13.690 --> 00:33:16.314
That's how he came up with this
model of mixtures of Gaussians,
00:33:16.314 --> 00:33:18.930
because there was actually
two different populations
00:33:18.930 --> 00:33:20.860
in this populations of crab.
00:33:20.860 --> 00:33:23.580
And the way he actually
fitted the parameters
00:33:23.580 --> 00:33:25.530
was by doing the
method of moments,
00:33:25.530 --> 00:33:27.740
except that since there
were a lot of parameters,
00:33:27.740 --> 00:33:33.580
he actually had to basically
solve six equations with six
00:33:33.580 --> 00:33:34.080
unknowns.
00:33:34.080 --> 00:33:35.496
And that was a
complete nightmare.
00:33:35.496 --> 00:33:36.980
And the guy did it by hand.
00:33:36.980 --> 00:33:40.140
And we don't know how
he did it actually.
00:33:40.140 --> 00:33:43.360
But that is pretty impressive.
00:33:43.360 --> 00:33:44.480
So I want to start--
00:33:44.480 --> 00:33:48.150
and this first part
is a little brutal.
00:33:48.150 --> 00:33:52.020
But this is a Course 18 class,
and I do not want to give you--
00:33:52.020 --> 00:33:54.510
So let's all agree that this
course might be slightly more
00:33:54.510 --> 00:33:56.820
challenging than AP statistics.
00:33:56.820 --> 00:34:00.540
And that means that it's
going to be challenging just
00:34:00.540 --> 00:34:01.670
during class.
00:34:01.670 --> 00:34:04.170
I'm not going to ask you about
the Weierstrass Approximation
00:34:04.170 --> 00:34:05.670
Theorem during the exams.
00:34:05.670 --> 00:34:08.219
But what I want is to give
you mathematical motivations
00:34:08.219 --> 00:34:10.130
for what we're doing.
00:34:10.130 --> 00:34:12.480
And I can promise
you that maybe you
00:34:12.480 --> 00:34:17.730
will have a slightly higher body
temperature during the lecture,
00:34:17.730 --> 00:34:20.760
but you will come out
smarter of this class.
00:34:20.760 --> 00:34:24.810
And I'm trying to motivate to
you for using mathematical tool
00:34:24.810 --> 00:34:27.989
and show you where interesting
mathematical things that you
00:34:27.989 --> 00:34:31.800
might find dry elsewhere
actually work very beautifully
00:34:31.800 --> 00:34:33.348
in the stats literature.
00:34:33.348 --> 00:34:35.889
And one that we saw was using
Kullback-Leibler divergence out
00:34:35.889 --> 00:34:38.639
of motivation for maximum
likelihood estimation,
00:34:38.639 --> 00:34:39.929
for example.
00:34:39.929 --> 00:34:42.300
So the Weierstrass
Approximation Theorem
00:34:42.300 --> 00:34:45.270
is something that comes
from pure analysis.
00:34:45.270 --> 00:34:49.656
So maybe-- I mean, it took
me a while before I saw that.
00:34:49.656 --> 00:34:51.239
And essentially,
what it's telling you
00:34:51.239 --> 00:34:52.822
is that if you look
at a function that
00:34:52.822 --> 00:34:55.710
is continuous on
an interval a, b--
00:34:55.710 --> 00:34:57.810
on a segment a, b--
00:34:57.810 --> 00:35:02.250
then you can actually
approximate it
00:35:02.250 --> 00:35:05.430
uniformly well by
polynomials as long
00:35:05.430 --> 00:35:06.930
as you're willing
to take the degree
00:35:06.930 --> 00:35:09.000
of this polynomials
large enough.
00:35:09.000 --> 00:35:11.890
So the formal statement
is, for any epsilon,
00:35:11.890 --> 00:35:16.140
there exists the d that depends
on epsilon in a1 to ad--
00:35:16.140 --> 00:35:20.400
so if you insist on having an
accuracy which is 1/10,000,
00:35:20.400 --> 00:35:23.650
maybe you're going to need a
polynomial of degree 100,000,
00:35:23.650 --> 00:35:24.150
who knows.
00:35:24.150 --> 00:35:26.520
It doesn't tell you
anything about this.
00:35:26.520 --> 00:35:28.170
But it's telling you
that at least you
00:35:28.170 --> 00:35:29.850
have only a finite
number of parameters
00:35:29.850 --> 00:35:31.725
to approximate those
functions that typically
00:35:31.725 --> 00:35:35.310
require an infinite number of
parameters to be described.
00:35:35.310 --> 00:35:36.670
So that's actually quite nice.
00:35:36.670 --> 00:35:39.510
And that's the basis
for many things
00:35:39.510 --> 00:35:43.000
and many polynomial
methods typically.
00:35:43.000 --> 00:35:45.540
And so here, it's
uniform, so there's
00:35:45.540 --> 00:35:50.400
this max over x that shows up
that's actually nice as well.
00:35:50.400 --> 00:35:52.200
That's Weierstrass
Approximation Theorem.
00:35:52.200 --> 00:35:54.720
Why is that useful to us?
00:35:54.720 --> 00:35:58.180
Well, in statistics, I
have a sample of X1 to Xn.
00:35:58.180 --> 00:36:00.054
I have, say, a unified
statistical model.
00:36:00.054 --> 00:36:01.470
I'm not always
going to remind you
00:36:01.470 --> 00:36:04.200
that it's identified--
not unified-- identified
00:36:04.200 --> 00:36:05.640
statistical model.
00:36:05.640 --> 00:36:08.550
And I'm going to assume
that it has a density.
00:36:08.550 --> 00:36:10.170
You could think of
it as having a PMF,
00:36:10.170 --> 00:36:13.140
but think of it as having
a density for one second.
00:36:13.140 --> 00:36:16.770
Now, what I want is to
find the distribution.
00:36:16.770 --> 00:36:18.030
I want to find theta.
00:36:18.030 --> 00:36:20.340
And finding theta,
since it's identified
00:36:20.340 --> 00:36:22.590
as equivalent to
finding P theta, which
00:36:22.590 --> 00:36:26.460
is equivalent to finding f
theta, and knowing a function
00:36:26.460 --> 00:36:28.410
is the same--
00:36:28.410 --> 00:36:30.750
knowing a density is the
same as knowing a density
00:36:30.750 --> 00:36:33.150
against any test function h.
00:36:33.150 --> 00:36:38.589
So that means that if I want
to make sure I know a density--
00:36:38.589 --> 00:36:40.630
if I want to check if two
densities are the same,
00:36:40.630 --> 00:36:42.520
all I have to do is to
compute their integral
00:36:42.520 --> 00:36:46.240
against all bounded
continuous functions.
00:36:46.240 --> 00:36:48.340
You already know
that it would be true
00:36:48.340 --> 00:36:50.530
if I checked for
all functions h.
00:36:50.530 --> 00:36:53.170
But since f is a
density, I can actually
00:36:53.170 --> 00:36:56.140
look only at functions
h that are bounded,
00:36:56.140 --> 00:37:04.360
say, between minus 1 and
1, and that are continuous.
00:37:04.360 --> 00:37:06.140
That's enough.
00:37:06.140 --> 00:37:06.875
Agreed?
00:37:06.875 --> 00:37:08.510
Well, just trust me on this.
00:37:08.510 --> 00:37:11.774
Yes, you have a question?
00:37:11.774 --> 00:37:14.518
AUDIENCE: Why is this--
like, why shouldn't you
00:37:14.518 --> 00:37:17.263
just say that [INAUDIBLE]?
00:37:20.195 --> 00:37:21.820
PHILIPPE RIGOLLET:
Yeah, I can do that.
00:37:21.820 --> 00:37:23.410
I'm just finding
a characterization
00:37:23.410 --> 00:37:25.210
that's going to be
useful for me later on.
00:37:25.210 --> 00:37:26.810
I can find a bunch of them.
00:37:26.810 --> 00:37:28.600
But here, this one is
going to be useful.
00:37:28.600 --> 00:37:32.670
So all I need to say is that f
theta star integrated against
00:37:32.670 --> 00:37:35.700
X, h of x-- so this
implies that f--
00:37:35.700 --> 00:37:38.320
if theta is equal to f
theta star not everywhere,
00:37:38.320 --> 00:37:41.770
but almost everywhere.
00:37:41.770 --> 00:37:44.635
And that's only true if I
guarantee to you that f theta
00:37:44.635 --> 00:37:46.390
and f theta stars are densities.
00:37:46.390 --> 00:37:49.180
This is not true
for any function.
00:37:49.180 --> 00:37:53.570
So now, that means that, well,
if I wanted to estimate theta
00:37:53.570 --> 00:37:56.880
hat, all I would have to do is
to compute the average, right--
00:37:56.880 --> 00:38:01.110
so this guy here, the integral--
00:38:01.110 --> 00:38:02.480
let me clean up a bit my board.
00:38:22.590 --> 00:38:30.350
So my goal is to find theta
such that, if I look at f theta
00:38:30.350 --> 00:38:34.680
and now I integrate it
against h of x, then
00:38:34.680 --> 00:38:36.540
this gives me the same
thing as if I were
00:38:36.540 --> 00:38:42.860
to do it against f theta star.
00:38:42.860 --> 00:38:45.690
And I want this for any h,
which is continuous and bounded.
00:38:48.390 --> 00:38:51.690
So of course, I don't know
what this quantity is.
00:38:51.690 --> 00:38:53.320
It depends on my
unknown theta star.
00:38:53.320 --> 00:38:54.600
But I have theta from this.
00:38:54.600 --> 00:38:56.100
And I'm going to do the usual--
00:38:56.100 --> 00:38:57.870
the good old statistical
trick, which is,
00:38:57.870 --> 00:39:01.470
well, this I can write as
the expectation with respect
00:39:01.470 --> 00:39:06.090
to P theta star of h theta of x.
00:39:06.090 --> 00:39:08.950
That's just the integral of
a function against something.
00:39:08.950 --> 00:39:11.250
And so what I can do
is say, well, now I
00:39:11.250 --> 00:39:12.160
don't know this guy.
00:39:12.160 --> 00:39:14.070
But my good old
trick from the book
00:39:14.070 --> 00:39:15.980
is replace expectations
by averages.
00:39:15.980 --> 00:39:16.670
And what I get--
00:39:23.050 --> 00:39:29.190
And that's approximately by
the law of large numbers.
00:39:29.190 --> 00:39:33.950
So if I can actually find
a function f theta such
00:39:33.950 --> 00:39:36.150
that when I integrate
it against h
00:39:36.150 --> 00:39:39.480
it gives me pretty much the
average of the evaluations
00:39:39.480 --> 00:39:42.890
of h over my data
points for all h,
00:39:42.890 --> 00:39:44.575
then that should be
a good candidate.
00:39:47.690 --> 00:39:52.040
The problem is that's a
lot of functions to try.
00:39:52.040 --> 00:39:54.500
Even if we reduced that
from all possible functions
00:39:54.500 --> 00:39:56.780
to bounded and
continuous ones, that's
00:39:56.780 --> 00:40:01.490
still a pretty large
infinite number of them.
00:40:01.490 --> 00:40:05.550
And so what we can do is to use
our Weierstrass Approximation
00:40:05.550 --> 00:40:06.050
Theorem.
00:40:06.050 --> 00:40:09.170
And it says, well, maybe I don't
need to test it against all h.
00:40:09.170 --> 00:40:11.987
Maybe the polynomials
are enough for me.
00:40:11.987 --> 00:40:14.570
So what I'm going to do is I'm
going to look only at functions
00:40:14.570 --> 00:40:20.130
h that are of the
form sum of ak--
00:40:20.130 --> 00:40:29.725
so h of x is sum of
ak X to the k-th for k
00:40:29.725 --> 00:40:34.040
equals 0 to d-- only
polynomials of degree d.
00:40:34.040 --> 00:40:37.520
So when I look at the
average of my h's, I'm
00:40:37.520 --> 00:40:40.360
going to get a term
like the first one.
00:40:40.360 --> 00:40:47.485
So the first one here, this guy,
becomes 1/n sum from i equal 1
00:40:47.485 --> 00:40:54.290
to n sum from k equal 0
to d of ak Xi to the k-th.
00:40:54.290 --> 00:40:58.430
That's just the average
of the values of h of Xi.
00:40:58.430 --> 00:41:00.710
And now, what I need
to do is to check
00:41:00.710 --> 00:41:03.590
that it's the same
thing when I integrate
00:41:03.590 --> 00:41:06.640
h of this form as well.
00:41:06.640 --> 00:41:10.880
I want this to hold for all
polynomials of degree d.
00:41:10.880 --> 00:41:12.110
That's still a lot of them.
00:41:12.110 --> 00:41:14.110
There's still an infinite
number of polynomials,
00:41:14.110 --> 00:41:17.870
because there's an infinite
number of numbers a0 to ad
00:41:17.870 --> 00:41:21.340
that describe those polynomials.
00:41:21.340 --> 00:41:23.410
But since those guys
are polynomials,
00:41:23.410 --> 00:41:26.320
it's actually enough for me
to look only at the terms
00:41:26.320 --> 00:41:28.420
of the form X to the k-th--
00:41:28.420 --> 00:41:30.610
no linear combination,
no nothing.
00:41:30.610 --> 00:41:32.290
So actually, it's
enough to look only
00:41:32.290 --> 00:41:40.050
at h of x, which is equal to X
to the k-th for k equal 0 to d.
00:41:43.260 --> 00:41:46.350
And now, how many of
those guys are there?
00:41:46.350 --> 00:41:49.210
Just d plus 1, 0 to d.
00:41:49.210 --> 00:41:51.640
So that's actually a much
easier thing for me to solve.
00:41:54.290 --> 00:42:01.970
Now, this quantity, which is the
integral of f theta against X
00:42:01.970 --> 00:42:05.360
to the k-th-- so that the
expectation of X to the k-th
00:42:05.360 --> 00:42:06.860
here--
00:42:06.860 --> 00:42:12.940
it's called moment of order
k, or k-th moment of P theta.
00:42:12.940 --> 00:42:13.620
That's a moment.
00:42:13.620 --> 00:42:16.210
A moment is just the
expectation of the power.
00:42:16.210 --> 00:42:19.780
The mean is which moment?
00:42:19.780 --> 00:42:21.760
The first moment.
00:42:21.760 --> 00:42:24.720
And variance is not
exactly the second moment.
00:42:24.720 --> 00:42:27.170
It's the second moment minus
the first moment squared.
00:42:29.862 --> 00:42:30.695
That's the variance.
00:42:30.695 --> 00:42:34.691
It's E of X squared
minus E of X squared.
00:42:34.691 --> 00:42:36.440
So those are things
that you already know.
00:42:36.440 --> 00:42:37.564
And then you can go higher.
00:42:37.564 --> 00:42:40.200
You can go to E of X
cube, E of X blah, blah.
00:42:40.200 --> 00:42:43.030
Here, I say go to
E of X to the d-th.
00:42:43.030 --> 00:42:44.780
Now, as you can see,
this is not something
00:42:44.780 --> 00:42:47.781
you can really put
in action right now,
00:42:47.781 --> 00:42:50.030
because the Weierstrass
Approximation Theorem does not
00:42:50.030 --> 00:42:52.070
tell you what d should be.
00:42:52.070 --> 00:42:54.020
Actually, we totally
lost track of the epsilon
00:42:54.020 --> 00:42:54.978
I was even looking for.
00:42:54.978 --> 00:42:57.300
I just said approximately
equal, approximately equal.
00:42:57.300 --> 00:42:59.300
And so all this thing is
really just motivation.
00:42:59.300 --> 00:43:02.730
But it's essentially
telling you that if you
00:43:02.730 --> 00:43:05.010
go to d large
enough, technically
00:43:05.010 --> 00:43:08.730
you should be able to identify
exactly your distribution up
00:43:08.730 --> 00:43:11.280
to epsilon.
00:43:11.280 --> 00:43:16.210
So I should be pretty good,
if I go to d large enough.
00:43:16.210 --> 00:43:19.190
Now in practice, actually
there should be much
00:43:19.190 --> 00:43:23.120
less than arbitrarily large d.
00:43:23.120 --> 00:43:25.460
Typically, we are going
to need d which is 1 or 2.
00:43:28.150 --> 00:43:31.720
So there are some limitations
to the Weierstrass Approximation
00:43:31.720 --> 00:43:32.440
Theorem.
00:43:32.440 --> 00:43:33.940
And there's a few.
00:43:33.940 --> 00:43:35.950
The first one is
that it only works
00:43:35.950 --> 00:43:39.850
for continuous functions, which
is not so much of a problem.
00:43:39.850 --> 00:43:42.740
That can be fixed.
00:43:42.740 --> 00:43:44.560
Well, we need bounded
continuous functions.
00:43:44.560 --> 00:43:45.961
It works only on intervals.
00:43:45.961 --> 00:43:47.460
That's annoying,
because we're going
00:43:47.460 --> 00:43:51.080
to have random variables that
are defined beyond intervals.
00:43:51.080 --> 00:43:53.560
So we need something that just
goes beyond the intervals.
00:43:53.560 --> 00:43:55.840
And you can imagine that if
you let your functions be huge,
00:43:55.840 --> 00:43:57.256
it's going to be
very hard for you
00:43:57.256 --> 00:44:00.001
to have a polynomial
approximately [INAUDIBLE] well.
00:44:00.001 --> 00:44:02.500
Things are going to start going
up and down at the boundary,
00:44:02.500 --> 00:44:05.380
and it's going to be very hard.
00:44:05.380 --> 00:44:07.360
And again, as I
said several times,
00:44:07.360 --> 00:44:09.160
it doesn't tell us
what d should be.
00:44:09.160 --> 00:44:11.470
And as statisticians, we're
looking for methods, not
00:44:11.470 --> 00:44:15.910
like principles of existence
of a method that exists.
00:44:15.910 --> 00:44:21.840
So if E is discrete,
I can actually
00:44:21.840 --> 00:44:23.720
get a handle on this d.
00:44:23.720 --> 00:44:26.730
If E is discrete and
actually finite--
00:44:26.730 --> 00:44:29.250
I'm going to actually
look at a finite E,
00:44:29.250 --> 00:44:33.690
meaning that I have a PMF on,
say, r possible values, x1
00:44:33.690 --> 00:44:34.404
and xr.
00:44:34.404 --> 00:44:35.820
My random variable,
capital X, can
00:44:35.820 --> 00:44:37.290
take only r possible values.
00:44:37.290 --> 00:44:41.550
Let's think of them as being
the integer numbers 1 to r.
00:44:41.550 --> 00:44:44.880
That's the number of
success out of r trials
00:44:44.880 --> 00:44:46.590
that I get, for example.
00:44:46.590 --> 00:44:51.640
Binomial rp, that's exactly
something like this.
00:44:51.640 --> 00:44:55.600
So now, clearly this
entire distribution
00:44:55.600 --> 00:45:00.452
is defined by the PMF, which
gives me exactly r numbers.
00:45:00.452 --> 00:45:02.410
So it can completely
describe this distribution
00:45:02.410 --> 00:45:03.850
with r numbers.
00:45:03.850 --> 00:45:08.290
The question is, do I have an
enormous amount of redundancy
00:45:08.290 --> 00:45:12.250
if I try to describe this
distribution using moments?
00:45:12.250 --> 00:45:14.970
It might be that I need--
say, r is equal to 10,
00:45:14.970 --> 00:45:18.080
maybe I have only 10 numbers
to describe this thing,
00:45:18.080 --> 00:45:20.980
but I actually need to compute
moments up to the order of 100
00:45:20.980 --> 00:45:23.500
before I actually recover
entirely the distribution.
00:45:23.500 --> 00:45:25.090
Maybe I need to go infinite.
00:45:25.090 --> 00:45:27.220
Maybe the Weierstrass
Theorem is the only thing
00:45:27.220 --> 00:45:28.420
that actually saves me here.
00:45:28.420 --> 00:45:30.720
And I just cannot
recover it exactly.
00:45:30.720 --> 00:45:33.340
I can go to epsilon if I'm
willing to go to higher
00:45:33.340 --> 00:45:34.611
and higher polynomials.
00:45:34.611 --> 00:45:36.610
Oh, by the way, in the
Weierstrass Approximation
00:45:36.610 --> 00:45:39.190
Theorem, I can promise you
that as epsilon goes to 0,
00:45:39.190 --> 00:45:41.660
d goes to infinity.
00:45:41.660 --> 00:45:46.160
So now, really I don't even
have actually r parameters.
00:45:46.160 --> 00:45:50.070
I have only r minus parameter,
because the last one--
00:45:50.070 --> 00:45:51.500
because they sum up to 1.
00:45:51.500 --> 00:45:53.630
So the last one I can
always get by doing
00:45:53.630 --> 00:45:56.960
1 minus the sum of the
first r minus 1 first.
00:45:56.960 --> 00:45:58.520
Agreed?
00:45:58.520 --> 00:46:01.100
So each distribution
r numbers is described
00:46:01.100 --> 00:46:04.700
by r minus 1 parameters.
00:46:04.700 --> 00:46:07.025
The question is, can I
use only r minus moments
00:46:07.025 --> 00:46:08.020
to describe this guy?
00:46:12.870 --> 00:46:16.380
This is something called
Gaussian quadrature.
00:46:16.380 --> 00:46:18.930
The Gaussian quadrature
tells you, yes, moments
00:46:18.930 --> 00:46:22.380
are actually a good way to
reparametrize your distribution
00:46:22.380 --> 00:46:24.870
in the sense that if
I give you the moments
00:46:24.870 --> 00:46:27.120
or if I give you the
probability mass function,
00:46:27.120 --> 00:46:29.370
I'm basically giving you
exactly the same information.
00:46:29.370 --> 00:46:32.770
You can recover all the
probabilities from there.
00:46:32.770 --> 00:46:34.930
So here, I'm going
to denote by--
00:46:34.930 --> 00:46:37.870
I'm going to drop the
notation in theta.
00:46:37.870 --> 00:46:38.770
I don't have theta.
00:46:38.770 --> 00:46:41.460
Here, I'm talking about
any generic distribution.
00:46:41.460 --> 00:46:44.950
And so I'm going to
call mk the k-th moment.
00:46:49.080 --> 00:46:54.610
And I have a PMF, this
is really the sum for j
00:46:54.610 --> 00:47:06.690
equals 1 to r of xj to
the k-th times p of xj.
00:47:06.690 --> 00:47:10.450
And this is the PMF.
00:47:10.450 --> 00:47:12.100
So that's my k-th moment.
00:47:12.100 --> 00:47:15.340
So the k-th moment is a linear
combination of the numbers
00:47:15.340 --> 00:47:16.430
that I am interested in.
00:47:19.750 --> 00:47:22.195
So that's one equation.
00:47:25.220 --> 00:47:27.350
And I have as many
equations as I'm actually
00:47:27.350 --> 00:47:28.700
willing to look at moments.
00:47:28.700 --> 00:47:34.250
So if I'm looking at 25
moments, I have 25 equations.
00:47:34.250 --> 00:47:36.650
m1 equals blah with
this to the power of 1,
00:47:36.650 --> 00:47:40.020
m2 equals blah with this to
the power of 2, et cetera.
00:47:40.020 --> 00:47:41.850
And then I also
have the equation
00:47:41.850 --> 00:47:51.240
that 1 is equal to the
sum of the p of xj.
00:47:51.240 --> 00:47:55.190
That's just the
definition of PMF.
00:47:55.190 --> 00:47:56.190
So this is r's.
00:47:56.190 --> 00:47:58.163
They're ugly, but those are r's.
00:48:00.790 --> 00:48:04.390
So now, this is a system
of linear equations in p,
00:48:04.390 --> 00:48:07.390
and I can actually write it
in its canonical form, which
00:48:07.390 --> 00:48:11.410
is of the form a
matrix of those guys
00:48:11.410 --> 00:48:15.750
times my parameters of interest
is equal to a right hand side.
00:48:15.750 --> 00:48:17.880
The right hand side
is the moments.
00:48:17.880 --> 00:48:20.070
That means, if I
did you the moments,
00:48:20.070 --> 00:48:22.740
can you come back and
find what the PMF,
00:48:22.740 --> 00:48:24.870
because we know already
from probability
00:48:24.870 --> 00:48:27.390
that the PMF is all I need
to know to fully describe
00:48:27.390 --> 00:48:29.190
my distribution.
00:48:29.190 --> 00:48:32.010
Given the moments,
that's unclear.
00:48:32.010 --> 00:48:37.830
Now, here, I'm going to actually
take exactly r minus 1 moment
00:48:37.830 --> 00:48:39.960
and this extra condition
that the sum of those guys
00:48:39.960 --> 00:48:41.640
should be 1.
00:48:41.640 --> 00:48:45.240
So that gives me r equations
based on r minus 1 moments.
00:48:45.240 --> 00:48:47.260
And how many unknowns do I have?
00:48:47.260 --> 00:48:54.230
Well, I have my r
unknown parameters
00:48:54.230 --> 00:48:59.060
for the PMF, the r
values of the PMF.
00:48:59.060 --> 00:49:02.540
Now, of course, this is
going to play a huge role
00:49:02.540 --> 00:49:06.920
in whether the are many
p's that give me the same.
00:49:06.920 --> 00:49:09.620
The goal is to find if there
are several p's that can give me
00:49:09.620 --> 00:49:10.552
the same moments.
00:49:10.552 --> 00:49:13.010
But if there's only one p that
can give me a set of moment,
00:49:13.010 --> 00:49:15.260
that means that I have a
one-to-one correspondence
00:49:15.260 --> 00:49:17.294
between PMF and moments.
00:49:17.294 --> 00:49:18.710
And so if you give
me the moments,
00:49:18.710 --> 00:49:23.074
I can just go back to the PMF.
00:49:23.074 --> 00:49:23.990
Now, how do I go back?
00:49:23.990 --> 00:49:26.310
Well, by inverting this matrix.
00:49:26.310 --> 00:49:28.710
If I multiply this
matrix by its inverse,
00:49:28.710 --> 00:49:32.600
I'm going to get the identity
times the vector of p's equal
00:49:32.600 --> 00:49:36.890
the inverse of the
matrix times the m's.
00:49:36.890 --> 00:49:41.150
So what we want to
do is to say that p
00:49:41.150 --> 00:49:45.350
is equal to the inverse of this
big matrix times the moments
00:49:45.350 --> 00:49:47.190
that you give me.
00:49:47.190 --> 00:49:49.380
And if I can actually
talk about the inverse,
00:49:49.380 --> 00:49:52.410
then I have basically
a one-to-one mapping
00:49:52.410 --> 00:49:55.930
between the m's, the
moments, and the matrix.
00:49:55.930 --> 00:49:58.380
So what I need to show is that
this matrix is invertible.
00:49:58.380 --> 00:50:01.350
And we just decided
that the way to check
00:50:01.350 --> 00:50:05.670
if a matrix is invertible is
by computing its determinant.
00:50:05.670 --> 00:50:10.300
Who has computed a
determinant before?
00:50:10.300 --> 00:50:12.820
Who was supposed to compute a
determinant at least than just
00:50:12.820 --> 00:50:15.004
to say, no, you
know how to do it.
00:50:15.004 --> 00:50:16.670
So you know how to
compute determinants.
00:50:16.670 --> 00:50:19.180
And if you've seen any
determinant in class,
00:50:19.180 --> 00:50:22.660
there's one that shows up in the
exercises that professors love.
00:50:22.660 --> 00:50:25.390
And it's called the
Vandermonde determinant.
00:50:25.390 --> 00:50:26.890
And it's the
determinant of a matrix
00:50:26.890 --> 00:50:28.900
that have a very specific form.
00:50:28.900 --> 00:50:31.780
It looks like-- so there's
basically only r parameters
00:50:31.780 --> 00:50:33.950
to this r by r matrix.
00:50:33.950 --> 00:50:36.340
The first row, or the
first column-- sometimes,
00:50:36.340 --> 00:50:37.810
it's presented like that--
00:50:37.810 --> 00:50:41.551
is this vector where each
entry is to the power of 1.
00:50:41.551 --> 00:50:43.800
And the second one is each
entry is to the power of 2,
00:50:43.800 --> 00:50:46.970
and to the power of 3, and
to the power 4, et cetera.
00:50:46.970 --> 00:50:49.410
So that's exactly what we
have-- x1 to the first, x2
00:50:49.410 --> 00:50:51.460
to the first, all the
way to xr to the first,
00:50:51.460 --> 00:50:53.290
and then same thing
to the power of 2,
00:50:53.290 --> 00:50:54.550
all the way to the last one.
00:50:54.550 --> 00:50:58.270
And I also need to add
the row of all 1's, which
00:50:58.270 --> 00:51:01.210
you can think of those guys are
to the power of 0, if you want.
00:51:01.210 --> 00:51:02.690
So I should really
put it on top,
00:51:02.690 --> 00:51:05.430
if I wanted to have
a nice ordering.
00:51:05.430 --> 00:51:07.290
So that was the
matrix that I had.
00:51:07.290 --> 00:51:09.060
And I'm not asking
you to check it.
00:51:09.060 --> 00:51:10.860
You can prove that by
induction actually,
00:51:10.860 --> 00:51:14.190
typically by doing the
usual let's eliminate
00:51:14.190 --> 00:51:15.810
some rows and columns
type of tricks
00:51:15.810 --> 00:51:17.260
that you do for matrices.
00:51:17.260 --> 00:51:19.197
So you basically start
from the whole matrix.
00:51:19.197 --> 00:51:21.780
And then you move onto a matrix
that has only one 1's and then
00:51:21.780 --> 00:51:22.770
0's here.
00:51:22.770 --> 00:51:25.827
And then you have Vandermonde
that's just slightly smaller.
00:51:25.827 --> 00:51:26.910
And then you just iterate.
00:51:26.910 --> 00:51:27.894
Yeah.
00:51:27.894 --> 00:51:31.502
AUDIENCE: I feel like there's a
loss to either the supra index,
00:51:31.502 --> 00:51:35.274
or the sub index should have
a k somewhere [INAUDIBLE]..
00:51:38.119 --> 00:51:39.702
[INAUDIBLE] the one
I'm talking about?
00:51:39.702 --> 00:51:41.285
PHILIPPE RIGOLLET:
Yeah, I know, but I
00:51:41.285 --> 00:51:45.280
don't think the answer
to your question is yes.
00:51:45.280 --> 00:51:48.330
So k is the general
index, right?
00:51:48.330 --> 00:51:51.180
So there's no k. k does not
exist. k just is here for me
00:51:51.180 --> 00:51:53.310
to tell me for k equals 1 to r.
00:51:53.310 --> 00:51:56.250
So this is an r by r matrix.
00:51:56.250 --> 00:51:58.290
And so there is no k there.
00:51:58.290 --> 00:52:00.960
So if you wanted
the generic term,
00:52:00.960 --> 00:52:03.980
if I wanted to put 1 in the
middle on the j-th row and k-th
00:52:03.980 --> 00:52:07.740
column, that would be x--
00:52:07.740 --> 00:52:13.410
so j-th row would be x
sub k to the power of j.
00:52:13.410 --> 00:52:15.420
That would be the--
00:52:15.420 --> 00:52:19.090
And so now, this is
basically the sum--
00:52:19.090 --> 00:52:20.630
well, that should
not be strictly--
00:52:20.630 --> 00:52:25.000
So that would be for j
and k between 1 and r.
00:52:25.000 --> 00:52:26.920
So this is the
formula that get when
00:52:26.920 --> 00:52:29.470
you try to expand this
Vandermonde determinant.
00:52:29.470 --> 00:52:32.110
You have to do it only once when
you're a sophomore typically.
00:52:32.110 --> 00:52:34.000
And then you can just go
on Wikipedia to do it.
00:52:34.000 --> 00:52:34.750
That's what I did.
00:52:34.750 --> 00:52:36.700
I actually made a
mistake copying it.
00:52:36.700 --> 00:52:39.370
The first one should be 1
less than or equal to j.
00:52:39.370 --> 00:52:42.370
And the last one should be
k less than or equal to r.
00:52:42.370 --> 00:52:43.870
And now what you
have is the product
00:52:43.870 --> 00:52:45.520
of the differences of xj and xk.
00:52:47.204 --> 00:52:48.620
And for this thing
to be non-zero,
00:52:48.620 --> 00:52:51.259
you need all the
terms to be non-zero.
00:52:51.259 --> 00:52:52.800
And for all the
terms to be non-zero,
00:52:52.800 --> 00:52:58.412
you need to have no xi, xj, and
no xj, xk that are identical.
00:52:58.412 --> 00:52:59.870
If all those are
different numbers,
00:52:59.870 --> 00:53:03.094
then this product is going
to be different from 0.
00:53:03.094 --> 00:53:05.010
And those are different
numbers, because those
00:53:05.010 --> 00:53:09.050
are r possible values that
your random verbal takes.
00:53:09.050 --> 00:53:11.360
You're not going to
say that it takes two
00:53:11.360 --> 00:53:14.010
with probability 1.5--
00:53:14.010 --> 00:53:18.170
sorry, two with probability 0.5
and two with probability 0.25.
00:53:18.170 --> 00:53:22.370
You're going to say it takes two
with probability 0.75 directly.
00:53:22.370 --> 00:53:24.350
So those xj's are different.
00:53:24.350 --> 00:53:27.510
These are the different values
that your random variable
00:53:27.510 --> 00:53:28.010
can take.
00:53:32.200 --> 00:53:37.404
Remember, xj, xk was just the
different values x1 to xr--
00:53:37.404 --> 00:53:39.820
sorry-- was the different
values that your random variable
00:53:39.820 --> 00:53:41.020
can take.
00:53:41.020 --> 00:53:43.796
Nobody in their right mind
would write twice the same value
00:53:43.796 --> 00:53:44.788
in this list.
00:53:47.450 --> 00:53:49.119
So my Vandermonde is non-zero.
00:53:49.119 --> 00:53:49.910
So I can invert it.
00:53:49.910 --> 00:53:51.493
And I have a one-to-one
correspondence
00:53:51.493 --> 00:53:55.970
between my entire PMF and
the first r minus 1's moments
00:53:55.970 --> 00:54:00.390
to which I append the
number 1, which is really
00:54:00.390 --> 00:54:02.700
the moment of order 0 again.
00:54:02.700 --> 00:54:05.550
It's E of X to the
0-th, which is 1.
00:54:05.550 --> 00:54:10.110
So good news, I only
need r minus 1 parameters
00:54:10.110 --> 00:54:12.030
to describe r
minus 1 parameters.
00:54:12.030 --> 00:54:14.260
And I can choose either
the values of my PMF.
00:54:14.260 --> 00:54:16.360
Or I can choose the r
minus 1 first moments.
00:54:20.300 --> 00:54:22.580
So the moments
tell me something.
00:54:22.580 --> 00:54:26.450
Here, it tells me that if I
have a discrete distribution
00:54:26.450 --> 00:54:28.160
with r possible
values, I only need
00:54:28.160 --> 00:54:30.200
to compute r minus 1 moments.
00:54:30.200 --> 00:54:34.471
So this is better than
Weierstrass Approximation
00:54:34.471 --> 00:54:34.970
Theorem.
00:54:34.970 --> 00:54:37.970
This tells me exactly how many
moments I need to consider.
00:54:37.970 --> 00:54:39.410
And this is for
any distribution.
00:54:39.410 --> 00:54:41.100
This is not a
distribution that's
00:54:41.100 --> 00:54:43.790
parametrized by one
parameter, like the Poisson
00:54:43.790 --> 00:54:47.210
or the binomial
or all this stuff.
00:54:47.210 --> 00:54:50.250
This is for any distribution
under a finite number.
00:54:50.250 --> 00:54:53.810
So hopefully, if I
reduce the family of PMFs
00:54:53.810 --> 00:54:55.970
that I'm looking at to
a one-parameter family,
00:54:55.970 --> 00:54:58.430
I'm actually going to need
to compute much less than r
00:54:58.430 --> 00:55:01.110
minus 1 values.
00:55:01.110 --> 00:55:02.640
But this is actually hopeful.
00:55:02.640 --> 00:55:04.650
It tells you that
the method of moments
00:55:04.650 --> 00:55:06.775
is going to work for
any distribution.
00:55:06.775 --> 00:55:09.417
You just have to invert
a Vandermonde matrix.
00:55:13.220 --> 00:55:17.350
So just the conclusion--
the statistical conclusion--
00:55:17.350 --> 00:55:20.770
is that moments contain
important information
00:55:20.770 --> 00:55:24.880
about the PMF and the PDF.
00:55:24.880 --> 00:55:26.890
If we can estimate these
moments accurately,
00:55:26.890 --> 00:55:30.820
we can solve for the
parameters of the distribution
00:55:30.820 --> 00:55:32.674
and recover the distribution.
00:55:32.674 --> 00:55:34.090
And in a parametric
setting, where
00:55:34.090 --> 00:55:36.970
knowing P theta amounts
to knowing theta, which
00:55:36.970 --> 00:55:41.270
is identifiability--
this is not innocuous--
00:55:41.270 --> 00:55:44.260
it is often the case that
even less moments are needed.
00:55:44.260 --> 00:55:46.810
After all, if theta is a
one dimensional parameter,
00:55:46.810 --> 00:55:48.730
I have one parameter
to estimate.
00:55:48.730 --> 00:55:51.370
Why would I go
and get 25 moments
00:55:51.370 --> 00:55:52.870
to get this one parameter.
00:55:52.870 --> 00:55:54.532
Typically, there
is actually-- we
00:55:54.532 --> 00:55:55.990
will see that the
method of moments
00:55:55.990 --> 00:55:58.480
just says if you have a
d dimensional parameter,
00:55:58.480 --> 00:56:02.110
just compute d
moments, and that's it.
00:56:02.110 --> 00:56:04.280
But this is only on
a case-by-case basis.
00:56:04.280 --> 00:56:07.610
I mean, maybe your model will
totally screw up its parameters
00:56:07.610 --> 00:56:09.950
and you actually
need to get them.
00:56:09.950 --> 00:56:16.710
I mean, think about it, if the
function is parameterized just
00:56:16.710 --> 00:56:19.047
by its 27th moment--
00:56:19.047 --> 00:56:21.630
like, that's the only thing that
matters in this distribution,
00:56:21.630 --> 00:56:24.187
I just describe the function,
it's just a density,
00:56:24.187 --> 00:56:26.520
and the only thing that can
change from one distribution
00:56:26.520 --> 00:56:28.484
to another is this 27th moment--
00:56:28.484 --> 00:56:30.900
well, then you're going to
have to go get the 27th moment.
00:56:30.900 --> 00:56:33.780
And that probably means that
your modeling step was actually
00:56:33.780 --> 00:56:34.686
pretty bad.
00:56:37.680 --> 00:56:40.970
So the rule of thumb, if theta
is in Rd, we need d moments.
00:56:46.970 --> 00:56:48.430
So what is the
method of moments?
00:56:52.800 --> 00:56:55.080
That's just a good old trick.
00:56:55.080 --> 00:56:58.380
Replace the expectation
by averages.
00:56:58.380 --> 00:56:59.970
That's the beauty.
00:56:59.970 --> 00:57:02.080
The moments are expectations.
00:57:02.080 --> 00:57:04.710
So let's just replace the
expectations by averages
00:57:04.710 --> 00:57:07.620
and then do it with
the average version,
00:57:07.620 --> 00:57:10.200
as if it was the true one.
00:57:10.200 --> 00:57:14.160
So for example, I'm going to
talk about population moments,
00:57:14.160 --> 00:57:16.357
when I'm computing them
with the true distribution,
00:57:16.357 --> 00:57:18.690
and I'm going to talk about
them empirical moments, when
00:57:18.690 --> 00:57:22.290
I talk about averages.
00:57:22.290 --> 00:57:24.690
So those are the two
quantities that I have.
00:57:24.690 --> 00:57:28.430
And now, what I hope
is that there is.
00:57:28.430 --> 00:57:30.960
So this is basically--
00:57:30.960 --> 00:57:32.140
everything is here.
00:57:32.140 --> 00:57:33.750
That's where all the money is.
00:57:33.750 --> 00:57:36.960
I'm going to assume there's
a function psi that maps
00:57:36.960 --> 00:57:40.120
my parameters-- let's
say they're in Rd--
00:57:40.120 --> 00:57:42.385
to the set of the
first d moments.
00:57:45.490 --> 00:57:48.040
Well, what I want to do
is to come from this guy
00:57:48.040 --> 00:57:49.070
back to theta.
00:57:49.070 --> 00:57:50.980
So it better be that
this function is--
00:57:54.850 --> 00:57:55.802
invertible.
00:57:55.802 --> 00:57:57.385
I want this function
to be invertible.
00:57:57.385 --> 00:57:59.200
In the Vandermonde
case, this function
00:57:59.200 --> 00:58:03.610
with just a linear function--
multiply a matrix by theta.
00:58:03.610 --> 00:58:06.610
Then inverting a linear function
is inverting the matrix.
00:58:06.610 --> 00:58:08.145
Then this is the same thing.
00:58:08.145 --> 00:58:09.520
So now what I'm
going to assume--
00:58:09.520 --> 00:58:14.470
and that's key for this method
to work-- is that this theta--
00:58:14.470 --> 00:58:16.570
so this function
psi is one to one.
00:58:16.570 --> 00:58:24.360
There's only one theta that
gets only one set of moments.
00:58:24.360 --> 00:58:26.750
And so if it's one to one, I
can talk about its inverse.
00:58:26.750 --> 00:58:28.750
And so now, I'm going to
be able to define theta
00:58:28.750 --> 00:58:32.330
as the inverse of the moments--
00:58:32.330 --> 00:58:33.620
the reciprocal of the moments.
00:58:33.620 --> 00:58:37.940
And so now, what I get is that
the moment estimator is just
00:58:37.940 --> 00:58:42.140
the thing where rather than
taking the true guys in there,
00:58:42.140 --> 00:58:44.780
I'm actually going to take the
empirical moments in there.
00:58:48.580 --> 00:58:50.530
Before we go any
further, I'd like
00:58:50.530 --> 00:58:53.980
to just go back and tell
you that this is not
00:58:53.980 --> 00:58:56.380
completely free.
00:58:56.380 --> 00:58:58.382
How well-behaved
your function psi
00:58:58.382 --> 00:58:59.590
is going to play a huge role.
00:59:02.490 --> 00:59:05.394
Can somebody tell me what
the typical distance--
00:59:05.394 --> 00:59:06.810
if I have a sample
of size n, what
00:59:06.810 --> 00:59:10.062
is the typical distance between
an average and the expectation?
00:59:12.790 --> 00:59:14.360
What is the typical distance?
00:59:14.360 --> 00:59:18.920
What is the order of magnitude
as a function of n between xn
00:59:18.920 --> 00:59:23.024
bar and its expectation.
00:59:23.024 --> 00:59:25.000
AUDIENCE: 1 over
square root of n.
00:59:25.000 --> 00:59:25.760
PHILIPPE RIGOLLET: 1
over square root n.
00:59:25.760 --> 00:59:28.064
That's what the central limit
theorem tells us, right?
00:59:28.064 --> 00:59:29.480
The central limit
theorem tells us
00:59:29.480 --> 00:59:31.521
that those things are
basically a Gaussian, which
00:59:31.521 --> 00:59:34.490
is of order of 1 divided
by its square of n.
00:59:34.490 --> 00:59:37.670
And so basically, I
start with something
00:59:37.670 --> 00:59:41.530
which is 1 over square root
of n away from the true thing.
00:59:41.530 --> 00:59:49.730
Now, if my function psi inverse
is super steep like this--
00:59:49.730 --> 00:59:54.970
that's psi inverse-- then
just small fluctuations, even
00:59:54.970 --> 00:59:57.310
if they're of order
1 square root of n,
00:59:57.310 --> 01:00:04.090
can translate into giant
fluctuations in the y-axis.
01:00:04.090 --> 01:00:06.040
And that's going
to be controlled
01:00:06.040 --> 01:00:09.640
by how steep psi inverse
is, which is the same
01:00:09.640 --> 01:00:14.150
as saying how flat psi is--
01:00:14.150 --> 01:00:15.720
how flat is psi.
01:00:15.720 --> 01:00:20.440
So if you go back to
this Vandermonde inverse,
01:00:20.440 --> 01:00:26.570
what it's telling you is that
if this inverse matrix blows up
01:00:26.570 --> 01:00:29.030
this guy a lot--
01:00:29.030 --> 01:00:32.566
so if I start from a small
fluctuation of this thing
01:00:32.566 --> 01:00:34.190
and then they're
blowing up by applying
01:00:34.190 --> 01:00:36.050
the inverse of
this matrix, things
01:00:36.050 --> 01:00:37.600
are not going to go well.
01:00:37.600 --> 01:00:41.860
Anybody knows what is the number
that I should be looking for?
01:00:41.860 --> 01:00:45.080
So that's from, say,
numerical linear algebra
01:00:45.080 --> 01:00:47.270
numerical methods.
01:00:47.270 --> 01:00:49.244
When I have a system
of linear equations,
01:00:49.244 --> 01:00:50.660
what is the actual
number I should
01:00:50.660 --> 01:00:53.510
be looking at to
know how much I'm
01:00:53.510 --> 01:00:54.950
blowing up the fluctuations?
01:00:54.950 --> 01:00:55.090
Yeah.
01:00:55.090 --> 01:00:55.776
AUDIENCE: Condition number?
01:00:55.776 --> 01:00:57.280
PHILIPPE RIGOLLET: The
condition number, right.
01:00:57.280 --> 01:00:59.600
So what's important here
is the condition number
01:00:59.600 --> 01:01:00.680
of this matrix.
01:01:00.680 --> 01:01:03.715
If the condition number
of this matrix is small,
01:01:03.715 --> 01:01:04.340
then it's good.
01:01:04.340 --> 01:01:05.660
It's not going to blow up much.
01:01:05.660 --> 01:01:07.280
But if the condition
number is very large,
01:01:07.280 --> 01:01:08.720
it's just going
to blow up a lot.
01:01:08.720 --> 01:01:10.310
And the condition
number is the ratio
01:01:10.310 --> 01:01:13.010
of the largest and the
smallest eigenvalues.
01:01:13.010 --> 01:01:14.720
So you'll have to
know what it is.
01:01:14.720 --> 01:01:17.180
But this is how all these
things get together.
01:01:17.180 --> 01:01:21.380
So the numerical
stability translates
01:01:21.380 --> 01:01:24.835
into statistical stability here.
01:01:24.835 --> 01:01:26.684
And numerical
means just if I had
01:01:26.684 --> 01:01:28.350
errors in measuring
the right hand side,
01:01:28.350 --> 01:01:30.360
how much would they translate
into errors on the left hand
01:01:30.360 --> 01:01:31.060
side.
01:01:31.060 --> 01:01:33.976
So the error here is intrinsic
to statistical questions.
01:01:38.610 --> 01:01:42.490
So that's my estimator,
provided that it exists.
01:01:42.490 --> 01:01:45.040
And I said it's a one to
one, so it should exist,
01:01:45.040 --> 01:01:48.520
if I assume that
psi is invertible.
01:01:48.520 --> 01:01:51.627
So how good is this guy?
01:01:51.627 --> 01:01:53.460
That's going to be
definitely our question--
01:01:53.460 --> 01:01:54.860
how good is this thing.
01:01:54.860 --> 01:02:00.560
And as I said, there's chances
that if psi is really steep,
01:02:00.560 --> 01:02:02.800
then it should be
not very good--
01:02:02.800 --> 01:02:06.140
if psi inverse is very steep,
it should not be very good,
01:02:06.140 --> 01:02:07.740
which means that it's--
01:02:07.740 --> 01:02:11.480
well, let's just
leave it to that.
01:02:11.480 --> 01:02:13.010
So that means that
I should probably
01:02:13.010 --> 01:02:16.460
see the derivative of
psi showing up somewhere.
01:02:16.460 --> 01:02:19.626
If the derivative of psi
inverse, say, is very large,
01:02:19.626 --> 01:02:21.500
then I should actually
have a larger variance
01:02:21.500 --> 01:02:22.520
in my estimator.
01:02:22.520 --> 01:02:26.900
So hopefully, just like we
had a theorem that told us
01:02:26.900 --> 01:02:29.390
that the Fisher information
was key in the variance
01:02:29.390 --> 01:02:30.890
of the maximum
likelihood estimator,
01:02:30.890 --> 01:02:32.473
we should have a
theorem that tells us
01:02:32.473 --> 01:02:33.920
that the derivative
of psi inverse
01:02:33.920 --> 01:02:37.313
is going to have a key role
in the method of moments.
01:02:37.313 --> 01:02:38.792
So let's do it.
01:02:57.540 --> 01:03:01.950
So I'm going to talk
to you about matrices.
01:03:01.950 --> 01:03:02.680
So now, I have--
01:03:10.150 --> 01:03:15.080
So since I have to manipulate
d numbers at any given time,
01:03:15.080 --> 01:03:17.610
I'm just going to concatenate
them into a vector.
01:03:17.610 --> 01:03:19.670
So I'm going to call
capital M theta--
01:03:19.670 --> 01:03:24.570
so that's basically
the population moment.
01:03:24.570 --> 01:03:31.320
And I have M hat, which is
just m hat 1 to m hat d.
01:03:31.320 --> 01:03:32.715
And that's my empirical moment.
01:03:39.100 --> 01:03:41.170
And what's going
to play a role is
01:03:41.170 --> 01:03:45.370
what is the variance-covariance
of the random vector.
01:03:45.370 --> 01:03:49.680
So I have this vector 1--
01:03:49.680 --> 01:03:50.440
do I have 1?
01:03:50.440 --> 01:03:51.865
No, I don't have 1.
01:03:59.240 --> 01:04:02.480
So that's a d
dimensional vector.
01:04:02.480 --> 01:04:04.940
And here, I take the
successive powers.
01:04:04.940 --> 01:04:08.780
Remember, that looks very much
like a column of my Vandermonde
01:04:08.780 --> 01:04:10.590
matrix.
01:04:10.590 --> 01:04:12.120
So now, I have
this random vector.
01:04:12.120 --> 01:04:15.570
It's just the successive powers
of some random variable X.
01:04:15.570 --> 01:04:19.480
And the variance-covariance
matrix is the expectation--
01:04:19.480 --> 01:04:20.130
so sigma--
01:04:20.130 --> 01:04:21.695
of theta.
01:04:21.695 --> 01:04:23.820
The theta just means I'm
going to take expectations
01:04:23.820 --> 01:04:26.310
with respect to theta.
01:04:26.310 --> 01:04:28.350
That's the expectation
with respect
01:04:28.350 --> 01:04:31.316
to theta of this
guy times this guy
01:04:31.316 --> 01:04:40.575
transpose minus the same
thing but with the expectation
01:04:40.575 --> 01:04:41.075
inside.
01:04:45.270 --> 01:04:46.760
Why do I do X, X1.
01:04:46.760 --> 01:04:48.070
I have X, X2, X3.
01:04:50.720 --> 01:05:04.384
X, X2, Xd times the
expectation of X, X2, Xd.
01:05:04.384 --> 01:05:05.550
Everybody sees what this is?
01:05:05.550 --> 01:05:11.790
So this is a matrix where if I
look at the ij-th term of this
01:05:11.790 --> 01:05:13.530
matrix--
01:05:13.530 --> 01:05:20.980
or let's say, jk-th term,
so on row j and column k,
01:05:20.980 --> 01:05:26.130
I have sigma jk of theta.
01:05:26.130 --> 01:05:30.960
And it's simply the
expectation of X to the j
01:05:30.960 --> 01:05:40.541
plus k-- well, Xj Xk minus
expectation of Xj expectation
01:05:40.541 --> 01:05:41.040
of Xk.
01:05:45.170 --> 01:05:53.970
So I can write this as m j plus
k of theta minus mj of theta
01:05:53.970 --> 01:05:55.080
times mk of theta.
01:06:00.840 --> 01:06:04.400
So that's my covariance matrix
of this particular vector
01:06:04.400 --> 01:06:06.870
that I define.
01:06:06.870 --> 01:06:09.240
And now, I'm going to
assume that psi inverse--
01:06:09.240 --> 01:06:11.070
well, if I want to
talk about the slope
01:06:11.070 --> 01:06:14.060
in an analytic fashion,
I have to assume
01:06:14.060 --> 01:06:16.250
that psi is differentiable.
01:06:16.250 --> 01:06:18.650
And I will talk
about the gradient
01:06:18.650 --> 01:06:20.500
of psi, which is, if
it's one dimensional,
01:06:20.500 --> 01:06:22.340
it's just the derivative.
01:06:22.340 --> 01:06:24.470
And here, that's where
notation becomes annoying.
01:06:24.470 --> 01:06:26.011
And I'm going to
actually just assume
01:06:26.011 --> 01:06:28.310
that so now I have a vector.
01:06:28.310 --> 01:06:30.590
But it's a vector
of functions and I
01:06:30.590 --> 01:06:32.840
want to compute those functions
at a particular value.
01:06:32.840 --> 01:06:34.506
And the value I'm
actually interested in
01:06:34.506 --> 01:06:37.010
is at the m of theta parameter.
01:06:37.010 --> 01:06:41.600
So psi inverse goes
from the set of moments
01:06:41.600 --> 01:06:43.710
to the set of parameters.
01:06:43.710 --> 01:06:45.680
So when I look at the
gradient of this guy,
01:06:45.680 --> 01:06:48.740
it should be a function that
takes as inputs moments.
01:06:48.740 --> 01:06:51.700
And where do I want this
function to be evaluated at?
01:06:51.700 --> 01:06:54.352
At the true moment--
01:06:54.352 --> 01:06:58.100
at the population moment vector.
01:06:58.100 --> 01:07:00.860
Just like when I computed
my Fisher information,
01:07:00.860 --> 01:07:04.250
I was computing it at
the true parameter.
01:07:04.250 --> 01:07:08.400
So now, once they
compute this guy--
01:07:08.400 --> 01:07:11.176
so now, why is this a
d by d gradient matrix?
01:07:15.840 --> 01:07:19.920
So I have a gradient vector when
I have a function from rd to r.
01:07:19.920 --> 01:07:22.160
This is the partial derivatives.
01:07:22.160 --> 01:07:25.900
But now, I have a
function from rd to rd.
01:07:25.900 --> 01:07:28.210
So I have to take the
derivative with respect
01:07:28.210 --> 01:07:32.457
to the arrival coordinate
and the departure coordinate.
01:07:35.260 --> 01:07:39.140
And so that's the
gradient matrix.
01:07:39.140 --> 01:07:41.360
And now, I have the
following properties.
01:07:41.360 --> 01:07:46.270
The first one is that
the law of large numbers
01:07:46.270 --> 01:07:52.720
tells me that theta hat is a
weakly or strongly consistent
01:07:52.720 --> 01:07:54.332
estimator.
01:07:54.332 --> 01:07:56.290
So either I use the strong
law of large numbers
01:07:56.290 --> 01:07:57.665
or the weak law
of large numbers,
01:07:57.665 --> 01:08:01.300
and I get strong or
weak consistency.
01:08:01.300 --> 01:08:02.870
So what does that mean?
01:08:02.870 --> 01:08:03.640
Why is that true?
01:08:03.640 --> 01:08:12.470
Well, because now so I
really have the function--
01:08:12.470 --> 01:08:13.930
so what is my estimator?
01:08:13.930 --> 01:08:23.689
Theta hat this psi inverse
of m hat 1 to m hat k.
01:08:23.689 --> 01:08:26.630
Now, by the law
of large numbers,
01:08:26.630 --> 01:08:28.890
let's look only at the weak one.
01:08:28.890 --> 01:08:35.600
Law of large numbers tells
me that each of the mj hat
01:08:35.600 --> 01:08:38.750
is going to converge
in probability as n
01:08:38.750 --> 01:08:40.970
to infinity to the-- so
the empirical moments
01:08:40.970 --> 01:08:44.950
converge to the
population moments.
01:08:44.950 --> 01:08:48.189
That's what the good
old trick is using,
01:08:48.189 --> 01:08:49.750
the fact that the
empirical moments
01:08:49.750 --> 01:08:52.760
are close to the true
moments as n becomes larger.
01:08:52.760 --> 01:08:55.390
And that's because, well,
just because the m hat j's
01:08:55.390 --> 01:08:57.160
are averages, and the
law of large numbers
01:08:57.160 --> 01:08:59.229
works for averages.
01:08:59.229 --> 01:09:04.930
So now, plus if I look at my
continuous mapping theorem,
01:09:04.930 --> 01:09:10.700
then I have that psi inverse
is continuously differentiable.
01:09:10.700 --> 01:09:12.279
So it's definitely continuous.
01:09:12.279 --> 01:09:16.510
And so what I have is
that psi inverse of m hat
01:09:16.510 --> 01:09:28.740
1 m hat d converges to psi
inverse m1 to md, which
01:09:28.740 --> 01:09:33.950
is equal to of theta star.
01:09:33.950 --> 01:09:35.060
So that's theta star.
01:09:35.060 --> 01:09:37.910
By definition, we assumed that
that was the unique one that
01:09:37.910 --> 01:09:40.189
was actually doing this.
01:09:40.189 --> 01:09:43.109
Again, this is a very
strong assumption.
01:09:43.109 --> 01:09:46.100
I mean, it's basically saying,
if the method of moment works,
01:09:46.100 --> 01:09:47.540
it works.
01:09:47.540 --> 01:09:51.710
So the fact that psi
inverse one to one
01:09:51.710 --> 01:09:55.280
is really the key to
making this guy work.
01:09:55.280 --> 01:09:57.200
And then I also have a
central limit theorem.
01:09:57.200 --> 01:10:00.140
And the central limit
theorem is basically
01:10:00.140 --> 01:10:04.550
telling me that M hat
is converging to M even
01:10:04.550 --> 01:10:06.040
in the multivariate sense.
01:10:06.040 --> 01:10:09.410
So if I look at the vector of
M hat and the true vector of M,
01:10:09.410 --> 01:10:11.870
then I actually make them go--
01:10:11.870 --> 01:10:14.570
I look at the difference for
scale by square root of n.
01:10:14.570 --> 01:10:15.951
It goes to some Gaussian.
01:10:15.951 --> 01:10:18.200
And usually, we would see--
if it was one dimensional,
01:10:18.200 --> 01:10:19.283
we would see the variance.
01:10:19.283 --> 01:10:22.430
Then we see the
variance-covariance matrix.
01:10:22.430 --> 01:10:25.370
Who has never seen the-- well,
nobody answers this question.
01:10:25.370 --> 01:10:28.200
Who has already seen the
multivariate central limit
01:10:28.200 --> 01:10:28.700
theorem?
01:10:31.339 --> 01:10:33.380
Who was never seen the
multivariate central limit
01:10:33.380 --> 01:10:35.410
theorem?
01:10:35.410 --> 01:10:37.630
So the multivariate
central limit theorem
01:10:37.630 --> 01:10:41.860
is basically just
the slight extension
01:10:41.860 --> 01:10:43.630
of the univariate one.
01:10:43.630 --> 01:10:48.160
It just says that
if I want to think--
01:10:48.160 --> 01:10:51.020
so the univariate one would
tell me something like this--
01:11:05.460 --> 01:11:06.270
and 0.
01:11:06.270 --> 01:11:18.960
And then I would have basically
the variance of X to the j-th.
01:11:18.960 --> 01:11:22.240
So that's what the central
limit theorem tells me.
01:11:22.240 --> 01:11:23.350
This is an average.
01:11:29.350 --> 01:11:31.150
So this is just for averages.
01:11:31.150 --> 01:11:33.190
The central limit
theorem tells me this.
01:11:33.190 --> 01:11:36.560
Just think of X to
the j-th as being y.
01:11:36.560 --> 01:11:37.960
And that would be true.
01:11:37.960 --> 01:11:40.092
Everybody agrees with me?
01:11:40.092 --> 01:11:41.550
So now, this is
actually telling me
01:11:41.550 --> 01:11:45.610
what's happening for all
these guys individually.
01:11:45.610 --> 01:11:48.990
But what happens when those guys
start to correlate together?
01:11:48.990 --> 01:11:51.090
I'd like to know if
they actually correlate
01:11:51.090 --> 01:11:53.010
the same way asymptotically.
01:11:53.010 --> 01:11:56.760
And so if I actually looked
at the covariance matrix
01:11:56.760 --> 01:11:57.465
of this vector--
01:12:03.440 --> 01:12:07.470
so now, I need to look at
a matrix which is d by d--
01:12:07.470 --> 01:12:10.170
then would those univariate
central limit theorems
01:12:10.170 --> 01:12:12.896
tell me--
01:12:12.896 --> 01:12:16.890
so let me right like
this, double bar.
01:12:16.890 --> 01:12:19.560
So that's just the
covariance matrix.
01:12:19.560 --> 01:12:23.050
This notation, V double bar is
the variance-covariance matrix.
01:12:23.050 --> 01:12:26.010
So what this thing tells
me-- so I know this thing
01:12:26.010 --> 01:12:30.117
is a matrix, d by d.
01:12:30.117 --> 01:12:31.950
Those univariate central
limit theorems only
01:12:31.950 --> 01:12:36.150
give me information
about the diagonal terms.
01:12:36.150 --> 01:12:40.860
But here, I have no idea where
the covariance matrix is.
01:12:40.860 --> 01:12:46.020
This guy is telling me, for
example, that this thing is
01:12:46.020 --> 01:12:49.520
like variance of X to the j-th.
01:12:49.520 --> 01:12:51.860
But what if I want to
find off-diagonal elements
01:12:51.860 --> 01:12:53.130
of this matrix?
01:12:53.130 --> 01:12:55.130
Well, I need to use a
multivariate central limit
01:12:55.130 --> 01:12:56.150
theorem.
01:12:56.150 --> 01:12:58.670
And really what it's telling
me is that you can actually
01:12:58.670 --> 01:13:00.200
replace this guy here--
01:13:10.450 --> 01:13:14.500
so that goes in distribution
to some normal mean 0, again.
01:13:14.500 --> 01:13:17.080
And now, what I
have is just sigma
01:13:17.080 --> 01:13:22.000
of theta, which is just
the covariance matrix
01:13:22.000 --> 01:13:26.696
of this vector X, X2, X3,
X4, all the way to Xd.
01:13:26.696 --> 01:13:27.514
And that's it.
01:13:27.514 --> 01:13:28.930
So that's a
multivariate Gaussian.
01:13:28.930 --> 01:13:33.040
Who has never seen a
multivariate Gaussian?
01:13:33.040 --> 01:13:35.974
Please, just go on
Wikipedia or something.
01:13:35.974 --> 01:13:37.390
There's not much
to know about it.
01:13:37.390 --> 01:13:40.270
But I don't have time to
redo probability here.
01:13:40.270 --> 01:13:43.350
So we're going to
have to live with it.
01:13:43.350 --> 01:13:46.230
Now, to be fair,
if your goal is not
01:13:46.230 --> 01:13:48.970
to become a
statistical savant, we
01:13:48.970 --> 01:13:52.490
will stick to
univariate questions
01:13:52.490 --> 01:14:01.260
in the scope of
homework and exams.
01:14:01.260 --> 01:14:09.900
So now, what was the
delta method telling me?
01:14:09.900 --> 01:14:13.440
It was telling me that if I had
a central limit theorem that
01:14:13.440 --> 01:14:16.112
told me that theta hat
was going to theta,
01:14:16.112 --> 01:14:17.820
or square root of n
theta hat minus theta
01:14:17.820 --> 01:14:19.530
was going to some
Gaussian, then I
01:14:19.530 --> 01:14:23.220
could look at square root of Mg
of theta hat minus g of theta.
01:14:23.220 --> 01:14:25.110
And this thing was also
going to a Gaussian.
01:14:25.110 --> 01:14:27.030
But what it had to
be is the square
01:14:27.030 --> 01:14:32.700
of the derivative of
g in the variance.
01:14:32.700 --> 01:14:35.190
So the delta method,
it was just a way
01:14:35.190 --> 01:14:38.280
to go from square
root of n theta
01:14:38.280 --> 01:14:46.810
hat n minus theta goes to some
N, say 0, sigma squared, to--
01:14:46.810 --> 01:14:50.410
so delta method was telling
me that this was square root
01:14:50.410 --> 01:14:56.030
Ng of theta hat N
minus g of theta
01:14:56.030 --> 01:15:01.770
was going in distribution
to N0 sigma squared
01:15:01.770 --> 01:15:04.200
g prime squared of theta.
01:15:07.210 --> 01:15:09.130
That was the delta method.
01:15:09.130 --> 01:15:12.700
Now, here, we have a
function of those guys.
01:15:12.700 --> 01:15:15.580
The central limit theorem,
even the multivariate one,
01:15:15.580 --> 01:15:20.180
is only guaranteeing something
for me regarding the moments.
01:15:20.180 --> 01:15:23.350
But now, I need to map the
moments back into some theta,
01:15:23.350 --> 01:15:26.230
so I have a function
of the moments.
01:15:26.230 --> 01:15:31.950
And there is something
called the multivariate delta
01:15:31.950 --> 01:15:35.310
method, where derivatives
are replaced by gradients.
01:15:35.310 --> 01:15:39.310
Like, they always are in
multivariate calculus.
01:15:39.310 --> 01:15:43.080
And rather than multiplying,
since things do not compute,
01:15:43.080 --> 01:15:46.557
rather than choosing which
side I want to put the square,
01:15:46.557 --> 01:15:49.140
I'm actually just going to take
half of the square on one side
01:15:49.140 --> 01:15:51.810
and the other half of the
square on the other side.
01:15:51.810 --> 01:15:53.790
So the way you
should view this, you
01:15:53.790 --> 01:15:59.160
should think of sigma
squared times g prime squared
01:15:59.160 --> 01:16:02.490
as being g prime of
theta times sigma
01:16:02.490 --> 01:16:06.040
squared times g prime of theta.
01:16:06.040 --> 01:16:08.640
And now, this is
completely symmetric.
01:16:08.640 --> 01:16:14.850
And the multivariate
delta method
01:16:14.850 --> 01:16:20.010
is basically telling you that
you get the gradient here.
01:16:20.010 --> 01:16:21.480
So you start from
something that's
01:16:21.480 --> 01:16:24.100
like that over there, a sigma--
01:16:24.100 --> 01:16:26.280
so that's my sigma squared,
think of sigma squared.
01:16:26.280 --> 01:16:29.040
And then I premultiply by
the gradient and postmultiply
01:16:29.040 --> 01:16:30.514
by the gradient.
01:16:30.514 --> 01:16:31.680
The first one is transposed.
01:16:31.680 --> 01:16:33.620
The second one is not.
01:16:33.620 --> 01:16:36.140
But that's very
straightforward extension.
01:16:36.140 --> 01:16:37.890
You don't even have
to understand it.
01:16:37.890 --> 01:16:41.780
Just think of what would be
the natural generalization.
01:16:41.780 --> 01:16:44.450
Here, by the way,
I wrote explicitly
01:16:44.450 --> 01:16:48.020
what the gradient of a
multivariate function is.
01:16:48.020 --> 01:16:53.930
So that's a function
that goes from Rd to Rk.
01:16:53.930 --> 01:16:56.050
So now, the gradient
is a d by k matrix.
01:16:58.920 --> 01:17:00.504
And so now, for this
guy, we can do it
01:17:00.504 --> 01:17:01.586
for the method or moments.
01:17:01.586 --> 01:17:03.140
And we can see that
basically we're
01:17:03.140 --> 01:17:04.765
going to have this
scaling that depends
01:17:04.765 --> 01:17:08.300
on the gradient of the
reciprocal of psi, which
01:17:08.300 --> 01:17:08.810
is normal.
01:17:08.810 --> 01:17:13.137
Because if psi is super steep,
if psi inverse is super steep,
01:17:13.137 --> 01:17:14.720
then the gradient
is going to be huge,
01:17:14.720 --> 01:17:17.120
which is going to translate
into having a huge variance
01:17:17.120 --> 01:17:18.203
for the method of moments.
01:17:21.180 --> 01:17:24.127
So this is actually the end.
01:17:24.127 --> 01:17:26.460
I would like to encourage
you-- and we'll probably do it
01:17:26.460 --> 01:17:27.550
on Thursday just to start.
01:17:27.550 --> 01:17:30.480
But I encourage you do
it in one dimension,
01:17:30.480 --> 01:17:35.070
so that you know how to
use the method of moments,
01:17:35.070 --> 01:17:37.140
you know how to do
a bunch of things.
01:17:37.140 --> 01:17:40.470
Do it in one dimension and see
how you can check those things.
01:17:40.470 --> 01:17:43.860
So just as a quick comparison,
in terms of the quadratic risk,
01:17:43.860 --> 01:17:46.050
the maximum likelihood
estimator is typically
01:17:46.050 --> 01:17:50.024
more accurate than
the method of moments.
01:17:50.024 --> 01:17:51.440
What is pretty
good to do is, when
01:17:51.440 --> 01:17:54.530
you have a
non-concave likelihood
01:17:54.530 --> 01:17:56.060
function, what
people like to do is
01:17:56.060 --> 01:17:58.980
to start with the method of
moments as an initialization
01:17:58.980 --> 01:18:01.680
and then run some algorithm
that optimizes locally
01:18:01.680 --> 01:18:03.710
the likelihood starting
from this point,
01:18:03.710 --> 01:18:05.985
because it's actually
likely to be closer.
01:18:05.985 --> 01:18:07.610
And then the MLE is
going to improve it
01:18:07.610 --> 01:18:12.010
a little bit by pushing the
likelihood a little better.
01:18:12.010 --> 01:18:13.840
So of course, the
maximum likelihood
01:18:13.840 --> 01:18:14.890
is sometimes intractable.
01:18:14.890 --> 01:18:18.440
Whereas, computing
moments is fairly doable.
01:18:18.440 --> 01:18:20.262
If the likelihood is
concave, as I said,
01:18:20.262 --> 01:18:21.720
we can use optimization
algorithms,
01:18:21.720 --> 01:18:24.020
such as interior-point
methods or gradient descent,
01:18:24.020 --> 01:18:25.154
I guess, to maximize it.
01:18:25.154 --> 01:18:26.695
And if the likelihood
is non-concave,
01:18:26.695 --> 01:18:28.240
we only have local heuristics.
01:18:28.240 --> 01:18:29.920
Risk And that's what I meant--
01:18:29.920 --> 01:18:31.440
you have only local maxima.
01:18:31.440 --> 01:18:32.860
And one trick you can do--
01:18:32.860 --> 01:18:37.880
so your likelihood
looks like this,
01:18:37.880 --> 01:18:42.140
and it might be the case that if
you have a lot of those peaks,
01:18:42.140 --> 01:18:44.810
you basically have to start
your algorithm in each
01:18:44.810 --> 01:18:46.270
of those peaks.
01:18:46.270 --> 01:18:48.530
But the method of
moments can actually
01:18:48.530 --> 01:18:50.510
start you in the right
peak, and then you
01:18:50.510 --> 01:18:53.300
just move up by doing
some local algorithm
01:18:53.300 --> 01:18:55.040
for maximum likelihood.
01:18:55.040 --> 01:18:56.180
So that's not key.
01:18:56.180 --> 01:18:59.330
But that's just if you want
to think about algorithmically
01:18:59.330 --> 01:19:03.470
how I would end up doing this
and how can I combine the two.
01:19:03.470 --> 01:19:04.970
So I'll see you on Thursday.
01:19:04.970 --> 01:19:06.820
Thank you.