WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.880
at ocw.mit.edu.
00:00:20.540 --> 00:00:22.070
PHILIPPE RIGOLLET: --124.
00:00:22.070 --> 00:00:24.350
If I were to repeat
this 1,000 times,
00:00:24.350 --> 00:00:26.390
so every one of
those 1,000 times
00:00:26.390 --> 00:00:29.150
they collect 124
data points and then
00:00:29.150 --> 00:00:31.430
I'd do it again and
do it again and again,
00:00:31.430 --> 00:00:34.070
then in average, the
number I should get
00:00:34.070 --> 00:00:37.222
should be close to the true
parameter that I'm looking for.
00:00:37.222 --> 00:00:38.930
The fluctuations that
are due to the fact
00:00:38.930 --> 00:00:40.850
that I get different
samples every time
00:00:40.850 --> 00:00:42.190
should somewhat vanish.
00:00:42.190 --> 00:00:46.310
And so what I want is to have a
small bias, hopefully a 0 bias.
00:00:46.310 --> 00:00:50.030
If this thing is 0, then we see
that the estimator is unbiased.
00:01:06.470 --> 00:01:08.250
So this is definitely
a property that we
00:01:08.250 --> 00:01:10.100
are going to be looking
for in an estimator,
00:01:10.100 --> 00:01:11.570
trying to find them
to be unbiased.
00:01:11.570 --> 00:01:14.060
But we'll see that it's
actually maybe not enough.
00:01:14.060 --> 00:01:16.040
So unbiasedness should
not be something
00:01:16.040 --> 00:01:18.140
you lose your sleep over.
00:01:18.140 --> 00:01:21.650
Something that's slightly
better is the risk, really
00:01:21.650 --> 00:01:33.050
the quadratics risk,
which is expectation of--
00:01:33.050 --> 00:01:35.400
so if I have an
estimator, theta hat,
00:01:35.400 --> 00:01:38.360
I'm going to look at the
expectation of theta hat n
00:01:38.360 --> 00:01:41.710
minus theta squared.
00:01:41.710 --> 00:01:44.130
And what we showed last time
is that we can actually--
00:01:44.130 --> 00:01:46.800
by inserting in there,
adding and removing
00:01:46.800 --> 00:01:49.374
the expectation of
theta hat, we actually
00:01:49.374 --> 00:01:50.790
get something where
this thing can
00:01:50.790 --> 00:01:59.160
be decomposed as the square
of the bias plus the variance,
00:01:59.160 --> 00:02:04.160
which is just the expectation of
theta hat minus its expectation
00:02:04.160 --> 00:02:06.986
squared.
00:02:06.986 --> 00:02:08.360
That came from
the fact that when
00:02:08.360 --> 00:02:10.669
I added and removed the
expectation of theta hat
00:02:10.669 --> 00:02:13.731
in there, the
cross-terms cancel.
00:02:13.731 --> 00:02:14.230
All right.
00:02:14.230 --> 00:02:19.556
So that was the bias squared,
and this is the variance.
00:02:25.410 --> 00:02:29.550
And so for example, if the
quadratic risk goes to 0,
00:02:29.550 --> 00:02:31.470
then that means that
theta hat converges
00:02:31.470 --> 00:02:34.570
to theta in the L2 sense.
00:02:34.570 --> 00:02:38.050
And here we know that if
we want this to go to 0,
00:02:38.050 --> 00:02:40.260
since it's the sum of
two positive terms,
00:02:40.260 --> 00:02:42.877
we need to have both
the bias that goes to 0
00:02:42.877 --> 00:02:44.460
and the variance
that goes to 0, so we
00:02:44.460 --> 00:02:46.470
need to control both
of those things.
00:02:46.470 --> 00:02:49.230
And so there is usually
an inherent trade-off
00:02:49.230 --> 00:02:53.550
between getting a small bias
and getting a small variance.
00:02:53.550 --> 00:02:56.220
If you reduce one too much, then
the variance of the other one
00:02:56.220 --> 00:02:57.030
is going to--
00:02:57.030 --> 00:02:59.470
then the other one is going
to increase, or the opposite.
00:02:59.470 --> 00:03:03.360
That happens a lot, but not so
much, actually, in this class.
00:03:03.360 --> 00:03:07.113
So let's just look at
a couple of examples.
00:03:07.113 --> 00:03:10.541
So am I planning--
00:03:10.541 --> 00:03:11.041
yeah.
00:03:11.041 --> 00:03:19.040
So examples.
00:03:19.040 --> 00:03:26.164
So if I do, for example, X1,
Xn, there are iid Bernoulli.
00:03:26.164 --> 00:03:27.580
And I'm going to
write it theta so
00:03:27.580 --> 00:03:29.440
that we keep the same notation.
00:03:29.440 --> 00:03:32.360
Then theta hat, what
is the theta hat
00:03:32.360 --> 00:03:33.860
that we proposed many times?
00:03:33.860 --> 00:03:38.530
It's just X bar, Xn bar,
the average of Xi's.
00:03:38.530 --> 00:03:40.990
So what is the bias of this guy?
00:03:40.990 --> 00:03:44.340
Well, to know the bias, I
just have to remove theta
00:03:44.340 --> 00:03:46.080
from the expectation.
00:03:46.080 --> 00:03:49.330
What is the
expectation of Xn bar?
00:03:49.330 --> 00:03:51.300
Well, by linearity
of the expectation,
00:03:51.300 --> 00:03:53.600
it's just the average
of the expectations.
00:03:57.950 --> 00:04:00.550
But since all my Xi's are
Bernouilli with the same theta,
00:04:00.550 --> 00:04:03.740
then each of this guy is
actually equal to theta.
00:04:03.740 --> 00:04:06.260
So this thing is actually
theta, which means
00:04:06.260 --> 00:04:07.520
that this isn't biased, right?
00:04:14.660 --> 00:04:16.320
Now, what is the
variance of this guy?
00:04:22.440 --> 00:04:27.774
So if you forgot the
properties of the variance
00:04:27.774 --> 00:04:29.440
for sum of independent
random variables,
00:04:29.440 --> 00:04:30.580
now it's time to wake up.
00:04:30.580 --> 00:04:34.060
So we have the
variance of something
00:04:34.060 --> 00:04:38.830
that looks like 1 over n, the
sum from i equal 1 to n of Xi.
00:04:41.460 --> 00:04:45.060
So it's of the form
variance of a constant times
00:04:45.060 --> 00:04:46.230
a random variable.
00:04:46.230 --> 00:04:49.140
So the first thing I'm going
to do is pull out the constant.
00:04:49.140 --> 00:04:52.180
But we know that the variance
leaves on the square scale,
00:04:52.180 --> 00:04:54.450
so when I pull out a constant
outside of the variance,
00:04:54.450 --> 00:04:56.416
it comes out with a square.
00:04:56.416 --> 00:04:59.210
The variance of a
times X is a-squared
00:04:59.210 --> 00:05:02.550
times the variance of
X, so this is equal to 1
00:05:02.550 --> 00:05:06.090
over n squared times
the variance of the sum.
00:05:10.580 --> 00:05:13.790
So now we want to always
do what we want to do.
00:05:13.790 --> 00:05:16.060
So we have the
variance of the sum.
00:05:16.060 --> 00:05:17.740
We would like somehow
to say that this
00:05:17.740 --> 00:05:19.640
is the sum of the variances.
00:05:19.640 --> 00:05:22.320
And in general, we are
not allowed to say that,
00:05:22.320 --> 00:05:26.500
but we are because my Xi's
are actually independent.
00:05:26.500 --> 00:05:30.660
So this is actually equal to 1
over n squared sum from i equal
00:05:30.660 --> 00:05:36.320
1 to n of the variance
of each of the Xi's.
00:05:36.320 --> 00:05:42.760
And that's by independence,
so this is basic probability.
00:05:42.760 --> 00:05:45.210
And now, what is the variance
of Xi's where again they're
00:05:45.210 --> 00:05:47.210
all the same distribution,
so the variance of Xi
00:05:47.210 --> 00:05:49.400
is the same as the
variance of X1.
00:05:49.400 --> 00:05:51.560
And so each of those
guys has variance what?
00:05:51.560 --> 00:05:53.060
What is the variance
of a Bernoulli?
00:05:53.060 --> 00:05:54.080
We've said it once.
00:05:54.080 --> 00:05:55.770
It's theta times 1 minus theta.
00:06:00.040 --> 00:06:03.390
And so now I'm going to have
the sum of n times a constant,
00:06:03.390 --> 00:06:05.730
so I get n times the constant
divided by n squared,
00:06:05.730 --> 00:06:07.720
so one of the n's
is going to cancel.
00:06:07.720 --> 00:06:10.210
And so the whole
thing here is actually
00:06:10.210 --> 00:06:15.110
equal to theta, 1 minus
theta divided by n.
00:06:18.500 --> 00:06:20.336
So if I'm interested
in the quadratic risk--
00:06:27.434 --> 00:06:28.850
and again, I should
just say risk,
00:06:28.850 --> 00:06:30.910
because this is the
only risk we're going
00:06:30.910 --> 00:06:32.010
to be actually looking at.
00:06:32.010 --> 00:06:32.510
Yeah.
00:06:32.510 --> 00:06:34.230
This parenthesis should
really stop here.
00:06:38.000 --> 00:06:41.250
I really wanted to put
quadratic in parenthesis.
00:06:41.250 --> 00:06:43.350
So the risk of this guy is what?
00:06:43.350 --> 00:06:50.890
Well, it's the expectation of
x bar n minus theta squared.
00:06:50.890 --> 00:06:54.849
And we know it's the
square of the variance,
00:06:54.849 --> 00:06:56.390
so it's the square
of the bias, which
00:06:56.390 --> 00:07:00.110
we know is 0, so it's 0 squared
plus the variance, which
00:07:00.110 --> 00:07:03.430
is theta, 1 plus theta--
00:07:03.430 --> 00:07:07.600
1 minus theta divided by n.
00:07:07.600 --> 00:07:14.620
So it's just theta, 1
minus theta divided by n.
00:07:14.620 --> 00:07:17.512
So this is just summarizing the
performance of an estimator,
00:07:17.512 --> 00:07:18.720
which is the random variable.
00:07:18.720 --> 00:07:19.761
I mean, it's complicated.
00:07:19.761 --> 00:07:22.310
If I really wanted
to describe it,
00:07:22.310 --> 00:07:25.400
I would just tell you
the entire distribution
00:07:25.400 --> 00:07:27.319
of this random variable.
00:07:27.319 --> 00:07:29.110
But now what I'm doing
is I'm saying, well,
00:07:29.110 --> 00:07:32.710
let's just take this random
variable, remove theta from it,
00:07:32.710 --> 00:07:36.950
and see how small the
fluctuations around theta--
00:07:36.950 --> 00:07:41.120
the squared fluctuations around
theta are in expectation.
00:07:41.120 --> 00:07:43.140
So that's what the
quadratic risk is doing.
00:07:43.140 --> 00:07:44.550
And in a way, this
decomposition,
00:07:44.550 --> 00:07:46.508
as the sum of the bias
square and the variance,
00:07:46.508 --> 00:07:47.840
is really telling you that--
00:07:47.840 --> 00:07:50.489
it is really accounting for
the bias, which is, well,
00:07:50.489 --> 00:07:52.530
even if I had an infinite
amount of observations,
00:07:52.530 --> 00:07:54.166
is this thing doing
the right thing?
00:07:54.166 --> 00:07:56.040
And the other thing is
actually the variance,
00:07:56.040 --> 00:07:57.581
so for finite number
of observations,
00:07:57.581 --> 00:07:59.680
what are the fluctuations?
00:07:59.680 --> 00:08:00.210
All right.
00:08:00.210 --> 00:08:02.459
Then you can see that those
things, bias and variance,
00:08:02.459 --> 00:08:05.740
are actually very different.
00:08:05.740 --> 00:08:08.220
So I don't have any
colors here, so you're
00:08:08.220 --> 00:08:12.360
going to have to really
follow the speed--
00:08:12.360 --> 00:08:14.380
the order in which
I draw those curves.
00:08:14.380 --> 00:08:14.880
All right.
00:08:14.880 --> 00:08:15.720
So let's find--
00:08:15.720 --> 00:08:19.530
I'm going to give you three
candidate estimators, so--
00:08:29.980 --> 00:08:31.290
estimators for theta.
00:08:35.350 --> 00:08:38.230
So the first one is
definitely Xn bar.
00:08:38.230 --> 00:08:40.900
That will be a good
candidate estimator.
00:08:40.900 --> 00:08:45.070
The second one is going to
be 0.5, because after all,
00:08:45.070 --> 00:08:47.260
why should I bother if
it's actually going to be--
00:08:47.260 --> 00:08:47.759
right?
00:08:47.759 --> 00:08:51.760
So for example, if
I ask you to predict
00:08:51.760 --> 00:08:54.510
the score of some
candidate in some election,
00:08:54.510 --> 00:08:57.472
then since you know it's
going to be very close to 0.5,
00:08:57.472 --> 00:08:59.680
you might as well just throw
0.5 and you're not going
00:08:59.680 --> 00:09:00.880
to be very far from reality.
00:09:00.880 --> 00:09:02.945
And it's actually going
to cost you 0 time and $0
00:09:02.945 --> 00:09:03.820
to come up with that.
00:09:03.820 --> 00:09:06.460
So sometimes maybe
just a good old guess
00:09:06.460 --> 00:09:08.830
is actually doing
the job for you.
00:09:08.830 --> 00:09:10.990
Of course, for
presidential elections
00:09:10.990 --> 00:09:12.890
or something like this,
it's not very helpful
00:09:12.890 --> 00:09:14.514
if your prediction
is telling you this.
00:09:14.514 --> 00:09:17.170
But if it was
something different,
00:09:17.170 --> 00:09:21.112
that would be a good way to
generate some close to 1/2.
00:09:21.112 --> 00:09:23.895
For a coin, for example,
if I give you a coin,
00:09:23.895 --> 00:09:24.520
you never know.
00:09:24.520 --> 00:09:25.720
Maybe it's slightly biased.
00:09:25.720 --> 00:09:27.970
But the good guess, just
looking at it, inspecting it,
00:09:27.970 --> 00:09:29.560
maybe there's something
crazy happening
00:09:29.560 --> 00:09:31.143
with the structure
of it, you're going
00:09:31.143 --> 00:09:34.522
to guess that it's 0.5 without
trying to collect information.
00:09:34.522 --> 00:09:36.730
And let's find another one,
which is, well, you know,
00:09:36.730 --> 00:09:38.950
I have a lot of observations.
00:09:38.950 --> 00:09:43.810
But I'm recording couples
kissing, but I'm on a budget.
00:09:43.810 --> 00:09:46.120
I don't have time to
travel all around the world
00:09:46.120 --> 00:09:47.122
and collect some people.
00:09:47.122 --> 00:09:49.330
So really, I'm just going
to look at the first couple
00:09:49.330 --> 00:09:49.840
and go home.
00:09:49.840 --> 00:09:53.410
So my other estimator
is just going to be X1.
00:09:53.410 --> 00:09:55.710
I just take the first
observation, 0, 1,
00:09:55.710 --> 00:09:57.110
and that's it.
00:09:57.110 --> 00:09:57.860
So now I'm going--
00:09:57.860 --> 00:10:01.080
I want to actually understand
what the behavior of those guys
00:10:01.080 --> 00:10:01.900
is.
00:10:01.900 --> 00:10:02.400
All right.
00:10:02.400 --> 00:10:09.240
So we know-- and so we know
that for this guy, the bias is 0
00:10:09.240 --> 00:10:14.280
and the variance
is equal to theta,
00:10:14.280 --> 00:10:19.610
1 minus theta divided by n.
00:10:19.610 --> 00:10:22.980
What is the bias
of this guy, 0.5?
00:10:28.100 --> 00:10:29.917
AUDIENCE: 0.5.
00:10:29.917 --> 00:10:31.000
AUDIENCE: 0.5 minus theta?
00:10:31.000 --> 00:10:32.749
PHILIPPE RIGOLLET: 0.5
minus theta, right.
00:10:35.360 --> 00:10:39.136
So the bias, 0.5 minus theta.
00:10:39.136 --> 00:10:40.510
What is the variance
of this guy?
00:10:44.670 --> 00:10:46.702
What is the variance of 0.5?
00:10:46.702 --> 00:10:47.410
AUDIENCE: It's 0.
00:10:47.410 --> 00:10:48.285
PHILIPPE RIGOLLET: 0.
00:10:48.285 --> 00:10:49.095
Right.
00:10:49.095 --> 00:10:50.470
It's just a
deterministic number,
00:10:50.470 --> 00:10:53.110
so there's no
fluctuations for this guy.
00:10:53.110 --> 00:10:54.190
What is the bias?
00:10:54.190 --> 00:10:56.590
Well, X1 is actually--
00:10:56.590 --> 00:10:58.210
just for simplicity,
I can think of it
00:10:58.210 --> 00:11:00.820
as being X1 bar, the
average of itself,
00:11:00.820 --> 00:11:03.940
so that wherever I saw an n for
this guy, I can replace it by 1
00:11:03.940 --> 00:11:05.690
and that will give
me my formula.
00:11:05.690 --> 00:11:07.390
So the bias is
still going to be 0.
00:11:07.390 --> 00:11:10.190
And the variance is going to be
equal to theta, 1 minus theta.
00:11:13.270 --> 00:11:16.180
So now I have those
three estimators.
00:11:16.180 --> 00:11:19.660
Well, if I compare
X1 and Xn bar, then
00:11:19.660 --> 00:11:22.480
clearly I have 0
bias in both cases.
00:11:22.480 --> 00:11:23.740
That's good.
00:11:23.740 --> 00:11:27.220
And I have the variance that's
actually n times smaller when I
00:11:27.220 --> 00:11:29.556
use my n observations
than when I don't.
00:11:29.556 --> 00:11:31.180
So those two guys,
on these two fronts,
00:11:31.180 --> 00:11:32.846
you can actually look
at the two numbers
00:11:32.846 --> 00:11:35.264
and say, well, the first
number is the same.
00:11:35.264 --> 00:11:37.180
The second number is
better for the other guy,
00:11:37.180 --> 00:11:40.550
so I will definitely go for
this guy compared to this guy.
00:11:40.550 --> 00:11:42.460
So this guy is gone.
00:11:42.460 --> 00:11:43.790
But not this guy.
00:11:43.790 --> 00:11:47.080
Well, if I look at the
bias, the variance is 0.
00:11:47.080 --> 00:11:49.480
It's always beating the
variance of this guy.
00:11:49.480 --> 00:11:52.270
And if I look at the bias, it's
actually really not that bad.
00:11:52.270 --> 00:11:53.860
It's 0.5 minus theta.
00:11:53.860 --> 00:11:55.970
In particular, if theta
is 0.5, then this guy
00:11:55.970 --> 00:11:57.930
is strictly better.
00:11:57.930 --> 00:12:00.430
And so you can actually
now look at what
00:12:00.430 --> 00:12:05.100
the quadratic risk looks like.
00:12:05.100 --> 00:12:06.600
So here, what I'm
going to do is I'm
00:12:06.600 --> 00:12:08.141
going to take my
true theta-- so it's
00:12:08.141 --> 00:12:09.706
going to range between 0 and 1.
00:12:09.706 --> 00:12:12.080
And we know that those two
things are functions of theta,
00:12:12.080 --> 00:12:13.913
so I can only understand
them if I plot them
00:12:13.913 --> 00:12:16.650
as functions of theta.
00:12:16.650 --> 00:12:18.590
And so now I'm going
to actually plot--
00:12:18.590 --> 00:12:20.870
the y-axis is going
to be the risk.
00:12:23.860 --> 00:12:26.680
So what is the risk of
the estimator of 0.5?
00:12:26.680 --> 00:12:27.630
This one is easy.
00:12:27.630 --> 00:12:33.330
Well, it's 0 plus the
square of 0.5 minus theta.
00:12:33.330 --> 00:12:37.990
So we know that at theta,
it's actually going to be 0.
00:12:37.990 --> 00:12:39.640
And then it's going
to be a square.
00:12:39.640 --> 00:12:44.800
So at 0, it's going to be 0.25.
00:12:44.800 --> 00:12:49.024
And at 1, it's going
to be 0.25 as well.
00:12:49.024 --> 00:12:49.940
So it looks like this.
00:12:49.940 --> 00:12:50.856
Well, actually, sorry.
00:12:50.856 --> 00:12:52.650
Let me put the 0.5
where it should be.
00:12:56.680 --> 00:12:57.180
OK.
00:12:57.180 --> 00:13:03.690
So this here is the risk of 0.5.
00:13:03.690 --> 00:13:06.370
And we'll write it like this.
00:13:06.370 --> 00:13:09.490
So when theta is very close
to 0.5, I'm very happy.
00:13:09.490 --> 00:13:13.090
When theta gets farther,
it's a little bit annoying.
00:13:13.090 --> 00:13:16.680
And then here, I want to
plot the risk of this guy.
00:13:16.680 --> 00:13:18.430
So now the thing with
the risk of this guy
00:13:18.430 --> 00:13:20.740
is that it will depend on n.
00:13:20.740 --> 00:13:24.040
So I will just pick some
n that I'm happy with just
00:13:24.040 --> 00:13:26.417
so that I can
actually draw a curve.
00:13:26.417 --> 00:13:29.000
Otherwise, I'm going to have to
plot one curve per value of n.
00:13:29.000 --> 00:13:31.900
So let's just say, for
example, that n is equal to 10.
00:13:31.900 --> 00:13:35.250
And so now I need to plot
the function theta, 1 minus
00:13:35.250 --> 00:13:37.600
theta divided by 10.
00:13:37.600 --> 00:13:39.000
We know that theta,
1 minus theta
00:13:39.000 --> 00:13:40.780
is a curve that goes like this.
00:13:40.780 --> 00:13:42.200
It takes value at 1/2.
00:13:42.200 --> 00:13:43.480
It thinks value 1/4.
00:13:43.480 --> 00:13:44.310
That's the maximum.
00:13:44.310 --> 00:13:46.000
And then it's 0 at the end.
00:13:46.000 --> 00:13:52.557
So really, if n is
equal to 1, this
00:13:52.557 --> 00:13:53.890
is what the variance looks like.
00:13:53.890 --> 00:13:56.530
The bias doesn't
count in the risk.
00:13:56.530 --> 00:13:57.029
Yeah.
00:13:57.029 --> 00:14:00.020
AUDIENCE: [INAUDIBLE]
00:14:00.020 --> 00:14:01.020
PHILIPPE RIGOLLET: Sure.
00:14:01.020 --> 00:14:03.560
Can you move?
00:14:03.560 --> 00:14:04.290
All right.
00:14:04.290 --> 00:14:05.065
Are you guys good?
00:14:08.060 --> 00:14:08.560
All right.
00:14:08.560 --> 00:14:10.060
So now I have this picture.
00:14:10.060 --> 00:14:12.280
And I know I'm going up to 25.
00:14:12.280 --> 00:14:15.230
And there's a place
where those curves cross.
00:14:15.230 --> 00:14:16.132
So if you're sure--
00:14:16.132 --> 00:14:18.340
let's say you're talking
about presidential election,
00:14:18.340 --> 00:14:20.830
you know that those things
are going to be really close.
00:14:20.830 --> 00:14:23.620
Maybe you're actually
better by predicting 0.5
00:14:23.620 --> 00:14:25.810
if you know it's not
going to go too far.
00:14:25.810 --> 00:14:32.670
But that's for one observation,
so that's the risk of X1.
00:14:32.670 --> 00:14:34.890
But if I look at the
risk of Xn, all I'm doing
00:14:34.890 --> 00:14:38.940
is just crushing
this curve down to 0.
00:14:38.940 --> 00:14:42.530
So as n increases, it's going
to look more and more like this.
00:14:42.530 --> 00:14:44.396
It's the same
curve divided by n.
00:14:48.600 --> 00:14:50.650
And so now I can just
start to understand
00:14:50.650 --> 00:14:52.660
that for different
values of thetas,
00:14:52.660 --> 00:14:56.240
now I'm going to have to be very
close to theta is equal to 1/2
00:14:56.240 --> 00:14:58.480
if I want to start saying
that Xn bar is worse
00:14:58.480 --> 00:15:03.226
than the naive estimator 0.5.
00:15:03.226 --> 00:15:04.006
Yeah.
00:15:04.006 --> 00:15:04.672
AUDIENCE: Sorry.
00:15:04.672 --> 00:15:08.528
I know you explained a little
bit before, but can you just--
00:15:08.528 --> 00:15:11.420
what is an intuitive
definition of risk?
00:15:11.420 --> 00:15:13.840
What is it actually describing?
00:15:13.840 --> 00:15:16.380
PHILIPPE RIGOLLET:
So either you can--
00:15:16.380 --> 00:15:18.924
well, when you have an unbiased
estimator, it's simple.
00:15:18.924 --> 00:15:20.590
It's just telling you
it's the variance,
00:15:20.590 --> 00:15:23.190
because the theta that you
have over there is really-- so
00:15:23.190 --> 00:15:26.500
in the definition of
the risk, the theta
00:15:26.500 --> 00:15:28.200
that you have here
if you're unbiased
00:15:28.200 --> 00:15:31.390
is really the
expectation of theta hat.
00:15:31.390 --> 00:15:33.230
So that's really
just the variance.
00:15:33.230 --> 00:15:35.590
So the risk is
really telling you
00:15:35.590 --> 00:15:39.160
how much fluctuations I
have around my expectation
00:15:39.160 --> 00:15:39.790
if unbiased.
00:15:39.790 --> 00:15:42.164
But actually here, it's telling
you how much fluctuations
00:15:42.164 --> 00:15:44.420
I have in average around theta.
00:15:44.420 --> 00:15:47.105
So if you understand the
notion of variance as being--
00:15:47.105 --> 00:15:47.980
AUDIENCE: [INAUDIBLE]
00:15:47.980 --> 00:15:48.580
PHILIPPE RIGOLLET: What?
00:15:48.580 --> 00:15:49.780
AUDIENCE: Like
variance on average.
00:15:49.780 --> 00:15:49.990
PHILIPPE RIGOLLET: No.
00:15:49.990 --> 00:15:50.650
AUDIENCE: No.
00:15:50.650 --> 00:15:51.775
PHILIPPE RIGOLLET: It's
just like variance.
00:15:51.775 --> 00:15:52.060
AUDIENCE: Oh, OK.
00:15:52.060 --> 00:15:53.800
PHILIPPE RIGOLLET: So when you--
00:15:53.800 --> 00:15:56.140
I mean, if you claim you
understand what variance is,
00:15:56.140 --> 00:15:58.090
it's telling you
what is the expected
00:15:58.090 --> 00:16:00.310
squared fluctuation
around the expectation
00:16:00.310 --> 00:16:01.640
of my random variable.
00:16:01.640 --> 00:16:04.270
It's just telling you on
average how far I'm going to be.
00:16:04.270 --> 00:16:06.200
And you take the square because
you want to cancel the signs.
00:16:06.200 --> 00:16:07.250
Otherwise, you're
going to get 0.
00:16:07.250 --> 00:16:07.660
AUDIENCE: Oh, OK.
00:16:07.660 --> 00:16:08.660
PHILIPPE RIGOLLET: And
here it's saying, well,
00:16:08.660 --> 00:16:11.034
I really don't care what the
expectation of theta hat is.
00:16:11.034 --> 00:16:13.030
What I want to get
to is theta, so I'm
00:16:13.030 --> 00:16:15.430
looking at the expectation
of the squared fluctuations
00:16:15.430 --> 00:16:16.870
around theta itself.
00:16:16.870 --> 00:16:19.610
If I'm unbiased, it
coincides with the variance.
00:16:19.610 --> 00:16:21.940
But if I'm biased, then I
have to account for the fact
00:16:21.940 --> 00:16:23.260
that I'm really
not computing the--
00:16:23.260 --> 00:16:23.801
AUDIENCE: OK.
00:16:23.801 --> 00:16:24.640
OK.
00:16:24.640 --> 00:16:25.140
Thanks.
00:16:25.140 --> 00:16:27.490
PHILIPPE RIGOLLET: OK?
00:16:27.490 --> 00:16:28.670
All right.
00:16:28.670 --> 00:16:29.670
Are there any questions?
00:16:29.670 --> 00:16:31.459
So here, what I really
want to illustrate
00:16:31.459 --> 00:16:33.000
is that the risk
itself is a function
00:16:33.000 --> 00:16:34.260
of theta most of the times.
00:16:34.260 --> 00:16:35.620
And so for different
thetas, some estimators
00:16:35.620 --> 00:16:37.170
are going to be
better than others.
00:16:37.170 --> 00:16:38.580
But there's also
the entire range
00:16:38.580 --> 00:16:41.960
of estimators, those
that are really biased,
00:16:41.960 --> 00:16:44.720
but the bias can
completely vanish.
00:16:44.720 --> 00:16:47.270
And so here, you see
you have no bias,
00:16:47.270 --> 00:16:48.890
but the variance can be large.
00:16:48.890 --> 00:16:50.720
Or you have 0 bias--
00:16:50.720 --> 00:16:52.430
you have a bias, but
the variance is 0.
00:16:52.430 --> 00:16:54.130
So you can actually
have this trade-off
00:16:54.130 --> 00:16:58.220
and you can find things that are
in the entire range in general.
00:17:01.260 --> 00:17:05.940
So those things are
actually-- those trade-offs
00:17:05.940 --> 00:17:10.260
between bias and variance are
usually much better illustrated
00:17:10.260 --> 00:17:12.565
if we're talking about
multivariate parameters.
00:17:12.565 --> 00:17:14.190
If I actually look
at a parameter which
00:17:14.190 --> 00:17:19.025
is the mean of some multivariate
Gaussian, so an entire vector,
00:17:19.025 --> 00:17:20.150
then the bias is going to--
00:17:20.150 --> 00:17:23.599
I can make the bias
bigger by, for example,
00:17:23.599 --> 00:17:26.190
forcing all the coordinates of
my estimator to be the same.
00:17:26.190 --> 00:17:27.690
So here, I'm going
to get some bias,
00:17:27.690 --> 00:17:29.106
but the variance
is actually going
00:17:29.106 --> 00:17:31.200
to be much better, because
I get to average all
00:17:31.200 --> 00:17:32.940
the coordinates for this guy.
00:17:32.940 --> 00:17:35.680
And so really, the
bias/variance trade-off
00:17:35.680 --> 00:17:38.790
is when you have multiple
parameters to estimate,
00:17:38.790 --> 00:17:40.400
so you have a vector
of parameters,
00:17:40.400 --> 00:17:42.930
a multivariate
parameter, the bias
00:17:42.930 --> 00:17:45.450
increases when you're trying
to pull more information
00:17:45.450 --> 00:17:49.470
across the different
components to actually have
00:17:49.470 --> 00:17:50.670
a lower variance.
00:17:50.670 --> 00:17:53.220
So the more you average,
the lower the variance.
00:17:53.220 --> 00:17:54.870
That's exactly what
we've illustrated.
00:17:54.870 --> 00:17:56.700
As n increases, the
variance decreases,
00:17:56.700 --> 00:17:59.370
like 1 over n or theta,
1 minus theta over n.
00:17:59.370 --> 00:18:01.530
And so this is how it
happens in general.
00:18:01.530 --> 00:18:03.840
In this class, it's mostly
one-dimensional parameter
00:18:03.840 --> 00:18:06.150
estimation, so it's going to be
a little harder to illustrate
00:18:06.150 --> 00:18:06.649
that.
00:18:06.649 --> 00:18:09.210
But if you do, for example,
non-parametric estimation,
00:18:09.210 --> 00:18:10.450
that's all you do.
00:18:10.450 --> 00:18:14.590
There's just bias/variance
trade-offs all the time.
00:18:14.590 --> 00:18:16.980
And in between, when you have
high-dimensional parametric
00:18:16.980 --> 00:18:20.110
estimation, that
happens a lot as well.
00:18:20.110 --> 00:18:21.750
OK.
00:18:21.750 --> 00:18:25.022
So I'm just going to go quickly
through those two remaining
00:18:25.022 --> 00:18:26.730
slides, because we've
actually seen them.
00:18:26.730 --> 00:18:29.850
But I just wanted you to have
somewhere a formal definition
00:18:29.850 --> 00:18:32.700
of what a confidence
interval is.
00:18:32.700 --> 00:18:37.830
And so we fixed a statistical
model for n observations, X1
00:18:37.830 --> 00:18:38.700
to Xn.
00:18:38.700 --> 00:18:42.050
The parameter theta
here is one-dimensional.
00:18:42.050 --> 00:18:44.010
Theta is a subset
of the real line,
00:18:44.010 --> 00:18:47.010
and that's why I
talk about intervals.
00:18:47.010 --> 00:18:48.510
An interval is a
subset of the line.
00:18:48.510 --> 00:18:51.480
If I had a subset
of R2, for example,
00:18:51.480 --> 00:18:54.800
that would no longer be called
an interval, but a region,
00:18:54.800 --> 00:18:57.570
just because-- well, that's
just we can say a set,
00:18:57.570 --> 00:18:59.130
a confidence set.
00:18:59.130 --> 00:19:01.920
But people like to
say confidence region.
00:19:01.920 --> 00:19:04.170
So an interval is just a
one-dimensional conference
00:19:04.170 --> 00:19:04.740
region.
00:19:04.740 --> 00:19:07.350
And it has to be an
interval as well.
00:19:07.350 --> 00:19:11.820
So a confidence interval
of level 1 minus alpha--
00:19:11.820 --> 00:19:16.110
so we refer to the quality
of a confidence interval
00:19:16.110 --> 00:19:18.120
is actually called it's level.
00:19:18.120 --> 00:19:21.490
It takes value 1 minus alpha
for some positive alpha.
00:19:21.490 --> 00:19:23.080
And so the confidence level--
00:19:23.080 --> 00:19:26.760
the level of the confidence
interval is between 0 and 1.
00:19:26.760 --> 00:19:29.410
The closer to 1 it is, the
better the confidence interval.
00:19:29.410 --> 00:19:32.040
The closer to 0,
the worse it is.
00:19:32.040 --> 00:19:34.560
And so for any
random interval-- so
00:19:34.560 --> 00:19:37.830
a confidence interval
is a random interval.
00:19:37.830 --> 00:19:41.310
The bounds of this interval
depends on random data.
00:19:41.310 --> 00:19:44.650
Just like we had
X bar plus/minus
00:19:44.650 --> 00:19:46.570
1 over square root of
n, for example, or 2
00:19:46.570 --> 00:19:49.020
over square root
of n, this X bar
00:19:49.020 --> 00:19:53.020
was the random thing that would
make fluctuate those guys.
00:19:53.020 --> 00:19:54.342
And so now I have an interval.
00:19:54.342 --> 00:19:56.550
And now I have its boundaries,
but now the boundaries
00:19:56.550 --> 00:19:58.830
are not allowed to depend
on my unknown parameter.
00:19:58.830 --> 00:20:00.600
Otherwise, it's not a
confidence interval,
00:20:00.600 --> 00:20:02.070
just like an
estimator that depends
00:20:02.070 --> 00:20:04.929
on the unknown parameter
is not an estimator.
00:20:04.929 --> 00:20:06.720
The confidence interval
has to be something
00:20:06.720 --> 00:20:10.360
that I can compute
once I collect data.
00:20:10.360 --> 00:20:14.990
And so what I want is that--
so there's this weird notation.
00:20:14.990 --> 00:20:17.800
The fact that I write theta--
00:20:17.800 --> 00:20:19.960
that's the probability
that I contains theta.
00:20:19.960 --> 00:20:23.081
You're used to seeing
theta belongs to I.
00:20:23.081 --> 00:20:24.580
But here, I really
want to emphasize
00:20:24.580 --> 00:20:26.980
that the randomness is
in I. And so the way
00:20:26.980 --> 00:20:28.540
you actually say
it when you read
00:20:28.540 --> 00:20:32.980
this formula is the probability
that I contains theta
00:20:32.980 --> 00:20:36.930
is at least 1 minus alpha.
00:20:36.930 --> 00:20:39.810
So it better be close to 1.
00:20:39.810 --> 00:20:41.850
You want 1 minus alpha
to be very close to 1,
00:20:41.850 --> 00:20:43.724
because it's really
telling you that whatever
00:20:43.724 --> 00:20:46.320
random variable I'm giving
you, my error bars are actually
00:20:46.320 --> 00:20:49.190
covering the right theta.
00:20:49.190 --> 00:20:50.890
And I want this to be true.
00:20:50.890 --> 00:20:52.390
But I want this--
since I don't know
00:20:52.390 --> 00:20:54.340
what my confidence--
my parameter of theta
00:20:54.340 --> 00:20:58.450
is, I want this to hold
true for all possible values
00:20:58.450 --> 00:21:02.860
of the parameters that nature
may have come up with from.
00:21:02.860 --> 00:21:05.050
So I want this-- so there's
theta that changes here,
00:21:05.050 --> 00:21:06.580
so the distribution
of the interval
00:21:06.580 --> 00:21:08.860
is actually changing
with theta hopefully.
00:21:08.860 --> 00:21:11.090
And theta is changing
with this guy.
00:21:11.090 --> 00:21:13.780
So regardless of the value
of theta that I'm getting,
00:21:13.780 --> 00:21:17.350
I want that the probability
that it contains the theta
00:21:17.350 --> 00:21:20.520
is actually larger
than 1 minus alpha.
00:21:20.520 --> 00:21:22.020
So I'll come back
to it in a second.
00:21:22.020 --> 00:21:23.600
I just want to say
that here, we can
00:21:23.600 --> 00:21:25.130
talk about asymptotic level.
00:21:25.130 --> 00:21:27.320
And that's typically when
you use central limit
00:21:27.320 --> 00:21:29.750
theorem to compute this guy.
00:21:29.750 --> 00:21:32.180
Then you're not guaranteed
that the value is
00:21:32.180 --> 00:21:35.840
at least 1 minus
alpha for every n,
00:21:35.840 --> 00:21:40.410
but it's actually in the limit
larger than 1 minus alpha.
00:21:40.410 --> 00:21:43.140
So maybe for each fixed n
it's going to be not true.
00:21:43.140 --> 00:21:44.970
But for as no goes
to infinity, it's
00:21:44.970 --> 00:21:46.620
actually going to become true.
00:21:46.620 --> 00:21:49.230
If you want this to
hold for every n,
00:21:49.230 --> 00:21:51.840
you actually need to use things
such as Hoeffding's inequality
00:21:51.840 --> 00:21:55.170
that we described at some
point, that hold for every n.
00:21:55.170 --> 00:22:00.002
So as a rule of thumb, if you
use the central limit theorem,
00:22:00.002 --> 00:22:01.710
you're dealing with
a confidence interval
00:22:01.710 --> 00:22:04.057
with asymptotic
level 1 minus alpha.
00:22:04.057 --> 00:22:05.640
And the reason is
because you actually
00:22:05.640 --> 00:22:10.260
want to get the quintiles
of the normal-- the Gaussian
00:22:10.260 --> 00:22:13.110
distribution that comes from
the central limit theorem.
00:22:13.110 --> 00:22:15.579
And if you want to use
Hoeffding's, for example,
00:22:15.579 --> 00:22:18.120
you might actually get away with
a confidence interval that's
00:22:18.120 --> 00:22:20.280
actually true even
non-asymptotically.
00:22:20.280 --> 00:22:22.030
It's just the regular
confidence interval.
00:22:24.560 --> 00:22:26.390
So this is the
formal definition.
00:22:26.390 --> 00:22:28.009
It's a bit of a mouthful.
00:22:28.009 --> 00:22:30.050
But we actually-- the best
way to understand them
00:22:30.050 --> 00:22:31.980
is to build them.
00:22:31.980 --> 00:22:33.930
Now, at some point I said--
00:22:33.930 --> 00:22:35.898
and I think it was
part of the homework--
00:22:38.429 --> 00:22:39.970
so here, I really
say the probability
00:22:39.970 --> 00:22:42.178
the true parameter belongs
to the confidence interval
00:22:42.178 --> 00:22:44.870
is actually 1 minus alpha.
00:22:44.870 --> 00:22:47.096
And so that's because here,
this confidence interval
00:22:47.096 --> 00:22:48.220
is still a random variable.
00:22:48.220 --> 00:22:50.350
Now, if I start plugging
in numbers instead
00:22:50.350 --> 00:22:52.000
of the random
variables X1 to Xn,
00:22:52.000 --> 00:22:55.240
I start putting 1,
0, 0, 1, 0, 0, 1,
00:22:55.240 --> 00:22:58.220
like I did for the kiss
example, then in this case,
00:22:58.220 --> 00:23:03.321
the random interval is actually
going to be 0.42, 0.65.
00:23:03.321 --> 00:23:05.570
And this guy, the probability
that theta belongs to it
00:23:05.570 --> 00:23:07.950
is not 1 minus alpha.
00:23:07.950 --> 00:23:10.000
It's either 0 if
it's not in there
00:23:10.000 --> 00:23:11.214
or it's 1 if it's in there.
00:23:16.870 --> 00:23:19.360
So here is the
example that we had.
00:23:19.360 --> 00:23:24.220
So just let's look at back into
our favorite example, which
00:23:24.220 --> 00:23:26.130
is the average of
Bernoulli random variables,
00:23:26.130 --> 00:23:30.280
so we studied that maybe
that's the third time already.
00:23:30.280 --> 00:23:34.210
So the sample average, Xn
bar, is a strongly consistent
00:23:34.210 --> 00:23:35.200
estimator of p.
00:23:35.200 --> 00:23:37.480
That was one of the
properties that we wanted.
00:23:37.480 --> 00:23:40.285
Strongly consistent means
that as n goes to infinity,
00:23:40.285 --> 00:23:42.940
it converges almost surely
to the true parameter.
00:23:42.940 --> 00:23:44.710
That's the strong
law of large number.
00:23:44.710 --> 00:23:47.050
It is consistent also, because
it's strongly consistent,
00:23:47.050 --> 00:23:49.910
so it also converges
in probability,
00:23:49.910 --> 00:23:52.290
which makes it consistent.
00:23:52.290 --> 00:23:53.070
It's unbiased.
00:23:53.070 --> 00:23:53.970
We've seen that.
00:23:53.970 --> 00:23:57.780
We've actually computed
its quadratic risk.
00:23:57.780 --> 00:24:00.344
And now what I have
is that if I look at--
00:24:00.344 --> 00:24:02.760
thanks to the central limit
theorem, we actually did this.
00:24:02.760 --> 00:24:08.850
We built a confidence interval
at level 1 minus alpha--
00:24:08.850 --> 00:24:12.360
asymptotic level, sorry,
asymptotic level 1 minus alpha.
00:24:12.360 --> 00:24:15.680
And so here, this
is how we did it.
00:24:15.680 --> 00:24:17.640
Let me just go through it again.
00:24:17.640 --> 00:24:19.455
So we know from the
central limit theorem--
00:24:28.240 --> 00:24:31.270
so the central limit
theorem tells us
00:24:31.270 --> 00:24:38.040
that Xn bar minus p divided
by square root of p1 minus p,
00:24:38.040 --> 00:24:41.330
square root of n converges
in distribution as n
00:24:41.330 --> 00:24:47.270
goes to infinity to some
standard normal distribution.
00:24:47.270 --> 00:24:49.910
So what it means is that if
I look at the probability
00:24:49.910 --> 00:24:53.990
under the true p, that's
square root of n, Xn bar
00:24:53.990 --> 00:25:03.130
minus p divided by square
root of p1 minus p,
00:25:03.130 --> 00:25:06.040
it's less than Q alpha
over 2, where this is
00:25:06.040 --> 00:25:07.780
the definition of the quintile.
00:25:07.780 --> 00:25:11.980
Then this guy-- and I'm actually
going to use the same notation,
00:25:11.980 --> 00:25:17.320
limit as n goes to infinity,
this is the same thing.
00:25:17.320 --> 00:25:22.720
So this is actually going to
be equal to 1 minus alpha.
00:25:22.720 --> 00:25:25.180
That's exactly what
I did last time.
00:25:25.180 --> 00:25:28.690
This is by definition of the
quintile of a standard Gaussian
00:25:28.690 --> 00:25:32.890
and of a limit in distribution.
00:25:32.890 --> 00:25:36.920
So the probabilities computed on
this guy in the limit converges
00:25:36.920 --> 00:25:38.620
to the probability
computed on this guy.
00:25:38.620 --> 00:25:40.580
And we know that this
is just the probability
00:25:40.580 --> 00:25:42.280
that the absolute
value of sum n 0, 1
00:25:42.280 --> 00:25:44.640
is less than Q alpha over 2.
00:25:47.750 --> 00:25:50.280
And so in particular,
if it's equal,
00:25:50.280 --> 00:25:54.180
then I can put some
larger than or equal to,
00:25:54.180 --> 00:25:57.480
which guarantees my
asymptotic confidence level.
00:25:57.480 --> 00:25:59.700
And I just solve for p.
00:25:59.700 --> 00:26:03.240
So this is equivalent
to the limit
00:26:03.240 --> 00:26:07.110
as n goes to infinity
of the probability
00:26:07.110 --> 00:26:15.990
that theta is between
Xn bar minus Q
00:26:15.990 --> 00:26:21.210
alpha over 2 divided by--
00:26:21.210 --> 00:26:26.810
times square root of p1 minus p
divided by square root of n, Xn
00:26:26.810 --> 00:26:33.980
bar plus q alpha over 2,
square root of p1 minus p
00:26:33.980 --> 00:26:37.370
divided by square root of
n is larger than or equal
00:26:37.370 --> 00:26:39.030
to 1 minus alpha.
00:26:39.030 --> 00:26:39.960
And so there you go.
00:26:39.960 --> 00:26:43.500
I have my confidence interval.
00:26:43.500 --> 00:26:45.750
Except that's not, right?
00:26:45.750 --> 00:26:48.290
We just said that the bounds
of a confidence interval
00:26:48.290 --> 00:26:50.370
may not depend on the
unknown parameter.
00:26:50.370 --> 00:26:52.440
And here, they do.
00:26:52.440 --> 00:26:54.300
And so we actually
came up with two ways
00:26:54.300 --> 00:26:55.874
of getting rid of this.
00:26:55.874 --> 00:26:58.290
Since we only need this thing--
so this thing, as we said,
00:26:58.290 --> 00:26:59.926
is really equal.
00:26:59.926 --> 00:27:01.800
Every time I'm going to
make this guy smaller
00:27:01.800 --> 00:27:03.450
and this guy larger,
I'm only going
00:27:03.450 --> 00:27:05.210
to increase the probability.
00:27:05.210 --> 00:27:06.960
And so what we do is
we actually just take
00:27:06.960 --> 00:27:08.940
the largest possible
value for p1 minus
00:27:08.940 --> 00:27:13.090
p, which makes the interval
as large as possible.
00:27:13.090 --> 00:27:15.420
And so now I have this.
00:27:15.420 --> 00:27:17.070
I just do one of the two tricks.
00:27:17.070 --> 00:27:22.400
I replace p1 minus p by their
upper bound, which is 1/4.
00:27:25.150 --> 00:27:28.255
As we said, p1 minus p, the
function looks like this.
00:27:28.255 --> 00:27:31.540
So I just take the
value here at 1/2.
00:27:31.540 --> 00:27:37.800
Or, I can use Slutsky and say
that if I replace p by Xn bar,
00:27:37.800 --> 00:27:40.890
that's the same as just
replacing p by Xn bar here.
00:27:45.640 --> 00:27:48.310
And by Slutsky, we know that
this is actually converging
00:27:48.310 --> 00:27:50.650
also to some standard Gaussian.
00:27:59.630 --> 00:28:04.120
We've seen that when we
saw Slutsky as an example.
00:28:04.120 --> 00:28:05.620
And so those two
things-- actually,
00:28:05.620 --> 00:28:07.300
just because I'm
taking the limit
00:28:07.300 --> 00:28:10.090
and I'm only caring about the
asymptotic confidence level,
00:28:10.090 --> 00:28:13.420
I can actually just plug in
consistent quantities in there,
00:28:13.420 --> 00:28:15.580
such as Xn bar where
I don't have a p.
00:28:15.580 --> 00:28:18.790
And that gives me another
confidence interval.
00:28:18.790 --> 00:28:19.290
All right.
00:28:19.290 --> 00:28:24.510
So this by now, hopefully
after doing it three times,
00:28:24.510 --> 00:28:28.320
you should really, really be
comfortable with just creating
00:28:28.320 --> 00:28:29.880
this confidence interval.
00:28:29.880 --> 00:28:31.260
We did it three times in class.
00:28:31.260 --> 00:28:33.660
I think you probably did
it another couple times
00:28:33.660 --> 00:28:34.612
in your homework.
00:28:34.612 --> 00:28:36.570
So just make sure you're
comfortable with this.
00:28:36.570 --> 00:28:37.070
All right.
00:28:37.070 --> 00:28:39.900
That's one of the basic
things you would want to know.
00:28:39.900 --> 00:28:41.470
Are there any questions?
00:28:41.470 --> 00:28:42.121
Yes.
00:28:42.121 --> 00:28:46.540
AUDIENCE: So Slutsky holds
for any single response set p.
00:28:46.540 --> 00:28:48.504
But Xn converges [INAUDIBLE].
00:28:52.940 --> 00:28:55.076
PHILIPPE RIGOLLET: So
that's not Slutsky, right?
00:28:55.076 --> 00:28:58.040
AUDIENCE: That's [INAUDIBLE].
00:28:58.040 --> 00:29:04.040
PHILIPPE RIGOLLET: So Slutsky
tells you that if you--
00:29:04.040 --> 00:29:06.530
Slutsky's about combining
two types of convergence.
00:29:06.530 --> 00:29:08.270
So Slutsky tells you
that if you actually
00:29:08.270 --> 00:29:13.910
have one Xn that converges
to X in distribution and Yn
00:29:13.910 --> 00:29:16.700
that converges to Y
in probability, then
00:29:16.700 --> 00:29:18.867
you can actually
multiply Xn and Yn
00:29:18.867 --> 00:29:20.450
and get that the
limit in distribution
00:29:20.450 --> 00:29:28.460
is the product of X and Y,
where X is now a constant.
00:29:28.460 --> 00:29:32.650
And here we have the
constant, which is 1.
00:29:32.650 --> 00:29:35.160
But I did that already, right?
00:29:35.160 --> 00:29:37.890
Using Slutsky to
replace it for the--
00:29:37.890 --> 00:29:40.890
to replace P by
Xn bar, we've done
00:29:40.890 --> 00:29:44.368
that last time, maybe a
couple of times ago, actually.
00:29:44.368 --> 00:29:45.850
Yeah.
00:29:45.850 --> 00:29:49.802
AUDIENCE: So I guess these
statements are [INAUDIBLE]..
00:29:49.802 --> 00:29:51.284
PHILIPPE RIGOLLET:
That's correct.
00:29:51.284 --> 00:29:53.754
AUDIENCE: So could we like
figure out [INAUDIBLE]
00:29:53.754 --> 00:29:58.220
can we set a finite [INAUDIBLE].
00:29:58.220 --> 00:30:00.830
PHILIPPE RIGOLLET: So of
course, the short answer is no.
00:30:04.280 --> 00:30:06.740
So here's how you
would go about thinking
00:30:06.740 --> 00:30:08.420
about which method is better.
00:30:08.420 --> 00:30:10.760
So there's always the
more conservative method.
00:30:10.760 --> 00:30:13.400
The first one, the only
thing you're losing
00:30:13.400 --> 00:30:16.430
is the rate of convergence
of the central limit theorem.
00:30:16.430 --> 00:30:19.990
So if n is large enough so
that the central limit theorem
00:30:19.990 --> 00:30:22.700
approximation is very good,
then that's all you're
00:30:22.700 --> 00:30:24.539
going to be losing.
00:30:24.539 --> 00:30:27.080
Of course, the price you pay is
that your confidence interval
00:30:27.080 --> 00:30:28.700
is wider than it
would be if you were
00:30:28.700 --> 00:30:31.160
to use Slutsky for this
particular problem,
00:30:31.160 --> 00:30:32.600
typically wider.
00:30:32.600 --> 00:30:37.140
Actually, it is always
wider, because Xn bar--
00:30:37.140 --> 00:30:41.120
1 minus Xn bar is always
less than 1/4 as well.
00:30:41.120 --> 00:30:45.920
And so that's the
first thing you--
00:30:45.920 --> 00:30:51.380
so Slutsky basically adds your
relying on the central limit--
00:30:51.380 --> 00:30:53.570
your relying on the
asymptotics again.
00:30:53.570 --> 00:30:56.180
Now of course, you don't
want to be conservative,
00:30:56.180 --> 00:30:59.060
because you actually want to
squeeze as much from your data
00:30:59.060 --> 00:30:59.930
as you can.
00:30:59.930 --> 00:31:04.040
So it depends on how comfortable
and how critical it is for you
00:31:04.040 --> 00:31:06.410
to put valid error bars.
00:31:06.410 --> 00:31:07.940
If they're valid
in the asymptotics,
00:31:07.940 --> 00:31:09.710
then maybe you're actually
going to go with Slutsky
00:31:09.710 --> 00:31:11.918
so it actually gives you
slightly narrower confidence
00:31:11.918 --> 00:31:16.060
intervals and so you feel
like you're a little more--
00:31:16.060 --> 00:31:17.869
you have a more precise answer.
00:31:17.869 --> 00:31:19.910
Now, if you really need
to be super-conservative,
00:31:19.910 --> 00:31:23.390
then you're actually going
to go with the P1 minus P.
00:31:23.390 --> 00:31:25.790
Actually, if you need to
be even more conservative,
00:31:25.790 --> 00:31:28.850
you are going to go with
Hoeffding's so you don't even
00:31:28.850 --> 00:31:31.412
have to rely on the
asymptotics level at all.
00:31:31.412 --> 00:31:32.870
But then you're
confidence interval
00:31:32.870 --> 00:31:35.000
becomes twice as wide
and twice as wide
00:31:35.000 --> 00:31:37.960
and it becomes wider
and wider as you go.
00:31:37.960 --> 00:31:39.859
So depends on--
00:31:39.859 --> 00:31:41.650
I mean, there's a lot
of data in statistics
00:31:41.650 --> 00:31:46.310
which is gauging how critical
it is for you to output
00:31:46.310 --> 00:31:48.380
valid error bounds or if
they're really just here
00:31:48.380 --> 00:31:51.620
to be indicative of the
precision of the estimator you
00:31:51.620 --> 00:31:55.396
gave from a more
qualitative perspective.
00:31:55.396 --> 00:31:57.540
AUDIENCE: So the error
there is [INAUDIBLE]??
00:31:57.540 --> 00:31:58.540
PHILIPPE RIGOLLET: Yeah.
00:31:58.540 --> 00:32:01.220
So here, there's basically
a bunch of errors.
00:32:01.220 --> 00:32:04.280
There's one that's-- so there's
a theorem called Berry-Esseen
00:32:04.280 --> 00:32:09.830
that quantifies how far this
probability is from 1 minus
00:32:09.830 --> 00:32:12.670
alpha, but the
constants are terrible.
00:32:12.670 --> 00:32:14.510
So it's not very
helpful, but it tells you
00:32:14.510 --> 00:32:17.330
as n grows how smaller
this thing grows--
00:32:17.330 --> 00:32:18.320
becomes smaller.
00:32:18.320 --> 00:32:20.330
And then for
Slutsky, again you're
00:32:20.330 --> 00:32:22.790
multiplying something that
converges by something that
00:32:22.790 --> 00:32:24.827
fluctuates around 1, so
you need to understand
00:32:24.827 --> 00:32:25.910
how this thing fluctuates.
00:32:25.910 --> 00:32:28.070
Now, there's something
that shows up.
00:32:28.070 --> 00:32:31.400
Basically, what is the
slope of the function 1
00:32:31.400 --> 00:32:36.220
over square root of X1
minus X around the value
00:32:36.220 --> 00:32:37.590
you're interested in?
00:32:37.590 --> 00:32:39.850
And so if this function
is super-sharp,
00:32:39.850 --> 00:32:43.200
then small fluctuations of Xn
bar around this expectation
00:32:43.200 --> 00:32:45.700
are going to lead to
really high fluctuations
00:32:45.700 --> 00:32:47.630
of the function itself.
00:32:47.630 --> 00:32:49.570
So if you're looking at--
00:32:49.570 --> 00:32:55.615
if you have f of Xn bar and
f around say the true P,
00:32:55.615 --> 00:32:58.390
if f is really sharp
like that, then
00:32:58.390 --> 00:33:00.730
if you move a little
bit here, then you're
00:33:00.730 --> 00:33:03.260
going to move really
a lot on the y-axis.
00:33:03.260 --> 00:33:05.650
So that's what the function
here-- the function
00:33:05.650 --> 00:33:09.205
you're interested in is 1 over
square root of X1 minus X.
00:33:09.205 --> 00:33:11.830
So what does this function look
like around the point where you
00:33:11.830 --> 00:33:14.260
think P is the true parameter?
00:33:17.270 --> 00:33:19.850
Its derivative really
is what matters.
00:33:19.850 --> 00:33:21.729
OK?
00:33:21.729 --> 00:33:22.520
Any other question.
00:33:24.665 --> 00:33:25.165
OK.
00:33:25.165 --> 00:33:26.665
So it's important,
because now we're
00:33:26.665 --> 00:33:29.430
going to switch to the
real let's do some hardcore
00:33:29.430 --> 00:33:31.928
computation type of things.
00:33:31.928 --> 00:33:32.892
All right.
00:33:36.760 --> 00:33:39.550
So in this chapter, we're
going to talk about maximum
00:33:39.550 --> 00:33:40.988
likelihood estimation.
00:33:44.340 --> 00:33:49.320
Who has already seen maximum
likelihood estimation?
00:33:49.320 --> 00:33:50.030
OK.
00:33:50.030 --> 00:33:55.380
And who knows what a
convex function is?
00:33:55.380 --> 00:33:56.340
OK.
00:33:56.340 --> 00:34:00.910
So we'll do a little bit of
reminders on those things.
00:34:00.910 --> 00:34:04.330
So those things are when we do
maximum likelihood estimation,
00:34:04.330 --> 00:34:07.470
likelihood is the function, so
we need to maximize a function.
00:34:07.470 --> 00:34:09.325
That's basically
what we need to do.
00:34:09.325 --> 00:34:10.699
And if I give you
a function, you
00:34:10.699 --> 00:34:12.659
need to know how to
maximize this function.
00:34:12.659 --> 00:34:14.408
Sometimes, you have
closed-form solutions.
00:34:14.408 --> 00:34:18.219
You can take the derivative and
set it equal to 0 and solve it.
00:34:18.219 --> 00:34:21.060
But sometimes, you actually
need to resort to algorithms
00:34:21.060 --> 00:34:21.600
to do that.
00:34:21.600 --> 00:34:25.020
And there's an entire
industry doing that.
00:34:25.020 --> 00:34:27.750
And we'll briefly touch upon
it, but this is definitely
00:34:27.750 --> 00:34:30.370
not the focus of this class.
00:34:30.370 --> 00:34:31.330
OK.
00:34:31.330 --> 00:34:34.630
So before diving directly
into the definition
00:34:34.630 --> 00:34:36.520
of the likelihood and
what is the definition
00:34:36.520 --> 00:34:38.500
of the maximum likelihood
estimator, what
00:34:38.500 --> 00:34:41.840
I'm going to try to
do is to give you
00:34:41.840 --> 00:34:45.380
an insight for what we're
actually doing when we do
00:34:45.380 --> 00:34:48.870
maximum likelihood estimation.
00:34:48.870 --> 00:34:53.719
So remember, we have a
model on a sample space E
00:34:53.719 --> 00:34:57.600
and some candidate
distributions P theta.
00:34:57.600 --> 00:35:00.930
And really, your goal is
to estimate a true theta
00:35:00.930 --> 00:35:04.080
star, the one that generated
some data, X1 to Xn,
00:35:04.080 --> 00:35:06.360
in an iid fashion.
00:35:06.360 --> 00:35:08.790
But this theta star is
really a proxy for us
00:35:08.790 --> 00:35:10.740
to know that we
actually understand
00:35:10.740 --> 00:35:12.100
the distribution itself.
00:35:12.100 --> 00:35:15.540
The goal of knowing theta star
is so that you can actually
00:35:15.540 --> 00:35:17.790
know what P theta star.
00:35:17.790 --> 00:35:19.380
Otherwise, it has--
well, sometimes we
00:35:19.380 --> 00:35:21.660
said it has some meaning
itself, but really you
00:35:21.660 --> 00:35:23.490
want to know what
the distribution is.
00:35:23.490 --> 00:35:27.810
And so your goal is to actually
come up with the distribution--
00:35:27.810 --> 00:35:30.270
hopefully that comes
from the family P theta--
00:35:30.270 --> 00:35:33.360
that's close to P theta star.
00:35:33.360 --> 00:35:38.710
So in a way, what does it mean
to have two distributions that
00:35:38.710 --> 00:35:39.210
are close?
00:35:39.210 --> 00:35:41.260
It means that when you
compute probabilities
00:35:41.260 --> 00:35:43.330
on one distribution,
you should have
00:35:43.330 --> 00:35:46.870
the same probability on the
other distribution pretty much.
00:35:46.870 --> 00:35:49.120
So what we can do
is say, well, now I
00:35:49.120 --> 00:35:51.311
have two candidate
distributions.
00:35:59.010 --> 00:36:03.360
So if theta hat leads to a
candidate distribution P theta
00:36:03.360 --> 00:36:06.210
hat, and this is
the true theta star,
00:36:06.210 --> 00:36:08.820
it leads to the true
distribution P theta star
00:36:08.820 --> 00:36:11.940
according to which
my data was drawn.
00:36:11.940 --> 00:36:12.970
That's my candidate.
00:36:16.060 --> 00:36:18.100
As a statistician, I'm
supposed to come up
00:36:18.100 --> 00:36:20.380
with a good candidate,
and this is the truth.
00:36:23.940 --> 00:36:26.790
And what I want is that
if you actually give me
00:36:26.790 --> 00:36:30.030
the distribution,
then I want when
00:36:30.030 --> 00:36:31.950
I'm computing
probabilities for this guy,
00:36:31.950 --> 00:36:34.980
I know what the probabilities
for the other guys are.
00:36:34.980 --> 00:36:40.040
And so really what I want is
that if I compute a probability
00:36:40.040 --> 00:36:44.340
under theta hat of
some interval a, b,
00:36:44.340 --> 00:36:46.580
it should be pretty
close to the probability
00:36:46.580 --> 00:36:51.660
under theta star of a, b.
00:36:51.660 --> 00:36:53.220
And more generally,
if I want to take
00:36:53.220 --> 00:36:55.470
the union of two intervals,
I want this to be true.
00:36:55.470 --> 00:36:58.500
If I take just 1/2 lines, I
want this to be true from 0
00:36:58.500 --> 00:37:00.900
to infinity, for example,
things like this.
00:37:00.900 --> 00:37:03.550
I want this to be true
for all of them at once.
00:37:03.550 --> 00:37:07.620
And so what I do is that I
write A for a probability event.
00:37:07.620 --> 00:37:11.520
And I want that P hat of
A is close to P star of A
00:37:11.520 --> 00:37:15.517
for any event A in
the sample space.
00:37:15.517 --> 00:37:17.100
Does that sound like
a reasonable goal
00:37:17.100 --> 00:37:18.994
for a statistician?
00:37:18.994 --> 00:37:20.910
So in particular, if I
want those to be close,
00:37:20.910 --> 00:37:22.784
I want the absolute
value of their difference
00:37:22.784 --> 00:37:23.680
to be close to 0.
00:37:26.220 --> 00:37:28.140
And this turns out to be--
00:37:28.140 --> 00:37:31.875
if I want this to hold
for all possible A's, I
00:37:31.875 --> 00:37:35.460
have all possible events, so I'm
going to actually maximize over
00:37:35.460 --> 00:37:36.100
these events.
00:37:36.100 --> 00:37:37.516
And I'm going to
look at the worst
00:37:37.516 --> 00:37:41.160
possible event on which theta
hat can depart from theta star.
00:37:41.160 --> 00:37:43.170
And so rather than
defining it specifically
00:37:43.170 --> 00:37:44.790
for theta hat and
theta star, I'm
00:37:44.790 --> 00:37:47.910
just going to say, well, if
you give me two probability
00:37:47.910 --> 00:37:51.420
measures, P theta
and P theta prime,
00:37:51.420 --> 00:37:53.100
I want to know how
close they are.
00:37:53.100 --> 00:37:55.080
Well, if I want to
measure how close they
00:37:55.080 --> 00:37:58.980
are by how they can
differ when I measure
00:37:58.980 --> 00:38:01.920
the probability
of some event, I'm
00:38:01.920 --> 00:38:04.800
just looking at the absolute
value of the difference
00:38:04.800 --> 00:38:06.180
of the probabilities
and I'm just
00:38:06.180 --> 00:38:09.240
maximizing over the worst
possible event that might
00:38:09.240 --> 00:38:11.380
actually make them differ.
00:38:11.380 --> 00:38:13.040
Agreed?
00:38:13.040 --> 00:38:14.360
That's a pretty strong notion.
00:38:14.360 --> 00:38:17.720
So if the total variation
between theta and theta prime
00:38:17.720 --> 00:38:22.390
is small, it means that for all
possible A's that you give me,
00:38:22.390 --> 00:38:25.590
then P theta of A is
going to be close to P
00:38:25.590 --> 00:38:30.820
theta prime of A, because if--
00:38:30.820 --> 00:38:33.820
let's say I just found the
bound on the total variation
00:38:33.820 --> 00:38:41.911
distance, which is 0.01.
00:38:41.911 --> 00:38:42.410
All right.
00:38:42.410 --> 00:38:46.110
So that means that this
is going to be larger
00:38:46.110 --> 00:39:00.940
than the max over A of P theta
minus P theta prime of A,
00:39:00.940 --> 00:39:04.550
which means that for any A--
00:39:04.550 --> 00:39:06.990
actually, let me write P
theta hat and P theta star,
00:39:06.990 --> 00:39:10.611
like we said, theta
hat and theta star.
00:39:10.611 --> 00:39:12.860
And so if I have a bound,
say, on the total variation,
00:39:12.860 --> 00:39:19.270
which is 0.01, that
means that P theta hat--
00:39:19.270 --> 00:39:23.950
every time I compute a
probability on P theta hat,
00:39:23.950 --> 00:39:29.295
it's basically in the
interval P theta star of A,
00:39:29.295 --> 00:39:34.790
the one that I really wanted
to compute, plus or minus 0.01.
00:39:34.790 --> 00:39:36.790
This has nothing to do
with confidence interval.
00:39:36.790 --> 00:39:38.165
This is just
telling me how far I
00:39:38.165 --> 00:39:41.280
am from the value of
actually trying to compute.
00:39:41.280 --> 00:39:44.750
And that's true for
all A. And that's key.
00:39:44.750 --> 00:39:47.400
That's where this
max comes into play.
00:39:47.400 --> 00:39:49.310
It just says, I want
this bound to hold
00:39:49.310 --> 00:39:50.870
for all possible A's at once.
00:39:55.300 --> 00:39:58.142
So this is actually a
very well-known distance
00:39:58.142 --> 00:39:59.350
between probability measures.
00:39:59.350 --> 00:40:00.766
It's the total
variation distance.
00:40:00.766 --> 00:40:04.880
It's extremely central to
probabilistic analysis.
00:40:04.880 --> 00:40:07.120
And it essentially tells
you that every time--
00:40:07.120 --> 00:40:09.160
if two probability
distributions are close,
00:40:09.160 --> 00:40:11.560
then it means that every
time I compute a probability
00:40:11.560 --> 00:40:15.160
under P theta but
I really actually
00:40:15.160 --> 00:40:17.290
have data from P
theta prime, then
00:40:17.290 --> 00:40:21.710
the error is no larger
than the total variation.
00:40:21.710 --> 00:40:23.470
OK.
00:40:23.470 --> 00:40:29.460
So this is maybe not
the most convenient way
00:40:29.460 --> 00:40:30.870
of finding a distance.
00:40:30.870 --> 00:40:32.130
I mean, how are you going--
00:40:32.130 --> 00:40:34.500
in reality, how are you
to compute this maximum
00:40:34.500 --> 00:40:35.640
over all possible events?
00:40:35.640 --> 00:40:36.931
I mean, it's just crazy, right?
00:40:36.931 --> 00:40:38.430
There's an infinite
number of them.
00:40:38.430 --> 00:40:41.340
It's much larger than the number
of intervals, for example,
00:40:41.340 --> 00:40:43.050
so it's a bit annoying.
00:40:43.050 --> 00:40:46.800
And so there's actually
a way to compress it
00:40:46.800 --> 00:40:50.834
by just looking at the basically
function distance or vector
00:40:50.834 --> 00:40:53.250
distance between probability
mass functions or probability
00:40:53.250 --> 00:40:55.510
density functions.
00:40:55.510 --> 00:40:58.150
So I'm going to start
with the discrete version
00:40:58.150 --> 00:40:59.280
of the total variation.
00:40:59.280 --> 00:41:03.282
So throughout this
chapter, I will
00:41:03.282 --> 00:41:05.490
make the difference between
discrete random variables
00:41:05.490 --> 00:41:07.530
and continuous random variables.
00:41:07.530 --> 00:41:08.651
It really doesn't matter.
00:41:08.651 --> 00:41:10.650
All it means is that when
I talk about discrete,
00:41:10.650 --> 00:41:12.606
I will talk about
probability mass functions.
00:41:12.606 --> 00:41:13.980
And when I talk
about continuous,
00:41:13.980 --> 00:41:16.600
I will talk about probability
density functions.
00:41:16.600 --> 00:41:20.030
When I talk about
probability mass functions,
00:41:20.030 --> 00:41:21.510
I talk about sums.
00:41:21.510 --> 00:41:24.900
When I talk about probability
density functions,
00:41:24.900 --> 00:41:26.730
I talk about integrals.
00:41:26.730 --> 00:41:30.090
But they're all the
same thing, really.
00:41:30.090 --> 00:41:32.475
So let's start with the
probability mass function.
00:41:32.475 --> 00:41:34.350
Everybody remembers what
the probability mass
00:41:34.350 --> 00:41:37.980
function of a discrete
random variable is.
00:41:37.980 --> 00:41:42.180
This is the function that tells
me for each possible value
00:41:42.180 --> 00:41:43.720
that it can take,
the probability
00:41:43.720 --> 00:41:46.410
that it takes this value.
00:41:46.410 --> 00:41:53.200
So the Probability
Mass Function, PMF,
00:41:53.200 --> 00:41:57.310
is just the function for
all x in the sample space
00:41:57.310 --> 00:42:01.420
tells me the probability
that my random variable is
00:42:01.420 --> 00:42:03.970
equal to this little value.
00:42:03.970 --> 00:42:09.091
And I will denote it
by P sub theta of X.
00:42:09.091 --> 00:42:10.840
So what I want is, of
course, that the sum
00:42:10.840 --> 00:42:12.250
of the probabilities is 1.
00:42:17.620 --> 00:42:20.460
And I want them to
be non-negative.
00:42:20.460 --> 00:42:23.110
Actually, typically we will
assume that they are positive.
00:42:23.110 --> 00:42:27.410
Otherwise, we can just remove
this x from the sample space.
00:42:27.410 --> 00:42:31.700
And so then I have the total
variation distance, I mean,
00:42:31.700 --> 00:42:35.470
it's supposed to be the
maximum overall sets of--
00:42:35.470 --> 00:42:39.850
of subsets of E, such
that the probability
00:42:39.850 --> 00:42:43.130
of A minus probability
of theta prime of A--
00:42:43.130 --> 00:42:44.630
it's complicated,
but really there's
00:42:44.630 --> 00:42:46.130
this beautiful
formula that tells me
00:42:46.130 --> 00:42:50.410
that if I look at the total
variation between P theta
00:42:50.410 --> 00:42:54.520
and P theta prime, it's
actually equal to just 1/2
00:42:54.520 --> 00:43:04.402
of the sum for all X in E of the
absolute difference between P
00:43:04.402 --> 00:43:12.151
theta X and P theta prime of X.
00:43:12.151 --> 00:43:13.650
So that's something
you can compute.
00:43:13.650 --> 00:43:16.110
If I give you two
probability mass functions,
00:43:16.110 --> 00:43:19.660
you can compute
this immediately.
00:43:19.660 --> 00:43:24.020
But if I give you
just the densities
00:43:24.020 --> 00:43:26.460
and the original distribution,
the original definition
00:43:26.460 --> 00:43:28.200
where you have to max
over all possible events,
00:43:28.200 --> 00:43:29.575
it's not clear
you're going to be
00:43:29.575 --> 00:43:31.140
able to do that very quickly.
00:43:31.140 --> 00:43:35.335
So this is really the
one you can work with.
00:43:35.335 --> 00:43:36.960
But the other one is
really telling you
00:43:36.960 --> 00:43:37.830
what it is doing for you.
00:43:37.830 --> 00:43:39.829
It's controlling the
difference of probabilities
00:43:39.829 --> 00:43:41.077
you can compute on any event.
00:43:41.077 --> 00:43:42.660
But here, it's just
telling you, well,
00:43:42.660 --> 00:43:46.370
if you do it for each
simple event, it's little x.
00:43:46.370 --> 00:43:49.080
It's actually simple.
00:43:49.080 --> 00:43:53.150
Now, if we have continuous
random variables-- so
00:43:53.150 --> 00:43:56.060
by the way, I didn't mention,
but discrete means Bernoulli.
00:43:56.060 --> 00:43:59.420
Binomial, but not only those
that have finite support,
00:43:59.420 --> 00:44:02.260
like Bernoulli has
support of size 2,
00:44:02.260 --> 00:44:05.760
binomial NP has
support of size n--
00:44:05.760 --> 00:44:08.570
there's n possible values it
can take-- but also Poisson.
00:44:08.570 --> 00:44:10.570
Poisson distribution can
take an infinite number
00:44:10.570 --> 00:44:13.510
of values, all the
positive integers,
00:44:13.510 --> 00:44:16.100
non-negative integers.
00:44:16.100 --> 00:44:18.000
And so now we have also
the continuous ones,
00:44:18.000 --> 00:44:19.384
such as Gaussian, exponential.
00:44:19.384 --> 00:44:21.300
And what characterizes
those guys is that they
00:44:21.300 --> 00:44:24.230
have a probability density.
00:44:24.230 --> 00:44:26.630
So the density,
remember the way I
00:44:26.630 --> 00:44:28.820
use my density is
when I want to compute
00:44:28.820 --> 00:44:31.910
the probability of
belonging to some event A.
00:44:31.910 --> 00:44:37.010
The probability of X falling to
some subset of the real line A
00:44:37.010 --> 00:44:40.280
is simply the integral of
the density on this set.
00:44:40.280 --> 00:44:43.940
That's the famous area
under the curve thing.
00:44:43.940 --> 00:44:49.196
So since for each possible
value, the probability at X--
00:44:49.196 --> 00:44:51.350
so I hope you
remember that stuff.
00:44:51.350 --> 00:44:57.890
That's just probably
something that you
00:44:57.890 --> 00:44:59.210
must remember from probability.
00:44:59.210 --> 00:45:02.120
But essentially, we know that
the probability that X is equal
00:45:02.120 --> 00:45:04.820
to little x is 0 for a
continuous random variable,
00:45:04.820 --> 00:45:06.830
for all possible X's.
00:45:06.830 --> 00:45:09.030
There's just none of them
that actually gets weight.
00:45:09.030 --> 00:45:11.321
So what we have to do is to
describe the fact that it's
00:45:11.321 --> 00:45:12.980
in some little region.
00:45:12.980 --> 00:45:18.830
So the probability that it's in
some interval, say, a, b, this
00:45:18.830 --> 00:45:25.550
is the integral between A
and B of f theta of X, dx.
00:45:25.550 --> 00:45:28.379
So I have this density,
such as the Gaussian one.
00:45:28.379 --> 00:45:30.545
And the probability that I
belong to the interval a,
00:45:30.545 --> 00:45:36.920
b is just the area under
the curve between A and B.
00:45:36.920 --> 00:45:43.880
If you don't remember that,
please take immediate remedy.
00:45:43.880 --> 00:45:48.920
So this function f, just
like P, is non-negative.
00:45:48.920 --> 00:45:51.890
And rather than summing
to 1, it integrates to 1
00:45:51.890 --> 00:45:55.119
when I integrate it over
the entire sample space E.
00:45:55.119 --> 00:45:56.660
And now the total
variation, well, it
00:45:56.660 --> 00:45:58.130
takes basically the same form.
00:45:58.130 --> 00:46:00.230
I said that you
essentially replace sums
00:46:00.230 --> 00:46:03.264
by integrals when you're
dealing with densities.
00:46:03.264 --> 00:46:05.180
And here, it's just
saying, rather than having
00:46:05.180 --> 00:46:07.220
1/2 of the sum of
the absolute values,
00:46:07.220 --> 00:46:09.860
you have 1/2 of the integral
of the absolute value
00:46:09.860 --> 00:46:11.530
of the difference.
00:46:11.530 --> 00:46:15.310
Again, if I give
you two densities
00:46:15.310 --> 00:46:18.280
and if you're not too bad at
calculus, which you will often
00:46:18.280 --> 00:46:21.490
be, because there's lots of them
you can actually not compute.
00:46:21.490 --> 00:46:24.400
But if I gave you, for example,
two Gaussian densities,
00:46:24.400 --> 00:46:27.330
exponential minus x squared,
blah, blah, blah, and I say,
00:46:27.330 --> 00:46:29.080
just compute the total
variation distance,
00:46:29.080 --> 00:46:30.957
you could actually
write it as an integral.
00:46:30.957 --> 00:46:33.040
Now, whether you can
actually reduce this integral
00:46:33.040 --> 00:46:35.470
to some particular
number is another story.
00:46:35.470 --> 00:46:38.860
But you could technically do it.
00:46:38.860 --> 00:46:41.695
So now, you have actually
a handle on this thing
00:46:41.695 --> 00:46:43.660
and you could technically
ask Mathematica,
00:46:43.660 --> 00:46:45.280
whereas asking
Mathematica to take
00:46:45.280 --> 00:46:48.280
the max over all possible
events is going to be difficult.
00:46:48.280 --> 00:46:48.780
All right.
00:46:48.780 --> 00:46:55.240
So the total variation
has some properties.
00:46:55.240 --> 00:46:59.560
So let's keep on the
board the definition that
00:46:59.560 --> 00:47:05.410
involves, say, the densities.
00:47:05.410 --> 00:47:06.910
So think Gaussian in your mind.
00:47:06.910 --> 00:47:09.530
And you have two Gaussians,
one with mean theta
00:47:09.530 --> 00:47:10.810
and one with mean theta prime.
00:47:10.810 --> 00:47:13.143
And I'm looking at the total
variation between those two
00:47:13.143 --> 00:47:14.560
guys.
00:47:14.560 --> 00:47:20.030
So if I look at P theta minus--
00:47:20.030 --> 00:47:20.530
sorry.
00:47:20.530 --> 00:47:25.800
TV between P theta and
P theta prime, this
00:47:25.800 --> 00:47:31.110
is equal to 1/2 of the integral
between f theta, f theta prime.
00:47:31.110 --> 00:47:32.490
And when I don't write it--
00:47:32.490 --> 00:47:34.800
so I don't write the
X, dx but it's there.
00:47:34.800 --> 00:47:38.432
And then I integrate over E.
00:47:38.432 --> 00:47:39.890
So what is this
thing doing for me?
00:47:39.890 --> 00:47:41.480
It's just saying,
well, if I have-- so
00:47:41.480 --> 00:47:42.438
think of two Gaussians.
00:47:42.438 --> 00:47:44.940
For example, I have one that's
here and one that's here.
00:47:47.610 --> 00:47:51.670
So this is let's say f
theta, f theta prime.
00:47:51.670 --> 00:47:52.750
This guy is doing what?
00:47:52.750 --> 00:47:55.980
It's computing the absolute
value of the difference
00:47:55.980 --> 00:47:57.910
between f and f theta prime.
00:47:57.910 --> 00:48:01.980
You can check for yourself
that graphically, this I
00:48:01.980 --> 00:48:05.931
can represent as an area
not under the curve,
00:48:05.931 --> 00:48:10.300
but between the curves.
00:48:10.300 --> 00:48:11.760
So this is this guy.
00:48:16.370 --> 00:48:20.040
Now, this guy is really the
integral of the absolute value.
00:48:20.040 --> 00:48:22.570
So this thing here,
this area, this
00:48:22.570 --> 00:48:25.224
is 2 times the total variation.
00:48:28.240 --> 00:48:29.980
The scaling 1/2
really doesn't matter.
00:48:29.980 --> 00:48:32.790
It's just if I want to have
an actual correspondence
00:48:32.790 --> 00:48:36.350
between the maximum and the
other guy, I have to do this.
00:48:39.630 --> 00:48:41.290
So this is what it looks like.
00:48:41.290 --> 00:48:42.910
So we have this definition.
00:48:42.910 --> 00:48:48.279
And so we have a couple of
properties that come into this.
00:48:48.279 --> 00:48:49.820
The first one is
that it's symmetric.
00:48:49.820 --> 00:48:51.860
TV of P theta and
P theta prime is
00:48:51.860 --> 00:48:55.970
the same as the TV between
P theta prime and P theta.
00:48:55.970 --> 00:48:59.710
Well, that's pretty obvious
from this definition.
00:48:59.710 --> 00:49:02.090
I just flip those two,
I get the same number.
00:49:02.090 --> 00:49:05.297
It's actually also true
if I take the maximum.
00:49:05.297 --> 00:49:07.630
Those things are completely
symmetric in theta and theta
00:49:07.630 --> 00:49:08.350
prime.
00:49:08.350 --> 00:49:10.620
You can just flip them.
00:49:10.620 --> 00:49:11.830
It's non-negative.
00:49:11.830 --> 00:49:15.640
Is that clear to everyone that
this thing is non-negative?
00:49:15.640 --> 00:49:20.530
I integrate an absolute
value, so this thing
00:49:20.530 --> 00:49:22.640
is going to give me some
non-negative number.
00:49:22.640 --> 00:49:24.598
And so if I integrate
this non-negative number,
00:49:24.598 --> 00:49:26.670
it's going to be a
non-negative number.
00:49:26.670 --> 00:49:29.230
The fact also that
it's an area tells me
00:49:29.230 --> 00:49:32.680
that it's going to
be non-negative.
00:49:32.680 --> 00:49:36.900
The nice thing is that if
TV is equal to zero, then
00:49:36.900 --> 00:49:42.490
the two distributions, the two
probabilities are the same.
00:49:42.490 --> 00:49:46.540
That means that for
every A, P theta of A
00:49:46.540 --> 00:49:49.050
is equal to P theta
prime of A. Now,
00:49:49.050 --> 00:49:50.860
there's two ways to see that.
00:49:50.860 --> 00:49:53.140
The first one is to say
that if this integral is
00:49:53.140 --> 00:49:56.650
equal to 0, that means
that for almost all X,
00:49:56.650 --> 00:49:58.240
f theta is equal
to f theta prime.
00:49:58.240 --> 00:50:01.390
The only way I can integrate
a non-negative and get 0
00:50:01.390 --> 00:50:05.390
is that it's 0 pretty
much everywhere.
00:50:05.390 --> 00:50:07.550
And so what it means is
that the two densities
00:50:07.550 --> 00:50:09.530
have to be the same
pretty much everywhere,
00:50:09.530 --> 00:50:11.546
which means that the
distributions are the same.
00:50:11.546 --> 00:50:13.670
But this is not really the
way you want to do this,
00:50:13.670 --> 00:50:15.128
because you have
to understand what
00:50:15.128 --> 00:50:16.850
pretty much everywhere means--
00:50:16.850 --> 00:50:18.760
which I should really
say almost everywhere.
00:50:18.760 --> 00:50:20.570
That's the formal
way of saying it.
00:50:20.570 --> 00:50:22.280
But let's go to
this definition--
00:50:24.830 --> 00:50:26.160
which is gone.
00:50:26.160 --> 00:50:26.660
Yeah.
00:50:26.660 --> 00:50:28.670
That's the one here.
00:50:28.670 --> 00:50:35.230
The max of those two guys, if
this maximum is equal to 0--
00:50:35.230 --> 00:50:39.220
I have a maximum of non-negative
numbers, their absolute values.
00:50:39.220 --> 00:50:42.090
Their maximum is
equal to 0, well,
00:50:42.090 --> 00:50:44.490
they better be all equal
to 0, because if one is not
00:50:44.490 --> 00:50:47.470
equal to 0, then the
maximum is not equal to 0.
00:50:47.470 --> 00:50:50.170
So those two guys,
for those two things
00:50:50.170 --> 00:50:52.180
to be-- for the maximum
to be equal to 0,
00:50:52.180 --> 00:50:54.220
then each of the
individual absolute values
00:50:54.220 --> 00:50:57.430
have to be equal to 0, which
means that the probability here
00:50:57.430 --> 00:51:03.730
is equal to this probability
here for every event A.
00:51:03.730 --> 00:51:04.960
So those two things--
00:51:04.960 --> 00:51:06.070
this is nice, right?
00:51:06.070 --> 00:51:08.410
That's called definiteness.
00:51:08.410 --> 00:51:10.900
The total variation equal
to 0 implies that P theta
00:51:10.900 --> 00:51:12.210
is equal to P theta prime.
00:51:12.210 --> 00:51:14.350
So that's really some
notion of distance, right?
00:51:14.350 --> 00:51:16.060
That's what we want.
00:51:16.060 --> 00:51:17.980
If this thing
being small implied
00:51:17.980 --> 00:51:20.350
that P theta could be all
over the place compared
00:51:20.350 --> 00:51:24.270
to P theta prime, that
would not help very much.
00:51:24.270 --> 00:51:26.580
Now, there's also the
triangle inequality
00:51:26.580 --> 00:51:28.710
that follows immediately
from the triangle
00:51:28.710 --> 00:51:32.730
inequality inside this guy.
00:51:32.730 --> 00:51:35.654
If I squeeze in some f
theta prime prime in there,
00:51:35.654 --> 00:51:37.320
I'm going to use the
triangle inequality
00:51:37.320 --> 00:51:39.486
and get the triangle
inequality for the whole thing.
00:51:42.392 --> 00:51:42.892
Yeah?
00:51:42.892 --> 00:51:45.287
AUDIENCE: The fact that
you need two definitions
00:51:45.287 --> 00:51:48.640
of the [INAUDIBLE],,
is it something
00:51:48.640 --> 00:51:50.090
obvious or is it complete?
00:51:50.090 --> 00:51:52.930
PHILIPPE RIGOLLET:
I'll do it for you now.
00:51:52.930 --> 00:51:56.530
So let's just prove that
those two things are actually
00:51:56.530 --> 00:51:58.756
giving me the same definition.
00:52:00.956 --> 00:52:02.830
So what I'm going to do
is I'm actually going
00:52:02.830 --> 00:52:04.420
to start with the second one.
00:52:04.420 --> 00:52:05.420
And I'm going to write--
00:52:05.420 --> 00:52:07.253
I'm going to start with
the density version.
00:52:07.253 --> 00:52:10.300
But as an exercise, you can
do it for the PMF version
00:52:10.300 --> 00:52:11.347
if you prefer.
00:52:11.347 --> 00:52:13.180
So I'm going to start
with the fact that f--
00:52:20.240 --> 00:52:23.810
so I'm going to write f of g so
I don't have to write f and g.
00:52:23.810 --> 00:52:27.490
So think of this as being f sub
theta, and think of this guy
00:52:27.490 --> 00:52:29.180
as being f sub theta prime.
00:52:29.180 --> 00:52:32.240
I just don't want to have to
write indices all the time.
00:52:32.240 --> 00:52:34.970
So I'm going to start with
this thing, the integral of f
00:52:34.970 --> 00:52:38.870
of X minus g of X dx.
00:52:38.870 --> 00:52:41.910
The first thing I'm going to do
is this is an absolute value,
00:52:41.910 --> 00:52:45.170
so either the number in the
absolute value is positive
00:52:45.170 --> 00:52:47.390
and I actually kept it
like that, or it's negative
00:52:47.390 --> 00:52:48.760
and I flipped its sign.
00:52:48.760 --> 00:52:51.600
So let's just split
between those two cases.
00:52:51.600 --> 00:52:55.460
So this thing is equal
to 1/2 the integral of--
00:52:55.460 --> 00:53:00.350
so let me actually
write the set A star as
00:53:00.350 --> 00:53:09.240
being the set of X's such that
f of X is larger than g of X.
00:53:09.240 --> 00:53:11.340
So that's the set on
which the difference is
00:53:11.340 --> 00:53:13.060
going to be positive
or the difference is
00:53:13.060 --> 00:53:14.370
going to be negative.
00:53:14.370 --> 00:53:17.082
So this, again,
is equivalent to f
00:53:17.082 --> 00:53:23.280
of X minus g of X is positive.
00:53:23.280 --> 00:53:23.780
OK.
00:53:23.780 --> 00:53:24.488
Everybody agrees?
00:53:24.488 --> 00:53:26.330
So this is the set
I'm interested in.
00:53:29.040 --> 00:53:31.830
So now I'm going to split
my integral into two parts,
00:53:31.830 --> 00:53:38.250
in A, A star, so on A
star, f is larger than g,
00:53:38.250 --> 00:53:40.666
so the absolute value is
just the difference itself.
00:53:45.150 --> 00:53:48.980
So here I put parenthesis
rather than absolute value.
00:53:48.980 --> 00:53:54.330
And then I have plus 1/2 of
the integral on the complement.
00:53:54.330 --> 00:53:57.940
What are you guys used to to
write the complement, to the C
00:53:57.940 --> 00:54:01.005
or the bar?
00:54:01.005 --> 00:54:01.991
To the C?
00:54:05.450 --> 00:54:08.320
And so here on the complement,
then f is less than g,
00:54:08.320 --> 00:54:17.810
so this is actually really
g of X minus f of X, dx.
00:54:17.810 --> 00:54:19.550
Everybody's with me here?
00:54:19.550 --> 00:54:20.900
So I just said--
00:54:20.900 --> 00:54:23.390
I mean, those are just
rewriting what the definition
00:54:23.390 --> 00:54:24.560
of the absolute value is.
00:54:33.290 --> 00:54:33.830
OK.
00:54:33.830 --> 00:54:38.120
So now there's nice things
that I know about f and g.
00:54:38.120 --> 00:54:40.880
And the two nice things is that
the integral of f is equal to 1
00:54:40.880 --> 00:54:42.790
and the integral
of g is equal to 1.
00:54:46.270 --> 00:54:53.614
This implies that the integral
of f minus g is equal to what?
00:54:53.614 --> 00:54:54.526
AUDIENCE: 0.
00:54:54.526 --> 00:54:56.400
PHILIPPE RIGOLLET: 0.
00:54:56.400 --> 00:54:59.060
And so now that
means that if I want
00:54:59.060 --> 00:55:04.130
to just go from the integral
here on A complement
00:55:04.130 --> 00:55:05.690
to the integral on A--
00:55:05.690 --> 00:55:08.780
or on A star, complement
to the integral of A star,
00:55:08.780 --> 00:55:11.700
I just have to flip the sign.
00:55:11.700 --> 00:55:14.920
So that implies that
an integral on A star
00:55:14.920 --> 00:55:21.198
complement of g
of X minus f of X,
00:55:21.198 --> 00:55:25.830
dx, this is simply equal
to the integral on A star
00:55:25.830 --> 00:55:30.250
of f of X minus g of X, dx.
00:55:40.880 --> 00:55:41.780
All right.
00:55:41.780 --> 00:55:46.100
So now this guy becomes
this guy over there.
00:55:46.100 --> 00:55:50.050
So I have 1/2 of this
plus 1/2 of the same guy,
00:55:50.050 --> 00:55:55.720
so that means that 1/2 half
of the integral between of f
00:55:55.720 --> 00:55:57.450
minus g absolute value--
00:55:57.450 --> 00:55:59.810
so that was my
original definition,
00:55:59.810 --> 00:56:03.890
this thing is actually equal
to the integral on A star
00:56:03.890 --> 00:56:10.379
of f of X minus g of X, dx.
00:56:14.160 --> 00:56:21.440
And this is simply
equal to P of A star--
00:56:21.440 --> 00:56:26.160
so say Pf of A start
minus Pg of A star.
00:56:34.160 --> 00:56:36.810
Which one is larger
than the other one?
00:56:41.610 --> 00:56:43.540
AUDIENCE: [INAUDIBLE]
00:56:43.540 --> 00:56:44.600
PHILIPPE RIGOLLET: It is.
00:56:44.600 --> 00:56:45.951
Just look at this board.
00:56:45.951 --> 00:56:47.406
AUDIENCE: [INAUDIBLE]
00:56:47.406 --> 00:56:48.406
PHILIPPE RIGOLLET: What?
00:56:48.406 --> 00:56:49.880
AUDIENCE: [INAUDIBLE]
00:56:49.880 --> 00:56:50.510
PHILIPPE RIGOLLET:
The first one has
00:56:50.510 --> 00:56:51.980
to be larger, because
this thing is actually
00:56:51.980 --> 00:56:53.271
equal to a non-negative number.
00:56:59.590 --> 00:57:01.990
So now I have this absolute
value of two things,
00:57:01.990 --> 00:57:04.150
and so I'm closer to
the actual definition.
00:57:04.150 --> 00:57:06.910
But I still need to show
you that this thing is
00:57:06.910 --> 00:57:09.010
the maximum value.
00:57:09.010 --> 00:57:17.710
So this is definitely at
most the maximum over A of Pf
00:57:17.710 --> 00:57:21.670
of A minus Pg of A.
00:57:21.670 --> 00:57:24.290
That's certainly true.
00:57:24.290 --> 00:57:24.830
Right?
00:57:24.830 --> 00:57:27.850
We agree with this?
00:57:27.850 --> 00:57:30.620
Because this is just
for one specific A,
00:57:30.620 --> 00:57:34.930
and I'm bounding it by the
maximum over all possible A.
00:57:34.930 --> 00:57:36.932
So that's clearly true.
00:57:36.932 --> 00:57:38.640
So now I have to go
the other way around.
00:57:38.640 --> 00:57:44.370
I have to show you that the max
is actually this guy, A star.
00:57:44.370 --> 00:57:45.640
So why would that be true?
00:57:45.640 --> 00:57:49.180
Well, let's just inspect
this thing over there.
00:57:49.180 --> 00:57:50.730
So we want to show
that if I take
00:57:50.730 --> 00:57:53.490
any other A in this integral
than this guy A star,
00:57:53.490 --> 00:57:56.580
it's actually got to
decrease its value.
00:57:56.580 --> 00:57:57.720
So we have this function.
00:57:57.720 --> 00:57:59.303
I'm going to call
this function delta.
00:58:02.314 --> 00:58:03.730
And what we have
is-- so let's say
00:58:03.730 --> 00:58:04.920
this function looks like this.
00:58:04.920 --> 00:58:06.836
Now it's the difference
between two densities.
00:58:06.836 --> 00:58:09.500
It doesn't have to
integrate-- it doesn't
00:58:09.500 --> 00:58:10.500
have to be non-negative.
00:58:10.500 --> 00:58:12.420
But it certainly has
to integrate to 0.
00:58:15.510 --> 00:58:18.440
And so now I take this thing.
00:58:18.440 --> 00:58:22.126
And the A star, what
is the set A star here?
00:58:22.126 --> 00:58:25.640
The set A star is the set
over which the function
00:58:25.640 --> 00:58:27.645
delta is non-negative.
00:58:36.340 --> 00:58:37.590
So that's just the definition.
00:58:37.590 --> 00:58:41.660
A star was the set over
which f minus g was positive,
00:58:41.660 --> 00:58:44.430
and f minus g was
just called delta.
00:58:44.430 --> 00:58:47.720
So what it means is that
what I'm really integrating
00:58:47.720 --> 00:58:50.810
is delta on this set.
00:58:50.810 --> 00:58:53.570
So it's this area
under the curve,
00:58:53.570 --> 00:58:55.230
just on the positive things.
00:58:55.230 --> 00:58:57.830
Agreed?
00:58:57.830 --> 00:59:03.290
So now let's just make some
tiny variations around this guy.
00:59:03.290 --> 00:59:08.150
If I take A to be
larger than A star--
00:59:08.150 --> 00:59:10.280
so let me add, for
example, this part here.
00:59:12.920 --> 00:59:15.680
That means that when
I compute my integral,
00:59:15.680 --> 00:59:18.067
I'm removing this
area under the curve.
00:59:18.067 --> 00:59:18.650
It's negative.
00:59:18.650 --> 00:59:20.520
The integral here is negative.
00:59:20.520 --> 00:59:25.160
So if I start adding something
to A, the value goes lower.
00:59:25.160 --> 00:59:29.060
If I start removing something
from A, like say this guy,
00:59:29.060 --> 00:59:32.450
I'm actually removing this
value from the integral.
00:59:32.450 --> 00:59:33.320
So there's no way.
00:59:33.320 --> 00:59:34.370
I'm actually stuck.
00:59:34.370 --> 00:59:37.100
This A star is the one
that actually maximizes
00:59:37.100 --> 00:59:39.830
the integral of this function.
00:59:39.830 --> 00:59:49.470
So we used the fact
that for any function,
00:59:49.470 --> 00:59:59.180
say delta, the integral
over A of delta
00:59:59.180 --> 01:00:02.712
is less than the integral
over the set of X's
01:00:02.712 --> 01:00:07.670
such that delta of X is
non-negative of delta of X, dx.
01:00:10.280 --> 01:00:13.518
And that's an obvious
fact, just by picture, say.
01:00:18.498 --> 01:00:24.972
And that's true for all A. Yeah?
01:00:24.972 --> 01:00:28.956
AUDIENCE: [INAUDIBLE]
could you use
01:00:28.956 --> 01:00:33.106
like a portion under the
axis as like less than
01:00:33.106 --> 01:00:34.845
or equal to the
portion above the axis?
01:00:34.845 --> 01:00:36.470
PHILIPPE RIGOLLET:
It's actually equal.
01:00:36.470 --> 01:00:39.005
We know that the
integral of f minus g--
01:00:39.005 --> 01:00:41.580
the integral of delta is 0.
01:00:41.580 --> 01:00:47.344
So there's actually exactly
the same area above and below.
01:00:47.344 --> 01:00:49.880
But yeah, you're right.
01:00:49.880 --> 01:00:51.349
You could go to
the extreme cases.
01:00:51.349 --> 01:00:51.890
You're right.
01:00:57.470 --> 01:00:57.970
No.
01:00:57.970 --> 01:01:00.490
It's actually still be
true, even if there was--
01:01:00.490 --> 01:01:02.720
if this was a constant,
that would still be true.
01:01:02.720 --> 01:01:05.500
Here, I never use the fact that
the integral is equal to 0.
01:01:11.380 --> 01:01:15.560
I could shift this function by
1 so that the integral of delta
01:01:15.560 --> 01:01:18.230
is equal to 1,
and it would still
01:01:18.230 --> 01:01:21.000
be true that it's maximized
when I take A to be
01:01:21.000 --> 01:01:24.892
the set where it's positive.
01:01:24.892 --> 01:01:27.350
Just need to make sure that
there is someplace where it is,
01:01:27.350 --> 01:01:28.390
but that's about it.
01:01:33.390 --> 01:01:36.981
Of course, we used this before,
when we made this thing.
01:01:36.981 --> 01:01:38.730
But just the last
argument, this last fact
01:01:38.730 --> 01:01:39.646
does not require that.
01:01:43.820 --> 01:01:44.320
All right.
01:01:44.320 --> 01:01:47.030
So now we have this notion of--
01:01:47.030 --> 01:01:48.358
I need the--
01:01:52.531 --> 01:01:53.030
OK.
01:01:53.030 --> 01:01:57.450
So we have this
notion of distance
01:01:57.450 --> 01:01:58.830
between probability measures.
01:01:58.830 --> 01:02:00.940
I mean, these things
are exactly what--
01:02:00.940 --> 01:02:03.780
if I were to be in a formal
math class and I said,
01:02:03.780 --> 01:02:06.060
here are the axioms that
a distance should satisfy,
01:02:06.060 --> 01:02:08.640
those are exactly those things.
01:02:08.640 --> 01:02:10.150
If it's not
satisfying this thing,
01:02:10.150 --> 01:02:13.800
it's called pseudo-distance or
quasi-distance or just metric
01:02:13.800 --> 01:02:15.770
or nothing at all, honestly.
01:02:15.770 --> 01:02:16.590
So it's a distance.
01:02:16.590 --> 01:02:18.930
It's symmetric,
non-negative, equal to 0,
01:02:18.930 --> 01:02:21.720
if and only if the two
arguments are equal, then
01:02:21.720 --> 01:02:25.870
it satisfies the
triangle inequality.
01:02:25.870 --> 01:02:28.860
And so that means that we have
this actual total variation
01:02:28.860 --> 01:02:31.140
distance between
probability distributions.
01:02:31.140 --> 01:02:36.510
And here is now a statistical
strategy to implement our goal.
01:02:36.510 --> 01:02:38.190
Remember, our goal
was to spit out
01:02:38.190 --> 01:02:41.940
a theta hat, which was
close such that P theta
01:02:41.940 --> 01:02:45.700
hat was close to P theta star.
01:02:45.700 --> 01:02:48.940
So hopefully, we were trying
to minimize the total variation
01:02:48.940 --> 01:02:51.580
distance between P theta
hat and P theta star.
01:02:51.580 --> 01:02:55.090
Now, we cannot do that, because
just by this fact, this slide,
01:02:55.090 --> 01:02:57.340
if we wanted to do that
directly, we would just take--
01:02:57.340 --> 01:02:59.830
well, let's take theta hat
equals theta star and that will
01:02:59.830 --> 01:03:00.880
give me the value 0.
01:03:00.880 --> 01:03:03.196
And that's the minimum
possible value we can take.
01:03:03.196 --> 01:03:04.570
The problem is
that we don't know
01:03:04.570 --> 01:03:07.342
what the total variation is to
something that we don't know.
01:03:07.342 --> 01:03:09.550
We know how to compute total
variations if I give you
01:03:09.550 --> 01:03:10.660
the two arguments.
01:03:10.660 --> 01:03:12.560
But here, one of the
arguments is not known.
01:03:12.560 --> 01:03:16.370
P theta star is not known to
us, so we need to estimate it.
01:03:16.370 --> 01:03:18.910
And so here is the strategy.
01:03:18.910 --> 01:03:21.760
Just build an estimator
of the total variation
01:03:21.760 --> 01:03:24.580
distance between P
theta and P theta star
01:03:24.580 --> 01:03:27.250
for all candidate theta,
all possible theta
01:03:27.250 --> 01:03:30.240
in capital theta.
01:03:30.240 --> 01:03:33.390
Now, if this is a good estimate,
then when I minimize it,
01:03:33.390 --> 01:03:37.230
I should get something
that's close to P theta star.
01:03:37.230 --> 01:03:38.220
So here's the strategy.
01:03:38.220 --> 01:03:40.980
This is my function
that maps theta
01:03:40.980 --> 01:03:44.340
to the total variation between
P theta and P theta star.
01:03:44.340 --> 01:03:47.010
I know it's minimized
at theta star.
01:03:47.010 --> 01:03:51.090
That's definitely TV of P--
and the value here, the y-axis
01:03:51.090 --> 01:03:53.300
should say 0.
01:03:53.300 --> 01:03:54.800
And so I don't know
this guy, so I'm
01:03:54.800 --> 01:03:56.810
going to estimate it
by some estimator that
01:03:56.810 --> 01:03:57.680
comes from my data.
01:03:57.680 --> 01:04:00.590
Hopefully, the more data I have,
the better this estimator is.
01:04:00.590 --> 01:04:03.391
And I'm going to try to
minimize this estimator now.
01:04:03.391 --> 01:04:05.390
And if the two things are
close, then the minima
01:04:05.390 --> 01:04:07.470
should be close.
01:04:07.470 --> 01:04:09.560
That's a pretty good
estimation strategy.
01:04:09.560 --> 01:04:11.370
The problem is that
it's very unclear
01:04:11.370 --> 01:04:13.810
how you would build
this estimator of TV,
01:04:13.810 --> 01:04:18.710
of the Total Variation.
01:04:18.710 --> 01:04:21.410
So building
estimators, as I said,
01:04:21.410 --> 01:04:25.160
typically consists in replacing
expectations by averages.
01:04:25.160 --> 01:04:29.130
But there's no simple way of
expressing the total variation
01:04:29.130 --> 01:04:31.230
distance as the
expectations with respect
01:04:31.230 --> 01:04:33.840
to theta star of anything.
01:04:33.840 --> 01:04:36.060
So what we're going
to do is we're
01:04:36.060 --> 01:04:38.190
going to move from
total variation distance
01:04:38.190 --> 01:04:41.040
to another notion of
distance that sort of has
01:04:41.040 --> 01:04:43.020
the same properties
and the same feeling
01:04:43.020 --> 01:04:47.040
and the same motivations as
the total variation distance.
01:04:47.040 --> 01:04:49.650
But for this guy, we
will be able to build
01:04:49.650 --> 01:04:51.420
an estimate for it,
because it's actually
01:04:51.420 --> 01:04:53.929
going to be of the form
expectation of something.
01:04:53.929 --> 01:04:55.470
And we're going to
be able to replace
01:04:55.470 --> 01:05:00.280
the expectation by an average
and then minimize this average.
01:05:00.280 --> 01:05:04.290
So this surrogate for
total variation distance
01:05:04.290 --> 01:05:07.510
is actually called the
Kullback-Leibler divergence.
01:05:07.510 --> 01:05:09.760
And why we call it divergence
is because it's actually
01:05:09.760 --> 01:05:11.740
not a distance.
01:05:11.740 --> 01:05:14.760
It's not going to be
symmetric to start with.
01:05:14.760 --> 01:05:17.400
So this Kullback-Leibler
or even KL divergence--
01:05:17.400 --> 01:05:20.790
I will just refer to it as KL--
01:05:20.790 --> 01:05:22.860
is actually just
more convenient.
01:05:22.860 --> 01:05:27.480
But it has some roots coming
from information theory, which
01:05:27.480 --> 01:05:29.170
I will not delve into.
01:05:29.170 --> 01:05:31.450
But if any of you is
actually a Core 6 student,
01:05:31.450 --> 01:05:32.970
I'm sure you've
seen that in some--
01:05:32.970 --> 01:05:37.980
I don't know-- course that
has any content on information
01:05:37.980 --> 01:05:39.060
theory.
01:05:39.060 --> 01:05:39.560
All right.
01:05:39.560 --> 01:05:42.380
So the KL divergence between two
probability measures, P theta
01:05:42.380 --> 01:05:43.790
and P theta prime--
01:05:43.790 --> 01:05:47.810
and here, as I said, it's not
going to be the symmetric,
01:05:47.810 --> 01:05:49.680
so it's very important
for you to specify
01:05:49.680 --> 01:05:51.930
which order you say it is,
between P theta and P theta
01:05:51.930 --> 01:05:52.429
prime.
01:05:52.429 --> 01:05:55.060
It's different from saying
between P theta prime and P
01:05:55.060 --> 01:05:56.510
theta.
01:05:56.510 --> 01:05:58.550
And so we denote it by KL.
01:05:58.550 --> 01:06:04.010
And so remember, before we had
either the sum or the integral
01:06:04.010 --> 01:06:07.910
of 1/2 of the distance--
absolute value of the distance
01:06:07.910 --> 01:06:10.550
between the PMFs and 1/2
of the absolute values
01:06:10.550 --> 01:06:17.900
of the distances between the
probability density functions.
01:06:17.900 --> 01:06:19.940
And then we replace
this absolute value
01:06:19.940 --> 01:06:24.740
of the distance divided by
2 by this weird function.
01:06:24.740 --> 01:06:28.100
This function is P
theta, log P theta,
01:06:28.100 --> 01:06:30.290
divided by P theta prime.
01:06:30.290 --> 01:06:31.880
That's the function.
01:06:31.880 --> 01:06:34.710
That's a weird function.
01:06:34.710 --> 01:06:35.210
OK.
01:06:35.210 --> 01:06:38.360
So this was what we had.
01:06:40.960 --> 01:06:41.590
That's the TV.
01:06:44.670 --> 01:06:48.120
And the KL, if I use the
same notation, f and g,
01:06:48.120 --> 01:06:57.315
is integral of f of X, log
of f of X over g of X, dx.
01:07:01.088 --> 01:07:04.280
It's a bit different.
01:07:04.280 --> 01:07:09.120
And I go from discrete to
continuous using an integral.
01:07:09.120 --> 01:07:10.240
Everybody can read this.
01:07:10.240 --> 01:07:11.365
Everybody's fine with this.
01:07:11.365 --> 01:07:15.780
Is there any uncertainty about
the actual definition here?
01:07:15.780 --> 01:07:17.480
So here I go straight
to the definition,
01:07:17.480 --> 01:07:19.910
which is just
plugging the functions
01:07:19.910 --> 01:07:22.190
into some integral and compute.
01:07:22.190 --> 01:07:24.670
So I don't bother with
maxima or anything.
01:07:24.670 --> 01:07:26.400
I mean, there is
something like that,
01:07:26.400 --> 01:07:29.885
but it's certainly not as
natural as the total variation.
01:07:29.885 --> 01:07:30.875
Yes?
01:07:30.875 --> 01:07:33.845
AUDIENCE: The total
variation, [INAUDIBLE]..
01:07:38.732 --> 01:07:40.440
PHILIPPE RIGOLLET:
Yes, just because it's
01:07:40.440 --> 01:07:42.280
hard to build anything
from total variation,
01:07:42.280 --> 01:07:43.500
because I don't know it.
01:07:43.500 --> 01:07:45.835
So it's very difficult.
But if you can actually--
01:07:45.835 --> 01:07:47.910
and even computing it
between two Gaussians,
01:07:47.910 --> 01:07:49.680
just try it for yourself.
01:07:49.680 --> 01:07:52.740
And please stop doing it
after at most six minutes,
01:07:52.740 --> 01:07:54.730
because you won't
be able to do it.
01:07:54.730 --> 01:07:56.730
And so it's just very
hard to manipulate,
01:07:56.730 --> 01:07:59.070
like this integral of
absolute values of differences
01:07:59.070 --> 01:08:01.230
between probability
density function, at least
01:08:01.230 --> 01:08:02.771
for the probability
density functions
01:08:02.771 --> 01:08:04.860
we're used to manipulate
is actually a nightmare.
01:08:04.860 --> 01:08:08.370
And so people prefer KL,
because for the Gaussian,
01:08:08.370 --> 01:08:10.770
this is going to be theta
minus theta prime squared.
01:08:10.770 --> 01:08:12.580
And then we're
going to be happy.
01:08:12.580 --> 01:08:15.720
And so those things are
much easier to manipulate.
01:08:15.720 --> 01:08:18.029
But it's really--
the total variation
01:08:18.029 --> 01:08:20.162
is telling you how
far in the worst case
01:08:20.162 --> 01:08:21.370
the two probabilities can be.
01:08:21.370 --> 01:08:23.220
This is really the
intrinsic notion
01:08:23.220 --> 01:08:25.380
of closeness between
probabilities.
01:08:25.380 --> 01:08:27.229
So that's really the
one-- if we could,
01:08:27.229 --> 01:08:30.202
that's the one we
would go after.
01:08:30.202 --> 01:08:32.160
Sometimes people will
compute them numerically,
01:08:32.160 --> 01:08:34.785
so that they can say, oh, here's
the total variation distance I
01:08:34.785 --> 01:08:36.899
have between those two things.
01:08:36.899 --> 01:08:38.670
And then you actually
know that that
01:08:38.670 --> 01:08:41.460
means they are close, because
the absolute value-- if I tell
01:08:41.460 --> 01:08:44.370
you total variation is
0.01, like we did here,
01:08:44.370 --> 01:08:46.319
it has a very specific meaning.
01:08:46.319 --> 01:08:49.762
If I tell you the KL
divergence is 0.01,
01:08:49.762 --> 01:08:50.970
it's not clear what it means.
01:08:55.130 --> 01:08:55.760
OK.
01:08:55.760 --> 01:08:58.109
So what are the properties?
01:08:58.109 --> 01:09:00.870
The KL divergence between
P theta and P theta prime
01:09:00.870 --> 01:09:03.170
is different from the KL
divergence between P theta
01:09:03.170 --> 01:09:05.569
prime and P theta in general.
01:09:05.569 --> 01:09:07.640
Of course, in general,
because if theta
01:09:07.640 --> 01:09:11.029
is equal to theta prime,
then this certainly is true.
01:09:11.029 --> 01:09:14.600
So there's cases
when it's not true.
01:09:14.600 --> 01:09:17.090
The KL divergence
is non-negative.
01:09:17.090 --> 01:09:19.742
Who knows the Jensen's
inequality here?
01:09:19.742 --> 01:09:21.450
That should be a subset
of the people who
01:09:21.450 --> 01:09:25.310
raised their hand when I asked
what a convex function is.
01:09:25.310 --> 01:09:26.090
All right.
01:09:26.090 --> 01:09:27.890
So you know what
Jensen's inequality is.
01:09:27.890 --> 01:09:30.490
This is Jensen's-- the
proof is just one step
01:09:30.490 --> 01:09:33.840
Jensen's inequality, which
we will not go into details.
01:09:33.840 --> 01:09:35.569
But that's basically
an inequality
01:09:35.569 --> 01:09:38.233
involving expectation
of a convex function
01:09:38.233 --> 01:09:40.399
of a random variable compared
to the convex function
01:09:40.399 --> 01:09:42.065
of the expectation
of a random variable.
01:09:45.460 --> 01:09:48.580
If you know Jensen,
have fun and prove it.
01:09:48.580 --> 01:09:51.729
What's really nice is that
if the KL is equal to 0,
01:09:51.729 --> 01:09:55.220
then the two distributions
are the same.
01:09:55.220 --> 01:09:57.170
And that's something
we're looking for.
01:09:57.170 --> 01:09:59.020
Everything else we're
happy to throw out.
01:09:59.020 --> 01:10:00.478
And actually, if
you pay attention,
01:10:00.478 --> 01:10:03.500
we're actually really
throwing out everything else.
01:10:03.500 --> 01:10:05.060
So they're not symmetric.
01:10:05.060 --> 01:10:08.530
It does satisfy the triangle
inequality in general.
01:10:08.530 --> 01:10:12.790
But it's non-negative and
it's 0 if and only if the two
01:10:12.790 --> 01:10:13.922
distributions are the same.
01:10:13.922 --> 01:10:15.130
And that's all we care about.
01:10:15.130 --> 01:10:17.129
And that's what we call
a divergence rather than
01:10:17.129 --> 01:10:21.910
a distance, and divergence will
be enough for our purposes.
01:10:21.910 --> 01:10:24.080
And actually, this
asymmetry, the fact
01:10:24.080 --> 01:10:26.570
that it's not flipping--
the first time I saw it,
01:10:26.570 --> 01:10:27.380
I was just annoyed.
01:10:27.380 --> 01:10:29.225
I was like, can we
just like, I don't
01:10:29.225 --> 01:10:31.550
know, take the average
of the KL between P theta
01:10:31.550 --> 01:10:34.270
and P theta prime and P
theta prime and P theta,
01:10:34.270 --> 01:10:36.290
you would think maybe
you could do this.
01:10:36.290 --> 01:10:39.590
You just symmatrize it by just
taking the average of the two
01:10:39.590 --> 01:10:41.480
possible values it can take.
01:10:41.480 --> 01:10:44.930
The problem is that this will
still not satisfy the triangle
01:10:44.930 --> 01:10:45.500
inequality.
01:10:45.500 --> 01:10:48.290
And there's no way basically
to turn it into something
01:10:48.290 --> 01:10:49.850
that is a distance.
01:10:49.850 --> 01:10:52.350
But the divergence is doing
a pretty good thing for us.
01:10:52.350 --> 01:10:55.790
And this is what will allow us
to estimate it and basically
01:10:55.790 --> 01:11:03.160
overcome what we could not
do with the total variation.
01:11:03.160 --> 01:11:06.410
So the first thing
that you want to notice
01:11:06.410 --> 01:11:08.120
is the total
variation distance--
01:11:08.120 --> 01:11:10.130
the KL divergence,
sorry, is actually
01:11:10.130 --> 01:11:12.470
an expectation of something.
01:11:12.470 --> 01:11:15.260
Look at what it is here.
01:11:15.260 --> 01:11:20.420
It's the integral of some
function against a density.
01:11:20.420 --> 01:11:25.230
That's exactly the definition
of an expectation, right?
01:11:25.230 --> 01:11:29.950
So this is the expectation
of this particular function
01:11:29.950 --> 01:11:31.730
with respect to this density f.
01:11:31.730 --> 01:11:35.650
So in particular, if I call
this is density f-- if I say,
01:11:35.650 --> 01:11:38.400
I want the true distribution
to be the first argument,
01:11:38.400 --> 01:11:39.920
this is an expectation
with respect
01:11:39.920 --> 01:11:42.310
to the true distribution from
which my data is actually
01:11:42.310 --> 01:11:45.760
drawn of the log of this ratio.
01:11:45.760 --> 01:11:46.870
So ha ha.
01:11:46.870 --> 01:11:47.700
I'm a statistician.
01:11:47.700 --> 01:11:49.300
Now I have an expectation.
01:11:49.300 --> 01:11:51.430
I can replace it by an
average, because I have data
01:11:51.430 --> 01:11:52.524
from this distribution.
01:11:52.524 --> 01:11:54.940
And I could actually replace
the expectation by an average
01:11:54.940 --> 01:11:56.680
and try to minimize here.
01:11:56.680 --> 01:11:57.959
The problem is that--
01:11:57.959 --> 01:12:00.250
actually the star here should
be in front of the theta,
01:12:00.250 --> 01:12:01.150
not of the P, right?
01:12:01.150 --> 01:12:04.460
That's P theta star,
not P star theta.
01:12:04.460 --> 01:12:05.960
But here, I still
cannot compute it,
01:12:05.960 --> 01:12:08.510
because I have this P
theta star that shows up.
01:12:08.510 --> 01:12:10.220
I don't know what it is.
01:12:10.220 --> 01:12:13.500
And that's now where
the log plays a role.
01:12:13.500 --> 01:12:15.050
If you actually pay
attention, I said
01:12:15.050 --> 01:12:16.940
you can use Jensen to
prove all this stuff.
01:12:16.940 --> 01:12:21.110
You could actually replace the
log by any concave function.
01:12:21.110 --> 01:12:22.440
That would be f divergent.
01:12:22.440 --> 01:12:24.030
That's called an f divergence.
01:12:24.030 --> 01:12:26.950
But the log itself is a
very, very specific property,
01:12:26.950 --> 01:12:29.790
which allows us to say
that the log of the ratio
01:12:29.790 --> 01:12:33.290
is the ratio of the log.
01:12:33.290 --> 01:12:38.620
Now, this thing here
does not depend on theta.
01:12:38.620 --> 01:12:43.010
If I think of this KL divergence
as a function of theta,
01:12:43.010 --> 01:12:45.239
then the first part is
actually a constant.
01:12:45.239 --> 01:12:47.530
If I change theta, this thing
is never going to change.
01:12:47.530 --> 01:12:49.980
It depends only on theta star.
01:12:49.980 --> 01:12:51.480
So if I look at
this function KL--
01:13:03.200 --> 01:13:05.500
so if I look at the
function, theta maps
01:13:05.500 --> 01:13:11.450
to KL P theta
star, P theta, it's
01:13:11.450 --> 01:13:15.400
of the form expectation
with respect to theta star,
01:13:15.400 --> 01:13:23.780
log of P theta star
of X. And then I
01:13:23.780 --> 01:13:29.610
have minus expectation with
respect to theta star of log
01:13:29.610 --> 01:13:33.340
of P theta of x.
01:13:33.340 --> 01:13:38.900
Now as I said, this thing
here, this second expectation
01:13:38.900 --> 01:13:39.950
is a function of theta.
01:13:39.950 --> 01:13:42.381
When theta changes, this
thing is going to change.
01:13:42.381 --> 01:13:43.380
And that's a good thing.
01:13:43.380 --> 01:13:45.754
We want something that reflects
how close theta and theta
01:13:45.754 --> 01:13:46.537
star are.
01:13:46.537 --> 01:13:48.120
But this thing is
not going to change.
01:13:48.120 --> 01:13:49.620
This is a fixed value.
01:13:49.620 --> 01:13:53.125
Actually, it's the negative
entropy of P theta star.
01:13:53.125 --> 01:13:54.500
And if you've
heard of KL, you've
01:13:54.500 --> 01:13:55.583
probably heard of entropy.
01:13:55.583 --> 01:13:58.820
And that's what-- it's
basically minus the entropy.
01:13:58.820 --> 01:14:01.310
And that's a quantity that
just depends on theta star.
01:14:01.310 --> 01:14:03.470
But it's just the number.
01:14:03.470 --> 01:14:05.030
I could compute this
number if I told
01:14:05.030 --> 01:14:07.130
you this is n theta star 1.
01:14:07.130 --> 01:14:09.450
You could compute this.
01:14:09.450 --> 01:14:11.640
So now I'm going
to try to minimize
01:14:11.640 --> 01:14:14.500
the estimate of this function.
01:14:14.500 --> 01:14:16.870
And minimizing a function or
a function plus a constant
01:14:16.870 --> 01:14:18.800
is the same thing.
01:14:18.800 --> 01:14:20.840
I'm just shifting the
function here or here,
01:14:20.840 --> 01:14:23.560
but it's the same minimizer.
01:14:23.560 --> 01:14:24.060
OK.
01:14:24.060 --> 01:14:28.910
So the function that maps
theta to KL of P theta star
01:14:28.910 --> 01:14:32.370
to P theta is of the form
constant minus this expectation
01:14:32.370 --> 01:14:35.810
of a log of P theta.
01:14:35.810 --> 01:14:38.070
Everybody agrees?
01:14:38.070 --> 01:14:40.610
Are there any
questions about this?
01:14:40.610 --> 01:14:42.740
Are there any
remarks, including I
01:14:42.740 --> 01:14:46.230
have no idea what's
happening right now?
01:14:46.230 --> 01:14:46.730
OK.
01:14:46.730 --> 01:14:47.700
We're good?
01:14:47.700 --> 01:14:48.200
Yeah.
01:14:48.200 --> 01:14:50.160
AUDIENCE: So when you're
actually employing this method,
01:14:50.160 --> 01:14:52.610
how do you know which theta
to use as theta star and which
01:14:52.610 --> 01:14:53.142
isn't?
01:14:53.142 --> 01:14:55.600
PHILIPPE RIGOLLET: So this is
not a method just yet, right?
01:14:55.600 --> 01:14:57.550
I'm just describing to
you what the KL divergence
01:14:57.550 --> 01:14:58.720
between two distributions is.
01:14:58.720 --> 01:15:00.130
If you really wanted
to compute it,
01:15:00.130 --> 01:15:01.930
you would need to know
what P theta star is
01:15:01.930 --> 01:15:02.770
and what P theta is.
01:15:02.770 --> 01:15:03.467
AUDIENCE: Right.
01:15:03.467 --> 01:15:06.050
PHILIPPE RIGOLLET: And so here,
I'm just saying at some point,
01:15:06.050 --> 01:15:07.650
we still-- so here, you see--
01:15:07.650 --> 01:15:09.280
so now let's move onto one step.
01:15:09.280 --> 01:15:12.570
I don't know expectation
of theta star.
01:15:12.570 --> 01:15:15.904
But I have data that comes
from distribution P theta star.
01:15:15.904 --> 01:15:17.820
So the expectation by
the law of large numbers
01:15:17.820 --> 01:15:19.950
should be close to the average.
01:15:19.950 --> 01:15:23.670
And so what I'm doing
is I'm replacing any--
01:15:23.670 --> 01:15:27.390
I can actually-- this is a very
standard estimation method.
01:15:27.390 --> 01:15:30.360
You write something as an
expectation with respect
01:15:30.360 --> 01:15:34.380
to the data-generating
process of some function.
01:15:34.380 --> 01:15:37.349
And then you replace this by
the average of this function.
01:15:37.349 --> 01:15:38.890
And the law of large
numbers tells me
01:15:38.890 --> 01:15:41.326
that those two quantities
should actually be close.
01:15:41.326 --> 01:15:43.820
Now, it doesn't mean that's
going to be the end of the day,
01:15:43.820 --> 01:15:44.319
right.
01:15:44.319 --> 01:15:46.950
When we did Xn bar, that
was the end of the day.
01:15:46.950 --> 01:15:47.900
We had an expectation.
01:15:47.900 --> 01:15:49.850
We replaced it by an average.
01:15:49.850 --> 01:15:51.170
And then we were gone.
01:15:51.170 --> 01:15:53.376
But here, we still
have to do something,
01:15:53.376 --> 01:15:55.250
because this is not
telling me what theta is.
01:15:55.250 --> 01:15:58.070
Now I still have to
minimize this average.
01:15:58.070 --> 01:16:04.370
So this is now my candidate
estimator for KL, KL hat.
01:16:04.370 --> 01:16:06.170
And that's the one
where I said, well, it's
01:16:06.170 --> 01:16:07.897
going to be of the
form of constant.
01:16:07.897 --> 01:16:09.230
And this constant, I don't know.
01:16:09.230 --> 01:16:09.771
You're right.
01:16:09.771 --> 01:16:11.586
I have no idea what
this constant is.
01:16:11.586 --> 01:16:13.640
It depends on P theta star.
01:16:13.640 --> 01:16:16.310
But then I have minus something
that I can completely compute.
01:16:16.310 --> 01:16:20.170
If you give me data and theta,
I can compute this entire thing.
01:16:20.170 --> 01:16:25.670
And now what I claim is that
the minimizer of f or f plus--
01:16:25.670 --> 01:16:28.950
f of X or f of X plus
4 are the same thing,
01:16:28.950 --> 01:16:32.200
or say 4 plus f of
X. I'm just shifting
01:16:32.200 --> 01:16:34.260
the plot of my
function up and down,
01:16:34.260 --> 01:16:36.340
but the minimizer stays
exactly where it is.
01:16:39.590 --> 01:16:41.075
If I have a function--
01:16:43.750 --> 01:16:45.284
so now I have a
function of theta.
01:16:51.620 --> 01:16:56.100
This is KL hat of P
theta star, P theta.
01:16:56.100 --> 01:16:58.831
And it's of the form--
it's a function like this.
01:16:58.831 --> 01:17:00.330
I don't know where
this function is.
01:17:00.330 --> 01:17:06.880
It might very well be this
function or this function.
01:17:06.880 --> 01:17:10.870
Every time it's a translation
on the y-axis of all these guys.
01:17:10.870 --> 01:17:14.690
And the value that I translated
by depends on theta star.
01:17:14.690 --> 01:17:15.970
I don't know what it is.
01:17:15.970 --> 01:17:19.600
But what I claim is that the
minimizer is always this guy,
01:17:19.600 --> 01:17:22.428
regardless of what the value is.
01:17:22.428 --> 01:17:25.290
OK?
01:17:25.290 --> 01:17:28.560
So when I say constant, it's a
constant with respect to theta.
01:17:28.560 --> 01:17:29.670
It's an unknown constant.
01:17:29.670 --> 01:17:32.490
But it's with respect to theta,
so without loss of generality,
01:17:32.490 --> 01:17:36.840
I can assume that this
constant is 0 for my purposes,
01:17:36.840 --> 01:17:38.040
or 25 if you prefer.
01:17:41.171 --> 01:17:41.670
All right.
01:17:41.670 --> 01:17:46.420
So we'll just keep going
on this property next time.
01:17:46.420 --> 01:17:49.359
And we'll see how from
here we can move on to--
01:17:49.359 --> 01:17:51.900
the likelihood is actually going
to come out of this formula.
01:17:51.900 --> 01:17:53.450
Thanks.