1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:20,540 --> 00:00:22,070
PHILIPPE RIGOLLET: --124.

9
00:00:22,070 --> 00:00:24,350
If I were to repeat
this 1,000 times,

10
00:00:24,350 --> 00:00:26,390
so every one of
those 1,000 times

11
00:00:26,390 --> 00:00:29,150
they collect 124
data points and then

12
00:00:29,150 --> 00:00:31,430
I'd do it again and
do it again and again,

13
00:00:31,430 --> 00:00:34,070
then in average, the
number I should get

14
00:00:34,070 --> 00:00:37,222
should be close to the true
parameter that I'm looking for.

15
00:00:37,222 --> 00:00:38,930
The fluctuations that
are due to the fact

16
00:00:38,930 --> 00:00:40,850
that I get different
samples every time

17
00:00:40,850 --> 00:00:42,190
should somewhat vanish.

18
00:00:42,190 --> 00:00:46,310
And so what I want is to have a
small bias, hopefully a 0 bias.

19
00:00:46,310 --> 00:00:50,030
If this thing is 0, then we see
that the estimator is unbiased.

20
00:01:06,470 --> 00:01:08,250
So this is definitely
a property that we

21
00:01:08,250 --> 00:01:10,100
are going to be looking
for in an estimator,

22
00:01:10,100 --> 00:01:11,570
trying to find them
to be unbiased.

23
00:01:11,570 --> 00:01:14,060
But we'll see that it's
actually maybe not enough.

24
00:01:14,060 --> 00:01:16,040
So unbiasedness should
not be something

25
00:01:16,040 --> 00:01:18,140
you lose your sleep over.

26
00:01:18,140 --> 00:01:21,650
Something that's slightly
better is the risk, really

27
00:01:21,650 --> 00:01:33,050
the quadratics risk,
which is expectation of--

28
00:01:33,050 --> 00:01:35,400
so if I have an
estimator, theta hat,

29
00:01:35,400 --> 00:01:38,360
I'm going to look at the
expectation of theta hat n

30
00:01:38,360 --> 00:01:41,710
minus theta squared.

31
00:01:41,710 --> 00:01:44,130
And what we showed last time
is that we can actually--

32
00:01:44,130 --> 00:01:46,800
by inserting in there,
adding and removing

33
00:01:46,800 --> 00:01:49,374
the expectation of
theta hat, we actually

34
00:01:49,374 --> 00:01:50,790
get something where
this thing can

35
00:01:50,790 --> 00:01:59,160
be decomposed as the square
of the bias plus the variance,

36
00:01:59,160 --> 00:02:04,160
which is just the expectation of
theta hat minus its expectation

37
00:02:04,160 --> 00:02:06,986
squared.

38
00:02:06,986 --> 00:02:08,360
That came from
the fact that when

39
00:02:08,360 --> 00:02:10,669
I added and removed the
expectation of theta hat

40
00:02:10,669 --> 00:02:13,731
in there, the
cross-terms cancel.

41
00:02:13,731 --> 00:02:14,230
All right.

42
00:02:14,230 --> 00:02:19,556
So that was the bias squared,
and this is the variance.

43
00:02:25,410 --> 00:02:29,550
And so for example, if the
quadratic risk goes to 0,

44
00:02:29,550 --> 00:02:31,470
then that means that
theta hat converges

45
00:02:31,470 --> 00:02:34,570
to theta in the L2 sense.

46
00:02:34,570 --> 00:02:38,050
And here we know that if
we want this to go to 0,

47
00:02:38,050 --> 00:02:40,260
since it's the sum of
two positive terms,

48
00:02:40,260 --> 00:02:42,877
we need to have both
the bias that goes to 0

49
00:02:42,877 --> 00:02:44,460
and the variance
that goes to 0, so we

50
00:02:44,460 --> 00:02:46,470
need to control both
of those things.

51
00:02:46,470 --> 00:02:49,230
And so there is usually
an inherent trade-off

52
00:02:49,230 --> 00:02:53,550
between getting a small bias
and getting a small variance.

53
00:02:53,550 --> 00:02:56,220
If you reduce one too much, then
the variance of the other one

54
00:02:56,220 --> 00:02:57,030
is going to--

55
00:02:57,030 --> 00:02:59,470
then the other one is going
to increase, or the opposite.

56
00:02:59,470 --> 00:03:03,360
That happens a lot, but not so
much, actually, in this class.

57
00:03:03,360 --> 00:03:07,113
So let's just look at
a couple of examples.

58
00:03:07,113 --> 00:03:10,541
So am I planning--

59
00:03:10,541 --> 00:03:11,041
yeah.

60
00:03:11,041 --> 00:03:19,040
So examples.

61
00:03:19,040 --> 00:03:26,164
So if I do, for example, X1,
Xn, there are iid Bernoulli.

62
00:03:26,164 --> 00:03:27,580
And I'm going to
write it theta so

63
00:03:27,580 --> 00:03:29,440
that we keep the same notation.

64
00:03:29,440 --> 00:03:32,360
Then theta hat, what
is the theta hat

65
00:03:32,360 --> 00:03:33,860
that we proposed many times?

66
00:03:33,860 --> 00:03:38,530
It's just X bar, Xn bar,
the average of Xi's.

67
00:03:38,530 --> 00:03:40,990
So what is the bias of this guy?

68
00:03:40,990 --> 00:03:44,340
Well, to know the bias, I
just have to remove theta

69
00:03:44,340 --> 00:03:46,080
from the expectation.

70
00:03:46,080 --> 00:03:49,330
What is the
expectation of Xn bar?

71
00:03:49,330 --> 00:03:51,300
Well, by linearity
of the expectation,

72
00:03:51,300 --> 00:03:53,600
it's just the average
of the expectations.

73
00:03:57,950 --> 00:04:00,550
But since all my Xi's are
Bernouilli with the same theta,

74
00:04:00,550 --> 00:04:03,740
then each of this guy is
actually equal to theta.

75
00:04:03,740 --> 00:04:06,260
So this thing is actually
theta, which means

76
00:04:06,260 --> 00:04:07,520
that this isn't biased, right?

77
00:04:14,660 --> 00:04:16,320
Now, what is the
variance of this guy?

78
00:04:22,440 --> 00:04:27,774
So if you forgot the
properties of the variance

79
00:04:27,774 --> 00:04:29,440
for sum of independent
random variables,

80
00:04:29,440 --> 00:04:30,580
now it's time to wake up.

81
00:04:30,580 --> 00:04:34,060
So we have the
variance of something

82
00:04:34,060 --> 00:04:38,830
that looks like 1 over n, the
sum from i equal 1 to n of Xi.

83
00:04:41,460 --> 00:04:45,060
So it's of the form
variance of a constant times

84
00:04:45,060 --> 00:04:46,230
a random variable.

85
00:04:46,230 --> 00:04:49,140
So the first thing I'm going
to do is pull out the constant.

86
00:04:49,140 --> 00:04:52,180
But we know that the variance
leaves on the square scale,

87
00:04:52,180 --> 00:04:54,450
so when I pull out a constant
outside of the variance,

88
00:04:54,450 --> 00:04:56,416
it comes out with a square.

89
00:04:56,416 --> 00:04:59,210
The variance of a
times X is a-squared

90
00:04:59,210 --> 00:05:02,550
times the variance of
X, so this is equal to 1

91
00:05:02,550 --> 00:05:06,090
over n squared times
the variance of the sum.

92
00:05:10,580 --> 00:05:13,790
So now we want to always
do what we want to do.

93
00:05:13,790 --> 00:05:16,060
So we have the
variance of the sum.

94
00:05:16,060 --> 00:05:17,740
We would like somehow
to say that this

95
00:05:17,740 --> 00:05:19,640
is the sum of the variances.

96
00:05:19,640 --> 00:05:22,320
And in general, we are
not allowed to say that,

97
00:05:22,320 --> 00:05:26,500
but we are because my Xi's
are actually independent.

98
00:05:26,500 --> 00:05:30,660
So this is actually equal to 1
over n squared sum from i equal

99
00:05:30,660 --> 00:05:36,320
1 to n of the variance
of each of the Xi's.

100
00:05:36,320 --> 00:05:42,760
And that's by independence,
so this is basic probability.

101
00:05:42,760 --> 00:05:45,210
And now, what is the variance
of Xi's where again they're

102
00:05:45,210 --> 00:05:47,210
all the same distribution,
so the variance of Xi

103
00:05:47,210 --> 00:05:49,400
is the same as the
variance of X1.

104
00:05:49,400 --> 00:05:51,560
And so each of those
guys has variance what?

105
00:05:51,560 --> 00:05:53,060
What is the variance
of a Bernoulli?

106
00:05:53,060 --> 00:05:54,080
We've said it once.

107
00:05:54,080 --> 00:05:55,770
It's theta times 1 minus theta.

108
00:06:00,040 --> 00:06:03,390
And so now I'm going to have
the sum of n times a constant,

109
00:06:03,390 --> 00:06:05,730
so I get n times the constant
divided by n squared,

110
00:06:05,730 --> 00:06:07,720
so one of the n's
is going to cancel.

111
00:06:07,720 --> 00:06:10,210
And so the whole
thing here is actually

112
00:06:10,210 --> 00:06:15,110
equal to theta, 1 minus
theta divided by n.

113
00:06:18,500 --> 00:06:20,336
So if I'm interested
in the quadratic risk--

114
00:06:27,434 --> 00:06:28,850
and again, I should
just say risk,

115
00:06:28,850 --> 00:06:30,910
because this is the
only risk we're going

116
00:06:30,910 --> 00:06:32,010
to be actually looking at.

117
00:06:32,010 --> 00:06:32,510
Yeah.

118
00:06:32,510 --> 00:06:34,230
This parenthesis should
really stop here.

119
00:06:38,000 --> 00:06:41,250
I really wanted to put
quadratic in parenthesis.

120
00:06:41,250 --> 00:06:43,350
So the risk of this guy is what?

121
00:06:43,350 --> 00:06:50,890
Well, it's the expectation of
x bar n minus theta squared.

122
00:06:50,890 --> 00:06:54,849
And we know it's the
square of the variance,

123
00:06:54,849 --> 00:06:56,390
so it's the square
of the bias, which

124
00:06:56,390 --> 00:07:00,110
we know is 0, so it's 0 squared
plus the variance, which

125
00:07:00,110 --> 00:07:03,430
is theta, 1 plus theta--

126
00:07:03,430 --> 00:07:07,600
1 minus theta divided by n.

127
00:07:07,600 --> 00:07:14,620
So it's just theta, 1
minus theta divided by n.

128
00:07:14,620 --> 00:07:17,512
So this is just summarizing the
performance of an estimator,

129
00:07:17,512 --> 00:07:18,720
which is the random variable.

130
00:07:18,720 --> 00:07:19,761
I mean, it's complicated.

131
00:07:19,761 --> 00:07:22,310
If I really wanted
to describe it,

132
00:07:22,310 --> 00:07:25,400
I would just tell you
the entire distribution

133
00:07:25,400 --> 00:07:27,319
of this random variable.

134
00:07:27,319 --> 00:07:29,110
But now what I'm doing
is I'm saying, well,

135
00:07:29,110 --> 00:07:32,710
let's just take this random
variable, remove theta from it,

136
00:07:32,710 --> 00:07:36,950
and see how small the
fluctuations around theta--

137
00:07:36,950 --> 00:07:41,120
the squared fluctuations around
theta are in expectation.

138
00:07:41,120 --> 00:07:43,140
So that's what the
quadratic risk is doing.

139
00:07:43,140 --> 00:07:44,550
And in a way, this
decomposition,

140
00:07:44,550 --> 00:07:46,508
as the sum of the bias
square and the variance,

141
00:07:46,508 --> 00:07:47,840
is really telling you that--

142
00:07:47,840 --> 00:07:50,489
it is really accounting for
the bias, which is, well,

143
00:07:50,489 --> 00:07:52,530
even if I had an infinite
amount of observations,

144
00:07:52,530 --> 00:07:54,166
is this thing doing
the right thing?

145
00:07:54,166 --> 00:07:56,040
And the other thing is
actually the variance,

146
00:07:56,040 --> 00:07:57,581
so for finite number
of observations,

147
00:07:57,581 --> 00:07:59,680
what are the fluctuations?

148
00:07:59,680 --> 00:08:00,210
All right.

149
00:08:00,210 --> 00:08:02,459
Then you can see that those
things, bias and variance,

150
00:08:02,459 --> 00:08:05,740
are actually very different.

151
00:08:05,740 --> 00:08:08,220
So I don't have any
colors here, so you're

152
00:08:08,220 --> 00:08:12,360
going to have to really
follow the speed--

153
00:08:12,360 --> 00:08:14,380
the order in which
I draw those curves.

154
00:08:14,380 --> 00:08:14,880
All right.

155
00:08:14,880 --> 00:08:15,720
So let's find--

156
00:08:15,720 --> 00:08:19,530
I'm going to give you three
candidate estimators, so--

157
00:08:29,980 --> 00:08:31,290
estimators for theta.

158
00:08:35,350 --> 00:08:38,230
So the first one is
definitely Xn bar.

159
00:08:38,230 --> 00:08:40,900
That will be a good
candidate estimator.

160
00:08:40,900 --> 00:08:45,070
The second one is going to
be 0.5, because after all,

161
00:08:45,070 --> 00:08:47,260
why should I bother if
it's actually going to be--

162
00:08:47,260 --> 00:08:47,759
right?

163
00:08:47,759 --> 00:08:51,760
So for example, if
I ask you to predict

164
00:08:51,760 --> 00:08:54,510
the score of some
candidate in some election,

165
00:08:54,510 --> 00:08:57,472
then since you know it's
going to be very close to 0.5,

166
00:08:57,472 --> 00:08:59,680
you might as well just throw
0.5 and you're not going

167
00:08:59,680 --> 00:09:00,880
to be very far from reality.

168
00:09:00,880 --> 00:09:02,945
And it's actually going
to cost you 0 time and $0

169
00:09:02,945 --> 00:09:03,820
to come up with that.

170
00:09:03,820 --> 00:09:06,460
So sometimes maybe
just a good old guess

171
00:09:06,460 --> 00:09:08,830
is actually doing
the job for you.

172
00:09:08,830 --> 00:09:10,990
Of course, for
presidential elections

173
00:09:10,990 --> 00:09:12,890
or something like this,
it's not very helpful

174
00:09:12,890 --> 00:09:14,514
if your prediction
is telling you this.

175
00:09:14,514 --> 00:09:17,170
But if it was
something different,

176
00:09:17,170 --> 00:09:21,112
that would be a good way to
generate some close to 1/2.

177
00:09:21,112 --> 00:09:23,895
For a coin, for example,
if I give you a coin,

178
00:09:23,895 --> 00:09:24,520
you never know.

179
00:09:24,520 --> 00:09:25,720
Maybe it's slightly biased.

180
00:09:25,720 --> 00:09:27,970
But the good guess, just
looking at it, inspecting it,

181
00:09:27,970 --> 00:09:29,560
maybe there's something
crazy happening

182
00:09:29,560 --> 00:09:31,143
with the structure
of it, you're going

183
00:09:31,143 --> 00:09:34,522
to guess that it's 0.5 without
trying to collect information.

184
00:09:34,522 --> 00:09:36,730
And let's find another one,
which is, well, you know,

185
00:09:36,730 --> 00:09:38,950
I have a lot of observations.

186
00:09:38,950 --> 00:09:43,810
But I'm recording couples
kissing, but I'm on a budget.

187
00:09:43,810 --> 00:09:46,120
I don't have time to
travel all around the world

188
00:09:46,120 --> 00:09:47,122
and collect some people.

189
00:09:47,122 --> 00:09:49,330
So really, I'm just going
to look at the first couple

190
00:09:49,330 --> 00:09:49,840
and go home.

191
00:09:49,840 --> 00:09:53,410
So my other estimator
is just going to be X1.

192
00:09:53,410 --> 00:09:55,710
I just take the first
observation, 0, 1,

193
00:09:55,710 --> 00:09:57,110
and that's it.

194
00:09:57,110 --> 00:09:57,860
So now I'm going--

195
00:09:57,860 --> 00:10:01,080
I want to actually understand
what the behavior of those guys

196
00:10:01,080 --> 00:10:01,900
is.

197
00:10:01,900 --> 00:10:02,400
All right.

198
00:10:02,400 --> 00:10:09,240
So we know-- and so we know
that for this guy, the bias is 0

199
00:10:09,240 --> 00:10:14,280
and the variance
is equal to theta,

200
00:10:14,280 --> 00:10:19,610
1 minus theta divided by n.

201
00:10:19,610 --> 00:10:22,980
What is the bias
of this guy, 0.5?

202
00:10:28,100 --> 00:10:29,917
AUDIENCE: 0.5.

203
00:10:29,917 --> 00:10:31,000
AUDIENCE: 0.5 minus theta?

204
00:10:31,000 --> 00:10:32,749
PHILIPPE RIGOLLET: 0.5
minus theta, right.

205
00:10:35,360 --> 00:10:39,136
So the bias, 0.5 minus theta.

206
00:10:39,136 --> 00:10:40,510
What is the variance
of this guy?

207
00:10:44,670 --> 00:10:46,702
What is the variance of 0.5?

208
00:10:46,702 --> 00:10:47,410
AUDIENCE: It's 0.

209
00:10:47,410 --> 00:10:48,285
PHILIPPE RIGOLLET: 0.

210
00:10:48,285 --> 00:10:49,095
Right.

211
00:10:49,095 --> 00:10:50,470
It's just a
deterministic number,

212
00:10:50,470 --> 00:10:53,110
so there's no
fluctuations for this guy.

213
00:10:53,110 --> 00:10:54,190
What is the bias?

214
00:10:54,190 --> 00:10:56,590
Well, X1 is actually--

215
00:10:56,590 --> 00:10:58,210
just for simplicity,
I can think of it

216
00:10:58,210 --> 00:11:00,820
as being X1 bar, the
average of itself,

217
00:11:00,820 --> 00:11:03,940
so that wherever I saw an n for
this guy, I can replace it by 1

218
00:11:03,940 --> 00:11:05,690
and that will give
me my formula.

219
00:11:05,690 --> 00:11:07,390
So the bias is
still going to be 0.

220
00:11:07,390 --> 00:11:10,190
And the variance is going to be
equal to theta, 1 minus theta.

221
00:11:13,270 --> 00:11:16,180
So now I have those
three estimators.

222
00:11:16,180 --> 00:11:19,660
Well, if I compare
X1 and Xn bar, then

223
00:11:19,660 --> 00:11:22,480
clearly I have 0
bias in both cases.

224
00:11:22,480 --> 00:11:23,740
That's good.

225
00:11:23,740 --> 00:11:27,220
And I have the variance that's
actually n times smaller when I

226
00:11:27,220 --> 00:11:29,556
use my n observations
than when I don't.

227
00:11:29,556 --> 00:11:31,180
So those two guys,
on these two fronts,

228
00:11:31,180 --> 00:11:32,846
you can actually look
at the two numbers

229
00:11:32,846 --> 00:11:35,264
and say, well, the first
number is the same.

230
00:11:35,264 --> 00:11:37,180
The second number is
better for the other guy,

231
00:11:37,180 --> 00:11:40,550
so I will definitely go for
this guy compared to this guy.

232
00:11:40,550 --> 00:11:42,460
So this guy is gone.

233
00:11:42,460 --> 00:11:43,790
But not this guy.

234
00:11:43,790 --> 00:11:47,080
Well, if I look at the
bias, the variance is 0.

235
00:11:47,080 --> 00:11:49,480
It's always beating the
variance of this guy.

236
00:11:49,480 --> 00:11:52,270
And if I look at the bias, it's
actually really not that bad.

237
00:11:52,270 --> 00:11:53,860
It's 0.5 minus theta.

238
00:11:53,860 --> 00:11:55,970
In particular, if theta
is 0.5, then this guy

239
00:11:55,970 --> 00:11:57,930
is strictly better.

240
00:11:57,930 --> 00:12:00,430
And so you can actually
now look at what

241
00:12:00,430 --> 00:12:05,100
the quadratic risk looks like.

242
00:12:05,100 --> 00:12:06,600
So here, what I'm
going to do is I'm

243
00:12:06,600 --> 00:12:08,141
going to take my
true theta-- so it's

244
00:12:08,141 --> 00:12:09,706
going to range between 0 and 1.

245
00:12:09,706 --> 00:12:12,080
And we know that those two
things are functions of theta,

246
00:12:12,080 --> 00:12:13,913
so I can only understand
them if I plot them

247
00:12:13,913 --> 00:12:16,650
as functions of theta.

248
00:12:16,650 --> 00:12:18,590
And so now I'm going
to actually plot--

249
00:12:18,590 --> 00:12:20,870
the y-axis is going
to be the risk.

250
00:12:23,860 --> 00:12:26,680
So what is the risk of
the estimator of 0.5?

251
00:12:26,680 --> 00:12:27,630
This one is easy.

252
00:12:27,630 --> 00:12:33,330
Well, it's 0 plus the
square of 0.5 minus theta.

253
00:12:33,330 --> 00:12:37,990
So we know that at theta,
it's actually going to be 0.

254
00:12:37,990 --> 00:12:39,640
And then it's going
to be a square.

255
00:12:39,640 --> 00:12:44,800
So at 0, it's going to be 0.25.

256
00:12:44,800 --> 00:12:49,024
And at 1, it's going
to be 0.25 as well.

257
00:12:49,024 --> 00:12:49,940
So it looks like this.

258
00:12:49,940 --> 00:12:50,856
Well, actually, sorry.

259
00:12:50,856 --> 00:12:52,650
Let me put the 0.5
where it should be.

260
00:12:56,680 --> 00:12:57,180
OK.

261
00:12:57,180 --> 00:13:03,690
So this here is the risk of 0.5.

262
00:13:03,690 --> 00:13:06,370
And we'll write it like this.

263
00:13:06,370 --> 00:13:09,490
So when theta is very close
to 0.5, I'm very happy.

264
00:13:09,490 --> 00:13:13,090
When theta gets farther,
it's a little bit annoying.

265
00:13:13,090 --> 00:13:16,680
And then here, I want to
plot the risk of this guy.

266
00:13:16,680 --> 00:13:18,430
So now the thing with
the risk of this guy

267
00:13:18,430 --> 00:13:20,740
is that it will depend on n.

268
00:13:20,740 --> 00:13:24,040
So I will just pick some
n that I'm happy with just

269
00:13:24,040 --> 00:13:26,417
so that I can
actually draw a curve.

270
00:13:26,417 --> 00:13:29,000
Otherwise, I'm going to have to
plot one curve per value of n.

271
00:13:29,000 --> 00:13:31,900
So let's just say, for
example, that n is equal to 10.

272
00:13:31,900 --> 00:13:35,250
And so now I need to plot
the function theta, 1 minus

273
00:13:35,250 --> 00:13:37,600
theta divided by 10.

274
00:13:37,600 --> 00:13:39,000
We know that theta,
1 minus theta

275
00:13:39,000 --> 00:13:40,780
is a curve that goes like this.

276
00:13:40,780 --> 00:13:42,200
It takes value at 1/2.

277
00:13:42,200 --> 00:13:43,480
It thinks value 1/4.

278
00:13:43,480 --> 00:13:44,310
That's the maximum.

279
00:13:44,310 --> 00:13:46,000
And then it's 0 at the end.

280
00:13:46,000 --> 00:13:52,557
So really, if n is
equal to 1, this

281
00:13:52,557 --> 00:13:53,890
is what the variance looks like.

282
00:13:53,890 --> 00:13:56,530
The bias doesn't
count in the risk.

283
00:13:56,530 --> 00:13:57,029
Yeah.

284
00:13:57,029 --> 00:14:00,020
AUDIENCE: [INAUDIBLE]

285
00:14:00,020 --> 00:14:01,020
PHILIPPE RIGOLLET: Sure.

286
00:14:01,020 --> 00:14:03,560
Can you move?

287
00:14:03,560 --> 00:14:04,290
All right.

288
00:14:04,290 --> 00:14:05,065
Are you guys good?

289
00:14:08,060 --> 00:14:08,560
All right.

290
00:14:08,560 --> 00:14:10,060
So now I have this picture.

291
00:14:10,060 --> 00:14:12,280
And I know I'm going up to 25.

292
00:14:12,280 --> 00:14:15,230
And there's a place
where those curves cross.

293
00:14:15,230 --> 00:14:16,132
So if you're sure--

294
00:14:16,132 --> 00:14:18,340
let's say you're talking
about presidential election,

295
00:14:18,340 --> 00:14:20,830
you know that those things
are going to be really close.

296
00:14:20,830 --> 00:14:23,620
Maybe you're actually
better by predicting 0.5

297
00:14:23,620 --> 00:14:25,810
if you know it's not
going to go too far.

298
00:14:25,810 --> 00:14:32,670
But that's for one observation,
so that's the risk of X1.

299
00:14:32,670 --> 00:14:34,890
But if I look at the
risk of Xn, all I'm doing

300
00:14:34,890 --> 00:14:38,940
is just crushing
this curve down to 0.

301
00:14:38,940 --> 00:14:42,530
So as n increases, it's going
to look more and more like this.

302
00:14:42,530 --> 00:14:44,396
It's the same
curve divided by n.

303
00:14:48,600 --> 00:14:50,650
And so now I can just
start to understand

304
00:14:50,650 --> 00:14:52,660
that for different
values of thetas,

305
00:14:52,660 --> 00:14:56,240
now I'm going to have to be very
close to theta is equal to 1/2

306
00:14:56,240 --> 00:14:58,480
if I want to start saying
that Xn bar is worse

307
00:14:58,480 --> 00:15:03,226
than the naive estimator 0.5.

308
00:15:03,226 --> 00:15:04,006
Yeah.

309
00:15:04,006 --> 00:15:04,672
AUDIENCE: Sorry.

310
00:15:04,672 --> 00:15:08,528
I know you explained a little
bit before, but can you just--

311
00:15:08,528 --> 00:15:11,420
what is an intuitive
definition of risk?

312
00:15:11,420 --> 00:15:13,840
What is it actually describing?

313
00:15:13,840 --> 00:15:16,380
PHILIPPE RIGOLLET:
So either you can--

314
00:15:16,380 --> 00:15:18,924
well, when you have an unbiased
estimator, it's simple.

315
00:15:18,924 --> 00:15:20,590
It's just telling you
it's the variance,

316
00:15:20,590 --> 00:15:23,190
because the theta that you
have over there is really-- so

317
00:15:23,190 --> 00:15:26,500
in the definition of
the risk, the theta

318
00:15:26,500 --> 00:15:28,200
that you have here
if you're unbiased

319
00:15:28,200 --> 00:15:31,390
is really the
expectation of theta hat.

320
00:15:31,390 --> 00:15:33,230
So that's really
just the variance.

321
00:15:33,230 --> 00:15:35,590
So the risk is
really telling you

322
00:15:35,590 --> 00:15:39,160
how much fluctuations I
have around my expectation

323
00:15:39,160 --> 00:15:39,790
if unbiased.

324
00:15:39,790 --> 00:15:42,164
But actually here, it's telling
you how much fluctuations

325
00:15:42,164 --> 00:15:44,420
I have in average around theta.

326
00:15:44,420 --> 00:15:47,105
So if you understand the
notion of variance as being--

327
00:15:47,105 --> 00:15:47,980
AUDIENCE: [INAUDIBLE]

328
00:15:47,980 --> 00:15:48,580
PHILIPPE RIGOLLET: What?

329
00:15:48,580 --> 00:15:49,780
AUDIENCE: Like
variance on average.

330
00:15:49,780 --> 00:15:49,990
PHILIPPE RIGOLLET: No.

331
00:15:49,990 --> 00:15:50,650
AUDIENCE: No.

332
00:15:50,650 --> 00:15:51,775
PHILIPPE RIGOLLET: It's
just like variance.

333
00:15:51,775 --> 00:15:52,060
AUDIENCE: Oh, OK.

334
00:15:52,060 --> 00:15:53,800
PHILIPPE RIGOLLET: So when you--

335
00:15:53,800 --> 00:15:56,140
I mean, if you claim you
understand what variance is,

336
00:15:56,140 --> 00:15:58,090
it's telling you
what is the expected

337
00:15:58,090 --> 00:16:00,310
squared fluctuation
around the expectation

338
00:16:00,310 --> 00:16:01,640
of my random variable.

339
00:16:01,640 --> 00:16:04,270
It's just telling you on
average how far I'm going to be.

340
00:16:04,270 --> 00:16:06,200
And you take the square because
you want to cancel the signs.

341
00:16:06,200 --> 00:16:07,250
Otherwise, you're
going to get 0.

342
00:16:07,250 --> 00:16:07,660
AUDIENCE: Oh, OK.

343
00:16:07,660 --> 00:16:08,660
PHILIPPE RIGOLLET: And
here it's saying, well,

344
00:16:08,660 --> 00:16:11,034
I really don't care what the
expectation of theta hat is.

345
00:16:11,034 --> 00:16:13,030
What I want to get
to is theta, so I'm

346
00:16:13,030 --> 00:16:15,430
looking at the expectation
of the squared fluctuations

347
00:16:15,430 --> 00:16:16,870
around theta itself.

348
00:16:16,870 --> 00:16:19,610
If I'm unbiased, it
coincides with the variance.

349
00:16:19,610 --> 00:16:21,940
But if I'm biased, then I
have to account for the fact

350
00:16:21,940 --> 00:16:23,260
that I'm really
not computing the--

351
00:16:23,260 --> 00:16:23,801
AUDIENCE: OK.

352
00:16:23,801 --> 00:16:24,640
OK.

353
00:16:24,640 --> 00:16:25,140
Thanks.

354
00:16:25,140 --> 00:16:27,490
PHILIPPE RIGOLLET: OK?

355
00:16:27,490 --> 00:16:28,670
All right.

356
00:16:28,670 --> 00:16:29,670
Are there any questions?

357
00:16:29,670 --> 00:16:31,459
So here, what I really
want to illustrate

358
00:16:31,459 --> 00:16:33,000
is that the risk
itself is a function

359
00:16:33,000 --> 00:16:34,260
of theta most of the times.

360
00:16:34,260 --> 00:16:35,620
And so for different
thetas, some estimators

361
00:16:35,620 --> 00:16:37,170
are going to be
better than others.

362
00:16:37,170 --> 00:16:38,580
But there's also
the entire range

363
00:16:38,580 --> 00:16:41,960
of estimators, those
that are really biased,

364
00:16:41,960 --> 00:16:44,720
but the bias can
completely vanish.

365
00:16:44,720 --> 00:16:47,270
And so here, you see
you have no bias,

366
00:16:47,270 --> 00:16:48,890
but the variance can be large.

367
00:16:48,890 --> 00:16:50,720
Or you have 0 bias--

368
00:16:50,720 --> 00:16:52,430
you have a bias, but
the variance is 0.

369
00:16:52,430 --> 00:16:54,130
So you can actually
have this trade-off

370
00:16:54,130 --> 00:16:58,220
and you can find things that are
in the entire range in general.

371
00:17:01,260 --> 00:17:05,940
So those things are
actually-- those trade-offs

372
00:17:05,940 --> 00:17:10,260
between bias and variance are
usually much better illustrated

373
00:17:10,260 --> 00:17:12,565
if we're talking about
multivariate parameters.

374
00:17:12,565 --> 00:17:14,190
If I actually look
at a parameter which

375
00:17:14,190 --> 00:17:19,025
is the mean of some multivariate
Gaussian, so an entire vector,

376
00:17:19,025 --> 00:17:20,150
then the bias is going to--

377
00:17:20,150 --> 00:17:23,599
I can make the bias
bigger by, for example,

378
00:17:23,599 --> 00:17:26,190
forcing all the coordinates of
my estimator to be the same.

379
00:17:26,190 --> 00:17:27,690
So here, I'm going
to get some bias,

380
00:17:27,690 --> 00:17:29,106
but the variance
is actually going

381
00:17:29,106 --> 00:17:31,200
to be much better, because
I get to average all

382
00:17:31,200 --> 00:17:32,940
the coordinates for this guy.

383
00:17:32,940 --> 00:17:35,680
And so really, the
bias/variance trade-off

384
00:17:35,680 --> 00:17:38,790
is when you have multiple
parameters to estimate,

385
00:17:38,790 --> 00:17:40,400
so you have a vector
of parameters,

386
00:17:40,400 --> 00:17:42,930
a multivariate
parameter, the bias

387
00:17:42,930 --> 00:17:45,450
increases when you're trying
to pull more information

388
00:17:45,450 --> 00:17:49,470
across the different
components to actually have

389
00:17:49,470 --> 00:17:50,670
a lower variance.

390
00:17:50,670 --> 00:17:53,220
So the more you average,
the lower the variance.

391
00:17:53,220 --> 00:17:54,870
That's exactly what
we've illustrated.

392
00:17:54,870 --> 00:17:56,700
As n increases, the
variance decreases,

393
00:17:56,700 --> 00:17:59,370
like 1 over n or theta,
1 minus theta over n.

394
00:17:59,370 --> 00:18:01,530
And so this is how it
happens in general.

395
00:18:01,530 --> 00:18:03,840
In this class, it's mostly
one-dimensional parameter

396
00:18:03,840 --> 00:18:06,150
estimation, so it's going to be
a little harder to illustrate

397
00:18:06,150 --> 00:18:06,649
that.

398
00:18:06,649 --> 00:18:09,210
But if you do, for example,
non-parametric estimation,

399
00:18:09,210 --> 00:18:10,450
that's all you do.

400
00:18:10,450 --> 00:18:14,590
There's just bias/variance
trade-offs all the time.

401
00:18:14,590 --> 00:18:16,980
And in between, when you have
high-dimensional parametric

402
00:18:16,980 --> 00:18:20,110
estimation, that
happens a lot as well.

403
00:18:20,110 --> 00:18:21,750
OK.

404
00:18:21,750 --> 00:18:25,022
So I'm just going to go quickly
through those two remaining

405
00:18:25,022 --> 00:18:26,730
slides, because we've
actually seen them.

406
00:18:26,730 --> 00:18:29,850
But I just wanted you to have
somewhere a formal definition

407
00:18:29,850 --> 00:18:32,700
of what a confidence
interval is.

408
00:18:32,700 --> 00:18:37,830
And so we fixed a statistical
model for n observations, X1

409
00:18:37,830 --> 00:18:38,700
to Xn.

410
00:18:38,700 --> 00:18:42,050
The parameter theta
here is one-dimensional.

411
00:18:42,050 --> 00:18:44,010
Theta is a subset
of the real line,

412
00:18:44,010 --> 00:18:47,010
and that's why I
talk about intervals.

413
00:18:47,010 --> 00:18:48,510
An interval is a
subset of the line.

414
00:18:48,510 --> 00:18:51,480
If I had a subset
of R2, for example,

415
00:18:51,480 --> 00:18:54,800
that would no longer be called
an interval, but a region,

416
00:18:54,800 --> 00:18:57,570
just because-- well, that's
just we can say a set,

417
00:18:57,570 --> 00:18:59,130
a confidence set.

418
00:18:59,130 --> 00:19:01,920
But people like to
say confidence region.

419
00:19:01,920 --> 00:19:04,170
So an interval is just a
one-dimensional conference

420
00:19:04,170 --> 00:19:04,740
region.

421
00:19:04,740 --> 00:19:07,350
And it has to be an
interval as well.

422
00:19:07,350 --> 00:19:11,820
So a confidence interval
of level 1 minus alpha--

423
00:19:11,820 --> 00:19:16,110
so we refer to the quality
of a confidence interval

424
00:19:16,110 --> 00:19:18,120
is actually called it's level.

425
00:19:18,120 --> 00:19:21,490
It takes value 1 minus alpha
for some positive alpha.

426
00:19:21,490 --> 00:19:23,080
And so the confidence level--

427
00:19:23,080 --> 00:19:26,760
the level of the confidence
interval is between 0 and 1.

428
00:19:26,760 --> 00:19:29,410
The closer to 1 it is, the
better the confidence interval.

429
00:19:29,410 --> 00:19:32,040
The closer to 0,
the worse it is.

430
00:19:32,040 --> 00:19:34,560
And so for any
random interval-- so

431
00:19:34,560 --> 00:19:37,830
a confidence interval
is a random interval.

432
00:19:37,830 --> 00:19:41,310
The bounds of this interval
depends on random data.

433
00:19:41,310 --> 00:19:44,650
Just like we had
X bar plus/minus

434
00:19:44,650 --> 00:19:46,570
1 over square root of
n, for example, or 2

435
00:19:46,570 --> 00:19:49,020
over square root
of n, this X bar

436
00:19:49,020 --> 00:19:53,020
was the random thing that would
make fluctuate those guys.

437
00:19:53,020 --> 00:19:54,342
And so now I have an interval.

438
00:19:54,342 --> 00:19:56,550
And now I have its boundaries,
but now the boundaries

439
00:19:56,550 --> 00:19:58,830
are not allowed to depend
on my unknown parameter.

440
00:19:58,830 --> 00:20:00,600
Otherwise, it's not a
confidence interval,

441
00:20:00,600 --> 00:20:02,070
just like an
estimator that depends

442
00:20:02,070 --> 00:20:04,929
on the unknown parameter
is not an estimator.

443
00:20:04,929 --> 00:20:06,720
The confidence interval
has to be something

444
00:20:06,720 --> 00:20:10,360
that I can compute
once I collect data.

445
00:20:10,360 --> 00:20:14,990
And so what I want is that--
so there's this weird notation.

446
00:20:14,990 --> 00:20:17,800
The fact that I write theta--

447
00:20:17,800 --> 00:20:19,960
that's the probability
that I contains theta.

448
00:20:19,960 --> 00:20:23,081
You're used to seeing
theta belongs to I.

449
00:20:23,081 --> 00:20:24,580
But here, I really
want to emphasize

450
00:20:24,580 --> 00:20:26,980
that the randomness is
in I. And so the way

451
00:20:26,980 --> 00:20:28,540
you actually say
it when you read

452
00:20:28,540 --> 00:20:32,980
this formula is the probability
that I contains theta

453
00:20:32,980 --> 00:20:36,930
is at least 1 minus alpha.

454
00:20:36,930 --> 00:20:39,810
So it better be close to 1.

455
00:20:39,810 --> 00:20:41,850
You want 1 minus alpha
to be very close to 1,

456
00:20:41,850 --> 00:20:43,724
because it's really
telling you that whatever

457
00:20:43,724 --> 00:20:46,320
random variable I'm giving
you, my error bars are actually

458
00:20:46,320 --> 00:20:49,190
covering the right theta.

459
00:20:49,190 --> 00:20:50,890
And I want this to be true.

460
00:20:50,890 --> 00:20:52,390
But I want this--
since I don't know

461
00:20:52,390 --> 00:20:54,340
what my confidence--
my parameter of theta

462
00:20:54,340 --> 00:20:58,450
is, I want this to hold
true for all possible values

463
00:20:58,450 --> 00:21:02,860
of the parameters that nature
may have come up with from.

464
00:21:02,860 --> 00:21:05,050
So I want this-- so there's
theta that changes here,

465
00:21:05,050 --> 00:21:06,580
so the distribution
of the interval

466
00:21:06,580 --> 00:21:08,860
is actually changing
with theta hopefully.

467
00:21:08,860 --> 00:21:11,090
And theta is changing
with this guy.

468
00:21:11,090 --> 00:21:13,780
So regardless of the value
of theta that I'm getting,

469
00:21:13,780 --> 00:21:17,350
I want that the probability
that it contains the theta

470
00:21:17,350 --> 00:21:20,520
is actually larger
than 1 minus alpha.

471
00:21:20,520 --> 00:21:22,020
So I'll come back
to it in a second.

472
00:21:22,020 --> 00:21:23,600
I just want to say
that here, we can

473
00:21:23,600 --> 00:21:25,130
talk about asymptotic level.

474
00:21:25,130 --> 00:21:27,320
And that's typically when
you use central limit

475
00:21:27,320 --> 00:21:29,750
theorem to compute this guy.

476
00:21:29,750 --> 00:21:32,180
Then you're not guaranteed
that the value is

477
00:21:32,180 --> 00:21:35,840
at least 1 minus
alpha for every n,

478
00:21:35,840 --> 00:21:40,410
but it's actually in the limit
larger than 1 minus alpha.

479
00:21:40,410 --> 00:21:43,140
So maybe for each fixed n
it's going to be not true.

480
00:21:43,140 --> 00:21:44,970
But for as no goes
to infinity, it's

481
00:21:44,970 --> 00:21:46,620
actually going to become true.

482
00:21:46,620 --> 00:21:49,230
If you want this to
hold for every n,

483
00:21:49,230 --> 00:21:51,840
you actually need to use things
such as Hoeffding's inequality

484
00:21:51,840 --> 00:21:55,170
that we described at some
point, that hold for every n.

485
00:21:55,170 --> 00:22:00,002
So as a rule of thumb, if you
use the central limit theorem,

486
00:22:00,002 --> 00:22:01,710
you're dealing with
a confidence interval

487
00:22:01,710 --> 00:22:04,057
with asymptotic
level 1 minus alpha.

488
00:22:04,057 --> 00:22:05,640
And the reason is
because you actually

489
00:22:05,640 --> 00:22:10,260
want to get the quintiles
of the normal-- the Gaussian

490
00:22:10,260 --> 00:22:13,110
distribution that comes from
the central limit theorem.

491
00:22:13,110 --> 00:22:15,579
And if you want to use
Hoeffding's, for example,

492
00:22:15,579 --> 00:22:18,120
you might actually get away with
a confidence interval that's

493
00:22:18,120 --> 00:22:20,280
actually true even
non-asymptotically.

494
00:22:20,280 --> 00:22:22,030
It's just the regular
confidence interval.

495
00:22:24,560 --> 00:22:26,390
So this is the
formal definition.

496
00:22:26,390 --> 00:22:28,009
It's a bit of a mouthful.

497
00:22:28,009 --> 00:22:30,050
But we actually-- the best
way to understand them

498
00:22:30,050 --> 00:22:31,980
is to build them.

499
00:22:31,980 --> 00:22:33,930
Now, at some point I said--

500
00:22:33,930 --> 00:22:35,898
and I think it was
part of the homework--

501
00:22:38,429 --> 00:22:39,970
so here, I really
say the probability

502
00:22:39,970 --> 00:22:42,178
the true parameter belongs
to the confidence interval

503
00:22:42,178 --> 00:22:44,870
is actually 1 minus alpha.

504
00:22:44,870 --> 00:22:47,096
And so that's because here,
this confidence interval

505
00:22:47,096 --> 00:22:48,220
is still a random variable.

506
00:22:48,220 --> 00:22:50,350
Now, if I start plugging
in numbers instead

507
00:22:50,350 --> 00:22:52,000
of the random
variables X1 to Xn,

508
00:22:52,000 --> 00:22:55,240
I start putting 1,
0, 0, 1, 0, 0, 1,

509
00:22:55,240 --> 00:22:58,220
like I did for the kiss
example, then in this case,

510
00:22:58,220 --> 00:23:03,321
the random interval is actually
going to be 0.42, 0.65.

511
00:23:03,321 --> 00:23:05,570
And this guy, the probability
that theta belongs to it

512
00:23:05,570 --> 00:23:07,950
is not 1 minus alpha.

513
00:23:07,950 --> 00:23:10,000
It's either 0 if
it's not in there

514
00:23:10,000 --> 00:23:11,214
or it's 1 if it's in there.

515
00:23:16,870 --> 00:23:19,360
So here is the
example that we had.

516
00:23:19,360 --> 00:23:24,220
So just let's look at back into
our favorite example, which

517
00:23:24,220 --> 00:23:26,130
is the average of
Bernoulli random variables,

518
00:23:26,130 --> 00:23:30,280
so we studied that maybe
that's the third time already.

519
00:23:30,280 --> 00:23:34,210
So the sample average, Xn
bar, is a strongly consistent

520
00:23:34,210 --> 00:23:35,200
estimator of p.

521
00:23:35,200 --> 00:23:37,480
That was one of the
properties that we wanted.

522
00:23:37,480 --> 00:23:40,285
Strongly consistent means
that as n goes to infinity,

523
00:23:40,285 --> 00:23:42,940
it converges almost surely
to the true parameter.

524
00:23:42,940 --> 00:23:44,710
That's the strong
law of large number.

525
00:23:44,710 --> 00:23:47,050
It is consistent also, because
it's strongly consistent,

526
00:23:47,050 --> 00:23:49,910
so it also converges
in probability,

527
00:23:49,910 --> 00:23:52,290
which makes it consistent.

528
00:23:52,290 --> 00:23:53,070
It's unbiased.

529
00:23:53,070 --> 00:23:53,970
We've seen that.

530
00:23:53,970 --> 00:23:57,780
We've actually computed
its quadratic risk.

531
00:23:57,780 --> 00:24:00,344
And now what I have
is that if I look at--

532
00:24:00,344 --> 00:24:02,760
thanks to the central limit
theorem, we actually did this.

533
00:24:02,760 --> 00:24:08,850
We built a confidence interval
at level 1 minus alpha--

534
00:24:08,850 --> 00:24:12,360
asymptotic level, sorry,
asymptotic level 1 minus alpha.

535
00:24:12,360 --> 00:24:15,680
And so here, this
is how we did it.

536
00:24:15,680 --> 00:24:17,640
Let me just go through it again.

537
00:24:17,640 --> 00:24:19,455
So we know from the
central limit theorem--

538
00:24:28,240 --> 00:24:31,270
so the central limit
theorem tells us

539
00:24:31,270 --> 00:24:38,040
that Xn bar minus p divided
by square root of p1 minus p,

540
00:24:38,040 --> 00:24:41,330
square root of n converges
in distribution as n

541
00:24:41,330 --> 00:24:47,270
goes to infinity to some
standard normal distribution.

542
00:24:47,270 --> 00:24:49,910
So what it means is that if
I look at the probability

543
00:24:49,910 --> 00:24:53,990
under the true p, that's
square root of n, Xn bar

544
00:24:53,990 --> 00:25:03,130
minus p divided by square
root of p1 minus p,

545
00:25:03,130 --> 00:25:06,040
it's less than Q alpha
over 2, where this is

546
00:25:06,040 --> 00:25:07,780
the definition of the quintile.

547
00:25:07,780 --> 00:25:11,980
Then this guy-- and I'm actually
going to use the same notation,

548
00:25:11,980 --> 00:25:17,320
limit as n goes to infinity,
this is the same thing.

549
00:25:17,320 --> 00:25:22,720
So this is actually going to
be equal to 1 minus alpha.

550
00:25:22,720 --> 00:25:25,180
That's exactly what
I did last time.

551
00:25:25,180 --> 00:25:28,690
This is by definition of the
quintile of a standard Gaussian

552
00:25:28,690 --> 00:25:32,890
and of a limit in distribution.

553
00:25:32,890 --> 00:25:36,920
So the probabilities computed on
this guy in the limit converges

554
00:25:36,920 --> 00:25:38,620
to the probability
computed on this guy.

555
00:25:38,620 --> 00:25:40,580
And we know that this
is just the probability

556
00:25:40,580 --> 00:25:42,280
that the absolute
value of sum n 0, 1

557
00:25:42,280 --> 00:25:44,640
is less than Q alpha over 2.

558
00:25:47,750 --> 00:25:50,280
And so in particular,
if it's equal,

559
00:25:50,280 --> 00:25:54,180
then I can put some
larger than or equal to,

560
00:25:54,180 --> 00:25:57,480
which guarantees my
asymptotic confidence level.

561
00:25:57,480 --> 00:25:59,700
And I just solve for p.

562
00:25:59,700 --> 00:26:03,240
So this is equivalent
to the limit

563
00:26:03,240 --> 00:26:07,110
as n goes to infinity
of the probability

564
00:26:07,110 --> 00:26:15,990
that theta is between
Xn bar minus Q

565
00:26:15,990 --> 00:26:21,210
alpha over 2 divided by--

566
00:26:21,210 --> 00:26:26,810
times square root of p1 minus p
divided by square root of n, Xn

567
00:26:26,810 --> 00:26:33,980
bar plus q alpha over 2,
square root of p1 minus p

568
00:26:33,980 --> 00:26:37,370
divided by square root of
n is larger than or equal

569
00:26:37,370 --> 00:26:39,030
to 1 minus alpha.

570
00:26:39,030 --> 00:26:39,960
And so there you go.

571
00:26:39,960 --> 00:26:43,500
I have my confidence interval.

572
00:26:43,500 --> 00:26:45,750
Except that's not, right?

573
00:26:45,750 --> 00:26:48,290
We just said that the bounds
of a confidence interval

574
00:26:48,290 --> 00:26:50,370
may not depend on the
unknown parameter.

575
00:26:50,370 --> 00:26:52,440
And here, they do.

576
00:26:52,440 --> 00:26:54,300
And so we actually
came up with two ways

577
00:26:54,300 --> 00:26:55,874
of getting rid of this.

578
00:26:55,874 --> 00:26:58,290
Since we only need this thing--
so this thing, as we said,

579
00:26:58,290 --> 00:26:59,926
is really equal.

580
00:26:59,926 --> 00:27:01,800
Every time I'm going to
make this guy smaller

581
00:27:01,800 --> 00:27:03,450
and this guy larger,
I'm only going

582
00:27:03,450 --> 00:27:05,210
to increase the probability.

583
00:27:05,210 --> 00:27:06,960
And so what we do is
we actually just take

584
00:27:06,960 --> 00:27:08,940
the largest possible
value for p1 minus

585
00:27:08,940 --> 00:27:13,090
p, which makes the interval
as large as possible.

586
00:27:13,090 --> 00:27:15,420
And so now I have this.

587
00:27:15,420 --> 00:27:17,070
I just do one of the two tricks.

588
00:27:17,070 --> 00:27:22,400
I replace p1 minus p by their
upper bound, which is 1/4.

589
00:27:25,150 --> 00:27:28,255
As we said, p1 minus p, the
function looks like this.

590
00:27:28,255 --> 00:27:31,540
So I just take the
value here at 1/2.

591
00:27:31,540 --> 00:27:37,800
Or, I can use Slutsky and say
that if I replace p by Xn bar,

592
00:27:37,800 --> 00:27:40,890
that's the same as just
replacing p by Xn bar here.

593
00:27:45,640 --> 00:27:48,310
And by Slutsky, we know that
this is actually converging

594
00:27:48,310 --> 00:27:50,650
also to some standard Gaussian.

595
00:27:59,630 --> 00:28:04,120
We've seen that when we
saw Slutsky as an example.

596
00:28:04,120 --> 00:28:05,620
And so those two
things-- actually,

597
00:28:05,620 --> 00:28:07,300
just because I'm
taking the limit

598
00:28:07,300 --> 00:28:10,090
and I'm only caring about the
asymptotic confidence level,

599
00:28:10,090 --> 00:28:13,420
I can actually just plug in
consistent quantities in there,

600
00:28:13,420 --> 00:28:15,580
such as Xn bar where
I don't have a p.

601
00:28:15,580 --> 00:28:18,790
And that gives me another
confidence interval.

602
00:28:18,790 --> 00:28:19,290
All right.

603
00:28:19,290 --> 00:28:24,510
So this by now, hopefully
after doing it three times,

604
00:28:24,510 --> 00:28:28,320
you should really, really be
comfortable with just creating

605
00:28:28,320 --> 00:28:29,880
this confidence interval.

606
00:28:29,880 --> 00:28:31,260
We did it three times in class.

607
00:28:31,260 --> 00:28:33,660
I think you probably did
it another couple times

608
00:28:33,660 --> 00:28:34,612
in your homework.

609
00:28:34,612 --> 00:28:36,570
So just make sure you're
comfortable with this.

610
00:28:36,570 --> 00:28:37,070
All right.

611
00:28:37,070 --> 00:28:39,900
That's one of the basic
things you would want to know.

612
00:28:39,900 --> 00:28:41,470
Are there any questions?

613
00:28:41,470 --> 00:28:42,121
Yes.

614
00:28:42,121 --> 00:28:46,540
AUDIENCE: So Slutsky holds
for any single response set p.

615
00:28:46,540 --> 00:28:48,504
But Xn converges [INAUDIBLE].

616
00:28:52,940 --> 00:28:55,076
PHILIPPE RIGOLLET: So
that's not Slutsky, right?

617
00:28:55,076 --> 00:28:58,040
AUDIENCE: That's [INAUDIBLE].

618
00:28:58,040 --> 00:29:04,040
PHILIPPE RIGOLLET: So Slutsky
tells you that if you--

619
00:29:04,040 --> 00:29:06,530
Slutsky's about combining
two types of convergence.

620
00:29:06,530 --> 00:29:08,270
So Slutsky tells you
that if you actually

621
00:29:08,270 --> 00:29:13,910
have one Xn that converges
to X in distribution and Yn

622
00:29:13,910 --> 00:29:16,700
that converges to Y
in probability, then

623
00:29:16,700 --> 00:29:18,867
you can actually
multiply Xn and Yn

624
00:29:18,867 --> 00:29:20,450
and get that the
limit in distribution

625
00:29:20,450 --> 00:29:28,460
is the product of X and Y,
where X is now a constant.

626
00:29:28,460 --> 00:29:32,650
And here we have the
constant, which is 1.

627
00:29:32,650 --> 00:29:35,160
But I did that already, right?

628
00:29:35,160 --> 00:29:37,890
Using Slutsky to
replace it for the--

629
00:29:37,890 --> 00:29:40,890
to replace P by
Xn bar, we've done

630
00:29:40,890 --> 00:29:44,368
that last time, maybe a
couple of times ago, actually.

631
00:29:44,368 --> 00:29:45,850
Yeah.

632
00:29:45,850 --> 00:29:49,802
AUDIENCE: So I guess these
statements are [INAUDIBLE]..

633
00:29:49,802 --> 00:29:51,284
PHILIPPE RIGOLLET:
That's correct.

634
00:29:51,284 --> 00:29:53,754
AUDIENCE: So could we like
figure out [INAUDIBLE]

635
00:29:53,754 --> 00:29:58,220
can we set a finite [INAUDIBLE].

636
00:29:58,220 --> 00:30:00,830
PHILIPPE RIGOLLET: So of
course, the short answer is no.

637
00:30:04,280 --> 00:30:06,740
So here's how you
would go about thinking

638
00:30:06,740 --> 00:30:08,420
about which method is better.

639
00:30:08,420 --> 00:30:10,760
So there's always the
more conservative method.

640
00:30:10,760 --> 00:30:13,400
The first one, the only
thing you're losing

641
00:30:13,400 --> 00:30:16,430
is the rate of convergence
of the central limit theorem.

642
00:30:16,430 --> 00:30:19,990
So if n is large enough so
that the central limit theorem

643
00:30:19,990 --> 00:30:22,700
approximation is very good,
then that's all you're

644
00:30:22,700 --> 00:30:24,539
going to be losing.

645
00:30:24,539 --> 00:30:27,080
Of course, the price you pay is
that your confidence interval

646
00:30:27,080 --> 00:30:28,700
is wider than it
would be if you were

647
00:30:28,700 --> 00:30:31,160
to use Slutsky for this
particular problem,

648
00:30:31,160 --> 00:30:32,600
typically wider.

649
00:30:32,600 --> 00:30:37,140
Actually, it is always
wider, because Xn bar--

650
00:30:37,140 --> 00:30:41,120
1 minus Xn bar is always
less than 1/4 as well.

651
00:30:41,120 --> 00:30:45,920
And so that's the
first thing you--

652
00:30:45,920 --> 00:30:51,380
so Slutsky basically adds your
relying on the central limit--

653
00:30:51,380 --> 00:30:53,570
your relying on the
asymptotics again.

654
00:30:53,570 --> 00:30:56,180
Now of course, you don't
want to be conservative,

655
00:30:56,180 --> 00:30:59,060
because you actually want to
squeeze as much from your data

656
00:30:59,060 --> 00:30:59,930
as you can.

657
00:30:59,930 --> 00:31:04,040
So it depends on how comfortable
and how critical it is for you

658
00:31:04,040 --> 00:31:06,410
to put valid error bars.

659
00:31:06,410 --> 00:31:07,940
If they're valid
in the asymptotics,

660
00:31:07,940 --> 00:31:09,710
then maybe you're actually
going to go with Slutsky

661
00:31:09,710 --> 00:31:11,918
so it actually gives you
slightly narrower confidence

662
00:31:11,918 --> 00:31:16,060
intervals and so you feel
like you're a little more--

663
00:31:16,060 --> 00:31:17,869
you have a more precise answer.

664
00:31:17,869 --> 00:31:19,910
Now, if you really need
to be super-conservative,

665
00:31:19,910 --> 00:31:23,390
then you're actually going
to go with the P1 minus P.

666
00:31:23,390 --> 00:31:25,790
Actually, if you need to
be even more conservative,

667
00:31:25,790 --> 00:31:28,850
you are going to go with
Hoeffding's so you don't even

668
00:31:28,850 --> 00:31:31,412
have to rely on the
asymptotics level at all.

669
00:31:31,412 --> 00:31:32,870
But then you're
confidence interval

670
00:31:32,870 --> 00:31:35,000
becomes twice as wide
and twice as wide

671
00:31:35,000 --> 00:31:37,960
and it becomes wider
and wider as you go.

672
00:31:37,960 --> 00:31:39,859
So depends on--

673
00:31:39,859 --> 00:31:41,650
I mean, there's a lot
of data in statistics

674
00:31:41,650 --> 00:31:46,310
which is gauging how critical
it is for you to output

675
00:31:46,310 --> 00:31:48,380
valid error bounds or if
they're really just here

676
00:31:48,380 --> 00:31:51,620
to be indicative of the
precision of the estimator you

677
00:31:51,620 --> 00:31:55,396
gave from a more
qualitative perspective.

678
00:31:55,396 --> 00:31:57,540
AUDIENCE: So the error
there is [INAUDIBLE]??

679
00:31:57,540 --> 00:31:58,540
PHILIPPE RIGOLLET: Yeah.

680
00:31:58,540 --> 00:32:01,220
So here, there's basically
a bunch of errors.

681
00:32:01,220 --> 00:32:04,280
There's one that's-- so there's
a theorem called Berry-Esseen

682
00:32:04,280 --> 00:32:09,830
that quantifies how far this
probability is from 1 minus

683
00:32:09,830 --> 00:32:12,670
alpha, but the
constants are terrible.

684
00:32:12,670 --> 00:32:14,510
So it's not very
helpful, but it tells you

685
00:32:14,510 --> 00:32:17,330
as n grows how smaller
this thing grows--

686
00:32:17,330 --> 00:32:18,320
becomes smaller.

687
00:32:18,320 --> 00:32:20,330
And then for
Slutsky, again you're

688
00:32:20,330 --> 00:32:22,790
multiplying something that
converges by something that

689
00:32:22,790 --> 00:32:24,827
fluctuates around 1, so
you need to understand

690
00:32:24,827 --> 00:32:25,910
how this thing fluctuates.

691
00:32:25,910 --> 00:32:28,070
Now, there's something
that shows up.

692
00:32:28,070 --> 00:32:31,400
Basically, what is the
slope of the function 1

693
00:32:31,400 --> 00:32:36,220
over square root of X1
minus X around the value

694
00:32:36,220 --> 00:32:37,590
you're interested in?

695
00:32:37,590 --> 00:32:39,850
And so if this function
is super-sharp,

696
00:32:39,850 --> 00:32:43,200
then small fluctuations of Xn
bar around this expectation

697
00:32:43,200 --> 00:32:45,700
are going to lead to
really high fluctuations

698
00:32:45,700 --> 00:32:47,630
of the function itself.

699
00:32:47,630 --> 00:32:49,570
So if you're looking at--

700
00:32:49,570 --> 00:32:55,615
if you have f of Xn bar and
f around say the true P,

701
00:32:55,615 --> 00:32:58,390
if f is really sharp
like that, then

702
00:32:58,390 --> 00:33:00,730
if you move a little
bit here, then you're

703
00:33:00,730 --> 00:33:03,260
going to move really
a lot on the y-axis.

704
00:33:03,260 --> 00:33:05,650
So that's what the function
here-- the function

705
00:33:05,650 --> 00:33:09,205
you're interested in is 1 over
square root of X1 minus X.

706
00:33:09,205 --> 00:33:11,830
So what does this function look
like around the point where you

707
00:33:11,830 --> 00:33:14,260
think P is the true parameter?

708
00:33:17,270 --> 00:33:19,850
Its derivative really
is what matters.

709
00:33:19,850 --> 00:33:21,729
OK?

710
00:33:21,729 --> 00:33:22,520
Any other question.

711
00:33:24,665 --> 00:33:25,165
OK.

712
00:33:25,165 --> 00:33:26,665
So it's important,
because now we're

713
00:33:26,665 --> 00:33:29,430
going to switch to the
real let's do some hardcore

714
00:33:29,430 --> 00:33:31,928
computation type of things.

715
00:33:31,928 --> 00:33:32,892
All right.

716
00:33:36,760 --> 00:33:39,550
So in this chapter, we're
going to talk about maximum

717
00:33:39,550 --> 00:33:40,988
likelihood estimation.

718
00:33:44,340 --> 00:33:49,320
Who has already seen maximum
likelihood estimation?

719
00:33:49,320 --> 00:33:50,030
OK.

720
00:33:50,030 --> 00:33:55,380
And who knows what a
convex function is?

721
00:33:55,380 --> 00:33:56,340
OK.

722
00:33:56,340 --> 00:34:00,910
So we'll do a little bit of
reminders on those things.

723
00:34:00,910 --> 00:34:04,330
So those things are when we do
maximum likelihood estimation,

724
00:34:04,330 --> 00:34:07,470
likelihood is the function, so
we need to maximize a function.

725
00:34:07,470 --> 00:34:09,325
That's basically
what we need to do.

726
00:34:09,325 --> 00:34:10,699
And if I give you
a function, you

727
00:34:10,699 --> 00:34:12,659
need to know how to
maximize this function.

728
00:34:12,659 --> 00:34:14,408
Sometimes, you have
closed-form solutions.

729
00:34:14,408 --> 00:34:18,219
You can take the derivative and
set it equal to 0 and solve it.

730
00:34:18,219 --> 00:34:21,060
But sometimes, you actually
need to resort to algorithms

731
00:34:21,060 --> 00:34:21,600
to do that.

732
00:34:21,600 --> 00:34:25,020
And there's an entire
industry doing that.

733
00:34:25,020 --> 00:34:27,750
And we'll briefly touch upon
it, but this is definitely

734
00:34:27,750 --> 00:34:30,370
not the focus of this class.

735
00:34:30,370 --> 00:34:31,330
OK.

736
00:34:31,330 --> 00:34:34,630
So before diving directly
into the definition

737
00:34:34,630 --> 00:34:36,520
of the likelihood and
what is the definition

738
00:34:36,520 --> 00:34:38,500
of the maximum likelihood
estimator, what

739
00:34:38,500 --> 00:34:41,840
I'm going to try to
do is to give you

740
00:34:41,840 --> 00:34:45,380
an insight for what we're
actually doing when we do

741
00:34:45,380 --> 00:34:48,870
maximum likelihood estimation.

742
00:34:48,870 --> 00:34:53,719
So remember, we have a
model on a sample space E

743
00:34:53,719 --> 00:34:57,600
and some candidate
distributions P theta.

744
00:34:57,600 --> 00:35:00,930
And really, your goal is
to estimate a true theta

745
00:35:00,930 --> 00:35:04,080
star, the one that generated
some data, X1 to Xn,

746
00:35:04,080 --> 00:35:06,360
in an iid fashion.

747
00:35:06,360 --> 00:35:08,790
But this theta star is
really a proxy for us

748
00:35:08,790 --> 00:35:10,740
to know that we
actually understand

749
00:35:10,740 --> 00:35:12,100
the distribution itself.

750
00:35:12,100 --> 00:35:15,540
The goal of knowing theta star
is so that you can actually

751
00:35:15,540 --> 00:35:17,790
know what P theta star.

752
00:35:17,790 --> 00:35:19,380
Otherwise, it has--
well, sometimes we

753
00:35:19,380 --> 00:35:21,660
said it has some meaning
itself, but really you

754
00:35:21,660 --> 00:35:23,490
want to know what
the distribution is.

755
00:35:23,490 --> 00:35:27,810
And so your goal is to actually
come up with the distribution--

756
00:35:27,810 --> 00:35:30,270
hopefully that comes
from the family P theta--

757
00:35:30,270 --> 00:35:33,360
that's close to P theta star.

758
00:35:33,360 --> 00:35:38,710
So in a way, what does it mean
to have two distributions that

759
00:35:38,710 --> 00:35:39,210
are close?

760
00:35:39,210 --> 00:35:41,260
It means that when you
compute probabilities

761
00:35:41,260 --> 00:35:43,330
on one distribution,
you should have

762
00:35:43,330 --> 00:35:46,870
the same probability on the
other distribution pretty much.

763
00:35:46,870 --> 00:35:49,120
So what we can do
is say, well, now I

764
00:35:49,120 --> 00:35:51,311
have two candidate
distributions.

765
00:35:59,010 --> 00:36:03,360
So if theta hat leads to a
candidate distribution P theta

766
00:36:03,360 --> 00:36:06,210
hat, and this is
the true theta star,

767
00:36:06,210 --> 00:36:08,820
it leads to the true
distribution P theta star

768
00:36:08,820 --> 00:36:11,940
according to which
my data was drawn.

769
00:36:11,940 --> 00:36:12,970
That's my candidate.

770
00:36:16,060 --> 00:36:18,100
As a statistician, I'm
supposed to come up

771
00:36:18,100 --> 00:36:20,380
with a good candidate,
and this is the truth.

772
00:36:23,940 --> 00:36:26,790
And what I want is that
if you actually give me

773
00:36:26,790 --> 00:36:30,030
the distribution,
then I want when

774
00:36:30,030 --> 00:36:31,950
I'm computing
probabilities for this guy,

775
00:36:31,950 --> 00:36:34,980
I know what the probabilities
for the other guys are.

776
00:36:34,980 --> 00:36:40,040
And so really what I want is
that if I compute a probability

777
00:36:40,040 --> 00:36:44,340
under theta hat of
some interval a, b,

778
00:36:44,340 --> 00:36:46,580
it should be pretty
close to the probability

779
00:36:46,580 --> 00:36:51,660
under theta star of a, b.

780
00:36:51,660 --> 00:36:53,220
And more generally,
if I want to take

781
00:36:53,220 --> 00:36:55,470
the union of two intervals,
I want this to be true.

782
00:36:55,470 --> 00:36:58,500
If I take just 1/2 lines, I
want this to be true from 0

783
00:36:58,500 --> 00:37:00,900
to infinity, for example,
things like this.

784
00:37:00,900 --> 00:37:03,550
I want this to be true
for all of them at once.

785
00:37:03,550 --> 00:37:07,620
And so what I do is that I
write A for a probability event.

786
00:37:07,620 --> 00:37:11,520
And I want that P hat of
A is close to P star of A

787
00:37:11,520 --> 00:37:15,517
for any event A in
the sample space.

788
00:37:15,517 --> 00:37:17,100
Does that sound like
a reasonable goal

789
00:37:17,100 --> 00:37:18,994
for a statistician?

790
00:37:18,994 --> 00:37:20,910
So in particular, if I
want those to be close,

791
00:37:20,910 --> 00:37:22,784
I want the absolute
value of their difference

792
00:37:22,784 --> 00:37:23,680
to be close to 0.

793
00:37:26,220 --> 00:37:28,140
And this turns out to be--

794
00:37:28,140 --> 00:37:31,875
if I want this to hold
for all possible A's, I

795
00:37:31,875 --> 00:37:35,460
have all possible events, so I'm
going to actually maximize over

796
00:37:35,460 --> 00:37:36,100
these events.

797
00:37:36,100 --> 00:37:37,516
And I'm going to
look at the worst

798
00:37:37,516 --> 00:37:41,160
possible event on which theta
hat can depart from theta star.

799
00:37:41,160 --> 00:37:43,170
And so rather than
defining it specifically

800
00:37:43,170 --> 00:37:44,790
for theta hat and
theta star, I'm

801
00:37:44,790 --> 00:37:47,910
just going to say, well, if
you give me two probability

802
00:37:47,910 --> 00:37:51,420
measures, P theta
and P theta prime,

803
00:37:51,420 --> 00:37:53,100
I want to know how
close they are.

804
00:37:53,100 --> 00:37:55,080
Well, if I want to
measure how close they

805
00:37:55,080 --> 00:37:58,980
are by how they can
differ when I measure

806
00:37:58,980 --> 00:38:01,920
the probability
of some event, I'm

807
00:38:01,920 --> 00:38:04,800
just looking at the absolute
value of the difference

808
00:38:04,800 --> 00:38:06,180
of the probabilities
and I'm just

809
00:38:06,180 --> 00:38:09,240
maximizing over the worst
possible event that might

810
00:38:09,240 --> 00:38:11,380
actually make them differ.

811
00:38:11,380 --> 00:38:13,040
Agreed?

812
00:38:13,040 --> 00:38:14,360
That's a pretty strong notion.

813
00:38:14,360 --> 00:38:17,720
So if the total variation
between theta and theta prime

814
00:38:17,720 --> 00:38:22,390
is small, it means that for all
possible A's that you give me,

815
00:38:22,390 --> 00:38:25,590
then P theta of A is
going to be close to P

816
00:38:25,590 --> 00:38:30,820
theta prime of A, because if--

817
00:38:30,820 --> 00:38:33,820
let's say I just found the
bound on the total variation

818
00:38:33,820 --> 00:38:41,911
distance, which is 0.01.

819
00:38:41,911 --> 00:38:42,410
All right.

820
00:38:42,410 --> 00:38:46,110
So that means that this
is going to be larger

821
00:38:46,110 --> 00:39:00,940
than the max over A of P theta
minus P theta prime of A,

822
00:39:00,940 --> 00:39:04,550
which means that for any A--

823
00:39:04,550 --> 00:39:06,990
actually, let me write P
theta hat and P theta star,

824
00:39:06,990 --> 00:39:10,611
like we said, theta
hat and theta star.

825
00:39:10,611 --> 00:39:12,860
And so if I have a bound,
say, on the total variation,

826
00:39:12,860 --> 00:39:19,270
which is 0.01, that
means that P theta hat--

827
00:39:19,270 --> 00:39:23,950
every time I compute a
probability on P theta hat,

828
00:39:23,950 --> 00:39:29,295
it's basically in the
interval P theta star of A,

829
00:39:29,295 --> 00:39:34,790
the one that I really wanted
to compute, plus or minus 0.01.

830
00:39:34,790 --> 00:39:36,790
This has nothing to do
with confidence interval.

831
00:39:36,790 --> 00:39:38,165
This is just
telling me how far I

832
00:39:38,165 --> 00:39:41,280
am from the value of
actually trying to compute.

833
00:39:41,280 --> 00:39:44,750
And that's true for
all A. And that's key.

834
00:39:44,750 --> 00:39:47,400
That's where this
max comes into play.

835
00:39:47,400 --> 00:39:49,310
It just says, I want
this bound to hold

836
00:39:49,310 --> 00:39:50,870
for all possible A's at once.

837
00:39:55,300 --> 00:39:58,142
So this is actually a
very well-known distance

838
00:39:58,142 --> 00:39:59,350
between probability measures.

839
00:39:59,350 --> 00:40:00,766
It's the total
variation distance.

840
00:40:00,766 --> 00:40:04,880
It's extremely central to
probabilistic analysis.

841
00:40:04,880 --> 00:40:07,120
And it essentially tells
you that every time--

842
00:40:07,120 --> 00:40:09,160
if two probability
distributions are close,

843
00:40:09,160 --> 00:40:11,560
then it means that every
time I compute a probability

844
00:40:11,560 --> 00:40:15,160
under P theta but
I really actually

845
00:40:15,160 --> 00:40:17,290
have data from P
theta prime, then

846
00:40:17,290 --> 00:40:21,710
the error is no larger
than the total variation.

847
00:40:21,710 --> 00:40:23,470
OK.

848
00:40:23,470 --> 00:40:29,460
So this is maybe not
the most convenient way

849
00:40:29,460 --> 00:40:30,870
of finding a distance.

850
00:40:30,870 --> 00:40:32,130
I mean, how are you going--

851
00:40:32,130 --> 00:40:34,500
in reality, how are you
to compute this maximum

852
00:40:34,500 --> 00:40:35,640
over all possible events?

853
00:40:35,640 --> 00:40:36,931
I mean, it's just crazy, right?

854
00:40:36,931 --> 00:40:38,430
There's an infinite
number of them.

855
00:40:38,430 --> 00:40:41,340
It's much larger than the number
of intervals, for example,

856
00:40:41,340 --> 00:40:43,050
so it's a bit annoying.

857
00:40:43,050 --> 00:40:46,800
And so there's actually
a way to compress it

858
00:40:46,800 --> 00:40:50,834
by just looking at the basically
function distance or vector

859
00:40:50,834 --> 00:40:53,250
distance between probability
mass functions or probability

860
00:40:53,250 --> 00:40:55,510
density functions.

861
00:40:55,510 --> 00:40:58,150
So I'm going to start
with the discrete version

862
00:40:58,150 --> 00:40:59,280
of the total variation.

863
00:40:59,280 --> 00:41:03,282
So throughout this
chapter, I will

864
00:41:03,282 --> 00:41:05,490
make the difference between
discrete random variables

865
00:41:05,490 --> 00:41:07,530
and continuous random variables.

866
00:41:07,530 --> 00:41:08,651
It really doesn't matter.

867
00:41:08,651 --> 00:41:10,650
All it means is that when
I talk about discrete,

868
00:41:10,650 --> 00:41:12,606
I will talk about
probability mass functions.

869
00:41:12,606 --> 00:41:13,980
And when I talk
about continuous,

870
00:41:13,980 --> 00:41:16,600
I will talk about probability
density functions.

871
00:41:16,600 --> 00:41:20,030
When I talk about
probability mass functions,

872
00:41:20,030 --> 00:41:21,510
I talk about sums.

873
00:41:21,510 --> 00:41:24,900
When I talk about probability
density functions,

874
00:41:24,900 --> 00:41:26,730
I talk about integrals.

875
00:41:26,730 --> 00:41:30,090
But they're all the
same thing, really.

876
00:41:30,090 --> 00:41:32,475
So let's start with the
probability mass function.

877
00:41:32,475 --> 00:41:34,350
Everybody remembers what
the probability mass

878
00:41:34,350 --> 00:41:37,980
function of a discrete
random variable is.

879
00:41:37,980 --> 00:41:42,180
This is the function that tells
me for each possible value

880
00:41:42,180 --> 00:41:43,720
that it can take,
the probability

881
00:41:43,720 --> 00:41:46,410
that it takes this value.

882
00:41:46,410 --> 00:41:53,200
So the Probability
Mass Function, PMF,

883
00:41:53,200 --> 00:41:57,310
is just the function for
all x in the sample space

884
00:41:57,310 --> 00:42:01,420
tells me the probability
that my random variable is

885
00:42:01,420 --> 00:42:03,970
equal to this little value.

886
00:42:03,970 --> 00:42:09,091
And I will denote it
by P sub theta of X.

887
00:42:09,091 --> 00:42:10,840
So what I want is, of
course, that the sum

888
00:42:10,840 --> 00:42:12,250
of the probabilities is 1.

889
00:42:17,620 --> 00:42:20,460
And I want them to
be non-negative.

890
00:42:20,460 --> 00:42:23,110
Actually, typically we will
assume that they are positive.

891
00:42:23,110 --> 00:42:27,410
Otherwise, we can just remove
this x from the sample space.

892
00:42:27,410 --> 00:42:31,700
And so then I have the total
variation distance, I mean,

893
00:42:31,700 --> 00:42:35,470
it's supposed to be the
maximum overall sets of--

894
00:42:35,470 --> 00:42:39,850
of subsets of E, such
that the probability

895
00:42:39,850 --> 00:42:43,130
of A minus probability
of theta prime of A--

896
00:42:43,130 --> 00:42:44,630
it's complicated,
but really there's

897
00:42:44,630 --> 00:42:46,130
this beautiful
formula that tells me

898
00:42:46,130 --> 00:42:50,410
that if I look at the total
variation between P theta

899
00:42:50,410 --> 00:42:54,520
and P theta prime, it's
actually equal to just 1/2

900
00:42:54,520 --> 00:43:04,402
of the sum for all X in E of the
absolute difference between P

901
00:43:04,402 --> 00:43:12,151
theta X and P theta prime of X.

902
00:43:12,151 --> 00:43:13,650
So that's something
you can compute.

903
00:43:13,650 --> 00:43:16,110
If I give you two
probability mass functions,

904
00:43:16,110 --> 00:43:19,660
you can compute
this immediately.

905
00:43:19,660 --> 00:43:24,020
But if I give you
just the densities

906
00:43:24,020 --> 00:43:26,460
and the original distribution,
the original definition

907
00:43:26,460 --> 00:43:28,200
where you have to max
over all possible events,

908
00:43:28,200 --> 00:43:29,575
it's not clear
you're going to be

909
00:43:29,575 --> 00:43:31,140
able to do that very quickly.

910
00:43:31,140 --> 00:43:35,335
So this is really the
one you can work with.

911
00:43:35,335 --> 00:43:36,960
But the other one is
really telling you

912
00:43:36,960 --> 00:43:37,830
what it is doing for you.

913
00:43:37,830 --> 00:43:39,829
It's controlling the
difference of probabilities

914
00:43:39,829 --> 00:43:41,077
you can compute on any event.

915
00:43:41,077 --> 00:43:42,660
But here, it's just
telling you, well,

916
00:43:42,660 --> 00:43:46,370
if you do it for each
simple event, it's little x.

917
00:43:46,370 --> 00:43:49,080
It's actually simple.

918
00:43:49,080 --> 00:43:53,150
Now, if we have continuous
random variables-- so

919
00:43:53,150 --> 00:43:56,060
by the way, I didn't mention,
but discrete means Bernoulli.

920
00:43:56,060 --> 00:43:59,420
Binomial, but not only those
that have finite support,

921
00:43:59,420 --> 00:44:02,260
like Bernoulli has
support of size 2,

922
00:44:02,260 --> 00:44:05,760
binomial NP has
support of size n--

923
00:44:05,760 --> 00:44:08,570
there's n possible values it
can take-- but also Poisson.

924
00:44:08,570 --> 00:44:10,570
Poisson distribution can
take an infinite number

925
00:44:10,570 --> 00:44:13,510
of values, all the
positive integers,

926
00:44:13,510 --> 00:44:16,100
non-negative integers.

927
00:44:16,100 --> 00:44:18,000
And so now we have also
the continuous ones,

928
00:44:18,000 --> 00:44:19,384
such as Gaussian, exponential.

929
00:44:19,384 --> 00:44:21,300
And what characterizes
those guys is that they

930
00:44:21,300 --> 00:44:24,230
have a probability density.

931
00:44:24,230 --> 00:44:26,630
So the density,
remember the way I

932
00:44:26,630 --> 00:44:28,820
use my density is
when I want to compute

933
00:44:28,820 --> 00:44:31,910
the probability of
belonging to some event A.

934
00:44:31,910 --> 00:44:37,010
The probability of X falling to
some subset of the real line A

935
00:44:37,010 --> 00:44:40,280
is simply the integral of
the density on this set.

936
00:44:40,280 --> 00:44:43,940
That's the famous area
under the curve thing.

937
00:44:43,940 --> 00:44:49,196
So since for each possible
value, the probability at X--

938
00:44:49,196 --> 00:44:51,350
so I hope you
remember that stuff.

939
00:44:51,350 --> 00:44:57,890
That's just probably
something that you

940
00:44:57,890 --> 00:44:59,210
must remember from probability.

941
00:44:59,210 --> 00:45:02,120
But essentially, we know that
the probability that X is equal

942
00:45:02,120 --> 00:45:04,820
to little x is 0 for a
continuous random variable,

943
00:45:04,820 --> 00:45:06,830
for all possible X's.

944
00:45:06,830 --> 00:45:09,030
There's just none of them
that actually gets weight.

945
00:45:09,030 --> 00:45:11,321
So what we have to do is to
describe the fact that it's

946
00:45:11,321 --> 00:45:12,980
in some little region.

947
00:45:12,980 --> 00:45:18,830
So the probability that it's in
some interval, say, a, b, this

948
00:45:18,830 --> 00:45:25,550
is the integral between A
and B of f theta of X, dx.

949
00:45:25,550 --> 00:45:28,379
So I have this density,
such as the Gaussian one.

950
00:45:28,379 --> 00:45:30,545
And the probability that I
belong to the interval a,

951
00:45:30,545 --> 00:45:36,920
b is just the area under
the curve between A and B.

952
00:45:36,920 --> 00:45:43,880
If you don't remember that,
please take immediate remedy.

953
00:45:43,880 --> 00:45:48,920
So this function f, just
like P, is non-negative.

954
00:45:48,920 --> 00:45:51,890
And rather than summing
to 1, it integrates to 1

955
00:45:51,890 --> 00:45:55,119
when I integrate it over
the entire sample space E.

956
00:45:55,119 --> 00:45:56,660
And now the total
variation, well, it

957
00:45:56,660 --> 00:45:58,130
takes basically the same form.

958
00:45:58,130 --> 00:46:00,230
I said that you
essentially replace sums

959
00:46:00,230 --> 00:46:03,264
by integrals when you're
dealing with densities.

960
00:46:03,264 --> 00:46:05,180
And here, it's just
saying, rather than having

961
00:46:05,180 --> 00:46:07,220
1/2 of the sum of
the absolute values,

962
00:46:07,220 --> 00:46:09,860
you have 1/2 of the integral
of the absolute value

963
00:46:09,860 --> 00:46:11,530
of the difference.

964
00:46:11,530 --> 00:46:15,310
Again, if I give
you two densities

965
00:46:15,310 --> 00:46:18,280
and if you're not too bad at
calculus, which you will often

966
00:46:18,280 --> 00:46:21,490
be, because there's lots of them
you can actually not compute.

967
00:46:21,490 --> 00:46:24,400
But if I gave you, for example,
two Gaussian densities,

968
00:46:24,400 --> 00:46:27,330
exponential minus x squared,
blah, blah, blah, and I say,

969
00:46:27,330 --> 00:46:29,080
just compute the total
variation distance,

970
00:46:29,080 --> 00:46:30,957
you could actually
write it as an integral.

971
00:46:30,957 --> 00:46:33,040
Now, whether you can
actually reduce this integral

972
00:46:33,040 --> 00:46:35,470
to some particular
number is another story.

973
00:46:35,470 --> 00:46:38,860
But you could technically do it.

974
00:46:38,860 --> 00:46:41,695
So now, you have actually
a handle on this thing

975
00:46:41,695 --> 00:46:43,660
and you could technically
ask Mathematica,

976
00:46:43,660 --> 00:46:45,280
whereas asking
Mathematica to take

977
00:46:45,280 --> 00:46:48,280
the max over all possible
events is going to be difficult.

978
00:46:48,280 --> 00:46:48,780
All right.

979
00:46:48,780 --> 00:46:55,240
So the total variation
has some properties.

980
00:46:55,240 --> 00:46:59,560
So let's keep on the
board the definition that

981
00:46:59,560 --> 00:47:05,410
involves, say, the densities.

982
00:47:05,410 --> 00:47:06,910
So think Gaussian in your mind.

983
00:47:06,910 --> 00:47:09,530
And you have two Gaussians,
one with mean theta

984
00:47:09,530 --> 00:47:10,810
and one with mean theta prime.

985
00:47:10,810 --> 00:47:13,143
And I'm looking at the total
variation between those two

986
00:47:13,143 --> 00:47:14,560
guys.

987
00:47:14,560 --> 00:47:20,030
So if I look at P theta minus--

988
00:47:20,030 --> 00:47:20,530
sorry.

989
00:47:20,530 --> 00:47:25,800
TV between P theta and
P theta prime, this

990
00:47:25,800 --> 00:47:31,110
is equal to 1/2 of the integral
between f theta, f theta prime.

991
00:47:31,110 --> 00:47:32,490
And when I don't write it--

992
00:47:32,490 --> 00:47:34,800
so I don't write the
X, dx but it's there.

993
00:47:34,800 --> 00:47:38,432
And then I integrate over E.

994
00:47:38,432 --> 00:47:39,890
So what is this
thing doing for me?

995
00:47:39,890 --> 00:47:41,480
It's just saying,
well, if I have-- so

996
00:47:41,480 --> 00:47:42,438
think of two Gaussians.

997
00:47:42,438 --> 00:47:44,940
For example, I have one that's
here and one that's here.

998
00:47:47,610 --> 00:47:51,670
So this is let's say f
theta, f theta prime.

999
00:47:51,670 --> 00:47:52,750
This guy is doing what?

1000
00:47:52,750 --> 00:47:55,980
It's computing the absolute
value of the difference

1001
00:47:55,980 --> 00:47:57,910
between f and f theta prime.

1002
00:47:57,910 --> 00:48:01,980
You can check for yourself
that graphically, this I

1003
00:48:01,980 --> 00:48:05,931
can represent as an area
not under the curve,

1004
00:48:05,931 --> 00:48:10,300
but between the curves.

1005
00:48:10,300 --> 00:48:11,760
So this is this guy.

1006
00:48:16,370 --> 00:48:20,040
Now, this guy is really the
integral of the absolute value.

1007
00:48:20,040 --> 00:48:22,570
So this thing here,
this area, this

1008
00:48:22,570 --> 00:48:25,224
is 2 times the total variation.

1009
00:48:28,240 --> 00:48:29,980
The scaling 1/2
really doesn't matter.

1010
00:48:29,980 --> 00:48:32,790
It's just if I want to have
an actual correspondence

1011
00:48:32,790 --> 00:48:36,350
between the maximum and the
other guy, I have to do this.

1012
00:48:39,630 --> 00:48:41,290
So this is what it looks like.

1013
00:48:41,290 --> 00:48:42,910
So we have this definition.

1014
00:48:42,910 --> 00:48:48,279
And so we have a couple of
properties that come into this.

1015
00:48:48,279 --> 00:48:49,820
The first one is
that it's symmetric.

1016
00:48:49,820 --> 00:48:51,860
TV of P theta and
P theta prime is

1017
00:48:51,860 --> 00:48:55,970
the same as the TV between
P theta prime and P theta.

1018
00:48:55,970 --> 00:48:59,710
Well, that's pretty obvious
from this definition.

1019
00:48:59,710 --> 00:49:02,090
I just flip those two,
I get the same number.

1020
00:49:02,090 --> 00:49:05,297
It's actually also true
if I take the maximum.

1021
00:49:05,297 --> 00:49:07,630
Those things are completely
symmetric in theta and theta

1022
00:49:07,630 --> 00:49:08,350
prime.

1023
00:49:08,350 --> 00:49:10,620
You can just flip them.

1024
00:49:10,620 --> 00:49:11,830
It's non-negative.

1025
00:49:11,830 --> 00:49:15,640
Is that clear to everyone that
this thing is non-negative?

1026
00:49:15,640 --> 00:49:20,530
I integrate an absolute
value, so this thing

1027
00:49:20,530 --> 00:49:22,640
is going to give me some
non-negative number.

1028
00:49:22,640 --> 00:49:24,598
And so if I integrate
this non-negative number,

1029
00:49:24,598 --> 00:49:26,670
it's going to be a
non-negative number.

1030
00:49:26,670 --> 00:49:29,230
The fact also that
it's an area tells me

1031
00:49:29,230 --> 00:49:32,680
that it's going to
be non-negative.

1032
00:49:32,680 --> 00:49:36,900
The nice thing is that if
TV is equal to zero, then

1033
00:49:36,900 --> 00:49:42,490
the two distributions, the two
probabilities are the same.

1034
00:49:42,490 --> 00:49:46,540
That means that for
every A, P theta of A

1035
00:49:46,540 --> 00:49:49,050
is equal to P theta
prime of A. Now,

1036
00:49:49,050 --> 00:49:50,860
there's two ways to see that.

1037
00:49:50,860 --> 00:49:53,140
The first one is to say
that if this integral is

1038
00:49:53,140 --> 00:49:56,650
equal to 0, that means
that for almost all X,

1039
00:49:56,650 --> 00:49:58,240
f theta is equal
to f theta prime.

1040
00:49:58,240 --> 00:50:01,390
The only way I can integrate
a non-negative and get 0

1041
00:50:01,390 --> 00:50:05,390
is that it's 0 pretty
much everywhere.

1042
00:50:05,390 --> 00:50:07,550
And so what it means is
that the two densities

1043
00:50:07,550 --> 00:50:09,530
have to be the same
pretty much everywhere,

1044
00:50:09,530 --> 00:50:11,546
which means that the
distributions are the same.

1045
00:50:11,546 --> 00:50:13,670
But this is not really the
way you want to do this,

1046
00:50:13,670 --> 00:50:15,128
because you have
to understand what

1047
00:50:15,128 --> 00:50:16,850
pretty much everywhere means--

1048
00:50:16,850 --> 00:50:18,760
which I should really
say almost everywhere.

1049
00:50:18,760 --> 00:50:20,570
That's the formal
way of saying it.

1050
00:50:20,570 --> 00:50:22,280
But let's go to
this definition--

1051
00:50:24,830 --> 00:50:26,160
which is gone.

1052
00:50:26,160 --> 00:50:26,660
Yeah.

1053
00:50:26,660 --> 00:50:28,670
That's the one here.

1054
00:50:28,670 --> 00:50:35,230
The max of those two guys, if
this maximum is equal to 0--

1055
00:50:35,230 --> 00:50:39,220
I have a maximum of non-negative
numbers, their absolute values.

1056
00:50:39,220 --> 00:50:42,090
Their maximum is
equal to 0, well,

1057
00:50:42,090 --> 00:50:44,490
they better be all equal
to 0, because if one is not

1058
00:50:44,490 --> 00:50:47,470
equal to 0, then the
maximum is not equal to 0.

1059
00:50:47,470 --> 00:50:50,170
So those two guys,
for those two things

1060
00:50:50,170 --> 00:50:52,180
to be-- for the maximum
to be equal to 0,

1061
00:50:52,180 --> 00:50:54,220
then each of the
individual absolute values

1062
00:50:54,220 --> 00:50:57,430
have to be equal to 0, which
means that the probability here

1063
00:50:57,430 --> 00:51:03,730
is equal to this probability
here for every event A.

1064
00:51:03,730 --> 00:51:04,960
So those two things--

1065
00:51:04,960 --> 00:51:06,070
this is nice, right?

1066
00:51:06,070 --> 00:51:08,410
That's called definiteness.

1067
00:51:08,410 --> 00:51:10,900
The total variation equal
to 0 implies that P theta

1068
00:51:10,900 --> 00:51:12,210
is equal to P theta prime.

1069
00:51:12,210 --> 00:51:14,350
So that's really some
notion of distance, right?

1070
00:51:14,350 --> 00:51:16,060
That's what we want.

1071
00:51:16,060 --> 00:51:17,980
If this thing
being small implied

1072
00:51:17,980 --> 00:51:20,350
that P theta could be all
over the place compared

1073
00:51:20,350 --> 00:51:24,270
to P theta prime, that
would not help very much.

1074
00:51:24,270 --> 00:51:26,580
Now, there's also the
triangle inequality

1075
00:51:26,580 --> 00:51:28,710
that follows immediately
from the triangle

1076
00:51:28,710 --> 00:51:32,730
inequality inside this guy.

1077
00:51:32,730 --> 00:51:35,654
If I squeeze in some f
theta prime prime in there,

1078
00:51:35,654 --> 00:51:37,320
I'm going to use the
triangle inequality

1079
00:51:37,320 --> 00:51:39,486
and get the triangle
inequality for the whole thing.

1080
00:51:42,392 --> 00:51:42,892
Yeah?

1081
00:51:42,892 --> 00:51:45,287
AUDIENCE: The fact that
you need two definitions

1082
00:51:45,287 --> 00:51:48,640
of the [INAUDIBLE],,
is it something

1083
00:51:48,640 --> 00:51:50,090
obvious or is it complete?

1084
00:51:50,090 --> 00:51:52,930
PHILIPPE RIGOLLET:
I'll do it for you now.

1085
00:51:52,930 --> 00:51:56,530
So let's just prove that
those two things are actually

1086
00:51:56,530 --> 00:51:58,756
giving me the same definition.

1087
00:52:00,956 --> 00:52:02,830
So what I'm going to do
is I'm actually going

1088
00:52:02,830 --> 00:52:04,420
to start with the second one.

1089
00:52:04,420 --> 00:52:05,420
And I'm going to write--

1090
00:52:05,420 --> 00:52:07,253
I'm going to start with
the density version.

1091
00:52:07,253 --> 00:52:10,300
But as an exercise, you can
do it for the PMF version

1092
00:52:10,300 --> 00:52:11,347
if you prefer.

1093
00:52:11,347 --> 00:52:13,180
So I'm going to start
with the fact that f--

1094
00:52:20,240 --> 00:52:23,810
so I'm going to write f of g so
I don't have to write f and g.

1095
00:52:23,810 --> 00:52:27,490
So think of this as being f sub
theta, and think of this guy

1096
00:52:27,490 --> 00:52:29,180
as being f sub theta prime.

1097
00:52:29,180 --> 00:52:32,240
I just don't want to have to
write indices all the time.

1098
00:52:32,240 --> 00:52:34,970
So I'm going to start with
this thing, the integral of f

1099
00:52:34,970 --> 00:52:38,870
of X minus g of X dx.

1100
00:52:38,870 --> 00:52:41,910
The first thing I'm going to do
is this is an absolute value,

1101
00:52:41,910 --> 00:52:45,170
so either the number in the
absolute value is positive

1102
00:52:45,170 --> 00:52:47,390
and I actually kept it
like that, or it's negative

1103
00:52:47,390 --> 00:52:48,760
and I flipped its sign.

1104
00:52:48,760 --> 00:52:51,600
So let's just split
between those two cases.

1105
00:52:51,600 --> 00:52:55,460
So this thing is equal
to 1/2 the integral of--

1106
00:52:55,460 --> 00:53:00,350
so let me actually
write the set A star as

1107
00:53:00,350 --> 00:53:09,240
being the set of X's such that
f of X is larger than g of X.

1108
00:53:09,240 --> 00:53:11,340
So that's the set on
which the difference is

1109
00:53:11,340 --> 00:53:13,060
going to be positive
or the difference is

1110
00:53:13,060 --> 00:53:14,370
going to be negative.

1111
00:53:14,370 --> 00:53:17,082
So this, again,
is equivalent to f

1112
00:53:17,082 --> 00:53:23,280
of X minus g of X is positive.

1113
00:53:23,280 --> 00:53:23,780
OK.

1114
00:53:23,780 --> 00:53:24,488
Everybody agrees?

1115
00:53:24,488 --> 00:53:26,330
So this is the set
I'm interested in.

1116
00:53:29,040 --> 00:53:31,830
So now I'm going to split
my integral into two parts,

1117
00:53:31,830 --> 00:53:38,250
in A, A star, so on A
star, f is larger than g,

1118
00:53:38,250 --> 00:53:40,666
so the absolute value is
just the difference itself.

1119
00:53:45,150 --> 00:53:48,980
So here I put parenthesis
rather than absolute value.

1120
00:53:48,980 --> 00:53:54,330
And then I have plus 1/2 of
the integral on the complement.

1121
00:53:54,330 --> 00:53:57,940
What are you guys used to to
write the complement, to the C

1122
00:53:57,940 --> 00:54:01,005
or the bar?

1123
00:54:01,005 --> 00:54:01,991
To the C?

1124
00:54:05,450 --> 00:54:08,320
And so here on the complement,
then f is less than g,

1125
00:54:08,320 --> 00:54:17,810
so this is actually really
g of X minus f of X, dx.

1126
00:54:17,810 --> 00:54:19,550
Everybody's with me here?

1127
00:54:19,550 --> 00:54:20,900
So I just said--

1128
00:54:20,900 --> 00:54:23,390
I mean, those are just
rewriting what the definition

1129
00:54:23,390 --> 00:54:24,560
of the absolute value is.

1130
00:54:33,290 --> 00:54:33,830
OK.

1131
00:54:33,830 --> 00:54:38,120
So now there's nice things
that I know about f and g.

1132
00:54:38,120 --> 00:54:40,880
And the two nice things is that
the integral of f is equal to 1

1133
00:54:40,880 --> 00:54:42,790
and the integral
of g is equal to 1.

1134
00:54:46,270 --> 00:54:53,614
This implies that the integral
of f minus g is equal to what?

1135
00:54:53,614 --> 00:54:54,526
AUDIENCE: 0.

1136
00:54:54,526 --> 00:54:56,400
PHILIPPE RIGOLLET: 0.

1137
00:54:56,400 --> 00:54:59,060
And so now that
means that if I want

1138
00:54:59,060 --> 00:55:04,130
to just go from the integral
here on A complement

1139
00:55:04,130 --> 00:55:05,690
to the integral on A--

1140
00:55:05,690 --> 00:55:08,780
or on A star, complement
to the integral of A star,

1141
00:55:08,780 --> 00:55:11,700
I just have to flip the sign.

1142
00:55:11,700 --> 00:55:14,920
So that implies that
an integral on A star

1143
00:55:14,920 --> 00:55:21,198
complement of g
of X minus f of X,

1144
00:55:21,198 --> 00:55:25,830
dx, this is simply equal
to the integral on A star

1145
00:55:25,830 --> 00:55:30,250
of f of X minus g of X, dx.

1146
00:55:40,880 --> 00:55:41,780
All right.

1147
00:55:41,780 --> 00:55:46,100
So now this guy becomes
this guy over there.

1148
00:55:46,100 --> 00:55:50,050
So I have 1/2 of this
plus 1/2 of the same guy,

1149
00:55:50,050 --> 00:55:55,720
so that means that 1/2 half
of the integral between of f

1150
00:55:55,720 --> 00:55:57,450
minus g absolute value--

1151
00:55:57,450 --> 00:55:59,810
so that was my
original definition,

1152
00:55:59,810 --> 00:56:03,890
this thing is actually equal
to the integral on A star

1153
00:56:03,890 --> 00:56:10,379
of f of X minus g of X, dx.

1154
00:56:14,160 --> 00:56:21,440
And this is simply
equal to P of A star--

1155
00:56:21,440 --> 00:56:26,160
so say Pf of A start
minus Pg of A star.

1156
00:56:34,160 --> 00:56:36,810
Which one is larger
than the other one?

1157
00:56:41,610 --> 00:56:43,540
AUDIENCE: [INAUDIBLE]

1158
00:56:43,540 --> 00:56:44,600
PHILIPPE RIGOLLET: It is.

1159
00:56:44,600 --> 00:56:45,951
Just look at this board.

1160
00:56:45,951 --> 00:56:47,406
AUDIENCE: [INAUDIBLE]

1161
00:56:47,406 --> 00:56:48,406
PHILIPPE RIGOLLET: What?

1162
00:56:48,406 --> 00:56:49,880
AUDIENCE: [INAUDIBLE]

1163
00:56:49,880 --> 00:56:50,510
PHILIPPE RIGOLLET:
The first one has

1164
00:56:50,510 --> 00:56:51,980
to be larger, because
this thing is actually

1165
00:56:51,980 --> 00:56:53,271
equal to a non-negative number.

1166
00:56:59,590 --> 00:57:01,990
So now I have this absolute
value of two things,

1167
00:57:01,990 --> 00:57:04,150
and so I'm closer to
the actual definition.

1168
00:57:04,150 --> 00:57:06,910
But I still need to show
you that this thing is

1169
00:57:06,910 --> 00:57:09,010
the maximum value.

1170
00:57:09,010 --> 00:57:17,710
So this is definitely at
most the maximum over A of Pf

1171
00:57:17,710 --> 00:57:21,670
of A minus Pg of A.

1172
00:57:21,670 --> 00:57:24,290
That's certainly true.

1173
00:57:24,290 --> 00:57:24,830
Right?

1174
00:57:24,830 --> 00:57:27,850
We agree with this?

1175
00:57:27,850 --> 00:57:30,620
Because this is just
for one specific A,

1176
00:57:30,620 --> 00:57:34,930
and I'm bounding it by the
maximum over all possible A.

1177
00:57:34,930 --> 00:57:36,932
So that's clearly true.

1178
00:57:36,932 --> 00:57:38,640
So now I have to go
the other way around.

1179
00:57:38,640 --> 00:57:44,370
I have to show you that the max
is actually this guy, A star.

1180
00:57:44,370 --> 00:57:45,640
So why would that be true?

1181
00:57:45,640 --> 00:57:49,180
Well, let's just inspect
this thing over there.

1182
00:57:49,180 --> 00:57:50,730
So we want to show
that if I take

1183
00:57:50,730 --> 00:57:53,490
any other A in this integral
than this guy A star,

1184
00:57:53,490 --> 00:57:56,580
it's actually got to
decrease its value.

1185
00:57:56,580 --> 00:57:57,720
So we have this function.

1186
00:57:57,720 --> 00:57:59,303
I'm going to call
this function delta.

1187
00:58:02,314 --> 00:58:03,730
And what we have
is-- so let's say

1188
00:58:03,730 --> 00:58:04,920
this function looks like this.

1189
00:58:04,920 --> 00:58:06,836
Now it's the difference
between two densities.

1190
00:58:06,836 --> 00:58:09,500
It doesn't have to
integrate-- it doesn't

1191
00:58:09,500 --> 00:58:10,500
have to be non-negative.

1192
00:58:10,500 --> 00:58:12,420
But it certainly has
to integrate to 0.

1193
00:58:15,510 --> 00:58:18,440
And so now I take this thing.

1194
00:58:18,440 --> 00:58:22,126
And the A star, what
is the set A star here?

1195
00:58:22,126 --> 00:58:25,640
The set A star is the set
over which the function

1196
00:58:25,640 --> 00:58:27,645
delta is non-negative.

1197
00:58:36,340 --> 00:58:37,590
So that's just the definition.

1198
00:58:37,590 --> 00:58:41,660
A star was the set over
which f minus g was positive,

1199
00:58:41,660 --> 00:58:44,430
and f minus g was
just called delta.

1200
00:58:44,430 --> 00:58:47,720
So what it means is that
what I'm really integrating

1201
00:58:47,720 --> 00:58:50,810
is delta on this set.

1202
00:58:50,810 --> 00:58:53,570
So it's this area
under the curve,

1203
00:58:53,570 --> 00:58:55,230
just on the positive things.

1204
00:58:55,230 --> 00:58:57,830
Agreed?

1205
00:58:57,830 --> 00:59:03,290
So now let's just make some
tiny variations around this guy.

1206
00:59:03,290 --> 00:59:08,150
If I take A to be
larger than A star--

1207
00:59:08,150 --> 00:59:10,280
so let me add, for
example, this part here.

1208
00:59:12,920 --> 00:59:15,680
That means that when
I compute my integral,

1209
00:59:15,680 --> 00:59:18,067
I'm removing this
area under the curve.

1210
00:59:18,067 --> 00:59:18,650
It's negative.

1211
00:59:18,650 --> 00:59:20,520
The integral here is negative.

1212
00:59:20,520 --> 00:59:25,160
So if I start adding something
to A, the value goes lower.

1213
00:59:25,160 --> 00:59:29,060
If I start removing something
from A, like say this guy,

1214
00:59:29,060 --> 00:59:32,450
I'm actually removing this
value from the integral.

1215
00:59:32,450 --> 00:59:33,320
So there's no way.

1216
00:59:33,320 --> 00:59:34,370
I'm actually stuck.

1217
00:59:34,370 --> 00:59:37,100
This A star is the one
that actually maximizes

1218
00:59:37,100 --> 00:59:39,830
the integral of this function.

1219
00:59:39,830 --> 00:59:49,470
So we used the fact
that for any function,

1220
00:59:49,470 --> 00:59:59,180
say delta, the integral
over A of delta

1221
00:59:59,180 --> 01:00:02,712
is less than the integral
over the set of X's

1222
01:00:02,712 --> 01:00:07,670
such that delta of X is
non-negative of delta of X, dx.

1223
01:00:10,280 --> 01:00:13,518
And that's an obvious
fact, just by picture, say.

1224
01:00:18,498 --> 01:00:24,972
And that's true for all A. Yeah?

1225
01:00:24,972 --> 01:00:28,956
AUDIENCE: [INAUDIBLE]
could you use

1226
01:00:28,956 --> 01:00:33,106
like a portion under the
axis as like less than

1227
01:00:33,106 --> 01:00:34,845
or equal to the
portion above the axis?

1228
01:00:34,845 --> 01:00:36,470
PHILIPPE RIGOLLET:
It's actually equal.

1229
01:00:36,470 --> 01:00:39,005
We know that the
integral of f minus g--

1230
01:00:39,005 --> 01:00:41,580
the integral of delta is 0.

1231
01:00:41,580 --> 01:00:47,344
So there's actually exactly
the same area above and below.

1232
01:00:47,344 --> 01:00:49,880
But yeah, you're right.

1233
01:00:49,880 --> 01:00:51,349
You could go to
the extreme cases.

1234
01:00:51,349 --> 01:00:51,890
You're right.

1235
01:00:57,470 --> 01:00:57,970
No.

1236
01:00:57,970 --> 01:01:00,490
It's actually still be
true, even if there was--

1237
01:01:00,490 --> 01:01:02,720
if this was a constant,
that would still be true.

1238
01:01:02,720 --> 01:01:05,500
Here, I never use the fact that
the integral is equal to 0.

1239
01:01:11,380 --> 01:01:15,560
I could shift this function by
1 so that the integral of delta

1240
01:01:15,560 --> 01:01:18,230
is equal to 1,
and it would still

1241
01:01:18,230 --> 01:01:21,000
be true that it's maximized
when I take A to be

1242
01:01:21,000 --> 01:01:24,892
the set where it's positive.

1243
01:01:24,892 --> 01:01:27,350
Just need to make sure that
there is someplace where it is,

1244
01:01:27,350 --> 01:01:28,390
but that's about it.

1245
01:01:33,390 --> 01:01:36,981
Of course, we used this before,
when we made this thing.

1246
01:01:36,981 --> 01:01:38,730
But just the last
argument, this last fact

1247
01:01:38,730 --> 01:01:39,646
does not require that.

1248
01:01:43,820 --> 01:01:44,320
All right.

1249
01:01:44,320 --> 01:01:47,030
So now we have this notion of--

1250
01:01:47,030 --> 01:01:48,358
I need the--

1251
01:01:52,531 --> 01:01:53,030
OK.

1252
01:01:53,030 --> 01:01:57,450
So we have this
notion of distance

1253
01:01:57,450 --> 01:01:58,830
between probability measures.

1254
01:01:58,830 --> 01:02:00,940
I mean, these things
are exactly what--

1255
01:02:00,940 --> 01:02:03,780
if I were to be in a formal
math class and I said,

1256
01:02:03,780 --> 01:02:06,060
here are the axioms that
a distance should satisfy,

1257
01:02:06,060 --> 01:02:08,640
those are exactly those things.

1258
01:02:08,640 --> 01:02:10,150
If it's not
satisfying this thing,

1259
01:02:10,150 --> 01:02:13,800
it's called pseudo-distance or
quasi-distance or just metric

1260
01:02:13,800 --> 01:02:15,770
or nothing at all, honestly.

1261
01:02:15,770 --> 01:02:16,590
So it's a distance.

1262
01:02:16,590 --> 01:02:18,930
It's symmetric,
non-negative, equal to 0,

1263
01:02:18,930 --> 01:02:21,720
if and only if the two
arguments are equal, then

1264
01:02:21,720 --> 01:02:25,870
it satisfies the
triangle inequality.

1265
01:02:25,870 --> 01:02:28,860
And so that means that we have
this actual total variation

1266
01:02:28,860 --> 01:02:31,140
distance between
probability distributions.

1267
01:02:31,140 --> 01:02:36,510
And here is now a statistical
strategy to implement our goal.

1268
01:02:36,510 --> 01:02:38,190
Remember, our goal
was to spit out

1269
01:02:38,190 --> 01:02:41,940
a theta hat, which was
close such that P theta

1270
01:02:41,940 --> 01:02:45,700
hat was close to P theta star.

1271
01:02:45,700 --> 01:02:48,940
So hopefully, we were trying
to minimize the total variation

1272
01:02:48,940 --> 01:02:51,580
distance between P theta
hat and P theta star.

1273
01:02:51,580 --> 01:02:55,090
Now, we cannot do that, because
just by this fact, this slide,

1274
01:02:55,090 --> 01:02:57,340
if we wanted to do that
directly, we would just take--

1275
01:02:57,340 --> 01:02:59,830
well, let's take theta hat
equals theta star and that will

1276
01:02:59,830 --> 01:03:00,880
give me the value 0.

1277
01:03:00,880 --> 01:03:03,196
And that's the minimum
possible value we can take.

1278
01:03:03,196 --> 01:03:04,570
The problem is
that we don't know

1279
01:03:04,570 --> 01:03:07,342
what the total variation is to
something that we don't know.

1280
01:03:07,342 --> 01:03:09,550
We know how to compute total
variations if I give you

1281
01:03:09,550 --> 01:03:10,660
the two arguments.

1282
01:03:10,660 --> 01:03:12,560
But here, one of the
arguments is not known.

1283
01:03:12,560 --> 01:03:16,370
P theta star is not known to
us, so we need to estimate it.

1284
01:03:16,370 --> 01:03:18,910
And so here is the strategy.

1285
01:03:18,910 --> 01:03:21,760
Just build an estimator
of the total variation

1286
01:03:21,760 --> 01:03:24,580
distance between P
theta and P theta star

1287
01:03:24,580 --> 01:03:27,250
for all candidate theta,
all possible theta

1288
01:03:27,250 --> 01:03:30,240
in capital theta.

1289
01:03:30,240 --> 01:03:33,390
Now, if this is a good estimate,
then when I minimize it,

1290
01:03:33,390 --> 01:03:37,230
I should get something
that's close to P theta star.

1291
01:03:37,230 --> 01:03:38,220
So here's the strategy.

1292
01:03:38,220 --> 01:03:40,980
This is my function
that maps theta

1293
01:03:40,980 --> 01:03:44,340
to the total variation between
P theta and P theta star.

1294
01:03:44,340 --> 01:03:47,010
I know it's minimized
at theta star.

1295
01:03:47,010 --> 01:03:51,090
That's definitely TV of P--
and the value here, the y-axis

1296
01:03:51,090 --> 01:03:53,300
should say 0.

1297
01:03:53,300 --> 01:03:54,800
And so I don't know
this guy, so I'm

1298
01:03:54,800 --> 01:03:56,810
going to estimate it
by some estimator that

1299
01:03:56,810 --> 01:03:57,680
comes from my data.

1300
01:03:57,680 --> 01:04:00,590
Hopefully, the more data I have,
the better this estimator is.

1301
01:04:00,590 --> 01:04:03,391
And I'm going to try to
minimize this estimator now.

1302
01:04:03,391 --> 01:04:05,390
And if the two things are
close, then the minima

1303
01:04:05,390 --> 01:04:07,470
should be close.

1304
01:04:07,470 --> 01:04:09,560
That's a pretty good
estimation strategy.

1305
01:04:09,560 --> 01:04:11,370
The problem is that
it's very unclear

1306
01:04:11,370 --> 01:04:13,810
how you would build
this estimator of TV,

1307
01:04:13,810 --> 01:04:18,710
of the Total Variation.

1308
01:04:18,710 --> 01:04:21,410
So building
estimators, as I said,

1309
01:04:21,410 --> 01:04:25,160
typically consists in replacing
expectations by averages.

1310
01:04:25,160 --> 01:04:29,130
But there's no simple way of
expressing the total variation

1311
01:04:29,130 --> 01:04:31,230
distance as the
expectations with respect

1312
01:04:31,230 --> 01:04:33,840
to theta star of anything.

1313
01:04:33,840 --> 01:04:36,060
So what we're going
to do is we're

1314
01:04:36,060 --> 01:04:38,190
going to move from
total variation distance

1315
01:04:38,190 --> 01:04:41,040
to another notion of
distance that sort of has

1316
01:04:41,040 --> 01:04:43,020
the same properties
and the same feeling

1317
01:04:43,020 --> 01:04:47,040
and the same motivations as
the total variation distance.

1318
01:04:47,040 --> 01:04:49,650
But for this guy, we
will be able to build

1319
01:04:49,650 --> 01:04:51,420
an estimate for it,
because it's actually

1320
01:04:51,420 --> 01:04:53,929
going to be of the form
expectation of something.

1321
01:04:53,929 --> 01:04:55,470
And we're going to
be able to replace

1322
01:04:55,470 --> 01:05:00,280
the expectation by an average
and then minimize this average.

1323
01:05:00,280 --> 01:05:04,290
So this surrogate for
total variation distance

1324
01:05:04,290 --> 01:05:07,510
is actually called the
Kullback-Leibler divergence.

1325
01:05:07,510 --> 01:05:09,760
And why we call it divergence
is because it's actually

1326
01:05:09,760 --> 01:05:11,740
not a distance.

1327
01:05:11,740 --> 01:05:14,760
It's not going to be
symmetric to start with.

1328
01:05:14,760 --> 01:05:17,400
So this Kullback-Leibler
or even KL divergence--

1329
01:05:17,400 --> 01:05:20,790
I will just refer to it as KL--

1330
01:05:20,790 --> 01:05:22,860
is actually just
more convenient.

1331
01:05:22,860 --> 01:05:27,480
But it has some roots coming
from information theory, which

1332
01:05:27,480 --> 01:05:29,170
I will not delve into.

1333
01:05:29,170 --> 01:05:31,450
But if any of you is
actually a Core 6 student,

1334
01:05:31,450 --> 01:05:32,970
I'm sure you've
seen that in some--

1335
01:05:32,970 --> 01:05:37,980
I don't know-- course that
has any content on information

1336
01:05:37,980 --> 01:05:39,060
theory.

1337
01:05:39,060 --> 01:05:39,560
All right.

1338
01:05:39,560 --> 01:05:42,380
So the KL divergence between two
probability measures, P theta

1339
01:05:42,380 --> 01:05:43,790
and P theta prime--

1340
01:05:43,790 --> 01:05:47,810
and here, as I said, it's not
going to be the symmetric,

1341
01:05:47,810 --> 01:05:49,680
so it's very important
for you to specify

1342
01:05:49,680 --> 01:05:51,930
which order you say it is,
between P theta and P theta

1343
01:05:51,930 --> 01:05:52,429
prime.

1344
01:05:52,429 --> 01:05:55,060
It's different from saying
between P theta prime and P

1345
01:05:55,060 --> 01:05:56,510
theta.

1346
01:05:56,510 --> 01:05:58,550
And so we denote it by KL.

1347
01:05:58,550 --> 01:06:04,010
And so remember, before we had
either the sum or the integral

1348
01:06:04,010 --> 01:06:07,910
of 1/2 of the distance--
absolute value of the distance

1349
01:06:07,910 --> 01:06:10,550
between the PMFs and 1/2
of the absolute values

1350
01:06:10,550 --> 01:06:17,900
of the distances between the
probability density functions.

1351
01:06:17,900 --> 01:06:19,940
And then we replace
this absolute value

1352
01:06:19,940 --> 01:06:24,740
of the distance divided by
2 by this weird function.

1353
01:06:24,740 --> 01:06:28,100
This function is P
theta, log P theta,

1354
01:06:28,100 --> 01:06:30,290
divided by P theta prime.

1355
01:06:30,290 --> 01:06:31,880
That's the function.

1356
01:06:31,880 --> 01:06:34,710
That's a weird function.

1357
01:06:34,710 --> 01:06:35,210
OK.

1358
01:06:35,210 --> 01:06:38,360
So this was what we had.

1359
01:06:40,960 --> 01:06:41,590
That's the TV.

1360
01:06:44,670 --> 01:06:48,120
And the KL, if I use the
same notation, f and g,

1361
01:06:48,120 --> 01:06:57,315
is integral of f of X, log
of f of X over g of X, dx.

1362
01:07:01,088 --> 01:07:04,280
It's a bit different.

1363
01:07:04,280 --> 01:07:09,120
And I go from discrete to
continuous using an integral.

1364
01:07:09,120 --> 01:07:10,240
Everybody can read this.

1365
01:07:10,240 --> 01:07:11,365
Everybody's fine with this.

1366
01:07:11,365 --> 01:07:15,780
Is there any uncertainty about
the actual definition here?

1367
01:07:15,780 --> 01:07:17,480
So here I go straight
to the definition,

1368
01:07:17,480 --> 01:07:19,910
which is just
plugging the functions

1369
01:07:19,910 --> 01:07:22,190
into some integral and compute.

1370
01:07:22,190 --> 01:07:24,670
So I don't bother with
maxima or anything.

1371
01:07:24,670 --> 01:07:26,400
I mean, there is
something like that,

1372
01:07:26,400 --> 01:07:29,885
but it's certainly not as
natural as the total variation.

1373
01:07:29,885 --> 01:07:30,875
Yes?

1374
01:07:30,875 --> 01:07:33,845
AUDIENCE: The total
variation, [INAUDIBLE]..

1375
01:07:38,732 --> 01:07:40,440
PHILIPPE RIGOLLET:
Yes, just because it's

1376
01:07:40,440 --> 01:07:42,280
hard to build anything
from total variation,

1377
01:07:42,280 --> 01:07:43,500
because I don't know it.

1378
01:07:43,500 --> 01:07:45,835
So it's very difficult.
But if you can actually--

1379
01:07:45,835 --> 01:07:47,910
and even computing it
between two Gaussians,

1380
01:07:47,910 --> 01:07:49,680
just try it for yourself.

1381
01:07:49,680 --> 01:07:52,740
And please stop doing it
after at most six minutes,

1382
01:07:52,740 --> 01:07:54,730
because you won't
be able to do it.

1383
01:07:54,730 --> 01:07:56,730
And so it's just very
hard to manipulate,

1384
01:07:56,730 --> 01:07:59,070
like this integral of
absolute values of differences

1385
01:07:59,070 --> 01:08:01,230
between probability
density function, at least

1386
01:08:01,230 --> 01:08:02,771
for the probability
density functions

1387
01:08:02,771 --> 01:08:04,860
we're used to manipulate
is actually a nightmare.

1388
01:08:04,860 --> 01:08:08,370
And so people prefer KL,
because for the Gaussian,

1389
01:08:08,370 --> 01:08:10,770
this is going to be theta
minus theta prime squared.

1390
01:08:10,770 --> 01:08:12,580
And then we're
going to be happy.

1391
01:08:12,580 --> 01:08:15,720
And so those things are
much easier to manipulate.

1392
01:08:15,720 --> 01:08:18,029
But it's really--
the total variation

1393
01:08:18,029 --> 01:08:20,162
is telling you how
far in the worst case

1394
01:08:20,162 --> 01:08:21,370
the two probabilities can be.

1395
01:08:21,370 --> 01:08:23,220
This is really the
intrinsic notion

1396
01:08:23,220 --> 01:08:25,380
of closeness between
probabilities.

1397
01:08:25,380 --> 01:08:27,229
So that's really the
one-- if we could,

1398
01:08:27,229 --> 01:08:30,202
that's the one we
would go after.

1399
01:08:30,202 --> 01:08:32,160
Sometimes people will
compute them numerically,

1400
01:08:32,160 --> 01:08:34,785
so that they can say, oh, here's
the total variation distance I

1401
01:08:34,785 --> 01:08:36,899
have between those two things.

1402
01:08:36,899 --> 01:08:38,670
And then you actually
know that that

1403
01:08:38,670 --> 01:08:41,460
means they are close, because
the absolute value-- if I tell

1404
01:08:41,460 --> 01:08:44,370
you total variation is
0.01, like we did here,

1405
01:08:44,370 --> 01:08:46,319
it has a very specific meaning.

1406
01:08:46,319 --> 01:08:49,762
If I tell you the KL
divergence is 0.01,

1407
01:08:49,762 --> 01:08:50,970
it's not clear what it means.

1408
01:08:55,130 --> 01:08:55,760
OK.

1409
01:08:55,760 --> 01:08:58,109
So what are the properties?

1410
01:08:58,109 --> 01:09:00,870
The KL divergence between
P theta and P theta prime

1411
01:09:00,870 --> 01:09:03,170
is different from the KL
divergence between P theta

1412
01:09:03,170 --> 01:09:05,569
prime and P theta in general.

1413
01:09:05,569 --> 01:09:07,640
Of course, in general,
because if theta

1414
01:09:07,640 --> 01:09:11,029
is equal to theta prime,
then this certainly is true.

1415
01:09:11,029 --> 01:09:14,600
So there's cases
when it's not true.

1416
01:09:14,600 --> 01:09:17,090
The KL divergence
is non-negative.

1417
01:09:17,090 --> 01:09:19,742
Who knows the Jensen's
inequality here?

1418
01:09:19,742 --> 01:09:21,450
That should be a subset
of the people who

1419
01:09:21,450 --> 01:09:25,310
raised their hand when I asked
what a convex function is.

1420
01:09:25,310 --> 01:09:26,090
All right.

1421
01:09:26,090 --> 01:09:27,890
So you know what
Jensen's inequality is.

1422
01:09:27,890 --> 01:09:30,490
This is Jensen's-- the
proof is just one step

1423
01:09:30,490 --> 01:09:33,840
Jensen's inequality, which
we will not go into details.

1424
01:09:33,840 --> 01:09:35,569
But that's basically
an inequality

1425
01:09:35,569 --> 01:09:38,233
involving expectation
of a convex function

1426
01:09:38,233 --> 01:09:40,399
of a random variable compared
to the convex function

1427
01:09:40,399 --> 01:09:42,065
of the expectation
of a random variable.

1428
01:09:45,460 --> 01:09:48,580
If you know Jensen,
have fun and prove it.

1429
01:09:48,580 --> 01:09:51,729
What's really nice is that
if the KL is equal to 0,

1430
01:09:51,729 --> 01:09:55,220
then the two distributions
are the same.

1431
01:09:55,220 --> 01:09:57,170
And that's something
we're looking for.

1432
01:09:57,170 --> 01:09:59,020
Everything else we're
happy to throw out.

1433
01:09:59,020 --> 01:10:00,478
And actually, if
you pay attention,

1434
01:10:00,478 --> 01:10:03,500
we're actually really
throwing out everything else.

1435
01:10:03,500 --> 01:10:05,060
So they're not symmetric.

1436
01:10:05,060 --> 01:10:08,530
It does satisfy the triangle
inequality in general.

1437
01:10:08,530 --> 01:10:12,790
But it's non-negative and
it's 0 if and only if the two

1438
01:10:12,790 --> 01:10:13,922
distributions are the same.

1439
01:10:13,922 --> 01:10:15,130
And that's all we care about.

1440
01:10:15,130 --> 01:10:17,129
And that's what we call
a divergence rather than

1441
01:10:17,129 --> 01:10:21,910
a distance, and divergence will
be enough for our purposes.

1442
01:10:21,910 --> 01:10:24,080
And actually, this
asymmetry, the fact

1443
01:10:24,080 --> 01:10:26,570
that it's not flipping--
the first time I saw it,

1444
01:10:26,570 --> 01:10:27,380
I was just annoyed.

1445
01:10:27,380 --> 01:10:29,225
I was like, can we
just like, I don't

1446
01:10:29,225 --> 01:10:31,550
know, take the average
of the KL between P theta

1447
01:10:31,550 --> 01:10:34,270
and P theta prime and P
theta prime and P theta,

1448
01:10:34,270 --> 01:10:36,290
you would think maybe
you could do this.

1449
01:10:36,290 --> 01:10:39,590
You just symmatrize it by just
taking the average of the two

1450
01:10:39,590 --> 01:10:41,480
possible values it can take.

1451
01:10:41,480 --> 01:10:44,930
The problem is that this will
still not satisfy the triangle

1452
01:10:44,930 --> 01:10:45,500
inequality.

1453
01:10:45,500 --> 01:10:48,290
And there's no way basically
to turn it into something

1454
01:10:48,290 --> 01:10:49,850
that is a distance.

1455
01:10:49,850 --> 01:10:52,350
But the divergence is doing
a pretty good thing for us.

1456
01:10:52,350 --> 01:10:55,790
And this is what will allow us
to estimate it and basically

1457
01:10:55,790 --> 01:11:03,160
overcome what we could not
do with the total variation.

1458
01:11:03,160 --> 01:11:06,410
So the first thing
that you want to notice

1459
01:11:06,410 --> 01:11:08,120
is the total
variation distance--

1460
01:11:08,120 --> 01:11:10,130
the KL divergence,
sorry, is actually

1461
01:11:10,130 --> 01:11:12,470
an expectation of something.

1462
01:11:12,470 --> 01:11:15,260
Look at what it is here.

1463
01:11:15,260 --> 01:11:20,420
It's the integral of some
function against a density.

1464
01:11:20,420 --> 01:11:25,230
That's exactly the definition
of an expectation, right?

1465
01:11:25,230 --> 01:11:29,950
So this is the expectation
of this particular function

1466
01:11:29,950 --> 01:11:31,730
with respect to this density f.

1467
01:11:31,730 --> 01:11:35,650
So in particular, if I call
this is density f-- if I say,

1468
01:11:35,650 --> 01:11:38,400
I want the true distribution
to be the first argument,

1469
01:11:38,400 --> 01:11:39,920
this is an expectation
with respect

1470
01:11:39,920 --> 01:11:42,310
to the true distribution from
which my data is actually

1471
01:11:42,310 --> 01:11:45,760
drawn of the log of this ratio.

1472
01:11:45,760 --> 01:11:46,870
So ha ha.

1473
01:11:46,870 --> 01:11:47,700
I'm a statistician.

1474
01:11:47,700 --> 01:11:49,300
Now I have an expectation.

1475
01:11:49,300 --> 01:11:51,430
I can replace it by an
average, because I have data

1476
01:11:51,430 --> 01:11:52,524
from this distribution.

1477
01:11:52,524 --> 01:11:54,940
And I could actually replace
the expectation by an average

1478
01:11:54,940 --> 01:11:56,680
and try to minimize here.

1479
01:11:56,680 --> 01:11:57,959
The problem is that--

1480
01:11:57,959 --> 01:12:00,250
actually the star here should
be in front of the theta,

1481
01:12:00,250 --> 01:12:01,150
not of the P, right?

1482
01:12:01,150 --> 01:12:04,460
That's P theta star,
not P star theta.

1483
01:12:04,460 --> 01:12:05,960
But here, I still
cannot compute it,

1484
01:12:05,960 --> 01:12:08,510
because I have this P
theta star that shows up.

1485
01:12:08,510 --> 01:12:10,220
I don't know what it is.

1486
01:12:10,220 --> 01:12:13,500
And that's now where
the log plays a role.

1487
01:12:13,500 --> 01:12:15,050
If you actually pay
attention, I said

1488
01:12:15,050 --> 01:12:16,940
you can use Jensen to
prove all this stuff.

1489
01:12:16,940 --> 01:12:21,110
You could actually replace the
log by any concave function.

1490
01:12:21,110 --> 01:12:22,440
That would be f divergent.

1491
01:12:22,440 --> 01:12:24,030
That's called an f divergence.

1492
01:12:24,030 --> 01:12:26,950
But the log itself is a
very, very specific property,

1493
01:12:26,950 --> 01:12:29,790
which allows us to say
that the log of the ratio

1494
01:12:29,790 --> 01:12:33,290
is the ratio of the log.

1495
01:12:33,290 --> 01:12:38,620
Now, this thing here
does not depend on theta.

1496
01:12:38,620 --> 01:12:43,010
If I think of this KL divergence
as a function of theta,

1497
01:12:43,010 --> 01:12:45,239
then the first part is
actually a constant.

1498
01:12:45,239 --> 01:12:47,530
If I change theta, this thing
is never going to change.

1499
01:12:47,530 --> 01:12:49,980
It depends only on theta star.

1500
01:12:49,980 --> 01:12:51,480
So if I look at
this function KL--

1501
01:13:03,200 --> 01:13:05,500
so if I look at the
function, theta maps

1502
01:13:05,500 --> 01:13:11,450
to KL P theta
star, P theta, it's

1503
01:13:11,450 --> 01:13:15,400
of the form expectation
with respect to theta star,

1504
01:13:15,400 --> 01:13:23,780
log of P theta star
of X. And then I

1505
01:13:23,780 --> 01:13:29,610
have minus expectation with
respect to theta star of log

1506
01:13:29,610 --> 01:13:33,340
of P theta of x.

1507
01:13:33,340 --> 01:13:38,900
Now as I said, this thing
here, this second expectation

1508
01:13:38,900 --> 01:13:39,950
is a function of theta.

1509
01:13:39,950 --> 01:13:42,381
When theta changes, this
thing is going to change.

1510
01:13:42,381 --> 01:13:43,380
And that's a good thing.

1511
01:13:43,380 --> 01:13:45,754
We want something that reflects
how close theta and theta

1512
01:13:45,754 --> 01:13:46,537
star are.

1513
01:13:46,537 --> 01:13:48,120
But this thing is
not going to change.

1514
01:13:48,120 --> 01:13:49,620
This is a fixed value.

1515
01:13:49,620 --> 01:13:53,125
Actually, it's the negative
entropy of P theta star.

1516
01:13:53,125 --> 01:13:54,500
And if you've
heard of KL, you've

1517
01:13:54,500 --> 01:13:55,583
probably heard of entropy.

1518
01:13:55,583 --> 01:13:58,820
And that's what-- it's
basically minus the entropy.

1519
01:13:58,820 --> 01:14:01,310
And that's a quantity that
just depends on theta star.

1520
01:14:01,310 --> 01:14:03,470
But it's just the number.

1521
01:14:03,470 --> 01:14:05,030
I could compute this
number if I told

1522
01:14:05,030 --> 01:14:07,130
you this is n theta star 1.

1523
01:14:07,130 --> 01:14:09,450
You could compute this.

1524
01:14:09,450 --> 01:14:11,640
So now I'm going
to try to minimize

1525
01:14:11,640 --> 01:14:14,500
the estimate of this function.

1526
01:14:14,500 --> 01:14:16,870
And minimizing a function or
a function plus a constant

1527
01:14:16,870 --> 01:14:18,800
is the same thing.

1528
01:14:18,800 --> 01:14:20,840
I'm just shifting the
function here or here,

1529
01:14:20,840 --> 01:14:23,560
but it's the same minimizer.

1530
01:14:23,560 --> 01:14:24,060
OK.

1531
01:14:24,060 --> 01:14:28,910
So the function that maps
theta to KL of P theta star

1532
01:14:28,910 --> 01:14:32,370
to P theta is of the form
constant minus this expectation

1533
01:14:32,370 --> 01:14:35,810
of a log of P theta.

1534
01:14:35,810 --> 01:14:38,070
Everybody agrees?

1535
01:14:38,070 --> 01:14:40,610
Are there any
questions about this?

1536
01:14:40,610 --> 01:14:42,740
Are there any
remarks, including I

1537
01:14:42,740 --> 01:14:46,230
have no idea what's
happening right now?

1538
01:14:46,230 --> 01:14:46,730
OK.

1539
01:14:46,730 --> 01:14:47,700
We're good?

1540
01:14:47,700 --> 01:14:48,200
Yeah.

1541
01:14:48,200 --> 01:14:50,160
AUDIENCE: So when you're
actually employing this method,

1542
01:14:50,160 --> 01:14:52,610
how do you know which theta
to use as theta star and which

1543
01:14:52,610 --> 01:14:53,142
isn't?

1544
01:14:53,142 --> 01:14:55,600
PHILIPPE RIGOLLET: So this is
not a method just yet, right?

1545
01:14:55,600 --> 01:14:57,550
I'm just describing to
you what the KL divergence

1546
01:14:57,550 --> 01:14:58,720
between two distributions is.

1547
01:14:58,720 --> 01:15:00,130
If you really wanted
to compute it,

1548
01:15:00,130 --> 01:15:01,930
you would need to know
what P theta star is

1549
01:15:01,930 --> 01:15:02,770
and what P theta is.

1550
01:15:02,770 --> 01:15:03,467
AUDIENCE: Right.

1551
01:15:03,467 --> 01:15:06,050
PHILIPPE RIGOLLET: And so here,
I'm just saying at some point,

1552
01:15:06,050 --> 01:15:07,650
we still-- so here, you see--

1553
01:15:07,650 --> 01:15:09,280
so now let's move onto one step.

1554
01:15:09,280 --> 01:15:12,570
I don't know expectation
of theta star.

1555
01:15:12,570 --> 01:15:15,904
But I have data that comes
from distribution P theta star.

1556
01:15:15,904 --> 01:15:17,820
So the expectation by
the law of large numbers

1557
01:15:17,820 --> 01:15:19,950
should be close to the average.

1558
01:15:19,950 --> 01:15:23,670
And so what I'm doing
is I'm replacing any--

1559
01:15:23,670 --> 01:15:27,390
I can actually-- this is a very
standard estimation method.

1560
01:15:27,390 --> 01:15:30,360
You write something as an
expectation with respect

1561
01:15:30,360 --> 01:15:34,380
to the data-generating
process of some function.

1562
01:15:34,380 --> 01:15:37,349
And then you replace this by
the average of this function.

1563
01:15:37,349 --> 01:15:38,890
And the law of large
numbers tells me

1564
01:15:38,890 --> 01:15:41,326
that those two quantities
should actually be close.

1565
01:15:41,326 --> 01:15:43,820
Now, it doesn't mean that's
going to be the end of the day,

1566
01:15:43,820 --> 01:15:44,319
right.

1567
01:15:44,319 --> 01:15:46,950
When we did Xn bar, that
was the end of the day.

1568
01:15:46,950 --> 01:15:47,900
We had an expectation.

1569
01:15:47,900 --> 01:15:49,850
We replaced it by an average.

1570
01:15:49,850 --> 01:15:51,170
And then we were gone.

1571
01:15:51,170 --> 01:15:53,376
But here, we still
have to do something,

1572
01:15:53,376 --> 01:15:55,250
because this is not
telling me what theta is.

1573
01:15:55,250 --> 01:15:58,070
Now I still have to
minimize this average.

1574
01:15:58,070 --> 01:16:04,370
So this is now my candidate
estimator for KL, KL hat.

1575
01:16:04,370 --> 01:16:06,170
And that's the one
where I said, well, it's

1576
01:16:06,170 --> 01:16:07,897
going to be of the
form of constant.

1577
01:16:07,897 --> 01:16:09,230
And this constant, I don't know.

1578
01:16:09,230 --> 01:16:09,771
You're right.

1579
01:16:09,771 --> 01:16:11,586
I have no idea what
this constant is.

1580
01:16:11,586 --> 01:16:13,640
It depends on P theta star.

1581
01:16:13,640 --> 01:16:16,310
But then I have minus something
that I can completely compute.

1582
01:16:16,310 --> 01:16:20,170
If you give me data and theta,
I can compute this entire thing.

1583
01:16:20,170 --> 01:16:25,670
And now what I claim is that
the minimizer of f or f plus--

1584
01:16:25,670 --> 01:16:28,950
f of X or f of X plus
4 are the same thing,

1585
01:16:28,950 --> 01:16:32,200
or say 4 plus f of
X. I'm just shifting

1586
01:16:32,200 --> 01:16:34,260
the plot of my
function up and down,

1587
01:16:34,260 --> 01:16:36,340
but the minimizer stays
exactly where it is.

1588
01:16:39,590 --> 01:16:41,075
If I have a function--

1589
01:16:43,750 --> 01:16:45,284
so now I have a
function of theta.

1590
01:16:51,620 --> 01:16:56,100
This is KL hat of P
theta star, P theta.

1591
01:16:56,100 --> 01:16:58,831
And it's of the form--
it's a function like this.

1592
01:16:58,831 --> 01:17:00,330
I don't know where
this function is.

1593
01:17:00,330 --> 01:17:06,880
It might very well be this
function or this function.

1594
01:17:06,880 --> 01:17:10,870
Every time it's a translation
on the y-axis of all these guys.

1595
01:17:10,870 --> 01:17:14,690
And the value that I translated
by depends on theta star.

1596
01:17:14,690 --> 01:17:15,970
I don't know what it is.

1597
01:17:15,970 --> 01:17:19,600
But what I claim is that the
minimizer is always this guy,

1598
01:17:19,600 --> 01:17:22,428
regardless of what the value is.

1599
01:17:22,428 --> 01:17:25,290
OK?

1600
01:17:25,290 --> 01:17:28,560
So when I say constant, it's a
constant with respect to theta.

1601
01:17:28,560 --> 01:17:29,670
It's an unknown constant.

1602
01:17:29,670 --> 01:17:32,490
But it's with respect to theta,
so without loss of generality,

1603
01:17:32,490 --> 01:17:36,840
I can assume that this
constant is 0 for my purposes,

1604
01:17:36,840 --> 01:17:38,040
or 25 if you prefer.

1605
01:17:41,171 --> 01:17:41,670
All right.

1606
01:17:41,670 --> 01:17:46,420
So we'll just keep going
on this property next time.

1607
01:17:46,420 --> 01:17:49,359
And we'll see how from
here we can move on to--

1608
01:17:49,359 --> 01:17:51,900
the likelihood is actually going
to come out of this formula.

1609
01:17:51,900 --> 01:17:53,450
Thanks.