1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,650
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,650 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:20,524 --> 00:00:21,940
PHILIPPE RIGOLLET:
So today, we're

9
00:00:21,940 --> 00:00:24,820
going to close this
chapter, this short chapter,

10
00:00:24,820 --> 00:00:26,200
on Bayesian inference.

11
00:00:26,200 --> 00:00:28,990
Again, this was just
an overview of what you

12
00:00:28,990 --> 00:00:32,259
can do in Bayesian inference.

13
00:00:32,259 --> 00:00:34,630
And last time, we
started defining

14
00:00:34,630 --> 00:00:36,260
what's called Jeffreys priors.

15
00:00:36,260 --> 00:00:36,760
Right?

16
00:00:36,760 --> 00:00:38,560
So when you do
Bayesian inference,

17
00:00:38,560 --> 00:00:41,620
you have to introduce a
prior on your parameter.

18
00:00:41,620 --> 00:00:43,660
And we said that
usually, it's something

19
00:00:43,660 --> 00:00:45,820
that encodes your domain
knowledge about where

20
00:00:45,820 --> 00:00:47,130
the parameter could be.

21
00:00:47,130 --> 00:00:49,030
But there's also some
principle way to do it,

22
00:00:49,030 --> 00:00:51,155
if you want to do Bayesian
inference without really

23
00:00:51,155 --> 00:00:53,420
having to think about it.

24
00:00:53,420 --> 00:00:56,260
And for example, one
of the natural priors

25
00:00:56,260 --> 00:00:58,080
were those non-informative
priors, right?

26
00:00:58,080 --> 00:00:59,740
If you were on a
compact set, it's

27
00:00:59,740 --> 00:01:01,570
a uniform prior of this set.

28
00:01:01,570 --> 00:01:04,239
If you're on an infinite set,
you can still think of taking

29
00:01:04,239 --> 00:01:06,520
the [? 01s ?] prior.

30
00:01:06,520 --> 00:01:09,280
And that's called an [INAUDIBLE]
That's always equal to 1.

31
00:01:09,280 --> 00:01:13,300
And that's an improper prior
if you are an infinite set

32
00:01:13,300 --> 00:01:14,830
or proportional to one.

33
00:01:14,830 --> 00:01:17,860
And so another prior
that you can think of,

34
00:01:17,860 --> 00:01:20,230
in the case where you have
a Fisher information, which

35
00:01:20,230 --> 00:01:23,200
is well-defined, is something
called Jefferys prior.

36
00:01:23,200 --> 00:01:25,600
And this prior is
a prior which is

37
00:01:25,600 --> 00:01:28,150
proportional to square root of
the determinant of the Fisher

38
00:01:28,150 --> 00:01:29,780
information matrix.

39
00:01:29,780 --> 00:01:31,750
And if you're in
one dimension, it's

40
00:01:31,750 --> 00:01:37,750
basically proportional to
a square root of the Fisher

41
00:01:37,750 --> 00:01:40,750
information coefficient,
which we know, for example,

42
00:01:40,750 --> 00:01:44,170
is the asymptotic variance
of the maximum likelihood

43
00:01:44,170 --> 00:01:45,370
estimator.

44
00:01:45,370 --> 00:01:48,010
And it turns out
that it's basically.

45
00:01:48,010 --> 00:01:50,330
So square root of this
thing is basically

46
00:01:50,330 --> 00:01:54,160
one over the standard deviation
of the maximum likelihood

47
00:01:54,160 --> 00:01:55,150
estimator.

48
00:01:55,150 --> 00:01:56,690
And so you can
compute this, right?

49
00:01:56,690 --> 00:01:59,944
So you can compute for the
maximum likelihood estimator.

50
00:01:59,944 --> 00:02:01,360
We know that the
variance is going

51
00:02:01,360 --> 00:02:09,910
to be p1 minus p
in the Bernoulli

52
00:02:09,910 --> 00:02:11,200
statistical experiment.

53
00:02:11,200 --> 00:02:13,510
So you get this one over the
square root of this thing.

54
00:02:13,510 --> 00:02:16,720
And for example, in
the Gaussian setting,

55
00:02:16,720 --> 00:02:19,880
you actually have the
Fisher information,

56
00:02:19,880 --> 00:02:22,000
even in the multi-variate
one, is actually

57
00:02:22,000 --> 00:02:24,752
going to be something
like the identity matrix.

58
00:02:24,752 --> 00:02:25,960
So this is proportional to 1.

59
00:02:25,960 --> 00:02:29,530
It's the improper prior that
you get, in this case, OK?

60
00:02:29,530 --> 00:02:31,690
Meaning that, for
the Gaussian setting,

61
00:02:31,690 --> 00:02:33,880
no place where you
center your Gaussian

62
00:02:33,880 --> 00:02:36,020
is actually better
than any other.

63
00:02:36,020 --> 00:02:36,520
All right.

64
00:02:36,520 --> 00:02:40,130
So we basically
left on this slide,

65
00:02:40,130 --> 00:02:43,570
where we saw that
Jeffreys prior satisfy

66
00:02:43,570 --> 00:02:46,170
a reparametrization
[INAUDIBLE] invariant

67
00:02:46,170 --> 00:02:49,180
by transformation of
your parameter, which

68
00:02:49,180 --> 00:02:51,920
is a desirable property.

69
00:02:51,920 --> 00:02:57,217
And the way, it says that, well,
if I have my prior on theta,

70
00:02:57,217 --> 00:02:59,050
and then I suddenly
decide that theta is not

71
00:02:59,050 --> 00:03:01,720
the parameter I want to use
to parameterize my problem,

72
00:03:01,720 --> 00:03:04,640
actually what I want
is phi of theta.

73
00:03:04,640 --> 00:03:07,840
So think, for example, as theta
being the mean of a Gaussian,

74
00:03:07,840 --> 00:03:11,140
and phi of theta as
being mean to the cube.

75
00:03:11,140 --> 00:03:11,920
OK?

76
00:03:11,920 --> 00:03:15,520
This is a one-to-one
map phi, right?

77
00:03:15,520 --> 00:03:20,185
So for example, if I want to
go from theta to theta cubed,

78
00:03:20,185 --> 00:03:22,840
and now I decide that this is
the actual parameter that I

79
00:03:22,840 --> 00:03:26,200
want, well, then it means
that, on this parameter,

80
00:03:26,200 --> 00:03:29,110
my original prior is going
to induce another prior.

81
00:03:29,110 --> 00:03:30,970
And here, it says,
well, this prior

82
00:03:30,970 --> 00:03:33,200
is actually also Jeffreys prior.

83
00:03:33,200 --> 00:03:33,700
OK?

84
00:03:33,700 --> 00:03:35,450
So it's essentially
telling you that,

85
00:03:35,450 --> 00:03:38,410
for this new parametrization,
if you take Jeffreys prior, then

86
00:03:38,410 --> 00:03:41,201
you actually go back to having
exactly something that's

87
00:03:41,201 --> 00:03:43,450
of the form's [INAUDIBLE]
of determinant of the Fisher

88
00:03:43,450 --> 00:03:45,116
information, but this
thing with respect

89
00:03:45,116 --> 00:03:47,810
to your new
parametrization All right.

90
00:03:47,810 --> 00:03:50,360
And so why is this true?

91
00:03:50,360 --> 00:03:53,440
Well, it's just this
change of variable theorem.

92
00:03:53,440 --> 00:03:58,330
So it's essentially telling
you that, if you call--

93
00:03:58,330 --> 00:04:08,850
let's call p-- well, let's go
pi tilde of eta prior over eta.

94
00:04:08,850 --> 00:04:11,130
And you have pi of
theta as the prior

95
00:04:11,130 --> 00:04:18,040
over theta, than since eta
is of the form phi of theta,

96
00:04:18,040 --> 00:04:26,620
just by change of variable,
so that's essentially

97
00:04:26,620 --> 00:04:33,070
a probability result. It
says that pi tilde of eta

98
00:04:33,070 --> 00:04:42,790
is equal to pi of eta
times d pi of theta times d

99
00:04:42,790 --> 00:04:48,860
theta over d eta and--

100
00:04:55,706 --> 00:04:57,189
sorry, is that the one?

101
00:04:57,189 --> 00:04:58,730
Sorry, I'm going to
have to write it,

102
00:04:58,730 --> 00:04:59,938
because I always forget this.

103
00:05:05,209 --> 00:05:07,380
So if I take a function--

104
00:05:14,380 --> 00:05:14,960
OK.

105
00:05:14,960 --> 00:05:16,400
So what I want is to check.

106
00:05:38,340 --> 00:05:41,870
OK, so I want the function
of eta that I can here.

107
00:05:41,870 --> 00:05:48,480
And what I know is that
this is h of phi of theta.

108
00:05:48,480 --> 00:05:48,980
All right?

109
00:05:48,980 --> 00:05:51,810
So sorry, eta is
phi of theta, right?

110
00:05:51,810 --> 00:05:53,471
Yeah.

111
00:05:53,471 --> 00:05:54,970
So what I'm going
to do is I'm going

112
00:05:54,970 --> 00:06:09,130
to do the change of variable,
theta is phi inverse of eta.

113
00:06:09,130 --> 00:06:14,120
So eta is phi of
theta, which means

114
00:06:14,120 --> 00:06:20,540
that d eta is equal to d--

115
00:06:20,540 --> 00:06:26,020
well, to phi prime
of theta d theta.

116
00:06:26,020 --> 00:06:31,464
So when I'm going to write this,
I'm going to get integral of h.

117
00:06:31,464 --> 00:06:33,470
Actually, let me
write this, as I

118
00:06:33,470 --> 00:06:36,980
am more comfortable
writing this as e

119
00:06:36,980 --> 00:06:40,031
with respect to eta of h of eta.

120
00:06:40,031 --> 00:06:40,530
OK?

121
00:06:40,530 --> 00:06:44,580
So that's just eta according
to being drawn from the prior.

122
00:06:44,580 --> 00:06:47,670
And I want to write this as
the integral of he of eta times

123
00:06:47,670 --> 00:06:49,080
some function, right?

124
00:06:49,080 --> 00:06:58,580
So this is the
integral of h of phi

125
00:06:58,580 --> 00:07:03,556
of theta pi of theta d theta.

126
00:07:03,556 --> 00:07:06,150
Now, I'm going to do
my change of variable.

127
00:07:06,150 --> 00:07:09,290
So this is going to be
the integral of h of eta.

128
00:07:09,290 --> 00:07:16,420
And then pi of phi of--

129
00:07:16,420 --> 00:07:20,290
so theta is phi inverse of eta.

130
00:07:20,290 --> 00:07:27,390
And then d theta is phi
prime of theta d theta, OK?

131
00:07:27,390 --> 00:07:30,210
And so what is pi of phi theta?

132
00:07:30,210 --> 00:07:32,120
So this thing is proportional.

133
00:07:32,120 --> 00:07:33,750
So we're in, say,
dimension 1, so it's

134
00:07:33,750 --> 00:07:38,420
proportional of square root
of the Fisher information.

135
00:07:38,420 --> 00:07:39,920
And the Fisher
information, we know,

136
00:07:39,920 --> 00:07:44,630
is the expectation of the square
of the derivative of the log

137
00:07:44,630 --> 00:07:45,770
likelihood, right?

138
00:07:45,770 --> 00:07:48,740
So this is square root
of the expectation

139
00:07:48,740 --> 00:08:03,650
of d over d theta of log of--

140
00:08:03,650 --> 00:08:06,010
well, now, I need the density.

141
00:08:06,010 --> 00:08:10,050
Well, let's just
call it l of theta.

142
00:08:10,050 --> 00:08:17,030
And I want this to be taken
at phi inverse of eta squared.

143
00:08:19,980 --> 00:08:21,480
And then what I pick up is the--

144
00:08:23,771 --> 00:08:25,770
so I'm going to put
everything under the square.

145
00:08:25,770 --> 00:08:31,460
So I get phi prime of
theta squared d theta.

146
00:08:31,460 --> 00:08:33,260
OK?

147
00:08:33,260 --> 00:08:35,090
So now, I have the
expectation of a square.

148
00:08:35,090 --> 00:08:38,539
This does not depend, so this
is-- sorry, this is l of theta.

149
00:08:38,539 --> 00:08:42,307
This is the expectation of
l of theta of an x, right?

150
00:08:42,307 --> 00:08:44,390
That's for some variable,
and the expectation here

151
00:08:44,390 --> 00:08:45,710
is with respect to x.

152
00:08:45,710 --> 00:08:49,824
That's just the definition
of the Fisher information.

153
00:08:49,824 --> 00:08:52,240
So now I'm going to squeeze
this guy into the expectation.

154
00:08:52,240 --> 00:08:53,260
It does not depend on x.

155
00:08:53,260 --> 00:08:55,412
It just acts as a constant.

156
00:08:55,412 --> 00:08:57,370
And so what I have now
is that this is actually

157
00:08:57,370 --> 00:08:59,760
proportional to
the integral of h

158
00:08:59,760 --> 00:09:05,320
eta times the square root of
the expectation with respect

159
00:09:05,320 --> 00:09:06,600
to x of what?

160
00:09:06,600 --> 00:09:10,540
Well, here, I have d over
d theta of log of theta.

161
00:09:10,540 --> 00:09:15,620
And here, this guy is really
d eta over d theta, right?

162
00:09:19,524 --> 00:09:21,480
Agree?

163
00:09:21,480 --> 00:09:24,720
So now, what I'm really left
by-- so I get d over d theta

164
00:09:24,720 --> 00:09:25,520
times d--

165
00:09:25,520 --> 00:09:28,047
sorry, times d theta over d eta.

166
00:09:42,980 --> 00:09:51,396
so that's just d over
d eta of log of eta x.

167
00:10:00,198 --> 00:10:04,370
And then this guy is now
becoming d eta, right?

168
00:10:04,370 --> 00:10:06,590
OK, so this was a mess.

169
00:10:09,710 --> 00:10:12,320
This is a complete mess, because
I actually want to use phi.

170
00:10:12,320 --> 00:10:14,150
I should not actually
introduce phi at all.

171
00:10:14,150 --> 00:10:21,930
I should just talk about d eta
over d theta type of things.

172
00:10:21,930 --> 00:10:24,370
And then that would actually
make my life so much easier.

173
00:10:24,370 --> 00:10:25,002
OK.

174
00:10:25,002 --> 00:10:26,710
I'm not going to spend
more time on this.

175
00:10:26,710 --> 00:10:28,210
This is really just
the idea, right?

176
00:10:28,210 --> 00:10:30,170
You have square root
of a square in there.

177
00:10:30,170 --> 00:10:31,480
And then, when you do
your change of variable,

178
00:10:31,480 --> 00:10:32,710
you just pick up a square.

179
00:10:32,710 --> 00:10:35,750
You just pick up
something in here.

180
00:10:35,750 --> 00:10:38,110
And so you just move
this thing in there.

181
00:10:38,110 --> 00:10:38,920
You get a square.

182
00:10:38,920 --> 00:10:40,400
It goes inside the square.

183
00:10:40,400 --> 00:10:42,280
And so your derivative
of the log likelihood

184
00:10:42,280 --> 00:10:44,488
with respect to theta becomes
a derivative of the log

185
00:10:44,488 --> 00:10:46,240
likelihood with respect to eta.

186
00:10:46,240 --> 00:10:48,850
And that's the only thing
that's happening here.

187
00:10:48,850 --> 00:10:52,478
I'm just being super
sloppy, for some reason.

188
00:10:52,478 --> 00:10:54,612
OK.

189
00:10:54,612 --> 00:10:56,570
And then, of course, now,
what you're left with

190
00:10:56,570 --> 00:10:59,442
is that this is really
just proportional.

191
00:10:59,442 --> 00:11:00,650
Well, this is actually equal.

192
00:11:00,650 --> 00:11:02,150
Everything is
proportional, but this

193
00:11:02,150 --> 00:11:05,090
is equal to the Fisher
information tilde with respect

194
00:11:05,090 --> 00:11:07,050
to eta now.

195
00:11:07,050 --> 00:11:07,550
Right?

196
00:11:07,550 --> 00:11:09,630
You're doing this
with respect to eta.

197
00:11:09,630 --> 00:11:17,010
And so that's your new
prior with respect to eta.

198
00:11:17,010 --> 00:11:17,510
OK.

199
00:11:17,510 --> 00:11:21,800
So one thing that
you want to do,

200
00:11:21,800 --> 00:11:23,870
once you have-- so
remember, when you actually

201
00:11:23,870 --> 00:11:26,600
compute your
posterior rate, rather

202
00:11:26,600 --> 00:11:29,330
than having-- so you
start with a prior,

203
00:11:29,330 --> 00:11:32,090
and you have some observations,
let's say, x1 to xn.

204
00:11:36,190 --> 00:11:41,540
When you do Bayesian
inference, rather than spitting

205
00:11:41,540 --> 00:11:45,450
out just some theta hat, which
is an estimator for theta,

206
00:11:45,450 --> 00:11:48,565
you actually spit out an
entire posterior distribution--

207
00:11:53,220 --> 00:11:57,040
pi of theta, given x1 xn.

208
00:11:57,040 --> 00:11:57,540
OK?

209
00:11:57,540 --> 00:11:59,460
So there's an
entire distribution

210
00:11:59,460 --> 00:12:01,110
on the [INAUDIBLE] theta.

211
00:12:01,110 --> 00:12:04,290
And you can actually use this
to perform inference, rather

212
00:12:04,290 --> 00:12:06,150
than just having one number.

213
00:12:06,150 --> 00:12:06,950
OK?

214
00:12:06,950 --> 00:12:09,300
And so you could actually
build confidence regions

215
00:12:09,300 --> 00:12:10,540
from this thing.

216
00:12:10,540 --> 00:12:11,040
OK.

217
00:12:11,040 --> 00:12:16,600
And so a Bayesian
confidence interval--

218
00:12:16,600 --> 00:12:21,480
so if your set of parameters
is included in the real line,

219
00:12:21,480 --> 00:12:23,880
then you can actually--
it's not even guaranteed

220
00:12:23,880 --> 00:12:25,740
to be to be an interval.

221
00:12:25,740 --> 00:12:33,350
So let me call it a confidence
region, so a Bayesian

222
00:12:33,350 --> 00:12:40,090
confidence region, OK?

223
00:12:40,090 --> 00:12:43,360
So it's just a random subspace.

224
00:12:43,360 --> 00:12:47,810
So let's call it r,
is included in theta.

225
00:12:47,810 --> 00:12:49,750
And when you have the
deterministic one,

226
00:12:49,750 --> 00:12:53,650
we had a definition, which was
with respect to the randomness

227
00:12:53,650 --> 00:12:54,880
of the data, right?

228
00:12:54,880 --> 00:12:57,850
That's how you actually
had a random subset.

229
00:12:57,850 --> 00:12:59,740
So you had a random
confidence interval.

230
00:12:59,740 --> 00:13:02,200
Here, it's actually
conditioned on the data,

231
00:13:02,200 --> 00:13:03,640
but with respect
to the randomness

232
00:13:03,640 --> 00:13:06,531
that you actually get from
your posterior distribution.

233
00:13:06,531 --> 00:13:07,030
OK?

234
00:13:07,030 --> 00:13:16,760
So such that the
probability that your theta

235
00:13:16,760 --> 00:13:18,350
belongs to this
confidence region,

236
00:13:18,350 --> 00:13:24,500
given x1 xn is, say,
at least 1 minus alpha.

237
00:13:24,500 --> 00:13:27,040
Let's just take it
equal to 1 minus alpha.

238
00:13:27,040 --> 00:13:34,530
OK so that's a confidence
region at level 1 minus alpha.

239
00:13:34,530 --> 00:13:36,240
OK, so that's one way.

240
00:13:36,240 --> 00:13:38,770
So why would you actually--

241
00:13:38,770 --> 00:13:41,390
when I actually implement
Bayesian inference,

242
00:13:41,390 --> 00:13:44,480
I'm actually spitting out
that entire distribution.

243
00:13:44,480 --> 00:13:47,540
I need to summarize this thing
to communicate it, right?

244
00:13:47,540 --> 00:13:49,730
I cannot just say this
is this entire function.

245
00:13:49,730 --> 00:13:51,230
I want to know where
are the regions

246
00:13:51,230 --> 00:13:54,344
of high probability, where my
perimeter is supposed to be?

247
00:13:54,344 --> 00:13:56,510
And so here, when I have
this thing, what I actually

248
00:13:56,510 --> 00:13:58,010
want to have is
something that says,

249
00:13:58,010 --> 00:14:00,200
well, I want to
summarize this thing

250
00:14:00,200 --> 00:14:03,680
into some subset of the
real line, in which I'm

251
00:14:03,680 --> 00:14:08,120
sure that the area under the
curve, here, of my posterior

252
00:14:08,120 --> 00:14:11,734
is actually 1 minus alpha.

253
00:14:11,734 --> 00:14:13,400
And there's many ways
to do this, right?

254
00:14:16,790 --> 00:14:22,450
So one way to do this is
to look at level sets.

255
00:14:27,870 --> 00:14:29,550
And so rather than
actually-- so let's

256
00:14:29,550 --> 00:14:32,220
say my posterior
looks like this.

257
00:14:32,220 --> 00:14:35,760
I know, for example, if I
have a Gaussian distribution,

258
00:14:35,760 --> 00:14:38,230
I can actually take my posterior
to be-- my posterior is

259
00:14:38,230 --> 00:14:39,480
actually going to be Gaussian.

260
00:14:43,060 --> 00:14:50,760
And what I can do is to try
to cut it here on the y-axis

261
00:14:50,760 --> 00:14:54,910
so that now, the area under
the curve, when I cut here,

262
00:14:54,910 --> 00:14:59,430
is actually 1 minus alpha.

263
00:14:59,430 --> 00:15:02,080
OK, so I have some
threshold tau.

264
00:15:02,080 --> 00:15:05,490
If tau goes to plus
infinity, then I'm

265
00:15:05,490 --> 00:15:07,380
going to have that this
area under the curve

266
00:15:07,380 --> 00:15:10,380
here is going to--

267
00:15:18,012 --> 00:15:19,920
AUDIENCE: [INAUDIBLE]

268
00:15:19,920 --> 00:15:21,786
PHILIPPE RIGOLLET: Well, no.

269
00:15:21,786 --> 00:15:23,160
So the area under
the curve, when

270
00:15:23,160 --> 00:15:24,810
tau is going to
plus infinity, think

271
00:15:24,810 --> 00:15:27,892
of the small, the when
tau is just right here.

272
00:15:27,892 --> 00:15:29,280
AUDIENCE: [INAUDIBLE]

273
00:15:29,280 --> 00:15:32,150
PHILIPPE RIGOLLET: So this is
actually going to 0, right?

274
00:15:32,150 --> 00:15:33,530
And so I start here.

275
00:15:33,530 --> 00:15:36,290
And then I start going down
and down and down and down,

276
00:15:36,290 --> 00:15:39,440
until I actually get something
which is going down to 1 plus

277
00:15:39,440 --> 00:15:40,160
alpha.

278
00:15:40,160 --> 00:15:44,000
And if tau is going down to 0,
then my area under the curve

279
00:15:44,000 --> 00:15:44,750
is going to--

280
00:15:48,240 --> 00:15:51,604
if tau is here, I'm
cutting nowhere.

281
00:15:51,604 --> 00:15:52,770
And so I'm getting 1, right?

282
00:15:56,160 --> 00:15:56,980
Agree?

283
00:15:56,980 --> 00:16:00,540
Think of, when tau
is very close to 0,

284
00:16:00,540 --> 00:16:02,876
I'm cutting [? s ?]
s very far here.

285
00:16:02,876 --> 00:16:04,750
And so I'm getting some
area under the curve,

286
00:16:04,750 --> 00:16:06,000
which is almost everything.

287
00:16:06,000 --> 00:16:08,100
And so it's going to 1--
as tau going down to 0.

288
00:16:08,100 --> 00:16:09,960
Yeah?

289
00:16:09,960 --> 00:16:12,882
AUDIENCE: Does this only
work for [INAUDIBLE]

290
00:16:12,882 --> 00:16:14,340
PHILIPPE RIGOLLET:
No, it does not.

291
00:16:14,340 --> 00:16:17,160
I mean-- so this is a picture.

292
00:16:17,160 --> 00:16:20,277
So those two things work
for all of them, right?

293
00:16:20,277 --> 00:16:22,110
But when you have a
[? bimodal, ?] actually,

294
00:16:22,110 --> 00:16:23,526
this is actually
when things start

295
00:16:23,526 --> 00:16:24,990
to become interesting, right?

296
00:16:24,990 --> 00:16:30,600
So when we built a frequentist
confidence interval,

297
00:16:30,600 --> 00:16:34,590
it was always of the form x
bar plus or minus something.

298
00:16:34,590 --> 00:16:36,510
But now, if I start to
have a posterior that

299
00:16:36,510 --> 00:16:40,230
looks like this, what I'm
going to start cutting off,

300
00:16:40,230 --> 00:16:41,370
I'm going to have two--

301
00:16:41,370 --> 00:16:44,550
I mean, my confidence
region is going

302
00:16:44,550 --> 00:16:47,740
to be the union of
those two things, right?

303
00:16:47,740 --> 00:16:50,700
And it really reflects
the fact that there

304
00:16:50,700 --> 00:16:51,820
is this bimodal thing.

305
00:16:51,820 --> 00:16:53,486
It's going to say,
well, with hyperbole,

306
00:16:53,486 --> 00:16:56,840
I'm actually going to
be either here or here.

307
00:16:56,840 --> 00:16:59,840
Now, the meaning here of a
Bayesian confidence region

308
00:16:59,840 --> 00:17:02,570
and the confidence interval are
completely distinct notions,

309
00:17:02,570 --> 00:17:03,260
right?

310
00:17:03,260 --> 00:17:06,140
And I'm going to work
out on example with you

311
00:17:06,140 --> 00:17:08,673
so that we can actually
see that sometimes--

312
00:17:08,673 --> 00:17:10,089
I mean, both of
them, actually you

313
00:17:10,089 --> 00:17:11,839
can come up with
some crazy paradoxes.

314
00:17:11,839 --> 00:17:13,609
So since we don't
have that much time,

315
00:17:13,609 --> 00:17:17,339
I will actually talk to you
about why, in some instances,

316
00:17:17,339 --> 00:17:19,819
it's actually a good idea to
think of Bayesian confidence

317
00:17:19,819 --> 00:17:22,369
intervals rather than
frequentist ones.

318
00:17:22,369 --> 00:17:25,609
So before we go into
more details about what

319
00:17:25,609 --> 00:17:27,440
those Bayesian
confidence intervals are,

320
00:17:27,440 --> 00:17:29,570
let's remind
ourselves what does it

321
00:17:29,570 --> 00:17:33,110
mean to have a frequentist
confidence interval?

322
00:17:33,110 --> 00:17:33,610
Right?

323
00:17:46,460 --> 00:17:46,960
OK.

324
00:17:46,960 --> 00:17:49,690
So when I have a frequentist
confidence interval,

325
00:17:49,690 --> 00:17:59,290
let's say something like x bar n
to minus 1.96 sigma over root n

326
00:17:59,290 --> 00:18:06,136
and x bar n plus 1.96
sigma over root n,

327
00:18:06,136 --> 00:18:07,510
so that's the
confidence interval

328
00:18:07,510 --> 00:18:10,720
that you get for the
mean of some Gaussian

329
00:18:10,720 --> 00:18:16,390
with known variants to be
equal to sigma square, OK.

330
00:18:16,390 --> 00:18:18,460
So what we know is that
the meaning of this

331
00:18:18,460 --> 00:18:20,410
is the probability
that theta belongs

332
00:18:20,410 --> 00:18:25,870
to this is equal to 95%, right?

333
00:18:25,870 --> 00:18:27,340
And this, more
generally, you can

334
00:18:27,340 --> 00:18:29,620
think of being q alpha over 2.

335
00:18:29,620 --> 00:18:33,040
And what you're going to get
is 1 minus alpha here, OK?

336
00:18:33,040 --> 00:18:34,280
So what does it mean here?

337
00:18:34,280 --> 00:18:37,480
Well, it looks very much
like what we have here,

338
00:18:37,480 --> 00:18:39,970
except that we're not
conditioning on x1 xn.

339
00:18:39,970 --> 00:18:40,720
And we should not.

340
00:18:40,720 --> 00:18:43,830
Because there was a question
like that in the midterm--

341
00:18:43,830 --> 00:18:47,590
if I condition on x1 xn, this
probability is either 0 or 1.

342
00:18:47,590 --> 00:18:48,610
OK?

343
00:18:48,610 --> 00:18:50,170
Because once I
condition-- so here,

344
00:18:50,170 --> 00:18:52,170
this probability, actually,
here is with respect

345
00:18:52,170 --> 00:18:55,010
to the randomness in x1 xn.

346
00:18:55,010 --> 00:18:56,040
So if I condition--

347
00:18:58,860 --> 00:19:04,890
so let's build this thing,
r freq, for frequentist.

348
00:19:07,830 --> 00:19:11,930
Well, given x1 xn--

349
00:19:11,930 --> 00:19:13,940
and actually, I don't
need to know x1 xn really.

350
00:19:13,940 --> 00:19:16,420
What I need to know
is what xn bar is.

351
00:19:16,420 --> 00:19:18,140
Well, this thing now is what?

352
00:19:18,140 --> 00:19:22,200
It's 1, if theta is
in r, and it's 0,

353
00:19:22,200 --> 00:19:27,110
if theta is not in r, right?

354
00:19:27,110 --> 00:19:28,010
That's all there is.

355
00:19:28,010 --> 00:19:29,900
This is a deterministic
confidence interval,

356
00:19:29,900 --> 00:19:32,360
once I condition x1 xn.

357
00:19:32,360 --> 00:19:33,270
So I have a number.

358
00:19:33,270 --> 00:19:35,720
The average is maybe 3.

359
00:19:35,720 --> 00:19:36,790
And so I get 3.

360
00:19:36,790 --> 00:19:41,900
Either theta is between 3
minus 0.5 or in 3 plus 0.5,

361
00:19:41,900 --> 00:19:42,840
or it's not.

362
00:19:42,840 --> 00:19:44,000
And so there's basically--

363
00:19:44,000 --> 00:19:45,470
I mean, I write
it as probability,

364
00:19:45,470 --> 00:19:47,303
but it's really not a
probalistic statement.

365
00:19:47,303 --> 00:19:49,160
It's either it's true or not.

366
00:19:49,160 --> 00:19:50,240
Agreed?

367
00:19:50,240 --> 00:19:52,580
So what does it mean to have
a frequentist confidence

368
00:19:52,580 --> 00:19:53,550
interval?

369
00:19:53,550 --> 00:19:55,270
It means that if I were--

370
00:19:55,270 --> 00:19:58,660
and here, where the word
frequentist comes from,

371
00:19:58,660 --> 00:20:02,840
it says that if I repeat this
experiment over and over,

372
00:20:02,840 --> 00:20:06,700
meaning that on Monday, I
collect a sample of size n,

373
00:20:06,700 --> 00:20:09,260
and I build a
confidence interval,

374
00:20:09,260 --> 00:20:12,260
and then on Tuesday, I collect
another sample of size n,

375
00:20:12,260 --> 00:20:13,890
and I build a
confidence interval,

376
00:20:13,890 --> 00:20:17,000
and on Wednesday, I do this
again and again, what's going

377
00:20:17,000 --> 00:20:18,510
to happen is the following.

378
00:20:18,510 --> 00:20:21,530
I'm going to have my true
theta that lives here.

379
00:20:21,530 --> 00:20:23,900
And then on Monday, this
is the confidence interval

380
00:20:23,900 --> 00:20:25,470
that I build.

381
00:20:25,470 --> 00:20:28,802
OK, so this is the real line.

382
00:20:28,802 --> 00:20:31,260
The true theta is here, and
this is the confidence interval

383
00:20:31,260 --> 00:20:32,300
I build on Monday.

384
00:20:32,300 --> 00:20:32,800
All right?

385
00:20:32,800 --> 00:20:37,530
So x bar was here, and this
is my confidence interval.

386
00:20:37,530 --> 00:20:41,540
On Tuesday, I build this
confidence interval maybe.

387
00:20:41,540 --> 00:20:44,640
x bar was closer to
theta, but smaller.

388
00:20:44,640 --> 00:20:49,820
But then on Wednesday, I build
this confidence interval.

389
00:20:49,820 --> 00:20:50,880
I'm not here.

390
00:20:50,880 --> 00:20:51,920
It's not in there.

391
00:20:51,920 --> 00:20:53,681
And that's this case.

392
00:20:53,681 --> 00:20:54,180
Right?

393
00:20:54,180 --> 00:20:56,100
It happens that it's
just not in there.

394
00:20:56,100 --> 00:20:57,930
And then on Thursday,
I build another one.

395
00:20:57,930 --> 00:21:01,300
I almost miss it, but
I'm in there, et cetera.

396
00:21:01,300 --> 00:21:04,430
Maybe here, Here, I miss again.

397
00:21:04,430 --> 00:21:07,490
And so what it means to have a
confidence interval-- so what

398
00:21:07,490 --> 00:21:12,131
does it mean to have a
confidence interval at 95%?

399
00:21:12,131 --> 00:21:15,610
AUDIENCE: [INAUDIBLE]

400
00:21:15,610 --> 00:21:18,150
PHILIPPE RIGOLLET: Yeah, so
it means that if I repeat this

401
00:21:18,150 --> 00:21:19,800
the frequency of times--

402
00:21:19,800 --> 00:21:21,720
hence, the word
frequentist-- at which

403
00:21:21,720 --> 00:21:24,150
I'm actually going
to overlap that,

404
00:21:24,150 --> 00:21:26,910
I'm actually going to
contain theta, should be 95%.

405
00:21:26,910 --> 00:21:28,890
That's what frequentist means.

406
00:21:28,890 --> 00:21:31,740
So it's just a matter
of trusting that.

407
00:21:31,740 --> 00:21:35,690
So on one given thing, one
given realization of your data,

408
00:21:35,690 --> 00:21:36,970
it's not telling you anything.

409
00:21:36,970 --> 00:21:38,460
[INAUDIBLE] it's there or not.

410
00:21:38,460 --> 00:21:42,530
So it's not really
something that's actually

411
00:21:42,530 --> 00:21:46,430
something that assesses the
confidence of your decision,

412
00:21:46,430 --> 00:21:48,230
such as data is in there or not.

413
00:21:48,230 --> 00:21:50,360
It's something that
assesses the confidence

414
00:21:50,360 --> 00:21:52,410
you have in the method
that you're using.

415
00:21:52,410 --> 00:21:54,170
If you were you repeat
it over and again,

416
00:21:54,170 --> 00:21:56,470
it'd be the same thing.

417
00:21:56,470 --> 00:21:58,850
It would be 95% of the
time correct, right?

418
00:21:58,850 --> 00:22:02,570
So for example, we know
that we could build a test.

419
00:22:02,570 --> 00:22:04,940
So it's pretty clear
that you can actually

420
00:22:04,940 --> 00:22:09,020
build a test for whether
theta is equal to theta naught

421
00:22:09,020 --> 00:22:10,705
or not equal to
theta naught, by just

422
00:22:10,705 --> 00:22:13,080
checking whether theta naught
is in a confidence interval

423
00:22:13,080 --> 00:22:13,780
or not.

424
00:22:13,780 --> 00:22:15,530
And what it means is
that, if you actually

425
00:22:15,530 --> 00:22:21,170
are doing those tests at 5%,
that means that 5% of the time,

426
00:22:21,170 --> 00:22:23,440
if you do this over and
again, 5% of the time

427
00:22:23,440 --> 00:22:24,610
you're going to be wrong.

428
00:22:24,610 --> 00:22:27,640
I mentioned my wife
does market research.

429
00:22:27,640 --> 00:22:31,930
And she does maybe, I don't
know, 100,000 tests a year.

430
00:22:31,930 --> 00:22:34,210
And if they do
all of them at 1%,

431
00:22:34,210 --> 00:22:37,550
then it means that 1% of the
time, which is a lot of time,

432
00:22:37,550 --> 00:22:38,050
right?

433
00:22:38,050 --> 00:22:40,840
When you do 100,000 a
year, it's 1,000 of them

434
00:22:40,840 --> 00:22:41,755
are actually wrong.

435
00:22:41,755 --> 00:22:44,611
OK, I mean, she's
actually hedging

436
00:22:44,611 --> 00:22:47,110
against the fact that 1% of
them that are going to be wrong.

437
00:22:47,110 --> 00:22:49,109
That's 1,000 of them that
are going to be wrong.

438
00:22:49,109 --> 00:22:52,890
Just like, if you do this
100,000 times at 95%,

439
00:22:52,890 --> 00:22:54,910
5,000 of those guys
are actually not going

440
00:22:54,910 --> 00:22:56,360
to be the correct ones.

441
00:22:56,360 --> 00:22:56,860
OK?

442
00:22:56,860 --> 00:22:58,600
So I mean, it's kind of scary.

443
00:22:58,600 --> 00:23:01,300
But that's the way it is.

444
00:23:01,300 --> 00:23:03,730
So that's with the frequentist
interpretation of this is.

445
00:23:03,730 --> 00:23:07,720
Now, as I mentioned, when we
started this Bayesian chapter,

446
00:23:07,720 --> 00:23:10,930
I said, Bayesian
statistics converge to--

447
00:23:10,930 --> 00:23:14,800
I mean, Bayesian decisions
and Bayesian methods converge

448
00:23:14,800 --> 00:23:16,510
to frequentist methods.

449
00:23:16,510 --> 00:23:18,590
When the sample size
is large enough,

450
00:23:18,590 --> 00:23:20,610
they lead to the same decisions.

451
00:23:20,610 --> 00:23:22,930
And in general, they
need not be the same,

452
00:23:22,930 --> 00:23:24,970
but they tend to
actually, when the sample

453
00:23:24,970 --> 00:23:27,830
size is large enough, to
have the same behavior.

454
00:23:27,830 --> 00:23:30,850
Think about, for
example, the posterior

455
00:23:30,850 --> 00:23:34,450
that you have when you have
in the Gaussian case, right?

456
00:23:34,450 --> 00:23:36,420
We said that, in
the Gaussian case,

457
00:23:36,420 --> 00:23:38,020
what you're going
to see is that it's

458
00:23:38,020 --> 00:23:40,240
as if you had an extra
observation which

459
00:23:40,240 --> 00:23:43,230
was essentially
given by your prior.

460
00:23:43,230 --> 00:23:44,570
OK?

461
00:23:44,570 --> 00:23:50,830
And now, what's going to happen
is that, when this just one

462
00:23:50,830 --> 00:23:53,470
observation among n
plus 1, it's really

463
00:23:53,470 --> 00:23:55,720
going to be totally
drawn, and you

464
00:23:55,720 --> 00:23:58,390
won't see it when the
sample size grows larger.

465
00:23:58,390 --> 00:24:00,400
So Bayesian methods are
particularly useful when

466
00:24:00,400 --> 00:24:02,190
you have a small sample size.

467
00:24:02,190 --> 00:24:05,680
And when you have a small sample
size, the effect of the prior

468
00:24:05,680 --> 00:24:06,980
is going to be bigger.

469
00:24:06,980 --> 00:24:08,950
But most importantly,
you're not going

470
00:24:08,950 --> 00:24:10,810
to have to repeat this
thing over and again.

471
00:24:10,810 --> 00:24:11,830
And you're going
to have a meaning.

472
00:24:11,830 --> 00:24:13,180
You're going to have
to have something

473
00:24:13,180 --> 00:24:15,138
that has a meaning for
this particular data set

474
00:24:15,138 --> 00:24:16,150
that you have.

475
00:24:16,150 --> 00:24:19,900
When I said that the probability
that theta belongs to r--

476
00:24:19,900 --> 00:24:22,810
and here, I'm going to specify
the fact that it's a Bayesian

477
00:24:22,810 --> 00:24:24,740
confidence region,
like this one--

478
00:24:24,740 --> 00:24:27,490
this is actually
conditionally on the data

479
00:24:27,490 --> 00:24:29,490
that you've collected.

480
00:24:29,490 --> 00:24:32,110
It says, given this data, given
the points that you have--

481
00:24:32,110 --> 00:24:34,540
just put in some numbers,
if you want, in there--

482
00:24:34,540 --> 00:24:36,460
it's actually telling
you the probability

483
00:24:36,460 --> 00:24:39,430
that theta belongs to
this Bayesian thing,

484
00:24:39,430 --> 00:24:41,750
to this Bayesian
confidence region.

485
00:24:41,750 --> 00:24:44,230
Here, since I have
conditioned on x1 xn,

486
00:24:44,230 --> 00:24:46,840
this probability is really
just with respect to theta

487
00:24:46,840 --> 00:24:51,660
drawn from the prior, right?

488
00:24:51,660 --> 00:24:54,150
And so now, it has a
slightly different meaning.

489
00:24:54,150 --> 00:24:57,170
It's just telling
you that when--

490
00:24:57,170 --> 00:24:59,570
it's really making a
statement about where

491
00:24:59,570 --> 00:25:03,870
the regions of hyperability
of your posterior are.

492
00:25:03,870 --> 00:25:05,050
Now, why is that useful?

493
00:25:05,050 --> 00:25:11,600
Well, there's actually
an interesting story that

494
00:25:11,600 --> 00:25:13,980
goes behind Bayesian methods.

495
00:25:13,980 --> 00:25:17,240
Anybody knows the story of
the USS I think it's Scorpion?

496
00:25:17,240 --> 00:25:18,610
Do you know the story?

497
00:25:18,610 --> 00:25:22,770
So that was an American
vessel that disappeared.

498
00:25:22,770 --> 00:25:25,490
I think it was close to
Bermuda or something.

499
00:25:25,490 --> 00:25:28,790
But you can tell the story
of the Malaysian Airlines,

500
00:25:28,790 --> 00:25:31,640
except that I don't think
it's such a successful story.

501
00:25:31,640 --> 00:25:33,770
But the idea was
essentially, we're

502
00:25:33,770 --> 00:25:36,050
trying to find where
this thing happened.

503
00:25:36,050 --> 00:25:39,800
And of course, this
is a one-time thing.

504
00:25:39,800 --> 00:25:41,686
You need something
that works once.

505
00:25:41,686 --> 00:25:44,060
You need something that works
for this particular vessel.

506
00:25:44,060 --> 00:25:46,601
And you don't care, if you go
to the Navy, and you tell them,

507
00:25:46,601 --> 00:25:48,320
well, here's a method.

508
00:25:48,320 --> 00:25:51,730
And for 95 out of 100 vessels
that you're going to lose,

509
00:25:51,730 --> 00:25:53,350
we're going to be
able to find it.

510
00:25:53,350 --> 00:25:57,230
And they want this to work
for this particular one.

511
00:25:57,230 --> 00:25:59,750
And so they were
looking, and they were

512
00:25:59,750 --> 00:26:02,200
diving in different places.

513
00:26:02,200 --> 00:26:04,710
And suddenly, they
brought in this guy.

514
00:26:04,710 --> 00:26:05,460
I forget his name.

515
00:26:05,460 --> 00:26:08,960
I mean, there's a whole story
about this on Wikipedia.

516
00:26:08,960 --> 00:26:10,612
And he started
collecting the data

517
00:26:10,612 --> 00:26:13,070
that they had from different
dives and maybe from currents.

518
00:26:13,070 --> 00:26:14,569
And he started to
put everything in.

519
00:26:14,569 --> 00:26:17,540
And he said, OK, what is
the posterior distribution

520
00:26:17,540 --> 00:26:21,140
of the location of the
vessel, given all the things

521
00:26:21,140 --> 00:26:22,340
that I've seen?

522
00:26:22,340 --> 00:26:23,390
And what have you seen?

523
00:26:23,390 --> 00:26:25,280
Well, you've seen that it's
not here, it's not there,

524
00:26:25,280 --> 00:26:26,071
and it's not there.

525
00:26:26,071 --> 00:26:29,360
And you've also seen that the
currents were going that way,

526
00:26:29,360 --> 00:26:30,786
and the winds were
going that way.

527
00:26:30,786 --> 00:26:32,660
And you can actually
put some modeling traits

528
00:26:32,660 --> 00:26:33,890
to understand this.

529
00:26:33,890 --> 00:26:37,940
Now, given this, for this
particular data that you have,

530
00:26:37,940 --> 00:26:41,420
you can actually think of having
a two-dimensional density that

531
00:26:41,420 --> 00:26:44,650
tells you where it's more
likely that the vessel is.

532
00:26:44,650 --> 00:26:46,400
And where are you going
to be looking for?

533
00:26:46,400 --> 00:26:48,097
Well, if it's a
multimodal distribution,

534
00:26:48,097 --> 00:26:50,180
you're just going to go
to the highest mode first,

535
00:26:50,180 --> 00:26:52,190
because that's where it's
the most likely to be.

536
00:26:52,190 --> 00:26:53,600
And maybe it's not
there, so you're just

537
00:26:53,600 --> 00:26:55,250
going to update your
posterior, based on the fact

538
00:26:55,250 --> 00:26:56,791
that it's not there,
and do it again.

539
00:26:56,791 --> 00:26:59,270
And actually, after
two dives, I think,

540
00:26:59,270 --> 00:27:01,010
he actually found the thing.

541
00:27:01,010 --> 00:27:03,122
And that's exactly where
Bayesian statistics

542
00:27:03,122 --> 00:27:03,830
start to kick in.

543
00:27:03,830 --> 00:27:08,570
Because you put a lot of
knowledge into your model,

544
00:27:08,570 --> 00:27:11,340
but you also can actually factor
in a bunch of information,

545
00:27:11,340 --> 00:27:11,840
right?

546
00:27:11,840 --> 00:27:13,460
The model, he had
to build a model

547
00:27:13,460 --> 00:27:17,360
that was actually taking into
account and currents, and when.

548
00:27:17,360 --> 00:27:20,780
And what you can have
as a guarantee is that,

549
00:27:20,780 --> 00:27:22,610
when you talk about
the probability

550
00:27:22,610 --> 00:27:27,346
that this vessel is
in this location,

551
00:27:27,346 --> 00:27:28,970
given what you've
observed in the past,

552
00:27:28,970 --> 00:27:30,140
it actually has some sense.

553
00:27:30,140 --> 00:27:34,610
Whereas, if you were to
use a frequentist approach,

554
00:27:34,610 --> 00:27:35,810
then there's no probability.

555
00:27:35,810 --> 00:27:38,660
Either it's underneath this
position or it's not, right?

556
00:27:38,660 --> 00:27:41,520
So that's actually where
it start to make sense.

557
00:27:41,520 --> 00:27:43,370
And so you can
actually build this.

558
00:27:43,370 --> 00:27:44,930
And there's actually
a lot of methods

559
00:27:44,930 --> 00:27:47,300
that are based on,
for search, that

560
00:27:47,300 --> 00:27:48,979
are based on Bayesian methods.

561
00:27:48,979 --> 00:27:50,520
I think, for example,
the Higgs boson

562
00:27:50,520 --> 00:27:51,920
was based on a lot
of Bayesian methods,

563
00:27:51,920 --> 00:27:54,050
because this is something
you need to find [INAUDIBLE],,

564
00:27:54,050 --> 00:27:54,549
right?

565
00:27:54,549 --> 00:27:57,330
I mean, there was a lot of
prior that has to be built in.

566
00:27:57,330 --> 00:27:57,830
OK.

567
00:27:57,830 --> 00:27:59,621
So now, you build this
confidence interval.

568
00:27:59,621 --> 00:28:02,300
And the nicest way to do
it is to use level sets.

569
00:28:02,300 --> 00:28:05,210
But again, just like for
Gaussians, I mean, if I had,

570
00:28:05,210 --> 00:28:12,290
even in the Gaussian
case, I decided

571
00:28:12,290 --> 00:28:16,110
to go at x bar plus
or minus something,

572
00:28:16,110 --> 00:28:19,500
but I could go at something
that's completely asymmetric.

573
00:28:19,500 --> 00:28:21,467
So what's happening is
that here, this method

574
00:28:21,467 --> 00:28:23,550
guarantees that you're
going to have the narrowest

575
00:28:23,550 --> 00:28:24,800
possible confidence intervals.

576
00:28:24,800 --> 00:28:27,480
That's essentially what
it's telling you, OK?

577
00:28:27,480 --> 00:28:31,890
Because every time I'm choosing
a point, starting from here,

578
00:28:31,890 --> 00:28:36,170
I'm actually putting as much
area under the curve as I can.

579
00:28:36,170 --> 00:28:38,660
All right.

580
00:28:38,660 --> 00:28:41,737
So those are called Bayesian
confidence [? interval. ?]

581
00:28:41,737 --> 00:28:43,320
Oh yeah, and I
promised you that we're

582
00:28:43,320 --> 00:28:46,500
going to work on some
example that actually

583
00:28:46,500 --> 00:28:50,940
gives a meaning to what I just
told you, with actual numbers.

584
00:28:50,940 --> 00:28:56,790
So this is something that's
taken from Wasserman's book.

585
00:28:56,790 --> 00:29:01,140
And also, it's
coming from a paper,

586
00:29:01,140 --> 00:29:03,780
from a stats paper,
from [? Wolpert ?] and I

587
00:29:03,780 --> 00:29:05,760
don't know who, from the '80s.

588
00:29:05,760 --> 00:29:07,760
And essentially,
this is how it works.

589
00:29:07,760 --> 00:29:10,680
So assume that you have
n equals 2 observations.

590
00:29:14,320 --> 00:29:18,780
And you have y1, so those
observations are y1--

591
00:29:18,780 --> 00:29:20,680
no, sorry, let's
call them x1, which

592
00:29:20,680 --> 00:29:26,000
is theta, plus epsilon 1 and x2,
which is theta plus epsilon 2,

593
00:29:26,000 --> 00:29:31,060
where epsilon 1 and
epsilon 2 are iid.

594
00:29:31,060 --> 00:29:33,280
And the probability
that epsilon i is equal

595
00:29:33,280 --> 00:29:35,110
to plus 1 is equal
to the probability

596
00:29:35,110 --> 00:29:38,440
that epsilon i is equal to
minus 1 is equal to 1/2.

597
00:29:38,440 --> 00:29:44,550
OK, so it's just the uniform
sign plus minus 1, OK?

598
00:29:44,550 --> 00:29:46,590
Now, let's think
about so you're trying

599
00:29:46,590 --> 00:29:47,970
to do some inference on theta.

600
00:29:47,970 --> 00:29:50,261
Maybe you actually want to
find some inference on theta

601
00:29:50,261 --> 00:29:51,825
that's actually based on--

602
00:29:51,825 --> 00:29:55,660
and that's based only
on the x1 and x2.

603
00:29:55,660 --> 00:29:56,430
OK?

604
00:29:56,430 --> 00:29:58,750
So I'm going to actually
build a confidence interval.

605
00:29:58,750 --> 00:30:01,110
But what I really
want to build is a--

606
00:30:03,594 --> 00:30:05,010
but let's start
thinking about how

607
00:30:05,010 --> 00:30:07,780
I would find an estimator
for those two things.

608
00:30:07,780 --> 00:30:09,970
Well, what values am I
going to be getting, right?

609
00:30:09,970 --> 00:30:13,750
So I'm going to get either
theta plus 1 or theta minus 1.

610
00:30:13,750 --> 00:30:15,610
And actually, I can
get basically four

611
00:30:15,610 --> 00:30:19,260
different observations, right?

612
00:30:19,260 --> 00:30:21,516
Sorry, four different
pairs of observations--

613
00:30:30,760 --> 00:30:32,410
plus plus theta minus 1.

614
00:30:32,410 --> 00:30:33,170
Agreed?

615
00:30:33,170 --> 00:30:37,340
Those are the four possible
observations that I can get.

616
00:30:37,340 --> 00:30:38,970
Agreed?

617
00:30:38,970 --> 00:30:42,924
Either they're both equal to
plus 1, both equal to minus 1,

618
00:30:42,924 --> 00:30:44,340
or one of the two
is equal to plus

619
00:30:44,340 --> 00:30:46,950
1, the other one to
minus 1, or the epsilons.

620
00:30:46,950 --> 00:30:47,580
OK.

621
00:30:47,580 --> 00:30:49,730
So those are the four
observations I can get.

622
00:30:49,730 --> 00:30:56,010
So in particular, if
they take the same value,

623
00:30:56,010 --> 00:30:59,390
and you know it's either
theta plus 1 or theta minus 1,

624
00:30:59,390 --> 00:31:02,100
and if they take a different
value, I know one of them

625
00:31:02,100 --> 00:31:04,555
is theta plus 1, and one
is actually theta minus 1.

626
00:31:04,555 --> 00:31:07,180
So in particular, if I take the
average of those two guys, when

627
00:31:07,180 --> 00:31:09,138
they take different
values, I know I'm actually

628
00:31:09,138 --> 00:31:10,850
getting theta right.

629
00:31:10,850 --> 00:31:14,441
So let's build a
confidence region.

630
00:31:14,441 --> 00:31:16,940
OK, so I'm actually going to
take a confidence region, which

631
00:31:16,940 --> 00:31:18,810
is just a singleton.

632
00:31:21,662 --> 00:31:23,120
And I'm going to
say the following.

633
00:31:23,120 --> 00:31:32,460
Well, if x1 is equal to x2, I'm
just going to take x1 minus 1,

634
00:31:32,460 --> 00:31:33,320
OK?

635
00:31:33,320 --> 00:31:34,790
So I'm just saying,
well, I'm never

636
00:31:34,790 --> 00:31:37,310
going to able to resolve
whether it's plus 1 or minus 1

637
00:31:37,310 --> 00:31:38,864
that actually gives
me the best one,

638
00:31:38,864 --> 00:31:41,030
so I'm just going to take
a dive and say, well, it's

639
00:31:41,030 --> 00:31:42,594
just plus 1.

640
00:31:42,594 --> 00:31:44,860
OK?

641
00:31:44,860 --> 00:31:47,710
And then, if they're
different, then here,

642
00:31:47,710 --> 00:31:50,830
I can do much better.

643
00:31:50,830 --> 00:31:52,929
I'm going to actually
just think the average.

644
00:31:56,282 --> 00:31:58,200
OK?

645
00:31:58,200 --> 00:32:08,360
Now, what I claim is that
this is a confidence region--

646
00:32:08,360 --> 00:32:10,370
and by default, when
I don't mention it,

647
00:32:10,370 --> 00:32:16,190
this is a frequentist
confidence region--

648
00:32:16,190 --> 00:32:18,740
at level 75%.

649
00:32:21,050 --> 00:32:21,550
OK?

650
00:32:21,550 --> 00:32:23,100
So let's just check that.

651
00:32:23,100 --> 00:32:24,685
To check that this
is correct, I need

652
00:32:24,685 --> 00:32:27,460
to check that the probability
under the realization of x1

653
00:32:27,460 --> 00:32:30,940
and x2, that theta belongs,
is one of those two guys,

654
00:32:30,940 --> 00:32:33,291
is actually equal to 0.75.

655
00:32:33,291 --> 00:32:33,790
Yes?

656
00:32:33,790 --> 00:32:36,529
AUDIENCE: What are
the [INAUDIBLE]

657
00:32:36,529 --> 00:32:39,070
PHILIPPE RIGOLLET: Well, it's
just the frequentist confidence

658
00:32:39,070 --> 00:32:41,842
interval that does not
need to be an interval.

659
00:32:41,842 --> 00:32:44,050
Actually, in this case, it's
going to be an interval.

660
00:32:44,050 --> 00:32:46,602
But that's just what it means.

661
00:32:46,602 --> 00:32:50,055
Yeah, region for Bayesian
was just because--

662
00:32:50,055 --> 00:32:51,430
I mean, the
confidence intervals,

663
00:32:51,430 --> 00:32:53,320
when we're frequentist,
we tend to make them

664
00:32:53,320 --> 00:32:54,606
intervals, because we want--

665
00:32:54,606 --> 00:32:56,980
but when you're Bayesian, and
you're doing this level set

666
00:32:56,980 --> 00:32:58,180
thing, you cannot
really guarantee,

667
00:32:58,180 --> 00:33:00,460
unless its [INAUDIBLE] is
going to be an interval.

668
00:33:00,460 --> 00:33:02,720
So region is just a way to
not have to say interval,

669
00:33:02,720 --> 00:33:03,430
in case it's not.

670
00:33:06,080 --> 00:33:06,640
OK.

671
00:33:06,640 --> 00:33:08,490
So I have this thing.

672
00:33:08,490 --> 00:33:11,440
So what I need to check is
the probability that theta

673
00:33:11,440 --> 00:33:13,000
is in one of those
two things, right?

674
00:33:13,000 --> 00:33:16,060
So what I need to find is
the probability that theta

675
00:33:16,060 --> 00:33:24,220
is an [INAUDIBLE] Well, x1 minus
1 and x1 is not equal to x2.

676
00:33:24,220 --> 00:33:26,840
And those are disjoint events,
so it's plus the probability

677
00:33:26,840 --> 00:33:35,980
that theta is in x1
plus x2 over 2 and x1--

678
00:33:35,980 --> 00:33:37,580
sorry, that's equal.

679
00:33:37,580 --> 00:33:39,700
That's different.

680
00:33:39,700 --> 00:33:40,200
OK.

681
00:33:40,200 --> 00:33:42,780
And OK, just before we actually
finish the computation,

682
00:33:42,780 --> 00:33:44,730
why do I have 75%?

683
00:33:44,730 --> 00:33:46,920
75% is 3/4.

684
00:33:46,920 --> 00:33:48,930
So it means that
we have four cases.

685
00:33:48,930 --> 00:33:52,020
And essentially, I did
not account for one case.

686
00:33:52,020 --> 00:33:52,650
And it's true.

687
00:33:52,650 --> 00:33:56,040
I did not account
for this case, when

688
00:33:56,040 --> 00:34:01,060
the both of the epsilon
i's are equal to minus 1.

689
00:34:01,060 --> 00:34:01,560
Right?

690
00:34:01,560 --> 00:34:03,393
So this is essentially
the one I'm not going

691
00:34:03,393 --> 00:34:04,620
to be able to account for.

692
00:34:04,620 --> 00:34:06,040
And so we'll see
that in a second.

693
00:34:06,040 --> 00:34:09,310
So in this case, we know
that everything goes great.

694
00:34:09,310 --> 00:34:09,810
Right?

695
00:34:09,810 --> 00:34:11,080
So in this case, this is--

696
00:34:11,080 --> 00:34:11,580
OK.

697
00:34:11,580 --> 00:34:13,831
Well, let's just start
from the first line.

698
00:34:13,831 --> 00:34:15,330
So the first line
is the probability

699
00:34:15,330 --> 00:34:20,290
that theta is equal to x1 minus
1 and those two are equal.

700
00:34:20,290 --> 00:34:28,440
So this is the probability
that theta is equal to--

701
00:34:28,440 --> 00:34:36,260
well, this is theta
plus epsilon 1 minus 1.

702
00:34:36,260 --> 00:34:43,409
And epsilon 1 is equal
to epsilon 2, right?

703
00:34:43,409 --> 00:34:45,290
Because I can remove
the theta from here,

704
00:34:45,290 --> 00:34:47,780
and I can actually remove
the theta from here,

705
00:34:47,780 --> 00:34:50,765
so that this guy here is
just epsilon 1 is equal to 1.

706
00:34:50,765 --> 00:34:52,407
So when I intersect
with this guy,

707
00:34:52,407 --> 00:34:54,740
it's actually the same thing
as epsilon 1 is equal to 1,

708
00:34:54,740 --> 00:34:56,530
as well--

709
00:34:56,530 --> 00:34:59,780
episilon 2 is equal
to 1, as well, OK?

710
00:34:59,780 --> 00:35:05,240
So this first thing is actually
equal to the probability

711
00:35:05,240 --> 00:35:10,780
that epsilon 1 is equal to 1
and epsilon 2 is equal to 1,

712
00:35:10,780 --> 00:35:14,180
which is equal to what?

713
00:35:14,180 --> 00:35:15,570
AUDIENCE: [INAUDIBLE]

714
00:35:15,570 --> 00:35:17,070
PHILIPPE RIGOLLET:
Yeah, 1/4, right?

715
00:35:17,070 --> 00:35:19,870
So that's just the
first case over there.

716
00:35:19,870 --> 00:35:21,020
They're independent.

717
00:35:21,020 --> 00:35:23,420
Now, I still need to
do the second one.

718
00:35:23,420 --> 00:35:24,650
So this case is what?

719
00:35:24,650 --> 00:35:28,890
Well, when those things are
equal, x1 plus x2 over 2

720
00:35:28,890 --> 00:35:29,390
is what?

721
00:35:29,390 --> 00:35:31,920
Well, I get theta
plus theta over 2.

722
00:35:31,920 --> 00:35:33,800
So that's just equal
to the probability

723
00:35:33,800 --> 00:35:39,620
that epsilon 1 plus epsilon
2 over 2 is equal to 0

724
00:35:39,620 --> 00:35:43,600
and epsilon 1 is
different from epsilon 2.

725
00:35:43,600 --> 00:35:44,100
Agreed?

726
00:35:46,860 --> 00:35:49,797
I just removed the thetas from
these equations, because I can.

727
00:35:49,797 --> 00:35:51,380
They're just on both
sides every time.

728
00:35:54,810 --> 00:35:55,310
OK.

729
00:35:55,310 --> 00:35:56,482
And so that means what?

730
00:35:56,482 --> 00:35:58,440
That means that the second
part-- so this thing

731
00:35:58,440 --> 00:36:02,120
is actually equal to
1/4 plus the probability

732
00:36:02,120 --> 00:36:05,350
that epsilon 1 and epsilon
2 over 2 is equal to 0.

733
00:36:05,350 --> 00:36:06,544
I can remove the 2.

734
00:36:06,544 --> 00:36:08,460
So this is just the
probability that one is 1,

735
00:36:08,460 --> 00:36:10,560
and the other one
is minus 1, right?

736
00:36:10,560 --> 00:36:12,510
So that's equal
to the probability

737
00:36:12,510 --> 00:36:17,820
that epsilon 1 is equal to 1 and
epsilon 2 is equal to minus 1

738
00:36:17,820 --> 00:36:21,360
plus the probability that
epsilon 1 is equal to minus 1

739
00:36:21,360 --> 00:36:24,447
and epsilon 2 is
equal to plus 1, OK?

740
00:36:24,447 --> 00:36:25,780
Because they're disjoint events.

741
00:36:25,780 --> 00:36:28,080
So I can break them
into the sum of the two.

742
00:36:28,080 --> 00:36:32,310
And each of those guys is also
one of the atomic part of it.

743
00:36:32,310 --> 00:36:33,960
It's one of the basic things.

744
00:36:33,960 --> 00:36:36,011
And so each of those
guys has probability 1/4.

745
00:36:36,011 --> 00:36:38,010
And so here, we can really
see that we accounted

746
00:36:38,010 --> 00:36:41,910
for everything, except for the
case when epsilon 1 was equal

747
00:36:41,910 --> 00:36:44,730
to minus 1, and epsilon
2 was equal to minus 1.

748
00:36:44,730 --> 00:36:45,570
So this is 1/4.

749
00:36:45,570 --> 00:36:46,380
This is 1/4.

750
00:36:46,380 --> 00:36:49,850
So the whole thing
is equal to 3/4.

751
00:36:49,850 --> 00:36:56,060
So now, what we have is that
the probability that epsilon 1

752
00:36:56,060 --> 00:36:57,350
is in--

753
00:36:57,350 --> 00:37:03,230
so the probability that data
belongs to this confidence

754
00:37:03,230 --> 00:37:06,280
region is equal to 3/4.

755
00:37:06,280 --> 00:37:07,990
And that's very nice.

756
00:37:07,990 --> 00:37:09,740
But the thing is some
people are sort of--

757
00:37:09,740 --> 00:37:12,650
I mean, it's not super nice
to be able to see this,

758
00:37:12,650 --> 00:37:17,510
because, in a way, I know that,
if I observe x1 and x2 that

759
00:37:17,510 --> 00:37:24,050
are different, I know
for sure that theta,

760
00:37:24,050 --> 00:37:25,882
that I actually got
the right theta, right?

761
00:37:25,882 --> 00:37:27,590
That this confidence
interval is actually

762
00:37:27,590 --> 00:37:31,370
happening with probability 1.

763
00:37:31,370 --> 00:37:34,700
And the problem is
that I do not know--

764
00:37:34,700 --> 00:37:37,640
I cannot make this precise
with the notion of frequentist

765
00:37:37,640 --> 00:37:39,230
confidence intervals.

766
00:37:39,230 --> 00:37:39,730
OK?

767
00:37:39,730 --> 00:37:41,396
Because frequentist
confidence intervals

768
00:37:41,396 --> 00:37:43,810
have to account for the
fact that, in the future,

769
00:37:43,810 --> 00:37:47,810
it might not be the case
that x1 and x2 are different.

770
00:37:47,810 --> 00:37:53,360
So Bayesian confidence
regions, by definition--

771
00:37:53,360 --> 00:37:54,530
well, they're all gone--

772
00:37:54,530 --> 00:37:57,387
but they are conditioned
on the data that I have.

773
00:37:57,387 --> 00:37:58,470
And so that's what I want.

774
00:37:58,470 --> 00:38:00,800
I want to be able to make
this statement conditionally

775
00:38:00,800 --> 00:38:02,640
and the data that I have.

776
00:38:02,640 --> 00:38:03,140
OK.

777
00:38:03,140 --> 00:38:06,450
So if I want to be able
to make this statement,

778
00:38:06,450 --> 00:38:08,450
if I want to build a
Bayesian confidence region,

779
00:38:08,450 --> 00:38:10,520
I'm going to have to
put a prior on theta.

780
00:38:10,520 --> 00:38:12,050
So without loss of generality--

781
00:38:12,050 --> 00:38:16,520
I mean, maybe with--
but let's assume

782
00:38:16,520 --> 00:38:25,980
that pi is a prior on theta.

783
00:38:25,980 --> 00:38:31,540
And let's assume that pi
of j is strictly positive

784
00:38:31,540 --> 00:38:35,920
for all integers
j equal, say, 0--

785
00:38:35,920 --> 00:38:42,770
well, actually, for all j in the
integers, positive or negative.

786
00:38:42,770 --> 00:38:43,270
OK.

787
00:38:43,270 --> 00:38:46,870
So that's a pretty weak
assumption on my prior.

788
00:38:46,870 --> 00:38:52,901
I'm just assuming that
theta is some integer.

789
00:38:52,901 --> 00:38:57,290
And now, let's build our
Bayesian confidence region.

790
00:38:57,290 --> 00:38:59,540
Well, if I want to build a
Bayesian confidence region,

791
00:38:59,540 --> 00:39:01,520
I need to understand what
my posterior is going to be.

792
00:39:01,520 --> 00:39:02,089
OK?

793
00:39:02,089 --> 00:39:04,630
And if I want to understand what
my posterior is going to be,

794
00:39:04,630 --> 00:39:11,530
I actually need to build
a likelihood, right?

795
00:39:11,530 --> 00:39:16,370
So we know that it's the
product of the likelihood

796
00:39:16,370 --> 00:39:20,740
and of the prior divided by--

797
00:39:20,740 --> 00:39:21,240
OK.

798
00:39:31,140 --> 00:39:32,850
So what is my likelihood?

799
00:39:32,850 --> 00:39:35,540
So my likelihood
is the probability

800
00:39:35,540 --> 00:39:40,580
of x1 x2, given theta.

801
00:39:40,580 --> 00:39:41,240
Right?

802
00:39:41,240 --> 00:39:45,010
That's what the
likelihood should be.

803
00:39:45,010 --> 00:39:49,840
And now let's say
that actually, just

804
00:39:49,840 --> 00:39:51,910
to make things a
little simpler, let

805
00:39:51,910 --> 00:40:07,230
us assume that x1 is
equal to, I don't know, 5,

806
00:40:07,230 --> 00:40:11,180
and x2 is equal to 7.

807
00:40:11,180 --> 00:40:12,540
OK?

808
00:40:12,540 --> 00:40:16,350
So I'm not going to take the
case where they're actually

809
00:40:16,350 --> 00:40:19,180
equal to each other, because
I know that, in this case,

810
00:40:19,180 --> 00:40:20,550
x1 and x2 are different.

811
00:40:20,550 --> 00:40:23,970
I know I'm going to actually
nail exactly what theta is,

812
00:40:23,970 --> 00:40:26,540
by looking at the average
of those guys, right?

813
00:40:26,540 --> 00:40:30,630
Here, it must be that
theta is equal to 6.

814
00:40:30,630 --> 00:40:34,491
So what I want is to compute
the likelihood at 5 and 7, OK?

815
00:40:38,419 --> 00:40:42,350
And what is this likelihood?

816
00:40:42,350 --> 00:40:53,950
Well, if theta is
equal to 6, that's

817
00:40:53,950 --> 00:41:00,010
just the probability that I
will observe 5 and 7, right?

818
00:41:00,010 --> 00:41:01,910
So what is the probability
I observe 5 and 7?

819
00:41:04,610 --> 00:41:05,510
Yeah?

820
00:41:05,510 --> 00:41:06,672
1?

821
00:41:06,672 --> 00:41:08,499
AUDIENCE: 1/4.

822
00:41:08,499 --> 00:41:10,040
PHILIPPE RIGOLLET:
That's 1/4, right?

823
00:41:10,040 --> 00:41:15,260
As the probability, I have
minus 1 for the first epsilon 1,

824
00:41:15,260 --> 00:41:15,760
right?

825
00:41:15,760 --> 00:41:17,260
So this is infinity 6.

826
00:41:17,260 --> 00:41:23,080
This is the probability that
epsilon 1 is equal to minus 1,

827
00:41:23,080 --> 00:41:28,790
and epsilon 2 is equal to
plus 1, which is equal to 1/4.

828
00:41:28,790 --> 00:41:31,520
So this probability is 1/4.

829
00:41:31,520 --> 00:41:35,560
If theta is different from
6, what is this probability?

830
00:41:35,560 --> 00:41:37,630
So if theta is different
from 6, since we

831
00:41:37,630 --> 00:41:41,210
know that we've only
loaded the integers--

832
00:41:41,210 --> 00:41:46,770
so if theta has to
be another integer,

833
00:41:46,770 --> 00:41:49,214
what is the probability
that I see 5 and 7?

834
00:41:49,214 --> 00:41:49,731
AUDIENCE: 0.

835
00:41:49,731 --> 00:41:50,606
PHILIPPE RIGOLLET: 0.

836
00:41:53,860 --> 00:41:55,190
So that's my likelihood.

837
00:41:55,190 --> 00:42:00,210
And if I want to know
what my posterior is,

838
00:42:00,210 --> 00:42:03,340
well, it's just
pi of theta times

839
00:42:03,340 --> 00:42:10,240
p of 5/6, given theta, divided
by the sum over all T's, say,

840
00:42:10,240 --> 00:42:11,890
in Z. Right?

841
00:42:11,890 --> 00:42:14,590
So now, I just need to
normalize this thing.

842
00:42:14,590 --> 00:42:21,950
So of pi of T, p of
4/6, given T. Agreed?

843
00:42:24,730 --> 00:42:27,350
That's just the definition
of the posterior.

844
00:42:27,350 --> 00:42:30,330
But when I sum
these guys, there's

845
00:42:30,330 --> 00:42:34,780
only one that counts,
because, for those things,

846
00:42:34,780 --> 00:42:38,140
we know that this is actually
equal to 0 for every T,

847
00:42:38,140 --> 00:42:41,470
except for when T is equal to 6.

848
00:42:41,470 --> 00:42:45,380
So this entire sum
here is actually

849
00:42:45,380 --> 00:42:54,310
equal to pi of 6
times p of 5/6--

850
00:42:54,310 --> 00:43:03,360
sorry, 5/7, of 5/7,
given that theta

851
00:43:03,360 --> 00:43:08,370
is equal to 6, which we
know is equal to 1/4.

852
00:43:08,370 --> 00:43:10,630
And I did not tell
you what pi of 6 was.

853
00:43:16,840 --> 00:43:18,070
But it's the same thing here.

854
00:43:18,070 --> 00:43:21,020
The posterior for any
theta that's not 6

855
00:43:21,020 --> 00:43:23,520
is actually going to be-- this
guy's going to be equal to 0.

856
00:43:23,520 --> 00:43:26,130
So I really don't
care what this guy is.

857
00:43:26,130 --> 00:43:29,270
So what it means is that
my posterior becomes what?

858
00:43:33,870 --> 00:43:40,290
It becomes the
posterior pi of theta,

859
00:43:40,290 --> 00:43:46,970
given 5 and 7 is equal to--
well, when theta is not

860
00:43:46,970 --> 00:43:49,090
equal to 6, this is actually 0.

861
00:43:49,090 --> 00:43:52,450
So regardless of what I do here,
I get something which is 0.

862
00:43:55,120 --> 00:43:58,000
And if theta is equal
to 6, what I get

863
00:43:58,000 --> 00:44:02,500
is pi of 6 times
p of 5/7, given 6,

864
00:44:02,500 --> 00:44:05,560
which I've just computed
here, which is 1/4 divided

865
00:44:05,560 --> 00:44:08,140
by pi of 6 times 1/4.

866
00:44:08,140 --> 00:44:10,640
So it's the ratio of two
things that are identical.

867
00:44:10,640 --> 00:44:13,360
So I get 1.

868
00:44:13,360 --> 00:44:16,570
So now, my posterior
tells me that, given

869
00:44:16,570 --> 00:44:22,440
that I observe 5
and 7, theta has

870
00:44:22,440 --> 00:44:27,690
to be 1 with probability-- has
to be 6 with probability 1.

871
00:44:27,690 --> 00:44:32,850
So now, I say that this
thing here-- so now, this

872
00:44:32,850 --> 00:44:34,590
is not something
that actually makes

873
00:44:34,590 --> 00:44:37,440
sense when I talk about
frequentist confidence

874
00:44:37,440 --> 00:44:38,310
intervals.

875
00:44:38,310 --> 00:44:40,560
They don't really make sense,
to talk about confidence

876
00:44:40,560 --> 00:44:42,330
intervals, given something.

877
00:44:42,330 --> 00:44:44,100
And so now, given that
I observe 5 and 7,

878
00:44:44,100 --> 00:44:46,224
I know that the probability
of theta is equal to 1.

879
00:44:46,224 --> 00:44:50,310
And in this sense, the
Bayesian confidence interval

880
00:44:50,310 --> 00:44:54,699
is actually more meaningful.

881
00:44:54,699 --> 00:44:56,990
So one thing I want to actually
say about this Bayesian

882
00:44:56,990 --> 00:44:58,466
confidence interval
is that it's--

883
00:45:01,100 --> 00:45:03,181
I mean, here, it's equal
to the value 1, right?

884
00:45:03,181 --> 00:45:05,180
So it really encompasses
the thing that we want.

885
00:45:05,180 --> 00:45:06,763
But the fact that
we actually computed

886
00:45:06,763 --> 00:45:09,140
it using the Bayesian
posterior and the Bayesian rule

887
00:45:09,140 --> 00:45:10,806
did not really matter
for this argument.

888
00:45:10,806 --> 00:45:12,980
All I just said was
that it had a prior.

889
00:45:12,980 --> 00:45:15,080
But just what I
want to illustrate

890
00:45:15,080 --> 00:45:17,930
is the fact that we can
actually give a meaning

891
00:45:17,930 --> 00:45:21,740
to the probability that
theta is equal to 6,

892
00:45:21,740 --> 00:45:23,390
given that I see 5 and 7.

893
00:45:23,390 --> 00:45:26,780
Whereas, we cannot really
in the other cases.

894
00:45:26,780 --> 00:45:28,490
And we don't have
to be particularly

895
00:45:28,490 --> 00:45:31,740
precise in the prior and theta
to be able to give theta this--

896
00:45:31,740 --> 00:45:32,930
to give this meaning.

897
00:45:32,930 --> 00:45:35,062
OK?

898
00:45:35,062 --> 00:45:36,038
All right.

899
00:45:38,966 --> 00:45:43,130
So now, as I said, I think
the main power of Bayesian

900
00:45:43,130 --> 00:45:45,980
inference is that it spits out
the posterior distribution,

901
00:45:45,980 --> 00:45:48,830
and not just the single
number, like frequentists

902
00:45:48,830 --> 00:45:50,030
would give you.

903
00:45:50,030 --> 00:45:55,070
Then we can say decorate, or
theta hat, or point estimate,

904
00:45:55,070 --> 00:45:56,570
with maybe some
confidence interval.

905
00:45:56,570 --> 00:45:58,400
Maybe we can do
a bunch of tests.

906
00:45:58,400 --> 00:46:01,070
But at the end of the
day, we just have,

907
00:46:01,070 --> 00:46:02,624
essentially, one number, right?

908
00:46:02,624 --> 00:46:04,040
Then maybe we can
understand where

909
00:46:04,040 --> 00:46:07,310
the fluctuations of this number
are in a frequentist setup.

910
00:46:07,310 --> 00:46:11,760
but the Bayesian
framework is essentially

911
00:46:11,760 --> 00:46:13,059
giving you a natural method.

912
00:46:13,059 --> 00:46:15,517
And you can interpret it in
terms of the probabilities that

913
00:46:15,517 --> 00:46:17,400
are associated to the prior.

914
00:46:17,400 --> 00:46:21,180
But you can actually
also try to make some--

915
00:46:21,180 --> 00:46:25,840
so a Bayesian, if you
give me any prior,

916
00:46:25,840 --> 00:46:29,040
you're going to actually build
an estimator from this prior,

917
00:46:29,040 --> 00:46:30,515
maybe from the posterior.

918
00:46:30,515 --> 00:46:32,890
And maybe it's going to have
some frequentist properties.

919
00:46:32,890 --> 00:46:35,181
And that's what's really nice
about [? Bayesians, ?] is

920
00:46:35,181 --> 00:46:36,700
that you can
actually try to give

921
00:46:36,700 --> 00:46:39,340
some frequentist properties
of Bayesian methods, that

922
00:46:39,340 --> 00:46:42,224
are built using
Bayesian methodology.

923
00:46:42,224 --> 00:46:44,140
But you cannot really
go the other way around.

924
00:46:44,140 --> 00:46:46,449
If I give you a
frequency methodology,

925
00:46:46,449 --> 00:46:48,490
how are you going to say
something about the fact

926
00:46:48,490 --> 00:46:51,620
that there's a prior
going on, et cetera?

927
00:46:51,620 --> 00:46:53,457
And so this is actually
one of the things

928
00:46:53,457 --> 00:46:55,790
there's actually some research
that's going on for this.

929
00:46:55,790 --> 00:46:58,147
They call it Bayesian
posterior concentration.

930
00:46:58,147 --> 00:46:59,980
And one of the things--
so there's something

931
00:46:59,980 --> 00:47:01,990
called the Bernstein-von
Mises theorem.

932
00:47:01,990 --> 00:47:03,910
And those are a
class of theorems,

933
00:47:03,910 --> 00:47:06,790
and those are essentially
methods that tell you, well,

934
00:47:06,790 --> 00:47:10,690
if I actually run
a Bayesian method,

935
00:47:10,690 --> 00:47:12,647
and I look at the
posterior that I get--

936
00:47:12,647 --> 00:47:14,230
it's going to be
something like this--

937
00:47:14,230 --> 00:47:16,540
but now, I try to study this
in a frequentist point of view,

938
00:47:16,540 --> 00:47:18,289
there's actually a
true parameter of theta

939
00:47:18,289 --> 00:47:20,390
somewhere, the true one.

940
00:47:20,390 --> 00:47:21,640
There's no prior for this guy.

941
00:47:21,640 --> 00:47:23,410
This is just one fixed number.

942
00:47:23,410 --> 00:47:25,120
Is it true that as
my sample size is

943
00:47:25,120 --> 00:47:27,610
going to go to infinity,
then this thing is going

944
00:47:27,610 --> 00:47:29,860
to concentrate around theta?

945
00:47:29,860 --> 00:47:31,990
And the rate of
concentration of this thing,

946
00:47:31,990 --> 00:47:35,440
the size of this width,
the standard deviation

947
00:47:35,440 --> 00:47:38,290
of this thing, is something
that should decay maybe

948
00:47:38,290 --> 00:47:40,850
like 1 over square root of
n, or something like this.

949
00:47:40,850 --> 00:47:43,349
And the rate of
posterior concentration,

950
00:47:43,349 --> 00:47:45,890
when you characterize it, it's
called the Bernstein-von Mises

951
00:47:45,890 --> 00:47:46,390
theorem.

952
00:47:46,390 --> 00:47:47,830
And so people are
looking at this

953
00:47:47,830 --> 00:47:49,566
in some non-parametric cases.

954
00:47:49,566 --> 00:47:51,190
You can do it in
pretty much everything

955
00:47:51,190 --> 00:47:52,190
we've been doing before.

956
00:47:52,190 --> 00:47:55,690
You can do it for non-parametric
regression estimation

957
00:47:55,690 --> 00:47:56,794
or density estimation.

958
00:47:56,794 --> 00:47:58,210
You can do it for,
of course-- you

959
00:47:58,210 --> 00:48:01,340
can do it for sparse
estimation, if you want.

960
00:48:01,340 --> 00:48:01,840
OK.

961
00:48:01,840 --> 00:48:04,967
So you can actually
compute the procedure and--

962
00:48:08,620 --> 00:48:09,290
yeah.

963
00:48:09,290 --> 00:48:12,660
And so you can think of it as
being just a method somehow.

964
00:48:12,660 --> 00:48:14,970
Now, the estimator
I'm talking about-- so

965
00:48:14,970 --> 00:48:18,210
that's just a general Bayesian
posterior concentration.

966
00:48:18,210 --> 00:48:20,430
But you can also
try to understand

967
00:48:20,430 --> 00:48:22,710
what is the property
of something that's

968
00:48:22,710 --> 00:48:24,210
extracted from this posterior.

969
00:48:24,210 --> 00:48:26,130
And one thing that
we actually describe

970
00:48:26,130 --> 00:48:28,310
was, for example,
well, given this guy,

971
00:48:28,310 --> 00:48:30,060
maybe it's a good idea
to think about what

972
00:48:30,060 --> 00:48:32,370
the mean of this
thing is, right?

973
00:48:32,370 --> 00:48:35,040
So there's going to
be some theta hat,

974
00:48:35,040 --> 00:48:41,460
which is just the integral of
theta pi theta, given x1 xn--

975
00:48:41,460 --> 00:48:43,860
so that's my posterior--

976
00:48:43,860 --> 00:48:44,380
d theta.

977
00:48:44,380 --> 00:48:44,880
Right?

978
00:48:44,880 --> 00:48:46,500
So that's the posterior mean.

979
00:48:46,500 --> 00:48:48,750
That's the expected
value with respect

980
00:48:48,750 --> 00:48:50,880
to the posterior distribution.

981
00:48:50,880 --> 00:48:53,640
And I want to know how
does this thing behave,

982
00:48:53,640 --> 00:48:56,670
how close it is to a
true theta if I actually

983
00:48:56,670 --> 00:48:58,370
am in a frequency setup.

984
00:48:58,370 --> 00:48:59,784
So that's the posterior mean.

985
00:49:04,260 --> 00:49:08,450
But this is not the only thing
I can actually spit out, right?

986
00:49:08,450 --> 00:49:09,980
This is definitely
uniquely defined.

987
00:49:09,980 --> 00:49:13,490
If you give me a
distribution, I can actually

988
00:49:13,490 --> 00:49:15,170
spit out its posterior mean.

989
00:49:15,170 --> 00:49:17,480
But I can also think of
the posterior median.

990
00:49:21,450 --> 00:49:23,237
But now, if this
is not continuous,

991
00:49:23,237 --> 00:49:24,570
you might have some uncertainty.

992
00:49:24,570 --> 00:49:26,570
Maybe the median is
not uniquely defined,

993
00:49:26,570 --> 00:49:29,180
and so maybe that's not
something you use as much.

994
00:49:29,180 --> 00:49:31,690
Maybe you can actually talk
about the posterior mode.

995
00:49:35,160 --> 00:49:38,040
All right, so for example, if
you're posterior density looks

996
00:49:38,040 --> 00:49:40,020
like this, then
maybe you just want

997
00:49:40,020 --> 00:49:43,600
to summarize your
posterior with this number.

998
00:49:43,600 --> 00:49:46,080
So clearly, in this case,
it's not such a good idea,

999
00:49:46,080 --> 00:49:48,270
because you completely
forget about this mode.

1000
00:49:48,270 --> 00:49:49,811
But maybe that's
what you want to do.

1001
00:49:49,811 --> 00:49:53,400
Maybe you want to focus
on the most peak mode.

1002
00:49:53,400 --> 00:49:58,524
And this is actually called
maximum a posteriori.

1003
00:49:58,524 --> 00:49:59,940
As I said, maybe
you want a sample

1004
00:49:59,940 --> 00:50:03,240
from this posterior
distribution.

1005
00:50:03,240 --> 00:50:06,420
OK, and so in all these cases,
these Bayesian estimators

1006
00:50:06,420 --> 00:50:09,000
will depend on the
prior distribution.

1007
00:50:09,000 --> 00:50:11,610
And the hope is that, as
the sample size grows,

1008
00:50:11,610 --> 00:50:14,130
you won't see that again.

1009
00:50:14,130 --> 00:50:14,630
OK.

1010
00:50:14,630 --> 00:50:20,840
So to conclude, let's just
do a couple of experiments.

1011
00:50:20,840 --> 00:50:22,340
So if I look at--

1012
00:50:25,200 --> 00:50:26,011
did we do this?

1013
00:50:26,011 --> 00:50:26,510
Yes.

1014
00:50:26,510 --> 00:50:30,398
So for example, so let's
focus on the posterior mean.

1015
00:50:34,366 --> 00:50:45,394
And we know-- so remember
in experiment one--

1016
00:50:45,394 --> 00:50:48,100
[INAUDIBLE] example
one, what we had

1017
00:50:48,100 --> 00:50:56,000
was x1 xn that were
[? iid, ?] Bernoulli p,

1018
00:50:56,000 --> 00:51:06,410
and the prior I put on p was
a beta with parameter aa.

1019
00:51:06,410 --> 00:51:07,160
OK?

1020
00:51:07,160 --> 00:51:09,830
And if I go back to
what we computed,

1021
00:51:09,830 --> 00:51:12,740
you can actually compute
the posterior of this thing.

1022
00:51:12,740 --> 00:51:15,000
And we know that it's
actually going to be--

1023
00:51:15,000 --> 00:51:17,390
sorry, that was uniform?

1024
00:51:17,390 --> 00:51:18,620
Where is-- yeah.

1025
00:51:18,620 --> 00:51:31,170
So what we get is that
the posterior, this thing

1026
00:51:31,170 --> 00:51:36,630
is actually going to be
a beta with parameter

1027
00:51:36,630 --> 00:51:42,640
a plus the sum, so a
plus the number of 1s

1028
00:51:42,640 --> 00:51:44,770
and a plus the number of 0s.

1029
00:51:48,590 --> 00:51:49,870
OK?

1030
00:51:49,870 --> 00:51:53,840
And the beta was just
something that looked like--

1031
00:51:56,480 --> 00:52:00,500
the density was p to the
a minus 1, 1 minus p.

1032
00:52:05,440 --> 00:52:05,940
OK?

1033
00:52:05,940 --> 00:52:11,130
So if I want to understand
the posterior mean,

1034
00:52:11,130 --> 00:52:13,950
I need to be able to compute
the expectation of a beta,

1035
00:52:13,950 --> 00:52:16,620
and then maybe plug
in a for a plus

1036
00:52:16,620 --> 00:52:17,980
this guy and minus this guy.

1037
00:52:17,980 --> 00:52:18,480
OK.

1038
00:52:18,480 --> 00:52:21,770
So actually, let me do this.

1039
00:52:21,770 --> 00:52:22,270
OK.

1040
00:52:22,270 --> 00:52:23,930
So what is the expectation?

1041
00:52:26,337 --> 00:52:27,920
So what I want is
something that looks

1042
00:52:27,920 --> 00:52:34,820
like the integral between 0
and 1 of p times a minus 1--

1043
00:52:34,820 --> 00:52:42,320
sorry, p times p a minus
1, 1 minus p, b minus 1.

1044
00:52:42,320 --> 00:52:43,590
Do we agree that this--

1045
00:52:43,590 --> 00:52:46,290
and then there's a
normalizing constant.

1046
00:52:46,290 --> 00:52:49,270
Let's call it c.

1047
00:52:49,270 --> 00:52:49,770
OK?

1048
00:52:53,200 --> 00:52:56,330
So this is what I
need to compute.

1049
00:52:56,330 --> 00:52:57,640
So that's c of a and b.

1050
00:53:00,257 --> 00:53:01,840
Do we agree that
this is the posterior

1051
00:53:01,840 --> 00:53:08,651
mean with respect to a beta
with parameters a and b?

1052
00:53:08,651 --> 00:53:09,150
Right?

1053
00:53:09,150 --> 00:53:13,334
I just integrate p
against the density.

1054
00:53:13,334 --> 00:53:14,750
So what does this
thing look like?

1055
00:53:14,750 --> 00:53:18,550
Well, I can actually
move this guy in here.

1056
00:53:18,550 --> 00:53:23,402
And here, I'm going to
have a plus 1 minus 1.

1057
00:53:23,402 --> 00:53:26,366
OK?

1058
00:53:26,366 --> 00:53:29,360
So the problem is that
this thing is actually--

1059
00:53:29,360 --> 00:53:31,360
the constant is going to
play a big role, right?

1060
00:53:31,360 --> 00:53:33,100
Because this is
essentially equal

1061
00:53:33,100 --> 00:53:40,270
to c a plus 1b
divided by c ab, where

1062
00:53:40,270 --> 00:53:42,220
ca plus 1b is just
the normalizing

1063
00:53:42,220 --> 00:53:46,340
constant of a beta a plus 1 b.

1064
00:53:46,340 --> 00:53:48,729
So I need to know the ratio
of those two constants.

1065
00:53:58,320 --> 00:53:59,660
And this is not something--

1066
00:53:59,660 --> 00:54:01,680
I mean, this is just
a calculus exercise.

1067
00:54:01,680 --> 00:54:06,820
So in this case,
what you get is--

1068
00:54:06,820 --> 00:54:08,640
sorry.

1069
00:54:08,640 --> 00:54:09,750
In this case, you get--

1070
00:54:12,560 --> 00:54:34,940
well, OK, so we get
essentially a divided by,

1071
00:54:34,940 --> 00:54:37,990
I think, it's a plus b.

1072
00:54:37,990 --> 00:54:38,940
Yeah, it's a plus b.

1073
00:54:41,856 --> 00:54:43,314
So that's this quantity.

1074
00:54:47,188 --> 00:54:47,688
OK?

1075
00:54:51,100 --> 00:54:56,520
And when I plug in a to be this
guy and b to be this guy, what

1076
00:54:56,520 --> 00:55:02,520
I get is a plus sum of the xi.

1077
00:55:02,520 --> 00:55:06,240
And then I get a plus this
guy, a plus n minus this guy.

1078
00:55:06,240 --> 00:55:07,720
So those two guys
go away, and I'm

1079
00:55:07,720 --> 00:55:14,050
left with 2a plus n,
which does not work.

1080
00:55:14,050 --> 00:55:15,240
No, that actually works.

1081
00:55:15,240 --> 00:55:18,520
And so now what I do, I
can actually divide and get

1082
00:55:18,520 --> 00:55:19,850
this thing, over there.

1083
00:55:19,850 --> 00:55:20,350
OK.

1084
00:55:20,350 --> 00:55:23,380
So what you can see, the reason
why this thing has been divided

1085
00:55:23,380 --> 00:55:27,730
is that you can really see
that, as n goes to infinity,

1086
00:55:27,730 --> 00:55:30,120
then this thing behaves
like xn bar, which

1087
00:55:30,120 --> 00:55:31,650
is our frequentist estimator.

1088
00:55:31,650 --> 00:55:34,200
The effect of a is
actually going away.

1089
00:55:34,200 --> 00:55:37,530
The effect of the prior, which
is completely captured by a,

1090
00:55:37,530 --> 00:55:40,440
is going away as n
goes to infinity.

1091
00:55:40,440 --> 00:55:42,440
Is there any question?

1092
00:55:47,440 --> 00:55:48,850
You guys have a question.

1093
00:55:48,850 --> 00:55:50,202
What is it?

1094
00:55:50,202 --> 00:55:51,551
Do you have a question?

1095
00:55:51,551 --> 00:55:53,426
AUDIENCE: Yeah, on the
board, is that divided

1096
00:55:53,426 --> 00:55:56,259
by some [INAUDIBLE] stuff?

1097
00:55:56,259 --> 00:55:58,050
PHILIPPE RIGOLLET: Is
that divided by what?

1098
00:55:58,050 --> 00:56:00,555
AUDIENCE: That a over a plus
b, and then you just expanded--

1099
00:56:00,555 --> 00:56:01,930
PHILIPPE RIGOLLET:
Oh yeah, yeah,

1100
00:56:01,930 --> 00:56:05,220
then I said that this
is equal to this, right.

1101
00:56:05,220 --> 00:56:15,690
So that's for a becomes a plus
sum of the xi's, and b becomes

1102
00:56:15,690 --> 00:56:20,391
a plus n minus sum of the xi's.

1103
00:56:20,391 --> 00:56:20,890
OK.

1104
00:56:20,890 --> 00:56:22,508
So that's just for
the posterior one.

1105
00:56:22,508 --> 00:56:26,264
AUDIENCE: What's [INAUDIBLE]

1106
00:56:26,264 --> 00:56:27,430
PHILIPPE RIGOLLET: This guy?

1107
00:56:27,430 --> 00:56:28,070
AUDIENCE: Yeah.

1108
00:56:28,070 --> 00:56:28,740
PHILIPPE RIGOLLET: 2a.

1109
00:56:28,740 --> 00:56:29,281
AUDIENCE: 2a.

1110
00:56:29,281 --> 00:56:30,150
Oh, OK.

1111
00:56:30,150 --> 00:56:31,191
PHILIPPE RIGOLLET: Right.

1112
00:56:31,191 --> 00:56:34,885
So I get a plus a plus n.

1113
00:56:34,885 --> 00:56:37,960
And then those two guys cancel.

1114
00:56:37,960 --> 00:56:38,460
OK?

1115
00:56:38,460 --> 00:56:41,380
And that's what you have here.

1116
00:56:41,380 --> 00:56:44,920
So for a is equal to 1/2--

1117
00:56:44,920 --> 00:56:47,020
and I claim that this
is Jeffreys prior.

1118
00:56:47,020 --> 00:56:53,950
Because remember, Jeffreys was
[INAUDIBLE] was square root

1119
00:56:53,950 --> 00:56:56,100
and was proportional to
the square root of p1 minus

1120
00:56:56,100 --> 00:57:01,050
p, which I can write as p to
the 1/2, 1 minus p to the 1/2.

1121
00:57:01,050 --> 00:57:03,501
So it's just the case
a is equal to 1/2.

1122
00:57:03,501 --> 00:57:04,000
OK.

1123
00:57:04,000 --> 00:57:07,660
So if I use Jeffreys prior, I
just plug in a equals to 1/2,

1124
00:57:07,660 --> 00:57:10,530
and this is what I get.

1125
00:57:10,530 --> 00:57:12,630
OK?

1126
00:57:12,630 --> 00:57:14,880
So those things are going
to have an impact again when

1127
00:57:14,880 --> 00:57:16,150
n is moderately large.

1128
00:57:16,150 --> 00:57:19,090
For large n, those things,
whether you take Jeffreys prior

1129
00:57:19,090 --> 00:57:20,710
or you take whatever
a you prefer,

1130
00:57:20,710 --> 00:57:23,130
it's going to have
no impact whatsoever.

1131
00:57:23,130 --> 00:57:26,894
But n is of the
order of 10 maybe,

1132
00:57:26,894 --> 00:57:28,810
then you're going to
start to see some impact,

1133
00:57:28,810 --> 00:57:30,351
depending on what
a you want to pick.

1134
00:57:33,540 --> 00:57:34,040
OK.

1135
00:57:34,040 --> 00:57:38,390
And then in the second
example, well, here we actually

1136
00:57:38,390 --> 00:57:42,560
computed the posterior
to be this guy.

1137
00:57:42,560 --> 00:57:45,544
Well, here, I can just read off
what the expectation is, right?

1138
00:57:45,544 --> 00:57:47,210
I mean, I don't have
to actually compute

1139
00:57:47,210 --> 00:57:48,970
the expectation of a Gaussian.

1140
00:57:48,970 --> 00:57:50,650
It's just that xn bar.

1141
00:57:50,650 --> 00:57:52,660
And so in this case,
there's actually no--

1142
00:57:52,660 --> 00:57:57,190
I mean, when I have a
non-informative prior

1143
00:57:57,190 --> 00:58:01,750
for a Gaussian, then I
have basically xn in bar.

1144
00:58:01,750 --> 00:58:04,390
As you can see, actually, this
is an interesting example.

1145
00:58:04,390 --> 00:58:06,490
When I actually look
at the posterior,

1146
00:58:06,490 --> 00:58:09,190
it's not something that cost
me a lot to communicate to you,

1147
00:58:09,190 --> 00:58:10,037
right?

1148
00:58:10,037 --> 00:58:12,370
There's one symbol here, one
symbol here, and one symbol

1149
00:58:12,370 --> 00:58:13,330
here.

1150
00:58:13,330 --> 00:58:17,950
I tell you the posterior is
a Gaussian with mean xn bar

1151
00:58:17,950 --> 00:58:19,660
and variance 1/n.

1152
00:58:19,660 --> 00:58:23,530
When I actually turn
that into a poster mean,

1153
00:58:23,530 --> 00:58:26,264
I'm dropping all
this information.

1154
00:58:26,264 --> 00:58:27,930
I'm just giving you
the first parameter.

1155
00:58:27,930 --> 00:58:30,150
So you can see there's
actually much more information

1156
00:58:30,150 --> 00:58:35,100
in the posterior than there
is in the posterior mean.

1157
00:58:35,100 --> 00:58:37,210
The posterior mean
is just a point.

1158
00:58:37,210 --> 00:58:39,930
It's not telling me how
confident I am in this point.

1159
00:58:39,930 --> 00:58:41,950
And this thing is
actually very interesting.

1160
00:58:41,950 --> 00:58:42,450
OK.

1161
00:58:42,450 --> 00:58:44,283
So you can talk about
the posterior variance

1162
00:58:44,283 --> 00:58:45,880
that's associated to it, right?

1163
00:58:45,880 --> 00:58:47,516
You can talk about,
as an output,

1164
00:58:47,516 --> 00:58:49,890
you could give the posterior
mean and posterior variance.

1165
00:58:49,890 --> 00:58:53,311
And those things are
actually interesting.

1166
00:58:53,311 --> 00:58:53,810
All right.

1167
00:58:53,810 --> 00:58:56,370
So I think this is it.

1168
00:58:56,370 --> 00:59:05,360
So as I said, in general,
just like in this case,

1169
00:59:05,360 --> 00:59:07,980
the impact of the prior
is being washed away

1170
00:59:07,980 --> 00:59:10,310
as the sample size
goes to infinity.

1171
00:59:10,310 --> 00:59:12,860
Just well, like here, there's
no impact of the prior.

1172
00:59:12,860 --> 00:59:14,500
It was an noninvasive one.

1173
00:59:14,500 --> 00:59:17,780
But if you actually had an
informative one, [? CF ?]

1174
00:59:17,780 --> 00:59:18,683
homework-- yeah?

1175
00:59:18,683 --> 00:59:19,650
AUDIENCE: [INAUDIBLE]

1176
00:59:19,650 --> 00:59:21,150
PHILIPPE RIGOLLET: Yeah,
so [? CF ?] homework,

1177
00:59:21,150 --> 00:59:23,358
you would actually see an
impact of the prior, which,

1178
00:59:23,358 --> 00:59:25,890
again, would be washed away
as your sample size increases.

1179
00:59:25,890 --> 00:59:26,820
Here, it goes away.

1180
00:59:26,820 --> 00:59:29,610
You just get xn bar over 1.

1181
00:59:29,610 --> 00:59:31,830
And actually, in
these cases, you

1182
00:59:31,830 --> 00:59:35,580
see that the posterior
distribution converges

1183
00:59:35,580 --> 00:59:37,560
to-- sorry, the
Bayesian estimator

1184
00:59:37,560 --> 00:59:39,510
is asymptotically normal.

1185
00:59:39,510 --> 00:59:43,471
This is different from the
distribution of the posterior,

1186
00:59:43,471 --> 00:59:43,970
right?

1187
00:59:43,970 --> 00:59:45,886
This is just the posterior
mean, which happens

1188
00:59:45,886 --> 00:59:47,480
to be asymptotically normal.

1189
00:59:47,480 --> 00:59:49,595
But the posterior
may not have a--

1190
00:59:49,595 --> 00:59:53,000
I mean, here, the
posterior is a beta, right?

1191
00:59:53,000 --> 00:59:55,020
I mean, it's not normal.

1192
00:59:55,020 --> 00:59:57,210
OK, so there's
different-- those things

1193
00:59:57,210 --> 00:59:59,556
are two different things.

1194
00:59:59,556 --> 01:00:01,548
Your question?

1195
01:00:01,548 --> 01:00:04,487
AUDIENCE: What was
the prior [INAUDIBLE]

1196
01:00:04,487 --> 01:00:05,820
PHILIPPE RIGOLLET: All 1, right?

1197
01:00:05,820 --> 01:00:06,986
That was the improper prior.

1198
01:00:06,986 --> 01:00:08,896
AUDIENCE: OK.

1199
01:00:08,896 --> 01:00:12,563
And so that would give you the
same thing as [INAUDIBLE],, not

1200
01:00:12,563 --> 01:00:13,790
just the proportion.

1201
01:00:13,790 --> 01:00:15,373
PHILIPPE RIGOLLET:
Well, I mean, yeah.

1202
01:00:15,373 --> 01:00:17,600
So it's essentially
telling you that--

1203
01:00:17,600 --> 01:00:23,390
so we said that, when you
have a non-informative prior,

1204
01:00:23,390 --> 01:00:25,760
essentially, the maximum
likelihood is the maximum

1205
01:00:25,760 --> 01:00:26,879
a posteriori, right?

1206
01:00:26,879 --> 01:00:28,670
But in this case,
there's so much symmetry,

1207
01:00:28,670 --> 01:00:30,560
that it just so happens that
the maximum in this thing

1208
01:00:30,560 --> 01:00:32,370
is completely symmetric
around its maximum.

1209
01:00:32,370 --> 01:00:34,809
So it means that the expectation
is equal to the maximum,

1210
01:00:34,809 --> 01:00:35,600
to [INAUDIBLE] max.

1211
01:00:40,957 --> 01:00:41,931
Yeah?

1212
01:00:41,931 --> 01:00:43,392
AUDIENCE: I read
somewhere that one

1213
01:00:43,392 --> 01:00:45,340
of the issues with
Bayesian methods

1214
01:00:45,340 --> 01:00:46,801
is that we choose
the wrong prior,

1215
01:00:46,801 --> 01:00:49,723
and it could mess
up your results.

1216
01:00:49,723 --> 01:00:51,370
PHILIPPE RIGOLLET:
Yeah, but hence,

1217
01:00:51,370 --> 01:00:53,980
do not pick the wrong prior.

1218
01:00:53,980 --> 01:00:55,244
I mean, of course, it would.

1219
01:00:55,244 --> 01:00:57,160
I mean, it would mess
up your res-- of course.

1220
01:00:57,160 --> 01:00:58,810
I mean, you're putting
extra information.

1221
01:00:58,810 --> 01:01:00,601
But you could say the
same thing by saying,

1222
01:01:00,601 --> 01:01:03,670
well, the issue with
frequentist method

1223
01:01:03,670 --> 01:01:06,730
is that, if you mess up the
choice of your likelihood,

1224
01:01:06,730 --> 01:01:09,424
then it's going to
mess up your output.

1225
01:01:09,424 --> 01:01:11,590
So here, you just have two
chances of messing it up,

1226
01:01:11,590 --> 01:01:12,250
right?

1227
01:01:12,250 --> 01:01:14,440
You have the-- well, it's gone.

1228
01:01:14,440 --> 01:01:17,920
So you have the product of
the likelihood and the prior,

1229
01:01:17,920 --> 01:01:20,350
and you have one
more chance to--

1230
01:01:20,350 --> 01:01:22,420
but it's true, if you
assume that the model is

1231
01:01:22,420 --> 01:01:25,960
right, then, of course,
finding the wrong prior could

1232
01:01:25,960 --> 01:01:28,520
completely mess up things
if your prior, for example,

1233
01:01:28,520 --> 01:01:30,780
has no support on
the true parameter.

1234
01:01:30,780 --> 01:01:34,715
But if your prior has a positive
weight on the true parameter

1235
01:01:34,715 --> 01:01:38,140
as n goes to infinity--

1236
01:01:38,140 --> 01:01:40,640
I mean, OK, I cannot speak
for all counterexamples

1237
01:01:40,640 --> 01:01:41,480
in the world.

1238
01:01:41,480 --> 01:01:44,450
But I'm sure, under minor
technical conditions,

1239
01:01:44,450 --> 01:01:46,550
you can guarantee
that your posterior

1240
01:01:46,550 --> 01:01:48,530
mean is going to
converge to what

1241
01:01:48,530 --> 01:01:49,742
you need it to converge to.

1242
01:01:53,678 --> 01:01:54,662
Any other question?

1243
01:01:57,881 --> 01:01:58,380
All right.

1244
01:01:58,380 --> 01:02:07,650
So I think this closes the more
traditional mathematical-- not

1245
01:02:07,650 --> 01:02:11,490
mathematical, but traditional
statistics part of this class.

1246
01:02:11,490 --> 01:02:14,310
And from here on, we'll
talk about more multivariate

1247
01:02:14,310 --> 01:02:17,740
statistics, starting with
principal component analysis.

1248
01:02:17,740 --> 01:02:19,800
So that's more like when
you have multiple data.

1249
01:02:19,800 --> 01:02:22,650
We started, in a way, to talk
about multivariate statistics

1250
01:02:22,650 --> 01:02:25,320
when we talked about
multivariate regression.

1251
01:02:25,320 --> 01:02:28,180
But we'll move on to
principal component analysis.

1252
01:02:28,180 --> 01:02:30,690
I'll talk a bit about
multiple testing.

1253
01:02:30,690 --> 01:02:32,400
I haven't made my
mind yet about what

1254
01:02:32,400 --> 01:02:34,350
we'll talk really in December.

1255
01:02:34,350 --> 01:02:36,480
But I want to make
sure that you have

1256
01:02:36,480 --> 01:02:41,310
a taste and a flavor of what is
being interesting in statistics

1257
01:02:41,310 --> 01:02:44,341
these days, especially as you
go towards more [INAUDIBLE]

1258
01:02:44,341 --> 01:02:46,590
learning type of questions,
where really, the focus is

1259
01:02:46,590 --> 01:02:48,619
on prediction rather
than the modeling itself.

1260
01:02:48,619 --> 01:02:50,160
We'll talk about
logistic regression,

1261
01:02:50,160 --> 01:02:52,800
as well, for example,
which is generalized

1262
01:02:52,800 --> 01:02:55,470
linear models, which is just
the generalization in the case

1263
01:02:55,470 --> 01:03:00,480
that y does not take value in
the whole real line, maybe 0,1,

1264
01:03:00,480 --> 01:03:03,360
for example, for regression.

1265
01:03:03,360 --> 01:03:03,960
All right.

1266
01:03:03,960 --> 01:03:05,510
Thanks.