1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative
2
00:00:02,460 --> 00:00:03,880
Commons license.
3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare
4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.
5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials
6
00:00:12,720 --> 00:00:16,650
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:16,650 --> 00:00:17,880
at ocw.mit.edu.
8
00:00:20,524 --> 00:00:21,940
PHILIPPE RIGOLLET:
So today, we're
9
00:00:21,940 --> 00:00:24,820
going to close this
chapter, this short chapter,
10
00:00:24,820 --> 00:00:26,200
on Bayesian inference.
11
00:00:26,200 --> 00:00:28,990
Again, this was just
an overview of what you
12
00:00:28,990 --> 00:00:32,259
can do in Bayesian inference.
13
00:00:32,259 --> 00:00:34,630
And last time, we
started defining
14
00:00:34,630 --> 00:00:36,260
what's called Jeffreys priors.
15
00:00:36,260 --> 00:00:36,760
Right?
16
00:00:36,760 --> 00:00:38,560
So when you do
Bayesian inference,
17
00:00:38,560 --> 00:00:41,620
you have to introduce a
prior on your parameter.
18
00:00:41,620 --> 00:00:43,660
And we said that
usually, it's something
19
00:00:43,660 --> 00:00:45,820
that encodes your domain
knowledge about where
20
00:00:45,820 --> 00:00:47,130
the parameter could be.
21
00:00:47,130 --> 00:00:49,030
But there's also some
principle way to do it,
22
00:00:49,030 --> 00:00:51,155
if you want to do Bayesian
inference without really
23
00:00:51,155 --> 00:00:53,420
having to think about it.
24
00:00:53,420 --> 00:00:56,260
And for example, one
of the natural priors
25
00:00:56,260 --> 00:00:58,080
were those non-informative
priors, right?
26
00:00:58,080 --> 00:00:59,740
If you were on a
compact set, it's
27
00:00:59,740 --> 00:01:01,570
a uniform prior of this set.
28
00:01:01,570 --> 00:01:04,239
If you're on an infinite set,
you can still think of taking
29
00:01:04,239 --> 00:01:06,520
the [? 01s ?] prior.
30
00:01:06,520 --> 00:01:09,280
And that's called an [INAUDIBLE]
That's always equal to 1.
31
00:01:09,280 --> 00:01:13,300
And that's an improper prior
if you are an infinite set
32
00:01:13,300 --> 00:01:14,830
or proportional to one.
33
00:01:14,830 --> 00:01:17,860
And so another prior
that you can think of,
34
00:01:17,860 --> 00:01:20,230
in the case where you have
a Fisher information, which
35
00:01:20,230 --> 00:01:23,200
is well-defined, is something
called Jefferys prior.
36
00:01:23,200 --> 00:01:25,600
And this prior is
a prior which is
37
00:01:25,600 --> 00:01:28,150
proportional to square root of
the determinant of the Fisher
38
00:01:28,150 --> 00:01:29,780
information matrix.
39
00:01:29,780 --> 00:01:31,750
And if you're in
one dimension, it's
40
00:01:31,750 --> 00:01:37,750
basically proportional to
a square root of the Fisher
41
00:01:37,750 --> 00:01:40,750
information coefficient,
which we know, for example,
42
00:01:40,750 --> 00:01:44,170
is the asymptotic variance
of the maximum likelihood
43
00:01:44,170 --> 00:01:45,370
estimator.
44
00:01:45,370 --> 00:01:48,010
And it turns out
that it's basically.
45
00:01:48,010 --> 00:01:50,330
So square root of this
thing is basically
46
00:01:50,330 --> 00:01:54,160
one over the standard deviation
of the maximum likelihood
47
00:01:54,160 --> 00:01:55,150
estimator.
48
00:01:55,150 --> 00:01:56,690
And so you can
compute this, right?
49
00:01:56,690 --> 00:01:59,944
So you can compute for the
maximum likelihood estimator.
50
00:01:59,944 --> 00:02:01,360
We know that the
variance is going
51
00:02:01,360 --> 00:02:09,910
to be p1 minus p
in the Bernoulli
52
00:02:09,910 --> 00:02:11,200
statistical experiment.
53
00:02:11,200 --> 00:02:13,510
So you get this one over the
square root of this thing.
54
00:02:13,510 --> 00:02:16,720
And for example, in
the Gaussian setting,
55
00:02:16,720 --> 00:02:19,880
you actually have the
Fisher information,
56
00:02:19,880 --> 00:02:22,000
even in the multi-variate
one, is actually
57
00:02:22,000 --> 00:02:24,752
going to be something
like the identity matrix.
58
00:02:24,752 --> 00:02:25,960
So this is proportional to 1.
59
00:02:25,960 --> 00:02:29,530
It's the improper prior that
you get, in this case, OK?
60
00:02:29,530 --> 00:02:31,690
Meaning that, for
the Gaussian setting,
61
00:02:31,690 --> 00:02:33,880
no place where you
center your Gaussian
62
00:02:33,880 --> 00:02:36,020
is actually better
than any other.
63
00:02:36,020 --> 00:02:36,520
All right.
64
00:02:36,520 --> 00:02:40,130
So we basically
left on this slide,
65
00:02:40,130 --> 00:02:43,570
where we saw that
Jeffreys prior satisfy
66
00:02:43,570 --> 00:02:46,170
a reparametrization
[INAUDIBLE] invariant
67
00:02:46,170 --> 00:02:49,180
by transformation of
your parameter, which
68
00:02:49,180 --> 00:02:51,920
is a desirable property.
69
00:02:51,920 --> 00:02:57,217
And the way, it says that, well,
if I have my prior on theta,
70
00:02:57,217 --> 00:02:59,050
and then I suddenly
decide that theta is not
71
00:02:59,050 --> 00:03:01,720
the parameter I want to use
to parameterize my problem,
72
00:03:01,720 --> 00:03:04,640
actually what I want
is phi of theta.
73
00:03:04,640 --> 00:03:07,840
So think, for example, as theta
being the mean of a Gaussian,
74
00:03:07,840 --> 00:03:11,140
and phi of theta as
being mean to the cube.
75
00:03:11,140 --> 00:03:11,920
OK?
76
00:03:11,920 --> 00:03:15,520
This is a one-to-one
map phi, right?
77
00:03:15,520 --> 00:03:20,185
So for example, if I want to
go from theta to theta cubed,
78
00:03:20,185 --> 00:03:22,840
and now I decide that this is
the actual parameter that I
79
00:03:22,840 --> 00:03:26,200
want, well, then it means
that, on this parameter,
80
00:03:26,200 --> 00:03:29,110
my original prior is going
to induce another prior.
81
00:03:29,110 --> 00:03:30,970
And here, it says,
well, this prior
82
00:03:30,970 --> 00:03:33,200
is actually also Jeffreys prior.
83
00:03:33,200 --> 00:03:33,700
OK?
84
00:03:33,700 --> 00:03:35,450
So it's essentially
telling you that,
85
00:03:35,450 --> 00:03:38,410
for this new parametrization,
if you take Jeffreys prior, then
86
00:03:38,410 --> 00:03:41,201
you actually go back to having
exactly something that's
87
00:03:41,201 --> 00:03:43,450
of the form's [INAUDIBLE]
of determinant of the Fisher
88
00:03:43,450 --> 00:03:45,116
information, but this
thing with respect
89
00:03:45,116 --> 00:03:47,810
to your new
parametrization All right.
90
00:03:47,810 --> 00:03:50,360
And so why is this true?
91
00:03:50,360 --> 00:03:53,440
Well, it's just this
change of variable theorem.
92
00:03:53,440 --> 00:03:58,330
So it's essentially telling
you that, if you call--
93
00:03:58,330 --> 00:04:08,850
let's call p-- well, let's go
pi tilde of eta prior over eta.
94
00:04:08,850 --> 00:04:11,130
And you have pi of
theta as the prior
95
00:04:11,130 --> 00:04:18,040
over theta, than since eta
is of the form phi of theta,
96
00:04:18,040 --> 00:04:26,620
just by change of variable,
so that's essentially
97
00:04:26,620 --> 00:04:33,070
a probability result. It
says that pi tilde of eta
98
00:04:33,070 --> 00:04:42,790
is equal to pi of eta
times d pi of theta times d
99
00:04:42,790 --> 00:04:48,860
theta over d eta and--
100
00:04:55,706 --> 00:04:57,189
sorry, is that the one?
101
00:04:57,189 --> 00:04:58,730
Sorry, I'm going to
have to write it,
102
00:04:58,730 --> 00:04:59,938
because I always forget this.
103
00:05:05,209 --> 00:05:07,380
So if I take a function--
104
00:05:14,380 --> 00:05:14,960
OK.
105
00:05:14,960 --> 00:05:16,400
So what I want is to check.
106
00:05:38,340 --> 00:05:41,870
OK, so I want the function
of eta that I can here.
107
00:05:41,870 --> 00:05:48,480
And what I know is that
this is h of phi of theta.
108
00:05:48,480 --> 00:05:48,980
All right?
109
00:05:48,980 --> 00:05:51,810
So sorry, eta is
phi of theta, right?
110
00:05:51,810 --> 00:05:53,471
Yeah.
111
00:05:53,471 --> 00:05:54,970
So what I'm going
to do is I'm going
112
00:05:54,970 --> 00:06:09,130
to do the change of variable,
theta is phi inverse of eta.
113
00:06:09,130 --> 00:06:14,120
So eta is phi of
theta, which means
114
00:06:14,120 --> 00:06:20,540
that d eta is equal to d--
115
00:06:20,540 --> 00:06:26,020
well, to phi prime
of theta d theta.
116
00:06:26,020 --> 00:06:31,464
So when I'm going to write this,
I'm going to get integral of h.
117
00:06:31,464 --> 00:06:33,470
Actually, let me
write this, as I
118
00:06:33,470 --> 00:06:36,980
am more comfortable
writing this as e
119
00:06:36,980 --> 00:06:40,031
with respect to eta of h of eta.
120
00:06:40,031 --> 00:06:40,530
OK?
121
00:06:40,530 --> 00:06:44,580
So that's just eta according
to being drawn from the prior.
122
00:06:44,580 --> 00:06:47,670
And I want to write this as
the integral of he of eta times
123
00:06:47,670 --> 00:06:49,080
some function, right?
124
00:06:49,080 --> 00:06:58,580
So this is the
integral of h of phi
125
00:06:58,580 --> 00:07:03,556
of theta pi of theta d theta.
126
00:07:03,556 --> 00:07:06,150
Now, I'm going to do
my change of variable.
127
00:07:06,150 --> 00:07:09,290
So this is going to be
the integral of h of eta.
128
00:07:09,290 --> 00:07:16,420
And then pi of phi of--
129
00:07:16,420 --> 00:07:20,290
so theta is phi inverse of eta.
130
00:07:20,290 --> 00:07:27,390
And then d theta is phi
prime of theta d theta, OK?
131
00:07:27,390 --> 00:07:30,210
And so what is pi of phi theta?
132
00:07:30,210 --> 00:07:32,120
So this thing is proportional.
133
00:07:32,120 --> 00:07:33,750
So we're in, say,
dimension 1, so it's
134
00:07:33,750 --> 00:07:38,420
proportional of square root
of the Fisher information.
135
00:07:38,420 --> 00:07:39,920
And the Fisher
information, we know,
136
00:07:39,920 --> 00:07:44,630
is the expectation of the square
of the derivative of the log
137
00:07:44,630 --> 00:07:45,770
likelihood, right?
138
00:07:45,770 --> 00:07:48,740
So this is square root
of the expectation
139
00:07:48,740 --> 00:08:03,650
of d over d theta of log of--
140
00:08:03,650 --> 00:08:06,010
well, now, I need the density.
141
00:08:06,010 --> 00:08:10,050
Well, let's just
call it l of theta.
142
00:08:10,050 --> 00:08:17,030
And I want this to be taken
at phi inverse of eta squared.
143
00:08:19,980 --> 00:08:21,480
And then what I pick up is the--
144
00:08:23,771 --> 00:08:25,770
so I'm going to put
everything under the square.
145
00:08:25,770 --> 00:08:31,460
So I get phi prime of
theta squared d theta.
146
00:08:31,460 --> 00:08:33,260
OK?
147
00:08:33,260 --> 00:08:35,090
So now, I have the
expectation of a square.
148
00:08:35,090 --> 00:08:38,539
This does not depend, so this
is-- sorry, this is l of theta.
149
00:08:38,539 --> 00:08:42,307
This is the expectation of
l of theta of an x, right?
150
00:08:42,307 --> 00:08:44,390
That's for some variable,
and the expectation here
151
00:08:44,390 --> 00:08:45,710
is with respect to x.
152
00:08:45,710 --> 00:08:49,824
That's just the definition
of the Fisher information.
153
00:08:49,824 --> 00:08:52,240
So now I'm going to squeeze
this guy into the expectation.
154
00:08:52,240 --> 00:08:53,260
It does not depend on x.
155
00:08:53,260 --> 00:08:55,412
It just acts as a constant.
156
00:08:55,412 --> 00:08:57,370
And so what I have now
is that this is actually
157
00:08:57,370 --> 00:08:59,760
proportional to
the integral of h
158
00:08:59,760 --> 00:09:05,320
eta times the square root of
the expectation with respect
159
00:09:05,320 --> 00:09:06,600
to x of what?
160
00:09:06,600 --> 00:09:10,540
Well, here, I have d over
d theta of log of theta.
161
00:09:10,540 --> 00:09:15,620
And here, this guy is really
d eta over d theta, right?
162
00:09:19,524 --> 00:09:21,480
Agree?
163
00:09:21,480 --> 00:09:24,720
So now, what I'm really left
by-- so I get d over d theta
164
00:09:24,720 --> 00:09:25,520
times d--
165
00:09:25,520 --> 00:09:28,047
sorry, times d theta over d eta.
166
00:09:42,980 --> 00:09:51,396
so that's just d over
d eta of log of eta x.
167
00:10:00,198 --> 00:10:04,370
And then this guy is now
becoming d eta, right?
168
00:10:04,370 --> 00:10:06,590
OK, so this was a mess.
169
00:10:09,710 --> 00:10:12,320
This is a complete mess, because
I actually want to use phi.
170
00:10:12,320 --> 00:10:14,150
I should not actually
introduce phi at all.
171
00:10:14,150 --> 00:10:21,930
I should just talk about d eta
over d theta type of things.
172
00:10:21,930 --> 00:10:24,370
And then that would actually
make my life so much easier.
173
00:10:24,370 --> 00:10:25,002
OK.
174
00:10:25,002 --> 00:10:26,710
I'm not going to spend
more time on this.
175
00:10:26,710 --> 00:10:28,210
This is really just
the idea, right?
176
00:10:28,210 --> 00:10:30,170
You have square root
of a square in there.
177
00:10:30,170 --> 00:10:31,480
And then, when you do
your change of variable,
178
00:10:31,480 --> 00:10:32,710
you just pick up a square.
179
00:10:32,710 --> 00:10:35,750
You just pick up
something in here.
180
00:10:35,750 --> 00:10:38,110
And so you just move
this thing in there.
181
00:10:38,110 --> 00:10:38,920
You get a square.
182
00:10:38,920 --> 00:10:40,400
It goes inside the square.
183
00:10:40,400 --> 00:10:42,280
And so your derivative
of the log likelihood
184
00:10:42,280 --> 00:10:44,488
with respect to theta becomes
a derivative of the log
185
00:10:44,488 --> 00:10:46,240
likelihood with respect to eta.
186
00:10:46,240 --> 00:10:48,850
And that's the only thing
that's happening here.
187
00:10:48,850 --> 00:10:52,478
I'm just being super
sloppy, for some reason.
188
00:10:52,478 --> 00:10:54,612
OK.
189
00:10:54,612 --> 00:10:56,570
And then, of course, now,
what you're left with
190
00:10:56,570 --> 00:10:59,442
is that this is really
just proportional.
191
00:10:59,442 --> 00:11:00,650
Well, this is actually equal.
192
00:11:00,650 --> 00:11:02,150
Everything is
proportional, but this
193
00:11:02,150 --> 00:11:05,090
is equal to the Fisher
information tilde with respect
194
00:11:05,090 --> 00:11:07,050
to eta now.
195
00:11:07,050 --> 00:11:07,550
Right?
196
00:11:07,550 --> 00:11:09,630
You're doing this
with respect to eta.
197
00:11:09,630 --> 00:11:17,010
And so that's your new
prior with respect to eta.
198
00:11:17,010 --> 00:11:17,510
OK.
199
00:11:17,510 --> 00:11:21,800
So one thing that
you want to do,
200
00:11:21,800 --> 00:11:23,870
once you have-- so
remember, when you actually
201
00:11:23,870 --> 00:11:26,600
compute your
posterior rate, rather
202
00:11:26,600 --> 00:11:29,330
than having-- so you
start with a prior,
203
00:11:29,330 --> 00:11:32,090
and you have some observations,
let's say, x1 to xn.
204
00:11:36,190 --> 00:11:41,540
When you do Bayesian
inference, rather than spitting
205
00:11:41,540 --> 00:11:45,450
out just some theta hat, which
is an estimator for theta,
206
00:11:45,450 --> 00:11:48,565
you actually spit out an
entire posterior distribution--
207
00:11:53,220 --> 00:11:57,040
pi of theta, given x1 xn.
208
00:11:57,040 --> 00:11:57,540
OK?
209
00:11:57,540 --> 00:11:59,460
So there's an
entire distribution
210
00:11:59,460 --> 00:12:01,110
on the [INAUDIBLE] theta.
211
00:12:01,110 --> 00:12:04,290
And you can actually use this
to perform inference, rather
212
00:12:04,290 --> 00:12:06,150
than just having one number.
213
00:12:06,150 --> 00:12:06,950
OK?
214
00:12:06,950 --> 00:12:09,300
And so you could actually
build confidence regions
215
00:12:09,300 --> 00:12:10,540
from this thing.
216
00:12:10,540 --> 00:12:11,040
OK.
217
00:12:11,040 --> 00:12:16,600
And so a Bayesian
confidence interval--
218
00:12:16,600 --> 00:12:21,480
so if your set of parameters
is included in the real line,
219
00:12:21,480 --> 00:12:23,880
then you can actually--
it's not even guaranteed
220
00:12:23,880 --> 00:12:25,740
to be to be an interval.
221
00:12:25,740 --> 00:12:33,350
So let me call it a confidence
region, so a Bayesian
222
00:12:33,350 --> 00:12:40,090
confidence region, OK?
223
00:12:40,090 --> 00:12:43,360
So it's just a random subspace.
224
00:12:43,360 --> 00:12:47,810
So let's call it r,
is included in theta.
225
00:12:47,810 --> 00:12:49,750
And when you have the
deterministic one,
226
00:12:49,750 --> 00:12:53,650
we had a definition, which was
with respect to the randomness
227
00:12:53,650 --> 00:12:54,880
of the data, right?
228
00:12:54,880 --> 00:12:57,850
That's how you actually
had a random subset.
229
00:12:57,850 --> 00:12:59,740
So you had a random
confidence interval.
230
00:12:59,740 --> 00:13:02,200
Here, it's actually
conditioned on the data,
231
00:13:02,200 --> 00:13:03,640
but with respect
to the randomness
232
00:13:03,640 --> 00:13:06,531
that you actually get from
your posterior distribution.
233
00:13:06,531 --> 00:13:07,030
OK?
234
00:13:07,030 --> 00:13:16,760
So such that the
probability that your theta
235
00:13:16,760 --> 00:13:18,350
belongs to this
confidence region,
236
00:13:18,350 --> 00:13:24,500
given x1 xn is, say,
at least 1 minus alpha.
237
00:13:24,500 --> 00:13:27,040
Let's just take it
equal to 1 minus alpha.
238
00:13:27,040 --> 00:13:34,530
OK so that's a confidence
region at level 1 minus alpha.
239
00:13:34,530 --> 00:13:36,240
OK, so that's one way.
240
00:13:36,240 --> 00:13:38,770
So why would you actually--
241
00:13:38,770 --> 00:13:41,390
when I actually implement
Bayesian inference,
242
00:13:41,390 --> 00:13:44,480
I'm actually spitting out
that entire distribution.
243
00:13:44,480 --> 00:13:47,540
I need to summarize this thing
to communicate it, right?
244
00:13:47,540 --> 00:13:49,730
I cannot just say this
is this entire function.
245
00:13:49,730 --> 00:13:51,230
I want to know where
are the regions
246
00:13:51,230 --> 00:13:54,344
of high probability, where my
perimeter is supposed to be?
247
00:13:54,344 --> 00:13:56,510
And so here, when I have
this thing, what I actually
248
00:13:56,510 --> 00:13:58,010
want to have is
something that says,
249
00:13:58,010 --> 00:14:00,200
well, I want to
summarize this thing
250
00:14:00,200 --> 00:14:03,680
into some subset of the
real line, in which I'm
251
00:14:03,680 --> 00:14:08,120
sure that the area under the
curve, here, of my posterior
252
00:14:08,120 --> 00:14:11,734
is actually 1 minus alpha.
253
00:14:11,734 --> 00:14:13,400
And there's many ways
to do this, right?
254
00:14:16,790 --> 00:14:22,450
So one way to do this is
to look at level sets.
255
00:14:27,870 --> 00:14:29,550
And so rather than
actually-- so let's
256
00:14:29,550 --> 00:14:32,220
say my posterior
looks like this.
257
00:14:32,220 --> 00:14:35,760
I know, for example, if I
have a Gaussian distribution,
258
00:14:35,760 --> 00:14:38,230
I can actually take my posterior
to be-- my posterior is
259
00:14:38,230 --> 00:14:39,480
actually going to be Gaussian.
260
00:14:43,060 --> 00:14:50,760
And what I can do is to try
to cut it here on the y-axis
261
00:14:50,760 --> 00:14:54,910
so that now, the area under
the curve, when I cut here,
262
00:14:54,910 --> 00:14:59,430
is actually 1 minus alpha.
263
00:14:59,430 --> 00:15:02,080
OK, so I have some
threshold tau.
264
00:15:02,080 --> 00:15:05,490
If tau goes to plus
infinity, then I'm
265
00:15:05,490 --> 00:15:07,380
going to have that this
area under the curve
266
00:15:07,380 --> 00:15:10,380
here is going to--
267
00:15:18,012 --> 00:15:19,920
AUDIENCE: [INAUDIBLE]
268
00:15:19,920 --> 00:15:21,786
PHILIPPE RIGOLLET: Well, no.
269
00:15:21,786 --> 00:15:23,160
So the area under
the curve, when
270
00:15:23,160 --> 00:15:24,810
tau is going to
plus infinity, think
271
00:15:24,810 --> 00:15:27,892
of the small, the when
tau is just right here.
272
00:15:27,892 --> 00:15:29,280
AUDIENCE: [INAUDIBLE]
273
00:15:29,280 --> 00:15:32,150
PHILIPPE RIGOLLET: So this is
actually going to 0, right?
274
00:15:32,150 --> 00:15:33,530
And so I start here.
275
00:15:33,530 --> 00:15:36,290
And then I start going down
and down and down and down,
276
00:15:36,290 --> 00:15:39,440
until I actually get something
which is going down to 1 plus
277
00:15:39,440 --> 00:15:40,160
alpha.
278
00:15:40,160 --> 00:15:44,000
And if tau is going down to 0,
then my area under the curve
279
00:15:44,000 --> 00:15:44,750
is going to--
280
00:15:48,240 --> 00:15:51,604
if tau is here, I'm
cutting nowhere.
281
00:15:51,604 --> 00:15:52,770
And so I'm getting 1, right?
282
00:15:56,160 --> 00:15:56,980
Agree?
283
00:15:56,980 --> 00:16:00,540
Think of, when tau
is very close to 0,
284
00:16:00,540 --> 00:16:02,876
I'm cutting [? s ?]
s very far here.
285
00:16:02,876 --> 00:16:04,750
And so I'm getting some
area under the curve,
286
00:16:04,750 --> 00:16:06,000
which is almost everything.
287
00:16:06,000 --> 00:16:08,100
And so it's going to 1--
as tau going down to 0.
288
00:16:08,100 --> 00:16:09,960
Yeah?
289
00:16:09,960 --> 00:16:12,882
AUDIENCE: Does this only
work for [INAUDIBLE]
290
00:16:12,882 --> 00:16:14,340
PHILIPPE RIGOLLET:
No, it does not.
291
00:16:14,340 --> 00:16:17,160
I mean-- so this is a picture.
292
00:16:17,160 --> 00:16:20,277
So those two things work
for all of them, right?
293
00:16:20,277 --> 00:16:22,110
But when you have a
[? bimodal, ?] actually,
294
00:16:22,110 --> 00:16:23,526
this is actually
when things start
295
00:16:23,526 --> 00:16:24,990
to become interesting, right?
296
00:16:24,990 --> 00:16:30,600
So when we built a frequentist
confidence interval,
297
00:16:30,600 --> 00:16:34,590
it was always of the form x
bar plus or minus something.
298
00:16:34,590 --> 00:16:36,510
But now, if I start to
have a posterior that
299
00:16:36,510 --> 00:16:40,230
looks like this, what I'm
going to start cutting off,
300
00:16:40,230 --> 00:16:41,370
I'm going to have two--
301
00:16:41,370 --> 00:16:44,550
I mean, my confidence
region is going
302
00:16:44,550 --> 00:16:47,740
to be the union of
those two things, right?
303
00:16:47,740 --> 00:16:50,700
And it really reflects
the fact that there
304
00:16:50,700 --> 00:16:51,820
is this bimodal thing.
305
00:16:51,820 --> 00:16:53,486
It's going to say,
well, with hyperbole,
306
00:16:53,486 --> 00:16:56,840
I'm actually going to
be either here or here.
307
00:16:56,840 --> 00:16:59,840
Now, the meaning here of a
Bayesian confidence region
308
00:16:59,840 --> 00:17:02,570
and the confidence interval are
completely distinct notions,
309
00:17:02,570 --> 00:17:03,260
right?
310
00:17:03,260 --> 00:17:06,140
And I'm going to work
out on example with you
311
00:17:06,140 --> 00:17:08,673
so that we can actually
see that sometimes--
312
00:17:08,673 --> 00:17:10,089
I mean, both of
them, actually you
313
00:17:10,089 --> 00:17:11,839
can come up with
some crazy paradoxes.
314
00:17:11,839 --> 00:17:13,609
So since we don't
have that much time,
315
00:17:13,609 --> 00:17:17,339
I will actually talk to you
about why, in some instances,
316
00:17:17,339 --> 00:17:19,819
it's actually a good idea to
think of Bayesian confidence
317
00:17:19,819 --> 00:17:22,369
intervals rather than
frequentist ones.
318
00:17:22,369 --> 00:17:25,609
So before we go into
more details about what
319
00:17:25,609 --> 00:17:27,440
those Bayesian
confidence intervals are,
320
00:17:27,440 --> 00:17:29,570
let's remind
ourselves what does it
321
00:17:29,570 --> 00:17:33,110
mean to have a frequentist
confidence interval?
322
00:17:33,110 --> 00:17:33,610
Right?
323
00:17:46,460 --> 00:17:46,960
OK.
324
00:17:46,960 --> 00:17:49,690
So when I have a frequentist
confidence interval,
325
00:17:49,690 --> 00:17:59,290
let's say something like x bar n
to minus 1.96 sigma over root n
326
00:17:59,290 --> 00:18:06,136
and x bar n plus 1.96
sigma over root n,
327
00:18:06,136 --> 00:18:07,510
so that's the
confidence interval
328
00:18:07,510 --> 00:18:10,720
that you get for the
mean of some Gaussian
329
00:18:10,720 --> 00:18:16,390
with known variants to be
equal to sigma square, OK.
330
00:18:16,390 --> 00:18:18,460
So what we know is that
the meaning of this
331
00:18:18,460 --> 00:18:20,410
is the probability
that theta belongs
332
00:18:20,410 --> 00:18:25,870
to this is equal to 95%, right?
333
00:18:25,870 --> 00:18:27,340
And this, more
generally, you can
334
00:18:27,340 --> 00:18:29,620
think of being q alpha over 2.
335
00:18:29,620 --> 00:18:33,040
And what you're going to get
is 1 minus alpha here, OK?
336
00:18:33,040 --> 00:18:34,280
So what does it mean here?
337
00:18:34,280 --> 00:18:37,480
Well, it looks very much
like what we have here,
338
00:18:37,480 --> 00:18:39,970
except that we're not
conditioning on x1 xn.
339
00:18:39,970 --> 00:18:40,720
And we should not.
340
00:18:40,720 --> 00:18:43,830
Because there was a question
like that in the midterm--
341
00:18:43,830 --> 00:18:47,590
if I condition on x1 xn, this
probability is either 0 or 1.
342
00:18:47,590 --> 00:18:48,610
OK?
343
00:18:48,610 --> 00:18:50,170
Because once I
condition-- so here,
344
00:18:50,170 --> 00:18:52,170
this probability, actually,
here is with respect
345
00:18:52,170 --> 00:18:55,010
to the randomness in x1 xn.
346
00:18:55,010 --> 00:18:56,040
So if I condition--
347
00:18:58,860 --> 00:19:04,890
so let's build this thing,
r freq, for frequentist.
348
00:19:07,830 --> 00:19:11,930
Well, given x1 xn--
349
00:19:11,930 --> 00:19:13,940
and actually, I don't
need to know x1 xn really.
350
00:19:13,940 --> 00:19:16,420
What I need to know
is what xn bar is.
351
00:19:16,420 --> 00:19:18,140
Well, this thing now is what?
352
00:19:18,140 --> 00:19:22,200
It's 1, if theta is
in r, and it's 0,
353
00:19:22,200 --> 00:19:27,110
if theta is not in r, right?
354
00:19:27,110 --> 00:19:28,010
That's all there is.
355
00:19:28,010 --> 00:19:29,900
This is a deterministic
confidence interval,
356
00:19:29,900 --> 00:19:32,360
once I condition x1 xn.
357
00:19:32,360 --> 00:19:33,270
So I have a number.
358
00:19:33,270 --> 00:19:35,720
The average is maybe 3.
359
00:19:35,720 --> 00:19:36,790
And so I get 3.
360
00:19:36,790 --> 00:19:41,900
Either theta is between 3
minus 0.5 or in 3 plus 0.5,
361
00:19:41,900 --> 00:19:42,840
or it's not.
362
00:19:42,840 --> 00:19:44,000
And so there's basically--
363
00:19:44,000 --> 00:19:45,470
I mean, I write
it as probability,
364
00:19:45,470 --> 00:19:47,303
but it's really not a
probalistic statement.
365
00:19:47,303 --> 00:19:49,160
It's either it's true or not.
366
00:19:49,160 --> 00:19:50,240
Agreed?
367
00:19:50,240 --> 00:19:52,580
So what does it mean to have
a frequentist confidence
368
00:19:52,580 --> 00:19:53,550
interval?
369
00:19:53,550 --> 00:19:55,270
It means that if I were--
370
00:19:55,270 --> 00:19:58,660
and here, where the word
frequentist comes from,
371
00:19:58,660 --> 00:20:02,840
it says that if I repeat this
experiment over and over,
372
00:20:02,840 --> 00:20:06,700
meaning that on Monday, I
collect a sample of size n,
373
00:20:06,700 --> 00:20:09,260
and I build a
confidence interval,
374
00:20:09,260 --> 00:20:12,260
and then on Tuesday, I collect
another sample of size n,
375
00:20:12,260 --> 00:20:13,890
and I build a
confidence interval,
376
00:20:13,890 --> 00:20:17,000
and on Wednesday, I do this
again and again, what's going
377
00:20:17,000 --> 00:20:18,510
to happen is the following.
378
00:20:18,510 --> 00:20:21,530
I'm going to have my true
theta that lives here.
379
00:20:21,530 --> 00:20:23,900
And then on Monday, this
is the confidence interval
380
00:20:23,900 --> 00:20:25,470
that I build.
381
00:20:25,470 --> 00:20:28,802
OK, so this is the real line.
382
00:20:28,802 --> 00:20:31,260
The true theta is here, and
this is the confidence interval
383
00:20:31,260 --> 00:20:32,300
I build on Monday.
384
00:20:32,300 --> 00:20:32,800
All right?
385
00:20:32,800 --> 00:20:37,530
So x bar was here, and this
is my confidence interval.
386
00:20:37,530 --> 00:20:41,540
On Tuesday, I build this
confidence interval maybe.
387
00:20:41,540 --> 00:20:44,640
x bar was closer to
theta, but smaller.
388
00:20:44,640 --> 00:20:49,820
But then on Wednesday, I build
this confidence interval.
389
00:20:49,820 --> 00:20:50,880
I'm not here.
390
00:20:50,880 --> 00:20:51,920
It's not in there.
391
00:20:51,920 --> 00:20:53,681
And that's this case.
392
00:20:53,681 --> 00:20:54,180
Right?
393
00:20:54,180 --> 00:20:56,100
It happens that it's
just not in there.
394
00:20:56,100 --> 00:20:57,930
And then on Thursday,
I build another one.
395
00:20:57,930 --> 00:21:01,300
I almost miss it, but
I'm in there, et cetera.
396
00:21:01,300 --> 00:21:04,430
Maybe here, Here, I miss again.
397
00:21:04,430 --> 00:21:07,490
And so what it means to have a
confidence interval-- so what
398
00:21:07,490 --> 00:21:12,131
does it mean to have a
confidence interval at 95%?
399
00:21:12,131 --> 00:21:15,610
AUDIENCE: [INAUDIBLE]
400
00:21:15,610 --> 00:21:18,150
PHILIPPE RIGOLLET: Yeah, so
it means that if I repeat this
401
00:21:18,150 --> 00:21:19,800
the frequency of times--
402
00:21:19,800 --> 00:21:21,720
hence, the word
frequentist-- at which
403
00:21:21,720 --> 00:21:24,150
I'm actually going
to overlap that,
404
00:21:24,150 --> 00:21:26,910
I'm actually going to
contain theta, should be 95%.
405
00:21:26,910 --> 00:21:28,890
That's what frequentist means.
406
00:21:28,890 --> 00:21:31,740
So it's just a matter
of trusting that.
407
00:21:31,740 --> 00:21:35,690
So on one given thing, one
given realization of your data,
408
00:21:35,690 --> 00:21:36,970
it's not telling you anything.
409
00:21:36,970 --> 00:21:38,460
[INAUDIBLE] it's there or not.
410
00:21:38,460 --> 00:21:42,530
So it's not really
something that's actually
411
00:21:42,530 --> 00:21:46,430
something that assesses the
confidence of your decision,
412
00:21:46,430 --> 00:21:48,230
such as data is in there or not.
413
00:21:48,230 --> 00:21:50,360
It's something that
assesses the confidence
414
00:21:50,360 --> 00:21:52,410
you have in the method
that you're using.
415
00:21:52,410 --> 00:21:54,170
If you were you repeat
it over and again,
416
00:21:54,170 --> 00:21:56,470
it'd be the same thing.
417
00:21:56,470 --> 00:21:58,850
It would be 95% of the
time correct, right?
418
00:21:58,850 --> 00:22:02,570
So for example, we know
that we could build a test.
419
00:22:02,570 --> 00:22:04,940
So it's pretty clear
that you can actually
420
00:22:04,940 --> 00:22:09,020
build a test for whether
theta is equal to theta naught
421
00:22:09,020 --> 00:22:10,705
or not equal to
theta naught, by just
422
00:22:10,705 --> 00:22:13,080
checking whether theta naught
is in a confidence interval
423
00:22:13,080 --> 00:22:13,780
or not.
424
00:22:13,780 --> 00:22:15,530
And what it means is
that, if you actually
425
00:22:15,530 --> 00:22:21,170
are doing those tests at 5%,
that means that 5% of the time,
426
00:22:21,170 --> 00:22:23,440
if you do this over and
again, 5% of the time
427
00:22:23,440 --> 00:22:24,610
you're going to be wrong.
428
00:22:24,610 --> 00:22:27,640
I mentioned my wife
does market research.
429
00:22:27,640 --> 00:22:31,930
And she does maybe, I don't
know, 100,000 tests a year.
430
00:22:31,930 --> 00:22:34,210
And if they do
all of them at 1%,
431
00:22:34,210 --> 00:22:37,550
then it means that 1% of the
time, which is a lot of time,
432
00:22:37,550 --> 00:22:38,050
right?
433
00:22:38,050 --> 00:22:40,840
When you do 100,000 a
year, it's 1,000 of them
434
00:22:40,840 --> 00:22:41,755
are actually wrong.
435
00:22:41,755 --> 00:22:44,611
OK, I mean, she's
actually hedging
436
00:22:44,611 --> 00:22:47,110
against the fact that 1% of
them that are going to be wrong.
437
00:22:47,110 --> 00:22:49,109
That's 1,000 of them that
are going to be wrong.
438
00:22:49,109 --> 00:22:52,890
Just like, if you do this
100,000 times at 95%,
439
00:22:52,890 --> 00:22:54,910
5,000 of those guys
are actually not going
440
00:22:54,910 --> 00:22:56,360
to be the correct ones.
441
00:22:56,360 --> 00:22:56,860
OK?
442
00:22:56,860 --> 00:22:58,600
So I mean, it's kind of scary.
443
00:22:58,600 --> 00:23:01,300
But that's the way it is.
444
00:23:01,300 --> 00:23:03,730
So that's with the frequentist
interpretation of this is.
445
00:23:03,730 --> 00:23:07,720
Now, as I mentioned, when we
started this Bayesian chapter,
446
00:23:07,720 --> 00:23:10,930
I said, Bayesian
statistics converge to--
447
00:23:10,930 --> 00:23:14,800
I mean, Bayesian decisions
and Bayesian methods converge
448
00:23:14,800 --> 00:23:16,510
to frequentist methods.
449
00:23:16,510 --> 00:23:18,590
When the sample size
is large enough,
450
00:23:18,590 --> 00:23:20,610
they lead to the same decisions.
451
00:23:20,610 --> 00:23:22,930
And in general, they
need not be the same,
452
00:23:22,930 --> 00:23:24,970
but they tend to
actually, when the sample
453
00:23:24,970 --> 00:23:27,830
size is large enough, to
have the same behavior.
454
00:23:27,830 --> 00:23:30,850
Think about, for
example, the posterior
455
00:23:30,850 --> 00:23:34,450
that you have when you have
in the Gaussian case, right?
456
00:23:34,450 --> 00:23:36,420
We said that, in
the Gaussian case,
457
00:23:36,420 --> 00:23:38,020
what you're going
to see is that it's
458
00:23:38,020 --> 00:23:40,240
as if you had an extra
observation which
459
00:23:40,240 --> 00:23:43,230
was essentially
given by your prior.
460
00:23:43,230 --> 00:23:44,570
OK?
461
00:23:44,570 --> 00:23:50,830
And now, what's going to happen
is that, when this just one
462
00:23:50,830 --> 00:23:53,470
observation among n
plus 1, it's really
463
00:23:53,470 --> 00:23:55,720
going to be totally
drawn, and you
464
00:23:55,720 --> 00:23:58,390
won't see it when the
sample size grows larger.
465
00:23:58,390 --> 00:24:00,400
So Bayesian methods are
particularly useful when
466
00:24:00,400 --> 00:24:02,190
you have a small sample size.
467
00:24:02,190 --> 00:24:05,680
And when you have a small sample
size, the effect of the prior
468
00:24:05,680 --> 00:24:06,980
is going to be bigger.
469
00:24:06,980 --> 00:24:08,950
But most importantly,
you're not going
470
00:24:08,950 --> 00:24:10,810
to have to repeat this
thing over and again.
471
00:24:10,810 --> 00:24:11,830
And you're going
to have a meaning.
472
00:24:11,830 --> 00:24:13,180
You're going to have
to have something
473
00:24:13,180 --> 00:24:15,138
that has a meaning for
this particular data set
474
00:24:15,138 --> 00:24:16,150
that you have.
475
00:24:16,150 --> 00:24:19,900
When I said that the probability
that theta belongs to r--
476
00:24:19,900 --> 00:24:22,810
and here, I'm going to specify
the fact that it's a Bayesian
477
00:24:22,810 --> 00:24:24,740
confidence region,
like this one--
478
00:24:24,740 --> 00:24:27,490
this is actually
conditionally on the data
479
00:24:27,490 --> 00:24:29,490
that you've collected.
480
00:24:29,490 --> 00:24:32,110
It says, given this data, given
the points that you have--
481
00:24:32,110 --> 00:24:34,540
just put in some numbers,
if you want, in there--
482
00:24:34,540 --> 00:24:36,460
it's actually telling
you the probability
483
00:24:36,460 --> 00:24:39,430
that theta belongs to
this Bayesian thing,
484
00:24:39,430 --> 00:24:41,750
to this Bayesian
confidence region.
485
00:24:41,750 --> 00:24:44,230
Here, since I have
conditioned on x1 xn,
486
00:24:44,230 --> 00:24:46,840
this probability is really
just with respect to theta
487
00:24:46,840 --> 00:24:51,660
drawn from the prior, right?
488
00:24:51,660 --> 00:24:54,150
And so now, it has a
slightly different meaning.
489
00:24:54,150 --> 00:24:57,170
It's just telling
you that when--
490
00:24:57,170 --> 00:24:59,570
it's really making a
statement about where
491
00:24:59,570 --> 00:25:03,870
the regions of hyperability
of your posterior are.
492
00:25:03,870 --> 00:25:05,050
Now, why is that useful?
493
00:25:05,050 --> 00:25:11,600
Well, there's actually
an interesting story that
494
00:25:11,600 --> 00:25:13,980
goes behind Bayesian methods.
495
00:25:13,980 --> 00:25:17,240
Anybody knows the story of
the USS I think it's Scorpion?
496
00:25:17,240 --> 00:25:18,610
Do you know the story?
497
00:25:18,610 --> 00:25:22,770
So that was an American
vessel that disappeared.
498
00:25:22,770 --> 00:25:25,490
I think it was close to
Bermuda or something.
499
00:25:25,490 --> 00:25:28,790
But you can tell the story
of the Malaysian Airlines,
500
00:25:28,790 --> 00:25:31,640
except that I don't think
it's such a successful story.
501
00:25:31,640 --> 00:25:33,770
But the idea was
essentially, we're
502
00:25:33,770 --> 00:25:36,050
trying to find where
this thing happened.
503
00:25:36,050 --> 00:25:39,800
And of course, this
is a one-time thing.
504
00:25:39,800 --> 00:25:41,686
You need something
that works once.
505
00:25:41,686 --> 00:25:44,060
You need something that works
for this particular vessel.
506
00:25:44,060 --> 00:25:46,601
And you don't care, if you go
to the Navy, and you tell them,
507
00:25:46,601 --> 00:25:48,320
well, here's a method.
508
00:25:48,320 --> 00:25:51,730
And for 95 out of 100 vessels
that you're going to lose,
509
00:25:51,730 --> 00:25:53,350
we're going to be
able to find it.
510
00:25:53,350 --> 00:25:57,230
And they want this to work
for this particular one.
511
00:25:57,230 --> 00:25:59,750
And so they were
looking, and they were
512
00:25:59,750 --> 00:26:02,200
diving in different places.
513
00:26:02,200 --> 00:26:04,710
And suddenly, they
brought in this guy.
514
00:26:04,710 --> 00:26:05,460
I forget his name.
515
00:26:05,460 --> 00:26:08,960
I mean, there's a whole story
about this on Wikipedia.
516
00:26:08,960 --> 00:26:10,612
And he started
collecting the data
517
00:26:10,612 --> 00:26:13,070
that they had from different
dives and maybe from currents.
518
00:26:13,070 --> 00:26:14,569
And he started to
put everything in.
519
00:26:14,569 --> 00:26:17,540
And he said, OK, what is
the posterior distribution
520
00:26:17,540 --> 00:26:21,140
of the location of the
vessel, given all the things
521
00:26:21,140 --> 00:26:22,340
that I've seen?
522
00:26:22,340 --> 00:26:23,390
And what have you seen?
523
00:26:23,390 --> 00:26:25,280
Well, you've seen that it's
not here, it's not there,
524
00:26:25,280 --> 00:26:26,071
and it's not there.
525
00:26:26,071 --> 00:26:29,360
And you've also seen that the
currents were going that way,
526
00:26:29,360 --> 00:26:30,786
and the winds were
going that way.
527
00:26:30,786 --> 00:26:32,660
And you can actually
put some modeling traits
528
00:26:32,660 --> 00:26:33,890
to understand this.
529
00:26:33,890 --> 00:26:37,940
Now, given this, for this
particular data that you have,
530
00:26:37,940 --> 00:26:41,420
you can actually think of having
a two-dimensional density that
531
00:26:41,420 --> 00:26:44,650
tells you where it's more
likely that the vessel is.
532
00:26:44,650 --> 00:26:46,400
And where are you going
to be looking for?
533
00:26:46,400 --> 00:26:48,097
Well, if it's a
multimodal distribution,
534
00:26:48,097 --> 00:26:50,180
you're just going to go
to the highest mode first,
535
00:26:50,180 --> 00:26:52,190
because that's where it's
the most likely to be.
536
00:26:52,190 --> 00:26:53,600
And maybe it's not
there, so you're just
537
00:26:53,600 --> 00:26:55,250
going to update your
posterior, based on the fact
538
00:26:55,250 --> 00:26:56,791
that it's not there,
and do it again.
539
00:26:56,791 --> 00:26:59,270
And actually, after
two dives, I think,
540
00:26:59,270 --> 00:27:01,010
he actually found the thing.
541
00:27:01,010 --> 00:27:03,122
And that's exactly where
Bayesian statistics
542
00:27:03,122 --> 00:27:03,830
start to kick in.
543
00:27:03,830 --> 00:27:08,570
Because you put a lot of
knowledge into your model,
544
00:27:08,570 --> 00:27:11,340
but you also can actually factor
in a bunch of information,
545
00:27:11,340 --> 00:27:11,840
right?
546
00:27:11,840 --> 00:27:13,460
The model, he had
to build a model
547
00:27:13,460 --> 00:27:17,360
that was actually taking into
account and currents, and when.
548
00:27:17,360 --> 00:27:20,780
And what you can have
as a guarantee is that,
549
00:27:20,780 --> 00:27:22,610
when you talk about
the probability
550
00:27:22,610 --> 00:27:27,346
that this vessel is
in this location,
551
00:27:27,346 --> 00:27:28,970
given what you've
observed in the past,
552
00:27:28,970 --> 00:27:30,140
it actually has some sense.
553
00:27:30,140 --> 00:27:34,610
Whereas, if you were to
use a frequentist approach,
554
00:27:34,610 --> 00:27:35,810
then there's no probability.
555
00:27:35,810 --> 00:27:38,660
Either it's underneath this
position or it's not, right?
556
00:27:38,660 --> 00:27:41,520
So that's actually where
it start to make sense.
557
00:27:41,520 --> 00:27:43,370
And so you can
actually build this.
558
00:27:43,370 --> 00:27:44,930
And there's actually
a lot of methods
559
00:27:44,930 --> 00:27:47,300
that are based on,
for search, that
560
00:27:47,300 --> 00:27:48,979
are based on Bayesian methods.
561
00:27:48,979 --> 00:27:50,520
I think, for example,
the Higgs boson
562
00:27:50,520 --> 00:27:51,920
was based on a lot
of Bayesian methods,
563
00:27:51,920 --> 00:27:54,050
because this is something
you need to find [INAUDIBLE],,
564
00:27:54,050 --> 00:27:54,549
right?
565
00:27:54,549 --> 00:27:57,330
I mean, there was a lot of
prior that has to be built in.
566
00:27:57,330 --> 00:27:57,830
OK.
567
00:27:57,830 --> 00:27:59,621
So now, you build this
confidence interval.
568
00:27:59,621 --> 00:28:02,300
And the nicest way to do
it is to use level sets.
569
00:28:02,300 --> 00:28:05,210
But again, just like for
Gaussians, I mean, if I had,
570
00:28:05,210 --> 00:28:12,290
even in the Gaussian
case, I decided
571
00:28:12,290 --> 00:28:16,110
to go at x bar plus
or minus something,
572
00:28:16,110 --> 00:28:19,500
but I could go at something
that's completely asymmetric.
573
00:28:19,500 --> 00:28:21,467
So what's happening is
that here, this method
574
00:28:21,467 --> 00:28:23,550
guarantees that you're
going to have the narrowest
575
00:28:23,550 --> 00:28:24,800
possible confidence intervals.
576
00:28:24,800 --> 00:28:27,480
That's essentially what
it's telling you, OK?
577
00:28:27,480 --> 00:28:31,890
Because every time I'm choosing
a point, starting from here,
578
00:28:31,890 --> 00:28:36,170
I'm actually putting as much
area under the curve as I can.
579
00:28:36,170 --> 00:28:38,660
All right.
580
00:28:38,660 --> 00:28:41,737
So those are called Bayesian
confidence [? interval. ?]
581
00:28:41,737 --> 00:28:43,320
Oh yeah, and I
promised you that we're
582
00:28:43,320 --> 00:28:46,500
going to work on some
example that actually
583
00:28:46,500 --> 00:28:50,940
gives a meaning to what I just
told you, with actual numbers.
584
00:28:50,940 --> 00:28:56,790
So this is something that's
taken from Wasserman's book.
585
00:28:56,790 --> 00:29:01,140
And also, it's
coming from a paper,
586
00:29:01,140 --> 00:29:03,780
from a stats paper,
from [? Wolpert ?] and I
587
00:29:03,780 --> 00:29:05,760
don't know who, from the '80s.
588
00:29:05,760 --> 00:29:07,760
And essentially,
this is how it works.
589
00:29:07,760 --> 00:29:10,680
So assume that you have
n equals 2 observations.
590
00:29:14,320 --> 00:29:18,780
And you have y1, so those
observations are y1--
591
00:29:18,780 --> 00:29:20,680
no, sorry, let's
call them x1, which
592
00:29:20,680 --> 00:29:26,000
is theta, plus epsilon 1 and x2,
which is theta plus epsilon 2,
593
00:29:26,000 --> 00:29:31,060
where epsilon 1 and
epsilon 2 are iid.
594
00:29:31,060 --> 00:29:33,280
And the probability
that epsilon i is equal
595
00:29:33,280 --> 00:29:35,110
to plus 1 is equal
to the probability
596
00:29:35,110 --> 00:29:38,440
that epsilon i is equal to
minus 1 is equal to 1/2.
597
00:29:38,440 --> 00:29:44,550
OK, so it's just the uniform
sign plus minus 1, OK?
598
00:29:44,550 --> 00:29:46,590
Now, let's think
about so you're trying
599
00:29:46,590 --> 00:29:47,970
to do some inference on theta.
600
00:29:47,970 --> 00:29:50,261
Maybe you actually want to
find some inference on theta
601
00:29:50,261 --> 00:29:51,825
that's actually based on--
602
00:29:51,825 --> 00:29:55,660
and that's based only
on the x1 and x2.
603
00:29:55,660 --> 00:29:56,430
OK?
604
00:29:56,430 --> 00:29:58,750
So I'm going to actually
build a confidence interval.
605
00:29:58,750 --> 00:30:01,110
But what I really
want to build is a--
606
00:30:03,594 --> 00:30:05,010
but let's start
thinking about how
607
00:30:05,010 --> 00:30:07,780
I would find an estimator
for those two things.
608
00:30:07,780 --> 00:30:09,970
Well, what values am I
going to be getting, right?
609
00:30:09,970 --> 00:30:13,750
So I'm going to get either
theta plus 1 or theta minus 1.
610
00:30:13,750 --> 00:30:15,610
And actually, I can
get basically four
611
00:30:15,610 --> 00:30:19,260
different observations, right?
612
00:30:19,260 --> 00:30:21,516
Sorry, four different
pairs of observations--
613
00:30:30,760 --> 00:30:32,410
plus plus theta minus 1.
614
00:30:32,410 --> 00:30:33,170
Agreed?
615
00:30:33,170 --> 00:30:37,340
Those are the four possible
observations that I can get.
616
00:30:37,340 --> 00:30:38,970
Agreed?
617
00:30:38,970 --> 00:30:42,924
Either they're both equal to
plus 1, both equal to minus 1,
618
00:30:42,924 --> 00:30:44,340
or one of the two
is equal to plus
619
00:30:44,340 --> 00:30:46,950
1, the other one to
minus 1, or the epsilons.
620
00:30:46,950 --> 00:30:47,580
OK.
621
00:30:47,580 --> 00:30:49,730
So those are the four
observations I can get.
622
00:30:49,730 --> 00:30:56,010
So in particular, if
they take the same value,
623
00:30:56,010 --> 00:30:59,390
and you know it's either
theta plus 1 or theta minus 1,
624
00:30:59,390 --> 00:31:02,100
and if they take a different
value, I know one of them
625
00:31:02,100 --> 00:31:04,555
is theta plus 1, and one
is actually theta minus 1.
626
00:31:04,555 --> 00:31:07,180
So in particular, if I take the
average of those two guys, when
627
00:31:07,180 --> 00:31:09,138
they take different
values, I know I'm actually
628
00:31:09,138 --> 00:31:10,850
getting theta right.
629
00:31:10,850 --> 00:31:14,441
So let's build a
confidence region.
630
00:31:14,441 --> 00:31:16,940
OK, so I'm actually going to
take a confidence region, which
631
00:31:16,940 --> 00:31:18,810
is just a singleton.
632
00:31:21,662 --> 00:31:23,120
And I'm going to
say the following.
633
00:31:23,120 --> 00:31:32,460
Well, if x1 is equal to x2, I'm
just going to take x1 minus 1,
634
00:31:32,460 --> 00:31:33,320
OK?
635
00:31:33,320 --> 00:31:34,790
So I'm just saying,
well, I'm never
636
00:31:34,790 --> 00:31:37,310
going to able to resolve
whether it's plus 1 or minus 1
637
00:31:37,310 --> 00:31:38,864
that actually gives
me the best one,
638
00:31:38,864 --> 00:31:41,030
so I'm just going to take
a dive and say, well, it's
639
00:31:41,030 --> 00:31:42,594
just plus 1.
640
00:31:42,594 --> 00:31:44,860
OK?
641
00:31:44,860 --> 00:31:47,710
And then, if they're
different, then here,
642
00:31:47,710 --> 00:31:50,830
I can do much better.
643
00:31:50,830 --> 00:31:52,929
I'm going to actually
just think the average.
644
00:31:56,282 --> 00:31:58,200
OK?
645
00:31:58,200 --> 00:32:08,360
Now, what I claim is that
this is a confidence region--
646
00:32:08,360 --> 00:32:10,370
and by default, when
I don't mention it,
647
00:32:10,370 --> 00:32:16,190
this is a frequentist
confidence region--
648
00:32:16,190 --> 00:32:18,740
at level 75%.
649
00:32:21,050 --> 00:32:21,550
OK?
650
00:32:21,550 --> 00:32:23,100
So let's just check that.
651
00:32:23,100 --> 00:32:24,685
To check that this
is correct, I need
652
00:32:24,685 --> 00:32:27,460
to check that the probability
under the realization of x1
653
00:32:27,460 --> 00:32:30,940
and x2, that theta belongs,
is one of those two guys,
654
00:32:30,940 --> 00:32:33,291
is actually equal to 0.75.
655
00:32:33,291 --> 00:32:33,790
Yes?
656
00:32:33,790 --> 00:32:36,529
AUDIENCE: What are
the [INAUDIBLE]
657
00:32:36,529 --> 00:32:39,070
PHILIPPE RIGOLLET: Well, it's
just the frequentist confidence
658
00:32:39,070 --> 00:32:41,842
interval that does not
need to be an interval.
659
00:32:41,842 --> 00:32:44,050
Actually, in this case, it's
going to be an interval.
660
00:32:44,050 --> 00:32:46,602
But that's just what it means.
661
00:32:46,602 --> 00:32:50,055
Yeah, region for Bayesian
was just because--
662
00:32:50,055 --> 00:32:51,430
I mean, the
confidence intervals,
663
00:32:51,430 --> 00:32:53,320
when we're frequentist,
we tend to make them
664
00:32:53,320 --> 00:32:54,606
intervals, because we want--
665
00:32:54,606 --> 00:32:56,980
but when you're Bayesian, and
you're doing this level set
666
00:32:56,980 --> 00:32:58,180
thing, you cannot
really guarantee,
667
00:32:58,180 --> 00:33:00,460
unless its [INAUDIBLE] is
going to be an interval.
668
00:33:00,460 --> 00:33:02,720
So region is just a way to
not have to say interval,
669
00:33:02,720 --> 00:33:03,430
in case it's not.
670
00:33:06,080 --> 00:33:06,640
OK.
671
00:33:06,640 --> 00:33:08,490
So I have this thing.
672
00:33:08,490 --> 00:33:11,440
So what I need to check is
the probability that theta
673
00:33:11,440 --> 00:33:13,000
is in one of those
two things, right?
674
00:33:13,000 --> 00:33:16,060
So what I need to find is
the probability that theta
675
00:33:16,060 --> 00:33:24,220
is an [INAUDIBLE] Well, x1 minus
1 and x1 is not equal to x2.
676
00:33:24,220 --> 00:33:26,840
And those are disjoint events,
so it's plus the probability
677
00:33:26,840 --> 00:33:35,980
that theta is in x1
plus x2 over 2 and x1--
678
00:33:35,980 --> 00:33:37,580
sorry, that's equal.
679
00:33:37,580 --> 00:33:39,700
That's different.
680
00:33:39,700 --> 00:33:40,200
OK.
681
00:33:40,200 --> 00:33:42,780
And OK, just before we actually
finish the computation,
682
00:33:42,780 --> 00:33:44,730
why do I have 75%?
683
00:33:44,730 --> 00:33:46,920
75% is 3/4.
684
00:33:46,920 --> 00:33:48,930
So it means that
we have four cases.
685
00:33:48,930 --> 00:33:52,020
And essentially, I did
not account for one case.
686
00:33:52,020 --> 00:33:52,650
And it's true.
687
00:33:52,650 --> 00:33:56,040
I did not account
for this case, when
688
00:33:56,040 --> 00:34:01,060
the both of the epsilon
i's are equal to minus 1.
689
00:34:01,060 --> 00:34:01,560
Right?
690
00:34:01,560 --> 00:34:03,393
So this is essentially
the one I'm not going
691
00:34:03,393 --> 00:34:04,620
to be able to account for.
692
00:34:04,620 --> 00:34:06,040
And so we'll see
that in a second.
693
00:34:06,040 --> 00:34:09,310
So in this case, we know
that everything goes great.
694
00:34:09,310 --> 00:34:09,810
Right?
695
00:34:09,810 --> 00:34:11,080
So in this case, this is--
696
00:34:11,080 --> 00:34:11,580
OK.
697
00:34:11,580 --> 00:34:13,831
Well, let's just start
from the first line.
698
00:34:13,831 --> 00:34:15,330
So the first line
is the probability
699
00:34:15,330 --> 00:34:20,290
that theta is equal to x1 minus
1 and those two are equal.
700
00:34:20,290 --> 00:34:28,440
So this is the probability
that theta is equal to--
701
00:34:28,440 --> 00:34:36,260
well, this is theta
plus epsilon 1 minus 1.
702
00:34:36,260 --> 00:34:43,409
And epsilon 1 is equal
to epsilon 2, right?
703
00:34:43,409 --> 00:34:45,290
Because I can remove
the theta from here,
704
00:34:45,290 --> 00:34:47,780
and I can actually remove
the theta from here,
705
00:34:47,780 --> 00:34:50,765
so that this guy here is
just epsilon 1 is equal to 1.
706
00:34:50,765 --> 00:34:52,407
So when I intersect
with this guy,
707
00:34:52,407 --> 00:34:54,740
it's actually the same thing
as epsilon 1 is equal to 1,
708
00:34:54,740 --> 00:34:56,530
as well--
709
00:34:56,530 --> 00:34:59,780
episilon 2 is equal
to 1, as well, OK?
710
00:34:59,780 --> 00:35:05,240
So this first thing is actually
equal to the probability
711
00:35:05,240 --> 00:35:10,780
that epsilon 1 is equal to 1
and epsilon 2 is equal to 1,
712
00:35:10,780 --> 00:35:14,180
which is equal to what?
713
00:35:14,180 --> 00:35:15,570
AUDIENCE: [INAUDIBLE]
714
00:35:15,570 --> 00:35:17,070
PHILIPPE RIGOLLET:
Yeah, 1/4, right?
715
00:35:17,070 --> 00:35:19,870
So that's just the
first case over there.
716
00:35:19,870 --> 00:35:21,020
They're independent.
717
00:35:21,020 --> 00:35:23,420
Now, I still need to
do the second one.
718
00:35:23,420 --> 00:35:24,650
So this case is what?
719
00:35:24,650 --> 00:35:28,890
Well, when those things are
equal, x1 plus x2 over 2
720
00:35:28,890 --> 00:35:29,390
is what?
721
00:35:29,390 --> 00:35:31,920
Well, I get theta
plus theta over 2.
722
00:35:31,920 --> 00:35:33,800
So that's just equal
to the probability
723
00:35:33,800 --> 00:35:39,620
that epsilon 1 plus epsilon
2 over 2 is equal to 0
724
00:35:39,620 --> 00:35:43,600
and epsilon 1 is
different from epsilon 2.
725
00:35:43,600 --> 00:35:44,100
Agreed?
726
00:35:46,860 --> 00:35:49,797
I just removed the thetas from
these equations, because I can.
727
00:35:49,797 --> 00:35:51,380
They're just on both
sides every time.
728
00:35:54,810 --> 00:35:55,310
OK.
729
00:35:55,310 --> 00:35:56,482
And so that means what?
730
00:35:56,482 --> 00:35:58,440
That means that the second
part-- so this thing
731
00:35:58,440 --> 00:36:02,120
is actually equal to
1/4 plus the probability
732
00:36:02,120 --> 00:36:05,350
that epsilon 1 and epsilon
2 over 2 is equal to 0.
733
00:36:05,350 --> 00:36:06,544
I can remove the 2.
734
00:36:06,544 --> 00:36:08,460
So this is just the
probability that one is 1,
735
00:36:08,460 --> 00:36:10,560
and the other one
is minus 1, right?
736
00:36:10,560 --> 00:36:12,510
So that's equal
to the probability
737
00:36:12,510 --> 00:36:17,820
that epsilon 1 is equal to 1 and
epsilon 2 is equal to minus 1
738
00:36:17,820 --> 00:36:21,360
plus the probability that
epsilon 1 is equal to minus 1
739
00:36:21,360 --> 00:36:24,447
and epsilon 2 is
equal to plus 1, OK?
740
00:36:24,447 --> 00:36:25,780
Because they're disjoint events.
741
00:36:25,780 --> 00:36:28,080
So I can break them
into the sum of the two.
742
00:36:28,080 --> 00:36:32,310
And each of those guys is also
one of the atomic part of it.
743
00:36:32,310 --> 00:36:33,960
It's one of the basic things.
744
00:36:33,960 --> 00:36:36,011
And so each of those
guys has probability 1/4.
745
00:36:36,011 --> 00:36:38,010
And so here, we can really
see that we accounted
746
00:36:38,010 --> 00:36:41,910
for everything, except for the
case when epsilon 1 was equal
747
00:36:41,910 --> 00:36:44,730
to minus 1, and epsilon
2 was equal to minus 1.
748
00:36:44,730 --> 00:36:45,570
So this is 1/4.
749
00:36:45,570 --> 00:36:46,380
This is 1/4.
750
00:36:46,380 --> 00:36:49,850
So the whole thing
is equal to 3/4.
751
00:36:49,850 --> 00:36:56,060
So now, what we have is that
the probability that epsilon 1
752
00:36:56,060 --> 00:36:57,350
is in--
753
00:36:57,350 --> 00:37:03,230
so the probability that data
belongs to this confidence
754
00:37:03,230 --> 00:37:06,280
region is equal to 3/4.
755
00:37:06,280 --> 00:37:07,990
And that's very nice.
756
00:37:07,990 --> 00:37:09,740
But the thing is some
people are sort of--
757
00:37:09,740 --> 00:37:12,650
I mean, it's not super nice
to be able to see this,
758
00:37:12,650 --> 00:37:17,510
because, in a way, I know that,
if I observe x1 and x2 that
759
00:37:17,510 --> 00:37:24,050
are different, I know
for sure that theta,
760
00:37:24,050 --> 00:37:25,882
that I actually got
the right theta, right?
761
00:37:25,882 --> 00:37:27,590
That this confidence
interval is actually
762
00:37:27,590 --> 00:37:31,370
happening with probability 1.
763
00:37:31,370 --> 00:37:34,700
And the problem is
that I do not know--
764
00:37:34,700 --> 00:37:37,640
I cannot make this precise
with the notion of frequentist
765
00:37:37,640 --> 00:37:39,230
confidence intervals.
766
00:37:39,230 --> 00:37:39,730
OK?
767
00:37:39,730 --> 00:37:41,396
Because frequentist
confidence intervals
768
00:37:41,396 --> 00:37:43,810
have to account for the
fact that, in the future,
769
00:37:43,810 --> 00:37:47,810
it might not be the case
that x1 and x2 are different.
770
00:37:47,810 --> 00:37:53,360
So Bayesian confidence
regions, by definition--
771
00:37:53,360 --> 00:37:54,530
well, they're all gone--
772
00:37:54,530 --> 00:37:57,387
but they are conditioned
on the data that I have.
773
00:37:57,387 --> 00:37:58,470
And so that's what I want.
774
00:37:58,470 --> 00:38:00,800
I want to be able to make
this statement conditionally
775
00:38:00,800 --> 00:38:02,640
and the data that I have.
776
00:38:02,640 --> 00:38:03,140
OK.
777
00:38:03,140 --> 00:38:06,450
So if I want to be able
to make this statement,
778
00:38:06,450 --> 00:38:08,450
if I want to build a
Bayesian confidence region,
779
00:38:08,450 --> 00:38:10,520
I'm going to have to
put a prior on theta.
780
00:38:10,520 --> 00:38:12,050
So without loss of generality--
781
00:38:12,050 --> 00:38:16,520
I mean, maybe with--
but let's assume
782
00:38:16,520 --> 00:38:25,980
that pi is a prior on theta.
783
00:38:25,980 --> 00:38:31,540
And let's assume that pi
of j is strictly positive
784
00:38:31,540 --> 00:38:35,920
for all integers
j equal, say, 0--
785
00:38:35,920 --> 00:38:42,770
well, actually, for all j in the
integers, positive or negative.
786
00:38:42,770 --> 00:38:43,270
OK.
787
00:38:43,270 --> 00:38:46,870
So that's a pretty weak
assumption on my prior.
788
00:38:46,870 --> 00:38:52,901
I'm just assuming that
theta is some integer.
789
00:38:52,901 --> 00:38:57,290
And now, let's build our
Bayesian confidence region.
790
00:38:57,290 --> 00:38:59,540
Well, if I want to build a
Bayesian confidence region,
791
00:38:59,540 --> 00:39:01,520
I need to understand what
my posterior is going to be.
792
00:39:01,520 --> 00:39:02,089
OK?
793
00:39:02,089 --> 00:39:04,630
And if I want to understand what
my posterior is going to be,
794
00:39:04,630 --> 00:39:11,530
I actually need to build
a likelihood, right?
795
00:39:11,530 --> 00:39:16,370
So we know that it's the
product of the likelihood
796
00:39:16,370 --> 00:39:20,740
and of the prior divided by--
797
00:39:20,740 --> 00:39:21,240
OK.
798
00:39:31,140 --> 00:39:32,850
So what is my likelihood?
799
00:39:32,850 --> 00:39:35,540
So my likelihood
is the probability
800
00:39:35,540 --> 00:39:40,580
of x1 x2, given theta.
801
00:39:40,580 --> 00:39:41,240
Right?
802
00:39:41,240 --> 00:39:45,010
That's what the
likelihood should be.
803
00:39:45,010 --> 00:39:49,840
And now let's say
that actually, just
804
00:39:49,840 --> 00:39:51,910
to make things a
little simpler, let
805
00:39:51,910 --> 00:40:07,230
us assume that x1 is
equal to, I don't know, 5,
806
00:40:07,230 --> 00:40:11,180
and x2 is equal to 7.
807
00:40:11,180 --> 00:40:12,540
OK?
808
00:40:12,540 --> 00:40:16,350
So I'm not going to take the
case where they're actually
809
00:40:16,350 --> 00:40:19,180
equal to each other, because
I know that, in this case,
810
00:40:19,180 --> 00:40:20,550
x1 and x2 are different.
811
00:40:20,550 --> 00:40:23,970
I know I'm going to actually
nail exactly what theta is,
812
00:40:23,970 --> 00:40:26,540
by looking at the average
of those guys, right?
813
00:40:26,540 --> 00:40:30,630
Here, it must be that
theta is equal to 6.
814
00:40:30,630 --> 00:40:34,491
So what I want is to compute
the likelihood at 5 and 7, OK?
815
00:40:38,419 --> 00:40:42,350
And what is this likelihood?
816
00:40:42,350 --> 00:40:53,950
Well, if theta is
equal to 6, that's
817
00:40:53,950 --> 00:41:00,010
just the probability that I
will observe 5 and 7, right?
818
00:41:00,010 --> 00:41:01,910
So what is the probability
I observe 5 and 7?
819
00:41:04,610 --> 00:41:05,510
Yeah?
820
00:41:05,510 --> 00:41:06,672
1?
821
00:41:06,672 --> 00:41:08,499
AUDIENCE: 1/4.
822
00:41:08,499 --> 00:41:10,040
PHILIPPE RIGOLLET:
That's 1/4, right?
823
00:41:10,040 --> 00:41:15,260
As the probability, I have
minus 1 for the first epsilon 1,
824
00:41:15,260 --> 00:41:15,760
right?
825
00:41:15,760 --> 00:41:17,260
So this is infinity 6.
826
00:41:17,260 --> 00:41:23,080
This is the probability that
epsilon 1 is equal to minus 1,
827
00:41:23,080 --> 00:41:28,790
and epsilon 2 is equal to
plus 1, which is equal to 1/4.
828
00:41:28,790 --> 00:41:31,520
So this probability is 1/4.
829
00:41:31,520 --> 00:41:35,560
If theta is different from
6, what is this probability?
830
00:41:35,560 --> 00:41:37,630
So if theta is different
from 6, since we
831
00:41:37,630 --> 00:41:41,210
know that we've only
loaded the integers--
832
00:41:41,210 --> 00:41:46,770
so if theta has to
be another integer,
833
00:41:46,770 --> 00:41:49,214
what is the probability
that I see 5 and 7?
834
00:41:49,214 --> 00:41:49,731
AUDIENCE: 0.
835
00:41:49,731 --> 00:41:50,606
PHILIPPE RIGOLLET: 0.
836
00:41:53,860 --> 00:41:55,190
So that's my likelihood.
837
00:41:55,190 --> 00:42:00,210
And if I want to know
what my posterior is,
838
00:42:00,210 --> 00:42:03,340
well, it's just
pi of theta times
839
00:42:03,340 --> 00:42:10,240
p of 5/6, given theta, divided
by the sum over all T's, say,
840
00:42:10,240 --> 00:42:11,890
in Z. Right?
841
00:42:11,890 --> 00:42:14,590
So now, I just need to
normalize this thing.
842
00:42:14,590 --> 00:42:21,950
So of pi of T, p of
4/6, given T. Agreed?
843
00:42:24,730 --> 00:42:27,350
That's just the definition
of the posterior.
844
00:42:27,350 --> 00:42:30,330
But when I sum
these guys, there's
845
00:42:30,330 --> 00:42:34,780
only one that counts,
because, for those things,
846
00:42:34,780 --> 00:42:38,140
we know that this is actually
equal to 0 for every T,
847
00:42:38,140 --> 00:42:41,470
except for when T is equal to 6.
848
00:42:41,470 --> 00:42:45,380
So this entire sum
here is actually
849
00:42:45,380 --> 00:42:54,310
equal to pi of 6
times p of 5/6--
850
00:42:54,310 --> 00:43:03,360
sorry, 5/7, of 5/7,
given that theta
851
00:43:03,360 --> 00:43:08,370
is equal to 6, which we
know is equal to 1/4.
852
00:43:08,370 --> 00:43:10,630
And I did not tell
you what pi of 6 was.
853
00:43:16,840 --> 00:43:18,070
But it's the same thing here.
854
00:43:18,070 --> 00:43:21,020
The posterior for any
theta that's not 6
855
00:43:21,020 --> 00:43:23,520
is actually going to be-- this
guy's going to be equal to 0.
856
00:43:23,520 --> 00:43:26,130
So I really don't
care what this guy is.
857
00:43:26,130 --> 00:43:29,270
So what it means is that
my posterior becomes what?
858
00:43:33,870 --> 00:43:40,290
It becomes the
posterior pi of theta,
859
00:43:40,290 --> 00:43:46,970
given 5 and 7 is equal to--
well, when theta is not
860
00:43:46,970 --> 00:43:49,090
equal to 6, this is actually 0.
861
00:43:49,090 --> 00:43:52,450
So regardless of what I do here,
I get something which is 0.
862
00:43:55,120 --> 00:43:58,000
And if theta is equal
to 6, what I get
863
00:43:58,000 --> 00:44:02,500
is pi of 6 times
p of 5/7, given 6,
864
00:44:02,500 --> 00:44:05,560
which I've just computed
here, which is 1/4 divided
865
00:44:05,560 --> 00:44:08,140
by pi of 6 times 1/4.
866
00:44:08,140 --> 00:44:10,640
So it's the ratio of two
things that are identical.
867
00:44:10,640 --> 00:44:13,360
So I get 1.
868
00:44:13,360 --> 00:44:16,570
So now, my posterior
tells me that, given
869
00:44:16,570 --> 00:44:22,440
that I observe 5
and 7, theta has
870
00:44:22,440 --> 00:44:27,690
to be 1 with probability-- has
to be 6 with probability 1.
871
00:44:27,690 --> 00:44:32,850
So now, I say that this
thing here-- so now, this
872
00:44:32,850 --> 00:44:34,590
is not something
that actually makes
873
00:44:34,590 --> 00:44:37,440
sense when I talk about
frequentist confidence
874
00:44:37,440 --> 00:44:38,310
intervals.
875
00:44:38,310 --> 00:44:40,560
They don't really make sense,
to talk about confidence
876
00:44:40,560 --> 00:44:42,330
intervals, given something.
877
00:44:42,330 --> 00:44:44,100
And so now, given that
I observe 5 and 7,
878
00:44:44,100 --> 00:44:46,224
I know that the probability
of theta is equal to 1.
879
00:44:46,224 --> 00:44:50,310
And in this sense, the
Bayesian confidence interval
880
00:44:50,310 --> 00:44:54,699
is actually more meaningful.
881
00:44:54,699 --> 00:44:56,990
So one thing I want to actually
say about this Bayesian
882
00:44:56,990 --> 00:44:58,466
confidence interval
is that it's--
883
00:45:01,100 --> 00:45:03,181
I mean, here, it's equal
to the value 1, right?
884
00:45:03,181 --> 00:45:05,180
So it really encompasses
the thing that we want.
885
00:45:05,180 --> 00:45:06,763
But the fact that
we actually computed
886
00:45:06,763 --> 00:45:09,140
it using the Bayesian
posterior and the Bayesian rule
887
00:45:09,140 --> 00:45:10,806
did not really matter
for this argument.
888
00:45:10,806 --> 00:45:12,980
All I just said was
that it had a prior.
889
00:45:12,980 --> 00:45:15,080
But just what I
want to illustrate
890
00:45:15,080 --> 00:45:17,930
is the fact that we can
actually give a meaning
891
00:45:17,930 --> 00:45:21,740
to the probability that
theta is equal to 6,
892
00:45:21,740 --> 00:45:23,390
given that I see 5 and 7.
893
00:45:23,390 --> 00:45:26,780
Whereas, we cannot really
in the other cases.
894
00:45:26,780 --> 00:45:28,490
And we don't have
to be particularly
895
00:45:28,490 --> 00:45:31,740
precise in the prior and theta
to be able to give theta this--
896
00:45:31,740 --> 00:45:32,930
to give this meaning.
897
00:45:32,930 --> 00:45:35,062
OK?
898
00:45:35,062 --> 00:45:36,038
All right.
899
00:45:38,966 --> 00:45:43,130
So now, as I said, I think
the main power of Bayesian
900
00:45:43,130 --> 00:45:45,980
inference is that it spits out
the posterior distribution,
901
00:45:45,980 --> 00:45:48,830
and not just the single
number, like frequentists
902
00:45:48,830 --> 00:45:50,030
would give you.
903
00:45:50,030 --> 00:45:55,070
Then we can say decorate, or
theta hat, or point estimate,
904
00:45:55,070 --> 00:45:56,570
with maybe some
confidence interval.
905
00:45:56,570 --> 00:45:58,400
Maybe we can do
a bunch of tests.
906
00:45:58,400 --> 00:46:01,070
But at the end of the
day, we just have,
907
00:46:01,070 --> 00:46:02,624
essentially, one number, right?
908
00:46:02,624 --> 00:46:04,040
Then maybe we can
understand where
909
00:46:04,040 --> 00:46:07,310
the fluctuations of this number
are in a frequentist setup.
910
00:46:07,310 --> 00:46:11,760
but the Bayesian
framework is essentially
911
00:46:11,760 --> 00:46:13,059
giving you a natural method.
912
00:46:13,059 --> 00:46:15,517
And you can interpret it in
terms of the probabilities that
913
00:46:15,517 --> 00:46:17,400
are associated to the prior.
914
00:46:17,400 --> 00:46:21,180
But you can actually
also try to make some--
915
00:46:21,180 --> 00:46:25,840
so a Bayesian, if you
give me any prior,
916
00:46:25,840 --> 00:46:29,040
you're going to actually build
an estimator from this prior,
917
00:46:29,040 --> 00:46:30,515
maybe from the posterior.
918
00:46:30,515 --> 00:46:32,890
And maybe it's going to have
some frequentist properties.
919
00:46:32,890 --> 00:46:35,181
And that's what's really nice
about [? Bayesians, ?] is
920
00:46:35,181 --> 00:46:36,700
that you can
actually try to give
921
00:46:36,700 --> 00:46:39,340
some frequentist properties
of Bayesian methods, that
922
00:46:39,340 --> 00:46:42,224
are built using
Bayesian methodology.
923
00:46:42,224 --> 00:46:44,140
But you cannot really
go the other way around.
924
00:46:44,140 --> 00:46:46,449
If I give you a
frequency methodology,
925
00:46:46,449 --> 00:46:48,490
how are you going to say
something about the fact
926
00:46:48,490 --> 00:46:51,620
that there's a prior
going on, et cetera?
927
00:46:51,620 --> 00:46:53,457
And so this is actually
one of the things
928
00:46:53,457 --> 00:46:55,790
there's actually some research
that's going on for this.
929
00:46:55,790 --> 00:46:58,147
They call it Bayesian
posterior concentration.
930
00:46:58,147 --> 00:46:59,980
And one of the things--
so there's something
931
00:46:59,980 --> 00:47:01,990
called the Bernstein-von
Mises theorem.
932
00:47:01,990 --> 00:47:03,910
And those are a
class of theorems,
933
00:47:03,910 --> 00:47:06,790
and those are essentially
methods that tell you, well,
934
00:47:06,790 --> 00:47:10,690
if I actually run
a Bayesian method,
935
00:47:10,690 --> 00:47:12,647
and I look at the
posterior that I get--
936
00:47:12,647 --> 00:47:14,230
it's going to be
something like this--
937
00:47:14,230 --> 00:47:16,540
but now, I try to study this
in a frequentist point of view,
938
00:47:16,540 --> 00:47:18,289
there's actually a
true parameter of theta
939
00:47:18,289 --> 00:47:20,390
somewhere, the true one.
940
00:47:20,390 --> 00:47:21,640
There's no prior for this guy.
941
00:47:21,640 --> 00:47:23,410
This is just one fixed number.
942
00:47:23,410 --> 00:47:25,120
Is it true that as
my sample size is
943
00:47:25,120 --> 00:47:27,610
going to go to infinity,
then this thing is going
944
00:47:27,610 --> 00:47:29,860
to concentrate around theta?
945
00:47:29,860 --> 00:47:31,990
And the rate of
concentration of this thing,
946
00:47:31,990 --> 00:47:35,440
the size of this width,
the standard deviation
947
00:47:35,440 --> 00:47:38,290
of this thing, is something
that should decay maybe
948
00:47:38,290 --> 00:47:40,850
like 1 over square root of
n, or something like this.
949
00:47:40,850 --> 00:47:43,349
And the rate of
posterior concentration,
950
00:47:43,349 --> 00:47:45,890
when you characterize it, it's
called the Bernstein-von Mises
951
00:47:45,890 --> 00:47:46,390
theorem.
952
00:47:46,390 --> 00:47:47,830
And so people are
looking at this
953
00:47:47,830 --> 00:47:49,566
in some non-parametric cases.
954
00:47:49,566 --> 00:47:51,190
You can do it in
pretty much everything
955
00:47:51,190 --> 00:47:52,190
we've been doing before.
956
00:47:52,190 --> 00:47:55,690
You can do it for non-parametric
regression estimation
957
00:47:55,690 --> 00:47:56,794
or density estimation.
958
00:47:56,794 --> 00:47:58,210
You can do it for,
of course-- you
959
00:47:58,210 --> 00:48:01,340
can do it for sparse
estimation, if you want.
960
00:48:01,340 --> 00:48:01,840
OK.
961
00:48:01,840 --> 00:48:04,967
So you can actually
compute the procedure and--
962
00:48:08,620 --> 00:48:09,290
yeah.
963
00:48:09,290 --> 00:48:12,660
And so you can think of it as
being just a method somehow.
964
00:48:12,660 --> 00:48:14,970
Now, the estimator
I'm talking about-- so
965
00:48:14,970 --> 00:48:18,210
that's just a general Bayesian
posterior concentration.
966
00:48:18,210 --> 00:48:20,430
But you can also
try to understand
967
00:48:20,430 --> 00:48:22,710
what is the property
of something that's
968
00:48:22,710 --> 00:48:24,210
extracted from this posterior.
969
00:48:24,210 --> 00:48:26,130
And one thing that
we actually describe
970
00:48:26,130 --> 00:48:28,310
was, for example,
well, given this guy,
971
00:48:28,310 --> 00:48:30,060
maybe it's a good idea
to think about what
972
00:48:30,060 --> 00:48:32,370
the mean of this
thing is, right?
973
00:48:32,370 --> 00:48:35,040
So there's going to
be some theta hat,
974
00:48:35,040 --> 00:48:41,460
which is just the integral of
theta pi theta, given x1 xn--
975
00:48:41,460 --> 00:48:43,860
so that's my posterior--
976
00:48:43,860 --> 00:48:44,380
d theta.
977
00:48:44,380 --> 00:48:44,880
Right?
978
00:48:44,880 --> 00:48:46,500
So that's the posterior mean.
979
00:48:46,500 --> 00:48:48,750
That's the expected
value with respect
980
00:48:48,750 --> 00:48:50,880
to the posterior distribution.
981
00:48:50,880 --> 00:48:53,640
And I want to know how
does this thing behave,
982
00:48:53,640 --> 00:48:56,670
how close it is to a
true theta if I actually
983
00:48:56,670 --> 00:48:58,370
am in a frequency setup.
984
00:48:58,370 --> 00:48:59,784
So that's the posterior mean.
985
00:49:04,260 --> 00:49:08,450
But this is not the only thing
I can actually spit out, right?
986
00:49:08,450 --> 00:49:09,980
This is definitely
uniquely defined.
987
00:49:09,980 --> 00:49:13,490
If you give me a
distribution, I can actually
988
00:49:13,490 --> 00:49:15,170
spit out its posterior mean.
989
00:49:15,170 --> 00:49:17,480
But I can also think of
the posterior median.
990
00:49:21,450 --> 00:49:23,237
But now, if this
is not continuous,
991
00:49:23,237 --> 00:49:24,570
you might have some uncertainty.
992
00:49:24,570 --> 00:49:26,570
Maybe the median is
not uniquely defined,
993
00:49:26,570 --> 00:49:29,180
and so maybe that's not
something you use as much.
994
00:49:29,180 --> 00:49:31,690
Maybe you can actually talk
about the posterior mode.
995
00:49:35,160 --> 00:49:38,040
All right, so for example, if
you're posterior density looks
996
00:49:38,040 --> 00:49:40,020
like this, then
maybe you just want
997
00:49:40,020 --> 00:49:43,600
to summarize your
posterior with this number.
998
00:49:43,600 --> 00:49:46,080
So clearly, in this case,
it's not such a good idea,
999
00:49:46,080 --> 00:49:48,270
because you completely
forget about this mode.
1000
00:49:48,270 --> 00:49:49,811
But maybe that's
what you want to do.
1001
00:49:49,811 --> 00:49:53,400
Maybe you want to focus
on the most peak mode.
1002
00:49:53,400 --> 00:49:58,524
And this is actually called
maximum a posteriori.
1003
00:49:58,524 --> 00:49:59,940
As I said, maybe
you want a sample
1004
00:49:59,940 --> 00:50:03,240
from this posterior
distribution.
1005
00:50:03,240 --> 00:50:06,420
OK, and so in all these cases,
these Bayesian estimators
1006
00:50:06,420 --> 00:50:09,000
will depend on the
prior distribution.
1007
00:50:09,000 --> 00:50:11,610
And the hope is that, as
the sample size grows,
1008
00:50:11,610 --> 00:50:14,130
you won't see that again.
1009
00:50:14,130 --> 00:50:14,630
OK.
1010
00:50:14,630 --> 00:50:20,840
So to conclude, let's just
do a couple of experiments.
1011
00:50:20,840 --> 00:50:22,340
So if I look at--
1012
00:50:25,200 --> 00:50:26,011
did we do this?
1013
00:50:26,011 --> 00:50:26,510
Yes.
1014
00:50:26,510 --> 00:50:30,398
So for example, so let's
focus on the posterior mean.
1015
00:50:34,366 --> 00:50:45,394
And we know-- so remember
in experiment one--
1016
00:50:45,394 --> 00:50:48,100
[INAUDIBLE] example
one, what we had
1017
00:50:48,100 --> 00:50:56,000
was x1 xn that were
[? iid, ?] Bernoulli p,
1018
00:50:56,000 --> 00:51:06,410
and the prior I put on p was
a beta with parameter aa.
1019
00:51:06,410 --> 00:51:07,160
OK?
1020
00:51:07,160 --> 00:51:09,830
And if I go back to
what we computed,
1021
00:51:09,830 --> 00:51:12,740
you can actually compute
the posterior of this thing.
1022
00:51:12,740 --> 00:51:15,000
And we know that it's
actually going to be--
1023
00:51:15,000 --> 00:51:17,390
sorry, that was uniform?
1024
00:51:17,390 --> 00:51:18,620
Where is-- yeah.
1025
00:51:18,620 --> 00:51:31,170
So what we get is that
the posterior, this thing
1026
00:51:31,170 --> 00:51:36,630
is actually going to be
a beta with parameter
1027
00:51:36,630 --> 00:51:42,640
a plus the sum, so a
plus the number of 1s
1028
00:51:42,640 --> 00:51:44,770
and a plus the number of 0s.
1029
00:51:48,590 --> 00:51:49,870
OK?
1030
00:51:49,870 --> 00:51:53,840
And the beta was just
something that looked like--
1031
00:51:56,480 --> 00:52:00,500
the density was p to the
a minus 1, 1 minus p.
1032
00:52:05,440 --> 00:52:05,940
OK?
1033
00:52:05,940 --> 00:52:11,130
So if I want to understand
the posterior mean,
1034
00:52:11,130 --> 00:52:13,950
I need to be able to compute
the expectation of a beta,
1035
00:52:13,950 --> 00:52:16,620
and then maybe plug
in a for a plus
1036
00:52:16,620 --> 00:52:17,980
this guy and minus this guy.
1037
00:52:17,980 --> 00:52:18,480
OK.
1038
00:52:18,480 --> 00:52:21,770
So actually, let me do this.
1039
00:52:21,770 --> 00:52:22,270
OK.
1040
00:52:22,270 --> 00:52:23,930
So what is the expectation?
1041
00:52:26,337 --> 00:52:27,920
So what I want is
something that looks
1042
00:52:27,920 --> 00:52:34,820
like the integral between 0
and 1 of p times a minus 1--
1043
00:52:34,820 --> 00:52:42,320
sorry, p times p a minus
1, 1 minus p, b minus 1.
1044
00:52:42,320 --> 00:52:43,590
Do we agree that this--
1045
00:52:43,590 --> 00:52:46,290
and then there's a
normalizing constant.
1046
00:52:46,290 --> 00:52:49,270
Let's call it c.
1047
00:52:49,270 --> 00:52:49,770
OK?
1048
00:52:53,200 --> 00:52:56,330
So this is what I
need to compute.
1049
00:52:56,330 --> 00:52:57,640
So that's c of a and b.
1050
00:53:00,257 --> 00:53:01,840
Do we agree that
this is the posterior
1051
00:53:01,840 --> 00:53:08,651
mean with respect to a beta
with parameters a and b?
1052
00:53:08,651 --> 00:53:09,150
Right?
1053
00:53:09,150 --> 00:53:13,334
I just integrate p
against the density.
1054
00:53:13,334 --> 00:53:14,750
So what does this
thing look like?
1055
00:53:14,750 --> 00:53:18,550
Well, I can actually
move this guy in here.
1056
00:53:18,550 --> 00:53:23,402
And here, I'm going to
have a plus 1 minus 1.
1057
00:53:23,402 --> 00:53:26,366
OK?
1058
00:53:26,366 --> 00:53:29,360
So the problem is that
this thing is actually--
1059
00:53:29,360 --> 00:53:31,360
the constant is going to
play a big role, right?
1060
00:53:31,360 --> 00:53:33,100
Because this is
essentially equal
1061
00:53:33,100 --> 00:53:40,270
to c a plus 1b
divided by c ab, where
1062
00:53:40,270 --> 00:53:42,220
ca plus 1b is just
the normalizing
1063
00:53:42,220 --> 00:53:46,340
constant of a beta a plus 1 b.
1064
00:53:46,340 --> 00:53:48,729
So I need to know the ratio
of those two constants.
1065
00:53:58,320 --> 00:53:59,660
And this is not something--
1066
00:53:59,660 --> 00:54:01,680
I mean, this is just
a calculus exercise.
1067
00:54:01,680 --> 00:54:06,820
So in this case,
what you get is--
1068
00:54:06,820 --> 00:54:08,640
sorry.
1069
00:54:08,640 --> 00:54:09,750
In this case, you get--
1070
00:54:12,560 --> 00:54:34,940
well, OK, so we get
essentially a divided by,
1071
00:54:34,940 --> 00:54:37,990
I think, it's a plus b.
1072
00:54:37,990 --> 00:54:38,940
Yeah, it's a plus b.
1073
00:54:41,856 --> 00:54:43,314
So that's this quantity.
1074
00:54:47,188 --> 00:54:47,688
OK?
1075
00:54:51,100 --> 00:54:56,520
And when I plug in a to be this
guy and b to be this guy, what
1076
00:54:56,520 --> 00:55:02,520
I get is a plus sum of the xi.
1077
00:55:02,520 --> 00:55:06,240
And then I get a plus this
guy, a plus n minus this guy.
1078
00:55:06,240 --> 00:55:07,720
So those two guys
go away, and I'm
1079
00:55:07,720 --> 00:55:14,050
left with 2a plus n,
which does not work.
1080
00:55:14,050 --> 00:55:15,240
No, that actually works.
1081
00:55:15,240 --> 00:55:18,520
And so now what I do, I
can actually divide and get
1082
00:55:18,520 --> 00:55:19,850
this thing, over there.
1083
00:55:19,850 --> 00:55:20,350
OK.
1084
00:55:20,350 --> 00:55:23,380
So what you can see, the reason
why this thing has been divided
1085
00:55:23,380 --> 00:55:27,730
is that you can really see
that, as n goes to infinity,
1086
00:55:27,730 --> 00:55:30,120
then this thing behaves
like xn bar, which
1087
00:55:30,120 --> 00:55:31,650
is our frequentist estimator.
1088
00:55:31,650 --> 00:55:34,200
The effect of a is
actually going away.
1089
00:55:34,200 --> 00:55:37,530
The effect of the prior, which
is completely captured by a,
1090
00:55:37,530 --> 00:55:40,440
is going away as n
goes to infinity.
1091
00:55:40,440 --> 00:55:42,440
Is there any question?
1092
00:55:47,440 --> 00:55:48,850
You guys have a question.
1093
00:55:48,850 --> 00:55:50,202
What is it?
1094
00:55:50,202 --> 00:55:51,551
Do you have a question?
1095
00:55:51,551 --> 00:55:53,426
AUDIENCE: Yeah, on the
board, is that divided
1096
00:55:53,426 --> 00:55:56,259
by some [INAUDIBLE] stuff?
1097
00:55:56,259 --> 00:55:58,050
PHILIPPE RIGOLLET: Is
that divided by what?
1098
00:55:58,050 --> 00:56:00,555
AUDIENCE: That a over a plus
b, and then you just expanded--
1099
00:56:00,555 --> 00:56:01,930
PHILIPPE RIGOLLET:
Oh yeah, yeah,
1100
00:56:01,930 --> 00:56:05,220
then I said that this
is equal to this, right.
1101
00:56:05,220 --> 00:56:15,690
So that's for a becomes a plus
sum of the xi's, and b becomes
1102
00:56:15,690 --> 00:56:20,391
a plus n minus sum of the xi's.
1103
00:56:20,391 --> 00:56:20,890
OK.
1104
00:56:20,890 --> 00:56:22,508
So that's just for
the posterior one.
1105
00:56:22,508 --> 00:56:26,264
AUDIENCE: What's [INAUDIBLE]
1106
00:56:26,264 --> 00:56:27,430
PHILIPPE RIGOLLET: This guy?
1107
00:56:27,430 --> 00:56:28,070
AUDIENCE: Yeah.
1108
00:56:28,070 --> 00:56:28,740
PHILIPPE RIGOLLET: 2a.
1109
00:56:28,740 --> 00:56:29,281
AUDIENCE: 2a.
1110
00:56:29,281 --> 00:56:30,150
Oh, OK.
1111
00:56:30,150 --> 00:56:31,191
PHILIPPE RIGOLLET: Right.
1112
00:56:31,191 --> 00:56:34,885
So I get a plus a plus n.
1113
00:56:34,885 --> 00:56:37,960
And then those two guys cancel.
1114
00:56:37,960 --> 00:56:38,460
OK?
1115
00:56:38,460 --> 00:56:41,380
And that's what you have here.
1116
00:56:41,380 --> 00:56:44,920
So for a is equal to 1/2--
1117
00:56:44,920 --> 00:56:47,020
and I claim that this
is Jeffreys prior.
1118
00:56:47,020 --> 00:56:53,950
Because remember, Jeffreys was
[INAUDIBLE] was square root
1119
00:56:53,950 --> 00:56:56,100
and was proportional to
the square root of p1 minus
1120
00:56:56,100 --> 00:57:01,050
p, which I can write as p to
the 1/2, 1 minus p to the 1/2.
1121
00:57:01,050 --> 00:57:03,501
So it's just the case
a is equal to 1/2.
1122
00:57:03,501 --> 00:57:04,000
OK.
1123
00:57:04,000 --> 00:57:07,660
So if I use Jeffreys prior, I
just plug in a equals to 1/2,
1124
00:57:07,660 --> 00:57:10,530
and this is what I get.
1125
00:57:10,530 --> 00:57:12,630
OK?
1126
00:57:12,630 --> 00:57:14,880
So those things are going
to have an impact again when
1127
00:57:14,880 --> 00:57:16,150
n is moderately large.
1128
00:57:16,150 --> 00:57:19,090
For large n, those things,
whether you take Jeffreys prior
1129
00:57:19,090 --> 00:57:20,710
or you take whatever
a you prefer,
1130
00:57:20,710 --> 00:57:23,130
it's going to have
no impact whatsoever.
1131
00:57:23,130 --> 00:57:26,894
But n is of the
order of 10 maybe,
1132
00:57:26,894 --> 00:57:28,810
then you're going to
start to see some impact,
1133
00:57:28,810 --> 00:57:30,351
depending on what
a you want to pick.
1134
00:57:33,540 --> 00:57:34,040
OK.
1135
00:57:34,040 --> 00:57:38,390
And then in the second
example, well, here we actually
1136
00:57:38,390 --> 00:57:42,560
computed the posterior
to be this guy.
1137
00:57:42,560 --> 00:57:45,544
Well, here, I can just read off
what the expectation is, right?
1138
00:57:45,544 --> 00:57:47,210
I mean, I don't have
to actually compute
1139
00:57:47,210 --> 00:57:48,970
the expectation of a Gaussian.
1140
00:57:48,970 --> 00:57:50,650
It's just that xn bar.
1141
00:57:50,650 --> 00:57:52,660
And so in this case,
there's actually no--
1142
00:57:52,660 --> 00:57:57,190
I mean, when I have a
non-informative prior
1143
00:57:57,190 --> 00:58:01,750
for a Gaussian, then I
have basically xn in bar.
1144
00:58:01,750 --> 00:58:04,390
As you can see, actually, this
is an interesting example.
1145
00:58:04,390 --> 00:58:06,490
When I actually look
at the posterior,
1146
00:58:06,490 --> 00:58:09,190
it's not something that cost
me a lot to communicate to you,
1147
00:58:09,190 --> 00:58:10,037
right?
1148
00:58:10,037 --> 00:58:12,370
There's one symbol here, one
symbol here, and one symbol
1149
00:58:12,370 --> 00:58:13,330
here.
1150
00:58:13,330 --> 00:58:17,950
I tell you the posterior is
a Gaussian with mean xn bar
1151
00:58:17,950 --> 00:58:19,660
and variance 1/n.
1152
00:58:19,660 --> 00:58:23,530
When I actually turn
that into a poster mean,
1153
00:58:23,530 --> 00:58:26,264
I'm dropping all
this information.
1154
00:58:26,264 --> 00:58:27,930
I'm just giving you
the first parameter.
1155
00:58:27,930 --> 00:58:30,150
So you can see there's
actually much more information
1156
00:58:30,150 --> 00:58:35,100
in the posterior than there
is in the posterior mean.
1157
00:58:35,100 --> 00:58:37,210
The posterior mean
is just a point.
1158
00:58:37,210 --> 00:58:39,930
It's not telling me how
confident I am in this point.
1159
00:58:39,930 --> 00:58:41,950
And this thing is
actually very interesting.
1160
00:58:41,950 --> 00:58:42,450
OK.
1161
00:58:42,450 --> 00:58:44,283
So you can talk about
the posterior variance
1162
00:58:44,283 --> 00:58:45,880
that's associated to it, right?
1163
00:58:45,880 --> 00:58:47,516
You can talk about,
as an output,
1164
00:58:47,516 --> 00:58:49,890
you could give the posterior
mean and posterior variance.
1165
00:58:49,890 --> 00:58:53,311
And those things are
actually interesting.
1166
00:58:53,311 --> 00:58:53,810
All right.
1167
00:58:53,810 --> 00:58:56,370
So I think this is it.
1168
00:58:56,370 --> 00:59:05,360
So as I said, in general,
just like in this case,
1169
00:59:05,360 --> 00:59:07,980
the impact of the prior
is being washed away
1170
00:59:07,980 --> 00:59:10,310
as the sample size
goes to infinity.
1171
00:59:10,310 --> 00:59:12,860
Just well, like here, there's
no impact of the prior.
1172
00:59:12,860 --> 00:59:14,500
It was an noninvasive one.
1173
00:59:14,500 --> 00:59:17,780
But if you actually had an
informative one, [? CF ?]
1174
00:59:17,780 --> 00:59:18,683
homework-- yeah?
1175
00:59:18,683 --> 00:59:19,650
AUDIENCE: [INAUDIBLE]
1176
00:59:19,650 --> 00:59:21,150
PHILIPPE RIGOLLET: Yeah,
so [? CF ?] homework,
1177
00:59:21,150 --> 00:59:23,358
you would actually see an
impact of the prior, which,
1178
00:59:23,358 --> 00:59:25,890
again, would be washed away
as your sample size increases.
1179
00:59:25,890 --> 00:59:26,820
Here, it goes away.
1180
00:59:26,820 --> 00:59:29,610
You just get xn bar over 1.
1181
00:59:29,610 --> 00:59:31,830
And actually, in
these cases, you
1182
00:59:31,830 --> 00:59:35,580
see that the posterior
distribution converges
1183
00:59:35,580 --> 00:59:37,560
to-- sorry, the
Bayesian estimator
1184
00:59:37,560 --> 00:59:39,510
is asymptotically normal.
1185
00:59:39,510 --> 00:59:43,471
This is different from the
distribution of the posterior,
1186
00:59:43,471 --> 00:59:43,970
right?
1187
00:59:43,970 --> 00:59:45,886
This is just the posterior
mean, which happens
1188
00:59:45,886 --> 00:59:47,480
to be asymptotically normal.
1189
00:59:47,480 --> 00:59:49,595
But the posterior
may not have a--
1190
00:59:49,595 --> 00:59:53,000
I mean, here, the
posterior is a beta, right?
1191
00:59:53,000 --> 00:59:55,020
I mean, it's not normal.
1192
00:59:55,020 --> 00:59:57,210
OK, so there's
different-- those things
1193
00:59:57,210 --> 00:59:59,556
are two different things.
1194
00:59:59,556 --> 01:00:01,548
Your question?
1195
01:00:01,548 --> 01:00:04,487
AUDIENCE: What was
the prior [INAUDIBLE]
1196
01:00:04,487 --> 01:00:05,820
PHILIPPE RIGOLLET: All 1, right?
1197
01:00:05,820 --> 01:00:06,986
That was the improper prior.
1198
01:00:06,986 --> 01:00:08,896
AUDIENCE: OK.
1199
01:00:08,896 --> 01:00:12,563
And so that would give you the
same thing as [INAUDIBLE],, not
1200
01:00:12,563 --> 01:00:13,790
just the proportion.
1201
01:00:13,790 --> 01:00:15,373
PHILIPPE RIGOLLET:
Well, I mean, yeah.
1202
01:00:15,373 --> 01:00:17,600
So it's essentially
telling you that--
1203
01:00:17,600 --> 01:00:23,390
so we said that, when you
have a non-informative prior,
1204
01:00:23,390 --> 01:00:25,760
essentially, the maximum
likelihood is the maximum
1205
01:00:25,760 --> 01:00:26,879
a posteriori, right?
1206
01:00:26,879 --> 01:00:28,670
But in this case,
there's so much symmetry,
1207
01:00:28,670 --> 01:00:30,560
that it just so happens that
the maximum in this thing
1208
01:00:30,560 --> 01:00:32,370
is completely symmetric
around its maximum.
1209
01:00:32,370 --> 01:00:34,809
So it means that the expectation
is equal to the maximum,
1210
01:00:34,809 --> 01:00:35,600
to [INAUDIBLE] max.
1211
01:00:40,957 --> 01:00:41,931
Yeah?
1212
01:00:41,931 --> 01:00:43,392
AUDIENCE: I read
somewhere that one
1213
01:00:43,392 --> 01:00:45,340
of the issues with
Bayesian methods
1214
01:00:45,340 --> 01:00:46,801
is that we choose
the wrong prior,
1215
01:00:46,801 --> 01:00:49,723
and it could mess
up your results.
1216
01:00:49,723 --> 01:00:51,370
PHILIPPE RIGOLLET:
Yeah, but hence,
1217
01:00:51,370 --> 01:00:53,980
do not pick the wrong prior.
1218
01:00:53,980 --> 01:00:55,244
I mean, of course, it would.
1219
01:00:55,244 --> 01:00:57,160
I mean, it would mess
up your res-- of course.
1220
01:00:57,160 --> 01:00:58,810
I mean, you're putting
extra information.
1221
01:00:58,810 --> 01:01:00,601
But you could say the
same thing by saying,
1222
01:01:00,601 --> 01:01:03,670
well, the issue with
frequentist method
1223
01:01:03,670 --> 01:01:06,730
is that, if you mess up the
choice of your likelihood,
1224
01:01:06,730 --> 01:01:09,424
then it's going to
mess up your output.
1225
01:01:09,424 --> 01:01:11,590
So here, you just have two
chances of messing it up,
1226
01:01:11,590 --> 01:01:12,250
right?
1227
01:01:12,250 --> 01:01:14,440
You have the-- well, it's gone.
1228
01:01:14,440 --> 01:01:17,920
So you have the product of
the likelihood and the prior,
1229
01:01:17,920 --> 01:01:20,350
and you have one
more chance to--
1230
01:01:20,350 --> 01:01:22,420
but it's true, if you
assume that the model is
1231
01:01:22,420 --> 01:01:25,960
right, then, of course,
finding the wrong prior could
1232
01:01:25,960 --> 01:01:28,520
completely mess up things
if your prior, for example,
1233
01:01:28,520 --> 01:01:30,780
has no support on
the true parameter.
1234
01:01:30,780 --> 01:01:34,715
But if your prior has a positive
weight on the true parameter
1235
01:01:34,715 --> 01:01:38,140
as n goes to infinity--
1236
01:01:38,140 --> 01:01:40,640
I mean, OK, I cannot speak
for all counterexamples
1237
01:01:40,640 --> 01:01:41,480
in the world.
1238
01:01:41,480 --> 01:01:44,450
But I'm sure, under minor
technical conditions,
1239
01:01:44,450 --> 01:01:46,550
you can guarantee
that your posterior
1240
01:01:46,550 --> 01:01:48,530
mean is going to
converge to what
1241
01:01:48,530 --> 01:01:49,742
you need it to converge to.
1242
01:01:53,678 --> 01:01:54,662
Any other question?
1243
01:01:57,881 --> 01:01:58,380
All right.
1244
01:01:58,380 --> 01:02:07,650
So I think this closes the more
traditional mathematical-- not
1245
01:02:07,650 --> 01:02:11,490
mathematical, but traditional
statistics part of this class.
1246
01:02:11,490 --> 01:02:14,310
And from here on, we'll
talk about more multivariate
1247
01:02:14,310 --> 01:02:17,740
statistics, starting with
principal component analysis.
1248
01:02:17,740 --> 01:02:19,800
So that's more like when
you have multiple data.
1249
01:02:19,800 --> 01:02:22,650
We started, in a way, to talk
about multivariate statistics
1250
01:02:22,650 --> 01:02:25,320
when we talked about
multivariate regression.
1251
01:02:25,320 --> 01:02:28,180
But we'll move on to
principal component analysis.
1252
01:02:28,180 --> 01:02:30,690
I'll talk a bit about
multiple testing.
1253
01:02:30,690 --> 01:02:32,400
I haven't made my
mind yet about what
1254
01:02:32,400 --> 01:02:34,350
we'll talk really in December.
1255
01:02:34,350 --> 01:02:36,480
But I want to make
sure that you have
1256
01:02:36,480 --> 01:02:41,310
a taste and a flavor of what is
being interesting in statistics
1257
01:02:41,310 --> 01:02:44,341
these days, especially as you
go towards more [INAUDIBLE]
1258
01:02:44,341 --> 01:02:46,590
learning type of questions,
where really, the focus is
1259
01:02:46,590 --> 01:02:48,619
on prediction rather
than the modeling itself.
1260
01:02:48,619 --> 01:02:50,160
We'll talk about
logistic regression,
1261
01:02:50,160 --> 01:02:52,800
as well, for example,
which is generalized
1262
01:02:52,800 --> 01:02:55,470
linear models, which is just
the generalization in the case
1263
01:02:55,470 --> 01:03:00,480
that y does not take value in
the whole real line, maybe 0,1,
1264
01:03:00,480 --> 01:03:03,360
for example, for regression.
1265
01:03:03,360 --> 01:03:03,960
All right.
1266
01:03:03,960 --> 01:03:05,510
Thanks.