1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:15,210
from hundreds of
MIT courses, visit

7
00:00:15,210 --> 00:00:21,470
MITOpenCourseWare@OCW.MIT.edu

8
00:00:21,470 --> 00:00:26,670
PHILIPPE RIGOLLET: So today
WE'LL actually just do a brief

9
00:00:26,670 --> 00:00:28,590
chapter on Bayesian statistics.

10
00:00:28,590 --> 00:00:31,380
And there's entire courses
on Bayesian statistics,

11
00:00:31,380 --> 00:00:33,480
there's entire books
on Bayesian statistics,

12
00:00:33,480 --> 00:00:36,130
there's entire careers
in Bayesian statistics.

13
00:00:36,130 --> 00:00:39,270
So admittedly, I'm
not going to be

14
00:00:39,270 --> 00:00:40,920
able to do it
justice and tell you

15
00:00:40,920 --> 00:00:42,850
all the interesting
things that are happening

16
00:00:42,850 --> 00:00:44,040
in Bayesian statistics.

17
00:00:44,040 --> 00:00:47,310
But I think it's important
as a statistician

18
00:00:47,310 --> 00:00:49,320
to know what it
is, how it works,

19
00:00:49,320 --> 00:00:52,500
because it's actually
a weapon of choice

20
00:00:52,500 --> 00:00:55,260
for many practitioners.

21
00:00:55,260 --> 00:00:58,080
And because it allows them to
incorporate their knowledge

22
00:00:58,080 --> 00:01:00,790
about a problem in a
fairly systematic manner.

23
00:01:00,790 --> 00:01:04,099
So if you look at like, say the
Bayesian statistics literature,

24
00:01:04,099 --> 00:01:05,489
it's huge.

25
00:01:05,489 --> 00:01:09,570
And so here I give
you sort of a range

26
00:01:09,570 --> 00:01:12,840
of what you can expect to
see in Bayesian statistics

27
00:01:12,840 --> 00:01:18,300
from your second edition of
a traditional book, something

28
00:01:18,300 --> 00:01:20,580
that involves computation,
some things that

29
00:01:20,580 --> 00:01:22,200
involve risk thinking.

30
00:01:22,200 --> 00:01:24,750
And there's a lot of
Bayesian thinking.

31
00:01:24,750 --> 00:01:26,640
There's a lot of
things that you know

32
00:01:26,640 --> 00:01:29,010
talking about sort of like
philosophy of thinking

33
00:01:29,010 --> 00:01:30,180
Bayesian.

34
00:01:30,180 --> 00:01:32,380
This book, for example,
seems to be one of them.

35
00:01:32,380 --> 00:01:34,710
This book is
definitely one of them.

36
00:01:34,710 --> 00:01:38,880
This one represents sort of
a wide, a broad literature

37
00:01:38,880 --> 00:01:42,370
on Bayesian statistics, for
applications for example,

38
00:01:42,370 --> 00:01:43,620
in social sciences.

39
00:01:43,620 --> 00:01:45,380
But even in large
scale machine learning,

40
00:01:45,380 --> 00:01:47,340
there's a lot of Bayesian
statistics happening,

41
00:01:47,340 --> 00:01:50,280
particular using something
called Bayesian parametrics,

42
00:01:50,280 --> 00:01:53,490
or hierarchical
Bayesian modeling.

43
00:01:53,490 --> 00:01:59,470
So we do have some experts
at MIT in the c-cell.

44
00:01:59,470 --> 00:02:02,070
Tamara Broderick for
example, is a person

45
00:02:02,070 --> 00:02:04,560
who does quite a bit
of interesting work

46
00:02:04,560 --> 00:02:06,093
on Bayesian parametrics.

47
00:02:06,093 --> 00:02:08,259
And if that's something you
want to know more about,

48
00:02:08,259 --> 00:02:10,889
I urge you to go
and talk to her.

49
00:02:10,889 --> 00:02:14,070
So before we go into
more advanced things,

50
00:02:14,070 --> 00:02:17,220
we need to start with what
is the Bayesian approach.

51
00:02:17,220 --> 00:02:19,290
What do Bayesians
do, and how is it

52
00:02:19,290 --> 00:02:22,720
different from what
we've been doing so far?

53
00:02:22,720 --> 00:02:26,340
So to understand the
difference between Bayesians

54
00:02:26,340 --> 00:02:28,800
and what we've been
doing so far is,

55
00:02:28,800 --> 00:02:31,350
we need to first put a name on
what we've been doing so far.

56
00:02:31,350 --> 00:02:32,940
It's called
frequentist statistics.

57
00:02:32,940 --> 00:02:36,720
Which usually Bayesian versus
frequentist statistics,

58
00:02:36,720 --> 00:02:38,760
by versus I don't mean
that there is naturally

59
00:02:38,760 --> 00:02:40,380
in opposition to them.

60
00:02:40,380 --> 00:02:43,350
Actually, often you will
see the same method that

61
00:02:43,350 --> 00:02:45,420
comes out of both approaches.

62
00:02:45,420 --> 00:02:46,860
So let's see how
we did it, right.

63
00:02:46,860 --> 00:02:48,930
The first thing, we had data.

64
00:02:48,930 --> 00:02:50,700
We observed some data.

65
00:02:50,700 --> 00:02:52,980
And we assumed that this
data was generated randomly.

66
00:02:52,980 --> 00:02:54,840
The reason we did
that is because this

67
00:02:54,840 --> 00:02:57,840
would allow us to leverage
tools from probability.

68
00:02:57,840 --> 00:03:01,020
So let's say by nature,
measurements, you do a survey,

69
00:03:01,020 --> 00:03:03,090
you get some data.

70
00:03:03,090 --> 00:03:06,030
Then we made some assumptions
on the data generating process.

71
00:03:06,030 --> 00:03:07,939
For example, we
assumed they were iid.

72
00:03:07,939 --> 00:03:09,480
That was one of the
recurring things.

73
00:03:09,480 --> 00:03:11,530
Sometimes we assume
it was Gaussian.

74
00:03:11,530 --> 00:03:13,470
If you wanted to
use say, T-test.

75
00:03:13,470 --> 00:03:15,330
Maybe we did some
nonparametric statistics.

76
00:03:15,330 --> 00:03:18,240
We assume it was a
smooth function or maybe

77
00:03:18,240 --> 00:03:20,350
linear regression function.

78
00:03:20,350 --> 00:03:21,540
So those are our modeling.

79
00:03:21,540 --> 00:03:24,850
And this was basically
a way to say, well,

80
00:03:24,850 --> 00:03:28,440
we're not going to allow for
any distributions for the data

81
00:03:28,440 --> 00:03:29,160
that we have.

82
00:03:29,160 --> 00:03:31,640
But maybe a small
set of distributions

83
00:03:31,640 --> 00:03:34,770
that indexed by some small
parameters, for example.

84
00:03:34,770 --> 00:03:38,400
Or at least remove some
of the possibilities.

85
00:03:38,400 --> 00:03:41,660
Otherwise, there's
nothing we can learn.

86
00:03:41,660 --> 00:03:45,270
And so for example,
this was associated

87
00:03:45,270 --> 00:03:48,980
to some parameter of
interest, say data or beta

88
00:03:48,980 --> 00:03:51,270
in the regression model.

89
00:03:51,270 --> 00:03:55,860
Then we had this unknown
problem and this unknown thing,

90
00:03:55,860 --> 00:03:56,610
a known parameter.

91
00:03:56,610 --> 00:03:57,651
And we wanted to find it.

92
00:03:57,651 --> 00:03:59,610
We wanted to either
estimate it or test it,

93
00:03:59,610 --> 00:04:02,460
or maybe find a confidence
interval for the subject.

94
00:04:02,460 --> 00:04:06,030
So, so far I should not have
said anything that's new.

95
00:04:06,030 --> 00:04:08,210
But this last
sentence is actually

96
00:04:08,210 --> 00:04:10,590
what's going to be different
from the Bayesian part.

97
00:04:10,590 --> 00:04:12,989
And particular, this
unknown but fixed things

98
00:04:12,989 --> 00:04:14,280
is what's going to be changing.

99
00:04:16,965 --> 00:04:18,740
In the Bayesian
approach, we still

100
00:04:18,740 --> 00:04:22,050
assume that we observe
some random data.

101
00:04:22,050 --> 00:04:24,180
But the generating process
is slightly different.

102
00:04:24,180 --> 00:04:25,737
It's sort of a
two later process.

103
00:04:25,737 --> 00:04:27,320
And there's one
process that generates

104
00:04:27,320 --> 00:04:28,740
the parameter and
then one process

105
00:04:28,740 --> 00:04:31,470
that, given this parameter
generates the data.

106
00:04:31,470 --> 00:04:35,990
So what the first layer
does, nobody really

107
00:04:35,990 --> 00:04:38,030
believes that there's
some random process that's

108
00:04:38,030 --> 00:04:41,000
happening, about
generating what is going

109
00:04:41,000 --> 00:04:44,820
to be the true expected
number of people

110
00:04:44,820 --> 00:04:47,060
who turn their head to
the right when they kiss.

111
00:04:47,060 --> 00:04:49,435
But this is actually going to
be something that brings us

112
00:04:49,435 --> 00:04:53,270
some easiness for
us to incorporate

113
00:04:53,270 --> 00:04:57,230
what we call prior belief.

114
00:04:57,230 --> 00:04:58,640
We'll see an
example in a second.

115
00:04:58,640 --> 00:05:01,430
But often, you actually
have prior belief

116
00:05:01,430 --> 00:05:02,960
of what this
parameter should be.

117
00:05:02,960 --> 00:05:05,510
When we, say least
squares, we looked

118
00:05:05,510 --> 00:05:09,350
over all of the vectors
in all of R to the p,

119
00:05:09,350 --> 00:05:11,840
including the ones that
have coefficients equal

120
00:05:11,840 --> 00:05:15,080
to 50 million.

121
00:05:15,080 --> 00:05:18,050
Those are things that we
might be able to rule out.

122
00:05:18,050 --> 00:05:21,800
We might be able to rule out
that on a much smaller scale.

123
00:05:21,800 --> 00:05:24,650
For example, well
I'm not an expert

124
00:05:24,650 --> 00:05:29,180
on turning your head to
the right or to the left.

125
00:05:29,180 --> 00:05:30,950
But maybe you can
rule out the fact

126
00:05:30,950 --> 00:05:33,200
that almost everybody
is turning their head

127
00:05:33,200 --> 00:05:35,990
in the same direction, or almost
everybody is turning their head

128
00:05:35,990 --> 00:05:38,090
to another direction.

129
00:05:38,090 --> 00:05:39,980
So we have this prior belief.

130
00:05:39,980 --> 00:05:43,750
And this belief is going
to play say, hopefully

131
00:05:43,750 --> 00:05:47,534
less and less important role as
we collect more and more data.

132
00:05:47,534 --> 00:05:49,200
But if we have a
smaller amount of data,

133
00:05:49,200 --> 00:05:52,510
we might want to be able
to use this information,

134
00:05:52,510 --> 00:05:54,700
rather than just
shooting in the dark.

135
00:05:54,700 --> 00:05:58,150
And so the idea is to
have this prior belief.

136
00:05:58,150 --> 00:06:00,430
And then, we want to
update this prior belief

137
00:06:00,430 --> 00:06:03,550
into what's called the
posterior belief after we've

138
00:06:03,550 --> 00:06:04,870
seen some data.

139
00:06:04,870 --> 00:06:08,050
Maybe I believe that
there's something

140
00:06:08,050 --> 00:06:09,640
that should be in some range.

141
00:06:09,640 --> 00:06:12,580
But maybe after I see data, it's
comforting me in my beliefs.

142
00:06:12,580 --> 00:06:15,330
So I'm actually having
maybe a belief that's more.

143
00:06:15,330 --> 00:06:18,460
So belief encompasses
basically what you think

144
00:06:18,460 --> 00:06:20,000
and how strongly
you think about it.

145
00:06:20,000 --> 00:06:21,370
That's what I call belief.

146
00:06:21,370 --> 00:06:24,070
So for example, if I have a
belief about some parameter

147
00:06:24,070 --> 00:06:26,050
theta, maybe my
belief is telling me

148
00:06:26,050 --> 00:06:28,970
where theta should
be and how strongly I

149
00:06:28,970 --> 00:06:32,920
believe in it, in the sense
that I have a very narrow region

150
00:06:32,920 --> 00:06:35,470
where theta could be.

151
00:06:35,470 --> 00:06:37,810
The posterior beliefs, as
well, you see some data.

152
00:06:37,810 --> 00:06:40,000
And maybe you're more confident
or less confident about what

153
00:06:40,000 --> 00:06:40,499
you've seen.

154
00:06:40,499 --> 00:06:42,796
Maybe you've shifted
your belief a little bit.

155
00:06:42,796 --> 00:06:44,670
And so that's what we're
going to try to see,

156
00:06:44,670 --> 00:06:48,630
and how to do this in
a principal manner.

157
00:06:48,630 --> 00:06:50,190
To understand this
better, there's

158
00:06:50,190 --> 00:06:52,150
nothing better than an example.

159
00:06:52,150 --> 00:06:56,220
So let's talk about another
stupid statistical question.

160
00:06:56,220 --> 00:06:58,620
Which is, let's try
to understand p.

161
00:06:58,620 --> 00:07:01,430
Of course, I'm not going to
talk about politics from now on.

162
00:07:01,430 --> 00:07:03,930
So let's talk about p,
the proportion of women

163
00:07:03,930 --> 00:07:04,950
in the population.

164
00:07:15,330 --> 00:07:21,850
And so what I could do is
to collect some data, X1, Xn

165
00:07:21,850 --> 00:07:23,950
and assume that
they're Bernoulli

166
00:07:23,950 --> 00:07:25,900
with some parameter, p unknown.

167
00:07:25,900 --> 00:07:30,810
So p is in 0, 1.

168
00:07:30,810 --> 00:07:33,270
OK, let's assume that
those guys are iid.

169
00:07:33,270 --> 00:07:38,190
So this is just an indicator
for each of my collected data,

170
00:07:38,190 --> 00:07:42,130
whether the person I randomly
sample is a woman, I get a one.

171
00:07:42,130 --> 00:07:43,350
If it's a man, I get a zero.

172
00:07:46,200 --> 00:07:49,470
Now the question is, I
sample these people randomly.

173
00:07:49,470 --> 00:07:51,560
I do you know their gender.

174
00:07:51,560 --> 00:07:54,600
And the frequentist
approach was just saying,

175
00:07:54,600 --> 00:07:58,250
OK, let's just estimate
p hat being Xn bar.

176
00:07:58,250 --> 00:08:01,110
And then we could do some tests.

177
00:08:01,110 --> 00:08:02,240
So here, there's a test.

178
00:08:02,240 --> 00:08:05,330
I want to test maybe if
p is equal to 0.5 or not.

179
00:08:05,330 --> 00:08:09,710
That sounds like a pretty
reasonable thing to test.

180
00:08:09,710 --> 00:08:13,100
But we want to also
maybe estimate p.

181
00:08:13,100 --> 00:08:16,160
But here, this is a case where
we definitely prior belief

182
00:08:16,160 --> 00:08:17,720
of what p should be.

183
00:08:17,720 --> 00:08:22,040
We are pretty confident that
p is not going to be 0.7.

184
00:08:22,040 --> 00:08:23,570
We actually believe
that we should

185
00:08:23,570 --> 00:08:29,330
be extremely close to one
half, but maybe not exactly.

186
00:08:29,330 --> 00:08:32,679
Maybe this population is not
the population in the world.

187
00:08:32,679 --> 00:08:35,659
But maybe this is the
population of, say some college

188
00:08:35,659 --> 00:08:38,720
and we want to understand if
this college has half women

189
00:08:38,720 --> 00:08:40,069
or not.

190
00:08:40,069 --> 00:08:42,110
Maybe we know it's going
to be close to one half,

191
00:08:42,110 --> 00:08:43,460
but maybe we're not quite sure.

192
00:08:46,840 --> 00:08:49,960
We're going to want to
integrate that knowledge.

193
00:08:49,960 --> 00:08:52,660
So I could integrate it in
a blunt manner by saying,

194
00:08:52,660 --> 00:08:55,420
discard the data and say
that p is equal to one half.

195
00:08:55,420 --> 00:08:57,650
But maybe that's just
a little too much.

196
00:08:57,650 --> 00:09:01,360
So how do I do this trade
off between adding the data

197
00:09:01,360 --> 00:09:06,760
and combining it with
this prior knowledge?

198
00:09:06,760 --> 00:09:09,610
In many instances, essentially
what's going to happen

199
00:09:09,610 --> 00:09:14,330
is this one half is going to
act like one new observation.

200
00:09:14,330 --> 00:09:17,062
So if you have
five observations,

201
00:09:17,062 --> 00:09:18,520
this is just the
sixth observation,

202
00:09:18,520 --> 00:09:20,240
which will play a role.

203
00:09:20,240 --> 00:09:21,790
If you have a
million observations,

204
00:09:21,790 --> 00:09:22,860
you're going to have
a million and one.

205
00:09:22,860 --> 00:09:24,568
It's not going to play
so much of a role.

206
00:09:24,568 --> 00:09:25,900
That's basically how it goes.

207
00:09:28,760 --> 00:09:33,470
But, definitely not
always because we'll

208
00:09:33,470 --> 00:09:36,700
see that if I take my prior to
be a point minus one half here,

209
00:09:36,700 --> 00:09:39,290
it's basically as if I
was discarding my data.

210
00:09:39,290 --> 00:09:41,740
So essentially, there's
also your ability

211
00:09:41,740 --> 00:09:45,520
to encompass how strongly
you believe in this prior.

212
00:09:45,520 --> 00:09:47,809
And if you believe
infinitely more in the prior

213
00:09:47,809 --> 00:09:49,600
than you believe in
the data you collected,

214
00:09:49,600 --> 00:09:54,600
then it's not going to act
like one more observation.

215
00:09:54,600 --> 00:09:56,820
The Bayesian approach
is a tool to one,

216
00:09:56,820 --> 00:09:59,010
include mathematically
our prior.

217
00:09:59,010 --> 00:10:02,580
And our prior belief into
statistical procedures.

218
00:10:02,580 --> 00:10:04,030
Maybe I have this
prior knowledge.

219
00:10:04,030 --> 00:10:06,090
But if I'm a medical
doctor, it's not clear to me

220
00:10:06,090 --> 00:10:09,870
how I'm going to turn this into
some principal way of building

221
00:10:09,870 --> 00:10:10,410
estimators.

222
00:10:10,410 --> 00:10:12,330
And the second
goal is going to be

223
00:10:12,330 --> 00:10:16,260
to update this prior belief
into a posterior belief

224
00:10:16,260 --> 00:10:17,270
by using the data.

225
00:10:22,200 --> 00:10:23,917
How do I do this?

226
00:10:23,917 --> 00:10:25,500
And at some point,
I sort of suggested

227
00:10:25,500 --> 00:10:28,610
that there's two layers.

228
00:10:28,610 --> 00:10:31,660
One is where you draw
the parameter at random.

229
00:10:31,660 --> 00:10:35,290
And two, once you
have the parameter,

230
00:10:35,290 --> 00:10:39,320
conditionless parameter,
you draw your data.

231
00:10:39,320 --> 00:10:42,080
Nobody believed this actually is
happening, that nature is just

232
00:10:42,080 --> 00:10:45,510
rolling dice for us and
choosing parameters at random.

233
00:10:45,510 --> 00:10:48,260
But what's happening
is that, this idea

234
00:10:48,260 --> 00:10:51,410
that the parameter comes
from some random distribution

235
00:10:51,410 --> 00:10:54,860
actually captures, very
well, this idea that how

236
00:10:54,860 --> 00:10:56,960
you would encompass your prior.

237
00:10:56,960 --> 00:10:59,090
How would you say, my
belief is as follows?

238
00:10:59,090 --> 00:11:01,870
Well here's an example about p.

239
00:11:01,870 --> 00:11:07,856
I'm 90% sure that p is
between 0.4 and 0.6.

240
00:11:07,856 --> 00:11:14,230
And I'm 95% sure that p
is between 0.3 and 0.8.

241
00:11:14,230 --> 00:11:18,490
So essentially, I have
this possible value of p.

242
00:11:18,490 --> 00:11:35,430
And what I know is that, there's
90% here between 0.4 and 0.6.

243
00:11:35,430 --> 00:11:39,340
And then I have 0.3 and 0.8.

244
00:11:39,340 --> 00:11:44,200
And I know that I'm 95%
sure that I'm in here.

245
00:11:44,200 --> 00:11:47,050
If you remember, this sort of
looks like the kind of pictures

246
00:11:47,050 --> 00:11:50,110
that I made when I had
some Gaussian, for example.

247
00:11:50,110 --> 00:11:54,220
And I said, oh here we have
90% of the observations.

248
00:11:54,220 --> 00:11:57,105
And here, we have 95%
of the observations.

249
00:12:00,500 --> 00:12:04,570
So in a way, if I
were able to tell you

250
00:12:04,570 --> 00:12:07,610
all those ranges for
all possible values,

251
00:12:07,610 --> 00:12:10,510
then I would essentially
describe a probability

252
00:12:10,510 --> 00:12:13,400
distribution for p.

253
00:12:13,400 --> 00:12:15,410
And what I'm saying
is that, p is going

254
00:12:15,410 --> 00:12:16,582
to have this kind of shape.

255
00:12:16,582 --> 00:12:19,040
So of course, if I tell you
only two twice this information

256
00:12:19,040 --> 00:12:22,280
that there's 90% I'm here,
and I'm between here and here.

257
00:12:22,280 --> 00:12:24,980
And 95%, I'm between here
and here, then there's

258
00:12:24,980 --> 00:12:26,845
many ways I can
accomplish that, right.

259
00:12:26,845 --> 00:12:28,970
I could have something that
looks like this, maybe.

260
00:12:33,190 --> 00:12:35,830
It could be like this.

261
00:12:35,830 --> 00:12:37,605
There's many ways
I can have this.

262
00:12:37,605 --> 00:12:38,980
Some of them are
definitely going

263
00:12:38,980 --> 00:12:42,280
to be mathematically more
convenient than others.

264
00:12:42,280 --> 00:12:44,320
And hopefully, we're
going to have things

265
00:12:44,320 --> 00:12:47,230
that I can
parameterize very well.

266
00:12:47,230 --> 00:12:49,900
Because if I tell
you this is this guy,

267
00:12:49,900 --> 00:12:54,340
then there's basically one,
two three, four, five, six,

268
00:12:54,340 --> 00:12:56,554
seven parameters.

269
00:12:56,554 --> 00:12:57,970
So I probably don't
want something

270
00:12:57,970 --> 00:12:59,053
that has seven parameters.

271
00:12:59,053 --> 00:13:01,582
But maybe I can say, oh,
it's a Gaussian and I all

272
00:13:01,582 --> 00:13:03,540
I have to do is to tell
you where it's centered

273
00:13:03,540 --> 00:13:04,998
and what the standard
deviation is.

274
00:13:07,250 --> 00:13:11,030
So the idea of using
this two layer thing,

275
00:13:11,030 --> 00:13:12,800
where we think of
the parameter p

276
00:13:12,800 --> 00:13:14,450
as being drawn from
some distribution,

277
00:13:14,450 --> 00:13:17,660
is really just a way for us
to capture this information.

278
00:13:17,660 --> 00:13:20,420
Our prior belief
being, well there's

279
00:13:20,420 --> 00:13:22,794
this percentage of
chances that it's there.

280
00:13:22,794 --> 00:13:24,710
But the percentage of
this chance, I'm not I'm

281
00:13:24,710 --> 00:13:28,730
deliberately not using
probability here.

282
00:13:28,730 --> 00:13:30,980
So it's really a way
to get close to this.

283
00:13:33,620 --> 00:13:36,170
That's why I say, the true
parameter is not random.

284
00:13:36,170 --> 00:13:40,420
But the Bayesian approach
does as if it was random.

285
00:13:40,420 --> 00:13:42,430
And then, just spits
out a procedure

286
00:13:42,430 --> 00:13:49,110
out of this thought process,
this thought experiment.

287
00:13:49,110 --> 00:13:54,060
So when you practice
Bayesian statistics a lot,

288
00:13:54,060 --> 00:13:57,840
you start getting automatisms.

289
00:13:57,840 --> 00:14:00,905
You start getting some things
that you do without really

290
00:14:00,905 --> 00:14:02,280
thinking about
it. just like when

291
00:14:02,280 --> 00:14:04,860
you you're a statistician,
the first thing you do is,

292
00:14:04,860 --> 00:14:07,419
can I think of this data as
being Gaussian for example?

293
00:14:07,419 --> 00:14:09,210
When you're Bayesian
you're thinking about,

294
00:14:09,210 --> 00:14:11,400
OK I have a set of parameters.

295
00:14:11,400 --> 00:14:14,250
So here, I can
describe my parameter

296
00:14:14,250 --> 00:14:20,190
as being theta in
general, in some big space

297
00:14:20,190 --> 00:14:21,540
parameter of theta.

298
00:14:21,540 --> 00:14:24,730
But what spaces
did we encounter?

299
00:14:24,730 --> 00:14:27,090
Well, we encountered
the real line.

300
00:14:27,090 --> 00:14:31,320
We encountered the interval
0, 1 for Bernoulli's And we

301
00:14:31,320 --> 00:14:36,320
encountered some of
the positive real line

302
00:14:36,320 --> 00:14:39,320
for exponential
distributions, etc.

303
00:14:39,320 --> 00:14:42,020
And so what I'm
going to need to do,

304
00:14:42,020 --> 00:14:44,570
if I want to put some
prior on those spaces,

305
00:14:44,570 --> 00:14:47,694
I'm going to have to
have a usual set of tools

306
00:14:47,694 --> 00:14:49,610
for this guy, usual set
of tools for this guy,

307
00:14:49,610 --> 00:14:51,176
usual sort of
tools for this guy.

308
00:14:51,176 --> 00:14:52,550
And by usual set
of tools, I mean

309
00:14:52,550 --> 00:14:54,966
I'm going to have to have a
family of distributions that's

310
00:14:54,966 --> 00:14:56,690
supported on this.

311
00:14:56,690 --> 00:14:59,010
So in particular,
this is the speed

312
00:14:59,010 --> 00:15:01,610
in which my parameter
that I usually denote

313
00:15:01,610 --> 00:15:03,900
by p for Bernoulli lives.

314
00:15:03,900 --> 00:15:07,840
And so what I need is to find a
distribution on the interval 0,

315
00:15:07,840 --> 00:15:13,540
1 just like this guy.

316
00:15:13,540 --> 00:15:15,310
The problem with the
Gaussian is that it's

317
00:15:15,310 --> 00:15:17,890
not on the interval 0, 1.

318
00:15:17,890 --> 00:15:20,230
It's going to spill
out in the end.

319
00:15:20,230 --> 00:15:22,850
And it's not going to be
something that works for me.

320
00:15:22,850 --> 00:15:25,802
And so the question is, I need
to think about distributions

321
00:15:25,802 --> 00:15:27,010
that are probably continuous.

322
00:15:27,010 --> 00:15:30,070
Why would I restrict myself
to discrete distributions that

323
00:15:30,070 --> 00:15:34,060
are actually convenient and for
Bernoulli, one that's actually

324
00:15:34,060 --> 00:15:36,730
basically the main tool
that everybody is using

325
00:15:36,730 --> 00:15:39,670
is the so-called
beta distribution.

326
00:15:39,670 --> 00:15:42,190
So the beta distribution
has two parameters.

327
00:15:50,680 --> 00:15:56,910
So x follows a beta
with parameters

328
00:15:56,910 --> 00:16:05,070
a and b if it has
a density, f of x

329
00:16:05,070 --> 00:16:09,050
is equal to x to the a minus 1.

330
00:16:09,050 --> 00:16:15,800
1 minus x to the b minus 1,
if x is in the interval 0,

331
00:16:15,800 --> 00:16:22,730
1 and 0 for all other x's.

332
00:16:22,730 --> 00:16:23,230
OK?

333
00:16:27,590 --> 00:16:30,470
Why is that a good thing?

334
00:16:30,470 --> 00:16:33,870
Well, it's a density that's
on the interval 0, 1 for sure.

335
00:16:33,870 --> 00:16:37,130
But now I have these two
parameters and a set of shapes

336
00:16:37,130 --> 00:16:41,525
that I can get by tweaking those
two parameters is incredible.

337
00:16:44,260 --> 00:16:46,190
It's going to be a
unimodal distribution.

338
00:16:46,190 --> 00:16:47,260
It's still fairly nice.

339
00:16:47,260 --> 00:16:49,760
It's not going to be something
that goes like this and this.

340
00:16:49,760 --> 00:16:52,790
Because if you think
about this, what

341
00:16:52,790 --> 00:16:55,550
would it mean if your prior
distribution of the interval 0,

342
00:16:55,550 --> 00:16:57,120
1 had this shape?

343
00:16:59,630 --> 00:17:01,934
It would mean that, maybe
you think that p is here

344
00:17:01,934 --> 00:17:03,350
or maybe you think
that p is here,

345
00:17:03,350 --> 00:17:05,127
or maybe you think
that p is here.

346
00:17:05,127 --> 00:17:06,710
Which essentially
means that you think

347
00:17:06,710 --> 00:17:10,661
that p can come from
three different phenomena.

348
00:17:10,661 --> 00:17:12,619
And there's other models
that are called mixers

349
00:17:12,619 --> 00:17:15,079
for that, that directly
account for the fact

350
00:17:15,079 --> 00:17:19,550
that maybe there are several
phenomena that are aggregated

351
00:17:19,550 --> 00:17:21,050
in your data set.

352
00:17:21,050 --> 00:17:23,390
But if you think that your
data set is sort of pure,

353
00:17:23,390 --> 00:17:25,650
and that everything comes
from the same phenomenon,

354
00:17:25,650 --> 00:17:28,850
you want something
that looks like this,

355
00:17:28,850 --> 00:17:32,850
or maybe looks like this, or
maybe is sort of symmetric.

356
00:17:32,850 --> 00:17:34,410
You want to get all this stuff.

357
00:17:34,410 --> 00:17:36,900
Maybe you want something
that says, well

358
00:17:36,900 --> 00:17:42,330
if I'm talking about p being the
probability of the proportion

359
00:17:42,330 --> 00:17:45,840
of women in the whole world, you
want something that's probably

360
00:17:45,840 --> 00:17:48,600
really spiked around one half.

361
00:17:48,600 --> 00:17:50,850
Almost the point
math, because you know

362
00:17:50,850 --> 00:17:54,990
let's agree that 0.5
is the actual number.

363
00:17:54,990 --> 00:17:58,950
So you want something that
says, OK maybe I'm wrong.

364
00:17:58,950 --> 00:18:01,300
But I'm sure I'm not going
to be really that way off.

365
00:18:01,300 --> 00:18:03,300
So you want something
that's really pointy.

366
00:18:03,300 --> 00:18:06,660
But if it's something
you've never checked,

367
00:18:06,660 --> 00:18:09,780
and again I can not make
references at this point,

368
00:18:09,780 --> 00:18:13,197
but something where you might
have some uncertainty that

369
00:18:13,197 --> 00:18:14,280
should be around one half.

370
00:18:14,280 --> 00:18:17,070
Maybe you want something
that a little more allows

371
00:18:17,070 --> 00:18:19,410
you to say, well, I think
there's more around one half.

372
00:18:19,410 --> 00:18:22,950
But there's still some
fluctuations that are possible.

373
00:18:22,950 --> 00:18:25,110
And in particular
here, I talk about p,

374
00:18:25,110 --> 00:18:29,310
where the two parameters a
and b are actually the same.

375
00:18:29,310 --> 00:18:30,500
I call them a.

376
00:18:30,500 --> 00:18:31,710
One is called scale.

377
00:18:31,710 --> 00:18:33,430
The other one is called shape.

378
00:18:33,430 --> 00:18:35,500
Oh sorry, this is not a density.

379
00:18:35,500 --> 00:18:38,646
So it actually has
to be normalized.

380
00:18:38,646 --> 00:18:40,020
When you integrate
this guy, it's

381
00:18:40,020 --> 00:18:41,490
going to be some function
that depends on a

382
00:18:41,490 --> 00:18:43,410
and b, actually depends
on this function

383
00:18:43,410 --> 00:18:45,427
through the beta function.

384
00:18:45,427 --> 00:18:47,260
Which is this combination
of gamma function,

385
00:18:47,260 --> 00:18:51,515
so that's why it's
called beta distribution.

386
00:18:51,515 --> 00:18:53,640
That's the definition of
the beta function when you

387
00:18:53,640 --> 00:18:55,721
integrate this thing anyway.

388
00:18:55,721 --> 00:18:56,970
You just have to normalize it.

389
00:18:56,970 --> 00:18:59,730
That's just a number that
depends on the a and b.

390
00:18:59,730 --> 00:19:01,542
So here, if you
take a equal to b,

391
00:19:01,542 --> 00:19:03,000
you have something
that essentially

392
00:19:03,000 --> 00:19:05,340
is symmetric around one half.

393
00:19:05,340 --> 00:19:07,120
Because what does it look like?

394
00:19:07,120 --> 00:19:10,980
Well, so my density f of
x, is going to be what?

395
00:19:10,980 --> 00:19:19,200
It's going to be my constant
times x, times one minus x

396
00:19:19,200 --> 00:19:21,670
to a minus one.

397
00:19:21,670 --> 00:19:26,080
And this function, x times
1 minus x looks like this.

398
00:19:26,080 --> 00:19:27,730
We've drawn it before.

399
00:19:27,730 --> 00:19:29,380
That was something
that showed up

400
00:19:29,380 --> 00:19:36,490
as being the variance
of my Bernoulli.

401
00:19:36,490 --> 00:19:42,240
So we know it's something that
takes its maximum at one half.

402
00:19:42,240 --> 00:19:44,190
And now I'm just taking
a power of this guy.

403
00:19:44,190 --> 00:19:46,020
So I'm really just
distorting this thing

404
00:19:46,020 --> 00:19:51,340
into some fairly
symmetric manner.

405
00:19:56,400 --> 00:20:00,630
This distribution that
we actually take for p.

406
00:20:00,630 --> 00:20:03,030
I assume that p, the
parameter, notice

407
00:20:03,030 --> 00:20:04,470
that this is kind of weird.

408
00:20:04,470 --> 00:20:06,344
First of all, this is
probably the first time

409
00:20:06,344 --> 00:20:09,570
in this entire
course that something

410
00:20:09,570 --> 00:20:12,085
has a distribution when it's
actually a lower case letter.

411
00:20:12,085 --> 00:20:13,710
That's something you
have to deal with,

412
00:20:13,710 --> 00:20:16,827
because we've been using lower
case letters for parameters.

413
00:20:16,827 --> 00:20:18,660
And now we want them
to have a distribution.

414
00:20:18,660 --> 00:20:20,550
So that's what's
going to happen.

415
00:20:20,550 --> 00:20:23,850
This is called the
prior distribution.

416
00:20:23,850 --> 00:20:27,750
So really, I should write
something like f of p

417
00:20:27,750 --> 00:20:35,290
is equal to a constant times
p, 1 minus p, to the n minus 1.

418
00:20:35,290 --> 00:20:39,985
Well no, actually I should not
because then it's confusing.

419
00:20:39,985 --> 00:20:41,610
One thing in terms
of notation that I'm

420
00:20:41,610 --> 00:20:43,639
going to write, when
I have a constant here

421
00:20:43,639 --> 00:20:45,180
and I don't want to
make it explicit.

422
00:20:45,180 --> 00:20:48,480
And we'll see in a second why I
don't need to make it explicit.

423
00:20:48,480 --> 00:20:53,250
I'm going to write
this as f of x

424
00:20:53,250 --> 00:21:04,060
is proportional to x 1
minus x to the n minus 1.

425
00:21:04,060 --> 00:21:08,740
That's just to say, equal to
some constant that does not

426
00:21:08,740 --> 00:21:11,260
depend on x times this thing.

427
00:21:16,320 --> 00:21:21,930
So if we continue
with our experiment

428
00:21:21,930 --> 00:21:25,410
where I'm drawing
this data, X1 to Xn,

429
00:21:25,410 --> 00:21:29,050
which is Bernoulli p, if
p has some distribution

430
00:21:29,050 --> 00:21:31,050
it's not clear what it
means to have a Bernoulli

431
00:21:31,050 --> 00:21:32,427
with some random parameter.

432
00:21:32,427 --> 00:21:35,010
So what I'm going to do is, then
I'm going to first draw my p.

433
00:21:35,010 --> 00:21:38,310
Let's say I get a number, 0.52.

434
00:21:38,310 --> 00:21:41,100
And then, I'm going to draw
my data conditionally on p.

435
00:21:41,100 --> 00:21:45,150
So here comes the first and
last flowchart of this class.

436
00:21:49,500 --> 00:21:51,190
So nature first draws p.

437
00:21:53,930 --> 00:21:58,360
p follows some data on a, a.

438
00:21:58,360 --> 00:21:59,670
Then I condition on p.

439
00:22:02,460 --> 00:22:10,760
And then I draw X1, Xn
that are iid, Bernoulli p.

440
00:22:10,760 --> 00:22:14,250
Everybody understand the
process of generating this data?

441
00:22:14,250 --> 00:22:16,250
So you first draw a
parameter, and then you just

442
00:22:16,250 --> 00:22:21,040
flip those independent biased
coins with this particular p.

443
00:22:21,040 --> 00:22:23,230
There's this layered thing.

444
00:22:26,570 --> 00:22:31,010
Now conditionally p, right so
here I have this prior about p

445
00:22:31,010 --> 00:22:32,070
which was the thing.

446
00:22:32,070 --> 00:22:34,090
So this is just the
thought process again,

447
00:22:34,090 --> 00:22:36,480
it's not anything that
actually happens in practice.

448
00:22:36,480 --> 00:22:39,920
This is my way of thinking about
how the data was generated.

449
00:22:39,920 --> 00:22:43,310
And from this, I'm going to try
to come up with some procedure.

450
00:22:43,310 --> 00:22:47,880
Just like, if your estimator
is the average of the data,

451
00:22:47,880 --> 00:22:49,700
you don't have to
understand probability

452
00:22:49,700 --> 00:22:52,670
to say that my estimator
is the average of the data.

453
00:22:52,670 --> 00:22:54,200
Anyone outside this
room understands

454
00:22:54,200 --> 00:22:55,970
that the average
is a good estimator

455
00:22:55,970 --> 00:22:58,550
for some average behavior.

456
00:22:58,550 --> 00:23:01,070
And they don't need
to think of the data

457
00:23:01,070 --> 00:23:02,960
as being a random
variable, et cetera.

458
00:23:02,960 --> 00:23:04,570
So same thing, basically.

459
00:23:10,760 --> 00:23:13,790
In this case, you can see that
the posterior distribution

460
00:23:13,790 --> 00:23:14,720
is still a beta.

461
00:23:18,320 --> 00:23:20,090
What it means is that,
I had this thing.

462
00:23:20,090 --> 00:23:21,650
Then, I observed my data.

463
00:23:21,650 --> 00:23:23,570
And then, I continue
and here I'm

464
00:23:23,570 --> 00:23:32,800
going to update my prior
into some posterior

465
00:23:32,800 --> 00:23:36,680
distribution, pi.

466
00:23:36,680 --> 00:23:39,210
And here, this guy is
actually also a beta.

467
00:23:43,370 --> 00:23:45,950
My posterior
distribution, p, is also

468
00:23:45,950 --> 00:23:48,002
a beta distribution
with the parameters

469
00:23:48,002 --> 00:23:48,960
that are on this slide.

470
00:23:48,960 --> 00:23:51,670
And I'll have the space
to reproduce them.

471
00:23:51,670 --> 00:23:54,180
So I start the beginning
of this flowchart

472
00:23:54,180 --> 00:23:57,130
as having p, which is a prior.

473
00:23:57,130 --> 00:23:58,810
I'm going to get
some observations

474
00:23:58,810 --> 00:24:01,120
and then, I'm going to
update what my posterior is.

475
00:24:04,530 --> 00:24:06,900
This posterior is
basically something

476
00:24:06,900 --> 00:24:09,690
that's, in business
statistics was

477
00:24:09,690 --> 00:24:13,720
beautiful is as soon as
you have this distribution,

478
00:24:13,720 --> 00:24:17,030
it's essentially capturing all
the information about the data

479
00:24:17,030 --> 00:24:19,010
that you want for p.

480
00:24:19,010 --> 00:24:20,429
And it's not just the point.

481
00:24:20,429 --> 00:24:21,470
It's not just an average.

482
00:24:21,470 --> 00:24:23,660
It's actually an
entire distribution

483
00:24:23,660 --> 00:24:27,050
for the possible
values of theta.

484
00:24:27,050 --> 00:24:30,740
And it's not the same
thing as saying, well

485
00:24:30,740 --> 00:24:35,030
if theta hat is equal to Xn
bar, in the Gaussian case I know

486
00:24:35,030 --> 00:24:37,130
that this is some mean, mu.

487
00:24:37,130 --> 00:24:39,680
And then maybe it has
varying sigma squared over n.

488
00:24:39,680 --> 00:24:43,550
That's not what I mean by, this
is my posterior distribution.

489
00:24:43,550 --> 00:24:46,640
This is not what I mean.

490
00:24:46,640 --> 00:24:49,790
This is going to come from
this guy, the Gaussian thing

491
00:24:49,790 --> 00:24:51,350
and the central limit theorem.

492
00:24:51,350 --> 00:24:52,970
But what I mean is this guy.

493
00:24:52,970 --> 00:24:58,130
And this came exclusively
from the prior distribution.

494
00:24:58,130 --> 00:25:00,830
If I had another prior,
I would not necessarily

495
00:25:00,830 --> 00:25:03,840
have a beta distribution
on the output.

496
00:25:03,840 --> 00:25:07,580
So when I have the same
family of distributions

497
00:25:07,580 --> 00:25:11,090
at the beginning and at
the end of this flowchart,

498
00:25:11,090 --> 00:25:16,520
I say that beta is
a conjugate prior.

499
00:25:21,200 --> 00:25:27,390
Meaning I put in beta as a prior
and I get beta as [INAUDIBLE]

500
00:25:27,390 --> 00:25:30,850
And that's why betas
are so popular.

501
00:25:30,850 --> 00:25:32,280
Conjugate priors
are really nice,

502
00:25:32,280 --> 00:25:35,730
because you know that whatever
you put in, what you're going

503
00:25:35,730 --> 00:25:37,170
to get in the end is a beta.

504
00:25:37,170 --> 00:25:38,790
So all you have to think
about is the parameters.

505
00:25:38,790 --> 00:25:41,040
You don't have to check
again what the posterior is

506
00:25:41,040 --> 00:25:43,290
going to look like, what the
PDF of this guy is going to be.

507
00:25:43,290 --> 00:25:44,664
You don't have to
think about it.

508
00:25:44,664 --> 00:25:46,650
You just have to check
what the parameters are.

509
00:25:46,650 --> 00:25:48,358
And there's families
of conjugate priors.

510
00:25:48,358 --> 00:25:51,150
Gaussian gives
Gaussian, for example.

511
00:25:51,150 --> 00:25:52,170
There's a bunch of them.

512
00:25:52,170 --> 00:25:57,210
And this is what drives people
into using specific priors as

513
00:25:57,210 --> 00:25:58,200
opposed to others.

514
00:25:58,200 --> 00:26:00,660
It has nice
mathematical properties.

515
00:26:00,660 --> 00:26:05,910
Nobody believes that p is really
distributed according to beta.

516
00:26:05,910 --> 00:26:08,640
But it's flexible enough
and super convenient

517
00:26:08,640 --> 00:26:09,700
mathematically.

518
00:26:12,450 --> 00:26:14,640
Now let's see for one
second, before we actually

519
00:26:14,640 --> 00:26:17,080
go any further.

520
00:26:17,080 --> 00:26:19,790
I didn't mention A and
B are both in here,

521
00:26:19,790 --> 00:26:21,540
A and B are both
positive numbers.

522
00:26:24,320 --> 00:26:27,610
They can be anything positive.

523
00:26:27,610 --> 00:26:29,460
So here what I did
is that, I updated A

524
00:26:29,460 --> 00:26:34,650
into a plus the sum
of my data, and b

525
00:26:34,650 --> 00:26:38,500
into b plus n minus
the sum of my data.

526
00:26:38,500 --> 00:26:41,910
So that's essentially, a becomes
a plus the number of ones.

527
00:26:45,040 --> 00:26:47,350
Well, that's only
when I have a and a.

528
00:26:47,350 --> 00:26:50,116
So the first parameters become
itself plus the number of ones.

529
00:26:50,116 --> 00:26:51,490
And the second
one becomes itself

530
00:26:51,490 --> 00:26:52,531
plus the number of zeros.

531
00:26:55,440 --> 00:26:59,160
And so just as a sanity
check, what does this mean?

532
00:26:59,160 --> 00:27:08,910
If a it goes to zero, what
is the beta when a goes to 0?

533
00:27:08,910 --> 00:27:10,410
We can actually
read this from here.

534
00:27:16,920 --> 00:27:19,170
Actually, let's take a goes to--

535
00:27:25,370 --> 00:27:26,110
no.

536
00:27:26,110 --> 00:27:27,310
Sorry, let's just do this.

537
00:27:38,670 --> 00:27:40,840
I'll do it when we talk
about non-informative prior,

538
00:27:40,840 --> 00:27:42,840
because it's a little too messy.

539
00:27:47,220 --> 00:27:47,970
How do we do this?

540
00:27:47,970 --> 00:27:51,390
How did I get this posterior
distribution, given the prior?

541
00:27:51,390 --> 00:27:56,070
How do I update This well this
is called Bayesian statistics.

542
00:27:56,070 --> 00:27:58,800
And you've heard this
word, Bayes before.

543
00:27:58,800 --> 00:28:02,010
And the way you've heard
it is in the Bayes formula.

544
00:28:02,010 --> 00:28:03,680
What was the Bayes formula?

545
00:28:03,680 --> 00:28:05,190
The Bayes formula
was telling you

546
00:28:05,190 --> 00:28:11,390
that the probability of A, given
B was equal to something that

547
00:28:11,390 --> 00:28:14,430
depended on the probability of
B, given A. That's what it was.

548
00:28:16,787 --> 00:28:18,620
You can actually either
remember the formula

549
00:28:18,620 --> 00:28:20,250
or you can remember
the definition.

550
00:28:20,250 --> 00:28:26,000
And this is what p of A
and B divided by p of B.

551
00:28:26,000 --> 00:28:35,480
So this is p of B, given A
times p of A divided by p of B.

552
00:28:35,480 --> 00:28:37,590
That's what Bayes
formula is telling you.

553
00:28:37,590 --> 00:28:40,050
Agree?

554
00:28:40,050 --> 00:28:46,200
So now what I want is to have
something that's telling me

555
00:28:46,200 --> 00:28:49,730
how this is going to work.

556
00:28:49,730 --> 00:28:54,410
What is going to play the
role of those events, A and B?

557
00:28:54,410 --> 00:28:59,280
Well one is going
to be, this is going

558
00:28:59,280 --> 00:29:01,980
to be the distribution
of my parameter of theta,

559
00:29:01,980 --> 00:29:03,894
given that I see the data.

560
00:29:03,894 --> 00:29:05,310
And this is going
to tell me, what

561
00:29:05,310 --> 00:29:07,601
is the distribution of the
data, given that I know what

562
00:29:07,601 --> 00:29:09,270
my parameter if theta is.

563
00:29:09,270 --> 00:29:11,456
But that part, if
this is theta and this

564
00:29:11,456 --> 00:29:13,080
is the parameter of
theta, this is what

565
00:29:13,080 --> 00:29:15,720
we've been doing all along.

566
00:29:15,720 --> 00:29:18,720
The distribution of the data,
given the parameter here

567
00:29:18,720 --> 00:29:22,350
was n iid Bernoulli p.

568
00:29:22,350 --> 00:29:27,960
I knew exactly what their joint
probability mass function is.

569
00:29:27,960 --> 00:29:29,290
Then, that was what?

570
00:29:29,290 --> 00:29:32,700
So we said that this
is going to be my data

571
00:29:32,700 --> 00:29:34,730
and this is going
to be my parameter.

572
00:29:37,270 --> 00:29:40,210
So that means that, this is
the probability of my data,

573
00:29:40,210 --> 00:29:43,000
given the parameter.

574
00:29:43,000 --> 00:29:45,729
This is the probability
of the parameter.

575
00:29:45,729 --> 00:29:46,270
What is this?

576
00:29:46,270 --> 00:29:49,095
What did we call this?

577
00:29:49,095 --> 00:29:50,280
This is the prior.

578
00:29:50,280 --> 00:29:53,690
It's just the distribution
of my parameter.

579
00:29:53,690 --> 00:29:56,030
Now what is this?

580
00:29:56,030 --> 00:29:57,490
Well, this is just
the distribution

581
00:29:57,490 --> 00:30:00,340
of the data, itself.

582
00:30:00,340 --> 00:30:06,800
This is essentially the
distribution of this,

583
00:30:06,800 --> 00:30:15,080
if this was indeed
not conditioned on p.

584
00:30:15,080 --> 00:30:18,710
So if I don't condition
on p, this data

585
00:30:18,710 --> 00:30:23,982
is going to be a bunch of iid,
Bernoulli with some parameter.

586
00:30:23,982 --> 00:30:25,440
But the perimeter
is random, right.

587
00:30:25,440 --> 00:30:27,837
So for different realization
of this data set,

588
00:30:27,837 --> 00:30:30,170
I'm going to get different
parameters for the Bernoulli.

589
00:30:30,170 --> 00:30:34,379
And so that leads to
some sort of convolution.

590
00:30:34,379 --> 00:30:36,170
It's not really a
convolution in this case,

591
00:30:36,170 --> 00:30:38,660
but it's like some sort of
composition of distributions.

592
00:30:38,660 --> 00:30:41,600
I have the randomness that
comes from here and then,

593
00:30:41,600 --> 00:30:44,757
the randomness that comes
from realizing the Bernoulli.

594
00:30:44,757 --> 00:30:46,340
That's just the
marginal distribution.

595
00:30:46,340 --> 00:30:49,820
It actually might be painful to
understand what this is, right.

596
00:30:49,820 --> 00:30:52,970
In a way, it's sort of a
mixture and it's not super nice.

597
00:30:52,970 --> 00:30:55,880
But we'll see that this
actually won't matter for us.

598
00:30:55,880 --> 00:30:57,240
This is going to be some number.

599
00:30:57,240 --> 00:30:58,220
It's going to be there.

600
00:30:58,220 --> 00:31:00,260
But it will matter
for us, what it is.

601
00:31:00,260 --> 00:31:02,510
Because it actually does
not depend on the parameter.

602
00:31:02,510 --> 00:31:04,340
And that's all
that matters to us.

603
00:31:09,100 --> 00:31:11,170
Let's put some names
on those things.

604
00:31:11,170 --> 00:31:12,860
This was very informal.

605
00:31:12,860 --> 00:31:19,710
So let's put some actual
names on what we call prior.

606
00:31:19,710 --> 00:31:22,320
So what is the formal
definition of a prior,

607
00:31:22,320 --> 00:31:24,960
what is the formal
definition of a posterior,

608
00:31:24,960 --> 00:31:27,450
and what are the
rules to update it?

609
00:31:27,450 --> 00:31:30,100
So I'm going to have my data,
which is going to be X1, Xn.

610
00:31:35,710 --> 00:31:38,520
Let's say they are iid, but
they don't actually have to.

611
00:31:38,520 --> 00:31:41,260
And so I'm going to
have given, theta.

612
00:31:47,450 --> 00:31:48,890
And when I say
given, it's either

613
00:31:48,890 --> 00:31:51,890
given like I did in the
first part of this course

614
00:31:51,890 --> 00:31:55,940
in all previous chapters,
or conditionally on.

615
00:31:55,940 --> 00:31:58,340
If you're thinking like a
Bayesian, what I really mean

616
00:31:58,340 --> 00:32:02,250
is conditionally on
this random parameter.

617
00:32:02,250 --> 00:32:06,350
It's as if it was
a fixed number.

618
00:32:06,350 --> 00:32:08,410
They're going to
have a distribution,

619
00:32:08,410 --> 00:32:12,350
X1, Xn is going to
have some distribution.

620
00:32:12,350 --> 00:32:19,260
Let's assume for now
it's a PDF, pn of X1, Xn.

621
00:32:19,260 --> 00:32:22,140
I'm going to write
theta like this.

622
00:32:22,140 --> 00:32:24,900
So for example, what is this?

623
00:32:24,900 --> 00:32:27,140
Let's say this is a PDF.

624
00:32:27,140 --> 00:32:28,110
It could be a PMF.

625
00:32:28,110 --> 00:32:31,197
Everything I say, I'm going to
think of them as being PDF's.

626
00:32:31,197 --> 00:32:33,030
I'm going to combine
PDF's with PDF's, but I

627
00:32:33,030 --> 00:32:37,440
could combine PDF it PMF, PMF
with PDF's or PMF with PMF.

628
00:32:37,440 --> 00:32:41,590
So everywhere you see
a D could be an M.

629
00:32:41,590 --> 00:32:42,590
Now I have those things.

630
00:32:42,590 --> 00:32:43,465
So what does it mean?

631
00:32:43,465 --> 00:32:46,430
So here is an example.

632
00:32:46,430 --> 00:32:53,970
X1, Xn or iid, and theta 1.

633
00:32:53,970 --> 00:32:57,530
Now I know exactly what the
joint PDF of this thing is.

634
00:32:57,530 --> 00:33:03,790
It means that pn of X1, Xn
given theta is equal to what?

635
00:33:03,790 --> 00:33:10,560
Well it's 1 over
2pi to the power n

636
00:33:10,560 --> 00:33:15,000
e, to the minus sum
from i equal 1 to n

637
00:33:15,000 --> 00:33:18,450
of xi minus theta
squared divided by 2.

638
00:33:18,450 --> 00:33:21,220
So that's just the joint
distribution of n iid

639
00:33:21,220 --> 00:33:25,120
and theta 1, random variables.

640
00:33:25,120 --> 00:33:27,290
That's my pn given theta.

641
00:33:27,290 --> 00:33:33,310
Now this is what we denoted
by f sub theta before.

642
00:33:33,310 --> 00:33:36,790
We had the subscript before, but
now we just put a bar in theta

643
00:33:36,790 --> 00:33:38,860
because we want to remember
that this is actually

644
00:33:38,860 --> 00:33:40,660
conditioned on theta.

645
00:33:40,660 --> 00:33:42,130
But this is just notation.

646
00:33:42,130 --> 00:33:46,060
You should just think of this
as being, just the usual thing

647
00:33:46,060 --> 00:33:50,910
that you get from some
statistical model.

648
00:33:50,910 --> 00:33:53,910
Now, that's going to be pn.

649
00:34:11,020 --> 00:34:19,500
Theta has prior
distribution, pi.

650
00:34:22,400 --> 00:34:29,130
For example, so think of it
as either PDF or PMF again.

651
00:34:29,130 --> 00:34:33,920
For example, pi
of theta was what?

652
00:34:33,920 --> 00:34:40,159
Well it was some constant
times theta to the a minus 1,

653
00:34:40,159 --> 00:34:43,739
1 minus theta to a minus 1.

654
00:34:43,739 --> 00:34:45,900
So it has some
prior distribution,

655
00:34:45,900 --> 00:34:49,050
and that's another PMF.

656
00:34:49,050 --> 00:34:51,090
So now I'm given the
distribution of my,

657
00:34:51,090 --> 00:34:54,000
x is given theta and given
the distribution of my theta.

658
00:34:54,000 --> 00:34:57,410
I'm given this guy.

659
00:34:57,410 --> 00:35:00,100
That's this guy.

660
00:35:00,100 --> 00:35:05,340
I'm given that guy,
which is my pi.

661
00:35:05,340 --> 00:35:11,700
So that's my pn of
X1, Xn given theta.

662
00:35:11,700 --> 00:35:13,063
That's my pi of theta.

663
00:35:17,390 --> 00:35:21,130
Well, this is just
the integral of pn

664
00:35:21,130 --> 00:35:28,280
of X1, Xn times pi
of theta, d theta,

665
00:35:28,280 --> 00:35:29,720
over all possible sets of theta.

666
00:35:29,720 --> 00:35:33,360
That's just when I
integrate out my theta,

667
00:35:33,360 --> 00:35:35,790
or I compute the
marginal distribution,

668
00:35:35,790 --> 00:35:37,290
I did this by integrating.

669
00:35:37,290 --> 00:35:41,010
That's just basic probability,
conditional probabilities.

670
00:35:41,010 --> 00:35:42,610
Then if I had the
PMF, I would just

671
00:35:42,610 --> 00:35:43,970
sum over the values of thetas.

672
00:35:49,020 --> 00:35:55,210
Now what I want is to
find what's called,

673
00:35:55,210 --> 00:35:58,870
so that's the
prior distribution,

674
00:35:58,870 --> 00:36:01,227
and I want to find the
posterior distribution.

675
00:36:15,110 --> 00:36:18,690
It's pi of theta, given X1, Xn.

676
00:36:21,780 --> 00:36:23,970
If I use Bayes' rule
I know that this

677
00:36:23,970 --> 00:36:34,650
is pn of X1, Xn, given
theta times pi of theta.

678
00:36:34,650 --> 00:36:37,530
And then it's divided
by the distribution

679
00:36:37,530 --> 00:36:41,070
of those guys, which I will
write as integral over theta

680
00:36:41,070 --> 00:36:48,830
of pn, X1, Xn, given theta
times pi of theta, d theta.

681
00:36:55,360 --> 00:36:57,700
Everybody's with me, still?

682
00:36:57,700 --> 00:36:59,200
If you're not
comfortable with this,

683
00:36:59,200 --> 00:37:03,010
it means that you probably need
to go read your couple of pages

684
00:37:03,010 --> 00:37:04,930
on conditional densities
and conditional

685
00:37:04,930 --> 00:37:07,420
PMF's from your probably class.

686
00:37:07,420 --> 00:37:08,870
There's really not much there.

687
00:37:08,870 --> 00:37:13,660
It's just a matter of being able
to define those quantities, f

688
00:37:13,660 --> 00:37:15,289
density of x, given y.

689
00:37:15,289 --> 00:37:17,330
This is just what's called
a conditional density.

690
00:37:17,330 --> 00:37:19,079
You need to understand
what this object is

691
00:37:19,079 --> 00:37:21,920
and how it relates to the
joint distribution of x and y,

692
00:37:21,920 --> 00:37:24,302
or maybe the distribution of
x or the distribution of y.

693
00:37:27,400 --> 00:37:29,920
But it's the same rules.

694
00:37:29,920 --> 00:37:31,465
One way to actually
remember this

695
00:37:31,465 --> 00:37:33,730
is, this is exactly
the same rules as this.

696
00:37:33,730 --> 00:37:36,610
When you see a bar, it's the
same thing as the probability

697
00:37:36,610 --> 00:37:37,790
of this and this guy.

698
00:37:37,790 --> 00:37:40,060
So for densities,
it's just a comma

699
00:37:40,060 --> 00:37:43,240
divided by the second the
probably the second guy.

700
00:37:43,240 --> 00:37:45,120
That's it.

701
00:37:45,120 --> 00:37:48,360
So if you remember this, you can
just do some pattern matching

702
00:37:48,360 --> 00:37:49,980
and see what I just wrote here.

703
00:37:53,220 --> 00:37:57,010
Now, I can compute every
single one of these guys.

704
00:37:57,010 --> 00:38:04,030
This something I get
from my modeling.

705
00:38:04,030 --> 00:38:05,290
So I did not write this.

706
00:38:05,290 --> 00:38:09,130
It's not written in the slides.

707
00:38:09,130 --> 00:38:14,820
But I give a name to this guy
that was my prior distribution.

708
00:38:14,820 --> 00:38:16,550
And that was my
posterior distribution.

709
00:38:22,550 --> 00:38:26,980
In chapter three, maybe
what did we call this guy?

710
00:38:32,120 --> 00:38:35,180
The one that does not have a
name and that's in the box.

711
00:38:39,347 --> 00:38:40,180
What did we call it?

712
00:38:43,498 --> 00:38:46,335
AUDIENCE: [INAUDIBLE]

713
00:38:46,335 --> 00:38:48,835
PHILLIPE RIGOLLET: It is the
joint distribution of the Xi's.

714
00:38:51,950 --> 00:38:53,235
And we gave it a name.

715
00:38:53,235 --> 00:38:54,214
AUDIENCE: [INAUDIBLE]

716
00:38:54,214 --> 00:38:56,130
PHILLIPE RIGOLLET: It's
the likelihood, right?

717
00:38:56,130 --> 00:38:57,630
This is exactly the likelihood.

718
00:38:57,630 --> 00:38:59,100
This was the
likelihood of theta.

719
00:39:03,920 --> 00:39:06,350
And this is something that's
very important to remember,

720
00:39:06,350 --> 00:39:10,520
and that really reminds you
that these things are really not

721
00:39:10,520 --> 00:39:11,540
that different.

722
00:39:11,540 --> 00:39:13,970
Maximum likelihood estimation
and Bayesian estimation,

723
00:39:13,970 --> 00:39:18,860
because your posterior is really
just your likelihood times

724
00:39:18,860 --> 00:39:23,570
something that's just putting
some weights on the thetas,

725
00:39:23,570 --> 00:39:26,390
depending on where you
think theta should be.

726
00:39:26,390 --> 00:39:28,420
If I had, say a maximum
likelihood estimate,

727
00:39:28,420 --> 00:39:31,130
and my likelihood and
theta looked like this,

728
00:39:31,130 --> 00:39:33,410
but my prior and theta
looked like this.

729
00:39:33,410 --> 00:39:37,040
I said, oh I really want
thetas that are like this.

730
00:39:37,040 --> 00:39:38,710
So what's going to
happen is that, I'm

731
00:39:38,710 --> 00:39:41,320
going to turn this into some
posterior that looks like this.

732
00:39:44,400 --> 00:39:47,610
So I'm just really
waiting, this posterior,

733
00:39:47,610 --> 00:39:49,971
this is a constant that does
not depend on theta right?

734
00:39:49,971 --> 00:39:50,470
Agreed?

735
00:39:50,470 --> 00:39:53,460
I integrated over
theta, so theta is gone.

736
00:39:53,460 --> 00:39:56,220
So forget about this guy.

737
00:39:56,220 --> 00:39:59,247
I have basically, that the
posterior distribution up

738
00:39:59,247 --> 00:40:01,830
to scaling, because it has to
be a probability density and not

739
00:40:01,830 --> 00:40:03,810
just anything any
function that's positive,

740
00:40:03,810 --> 00:40:05,070
is the product of this guy.

741
00:40:05,070 --> 00:40:06,920
It's a weighted version
of my likelihood.

742
00:40:06,920 --> 00:40:07,890
That's all it is.

743
00:40:07,890 --> 00:40:09,990
I'm just weighing
the likelihood,

744
00:40:09,990 --> 00:40:13,150
using my prior belief on theta.

745
00:40:13,150 --> 00:40:16,870
And so given this guy
a natural estimator,

746
00:40:16,870 --> 00:40:19,480
if you follow the maximum
likelihood principle,

747
00:40:19,480 --> 00:40:23,150
would be the maximum
of this posterior.

748
00:40:23,150 --> 00:40:24,620
Agreed?

749
00:40:24,620 --> 00:40:28,830
That would basically be doing
exactly what maximum likelihood

750
00:40:28,830 --> 00:40:31,740
estimation is telling you.

751
00:40:31,740 --> 00:40:33,560
So it turns out that you can.

752
00:40:33,560 --> 00:40:35,330
It's called Maximum
A Posteriori,

753
00:40:35,330 --> 00:40:39,370
and I won't talk much
about this, or MAP.

754
00:40:39,370 --> 00:40:44,500
That's Maximum a Posteriori.

755
00:40:44,500 --> 00:40:47,200
So it's just the
theta hat is the arc

756
00:40:47,200 --> 00:40:50,790
max of pi theta, given X1, Xn.

757
00:40:54,990 --> 00:40:56,190
And it sounds like it's OK.

758
00:40:56,190 --> 00:40:58,660
I'll give you a
density and you say, OK

759
00:40:58,660 --> 00:41:00,970
I have a density for all
values of my parameters.

760
00:41:00,970 --> 00:41:03,440
You're asking me to
summarize it into one number.

761
00:41:03,440 --> 00:41:06,570
I'm just going to take the most
likely number of those guys.

762
00:41:06,570 --> 00:41:08,310
But you could summarize
it, otherwise.

763
00:41:08,310 --> 00:41:10,770
You could take the average.

764
00:41:10,770 --> 00:41:12,420
You could take the median.

765
00:41:12,420 --> 00:41:14,370
You could take a
bunch of numbers.

766
00:41:14,370 --> 00:41:16,080
And the beauty of
Bayesian statistics

767
00:41:16,080 --> 00:41:19,230
is that, you don't have to
take any number in particular.

768
00:41:19,230 --> 00:41:21,480
You have an entire
posterior distribution.

769
00:41:21,480 --> 00:41:25,080
This is not only telling
you where theta is,

770
00:41:25,080 --> 00:41:29,160
but it's actually telling
you the difference

771
00:41:29,160 --> 00:41:31,920
if you actually
give as something

772
00:41:31,920 --> 00:41:33,180
that gives you the posterior.

773
00:41:33,180 --> 00:41:36,270
Now, let's say the theta
is p between 0 and 1.

774
00:41:36,270 --> 00:41:39,990
If my posterior distribution
looks like this,

775
00:41:39,990 --> 00:41:43,410
or my posterior distribution
looks like this,

776
00:41:43,410 --> 00:41:47,610
then those two guys
have one, the same mode.

777
00:41:47,610 --> 00:41:49,200
This is the same value.

778
00:41:49,200 --> 00:41:51,630
And their symmetric, so they'll
also have the same mean.

779
00:41:51,630 --> 00:41:53,130
So these two posterior
distributions

780
00:41:53,130 --> 00:41:55,500
give me the same
summary into one number.

781
00:41:55,500 --> 00:41:58,229
However clearly, one
is much more confident

782
00:41:58,229 --> 00:41:59,020
than the other one.

783
00:41:59,020 --> 00:42:04,010
So I might as well just
spit it out as a solution.

784
00:42:04,010 --> 00:42:05,180
You can do even better.

785
00:42:05,180 --> 00:42:09,560
People actually do things,
such as drawing a random number

786
00:42:09,560 --> 00:42:10,600
from this distribution.

787
00:42:10,600 --> 00:42:12,940
Say, this is my number.

788
00:42:12,940 --> 00:42:14,440
That's kind of
dangerous, but you

789
00:42:14,440 --> 00:42:15,690
can imagine you could do this.

790
00:42:20,730 --> 00:42:22,140
This is what works.

791
00:42:22,140 --> 00:42:23,680
That's what we went through.

792
00:42:23,680 --> 00:42:28,650
So here, as you notice I don't
care so much about this part

793
00:42:28,650 --> 00:42:30,240
here.

794
00:42:30,240 --> 00:42:32,240
Because it does not
depend on theta.

795
00:42:32,240 --> 00:42:35,190
So I know that given the
product of those two things,

796
00:42:35,190 --> 00:42:37,650
this thing is only the
constant that I need to divide

797
00:42:37,650 --> 00:42:40,050
so that when I integrate
this thing over theta,

798
00:42:40,050 --> 00:42:41,460
it integrates to one.

799
00:42:41,460 --> 00:42:45,540
Because this has to be a
probability density on theta.

800
00:42:45,540 --> 00:42:47,910
I can write this and just
forget about that part.

801
00:42:47,910 --> 00:42:52,280
And that's what's written
on the top of this slide.

802
00:42:52,280 --> 00:42:57,920
This notation, this sort of
weird alpha, or I don't know.

803
00:42:57,920 --> 00:42:59,780
Infinity sign
propped to the right.

804
00:42:59,780 --> 00:43:02,330
Whatever you want
to call this thing

805
00:43:02,330 --> 00:43:04,700
is actually just really
emphasizing the fact

806
00:43:04,700 --> 00:43:06,310
that I don't care.

807
00:43:06,310 --> 00:43:12,490
I write it because I can,
but you know what it is.

808
00:43:17,314 --> 00:43:19,480
In some instances, you have
to compute the integral.

809
00:43:19,480 --> 00:43:21,640
In some instances, you don't
have to compute the integral.

810
00:43:21,640 --> 00:43:23,200
And a lot of
Bayesian computation

811
00:43:23,200 --> 00:43:25,600
is about saying,
OK it's actually

812
00:43:25,600 --> 00:43:27,146
really hard to
compute this integral,

813
00:43:27,146 --> 00:43:28,270
so I'd rather not doing it.

814
00:43:28,270 --> 00:43:31,450
So let me try to find some
methods that will allow me

815
00:43:31,450 --> 00:43:33,789
to sample from the
posterior distribution,

816
00:43:33,789 --> 00:43:35,080
without having to compute this.

817
00:43:35,080 --> 00:43:37,720
And that's what's called
Monte-Carlo Markov

818
00:43:37,720 --> 00:43:40,580
chains, or MCMC, and that's
exactly what they're doing.

819
00:43:40,580 --> 00:43:42,370
They're just using
only ratios of things,

820
00:43:42,370 --> 00:43:44,130
like that for different thetas.

821
00:43:44,130 --> 00:43:45,890
And which means that
if you take ratios,

822
00:43:45,890 --> 00:43:47,860
the normalizing constant
is gone and you don't

823
00:43:47,860 --> 00:43:50,810
need to find this integral.

824
00:43:50,810 --> 00:43:53,015
So we won't go into
those details at all.

825
00:43:53,015 --> 00:43:54,890
That would be the purpose
of an entire course

826
00:43:54,890 --> 00:43:56,630
on Bayesian inference.

827
00:43:56,630 --> 00:43:59,570
Actually, even
Bayesian computations

828
00:43:59,570 --> 00:44:02,154
would be an entire
course on its own.

829
00:44:02,154 --> 00:44:03,820
And there's some very
interesting things

830
00:44:03,820 --> 00:44:05,778
that are going on there,
the interface of stats

831
00:44:05,778 --> 00:44:06,890
and computation.

832
00:44:10,054 --> 00:44:12,470
So let's go back to our example
and see if we can actually

833
00:44:12,470 --> 00:44:13,636
compute any of those things.

834
00:44:13,636 --> 00:44:17,420
Because it's very nice to give
you some data, some formulas.

835
00:44:17,420 --> 00:44:19,990
Let's see if we
can actually do it.

836
00:44:19,990 --> 00:44:23,810
In particular, can I
actually recover this claim

837
00:44:23,810 --> 00:44:31,250
that the posterior associated
to a beta prior with a Bernoulli

838
00:44:31,250 --> 00:44:35,780
likelihood is actually
giving me a beta again?

839
00:44:35,780 --> 00:44:36,710
What was my prior?

840
00:44:42,670 --> 00:44:45,970
So p was following
a beta AA, which

841
00:44:45,970 --> 00:44:48,320
means that p, the density.

842
00:44:53,620 --> 00:44:56,610
That was pi of theta.

843
00:44:56,610 --> 00:44:59,580
Well I'm going to
write this as pi of p--

844
00:44:59,580 --> 00:45:05,800
was proportional to p to the
A minus 1 times 1 minus p

845
00:45:05,800 --> 00:45:08,806
to the A minus 1.

846
00:45:08,806 --> 00:45:11,430
So that's the first ingredient
I need to complete my posterior.

847
00:45:11,430 --> 00:45:14,370
I really need only two, if I
wanted to bound up to constant.

848
00:45:14,370 --> 00:45:16,234
The second one was p hat.

849
00:45:20,710 --> 00:45:22,620
We've computed that many times.

850
00:45:22,620 --> 00:45:25,610
And we had even a nice
compact way of writing it,

851
00:45:25,610 --> 00:45:32,570
which was that pn of X1,
Xn, given the parameter p.

852
00:45:32,570 --> 00:45:36,850
So the joint density of my data,
given p, that's my likelihood.

853
00:45:36,850 --> 00:45:38,730
The likelihood of p was what?

854
00:45:38,730 --> 00:45:41,230
Well it was p to
the sum of Xi's.

855
00:45:44,030 --> 00:45:46,300
1 minus p to the n
minus some of the Xi's.

856
00:45:50,990 --> 00:45:53,750
Anybody wants me
to parse this more?

857
00:45:53,750 --> 00:45:56,060
Or do you remember seeing
that from maximum likelihood

858
00:45:56,060 --> 00:45:57,060
estimation?

859
00:45:57,060 --> 00:45:57,697
Yeah?

860
00:45:57,697 --> 00:46:02,929
AUDIENCE: [INAUDIBLE]

861
00:46:02,929 --> 00:46:04,970
PHILLIPE RIGOLLET: That's
what conditioning does.

862
00:46:10,838 --> 00:46:15,239
AUDIENCE: [INAUDIBLE]
previous slide.

863
00:46:15,239 --> 00:46:19,151
[INAUDIBLE] bottom
there, it says D pi of t.

864
00:46:19,151 --> 00:46:23,570
Shouldn't it be dt pi of t?

865
00:46:23,570 --> 00:46:25,300
PHILLIPE RIGOLLET:
So D pi of T is

866
00:46:25,300 --> 00:46:29,110
a measure theoretic notation,
which I used without thinking.

867
00:46:29,110 --> 00:46:32,380
And I should not because
I can see it upsets you.

868
00:46:32,380 --> 00:46:35,050
D pi of T is just a
natural way to say

869
00:46:35,050 --> 00:46:38,170
that I integrate
against whatever I'm

870
00:46:38,170 --> 00:46:43,930
given for the prior of theta.

871
00:46:43,930 --> 00:46:48,820
In particular, if theta is just
the mix of a PDF and a point

872
00:46:48,820 --> 00:46:51,430
mass, maybe I say
that my p takes

873
00:46:51,430 --> 00:46:54,400
value 0.5 with probability 0.5.

874
00:46:54,400 --> 00:46:58,900
And then is uniform on the
interval with probability 0.5.

875
00:46:58,900 --> 00:47:01,930
For this, I neither
have a PDF nor a PMF.

876
00:47:01,930 --> 00:47:04,150
But I can still talk about
integrating with respect

877
00:47:04,150 --> 00:47:04,930
to this, right?

878
00:47:04,930 --> 00:47:08,530
It's going to look like, if
I take a function f of T,

879
00:47:08,530 --> 00:47:14,480
D pi of T is going to be
one half of f of one half.

880
00:47:14,480 --> 00:47:16,480
That's the point mass
with probability one half,

881
00:47:16,480 --> 00:47:17,560
at one half.

882
00:47:17,560 --> 00:47:23,230
Plus one half of the integral
between 0 and 1, of f of TDT.

883
00:47:23,230 --> 00:47:26,980
This is just the notation, which
is actually funnily enough,

884
00:47:26,980 --> 00:47:29,360
interchangeable with pi of DT.

885
00:47:32,460 --> 00:47:34,890
But if you have a
density, it's really

886
00:47:34,890 --> 00:47:39,801
just the density pi of TDT.

887
00:47:39,801 --> 00:47:41,940
If pi is really a
density, but that's

888
00:47:41,940 --> 00:47:44,120
when it's when pi is and
measure and not a density.

889
00:47:46,820 --> 00:47:49,700
Everybody else,
forget about this.

890
00:47:49,700 --> 00:47:51,627
This is not something
you should really

891
00:47:51,627 --> 00:47:52,710
worry about at this point.

892
00:47:52,710 --> 00:47:55,719
This is more graduate
level probability classes.

893
00:47:55,719 --> 00:47:57,260
But yeah, it's called
measure theory.

894
00:47:57,260 --> 00:47:59,160
And that's when you think
of pi as being a measure

895
00:47:59,160 --> 00:47:59,980
in an abstract fashion.

896
00:47:59,980 --> 00:48:01,896
You don't have to worry
whether it's a density

897
00:48:01,896 --> 00:48:04,000
or not, or whether
it has a density.

898
00:48:08,350 --> 00:48:10,250
So everybody is OK with this?

899
00:48:15,530 --> 00:48:17,390
Now I need to
compute my posterior.

900
00:48:17,390 --> 00:48:23,120
And as I said, my
posterior is really

901
00:48:23,120 --> 00:48:25,550
just the product of
the likelihood weighted

902
00:48:25,550 --> 00:48:28,970
by the prior.

903
00:48:28,970 --> 00:48:33,030
Hopefully, at this stage
of your application,

904
00:48:33,030 --> 00:48:35,390
you can multiply two functions.

905
00:48:35,390 --> 00:48:37,580
So what's happening is,
if I multiply this guy

906
00:48:37,580 --> 00:48:41,300
with this guy, p gets
this guy to the power

907
00:48:41,300 --> 00:48:42,860
this guy plus this guy.

908
00:48:53,810 --> 00:49:00,020
And then 1 minus p gets the
power n minus some of Xi's.

909
00:49:00,020 --> 00:49:02,900
So this is always
from I equal 1 to n.

910
00:49:02,900 --> 00:49:04,390
And then plus A minus 1 as well.

911
00:49:10,010 --> 00:49:15,560
This is up to constant, because
I still need to solve this.

912
00:49:15,560 --> 00:49:17,259
And I could try to do it.

913
00:49:17,259 --> 00:49:18,800
But I really don't
have to, because I

914
00:49:18,800 --> 00:49:24,380
know that if my density
has this form, then

915
00:49:24,380 --> 00:49:25,532
it's a beta distribution.

916
00:49:25,532 --> 00:49:26,990
And then I can just
go on Wikipedia

917
00:49:26,990 --> 00:49:29,021
and see what should be
the normalization factor.

918
00:49:29,021 --> 00:49:31,020
But I know it's going to
be a beta distribution.

919
00:49:31,020 --> 00:49:34,020
It's actually the
beta with parameter.

920
00:49:34,020 --> 00:49:39,210
So this is really my beta
with parameter, sum of Xi,

921
00:49:39,210 --> 00:49:43,580
i equal 1 to n plus A minus 1.

922
00:49:43,580 --> 00:49:46,130
And then the second
parameter is n minus sum

923
00:49:46,130 --> 00:49:49,806
of the Xi's plus A minus 1.

924
00:49:54,980 --> 00:49:59,030
I just wrote what was here.

925
00:49:59,030 --> 00:50:01,580
What happened to my one?

926
00:50:01,580 --> 00:50:02,920
Oh no, sorry.

927
00:50:02,920 --> 00:50:05,640
Beta has the power minus 1.

928
00:50:05,640 --> 00:50:08,847
So that's the
parameter of the beta.

929
00:50:08,847 --> 00:50:10,430
And this is the
parameter of the beta.

930
00:50:15,127 --> 00:50:16,210
Beta is over there, right?

931
00:50:16,210 --> 00:50:19,852
So I just replace
A by what I see.

932
00:50:19,852 --> 00:50:22,290
A is just becoming
this guy plus this guy

933
00:50:22,290 --> 00:50:26,400
and this guy plus this guy.

934
00:50:26,400 --> 00:50:28,662
Everybody is comfortable
with this computation?

935
00:50:34,170 --> 00:50:38,850
We just agreed that beta priors
for Bernoulli observations

936
00:50:38,850 --> 00:50:42,540
are certainly convenient.

937
00:50:42,540 --> 00:50:44,457
Because they are just
conjugate, and we know

938
00:50:44,457 --> 00:50:46,290
that's what is going
to come out in the end.

939
00:50:46,290 --> 00:50:48,899
That's going to
be a beta as well.

940
00:50:48,899 --> 00:50:50,190
I just claim it was convenient.

941
00:50:50,190 --> 00:50:52,890
It was certainly convenient
to compute this, right?

942
00:50:52,890 --> 00:50:55,741
There was certainly
some compatibility

943
00:50:55,741 --> 00:50:57,990
when I had to multiply this
function by that function.

944
00:50:57,990 --> 00:51:00,916
And you can imagine that things
could go much more wrong,

945
00:51:00,916 --> 00:51:03,540
than just having p to some power
and p to some power, 1 minus p

946
00:51:03,540 --> 00:51:06,390
to some power, when it might
just be some other power.

947
00:51:06,390 --> 00:51:09,280
Things were nice.

948
00:51:09,280 --> 00:51:12,410
Now this is nice, but I can also
question the following things.

949
00:51:12,410 --> 00:51:14,330
Why beta, for one?

950
00:51:14,330 --> 00:51:17,840
The beta tells me something.

951
00:51:17,840 --> 00:51:20,636
That's convenient, but
then how do I pick A?

952
00:51:20,636 --> 00:51:27,500
I know that A should definitely
capture the fact that where

953
00:51:27,500 --> 00:51:30,200
I want to have my p
most likely located.

954
00:51:30,200 --> 00:51:32,390
But it also actually
also captures

955
00:51:32,390 --> 00:51:34,580
the variance of my beta.

956
00:51:34,580 --> 00:51:36,740
And so choosing
different As is going

957
00:51:36,740 --> 00:51:37,950
to have different functions.

958
00:51:37,950 --> 00:51:43,050
If I have A and B, If I started
with the beta with parameter.

959
00:51:43,050 --> 00:51:48,110
If I started with a B here, I
would just pick up the B here.

960
00:51:48,110 --> 00:51:49,862
Agreed?

961
00:51:49,862 --> 00:51:51,320
And that would just
be a symmetric.

962
00:51:51,320 --> 00:51:53,270
But they're going to
capture mean and variance

963
00:51:53,270 --> 00:51:53,853
of this thing.

964
00:51:53,853 --> 00:51:56,030
And so how do I pick those guys?

965
00:51:56,030 --> 00:51:59,437
If I'm a doctor and
you're asking me,

966
00:51:59,437 --> 00:52:01,520
what do you think the
chances of this drug working

967
00:52:01,520 --> 00:52:03,230
in this kind of patients is?

968
00:52:03,230 --> 00:52:06,080
And I have to spit out the
parameters of a beta for you,

969
00:52:06,080 --> 00:52:08,630
it might be a bit of a
complicated thing to do.

970
00:52:08,630 --> 00:52:10,720
So how do you do this,
especially for problems?

971
00:52:10,720 --> 00:52:14,750
So by now, people
have actually mastered

972
00:52:14,750 --> 00:52:19,290
the art of coming up with how
to formulate those numbers.

973
00:52:19,290 --> 00:52:21,660
But in new problems that
come up, how do you do this?

974
00:52:21,660 --> 00:52:23,840
What happens if you want
to use Bayesian methods,

975
00:52:23,840 --> 00:52:30,140
but you actually do not
know what you expect to see?

976
00:52:30,140 --> 00:52:33,260
To be fair, before we started
this class, I hope all of you

977
00:52:33,260 --> 00:52:36,870
had no idea whether people tend
to bend their head to the right

978
00:52:36,870 --> 00:52:38,172
or to the left before kissing.

979
00:52:38,172 --> 00:52:40,130
Because if you did, well
you have too much time

980
00:52:40,130 --> 00:52:42,130
on your hands and I should
double your homework.

981
00:52:44,390 --> 00:52:46,970
So in this case,
maybe you still want

982
00:52:46,970 --> 00:52:48,830
to use the Bayesian machinery.

983
00:52:48,830 --> 00:52:50,980
Maybe you just want
to do something nice.

984
00:52:50,980 --> 00:52:53,512
It's nice right, I mean
it worked out pretty well.

985
00:52:53,512 --> 00:52:54,470
What if you want to do?

986
00:52:54,470 --> 00:52:56,870
Well you actually want
to use some priors that

987
00:52:56,870 --> 00:53:00,170
carry no information, that
basically do not prefer

988
00:53:00,170 --> 00:53:02,750
any theta to another theta.

989
00:53:02,750 --> 00:53:05,435
Now, you could read
this slide or you

990
00:53:05,435 --> 00:53:06,560
could look at this formula.

991
00:53:10,010 --> 00:53:14,920
We just said that this
pi here was just here

992
00:53:14,920 --> 00:53:18,220
to weigh some thetas more
than others, depending

993
00:53:18,220 --> 00:53:19,870
on their prior belief.

994
00:53:19,870 --> 00:53:21,400
If our prior belief
does not want

995
00:53:21,400 --> 00:53:24,880
to put any preference towards
some thetas than to others,

996
00:53:24,880 --> 00:53:26,332
what do I do?

997
00:53:26,332 --> 00:53:27,655
AUDIENCE: [INAUDIBLE]

998
00:53:27,655 --> 00:53:29,462
PHILLIPE RIGOLLET:
Yeah, I remove it.

999
00:53:29,462 --> 00:53:31,420
And the way to remove
something we multiply by,

1000
00:53:31,420 --> 00:53:32,650
is just replace it by one.

1001
00:53:32,650 --> 00:53:35,100
That's really what we're doing.

1002
00:53:35,100 --> 00:53:38,560
If this was a constant
not depending on theta,

1003
00:53:38,560 --> 00:53:41,400
then that would mean that
we're not preferring any theta.

1004
00:53:41,400 --> 00:53:44,370
And we're looking
at the likelihood.

1005
00:53:44,370 --> 00:53:46,560
But not as a function that
we're trying to maximize,

1006
00:53:46,560 --> 00:53:50,220
but it is a function that
we normalize in such a way

1007
00:53:50,220 --> 00:53:52,570
that it's actually
a distribution.

1008
00:53:52,570 --> 00:53:54,782
So if I have pi,
which is not here,

1009
00:53:54,782 --> 00:53:56,740
this is really just taking
the like likelihood,

1010
00:53:56,740 --> 00:53:57,990
which is a positive function.

1011
00:53:57,990 --> 00:53:59,970
It may not integrate
to 1, so I normalize it

1012
00:53:59,970 --> 00:54:02,330
so that it integrates to 1.

1013
00:54:02,330 --> 00:54:05,120
And then I just say, well this
is my posterior distribution.

1014
00:54:05,120 --> 00:54:06,770
Now I could just
maximize this thing

1015
00:54:06,770 --> 00:54:09,180
and spit out my maximum
likelihood estimator.

1016
00:54:09,180 --> 00:54:10,850
But I can also
integrate and find

1017
00:54:10,850 --> 00:54:12,350
what the expectation
of this guy is.

1018
00:54:12,350 --> 00:54:14,210
I can find what the
median of this guy is.

1019
00:54:14,210 --> 00:54:16,370
I can sample data from this guy.

1020
00:54:16,370 --> 00:54:19,430
I can build, understand what
the variance of this guy is.

1021
00:54:19,430 --> 00:54:21,830
Which is something we did
not do when we just did

1022
00:54:21,830 --> 00:54:24,800
maximum likelihood estimation
because given a function, all

1023
00:54:24,800 --> 00:54:27,998
we cared about was the
arc max of this function.

1024
00:54:31,680 --> 00:54:36,120
These priors are
called uninformative.

1025
00:54:36,120 --> 00:54:43,440
This is just replacing this
number by one or by a constant.

1026
00:54:43,440 --> 00:54:45,020
Because it still
has to be a density.

1027
00:54:49,236 --> 00:54:50,610
If I have a bounded
set, I'm just

1028
00:54:50,610 --> 00:54:52,950
looking for the
uniform distribution

1029
00:54:52,950 --> 00:54:56,580
on this bounded set, the
one that puts constant one

1030
00:54:56,580 --> 00:54:59,200
over the size of this thing.

1031
00:54:59,200 --> 00:55:01,590
But if I have an
invalid set, what

1032
00:55:01,590 --> 00:55:03,870
is the density that
takes a constant value

1033
00:55:03,870 --> 00:55:07,555
on the entire real
line, for example?

1034
00:55:07,555 --> 00:55:08,430
What is this density?

1035
00:55:13,200 --> 00:55:16,550
AUDIENCE: [INAUDIBLE]

1036
00:55:16,550 --> 00:55:18,530
PHILLIPE RIGOLLET:
Doesn't exist, right?

1037
00:55:18,530 --> 00:55:20,990
It just doesn't exist.

1038
00:55:20,990 --> 00:55:22,770
The way you can think
of it is a Gaussian

1039
00:55:22,770 --> 00:55:24,860
with the variance going
to infinity, maybe,

1040
00:55:24,860 --> 00:55:26,289
or something like this.

1041
00:55:26,289 --> 00:55:27,830
But you can think
of it in many ways.

1042
00:55:27,830 --> 00:55:32,330
You can think of the limit of
the uniform between minus T

1043
00:55:32,330 --> 00:55:34,250
and T, with T going to infinity.

1044
00:55:34,250 --> 00:55:36,480
But this thing is actually zero.

1045
00:55:36,480 --> 00:55:39,530
There's nothing there.

1046
00:55:39,530 --> 00:55:41,990
You can actually
still talk about this.

1047
00:55:41,990 --> 00:55:44,390
You could always talk
about this thing, where

1048
00:55:44,390 --> 00:55:46,550
you think of this guy
as being a constant,

1049
00:55:46,550 --> 00:55:49,080
remove this thing from this
equation, and just say,

1050
00:55:49,080 --> 00:55:51,320
well my posterior is
just the likelihood

1051
00:55:51,320 --> 00:55:54,680
divided by the integral of
the likelihood over theta.

1052
00:55:54,680 --> 00:55:58,650
And if theta is the entire
real line, so be it.

1053
00:55:58,650 --> 00:56:00,390
As long as this
integral converges,

1054
00:56:00,390 --> 00:56:01,890
you can still talk
about this stuff.

1055
00:56:04,460 --> 00:56:06,300
This is what's called
an improper prior.

1056
00:56:09,140 --> 00:56:11,990
An improper prior is just a
non-negative function defined

1057
00:56:11,990 --> 00:56:17,390
in theta, but it does not have
to integrate neither to one,

1058
00:56:17,390 --> 00:56:18,170
nor to anything.

1059
00:56:20,900 --> 00:56:22,700
If I integrate the
function equal to 1

1060
00:56:22,700 --> 00:56:24,330
on the entire real
line, what do I get?

1061
00:56:27,800 --> 00:56:28,520
Infinity.

1062
00:56:32,390 --> 00:56:35,960
It's not a proper prior, and
it's called and improper prior.

1063
00:56:35,960 --> 00:56:39,380
And those improper
priors are usually

1064
00:56:39,380 --> 00:56:42,830
what you see when you start
to want non-informative priors

1065
00:56:42,830 --> 00:56:44,360
on infinite sets of datas.

1066
00:56:44,360 --> 00:56:46,880
That's just the nature of it.

1067
00:56:46,880 --> 00:56:50,020
You should think of them as
being the uniform distribution

1068
00:56:50,020 --> 00:56:52,550
of some infinite set, if
that thing were to exist.

1069
00:56:56,360 --> 00:57:01,070
Let's see some examples
about non-informative priors.

1070
00:57:01,070 --> 00:57:04,410
If I'm in the interval 0,
1 this is a finite set.

1071
00:57:04,410 --> 00:57:07,730
So I can talk about
the uniform prior

1072
00:57:07,730 --> 00:57:10,600
on the interval 0, 1 for a
parameter, p of a Bernoulli.

1073
00:57:26,380 --> 00:57:28,000
If I want to talk
about this, then it

1074
00:57:28,000 --> 00:57:35,910
means that my prior is p follows
some uniform on the interval

1075
00:57:35,910 --> 00:57:37,570
0, 1.

1076
00:57:37,570 --> 00:57:48,940
So that means that f of
x is 1 if x is in 0, 1.

1077
00:57:48,940 --> 00:57:52,000
Otherwise, there is actually
not even a normalization.

1078
00:57:52,000 --> 00:57:53,860
This thing integrates to 1.

1079
00:57:53,860 --> 00:57:56,137
And so now if I look
at my likelihood,

1080
00:57:56,137 --> 00:57:57,220
it's still the same thing.

1081
00:57:57,220 --> 00:58:04,510
So my posterior
becomes theta X1, Xn.

1082
00:58:04,510 --> 00:58:07,022
That's my posterior.

1083
00:58:07,022 --> 00:58:08,480
I don't write the
likelihood again,

1084
00:58:08,480 --> 00:58:09,830
because we still have it--

1085
00:58:09,830 --> 00:58:11,583
well we don't have
it here anymore.

1086
00:58:15,440 --> 00:58:17,940
The likelihood is given here.

1087
00:58:17,940 --> 00:58:20,930
Copy, paste over there.

1088
00:58:20,930 --> 00:58:23,069
The posterior is just
this thing times 1.

1089
00:58:23,069 --> 00:58:24,360
So you will see it in a second.

1090
00:58:24,360 --> 00:58:28,570
So it's p to the power sum
of the Xi's, one minus p

1091
00:58:28,570 --> 00:58:31,970
to the power, n minus
sum of the Xi's.

1092
00:58:31,970 --> 00:58:36,380
And then it's multiplied by
1, and then divided by this

1093
00:58:36,380 --> 00:58:42,250
integral between 0 and
1 of p, sum of the Xi's.

1094
00:58:42,250 --> 00:58:47,870
1 minus p, n minus
sum of the Xi's.

1095
00:58:47,870 --> 00:58:51,866
Dp, which does not depend on p.

1096
00:58:51,866 --> 00:58:53,990
And I really don't care
what the thing actually is.

1097
00:58:58,900 --> 00:59:03,550
That's posterior of p.

1098
00:59:03,550 --> 00:59:06,280
And now I can see,
well what is this?

1099
00:59:06,280 --> 00:59:12,870
It's actually just the
beta with parameters.

1100
00:59:12,870 --> 00:59:14,120
This guy plus 1.

1101
00:59:19,670 --> 00:59:21,680
And this guy plus 1.

1102
00:59:34,430 --> 00:59:38,057
I didn't tell you what the
expectation of a beta was.

1103
00:59:38,057 --> 00:59:39,890
We don't know what the
expectation of a beta

1104
00:59:39,890 --> 00:59:42,200
is, agreed?

1105
00:59:42,200 --> 00:59:45,980
If I wanted to find say, the
expectation of this thing that

1106
00:59:45,980 --> 00:59:47,990
would be some good
estimator, we know

1107
00:59:47,990 --> 00:59:49,902
that the maximum
of this guy-- what

1108
00:59:49,902 --> 00:59:51,110
is the maximum of this thing?

1109
00:59:54,880 --> 00:59:57,937
Well, it's just this thing,
it's the average of the Xi's.

1110
00:59:57,937 --> 00:59:59,770
That's just the maximum
likelihood estimator

1111
00:59:59,770 --> 01:00:00,353
for Bernoulli.

1112
01:00:00,353 --> 01:00:01,702
We know it's the average.

1113
01:00:01,702 --> 01:00:03,910
Do you think if I take the
expectation of this thing,

1114
01:00:03,910 --> 01:00:05,295
I'm going to get the average?

1115
01:00:13,864 --> 01:00:15,780
So actually, I'm not
going to get the average.

1116
01:00:15,780 --> 01:00:19,790
I'm going to get this guy plus
this guy, divided by n plus 1.

1117
01:00:27,246 --> 01:00:28,870
Let's look at what
this thing is doing.

1118
01:00:28,870 --> 01:00:34,364
It's looking at the number
of ones and it's adding one.

1119
01:00:34,364 --> 01:00:36,280
And this guy is looking
at the number of zeros

1120
01:00:36,280 --> 01:00:39,190
and it's adding one.

1121
01:00:39,190 --> 01:00:41,910
Why is it adding this one?

1122
01:00:41,910 --> 01:00:42,840
What's going on here?

1123
01:00:47,510 --> 01:00:52,040
This is going to matter
mostly when the number of ones

1124
01:00:52,040 --> 01:00:56,060
is actually zero, or the
number of zeros is zero.

1125
01:00:56,060 --> 01:01:00,000
Because what it does is just
pushes the zero from non-zero.

1126
01:01:00,000 --> 01:01:03,020
And why is that something that
this Bayesian method actually

1127
01:01:03,020 --> 01:01:04,600
does for you automatically?

1128
01:01:04,600 --> 01:01:06,530
It's because when we
put this non-informative

1129
01:01:06,530 --> 01:01:11,169
prior on p, which was
uniform on the interval 0, 1.

1130
01:01:11,169 --> 01:01:12,960
In particular, we know
that the probability

1131
01:01:12,960 --> 01:01:16,690
that p is equal to 0 is zero.

1132
01:01:16,690 --> 01:01:19,180
And the probability p
is equal to 1 is zero.

1133
01:01:19,180 --> 01:01:21,880
And so the problem
is that if I did not

1134
01:01:21,880 --> 01:01:24,520
add this 1 with some
positive probability,

1135
01:01:24,520 --> 01:01:28,120
I wouldn't be allowed to spit
out something that actually had

1136
01:01:28,120 --> 01:01:30,640
p hat, which was equal to 0.

1137
01:01:30,640 --> 01:01:33,280
If by chance, let's say
I have n is equal to 3,

1138
01:01:33,280 --> 01:01:37,750
and I get only 0, 0, 0, that
could happen with probability.

1139
01:01:37,750 --> 01:01:41,470
1 over pq, one over 1 minus pq.

1140
01:01:46,360 --> 01:01:47,880
That's not something
that I want.

1141
01:01:47,880 --> 01:01:49,359
And I'm using my priors.

1142
01:01:49,359 --> 01:01:51,900
My prior is not informative,
but somehow it captures the fact

1143
01:01:51,900 --> 01:01:53,550
that I don't want to
believe p is going

1144
01:01:53,550 --> 01:01:56,110
to be either equal to 0 or 1.

1145
01:01:56,110 --> 01:01:59,790
So that's sort of
taken care of here.

1146
01:01:59,790 --> 01:02:05,640
So let's move away a little
bit from the Bernoulli example,

1147
01:02:05,640 --> 01:02:06,310
shall we?

1148
01:02:06,310 --> 01:02:08,120
I think we've seen enough of it.

1149
01:02:08,120 --> 01:02:10,860
And so let's talk about
the Gaussian model.

1150
01:02:10,860 --> 01:02:12,690
Let's say I want to
do Gaussian inference.

1151
01:02:17,859 --> 01:02:19,650
I want to do inference
in a Gaussian model,

1152
01:02:19,650 --> 01:02:20,730
using Bayesian methods.

1153
01:02:30,600 --> 01:02:39,840
What I want is that Xi,
X1, Xn, or say 0, 1 iid.

1154
01:02:44,720 --> 01:02:47,770
Sorry, theta 1, iid
conditionally on theta.

1155
01:02:50,630 --> 01:02:56,300
That means that pn of
X1, Xn, given theta

1156
01:02:56,300 --> 01:02:58,670
is equal to exactly
what I wrote before.

1157
01:02:58,670 --> 01:03:04,760
So 1 square root to pi, to the
n exponential minus one half

1158
01:03:04,760 --> 01:03:09,579
sum of Xi minus theta squared.

1159
01:03:09,579 --> 01:03:11,120
So that's just the
joint distribution

1160
01:03:11,120 --> 01:03:13,410
of my Gaussian with mean data.

1161
01:03:13,410 --> 01:03:14,810
And the another
question is, what

1162
01:03:14,810 --> 01:03:17,540
is the posterior distribution?

1163
01:03:17,540 --> 01:03:22,500
Well here I said, let's use
the uninformative prior,

1164
01:03:22,500 --> 01:03:23,840
which is an improper prior.

1165
01:03:23,840 --> 01:03:25,490
It puts weight on everyone.

1166
01:03:25,490 --> 01:03:29,310
That's the so-called uniform
on the entire real line.

1167
01:03:29,310 --> 01:03:31,190
So that's certainly
not a density.

1168
01:03:31,190 --> 01:03:34,360
But it can still just use this.

1169
01:03:34,360 --> 01:03:40,430
So all I need to do
is get this divided

1170
01:03:40,430 --> 01:03:44,690
by normalizing this thing.

1171
01:03:44,690 --> 01:03:47,900
But if you look at
this, essentially I

1172
01:03:47,900 --> 01:03:49,530
want to understand.

1173
01:03:49,530 --> 01:03:52,470
So this is proportional
to the exponential

1174
01:03:52,470 --> 01:03:55,040
minus one half
sum from I equal 1

1175
01:03:55,040 --> 01:03:58,950
to n of Xi minus theta squared.

1176
01:03:58,950 --> 01:04:01,370
And now I want to see
this thing as a density,

1177
01:04:01,370 --> 01:04:03,560
not on the Xi's but on theta.

1178
01:04:06,420 --> 01:04:10,120
What I want is a
density on theta.

1179
01:04:10,120 --> 01:04:13,650
So it looks like I have
chances of getting something

1180
01:04:13,650 --> 01:04:16,800
that looks like a Gaussian.

1181
01:04:16,800 --> 01:04:19,500
To have a Gaussian, I would
need to see minus one half.

1182
01:04:19,500 --> 01:04:21,660
And then I would need to
see theta minus something

1183
01:04:21,660 --> 01:04:25,230
here, not just the sum of
something minus thetas.

1184
01:04:25,230 --> 01:04:29,820
So I need to work
a little bit more,

1185
01:04:29,820 --> 01:04:31,475
to expand the square here.

1186
01:04:31,475 --> 01:04:32,850
So this thing here
is going to be

1187
01:04:32,850 --> 01:04:37,330
equal to exponential minus
one half sum from I equal 1

1188
01:04:37,330 --> 01:04:45,280
to n of Xi squared minus 2Xi
theta plus theta squared.

1189
01:05:10,590 --> 01:05:13,590
Now what I'm going to do
is, everything remember

1190
01:05:13,590 --> 01:05:15,870
is up to this little sign.

1191
01:05:15,870 --> 01:05:19,710
So every time I see a term
that does not depend on theta,

1192
01:05:19,710 --> 01:05:22,250
I can just push it in there
and just make it disappear.

1193
01:05:22,250 --> 01:05:24,550
Agreed?

1194
01:05:24,550 --> 01:05:28,420
This term here, exponential
minus one half sum of Xi

1195
01:05:28,420 --> 01:05:31,661
squared, does it
depend on theta?

1196
01:05:31,661 --> 01:05:32,160
No.

1197
01:05:32,160 --> 01:05:33,420
So I'm just pushing it here.

1198
01:05:33,420 --> 01:05:34,530
This guy, yes.

1199
01:05:34,530 --> 01:05:35,970
And the other one, yes.

1200
01:05:35,970 --> 01:05:45,020
So this is proportional to
exponential sum of the Xi.

1201
01:05:45,020 --> 01:05:47,780
And then I'm going to pull out
my theta, the minus one half

1202
01:05:47,780 --> 01:05:50,150
canceled with the minus 2.

1203
01:05:50,150 --> 01:05:56,460
And then I have minus
one half sum from I

1204
01:05:56,460 --> 01:05:58,180
equal 1 to n of theta squared.

1205
01:06:01,480 --> 01:06:03,460
Agreed?

1206
01:06:03,460 --> 01:06:05,350
So now what this
thing looks like,

1207
01:06:05,350 --> 01:06:09,570
this looks very much like some
theta minus something squared.

1208
01:06:09,570 --> 01:06:15,110
This thing here is really
just n over 2 times theta.

1209
01:06:18,520 --> 01:06:21,740
Sorry, times theta squared.

1210
01:06:21,740 --> 01:06:25,120
So now what I need to do is to
write this of the form, theta

1211
01:06:25,120 --> 01:06:26,230
minus something.

1212
01:06:26,230 --> 01:06:31,820
Let's call it mu, squared,
divided by 2 sigma squared.

1213
01:06:31,820 --> 01:06:34,160
I want to turn this into
that, maybe up to terms

1214
01:06:34,160 --> 01:06:36,510
that do not depend on theta.

1215
01:06:36,510 --> 01:06:39,062
That's what I'm
going to try to do.

1216
01:06:39,062 --> 01:06:40,770
So that's called
completing the squaring.

1217
01:06:40,770 --> 01:06:42,010
That's some exercises you do.

1218
01:06:42,010 --> 01:06:44,260
You've done it probably,
already in the homework.

1219
01:06:44,260 --> 01:06:46,560
And that's something
you do a lot when

1220
01:06:46,560 --> 01:06:48,750
you do Bayesian
statistics, in particular.

1221
01:06:48,750 --> 01:06:50,010
So let's do this.

1222
01:06:50,010 --> 01:06:51,910
What is it going to
be the leading term?

1223
01:06:51,910 --> 01:06:54,160
Theta squared is going to
be multiplied by this thing.

1224
01:06:54,160 --> 01:06:57,130
So I'm going to pull
out my n over 2.

1225
01:06:57,130 --> 01:07:03,070
And then I'm going to write
this as minus theta over 2.

1226
01:07:03,070 --> 01:07:06,220
And then I'm going to write
theta minus something squared.

1227
01:07:06,220 --> 01:07:08,890
And this something is going
to be one half of what

1228
01:07:08,890 --> 01:07:10,160
I see in the cross-product.

1229
01:07:12,966 --> 01:07:14,590
I need to actually
pull this thing out.

1230
01:07:14,590 --> 01:07:18,340
So let me write it
like that first.

1231
01:07:18,340 --> 01:07:21,860
So that's theta squared.

1232
01:07:21,860 --> 01:07:30,680
And then I'm going to write it
as minus 2 times 1 over n sum

1233
01:07:30,680 --> 01:07:36,980
from I equal 1 to n
of Xi's times theta.

1234
01:07:36,980 --> 01:07:39,874
That's exactly just a rewriting
of what we had before.

1235
01:07:39,874 --> 01:07:41,540
And that should look
much more familiar.

1236
01:07:44,990 --> 01:07:49,700
A squared minus 2 blap A,
and then I missed something.

1237
01:07:49,700 --> 01:07:51,860
So this thing, I'm going
to be able to rewrite

1238
01:07:51,860 --> 01:07:57,930
as theta minus Xn bar squared.

1239
01:07:57,930 --> 01:08:00,720
But then I need to remove
the square of Xn bar.

1240
01:08:00,720 --> 01:08:01,740
Because it's not here.

1241
01:08:09,210 --> 01:08:11,297
So I just complete the square.

1242
01:08:11,297 --> 01:08:13,880
And then I actually really don't
care with this thing actually

1243
01:08:13,880 --> 01:08:16,899
was, because it's going to go
again in the little Alpha's

1244
01:08:16,899 --> 01:08:18,416
sign over there.

1245
01:08:18,416 --> 01:08:19,790
So this thing
eventually is going

1246
01:08:19,790 --> 01:08:24,620
to be proportional
to exponential

1247
01:08:24,620 --> 01:08:31,090
of minus n over 2 times theta
of minus Xn bar squared.

1248
01:08:31,090 --> 01:08:33,370
And so we know that if
this is a density that's

1249
01:08:33,370 --> 01:08:44,100
proportional to this guy, it has
to be some n with mean, Xn bar.

1250
01:08:44,100 --> 01:08:47,520
And variance, this is supposed
to be 1 over sigma squared.

1251
01:08:47,520 --> 01:08:49,318
This guy over here, this n.

1252
01:08:49,318 --> 01:08:50,609
So that's really just 1 over n.

1253
01:08:53,870 --> 01:09:01,740
So the posterior
distribution is a Gaussian

1254
01:09:01,740 --> 01:09:05,819
centered at the average
of my observations.

1255
01:09:05,819 --> 01:09:08,430
And with variance, 1 over n.

1256
01:09:13,307 --> 01:09:14,140
Everybody's with me?

1257
01:09:16,740 --> 01:09:19,779
Why I'm saying this, this was
the output of some computation.

1258
01:09:19,779 --> 01:09:21,450
But it sort of
makes sense, right?

1259
01:09:21,450 --> 01:09:24,210
It's really telling me that
the more observations I have,

1260
01:09:24,210 --> 01:09:26,250
the more concentrated
this posterior is.

1261
01:09:26,250 --> 01:09:27,819
Concentrated around what?

1262
01:09:27,819 --> 01:09:30,529
Well around this Xn bar.

1263
01:09:30,529 --> 01:09:33,140
That looks like something
we've sort of seen before.

1264
01:09:33,140 --> 01:09:35,420
But it does not have the
same meaning, somehow.

1265
01:09:35,420 --> 01:09:37,580
This is really just the
posterior distribution.

1266
01:09:40,490 --> 01:09:43,160
It's sort of a sanity check,
that I have this 1 over n

1267
01:09:43,160 --> 01:09:44,139
when I have Xn bar.

1268
01:09:44,139 --> 01:09:45,680
But it's not the
same thing as saying

1269
01:09:45,680 --> 01:09:48,429
that the variance of Xn bar was
1 over n, like we had before.

1270
01:09:55,670 --> 01:09:59,390
As an exercise,
I would recommend

1271
01:09:59,390 --> 01:10:10,140
if you don't get it,
just try pi of theta

1272
01:10:10,140 --> 01:10:15,290
to be equal to some n mu 1.

1273
01:10:18,120 --> 01:10:22,350
Here, the prior that we used
was completely non-informative.

1274
01:10:22,350 --> 01:10:25,594
What happens if I take my prior
to be some Gaussian, which

1275
01:10:25,594 --> 01:10:27,510
is centered at mu and
it has the same variance

1276
01:10:27,510 --> 01:10:30,120
as the other guys?

1277
01:10:30,120 --> 01:10:32,204
So what's going to
happen here is that we're

1278
01:10:32,204 --> 01:10:33,120
going to put a weight.

1279
01:10:33,120 --> 01:10:34,536
And everything
that's away from mu

1280
01:10:34,536 --> 01:10:38,469
is going to actually
get less weight.

1281
01:10:38,469 --> 01:10:40,260
I want to know how I'm
going to be updating

1282
01:10:40,260 --> 01:10:41,850
this prior into a posterior.

1283
01:10:44,520 --> 01:10:47,040
Everybody sees what
I'm saying here?

1284
01:10:47,040 --> 01:10:50,040
So that means that pi of theta
has the density proportional

1285
01:10:50,040 --> 01:10:55,680
to exponential minus one
half theta minus mu squared.

1286
01:10:55,680 --> 01:11:00,540
So I need to multiply
my posterior with this,

1287
01:11:00,540 --> 01:11:01,849
and then see.

1288
01:11:01,849 --> 01:11:03,390
It's actually going
to be a Gaussian.

1289
01:11:03,390 --> 01:11:04,774
This is also a conjugate prior.

1290
01:11:04,774 --> 01:11:06,440
It's going to spit
out another Gaussian.

1291
01:11:06,440 --> 01:11:09,390
You're going to have to complete
a square again, and just check

1292
01:11:09,390 --> 01:11:10,814
what it's actually giving you.

1293
01:11:10,814 --> 01:11:12,480
And so spoiler alert,
it's going to look

1294
01:11:12,480 --> 01:11:14,790
like you get an extra
observation, which is actually

1295
01:11:14,790 --> 01:11:15,360
equal to mu.

1296
01:11:18,800 --> 01:11:22,440
It's going to be the average
of n plus 1 observations.

1297
01:11:22,440 --> 01:11:24,110
The first n1's being X1 to Xn.

1298
01:11:24,110 --> 01:11:27,530
And then, the last one being mu.

1299
01:11:27,530 --> 01:11:30,860
And it sort of makes sense.

1300
01:11:30,860 --> 01:11:34,700
That's actually a
fairly simple exercise.

1301
01:11:34,700 --> 01:11:36,441
Rather than going
into more computation,

1302
01:11:36,441 --> 01:11:37,940
this is something
you can definitely

1303
01:11:37,940 --> 01:11:41,510
do when you're in the
comfort of your room.

1304
01:11:41,510 --> 01:11:43,910
I want to talk about
other types of priors.

1305
01:11:43,910 --> 01:11:47,330
The first thing I said is,
there's this beta prior

1306
01:11:47,330 --> 01:11:50,390
that I just pulled out of my hat
and that was just convenient.

1307
01:11:50,390 --> 01:11:52,860
Then there was this
non-informative prior.

1308
01:11:52,860 --> 01:11:53,720
It was convenient.

1309
01:11:53,720 --> 01:11:56,300
It was non-informative, so
if you don't know anything

1310
01:11:56,300 --> 01:11:58,950
else maybe that's
what you want to do.

1311
01:11:58,950 --> 01:12:01,940
The question is, are there
any other priors that

1312
01:12:01,940 --> 01:12:04,490
are sort of principled
and generic, in the sense

1313
01:12:04,490 --> 01:12:08,600
that the uninformative
prior was generic, right?

1314
01:12:08,600 --> 01:12:11,400
It was equal to 1, that's
as generic as it gets.

1315
01:12:11,400 --> 01:12:14,190
So is there anything
that's generic as well?

1316
01:12:14,190 --> 01:12:17,180
Well, there's this priors that
are called Jeffrey's priors.

1317
01:12:17,180 --> 01:12:20,540
And Jeffrey's prior, which is
proportional to square root

1318
01:12:20,540 --> 01:12:23,290
of the determinant of the
Fisher information of theta.

1319
01:12:26,360 --> 01:12:28,600
This is actually a
weird thing to do.

1320
01:12:28,600 --> 01:12:31,380
It says, look at your model.

1321
01:12:31,380 --> 01:12:34,152
Your model is going to
have a Fisher information.

1322
01:12:34,152 --> 01:12:34,985
Let's say it exists.

1323
01:12:38,150 --> 01:12:39,957
Because we know it
does not always exist.

1324
01:12:39,957 --> 01:12:41,540
For example, in the
multinomial model,

1325
01:12:41,540 --> 01:12:44,660
we didn't have a
Fisher information.

1326
01:12:44,660 --> 01:12:46,670
The determinant of
a matrix is somehow

1327
01:12:46,670 --> 01:12:48,800
measuring the size of a matrix.

1328
01:12:48,800 --> 01:12:50,540
If you don't trust
me, just think

1329
01:12:50,540 --> 01:12:53,870
about the matrix being
of size one by one,

1330
01:12:53,870 --> 01:12:56,910
then the determinant is just
the number that you have there.

1331
01:12:56,910 --> 01:13:00,770
And so this is really something
that looks like the Fisher

1332
01:13:00,770 --> 01:13:01,670
information.

1333
01:13:04,374 --> 01:13:06,290
It's proportional to the
amount of information

1334
01:13:06,290 --> 01:13:09,620
that you have at
a certain point.

1335
01:13:09,620 --> 01:13:12,310
And so what my prior
is saying well,

1336
01:13:12,310 --> 01:13:14,280
I want to put more weights
on those thetas that

1337
01:13:14,280 --> 01:13:17,050
are going to just extract more
information from the data.

1338
01:13:20,510 --> 01:13:22,760
You can actually
compute those things.

1339
01:13:22,760 --> 01:13:26,215
In the first example,
Jeffrey's prior

1340
01:13:26,215 --> 01:13:28,360
is something that
looks like this.

1341
01:13:28,360 --> 01:13:30,230
In one dimension,
Fisher information

1342
01:13:30,230 --> 01:13:33,476
is essentially one
the word variance.

1343
01:13:33,476 --> 01:13:35,600
That's just 1 over the
square root of the variance,

1344
01:13:35,600 --> 01:13:37,550
because I have the square root.

1345
01:13:37,550 --> 01:13:45,770
And when I have the Jeffrey's
prior, when I have the Gaussian

1346
01:13:45,770 --> 01:13:48,770
case, this is the
identity matrix

1347
01:13:48,770 --> 01:13:50,840
that I would have in
the Gaussian case.

1348
01:13:50,840 --> 01:13:52,580
The determinant of
the identities is 1.

1349
01:13:52,580 --> 01:13:56,180
So square root of 1 is 1, and
so I would basically get 1.

1350
01:13:56,180 --> 01:13:59,170
And that gives me my improper
prior, my uninformative prior

1351
01:13:59,170 --> 01:14:01,020
that I had.

1352
01:14:01,020 --> 01:14:03,690
So the uninformative
prior 1 is fine.

1353
01:14:03,690 --> 01:14:06,780
Clearly, all the thetas
carry the same information

1354
01:14:06,780 --> 01:14:08,160
in the Gaussian model.

1355
01:14:08,160 --> 01:14:10,200
Whether I translate
it here or here,

1356
01:14:10,200 --> 01:14:12,120
it's pretty clear none
of them is actually

1357
01:14:12,120 --> 01:14:13,140
better than the other.

1358
01:14:13,140 --> 01:14:16,530
But clearly for
the Bernoulli case,

1359
01:14:16,530 --> 01:14:22,560
the p's that are closer
to the boundary carry

1360
01:14:22,560 --> 01:14:23,940
more information.

1361
01:14:23,940 --> 01:14:26,250
I sort of like those
guys, because they just

1362
01:14:26,250 --> 01:14:27,757
carry more information.

1363
01:14:27,757 --> 01:14:29,340
So what I do is, I
take this function.

1364
01:14:29,340 --> 01:14:30,300
So p1 minus p.

1365
01:14:30,300 --> 01:14:34,170
Remember, it's something
that looks like this.

1366
01:14:34,170 --> 01:14:35,390
On the interval 0, 1.

1367
01:14:38,710 --> 01:14:40,979
This guy, 1 over square
root of p1 minus p

1368
01:14:40,979 --> 01:14:42,395
is something that
looks like this.

1369
01:14:45,780 --> 01:14:47,620
Agreed

1370
01:14:47,620 --> 01:14:49,780
What it's doing is
sort of wants to push

1371
01:14:49,780 --> 01:14:54,586
towards the piece that actually
carry more information.

1372
01:14:54,586 --> 01:14:56,210
Whether you want to
bias your data that

1373
01:14:56,210 --> 01:14:59,120
way or not, is something
you need to think about.

1374
01:14:59,120 --> 01:15:01,550
When you put a prior on your
data, on your parameter,

1375
01:15:01,550 --> 01:15:06,140
you're sort of biasing
towards this idea your data.

1376
01:15:06,140 --> 01:15:07,700
That's maybe not
such a good idea,

1377
01:15:07,700 --> 01:15:13,160
when you have some p that's
actually close to one half,

1378
01:15:13,160 --> 01:15:13,820
for example.

1379
01:15:13,820 --> 01:15:14,960
You're actually
saying, no I don't

1380
01:15:14,960 --> 01:15:16,610
want to see a p that's
close to one half.

1381
01:15:16,610 --> 01:15:18,350
Just make a decision,
one way or another.

1382
01:15:18,350 --> 01:15:19,699
But just make a decision.

1383
01:15:19,699 --> 01:15:20,990
So it's forcing you to do that.

1384
01:15:23,690 --> 01:15:26,090
Jeffrey's prior, I'm
running out of time

1385
01:15:26,090 --> 01:15:29,850
so I don't want to go
into too much detail.

1386
01:15:29,850 --> 01:15:31,670
We'll probably stop
here, actually.

1387
01:15:44,570 --> 01:15:47,810
So Jeffrey's priors have
this very nice property.

1388
01:15:47,810 --> 01:15:51,740
It's that they actually do not
care about the parameterization

1389
01:15:51,740 --> 01:15:53,150
of your space.

1390
01:15:53,150 --> 01:15:56,360
If you actually have
p and you suddenly

1391
01:15:56,360 --> 01:15:58,850
decide that p is not the
right parameter for Bernoulli,

1392
01:15:58,850 --> 01:16:00,740
but it's p squared.

1393
01:16:00,740 --> 01:16:03,200
You could decide to
parameterize this by p squared.

1394
01:16:03,200 --> 01:16:05,840
Maybe your doctor is
actually much more able

1395
01:16:05,840 --> 01:16:08,840
to formulate some prior
assumption on p squared,

1396
01:16:08,840 --> 01:16:09,800
rather than p.

1397
01:16:09,800 --> 01:16:11,100
You never know.

1398
01:16:11,100 --> 01:16:14,390
And so what happens is
that Jeffrey's priors

1399
01:16:14,390 --> 01:16:15,990
are an invariant in this.

1400
01:16:15,990 --> 01:16:18,560
And the reason is because
the information carried by p

1401
01:16:18,560 --> 01:16:21,130
is the same as the information
carried by p squared, somehow.

1402
01:16:28,822 --> 01:16:30,280
They're essentially
the same thing.

1403
01:16:32,950 --> 01:16:34,630
You need to have one to one map.

1404
01:16:34,630 --> 01:16:37,896
Where you basically for
each parameter, before

1405
01:16:37,896 --> 01:16:39,020
you have another parameter.

1406
01:16:39,020 --> 01:16:40,810
Let's call Eta the
new parameters.

1407
01:16:45,790 --> 01:16:50,380
The PDF of the new prior
indexed by Eta this time

1408
01:16:50,380 --> 01:16:52,990
is actually also
Jeffrey's prior.

1409
01:16:52,990 --> 01:16:55,174
But this time, the
new Fisher information

1410
01:16:55,174 --> 01:16:57,340
is not the Fisher information
with respect to theta.

1411
01:16:57,340 --> 01:17:00,010
But it's this Fisher
information associated

1412
01:17:00,010 --> 01:17:03,130
to this statistical
model indexed by Eta.

1413
01:17:03,130 --> 01:17:08,110
So essentially, when you
change the parameterization

1414
01:17:08,110 --> 01:17:10,600
of your model, you still
get Jeffrey's prior

1415
01:17:10,600 --> 01:17:12,820
for the new parameterization.

1416
01:17:12,820 --> 01:17:15,020
Which is, in a way,
a desirable property.

1417
01:17:19,410 --> 01:17:21,920
Jeffrey's prior is just
an uninformative priors,

1418
01:17:21,920 --> 01:17:24,140
or priors you want
to use when you

1419
01:17:24,140 --> 01:17:26,480
want a systematic way without
really thinking about what

1420
01:17:26,480 --> 01:17:27,396
to pick for your mile.

1421
01:17:35,440 --> 01:17:37,060
I'll finish this next time.

1422
01:17:37,060 --> 01:17:39,910
And we'll talk about
Bayesian confidence regions.

1423
01:17:39,910 --> 01:17:41,620
We'll talk about
Bayesian estimation.

1424
01:17:41,620 --> 01:17:44,074
Once I have a posterior,
what do I get?

1425
01:17:44,074 --> 01:17:45,490
And basically, the
only message is

1426
01:17:45,490 --> 01:17:47,860
going to be that you
might want to integrate

1427
01:17:47,860 --> 01:17:48,910
against the posterior.

1428
01:17:48,910 --> 01:17:51,490
Find the posterior, the
expectation of your posterior

1429
01:17:51,490 --> 01:17:52,130
distribution.

1430
01:17:52,130 --> 01:17:54,010
That's a good point
estimator for theta.

1431
01:17:56,860 --> 01:18:01,020
We'll just do a
couple of computation.