WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:15.210
from hundreds of
MIT courses, visit
00:00:15.210 --> 00:00:21.470
MITOpenCourseWare@OCW.MIT.edu
00:00:21.470 --> 00:00:26.670
PHILIPPE RIGOLLET: So today
WE'LL actually just do a brief
00:00:26.670 --> 00:00:28.590
chapter on Bayesian statistics.
00:00:28.590 --> 00:00:31.380
And there's entire courses
on Bayesian statistics,
00:00:31.380 --> 00:00:33.480
there's entire books
on Bayesian statistics,
00:00:33.480 --> 00:00:36.130
there's entire careers
in Bayesian statistics.
00:00:36.130 --> 00:00:39.270
So admittedly, I'm
not going to be
00:00:39.270 --> 00:00:40.920
able to do it
justice and tell you
00:00:40.920 --> 00:00:42.850
all the interesting
things that are happening
00:00:42.850 --> 00:00:44.040
in Bayesian statistics.
00:00:44.040 --> 00:00:47.310
But I think it's important
as a statistician
00:00:47.310 --> 00:00:49.320
to know what it
is, how it works,
00:00:49.320 --> 00:00:52.500
because it's actually
a weapon of choice
00:00:52.500 --> 00:00:55.260
for many practitioners.
00:00:55.260 --> 00:00:58.080
And because it allows them to
incorporate their knowledge
00:00:58.080 --> 00:01:00.790
about a problem in a
fairly systematic manner.
00:01:00.790 --> 00:01:04.099
So if you look at like, say the
Bayesian statistics literature,
00:01:04.099 --> 00:01:05.489
it's huge.
00:01:05.489 --> 00:01:09.570
And so here I give
you sort of a range
00:01:09.570 --> 00:01:12.840
of what you can expect to
see in Bayesian statistics
00:01:12.840 --> 00:01:18.300
from your second edition of
a traditional book, something
00:01:18.300 --> 00:01:20.580
that involves computation,
some things that
00:01:20.580 --> 00:01:22.200
involve risk thinking.
00:01:22.200 --> 00:01:24.750
And there's a lot of
Bayesian thinking.
00:01:24.750 --> 00:01:26.640
There's a lot of
things that you know
00:01:26.640 --> 00:01:29.010
talking about sort of like
philosophy of thinking
00:01:29.010 --> 00:01:30.180
Bayesian.
00:01:30.180 --> 00:01:32.380
This book, for example,
seems to be one of them.
00:01:32.380 --> 00:01:34.710
This book is
definitely one of them.
00:01:34.710 --> 00:01:38.880
This one represents sort of
a wide, a broad literature
00:01:38.880 --> 00:01:42.370
on Bayesian statistics, for
applications for example,
00:01:42.370 --> 00:01:43.620
in social sciences.
00:01:43.620 --> 00:01:45.380
But even in large
scale machine learning,
00:01:45.380 --> 00:01:47.340
there's a lot of Bayesian
statistics happening,
00:01:47.340 --> 00:01:50.280
particular using something
called Bayesian parametrics,
00:01:50.280 --> 00:01:53.490
or hierarchical
Bayesian modeling.
00:01:53.490 --> 00:01:59.470
So we do have some experts
at MIT in the c-cell.
00:01:59.470 --> 00:02:02.070
Tamara Broderick for
example, is a person
00:02:02.070 --> 00:02:04.560
who does quite a bit
of interesting work
00:02:04.560 --> 00:02:06.093
on Bayesian parametrics.
00:02:06.093 --> 00:02:08.259
And if that's something you
want to know more about,
00:02:08.259 --> 00:02:10.889
I urge you to go
and talk to her.
00:02:10.889 --> 00:02:14.070
So before we go into
more advanced things,
00:02:14.070 --> 00:02:17.220
we need to start with what
is the Bayesian approach.
00:02:17.220 --> 00:02:19.290
What do Bayesians
do, and how is it
00:02:19.290 --> 00:02:22.720
different from what
we've been doing so far?
00:02:22.720 --> 00:02:26.340
So to understand the
difference between Bayesians
00:02:26.340 --> 00:02:28.800
and what we've been
doing so far is,
00:02:28.800 --> 00:02:31.350
we need to first put a name on
what we've been doing so far.
00:02:31.350 --> 00:02:32.940
It's called
frequentist statistics.
00:02:32.940 --> 00:02:36.720
Which usually Bayesian versus
frequentist statistics,
00:02:36.720 --> 00:02:38.760
by versus I don't mean
that there is naturally
00:02:38.760 --> 00:02:40.380
in opposition to them.
00:02:40.380 --> 00:02:43.350
Actually, often you will
see the same method that
00:02:43.350 --> 00:02:45.420
comes out of both approaches.
00:02:45.420 --> 00:02:46.860
So let's see how
we did it, right.
00:02:46.860 --> 00:02:48.930
The first thing, we had data.
00:02:48.930 --> 00:02:50.700
We observed some data.
00:02:50.700 --> 00:02:52.980
And we assumed that this
data was generated randomly.
00:02:52.980 --> 00:02:54.840
The reason we did
that is because this
00:02:54.840 --> 00:02:57.840
would allow us to leverage
tools from probability.
00:02:57.840 --> 00:03:01.020
So let's say by nature,
measurements, you do a survey,
00:03:01.020 --> 00:03:03.090
you get some data.
00:03:03.090 --> 00:03:06.030
Then we made some assumptions
on the data generating process.
00:03:06.030 --> 00:03:07.939
For example, we
assumed they were iid.
00:03:07.939 --> 00:03:09.480
That was one of the
recurring things.
00:03:09.480 --> 00:03:11.530
Sometimes we assume
it was Gaussian.
00:03:11.530 --> 00:03:13.470
If you wanted to
use say, T-test.
00:03:13.470 --> 00:03:15.330
Maybe we did some
nonparametric statistics.
00:03:15.330 --> 00:03:18.240
We assume it was a
smooth function or maybe
00:03:18.240 --> 00:03:20.350
linear regression function.
00:03:20.350 --> 00:03:21.540
So those are our modeling.
00:03:21.540 --> 00:03:24.850
And this was basically
a way to say, well,
00:03:24.850 --> 00:03:28.440
we're not going to allow for
any distributions for the data
00:03:28.440 --> 00:03:29.160
that we have.
00:03:29.160 --> 00:03:31.640
But maybe a small
set of distributions
00:03:31.640 --> 00:03:34.770
that indexed by some small
parameters, for example.
00:03:34.770 --> 00:03:38.400
Or at least remove some
of the possibilities.
00:03:38.400 --> 00:03:41.660
Otherwise, there's
nothing we can learn.
00:03:41.660 --> 00:03:45.270
And so for example,
this was associated
00:03:45.270 --> 00:03:48.980
to some parameter of
interest, say data or beta
00:03:48.980 --> 00:03:51.270
in the regression model.
00:03:51.270 --> 00:03:55.860
Then we had this unknown
problem and this unknown thing,
00:03:55.860 --> 00:03:56.610
a known parameter.
00:03:56.610 --> 00:03:57.651
And we wanted to find it.
00:03:57.651 --> 00:03:59.610
We wanted to either
estimate it or test it,
00:03:59.610 --> 00:04:02.460
or maybe find a confidence
interval for the subject.
00:04:02.460 --> 00:04:06.030
So, so far I should not have
said anything that's new.
00:04:06.030 --> 00:04:08.210
But this last
sentence is actually
00:04:08.210 --> 00:04:10.590
what's going to be different
from the Bayesian part.
00:04:10.590 --> 00:04:12.989
And particular, this
unknown but fixed things
00:04:12.989 --> 00:04:14.280
is what's going to be changing.
00:04:16.965 --> 00:04:18.740
In the Bayesian
approach, we still
00:04:18.740 --> 00:04:22.050
assume that we observe
some random data.
00:04:22.050 --> 00:04:24.180
But the generating process
is slightly different.
00:04:24.180 --> 00:04:25.737
It's sort of a
two later process.
00:04:25.737 --> 00:04:27.320
And there's one
process that generates
00:04:27.320 --> 00:04:28.740
the parameter and
then one process
00:04:28.740 --> 00:04:31.470
that, given this parameter
generates the data.
00:04:31.470 --> 00:04:35.990
So what the first layer
does, nobody really
00:04:35.990 --> 00:04:38.030
believes that there's
some random process that's
00:04:38.030 --> 00:04:41.000
happening, about
generating what is going
00:04:41.000 --> 00:04:44.820
to be the true expected
number of people
00:04:44.820 --> 00:04:47.060
who turn their head to
the right when they kiss.
00:04:47.060 --> 00:04:49.435
But this is actually going to
be something that brings us
00:04:49.435 --> 00:04:53.270
some easiness for
us to incorporate
00:04:53.270 --> 00:04:57.230
what we call prior belief.
00:04:57.230 --> 00:04:58.640
We'll see an
example in a second.
00:04:58.640 --> 00:05:01.430
But often, you actually
have prior belief
00:05:01.430 --> 00:05:02.960
of what this
parameter should be.
00:05:02.960 --> 00:05:05.510
When we, say least
squares, we looked
00:05:05.510 --> 00:05:09.350
over all of the vectors
in all of R to the p,
00:05:09.350 --> 00:05:11.840
including the ones that
have coefficients equal
00:05:11.840 --> 00:05:15.080
to 50 million.
00:05:15.080 --> 00:05:18.050
Those are things that we
might be able to rule out.
00:05:18.050 --> 00:05:21.800
We might be able to rule out
that on a much smaller scale.
00:05:21.800 --> 00:05:24.650
For example, well
I'm not an expert
00:05:24.650 --> 00:05:29.180
on turning your head to
the right or to the left.
00:05:29.180 --> 00:05:30.950
But maybe you can
rule out the fact
00:05:30.950 --> 00:05:33.200
that almost everybody
is turning their head
00:05:33.200 --> 00:05:35.990
in the same direction, or almost
everybody is turning their head
00:05:35.990 --> 00:05:38.090
to another direction.
00:05:38.090 --> 00:05:39.980
So we have this prior belief.
00:05:39.980 --> 00:05:43.750
And this belief is going
to play say, hopefully
00:05:43.750 --> 00:05:47.534
less and less important role as
we collect more and more data.
00:05:47.534 --> 00:05:49.200
But if we have a
smaller amount of data,
00:05:49.200 --> 00:05:52.510
we might want to be able
to use this information,
00:05:52.510 --> 00:05:54.700
rather than just
shooting in the dark.
00:05:54.700 --> 00:05:58.150
And so the idea is to
have this prior belief.
00:05:58.150 --> 00:06:00.430
And then, we want to
update this prior belief
00:06:00.430 --> 00:06:03.550
into what's called the
posterior belief after we've
00:06:03.550 --> 00:06:04.870
seen some data.
00:06:04.870 --> 00:06:08.050
Maybe I believe that
there's something
00:06:08.050 --> 00:06:09.640
that should be in some range.
00:06:09.640 --> 00:06:12.580
But maybe after I see data, it's
comforting me in my beliefs.
00:06:12.580 --> 00:06:15.330
So I'm actually having
maybe a belief that's more.
00:06:15.330 --> 00:06:18.460
So belief encompasses
basically what you think
00:06:18.460 --> 00:06:20.000
and how strongly
you think about it.
00:06:20.000 --> 00:06:21.370
That's what I call belief.
00:06:21.370 --> 00:06:24.070
So for example, if I have a
belief about some parameter
00:06:24.070 --> 00:06:26.050
theta, maybe my
belief is telling me
00:06:26.050 --> 00:06:28.970
where theta should
be and how strongly I
00:06:28.970 --> 00:06:32.920
believe in it, in the sense
that I have a very narrow region
00:06:32.920 --> 00:06:35.470
where theta could be.
00:06:35.470 --> 00:06:37.810
The posterior beliefs, as
well, you see some data.
00:06:37.810 --> 00:06:40.000
And maybe you're more confident
or less confident about what
00:06:40.000 --> 00:06:40.499
you've seen.
00:06:40.499 --> 00:06:42.796
Maybe you've shifted
your belief a little bit.
00:06:42.796 --> 00:06:44.670
And so that's what we're
going to try to see,
00:06:44.670 --> 00:06:48.630
and how to do this in
a principal manner.
00:06:48.630 --> 00:06:50.190
To understand this
better, there's
00:06:50.190 --> 00:06:52.150
nothing better than an example.
00:06:52.150 --> 00:06:56.220
So let's talk about another
stupid statistical question.
00:06:56.220 --> 00:06:58.620
Which is, let's try
to understand p.
00:06:58.620 --> 00:07:01.430
Of course, I'm not going to
talk about politics from now on.
00:07:01.430 --> 00:07:03.930
So let's talk about p,
the proportion of women
00:07:03.930 --> 00:07:04.950
in the population.
00:07:15.330 --> 00:07:21.850
And so what I could do is
to collect some data, X1, Xn
00:07:21.850 --> 00:07:23.950
and assume that
they're Bernoulli
00:07:23.950 --> 00:07:25.900
with some parameter, p unknown.
00:07:25.900 --> 00:07:30.810
So p is in 0, 1.
00:07:30.810 --> 00:07:33.270
OK, let's assume that
those guys are iid.
00:07:33.270 --> 00:07:38.190
So this is just an indicator
for each of my collected data,
00:07:38.190 --> 00:07:42.130
whether the person I randomly
sample is a woman, I get a one.
00:07:42.130 --> 00:07:43.350
If it's a man, I get a zero.
00:07:46.200 --> 00:07:49.470
Now the question is, I
sample these people randomly.
00:07:49.470 --> 00:07:51.560
I do you know their gender.
00:07:51.560 --> 00:07:54.600
And the frequentist
approach was just saying,
00:07:54.600 --> 00:07:58.250
OK, let's just estimate
p hat being Xn bar.
00:07:58.250 --> 00:08:01.110
And then we could do some tests.
00:08:01.110 --> 00:08:02.240
So here, there's a test.
00:08:02.240 --> 00:08:05.330
I want to test maybe if
p is equal to 0.5 or not.
00:08:05.330 --> 00:08:09.710
That sounds like a pretty
reasonable thing to test.
00:08:09.710 --> 00:08:13.100
But we want to also
maybe estimate p.
00:08:13.100 --> 00:08:16.160
But here, this is a case where
we definitely prior belief
00:08:16.160 --> 00:08:17.720
of what p should be.
00:08:17.720 --> 00:08:22.040
We are pretty confident that
p is not going to be 0.7.
00:08:22.040 --> 00:08:23.570
We actually believe
that we should
00:08:23.570 --> 00:08:29.330
be extremely close to one
half, but maybe not exactly.
00:08:29.330 --> 00:08:32.679
Maybe this population is not
the population in the world.
00:08:32.679 --> 00:08:35.659
But maybe this is the
population of, say some college
00:08:35.659 --> 00:08:38.720
and we want to understand if
this college has half women
00:08:38.720 --> 00:08:40.069
or not.
00:08:40.069 --> 00:08:42.110
Maybe we know it's going
to be close to one half,
00:08:42.110 --> 00:08:43.460
but maybe we're not quite sure.
00:08:46.840 --> 00:08:49.960
We're going to want to
integrate that knowledge.
00:08:49.960 --> 00:08:52.660
So I could integrate it in
a blunt manner by saying,
00:08:52.660 --> 00:08:55.420
discard the data and say
that p is equal to one half.
00:08:55.420 --> 00:08:57.650
But maybe that's just
a little too much.
00:08:57.650 --> 00:09:01.360
So how do I do this trade
off between adding the data
00:09:01.360 --> 00:09:06.760
and combining it with
this prior knowledge?
00:09:06.760 --> 00:09:09.610
In many instances, essentially
what's going to happen
00:09:09.610 --> 00:09:14.330
is this one half is going to
act like one new observation.
00:09:14.330 --> 00:09:17.062
So if you have
five observations,
00:09:17.062 --> 00:09:18.520
this is just the
sixth observation,
00:09:18.520 --> 00:09:20.240
which will play a role.
00:09:20.240 --> 00:09:21.790
If you have a
million observations,
00:09:21.790 --> 00:09:22.860
you're going to have
a million and one.
00:09:22.860 --> 00:09:24.568
It's not going to play
so much of a role.
00:09:24.568 --> 00:09:25.900
That's basically how it goes.
00:09:28.760 --> 00:09:33.470
But, definitely not
always because we'll
00:09:33.470 --> 00:09:36.700
see that if I take my prior to
be a point minus one half here,
00:09:36.700 --> 00:09:39.290
it's basically as if I
was discarding my data.
00:09:39.290 --> 00:09:41.740
So essentially, there's
also your ability
00:09:41.740 --> 00:09:45.520
to encompass how strongly
you believe in this prior.
00:09:45.520 --> 00:09:47.809
And if you believe
infinitely more in the prior
00:09:47.809 --> 00:09:49.600
than you believe in
the data you collected,
00:09:49.600 --> 00:09:54.600
then it's not going to act
like one more observation.
00:09:54.600 --> 00:09:56.820
The Bayesian approach
is a tool to one,
00:09:56.820 --> 00:09:59.010
include mathematically
our prior.
00:09:59.010 --> 00:10:02.580
And our prior belief into
statistical procedures.
00:10:02.580 --> 00:10:04.030
Maybe I have this
prior knowledge.
00:10:04.030 --> 00:10:06.090
But if I'm a medical
doctor, it's not clear to me
00:10:06.090 --> 00:10:09.870
how I'm going to turn this into
some principal way of building
00:10:09.870 --> 00:10:10.410
estimators.
00:10:10.410 --> 00:10:12.330
And the second
goal is going to be
00:10:12.330 --> 00:10:16.260
to update this prior belief
into a posterior belief
00:10:16.260 --> 00:10:17.270
by using the data.
00:10:22.200 --> 00:10:23.917
How do I do this?
00:10:23.917 --> 00:10:25.500
And at some point,
I sort of suggested
00:10:25.500 --> 00:10:28.610
that there's two layers.
00:10:28.610 --> 00:10:31.660
One is where you draw
the parameter at random.
00:10:31.660 --> 00:10:35.290
And two, once you
have the parameter,
00:10:35.290 --> 00:10:39.320
conditionless parameter,
you draw your data.
00:10:39.320 --> 00:10:42.080
Nobody believed this actually is
happening, that nature is just
00:10:42.080 --> 00:10:45.510
rolling dice for us and
choosing parameters at random.
00:10:45.510 --> 00:10:48.260
But what's happening
is that, this idea
00:10:48.260 --> 00:10:51.410
that the parameter comes
from some random distribution
00:10:51.410 --> 00:10:54.860
actually captures, very
well, this idea that how
00:10:54.860 --> 00:10:56.960
you would encompass your prior.
00:10:56.960 --> 00:10:59.090
How would you say, my
belief is as follows?
00:10:59.090 --> 00:11:01.870
Well here's an example about p.
00:11:01.870 --> 00:11:07.856
I'm 90% sure that p is
between 0.4 and 0.6.
00:11:07.856 --> 00:11:14.230
And I'm 95% sure that p
is between 0.3 and 0.8.
00:11:14.230 --> 00:11:18.490
So essentially, I have
this possible value of p.
00:11:18.490 --> 00:11:35.430
And what I know is that, there's
90% here between 0.4 and 0.6.
00:11:35.430 --> 00:11:39.340
And then I have 0.3 and 0.8.
00:11:39.340 --> 00:11:44.200
And I know that I'm 95%
sure that I'm in here.
00:11:44.200 --> 00:11:47.050
If you remember, this sort of
looks like the kind of pictures
00:11:47.050 --> 00:11:50.110
that I made when I had
some Gaussian, for example.
00:11:50.110 --> 00:11:54.220
And I said, oh here we have
90% of the observations.
00:11:54.220 --> 00:11:57.105
And here, we have 95%
of the observations.
00:12:00.500 --> 00:12:04.570
So in a way, if I
were able to tell you
00:12:04.570 --> 00:12:07.610
all those ranges for
all possible values,
00:12:07.610 --> 00:12:10.510
then I would essentially
describe a probability
00:12:10.510 --> 00:12:13.400
distribution for p.
00:12:13.400 --> 00:12:15.410
And what I'm saying
is that, p is going
00:12:15.410 --> 00:12:16.582
to have this kind of shape.
00:12:16.582 --> 00:12:19.040
So of course, if I tell you
only two twice this information
00:12:19.040 --> 00:12:22.280
that there's 90% I'm here,
and I'm between here and here.
00:12:22.280 --> 00:12:24.980
And 95%, I'm between here
and here, then there's
00:12:24.980 --> 00:12:26.845
many ways I can
accomplish that, right.
00:12:26.845 --> 00:12:28.970
I could have something that
looks like this, maybe.
00:12:33.190 --> 00:12:35.830
It could be like this.
00:12:35.830 --> 00:12:37.605
There's many ways
I can have this.
00:12:37.605 --> 00:12:38.980
Some of them are
definitely going
00:12:38.980 --> 00:12:42.280
to be mathematically more
convenient than others.
00:12:42.280 --> 00:12:44.320
And hopefully, we're
going to have things
00:12:44.320 --> 00:12:47.230
that I can
parameterize very well.
00:12:47.230 --> 00:12:49.900
Because if I tell
you this is this guy,
00:12:49.900 --> 00:12:54.340
then there's basically one,
two three, four, five, six,
00:12:54.340 --> 00:12:56.554
seven parameters.
00:12:56.554 --> 00:12:57.970
So I probably don't
want something
00:12:57.970 --> 00:12:59.053
that has seven parameters.
00:12:59.053 --> 00:13:01.582
But maybe I can say, oh,
it's a Gaussian and I all
00:13:01.582 --> 00:13:03.540
I have to do is to tell
you where it's centered
00:13:03.540 --> 00:13:04.998
and what the standard
deviation is.
00:13:07.250 --> 00:13:11.030
So the idea of using
this two layer thing,
00:13:11.030 --> 00:13:12.800
where we think of
the parameter p
00:13:12.800 --> 00:13:14.450
as being drawn from
some distribution,
00:13:14.450 --> 00:13:17.660
is really just a way for us
to capture this information.
00:13:17.660 --> 00:13:20.420
Our prior belief
being, well there's
00:13:20.420 --> 00:13:22.794
this percentage of
chances that it's there.
00:13:22.794 --> 00:13:24.710
But the percentage of
this chance, I'm not I'm
00:13:24.710 --> 00:13:28.730
deliberately not using
probability here.
00:13:28.730 --> 00:13:30.980
So it's really a way
to get close to this.
00:13:33.620 --> 00:13:36.170
That's why I say, the true
parameter is not random.
00:13:36.170 --> 00:13:40.420
But the Bayesian approach
does as if it was random.
00:13:40.420 --> 00:13:42.430
And then, just spits
out a procedure
00:13:42.430 --> 00:13:49.110
out of this thought process,
this thought experiment.
00:13:49.110 --> 00:13:54.060
So when you practice
Bayesian statistics a lot,
00:13:54.060 --> 00:13:57.840
you start getting automatisms.
00:13:57.840 --> 00:14:00.905
You start getting some things
that you do without really
00:14:00.905 --> 00:14:02.280
thinking about
it. just like when
00:14:02.280 --> 00:14:04.860
you you're a statistician,
the first thing you do is,
00:14:04.860 --> 00:14:07.419
can I think of this data as
being Gaussian for example?
00:14:07.419 --> 00:14:09.210
When you're Bayesian
you're thinking about,
00:14:09.210 --> 00:14:11.400
OK I have a set of parameters.
00:14:11.400 --> 00:14:14.250
So here, I can
describe my parameter
00:14:14.250 --> 00:14:20.190
as being theta in
general, in some big space
00:14:20.190 --> 00:14:21.540
parameter of theta.
00:14:21.540 --> 00:14:24.730
But what spaces
did we encounter?
00:14:24.730 --> 00:14:27.090
Well, we encountered
the real line.
00:14:27.090 --> 00:14:31.320
We encountered the interval
0, 1 for Bernoulli's And we
00:14:31.320 --> 00:14:36.320
encountered some of
the positive real line
00:14:36.320 --> 00:14:39.320
for exponential
distributions, etc.
00:14:39.320 --> 00:14:42.020
And so what I'm
going to need to do,
00:14:42.020 --> 00:14:44.570
if I want to put some
prior on those spaces,
00:14:44.570 --> 00:14:47.694
I'm going to have to
have a usual set of tools
00:14:47.694 --> 00:14:49.610
for this guy, usual set
of tools for this guy,
00:14:49.610 --> 00:14:51.176
usual sort of
tools for this guy.
00:14:51.176 --> 00:14:52.550
And by usual set
of tools, I mean
00:14:52.550 --> 00:14:54.966
I'm going to have to have a
family of distributions that's
00:14:54.966 --> 00:14:56.690
supported on this.
00:14:56.690 --> 00:14:59.010
So in particular,
this is the speed
00:14:59.010 --> 00:15:01.610
in which my parameter
that I usually denote
00:15:01.610 --> 00:15:03.900
by p for Bernoulli lives.
00:15:03.900 --> 00:15:07.840
And so what I need is to find a
distribution on the interval 0,
00:15:07.840 --> 00:15:13.540
1 just like this guy.
00:15:13.540 --> 00:15:15.310
The problem with the
Gaussian is that it's
00:15:15.310 --> 00:15:17.890
not on the interval 0, 1.
00:15:17.890 --> 00:15:20.230
It's going to spill
out in the end.
00:15:20.230 --> 00:15:22.850
And it's not going to be
something that works for me.
00:15:22.850 --> 00:15:25.802
And so the question is, I need
to think about distributions
00:15:25.802 --> 00:15:27.010
that are probably continuous.
00:15:27.010 --> 00:15:30.070
Why would I restrict myself
to discrete distributions that
00:15:30.070 --> 00:15:34.060
are actually convenient and for
Bernoulli, one that's actually
00:15:34.060 --> 00:15:36.730
basically the main tool
that everybody is using
00:15:36.730 --> 00:15:39.670
is the so-called
beta distribution.
00:15:39.670 --> 00:15:42.190
So the beta distribution
has two parameters.
00:15:50.680 --> 00:15:56.910
So x follows a beta
with parameters
00:15:56.910 --> 00:16:05.070
a and b if it has
a density, f of x
00:16:05.070 --> 00:16:09.050
is equal to x to the a minus 1.
00:16:09.050 --> 00:16:15.800
1 minus x to the b minus 1,
if x is in the interval 0,
00:16:15.800 --> 00:16:22.730
1 and 0 for all other x's.
00:16:22.730 --> 00:16:23.230
OK?
00:16:27.590 --> 00:16:30.470
Why is that a good thing?
00:16:30.470 --> 00:16:33.870
Well, it's a density that's
on the interval 0, 1 for sure.
00:16:33.870 --> 00:16:37.130
But now I have these two
parameters and a set of shapes
00:16:37.130 --> 00:16:41.525
that I can get by tweaking those
two parameters is incredible.
00:16:44.260 --> 00:16:46.190
It's going to be a
unimodal distribution.
00:16:46.190 --> 00:16:47.260
It's still fairly nice.
00:16:47.260 --> 00:16:49.760
It's not going to be something
that goes like this and this.
00:16:49.760 --> 00:16:52.790
Because if you think
about this, what
00:16:52.790 --> 00:16:55.550
would it mean if your prior
distribution of the interval 0,
00:16:55.550 --> 00:16:57.120
1 had this shape?
00:16:59.630 --> 00:17:01.934
It would mean that, maybe
you think that p is here
00:17:01.934 --> 00:17:03.350
or maybe you think
that p is here,
00:17:03.350 --> 00:17:05.127
or maybe you think
that p is here.
00:17:05.127 --> 00:17:06.710
Which essentially
means that you think
00:17:06.710 --> 00:17:10.661
that p can come from
three different phenomena.
00:17:10.661 --> 00:17:12.619
And there's other models
that are called mixers
00:17:12.619 --> 00:17:15.079
for that, that directly
account for the fact
00:17:15.079 --> 00:17:19.550
that maybe there are several
phenomena that are aggregated
00:17:19.550 --> 00:17:21.050
in your data set.
00:17:21.050 --> 00:17:23.390
But if you think that your
data set is sort of pure,
00:17:23.390 --> 00:17:25.650
and that everything comes
from the same phenomenon,
00:17:25.650 --> 00:17:28.850
you want something
that looks like this,
00:17:28.850 --> 00:17:32.850
or maybe looks like this, or
maybe is sort of symmetric.
00:17:32.850 --> 00:17:34.410
You want to get all this stuff.
00:17:34.410 --> 00:17:36.900
Maybe you want something
that says, well
00:17:36.900 --> 00:17:42.330
if I'm talking about p being the
probability of the proportion
00:17:42.330 --> 00:17:45.840
of women in the whole world, you
want something that's probably
00:17:45.840 --> 00:17:48.600
really spiked around one half.
00:17:48.600 --> 00:17:50.850
Almost the point
math, because you know
00:17:50.850 --> 00:17:54.990
let's agree that 0.5
is the actual number.
00:17:54.990 --> 00:17:58.950
So you want something that
says, OK maybe I'm wrong.
00:17:58.950 --> 00:18:01.300
But I'm sure I'm not going
to be really that way off.
00:18:01.300 --> 00:18:03.300
So you want something
that's really pointy.
00:18:03.300 --> 00:18:06.660
But if it's something
you've never checked,
00:18:06.660 --> 00:18:09.780
and again I can not make
references at this point,
00:18:09.780 --> 00:18:13.197
but something where you might
have some uncertainty that
00:18:13.197 --> 00:18:14.280
should be around one half.
00:18:14.280 --> 00:18:17.070
Maybe you want something
that a little more allows
00:18:17.070 --> 00:18:19.410
you to say, well, I think
there's more around one half.
00:18:19.410 --> 00:18:22.950
But there's still some
fluctuations that are possible.
00:18:22.950 --> 00:18:25.110
And in particular
here, I talk about p,
00:18:25.110 --> 00:18:29.310
where the two parameters a
and b are actually the same.
00:18:29.310 --> 00:18:30.500
I call them a.
00:18:30.500 --> 00:18:31.710
One is called scale.
00:18:31.710 --> 00:18:33.430
The other one is called shape.
00:18:33.430 --> 00:18:35.500
Oh sorry, this is not a density.
00:18:35.500 --> 00:18:38.646
So it actually has
to be normalized.
00:18:38.646 --> 00:18:40.020
When you integrate
this guy, it's
00:18:40.020 --> 00:18:41.490
going to be some function
that depends on a
00:18:41.490 --> 00:18:43.410
and b, actually depends
on this function
00:18:43.410 --> 00:18:45.427
through the beta function.
00:18:45.427 --> 00:18:47.260
Which is this combination
of gamma function,
00:18:47.260 --> 00:18:51.515
so that's why it's
called beta distribution.
00:18:51.515 --> 00:18:53.640
That's the definition of
the beta function when you
00:18:53.640 --> 00:18:55.721
integrate this thing anyway.
00:18:55.721 --> 00:18:56.970
You just have to normalize it.
00:18:56.970 --> 00:18:59.730
That's just a number that
depends on the a and b.
00:18:59.730 --> 00:19:01.542
So here, if you
take a equal to b,
00:19:01.542 --> 00:19:03.000
you have something
that essentially
00:19:03.000 --> 00:19:05.340
is symmetric around one half.
00:19:05.340 --> 00:19:07.120
Because what does it look like?
00:19:07.120 --> 00:19:10.980
Well, so my density f of
x, is going to be what?
00:19:10.980 --> 00:19:19.200
It's going to be my constant
times x, times one minus x
00:19:19.200 --> 00:19:21.670
to a minus one.
00:19:21.670 --> 00:19:26.080
And this function, x times
1 minus x looks like this.
00:19:26.080 --> 00:19:27.730
We've drawn it before.
00:19:27.730 --> 00:19:29.380
That was something
that showed up
00:19:29.380 --> 00:19:36.490
as being the variance
of my Bernoulli.
00:19:36.490 --> 00:19:42.240
So we know it's something that
takes its maximum at one half.
00:19:42.240 --> 00:19:44.190
And now I'm just taking
a power of this guy.
00:19:44.190 --> 00:19:46.020
So I'm really just
distorting this thing
00:19:46.020 --> 00:19:51.340
into some fairly
symmetric manner.
00:19:56.400 --> 00:20:00.630
This distribution that
we actually take for p.
00:20:00.630 --> 00:20:03.030
I assume that p, the
parameter, notice
00:20:03.030 --> 00:20:04.470
that this is kind of weird.
00:20:04.470 --> 00:20:06.344
First of all, this is
probably the first time
00:20:06.344 --> 00:20:09.570
in this entire
course that something
00:20:09.570 --> 00:20:12.085
has a distribution when it's
actually a lower case letter.
00:20:12.085 --> 00:20:13.710
That's something you
have to deal with,
00:20:13.710 --> 00:20:16.827
because we've been using lower
case letters for parameters.
00:20:16.827 --> 00:20:18.660
And now we want them
to have a distribution.
00:20:18.660 --> 00:20:20.550
So that's what's
going to happen.
00:20:20.550 --> 00:20:23.850
This is called the
prior distribution.
00:20:23.850 --> 00:20:27.750
So really, I should write
something like f of p
00:20:27.750 --> 00:20:35.290
is equal to a constant times
p, 1 minus p, to the n minus 1.
00:20:35.290 --> 00:20:39.985
Well no, actually I should not
because then it's confusing.
00:20:39.985 --> 00:20:41.610
One thing in terms
of notation that I'm
00:20:41.610 --> 00:20:43.639
going to write, when
I have a constant here
00:20:43.639 --> 00:20:45.180
and I don't want to
make it explicit.
00:20:45.180 --> 00:20:48.480
And we'll see in a second why I
don't need to make it explicit.
00:20:48.480 --> 00:20:53.250
I'm going to write
this as f of x
00:20:53.250 --> 00:21:04.060
is proportional to x 1
minus x to the n minus 1.
00:21:04.060 --> 00:21:08.740
That's just to say, equal to
some constant that does not
00:21:08.740 --> 00:21:11.260
depend on x times this thing.
00:21:16.320 --> 00:21:21.930
So if we continue
with our experiment
00:21:21.930 --> 00:21:25.410
where I'm drawing
this data, X1 to Xn,
00:21:25.410 --> 00:21:29.050
which is Bernoulli p, if
p has some distribution
00:21:29.050 --> 00:21:31.050
it's not clear what it
means to have a Bernoulli
00:21:31.050 --> 00:21:32.427
with some random parameter.
00:21:32.427 --> 00:21:35.010
So what I'm going to do is, then
I'm going to first draw my p.
00:21:35.010 --> 00:21:38.310
Let's say I get a number, 0.52.
00:21:38.310 --> 00:21:41.100
And then, I'm going to draw
my data conditionally on p.
00:21:41.100 --> 00:21:45.150
So here comes the first and
last flowchart of this class.
00:21:49.500 --> 00:21:51.190
So nature first draws p.
00:21:53.930 --> 00:21:58.360
p follows some data on a, a.
00:21:58.360 --> 00:21:59.670
Then I condition on p.
00:22:02.460 --> 00:22:10.760
And then I draw X1, Xn
that are iid, Bernoulli p.
00:22:10.760 --> 00:22:14.250
Everybody understand the
process of generating this data?
00:22:14.250 --> 00:22:16.250
So you first draw a
parameter, and then you just
00:22:16.250 --> 00:22:21.040
flip those independent biased
coins with this particular p.
00:22:21.040 --> 00:22:23.230
There's this layered thing.
00:22:26.570 --> 00:22:31.010
Now conditionally p, right so
here I have this prior about p
00:22:31.010 --> 00:22:32.070
which was the thing.
00:22:32.070 --> 00:22:34.090
So this is just the
thought process again,
00:22:34.090 --> 00:22:36.480
it's not anything that
actually happens in practice.
00:22:36.480 --> 00:22:39.920
This is my way of thinking about
how the data was generated.
00:22:39.920 --> 00:22:43.310
And from this, I'm going to try
to come up with some procedure.
00:22:43.310 --> 00:22:47.880
Just like, if your estimator
is the average of the data,
00:22:47.880 --> 00:22:49.700
you don't have to
understand probability
00:22:49.700 --> 00:22:52.670
to say that my estimator
is the average of the data.
00:22:52.670 --> 00:22:54.200
Anyone outside this
room understands
00:22:54.200 --> 00:22:55.970
that the average
is a good estimator
00:22:55.970 --> 00:22:58.550
for some average behavior.
00:22:58.550 --> 00:23:01.070
And they don't need
to think of the data
00:23:01.070 --> 00:23:02.960
as being a random
variable, et cetera.
00:23:02.960 --> 00:23:04.570
So same thing, basically.
00:23:10.760 --> 00:23:13.790
In this case, you can see that
the posterior distribution
00:23:13.790 --> 00:23:14.720
is still a beta.
00:23:18.320 --> 00:23:20.090
What it means is that,
I had this thing.
00:23:20.090 --> 00:23:21.650
Then, I observed my data.
00:23:21.650 --> 00:23:23.570
And then, I continue
and here I'm
00:23:23.570 --> 00:23:32.800
going to update my prior
into some posterior
00:23:32.800 --> 00:23:36.680
distribution, pi.
00:23:36.680 --> 00:23:39.210
And here, this guy is
actually also a beta.
00:23:43.370 --> 00:23:45.950
My posterior
distribution, p, is also
00:23:45.950 --> 00:23:48.002
a beta distribution
with the parameters
00:23:48.002 --> 00:23:48.960
that are on this slide.
00:23:48.960 --> 00:23:51.670
And I'll have the space
to reproduce them.
00:23:51.670 --> 00:23:54.180
So I start the beginning
of this flowchart
00:23:54.180 --> 00:23:57.130
as having p, which is a prior.
00:23:57.130 --> 00:23:58.810
I'm going to get
some observations
00:23:58.810 --> 00:24:01.120
and then, I'm going to
update what my posterior is.
00:24:04.530 --> 00:24:06.900
This posterior is
basically something
00:24:06.900 --> 00:24:09.690
that's, in business
statistics was
00:24:09.690 --> 00:24:13.720
beautiful is as soon as
you have this distribution,
00:24:13.720 --> 00:24:17.030
it's essentially capturing all
the information about the data
00:24:17.030 --> 00:24:19.010
that you want for p.
00:24:19.010 --> 00:24:20.429
And it's not just the point.
00:24:20.429 --> 00:24:21.470
It's not just an average.
00:24:21.470 --> 00:24:23.660
It's actually an
entire distribution
00:24:23.660 --> 00:24:27.050
for the possible
values of theta.
00:24:27.050 --> 00:24:30.740
And it's not the same
thing as saying, well
00:24:30.740 --> 00:24:35.030
if theta hat is equal to Xn
bar, in the Gaussian case I know
00:24:35.030 --> 00:24:37.130
that this is some mean, mu.
00:24:37.130 --> 00:24:39.680
And then maybe it has
varying sigma squared over n.
00:24:39.680 --> 00:24:43.550
That's not what I mean by, this
is my posterior distribution.
00:24:43.550 --> 00:24:46.640
This is not what I mean.
00:24:46.640 --> 00:24:49.790
This is going to come from
this guy, the Gaussian thing
00:24:49.790 --> 00:24:51.350
and the central limit theorem.
00:24:51.350 --> 00:24:52.970
But what I mean is this guy.
00:24:52.970 --> 00:24:58.130
And this came exclusively
from the prior distribution.
00:24:58.130 --> 00:25:00.830
If I had another prior,
I would not necessarily
00:25:00.830 --> 00:25:03.840
have a beta distribution
on the output.
00:25:03.840 --> 00:25:07.580
So when I have the same
family of distributions
00:25:07.580 --> 00:25:11.090
at the beginning and at
the end of this flowchart,
00:25:11.090 --> 00:25:16.520
I say that beta is
a conjugate prior.
00:25:21.200 --> 00:25:27.390
Meaning I put in beta as a prior
and I get beta as [INAUDIBLE]
00:25:27.390 --> 00:25:30.850
And that's why betas
are so popular.
00:25:30.850 --> 00:25:32.280
Conjugate priors
are really nice,
00:25:32.280 --> 00:25:35.730
because you know that whatever
you put in, what you're going
00:25:35.730 --> 00:25:37.170
to get in the end is a beta.
00:25:37.170 --> 00:25:38.790
So all you have to think
about is the parameters.
00:25:38.790 --> 00:25:41.040
You don't have to check
again what the posterior is
00:25:41.040 --> 00:25:43.290
going to look like, what the
PDF of this guy is going to be.
00:25:43.290 --> 00:25:44.664
You don't have to
think about it.
00:25:44.664 --> 00:25:46.650
You just have to check
what the parameters are.
00:25:46.650 --> 00:25:48.358
And there's families
of conjugate priors.
00:25:48.358 --> 00:25:51.150
Gaussian gives
Gaussian, for example.
00:25:51.150 --> 00:25:52.170
There's a bunch of them.
00:25:52.170 --> 00:25:57.210
And this is what drives people
into using specific priors as
00:25:57.210 --> 00:25:58.200
opposed to others.
00:25:58.200 --> 00:26:00.660
It has nice
mathematical properties.
00:26:00.660 --> 00:26:05.910
Nobody believes that p is really
distributed according to beta.
00:26:05.910 --> 00:26:08.640
But it's flexible enough
and super convenient
00:26:08.640 --> 00:26:09.700
mathematically.
00:26:12.450 --> 00:26:14.640
Now let's see for one
second, before we actually
00:26:14.640 --> 00:26:17.080
go any further.
00:26:17.080 --> 00:26:19.790
I didn't mention A and
B are both in here,
00:26:19.790 --> 00:26:21.540
A and B are both
positive numbers.
00:26:24.320 --> 00:26:27.610
They can be anything positive.
00:26:27.610 --> 00:26:29.460
So here what I did
is that, I updated A
00:26:29.460 --> 00:26:34.650
into a plus the sum
of my data, and b
00:26:34.650 --> 00:26:38.500
into b plus n minus
the sum of my data.
00:26:38.500 --> 00:26:41.910
So that's essentially, a becomes
a plus the number of ones.
00:26:45.040 --> 00:26:47.350
Well, that's only
when I have a and a.
00:26:47.350 --> 00:26:50.116
So the first parameters become
itself plus the number of ones.
00:26:50.116 --> 00:26:51.490
And the second
one becomes itself
00:26:51.490 --> 00:26:52.531
plus the number of zeros.
00:26:55.440 --> 00:26:59.160
And so just as a sanity
check, what does this mean?
00:26:59.160 --> 00:27:08.910
If a it goes to zero, what
is the beta when a goes to 0?
00:27:08.910 --> 00:27:10.410
We can actually
read this from here.
00:27:16.920 --> 00:27:19.170
Actually, let's take a goes to--
00:27:25.370 --> 00:27:26.110
no.
00:27:26.110 --> 00:27:27.310
Sorry, let's just do this.
00:27:38.670 --> 00:27:40.840
I'll do it when we talk
about non-informative prior,
00:27:40.840 --> 00:27:42.840
because it's a little too messy.
00:27:47.220 --> 00:27:47.970
How do we do this?
00:27:47.970 --> 00:27:51.390
How did I get this posterior
distribution, given the prior?
00:27:51.390 --> 00:27:56.070
How do I update This well this
is called Bayesian statistics.
00:27:56.070 --> 00:27:58.800
And you've heard this
word, Bayes before.
00:27:58.800 --> 00:28:02.010
And the way you've heard
it is in the Bayes formula.
00:28:02.010 --> 00:28:03.680
What was the Bayes formula?
00:28:03.680 --> 00:28:05.190
The Bayes formula
was telling you
00:28:05.190 --> 00:28:11.390
that the probability of A, given
B was equal to something that
00:28:11.390 --> 00:28:14.430
depended on the probability of
B, given A. That's what it was.
00:28:16.787 --> 00:28:18.620
You can actually either
remember the formula
00:28:18.620 --> 00:28:20.250
or you can remember
the definition.
00:28:20.250 --> 00:28:26.000
And this is what p of A
and B divided by p of B.
00:28:26.000 --> 00:28:35.480
So this is p of B, given A
times p of A divided by p of B.
00:28:35.480 --> 00:28:37.590
That's what Bayes
formula is telling you.
00:28:37.590 --> 00:28:40.050
Agree?
00:28:40.050 --> 00:28:46.200
So now what I want is to have
something that's telling me
00:28:46.200 --> 00:28:49.730
how this is going to work.
00:28:49.730 --> 00:28:54.410
What is going to play the
role of those events, A and B?
00:28:54.410 --> 00:28:59.280
Well one is going
to be, this is going
00:28:59.280 --> 00:29:01.980
to be the distribution
of my parameter of theta,
00:29:01.980 --> 00:29:03.894
given that I see the data.
00:29:03.894 --> 00:29:05.310
And this is going
to tell me, what
00:29:05.310 --> 00:29:07.601
is the distribution of the
data, given that I know what
00:29:07.601 --> 00:29:09.270
my parameter if theta is.
00:29:09.270 --> 00:29:11.456
But that part, if
this is theta and this
00:29:11.456 --> 00:29:13.080
is the parameter of
theta, this is what
00:29:13.080 --> 00:29:15.720
we've been doing all along.
00:29:15.720 --> 00:29:18.720
The distribution of the data,
given the parameter here
00:29:18.720 --> 00:29:22.350
was n iid Bernoulli p.
00:29:22.350 --> 00:29:27.960
I knew exactly what their joint
probability mass function is.
00:29:27.960 --> 00:29:29.290
Then, that was what?
00:29:29.290 --> 00:29:32.700
So we said that this
is going to be my data
00:29:32.700 --> 00:29:34.730
and this is going
to be my parameter.
00:29:37.270 --> 00:29:40.210
So that means that, this is
the probability of my data,
00:29:40.210 --> 00:29:43.000
given the parameter.
00:29:43.000 --> 00:29:45.729
This is the probability
of the parameter.
00:29:45.729 --> 00:29:46.270
What is this?
00:29:46.270 --> 00:29:49.095
What did we call this?
00:29:49.095 --> 00:29:50.280
This is the prior.
00:29:50.280 --> 00:29:53.690
It's just the distribution
of my parameter.
00:29:53.690 --> 00:29:56.030
Now what is this?
00:29:56.030 --> 00:29:57.490
Well, this is just
the distribution
00:29:57.490 --> 00:30:00.340
of the data, itself.
00:30:00.340 --> 00:30:06.800
This is essentially the
distribution of this,
00:30:06.800 --> 00:30:15.080
if this was indeed
not conditioned on p.
00:30:15.080 --> 00:30:18.710
So if I don't condition
on p, this data
00:30:18.710 --> 00:30:23.982
is going to be a bunch of iid,
Bernoulli with some parameter.
00:30:23.982 --> 00:30:25.440
But the perimeter
is random, right.
00:30:25.440 --> 00:30:27.837
So for different realization
of this data set,
00:30:27.837 --> 00:30:30.170
I'm going to get different
parameters for the Bernoulli.
00:30:30.170 --> 00:30:34.379
And so that leads to
some sort of convolution.
00:30:34.379 --> 00:30:36.170
It's not really a
convolution in this case,
00:30:36.170 --> 00:30:38.660
but it's like some sort of
composition of distributions.
00:30:38.660 --> 00:30:41.600
I have the randomness that
comes from here and then,
00:30:41.600 --> 00:30:44.757
the randomness that comes
from realizing the Bernoulli.
00:30:44.757 --> 00:30:46.340
That's just the
marginal distribution.
00:30:46.340 --> 00:30:49.820
It actually might be painful to
understand what this is, right.
00:30:49.820 --> 00:30:52.970
In a way, it's sort of a
mixture and it's not super nice.
00:30:52.970 --> 00:30:55.880
But we'll see that this
actually won't matter for us.
00:30:55.880 --> 00:30:57.240
This is going to be some number.
00:30:57.240 --> 00:30:58.220
It's going to be there.
00:30:58.220 --> 00:31:00.260
But it will matter
for us, what it is.
00:31:00.260 --> 00:31:02.510
Because it actually does
not depend on the parameter.
00:31:02.510 --> 00:31:04.340
And that's all
that matters to us.
00:31:09.100 --> 00:31:11.170
Let's put some names
on those things.
00:31:11.170 --> 00:31:12.860
This was very informal.
00:31:12.860 --> 00:31:19.710
So let's put some actual
names on what we call prior.
00:31:19.710 --> 00:31:22.320
So what is the formal
definition of a prior,
00:31:22.320 --> 00:31:24.960
what is the formal
definition of a posterior,
00:31:24.960 --> 00:31:27.450
and what are the
rules to update it?
00:31:27.450 --> 00:31:30.100
So I'm going to have my data,
which is going to be X1, Xn.
00:31:35.710 --> 00:31:38.520
Let's say they are iid, but
they don't actually have to.
00:31:38.520 --> 00:31:41.260
And so I'm going to
have given, theta.
00:31:47.450 --> 00:31:48.890
And when I say
given, it's either
00:31:48.890 --> 00:31:51.890
given like I did in the
first part of this course
00:31:51.890 --> 00:31:55.940
in all previous chapters,
or conditionally on.
00:31:55.940 --> 00:31:58.340
If you're thinking like a
Bayesian, what I really mean
00:31:58.340 --> 00:32:02.250
is conditionally on
this random parameter.
00:32:02.250 --> 00:32:06.350
It's as if it was
a fixed number.
00:32:06.350 --> 00:32:08.410
They're going to
have a distribution,
00:32:08.410 --> 00:32:12.350
X1, Xn is going to
have some distribution.
00:32:12.350 --> 00:32:19.260
Let's assume for now
it's a PDF, pn of X1, Xn.
00:32:19.260 --> 00:32:22.140
I'm going to write
theta like this.
00:32:22.140 --> 00:32:24.900
So for example, what is this?
00:32:24.900 --> 00:32:27.140
Let's say this is a PDF.
00:32:27.140 --> 00:32:28.110
It could be a PMF.
00:32:28.110 --> 00:32:31.197
Everything I say, I'm going to
think of them as being PDF's.
00:32:31.197 --> 00:32:33.030
I'm going to combine
PDF's with PDF's, but I
00:32:33.030 --> 00:32:37.440
could combine PDF it PMF, PMF
with PDF's or PMF with PMF.
00:32:37.440 --> 00:32:41.590
So everywhere you see
a D could be an M.
00:32:41.590 --> 00:32:42.590
Now I have those things.
00:32:42.590 --> 00:32:43.465
So what does it mean?
00:32:43.465 --> 00:32:46.430
So here is an example.
00:32:46.430 --> 00:32:53.970
X1, Xn or iid, and theta 1.
00:32:53.970 --> 00:32:57.530
Now I know exactly what the
joint PDF of this thing is.
00:32:57.530 --> 00:33:03.790
It means that pn of X1, Xn
given theta is equal to what?
00:33:03.790 --> 00:33:10.560
Well it's 1 over
2pi to the power n
00:33:10.560 --> 00:33:15.000
e, to the minus sum
from i equal 1 to n
00:33:15.000 --> 00:33:18.450
of xi minus theta
squared divided by 2.
00:33:18.450 --> 00:33:21.220
So that's just the joint
distribution of n iid
00:33:21.220 --> 00:33:25.120
and theta 1, random variables.
00:33:25.120 --> 00:33:27.290
That's my pn given theta.
00:33:27.290 --> 00:33:33.310
Now this is what we denoted
by f sub theta before.
00:33:33.310 --> 00:33:36.790
We had the subscript before, but
now we just put a bar in theta
00:33:36.790 --> 00:33:38.860
because we want to remember
that this is actually
00:33:38.860 --> 00:33:40.660
conditioned on theta.
00:33:40.660 --> 00:33:42.130
But this is just notation.
00:33:42.130 --> 00:33:46.060
You should just think of this
as being, just the usual thing
00:33:46.060 --> 00:33:50.910
that you get from some
statistical model.
00:33:50.910 --> 00:33:53.910
Now, that's going to be pn.
00:34:11.020 --> 00:34:19.500
Theta has prior
distribution, pi.
00:34:22.400 --> 00:34:29.130
For example, so think of it
as either PDF or PMF again.
00:34:29.130 --> 00:34:33.920
For example, pi
of theta was what?
00:34:33.920 --> 00:34:40.159
Well it was some constant
times theta to the a minus 1,
00:34:40.159 --> 00:34:43.739
1 minus theta to a minus 1.
00:34:43.739 --> 00:34:45.900
So it has some
prior distribution,
00:34:45.900 --> 00:34:49.050
and that's another PMF.
00:34:49.050 --> 00:34:51.090
So now I'm given the
distribution of my,
00:34:51.090 --> 00:34:54.000
x is given theta and given
the distribution of my theta.
00:34:54.000 --> 00:34:57.410
I'm given this guy.
00:34:57.410 --> 00:35:00.100
That's this guy.
00:35:00.100 --> 00:35:05.340
I'm given that guy,
which is my pi.
00:35:05.340 --> 00:35:11.700
So that's my pn of
X1, Xn given theta.
00:35:11.700 --> 00:35:13.063
That's my pi of theta.
00:35:17.390 --> 00:35:21.130
Well, this is just
the integral of pn
00:35:21.130 --> 00:35:28.280
of X1, Xn times pi
of theta, d theta,
00:35:28.280 --> 00:35:29.720
over all possible sets of theta.
00:35:29.720 --> 00:35:33.360
That's just when I
integrate out my theta,
00:35:33.360 --> 00:35:35.790
or I compute the
marginal distribution,
00:35:35.790 --> 00:35:37.290
I did this by integrating.
00:35:37.290 --> 00:35:41.010
That's just basic probability,
conditional probabilities.
00:35:41.010 --> 00:35:42.610
Then if I had the
PMF, I would just
00:35:42.610 --> 00:35:43.970
sum over the values of thetas.
00:35:49.020 --> 00:35:55.210
Now what I want is to
find what's called,
00:35:55.210 --> 00:35:58.870
so that's the
prior distribution,
00:35:58.870 --> 00:36:01.227
and I want to find the
posterior distribution.
00:36:15.110 --> 00:36:18.690
It's pi of theta, given X1, Xn.
00:36:21.780 --> 00:36:23.970
If I use Bayes' rule
I know that this
00:36:23.970 --> 00:36:34.650
is pn of X1, Xn, given
theta times pi of theta.
00:36:34.650 --> 00:36:37.530
And then it's divided
by the distribution
00:36:37.530 --> 00:36:41.070
of those guys, which I will
write as integral over theta
00:36:41.070 --> 00:36:48.830
of pn, X1, Xn, given theta
times pi of theta, d theta.
00:36:55.360 --> 00:36:57.700
Everybody's with me, still?
00:36:57.700 --> 00:36:59.200
If you're not
comfortable with this,
00:36:59.200 --> 00:37:03.010
it means that you probably need
to go read your couple of pages
00:37:03.010 --> 00:37:04.930
on conditional densities
and conditional
00:37:04.930 --> 00:37:07.420
PMF's from your probably class.
00:37:07.420 --> 00:37:08.870
There's really not much there.
00:37:08.870 --> 00:37:13.660
It's just a matter of being able
to define those quantities, f
00:37:13.660 --> 00:37:15.289
density of x, given y.
00:37:15.289 --> 00:37:17.330
This is just what's called
a conditional density.
00:37:17.330 --> 00:37:19.079
You need to understand
what this object is
00:37:19.079 --> 00:37:21.920
and how it relates to the
joint distribution of x and y,
00:37:21.920 --> 00:37:24.302
or maybe the distribution of
x or the distribution of y.
00:37:27.400 --> 00:37:29.920
But it's the same rules.
00:37:29.920 --> 00:37:31.465
One way to actually
remember this
00:37:31.465 --> 00:37:33.730
is, this is exactly
the same rules as this.
00:37:33.730 --> 00:37:36.610
When you see a bar, it's the
same thing as the probability
00:37:36.610 --> 00:37:37.790
of this and this guy.
00:37:37.790 --> 00:37:40.060
So for densities,
it's just a comma
00:37:40.060 --> 00:37:43.240
divided by the second the
probably the second guy.
00:37:43.240 --> 00:37:45.120
That's it.
00:37:45.120 --> 00:37:48.360
So if you remember this, you can
just do some pattern matching
00:37:48.360 --> 00:37:49.980
and see what I just wrote here.
00:37:53.220 --> 00:37:57.010
Now, I can compute every
single one of these guys.
00:37:57.010 --> 00:38:04.030
This something I get
from my modeling.
00:38:04.030 --> 00:38:05.290
So I did not write this.
00:38:05.290 --> 00:38:09.130
It's not written in the slides.
00:38:09.130 --> 00:38:14.820
But I give a name to this guy
that was my prior distribution.
00:38:14.820 --> 00:38:16.550
And that was my
posterior distribution.
00:38:22.550 --> 00:38:26.980
In chapter three, maybe
what did we call this guy?
00:38:32.120 --> 00:38:35.180
The one that does not have a
name and that's in the box.
00:38:39.347 --> 00:38:40.180
What did we call it?
00:38:43.498 --> 00:38:46.335
AUDIENCE: [INAUDIBLE]
00:38:46.335 --> 00:38:48.835
PHILLIPE RIGOLLET: It is the
joint distribution of the Xi's.
00:38:51.950 --> 00:38:53.235
And we gave it a name.
00:38:53.235 --> 00:38:54.214
AUDIENCE: [INAUDIBLE]
00:38:54.214 --> 00:38:56.130
PHILLIPE RIGOLLET: It's
the likelihood, right?
00:38:56.130 --> 00:38:57.630
This is exactly the likelihood.
00:38:57.630 --> 00:38:59.100
This was the
likelihood of theta.
00:39:03.920 --> 00:39:06.350
And this is something that's
very important to remember,
00:39:06.350 --> 00:39:10.520
and that really reminds you
that these things are really not
00:39:10.520 --> 00:39:11.540
that different.
00:39:11.540 --> 00:39:13.970
Maximum likelihood estimation
and Bayesian estimation,
00:39:13.970 --> 00:39:18.860
because your posterior is really
just your likelihood times
00:39:18.860 --> 00:39:23.570
something that's just putting
some weights on the thetas,
00:39:23.570 --> 00:39:26.390
depending on where you
think theta should be.
00:39:26.390 --> 00:39:28.420
If I had, say a maximum
likelihood estimate,
00:39:28.420 --> 00:39:31.130
and my likelihood and
theta looked like this,
00:39:31.130 --> 00:39:33.410
but my prior and theta
looked like this.
00:39:33.410 --> 00:39:37.040
I said, oh I really want
thetas that are like this.
00:39:37.040 --> 00:39:38.710
So what's going to
happen is that, I'm
00:39:38.710 --> 00:39:41.320
going to turn this into some
posterior that looks like this.
00:39:44.400 --> 00:39:47.610
So I'm just really
waiting, this posterior,
00:39:47.610 --> 00:39:49.971
this is a constant that does
not depend on theta right?
00:39:49.971 --> 00:39:50.470
Agreed?
00:39:50.470 --> 00:39:53.460
I integrated over
theta, so theta is gone.
00:39:53.460 --> 00:39:56.220
So forget about this guy.
00:39:56.220 --> 00:39:59.247
I have basically, that the
posterior distribution up
00:39:59.247 --> 00:40:01.830
to scaling, because it has to
be a probability density and not
00:40:01.830 --> 00:40:03.810
just anything any
function that's positive,
00:40:03.810 --> 00:40:05.070
is the product of this guy.
00:40:05.070 --> 00:40:06.920
It's a weighted version
of my likelihood.
00:40:06.920 --> 00:40:07.890
That's all it is.
00:40:07.890 --> 00:40:09.990
I'm just weighing
the likelihood,
00:40:09.990 --> 00:40:13.150
using my prior belief on theta.
00:40:13.150 --> 00:40:16.870
And so given this guy
a natural estimator,
00:40:16.870 --> 00:40:19.480
if you follow the maximum
likelihood principle,
00:40:19.480 --> 00:40:23.150
would be the maximum
of this posterior.
00:40:23.150 --> 00:40:24.620
Agreed?
00:40:24.620 --> 00:40:28.830
That would basically be doing
exactly what maximum likelihood
00:40:28.830 --> 00:40:31.740
estimation is telling you.
00:40:31.740 --> 00:40:33.560
So it turns out that you can.
00:40:33.560 --> 00:40:35.330
It's called Maximum
A Posteriori,
00:40:35.330 --> 00:40:39.370
and I won't talk much
about this, or MAP.
00:40:39.370 --> 00:40:44.500
That's Maximum a Posteriori.
00:40:44.500 --> 00:40:47.200
So it's just the
theta hat is the arc
00:40:47.200 --> 00:40:50.790
max of pi theta, given X1, Xn.
00:40:54.990 --> 00:40:56.190
And it sounds like it's OK.
00:40:56.190 --> 00:40:58.660
I'll give you a
density and you say, OK
00:40:58.660 --> 00:41:00.970
I have a density for all
values of my parameters.
00:41:00.970 --> 00:41:03.440
You're asking me to
summarize it into one number.
00:41:03.440 --> 00:41:06.570
I'm just going to take the most
likely number of those guys.
00:41:06.570 --> 00:41:08.310
But you could summarize
it, otherwise.
00:41:08.310 --> 00:41:10.770
You could take the average.
00:41:10.770 --> 00:41:12.420
You could take the median.
00:41:12.420 --> 00:41:14.370
You could take a
bunch of numbers.
00:41:14.370 --> 00:41:16.080
And the beauty of
Bayesian statistics
00:41:16.080 --> 00:41:19.230
is that, you don't have to
take any number in particular.
00:41:19.230 --> 00:41:21.480
You have an entire
posterior distribution.
00:41:21.480 --> 00:41:25.080
This is not only telling
you where theta is,
00:41:25.080 --> 00:41:29.160
but it's actually telling
you the difference
00:41:29.160 --> 00:41:31.920
if you actually
give as something
00:41:31.920 --> 00:41:33.180
that gives you the posterior.
00:41:33.180 --> 00:41:36.270
Now, let's say the theta
is p between 0 and 1.
00:41:36.270 --> 00:41:39.990
If my posterior distribution
looks like this,
00:41:39.990 --> 00:41:43.410
or my posterior distribution
looks like this,
00:41:43.410 --> 00:41:47.610
then those two guys
have one, the same mode.
00:41:47.610 --> 00:41:49.200
This is the same value.
00:41:49.200 --> 00:41:51.630
And their symmetric, so they'll
also have the same mean.
00:41:51.630 --> 00:41:53.130
So these two posterior
distributions
00:41:53.130 --> 00:41:55.500
give me the same
summary into one number.
00:41:55.500 --> 00:41:58.229
However clearly, one
is much more confident
00:41:58.229 --> 00:41:59.020
than the other one.
00:41:59.020 --> 00:42:04.010
So I might as well just
spit it out as a solution.
00:42:04.010 --> 00:42:05.180
You can do even better.
00:42:05.180 --> 00:42:09.560
People actually do things,
such as drawing a random number
00:42:09.560 --> 00:42:10.600
from this distribution.
00:42:10.600 --> 00:42:12.940
Say, this is my number.
00:42:12.940 --> 00:42:14.440
That's kind of
dangerous, but you
00:42:14.440 --> 00:42:15.690
can imagine you could do this.
00:42:20.730 --> 00:42:22.140
This is what works.
00:42:22.140 --> 00:42:23.680
That's what we went through.
00:42:23.680 --> 00:42:28.650
So here, as you notice I don't
care so much about this part
00:42:28.650 --> 00:42:30.240
here.
00:42:30.240 --> 00:42:32.240
Because it does not
depend on theta.
00:42:32.240 --> 00:42:35.190
So I know that given the
product of those two things,
00:42:35.190 --> 00:42:37.650
this thing is only the
constant that I need to divide
00:42:37.650 --> 00:42:40.050
so that when I integrate
this thing over theta,
00:42:40.050 --> 00:42:41.460
it integrates to one.
00:42:41.460 --> 00:42:45.540
Because this has to be a
probability density on theta.
00:42:45.540 --> 00:42:47.910
I can write this and just
forget about that part.
00:42:47.910 --> 00:42:52.280
And that's what's written
on the top of this slide.
00:42:52.280 --> 00:42:57.920
This notation, this sort of
weird alpha, or I don't know.
00:42:57.920 --> 00:42:59.780
Infinity sign
propped to the right.
00:42:59.780 --> 00:43:02.330
Whatever you want
to call this thing
00:43:02.330 --> 00:43:04.700
is actually just really
emphasizing the fact
00:43:04.700 --> 00:43:06.310
that I don't care.
00:43:06.310 --> 00:43:12.490
I write it because I can,
but you know what it is.
00:43:17.314 --> 00:43:19.480
In some instances, you have
to compute the integral.
00:43:19.480 --> 00:43:21.640
In some instances, you don't
have to compute the integral.
00:43:21.640 --> 00:43:23.200
And a lot of
Bayesian computation
00:43:23.200 --> 00:43:25.600
is about saying,
OK it's actually
00:43:25.600 --> 00:43:27.146
really hard to
compute this integral,
00:43:27.146 --> 00:43:28.270
so I'd rather not doing it.
00:43:28.270 --> 00:43:31.450
So let me try to find some
methods that will allow me
00:43:31.450 --> 00:43:33.789
to sample from the
posterior distribution,
00:43:33.789 --> 00:43:35.080
without having to compute this.
00:43:35.080 --> 00:43:37.720
And that's what's called
Monte-Carlo Markov
00:43:37.720 --> 00:43:40.580
chains, or MCMC, and that's
exactly what they're doing.
00:43:40.580 --> 00:43:42.370
They're just using
only ratios of things,
00:43:42.370 --> 00:43:44.130
like that for different thetas.
00:43:44.130 --> 00:43:45.890
And which means that
if you take ratios,
00:43:45.890 --> 00:43:47.860
the normalizing constant
is gone and you don't
00:43:47.860 --> 00:43:50.810
need to find this integral.
00:43:50.810 --> 00:43:53.015
So we won't go into
those details at all.
00:43:53.015 --> 00:43:54.890
That would be the purpose
of an entire course
00:43:54.890 --> 00:43:56.630
on Bayesian inference.
00:43:56.630 --> 00:43:59.570
Actually, even
Bayesian computations
00:43:59.570 --> 00:44:02.154
would be an entire
course on its own.
00:44:02.154 --> 00:44:03.820
And there's some very
interesting things
00:44:03.820 --> 00:44:05.778
that are going on there,
the interface of stats
00:44:05.778 --> 00:44:06.890
and computation.
00:44:10.054 --> 00:44:12.470
So let's go back to our example
and see if we can actually
00:44:12.470 --> 00:44:13.636
compute any of those things.
00:44:13.636 --> 00:44:17.420
Because it's very nice to give
you some data, some formulas.
00:44:17.420 --> 00:44:19.990
Let's see if we
can actually do it.
00:44:19.990 --> 00:44:23.810
In particular, can I
actually recover this claim
00:44:23.810 --> 00:44:31.250
that the posterior associated
to a beta prior with a Bernoulli
00:44:31.250 --> 00:44:35.780
likelihood is actually
giving me a beta again?
00:44:35.780 --> 00:44:36.710
What was my prior?
00:44:42.670 --> 00:44:45.970
So p was following
a beta AA, which
00:44:45.970 --> 00:44:48.320
means that p, the density.
00:44:53.620 --> 00:44:56.610
That was pi of theta.
00:44:56.610 --> 00:44:59.580
Well I'm going to
write this as pi of p--
00:44:59.580 --> 00:45:05.800
was proportional to p to the
A minus 1 times 1 minus p
00:45:05.800 --> 00:45:08.806
to the A minus 1.
00:45:08.806 --> 00:45:11.430
So that's the first ingredient
I need to complete my posterior.
00:45:11.430 --> 00:45:14.370
I really need only two, if I
wanted to bound up to constant.
00:45:14.370 --> 00:45:16.234
The second one was p hat.
00:45:20.710 --> 00:45:22.620
We've computed that many times.
00:45:22.620 --> 00:45:25.610
And we had even a nice
compact way of writing it,
00:45:25.610 --> 00:45:32.570
which was that pn of X1,
Xn, given the parameter p.
00:45:32.570 --> 00:45:36.850
So the joint density of my data,
given p, that's my likelihood.
00:45:36.850 --> 00:45:38.730
The likelihood of p was what?
00:45:38.730 --> 00:45:41.230
Well it was p to
the sum of Xi's.
00:45:44.030 --> 00:45:46.300
1 minus p to the n
minus some of the Xi's.
00:45:50.990 --> 00:45:53.750
Anybody wants me
to parse this more?
00:45:53.750 --> 00:45:56.060
Or do you remember seeing
that from maximum likelihood
00:45:56.060 --> 00:45:57.060
estimation?
00:45:57.060 --> 00:45:57.697
Yeah?
00:45:57.697 --> 00:46:02.929
AUDIENCE: [INAUDIBLE]
00:46:02.929 --> 00:46:04.970
PHILLIPE RIGOLLET: That's
what conditioning does.
00:46:10.838 --> 00:46:15.239
AUDIENCE: [INAUDIBLE]
previous slide.
00:46:15.239 --> 00:46:19.151
[INAUDIBLE] bottom
there, it says D pi of t.
00:46:19.151 --> 00:46:23.570
Shouldn't it be dt pi of t?
00:46:23.570 --> 00:46:25.300
PHILLIPE RIGOLLET:
So D pi of T is
00:46:25.300 --> 00:46:29.110
a measure theoretic notation,
which I used without thinking.
00:46:29.110 --> 00:46:32.380
And I should not because
I can see it upsets you.
00:46:32.380 --> 00:46:35.050
D pi of T is just a
natural way to say
00:46:35.050 --> 00:46:38.170
that I integrate
against whatever I'm
00:46:38.170 --> 00:46:43.930
given for the prior of theta.
00:46:43.930 --> 00:46:48.820
In particular, if theta is just
the mix of a PDF and a point
00:46:48.820 --> 00:46:51.430
mass, maybe I say
that my p takes
00:46:51.430 --> 00:46:54.400
value 0.5 with probability 0.5.
00:46:54.400 --> 00:46:58.900
And then is uniform on the
interval with probability 0.5.
00:46:58.900 --> 00:47:01.930
For this, I neither
have a PDF nor a PMF.
00:47:01.930 --> 00:47:04.150
But I can still talk about
integrating with respect
00:47:04.150 --> 00:47:04.930
to this, right?
00:47:04.930 --> 00:47:08.530
It's going to look like, if
I take a function f of T,
00:47:08.530 --> 00:47:14.480
D pi of T is going to be
one half of f of one half.
00:47:14.480 --> 00:47:16.480
That's the point mass
with probability one half,
00:47:16.480 --> 00:47:17.560
at one half.
00:47:17.560 --> 00:47:23.230
Plus one half of the integral
between 0 and 1, of f of TDT.
00:47:23.230 --> 00:47:26.980
This is just the notation, which
is actually funnily enough,
00:47:26.980 --> 00:47:29.360
interchangeable with pi of DT.
00:47:32.460 --> 00:47:34.890
But if you have a
density, it's really
00:47:34.890 --> 00:47:39.801
just the density pi of TDT.
00:47:39.801 --> 00:47:41.940
If pi is really a
density, but that's
00:47:41.940 --> 00:47:44.120
when it's when pi is and
measure and not a density.
00:47:46.820 --> 00:47:49.700
Everybody else,
forget about this.
00:47:49.700 --> 00:47:51.627
This is not something
you should really
00:47:51.627 --> 00:47:52.710
worry about at this point.
00:47:52.710 --> 00:47:55.719
This is more graduate
level probability classes.
00:47:55.719 --> 00:47:57.260
But yeah, it's called
measure theory.
00:47:57.260 --> 00:47:59.160
And that's when you think
of pi as being a measure
00:47:59.160 --> 00:47:59.980
in an abstract fashion.
00:47:59.980 --> 00:48:01.896
You don't have to worry
whether it's a density
00:48:01.896 --> 00:48:04.000
or not, or whether
it has a density.
00:48:08.350 --> 00:48:10.250
So everybody is OK with this?
00:48:15.530 --> 00:48:17.390
Now I need to
compute my posterior.
00:48:17.390 --> 00:48:23.120
And as I said, my
posterior is really
00:48:23.120 --> 00:48:25.550
just the product of
the likelihood weighted
00:48:25.550 --> 00:48:28.970
by the prior.
00:48:28.970 --> 00:48:33.030
Hopefully, at this stage
of your application,
00:48:33.030 --> 00:48:35.390
you can multiply two functions.
00:48:35.390 --> 00:48:37.580
So what's happening is,
if I multiply this guy
00:48:37.580 --> 00:48:41.300
with this guy, p gets
this guy to the power
00:48:41.300 --> 00:48:42.860
this guy plus this guy.
00:48:53.810 --> 00:49:00.020
And then 1 minus p gets the
power n minus some of Xi's.
00:49:00.020 --> 00:49:02.900
So this is always
from I equal 1 to n.
00:49:02.900 --> 00:49:04.390
And then plus A minus 1 as well.
00:49:10.010 --> 00:49:15.560
This is up to constant, because
I still need to solve this.
00:49:15.560 --> 00:49:17.259
And I could try to do it.
00:49:17.259 --> 00:49:18.800
But I really don't
have to, because I
00:49:18.800 --> 00:49:24.380
know that if my density
has this form, then
00:49:24.380 --> 00:49:25.532
it's a beta distribution.
00:49:25.532 --> 00:49:26.990
And then I can just
go on Wikipedia
00:49:26.990 --> 00:49:29.021
and see what should be
the normalization factor.
00:49:29.021 --> 00:49:31.020
But I know it's going to
be a beta distribution.
00:49:31.020 --> 00:49:34.020
It's actually the
beta with parameter.
00:49:34.020 --> 00:49:39.210
So this is really my beta
with parameter, sum of Xi,
00:49:39.210 --> 00:49:43.580
i equal 1 to n plus A minus 1.
00:49:43.580 --> 00:49:46.130
And then the second
parameter is n minus sum
00:49:46.130 --> 00:49:49.806
of the Xi's plus A minus 1.
00:49:54.980 --> 00:49:59.030
I just wrote what was here.
00:49:59.030 --> 00:50:01.580
What happened to my one?
00:50:01.580 --> 00:50:02.920
Oh no, sorry.
00:50:02.920 --> 00:50:05.640
Beta has the power minus 1.
00:50:05.640 --> 00:50:08.847
So that's the
parameter of the beta.
00:50:08.847 --> 00:50:10.430
And this is the
parameter of the beta.
00:50:15.127 --> 00:50:16.210
Beta is over there, right?
00:50:16.210 --> 00:50:19.852
So I just replace
A by what I see.
00:50:19.852 --> 00:50:22.290
A is just becoming
this guy plus this guy
00:50:22.290 --> 00:50:26.400
and this guy plus this guy.
00:50:26.400 --> 00:50:28.662
Everybody is comfortable
with this computation?
00:50:34.170 --> 00:50:38.850
We just agreed that beta priors
for Bernoulli observations
00:50:38.850 --> 00:50:42.540
are certainly convenient.
00:50:42.540 --> 00:50:44.457
Because they are just
conjugate, and we know
00:50:44.457 --> 00:50:46.290
that's what is going
to come out in the end.
00:50:46.290 --> 00:50:48.899
That's going to
be a beta as well.
00:50:48.899 --> 00:50:50.190
I just claim it was convenient.
00:50:50.190 --> 00:50:52.890
It was certainly convenient
to compute this, right?
00:50:52.890 --> 00:50:55.741
There was certainly
some compatibility
00:50:55.741 --> 00:50:57.990
when I had to multiply this
function by that function.
00:50:57.990 --> 00:51:00.916
And you can imagine that things
could go much more wrong,
00:51:00.916 --> 00:51:03.540
than just having p to some power
and p to some power, 1 minus p
00:51:03.540 --> 00:51:06.390
to some power, when it might
just be some other power.
00:51:06.390 --> 00:51:09.280
Things were nice.
00:51:09.280 --> 00:51:12.410
Now this is nice, but I can also
question the following things.
00:51:12.410 --> 00:51:14.330
Why beta, for one?
00:51:14.330 --> 00:51:17.840
The beta tells me something.
00:51:17.840 --> 00:51:20.636
That's convenient, but
then how do I pick A?
00:51:20.636 --> 00:51:27.500
I know that A should definitely
capture the fact that where
00:51:27.500 --> 00:51:30.200
I want to have my p
most likely located.
00:51:30.200 --> 00:51:32.390
But it also actually
also captures
00:51:32.390 --> 00:51:34.580
the variance of my beta.
00:51:34.580 --> 00:51:36.740
And so choosing
different As is going
00:51:36.740 --> 00:51:37.950
to have different functions.
00:51:37.950 --> 00:51:43.050
If I have A and B, If I started
with the beta with parameter.
00:51:43.050 --> 00:51:48.110
If I started with a B here, I
would just pick up the B here.
00:51:48.110 --> 00:51:49.862
Agreed?
00:51:49.862 --> 00:51:51.320
And that would just
be a symmetric.
00:51:51.320 --> 00:51:53.270
But they're going to
capture mean and variance
00:51:53.270 --> 00:51:53.853
of this thing.
00:51:53.853 --> 00:51:56.030
And so how do I pick those guys?
00:51:56.030 --> 00:51:59.437
If I'm a doctor and
you're asking me,
00:51:59.437 --> 00:52:01.520
what do you think the
chances of this drug working
00:52:01.520 --> 00:52:03.230
in this kind of patients is?
00:52:03.230 --> 00:52:06.080
And I have to spit out the
parameters of a beta for you,
00:52:06.080 --> 00:52:08.630
it might be a bit of a
complicated thing to do.
00:52:08.630 --> 00:52:10.720
So how do you do this,
especially for problems?
00:52:10.720 --> 00:52:14.750
So by now, people
have actually mastered
00:52:14.750 --> 00:52:19.290
the art of coming up with how
to formulate those numbers.
00:52:19.290 --> 00:52:21.660
But in new problems that
come up, how do you do this?
00:52:21.660 --> 00:52:23.840
What happens if you want
to use Bayesian methods,
00:52:23.840 --> 00:52:30.140
but you actually do not
know what you expect to see?
00:52:30.140 --> 00:52:33.260
To be fair, before we started
this class, I hope all of you
00:52:33.260 --> 00:52:36.870
had no idea whether people tend
to bend their head to the right
00:52:36.870 --> 00:52:38.172
or to the left before kissing.
00:52:38.172 --> 00:52:40.130
Because if you did, well
you have too much time
00:52:40.130 --> 00:52:42.130
on your hands and I should
double your homework.
00:52:44.390 --> 00:52:46.970
So in this case,
maybe you still want
00:52:46.970 --> 00:52:48.830
to use the Bayesian machinery.
00:52:48.830 --> 00:52:50.980
Maybe you just want
to do something nice.
00:52:50.980 --> 00:52:53.512
It's nice right, I mean
it worked out pretty well.
00:52:53.512 --> 00:52:54.470
What if you want to do?
00:52:54.470 --> 00:52:56.870
Well you actually want
to use some priors that
00:52:56.870 --> 00:53:00.170
carry no information, that
basically do not prefer
00:53:00.170 --> 00:53:02.750
any theta to another theta.
00:53:02.750 --> 00:53:05.435
Now, you could read
this slide or you
00:53:05.435 --> 00:53:06.560
could look at this formula.
00:53:10.010 --> 00:53:14.920
We just said that this
pi here was just here
00:53:14.920 --> 00:53:18.220
to weigh some thetas more
than others, depending
00:53:18.220 --> 00:53:19.870
on their prior belief.
00:53:19.870 --> 00:53:21.400
If our prior belief
does not want
00:53:21.400 --> 00:53:24.880
to put any preference towards
some thetas than to others,
00:53:24.880 --> 00:53:26.332
what do I do?
00:53:26.332 --> 00:53:27.655
AUDIENCE: [INAUDIBLE]
00:53:27.655 --> 00:53:29.462
PHILLIPE RIGOLLET:
Yeah, I remove it.
00:53:29.462 --> 00:53:31.420
And the way to remove
something we multiply by,
00:53:31.420 --> 00:53:32.650
is just replace it by one.
00:53:32.650 --> 00:53:35.100
That's really what we're doing.
00:53:35.100 --> 00:53:38.560
If this was a constant
not depending on theta,
00:53:38.560 --> 00:53:41.400
then that would mean that
we're not preferring any theta.
00:53:41.400 --> 00:53:44.370
And we're looking
at the likelihood.
00:53:44.370 --> 00:53:46.560
But not as a function that
we're trying to maximize,
00:53:46.560 --> 00:53:50.220
but it is a function that
we normalize in such a way
00:53:50.220 --> 00:53:52.570
that it's actually
a distribution.
00:53:52.570 --> 00:53:54.782
So if I have pi,
which is not here,
00:53:54.782 --> 00:53:56.740
this is really just taking
the like likelihood,
00:53:56.740 --> 00:53:57.990
which is a positive function.
00:53:57.990 --> 00:53:59.970
It may not integrate
to 1, so I normalize it
00:53:59.970 --> 00:54:02.330
so that it integrates to 1.
00:54:02.330 --> 00:54:05.120
And then I just say, well this
is my posterior distribution.
00:54:05.120 --> 00:54:06.770
Now I could just
maximize this thing
00:54:06.770 --> 00:54:09.180
and spit out my maximum
likelihood estimator.
00:54:09.180 --> 00:54:10.850
But I can also
integrate and find
00:54:10.850 --> 00:54:12.350
what the expectation
of this guy is.
00:54:12.350 --> 00:54:14.210
I can find what the
median of this guy is.
00:54:14.210 --> 00:54:16.370
I can sample data from this guy.
00:54:16.370 --> 00:54:19.430
I can build, understand what
the variance of this guy is.
00:54:19.430 --> 00:54:21.830
Which is something we did
not do when we just did
00:54:21.830 --> 00:54:24.800
maximum likelihood estimation
because given a function, all
00:54:24.800 --> 00:54:27.998
we cared about was the
arc max of this function.
00:54:31.680 --> 00:54:36.120
These priors are
called uninformative.
00:54:36.120 --> 00:54:43.440
This is just replacing this
number by one or by a constant.
00:54:43.440 --> 00:54:45.020
Because it still
has to be a density.
00:54:49.236 --> 00:54:50.610
If I have a bounded
set, I'm just
00:54:50.610 --> 00:54:52.950
looking for the
uniform distribution
00:54:52.950 --> 00:54:56.580
on this bounded set, the
one that puts constant one
00:54:56.580 --> 00:54:59.200
over the size of this thing.
00:54:59.200 --> 00:55:01.590
But if I have an
invalid set, what
00:55:01.590 --> 00:55:03.870
is the density that
takes a constant value
00:55:03.870 --> 00:55:07.555
on the entire real
line, for example?
00:55:07.555 --> 00:55:08.430
What is this density?
00:55:13.200 --> 00:55:16.550
AUDIENCE: [INAUDIBLE]
00:55:16.550 --> 00:55:18.530
PHILLIPE RIGOLLET:
Doesn't exist, right?
00:55:18.530 --> 00:55:20.990
It just doesn't exist.
00:55:20.990 --> 00:55:22.770
The way you can think
of it is a Gaussian
00:55:22.770 --> 00:55:24.860
with the variance going
to infinity, maybe,
00:55:24.860 --> 00:55:26.289
or something like this.
00:55:26.289 --> 00:55:27.830
But you can think
of it in many ways.
00:55:27.830 --> 00:55:32.330
You can think of the limit of
the uniform between minus T
00:55:32.330 --> 00:55:34.250
and T, with T going to infinity.
00:55:34.250 --> 00:55:36.480
But this thing is actually zero.
00:55:36.480 --> 00:55:39.530
There's nothing there.
00:55:39.530 --> 00:55:41.990
You can actually
still talk about this.
00:55:41.990 --> 00:55:44.390
You could always talk
about this thing, where
00:55:44.390 --> 00:55:46.550
you think of this guy
as being a constant,
00:55:46.550 --> 00:55:49.080
remove this thing from this
equation, and just say,
00:55:49.080 --> 00:55:51.320
well my posterior is
just the likelihood
00:55:51.320 --> 00:55:54.680
divided by the integral of
the likelihood over theta.
00:55:54.680 --> 00:55:58.650
And if theta is the entire
real line, so be it.
00:55:58.650 --> 00:56:00.390
As long as this
integral converges,
00:56:00.390 --> 00:56:01.890
you can still talk
about this stuff.
00:56:04.460 --> 00:56:06.300
This is what's called
an improper prior.
00:56:09.140 --> 00:56:11.990
An improper prior is just a
non-negative function defined
00:56:11.990 --> 00:56:17.390
in theta, but it does not have
to integrate neither to one,
00:56:17.390 --> 00:56:18.170
nor to anything.
00:56:20.900 --> 00:56:22.700
If I integrate the
function equal to 1
00:56:22.700 --> 00:56:24.330
on the entire real
line, what do I get?
00:56:27.800 --> 00:56:28.520
Infinity.
00:56:32.390 --> 00:56:35.960
It's not a proper prior, and
it's called and improper prior.
00:56:35.960 --> 00:56:39.380
And those improper
priors are usually
00:56:39.380 --> 00:56:42.830
what you see when you start
to want non-informative priors
00:56:42.830 --> 00:56:44.360
on infinite sets of datas.
00:56:44.360 --> 00:56:46.880
That's just the nature of it.
00:56:46.880 --> 00:56:50.020
You should think of them as
being the uniform distribution
00:56:50.020 --> 00:56:52.550
of some infinite set, if
that thing were to exist.
00:56:56.360 --> 00:57:01.070
Let's see some examples
about non-informative priors.
00:57:01.070 --> 00:57:04.410
If I'm in the interval 0,
1 this is a finite set.
00:57:04.410 --> 00:57:07.730
So I can talk about
the uniform prior
00:57:07.730 --> 00:57:10.600
on the interval 0, 1 for a
parameter, p of a Bernoulli.
00:57:26.380 --> 00:57:28.000
If I want to talk
about this, then it
00:57:28.000 --> 00:57:35.910
means that my prior is p follows
some uniform on the interval
00:57:35.910 --> 00:57:37.570
0, 1.
00:57:37.570 --> 00:57:48.940
So that means that f of
x is 1 if x is in 0, 1.
00:57:48.940 --> 00:57:52.000
Otherwise, there is actually
not even a normalization.
00:57:52.000 --> 00:57:53.860
This thing integrates to 1.
00:57:53.860 --> 00:57:56.137
And so now if I look
at my likelihood,
00:57:56.137 --> 00:57:57.220
it's still the same thing.
00:57:57.220 --> 00:58:04.510
So my posterior
becomes theta X1, Xn.
00:58:04.510 --> 00:58:07.022
That's my posterior.
00:58:07.022 --> 00:58:08.480
I don't write the
likelihood again,
00:58:08.480 --> 00:58:09.830
because we still have it--
00:58:09.830 --> 00:58:11.583
well we don't have
it here anymore.
00:58:15.440 --> 00:58:17.940
The likelihood is given here.
00:58:17.940 --> 00:58:20.930
Copy, paste over there.
00:58:20.930 --> 00:58:23.069
The posterior is just
this thing times 1.
00:58:23.069 --> 00:58:24.360
So you will see it in a second.
00:58:24.360 --> 00:58:28.570
So it's p to the power sum
of the Xi's, one minus p
00:58:28.570 --> 00:58:31.970
to the power, n minus
sum of the Xi's.
00:58:31.970 --> 00:58:36.380
And then it's multiplied by
1, and then divided by this
00:58:36.380 --> 00:58:42.250
integral between 0 and
1 of p, sum of the Xi's.
00:58:42.250 --> 00:58:47.870
1 minus p, n minus
sum of the Xi's.
00:58:47.870 --> 00:58:51.866
Dp, which does not depend on p.
00:58:51.866 --> 00:58:53.990
And I really don't care
what the thing actually is.
00:58:58.900 --> 00:59:03.550
That's posterior of p.
00:59:03.550 --> 00:59:06.280
And now I can see,
well what is this?
00:59:06.280 --> 00:59:12.870
It's actually just the
beta with parameters.
00:59:12.870 --> 00:59:14.120
This guy plus 1.
00:59:19.670 --> 00:59:21.680
And this guy plus 1.
00:59:34.430 --> 00:59:38.057
I didn't tell you what the
expectation of a beta was.
00:59:38.057 --> 00:59:39.890
We don't know what the
expectation of a beta
00:59:39.890 --> 00:59:42.200
is, agreed?
00:59:42.200 --> 00:59:45.980
If I wanted to find say, the
expectation of this thing that
00:59:45.980 --> 00:59:47.990
would be some good
estimator, we know
00:59:47.990 --> 00:59:49.902
that the maximum
of this guy-- what
00:59:49.902 --> 00:59:51.110
is the maximum of this thing?
00:59:54.880 --> 00:59:57.937
Well, it's just this thing,
it's the average of the Xi's.
00:59:57.937 --> 00:59:59.770
That's just the maximum
likelihood estimator
00:59:59.770 --> 01:00:00.353
for Bernoulli.
01:00:00.353 --> 01:00:01.702
We know it's the average.
01:00:01.702 --> 01:00:03.910
Do you think if I take the
expectation of this thing,
01:00:03.910 --> 01:00:05.295
I'm going to get the average?
01:00:13.864 --> 01:00:15.780
So actually, I'm not
going to get the average.
01:00:15.780 --> 01:00:19.790
I'm going to get this guy plus
this guy, divided by n plus 1.
01:00:27.246 --> 01:00:28.870
Let's look at what
this thing is doing.
01:00:28.870 --> 01:00:34.364
It's looking at the number
of ones and it's adding one.
01:00:34.364 --> 01:00:36.280
And this guy is looking
at the number of zeros
01:00:36.280 --> 01:00:39.190
and it's adding one.
01:00:39.190 --> 01:00:41.910
Why is it adding this one?
01:00:41.910 --> 01:00:42.840
What's going on here?
01:00:47.510 --> 01:00:52.040
This is going to matter
mostly when the number of ones
01:00:52.040 --> 01:00:56.060
is actually zero, or the
number of zeros is zero.
01:00:56.060 --> 01:01:00.000
Because what it does is just
pushes the zero from non-zero.
01:01:00.000 --> 01:01:03.020
And why is that something that
this Bayesian method actually
01:01:03.020 --> 01:01:04.600
does for you automatically?
01:01:04.600 --> 01:01:06.530
It's because when we
put this non-informative
01:01:06.530 --> 01:01:11.169
prior on p, which was
uniform on the interval 0, 1.
01:01:11.169 --> 01:01:12.960
In particular, we know
that the probability
01:01:12.960 --> 01:01:16.690
that p is equal to 0 is zero.
01:01:16.690 --> 01:01:19.180
And the probability p
is equal to 1 is zero.
01:01:19.180 --> 01:01:21.880
And so the problem
is that if I did not
01:01:21.880 --> 01:01:24.520
add this 1 with some
positive probability,
01:01:24.520 --> 01:01:28.120
I wouldn't be allowed to spit
out something that actually had
01:01:28.120 --> 01:01:30.640
p hat, which was equal to 0.
01:01:30.640 --> 01:01:33.280
If by chance, let's say
I have n is equal to 3,
01:01:33.280 --> 01:01:37.750
and I get only 0, 0, 0, that
could happen with probability.
01:01:37.750 --> 01:01:41.470
1 over pq, one over 1 minus pq.
01:01:46.360 --> 01:01:47.880
That's not something
that I want.
01:01:47.880 --> 01:01:49.359
And I'm using my priors.
01:01:49.359 --> 01:01:51.900
My prior is not informative,
but somehow it captures the fact
01:01:51.900 --> 01:01:53.550
that I don't want to
believe p is going
01:01:53.550 --> 01:01:56.110
to be either equal to 0 or 1.
01:01:56.110 --> 01:01:59.790
So that's sort of
taken care of here.
01:01:59.790 --> 01:02:05.640
So let's move away a little
bit from the Bernoulli example,
01:02:05.640 --> 01:02:06.310
shall we?
01:02:06.310 --> 01:02:08.120
I think we've seen enough of it.
01:02:08.120 --> 01:02:10.860
And so let's talk about
the Gaussian model.
01:02:10.860 --> 01:02:12.690
Let's say I want to
do Gaussian inference.
01:02:17.859 --> 01:02:19.650
I want to do inference
in a Gaussian model,
01:02:19.650 --> 01:02:20.730
using Bayesian methods.
01:02:30.600 --> 01:02:39.840
What I want is that Xi,
X1, Xn, or say 0, 1 iid.
01:02:44.720 --> 01:02:47.770
Sorry, theta 1, iid
conditionally on theta.
01:02:50.630 --> 01:02:56.300
That means that pn of
X1, Xn, given theta
01:02:56.300 --> 01:02:58.670
is equal to exactly
what I wrote before.
01:02:58.670 --> 01:03:04.760
So 1 square root to pi, to the
n exponential minus one half
01:03:04.760 --> 01:03:09.579
sum of Xi minus theta squared.
01:03:09.579 --> 01:03:11.120
So that's just the
joint distribution
01:03:11.120 --> 01:03:13.410
of my Gaussian with mean data.
01:03:13.410 --> 01:03:14.810
And the another
question is, what
01:03:14.810 --> 01:03:17.540
is the posterior distribution?
01:03:17.540 --> 01:03:22.500
Well here I said, let's use
the uninformative prior,
01:03:22.500 --> 01:03:23.840
which is an improper prior.
01:03:23.840 --> 01:03:25.490
It puts weight on everyone.
01:03:25.490 --> 01:03:29.310
That's the so-called uniform
on the entire real line.
01:03:29.310 --> 01:03:31.190
So that's certainly
not a density.
01:03:31.190 --> 01:03:34.360
But it can still just use this.
01:03:34.360 --> 01:03:40.430
So all I need to do
is get this divided
01:03:40.430 --> 01:03:44.690
by normalizing this thing.
01:03:44.690 --> 01:03:47.900
But if you look at
this, essentially I
01:03:47.900 --> 01:03:49.530
want to understand.
01:03:49.530 --> 01:03:52.470
So this is proportional
to the exponential
01:03:52.470 --> 01:03:55.040
minus one half
sum from I equal 1
01:03:55.040 --> 01:03:58.950
to n of Xi minus theta squared.
01:03:58.950 --> 01:04:01.370
And now I want to see
this thing as a density,
01:04:01.370 --> 01:04:03.560
not on the Xi's but on theta.
01:04:06.420 --> 01:04:10.120
What I want is a
density on theta.
01:04:10.120 --> 01:04:13.650
So it looks like I have
chances of getting something
01:04:13.650 --> 01:04:16.800
that looks like a Gaussian.
01:04:16.800 --> 01:04:19.500
To have a Gaussian, I would
need to see minus one half.
01:04:19.500 --> 01:04:21.660
And then I would need to
see theta minus something
01:04:21.660 --> 01:04:25.230
here, not just the sum of
something minus thetas.
01:04:25.230 --> 01:04:29.820
So I need to work
a little bit more,
01:04:29.820 --> 01:04:31.475
to expand the square here.
01:04:31.475 --> 01:04:32.850
So this thing here
is going to be
01:04:32.850 --> 01:04:37.330
equal to exponential minus
one half sum from I equal 1
01:04:37.330 --> 01:04:45.280
to n of Xi squared minus 2Xi
theta plus theta squared.
01:05:10.590 --> 01:05:13.590
Now what I'm going to do
is, everything remember
01:05:13.590 --> 01:05:15.870
is up to this little sign.
01:05:15.870 --> 01:05:19.710
So every time I see a term
that does not depend on theta,
01:05:19.710 --> 01:05:22.250
I can just push it in there
and just make it disappear.
01:05:22.250 --> 01:05:24.550
Agreed?
01:05:24.550 --> 01:05:28.420
This term here, exponential
minus one half sum of Xi
01:05:28.420 --> 01:05:31.661
squared, does it
depend on theta?
01:05:31.661 --> 01:05:32.160
No.
01:05:32.160 --> 01:05:33.420
So I'm just pushing it here.
01:05:33.420 --> 01:05:34.530
This guy, yes.
01:05:34.530 --> 01:05:35.970
And the other one, yes.
01:05:35.970 --> 01:05:45.020
So this is proportional to
exponential sum of the Xi.
01:05:45.020 --> 01:05:47.780
And then I'm going to pull out
my theta, the minus one half
01:05:47.780 --> 01:05:50.150
canceled with the minus 2.
01:05:50.150 --> 01:05:56.460
And then I have minus
one half sum from I
01:05:56.460 --> 01:05:58.180
equal 1 to n of theta squared.
01:06:01.480 --> 01:06:03.460
Agreed?
01:06:03.460 --> 01:06:05.350
So now what this
thing looks like,
01:06:05.350 --> 01:06:09.570
this looks very much like some
theta minus something squared.
01:06:09.570 --> 01:06:15.110
This thing here is really
just n over 2 times theta.
01:06:18.520 --> 01:06:21.740
Sorry, times theta squared.
01:06:21.740 --> 01:06:25.120
So now what I need to do is to
write this of the form, theta
01:06:25.120 --> 01:06:26.230
minus something.
01:06:26.230 --> 01:06:31.820
Let's call it mu, squared,
divided by 2 sigma squared.
01:06:31.820 --> 01:06:34.160
I want to turn this into
that, maybe up to terms
01:06:34.160 --> 01:06:36.510
that do not depend on theta.
01:06:36.510 --> 01:06:39.062
That's what I'm
going to try to do.
01:06:39.062 --> 01:06:40.770
So that's called
completing the squaring.
01:06:40.770 --> 01:06:42.010
That's some exercises you do.
01:06:42.010 --> 01:06:44.260
You've done it probably,
already in the homework.
01:06:44.260 --> 01:06:46.560
And that's something
you do a lot when
01:06:46.560 --> 01:06:48.750
you do Bayesian
statistics, in particular.
01:06:48.750 --> 01:06:50.010
So let's do this.
01:06:50.010 --> 01:06:51.910
What is it going to
be the leading term?
01:06:51.910 --> 01:06:54.160
Theta squared is going to
be multiplied by this thing.
01:06:54.160 --> 01:06:57.130
So I'm going to pull
out my n over 2.
01:06:57.130 --> 01:07:03.070
And then I'm going to write
this as minus theta over 2.
01:07:03.070 --> 01:07:06.220
And then I'm going to write
theta minus something squared.
01:07:06.220 --> 01:07:08.890
And this something is going
to be one half of what
01:07:08.890 --> 01:07:10.160
I see in the cross-product.
01:07:12.966 --> 01:07:14.590
I need to actually
pull this thing out.
01:07:14.590 --> 01:07:18.340
So let me write it
like that first.
01:07:18.340 --> 01:07:21.860
So that's theta squared.
01:07:21.860 --> 01:07:30.680
And then I'm going to write it
as minus 2 times 1 over n sum
01:07:30.680 --> 01:07:36.980
from I equal 1 to n
of Xi's times theta.
01:07:36.980 --> 01:07:39.874
That's exactly just a rewriting
of what we had before.
01:07:39.874 --> 01:07:41.540
And that should look
much more familiar.
01:07:44.990 --> 01:07:49.700
A squared minus 2 blap A,
and then I missed something.
01:07:49.700 --> 01:07:51.860
So this thing, I'm going
to be able to rewrite
01:07:51.860 --> 01:07:57.930
as theta minus Xn bar squared.
01:07:57.930 --> 01:08:00.720
But then I need to remove
the square of Xn bar.
01:08:00.720 --> 01:08:01.740
Because it's not here.
01:08:09.210 --> 01:08:11.297
So I just complete the square.
01:08:11.297 --> 01:08:13.880
And then I actually really don't
care with this thing actually
01:08:13.880 --> 01:08:16.899
was, because it's going to go
again in the little Alpha's
01:08:16.899 --> 01:08:18.416
sign over there.
01:08:18.416 --> 01:08:19.790
So this thing
eventually is going
01:08:19.790 --> 01:08:24.620
to be proportional
to exponential
01:08:24.620 --> 01:08:31.090
of minus n over 2 times theta
of minus Xn bar squared.
01:08:31.090 --> 01:08:33.370
And so we know that if
this is a density that's
01:08:33.370 --> 01:08:44.100
proportional to this guy, it has
to be some n with mean, Xn bar.
01:08:44.100 --> 01:08:47.520
And variance, this is supposed
to be 1 over sigma squared.
01:08:47.520 --> 01:08:49.318
This guy over here, this n.
01:08:49.318 --> 01:08:50.609
So that's really just 1 over n.
01:08:53.870 --> 01:09:01.740
So the posterior
distribution is a Gaussian
01:09:01.740 --> 01:09:05.819
centered at the average
of my observations.
01:09:05.819 --> 01:09:08.430
And with variance, 1 over n.
01:09:13.307 --> 01:09:14.140
Everybody's with me?
01:09:16.740 --> 01:09:19.779
Why I'm saying this, this was
the output of some computation.
01:09:19.779 --> 01:09:21.450
But it sort of
makes sense, right?
01:09:21.450 --> 01:09:24.210
It's really telling me that
the more observations I have,
01:09:24.210 --> 01:09:26.250
the more concentrated
this posterior is.
01:09:26.250 --> 01:09:27.819
Concentrated around what?
01:09:27.819 --> 01:09:30.529
Well around this Xn bar.
01:09:30.529 --> 01:09:33.140
That looks like something
we've sort of seen before.
01:09:33.140 --> 01:09:35.420
But it does not have the
same meaning, somehow.
01:09:35.420 --> 01:09:37.580
This is really just the
posterior distribution.
01:09:40.490 --> 01:09:43.160
It's sort of a sanity check,
that I have this 1 over n
01:09:43.160 --> 01:09:44.139
when I have Xn bar.
01:09:44.139 --> 01:09:45.680
But it's not the
same thing as saying
01:09:45.680 --> 01:09:48.429
that the variance of Xn bar was
1 over n, like we had before.
01:09:55.670 --> 01:09:59.390
As an exercise,
I would recommend
01:09:59.390 --> 01:10:10.140
if you don't get it,
just try pi of theta
01:10:10.140 --> 01:10:15.290
to be equal to some n mu 1.
01:10:18.120 --> 01:10:22.350
Here, the prior that we used
was completely non-informative.
01:10:22.350 --> 01:10:25.594
What happens if I take my prior
to be some Gaussian, which
01:10:25.594 --> 01:10:27.510
is centered at mu and
it has the same variance
01:10:27.510 --> 01:10:30.120
as the other guys?
01:10:30.120 --> 01:10:32.204
So what's going to
happen here is that we're
01:10:32.204 --> 01:10:33.120
going to put a weight.
01:10:33.120 --> 01:10:34.536
And everything
that's away from mu
01:10:34.536 --> 01:10:38.469
is going to actually
get less weight.
01:10:38.469 --> 01:10:40.260
I want to know how I'm
going to be updating
01:10:40.260 --> 01:10:41.850
this prior into a posterior.
01:10:44.520 --> 01:10:47.040
Everybody sees what
I'm saying here?
01:10:47.040 --> 01:10:50.040
So that means that pi of theta
has the density proportional
01:10:50.040 --> 01:10:55.680
to exponential minus one
half theta minus mu squared.
01:10:55.680 --> 01:11:00.540
So I need to multiply
my posterior with this,
01:11:00.540 --> 01:11:01.849
and then see.
01:11:01.849 --> 01:11:03.390
It's actually going
to be a Gaussian.
01:11:03.390 --> 01:11:04.774
This is also a conjugate prior.
01:11:04.774 --> 01:11:06.440
It's going to spit
out another Gaussian.
01:11:06.440 --> 01:11:09.390
You're going to have to complete
a square again, and just check
01:11:09.390 --> 01:11:10.814
what it's actually giving you.
01:11:10.814 --> 01:11:12.480
And so spoiler alert,
it's going to look
01:11:12.480 --> 01:11:14.790
like you get an extra
observation, which is actually
01:11:14.790 --> 01:11:15.360
equal to mu.
01:11:18.800 --> 01:11:22.440
It's going to be the average
of n plus 1 observations.
01:11:22.440 --> 01:11:24.110
The first n1's being X1 to Xn.
01:11:24.110 --> 01:11:27.530
And then, the last one being mu.
01:11:27.530 --> 01:11:30.860
And it sort of makes sense.
01:11:30.860 --> 01:11:34.700
That's actually a
fairly simple exercise.
01:11:34.700 --> 01:11:36.441
Rather than going
into more computation,
01:11:36.441 --> 01:11:37.940
this is something
you can definitely
01:11:37.940 --> 01:11:41.510
do when you're in the
comfort of your room.
01:11:41.510 --> 01:11:43.910
I want to talk about
other types of priors.
01:11:43.910 --> 01:11:47.330
The first thing I said is,
there's this beta prior
01:11:47.330 --> 01:11:50.390
that I just pulled out of my hat
and that was just convenient.
01:11:50.390 --> 01:11:52.860
Then there was this
non-informative prior.
01:11:52.860 --> 01:11:53.720
It was convenient.
01:11:53.720 --> 01:11:56.300
It was non-informative, so
if you don't know anything
01:11:56.300 --> 01:11:58.950
else maybe that's
what you want to do.
01:11:58.950 --> 01:12:01.940
The question is, are there
any other priors that
01:12:01.940 --> 01:12:04.490
are sort of principled
and generic, in the sense
01:12:04.490 --> 01:12:08.600
that the uninformative
prior was generic, right?
01:12:08.600 --> 01:12:11.400
It was equal to 1, that's
as generic as it gets.
01:12:11.400 --> 01:12:14.190
So is there anything
that's generic as well?
01:12:14.190 --> 01:12:17.180
Well, there's this priors that
are called Jeffrey's priors.
01:12:17.180 --> 01:12:20.540
And Jeffrey's prior, which is
proportional to square root
01:12:20.540 --> 01:12:23.290
of the determinant of the
Fisher information of theta.
01:12:26.360 --> 01:12:28.600
This is actually a
weird thing to do.
01:12:28.600 --> 01:12:31.380
It says, look at your model.
01:12:31.380 --> 01:12:34.152
Your model is going to
have a Fisher information.
01:12:34.152 --> 01:12:34.985
Let's say it exists.
01:12:38.150 --> 01:12:39.957
Because we know it
does not always exist.
01:12:39.957 --> 01:12:41.540
For example, in the
multinomial model,
01:12:41.540 --> 01:12:44.660
we didn't have a
Fisher information.
01:12:44.660 --> 01:12:46.670
The determinant of
a matrix is somehow
01:12:46.670 --> 01:12:48.800
measuring the size of a matrix.
01:12:48.800 --> 01:12:50.540
If you don't trust
me, just think
01:12:50.540 --> 01:12:53.870
about the matrix being
of size one by one,
01:12:53.870 --> 01:12:56.910
then the determinant is just
the number that you have there.
01:12:56.910 --> 01:13:00.770
And so this is really something
that looks like the Fisher
01:13:00.770 --> 01:13:01.670
information.
01:13:04.374 --> 01:13:06.290
It's proportional to the
amount of information
01:13:06.290 --> 01:13:09.620
that you have at
a certain point.
01:13:09.620 --> 01:13:12.310
And so what my prior
is saying well,
01:13:12.310 --> 01:13:14.280
I want to put more weights
on those thetas that
01:13:14.280 --> 01:13:17.050
are going to just extract more
information from the data.
01:13:20.510 --> 01:13:22.760
You can actually
compute those things.
01:13:22.760 --> 01:13:26.215
In the first example,
Jeffrey's prior
01:13:26.215 --> 01:13:28.360
is something that
looks like this.
01:13:28.360 --> 01:13:30.230
In one dimension,
Fisher information
01:13:30.230 --> 01:13:33.476
is essentially one
the word variance.
01:13:33.476 --> 01:13:35.600
That's just 1 over the
square root of the variance,
01:13:35.600 --> 01:13:37.550
because I have the square root.
01:13:37.550 --> 01:13:45.770
And when I have the Jeffrey's
prior, when I have the Gaussian
01:13:45.770 --> 01:13:48.770
case, this is the
identity matrix
01:13:48.770 --> 01:13:50.840
that I would have in
the Gaussian case.
01:13:50.840 --> 01:13:52.580
The determinant of
the identities is 1.
01:13:52.580 --> 01:13:56.180
So square root of 1 is 1, and
so I would basically get 1.
01:13:56.180 --> 01:13:59.170
And that gives me my improper
prior, my uninformative prior
01:13:59.170 --> 01:14:01.020
that I had.
01:14:01.020 --> 01:14:03.690
So the uninformative
prior 1 is fine.
01:14:03.690 --> 01:14:06.780
Clearly, all the thetas
carry the same information
01:14:06.780 --> 01:14:08.160
in the Gaussian model.
01:14:08.160 --> 01:14:10.200
Whether I translate
it here or here,
01:14:10.200 --> 01:14:12.120
it's pretty clear none
of them is actually
01:14:12.120 --> 01:14:13.140
better than the other.
01:14:13.140 --> 01:14:16.530
But clearly for
the Bernoulli case,
01:14:16.530 --> 01:14:22.560
the p's that are closer
to the boundary carry
01:14:22.560 --> 01:14:23.940
more information.
01:14:23.940 --> 01:14:26.250
I sort of like those
guys, because they just
01:14:26.250 --> 01:14:27.757
carry more information.
01:14:27.757 --> 01:14:29.340
So what I do is, I
take this function.
01:14:29.340 --> 01:14:30.300
So p1 minus p.
01:14:30.300 --> 01:14:34.170
Remember, it's something
that looks like this.
01:14:34.170 --> 01:14:35.390
On the interval 0, 1.
01:14:38.710 --> 01:14:40.979
This guy, 1 over square
root of p1 minus p
01:14:40.979 --> 01:14:42.395
is something that
looks like this.
01:14:45.780 --> 01:14:47.620
Agreed
01:14:47.620 --> 01:14:49.780
What it's doing is
sort of wants to push
01:14:49.780 --> 01:14:54.586
towards the piece that actually
carry more information.
01:14:54.586 --> 01:14:56.210
Whether you want to
bias your data that
01:14:56.210 --> 01:14:59.120
way or not, is something
you need to think about.
01:14:59.120 --> 01:15:01.550
When you put a prior on your
data, on your parameter,
01:15:01.550 --> 01:15:06.140
you're sort of biasing
towards this idea your data.
01:15:06.140 --> 01:15:07.700
That's maybe not
such a good idea,
01:15:07.700 --> 01:15:13.160
when you have some p that's
actually close to one half,
01:15:13.160 --> 01:15:13.820
for example.
01:15:13.820 --> 01:15:14.960
You're actually
saying, no I don't
01:15:14.960 --> 01:15:16.610
want to see a p that's
close to one half.
01:15:16.610 --> 01:15:18.350
Just make a decision,
one way or another.
01:15:18.350 --> 01:15:19.699
But just make a decision.
01:15:19.699 --> 01:15:20.990
So it's forcing you to do that.
01:15:23.690 --> 01:15:26.090
Jeffrey's prior, I'm
running out of time
01:15:26.090 --> 01:15:29.850
so I don't want to go
into too much detail.
01:15:29.850 --> 01:15:31.670
We'll probably stop
here, actually.
01:15:44.570 --> 01:15:47.810
So Jeffrey's priors have
this very nice property.
01:15:47.810 --> 01:15:51.740
It's that they actually do not
care about the parameterization
01:15:51.740 --> 01:15:53.150
of your space.
01:15:53.150 --> 01:15:56.360
If you actually have
p and you suddenly
01:15:56.360 --> 01:15:58.850
decide that p is not the
right parameter for Bernoulli,
01:15:58.850 --> 01:16:00.740
but it's p squared.
01:16:00.740 --> 01:16:03.200
You could decide to
parameterize this by p squared.
01:16:03.200 --> 01:16:05.840
Maybe your doctor is
actually much more able
01:16:05.840 --> 01:16:08.840
to formulate some prior
assumption on p squared,
01:16:08.840 --> 01:16:09.800
rather than p.
01:16:09.800 --> 01:16:11.100
You never know.
01:16:11.100 --> 01:16:14.390
And so what happens is
that Jeffrey's priors
01:16:14.390 --> 01:16:15.990
are an invariant in this.
01:16:15.990 --> 01:16:18.560
And the reason is because
the information carried by p
01:16:18.560 --> 01:16:21.130
is the same as the information
carried by p squared, somehow.
01:16:28.822 --> 01:16:30.280
They're essentially
the same thing.
01:16:32.950 --> 01:16:34.630
You need to have one to one map.
01:16:34.630 --> 01:16:37.896
Where you basically for
each parameter, before
01:16:37.896 --> 01:16:39.020
you have another parameter.
01:16:39.020 --> 01:16:40.810
Let's call Eta the
new parameters.
01:16:45.790 --> 01:16:50.380
The PDF of the new prior
indexed by Eta this time
01:16:50.380 --> 01:16:52.990
is actually also
Jeffrey's prior.
01:16:52.990 --> 01:16:55.174
But this time, the
new Fisher information
01:16:55.174 --> 01:16:57.340
is not the Fisher information
with respect to theta.
01:16:57.340 --> 01:17:00.010
But it's this Fisher
information associated
01:17:00.010 --> 01:17:03.130
to this statistical
model indexed by Eta.
01:17:03.130 --> 01:17:08.110
So essentially, when you
change the parameterization
01:17:08.110 --> 01:17:10.600
of your model, you still
get Jeffrey's prior
01:17:10.600 --> 01:17:12.820
for the new parameterization.
01:17:12.820 --> 01:17:15.020
Which is, in a way,
a desirable property.
01:17:19.410 --> 01:17:21.920
Jeffrey's prior is just
an uninformative priors,
01:17:21.920 --> 01:17:24.140
or priors you want
to use when you
01:17:24.140 --> 01:17:26.480
want a systematic way without
really thinking about what
01:17:26.480 --> 01:17:27.396
to pick for your mile.
01:17:35.440 --> 01:17:37.060
I'll finish this next time.
01:17:37.060 --> 01:17:39.910
And we'll talk about
Bayesian confidence regions.
01:17:39.910 --> 01:17:41.620
We'll talk about
Bayesian estimation.
01:17:41.620 --> 01:17:44.074
Once I have a posterior,
what do I get?
01:17:44.074 --> 01:17:45.490
And basically, the
only message is
01:17:45.490 --> 01:17:47.860
going to be that you
might want to integrate
01:17:47.860 --> 01:17:48.910
against the posterior.
01:17:48.910 --> 01:17:51.490
Find the posterior, the
expectation of your posterior
01:17:51.490 --> 01:17:52.130
distribution.
01:17:52.130 --> 01:17:54.010
That's a good point
estimator for theta.
01:17:56.860 --> 01:18:01.020
We'll just do a
couple of computation.