WEBVTT
00:00:00.700 --> 00:00:03.880
We are now ready to
move on to a model which
00:00:03.880 --> 00:00:07.240
is quite interesting
and quite realistic.
00:00:07.240 --> 00:00:11.900
This is a model in which we have
an unknown parameter modeled
00:00:11.900 --> 00:00:14.970
as a random variable
that we try to estimate.
00:00:14.970 --> 00:00:17.080
This is the random
variable, Theta.
00:00:17.080 --> 00:00:18.880
And we have multiple
observations
00:00:18.880 --> 00:00:20.540
of that random variable.
00:00:20.540 --> 00:00:22.380
Each one of those
observations is
00:00:22.380 --> 00:00:24.720
equal to the unknown
random variable,
00:00:24.720 --> 00:00:27.490
plus some additive noise.
00:00:27.490 --> 00:00:30.900
This is a model that appears
quite often in practice.
00:00:30.900 --> 00:00:32.509
It is often the case
that we're trying
00:00:32.509 --> 00:00:35.730
to estimate a certain
quantity, but we can only
00:00:35.730 --> 00:00:39.300
observe values of that quantity
in the presence of noise.
00:00:39.300 --> 00:00:42.130
And because of the
noise, what we want to do
00:00:42.130 --> 00:00:45.280
is to try to measure
it to multiple times.
00:00:45.280 --> 00:00:48.470
And so we have multiple
such measurement equations.
00:00:48.470 --> 00:00:52.020
And then we want to combine all
of the observations together
00:00:52.020 --> 00:00:56.340
to come up with a good
estimate of that parameter.
00:00:56.340 --> 00:00:58.440
The assumption that
we will be making
00:00:58.440 --> 00:01:01.950
are that Theta is a
normal random variable.
00:01:01.950 --> 00:01:05.019
It has a certain mean
that we denote by x0.
00:01:05.019 --> 00:01:08.030
The reason for this strange
notation will be seen later.
00:01:08.030 --> 00:01:10.800
And it also has a
certain variance.
00:01:10.800 --> 00:01:14.130
The noise terms are also
normal random variables
00:01:14.130 --> 00:01:17.360
with 0 mean and a
certain variance.
00:01:17.360 --> 00:01:21.300
And finally, we assume that
these basic random variables
00:01:21.300 --> 00:01:26.240
that define our model
are all independent.
00:01:26.240 --> 00:01:29.680
Based on these assumptions, now
we would like to estimate Theta
00:01:29.680 --> 00:01:32.840
on the basis of the X's.
00:01:32.840 --> 00:01:36.140
And as usual, in the
Bayesian setting,
00:01:36.140 --> 00:01:38.759
what we want to do is to
calculate the posterior
00:01:38.759 --> 00:01:42.360
distribution of
Theta, given the X's.
00:01:42.360 --> 00:01:45.090
The Bayes rule
has the usual form
00:01:45.090 --> 00:01:47.590
for the case of continuous
random variables.
00:01:47.590 --> 00:01:51.740
The only remark that needs to
be made is that in this case,
00:01:51.740 --> 00:01:55.729
there are multiple X's,
so X up here stands
00:01:55.729 --> 00:01:58.265
for the vector of
the observations
00:01:58.265 --> 00:01:59.920
that we have obtained.
00:01:59.920 --> 00:02:02.160
And similarly,
little x will stand
00:02:02.160 --> 00:02:06.260
for the vector of the values
of these observations.
00:02:06.260 --> 00:02:09.070
So we need now to start
making some progress
00:02:09.070 --> 00:02:11.810
towards calculating
this term here.
00:02:11.810 --> 00:02:15.080
What is the distribution of
the vector of measurements
00:02:15.080 --> 00:02:16.600
given theta.
00:02:16.600 --> 00:02:19.030
Before we move to
the vector case,
00:02:19.030 --> 00:02:23.570
let us look at one of the
measurements in isolation.
00:02:23.570 --> 00:02:26.600
This is something that
we have already seen.
00:02:26.600 --> 00:02:30.770
If I tell you the value
of the random variable,
00:02:30.770 --> 00:02:35.210
Theta, which is what happens
in this conditional universe
00:02:35.210 --> 00:02:38.270
when you condition on
the value of Theta,
00:02:38.270 --> 00:02:42.300
then in that universe,
the random variable, Xi,
00:02:42.300 --> 00:02:45.610
is equal to the numerical
value that you gave me
00:02:45.610 --> 00:02:47.460
for Theta, plus Wi.
00:02:50.329 --> 00:02:55.360
And because Wi is independent
from the random variable Theta,
00:02:55.360 --> 00:02:57.579
knowing the value of
the random variable
00:02:57.579 --> 00:03:00.300
Theta does not change
the distribution of Wi.
00:03:00.300 --> 00:03:02.990
It will still have this
normal distribution.
00:03:02.990 --> 00:03:08.580
So Xi is a normal of this
kind plus a constant.
00:03:08.580 --> 00:03:12.780
And so Xi is a normal
random variable
00:03:12.780 --> 00:03:16.470
with mean equal to the
constant that we added,
00:03:16.470 --> 00:03:19.020
and variance equal to
the original variance
00:03:19.020 --> 00:03:21.420
of the random variable, Wi.
00:03:21.420 --> 00:03:25.750
And so we can now write down,
the PDF, the conditional PDF,
00:03:25.750 --> 00:03:27.325
of Xi.
00:03:27.325 --> 00:03:30.140
There's going to be a
normalizing constant.
00:03:30.140 --> 00:03:32.700
And then the usual
exponential term,
00:03:32.700 --> 00:03:39.730
which is going to be xi minus
the mean of the distribution,
00:03:39.730 --> 00:03:42.704
which is theta.
00:03:42.704 --> 00:03:45.620
And then we divide by
the usual variance term.
00:03:48.280 --> 00:03:53.730
Let us move next to
this distribution here.
00:03:53.730 --> 00:03:58.940
This is a shorthand
notation for the joint PDF
00:03:58.940 --> 00:04:03.480
of the random
variables X1 up to Xn,
00:04:03.480 --> 00:04:07.610
conditional on the
random variable Theta.
00:04:07.610 --> 00:04:11.446
So it's really a function
of multiple variables.
00:04:14.990 --> 00:04:17.820
And how do we proceed now?
00:04:17.820 --> 00:04:20.899
Here is the crucial observation.
00:04:20.899 --> 00:04:26.090
If I tell you the value of
the random variable capital
00:04:26.090 --> 00:04:32.400
Theta as before, then
you argue as follows.
00:04:32.400 --> 00:04:35.460
All of these random
variables are independent.
00:04:35.460 --> 00:04:39.180
So if I tell you the value
of the random variable Theta,
00:04:39.180 --> 00:04:43.490
this does not change the joint
distribution of the Wi's.
00:04:43.490 --> 00:04:46.680
The Wi's were independent
when we started,
00:04:46.680 --> 00:04:50.000
so they remain independent
in the conditional universe.
00:04:58.630 --> 00:05:01.110
And since the Wi's
are independent
00:05:01.110 --> 00:05:05.560
and Xi's are obtained from the
Wi's by just adding a constant,
00:05:05.560 --> 00:05:09.670
this means that
the Xi's are also
00:05:09.670 --> 00:05:13.420
independent in this
conditional universe.
00:05:13.420 --> 00:05:16.080
Once I tell you
the value of Theta,
00:05:16.080 --> 00:05:18.490
then because the
noises are independent,
00:05:18.490 --> 00:05:21.460
the observations are
also independent.
00:05:21.460 --> 00:05:27.810
But this means that the joint
PDF factors as a product
00:05:27.810 --> 00:05:32.330
of the individual
marginal PDFs of the Xi's.
00:05:37.900 --> 00:05:41.540
And these PDFs, we
have already found.
00:05:41.540 --> 00:05:44.409
So now, we can put
everything together
00:05:44.409 --> 00:05:48.560
to write down a formula
for the posterior PDF
00:05:48.560 --> 00:05:50.400
using the Bayes rule.
00:05:50.400 --> 00:05:57.300
We have this denominator term,
which I will write first,
00:05:57.300 --> 00:06:00.580
and which term we do
not need to evaluate.
00:06:00.580 --> 00:06:04.320
Then we have the
marginal PDF of Theta.
00:06:04.320 --> 00:06:08.440
Now since Theta is normal
with these parameters,
00:06:08.440 --> 00:06:16.120
this is of the form e to
the minus theta minus x0
00:06:16.120 --> 00:06:21.690
squared over 2 sigma 0 squared.
00:06:21.690 --> 00:06:27.500
And then we have this joint
density of the X's conditioned
00:06:27.500 --> 00:06:32.840
on Theta, which we have already
found, it is this product here.
00:06:32.840 --> 00:06:36.710
It's a product of n terms,
one for each observation.
00:06:36.710 --> 00:06:40.480
And each one of these terms
is what we have found earlier,
00:06:40.480 --> 00:06:43.780
so I'm just substituting
this expression up here.
00:06:52.060 --> 00:06:56.390
Now once we have obtained the
observations, so the value
00:06:56.390 --> 00:07:00.130
of the random variable capital
X, that is, the value little x,
00:07:00.130 --> 00:07:01.710
is fixed.
00:07:01.710 --> 00:07:07.430
Once it is fixed, then the x's
that appear here are constant.
00:07:07.430 --> 00:07:11.040
So in particular, this
term here is a constant.
00:07:11.040 --> 00:07:12.840
We do not bother with it.
00:07:12.840 --> 00:07:19.030
And what we have is a constant
times an exponential in terms
00:07:19.030 --> 00:07:21.720
that are quadratic in theta.
00:07:21.720 --> 00:07:25.020
So we recognize this
kind of expression.
00:07:25.020 --> 00:07:29.410
It has to correspond to
a normal distribution.
00:07:29.410 --> 00:07:34.120
And this is the first
conclusion of this exercise.
00:07:34.120 --> 00:07:38.770
That is, the posterior PDF
of the parameter, Theta,
00:07:38.770 --> 00:07:43.560
given our observations, this
posterior PDF is normal.
00:07:43.560 --> 00:07:47.890
We have e to a quadratic
function in theta.
00:07:47.890 --> 00:07:50.080
And that quadratic
function also involves
00:07:50.080 --> 00:07:53.659
the specific values of the
X's that we have obtained.
00:07:53.659 --> 00:07:57.710
Let us copy what we have
found and rearrange it.
00:07:57.710 --> 00:08:02.430
Once more, we have a
constant, then the exponential
00:08:02.430 --> 00:08:06.170
of the negative of some
quadratic function in theta.
00:08:06.170 --> 00:08:07.955
And the specific
quadratic function
00:08:07.955 --> 00:08:15.370
that we calculated just before
takes this particular form.
00:08:15.370 --> 00:08:19.250
What is the mean of this
normal distribution?
00:08:19.250 --> 00:08:23.340
The mean is same as the peak.
00:08:23.340 --> 00:08:30.180
And to find the peak, the
location at which this PDF is
00:08:30.180 --> 00:08:34.940
largest, what we do
is we try to find
00:08:34.940 --> 00:08:39.190
the place at which this
quadratic function is smallest.
00:08:39.190 --> 00:08:42.890
So what we do is to take
the derivative with respect
00:08:42.890 --> 00:08:50.520
to theta of this
quadratic, and set it to 0.
00:08:50.520 --> 00:08:54.530
This gives us a sum of terms.
00:08:54.530 --> 00:08:59.970
The derivative of
the typical term
00:08:59.970 --> 00:09:14.970
is going to be theta minus xi,
divided by sigma i squared.
00:09:14.970 --> 00:09:18.340
And this expression
must be equal to 0
00:09:18.340 --> 00:09:23.440
if theta is at the peak of
the posterior distribution.
00:09:23.440 --> 00:09:26.930
And so we now rearrange
this equation.
00:09:26.930 --> 00:09:30.700
We split and take first
the term involving theta,
00:09:30.700 --> 00:09:33.465
and gives us a
contribution of this kind.
00:09:39.190 --> 00:09:43.160
And we move the terms involving
x's to the other side.
00:09:43.160 --> 00:09:45.365
And this gives us
this expression.
00:09:50.830 --> 00:09:52.960
And finally, we
take this sum here
00:09:52.960 --> 00:09:56.630
and send it to the
denominator of the other side.
00:09:56.630 --> 00:10:00.940
And this gives us the
final form of the solution:
00:10:00.940 --> 00:10:05.280
the peak of the posterior
distribution, which is also
00:10:05.280 --> 00:10:08.400
the same as the conditional
expectation of the posterior
00:10:08.400 --> 00:10:09.240
distribution.
00:10:09.240 --> 00:10:12.670
Whenever we have a
normal distribution,
00:10:12.670 --> 00:10:15.550
the expected value is
the same as the place
00:10:15.550 --> 00:10:17.390
where the distribution
is largest.
00:10:21.280 --> 00:10:25.000
Let us now conclude with a few
comments and words about how
00:10:25.000 --> 00:10:28.120
to interpret the
result that we found.
00:10:28.120 --> 00:10:32.210
First, let me emphasize that the
same conclusions that we have
00:10:32.210 --> 00:10:35.680
obtained for the case
of a single observation
00:10:35.680 --> 00:10:40.110
go through in this case as well.
00:10:40.110 --> 00:10:43.250
The posterior distribution
of the unknown parameter
00:10:43.250 --> 00:10:47.480
is still a normal distribution.
00:10:47.480 --> 00:10:49.860
Our state of
knowledge about Theta
00:10:49.860 --> 00:10:52.000
after we obtain
the observations is
00:10:52.000 --> 00:10:54.450
described by a
normal distribution.
00:10:54.450 --> 00:10:56.190
Because it is a
normal distribution,
00:10:56.190 --> 00:11:00.690
the location of its peak is
the same as the expected value.
00:11:00.690 --> 00:11:03.070
And for this reason, the
conditional expectation
00:11:03.070 --> 00:11:05.640
estimate and the maximum
a posteriori probability
00:11:05.640 --> 00:11:07.490
estimates coincide.
00:11:07.490 --> 00:11:11.760
And finally, the form of
the estimates that we get is
00:11:11.760 --> 00:11:16.700
a linear function in the xi's.
00:11:16.700 --> 00:11:20.430
And this linearity is a very
convenient property to have,
00:11:20.430 --> 00:11:23.180
because it allows
further analysis
00:11:23.180 --> 00:11:26.630
of these ways of
obtaining estimates.
00:11:26.630 --> 00:11:30.620
How do we interpret
this formula?
00:11:30.620 --> 00:11:33.530
What we have here
is the following.
00:11:33.530 --> 00:11:35.880
Each one of the
xi's gets multiplied
00:11:35.880 --> 00:11:39.760
by a certain coefficient,
which is 1 over the variance.
00:11:39.760 --> 00:11:42.750
And in the denominator,
we have the sum
00:11:42.750 --> 00:11:44.870
of all of those coefficients.
00:11:44.870 --> 00:11:50.110
So what we really have here is
a weighted average of the xi's.
00:11:50.110 --> 00:11:52.720
Now keep in mind that
those xi's are not
00:11:52.720 --> 00:11:54.520
all of them of the same kind.
00:11:54.520 --> 00:11:58.170
One term is x0, which
is the prior mean,
00:11:58.170 --> 00:12:02.470
whereas the remaining xi's are
the values of the observations.
00:12:02.470 --> 00:12:05.080
So there's something
interesting happening here.
00:12:05.080 --> 00:12:08.350
We combine the values
of the observations
00:12:08.350 --> 00:12:10.790
with the value of
the prior mean.
00:12:10.790 --> 00:12:12.900
And in some sense,
the prior mean
00:12:12.900 --> 00:12:17.200
is treated as just one
more piece of information
00:12:17.200 --> 00:12:19.010
available to us.
00:12:19.010 --> 00:12:23.100
And it is treated in
a sort of equal way
00:12:23.100 --> 00:12:26.060
as the other observations.
00:12:26.060 --> 00:12:29.980
The weight that we have
in this weighted average
00:12:29.980 --> 00:12:36.410
are that each xi gets divided
by the corresponding variance.
00:12:36.410 --> 00:12:38.130
Does this make sense?
00:12:38.130 --> 00:12:42.090
Well, suppose that sigma
i squared is large.
00:12:45.290 --> 00:12:49.300
This means that the noise
term Wi is very large.
00:12:49.300 --> 00:12:53.190
So Xi is very noisy.
00:12:53.190 --> 00:12:56.940
And so it's not a useful
observation to have.
00:12:56.940 --> 00:13:00.590
And in that case, it
gets a small weight.
00:13:03.320 --> 00:13:06.510
So the weights are
determined by the variances
00:13:06.510 --> 00:13:09.480
in a way that is quite sensible.
00:13:09.480 --> 00:13:12.350
Those observations that
will get the most weight
00:13:12.350 --> 00:13:15.730
will be those observations for
which the corresponding noise
00:13:15.730 --> 00:13:18.760
variance is small.
00:13:18.760 --> 00:13:22.280
So the solution to this
estimation problem that we just
00:13:22.280 --> 00:13:26.350
went through has
many nice properties.
00:13:26.350 --> 00:13:30.960
First, we stay within the world
of normal random variables,
00:13:30.960 --> 00:13:33.760
because even the
posterior is normal.
00:13:33.760 --> 00:13:36.540
We stay within the world
of linear functions
00:13:36.540 --> 00:13:39.590
of normal random
variables, and the form
00:13:39.590 --> 00:13:42.670
of the formula that
we have, itself,
00:13:42.670 --> 00:13:46.120
has an appealing
intuitive content.