WEBVTT
00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.870
Commons license.
00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to
00:00:06.910 --> 00:00:10.560
offer high quality educational
resources for free.
00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from
00:00:13.460 --> 00:00:19.290
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:19.290 --> 00:00:22.410
ocw.mit.edu
00:00:22.410 --> 00:00:25.430
PROFESSOR: So we're going to
finish today our discussion of
00:00:25.430 --> 00:00:28.870
Bayesian Inference, which
we started last time.
00:00:28.870 --> 00:00:32.960
As you probably saw there's
not a huge lot of concepts
00:00:32.960 --> 00:00:37.370
that we're introducing at this
point in terms of specific
00:00:37.370 --> 00:00:39.770
skills of calculating
probabilities.
00:00:39.770 --> 00:00:44.040
But, rather, it's more of an
interpretation and setting up
00:00:44.040 --> 00:00:45.460
the framework.
00:00:45.460 --> 00:00:48.010
So the framework in Bayesian
estimation is that there is
00:00:48.010 --> 00:00:52.500
some parameter which is not
known, but we have a prior
00:00:52.500 --> 00:00:53.550
distribution on it.
00:00:53.550 --> 00:01:00.040
These are beliefs about what
this variable might be, and
00:01:00.040 --> 00:01:02.370
then we'll obtain some
measurements.
00:01:02.370 --> 00:01:05.410
And the measurements are
affected by the value of that
00:01:05.410 --> 00:01:07.560
parameter that we don't know.
00:01:07.560 --> 00:01:12.490
And this effect, the fact that
X is affected by Theta, is
00:01:12.490 --> 00:01:15.970
captured by introducing a
conditional probability
00:01:15.970 --> 00:01:16.660
distribution--
00:01:16.660 --> 00:01:19.590
the distribution of X
depends on Theta.
00:01:19.590 --> 00:01:22.270
It's a conditional probability
distribution.
00:01:22.270 --> 00:01:26.280
So we have formulas for these
two densities, the prior
00:01:26.280 --> 00:01:28.330
density and the conditional
density.
00:01:28.330 --> 00:01:31.110
And given that we have these,
if we multiply them we can
00:01:31.110 --> 00:01:34.000
also get the joint density
of X and Theta.
00:01:34.000 --> 00:01:35.940
So we have everything
that's there is to
00:01:35.940 --> 00:01:37.450
know in this second.
00:01:37.450 --> 00:01:41.650
And now we observe the random
variable X. Given this random
00:01:41.650 --> 00:01:44.400
variable what can we
say about Theta?
00:01:44.400 --> 00:01:48.380
Well, what we can do is we
can always calculate the
00:01:48.380 --> 00:01:52.600
conditional distribution of
theta given X. And now that we
00:01:52.600 --> 00:01:55.990
have the specific value of
X we can plot this as
00:01:55.990 --> 00:01:58.650
a function of Theta.
00:01:58.650 --> 00:01:59.150
OK.
00:01:59.150 --> 00:02:01.380
And this is the complete
answer to a
00:02:01.380 --> 00:02:02.990
Bayesian Inference problem.
00:02:02.990 --> 00:02:06.130
This posterior distribution
captures everything there is
00:02:06.130 --> 00:02:10.240
to say about Theta, that's
what we know about Theta.
00:02:10.240 --> 00:02:13.330
Given the X that we have
observed Theta is still
00:02:13.330 --> 00:02:15.080
random, it's still unknown.
00:02:15.080 --> 00:02:18.270
And it might be here, there,
or there with several
00:02:18.270 --> 00:02:19.900
probabilities.
00:02:19.900 --> 00:02:22.780
On the other hand, if you want
to report a single value for
00:02:22.780 --> 00:02:27.590
Theta then you do
some extra work.
00:02:27.590 --> 00:02:31.430
You continue from here, and you
do some data processing on
00:02:31.430 --> 00:02:35.360
X. Doing data processing means
that you apply a certain
00:02:35.360 --> 00:02:39.000
function on the data,
and this function is
00:02:39.000 --> 00:02:40.650
something that you design.
00:02:40.650 --> 00:02:42.930
It's the so-called estimator.
00:02:42.930 --> 00:02:46.460
And once that function is
applied it outputs an estimate
00:02:46.460 --> 00:02:50.760
of Theta, which we
call Theta hat.
00:02:50.760 --> 00:02:53.490
So this is sort of the big
picture of what's happening.
00:02:53.490 --> 00:02:55.880
Now one thing to keep in mind
is that even though I'm
00:02:55.880 --> 00:03:00.450
writing single letters here, in
general Theta or X could be
00:03:00.450 --> 00:03:02.030
vector random variables.
00:03:02.030 --> 00:03:03.540
So think of this--
00:03:03.540 --> 00:03:08.170
it could be a collection
Theta1, Theta2, Theta3.
00:03:08.170 --> 00:03:11.570
And maybe we obtained several
measurements, so this X is
00:03:11.570 --> 00:03:15.630
really a vector X1,
X2, up to Xn.
00:03:15.630 --> 00:03:20.190
All right, so now how do we
choose a Theta to report?
00:03:20.190 --> 00:03:21.960
There are various ways
of doing it.
00:03:21.960 --> 00:03:25.280
One is to look at the posterior
distribution and
00:03:25.280 --> 00:03:29.940
report the value of Theta, at
which the density or the PMF
00:03:29.940 --> 00:03:31.990
is highest.
00:03:31.990 --> 00:03:35.570
This is called the maximum
a posteriori estimate.
00:03:35.570 --> 00:03:38.770
So we pick a value of theta for
which the posteriori is
00:03:38.770 --> 00:03:40.990
maximum, and we report it.
00:03:40.990 --> 00:03:46.030
An alternative way is to try to
be optimal with respects to
00:03:46.030 --> 00:03:48.500
a mean squared error.
00:03:48.500 --> 00:03:49.410
So what is this?
00:03:49.410 --> 00:03:53.260
If we have a specific estimator,
g, this is the
00:03:53.260 --> 00:03:55.880
estimate it's going
to produce.
00:03:55.880 --> 00:03:58.300
This is the true value of
Theta, so this is our
00:03:58.300 --> 00:03:59.740
estimation error.
00:03:59.740 --> 00:04:03.180
We look at the square of the
estimation error, and look at
00:04:03.180 --> 00:04:04.180
the average value.
00:04:04.180 --> 00:04:07.180
We would like this squared
estimation error to be as
00:04:07.180 --> 00:04:08.710
small as possible.
00:04:08.710 --> 00:04:12.470
How can we design our estimator
g to make that error
00:04:12.470 --> 00:04:13.920
as small as possible?
00:04:13.920 --> 00:04:19.490
It turns out that the answer is
to produce, as an estimate,
00:04:19.490 --> 00:04:22.660
the conditional expectation
of Theta given X. So the
00:04:22.660 --> 00:04:26.600
conditional expectation is the
best estimate that you could
00:04:26.600 --> 00:04:30.690
produce if your objective is to
keep the mean squared error
00:04:30.690 --> 00:04:32.720
as small as possible.
00:04:32.720 --> 00:04:35.280
So this statement here is a
statement of what happens on
00:04:35.280 --> 00:04:39.950
the average over all Theta's and
all X's that may happen in
00:04:39.950 --> 00:04:42.490
our experiment.
00:04:42.490 --> 00:04:45.160
The conditional expectation as
an estimator has an even
00:04:45.160 --> 00:04:47.750
stronger property.
00:04:47.750 --> 00:04:51.490
Not only it's optimal on the
average, but it's also optimal
00:04:51.490 --> 00:04:56.130
given that you have made a
specific observation, no
00:04:56.130 --> 00:04:57.840
matter what you observe.
00:04:57.840 --> 00:05:01.150
Let's say you observe the
specific value for the random
00:05:01.150 --> 00:05:05.560
variable X. After that point if
you're asked to produce a
00:05:05.560 --> 00:05:11.190
best estimate Theta hat that
minimizes this mean squared
00:05:11.190 --> 00:05:14.080
error, your best estimate
would be the conditional
00:05:14.080 --> 00:05:18.940
expectation given the specific
value that you have observed.
00:05:18.940 --> 00:05:23.150
These two statements say almost
the same thing, but
00:05:23.150 --> 00:05:25.650
this one is a bit stronger.
00:05:25.650 --> 00:05:30.830
This one tells you no matter
what specific X happens the
00:05:30.830 --> 00:05:33.370
conditional expectation
is the best estimate.
00:05:33.370 --> 00:05:36.870
This one tells you on the
average, over all X's may
00:05:36.870 --> 00:05:39.050
happen, the conditional
00:05:39.050 --> 00:05:42.650
expectation is the best estimator.
00:05:42.650 --> 00:05:44.870
Now this is really a consequence
of this.
00:05:44.870 --> 00:05:48.510
If the conditional expectation
is best for any specific X,
00:05:48.510 --> 00:05:52.750
then it's the best one even when
X is left random and you
00:05:52.750 --> 00:05:58.200
are averaging your error
over all possible X's.
00:05:58.200 --> 00:06:02.120
OK so now that we know what is
the optimal way of producing
00:06:02.120 --> 00:06:05.510
an estimate let's do a
simple example to see
00:06:05.510 --> 00:06:07.240
how things work out.
00:06:07.240 --> 00:06:10.290
So we have started with an
unknown random variable,
00:06:10.290 --> 00:06:15.080
Theta, which is uniformly
distributed between 4 and 10.
00:06:15.080 --> 00:06:18.270
And then we have an observation
model that tells
00:06:18.270 --> 00:06:22.430
us that given the value of
Theta, X is going to be a
00:06:22.430 --> 00:06:24.532
random variable that ranges
between Theta -
00:06:24.532 --> 00:06:26.570
1, and Theta + 1.
00:06:26.570 --> 00:06:32.550
So think of X as a noisy
measurement of Theta, plus
00:06:32.550 --> 00:06:37.600
some noise, which is
between -1, and +1.
00:06:37.600 --> 00:06:41.980
So really the model that we are
using here is that X is
00:06:41.980 --> 00:06:44.430
equal to Theta plus U --
00:06:44.430 --> 00:06:50.500
where U is uniform
on -1, and +1.
00:06:50.500 --> 00:06:52.350
one, and plus one.
00:06:52.350 --> 00:06:55.946
So we have the true value of
Theta, but X could be Theta -
00:06:55.946 --> 00:07:00.750
1, or it could be all the
way up to Theta + 1.
00:07:00.750 --> 00:07:03.770
And the X is uniformly
distributed on that interval.
00:07:03.770 --> 00:07:08.060
That's the same as saying that
U is uniformly distributed
00:07:08.060 --> 00:07:09.820
over this interval.
00:07:09.820 --> 00:07:12.780
So now we have all the
information that we need, we
00:07:12.780 --> 00:07:15.270
can construct the
joint density.
00:07:15.270 --> 00:07:19.020
And the joint density is, of
course, the prior density
00:07:19.020 --> 00:07:21.850
times the conditional density.
00:07:21.850 --> 00:07:24.540
We go both of these.
00:07:24.540 --> 00:07:28.880
Both of these are constants, so
the joint density is also
00:07:28.880 --> 00:07:30.150
going to be a constant.
00:07:30.150 --> 00:07:34.420
1/6 times 1/2, this
is one over 12.
00:07:34.420 --> 00:07:37.700
But it is a constant,
not everywhere.
00:07:37.700 --> 00:07:41.280
Only on the range of possible
x's and thetas.
00:07:41.280 --> 00:07:46.030
So theta can take any value
between four and ten, so these
00:07:46.030 --> 00:07:47.430
are the values of theta.
00:07:47.430 --> 00:07:51.990
And for any given value of theta
x can take values from
00:07:51.990 --> 00:07:55.690
theta minus one, up
to theta plus one.
00:07:55.690 --> 00:08:00.210
So here, if you can imagine, a
line that goes with slope one,
00:08:00.210 --> 00:08:08.530
and then x can take that value
of theta plus or minus one.
00:08:08.530 --> 00:08:14.720
So this object here, this is
the set of possible x and
00:08:14.720 --> 00:08:16.070
theta pairs.
00:08:16.070 --> 00:08:21.490
So the density is equal to one
over 12 over this set, and
00:08:21.490 --> 00:08:23.640
it's zero everywhere else.
00:08:23.640 --> 00:08:28.035
So outside here the density is
zero, the density only applies
00:08:28.035 --> 00:08:29.800
at that point.
00:08:29.800 --> 00:08:33.110
All right, so now we're
asked to estimate
00:08:33.110 --> 00:08:34.890
theta in terms of x.
00:08:34.890 --> 00:08:37.500
So we want to build an estimator
which is going to be
00:08:37.500 --> 00:08:40.000
a function from the
x's to the thetas.
00:08:40.000 --> 00:08:42.909
That's why I chose the axis
this way-- x to be on this
00:08:42.909 --> 00:08:44.600
axis, theta on that axis--
00:08:44.600 --> 00:08:48.020
Because the estimator we're
building is a function of x.
00:08:48.020 --> 00:08:51.070
Based on the observation that
we obtained, we want to
00:08:51.070 --> 00:08:51.940
estimate theta.
00:08:51.940 --> 00:08:55.680
So we know that the optimal
estimator is the conditional
00:08:55.680 --> 00:08:59.360
expectation, given
the value of x.
00:08:59.360 --> 00:09:02.160
So what is the conditional
expectation?
00:09:02.160 --> 00:09:07.890
If you fix a particular value of
x, let's say in this range.
00:09:07.890 --> 00:09:13.240
So this is our x, then what
do we know about theta?
00:09:13.240 --> 00:09:18.050
We know that theta lies
in this range.
00:09:18.050 --> 00:09:21.670
Theta can only be sampled
between those two values.
00:09:21.670 --> 00:09:24.760
And what kind of distribution
does theta have?
00:09:24.760 --> 00:09:28.980
What is the conditional
distribution of theta given x?
00:09:28.980 --> 00:09:32.260
Well, remember how we built
conditional distributions from
00:09:32.260 --> 00:09:33.410
joint distributions?
00:09:33.410 --> 00:09:38.900
The conditional distribution is
just a section of the joint
00:09:38.900 --> 00:09:41.640
distribution applied to
the place where we're
00:09:41.640 --> 00:09:42.770
conditioning.
00:09:42.770 --> 00:09:45.800
So the joint is constant.
00:09:45.800 --> 00:09:49.310
So the conditional is also going
to be a constant density
00:09:49.310 --> 00:09:50.630
over this interval.
00:09:50.630 --> 00:09:53.560
So the posterior distribution
of theta is
00:09:53.560 --> 00:09:57.210
uniform over this interval.
00:09:57.210 --> 00:10:01.110
So if the posterior of theta is
uniform over that interval,
00:10:01.110 --> 00:10:04.900
the expected value of theta is
going to be the meet point of
00:10:04.900 --> 00:10:06.070
that interval.
00:10:06.070 --> 00:10:08.880
So the estimate which
you report--
00:10:08.880 --> 00:10:10.710
if you observe that theta--
00:10:10.710 --> 00:10:15.750
is going to be this particular
point here, it's the midpoint.
00:10:15.750 --> 00:10:19.140
The same argument goes through
even if you obtain an x
00:10:19.140 --> 00:10:22.570
somewhere here.
00:10:22.570 --> 00:10:29.540
Given this x, theta
can take a value
00:10:29.540 --> 00:10:32.800
between these two values.
00:10:32.800 --> 00:10:35.990
Theta is going to have a uniform
distribution over this
00:10:35.990 --> 00:10:40.650
interval, and the conditional
expectation of theta given x
00:10:40.650 --> 00:10:43.840
is going to be the midpoint
of that interval.
00:10:43.840 --> 00:10:50.790
So now if we plot our estimator
by tracing midpoints
00:10:50.790 --> 00:10:56.300
in this diagram what you're
going to obtain is a curve
00:10:56.300 --> 00:11:01.795
that starts like this, then
it changes slope.
00:11:04.490 --> 00:11:07.280
So that it keeps track of the
midpoint, and then it goes
00:11:07.280 --> 00:11:09.000
like that again.
00:11:09.000 --> 00:11:13.760
So this blue curve here is
our g of x, which is the
00:11:13.760 --> 00:11:16.910
conditional expectation of
theta given that x is
00:11:16.910 --> 00:11:20.480
equal to little x.
00:11:20.480 --> 00:11:26.610
So it's a curve, in our example
it consists of three
00:11:26.610 --> 00:11:28.220
straight segments.
00:11:28.220 --> 00:11:30.780
But overall it's non-linear.
00:11:30.780 --> 00:11:33.440
It's not a single line
through this diagram.
00:11:33.440 --> 00:11:35.670
And that's how things
are in general.
00:11:35.670 --> 00:11:39.300
g of x, our optimal estimate has
no reason to be a linear
00:11:39.300 --> 00:11:40.460
function of x.
00:11:40.460 --> 00:11:42.780
In general it's going to be
some complicated curve.
00:11:47.350 --> 00:11:51.170
So how good is our estimate?
00:11:51.170 --> 00:11:55.700
I mean you reported your x, your
estimate of theta based
00:11:55.700 --> 00:12:00.690
on x, and your boss asks you
what kind of error do you
00:12:00.690 --> 00:12:03.350
expect to get?
00:12:03.350 --> 00:12:07.010
Having observed the particular
value of x, what you can
00:12:07.010 --> 00:12:11.140
report to your boss is what you
think is the mean squared
00:12:11.140 --> 00:12:13.040
error is going to be.
00:12:13.040 --> 00:12:15.380
We observe the particular
value of x.
00:12:15.380 --> 00:12:19.650
So we're conditioning, and we're
living in this universe.
00:12:19.650 --> 00:12:22.760
Given that we have made this
observation, this is the true
00:12:22.760 --> 00:12:25.840
value of theta, this is the
estimate that we have
00:12:25.840 --> 00:12:32.220
produced, this is the expected
squared error, given that we
00:12:32.220 --> 00:12:35.740
have made the particular
observation.
00:12:35.740 --> 00:12:39.700
Now in this conditional universe
this is the expected
00:12:39.700 --> 00:12:42.880
value of theta given x.
00:12:42.880 --> 00:12:46.240
So this is the expected value of
this random variable inside
00:12:46.240 --> 00:12:47.900
the conditional universe.
00:12:47.900 --> 00:12:50.900
So when you take the mean
squared of a random variable
00:12:50.900 --> 00:12:53.780
minus the expected value, this
is the same thing as the
00:12:53.780 --> 00:12:55.840
variance of that random
variable.
00:12:55.840 --> 00:12:58.670
Except that it's the
variance inside
00:12:58.670 --> 00:13:00.940
the conditional universe.
00:13:00.940 --> 00:13:06.230
Having observed x, theta is
still a random variable.
00:13:06.230 --> 00:13:09.010
It's distributed according to
the posterior distribution.
00:13:09.010 --> 00:13:12.220
Since it's a random variable,
it has a variance.
00:13:12.220 --> 00:13:16.060
And that variance is our
mean squared error.
00:13:16.060 --> 00:13:20.280
So this is the variance of the
posterior distribution of
00:13:20.280 --> 00:13:22.605
Theta given the observation
that we have made.
00:13:26.688 --> 00:13:30.180
OK, so what is the variance
in our example?
00:13:30.180 --> 00:13:36.270
If X happens to be here, then
Theta is uniform over this
00:13:36.270 --> 00:13:41.990
interval, and this interval
has length 2.
00:13:41.990 --> 00:13:46.960
Theta is uniformly distributed
over an interval of length 2.
00:13:46.960 --> 00:13:49.900
This is the posterior
distribution of Theta.
00:13:49.900 --> 00:13:51.410
What is the variance?
00:13:51.410 --> 00:13:54.680
Then you remember the formula
for the variance of a uniform
00:13:54.680 --> 00:13:59.520
random variable, it is the
length of the interval squared
00:13:59.520 --> 00:14:03.590
divided by 12, so this is 1/3.
00:14:03.590 --> 00:14:06.060
So the variance of Theta --
00:14:06.060 --> 00:14:10.330
the mean squared error-- is
going to be 1/3 whenever this
00:14:10.330 --> 00:14:12.430
kind of picture applies.
00:14:12.430 --> 00:14:16.460
This picture applies when
X is between 5 and 9.
00:14:16.460 --> 00:14:20.100
If X is less than 5, then the
picture is a little different,
00:14:20.100 --> 00:14:22.020
and Theta is going
to be uniform
00:14:22.020 --> 00:14:24.660
over a smaller interval.
00:14:24.660 --> 00:14:26.930
And so the variance of
theta is going to
00:14:26.930 --> 00:14:28.770
be smaller as well.
00:14:28.770 --> 00:14:31.470
So let's start plotting our
mean squared error.
00:14:31.470 --> 00:14:35.930
Between 5 and 9 the variance
of Theta --
00:14:35.930 --> 00:14:37.260
the posterior variance--
00:14:37.260 --> 00:14:39.090
is 1/3.
00:14:39.090 --> 00:14:46.100
Now when the X falls in here
Theta is uniformly distributed
00:14:46.100 --> 00:14:48.450
over a smaller interval.
00:14:48.450 --> 00:14:50.670
The size of this interval
changes
00:14:50.670 --> 00:14:52.800
linearly over that range.
00:14:52.800 --> 00:14:59.260
And so when we take the square
size of that interval we get a
00:14:59.260 --> 00:15:01.560
quadratic function of
how much we have
00:15:01.560 --> 00:15:03.120
moved from that corner.
00:15:03.120 --> 00:15:07.140
So at that corner what is
the variance of Theta?
00:15:07.140 --> 00:15:11.290
Well if I observe an X that's
equal to 3 then I know with
00:15:11.290 --> 00:15:14.810
certainty that Theta
is equal to 4.
00:15:14.810 --> 00:15:18.340
Then I'm in very good shape, I
know exactly what Theta is
00:15:18.340 --> 00:15:19.240
going to be.
00:15:19.240 --> 00:15:22.890
So the variance, in this
case, is going to be 0.
00:15:22.890 --> 00:15:26.570
If I observe an X that's a
little larger than Theta is
00:15:26.570 --> 00:15:31.130
now random, takes values on
a little interval, and the
00:15:31.130 --> 00:15:35.430
variance of Theta is going to be
proportional to the square
00:15:35.430 --> 00:15:37.910
of the length of that
little interval.
00:15:37.910 --> 00:15:40.400
So we get a curve that
starts rising
00:15:40.400 --> 00:15:42.560
quadratically from here.
00:15:42.560 --> 00:15:45.390
It goes up forward 1/3.
00:15:45.390 --> 00:15:48.980
At the other end of the picture
the same is true.
00:15:48.980 --> 00:15:54.500
If you observe an X which is
11 then Theta can only be
00:15:54.500 --> 00:15:57.150
equal to 10.
00:15:57.150 --> 00:16:00.720
And so the error in Theta
is equal to 0,
00:16:00.720 --> 00:16:02.920
there's 0 error variance.
00:16:02.920 --> 00:16:07.360
But as we obtain X's that are
slightly less than 11 then the
00:16:07.360 --> 00:16:10.380
mean squared error again
rises quadratically.
00:16:10.380 --> 00:16:13.450
So we end up with a
plot like this.
00:16:13.450 --> 00:16:17.120
What this plot tells us is that
certain measurements are
00:16:17.120 --> 00:16:18.920
better than others.
00:16:18.920 --> 00:16:25.270
If you're lucky, and you see X
equal to 3 then you're lucky,
00:16:25.270 --> 00:16:28.820
because you know Theta
exactly what it is.
00:16:28.820 --> 00:16:33.830
If you see an X which is equal
to 6 then you're sort of
00:16:33.830 --> 00:16:35.800
unlikely, because it
doesn't tell you
00:16:35.800 --> 00:16:37.900
Theta with great precision.
00:16:37.900 --> 00:16:40.560
Theta could be anywhere
on that interval.
00:16:40.560 --> 00:16:42.360
And so the variance
of Theta --
00:16:42.360 --> 00:16:44.630
even after you have
observed X --
00:16:44.630 --> 00:16:48.470
is a certain number,
1/3 in our case.
00:16:48.470 --> 00:16:52.370
So the moral to keep out
of that story is
00:16:52.370 --> 00:16:56.970
that the error variance--
00:16:56.970 --> 00:17:00.380
or the mean squared error--
00:17:00.380 --> 00:17:03.350
depends on what particular
observation
00:17:03.350 --> 00:17:04.829
you happen to obtain.
00:17:04.829 --> 00:17:10.240
Some observations may be very
informative, and once you see
00:17:10.240 --> 00:17:13.550
a specific number than you know
exactly what Theta is.
00:17:13.550 --> 00:17:15.760
Some observations might
be less informative.
00:17:15.760 --> 00:17:18.980
You observe your X, but it could
still leave a lot of
00:17:18.980 --> 00:17:20.230
uncertainty about Theta.
00:17:23.839 --> 00:17:27.650
So conditional expectations are
really the cornerstone of
00:17:27.650 --> 00:17:28.890
Bayesian estimation.
00:17:28.890 --> 00:17:31.690
They're particularly
popular, especially
00:17:31.690 --> 00:17:33.950
in engineering contexts.
00:17:33.950 --> 00:17:38.260
There used a lot in signal
processing, communications,
00:17:38.260 --> 00:17:40.940
control theory, so on.
00:17:40.940 --> 00:17:44.300
So that makes it worth playing
a little bit with their
00:17:44.300 --> 00:17:50.450
theoretical properties, and get
some appreciation of a few
00:17:50.450 --> 00:17:53.590
subtleties involved here.
00:17:53.590 --> 00:17:57.990
No new math in reality, in what
we're going to do here.
00:17:57.990 --> 00:18:01.290
But it's going to be a good
opportunity to practice
00:18:01.290 --> 00:18:05.310
manipulation of conditional
expectations.
00:18:05.310 --> 00:18:13.150
So let's look at the expected
value of the estimation error
00:18:13.150 --> 00:18:15.330
that we obtained.
00:18:15.330 --> 00:18:18.540
So Theta hat is our estimator,
is the conditional
00:18:18.540 --> 00:18:19.855
expectation.
00:18:19.855 --> 00:18:25.690
Theta hat minus Theta is what
kind of error do we have?
00:18:25.690 --> 00:18:29.610
If Theta hat, is bigger than
Theta then we have made the
00:18:29.610 --> 00:18:31.510
positive error.
00:18:31.510 --> 00:18:33.910
If not, if it's on the other
side, we have made the
00:18:33.910 --> 00:18:35.290
negative error.
00:18:35.290 --> 00:18:39.110
Then it turns out that on the
average the errors cancel each
00:18:39.110 --> 00:18:41.030
other out, on the average.
00:18:41.030 --> 00:18:43.110
So let's do this calculation.
00:18:43.110 --> 00:18:50.010
Let's calculate the expected
value of the error given X.
00:18:50.010 --> 00:18:54.480
Now by definition the error is
expected value of Theta hat
00:18:54.480 --> 00:18:57.850
minus Theta given X.
00:18:57.850 --> 00:19:01.090
We use linearity of expectations
to break it up as
00:19:01.090 --> 00:19:04.850
expected value of Theta hat
given X minus expected value
00:19:04.850 --> 00:19:11.090
of Theta given X.
And now what?
00:19:11.090 --> 00:19:18.680
Our estimate is made on the
basis of the data of the X's.
00:19:18.680 --> 00:19:23.600
If I tell you X then you
know what Theta hat is.
00:19:23.600 --> 00:19:26.490
Remember that the conditional
expectation is a random
00:19:26.490 --> 00:19:29.680
variable which is a function
of the random variable, on
00:19:29.680 --> 00:19:31.560
which you're conditioning on.
00:19:31.560 --> 00:19:35.330
If you know X then you know the
conditional expectation
00:19:35.330 --> 00:19:38.390
given X, you know what Theta
hat is going to be.
00:19:38.390 --> 00:19:42.910
So Theta hat is a function of
X. If it's a function of X
00:19:42.910 --> 00:19:45.910
then once I tell you X
you know what Theta
00:19:45.910 --> 00:19:47.460
hat is going to be.
00:19:47.460 --> 00:19:49.580
So this conditional expectation
is going to be
00:19:49.580 --> 00:19:51.860
Theta hat itself.
00:19:51.860 --> 00:19:54.030
Here this is-- just
by definition--
00:19:54.030 --> 00:19:59.580
Theta hat, and so we
get equality to 0.
00:19:59.580 --> 00:20:04.260
So what we have proved is that
no matter what I have
00:20:04.260 --> 00:20:08.970
observed, and given that I have
observed something on the
00:20:08.970 --> 00:20:14.050
average my error is
going to be 0.
00:20:14.050 --> 00:20:19.960
This is a statement involving
equality of random variables.
00:20:19.960 --> 00:20:22.620
Remember that conditional
expectations are random
00:20:22.620 --> 00:20:26.970
variables because they depend
on the thing you're
00:20:26.970 --> 00:20:28.440
conditioning on.
00:20:28.440 --> 00:20:31.630
0 is sort of a trivial
random variable.
00:20:31.630 --> 00:20:34.080
This tells you that this random
variable is identically
00:20:34.080 --> 00:20:36.390
equal to the 0 random
variable.
00:20:36.390 --> 00:20:40.720
More specifically it tells you
that no matter what value for
00:20:40.720 --> 00:20:45.120
X you observe, the conditional
expectation of the error is
00:20:45.120 --> 00:20:46.410
going to be 0.
00:20:46.410 --> 00:20:49.150
And this takes us to this
statement here, which is
00:20:49.150 --> 00:20:51.830
inequality between numbers.
00:20:51.830 --> 00:20:56.330
No matter what specific value
for capital X you have
00:20:56.330 --> 00:21:00.440
observed, your error, on
the average, is going
00:21:00.440 --> 00:21:02.420
to be equal to 0.
00:21:02.420 --> 00:21:06.730
So this is a less abstract
version of these statements.
00:21:06.730 --> 00:21:09.300
This is inequality between
two numbers.
00:21:09.300 --> 00:21:15.080
It's true for every value of
X, so it's true in terms of
00:21:15.080 --> 00:21:18.550
these random variables being
equal to that random variable.
00:21:18.550 --> 00:21:21.170
Because remember according to
our definition this random
00:21:21.170 --> 00:21:24.400
variable is the random variable
that takes this
00:21:24.400 --> 00:21:27.410
specific value when capital
X happens to be
00:21:27.410 --> 00:21:29.410
equal to little x.
00:21:29.410 --> 00:21:33.480
Now this doesn't mean that your
error is 0, it only means
00:21:33.480 --> 00:21:37.050
that your error is as likely, in
some sense, to fall on the
00:21:37.050 --> 00:21:40.040
positive side, as to fall
on the negative side.
00:21:40.040 --> 00:21:41.400
So sometimes your error will be
00:21:41.400 --> 00:21:42.880
positive, sometimes negative.
00:21:42.880 --> 00:21:46.360
And on the average these
things cancel out and
00:21:46.360 --> 00:21:48.150
give you a 0 --.
00:21:48.150 --> 00:21:49.470
on the average.
00:21:49.470 --> 00:21:53.620
So this is a property that's
sometimes giving the name we
00:21:53.620 --> 00:21:59.040
say that Theta hat
is unbiased.
00:21:59.040 --> 00:22:03.190
So Theta hat, our estimate, does
not have a tendency to be
00:22:03.190 --> 00:22:04.180
on the high side.
00:22:04.180 --> 00:22:06.920
It does not have a tendency
to be on the low side.
00:22:06.920 --> 00:22:10.580
On the average it's
just right.
00:22:14.700 --> 00:22:18.390
So let's do a little
more playing here.
00:22:21.790 --> 00:22:27.690
Let's see how our error is
related to an arbitrary
00:22:27.690 --> 00:22:30.270
function of the data.
00:22:30.270 --> 00:22:36.960
Let's do this in a conditional
universe and
00:22:36.960 --> 00:22:38.210
look at this quantity.
00:22:45.210 --> 00:22:47.910
In a conditional universe
where X is known
00:22:47.910 --> 00:22:51.060
then h of X is known.
00:22:51.060 --> 00:22:54.200
And so you can pull it outside
the expectation.
00:22:54.200 --> 00:22:58.010
In the conditional universe
where the value of X is given
00:22:58.010 --> 00:23:01.290
this quantity becomes
just a constant.
00:23:01.290 --> 00:23:03.250
There's nothing random
about it.
00:23:03.250 --> 00:23:06.280
So you can pull it out,
the expectation, and
00:23:06.280 --> 00:23:09.840
write things this way.
00:23:09.840 --> 00:23:14.090
And we have just calculated
that this quantity is 0.
00:23:14.090 --> 00:23:17.390
So this number turns out
to be 0 as well.
00:23:20.810 --> 00:23:23.830
Now having done this,
we can take
00:23:23.830 --> 00:23:26.110
expectations of both sides.
00:23:26.110 --> 00:23:29.530
And now let's use the law of
iterated expectations.
00:23:29.530 --> 00:23:33.040
Expectation of a conditional
expectation gives us the
00:23:33.040 --> 00:23:42.200
unconditional expectation, and
this is also going to be 0.
00:23:42.200 --> 00:23:47.455
So here we use the law of
iterated expectations.
00:23:54.460 --> 00:23:55.710
OK.
00:24:04.510 --> 00:24:06.290
OK, why are we doing this?
00:24:06.290 --> 00:24:09.990
We're doing this because I would
like to calculate the
00:24:09.990 --> 00:24:13.940
covariance between Theta
tilde and Theta hat.
00:24:13.940 --> 00:24:16.490
Theta hat is, ask the question
-- is there a systematic
00:24:16.490 --> 00:24:20.870
relation between the error
and the estimate?
00:24:20.870 --> 00:24:30.830
So to calculate the covariance
we use the property that we
00:24:30.830 --> 00:24:34.460
can calculate the covariances
by calculating the expected
00:24:34.460 --> 00:24:39.520
value of the product minus
the product of
00:24:39.520 --> 00:24:40.770
the expected values.
00:24:48.440 --> 00:24:50.850
And what do we get?
00:24:50.850 --> 00:24:56.080
This is 0, because of
what we just proved.
00:25:00.980 --> 00:25:06.160
And this is 0, because of
what we proved earlier.
00:25:06.160 --> 00:25:09.740
That the expected value of
the error is equal to 0.
00:25:12.900 --> 00:25:27.800
So the covariance between the
error and any function of X is
00:25:27.800 --> 00:25:29.470
equal to 0.
00:25:29.470 --> 00:25:33.060
Let's use that to the case where
the function of X we're
00:25:33.060 --> 00:25:38.620
considering is Theta
hat itself.
00:25:38.620 --> 00:25:43.300
Theta hat is our estimate, it's
a function of X. So this
00:25:43.300 --> 00:25:46.845
0 result would still apply,
and we get that this
00:25:46.845 --> 00:25:50.570
covariance is equal to 0.
00:25:50.570 --> 00:25:59.100
OK, so that's what we proved.
00:25:59.100 --> 00:26:02.720
Let's see, what are the morals
to take out of all this?
00:26:02.720 --> 00:26:07.640
First is you should be very
comfortable with this type of
00:26:07.640 --> 00:26:10.580
calculation involving
conditional expectations.
00:26:10.580 --> 00:26:14.100
The main two things that we're
using are that when you
00:26:14.100 --> 00:26:17.630
condition on a random variable
any function of that random
00:26:17.630 --> 00:26:21.020
variable becomes a constant,
and can be pulled out the
00:26:21.020 --> 00:26:22.690
conditional expectation.
00:26:22.690 --> 00:26:25.460
The other thing that we are
using is the law of iterated
00:26:25.460 --> 00:26:29.450
expectations, so these are
the skills involved.
00:26:29.450 --> 00:26:32.980
Now on the substance, why is
this result interesting?
00:26:32.980 --> 00:26:35.390
This tells us that the error is
00:26:35.390 --> 00:26:37.060
uncorrelated with the estimate.
00:26:39.770 --> 00:26:42.530
What's a hypothetical situation
where these would
00:26:42.530 --> 00:26:44.160
not happen?
00:26:44.160 --> 00:26:52.720
Whenever Theta hat is positive
my error tends to be negative.
00:26:52.720 --> 00:26:57.000
Suppose that whenever Theta hat
is big then you say oh my
00:26:57.000 --> 00:27:00.610
estimate is too big, maybe the
true Theta is on the lower
00:27:00.610 --> 00:27:04.470
side, so I expect my error
to be negative.
00:27:04.470 --> 00:27:09.230
That would be a situation that
would violate this condition.
00:27:09.230 --> 00:27:13.880
This condition tells you that
no matter what Theta hat is,
00:27:13.880 --> 00:27:17.110
you don't expect your error to
be on the positive side or on
00:27:17.110 --> 00:27:18.030
the negative side.
00:27:18.030 --> 00:27:21.630
Your error will still
be 0 on the average.
00:27:21.630 --> 00:27:25.780
So if you obtain a very high
estimate this is no reason for
00:27:25.780 --> 00:27:29.630
you to suspect that
the true Theta is
00:27:29.630 --> 00:27:30.890
lower than your estimate.
00:27:30.890 --> 00:27:34.420
If you suspected that the true
Theta was lower than your
00:27:34.420 --> 00:27:38.830
estimate you should have
changed your Theta hat.
00:27:38.830 --> 00:27:42.580
If you make an estimate and
after obtaining that estimate
00:27:42.580 --> 00:27:46.270
you say I think my estimate
is too big, and so
00:27:46.270 --> 00:27:47.770
the error is negative.
00:27:47.770 --> 00:27:50.730
If you thought that way then
that means that your estimate
00:27:50.730 --> 00:27:53.690
is not the optimal one, that
your estimate should have been
00:27:53.690 --> 00:27:57.200
corrected to be smaller.
00:27:57.200 --> 00:28:00.030
And that would mean that there's
a better estimate than
00:28:00.030 --> 00:28:03.060
the one you used, but the
estimate that we are using
00:28:03.060 --> 00:28:06.060
here is the optimal one in terms
of mean squared error,
00:28:06.060 --> 00:28:08.350
there's no way of
improving it.
00:28:08.350 --> 00:28:11.640
And this is really captured
in that statement.
00:28:11.640 --> 00:28:14.250
That is knowing Theta hat
doesn't give you a lot of
00:28:14.250 --> 00:28:18.290
information about the error, and
gives you, therefore, no
00:28:18.290 --> 00:28:24.430
reason to adjust your estimate
from what it was.
00:28:24.430 --> 00:28:29.190
Finally, a consequence
of all this.
00:28:29.190 --> 00:28:31.910
This is the definition
of the error.
00:28:31.910 --> 00:28:35.770
Send Theta to this side, send
Theta tilde to that side, you
00:28:35.770 --> 00:28:36.850
get this relation.
00:28:36.850 --> 00:28:41.010
The true parameter is composed
of two quantities.
00:28:41.010 --> 00:28:44.940
The estimate, and the
error that they got
00:28:44.940 --> 00:28:46.460
with a minus sign.
00:28:46.460 --> 00:28:49.790
These two quantities are
uncorrelated with each other.
00:28:49.790 --> 00:28:53.350
Their covariance is 0, and
therefore, the variance of
00:28:53.350 --> 00:28:56.330
this is the sum of the variances
of these two
00:28:56.330 --> 00:28:57.580
quantities.
00:29:00.470 --> 00:29:07.520
So what's an interpretation
of this equality?
00:29:07.520 --> 00:29:10.930
There is some inherent
randomness in the random
00:29:10.930 --> 00:29:14.540
variable theta that we're
trying to estimate.
00:29:14.540 --> 00:29:19.360
Theta hat tries to estimate it,
tries to get close to it.
00:29:19.360 --> 00:29:25.500
And if Theta hat always stays
close to Theta, since Theta is
00:29:25.500 --> 00:29:29.260
random Theta hat must also be
quite random, so it has
00:29:29.260 --> 00:29:31.170
uncertainty in it.
00:29:31.170 --> 00:29:35.270
And the more uncertain Theta
hat is the more it moves
00:29:35.270 --> 00:29:36.640
together with Theta.
00:29:36.640 --> 00:29:40.860
So the more uncertainty
it removes from Theta.
00:29:40.860 --> 00:29:43.900
And this is the remaining
uncertainty in Theta.
00:29:43.900 --> 00:29:47.140
The uncertainty that's left
after we've done our
00:29:47.140 --> 00:29:48.350
estimation.
00:29:48.350 --> 00:29:52.330
So ideally, to have a small
error we want this
00:29:52.330 --> 00:29:54.120
quantity to be small.
00:29:54.120 --> 00:29:55.820
Which is the same as
saying that this
00:29:55.820 --> 00:29:57.740
quantity should be big.
00:29:57.740 --> 00:30:02.070
In the ideal case Theta hat
is the same as Theta.
00:30:02.070 --> 00:30:04.820
That's the best we
could hope for.
00:30:04.820 --> 00:30:09.250
That corresponds to 0 error,
and all the uncertainly in
00:30:09.250 --> 00:30:14.230
Theta is absorbed by the
uncertainty in Theta hat.
00:30:14.230 --> 00:30:18.960
Interestingly, this relation
here is just another variation
00:30:18.960 --> 00:30:21.630
of the law of total variance
that we have seen at some
00:30:21.630 --> 00:30:23.880
point in the past.
00:30:23.880 --> 00:30:28.570
I will skip that derivation, but
it's an interesting fact,
00:30:28.570 --> 00:30:31.430
and it can give you an
alternative interpretation of
00:30:31.430 --> 00:30:32.680
the law of total variance.
00:30:36.840 --> 00:30:40.570
OK, so now let's return
to our example.
00:30:40.570 --> 00:30:45.630
In our example we obtained the
optimal estimator, and we saw
00:30:45.630 --> 00:30:51.220
that it was a nonlinear curve,
something like this.
00:30:51.220 --> 00:30:53.660
I'm exaggerating the corner
of a little bit to
00:30:53.660 --> 00:30:55.350
show that it's nonlinear.
00:30:55.350 --> 00:30:57.400
This is the optimal estimator.
00:30:57.400 --> 00:31:01.070
It's a nonlinear function
of X --
00:31:01.070 --> 00:31:05.200
nonlinear generally
means complicated.
00:31:05.200 --> 00:31:09.020
Sometimes the conditional
expectation is really hard to
00:31:09.020 --> 00:31:12.320
compute, because whenever you
have to compute expectations
00:31:12.320 --> 00:31:17.270
you need to do some integrals.
00:31:17.270 --> 00:31:19.880
And if you have many random
variables involved it might
00:31:19.880 --> 00:31:23.160
correspond to a
multi-dimensional integration.
00:31:23.160 --> 00:31:24.370
We don't like this.
00:31:24.370 --> 00:31:27.370
Can we come up, maybe,
with a simpler way
00:31:27.370 --> 00:31:29.200
of estimating Theta?
00:31:29.200 --> 00:31:32.580
Of coming up with a point
estimate which still has some
00:31:32.580 --> 00:31:34.350
nice properties, it
has some good
00:31:34.350 --> 00:31:37.120
motivation, but is simpler.
00:31:37.120 --> 00:31:38.630
What does simpler mean?
00:31:38.630 --> 00:31:40.920
Perhaps linear.
00:31:40.920 --> 00:31:45.570
Let's put ourselves in a
straitjacket and restrict
00:31:45.570 --> 00:31:50.260
ourselves to estimators that's
are of these forms.
00:31:50.260 --> 00:31:53.280
My estimate is constrained
to be a linear
00:31:53.280 --> 00:31:54.930
function of the X's.
00:31:54.930 --> 00:31:59.320
So my estimator is going to be
a curve, a linear curve.
00:31:59.320 --> 00:32:03.450
It could be this, it could be
that, maybe it would want to
00:32:03.450 --> 00:32:06.350
be something like this.
00:32:06.350 --> 00:32:10.540
I want to choose the best
possible linear function.
00:32:10.540 --> 00:32:11.490
What does that mean?
00:32:11.490 --> 00:32:15.570
It means that I write my
Theta hat in this form.
00:32:15.570 --> 00:32:20.750
If I fix a certain a and b I
have fixed the functional form
00:32:20.750 --> 00:32:23.940
of my estimator, and this
is the corresponding
00:32:23.940 --> 00:32:25.360
mean squared error.
00:32:25.360 --> 00:32:28.210
That's the error between the
true parameter and the
00:32:28.210 --> 00:32:31.130
estimate of that parameter, we
take the square of this.
00:32:33.730 --> 00:32:38.350
And now the optimal linear
estimator is defined as one
00:32:38.350 --> 00:32:42.210
for which these mean squared
error is smallest possible
00:32:42.210 --> 00:32:45.600
over all choices of a and b.
00:32:45.600 --> 00:32:48.260
So we want to minimize
this expression
00:32:48.260 --> 00:32:52.030
over all a's and b's.
00:32:52.030 --> 00:32:55.650
How do we do this
minimization?
00:32:55.650 --> 00:32:58.910
Well this is a square,
you can expand it.
00:32:58.910 --> 00:33:02.040
Write down all the terms in the
expansion of the square.
00:33:02.040 --> 00:33:03.810
So you're going to get
the term expected
00:33:03.810 --> 00:33:05.400
value of Theta squared.
00:33:05.400 --> 00:33:07.380
You're going to get
another term--
00:33:07.380 --> 00:33:11.010
a squared expected value of X
squared, another term which is
00:33:11.010 --> 00:33:13.340
b squared, and then you're
going to get to
00:33:13.340 --> 00:33:16.620
various cross terms.
00:33:16.620 --> 00:33:22.050
What you have here is really a
quadratic function of a and b.
00:33:22.050 --> 00:33:25.030
So think of this quantity that
we're minimizing as some
00:33:25.030 --> 00:33:28.920
function h of a and b, and it
happens to be quadratic.
00:33:32.500 --> 00:33:35.280
How do we minimize a
quadratic function?
00:33:35.280 --> 00:33:38.890
We set the derivative of this
function with respect to a and
00:33:38.890 --> 00:33:42.940
b to 0, and then
do the algebra.
00:33:42.940 --> 00:33:48.000
After you do the algebra you
find that the best choice for
00:33:48.000 --> 00:33:54.380
a is this 1, so this is the
coefficient next to X. This is
00:33:54.380 --> 00:33:55.630
the optimal a.
00:33:59.560 --> 00:34:03.660
And the optimal b corresponds
of the constant terms.
00:34:03.660 --> 00:34:08.770
So this term and this times that
together are the optimal
00:34:08.770 --> 00:34:11.090
choices of b.
00:34:11.090 --> 00:34:15.590
So the algebra itself is
not very interesting.
00:34:15.590 --> 00:34:19.210
What is really interesting is
the nature of the result that
00:34:19.210 --> 00:34:21.179
we get here.
00:34:21.179 --> 00:34:26.260
If we were to plot the result on
this particular example you
00:34:26.260 --> 00:34:32.280
would get the curve that's
something like this.
00:34:36.949 --> 00:34:40.710
It goes through the middle
of this diagram
00:34:40.710 --> 00:34:43.080
and is a little slanted.
00:34:43.080 --> 00:34:48.639
In this example, X and Theta
are positively correlated.
00:34:48.639 --> 00:34:51.190
Bigger values of X generally
correspond to
00:34:51.190 --> 00:34:53.139
bigger values of Theta.
00:34:53.139 --> 00:34:56.310
So in this example the
covariance between X and Theta
00:34:56.310 --> 00:35:05.530
is positive, and so our estimate
can be interpreted in
00:35:05.530 --> 00:35:09.110
the following way: The expected
value of Theta is the
00:35:09.110 --> 00:35:13.130
estimate that you would come up
with if you didn't have any
00:35:13.130 --> 00:35:15.960
information about Theta.
00:35:15.960 --> 00:35:19.590
If you don't make any
observations this is the best
00:35:19.590 --> 00:35:22.270
way of estimating Theta.
00:35:22.270 --> 00:35:26.190
But I have made an observation,
X, and I need to
00:35:26.190 --> 00:35:27.920
take it into account.
00:35:27.920 --> 00:35:32.360
I look at this difference, which
is the piece of news
00:35:32.360 --> 00:35:34.380
contained in X?
00:35:34.380 --> 00:35:37.870
That's what X should
be on the average.
00:35:37.870 --> 00:35:41.910
If I observe an X which is
bigger than what I expected it
00:35:41.910 --> 00:35:46.830
to be, and since X and Theta
are positively correlated,
00:35:46.830 --> 00:35:51.070
this tells me that Theta should
also be bigger than its
00:35:51.070 --> 00:35:52.690
average value.
00:35:52.690 --> 00:35:57.180
Whenever I see an X that's
larger than its average value
00:35:57.180 --> 00:36:00.230
this gives me an indication
that theta should also
00:36:00.230 --> 00:36:04.480
probably be larger than
its average value.
00:36:04.480 --> 00:36:08.040
And so I'm taking that
difference and multiplying it
00:36:08.040 --> 00:36:10.240
by a positive coefficient.
00:36:10.240 --> 00:36:12.360
And that's what gives
me a curve here that
00:36:12.360 --> 00:36:14.880
has a positive slope.
00:36:14.880 --> 00:36:17.780
So this increment--
00:36:17.780 --> 00:36:21.750
the new information contained
in X as compared to the
00:36:21.750 --> 00:36:25.950
average value we expected
apriori, that increment allows
00:36:25.950 --> 00:36:30.780
us to make a correction to our
prior estimate of Theta, and
00:36:30.780 --> 00:36:34.780
the amount of that correction is
guided by the covariance of
00:36:34.780 --> 00:36:36.260
X with Theta.
00:36:36.260 --> 00:36:39.670
If the covariance of X with
Theta were 0, that would mean
00:36:39.670 --> 00:36:43.050
there's no systematic relation
between the two, and in that
00:36:43.050 --> 00:36:46.380
case obtaining some information
from X doesn't
00:36:46.380 --> 00:36:51.010
give us a guide as to how to
change the estimates of Theta.
00:36:51.010 --> 00:36:53.870
If that were 0, we would
just stay with
00:36:53.870 --> 00:36:55.050
this particular estimate.
00:36:55.050 --> 00:36:57.090
We're not able to make
a correction.
00:36:57.090 --> 00:37:00.810
But when there's a non zero
covariance between X and Theta
00:37:00.810 --> 00:37:04.620
that covariance works as a
guide for us to obtain a
00:37:04.620 --> 00:37:08.130
better estimate of Theta.
00:37:12.270 --> 00:37:15.220
How about the resulting
mean squared error?
00:37:15.220 --> 00:37:18.690
In this context turns out that
there's a very nice formula
00:37:18.690 --> 00:37:21.360
for the mean squared
error obtained from
00:37:21.360 --> 00:37:24.780
the best linear estimate.
00:37:24.780 --> 00:37:27.900
What's the story here?
00:37:27.900 --> 00:37:31.210
The mean squared error that we
have has something to do with
00:37:31.210 --> 00:37:35.450
the variance of the original
random variable.
00:37:35.450 --> 00:37:38.710
The more uncertain our original
random variable is,
00:37:38.710 --> 00:37:41.670
the more error we're
going to make.
00:37:41.670 --> 00:37:45.590
On the other hand, when the two
variables are correlated
00:37:45.590 --> 00:37:48.370
we explored that correlation
to improve our estimate.
00:37:52.100 --> 00:37:54.650
This row here is the correlation
coefficient
00:37:54.650 --> 00:37:56.730
between the two random
variables.
00:37:56.730 --> 00:37:59.720
When this correlation
coefficient is larger this
00:37:59.720 --> 00:38:01.780
factor here becomes smaller.
00:38:01.780 --> 00:38:04.660
And our mean squared error
become smaller.
00:38:04.660 --> 00:38:07.560
So think of the two
extreme cases.
00:38:07.560 --> 00:38:11.270
One extreme case is when
rho equal to 1 --
00:38:11.270 --> 00:38:14.200
so X and Theta are perfectly
correlated.
00:38:14.200 --> 00:38:18.420
When they're perfectly
correlated once I know X then
00:38:18.420 --> 00:38:20.310
I also know Theta.
00:38:20.310 --> 00:38:23.580
And the two random variables
are linearly related.
00:38:23.580 --> 00:38:27.080
In that case, my estimate is
right on the target, and the
00:38:27.080 --> 00:38:30.860
mean squared error
is going to be 0.
00:38:30.860 --> 00:38:34.810
The other extreme case is
if rho is equal to 0.
00:38:34.810 --> 00:38:37.590
The two random variables
are uncorrelated.
00:38:37.590 --> 00:38:41.740
In that case the measurement
does not help me estimate
00:38:41.740 --> 00:38:45.390
Theta, and the uncertainty
that's left--
00:38:45.390 --> 00:38:46.970
the mean squared error--
00:38:46.970 --> 00:38:49.830
is just the original
variance of Theta.
00:38:49.830 --> 00:38:53.750
So the uncertainty in Theta
does not get reduced.
00:38:53.750 --> 00:38:54.670
So moral--
00:38:54.670 --> 00:38:59.710
the estimation error is a
reduced version of the
00:38:59.710 --> 00:39:03.660
original amount of uncertainty
in the random variable Theta,
00:39:03.660 --> 00:39:08.280
and the larger the correlation
between those two random
00:39:08.280 --> 00:39:12.620
variables, the better we can
remove uncertainty from the
00:39:12.620 --> 00:39:13.970
original random variable.
00:39:17.320 --> 00:39:21.200
I didn't derive this formula,
but it's just a matter of
00:39:21.200 --> 00:39:22.430
algebraic manipulations.
00:39:22.430 --> 00:39:25.770
We have a formula for
Theta hat, subtract
00:39:25.770 --> 00:39:27.620
Theta from that formula.
00:39:27.620 --> 00:39:30.640
Take square, take expectations,
and do a few
00:39:30.640 --> 00:39:33.750
lines of algebra that you can
read in the text, and you end
00:39:33.750 --> 00:39:35.915
up with this really neat
and clean formula.
00:39:38.650 --> 00:39:42.360
Now I mentioned in the beginning
of the lecture that
00:39:42.360 --> 00:39:45.220
we can do inference with Theta's
and X's not just being
00:39:45.220 --> 00:39:48.970
single numbers, but they could
be vector random variables.
00:39:48.970 --> 00:39:52.100
So for example we might have
multiple data that gives us
00:39:52.100 --> 00:39:56.710
information about X.
00:39:56.710 --> 00:40:00.240
There are no vectors here, so
this discussion was for the
00:40:00.240 --> 00:40:04.460
case where Theta and X were just
scalar, one-dimensional
00:40:04.460 --> 00:40:05.350
quantities.
00:40:05.350 --> 00:40:08.060
What do we do if we have
multiple data?
00:40:08.060 --> 00:40:11.990
Suppose that Theta is still a
scalar, it's one dimensional,
00:40:11.990 --> 00:40:14.710
but we make several
observations.
00:40:14.710 --> 00:40:17.050
And on the basis of these
observations we want to
00:40:17.050 --> 00:40:20.080
estimate Theta.
00:40:20.080 --> 00:40:24.650
The optimal least mean squares
estimator would be again the
00:40:24.650 --> 00:40:28.830
conditional expectation of
Theta given X. That's the
00:40:28.830 --> 00:40:30.130
optimal one.
00:40:30.130 --> 00:40:36.330
And in this case X is a
vector, so the general
00:40:36.330 --> 00:40:40.650
estimator we would use
would be this one.
00:40:40.650 --> 00:40:44.050
But if we want to keep things
simple and we want our
00:40:44.050 --> 00:40:47.300
estimator to have a simple
functional form we might
00:40:47.300 --> 00:40:51.870
restrict to estimator that are
linear functions of the data.
00:40:51.870 --> 00:40:53.800
And then the story is
exactly the same as
00:40:53.800 --> 00:40:57.010
we discussed before.
00:40:57.010 --> 00:41:00.460
I constrained myself to
estimating Theta using a
00:41:00.460 --> 00:41:05.880
linear function of the data,
so my signal processing box
00:41:05.880 --> 00:41:07.830
just applies a linear
function.
00:41:07.830 --> 00:41:11.145
And I'm looking for the best
coefficients, the coefficients
00:41:11.145 --> 00:41:13.490
that are going to result
in the least
00:41:13.490 --> 00:41:15.990
possible squared error.
00:41:15.990 --> 00:41:19.780
This is my squared error, this
is (my estimate minus the
00:41:19.780 --> 00:41:22.110
thing I'm trying to estimate)
squared, and
00:41:22.110 --> 00:41:24.100
then taking the average.
00:41:24.100 --> 00:41:25.330
How do we do this?
00:41:25.330 --> 00:41:26.580
Same story as before.
00:41:29.510 --> 00:41:32.500
The X's and the Theta's get
averaged out because we have
00:41:32.500 --> 00:41:33.430
an expectation.
00:41:33.430 --> 00:41:36.830
Whatever is left is just a
function of the coefficients
00:41:36.830 --> 00:41:38.760
of the a's and of b's.
00:41:38.760 --> 00:41:42.110
As before it turns out to
be a quadratic function.
00:41:42.110 --> 00:41:46.580
Then we set the derivatives of
this function of a's and b's
00:41:46.580 --> 00:41:50.000
with respect to the
coefficients, we set it to 0.
00:41:50.000 --> 00:41:54.340
And this gives us a system
of linear equations.
00:41:54.340 --> 00:41:56.780
It's a system of linear
equations that's satisfied by
00:41:56.780 --> 00:41:57.730
those coefficients.
00:41:57.730 --> 00:42:00.860
It's a linear system because
this is a quadratic function
00:42:00.860 --> 00:42:03.950
of those coefficients.
00:42:03.950 --> 00:42:10.410
So to get closed-form formulas
in this particular case one
00:42:10.410 --> 00:42:13.180
would need to introduce vectors,
and matrices, and
00:42:13.180 --> 00:42:15.330
metrics inverses and so on.
00:42:15.330 --> 00:42:18.570
The particular formulas are not
so much what interests us
00:42:18.570 --> 00:42:22.950
here, rather, the interesting
thing is that this is simply
00:42:22.950 --> 00:42:27.120
done just using straightforward
solvers of
00:42:27.120 --> 00:42:29.240
linear equations.
00:42:29.240 --> 00:42:32.470
The only thing you need to do
is to write down the correct
00:42:32.470 --> 00:42:35.280
coefficients of those non-linear
equations.
00:42:35.280 --> 00:42:37.440
And the typical coefficient
that you would
00:42:37.440 --> 00:42:39.240
get would be what?
00:42:39.240 --> 00:42:42.480
Let say a typical quick
equations would be --
00:42:42.480 --> 00:42:44.190
let's take a typical
term of this
00:42:44.190 --> 00:42:45.680
quadratic one you expanded.
00:42:45.680 --> 00:42:51.470
You're going to get the terms
such as a1x1 times a2x2.
00:42:51.470 --> 00:42:55.680
When you take expectations
you're left with a1a2 times
00:42:55.680 --> 00:42:58.210
expected value of x1x2.
00:43:02.030 --> 00:43:06.700
So this would involve terms such
as a1 squared expected
00:43:06.700 --> 00:43:08.520
value of x1 squared.
00:43:08.520 --> 00:43:14.760
You would get terms such as
a1a2, expected value of x1x2,
00:43:14.760 --> 00:43:20.120
and a lot of other terms
here should have a too.
00:43:20.120 --> 00:43:23.600
So you get something that's
quadratic in your
00:43:23.600 --> 00:43:24.890
coefficients.
00:43:24.890 --> 00:43:30.490
And the constants that show up
in your system of equations
00:43:30.490 --> 00:43:33.790
are things that have to do with
the expected values of
00:43:33.790 --> 00:43:37.070
squares of your random
variables, or products of your
00:43:37.070 --> 00:43:39.130
random variables.
00:43:39.130 --> 00:43:43.060
To write down the numerical
values for these the only
00:43:43.060 --> 00:43:46.330
thing you need to know are the
means and variances of your
00:43:46.330 --> 00:43:47.570
random variables.
00:43:47.570 --> 00:43:50.360
If you know the mean and
variance then you know what
00:43:50.360 --> 00:43:51.760
this thing is.
00:43:51.760 --> 00:43:54.950
And if you know the covariances
as well then you
00:43:54.950 --> 00:43:57.250
know what this thing is.
00:43:57.250 --> 00:44:02.080
So in order to find the optimal
linear estimator in
00:44:02.080 --> 00:44:06.870
the case of multiple data you do
not need to know the entire
00:44:06.870 --> 00:44:09.230
probability distribution
of the random
00:44:09.230 --> 00:44:11.050
variables that are involved.
00:44:11.050 --> 00:44:14.690
You only need to know your
means and covariances.
00:44:14.690 --> 00:44:18.670
These are the only quantities
that affect the construction
00:44:18.670 --> 00:44:20.570
of your optimal estimator.
00:44:20.570 --> 00:44:23.840
We could see this already
in this formula.
00:44:23.840 --> 00:44:29.650
The form of my optimal estimator
is completely
00:44:29.650 --> 00:44:34.100
determined once I know the
means, variance, and
00:44:34.100 --> 00:44:37.970
covariance of the random
variables in my model.
00:44:37.970 --> 00:44:44.410
I do not need to know how the
details distribution of the
00:44:44.410 --> 00:44:46.570
random variables that
are involved here.
00:44:51.690 --> 00:44:55.110
So as I said in general, you
find the form of the optimal
00:44:55.110 --> 00:44:59.550
estimator by using a linear
equation solver.
00:44:59.550 --> 00:45:01.890
There are special examples
in which you can
00:45:01.890 --> 00:45:05.210
get closed-form solutions.
00:45:05.210 --> 00:45:10.090
The nicest simplest estimation
problem one can think of is
00:45:10.090 --> 00:45:11.120
the following--
00:45:11.120 --> 00:45:14.870
you have some uncertain
parameter, and you make
00:45:14.870 --> 00:45:17.790
multiple measurements
of that parameter in
00:45:17.790 --> 00:45:19.950
the presence of noise.
00:45:19.950 --> 00:45:22.520
So the Wi's are noises.
00:45:22.520 --> 00:45:25.130
I corresponds to your
i-th experiment.
00:45:25.130 --> 00:45:27.810
So this is the most common
situation that you encounter
00:45:27.810 --> 00:45:28.490
in the lab.
00:45:28.490 --> 00:45:31.240
If you are dealing with some
process, you're trying to
00:45:31.240 --> 00:45:34.110
measure something you measure
it over and over.
00:45:34.110 --> 00:45:37.030
Each time your measurement
has some random error.
00:45:37.030 --> 00:45:40.360
And then you need to take all
your measurements together and
00:45:40.360 --> 00:45:43.550
come up with a single
estimate.
00:45:43.550 --> 00:45:48.320
So the noises are assumed to be
independent of each other,
00:45:48.320 --> 00:45:50.010
and also to be independent
from the
00:45:50.010 --> 00:45:52.090
value of the true parameter.
00:45:52.090 --> 00:45:55.010
Without loss of generality we
can assume that the noises
00:45:55.010 --> 00:45:58.890
have 0 mean and they have
some variances that we
00:45:58.890 --> 00:46:00.340
assume to be known.
00:46:00.340 --> 00:46:03.180
Theta itself has a prior
distribution with a certain
00:46:03.180 --> 00:46:05.670
mean and the certain variance.
00:46:05.670 --> 00:46:07.610
So the form of the
optimal linear
00:46:07.610 --> 00:46:10.940
estimator is really nice.
00:46:10.940 --> 00:46:14.930
Well maybe you cannot see it
right away because this looks
00:46:14.930 --> 00:46:18.580
messy, but what is it really?
00:46:18.580 --> 00:46:24.590
It's a linear combination of
the X's and the prior mean.
00:46:24.590 --> 00:46:28.560
And it's actually a weighted
average of the X's and the
00:46:28.560 --> 00:46:30.250
prior mean.
00:46:30.250 --> 00:46:33.570
Here we collect all of
the coefficients that
00:46:33.570 --> 00:46:35.920
we have at the top.
00:46:35.920 --> 00:46:42.060
So the whole thing is basically
a weighted average.
00:46:46.460 --> 00:46:51.110
1/(sigma_i-squared) is the
weight that we give to Xi, and
00:46:51.110 --> 00:46:54.710
in the denominator we have the
sum of all of the weights.
00:46:54.710 --> 00:46:59.260
So in the end we're dealing
with a weighted average.
00:46:59.260 --> 00:47:03.760
If mu was equal to 1, and all
the Xi's were equal to 1 then
00:47:03.760 --> 00:47:06.790
our estimate would also
be equal to 1.
00:47:06.790 --> 00:47:10.670
Now the form of the weights that
we have is interesting.
00:47:10.670 --> 00:47:16.050
Any given data point is
weighted inversely
00:47:16.050 --> 00:47:17.820
proportional to the variance.
00:47:17.820 --> 00:47:20.270
What does that say?
00:47:20.270 --> 00:47:26.920
If my i-th data point has a lot
of variance, if Wi is very
00:47:26.920 --> 00:47:32.900
noisy then Xi is not very
useful, is not very reliable.
00:47:32.900 --> 00:47:36.840
So I'm giving it
a small weight.
00:47:36.840 --> 00:47:41.870
Large variance, a lot of error
in my Xi means that I should
00:47:41.870 --> 00:47:44.200
give it a smaller weight.
00:47:44.200 --> 00:47:47.920
If two data points have the
same variance, they're of
00:47:47.920 --> 00:47:50.140
comparable quality,
then I'm going to
00:47:50.140 --> 00:47:51.950
give them equal weight.
00:47:51.950 --> 00:47:56.200
The other interesting thing is
that the prior mean is treated
00:47:56.200 --> 00:47:58.300
the same way as the X's.
00:47:58.300 --> 00:48:03.050
So it's treated as an additional
observation.
00:48:03.050 --> 00:48:07.100
So we're taking a weighted
average of the prior mean and
00:48:07.100 --> 00:48:09.850
of the measurements that
we are making.
00:48:09.850 --> 00:48:13.380
The formula looks as if the
prior mean was just another
00:48:13.380 --> 00:48:14.210
data point.
00:48:14.210 --> 00:48:17.440
So that's the way of thinking
about Bayesian estimation.
00:48:17.440 --> 00:48:20.270
You have your real data points,
the X's that you
00:48:20.270 --> 00:48:23.430
observe, you also had some
prior information.
00:48:23.430 --> 00:48:27.470
This plays a role similar
to a data point.
00:48:27.470 --> 00:48:31.580
Interesting note that if all
random variables are normal in
00:48:31.580 --> 00:48:35.230
this model these optimal linear
estimator happens to be
00:48:35.230 --> 00:48:36.950
also the conditional
expectation.
00:48:36.950 --> 00:48:40.000
That's the nice thing about
normal random variables that
00:48:40.000 --> 00:48:42.770
conditional expectations
turn out to be linear.
00:48:42.770 --> 00:48:46.920
So the optimal estimate and the
optimal linear estimate
00:48:46.920 --> 00:48:48.560
turn out to be the same.
00:48:48.560 --> 00:48:51.050
And that gives us another
interpretation of linear
00:48:51.050 --> 00:48:52.100
estimation.
00:48:52.100 --> 00:48:54.660
Linear estimation is essentially
the same as
00:48:54.660 --> 00:48:58.970
pretending that all random
variables are normal.
00:48:58.970 --> 00:49:02.040
So that's a side point.
00:49:02.040 --> 00:49:04.230
Now I'd like to close
with a comment.
00:49:08.370 --> 00:49:11.760
You do your measurements and
you estimate Theta on the
00:49:11.760 --> 00:49:17.040
basis of X. Suppose that instead
you have a measuring
00:49:17.040 --> 00:49:20.970
device that's measures X-cubed
instead of measuring X, and
00:49:20.970 --> 00:49:23.350
you want to estimate Theta.
00:49:23.350 --> 00:49:26.760
Are you going to get to
different a estimate?
00:49:26.760 --> 00:49:31.790
Well X and X-cubed contained
the same information.
00:49:31.790 --> 00:49:34.730
Telling you X is the
same as telling you
00:49:34.730 --> 00:49:36.640
the value of X-cubed.
00:49:36.640 --> 00:49:40.660
So the posterior distribution
of Theta given X is the same
00:49:40.660 --> 00:49:44.160
as the posterior distribution
of Theta given X-cubed.
00:49:44.160 --> 00:49:47.450
And so the means of these
posterior distributions are
00:49:47.450 --> 00:49:49.390
going to be the same.
00:49:49.390 --> 00:49:52.850
So doing transformations through
your data does not
00:49:52.850 --> 00:49:57.370
matter if you're doing optimal
least squares estimation.
00:49:57.370 --> 00:50:00.100
On the other hand, if you
restrict yourself to doing
00:50:00.100 --> 00:50:05.540
linear estimation then using a
linear function of X is not
00:50:05.540 --> 00:50:09.720
the same as using a linear
function of X-cubed.
00:50:09.720 --> 00:50:14.720
So this is a linear estimator,
but where the data are the
00:50:14.720 --> 00:50:19.250
X-cube's, and we have a linear
function of the data.
00:50:19.250 --> 00:50:23.690
So this means that when you're
using linear estimation you
00:50:23.690 --> 00:50:28.040
have some choices to make
linear on what?
00:50:28.040 --> 00:50:32.290
Sometimes you want to plot your
data on a not ordinary
00:50:32.290 --> 00:50:35.090
scale and try to plot
a line through them.
00:50:35.090 --> 00:50:38.360
Sometimes you plot your data
on a logarithmic scale, and
00:50:38.360 --> 00:50:40.480
try to plot a line
through them.
00:50:40.480 --> 00:50:42.390
Which scale is the
appropriate one?
00:50:42.390 --> 00:50:44.510
Here it would be
a cubic scale.
00:50:44.510 --> 00:50:46.830
And you have to think about
your particular model to
00:50:46.830 --> 00:50:51.180
decide which version would be
a more appropriate one.
00:50:51.180 --> 00:50:55.830
Finally when we have multiple
data sometimes these multiple
00:50:55.830 --> 00:50:59.910
data might contain the
same information.
00:50:59.910 --> 00:51:02.800
So X is one data point,
X-squared is another data
00:51:02.800 --> 00:51:05.610
point, X-cubed is another
data point.
00:51:05.610 --> 00:51:08.540
The three of them contain the
same information, but you can
00:51:08.540 --> 00:51:11.480
try to form a linear
function of them.
00:51:11.480 --> 00:51:14.380
And then you obtain a linear
estimator that has a more
00:51:14.380 --> 00:51:16.930
general form as a
function of X.
00:51:16.930 --> 00:51:22.130
So if you want to estimate your
Theta as a cubic function
00:51:22.130 --> 00:51:26.330
of X, for example, you can set
up a linear estimation model
00:51:26.330 --> 00:51:29.480
of this particular form and find
the optimal coefficients,
00:51:29.480 --> 00:51:32.900
the a's and the b's.
00:51:32.900 --> 00:51:35.700
All right, so the last slide
just gives you the big picture
00:51:35.700 --> 00:51:39.330
of what's happening in Bayesian
Inference, it's for
00:51:39.330 --> 00:51:40.330
you to ponder.
00:51:40.330 --> 00:51:41.930
Basically we talked about three
00:51:41.930 --> 00:51:43.470
possible estimation methods.
00:51:43.470 --> 00:51:48.300
Maximum posteriori, mean squared
error estimation, and
00:51:48.300 --> 00:51:51.070
linear mean squared error
estimation, or least squares
00:51:51.070 --> 00:51:52.290
estimation.
00:51:52.290 --> 00:51:54.410
And there's a number of standard
examples that you
00:51:54.410 --> 00:51:57.130
will be seeing over and over in
the recitations, tutorial,
00:51:57.130 --> 00:52:00.950
homework, and so on, perhaps
on exams even.
00:52:00.950 --> 00:52:05.630
Where we take some nice priors
on some unknown parameter, we
00:52:05.630 --> 00:52:09.410
take some nice models for the
noise or the observations, and
00:52:09.410 --> 00:52:11.880
then you need to work out
posterior distributions in the
00:52:11.880 --> 00:52:13.570
various estimates and
compare them.