WEBVTT
00:00:00.910 --> 00:00:03.000
In this segment, we
introduce a model
00:00:03.000 --> 00:00:07.020
with multiple parameters
and multiple observations.
00:00:07.020 --> 00:00:09.980
It is a model that appears
in countless real-world
00:00:09.980 --> 00:00:11.110
applications.
00:00:11.110 --> 00:00:13.920
But instead of giving you a
general and abstract model,
00:00:13.920 --> 00:00:16.160
we will talk about a
specific context that
00:00:16.160 --> 00:00:18.900
has all the elements
of the general model,
00:00:18.900 --> 00:00:21.200
but it has the advantage
of being concrete.
00:00:21.200 --> 00:00:23.930
And we can also
visualize the results.
00:00:23.930 --> 00:00:26.040
The model is as follows.
00:00:26.040 --> 00:00:30.820
Somebody is holding a ball
and throws it upwards.
00:00:30.820 --> 00:00:35.290
This ball is going to
follow a certain trajectory.
00:00:35.290 --> 00:00:37.520
What kind of trajectory is it?
00:00:37.520 --> 00:00:40.240
According to Newton's
laws, we know
00:00:40.240 --> 00:00:43.560
that it's going to be described
by a quadratic function
00:00:43.560 --> 00:00:44.940
of time.
00:00:44.940 --> 00:00:48.500
So here's a plot of such
a quadratic function,
00:00:48.500 --> 00:00:50.700
where this is the time axis.
00:00:50.700 --> 00:00:55.355
And this variable here, x,
is the vertical displacement
00:00:55.355 --> 00:00:56.910
of the ball.
00:00:56.910 --> 00:01:00.000
The ball initially is
it a certain location,
00:01:00.000 --> 00:01:02.790
at a certain height-- theta 0.
00:01:02.790 --> 00:01:04.420
It is thrown upwards.
00:01:04.420 --> 00:01:09.190
And it starts moving with some
initial velocity, theta 1.
00:01:09.190 --> 00:01:11.260
But because of the
gravitational force,
00:01:11.260 --> 00:01:13.640
it will start
eventually going down.
00:01:13.640 --> 00:01:17.740
So this parameter theta 2, which
would actually be negative,
00:01:17.740 --> 00:01:21.120
reflects the
gravitational constant.
00:01:21.120 --> 00:01:25.170
Suppose now that you do
not know these parameters.
00:01:25.170 --> 00:01:27.750
You do not to know
exactly where the ball was
00:01:27.750 --> 00:01:28.840
when it was thrown.
00:01:28.840 --> 00:01:31.170
You don't know the
exact velocity.
00:01:31.170 --> 00:01:33.870
And perhaps you also live
in a strange gravitational
00:01:33.870 --> 00:01:37.190
environment, and you do not
know the gravitational constant.
00:01:37.190 --> 00:01:39.330
And you would like to
estimate those quantities
00:01:39.330 --> 00:01:41.150
based on measurements.
00:01:41.150 --> 00:01:44.350
If you are to follow the
Bayesian Inference methodology,
00:01:44.350 --> 00:01:47.550
what you need to do is
to model those variables
00:01:47.550 --> 00:01:51.539
as random variables, and
think of the actual realized
00:01:51.539 --> 00:01:55.820
values as realizations of
these random variables.
00:01:55.820 --> 00:01:58.890
And we also need some
prior distributions
00:01:58.890 --> 00:02:00.190
for these random variable.
00:02:00.190 --> 00:02:03.580
We should specify the
joint PDF of these three
00:02:03.580 --> 00:02:04.890
random variables.
00:02:04.890 --> 00:02:07.290
A common assumption
here is to assume
00:02:07.290 --> 00:02:09.620
that the random
variables involved are
00:02:09.620 --> 00:02:11.560
independent of each other.
00:02:11.560 --> 00:02:14.210
And each one has
a certain prior.
00:02:14.210 --> 00:02:16.590
Where do these priors come from?
00:02:16.590 --> 00:02:21.520
If for example you know
where is the person that's
00:02:21.520 --> 00:02:27.220
throwing the ball, if you know
their location within let's say
00:02:27.220 --> 00:02:31.600
a meter or so, you should have
a distribution for Theta 0
00:02:31.600 --> 00:02:33.850
that describes your
state of knowledge
00:02:33.850 --> 00:02:36.160
and which has a
width or standard
00:02:36.160 --> 00:02:40.060
deviation of maybe
a couple of meters.
00:02:40.060 --> 00:02:43.280
So the priors are chosen
to reflect whatever
00:02:43.280 --> 00:02:48.710
you know about the possible
values of these parameters.
00:02:48.710 --> 00:02:51.260
Then what is going to
happen is that you're
00:02:51.260 --> 00:02:55.110
going to observe the trajectory
at certain points in time.
00:02:55.110 --> 00:02:59.100
For example, at a certain time
t1, you make a measurement.
00:02:59.100 --> 00:03:01.440
But your measurement
is not exact.
00:03:01.440 --> 00:03:04.900
It is noisy, and you
record a certain value.
00:03:04.900 --> 00:03:08.020
At another time you make
another measurement,
00:03:08.020 --> 00:03:10.040
and you record another value.
00:03:10.040 --> 00:03:12.560
At another time, you
make another measurement,
00:03:12.560 --> 00:03:14.550
and you record another value.
00:03:14.550 --> 00:03:17.840
And similarly, you get
multiple measurements.
00:03:17.840 --> 00:03:19.430
On the basis of
these measurements,
00:03:19.430 --> 00:03:22.500
you would like to
estimate the parameters.
00:03:22.500 --> 00:03:25.420
Let us write down a model
for these measurements.
00:03:25.420 --> 00:03:27.220
We assume that
the measurement is
00:03:27.220 --> 00:03:29.840
equal to the true
position of the ball
00:03:29.840 --> 00:03:31.720
at the time of the
measurement, which
00:03:31.720 --> 00:03:35.300
is this quantity,
plus some noise.
00:03:35.300 --> 00:03:39.170
And we introduce a model for
these noise random variables.
00:03:39.170 --> 00:03:41.370
We assume that we
have a probability
00:03:41.370 --> 00:03:43.050
distribution for them.
00:03:43.050 --> 00:03:47.079
And we also assume that they're
independent of each other,
00:03:47.079 --> 00:03:49.440
and independent from the Thetas.
00:03:49.440 --> 00:03:52.550
It is quite often that
measuring devices,
00:03:52.550 --> 00:03:55.220
when they try to measure
something multiple times,
00:03:55.220 --> 00:03:57.740
the corresponding noises
will be independent.
00:03:57.740 --> 00:04:00.370
So this is often a
realistic assumption.
00:04:00.370 --> 00:04:04.150
And in addition, the noise
in the measuring devices
00:04:04.150 --> 00:04:08.410
is usually independent from
whatever randomness there
00:04:08.410 --> 00:04:12.430
is that determined the
values of the phenomenon
00:04:12.430 --> 00:04:14.470
that you are trying to measure.
00:04:14.470 --> 00:04:19.120
So these are pretty common
and realistic assumptions.
00:04:19.120 --> 00:04:21.220
Let us take stock
of what we have.
00:04:21.220 --> 00:04:24.670
We have observations that
are determined according
00:04:24.670 --> 00:04:28.160
to this relation, where
Wi are noise terms.
00:04:28.160 --> 00:04:30.700
Now let us make some more
concrete assumptions.
00:04:30.700 --> 00:04:33.970
We will assume that the
random variables, the Thetas,
00:04:33.970 --> 00:04:36.010
are normal random variables.
00:04:36.010 --> 00:04:38.320
And similarly, the
W's are normal.
00:04:38.320 --> 00:04:42.020
As we said before,
they're all independent.
00:04:42.020 --> 00:04:44.470
And to keep the
formula simple, we
00:04:44.470 --> 00:04:47.270
will assume that the means
of those random variables
00:04:47.270 --> 00:04:50.640
are 0, although the
same procedure can
00:04:50.640 --> 00:04:53.900
be followed if the
means are non-0.
00:04:53.900 --> 00:04:56.930
We would like to estimate
the Theta parameters
00:04:56.930 --> 00:04:59.370
on the basis of
the observations.
00:04:59.370 --> 00:05:03.070
We will use as usual, the
appropriate form of the Bayes
00:05:03.070 --> 00:05:05.320
rule, which is this
one-- the Bayes
00:05:05.320 --> 00:05:08.030
rule for continuous
random variables.
00:05:08.030 --> 00:05:12.430
The only thing to notice is
that in this notation here,
00:05:12.430 --> 00:05:18.720
x is an n-dimensional vector,
because we have n observations.
00:05:18.720 --> 00:05:22.700
And theta in this example is
a three-dimensional vector,
00:05:22.700 --> 00:05:25.830
because we have three
unknown parameters.
00:05:25.830 --> 00:05:29.850
So wherever you see a theta
or an x without a subscript,
00:05:29.850 --> 00:05:34.230
it should be
interpreted as a vector.
00:05:34.230 --> 00:05:37.890
Now, in order to calculate
this posterior distribution,
00:05:37.890 --> 00:05:43.060
we need to put our hands on
the conditional density of X.
00:05:43.060 --> 00:05:45.650
Actually, it's a joint
density, because X
00:05:45.650 --> 00:05:49.460
is a vector given
the value of Theta.
00:05:49.460 --> 00:05:51.720
The arguments are
pretty much the same
00:05:51.720 --> 00:05:53.950
as in our previous examples.
00:05:53.950 --> 00:05:56.390
And it goes as follows.
00:05:56.390 --> 00:05:59.420
Suppose that I tell you the
value of the parameters,
00:05:59.420 --> 00:06:00.550
as here.
00:06:00.550 --> 00:06:02.590
Then we look at this equation.
00:06:02.590 --> 00:06:06.560
This quantity is now a constant.
00:06:06.560 --> 00:06:09.090
And we have a constant
plus normal noise.
00:06:09.090 --> 00:06:13.050
So Xi is this normal
noise whose mean
00:06:13.050 --> 00:06:15.610
is shifted by this constant.
00:06:15.610 --> 00:06:19.090
So Xi is going to be a
normal random variable,
00:06:19.090 --> 00:06:24.160
with a mean of theta
0 plus theta 1ti,
00:06:24.160 --> 00:06:29.270
plus theta 2ti squared.
00:06:29.270 --> 00:06:32.075
And a variance equal
to the variance of Wi.
00:06:34.970 --> 00:06:37.900
We know what the normal PDF is.
00:06:37.900 --> 00:06:39.840
So we can write it down.
00:06:39.840 --> 00:06:42.880
It's the usual exponential
of a quadratic.
00:06:42.880 --> 00:06:45.880
And in this
quadratic, we have Xi
00:06:45.880 --> 00:06:49.710
minus the mean of the
normal random variable
00:06:49.710 --> 00:06:52.870
that we're dealing with.
00:06:52.870 --> 00:06:55.500
Let us now continue
and write down
00:06:55.500 --> 00:06:57.600
a formula for the posterior.
00:06:57.600 --> 00:07:00.270
We first have this
denominator term,
00:07:00.270 --> 00:07:02.760
which does not
involve any thetas,
00:07:02.760 --> 00:07:07.580
and as in previous examples,
does not really concern us.
00:07:07.580 --> 00:07:11.860
Here we have the joint PDF
of the vector of Thetas.
00:07:11.860 --> 00:07:13.450
There's three of them.
00:07:13.450 --> 00:07:17.110
Because we assumed that
the Thetas are independent,
00:07:17.110 --> 00:07:23.390
the joint PDF factors as a
product of individual PDFs.
00:07:30.550 --> 00:07:37.120
And then, we have here the joint
PDF of X, conditioned on Theta.
00:07:37.120 --> 00:07:39.960
Now with the same argument
is in our previous discussion
00:07:39.960 --> 00:07:42.290
of the case of
multiple observations,
00:07:42.290 --> 00:07:45.310
once I tell you the
values of Theta,
00:07:45.310 --> 00:07:48.960
then the Xi's are just
functions of the noises.
00:07:48.960 --> 00:07:53.470
The noises are independent, so
the Xi's are also independent.
00:07:53.470 --> 00:07:57.020
So in the conditional universe,
where the value of Theta is
00:07:57.020 --> 00:08:00.950
known, the Xi's are
independent and therefore,
00:08:00.950 --> 00:08:06.270
the joint PDF of the Xi's
is equal to the product
00:08:06.270 --> 00:08:10.770
of the marginal PDFs of
each one of the Xi's.
00:08:14.330 --> 00:08:19.950
But this marginal PDF in
the conditional universe
00:08:19.950 --> 00:08:24.390
of the Xi's is something that
we have already calculated.
00:08:24.390 --> 00:08:27.560
And so we know what each
one of these densities is.
00:08:27.560 --> 00:08:29.300
We can write them down.
00:08:29.300 --> 00:08:32.169
And we obtain an
expression of this form.
00:08:32.169 --> 00:08:35.000
We have a normalizing
constant in the beginning.
00:08:35.000 --> 00:08:39.090
We have here a term that comes
from the prior for Theta 0,
00:08:39.090 --> 00:08:41.789
a prior from Theta 1.
00:08:41.789 --> 00:08:44.120
Here is a typical
term that comes
00:08:44.120 --> 00:08:50.390
from the density of Xi,
which is this term up here.
00:08:50.390 --> 00:08:53.610
So here is what we have so far.
00:08:53.610 --> 00:08:56.840
How should we now
estimate Theta if we
00:08:56.840 --> 00:08:59.330
wish to derive a point estimate?
00:08:59.330 --> 00:09:02.570
The natural process is to look
for the maximum posteriori
00:09:02.570 --> 00:09:05.760
probability estimate,
which will maximize
00:09:05.760 --> 00:09:09.040
this expression over theta.
00:09:09.040 --> 00:09:12.550
Find a set of theta parameters.
00:09:12.550 --> 00:09:15.270
It's a three-dimensional
vector for which
00:09:15.270 --> 00:09:17.480
is this quantity is largest.
00:09:17.480 --> 00:09:21.660
Equivalently we can look at the
exponent, get rid of the minus
00:09:21.660 --> 00:09:24.550
signs, and minimize
the quadratic function
00:09:24.550 --> 00:09:26.430
that we obtain here.
00:09:26.430 --> 00:09:29.680
How does one minimize
a quadratic function?
00:09:29.680 --> 00:09:33.080
We take the derivatives
with respect
00:09:33.080 --> 00:09:39.450
to each one of the parameters,
and set the derivative to 0.
00:09:42.000 --> 00:09:48.375
We will get this way three
equations and three unknowns.
00:09:51.490 --> 00:09:54.730
And because it's a
quadratic function of theta,
00:09:54.730 --> 00:09:58.970
these derivatives will be
linear functions of theta.
00:09:58.970 --> 00:10:01.510
So these equations
that we're dealing with
00:10:01.510 --> 00:10:03.560
will be linear equations.
00:10:03.560 --> 00:10:06.430
So it's a system of
three linear equations
00:10:06.430 --> 00:10:08.190
which we can solve numerically.
00:10:08.190 --> 00:10:11.570
And this is what is
usually done in practice.
00:10:11.570 --> 00:10:14.920
A this is exactly what
it takes to calculate
00:10:14.920 --> 00:10:16.960
the maximum a
posteriori probability
00:10:16.960 --> 00:10:21.910
estimate in this particular
example that we have discussed.
00:10:21.910 --> 00:10:24.680
It turns out as we
will discuss later,
00:10:24.680 --> 00:10:28.600
that there are many interesting
properties for this estimate,
00:10:28.600 --> 00:10:31.160
and which are quite general.