WEBVTT
00:00:01.240 --> 00:00:05.370
We can finally go ahead and
introduce the basic elements
00:00:05.370 --> 00:00:08.660
of the Bayesian
inference framework.
00:00:08.660 --> 00:00:11.700
There is an unknown
quantity, which
00:00:11.700 --> 00:00:15.900
we treat as a random variable,
and this is what's special
00:00:15.900 --> 00:00:19.780
and why we call this the
Bayesian inference framework.
00:00:19.780 --> 00:00:22.295
This is in contrast
to other frameworks
00:00:22.295 --> 00:00:25.850
in which the unknown
quantity theta is just
00:00:25.850 --> 00:00:28.110
treated as an unknown constant.
00:00:28.110 --> 00:00:30.260
But here, we treat it
as a random variable,
00:00:30.260 --> 00:00:32.980
and as such, it
has a distribution.
00:00:32.980 --> 00:00:34.660
This is the prior distribution.
00:00:34.660 --> 00:00:36.890
This is what we
believe about Theta
00:00:36.890 --> 00:00:39.590
before we obtain any data.
00:00:39.590 --> 00:00:44.720
And then, we obtain some data,
which are some observation.
00:00:44.720 --> 00:00:47.250
That observation is
a random variable,
00:00:47.250 --> 00:00:49.510
but when the process
gets realized,
00:00:49.510 --> 00:00:53.310
we observe an actual
value, numerical value,
00:00:53.310 --> 00:00:55.000
of this random variable.
00:00:55.000 --> 00:00:58.280
The observation
process is modeled,
00:00:58.280 --> 00:01:00.710
again in terms of a
probabilistic model.
00:01:00.710 --> 00:01:03.810
We specify the
distribution of X,
00:01:03.810 --> 00:01:07.510
but we actually specify the
conditional distribution of X.
00:01:07.510 --> 00:01:12.750
We say how X will
behave if Theta happens
00:01:12.750 --> 00:01:15.870
to take on a specific value.
00:01:15.870 --> 00:01:20.690
These two pieces, the prior and
the model of the observations,
00:01:20.690 --> 00:01:22.580
are the two components
of the model
00:01:22.580 --> 00:01:25.370
that we will be working with.
00:01:25.370 --> 00:01:29.560
Once we have obtained a specific
value for the observations,
00:01:29.560 --> 00:01:33.220
then we can use the
Bayes rule to calculate
00:01:33.220 --> 00:01:37.090
the conditional
distribution of Theta,
00:01:37.090 --> 00:01:40.750
either a conditional
PMF if Theta is discrete
00:01:40.750 --> 00:01:44.970
or a conditional PDF
if Theta is continuous.
00:01:44.970 --> 00:01:48.310
And this will be a complete
solution, in some sense,
00:01:48.310 --> 00:01:51.740
of the Bayesian
inference problem.
00:01:51.740 --> 00:01:55.550
There's one philosophical issue
about this framework, which
00:01:55.550 --> 00:01:59.870
is where does this prior
distribution come from?
00:01:59.870 --> 00:02:01.810
How do we choose it?
00:02:01.810 --> 00:02:05.110
Sometimes we can choose it
using a symmetry argument.
00:02:05.110 --> 00:02:08.000
If there's a number of
possible choices for Theta
00:02:08.000 --> 00:02:10.860
and there's a reason to
believe that they're all
00:02:10.860 --> 00:02:12.630
equally likely,
we have no reason
00:02:12.630 --> 00:02:15.300
to believe that one is more
likely than the other, then
00:02:15.300 --> 00:02:19.800
the symmetry consideration
gives us a uniform prior.
00:02:19.800 --> 00:02:22.760
We definitely take into
account any information
00:02:22.760 --> 00:02:26.520
we have about the range
of the parameter Theta,
00:02:26.520 --> 00:02:32.220
so we use that range and we
assign 0 prior probability
00:02:32.220 --> 00:02:35.130
for values of Theta
outside the range.
00:02:35.130 --> 00:02:37.720
Sometimes, we have some
knowledge about Theta
00:02:37.720 --> 00:02:42.740
from previous studies of a
certain problem, that tell us
00:02:42.740 --> 00:02:45.640
a little bit about
what Theta might be,
00:02:45.640 --> 00:02:48.090
and then when we obtain
new observations,
00:02:48.090 --> 00:02:51.280
we refine those results
that were obtained
00:02:51.280 --> 00:02:54.450
from previous studies by
applying the Bayes rule.
00:02:54.450 --> 00:02:56.750
And in some cases,
finally, the choice
00:02:56.750 --> 00:03:01.510
could be arbitrary or subjective
just reflecting our beliefs
00:03:01.510 --> 00:03:05.280
about Theta, some
plausible judgment
00:03:05.280 --> 00:03:10.690
about the relative likelihoods
of different choices of Theta.
00:03:10.690 --> 00:03:14.140
Now, as we just discussed,
the complete solution
00:03:14.140 --> 00:03:18.020
or the complete answer to a
Bayesian inference problem
00:03:18.020 --> 00:03:21.690
is just the specification of
the posterior distribution
00:03:21.690 --> 00:03:24.810
of Theta given the particular
observation that we
00:03:24.810 --> 00:03:26.430
have obtained.
00:03:26.430 --> 00:03:29.000
Pictorially, if
Theta is discrete,
00:03:29.000 --> 00:03:33.220
a complete answer might be in
the form of such a diagram that
00:03:33.220 --> 00:03:35.380
tells us that certain
values of Theta
00:03:35.380 --> 00:03:38.120
are possible with
certain probabilities.
00:03:38.120 --> 00:03:41.500
Or if Theta is continuous,
a complete solution
00:03:41.500 --> 00:03:45.090
might be in the form of a
conditional PDF that again
00:03:45.090 --> 00:03:49.380
tells us the conditional
distribution of Theta.
00:03:49.380 --> 00:03:53.160
To appreciate the idea
here, consider the problem
00:03:53.160 --> 00:03:57.360
of guessing the number
of electoral votes
00:03:57.360 --> 00:04:01.470
that a candidate gets in
the presidential election.
00:04:01.470 --> 00:04:04.210
The electoral votes
are certain votes
00:04:04.210 --> 00:04:06.190
that the candidate
gets from each one
00:04:06.190 --> 00:04:09.210
of the states in
the United States.
00:04:09.210 --> 00:04:11.690
And there is a certain
number that the candidate
00:04:11.690 --> 00:04:14.540
needs to get in order
to be elected president.
00:04:14.540 --> 00:04:17.149
One possible prediction
could be a statement
00:04:17.149 --> 00:04:20.670
that I predict that
candidate A will win,
00:04:20.670 --> 00:04:23.160
but actually a more
complete presentation
00:04:23.160 --> 00:04:25.480
of the results of
a poll could be
00:04:25.480 --> 00:04:30.190
a diagram of this kind,
which is essentially a PMF.
00:04:30.190 --> 00:04:34.270
Here, a particular pollster
collected all the data
00:04:34.270 --> 00:04:38.240
and gave the posterior
probability distribution
00:04:38.240 --> 00:04:43.500
for the different possible
numbers of electoral votes.
00:04:43.500 --> 00:04:45.900
And this diagram is a
lot more informative
00:04:45.900 --> 00:04:50.230
than the simple statement that
we expect a certain candidate
00:04:50.230 --> 00:04:54.409
to get more than the
required electoral votes.
00:04:54.409 --> 00:04:56.810
So what is next?
00:04:56.810 --> 00:04:59.190
As we just discussed,
the complete solution
00:04:59.190 --> 00:05:01.950
is in terms of a
posterior distribution,
00:05:01.950 --> 00:05:05.480
but sometimes, you may want
to summarize this posterior
00:05:05.480 --> 00:05:09.810
distribution in a single
number or a single estimate,
00:05:09.810 --> 00:05:11.930
and this could be
a further stage
00:05:11.930 --> 00:05:14.670
of processing of the results.
00:05:14.670 --> 00:05:18.170
So let us talk about this.
00:05:18.170 --> 00:05:21.780
Once you have in your hands
the posterior distribution
00:05:21.780 --> 00:05:26.260
of Theta, either in a discrete
or in a continuous setting,
00:05:26.260 --> 00:05:30.280
and if you're asked to provide
a single guess about what
00:05:30.280 --> 00:05:33.430
Theta is, how might you proceed?
00:05:33.430 --> 00:05:36.480
In the discrete case, you
could argue as follows.
00:05:36.480 --> 00:05:40.210
These values of Theta all
have some chance of occurring.
00:05:40.210 --> 00:05:43.540
This value of Theta is the
one which is the most likely,
00:05:43.540 --> 00:05:46.420
so I'm going to
report this value
00:05:46.420 --> 00:05:49.760
as my best guess
of what Theta is.
00:05:49.760 --> 00:05:51.760
And using a similar
philosophy, you
00:05:51.760 --> 00:05:53.690
could look at the
continuous case
00:05:53.690 --> 00:05:59.000
and find the value of Theta
at which the PDF is largest
00:05:59.000 --> 00:06:02.350
and report that
particular value.
00:06:02.350 --> 00:06:05.600
This particular way
of estimating Theta
00:06:05.600 --> 00:06:09.530
is called the maximum a
posteriori probability rule.
00:06:09.530 --> 00:06:13.460
We already have in our hands
the specific value of X,
00:06:13.460 --> 00:06:15.130
and therefore, we
have determined
00:06:15.130 --> 00:06:17.920
the conditional
distribution for Theta.
00:06:17.920 --> 00:06:21.230
What we then do is to
find the value of theta
00:06:21.230 --> 00:06:27.200
that maximizes over all possible
thetas the conditional PMF
00:06:27.200 --> 00:06:30.090
of this random
variables capital Theta.
00:06:30.090 --> 00:06:31.950
And similarly in
the continuous case,
00:06:31.950 --> 00:06:37.380
the value of theta that
maximizes the conditional PDF
00:06:37.380 --> 00:06:39.400
of the random variable Theta.
00:06:39.400 --> 00:06:42.360
This is one way of coming
up with an estimate.
00:06:42.360 --> 00:06:44.360
One can think of other ways.
00:06:44.360 --> 00:06:49.180
For example, I might want
to report instead, the mean
00:06:49.180 --> 00:06:51.890
of the conditional distribution,
which in this diagram
00:06:51.890 --> 00:06:54.920
might be somewhere here,
and in this picture,
00:06:54.920 --> 00:06:57.380
it might be somewhere here.
00:06:57.380 --> 00:07:02.250
This way of estimating theta
is the conditional expectation
00:07:02.250 --> 00:07:03.320
estimator.
00:07:03.320 --> 00:07:06.500
It just reports the value of
the conditional expectation,
00:07:06.500 --> 00:07:09.410
the mean of this
conditional distribution.
00:07:09.410 --> 00:07:13.190
It is called the least
mean squares estimator,
00:07:13.190 --> 00:07:16.940
because it has a certain
useful and important property.
00:07:16.940 --> 00:07:19.290
It is the estimator
that gives you
00:07:19.290 --> 00:07:21.710
the smallest mean squared error.
00:07:21.710 --> 00:07:24.120
We will discuss this
particular issue
00:07:24.120 --> 00:07:27.910
in much more depth
a little later.
00:07:27.910 --> 00:07:31.610
Now, let me make two
comments about terminology.
00:07:31.610 --> 00:07:35.540
What we have produced
here is an estimate.
00:07:35.540 --> 00:07:39.700
I gave you the conditional
PDF or conditional PMF,
00:07:39.700 --> 00:07:41.710
and you tell me a number.
00:07:41.710 --> 00:07:45.730
This number, the
estimate, is obtained
00:07:45.730 --> 00:07:49.960
by starting with the data, doing
some processing to the data,
00:07:49.960 --> 00:07:54.880
and eventually, coming up
with a numerical value.
00:07:54.880 --> 00:07:58.940
Now, g is the way that
we process the data.
00:07:58.940 --> 00:08:00.670
It's a certain rule.
00:08:00.670 --> 00:08:03.570
Now, if we know the
value of the data,
00:08:03.570 --> 00:08:06.290
we know what the
estimate is going to be.
00:08:06.290 --> 00:08:09.580
But if I do not tell you
the value of the data
00:08:09.580 --> 00:08:12.580
and you look at the
situation more abstractly,
00:08:12.580 --> 00:08:14.700
then the only thing
you can tell me
00:08:14.700 --> 00:08:17.900
is that I will be seeing
a random variable,
00:08:17.900 --> 00:08:21.480
capital X, I will do
some processing to it,
00:08:21.480 --> 00:08:25.070
and then I will obtain
a certain quantity.
00:08:25.070 --> 00:08:29.370
Because capital X is random,
the quantity that I will obtain
00:08:29.370 --> 00:08:30.800
will also be random.
00:08:30.800 --> 00:08:32.679
It's a random variable.
00:08:32.679 --> 00:08:36.490
This random variable,
capital Theta hat,
00:08:36.490 --> 00:08:38.750
we call it an estimator.
00:08:38.750 --> 00:08:41.610
Sometimes, we might also
use the term estimator
00:08:41.610 --> 00:08:43.429
to [refer to] the
function g, which
00:08:43.429 --> 00:08:46.390
is the way that we
process the data.
00:08:46.390 --> 00:08:49.810
In any case, it is important to
keep this distinction in mind.
00:08:49.810 --> 00:08:54.680
The estimator is the rule that
we use to process the data,
00:08:54.680 --> 00:08:58.810
and it is equivalent to a
certain random variable.
00:08:58.810 --> 00:09:02.110
An estimate is the
specific numerical value
00:09:02.110 --> 00:09:08.410
that we get when the data take
a specific numerical value.
00:09:08.410 --> 00:09:11.890
So if little x is the
numerical value of capital X,
00:09:11.890 --> 00:09:16.730
in that case, little theta
hat is the numerical value
00:09:16.730 --> 00:09:22.010
of the estimator
capital Theta hat.
00:09:22.010 --> 00:09:26.750
So at this point, we have a
complete conceptual framework.
00:09:26.750 --> 00:09:29.440
We know, abstractly
speaking, what
00:09:29.440 --> 00:09:33.610
it takes to calculate
conditional distributions,
00:09:33.610 --> 00:09:37.520
and we have two specific
estimators at hand.
00:09:37.520 --> 00:09:39.870
All that's left
for us to do now is
00:09:39.870 --> 00:09:43.780
to consider various examples
in which we can discuss what
00:09:43.780 --> 00:09:46.940
it takes to go through
these various steps.