WEBVTT

00:00:00.580 --> 00:00:02.640
Let us now discuss
in some more detail

00:00:02.640 --> 00:00:05.900
what it takes to carry
out Bayesian inference,

00:00:05.900 --> 00:00:08.630
when both random
variables are discrete.

00:00:08.630 --> 00:00:12.120
The unknown parameter,
Theta, is a random variable

00:00:12.120 --> 00:00:14.980
that takes values
in the discrete set.

00:00:14.980 --> 00:00:19.340
And we can think of these values
as alternative hypotheses.

00:00:19.340 --> 00:00:22.400
In this case, we know
how to do inference.

00:00:22.400 --> 00:00:24.760
We have in our
hands the Bayes rule

00:00:24.760 --> 00:00:27.110
and we have seen
plenty of examples.

00:00:27.110 --> 00:00:30.530
So instead of going through
one more example in detail,

00:00:30.530 --> 00:00:34.240
let us assume that we have a
model, that we have observed

00:00:34.240 --> 00:00:38.470
the value of X, and that
we have already determined

00:00:38.470 --> 00:00:42.100
the conditional PMF of
the random variable Theta.

00:00:42.100 --> 00:00:44.860
As a concrete example,
suppose that Theta

00:00:44.860 --> 00:00:47.460
can take values 1, 2, or 3.

00:00:47.460 --> 00:00:49.310
We have obtained
our observation,

00:00:49.310 --> 00:00:52.890
and the conditional
PMF takes this form.

00:00:52.890 --> 00:00:55.280
We could stop at
this point or we

00:00:55.280 --> 00:01:00.090
could continue by asking for
a specific estimate of Theta--

00:01:00.090 --> 00:01:03.550
our best guess as
to what Theta is.

00:01:03.550 --> 00:01:05.440
One way of coming
up with an estimate

00:01:05.440 --> 00:01:08.750
is to use the maximum a
posteriori of probability

00:01:08.750 --> 00:01:11.560
rule, which looks for
that value of theta that

00:01:11.560 --> 00:01:15.530
has the largest posterior,
or conditional, probability.

00:01:15.530 --> 00:01:17.880
In this example,
it is this value,

00:01:17.880 --> 00:01:21.480
so our estimate is
going to be equal to 2.

00:01:21.480 --> 00:01:24.890
An alternative way of
coming up with an estimate

00:01:24.890 --> 00:01:28.340
could be the LMS
rule, which calculates

00:01:28.340 --> 00:01:31.980
an estimate equal to the
conditional expectation

00:01:31.980 --> 00:01:35.520
of the unknown parameter,
given the observation that we

00:01:35.520 --> 00:01:36.650
have made.

00:01:36.650 --> 00:01:39.670
This is just the mean of this
conditional distribution.

00:01:39.670 --> 00:01:43.080
In this example, it would
fall somewhere around here,

00:01:43.080 --> 00:01:48.820
and the numerical value, as
you can check, is equal to 2.2.

00:01:48.820 --> 00:01:55.090
Next, we may be interested in
how good a certain estimate is.

00:01:55.090 --> 00:01:58.270
And for the case where we
interpret the values of Theta

00:01:58.270 --> 00:02:01.510
as hypotheses, a
relevant criterion

00:02:01.510 --> 00:02:04.080
is the probability of error.

00:02:04.080 --> 00:02:06.560
In this case, because
we already have

00:02:06.560 --> 00:02:09.610
some data available
in our hands and we're

00:02:09.610 --> 00:02:12.270
called to make an estimate,
what we care about

00:02:12.270 --> 00:02:15.435
is the conditional probability,
given the information

00:02:15.435 --> 00:02:18.450
that we have, that
we're making an error.

00:02:18.450 --> 00:02:20.950
Making an error
means the following.

00:02:20.950 --> 00:02:23.410
We have the observation,
the value of the estimate

00:02:23.410 --> 00:02:26.270
has been determined,
it is now a number,

00:02:26.270 --> 00:02:30.220
and that's why we write it
with a lowercase theta hat.

00:02:30.220 --> 00:02:34.059
But the parameter
is still unknown.

00:02:34.059 --> 00:02:35.100
We don't know what it is.

00:02:35.100 --> 00:02:37.022
It is described by
this distribution.

00:02:37.022 --> 00:02:38.480
And there's a
probability that it's

00:02:38.480 --> 00:02:41.170
going to be different
from our estimate.

00:02:41.170 --> 00:02:43.340
What is this probability?

00:02:43.340 --> 00:02:46.220
It depends on how we
construct the estimates.

00:02:46.220 --> 00:02:48.540
If in this example,
we use the MAP rule

00:02:48.540 --> 00:02:52.295
and we make an
estimate of 2, there

00:02:52.295 --> 00:02:55.340
is probability 0.6 that
the true value of Theta

00:02:55.340 --> 00:02:59.120
is also equal to
2, and we are fine.

00:02:59.120 --> 00:03:02.830
But there's a remaining
probability of 0.4

00:03:02.830 --> 00:03:06.340
that the true value of Theta
is different than our estimate.

00:03:06.340 --> 00:03:11.010
So there's probability 0.4
of having made a mistake.

00:03:11.010 --> 00:03:13.010
If, instead of an
estimate equal to 2,

00:03:13.010 --> 00:03:16.850
we had chosen an
estimate equal to 3,

00:03:16.850 --> 00:03:21.190
then the true parameter would
be equal to our estimate

00:03:21.190 --> 00:03:25.210
with probability 0.3, but
we would have made an error

00:03:25.210 --> 00:03:28.310
with probability
0.7, which would

00:03:28.310 --> 00:03:31.260
be a bigger
probability of error.

00:03:31.260 --> 00:03:33.620
More generally, the
probability of error

00:03:33.620 --> 00:03:35.720
of a particular
estimate is the sum

00:03:35.720 --> 00:03:39.540
of the probabilities of
the other values of Theta.

00:03:39.540 --> 00:03:44.180
And if we want to keep the
probability of error small,

00:03:44.180 --> 00:03:47.200
we want to keep the sum
of the probabilities

00:03:47.200 --> 00:03:50.230
of the other values
small, which means

00:03:50.230 --> 00:03:54.530
we want to pick an estimate for
which its own probability is

00:03:54.530 --> 00:03:55.550
large.

00:03:55.550 --> 00:03:59.100
And so by that argument,
we see that the way

00:03:59.100 --> 00:04:02.540
to achieve the smallest
possible probability of error

00:04:02.540 --> 00:04:04.540
is to employ the MAP rule.

00:04:04.540 --> 00:04:09.450
This is a very important
property of the MAP rule.

00:04:09.450 --> 00:04:11.480
Now, this is the
conditional probability

00:04:11.480 --> 00:04:15.650
of error, given that we
already have data in our hands.

00:04:15.650 --> 00:04:20.890
But more generally, we may want
to compare estimators or talk

00:04:20.890 --> 00:04:25.250
about their performance in terms
of their overall probability

00:04:25.250 --> 00:04:26.320
of error.

00:04:26.320 --> 00:04:29.080
We're designing a
decision-making system

00:04:29.080 --> 00:04:32.430
that's going to process
data and making decisions.

00:04:32.430 --> 00:04:35.270
In order to say how
good our system is,

00:04:35.270 --> 00:04:38.909
we want to say that overall,
whenever you use the system,

00:04:38.909 --> 00:04:41.790
there's going to be
some random parameter,

00:04:41.790 --> 00:04:45.090
there's going to be some
value of the estimate.

00:04:45.090 --> 00:04:49.300
And we want to know what's the
probability that these two will

00:04:49.300 --> 00:04:51.340
be different.

00:04:51.340 --> 00:04:54.750
We can calculate this
overall probability of error

00:04:54.750 --> 00:04:58.780
by using the total
probability theorem.

00:04:58.780 --> 00:05:01.900
And the conditional
probabilities of error as

00:05:01.900 --> 00:05:03.010
follows.

00:05:03.010 --> 00:05:08.380
We condition on the value of
X. For any possible value of X,

00:05:08.380 --> 00:05:10.710
we have a conditional
probability of error.

00:05:10.710 --> 00:05:13.220
And then we take
a weighted average

00:05:13.220 --> 00:05:16.190
of these conditional
probabilities of error.

00:05:16.190 --> 00:05:19.650
There's also an alternative way
of using the total probability

00:05:19.650 --> 00:05:22.470
theorem, which would be to
first condition on Theta

00:05:22.470 --> 00:05:26.010
and calculate the conditional
probability of error

00:05:26.010 --> 00:05:30.160
for a given choice of
this unknown parameter.

00:05:30.160 --> 00:05:34.000
And both of these
formulas can be used.

00:05:34.000 --> 00:05:36.650
Which one of the two
is more convenient

00:05:36.650 --> 00:05:41.640
really depends on the
specifics of the problem.

00:05:41.640 --> 00:05:45.290
Finally, I would like to make
an important observation.

00:05:45.290 --> 00:05:50.260
We argued that for
any particular choice

00:05:50.260 --> 00:05:55.390
of an observation, the MAP
rule achieves the smallest

00:05:55.390 --> 00:05:58.800
possible probability of error.

00:05:58.800 --> 00:06:02.920
So under the MAP rule,
this term is as small

00:06:02.920 --> 00:06:07.270
as possible for any given
value of the random variable,

00:06:07.270 --> 00:06:10.530
capital X.

00:06:10.530 --> 00:06:15.290
Since each term of this
sum is as small as possible

00:06:15.290 --> 00:06:19.480
under the MAP rule, it means
that the overall sum will also

00:06:19.480 --> 00:06:21.950
be as small as possible.

00:06:21.950 --> 00:06:27.750
And this means is that the
overall probability of error

00:06:27.750 --> 00:06:31.110
is also smallest
under the MAP rule.

00:06:31.110 --> 00:06:34.730
In this sense, the MAP
rule is the optimum way

00:06:34.730 --> 00:06:40.670
of coming up with estimates in
the hypothesis-testing context,

00:06:40.670 --> 00:06:44.943
where we want to minimize
the probability of error.