WEBVTT

00:00:00.680 --> 00:00:02.550
We will now continue
with the problem

00:00:02.550 --> 00:00:06.300
of inferring the unknown
bias of a certain coin

00:00:06.300 --> 00:00:09.130
for which we have a
certain prior distribution

00:00:09.130 --> 00:00:11.980
and of which we observe
the number of heads

00:00:11.980 --> 00:00:15.130
in n independent coin tosses.

00:00:15.130 --> 00:00:19.210
We have already seen that if
we assume a uniform prior,

00:00:19.210 --> 00:00:21.810
the posterior takes this
particular form, which

00:00:21.810 --> 00:00:24.830
comes from the family
of Beta distributions.

00:00:24.830 --> 00:00:28.010
What we want to do now
is to actually derive

00:00:28.010 --> 00:00:29.830
point estimates.

00:00:29.830 --> 00:00:33.620
That is, instead of just
providing the posterior,

00:00:33.620 --> 00:00:36.120
we would like to select
a specific estimate

00:00:36.120 --> 00:00:38.310
for the unknown bias.

00:00:38.310 --> 00:00:42.040
Let us look at the maximum
a posteriori probability

00:00:42.040 --> 00:00:43.380
estimate.

00:00:43.380 --> 00:00:45.030
How can we find it?

00:00:45.030 --> 00:00:50.340
By definition, the MAP
estimate is that value of theta

00:00:50.340 --> 00:00:53.750
that maximizes the
posterior, the value of theta

00:00:53.750 --> 00:00:56.790
at which the
posterior is largest.

00:00:56.790 --> 00:00:59.800
Now, instead of
maximizing the posterior,

00:00:59.800 --> 00:01:02.070
it is more convenient
in this example

00:01:02.070 --> 00:01:06.030
to maximize the logarithm
of the posterior.

00:01:06.030 --> 00:01:11.860
And the logarithm is
k times log theta,

00:01:11.860 --> 00:01:19.160
plus n minus k times the
log of 1 minus theta.

00:01:19.160 --> 00:01:23.100
To carry out the
maximization over theta,

00:01:23.100 --> 00:01:26.530
we form the derivative
with respect to theta

00:01:26.530 --> 00:01:28.970
and set that derivative to 0.

00:01:28.970 --> 00:01:33.300
So the derivative of the
first term is k over theta.

00:01:33.300 --> 00:01:35.550
And the derivative
of the second term

00:01:35.550 --> 00:01:42.550
is n minus k over this
quantity, 1 minus theta.

00:01:42.550 --> 00:01:47.520
But because of the minus here
when we apply the chain rule,

00:01:47.520 --> 00:01:53.130
actually, this plus sign here
is going to become a minus sign.

00:01:53.130 --> 00:01:56.580
And now we set this
derivative to 0.

00:01:56.580 --> 00:02:00.340
We carry out the algebra,
which is rather simple.

00:02:00.340 --> 00:02:03.460
And the end result
that you will find

00:02:03.460 --> 00:02:08.728
is that the estimate
is equal to k over n.

00:02:08.728 --> 00:02:12.080
Notice that this is lowercase k.

00:02:12.080 --> 00:02:15.200
We are told the
specific value of heads

00:02:15.200 --> 00:02:17.170
that has been observed.

00:02:17.170 --> 00:02:21.140
So little k is a number, and
our estimate, accordingly,

00:02:21.140 --> 00:02:23.360
is a number.

00:02:23.360 --> 00:02:26.370
This answer makes perfect sense.

00:02:26.370 --> 00:02:28.980
A very reasonable
way of estimating

00:02:28.980 --> 00:02:32.110
the probability of
heads of a certain coin

00:02:32.110 --> 00:02:34.260
is to look at the
number of heads

00:02:34.260 --> 00:02:38.270
obtained and divide by the
total number of trials.

00:02:38.270 --> 00:02:40.640
So we see that the
MAP estimate turns out

00:02:40.640 --> 00:02:43.570
to be a quite natural one.

00:02:43.570 --> 00:02:46.920
How about the
corresponding estimator?

00:02:46.920 --> 00:02:49.390
Recall the distinction
that the estimator

00:02:49.390 --> 00:02:52.760
is a random variable that tells
us what the estimate is going

00:02:52.760 --> 00:02:57.810
to be as a function of
the random variable that

00:02:57.810 --> 00:03:00.630
is going to be observed.

00:03:00.630 --> 00:03:07.690
The estimator is uppercase
K divided by little n.

00:03:07.690 --> 00:03:11.610
So it is a random
variable whose value

00:03:11.610 --> 00:03:15.130
is determined by the value of
the random variable capital

00:03:15.130 --> 00:03:17.620
K. If the random variable
capital K happens

00:03:17.620 --> 00:03:20.730
to take on this specific
value, little k, then

00:03:20.730 --> 00:03:23.329
our estimator, this
random variable,

00:03:23.329 --> 00:03:28.690
will be taking this specific
value, which is the estimate.

00:03:28.690 --> 00:03:31.390
And let us now compare
with an alternative way

00:03:31.390 --> 00:03:32.970
of estimating Theta.

00:03:32.970 --> 00:03:35.270
We will consider
estimating Theta

00:03:35.270 --> 00:03:39.460
by forming the conditional
expectation of Theta, given

00:03:39.460 --> 00:03:43.470
the specific number of
heads that we have observed.

00:03:43.470 --> 00:03:49.110
This is what we call the LMS
or least mean squares estimate.

00:03:49.110 --> 00:03:51.620
To calculate this
conditional expectation,

00:03:51.620 --> 00:03:58.650
all that we need to do is to
form the integral of theta

00:03:58.650 --> 00:04:02.390
times the density of Theta.

00:04:02.390 --> 00:04:04.330
But since it's a
conditional expectation,

00:04:04.330 --> 00:04:07.690
we need to take the
conditional density of Theta.

00:04:11.240 --> 00:04:14.200
And the integral
ranges from 0 to 1,

00:04:14.200 --> 00:04:17.089
because this is the range of
our random variable, Theta.

00:04:19.880 --> 00:04:21.200
Now, what is this?

00:04:21.200 --> 00:04:24.380
We have a formula for
the posterior density.

00:04:24.380 --> 00:04:28.160
So we need to just multiply
this expression here by theta,

00:04:28.160 --> 00:04:29.970
and then integrate.

00:04:29.970 --> 00:04:32.440
This term here is a constant.

00:04:32.440 --> 00:04:35.150
So it can be pulled
outside the integral.

00:04:40.030 --> 00:04:47.260
And inside the integral, we
are left with this term times

00:04:47.260 --> 00:04:54.520
theta, which changes the
exponent of theta to k plus 1.

00:04:54.520 --> 00:04:59.860
Then we have 1 minus theta to
the power n minus k, d theta.

00:05:02.630 --> 00:05:06.540
At this point, we need
to do some calculations.

00:05:06.540 --> 00:05:09.020
What is d of n, k?

00:05:09.020 --> 00:05:13.860
d of n, k is the normalizing
constant of this PDF.

00:05:13.860 --> 00:05:18.960
For this to be a PDF and to
integrate to 1, d of n, k

00:05:18.960 --> 00:05:25.310
has to be equal to the integral
of this expression from 0 to 1.

00:05:25.310 --> 00:05:29.490
So we need to somehow be able
to evaluate this integral.

00:05:29.490 --> 00:05:35.300
Here, we will be helped by the
following very nice formula.

00:05:35.300 --> 00:05:37.790
This formula tells
us that the integral

00:05:37.790 --> 00:05:42.710
of for such a function
of theta from 0 to 1

00:05:42.710 --> 00:05:47.730
is equal to this very nice
and simple expression.

00:05:47.730 --> 00:05:49.670
Of course, this
formula is only valid

00:05:49.670 --> 00:05:52.250
when these factorials
make sense.

00:05:52.250 --> 00:05:54.860
So we assume that
alpha is non-negative

00:05:54.860 --> 00:05:58.070
and theta is non-negative.

00:05:58.070 --> 00:05:59.960
How is this formula derived?

00:05:59.960 --> 00:06:05.140
There's various algebraic or
calculus style derivations.

00:06:05.140 --> 00:06:07.900
One possibility is to
use integration by parts.

00:06:07.900 --> 00:06:10.830
And there are also other
tricks for deriving it.

00:06:10.830 --> 00:06:13.930
It turns out that there
is also a very clever

00:06:13.930 --> 00:06:16.780
probabilistic
proof of this fact.

00:06:16.780 --> 00:06:19.380
But in any case, we
will not derive it.

00:06:19.380 --> 00:06:24.120
We will just take it as a fact
that comes to us from calculus.

00:06:24.120 --> 00:06:28.440
And now, let us
apply this formula.

00:06:28.440 --> 00:06:33.770
d of n, k is equal to the
integral of this expression,

00:06:33.770 --> 00:06:38.520
which is of this form, with
alpha equal to k and beta

00:06:38.520 --> 00:06:41.390
equal to n minus k.

00:06:41.390 --> 00:06:53.625
So d of n, k takes the form
alpha is k, beta is n minus k.

00:06:57.960 --> 00:07:00.760
And then in the
denominator, we have

00:07:00.760 --> 00:07:03.630
the sum of the two
indices plus 1.

00:07:03.630 --> 00:07:05.560
So it's going to be
k plus n minus k.

00:07:05.560 --> 00:07:07.030
That gives us n.

00:07:07.030 --> 00:07:08.520
And then there's a plus 1.

00:07:12.530 --> 00:07:15.450
And how about this integral?

00:07:15.450 --> 00:07:20.540
Well, this integral is also of
the form that we have up here.

00:07:20.540 --> 00:07:30.530
But now, we have alpha
equal to k plus 1,

00:07:30.530 --> 00:07:33.460
beta is n minus k.

00:07:37.940 --> 00:07:42.710
And in the denominator, we have
the sum of the indices plus 1.

00:07:42.710 --> 00:07:45.880
So when we add these
indices, we get n plus 1.

00:07:45.880 --> 00:07:48.880
And then we get
another factor of 1,

00:07:48.880 --> 00:07:51.280
which gives us an n plus 2.

00:07:55.510 --> 00:07:56.960
This looks formidable.

00:07:56.960 --> 00:08:00.590
But actually, there's a
lot of simplifications.

00:08:00.590 --> 00:08:05.000
This term here cancels
with that term.

00:08:05.000 --> 00:08:10.160
k plus 1 factorial divided
by k factorial, what is it?

00:08:10.160 --> 00:08:14.450
It is just a factor of k plus 1.

00:08:18.030 --> 00:08:20.110
And what do we have here?

00:08:20.110 --> 00:08:23.380
This term is in the
denominator of the denominator.

00:08:23.380 --> 00:08:26.430
So it can be moved
up to the numerator.

00:08:26.430 --> 00:08:30.870
We have n plus 1 factorial
divided by n plus 2 factorial.

00:08:30.870 --> 00:08:33.250
This is just n plus 2.

00:08:36.039 --> 00:08:39.640
And this is the final
form of the answer.

00:08:39.640 --> 00:08:44.810
This is what the conditional
expectation of theta is.

00:08:44.810 --> 00:08:48.570
So now, we can compare the
two estimates that we have,

00:08:48.570 --> 00:08:51.840
the MAP estimate and the
conditional expectation

00:08:51.840 --> 00:08:53.230
estimate.

00:08:53.230 --> 00:08:57.120
They're fairly similar,
but not exactly the same.

00:08:57.120 --> 00:09:01.640
This means that the mean
of a Beta distribution

00:09:01.640 --> 00:09:05.550
is not the same as the point
at which the distribution is

00:09:05.550 --> 00:09:06.990
highest.

00:09:06.990 --> 00:09:10.830
On the other hand, if n
is a very large number,

00:09:10.830 --> 00:09:17.450
this expression is going to be
approximately equal to k over n

00:09:17.450 --> 00:09:18.850
when n is large.

00:09:18.850 --> 00:09:24.730
And so in the limit of
large n, the two estimators

00:09:24.730 --> 00:09:28.504
will not be very
different from each other.