WEBVTT
00:00:00.680 --> 00:00:02.550
We will now continue
with the problem
00:00:02.550 --> 00:00:06.300
of inferring the unknown
bias of a certain coin
00:00:06.300 --> 00:00:09.130
for which we have a
certain prior distribution
00:00:09.130 --> 00:00:11.980
and of which we observe
the number of heads
00:00:11.980 --> 00:00:15.130
in n independent coin tosses.
00:00:15.130 --> 00:00:19.210
We have already seen that if
we assume a uniform prior,
00:00:19.210 --> 00:00:21.810
the posterior takes this
particular form, which
00:00:21.810 --> 00:00:24.830
comes from the family
of Beta distributions.
00:00:24.830 --> 00:00:28.010
What we want to do now
is to actually derive
00:00:28.010 --> 00:00:29.830
point estimates.
00:00:29.830 --> 00:00:33.620
That is, instead of just
providing the posterior,
00:00:33.620 --> 00:00:36.120
we would like to select
a specific estimate
00:00:36.120 --> 00:00:38.310
for the unknown bias.
00:00:38.310 --> 00:00:42.040
Let us look at the maximum
a posteriori probability
00:00:42.040 --> 00:00:43.380
estimate.
00:00:43.380 --> 00:00:45.030
How can we find it?
00:00:45.030 --> 00:00:50.340
By definition, the MAP
estimate is that value of theta
00:00:50.340 --> 00:00:53.750
that maximizes the
posterior, the value of theta
00:00:53.750 --> 00:00:56.790
at which the
posterior is largest.
00:00:56.790 --> 00:00:59.800
Now, instead of
maximizing the posterior,
00:00:59.800 --> 00:01:02.070
it is more convenient
in this example
00:01:02.070 --> 00:01:06.030
to maximize the logarithm
of the posterior.
00:01:06.030 --> 00:01:11.860
And the logarithm is
k times log theta,
00:01:11.860 --> 00:01:19.160
plus n minus k times the
log of 1 minus theta.
00:01:19.160 --> 00:01:23.100
To carry out the
maximization over theta,
00:01:23.100 --> 00:01:26.530
we form the derivative
with respect to theta
00:01:26.530 --> 00:01:28.970
and set that derivative to 0.
00:01:28.970 --> 00:01:33.300
So the derivative of the
first term is k over theta.
00:01:33.300 --> 00:01:35.550
And the derivative
of the second term
00:01:35.550 --> 00:01:42.550
is n minus k over this
quantity, 1 minus theta.
00:01:42.550 --> 00:01:47.520
But because of the minus here
when we apply the chain rule,
00:01:47.520 --> 00:01:53.130
actually, this plus sign here
is going to become a minus sign.
00:01:53.130 --> 00:01:56.580
And now we set this
derivative to 0.
00:01:56.580 --> 00:02:00.340
We carry out the algebra,
which is rather simple.
00:02:00.340 --> 00:02:03.460
And the end result
that you will find
00:02:03.460 --> 00:02:08.728
is that the estimate
is equal to k over n.
00:02:08.728 --> 00:02:12.080
Notice that this is lowercase k.
00:02:12.080 --> 00:02:15.200
We are told the
specific value of heads
00:02:15.200 --> 00:02:17.170
that has been observed.
00:02:17.170 --> 00:02:21.140
So little k is a number, and
our estimate, accordingly,
00:02:21.140 --> 00:02:23.360
is a number.
00:02:23.360 --> 00:02:26.370
This answer makes perfect sense.
00:02:26.370 --> 00:02:28.980
A very reasonable
way of estimating
00:02:28.980 --> 00:02:32.110
the probability of
heads of a certain coin
00:02:32.110 --> 00:02:34.260
is to look at the
number of heads
00:02:34.260 --> 00:02:38.270
obtained and divide by the
total number of trials.
00:02:38.270 --> 00:02:40.640
So we see that the
MAP estimate turns out
00:02:40.640 --> 00:02:43.570
to be a quite natural one.
00:02:43.570 --> 00:02:46.920
How about the
corresponding estimator?
00:02:46.920 --> 00:02:49.390
Recall the distinction
that the estimator
00:02:49.390 --> 00:02:52.760
is a random variable that tells
us what the estimate is going
00:02:52.760 --> 00:02:57.810
to be as a function of
the random variable that
00:02:57.810 --> 00:03:00.630
is going to be observed.
00:03:00.630 --> 00:03:07.690
The estimator is uppercase
K divided by little n.
00:03:07.690 --> 00:03:11.610
So it is a random
variable whose value
00:03:11.610 --> 00:03:15.130
is determined by the value of
the random variable capital
00:03:15.130 --> 00:03:17.620
K. If the random variable
capital K happens
00:03:17.620 --> 00:03:20.730
to take on this specific
value, little k, then
00:03:20.730 --> 00:03:23.329
our estimator, this
random variable,
00:03:23.329 --> 00:03:28.690
will be taking this specific
value, which is the estimate.
00:03:28.690 --> 00:03:31.390
And let us now compare
with an alternative way
00:03:31.390 --> 00:03:32.970
of estimating Theta.
00:03:32.970 --> 00:03:35.270
We will consider
estimating Theta
00:03:35.270 --> 00:03:39.460
by forming the conditional
expectation of Theta, given
00:03:39.460 --> 00:03:43.470
the specific number of
heads that we have observed.
00:03:43.470 --> 00:03:49.110
This is what we call the LMS
or least mean squares estimate.
00:03:49.110 --> 00:03:51.620
To calculate this
conditional expectation,
00:03:51.620 --> 00:03:58.650
all that we need to do is to
form the integral of theta
00:03:58.650 --> 00:04:02.390
times the density of Theta.
00:04:02.390 --> 00:04:04.330
But since it's a
conditional expectation,
00:04:04.330 --> 00:04:07.690
we need to take the
conditional density of Theta.
00:04:11.240 --> 00:04:14.200
And the integral
ranges from 0 to 1,
00:04:14.200 --> 00:04:17.089
because this is the range of
our random variable, Theta.
00:04:19.880 --> 00:04:21.200
Now, what is this?
00:04:21.200 --> 00:04:24.380
We have a formula for
the posterior density.
00:04:24.380 --> 00:04:28.160
So we need to just multiply
this expression here by theta,
00:04:28.160 --> 00:04:29.970
and then integrate.
00:04:29.970 --> 00:04:32.440
This term here is a constant.
00:04:32.440 --> 00:04:35.150
So it can be pulled
outside the integral.
00:04:40.030 --> 00:04:47.260
And inside the integral, we
are left with this term times
00:04:47.260 --> 00:04:54.520
theta, which changes the
exponent of theta to k plus 1.
00:04:54.520 --> 00:04:59.860
Then we have 1 minus theta to
the power n minus k, d theta.
00:05:02.630 --> 00:05:06.540
At this point, we need
to do some calculations.
00:05:06.540 --> 00:05:09.020
What is d of n, k?
00:05:09.020 --> 00:05:13.860
d of n, k is the normalizing
constant of this PDF.
00:05:13.860 --> 00:05:18.960
For this to be a PDF and to
integrate to 1, d of n, k
00:05:18.960 --> 00:05:25.310
has to be equal to the integral
of this expression from 0 to 1.
00:05:25.310 --> 00:05:29.490
So we need to somehow be able
to evaluate this integral.
00:05:29.490 --> 00:05:35.300
Here, we will be helped by the
following very nice formula.
00:05:35.300 --> 00:05:37.790
This formula tells
us that the integral
00:05:37.790 --> 00:05:42.710
of for such a function
of theta from 0 to 1
00:05:42.710 --> 00:05:47.730
is equal to this very nice
and simple expression.
00:05:47.730 --> 00:05:49.670
Of course, this
formula is only valid
00:05:49.670 --> 00:05:52.250
when these factorials
make sense.
00:05:52.250 --> 00:05:54.860
So we assume that
alpha is non-negative
00:05:54.860 --> 00:05:58.070
and theta is non-negative.
00:05:58.070 --> 00:05:59.960
How is this formula derived?
00:05:59.960 --> 00:06:05.140
There's various algebraic or
calculus style derivations.
00:06:05.140 --> 00:06:07.900
One possibility is to
use integration by parts.
00:06:07.900 --> 00:06:10.830
And there are also other
tricks for deriving it.
00:06:10.830 --> 00:06:13.930
It turns out that there
is also a very clever
00:06:13.930 --> 00:06:16.780
probabilistic
proof of this fact.
00:06:16.780 --> 00:06:19.380
But in any case, we
will not derive it.
00:06:19.380 --> 00:06:24.120
We will just take it as a fact
that comes to us from calculus.
00:06:24.120 --> 00:06:28.440
And now, let us
apply this formula.
00:06:28.440 --> 00:06:33.770
d of n, k is equal to the
integral of this expression,
00:06:33.770 --> 00:06:38.520
which is of this form, with
alpha equal to k and beta
00:06:38.520 --> 00:06:41.390
equal to n minus k.
00:06:41.390 --> 00:06:53.625
So d of n, k takes the form
alpha is k, beta is n minus k.
00:06:57.960 --> 00:07:00.760
And then in the
denominator, we have
00:07:00.760 --> 00:07:03.630
the sum of the two
indices plus 1.
00:07:03.630 --> 00:07:05.560
So it's going to be
k plus n minus k.
00:07:05.560 --> 00:07:07.030
That gives us n.
00:07:07.030 --> 00:07:08.520
And then there's a plus 1.
00:07:12.530 --> 00:07:15.450
And how about this integral?
00:07:15.450 --> 00:07:20.540
Well, this integral is also of
the form that we have up here.
00:07:20.540 --> 00:07:30.530
But now, we have alpha
equal to k plus 1,
00:07:30.530 --> 00:07:33.460
beta is n minus k.
00:07:37.940 --> 00:07:42.710
And in the denominator, we have
the sum of the indices plus 1.
00:07:42.710 --> 00:07:45.880
So when we add these
indices, we get n plus 1.
00:07:45.880 --> 00:07:48.880
And then we get
another factor of 1,
00:07:48.880 --> 00:07:51.280
which gives us an n plus 2.
00:07:55.510 --> 00:07:56.960
This looks formidable.
00:07:56.960 --> 00:08:00.590
But actually, there's a
lot of simplifications.
00:08:00.590 --> 00:08:05.000
This term here cancels
with that term.
00:08:05.000 --> 00:08:10.160
k plus 1 factorial divided
by k factorial, what is it?
00:08:10.160 --> 00:08:14.450
It is just a factor of k plus 1.
00:08:18.030 --> 00:08:20.110
And what do we have here?
00:08:20.110 --> 00:08:23.380
This term is in the
denominator of the denominator.
00:08:23.380 --> 00:08:26.430
So it can be moved
up to the numerator.
00:08:26.430 --> 00:08:30.870
We have n plus 1 factorial
divided by n plus 2 factorial.
00:08:30.870 --> 00:08:33.250
This is just n plus 2.
00:08:36.039 --> 00:08:39.640
And this is the final
form of the answer.
00:08:39.640 --> 00:08:44.810
This is what the conditional
expectation of theta is.
00:08:44.810 --> 00:08:48.570
So now, we can compare the
two estimates that we have,
00:08:48.570 --> 00:08:51.840
the MAP estimate and the
conditional expectation
00:08:51.840 --> 00:08:53.230
estimate.
00:08:53.230 --> 00:08:57.120
They're fairly similar,
but not exactly the same.
00:08:57.120 --> 00:09:01.640
This means that the mean
of a Beta distribution
00:09:01.640 --> 00:09:05.550
is not the same as the point
at which the distribution is
00:09:05.550 --> 00:09:06.990
highest.
00:09:06.990 --> 00:09:10.830
On the other hand, if n
is a very large number,
00:09:10.830 --> 00:09:17.450
this expression is going to be
approximately equal to k over n
00:09:17.450 --> 00:09:18.850
when n is large.
00:09:18.850 --> 00:09:24.730
And so in the limit of
large n, the two estimators
00:09:24.730 --> 00:09:28.504
will not be very
different from each other.