WEBVTT

00:00:01.550 --> 00:00:04.170
In this segment, we will
go through two examples

00:00:04.170 --> 00:00:06.760
of maximum likelihood
estimation,

00:00:06.760 --> 00:00:10.240
just in order to get a feel
for the procedure involved

00:00:10.240 --> 00:00:13.450
and the calculations that
one has to go through.

00:00:13.450 --> 00:00:16.129
Our first example
will be very simple.

00:00:16.129 --> 00:00:18.320
We have a binomial
random variable

00:00:18.320 --> 00:00:21.270
with parameters n and theta.

00:00:21.270 --> 00:00:25.450
So think of having a coin
that you flip n times,

00:00:25.450 --> 00:00:27.750
and theta is the
probability of heads

00:00:27.750 --> 00:00:29.680
at each one of the tosses.

00:00:29.680 --> 00:00:32.090
So we flip it n
times and we observe

00:00:32.090 --> 00:00:35.550
a certain numerical
value, little k

00:00:35.550 --> 00:00:38.350
for the random variable
K. And on the basis

00:00:38.350 --> 00:00:41.810
of that numerical value, we
would like to estimate theta.

00:00:41.810 --> 00:00:44.510
According to the maximum
likelihood methodology,

00:00:44.510 --> 00:00:48.250
the first step is to write
down the likelihood function.

00:00:48.250 --> 00:00:51.510
This is the probability of
obtaining this particular piece

00:00:51.510 --> 00:00:55.220
of data if the true
parameter is theta.

00:00:55.220 --> 00:00:57.980
Now, since K is a
binomial random variable,

00:00:57.980 --> 00:01:01.020
the probability of obtaining
k heads in n tosses

00:01:01.020 --> 00:01:03.830
is given by this
expression here.

00:01:03.830 --> 00:01:07.900
So what we need to do is to take
the data that we have observed,

00:01:07.900 --> 00:01:11.740
plug it in this formula,
leave theta free--

00:01:11.740 --> 00:01:14.420
we have here a
function of theta--

00:01:14.420 --> 00:01:18.260
and then maximize this function
of theta over all theta.

00:01:18.260 --> 00:01:21.020
Let us now do this calculation.

00:01:21.020 --> 00:01:24.220
Actually, instead of
maximizing this expression,

00:01:24.220 --> 00:01:27.340
it's a little easier to
maximize the logarithm

00:01:27.340 --> 00:01:28.960
of this expression.

00:01:28.960 --> 00:01:32.150
And the logarithm of this
expression is as follows.

00:01:32.150 --> 00:01:35.410
There's a first term, which
is the logarithm of the n

00:01:35.410 --> 00:01:36.870
choose k term.

00:01:36.870 --> 00:01:42.890
Then, the logarithm of theta
to the k is k times log theta.

00:01:42.890 --> 00:01:45.660
And finally, the
logarithm of the last term

00:01:45.660 --> 00:01:52.130
is n minus k, log
of 1 minus theta.

00:01:52.130 --> 00:01:54.550
So we need to maximize this
expression with respect

00:01:54.550 --> 00:01:55.289
to theta.

00:01:55.289 --> 00:01:58.140
In order to do that, we take
the derivative with respect

00:01:58.140 --> 00:01:59.530
to theta.

00:01:59.530 --> 00:02:01.130
Here, there is no
theta involved.

00:02:01.130 --> 00:02:03.210
We get a contribution of 0.

00:02:03.210 --> 00:02:07.390
This term has a derivative
of k divided by theta.

00:02:07.390 --> 00:02:09.880
And this term here
has a derivative,

00:02:09.880 --> 00:02:14.020
which is n minus k
times the derivative

00:02:14.020 --> 00:02:16.920
of this logarithmic
term, which is

00:02:16.920 --> 00:02:19.480
1 over what is
inside the logarithm.

00:02:19.480 --> 00:02:22.100
But by the chain rule,
because of this minus sign

00:02:22.100 --> 00:02:27.900
here, we get also a minus sign,
and we obtain this expression.

00:02:27.900 --> 00:02:29.790
Now, at the maximum,
the derivative

00:02:29.790 --> 00:02:31.920
has to be equal to 0.

00:02:31.920 --> 00:02:34.770
And this gives us now
an equation for theta

00:02:34.770 --> 00:02:36.270
that we can solve.

00:02:36.270 --> 00:02:39.630
Let us take this term, move
it to the right-hand side,

00:02:39.630 --> 00:02:43.270
and then cross-multiply
with the denominators

00:02:43.270 --> 00:02:49.560
to obtain the relation
that k minus k theta-- this

00:02:49.560 --> 00:02:52.550
is obtained by multiplying this
k with this one minus theta

00:02:52.550 --> 00:02:57.920
factor-- has to be equal to
this term times theta, which

00:02:57.920 --> 00:03:02.160
is n times theta minus k theta.

00:03:02.160 --> 00:03:05.370
The k theta terms
cancel, and we're

00:03:05.370 --> 00:03:08.390
left with this
expression, which tells us

00:03:08.390 --> 00:03:11.680
that theta should be
equal to k over n.

00:03:11.680 --> 00:03:13.950
So this is the maximum
likelihood estimate

00:03:13.950 --> 00:03:15.450
for this particular
problem, which

00:03:15.450 --> 00:03:17.990
is a pretty reasonable answer.

00:03:17.990 --> 00:03:20.380
If you would like to
rephrase what we just

00:03:20.380 --> 00:03:24.040
found in terms of estimators
and random variables,

00:03:24.040 --> 00:03:28.660
the maximum likelihood
estimator is as follows.

00:03:28.660 --> 00:03:32.579
We take the random variable that
we observe, our observations,

00:03:32.579 --> 00:03:34.270
and divide it by n.

00:03:34.270 --> 00:03:36.630
And this is now a
random variable,

00:03:36.630 --> 00:03:39.720
which will be our estimator.

00:03:39.720 --> 00:03:42.160
Now, notice that in
this particular example,

00:03:42.160 --> 00:03:46.090
the answer that we got is
exactly the same as the answer

00:03:46.090 --> 00:03:49.360
that we got in the context
of Bayesian inference

00:03:49.360 --> 00:03:52.950
when we were finding the
maximum a posteriori probability

00:03:52.950 --> 00:03:55.810
estimator, but for
the special case

00:03:55.810 --> 00:04:00.160
where the prior was a
uniform distribution.

00:04:00.160 --> 00:04:04.400
So if we assume that theta
is actually a random variable

00:04:04.400 --> 00:04:08.240
but has a uniform distribution,
so that we have a flat prior,

00:04:08.240 --> 00:04:11.020
and we carry out maximum
a posteriori probability

00:04:11.020 --> 00:04:12.080
estimation.

00:04:12.080 --> 00:04:14.930
We do obtain exactly
the same estimate.

00:04:14.930 --> 00:04:16.740
And this is consistent
with the comments

00:04:16.740 --> 00:04:19.990
that we made earlier, that
maximum likelihood estimation

00:04:19.990 --> 00:04:25.630
can be interpreted also as MAP
estimation with a flat prior.

00:04:25.630 --> 00:04:27.970
Let us now move to our
second example, which

00:04:27.970 --> 00:04:30.670
will be a little
more complicated.

00:04:30.670 --> 00:04:34.830
Here, we have n random
variables that are independent,

00:04:34.830 --> 00:04:36.090
identically distributed.

00:04:36.090 --> 00:04:38.070
They all have a
normal distribution

00:04:38.070 --> 00:04:40.540
with a certain
mean and variance.

00:04:40.540 --> 00:04:43.600
But both the mean and
the variance are unknown,

00:04:43.600 --> 00:04:45.560
and we want to estimate
them on the basis

00:04:45.560 --> 00:04:47.570
of these observations.

00:04:47.570 --> 00:04:51.409
The first step is to write
down the likelihood function.

00:04:51.409 --> 00:04:53.960
That is the probability
density function

00:04:53.960 --> 00:04:59.380
for the vector of observations
given some set of parameters.

00:04:59.380 --> 00:05:02.650
Because of independence,
the joint distribution

00:05:02.650 --> 00:05:08.250
of the vector of X's that we
have obtained is the product

00:05:08.250 --> 00:05:13.510
of the PDFs of the
individual X's, of the Xi's.

00:05:13.510 --> 00:05:21.920
So the PDF of the typical Xi
that has variance v and mean mu

00:05:21.920 --> 00:05:23.770
is of this form.

00:05:23.770 --> 00:05:27.000
So this is the likelihood
function in this case.

00:05:27.000 --> 00:05:29.430
This is the probability
density of obtaining

00:05:29.430 --> 00:05:32.730
a particular vector
X of observations

00:05:32.730 --> 00:05:35.909
when we have these
particular parameters.

00:05:35.909 --> 00:05:40.980
We would like to
maximize this function.

00:05:40.980 --> 00:05:44.700
As in our previous example,
it is actually a little easier

00:05:44.700 --> 00:05:48.900
to maximize the logarithm
of this expression.

00:05:48.900 --> 00:05:50.880
And this is the
same as minimizing

00:05:50.880 --> 00:05:54.480
the negative of the
logarithm of this expression.

00:05:54.480 --> 00:05:57.180
Now, when we take the
logarithm of this expression,

00:05:57.180 --> 00:05:58.159
we have a product.

00:05:58.159 --> 00:06:01.710
So we're going to get
a sum of logarithms.

00:06:01.710 --> 00:06:06.350
And I leave it to you to verify
that the negative logarithm

00:06:06.350 --> 00:06:11.810
of this expression is of this
form plus some other constant

00:06:11.810 --> 00:06:13.930
that does not involve
the parameters,

00:06:13.930 --> 00:06:18.120
and which comes from this factor
of 1 over square root 2pi.

00:06:18.120 --> 00:06:21.820
In particular, this
term here appears

00:06:21.820 --> 00:06:24.220
when we take the
logarithm of this.

00:06:24.220 --> 00:06:28.410
And this happens n times because
we have a product of n terms.

00:06:28.410 --> 00:06:31.070
And this term here
appears when we

00:06:31.070 --> 00:06:33.640
take the logarithm
of this expression,

00:06:33.640 --> 00:06:36.550
and after we put
in the minus sign,

00:06:36.550 --> 00:06:38.080
because we're
actually considering

00:06:38.080 --> 00:06:40.409
the negative of the logarithm.

00:06:40.409 --> 00:06:43.100
Now, to carry out the
minimization, what

00:06:43.100 --> 00:06:46.900
we need to do is to take the
derivative of this expression

00:06:46.900 --> 00:06:49.420
with respect to
mu, set it to zero,

00:06:49.420 --> 00:06:52.270
and also take the
derivative with respect to v

00:06:52.270 --> 00:06:54.320
and set it to zero as well.

00:06:54.320 --> 00:07:00.000
Solve those equations and find
the optimal mu and v. So let's

00:07:00.000 --> 00:07:02.960
start by optimizing
with respect to mu.

00:07:02.960 --> 00:07:05.720
So we're going to take the
derivative of this expression

00:07:05.720 --> 00:07:09.160
with respect to mu
and set it to zero.

00:07:09.160 --> 00:07:11.940
This term does not
involve mu, so we only

00:07:11.940 --> 00:07:14.690
need to take the
derivative of this.

00:07:14.690 --> 00:07:16.640
And the derivative
of this is going

00:07:16.640 --> 00:07:21.470
to be-- there's a term
1 over v. And then

00:07:21.470 --> 00:07:24.180
the derivative of a
quadratic divided by 2

00:07:24.180 --> 00:07:27.930
is just xi minus mu.

00:07:27.930 --> 00:07:31.960
And we have one term
for each possible i.

00:07:31.960 --> 00:07:34.700
We get this equation.

00:07:34.700 --> 00:07:40.909
Now we can cancel out v, and
we're left with the equation

00:07:40.909 --> 00:07:45.170
that the sum of the
xi's is equal to the sum

00:07:45.170 --> 00:07:47.920
of the mus, which is n times mu.

00:07:47.920 --> 00:07:51.260
And now we can send
n to the denominator

00:07:51.260 --> 00:07:55.020
to obtain that
the estimate of mu

00:07:55.020 --> 00:07:59.960
is going to be the sum
of the xi's divided by n.

00:07:59.960 --> 00:08:02.670
So the maximum likelihood
estimate of the mean

00:08:02.670 --> 00:08:05.480
takes a very simple
and very natural form.

00:08:05.480 --> 00:08:08.160
It is just the sample mean.

00:08:08.160 --> 00:08:11.070
Now, let us continue with
the minimization with respect

00:08:11.070 --> 00:08:14.670
to v. In order to carry
out that minimization,

00:08:14.670 --> 00:08:17.680
we need to take the derivative
of this expression with respect

00:08:17.680 --> 00:08:19.900
to v and set it to zero.

00:08:19.900 --> 00:08:22.110
The derivative of
the first term is

00:08:22.110 --> 00:08:28.700
equal to n over 2 times 1
over v. And then from here,

00:08:28.700 --> 00:08:32.915
when we take the
derivative, we obtain

00:08:32.915 --> 00:08:45.180
the sum of all these terms
divided by 2v squared.

00:08:45.180 --> 00:08:47.925
But actually, when we take
the derivative of 1 over v,

00:08:47.925 --> 00:08:51.310
the derivative is
minus 1 over v squared.

00:08:51.310 --> 00:08:56.330
And for this reason here,
we will have a minus sign.

00:08:56.330 --> 00:08:58.030
So this is the
derivative with respect

00:08:58.030 --> 00:09:03.260
to v. We set it equal to zero
and carry out some algebra.

00:09:03.260 --> 00:09:05.770
What is the algebra
involved here?

00:09:05.770 --> 00:09:11.280
We can delete this term, 2,
that appears here and there.

00:09:11.280 --> 00:09:16.840
This term v cancels
out this exponent here.

00:09:16.840 --> 00:09:20.700
Then we take this v, move
it to the other side,

00:09:20.700 --> 00:09:24.860
and then take this n and
move it to this side,

00:09:24.860 --> 00:09:26.630
underneath this term.

00:09:26.630 --> 00:09:29.260
And finally, what we
obtain after you carry out

00:09:29.260 --> 00:09:32.690
this algebra is this
expression, that the estimate

00:09:32.690 --> 00:09:39.150
of the variance is some
form of the sample variance

00:09:39.150 --> 00:09:43.490
where we use the
optimal value of mu.

00:09:43.490 --> 00:09:47.500
And the optimal value of
mu we have already found.

00:09:47.500 --> 00:09:49.990
It's given by this
expression here.

00:09:49.990 --> 00:09:53.700
So we obtain a pretty natural
estimate for the variance

00:09:53.700 --> 00:09:58.340
as well by using this maximum
likelihood methodology.

00:09:58.340 --> 00:10:00.880
Now, these two examples
were particularly nice

00:10:00.880 --> 00:10:04.450
because the algebra was
not too complicated.

00:10:04.450 --> 00:10:07.160
And the answers
turned out to be what

00:10:07.160 --> 00:10:11.410
you might have guessed without
using any fancy methods.

00:10:11.410 --> 00:10:13.840
But in other problems,
the calculations

00:10:13.840 --> 00:10:18.869
may be more complicated and the
answers may not be so obvious.