WEBVTT

00:00:00.880 --> 00:00:03.630
We will finish our discussion
of classical statistical

00:00:03.630 --> 00:00:07.330
methods by discussing a general
method for estimation,

00:00:07.330 --> 00:00:10.650
the so-called maximum
likelihood method.

00:00:10.650 --> 00:00:14.150
If an unknown parameter can be
expressed as an expectation,

00:00:14.150 --> 00:00:17.710
we have seen that there's a
natural way of estimating it.

00:00:17.710 --> 00:00:20.730
But what if this is
not the case?

00:00:20.730 --> 00:00:24.660
Suppose there's no apparent way
of interpreting theta as

00:00:24.660 --> 00:00:25.760
an expectation.

00:00:25.760 --> 00:00:28.410
So we need to do
something else.

00:00:28.410 --> 00:00:32.110
So rather than using this
approach, we will use a

00:00:32.110 --> 00:00:34.550
different approach, which
is the following.

00:00:34.550 --> 00:00:39.780
We will find a value of theta
that makes the data that we

00:00:39.780 --> 00:00:42.420
have seen most likely.

00:00:42.420 --> 00:00:46.970
That is, we will find the value
of theta under which the

00:00:46.970 --> 00:00:49.950
probability of obtaining
the particular x

00:00:49.950 --> 00:00:51.710
that we have seen--

00:00:51.710 --> 00:00:54.900
that probability is as
large as possible.

00:00:54.900 --> 00:00:57.780
And that value of theta is going
to be our estimate, the

00:00:57.780 --> 00:00:59.900
maximum likelihood estimate.

00:00:59.900 --> 00:01:02.240
Here, I wrote a PMF.

00:01:02.240 --> 00:01:04.129
That's what you would
do if X was a

00:01:04.129 --> 00:01:05.470
discrete random variable.

00:01:05.470 --> 00:01:10.170
But the same procedure, of
course, applies when X is a

00:01:10.170 --> 00:01:12.440
continuous random variable.

00:01:12.440 --> 00:01:16.039
And more generally, this
procedure also applies when X

00:01:16.039 --> 00:01:20.550
is a vector of observations and
when theta is a vector of

00:01:20.550 --> 00:01:22.480
parameters.

00:01:22.480 --> 00:01:25.289
But what does this
method really do?

00:01:25.289 --> 00:01:28.420
It is instructive to compare
maximum likelihood estimation

00:01:28.420 --> 00:01:30.160
to a Bayesian approach.

00:01:30.160 --> 00:01:34.270
In a Bayesian setting, what we
do is, we find the posterior

00:01:34.270 --> 00:01:37.330
distribution of the unknown
parameter, which is now

00:01:37.330 --> 00:01:40.180
treated as a random variable.

00:01:40.180 --> 00:01:45.729
And then we look for the most
likely value of theta.

00:01:45.729 --> 00:01:49.050
We look at this distribution
and try to find its peak.

00:01:49.050 --> 00:01:53.210
So we want to maximize this
quantity over theta.

00:01:53.210 --> 00:01:55.870
The denominator does not
involve any thetas.

00:01:55.870 --> 00:01:57.320
So we ignore it.

00:01:57.320 --> 00:02:02.000
And suppose now that
we use a prior for

00:02:02.000 --> 00:02:04.760
theta, which is flat.

00:02:04.760 --> 00:02:09.389
Suppose that this prior is
constant over the range of

00:02:09.389 --> 00:02:11.630
possible values of theta.

00:02:11.630 --> 00:02:15.520
In that case, what we need to
do is to just take this

00:02:15.520 --> 00:02:19.750
expression and to maximize
it over all thetas.

00:02:19.750 --> 00:02:22.960
And this looks very similar to
what is happening here, where

00:02:22.960 --> 00:02:27.400
we take this expression and
maximize it over all thetas.

00:02:27.400 --> 00:02:31.579
So operationally, maximum
likelihood estimation is the

00:02:31.579 --> 00:02:36.790
same as Bayesian estimation, in
which we find the peak of

00:02:36.790 --> 00:02:41.160
the posterior for the special
case where we're using

00:02:41.160 --> 00:02:43.910
constant or a flat prior.

00:02:43.910 --> 00:02:47.030
But despite this similarity,
the two methods are

00:02:47.030 --> 00:02:49.505
philosophically very
different.

00:02:49.505 --> 00:02:53.010
In the Bayesian setting, you're
asking the question,

00:02:53.010 --> 00:02:57.090
what is the most likely
value of theta?

00:02:57.090 --> 00:03:00.500
Whereas in the maximum
likelihood setting, you're

00:03:00.500 --> 00:03:04.750
asking, what is the value
of theta that makes

00:03:04.750 --> 00:03:08.070
my data most likely?

00:03:08.070 --> 00:03:12.380
Or what is the value of theta
under which my data are the

00:03:12.380 --> 00:03:14.610
least surprising?

00:03:14.610 --> 00:03:19.579
So the interpretation of the
two methods is quite

00:03:19.579 --> 00:03:22.579
different, even though
the mechanics

00:03:22.579 --> 00:03:24.810
can be fairly similar.

00:03:24.810 --> 00:03:29.350
The maximum likelihood method
has some remarkable properties

00:03:29.350 --> 00:03:31.430
that we would like
now to discuss.

00:03:31.430 --> 00:03:33.560
But first, one comment--

00:03:33.560 --> 00:03:38.230
we need to take the probability
of the observed

00:03:38.230 --> 00:03:39.579
data given theta.

00:03:39.579 --> 00:03:43.300
This is a function of theta,
and maximize it over theta.

00:03:43.300 --> 00:03:47.190
In some problems, we can find
closed form solutions for the

00:03:47.190 --> 00:03:50.400
optimal value of theta, which
is going to be our estimate

00:03:50.400 --> 00:03:54.190
but more often, and especially
for large problems, one has to

00:03:54.190 --> 00:03:57.960
do this maximization
in a numerical way.

00:03:57.960 --> 00:04:01.440
This is possible these days,
and routinely, people solve

00:04:01.440 --> 00:04:04.700
very high dimensional problems
with lots of data and lots of

00:04:04.700 --> 00:04:08.530
parameters using the maximum
likelihood methodology.

00:04:08.530 --> 00:04:11.480
The maximum likelihood
methodology is very popular

00:04:11.480 --> 00:04:16.399
because it has a very sound
theoretical basis.

00:04:16.399 --> 00:04:19.990
I will list a few facts, which
we will not attempt to prove

00:04:19.990 --> 00:04:21.829
or even justify.

00:04:21.829 --> 00:04:25.760
But they're useful to know
as general background.

00:04:25.760 --> 00:04:29.770
Suppose that we have n pieces of
data that are drawn from a

00:04:29.770 --> 00:04:32.450
model from a certain
structure.

00:04:32.450 --> 00:04:37.050
Then under mild assumptions,
the maximum likelihood

00:04:37.050 --> 00:04:40.190
estimator has the property
that it is consistent.

00:04:40.190 --> 00:04:43.720
That is, as we draw more and
more data, our estimate is

00:04:43.720 --> 00:04:47.640
going to converge to the true
value of the parameter.

00:04:47.640 --> 00:04:50.070
In addition, we know
quite a bit more.

00:04:50.070 --> 00:04:53.870
Asymptotically, the maximum
likelihood estimator behaves

00:04:53.870 --> 00:04:55.930
like a normal random variable.

00:04:55.930 --> 00:05:00.330
That is, after we normalize,
subtract the target and divide

00:05:00.330 --> 00:05:04.480
by its standard deviation, it
approaches a standard normal

00:05:04.480 --> 00:05:05.440
distribution.

00:05:05.440 --> 00:05:10.360
So in this sense, it behaves the
same way that the sample

00:05:10.360 --> 00:05:12.370
mean behaves.

00:05:12.370 --> 00:05:15.940
Notice that this expression
here involves the standard

00:05:15.940 --> 00:05:18.840
error of the maximum likelihood
estimator.

00:05:18.840 --> 00:05:20.710
This is an important quantity.

00:05:20.710 --> 00:05:23.960
And for this reason, people
have developed either

00:05:23.960 --> 00:05:27.320
analytical or simulation methods
for calculating or

00:05:27.320 --> 00:05:30.160
approximating this
standard error.

00:05:30.160 --> 00:05:33.530
Once you have an estimate or
an approximation of the

00:05:33.530 --> 00:05:37.400
standard error in your hands,
you can further use it to

00:05:37.400 --> 00:05:40.140
construct confidence
intervals.

00:05:40.140 --> 00:05:43.980
Using the asymptotic normality,
then we can

00:05:43.980 --> 00:05:46.710
construct a confidence interval
in exactly the same

00:05:46.710 --> 00:05:50.690
way as we did for the case of
the sample mean estimator.

00:05:50.690 --> 00:05:56.010
And this, for example, would be
a 95% confidence interval.

00:05:56.010 --> 00:05:59.650
Finally, one last important
property is that the maximum

00:05:59.650 --> 00:06:05.670
likelihood estimator is what
is called an asymptotically

00:06:05.670 --> 00:06:07.700
efficient estimator.

00:06:07.700 --> 00:06:11.720
That is, it is the best possible
estimator in the

00:06:11.720 --> 00:06:16.070
sense that it achieves the
smallest possible variance.

00:06:16.070 --> 00:06:18.710
So all of these are very
strong properties.

00:06:18.710 --> 00:06:22.790
And this is the reason why
maximum likelihood estimation

00:06:22.790 --> 00:06:26.780
is the most common approach for
problems that do not have

00:06:26.780 --> 00:06:29.520
any particular special
structure that

00:06:29.520 --> 00:06:30.770
you can exploit otherwise.