WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.650
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.650 --> 00:00:17.880
at ocw.mit.edu.
00:00:22.729 --> 00:00:24.770
PROFESSOR: So I'm using
a few things here, right?
00:00:24.770 --> 00:00:27.740
I'm using the fact that
KL is non-negative.
00:00:27.740 --> 00:00:31.130
But KL is equal to 0 when I
take twice the same argument.
00:00:31.130 --> 00:00:34.061
So I know that this function
is always non-negative.
00:00:38.650 --> 00:00:45.580
So that's theta and that's
KL P theta star P theta.
00:00:45.580 --> 00:00:51.170
And I know that at theta
star, it's equal to 0.
00:00:51.170 --> 00:00:52.360
OK?
00:00:52.360 --> 00:00:56.450
I could be in the case
where I have this happening.
00:00:56.450 --> 00:01:01.460
I have two-- let's call
it theta star prime.
00:01:01.460 --> 00:01:04.069
I have two minimizers.
00:01:04.069 --> 00:01:05.390
That could be the case, right?
00:01:05.390 --> 00:01:07.460
I'm not saying
that-- so K of L--
00:01:07.460 --> 00:01:11.810
KL is 0 at the minimum.
00:01:11.810 --> 00:01:14.660
That doesn't mean that I
have a unit minimum, right?
00:01:14.660 --> 00:01:16.110
But it does, actually.
00:01:16.110 --> 00:01:17.562
What do I need to
use to make sure
00:01:17.562 --> 00:01:18.770
that I have only one minimum?
00:01:22.210 --> 00:01:24.280
So the definiteness
is guaranteeing to me
00:01:24.280 --> 00:01:28.412
that there's a unique P
theta star that minimizes it.
00:01:28.412 --> 00:01:30.120
But then I need to
make sure that there's
00:01:30.120 --> 00:01:33.214
a unique-- from this
unique P theta star,
00:01:33.214 --> 00:01:35.380
I need to make sure there's
a unique theta star that
00:01:35.380 --> 00:01:36.550
defines this P theta star.
00:01:39.250 --> 00:01:40.550
Exactly.
00:01:40.550 --> 00:01:43.420
All right, so I
combine definiteness
00:01:43.420 --> 00:01:47.410
and identifiability to make
sure that there is a unique
00:01:47.410 --> 00:01:50.070
minimizer, in this
case cannot exist.
00:01:50.070 --> 00:01:55.630
OK, so basically, let me
write what I just said.
00:01:55.630 --> 00:02:06.540
So definiteness, that
implies that P theta star
00:02:06.540 --> 00:02:23.120
is the unique minimizer of P
theta maps to KL P theta star P
00:02:23.120 --> 00:02:23.910
theta.
00:02:23.910 --> 00:02:26.290
So definiteness only
guarantees that the probability
00:02:26.290 --> 00:02:29.430
distribution is
uniquely identified.
00:02:29.430 --> 00:02:42.260
And identifiability
implies that theta star
00:02:42.260 --> 00:02:56.937
is the unique minimizer of
theta maps to KL P theta star P
00:02:56.937 --> 00:03:00.590
theta, OK?
00:03:00.590 --> 00:03:02.480
So I'm basically
doing the composition
00:03:02.480 --> 00:03:04.260
of two injective functions.
00:03:04.260 --> 00:03:07.970
The first one is the one that
maps, say, theta to P theta.
00:03:07.970 --> 00:03:11.090
And the second one is
the one that maps P theta
00:03:11.090 --> 00:03:14.482
to the set of minimizers, OK?
00:03:20.770 --> 00:03:27.560
So at least morally, you
should agree that theta star
00:03:27.560 --> 00:03:28.940
is the minimizer of this thing.
00:03:28.940 --> 00:03:30.523
Whether it's unique
or not, you should
00:03:30.523 --> 00:03:33.110
agree that it's a good one.
00:03:33.110 --> 00:03:36.620
So maybe you can think
a little longer on this.
00:03:36.620 --> 00:03:38.720
So thinking about this
being the minimizer,
00:03:38.720 --> 00:03:40.220
then it says,
well, if I actually
00:03:40.220 --> 00:03:42.230
had a good estimate
for this function,
00:03:42.230 --> 00:03:44.050
I would use the strategy
that I described
00:03:44.050 --> 00:03:45.860
for the total
variation, which is,
00:03:45.860 --> 00:03:48.050
well, I don't know what
this function looks like.
00:03:48.050 --> 00:03:49.700
It depends on theta star.
00:03:49.700 --> 00:03:51.470
But maybe I can
find an estimator
00:03:51.470 --> 00:03:55.170
of this function that
fluctuates around this function,
00:03:55.170 --> 00:03:58.070
and such that when I minimize
this estimator of the function,
00:03:58.070 --> 00:04:01.380
I'm actually not too far, OK?
00:04:01.380 --> 00:04:04.650
And this is exactly what
drives me to do this,
00:04:04.650 --> 00:04:07.380
because I can actually
construct an estimator.
00:04:07.380 --> 00:04:09.300
I can actually construct
an estimator such
00:04:09.300 --> 00:04:12.540
that this estimator
is actually--
00:04:12.540 --> 00:04:15.900
of the KL is actually
close to the KL, all right?
00:04:15.900 --> 00:04:18.709
So I define KL hat.
00:04:18.709 --> 00:04:22.920
So all we did is just replacing
expectation with respect
00:04:22.920 --> 00:04:27.400
to theta star by averages.
00:04:30.840 --> 00:04:33.850
That's what we did.
00:04:33.850 --> 00:04:37.270
So if you're a little puzzled by
this error, that's all it says.
00:04:37.270 --> 00:04:39.540
Replace this guy by this guy.
00:04:39.540 --> 00:04:41.190
It has no mathematical meaning.
00:04:41.190 --> 00:04:42.990
It just means just
replace it by.
00:04:42.990 --> 00:04:46.190
And now that actually tells
me how to get my estimator.
00:04:46.190 --> 00:04:51.660
It just says, well,
my estimator, KL hat,
00:04:51.660 --> 00:04:54.969
is equal to some constant
which I don't know.
00:04:54.969 --> 00:04:56.760
I mean, it certainly
depends on theta star,
00:04:56.760 --> 00:04:59.720
but I won't care about it
when I'm trying to minimize--
00:04:59.720 --> 00:05:09.610
minus 1/n sum from i from
1 to n log f theta of x.
00:05:09.610 --> 00:05:11.320
So here I'm reading
it with the density.
00:05:11.320 --> 00:05:13.950
You have it with the
PMF on the slides,
00:05:13.950 --> 00:05:18.560
and so you have the two
versions in front of you, OK?
00:05:18.560 --> 00:05:22.400
Oh sorry, I forgot the xi.
00:05:22.400 --> 00:05:25.430
Now clearly, this function
I know how to compute.
00:05:25.430 --> 00:05:30.710
If you give me a theta, since
I know the form of the density
00:05:30.710 --> 00:05:33.380
f theta, for each
theta that you give me,
00:05:33.380 --> 00:05:38.070
I can actually compute
this quantity, right?
00:05:38.070 --> 00:05:40.475
This I don't know,
but I don't care.
00:05:40.475 --> 00:05:42.600
Because I'm just shifting
the value of the function
00:05:42.600 --> 00:05:43.590
I'm trying to minimize.
00:05:43.590 --> 00:05:46.830
The set of minimizers
is not going to change.
00:05:46.830 --> 00:05:50.420
So now, this is my
estimation strategy.
00:05:50.420 --> 00:06:01.560
Minimize in theta KL hat
P theta star P theta, OK?
00:06:01.560 --> 00:06:05.575
So now let's just make sure
that we all agree that--
00:06:05.575 --> 00:06:07.960
so what we want is the
argument of the minimum,
00:06:07.960 --> 00:06:10.710
right? arg min means the
theta that minimizes this guy,
00:06:10.710 --> 00:06:13.534
rather than finding
the value of the min.
00:06:13.534 --> 00:06:15.700
OK, so I'm trying to find
the arg min of this thing.
00:06:15.700 --> 00:06:18.900
Well, this is equivalent
to finding the arg
00:06:18.900 --> 00:06:28.020
min of, say, a constant minus
1/n sum from i from 1 to n
00:06:28.020 --> 00:06:31.226
of log f theta of xi.
00:06:33.864 --> 00:06:34.530
So that's just--
00:06:38.850 --> 00:06:41.026
I don't think it likes me.
00:06:41.026 --> 00:06:42.490
No.
00:06:42.490 --> 00:06:46.110
OK, so thus minimizing
this average, right?
00:06:46.110 --> 00:06:48.620
I just plugged in the
definition of KL hat.
00:06:48.620 --> 00:06:50.360
Now, I claim that
taking the arg min
00:06:50.360 --> 00:06:53.510
of a constant plus a function
or the arg min of the function
00:06:53.510 --> 00:06:55.820
is the same thing.
00:06:55.820 --> 00:07:00.650
Is anybody not comfortable
with this idea?
00:07:00.650 --> 00:07:03.530
OK, so this is the same.
00:07:13.757 --> 00:07:15.750
By the way, this
I should probably
00:07:15.750 --> 00:07:18.630
switch to the next slide,
because I'm writing
00:07:18.630 --> 00:07:22.830
the same thing, but better.
00:07:22.830 --> 00:07:29.360
And it's with PMF
rather than as PF.
00:07:29.360 --> 00:07:34.000
OK, now, arg min of the minimum
is the same of arg max--
00:07:34.000 --> 00:07:35.595
sorry, arg min of
the negative thing
00:07:35.595 --> 00:07:37.720
is the same as arg max
without the negative, right?
00:07:40.324 --> 00:07:49.010
arg max over theta of 1/n from
i equal equal 1 to n log f
00:07:49.010 --> 00:07:49.850
theta of xi.
00:07:53.540 --> 00:07:54.980
Taking the arg
min of the average
00:07:54.980 --> 00:07:56.563
or the arg min of
the sum, again, it's
00:07:56.563 --> 00:07:59.030
not going to make
much difference.
00:07:59.030 --> 00:08:01.310
Just adding constants OR
multiplying by constants
00:08:01.310 --> 00:08:04.440
does not change the
arg min or the arg max.
00:08:04.440 --> 00:08:07.677
Now, I have the
sum of logs, which
00:08:07.677 --> 00:08:08.760
is the log of the product.
00:08:23.310 --> 00:08:24.350
OK?
00:08:24.350 --> 00:08:27.280
It's the arg max of the
log of f theta of x1 times
00:08:27.280 --> 00:08:30.190
f theta of x2, f theta of xn.
00:08:30.190 --> 00:08:37.440
But the log is a function
that's increasing, so maximizing
00:08:37.440 --> 00:08:40.830
log of a function or
maximizing the function itself
00:08:40.830 --> 00:08:42.860
is the same thing.
00:08:42.860 --> 00:08:45.200
The value is going to
change, but the arg max
00:08:45.200 --> 00:08:46.220
is not going to change.
00:08:46.220 --> 00:08:47.344
Everybody agrees with this?
00:08:50.340 --> 00:08:59.990
So this is equivalent to arg
max over theta of pi from 1 to n
00:08:59.990 --> 00:09:02.780
of f theta xi.
00:09:02.780 --> 00:09:10.515
And that's because x maps
to log x is increasing.
00:09:13.930 --> 00:09:17.140
So now I've gone from
minimizing the KL
00:09:17.140 --> 00:09:19.750
to minimizing the
estimate of the KL
00:09:19.750 --> 00:09:23.520
to maximizing this product.
00:09:23.520 --> 00:09:27.280
Well, this chapter is called
maximum likelihood estimation.
00:09:27.280 --> 00:09:30.370
The maximum comes from the
fact that our original idea
00:09:30.370 --> 00:09:32.240
was to minimize the
negative of a function.
00:09:32.240 --> 00:09:34.150
So that's why it's
maximum likelihood.
00:09:34.150 --> 00:09:42.770
And this function here
is called the likelihood.
00:09:42.770 --> 00:09:45.150
This function is really
just telling me--
00:09:45.150 --> 00:09:47.370
they call it
likelihood because it's
00:09:47.370 --> 00:09:49.920
some measure of how
likely it is that theta
00:09:49.920 --> 00:09:52.800
was the parameter that
generated the data.
00:09:52.800 --> 00:09:55.444
OK, so let's go to the--
00:09:55.444 --> 00:09:57.610
well, we'll go to the formal
definition in a second.
00:09:57.610 --> 00:09:59.160
But actually, let
me just give you
00:09:59.160 --> 00:10:05.592
intuition as to why this is
the distribution of the data.
00:10:05.592 --> 00:10:07.050
Why this is the
likelihood-- sorry.
00:10:07.050 --> 00:10:11.940
Why is this making sense
as a measure of likelihood?
00:10:11.940 --> 00:10:14.550
Let's now think for simplicity
of the following model.
00:10:14.550 --> 00:10:15.710
So I have--
00:10:15.710 --> 00:10:19.550
I'm on the real line
and I look at n, say,
00:10:19.550 --> 00:10:25.540
theta 1 for theta in the
real-- do you see that?
00:10:25.540 --> 00:10:26.040
OK.
00:10:26.040 --> 00:10:27.660
Probably you don't.
00:10:27.660 --> 00:10:28.860
Not that you care.
00:10:28.860 --> 00:10:29.790
OK, so--
00:10:41.120 --> 00:10:42.590
OK, let's look at
a simple example.
00:10:45.990 --> 00:10:48.910
So here's the model.
00:10:48.910 --> 00:10:52.310
As I said, we're looking at
observations on the real line.
00:10:52.310 --> 00:10:57.032
And they're distributed
according to some n theta 1.
00:10:57.032 --> 00:10:58.490
So I don't care
about the variance.
00:10:58.490 --> 00:10:59.810
I know it's 1.
00:10:59.810 --> 00:11:03.170
And it's indexed by
theta in the real line.
00:11:03.170 --> 00:11:05.360
OK, so this is-- the only
thing I need to figure out
00:11:05.360 --> 00:11:09.260
is, what is the mean
of those guys, OK?
00:11:09.260 --> 00:11:11.600
Now, I have this n observations.
00:11:11.600 --> 00:11:15.920
And if you actually remember
from your probability class,
00:11:15.920 --> 00:11:18.610
are you familiar with the
concept of joint density?
00:11:18.610 --> 00:11:20.420
You have multivariate
observations.
00:11:20.420 --> 00:11:23.750
The joint density of
independent random variables
00:11:23.750 --> 00:11:26.990
is just a product of their
individual densities.
00:11:26.990 --> 00:11:30.950
So really, when I look
at the product from i
00:11:30.950 --> 00:11:34.610
equal 1 to n of f
theta of xi, this
00:11:34.610 --> 00:11:44.800
is really the joint
density of the vector--
00:11:48.592 --> 00:11:51.210
well, let me not use
the word vector--
00:11:51.210 --> 00:11:55.660
of x1 xn, OK?
00:11:55.660 --> 00:11:58.930
So if I take the product of
density, is it still a density?
00:11:58.930 --> 00:12:04.136
And it's actually-- but
this time on the r to the n.
00:12:04.136 --> 00:12:06.010
And so now what this
thing is telling me-- so
00:12:06.010 --> 00:12:07.330
think of it in r2, right?
00:12:07.330 --> 00:12:10.630
So this is the joint
density of two Gaussians.
00:12:10.630 --> 00:12:14.640
So it's something that looks
like some bell-shaped curve
00:12:14.640 --> 00:12:15.940
in two dimensions.
00:12:15.940 --> 00:12:20.410
And it's centered at
the value theta theta.
00:12:20.410 --> 00:12:22.390
OK, they both have
the mean theta.
00:12:22.390 --> 00:12:24.280
So let's assume for
one second-- it's
00:12:24.280 --> 00:12:28.000
going to be hard for me to
make pictures in n dimensions.
00:12:28.000 --> 00:12:29.710
Actually, already
in two dimensions,
00:12:29.710 --> 00:12:31.660
I can promise you that
it's not very easy.
00:12:31.660 --> 00:12:34.300
So I'm actually
just going to assume
00:12:34.300 --> 00:12:37.270
that n is equal to 1 for
the sake of illustration.
00:12:37.270 --> 00:12:40.900
OK, so now I have this data.
00:12:40.900 --> 00:12:44.350
And now I have one
observation, OK?
00:12:44.350 --> 00:12:47.524
And I know that the f
theta looks like this.
00:12:47.524 --> 00:12:48.940
And what I'm doing
is I'm actually
00:12:48.940 --> 00:12:51.961
looking at the value of x
theta as my observation.
00:12:54.787 --> 00:12:57.870
Let's call it x1.
00:12:57.870 --> 00:13:00.590
Now, my principal tells me,
just find the theta that
00:13:00.590 --> 00:13:03.260
makes this guy the most likely.
00:13:03.260 --> 00:13:05.360
What is the likelihood of my x1?
00:13:05.360 --> 00:13:07.640
Well, it's just the
value of the function.
00:13:07.640 --> 00:13:09.410
That this value here.
00:13:09.410 --> 00:13:13.670
And if I wanted to find the most
likely theta that had generated
00:13:13.670 --> 00:13:16.370
this x1, what I would need
to do is to shift this thing
00:13:16.370 --> 00:13:19.290
and put it here.
00:13:19.290 --> 00:13:21.950
And so my estimate, my
maximum likelihood estimator
00:13:21.950 --> 00:13:28.720
here would be theta
is equal to x1, OK?
00:13:28.720 --> 00:13:30.370
That would be just
the observation.
00:13:30.370 --> 00:13:32.110
Because if I have
only one observation,
00:13:32.110 --> 00:13:33.454
what else am I going to do?
00:13:33.454 --> 00:13:34.870
OK, and so it sort
of makes sense.
00:13:34.870 --> 00:13:36.286
And if you have
more observations,
00:13:36.286 --> 00:13:40.540
you can think of it this way,
as if you had more observations.
00:13:40.540 --> 00:13:42.735
So now I have, say,
K observations,
00:13:42.735 --> 00:13:44.157
or n observations.
00:13:44.157 --> 00:13:46.240
And what I do is that I
look at the value for each
00:13:46.240 --> 00:13:48.790
of these guys.
00:13:48.790 --> 00:13:52.240
So this value, this value,
this value, this value.
00:13:52.240 --> 00:13:55.870
I take their product and
I make this thing large.
00:13:55.870 --> 00:13:57.250
OK, why do I take the product?
00:13:57.250 --> 00:14:00.100
Well, because I'm trying
to maximize their value
00:14:00.100 --> 00:14:02.830
all together, and I need to
just turn it into one number
00:14:02.830 --> 00:14:04.030
that I can maximize.
00:14:04.030 --> 00:14:06.580
And taking the product
is the natural way
00:14:06.580 --> 00:14:08.470
of doing it, either
by motivating it
00:14:08.470 --> 00:14:11.050
by the KL principle
or motivating it
00:14:11.050 --> 00:14:14.800
by maximizing the joint density,
rather than just maximizing
00:14:14.800 --> 00:14:15.910
anything.
00:14:15.910 --> 00:14:20.200
OK, so that's why, visually,
this is the maximum likelihood.
00:14:20.200 --> 00:14:24.010
It just says that if my
observations are here,
00:14:24.010 --> 00:14:29.210
then this guy, this mean theta,
is more likely than this guy.
00:14:29.210 --> 00:14:31.450
Because now if I
look at the value
00:14:31.450 --> 00:14:33.850
of the function
for this guy-- if I
00:14:33.850 --> 00:14:35.740
look at theta being
this thing, then this
00:14:35.740 --> 00:14:37.540
is a very small value.
00:14:37.540 --> 00:14:39.850
Very small value, very small
value, very small value.
00:14:39.850 --> 00:14:41.984
Everything gets a super
small value, right?
00:14:41.984 --> 00:14:43.900
That's just the value
that it gets in the tail
00:14:43.900 --> 00:14:45.730
here, which is very close to 0.
00:14:45.730 --> 00:14:47.980
But as soon as I start
covering all my points
00:14:47.980 --> 00:14:53.260
with my bell-shaped curve,
then all the values go up.
00:14:53.260 --> 00:14:58.720
All right, so I just want
to make a short break
00:14:58.720 --> 00:15:00.490
into statistics,
and just make sure
00:15:00.490 --> 00:15:04.120
that the maximum likelihood
principle involves
00:15:04.120 --> 00:15:05.650
maximizing a function.
00:15:05.650 --> 00:15:07.250
So I just want to
make sure that we're
00:15:07.250 --> 00:15:11.200
all on par about how do
we maximize functions.
00:15:11.200 --> 00:15:13.990
In most instances, it's going to
be a one-dimensional function,
00:15:13.990 --> 00:15:16.690
because theta is going to be
a one-dimensional parameter.
00:15:16.690 --> 00:15:18.800
Like here it's the real line.
00:15:18.800 --> 00:15:20.110
So it's going to be easy.
00:15:20.110 --> 00:15:22.130
In some cases, it may be
a multivariate function
00:15:22.130 --> 00:15:24.790
and it might be
more complicated.
00:15:24.790 --> 00:15:26.650
OK, so let's just
make this interlude.
00:15:26.650 --> 00:15:28.450
So the first thing
I want you to notice
00:15:28.450 --> 00:15:31.940
is that if you open any book
on what's called optimization,
00:15:31.940 --> 00:15:35.110
which basically is the science
behind optimizing functions,
00:15:35.110 --> 00:15:36.610
you will talk mostly--
00:15:36.610 --> 00:15:40.170
I mean, I'd say
99.9% of the cases
00:15:40.170 --> 00:15:42.349
will talk about
minimizing functions.
00:15:42.349 --> 00:15:44.890
But it doesn't matter, because
you can just flip the function
00:15:44.890 --> 00:15:47.710
and you just put a minus
sign, and minimizing h
00:15:47.710 --> 00:15:51.640
is the same as maximizing
minus h and the opposite, OK?
00:15:51.640 --> 00:15:53.547
So for this class,
since we're only
00:15:53.547 --> 00:15:55.630
going to talk about maximum
likelihood estimation,
00:15:55.630 --> 00:15:57.654
we will talk about
maximizing functions.
00:15:57.654 --> 00:15:59.320
But don't be lost if
you decide suddenly
00:15:59.320 --> 00:16:01.569
to open a book on optimization
and find only something
00:16:01.569 --> 00:16:03.700
about minimizing functions.
00:16:03.700 --> 00:16:08.279
OK, so maximizing an arbitrary
function can actually be fairly
00:16:08.279 --> 00:16:10.570
difficult. If I give you a
function that has this weird
00:16:10.570 --> 00:16:13.740
shape, right-- let's think of
this polynomial for example--
00:16:13.740 --> 00:16:17.140
and I wanted to find the
maximum, how would we do it?
00:16:20.350 --> 00:16:23.580
So what is the thing you've
learned in calculus on how
00:16:23.580 --> 00:16:26.200
to maximize the function?
00:16:26.200 --> 00:16:27.856
Set the derivative equal to 0.
00:16:27.856 --> 00:16:29.730
Maybe you want to check
the second derivative
00:16:29.730 --> 00:16:31.647
to make sure it's a
maximum and not a minimum.
00:16:31.647 --> 00:16:34.105
But the thing is, this is only
guaranteeing to you that you
00:16:34.105 --> 00:16:35.280
have a local one, right?
00:16:35.280 --> 00:16:38.412
So if I do it for this function,
for example, then this guy
00:16:38.412 --> 00:16:39.870
is going to satisfy
this criterion,
00:16:39.870 --> 00:16:41.610
this guy is going to
satisfy this criterion,
00:16:41.610 --> 00:16:43.530
this guy is going to satisfy
this criterion, this guy here,
00:16:43.530 --> 00:16:45.404
and this guy satisfies
the criterion, but not
00:16:45.404 --> 00:16:46.950
the second derivative one.
00:16:46.950 --> 00:16:50.160
So I have a lot of candidates.
00:16:50.160 --> 00:16:52.800
And if my function can
be really anything,
00:16:52.800 --> 00:16:54.696
it's going to be
difficult, whether it's
00:16:54.696 --> 00:16:56.820
analytically by taking
derivatives and setting them
00:16:56.820 --> 00:17:00.230
to 0, or trying to find
some algorithms to do this.
00:17:00.230 --> 00:17:02.400
Because if my function
is very jittery,
00:17:02.400 --> 00:17:05.130
then my algorithm basically
has to check all candidates.
00:17:05.130 --> 00:17:08.520
And if there's a lot of them,
it might take forever, OK?
00:17:08.520 --> 00:17:11.369
So this is-- I have only
one, two, three, four,
00:17:11.369 --> 00:17:13.109
five candidates to check.
00:17:13.109 --> 00:17:15.900
But in practice, you might have
a million of them to check.
00:17:15.900 --> 00:17:17.460
And that might take forever.
00:17:17.460 --> 00:17:21.180
OK, so what's nice about
statistical models, and one
00:17:21.180 --> 00:17:24.400
of the things that makes all
these models particularly
00:17:24.400 --> 00:17:27.792
robust, and that we
still talk about them 100
00:17:27.792 --> 00:17:29.250
years after they've
been introduced
00:17:29.250 --> 00:17:31.680
is that the functions
that-- the likelihoods
00:17:31.680 --> 00:17:33.450
that they lead
for us to maximize
00:17:33.450 --> 00:17:34.800
are actually very simple.
00:17:34.800 --> 00:17:37.090
And they all share
a nice property,
00:17:37.090 --> 00:17:40.350
which is that of being concave.
00:17:40.350 --> 00:17:42.567
All right, so what is
a concave function?
00:17:42.567 --> 00:17:44.775
Well, by definition, it's
just a function for which--
00:17:44.775 --> 00:17:47.760
let's think of it as being
twice differentiable.
00:17:47.760 --> 00:17:49.320
You can define
functions that are not
00:17:49.320 --> 00:17:51.695
differentiable as being concave,
but let's think about it
00:17:51.695 --> 00:17:53.035
as having a second derivative.
00:17:53.035 --> 00:17:54.660
And so if you look
at the function that
00:17:54.660 --> 00:17:57.480
has a second derivative,
concave are the functions
00:17:57.480 --> 00:17:59.340
that have their second
derivative that's
00:17:59.340 --> 00:18:02.060
negative everywhere.
00:18:02.060 --> 00:18:06.430
Not just at the
maximum, everywhere, OK?
00:18:06.430 --> 00:18:09.190
And so if it's strictly
concave, this second derivative
00:18:09.190 --> 00:18:12.280
is actually strictly
less than zero.
00:18:12.280 --> 00:18:16.110
And particularly if I
think of a linear function,
00:18:16.110 --> 00:18:19.480
y is equal to x,
then this function
00:18:19.480 --> 00:18:24.130
has its second derivative
which is equal to zero, OK?
00:18:24.130 --> 00:18:26.020
So it is concave.
00:18:26.020 --> 00:18:28.430
But it's not
strictly concave, OK?
00:18:28.430 --> 00:18:31.570
If I look at the function
which is negative x squared,
00:18:31.570 --> 00:18:33.060
what is its second derivative?
00:18:35.700 --> 00:18:36.810
Minus 2.
00:18:36.810 --> 00:18:39.810
So it's strictly
negative everywhere, OK?
00:18:39.810 --> 00:18:43.767
So actually, this is a
pretty canonical example
00:18:43.767 --> 00:18:44.850
strictly concave function.
00:18:44.850 --> 00:18:46.850
If you want to think of a
picture of a strictly concave
00:18:46.850 --> 00:18:48.540
function, think of
negative x squared.
00:18:48.540 --> 00:18:52.770
So parabola pointing downwards.
00:18:52.770 --> 00:18:56.980
OK, so we can talk about
strictly convex functions.
00:18:56.980 --> 00:18:59.940
So convex is just happening when
the negative of the function
00:18:59.940 --> 00:19:00.757
is concave.
00:19:00.757 --> 00:19:03.090
So that translates into having
a second derivative which
00:19:03.090 --> 00:19:05.910
is either non-negative
or positive, depending
00:19:05.910 --> 00:19:09.040
on whether you're talking about
convexity or strict convexity.
00:19:09.040 --> 00:19:11.850
But again, those
convex functions
00:19:11.850 --> 00:19:14.580
are convenient when you're
trying to minimize something.
00:19:14.580 --> 00:19:16.579
And since we're trying
to maximize the function,
00:19:16.579 --> 00:19:18.710
we're looking for concave.
00:19:18.710 --> 00:19:21.732
So here are some examples.
00:19:21.732 --> 00:19:23.190
Let's just go
through them quickly.
00:19:39.060 --> 00:19:41.830
OK, so the first one is--
00:19:41.830 --> 00:19:46.540
so here I made my
life a little uneasy
00:19:46.540 --> 00:19:49.889
by talking about the
functions in theta, right?
00:19:49.889 --> 00:19:51.430
I'm talking about
likelihoods, right?
00:19:51.430 --> 00:19:54.460
So I'm thinking of functions
where the parameter is theta.
00:19:54.460 --> 00:19:56.270
So I have h of theta.
00:19:56.270 --> 00:19:59.370
And so if I start
with theta squared,
00:19:59.370 --> 00:20:02.170
negative theta squared,
then as we said,
00:20:02.170 --> 00:20:09.490
h prime prime of theta, the
second derivative is minus 2,
00:20:09.490 --> 00:20:11.830
which is strictly negative,
so this function is strictly
00:20:11.830 --> 00:20:12.329
concave.
00:20:19.210 --> 00:20:24.400
OK, another function is
h of theta, which is--
00:20:24.400 --> 00:20:25.980
what did we pick--
00:20:25.980 --> 00:20:28.380
square root of theta.
00:20:28.380 --> 00:20:30.018
What is the first derivative?
00:20:35.760 --> 00:20:39.720
1/2 square root of theta.
00:20:39.720 --> 00:20:41.332
What is the second derivative?
00:20:48.220 --> 00:20:51.617
So that's theta to
the negative 1/2.
00:20:51.617 --> 00:20:53.450
So I'm just picking up
another negative 1/2,
00:20:53.450 --> 00:20:56.640
so I get negative 1/4.
00:20:56.640 --> 00:21:02.420
And then I get theta to
the 3/4 downstairs, OK?
00:21:02.420 --> 00:21:03.320
Sorry, 3/2.
00:21:09.430 --> 00:21:16.820
And that's strictly negative
for theta, say, larger than 0.
00:21:16.820 --> 00:21:20.060
And I really need to have
this thing larger than 0
00:21:20.060 --> 00:21:21.470
so that it's well-defined.
00:21:21.470 --> 00:21:24.320
But strictly larger than 0 is
so that this thing does not
00:21:24.320 --> 00:21:25.597
blow up to infinity.
00:21:25.597 --> 00:21:26.180
And it's true.
00:21:26.180 --> 00:21:30.740
If you think about this
function, it looks like this.
00:21:30.740 --> 00:21:34.940
And already, the first
derivative to infinity at 0.
00:21:34.940 --> 00:21:37.520
And it's a concave function, OK?
00:21:37.520 --> 00:21:39.920
Another one is the
log, of course.
00:21:44.070 --> 00:21:47.640
What is the
derivative of the log?
00:21:47.640 --> 00:21:52.710
That's 1 over theta, where h
prime of theta is 1 over theta.
00:21:52.710 --> 00:22:01.080
And the second derivative
negative 1 over theta squared,
00:22:01.080 --> 00:22:06.210
which again, is negative if
theta is strictly positive.
00:22:06.210 --> 00:22:07.770
Here I define it as--
00:22:07.770 --> 00:22:10.310
I don't need to define it to
be strictly positive here,
00:22:10.310 --> 00:22:13.110
but I need it for the log.
00:22:13.110 --> 00:22:16.320
And sine.
00:22:16.320 --> 00:22:18.030
OK, so let's just do one more.
00:22:18.030 --> 00:22:22.555
So h of theta is sine of theta.
00:22:22.555 --> 00:22:24.180
But here I take it
only on an interval,
00:22:24.180 --> 00:22:27.540
because you want to
think of this function
00:22:27.540 --> 00:22:29.112
as pointing always downwards.
00:22:29.112 --> 00:22:31.070
And in particular, you
don't want this function
00:22:31.070 --> 00:22:32.400
to have an inflection point.
00:22:32.400 --> 00:22:34.080
You don't want it to
go down and then up
00:22:34.080 --> 00:22:37.200
and then down and then up,
because this is not concave.
00:22:37.200 --> 00:22:39.960
And so sine is certainly
going up and down, right?
00:22:39.960 --> 00:22:43.530
So what we do is we restrict
it to an interval where sine
00:22:43.530 --> 00:22:45.690
is actually-- so what does
the sine function looks
00:22:45.690 --> 00:22:47.530
like at 0, 0?
00:22:47.530 --> 00:22:48.600
And it's going up.
00:22:48.600 --> 00:22:53.110
Where is the first
maximum of the sine?
00:22:53.110 --> 00:22:54.341
STUDENT: [INAUDIBLE]
00:22:54.341 --> 00:22:55.215
PROFESSOR: I'm sorry.
00:22:55.215 --> 00:22:56.006
STUDENT: Pi over 2.
00:22:56.006 --> 00:22:59.220
PROFESSOR: Pi over 2,
where it takes value 1.
00:22:59.220 --> 00:23:01.700
And then it goes down again.
00:23:01.700 --> 00:23:04.220
And then that's at pi.
00:23:04.220 --> 00:23:05.420
And then I go down again.
00:23:05.420 --> 00:23:08.360
And here you see I actually
start changing my inflection.
00:23:08.360 --> 00:23:10.700
So what we do is
we stop it at pi.
00:23:10.700 --> 00:23:12.450
And we look at this
function, it certainly
00:23:12.450 --> 00:23:14.404
looks like a parabola
pointing downwards.
00:23:14.404 --> 00:23:16.820
And so if you look at the--
you can check that it actually
00:23:16.820 --> 00:23:17.944
works with the derivatives.
00:23:17.944 --> 00:23:22.530
So the derivative
of sine is cosine.
00:23:25.160 --> 00:23:31.850
And the derivative of
cosine is negative sine.
00:23:34.560 --> 00:23:38.170
OK, and this thing between 0
and pi is actually positive.
00:23:38.170 --> 00:23:40.730
So this entire thing is
going to be negative.
00:23:40.730 --> 00:23:41.860
OK?
00:23:41.860 --> 00:23:45.160
And you know, I can come
up with a lot of examples,
00:23:45.160 --> 00:23:46.730
but let's just stick to those.
00:23:46.730 --> 00:23:48.990
There's a linear
function, of course.
00:23:48.990 --> 00:23:51.196
And the find function
is going to be concave,
00:23:51.196 --> 00:23:53.320
but it's actually going to
be convex as well, which
00:23:53.320 --> 00:23:55.028
means that it's
certainly not going to be
00:23:55.028 --> 00:23:58.780
strictly concave or convex, OK?
00:23:58.780 --> 00:24:01.450
So here's your standard picture.
00:24:01.450 --> 00:24:04.630
And here, if you look
at the dotted line, what
00:24:04.630 --> 00:24:07.194
it tells me is that
a concave function,
00:24:07.194 --> 00:24:08.860
and the property we're
going to be using
00:24:08.860 --> 00:24:12.910
is that if a strictly concave
function has a maximum, which
00:24:12.910 --> 00:24:15.790
is not always the case,
but if it has a maximum,
00:24:15.790 --> 00:24:18.770
then it actually must be--
sorry, a local maximum,
00:24:18.770 --> 00:24:21.350
it must be a global maximum.
00:24:21.350 --> 00:24:23.870
OK, so just the fact that
it goes up and down and not
00:24:23.870 --> 00:24:28.260
again means that there's only
global maximum that can exist.
00:24:28.260 --> 00:24:32.480
Now if you looked, for example,
at the square root function,
00:24:32.480 --> 00:24:34.855
look at the entire
positive real line,
00:24:34.855 --> 00:24:36.980
then this thing is never
going to attain a maximum.
00:24:36.980 --> 00:24:39.362
It's just going to infinity
as x goes to infinity.
00:24:39.362 --> 00:24:40.820
So if I wanted to
find the maximum,
00:24:40.820 --> 00:24:42.590
I would have to stop
somewhere and say
00:24:42.590 --> 00:24:46.200
that the maximum is attained
at the right-hand side.
00:24:46.200 --> 00:24:49.890
OK, so that's the beauty about
convex functions or concave
00:24:49.890 --> 00:24:53.880
functions, is that
essentially, these functions
00:24:53.880 --> 00:24:55.110
are easy to maximize.
00:24:55.110 --> 00:24:57.660
And if I tell you a
function is concave,
00:24:57.660 --> 00:25:00.164
you take the first
derivative, set it equal to 0.
00:25:00.164 --> 00:25:01.830
If you find a point
that satisfies this,
00:25:01.830 --> 00:25:07.560
then it must be a
global maximum, OK?
00:25:07.560 --> 00:25:09.086
STUDENT: What if
your set theta was
00:25:09.086 --> 00:25:13.695
[INAUDIBLE] then couldn't
you have a function that,
00:25:13.695 --> 00:25:17.090
by the definition, is concave,
with two upside down parabolas
00:25:17.090 --> 00:25:22.910
at two disjoint intervals, but
yet it has two global maximums?
00:25:26.704 --> 00:25:28.120
PROFESSOR: So you
won't get them--
00:25:28.120 --> 00:25:31.030
so you want the function
to be concave on what?
00:25:31.030 --> 00:25:34.430
On the convex cell
of the intervals?
00:25:34.430 --> 00:25:35.875
Or you want it to be--
00:25:35.875 --> 00:25:38.250
STUDENT: [INAUDIBLE] just
said that any subset.
00:25:38.250 --> 00:25:40.029
PROFESSOR: OK, OK.
00:25:40.029 --> 00:25:40.570
You're right.
00:25:40.570 --> 00:25:42.060
So maybe the
definition-- so you're
00:25:42.060 --> 00:25:45.810
pointing to a weakness
in the definition.
00:25:45.810 --> 00:25:49.050
Let's just say that
theta is a convex set
00:25:49.050 --> 00:25:50.220
and then you're good, OK?
00:25:50.220 --> 00:25:51.450
So you're right.
00:25:54.210 --> 00:25:56.790
Since I actually just said that
this is true only for theta,
00:25:56.790 --> 00:25:59.280
I can just take pieces of
concave functions, right?
00:25:59.280 --> 00:26:00.990
I can do this, and
then the next one
00:26:00.990 --> 00:26:03.330
I can do this, on the
next one I can do this.
00:26:03.330 --> 00:26:05.530
And then I would
have a bunch of them.
00:26:05.530 --> 00:26:10.620
But what I want is think
of it as a global function
00:26:10.620 --> 00:26:11.590
on some convex set.
00:26:11.590 --> 00:26:13.450
You're right.
00:26:13.450 --> 00:26:14.900
So think of theta
as being convex
00:26:14.900 --> 00:26:17.560
for this guy, an interval,
if it's a real line.
00:26:20.340 --> 00:26:25.689
OK, so as I said, for
more generally-- so
00:26:25.689 --> 00:26:27.980
we can actually define concave
functions more generally
00:26:27.980 --> 00:26:29.540
in higher dimensions.
00:26:29.540 --> 00:26:32.690
And that will be useful
if theta is not just
00:26:32.690 --> 00:26:34.640
one parameter but
several parameters.
00:26:34.640 --> 00:26:39.050
And for that, you need to
remind yourself of Calculus II,
00:26:39.050 --> 00:26:42.440
and you have generalization of
the notion of derivative, which
00:26:42.440 --> 00:26:46.130
is called a gradient, which
is basically a vector where
00:26:46.130 --> 00:26:49.390
each coordinate is just the
partial derivative with respect
00:26:49.390 --> 00:26:51.170
to each coordinate of theta.
00:26:51.170 --> 00:26:54.380
And the Hessian is
the matrix, which
00:26:54.380 --> 00:26:58.020
is essentially a generalization
of the second derivative.
00:26:58.020 --> 00:27:01.220
I denote it by nabla
squared, but you
00:27:01.220 --> 00:27:02.610
can write it the way you want.
00:27:02.610 --> 00:27:07.296
And so this matrix
here is taking as entry
00:27:07.296 --> 00:27:10.970
the second partial
derivatives of h with respect
00:27:10.970 --> 00:27:12.920
to theta i and theta j.
00:27:12.920 --> 00:27:15.440
And so that's the ij-th entry.
00:27:15.440 --> 00:27:16.550
Who has never seen that?
00:27:19.400 --> 00:27:20.600
OK.
00:27:20.600 --> 00:27:27.200
So now, being concave here
is essentially generalizing,
00:27:27.200 --> 00:27:28.820
saying that a vector
is equal to zero.
00:27:28.820 --> 00:27:31.390
Well, that's just setting
the vector-- sorry.
00:27:31.390 --> 00:27:33.700
The first order condition
to say that it's a maximum
00:27:33.700 --> 00:27:34.700
is going to be the same.
00:27:34.700 --> 00:27:38.930
Saying that a function has
a gradient equal to zero
00:27:38.930 --> 00:27:43.730
is the same as saying that
each of its coordinates
00:27:43.730 --> 00:27:44.730
are equal to zero.
00:27:44.730 --> 00:27:46.521
And that's actually
going to be a condition
00:27:46.521 --> 00:27:48.560
for a global maximum here.
00:27:48.560 --> 00:27:52.190
So to check convexity, we need
to see that a matrix itself
00:27:52.190 --> 00:27:53.760
is negative.
00:27:53.760 --> 00:27:55.220
Sorry, to check
concavity, we need
00:27:55.220 --> 00:27:57.020
to check that a
matrix is negative.
00:27:57.020 --> 00:27:59.840
And there is a
notion among matrices
00:27:59.840 --> 00:28:03.320
that compare matrix to zero,
and that's exactly this notion.
00:28:03.320 --> 00:28:06.170
You pre- and post-multiply
by the same x.
00:28:06.170 --> 00:28:08.960
So that works for
symmetric matrices,
00:28:08.960 --> 00:28:10.590
which is the case here.
00:28:10.590 --> 00:28:13.940
And so you pre-multiply by x,
post-multiply by the same x.
00:28:13.940 --> 00:28:15.930
So you have your matrix,
your Hessian here.
00:28:20.630 --> 00:28:24.030
It's a d by d matrix if you
have a d-dimensional matrix.
00:28:24.030 --> 00:28:26.900
So let's call it--
00:28:26.900 --> 00:28:27.400
OK.
00:28:27.400 --> 00:28:31.150
And then here I
pre-multiply by x transpose.
00:28:31.150 --> 00:28:34.330
I post-multiply by x.
00:28:34.330 --> 00:28:38.470
And this has to be non-positive
if I want it to be concave,
00:28:38.470 --> 00:28:42.850
and strictly negative if I
want it to be strictly concave.
00:28:42.850 --> 00:28:44.740
OK, that's just a
real generalization.
00:28:44.740 --> 00:28:47.050
You can check for yourself
that this is the same thing.
00:28:47.050 --> 00:28:49.760
If I were in dimension 1,
this would be the same thing.
00:28:49.760 --> 00:28:50.410
Why?
00:28:50.410 --> 00:28:53.380
Because in dimension 1, pre-
and post-multiplying by x
00:28:53.380 --> 00:28:55.840
is the same as
multiplying by x squared.
00:28:55.840 --> 00:28:58.820
Because in dimension 1, I can
just move my x's around, right?
00:28:58.820 --> 00:29:01.180
And so that would just
mean the first condition
00:29:01.180 --> 00:29:04.930
would mean in dimension 1 that
the second derivative times x
00:29:04.930 --> 00:29:11.110
squared has to be less
than or equal to zero.
00:29:11.110 --> 00:29:14.371
So here I need this for
all x's that are not zero,
00:29:14.371 --> 00:29:16.870
because I can take x to be zero
and make this equal to zero,
00:29:16.870 --> 00:29:17.370
right?
00:29:17.370 --> 00:29:21.640
So this is for x's that
are not equal to zero, OK?
00:29:21.640 --> 00:29:25.720
And so some examples.
00:29:25.720 --> 00:29:27.340
Just look at this function.
00:29:27.340 --> 00:29:29.830
So now I have functions that
depend on two parameters,
00:29:29.830 --> 00:29:31.600
theta1 and theta2.
00:29:31.600 --> 00:29:33.130
So the first one is--
00:29:33.130 --> 00:29:36.460
so if I take theta
to be equal to--
00:29:36.460 --> 00:29:39.010
now I need two
parameters, r squared.
00:29:39.010 --> 00:29:42.670
And I look at the function,
which is h of theta.
00:29:42.670 --> 00:29:45.266
Can somebody tell me
what h of theta is?
00:29:45.266 --> 00:29:49.530
STUDENT: [INAUDIBLE]
00:29:49.530 --> 00:29:52.040
PROFESSOR: Minus
2 theta2 squared?
00:29:52.040 --> 00:30:00.920
OK, so let's compute the
gradient of h of theta.
00:30:00.920 --> 00:30:04.210
So it's going to be something
that has two coordinates.
00:30:04.210 --> 00:30:06.064
To get the first
coordinate, what do I do?
00:30:06.064 --> 00:30:07.730
Well, I take the
derivative with respect
00:30:07.730 --> 00:30:10.230
to theta1, thinking of
theta2 as being a constant.
00:30:10.230 --> 00:30:11.750
So this thing is
going to go away.
00:30:11.750 --> 00:30:14.189
And so I get negative 2 theta1.
00:30:14.189 --> 00:30:15.980
And when I take the
derivative with respect
00:30:15.980 --> 00:30:18.620
to the second part, thinking
of this part as being constant,
00:30:18.620 --> 00:30:21.490
I get minus 4 theta2.
00:30:24.560 --> 00:30:26.970
That clear for everyone?
00:30:26.970 --> 00:30:29.455
That's just the definition
of partial derivatives.
00:30:32.430 --> 00:30:40.880
And then if I want
to do the Hessian,
00:30:40.880 --> 00:30:42.860
so now I'm going to
get a 2 by 2 matrix.
00:30:45.690 --> 00:30:48.650
The first guy here, I take
the first-- so this guy
00:30:48.650 --> 00:30:51.480
I get by taking the derivative
of this guy with respect
00:30:51.480 --> 00:30:52.550
to theta1.
00:30:52.550 --> 00:30:53.380
So that's easy.
00:30:53.380 --> 00:30:55.152
So that's just minus 2.
00:30:55.152 --> 00:30:56.610
This guy I get by
taking derivative
00:30:56.610 --> 00:30:58.530
of this guy with
respect to theta2.
00:30:58.530 --> 00:31:00.341
So I get what?
00:31:00.341 --> 00:31:00.840
Zero.
00:31:00.840 --> 00:31:03.234
I treat this guy as
being a constant.
00:31:03.234 --> 00:31:04.650
This guy is also
going to be zero,
00:31:04.650 --> 00:31:06.990
because I take the derivative
of this guy with respect
00:31:06.990 --> 00:31:08.269
to theta1.
00:31:08.269 --> 00:31:10.560
And then I take the derivative
of this guy with respect
00:31:10.560 --> 00:31:14.310
to theta2, so I get minus 4.
00:31:14.310 --> 00:31:19.220
OK, so now I want to check
that this matrix satisfies
00:31:19.220 --> 00:31:21.210
x transpose--
00:31:21.210 --> 00:31:24.690
this matrix x is negative.
00:31:24.690 --> 00:31:25.920
So what I do is--
00:31:25.920 --> 00:31:27.360
so what is x transpose x?
00:31:27.360 --> 00:31:33.810
So if I do x transpose delta
squared h theta x, what I get
00:31:33.810 --> 00:31:42.570
is minus 2 x1 squared
minus 4 x2 squared.
00:31:42.570 --> 00:31:45.990
Because this matrix is diagonal,
so all it does is just weights
00:31:45.990 --> 00:31:47.920
the square of the x's.
00:31:47.920 --> 00:31:51.270
So this guy is
definitely negative.
00:31:51.270 --> 00:31:53.580
This guy is negative.
00:31:53.580 --> 00:31:56.070
And actually, if one
of the two is non-zero,
00:31:56.070 --> 00:31:58.050
which means that x is
non-zero, then this thing
00:31:58.050 --> 00:32:00.240
is actually strictly negative.
00:32:00.240 --> 00:32:02.600
So this function is
actually strictly concave.
00:32:05.380 --> 00:32:07.730
And it looks like a
parabola that's slightly
00:32:07.730 --> 00:32:09.499
distorted in one direction.
00:32:15.730 --> 00:32:21.257
So well, I know this might
have been some time ago.
00:32:21.257 --> 00:32:23.590
Maybe for some of you might
have been since high school.
00:32:23.590 --> 00:32:27.360
So just remind yourself of doing
second derivatives and Hessians
00:32:27.360 --> 00:32:29.710
and things like this.
00:32:29.710 --> 00:32:32.920
Here's another one
as an exercise.
00:32:32.920 --> 00:32:36.660
h is minus theta1
minus theta2 squared.
00:32:36.660 --> 00:32:44.100
So this one is going to
actually not be diagonal.
00:32:44.100 --> 00:32:46.630
The Hessian is not
going to be diagonal.
00:32:46.630 --> 00:32:50.660
Who would like to do
this now in class?
00:32:50.660 --> 00:32:51.800
OK, thank you.
00:32:51.800 --> 00:32:53.730
This is not a calculus class.
00:32:53.730 --> 00:32:56.090
So you can just do it
as a calculus exercise.
00:32:56.090 --> 00:32:58.110
And you can do it
for log as well.
00:32:58.110 --> 00:33:01.100
Now, there is a nice
recipe for concavity
00:33:01.100 --> 00:33:05.111
that works for the second
one and the third one.
00:33:05.111 --> 00:33:07.610
And the thing is, if you look
at those particular functions,
00:33:07.610 --> 00:33:11.360
what I'm doing is taking, first
of all, a linear combination
00:33:11.360 --> 00:33:13.040
of my arguments.
00:33:13.040 --> 00:33:15.890
And then I take a concave
function of this guy.
00:33:15.890 --> 00:33:18.350
And this is always
going to work.
00:33:18.350 --> 00:33:20.930
This is always going to
give me a complete function.
00:33:20.930 --> 00:33:22.841
So the computations
that I just made,
00:33:22.841 --> 00:33:24.840
I actually never made
them when I prepared those
00:33:24.840 --> 00:33:26.132
slides because I don't have to.
00:33:26.132 --> 00:33:28.548
I know that if I take a linear
combination of those things
00:33:28.548 --> 00:33:30.650
and then I take a concave
function of this guy,
00:33:30.650 --> 00:33:33.750
I'm always going to
get a concave function.
00:33:33.750 --> 00:33:39.410
OK, so that's an easy way to
check this, or at least as
00:33:39.410 --> 00:33:42.520
a sanity check.
00:33:42.520 --> 00:33:48.250
All right, and so as I said,
finding maximizers of concave
00:33:48.250 --> 00:33:50.380
or strictly concave
function is the same
00:33:50.380 --> 00:33:52.870
as it was in the
one-dimensional case.
00:33:52.870 --> 00:33:55.052
What I do-- sorry, in
the one-dimensional case,
00:33:55.052 --> 00:33:57.010
we just agreed that we
just take the derivative
00:33:57.010 --> 00:33:58.077
and set it to zero.
00:33:58.077 --> 00:34:00.160
In the high dimensional
case, we take the gradient
00:34:00.160 --> 00:34:01.270
and set it equal to zero.
00:34:01.270 --> 00:34:04.300
Again, that's
calculus, all right?
00:34:04.300 --> 00:34:07.930
So it turns out that
so this is going
00:34:07.930 --> 00:34:09.489
to give me equations, right?
00:34:09.489 --> 00:34:11.530
The first one is an
equation in theta.
00:34:11.530 --> 00:34:15.040
The second one is an equation
in theta1, theta2, theta3,
00:34:15.040 --> 00:34:16.734
all the way to theta d.
00:34:16.734 --> 00:34:19.150
And it doesn't mean that because
I can write this equation
00:34:19.150 --> 00:34:21.130
that I can actually solve it.
00:34:21.130 --> 00:34:23.110
This equation might
be super nasty.
00:34:23.110 --> 00:34:28.929
It might be like some polynomial
and exponentials and logs equal
00:34:28.929 --> 00:34:31.219
zero, or some crazy thing.
00:34:31.219 --> 00:34:36.620
And so there's actually,
for a concave function,
00:34:36.620 --> 00:34:38.760
since we know there's
a unique maximizer,
00:34:38.760 --> 00:34:42.780
there's this theory of convex
optimization, which really,
00:34:42.780 --> 00:34:44.909
since those books are
talking about minimizing,
00:34:44.909 --> 00:34:46.620
you had to find some
sort of direction.
00:34:46.620 --> 00:34:50.280
But you can think of it as the
theory of concave maximization.
00:34:50.280 --> 00:34:54.060
And they allow you to
find algorithms to solve
00:34:54.060 --> 00:34:57.670
this numerically and
fairly efficiently.
00:34:57.670 --> 00:34:58.800
OK, that means fast.
00:34:58.800 --> 00:35:01.099
Even if d is of
size 10,000, you're
00:35:01.099 --> 00:35:02.640
going to wait for
one second and it's
00:35:02.640 --> 00:35:05.130
going to tell you
what the maximum is.
00:35:05.130 --> 00:35:06.914
And that's what machine
learning is about.
00:35:06.914 --> 00:35:08.830
If you've taken any class
on machine learning,
00:35:08.830 --> 00:35:11.163
there's a lot of optimization,
because they have really,
00:35:11.163 --> 00:35:13.850
really big problems to solve.
00:35:13.850 --> 00:35:15.470
Often in this
class, since this is
00:35:15.470 --> 00:35:19.460
more introductory statistics,
we will have a close form.
00:35:19.460 --> 00:35:21.250
For the maximum
likelihood estimator
00:35:21.250 --> 00:35:25.240
will be saying theta hat
equals, and say x bar,
00:35:25.240 --> 00:35:28.150
and that will be the maximum
likelihood estimator.
00:35:28.150 --> 00:35:34.310
So just why-- so has anybody
seen convex optimization
00:35:34.310 --> 00:35:36.950
before?
00:35:36.950 --> 00:35:38.830
So let me just give
you an intuition
00:35:38.830 --> 00:35:43.690
why those functions are easy
to maximize or to minimize.
00:35:43.690 --> 00:35:46.990
In one dimension, it's actually
very easy for you to see that.
00:35:50.540 --> 00:35:52.550
And the reason is this.
00:35:52.550 --> 00:35:57.110
If I want to maximize the
concave function, what
00:35:57.110 --> 00:35:59.780
I need to do is to be
able to query a point
00:35:59.780 --> 00:36:04.080
and get as an answer the
derivative of this function,
00:36:04.080 --> 00:36:04.791
OK?
00:36:04.791 --> 00:36:07.040
So now I said this is the
function I want to optimize,
00:36:07.040 --> 00:36:13.410
and I've been running my
algorithm for 5/10 of a second.
00:36:13.410 --> 00:36:15.650
And it's at this point here.
00:36:15.650 --> 00:36:17.214
OK, that's the candidate.
00:36:17.214 --> 00:36:19.130
Now, what I can ask is,
what is the derivative
00:36:19.130 --> 00:36:21.051
of my function here?
00:36:21.051 --> 00:36:22.550
Well, it's going
to give me a value.
00:36:22.550 --> 00:36:26.600
And this value is going to
be either negative, positive,
00:36:26.600 --> 00:36:27.246
or zero.
00:36:27.246 --> 00:36:28.620
Well, if it's
zero, that's great.
00:36:28.620 --> 00:36:30.411
That means I'm here
and I can just go home.
00:36:30.411 --> 00:36:31.679
I've solved my problem.
00:36:31.679 --> 00:36:33.470
I know there's a unique
maximum, and that's
00:36:33.470 --> 00:36:34.760
what I wanted to find.
00:36:34.760 --> 00:36:37.340
If it's positive,
it actually tells me
00:36:37.340 --> 00:36:41.470
that I'm on the left
of the optimizer.
00:36:41.470 --> 00:36:43.520
And on the left of
the optimal value.
00:36:43.520 --> 00:36:47.270
And if it's negative,
it means that I'm
00:36:47.270 --> 00:36:50.370
at the right of the
value I'm looking for.
00:36:50.370 --> 00:36:53.600
And so most of the convex
optimization methods
00:36:53.600 --> 00:36:56.780
basically tell you, well,
if you query the derivative
00:36:56.780 --> 00:37:00.390
and it's actually positive,
move to the right.
00:37:00.390 --> 00:37:02.430
And if it's negative,
move to the left.
00:37:02.430 --> 00:37:07.280
Now, by how much you
move is basically, well,
00:37:07.280 --> 00:37:09.020
why people write books.
00:37:09.020 --> 00:37:13.400
And in higher dimension, it's
a little more complicated,
00:37:13.400 --> 00:37:16.260
because in higher dimension,
thinks about two dimensions,
00:37:16.260 --> 00:37:21.940
then I'm only being
able to get in a vector.
00:37:21.940 --> 00:37:24.320
And the vector is only
telling me, well, here
00:37:24.320 --> 00:37:26.579
is half of the space
in which you can move.
00:37:26.579 --> 00:37:28.370
Now here, if you tell
me move to the right,
00:37:28.370 --> 00:37:30.620
I know exactly which direction
I'm going to have to move.
00:37:30.620 --> 00:37:32.036
But in two dimension,
you're going
00:37:32.036 --> 00:37:37.160
to basically tell me, well,
move in this global direction.
00:37:37.160 --> 00:37:40.190
And so, of course, I know
there's a line on the floor I
00:37:40.190 --> 00:37:42.140
cannot move behind.
00:37:42.140 --> 00:37:45.350
But even if you tell me,
draw a line on the floor
00:37:45.350 --> 00:37:47.720
and move only to that
side of the line,
00:37:47.720 --> 00:37:50.840
then there's many directions
in that line that I can go to.
00:37:50.840 --> 00:37:53.870
And that's also why
there's lots of things
00:37:53.870 --> 00:37:55.870
you can do in optimization.
00:37:55.870 --> 00:38:00.990
OK, but still, putting this
line on the floor is telling me,
00:38:00.990 --> 00:38:02.167
do not go backwards.
00:38:02.167 --> 00:38:03.250
And that's very important.
00:38:03.250 --> 00:38:04.791
It's just telling
you which direction
00:38:04.791 --> 00:38:07.470
I should be going always, OK?
00:38:07.470 --> 00:38:11.310
All right, so that's
what's behind this notion
00:38:11.310 --> 00:38:14.490
of gradient descent
algorithm, steepest descent.
00:38:14.490 --> 00:38:17.940
Or steepest descent, actually,
if we're trying to maximize.
00:38:17.940 --> 00:38:22.150
OK, so let's move on.
00:38:22.150 --> 00:38:26.400
So this course is not about
optimization, all right?
00:38:26.400 --> 00:38:30.690
So as I said, the
likelihood was this guy.
00:38:30.690 --> 00:38:32.532
The product of f of the xi's.
00:38:32.532 --> 00:38:33.990
And one way you
can do this is just
00:38:33.990 --> 00:38:39.060
basically the joint distribution
of my data at the point theta.
00:38:39.060 --> 00:38:41.160
So now the likelihood,
formerly-- so here
00:38:41.160 --> 00:38:44.760
I am giving myself
the model e theta.
00:38:44.760 --> 00:38:48.120
And here I'm going to
assume that e is discrete
00:38:48.120 --> 00:38:49.740
so that I can talk about PMFs.
00:38:49.740 --> 00:38:51.840
But everything
you're doing, just
00:38:51.840 --> 00:38:55.080
redo for the sake of yourself
by replacing PMFs by PDFs,
00:38:55.080 --> 00:38:56.680
and everything's
going to be fine.
00:38:56.680 --> 00:38:58.260
We'll do it in a second.
00:38:58.260 --> 00:39:02.470
All right, so the
likelihood of the model.
00:39:02.470 --> 00:39:05.552
So here I'm not looking at
the likelihood of a parameter.
00:39:05.552 --> 00:39:07.260
I'm looking at the
likelihood of a model.
00:39:07.260 --> 00:39:09.234
So it's actually a
function of the parameter.
00:39:09.234 --> 00:39:10.650
And actually, I'm
going to make it
00:39:10.650 --> 00:39:14.130
even a function of
the points x1 to xn.
00:39:14.130 --> 00:39:15.760
All right, so I have a function.
00:39:15.760 --> 00:39:18.070
And what it takes as
input is all the points x1
00:39:18.070 --> 00:39:22.062
to xn and a candidate
parameter theta.
00:39:22.062 --> 00:39:22.770
Not the true one.
00:39:22.770 --> 00:39:23.989
A candidate.
00:39:23.989 --> 00:39:25.530
And what I'm going
to do is I'm going
00:39:25.530 --> 00:39:28.530
to look at the probability
that my random variables
00:39:28.530 --> 00:39:29.970
under this
distribution, p theta,
00:39:29.970 --> 00:39:34.630
take these exact
values, px1, px2, pxn.
00:39:34.630 --> 00:39:40.290
Now remember, if my
data was independent,
00:39:40.290 --> 00:39:43.200
then I could actually just
say that the probability
00:39:43.200 --> 00:39:45.960
of this intersection is just a
product of the probabilities.
00:39:45.960 --> 00:39:48.790
And it would look
something like this.
00:39:48.790 --> 00:39:50.790
But I can define likelihood
even if I don't have
00:39:50.790 --> 00:39:52.865
independent random variables.
00:39:52.865 --> 00:39:54.490
But think of them as
being independent,
00:39:54.490 --> 00:39:57.550
because that's all we're going
to encounter in this class, OK?
00:39:57.550 --> 00:40:00.380
I just want you to be aware that
if I had dependent variables,
00:40:00.380 --> 00:40:02.089
I could still define
the likelihood.
00:40:02.089 --> 00:40:04.630
I would have to understand how
to compute these probabilities
00:40:04.630 --> 00:40:08.270
there to be able to compute it.
00:40:08.270 --> 00:40:11.000
OK, so think of
Bernoullis, for example.
00:40:11.000 --> 00:40:12.985
So here is my example
of a Bernoulli.
00:40:16.570 --> 00:40:18.650
So my parameter is--
00:40:18.650 --> 00:40:25.211
so my model is 0,1 Bernoulli p.
00:40:25.211 --> 00:40:31.790
p is in the interval 0,1.
00:40:31.790 --> 00:40:35.917
The probability, just
as a side remark,
00:40:35.917 --> 00:40:38.000
I'm just going to use the
fact that I can actually
00:40:38.000 --> 00:40:41.840
write the PMF of a Bernoulli
in a very concise form, right?
00:40:41.840 --> 00:40:43.970
If I ask you what the
PMF of a Bernoulli is,
00:40:43.970 --> 00:40:46.500
you could tell me, well,
the probability that x--
00:40:46.500 --> 00:40:50.720
so under p, the probability that
x is equal to 0 is 1 minus p.
00:40:50.720 --> 00:40:57.230
The probability under p that
x is equal to 1 is equal to p.
00:40:57.230 --> 00:41:01.790
But I can be a bit smart and
say that for any X that's
00:41:01.790 --> 00:41:04.910
either 0 or 1, the
probability under p
00:41:04.910 --> 00:41:07.610
that X is equal to
little x, I can write it
00:41:07.610 --> 00:41:14.150
in a compact form as p to the
X, 1 minus p to the 1 minus x.
00:41:14.150 --> 00:41:17.570
And you can check that this is
the right form because, well,
00:41:17.570 --> 00:41:20.910
you have to check it only
for two values of X, 0 and 1.
00:41:20.910 --> 00:41:23.350
And if you plug in 1,
you only keep the p.
00:41:23.350 --> 00:41:27.840
If you plug in 0, you
only keep the 1 minus p.
00:41:27.840 --> 00:41:31.190
And that's just a trick, OK?
00:41:31.190 --> 00:41:34.350
I could have gone
with many other ways.
00:41:34.350 --> 00:41:35.940
Agreed?
00:41:35.940 --> 00:41:39.342
I could have said,
actually, something like--
00:41:39.342 --> 00:41:41.550
another one would be-- which
we are not going to use,
00:41:41.550 --> 00:41:47.340
but we could say, well, it's
xp plus and minus x 1 minus
00:41:47.340 --> 00:41:47.850
p, right?
00:41:50.680 --> 00:41:53.160
That's another one.
00:41:53.160 --> 00:41:56.057
But this one is going
to be convenient.
00:41:56.057 --> 00:41:57.640
So forget about this
guy for a second.
00:42:02.750 --> 00:42:05.450
So now, I said that
the likelihood is just
00:42:05.450 --> 00:42:12.380
this function that's computing
the probability that X1
00:42:12.380 --> 00:42:15.050
is equal to little x1.
00:42:15.050 --> 00:42:27.950
So likelihood is L of X1, Xn.
00:42:27.950 --> 00:42:30.140
So let me try to make
those calligraphic so you
00:42:30.140 --> 00:42:33.140
know that I'm talking about
smaller values, right?
00:42:33.140 --> 00:42:35.010
Small x's.
00:42:35.010 --> 00:42:38.840
x1, xn, and then of course p.
00:42:38.840 --> 00:42:40.284
Sometimes we even put--
00:42:40.284 --> 00:42:42.200
I didn't do it, but
sometimes you can actually
00:42:42.200 --> 00:42:46.640
put a semicolon here, semicolon
so you know that those two
00:42:46.640 --> 00:42:48.860
things are treated differently.
00:42:48.860 --> 00:42:51.570
And so now you have this
thing is equal to what?
00:42:51.570 --> 00:42:54.440
Well, it's just the
probability under p
00:42:54.440 --> 00:42:59.990
that X1 is little x1 all
the way to Xn is little xn.
00:42:59.990 --> 00:43:02.064
OK, that's just the definition.
00:43:06.910 --> 00:43:11.590
All right, so now
let's start working.
00:43:11.590 --> 00:43:13.240
So we write the
definition, and then we
00:43:13.240 --> 00:43:16.030
want to make it look like
something we would potentially
00:43:16.030 --> 00:43:17.902
be able to maximize if I were--
00:43:17.902 --> 00:43:20.235
like if I take the derivative
of this with respect to p,
00:43:20.235 --> 00:43:22.570
it's not very helpful
because I just don't know.
00:43:22.570 --> 00:43:26.770
Just want the algebraic
function of p.
00:43:26.770 --> 00:43:28.580
So this thing is going
to be equal to what?
00:43:28.580 --> 00:43:30.413
Well, what is the first
thing I want to use?
00:43:32.740 --> 00:43:35.350
I have a probability of
an intersection of events,
00:43:35.350 --> 00:43:39.630
so it's just the product
of the probabilities.
00:43:39.630 --> 00:43:44.396
So this is the product from
i equal 1 to n of P, small p,
00:43:44.396 --> 00:43:47.970
Xi is equal to little xi.
00:43:47.970 --> 00:43:49.858
That's independence.
00:43:54.070 --> 00:43:58.690
OK, now, I'm starting to mean
business, because for each P,
00:43:58.690 --> 00:44:00.370
we have a closed form, right?
00:44:00.370 --> 00:44:03.910
I wrote this as this
supposedly convenient form.
00:44:03.910 --> 00:44:06.470
I still have to reveal to
you why it's convenient.
00:44:06.470 --> 00:44:09.640
So this thing is equal to--
00:44:09.640 --> 00:44:15.090
well, we said that that
was p xi for a little xi.
00:44:15.090 --> 00:44:20.240
1 minus p to the 1 minus xi, OK?
00:44:22.960 --> 00:44:26.650
So that was just what I wrote
over there as the probability
00:44:26.650 --> 00:44:29.540
that Xi is equal to little xi.
00:44:29.540 --> 00:44:32.780
And since they all have
the same parameter p, just
00:44:32.780 --> 00:44:34.280
have this p that shows up here.
00:44:38.140 --> 00:44:41.230
And so now I'm just taking
the products of something
00:44:41.230 --> 00:44:45.160
to the xi, so it's this
thing to the sum of the xi's.
00:44:45.160 --> 00:44:48.090
Everybody agrees with this?
00:44:48.090 --> 00:44:56.360
So this is equal to p
sum of the xi, 1 minus p
00:44:56.360 --> 00:44:58.180
to the n minus sum of the xi.
00:45:10.180 --> 00:45:13.300
If you don't feel comfortable
with writing it directly,
00:45:13.300 --> 00:45:15.520
you can observe
that this thing here
00:45:15.520 --> 00:45:22.170
is actually equal to p over
1 minus p to the xi times 1
00:45:22.170 --> 00:45:26.022
minus p, OK?
00:45:26.022 --> 00:45:27.480
So now when I take
the product, I'm
00:45:27.480 --> 00:45:28.938
getting the products
of those guys.
00:45:28.938 --> 00:45:31.380
So it's just this guy
to the power of sum
00:45:31.380 --> 00:45:33.570
and this guy to the power n.
00:45:33.570 --> 00:45:39.670
And then I can rewrite
it like this if I want to
00:45:39.670 --> 00:45:42.720
And so now-- well,
that's what we have here.
00:45:42.720 --> 00:45:45.750
And now I am in business
because I can still
00:45:45.750 --> 00:45:48.750
hope to maximize this function.
00:45:48.750 --> 00:45:50.679
And how to maximize
this function?
00:45:50.679 --> 00:45:52.470
All I have to do is to
take the derivative.
00:45:52.470 --> 00:45:54.710
Do you want to do it?
00:45:54.710 --> 00:45:56.502
Let's just take
the derivative, OK?
00:45:56.502 --> 00:45:58.960
Sorry, I didn't tell you that,
well, the maximum likelihood
00:45:58.960 --> 00:46:01.700
principle is to just maxim-- the
idea is to maximize this thing,
00:46:01.700 --> 00:46:02.200
OK?
00:46:02.200 --> 00:46:04.310
But I'm not going to
get there right now.
00:46:04.310 --> 00:46:08.810
OK, so let's do it maybe for
the Poisson model for a second.
00:46:08.810 --> 00:46:16.910
So if you want to do it
for the Poisson model,
00:46:16.910 --> 00:46:18.380
let's write the likelihood.
00:46:18.380 --> 00:46:20.020
So right now I'm
not doing anything.
00:46:20.020 --> 00:46:21.010
I'm not maximizing.
00:46:21.010 --> 00:46:24.040
I'm just computing the
likelihood function.
00:46:29.640 --> 00:46:32.470
OK, so the likelihood
function for Poisson.
00:46:32.470 --> 00:46:36.710
So now I know-- what is my
sample space for Poisson?
00:46:36.710 --> 00:46:38.140
STUDENT: Positives.
00:46:38.140 --> 00:46:41.170
PROFESSOR: Positive integers.
00:46:41.170 --> 00:46:45.220
And well, let me
write it like this.
00:46:45.220 --> 00:46:51.170
Poisson lambda, and I'm going
to take lambda to be positive.
00:46:51.170 --> 00:46:53.560
And so that means that the
probability under lambda
00:46:53.560 --> 00:46:57.920
that X is equal to little
x in the sample space
00:46:57.920 --> 00:47:01.030
is lambda to the
X over factorial x
00:47:01.030 --> 00:47:03.130
e to the minus lambda.
00:47:03.130 --> 00:47:05.530
So that's basically the
same as the compact form
00:47:05.530 --> 00:47:06.740
that I wrote over there.
00:47:06.740 --> 00:47:08.860
It's just now a different one.
00:47:08.860 --> 00:47:12.340
And so when I want to
write my likelihood, again,
00:47:12.340 --> 00:47:13.500
we said little x's.
00:47:17.050 --> 00:47:18.390
This is equal to what?
00:47:18.390 --> 00:47:23.690
Well, it's equal to the
probability under lambda
00:47:23.690 --> 00:47:31.796
that X1 is little
x1, Xn is little xn,
00:47:31.796 --> 00:47:33.045
which is equal to the product.
00:47:40.950 --> 00:47:42.720
OK?
00:47:42.720 --> 00:47:45.270
Just by independence.
00:47:45.270 --> 00:47:47.640
And now I can write those
guys as being-- each
00:47:47.640 --> 00:47:52.080
of them being i equal 1 to n.
00:47:52.080 --> 00:47:56.100
So this guy is just this
thing where a plug in Xi.
00:47:56.100 --> 00:48:05.540
So I get lambda to the Xi
divided by factorial xi times e
00:48:05.540 --> 00:48:10.660
to the minus lambda, OK?
00:48:10.660 --> 00:48:13.709
And now, I mean, this
guy is going to be nice.
00:48:13.709 --> 00:48:15.250
This guy is not
going to be too nice.
00:48:15.250 --> 00:48:16.570
But let's write it.
00:48:16.570 --> 00:48:18.820
When I'm going to take the
product of those guys here,
00:48:18.820 --> 00:48:21.910
I'm going to pick up lambda
to the sum of the xi's.
00:48:21.910 --> 00:48:23.470
Here I'm going to
pick up exponential
00:48:23.470 --> 00:48:25.334
minus n times lambda.
00:48:25.334 --> 00:48:27.250
And here I'm going to
pick up just the product
00:48:27.250 --> 00:48:29.200
of the factorials.
00:48:29.200 --> 00:48:35.900
So x1 factorial all the
way to xn factorial.
00:48:35.900 --> 00:48:41.130
Then I get lambda,
the sum of the xi.
00:48:41.130 --> 00:48:43.480
Those are little xi's.
00:48:43.480 --> 00:48:46.581
e to the minus xn lambda.
00:48:46.581 --> 00:48:47.080
OK?
00:48:51.900 --> 00:48:55.510
So that might be freaky at
this point, but remember,
00:48:55.510 --> 00:48:58.100
this is a function we
will be maximizing.
00:48:58.100 --> 00:49:01.480
And the denominator here
does not depend on lambda.
00:49:01.480 --> 00:49:04.860
So we knew that maximizing this
function with this denominator,
00:49:04.860 --> 00:49:07.590
or any other
denominator, including 1,
00:49:07.590 --> 00:49:09.930
will give me the same arg max.
00:49:09.930 --> 00:49:12.180
So it won't be a problem for me.
00:49:12.180 --> 00:49:14.349
As long as it does
not depend on lambda,
00:49:14.349 --> 00:49:15.640
this thing is going to go away.
00:49:19.130 --> 00:49:24.720
OK, so in the continuous case,
the likelihood I cannot--
00:49:24.720 --> 00:49:25.220
right?
00:49:25.220 --> 00:49:26.720
So if I would write
the likelihood
00:49:26.720 --> 00:49:29.600
like this in the
continuous case,
00:49:29.600 --> 00:49:32.240
this one would be equal to what?
00:49:32.240 --> 00:49:33.160
Zero, right?
00:49:33.160 --> 00:49:34.565
So it's not very helpful.
00:49:34.565 --> 00:49:36.440
And so what we do is we
define the likelihood
00:49:36.440 --> 00:49:39.860
as the product of
the f of theta xi.
00:49:39.860 --> 00:49:43.340
Now that would be a
jump if I told you,
00:49:43.340 --> 00:49:45.230
well, just define it
like that and go home
00:49:45.230 --> 00:49:46.700
and don't discuss it.
00:49:46.700 --> 00:49:52.011
But we know that this is
exactly what's coming from the--
00:49:52.011 --> 00:49:53.510
well, actually, I
think I erased it.
00:49:53.510 --> 00:49:55.370
It was just behind.
00:49:55.370 --> 00:49:58.280
So this was exactly what
was coming from the KL
00:49:58.280 --> 00:50:00.200
divergence estimated, right?
00:50:00.200 --> 00:50:01.700
The thing that I
showed you, if we
00:50:01.700 --> 00:50:03.200
want to follow this
strategy, which
00:50:03.200 --> 00:50:06.830
consists in estimating the KL
divergence and minimizing it,
00:50:06.830 --> 00:50:08.210
is exactly doing this.
00:50:12.190 --> 00:50:16.730
So in the Gaussian case--
00:50:16.730 --> 00:50:17.835
well, let's write it.
00:50:17.835 --> 00:50:19.610
So in the Gaussian
case, let's see
00:50:19.610 --> 00:50:20.940
what the likelihood looks like.
00:50:27.600 --> 00:50:32.000
OK, so if I have a
Gaussian experiment here--
00:50:32.000 --> 00:50:33.430
did I actually write it?
00:50:36.440 --> 00:50:40.590
OK, so I'm going to take mu and
sigma as being two parameters.
00:50:40.590 --> 00:50:43.756
So that means that my sample
space is going to be what?
00:50:47.330 --> 00:50:49.700
Well, my sample
space is still R.
00:50:49.700 --> 00:50:51.750
Those are just my observations.
00:50:51.750 --> 00:50:56.840
But then I'm going to
have a N mu sigma squared.
00:50:56.840 --> 00:50:58.400
And the parameters
of interest are mu
00:50:58.400 --> 00:51:04.291
and R. And sigma squared
and say 0 infinity.
00:51:04.291 --> 00:51:06.450
OK, so that's my Gaussian model.
00:51:06.450 --> 00:51:07.736
Yes.
00:51:07.736 --> 00:51:17.455
STUDENT: [INAUDIBLE]
00:51:17.455 --> 00:51:18.580
PROFESSOR: No, there's no--
00:51:18.580 --> 00:51:20.080
I mean, there's no difference.
00:51:20.080 --> 00:51:21.514
STUDENT: [INAUDIBLE]
00:51:21.514 --> 00:51:22.180
PROFESSOR: Yeah.
00:51:22.180 --> 00:51:24.880
I think the all the slides
I put the curly bracket,
00:51:24.880 --> 00:51:26.520
then I'm just being lazy.
00:51:26.520 --> 00:51:31.540
I just like those
concave parenthesis.
00:51:31.540 --> 00:51:33.850
All right, so let's write it.
00:51:33.850 --> 00:51:39.670
So the definition, L xi, xn.
00:51:39.670 --> 00:51:43.810
And now I have two parameters,
mu and sigma squared.
00:51:43.810 --> 00:51:48.035
We said, by definition,
is the product from i
00:51:48.035 --> 00:51:55.540
equal 1 to n of f
theta of little xi.
00:51:55.540 --> 00:51:57.550
Now, think about it.
00:51:57.550 --> 00:52:00.790
Here we always had
an extra line, right?
00:52:00.790 --> 00:52:03.460
The line was to say that the
definition was the probability
00:52:03.460 --> 00:52:05.470
that they were all
equal to each other.
00:52:05.470 --> 00:52:08.230
That was the joint probability.
00:52:08.230 --> 00:52:12.430
And here it could actually have
a line that says it's the joint
00:52:12.430 --> 00:52:14.146
probability distribution
of the xi's.
00:52:14.146 --> 00:52:15.520
And if it's not
independent, it's
00:52:15.520 --> 00:52:16.732
not going to be the product.
00:52:16.732 --> 00:52:18.190
But again, since
we're only dealing
00:52:18.190 --> 00:52:21.020
with independent observations
in the scope of this class,
00:52:21.020 --> 00:52:23.890
this is the only definition
we're going to be using.
00:52:23.890 --> 00:52:26.710
OK, and actually,
from here on, I
00:52:26.710 --> 00:52:30.910
will literally skip this step
when I talk about discrete ones
00:52:30.910 --> 00:52:33.270
as well, because they
are also independent.
00:52:33.270 --> 00:52:35.530
Agreed?
00:52:35.530 --> 00:52:37.570
So we start with
this, which we agreed
00:52:37.570 --> 00:52:39.590
was the definition for
this particular case.
00:52:39.590 --> 00:52:44.545
And so now all of you know by
heart what the density of a--
00:52:44.545 --> 00:52:45.600
sorry, that's not theta.
00:52:45.600 --> 00:52:47.540
I should write it
mu sigma squared.
00:52:47.540 --> 00:52:50.650
And so you need to
understand what this density.
00:52:50.650 --> 00:53:01.070
And it's product of 1 over
sigma square root 2 pi times
00:53:01.070 --> 00:53:07.350
exponential minus
xi minus mu squared
00:53:07.350 --> 00:53:10.210
divided by 2 sigma squared.
00:53:10.210 --> 00:53:13.750
OK, that's the Gaussian density
with parameters mu and sigma
00:53:13.750 --> 00:53:15.810
squared.
00:53:15.810 --> 00:53:18.360
I just plugged in this thing
which I don't give you,
00:53:18.360 --> 00:53:20.630
so you just have to trust me.
00:53:20.630 --> 00:53:22.500
It's all over any book.
00:53:22.500 --> 00:53:25.334
Certainly, I mean,
you can find it.
00:53:25.334 --> 00:53:26.250
I will give it to you.
00:53:26.250 --> 00:53:29.310
And again, you're not
expected to know it by heart.
00:53:29.310 --> 00:53:34.290
Though, if you do your homework
every week without wanting to,
00:53:34.290 --> 00:53:36.180
you will definitely
use some of your brain
00:53:36.180 --> 00:53:38.140
to remember that thing.
00:53:38.140 --> 00:53:42.680
OK, and so now, well, I
have this constant in front.
00:53:42.680 --> 00:53:45.000
1 over sigma square root
2 pi that I can pull out.
00:53:45.000 --> 00:53:50.474
So I get 1 over sigma square
root 2 pi to the power n.
00:53:50.474 --> 00:53:52.890
And then I have the product
of exponentials, which we know
00:53:52.890 --> 00:53:55.420
is the exponential of the sum.
00:53:55.420 --> 00:53:58.710
So this is equal to
exponential minus.
00:53:58.710 --> 00:54:01.260
And here I'm going to put
the 1 over 2 sigma squared
00:54:01.260 --> 00:54:02.210
outside the sum.
00:54:15.740 --> 00:54:19.850
And so that's how
this guy shows up.
00:54:19.850 --> 00:54:23.550
Just the product of the density
is evaluated at, respectively,
00:54:23.550 --> 00:54:24.676
x1 to xn.
00:54:28.850 --> 00:54:33.240
OK, any questions about
computing those likelihoods?
00:54:33.240 --> 00:54:34.556
Yes.
00:54:34.556 --> 00:54:41.460
STUDENT: Why [INAUDIBLE]
00:54:41.460 --> 00:54:42.890
PROFESSOR: Oh, that's a typo.
00:54:42.890 --> 00:54:43.740
Thank you.
00:54:43.740 --> 00:54:47.040
Because I just took it from
probably the previous thing.
00:54:47.040 --> 00:54:48.840
So those are
actually-- should be--
00:54:48.840 --> 00:54:50.850
OK, thank you for
noting that one.
00:54:50.850 --> 00:55:00.180
So this line should say for
any x1 to xn in R to the n.
00:55:00.180 --> 00:55:01.470
Thank you, good catch.
00:55:06.940 --> 00:55:10.840
All right, so that's
really e to the n, right?
00:55:10.840 --> 00:55:12.490
My sample space always.
00:55:16.090 --> 00:55:19.800
OK, so what is maximum
likelihood estimation?
00:55:19.800 --> 00:55:24.770
Well again, if you go
back to the estimate
00:55:24.770 --> 00:55:27.770
that we got, the estimation
strategy, which consisted
00:55:27.770 --> 00:55:31.160
in replacing expectation
with respect to theta star
00:55:31.160 --> 00:55:35.540
by average of the data
in the KL divergence,
00:55:35.540 --> 00:55:41.810
we would try to maximize
not this guy, but this guy.
00:55:45.770 --> 00:55:48.260
The thing that we actually
plugged in were not any small
00:55:48.260 --> 00:55:48.760
xi's.
00:55:48.760 --> 00:55:52.040
Were actually-- the random
variable is capital Xi.
00:55:52.040 --> 00:55:54.190
So the maximum
likelihood estimator
00:55:54.190 --> 00:55:57.090
is actually taking
the likelihood,
00:55:57.090 --> 00:55:59.570
which is a function
of little x's, and now
00:55:59.570 --> 00:56:02.210
the values at which it
estimates, if you look at it,
00:56:02.210 --> 00:56:03.620
is actually--
00:56:03.620 --> 00:56:05.870
the capital X is my data.
00:56:05.870 --> 00:56:09.800
So it looks at the
function, at the data,
00:56:09.800 --> 00:56:11.900
and at the parameter theta.
00:56:11.900 --> 00:56:14.932
That's what the-- so
that's the first thing.
00:56:14.932 --> 00:56:16.640
And then the maximum
likelihood estimator
00:56:16.640 --> 00:56:19.930
is maximizing this, OK?
00:56:19.930 --> 00:56:24.090
So in a way, what it does is
it's a function that couples
00:56:24.090 --> 00:56:27.810
together the data,
capital X1 to capital Xn,
00:56:27.810 --> 00:56:32.310
with the parameter theta and
just now tries to maximize it.
00:56:32.310 --> 00:56:40.120
So if this is just a
little hard for you to get,
00:56:40.120 --> 00:56:42.340
the likelihood is
formally defined
00:56:42.340 --> 00:56:43.750
as a function of x, right?
00:56:43.750 --> 00:56:46.105
Like when I write f of x.
00:56:46.105 --> 00:56:48.580
f of little x, I
define it like that.
00:56:48.580 --> 00:56:52.990
But really, the only
x arguments we're
00:56:52.990 --> 00:56:54.680
going to evaluate
this function at
00:56:54.680 --> 00:56:57.920
are always the random
variable, which is the data.
00:56:57.920 --> 00:56:59.440
So if you want,
you can think of it
00:56:59.440 --> 00:57:02.230
as those guys being not
parameters of this function,
00:57:02.230 --> 00:57:04.810
but really, random variables
themselves directly.
00:57:09.390 --> 00:57:10.683
Is there any question?
00:57:10.683 --> 00:57:15.516
STUDENT: [INAUDIBLE] those
random variables [INAUDIBLE]??
00:57:15.516 --> 00:57:17.890
PROFESSOR: So those are going
to be known once you have--
00:57:17.890 --> 00:57:20.500
so it's always the
same thing in stats.
00:57:20.500 --> 00:57:24.040
You first design your
estimator as a function
00:57:24.040 --> 00:57:25.270
of random variables.
00:57:25.270 --> 00:57:27.490
And then once you get
data, you just plug it in.
00:57:27.490 --> 00:57:29.920
But we want to think of them
as being random variables
00:57:29.920 --> 00:57:32.262
because we want to understand
what the fluctuations are.
00:57:32.262 --> 00:57:34.720
So we're going to keep them as
random variables for as long
00:57:34.720 --> 00:57:35.685
as we can.
00:57:35.685 --> 00:57:37.810
We're going to spit out
the estimator as a function
00:57:37.810 --> 00:57:38.690
of the random variables.
00:57:38.690 --> 00:57:40.060
And then when we want
to compute it from data,
00:57:40.060 --> 00:57:41.351
we're just going to plug it in.
00:57:44.170 --> 00:57:46.630
So keep the random variables
for as long as you can.
00:57:46.630 --> 00:57:48.430
Unless I give you
numbers, actual numbers,
00:57:48.430 --> 00:57:51.130
just those are random variables.
00:57:51.130 --> 00:57:53.549
OK, so there might
be some confusion
00:57:53.549 --> 00:57:55.590
if you've seen any stats
class, sometimes there's
00:57:55.590 --> 00:57:58.420
a notation which says,
oh, the realization
00:57:58.420 --> 00:58:01.240
of the random variables
are lower case versions
00:58:01.240 --> 00:58:02.730
of the original
random variables.
00:58:02.730 --> 00:58:05.920
So lowercase x should be
thought as the realization
00:58:05.920 --> 00:58:09.610
of the upper case X. This
is not the case here.
00:58:09.610 --> 00:58:12.010
When I write this,
it's the same way
00:58:12.010 --> 00:58:16.630
as I write f of x is
equal to x squared, right?
00:58:16.630 --> 00:58:20.260
It's just an argument of a
function that I want to define.
00:58:20.260 --> 00:58:22.150
So those are just generic x.
00:58:22.150 --> 00:58:24.580
So if you correct
the typo that I have,
00:58:24.580 --> 00:58:27.150
this should say that this
should be for any x and xn.
00:58:27.150 --> 00:58:28.990
I'm just describing a function.
00:58:28.990 --> 00:58:30.816
And now the only
place at which I'm
00:58:30.816 --> 00:58:32.440
interested in evaluating
that function,
00:58:32.440 --> 00:58:35.477
at least for those first n
arguments, is at the capital
00:58:35.477 --> 00:58:37.310
N observations random
variables that I have.
00:58:41.110 --> 00:58:45.040
So there's actually
texts, there's actually
00:58:45.040 --> 00:58:48.070
people doing research on when
does the maximum likelihood
00:58:48.070 --> 00:58:49.720
estimator exist?
00:58:49.720 --> 00:58:56.890
And that happens when you
have infinite sets, thetas.
00:58:56.890 --> 00:58:58.770
And this thing can diverge.
00:58:58.770 --> 00:59:00.160
There is no global maximum.
00:59:00.160 --> 00:59:01.990
There's crazy things
that might happen.
00:59:01.990 --> 00:59:04.630
And so we're actually
always going to be in a case
00:59:04.630 --> 00:59:07.450
where this maximum
likelihood estimator exists.
00:59:07.450 --> 00:59:09.580
And if it doesn't, then
it means that you actually
00:59:09.580 --> 00:59:13.840
need to restrict your
parameter space, capital Theta,
00:59:13.840 --> 00:59:15.430
to something smaller.
00:59:15.430 --> 00:59:17.500
Otherwise it won't exist.
00:59:17.500 --> 00:59:21.910
OK, so another thing is the
log likelihood estimator.
00:59:21.910 --> 00:59:23.800
So it is still the
likelihood estimator.
00:59:23.800 --> 00:59:26.380
We solved before that
maximizing a function
00:59:26.380 --> 00:59:27.820
or maximizing log
of this function
00:59:27.820 --> 00:59:30.350
is the same thing, because the
log function is increasing.
00:59:30.350 --> 00:59:32.100
So the same thing is
maximizing a function
00:59:32.100 --> 00:59:35.352
or maximizing, I don't know,
exponential of this function.
00:59:35.352 --> 00:59:37.060
Every time I take an
increasing function,
00:59:37.060 --> 00:59:38.410
it's actually the same thing.
00:59:38.410 --> 00:59:40.360
Maximizing a function
or maximizing 10 times
00:59:40.360 --> 00:59:41.693
this function is the same thing.
00:59:41.693 --> 00:59:45.730
So the function x maps to
10 times x is increasing.
00:59:45.730 --> 00:59:49.480
And so why do we talk about
log likelihood rather than
00:59:49.480 --> 00:59:50.620
likelihood?
00:59:50.620 --> 00:59:52.590
So the log of likelihood
is really just--
00:59:52.590 --> 00:59:55.810
I mean the log likelihood is
the log of the likelihood.
00:59:55.810 --> 00:59:59.420
And the reason is exactly
for this kind of reasons.
00:59:59.420 --> 01:00:02.240
Remember, that was
my likelihood, right?
01:00:02.240 --> 01:00:04.170
And I want to maximize it.
01:00:04.170 --> 01:00:05.940
And it turns out that
in stats, there's
01:00:05.940 --> 01:00:10.410
a lot of distributions that look
like exponential of something.
01:00:10.410 --> 01:00:12.930
So I might as well just
remove the exponential
01:00:12.930 --> 01:00:14.730
by taking the log.
01:00:14.730 --> 01:00:17.230
So once I have this
guy, I can take the log.
01:00:17.230 --> 01:00:19.080
This is something to
a power of something.
01:00:19.080 --> 01:00:21.720
If I take the log, it's
going to look better for me.
01:00:21.720 --> 01:00:23.400
I have this thing--
01:00:23.400 --> 01:00:25.650
well, I have another
one somewhere, I think,
01:00:25.650 --> 01:00:27.910
where I had the Poisson.
01:00:27.910 --> 01:00:29.070
Where was the Poisson?
01:00:29.070 --> 01:00:31.890
The Poisson's gone.
01:00:31.890 --> 01:00:33.610
So the Poisson was
the same thing.
01:00:33.610 --> 01:00:35.670
If I took the log,
because it had a power,
01:00:35.670 --> 01:00:37.210
that would make my life easier.
01:00:37.210 --> 01:00:43.800
So the log doesn't have any
particular intrinsic notion,
01:00:43.800 --> 01:00:47.550
except that it's
just more convenient.
01:00:47.550 --> 01:00:49.500
Now, that being
said, if you think
01:00:49.500 --> 01:00:53.370
about maximizing the KL,
the original formulation,
01:00:53.370 --> 01:00:55.590
we actually remove the log.
01:00:55.590 --> 01:00:57.040
If we come back
to the KL thing--
01:01:00.700 --> 01:01:01.610
where is my KL?
01:01:01.610 --> 01:01:03.770
Sorry.
01:01:03.770 --> 01:01:08.630
That was maximizing the sum
of the logs of the pi's.
01:01:08.630 --> 01:01:11.870
And so then we worked at it by
saying that the sum of the logs
01:01:11.870 --> 01:01:12.539
was--
01:01:12.539 --> 01:01:14.330
maximizing the sum of
the logs was the same
01:01:14.330 --> 01:01:16.220
as maximizing the product.
01:01:16.220 --> 01:01:18.140
But here, we're
basically-- log likelihood
01:01:18.140 --> 01:01:21.571
is just going backwards in
this chain of equivalences.
01:01:21.571 --> 01:01:23.570
And that's just because
the original formulation
01:01:23.570 --> 01:01:27.180
was already convenient.
01:01:27.180 --> 01:01:28.940
So we went to find
the likelihood
01:01:28.940 --> 01:01:32.620
and then coming back to our
original estimation strategy.
01:01:32.620 --> 01:01:34.250
So look at the Poisson.
01:01:34.250 --> 01:01:39.210
I want to take log here to
make my sum of xi's go down.
01:01:39.210 --> 01:01:47.510
OK, so this is my estimator.
01:01:47.510 --> 01:01:50.090
So the log of L--
01:01:50.090 --> 01:01:51.590
so one thing that
you want to notice
01:01:51.590 --> 01:01:59.960
is that the log of L of
x1, xn theta, as we said,
01:01:59.960 --> 01:02:02.860
is equal to the
sum from i equal 1
01:02:02.860 --> 01:02:09.950
to n of the log of either
p theta of xi, or--
01:02:09.950 --> 01:02:11.270
so that's in the discrete case.
01:02:11.270 --> 01:02:14.480
And in the continuous
case is the sum
01:02:14.480 --> 01:02:16.627
of the log of f theta of xi.
01:02:19.277 --> 01:02:21.860
The beauty of this is that you
don't have to really understand
01:02:21.860 --> 01:02:23.360
the difference between
probability mass
01:02:23.360 --> 01:02:25.310
function and probability
distribution function
01:02:25.310 --> 01:02:26.690
to implement this.
01:02:26.690 --> 01:02:29.518
Whatever you get,
that's what you plug in.
01:02:32.930 --> 01:02:33.810
Any questions so far?
01:02:36.550 --> 01:02:39.940
All right, so shall we
do some computations
01:02:39.940 --> 01:02:44.720
and check that, actually, we've
introduced all this stuff--
01:02:44.720 --> 01:02:47.380
complicate functions,
maximizing, KL divergence,
01:02:47.380 --> 01:02:50.590
lot of things-- so that we
can spit out, again, averages?
01:02:50.590 --> 01:02:51.160
All right?
01:02:51.160 --> 01:02:51.580
That's great.
01:02:51.580 --> 01:02:52.810
We're going to able
to sleep at night
01:02:52.810 --> 01:02:55.150
and know that there's a really
powerful mechanism called
01:02:55.150 --> 01:02:57.370
maximum likelihood
estimator that was actually
01:02:57.370 --> 01:03:00.370
driving our intuition
without us knowing.
01:03:00.370 --> 01:03:04.730
OK, so let's do this so.
01:03:04.730 --> 01:03:06.240
Bernoulli trials.
01:03:06.240 --> 01:03:07.400
I still have it over there.
01:03:15.920 --> 01:03:19.120
OK, so actually, I
don't know what--
01:03:19.120 --> 01:03:21.260
well, let me write it like that.
01:03:21.260 --> 01:03:25.730
So it's P over 1 minus P xi--
01:03:25.730 --> 01:03:32.650
sorry, sum of the xi's
times 1 minus P is to the n.
01:03:32.650 --> 01:03:37.960
So now I want to maximize
this as a function of P.
01:03:37.960 --> 01:03:39.880
Well, the first thing
we would want to do
01:03:39.880 --> 01:03:41.860
is to check that this
function is concave.
01:03:41.860 --> 01:03:45.220
And I'm just going to ask
you to trust me on this.
01:03:45.220 --> 01:03:47.800
So I don't want--
sorry, sum of the xi's.
01:03:47.800 --> 01:03:52.520
I only want to take the
derivative and just go home.
01:03:52.520 --> 01:03:55.150
So let's just take the
derivative of this with respect
01:03:55.150 --> 01:03:56.332
to P. Actually, no.
01:03:56.332 --> 01:03:57.540
This one was more convenient.
01:03:57.540 --> 01:03:58.040
I'm sorry.
01:04:00.820 --> 01:04:03.100
This one was slightly
more convenient, OK?
01:04:03.100 --> 01:04:05.980
So now we have--
01:04:05.980 --> 01:04:09.130
so now let me take the log.
01:04:09.130 --> 01:04:16.960
So if I take the log, what I get
is sum of the xi's times log p
01:04:16.960 --> 01:04:24.704
plus n minus some of the
xi's times log 1 minus p.
01:04:27.970 --> 01:04:29.590
Now I take the
derivative with respect
01:04:29.590 --> 01:04:35.837
to p and set it equal to zero.
01:04:35.837 --> 01:04:36.920
So what does that give me?
01:04:36.920 --> 01:04:43.710
It tells me that sum of the
xi's divided by p minus n
01:04:43.710 --> 01:04:50.130
sum of the xi's divided by
1 minus p is equal to 0.
01:04:56.360 --> 01:04:58.980
So now I need to solve for p.
01:04:58.980 --> 01:04:59.920
So let's just do it.
01:04:59.920 --> 01:05:06.500
So what we get is that 1 minus p
sum of the xi's is equal to p n
01:05:06.500 --> 01:05:10.530
minus sum of the xi's.
01:05:10.530 --> 01:05:17.300
So that's p times n minus sum of
the xi's plus sum of the xi's.
01:05:17.300 --> 01:05:18.550
So let me put it on the right.
01:05:18.550 --> 01:05:24.410
So that's p times n is
equal to sum of the xi's.
01:05:24.410 --> 01:05:27.170
And that's equivalent to p--
01:05:27.170 --> 01:05:30.020
actually, I should start
by putting p hat from here
01:05:30.020 --> 01:05:33.720
on, because I'm already
solving an equation, right?
01:05:33.720 --> 01:05:36.880
And so p hat is equal
to syn of the xi's
01:05:36.880 --> 01:05:38.510
divided by n,
which is my xn bar.
01:05:44.050 --> 01:05:50.280
Poisson model, as I
said, Poisson is gone.
01:05:50.280 --> 01:05:51.874
So let me rewrite it quickly.
01:06:00.850 --> 01:06:07.975
So Poisson, the likelihood
in X1, Xn, and lambda
01:06:07.975 --> 01:06:13.270
was equal to lambda to
the sum of the xi's e
01:06:13.270 --> 01:06:17.650
to the minus n lambda
divided by X1 factorial,
01:06:17.650 --> 01:06:20.920
all the way to Xn factorial.
01:06:20.920 --> 01:06:25.110
So let me take the
log likelihood.
01:06:25.110 --> 01:06:26.490
That's going to
be equal to what?
01:06:26.490 --> 01:06:27.406
It's going to tell me.
01:06:27.406 --> 01:06:29.096
It's going to be--
01:06:29.096 --> 01:06:30.720
well, let me get rid
of this guy first.
01:06:30.720 --> 01:06:36.780
Minus log of X1 factorial
all the way to Xn factorial.
01:06:36.780 --> 01:06:39.520
That's a constant with
respect to lambda.
01:06:39.520 --> 01:06:43.180
So when I'm going to take the
derivative, it's going to go.
01:06:43.180 --> 01:06:49.410
Then I'm going to have plus sum
of the xi's times log lambda.
01:06:49.410 --> 01:06:51.410
And then I'm going to
have minus n lambda.
01:06:54.390 --> 01:06:55.890
So now then, you
take the derivative
01:06:55.890 --> 01:06:57.660
and set it equal to zero.
01:06:57.660 --> 01:07:04.860
So log L-- well, partial with
respect to lambda of log L,
01:07:04.860 --> 01:07:08.820
say lambda, equals zero.
01:07:08.820 --> 01:07:11.160
This is equivalent
to, so this guy goes.
01:07:11.160 --> 01:07:16.440
This guy gives me sum of the
xi's divided by lambda hat
01:07:16.440 --> 01:07:17.070
equals n.
01:07:22.470 --> 01:07:25.690
And so that's
equivalent to lambda hat
01:07:25.690 --> 01:07:31.092
is equal to sum of the xi's
divided by n, which is Xn bar.
01:07:34.044 --> 01:07:38.785
Take derivative, set it equal
to zero, and just solve.
01:07:38.785 --> 01:07:42.930
It's a very satisfying
exercise, especially when
01:07:42.930 --> 01:07:45.150
you get the average in the end.
01:07:45.150 --> 01:07:49.060
You don't have to
think about it forever.
01:07:49.060 --> 01:07:54.360
OK, the Gaussian model I'm going
to leave to you as an exercise.
01:07:54.360 --> 01:07:57.600
Take the log to get rid
of the pesky exponential,
01:07:57.600 --> 01:08:00.690
and then take the derivative
and you should be fine.
01:08:00.690 --> 01:08:02.940
It's a bit more--
01:08:02.940 --> 01:08:05.960
it might be one more
line than those guys.
01:08:05.960 --> 01:08:12.760
OK, so-- well actually,
you need to take
01:08:12.760 --> 01:08:14.040
the gradient in this case.
01:08:14.040 --> 01:08:15.930
Don't check the second
derivative right now.
01:08:15.930 --> 01:08:17.596
You don't have to
really think about it.
01:08:21.430 --> 01:08:23.537
What did I want to add?
01:08:23.537 --> 01:08:25.370
I think there was
something I wanted to say.
01:08:25.370 --> 01:08:27.319
Yes.
01:08:27.319 --> 01:08:31.040
When I have a function that's
concave and I'm on, like,
01:08:31.040 --> 01:08:33.671
some infinite
interval, then it's
01:08:33.671 --> 01:08:36.170
true that taking the derivative
and setting it equal to zero
01:08:36.170 --> 01:08:38.029
will give me the maximum.
01:08:38.029 --> 01:08:42.330
But again, I might have a
function that looks like this.
01:08:42.330 --> 01:08:46.260
Now, if I'm on some finite
interval-- let me go elsewhere.
01:08:46.260 --> 01:08:55.550
So if I'm on some
finite interval
01:08:55.550 --> 01:09:00.979
and my function looks like
this as a function of theta--
01:09:00.979 --> 01:09:03.220
let's say this is
my log likelihood
01:09:03.220 --> 01:09:06.410
as a function of theta--
01:09:06.410 --> 01:09:13.200
then, OK, there's no
place in this interval--
01:09:13.200 --> 01:09:15.040
let's say this is
between 0 and 1-- there's
01:09:15.040 --> 01:09:19.870
no place in this interval where
the derivative is equal to 0.
01:09:19.870 --> 01:09:22.569
And if you actually
try to solve this,
01:09:22.569 --> 01:09:26.187
you won't find a solution which
is not in the interval 0, 1.
01:09:26.187 --> 01:09:28.270
And that's actually how
you know that you probably
01:09:28.270 --> 01:09:30.144
should not take the
derivative equal to zero.
01:09:30.144 --> 01:09:32.720
So don't panic if you
get something that says,
01:09:32.720 --> 01:09:34.720
well, the solution is
at infinity, right?
01:09:34.720 --> 01:09:36.285
If this function
keeps going, you
01:09:36.285 --> 01:09:37.660
will find that
the solution-- you
01:09:37.660 --> 01:09:40.490
won't be able to find a
solution apart from infinity.
01:09:40.490 --> 01:09:43.720
You are going to see something
like 1 over theta hat
01:09:43.720 --> 01:09:46.359
is equal to 0, or
something like this.
01:09:46.359 --> 01:09:48.939
So you know that when you've
found this kind of solution,
01:09:48.939 --> 01:09:51.370
you've probably made a
mistake at some point.
01:09:51.370 --> 01:09:54.820
And the reason is because the
functions that are like this,
01:09:54.820 --> 01:09:58.150
you don't find the maximum by
setting the derivative equal
01:09:58.150 --> 01:09:59.230
to zero.
01:09:59.230 --> 01:10:01.159
You actually just find
the maximum by saying,
01:10:01.159 --> 01:10:03.450
well, it's an increasing
function on the interval 0, 1,
01:10:03.450 --> 01:10:05.000
so the maximum must
be attained at 1.
01:10:07.209 --> 01:10:08.750
So here in this
case, that would mean
01:10:08.750 --> 01:10:12.560
that my maximum would be 1.
01:10:12.560 --> 01:10:14.540
My estimator would be
1, which would be weird.
01:10:14.540 --> 01:10:17.316
So typically here, you have
a function of the xi's.
01:10:17.316 --> 01:10:19.940
So one example that you will see
many times is when this guy is
01:10:19.940 --> 01:10:24.870
the maximum of the xi's.
01:10:24.870 --> 01:10:27.210
And in which case, the
maximum is attained here,
01:10:27.210 --> 01:10:29.190
which is the maximum of this.
01:10:29.190 --> 01:10:31.620
OK, so just keep in mind--
01:10:31.620 --> 01:10:33.840
what I would recommend
is every time
01:10:33.840 --> 01:10:36.450
you're trying to take the
maximum of a function,
01:10:36.450 --> 01:10:39.320
just try to plot the
function in your head.
01:10:39.320 --> 01:10:40.380
It's not too complicated.
01:10:40.380 --> 01:10:44.790
Those things are usually
squares, or square roots,
01:10:44.790 --> 01:10:45.630
or logs.
01:10:45.630 --> 01:10:47.430
You know what those
functions look like.
01:10:47.430 --> 01:10:50.040
Just plug them in your
mind and make sure
01:10:50.040 --> 01:10:52.230
that you will find a
maximum which really
01:10:52.230 --> 01:10:54.210
goes up and then down again.
01:10:54.210 --> 01:10:56.400
If you don't, then
that means your maximum
01:10:56.400 --> 01:10:59.370
is achieved at the
boundary and you have
01:10:59.370 --> 01:11:01.950
to think differently to get it.
01:11:01.950 --> 01:11:04.590
So the machinery that consists
in setting the derivative equal
01:11:04.590 --> 01:11:06.870
to zero works 80% of the time.
01:11:06.870 --> 01:11:08.880
But o you have to be careful.
01:11:08.880 --> 01:11:11.880
And from the context,
it will be clear
01:11:11.880 --> 01:11:14.460
that you had to be careful,
because you will find
01:11:14.460 --> 01:11:17.190
some crazy stuff, such
as solve 1 over theta hat
01:11:17.190 --> 01:11:18.090
is equal to zero.
01:11:23.140 --> 01:11:25.280
All right, so
before we conclude,
01:11:25.280 --> 01:11:28.090
I just wanted to give you
some intuition about how does
01:11:28.090 --> 01:11:30.620
the maximum likelihood perform?
01:11:30.620 --> 01:11:33.070
So there's something called
the Fisher information
01:11:33.070 --> 01:11:35.980
that essentially controls
how this thing performs.
01:11:35.980 --> 01:11:38.710
And the Fisher information
is, essentially,
01:11:38.710 --> 01:11:40.420
a second derivative
or a Hessian.
01:11:40.420 --> 01:11:44.980
So if I'm in a one-dimensional
parameter case, it's a number,
01:11:44.980 --> 01:11:46.300
it's a second derivative.
01:11:46.300 --> 01:11:51.000
If I'm in a multidimensional
case, it's actually a Hessian,
01:11:51.000 --> 01:11:52.780
it's a matrix.
01:11:52.780 --> 01:11:57.800
So I'm going to actually take
in notation little curly L
01:11:57.800 --> 01:12:00.670
of theta to be the
log likelihood, OK?
01:12:00.670 --> 01:12:02.910
And that's the log likelihood
for one observation.
01:12:02.910 --> 01:12:05.560
So let's call it x generically,
but think of it as being x1,
01:12:05.560 --> 01:12:07.480
for example.
01:12:07.480 --> 01:12:09.250
And I don't care
of, like, summing,
01:12:09.250 --> 01:12:11.260
because I'm actually going to
take expectation of this thing.
01:12:11.260 --> 01:12:13.176
So it's not going to be
a data driven quantity
01:12:13.176 --> 01:12:14.390
I'm going to play with.
01:12:14.390 --> 01:12:15.806
So now I'm going
to assume that it
01:12:15.806 --> 01:12:19.390
is twice differentiable,
almost surely, because it's
01:12:19.390 --> 01:12:21.350
a random function.
01:12:21.350 --> 01:12:23.890
And so now I'm going to
just sweep under the rug
01:12:23.890 --> 01:12:27.700
some technical conditions
when these things hold.
01:12:27.700 --> 01:12:32.130
So typically, when can I
permute integral and derivatives
01:12:32.130 --> 01:12:35.160
and this kind of stuff that
you don't want to think about?
01:12:35.160 --> 01:12:36.730
OK, the rule of
thumb is it always
01:12:36.730 --> 01:12:39.589
works until it
doesn't, in which case,
01:12:39.589 --> 01:12:41.380
that probably means
you're actually solving
01:12:41.380 --> 01:12:44.080
some sort of calculus problem.
01:12:44.080 --> 01:12:47.390
Because in practice,
it just doesn't happen.
01:12:47.390 --> 01:12:56.010
So the Fisher information
is the expectation of the--
01:12:56.010 --> 01:12:57.790
that's called the outer product.
01:12:57.790 --> 01:13:01.240
So that's the product
of this gradient
01:13:01.240 --> 01:13:02.390
and the gradient transpose.
01:13:02.390 --> 01:13:04.540
So that forms a matrix, right?
01:13:04.540 --> 01:13:09.830
That's a matrix minus the outer
product of the expectations.
01:13:09.830 --> 01:13:12.910
So that's really what's
called the covariance matrix
01:13:12.910 --> 01:13:16.285
of this vector, nabla
of L theta, which
01:13:16.285 --> 01:13:18.090
is a random vector.
01:13:18.090 --> 01:13:21.042
So I'm forming the covariance
matrix of this thing.
01:13:21.042 --> 01:13:23.250
And the technical conditions
tells me that, actually,
01:13:23.250 --> 01:13:26.600
this guy, which depends
only on the Hessian,
01:13:26.600 --> 01:13:31.115
is actually equal to negative
expectation of the-- sorry.
01:13:31.115 --> 01:13:32.240
It depends on the gradient.
01:13:32.240 --> 01:13:36.140
Is actually negative
expectation of the Hessian.
01:13:36.140 --> 01:13:38.300
So I can actually
get a quantity that
01:13:38.300 --> 01:13:40.400
depends on the second
derivatives only using
01:13:40.400 --> 01:13:41.740
first derivatives.
01:13:41.740 --> 01:13:44.202
But the expectation is
going to play a role here.
01:13:44.202 --> 01:13:45.410
And the fact that it's a log.
01:13:45.410 --> 01:13:48.180
And lots of things
actually show up here.
01:13:48.180 --> 01:13:51.220
And so in this case,
what I get is that--
01:13:51.220 --> 01:13:53.510
so in the one-dimensional
case, then this
01:13:53.510 --> 01:13:56.480
is just the covariance matrix of
a one-dimensional thing, which
01:13:56.480 --> 01:13:58.200
is just a variance of itself.
01:13:58.200 --> 01:14:00.050
So the variance
of the derivative
01:14:00.050 --> 01:14:04.190
is actually equal to
negative the expectation
01:14:04.190 --> 01:14:07.080
of the second derivative.
01:14:07.080 --> 01:14:09.280
OK, so we'll see that next time.
01:14:09.280 --> 01:14:12.600
But what I wanted to emphasize
with this is that why do
01:14:12.600 --> 01:14:15.109
we care about this quantity?
01:14:15.109 --> 01:14:16.650
That's called the
Fisher information.
01:14:16.650 --> 01:14:19.770
Fisher is the founding
father of modern statistics.
01:14:19.770 --> 01:14:23.070
Why do we give this
quantity his name?
01:14:23.070 --> 01:14:25.546
Well, it's because this quantity
is actually very critical.
01:14:25.546 --> 01:14:27.420
What does the second
derivative of a function
01:14:27.420 --> 01:14:29.560
tell me at the maximum?
01:14:29.560 --> 01:14:34.350
Well, it's telling me
how curved it is, right?
01:14:34.350 --> 01:14:37.780
If I have a zero second
derivative, I'm basically flat.
01:14:37.780 --> 01:14:41.137
And if I have a very high second
derivative, I'm very curvy.
01:14:41.137 --> 01:14:42.720
And when I'm very
curvy, what it means
01:14:42.720 --> 01:14:45.760
is that I'm very robust
to the estimation error.
01:14:45.760 --> 01:14:47.160
Remember our
estimation strategy,
01:14:47.160 --> 01:14:50.130
which consisted in replacing
expectation by averages?
01:14:50.130 --> 01:14:52.830
If I'm extremely curvy,
I can move a little bit.
01:14:52.830 --> 01:14:55.410
This thing, the maximum,
is not going to move much.
01:14:55.410 --> 01:14:57.280
And this formula here--
01:14:57.280 --> 01:15:00.090
so forget about the matrix
version for a second--
01:15:00.090 --> 01:15:01.780
is actually telling me exactly--
01:15:01.780 --> 01:15:06.000
it's telling me the curvature
is basically the variance
01:15:06.000 --> 01:15:08.290
of the first derivative.
01:15:08.290 --> 01:15:10.840
And so the more the first
derivative fluctuates,
01:15:10.840 --> 01:15:12.930
the more your maximum is
actually-- your org max
01:15:12.930 --> 01:15:14.710
is going to move
all over the place.
01:15:14.710 --> 01:15:16.950
So this is really
controlling how flat
01:15:16.950 --> 01:15:20.280
your likelihood, your log
likelihood, is at its maximum.
01:15:20.280 --> 01:15:23.340
The flatter it is, the more
sensitive to fluctuation
01:15:23.340 --> 01:15:24.630
the arg max is going to be.
01:15:24.630 --> 01:15:27.060
The curvier it is, the
less sensitive it is.
01:15:27.060 --> 01:15:28.740
And so what we're
hoping-- a good model
01:15:28.740 --> 01:15:31.710
is going to be one that
has a large or small value
01:15:31.710 --> 01:15:34.350
for the Fisher information.
01:15:34.350 --> 01:15:36.938
I want this to be--
01:15:36.938 --> 01:15:38.300
small?
01:15:38.300 --> 01:15:40.070
I want it to be large.
01:15:40.070 --> 01:15:42.290
Because this is the
curvature, right?
01:15:42.290 --> 01:15:44.414
This number is
negative, it's concave.
01:15:44.414 --> 01:15:45.830
So if I take a
negative sign, it's
01:15:45.830 --> 01:15:47.810
going to be something
that's positive.
01:15:47.810 --> 01:15:51.230
And the larger this thing,
the more curvy it is.
01:15:51.230 --> 01:15:52.730
Oh, yeah, because
it's the variance.
01:15:52.730 --> 01:15:53.271
Again, sorry.
01:15:53.271 --> 01:15:55.430
This is what--
01:15:55.430 --> 01:15:55.930
OK.
01:15:59.480 --> 01:16:02.156
Yeah, maybe I should not
go into those details
01:16:02.156 --> 01:16:03.530
because I'm actually
out of time.
01:16:03.530 --> 01:16:06.890
But just spoiler alert,
the asymptotic variance
01:16:06.890 --> 01:16:09.020
of your-- the variance,
basically, as n
01:16:09.020 --> 01:16:11.370
goes to infinity of the
maximum likelihood estimator
01:16:11.370 --> 01:16:12.830
is going to be 1 over this guy.
01:16:12.830 --> 01:16:15.260
So we want it to be large,
because the asymptotic variance
01:16:15.260 --> 01:16:16.910
is going to be very small.
01:16:16.910 --> 01:16:18.650
All right, so we're out of time.
01:16:18.650 --> 01:16:20.630
We'll see that next week.
01:16:20.630 --> 01:16:22.730
And I have your
homework with me.
01:16:22.730 --> 01:16:25.052
And I will actually turn it in.
01:16:25.052 --> 01:16:26.510
I will give it to
you outside so we
01:16:26.510 --> 01:16:28.580
can let the other room come in.
01:16:28.580 --> 01:16:31.630
OK, I'll just leave you the--