WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.880
at oct.mit.edu.
00:00:32.680 --> 00:00:34.120
PHILIPPE RIGOLLET: All right.
00:00:34.120 --> 00:00:41.180
So let's continue talking about
maximum likelihood estimation
00:00:41.180 --> 00:00:43.640
in the context of generalized
linear models, all right?
00:00:43.640 --> 00:00:45.570
So in those generalized
linear models,
00:00:45.570 --> 00:00:49.730
what we spent most of the
past lectures working on
00:00:49.730 --> 00:00:55.680
is the conditional
distribution of Y given
00:00:55.680 --> 00:00:59.940
X. And we're going
to assume that this
00:00:59.940 --> 00:01:08.990
follows some distribution
in the exponential family.
00:01:12.861 --> 00:01:13.360
OK.
00:01:13.360 --> 00:01:18.170
And so what it means is that if
we look at the density, say--
00:01:18.170 --> 00:01:21.970
or the PMF, but let's
talk about density
00:01:21.970 --> 00:01:23.440
to make things clearer--
00:01:23.440 --> 00:01:29.020
we're going to assume that
Y given X has distribution.
00:01:29.020 --> 00:01:31.860
So X is now fixed, because
we're conditioning on it.
00:01:31.860 --> 00:01:50.240
And it has a density, which
is of this form, c of Yi phi.
00:01:50.240 --> 00:01:50.740
OK.
00:01:50.740 --> 00:01:53.290
So this c, again, we don't
really need to think about it.
00:01:53.290 --> 00:01:55.415
This is something that's
going to come up naturally
00:01:55.415 --> 00:01:58.690
as soon as you need
normalization factor.
00:01:58.690 --> 00:02:02.710
And so here what it means, if
this is the distribution of Y
00:02:02.710 --> 00:02:08.080
given Xi, so that's
the density of Yi
00:02:08.080 --> 00:02:11.830
given Xi is equal to little xi.
00:02:11.830 --> 00:02:15.760
So if it's the conditional
distribution of Yi given Xi,
00:02:15.760 --> 00:02:18.040
it should depend on xi somehow.
00:02:18.040 --> 00:02:20.980
And it does not appear
to depend on Xi.
00:02:20.980 --> 00:02:26.640
And here, the model is going
to be on theta i, which is just
00:02:26.640 --> 00:02:29.137
a function, theta i of Xi.
00:02:29.137 --> 00:02:30.970
And we're going to take
a very specific one.
00:02:30.970 --> 00:02:38.430
It's going to be a function
of a linear form of the Xi.
00:02:38.430 --> 00:02:40.930
So really we're going to take
something which is of the form
00:02:40.930 --> 00:02:43.270
theta I, which is really
just-- as theta does not depend
00:02:43.270 --> 00:02:44.080
on Xi--
00:02:44.080 --> 00:02:47.310
of Xi transposed from beta.
00:02:47.310 --> 00:02:47.980
OK?
00:02:47.980 --> 00:02:50.800
So all these parts here,
this is really some modeling
00:02:50.800 --> 00:02:54.140
assumptions that we're
making once we've agreed
00:02:54.140 --> 00:02:55.740
on what distribution we want.
00:02:55.740 --> 00:02:56.240
OK.
00:02:56.240 --> 00:02:58.730
So to do that, our
goal, of course,
00:02:58.730 --> 00:03:01.010
is going to try to
understand what this beta is.
00:03:01.010 --> 00:03:02.810
There's one beta here.
00:03:02.810 --> 00:03:12.660
What's important is that this
beta does not depend on i.
00:03:12.660 --> 00:03:17.480
So if they observe
pairs Xi, Yi--
00:03:17.480 --> 00:03:22.460
let's say I observe n of
them, i equals 1 to n--
00:03:22.460 --> 00:03:25.940
the hope is that as a
accumulate more and more pairs
00:03:25.940 --> 00:03:29.360
of this form where there's
always the same parameter that
00:03:29.360 --> 00:03:32.960
links Xi to Yi, that's
this parameter beta,
00:03:32.960 --> 00:03:35.960
that I should have a better and
better estimation of this beta.
00:03:35.960 --> 00:03:37.232
Because it's always the same.
00:03:37.232 --> 00:03:38.690
And that's essentially
with couples
00:03:38.690 --> 00:03:40.040
all of our distribution.
00:03:40.040 --> 00:03:42.620
If I did not assume
this, then I could
00:03:42.620 --> 00:03:46.220
have a different distribution
for each pair Xi given YI.
00:03:46.220 --> 00:03:48.830
And I would not be able
to do any statistics.
00:03:48.830 --> 00:03:50.420
Nothing would
average in the end.
00:03:50.420 --> 00:03:52.267
But here I have the
same beta, which
00:03:52.267 --> 00:03:54.350
means that I can hope to
do statistics and average
00:03:54.350 --> 00:03:56.310
errors in them.
00:03:56.310 --> 00:03:56.810
OK.
00:03:56.810 --> 00:04:00.540
So I'm going to collect,
so I'll come back to this.
00:04:00.540 --> 00:04:02.910
But as usual in the
linear regression model,
00:04:02.910 --> 00:04:05.740
we're going to collect
all our observations Yi.
00:04:05.740 --> 00:04:08.060
So I'm going to assume
that they're real valued
00:04:08.060 --> 00:04:10.940
and that my Xi's
takes value in Rp
00:04:10.940 --> 00:04:12.560
just like in the
regression model.
00:04:12.560 --> 00:04:16.410
And I'm going to collect all my
Yi's into one big vector of Y
00:04:16.410 --> 00:04:23.480
in our n and all my X's into
one big matrix in Rn times p
00:04:23.480 --> 00:04:26.420
just like for the
linear regression model.
00:04:26.420 --> 00:04:30.020
All right, so, again, what
I'm interested in here
00:04:30.020 --> 00:04:36.361
is the conditional
distribution of Yi given Xi.
00:04:36.361 --> 00:04:36.860
OK.
00:04:36.860 --> 00:04:39.020
I said this is
this distribution.
00:04:39.020 --> 00:04:40.520
When we're talking
about regression,
00:04:40.520 --> 00:04:42.470
I defined last time
what the definition
00:04:42.470 --> 00:04:44.675
of regression function was.
00:04:44.675 --> 00:04:46.760
It's just one
particular aspect of
00:04:46.760 --> 00:04:48.240
this conventional distribution.
00:04:48.240 --> 00:04:51.580
It's the conditional
expectation of Yi given Xi.
00:04:51.580 --> 00:04:52.080
OK.
00:04:52.080 --> 00:04:53.780
And so this conditional
expectation,
00:04:53.780 --> 00:04:57.180
I will denote it by--
00:04:57.180 --> 00:05:06.950
so I talk about the
conditional, I'm
00:05:06.950 --> 00:05:08.420
going to call it,
say, mu i, which
00:05:08.420 --> 00:05:11.720
is the conditional
expectation of Yi given
00:05:11.720 --> 00:05:15.217
Xi equals some little xi, say.
00:05:15.217 --> 00:05:17.550
You can forget about this
part if you find it confusing.
00:05:17.550 --> 00:05:18.591
It really doesn't matter.
00:05:18.591 --> 00:05:21.590
It's just that this
means that this
00:05:21.590 --> 00:05:25.280
is a function of little xi.
00:05:25.280 --> 00:05:29.885
But if I only had the
expectation of Yi given big Xi,
00:05:29.885 --> 00:05:32.489
this would be just a
function of big Xi.
00:05:32.489 --> 00:05:34.030
So it really doesn't
change anything.
00:05:34.030 --> 00:05:36.240
It's just a matter of notation.
00:05:36.240 --> 00:05:36.740
OK.
00:05:36.740 --> 00:05:39.000
So just forget about this part.
00:05:39.000 --> 00:05:41.720
But I'll just do
it like that here.
00:05:41.720 --> 00:05:42.220
OK.
00:05:42.220 --> 00:05:46.630
So this is just the conditional
expectation of Yi given Xi.
00:05:46.630 --> 00:05:50.390
It just depends on Xi, so
I think it depends on i,
00:05:50.390 --> 00:05:52.040
and so I will call it mu i.
00:05:52.040 --> 00:05:55.880
But I know that since in a
canonical exponential family,
00:05:55.880 --> 00:06:00.510
then I know that mu i is
actually B prime of theta i.
00:06:00.510 --> 00:06:01.010
OK.
00:06:01.010 --> 00:06:04.370
So there's a 1 to 1 link
between the canonical parameter
00:06:04.370 --> 00:06:07.362
of my exponential
family and the mean mu
00:06:07.362 --> 00:06:09.920
i, the conditional expectation.
00:06:09.920 --> 00:06:12.630
And the modeling assumption
we're going to make
00:06:12.630 --> 00:06:14.390
is not directly--
00:06:14.390 --> 00:06:15.890
remember, that was
the second aspect
00:06:15.890 --> 00:06:17.480
of the generalized linear model.
00:06:17.480 --> 00:06:20.600
We're not going to assume
that theta i itself directly
00:06:20.600 --> 00:06:22.040
depends on Xi.
00:06:22.040 --> 00:06:23.570
We're going to
assume that mu i has
00:06:23.570 --> 00:06:28.190
a particular dependence on
Xi through the link function.
00:06:28.190 --> 00:06:31.490
So, again, we're
back to modeling.
00:06:31.490 --> 00:06:33.150
So we have a link function g.
00:06:38.020 --> 00:06:51.840
And we assume that mu i
depends on Xi as follows.
00:06:56.580 --> 00:06:58.625
g of mu i--
00:06:58.625 --> 00:07:02.520
and remember, all g
does for us is really
00:07:02.520 --> 00:07:05.260
map the space in which
mu i lives, which
00:07:05.260 --> 00:07:08.930
could be just the interval
0, 1 to the entire real line,
00:07:08.930 --> 00:07:09.464
all right?
00:07:09.464 --> 00:07:11.380
And we're going to assume
that this thing that
00:07:11.380 --> 00:07:14.105
lives in the real line is
just Xi transpose beta.
00:07:14.105 --> 00:07:17.235
I should maybe put a small
one, Xi transpose beta.
00:07:17.235 --> 00:07:17.734
OK?
00:07:20.460 --> 00:07:23.040
So we're making, indeed,
some modeling assumption.
00:07:23.040 --> 00:07:26.080
But compared to in the
linear regression model,
00:07:26.080 --> 00:07:29.250
we only assume that mu
i was Xi transpose beta.
00:07:29.250 --> 00:07:30.990
So if you want to
make a parallel
00:07:30.990 --> 00:07:32.970
between generalized
linear models
00:07:32.970 --> 00:07:35.090
and linear model is
the only difference
00:07:35.090 --> 00:07:39.580
is that g is not the identity
necessarily in this case.
00:07:39.580 --> 00:07:42.760
And all the g does
for us is to just
00:07:42.760 --> 00:07:45.400
make this thing
compatible, that those two
00:07:45.400 --> 00:07:48.100
things on the left and
the right of equality
00:07:48.100 --> 00:07:50.210
live in the same space.
00:07:50.210 --> 00:07:56.170
So in a way, we're not making
a much bigger leap of faith
00:07:56.170 --> 00:07:57.390
by assuming a linear model.
00:07:57.390 --> 00:07:59.740
The linear link is already here.
00:07:59.740 --> 00:08:04.020
We're just making things
compatible, all right?
00:08:04.020 --> 00:08:09.140
And so it's always the
same link function.
00:08:09.140 --> 00:08:12.790
So now if I want to
go back to beta--
00:08:12.790 --> 00:08:16.510
right, because I'm going to
want to express my likelihood--
00:08:16.510 --> 00:08:18.670
if I were to express my
likelihood from this,
00:08:18.670 --> 00:08:21.110
it would just be a
function of theta, right?
00:08:21.110 --> 00:08:22.960
And so if I want to
maximize my likelihood,
00:08:22.960 --> 00:08:24.501
I don't want to
maximize it in theta.
00:08:24.501 --> 00:08:26.000
I want to maximize it in beta.
00:08:26.000 --> 00:08:29.910
So if I can write my density
as a function of beta,
00:08:29.910 --> 00:08:31.660
then I will be able
to write my likelihood
00:08:31.660 --> 00:08:33.160
as a function of
beta, and then talk
00:08:33.160 --> 00:08:35.169
about my maximum
likelihood estimator.
00:08:35.169 --> 00:08:37.669
And so all they need to
do is to just say, OK,
00:08:37.669 --> 00:08:39.669
how do I replace theta by--
00:08:39.669 --> 00:08:42.820
I know that theta is a
function of beta, right?
00:08:42.820 --> 00:08:45.120
I wrote it here.
00:08:45.120 --> 00:08:46.980
So the question is,
what is this function?
00:08:46.980 --> 00:08:49.400
And I actually have
access to all of this.
00:08:49.400 --> 00:08:50.930
So what I know is that theta--
00:08:50.930 --> 00:08:54.990
right, so mu is
b prime of theta,
00:08:54.990 --> 00:09:02.280
which means that theta i
is b prime inverse of mu i.
00:09:02.280 --> 00:09:02.990
OK.
00:09:02.990 --> 00:09:06.080
So that's what we've got from
this derivative of the log
00:09:06.080 --> 00:09:07.850
likelihood equal to 0.
00:09:07.850 --> 00:09:09.710
That give us this guy inverted.
00:09:09.710 --> 00:09:13.670
And now I know that mu i
is g inverse of Xi beta.
00:09:23.790 --> 00:09:27.210
So this composition of b
prime inverse and g inverse
00:09:27.210 --> 00:09:30.960
is actually just the
composition of g with b prime.
00:09:38.960 --> 00:09:40.840
Everybody's comfortable
with this notation,
00:09:40.840 --> 00:09:42.470
the little circle?
00:09:42.470 --> 00:09:43.760
Any question about this?
00:09:43.760 --> 00:09:45.999
It just means that I
first applied b prime.
00:09:45.999 --> 00:09:47.290
Well, actually, it's D inverse.
00:09:47.290 --> 00:09:49.850
But if I look at a function
g composed with b prime,
00:09:49.850 --> 00:09:59.230
I first applied the g b prime
of x, is just g of b prime of x.
00:09:59.230 --> 00:10:00.050
OK.
00:10:00.050 --> 00:10:02.160
And then I take the
inverse of this function,
00:10:02.160 --> 00:10:05.645
which is first take g inverse,
and then take b prime inverse.
00:10:09.110 --> 00:10:09.610
OK.
00:10:09.610 --> 00:10:12.810
So now I have
everywhere I saw theta,
00:10:12.810 --> 00:10:16.360
now I see this function of beta.
00:10:16.360 --> 00:10:18.010
So I could technically
plug that in.
00:10:18.010 --> 00:10:20.080
Of course, it's a
little painful to have
00:10:20.080 --> 00:10:22.720
to write g circle beta
prime all the time.
00:10:22.720 --> 00:10:24.700
So I'm going to give
this guy a name.
00:10:24.700 --> 00:10:29.800
And so you're just
going to define
00:10:29.800 --> 00:10:39.090
h, which is g b prime inverse
so that theta i is simply
00:10:39.090 --> 00:10:42.890
h of Xi transpose beta.
00:10:42.890 --> 00:10:43.510
OK.
00:10:43.510 --> 00:10:46.150
I could give it
a name, you know.
00:10:46.150 --> 00:10:47.890
But let's just call
that the h function.
00:10:52.170 --> 00:10:54.380
And something which is
nice about this h function
00:10:54.380 --> 00:11:01.360
is that if g is the
canonical link--
00:11:04.790 --> 00:11:05.940
what is the canonical link?
00:11:12.010 --> 00:11:15.440
So what is it canonical to?
00:11:15.440 --> 00:11:19.860
A canonical link, it's canonical
to a particular distribution
00:11:19.860 --> 00:11:22.720
in the canonical
exponential family, right?
00:11:22.720 --> 00:11:26.490
A canonical exponential family
is completely characterized
00:11:26.490 --> 00:11:28.640
by the function b.
00:11:28.640 --> 00:11:31.830
Which means that if I want to
talk about the canonical link,
00:11:31.830 --> 00:11:35.470
all I need to tell you
is how it depends on b.
00:11:35.470 --> 00:11:37.070
So what is g as a function of b?
00:11:40.332 --> 00:11:41.277
AUDIENCE: [INAUDIBLE]
00:11:41.277 --> 00:11:42.485
PHILIPPE RIGOLLET: b inverse.
00:11:45.120 --> 00:11:47.490
b prime inverse, right?
00:11:47.490 --> 00:11:52.710
So this is g is equal to
b prime inverse, which
00:11:52.710 --> 00:11:56.370
means that if g is composed
with b prime that means
00:11:56.370 --> 00:11:58.120
that this is just the identity.
00:12:02.530 --> 00:12:04.020
So h is the identity.
00:12:11.700 --> 00:12:18.870
So h of Xi transpose beta
is simply Xi transpose beta.
00:12:18.870 --> 00:12:22.650
And it's true that the way we
introduce the canonical link
00:12:22.650 --> 00:12:26.130
was just the function for
which we model directly theta i
00:12:26.130 --> 00:12:30.740
as Xi transpose beta, which
we can read off here, right?
00:12:30.740 --> 00:12:34.830
So theta i is simply
Xi transpose beta.
00:12:34.830 --> 00:12:43.450
So now, for example, if I go
back to my log-likelihood,
00:12:43.450 --> 00:12:51.650
so if I look log-likelihood,
the log-likelihood
00:12:51.650 --> 00:12:56.150
is sum of the log
of the densities.
00:12:56.150 --> 00:13:06.740
So it's sum from i equal 1 to n
of log of exponential Yi theta
00:13:06.740 --> 00:13:14.740
i minus b theta i divided
by phi plus c of Yi phi.
00:13:14.740 --> 00:13:19.065
So this term does
not depend on theta.
00:13:19.065 --> 00:13:19.940
So I have two things.
00:13:19.940 --> 00:13:22.579
First of all, the log
and the exponential
00:13:22.579 --> 00:13:23.870
are going to cancel each other.
00:13:23.870 --> 00:13:28.230
And second, I actually
know that theta is just
00:13:28.230 --> 00:13:29.780
a function of beta.
00:13:29.780 --> 00:13:30.710
And it has this form.
00:13:30.710 --> 00:13:32.990
Theta i is h of
Xi transpose beta.
00:13:32.990 --> 00:13:34.740
And that's my
modeling assumption.
00:13:34.740 --> 00:13:41.300
So this is actually equal to the
sum from i equal 1 to n of Yi.
00:13:41.300 --> 00:13:45.010
And then here I'm going to
write h of Xi transpose beta
00:13:45.010 --> 00:13:54.010
minus b of h of Xi transpose
beta divided by phi.
00:13:54.010 --> 00:13:56.580
And then I have,
again, this function
00:13:56.580 --> 00:14:00.710
c of Yi phi, which
again won't matter.
00:14:00.710 --> 00:14:03.539
Because when I'm going to
try to maximize this thing,
00:14:03.539 --> 00:14:05.330
this is just playing
the role of a constant
00:14:05.330 --> 00:14:07.100
that's shifting the
entire function.
00:14:07.100 --> 00:14:10.648
In particular, your max is
going to be exactly what it was.
00:14:10.648 --> 00:14:12.480
OK?
00:14:12.480 --> 00:14:14.720
So this thing is really
not going to matter for me.
00:14:14.720 --> 00:14:16.440
I'm keeping track of it.
00:14:16.440 --> 00:14:20.400
And actually, if you look
here, it's gone, right?
00:14:20.400 --> 00:14:23.250
It's gone, because
it does not matter.
00:14:23.250 --> 00:14:26.865
So let's just pretend
it's not here,
00:14:26.865 --> 00:14:28.490
because it won't
matter when I'm trying
00:14:28.490 --> 00:14:31.845
to maximize the likelihood.
00:14:31.845 --> 00:14:33.840
OK?
00:14:33.840 --> 00:14:36.529
While it's here up to
constant term, it says.
00:14:36.529 --> 00:14:37.570
That's the constant term.
00:14:40.340 --> 00:14:41.670
All right, any question?
00:14:44.360 --> 00:14:49.120
All I'm doing here is replacing
my likelihood as a function
00:14:49.120 --> 00:14:50.410
of theta i's.
00:14:50.410 --> 00:14:52.690
So if I had one theta
i per observation,
00:14:52.690 --> 00:14:56.200
again, this would not
help me very much.
00:14:56.200 --> 00:14:58.690
But if I assume that they
are all linked together
00:14:58.690 --> 00:15:02.560
by saying that theta i is of
the form Xi transpose beta
00:15:02.560 --> 00:15:07.060
or h of Xi transpose beta if I'm
not using the canonical link,
00:15:07.060 --> 00:15:10.510
then I can hope to
make some estimation.
00:15:10.510 --> 00:15:14.260
And so, again, if I
have the canonical link,
00:15:14.260 --> 00:15:15.420
h is the identity.
00:15:15.420 --> 00:15:19.060
So I'm left only with Yi
times Xi transpose beta.
00:15:19.060 --> 00:15:21.060
And then I have b
of Xi transpose beta
00:15:21.060 --> 00:15:26.590
and not b composed with h,
because h is the identity,
00:15:26.590 --> 00:15:28.690
which is fairly simple, right?
00:15:28.690 --> 00:15:30.560
Why is it simple?
00:15:30.560 --> 00:15:35.489
Well, let's actually focus
on this guy for one second.
00:15:35.489 --> 00:15:38.030
So let me write it down, so we
know what we're talking about.
00:15:47.300 --> 00:15:52.420
So we just showed that the
log-likelihood when I use
00:15:52.420 --> 00:15:54.790
the canonical link--
00:15:54.790 --> 00:15:57.500
so that h is equal
to the identity,
00:15:57.500 --> 00:16:00.340
the log-likelihood
actually takes the form ln.
00:16:00.340 --> 00:16:01.960
And it depends on
a bunch of stuff.
00:16:01.960 --> 00:16:04.085
But let's just make it
depend only on the parameter
00:16:04.085 --> 00:16:06.520
that we care about,
which is beta, all right?
00:16:06.520 --> 00:16:09.700
So this is of the form l of
beta and that's equal to what?
00:16:09.700 --> 00:16:16.390
It's the sum from i equal 1 to n
of Yi Xi transpose beta minus--
00:16:16.390 --> 00:16:18.100
let me put the phi here.
00:16:18.100 --> 00:16:24.060
And then I'm going to have
minus b of Xi transpose beta.
00:16:28.621 --> 00:16:29.120
OK.
00:16:29.120 --> 00:16:33.980
And phi we know is some
known positive term.
00:16:33.980 --> 00:16:37.856
So again, optimizing a
function plus some constant
00:16:37.856 --> 00:16:39.730
or optimizing of function
time as a constant,
00:16:39.730 --> 00:16:41.800
that's not going to
change much either.
00:16:41.800 --> 00:16:44.350
So it won't really matter to
think about whether this phi is
00:16:44.350 --> 00:16:45.340
here or not.
00:16:45.340 --> 00:16:48.320
But let's just think about
what this function looks like.
00:16:48.320 --> 00:16:50.410
I'm trying to
maximize a function.
00:16:50.410 --> 00:16:52.690
I'm trying to maximize
a log-likelihood.
00:16:52.690 --> 00:16:57.080
If it looked like this, that
would be a serious problem.
00:16:57.080 --> 00:16:59.965
But we can do like a
basic, you know, back
00:16:59.965 --> 00:17:03.700
of the envelope guess of what
the variations of this function
00:17:03.700 --> 00:17:04.930
is.
00:17:04.930 --> 00:17:08.490
This first term here is--
00:17:08.490 --> 00:17:11.302
as a function of beta, what
kind of function is it?
00:17:11.302 --> 00:17:12.010
AUDIENCE: Linear.
00:17:12.010 --> 00:17:13.880
PHILIPPE RIGOLLET:
It's linear, right?
00:17:13.880 --> 00:17:16.400
This is just Xi transpose beta.
00:17:16.400 --> 00:17:18.470
If I multiply beta
by 2, I get twice.
00:17:18.470 --> 00:17:21.050
If I add something to
beta, it just gets added,
00:17:21.050 --> 00:17:23.030
so it's a linear
function of beta.
00:17:23.030 --> 00:17:25.480
And so this thing is
both convex and concave.
00:17:25.480 --> 00:17:28.179
In the one-dimensional
case-- so think about p
00:17:28.179 --> 00:17:29.720
as being one-dimensional--
so if beta
00:17:29.720 --> 00:17:31.190
is a one-dimensional
thing, those
00:17:31.190 --> 00:17:34.070
are just the function that
looks like this, right?
00:17:34.070 --> 00:17:35.670
Those are linear functions.
00:17:35.670 --> 00:17:40.220
They are both
convex and concave.
00:17:40.220 --> 00:17:42.380
So this is not
going to matter when
00:17:42.380 --> 00:17:44.659
it comes to the convexity
of my overall function,
00:17:44.659 --> 00:17:46.950
because I'm just adding
something which is just a line.
00:17:46.950 --> 00:17:49.340
And so if I started with convex,
it's going to stay convex.
00:17:49.340 --> 00:17:51.460
If I started with concave,
it's going to stay concave.
00:17:51.460 --> 00:17:53.376
And if I started with
something which is both,
00:17:53.376 --> 00:17:56.160
it's going to stay
both, meaning neither.
00:17:56.160 --> 00:17:57.621
It cannot be both.
00:17:57.621 --> 00:17:58.120
Yeah.
00:17:58.120 --> 00:18:00.619
So if you're neither convex or
concave, adding this linear--
00:18:00.619 --> 00:18:02.074
so this will not really matter.
00:18:02.074 --> 00:18:04.240
If I want to understand
when my function looks like,
00:18:04.240 --> 00:18:07.540
I need to understand would
b of Xi transpose beta does.
00:18:07.540 --> 00:18:10.210
Begin, the Xi transpose beta--
00:18:10.210 --> 00:18:10.940
no impact.
00:18:10.940 --> 00:18:12.050
It's a linear function.
00:18:12.050 --> 00:18:15.220
In terms of convexity, it's
not going to play any role.
00:18:15.220 --> 00:18:18.561
So I really need to understand
what my function b looks like.
00:18:18.561 --> 00:18:19.810
What do we know about b again?
00:18:22.930 --> 00:18:30.610
So we know that b prime of
theta is equal to mu, right?
00:18:30.610 --> 00:18:34.630
Well, the mean of
a random variable
00:18:34.630 --> 00:18:36.069
in a canonical
exponential family
00:18:36.069 --> 00:18:37.610
can be a positive
or negative number.
00:18:37.610 --> 00:18:40.090
This really does not
tell me anything.
00:18:40.090 --> 00:18:41.290
That can be really anything.
00:18:41.290 --> 00:18:48.080
However, if I look at the
second the derivative of b,
00:18:48.080 --> 00:18:49.780
I know that this is what?
00:18:49.780 --> 00:18:59.190
This is the variance
of Y divided by phi.
00:18:59.190 --> 00:19:00.840
That was my
dispersion parameter.
00:19:00.840 --> 00:19:03.730
The variance was equal to
phi times b prime prime.
00:19:03.730 --> 00:19:06.630
So we know that if
theta is not degenerate,
00:19:06.630 --> 00:19:08.640
meaning that the
density does not
00:19:08.640 --> 00:19:10.920
take value infinity
at only one point,
00:19:10.920 --> 00:19:12.570
this thing is actually positive.
00:19:12.570 --> 00:19:14.070
And clearly, when
you have something
00:19:14.070 --> 00:19:16.770
that looks like this, unless you
have some crazy stuff happening
00:19:16.770 --> 00:19:20.755
with phi being equal to 0 or
anything that's not normal,
00:19:20.755 --> 00:19:22.630
then you will see that
you're not degenerate.
00:19:22.630 --> 00:19:24.690
So this think is
strictly positive.
00:19:24.690 --> 00:19:27.060
And we've said several times
that if b prime prime is
00:19:27.060 --> 00:19:32.080
positive, then that means that's
the derivative of b prime,
00:19:32.080 --> 00:19:34.090
meaning that b
prime is increasing.
00:19:34.090 --> 00:19:36.090
And b prime is
increasing is just
00:19:36.090 --> 00:19:39.300
the same thing as saying
that b is convex, All right?
00:19:39.300 --> 00:19:44.850
So that implies that
b is strictly convex.
00:19:44.850 --> 00:19:47.760
And the strictly
comes from the fact
00:19:47.760 --> 00:19:50.180
that this is a strict sign.
00:19:50.180 --> 00:19:52.515
Well, I should not do that,
because now it's no longer.
00:19:52.515 --> 00:19:54.170
So it's just a
strict sign, meaning
00:19:54.170 --> 00:19:56.890
that the function, this
is not strictly convex,
00:19:56.890 --> 00:19:57.920
because it's linear.
00:19:57.920 --> 00:19:59.420
Strictly convex
means there's always
00:19:59.420 --> 00:20:02.479
some curvature everywhere.
00:20:02.479 --> 00:20:05.020
So now I have this thing that's
linear minus something that's
00:20:05.020 --> 00:20:07.150
convex.
00:20:07.150 --> 00:20:10.240
Something that's negative,
something convex, is concave.
00:20:10.240 --> 00:20:12.490
So this thing is
linear plus concave.
00:20:12.490 --> 00:20:13.840
So it is concave.
00:20:13.840 --> 00:20:18.280
So I know just by looking at
this that ln of beta, which,
00:20:18.280 --> 00:20:20.440
of course, is something
that lives in Rp,
00:20:20.440 --> 00:20:23.940
but if I saw it living in
R1 it would look like this.
00:20:23.940 --> 00:20:25.750
And if I saw it
living in R2, it would
00:20:25.750 --> 00:20:28.310
look like a dome like this.
00:20:28.310 --> 00:20:30.290
And the fact that
it's strict is also
00:20:30.290 --> 00:20:34.490
telling me that it is
actually a unique maximizer.
00:20:34.490 --> 00:20:37.400
So there's unique maximizer
in Xi transpose beta,
00:20:37.400 --> 00:20:38.660
but not in beta necessarily.
00:20:38.660 --> 00:20:41.300
We're going to need extra
assumptions for this.
00:20:41.300 --> 00:20:41.800
OK.
00:20:41.800 --> 00:20:44.080
So this is what I say here.
00:20:44.080 --> 00:20:46.420
The log-likelihood
is strictly concave.
00:20:46.420 --> 00:20:49.175
And so as a consequence under
extra assumptions on the Xi
00:20:49.175 --> 00:20:55.240
is because, of course, if the
Xi's are all the same, right?
00:20:55.240 --> 00:20:58.070
So if the entries of Xi's--
00:20:58.070 --> 00:21:01.857
so if Xi is equal
to 1, 1, 1, 1, 1,
00:21:01.857 --> 00:21:05.920
then Xi transpose beta is
just the sum of the betas.
00:21:05.920 --> 00:21:10.300
And of the beta i's, I will be
strictly concaving those guys,
00:21:10.300 --> 00:21:13.070
but certainly not in
the individual entries.
00:21:13.070 --> 00:21:13.570
OK.
00:21:13.570 --> 00:21:17.320
So I need extra thing on my
Xi, so that this happens,
00:21:17.320 --> 00:21:19.270
just like we needed
the matrix capital
00:21:19.270 --> 00:21:23.060
X in the linear regression
case to be a full rank,
00:21:23.060 --> 00:21:25.371
so we could actually
identify would beta was.
00:21:25.371 --> 00:21:25.870
OK.
00:21:25.870 --> 00:21:27.536
It's going to be
exactly the same thing.
00:21:32.350 --> 00:21:35.270
So here, this is when we
have this very specific
00:21:35.270 --> 00:21:36.780
parametrization.
00:21:36.780 --> 00:21:39.000
And the question is--
00:21:39.000 --> 00:21:41.220
but it may not be the case
if we change the parameter
00:21:41.220 --> 00:21:42.510
beta into something else.
00:21:42.510 --> 00:21:43.170
OK.
00:21:43.170 --> 00:21:46.800
So here, the fact that we use
the canonical link, et cetera,
00:21:46.800 --> 00:21:49.710
everything actually works
really to our advantage,
00:21:49.710 --> 00:21:51.690
so that everything
becomes strictly concave.
00:21:51.690 --> 00:21:53.321
And we know exactly
what's happening.
00:21:56.637 --> 00:21:58.470
All right, so I understand
I went a bit fast
00:21:58.470 --> 00:22:00.460
on playing with convex
and concave functions.
00:22:00.460 --> 00:22:02.490
This is not the purpose.
00:22:02.490 --> 00:22:04.590
You know, I could
spend a lecture
00:22:04.590 --> 00:22:06.780
telling you, oh, if I add
two concave functions,
00:22:06.780 --> 00:22:08.480
then the result remains concave.
00:22:08.480 --> 00:22:11.500
If I had a concave and
a strictly concave,
00:22:11.500 --> 00:22:13.760
then the result still
remains strictly concave.
00:22:13.760 --> 00:22:16.070
And we could spend
time proving this.
00:22:16.070 --> 00:22:17.850
This was just for you
to get an intuition
00:22:17.850 --> 00:22:19.020
as to why this is correct.
00:22:19.020 --> 00:22:22.090
But we don't really have time
to go into too much detail.
00:22:22.090 --> 00:22:23.971
One thing you can do--
00:22:23.971 --> 00:22:26.220
a strictly concave function,
if it's in one dimension,
00:22:26.220 --> 00:22:32.010
all I need to have is that the
second derivative is strictly
00:22:32.010 --> 00:22:32.820
negative, right?
00:22:32.820 --> 00:22:34.500
That's a strictly
concave function.
00:22:34.500 --> 00:22:38.340
That was the analytic definition
we had for strict concavity.
00:22:38.340 --> 00:22:40.080
So if this was in
one dimension, it
00:22:40.080 --> 00:22:45.670
would look like this,
Yi times Xi times beta.
00:22:45.670 --> 00:22:47.640
Now, beta is just one number.
00:22:47.640 --> 00:22:52.500
And then I would have
minus beta Xi times b.
00:22:52.500 --> 00:22:54.870
And this is all over phi.
00:22:54.870 --> 00:22:56.160
You take second derivatives.
00:22:56.160 --> 00:22:59.310
The fact that this is linear in
beta, this is going to go away.
00:22:59.310 --> 00:23:03.050
And here, I'm just going
to be left with minus--
00:23:03.050 --> 00:23:07.900
so if I take the second
derivative with respect
00:23:07.900 --> 00:23:13.490
to beta, this is going to be
equal to minus b prime prime Xi
00:23:13.490 --> 00:23:18.520
beta times Xi squared
divided by phi.
00:23:18.520 --> 00:23:21.790
So this is clearly positive.
00:23:21.790 --> 00:23:25.126
If Xi is 0, this is degenerate,
so I would not get it.
00:23:25.126 --> 00:23:26.500
Then I have the
second derivative
00:23:26.500 --> 00:23:28.180
of b prime, which
I know is positive,
00:23:28.180 --> 00:23:30.840
because of the variance
thing that I have here,
00:23:30.840 --> 00:23:32.270
divided by phi.
00:23:32.270 --> 00:23:34.320
And so that would all be fine.
00:23:34.320 --> 00:23:35.377
That's for one dimension.
00:23:35.377 --> 00:23:37.210
If I wanted to do this
in higher dimensions,
00:23:37.210 --> 00:23:39.070
I would have to say
that the Hessian is
00:23:39.070 --> 00:23:41.440
a positive definite matrix.
00:23:41.440 --> 00:23:44.352
And that's maybe a bit
beyond what this course is.
00:23:50.760 --> 00:23:54.040
So in the rest of
this chapter, I
00:23:54.040 --> 00:23:56.530
will do what I did
not do when we talked
00:23:56.530 --> 00:23:58.450
about maximum likelihood.
00:23:58.450 --> 00:24:00.100
And what we're
going to do is we're
00:24:00.100 --> 00:24:03.130
going to actually show how to
do this maximization, right?
00:24:03.130 --> 00:24:06.060
So here, we know that
the function is concave.
00:24:06.060 --> 00:24:08.890
But what it looks
like specifically
00:24:08.890 --> 00:24:11.350
depends on what b is.
00:24:11.350 --> 00:24:15.562
And for different b's, I'm going
to have different things to do,
00:24:15.562 --> 00:24:17.770
Just like when I was talking
about maximum likelihood
00:24:17.770 --> 00:24:22.120
estimation, if it had a concave
log-likelihood function,
00:24:22.120 --> 00:24:23.040
I could optimize it.
00:24:23.040 --> 00:24:24.790
But depending on
what the function is,
00:24:24.790 --> 00:24:26.350
I would actually
need some algorithms
00:24:26.350 --> 00:24:29.400
that may be working better on
some functions than others.
00:24:29.400 --> 00:24:32.530
Now, here I don't
have random things.
00:24:32.530 --> 00:24:34.960
I have the b is the
cumulant generating
00:24:34.960 --> 00:24:38.480
function of a canonical
exponential family.
00:24:38.480 --> 00:24:42.280
And there is a way for me
to sort of leverage that.
00:24:42.280 --> 00:24:45.610
So not only is there the
b part, but there's also
00:24:45.610 --> 00:24:46.930
the linear part.
00:24:46.930 --> 00:24:49.000
And if I start
trying to use that,
00:24:49.000 --> 00:24:51.760
I'm actually going to be
able to devise very specific
00:24:51.760 --> 00:24:53.139
optimization algorithms.
00:24:53.139 --> 00:24:54.930
And the way I'm going
to be able to do this
00:24:54.930 --> 00:24:57.220
is by thinking of
simple black box
00:24:57.220 --> 00:24:59.560
optimization to which I can
actually technically feed
00:24:59.560 --> 00:25:00.434
any function.
00:25:00.434 --> 00:25:02.350
But it's going to turn
out that the iterations
00:25:02.350 --> 00:25:05.170
of this iterative
algorithms are going
00:25:05.170 --> 00:25:10.450
to look very
familiar when we just
00:25:10.450 --> 00:25:13.360
plug in the particular values
of b, of the log-likelihood
00:25:13.360 --> 00:25:15.500
that we have for this problem.
00:25:15.500 --> 00:25:19.030
And so the three methods we're
going to talk about going from
00:25:19.030 --> 00:25:22.199
more black box-- meaning you can
basically stuff in any function
00:25:22.199 --> 00:25:24.490
that's going to work, any
concave function that's going
00:25:24.490 --> 00:25:27.850
to work, all the way to
this is working specifically
00:25:27.850 --> 00:25:29.890
for generalized linear models--
00:25:29.890 --> 00:25:31.670
are Newton-Raphson method.
00:25:31.670 --> 00:25:35.100
Who's already heard about
the Newton-Raphson method?
00:25:35.100 --> 00:25:40.930
So there's probably some people
actually learned this algorithm
00:25:40.930 --> 00:25:43.400
without even knowing the
word algorithm, right?
00:25:43.400 --> 00:25:44.514
It's a function.
00:25:44.514 --> 00:25:46.930
Typically, it's supposed to
be finding roots of functions.
00:25:46.930 --> 00:25:49.990
But finding the root of a
function of the derivative
00:25:49.990 --> 00:25:53.150
is the same as finding
the minimum of a function.
00:25:53.150 --> 00:25:55.560
So that's the first
black box method.
00:25:55.560 --> 00:25:57.945
I mean, it's pretty old.
00:25:57.945 --> 00:25:59.320
And then there's
something that's
00:25:59.320 --> 00:26:01.750
very specific to what we're
doing, which is called--
00:26:01.750 --> 00:26:03.430
so this Newton-Raphson
method is going
00:26:03.430 --> 00:26:06.820
to involve the Hessian
of our log-likelihood.
00:26:06.820 --> 00:26:09.246
And since we know
something about the Hessian
00:26:09.246 --> 00:26:11.620
for a particular problem,
we're going to be able move one
00:26:11.620 --> 00:26:13.030
to Fisher-scoring.
00:26:13.030 --> 00:26:15.790
And the word Fisher here
is actually exactly coming
00:26:15.790 --> 00:26:17.260
from Fisher information.
00:26:17.260 --> 00:26:20.450
So the Hessian is going to
involve the Fisher information.
00:26:20.450 --> 00:26:25.600
And finally, we will talk about
iteratively re-weighted least
00:26:25.600 --> 00:26:26.440
squares.
00:26:26.440 --> 00:26:28.090
And that's not for any function.
00:26:28.090 --> 00:26:29.710
It's really when
we're trying to use
00:26:29.710 --> 00:26:35.076
the fact that there is this
linear dependence on the Xi.
00:26:35.076 --> 00:26:37.450
And this is essentially going
to tell us, well, you know,
00:26:37.450 --> 00:26:39.655
you can use least squares
for linear regression.
00:26:39.655 --> 00:26:41.530
Here, you can use least
squares, but locally,
00:26:41.530 --> 00:26:42.810
and you have to iterate.
00:26:42.810 --> 00:26:43.510
OK.
00:26:43.510 --> 00:26:46.510
And this last part is
essentially a trick
00:26:46.510 --> 00:26:49.060
by statisticians
to be able to solve
00:26:49.060 --> 00:26:52.570
the Newton-Raphson
updates without actually
00:26:52.570 --> 00:26:54.550
having a dedicated
software for this,
00:26:54.550 --> 00:26:59.361
but just being able to reuse
some least squares software.
00:26:59.361 --> 00:26:59.860
OK.
00:26:59.860 --> 00:27:02.830
So you know, we've talked
about this many times.
00:27:02.830 --> 00:27:05.740
I just want to make sure that
we're all on the same page
00:27:05.740 --> 00:27:07.030
here.
00:27:07.030 --> 00:27:09.790
We have a function f.
00:27:09.790 --> 00:27:12.100
We're going to assume that
it has two derivatives.
00:27:12.100 --> 00:27:14.800
And it's a function
from Rm to R.
00:27:14.800 --> 00:27:16.960
So it's fist derivative
is called gradient.
00:27:16.960 --> 00:27:20.590
That's the vector that collects
all the partial derivatives
00:27:20.590 --> 00:27:23.040
with respect to each
of the coordinates.
00:27:23.040 --> 00:27:25.180
It's dimension m, of course.
00:27:25.180 --> 00:27:28.520
And the second derivative
is an m by m matrix.
00:27:28.520 --> 00:27:29.980
It's called the Hessian.
00:27:29.980 --> 00:27:34.900
And ith row and
jth column, you see
00:27:34.900 --> 00:27:37.330
the second partial
derivative with respect
00:27:37.330 --> 00:27:39.820
to the ith component
and the jth component.
00:27:39.820 --> 00:27:40.620
OK.
00:27:40.620 --> 00:27:41.960
We've seen that several times.
00:27:41.960 --> 00:27:44.440
This is just
multi-variable calculus.
00:27:44.440 --> 00:27:48.490
But really the point here
is to maybe the notation
00:27:48.490 --> 00:27:51.370
is slightly different, because
I want to keep track of f.
00:27:51.370 --> 00:27:55.090
So when I write the gradient,
I write nabla sub f.
00:27:55.090 --> 00:28:00.190
And when I write Hessian,
I write nabla H sub f.
00:28:00.190 --> 00:28:03.520
And as I said, if f
is strictly concave,
00:28:03.520 --> 00:28:05.950
then Hf of x is
negative definite.
00:28:05.950 --> 00:28:14.250
What it means is that if
I take any x in Rm, then x
00:28:14.250 --> 00:28:20.420
transpose Hf, well,
that's for any X0 X,
00:28:20.420 --> 00:28:23.520
this is actually
strictly negative.
00:28:23.520 --> 00:28:25.650
That's what it means to
be negative definite.
00:28:25.650 --> 00:28:26.640
OK?
00:28:26.640 --> 00:28:30.160
So every time I do x transpose--
00:28:30.160 --> 00:28:31.940
so this is like
a quadratic form.
00:28:31.940 --> 00:28:35.580
And I want it to be negative
for all values of X0 and X,
00:28:35.580 --> 00:28:36.900
both of them.
00:28:36.900 --> 00:28:39.180
That's very strong, clearly.
00:28:39.180 --> 00:28:41.340
But for us, actually,
this is what happens just
00:28:41.340 --> 00:28:42.848
because of the properties of b.
00:28:45.540 --> 00:28:48.150
Well, at least
the fact that it's
00:28:48.150 --> 00:28:50.850
negative, less than
or equal to, if I
00:28:50.850 --> 00:28:54.500
want it to be strictly less
I need some properties on X.
00:28:54.500 --> 00:28:56.630
And then I will call the
Hessian map the function
00:28:56.630 --> 00:29:02.270
that maps X to this
matrix Hf of X.
00:29:02.270 --> 00:29:04.080
So that's just the
second derivative at x.
00:29:04.080 --> 00:29:04.580
Yeah.
00:29:04.580 --> 00:29:09.750
AUDIENCE: When you
what are [INAUDIBLE]??
00:29:09.750 --> 00:29:11.776
PHILIPPE RIGOLLET:
Where do [INAUDIBLE]??
00:29:11.776 --> 00:29:13.589
Oh, yeah.
00:29:13.589 --> 00:29:16.130
I mean, you know, you need to
be able to apply Schwarz lemma.
00:29:16.130 --> 00:29:19.466
Let's say two continue
derivatives that's smooth.
00:29:19.466 --> 00:29:21.542
AUDIENCE: [INAUDIBLE]
00:29:21.542 --> 00:29:23.000
PHILIPPE RIGOLLET:
No, that's fine.
00:29:29.190 --> 00:29:29.690
OK.
00:29:29.690 --> 00:29:32.390
So how does the
Newton-Raphson method work?
00:29:32.390 --> 00:29:35.180
Well, what it does is that it
forms a quadratic approximation
00:29:35.180 --> 00:29:37.040
to your function.
00:29:37.040 --> 00:29:39.770
And that's the one it optimizes
at every single point.
00:29:39.770 --> 00:29:40.370
OK.
00:29:40.370 --> 00:29:41.745
And the reason is
because we have
00:29:41.745 --> 00:29:44.600
a closed-form solution
to defining the minimum
00:29:44.600 --> 00:29:46.394
of a quadratic function.
00:29:46.394 --> 00:29:47.810
So if I give you
a function that's
00:29:47.810 --> 00:29:51.160
of the form ax squared
plus b x plus c,
00:29:51.160 --> 00:29:54.350
you know exactly a closed
form for its minimum.
00:29:54.350 --> 00:29:57.930
But if I give you any
function or, let's say--
00:29:57.930 --> 00:29:58.460
yeah, yeah.
00:29:58.460 --> 00:29:59.793
So here, it's all about maximum.
00:29:59.793 --> 00:30:01.100
I'm sorry.
00:30:01.100 --> 00:30:03.830
If you're confused with
me using the word minimum,
00:30:03.830 --> 00:30:06.890
just assume that it
was the word maximum.
00:30:06.890 --> 00:30:07.910
So this is how it works.
00:30:07.910 --> 00:30:08.410
OK.
00:30:08.410 --> 00:30:14.070
If I give you a function which
is concave, that's quadratic.
00:30:14.070 --> 00:30:14.570
OK.
00:30:14.570 --> 00:30:15.903
So it's going to look like this.
00:30:19.050 --> 00:30:21.630
So that's of the
form ax squared--
00:30:21.630 --> 00:30:23.405
where a is negative, of course--
00:30:23.405 --> 00:30:25.670
plus bx plus c.
00:30:25.670 --> 00:30:28.550
Then you can solve
your whatever.
00:30:28.550 --> 00:30:31.100
You can take the derivative of
this guy, set it equal to 0,
00:30:31.100 --> 00:30:33.040
and you will have
an exact equation
00:30:33.040 --> 00:30:37.530
into what the value of x is
that it realizes this maximum.
00:30:37.530 --> 00:30:42.840
If I give you any
function that's concave,
00:30:42.840 --> 00:30:43.970
that's all clear, right?
00:30:43.970 --> 00:30:46.700
I mean, if I tell you the
function that we have here
00:30:46.700 --> 00:30:50.390
is that the form
ax minus b of x,
00:30:50.390 --> 00:30:53.960
then I'm just going to have
something that inverts b prime.
00:30:53.960 --> 00:30:56.050
But how do I do that exactly?
00:30:56.050 --> 00:30:57.140
It's not clear.
00:30:57.140 --> 00:31:00.530
And so what we do is we do a
quadratic approximation, which
00:31:00.530 --> 00:31:03.170
should be true approximately
everywhere, right?
00:31:03.170 --> 00:31:05.630
So I'm at this point
here, I'm going to say,
00:31:05.630 --> 00:31:08.964
oh, I'm close to
being that function.
00:31:08.964 --> 00:31:10.380
And if I'm at this
point here, I'm
00:31:10.380 --> 00:31:12.540
going to be close to
being that function.
00:31:12.540 --> 00:31:14.820
And for this function,
I can actually optimize.
00:31:14.820 --> 00:31:17.670
And so if I'm not moving too
far from one to the other,
00:31:17.670 --> 00:31:19.500
I should actually get something.
00:31:19.500 --> 00:31:21.960
So here's how the quadratic
approximation works.
00:31:21.960 --> 00:31:27.791
I'm going to write the second
order Taylor expansion.
00:31:31.400 --> 00:31:32.420
OK.
00:31:32.420 --> 00:31:35.160
And so that's just going to
be my quadratic approximation.
00:31:35.160 --> 00:31:38.100
It's going to say, oh,
f of x, when x is close
00:31:38.100 --> 00:31:40.710
to some point x0,
is going to close
00:31:40.710 --> 00:31:48.870
to f of x0 plus the gradient of
f at x0 transpose x minus x0.
00:31:48.870 --> 00:31:53.640
And then I'm going to have
plus 1/2x minus x0 transpose
00:31:53.640 --> 00:31:58.020
Hf at 0x x minus x
transpose, right--
00:31:58.020 --> 00:31:59.760
x minus x0.
00:31:59.760 --> 00:32:02.310
So that's just my second
order Taylor expansion
00:32:02.310 --> 00:32:03.820
multi-variate 1.
00:32:03.820 --> 00:32:06.000
And let's say x0 is this guy.
00:32:08.840 --> 00:32:13.110
Now, what I'm going
to do is say, OK,
00:32:13.110 --> 00:32:15.290
if I wanted to set this
derivative of this guy
00:32:15.290 --> 00:32:17.900
equal to 0, I would
just have to solve,
00:32:17.900 --> 00:32:21.260
well, you know, f
prime of x equals 0,
00:32:21.260 --> 00:32:26.300
meaning that X has to
be f prime inverse of 0.
00:32:26.300 --> 00:32:29.630
And really apart from like being
some notation manipulation,
00:32:29.630 --> 00:32:31.670
this is really not helping me.
00:32:31.670 --> 00:32:32.170
OK.
00:32:32.170 --> 00:32:35.140
Because I don't know
what f prime inverse of 0
00:32:35.140 --> 00:32:36.880
is in many instances.
00:32:36.880 --> 00:32:39.910
However if f has a very
specific form which
00:32:39.910 --> 00:32:43.340
is something that depends
on x in a very specific way,
00:32:43.340 --> 00:32:45.670
there's just a linear term
and then a quadratic term,
00:32:45.670 --> 00:32:47.690
then I can actually
do something.
00:32:47.690 --> 00:32:49.610
So let's forget
about this approach.
00:32:49.610 --> 00:32:51.580
And rather than
minimizing f, let's just
00:32:51.580 --> 00:32:54.590
minimize the right-hand side.
00:32:54.590 --> 00:32:55.090
OK.
00:32:55.090 --> 00:32:59.200
So sorry- maximize.
00:32:59.200 --> 00:33:06.820
So maximize the right-hand side.
00:33:06.820 --> 00:33:08.060
And so how do I get this?
00:33:08.060 --> 00:33:11.471
Well, I just set the
gradient equal to 0.
00:33:11.471 --> 00:33:12.470
So what is the gradient?
00:33:12.470 --> 00:33:15.600
The first term does
not depend on x.
00:33:15.600 --> 00:33:21.420
So that means that this
is going to be 0 plus--
00:33:21.420 --> 00:33:25.410
what is the gradient of this
thing, of the gradient of f
00:33:25.410 --> 00:33:27.275
at x0 transpose x minus x0?
00:33:27.275 --> 00:33:28.650
What is the gradient
of this guy?
00:33:40.300 --> 00:33:43.540
So I have a function of
the form b transpose x.
00:33:43.540 --> 00:33:46.648
What is the gradient
of this thing?
00:33:46.648 --> 00:33:47.582
AUDIENCE: [INAUDIBLE]
00:33:47.582 --> 00:33:49.450
PHILIPPE RIGOLLET: I'm sorry?
00:33:49.450 --> 00:33:50.326
AUDIENCE: [INAUDIBLE]
00:33:50.326 --> 00:33:52.866
PHILIPPE RIGOLLET: I'm writing
everything in two-column form,
00:33:52.866 --> 00:33:53.900
right?
00:33:53.900 --> 00:33:55.311
So it's just b.
00:33:55.311 --> 00:33:55.810
OK.
00:33:55.810 --> 00:33:57.470
So here, what is b?
00:33:57.470 --> 00:33:59.964
Well, it's gradient of f at x0.
00:34:03.540 --> 00:34:04.040
OK.
00:34:04.040 --> 00:34:07.119
And this term here gradient
of f at x0 transpose x0
00:34:07.119 --> 00:34:07.910
is just a constant.
00:34:07.910 --> 00:34:10.560
This thing is
going away as well.
00:34:10.560 --> 00:34:13.850
And then I'm looking at the
derivative of this guy here.
00:34:13.850 --> 00:34:15.320
And this is like
a quadratic term.
00:34:15.320 --> 00:34:18.931
It's like H times
x minus x0 squared.
00:34:18.931 --> 00:34:20.639
So when I'm going to
take the derivative,
00:34:20.639 --> 00:34:22.097
I'm going to have
a factor 2 that's
00:34:22.097 --> 00:34:23.924
going to pop out and
cancel this one half.
00:34:23.924 --> 00:34:25.340
And then I'm going
to be left only
00:34:25.340 --> 00:34:28.360
with this part times this part.
00:34:28.360 --> 00:34:28.860
OK.
00:34:28.860 --> 00:34:36.320
So that's plus Hf x minus x0.
00:34:36.320 --> 00:34:36.820
OK.
00:34:36.820 --> 00:34:39.100
So that's just a gradient.
00:34:39.100 --> 00:34:40.810
And I want it to be equal to 0.
00:34:40.810 --> 00:34:44.910
So I'm just going to
solve this equal to 0.
00:34:44.910 --> 00:34:46.830
OK?
00:34:46.830 --> 00:34:49.860
So that means that if I
want to find the minimum,
00:34:49.860 --> 00:34:53.400
this is just going to be
the x* that satisfies this.
00:34:53.400 --> 00:35:07.860
So that's actually equivalent
to Hf times x* is equal to Hf x0
00:35:07.860 --> 00:35:14.010
minus gradient f at x0.
00:35:14.010 --> 00:35:16.140
Now, this is a much
easier thing to solve.
00:35:16.140 --> 00:35:16.910
What is this?
00:35:21.410 --> 00:35:25.240
This is just a system of
linear equations, right?
00:35:25.240 --> 00:35:29.170
I just need to find the x* such
that when I pre-multiply it
00:35:29.170 --> 00:35:33.120
by a matrix I get this vector
on the right-hand side.
00:35:33.120 --> 00:35:37.890
This is just something
of the form ax equals b.
00:35:37.890 --> 00:35:39.400
And I have many
ways I can do this.
00:35:39.400 --> 00:35:41.760
I could do Gaussian
elimination, or I
00:35:41.760 --> 00:35:46.200
could use Spielman's
fast Laplacian solvers
00:35:46.200 --> 00:35:49.530
if I had some particular
properties of H. I mean,
00:35:49.530 --> 00:35:54.190
there's huge activity in terms
of how to solve those systems.
00:35:54.190 --> 00:35:56.550
But let's say I have some time.
00:35:56.550 --> 00:35:58.230
It's not a huge problem.
00:35:58.230 --> 00:35:59.910
I can actually just
use linear algebra.
00:35:59.910 --> 00:36:06.590
And linear algebra just tells me
that x* is equal to Hf inverse
00:36:06.590 --> 00:36:16.010
times this guy, which those
two guys are going to cancel.
00:36:16.010 --> 00:36:23.630
So this is actually equal to
x0 minus Hf inverse gradient f
00:36:23.630 --> 00:36:26.460
at x0.
00:36:26.460 --> 00:36:28.920
And that's just what's
called a Newton iteration.
00:36:28.920 --> 00:36:31.990
I started at some x0.
00:36:31.990 --> 00:36:34.610
I'm at some x0, where I
make my approximation.
00:36:34.610 --> 00:36:37.190
And it's telling me
starting from this x0,
00:36:37.190 --> 00:36:40.890
I wanted to fully optimize
a quadratic approximation,
00:36:40.890 --> 00:36:43.310
I would just have
to take the x*.
00:36:43.310 --> 00:36:44.830
That's this guy.
00:36:44.830 --> 00:36:48.800
And then I could just use this
guy as my x0 and do it again,
00:36:48.800 --> 00:36:50.160
and again, and again, and again.
00:36:50.160 --> 00:36:52.070
And those are called
Newton iterations.
00:36:52.070 --> 00:36:54.320
And they're basically
the workhorse
00:36:54.320 --> 00:36:57.830
of interior point methods,
for example, a lot
00:36:57.830 --> 00:36:59.930
of optimization algorithms.
00:36:59.930 --> 00:37:01.770
And that's what
you can see here.
00:37:01.770 --> 00:37:05.480
x* is equal to x0 minus
the inverse Hessian times
00:37:05.480 --> 00:37:07.370
the gradient.
00:37:07.370 --> 00:37:10.844
We briefly mentioned
gradient descent.
00:37:10.844 --> 00:37:13.010
We briefly mentioned gradient
decent, at some point,
00:37:13.010 --> 00:37:15.350
to optimize the convex
function, right?
00:37:15.350 --> 00:37:22.680
And if I wanted to use gradient
descent, again, H is a matrix.
00:37:22.680 --> 00:37:24.810
But if I wanted to think
of H as being a scalar,
00:37:24.810 --> 00:37:26.559
would it be a positive
or negative number?
00:37:31.897 --> 00:37:33.388
Yeah.
00:37:33.388 --> 00:37:34.382
AUDIENCE: [INAUDIBLE]
00:37:34.382 --> 00:37:36.370
PHILIPPE RIGOLLET: Why?
00:37:36.370 --> 00:37:39.870
AUDIENCE: [INAUDIBLE]
00:37:39.870 --> 00:37:40.870
PHILIPPE RIGOLLET: Yeah.
00:37:40.870 --> 00:37:42.250
So that would be this.
00:37:42.250 --> 00:37:44.516
So I want to move against
the gradient to do what?
00:37:44.516 --> 00:37:45.390
AUDIENCE: [INAUDIBLE]
00:37:45.390 --> 00:37:46.020
PHILIPPE RIGOLLET: To minimize.
00:37:46.020 --> 00:37:47.790
But I'm maximizing here, right?
00:37:47.790 --> 00:37:49.140
Everything is maximized, right?
00:37:49.140 --> 00:37:52.950
So I know that H is
actually negative definite.
00:37:52.950 --> 00:37:55.620
So it's a negative number.
00:37:55.620 --> 00:37:59.040
So you have the same
confusions they do.
00:37:59.040 --> 00:38:01.390
We're maximizing a
concave function here.
00:38:01.390 --> 00:38:02.790
So H is negative.
00:38:02.790 --> 00:38:07.710
So this is something of
the form x0 plus something
00:38:07.710 --> 00:38:10.090
times the gradient.
00:38:10.090 --> 00:38:13.270
And this is what your gradient
ascent, rather than descent,
00:38:13.270 --> 00:38:15.090
would look like.
00:38:15.090 --> 00:38:17.130
And all it's saying,
Newton is telling
00:38:17.130 --> 00:38:19.770
you don't take the
gradient for granted
00:38:19.770 --> 00:38:21.690
as a direction in
which you want to go.
00:38:21.690 --> 00:38:25.410
It says, do a slight change
of coordinates before you
00:38:25.410 --> 00:38:30.840
do this according to what your
Hessian looks like, all right?
00:38:30.840 --> 00:38:34.480
And those are called second
order methods that require
00:38:34.480 --> 00:38:36.410
knowing what the Hessian is.
00:38:36.410 --> 00:38:39.271
But those are actually
much more powerful
00:38:39.271 --> 00:38:41.020
than the gradient
descent, because they're
00:38:41.020 --> 00:38:43.690
using all of the local
geometry of the problem.
00:38:43.690 --> 00:38:45.820
All of the local
geometry of your function
00:38:45.820 --> 00:38:48.370
is completely encoded
in this Hessian.
00:38:48.370 --> 00:38:50.290
And in particular it
implies that it tells you
00:38:50.290 --> 00:38:53.500
where to switch and not to
go slower in some places
00:38:53.500 --> 00:38:55.550
or go faster in other places.
00:38:55.550 --> 00:38:58.870
Now, this in practice for,
say, modern large scale
00:38:58.870 --> 00:39:02.170
machine learning problems,
inverting this matrix H
00:39:02.170 --> 00:39:03.610
is extremely painful.
00:39:03.610 --> 00:39:05.180
It takes too much time.
00:39:05.180 --> 00:39:08.450
The matrix is too big, and
computers cannot do it.
00:39:08.450 --> 00:39:10.780
And people resort
to what's called
00:39:10.780 --> 00:39:13.810
pseudo-Newton method,
which essentially tries
00:39:13.810 --> 00:39:15.760
to emulate what this guy is.
00:39:15.760 --> 00:39:17.920
And there's many
ways you can do this.
00:39:17.920 --> 00:39:20.170
Some of them is
by using gradients
00:39:20.170 --> 00:39:23.080
that you've collected
in the past.
00:39:23.080 --> 00:39:27.435
Some of them just say,
well let's just pretend H
00:39:27.435 --> 00:39:29.554
is diagonal.
00:39:29.554 --> 00:39:30.970
There's a lot of
things you can do
00:39:30.970 --> 00:39:32.980
to just play around this
and not actually have
00:39:32.980 --> 00:39:34.740
to invert this matrix.
00:39:34.740 --> 00:39:36.620
OK?
00:39:36.620 --> 00:39:38.840
So once you have this,
you started from edge 0.
00:39:38.840 --> 00:39:44.630
It tells you which H* you can
get as a maximizer of the local
00:39:44.630 --> 00:39:47.220
quadratic approximation
to your function.
00:39:47.220 --> 00:39:49.220
You can actually just
iterate that, all right?
00:39:49.220 --> 00:39:52.940
So you start at
some x0 somewhere.
00:39:52.940 --> 00:39:55.040
And then once you
get to some xk,
00:39:55.040 --> 00:39:57.510
you just do the iteration
which is described,
00:39:57.510 --> 00:39:59.690
which is just find
a k plus 1, which
00:39:59.690 --> 00:40:03.860
is the maximizer of the
local quadratic approximation
00:40:03.860 --> 00:40:09.011
to your function at xk and
repeat until convergence.
00:40:09.011 --> 00:40:09.510
OK.
00:40:09.510 --> 00:40:13.780
So if this was an
optimization class,
00:40:13.780 --> 00:40:16.530
we would prove that convergence
actually, eventually,
00:40:16.530 --> 00:40:20.779
happens for a strictly
concave function.
00:40:20.779 --> 00:40:22.320
This is a stats
class, so you're just
00:40:22.320 --> 00:40:24.750
going to have to trust
me that this is the case.
00:40:24.750 --> 00:40:26.730
And it's globally
convergent, meaning
00:40:26.730 --> 00:40:30.390
that you can start
wherever you want, and it's
00:40:30.390 --> 00:40:35.472
going to work for under
minor conditions on f.
00:40:35.472 --> 00:40:36.930
And in particular,
those conditions
00:40:36.930 --> 00:40:40.050
are satisfied for the
log-likelihood functions
00:40:40.050 --> 00:40:41.120
we have in mind.
00:40:41.120 --> 00:40:42.130
OK.
00:40:42.130 --> 00:40:44.310
And it converges at an
extremely fast rate.
00:40:44.310 --> 00:40:46.470
Usually it's
quadratic convergence,
00:40:46.470 --> 00:40:49.260
which means that every
time you make one step,
00:40:49.260 --> 00:40:52.154
you improve the accuracy of
your solution by two digits.
00:40:56.410 --> 00:40:58.960
If that's something you're
vaguely interested in,
00:40:58.960 --> 00:41:01.095
I highly recommend that
you take a class on them
00:41:01.095 --> 00:41:02.950
in your optimization.
00:41:02.950 --> 00:41:04.175
It's a fascinating topic.
00:41:04.175 --> 00:41:05.800
Unfortunately, we
don't have much time,
00:41:05.800 --> 00:41:08.880
but it starts being more
and more intertwined
00:41:08.880 --> 00:41:11.680
with high dimensional
statistics and machine learning.
00:41:14.652 --> 00:41:17.000
I mean, it's an algorithms
class, typically.
00:41:17.000 --> 00:41:22.047
But it's very much
more principled.
00:41:22.047 --> 00:41:24.630
It's not a bunch of algorithms
that solve a bunch of problems.
00:41:24.630 --> 00:41:26.310
There's basically
one basic idea,
00:41:26.310 --> 00:41:28.250
which is if I have
a convex function,
00:41:28.250 --> 00:41:30.050
I can actually minimize it.
00:41:30.050 --> 00:41:32.370
If I have a concave
function, I can maximize it.
00:41:32.370 --> 00:41:35.930
And it evolves around
a similar thing.
00:41:35.930 --> 00:41:40.410
So let's stare at this iterative
step for a second and pause.
00:41:40.410 --> 00:41:43.318
And let me know if you
have any questions.
00:41:46.186 --> 00:41:47.150
OK.
00:41:47.150 --> 00:41:50.930
So, of course, in a
second we will plug in
00:41:50.930 --> 00:41:51.900
for the log-likelihood.
00:41:51.900 --> 00:41:54.710
This is just a general thing
for a general function f.
00:41:54.710 --> 00:41:57.550
But then in a second,
f is going to be ln.
00:41:57.550 --> 00:41:58.200
OK.
00:41:58.200 --> 00:42:00.500
So if I wanted to
implement that for real,
00:42:00.500 --> 00:42:04.220
I would have to compute the
gradient of ln at a point xk.
00:42:04.220 --> 00:42:06.740
And I would have to compute
the Hessian at a given point
00:42:06.740 --> 00:42:08.141
and invert it.
00:42:08.141 --> 00:42:08.640
OK.
00:42:08.640 --> 00:42:11.520
So this is just the
basic algorithm.
00:42:11.520 --> 00:42:14.540
And this, as you can
tell, used in no place
00:42:14.540 --> 00:42:17.150
the fact that ln was the
log-likelihood associated
00:42:17.150 --> 00:42:19.100
to some canonical
exponential family
00:42:19.100 --> 00:42:21.020
in a generalized linear model.
00:42:21.020 --> 00:42:23.550
This never showed up.
00:42:23.550 --> 00:42:26.220
So can we use that somehow?
00:42:26.220 --> 00:42:28.920
Optimization for longest time
was about making your problems
00:42:28.920 --> 00:42:31.800
as general as possible
accumulating maybe
00:42:31.800 --> 00:42:34.680
in the interior point method
theory in Koenig programming
00:42:34.680 --> 00:42:35.790
in the mid-'90s.
00:42:35.790 --> 00:42:37.350
And now what
optimization is doing
00:42:37.350 --> 00:42:38.710
is that it's [INAUDIBLE]
very general.
00:42:38.710 --> 00:42:40.620
It' says, OK, if I want
to start to go fast,
00:42:40.620 --> 00:42:42.930
I need to exploit as much
structure about my problem
00:42:42.930 --> 00:42:43.980
as I can.
00:42:43.980 --> 00:42:45.936
And the beauty is
that as statisticians
00:42:45.936 --> 00:42:47.310
are a machine
learning people, we
00:42:47.310 --> 00:42:49.350
do have a bunch of
very specific problem
00:42:49.350 --> 00:42:50.940
that we want
optimizers to solve.
00:42:50.940 --> 00:42:53.070
And they can make
things run much faster.
00:42:53.070 --> 00:42:56.850
But this did not require to
wait until the 21st century.
00:42:56.850 --> 00:42:58.460
Problems with very
specific structure
00:42:58.460 --> 00:43:01.900
arose already in this
generalized linear model.
00:43:01.900 --> 00:43:03.430
So what do we know?
00:43:03.430 --> 00:43:07.770
Well, we know that this
log-likelihood is really
00:43:07.770 --> 00:43:09.780
one thing that comes
when we're trying
00:43:09.780 --> 00:43:12.270
to replace an expectation
by an average,
00:43:12.270 --> 00:43:14.310
and then doing
something fancy, right?
00:43:14.310 --> 00:43:16.350
That was our statistical hammer.
00:43:16.350 --> 00:43:19.260
And remember when we introduced
likelihood maximization we just
00:43:19.260 --> 00:43:21.780
said, what do we
really want to do
00:43:21.780 --> 00:43:23.935
is to minimize the KL, right?
00:43:23.935 --> 00:43:25.560
That's the thing we
wanted to minimize,
00:43:25.560 --> 00:43:29.640
the KL divergence between two
distributions, the true one
00:43:29.640 --> 00:43:32.310
and the one that's parameterized
by some unknown theta.
00:43:32.310 --> 00:43:34.620
And we're trying to
minimize that over theta.
00:43:34.620 --> 00:43:36.179
And we said, well,
I don't know what
00:43:36.179 --> 00:43:38.220
this is, because it's an
expectation with respect
00:43:38.220 --> 00:43:40.030
to some known distribution.
00:43:40.030 --> 00:43:42.900
So let me just replace the
expectation with respect
00:43:42.900 --> 00:43:46.990
to my unknown distribution by
an average over my data points.
00:43:46.990 --> 00:43:49.800
And that's how we
justified the existence
00:43:49.800 --> 00:43:54.220
of the log-likelihood
maximization problem.
00:43:54.220 --> 00:44:00.690
But here, actually, I
might be able to compute
00:44:00.690 --> 00:44:04.410
this expectation, at least
partially where I need it.
00:44:04.410 --> 00:44:07.690
And what we're going to do
is we're going to say, OK,
00:44:07.690 --> 00:44:12.400
since at a given point xk,
say, let me call it here theta,
00:44:12.400 --> 00:44:16.080
I'm trying to find the
inverse of the Hessian
00:44:16.080 --> 00:44:17.550
of my log-likelihood, right?
00:44:17.550 --> 00:44:19.710
So if you look at the
previous one, as I said,
00:44:19.710 --> 00:44:23.659
we're going to have to compute
the Hessian H sub l n of xk,
00:44:23.659 --> 00:44:24.450
and then invert it.
00:44:24.450 --> 00:44:27.250
But let's forget about the
inversion step for a second.
00:44:27.250 --> 00:44:30.102
We have to compute the Hessian.
00:44:30.102 --> 00:44:31.560
This is the Hessian
of the function
00:44:31.560 --> 00:44:32.820
we're trying to minimize.
00:44:32.820 --> 00:44:34.680
But if I could
actually replace it
00:44:34.680 --> 00:44:37.100
not by the function I'm
trying to minimize to maximize
00:44:37.100 --> 00:44:38.970
or the log-likelihood,
but really
00:44:38.970 --> 00:44:42.390
by the function I wish I
was actually minimizing,
00:44:42.390 --> 00:44:45.240
which is the KL, right?
00:44:45.240 --> 00:44:46.569
Then that would be really nice.
00:44:46.569 --> 00:44:48.360
And what happens is
that since I'm actually
00:44:48.360 --> 00:44:51.630
trying to find
this at a given xk,
00:44:51.630 --> 00:44:53.760
I can always pretend
that this xk that I have
00:44:53.760 --> 00:44:55.770
in my current iteration
is the true one
00:44:55.770 --> 00:44:58.902
and compute my expectation
with respect to that guy.
00:44:58.902 --> 00:45:00.610
And what happens is
that I know that when
00:45:00.610 --> 00:45:05.190
I compute the expectation of the
Hessian of the log-likelihood
00:45:05.190 --> 00:45:08.020
at a given theta and when I take
the expectation with respect
00:45:08.020 --> 00:45:10.870
to the same theta,
what I get out
00:45:10.870 --> 00:45:15.180
is negative Fisher information.
00:45:15.180 --> 00:45:18.240
The Fisher information
was defined in two ways--
00:45:18.240 --> 00:45:25.509
as the expectation of the square
of the derivative or negative
00:45:25.509 --> 00:45:27.300
of the expectation of
the second derivative
00:45:27.300 --> 00:45:30.040
of the log-likelihood.
00:45:30.040 --> 00:45:35.090
And so now, I'm doing some
sort of a leap of faith here.
00:45:35.090 --> 00:45:40.390
Because there's no way the
theta, which is the current xk,
00:45:40.390 --> 00:45:42.950
that's the current theta
at which I'm actually
00:45:42.950 --> 00:45:44.210
doing this optimization--
00:45:44.210 --> 00:45:46.334
I'm actually pretending
that this is the right one.
00:45:48.930 --> 00:45:51.230
But what's going to
change by doing this
00:45:51.230 --> 00:45:52.980
is that it's going to
make my life easier.
00:45:52.980 --> 00:45:56.670
Because when I
take expectations,
00:45:56.670 --> 00:46:00.630
we'll see that when we
look at the Hessian,
00:46:00.630 --> 00:46:07.200
the Hessian as essentially the
derivative of, say, a product
00:46:07.200 --> 00:46:09.420
is going to be the sum
of two terms, right?
00:46:09.420 --> 00:46:13.224
The derivative of u times v
is u prime v plus uv prime.
00:46:13.224 --> 00:46:14.640
One of those two
terms is actually
00:46:14.640 --> 00:46:16.462
going to have expectation 0.
00:46:16.462 --> 00:46:18.420
And that's going to make
my life very easy when
00:46:18.420 --> 00:46:20.400
I take expectations
and basically just
00:46:20.400 --> 00:46:22.850
have one term that's
going to go away.
00:46:22.850 --> 00:46:24.300
And so in particular,
my formula,
00:46:24.300 --> 00:46:27.180
just by the virtue of
taking this expectation
00:46:27.180 --> 00:46:29.340
before inverting the
Hessian, is going
00:46:29.340 --> 00:46:33.710
to just shrink the size
of my formulas by half.
00:46:33.710 --> 00:46:34.210
OK.
00:46:34.210 --> 00:46:35.620
So let's see how this works.
00:46:35.620 --> 00:46:37.820
You don't have to believe me.
00:46:37.820 --> 00:46:39.500
Is there any question
about this slide?
00:46:39.500 --> 00:46:41.050
You guys remember
when we were doing
00:46:41.050 --> 00:46:43.850
maximum estimation
and Fisher information
00:46:43.850 --> 00:46:47.228
and the KL
divergence, et cetera?
00:46:47.228 --> 00:46:48.202
Yeah.
00:46:48.202 --> 00:46:51.175
AUDIENCE: [INAUDIBLE]
00:46:51.175 --> 00:46:53.300
PHILIPPE RIGOLLET: Because
that's what we're really
00:46:53.300 --> 00:46:55.140
trying to minimize.
00:46:55.140 --> 00:46:59.540
AUDIENCE: [INAUDIBLE]
00:46:59.540 --> 00:47:00.540
PHILIPPE RIGOLLET: Yeah.
00:47:00.540 --> 00:47:05.100
So there's something
you need to trust me
00:47:05.100 --> 00:47:08.850
with, which is that the
expectation of H of ln
00:47:08.850 --> 00:47:14.975
is actually H of the
expectation of ln, all right?
00:47:14.975 --> 00:47:16.780
Yeah, it's true, right?
00:47:16.780 --> 00:47:20.387
Because taking derivative
is a linear operator.
00:47:20.387 --> 00:47:22.220
And we actually used
that several times when
00:47:22.220 --> 00:47:30.530
we said expectation of partial
of l with respect to theta
00:47:30.530 --> 00:47:31.310
is equal to 0.
00:47:31.310 --> 00:47:32.619
Remember we did that?
00:47:32.619 --> 00:47:34.160
That's basically
what we used, right?
00:47:39.070 --> 00:47:40.716
AUDIENCE: ln is the likelihood.
00:47:40.716 --> 00:47:42.507
PHILIPPE RIGOLLET: It's
the log-likelihood.
00:47:42.507 --> 00:47:45.453
AUDIENCE: Log-likelihood,
[INAUDIBLE] OK.
00:47:45.453 --> 00:47:47.417
When we did Fisher
[INAUDIBLE],, we
00:47:47.417 --> 00:47:49.872
did the likelihood of
[INAUDIBLE] observation.
00:47:49.872 --> 00:47:51.836
PHILIPPE RIGOLLET: Yeah.
00:47:51.836 --> 00:47:54.291
AUDIENCE: Why is
it ln in this case?
00:47:54.291 --> 00:48:01.410
PHILIPPE RIGOLLET: So actually,
ln is typically not normalized.
00:48:01.410 --> 00:48:04.060
So I really should
talk about ln over n.
00:48:04.060 --> 00:48:04.620
OK.
00:48:04.620 --> 00:48:05.840
But let's see that, OK?
00:48:05.840 --> 00:48:10.790
So if I have IID observations,
that should be pretty obvious.
00:48:10.790 --> 00:48:11.290
OK.
00:48:11.290 --> 00:48:22.320
So if I have IID x1, xn, with
density f theta and if I look
00:48:22.320 --> 00:48:28.290
at log f theta of Xi, sum from
i equal 1 to n, as I said,
00:48:28.290 --> 00:48:30.915
I need to actually
have a 1 over n here.
00:48:30.915 --> 00:48:32.790
When I look at the
expectation, they all have
00:48:32.790 --> 00:48:34.350
the same expectation, right?
00:48:34.350 --> 00:48:41.250
So this is actually,
indeed, equal to negative KL
00:48:41.250 --> 00:48:42.690
plus a constant.
00:48:42.690 --> 00:48:43.490
OK?
00:48:43.490 --> 00:48:45.210
And negative KL
is because this--
00:48:45.210 --> 00:48:46.710
sorry, if I look
at the expectation.
00:48:46.710 --> 00:48:50.850
So the expectation of this guy
is just the expectation of one
00:48:50.850 --> 00:48:56.430
of them, all right?
00:48:56.430 --> 00:48:57.990
So I just do expectation theta.
00:48:57.990 --> 00:48:59.710
OK?
00:48:59.710 --> 00:49:02.110
Agree?
00:49:02.110 --> 00:49:09.060
Remember, the KL was expectation
theta log f theta divided by f.
00:49:09.060 --> 00:49:11.270
So that's between p
theta and p theta prime.
00:49:14.714 --> 00:49:15.380
Well, no, sorry.
00:49:15.380 --> 00:49:17.750
That's the true p.
00:49:17.750 --> 00:49:19.690
And let's call it f.
00:49:19.690 --> 00:49:20.620
p theta, right?
00:49:20.620 --> 00:49:23.440
So that's what showed
up, which is, indeed,
00:49:23.440 --> 00:49:27.075
equal to minus
expectation theta log
00:49:27.075 --> 00:49:33.340
f theta plus log of f, which
is just a constant with respect
00:49:33.340 --> 00:49:34.150
to theta.
00:49:34.150 --> 00:49:36.465
It's just the thing
that's up doesn't matter.
00:49:36.465 --> 00:49:38.890
OK?
00:49:38.890 --> 00:49:40.719
So this is what shows up here.
00:49:40.719 --> 00:49:42.510
And just the fact that
I have this 1 over n
00:49:42.510 --> 00:49:44.614
doesn't change,
because they're IID.
00:49:44.614 --> 00:49:47.280
Now, when I have things that are
not IID-- because what I really
00:49:47.280 --> 00:49:54.570
had was Y1 Yn, and Yi at
density f theta i, which is just
00:49:54.570 --> 00:49:56.602
the conditional
density given Xi,
00:49:56.602 --> 00:49:59.040
then I could still write this.
00:49:59.040 --> 00:50:02.220
And now when I look at the
expectation of this guy, what
00:50:02.220 --> 00:50:08.130
I'm going to be left with
is just 1 over n sum from i
00:50:08.130 --> 00:50:17.391
equal 1 to n of the expectation
of log f theta i of Yi.
00:50:19.635 --> 00:50:22.010
And it's basically the same
thing, except that I have a 1
00:50:22.010 --> 00:50:24.772
over n expectation in front.
00:50:24.772 --> 00:50:26.980
And I didn't tell you this,
because I only showed you
00:50:26.980 --> 00:50:32.830
what the KL divergence was
for between two distributions.
00:50:32.830 --> 00:50:35.110
But here, I'm telling
you what the KL
00:50:35.110 --> 00:50:39.012
is between two products
of distributions
00:50:39.012 --> 00:50:40.720
that are independent,
but not necessarily
00:50:40.720 --> 00:50:42.200
identically distributed.
00:50:48.412 --> 00:50:50.620
But that's what's going to
show up, just because it's
00:50:50.620 --> 00:50:51.680
a product of things.
00:50:51.680 --> 00:50:53.930
So when you have the log,
it's just going to be a sum.
00:50:58.360 --> 00:51:01.706
Other questions?
00:51:01.706 --> 00:51:05.440
All right, so what
do we do here?
00:51:05.440 --> 00:51:08.850
Well, as I said, now we know
that the expectation of H
00:51:08.850 --> 00:51:10.920
is negative Fisher information.
00:51:10.920 --> 00:51:13.740
So rather than putting
H inverse in my iterates
00:51:13.740 --> 00:51:19.050
for Newton-Raphson, I'm just
going to put the inverse Fisher
00:51:19.050 --> 00:51:20.280
information.
00:51:20.280 --> 00:51:23.320
And remember, it had
a minus sign in front.
00:51:23.320 --> 00:51:25.950
So I'm just going to
pick up a plus sign now,
00:51:25.950 --> 00:51:31.330
just because i is negative,
the expectation of the Hessian.
00:51:31.330 --> 00:51:33.880
And this guy has, essentially,
the same convergence
00:51:33.880 --> 00:51:35.590
properties.
00:51:35.590 --> 00:51:37.330
And it just happens
that it's easier
00:51:37.330 --> 00:51:39.550
to compute the i than H Ln.
00:51:39.550 --> 00:51:40.930
And that's it.
00:51:40.930 --> 00:51:43.430
That's really why
you want to do this.
00:51:43.430 --> 00:51:50.180
Now, you might say that, well,
if I use more information,
00:51:50.180 --> 00:51:51.986
I should do better, right?
00:51:51.986 --> 00:51:53.360
But it's actually
not necessarily
00:51:53.360 --> 00:51:54.540
true for several reasons.
00:51:54.540 --> 00:51:56.990
But let's say that one is
probably the fact that I
00:51:56.990 --> 00:51:58.400
did not use more information.
00:51:58.400 --> 00:52:02.030
Every step when I was
computing this thing at xk,
00:52:02.030 --> 00:52:04.600
I actually pretended
that at theta k
00:52:04.600 --> 00:52:07.780
the true distribution
was the one distributed
00:52:07.780 --> 00:52:09.450
according to theta k.
00:52:09.450 --> 00:52:10.880
And that was not true.
00:52:10.880 --> 00:52:12.740
This is only true
when theta k becomes
00:52:12.740 --> 00:52:13.830
close to the true theta.
00:52:14.570 --> 00:52:18.080
And so in a way, what
I gained I lost again
00:52:18.080 --> 00:52:20.380
by making this thing.
00:52:20.380 --> 00:52:23.040
It's just really a matter
of simple computation.
00:52:23.040 --> 00:52:25.620
So let's just see it on
a particular example.
00:52:25.620 --> 00:52:28.530
Actually, in this example, it's
not going to look much simpler.
00:52:28.530 --> 00:52:30.700
It's actually going
to be the same.
00:52:30.700 --> 00:52:35.061
All right, so I'm going to
have the Bernoulli example.
00:52:35.061 --> 00:52:36.560
All right, so we
know that Bernoulli
00:52:36.560 --> 00:52:41.480
belongs to the canonical
exponential family.
00:52:41.480 --> 00:52:46.700
And essentially, all I
need to tell you what b is.
00:52:46.700 --> 00:52:53.670
And b of theta for Bernoulli
is log 1 plus e theta, right?
00:52:53.670 --> 00:52:56.620
We computed that.
00:52:56.620 --> 00:52:57.690
OK.
00:52:57.690 --> 00:53:02.060
And so when I look
at my log-likelihood,
00:53:02.060 --> 00:53:10.079
it is going to look like the sum
from i equal 1 to n of Yi of--
00:53:10.079 --> 00:53:11.620
OK, so here I'm
going to actually use
00:53:11.620 --> 00:53:12.710
the canonical link.
00:53:12.710 --> 00:53:15.610
So it's going to be
Xi transpose beta
00:53:15.610 --> 00:53:20.750
minus log 1 plus exponential
Xi transpose beta.
00:53:24.330 --> 00:53:27.620
And phi for this
guy is equal to 1.
00:53:27.620 --> 00:53:29.370
Is it clear for
everyone what I did?
00:53:29.370 --> 00:53:29.870
OK.
00:53:29.870 --> 00:53:35.780
So remember the density,
so that was really just--
00:53:35.780 --> 00:53:45.070
so the PMF was exponential Y
theta minus log 1 plus e theta.
00:53:45.070 --> 00:53:47.030
There was actually
no normalization.
00:53:47.030 --> 00:53:51.140
That's just the
density of a Bernoulli.
00:53:51.140 --> 00:54:00.330
And the theta is actually
log p over 1 minus p.
00:54:00.330 --> 00:54:03.120
And so that's what
actually gives me what my--
00:54:03.120 --> 00:54:05.220
since p is the expectation,
this is actually
00:54:05.220 --> 00:54:06.750
giving me also my
canonical link,
00:54:06.750 --> 00:54:09.210
which is the log [? at link. ?]
We saw that last time.
00:54:09.210 --> 00:54:12.960
And so if I start taking the log
of this guy and summing over n
00:54:12.960 --> 00:54:16.050
and replacing theta by
Xi transpose beta, which
00:54:16.050 --> 00:54:19.980
is what the canonical link
tells me to do, I get this guy.
00:54:19.980 --> 00:54:21.930
Is that clear for everyone?
00:54:21.930 --> 00:54:25.120
If it's not, please redo
this step on your own.
00:54:27.740 --> 00:54:28.240
OK.
00:54:31.100 --> 00:54:33.830
So I want to maximize
this function.
00:54:33.830 --> 00:54:34.340
Sorry.
00:54:34.340 --> 00:54:36.256
So I want to maximize
this function over there
00:54:36.256 --> 00:54:39.690
on the first line as
a function of beta.
00:54:39.690 --> 00:54:44.280
And so to do this, I want
to use either Newton-Raphson
00:54:44.280 --> 00:54:45.840
or what I call Fisher-scoring.
00:54:45.840 --> 00:54:47.340
So Fisher-scoring
is the second one,
00:54:47.340 --> 00:54:51.980
when you replace the Hessian
by negative Fisher information.
00:54:51.980 --> 00:54:53.840
So I replace these two things.
00:54:53.840 --> 00:54:55.215
And so I first
take the gradient.
00:54:55.215 --> 00:54:55.714
OK.
00:54:55.714 --> 00:54:57.160
So let's take the
gradient of ln.
00:55:04.360 --> 00:55:08.910
So the gradient of ln is
going to be, well, sum--
00:55:08.910 --> 00:55:11.470
so here, this is of
the form Yi, which
00:55:11.470 --> 00:55:14.740
is a scalar, times
a vector, Xi beta.
00:55:14.740 --> 00:55:18.400
That's what I erased from here.
00:55:18.400 --> 00:55:20.650
The gradient of b
transpose x is just b.
00:55:20.650 --> 00:55:23.620
So here, I have just Yi Xi.
00:55:23.620 --> 00:55:27.400
So that's of the form Yi,
which is a scalar, times Xi,
00:55:27.400 --> 00:55:30.029
which is a vector.
00:55:30.029 --> 00:55:31.070
Now, what about this guy?
00:55:31.070 --> 00:55:32.278
Well, here I have a function.
00:55:32.278 --> 00:55:34.990
So I'm going to have just the
usual rule, the chain rule,
00:55:34.990 --> 00:55:35.490
right?
00:55:35.490 --> 00:55:38.140
So that's just going
to be 1 over this guy.
00:55:40.690 --> 00:55:43.220
And then I need to find
the Hessian of this thing.
00:55:43.220 --> 00:55:44.620
So the 1 is going away.
00:55:44.620 --> 00:55:46.910
And then I apply the
chain rule again.
00:55:46.910 --> 00:55:49.990
So I get e of Xi
transpose beta, and then
00:55:49.990 --> 00:55:53.274
the Hessian of this
thing, which is Xi.
00:55:57.700 --> 00:56:00.450
So my Hessian--
my radiant, sorry,
00:56:00.450 --> 00:56:02.506
I can actually factor
out all my Xi's.
00:56:02.506 --> 00:56:03.880
And it's going to
look like this.
00:56:15.230 --> 00:56:21.180
My gradient is a weighted
average or weighted sum
00:56:21.180 --> 00:56:22.080
of the Xi's.
00:56:25.130 --> 00:56:30.140
This will always
happen when you have
00:56:30.140 --> 00:56:32.290
a generalized linear model.
00:56:32.290 --> 00:56:33.410
And that's pretty clear.
00:56:33.410 --> 00:56:35.360
Where did the Xi show up?
00:56:35.360 --> 00:56:37.640
Whether it's from
this guy or that guy,
00:56:37.640 --> 00:56:39.220
the Xi came from
the fact that when
00:56:39.220 --> 00:56:41.270
I take the gradient
of Xi transpose beta,
00:56:41.270 --> 00:56:43.940
I have this vector
Xi that comes out.
00:56:43.940 --> 00:56:46.740
It's always going to be
the thing that comes out.
00:56:46.740 --> 00:56:49.926
So I will always have something
that looks like with some sum
00:56:49.926 --> 00:56:53.590
with some weights
here of the Xi's.
00:56:53.590 --> 00:56:55.938
Now, when I look at
the second derivative--
00:57:04.350 --> 00:57:08.310
so same thing, I'm just going
to take the derivative this guy.
00:57:08.310 --> 00:57:11.460
Since nothing depends
on beta here or here,
00:57:11.460 --> 00:57:13.990
I'm just going to have to take
the derivative of this thing.
00:57:13.990 --> 00:57:15.240
And so it's going to be equal.
00:57:15.240 --> 00:57:21.110
So if I look now at the Hessian
ln as a function of beta,
00:57:21.110 --> 00:57:25.860
I'm going to have sum from i
equal 1 to n of, well, Yi--
00:57:25.860 --> 00:57:28.590
what is a derivative of
Yi with respect to beta?
00:57:31.482 --> 00:57:32.446
AUDIENCE: [INAUDIBLE]
00:57:32.446 --> 00:57:34.374
PHILIPPE RIGOLLET: What?
00:57:34.374 --> 00:57:37.266
AUDIENCE: [INAUDIBLE]
00:57:37.266 --> 00:57:39.161
PHILIPPE RIGOLLET: Yeah.
00:57:39.161 --> 00:57:39.660
0.
00:57:39.660 --> 00:57:40.159
OK?
00:57:40.159 --> 00:57:41.669
It doesn't depend on data.
00:57:41.669 --> 00:57:42.960
I mean, this distribution does.
00:57:42.960 --> 00:57:46.290
But Y itself is just
a number, right?
00:57:46.290 --> 00:57:47.730
So this is 0.
00:57:47.730 --> 00:57:49.050
So I'm going to get the minus.
00:57:49.050 --> 00:57:51.060
And then I'm going to
have, again, the chain
00:57:51.060 --> 00:57:52.030
rule that shows up.
00:57:52.030 --> 00:57:56.840
So I need to find the
derivative of x over 1 plus x.
00:57:56.840 --> 00:57:59.289
What is the derivative
of x over 1 plus x.
00:57:59.289 --> 00:58:00.330
Actually don't even know.
00:58:03.180 --> 00:58:04.510
So that gives me--
00:58:12.491 --> 00:58:12.990
OK.
00:58:12.990 --> 00:58:15.300
So that's 1 over
1 plus x squared.
00:58:15.300 --> 00:58:19.380
So that's minus e
Xi transpose beta--
00:58:19.380 --> 00:58:27.390
sorry, 1 divided by 1 plus e
Xi transpose beta squared times
00:58:27.390 --> 00:58:31.110
the derivative of the
exponential, which is e Xi
00:58:31.110 --> 00:58:35.280
transpose beta and again, Xi.
00:58:38.090 --> 00:58:39.680
And then I have this
Xi that shows up.
00:58:39.680 --> 00:58:41.138
But since I'm
looking for a matrix,
00:58:41.138 --> 00:58:42.890
I'm going to have Xi,
Xi transpose, right?
00:58:53.080 --> 00:58:55.765
OK?
00:58:55.765 --> 00:58:58.600
AUDIENCE: [INAUDIBLE]
00:58:58.600 --> 00:59:02.830
PHILIPPE RIGOLLET: So I know
I'm going to need something that
00:59:02.830 --> 00:59:04.360
looks like a matrix in the end.
00:59:04.360 --> 00:59:07.480
And so one way you want
to think about it is this
00:59:07.480 --> 00:59:09.850
is going to spit out an Xi.
00:59:09.850 --> 00:59:11.937
There's already an Xi here.
00:59:11.937 --> 00:59:14.020
So I'm going to have
something that looks like Xi.
00:59:14.020 --> 00:59:15.850
And I'm going to have to
multiply by it another vector
00:59:15.850 --> 00:59:16.630
Xi.
00:59:16.630 --> 00:59:18.520
And I want it to form a matrix.
00:59:18.520 --> 00:59:22.264
And so what you need to do
is to take an outer product.
00:59:22.264 --> 00:59:23.240
And that's it.
00:59:36.420 --> 00:59:42.510
So now as a result, the
updating rule is this.
00:59:42.510 --> 00:59:44.460
Honestly, this is not
a result of anything.
00:59:44.460 --> 00:59:47.950
I actually rewrote everything
that I had before with a theta
00:59:47.950 --> 00:59:50.160
replaced by beta,
because it's just
00:59:50.160 --> 00:59:54.290
painful to rewrite this entire
thing, put some big parenthesis
00:59:54.290 --> 00:59:55.335
and put minus 1 here.
00:59:58.680 --> 01:00:00.920
And then I would have to
put the gradient, which
01:00:00.920 --> 01:00:03.390
is this thing here.
01:00:03.390 --> 01:00:07.340
So as you can imagine,
this is not super nice.
01:00:07.340 --> 01:00:11.450
Actually, what's
interesting is at some point
01:00:11.450 --> 01:00:15.080
I mentioned there's a
pseudo-Newton method.
01:00:15.080 --> 01:00:16.680
They're actually
doing exactly this.
01:00:16.680 --> 01:00:22.360
They're saying, oh,
at each iteration,
01:00:22.360 --> 01:00:24.580
I'm actually going to
just take those guys.
01:00:24.580 --> 01:00:26.260
If I'm at iteration
k, I'm actually
01:00:26.260 --> 01:00:28.225
just going to sum
those guys up to k
01:00:28.225 --> 01:00:30.600
rather than going all the way
to n and look at every one.
01:00:30.600 --> 01:00:33.020
So you're just looking at your
observations one at a time
01:00:33.020 --> 01:00:35.851
based on where you were before.
01:00:35.851 --> 01:00:36.350
OK.
01:00:36.350 --> 01:00:38.380
So you have a matrix.
01:00:38.380 --> 01:00:39.560
You need to invert it.
01:00:39.560 --> 01:00:41.185
So if you want to be
able to invert it,
01:00:41.185 --> 01:00:43.120
you need to make
sure that the sum
01:00:43.120 --> 01:00:46.810
with those weights of Xi
outer, Xi, or Xi, Xi transpose
01:00:46.810 --> 01:00:47.920
is invertible.
01:00:47.920 --> 01:00:50.440
So that's a condition
that you need to have.
01:00:50.440 --> 01:00:53.826
And well, you don't have
to, because technically you
01:00:53.826 --> 01:00:54.700
don't need to invert.
01:00:54.700 --> 01:00:57.430
You just need to solve
the linear system.
01:00:57.430 --> 01:01:00.910
But that's actually guaranteed
in most of the cases
01:01:00.910 --> 01:01:03.210
if n is large enough.
01:01:03.210 --> 01:01:05.880
All right, so everybody
sees what we're doing here?
01:01:05.880 --> 01:01:06.380
OK.
01:01:06.380 --> 01:01:11.480
So that's for the
Newton-Raphson.
01:01:11.480 --> 01:01:17.740
If I wanted to actually
do the Fisher-scoring,
01:01:17.740 --> 01:01:22.070
all I would need to do is
to replace the Hessian here
01:01:22.070 --> 01:01:25.100
by its expectation when I
pretend that the beta have,
01:01:25.100 --> 01:01:28.090
iteration k, is the true one.
01:01:28.090 --> 01:01:31.510
What is the expectation
of this thing?
01:01:31.510 --> 01:01:34.330
And when I say
expectation here, I'm
01:01:34.330 --> 01:01:37.810
always talking about conditional
expectation of Y given X.
01:01:37.810 --> 01:01:40.690
The only distributions that
matter, that have mattered
01:01:40.690 --> 01:01:44.530
in this entire chapter, are
conditional expectation of Y
01:01:44.530 --> 01:01:45.730
given X.
01:01:45.730 --> 01:01:50.810
The conditional expectation
of this thing given X is what?
01:01:55.870 --> 01:01:57.050
It's itself.
01:01:57.050 --> 01:01:59.720
It does not depend on Y.
It only depends on the X's.
01:01:59.720 --> 01:02:01.520
So conditionally
on x, this thing
01:02:01.520 --> 01:02:04.160
as far as we're concerned,
is completely deterministic.
01:02:04.160 --> 01:02:07.700
So it's actually equal
to it's expectation.
01:02:07.700 --> 01:02:11.240
And so in this
particular example,
01:02:11.240 --> 01:02:16.450
there's no difference
between Fisher-scoring
01:02:16.450 --> 01:02:19.990
and Newton-Raphson.
01:02:19.990 --> 01:02:24.190
And the reason is because
the gradient no longer
01:02:24.190 --> 01:02:26.050
depends on Yi--
01:02:26.050 --> 01:02:26.550
I'm sorry.
01:02:26.550 --> 01:02:30.172
The Hessian no
longer depends on Yi.
01:02:30.172 --> 01:02:30.672
OK?
01:02:41.630 --> 01:02:44.158
This slide is just repeating
some stuff that I've said.
01:02:49.466 --> 01:02:49.966
OK.
01:02:54.330 --> 01:02:55.800
So I think this is probably--
01:02:58.692 --> 01:03:00.150
OK, let's go through
this actually.
01:03:04.980 --> 01:03:07.380
At some point, I said
that Newton-Raphson--
01:03:07.380 --> 01:03:08.362
do you have a question?
01:03:08.362 --> 01:03:08.987
AUDIENCE: Yeah.
01:03:08.987 --> 01:03:12.290
When would the gradient-- sorry,
the Hessian ever depend on Yi?
01:03:12.290 --> 01:03:15.111
Because it seems like
Yi is just-- or at least
01:03:15.111 --> 01:03:22.601
when you have a canonical link,
that the log-likelihood is just
01:03:22.601 --> 01:03:27.346
[INAUDIBLE] to Yi Xi [INAUDIBLE]
theta and that's the only place
01:03:27.346 --> 01:03:28.002
Y shows up.
01:03:28.002 --> 01:03:30.884
So [INAUDIBLE] derivative
[INAUDIBLE] never depend on Y?
01:03:30.884 --> 01:03:33.134
PHILIPPE RIGOLLET: Not when
you have a canonical link.
01:03:33.134 --> 01:03:35.298
AUDIENCE: So if it's not a
[INAUDIBLE] there's is no
01:03:35.298 --> 01:03:35.918
difference between--
01:03:35.918 --> 01:03:36.834
PHILIPPE RIGOLLET: No.
01:03:36.834 --> 01:03:37.880
AUDIENCE: OK.
01:03:37.880 --> 01:03:38.880
PHILIPPE RIGOLLET: Yeah.
01:03:38.880 --> 01:03:45.350
So yeah, maybe I wanted you to
figure that out for yourself.
01:03:45.350 --> 01:03:45.850
OK.
01:03:45.850 --> 01:03:49.690
So Yi times Xi transpose beta.
01:03:49.690 --> 01:03:55.390
So essentially, when I
have a general family, what
01:03:55.390 --> 01:04:00.167
he's referring to is that this
is just b of Xi transpose beta.
01:04:00.167 --> 01:04:01.750
So I'm going to take
some derivatives.
01:04:01.750 --> 01:04:02.590
And there's going
to be something
01:04:02.590 --> 01:04:04.180
complicated coming out of this.
01:04:04.180 --> 01:04:06.520
But I'm certainly not going
to have some Yi showing up.
01:04:06.520 --> 01:04:09.670
The only place where
Yi shows up is here.
01:04:09.670 --> 01:04:11.890
Now, if I take two
derivatives, this thing
01:04:11.890 --> 01:04:13.480
is gone, because it's linear.
01:04:13.480 --> 01:04:15.550
The first one is going
to keep on like this guy.
01:04:15.550 --> 01:04:17.424
And the second one is
going to make it go on.
01:04:17.424 --> 01:04:21.730
The only way this actually shows
up is when to have an H here.
01:04:21.730 --> 01:04:24.880
And if I have an H, then I
can take second derivatives.
01:04:24.880 --> 01:04:26.980
And this thing is
not going to be
01:04:26.980 --> 01:04:30.920
completely independent of beta.
01:04:30.920 --> 01:04:31.420
Sorry.
01:04:31.420 --> 01:04:33.128
Yeah, this thing is
still going to depend
01:04:33.128 --> 01:04:35.080
on beta, which means
that this Yi term is not
01:04:35.080 --> 01:04:36.240
going to disappear.
01:04:39.127 --> 01:04:41.710
I believe we'll see an example
of that, or maybe I removed it.
01:04:41.710 --> 01:04:42.690
I'm not sure, actually.
01:04:42.690 --> 01:04:44.626
I think we will see an example.
01:04:49.460 --> 01:04:55.390
So let us do a Iteratively
Re-weighted Least
01:04:55.390 --> 01:05:00.100
Squares, or IRLS, which I've
actually recently learned
01:05:00.100 --> 01:05:03.490
is a term that even though
it was defined in the '50s,
01:05:03.490 --> 01:05:06.572
people still feel
free to use to define
01:05:06.572 --> 01:05:08.530
to call their new algorithms
which have nothing
01:05:08.530 --> 01:05:09.710
to do with this.
01:05:09.710 --> 01:05:12.460
This is really something where
you actually do iteratively
01:05:12.460 --> 01:05:13.660
re-weighted least squares.
01:05:18.290 --> 01:05:18.790
OK.
01:05:18.790 --> 01:05:20.620
Let's just actually go
through this quickly
01:05:20.620 --> 01:05:23.290
what is going to be iteratively
re-weighted least squares.
01:05:23.290 --> 01:05:28.250
The way the steps that
we had here showed up--
01:05:28.250 --> 01:05:32.665
let's say those guys, x*--
01:05:32.665 --> 01:05:37.430
is this, is when were actually
solving this linear system,
01:05:37.430 --> 01:05:37.980
right?
01:05:37.980 --> 01:05:43.215
That was the linear system
we were trying to solve.
01:05:43.215 --> 01:05:45.270
But solving a
linear system can be
01:05:45.270 --> 01:05:48.390
done by just trying
to minimize, right?
01:05:48.390 --> 01:05:50.640
If I have x a and
b, it's the same
01:05:50.640 --> 01:05:58.820
as minimizing the norm of
ax minus b squared over x.
01:05:58.820 --> 01:06:01.490
If I can actually find
an x for which it's 0,
01:06:01.490 --> 01:06:04.280
it means that I've
actually solved my problem.
01:06:04.280 --> 01:06:10.580
And so that means that I can
solve linear systems by solving
01:06:10.580 --> 01:06:11.889
least square problems.
01:06:11.889 --> 01:06:14.180
And least square problems
are things that statisticians
01:06:14.180 --> 01:06:15.630
are comfortable solving.
01:06:15.630 --> 01:06:18.500
And so all I have to
do is to rephrase this
01:06:18.500 --> 01:06:19.940
as at least square problem.
01:06:19.940 --> 01:06:20.780
OK?
01:06:20.780 --> 01:06:23.240
And you know, I could just
write it directly like this.
01:06:23.240 --> 01:06:25.850
But there's a way to
streamline it a little bit.
01:06:25.850 --> 01:06:30.220
And that's actually
by using weights.
01:06:30.220 --> 01:06:30.860
OK.
01:06:30.860 --> 01:06:32.840
So I've come in the weights--
01:06:32.840 --> 01:06:37.580
well, not today, actually,
but very soon, all right?
01:06:37.580 --> 01:06:39.560
So this is just a
reminder of what we had.
01:06:39.560 --> 01:06:43.670
We have that's Yi give Xi as
a distribution distributed
01:06:43.670 --> 01:06:47.120
according to some distribution
in the canonical exponential
01:06:47.120 --> 01:06:48.360
family.
01:06:48.360 --> 01:06:50.730
So that means that the
log-likelihood looks like this.
01:06:50.730 --> 01:06:52.700
Again, this does
not matter to us.
01:06:52.700 --> 01:06:54.390
This is the form that matters.
01:06:54.390 --> 01:06:55.940
And we have a bunch
of relationships
01:06:55.940 --> 01:06:58.670
that we actually spent
some time computing.
01:06:58.670 --> 01:07:01.250
The first one is that mu
is b prime of theta i.
01:07:01.250 --> 01:07:04.780
The second one is that
if I take g of mu i,
01:07:04.780 --> 01:07:08.150
I get this systematic component,
Xi transpose beta that's
01:07:08.150 --> 01:07:09.740
modeling.
01:07:09.740 --> 01:07:13.030
Now, if I look at the derivative
mu i with respect to theta i,
01:07:13.030 --> 01:07:15.510
this is the derivative
of b prime of theta i
01:07:15.510 --> 01:07:16.510
with respect to theta i.
01:07:16.510 --> 01:07:18.120
So that's the second derivative.
01:07:18.120 --> 01:07:20.240
And I'm going to call it Vi.
01:07:20.240 --> 01:07:24.430
If phi is equal to 1, this
is actually the variance.
01:07:24.430 --> 01:07:27.130
And then I have this
function H, which
01:07:27.130 --> 01:07:30.710
allows me to bypass altogether
the existence of this parameter
01:07:30.710 --> 01:07:33.730
mu, which says if I want to
go from Xi transpose beta
01:07:33.730 --> 01:07:37.420
all the way to theta i, I
have to first do g inverse,
01:07:37.420 --> 01:07:39.210
and then b prime inverse.
01:07:39.210 --> 01:07:41.796
If I stopped here, I
would just have mu.
01:07:41.796 --> 01:07:42.296
OK?
01:07:46.030 --> 01:07:46.530
OK.
01:07:46.530 --> 01:07:48.240
So now what I'm
going to do is I'm
01:07:48.240 --> 01:07:50.800
going to apply the chain rule.
01:07:50.800 --> 01:07:53.070
And I'm going to try to
compute the derivative
01:07:53.070 --> 01:07:56.930
of my log-likelihood
with respect to beta.
01:07:56.930 --> 01:08:00.517
So, again, the
log-likelihood is much nicer
01:08:00.517 --> 01:08:03.100
when I read it as a function of
theta than a function of beta,
01:08:03.100 --> 01:08:05.930
but it's basically what
we've been doing by hand.
01:08:05.930 --> 01:08:08.420
You can write it as a
derivative with respect
01:08:08.420 --> 01:08:11.390
to theta first, and then
multiply by the derivative
01:08:11.390 --> 01:08:13.240
of theta with respect to beta.
01:08:13.240 --> 01:08:13.740
OK.
01:08:13.740 --> 01:08:16.189
And we know that
theta depends on beta
01:08:16.189 --> 01:08:20.020
as H of Xi transpose beta.
01:08:20.020 --> 01:08:20.523
OK?
01:08:20.523 --> 01:08:22.189
I mean, that's basically
what we've been
01:08:22.189 --> 01:08:26.010
doing for the Bernoulli case.
01:08:26.010 --> 01:08:28.540
I mean, we used the chain rule
without actually saying it.
01:08:28.540 --> 01:08:30.665
But this is going to be
convenient to actually make
01:08:30.665 --> 01:08:32.200
it explicitly show up.
01:08:32.200 --> 01:08:32.700
OK.
01:08:32.700 --> 01:08:35.640
So when I first take the
derivative of my log-likelihood
01:08:35.640 --> 01:08:39.689
with respect to theta,
I'm going to use the fact
01:08:39.689 --> 01:08:42.280
that my canonical
family is super simple.
01:08:42.280 --> 01:08:42.779
OK.
01:08:42.779 --> 01:08:50.220
So what I have is that
my log-likelihood ln
01:08:50.220 --> 01:08:56.019
is the sum from i equal 1
to n of Yi theta i minus b
01:08:56.019 --> 01:09:00.240
of theta i divided by phi
plus some constant, which
01:09:00.240 --> 01:09:01.680
will go away as
soon as I'm going
01:09:01.680 --> 01:09:03.120
to take my first derivative.
01:09:03.120 --> 01:09:08.210
So if I take the derivative with
respect to theta i of this guy,
01:09:08.210 --> 01:09:12.439
this is actually going to be
equal to Yi minus b prime theta
01:09:12.439 --> 01:09:16.870
i divided by phi.
01:09:16.870 --> 01:09:20.319
And then I need to multiply
by the derivative of theta i
01:09:20.319 --> 01:09:21.700
with respect to beta.
01:09:21.700 --> 01:09:26.319
Remember, theta is H
of Xi transpose beta.
01:09:26.319 --> 01:09:31.390
So the derivative of theta
i with respect to beta j,
01:09:31.390 --> 01:09:36.819
this is equal to H prime
of Xi transpose beta.
01:09:36.819 --> 01:09:40.060
And then I have the
derivative of this guy.
01:09:40.060 --> 01:09:46.029
Actually, let me just do
the gradient of theta I
01:09:46.029 --> 01:09:48.229
at beta, right?
01:09:48.229 --> 01:09:50.020
That's what we did.
01:09:50.020 --> 01:09:53.029
I'm just thinking of theta i
as being a function of theta.
01:09:53.029 --> 01:09:55.210
So what should I add here?
01:09:55.210 --> 01:10:01.812
It was just the vector Xi, which
is just the chain rule again.
01:10:01.812 --> 01:10:02.770
That's Hi prime, right?
01:10:02.770 --> 01:10:06.170
You don't see it, but there's
a prime here that's derivative.
01:10:06.170 --> 01:10:06.730
OK.
01:10:06.730 --> 01:10:10.030
We've done that without
saying it explicitly.
01:10:10.030 --> 01:10:11.780
So now if I multiply
those two things
01:10:11.780 --> 01:10:14.440
I have this Yi minus
b prime of the theta
01:10:14.440 --> 01:10:19.020
i, which I call by its
good name, which is mu i.
01:10:19.020 --> 01:10:20.790
b prime of theta i
is the expectation
01:10:20.790 --> 01:10:22.980
of Yi conditionally on Xi.
01:10:22.980 --> 01:10:24.910
And then I multiply
by this thing here.
01:10:24.910 --> 01:10:28.320
So here, this thing is written
coordinate by coordinate.
01:10:28.320 --> 01:10:30.630
But I can write
it as a big vector
01:10:30.630 --> 01:10:33.350
when I stack them together.
01:10:33.350 --> 01:10:36.250
And so what I claim is
that this thing here
01:10:36.250 --> 01:10:39.190
is of the form Y minus mu.
01:10:39.190 --> 01:10:40.420
But here I put some tildes.
01:10:40.420 --> 01:10:46.660
Because what I did is that
first I multiplied everything
01:10:46.660 --> 01:10:50.471
by g prime of mu for each mu.
01:10:50.471 --> 01:10:50.970
OK.
01:10:50.970 --> 01:10:52.620
So why not?
01:10:52.620 --> 01:10:54.930
OK.
01:10:54.930 --> 01:10:58.680
Actually, on this slide it will
make no sense why I do this.
01:10:58.680 --> 01:11:01.710
I basically multiply
by g prime on one side
01:11:01.710 --> 01:11:04.080
and divide by g prime
on the other side.
01:11:04.080 --> 01:11:12.090
So what I write so far is that
the gradient of ln with respect
01:11:12.090 --> 01:11:19.970
to beta is the sum from i
equal 1 to n of Yi minus mu i,
01:11:19.970 --> 01:11:23.290
let's call it,
divide by phi times
01:11:23.290 --> 01:11:28.591
H prime of Xi transpose beta Xi.
01:11:28.591 --> 01:11:29.090
OK.
01:11:29.090 --> 01:11:31.840
So I just stacked
everything that's here.
01:11:31.840 --> 01:11:33.660
And now I'm going to
start calling things.
01:11:33.660 --> 01:11:36.180
The first thing I'm going to
do is I'm going to divide.
01:11:36.180 --> 01:11:38.320
So this guy here I'm
going to push here.
01:11:40.840 --> 01:11:42.640
Now, this guy here
I'm actually going
01:11:42.640 --> 01:11:48.870
to multiply by g prime of mu i.
01:11:48.870 --> 01:11:51.475
And this guy I'm going to
divide by g prime of mu i.
01:11:51.475 --> 01:11:54.780
So there's really nothing
that happened here.
01:11:54.780 --> 01:11:59.780
I just took g prime and multiply
and divide it by g prime.
01:11:59.780 --> 01:12:00.615
Why do I do this?
01:12:00.615 --> 01:12:02.240
Well, that's actually
going to be clear
01:12:02.240 --> 01:12:06.320
when we talk about iteratively
re-weighted least squares.
01:12:06.320 --> 01:12:11.320
But now, essentially I have
a new mu, a Y which is--
01:12:11.320 --> 01:12:14.620
so this thing now is going
to be Y tilde minus mu
01:12:14.620 --> 01:12:19.450
tilde, so i, i.
01:12:19.450 --> 01:12:23.720
Now, this guy here
I'm going to call Wi.
01:12:27.800 --> 01:12:31.520
And I have the Xi
that's there, which
01:12:31.520 --> 01:12:38.120
means that now the thing that
I have here I can write as
01:12:38.120 --> 01:12:39.860
follows.
01:12:39.860 --> 01:12:46.280
Gradient ln of beta
is equal to what?
01:12:46.280 --> 01:12:49.360
Well, I'm going to write
it in matrix forms.
01:12:49.360 --> 01:12:53.750
So I have the sum over i of
something multiplied by Xi.
01:12:53.750 --> 01:12:57.110
So I'm going to write
it as x transpose.
01:12:57.110 --> 01:13:02.990
Then I'm going to have
this matrix W1 Wn, and then
01:13:02.990 --> 01:13:05.030
0 elsewhere.
01:13:05.030 --> 01:13:09.520
And then I'm going to
have my Y tilde minus mu.
01:13:09.520 --> 01:13:15.190
And remember, X is the
matrix with-- sorry,
01:13:15.190 --> 01:13:16.880
it should be a bit [INAUDIBLE].
01:13:16.880 --> 01:13:19.610
I have n, and then p.
01:13:19.610 --> 01:13:26.560
And here I have my Xi j in this
matrix on row i and column j.
01:13:26.560 --> 01:13:30.670
And this is just a matrix that
has the Wi's on the diagonal.
01:13:30.670 --> 01:13:32.830
And then I have
Y tilde minus mu.
01:13:32.830 --> 01:13:36.530
So this is just the matrix
we're writing of this formula.
01:13:36.530 --> 01:13:37.030
All right.
01:13:37.030 --> 01:13:38.446
So it's just saying
that if I look
01:13:38.446 --> 01:13:42.340
at the sum of weighted
things of my columns of Xi,
01:13:42.340 --> 01:13:44.480
it's basically the same thing.
01:13:44.480 --> 01:13:46.750
When I'm going to multiply
this by my matrix,
01:13:46.750 --> 01:13:48.740
I'm going to get exactly
those terms, right?
01:13:48.740 --> 01:13:52.900
Yi minus mu i tilde times Wi.
01:13:52.900 --> 01:13:56.170
And then when I actually
take this Xi transpose
01:13:56.170 --> 01:13:58.120
times this guy, I'm
really just getting
01:13:58.120 --> 01:14:04.590
the sum of the columns
with the weights, right?
01:14:04.590 --> 01:14:05.930
Agree?
01:14:05.930 --> 01:14:08.540
If I look at this
thing here, this
01:14:08.540 --> 01:14:15.915
is a vector that has S
coordinates, Wi times Yi
01:14:15.915 --> 01:14:19.320
tilde minus mu i tilde.
01:14:19.320 --> 01:14:21.020
And I have n of them.
01:14:21.020 --> 01:14:24.230
So when I multiply X
transpose by this guy,
01:14:24.230 --> 01:14:28.060
I'm just looking at a weighted
sum of the columns of X
01:14:28.060 --> 01:14:31.130
transpose, which is a
weighted sum of the rows of X,
01:14:31.130 --> 01:14:34.166
which are exactly my Xi's.
01:14:34.166 --> 01:14:36.332
All right, and that's this
weighted sum of the Xi's.
01:14:40.110 --> 01:14:40.610
OK.
01:14:40.610 --> 01:14:43.220
So here, as I said,
the fact that we
01:14:43.220 --> 01:14:48.050
decided to put this g prime of
mu i here and g prime of mu i
01:14:48.050 --> 01:14:50.600
here, we could have
not done this, right?
01:14:50.600 --> 01:14:54.470
We could have just said,
I forget about the tilde
01:14:54.470 --> 01:14:56.890
and just call it Yi minus mu i.
01:14:56.890 --> 01:15:00.500
And here, I just put everything
I don't know into some Wi.
01:15:00.500 --> 01:15:01.784
And so why do I do this?
01:15:01.784 --> 01:15:03.200
Well, it's because
when I actually
01:15:03.200 --> 01:15:06.830
start looking at the Hessian,
what's going to happen?
01:15:06.830 --> 01:15:08.080
AUDIENCE: [INAUDIBLE].
01:15:08.080 --> 01:15:09.080
PHILIPPE RIGOLLET: Yeah.
01:15:09.080 --> 01:15:10.330
We'll do that next time.
01:15:10.330 --> 01:15:14.370
But let's just look quickly at
the outcome of the computation
01:15:14.370 --> 01:15:16.560
of my Hessian.
01:15:16.560 --> 01:15:20.160
So I compute a bunch
of second derivatives.
01:15:20.160 --> 01:15:21.630
And here, I have
two terms, right?
01:15:21.630 --> 01:15:22.590
Well, he's gone.
01:15:22.590 --> 01:15:23.910
So I have two terms.
01:15:23.910 --> 01:15:26.569
And when I take the
expectation now,
01:15:26.569 --> 01:15:28.110
it's going to actually
change, right?
01:15:28.110 --> 01:15:31.150
This thing is actually
going to depend on Yi.
01:15:31.150 --> 01:15:34.070
Because I have an H which
is not the identity.
01:15:34.070 --> 01:15:37.020
Oh, no, you're here, sorry.
01:15:37.020 --> 01:15:39.460
So when I start looking
at the expectation,
01:15:39.460 --> 01:15:42.780
so I look at the conditional
expectation given Xi.
01:15:42.780 --> 01:15:46.599
The first term here has
a Yi minus expectation.
01:15:46.599 --> 01:15:48.390
So when I take the
conditional expectation,
01:15:48.390 --> 01:15:49.260
this is going to be 0.
01:15:49.260 --> 01:15:50.926
The first term is
going away when I take
01:15:50.926 --> 01:15:52.470
the conditional expectation.
01:15:52.470 --> 01:15:54.122
But this was
actually gone already
01:15:54.122 --> 01:15:56.580
if we had the canonical term,
because the second derivative
01:15:56.580 --> 01:16:00.480
of H when H is
the identity is 0.
01:16:00.480 --> 01:16:03.490
But if H is not the identity,
H prime prime may not be 0.
01:16:03.490 --> 01:16:07.120
And so I need that part
to remove that term.
01:16:07.120 --> 01:16:09.340
And so now, you know,
I work a little bit,
01:16:09.340 --> 01:16:10.420
and I get this term.
01:16:10.420 --> 01:16:11.650
That's not very surprising.
01:16:11.650 --> 01:16:15.150
In the second derivative, I see
I have terms in b prime prime.
01:16:15.150 --> 01:16:18.160
I have term in H
prime, but squared.
01:16:18.160 --> 01:16:21.700
And then I have my Xi
outer Xi, Xi, Xi transpose,
01:16:21.700 --> 01:16:23.180
which we know we would see.
01:16:23.180 --> 01:16:23.680
OK.
01:16:23.680 --> 01:16:26.170
So we'll go through
those things next time.
01:16:26.170 --> 01:16:31.850
But what I want to show you is
that now once I compute this,
01:16:31.850 --> 01:16:35.960
I can actually show that if
I look at this product that
01:16:35.960 --> 01:16:40.700
showed up, I had b prime
prime times H prime squared.
01:16:40.700 --> 01:16:44.200
One of those terms is
actually 1 over g prime.
01:16:44.200 --> 01:16:46.390
And so I can rewrite it
as one of the H primes,
01:16:46.390 --> 01:16:49.120
because I had a square,
divided by g prime.
01:16:49.120 --> 01:16:51.730
And now, I have this
Xi Xi transpose.
01:16:51.730 --> 01:16:57.370
So if I did not put
the g prime in the W
01:16:57.370 --> 01:17:00.340
that I put here
completely artificially,
01:17:00.340 --> 01:17:04.660
I would not be able to call
this guy Wi, which is exactly
01:17:04.660 --> 01:17:06.920
what it is from this board.
01:17:06.920 --> 01:17:09.970
And now that this guy is Wi, I
can actually write this thing
01:17:09.970 --> 01:17:14.510
here as X transpose WX.
01:17:14.510 --> 01:17:15.010
OK?
01:17:15.010 --> 01:17:17.310
And that's why I
really wanted my W
01:17:17.310 --> 01:17:20.920
to have this g prime of
mu i in the denominator.
01:17:20.920 --> 01:17:24.107
Because now I can actually
write a term that depends on W.
01:17:24.107 --> 01:17:26.440
Now, you might say, how do I
reconcile those two things?
01:17:26.440 --> 01:17:28.090
What the hell are you doing?
01:17:28.090 --> 01:17:29.980
And what the hell I'm
doing is essentially
01:17:29.980 --> 01:17:35.380
that I'm saying that if
you write beta k according
01:17:35.380 --> 01:17:37.750
to the Fisher-scoring
iterations,
01:17:37.750 --> 01:17:41.660
you can actually write it
as just this term here,
01:17:41.660 --> 01:17:46.090
which is of the form X transpose
X inverse X transpose Y.
01:17:46.090 --> 01:17:50.120
But I actually
squeezed in these W's.
01:17:50.120 --> 01:17:52.024
And that's actually a
weighted least square.
01:17:52.024 --> 01:17:53.690
And it's applied to
this particular guy.
01:17:53.690 --> 01:17:56.090
So we'll talk about those
weighted least squares.
01:17:56.090 --> 01:17:58.400
But remember, least
squares is of the form--
01:17:58.400 --> 01:18:02.725
beta hat is X transpose
X inverse X transpose Y.
01:18:02.725 --> 01:18:04.350
And here it's basically
the same thing,
01:18:04.350 --> 01:18:09.290
except that I squeeze in
some W after my X transpose.
01:18:09.290 --> 01:18:09.790
OK.
01:18:09.790 --> 01:18:12.562
So that's how we're
going to solve it.
01:18:12.562 --> 01:18:14.020
I don't want to go
into the details
01:18:14.020 --> 01:18:18.190
now, mostly because we're
running out of time.
01:18:18.190 --> 01:18:20.940
Are there any questions?