WEBVTT
00:00:00.120 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.880
Commons license.
00:00:03.880 --> 00:00:06.090
Your support will help
MIT OpenCourseWare
00:00:06.090 --> 00:00:10.180
continue to offer high-quality
educational resources for free.
00:00:10.180 --> 00:00:12.720
To make a donation or to
view additional materials
00:00:12.720 --> 00:00:16.680
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.680 --> 00:00:17.880
at ocw.mit.edu.
00:00:21.884 --> 00:00:23.300
PHILIPPE RIGOLLET:
So I apologize.
00:00:23.300 --> 00:00:27.810
My voice is not 100%.
00:00:27.810 --> 00:00:32.930
So if you don't understand
what I'm saying, please ask me.
00:00:32.930 --> 00:00:36.440
So we're going to be analyzing--
actually, not really analyzing.
00:00:36.440 --> 00:00:38.750
We described a
second-order method
00:00:38.750 --> 00:00:42.860
to optimize the log likelihood
in a generalized linear model,
00:00:42.860 --> 00:00:45.637
when the parameter
of interest was beta.
00:00:45.637 --> 00:00:47.970
So here, I'm going to rewrite
the whole thing as a beta.
00:00:47.970 --> 00:00:49.740
So that's the equation you see.
00:00:49.740 --> 00:00:51.560
But we really have this beta.
00:00:51.560 --> 00:00:58.160
And at iteration k plus 1,
beta is given by beta k.
00:00:58.160 --> 00:01:01.170
And then I have a plus sign.
00:01:01.170 --> 00:01:06.390
And the plus, if you think of
the Fisher information at beta
00:01:06.390 --> 00:01:09.340
k as being some number--
00:01:09.340 --> 00:01:11.090
if you were to say
whether it's a positive
00:01:11.090 --> 00:01:12.549
or a negative
number, it's actually
00:01:12.549 --> 00:01:14.339
going to be a positive
number, because it's
00:01:14.339 --> 00:01:15.770
a positive semi-definite matrix.
00:01:15.770 --> 00:01:18.020
So since we're doing
gradient ascent,
00:01:18.020 --> 00:01:19.730
we have a plus sign here.
00:01:19.730 --> 00:01:21.620
And then the
direction is basically
00:01:21.620 --> 00:01:26.750
gradient ln at beta k.
00:01:26.750 --> 00:01:27.410
OK?
00:01:27.410 --> 00:01:30.320
So this is the iterations that
we're trying to implement.
00:01:30.320 --> 00:01:31.540
And we could just do this.
00:01:31.540 --> 00:01:34.814
At each iteration, we compute
the Fisher information,
00:01:34.814 --> 00:01:36.230
and then we do it
again and again.
00:01:36.230 --> 00:01:36.869
All right.
00:01:36.869 --> 00:01:38.660
That's called the
Fisher-scoring algorithm.
00:01:38.660 --> 00:01:41.045
And I told you that this
was going to converge.
00:01:41.045 --> 00:01:44.090
And what we're going to
try to do in this lecture
00:01:44.090 --> 00:01:46.100
is to show how we can
re-implement this,
00:01:46.100 --> 00:01:48.770
using iteratively
re-weighted least squares,
00:01:48.770 --> 00:01:50.870
so that each step
of this algorithm
00:01:50.870 --> 00:01:54.270
consists simply of solving a
weighted least square problem.
00:01:54.270 --> 00:01:54.770
All right.
00:01:54.770 --> 00:01:59.840
So let's go back quickly
and remind ourselves
00:01:59.840 --> 00:02:04.830
that we are in the Gaussian--
00:02:04.830 --> 00:02:07.110
sorry, we're in the
exponential family.
00:02:07.110 --> 00:02:10.400
So if I look at the log
likelihood for one observation,
00:02:10.400 --> 00:02:12.290
so here it's ln--
00:02:12.290 --> 00:02:13.280
sorry.
00:02:13.280 --> 00:02:17.210
This is the sum from i
equal 1 to n of yi minus--
00:02:20.694 --> 00:02:26.770
OK, so it's yi times theta
i, sorry, minus b of theta i.
00:02:26.770 --> 00:02:28.550
Then there's going
to be some parameter.
00:02:28.550 --> 00:02:32.810
And then I have
plus c of yi phi.
00:02:32.810 --> 00:02:33.310
OK.
00:02:33.310 --> 00:02:35.210
So just the
exponential went away
00:02:35.210 --> 00:02:36.860
when I took the log
of the likelihood.
00:02:36.860 --> 00:02:38.600
And I have n observations,
so I'm summing
00:02:38.600 --> 00:02:40.320
over all n observations.
00:02:40.320 --> 00:02:40.820
All right.
00:02:40.820 --> 00:02:43.445
Then we had a bunch of
formulas that we came up to be.
00:02:43.445 --> 00:02:46.100
So if I look at the
expectation of yi--
00:02:46.100 --> 00:02:49.880
so that's really the
conditional of yi, given xi.
00:02:49.880 --> 00:02:52.230
But like here, it
really doesn't matter.
00:02:52.230 --> 00:02:54.110
It's just going to be
different for each i.
00:02:54.110 --> 00:02:55.880
This is denoted by mu i.
00:02:55.880 --> 00:03:00.830
And we showed that this
was beta prime of theta i.
00:03:00.830 --> 00:03:03.880
Then the other equation
that we found was that.
00:03:03.880 --> 00:03:06.350
And so what we want to
model is this thing.
00:03:06.350 --> 00:03:10.220
We want it to be equal to
xi transpose beta- sorry
00:03:10.220 --> 00:03:11.190
g of this thing.
00:03:14.918 --> 00:03:15.620
All right.
00:03:15.620 --> 00:03:19.010
So that's our model.
00:03:19.010 --> 00:03:21.650
And then we had that
the variance was also
00:03:21.650 --> 00:03:23.360
given by the second derivative.
00:03:23.360 --> 00:03:24.747
I'm not going to go into it.
00:03:24.747 --> 00:03:26.330
What's actually
interesting is to see,
00:03:26.330 --> 00:03:32.240
if we want to express theta i as
a function of xi, what we get,
00:03:32.240 --> 00:03:38.900
going from xi to mu i by g
inverse, and then to theta i
00:03:38.900 --> 00:03:43.790
by b inverse, we
get that theta i
00:03:43.790 --> 00:03:51.660
is equal to h of xi transpose
beta, h of xi transpose beta,
00:03:51.660 --> 00:03:56.340
where h is the inverse--
00:03:56.340 --> 00:03:58.890
so which order is --this?
00:03:58.890 --> 00:04:03.650
Is the inverse of g, and then
the compose would be prime.
00:04:03.650 --> 00:04:05.210
OK?
00:04:05.210 --> 00:04:09.194
So we remember that last time,
those are all computations
00:04:09.194 --> 00:04:10.610
that we've made,
but they're going
00:04:10.610 --> 00:04:12.660
to be useful in our derivation.
00:04:12.660 --> 00:04:14.510
And the first thing
we did last time is
00:04:14.510 --> 00:04:17.690
to show that, if I look now
at the derivative of the log
00:04:17.690 --> 00:04:20.852
likelihood with respect to
one coordinate of beta, which
00:04:20.852 --> 00:04:23.060
is going to give me the
gradient if I do that for all
00:04:23.060 --> 00:04:25.250
the coordinates, what
we ended up finding
00:04:25.250 --> 00:04:28.400
is that we can rewrite
it in this form, some
00:04:28.400 --> 00:04:31.470
of yi tilde minus mu tilde.
00:04:31.470 --> 00:04:33.380
So let's remind ourselves that--
00:04:36.340 --> 00:04:41.560
so y tilde is just y divided--
00:04:41.560 --> 00:04:45.010
well, OK y tilde i is yi--
00:04:45.010 --> 00:04:46.300
is it times or divided--
00:04:46.300 --> 00:04:50.890
times g prime of mu i.
00:04:50.890 --> 00:05:00.140
Mu tilde i is mu i
times g prime of mu i.
00:05:00.140 --> 00:05:02.980
And then that was just
an artificial thing,
00:05:02.980 --> 00:05:07.070
so that we could actually
divide the weights by g prime.
00:05:07.070 --> 00:05:10.060
But the real thing that built
the weights are this h prime.
00:05:10.060 --> 00:05:12.310
And there's this
normalization factor.
00:05:12.310 --> 00:05:14.440
And so if we read it
like that-- so if I also
00:05:14.440 --> 00:05:22.900
write that wi is h prime of
xi transpose beta divided
00:05:22.900 --> 00:05:27.640
by g prime of mu
i times phi, then
00:05:27.640 --> 00:05:30.820
I could actually rewrite
my gradient, which
00:05:30.820 --> 00:05:34.270
is a vector, in the
following matrix form,
00:05:34.270 --> 00:05:40.820
the gradient ln at beta.
00:05:40.820 --> 00:05:44.300
So the gradient of my
log likelihood of beta
00:05:44.300 --> 00:05:45.390
took the following form.
00:05:45.390 --> 00:05:53.600
It was x transpose w, and
then y tilde minus mu tilde.
00:05:53.600 --> 00:05:57.770
And here, w was just
the matrix with w1,
00:05:57.770 --> 00:06:02.020
w2, all the way to wn
on the diagonal and 0
00:06:02.020 --> 00:06:04.142
on of the up diagonals.
00:06:04.142 --> 00:06:06.030
OK?
00:06:06.030 --> 00:06:08.340
So that was just
taking the derivative
00:06:08.340 --> 00:06:11.490
and doing a slight
manipulations that said,
00:06:11.490 --> 00:06:14.670
well, let's just divide
whatever is here by g
00:06:14.670 --> 00:06:17.740
prime and multiply whatever
is here by g prime.
00:06:17.740 --> 00:06:19.680
So today, we'll see why
we make this division
00:06:19.680 --> 00:06:23.220
and multiplication by g prime,
which seems to make no sense,
00:06:23.220 --> 00:06:26.620
but it actually comes from
the Hessian computations.
00:06:26.620 --> 00:06:28.530
So the Hessian
computations are going
00:06:28.530 --> 00:06:29.790
to be a little more annoying.
00:06:29.790 --> 00:06:33.600
Actually, let me start directly
with the coordinate y's
00:06:33.600 --> 00:06:34.440
derivative, right?
00:06:34.440 --> 00:06:37.740
So to build this gradient,
what we used, in the end,
00:06:37.740 --> 00:06:41.880
was that the partial derivative
of ln with respect to the gth
00:06:41.880 --> 00:06:49.220
coordinate of beta was
equal to the sum over i
00:06:49.220 --> 00:06:55.520
of yi tilde minus mu
i tilde times wi times
00:06:55.520 --> 00:06:59.725
the gth coordinate of xi.
00:06:59.725 --> 00:07:01.310
OK?
00:07:01.310 --> 00:07:03.480
So now, let's just take
another derivative,
00:07:03.480 --> 00:07:07.810
and that's going to give us
the entries of the Hessian.
00:07:07.810 --> 00:07:11.680
OK, so we're going to
the second derivative.
00:07:11.680 --> 00:07:16.950
So what I want to compute is
the derivative with respect
00:07:16.950 --> 00:07:18.740
to beta j and beta k.
00:07:21.830 --> 00:07:22.330
OK.
00:07:22.330 --> 00:07:24.525
So where does beta j--
00:07:24.525 --> 00:07:26.650
so here, I already took
the derivative with respect
00:07:26.650 --> 00:07:27.550
to beta j.
00:07:27.550 --> 00:07:29.530
So this is just the
derivative with respect
00:07:29.530 --> 00:07:32.850
to beta k of the derivative
with respect to beta j.
00:07:36.874 --> 00:07:39.290
So what I need to do is to
take the derivative of this guy
00:07:39.290 --> 00:07:40.790
with respect to beta k.
00:07:40.790 --> 00:07:42.510
Where does beta k show up here?
00:07:48.920 --> 00:07:52.170
It's set in, in two places.
00:07:52.170 --> 00:07:53.179
AUDIENCE: In the y's?
00:07:53.179 --> 00:07:54.970
PHILIPPE RIGOLLET: No,
it's not in the y's.
00:07:54.970 --> 00:07:56.760
The y's are my data, right?
00:07:59.470 --> 00:08:02.220
But I mean, it's
in the y tildes.
00:08:02.220 --> 00:08:03.700
Yeah, because it's in mu, right?
00:08:03.700 --> 00:08:04.960
Mu depends on beta.
00:08:04.960 --> 00:08:09.270
Mu is g inverse of
xi transpose beta.
00:08:09.270 --> 00:08:12.930
And it's also in the wi's.
00:08:12.930 --> 00:08:17.070
Actually, everything that you
see is directly-- well, OK, w
00:08:17.070 --> 00:08:21.810
depends on mu n on
beta explicitly.
00:08:21.810 --> 00:08:24.480
But the rest depends only on mu.
00:08:24.480 --> 00:08:27.930
And so we might want
to be a little--
00:08:27.930 --> 00:08:30.660
well, we can actually use the--
00:08:30.660 --> 00:08:32.220
did I use the
chain rule already?
00:08:35.059 --> 00:08:36.950
Yeah, it's here.
00:08:36.950 --> 00:08:40.780
But OK, well, let's go for it.
00:08:49.200 --> 00:08:50.695
Oh yeah, OK.
00:08:50.695 --> 00:08:52.320
Sorry, I should not
write it like that,
00:08:52.320 --> 00:08:54.390
because that was actually--
00:08:54.390 --> 00:08:57.000
right, so I make my life
miserable by just multiplying
00:08:57.000 --> 00:09:00.800
and dividing by
this g prime of mu.
00:09:00.800 --> 00:09:02.310
I should not do this, right?
00:09:02.310 --> 00:09:04.860
So what I should just write
is say that this guy here--
00:09:04.860 --> 00:09:09.180
I'm actually going to
remove the g prime of mu,
00:09:09.180 --> 00:09:11.570
because I just make something
that depends on theta
00:09:11.570 --> 00:09:13.090
appear when it
really should not.
00:09:13.090 --> 00:09:15.880
So let's just look at the
last but one equality.
00:09:23.000 --> 00:09:23.500
OK.
00:09:23.500 --> 00:09:27.430
So that's the one over
there, and then I have xi j.
00:09:27.430 --> 00:09:29.800
OK, so here, it make my
life much more simple,
00:09:29.800 --> 00:09:31.852
because yi does
not depend on beta,
00:09:31.852 --> 00:09:34.310
but this guy depends on beta,
and this guy depends on beta.
00:09:34.310 --> 00:09:35.074
All right.
00:09:35.074 --> 00:09:36.490
So when I take the
derivative, I'm
00:09:36.490 --> 00:09:38.440
going to have to be a
little more careful now.
00:09:38.440 --> 00:09:40.300
But I just have a
derivative of a product,
00:09:40.300 --> 00:09:42.080
nothing more complicated.
00:09:42.080 --> 00:09:43.345
So this is what?
00:09:43.345 --> 00:09:45.350
Well, the sum is
going to be linear,
00:09:45.350 --> 00:09:46.710
so it's going to come out.
00:09:46.710 --> 00:09:51.170
Then I'm going to have to take
the derivative of this term.
00:09:51.170 --> 00:09:54.120
So it's just going
to be 1 over psi.
00:09:54.120 --> 00:09:58.460
Then the derivative
of mu i with respect
00:09:58.460 --> 00:10:04.440
to beta k, which I will
just write like this,
00:10:04.440 --> 00:10:09.920
times h prime of xi
transpose beta xi j.
00:10:09.920 --> 00:10:15.570
And then I'm going to have the
other one, which is yi minus mu
00:10:15.570 --> 00:10:24.640
i over 5 times the second
derivative of h of xi transpose
00:10:24.640 --> 00:10:25.330
beta.
00:10:25.330 --> 00:10:27.038
And then I'm going to
take the derivative
00:10:27.038 --> 00:10:30.190
of this guy with respect to beta
j with beta k, which is just
00:10:30.190 --> 00:10:32.196
xi k.
00:10:32.196 --> 00:10:36.200
So I have xi j times xi k.
00:10:36.200 --> 00:10:36.700
OK.
00:10:36.700 --> 00:10:40.400
So I still need to
compute this guy.
00:10:40.400 --> 00:10:42.590
So what is the
partial derivative
00:10:42.590 --> 00:10:46.430
with respect to beta k of g?
00:10:46.430 --> 00:10:49.310
So mu is g of--
00:10:49.310 --> 00:10:52.524
worry, it's g inverse
of xi transpose beta.
00:10:56.610 --> 00:10:58.400
OK?
00:10:58.400 --> 00:10:59.460
So what do I get?
00:10:59.460 --> 00:11:01.610
Well, I'm going
to get definitely
00:11:01.610 --> 00:11:02.990
the second derivative of g.
00:11:11.558 --> 00:11:14.050
Well, OK, that's
actually not a bad idea.
00:11:17.857 --> 00:11:18.690
Well, no, that's OK.
00:11:18.690 --> 00:11:21.150
I can make the second--
00:11:21.150 --> 00:11:22.850
what makes my life
easier, actually?
00:11:26.690 --> 00:11:31.010
Give me one second.
00:11:31.010 --> 00:11:33.230
Well, there's no
one that actually
00:11:33.230 --> 00:11:35.660
makes my life so much easier.
00:11:35.660 --> 00:11:36.872
Let's just write it.
00:11:36.872 --> 00:11:37.830
Let's go with this guy.
00:11:37.830 --> 00:11:43.300
So it's going to be g prime
prime of xi transpose beta
00:11:43.300 --> 00:11:47.677
times xi k.
00:11:47.677 --> 00:11:50.140
OK?
00:11:50.140 --> 00:11:53.470
So now, what do I have
if I collect my terms?
00:11:53.470 --> 00:12:05.990
I have that this whole thing
here, the second derivative is,
00:12:05.990 --> 00:12:10.600
well, I have the sum
from 1 equal 1 to n.
00:12:10.600 --> 00:12:13.200
Then I have terms that
I can factor out, right?
00:12:13.200 --> 00:12:17.790
Both of these guys have xi j,
and this guy pulls out an xi k.
00:12:17.790 --> 00:12:21.450
And it's also here, xi
j times xi k, right?
00:12:21.450 --> 00:12:26.690
So everybody here is xi j xi k.
00:12:26.690 --> 00:12:29.790
And now, I just have to take
the terms that I have here.
00:12:29.790 --> 00:12:33.490
The 1 over phi, I can
actually pull out in front.
00:12:33.490 --> 00:12:40.400
And I'm left with the
second derivative of g times
00:12:40.400 --> 00:12:46.370
the first derivative of h, both
taken at xi transpose beta.
00:12:46.370 --> 00:12:48.880
And then, I have
this yi minus mu i
00:12:48.880 --> 00:12:52.166
times the second derivative of
h, taken at xi transpose beta.
00:13:00.180 --> 00:13:00.680
OK.
00:13:00.680 --> 00:13:03.240
But here, I'm looking
at Fisher scoring.
00:13:03.240 --> 00:13:07.200
I'm not looking at
Newton's method, which
00:13:07.200 --> 00:13:09.660
means that I can actually
take the expectation
00:13:09.660 --> 00:13:11.636
of the second derivative.
00:13:11.636 --> 00:13:13.260
So when I start taking
the expectation,
00:13:13.260 --> 00:13:15.640
what's going to happen--
00:13:15.640 --> 00:13:17.670
so if I take the expectation
of this whole thing
00:13:17.670 --> 00:13:21.830
here, well, this guy, it's not--
00:13:21.830 --> 00:13:24.990
and when I say expectation,
it's always conditionally on xi.
00:13:24.990 --> 00:13:27.470
So let's write it--
00:13:27.470 --> 00:13:29.540
x1 xn.
00:13:29.540 --> 00:13:31.160
So I take conditional.
00:13:31.160 --> 00:13:32.790
This is just deterministic.
00:13:32.790 --> 00:13:34.430
But what is the
conditional expectation
00:13:34.430 --> 00:13:39.570
of yi minus mu i times this
guy, conditionally on xi?
00:13:42.160 --> 00:13:43.499
0, right?
00:13:43.499 --> 00:13:45.790
Because this is just the
conditional expectation of yi,
00:13:45.790 --> 00:13:47.620
and everything else
depends on xi only,
00:13:47.620 --> 00:13:50.810
so I can push it out of the
conditional expectation.
00:13:50.810 --> 00:13:52.200
So I'm left only with this term.
00:14:06.460 --> 00:14:06.960
OK.
00:14:13.850 --> 00:14:14.950
So now I need to--
00:14:14.950 --> 00:14:23.185
sorry, and I have
xi xj, xi j xi j.
00:14:23.185 --> 00:14:26.953
OK
00:14:26.953 --> 00:14:34.850
So now, I want to go to
something that's slightly more
00:14:34.850 --> 00:14:35.880
convenient for me.
00:14:35.880 --> 00:14:37.820
So maybe we can
skip that part here,
00:14:37.820 --> 00:14:40.790
because this is not going to
be convenient for me anyway.
00:14:40.790 --> 00:14:45.500
So I just want to go back to
something that looks eventually
00:14:45.500 --> 00:14:48.150
like this.
00:14:48.150 --> 00:14:50.010
OK, that's what
I'm going to want.
00:14:50.010 --> 00:14:53.700
So I need to have my xi show
up with some weight somehow.
00:14:53.700 --> 00:14:57.530
And the weight should involve
h prime divided by g prime.
00:14:57.530 --> 00:15:00.840
Again, the reason why I want
to see g prime coming back
00:15:00.840 --> 00:15:03.900
is because I had g prime
coming in the original w.
00:15:03.900 --> 00:15:06.690
This is actually the
same definition as the w
00:15:06.690 --> 00:15:09.870
that I used when I was
computing the gradient.
00:15:09.870 --> 00:15:13.600
Those are exactly
these w's, those guys.
00:15:13.600 --> 00:15:15.750
So I need to have g
prime that shows up.
00:15:15.750 --> 00:15:17.166
And that's where
I'm going to have
00:15:17.166 --> 00:15:21.240
to make a little bit
of computation here.
00:15:21.240 --> 00:15:26.460
And it's coming from this
kind of consideration.
00:15:26.460 --> 00:15:27.840
OK?
00:15:27.840 --> 00:15:29.960
So this thing here--
00:15:33.680 --> 00:15:39.180
well, actually, I'm missing
the phi over there, right?
00:15:39.180 --> 00:15:41.170
There should be a phi here.
00:15:41.170 --> 00:15:41.670
OK.
00:15:41.670 --> 00:15:46.482
So we have exactly this thing,
because this tells me that,
00:15:46.482 --> 00:15:47.565
if I look at the Hessian--
00:15:53.840 --> 00:15:56.740
so this was entry-wise,
and this is exactly
00:15:56.740 --> 00:15:58.930
the form of something
of the form of the k.
00:15:58.930 --> 00:16:06.120
This is exactly the jth kth
entry of xi xi transpose.
00:16:06.120 --> 00:16:06.620
Right?
00:16:06.620 --> 00:16:07.880
We've used that before.
00:16:07.880 --> 00:16:09.980
So if I want to write
this in a vector form,
00:16:09.980 --> 00:16:13.010
this is just going to be the
sum of something that depends
00:16:13.010 --> 00:16:15.710
on i times xi xi transpose.
00:16:15.710 --> 00:16:20.660
So this is 1 over phi sum
from i equal 1 to n of g
00:16:20.660 --> 00:16:28.580
prime prime xi transpose beta
h prime xi transpose beta xi xi
00:16:28.580 --> 00:16:30.631
transpose.
00:16:30.631 --> 00:16:31.130
OK?
00:16:31.130 --> 00:16:32.504
And that's for
the entire matrix.
00:16:32.504 --> 00:16:34.820
Here, that was just the j
kth entries of this matrix.
00:16:38.520 --> 00:16:41.640
And you can just check
that, if I take this matrix,
00:16:41.640 --> 00:16:45.330
the j kth entry is just the
product of the jth coordinate
00:16:45.330 --> 00:16:48.780
and the kth coordinate of xi.
00:16:48.780 --> 00:16:51.540
All right.
00:16:51.540 --> 00:16:54.000
So now I need to
do my rewriting.
00:16:54.000 --> 00:16:54.975
Can I write this?
00:16:58.529 --> 00:17:00.070
So I'm missing
something here, right?
00:17:11.790 --> 00:17:13.829
Oh, I know where
it's coming from.
00:17:18.630 --> 00:17:22.010
Mu is not g prime of x beta.
00:17:22.010 --> 00:17:24.386
Mu is g inverse
of x beta, right?
00:17:27.859 --> 00:17:34.670
So the derivative of x
prime is not g prime prime.
00:17:34.670 --> 00:17:39.915
It's like this guy--
00:17:44.890 --> 00:17:46.660
no, 1 over this, right?
00:17:51.583 --> 00:17:52.083
Yeah.
00:18:06.880 --> 00:18:08.040
OK?
00:18:08.040 --> 00:18:12.180
The derivative of g inverse is
1 over g prime of gene inverse.
00:18:15.390 --> 00:18:18.260
I need you guys, OK?
00:18:18.260 --> 00:18:18.790
All right.
00:18:18.790 --> 00:18:20.670
So now, I'm going to
have to rewrite this.
00:18:20.670 --> 00:18:21.810
This guy is still
going to go away.
00:18:21.810 --> 00:18:23.351
It doesn't matter,
but now this thing
00:18:23.351 --> 00:18:30.180
is becoming h prime over g prime
of g inverse of xi transpose
00:18:30.180 --> 00:18:41.820
beta, which is the same
here, which is the same here.
00:18:52.220 --> 00:18:53.300
OK?
00:18:53.300 --> 00:18:55.435
Everybody approves?
00:18:55.435 --> 00:18:55.935
All right.
00:18:55.935 --> 00:18:58.460
Well, now, it's
actually much nicer.
00:18:58.460 --> 00:19:01.040
What is g inverse of
xi transpose beta?
00:19:05.154 --> 00:19:07.320
Well, that was exactly the
mistake that I just made,
00:19:07.320 --> 00:19:08.310
right?
00:19:08.310 --> 00:19:10.960
It's mu i itself.
00:19:10.960 --> 00:19:18.330
So this guy is really
g prime of mu i.
00:19:18.330 --> 00:19:19.630
Sorry, just the bottom, right?
00:19:23.200 --> 00:19:32.870
So now, I have something
which looks like a sum from i
00:19:32.870 --> 00:19:38.470
equal 1 to n of h prime
of xi transpose beta,
00:19:38.470 --> 00:19:46.780
divided by g prime of mu i phi
times xi xi transpose, which
00:19:46.780 --> 00:19:54.550
I can certainly write in
matrix form as x transpose wx,
00:19:54.550 --> 00:20:00.410
where w is exactly
the same as before.
00:20:00.410 --> 00:20:05.330
So it's w1 wn.
00:20:05.330 --> 00:20:11.380
And wi is h prime
of xi transpose beta
00:20:11.380 --> 00:20:16.082
divided by g prime of mu i.
00:20:16.082 --> 00:20:20.880
There's a prime here
times phi, which
00:20:20.880 --> 00:20:23.390
is the same that we had here.
00:20:23.390 --> 00:20:26.610
And it's supposed to be
the same that we have here,
00:20:26.610 --> 00:20:30.380
except the phi is in white.
00:20:30.380 --> 00:20:31.820
That's why it's not there.
00:20:31.820 --> 00:20:32.320
OK.
00:20:37.655 --> 00:20:39.610
All right?
00:20:39.610 --> 00:20:42.891
So it's actually simpler than
what's on the slides, I guess.
00:20:42.891 --> 00:20:43.390
All right.
00:20:43.390 --> 00:20:46.450
So now, if you pay
attention, I actually
00:20:46.450 --> 00:20:49.060
never force this g prime
of mu i to be here.
00:20:49.060 --> 00:20:52.660
Actually, I even tried to
make a mistake to not have it.
00:20:52.660 --> 00:20:57.670
And so this g prime of mu i
shows up completely naturally.
00:20:57.670 --> 00:21:04.200
If I had started with this,
you would have never questioned
00:21:04.200 --> 00:21:07.060
why I actually didn't
multiply by g prime
00:21:07.060 --> 00:21:09.700
and divided by g prime
completely artificially here.
00:21:09.700 --> 00:21:12.790
It just shows up
naturally in the weights.
00:21:12.790 --> 00:21:14.230
But it's just more
natural for me
00:21:14.230 --> 00:21:15.771
to compute the first
derivative first
00:21:15.771 --> 00:21:17.790
than the second
derivative second, OK?
00:21:17.790 --> 00:21:20.620
And so we just did it
the other way around.
00:21:20.620 --> 00:21:23.620
But now, let's assume we
forgot about everything.
00:21:23.620 --> 00:21:24.270
We have this.
00:21:24.270 --> 00:21:28.240
This is a natural way of
writing it, x transpose wx.
00:21:28.240 --> 00:21:30.310
If I want something that
involves some weights,
00:21:30.310 --> 00:21:34.270
I have to force them in by
dividing by g prime of mu i
00:21:34.270 --> 00:21:40.410
and therefore, multiplying
yi n mu i by this wi.
00:21:40.410 --> 00:21:41.140
OK?
00:21:41.140 --> 00:21:46.490
So now, if we recap what we've
actually found, we got that--
00:21:49.470 --> 00:21:51.540
let me write it here.
00:21:58.740 --> 00:22:02.010
We also have that
the expectation
00:22:02.010 --> 00:22:12.100
of H ln of beta x transpose xw.
00:22:12.100 --> 00:22:15.190
So if I go back to my
iterations over there,
00:22:15.190 --> 00:22:20.260
I should actually
update beta k plus 1
00:22:20.260 --> 00:22:25.240
to be equal to beta
k plus the inverse.
00:22:25.240 --> 00:22:30.250
So that's actually equal
to negative i of beta k--
00:22:30.250 --> 00:22:33.200
well, yeah.
00:22:33.200 --> 00:22:35.180
That's negative i
of beta, I guess.
00:22:38.230 --> 00:22:42.680
Oh, and beta here shows up in
w, right? w depends on beta.
00:22:42.680 --> 00:22:44.460
So that's going to be beta k.
00:22:44.460 --> 00:22:45.380
So let me call it wk.
00:22:49.151 --> 00:22:54.460
So that's the diagonal of
H prime xi transpose beta
00:22:54.460 --> 00:23:01.800
k, this time, divided by
g prime of mu i k phi.
00:23:01.800 --> 00:23:02.300
OK?
00:23:02.300 --> 00:23:06.650
So this beta k induces
a mu by looking
00:23:06.650 --> 00:23:11.141
at g inverse of xi
transpose beta k.
00:23:11.141 --> 00:23:11.670
All right.
00:23:11.670 --> 00:23:21.804
So mu i k is g inverse
of xi transpose beta k.
00:23:21.804 --> 00:23:25.470
So that's 2 to the--
sorry, that's an iteration.
00:23:25.470 --> 00:23:28.080
And so now, if I actually
write these things together,
00:23:28.080 --> 00:23:37.820
I get minus x
transpose wx inverse.
00:23:37.820 --> 00:23:38.385
So that's wk.
00:23:41.900 --> 00:23:45.260
And then I have my
gradient here that I
00:23:45.260 --> 00:23:50.810
have to apply at k,
which is x transpose wk.
00:23:50.810 --> 00:23:58.610
And then I have y tilde k minus
mu tilde k, where, again, the
00:23:58.610 --> 00:23:59.330
indices--
00:23:59.330 --> 00:24:01.860
I mean the superscript
k are pretty natural.
00:24:01.860 --> 00:24:05.720
y tilde k just means that--
00:24:05.720 --> 00:24:07.370
so that's just yi.
00:24:07.370 --> 00:24:14.650
So that's just yi times
g prime of mu i k.
00:24:14.650 --> 00:24:21.050
And mu tilde k is, if I
look at the i coordinate,
00:24:21.050 --> 00:24:27.960
it's just going to be mu
i times g prime of mu i.
00:24:31.571 --> 00:24:32.070
OK?
00:24:32.070 --> 00:24:34.470
So I just add superscripts
k to everything.
00:24:34.470 --> 00:24:37.710
So I know that those things
get updated real time, right?
00:24:37.710 --> 00:24:41.670
Every time I make one iteration,
I get a new value for beta,
00:24:41.670 --> 00:24:43.800
I get a new value for
mu, and therefore, I
00:24:43.800 --> 00:24:44.981
get a new value for w.
00:24:44.981 --> 00:24:45.480
Yes?
00:24:45.480 --> 00:24:50.210
AUDIENCE: [INAUDIBLE] the
Fisher equation [INAUDIBLE]??
00:24:50.210 --> 00:24:52.660
PHILIPPE RIGOLLET: Yeah,
that's a good point.
00:24:52.660 --> 00:24:54.400
So that's definitely
a plus, because this
00:24:54.400 --> 00:24:56.030
is a positive,
semi-definite matrix.
00:24:56.030 --> 00:24:57.700
So this is a plus.
00:24:57.700 --> 00:25:01.330
And well, that's probably
where I erased it.
00:25:15.920 --> 00:25:16.420
OK.
00:25:16.420 --> 00:25:19.105
Let's see where I
made my mistake.
00:25:23.510 --> 00:25:28.602
So there should be a minus here.
00:25:28.602 --> 00:25:29.810
There should be a minus here.
00:25:29.810 --> 00:25:32.720
There should be a minus even
at the beginning, I believe.
00:25:32.720 --> 00:25:37.940
So that means that what
is my-- oh, yeah, yeah.
00:25:37.940 --> 00:25:41.160
So you see, when we
go back to the first,
00:25:41.160 --> 00:25:47.440
so what I erased was basically
this thing here, yi minus mu i.
00:25:47.440 --> 00:25:49.680
And when I took the
first derivative--
00:25:49.680 --> 00:25:53.170
so it was the derivative
with respect to H prime.
00:25:53.170 --> 00:25:55.830
So the derivative with
respect to the second term--
00:25:55.830 --> 00:25:57.920
I mean, the derivative
of the second term
00:25:57.920 --> 00:25:59.754
was actually killed,
because we took
00:25:59.754 --> 00:26:00.920
the expectation of this guy.
00:26:00.920 --> 00:26:03.253
But when we took the derivative
of the first term, which
00:26:03.253 --> 00:26:05.747
is the only one that
stayed, this guy went away.
00:26:05.747 --> 00:26:07.580
But there was a negative
sign from this guy,
00:26:07.580 --> 00:26:09.740
because that's the thing
we took the negative off.
00:26:09.740 --> 00:26:12.920
So it's really, when I
take my second derivative,
00:26:12.920 --> 00:26:15.896
I should carry out the
minus signs everywhere.
00:26:22.084 --> 00:26:24.000
OK?
00:26:24.000 --> 00:26:26.530
So it's just I forget
this minus throughout.
00:26:31.700 --> 00:26:34.735
You see the first term went
away, on the first line there.
00:26:34.735 --> 00:26:36.110
The first term
went away, because
00:26:36.110 --> 00:26:38.930
the conditional expectation
of yi, given xi 0.
00:26:38.930 --> 00:26:41.410
And then I had this minus
sign in front of everyone,
00:26:41.410 --> 00:26:42.110
and I forgot it.
00:26:44.660 --> 00:26:45.770
All right.
00:26:45.770 --> 00:26:47.390
Any other mistake that I made?
00:26:51.230 --> 00:26:52.800
We're good?
00:26:52.800 --> 00:26:54.858
All right.
00:26:54.858 --> 00:27:08.360
So now, this is what
we have, that xk--
00:27:08.360 --> 00:27:14.220
sorry, that beta k plus
1 is equal to beta k
00:27:14.220 --> 00:27:15.590
plus this thing.
00:27:15.590 --> 00:27:16.920
OK?
00:27:16.920 --> 00:27:19.140
And if you look at this
thing, it sort of reminds
00:27:19.140 --> 00:27:20.700
us of something.
00:27:20.700 --> 00:27:22.860
Remember the least
squares estimator.
00:27:22.860 --> 00:27:24.870
So here, I'm going to
actually deviate slightly
00:27:24.870 --> 00:27:25.820
from the slides.
00:27:25.820 --> 00:27:27.480
And I will tell you how.
00:27:27.480 --> 00:27:30.690
The slides take
beta k and put it
00:27:30.690 --> 00:27:33.220
in here, which is one way to go.
00:27:33.220 --> 00:27:36.300
And just think of this as a
big least square solution.
00:27:36.300 --> 00:27:41.040
Or you can keep the beta k,
solve another least squares,
00:27:41.040 --> 00:27:43.150
and then add it to the
beta k that you have.
00:27:43.150 --> 00:27:44.280
It's the same thing.
00:27:44.280 --> 00:27:45.820
So I will take the
different routes.
00:27:45.820 --> 00:27:47.445
So you have the two
options, all right?
00:28:07.410 --> 00:28:09.340
OK.
00:28:09.340 --> 00:28:10.880
So when we did the
least squares--
00:28:10.880 --> 00:28:15.880
so parenthesis least squares--
00:28:19.210 --> 00:28:23.810
we had y equals x
beta plus epsilon.
00:28:23.810 --> 00:28:27.850
And our estimator beta
hat was x transpose
00:28:27.850 --> 00:28:33.382
x inverse x transpose y, right?
00:28:33.382 --> 00:28:36.560
And that was just solving
the first order condition,
00:28:36.560 --> 00:28:38.230
and that's what we found.
00:28:38.230 --> 00:28:40.680
Now look at this--
00:28:40.680 --> 00:28:46.770
x transpose bleep x inverse,
x transpose bleep something.
00:28:46.770 --> 00:28:47.460
OK?
00:28:47.460 --> 00:28:58.120
So this looks like, if this
is the same as the left board,
00:28:58.120 --> 00:29:04.140
if wk is equal to the
identity matrix, meaning we
00:29:04.140 --> 00:29:11.040
don't see it, and y is equal
to y tilde k minus mu tilde k--
00:29:13.560 --> 00:29:16.950
so those similarities, the
fact that we just squeeze in--
00:29:16.950 --> 00:29:19.730
so the fact that the response
variable is different
00:29:19.730 --> 00:29:20.850
is really not a problem.
00:29:20.850 --> 00:29:22.560
We just have to
pretend that this
00:29:22.560 --> 00:29:24.877
is equal to y tilde
minus mu tilde.
00:29:24.877 --> 00:29:26.460
I mean, that's just
the least squares.
00:29:26.460 --> 00:29:29.440
When you call a software that
does least squares for you,
00:29:29.440 --> 00:29:31.710
you just tell it what y
is, you tell it with x is,
00:29:31.710 --> 00:29:32.940
and it makes the computation.
00:29:32.940 --> 00:29:35.470
So you would just lie to
it and say all the actual y
00:29:35.470 --> 00:29:37.530
I want is this thing.
00:29:37.530 --> 00:29:42.420
And then we need to somehow
incorporate those weights.
00:29:42.420 --> 00:29:44.980
And so the question
is, is that easy to do?
00:29:44.980 --> 00:29:48.390
And the answer is yes,
because this is a setup where
00:29:48.390 --> 00:29:50.460
this would actually arise.
00:29:50.460 --> 00:29:52.876
So one of the things that's
very specific to what
00:29:52.876 --> 00:29:54.750
we did here and with
least squares, we assume
00:29:54.750 --> 00:29:58.140
that epsilon, when we did
at least the inference,
00:29:58.140 --> 00:30:01.080
we assumed that
epsilon was normal 0
00:30:01.080 --> 00:30:04.960
and the covariance matrix
was the identity, right?
00:30:04.960 --> 00:30:07.180
What if the covariance
matrix is not the identity?
00:30:07.180 --> 00:30:09.610
If the covariance matrix
is not the identity,
00:30:09.610 --> 00:30:12.140
then your maximum
likelihood is not exactly
00:30:12.140 --> 00:30:13.600
these least squares.
00:30:13.600 --> 00:30:15.580
If the covariance
matrix is any matrix
00:30:15.580 --> 00:30:18.280
you have another solution,
which involves the inverse
00:30:18.280 --> 00:30:20.620
of the covariance
matrix that you have,
00:30:20.620 --> 00:30:24.100
but if your covariance matrix,
in particular, is diagonal--
00:30:24.100 --> 00:30:26.560
which would mean that
each observation that you
00:30:26.560 --> 00:30:30.160
get in this system of
equations is still independent,
00:30:30.160 --> 00:30:32.530
but the variances can
change from one line
00:30:32.530 --> 00:30:35.030
to another, from one
observation to another--
00:30:35.030 --> 00:30:37.570
then it's called
heteroscedastic.
00:30:37.570 --> 00:30:39.730
"Hetero" means "not the same."
00:30:39.730 --> 00:30:41.680
"Scedastic" is "scale."
00:30:41.680 --> 00:30:45.280
And a heteroscedastic
case, you would have
00:30:45.280 --> 00:30:47.000
something slightly different.
00:30:47.000 --> 00:30:49.750
And it makes sense
that, if you know
00:30:49.750 --> 00:30:52.970
that some observations have
much less variance than others,
00:30:52.970 --> 00:30:54.790
you might want to
give them more weight.
00:30:54.790 --> 00:30:55.420
OK?
00:30:55.420 --> 00:31:02.940
So if you think about
your usual drawing,
00:31:02.940 --> 00:31:07.100
and maybe you have
something like this,
00:31:07.100 --> 00:31:08.600
but the actual line is really--
00:31:08.600 --> 00:31:12.350
OK, let's say you have this guy
as well, so just a few here.
00:31:12.350 --> 00:31:16.474
If you start drawing this
thing, if you do least squares,
00:31:16.474 --> 00:31:18.140
you're going to see
something that looks
00:31:18.140 --> 00:31:20.030
like this on those points.
00:31:20.030 --> 00:31:22.640
But now, if I tell you
that, on this side,
00:31:22.640 --> 00:31:26.900
the variance is equal to 100,
meaning that those points are
00:31:26.900 --> 00:31:29.030
actually really far
from the true one,
00:31:29.030 --> 00:31:31.527
and here on this side, the
variance is equal to 1,
00:31:31.527 --> 00:31:33.860
meaning that those points are
actually close to the line
00:31:33.860 --> 00:31:36.151
you're looking for, then the
line you should be fitting
00:31:36.151 --> 00:31:38.450
is probably this
guy, meaning do not
00:31:38.450 --> 00:31:42.210
trust the guys that
have a lot of variance.
00:31:42.210 --> 00:31:44.140
And so you need somehow
to incorporate that.
00:31:44.140 --> 00:31:46.600
If you know that those things
have much more variance
00:31:46.600 --> 00:31:49.370
than these guys, you
want to weight this.
00:31:49.370 --> 00:31:52.620
And the way you do it is by
using weighted least squares.
00:31:52.620 --> 00:31:53.120
OK.
00:31:53.120 --> 00:31:54.661
So we're going to
open in parentheses
00:31:54.661 --> 00:31:55.820
on weighted least squares.
00:31:55.820 --> 00:31:57.980
It's not a fundamental
statistical question,
00:31:57.980 --> 00:32:00.470
but it's useful for us,
because this is exactly
00:32:00.470 --> 00:32:01.850
what's going to spit out--
00:32:01.850 --> 00:32:05.160
something that looks like this
with this matrix w in there.
00:32:05.160 --> 00:32:05.660
OK.
00:32:05.660 --> 00:32:09.720
So let's go back in
time for a second.
00:32:09.720 --> 00:32:12.840
Assume we're still covering
least squares regression.
00:32:12.840 --> 00:32:19.220
So now, I'm going to assume
that y is x beta plus epsilon,
00:32:19.220 --> 00:32:23.600
but this time, epsilon is a
multivariate Gaussian in, say,
00:32:23.600 --> 00:32:25.940
p dimensions with mean 0.
00:32:25.940 --> 00:32:29.720
And covariance matrix, I
will write it as w inverse,
00:32:29.720 --> 00:32:32.790
because w is going to be the
one that's going to show up.
00:32:32.790 --> 00:32:34.650
OK?
00:32:34.650 --> 00:32:37.080
So this is the so-called
heteroscedastic.
00:32:37.080 --> 00:32:43.560
That's how it's spelled,
and yet another name
00:32:43.560 --> 00:32:47.800
that you can pick for your
soccer team or a capella group.
00:32:47.800 --> 00:32:48.300
All right.
00:32:48.300 --> 00:32:52.289
So the maximum
likelihood, in this case--
00:32:52.289 --> 00:32:54.330
so actually, let's compute
the maximum likelihood
00:32:54.330 --> 00:32:55.470
for this problem, right?
00:32:55.470 --> 00:32:58.770
So the log likelihood is what?
00:32:58.770 --> 00:33:02.110
Well, we're going to have
the term that tells us
00:33:02.110 --> 00:33:04.120
that it's going to be-- so OK.
00:33:04.120 --> 00:33:06.390
What is the density of
a multivariate Gaussian?
00:33:10.339 --> 00:33:12.130
So it's going to be a
multivariate Gaussian
00:33:12.130 --> 00:33:17.270
in p dimension with mean x
beta and covariance matrix w
00:33:17.270 --> 00:33:19.040
inverse, right?
00:33:19.040 --> 00:33:20.660
So that's the
density that we want.
00:33:20.660 --> 00:33:30.490
Well, it's of the form 1 over
determinant of w inverse times
00:33:30.490 --> 00:33:35.734
2 pi to the p/2.
00:33:35.734 --> 00:33:37.730
OK?
00:33:37.730 --> 00:33:47.570
And times exponential, and now,
what I have is x minus x beta
00:33:47.570 --> 00:33:51.980
transpose w-- so that's
the inverse of w inverse--
00:33:51.980 --> 00:33:58.340
x minus x beta divided by 2.
00:33:58.340 --> 00:33:59.240
OK?
00:33:59.240 --> 00:34:03.080
So this is x minus mu
transpose sigma inverse x
00:34:03.080 --> 00:34:04.920
minus mu divided by 2.
00:34:04.920 --> 00:34:10.766
And if you want a sanity
check, just assume that sigma--
00:34:10.766 --> 00:34:11.266
yeah?
00:34:11.266 --> 00:34:15.074
AUDIENCE: Is it x
minus x beta or y?
00:34:15.074 --> 00:34:18.290
PHILIPPE RIGOLLET: Well, you
know, if you want this to be y,
00:34:18.290 --> 00:34:21.629
then this is y, right?
00:34:21.629 --> 00:34:22.601
Sure.
00:34:22.601 --> 00:34:24.960
Yeah, maybe it's less confusing.
00:34:24.960 --> 00:34:29.886
So if you should do p is equal
to 1, then what does it mean?
00:34:29.886 --> 00:34:31.469
It means that you
have this mean here.
00:34:31.469 --> 00:34:32.969
So let's forget
about what it is.
00:34:32.969 --> 00:34:35.520
But this guy is going to be
just 1 sigma squared, right?
00:34:35.520 --> 00:34:38.699
So what you see here is the
inverse of sigma squared.
00:34:38.699 --> 00:34:41.670
So that's going to be 2 over 2
sigma squared, like we usually
00:34:41.670 --> 00:34:42.420
see it.
00:34:42.420 --> 00:34:44.310
The determinant of
w inverse is just
00:34:44.310 --> 00:34:45.960
the product of
the entry of the 1
00:34:45.960 --> 00:34:53.341
by 1 matrix, which
is just sigma square.
00:34:53.341 --> 00:34:53.840
OK?
00:34:53.840 --> 00:34:58.390
So that should be actually--
00:34:58.390 --> 00:35:01.100
yeah, no, that's actually--
yeah, that's sigma square.
00:35:01.100 --> 00:35:02.480
And then I have this 2 pi.
00:35:02.480 --> 00:35:04.670
So square root of this,
because p is equal to 1,
00:35:04.670 --> 00:35:06.290
I get sigma square
root 2 pi, which is
00:35:06.290 --> 00:35:07.719
the normalization that I get.
00:35:07.719 --> 00:35:09.260
This is not going
to matter, because,
00:35:09.260 --> 00:35:12.640
when I look at
the log likelihood
00:35:12.640 --> 00:35:15.400
as a function of beta--
00:35:15.400 --> 00:35:17.720
so I'm assuming
that w is known--
00:35:17.720 --> 00:35:19.760
what I get is something
which is a constant.
00:35:19.760 --> 00:35:25.520
So it's minus p minus
n times p/2 times
00:35:25.520 --> 00:35:31.290
log that w inverse times 2 pi.
00:35:31.290 --> 00:35:31.790
OK?
00:35:31.790 --> 00:35:33.290
So this is just going
to be a constant.
00:35:33.290 --> 00:35:35.390
It won't matter when I do
the maximum likelihood.
00:35:35.390 --> 00:35:36.723
And then I'm going to have what?
00:35:36.723 --> 00:35:44.508
I'm going to have plus 1/2
of y minus x beta transpose w
00:35:44.508 --> 00:35:45.820
y minus x beta.
00:35:49.230 --> 00:35:53.520
So if I want to take the
maximum of this guy--
00:35:53.520 --> 00:35:56.620
sorry, there's a minus here.
00:35:56.620 --> 00:35:58.590
So if I want to take
the maximum of this guy,
00:35:58.590 --> 00:36:01.230
I'm going to have to take
the minimum of this thing.
00:36:01.230 --> 00:36:04.530
And the minimum of this thing,
if you take the derivative,
00:36:04.530 --> 00:36:05.820
you get to see--
00:36:05.820 --> 00:36:07.350
so that's what we have, right?
00:36:07.350 --> 00:36:09.240
We need to compute
the minimum of y
00:36:09.240 --> 00:36:13.980
minus x beta transpose
w minus y minus x beta.
00:36:13.980 --> 00:36:16.570
And the solution that you get--
00:36:16.570 --> 00:36:20.320
I mean, you can actually
check this for yourself.
00:36:20.320 --> 00:36:24.640
The way you can see this
is by doing the following.
00:36:24.640 --> 00:36:27.702
If you're lazy and you don't
want to redo the entire thing--
00:36:27.702 --> 00:36:28.910
maybe I should keep that guy.
00:36:36.110 --> 00:36:39.240
W is diagonal, right?
00:36:39.240 --> 00:36:42.540
I'm going to assume that
so w inverse is diagonal,
00:36:42.540 --> 00:36:45.270
and I'm going to assume that
no variance is equal to 0
00:36:45.270 --> 00:36:47.280
and no variance is
equal to infinity,
00:36:47.280 --> 00:36:52.050
so that both w inverse and
w have only positive entries
00:36:52.050 --> 00:36:53.010
on the diagonal.
00:36:53.010 --> 00:36:53.887
All right?
00:36:53.887 --> 00:36:55.970
So in particular, I can
talk about the square root
00:36:55.970 --> 00:36:58.520
of w, which is just the
matrix, the diagonal matrix,
00:36:58.520 --> 00:37:00.460
with the square roots
on the diagonal.
00:37:00.460 --> 00:37:01.040
OK?
00:37:01.040 --> 00:37:08.960
And so I want to minimize in
beta y minus x beta transpose w
00:37:08.960 --> 00:37:11.420
y minus x beta.
00:37:11.420 --> 00:37:13.850
So I'm going to write
w as square root
00:37:13.850 --> 00:37:17.584
of w times square root of
w, which I can, because w--
00:37:17.584 --> 00:37:19.250
and it's just the
simplest thing, right?
00:37:19.250 --> 00:37:28.030
If w is w1 wn, so that's my
w, then the square root of w
00:37:28.030 --> 00:37:31.580
is just square root of
w1 square root of wn,
00:37:31.580 --> 00:37:33.960
and then 0 is elsewhere.
00:37:33.960 --> 00:37:35.210
OK?
00:37:35.210 --> 00:37:37.100
So the product of
those two matrices
00:37:37.100 --> 00:37:38.740
gives me definitely
back what I want,
00:37:38.740 --> 00:37:41.387
and that's the usual
matrix product.
00:37:41.387 --> 00:37:43.970
Now, what I'm going to do is I'm
going to push one on one side
00:37:43.970 --> 00:37:45.681
and push the other
one on the other side.
00:37:45.681 --> 00:37:47.180
So that gives me
that this is really
00:37:47.180 --> 00:37:49.124
the minimum over beta of--
00:37:49.124 --> 00:37:50.540
well, here I have
this transposed,
00:37:50.540 --> 00:37:52.123
so I have to put it
on the other side.
00:37:52.123 --> 00:37:55.970
w is clearly symmetric and
so is square root of w.
00:37:55.970 --> 00:37:57.424
So the transpose doesn't matter.
00:37:57.424 --> 00:37:59.090
And so what I'm left
with is square root
00:37:59.090 --> 00:38:06.290
of wy minus square root of wx
beta transpose, and then times
00:38:06.290 --> 00:38:07.870
itself.
00:38:07.870 --> 00:38:15.010
So that's square root
wy minus square root w--
00:38:15.010 --> 00:38:17.530
oh, I don't have enough space--
00:38:17.530 --> 00:38:20.130
x beta.
00:38:20.130 --> 00:38:23.347
OK, and that stops here.
00:38:23.347 --> 00:38:25.680
But this is the same thing
that we've been doing before.
00:38:25.680 --> 00:38:26.680
This is a new y.
00:38:26.680 --> 00:38:28.074
Let's call it y prime.
00:38:28.074 --> 00:38:28.740
This is a new x.
00:38:28.740 --> 00:38:31.250
Let's call it x prime.
00:38:31.250 --> 00:38:33.480
And now, this is just the
least squares estimator
00:38:33.480 --> 00:38:39.000
associated to a response y prime
and a design matrix x prime.
00:38:39.000 --> 00:38:47.460
So I know that the solution is x
prime transpose x prime inverse
00:38:47.460 --> 00:38:53.020
x prime transpose y prime.
00:38:53.020 --> 00:38:55.380
And now, I'm just going
to substitute again
00:38:55.380 --> 00:38:58.560
what my x prime is in
terms of x and what
00:38:58.560 --> 00:39:01.830
my y prime is in terms of y.
00:39:01.830 --> 00:39:06.630
And that gives me exactly x
square root w square root w
00:39:06.630 --> 00:39:11.490
x inverse.
00:39:11.490 --> 00:39:17.490
And then I have x transpose
square root w for this guy.
00:39:17.490 --> 00:39:21.660
And then I have square
root wy for that guy.
00:39:21.660 --> 00:39:23.400
And that's exactly
what I wanted.
00:39:23.400 --> 00:39:30.880
I'm left with x transpose
wx inverse x transpose wy.
00:39:34.664 --> 00:39:35.164
OK?
00:39:38.020 --> 00:39:41.510
So that's a simple way
to take into account
00:39:41.510 --> 00:39:44.150
the w that we had before.
00:39:44.150 --> 00:39:47.285
And you could actually do it
with any matrix that's positive
00:39:47.285 --> 00:39:48.910
semi-definite, because
you can actually
00:39:48.910 --> 00:39:52.204
talk about the square
root of those matrices.
00:39:52.204 --> 00:39:54.620
And it's just the square root
of a matrix is just a matrix
00:39:54.620 --> 00:39:58.260
such that, when you
multiply it by itself,
00:39:58.260 --> 00:40:00.846
it gives you the
original matrix.
00:40:00.846 --> 00:40:03.560
OK?
00:40:03.560 --> 00:40:06.220
So here, that was
just a shortcut
00:40:06.220 --> 00:40:08.560
that consisted in
saying, OK, maybe I
00:40:08.560 --> 00:40:12.910
don't want to recompute the
gradient of this quantity,
00:40:12.910 --> 00:40:17.510
set it equal to 0, and see
what beta hat had should be.
00:40:17.510 --> 00:40:19.810
Instead, I am going to
assume that I already
00:40:19.810 --> 00:40:23.050
know that, if I
did not have the w,
00:40:23.050 --> 00:40:24.560
I would know how to solve it.
00:40:24.560 --> 00:40:25.810
And that's exactly what I did.
00:40:25.810 --> 00:40:28.120
I said, well, I
know that this is
00:40:28.120 --> 00:40:30.370
the minimum of something
that looks like this,
00:40:30.370 --> 00:40:32.020
when I have the primes.
00:40:32.020 --> 00:40:36.148
And then I just substitute
back my w in there.
00:40:36.148 --> 00:40:36.790
All right.
00:40:36.790 --> 00:40:38.390
So that' just the
lazy computation.
00:40:38.390 --> 00:40:40.440
But again, if you
don't like it, you
00:40:40.440 --> 00:40:42.380
can always take the
gradient of this guy.
00:40:42.380 --> 00:40:42.880
Yes?
00:40:42.880 --> 00:40:44.612
AUDIENCE: Why is
the solution written
00:40:44.612 --> 00:40:45.685
in the slides different?
00:40:45.685 --> 00:40:47.560
PHILIPPE RIGOLLET:
Because there's a mistake.
00:40:49.647 --> 00:40:51.230
Yeah, there's a
mistake on the slides.
00:40:58.385 --> 00:40:59.520
How did I make that one?
00:40:59.520 --> 00:41:01.220
I'm actually trying
to parse it back.
00:41:11.570 --> 00:41:13.820
I mean, it's clearly
wrong, right?
00:41:13.820 --> 00:41:14.600
Oh, no, it's not.
00:41:24.590 --> 00:41:27.570
No, it is.
00:41:27.570 --> 00:41:29.110
So it's not clearly wrong.
00:41:32.680 --> 00:41:34.960
Actually, it is clearly wrong.
00:41:34.960 --> 00:41:37.840
Because if I put
the identity here,
00:41:37.840 --> 00:41:39.360
those are still
associative, right?
00:41:39.360 --> 00:41:42.140
So this product is
actually not compatible.
00:41:42.140 --> 00:41:44.140
So it's wrong, but there's
just this extra thing
00:41:44.140 --> 00:41:46.630
that I probably copy-pasted
from some place.
00:41:46.630 --> 00:41:48.430
Since this is one
of my latest slide,
00:41:48.430 --> 00:41:51.280
I'll just color it in white.
00:41:51.280 --> 00:41:54.961
But yeah, sorry, there's a mis--
this parenthesis is not here.
00:41:54.961 --> 00:41:55.460
Thank you.
00:41:55.460 --> 00:41:56.388
AUDIENCE: [INAUDIBLE].
00:41:56.388 --> 00:41:57.388
PHILIPPE RIGOLLET: Yeah.
00:42:01.244 --> 00:42:03.172
OK?
00:42:03.172 --> 00:42:06.124
AUDIENCE: So why not
square root [INAUDIBLE]??
00:42:06.124 --> 00:42:08.040
PHILIPPE RIGOLLET: Because
I have two of them.
00:42:08.040 --> 00:42:11.180
I have one that comes from the
x prime that's here, this guy.
00:42:11.180 --> 00:42:15.760
And then I have one that
comes from this guy here.
00:42:15.760 --> 00:42:17.530
OK, so the solution--
00:42:17.530 --> 00:42:20.121
let's write it in some place
that's actually legible--
00:42:25.530 --> 00:42:27.150
which is the correction
for this thing
00:42:27.150 --> 00:42:34.930
is x transpose wx
inverse x transpose wy.
00:42:34.930 --> 00:42:35.470
OK?
00:42:35.470 --> 00:42:38.270
So you just squeeze
in this w in there.
00:42:38.270 --> 00:42:41.860
And that's exactly
what we had before,
00:42:41.860 --> 00:42:47.740
x transpose wx inverse
x transpose w some y.
00:42:47.740 --> 00:42:49.360
OK?
00:42:49.360 --> 00:42:53.050
And what I claim is that this
is routinely implemented.
00:42:53.050 --> 00:42:55.000
As you can imagine,
heteroscedastic linear
00:42:55.000 --> 00:42:57.550
regression is something
that's very common.
00:42:57.550 --> 00:43:00.100
So every time you a
least squares formula,
00:43:00.100 --> 00:43:02.886
you also have a way to
put in some weights.
00:43:02.886 --> 00:43:04.510
You don't have to
put diagonal weights,
00:43:04.510 --> 00:43:05.718
but here, that's all we need.
00:43:08.190 --> 00:43:12.310
So here on the slides,
again, I took the beta k,
00:43:12.310 --> 00:43:15.060
and I put it in there, so that
I have only one least square
00:43:15.060 --> 00:43:17.370
solution to formulate.
00:43:17.370 --> 00:43:19.680
But let's do it
slightly differently.
00:43:19.680 --> 00:43:21.180
What I'm going to
do here now is I'm
00:43:21.180 --> 00:43:24.430
going to say, OK, let's feed
it to some least squares.
00:43:24.430 --> 00:43:32.600
So let's do weighted least
squares on a response,
00:43:32.600 --> 00:43:44.810
y being y tilde k minus mu tilde
k, and design matrix being,
00:43:44.810 --> 00:43:47.520
well, just the x itself.
00:43:47.520 --> 00:43:50.240
So that doesn't change.
00:43:50.240 --> 00:44:00.090
And the weights-- so
the weights are what?
00:44:00.090 --> 00:44:04.290
The weights are the
wk that I had here.
00:44:04.290 --> 00:44:15.630
So wki is h prime
of xi transpose beta
00:44:15.630 --> 00:44:24.380
k divided by g prime of
mu i at time k times phi.
00:44:28.630 --> 00:44:32.620
OK, and so this, if I solve
it, will spit out something
00:44:32.620 --> 00:44:33.910
that I will call a solution.
00:44:33.910 --> 00:44:41.290
I will call it u hat k plus 1.
00:44:41.290 --> 00:44:44.590
And to get beta
hat k plus 1, all I
00:44:44.590 --> 00:44:53.215
need to do is to do beta
k plus u hat k plus 1--
00:44:53.215 --> 00:44:55.808
sorry, beta-- yeah.
00:44:55.808 --> 00:44:58.730
OK?
00:44:58.730 --> 00:45:01.430
And that's because-- so
here, that's not clear.
00:45:01.430 --> 00:45:04.080
But I started from
there, remember?
00:45:04.080 --> 00:45:08.250
I started from this guy here.
00:45:08.250 --> 00:45:10.775
So I'm just solving a least
squares, a weighted least
00:45:10.775 --> 00:45:12.525
square that's going
to give me this thing.
00:45:12.525 --> 00:45:15.300
That's what I called
u hat k plus 1.
00:45:15.300 --> 00:45:18.475
And then I add it to beta k, and
that gives me beta k minus 1.
00:45:18.475 --> 00:45:21.490
So I just have this
intermediate step,
00:45:21.490 --> 00:45:25.238
which is removed in the slides.
00:45:25.238 --> 00:45:28.070
OK?
00:45:28.070 --> 00:45:29.960
So then you can repeat
until convergence.
00:45:29.960 --> 00:45:32.270
What does it mean to
repeat until convergence?
00:45:35.066 --> 00:45:37.435
AUDIENCE: [INAUDIBLE]?
00:45:37.435 --> 00:45:38.810
PHILIPPE RIGOLLET:
Yeah, exactly.
00:45:38.810 --> 00:45:41.030
So you just set some
threshold and you say,
00:45:41.030 --> 00:45:43.550
I promise you that this
will converge, right?
00:45:43.550 --> 00:45:46.570
So you know that at some point,
you're going to be there.
00:45:46.570 --> 00:45:48.320
You're going to go
there, but you're never
00:45:48.320 --> 00:45:49.403
going to be exactly there.
00:45:49.403 --> 00:45:52.430
And so you just say, OK, I
want this accuracy on my data.
00:45:52.430 --> 00:45:55.712
Actually, the machine
is a little strong.
00:45:55.712 --> 00:45:57.920
Especially if you have 10
observations to start with,
00:45:57.920 --> 00:46:01.789
you know you're going to
have something that's going
00:46:01.789 --> 00:46:03.080
to have some statistical error.
00:46:03.080 --> 00:46:05.990
So that should actually guide
you into what kind of error
00:46:05.990 --> 00:46:06.930
you want to be making.
00:46:06.930 --> 00:46:08.780
So for example, a
good rule of thumb
00:46:08.780 --> 00:46:11.960
is that if you have
n observations,
00:46:11.960 --> 00:46:13.670
you just take some within--
00:46:13.670 --> 00:46:17.060
if you want the L2
distance between the beta--
00:46:17.060 --> 00:46:19.787
the two consecutive beta
to be less than 1/n,
00:46:19.787 --> 00:46:20.870
you should be good enough.
00:46:20.870 --> 00:46:24.560
It doesn't have to be
that machine precision.
00:46:24.560 --> 00:46:27.260
And so it's clear how
we do this, right?
00:46:27.260 --> 00:46:30.680
So here, I just have to maintain
a bunch of things, right?
00:46:30.680 --> 00:46:33.680
So remember, when I want to
recompute-- at every step,
00:46:33.680 --> 00:46:35.430
I have to recompute
a bunch of things.
00:46:35.430 --> 00:46:36.890
So I have to
recompute the weights.
00:46:36.890 --> 00:46:39.080
But if I want to recompute
the weights, not only do
00:46:39.080 --> 00:46:40.760
I need to previous
iterate, but I
00:46:40.760 --> 00:46:46.040
need to know how the previous
iterate impacts my means.
00:46:46.040 --> 00:46:50.300
So at each step, I have
to recalculate mu i k
00:46:50.300 --> 00:46:53.090
by doing g prime, rate?
00:46:53.090 --> 00:47:02.670
Remember mu i k was just g
inverse of xi transpose beta k,
00:47:02.670 --> 00:47:03.170
right?
00:47:03.170 --> 00:47:05.630
So I have to recompute that.
00:47:05.630 --> 00:47:09.340
And then I use this
to compute my weights.
00:47:09.340 --> 00:47:15.790
I also use this to
compute my y, right?
00:47:15.790 --> 00:47:20.030
so my y depends also
on g prime of mu i k.
00:47:20.030 --> 00:47:24.950
I feed that to my weighted
least squares engine.
00:47:24.950 --> 00:47:28.520
It spits out the u hat k, that
I add to my previous beta k.
00:47:28.520 --> 00:47:30.605
And that gives me my
new beta k plus 1.
00:47:33.170 --> 00:47:33.670
OK.
00:47:33.670 --> 00:47:35.980
So here's the
pseudocode, if you want
00:47:35.980 --> 00:47:40.781
to take some time to parse it.
00:47:40.781 --> 00:47:41.280
All right.
00:47:41.280 --> 00:47:43.970
So here again, the
trick is not much.
00:47:43.970 --> 00:47:49.400
It's just saying, if you don't
feel like implementing Fisher
00:47:49.400 --> 00:47:52.662
scoring or inverting your
Hessian at every step,
00:47:52.662 --> 00:47:54.620
then a weighted least
squares is actually going
00:47:54.620 --> 00:47:56.360
to do it for you automatically.
00:47:56.360 --> 00:47:56.860
All right.
00:47:56.860 --> 00:47:58.610
Then that's just
a numerical trick.
00:47:58.610 --> 00:48:00.950
There's nothing really
statistical about this,
00:48:00.950 --> 00:48:04.730
except the fact that this
calls for a solution for each
00:48:04.730 --> 00:48:09.682
of the step reminded us
of sum of the squares,
00:48:09.682 --> 00:48:11.390
except that there was
some extra weights.
00:48:14.180 --> 00:48:14.680
OK.
00:48:14.680 --> 00:48:18.670
So to conclude, we'll
need to know, of course,
00:48:18.670 --> 00:48:19.945
xy, the link function.
00:48:22.629 --> 00:48:24.170
Why do we need the
variance function?
00:48:29.530 --> 00:48:33.250
I'm not sure we actually
need the variance function.
00:48:33.250 --> 00:48:36.220
No, I don't know why I say that.
00:48:36.220 --> 00:48:39.750
You need phi, not the
variance function.
00:48:39.750 --> 00:48:41.370
So where do you start
actually, right?
00:48:41.370 --> 00:48:44.400
So clearly, if you start
very close to your solution,
00:48:44.400 --> 00:48:46.810
you're actually going
to do much better.
00:48:46.810 --> 00:48:48.760
And one good way to start--
00:48:48.760 --> 00:48:51.710
so for the beta itself, it's
not clear what it's going to be.
00:48:51.710 --> 00:48:53.490
But you can actually
get a good idea
00:48:53.490 --> 00:48:57.960
of what beta is by just having
a good idea of what mu is.
00:48:57.960 --> 00:49:01.830
Because mu is g inverse
of xi transpose beta.
00:49:01.830 --> 00:49:04.020
And so what you
could do is to try
00:49:04.020 --> 00:49:07.560
to set mu to be the actual
observations that you have,
00:49:07.560 --> 00:49:09.150
because that's the
best guess that you
00:49:09.150 --> 00:49:11.540
have for their expected value.
00:49:11.540 --> 00:49:14.740
And then you just say,
OK, once I have my mu,
00:49:14.740 --> 00:49:17.630
I know that my mu is a
function of this thing.
00:49:17.630 --> 00:49:21.380
So I can write g of mu and solve
it, using your least squares
00:49:21.380 --> 00:49:22.340
estimator, right?
00:49:22.340 --> 00:49:28.970
So g of mu is of
the form x beta.
00:49:28.970 --> 00:49:33.710
So you just solve for--
once you have your mu,
00:49:33.710 --> 00:49:36.350
you pass it through g, and
then you solve for the beta
00:49:36.350 --> 00:49:37.937
that you want.
00:49:37.937 --> 00:49:40.020
And then that's the beta
that you initialize with.
00:49:42.954 --> 00:49:44.910
OK?
00:49:44.910 --> 00:49:47.850
And actually, this was your
question from last time.
00:49:47.850 --> 00:49:50.320
As soon as I use
the canonical link,
00:49:50.320 --> 00:49:53.880
Fisher scoring
and Newton-Raphson
00:49:53.880 --> 00:49:57.720
are the same thing, because
the Hessian is actually
00:49:57.720 --> 00:50:05.870
deterministic in that
case, just because when
00:50:05.870 --> 00:50:09.290
you use the canonical link,
H is the identity, which
00:50:09.290 --> 00:50:12.050
means that its second
derivative is equal to 0.
00:50:12.050 --> 00:50:15.650
So this term goes away even
without taking the expectation.
00:50:15.650 --> 00:50:17.840
So remember, the
term that went away
00:50:17.840 --> 00:50:23.420
was of the form yi
minus mu i divided
00:50:23.420 --> 00:50:29.609
by phi times h prime prime
of xi transpose beta, right?
00:50:29.609 --> 00:50:32.150
That's the term that we said,
oh, the conditional expectation
00:50:32.150 --> 00:50:34.170
of this guy is 0.
00:50:34.170 --> 00:50:36.384
But if h prime prime
is already equal to 0,
00:50:36.384 --> 00:50:37.800
then there's nothing
that changes.
00:50:37.800 --> 00:50:39.120
There's nothing that goes away.
00:50:39.120 --> 00:50:40.530
It was already equal to 0.
00:50:40.530 --> 00:50:43.710
And that always happens when
you have the canonical link,
00:50:43.710 --> 00:50:54.450
because h is g b prime inverse.
00:50:54.450 --> 00:50:57.690
And the canonical link
is b prime inverse,
00:50:57.690 --> 00:51:00.176
so this thing is the identity.
00:51:00.176 --> 00:51:06.780
So the second derivative of
f of x is equal to x is 0.
00:51:06.780 --> 00:51:08.630
OK.
00:51:08.630 --> 00:51:11.620
My screen says end of show.
00:51:11.620 --> 00:51:13.862
So we can start
with some questions.
00:51:13.862 --> 00:51:15.320
AUDIENCE: I just
wanted to clarify.
00:51:15.320 --> 00:51:19.127
So iterative-- what is
it say for iterative--
00:51:19.127 --> 00:51:20.960
PHILIPPE RIGOLLET:
Reweighted least squares.
00:51:20.960 --> 00:51:21.386
AUDIENCE: Reweighted
least squares
00:51:21.386 --> 00:51:23.840
is an implementation of the
Fisher scoring [INAUDIBLE]??
00:51:23.840 --> 00:51:25.631
PHILIPPE RIGOLLET:
That's an implementation
00:51:25.631 --> 00:51:29.000
that's just making calls to
weighted least squares oracles.
00:51:29.000 --> 00:51:30.730
It's called an oracle sometimes.
00:51:30.730 --> 00:51:33.849
An oracle is what you assume the
machine can do easily for you.
00:51:33.849 --> 00:51:35.390
So if you assume
that your machine is
00:51:35.390 --> 00:51:38.150
very good at multiplying
by the inverse of a matrix,
00:51:38.150 --> 00:51:40.580
you might as well just do
Fisher scoring yourself, right?
00:51:40.580 --> 00:51:43.130
It's just a way so that you
don't have to actually do it.
00:51:43.130 --> 00:51:46.460
And usually, those
things are implemented--
00:51:46.460 --> 00:51:49.320
and I just said routinely--
in statistical software.
00:51:49.320 --> 00:51:51.440
But they're implemented
very efficiently
00:51:51.440 --> 00:51:52.440
in statistical software.
00:51:52.440 --> 00:51:54.770
So this is going to be one
of the fastest ways you're
00:51:54.770 --> 00:51:59.165
going to have to
solve, to do this step,
00:51:59.165 --> 00:52:01.145
especially for
large-scale problems.
00:52:01.145 --> 00:52:03.186
AUDIENCE: So the thing
that computers can do well
00:52:03.186 --> 00:52:05.105
is the multiplier [INAUDIBLE].
00:52:05.105 --> 00:52:07.580
What's the thing that
the computers can do fast
00:52:07.580 --> 00:52:09.525
and what's the thing
that [INAUDIBLE]??
00:52:09.525 --> 00:52:10.900
PHILIPPE RIGOLLET:
So if you were
00:52:10.900 --> 00:52:13.210
to do this in the
simplest possible way,
00:52:13.210 --> 00:52:18.070
your iterations for,
say, Fisher scoring
00:52:18.070 --> 00:52:21.500
is just multiply by the inverse
of the Fisher information,
00:52:21.500 --> 00:52:22.000
right?
00:52:22.000 --> 00:52:24.160
AUDIENCE: So finding
that inverse is slow?
00:52:24.160 --> 00:52:26.530
PHILIPPE RIGOLLET: Yeah,
so it takes a bit of time.
00:52:26.530 --> 00:52:30.330
Whereas, since you know you're
going to multiply directly
00:52:30.330 --> 00:52:33.177
by something, if you just say--
00:52:33.177 --> 00:52:35.010
those things are not
as optimized as solving
00:52:35.010 --> 00:52:35.580
least squares.
00:52:35.580 --> 00:52:36.990
Actually, the way
it's typically done
00:52:36.990 --> 00:52:38.340
is by doing some least squares.
00:52:38.340 --> 00:52:41.190
So you might as well just do
least squares that you like.
00:52:41.190 --> 00:52:42.180
And there's also less--
00:52:45.870 --> 00:52:47.770
well, no, there's no--
00:52:47.770 --> 00:52:51.035
well, there is less
recalculation, right?
00:52:51.035 --> 00:52:52.410
Here, your Fisher,
you would have
00:52:52.410 --> 00:52:54.720
to recompute the entire
matrix of Fisher information.
00:52:54.720 --> 00:52:56.170
Whereas here, you don't have to.
00:52:56.170 --> 00:52:56.670
Right?
00:52:56.670 --> 00:52:59.850
You really just have to compute
some vectors and the vector
00:52:59.850 --> 00:53:00.600
of weights, right?
00:53:00.600 --> 00:53:03.230
So the Fisher information
matrix has, say,
00:53:03.230 --> 00:53:05.910
n choose two entries that
you need to compute, right?
00:53:05.910 --> 00:53:08.910
It's symmetric, so it's
order n squared entries.
00:53:08.910 --> 00:53:11.460
But here, the only things you
update, if you think about it,
00:53:11.460 --> 00:53:13.987
are this weight matrix.
00:53:13.987 --> 00:53:15.570
So there is only the
diagonal elements
00:53:15.570 --> 00:53:19.330
that you need to update, and
these vectors in there also.
00:53:19.330 --> 00:53:21.660
There's two inverses n squared.
00:53:21.660 --> 00:53:23.960
So that's much less thing
to actually put in there.
00:53:23.960 --> 00:53:25.100
It does it for you somehow.
00:53:29.680 --> 00:53:30.810
Any other question?
00:53:34.440 --> 00:53:35.070
Yeah?
00:53:35.070 --> 00:53:37.950
AUDIENCE: So if I have
a data set [INAUDIBLE],,
00:53:37.950 --> 00:53:40.451
then I can always try to model
it with least squares, right?
00:53:40.451 --> 00:53:41.825
PHILIPPE RIGOLLET:
Yeah, you can.
00:53:41.825 --> 00:53:44.670
AUDIENCE: And so this is like
setting my weight equal to 1--
00:53:44.670 --> 00:53:46.159
the identity,
essentially, right?
00:53:46.159 --> 00:53:47.700
PHILIPPE RIGOLLET:
Well, not exactly,
00:53:47.700 --> 00:53:50.640
because the g also shows
up in this correction
00:53:50.640 --> 00:53:51.982
that you have here, right?
00:53:51.982 --> 00:53:52.934
AUDIENCE: Yeah.
00:53:52.934 --> 00:53:55.350
PHILIPPE RIGOLLET: I mean, I
don't know what you mean by--
00:53:55.350 --> 00:53:56.725
AUDIENCE: I'm just
trying to say,
00:53:56.725 --> 00:53:59.652
are there ever situations where
I'm trying to model a data set
00:53:59.652 --> 00:54:03.910
and I would want to pick my
weights in a particular way?
00:54:03.910 --> 00:54:04.910
PHILIPPE RIGOLLET: Yeah.
00:54:04.910 --> 00:54:05.400
AUDIENCE: OK.
00:54:05.400 --> 00:54:06.216
PHILIPPE RIGOLLET: I mean--
00:54:06.216 --> 00:54:07.920
AUDIENCE: [INAUDIBLE]
example [INAUDIBLE]..
00:54:07.920 --> 00:54:09.420
PHILIPPE RIGOLLET:
Well, OK, there's
00:54:09.420 --> 00:54:10.960
the heteroscedastic
case for sure.
00:54:10.960 --> 00:54:13.632
So if you're going to actually
compute those things-- and more
00:54:13.632 --> 00:54:15.340
generally, I don't
think you should think
00:54:15.340 --> 00:54:16.390
of those as being weights.
00:54:16.390 --> 00:54:18.473
You should really think
of those as being matrices
00:54:18.473 --> 00:54:19.510
that you invert.
00:54:19.510 --> 00:54:21.370
And don't think of
it as being diagonal,
00:54:21.370 --> 00:54:23.890
but really think of them
as being full matrices.
00:54:23.890 --> 00:54:25.390
So if you have--
00:54:25.390 --> 00:54:30.280
when we wrote weighted least
squares here, this was really--
00:54:30.280 --> 00:54:31.776
the w, I said, is diagonal.
00:54:31.776 --> 00:54:34.150
But all the computations really
never really use the fact
00:54:34.150 --> 00:54:35.140
that it's diagonal.
00:54:35.140 --> 00:54:38.500
So what shows up here
is just the inverse
00:54:38.500 --> 00:54:40.180
of your covariance matrix.
00:54:40.180 --> 00:54:42.580
And so if you have
data that's correlated,
00:54:42.580 --> 00:54:45.330
this is where it's
going to show up.