1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:21,884 --> 00:00:23,300
PHILIPPE RIGOLLET:
So I apologize.

9
00:00:23,300 --> 00:00:27,810
My voice is not 100%.

10
00:00:27,810 --> 00:00:32,930
So if you don't understand
what I'm saying, please ask me.

11
00:00:32,930 --> 00:00:36,440
So we're going to be analyzing--
actually, not really analyzing.

12
00:00:36,440 --> 00:00:38,750
We described a
second-order method

13
00:00:38,750 --> 00:00:42,860
to optimize the log likelihood
in a generalized linear model,

14
00:00:42,860 --> 00:00:45,637
when the parameter
of interest was beta.

15
00:00:45,637 --> 00:00:47,970
So here, I'm going to rewrite
the whole thing as a beta.

16
00:00:47,970 --> 00:00:49,740
So that's the equation you see.

17
00:00:49,740 --> 00:00:51,560
But we really have this beta.

18
00:00:51,560 --> 00:00:58,160
And at iteration k plus 1,
beta is given by beta k.

19
00:00:58,160 --> 00:01:01,170
And then I have a plus sign.

20
00:01:01,170 --> 00:01:06,390
And the plus, if you think of
the Fisher information at beta

21
00:01:06,390 --> 00:01:09,340
k as being some number--

22
00:01:09,340 --> 00:01:11,090
if you were to say
whether it's a positive

23
00:01:11,090 --> 00:01:12,549
or a negative
number, it's actually

24
00:01:12,549 --> 00:01:14,339
going to be a positive
number, because it's

25
00:01:14,339 --> 00:01:15,770
a positive semi-definite matrix.

26
00:01:15,770 --> 00:01:18,020
So since we're doing
gradient ascent,

27
00:01:18,020 --> 00:01:19,730
we have a plus sign here.

28
00:01:19,730 --> 00:01:21,620
And then the
direction is basically

29
00:01:21,620 --> 00:01:26,750
gradient ln at beta k.

30
00:01:26,750 --> 00:01:27,410
OK?

31
00:01:27,410 --> 00:01:30,320
So this is the iterations that
we're trying to implement.

32
00:01:30,320 --> 00:01:31,540
And we could just do this.

33
00:01:31,540 --> 00:01:34,814
At each iteration, we compute
the Fisher information,

34
00:01:34,814 --> 00:01:36,230
and then we do it
again and again.

35
00:01:36,230 --> 00:01:36,869
All right.

36
00:01:36,869 --> 00:01:38,660
That's called the
Fisher-scoring algorithm.

37
00:01:38,660 --> 00:01:41,045
And I told you that this
was going to converge.

38
00:01:41,045 --> 00:01:44,090
And what we're going to
try to do in this lecture

39
00:01:44,090 --> 00:01:46,100
is to show how we can
re-implement this,

40
00:01:46,100 --> 00:01:48,770
using iteratively
re-weighted least squares,

41
00:01:48,770 --> 00:01:50,870
so that each step
of this algorithm

42
00:01:50,870 --> 00:01:54,270
consists simply of solving a
weighted least square problem.

43
00:01:54,270 --> 00:01:54,770
All right.

44
00:01:54,770 --> 00:01:59,840
So let's go back quickly
and remind ourselves

45
00:01:59,840 --> 00:02:04,830
that we are in the Gaussian--

46
00:02:04,830 --> 00:02:07,110
sorry, we're in the
exponential family.

47
00:02:07,110 --> 00:02:10,400
So if I look at the log
likelihood for one observation,

48
00:02:10,400 --> 00:02:12,290
so here it's ln--

49
00:02:12,290 --> 00:02:13,280
sorry.

50
00:02:13,280 --> 00:02:17,210
This is the sum from i
equal 1 to n of yi minus--

51
00:02:20,694 --> 00:02:26,770
OK, so it's yi times theta
i, sorry, minus b of theta i.

52
00:02:26,770 --> 00:02:28,550
Then there's going
to be some parameter.

53
00:02:28,550 --> 00:02:32,810
And then I have
plus c of yi phi.

54
00:02:32,810 --> 00:02:33,310
OK.

55
00:02:33,310 --> 00:02:35,210
So just the
exponential went away

56
00:02:35,210 --> 00:02:36,860
when I took the log
of the likelihood.

57
00:02:36,860 --> 00:02:38,600
And I have n observations,
so I'm summing

58
00:02:38,600 --> 00:02:40,320
over all n observations.

59
00:02:40,320 --> 00:02:40,820
All right.

60
00:02:40,820 --> 00:02:43,445
Then we had a bunch of
formulas that we came up to be.

61
00:02:43,445 --> 00:02:46,100
So if I look at the
expectation of yi--

62
00:02:46,100 --> 00:02:49,880
so that's really the
conditional of yi, given xi.

63
00:02:49,880 --> 00:02:52,230
But like here, it
really doesn't matter.

64
00:02:52,230 --> 00:02:54,110
It's just going to be
different for each i.

65
00:02:54,110 --> 00:02:55,880
This is denoted by mu i.

66
00:02:55,880 --> 00:03:00,830
And we showed that this
was beta prime of theta i.

67
00:03:00,830 --> 00:03:03,880
Then the other equation
that we found was that.

68
00:03:03,880 --> 00:03:06,350
And so what we want to
model is this thing.

69
00:03:06,350 --> 00:03:10,220
We want it to be equal to
xi transpose beta- sorry

70
00:03:10,220 --> 00:03:11,190
g of this thing.

71
00:03:14,918 --> 00:03:15,620
All right.

72
00:03:15,620 --> 00:03:19,010
So that's our model.

73
00:03:19,010 --> 00:03:21,650
And then we had that
the variance was also

74
00:03:21,650 --> 00:03:23,360
given by the second derivative.

75
00:03:23,360 --> 00:03:24,747
I'm not going to go into it.

76
00:03:24,747 --> 00:03:26,330
What's actually
interesting is to see,

77
00:03:26,330 --> 00:03:32,240
if we want to express theta i as
a function of xi, what we get,

78
00:03:32,240 --> 00:03:38,900
going from xi to mu i by g
inverse, and then to theta i

79
00:03:38,900 --> 00:03:43,790
by b inverse, we
get that theta i

80
00:03:43,790 --> 00:03:51,660
is equal to h of xi transpose
beta, h of xi transpose beta,

81
00:03:51,660 --> 00:03:56,340
where h is the inverse--

82
00:03:56,340 --> 00:03:58,890
so which order is --this?

83
00:03:58,890 --> 00:04:03,650
Is the inverse of g, and then
the compose would be prime.

84
00:04:03,650 --> 00:04:05,210
OK?

85
00:04:05,210 --> 00:04:09,194
So we remember that last time,
those are all computations

86
00:04:09,194 --> 00:04:10,610
that we've made,
but they're going

87
00:04:10,610 --> 00:04:12,660
to be useful in our derivation.

88
00:04:12,660 --> 00:04:14,510
And the first thing
we did last time is

89
00:04:14,510 --> 00:04:17,690
to show that, if I look now
at the derivative of the log

90
00:04:17,690 --> 00:04:20,852
likelihood with respect to
one coordinate of beta, which

91
00:04:20,852 --> 00:04:23,060
is going to give me the
gradient if I do that for all

92
00:04:23,060 --> 00:04:25,250
the coordinates, what
we ended up finding

93
00:04:25,250 --> 00:04:28,400
is that we can rewrite
it in this form, some

94
00:04:28,400 --> 00:04:31,470
of yi tilde minus mu tilde.

95
00:04:31,470 --> 00:04:33,380
So let's remind ourselves that--

96
00:04:36,340 --> 00:04:41,560
so y tilde is just y divided--

97
00:04:41,560 --> 00:04:45,010
well, OK y tilde i is yi--

98
00:04:45,010 --> 00:04:46,300
is it times or divided--

99
00:04:46,300 --> 00:04:50,890
times g prime of mu i.

100
00:04:50,890 --> 00:05:00,140
Mu tilde i is mu i
times g prime of mu i.

101
00:05:00,140 --> 00:05:02,980
And then that was just
an artificial thing,

102
00:05:02,980 --> 00:05:07,070
so that we could actually
divide the weights by g prime.

103
00:05:07,070 --> 00:05:10,060
But the real thing that built
the weights are this h prime.

104
00:05:10,060 --> 00:05:12,310
And there's this
normalization factor.

105
00:05:12,310 --> 00:05:14,440
And so if we read it
like that-- so if I also

106
00:05:14,440 --> 00:05:22,900
write that wi is h prime of
xi transpose beta divided

107
00:05:22,900 --> 00:05:27,640
by g prime of mu
i times phi, then

108
00:05:27,640 --> 00:05:30,820
I could actually rewrite
my gradient, which

109
00:05:30,820 --> 00:05:34,270
is a vector, in the
following matrix form,

110
00:05:34,270 --> 00:05:40,820
the gradient ln at beta.

111
00:05:40,820 --> 00:05:44,300
So the gradient of my
log likelihood of beta

112
00:05:44,300 --> 00:05:45,390
took the following form.

113
00:05:45,390 --> 00:05:53,600
It was x transpose w, and
then y tilde minus mu tilde.

114
00:05:53,600 --> 00:05:57,770
And here, w was just
the matrix with w1,

115
00:05:57,770 --> 00:06:02,020
w2, all the way to wn
on the diagonal and 0

116
00:06:02,020 --> 00:06:04,142
on of the up diagonals.

117
00:06:04,142 --> 00:06:06,030
OK?

118
00:06:06,030 --> 00:06:08,340
So that was just
taking the derivative

119
00:06:08,340 --> 00:06:11,490
and doing a slight
manipulations that said,

120
00:06:11,490 --> 00:06:14,670
well, let's just divide
whatever is here by g

121
00:06:14,670 --> 00:06:17,740
prime and multiply whatever
is here by g prime.

122
00:06:17,740 --> 00:06:19,680
So today, we'll see why
we make this division

123
00:06:19,680 --> 00:06:23,220
and multiplication by g prime,
which seems to make no sense,

124
00:06:23,220 --> 00:06:26,620
but it actually comes from
the Hessian computations.

125
00:06:26,620 --> 00:06:28,530
So the Hessian
computations are going

126
00:06:28,530 --> 00:06:29,790
to be a little more annoying.

127
00:06:29,790 --> 00:06:33,600
Actually, let me start directly
with the coordinate y's

128
00:06:33,600 --> 00:06:34,440
derivative, right?

129
00:06:34,440 --> 00:06:37,740
So to build this gradient,
what we used, in the end,

130
00:06:37,740 --> 00:06:41,880
was that the partial derivative
of ln with respect to the gth

131
00:06:41,880 --> 00:06:49,220
coordinate of beta was
equal to the sum over i

132
00:06:49,220 --> 00:06:55,520
of yi tilde minus mu
i tilde times wi times

133
00:06:55,520 --> 00:06:59,725
the gth coordinate of xi.

134
00:06:59,725 --> 00:07:01,310
OK?

135
00:07:01,310 --> 00:07:03,480
So now, let's just take
another derivative,

136
00:07:03,480 --> 00:07:07,810
and that's going to give us
the entries of the Hessian.

137
00:07:07,810 --> 00:07:11,680
OK, so we're going to
the second derivative.

138
00:07:11,680 --> 00:07:16,950
So what I want to compute is
the derivative with respect

139
00:07:16,950 --> 00:07:18,740
to beta j and beta k.

140
00:07:21,830 --> 00:07:22,330
OK.

141
00:07:22,330 --> 00:07:24,525
So where does beta j--

142
00:07:24,525 --> 00:07:26,650
so here, I already took
the derivative with respect

143
00:07:26,650 --> 00:07:27,550
to beta j.

144
00:07:27,550 --> 00:07:29,530
So this is just the
derivative with respect

145
00:07:29,530 --> 00:07:32,850
to beta k of the derivative
with respect to beta j.

146
00:07:36,874 --> 00:07:39,290
So what I need to do is to
take the derivative of this guy

147
00:07:39,290 --> 00:07:40,790
with respect to beta k.

148
00:07:40,790 --> 00:07:42,510
Where does beta k show up here?

149
00:07:48,920 --> 00:07:52,170
It's set in, in two places.

150
00:07:52,170 --> 00:07:53,179
AUDIENCE: In the y's?

151
00:07:53,179 --> 00:07:54,970
PHILIPPE RIGOLLET: No,
it's not in the y's.

152
00:07:54,970 --> 00:07:56,760
The y's are my data, right?

153
00:07:59,470 --> 00:08:02,220
But I mean, it's
in the y tildes.

154
00:08:02,220 --> 00:08:03,700
Yeah, because it's in mu, right?

155
00:08:03,700 --> 00:08:04,960
Mu depends on beta.

156
00:08:04,960 --> 00:08:09,270
Mu is g inverse of
xi transpose beta.

157
00:08:09,270 --> 00:08:12,930
And it's also in the wi's.

158
00:08:12,930 --> 00:08:17,070
Actually, everything that you
see is directly-- well, OK, w

159
00:08:17,070 --> 00:08:21,810
depends on mu n on
beta explicitly.

160
00:08:21,810 --> 00:08:24,480
But the rest depends only on mu.

161
00:08:24,480 --> 00:08:27,930
And so we might want
to be a little--

162
00:08:27,930 --> 00:08:30,660
well, we can actually use the--

163
00:08:30,660 --> 00:08:32,220
did I use the
chain rule already?

164
00:08:35,059 --> 00:08:36,950
Yeah, it's here.

165
00:08:36,950 --> 00:08:40,780
But OK, well, let's go for it.

166
00:08:49,200 --> 00:08:50,695
Oh yeah, OK.

167
00:08:50,695 --> 00:08:52,320
Sorry, I should not
write it like that,

168
00:08:52,320 --> 00:08:54,390
because that was actually--

169
00:08:54,390 --> 00:08:57,000
right, so I make my life
miserable by just multiplying

170
00:08:57,000 --> 00:09:00,800
and dividing by
this g prime of mu.

171
00:09:00,800 --> 00:09:02,310
I should not do this, right?

172
00:09:02,310 --> 00:09:04,860
So what I should just write
is say that this guy here--

173
00:09:04,860 --> 00:09:09,180
I'm actually going to
remove the g prime of mu,

174
00:09:09,180 --> 00:09:11,570
because I just make something
that depends on theta

175
00:09:11,570 --> 00:09:13,090
appear when it
really should not.

176
00:09:13,090 --> 00:09:15,880
So let's just look at the
last but one equality.

177
00:09:23,000 --> 00:09:23,500
OK.

178
00:09:23,500 --> 00:09:27,430
So that's the one over
there, and then I have xi j.

179
00:09:27,430 --> 00:09:29,800
OK, so here, it make my
life much more simple,

180
00:09:29,800 --> 00:09:31,852
because yi does
not depend on beta,

181
00:09:31,852 --> 00:09:34,310
but this guy depends on beta,
and this guy depends on beta.

182
00:09:34,310 --> 00:09:35,074
All right.

183
00:09:35,074 --> 00:09:36,490
So when I take the
derivative, I'm

184
00:09:36,490 --> 00:09:38,440
going to have to be a
little more careful now.

185
00:09:38,440 --> 00:09:40,300
But I just have a
derivative of a product,

186
00:09:40,300 --> 00:09:42,080
nothing more complicated.

187
00:09:42,080 --> 00:09:43,345
So this is what?

188
00:09:43,345 --> 00:09:45,350
Well, the sum is
going to be linear,

189
00:09:45,350 --> 00:09:46,710
so it's going to come out.

190
00:09:46,710 --> 00:09:51,170
Then I'm going to have to take
the derivative of this term.

191
00:09:51,170 --> 00:09:54,120
So it's just going
to be 1 over psi.

192
00:09:54,120 --> 00:09:58,460
Then the derivative
of mu i with respect

193
00:09:58,460 --> 00:10:04,440
to beta k, which I will
just write like this,

194
00:10:04,440 --> 00:10:09,920
times h prime of xi
transpose beta xi j.

195
00:10:09,920 --> 00:10:15,570
And then I'm going to have the
other one, which is yi minus mu

196
00:10:15,570 --> 00:10:24,640
i over 5 times the second
derivative of h of xi transpose

197
00:10:24,640 --> 00:10:25,330
beta.

198
00:10:25,330 --> 00:10:27,038
And then I'm going to
take the derivative

199
00:10:27,038 --> 00:10:30,190
of this guy with respect to beta
j with beta k, which is just

200
00:10:30,190 --> 00:10:32,196
xi k.

201
00:10:32,196 --> 00:10:36,200
So I have xi j times xi k.

202
00:10:36,200 --> 00:10:36,700
OK.

203
00:10:36,700 --> 00:10:40,400
So I still need to
compute this guy.

204
00:10:40,400 --> 00:10:42,590
So what is the
partial derivative

205
00:10:42,590 --> 00:10:46,430
with respect to beta k of g?

206
00:10:46,430 --> 00:10:49,310
So mu is g of--

207
00:10:49,310 --> 00:10:52,524
worry, it's g inverse
of xi transpose beta.

208
00:10:56,610 --> 00:10:58,400
OK?

209
00:10:58,400 --> 00:10:59,460
So what do I get?

210
00:10:59,460 --> 00:11:01,610
Well, I'm going
to get definitely

211
00:11:01,610 --> 00:11:02,990
the second derivative of g.

212
00:11:11,558 --> 00:11:14,050
Well, OK, that's
actually not a bad idea.

213
00:11:17,857 --> 00:11:18,690
Well, no, that's OK.

214
00:11:18,690 --> 00:11:21,150
I can make the second--

215
00:11:21,150 --> 00:11:22,850
what makes my life
easier, actually?

216
00:11:26,690 --> 00:11:31,010
Give me one second.

217
00:11:31,010 --> 00:11:33,230
Well, there's no
one that actually

218
00:11:33,230 --> 00:11:35,660
makes my life so much easier.

219
00:11:35,660 --> 00:11:36,872
Let's just write it.

220
00:11:36,872 --> 00:11:37,830
Let's go with this guy.

221
00:11:37,830 --> 00:11:43,300
So it's going to be g prime
prime of xi transpose beta

222
00:11:43,300 --> 00:11:47,677
times xi k.

223
00:11:47,677 --> 00:11:50,140
OK?

224
00:11:50,140 --> 00:11:53,470
So now, what do I have
if I collect my terms?

225
00:11:53,470 --> 00:12:05,990
I have that this whole thing
here, the second derivative is,

226
00:12:05,990 --> 00:12:10,600
well, I have the sum
from 1 equal 1 to n.

227
00:12:10,600 --> 00:12:13,200
Then I have terms that
I can factor out, right?

228
00:12:13,200 --> 00:12:17,790
Both of these guys have xi j,
and this guy pulls out an xi k.

229
00:12:17,790 --> 00:12:21,450
And it's also here, xi
j times xi k, right?

230
00:12:21,450 --> 00:12:26,690
So everybody here is xi j xi k.

231
00:12:26,690 --> 00:12:29,790
And now, I just have to take
the terms that I have here.

232
00:12:29,790 --> 00:12:33,490
The 1 over phi, I can
actually pull out in front.

233
00:12:33,490 --> 00:12:40,400
And I'm left with the
second derivative of g times

234
00:12:40,400 --> 00:12:46,370
the first derivative of h, both
taken at xi transpose beta.

235
00:12:46,370 --> 00:12:48,880
And then, I have
this yi minus mu i

236
00:12:48,880 --> 00:12:52,166
times the second derivative of
h, taken at xi transpose beta.

237
00:13:00,180 --> 00:13:00,680
OK.

238
00:13:00,680 --> 00:13:03,240
But here, I'm looking
at Fisher scoring.

239
00:13:03,240 --> 00:13:07,200
I'm not looking at
Newton's method, which

240
00:13:07,200 --> 00:13:09,660
means that I can actually
take the expectation

241
00:13:09,660 --> 00:13:11,636
of the second derivative.

242
00:13:11,636 --> 00:13:13,260
So when I start taking
the expectation,

243
00:13:13,260 --> 00:13:15,640
what's going to happen--

244
00:13:15,640 --> 00:13:17,670
so if I take the expectation
of this whole thing

245
00:13:17,670 --> 00:13:21,830
here, well, this guy, it's not--

246
00:13:21,830 --> 00:13:24,990
and when I say expectation,
it's always conditionally on xi.

247
00:13:24,990 --> 00:13:27,470
So let's write it--

248
00:13:27,470 --> 00:13:29,540
x1 xn.

249
00:13:29,540 --> 00:13:31,160
So I take conditional.

250
00:13:31,160 --> 00:13:32,790
This is just deterministic.

251
00:13:32,790 --> 00:13:34,430
But what is the
conditional expectation

252
00:13:34,430 --> 00:13:39,570
of yi minus mu i times this
guy, conditionally on xi?

253
00:13:42,160 --> 00:13:43,499
0, right?

254
00:13:43,499 --> 00:13:45,790
Because this is just the
conditional expectation of yi,

255
00:13:45,790 --> 00:13:47,620
and everything else
depends on xi only,

256
00:13:47,620 --> 00:13:50,810
so I can push it out of the
conditional expectation.

257
00:13:50,810 --> 00:13:52,200
So I'm left only with this term.

258
00:14:06,460 --> 00:14:06,960
OK.

259
00:14:13,850 --> 00:14:14,950
So now I need to--

260
00:14:14,950 --> 00:14:23,185
sorry, and I have
xi xj, xi j xi j.

261
00:14:23,185 --> 00:14:26,953
OK

262
00:14:26,953 --> 00:14:34,850
So now, I want to go to
something that's slightly more

263
00:14:34,850 --> 00:14:35,880
convenient for me.

264
00:14:35,880 --> 00:14:37,820
So maybe we can
skip that part here,

265
00:14:37,820 --> 00:14:40,790
because this is not going to
be convenient for me anyway.

266
00:14:40,790 --> 00:14:45,500
So I just want to go back to
something that looks eventually

267
00:14:45,500 --> 00:14:48,150
like this.

268
00:14:48,150 --> 00:14:50,010
OK, that's what
I'm going to want.

269
00:14:50,010 --> 00:14:53,700
So I need to have my xi show
up with some weight somehow.

270
00:14:53,700 --> 00:14:57,530
And the weight should involve
h prime divided by g prime.

271
00:14:57,530 --> 00:15:00,840
Again, the reason why I want
to see g prime coming back

272
00:15:00,840 --> 00:15:03,900
is because I had g prime
coming in the original w.

273
00:15:03,900 --> 00:15:06,690
This is actually the
same definition as the w

274
00:15:06,690 --> 00:15:09,870
that I used when I was
computing the gradient.

275
00:15:09,870 --> 00:15:13,600
Those are exactly
these w's, those guys.

276
00:15:13,600 --> 00:15:15,750
So I need to have g
prime that shows up.

277
00:15:15,750 --> 00:15:17,166
And that's where
I'm going to have

278
00:15:17,166 --> 00:15:21,240
to make a little bit
of computation here.

279
00:15:21,240 --> 00:15:26,460
And it's coming from this
kind of consideration.

280
00:15:26,460 --> 00:15:27,840
OK?

281
00:15:27,840 --> 00:15:29,960
So this thing here--

282
00:15:33,680 --> 00:15:39,180
well, actually, I'm missing
the phi over there, right?

283
00:15:39,180 --> 00:15:41,170
There should be a phi here.

284
00:15:41,170 --> 00:15:41,670
OK.

285
00:15:41,670 --> 00:15:46,482
So we have exactly this thing,
because this tells me that,

286
00:15:46,482 --> 00:15:47,565
if I look at the Hessian--

287
00:15:53,840 --> 00:15:56,740
so this was entry-wise,
and this is exactly

288
00:15:56,740 --> 00:15:58,930
the form of something
of the form of the k.

289
00:15:58,930 --> 00:16:06,120
This is exactly the jth kth
entry of xi xi transpose.

290
00:16:06,120 --> 00:16:06,620
Right?

291
00:16:06,620 --> 00:16:07,880
We've used that before.

292
00:16:07,880 --> 00:16:09,980
So if I want to write
this in a vector form,

293
00:16:09,980 --> 00:16:13,010
this is just going to be the
sum of something that depends

294
00:16:13,010 --> 00:16:15,710
on i times xi xi transpose.

295
00:16:15,710 --> 00:16:20,660
So this is 1 over phi sum
from i equal 1 to n of g

296
00:16:20,660 --> 00:16:28,580
prime prime xi transpose beta
h prime xi transpose beta xi xi

297
00:16:28,580 --> 00:16:30,631
transpose.

298
00:16:30,631 --> 00:16:31,130
OK?

299
00:16:31,130 --> 00:16:32,504
And that's for
the entire matrix.

300
00:16:32,504 --> 00:16:34,820
Here, that was just the j
kth entries of this matrix.

301
00:16:38,520 --> 00:16:41,640
And you can just check
that, if I take this matrix,

302
00:16:41,640 --> 00:16:45,330
the j kth entry is just the
product of the jth coordinate

303
00:16:45,330 --> 00:16:48,780
and the kth coordinate of xi.

304
00:16:48,780 --> 00:16:51,540
All right.

305
00:16:51,540 --> 00:16:54,000
So now I need to
do my rewriting.

306
00:16:54,000 --> 00:16:54,975
Can I write this?

307
00:16:58,529 --> 00:17:00,070
So I'm missing
something here, right?

308
00:17:11,790 --> 00:17:13,829
Oh, I know where
it's coming from.

309
00:17:18,630 --> 00:17:22,010
Mu is not g prime of x beta.

310
00:17:22,010 --> 00:17:24,386
Mu is g inverse
of x beta, right?

311
00:17:27,859 --> 00:17:34,670
So the derivative of x
prime is not g prime prime.

312
00:17:34,670 --> 00:17:39,915
It's like this guy--

313
00:17:44,890 --> 00:17:46,660
no, 1 over this, right?

314
00:17:51,583 --> 00:17:52,083
Yeah.

315
00:18:06,880 --> 00:18:08,040
OK?

316
00:18:08,040 --> 00:18:12,180
The derivative of g inverse is
1 over g prime of gene inverse.

317
00:18:15,390 --> 00:18:18,260
I need you guys, OK?

318
00:18:18,260 --> 00:18:18,790
All right.

319
00:18:18,790 --> 00:18:20,670
So now, I'm going to
have to rewrite this.

320
00:18:20,670 --> 00:18:21,810
This guy is still
going to go away.

321
00:18:21,810 --> 00:18:23,351
It doesn't matter,
but now this thing

322
00:18:23,351 --> 00:18:30,180
is becoming h prime over g prime
of g inverse of xi transpose

323
00:18:30,180 --> 00:18:41,820
beta, which is the same
here, which is the same here.

324
00:18:52,220 --> 00:18:53,300
OK?

325
00:18:53,300 --> 00:18:55,435
Everybody approves?

326
00:18:55,435 --> 00:18:55,935
All right.

327
00:18:55,935 --> 00:18:58,460
Well, now, it's
actually much nicer.

328
00:18:58,460 --> 00:19:01,040
What is g inverse of
xi transpose beta?

329
00:19:05,154 --> 00:19:07,320
Well, that was exactly the
mistake that I just made,

330
00:19:07,320 --> 00:19:08,310
right?

331
00:19:08,310 --> 00:19:10,960
It's mu i itself.

332
00:19:10,960 --> 00:19:18,330
So this guy is really
g prime of mu i.

333
00:19:18,330 --> 00:19:19,630
Sorry, just the bottom, right?

334
00:19:23,200 --> 00:19:32,870
So now, I have something
which looks like a sum from i

335
00:19:32,870 --> 00:19:38,470
equal 1 to n of h prime
of xi transpose beta,

336
00:19:38,470 --> 00:19:46,780
divided by g prime of mu i phi
times xi xi transpose, which

337
00:19:46,780 --> 00:19:54,550
I can certainly write in
matrix form as x transpose wx,

338
00:19:54,550 --> 00:20:00,410
where w is exactly
the same as before.

339
00:20:00,410 --> 00:20:05,330
So it's w1 wn.

340
00:20:05,330 --> 00:20:11,380
And wi is h prime
of xi transpose beta

341
00:20:11,380 --> 00:20:16,082
divided by g prime of mu i.

342
00:20:16,082 --> 00:20:20,880
There's a prime here
times phi, which

343
00:20:20,880 --> 00:20:23,390
is the same that we had here.

344
00:20:23,390 --> 00:20:26,610
And it's supposed to be
the same that we have here,

345
00:20:26,610 --> 00:20:30,380
except the phi is in white.

346
00:20:30,380 --> 00:20:31,820
That's why it's not there.

347
00:20:31,820 --> 00:20:32,320
OK.

348
00:20:37,655 --> 00:20:39,610
All right?

349
00:20:39,610 --> 00:20:42,891
So it's actually simpler than
what's on the slides, I guess.

350
00:20:42,891 --> 00:20:43,390
All right.

351
00:20:43,390 --> 00:20:46,450
So now, if you pay
attention, I actually

352
00:20:46,450 --> 00:20:49,060
never force this g prime
of mu i to be here.

353
00:20:49,060 --> 00:20:52,660
Actually, I even tried to
make a mistake to not have it.

354
00:20:52,660 --> 00:20:57,670
And so this g prime of mu i
shows up completely naturally.

355
00:20:57,670 --> 00:21:04,200
If I had started with this,
you would have never questioned

356
00:21:04,200 --> 00:21:07,060
why I actually didn't
multiply by g prime

357
00:21:07,060 --> 00:21:09,700
and divided by g prime
completely artificially here.

358
00:21:09,700 --> 00:21:12,790
It just shows up
naturally in the weights.

359
00:21:12,790 --> 00:21:14,230
But it's just more
natural for me

360
00:21:14,230 --> 00:21:15,771
to compute the first
derivative first

361
00:21:15,771 --> 00:21:17,790
than the second
derivative second, OK?

362
00:21:17,790 --> 00:21:20,620
And so we just did it
the other way around.

363
00:21:20,620 --> 00:21:23,620
But now, let's assume we
forgot about everything.

364
00:21:23,620 --> 00:21:24,270
We have this.

365
00:21:24,270 --> 00:21:28,240
This is a natural way of
writing it, x transpose wx.

366
00:21:28,240 --> 00:21:30,310
If I want something that
involves some weights,

367
00:21:30,310 --> 00:21:34,270
I have to force them in by
dividing by g prime of mu i

368
00:21:34,270 --> 00:21:40,410
and therefore, multiplying
yi n mu i by this wi.

369
00:21:40,410 --> 00:21:41,140
OK?

370
00:21:41,140 --> 00:21:46,490
So now, if we recap what we've
actually found, we got that--

371
00:21:49,470 --> 00:21:51,540
let me write it here.

372
00:21:58,740 --> 00:22:02,010
We also have that
the expectation

373
00:22:02,010 --> 00:22:12,100
of H ln of beta x transpose xw.

374
00:22:12,100 --> 00:22:15,190
So if I go back to my
iterations over there,

375
00:22:15,190 --> 00:22:20,260
I should actually
update beta k plus 1

376
00:22:20,260 --> 00:22:25,240
to be equal to beta
k plus the inverse.

377
00:22:25,240 --> 00:22:30,250
So that's actually equal
to negative i of beta k--

378
00:22:30,250 --> 00:22:33,200
well, yeah.

379
00:22:33,200 --> 00:22:35,180
That's negative i
of beta, I guess.

380
00:22:38,230 --> 00:22:42,680
Oh, and beta here shows up in
w, right? w depends on beta.

381
00:22:42,680 --> 00:22:44,460
So that's going to be beta k.

382
00:22:44,460 --> 00:22:45,380
So let me call it wk.

383
00:22:49,151 --> 00:22:54,460
So that's the diagonal of
H prime xi transpose beta

384
00:22:54,460 --> 00:23:01,800
k, this time, divided by
g prime of mu i k phi.

385
00:23:01,800 --> 00:23:02,300
OK?

386
00:23:02,300 --> 00:23:06,650
So this beta k induces
a mu by looking

387
00:23:06,650 --> 00:23:11,141
at g inverse of xi
transpose beta k.

388
00:23:11,141 --> 00:23:11,670
All right.

389
00:23:11,670 --> 00:23:21,804
So mu i k is g inverse
of xi transpose beta k.

390
00:23:21,804 --> 00:23:25,470
So that's 2 to the--
sorry, that's an iteration.

391
00:23:25,470 --> 00:23:28,080
And so now, if I actually
write these things together,

392
00:23:28,080 --> 00:23:37,820
I get minus x
transpose wx inverse.

393
00:23:37,820 --> 00:23:38,385
So that's wk.

394
00:23:41,900 --> 00:23:45,260
And then I have my
gradient here that I

395
00:23:45,260 --> 00:23:50,810
have to apply at k,
which is x transpose wk.

396
00:23:50,810 --> 00:23:58,610
And then I have y tilde k minus
mu tilde k, where, again, the

397
00:23:58,610 --> 00:23:59,330
indices--

398
00:23:59,330 --> 00:24:01,860
I mean the superscript
k are pretty natural.

399
00:24:01,860 --> 00:24:05,720
y tilde k just means that--

400
00:24:05,720 --> 00:24:07,370
so that's just yi.

401
00:24:07,370 --> 00:24:14,650
So that's just yi times
g prime of mu i k.

402
00:24:14,650 --> 00:24:21,050
And mu tilde k is, if I
look at the i coordinate,

403
00:24:21,050 --> 00:24:27,960
it's just going to be mu
i times g prime of mu i.

404
00:24:31,571 --> 00:24:32,070
OK?

405
00:24:32,070 --> 00:24:34,470
So I just add superscripts
k to everything.

406
00:24:34,470 --> 00:24:37,710
So I know that those things
get updated real time, right?

407
00:24:37,710 --> 00:24:41,670
Every time I make one iteration,
I get a new value for beta,

408
00:24:41,670 --> 00:24:43,800
I get a new value for
mu, and therefore, I

409
00:24:43,800 --> 00:24:44,981
get a new value for w.

410
00:24:44,981 --> 00:24:45,480
Yes?

411
00:24:45,480 --> 00:24:50,210
AUDIENCE: [INAUDIBLE] the
Fisher equation [INAUDIBLE]??

412
00:24:50,210 --> 00:24:52,660
PHILIPPE RIGOLLET: Yeah,
that's a good point.

413
00:24:52,660 --> 00:24:54,400
So that's definitely
a plus, because this

414
00:24:54,400 --> 00:24:56,030
is a positive,
semi-definite matrix.

415
00:24:56,030 --> 00:24:57,700
So this is a plus.

416
00:24:57,700 --> 00:25:01,330
And well, that's probably
where I erased it.

417
00:25:15,920 --> 00:25:16,420
OK.

418
00:25:16,420 --> 00:25:19,105
Let's see where I
made my mistake.

419
00:25:23,510 --> 00:25:28,602
So there should be a minus here.

420
00:25:28,602 --> 00:25:29,810
There should be a minus here.

421
00:25:29,810 --> 00:25:32,720
There should be a minus even
at the beginning, I believe.

422
00:25:32,720 --> 00:25:37,940
So that means that what
is my-- oh, yeah, yeah.

423
00:25:37,940 --> 00:25:41,160
So you see, when we
go back to the first,

424
00:25:41,160 --> 00:25:47,440
so what I erased was basically
this thing here, yi minus mu i.

425
00:25:47,440 --> 00:25:49,680
And when I took the
first derivative--

426
00:25:49,680 --> 00:25:53,170
so it was the derivative
with respect to H prime.

427
00:25:53,170 --> 00:25:55,830
So the derivative with
respect to the second term--

428
00:25:55,830 --> 00:25:57,920
I mean, the derivative
of the second term

429
00:25:57,920 --> 00:25:59,754
was actually killed,
because we took

430
00:25:59,754 --> 00:26:00,920
the expectation of this guy.

431
00:26:00,920 --> 00:26:03,253
But when we took the derivative
of the first term, which

432
00:26:03,253 --> 00:26:05,747
is the only one that
stayed, this guy went away.

433
00:26:05,747 --> 00:26:07,580
But there was a negative
sign from this guy,

434
00:26:07,580 --> 00:26:09,740
because that's the thing
we took the negative off.

435
00:26:09,740 --> 00:26:12,920
So it's really, when I
take my second derivative,

436
00:26:12,920 --> 00:26:15,896
I should carry out the
minus signs everywhere.

437
00:26:22,084 --> 00:26:24,000
OK?

438
00:26:24,000 --> 00:26:26,530
So it's just I forget
this minus throughout.

439
00:26:31,700 --> 00:26:34,735
You see the first term went
away, on the first line there.

440
00:26:34,735 --> 00:26:36,110
The first term
went away, because

441
00:26:36,110 --> 00:26:38,930
the conditional expectation
of yi, given xi 0.

442
00:26:38,930 --> 00:26:41,410
And then I had this minus
sign in front of everyone,

443
00:26:41,410 --> 00:26:42,110
and I forgot it.

444
00:26:44,660 --> 00:26:45,770
All right.

445
00:26:45,770 --> 00:26:47,390
Any other mistake that I made?

446
00:26:51,230 --> 00:26:52,800
We're good?

447
00:26:52,800 --> 00:26:54,858
All right.

448
00:26:54,858 --> 00:27:08,360
So now, this is what
we have, that xk--

449
00:27:08,360 --> 00:27:14,220
sorry, that beta k plus
1 is equal to beta k

450
00:27:14,220 --> 00:27:15,590
plus this thing.

451
00:27:15,590 --> 00:27:16,920
OK?

452
00:27:16,920 --> 00:27:19,140
And if you look at this
thing, it sort of reminds

453
00:27:19,140 --> 00:27:20,700
us of something.

454
00:27:20,700 --> 00:27:22,860
Remember the least
squares estimator.

455
00:27:22,860 --> 00:27:24,870
So here, I'm going to
actually deviate slightly

456
00:27:24,870 --> 00:27:25,820
from the slides.

457
00:27:25,820 --> 00:27:27,480
And I will tell you how.

458
00:27:27,480 --> 00:27:30,690
The slides take
beta k and put it

459
00:27:30,690 --> 00:27:33,220
in here, which is one way to go.

460
00:27:33,220 --> 00:27:36,300
And just think of this as a
big least square solution.

461
00:27:36,300 --> 00:27:41,040
Or you can keep the beta k,
solve another least squares,

462
00:27:41,040 --> 00:27:43,150
and then add it to the
beta k that you have.

463
00:27:43,150 --> 00:27:44,280
It's the same thing.

464
00:27:44,280 --> 00:27:45,820
So I will take the
different routes.

465
00:27:45,820 --> 00:27:47,445
So you have the two
options, all right?

466
00:28:07,410 --> 00:28:09,340
OK.

467
00:28:09,340 --> 00:28:10,880
So when we did the
least squares--

468
00:28:10,880 --> 00:28:15,880
so parenthesis least squares--

469
00:28:19,210 --> 00:28:23,810
we had y equals x
beta plus epsilon.

470
00:28:23,810 --> 00:28:27,850
And our estimator beta
hat was x transpose

471
00:28:27,850 --> 00:28:33,382
x inverse x transpose y, right?

472
00:28:33,382 --> 00:28:36,560
And that was just solving
the first order condition,

473
00:28:36,560 --> 00:28:38,230
and that's what we found.

474
00:28:38,230 --> 00:28:40,680
Now look at this--

475
00:28:40,680 --> 00:28:46,770
x transpose bleep x inverse,
x transpose bleep something.

476
00:28:46,770 --> 00:28:47,460
OK?

477
00:28:47,460 --> 00:28:58,120
So this looks like, if this
is the same as the left board,

478
00:28:58,120 --> 00:29:04,140
if wk is equal to the
identity matrix, meaning we

479
00:29:04,140 --> 00:29:11,040
don't see it, and y is equal
to y tilde k minus mu tilde k--

480
00:29:13,560 --> 00:29:16,950
so those similarities, the
fact that we just squeeze in--

481
00:29:16,950 --> 00:29:19,730
so the fact that the response
variable is different

482
00:29:19,730 --> 00:29:20,850
is really not a problem.

483
00:29:20,850 --> 00:29:22,560
We just have to
pretend that this

484
00:29:22,560 --> 00:29:24,877
is equal to y tilde
minus mu tilde.

485
00:29:24,877 --> 00:29:26,460
I mean, that's just
the least squares.

486
00:29:26,460 --> 00:29:29,440
When you call a software that
does least squares for you,

487
00:29:29,440 --> 00:29:31,710
you just tell it what y
is, you tell it with x is,

488
00:29:31,710 --> 00:29:32,940
and it makes the computation.

489
00:29:32,940 --> 00:29:35,470
So you would just lie to
it and say all the actual y

490
00:29:35,470 --> 00:29:37,530
I want is this thing.

491
00:29:37,530 --> 00:29:42,420
And then we need to somehow
incorporate those weights.

492
00:29:42,420 --> 00:29:44,980
And so the question
is, is that easy to do?

493
00:29:44,980 --> 00:29:48,390
And the answer is yes,
because this is a setup where

494
00:29:48,390 --> 00:29:50,460
this would actually arise.

495
00:29:50,460 --> 00:29:52,876
So one of the things that's
very specific to what

496
00:29:52,876 --> 00:29:54,750
we did here and with
least squares, we assume

497
00:29:54,750 --> 00:29:58,140
that epsilon, when we did
at least the inference,

498
00:29:58,140 --> 00:30:01,080
we assumed that
epsilon was normal 0

499
00:30:01,080 --> 00:30:04,960
and the covariance matrix
was the identity, right?

500
00:30:04,960 --> 00:30:07,180
What if the covariance
matrix is not the identity?

501
00:30:07,180 --> 00:30:09,610
If the covariance matrix
is not the identity,

502
00:30:09,610 --> 00:30:12,140
then your maximum
likelihood is not exactly

503
00:30:12,140 --> 00:30:13,600
these least squares.

504
00:30:13,600 --> 00:30:15,580
If the covariance
matrix is any matrix

505
00:30:15,580 --> 00:30:18,280
you have another solution,
which involves the inverse

506
00:30:18,280 --> 00:30:20,620
of the covariance
matrix that you have,

507
00:30:20,620 --> 00:30:24,100
but if your covariance matrix,
in particular, is diagonal--

508
00:30:24,100 --> 00:30:26,560
which would mean that
each observation that you

509
00:30:26,560 --> 00:30:30,160
get in this system of
equations is still independent,

510
00:30:30,160 --> 00:30:32,530
but the variances can
change from one line

511
00:30:32,530 --> 00:30:35,030
to another, from one
observation to another--

512
00:30:35,030 --> 00:30:37,570
then it's called
heteroscedastic.

513
00:30:37,570 --> 00:30:39,730
"Hetero" means "not the same."

514
00:30:39,730 --> 00:30:41,680
"Scedastic" is "scale."

515
00:30:41,680 --> 00:30:45,280
And a heteroscedastic
case, you would have

516
00:30:45,280 --> 00:30:47,000
something slightly different.

517
00:30:47,000 --> 00:30:49,750
And it makes sense
that, if you know

518
00:30:49,750 --> 00:30:52,970
that some observations have
much less variance than others,

519
00:30:52,970 --> 00:30:54,790
you might want to
give them more weight.

520
00:30:54,790 --> 00:30:55,420
OK?

521
00:30:55,420 --> 00:31:02,940
So if you think about
your usual drawing,

522
00:31:02,940 --> 00:31:07,100
and maybe you have
something like this,

523
00:31:07,100 --> 00:31:08,600
but the actual line is really--

524
00:31:08,600 --> 00:31:12,350
OK, let's say you have this guy
as well, so just a few here.

525
00:31:12,350 --> 00:31:16,474
If you start drawing this
thing, if you do least squares,

526
00:31:16,474 --> 00:31:18,140
you're going to see
something that looks

527
00:31:18,140 --> 00:31:20,030
like this on those points.

528
00:31:20,030 --> 00:31:22,640
But now, if I tell you
that, on this side,

529
00:31:22,640 --> 00:31:26,900
the variance is equal to 100,
meaning that those points are

530
00:31:26,900 --> 00:31:29,030
actually really far
from the true one,

531
00:31:29,030 --> 00:31:31,527
and here on this side, the
variance is equal to 1,

532
00:31:31,527 --> 00:31:33,860
meaning that those points are
actually close to the line

533
00:31:33,860 --> 00:31:36,151
you're looking for, then the
line you should be fitting

534
00:31:36,151 --> 00:31:38,450
is probably this
guy, meaning do not

535
00:31:38,450 --> 00:31:42,210
trust the guys that
have a lot of variance.

536
00:31:42,210 --> 00:31:44,140
And so you need somehow
to incorporate that.

537
00:31:44,140 --> 00:31:46,600
If you know that those things
have much more variance

538
00:31:46,600 --> 00:31:49,370
than these guys, you
want to weight this.

539
00:31:49,370 --> 00:31:52,620
And the way you do it is by
using weighted least squares.

540
00:31:52,620 --> 00:31:53,120
OK.

541
00:31:53,120 --> 00:31:54,661
So we're going to
open in parentheses

542
00:31:54,661 --> 00:31:55,820
on weighted least squares.

543
00:31:55,820 --> 00:31:57,980
It's not a fundamental
statistical question,

544
00:31:57,980 --> 00:32:00,470
but it's useful for us,
because this is exactly

545
00:32:00,470 --> 00:32:01,850
what's going to spit out--

546
00:32:01,850 --> 00:32:05,160
something that looks like this
with this matrix w in there.

547
00:32:05,160 --> 00:32:05,660
OK.

548
00:32:05,660 --> 00:32:09,720
So let's go back in
time for a second.

549
00:32:09,720 --> 00:32:12,840
Assume we're still covering
least squares regression.

550
00:32:12,840 --> 00:32:19,220
So now, I'm going to assume
that y is x beta plus epsilon,

551
00:32:19,220 --> 00:32:23,600
but this time, epsilon is a
multivariate Gaussian in, say,

552
00:32:23,600 --> 00:32:25,940
p dimensions with mean 0.

553
00:32:25,940 --> 00:32:29,720
And covariance matrix, I
will write it as w inverse,

554
00:32:29,720 --> 00:32:32,790
because w is going to be the
one that's going to show up.

555
00:32:32,790 --> 00:32:34,650
OK?

556
00:32:34,650 --> 00:32:37,080
So this is the so-called
heteroscedastic.

557
00:32:37,080 --> 00:32:43,560
That's how it's spelled,
and yet another name

558
00:32:43,560 --> 00:32:47,800
that you can pick for your
soccer team or a capella group.

559
00:32:47,800 --> 00:32:48,300
All right.

560
00:32:48,300 --> 00:32:52,289
So the maximum
likelihood, in this case--

561
00:32:52,289 --> 00:32:54,330
so actually, let's compute
the maximum likelihood

562
00:32:54,330 --> 00:32:55,470
for this problem, right?

563
00:32:55,470 --> 00:32:58,770
So the log likelihood is what?

564
00:32:58,770 --> 00:33:02,110
Well, we're going to have
the term that tells us

565
00:33:02,110 --> 00:33:04,120
that it's going to be-- so OK.

566
00:33:04,120 --> 00:33:06,390
What is the density of
a multivariate Gaussian?

567
00:33:10,339 --> 00:33:12,130
So it's going to be a
multivariate Gaussian

568
00:33:12,130 --> 00:33:17,270
in p dimension with mean x
beta and covariance matrix w

569
00:33:17,270 --> 00:33:19,040
inverse, right?

570
00:33:19,040 --> 00:33:20,660
So that's the
density that we want.

571
00:33:20,660 --> 00:33:30,490
Well, it's of the form 1 over
determinant of w inverse times

572
00:33:30,490 --> 00:33:35,734
2 pi to the p/2.

573
00:33:35,734 --> 00:33:37,730
OK?

574
00:33:37,730 --> 00:33:47,570
And times exponential, and now,
what I have is x minus x beta

575
00:33:47,570 --> 00:33:51,980
transpose w-- so that's
the inverse of w inverse--

576
00:33:51,980 --> 00:33:58,340
x minus x beta divided by 2.

577
00:33:58,340 --> 00:33:59,240
OK?

578
00:33:59,240 --> 00:34:03,080
So this is x minus mu
transpose sigma inverse x

579
00:34:03,080 --> 00:34:04,920
minus mu divided by 2.

580
00:34:04,920 --> 00:34:10,766
And if you want a sanity
check, just assume that sigma--

581
00:34:10,766 --> 00:34:11,266
yeah?

582
00:34:11,266 --> 00:34:15,074
AUDIENCE: Is it x
minus x beta or y?

583
00:34:15,074 --> 00:34:18,290
PHILIPPE RIGOLLET: Well, you
know, if you want this to be y,

584
00:34:18,290 --> 00:34:21,629
then this is y, right?

585
00:34:21,629 --> 00:34:22,601
Sure.

586
00:34:22,601 --> 00:34:24,960
Yeah, maybe it's less confusing.

587
00:34:24,960 --> 00:34:29,886
So if you should do p is equal
to 1, then what does it mean?

588
00:34:29,886 --> 00:34:31,469
It means that you
have this mean here.

589
00:34:31,469 --> 00:34:32,969
So let's forget
about what it is.

590
00:34:32,969 --> 00:34:35,520
But this guy is going to be
just 1 sigma squared, right?

591
00:34:35,520 --> 00:34:38,699
So what you see here is the
inverse of sigma squared.

592
00:34:38,699 --> 00:34:41,670
So that's going to be 2 over 2
sigma squared, like we usually

593
00:34:41,670 --> 00:34:42,420
see it.

594
00:34:42,420 --> 00:34:44,310
The determinant of
w inverse is just

595
00:34:44,310 --> 00:34:45,960
the product of
the entry of the 1

596
00:34:45,960 --> 00:34:53,341
by 1 matrix, which
is just sigma square.

597
00:34:53,341 --> 00:34:53,840
OK?

598
00:34:53,840 --> 00:34:58,390
So that should be actually--

599
00:34:58,390 --> 00:35:01,100
yeah, no, that's actually--
yeah, that's sigma square.

600
00:35:01,100 --> 00:35:02,480
And then I have this 2 pi.

601
00:35:02,480 --> 00:35:04,670
So square root of this,
because p is equal to 1,

602
00:35:04,670 --> 00:35:06,290
I get sigma square
root 2 pi, which is

603
00:35:06,290 --> 00:35:07,719
the normalization that I get.

604
00:35:07,719 --> 00:35:09,260
This is not going
to matter, because,

605
00:35:09,260 --> 00:35:12,640
when I look at
the log likelihood

606
00:35:12,640 --> 00:35:15,400
as a function of beta--

607
00:35:15,400 --> 00:35:17,720
so I'm assuming
that w is known--

608
00:35:17,720 --> 00:35:19,760
what I get is something
which is a constant.

609
00:35:19,760 --> 00:35:25,520
So it's minus p minus
n times p/2 times

610
00:35:25,520 --> 00:35:31,290
log that w inverse times 2 pi.

611
00:35:31,290 --> 00:35:31,790
OK?

612
00:35:31,790 --> 00:35:33,290
So this is just going
to be a constant.

613
00:35:33,290 --> 00:35:35,390
It won't matter when I do
the maximum likelihood.

614
00:35:35,390 --> 00:35:36,723
And then I'm going to have what?

615
00:35:36,723 --> 00:35:44,508
I'm going to have plus 1/2
of y minus x beta transpose w

616
00:35:44,508 --> 00:35:45,820
y minus x beta.

617
00:35:49,230 --> 00:35:53,520
So if I want to take the
maximum of this guy--

618
00:35:53,520 --> 00:35:56,620
sorry, there's a minus here.

619
00:35:56,620 --> 00:35:58,590
So if I want to take
the maximum of this guy,

620
00:35:58,590 --> 00:36:01,230
I'm going to have to take
the minimum of this thing.

621
00:36:01,230 --> 00:36:04,530
And the minimum of this thing,
if you take the derivative,

622
00:36:04,530 --> 00:36:05,820
you get to see--

623
00:36:05,820 --> 00:36:07,350
so that's what we have, right?

624
00:36:07,350 --> 00:36:09,240
We need to compute
the minimum of y

625
00:36:09,240 --> 00:36:13,980
minus x beta transpose
w minus y minus x beta.

626
00:36:13,980 --> 00:36:16,570
And the solution that you get--

627
00:36:16,570 --> 00:36:20,320
I mean, you can actually
check this for yourself.

628
00:36:20,320 --> 00:36:24,640
The way you can see this
is by doing the following.

629
00:36:24,640 --> 00:36:27,702
If you're lazy and you don't
want to redo the entire thing--

630
00:36:27,702 --> 00:36:28,910
maybe I should keep that guy.

631
00:36:36,110 --> 00:36:39,240
W is diagonal, right?

632
00:36:39,240 --> 00:36:42,540
I'm going to assume that
so w inverse is diagonal,

633
00:36:42,540 --> 00:36:45,270
and I'm going to assume that
no variance is equal to 0

634
00:36:45,270 --> 00:36:47,280
and no variance is
equal to infinity,

635
00:36:47,280 --> 00:36:52,050
so that both w inverse and
w have only positive entries

636
00:36:52,050 --> 00:36:53,010
on the diagonal.

637
00:36:53,010 --> 00:36:53,887
All right?

638
00:36:53,887 --> 00:36:55,970
So in particular, I can
talk about the square root

639
00:36:55,970 --> 00:36:58,520
of w, which is just the
matrix, the diagonal matrix,

640
00:36:58,520 --> 00:37:00,460
with the square roots
on the diagonal.

641
00:37:00,460 --> 00:37:01,040
OK?

642
00:37:01,040 --> 00:37:08,960
And so I want to minimize in
beta y minus x beta transpose w

643
00:37:08,960 --> 00:37:11,420
y minus x beta.

644
00:37:11,420 --> 00:37:13,850
So I'm going to write
w as square root

645
00:37:13,850 --> 00:37:17,584
of w times square root of
w, which I can, because w--

646
00:37:17,584 --> 00:37:19,250
and it's just the
simplest thing, right?

647
00:37:19,250 --> 00:37:28,030
If w is w1 wn, so that's my
w, then the square root of w

648
00:37:28,030 --> 00:37:31,580
is just square root of
w1 square root of wn,

649
00:37:31,580 --> 00:37:33,960
and then 0 is elsewhere.

650
00:37:33,960 --> 00:37:35,210
OK?

651
00:37:35,210 --> 00:37:37,100
So the product of
those two matrices

652
00:37:37,100 --> 00:37:38,740
gives me definitely
back what I want,

653
00:37:38,740 --> 00:37:41,387
and that's the usual
matrix product.

654
00:37:41,387 --> 00:37:43,970
Now, what I'm going to do is I'm
going to push one on one side

655
00:37:43,970 --> 00:37:45,681
and push the other
one on the other side.

656
00:37:45,681 --> 00:37:47,180
So that gives me
that this is really

657
00:37:47,180 --> 00:37:49,124
the minimum over beta of--

658
00:37:49,124 --> 00:37:50,540
well, here I have
this transposed,

659
00:37:50,540 --> 00:37:52,123
so I have to put it
on the other side.

660
00:37:52,123 --> 00:37:55,970
w is clearly symmetric and
so is square root of w.

661
00:37:55,970 --> 00:37:57,424
So the transpose doesn't matter.

662
00:37:57,424 --> 00:37:59,090
And so what I'm left
with is square root

663
00:37:59,090 --> 00:38:06,290
of wy minus square root of wx
beta transpose, and then times

664
00:38:06,290 --> 00:38:07,870
itself.

665
00:38:07,870 --> 00:38:15,010
So that's square root
wy minus square root w--

666
00:38:15,010 --> 00:38:17,530
oh, I don't have enough space--

667
00:38:17,530 --> 00:38:20,130
x beta.

668
00:38:20,130 --> 00:38:23,347
OK, and that stops here.

669
00:38:23,347 --> 00:38:25,680
But this is the same thing
that we've been doing before.

670
00:38:25,680 --> 00:38:26,680
This is a new y.

671
00:38:26,680 --> 00:38:28,074
Let's call it y prime.

672
00:38:28,074 --> 00:38:28,740
This is a new x.

673
00:38:28,740 --> 00:38:31,250
Let's call it x prime.

674
00:38:31,250 --> 00:38:33,480
And now, this is just the
least squares estimator

675
00:38:33,480 --> 00:38:39,000
associated to a response y prime
and a design matrix x prime.

676
00:38:39,000 --> 00:38:47,460
So I know that the solution is x
prime transpose x prime inverse

677
00:38:47,460 --> 00:38:53,020
x prime transpose y prime.

678
00:38:53,020 --> 00:38:55,380
And now, I'm just going
to substitute again

679
00:38:55,380 --> 00:38:58,560
what my x prime is in
terms of x and what

680
00:38:58,560 --> 00:39:01,830
my y prime is in terms of y.

681
00:39:01,830 --> 00:39:06,630
And that gives me exactly x
square root w square root w

682
00:39:06,630 --> 00:39:11,490
x inverse.

683
00:39:11,490 --> 00:39:17,490
And then I have x transpose
square root w for this guy.

684
00:39:17,490 --> 00:39:21,660
And then I have square
root wy for that guy.

685
00:39:21,660 --> 00:39:23,400
And that's exactly
what I wanted.

686
00:39:23,400 --> 00:39:30,880
I'm left with x transpose
wx inverse x transpose wy.

687
00:39:34,664 --> 00:39:35,164
OK?

688
00:39:38,020 --> 00:39:41,510
So that's a simple way
to take into account

689
00:39:41,510 --> 00:39:44,150
the w that we had before.

690
00:39:44,150 --> 00:39:47,285
And you could actually do it
with any matrix that's positive

691
00:39:47,285 --> 00:39:48,910
semi-definite, because
you can actually

692
00:39:48,910 --> 00:39:52,204
talk about the square
root of those matrices.

693
00:39:52,204 --> 00:39:54,620
And it's just the square root
of a matrix is just a matrix

694
00:39:54,620 --> 00:39:58,260
such that, when you
multiply it by itself,

695
00:39:58,260 --> 00:40:00,846
it gives you the
original matrix.

696
00:40:00,846 --> 00:40:03,560
OK?

697
00:40:03,560 --> 00:40:06,220
So here, that was
just a shortcut

698
00:40:06,220 --> 00:40:08,560
that consisted in
saying, OK, maybe I

699
00:40:08,560 --> 00:40:12,910
don't want to recompute the
gradient of this quantity,

700
00:40:12,910 --> 00:40:17,510
set it equal to 0, and see
what beta hat had should be.

701
00:40:17,510 --> 00:40:19,810
Instead, I am going to
assume that I already

702
00:40:19,810 --> 00:40:23,050
know that, if I
did not have the w,

703
00:40:23,050 --> 00:40:24,560
I would know how to solve it.

704
00:40:24,560 --> 00:40:25,810
And that's exactly what I did.

705
00:40:25,810 --> 00:40:28,120
I said, well, I
know that this is

706
00:40:28,120 --> 00:40:30,370
the minimum of something
that looks like this,

707
00:40:30,370 --> 00:40:32,020
when I have the primes.

708
00:40:32,020 --> 00:40:36,148
And then I just substitute
back my w in there.

709
00:40:36,148 --> 00:40:36,790
All right.

710
00:40:36,790 --> 00:40:38,390
So that' just the
lazy computation.

711
00:40:38,390 --> 00:40:40,440
But again, if you
don't like it, you

712
00:40:40,440 --> 00:40:42,380
can always take the
gradient of this guy.

713
00:40:42,380 --> 00:40:42,880
Yes?

714
00:40:42,880 --> 00:40:44,612
AUDIENCE: Why is
the solution written

715
00:40:44,612 --> 00:40:45,685
in the slides different?

716
00:40:45,685 --> 00:40:47,560
PHILIPPE RIGOLLET:
Because there's a mistake.

717
00:40:49,647 --> 00:40:51,230
Yeah, there's a
mistake on the slides.

718
00:40:58,385 --> 00:40:59,520
How did I make that one?

719
00:40:59,520 --> 00:41:01,220
I'm actually trying
to parse it back.

720
00:41:11,570 --> 00:41:13,820
I mean, it's clearly
wrong, right?

721
00:41:13,820 --> 00:41:14,600
Oh, no, it's not.

722
00:41:24,590 --> 00:41:27,570
No, it is.

723
00:41:27,570 --> 00:41:29,110
So it's not clearly wrong.

724
00:41:32,680 --> 00:41:34,960
Actually, it is clearly wrong.

725
00:41:34,960 --> 00:41:37,840
Because if I put
the identity here,

726
00:41:37,840 --> 00:41:39,360
those are still
associative, right?

727
00:41:39,360 --> 00:41:42,140
So this product is
actually not compatible.

728
00:41:42,140 --> 00:41:44,140
So it's wrong, but there's
just this extra thing

729
00:41:44,140 --> 00:41:46,630
that I probably copy-pasted
from some place.

730
00:41:46,630 --> 00:41:48,430
Since this is one
of my latest slide,

731
00:41:48,430 --> 00:41:51,280
I'll just color it in white.

732
00:41:51,280 --> 00:41:54,961
But yeah, sorry, there's a mis--
this parenthesis is not here.

733
00:41:54,961 --> 00:41:55,460
Thank you.

734
00:41:55,460 --> 00:41:56,388
AUDIENCE: [INAUDIBLE].

735
00:41:56,388 --> 00:41:57,388
PHILIPPE RIGOLLET: Yeah.

736
00:42:01,244 --> 00:42:03,172
OK?

737
00:42:03,172 --> 00:42:06,124
AUDIENCE: So why not
square root [INAUDIBLE]??

738
00:42:06,124 --> 00:42:08,040
PHILIPPE RIGOLLET: Because
I have two of them.

739
00:42:08,040 --> 00:42:11,180
I have one that comes from the
x prime that's here, this guy.

740
00:42:11,180 --> 00:42:15,760
And then I have one that
comes from this guy here.

741
00:42:15,760 --> 00:42:17,530
OK, so the solution--

742
00:42:17,530 --> 00:42:20,121
let's write it in some place
that's actually legible--

743
00:42:25,530 --> 00:42:27,150
which is the correction
for this thing

744
00:42:27,150 --> 00:42:34,930
is x transpose wx
inverse x transpose wy.

745
00:42:34,930 --> 00:42:35,470
OK?

746
00:42:35,470 --> 00:42:38,270
So you just squeeze
in this w in there.

747
00:42:38,270 --> 00:42:41,860
And that's exactly
what we had before,

748
00:42:41,860 --> 00:42:47,740
x transpose wx inverse
x transpose w some y.

749
00:42:47,740 --> 00:42:49,360
OK?

750
00:42:49,360 --> 00:42:53,050
And what I claim is that this
is routinely implemented.

751
00:42:53,050 --> 00:42:55,000
As you can imagine,
heteroscedastic linear

752
00:42:55,000 --> 00:42:57,550
regression is something
that's very common.

753
00:42:57,550 --> 00:43:00,100
So every time you a
least squares formula,

754
00:43:00,100 --> 00:43:02,886
you also have a way to
put in some weights.

755
00:43:02,886 --> 00:43:04,510
You don't have to
put diagonal weights,

756
00:43:04,510 --> 00:43:05,718
but here, that's all we need.

757
00:43:08,190 --> 00:43:12,310
So here on the slides,
again, I took the beta k,

758
00:43:12,310 --> 00:43:15,060
and I put it in there, so that
I have only one least square

759
00:43:15,060 --> 00:43:17,370
solution to formulate.

760
00:43:17,370 --> 00:43:19,680
But let's do it
slightly differently.

761
00:43:19,680 --> 00:43:21,180
What I'm going to
do here now is I'm

762
00:43:21,180 --> 00:43:24,430
going to say, OK, let's feed
it to some least squares.

763
00:43:24,430 --> 00:43:32,600
So let's do weighted least
squares on a response,

764
00:43:32,600 --> 00:43:44,810
y being y tilde k minus mu tilde
k, and design matrix being,

765
00:43:44,810 --> 00:43:47,520
well, just the x itself.

766
00:43:47,520 --> 00:43:50,240
So that doesn't change.

767
00:43:50,240 --> 00:44:00,090
And the weights-- so
the weights are what?

768
00:44:00,090 --> 00:44:04,290
The weights are the
wk that I had here.

769
00:44:04,290 --> 00:44:15,630
So wki is h prime
of xi transpose beta

770
00:44:15,630 --> 00:44:24,380
k divided by g prime of
mu i at time k times phi.

771
00:44:28,630 --> 00:44:32,620
OK, and so this, if I solve
it, will spit out something

772
00:44:32,620 --> 00:44:33,910
that I will call a solution.

773
00:44:33,910 --> 00:44:41,290
I will call it u hat k plus 1.

774
00:44:41,290 --> 00:44:44,590
And to get beta
hat k plus 1, all I

775
00:44:44,590 --> 00:44:53,215
need to do is to do beta
k plus u hat k plus 1--

776
00:44:53,215 --> 00:44:55,808
sorry, beta-- yeah.

777
00:44:55,808 --> 00:44:58,730
OK?

778
00:44:58,730 --> 00:45:01,430
And that's because-- so
here, that's not clear.

779
00:45:01,430 --> 00:45:04,080
But I started from
there, remember?

780
00:45:04,080 --> 00:45:08,250
I started from this guy here.

781
00:45:08,250 --> 00:45:10,775
So I'm just solving a least
squares, a weighted least

782
00:45:10,775 --> 00:45:12,525
square that's going
to give me this thing.

783
00:45:12,525 --> 00:45:15,300
That's what I called
u hat k plus 1.

784
00:45:15,300 --> 00:45:18,475
And then I add it to beta k, and
that gives me beta k minus 1.

785
00:45:18,475 --> 00:45:21,490
So I just have this
intermediate step,

786
00:45:21,490 --> 00:45:25,238
which is removed in the slides.

787
00:45:25,238 --> 00:45:28,070
OK?

788
00:45:28,070 --> 00:45:29,960
So then you can repeat
until convergence.

789
00:45:29,960 --> 00:45:32,270
What does it mean to
repeat until convergence?

790
00:45:35,066 --> 00:45:37,435
AUDIENCE: [INAUDIBLE]?

791
00:45:37,435 --> 00:45:38,810
PHILIPPE RIGOLLET:
Yeah, exactly.

792
00:45:38,810 --> 00:45:41,030
So you just set some
threshold and you say,

793
00:45:41,030 --> 00:45:43,550
I promise you that this
will converge, right?

794
00:45:43,550 --> 00:45:46,570
So you know that at some point,
you're going to be there.

795
00:45:46,570 --> 00:45:48,320
You're going to go
there, but you're never

796
00:45:48,320 --> 00:45:49,403
going to be exactly there.

797
00:45:49,403 --> 00:45:52,430
And so you just say, OK, I
want this accuracy on my data.

798
00:45:52,430 --> 00:45:55,712
Actually, the machine
is a little strong.

799
00:45:55,712 --> 00:45:57,920
Especially if you have 10
observations to start with,

800
00:45:57,920 --> 00:46:01,789
you know you're going to
have something that's going

801
00:46:01,789 --> 00:46:03,080
to have some statistical error.

802
00:46:03,080 --> 00:46:05,990
So that should actually guide
you into what kind of error

803
00:46:05,990 --> 00:46:06,930
you want to be making.

804
00:46:06,930 --> 00:46:08,780
So for example, a
good rule of thumb

805
00:46:08,780 --> 00:46:11,960
is that if you have
n observations,

806
00:46:11,960 --> 00:46:13,670
you just take some within--

807
00:46:13,670 --> 00:46:17,060
if you want the L2
distance between the beta--

808
00:46:17,060 --> 00:46:19,787
the two consecutive beta
to be less than 1/n,

809
00:46:19,787 --> 00:46:20,870
you should be good enough.

810
00:46:20,870 --> 00:46:24,560
It doesn't have to be
that machine precision.

811
00:46:24,560 --> 00:46:27,260
And so it's clear how
we do this, right?

812
00:46:27,260 --> 00:46:30,680
So here, I just have to maintain
a bunch of things, right?

813
00:46:30,680 --> 00:46:33,680
So remember, when I want to
recompute-- at every step,

814
00:46:33,680 --> 00:46:35,430
I have to recompute
a bunch of things.

815
00:46:35,430 --> 00:46:36,890
So I have to
recompute the weights.

816
00:46:36,890 --> 00:46:39,080
But if I want to recompute
the weights, not only do

817
00:46:39,080 --> 00:46:40,760
I need to previous
iterate, but I

818
00:46:40,760 --> 00:46:46,040
need to know how the previous
iterate impacts my means.

819
00:46:46,040 --> 00:46:50,300
So at each step, I have
to recalculate mu i k

820
00:46:50,300 --> 00:46:53,090
by doing g prime, rate?

821
00:46:53,090 --> 00:47:02,670
Remember mu i k was just g
inverse of xi transpose beta k,

822
00:47:02,670 --> 00:47:03,170
right?

823
00:47:03,170 --> 00:47:05,630
So I have to recompute that.

824
00:47:05,630 --> 00:47:09,340
And then I use this
to compute my weights.

825
00:47:09,340 --> 00:47:15,790
I also use this to
compute my y, right?

826
00:47:15,790 --> 00:47:20,030
so my y depends also
on g prime of mu i k.

827
00:47:20,030 --> 00:47:24,950
I feed that to my weighted
least squares engine.

828
00:47:24,950 --> 00:47:28,520
It spits out the u hat k, that
I add to my previous beta k.

829
00:47:28,520 --> 00:47:30,605
And that gives me my
new beta k plus 1.

830
00:47:33,170 --> 00:47:33,670
OK.

831
00:47:33,670 --> 00:47:35,980
So here's the
pseudocode, if you want

832
00:47:35,980 --> 00:47:40,781
to take some time to parse it.

833
00:47:40,781 --> 00:47:41,280
All right.

834
00:47:41,280 --> 00:47:43,970
So here again, the
trick is not much.

835
00:47:43,970 --> 00:47:49,400
It's just saying, if you don't
feel like implementing Fisher

836
00:47:49,400 --> 00:47:52,662
scoring or inverting your
Hessian at every step,

837
00:47:52,662 --> 00:47:54,620
then a weighted least
squares is actually going

838
00:47:54,620 --> 00:47:56,360
to do it for you automatically.

839
00:47:56,360 --> 00:47:56,860
All right.

840
00:47:56,860 --> 00:47:58,610
Then that's just
a numerical trick.

841
00:47:58,610 --> 00:48:00,950
There's nothing really
statistical about this,

842
00:48:00,950 --> 00:48:04,730
except the fact that this
calls for a solution for each

843
00:48:04,730 --> 00:48:09,682
of the step reminded us
of sum of the squares,

844
00:48:09,682 --> 00:48:11,390
except that there was
some extra weights.

845
00:48:14,180 --> 00:48:14,680
OK.

846
00:48:14,680 --> 00:48:18,670
So to conclude, we'll
need to know, of course,

847
00:48:18,670 --> 00:48:19,945
xy, the link function.

848
00:48:22,629 --> 00:48:24,170
Why do we need the
variance function?

849
00:48:29,530 --> 00:48:33,250
I'm not sure we actually
need the variance function.

850
00:48:33,250 --> 00:48:36,220
No, I don't know why I say that.

851
00:48:36,220 --> 00:48:39,750
You need phi, not the
variance function.

852
00:48:39,750 --> 00:48:41,370
So where do you start
actually, right?

853
00:48:41,370 --> 00:48:44,400
So clearly, if you start
very close to your solution,

854
00:48:44,400 --> 00:48:46,810
you're actually going
to do much better.

855
00:48:46,810 --> 00:48:48,760
And one good way to start--

856
00:48:48,760 --> 00:48:51,710
so for the beta itself, it's
not clear what it's going to be.

857
00:48:51,710 --> 00:48:53,490
But you can actually
get a good idea

858
00:48:53,490 --> 00:48:57,960
of what beta is by just having
a good idea of what mu is.

859
00:48:57,960 --> 00:49:01,830
Because mu is g inverse
of xi transpose beta.

860
00:49:01,830 --> 00:49:04,020
And so what you
could do is to try

861
00:49:04,020 --> 00:49:07,560
to set mu to be the actual
observations that you have,

862
00:49:07,560 --> 00:49:09,150
because that's the
best guess that you

863
00:49:09,150 --> 00:49:11,540
have for their expected value.

864
00:49:11,540 --> 00:49:14,740
And then you just say,
OK, once I have my mu,

865
00:49:14,740 --> 00:49:17,630
I know that my mu is a
function of this thing.

866
00:49:17,630 --> 00:49:21,380
So I can write g of mu and solve
it, using your least squares

867
00:49:21,380 --> 00:49:22,340
estimator, right?

868
00:49:22,340 --> 00:49:28,970
So g of mu is of
the form x beta.

869
00:49:28,970 --> 00:49:33,710
So you just solve for--
once you have your mu,

870
00:49:33,710 --> 00:49:36,350
you pass it through g, and
then you solve for the beta

871
00:49:36,350 --> 00:49:37,937
that you want.

872
00:49:37,937 --> 00:49:40,020
And then that's the beta
that you initialize with.

873
00:49:42,954 --> 00:49:44,910
OK?

874
00:49:44,910 --> 00:49:47,850
And actually, this was your
question from last time.

875
00:49:47,850 --> 00:49:50,320
As soon as I use
the canonical link,

876
00:49:50,320 --> 00:49:53,880
Fisher scoring
and Newton-Raphson

877
00:49:53,880 --> 00:49:57,720
are the same thing, because
the Hessian is actually

878
00:49:57,720 --> 00:50:05,870
deterministic in that
case, just because when

879
00:50:05,870 --> 00:50:09,290
you use the canonical link,
H is the identity, which

880
00:50:09,290 --> 00:50:12,050
means that its second
derivative is equal to 0.

881
00:50:12,050 --> 00:50:15,650
So this term goes away even
without taking the expectation.

882
00:50:15,650 --> 00:50:17,840
So remember, the
term that went away

883
00:50:17,840 --> 00:50:23,420
was of the form yi
minus mu i divided

884
00:50:23,420 --> 00:50:29,609
by phi times h prime prime
of xi transpose beta, right?

885
00:50:29,609 --> 00:50:32,150
That's the term that we said,
oh, the conditional expectation

886
00:50:32,150 --> 00:50:34,170
of this guy is 0.

887
00:50:34,170 --> 00:50:36,384
But if h prime prime
is already equal to 0,

888
00:50:36,384 --> 00:50:37,800
then there's nothing
that changes.

889
00:50:37,800 --> 00:50:39,120
There's nothing that goes away.

890
00:50:39,120 --> 00:50:40,530
It was already equal to 0.

891
00:50:40,530 --> 00:50:43,710
And that always happens when
you have the canonical link,

892
00:50:43,710 --> 00:50:54,450
because h is g b prime inverse.

893
00:50:54,450 --> 00:50:57,690
And the canonical link
is b prime inverse,

894
00:50:57,690 --> 00:51:00,176
so this thing is the identity.

895
00:51:00,176 --> 00:51:06,780
So the second derivative of
f of x is equal to x is 0.

896
00:51:06,780 --> 00:51:08,630
OK.

897
00:51:08,630 --> 00:51:11,620
My screen says end of show.

898
00:51:11,620 --> 00:51:13,862
So we can start
with some questions.

899
00:51:13,862 --> 00:51:15,320
AUDIENCE: I just
wanted to clarify.

900
00:51:15,320 --> 00:51:19,127
So iterative-- what is
it say for iterative--

901
00:51:19,127 --> 00:51:20,960
PHILIPPE RIGOLLET:
Reweighted least squares.

902
00:51:20,960 --> 00:51:21,386
AUDIENCE: Reweighted
least squares

903
00:51:21,386 --> 00:51:23,840
is an implementation of the
Fisher scoring [INAUDIBLE]??

904
00:51:23,840 --> 00:51:25,631
PHILIPPE RIGOLLET:
That's an implementation

905
00:51:25,631 --> 00:51:29,000
that's just making calls to
weighted least squares oracles.

906
00:51:29,000 --> 00:51:30,730
It's called an oracle sometimes.

907
00:51:30,730 --> 00:51:33,849
An oracle is what you assume the
machine can do easily for you.

908
00:51:33,849 --> 00:51:35,390
So if you assume
that your machine is

909
00:51:35,390 --> 00:51:38,150
very good at multiplying
by the inverse of a matrix,

910
00:51:38,150 --> 00:51:40,580
you might as well just do
Fisher scoring yourself, right?

911
00:51:40,580 --> 00:51:43,130
It's just a way so that you
don't have to actually do it.

912
00:51:43,130 --> 00:51:46,460
And usually, those
things are implemented--

913
00:51:46,460 --> 00:51:49,320
and I just said routinely--
in statistical software.

914
00:51:49,320 --> 00:51:51,440
But they're implemented
very efficiently

915
00:51:51,440 --> 00:51:52,440
in statistical software.

916
00:51:52,440 --> 00:51:54,770
So this is going to be one
of the fastest ways you're

917
00:51:54,770 --> 00:51:59,165
going to have to
solve, to do this step,

918
00:51:59,165 --> 00:52:01,145
especially for
large-scale problems.

919
00:52:01,145 --> 00:52:03,186
AUDIENCE: So the thing
that computers can do well

920
00:52:03,186 --> 00:52:05,105
is the multiplier [INAUDIBLE].

921
00:52:05,105 --> 00:52:07,580
What's the thing that
the computers can do fast

922
00:52:07,580 --> 00:52:09,525
and what's the thing
that [INAUDIBLE]??

923
00:52:09,525 --> 00:52:10,900
PHILIPPE RIGOLLET:
So if you were

924
00:52:10,900 --> 00:52:13,210
to do this in the
simplest possible way,

925
00:52:13,210 --> 00:52:18,070
your iterations for,
say, Fisher scoring

926
00:52:18,070 --> 00:52:21,500
is just multiply by the inverse
of the Fisher information,

927
00:52:21,500 --> 00:52:22,000
right?

928
00:52:22,000 --> 00:52:24,160
AUDIENCE: So finding
that inverse is slow?

929
00:52:24,160 --> 00:52:26,530
PHILIPPE RIGOLLET: Yeah,
so it takes a bit of time.

930
00:52:26,530 --> 00:52:30,330
Whereas, since you know you're
going to multiply directly

931
00:52:30,330 --> 00:52:33,177
by something, if you just say--

932
00:52:33,177 --> 00:52:35,010
those things are not
as optimized as solving

933
00:52:35,010 --> 00:52:35,580
least squares.

934
00:52:35,580 --> 00:52:36,990
Actually, the way
it's typically done

935
00:52:36,990 --> 00:52:38,340
is by doing some least squares.

936
00:52:38,340 --> 00:52:41,190
So you might as well just do
least squares that you like.

937
00:52:41,190 --> 00:52:42,180
And there's also less--

938
00:52:45,870 --> 00:52:47,770
well, no, there's no--

939
00:52:47,770 --> 00:52:51,035
well, there is less
recalculation, right?

940
00:52:51,035 --> 00:52:52,410
Here, your Fisher,
you would have

941
00:52:52,410 --> 00:52:54,720
to recompute the entire
matrix of Fisher information.

942
00:52:54,720 --> 00:52:56,170
Whereas here, you don't have to.

943
00:52:56,170 --> 00:52:56,670
Right?

944
00:52:56,670 --> 00:52:59,850
You really just have to compute
some vectors and the vector

945
00:52:59,850 --> 00:53:00,600
of weights, right?

946
00:53:00,600 --> 00:53:03,230
So the Fisher information
matrix has, say,

947
00:53:03,230 --> 00:53:05,910
n choose two entries that
you need to compute, right?

948
00:53:05,910 --> 00:53:08,910
It's symmetric, so it's
order n squared entries.

949
00:53:08,910 --> 00:53:11,460
But here, the only things you
update, if you think about it,

950
00:53:11,460 --> 00:53:13,987
are this weight matrix.

951
00:53:13,987 --> 00:53:15,570
So there is only the
diagonal elements

952
00:53:15,570 --> 00:53:19,330
that you need to update, and
these vectors in there also.

953
00:53:19,330 --> 00:53:21,660
There's two inverses n squared.

954
00:53:21,660 --> 00:53:23,960
So that's much less thing
to actually put in there.

955
00:53:23,960 --> 00:53:25,100
It does it for you somehow.

956
00:53:29,680 --> 00:53:30,810
Any other question?

957
00:53:34,440 --> 00:53:35,070
Yeah?

958
00:53:35,070 --> 00:53:37,950
AUDIENCE: So if I have
a data set [INAUDIBLE],,

959
00:53:37,950 --> 00:53:40,451
then I can always try to model
it with least squares, right?

960
00:53:40,451 --> 00:53:41,825
PHILIPPE RIGOLLET:
Yeah, you can.

961
00:53:41,825 --> 00:53:44,670
AUDIENCE: And so this is like
setting my weight equal to 1--

962
00:53:44,670 --> 00:53:46,159
the identity,
essentially, right?

963
00:53:46,159 --> 00:53:47,700
PHILIPPE RIGOLLET:
Well, not exactly,

964
00:53:47,700 --> 00:53:50,640
because the g also shows
up in this correction

965
00:53:50,640 --> 00:53:51,982
that you have here, right?

966
00:53:51,982 --> 00:53:52,934
AUDIENCE: Yeah.

967
00:53:52,934 --> 00:53:55,350
PHILIPPE RIGOLLET: I mean, I
don't know what you mean by--

968
00:53:55,350 --> 00:53:56,725
AUDIENCE: I'm just
trying to say,

969
00:53:56,725 --> 00:53:59,652
are there ever situations where
I'm trying to model a data set

970
00:53:59,652 --> 00:54:03,910
and I would want to pick my
weights in a particular way?

971
00:54:03,910 --> 00:54:04,910
PHILIPPE RIGOLLET: Yeah.

972
00:54:04,910 --> 00:54:05,400
AUDIENCE: OK.

973
00:54:05,400 --> 00:54:06,216
PHILIPPE RIGOLLET: I mean--

974
00:54:06,216 --> 00:54:07,920
AUDIENCE: [INAUDIBLE]
example [INAUDIBLE]..

975
00:54:07,920 --> 00:54:09,420
PHILIPPE RIGOLLET:
Well, OK, there's

976
00:54:09,420 --> 00:54:10,960
the heteroscedastic
case for sure.

977
00:54:10,960 --> 00:54:13,632
So if you're going to actually
compute those things-- and more

978
00:54:13,632 --> 00:54:15,340
generally, I don't
think you should think

979
00:54:15,340 --> 00:54:16,390
of those as being weights.

980
00:54:16,390 --> 00:54:18,473
You should really think
of those as being matrices

981
00:54:18,473 --> 00:54:19,510
that you invert.

982
00:54:19,510 --> 00:54:21,370
And don't think of
it as being diagonal,

983
00:54:21,370 --> 00:54:23,890
but really think of them
as being full matrices.

984
00:54:23,890 --> 00:54:25,390
So if you have--

985
00:54:25,390 --> 00:54:30,280
when we wrote weighted least
squares here, this was really--

986
00:54:30,280 --> 00:54:31,776
the w, I said, is diagonal.

987
00:54:31,776 --> 00:54:34,150
But all the computations really
never really use the fact

988
00:54:34,150 --> 00:54:35,140
that it's diagonal.

989
00:54:35,140 --> 00:54:38,500
So what shows up here
is just the inverse

990
00:54:38,500 --> 00:54:40,180
of your covariance matrix.

991
00:54:40,180 --> 00:54:42,580
And so if you have
data that's correlated,

992
00:54:42,580 --> 00:54:45,330
this is where it's
going to show up.