1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:21,380 --> 00:00:23,850
PHILIPPE RIGOLLET:
So welcome back.

9
00:00:23,850 --> 00:00:27,840
We're going to finish this
chapter on maximum likelihood

10
00:00:27,840 --> 00:00:28,430
estimation.

11
00:00:28,430 --> 00:00:30,830
And last time, I briefly
mentioned something that

12
00:00:30,830 --> 00:00:33,220
was called Fisher information.

13
00:00:33,220 --> 00:00:35,990
So Fisher information,
in general,

14
00:00:35,990 --> 00:00:40,890
is actually a matrix when you
have a multivariate parameter

15
00:00:40,890 --> 00:00:41,390
theta.

16
00:00:41,390 --> 00:00:44,960
So if theta, for example,
is of dimension d,

17
00:00:44,960 --> 00:00:46,430
then the Fisher
information matrix

18
00:00:46,430 --> 00:00:48,350
is going to be a d by d matrix.

19
00:00:48,350 --> 00:00:51,740
You can see that, because
it's the outer product.

20
00:00:51,740 --> 00:00:55,605
So it's of the form
gradient gradient transpose.

21
00:00:55,605 --> 00:00:57,230
So if it's gradient
gradient transpose,

22
00:00:57,230 --> 00:00:59,060
the gradient is
the d dimensional.

23
00:00:59,060 --> 00:01:03,530
And so gradient times gradient
transpose is a d by d matrix.

24
00:01:03,530 --> 00:01:07,280
And so this matrix
actually contains--

25
00:01:07,280 --> 00:01:09,560
well, tells you it's called
Fisher information matrix.

26
00:01:09,560 --> 00:01:11,143
So it's basically
telling you how much

27
00:01:11,143 --> 00:01:14,270
information about the
theta is in your model.

28
00:01:14,270 --> 00:01:17,960
So for example, if your model
is very well-parameterized,

29
00:01:17,960 --> 00:01:20,120
then you will have a
lot of information.

30
00:01:20,120 --> 00:01:21,684
You will have a higher--

31
00:01:21,684 --> 00:01:23,600
so let's think of it as
being a scalar number,

32
00:01:23,600 --> 00:01:25,141
just one number
now-- so you're going

33
00:01:25,141 --> 00:01:27,920
to have a larger information
about your parameter

34
00:01:27,920 --> 00:01:30,090
in the same probability
distribution.

35
00:01:30,090 --> 00:01:35,900
But if start having a weird
way to parameterize your model,

36
00:01:35,900 --> 00:01:38,150
then the Fisher information
is actually going to drop.

37
00:01:38,150 --> 00:01:40,820
So as a concrete example
think of, for example,

38
00:01:40,820 --> 00:01:44,240
a parameter of interest
in a Gaussian model,

39
00:01:44,240 --> 00:01:45,890
where the mean is
known to be zero.

40
00:01:45,890 --> 00:01:48,974
But what you're interested in
is the variance, sigma squared.

41
00:01:48,974 --> 00:01:50,390
If I'm interested
in sigma square,

42
00:01:50,390 --> 00:01:53,750
I could parameterize my model
by sigma, sigma squared, sigma

43
00:01:53,750 --> 00:01:55,880
to the fourth, sigma to 24th.

44
00:01:55,880 --> 00:01:57,690
I could parameterize
it by whatever I want,

45
00:01:57,690 --> 00:01:59,601
then I would have a
simple transformation.

46
00:01:59,601 --> 00:02:01,100
Then you could say
that some of them

47
00:02:01,100 --> 00:02:02,600
are actually more
or less informative,

48
00:02:02,600 --> 00:02:04,308
and you're going to
have different values

49
00:02:04,308 --> 00:02:06,800
for your Fisher information.

50
00:02:06,800 --> 00:02:10,729
So let's just review a few
well-known computations.

51
00:02:10,729 --> 00:02:17,540
So I will focus primarily on the
one dimensional case as usual.

52
00:02:17,540 --> 00:02:19,430
And I claim that
there's two definitions.

53
00:02:19,430 --> 00:02:24,020
So if theta is a real
valued parameter,

54
00:02:24,020 --> 00:02:26,827
then there's basically
two definitions

55
00:02:26,827 --> 00:02:28,910
that you can think of for
your Fisher information.

56
00:02:28,910 --> 00:02:30,350
One involves the
first derivative

57
00:02:30,350 --> 00:02:31,670
of your log likelihood.

58
00:02:31,670 --> 00:02:34,840
And the second one involves
the second derivative.

59
00:02:34,840 --> 00:02:36,590
So the log likelihood
here, we're

60
00:02:36,590 --> 00:02:39,715
actually going to
define it as l of theta.

61
00:02:39,715 --> 00:02:40,340
And what is it?

62
00:02:40,340 --> 00:02:43,550
Well, it's simply the likelihood
function for one observation.

63
00:02:43,550 --> 00:02:46,640
So it's l-- and I'm going to
write 1 just to make sure that

64
00:02:46,640 --> 00:02:49,010
we all know what we're talking
about one observation--

65
00:02:49,010 --> 00:02:52,148
of-- which is the order again,
I think it's X and theta.

66
00:02:55,170 --> 00:02:57,290
So that's the log
likelihood, remember?

67
00:03:05,290 --> 00:03:07,100
So for example, if
I have a density,

68
00:03:07,100 --> 00:03:08,058
what is it going to be?

69
00:03:08,058 --> 00:03:12,330
It's going to be log
of f sub theta of X.

70
00:03:12,330 --> 00:03:15,190
So this guy is a
random variable,

71
00:03:15,190 --> 00:03:17,456
because it's a function
of a random variable.

72
00:03:17,456 --> 00:03:19,580
And that's what you see
expectations of this thing.

73
00:03:19,580 --> 00:03:21,940
It's a random function of theta.

74
00:03:21,940 --> 00:03:23,525
If I view this as a
function of theta,

75
00:03:23,525 --> 00:03:25,150
the function becomes
random, because it

76
00:03:25,150 --> 00:03:27,590
depends on this random X.

77
00:03:27,590 --> 00:03:35,180
And so I of theta is actually
defined as the variance

78
00:03:35,180 --> 00:03:40,400
of l prime of theta--

79
00:03:40,400 --> 00:03:43,760
so the variance of the
derivative of this function.

80
00:03:43,760 --> 00:03:50,750
And I also claim that it's equal
to negative the expectation

81
00:03:50,750 --> 00:03:54,160
of the second
derivative of theta.

82
00:03:57,380 --> 00:04:00,560
And here, the expectation
and the variance

83
00:04:00,560 --> 00:04:03,120
are computed, because this
function, remember, is random.

84
00:04:03,120 --> 00:04:05,450
So I need to tell you
what is the distribution

85
00:04:05,450 --> 00:04:06,920
of the X with
respect to which I'm

86
00:04:06,920 --> 00:04:08,882
computing the expectation
and the variance.

87
00:04:08,882 --> 00:04:09,965
And it's the theta itself.

88
00:04:13,882 --> 00:04:15,340
So typically, the
theta we're going

89
00:04:15,340 --> 00:04:18,670
to be interested in--
so there's a Fisher

90
00:04:18,670 --> 00:04:20,519
information for all
values of the parameter,

91
00:04:20,519 --> 00:04:22,227
but the one typically
we're interested in

92
00:04:22,227 --> 00:04:25,830
is the true
parameter, theta star.

93
00:04:25,830 --> 00:04:29,270
But view this as a function
of theta right now.

94
00:04:29,270 --> 00:04:31,760
So now, I need to prove
to you-- and this is not

95
00:04:31,760 --> 00:04:34,610
a trivial statement-- the
variance of the derivative

96
00:04:34,610 --> 00:04:36,590
is equal to negative
the expectation

97
00:04:36,590 --> 00:04:37,777
of the second derivative.

98
00:04:37,777 --> 00:04:40,360
I mean, there's really quite a
bit that comes into this right.

99
00:04:40,360 --> 00:04:44,267
And it comes from the fact that
this is a log not of anything.

100
00:04:44,267 --> 00:04:45,350
It's the log of a density.

101
00:04:45,350 --> 00:04:48,560
So let's just prove
that without having

102
00:04:48,560 --> 00:04:51,260
to bother too much
ourselves with

103
00:04:51,260 --> 00:04:54,219
some technical assumptions.

104
00:04:54,219 --> 00:04:56,260
And the technical assumptions
are the assumptions

105
00:04:56,260 --> 00:04:59,270
that allow me to permute
derivative and integral.

106
00:04:59,270 --> 00:05:01,820
Because when I compute the
variances and expectations,

107
00:05:01,820 --> 00:05:04,310
I'm actually integrating
against the density.

108
00:05:04,310 --> 00:05:09,080
And what I want to do is to make
sure that I can always do that.

109
00:05:09,080 --> 00:05:13,250
So my technical assumptions
are I can always permute

110
00:05:13,250 --> 00:05:15,320
integral and derivatives.

111
00:05:15,320 --> 00:05:19,350
So let's just prove this.

112
00:05:19,350 --> 00:05:21,560
So what I'm going to do
is I'm going to assume

113
00:05:21,560 --> 00:05:32,750
that X has density f theta.

114
00:05:35,211 --> 00:05:37,210
And I'm actually just
going to write-- well, let

115
00:05:37,210 --> 00:05:39,030
me write it f theta right now.

116
00:05:39,030 --> 00:05:42,000
Let me try to not be
lazy about writing.

117
00:05:42,000 --> 00:05:43,910
And so the thing
I'm going to use

118
00:05:43,910 --> 00:05:50,540
is the fact that the integral of
this density is equal to what?

119
00:05:50,540 --> 00:05:51,820
1.

120
00:05:51,820 --> 00:05:54,620
And this is where I'm going
to start doing weird things.

121
00:05:54,620 --> 00:05:56,860
That means that if I take
the derivative of this guy,

122
00:05:56,860 --> 00:05:59,650
it's equal to 0.

123
00:05:59,650 --> 00:06:03,370
So that means that if I look
at the derivative with respect

124
00:06:03,370 --> 00:06:11,840
to theta of integral f theta
of X dX, this is equal to 0.

125
00:06:11,840 --> 00:06:14,124
And this is where I'm
actually making this switch,

126
00:06:14,124 --> 00:06:16,040
is that I'm going to say
that this is actually

127
00:06:16,040 --> 00:06:19,670
equal to the integral
of the derivative.

128
00:06:27,670 --> 00:06:30,390
So that's going to be the
first thing I'm going to use.

129
00:06:30,390 --> 00:06:32,820
And of course, if it's true
for the first derivative,

130
00:06:32,820 --> 00:06:34,820
it's going to be true for
the second derivative.

131
00:06:34,820 --> 00:06:36,550
So I'm going to actually
do it a second time.

132
00:06:36,550 --> 00:06:38,220
And the second thing
I'm going to use

133
00:06:38,220 --> 00:06:46,860
is the fact the integral
of the second derivative

134
00:06:46,860 --> 00:06:47,640
is equal to 0.

135
00:06:50,739 --> 00:06:51,780
So let's start from here.

136
00:06:59,410 --> 00:07:01,990
And let me start from,
say, the expectation

137
00:07:01,990 --> 00:07:06,790
of the second derivative
of l prime theta.

138
00:07:06,790 --> 00:07:08,440
So what is l prime prime theta?

139
00:07:08,440 --> 00:07:21,320
Well, it's the second derivative
of log of f theta of X.

140
00:07:21,320 --> 00:07:24,270
And we know that the
derivative of the log--

141
00:07:24,270 --> 00:07:30,780
sorry-- yeah, so the derivative
of the log is 1 over--

142
00:07:30,780 --> 00:07:34,050
well, it's the derivative
of f divided by f itself.

143
00:07:49,647 --> 00:07:50,480
Everybody's with me?

144
00:07:53,450 --> 00:07:58,760
Just log of f prime
is f prime over f.

145
00:07:58,760 --> 00:08:01,610
Here, it's just that f, I view
this as a function of theta

146
00:08:01,610 --> 00:08:04,200
and not as a function of X.

147
00:08:04,200 --> 00:08:08,040
So now, I need to take another
derivative of this thing.

148
00:08:08,040 --> 00:08:09,560
So that's going to be equal to--

149
00:08:09,560 --> 00:08:13,030
well, so we all know the
formula for the derivative

150
00:08:13,030 --> 00:08:14,000
of the ratio.

151
00:08:14,000 --> 00:08:22,080
So I pick up the second
derivative times f theta

152
00:08:22,080 --> 00:08:30,480
minus the first
derivative squared

153
00:08:30,480 --> 00:08:33,840
divided by f theta squared--

154
00:08:38,590 --> 00:08:40,542
basic calculus.

155
00:08:40,542 --> 00:08:43,390
And now, I need to check
that negative the expectation

156
00:08:43,390 --> 00:08:47,580
of this guy is giving
me back what I want.

157
00:08:47,580 --> 00:08:52,030
Well what is negative the
expectation of l prime prime

158
00:08:52,030 --> 00:08:54,150
of theta?

159
00:08:54,150 --> 00:08:56,440
Well, what we need to do
is to do negative integral

160
00:08:56,440 --> 00:08:59,500
of this guy against f theta.

161
00:08:59,500 --> 00:09:01,780
So it's minus the integral of--

162
00:09:28,340 --> 00:09:31,640
That's just the definition
of the expectation.

163
00:09:31,640 --> 00:09:34,370
I take an integral
against f theta.

164
00:09:34,370 --> 00:09:36,440
But here, I have something nice.

165
00:09:36,440 --> 00:09:38,540
What's happening is that
those guys are canceling.

166
00:09:41,937 --> 00:09:43,520
And now that those
guys are canceling,

167
00:09:43,520 --> 00:09:44,780
those guys are canceling too.

168
00:09:51,556 --> 00:09:53,180
So what I have is
that the first term--

169
00:09:53,180 --> 00:09:56,000
I'm going to break
this difference here.

170
00:09:56,000 --> 00:09:58,250
So I'm going to say that
integral of this difference

171
00:09:58,250 --> 00:10:00,320
is the difference
of the integrals.

172
00:10:00,320 --> 00:10:03,260
So the first term is
going to be the integral

173
00:10:03,260 --> 00:10:09,256
of d over d theta
squared of f theta.

174
00:10:13,910 --> 00:10:17,360
And the second one, the negative
signs are going to cancel,

175
00:10:17,360 --> 00:10:32,140
and I'm going to
be left with this.

176
00:10:37,700 --> 00:10:39,147
Everybody's following?

177
00:10:39,147 --> 00:10:40,230
Anybody found the mistake?

178
00:10:44,860 --> 00:10:46,954
How about the other mistake?

179
00:10:46,954 --> 00:10:48,370
I don't know if
there's a mistake.

180
00:10:48,370 --> 00:10:50,980
I'm just trying to get you
to check what I'm doing.

181
00:10:54,010 --> 00:10:56,370
With me so far?

182
00:10:56,370 --> 00:10:59,670
So this guy here is the integral
of the second the derivative

183
00:10:59,670 --> 00:11:02,400
of f of X dX.

184
00:11:02,400 --> 00:11:04,858
What is this?

185
00:11:04,858 --> 00:11:06,340
AUDIENCE: It's 0.

186
00:11:06,340 --> 00:11:08,330
PHILIPPE RIGOLLET: It's 0.

187
00:11:08,330 --> 00:11:16,910
And that's because of this guy,
which I will call frowny face.

188
00:11:16,910 --> 00:11:20,930
So frowny face tells me this.

189
00:11:20,930 --> 00:11:26,480
And let's call this guy
monkey that hides his eyes.

190
00:11:26,480 --> 00:11:28,030
No, let's just do
something simpler.

191
00:11:28,030 --> 00:11:29,360
Let's call it star.

192
00:11:29,360 --> 00:11:32,180
And this guy, we
will use later on.

193
00:11:35,630 --> 00:11:37,760
So now, I have to prove
that this guy, which

194
00:11:37,760 --> 00:11:40,460
I have proved is
equal to this, is now

195
00:11:40,460 --> 00:11:46,070
equal to the variance
of l prime theta.

196
00:11:46,070 --> 00:11:48,200
So now, let's go back
to the other way.

197
00:11:48,200 --> 00:11:49,520
We're going to meet halfway.

198
00:11:49,520 --> 00:11:51,020
I'm going to have a series--

199
00:11:51,020 --> 00:11:56,090
I want to prove that this
guy is equal to this guy.

200
00:11:56,090 --> 00:11:58,340
And I'm going to have
a series of equalities

201
00:11:58,340 --> 00:12:01,740
that I'm going to meet halfway.

202
00:12:01,740 --> 00:12:03,239
So let's start
from the other end.

203
00:12:03,239 --> 00:12:05,280
We started from the negative
l prime prime theta.

204
00:12:05,280 --> 00:12:06,743
Let's start with
the variance part.

205
00:12:10,330 --> 00:12:17,394
Variance of l prime of theta,
so that's the variance--

206
00:12:22,520 --> 00:12:29,170
so that's the
expectation of l prime

207
00:12:29,170 --> 00:12:35,230
of theta squared minus the
square of the expectation of l

208
00:12:35,230 --> 00:12:35,920
prime of theta.

209
00:12:41,370 --> 00:12:43,340
Now, what is the square
of the expectation

210
00:12:43,340 --> 00:12:44,630
of l prime of theta?

211
00:12:44,630 --> 00:12:50,750
Well, l prime of theta is equal
to the partial with respect

212
00:12:50,750 --> 00:12:57,461
to theta of log f theta of X,
which we know from the first

213
00:12:57,461 --> 00:12:59,960
line over there-- that's what's
in the bracket on the second

214
00:12:59,960 --> 00:13:01,030
line there--

215
00:13:01,030 --> 00:13:05,010
is actually equal to the
partial over theta of f

216
00:13:05,010 --> 00:13:09,410
theta X divided by
f theta X. That's

217
00:13:09,410 --> 00:13:11,390
the derivative of the log.

218
00:13:11,390 --> 00:13:16,150
So when I look at the
expectation of this guy,

219
00:13:16,150 --> 00:13:18,700
I'm going to have the integral
of this against f theta.

220
00:13:18,700 --> 00:13:20,540
And the f thetas are
going to cancel again,

221
00:13:20,540 --> 00:13:23,260
just like I did here.

222
00:13:23,260 --> 00:13:26,110
So this thing is actually
equal to the integral

223
00:13:26,110 --> 00:13:33,650
of partial over theta
of f theta of X dX.

224
00:13:33,650 --> 00:13:34,950
And what does this equal to?

225
00:13:37,720 --> 00:13:42,310
0, by the monkey hiding is eyes.

226
00:13:42,310 --> 00:13:46,950
So that's star-- tells me
that this is equal to 0.

227
00:13:50,090 --> 00:13:52,640
So basically, when I compute
the variance, this term is not.

228
00:13:52,640 --> 00:13:53,390
Going to matter.

229
00:13:53,390 --> 00:13:54,973
I only have to
complete the first one.

230
00:14:10,630 --> 00:14:12,460
So what is the first one?

231
00:14:12,460 --> 00:14:21,280
Well, the first one is the
expectation of l prime squared.

232
00:14:24,820 --> 00:14:29,770
And so that guy is the integral
of-- well, what is l prime?

233
00:14:29,770 --> 00:14:33,160
Again, it's partial
over partial theta

234
00:14:33,160 --> 00:14:37,960
f theta of X divided by f theta
of X. Now, this time, this guy

235
00:14:37,960 --> 00:14:40,975
is squared against the density.

236
00:14:44,320 --> 00:14:45,625
So one of the f thetas cancel.

237
00:15:07,195 --> 00:15:11,790
But now, I'm back to what
I had before for this guy.

238
00:15:16,420 --> 00:15:20,540
So this guy is now
equal to this guy.

239
00:15:20,540 --> 00:15:21,940
There's just the same formula.

240
00:15:21,940 --> 00:15:23,960
So they're the same thing.

241
00:15:23,960 --> 00:15:25,880
And so I've moved both ways.

242
00:15:25,880 --> 00:15:27,800
Starting from the
expression that

243
00:15:27,800 --> 00:15:30,230
involves the expectation
of the second derivative,

244
00:15:30,230 --> 00:15:31,820
I've come to this guy.

245
00:15:31,820 --> 00:15:34,310
And starting from the
expression that tells me

246
00:15:34,310 --> 00:15:36,350
about the variance of
the first derivative,

247
00:15:36,350 --> 00:15:38,300
I've come to the same guy.

248
00:15:38,300 --> 00:15:41,397
So that completes my proof.

249
00:15:41,397 --> 00:15:43,063
Are there any questions
about the proof?

250
00:15:47,050 --> 00:15:53,890
We also have on our way found an
explicit formula for the Fisher

251
00:15:53,890 --> 00:15:55,330
information as well.

252
00:15:55,330 --> 00:15:57,760
So now that I have this
thing, I could actually

253
00:15:57,760 --> 00:16:02,200
add that if X has a
density, for example,

254
00:16:02,200 --> 00:16:07,000
this is also equal
to the integral of--

255
00:16:09,810 --> 00:16:15,840
well, the partial over
theta of f theta of X

256
00:16:15,840 --> 00:16:24,431
squared divided by f
theta of X, because I just

257
00:16:24,431 --> 00:16:26,180
proved that those two
things were actually

258
00:16:26,180 --> 00:16:28,013
equal to the same thing,
which was this guy.

259
00:16:31,230 --> 00:16:34,024
Now in practice, this is really
going to be the useful one.

260
00:16:34,024 --> 00:16:35,940
The other two are going
to be useful depending

261
00:16:35,940 --> 00:16:37,810
on what case you're in.

262
00:16:37,810 --> 00:16:40,560
So if I ask you to compute
the Fisher information,

263
00:16:40,560 --> 00:16:43,020
you have now three
ways to pick from.

264
00:16:43,020 --> 00:16:44,670
And basically,
practice will tell you

265
00:16:44,670 --> 00:16:47,070
which one to choose if you
want to save five minutes when

266
00:16:47,070 --> 00:16:48,666
you're doing your computations.

267
00:16:48,666 --> 00:16:50,790
Maybe you're the guy who
likes to take derivatives.

268
00:16:50,790 --> 00:16:52,860
And then you're going to go
with the second derivative one.

269
00:16:52,860 --> 00:16:55,000
Maybe you're the guy who
likes to extend squares,

270
00:16:55,000 --> 00:16:56,500
so you're going to
take the one that

271
00:16:56,500 --> 00:16:59,340
involves the square
of the squared prime.

272
00:16:59,340 --> 00:17:01,042
And maybe you're
just a normal person,

273
00:17:01,042 --> 00:17:02,250
and you want to use that guy.

274
00:17:06,119 --> 00:17:07,740
Why do I care?

275
00:17:07,740 --> 00:17:09,540
This is the Fisher information.

276
00:17:09,540 --> 00:17:11,790
And I could have defined the
[? Hilbert ?] information

277
00:17:11,790 --> 00:17:13,470
by taking the square
root of this guy

278
00:17:13,470 --> 00:17:16,500
plus the sine of this thing
and just be super happy

279
00:17:16,500 --> 00:17:18,369
and have my name in textbooks.

280
00:17:18,369 --> 00:17:22,170
But this thing has a
very particular meaning.

281
00:17:22,170 --> 00:17:24,540
When we're doing the maximum
likelihood estimation--

282
00:17:28,590 --> 00:17:32,910
so remember the maximum
likelihood estimation is just

283
00:17:32,910 --> 00:17:36,720
an empirical version of trying
to minimize the KL divergence.

284
00:17:36,720 --> 00:17:39,570
So what we're trying to
do, maximum likelihood,

285
00:17:39,570 --> 00:17:42,820
is really trying to
minimize the KL divergence.

286
00:17:54,160 --> 00:17:57,490
And we're trying to minimize
this function, remember?

287
00:17:57,490 --> 00:18:01,034
So now what we're
going to do is we're

288
00:18:01,034 --> 00:18:02,200
going to plot this function.

289
00:18:02,200 --> 00:18:04,000
We said that, let's
place ourselves

290
00:18:04,000 --> 00:18:06,440
in cases where
this KL is convex,

291
00:18:06,440 --> 00:18:09,550
so that the inverse is concave.

292
00:18:09,550 --> 00:18:11,020
So it's going to
look like this--

293
00:18:11,020 --> 00:18:13,930
U-shaped, that's convex.

294
00:18:13,930 --> 00:18:16,119
So that's the truth thing
I'm trying to minimize.

295
00:18:16,119 --> 00:18:18,160
And what I said is that
I'm going to actually try

296
00:18:18,160 --> 00:18:19,215
to estimate this guy.

297
00:18:19,215 --> 00:18:20,590
So in practice,
I'm going to have

298
00:18:20,590 --> 00:18:22,990
something that looks like
this, but it's not really this.

299
00:18:26,425 --> 00:18:28,050
And we're not going
to do this, but you

300
00:18:28,050 --> 00:18:30,120
can show that you
can control this

301
00:18:30,120 --> 00:18:33,810
uniformly over the entire space,
that there is no space where

302
00:18:33,810 --> 00:18:35,190
it just becomes huge.

303
00:18:35,190 --> 00:18:37,020
In particular, this
is not the space

304
00:18:37,020 --> 00:18:38,450
where it just
becomes super huge,

305
00:18:38,450 --> 00:18:39,930
and the minimum
of the dotted line

306
00:18:39,930 --> 00:18:42,120
becomes really
far from this guy.

307
00:18:42,120 --> 00:18:45,330
So if those two functions
are close to each other,

308
00:18:45,330 --> 00:18:48,510
then this implies that the
minimum here of the dotted line

309
00:18:48,510 --> 00:18:53,250
is close to the minimum
of the solid line.

310
00:18:53,250 --> 00:18:56,180
So we know that
this is theta star.

311
00:18:56,180 --> 00:19:00,472
And this is our MLE,
estimator, theta hat ml.

312
00:19:00,472 --> 00:19:01,930
So that's basically
the principle--

313
00:19:01,930 --> 00:19:05,450
the more data we have,
the closer the dotted line

314
00:19:05,450 --> 00:19:07,020
is to the solid line.

315
00:19:07,020 --> 00:19:10,470
And so the minimum is
closer to the minimum.

316
00:19:10,470 --> 00:19:12,800
But now, this is
just one example,

317
00:19:12,800 --> 00:19:14,330
where I drew a picture for you.

318
00:19:14,330 --> 00:19:17,190
But there could be some
really nasty examples.

319
00:19:17,190 --> 00:19:20,330
Think of this
example, where I have

320
00:19:20,330 --> 00:19:23,840
a function, which is convex,
but it looks more like this.

321
00:19:30,120 --> 00:19:31,790
That's convex, it's U-shaped.

322
00:19:31,790 --> 00:19:36,300
It's just a professional U.

323
00:19:36,300 --> 00:19:41,430
Now, I'm going to put a dotted
line around it that has pretty

324
00:19:41,430 --> 00:19:42,690
much the same fluctuations.

325
00:19:42,690 --> 00:19:44,720
The bend around it
is of this size.

326
00:19:52,980 --> 00:19:56,730
So do we agree that the
distance between the solid line

327
00:19:56,730 --> 00:19:58,590
and the dotted line is
pretty much the same

328
00:19:58,590 --> 00:20:01,110
in those two pictures?

329
00:20:01,110 --> 00:20:04,650
Now, here, depending
on how I tilt this guy,

330
00:20:04,650 --> 00:20:07,570
basically, I can put the minimum
theta star wherever I want.

331
00:20:07,570 --> 00:20:11,650
And let's say that here,
I actually put it here.

332
00:20:11,650 --> 00:20:13,800
That's pretty much the
minimum of this line.

333
00:20:13,800 --> 00:20:16,530
And now, the minimum of the
dotted line is this guy.

334
00:20:20,930 --> 00:20:22,730
So they're very far.

335
00:20:22,730 --> 00:20:25,880
The fact that I'm very
flat at the bottom

336
00:20:25,880 --> 00:20:28,340
makes my requirements
for being close

337
00:20:28,340 --> 00:20:31,950
to the U-shaped solid
curve much more stringent,

338
00:20:31,950 --> 00:20:34,020
if I want to stay close.

339
00:20:34,020 --> 00:20:38,190
And so this is the
canonical case.

340
00:20:38,190 --> 00:20:39,720
This is the annoying case.

341
00:20:39,720 --> 00:20:43,710
And of course, you
have the awesome case--

342
00:20:43,710 --> 00:20:45,540
looks like this.

343
00:20:45,540 --> 00:20:47,780
And then whether
you deviate, you

344
00:20:47,780 --> 00:20:50,430
can have something
that moves pretty far.

345
00:20:50,430 --> 00:20:53,480
It doesn't matter, it's
always going to stay close.

346
00:20:53,480 --> 00:20:57,600
Now, what is the quantity
that measures how

347
00:20:57,600 --> 00:20:59,700
curved I am at a given point--

348
00:20:59,700 --> 00:21:03,840
how curved the function
is at a given point?

349
00:21:03,840 --> 00:21:05,420
The secondary derivative.

350
00:21:05,420 --> 00:21:11,150
And so the Fisher information is
negative the second derivative.

351
00:21:11,150 --> 00:21:12,394
Why the negative?

352
00:21:17,044 --> 00:21:18,960
Well here-- Yeah, we're
looking for a minimum,

353
00:21:18,960 --> 00:21:20,793
and this guy is really
the-- you should view

354
00:21:20,793 --> 00:21:23,460
this as a reverted function.

355
00:21:23,460 --> 00:21:26,370
This is we're trying to
maximize the likelihood, which

356
00:21:26,370 --> 00:21:28,950
is basically maximizing
the negative KL.

357
00:21:28,950 --> 00:21:31,650
So the picture I'm showing you
is trying to minimize the KL.

358
00:21:31,650 --> 00:21:33,990
So the truth picture that
you should see for this guy

359
00:21:33,990 --> 00:21:37,080
is the same, except that
it's just flipped over.

360
00:21:37,080 --> 00:21:40,800
But the curvature is the same,
whether I flip my sheet or not.

361
00:21:40,800 --> 00:21:42,390
So it's the same thing.

362
00:21:42,390 --> 00:21:44,034
So apart from this
negative sign,

363
00:21:44,034 --> 00:21:45,450
which is just
coming from the fact

364
00:21:45,450 --> 00:21:47,550
that we're maximizing
instead of minimizing,

365
00:21:47,550 --> 00:21:50,490
this is just telling me
how curved my likelihood is

366
00:21:50,490 --> 00:21:51,810
around the maximum.

367
00:21:51,810 --> 00:21:55,080
And therefore, it's actually
telling me how good,

368
00:21:55,080 --> 00:21:58,830
how robust my maximum
likelihood estimator is.

369
00:21:58,830 --> 00:22:01,270
It's going to tell me how
close, actually, my likelihood

370
00:22:01,270 --> 00:22:03,870
estimator is going to be--

371
00:22:03,870 --> 00:22:06,510
maximum likelihood is going
to be to the true parameter.

372
00:22:06,510 --> 00:22:09,030
So I should be able
to see that somewhere.

373
00:22:09,030 --> 00:22:11,220
There should be some
statement that tells me

374
00:22:11,220 --> 00:22:13,230
that this Fisher
information will

375
00:22:13,230 --> 00:22:17,700
play a role when assessing the
precision of this estimator.

376
00:22:17,700 --> 00:22:20,280
And remember, how do we
characterize a good estimator?

377
00:22:20,280 --> 00:22:24,030
Well, we look at its bias,
or we look its variance.

378
00:22:24,030 --> 00:22:26,880
And we can combine the two
and form the quadratic risk.

379
00:22:26,880 --> 00:22:29,160
So essentially, what
we're going to try to say

380
00:22:29,160 --> 00:22:31,525
is that one of those
guys-- either the bias

381
00:22:31,525 --> 00:22:33,150
or the variance or
the quadratic risk--

382
00:22:33,150 --> 00:22:35,284
is going to be worse if
my function is flatter,

383
00:22:35,284 --> 00:22:37,200
meaning that my Fisher
information is smaller.

384
00:22:40,390 --> 00:22:44,030
And this is exactly the
point of this last theorem.

385
00:22:44,030 --> 00:22:46,270
So let's look at a
couple of conditions.

386
00:22:46,270 --> 00:22:51,310
So this is your typical
1950s statistics

387
00:22:51,310 --> 00:22:54,770
paper that has like one
page of assumptions.

388
00:22:54,770 --> 00:22:56,764
And this was like that
in the early days,

389
00:22:56,764 --> 00:22:58,180
because people
were trying to make

390
00:22:58,180 --> 00:23:01,572
theories that would be valid
for as many models as possible.

391
00:23:01,572 --> 00:23:03,280
And now, people are
sort of abusing this,

392
00:23:03,280 --> 00:23:05,470
and they're just making all
this lists of assumptions

393
00:23:05,470 --> 00:23:06,670
so that their
particular method works

394
00:23:06,670 --> 00:23:08,628
for their particular
problem, because they just

395
00:23:08,628 --> 00:23:10,030
want to take shortcuts.

396
00:23:10,030 --> 00:23:13,480
But really, the maximum
likelihood estimator

397
00:23:13,480 --> 00:23:15,820
is basically as old
as modern statistics.

398
00:23:15,820 --> 00:23:18,190
And so this was really
necessary conditions.

399
00:23:18,190 --> 00:23:19,810
And we'll just parse that.

400
00:23:19,810 --> 00:23:21,610
The model is identified.

401
00:23:21,610 --> 00:23:24,070
Well, better be, because
I'm trying to estimate

402
00:23:24,070 --> 00:23:25,550
theta and not P theta.

403
00:23:25,550 --> 00:23:26,860
So this one is good.

404
00:23:26,860 --> 00:23:32,630
For all theta, the support of P
theta does not depend on theta.

405
00:23:32,630 --> 00:23:34,850
So that's just something
that we need to have.

406
00:23:34,850 --> 00:23:36,710
Otherwise, things
become really messy.

407
00:23:36,710 --> 00:23:38,540
And in particular,
I'm not going to be

408
00:23:38,540 --> 00:23:40,600
able to define likelihood--

409
00:23:40,600 --> 00:23:42,910
Kullback-Leibler divergences.

410
00:23:42,910 --> 00:23:44,340
Then why can I not do that?

411
00:23:44,340 --> 00:23:46,730
Well, because the
Kullback-Leibler divergence

412
00:23:46,730 --> 00:23:49,430
has a log of the ratio
of two densities.

413
00:23:49,430 --> 00:23:51,554
And if one of the support
is changing with theta

414
00:23:51,554 --> 00:23:53,720
is it might be they have
the log of something that's

415
00:23:53,720 --> 00:23:55,820
0 or something that's not 0.

416
00:23:55,820 --> 00:23:59,450
And the log of 0 is a slightly
annoying quantity to play with.

417
00:23:59,450 --> 00:24:01,220
And so we're just
removing that case.

418
00:24:01,220 --> 00:24:02,870
Nothing depends on theta--

419
00:24:02,870 --> 00:24:05,170
think of it as being
basically the entire real line

420
00:24:05,170 --> 00:24:08,020
as a support for
Gaussian, for example.

421
00:24:08,020 --> 00:24:10,830
Theta star is not on
the boundary of theta.

422
00:24:10,830 --> 00:24:13,147
Can anybody tell me
why this is important?

423
00:24:17,720 --> 00:24:19,142
We're talking about derivatives.

424
00:24:19,142 --> 00:24:20,850
So when I want to talk
about derivatives,

425
00:24:20,850 --> 00:24:23,260
I'm talking about fluctuations
around a certain point.

426
00:24:23,260 --> 00:24:26,166
And if I'm at the boundary,
it's actually really annoying.

427
00:24:26,166 --> 00:24:27,790
I might have the
derivative-- remember,

428
00:24:27,790 --> 00:24:28,940
I give you this example--

429
00:24:28,940 --> 00:24:31,720
where the maximum likelihood is
just obtained at the boundary,

430
00:24:31,720 --> 00:24:34,300
because the function cannot
grow anymore at the boundary.

431
00:24:34,300 --> 00:24:36,550
But it does not mean
that the first order

432
00:24:36,550 --> 00:24:38,050
derivative is equal to 0.

433
00:24:38,050 --> 00:24:39,700
It does not mean anything.

434
00:24:39,700 --> 00:24:42,040
So all this picture
here is valid

435
00:24:42,040 --> 00:24:46,720
only if I'm actually
achieving the minimum inside.

436
00:24:46,720 --> 00:24:52,030
Because if my theta space stops
here and it's just this guy,

437
00:24:52,030 --> 00:24:53,200
I'm going to be here.

438
00:24:53,200 --> 00:24:55,600
And there's no questions
about curvature or anything

439
00:24:55,600 --> 00:24:56,690
that comes into play.

440
00:24:56,690 --> 00:24:58,340
It's completely different.

441
00:24:58,340 --> 00:25:00,080
So here, it's inside.

442
00:25:00,080 --> 00:25:02,770
Again, think of theta as
being the entire real line.

443
00:25:02,770 --> 00:25:05,550
Then everything is inside.

444
00:25:05,550 --> 00:25:08,040
I is invertible.

445
00:25:08,040 --> 00:25:11,130
What does it mean for a
positive number, a 1 by 1 matrix

446
00:25:11,130 --> 00:25:12,420
to be invertible?

447
00:25:16,820 --> 00:25:17,320
Yep.

448
00:25:17,320 --> 00:25:20,260
AUDIENCE: It'd be equal
to its [INAUDIBLE]..

449
00:25:20,260 --> 00:25:24,167
PHILIPPE RIGOLLET: A 1 by 1
matrix, that's a number, right?

450
00:25:24,167 --> 00:25:26,750
What is a characteristic-- if I
give you a matrix with numbers

451
00:25:26,750 --> 00:25:28,250
and ask you if it's
invertible, what

452
00:25:28,250 --> 00:25:31,658
are you going to do with it?

453
00:25:31,658 --> 00:25:33,650
AUDIENCE: Check if
the determinant is 0.

454
00:25:33,650 --> 00:25:35,691
PHILIPPE RIGOLLET: Check
if the determinant is 0.

455
00:25:35,691 --> 00:25:37,600
What is the determinant
of the 1 by 1 matrix?

456
00:25:37,600 --> 00:25:38,944
It's just the number itself.

457
00:25:38,944 --> 00:25:41,360
So that's basically, you want
to check if this number is 0

458
00:25:41,360 --> 00:25:42,500
or not.

459
00:25:42,500 --> 00:25:44,990
So we're going to think in
the one dimensional case here.

460
00:25:44,990 --> 00:25:46,739
And in the one dimensional
case, that just

461
00:25:46,739 --> 00:25:51,480
means that the
curvature is not 0.

462
00:25:51,480 --> 00:25:53,230
Well, it better be not
0, because then I'm

463
00:25:53,230 --> 00:25:54,460
going to have no guarantees.

464
00:25:54,460 --> 00:25:56,680
If I'm totally flat,
if I have no curvature,

465
00:25:56,680 --> 00:25:58,900
I'm basically totally
flat at the bottom.

466
00:25:58,900 --> 00:26:00,580
And then I'm going
to get nasty things.

467
00:26:00,580 --> 00:26:02,170
Now, this is not true.

468
00:26:02,170 --> 00:26:05,740
I could have the curvature
which grows like-- so here, it's

469
00:26:05,740 --> 00:26:08,110
basically-- the second
derivative is telling me--

470
00:26:08,110 --> 00:26:09,670
if I do the Taylor
expansion, it's

471
00:26:09,670 --> 00:26:13,170
telling me how I grow as a
function of, say, x squared.

472
00:26:13,170 --> 00:26:15,250
It's the quadratic term
that I'm controlling.

473
00:26:15,250 --> 00:26:19,170
It could be that this guy is
0, and then the term of order,

474
00:26:19,170 --> 00:26:20,805
x to the fourth, is picking up.

475
00:26:20,805 --> 00:26:22,640
That could be the first
one that's non-zero.

476
00:26:23,290 --> 00:26:25,270
But that would mean that
my rate of convergence

477
00:26:25,270 --> 00:26:26,550
would not be square root of n.

478
00:26:26,550 --> 00:26:28,549
When I'm actually playing
central limit theorem,

479
00:26:28,549 --> 00:26:30,820
it would become n to the 1/4th.

480
00:26:30,820 --> 00:26:33,660
And if I have all a bunch
of 0 until the 16th order,

481
00:26:33,660 --> 00:26:36,460
I would have n to the
1/16th, because that's really

482
00:26:36,460 --> 00:26:39,572
telling me how flat I am.

483
00:26:39,572 --> 00:26:41,030
So we're going to
focus on the case

484
00:26:41,030 --> 00:26:43,160
where it's only quadratic
terms, and the rates

485
00:26:43,160 --> 00:26:46,370
of the central limit
theorems kick in.

486
00:26:46,370 --> 00:26:48,560
And then a few other
technical conditions--

487
00:26:48,560 --> 00:26:49,940
we just used a couple of them.

488
00:26:49,940 --> 00:26:52,100
So I permuted
limit and integral.

489
00:26:52,100 --> 00:26:54,890
And you can check that
really what you want

490
00:26:54,890 --> 00:26:58,246
is that the integral of a
derivative is equal to 0.

491
00:26:58,246 --> 00:27:00,370
Well, it just means that
the values at the two ends

492
00:27:00,370 --> 00:27:01,580
are actually the same.

493
00:27:01,580 --> 00:27:05,090
So those are slightly
different things.

494
00:27:05,090 --> 00:27:08,900
So now, what we have is that
the maximum likelihood estimator

495
00:27:08,900 --> 00:27:10,590
has the following
two properties.

496
00:27:10,590 --> 00:27:13,610
The first one, if I were to
say that in words, what would

497
00:27:13,610 --> 00:27:15,470
I say, that theta hat is--

498
00:27:18,851 --> 00:27:20,790
Is what?

499
00:27:20,790 --> 00:27:22,430
Yeah, that's what I
would say when I--

500
00:27:22,430 --> 00:27:23,630
that's for mathematicians.

501
00:27:23,630 --> 00:27:27,530
But if I'm a statistician,
what am I going to say?

502
00:27:27,530 --> 00:27:28,620
It's consistent.

503
00:27:28,620 --> 00:27:30,470
It's a consistent
estimator of theta star.

504
00:27:30,470 --> 00:27:33,830
It converges in
probability to theta star.

505
00:27:33,830 --> 00:27:35,960
And then we have this sort
of central limit theorem

506
00:27:35,960 --> 00:27:36,946
statement.

507
00:27:36,946 --> 00:27:39,320
The central limit theorem
statement tells me that if this

508
00:27:39,320 --> 00:27:44,200
was an average and I remove the
expectation of the average--

509
00:27:44,200 --> 00:27:45,717
let's say it's 0, for example--

510
00:27:45,717 --> 00:27:47,550
then square root of n
times the average blah

511
00:27:47,550 --> 00:27:49,830
goes through some
normal distribution.

512
00:27:49,830 --> 00:27:52,080
This is telling me that
this is actually true,

513
00:27:52,080 --> 00:27:55,500
even if theta hat has nothing
to do with an average.

514
00:27:55,500 --> 00:27:56,725
That's remarkable.

515
00:27:56,725 --> 00:27:59,640
Theta hat might not
even have a closed form,

516
00:27:59,640 --> 00:28:02,070
and I'm still having
basically the same properties

517
00:28:02,070 --> 00:28:04,410
as an average that
would be given to me

518
00:28:04,410 --> 00:28:08,180
by a central limit theorem.

519
00:28:08,180 --> 00:28:10,690
And what is the
asymptotic variance?

520
00:28:10,690 --> 00:28:12,510
So that's the variance in the n.

521
00:28:12,510 --> 00:28:15,980
So here, I'm thinking of having
those guys being multivariate.

522
00:28:15,980 --> 00:28:18,320
And so I have the inverse
of the covariance matrix

523
00:28:18,320 --> 00:28:21,050
that shows up as the
variance-covariance matrix

524
00:28:21,050 --> 00:28:22,640
asymptotically.

525
00:28:22,640 --> 00:28:25,430
But if you think of just being
a one dimensional parameter,

526
00:28:25,430 --> 00:28:27,680
it's one over this
Fisher information,

527
00:28:27,680 --> 00:28:29,616
one over the curvature.

528
00:28:29,616 --> 00:28:31,490
So the curvature is
really flat, the variance

529
00:28:31,490 --> 00:28:33,230
becomes really big.

530
00:28:33,230 --> 00:28:36,230
If the function is really
flat, curvature is low,

531
00:28:36,230 --> 00:28:37,070
variance is big.

532
00:28:37,070 --> 00:28:41,384
If the curvature is very high,
the variance becomes very low.

533
00:28:41,384 --> 00:28:42,800
And so that
illustrates everything

534
00:28:42,800 --> 00:28:45,510
that's happening with the
pictures that we have.

535
00:28:45,510 --> 00:28:48,740
And if you look,
what's amazing here,

536
00:28:48,740 --> 00:28:51,970
there is no square root 2
pi, there's no fudge factors

537
00:28:51,970 --> 00:28:52,940
going on here.

538
00:28:52,940 --> 00:28:56,270
This is the asymptotic
variance, nothing else.

539
00:28:56,270 --> 00:28:58,404
It's all in there,
all in the curvature.

540
00:29:03,770 --> 00:29:05,228
Are there any
questions about this?

541
00:29:07,860 --> 00:29:11,190
So you can see here that theta
star is the true parameter.

542
00:29:11,190 --> 00:29:17,160
And the information matrix
is evaluated at theta star.

543
00:29:17,160 --> 00:29:18,660
That's the point that matters.

544
00:29:18,660 --> 00:29:20,280
When I drew this
picture, the point

545
00:29:20,280 --> 00:29:22,420
that was at the very bottom
was always theta star.

546
00:29:22,420 --> 00:29:26,980
It's the one that minimizes
the KL divergence,

547
00:29:26,980 --> 00:29:32,856
am as long as I'm identified.

548
00:29:32,856 --> 00:29:33,356
Yes.

549
00:29:33,356 --> 00:29:35,766
AUDIENCE: So the
higher the curvature,

550
00:29:35,766 --> 00:29:38,515
the higher the inverse of
the Fisher information?

551
00:29:38,515 --> 00:29:39,890
PHILIPPE RIGOLLET:
No, the higher

552
00:29:39,890 --> 00:29:42,520
the Fisher information itself.

553
00:29:42,520 --> 00:29:46,310
So the inverse is
going to be smaller.

554
00:29:46,310 --> 00:29:48,960
So small variance is good.

555
00:29:48,960 --> 00:29:50,510
So now what it
means, actually, if I

556
00:29:50,510 --> 00:29:51,980
look at what is
the quadratic risk

557
00:29:51,980 --> 00:29:54,050
of this guy,
asymptotically-- what

558
00:29:54,050 --> 00:29:56,270
is asymptotic quadratic risk?

559
00:29:56,270 --> 00:29:57,590
Well, it's 0 actually.

560
00:29:57,590 --> 00:30:01,470
But if I assume that
this thing is true,

561
00:30:01,470 --> 00:30:03,419
that this thing is
pretty much Gaussian,

562
00:30:03,419 --> 00:30:05,210
if I look at the
quadratic risk, well, it's

563
00:30:05,210 --> 00:30:08,170
the expectation of the
square of this thing.

564
00:30:08,170 --> 00:30:12,312
And so it's going to scale
like the variance divided by n.

565
00:30:14,930 --> 00:30:18,800
The bias goes to
0, just by this.

566
00:30:18,800 --> 00:30:20,590
And then the quadratic
risk is going

567
00:30:20,590 --> 00:30:23,340
to scale like one over Fisher
information divided by n.

568
00:30:28,241 --> 00:30:30,240
So here, the-- I'm not
mentioning the constants.

569
00:30:30,240 --> 00:30:32,160
There must be constants, because
everything is asymptotic.

570
00:30:32,160 --> 00:30:33,834
So for each finite
n, I'm going to have

571
00:30:33,834 --> 00:30:35,000
some constants that show up.

572
00:30:39,270 --> 00:30:43,590
Everybody just got their mind
blown by this amazing theorem?

573
00:30:43,590 --> 00:30:48,090
So I mean, if you think about
it, the MLE can be anything.

574
00:30:48,090 --> 00:30:50,562
I'm sorry to say to
you, in many instances,

575
00:30:50,562 --> 00:30:52,770
the MLE is just going to be
an average, which is just

576
00:30:52,770 --> 00:30:54,660
going to be slightly annoying.

577
00:30:54,660 --> 00:30:57,370
But there are some
cases where it's not.

578
00:30:57,370 --> 00:30:59,462
And we have to resort
to this theorem

579
00:30:59,462 --> 00:31:01,920
rather than actually resorting
to the central limit theorem

580
00:31:01,920 --> 00:31:03,090
to prove this thing.

581
00:31:03,090 --> 00:31:05,960
And more importantly, even
if this was an average,

582
00:31:05,960 --> 00:31:07,710
you don't have to even
know how to compute

583
00:31:07,710 --> 00:31:09,060
the covariance matrix--

584
00:31:09,060 --> 00:31:11,320
sorry, the variance
of this thing

585
00:31:11,320 --> 00:31:14,490
to plug it into the
central limit theorem.

586
00:31:14,490 --> 00:31:17,220
I'm telling you, it's actually
given by the Fisher information

587
00:31:17,220 --> 00:31:18,950
matrix.

588
00:31:18,950 --> 00:31:22,070
So if it's an average,
between you and me,

589
00:31:22,070 --> 00:31:24,590
you probably want to go the
central limit theorem route

590
00:31:24,590 --> 00:31:26,700
if you want to prove
this kind of stuff.

591
00:31:26,700 --> 00:31:28,910
But if it's not, then
that's your best shot.

592
00:31:28,910 --> 00:31:31,870
But you have to check
those conditions.

593
00:31:31,870 --> 00:31:38,020
I will give you for
granted the 0.5.

594
00:31:38,020 --> 00:31:39,780
Ready?

595
00:31:39,780 --> 00:31:40,410
Any questions?

596
00:31:40,410 --> 00:31:41,960
We're going to wrap
up this chapter four.

597
00:31:41,960 --> 00:31:43,440
So if you have questions,
that's the time.

598
00:31:43,440 --> 00:31:43,939
Yes.

599
00:31:43,939 --> 00:31:45,925
AUDIENCE: What was the
quadratic risk up there?

600
00:31:45,925 --> 00:31:47,716
PHILIPPE RIGOLLET: You
mean the definition?

601
00:31:47,716 --> 00:31:49,620
AUDIENCE: No, the--
what is was for this.

602
00:31:49,620 --> 00:31:51,210
PHILIPPE RIGOLLET: Well,
you see the quadratic risk,

603
00:31:51,210 --> 00:31:53,070
if I think of it as
being one dimensional,

604
00:31:53,070 --> 00:31:55,272
the quadratic risk
is the expectation

605
00:31:55,272 --> 00:31:57,730
of the square of the difference
between theta hat and theta

606
00:31:57,730 --> 00:31:58,230
star.

607
00:32:01,010 --> 00:32:05,900
So that means that if I think
of this as having a normal 0, 1,

608
00:32:05,900 --> 00:32:09,680
that's basically computing
the expectation of the square

609
00:32:09,680 --> 00:32:13,310
of this Gaussian divided by n.

610
00:32:13,310 --> 00:32:15,759
I just divided by square
root of n on both sides.

611
00:32:15,759 --> 00:32:18,050
So it's the expectation of
the square of this Gaussian.

612
00:32:18,050 --> 00:32:20,383
The Gaussian is mean 0, so
the expectation of the square

613
00:32:20,383 --> 00:32:23,060
is just a variance.

614
00:32:23,060 --> 00:32:25,903
And so I'm left with 1 over
the Fisher information divided

615
00:32:25,903 --> 00:32:26,403
by n.

616
00:32:26,403 --> 00:32:26,886
AUDIENCE: I see.

617
00:32:26,886 --> 00:32:27,386
OK.

618
00:32:34,084 --> 00:32:36,250
PHILIPPE RIGOLLET: So let's
move on to chapter four.

619
00:32:36,250 --> 00:32:38,190
And this is the
method of moments.

620
00:32:38,190 --> 00:32:42,000
So the method of moments is
actually maybe a bit older

621
00:32:42,000 --> 00:32:44,260
than maximum likelihood.

622
00:32:44,260 --> 00:32:48,490
And maximum likelihood is
dated, say, early 20th century,

623
00:32:48,490 --> 00:32:50,490
I mean as a systematic
thing, because as I said,

624
00:32:50,490 --> 00:32:52,323
many of those guys are
going to be averages.

625
00:32:52,323 --> 00:32:56,010
So finding an average is
probably a little older.

626
00:32:56,010 --> 00:32:58,380
The method of moments,
there's some really nice uses.

627
00:32:58,380 --> 00:33:03,679
There's a paper by Pearson in
1904, I believe, or maybe 1894.

628
00:33:03,679 --> 00:33:04,220
I don't know.

629
00:33:06,930 --> 00:33:10,860
And this paper, he was
actually studying some species

630
00:33:10,860 --> 00:33:12,607
of crab in an island,
and he was trying

631
00:33:12,607 --> 00:33:13,690
to make some measurements.

632
00:33:13,690 --> 00:33:16,314
That's how he came up with this
model of mixtures of Gaussians,

633
00:33:16,314 --> 00:33:18,930
because there was actually
two different populations

634
00:33:18,930 --> 00:33:20,860
in this populations of crab.

635
00:33:20,860 --> 00:33:23,580
And the way he actually
fitted the parameters

636
00:33:23,580 --> 00:33:25,530
was by doing the
method of moments,

637
00:33:25,530 --> 00:33:27,740
except that since there
were a lot of parameters,

638
00:33:27,740 --> 00:33:33,580
he actually had to basically
solve six equations with six

639
00:33:33,580 --> 00:33:34,080
unknowns.

640
00:33:34,080 --> 00:33:35,496
And that was a
complete nightmare.

641
00:33:35,496 --> 00:33:36,980
And the guy did it by hand.

642
00:33:36,980 --> 00:33:40,140
And we don't know how
he did it actually.

643
00:33:40,140 --> 00:33:43,360
But that is pretty impressive.

644
00:33:43,360 --> 00:33:44,480
So I want to start--

645
00:33:44,480 --> 00:33:48,150
and this first part
is a little brutal.

646
00:33:48,150 --> 00:33:52,020
But this is a Course 18 class,
and I do not want to give you--

647
00:33:52,020 --> 00:33:54,510
So let's all agree that this
course might be slightly more

648
00:33:54,510 --> 00:33:56,820
challenging than AP statistics.

649
00:33:56,820 --> 00:34:00,540
And that means that it's
going to be challenging just

650
00:34:00,540 --> 00:34:01,670
during class.

651
00:34:01,670 --> 00:34:04,170
I'm not going to ask you about
the Weierstrass Approximation

652
00:34:04,170 --> 00:34:05,670
Theorem during the exams.

653
00:34:05,670 --> 00:34:08,219
But what I want is to give
you mathematical motivations

654
00:34:08,219 --> 00:34:10,130
for what we're doing.

655
00:34:10,130 --> 00:34:12,480
And I can promise
you that maybe you

656
00:34:12,480 --> 00:34:17,730
will have a slightly higher body
temperature during the lecture,

657
00:34:17,730 --> 00:34:20,760
but you will come out
smarter of this class.

658
00:34:20,760 --> 00:34:24,810
And I'm trying to motivate to
you for using mathematical tool

659
00:34:24,810 --> 00:34:27,989
and show you where interesting
mathematical things that you

660
00:34:27,989 --> 00:34:31,800
might find dry elsewhere
actually work very beautifully

661
00:34:31,800 --> 00:34:33,348
in the stats literature.

662
00:34:33,348 --> 00:34:35,889
And one that we saw was using
Kullback-Leibler divergence out

663
00:34:35,889 --> 00:34:38,639
of motivation for maximum
likelihood estimation,

664
00:34:38,639 --> 00:34:39,929
for example.

665
00:34:39,929 --> 00:34:42,300
So the Weierstrass
Approximation Theorem

666
00:34:42,300 --> 00:34:45,270
is something that comes
from pure analysis.

667
00:34:45,270 --> 00:34:49,656
So maybe-- I mean, it took
me a while before I saw that.

668
00:34:49,656 --> 00:34:51,239
And essentially,
what it's telling you

669
00:34:51,239 --> 00:34:52,822
is that if you look
at a function that

670
00:34:52,822 --> 00:34:55,710
is continuous on
an interval a, b--

671
00:34:55,710 --> 00:34:57,810
on a segment a, b--

672
00:34:57,810 --> 00:35:02,250
then you can actually
approximate it

673
00:35:02,250 --> 00:35:05,430
uniformly well by
polynomials as long

674
00:35:05,430 --> 00:35:06,930
as you're willing
to take the degree

675
00:35:06,930 --> 00:35:09,000
of this polynomials
large enough.

676
00:35:09,000 --> 00:35:11,890
So the formal statement
is, for any epsilon,

677
00:35:11,890 --> 00:35:16,140
there exists the d that depends
on epsilon in a1 to ad--

678
00:35:16,140 --> 00:35:20,400
so if you insist on having an
accuracy which is 1/10,000,

679
00:35:20,400 --> 00:35:23,650
maybe you're going to need a
polynomial of degree 100,000,

680
00:35:23,650 --> 00:35:24,150
who knows.

681
00:35:24,150 --> 00:35:26,520
It doesn't tell you
anything about this.

682
00:35:26,520 --> 00:35:28,170
But it's telling you
that at least you

683
00:35:28,170 --> 00:35:29,850
have only a finite
number of parameters

684
00:35:29,850 --> 00:35:31,725
to approximate those
functions that typically

685
00:35:31,725 --> 00:35:35,310
require an infinite number of
parameters to be described.

686
00:35:35,310 --> 00:35:36,670
So that's actually quite nice.

687
00:35:36,670 --> 00:35:39,510
And that's the basis
for many things

688
00:35:39,510 --> 00:35:43,000
and many polynomial
methods typically.

689
00:35:43,000 --> 00:35:45,540
And so here, it's
uniform, so there's

690
00:35:45,540 --> 00:35:50,400
this max over x that shows up
that's actually nice as well.

691
00:35:50,400 --> 00:35:52,200
That's Weierstrass
Approximation Theorem.

692
00:35:52,200 --> 00:35:54,720
Why is that useful to us?

693
00:35:54,720 --> 00:35:58,180
Well, in statistics, I
have a sample of X1 to Xn.

694
00:35:58,180 --> 00:36:00,054
I have, say, a unified
statistical model.

695
00:36:00,054 --> 00:36:01,470
I'm not always
going to remind you

696
00:36:01,470 --> 00:36:04,200
that it's identified--
not unified-- identified

697
00:36:04,200 --> 00:36:05,640
statistical model.

698
00:36:05,640 --> 00:36:08,550
And I'm going to assume
that it has a density.

699
00:36:08,550 --> 00:36:10,170
You could think of
it as having a PMF,

700
00:36:10,170 --> 00:36:13,140
but think of it as having
a density for one second.

701
00:36:13,140 --> 00:36:16,770
Now, what I want is to
find the distribution.

702
00:36:16,770 --> 00:36:18,030
I want to find theta.

703
00:36:18,030 --> 00:36:20,340
And finding theta,
since it's identified

704
00:36:20,340 --> 00:36:22,590
as equivalent to
finding P theta, which

705
00:36:22,590 --> 00:36:26,460
is equivalent to finding f
theta, and knowing a function

706
00:36:26,460 --> 00:36:28,410
is the same--

707
00:36:28,410 --> 00:36:30,750
knowing a density is the
same as knowing a density

708
00:36:30,750 --> 00:36:33,150
against any test function h.

709
00:36:33,150 --> 00:36:38,589
So that means that if I want
to make sure I know a density--

710
00:36:38,589 --> 00:36:40,630
if I want to check if two
densities are the same,

711
00:36:40,630 --> 00:36:42,520
all I have to do is to
compute their integral

712
00:36:42,520 --> 00:36:46,240
against all bounded
continuous functions.

713
00:36:46,240 --> 00:36:48,340
You already know
that it would be true

714
00:36:48,340 --> 00:36:50,530
if I checked for
all functions h.

715
00:36:50,530 --> 00:36:53,170
But since f is a
density, I can actually

716
00:36:53,170 --> 00:36:56,140
look only at functions
h that are bounded,

717
00:36:56,140 --> 00:37:04,360
say, between minus 1 and
1, and that are continuous.

718
00:37:04,360 --> 00:37:06,140
That's enough.

719
00:37:06,140 --> 00:37:06,875
Agreed?

720
00:37:06,875 --> 00:37:08,510
Well, just trust me on this.

721
00:37:08,510 --> 00:37:11,774
Yes, you have a question?

722
00:37:11,774 --> 00:37:14,518
AUDIENCE: Why is this--
like, why shouldn't you

723
00:37:14,518 --> 00:37:17,263
just say that [INAUDIBLE]?

724
00:37:20,195 --> 00:37:21,820
PHILIPPE RIGOLLET:
Yeah, I can do that.

725
00:37:21,820 --> 00:37:23,410
I'm just finding
a characterization

726
00:37:23,410 --> 00:37:25,210
that's going to be
useful for me later on.

727
00:37:25,210 --> 00:37:26,810
I can find a bunch of them.

728
00:37:26,810 --> 00:37:28,600
But here, this one is
going to be useful.

729
00:37:28,600 --> 00:37:32,670
So all I need to say is that f
theta star integrated against

730
00:37:32,670 --> 00:37:35,700
X, h of x-- so this
implies that f--

731
00:37:35,700 --> 00:37:38,320
if theta is equal to f
theta star not everywhere,

732
00:37:38,320 --> 00:37:41,770
but almost everywhere.

733
00:37:41,770 --> 00:37:44,635
And that's only true if I
guarantee to you that f theta

734
00:37:44,635 --> 00:37:46,390
and f theta stars are densities.

735
00:37:46,390 --> 00:37:49,180
This is not true
for any function.

736
00:37:49,180 --> 00:37:53,570
So now, that means that, well,
if I wanted to estimate theta

737
00:37:53,570 --> 00:37:56,880
hat, all I would have to do is
to compute the average, right--

738
00:37:56,880 --> 00:38:01,110
so this guy here, the integral--

739
00:38:01,110 --> 00:38:02,480
let me clean up a bit my board.

740
00:38:22,590 --> 00:38:30,350
So my goal is to find theta
such that, if I look at f theta

741
00:38:30,350 --> 00:38:34,680
and now I integrate it
against h of x, then

742
00:38:34,680 --> 00:38:36,540
this gives me the same
thing as if I were

743
00:38:36,540 --> 00:38:42,860
to do it against f theta star.

744
00:38:42,860 --> 00:38:45,690
And I want this for any h,
which is continuous and bounded.

745
00:38:48,390 --> 00:38:51,690
So of course, I don't know
what this quantity is.

746
00:38:51,690 --> 00:38:53,320
It depends on my
unknown theta star.

747
00:38:53,320 --> 00:38:54,600
But I have theta from this.

748
00:38:54,600 --> 00:38:56,100
And I'm going to do the usual--

749
00:38:56,100 --> 00:38:57,870
the good old statistical
trick, which is,

750
00:38:57,870 --> 00:39:01,470
well, this I can write as
the expectation with respect

751
00:39:01,470 --> 00:39:06,090
to P theta star of h theta of x.

752
00:39:06,090 --> 00:39:08,950
That's just the integral of
a function against something.

753
00:39:08,950 --> 00:39:11,250
And so what I can do
is say, well, now I

754
00:39:11,250 --> 00:39:12,160
don't know this guy.

755
00:39:12,160 --> 00:39:14,070
But my good old
trick from the book

756
00:39:14,070 --> 00:39:15,980
is replace expectations
by averages.

757
00:39:15,980 --> 00:39:16,670
And what I get--

758
00:39:23,050 --> 00:39:29,190
And that's approximately by
the law of large numbers.

759
00:39:29,190 --> 00:39:33,950
So if I can actually find
a function f theta such

760
00:39:33,950 --> 00:39:36,150
that when I integrate
it against h

761
00:39:36,150 --> 00:39:39,480
it gives me pretty much the
average of the evaluations

762
00:39:39,480 --> 00:39:42,890
of h over my data
points for all h,

763
00:39:42,890 --> 00:39:44,575
then that should be
a good candidate.

764
00:39:47,690 --> 00:39:52,040
The problem is that's a
lot of functions to try.

765
00:39:52,040 --> 00:39:54,500
Even if we reduced that
from all possible functions

766
00:39:54,500 --> 00:39:56,780
to bounded and
continuous ones, that's

767
00:39:56,780 --> 00:40:01,490
still a pretty large
infinite number of them.

768
00:40:01,490 --> 00:40:05,550
And so what we can do is to use
our Weierstrass Approximation

769
00:40:05,550 --> 00:40:06,050
Theorem.

770
00:40:06,050 --> 00:40:09,170
And it says, well, maybe I don't
need to test it against all h.

771
00:40:09,170 --> 00:40:11,987
Maybe the polynomials
are enough for me.

772
00:40:11,987 --> 00:40:14,570
So what I'm going to do is I'm
going to look only at functions

773
00:40:14,570 --> 00:40:20,130
h that are of the
form sum of ak--

774
00:40:20,130 --> 00:40:29,725
so h of x is sum of
ak X to the k-th for k

775
00:40:29,725 --> 00:40:34,040
equals 0 to d-- only
polynomials of degree d.

776
00:40:34,040 --> 00:40:37,520
So when I look at the
average of my h's, I'm

777
00:40:37,520 --> 00:40:40,360
going to get a term
like the first one.

778
00:40:40,360 --> 00:40:47,485
So the first one here, this guy,
becomes 1/n sum from i equal 1

779
00:40:47,485 --> 00:40:54,290
to n sum from k equal 0
to d of ak Xi to the k-th.

780
00:40:54,290 --> 00:40:58,430
That's just the average
of the values of h of Xi.

781
00:40:58,430 --> 00:41:00,710
And now, what I need
to do is to check

782
00:41:00,710 --> 00:41:03,590
that it's the same
thing when I integrate

783
00:41:03,590 --> 00:41:06,640
h of this form as well.

784
00:41:06,640 --> 00:41:10,880
I want this to hold for all
polynomials of degree d.

785
00:41:10,880 --> 00:41:12,110
That's still a lot of them.

786
00:41:12,110 --> 00:41:14,110
There's still an infinite
number of polynomials,

787
00:41:14,110 --> 00:41:17,870
because there's an infinite
number of numbers a0 to ad

788
00:41:17,870 --> 00:41:21,340
that describe those polynomials.

789
00:41:21,340 --> 00:41:23,410
But since those guys
are polynomials,

790
00:41:23,410 --> 00:41:26,320
it's actually enough for me
to look only at the terms

791
00:41:26,320 --> 00:41:28,420
of the form X to the k-th--

792
00:41:28,420 --> 00:41:30,610
no linear combination,
no nothing.

793
00:41:30,610 --> 00:41:32,290
So actually, it's
enough to look only

794
00:41:32,290 --> 00:41:40,050
at h of x, which is equal to X
to the k-th for k equal 0 to d.

795
00:41:43,260 --> 00:41:46,350
And now, how many of
those guys are there?

796
00:41:46,350 --> 00:41:49,210
Just d plus 1, 0 to d.

797
00:41:49,210 --> 00:41:51,640
So that's actually a much
easier thing for me to solve.

798
00:41:54,290 --> 00:42:01,970
Now, this quantity, which is the
integral of f theta against X

799
00:42:01,970 --> 00:42:05,360
to the k-th-- so that the
expectation of X to the k-th

800
00:42:05,360 --> 00:42:06,860
here--

801
00:42:06,860 --> 00:42:12,940
it's called moment of order
k, or k-th moment of P theta.

802
00:42:12,940 --> 00:42:13,620
That's a moment.

803
00:42:13,620 --> 00:42:16,210
A moment is just the
expectation of the power.

804
00:42:16,210 --> 00:42:19,780
The mean is which moment?

805
00:42:19,780 --> 00:42:21,760
The first moment.

806
00:42:21,760 --> 00:42:24,720
And variance is not
exactly the second moment.

807
00:42:24,720 --> 00:42:27,170
It's the second moment minus
the first moment squared.

808
00:42:29,862 --> 00:42:30,695
That's the variance.

809
00:42:30,695 --> 00:42:34,691
It's E of X squared
minus E of X squared.

810
00:42:34,691 --> 00:42:36,440
So those are things
that you already know.

811
00:42:36,440 --> 00:42:37,564
And then you can go higher.

812
00:42:37,564 --> 00:42:40,200
You can go to E of X
cube, E of X blah, blah.

813
00:42:40,200 --> 00:42:43,030
Here, I say go to
E of X to the d-th.

814
00:42:43,030 --> 00:42:44,780
Now, as you can see,
this is not something

815
00:42:44,780 --> 00:42:47,781
you can really put
in action right now,

816
00:42:47,781 --> 00:42:50,030
because the Weierstrass
Approximation Theorem does not

817
00:42:50,030 --> 00:42:52,070
tell you what d should be.

818
00:42:52,070 --> 00:42:54,020
Actually, we totally
lost track of the epsilon

819
00:42:54,020 --> 00:42:54,978
I was even looking for.

820
00:42:54,978 --> 00:42:57,300
I just said approximately
equal, approximately equal.

821
00:42:57,300 --> 00:42:59,300
And so all this thing is
really just motivation.

822
00:42:59,300 --> 00:43:02,730
But it's essentially
telling you that if you

823
00:43:02,730 --> 00:43:05,010
go to d large
enough, technically

824
00:43:05,010 --> 00:43:08,730
you should be able to identify
exactly your distribution up

825
00:43:08,730 --> 00:43:11,280
to epsilon.

826
00:43:11,280 --> 00:43:16,210
So I should be pretty good,
if I go to d large enough.

827
00:43:16,210 --> 00:43:19,190
Now in practice, actually
there should be much

828
00:43:19,190 --> 00:43:23,120
less than arbitrarily large d.

829
00:43:23,120 --> 00:43:25,460
Typically, we are going
to need d which is 1 or 2.

830
00:43:28,150 --> 00:43:31,720
So there are some limitations
to the Weierstrass Approximation

831
00:43:31,720 --> 00:43:32,440
Theorem.

832
00:43:32,440 --> 00:43:33,940
And there's a few.

833
00:43:33,940 --> 00:43:35,950
The first one is
that it only works

834
00:43:35,950 --> 00:43:39,850
for continuous functions, which
is not so much of a problem.

835
00:43:39,850 --> 00:43:42,740
That can be fixed.

836
00:43:42,740 --> 00:43:44,560
Well, we need bounded
continuous functions.

837
00:43:44,560 --> 00:43:45,961
It works only on intervals.

838
00:43:45,961 --> 00:43:47,460
That's annoying,
because we're going

839
00:43:47,460 --> 00:43:51,080
to have random variables that
are defined beyond intervals.

840
00:43:51,080 --> 00:43:53,560
So we need something that just
goes beyond the intervals.

841
00:43:53,560 --> 00:43:55,840
And you can imagine that if
you let your functions be huge,

842
00:43:55,840 --> 00:43:57,256
it's going to be
very hard for you

843
00:43:57,256 --> 00:44:00,001
to have a polynomial
approximately [INAUDIBLE] well.

844
00:44:00,001 --> 00:44:02,500
Things are going to start going
up and down at the boundary,

845
00:44:02,500 --> 00:44:05,380
and it's going to be very hard.

846
00:44:05,380 --> 00:44:07,360
And again, as I
said several times,

847
00:44:07,360 --> 00:44:09,160
it doesn't tell us
what d should be.

848
00:44:09,160 --> 00:44:11,470
And as statisticians, we're
looking for methods, not

849
00:44:11,470 --> 00:44:15,910
like principles of existence
of a method that exists.

850
00:44:15,910 --> 00:44:21,840
So if E is discrete,
I can actually

851
00:44:21,840 --> 00:44:23,720
get a handle on this d.

852
00:44:23,720 --> 00:44:26,730
If E is discrete and
actually finite--

853
00:44:26,730 --> 00:44:29,250
I'm going to actually
look at a finite E,

854
00:44:29,250 --> 00:44:33,690
meaning that I have a PMF on,
say, r possible values, x1

855
00:44:33,690 --> 00:44:34,404
and xr.

856
00:44:34,404 --> 00:44:35,820
My random variable,
capital X, can

857
00:44:35,820 --> 00:44:37,290
take only r possible values.

858
00:44:37,290 --> 00:44:41,550
Let's think of them as being
the integer numbers 1 to r.

859
00:44:41,550 --> 00:44:44,880
That's the number of
success out of r trials

860
00:44:44,880 --> 00:44:46,590
that I get, for example.

861
00:44:46,590 --> 00:44:51,640
Binomial rp, that's exactly
something like this.

862
00:44:51,640 --> 00:44:55,600
So now, clearly this
entire distribution

863
00:44:55,600 --> 00:45:00,452
is defined by the PMF, which
gives me exactly r numbers.

864
00:45:00,452 --> 00:45:02,410
So it can completely
describe this distribution

865
00:45:02,410 --> 00:45:03,850
with r numbers.

866
00:45:03,850 --> 00:45:08,290
The question is, do I have an
enormous amount of redundancy

867
00:45:08,290 --> 00:45:12,250
if I try to describe this
distribution using moments?

868
00:45:12,250 --> 00:45:14,970
It might be that I need--
say, r is equal to 10,

869
00:45:14,970 --> 00:45:18,080
maybe I have only 10 numbers
to describe this thing,

870
00:45:18,080 --> 00:45:20,980
but I actually need to compute
moments up to the order of 100

871
00:45:20,980 --> 00:45:23,500
before I actually recover
entirely the distribution.

872
00:45:23,500 --> 00:45:25,090
Maybe I need to go infinite.

873
00:45:25,090 --> 00:45:27,220
Maybe the Weierstrass
Theorem is the only thing

874
00:45:27,220 --> 00:45:28,420
that actually saves me here.

875
00:45:28,420 --> 00:45:30,720
And I just cannot
recover it exactly.

876
00:45:30,720 --> 00:45:33,340
I can go to epsilon if I'm
willing to go to higher

877
00:45:33,340 --> 00:45:34,611
and higher polynomials.

878
00:45:34,611 --> 00:45:36,610
Oh, by the way, in the
Weierstrass Approximation

879
00:45:36,610 --> 00:45:39,190
Theorem, I can promise you
that as epsilon goes to 0,

880
00:45:39,190 --> 00:45:41,660
d goes to infinity.

881
00:45:41,660 --> 00:45:46,160
So now, really I don't even
have actually r parameters.

882
00:45:46,160 --> 00:45:50,070
I have only r minus parameter,
because the last one--

883
00:45:50,070 --> 00:45:51,500
because they sum up to 1.

884
00:45:51,500 --> 00:45:53,630
So the last one I can
always get by doing

885
00:45:53,630 --> 00:45:56,960
1 minus the sum of the
first r minus 1 first.

886
00:45:56,960 --> 00:45:58,520
Agreed?

887
00:45:58,520 --> 00:46:01,100
So each distribution
r numbers is described

888
00:46:01,100 --> 00:46:04,700
by r minus 1 parameters.

889
00:46:04,700 --> 00:46:07,025
The question is, can I
use only r minus moments

890
00:46:07,025 --> 00:46:08,020
to describe this guy?

891
00:46:12,870 --> 00:46:16,380
This is something called
Gaussian quadrature.

892
00:46:16,380 --> 00:46:18,930
The Gaussian quadrature
tells you, yes, moments

893
00:46:18,930 --> 00:46:22,380
are actually a good way to
reparametrize your distribution

894
00:46:22,380 --> 00:46:24,870
in the sense that if
I give you the moments

895
00:46:24,870 --> 00:46:27,120
or if I give you the
probability mass function,

896
00:46:27,120 --> 00:46:29,370
I'm basically giving you
exactly the same information.

897
00:46:29,370 --> 00:46:32,770
You can recover all the
probabilities from there.

898
00:46:32,770 --> 00:46:34,930
So here, I'm going
to denote by--

899
00:46:34,930 --> 00:46:37,870
I'm going to drop the
notation in theta.

900
00:46:37,870 --> 00:46:38,770
I don't have theta.

901
00:46:38,770 --> 00:46:41,460
Here, I'm talking about
any generic distribution.

902
00:46:41,460 --> 00:46:44,950
And so I'm going to
call mk the k-th moment.

903
00:46:49,080 --> 00:46:54,610
And I have a PMF, this
is really the sum for j

904
00:46:54,610 --> 00:47:06,690
equals 1 to r of xj to
the k-th times p of xj.

905
00:47:06,690 --> 00:47:10,450
And this is the PMF.

906
00:47:10,450 --> 00:47:12,100
So that's my k-th moment.

907
00:47:12,100 --> 00:47:15,340
So the k-th moment is a linear
combination of the numbers

908
00:47:15,340 --> 00:47:16,430
that I am interested in.

909
00:47:19,750 --> 00:47:22,195
So that's one equation.

910
00:47:25,220 --> 00:47:27,350
And I have as many
equations as I'm actually

911
00:47:27,350 --> 00:47:28,700
willing to look at moments.

912
00:47:28,700 --> 00:47:34,250
So if I'm looking at 25
moments, I have 25 equations.

913
00:47:34,250 --> 00:47:36,650
m1 equals blah with
this to the power of 1,

914
00:47:36,650 --> 00:47:40,020
m2 equals blah with this to
the power of 2, et cetera.

915
00:47:40,020 --> 00:47:41,850
And then I also
have the equation

916
00:47:41,850 --> 00:47:51,240
that 1 is equal to the
sum of the p of xj.

917
00:47:51,240 --> 00:47:55,190
That's just the
definition of PMF.

918
00:47:55,190 --> 00:47:56,190
So this is r's.

919
00:47:56,190 --> 00:47:58,163
They're ugly, but those are r's.

920
00:48:00,790 --> 00:48:04,390
So now, this is a system
of linear equations in p,

921
00:48:04,390 --> 00:48:07,390
and I can actually write it
in its canonical form, which

922
00:48:07,390 --> 00:48:11,410
is of the form a
matrix of those guys

923
00:48:11,410 --> 00:48:15,750
times my parameters of interest
is equal to a right hand side.

924
00:48:15,750 --> 00:48:17,880
The right hand side
is the moments.

925
00:48:17,880 --> 00:48:20,070
That means, if I
did you the moments,

926
00:48:20,070 --> 00:48:22,740
can you come back and
find what the PMF,

927
00:48:22,740 --> 00:48:24,870
because we know already
from probability

928
00:48:24,870 --> 00:48:27,390
that the PMF is all I need
to know to fully describe

929
00:48:27,390 --> 00:48:29,190
my distribution.

930
00:48:29,190 --> 00:48:32,010
Given the moments,
that's unclear.

931
00:48:32,010 --> 00:48:37,830
Now, here, I'm going to actually
take exactly r minus 1 moment

932
00:48:37,830 --> 00:48:39,960
and this extra condition
that the sum of those guys

933
00:48:39,960 --> 00:48:41,640
should be 1.

934
00:48:41,640 --> 00:48:45,240
So that gives me r equations
based on r minus 1 moments.

935
00:48:45,240 --> 00:48:47,260
And how many unknowns do I have?

936
00:48:47,260 --> 00:48:54,230
Well, I have my r
unknown parameters

937
00:48:54,230 --> 00:48:59,060
for the PMF, the r
values of the PMF.

938
00:48:59,060 --> 00:49:02,540
Now, of course, this is
going to play a huge role

939
00:49:02,540 --> 00:49:06,920
in whether the are many
p's that give me the same.

940
00:49:06,920 --> 00:49:09,620
The goal is to find if there
are several p's that can give me

941
00:49:09,620 --> 00:49:10,552
the same moments.

942
00:49:10,552 --> 00:49:13,010
But if there's only one p that
can give me a set of moment,

943
00:49:13,010 --> 00:49:15,260
that means that I have a
one-to-one correspondence

944
00:49:15,260 --> 00:49:17,294
between PMF and moments.

945
00:49:17,294 --> 00:49:18,710
And so if you give
me the moments,

946
00:49:18,710 --> 00:49:23,074
I can just go back to the PMF.

947
00:49:23,074 --> 00:49:23,990
Now, how do I go back?

948
00:49:23,990 --> 00:49:26,310
Well, by inverting this matrix.

949
00:49:26,310 --> 00:49:28,710
If I multiply this
matrix by its inverse,

950
00:49:28,710 --> 00:49:32,600
I'm going to get the identity
times the vector of p's equal

951
00:49:32,600 --> 00:49:36,890
the inverse of the
matrix times the m's.

952
00:49:36,890 --> 00:49:41,150
So what we want to
do is to say that p

953
00:49:41,150 --> 00:49:45,350
is equal to the inverse of this
big matrix times the moments

954
00:49:45,350 --> 00:49:47,190
that you give me.

955
00:49:47,190 --> 00:49:49,380
And if I can actually
talk about the inverse,

956
00:49:49,380 --> 00:49:52,410
then I have basically
a one-to-one mapping

957
00:49:52,410 --> 00:49:55,930
between the m's, the
moments, and the matrix.

958
00:49:55,930 --> 00:49:58,380
So what I need to show is that
this matrix is invertible.

959
00:49:58,380 --> 00:50:01,350
And we just decided
that the way to check

960
00:50:01,350 --> 00:50:05,670
if a matrix is invertible is
by computing its determinant.

961
00:50:05,670 --> 00:50:10,300
Who has computed a
determinant before?

962
00:50:10,300 --> 00:50:12,820
Who was supposed to compute a
determinant at least than just

963
00:50:12,820 --> 00:50:15,004
to say, no, you
know how to do it.

964
00:50:15,004 --> 00:50:16,670
So you know how to
compute determinants.

965
00:50:16,670 --> 00:50:19,180
And if you've seen any
determinant in class,

966
00:50:19,180 --> 00:50:22,660
there's one that shows up in the
exercises that professors love.

967
00:50:22,660 --> 00:50:25,390
And it's called the
Vandermonde determinant.

968
00:50:25,390 --> 00:50:26,890
And it's the
determinant of a matrix

969
00:50:26,890 --> 00:50:28,900
that have a very specific form.

970
00:50:28,900 --> 00:50:31,780
It looks like-- so there's
basically only r parameters

971
00:50:31,780 --> 00:50:33,950
to this r by r matrix.

972
00:50:33,950 --> 00:50:36,340
The first row, or the
first column-- sometimes,

973
00:50:36,340 --> 00:50:37,810
it's presented like that--

974
00:50:37,810 --> 00:50:41,551
is this vector where each
entry is to the power of 1.

975
00:50:41,551 --> 00:50:43,800
And the second one is each
entry is to the power of 2,

976
00:50:43,800 --> 00:50:46,970
and to the power of 3, and
to the power 4, et cetera.

977
00:50:46,970 --> 00:50:49,410
So that's exactly what we
have-- x1 to the first, x2

978
00:50:49,410 --> 00:50:51,460
to the first, all the
way to xr to the first,

979
00:50:51,460 --> 00:50:53,290
and then same thing
to the power of 2,

980
00:50:53,290 --> 00:50:54,550
all the way to the last one.

981
00:50:54,550 --> 00:50:58,270
And I also need to add
the row of all 1's, which

982
00:50:58,270 --> 00:51:01,210
you can think of those guys are
to the power of 0, if you want.

983
00:51:01,210 --> 00:51:02,690
So I should really
put it on top,

984
00:51:02,690 --> 00:51:05,430
if I wanted to have
a nice ordering.

985
00:51:05,430 --> 00:51:07,290
So that was the
matrix that I had.

986
00:51:07,290 --> 00:51:09,060
And I'm not asking
you to check it.

987
00:51:09,060 --> 00:51:10,860
You can prove that by
induction actually,

988
00:51:10,860 --> 00:51:14,190
typically by doing the
usual let's eliminate

989
00:51:14,190 --> 00:51:15,810
some rows and columns
type of tricks

990
00:51:15,810 --> 00:51:17,260
that you do for matrices.

991
00:51:17,260 --> 00:51:19,197
So you basically start
from the whole matrix.

992
00:51:19,197 --> 00:51:21,780
And then you move onto a matrix
that has only one 1's and then

993
00:51:21,780 --> 00:51:22,770
0's here.

994
00:51:22,770 --> 00:51:25,827
And then you have Vandermonde
that's just slightly smaller.

995
00:51:25,827 --> 00:51:26,910
And then you just iterate.

996
00:51:26,910 --> 00:51:27,894
Yeah.

997
00:51:27,894 --> 00:51:31,502
AUDIENCE: I feel like there's a
loss to either the supra index,

998
00:51:31,502 --> 00:51:35,274
or the sub index should have
a k somewhere [INAUDIBLE]..

999
00:51:38,119 --> 00:51:39,702
[INAUDIBLE] the one
I'm talking about?

1000
00:51:39,702 --> 00:51:41,285
PHILIPPE RIGOLLET:
Yeah, I know, but I

1001
00:51:41,285 --> 00:51:45,280
don't think the answer
to your question is yes.

1002
00:51:45,280 --> 00:51:48,330
So k is the general
index, right?

1003
00:51:48,330 --> 00:51:51,180
So there's no k. k does not
exist. k just is here for me

1004
00:51:51,180 --> 00:51:53,310
to tell me for k equals 1 to r.

1005
00:51:53,310 --> 00:51:56,250
So this is an r by r matrix.

1006
00:51:56,250 --> 00:51:58,290
And so there is no k there.

1007
00:51:58,290 --> 00:52:00,960
So if you wanted
the generic term,

1008
00:52:00,960 --> 00:52:03,980
if I wanted to put 1 in the
middle on the j-th row and k-th

1009
00:52:03,980 --> 00:52:07,740
column, that would be x--

1010
00:52:07,740 --> 00:52:13,410
so j-th row would be x
sub k to the power of j.

1011
00:52:13,410 --> 00:52:15,420
That would be the--

1012
00:52:15,420 --> 00:52:19,090
And so now, this is
basically the sum--

1013
00:52:19,090 --> 00:52:20,630
well, that should
not be strictly--

1014
00:52:20,630 --> 00:52:25,000
So that would be for j
and k between 1 and r.

1015
00:52:25,000 --> 00:52:26,920
So this is the
formula that get when

1016
00:52:26,920 --> 00:52:29,470
you try to expand this
Vandermonde determinant.

1017
00:52:29,470 --> 00:52:32,110
You have to do it only once when
you're a sophomore typically.

1018
00:52:32,110 --> 00:52:34,000
And then you can just go
on Wikipedia to do it.

1019
00:52:34,000 --> 00:52:34,750
That's what I did.

1020
00:52:34,750 --> 00:52:36,700
I actually made a
mistake copying it.

1021
00:52:36,700 --> 00:52:39,370
The first one should be 1
less than or equal to j.

1022
00:52:39,370 --> 00:52:42,370
And the last one should be
k less than or equal to r.

1023
00:52:42,370 --> 00:52:43,870
And now what you
have is the product

1024
00:52:43,870 --> 00:52:45,520
of the differences of xj and xk.

1025
00:52:47,204 --> 00:52:48,620
And for this thing
to be non-zero,

1026
00:52:48,620 --> 00:52:51,259
you need all the
terms to be non-zero.

1027
00:52:51,259 --> 00:52:52,800
And for all the
terms to be non-zero,

1028
00:52:52,800 --> 00:52:58,412
you need to have no xi, xj, and
no xj, xk that are identical.

1029
00:52:58,412 --> 00:52:59,870
If all those are
different numbers,

1030
00:52:59,870 --> 00:53:03,094
then this product is going
to be different from 0.

1031
00:53:03,094 --> 00:53:05,010
And those are different
numbers, because those

1032
00:53:05,010 --> 00:53:09,050
are r possible values that
your random verbal takes.

1033
00:53:09,050 --> 00:53:11,360
You're not going to
say that it takes two

1034
00:53:11,360 --> 00:53:14,010
with probability 1.5--

1035
00:53:14,010 --> 00:53:18,170
sorry, two with probability 0.5
and two with probability 0.25.

1036
00:53:18,170 --> 00:53:22,370
You're going to say it takes two
with probability 0.75 directly.

1037
00:53:22,370 --> 00:53:24,350
So those xj's are different.

1038
00:53:24,350 --> 00:53:27,510
These are the different values
that your random variable

1039
00:53:27,510 --> 00:53:28,010
can take.

1040
00:53:32,200 --> 00:53:37,404
Remember, xj, xk was just the
different values x1 to xr--

1041
00:53:37,404 --> 00:53:39,820
sorry-- was the different
values that your random variable

1042
00:53:39,820 --> 00:53:41,020
can take.

1043
00:53:41,020 --> 00:53:43,796
Nobody in their right mind
would write twice the same value

1044
00:53:43,796 --> 00:53:44,788
in this list.

1045
00:53:47,450 --> 00:53:49,119
So my Vandermonde is non-zero.

1046
00:53:49,119 --> 00:53:49,910
So I can invert it.

1047
00:53:49,910 --> 00:53:51,493
And I have a one-to-one
correspondence

1048
00:53:51,493 --> 00:53:55,970
between my entire PMF and
the first r minus 1's moments

1049
00:53:55,970 --> 00:54:00,390
to which I append the
number 1, which is really

1050
00:54:00,390 --> 00:54:02,700
the moment of order 0 again.

1051
00:54:02,700 --> 00:54:05,550
It's E of X to the
0-th, which is 1.

1052
00:54:05,550 --> 00:54:10,110
So good news, I only
need r minus 1 parameters

1053
00:54:10,110 --> 00:54:12,030
to describe r
minus 1 parameters.

1054
00:54:12,030 --> 00:54:14,260
And I can choose either
the values of my PMF.

1055
00:54:14,260 --> 00:54:16,360
Or I can choose the r
minus 1 first moments.

1056
00:54:20,300 --> 00:54:22,580
So the moments
tell me something.

1057
00:54:22,580 --> 00:54:26,450
Here, it tells me that if I
have a discrete distribution

1058
00:54:26,450 --> 00:54:28,160
with r possible
values, I only need

1059
00:54:28,160 --> 00:54:30,200
to compute r minus 1 moments.

1060
00:54:30,200 --> 00:54:34,471
So this is better than
Weierstrass Approximation

1061
00:54:34,471 --> 00:54:34,970
Theorem.

1062
00:54:34,970 --> 00:54:37,970
This tells me exactly how many
moments I need to consider.

1063
00:54:37,970 --> 00:54:39,410
And this is for
any distribution.

1064
00:54:39,410 --> 00:54:41,100
This is not a
distribution that's

1065
00:54:41,100 --> 00:54:43,790
parametrized by one
parameter, like the Poisson

1066
00:54:43,790 --> 00:54:47,210
or the binomial
or all this stuff.

1067
00:54:47,210 --> 00:54:50,250
This is for any distribution
under a finite number.

1068
00:54:50,250 --> 00:54:53,810
So hopefully, if I
reduce the family of PMFs

1069
00:54:53,810 --> 00:54:55,970
that I'm looking at to
a one-parameter family,

1070
00:54:55,970 --> 00:54:58,430
I'm actually going to need
to compute much less than r

1071
00:54:58,430 --> 00:55:01,110
minus 1 values.

1072
00:55:01,110 --> 00:55:02,640
But this is actually hopeful.

1073
00:55:02,640 --> 00:55:04,650
It tells you that
the method of moments

1074
00:55:04,650 --> 00:55:06,775
is going to work for
any distribution.

1075
00:55:06,775 --> 00:55:09,417
You just have to invert
a Vandermonde matrix.

1076
00:55:13,220 --> 00:55:17,350
So just the conclusion--
the statistical conclusion--

1077
00:55:17,350 --> 00:55:20,770
is that moments contain
important information

1078
00:55:20,770 --> 00:55:24,880
about the PMF and the PDF.

1079
00:55:24,880 --> 00:55:26,890
If we can estimate these
moments accurately,

1080
00:55:26,890 --> 00:55:30,820
we can solve for the
parameters of the distribution

1081
00:55:30,820 --> 00:55:32,674
and recover the distribution.

1082
00:55:32,674 --> 00:55:34,090
And in a parametric
setting, where

1083
00:55:34,090 --> 00:55:36,970
knowing P theta amounts
to knowing theta, which

1084
00:55:36,970 --> 00:55:41,270
is identifiability--
this is not innocuous--

1085
00:55:41,270 --> 00:55:44,260
it is often the case that
even less moments are needed.

1086
00:55:44,260 --> 00:55:46,810
After all, if theta is a
one dimensional parameter,

1087
00:55:46,810 --> 00:55:48,730
I have one parameter
to estimate.

1088
00:55:48,730 --> 00:55:51,370
Why would I go
and get 25 moments

1089
00:55:51,370 --> 00:55:52,870
to get this one parameter.

1090
00:55:52,870 --> 00:55:54,532
Typically, there
is actually-- we

1091
00:55:54,532 --> 00:55:55,990
will see that the
method of moments

1092
00:55:55,990 --> 00:55:58,480
just says if you have a
d dimensional parameter,

1093
00:55:58,480 --> 00:56:02,110
just compute d
moments, and that's it.

1094
00:56:02,110 --> 00:56:04,280
But this is only on
a case-by-case basis.

1095
00:56:04,280 --> 00:56:07,610
I mean, maybe your model will
totally screw up its parameters

1096
00:56:07,610 --> 00:56:09,950
and you actually
need to get them.

1097
00:56:09,950 --> 00:56:16,710
I mean, think about it, if the
function is parameterized just

1098
00:56:16,710 --> 00:56:19,047
by its 27th moment--

1099
00:56:19,047 --> 00:56:21,630
like, that's the only thing that
matters in this distribution,

1100
00:56:21,630 --> 00:56:24,187
I just describe the function,
it's just a density,

1101
00:56:24,187 --> 00:56:26,520
and the only thing that can
change from one distribution

1102
00:56:26,520 --> 00:56:28,484
to another is this 27th moment--

1103
00:56:28,484 --> 00:56:30,900
well, then you're going to
have to go get the 27th moment.

1104
00:56:30,900 --> 00:56:33,780
And that probably means that
your modeling step was actually

1105
00:56:33,780 --> 00:56:34,686
pretty bad.

1106
00:56:37,680 --> 00:56:40,970
So the rule of thumb, if theta
is in Rd, we need d moments.

1107
00:56:46,970 --> 00:56:48,430
So what is the
method of moments?

1108
00:56:52,800 --> 00:56:55,080
That's just a good old trick.

1109
00:56:55,080 --> 00:56:58,380
Replace the expectation
by averages.

1110
00:56:58,380 --> 00:56:59,970
That's the beauty.

1111
00:56:59,970 --> 00:57:02,080
The moments are expectations.

1112
00:57:02,080 --> 00:57:04,710
So let's just replace the
expectations by averages

1113
00:57:04,710 --> 00:57:07,620
and then do it with
the average version,

1114
00:57:07,620 --> 00:57:10,200
as if it was the true one.

1115
00:57:10,200 --> 00:57:14,160
So for example, I'm going to
talk about population moments,

1116
00:57:14,160 --> 00:57:16,357
when I'm computing them
with the true distribution,

1117
00:57:16,357 --> 00:57:18,690
and I'm going to talk about
them empirical moments, when

1118
00:57:18,690 --> 00:57:22,290
I talk about averages.

1119
00:57:22,290 --> 00:57:24,690
So those are the two
quantities that I have.

1120
00:57:24,690 --> 00:57:28,430
And now, what I hope
is that there is.

1121
00:57:28,430 --> 00:57:30,960
So this is basically--

1122
00:57:30,960 --> 00:57:32,140
everything is here.

1123
00:57:32,140 --> 00:57:33,750
That's where all the money is.

1124
00:57:33,750 --> 00:57:36,960
I'm going to assume there's
a function psi that maps

1125
00:57:36,960 --> 00:57:40,120
my parameters-- let's
say they're in Rd--

1126
00:57:40,120 --> 00:57:42,385
to the set of the
first d moments.

1127
00:57:45,490 --> 00:57:48,040
Well, what I want to do
is to come from this guy

1128
00:57:48,040 --> 00:57:49,070
back to theta.

1129
00:57:49,070 --> 00:57:50,980
So it better be that
this function is--

1130
00:57:54,850 --> 00:57:55,802
invertible.

1131
00:57:55,802 --> 00:57:57,385
I want this function
to be invertible.

1132
00:57:57,385 --> 00:57:59,200
In the Vandermonde
case, this function

1133
00:57:59,200 --> 00:58:03,610
with just a linear function--
multiply a matrix by theta.

1134
00:58:03,610 --> 00:58:06,610
Then inverting a linear function
is inverting the matrix.

1135
00:58:06,610 --> 00:58:08,145
Then this is the same thing.

1136
00:58:08,145 --> 00:58:09,520
So now what I'm
going to assume--

1137
00:58:09,520 --> 00:58:14,470
and that's key for this method
to work-- is that this theta--

1138
00:58:14,470 --> 00:58:16,570
so this function
psi is one to one.

1139
00:58:16,570 --> 00:58:24,360
There's only one theta that
gets only one set of moments.

1140
00:58:24,360 --> 00:58:26,750
And so if it's one to one, I
can talk about its inverse.

1141
00:58:26,750 --> 00:58:28,750
And so now, I'm going to
be able to define theta

1142
00:58:28,750 --> 00:58:32,330
as the inverse of the moments--

1143
00:58:32,330 --> 00:58:33,620
the reciprocal of the moments.

1144
00:58:33,620 --> 00:58:37,940
And so now, what I get is that
the moment estimator is just

1145
00:58:37,940 --> 00:58:42,140
the thing where rather than
taking the true guys in there,

1146
00:58:42,140 --> 00:58:44,780
I'm actually going to take the
empirical moments in there.

1147
00:58:48,580 --> 00:58:50,530
Before we go any
further, I'd like

1148
00:58:50,530 --> 00:58:53,980
to just go back and tell
you that this is not

1149
00:58:53,980 --> 00:58:56,380
completely free.

1150
00:58:56,380 --> 00:58:58,382
How well-behaved
your function psi

1151
00:58:58,382 --> 00:58:59,590
is going to play a huge role.

1152
00:59:02,490 --> 00:59:05,394
Can somebody tell me what
the typical distance--

1153
00:59:05,394 --> 00:59:06,810
if I have a sample
of size n, what

1154
00:59:06,810 --> 00:59:10,062
is the typical distance between
an average and the expectation?

1155
00:59:12,790 --> 00:59:14,360
What is the typical distance?

1156
00:59:14,360 --> 00:59:18,920
What is the order of magnitude
as a function of n between xn

1157
00:59:18,920 --> 00:59:23,024
bar and its expectation.

1158
00:59:23,024 --> 00:59:25,000
AUDIENCE: 1 over
square root of n.

1159
00:59:25,000 --> 00:59:25,760
PHILIPPE RIGOLLET: 1
over square root n.

1160
00:59:25,760 --> 00:59:28,064
That's what the central limit
theorem tells us, right?

1161
00:59:28,064 --> 00:59:29,480
The central limit
theorem tells us

1162
00:59:29,480 --> 00:59:31,521
that those things are
basically a Gaussian, which

1163
00:59:31,521 --> 00:59:34,490
is of order of 1 divided
by its square of n.

1164
00:59:34,490 --> 00:59:37,670
And so basically, I
start with something

1165
00:59:37,670 --> 00:59:41,530
which is 1 over square root
of n away from the true thing.

1166
00:59:41,530 --> 00:59:49,730
Now, if my function psi inverse
is super steep like this--

1167
00:59:49,730 --> 00:59:54,970
that's psi inverse-- then
just small fluctuations, even

1168
00:59:54,970 --> 00:59:57,310
if they're of order
1 square root of n,

1169
00:59:57,310 --> 01:00:04,090
can translate into giant
fluctuations in the y-axis.

1170
01:00:04,090 --> 01:00:06,040
And that's going
to be controlled

1171
01:00:06,040 --> 01:00:09,640
by how steep psi inverse
is, which is the same

1172
01:00:09,640 --> 01:00:14,150
as saying how flat psi is--

1173
01:00:14,150 --> 01:00:15,720
how flat is psi.

1174
01:00:15,720 --> 01:00:20,440
So if you go back to
this Vandermonde inverse,

1175
01:00:20,440 --> 01:00:26,570
what it's telling you is that
if this inverse matrix blows up

1176
01:00:26,570 --> 01:00:29,030
this guy a lot--

1177
01:00:29,030 --> 01:00:32,566
so if I start from a small
fluctuation of this thing

1178
01:00:32,566 --> 01:00:34,190
and then they're
blowing up by applying

1179
01:00:34,190 --> 01:00:36,050
the inverse of
this matrix, things

1180
01:00:36,050 --> 01:00:37,600
are not going to go well.

1181
01:00:37,600 --> 01:00:41,860
Anybody knows what is the number
that I should be looking for?

1182
01:00:41,860 --> 01:00:45,080
So that's from, say,
numerical linear algebra

1183
01:00:45,080 --> 01:00:47,270
numerical methods.

1184
01:00:47,270 --> 01:00:49,244
When I have a system
of linear equations,

1185
01:00:49,244 --> 01:00:50,660
what is the actual
number I should

1186
01:00:50,660 --> 01:00:53,510
be looking at to
know how much I'm

1187
01:00:53,510 --> 01:00:54,950
blowing up the fluctuations?

1188
01:00:54,950 --> 01:00:55,090
Yeah.

1189
01:00:55,090 --> 01:00:55,776
AUDIENCE: Condition number?

1190
01:00:55,776 --> 01:00:57,280
PHILIPPE RIGOLLET: The
condition number, right.

1191
01:00:57,280 --> 01:00:59,600
So what's important here
is the condition number

1192
01:00:59,600 --> 01:01:00,680
of this matrix.

1193
01:01:00,680 --> 01:01:03,715
If the condition number
of this matrix is small,

1194
01:01:03,715 --> 01:01:04,340
then it's good.

1195
01:01:04,340 --> 01:01:05,660
It's not going to blow up much.

1196
01:01:05,660 --> 01:01:07,280
But if the condition
number is very large,

1197
01:01:07,280 --> 01:01:08,720
it's just going
to blow up a lot.

1198
01:01:08,720 --> 01:01:10,310
And the condition
number is the ratio

1199
01:01:10,310 --> 01:01:13,010
of the largest and the
smallest eigenvalues.

1200
01:01:13,010 --> 01:01:14,720
So you'll have to
know what it is.

1201
01:01:14,720 --> 01:01:17,180
But this is how all these
things get together.

1202
01:01:17,180 --> 01:01:21,380
So the numerical
stability translates

1203
01:01:21,380 --> 01:01:24,835
into statistical stability here.

1204
01:01:24,835 --> 01:01:26,684
And numerical
means just if I had

1205
01:01:26,684 --> 01:01:28,350
errors in measuring
the right hand side,

1206
01:01:28,350 --> 01:01:30,360
how much would they translate
into errors on the left hand

1207
01:01:30,360 --> 01:01:31,060
side.

1208
01:01:31,060 --> 01:01:33,976
So the error here is intrinsic
to statistical questions.

1209
01:01:38,610 --> 01:01:42,490
So that's my estimator,
provided that it exists.

1210
01:01:42,490 --> 01:01:45,040
And I said it's a one to
one, so it should exist,

1211
01:01:45,040 --> 01:01:48,520
if I assume that
psi is invertible.

1212
01:01:48,520 --> 01:01:51,627
So how good is this guy?

1213
01:01:51,627 --> 01:01:53,460
That's going to be
definitely our question--

1214
01:01:53,460 --> 01:01:54,860
how good is this thing.

1215
01:01:54,860 --> 01:02:00,560
And as I said, there's chances
that if psi is really steep,

1216
01:02:00,560 --> 01:02:02,800
then it should be
not very good--

1217
01:02:02,800 --> 01:02:06,140
if psi inverse is very steep,
it should not be very good,

1218
01:02:06,140 --> 01:02:07,740
which means that it's--

1219
01:02:07,740 --> 01:02:11,480
well, let's just
leave it to that.

1220
01:02:11,480 --> 01:02:13,010
So that means that
I should probably

1221
01:02:13,010 --> 01:02:16,460
see the derivative of
psi showing up somewhere.

1222
01:02:16,460 --> 01:02:19,626
If the derivative of psi
inverse, say, is very large,

1223
01:02:19,626 --> 01:02:21,500
then I should actually
have a larger variance

1224
01:02:21,500 --> 01:02:22,520
in my estimator.

1225
01:02:22,520 --> 01:02:26,900
So hopefully, just like we
had a theorem that told us

1226
01:02:26,900 --> 01:02:29,390
that the Fisher information
was key in the variance

1227
01:02:29,390 --> 01:02:30,890
of the maximum
likelihood estimator,

1228
01:02:30,890 --> 01:02:32,473
we should have a
theorem that tells us

1229
01:02:32,473 --> 01:02:33,920
that the derivative
of psi inverse

1230
01:02:33,920 --> 01:02:37,313
is going to have a key role
in the method of moments.

1231
01:02:37,313 --> 01:02:38,792
So let's do it.

1232
01:02:57,540 --> 01:03:01,950
So I'm going to talk
to you about matrices.

1233
01:03:01,950 --> 01:03:02,680
So now, I have--

1234
01:03:10,150 --> 01:03:15,080
So since I have to manipulate
d numbers at any given time,

1235
01:03:15,080 --> 01:03:17,610
I'm just going to concatenate
them into a vector.

1236
01:03:17,610 --> 01:03:19,670
So I'm going to call
capital M theta--

1237
01:03:19,670 --> 01:03:24,570
so that's basically
the population moment.

1238
01:03:24,570 --> 01:03:31,320
And I have M hat, which is
just m hat 1 to m hat d.

1239
01:03:31,320 --> 01:03:32,715
And that's my empirical moment.

1240
01:03:39,100 --> 01:03:41,170
And what's going
to play a role is

1241
01:03:41,170 --> 01:03:45,370
what is the variance-covariance
of the random vector.

1242
01:03:45,370 --> 01:03:49,680
So I have this vector 1--

1243
01:03:49,680 --> 01:03:50,440
do I have 1?

1244
01:03:50,440 --> 01:03:51,865
No, I don't have 1.

1245
01:03:59,240 --> 01:04:02,480
So that's a d
dimensional vector.

1246
01:04:02,480 --> 01:04:04,940
And here, I take the
successive powers.

1247
01:04:04,940 --> 01:04:08,780
Remember, that looks very much
like a column of my Vandermonde

1248
01:04:08,780 --> 01:04:10,590
matrix.

1249
01:04:10,590 --> 01:04:12,120
So now, I have
this random vector.

1250
01:04:12,120 --> 01:04:15,570
It's just the successive powers
of some random variable X.

1251
01:04:15,570 --> 01:04:19,480
And the variance-covariance
matrix is the expectation--

1252
01:04:19,480 --> 01:04:20,130
so sigma--

1253
01:04:20,130 --> 01:04:21,695
of theta.

1254
01:04:21,695 --> 01:04:23,820
The theta just means I'm
going to take expectations

1255
01:04:23,820 --> 01:04:26,310
with respect to theta.

1256
01:04:26,310 --> 01:04:28,350
That's the expectation
with respect

1257
01:04:28,350 --> 01:04:31,316
to theta of this
guy times this guy

1258
01:04:31,316 --> 01:04:40,575
transpose minus the same
thing but with the expectation

1259
01:04:40,575 --> 01:04:41,075
inside.

1260
01:04:45,270 --> 01:04:46,760
Why do I do X, X1.

1261
01:04:46,760 --> 01:04:48,070
I have X, X2, X3.

1262
01:04:50,720 --> 01:05:04,384
X, X2, Xd times the
expectation of X, X2, Xd.

1263
01:05:04,384 --> 01:05:05,550
Everybody sees what this is?

1264
01:05:05,550 --> 01:05:11,790
So this is a matrix where if I
look at the ij-th term of this

1265
01:05:11,790 --> 01:05:13,530
matrix--

1266
01:05:13,530 --> 01:05:20,980
or let's say, jk-th term,
so on row j and column k,

1267
01:05:20,980 --> 01:05:26,130
I have sigma jk of theta.

1268
01:05:26,130 --> 01:05:30,960
And it's simply the
expectation of X to the j

1269
01:05:30,960 --> 01:05:40,541
plus k-- well, Xj Xk minus
expectation of Xj expectation

1270
01:05:40,541 --> 01:05:41,040
of Xk.

1271
01:05:45,170 --> 01:05:53,970
So I can write this as m j plus
k of theta minus mj of theta

1272
01:05:53,970 --> 01:05:55,080
times mk of theta.

1273
01:06:00,840 --> 01:06:04,400
So that's my covariance matrix
of this particular vector

1274
01:06:04,400 --> 01:06:06,870
that I define.

1275
01:06:06,870 --> 01:06:09,240
And now, I'm going to
assume that psi inverse--

1276
01:06:09,240 --> 01:06:11,070
well, if I want to
talk about the slope

1277
01:06:11,070 --> 01:06:14,060
in an analytic fashion,
I have to assume

1278
01:06:14,060 --> 01:06:16,250
that psi is differentiable.

1279
01:06:16,250 --> 01:06:18,650
And I will talk
about the gradient

1280
01:06:18,650 --> 01:06:20,500
of psi, which is, if
it's one dimensional,

1281
01:06:20,500 --> 01:06:22,340
it's just the derivative.

1282
01:06:22,340 --> 01:06:24,470
And here, that's where
notation becomes annoying.

1283
01:06:24,470 --> 01:06:26,011
And I'm going to
actually just assume

1284
01:06:26,011 --> 01:06:28,310
that so now I have a vector.

1285
01:06:28,310 --> 01:06:30,590
But it's a vector
of functions and I

1286
01:06:30,590 --> 01:06:32,840
want to compute those functions
at a particular value.

1287
01:06:32,840 --> 01:06:34,506
And the value I'm
actually interested in

1288
01:06:34,506 --> 01:06:37,010
is at the m of theta parameter.

1289
01:06:37,010 --> 01:06:41,600
So psi inverse goes
from the set of moments

1290
01:06:41,600 --> 01:06:43,710
to the set of parameters.

1291
01:06:43,710 --> 01:06:45,680
So when I look at the
gradient of this guy,

1292
01:06:45,680 --> 01:06:48,740
it should be a function that
takes as inputs moments.

1293
01:06:48,740 --> 01:06:51,700
And where do I want this
function to be evaluated at?

1294
01:06:51,700 --> 01:06:54,352
At the true moment--

1295
01:06:54,352 --> 01:06:58,100
at the population moment vector.

1296
01:06:58,100 --> 01:07:00,860
Just like when I computed
my Fisher information,

1297
01:07:00,860 --> 01:07:04,250
I was computing it at
the true parameter.

1298
01:07:04,250 --> 01:07:08,400
So now, once they
compute this guy--

1299
01:07:08,400 --> 01:07:11,176
so now, why is this a
d by d gradient matrix?

1300
01:07:15,840 --> 01:07:19,920
So I have a gradient vector when
I have a function from rd to r.

1301
01:07:19,920 --> 01:07:22,160
This is the partial derivatives.

1302
01:07:22,160 --> 01:07:25,900
But now, I have a
function from rd to rd.

1303
01:07:25,900 --> 01:07:28,210
So I have to take the
derivative with respect

1304
01:07:28,210 --> 01:07:32,457
to the arrival coordinate
and the departure coordinate.

1305
01:07:35,260 --> 01:07:39,140
And so that's the
gradient matrix.

1306
01:07:39,140 --> 01:07:41,360
And now, I have the
following properties.

1307
01:07:41,360 --> 01:07:46,270
The first one is that
the law of large numbers

1308
01:07:46,270 --> 01:07:52,720
tells me that theta hat is a
weakly or strongly consistent

1309
01:07:52,720 --> 01:07:54,332
estimator.

1310
01:07:54,332 --> 01:07:56,290
So either I use the strong
law of large numbers

1311
01:07:56,290 --> 01:07:57,665
or the weak law
of large numbers,

1312
01:07:57,665 --> 01:08:01,300
and I get strong or
weak consistency.

1313
01:08:01,300 --> 01:08:02,870
So what does that mean?

1314
01:08:02,870 --> 01:08:03,640
Why is that true?

1315
01:08:03,640 --> 01:08:12,470
Well, because now so I
really have the function--

1316
01:08:12,470 --> 01:08:13,930
so what is my estimator?

1317
01:08:13,930 --> 01:08:23,689
Theta hat this psi inverse
of m hat 1 to m hat k.

1318
01:08:23,689 --> 01:08:26,630
Now, by the law
of large numbers,

1319
01:08:26,630 --> 01:08:28,890
let's look only at the weak one.

1320
01:08:28,890 --> 01:08:35,600
Law of large numbers tells
me that each of the mj hat

1321
01:08:35,600 --> 01:08:38,750
is going to converge
in probability as n

1322
01:08:38,750 --> 01:08:40,970
to infinity to the-- so
the empirical moments

1323
01:08:40,970 --> 01:08:44,950
converge to the
population moments.

1324
01:08:44,950 --> 01:08:48,189
That's what the good
old trick is using,

1325
01:08:48,189 --> 01:08:49,750
the fact that the
empirical moments

1326
01:08:49,750 --> 01:08:52,760
are close to the true
moments as n becomes larger.

1327
01:08:52,760 --> 01:08:55,390
And that's because, well,
just because the m hat j's

1328
01:08:55,390 --> 01:08:57,160
are averages, and the
law of large numbers

1329
01:08:57,160 --> 01:08:59,229
works for averages.

1330
01:08:59,229 --> 01:09:04,930
So now, plus if I look at my
continuous mapping theorem,

1331
01:09:04,930 --> 01:09:10,700
then I have that psi inverse
is continuously differentiable.

1332
01:09:10,700 --> 01:09:12,279
So it's definitely continuous.

1333
01:09:12,279 --> 01:09:16,510
And so what I have is
that psi inverse of m hat

1334
01:09:16,510 --> 01:09:28,740
1 m hat d converges to psi
inverse m1 to md, which

1335
01:09:28,740 --> 01:09:33,950
is equal to of theta star.

1336
01:09:33,950 --> 01:09:35,060
So that's theta star.

1337
01:09:35,060 --> 01:09:37,910
By definition, we assumed that
that was the unique one that

1338
01:09:37,910 --> 01:09:40,189
was actually doing this.

1339
01:09:40,189 --> 01:09:43,109
Again, this is a very
strong assumption.

1340
01:09:43,109 --> 01:09:46,100
I mean, it's basically saying,
if the method of moment works,

1341
01:09:46,100 --> 01:09:47,540
it works.

1342
01:09:47,540 --> 01:09:51,710
So the fact that psi
inverse one to one

1343
01:09:51,710 --> 01:09:55,280
is really the key to
making this guy work.

1344
01:09:55,280 --> 01:09:57,200
And then I also have a
central limit theorem.

1345
01:09:57,200 --> 01:10:00,140
And the central limit
theorem is basically

1346
01:10:00,140 --> 01:10:04,550
telling me that M hat
is converging to M even

1347
01:10:04,550 --> 01:10:06,040
in the multivariate sense.

1348
01:10:06,040 --> 01:10:09,410
So if I look at the vector of
M hat and the true vector of M,

1349
01:10:09,410 --> 01:10:11,870
then I actually make them go--

1350
01:10:11,870 --> 01:10:14,570
I look at the difference for
scale by square root of n.

1351
01:10:14,570 --> 01:10:15,951
It goes to some Gaussian.

1352
01:10:15,951 --> 01:10:18,200
And usually, we would see--
if it was one dimensional,

1353
01:10:18,200 --> 01:10:19,283
we would see the variance.

1354
01:10:19,283 --> 01:10:22,430
Then we see the
variance-covariance matrix.

1355
01:10:22,430 --> 01:10:25,370
Who has never seen the-- well,
nobody answers this question.

1356
01:10:25,370 --> 01:10:28,200
Who has already seen the
multivariate central limit

1357
01:10:28,200 --> 01:10:28,700
theorem?

1358
01:10:31,339 --> 01:10:33,380
Who was never seen the
multivariate central limit

1359
01:10:33,380 --> 01:10:35,410
theorem?

1360
01:10:35,410 --> 01:10:37,630
So the multivariate
central limit theorem

1361
01:10:37,630 --> 01:10:41,860
is basically just
the slight extension

1362
01:10:41,860 --> 01:10:43,630
of the univariate one.

1363
01:10:43,630 --> 01:10:48,160
It just says that
if I want to think--

1364
01:10:48,160 --> 01:10:51,020
so the univariate one would
tell me something like this--

1365
01:11:05,460 --> 01:11:06,270
and 0.

1366
01:11:06,270 --> 01:11:18,960
And then I would have basically
the variance of X to the j-th.

1367
01:11:18,960 --> 01:11:22,240
So that's what the central
limit theorem tells me.

1368
01:11:22,240 --> 01:11:23,350
This is an average.

1369
01:11:29,350 --> 01:11:31,150
So this is just for averages.

1370
01:11:31,150 --> 01:11:33,190
The central limit
theorem tells me this.

1371
01:11:33,190 --> 01:11:36,560
Just think of X to
the j-th as being y.

1372
01:11:36,560 --> 01:11:37,960
And that would be true.

1373
01:11:37,960 --> 01:11:40,092
Everybody agrees with me?

1374
01:11:40,092 --> 01:11:41,550
So now, this is
actually telling me

1375
01:11:41,550 --> 01:11:45,610
what's happening for all
these guys individually.

1376
01:11:45,610 --> 01:11:48,990
But what happens when those guys
start to correlate together?

1377
01:11:48,990 --> 01:11:51,090
I'd like to know if
they actually correlate

1378
01:11:51,090 --> 01:11:53,010
the same way asymptotically.

1379
01:11:53,010 --> 01:11:56,760
And so if I actually looked
at the covariance matrix

1380
01:11:56,760 --> 01:11:57,465
of this vector--

1381
01:12:03,440 --> 01:12:07,470
so now, I need to look at
a matrix which is d by d--

1382
01:12:07,470 --> 01:12:10,170
then would those univariate
central limit theorems

1383
01:12:10,170 --> 01:12:12,896
tell me--

1384
01:12:12,896 --> 01:12:16,890
so let me right like
this, double bar.

1385
01:12:16,890 --> 01:12:19,560
So that's just the
covariance matrix.

1386
01:12:19,560 --> 01:12:23,050
This notation, V double bar is
the variance-covariance matrix.

1387
01:12:23,050 --> 01:12:26,010
So what this thing tells
me-- so I know this thing

1388
01:12:26,010 --> 01:12:30,117
is a matrix, d by d.

1389
01:12:30,117 --> 01:12:31,950
Those univariate central
limit theorems only

1390
01:12:31,950 --> 01:12:36,150
give me information
about the diagonal terms.

1391
01:12:36,150 --> 01:12:40,860
But here, I have no idea where
the covariance matrix is.

1392
01:12:40,860 --> 01:12:46,020
This guy is telling me, for
example, that this thing is

1393
01:12:46,020 --> 01:12:49,520
like variance of X to the j-th.

1394
01:12:49,520 --> 01:12:51,860
But what if I want to
find off-diagonal elements

1395
01:12:51,860 --> 01:12:53,130
of this matrix?

1396
01:12:53,130 --> 01:12:55,130
Well, I need to use a
multivariate central limit

1397
01:12:55,130 --> 01:12:56,150
theorem.

1398
01:12:56,150 --> 01:12:58,670
And really what it's telling
me is that you can actually

1399
01:12:58,670 --> 01:13:00,200
replace this guy here--

1400
01:13:10,450 --> 01:13:14,500
so that goes in distribution
to some normal mean 0, again.

1401
01:13:14,500 --> 01:13:17,080
And now, what I
have is just sigma

1402
01:13:17,080 --> 01:13:22,000
of theta, which is just
the covariance matrix

1403
01:13:22,000 --> 01:13:26,696
of this vector X, X2, X3,
X4, all the way to Xd.

1404
01:13:26,696 --> 01:13:27,514
And that's it.

1405
01:13:27,514 --> 01:13:28,930
So that's a
multivariate Gaussian.

1406
01:13:28,930 --> 01:13:33,040
Who has never seen a
multivariate Gaussian?

1407
01:13:33,040 --> 01:13:35,974
Please, just go on
Wikipedia or something.

1408
01:13:35,974 --> 01:13:37,390
There's not much
to know about it.

1409
01:13:37,390 --> 01:13:40,270
But I don't have time to
redo probability here.

1410
01:13:40,270 --> 01:13:43,350
So we're going to
have to live with it.

1411
01:13:43,350 --> 01:13:46,230
Now, to be fair,
if your goal is not

1412
01:13:46,230 --> 01:13:48,970
to become a
statistical savant, we

1413
01:13:48,970 --> 01:13:52,490
will stick to
univariate questions

1414
01:13:52,490 --> 01:14:01,260
in the scope of
homework and exams.

1415
01:14:01,260 --> 01:14:09,900
So now, what was the
delta method telling me?

1416
01:14:09,900 --> 01:14:13,440
It was telling me that if I had
a central limit theorem that

1417
01:14:13,440 --> 01:14:16,112
told me that theta hat
was going to theta,

1418
01:14:16,112 --> 01:14:17,820
or square root of n
theta hat minus theta

1419
01:14:17,820 --> 01:14:19,530
was going to some
Gaussian, then I

1420
01:14:19,530 --> 01:14:23,220
could look at square root of Mg
of theta hat minus g of theta.

1421
01:14:23,220 --> 01:14:25,110
And this thing was also
going to a Gaussian.

1422
01:14:25,110 --> 01:14:27,030
But what it had to
be is the square

1423
01:14:27,030 --> 01:14:32,700
of the derivative of
g in the variance.

1424
01:14:32,700 --> 01:14:35,190
So the delta method,
it was just a way

1425
01:14:35,190 --> 01:14:38,280
to go from square
root of n theta

1426
01:14:38,280 --> 01:14:46,810
hat n minus theta goes to some
N, say 0, sigma squared, to--

1427
01:14:46,810 --> 01:14:50,410
so delta method was telling
me that this was square root

1428
01:14:50,410 --> 01:14:56,030
Ng of theta hat N
minus g of theta

1429
01:14:56,030 --> 01:15:01,770
was going in distribution
to N0 sigma squared

1430
01:15:01,770 --> 01:15:04,200
g prime squared of theta.

1431
01:15:07,210 --> 01:15:09,130
That was the delta method.

1432
01:15:09,130 --> 01:15:12,700
Now, here, we have a
function of those guys.

1433
01:15:12,700 --> 01:15:15,580
The central limit theorem,
even the multivariate one,

1434
01:15:15,580 --> 01:15:20,180
is only guaranteeing something
for me regarding the moments.

1435
01:15:20,180 --> 01:15:23,350
But now, I need to map the
moments back into some theta,

1436
01:15:23,350 --> 01:15:26,230
so I have a function
of the moments.

1437
01:15:26,230 --> 01:15:31,950
And there is something
called the multivariate delta

1438
01:15:31,950 --> 01:15:35,310
method, where derivatives
are replaced by gradients.

1439
01:15:35,310 --> 01:15:39,310
Like, they always are in
multivariate calculus.

1440
01:15:39,310 --> 01:15:43,080
And rather than multiplying,
since things do not compute,

1441
01:15:43,080 --> 01:15:46,557
rather than choosing which
side I want to put the square,

1442
01:15:46,557 --> 01:15:49,140
I'm actually just going to take
half of the square on one side

1443
01:15:49,140 --> 01:15:51,810
and the other half of the
square on the other side.

1444
01:15:51,810 --> 01:15:53,790
So the way you
should view this, you

1445
01:15:53,790 --> 01:15:59,160
should think of sigma
squared times g prime squared

1446
01:15:59,160 --> 01:16:02,490
as being g prime of
theta times sigma

1447
01:16:02,490 --> 01:16:06,040
squared times g prime of theta.

1448
01:16:06,040 --> 01:16:08,640
And now, this is
completely symmetric.

1449
01:16:08,640 --> 01:16:14,850
And the multivariate
delta method

1450
01:16:14,850 --> 01:16:20,010
is basically telling you that
you get the gradient here.

1451
01:16:20,010 --> 01:16:21,480
So you start from
something that's

1452
01:16:21,480 --> 01:16:24,100
like that over there, a sigma--

1453
01:16:24,100 --> 01:16:26,280
so that's my sigma squared,
think of sigma squared.

1454
01:16:26,280 --> 01:16:29,040
And then I premultiply by
the gradient and postmultiply

1455
01:16:29,040 --> 01:16:30,514
by the gradient.

1456
01:16:30,514 --> 01:16:31,680
The first one is transposed.

1457
01:16:31,680 --> 01:16:33,620
The second one is not.

1458
01:16:33,620 --> 01:16:36,140
But that's very
straightforward extension.

1459
01:16:36,140 --> 01:16:37,890
You don't even have
to understand it.

1460
01:16:37,890 --> 01:16:41,780
Just think of what would be
the natural generalization.

1461
01:16:41,780 --> 01:16:44,450
Here, by the way,
I wrote explicitly

1462
01:16:44,450 --> 01:16:48,020
what the gradient of a
multivariate function is.

1463
01:16:48,020 --> 01:16:53,930
So that's a function
that goes from Rd to Rk.

1464
01:16:53,930 --> 01:16:56,050
So now, the gradient
is a d by k matrix.

1465
01:16:58,920 --> 01:17:00,504
And so now, for this
guy, we can do it

1466
01:17:00,504 --> 01:17:01,586
for the method or moments.

1467
01:17:01,586 --> 01:17:03,140
And we can see that
basically we're

1468
01:17:03,140 --> 01:17:04,765
going to have this
scaling that depends

1469
01:17:04,765 --> 01:17:08,300
on the gradient of the
reciprocal of psi, which

1470
01:17:08,300 --> 01:17:08,810
is normal.

1471
01:17:08,810 --> 01:17:13,137
Because if psi is super steep,
if psi inverse is super steep,

1472
01:17:13,137 --> 01:17:14,720
then the gradient
is going to be huge,

1473
01:17:14,720 --> 01:17:17,120
which is going to translate
into having a huge variance

1474
01:17:17,120 --> 01:17:18,203
for the method of moments.

1475
01:17:21,180 --> 01:17:24,127
So this is actually the end.

1476
01:17:24,127 --> 01:17:26,460
I would like to encourage
you-- and we'll probably do it

1477
01:17:26,460 --> 01:17:27,550
on Thursday just to start.

1478
01:17:27,550 --> 01:17:30,480
But I encourage you do
it in one dimension,

1479
01:17:30,480 --> 01:17:35,070
so that you know how to
use the method of moments,

1480
01:17:35,070 --> 01:17:37,140
you know how to do
a bunch of things.

1481
01:17:37,140 --> 01:17:40,470
Do it in one dimension and see
how you can check those things.

1482
01:17:40,470 --> 01:17:43,860
So just as a quick comparison,
in terms of the quadratic risk,

1483
01:17:43,860 --> 01:17:46,050
the maximum likelihood
estimator is typically

1484
01:17:46,050 --> 01:17:50,024
more accurate than
the method of moments.

1485
01:17:50,024 --> 01:17:51,440
What is pretty
good to do is, when

1486
01:17:51,440 --> 01:17:54,530
you have a
non-concave likelihood

1487
01:17:54,530 --> 01:17:56,060
function, what
people like to do is

1488
01:17:56,060 --> 01:17:58,980
to start with the method of
moments as an initialization

1489
01:17:58,980 --> 01:18:01,680
and then run some algorithm
that optimizes locally

1490
01:18:01,680 --> 01:18:03,710
the likelihood starting
from this point,

1491
01:18:03,710 --> 01:18:05,985
because it's actually
likely to be closer.

1492
01:18:05,985 --> 01:18:07,610
And then the MLE is
going to improve it

1493
01:18:07,610 --> 01:18:12,010
a little bit by pushing the
likelihood a little better.

1494
01:18:12,010 --> 01:18:13,840
So of course, the
maximum likelihood

1495
01:18:13,840 --> 01:18:14,890
is sometimes intractable.

1496
01:18:14,890 --> 01:18:18,440
Whereas, computing
moments is fairly doable.

1497
01:18:18,440 --> 01:18:20,262
If the likelihood is
concave, as I said,

1498
01:18:20,262 --> 01:18:21,720
we can use optimization
algorithms,

1499
01:18:21,720 --> 01:18:24,020
such as interior-point
methods or gradient descent,

1500
01:18:24,020 --> 01:18:25,154
I guess, to maximize it.

1501
01:18:25,154 --> 01:18:26,695
And if the likelihood
is non-concave,

1502
01:18:26,695 --> 01:18:28,240
we only have local heuristics.

1503
01:18:28,240 --> 01:18:29,920
Risk And that's what I meant--

1504
01:18:29,920 --> 01:18:31,440
you have only local maxima.

1505
01:18:31,440 --> 01:18:32,860
And one trick you can do--

1506
01:18:32,860 --> 01:18:37,880
so your likelihood
looks like this,

1507
01:18:37,880 --> 01:18:42,140
and it might be the case that if
you have a lot of those peaks,

1508
01:18:42,140 --> 01:18:44,810
you basically have to start
your algorithm in each

1509
01:18:44,810 --> 01:18:46,270
of those peaks.

1510
01:18:46,270 --> 01:18:48,530
But the method of
moments can actually

1511
01:18:48,530 --> 01:18:50,510
start you in the right
peak, and then you

1512
01:18:50,510 --> 01:18:53,300
just move up by doing
some local algorithm

1513
01:18:53,300 --> 01:18:55,040
for maximum likelihood.

1514
01:18:55,040 --> 01:18:56,180
So that's not key.

1515
01:18:56,180 --> 01:18:59,330
But that's just if you want
to think about algorithmically

1516
01:18:59,330 --> 01:19:03,470
how I would end up doing this
and how can I combine the two.

1517
01:19:03,470 --> 01:19:04,970
So I'll see you on Thursday.

1518
01:19:04,970 --> 01:19:06,820
Thank you.