1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,650
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,650 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:22,729 --> 00:00:24,770
PROFESSOR: So I'm using
a few things here, right?

9
00:00:24,770 --> 00:00:27,740
I'm using the fact that
KL is non-negative.

10
00:00:27,740 --> 00:00:31,130
But KL is equal to 0 when I
take twice the same argument.

11
00:00:31,130 --> 00:00:34,061
So I know that this function
is always non-negative.

12
00:00:38,650 --> 00:00:45,580
So that's theta and that's
KL P theta star P theta.

13
00:00:45,580 --> 00:00:51,170
And I know that at theta
star, it's equal to 0.

14
00:00:51,170 --> 00:00:52,360
OK?

15
00:00:52,360 --> 00:00:56,450
I could be in the case
where I have this happening.

16
00:00:56,450 --> 00:01:01,460
I have two-- let's call
it theta star prime.

17
00:01:01,460 --> 00:01:04,069
I have two minimizers.

18
00:01:04,069 --> 00:01:05,390
That could be the case, right?

19
00:01:05,390 --> 00:01:07,460
I'm not saying
that-- so K of L--

20
00:01:07,460 --> 00:01:11,810
KL is 0 at the minimum.

21
00:01:11,810 --> 00:01:14,660
That doesn't mean that I
have a unit minimum, right?

22
00:01:14,660 --> 00:01:16,110
But it does, actually.

23
00:01:16,110 --> 00:01:17,562
What do I need to
use to make sure

24
00:01:17,562 --> 00:01:18,770
that I have only one minimum?

25
00:01:22,210 --> 00:01:24,280
So the definiteness
is guaranteeing to me

26
00:01:24,280 --> 00:01:28,412
that there's a unique P
theta star that minimizes it.

27
00:01:28,412 --> 00:01:30,120
But then I need to
make sure that there's

28
00:01:30,120 --> 00:01:33,214
a unique-- from this
unique P theta star,

29
00:01:33,214 --> 00:01:35,380
I need to make sure there's
a unique theta star that

30
00:01:35,380 --> 00:01:36,550
defines this P theta star.

31
00:01:39,250 --> 00:01:40,550
Exactly.

32
00:01:40,550 --> 00:01:43,420
All right, so I
combine definiteness

33
00:01:43,420 --> 00:01:47,410
and identifiability to make
sure that there is a unique

34
00:01:47,410 --> 00:01:50,070
minimizer, in this
case cannot exist.

35
00:01:50,070 --> 00:01:55,630
OK, so basically, let me
write what I just said.

36
00:01:55,630 --> 00:02:06,540
So definiteness, that
implies that P theta star

37
00:02:06,540 --> 00:02:23,120
is the unique minimizer of P
theta maps to KL P theta star P

38
00:02:23,120 --> 00:02:23,910
theta.

39
00:02:23,910 --> 00:02:26,290
So definiteness only
guarantees that the probability

40
00:02:26,290 --> 00:02:29,430
distribution is
uniquely identified.

41
00:02:29,430 --> 00:02:42,260
And identifiability
implies that theta star

42
00:02:42,260 --> 00:02:56,937
is the unique minimizer of
theta maps to KL P theta star P

43
00:02:56,937 --> 00:03:00,590
theta, OK?

44
00:03:00,590 --> 00:03:02,480
So I'm basically
doing the composition

45
00:03:02,480 --> 00:03:04,260
of two injective functions.

46
00:03:04,260 --> 00:03:07,970
The first one is the one that
maps, say, theta to P theta.

47
00:03:07,970 --> 00:03:11,090
And the second one is
the one that maps P theta

48
00:03:11,090 --> 00:03:14,482
to the set of minimizers, OK?

49
00:03:20,770 --> 00:03:27,560
So at least morally, you
should agree that theta star

50
00:03:27,560 --> 00:03:28,940
is the minimizer of this thing.

51
00:03:28,940 --> 00:03:30,523
Whether it's unique
or not, you should

52
00:03:30,523 --> 00:03:33,110
agree that it's a good one.

53
00:03:33,110 --> 00:03:36,620
So maybe you can think
a little longer on this.

54
00:03:36,620 --> 00:03:38,720
So thinking about this
being the minimizer,

55
00:03:38,720 --> 00:03:40,220
then it says,
well, if I actually

56
00:03:40,220 --> 00:03:42,230
had a good estimate
for this function,

57
00:03:42,230 --> 00:03:44,050
I would use the strategy
that I described

58
00:03:44,050 --> 00:03:45,860
for the total
variation, which is,

59
00:03:45,860 --> 00:03:48,050
well, I don't know what
this function looks like.

60
00:03:48,050 --> 00:03:49,700
It depends on theta star.

61
00:03:49,700 --> 00:03:51,470
But maybe I can
find an estimator

62
00:03:51,470 --> 00:03:55,170
of this function that
fluctuates around this function,

63
00:03:55,170 --> 00:03:58,070
and such that when I minimize
this estimator of the function,

64
00:03:58,070 --> 00:04:01,380
I'm actually not too far, OK?

65
00:04:01,380 --> 00:04:04,650
And this is exactly what
drives me to do this,

66
00:04:04,650 --> 00:04:07,380
because I can actually
construct an estimator.

67
00:04:07,380 --> 00:04:09,300
I can actually construct
an estimator such

68
00:04:09,300 --> 00:04:12,540
that this estimator
is actually--

69
00:04:12,540 --> 00:04:15,900
of the KL is actually
close to the KL, all right?

70
00:04:15,900 --> 00:04:18,709
So I define KL hat.

71
00:04:18,709 --> 00:04:22,920
So all we did is just replacing
expectation with respect

72
00:04:22,920 --> 00:04:27,400
to theta star by averages.

73
00:04:30,840 --> 00:04:33,850
That's what we did.

74
00:04:33,850 --> 00:04:37,270
So if you're a little puzzled by
this error, that's all it says.

75
00:04:37,270 --> 00:04:39,540
Replace this guy by this guy.

76
00:04:39,540 --> 00:04:41,190
It has no mathematical meaning.

77
00:04:41,190 --> 00:04:42,990
It just means just
replace it by.

78
00:04:42,990 --> 00:04:46,190
And now that actually tells
me how to get my estimator.

79
00:04:46,190 --> 00:04:51,660
It just says, well,
my estimator, KL hat,

80
00:04:51,660 --> 00:04:54,969
is equal to some constant
which I don't know.

81
00:04:54,969 --> 00:04:56,760
I mean, it certainly
depends on theta star,

82
00:04:56,760 --> 00:04:59,720
but I won't care about it
when I'm trying to minimize--

83
00:04:59,720 --> 00:05:09,610
minus 1/n sum from i from
1 to n log f theta of x.

84
00:05:09,610 --> 00:05:11,320
So here I'm reading
it with the density.

85
00:05:11,320 --> 00:05:13,950
You have it with the
PMF on the slides,

86
00:05:13,950 --> 00:05:18,560
and so you have the two
versions in front of you, OK?

87
00:05:18,560 --> 00:05:22,400
Oh sorry, I forgot the xi.

88
00:05:22,400 --> 00:05:25,430
Now clearly, this function
I know how to compute.

89
00:05:25,430 --> 00:05:30,710
If you give me a theta, since
I know the form of the density

90
00:05:30,710 --> 00:05:33,380
f theta, for each
theta that you give me,

91
00:05:33,380 --> 00:05:38,070
I can actually compute
this quantity, right?

92
00:05:38,070 --> 00:05:40,475
This I don't know,
but I don't care.

93
00:05:40,475 --> 00:05:42,600
Because I'm just shifting
the value of the function

94
00:05:42,600 --> 00:05:43,590
I'm trying to minimize.

95
00:05:43,590 --> 00:05:46,830
The set of minimizers
is not going to change.

96
00:05:46,830 --> 00:05:50,420
So now, this is my
estimation strategy.

97
00:05:50,420 --> 00:06:01,560
Minimize in theta KL hat
P theta star P theta, OK?

98
00:06:01,560 --> 00:06:05,575
So now let's just make sure
that we all agree that--

99
00:06:05,575 --> 00:06:07,960
so what we want is the
argument of the minimum,

100
00:06:07,960 --> 00:06:10,710
right? arg min means the
theta that minimizes this guy,

101
00:06:10,710 --> 00:06:13,534
rather than finding
the value of the min.

102
00:06:13,534 --> 00:06:15,700
OK, so I'm trying to find
the arg min of this thing.

103
00:06:15,700 --> 00:06:18,900
Well, this is equivalent
to finding the arg

104
00:06:18,900 --> 00:06:28,020
min of, say, a constant minus
1/n sum from i from 1 to n

105
00:06:28,020 --> 00:06:31,226
of log f theta of xi.

106
00:06:33,864 --> 00:06:34,530
So that's just--

107
00:06:38,850 --> 00:06:41,026
I don't think it likes me.

108
00:06:41,026 --> 00:06:42,490
No.

109
00:06:42,490 --> 00:06:46,110
OK, so thus minimizing
this average, right?

110
00:06:46,110 --> 00:06:48,620
I just plugged in the
definition of KL hat.

111
00:06:48,620 --> 00:06:50,360
Now, I claim that
taking the arg min

112
00:06:50,360 --> 00:06:53,510
of a constant plus a function
or the arg min of the function

113
00:06:53,510 --> 00:06:55,820
is the same thing.

114
00:06:55,820 --> 00:07:00,650
Is anybody not comfortable
with this idea?

115
00:07:00,650 --> 00:07:03,530
OK, so this is the same.

116
00:07:13,757 --> 00:07:15,750
By the way, this
I should probably

117
00:07:15,750 --> 00:07:18,630
switch to the next slide,
because I'm writing

118
00:07:18,630 --> 00:07:22,830
the same thing, but better.

119
00:07:22,830 --> 00:07:29,360
And it's with PMF
rather than as PF.

120
00:07:29,360 --> 00:07:34,000
OK, now, arg min of the minimum
is the same of arg max--

121
00:07:34,000 --> 00:07:35,595
sorry, arg min of
the negative thing

122
00:07:35,595 --> 00:07:37,720
is the same as arg max
without the negative, right?

123
00:07:40,324 --> 00:07:49,010
arg max over theta of 1/n from
i equal equal 1 to n log f

124
00:07:49,010 --> 00:07:49,850
theta of xi.

125
00:07:53,540 --> 00:07:54,980
Taking the arg
min of the average

126
00:07:54,980 --> 00:07:56,563
or the arg min of
the sum, again, it's

127
00:07:56,563 --> 00:07:59,030
not going to make
much difference.

128
00:07:59,030 --> 00:08:01,310
Just adding constants OR
multiplying by constants

129
00:08:01,310 --> 00:08:04,440
does not change the
arg min or the arg max.

130
00:08:04,440 --> 00:08:07,677
Now, I have the
sum of logs, which

131
00:08:07,677 --> 00:08:08,760
is the log of the product.

132
00:08:23,310 --> 00:08:24,350
OK?

133
00:08:24,350 --> 00:08:27,280
It's the arg max of the
log of f theta of x1 times

134
00:08:27,280 --> 00:08:30,190
f theta of x2, f theta of xn.

135
00:08:30,190 --> 00:08:37,440
But the log is a function
that's increasing, so maximizing

136
00:08:37,440 --> 00:08:40,830
log of a function or
maximizing the function itself

137
00:08:40,830 --> 00:08:42,860
is the same thing.

138
00:08:42,860 --> 00:08:45,200
The value is going to
change, but the arg max

139
00:08:45,200 --> 00:08:46,220
is not going to change.

140
00:08:46,220 --> 00:08:47,344
Everybody agrees with this?

141
00:08:50,340 --> 00:08:59,990
So this is equivalent to arg
max over theta of pi from 1 to n

142
00:08:59,990 --> 00:09:02,780
of f theta xi.

143
00:09:02,780 --> 00:09:10,515
And that's because x maps
to log x is increasing.

144
00:09:13,930 --> 00:09:17,140
So now I've gone from
minimizing the KL

145
00:09:17,140 --> 00:09:19,750
to minimizing the
estimate of the KL

146
00:09:19,750 --> 00:09:23,520
to maximizing this product.

147
00:09:23,520 --> 00:09:27,280
Well, this chapter is called
maximum likelihood estimation.

148
00:09:27,280 --> 00:09:30,370
The maximum comes from the
fact that our original idea

149
00:09:30,370 --> 00:09:32,240
was to minimize the
negative of a function.

150
00:09:32,240 --> 00:09:34,150
So that's why it's
maximum likelihood.

151
00:09:34,150 --> 00:09:42,770
And this function here
is called the likelihood.

152
00:09:42,770 --> 00:09:45,150
This function is really
just telling me--

153
00:09:45,150 --> 00:09:47,370
they call it
likelihood because it's

154
00:09:47,370 --> 00:09:49,920
some measure of how
likely it is that theta

155
00:09:49,920 --> 00:09:52,800
was the parameter that
generated the data.

156
00:09:52,800 --> 00:09:55,444
OK, so let's go to the--

157
00:09:55,444 --> 00:09:57,610
well, we'll go to the formal
definition in a second.

158
00:09:57,610 --> 00:09:59,160
But actually, let
me just give you

159
00:09:59,160 --> 00:10:05,592
intuition as to why this is
the distribution of the data.

160
00:10:05,592 --> 00:10:07,050
Why this is the
likelihood-- sorry.

161
00:10:07,050 --> 00:10:11,940
Why is this making sense
as a measure of likelihood?

162
00:10:11,940 --> 00:10:14,550
Let's now think for simplicity
of the following model.

163
00:10:14,550 --> 00:10:15,710
So I have--

164
00:10:15,710 --> 00:10:19,550
I'm on the real line
and I look at n, say,

165
00:10:19,550 --> 00:10:25,540
theta 1 for theta in the
real-- do you see that?

166
00:10:25,540 --> 00:10:26,040
OK.

167
00:10:26,040 --> 00:10:27,660
Probably you don't.

168
00:10:27,660 --> 00:10:28,860
Not that you care.

169
00:10:28,860 --> 00:10:29,790
OK, so--

170
00:10:41,120 --> 00:10:42,590
OK, let's look at
a simple example.

171
00:10:45,990 --> 00:10:48,910
So here's the model.

172
00:10:48,910 --> 00:10:52,310
As I said, we're looking at
observations on the real line.

173
00:10:52,310 --> 00:10:57,032
And they're distributed
according to some n theta 1.

174
00:10:57,032 --> 00:10:58,490
So I don't care
about the variance.

175
00:10:58,490 --> 00:10:59,810
I know it's 1.

176
00:10:59,810 --> 00:11:03,170
And it's indexed by
theta in the real line.

177
00:11:03,170 --> 00:11:05,360
OK, so this is-- the only
thing I need to figure out

178
00:11:05,360 --> 00:11:09,260
is, what is the mean
of those guys, OK?

179
00:11:09,260 --> 00:11:11,600
Now, I have this n observations.

180
00:11:11,600 --> 00:11:15,920
And if you actually remember
from your probability class,

181
00:11:15,920 --> 00:11:18,610
are you familiar with the
concept of joint density?

182
00:11:18,610 --> 00:11:20,420
You have multivariate
observations.

183
00:11:20,420 --> 00:11:23,750
The joint density of
independent random variables

184
00:11:23,750 --> 00:11:26,990
is just a product of their
individual densities.

185
00:11:26,990 --> 00:11:30,950
So really, when I look
at the product from i

186
00:11:30,950 --> 00:11:34,610
equal 1 to n of f
theta of xi, this

187
00:11:34,610 --> 00:11:44,800
is really the joint
density of the vector--

188
00:11:48,592 --> 00:11:51,210
well, let me not use
the word vector--

189
00:11:51,210 --> 00:11:55,660
of x1 xn, OK?

190
00:11:55,660 --> 00:11:58,930
So if I take the product of
density, is it still a density?

191
00:11:58,930 --> 00:12:04,136
And it's actually-- but
this time on the r to the n.

192
00:12:04,136 --> 00:12:06,010
And so now what this
thing is telling me-- so

193
00:12:06,010 --> 00:12:07,330
think of it in r2, right?

194
00:12:07,330 --> 00:12:10,630
So this is the joint
density of two Gaussians.

195
00:12:10,630 --> 00:12:14,640
So it's something that looks
like some bell-shaped curve

196
00:12:14,640 --> 00:12:15,940
in two dimensions.

197
00:12:15,940 --> 00:12:20,410
And it's centered at
the value theta theta.

198
00:12:20,410 --> 00:12:22,390
OK, they both have
the mean theta.

199
00:12:22,390 --> 00:12:24,280
So let's assume for
one second-- it's

200
00:12:24,280 --> 00:12:28,000
going to be hard for me to
make pictures in n dimensions.

201
00:12:28,000 --> 00:12:29,710
Actually, already
in two dimensions,

202
00:12:29,710 --> 00:12:31,660
I can promise you that
it's not very easy.

203
00:12:31,660 --> 00:12:34,300
So I'm actually
just going to assume

204
00:12:34,300 --> 00:12:37,270
that n is equal to 1 for
the sake of illustration.

205
00:12:37,270 --> 00:12:40,900
OK, so now I have this data.

206
00:12:40,900 --> 00:12:44,350
And now I have one
observation, OK?

207
00:12:44,350 --> 00:12:47,524
And I know that the f
theta looks like this.

208
00:12:47,524 --> 00:12:48,940
And what I'm doing
is I'm actually

209
00:12:48,940 --> 00:12:51,961
looking at the value of x
theta as my observation.

210
00:12:54,787 --> 00:12:57,870
Let's call it x1.

211
00:12:57,870 --> 00:13:00,590
Now, my principal tells me,
just find the theta that

212
00:13:00,590 --> 00:13:03,260
makes this guy the most likely.

213
00:13:03,260 --> 00:13:05,360
What is the likelihood of my x1?

214
00:13:05,360 --> 00:13:07,640
Well, it's just the
value of the function.

215
00:13:07,640 --> 00:13:09,410
That this value here.

216
00:13:09,410 --> 00:13:13,670
And if I wanted to find the most
likely theta that had generated

217
00:13:13,670 --> 00:13:16,370
this x1, what I would need
to do is to shift this thing

218
00:13:16,370 --> 00:13:19,290
and put it here.

219
00:13:19,290 --> 00:13:21,950
And so my estimate, my
maximum likelihood estimator

220
00:13:21,950 --> 00:13:28,720
here would be theta
is equal to x1, OK?

221
00:13:28,720 --> 00:13:30,370
That would be just
the observation.

222
00:13:30,370 --> 00:13:32,110
Because if I have
only one observation,

223
00:13:32,110 --> 00:13:33,454
what else am I going to do?

224
00:13:33,454 --> 00:13:34,870
OK, and so it sort
of makes sense.

225
00:13:34,870 --> 00:13:36,286
And if you have
more observations,

226
00:13:36,286 --> 00:13:40,540
you can think of it this way,
as if you had more observations.

227
00:13:40,540 --> 00:13:42,735
So now I have, say,
K observations,

228
00:13:42,735 --> 00:13:44,157
or n observations.

229
00:13:44,157 --> 00:13:46,240
And what I do is that I
look at the value for each

230
00:13:46,240 --> 00:13:48,790
of these guys.

231
00:13:48,790 --> 00:13:52,240
So this value, this value,
this value, this value.

232
00:13:52,240 --> 00:13:55,870
I take their product and
I make this thing large.

233
00:13:55,870 --> 00:13:57,250
OK, why do I take the product?

234
00:13:57,250 --> 00:14:00,100
Well, because I'm trying
to maximize their value

235
00:14:00,100 --> 00:14:02,830
all together, and I need to
just turn it into one number

236
00:14:02,830 --> 00:14:04,030
that I can maximize.

237
00:14:04,030 --> 00:14:06,580
And taking the product
is the natural way

238
00:14:06,580 --> 00:14:08,470
of doing it, either
by motivating it

239
00:14:08,470 --> 00:14:11,050
by the KL principle
or motivating it

240
00:14:11,050 --> 00:14:14,800
by maximizing the joint density,
rather than just maximizing

241
00:14:14,800 --> 00:14:15,910
anything.

242
00:14:15,910 --> 00:14:20,200
OK, so that's why, visually,
this is the maximum likelihood.

243
00:14:20,200 --> 00:14:24,010
It just says that if my
observations are here,

244
00:14:24,010 --> 00:14:29,210
then this guy, this mean theta,
is more likely than this guy.

245
00:14:29,210 --> 00:14:31,450
Because now if I
look at the value

246
00:14:31,450 --> 00:14:33,850
of the function
for this guy-- if I

247
00:14:33,850 --> 00:14:35,740
look at theta being
this thing, then this

248
00:14:35,740 --> 00:14:37,540
is a very small value.

249
00:14:37,540 --> 00:14:39,850
Very small value, very small
value, very small value.

250
00:14:39,850 --> 00:14:41,984
Everything gets a super
small value, right?

251
00:14:41,984 --> 00:14:43,900
That's just the value
that it gets in the tail

252
00:14:43,900 --> 00:14:45,730
here, which is very close to 0.

253
00:14:45,730 --> 00:14:47,980
But as soon as I start
covering all my points

254
00:14:47,980 --> 00:14:53,260
with my bell-shaped curve,
then all the values go up.

255
00:14:53,260 --> 00:14:58,720
All right, so I just want
to make a short break

256
00:14:58,720 --> 00:15:00,490
into statistics,
and just make sure

257
00:15:00,490 --> 00:15:04,120
that the maximum likelihood
principle involves

258
00:15:04,120 --> 00:15:05,650
maximizing a function.

259
00:15:05,650 --> 00:15:07,250
So I just want to
make sure that we're

260
00:15:07,250 --> 00:15:11,200
all on par about how do
we maximize functions.

261
00:15:11,200 --> 00:15:13,990
In most instances, it's going to
be a one-dimensional function,

262
00:15:13,990 --> 00:15:16,690
because theta is going to be
a one-dimensional parameter.

263
00:15:16,690 --> 00:15:18,800
Like here it's the real line.

264
00:15:18,800 --> 00:15:20,110
So it's going to be easy.

265
00:15:20,110 --> 00:15:22,130
In some cases, it may be
a multivariate function

266
00:15:22,130 --> 00:15:24,790
and it might be
more complicated.

267
00:15:24,790 --> 00:15:26,650
OK, so let's just
make this interlude.

268
00:15:26,650 --> 00:15:28,450
So the first thing
I want you to notice

269
00:15:28,450 --> 00:15:31,940
is that if you open any book
on what's called optimization,

270
00:15:31,940 --> 00:15:35,110
which basically is the science
behind optimizing functions,

271
00:15:35,110 --> 00:15:36,610
you will talk mostly--

272
00:15:36,610 --> 00:15:40,170
I mean, I'd say
99.9% of the cases

273
00:15:40,170 --> 00:15:42,349
will talk about
minimizing functions.

274
00:15:42,349 --> 00:15:44,890
But it doesn't matter, because
you can just flip the function

275
00:15:44,890 --> 00:15:47,710
and you just put a minus
sign, and minimizing h

276
00:15:47,710 --> 00:15:51,640
is the same as maximizing
minus h and the opposite, OK?

277
00:15:51,640 --> 00:15:53,547
So for this class,
since we're only

278
00:15:53,547 --> 00:15:55,630
going to talk about maximum
likelihood estimation,

279
00:15:55,630 --> 00:15:57,654
we will talk about
maximizing functions.

280
00:15:57,654 --> 00:15:59,320
But don't be lost if
you decide suddenly

281
00:15:59,320 --> 00:16:01,569
to open a book on optimization
and find only something

282
00:16:01,569 --> 00:16:03,700
about minimizing functions.

283
00:16:03,700 --> 00:16:08,279
OK, so maximizing an arbitrary
function can actually be fairly

284
00:16:08,279 --> 00:16:10,570
difficult. If I give you a
function that has this weird

285
00:16:10,570 --> 00:16:13,740
shape, right-- let's think of
this polynomial for example--

286
00:16:13,740 --> 00:16:17,140
and I wanted to find the
maximum, how would we do it?

287
00:16:20,350 --> 00:16:23,580
So what is the thing you've
learned in calculus on how

288
00:16:23,580 --> 00:16:26,200
to maximize the function?

289
00:16:26,200 --> 00:16:27,856
Set the derivative equal to 0.

290
00:16:27,856 --> 00:16:29,730
Maybe you want to check
the second derivative

291
00:16:29,730 --> 00:16:31,647
to make sure it's a
maximum and not a minimum.

292
00:16:31,647 --> 00:16:34,105
But the thing is, this is only
guaranteeing to you that you

293
00:16:34,105 --> 00:16:35,280
have a local one, right?

294
00:16:35,280 --> 00:16:38,412
So if I do it for this function,
for example, then this guy

295
00:16:38,412 --> 00:16:39,870
is going to satisfy
this criterion,

296
00:16:39,870 --> 00:16:41,610
this guy is going to
satisfy this criterion,

297
00:16:41,610 --> 00:16:43,530
this guy is going to satisfy
this criterion, this guy here,

298
00:16:43,530 --> 00:16:45,404
and this guy satisfies
the criterion, but not

299
00:16:45,404 --> 00:16:46,950
the second derivative one.

300
00:16:46,950 --> 00:16:50,160
So I have a lot of candidates.

301
00:16:50,160 --> 00:16:52,800
And if my function can
be really anything,

302
00:16:52,800 --> 00:16:54,696
it's going to be
difficult, whether it's

303
00:16:54,696 --> 00:16:56,820
analytically by taking
derivatives and setting them

304
00:16:56,820 --> 00:17:00,230
to 0, or trying to find
some algorithms to do this.

305
00:17:00,230 --> 00:17:02,400
Because if my function
is very jittery,

306
00:17:02,400 --> 00:17:05,130
then my algorithm basically
has to check all candidates.

307
00:17:05,130 --> 00:17:08,520
And if there's a lot of them,
it might take forever, OK?

308
00:17:08,520 --> 00:17:11,369
So this is-- I have only
one, two, three, four,

309
00:17:11,369 --> 00:17:13,109
five candidates to check.

310
00:17:13,109 --> 00:17:15,900
But in practice, you might have
a million of them to check.

311
00:17:15,900 --> 00:17:17,460
And that might take forever.

312
00:17:17,460 --> 00:17:21,180
OK, so what's nice about
statistical models, and one

313
00:17:21,180 --> 00:17:24,400
of the things that makes all
these models particularly

314
00:17:24,400 --> 00:17:27,792
robust, and that we
still talk about them 100

315
00:17:27,792 --> 00:17:29,250
years after they've
been introduced

316
00:17:29,250 --> 00:17:31,680
is that the functions
that-- the likelihoods

317
00:17:31,680 --> 00:17:33,450
that they lead
for us to maximize

318
00:17:33,450 --> 00:17:34,800
are actually very simple.

319
00:17:34,800 --> 00:17:37,090
And they all share
a nice property,

320
00:17:37,090 --> 00:17:40,350
which is that of being concave.

321
00:17:40,350 --> 00:17:42,567
All right, so what is
a concave function?

322
00:17:42,567 --> 00:17:44,775
Well, by definition, it's
just a function for which--

323
00:17:44,775 --> 00:17:47,760
let's think of it as being
twice differentiable.

324
00:17:47,760 --> 00:17:49,320
You can define
functions that are not

325
00:17:49,320 --> 00:17:51,695
differentiable as being concave,
but let's think about it

326
00:17:51,695 --> 00:17:53,035
as having a second derivative.

327
00:17:53,035 --> 00:17:54,660
And so if you look
at the function that

328
00:17:54,660 --> 00:17:57,480
has a second derivative,
concave are the functions

329
00:17:57,480 --> 00:17:59,340
that have their second
derivative that's

330
00:17:59,340 --> 00:18:02,060
negative everywhere.

331
00:18:02,060 --> 00:18:06,430
Not just at the
maximum, everywhere, OK?

332
00:18:06,430 --> 00:18:09,190
And so if it's strictly
concave, this second derivative

333
00:18:09,190 --> 00:18:12,280
is actually strictly
less than zero.

334
00:18:12,280 --> 00:18:16,110
And particularly if I
think of a linear function,

335
00:18:16,110 --> 00:18:19,480
y is equal to x,
then this function

336
00:18:19,480 --> 00:18:24,130
has its second derivative
which is equal to zero, OK?

337
00:18:24,130 --> 00:18:26,020
So it is concave.

338
00:18:26,020 --> 00:18:28,430
But it's not
strictly concave, OK?

339
00:18:28,430 --> 00:18:31,570
If I look at the function
which is negative x squared,

340
00:18:31,570 --> 00:18:33,060
what is its second derivative?

341
00:18:35,700 --> 00:18:36,810
Minus 2.

342
00:18:36,810 --> 00:18:39,810
So it's strictly
negative everywhere, OK?

343
00:18:39,810 --> 00:18:43,767
So actually, this is a
pretty canonical example

344
00:18:43,767 --> 00:18:44,850
strictly concave function.

345
00:18:44,850 --> 00:18:46,850
If you want to think of a
picture of a strictly concave

346
00:18:46,850 --> 00:18:48,540
function, think of
negative x squared.

347
00:18:48,540 --> 00:18:52,770
So parabola pointing downwards.

348
00:18:52,770 --> 00:18:56,980
OK, so we can talk about
strictly convex functions.

349
00:18:56,980 --> 00:18:59,940
So convex is just happening when
the negative of the function

350
00:18:59,940 --> 00:19:00,757
is concave.

351
00:19:00,757 --> 00:19:03,090
So that translates into having
a second derivative which

352
00:19:03,090 --> 00:19:05,910
is either non-negative
or positive, depending

353
00:19:05,910 --> 00:19:09,040
on whether you're talking about
convexity or strict convexity.

354
00:19:09,040 --> 00:19:11,850
But again, those
convex functions

355
00:19:11,850 --> 00:19:14,580
are convenient when you're
trying to minimize something.

356
00:19:14,580 --> 00:19:16,579
And since we're trying
to maximize the function,

357
00:19:16,579 --> 00:19:18,710
we're looking for concave.

358
00:19:18,710 --> 00:19:21,732
So here are some examples.

359
00:19:21,732 --> 00:19:23,190
Let's just go
through them quickly.

360
00:19:39,060 --> 00:19:41,830
OK, so the first one is--

361
00:19:41,830 --> 00:19:46,540
so here I made my
life a little uneasy

362
00:19:46,540 --> 00:19:49,889
by talking about the
functions in theta, right?

363
00:19:49,889 --> 00:19:51,430
I'm talking about
likelihoods, right?

364
00:19:51,430 --> 00:19:54,460
So I'm thinking of functions
where the parameter is theta.

365
00:19:54,460 --> 00:19:56,270
So I have h of theta.

366
00:19:56,270 --> 00:19:59,370
And so if I start
with theta squared,

367
00:19:59,370 --> 00:20:02,170
negative theta squared,
then as we said,

368
00:20:02,170 --> 00:20:09,490
h prime prime of theta, the
second derivative is minus 2,

369
00:20:09,490 --> 00:20:11,830
which is strictly negative,
so this function is strictly

370
00:20:11,830 --> 00:20:12,329
concave.

371
00:20:19,210 --> 00:20:24,400
OK, another function is
h of theta, which is--

372
00:20:24,400 --> 00:20:25,980
what did we pick--

373
00:20:25,980 --> 00:20:28,380
square root of theta.

374
00:20:28,380 --> 00:20:30,018
What is the first derivative?

375
00:20:35,760 --> 00:20:39,720
1/2 square root of theta.

376
00:20:39,720 --> 00:20:41,332
What is the second derivative?

377
00:20:48,220 --> 00:20:51,617
So that's theta to
the negative 1/2.

378
00:20:51,617 --> 00:20:53,450
So I'm just picking up
another negative 1/2,

379
00:20:53,450 --> 00:20:56,640
so I get negative 1/4.

380
00:20:56,640 --> 00:21:02,420
And then I get theta to
the 3/4 downstairs, OK?

381
00:21:02,420 --> 00:21:03,320
Sorry, 3/2.

382
00:21:09,430 --> 00:21:16,820
And that's strictly negative
for theta, say, larger than 0.

383
00:21:16,820 --> 00:21:20,060
And I really need to have
this thing larger than 0

384
00:21:20,060 --> 00:21:21,470
so that it's well-defined.

385
00:21:21,470 --> 00:21:24,320
But strictly larger than 0 is
so that this thing does not

386
00:21:24,320 --> 00:21:25,597
blow up to infinity.

387
00:21:25,597 --> 00:21:26,180
And it's true.

388
00:21:26,180 --> 00:21:30,740
If you think about this
function, it looks like this.

389
00:21:30,740 --> 00:21:34,940
And already, the first
derivative to infinity at 0.

390
00:21:34,940 --> 00:21:37,520
And it's a concave function, OK?

391
00:21:37,520 --> 00:21:39,920
Another one is the
log, of course.

392
00:21:44,070 --> 00:21:47,640
What is the
derivative of the log?

393
00:21:47,640 --> 00:21:52,710
That's 1 over theta, where h
prime of theta is 1 over theta.

394
00:21:52,710 --> 00:22:01,080
And the second derivative
negative 1 over theta squared,

395
00:22:01,080 --> 00:22:06,210
which again, is negative if
theta is strictly positive.

396
00:22:06,210 --> 00:22:07,770
Here I define it as--

397
00:22:07,770 --> 00:22:10,310
I don't need to define it to
be strictly positive here,

398
00:22:10,310 --> 00:22:13,110
but I need it for the log.

399
00:22:13,110 --> 00:22:16,320
And sine.

400
00:22:16,320 --> 00:22:18,030
OK, so let's just do one more.

401
00:22:18,030 --> 00:22:22,555
So h of theta is sine of theta.

402
00:22:22,555 --> 00:22:24,180
But here I take it
only on an interval,

403
00:22:24,180 --> 00:22:27,540
because you want to
think of this function

404
00:22:27,540 --> 00:22:29,112
as pointing always downwards.

405
00:22:29,112 --> 00:22:31,070
And in particular, you
don't want this function

406
00:22:31,070 --> 00:22:32,400
to have an inflection point.

407
00:22:32,400 --> 00:22:34,080
You don't want it to
go down and then up

408
00:22:34,080 --> 00:22:37,200
and then down and then up,
because this is not concave.

409
00:22:37,200 --> 00:22:39,960
And so sine is certainly
going up and down, right?

410
00:22:39,960 --> 00:22:43,530
So what we do is we restrict
it to an interval where sine

411
00:22:43,530 --> 00:22:45,690
is actually-- so what does
the sine function looks

412
00:22:45,690 --> 00:22:47,530
like at 0, 0?

413
00:22:47,530 --> 00:22:48,600
And it's going up.

414
00:22:48,600 --> 00:22:53,110
Where is the first
maximum of the sine?

415
00:22:53,110 --> 00:22:54,341
STUDENT: [INAUDIBLE]

416
00:22:54,341 --> 00:22:55,215
PROFESSOR: I'm sorry.

417
00:22:55,215 --> 00:22:56,006
STUDENT: Pi over 2.

418
00:22:56,006 --> 00:22:59,220
PROFESSOR: Pi over 2,
where it takes value 1.

419
00:22:59,220 --> 00:23:01,700
And then it goes down again.

420
00:23:01,700 --> 00:23:04,220
And then that's at pi.

421
00:23:04,220 --> 00:23:05,420
And then I go down again.

422
00:23:05,420 --> 00:23:08,360
And here you see I actually
start changing my inflection.

423
00:23:08,360 --> 00:23:10,700
So what we do is
we stop it at pi.

424
00:23:10,700 --> 00:23:12,450
And we look at this
function, it certainly

425
00:23:12,450 --> 00:23:14,404
looks like a parabola
pointing downwards.

426
00:23:14,404 --> 00:23:16,820
And so if you look at the--
you can check that it actually

427
00:23:16,820 --> 00:23:17,944
works with the derivatives.

428
00:23:17,944 --> 00:23:22,530
So the derivative
of sine is cosine.

429
00:23:25,160 --> 00:23:31,850
And the derivative of
cosine is negative sine.

430
00:23:34,560 --> 00:23:38,170
OK, and this thing between 0
and pi is actually positive.

431
00:23:38,170 --> 00:23:40,730
So this entire thing is
going to be negative.

432
00:23:40,730 --> 00:23:41,860
OK?

433
00:23:41,860 --> 00:23:45,160
And you know, I can come
up with a lot of examples,

434
00:23:45,160 --> 00:23:46,730
but let's just stick to those.

435
00:23:46,730 --> 00:23:48,990
There's a linear
function, of course.

436
00:23:48,990 --> 00:23:51,196
And the find function
is going to be concave,

437
00:23:51,196 --> 00:23:53,320
but it's actually going to
be convex as well, which

438
00:23:53,320 --> 00:23:55,028
means that it's
certainly not going to be

439
00:23:55,028 --> 00:23:58,780
strictly concave or convex, OK?

440
00:23:58,780 --> 00:24:01,450
So here's your standard picture.

441
00:24:01,450 --> 00:24:04,630
And here, if you look
at the dotted line, what

442
00:24:04,630 --> 00:24:07,194
it tells me is that
a concave function,

443
00:24:07,194 --> 00:24:08,860
and the property we're
going to be using

444
00:24:08,860 --> 00:24:12,910
is that if a strictly concave
function has a maximum, which

445
00:24:12,910 --> 00:24:15,790
is not always the case,
but if it has a maximum,

446
00:24:15,790 --> 00:24:18,770
then it actually must be--
sorry, a local maximum,

447
00:24:18,770 --> 00:24:21,350
it must be a global maximum.

448
00:24:21,350 --> 00:24:23,870
OK, so just the fact that
it goes up and down and not

449
00:24:23,870 --> 00:24:28,260
again means that there's only
global maximum that can exist.

450
00:24:28,260 --> 00:24:32,480
Now if you looked, for example,
at the square root function,

451
00:24:32,480 --> 00:24:34,855
look at the entire
positive real line,

452
00:24:34,855 --> 00:24:36,980
then this thing is never
going to attain a maximum.

453
00:24:36,980 --> 00:24:39,362
It's just going to infinity
as x goes to infinity.

454
00:24:39,362 --> 00:24:40,820
So if I wanted to
find the maximum,

455
00:24:40,820 --> 00:24:42,590
I would have to stop
somewhere and say

456
00:24:42,590 --> 00:24:46,200
that the maximum is attained
at the right-hand side.

457
00:24:46,200 --> 00:24:49,890
OK, so that's the beauty about
convex functions or concave

458
00:24:49,890 --> 00:24:53,880
functions, is that
essentially, these functions

459
00:24:53,880 --> 00:24:55,110
are easy to maximize.

460
00:24:55,110 --> 00:24:57,660
And if I tell you a
function is concave,

461
00:24:57,660 --> 00:25:00,164
you take the first
derivative, set it equal to 0.

462
00:25:00,164 --> 00:25:01,830
If you find a point
that satisfies this,

463
00:25:01,830 --> 00:25:07,560
then it must be a
global maximum, OK?

464
00:25:07,560 --> 00:25:09,086
STUDENT: What if
your set theta was

465
00:25:09,086 --> 00:25:13,695
[INAUDIBLE] then couldn't
you have a function that,

466
00:25:13,695 --> 00:25:17,090
by the definition, is concave,
with two upside down parabolas

467
00:25:17,090 --> 00:25:22,910
at two disjoint intervals, but
yet it has two global maximums?

468
00:25:26,704 --> 00:25:28,120
PROFESSOR: So you
won't get them--

469
00:25:28,120 --> 00:25:31,030
so you want the function
to be concave on what?

470
00:25:31,030 --> 00:25:34,430
On the convex cell
of the intervals?

471
00:25:34,430 --> 00:25:35,875
Or you want it to be--

472
00:25:35,875 --> 00:25:38,250
STUDENT: [INAUDIBLE] just
said that any subset.

473
00:25:38,250 --> 00:25:40,029
PROFESSOR: OK, OK.

474
00:25:40,029 --> 00:25:40,570
You're right.

475
00:25:40,570 --> 00:25:42,060
So maybe the
definition-- so you're

476
00:25:42,060 --> 00:25:45,810
pointing to a weakness
in the definition.

477
00:25:45,810 --> 00:25:49,050
Let's just say that
theta is a convex set

478
00:25:49,050 --> 00:25:50,220
and then you're good, OK?

479
00:25:50,220 --> 00:25:51,450
So you're right.

480
00:25:54,210 --> 00:25:56,790
Since I actually just said that
this is true only for theta,

481
00:25:56,790 --> 00:25:59,280
I can just take pieces of
concave functions, right?

482
00:25:59,280 --> 00:26:00,990
I can do this, and
then the next one

483
00:26:00,990 --> 00:26:03,330
I can do this, on the
next one I can do this.

484
00:26:03,330 --> 00:26:05,530
And then I would
have a bunch of them.

485
00:26:05,530 --> 00:26:10,620
But what I want is think
of it as a global function

486
00:26:10,620 --> 00:26:11,590
on some convex set.

487
00:26:11,590 --> 00:26:13,450
You're right.

488
00:26:13,450 --> 00:26:14,900
So think of theta
as being convex

489
00:26:14,900 --> 00:26:17,560
for this guy, an interval,
if it's a real line.

490
00:26:20,340 --> 00:26:25,689
OK, so as I said, for
more generally-- so

491
00:26:25,689 --> 00:26:27,980
we can actually define concave
functions more generally

492
00:26:27,980 --> 00:26:29,540
in higher dimensions.

493
00:26:29,540 --> 00:26:32,690
And that will be useful
if theta is not just

494
00:26:32,690 --> 00:26:34,640
one parameter but
several parameters.

495
00:26:34,640 --> 00:26:39,050
And for that, you need to
remind yourself of Calculus II,

496
00:26:39,050 --> 00:26:42,440
and you have generalization of
the notion of derivative, which

497
00:26:42,440 --> 00:26:46,130
is called a gradient, which
is basically a vector where

498
00:26:46,130 --> 00:26:49,390
each coordinate is just the
partial derivative with respect

499
00:26:49,390 --> 00:26:51,170
to each coordinate of theta.

500
00:26:51,170 --> 00:26:54,380
And the Hessian is
the matrix, which

501
00:26:54,380 --> 00:26:58,020
is essentially a generalization
of the second derivative.

502
00:26:58,020 --> 00:27:01,220
I denote it by nabla
squared, but you

503
00:27:01,220 --> 00:27:02,610
can write it the way you want.

504
00:27:02,610 --> 00:27:07,296
And so this matrix
here is taking as entry

505
00:27:07,296 --> 00:27:10,970
the second partial
derivatives of h with respect

506
00:27:10,970 --> 00:27:12,920
to theta i and theta j.

507
00:27:12,920 --> 00:27:15,440
And so that's the ij-th entry.

508
00:27:15,440 --> 00:27:16,550
Who has never seen that?

509
00:27:19,400 --> 00:27:20,600
OK.

510
00:27:20,600 --> 00:27:27,200
So now, being concave here
is essentially generalizing,

511
00:27:27,200 --> 00:27:28,820
saying that a vector
is equal to zero.

512
00:27:28,820 --> 00:27:31,390
Well, that's just setting
the vector-- sorry.

513
00:27:31,390 --> 00:27:33,700
The first order condition
to say that it's a maximum

514
00:27:33,700 --> 00:27:34,700
is going to be the same.

515
00:27:34,700 --> 00:27:38,930
Saying that a function has
a gradient equal to zero

516
00:27:38,930 --> 00:27:43,730
is the same as saying that
each of its coordinates

517
00:27:43,730 --> 00:27:44,730
are equal to zero.

518
00:27:44,730 --> 00:27:46,521
And that's actually
going to be a condition

519
00:27:46,521 --> 00:27:48,560
for a global maximum here.

520
00:27:48,560 --> 00:27:52,190
So to check convexity, we need
to see that a matrix itself

521
00:27:52,190 --> 00:27:53,760
is negative.

522
00:27:53,760 --> 00:27:55,220
Sorry, to check
concavity, we need

523
00:27:55,220 --> 00:27:57,020
to check that a
matrix is negative.

524
00:27:57,020 --> 00:27:59,840
And there is a
notion among matrices

525
00:27:59,840 --> 00:28:03,320
that compare matrix to zero,
and that's exactly this notion.

526
00:28:03,320 --> 00:28:06,170
You pre- and post-multiply
by the same x.

527
00:28:06,170 --> 00:28:08,960
So that works for
symmetric matrices,

528
00:28:08,960 --> 00:28:10,590
which is the case here.

529
00:28:10,590 --> 00:28:13,940
And so you pre-multiply by x,
post-multiply by the same x.

530
00:28:13,940 --> 00:28:15,930
So you have your matrix,
your Hessian here.

531
00:28:20,630 --> 00:28:24,030
It's a d by d matrix if you
have a d-dimensional matrix.

532
00:28:24,030 --> 00:28:26,900
So let's call it--

533
00:28:26,900 --> 00:28:27,400
OK.

534
00:28:27,400 --> 00:28:31,150
And then here I
pre-multiply by x transpose.

535
00:28:31,150 --> 00:28:34,330
I post-multiply by x.

536
00:28:34,330 --> 00:28:38,470
And this has to be non-positive
if I want it to be concave,

537
00:28:38,470 --> 00:28:42,850
and strictly negative if I
want it to be strictly concave.

538
00:28:42,850 --> 00:28:44,740
OK, that's just a
real generalization.

539
00:28:44,740 --> 00:28:47,050
You can check for yourself
that this is the same thing.

540
00:28:47,050 --> 00:28:49,760
If I were in dimension 1,
this would be the same thing.

541
00:28:49,760 --> 00:28:50,410
Why?

542
00:28:50,410 --> 00:28:53,380
Because in dimension 1, pre-
and post-multiplying by x

543
00:28:53,380 --> 00:28:55,840
is the same as
multiplying by x squared.

544
00:28:55,840 --> 00:28:58,820
Because in dimension 1, I can
just move my x's around, right?

545
00:28:58,820 --> 00:29:01,180
And so that would just
mean the first condition

546
00:29:01,180 --> 00:29:04,930
would mean in dimension 1 that
the second derivative times x

547
00:29:04,930 --> 00:29:11,110
squared has to be less
than or equal to zero.

548
00:29:11,110 --> 00:29:14,371
So here I need this for
all x's that are not zero,

549
00:29:14,371 --> 00:29:16,870
because I can take x to be zero
and make this equal to zero,

550
00:29:16,870 --> 00:29:17,370
right?

551
00:29:17,370 --> 00:29:21,640
So this is for x's that
are not equal to zero, OK?

552
00:29:21,640 --> 00:29:25,720
And so some examples.

553
00:29:25,720 --> 00:29:27,340
Just look at this function.

554
00:29:27,340 --> 00:29:29,830
So now I have functions that
depend on two parameters,

555
00:29:29,830 --> 00:29:31,600
theta1 and theta2.

556
00:29:31,600 --> 00:29:33,130
So the first one is--

557
00:29:33,130 --> 00:29:36,460
so if I take theta
to be equal to--

558
00:29:36,460 --> 00:29:39,010
now I need two
parameters, r squared.

559
00:29:39,010 --> 00:29:42,670
And I look at the function,
which is h of theta.

560
00:29:42,670 --> 00:29:45,266
Can somebody tell me
what h of theta is?

561
00:29:45,266 --> 00:29:49,530
STUDENT: [INAUDIBLE]

562
00:29:49,530 --> 00:29:52,040
PROFESSOR: Minus
2 theta2 squared?

563
00:29:52,040 --> 00:30:00,920
OK, so let's compute the
gradient of h of theta.

564
00:30:00,920 --> 00:30:04,210
So it's going to be something
that has two coordinates.

565
00:30:04,210 --> 00:30:06,064
To get the first
coordinate, what do I do?

566
00:30:06,064 --> 00:30:07,730
Well, I take the
derivative with respect

567
00:30:07,730 --> 00:30:10,230
to theta1, thinking of
theta2 as being a constant.

568
00:30:10,230 --> 00:30:11,750
So this thing is
going to go away.

569
00:30:11,750 --> 00:30:14,189
And so I get negative 2 theta1.

570
00:30:14,189 --> 00:30:15,980
And when I take the
derivative with respect

571
00:30:15,980 --> 00:30:18,620
to the second part, thinking
of this part as being constant,

572
00:30:18,620 --> 00:30:21,490
I get minus 4 theta2.

573
00:30:24,560 --> 00:30:26,970
That clear for everyone?

574
00:30:26,970 --> 00:30:29,455
That's just the definition
of partial derivatives.

575
00:30:32,430 --> 00:30:40,880
And then if I want
to do the Hessian,

576
00:30:40,880 --> 00:30:42,860
so now I'm going to
get a 2 by 2 matrix.

577
00:30:45,690 --> 00:30:48,650
The first guy here, I take
the first-- so this guy

578
00:30:48,650 --> 00:30:51,480
I get by taking the derivative
of this guy with respect

579
00:30:51,480 --> 00:30:52,550
to theta1.

580
00:30:52,550 --> 00:30:53,380
So that's easy.

581
00:30:53,380 --> 00:30:55,152
So that's just minus 2.

582
00:30:55,152 --> 00:30:56,610
This guy I get by
taking derivative

583
00:30:56,610 --> 00:30:58,530
of this guy with
respect to theta2.

584
00:30:58,530 --> 00:31:00,341
So I get what?

585
00:31:00,341 --> 00:31:00,840
Zero.

586
00:31:00,840 --> 00:31:03,234
I treat this guy as
being a constant.

587
00:31:03,234 --> 00:31:04,650
This guy is also
going to be zero,

588
00:31:04,650 --> 00:31:06,990
because I take the derivative
of this guy with respect

589
00:31:06,990 --> 00:31:08,269
to theta1.

590
00:31:08,269 --> 00:31:10,560
And then I take the derivative
of this guy with respect

591
00:31:10,560 --> 00:31:14,310
to theta2, so I get minus 4.

592
00:31:14,310 --> 00:31:19,220
OK, so now I want to check
that this matrix satisfies

593
00:31:19,220 --> 00:31:21,210
x transpose--

594
00:31:21,210 --> 00:31:24,690
this matrix x is negative.

595
00:31:24,690 --> 00:31:25,920
So what I do is--

596
00:31:25,920 --> 00:31:27,360
so what is x transpose x?

597
00:31:27,360 --> 00:31:33,810
So if I do x transpose delta
squared h theta x, what I get

598
00:31:33,810 --> 00:31:42,570
is minus 2 x1 squared
minus 4 x2 squared.

599
00:31:42,570 --> 00:31:45,990
Because this matrix is diagonal,
so all it does is just weights

600
00:31:45,990 --> 00:31:47,920
the square of the x's.

601
00:31:47,920 --> 00:31:51,270
So this guy is
definitely negative.

602
00:31:51,270 --> 00:31:53,580
This guy is negative.

603
00:31:53,580 --> 00:31:56,070
And actually, if one
of the two is non-zero,

604
00:31:56,070 --> 00:31:58,050
which means that x is
non-zero, then this thing

605
00:31:58,050 --> 00:32:00,240
is actually strictly negative.

606
00:32:00,240 --> 00:32:02,600
So this function is
actually strictly concave.

607
00:32:05,380 --> 00:32:07,730
And it looks like a
parabola that's slightly

608
00:32:07,730 --> 00:32:09,499
distorted in one direction.

609
00:32:15,730 --> 00:32:21,257
So well, I know this might
have been some time ago.

610
00:32:21,257 --> 00:32:23,590
Maybe for some of you might
have been since high school.

611
00:32:23,590 --> 00:32:27,360
So just remind yourself of doing
second derivatives and Hessians

612
00:32:27,360 --> 00:32:29,710
and things like this.

613
00:32:29,710 --> 00:32:32,920
Here's another one
as an exercise.

614
00:32:32,920 --> 00:32:36,660
h is minus theta1
minus theta2 squared.

615
00:32:36,660 --> 00:32:44,100
So this one is going to
actually not be diagonal.

616
00:32:44,100 --> 00:32:46,630
The Hessian is not
going to be diagonal.

617
00:32:46,630 --> 00:32:50,660
Who would like to do
this now in class?

618
00:32:50,660 --> 00:32:51,800
OK, thank you.

619
00:32:51,800 --> 00:32:53,730
This is not a calculus class.

620
00:32:53,730 --> 00:32:56,090
So you can just do it
as a calculus exercise.

621
00:32:56,090 --> 00:32:58,110
And you can do it
for log as well.

622
00:32:58,110 --> 00:33:01,100
Now, there is a nice
recipe for concavity

623
00:33:01,100 --> 00:33:05,111
that works for the second
one and the third one.

624
00:33:05,111 --> 00:33:07,610
And the thing is, if you look
at those particular functions,

625
00:33:07,610 --> 00:33:11,360
what I'm doing is taking, first
of all, a linear combination

626
00:33:11,360 --> 00:33:13,040
of my arguments.

627
00:33:13,040 --> 00:33:15,890
And then I take a concave
function of this guy.

628
00:33:15,890 --> 00:33:18,350
And this is always
going to work.

629
00:33:18,350 --> 00:33:20,930
This is always going to
give me a complete function.

630
00:33:20,930 --> 00:33:22,841
So the computations
that I just made,

631
00:33:22,841 --> 00:33:24,840
I actually never made
them when I prepared those

632
00:33:24,840 --> 00:33:26,132
slides because I don't have to.

633
00:33:26,132 --> 00:33:28,548
I know that if I take a linear
combination of those things

634
00:33:28,548 --> 00:33:30,650
and then I take a concave
function of this guy,

635
00:33:30,650 --> 00:33:33,750
I'm always going to
get a concave function.

636
00:33:33,750 --> 00:33:39,410
OK, so that's an easy way to
check this, or at least as

637
00:33:39,410 --> 00:33:42,520
a sanity check.

638
00:33:42,520 --> 00:33:48,250
All right, and so as I said,
finding maximizers of concave

639
00:33:48,250 --> 00:33:50,380
or strictly concave
function is the same

640
00:33:50,380 --> 00:33:52,870
as it was in the
one-dimensional case.

641
00:33:52,870 --> 00:33:55,052
What I do-- sorry, in
the one-dimensional case,

642
00:33:55,052 --> 00:33:57,010
we just agreed that we
just take the derivative

643
00:33:57,010 --> 00:33:58,077
and set it to zero.

644
00:33:58,077 --> 00:34:00,160
In the high dimensional
case, we take the gradient

645
00:34:00,160 --> 00:34:01,270
and set it equal to zero.

646
00:34:01,270 --> 00:34:04,300
Again, that's
calculus, all right?

647
00:34:04,300 --> 00:34:07,930
So it turns out that
so this is going

648
00:34:07,930 --> 00:34:09,489
to give me equations, right?

649
00:34:09,489 --> 00:34:11,530
The first one is an
equation in theta.

650
00:34:11,530 --> 00:34:15,040
The second one is an equation
in theta1, theta2, theta3,

651
00:34:15,040 --> 00:34:16,734
all the way to theta d.

652
00:34:16,734 --> 00:34:19,150
And it doesn't mean that because
I can write this equation

653
00:34:19,150 --> 00:34:21,130
that I can actually solve it.

654
00:34:21,130 --> 00:34:23,110
This equation might
be super nasty.

655
00:34:23,110 --> 00:34:28,929
It might be like some polynomial
and exponentials and logs equal

656
00:34:28,929 --> 00:34:31,219
zero, or some crazy thing.

657
00:34:31,219 --> 00:34:36,620
And so there's actually,
for a concave function,

658
00:34:36,620 --> 00:34:38,760
since we know there's
a unique maximizer,

659
00:34:38,760 --> 00:34:42,780
there's this theory of convex
optimization, which really,

660
00:34:42,780 --> 00:34:44,909
since those books are
talking about minimizing,

661
00:34:44,909 --> 00:34:46,620
you had to find some
sort of direction.

662
00:34:46,620 --> 00:34:50,280
But you can think of it as the
theory of concave maximization.

663
00:34:50,280 --> 00:34:54,060
And they allow you to
find algorithms to solve

664
00:34:54,060 --> 00:34:57,670
this numerically and
fairly efficiently.

665
00:34:57,670 --> 00:34:58,800
OK, that means fast.

666
00:34:58,800 --> 00:35:01,099
Even if d is of
size 10,000, you're

667
00:35:01,099 --> 00:35:02,640
going to wait for
one second and it's

668
00:35:02,640 --> 00:35:05,130
going to tell you
what the maximum is.

669
00:35:05,130 --> 00:35:06,914
And that's what machine
learning is about.

670
00:35:06,914 --> 00:35:08,830
If you've taken any class
on machine learning,

671
00:35:08,830 --> 00:35:11,163
there's a lot of optimization,
because they have really,

672
00:35:11,163 --> 00:35:13,850
really big problems to solve.

673
00:35:13,850 --> 00:35:15,470
Often in this
class, since this is

674
00:35:15,470 --> 00:35:19,460
more introductory statistics,
we will have a close form.

675
00:35:19,460 --> 00:35:21,250
For the maximum
likelihood estimator

676
00:35:21,250 --> 00:35:25,240
will be saying theta hat
equals, and say x bar,

677
00:35:25,240 --> 00:35:28,150
and that will be the maximum
likelihood estimator.

678
00:35:28,150 --> 00:35:34,310
So just why-- so has anybody
seen convex optimization

679
00:35:34,310 --> 00:35:36,950
before?

680
00:35:36,950 --> 00:35:38,830
So let me just give
you an intuition

681
00:35:38,830 --> 00:35:43,690
why those functions are easy
to maximize or to minimize.

682
00:35:43,690 --> 00:35:46,990
In one dimension, it's actually
very easy for you to see that.

683
00:35:50,540 --> 00:35:52,550
And the reason is this.

684
00:35:52,550 --> 00:35:57,110
If I want to maximize the
concave function, what

685
00:35:57,110 --> 00:35:59,780
I need to do is to be
able to query a point

686
00:35:59,780 --> 00:36:04,080
and get as an answer the
derivative of this function,

687
00:36:04,080 --> 00:36:04,791
OK?

688
00:36:04,791 --> 00:36:07,040
So now I said this is the
function I want to optimize,

689
00:36:07,040 --> 00:36:13,410
and I've been running my
algorithm for 5/10 of a second.

690
00:36:13,410 --> 00:36:15,650
And it's at this point here.

691
00:36:15,650 --> 00:36:17,214
OK, that's the candidate.

692
00:36:17,214 --> 00:36:19,130
Now, what I can ask is,
what is the derivative

693
00:36:19,130 --> 00:36:21,051
of my function here?

694
00:36:21,051 --> 00:36:22,550
Well, it's going
to give me a value.

695
00:36:22,550 --> 00:36:26,600
And this value is going to
be either negative, positive,

696
00:36:26,600 --> 00:36:27,246
or zero.

697
00:36:27,246 --> 00:36:28,620
Well, if it's
zero, that's great.

698
00:36:28,620 --> 00:36:30,411
That means I'm here
and I can just go home.

699
00:36:30,411 --> 00:36:31,679
I've solved my problem.

700
00:36:31,679 --> 00:36:33,470
I know there's a unique
maximum, and that's

701
00:36:33,470 --> 00:36:34,760
what I wanted to find.

702
00:36:34,760 --> 00:36:37,340
If it's positive,
it actually tells me

703
00:36:37,340 --> 00:36:41,470
that I'm on the left
of the optimizer.

704
00:36:41,470 --> 00:36:43,520
And on the left of
the optimal value.

705
00:36:43,520 --> 00:36:47,270
And if it's negative,
it means that I'm

706
00:36:47,270 --> 00:36:50,370
at the right of the
value I'm looking for.

707
00:36:50,370 --> 00:36:53,600
And so most of the convex
optimization methods

708
00:36:53,600 --> 00:36:56,780
basically tell you, well,
if you query the derivative

709
00:36:56,780 --> 00:37:00,390
and it's actually positive,
move to the right.

710
00:37:00,390 --> 00:37:02,430
And if it's negative,
move to the left.

711
00:37:02,430 --> 00:37:07,280
Now, by how much you
move is basically, well,

712
00:37:07,280 --> 00:37:09,020
why people write books.

713
00:37:09,020 --> 00:37:13,400
And in higher dimension, it's
a little more complicated,

714
00:37:13,400 --> 00:37:16,260
because in higher dimension,
thinks about two dimensions,

715
00:37:16,260 --> 00:37:21,940
then I'm only being
able to get in a vector.

716
00:37:21,940 --> 00:37:24,320
And the vector is only
telling me, well, here

717
00:37:24,320 --> 00:37:26,579
is half of the space
in which you can move.

718
00:37:26,579 --> 00:37:28,370
Now here, if you tell
me move to the right,

719
00:37:28,370 --> 00:37:30,620
I know exactly which direction
I'm going to have to move.

720
00:37:30,620 --> 00:37:32,036
But in two dimension,
you're going

721
00:37:32,036 --> 00:37:37,160
to basically tell me, well,
move in this global direction.

722
00:37:37,160 --> 00:37:40,190
And so, of course, I know
there's a line on the floor I

723
00:37:40,190 --> 00:37:42,140
cannot move behind.

724
00:37:42,140 --> 00:37:45,350
But even if you tell me,
draw a line on the floor

725
00:37:45,350 --> 00:37:47,720
and move only to that
side of the line,

726
00:37:47,720 --> 00:37:50,840
then there's many directions
in that line that I can go to.

727
00:37:50,840 --> 00:37:53,870
And that's also why
there's lots of things

728
00:37:53,870 --> 00:37:55,870
you can do in optimization.

729
00:37:55,870 --> 00:38:00,990
OK, but still, putting this
line on the floor is telling me,

730
00:38:00,990 --> 00:38:02,167
do not go backwards.

731
00:38:02,167 --> 00:38:03,250
And that's very important.

732
00:38:03,250 --> 00:38:04,791
It's just telling
you which direction

733
00:38:04,791 --> 00:38:07,470
I should be going always, OK?

734
00:38:07,470 --> 00:38:11,310
All right, so that's
what's behind this notion

735
00:38:11,310 --> 00:38:14,490
of gradient descent
algorithm, steepest descent.

736
00:38:14,490 --> 00:38:17,940
Or steepest descent, actually,
if we're trying to maximize.

737
00:38:17,940 --> 00:38:22,150
OK, so let's move on.

738
00:38:22,150 --> 00:38:26,400
So this course is not about
optimization, all right?

739
00:38:26,400 --> 00:38:30,690
So as I said, the
likelihood was this guy.

740
00:38:30,690 --> 00:38:32,532
The product of f of the xi's.

741
00:38:32,532 --> 00:38:33,990
And one way you
can do this is just

742
00:38:33,990 --> 00:38:39,060
basically the joint distribution
of my data at the point theta.

743
00:38:39,060 --> 00:38:41,160
So now the likelihood,
formerly-- so here

744
00:38:41,160 --> 00:38:44,760
I am giving myself
the model e theta.

745
00:38:44,760 --> 00:38:48,120
And here I'm going to
assume that e is discrete

746
00:38:48,120 --> 00:38:49,740
so that I can talk about PMFs.

747
00:38:49,740 --> 00:38:51,840
But everything
you're doing, just

748
00:38:51,840 --> 00:38:55,080
redo for the sake of yourself
by replacing PMFs by PDFs,

749
00:38:55,080 --> 00:38:56,680
and everything's
going to be fine.

750
00:38:56,680 --> 00:38:58,260
We'll do it in a second.

751
00:38:58,260 --> 00:39:02,470
All right, so the
likelihood of the model.

752
00:39:02,470 --> 00:39:05,552
So here I'm not looking at
the likelihood of a parameter.

753
00:39:05,552 --> 00:39:07,260
I'm looking at the
likelihood of a model.

754
00:39:07,260 --> 00:39:09,234
So it's actually a
function of the parameter.

755
00:39:09,234 --> 00:39:10,650
And actually, I'm
going to make it

756
00:39:10,650 --> 00:39:14,130
even a function of
the points x1 to xn.

757
00:39:14,130 --> 00:39:15,760
All right, so I have a function.

758
00:39:15,760 --> 00:39:18,070
And what it takes as
input is all the points x1

759
00:39:18,070 --> 00:39:22,062
to xn and a candidate
parameter theta.

760
00:39:22,062 --> 00:39:22,770
Not the true one.

761
00:39:22,770 --> 00:39:23,989
A candidate.

762
00:39:23,989 --> 00:39:25,530
And what I'm going
to do is I'm going

763
00:39:25,530 --> 00:39:28,530
to look at the probability
that my random variables

764
00:39:28,530 --> 00:39:29,970
under this
distribution, p theta,

765
00:39:29,970 --> 00:39:34,630
take these exact
values, px1, px2, pxn.

766
00:39:34,630 --> 00:39:40,290
Now remember, if my
data was independent,

767
00:39:40,290 --> 00:39:43,200
then I could actually just
say that the probability

768
00:39:43,200 --> 00:39:45,960
of this intersection is just a
product of the probabilities.

769
00:39:45,960 --> 00:39:48,790
And it would look
something like this.

770
00:39:48,790 --> 00:39:50,790
But I can define likelihood
even if I don't have

771
00:39:50,790 --> 00:39:52,865
independent random variables.

772
00:39:52,865 --> 00:39:54,490
But think of them as
being independent,

773
00:39:54,490 --> 00:39:57,550
because that's all we're going
to encounter in this class, OK?

774
00:39:57,550 --> 00:40:00,380
I just want you to be aware that
if I had dependent variables,

775
00:40:00,380 --> 00:40:02,089
I could still define
the likelihood.

776
00:40:02,089 --> 00:40:04,630
I would have to understand how
to compute these probabilities

777
00:40:04,630 --> 00:40:08,270
there to be able to compute it.

778
00:40:08,270 --> 00:40:11,000
OK, so think of
Bernoullis, for example.

779
00:40:11,000 --> 00:40:12,985
So here is my example
of a Bernoulli.

780
00:40:16,570 --> 00:40:18,650
So my parameter is--

781
00:40:18,650 --> 00:40:25,211
so my model is 0,1 Bernoulli p.

782
00:40:25,211 --> 00:40:31,790
p is in the interval 0,1.

783
00:40:31,790 --> 00:40:35,917
The probability, just
as a side remark,

784
00:40:35,917 --> 00:40:38,000
I'm just going to use the
fact that I can actually

785
00:40:38,000 --> 00:40:41,840
write the PMF of a Bernoulli
in a very concise form, right?

786
00:40:41,840 --> 00:40:43,970
If I ask you what the
PMF of a Bernoulli is,

787
00:40:43,970 --> 00:40:46,500
you could tell me, well,
the probability that x--

788
00:40:46,500 --> 00:40:50,720
so under p, the probability that
x is equal to 0 is 1 minus p.

789
00:40:50,720 --> 00:40:57,230
The probability under p that
x is equal to 1 is equal to p.

790
00:40:57,230 --> 00:41:01,790
But I can be a bit smart and
say that for any X that's

791
00:41:01,790 --> 00:41:04,910
either 0 or 1, the
probability under p

792
00:41:04,910 --> 00:41:07,610
that X is equal to
little x, I can write it

793
00:41:07,610 --> 00:41:14,150
in a compact form as p to the
X, 1 minus p to the 1 minus x.

794
00:41:14,150 --> 00:41:17,570
And you can check that this is
the right form because, well,

795
00:41:17,570 --> 00:41:20,910
you have to check it only
for two values of X, 0 and 1.

796
00:41:20,910 --> 00:41:23,350
And if you plug in 1,
you only keep the p.

797
00:41:23,350 --> 00:41:27,840
If you plug in 0, you
only keep the 1 minus p.

798
00:41:27,840 --> 00:41:31,190
And that's just a trick, OK?

799
00:41:31,190 --> 00:41:34,350
I could have gone
with many other ways.

800
00:41:34,350 --> 00:41:35,940
Agreed?

801
00:41:35,940 --> 00:41:39,342
I could have said,
actually, something like--

802
00:41:39,342 --> 00:41:41,550
another one would be-- which
we are not going to use,

803
00:41:41,550 --> 00:41:47,340
but we could say, well, it's
xp plus and minus x 1 minus

804
00:41:47,340 --> 00:41:47,850
p, right?

805
00:41:50,680 --> 00:41:53,160
That's another one.

806
00:41:53,160 --> 00:41:56,057
But this one is going
to be convenient.

807
00:41:56,057 --> 00:41:57,640
So forget about this
guy for a second.

808
00:42:02,750 --> 00:42:05,450
So now, I said that
the likelihood is just

809
00:42:05,450 --> 00:42:12,380
this function that's computing
the probability that X1

810
00:42:12,380 --> 00:42:15,050
is equal to little x1.

811
00:42:15,050 --> 00:42:27,950
So likelihood is L of X1, Xn.

812
00:42:27,950 --> 00:42:30,140
So let me try to make
those calligraphic so you

813
00:42:30,140 --> 00:42:33,140
know that I'm talking about
smaller values, right?

814
00:42:33,140 --> 00:42:35,010
Small x's.

815
00:42:35,010 --> 00:42:38,840
x1, xn, and then of course p.

816
00:42:38,840 --> 00:42:40,284
Sometimes we even put--

817
00:42:40,284 --> 00:42:42,200
I didn't do it, but
sometimes you can actually

818
00:42:42,200 --> 00:42:46,640
put a semicolon here, semicolon
so you know that those two

819
00:42:46,640 --> 00:42:48,860
things are treated differently.

820
00:42:48,860 --> 00:42:51,570
And so now you have this
thing is equal to what?

821
00:42:51,570 --> 00:42:54,440
Well, it's just the
probability under p

822
00:42:54,440 --> 00:42:59,990
that X1 is little x1 all
the way to Xn is little xn.

823
00:42:59,990 --> 00:43:02,064
OK, that's just the definition.

824
00:43:06,910 --> 00:43:11,590
All right, so now
let's start working.

825
00:43:11,590 --> 00:43:13,240
So we write the
definition, and then we

826
00:43:13,240 --> 00:43:16,030
want to make it look like
something we would potentially

827
00:43:16,030 --> 00:43:17,902
be able to maximize if I were--

828
00:43:17,902 --> 00:43:20,235
like if I take the derivative
of this with respect to p,

829
00:43:20,235 --> 00:43:22,570
it's not very helpful
because I just don't know.

830
00:43:22,570 --> 00:43:26,770
Just want the algebraic
function of p.

831
00:43:26,770 --> 00:43:28,580
So this thing is going
to be equal to what?

832
00:43:28,580 --> 00:43:30,413
Well, what is the first
thing I want to use?

833
00:43:32,740 --> 00:43:35,350
I have a probability of
an intersection of events,

834
00:43:35,350 --> 00:43:39,630
so it's just the product
of the probabilities.

835
00:43:39,630 --> 00:43:44,396
So this is the product from
i equal 1 to n of P, small p,

836
00:43:44,396 --> 00:43:47,970
Xi is equal to little xi.

837
00:43:47,970 --> 00:43:49,858
That's independence.

838
00:43:54,070 --> 00:43:58,690
OK, now, I'm starting to mean
business, because for each P,

839
00:43:58,690 --> 00:44:00,370
we have a closed form, right?

840
00:44:00,370 --> 00:44:03,910
I wrote this as this
supposedly convenient form.

841
00:44:03,910 --> 00:44:06,470
I still have to reveal to
you why it's convenient.

842
00:44:06,470 --> 00:44:09,640
So this thing is equal to--

843
00:44:09,640 --> 00:44:15,090
well, we said that that
was p xi for a little xi.

844
00:44:15,090 --> 00:44:20,240
1 minus p to the 1 minus xi, OK?

845
00:44:22,960 --> 00:44:26,650
So that was just what I wrote
over there as the probability

846
00:44:26,650 --> 00:44:29,540
that Xi is equal to little xi.

847
00:44:29,540 --> 00:44:32,780
And since they all have
the same parameter p, just

848
00:44:32,780 --> 00:44:34,280
have this p that shows up here.

849
00:44:38,140 --> 00:44:41,230
And so now I'm just taking
the products of something

850
00:44:41,230 --> 00:44:45,160
to the xi, so it's this
thing to the sum of the xi's.

851
00:44:45,160 --> 00:44:48,090
Everybody agrees with this?

852
00:44:48,090 --> 00:44:56,360
So this is equal to p
sum of the xi, 1 minus p

853
00:44:56,360 --> 00:44:58,180
to the n minus sum of the xi.

854
00:45:10,180 --> 00:45:13,300
If you don't feel comfortable
with writing it directly,

855
00:45:13,300 --> 00:45:15,520
you can observe
that this thing here

856
00:45:15,520 --> 00:45:22,170
is actually equal to p over
1 minus p to the xi times 1

857
00:45:22,170 --> 00:45:26,022
minus p, OK?

858
00:45:26,022 --> 00:45:27,480
So now when I take
the product, I'm

859
00:45:27,480 --> 00:45:28,938
getting the products
of those guys.

860
00:45:28,938 --> 00:45:31,380
So it's just this guy
to the power of sum

861
00:45:31,380 --> 00:45:33,570
and this guy to the power n.

862
00:45:33,570 --> 00:45:39,670
And then I can rewrite
it like this if I want to

863
00:45:39,670 --> 00:45:42,720
And so now-- well,
that's what we have here.

864
00:45:42,720 --> 00:45:45,750
And now I am in business
because I can still

865
00:45:45,750 --> 00:45:48,750
hope to maximize this function.

866
00:45:48,750 --> 00:45:50,679
And how to maximize
this function?

867
00:45:50,679 --> 00:45:52,470
All I have to do is to
take the derivative.

868
00:45:52,470 --> 00:45:54,710
Do you want to do it?

869
00:45:54,710 --> 00:45:56,502
Let's just take
the derivative, OK?

870
00:45:56,502 --> 00:45:58,960
Sorry, I didn't tell you that,
well, the maximum likelihood

871
00:45:58,960 --> 00:46:01,700
principle is to just maxim-- the
idea is to maximize this thing,

872
00:46:01,700 --> 00:46:02,200
OK?

873
00:46:02,200 --> 00:46:04,310
But I'm not going to
get there right now.

874
00:46:04,310 --> 00:46:08,810
OK, so let's do it maybe for
the Poisson model for a second.

875
00:46:08,810 --> 00:46:16,910
So if you want to do it
for the Poisson model,

876
00:46:16,910 --> 00:46:18,380
let's write the likelihood.

877
00:46:18,380 --> 00:46:20,020
So right now I'm
not doing anything.

878
00:46:20,020 --> 00:46:21,010
I'm not maximizing.

879
00:46:21,010 --> 00:46:24,040
I'm just computing the
likelihood function.

880
00:46:29,640 --> 00:46:32,470
OK, so the likelihood
function for Poisson.

881
00:46:32,470 --> 00:46:36,710
So now I know-- what is my
sample space for Poisson?

882
00:46:36,710 --> 00:46:38,140
STUDENT: Positives.

883
00:46:38,140 --> 00:46:41,170
PROFESSOR: Positive integers.

884
00:46:41,170 --> 00:46:45,220
And well, let me
write it like this.

885
00:46:45,220 --> 00:46:51,170
Poisson lambda, and I'm going
to take lambda to be positive.

886
00:46:51,170 --> 00:46:53,560
And so that means that the
probability under lambda

887
00:46:53,560 --> 00:46:57,920
that X is equal to little
x in the sample space

888
00:46:57,920 --> 00:47:01,030
is lambda to the
X over factorial x

889
00:47:01,030 --> 00:47:03,130
e to the minus lambda.

890
00:47:03,130 --> 00:47:05,530
So that's basically the
same as the compact form

891
00:47:05,530 --> 00:47:06,740
that I wrote over there.

892
00:47:06,740 --> 00:47:08,860
It's just now a different one.

893
00:47:08,860 --> 00:47:12,340
And so when I want to
write my likelihood, again,

894
00:47:12,340 --> 00:47:13,500
we said little x's.

895
00:47:17,050 --> 00:47:18,390
This is equal to what?

896
00:47:18,390 --> 00:47:23,690
Well, it's equal to the
probability under lambda

897
00:47:23,690 --> 00:47:31,796
that X1 is little
x1, Xn is little xn,

898
00:47:31,796 --> 00:47:33,045
which is equal to the product.

899
00:47:40,950 --> 00:47:42,720
OK?

900
00:47:42,720 --> 00:47:45,270
Just by independence.

901
00:47:45,270 --> 00:47:47,640
And now I can write those
guys as being-- each

902
00:47:47,640 --> 00:47:52,080
of them being i equal 1 to n.

903
00:47:52,080 --> 00:47:56,100
So this guy is just this
thing where a plug in Xi.

904
00:47:56,100 --> 00:48:05,540
So I get lambda to the Xi
divided by factorial xi times e

905
00:48:05,540 --> 00:48:10,660
to the minus lambda, OK?

906
00:48:10,660 --> 00:48:13,709
And now, I mean, this
guy is going to be nice.

907
00:48:13,709 --> 00:48:15,250
This guy is not
going to be too nice.

908
00:48:15,250 --> 00:48:16,570
But let's write it.

909
00:48:16,570 --> 00:48:18,820
When I'm going to take the
product of those guys here,

910
00:48:18,820 --> 00:48:21,910
I'm going to pick up lambda
to the sum of the xi's.

911
00:48:21,910 --> 00:48:23,470
Here I'm going to
pick up exponential

912
00:48:23,470 --> 00:48:25,334
minus n times lambda.

913
00:48:25,334 --> 00:48:27,250
And here I'm going to
pick up just the product

914
00:48:27,250 --> 00:48:29,200
of the factorials.

915
00:48:29,200 --> 00:48:35,900
So x1 factorial all the
way to xn factorial.

916
00:48:35,900 --> 00:48:41,130
Then I get lambda,
the sum of the xi.

917
00:48:41,130 --> 00:48:43,480
Those are little xi's.

918
00:48:43,480 --> 00:48:46,581
e to the minus xn lambda.

919
00:48:46,581 --> 00:48:47,080
OK?

920
00:48:51,900 --> 00:48:55,510
So that might be freaky at
this point, but remember,

921
00:48:55,510 --> 00:48:58,100
this is a function we
will be maximizing.

922
00:48:58,100 --> 00:49:01,480
And the denominator here
does not depend on lambda.

923
00:49:01,480 --> 00:49:04,860
So we knew that maximizing this
function with this denominator,

924
00:49:04,860 --> 00:49:07,590
or any other
denominator, including 1,

925
00:49:07,590 --> 00:49:09,930
will give me the same arg max.

926
00:49:09,930 --> 00:49:12,180
So it won't be a problem for me.

927
00:49:12,180 --> 00:49:14,349
As long as it does
not depend on lambda,

928
00:49:14,349 --> 00:49:15,640
this thing is going to go away.

929
00:49:19,130 --> 00:49:24,720
OK, so in the continuous case,
the likelihood I cannot--

930
00:49:24,720 --> 00:49:25,220
right?

931
00:49:25,220 --> 00:49:26,720
So if I would write
the likelihood

932
00:49:26,720 --> 00:49:29,600
like this in the
continuous case,

933
00:49:29,600 --> 00:49:32,240
this one would be equal to what?

934
00:49:32,240 --> 00:49:33,160
Zero, right?

935
00:49:33,160 --> 00:49:34,565
So it's not very helpful.

936
00:49:34,565 --> 00:49:36,440
And so what we do is we
define the likelihood

937
00:49:36,440 --> 00:49:39,860
as the product of
the f of theta xi.

938
00:49:39,860 --> 00:49:43,340
Now that would be a
jump if I told you,

939
00:49:43,340 --> 00:49:45,230
well, just define it
like that and go home

940
00:49:45,230 --> 00:49:46,700
and don't discuss it.

941
00:49:46,700 --> 00:49:52,011
But we know that this is
exactly what's coming from the--

942
00:49:52,011 --> 00:49:53,510
well, actually, I
think I erased it.

943
00:49:53,510 --> 00:49:55,370
It was just behind.

944
00:49:55,370 --> 00:49:58,280
So this was exactly what
was coming from the KL

945
00:49:58,280 --> 00:50:00,200
divergence estimated, right?

946
00:50:00,200 --> 00:50:01,700
The thing that I
showed you, if we

947
00:50:01,700 --> 00:50:03,200
want to follow this
strategy, which

948
00:50:03,200 --> 00:50:06,830
consists in estimating the KL
divergence and minimizing it,

949
00:50:06,830 --> 00:50:08,210
is exactly doing this.

950
00:50:12,190 --> 00:50:16,730
So in the Gaussian case--

951
00:50:16,730 --> 00:50:17,835
well, let's write it.

952
00:50:17,835 --> 00:50:19,610
So in the Gaussian
case, let's see

953
00:50:19,610 --> 00:50:20,940
what the likelihood looks like.

954
00:50:27,600 --> 00:50:32,000
OK, so if I have a
Gaussian experiment here--

955
00:50:32,000 --> 00:50:33,430
did I actually write it?

956
00:50:36,440 --> 00:50:40,590
OK, so I'm going to take mu and
sigma as being two parameters.

957
00:50:40,590 --> 00:50:43,756
So that means that my sample
space is going to be what?

958
00:50:47,330 --> 00:50:49,700
Well, my sample
space is still R.

959
00:50:49,700 --> 00:50:51,750
Those are just my observations.

960
00:50:51,750 --> 00:50:56,840
But then I'm going to
have a N mu sigma squared.

961
00:50:56,840 --> 00:50:58,400
And the parameters
of interest are mu

962
00:50:58,400 --> 00:51:04,291
and R. And sigma squared
and say 0 infinity.

963
00:51:04,291 --> 00:51:06,450
OK, so that's my Gaussian model.

964
00:51:06,450 --> 00:51:07,736
Yes.

965
00:51:07,736 --> 00:51:17,455
STUDENT: [INAUDIBLE]

966
00:51:17,455 --> 00:51:18,580
PROFESSOR: No, there's no--

967
00:51:18,580 --> 00:51:20,080
I mean, there's no difference.

968
00:51:20,080 --> 00:51:21,514
STUDENT: [INAUDIBLE]

969
00:51:21,514 --> 00:51:22,180
PROFESSOR: Yeah.

970
00:51:22,180 --> 00:51:24,880
I think the all the slides
I put the curly bracket,

971
00:51:24,880 --> 00:51:26,520
then I'm just being lazy.

972
00:51:26,520 --> 00:51:31,540
I just like those
concave parenthesis.

973
00:51:31,540 --> 00:51:33,850
All right, so let's write it.

974
00:51:33,850 --> 00:51:39,670
So the definition, L xi, xn.

975
00:51:39,670 --> 00:51:43,810
And now I have two parameters,
mu and sigma squared.

976
00:51:43,810 --> 00:51:48,035
We said, by definition,
is the product from i

977
00:51:48,035 --> 00:51:55,540
equal 1 to n of f
theta of little xi.

978
00:51:55,540 --> 00:51:57,550
Now, think about it.

979
00:51:57,550 --> 00:52:00,790
Here we always had
an extra line, right?

980
00:52:00,790 --> 00:52:03,460
The line was to say that the
definition was the probability

981
00:52:03,460 --> 00:52:05,470
that they were all
equal to each other.

982
00:52:05,470 --> 00:52:08,230
That was the joint probability.

983
00:52:08,230 --> 00:52:12,430
And here it could actually have
a line that says it's the joint

984
00:52:12,430 --> 00:52:14,146
probability distribution
of the xi's.

985
00:52:14,146 --> 00:52:15,520
And if it's not
independent, it's

986
00:52:15,520 --> 00:52:16,732
not going to be the product.

987
00:52:16,732 --> 00:52:18,190
But again, since
we're only dealing

988
00:52:18,190 --> 00:52:21,020
with independent observations
in the scope of this class,

989
00:52:21,020 --> 00:52:23,890
this is the only definition
we're going to be using.

990
00:52:23,890 --> 00:52:26,710
OK, and actually,
from here on, I

991
00:52:26,710 --> 00:52:30,910
will literally skip this step
when I talk about discrete ones

992
00:52:30,910 --> 00:52:33,270
as well, because they
are also independent.

993
00:52:33,270 --> 00:52:35,530
Agreed?

994
00:52:35,530 --> 00:52:37,570
So we start with
this, which we agreed

995
00:52:37,570 --> 00:52:39,590
was the definition for
this particular case.

996
00:52:39,590 --> 00:52:44,545
And so now all of you know by
heart what the density of a--

997
00:52:44,545 --> 00:52:45,600
sorry, that's not theta.

998
00:52:45,600 --> 00:52:47,540
I should write it
mu sigma squared.

999
00:52:47,540 --> 00:52:50,650
And so you need to
understand what this density.

1000
00:52:50,650 --> 00:53:01,070
And it's product of 1 over
sigma square root 2 pi times

1001
00:53:01,070 --> 00:53:07,350
exponential minus
xi minus mu squared

1002
00:53:07,350 --> 00:53:10,210
divided by 2 sigma squared.

1003
00:53:10,210 --> 00:53:13,750
OK, that's the Gaussian density
with parameters mu and sigma

1004
00:53:13,750 --> 00:53:15,810
squared.

1005
00:53:15,810 --> 00:53:18,360
I just plugged in this thing
which I don't give you,

1006
00:53:18,360 --> 00:53:20,630
so you just have to trust me.

1007
00:53:20,630 --> 00:53:22,500
It's all over any book.

1008
00:53:22,500 --> 00:53:25,334
Certainly, I mean,
you can find it.

1009
00:53:25,334 --> 00:53:26,250
I will give it to you.

1010
00:53:26,250 --> 00:53:29,310
And again, you're not
expected to know it by heart.

1011
00:53:29,310 --> 00:53:34,290
Though, if you do your homework
every week without wanting to,

1012
00:53:34,290 --> 00:53:36,180
you will definitely
use some of your brain

1013
00:53:36,180 --> 00:53:38,140
to remember that thing.

1014
00:53:38,140 --> 00:53:42,680
OK, and so now, well, I
have this constant in front.

1015
00:53:42,680 --> 00:53:45,000
1 over sigma square root
2 pi that I can pull out.

1016
00:53:45,000 --> 00:53:50,474
So I get 1 over sigma square
root 2 pi to the power n.

1017
00:53:50,474 --> 00:53:52,890
And then I have the product
of exponentials, which we know

1018
00:53:52,890 --> 00:53:55,420
is the exponential of the sum.

1019
00:53:55,420 --> 00:53:58,710
So this is equal to
exponential minus.

1020
00:53:58,710 --> 00:54:01,260
And here I'm going to put
the 1 over 2 sigma squared

1021
00:54:01,260 --> 00:54:02,210
outside the sum.

1022
00:54:15,740 --> 00:54:19,850
And so that's how
this guy shows up.

1023
00:54:19,850 --> 00:54:23,550
Just the product of the density
is evaluated at, respectively,

1024
00:54:23,550 --> 00:54:24,676
x1 to xn.

1025
00:54:28,850 --> 00:54:33,240
OK, any questions about
computing those likelihoods?

1026
00:54:33,240 --> 00:54:34,556
Yes.

1027
00:54:34,556 --> 00:54:41,460
STUDENT: Why [INAUDIBLE]

1028
00:54:41,460 --> 00:54:42,890
PROFESSOR: Oh, that's a typo.

1029
00:54:42,890 --> 00:54:43,740
Thank you.

1030
00:54:43,740 --> 00:54:47,040
Because I just took it from
probably the previous thing.

1031
00:54:47,040 --> 00:54:48,840
So those are
actually-- should be--

1032
00:54:48,840 --> 00:54:50,850
OK, thank you for
noting that one.

1033
00:54:50,850 --> 00:55:00,180
So this line should say for
any x1 to xn in R to the n.

1034
00:55:00,180 --> 00:55:01,470
Thank you, good catch.

1035
00:55:06,940 --> 00:55:10,840
All right, so that's
really e to the n, right?

1036
00:55:10,840 --> 00:55:12,490
My sample space always.

1037
00:55:16,090 --> 00:55:19,800
OK, so what is maximum
likelihood estimation?

1038
00:55:19,800 --> 00:55:24,770
Well again, if you go
back to the estimate

1039
00:55:24,770 --> 00:55:27,770
that we got, the estimation
strategy, which consisted

1040
00:55:27,770 --> 00:55:31,160
in replacing expectation
with respect to theta star

1041
00:55:31,160 --> 00:55:35,540
by average of the data
in the KL divergence,

1042
00:55:35,540 --> 00:55:41,810
we would try to maximize
not this guy, but this guy.

1043
00:55:45,770 --> 00:55:48,260
The thing that we actually
plugged in were not any small

1044
00:55:48,260 --> 00:55:48,760
xi's.

1045
00:55:48,760 --> 00:55:52,040
Were actually-- the random
variable is capital Xi.

1046
00:55:52,040 --> 00:55:54,190
So the maximum
likelihood estimator

1047
00:55:54,190 --> 00:55:57,090
is actually taking
the likelihood,

1048
00:55:57,090 --> 00:55:59,570
which is a function
of little x's, and now

1049
00:55:59,570 --> 00:56:02,210
the values at which it
estimates, if you look at it,

1050
00:56:02,210 --> 00:56:03,620
is actually--

1051
00:56:03,620 --> 00:56:05,870
the capital X is my data.

1052
00:56:05,870 --> 00:56:09,800
So it looks at the
function, at the data,

1053
00:56:09,800 --> 00:56:11,900
and at the parameter theta.

1054
00:56:11,900 --> 00:56:14,932
That's what the-- so
that's the first thing.

1055
00:56:14,932 --> 00:56:16,640
And then the maximum
likelihood estimator

1056
00:56:16,640 --> 00:56:19,930
is maximizing this, OK?

1057
00:56:19,930 --> 00:56:24,090
So in a way, what it does is
it's a function that couples

1058
00:56:24,090 --> 00:56:27,810
together the data,
capital X1 to capital Xn,

1059
00:56:27,810 --> 00:56:32,310
with the parameter theta and
just now tries to maximize it.

1060
00:56:32,310 --> 00:56:40,120
So if this is just a
little hard for you to get,

1061
00:56:40,120 --> 00:56:42,340
the likelihood is
formally defined

1062
00:56:42,340 --> 00:56:43,750
as a function of x, right?

1063
00:56:43,750 --> 00:56:46,105
Like when I write f of x.

1064
00:56:46,105 --> 00:56:48,580
f of little x, I
define it like that.

1065
00:56:48,580 --> 00:56:52,990
But really, the only
x arguments we're

1066
00:56:52,990 --> 00:56:54,680
going to evaluate
this function at

1067
00:56:54,680 --> 00:56:57,920
are always the random
variable, which is the data.

1068
00:56:57,920 --> 00:56:59,440
So if you want,
you can think of it

1069
00:56:59,440 --> 00:57:02,230
as those guys being not
parameters of this function,

1070
00:57:02,230 --> 00:57:04,810
but really, random variables
themselves directly.

1071
00:57:09,390 --> 00:57:10,683
Is there any question?

1072
00:57:10,683 --> 00:57:15,516
STUDENT: [INAUDIBLE] those
random variables [INAUDIBLE]??

1073
00:57:15,516 --> 00:57:17,890
PROFESSOR: So those are going
to be known once you have--

1074
00:57:17,890 --> 00:57:20,500
so it's always the
same thing in stats.

1075
00:57:20,500 --> 00:57:24,040
You first design your
estimator as a function

1076
00:57:24,040 --> 00:57:25,270
of random variables.

1077
00:57:25,270 --> 00:57:27,490
And then once you get
data, you just plug it in.

1078
00:57:27,490 --> 00:57:29,920
But we want to think of them
as being random variables

1079
00:57:29,920 --> 00:57:32,262
because we want to understand
what the fluctuations are.

1080
00:57:32,262 --> 00:57:34,720
So we're going to keep them as
random variables for as long

1081
00:57:34,720 --> 00:57:35,685
as we can.

1082
00:57:35,685 --> 00:57:37,810
We're going to spit out
the estimator as a function

1083
00:57:37,810 --> 00:57:38,690
of the random variables.

1084
00:57:38,690 --> 00:57:40,060
And then when we want
to compute it from data,

1085
00:57:40,060 --> 00:57:41,351
we're just going to plug it in.

1086
00:57:44,170 --> 00:57:46,630
So keep the random variables
for as long as you can.

1087
00:57:46,630 --> 00:57:48,430
Unless I give you
numbers, actual numbers,

1088
00:57:48,430 --> 00:57:51,130
just those are random variables.

1089
00:57:51,130 --> 00:57:53,549
OK, so there might
be some confusion

1090
00:57:53,549 --> 00:57:55,590
if you've seen any stats
class, sometimes there's

1091
00:57:55,590 --> 00:57:58,420
a notation which says,
oh, the realization

1092
00:57:58,420 --> 00:58:01,240
of the random variables
are lower case versions

1093
00:58:01,240 --> 00:58:02,730
of the original
random variables.

1094
00:58:02,730 --> 00:58:05,920
So lowercase x should be
thought as the realization

1095
00:58:05,920 --> 00:58:09,610
of the upper case X. This
is not the case here.

1096
00:58:09,610 --> 00:58:12,010
When I write this,
it's the same way

1097
00:58:12,010 --> 00:58:16,630
as I write f of x is
equal to x squared, right?

1098
00:58:16,630 --> 00:58:20,260
It's just an argument of a
function that I want to define.

1099
00:58:20,260 --> 00:58:22,150
So those are just generic x.

1100
00:58:22,150 --> 00:58:24,580
So if you correct
the typo that I have,

1101
00:58:24,580 --> 00:58:27,150
this should say that this
should be for any x and xn.

1102
00:58:27,150 --> 00:58:28,990
I'm just describing a function.

1103
00:58:28,990 --> 00:58:30,816
And now the only
place at which I'm

1104
00:58:30,816 --> 00:58:32,440
interested in evaluating
that function,

1105
00:58:32,440 --> 00:58:35,477
at least for those first n
arguments, is at the capital

1106
00:58:35,477 --> 00:58:37,310
N observations random
variables that I have.

1107
00:58:41,110 --> 00:58:45,040
So there's actually
texts, there's actually

1108
00:58:45,040 --> 00:58:48,070
people doing research on when
does the maximum likelihood

1109
00:58:48,070 --> 00:58:49,720
estimator exist?

1110
00:58:49,720 --> 00:58:56,890
And that happens when you
have infinite sets, thetas.

1111
00:58:56,890 --> 00:58:58,770
And this thing can diverge.

1112
00:58:58,770 --> 00:59:00,160
There is no global maximum.

1113
00:59:00,160 --> 00:59:01,990
There's crazy things
that might happen.

1114
00:59:01,990 --> 00:59:04,630
And so we're actually
always going to be in a case

1115
00:59:04,630 --> 00:59:07,450
where this maximum
likelihood estimator exists.

1116
00:59:07,450 --> 00:59:09,580
And if it doesn't, then
it means that you actually

1117
00:59:09,580 --> 00:59:13,840
need to restrict your
parameter space, capital Theta,

1118
00:59:13,840 --> 00:59:15,430
to something smaller.

1119
00:59:15,430 --> 00:59:17,500
Otherwise it won't exist.

1120
00:59:17,500 --> 00:59:21,910
OK, so another thing is the
log likelihood estimator.

1121
00:59:21,910 --> 00:59:23,800
So it is still the
likelihood estimator.

1122
00:59:23,800 --> 00:59:26,380
We solved before that
maximizing a function

1123
00:59:26,380 --> 00:59:27,820
or maximizing log
of this function

1124
00:59:27,820 --> 00:59:30,350
is the same thing, because the
log function is increasing.

1125
00:59:30,350 --> 00:59:32,100
So the same thing is
maximizing a function

1126
00:59:32,100 --> 00:59:35,352
or maximizing, I don't know,
exponential of this function.

1127
00:59:35,352 --> 00:59:37,060
Every time I take an
increasing function,

1128
00:59:37,060 --> 00:59:38,410
it's actually the same thing.

1129
00:59:38,410 --> 00:59:40,360
Maximizing a function
or maximizing 10 times

1130
00:59:40,360 --> 00:59:41,693
this function is the same thing.

1131
00:59:41,693 --> 00:59:45,730
So the function x maps to
10 times x is increasing.

1132
00:59:45,730 --> 00:59:49,480
And so why do we talk about
log likelihood rather than

1133
00:59:49,480 --> 00:59:50,620
likelihood?

1134
00:59:50,620 --> 00:59:52,590
So the log of likelihood
is really just--

1135
00:59:52,590 --> 00:59:55,810
I mean the log likelihood is
the log of the likelihood.

1136
00:59:55,810 --> 00:59:59,420
And the reason is exactly
for this kind of reasons.

1137
00:59:59,420 --> 01:00:02,240
Remember, that was
my likelihood, right?

1138
01:00:02,240 --> 01:00:04,170
And I want to maximize it.

1139
01:00:04,170 --> 01:00:05,940
And it turns out that
in stats, there's

1140
01:00:05,940 --> 01:00:10,410
a lot of distributions that look
like exponential of something.

1141
01:00:10,410 --> 01:00:12,930
So I might as well just
remove the exponential

1142
01:00:12,930 --> 01:00:14,730
by taking the log.

1143
01:00:14,730 --> 01:00:17,230
So once I have this
guy, I can take the log.

1144
01:00:17,230 --> 01:00:19,080
This is something to
a power of something.

1145
01:00:19,080 --> 01:00:21,720
If I take the log, it's
going to look better for me.

1146
01:00:21,720 --> 01:00:23,400
I have this thing--

1147
01:00:23,400 --> 01:00:25,650
well, I have another
one somewhere, I think,

1148
01:00:25,650 --> 01:00:27,910
where I had the Poisson.

1149
01:00:27,910 --> 01:00:29,070
Where was the Poisson?

1150
01:00:29,070 --> 01:00:31,890
The Poisson's gone.

1151
01:00:31,890 --> 01:00:33,610
So the Poisson was
the same thing.

1152
01:00:33,610 --> 01:00:35,670
If I took the log,
because it had a power,

1153
01:00:35,670 --> 01:00:37,210
that would make my life easier.

1154
01:00:37,210 --> 01:00:43,800
So the log doesn't have any
particular intrinsic notion,

1155
01:00:43,800 --> 01:00:47,550
except that it's
just more convenient.

1156
01:00:47,550 --> 01:00:49,500
Now, that being
said, if you think

1157
01:00:49,500 --> 01:00:53,370
about maximizing the KL,
the original formulation,

1158
01:00:53,370 --> 01:00:55,590
we actually remove the log.

1159
01:00:55,590 --> 01:00:57,040
If we come back
to the KL thing--

1160
01:01:00,700 --> 01:01:01,610
where is my KL?

1161
01:01:01,610 --> 01:01:03,770
Sorry.

1162
01:01:03,770 --> 01:01:08,630
That was maximizing the sum
of the logs of the pi's.

1163
01:01:08,630 --> 01:01:11,870
And so then we worked at it by
saying that the sum of the logs

1164
01:01:11,870 --> 01:01:12,539
was--

1165
01:01:12,539 --> 01:01:14,330
maximizing the sum of
the logs was the same

1166
01:01:14,330 --> 01:01:16,220
as maximizing the product.

1167
01:01:16,220 --> 01:01:18,140
But here, we're
basically-- log likelihood

1168
01:01:18,140 --> 01:01:21,571
is just going backwards in
this chain of equivalences.

1169
01:01:21,571 --> 01:01:23,570
And that's just because
the original formulation

1170
01:01:23,570 --> 01:01:27,180
was already convenient.

1171
01:01:27,180 --> 01:01:28,940
So we went to find
the likelihood

1172
01:01:28,940 --> 01:01:32,620
and then coming back to our
original estimation strategy.

1173
01:01:32,620 --> 01:01:34,250
So look at the Poisson.

1174
01:01:34,250 --> 01:01:39,210
I want to take log here to
make my sum of xi's go down.

1175
01:01:39,210 --> 01:01:47,510
OK, so this is my estimator.

1176
01:01:47,510 --> 01:01:50,090
So the log of L--

1177
01:01:50,090 --> 01:01:51,590
so one thing that
you want to notice

1178
01:01:51,590 --> 01:01:59,960
is that the log of L of
x1, xn theta, as we said,

1179
01:01:59,960 --> 01:02:02,860
is equal to the
sum from i equal 1

1180
01:02:02,860 --> 01:02:09,950
to n of the log of either
p theta of xi, or--

1181
01:02:09,950 --> 01:02:11,270
so that's in the discrete case.

1182
01:02:11,270 --> 01:02:14,480
And in the continuous
case is the sum

1183
01:02:14,480 --> 01:02:16,627
of the log of f theta of xi.

1184
01:02:19,277 --> 01:02:21,860
The beauty of this is that you
don't have to really understand

1185
01:02:21,860 --> 01:02:23,360
the difference between
probability mass

1186
01:02:23,360 --> 01:02:25,310
function and probability
distribution function

1187
01:02:25,310 --> 01:02:26,690
to implement this.

1188
01:02:26,690 --> 01:02:29,518
Whatever you get,
that's what you plug in.

1189
01:02:32,930 --> 01:02:33,810
Any questions so far?

1190
01:02:36,550 --> 01:02:39,940
All right, so shall we
do some computations

1191
01:02:39,940 --> 01:02:44,720
and check that, actually, we've
introduced all this stuff--

1192
01:02:44,720 --> 01:02:47,380
complicate functions,
maximizing, KL divergence,

1193
01:02:47,380 --> 01:02:50,590
lot of things-- so that we
can spit out, again, averages?

1194
01:02:50,590 --> 01:02:51,160
All right?

1195
01:02:51,160 --> 01:02:51,580
That's great.

1196
01:02:51,580 --> 01:02:52,810
We're going to able
to sleep at night

1197
01:02:52,810 --> 01:02:55,150
and know that there's a really
powerful mechanism called

1198
01:02:55,150 --> 01:02:57,370
maximum likelihood
estimator that was actually

1199
01:02:57,370 --> 01:03:00,370
driving our intuition
without us knowing.

1200
01:03:00,370 --> 01:03:04,730
OK, so let's do this so.

1201
01:03:04,730 --> 01:03:06,240
Bernoulli trials.

1202
01:03:06,240 --> 01:03:07,400
I still have it over there.

1203
01:03:15,920 --> 01:03:19,120
OK, so actually, I
don't know what--

1204
01:03:19,120 --> 01:03:21,260
well, let me write it like that.

1205
01:03:21,260 --> 01:03:25,730
So it's P over 1 minus P xi--

1206
01:03:25,730 --> 01:03:32,650
sorry, sum of the xi's
times 1 minus P is to the n.

1207
01:03:32,650 --> 01:03:37,960
So now I want to maximize
this as a function of P.

1208
01:03:37,960 --> 01:03:39,880
Well, the first thing
we would want to do

1209
01:03:39,880 --> 01:03:41,860
is to check that this
function is concave.

1210
01:03:41,860 --> 01:03:45,220
And I'm just going to ask
you to trust me on this.

1211
01:03:45,220 --> 01:03:47,800
So I don't want--
sorry, sum of the xi's.

1212
01:03:47,800 --> 01:03:52,520
I only want to take the
derivative and just go home.

1213
01:03:52,520 --> 01:03:55,150
So let's just take the
derivative of this with respect

1214
01:03:55,150 --> 01:03:56,332
to P. Actually, no.

1215
01:03:56,332 --> 01:03:57,540
This one was more convenient.

1216
01:03:57,540 --> 01:03:58,040
I'm sorry.

1217
01:04:00,820 --> 01:04:03,100
This one was slightly
more convenient, OK?

1218
01:04:03,100 --> 01:04:05,980
So now we have--

1219
01:04:05,980 --> 01:04:09,130
so now let me take the log.

1220
01:04:09,130 --> 01:04:16,960
So if I take the log, what I get
is sum of the xi's times log p

1221
01:04:16,960 --> 01:04:24,704
plus n minus some of the
xi's times log 1 minus p.

1222
01:04:27,970 --> 01:04:29,590
Now I take the
derivative with respect

1223
01:04:29,590 --> 01:04:35,837
to p and set it equal to zero.

1224
01:04:35,837 --> 01:04:36,920
So what does that give me?

1225
01:04:36,920 --> 01:04:43,710
It tells me that sum of the
xi's divided by p minus n

1226
01:04:43,710 --> 01:04:50,130
sum of the xi's divided by
1 minus p is equal to 0.

1227
01:04:56,360 --> 01:04:58,980
So now I need to solve for p.

1228
01:04:58,980 --> 01:04:59,920
So let's just do it.

1229
01:04:59,920 --> 01:05:06,500
So what we get is that 1 minus p
sum of the xi's is equal to p n

1230
01:05:06,500 --> 01:05:10,530
minus sum of the xi's.

1231
01:05:10,530 --> 01:05:17,300
So that's p times n minus sum of
the xi's plus sum of the xi's.

1232
01:05:17,300 --> 01:05:18,550
So let me put it on the right.

1233
01:05:18,550 --> 01:05:24,410
So that's p times n is
equal to sum of the xi's.

1234
01:05:24,410 --> 01:05:27,170
And that's equivalent to p--

1235
01:05:27,170 --> 01:05:30,020
actually, I should start
by putting p hat from here

1236
01:05:30,020 --> 01:05:33,720
on, because I'm already
solving an equation, right?

1237
01:05:33,720 --> 01:05:36,880
And so p hat is equal
to syn of the xi's

1238
01:05:36,880 --> 01:05:38,510
divided by n,
which is my xn bar.

1239
01:05:44,050 --> 01:05:50,280
Poisson model, as I
said, Poisson is gone.

1240
01:05:50,280 --> 01:05:51,874
So let me rewrite it quickly.

1241
01:06:00,850 --> 01:06:07,975
So Poisson, the likelihood
in X1, Xn, and lambda

1242
01:06:07,975 --> 01:06:13,270
was equal to lambda to
the sum of the xi's e

1243
01:06:13,270 --> 01:06:17,650
to the minus n lambda
divided by X1 factorial,

1244
01:06:17,650 --> 01:06:20,920
all the way to Xn factorial.

1245
01:06:20,920 --> 01:06:25,110
So let me take the
log likelihood.

1246
01:06:25,110 --> 01:06:26,490
That's going to
be equal to what?

1247
01:06:26,490 --> 01:06:27,406
It's going to tell me.

1248
01:06:27,406 --> 01:06:29,096
It's going to be--

1249
01:06:29,096 --> 01:06:30,720
well, let me get rid
of this guy first.

1250
01:06:30,720 --> 01:06:36,780
Minus log of X1 factorial
all the way to Xn factorial.

1251
01:06:36,780 --> 01:06:39,520
That's a constant with
respect to lambda.

1252
01:06:39,520 --> 01:06:43,180
So when I'm going to take the
derivative, it's going to go.

1253
01:06:43,180 --> 01:06:49,410
Then I'm going to have plus sum
of the xi's times log lambda.

1254
01:06:49,410 --> 01:06:51,410
And then I'm going to
have minus n lambda.

1255
01:06:54,390 --> 01:06:55,890
So now then, you
take the derivative

1256
01:06:55,890 --> 01:06:57,660
and set it equal to zero.

1257
01:06:57,660 --> 01:07:04,860
So log L-- well, partial with
respect to lambda of log L,

1258
01:07:04,860 --> 01:07:08,820
say lambda, equals zero.

1259
01:07:08,820 --> 01:07:11,160
This is equivalent
to, so this guy goes.

1260
01:07:11,160 --> 01:07:16,440
This guy gives me sum of the
xi's divided by lambda hat

1261
01:07:16,440 --> 01:07:17,070
equals n.

1262
01:07:22,470 --> 01:07:25,690
And so that's
equivalent to lambda hat

1263
01:07:25,690 --> 01:07:31,092
is equal to sum of the xi's
divided by n, which is Xn bar.

1264
01:07:34,044 --> 01:07:38,785
Take derivative, set it equal
to zero, and just solve.

1265
01:07:38,785 --> 01:07:42,930
It's a very satisfying
exercise, especially when

1266
01:07:42,930 --> 01:07:45,150
you get the average in the end.

1267
01:07:45,150 --> 01:07:49,060
You don't have to
think about it forever.

1268
01:07:49,060 --> 01:07:54,360
OK, the Gaussian model I'm going
to leave to you as an exercise.

1269
01:07:54,360 --> 01:07:57,600
Take the log to get rid
of the pesky exponential,

1270
01:07:57,600 --> 01:08:00,690
and then take the derivative
and you should be fine.

1271
01:08:00,690 --> 01:08:02,940
It's a bit more--

1272
01:08:02,940 --> 01:08:05,960
it might be one more
line than those guys.

1273
01:08:05,960 --> 01:08:12,760
OK, so-- well actually,
you need to take

1274
01:08:12,760 --> 01:08:14,040
the gradient in this case.

1275
01:08:14,040 --> 01:08:15,930
Don't check the second
derivative right now.

1276
01:08:15,930 --> 01:08:17,596
You don't have to
really think about it.

1277
01:08:21,430 --> 01:08:23,537
What did I want to add?

1278
01:08:23,537 --> 01:08:25,370
I think there was
something I wanted to say.

1279
01:08:25,370 --> 01:08:27,319
Yes.

1280
01:08:27,319 --> 01:08:31,040
When I have a function that's
concave and I'm on, like,

1281
01:08:31,040 --> 01:08:33,671
some infinite
interval, then it's

1282
01:08:33,671 --> 01:08:36,170
true that taking the derivative
and setting it equal to zero

1283
01:08:36,170 --> 01:08:38,029
will give me the maximum.

1284
01:08:38,029 --> 01:08:42,330
But again, I might have a
function that looks like this.

1285
01:08:42,330 --> 01:08:46,260
Now, if I'm on some finite
interval-- let me go elsewhere.

1286
01:08:46,260 --> 01:08:55,550
So if I'm on some
finite interval

1287
01:08:55,550 --> 01:09:00,979
and my function looks like
this as a function of theta--

1288
01:09:00,979 --> 01:09:03,220
let's say this is
my log likelihood

1289
01:09:03,220 --> 01:09:06,410
as a function of theta--

1290
01:09:06,410 --> 01:09:13,200
then, OK, there's no
place in this interval--

1291
01:09:13,200 --> 01:09:15,040
let's say this is
between 0 and 1-- there's

1292
01:09:15,040 --> 01:09:19,870
no place in this interval where
the derivative is equal to 0.

1293
01:09:19,870 --> 01:09:22,569
And if you actually
try to solve this,

1294
01:09:22,569 --> 01:09:26,187
you won't find a solution which
is not in the interval 0, 1.

1295
01:09:26,187 --> 01:09:28,270
And that's actually how
you know that you probably

1296
01:09:28,270 --> 01:09:30,144
should not take the
derivative equal to zero.

1297
01:09:30,144 --> 01:09:32,720
So don't panic if you
get something that says,

1298
01:09:32,720 --> 01:09:34,720
well, the solution is
at infinity, right?

1299
01:09:34,720 --> 01:09:36,285
If this function
keeps going, you

1300
01:09:36,285 --> 01:09:37,660
will find that
the solution-- you

1301
01:09:37,660 --> 01:09:40,490
won't be able to find a
solution apart from infinity.

1302
01:09:40,490 --> 01:09:43,720
You are going to see something
like 1 over theta hat

1303
01:09:43,720 --> 01:09:46,359
is equal to 0, or
something like this.

1304
01:09:46,359 --> 01:09:48,939
So you know that when you've
found this kind of solution,

1305
01:09:48,939 --> 01:09:51,370
you've probably made a
mistake at some point.

1306
01:09:51,370 --> 01:09:54,820
And the reason is because the
functions that are like this,

1307
01:09:54,820 --> 01:09:58,150
you don't find the maximum by
setting the derivative equal

1308
01:09:58,150 --> 01:09:59,230
to zero.

1309
01:09:59,230 --> 01:10:01,159
You actually just find
the maximum by saying,

1310
01:10:01,159 --> 01:10:03,450
well, it's an increasing
function on the interval 0, 1,

1311
01:10:03,450 --> 01:10:05,000
so the maximum must
be attained at 1.

1312
01:10:07,209 --> 01:10:08,750
So here in this
case, that would mean

1313
01:10:08,750 --> 01:10:12,560
that my maximum would be 1.

1314
01:10:12,560 --> 01:10:14,540
My estimator would be
1, which would be weird.

1315
01:10:14,540 --> 01:10:17,316
So typically here, you have
a function of the xi's.

1316
01:10:17,316 --> 01:10:19,940
So one example that you will see
many times is when this guy is

1317
01:10:19,940 --> 01:10:24,870
the maximum of the xi's.

1318
01:10:24,870 --> 01:10:27,210
And in which case, the
maximum is attained here,

1319
01:10:27,210 --> 01:10:29,190
which is the maximum of this.

1320
01:10:29,190 --> 01:10:31,620
OK, so just keep in mind--

1321
01:10:31,620 --> 01:10:33,840
what I would recommend
is every time

1322
01:10:33,840 --> 01:10:36,450
you're trying to take the
maximum of a function,

1323
01:10:36,450 --> 01:10:39,320
just try to plot the
function in your head.

1324
01:10:39,320 --> 01:10:40,380
It's not too complicated.

1325
01:10:40,380 --> 01:10:44,790
Those things are usually
squares, or square roots,

1326
01:10:44,790 --> 01:10:45,630
or logs.

1327
01:10:45,630 --> 01:10:47,430
You know what those
functions look like.

1328
01:10:47,430 --> 01:10:50,040
Just plug them in your
mind and make sure

1329
01:10:50,040 --> 01:10:52,230
that you will find a
maximum which really

1330
01:10:52,230 --> 01:10:54,210
goes up and then down again.

1331
01:10:54,210 --> 01:10:56,400
If you don't, then
that means your maximum

1332
01:10:56,400 --> 01:10:59,370
is achieved at the
boundary and you have

1333
01:10:59,370 --> 01:11:01,950
to think differently to get it.

1334
01:11:01,950 --> 01:11:04,590
So the machinery that consists
in setting the derivative equal

1335
01:11:04,590 --> 01:11:06,870
to zero works 80% of the time.

1336
01:11:06,870 --> 01:11:08,880
But o you have to be careful.

1337
01:11:08,880 --> 01:11:11,880
And from the context,
it will be clear

1338
01:11:11,880 --> 01:11:14,460
that you had to be careful,
because you will find

1339
01:11:14,460 --> 01:11:17,190
some crazy stuff, such
as solve 1 over theta hat

1340
01:11:17,190 --> 01:11:18,090
is equal to zero.

1341
01:11:23,140 --> 01:11:25,280
All right, so
before we conclude,

1342
01:11:25,280 --> 01:11:28,090
I just wanted to give you
some intuition about how does

1343
01:11:28,090 --> 01:11:30,620
the maximum likelihood perform?

1344
01:11:30,620 --> 01:11:33,070
So there's something called
the Fisher information

1345
01:11:33,070 --> 01:11:35,980
that essentially controls
how this thing performs.

1346
01:11:35,980 --> 01:11:38,710
And the Fisher information
is, essentially,

1347
01:11:38,710 --> 01:11:40,420
a second derivative
or a Hessian.

1348
01:11:40,420 --> 01:11:44,980
So if I'm in a one-dimensional
parameter case, it's a number,

1349
01:11:44,980 --> 01:11:46,300
it's a second derivative.

1350
01:11:46,300 --> 01:11:51,000
If I'm in a multidimensional
case, it's actually a Hessian,

1351
01:11:51,000 --> 01:11:52,780
it's a matrix.

1352
01:11:52,780 --> 01:11:57,800
So I'm going to actually take
in notation little curly L

1353
01:11:57,800 --> 01:12:00,670
of theta to be the
log likelihood, OK?

1354
01:12:00,670 --> 01:12:02,910
And that's the log likelihood
for one observation.

1355
01:12:02,910 --> 01:12:05,560
So let's call it x generically,
but think of it as being x1,

1356
01:12:05,560 --> 01:12:07,480
for example.

1357
01:12:07,480 --> 01:12:09,250
And I don't care
of, like, summing,

1358
01:12:09,250 --> 01:12:11,260
because I'm actually going to
take expectation of this thing.

1359
01:12:11,260 --> 01:12:13,176
So it's not going to be
a data driven quantity

1360
01:12:13,176 --> 01:12:14,390
I'm going to play with.

1361
01:12:14,390 --> 01:12:15,806
So now I'm going
to assume that it

1362
01:12:15,806 --> 01:12:19,390
is twice differentiable,
almost surely, because it's

1363
01:12:19,390 --> 01:12:21,350
a random function.

1364
01:12:21,350 --> 01:12:23,890
And so now I'm going to
just sweep under the rug

1365
01:12:23,890 --> 01:12:27,700
some technical conditions
when these things hold.

1366
01:12:27,700 --> 01:12:32,130
So typically, when can I
permute integral and derivatives

1367
01:12:32,130 --> 01:12:35,160
and this kind of stuff that
you don't want to think about?

1368
01:12:35,160 --> 01:12:36,730
OK, the rule of
thumb is it always

1369
01:12:36,730 --> 01:12:39,589
works until it
doesn't, in which case,

1370
01:12:39,589 --> 01:12:41,380
that probably means
you're actually solving

1371
01:12:41,380 --> 01:12:44,080
some sort of calculus problem.

1372
01:12:44,080 --> 01:12:47,390
Because in practice,
it just doesn't happen.

1373
01:12:47,390 --> 01:12:56,010
So the Fisher information
is the expectation of the--

1374
01:12:56,010 --> 01:12:57,790
that's called the outer product.

1375
01:12:57,790 --> 01:13:01,240
So that's the product
of this gradient

1376
01:13:01,240 --> 01:13:02,390
and the gradient transpose.

1377
01:13:02,390 --> 01:13:04,540
So that forms a matrix, right?

1378
01:13:04,540 --> 01:13:09,830
That's a matrix minus the outer
product of the expectations.

1379
01:13:09,830 --> 01:13:12,910
So that's really what's
called the covariance matrix

1380
01:13:12,910 --> 01:13:16,285
of this vector, nabla
of L theta, which

1381
01:13:16,285 --> 01:13:18,090
is a random vector.

1382
01:13:18,090 --> 01:13:21,042
So I'm forming the covariance
matrix of this thing.

1383
01:13:21,042 --> 01:13:23,250
And the technical conditions
tells me that, actually,

1384
01:13:23,250 --> 01:13:26,600
this guy, which depends
only on the Hessian,

1385
01:13:26,600 --> 01:13:31,115
is actually equal to negative
expectation of the-- sorry.

1386
01:13:31,115 --> 01:13:32,240
It depends on the gradient.

1387
01:13:32,240 --> 01:13:36,140
Is actually negative
expectation of the Hessian.

1388
01:13:36,140 --> 01:13:38,300
So I can actually
get a quantity that

1389
01:13:38,300 --> 01:13:40,400
depends on the second
derivatives only using

1390
01:13:40,400 --> 01:13:41,740
first derivatives.

1391
01:13:41,740 --> 01:13:44,202
But the expectation is
going to play a role here.

1392
01:13:44,202 --> 01:13:45,410
And the fact that it's a log.

1393
01:13:45,410 --> 01:13:48,180
And lots of things
actually show up here.

1394
01:13:48,180 --> 01:13:51,220
And so in this case,
what I get is that--

1395
01:13:51,220 --> 01:13:53,510
so in the one-dimensional
case, then this

1396
01:13:53,510 --> 01:13:56,480
is just the covariance matrix of
a one-dimensional thing, which

1397
01:13:56,480 --> 01:13:58,200
is just a variance of itself.

1398
01:13:58,200 --> 01:14:00,050
So the variance
of the derivative

1399
01:14:00,050 --> 01:14:04,190
is actually equal to
negative the expectation

1400
01:14:04,190 --> 01:14:07,080
of the second derivative.

1401
01:14:07,080 --> 01:14:09,280
OK, so we'll see that next time.

1402
01:14:09,280 --> 01:14:12,600
But what I wanted to emphasize
with this is that why do

1403
01:14:12,600 --> 01:14:15,109
we care about this quantity?

1404
01:14:15,109 --> 01:14:16,650
That's called the
Fisher information.

1405
01:14:16,650 --> 01:14:19,770
Fisher is the founding
father of modern statistics.

1406
01:14:19,770 --> 01:14:23,070
Why do we give this
quantity his name?

1407
01:14:23,070 --> 01:14:25,546
Well, it's because this quantity
is actually very critical.

1408
01:14:25,546 --> 01:14:27,420
What does the second
derivative of a function

1409
01:14:27,420 --> 01:14:29,560
tell me at the maximum?

1410
01:14:29,560 --> 01:14:34,350
Well, it's telling me
how curved it is, right?

1411
01:14:34,350 --> 01:14:37,780
If I have a zero second
derivative, I'm basically flat.

1412
01:14:37,780 --> 01:14:41,137
And if I have a very high second
derivative, I'm very curvy.

1413
01:14:41,137 --> 01:14:42,720
And when I'm very
curvy, what it means

1414
01:14:42,720 --> 01:14:45,760
is that I'm very robust
to the estimation error.

1415
01:14:45,760 --> 01:14:47,160
Remember our
estimation strategy,

1416
01:14:47,160 --> 01:14:50,130
which consisted in replacing
expectation by averages?

1417
01:14:50,130 --> 01:14:52,830
If I'm extremely curvy,
I can move a little bit.

1418
01:14:52,830 --> 01:14:55,410
This thing, the maximum,
is not going to move much.

1419
01:14:55,410 --> 01:14:57,280
And this formula here--

1420
01:14:57,280 --> 01:15:00,090
so forget about the matrix
version for a second--

1421
01:15:00,090 --> 01:15:01,780
is actually telling me exactly--

1422
01:15:01,780 --> 01:15:06,000
it's telling me the curvature
is basically the variance

1423
01:15:06,000 --> 01:15:08,290
of the first derivative.

1424
01:15:08,290 --> 01:15:10,840
And so the more the first
derivative fluctuates,

1425
01:15:10,840 --> 01:15:12,930
the more your maximum is
actually-- your org max

1426
01:15:12,930 --> 01:15:14,710
is going to move
all over the place.

1427
01:15:14,710 --> 01:15:16,950
So this is really
controlling how flat

1428
01:15:16,950 --> 01:15:20,280
your likelihood, your log
likelihood, is at its maximum.

1429
01:15:20,280 --> 01:15:23,340
The flatter it is, the more
sensitive to fluctuation

1430
01:15:23,340 --> 01:15:24,630
the arg max is going to be.

1431
01:15:24,630 --> 01:15:27,060
The curvier it is, the
less sensitive it is.

1432
01:15:27,060 --> 01:15:28,740
And so what we're
hoping-- a good model

1433
01:15:28,740 --> 01:15:31,710
is going to be one that
has a large or small value

1434
01:15:31,710 --> 01:15:34,350
for the Fisher information.

1435
01:15:34,350 --> 01:15:36,938
I want this to be--

1436
01:15:36,938 --> 01:15:38,300
small?

1437
01:15:38,300 --> 01:15:40,070
I want it to be large.

1438
01:15:40,070 --> 01:15:42,290
Because this is the
curvature, right?

1439
01:15:42,290 --> 01:15:44,414
This number is
negative, it's concave.

1440
01:15:44,414 --> 01:15:45,830
So if I take a
negative sign, it's

1441
01:15:45,830 --> 01:15:47,810
going to be something
that's positive.

1442
01:15:47,810 --> 01:15:51,230
And the larger this thing,
the more curvy it is.

1443
01:15:51,230 --> 01:15:52,730
Oh, yeah, because
it's the variance.

1444
01:15:52,730 --> 01:15:53,271
Again, sorry.

1445
01:15:53,271 --> 01:15:55,430
This is what--

1446
01:15:55,430 --> 01:15:55,930
OK.

1447
01:15:59,480 --> 01:16:02,156
Yeah, maybe I should not
go into those details

1448
01:16:02,156 --> 01:16:03,530
because I'm actually
out of time.

1449
01:16:03,530 --> 01:16:06,890
But just spoiler alert,
the asymptotic variance

1450
01:16:06,890 --> 01:16:09,020
of your-- the variance,
basically, as n

1451
01:16:09,020 --> 01:16:11,370
goes to infinity of the
maximum likelihood estimator

1452
01:16:11,370 --> 01:16:12,830
is going to be 1 over this guy.

1453
01:16:12,830 --> 01:16:15,260
So we want it to be large,
because the asymptotic variance

1454
01:16:15,260 --> 01:16:16,910
is going to be very small.

1455
01:16:16,910 --> 01:16:18,650
All right, so we're out of time.

1456
01:16:18,650 --> 01:16:20,630
We'll see that next week.

1457
01:16:20,630 --> 01:16:22,730
And I have your
homework with me.

1458
01:16:22,730 --> 01:16:25,052
And I will actually turn it in.

1459
01:16:25,052 --> 01:16:26,510
I will give it to
you outside so we

1460
01:16:26,510 --> 01:16:28,580
can let the other room come in.

1461
01:16:28,580 --> 01:16:31,630
OK, I'll just leave you the--