1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,610
at ocw.mit.edu.

8
00:00:20,340 --> 00:00:23,280
PHILIPPE RIGOLLET:
We're talking about

9
00:00:23,280 --> 00:00:24,390
generalized linear models.

10
00:00:24,390 --> 00:00:25,764
And in generalized
linear models,

11
00:00:25,764 --> 00:00:28,780
we generalize linear
models in two ways.

12
00:00:28,780 --> 00:00:31,170
The first one is to allow
for a different distribution

13
00:00:31,170 --> 00:00:32,910
for the response variables.

14
00:00:32,910 --> 00:00:34,560
And the distributions
that we wanted

15
00:00:34,560 --> 00:00:37,230
was the exponential family.

16
00:00:44,240 --> 00:00:46,910
And this is a family
that can be generalized

17
00:00:46,910 --> 00:00:49,340
over random variables
that are defined

18
00:00:49,340 --> 00:00:52,640
on c or q in general,
with parameters rk.

19
00:00:52,640 --> 00:00:58,310
But we're going to focus in
a very specific case when

20
00:00:58,310 --> 00:01:00,710
y is a real valued
response variable, which

21
00:01:00,710 --> 00:01:04,040
is the one you're used to when
you're doing linear regression.

22
00:01:04,040 --> 00:01:09,500
And the parameter
theta also lives in r.

23
00:01:09,500 --> 00:01:12,250
And so we're going to talk
about the canonical case.

24
00:01:12,250 --> 00:01:15,920
So that's the canonical
exponential family,

25
00:01:15,920 --> 00:01:19,760
where you have a density,
theta of x, which is

26
00:01:19,760 --> 00:01:25,280
of the form, exponential plus.

27
00:01:25,280 --> 00:01:27,800
And then, we have y,
which interacts with theta

28
00:01:27,800 --> 00:01:29,580
only by taking a product.

29
00:01:29,580 --> 00:01:32,990
Then, there's a term that
depends only on theta,

30
00:01:32,990 --> 00:01:35,180
some dispersion parameter phi.

31
00:01:35,180 --> 00:01:37,340
And then, we have some
normalization factor.

32
00:01:37,340 --> 00:01:44,900
Let's call it c of y phi.

33
00:01:44,900 --> 00:01:48,340
So it really should not matter
too much, so it's c of y phi,

34
00:01:48,340 --> 00:01:50,520
and that's really just the
normal position factor.

35
00:01:50,520 --> 00:01:54,010
And here, we're going to
assume that phi is known.

36
00:01:57,480 --> 00:01:58,694
I have no idea what I write.

37
00:01:58,694 --> 00:02:00,110
I don't know if
you guys can read.

38
00:02:00,110 --> 00:02:01,943
I don't know what chalk
has been used today,

39
00:02:01,943 --> 00:02:05,480
but I just can't see it.

40
00:02:05,480 --> 00:02:08,694
That's not my fault. All
right, so we're going

41
00:02:08,694 --> 00:02:09,860
to assume that phi is known.

42
00:02:09,860 --> 00:02:12,021
And so we saw that
several distributions

43
00:02:12,021 --> 00:02:14,270
that we know well, including
the Gaussian for example,

44
00:02:14,270 --> 00:02:15,612
belong to this family.

45
00:02:15,612 --> 00:02:17,320
And there's other
ones, such as Poisson--

46
00:02:21,084 --> 00:02:22,000
Poisson and Bernoulli.

47
00:02:22,000 --> 00:02:24,040
So if the PMF has
this form, if you

48
00:02:24,040 --> 00:02:27,610
have a discrete random
variable, this is also valid.

49
00:02:27,610 --> 00:02:29,779
And the reason why we
introduced this family

50
00:02:29,779 --> 00:02:32,320
is because there are going to
be some properties that we know

51
00:02:32,320 --> 00:02:34,960
that this thing here,
this function, b of theta,

52
00:02:34,960 --> 00:02:37,540
is essentially what
completely characterizes

53
00:02:37,540 --> 00:02:38,950
your distribution.

54
00:02:38,950 --> 00:02:42,504
So if phi is fixed, we know that
the interaction is the form.

55
00:02:42,504 --> 00:02:44,170
And this really just
comes from the fact

56
00:02:44,170 --> 00:02:46,490
that we want the function
to integrate to one.

57
00:02:46,490 --> 00:02:49,180
So this b here in
the canonical form

58
00:02:49,180 --> 00:02:50,860
encodes everything
we want to know.

59
00:02:50,860 --> 00:02:53,164
If I tell you what
b of theta is--

60
00:02:53,164 --> 00:02:54,580
and of course, I
tell you what phi

61
00:02:54,580 --> 00:02:56,914
is, but let's say for a second
that phi is equal to one.

62
00:02:56,914 --> 00:02:58,330
If I tell you this
b of theta, you

63
00:02:58,330 --> 00:03:00,420
know exactly what distribution
I'm talking about.

64
00:03:00,420 --> 00:03:02,520
So it should encode
everything that's

65
00:03:02,520 --> 00:03:05,920
specific to this distribution,
such as mean, variance,

66
00:03:05,920 --> 00:03:07,780
all the moments
that you would want.

67
00:03:07,780 --> 00:03:12,310
And we'll see how we can
compute from this thing

68
00:03:12,310 --> 00:03:14,590
the mean and the
variance, for example.

69
00:03:14,590 --> 00:03:16,567
So today, we're going to
talk about likelihood,

70
00:03:16,567 --> 00:03:18,400
and we're going to start
with the likelihood

71
00:03:18,400 --> 00:03:21,100
function or the log likelihood
for one observation.

72
00:03:21,100 --> 00:03:23,440
From this, we're going
to do some computations,

73
00:03:23,440 --> 00:03:26,590
and then, we'll move on to the
actual log likelihood based

74
00:03:26,590 --> 00:03:28,750
on n independent observations.

75
00:03:28,750 --> 00:03:30,730
And here, as we will
see, the observations

76
00:03:30,730 --> 00:03:32,950
are not going to be
identically distributed,

77
00:03:32,950 --> 00:03:35,080
because we're going
to want each of them,

78
00:03:35,080 --> 00:03:39,530
conditionally on x to be a
different function of x, where

79
00:03:39,530 --> 00:03:41,200
theta is just a
different function of x

80
00:03:41,200 --> 00:03:43,230
for each of the observation.

81
00:03:43,230 --> 00:03:45,649
So remember, the
log likelihood--

82
00:03:50,050 --> 00:03:52,630
and this is for
one observation--

83
00:03:52,630 --> 00:03:54,400
is just the log of
the density, right?

84
00:03:59,090 --> 00:04:02,840
And we have this
identity that I mentioned

85
00:04:02,840 --> 00:04:04,530
at the end of the
class on Tuesday.

86
00:04:04,530 --> 00:04:06,800
And this identity is
just that the expectation

87
00:04:06,800 --> 00:04:08,960
of the derivative of this
guy with respect to theta

88
00:04:08,960 --> 00:04:10,430
is equal to 0.

89
00:04:10,430 --> 00:04:11,330
So let's see why.

90
00:04:11,330 --> 00:04:15,610
So if I take the derivative
with respect to theta, of log f,

91
00:04:15,610 --> 00:04:18,860
theta of x, what I
get is the derivative

92
00:04:18,860 --> 00:04:21,930
with respect to
theta of f, theta

93
00:04:21,930 --> 00:04:26,880
of x, divided by f theta of x.

94
00:04:26,880 --> 00:04:30,820
Now, if I take the
expectation of this guy,

95
00:04:30,820 --> 00:04:37,810
with respect to this
theta as well, what I get

96
00:04:37,810 --> 00:04:40,210
is that this thing--
what is the expectation?

97
00:04:40,210 --> 00:04:42,970
Well, it's just the
integral against f theta.

98
00:04:42,970 --> 00:04:45,040
Or if I'm in a
discrete case, I just

99
00:04:45,040 --> 00:04:48,590
have the sum against f
theta, if it's a pmf.

100
00:04:48,590 --> 00:04:53,320
Just the definition,
the expectation of x,

101
00:04:53,320 --> 00:04:56,790
is either the integral--
well, let's say of h of x--

102
00:04:56,790 --> 00:04:59,770
is integral of h of x.

103
00:04:59,770 --> 00:05:01,390
F theta of x--

104
00:05:01,390 --> 00:05:04,090
if this is discrete
or is just the sum

105
00:05:04,090 --> 00:05:07,960
of h of x, f theta of x.

106
00:05:07,960 --> 00:05:09,880
If x is discrete--

107
00:05:09,880 --> 00:05:15,115
so if it's continuous,
you put this soft sum.

108
00:05:15,115 --> 00:05:17,110
This guy is the
same thing, right?

109
00:05:17,110 --> 00:05:20,450
So I'm just going to illustrate
the case when it's continuous.

110
00:05:20,450 --> 00:05:21,300
So this is what?

111
00:05:21,300 --> 00:05:24,790
Well, this is the integral of
partial derivative with respect

112
00:05:24,790 --> 00:05:29,740
to theta, of f theta of
x, divided by f theta

113
00:05:29,740 --> 00:05:35,060
of x, all time f theta of x--

114
00:05:35,060 --> 00:05:36,526
dx.

115
00:05:36,526 --> 00:05:38,627
And now, this f
theta is canceled,

116
00:05:38,627 --> 00:05:40,210
so I'm actually left
with the integral

117
00:05:40,210 --> 00:05:41,290
of the derivative,
which I'm going

118
00:05:41,290 --> 00:05:43,081
to write as the derivative
of the integral.

119
00:05:50,690 --> 00:06:01,640
But f theta being density
for any value of theta

120
00:06:01,640 --> 00:06:04,150
that I can take,
this is the function.

121
00:06:04,150 --> 00:06:07,720
As a function of
theta, this function

122
00:06:07,720 --> 00:06:10,670
is constantly equal to 1.

123
00:06:10,670 --> 00:06:13,710
For any theta that I take
it, it takes value of 1.

124
00:06:13,710 --> 00:06:16,344
So this is constantly
equal to 1.

125
00:06:16,344 --> 00:06:18,510
I put three bars to see
that for any value of theta,

126
00:06:18,510 --> 00:06:21,830
this is 1, which actually
tells me that the derivative is

127
00:06:21,830 --> 00:06:24,200
equal to 0.

128
00:06:24,200 --> 00:06:25,010
OK, yes?

129
00:06:30,455 --> 00:06:32,930
AUDIENCE: What is
the first [INAUDIBLE]

130
00:06:32,930 --> 00:06:34,415
that you wrote on the board?

131
00:06:38,666 --> 00:06:40,540
PHILIPPE RIGOLLET: That's
just the definition

132
00:06:40,540 --> 00:06:44,396
of the derivative of
the log of a function?

133
00:06:44,396 --> 00:06:45,364
AUDIENCE: OK.

134
00:06:49,720 --> 00:06:53,660
PHILIPPE RIGOLLET: Log of
f prime is f prime over f.

135
00:06:53,660 --> 00:06:56,060
That's a log, yeah.

136
00:06:56,060 --> 00:06:59,735
Just by elimination.

137
00:06:59,735 --> 00:07:01,652
AUDIENCE: [INAUDIBLE]

138
00:07:01,652 --> 00:07:02,860
PHILIPPE RIGOLLET: I'm sorry.

139
00:07:02,860 --> 00:07:05,276
AUDIENCE: When you write a
squiggle that starts with an l,

140
00:07:05,276 --> 00:07:06,680
I assume it's lambda.

141
00:07:06,680 --> 00:07:09,420
PHILIPPE RIGOLLET: And you do
good, because that's probably

142
00:07:09,420 --> 00:07:11,370
how my mind processes.

143
00:07:11,370 --> 00:07:13,310
And so I'm like, yeah, l.

144
00:07:13,310 --> 00:07:16,820
Here is enough information.

145
00:07:16,820 --> 00:07:19,290
OK, everybody is good with this?

146
00:07:19,290 --> 00:07:21,260
So that was convenient.

147
00:07:21,260 --> 00:07:22,860
So it just said
that the expectation

148
00:07:22,860 --> 00:07:26,970
of the derivative of the log
likelihood is equal to 0.

149
00:07:26,970 --> 00:07:29,340
That's going to be
our first identity.

150
00:07:29,340 --> 00:07:30,900
Let's move onto the
second identity,

151
00:07:30,900 --> 00:07:34,140
using exactly the same
trick, which is let's hope

152
00:07:34,140 --> 00:07:35,850
that at some point,
we have the integral

153
00:07:35,850 --> 00:07:37,266
of this function
that's constantly

154
00:07:37,266 --> 00:07:41,550
equal to 1 as a function of
theta, and then use the fact

155
00:07:41,550 --> 00:07:43,650
that its derivative
is equal to 0.

156
00:07:43,650 --> 00:07:54,850
So if I start taking the
second derivative of the log

157
00:07:54,850 --> 00:07:57,470
of f theta, so what is this?

158
00:07:57,470 --> 00:08:00,600
Well, it's the
derivative of this guy

159
00:08:00,600 --> 00:08:02,720
here, so I'm going
to go straight to it.

160
00:08:02,720 --> 00:08:08,830
So it's second derivative,
f theta of x, times

161
00:08:08,830 --> 00:08:19,810
f theta of x, minus first
derivative of f theta of x,

162
00:08:19,810 --> 00:08:22,160
times first derivative
of f theta of x.

163
00:08:26,360 --> 00:08:29,670
Here is some super
important stuff--

164
00:08:29,670 --> 00:08:31,740
no, I'm kidding.

165
00:08:31,740 --> 00:08:34,080
So you can still see
that guy over there?

166
00:08:34,080 --> 00:08:35,870
So it's just the square.

167
00:08:35,870 --> 00:08:38,100
And then, I divide by
f theta of x squared.

168
00:08:43,890 --> 00:08:48,340
So here I have the second
derivative, times f itself.

169
00:08:48,340 --> 00:08:51,630
And here, I have the product
of the first derivative

170
00:08:51,630 --> 00:08:52,160
with itself.

171
00:08:52,160 --> 00:08:53,544
So that's the square.

172
00:08:53,544 --> 00:08:55,210
So now, I'm going to
integrate this guy.

173
00:08:55,210 --> 00:09:01,550
So if I take the expectation
of this thing here, what I get

174
00:09:01,550 --> 00:09:03,741
is the integral.

175
00:09:03,741 --> 00:09:05,240
So here, the only
thing that's going

176
00:09:05,240 --> 00:09:07,073
to happen when I'm going
to take my integral

177
00:09:07,073 --> 00:09:09,380
is that one of those
squares is going to cancel

178
00:09:09,380 --> 00:09:10,940
against f theta, right?

179
00:09:10,940 --> 00:09:22,430
So I'm going to get
the second derivative

180
00:09:22,430 --> 00:09:24,830
minus the second
derivative squared.

181
00:09:32,950 --> 00:09:34,560
And then, I'm
divided by f theta.

182
00:09:38,120 --> 00:09:39,850
And I know that this
thing is equal to 0.

183
00:09:44,460 --> 00:09:46,660
Now, one of these guys here--

184
00:09:46,660 --> 00:09:48,180
sorry, why do I have--

185
00:09:48,180 --> 00:09:49,350
so I have this guy here.

186
00:09:49,350 --> 00:09:50,865
So this guy here
is going to cancel.

187
00:09:53,440 --> 00:09:57,320
So this is what
this is equal to--

188
00:09:59,970 --> 00:10:05,595
the integral of the partial,
so the second derivative of f

189
00:10:05,595 --> 00:10:09,620
theta of x, because
those two guys cancel--

190
00:10:09,620 --> 00:10:14,156
minus the integral of
the second derivative.

191
00:10:24,380 --> 00:10:28,360
And this is telling me what?

192
00:10:55,480 --> 00:10:58,040
Yeah, I'm losing one, because
I have some weird sequences.

193
00:10:58,040 --> 00:10:58,539
Thank you.

194
00:11:03,210 --> 00:11:12,282
OK, this is still positive.

195
00:11:12,282 --> 00:11:14,490
I want to say that this
thing is actually equal to 0.

196
00:11:17,040 --> 00:11:19,410
But then, it gives
me some weird things,

197
00:11:19,410 --> 00:11:24,150
which are that I
have an integral

198
00:11:24,150 --> 00:11:26,080
of a positive function,
which is equal to 0.

199
00:11:32,814 --> 00:11:34,480
Yeah, that's what I'm
thinking of doing.

200
00:11:34,480 --> 00:11:37,230
But I'm going to get 0 for
this entire integral, which

201
00:11:37,230 --> 00:11:39,729
means that I have the integral
of a positive function, which

202
00:11:39,729 --> 00:11:42,810
is equal to 0, which means that
this function is equal to 0,

203
00:11:42,810 --> 00:11:44,940
which sounds a little bad--

204
00:11:44,940 --> 00:11:48,310
basically, tells me that this
function, f theta, is linear.

205
00:11:52,710 --> 00:11:55,259
So I went a little
too far, I believe,

206
00:11:55,259 --> 00:11:57,300
because I only want to
prove that the expectation

207
00:11:57,300 --> 00:11:58,937
of the second derivative--

208
00:12:24,190 --> 00:12:25,970
Yes, so I want to pull this out.

209
00:12:31,020 --> 00:12:35,229
So let's see, if I keep rolling
with this, I'm going to get--

210
00:12:35,229 --> 00:12:37,520
well, no because the fact
that it's divided by f theta,

211
00:12:37,520 --> 00:12:40,670
means that, indeed, the second
derivative is equal to 0.

212
00:12:40,670 --> 00:12:41,960
So it cannot do this here.

213
00:12:49,446 --> 00:12:51,438
AUDIENCE: [INAUDIBLE]

214
00:12:59,920 --> 00:13:03,120
PHILIPPE RIGOLLET: OK, but
let's write it like this.

215
00:13:03,120 --> 00:13:05,020
You're right, so this is what?

216
00:13:05,020 --> 00:13:12,200
This is the expectation of
the partial with respect

217
00:13:12,200 --> 00:13:15,740
to theta of f
theta of x, divided

218
00:13:15,740 --> 00:13:21,360
by f theta of x squared.

219
00:13:21,360 --> 00:13:25,530
And this is exactly the
derivative of the log, right?

220
00:13:25,530 --> 00:13:28,830
So indeed, this thing is equal
to the expectation with respect

221
00:13:28,830 --> 00:13:34,980
to theta of the partial of l--

222
00:13:34,980 --> 00:13:41,160
of log f theta, divided
by partial theta.

223
00:13:41,160 --> 00:13:44,660
All right, so this is one of
the guys that I want squared.

224
00:13:44,660 --> 00:13:47,510
This is one of the
guys that I want.

225
00:13:47,510 --> 00:13:50,810
And this is actually
equal, so this will

226
00:13:50,810 --> 00:13:53,270
be equal to the expectation--

227
00:13:56,186 --> 00:13:58,130
AUDIENCE: [INAUDIBLE]

228
00:13:59,110 --> 00:14:02,860
PHILIPPE RIGOLLET: Oh, right, so
this term should be equal to 0.

229
00:14:02,860 --> 00:14:03,940
This was not 0.

230
00:14:03,940 --> 00:14:04,990
You're absolutely right.

231
00:14:04,990 --> 00:14:06,672
So at some point,
I got confused,

232
00:14:06,672 --> 00:14:08,380
because I thought
putting this equal to 0

233
00:14:08,380 --> 00:14:09,463
would mean that this is 0.

234
00:14:09,463 --> 00:14:10,840
But this thing is
not equal to 0.

235
00:14:10,840 --> 00:14:11,650
So this thing, you're right.

236
00:14:11,650 --> 00:14:13,858
I take the same trick as
before, and this is actually

237
00:14:13,858 --> 00:14:16,900
equal to 0, which
means that now I have

238
00:14:16,900 --> 00:14:19,690
what's on the left-hand side,
which is equal to what's

239
00:14:19,690 --> 00:14:20,720
on the right-hand side.

240
00:14:20,720 --> 00:14:23,220
And if I recap, I
get that e theta

241
00:14:23,220 --> 00:14:30,750
of the second derivative
of the log of f theta

242
00:14:30,750 --> 00:14:32,670
is equal to minus--

243
00:14:32,670 --> 00:14:34,360
because I had a
minus sign here--

244
00:14:34,360 --> 00:14:39,300
to the expectation with respect
to theta, of log of f theta,

245
00:14:39,300 --> 00:14:44,100
divided by theta squared.

246
00:14:44,100 --> 00:14:48,720
Thank you for being on watch
when I'm falling apart.

247
00:14:48,720 --> 00:14:50,390
All right, so this
is exactly what

248
00:14:50,390 --> 00:14:52,140
you have here, except
that both terms have

249
00:14:52,140 --> 00:14:54,180
been put on the same side.

250
00:14:54,180 --> 00:14:57,550
All right, so those things
are going to be useful to us,

251
00:14:57,550 --> 00:14:59,880
so maybe, we should write
them somewhere here.

252
00:15:13,820 --> 00:15:16,210
And then, we have
that the expectation

253
00:15:16,210 --> 00:15:26,090
of the second
derivative of the log

254
00:15:26,090 --> 00:15:33,170
is equal to minus the
expectation of the square

255
00:15:33,170 --> 00:15:34,941
of the first derivative.

256
00:15:40,020 --> 00:15:42,750
And this is, indeed,
my Fisher information.

257
00:15:42,750 --> 00:15:48,030
This is just telling me what is
the second derivative of my log

258
00:15:48,030 --> 00:15:49,217
likelihood at theta, right?

259
00:15:49,217 --> 00:15:50,800
So everything is
with respect to theta

260
00:15:50,800 --> 00:15:52,440
when I take these expectations.

261
00:15:52,440 --> 00:15:55,140
And so it tells me
that the expectation

262
00:15:55,140 --> 00:15:58,530
of the second derivative--
at least first of all, what

263
00:15:58,530 --> 00:16:00,150
it's telling me is
that it's concave,

264
00:16:00,150 --> 00:16:02,970
because the second
derivative of this thing,

265
00:16:02,970 --> 00:16:05,910
which is the second
derivative of kl divergence,

266
00:16:05,910 --> 00:16:09,436
is actually minus something
which is must be non-negative.

267
00:16:09,436 --> 00:16:11,310
And so it's telling me
that it's concave here

268
00:16:11,310 --> 00:16:14,646
at this [INAUDIBLE].

269
00:16:14,646 --> 00:16:16,270
And in particular,
it's also telling me

270
00:16:16,270 --> 00:16:19,240
that it has to be strictly
positive, unless the derivative

271
00:16:19,240 --> 00:16:21,040
of f is equal to 0.

272
00:16:21,040 --> 00:16:27,700
Unless f is constant, then
it's not going to change.

273
00:16:27,700 --> 00:16:32,920
All right, do you
have a question?

274
00:16:32,920 --> 00:16:35,660
So now, let's use this.

275
00:16:35,660 --> 00:16:37,660
So what does my
log likelihood look

276
00:16:37,660 --> 00:16:41,390
like when I actually compute it
for this canonical exponential

277
00:16:41,390 --> 00:16:42,760
family.

278
00:16:42,760 --> 00:16:45,310
We have this exponential
function, so taking the log

279
00:16:45,310 --> 00:16:48,260
should make my life much
easier, and indeed, it does.

280
00:16:48,260 --> 00:16:56,250
So if I look at the
canonical, what I have

281
00:16:56,250 --> 00:17:04,339
is that the log of f
theta of x, it's equal

282
00:17:04,339 --> 00:17:10,849
simply to y theta minus b
of theta, divided by phi,

283
00:17:10,849 --> 00:17:14,880
plus this function that
does not depend on theta.

284
00:17:18,790 --> 00:17:20,440
So let's see what this tells me.

285
00:17:20,440 --> 00:17:23,329
Let's just plug-in those
equalities in there.

286
00:17:23,329 --> 00:17:25,329
I can take the derivative
of the right-hand side

287
00:17:25,329 --> 00:17:28,060
and just say that in
expectation, it's equal to 0.

288
00:17:28,060 --> 00:17:32,300
So if I start looking
at the derivative,

289
00:17:32,300 --> 00:17:33,780
this is equal to what?

290
00:17:33,780 --> 00:17:37,820
Well, here I'm going
to pick up only y.

291
00:17:37,820 --> 00:17:39,160
Sorry, this is a function of y.

292
00:17:46,585 --> 00:17:48,460
I was talking about
likelihood, so I actually

293
00:17:48,460 --> 00:17:50,630
need to put the
random variable here.

294
00:17:50,630 --> 00:17:54,750
So I get y minus the
derivative of b of theta.

295
00:17:54,750 --> 00:17:56,250
Since it's only a
function of theta,

296
00:17:56,250 --> 00:17:58,180
I'm just going to write
b prime, is at OK--

297
00:17:58,180 --> 00:18:01,450
rather than having the
partial with respect to theta.

298
00:18:01,450 --> 00:18:02,932
Then, this is a constant.

299
00:18:02,932 --> 00:18:04,890
This does not depend on
theta, so it goes away.

300
00:18:10,200 --> 00:18:15,970
So if I start taking the
expectation of this guy,

301
00:18:15,970 --> 00:18:20,200
I get the expectation
of this guy,

302
00:18:20,200 --> 00:18:24,960
which is the expectation
of y, minus-- well,

303
00:18:24,960 --> 00:18:27,100
this does not depend on
y, so it's just itself--

304
00:18:27,100 --> 00:18:28,390
b prime of theta.

305
00:18:28,390 --> 00:18:30,460
And the whole thing
is divided by phi.

306
00:18:30,460 --> 00:18:33,100
But from my first
equality over there,

307
00:18:33,100 --> 00:18:35,020
I know that this thing
is actually equal to 0.

308
00:18:38,680 --> 00:18:40,740
We just proved that.

309
00:18:40,740 --> 00:18:43,420
So in particular, it means
that since phi is non-zero,

310
00:18:43,420 --> 00:18:45,550
it means that this guy
must be equal to this guy--

311
00:18:45,550 --> 00:18:47,500
or phi is not infinity.

312
00:18:47,500 --> 00:18:50,395
And so that implies
that the expectation

313
00:18:50,395 --> 00:18:56,310
with respect to theta of y
is equal to b prime of theta.

314
00:19:02,322 --> 00:19:04,280
I'm sorry, you're not
registered in this class.

315
00:19:04,280 --> 00:19:07,230
I'm going to have
to ask you to leave.

316
00:19:07,230 --> 00:19:09,150
I'm not kidding.

317
00:19:09,150 --> 00:19:10,570
AUDIENCE: [INAUDIBLE]

318
00:19:11,070 --> 00:19:12,520
PHILIPPE RIGOLLET: You are?

319
00:19:12,520 --> 00:19:13,970
I've never seen you here.

320
00:19:13,970 --> 00:19:15,861
I saw you for the first lecture.

321
00:19:15,861 --> 00:19:16,360
OK.

322
00:19:19,120 --> 00:19:23,105
All right, so e theta of y
is equal to b prime of theta.

323
00:19:23,105 --> 00:19:24,230
Everybody agrees with that?

324
00:19:27,320 --> 00:19:31,130
So this is actually nice,
because if I give you

325
00:19:31,130 --> 00:19:34,190
an exponential family, the only
thing I really need to tell

326
00:19:34,190 --> 00:19:36,210
you is what b theta is.

327
00:19:36,210 --> 00:19:38,780
And if I give you b of theta,
then computing a derivative

328
00:19:38,780 --> 00:19:42,470
is actually much easier
than having to integrate y

329
00:19:42,470 --> 00:19:44,000
against the density itself.

330
00:19:44,000 --> 00:19:46,010
You could really
have fun and try

331
00:19:46,010 --> 00:19:48,310
to compute this, which you
would be able to do, right?

332
00:19:54,080 --> 00:19:58,840
And then, there's the plus c
of y phi, blah, blah, blah--

333
00:19:58,840 --> 00:19:59,385
dy.

334
00:19:59,385 --> 00:20:01,760
And that's the way you would
actually compute this thing.

335
00:20:05,040 --> 00:20:06,680
Sorry, this guy is here.

336
00:20:06,680 --> 00:20:07,940
That would be painful.

337
00:20:07,940 --> 00:20:10,310
I don't know what this
normalization looks like,

338
00:20:10,310 --> 00:20:12,230
so it would have to
also explicit that,

339
00:20:12,230 --> 00:20:13,790
so I can actually
compute this thing.

340
00:20:13,790 --> 00:20:15,415
And you know, just
the same way, if you

341
00:20:15,415 --> 00:20:17,852
want to compute the
expectation of a Gaussian--

342
00:20:17,852 --> 00:20:19,310
well, the expectation
of a Gaussian

343
00:20:19,310 --> 00:20:20,750
is not the most difficult one.

344
00:20:20,750 --> 00:20:23,380
But even if you compute the
expectation of a Poisson,

345
00:20:23,380 --> 00:20:25,130
you start to have to
work in a little bit.

346
00:20:25,130 --> 00:20:27,200
There's a few things that
you have to work through.

347
00:20:27,200 --> 00:20:29,199
Here, I'm just telling
you, all you have to know

348
00:20:29,199 --> 00:20:30,740
is what b of theta
is, and then, you

349
00:20:30,740 --> 00:20:33,190
can just take the derivative.

350
00:20:33,190 --> 00:20:35,750
Let's see what the second
equality is going to give us.

351
00:20:56,490 --> 00:21:00,830
OK, so what is the
second equality?

352
00:21:00,830 --> 00:21:03,850
It's telling me that if I
look at the second derivative,

353
00:21:03,850 --> 00:21:07,410
and then I take its
expectation, I'm

354
00:21:07,410 --> 00:21:11,190
going to have something which
is equal to negative this guy

355
00:21:11,190 --> 00:21:13,059
squared.

356
00:21:13,059 --> 00:21:14,350
Sorry, that was the log, right?

357
00:21:19,970 --> 00:21:22,700
We've already computed
this first derivative

358
00:21:22,700 --> 00:21:24,380
of the likelihood.

359
00:21:24,380 --> 00:21:29,390
It's just the expectation of
the square of this thing here.

360
00:21:29,390 --> 00:21:34,070
So expectation of
the derivative,

361
00:21:34,070 --> 00:21:38,900
with respect to theta of
log, f theta of x, divided

362
00:21:38,900 --> 00:21:44,130
by partial theta squared.

363
00:21:44,130 --> 00:21:50,040
This is equal to the
expectation of the square of y,

364
00:21:50,040 --> 00:21:58,580
minus b theta, divided
by phi squared--

365
00:21:58,580 --> 00:21:59,720
b prime, theta squared.

366
00:22:04,375 --> 00:22:06,500
OK, sorry, I'm actually
going to move on with the--

367
00:22:11,100 --> 00:22:13,320
so if I start computing,
what is this thing?

368
00:22:13,320 --> 00:22:16,350
Well, we just agreed
that this was what?

369
00:22:19,580 --> 00:22:22,920
The expectation of theta, right?

370
00:22:22,920 --> 00:22:27,120
So that's just the
expectation of y.

371
00:22:27,120 --> 00:22:28,230
We just computed it here.

372
00:22:31,548 --> 00:22:32,970
AUDIENCE: [INAUDIBLE]

373
00:22:35,744 --> 00:22:37,410
PHILIPPE RIGOLLET:
Yeah, that's b prime.

374
00:22:37,410 --> 00:22:39,050
There's a derivative here.

375
00:22:44,760 --> 00:22:47,900
So now, this is what?

376
00:22:47,900 --> 00:22:57,680
This is simply-- anyone?

377
00:23:01,580 --> 00:23:02,810
I'm sorry?

378
00:23:02,810 --> 00:23:05,660
Variance of y, but you're
scaling by phi squared.

379
00:23:11,040 --> 00:23:15,390
OK, so this is negative
of the right-hand side

380
00:23:15,390 --> 00:23:17,250
of our inequality.

381
00:23:17,250 --> 00:23:21,810
And now, I just have to take
one more derivative to this guy.

382
00:23:21,810 --> 00:23:27,420
So now, if I look at
the left-hand side now,

383
00:23:27,420 --> 00:23:29,920
I have that the
second derivative

384
00:23:29,920 --> 00:23:38,890
of log, of f theta of y, divided
by partial of theta squared.

385
00:23:38,890 --> 00:23:40,680
So this thing is equal to--

386
00:23:40,680 --> 00:23:42,710
well, no, I'm not
left with much.

387
00:23:42,710 --> 00:23:44,630
The white part is
going to go away,

388
00:23:44,630 --> 00:23:47,590
and I'm left only with the
second derivative of theta,

389
00:23:47,590 --> 00:23:49,930
minus the second derivative
theta, divided by phi.

390
00:24:00,540 --> 00:24:03,790
So if I take expectation--

391
00:24:03,790 --> 00:24:05,360
well, it just doesn't change.

392
00:24:08,010 --> 00:24:09,320
This is deterministic.

393
00:24:09,320 --> 00:24:11,240
So now, what I've
established is that this guy

394
00:24:11,240 --> 00:24:14,010
is equal to negative this guy.

395
00:24:14,010 --> 00:24:17,050
So those two things, the
signs are going to go away.

396
00:24:17,050 --> 00:24:20,910
And so this implies
that the variance of y

397
00:24:20,910 --> 00:24:25,800
is equal to b prime prime theta.

398
00:24:25,800 --> 00:24:30,240
And then, I have a phi
square in denominator

399
00:24:30,240 --> 00:24:33,180
that cancels only one of the
phi squares, so it's time phi.

400
00:24:37,480 --> 00:24:41,140
So now, I have that my second
derivative-- since I know phi

401
00:24:41,140 --> 00:24:46,280
is completely
determining the variance.

402
00:24:46,280 --> 00:24:52,470
So basically, that's why b is
called the cumulant generating

403
00:24:52,470 --> 00:24:52,970
function.

404
00:24:52,970 --> 00:24:55,430
It's not generating
moments, but cumulants.

405
00:24:55,430 --> 00:24:59,180
But cumulants, in this
case, correspond, basically,

406
00:24:59,180 --> 00:25:01,060
to the moments, at
least for the first two.

407
00:25:01,060 --> 00:25:03,890
If I start going
farther, I'm going

408
00:25:03,890 --> 00:25:08,090
to have more combinations of
the expectation of y3, y2,

409
00:25:08,090 --> 00:25:09,530
and y itself.

410
00:25:13,150 --> 00:25:14,770
But as we know,
those are the ones

411
00:25:14,770 --> 00:25:17,170
that are usually the
most useful, at least

412
00:25:17,170 --> 00:25:19,384
if we're interested in
asymptotic performance.

413
00:25:19,384 --> 00:25:20,800
The central limit
theorem tells us

414
00:25:20,800 --> 00:25:23,380
that all that matters are
the first two moments,

415
00:25:23,380 --> 00:25:25,580
and then, the rest is just
going to go and say well,

416
00:25:25,580 --> 00:25:26,330
it doesn't matter.

417
00:25:26,330 --> 00:25:29,300
It's all going to
[INAUDIBLE] anyway.

418
00:25:29,300 --> 00:25:31,290
So let's go to a
Poisson for example.

419
00:25:31,290 --> 00:25:33,518
So if I had a Poisson
distribution--

420
00:25:39,910 --> 00:25:42,430
so this is a discrete
distribution.

421
00:25:42,430 --> 00:25:46,390
And what I know is that f--

422
00:25:46,390 --> 00:25:51,580
let me call mu the
parameter of y.

423
00:25:56,290 --> 00:26:01,870
So it's mu to the y, divided
by y factorial, exponential

424
00:26:01,870 --> 00:26:02,650
minus mu.

425
00:26:02,650 --> 00:26:04,570
OK so mu is usually
called lambda,

426
00:26:04,570 --> 00:26:06,294
and y is usually
called x, that's

427
00:26:06,294 --> 00:26:07,960
why it takes me to a
little bit of time.

428
00:26:07,960 --> 00:26:09,626
But it usually it's
lambda to the x over

429
00:26:09,626 --> 00:26:13,810
factorial x, exponential
minus lambda.

430
00:26:13,810 --> 00:26:16,490
Since this is just the series
expansion of the exponential

431
00:26:16,490 --> 00:26:19,230
when I sum those things
from 0 to infinity,

432
00:26:19,230 --> 00:26:20,949
this thing sums to 1.

433
00:26:20,949 --> 00:26:22,990
But then, if I wanted to
start understanding what

434
00:26:22,990 --> 00:26:25,900
the expectation
of this thing is--

435
00:26:25,900 --> 00:26:28,340
so if I want to understand
the expectation with respect

436
00:26:28,340 --> 00:26:33,820
to mu of y, then, I would
have to compute the sum

437
00:26:33,820 --> 00:26:48,280
from k equals 0 to infinity
of k, times mu to the k,

438
00:26:48,280 --> 00:26:51,587
over factorial of k,
exponential minus mu--

439
00:26:51,587 --> 00:26:53,170
which means that I
would, essentially,

440
00:26:53,170 --> 00:27:06,090
have to take the derivative
of my series in the end.

441
00:27:06,090 --> 00:27:07,174
So I can do this.

442
00:27:07,174 --> 00:27:08,340
This is a standard exercise.

443
00:27:08,340 --> 00:27:10,630
You've probably done it
when you took probability.

444
00:27:10,630 --> 00:27:12,900
But let's see if we can
actually just read it off

445
00:27:12,900 --> 00:27:14,760
from the first derivative of b.

446
00:27:14,760 --> 00:27:16,380
So to do that, we
need to write this

447
00:27:16,380 --> 00:27:18,850
in the form of an
exponential, where there

448
00:27:18,850 --> 00:27:23,220
is one parameter that captures
mu, that interacts with y, just

449
00:27:23,220 --> 00:27:25,860
doing this parameter times
y, and then something that

450
00:27:25,860 --> 00:27:29,430
depends only on y, and then
something that depends only

451
00:27:29,430 --> 00:27:32,979
on mu.

452
00:27:32,979 --> 00:27:34,020
That's the important one.

453
00:27:34,020 --> 00:27:35,550
That's going to be
our B. And then,

454
00:27:35,550 --> 00:27:39,150
there's going to be something
that depends only on y.

455
00:27:39,150 --> 00:27:42,990
So let's write this and
check that this f mu, indeed,

456
00:27:42,990 --> 00:27:46,510
belongs to this canonical
exponential family.

457
00:27:46,510 --> 00:27:48,600
So I definitely
have an exponential

458
00:27:48,600 --> 00:27:50,075
that comes from this guy.

459
00:27:50,075 --> 00:27:51,454
So I have minus mu.

460
00:27:51,454 --> 00:27:53,370
And then, this thing is
going to give me what?

461
00:27:53,370 --> 00:27:58,260
It's going to give
me plus y log mu.

462
00:27:58,260 --> 00:28:02,166
And then, I'm going to have
minus log of y factorial.

463
00:28:06,480 --> 00:28:08,790
So clearly, I have a
term that depends only

464
00:28:08,790 --> 00:28:11,340
on mu, terms that
depend only on y,

465
00:28:11,340 --> 00:28:15,300
and I have a product of y and
something that depends on mu.

466
00:28:15,300 --> 00:28:17,820
If I want to be
canonical, I must

467
00:28:17,820 --> 00:28:23,650
have this to be exactly
the parameter theta itself.

468
00:28:23,650 --> 00:28:27,150
So I'm going to
call this guy theta.

469
00:28:27,150 --> 00:28:30,750
So theta is log mu,
which means that mu

470
00:28:30,750 --> 00:28:32,592
is equal to e to the theta.

471
00:28:32,592 --> 00:28:34,050
And so wherever I
see mu, I'm going

472
00:28:34,050 --> 00:28:36,716
to replace it by [INAUDIBLE] the
theta, because my new parameter

473
00:28:36,716 --> 00:28:38,070
now, is theta.

474
00:28:38,070 --> 00:28:38,880
So this is what?

475
00:28:38,880 --> 00:28:43,490
This is equal to
exponential y times theta.

476
00:28:43,490 --> 00:28:47,860
And then, I'm going to
have minus e of theta.

477
00:28:47,860 --> 00:28:51,630
And then, who cares, something
that depends only on mu.

478
00:28:51,630 --> 00:28:58,330
So this is my c of y, and phi
is equal to 1 in this case.

479
00:28:58,330 --> 00:29:00,930
So that's all I care about.

480
00:29:00,930 --> 00:29:01,830
So let's use it.

481
00:29:05,000 --> 00:29:08,770
So this is my canonical
exponential family.

482
00:29:08,770 --> 00:29:11,680
Y interacts with theta
exactly like this.

483
00:29:11,680 --> 00:29:13,040
And then, I have this function.

484
00:29:13,040 --> 00:29:17,410
So this function here
must be b of theta.

485
00:29:20,080 --> 00:29:22,780
So from this function,
exponential theta,

486
00:29:22,780 --> 00:29:25,000
I'm supposed to be able
to read what the mean is.

487
00:29:39,820 --> 00:29:41,796
AUDIENCE: [INAUDIBLE]

488
00:29:43,772 --> 00:29:46,990
PHILIPPE RIGOLLET: Because
since in this course

489
00:29:46,990 --> 00:29:48,610
I always know what
the dispersion is,

490
00:29:48,610 --> 00:29:52,450
I can actually always absorb
it into theta from one.

491
00:29:52,450 --> 00:29:54,670
But here, it's really of
the form y times something

492
00:29:54,670 --> 00:29:55,720
divided by 1, right?

493
00:30:01,030 --> 00:30:04,805
If it was like log
of mu divided by phi,

494
00:30:04,805 --> 00:30:06,430
that would be the
question of whether I

495
00:30:06,430 --> 00:30:10,300
want to call phi my
dispersion, or if I

496
00:30:10,300 --> 00:30:12,070
want to just have it in there.

497
00:30:16,610 --> 00:30:18,740
This makes no
difference in practice.

498
00:30:18,740 --> 00:30:20,860
But the real thing
is it's never going

499
00:30:20,860 --> 00:30:22,610
to happen that this
thing, this version,

500
00:30:22,610 --> 00:30:23,960
is going to be an exact number.

501
00:30:23,960 --> 00:30:26,240
If it's an actual
numerical number,

502
00:30:26,240 --> 00:30:28,580
this just means that this
number should be absorbed

503
00:30:28,580 --> 00:30:32,120
in the definition of theta.

504
00:30:32,120 --> 00:30:34,700
But if it's something
that is called sigma, say,

505
00:30:34,700 --> 00:30:36,470
and I will assume
that sigma is known,

506
00:30:36,470 --> 00:30:39,162
then it's probably preferable
to keep it in the dispersion,

507
00:30:39,162 --> 00:30:41,120
so you can see that
there's this parameter here

508
00:30:41,120 --> 00:30:44,450
that you can,
essentially, play with.

509
00:30:44,450 --> 00:30:48,810
It doesn't make any
difference when you know phi.

510
00:30:48,810 --> 00:30:55,050
So now, if I look at the
expectation of some y-- so now,

511
00:30:55,050 --> 00:31:00,419
I'm going to have y, which
follows my Poisson mu.

512
00:31:00,419 --> 00:31:01,960
I'm going to look
at the expectation,

513
00:31:01,960 --> 00:31:09,210
and I know that the expectation
is b prime of theta.

514
00:31:09,210 --> 00:31:09,950
Agreed?

515
00:31:09,950 --> 00:31:14,780
That's what I just
erased, I think.

516
00:31:14,780 --> 00:31:17,020
Agreed with this,
the derivative?

517
00:31:17,020 --> 00:31:18,550
So what is this?

518
00:31:18,550 --> 00:31:23,050
Well, it's the derivative
of e to the theta, which

519
00:31:23,050 --> 00:31:27,270
is e to the theta, which is mu.

520
00:31:27,270 --> 00:31:30,030
So my Poisson is
parametrized by its mean.

521
00:31:30,030 --> 00:31:34,850
I can also compute
the variance, which

522
00:31:34,850 --> 00:31:40,580
is equal to minus the
second derivative of--

523
00:31:40,580 --> 00:31:42,460
no, it's equal to the
second derivative of b.

524
00:31:47,170 --> 00:31:49,090
Dispersion is equal to 1.

525
00:31:49,090 --> 00:31:55,000
Again, if I took phi elsewhere,
I would see it here as well.

526
00:31:55,000 --> 00:31:57,760
So if I just absorbed phi here,
I would see it divided here,

527
00:31:57,760 --> 00:32:00,040
so it would not
make any difference.

528
00:32:00,040 --> 00:32:02,925
And what is the second
derivative of the exponential?

529
00:32:06,570 --> 00:32:09,820
Still the exponential--
so it's still equal to mu.

530
00:32:14,760 --> 00:32:17,950
So that certainly
makes our life easier.

531
00:32:17,950 --> 00:32:19,620
Just one quick from remark--

532
00:32:31,130 --> 00:32:32,360
here's the function.

533
00:32:32,360 --> 00:32:35,150
I am giving you problem--

534
00:32:35,150 --> 00:32:36,710
can the b function--

535
00:32:39,320 --> 00:32:46,550
can it ever be equal
to log of theta?

536
00:32:55,840 --> 00:32:58,310
Who says yes?

537
00:32:58,310 --> 00:33:00,428
Who says no?

538
00:33:00,428 --> 00:33:02,858
Why?

539
00:33:02,858 --> 00:33:04,802
AUDIENCE: [INAUDIBLE]

540
00:33:09,680 --> 00:33:13,670
PHILIPPE RIGOLLET: Yeah, so
what I've learned from this--

541
00:33:13,670 --> 00:33:16,610
it's sort of completely
analytic, right?

542
00:33:16,610 --> 00:33:19,640
So we just took derivatives,
and this thing just happened.

543
00:33:19,640 --> 00:33:22,490
This thing actually allowed us
to relate the second derivative

544
00:33:22,490 --> 00:33:24,299
of b to the variance.

545
00:33:24,299 --> 00:33:26,090
And one thing that we
know about a variance

546
00:33:26,090 --> 00:33:27,920
is that this is non-negative.

547
00:33:27,920 --> 00:33:30,830
And in particular,
it's always positive.

548
00:33:30,830 --> 00:33:35,330
If they give you a canonical
exponential family that

549
00:33:35,330 --> 00:33:39,260
has zero variance, trust
me, you will see it.

550
00:33:39,260 --> 00:33:40,919
That means that
this thing is not

551
00:33:40,919 --> 00:33:42,710
going to look like
something that's finite,

552
00:33:42,710 --> 00:33:44,280
and it's going to
have a point mass.

553
00:33:44,280 --> 00:33:46,280
It's going to take value
infinity at one point.

554
00:33:46,280 --> 00:33:48,080
So this will,
basically, never happen.

555
00:33:48,080 --> 00:33:50,220
This thing is, actually,
strictly positive,

556
00:33:50,220 --> 00:33:52,600
which means that this thing
is always strictly concave.

557
00:33:52,600 --> 00:33:55,220
It means that the second
derivative of this function, b,

558
00:33:55,220 --> 00:34:00,440
has to be strictly positive, and
so that the function is convex.

559
00:34:00,440 --> 00:34:03,005
So this is concave, so this
is definitely not working.

560
00:34:03,005 --> 00:34:04,880
I need to have something
that looks like this

561
00:34:04,880 --> 00:34:07,920
when I talk about my b.

562
00:34:07,920 --> 00:34:10,500
So f theta squared--

563
00:34:10,500 --> 00:34:13,190
we'll see a bunch of
exponential theta.

564
00:34:13,190 --> 00:34:14,389
And there's a bunch of them.

565
00:34:14,389 --> 00:34:18,280
But if you started writing
something, and you find b--

566
00:34:18,280 --> 00:34:20,230
try to think of the
plot of b in your mind,

567
00:34:20,230 --> 00:34:23,980
and you find that b looks like
it's going to become concave,

568
00:34:23,980 --> 00:34:25,844
you've made a sign
mistake somewhere.

569
00:34:30,110 --> 00:34:33,320
All right, so we've done
a pretty big parenthesis

570
00:34:33,320 --> 00:34:37,040
to try to characterize
what the distribution of y

571
00:34:37,040 --> 00:34:37,820
was going to be.

572
00:34:37,820 --> 00:34:41,679
We wanted to extend from, say,
Gaussian to something else.

573
00:34:41,679 --> 00:34:43,909
But when we're doing
regression, which

574
00:34:43,909 --> 00:34:46,520
means generalized
linear models, we

575
00:34:46,520 --> 00:34:48,500
are not interested in
the distribution of y

576
00:34:48,500 --> 00:34:51,650
but really the conditional
distribution of y given x.

577
00:34:51,650 --> 00:34:55,760
So I need now to couple
those back together.

578
00:34:55,760 --> 00:34:59,702
So what I know is that
this same mu, in this case,

579
00:34:59,702 --> 00:35:01,910
which is the expectation--
what I want to say is that

580
00:35:01,910 --> 00:35:09,740
the conditional
expectation of y given x--

581
00:35:12,710 --> 00:35:15,870
this is some mu of x.

582
00:35:15,870 --> 00:35:17,700
When we did linear
models, we said well,

583
00:35:17,700 --> 00:35:21,870
this thing was some x transpose
beta for linear models.

584
00:35:26,139 --> 00:35:27,680
And the whole premise
of this chapter

585
00:35:27,680 --> 00:35:29,900
is to say well, this
might make no sense,

586
00:35:29,900 --> 00:35:32,930
because x transpose beta
can take the entire range

587
00:35:32,930 --> 00:35:34,320
of real values.

588
00:35:34,320 --> 00:35:36,870
Whereas, this mu can take
only a partial range.

589
00:35:36,870 --> 00:35:40,550
So even if you actually focus
on the Poisson, for example,

590
00:35:40,550 --> 00:35:43,970
we know that the expectation
of a Poisson has to be

591
00:35:43,970 --> 00:35:45,910
a non-negative number--

592
00:35:45,910 --> 00:35:47,660
actually, a positive
number as soon as you

593
00:35:47,660 --> 00:35:49,970
have a little bit of variance.

594
00:35:49,970 --> 00:35:52,590
It's mu itself-- mu
is a positive number.

595
00:35:52,590 --> 00:35:54,800
And so it's not going
to make any sense

596
00:35:54,800 --> 00:35:57,710
to assume that mu of x is
equal to x transpose beta,

597
00:35:57,710 --> 00:36:00,710
because you might find some x's
for which this value ends up

598
00:36:00,710 --> 00:36:02,302
being negative.

599
00:36:02,302 --> 00:36:03,760
And so we're going
to need, what we

600
00:36:03,760 --> 00:36:05,860
call, the link
function that relates,

601
00:36:05,860 --> 00:36:08,560
that transforms mu,
maps onto the real line,

602
00:36:08,560 --> 00:36:13,210
so that you can now express it
of the form x transpose beta.

603
00:36:13,210 --> 00:36:17,560
So we're going to take
not this, but we're

604
00:36:17,560 --> 00:36:21,250
going to assume
that g of mu of x

605
00:36:21,250 --> 00:36:24,430
is not equal to
x transpose beta,

606
00:36:24,430 --> 00:36:26,140
and that's the
generalized linear models.

607
00:36:33,220 --> 00:36:40,650
So as I said, it's weird to
transform x transpose beta--

608
00:36:40,650 --> 00:36:43,420
a mu to make it
take the real line.

609
00:36:43,420 --> 00:36:44,920
At least to me, it
feels a bit more

610
00:36:44,920 --> 00:36:47,530
natural to take x
transpose beta and make

611
00:36:47,530 --> 00:36:51,000
it fit to the particular
distribution that I want.

612
00:36:51,000 --> 00:36:53,890
And so I'm going to want to
talk about g and g inverse

613
00:36:53,890 --> 00:36:55,550
at the same time.

614
00:36:55,550 --> 00:36:59,070
So I'm going to
actually take always g.

615
00:36:59,070 --> 00:37:04,920
So g is my link
function, and I'm

616
00:37:04,920 --> 00:37:10,036
going to want g to be
continuous differentiable.

617
00:37:16,980 --> 00:37:19,020
OK, let's say that
it has a derivative,

618
00:37:19,020 --> 00:37:22,630
and its derivative
is continuous.

619
00:37:22,630 --> 00:37:28,398
And I'm going to want g
to be strictly increasing.

620
00:37:34,770 --> 00:37:39,930
And that actually implies
that g inverse exists.

621
00:37:39,930 --> 00:37:43,410
Actually, that's not true.

622
00:37:43,410 --> 00:37:54,505
What I'm also going to want
is that g of mu spans--

623
00:37:57,420 --> 00:37:58,380
how do I do this?

624
00:38:06,090 --> 00:38:09,720
So I want the g, as I arrange
for all possible values of mu,

625
00:38:09,720 --> 00:38:11,220
whether they're all
positive values,

626
00:38:11,220 --> 00:38:12,969
or whether they're
values that are limited

627
00:38:12,969 --> 00:38:15,240
between the intervals 0,
1, I want those to span

628
00:38:15,240 --> 00:38:18,340
the entire real line, so that
when I want to talk about g

629
00:38:18,340 --> 00:38:20,282
inverses define over
the entire real line,

630
00:38:20,282 --> 00:38:21,240
I know where I started.

631
00:38:24,396 --> 00:38:26,660
So this implies that
gene inverse exists.

632
00:38:30,200 --> 00:38:32,388
What else does it
imply about g inverse?

633
00:38:39,500 --> 00:38:41,270
So for a function
to be invertible,

634
00:38:41,270 --> 00:38:43,855
I only need for it to
be strictly monotone.

635
00:38:43,855 --> 00:38:45,605
I don't need it to be
strictly increasing.

636
00:38:45,605 --> 00:38:47,729
So in particular, the fact
that I picked increasing

637
00:38:47,729 --> 00:38:53,360
implies that this guy
is actually increasing.

638
00:38:53,360 --> 00:38:54,820
AUDIENCE: [INAUDIBLE]

639
00:38:54,820 --> 00:38:56,320
PHILIPPE RIGOLLET:
That's the image.

640
00:39:03,470 --> 00:39:06,830
So this is my link function, and
this slide is just telling me

641
00:39:06,830 --> 00:39:08,330
I want my function
to be invertible,

642
00:39:08,330 --> 00:39:09,890
so I can talk about g inverse.

643
00:39:09,890 --> 00:39:12,510
I'm going to switch
between the two.

644
00:39:12,510 --> 00:39:15,450
So what link functions
am I going to get?

645
00:39:15,450 --> 00:39:17,214
So for linear
models, we just said

646
00:39:17,214 --> 00:39:18,630
there's no link
function, which is

647
00:39:18,630 --> 00:39:20,962
the same as saying that the
link function is identity,

648
00:39:20,962 --> 00:39:22,920
which certainly satisfies
all these conditions.

649
00:39:22,920 --> 00:39:23,735
It's invertible.

650
00:39:23,735 --> 00:39:25,110
It has all these
nice properties,

651
00:39:25,110 --> 00:39:27,540
but might as well
not talk about it.

652
00:39:27,540 --> 00:39:29,220
For Poisson data,
when we assume that

653
00:39:29,220 --> 00:39:32,250
the conditional distribution
of y given x is Poisson,

654
00:39:32,250 --> 00:39:37,200
the mu, as I just said, is
required to be positive.

655
00:39:37,200 --> 00:39:43,650
So I need a g that goes
from the interval 0 infinity

656
00:39:43,650 --> 00:39:45,540
to the entire real line.

657
00:39:45,540 --> 00:39:47,910
I need a function that
starts from one end

658
00:39:47,910 --> 00:39:51,720
and just takes-- not
only the positive values

659
00:39:51,720 --> 00:39:54,580
are split between positive
and negative values.

660
00:39:54,580 --> 00:39:56,820
And here, for example, I
could take the log link.

661
00:39:56,820 --> 00:40:01,260
So the log is defined
on this entire interval.

662
00:40:01,260 --> 00:40:04,050
And as I range from
0 to plus infinity,

663
00:40:04,050 --> 00:40:07,440
the log is ranging from negative
infinity to plus infinity.

664
00:40:10,382 --> 00:40:12,090
You can probably think
of other functions

665
00:40:12,090 --> 00:40:15,510
that do that, like 2 times log.

666
00:40:15,510 --> 00:40:16,860
That's another one.

667
00:40:16,860 --> 00:40:20,170
But there's many others
you can think of.

668
00:40:20,170 --> 00:40:21,752
But let's say the
log is one of them

669
00:40:21,752 --> 00:40:23,210
that you might want
to think about.

670
00:40:32,680 --> 00:40:34,410
It is unnatural in
the sense that it's

671
00:40:34,410 --> 00:40:36,160
one of the first
function we can think of.

672
00:40:36,160 --> 00:40:39,840
We will see, also, that it has
another canonical property that

673
00:40:39,840 --> 00:40:42,090
makes it a natural choice.

674
00:40:42,090 --> 00:40:44,130
The other one is
the other example,

675
00:40:44,130 --> 00:40:47,520
where we had an even stronger
condition on what mu could be.

676
00:40:47,520 --> 00:40:49,620
Mu could only be a
number between 0 and 1,

677
00:40:49,620 --> 00:40:52,780
that was the probability
of success of a coin flip--

678
00:40:52,780 --> 00:40:55,290
probability of success of a
Bernoulli random variable.

679
00:40:55,290 --> 00:40:59,310
And now, I need g to map 0,
1 to the entire real line.

680
00:40:59,310 --> 00:41:02,490
And so here are
a bunch of things

681
00:41:02,490 --> 00:41:04,980
that you can come up
with, because now you

682
00:41:04,980 --> 00:41:08,220
start to have maybe--

683
00:41:08,220 --> 00:41:11,340
I will soon claim that
this one, log of mu,

684
00:41:11,340 --> 00:41:14,220
divided by 1 minus mu
is the most natural one.

685
00:41:14,220 --> 00:41:16,770
But maybe, if you had
never thought of this,

686
00:41:16,770 --> 00:41:18,780
that might not be
the first function

687
00:41:18,780 --> 00:41:20,490
you would come up with, right?

688
00:41:20,490 --> 00:41:23,670
You mentioned trigonometric
functions, for example,

689
00:41:23,670 --> 00:41:25,980
so maybe, you can
come up with something

690
00:41:25,980 --> 00:41:30,960
that comes from hyperbolic
trigonometry or something.

691
00:41:30,960 --> 00:41:32,329
So what does this function do?

692
00:41:32,329 --> 00:41:34,370
Well, we'll see a picture,
but this function does

693
00:41:34,370 --> 00:41:36,990
map the interval 0, 1
to the entire real line.

694
00:41:36,990 --> 00:41:40,770
We also discuss the fact that
if we think reciprocally--

695
00:41:43,740 --> 00:41:46,110
what I want if I want to
think about g inverse,

696
00:41:46,110 --> 00:41:49,140
I want a function that maps the
entire real line into the unit

697
00:41:49,140 --> 00:41:49,920
interval.

698
00:41:49,920 --> 00:41:52,650
And as we said, if I'm not
a very creative statistician

699
00:41:52,650 --> 00:41:55,960
or probabilist, I can just
pick my favorite continuous,

700
00:41:55,960 --> 00:41:59,710
strictly increasing cumulative
distribution function,

701
00:41:59,710 --> 00:42:01,350
which as we know,
will arise as soon

702
00:42:01,350 --> 00:42:03,060
as I have a density
that has support

703
00:42:03,060 --> 00:42:04,830
on the entire real line.

704
00:42:04,830 --> 00:42:07,350
If I have support everywhere,
then it means that my--

705
00:42:12,100 --> 00:42:14,140
it is strictly positive
everywhere, then,

706
00:42:14,140 --> 00:42:17,070
it means that my community
distribution function

707
00:42:17,070 --> 00:42:18,690
has to be strictly increasing.

708
00:42:18,690 --> 00:42:21,450
And of course, it has to go
from 0 to 1, because that's just

709
00:42:21,450 --> 00:42:22,717
the nature of those things.

710
00:42:22,717 --> 00:42:24,550
And so for example, I
can take the Gaussian,

711
00:42:24,550 --> 00:42:25,980
that's one such function.

712
00:42:25,980 --> 00:42:28,591
But I could also take
the double exponential

713
00:42:28,591 --> 00:42:30,340
that looks like an
exponential on one end,

714
00:42:30,340 --> 00:42:32,460
and then an exponential
on the other end.

715
00:42:32,460 --> 00:42:39,930
And basically, if you
take capital phi, which

716
00:42:39,930 --> 00:42:43,560
is the standard Gaussian
cumulative distribution

717
00:42:43,560 --> 00:42:47,460
function, it does work for you,
and you can take its inverse.

718
00:42:47,460 --> 00:42:49,160
And in this case,
we don't talk about,

719
00:42:49,160 --> 00:42:51,810
so this guy is called
logit or logit.

720
00:42:51,810 --> 00:42:53,172
And this guy is called probit.

721
00:42:53,172 --> 00:42:54,630
And you see it,
usually, every time

722
00:42:54,630 --> 00:42:58,534
you have a package on
generalized linear models.

723
00:42:58,534 --> 00:42:59,700
You are trying to implement.

724
00:42:59,700 --> 00:43:00,970
You have this choice.

725
00:43:00,970 --> 00:43:04,009
And for what's called logistic
regression-- so it's funny

726
00:43:04,009 --> 00:43:05,550
that it's called
logistic regression,

727
00:43:05,550 --> 00:43:07,410
but you can actually
use the probit link,

728
00:43:07,410 --> 00:43:10,620
which in this case, is
called probit regression.

729
00:43:10,620 --> 00:43:12,480
But those things are
essentially equivalent,

730
00:43:12,480 --> 00:43:14,816
and it's really a
matter of taste.

731
00:43:14,816 --> 00:43:16,440
Maybe of communities--
some communities

732
00:43:16,440 --> 00:43:18,140
might prefer one over the other.

733
00:43:18,140 --> 00:43:20,490
We'll see that
again, as I claimed

734
00:43:20,490 --> 00:43:24,810
before, the logistic,
the logit one

735
00:43:24,810 --> 00:43:29,130
has a slightly more compelling
argument for its reason

736
00:43:29,130 --> 00:43:30,152
to exist.

737
00:43:30,152 --> 00:43:31,860
I guess this one, the
compelling argument

738
00:43:31,860 --> 00:43:34,770
is that it involved the
standard Gaussian, which

739
00:43:34,770 --> 00:43:37,470
of course, is something that
should show up everywhere.

740
00:43:37,470 --> 00:43:41,340
And then, you can think
about crazy stuff.

741
00:43:41,340 --> 00:43:43,770
Even crazy gets name--

742
00:43:43,770 --> 00:43:48,670
complimentary log, log, which is
the log of minus, log 1 minus.

743
00:43:48,670 --> 00:43:49,170
Why not?

744
00:43:52,890 --> 00:43:56,450
So I guess you can
iterate that thing.

745
00:43:56,450 --> 00:43:59,510
You can just put a log 1
minus in front of this thing,

746
00:43:59,510 --> 00:44:01,160
and it's still going to go.

747
00:44:01,160 --> 00:44:07,810
So that's not true.

748
00:44:07,810 --> 00:44:10,377
I have to put a minus and take--

749
00:44:10,377 --> 00:44:11,210
no, that's not true.

750
00:44:13,707 --> 00:44:15,290
So you can think of
whatever you want.

751
00:44:19,320 --> 00:44:21,770
So I claimed that the logit
link is the natural choice, so

752
00:44:21,770 --> 00:44:22,970
here's a picture.

753
00:44:22,970 --> 00:44:25,590
I should have actually
plotted the other one,

754
00:44:25,590 --> 00:44:27,399
so we can actually compare it.

755
00:44:27,399 --> 00:44:29,690
To be fair, I don't even
remember how it would actually

756
00:44:29,690 --> 00:44:32,010
fit into those two functions.

757
00:44:32,010 --> 00:44:35,712
So the blue one, which is
this one, for those of you

758
00:44:35,712 --> 00:44:37,670
don't see the difference
between blue and red--

759
00:44:37,670 --> 00:44:39,300
sorry about that.

760
00:44:39,300 --> 00:44:45,320
So this the blue one
is the logistic one.

761
00:44:45,320 --> 00:44:49,980
So this guy is the function that
does e to the x, over 1 plus

762
00:44:49,980 --> 00:44:50,480
e to the x.

763
00:44:50,480 --> 00:44:51,560
As you can see,
this is a function

764
00:44:51,560 --> 00:44:53,600
that's supposed to map
the entire real line

765
00:44:53,600 --> 00:44:55,970
into the intervals, 0, 1.

766
00:44:55,970 --> 00:44:58,220
So that's supposed to be the
inverse of your function,

767
00:44:58,220 --> 00:45:00,580
and I claimed that this is
the inverse of the logistic

768
00:45:00,580 --> 00:45:02,090
of the logit function.

769
00:45:02,090 --> 00:45:04,340
And the blue one, well,
this is the Gaussian CDF,

770
00:45:04,340 --> 00:45:06,470
so you know it's clearly
the inverse of the inverse

771
00:45:06,470 --> 00:45:07,732
of the Gaussian CDF.

772
00:45:07,732 --> 00:45:08,690
And that's the red one.

773
00:45:08,690 --> 00:45:09,939
That's the one that goes here.

774
00:45:12,290 --> 00:45:15,320
I would guess that the
complimentary log, log is

775
00:45:15,320 --> 00:45:17,600
something that's probably
going above here,

776
00:45:17,600 --> 00:45:19,790
and for which the
slope is, actually,

777
00:45:19,790 --> 00:45:22,840
even a little flatter
as you cross 0.

778
00:45:26,250 --> 00:45:29,119
So of course, this is
not our link functions.

779
00:45:29,119 --> 00:45:30,910
These are the inverse
of our link function.

780
00:45:30,910 --> 00:45:32,730
So what do they look
like when actually,

781
00:45:32,730 --> 00:45:36,670
basically, flip my
thing like this?

782
00:45:36,670 --> 00:45:38,810
So this is what I see.

783
00:45:38,810 --> 00:45:42,600
And so I can see that in blue,
this is my logistic link.

784
00:45:42,600 --> 00:45:46,140
So it crosses 0 with a
slightly faster rate.

785
00:45:46,140 --> 00:45:49,830
Remember, if we could
use the identity, that

786
00:45:49,830 --> 00:45:51,134
would be very nice to us.

787
00:45:51,134 --> 00:45:52,800
We would just want
to take the identity.

788
00:45:52,800 --> 00:45:55,145
The problem is that
if I start having

789
00:45:55,145 --> 00:45:56,520
the identity that
goes here, it's

790
00:45:56,520 --> 00:45:58,810
going to start being a problem.

791
00:45:58,810 --> 00:46:06,419
And this is the probit link,
the phi verse that you see here.

792
00:46:06,419 --> 00:46:07,335
It's a little flatter.

793
00:46:10,290 --> 00:46:16,599
You can compute the derivative
at zero of those guys.

794
00:46:16,599 --> 00:46:17,890
What is the derivative of the--

795
00:46:21,180 --> 00:46:24,380
So I'm taking the derivative
of log of x over 1 minus x.

796
00:46:24,380 --> 00:46:32,010
So it's 1 over x,
minus 1 over 1 minus x.

797
00:46:35,850 --> 00:46:39,120
So if I look at 0.5--

798
00:46:39,120 --> 00:46:40,770
sorry, this is
the interval 0, 1.

799
00:46:40,770 --> 00:46:48,070
So I'm interested
in the slope at 0.5.

800
00:46:48,070 --> 00:46:49,660
Yes, it's plus, thank you.

801
00:46:49,660 --> 00:46:53,230
So at 0.5, what I
get is 2 plus 2.

802
00:46:57,090 --> 00:47:02,650
Yeah, so that's the
slope that we get,

803
00:47:02,650 --> 00:47:07,434
and if you compute
for the derivative--

804
00:47:07,434 --> 00:47:09,100
what is the derivative
of a phi inverse?

805
00:47:09,100 --> 00:47:13,180
Well, it's a little
phi of x, divided

806
00:47:13,180 --> 00:47:20,640
by little phi of capital
phi, inverse of x.

807
00:47:20,640 --> 00:47:24,319
So little phi at 1/2--

808
00:47:24,319 --> 00:47:24,860
I don't know.

809
00:47:29,450 --> 00:47:30,950
Yeah, I guess I can
probably compute

810
00:47:30,950 --> 00:47:32,590
the derivative of
the capital phi

811
00:47:32,590 --> 00:47:34,460
at 0, which is going
to be just that.

812
00:47:34,460 --> 00:47:37,070
1 over square root of 2
pi, and then just say well,

813
00:47:37,070 --> 00:47:38,870
the slope has to be 1 over that.

814
00:47:42,972 --> 00:47:43,680
Square root 2 pi.

815
00:47:47,130 --> 00:47:50,310
So that's just a comparison,
but again, so far, we

816
00:47:50,310 --> 00:47:54,151
do not have any reason to
prefer one to the other.

817
00:47:54,151 --> 00:47:56,400
And so now, I'm going to
start giving you some reasons

818
00:47:56,400 --> 00:47:58,110
to prefer one to the other.

819
00:47:58,110 --> 00:48:01,260
And one of those two--

820
00:48:01,260 --> 00:48:03,570
and actually for each
canonical family,

821
00:48:03,570 --> 00:48:05,820
there is something which is
called the canonical link.

822
00:48:05,820 --> 00:48:07,486
And when you don't
have any other reason

823
00:48:07,486 --> 00:48:10,386
to choose anything else, why
not choose the canonical one?

824
00:48:10,386 --> 00:48:11,760
And the canonical
link is the one

825
00:48:11,760 --> 00:48:19,580
that says OK, what I want is g
to map mu onto the real line.

826
00:48:22,980 --> 00:48:28,550
But mu is not the parameter
of my canonical family.

827
00:48:28,550 --> 00:48:31,470
Here for example,
mu is e of theta,

828
00:48:31,470 --> 00:48:33,290
but the canonical
parameter is theta.

829
00:48:36,050 --> 00:48:39,480
But the parameter of a
canonical exponential family

830
00:48:39,480 --> 00:48:42,650
is something that lives
in the entire real line.

831
00:48:42,650 --> 00:48:45,510
It was defined for all thetas.

832
00:48:45,510 --> 00:48:50,250
And so in particular,
I can just take theta

833
00:48:50,250 --> 00:48:52,950
to be the one that's
x transpose beta.

834
00:48:52,950 --> 00:48:54,480
And so in particular,
I'm just going

835
00:48:54,480 --> 00:48:57,180
to try to find the link
that just says OK, when

836
00:48:57,180 --> 00:49:00,470
I take g of mu,
I'm going to map,

837
00:49:00,470 --> 00:49:01,700
so that's what's going to be.

838
00:49:01,700 --> 00:49:05,499
So I know that g of mu is
going to be equal to x beta.

839
00:49:05,499 --> 00:49:07,040
And now, what I'm
going to say is OK,

840
00:49:07,040 --> 00:49:09,850
let's just take the g that
makes this guy equal to theta,

841
00:49:09,850 --> 00:49:11,600
so that this is theta
that actually model,

842
00:49:11,600 --> 00:49:14,880
like x transpose beta.

843
00:49:14,880 --> 00:49:17,960
Feels pretty canonical, right?

844
00:49:17,960 --> 00:49:19,880
What else?

845
00:49:19,880 --> 00:49:22,280
What other central, easy
choice would you take?

846
00:49:22,280 --> 00:49:23,650
This was pretty easy.

847
00:49:23,650 --> 00:49:27,560
There is a natural parameter
for this canonical family,

848
00:49:27,560 --> 00:49:29,780
and it takes value on
the entire real line.

849
00:49:29,780 --> 00:49:32,500
I have a function that maps
mu onto the entire real line,

850
00:49:32,500 --> 00:49:36,260
so let's just map it to
the actual parameter.

851
00:49:36,260 --> 00:49:40,419
So now, OK, why do I have this?

852
00:49:40,419 --> 00:49:41,960
Well, we've already
figured that out.

853
00:49:41,960 --> 00:49:44,840
The canonical link function
is strictly increasing.

854
00:49:44,840 --> 00:49:49,670
Sorry, so I said that
now I want this guy--

855
00:49:49,670 --> 00:49:57,470
so I want g of mu to
be equal to theta,

856
00:49:57,470 --> 00:50:00,560
which is equivalent to
saying that I want mu to be

857
00:50:00,560 --> 00:50:03,860
equal to g inverse of theta.

858
00:50:03,860 --> 00:50:08,140
But we know that mu is what--

859
00:50:08,140 --> 00:50:09,160
b prime of theta.

860
00:50:15,640 --> 00:50:21,090
So that means that b prime is
the same function as g inverse.

861
00:50:21,090 --> 00:50:24,570
And I claimed that this is
actually giving me, indeed,

862
00:50:24,570 --> 00:50:27,930
a function that has the
properties that I want,

863
00:50:27,930 --> 00:50:30,060
because before I said,
just pick any function that

864
00:50:30,060 --> 00:50:31,080
has these properties.

865
00:50:31,080 --> 00:50:33,102
And now, I'm giving
you a very hard rule

866
00:50:33,102 --> 00:50:34,560
to pick this, though
you need still

867
00:50:34,560 --> 00:50:37,050
to check that it satisfies
those conditions and particular,

868
00:50:37,050 --> 00:50:39,050
that it's increasing
and invertible.

869
00:50:39,050 --> 00:50:41,050
And so for this to be
increasing and invertible,

870
00:50:41,050 --> 00:50:42,630
strictly increasing
and invertible,

871
00:50:42,630 --> 00:50:44,880
really what I need is that
the inverse is strictly

872
00:50:44,880 --> 00:50:48,070
increasing and invertible,
which is the case here,

873
00:50:48,070 --> 00:50:51,220
because b prime as we said--

874
00:50:51,220 --> 00:50:56,610
well, b prime is the derivative
of a strictly convex function.

875
00:50:56,610 --> 00:50:58,749
A strictly convex function
has a second derivative

876
00:50:58,749 --> 00:50:59,790
that's strictly positive.

877
00:50:59,790 --> 00:51:01,770
We just figured that
out using the fact

878
00:51:01,770 --> 00:51:03,790
that the variance was
strictly positive.

879
00:51:03,790 --> 00:51:06,330
And if phi is strictly
positive, then this thing

880
00:51:06,330 --> 00:51:08,530
has to be strictly positive.

881
00:51:08,530 --> 00:51:10,650
So if b prime, prime
is strictly positive--

882
00:51:10,650 --> 00:51:13,604
this is the derivative of
a function called b prime.

883
00:51:13,604 --> 00:51:15,270
If your derivative
is strictly positive,

884
00:51:15,270 --> 00:51:17,670
you are strictly increasing.

885
00:51:17,670 --> 00:51:22,810
And so we know that b prime is,
indeed, strictly increasing.

886
00:51:22,810 --> 00:51:26,060
And what I need also
to check-- well,

887
00:51:26,060 --> 00:51:28,010
I guess this is already
checked on its own,

888
00:51:28,010 --> 00:51:33,560
because b prime is
actually mapping all of our

889
00:51:33,560 --> 00:51:37,100
into the possible values.

890
00:51:37,100 --> 00:51:38,870
When theta ranges on
the entire real line,

891
00:51:38,870 --> 00:51:41,120
then b prime ranges
in the entire interval

892
00:51:41,120 --> 00:51:45,440
of the mean values
that it can take.

893
00:51:45,440 --> 00:51:48,030
And so now, I have this thing
that's completely defined.

894
00:51:48,030 --> 00:51:50,490
B prime inverse is a valid link.

895
00:51:56,030 --> 00:51:57,450
And it's called
a canonical link.

896
00:52:02,470 --> 00:52:05,770
OK, so again, if I give you
an exponential family, which

897
00:52:05,770 --> 00:52:09,350
is another way of saying I'll
give you a convex function, b,

898
00:52:09,350 --> 00:52:12,400
which gives you some
exponential family,

899
00:52:12,400 --> 00:52:15,160
then if you just
take b prime inverse,

900
00:52:15,160 --> 00:52:17,770
this gives you the
associated canonical link

901
00:52:17,770 --> 00:52:21,590
for this canonical
exponential family.

902
00:52:21,590 --> 00:52:26,196
So clearly there's
an advantage of doing

903
00:52:26,196 --> 00:52:28,070
this, which is I don't
have to actually think

904
00:52:28,070 --> 00:52:30,800
about which one to pick if I
don't want to think about it.

905
00:52:30,800 --> 00:52:34,220
But there's other
advantages that come to it,

906
00:52:34,220 --> 00:52:36,170
and we'll see that in
the representations.

907
00:52:36,170 --> 00:52:38,711
There's, basically, going to be
some light cancellations that

908
00:52:38,711 --> 00:52:39,290
show up.

909
00:52:39,290 --> 00:52:40,665
So before we go
there, let's just

910
00:52:40,665 --> 00:52:43,790
compute the canonical link for
the Bernoulli distribution.

911
00:52:43,790 --> 00:52:46,360
So remember, the
Bernoulli distribution

912
00:52:46,360 --> 00:52:55,510
has a PMF, which is part of the
canonical exponential family.

913
00:52:55,510 --> 00:53:00,180
So the PMF of the
Bernoulli is f theta of x.

914
00:53:06,529 --> 00:53:07,820
Let me just write it like this.

915
00:53:07,820 --> 00:53:12,470
So it's p to the y,
let's say-- one minus p

916
00:53:12,470 --> 00:53:16,700
to the 1 minus y,
which I will write

917
00:53:16,700 --> 00:53:28,910
as exponential y log p, plus
1 minus y, log 1 minus p.

918
00:53:28,910 --> 00:53:30,750
OK, we've done that last time.

919
00:53:30,750 --> 00:53:32,670
Now, I'm going to
group my terms in y

920
00:53:32,670 --> 00:53:37,530
to see how y interacts
with this parameter p.

921
00:53:37,530 --> 00:53:40,710
And what I'm getting is
y, which is times log p

922
00:53:40,710 --> 00:53:42,540
divided by 1 minus p.

923
00:53:42,540 --> 00:53:47,040
And then, the only term that
remains is log, 1 minus p.

924
00:53:50,370 --> 00:53:53,710
Now, I want this to be a
canonical exponential family,

925
00:53:53,710 --> 00:53:56,880
which means that I just
need to call this guy,

926
00:53:56,880 --> 00:53:58,710
so it is part of the
exponential family.

927
00:53:58,710 --> 00:53:59,460
You can read that.

928
00:53:59,460 --> 00:54:04,520
If I want it to be canonical,
this guy must be theta itself.

929
00:54:04,520 --> 00:54:11,180
So I have that theta is
equal to log p, 1 minus p.

930
00:54:11,180 --> 00:54:12,800
If I invert this
thing, it tells me

931
00:54:12,800 --> 00:54:16,880
that p is e to the theta,
divided by 1, plus e

932
00:54:16,880 --> 00:54:18,434
to the theta.

933
00:54:18,434 --> 00:54:19,850
It's just inverting
this function.

934
00:54:23,550 --> 00:54:28,140
In particular, it means
that log, 1 minus p,

935
00:54:28,140 --> 00:54:31,900
is equal to log, 1
minus this thing.

936
00:54:31,900 --> 00:54:33,520
So the exponential
thetas go away.

937
00:54:33,520 --> 00:54:39,350
So in the numerator,
this is what I get.

938
00:54:39,350 --> 00:54:44,870
That's the log 1 minus this guy,
which is equal to minus log 1,

939
00:54:44,870 --> 00:54:46,010
plus e to the theta.

940
00:54:50,790 --> 00:54:52,540
So I'm going a bit too
fast, but these are

941
00:54:52,540 --> 00:54:56,230
very elementary manipulations--

942
00:54:56,230 --> 00:54:59,220
maybe, it requires one more
line to convince yourself.

943
00:54:59,220 --> 00:55:05,940
But just do it in the
comfort of your room.

944
00:55:05,940 --> 00:55:11,210
And then, what you have is the
exponential of y times theta,

945
00:55:11,210 --> 00:55:16,850
and then, I have minus
log, 1 plus e theta.

946
00:55:16,850 --> 00:55:20,960
So this is the
representation of the p

947
00:55:20,960 --> 00:55:23,990
and f of a Bernoulli
distribution,

948
00:55:23,990 --> 00:55:27,680
as part of a member of the
canonical exponential family.

949
00:55:27,680 --> 00:55:33,530
And it tells me that b of
theta is equal to log 1,

950
00:55:33,530 --> 00:55:36,510
plus e of theta.

951
00:55:36,510 --> 00:55:38,140
That's what I have there.

952
00:55:38,140 --> 00:55:41,790
From there, I can compute the
expectation, which hopefully,

953
00:55:41,790 --> 00:55:46,170
I'm going to get p as
the mean and p times 1,

954
00:55:46,170 --> 00:55:47,759
minus p as the variance.

955
00:55:47,759 --> 00:55:49,050
Otherwise, that would be weird.

956
00:55:51,960 --> 00:55:55,840
So let's just do this.

957
00:55:55,840 --> 00:56:00,950
B prime of theta should
give me the mean.

958
00:56:00,950 --> 00:56:04,010
And indeed, b prime of
theta is e to the theta,

959
00:56:04,010 --> 00:56:08,060
divided by 1, plus e to
the theta, which is exactly

960
00:56:08,060 --> 00:56:09,290
this p that I had there.

961
00:56:14,850 --> 00:56:18,350
OK just for fun--

962
00:56:18,350 --> 00:56:19,200
well, I don't know.

963
00:56:19,200 --> 00:56:20,520
Maybe, that's not part of it.

964
00:56:20,520 --> 00:56:22,820
Yeah, let's not compute
the second derivative.

965
00:56:22,820 --> 00:56:25,800
That's probably going to be on
your homework at some point--

966
00:56:25,800 --> 00:56:29,440
if not, on the final.

967
00:56:29,440 --> 00:56:32,890
So b prime now--

968
00:56:32,890 --> 00:56:34,120
oh, I erased it, of course.

969
00:56:37,300 --> 00:56:39,380
G, the canonical link
is b prime inverse.

970
00:56:42,520 --> 00:56:44,770
And I claim that this
is going to give me

971
00:56:44,770 --> 00:56:48,910
the logit function, log
of mu, over 1 minus mu.

972
00:56:48,910 --> 00:56:50,480
So let's check that.

973
00:56:50,480 --> 00:56:54,236
So b prime is this
thing, so now,

974
00:56:54,236 --> 00:56:55,360
I want to find the inverse.

975
00:57:02,180 --> 00:57:05,360
Well, I should really call
my inverse a function of p.

976
00:57:05,360 --> 00:57:06,750
And I've done it before--

977
00:57:06,750 --> 00:57:08,930
all I have to do is to
solve this equation, which

978
00:57:08,930 --> 00:57:10,400
I've actually just
done it, that's

979
00:57:10,400 --> 00:57:11,890
where I'm actually coming from.

980
00:57:11,890 --> 00:57:14,510
So it's actually telling me
that the solution of this thing

981
00:57:14,510 --> 00:57:17,941
is equal to log of
p over 1 minus p.

982
00:57:25,810 --> 00:57:28,540
We just solve this
thing both ways.

983
00:57:28,540 --> 00:57:38,520
And this is, indeed, logit
of p by definition of logit.

984
00:57:38,520 --> 00:57:40,470
So b prime inverse,
this function that

985
00:57:40,470 --> 00:57:42,440
seemed to come out
of nowhere, is really

986
00:57:42,440 --> 00:57:45,030
just the inverse of b
prime, which we know

987
00:57:45,030 --> 00:57:46,200
is the canonical link.

988
00:57:46,200 --> 00:57:49,200
And canonical is some
sort of ad hoc choices

989
00:57:49,200 --> 00:57:53,040
that we've made by saying let's
just take the link, such that d

990
00:57:53,040 --> 00:57:57,819
of mu is giving me the actual
canonical parameter of theta.

991
00:57:57,819 --> 00:57:58,785
Yeah?

992
00:57:58,785 --> 00:58:00,717
AUDIENCE: [INAUDIBLE]

993
00:58:02,197 --> 00:58:03,530
PHILIPPE RIGOLLET: You're right.

994
00:58:08,520 --> 00:58:11,550
Now, of course, I'm going
through all this trouble,

995
00:58:11,550 --> 00:58:13,470
but you could see
it immediately.

996
00:58:13,470 --> 00:58:16,550
I know this is
going to be theta.

997
00:58:16,550 --> 00:58:19,380
We also have prior
knowledge, hopefully,

998
00:58:19,380 --> 00:58:23,520
that the expectation of
a Bernoulli is p itself.

999
00:58:23,520 --> 00:58:25,760
So right at this
step, when I say

1000
00:58:25,760 --> 00:58:28,070
that I'm going to take
theta to be this guy,

1001
00:58:28,070 --> 00:58:32,959
already knew that the
canonical link was the logit--

1002
00:58:32,959 --> 00:58:34,500
because I just said
oh, here's theta.

1003
00:58:34,500 --> 00:58:37,356
And it's just this
function of mu [INAUDIBLE]..

1004
00:58:41,100 --> 00:58:43,539
OK, so you can do that
for a bunch of examples,

1005
00:58:43,539 --> 00:58:45,330
and this is what they're
going to give you.

1006
00:58:45,330 --> 00:58:47,820
So the Gaussian
case, b of theta--

1007
00:58:47,820 --> 00:58:49,760
we've actually computed
it, actually, once.

1008
00:58:49,760 --> 00:58:51,290
This is theta squared over 2.

1009
00:58:51,290 --> 00:58:53,130
So the derivative of
this thing is really

1010
00:58:53,130 --> 00:58:56,970
just theta, which means that
g or g inverse is actually

1011
00:58:56,970 --> 00:58:59,280
equal to the identity.

1012
00:58:59,280 --> 00:59:02,760
And again, sanity check--

1013
00:59:02,760 --> 00:59:04,410
when I'm in the
Gaussian case, there's

1014
00:59:04,410 --> 00:59:06,780
nothing general about
general linear models

1015
00:59:06,780 --> 00:59:09,040
if you don't have a link.

1016
00:59:09,040 --> 00:59:12,390
The Poisson case-- you
can actually check.

1017
00:59:12,390 --> 00:59:13,480
Did we do this, actually?

1018
00:59:13,480 --> 00:59:14,350
Yes we did.

1019
00:59:14,350 --> 00:59:16,570
So that's when we
had this e of theta.

1020
00:59:16,570 --> 00:59:19,960
And so b is e of theta, which
means that the natural link is

1021
00:59:19,960 --> 00:59:24,790
the inverse, which is log, which
is the inverse of exponential.

1022
00:59:24,790 --> 00:59:29,680
And so that's logarithm
link, which as I said,

1023
00:59:29,680 --> 00:59:31,560
I used the word natural.

1024
00:59:31,560 --> 00:59:33,610
You can also use
the word canonical

1025
00:59:33,610 --> 00:59:35,740
if you want to describe
this function as being

1026
00:59:35,740 --> 00:59:38,620
the right function to map
the positive real line

1027
00:59:38,620 --> 00:59:40,959
to the entire real line.

1028
00:59:40,959 --> 00:59:42,250
The Bernoulli-- we just did it.

1029
00:59:42,250 --> 00:59:46,930
So b-- the cumulative
enduring function is log of 1,

1030
00:59:46,930 --> 00:59:52,990
plus e of theta, which is
log of mu over 1 minus mu.

1031
00:59:52,990 --> 00:59:57,520
And gamma function--
where you have

1032
00:59:57,520 --> 01:00:00,700
the thing you're going to see is
minus log of minus [INAUDIBLE]..

1033
01:00:00,700 --> 01:00:04,030
You see the reciprocal link
is the link that actually

1034
01:00:04,030 --> 01:00:08,045
shows up, so minus 1 over mu.

1035
01:00:08,045 --> 01:00:08,545
That maps.

1036
01:00:35,690 --> 01:00:40,400
So are there any questions
about the canonical links,

1037
01:00:40,400 --> 01:00:42,532
canonical families?

1038
01:00:42,532 --> 01:00:45,020
I use the word canonical a lot.

1039
01:00:45,020 --> 01:00:48,929
But is everything fitting
together right now?

1040
01:00:48,929 --> 01:00:49,970
So we have this function.

1041
01:00:49,970 --> 01:00:53,060
We have canonical exponential
family, by assumption.

1042
01:00:53,060 --> 01:00:54,800
It has a function,
b, which contains

1043
01:00:54,800 --> 01:00:56,552
every information we want.

1044
01:00:56,552 --> 01:00:58,010
At the beginning
of the lecture, we

1045
01:00:58,010 --> 01:00:59,468
established that
it has information

1046
01:00:59,468 --> 01:01:01,310
about the mean in
the first derivative,

1047
01:01:01,310 --> 01:01:03,143
about the variance in
the second derivative.

1048
01:01:03,143 --> 01:01:05,210
And it's also giving
us a canonical link.

1049
01:01:05,210 --> 01:01:08,035
So just cherish this b
once you've found it,

1050
01:01:08,035 --> 01:01:09,410
because it's
everything you need.

1051
01:01:09,410 --> 01:01:09,909
Yeah?

1052
01:01:09,909 --> 01:01:11,750
AUDIENCE: [INAUDIBLE]

1053
01:01:15,962 --> 01:01:19,342
PHILIPPE RIGOLLET: I don't
know, a political preference?

1054
01:01:24,710 --> 01:01:26,730
I don't know, honestly.

1055
01:01:26,730 --> 01:01:29,570
If I were a serious
practitioner,

1056
01:01:29,570 --> 01:01:31,700
I probably would have a
better answer for you.

1057
01:01:31,700 --> 01:01:32,870
At this point, I just don't.

1058
01:01:32,870 --> 01:01:34,244
I think it's a
matter of practice

1059
01:01:34,244 --> 01:01:36,860
and actual preferences.

1060
01:01:36,860 --> 01:01:38,426
You can also try both.

1061
01:01:38,426 --> 01:01:39,800
We didn't mention
it, but there's

1062
01:01:39,800 --> 01:01:41,360
this idea of
cross-validation-- well,

1063
01:01:41,360 --> 01:01:43,610
we mentioned it without
going too much into detail.

1064
01:01:43,610 --> 01:01:46,460
But you could try both
and see which one performs

1065
01:01:46,460 --> 01:01:48,617
best on a yet unseen data set.

1066
01:01:48,617 --> 01:01:51,200
In terms of prediction, just say
I prefer this one of the two,

1067
01:01:51,200 --> 01:01:53,450
because this actually comes
as part of your modeling

1068
01:01:53,450 --> 01:01:56,090
assumption, right?

1069
01:01:56,090 --> 01:01:59,630
Not only did you decide
to model the image of mu

1070
01:01:59,630 --> 01:02:03,057
through the link function as
a linear model, but really

1071
01:02:03,057 --> 01:02:03,890
what you're saying--

1072
01:02:03,890 --> 01:02:05,750
your model is saying
well, you have

1073
01:02:05,750 --> 01:02:07,860
two pieces of [INAUDIBLE],,
the distribution of y.

1074
01:02:07,860 --> 01:02:10,340
But you also have
the fact that mu

1075
01:02:10,340 --> 01:02:14,870
is modeled as g inverse
of x transpose beta.

1076
01:02:14,870 --> 01:02:17,120
And for different g's, this
is just different modeling

1077
01:02:17,120 --> 01:02:18,380
assumptions, right?

1078
01:02:18,380 --> 01:02:25,930
So why should this be linear--

1079
01:02:25,930 --> 01:02:26,610
I don't know.

1080
01:02:29,470 --> 01:02:32,740
My authority as a
person who has not

1081
01:02:32,740 --> 01:02:34,780
examined the
[INAUDIBLE] data sets

1082
01:02:34,780 --> 01:02:38,050
for both things would be that
the changes are fairly minor.

1083
01:02:42,270 --> 01:02:45,420
OK, so this was all
for one observation.

1084
01:02:45,420 --> 01:02:49,350
We just, basically,
did probability.

1085
01:02:49,350 --> 01:02:52,620
We described some density, some
properties of the densities,

1086
01:02:52,620 --> 01:02:53,940
how to compute expectations.

1087
01:02:53,940 --> 01:02:55,314
That was really
just probability.

1088
01:02:55,314 --> 01:02:57,240
There was no data
involved at any point.

1089
01:02:57,240 --> 01:03:00,330
We did a bit of modeling, but
it was all for one observation.

1090
01:03:00,330 --> 01:03:01,710
What we're going
to try to do now

1091
01:03:01,710 --> 01:03:06,360
is given the reverse
engineering to probability

1092
01:03:06,360 --> 01:03:08,310
that is statistics,
given data, what

1093
01:03:08,310 --> 01:03:10,780
can I infer about my model?

1094
01:03:10,780 --> 01:03:12,370
Now remember, there's
three parameters

1095
01:03:12,370 --> 01:03:15,040
that are floating
around in this model.

1096
01:03:15,040 --> 01:03:18,190
There is one that was theta.

1097
01:03:18,190 --> 01:03:21,689
There is one that was mu, and
there is one that is beta.

1098
01:03:21,689 --> 01:03:23,230
OK, so those are
the three parameters

1099
01:03:23,230 --> 01:03:25,110
that are floating around.

1100
01:03:25,110 --> 01:03:32,550
What we said is that the
expectation of y, given x,

1101
01:03:32,550 --> 01:03:34,980
is mu of x.

1102
01:03:34,980 --> 01:03:37,950
So if I estimate mu, I know the
conditional expectation of y,

1103
01:03:37,950 --> 01:03:44,960
given x, which definitely
gives me theta of x.

1104
01:03:44,960 --> 01:03:46,830
How do I go from mu
of x to theta of x?

1105
01:03:55,080 --> 01:03:58,010
The inverse of what--

1106
01:03:58,010 --> 01:03:59,890
of the arrow?

1107
01:03:59,890 --> 01:04:07,290
Yeah, sure, but how do I go
from this guy to this guy?

1108
01:04:07,290 --> 01:04:08,860
So theta as a function of mu is?

1109
01:04:12,556 --> 01:04:13,792
AUDIENCE: [INAUDIBLE]

1110
01:04:13,792 --> 01:04:15,250
PHILIPPE RIGOLLET:
Yeah, so we just

1111
01:04:15,250 --> 01:04:18,760
computed that mu was
b prime of theta.

1112
01:04:18,760 --> 01:04:23,260
So it means that theta is
just b prime inverse of mu.

1113
01:04:23,260 --> 01:04:24,910
So those two things
are the same as far

1114
01:04:24,910 --> 01:04:27,580
as we're concerned, because we
know that b prime is strictly

1115
01:04:27,580 --> 01:04:29,020
increasing it's invertible.

1116
01:04:29,020 --> 01:04:31,560
So it's just a matter
of re-parametrization,

1117
01:04:31,560 --> 01:04:34,420
and we just can switch from one
to the other whenever we want.

1118
01:04:34,420 --> 01:04:36,754
But why we go through
mu, because so far

1119
01:04:36,754 --> 01:04:38,170
for the entire
semester I told you

1120
01:04:38,170 --> 01:04:39,150
there was one
parameter that's theta.

1121
01:04:39,150 --> 01:04:41,420
It does not have to be the
mean, and that's the parameter

1122
01:04:41,420 --> 01:04:42,130
that we care about.

1123
01:04:42,130 --> 01:04:43,700
It's the one on which we
want to do interference.

1124
01:04:43,700 --> 01:04:45,580
That's the one for which we're
going to compute the Fisher

1125
01:04:45,580 --> 01:04:46,360
information.

1126
01:04:46,360 --> 01:04:49,572
This was the parameter that
was our object of worship.

1127
01:04:49,572 --> 01:04:51,280
And now, I'm saying
oh, I'm going to have

1128
01:04:51,280 --> 01:04:53,200
mu that's coming around.

1129
01:04:53,200 --> 01:04:55,270
And why we have mu,
because this is the mu

1130
01:04:55,270 --> 01:04:58,390
that we use to go to beta.

1131
01:04:58,390 --> 01:05:06,360
So I can go freely from theta
to mu using b prime or b

1132
01:05:06,360 --> 01:05:07,600
prime inverse.

1133
01:05:07,600 --> 01:05:11,080
And now, I can go
from mu to beta,

1134
01:05:11,080 --> 01:05:19,120
because I have that g of mu
of x is beta transpose x.

1135
01:05:19,120 --> 01:05:21,130
So in the end,
now, this is going

1136
01:05:21,130 --> 01:05:22,360
to be my object of worship.

1137
01:05:22,360 --> 01:05:24,318
This is going to be the
parameter that matters.

1138
01:05:24,318 --> 01:05:27,910
Because once I set beta,
I set everything else

1139
01:05:27,910 --> 01:05:30,290
through this chain.

1140
01:05:30,290 --> 01:05:33,010
So the question is if I
start stacking up this pile

1141
01:05:33,010 --> 01:05:36,260
of parameters-- so I
start with my beta,

1142
01:05:36,260 --> 01:05:38,520
which in turns give me
a mu, which in turn,

1143
01:05:38,520 --> 01:05:39,580
gives me a theta--

1144
01:05:39,580 --> 01:05:43,720
can I just have a
long, streamlined--

1145
01:05:43,720 --> 01:05:45,640
what is the outcome
when I actually

1146
01:05:45,640 --> 01:05:48,016
start writing my likelihood,
not as a function of theta,

1147
01:05:48,016 --> 01:05:50,140
not as a function of mu,
but as a function of beta,

1148
01:05:50,140 --> 01:05:52,720
which is the one at
the end of the chain?

1149
01:05:52,720 --> 01:05:55,540
And hopefully, things are
going to happen nicely,

1150
01:05:55,540 --> 01:05:56,292
and they might no.

1151
01:05:56,292 --> 01:05:56,792
Yeah?

1152
01:05:56,792 --> 01:05:58,702
AUDIENCE: [INAUDIBLE]

1153
01:06:02,076 --> 01:06:03,680
PHILIPPE RIGOLLET: Is G--

1154
01:06:03,680 --> 01:06:04,800
that's my link.

1155
01:06:04,800 --> 01:06:06,710
G of mu of x--

1156
01:06:06,710 --> 01:06:09,320
now, its mu is a function of x,
because its conditional on x.

1157
01:06:12,200 --> 01:06:17,000
So this is really
theta of x, mu of x,

1158
01:06:17,000 --> 01:06:21,100
but b is not a function of x,
because it's just something

1159
01:06:21,100 --> 01:06:22,965
to tells me what the
function of x is.

1160
01:06:22,965 --> 01:06:24,865
AUDIENCE: [INAUDIBLE]

1161
01:06:26,074 --> 01:06:28,240
PHILIPPE RIGOLLET: Mu is
the conditional expectation

1162
01:06:28,240 --> 01:06:29,770
of y, given x.

1163
01:06:29,770 --> 01:06:33,010
It has, actually, a fancy name
in the statistics literature.

1164
01:06:33,010 --> 01:06:36,989
It's called-- anybody knows
the name of the function, mu

1165
01:06:36,989 --> 01:06:39,280
of x, which is a conditional
expectation of y, given x?

1166
01:06:42,116 --> 01:06:43,960
AUDIENCE: [INAUDIBLE]

1167
01:06:43,960 --> 01:06:46,120
PHILIPPE RIGOLLET: That's
the regression function.

1168
01:06:46,120 --> 01:06:47,230
That's actual definition.

1169
01:06:47,230 --> 01:06:48,970
If I tell you what is the
definition of the regression

1170
01:06:48,970 --> 01:06:51,011
function, that's just the
conditional expectation

1171
01:06:51,011 --> 01:06:52,970
of why, given x.

1172
01:06:52,970 --> 01:06:58,720
And I could look at any property
of the conditional distribution

1173
01:06:58,720 --> 01:07:00,020
of y given x.

1174
01:07:00,020 --> 01:07:02,639
I could look at the
conditional 95th percentile.

1175
01:07:02,639 --> 01:07:04,180
I can look at the
conditional median.

1176
01:07:04,180 --> 01:07:06,450
I can look at the conditional
[INAUDIBLE] range.

1177
01:07:06,450 --> 01:07:08,470
I can look at the
conditional variance.

1178
01:07:08,470 --> 01:07:12,300
But I decide to look at the
conditional expectation, which

1179
01:07:12,300 --> 01:07:15,429
is called the
regression function.

1180
01:07:18,363 --> 01:07:19,341
Yes?

1181
01:07:19,341 --> 01:07:21,297
AUDIENCE: [INAUDIBLE]

1182
01:07:24,231 --> 01:07:26,290
PHILIPPE RIGOLLET: Oh,
there's no transpose here.

1183
01:07:26,290 --> 01:07:28,700
Actually, only Victor-Emmanuel
used this prime for transpose,

1184
01:07:28,700 --> 01:07:30,710
and I found it confusing
with the derivatives.

1185
01:07:30,710 --> 01:07:33,306
So primes here is
only a derivative.

1186
01:07:33,306 --> 01:07:34,623
AUDIENCE: [INAUDIBLE]

1187
01:07:35,122 --> 01:07:38,640
PHILIPPE RIGOLLET: Oh, yeah,
sorry, beta transpose x.

1188
01:07:38,640 --> 01:07:40,350
So you said what?

1189
01:07:40,350 --> 01:07:43,245
I said that g of mu of
x is beta transpose x?

1190
01:07:43,245 --> 01:07:45,145
AUDIENCE: [INAUDIBLE]

1191
01:07:48,035 --> 01:07:49,910
PHILIPPE RIGOLLET: Isn't
that the same thing?

1192
01:07:52,510 --> 01:07:53,970
X is a vector here, right?

1193
01:07:53,970 --> 01:07:54,930
AUDIENCE: Yeah.

1194
01:07:54,930 --> 01:07:56,555
PHILIPPE RIGOLLET:
So x transpose beta,

1195
01:07:56,555 --> 01:08:00,348
and beta transpose x
are of the same thing.

1196
01:08:00,348 --> 01:08:02,280
AUDIENCE: [INAUDIBLE]

1197
01:08:03,979 --> 01:08:05,770
PHILIPPE RIGOLLET: So
beta looks like this.

1198
01:08:05,770 --> 01:08:08,706
X looks like this.

1199
01:08:08,706 --> 01:08:12,420
It's just a simple number.

1200
01:08:12,420 --> 01:08:13,386
Yeah, you're right.

1201
01:08:13,386 --> 01:08:15,010
I'm going to start
to look at matrices.

1202
01:08:15,010 --> 01:08:18,189
I'm going to have to be slightly
more careful when I do this.

1203
01:08:18,189 --> 01:08:20,740
OK so let's do the
reverse engineering.

1204
01:08:20,740 --> 01:08:22,199
I'm giving you data.

1205
01:08:22,199 --> 01:08:23,740
From this data,
hopefully, you should

1206
01:08:23,740 --> 01:08:26,979
be able to get what the
conditional-- if I give you

1207
01:08:26,979 --> 01:08:29,630
an infinite amount of data,
you would know exactly,

1208
01:08:29,630 --> 01:08:33,819
of pairs xy, what the
conditional distribution of y

1209
01:08:33,819 --> 01:08:36,130
given x is.

1210
01:08:36,130 --> 01:08:37,770
And in particular,
you would know

1211
01:08:37,770 --> 01:08:40,560
what the conditional
expectation of y given x

1212
01:08:40,560 --> 01:08:42,359
is, which means that
you would know mu,

1213
01:08:42,359 --> 01:08:44,192
which means that you
would know theta, which

1214
01:08:44,192 --> 01:08:45,920
means that you would know beta.

1215
01:08:45,920 --> 01:08:48,600
Now, when I have a finite
number of observations,

1216
01:08:48,600 --> 01:08:50,910
I'm going to try to
estimate mu of x.

1217
01:08:50,910 --> 01:08:53,250
But really, I'm going to
go the other way around.

1218
01:08:53,250 --> 01:08:56,279
Because the fact that I assume,
specifically, that mu of x

1219
01:08:56,279 --> 01:09:00,510
is of the form g of mu of x
is x transpose beta, then that

1220
01:09:00,510 --> 01:09:02,850
means that I only have
to estimate beta, which

1221
01:09:02,850 --> 01:09:06,432
is a much simpler object than
the entire regression function.

1222
01:09:06,432 --> 01:09:07,890
So that's what I'm
going to go for.

1223
01:09:07,890 --> 01:09:10,330
I'm going to try to represent
the likelihood, the log

1224
01:09:10,330 --> 01:09:12,890
likelihood, of my data as
a function, not of theta,

1225
01:09:12,890 --> 01:09:15,390
not of mu, but of beta--

1226
01:09:15,390 --> 01:09:18,120
and then, maximize that guy.

1227
01:09:18,120 --> 01:09:21,870
So now, rather than thinking
of just one observation,

1228
01:09:21,870 --> 01:09:23,940
I'm going to have a
bunch of observations.

1229
01:09:27,100 --> 01:09:29,069
So this might actually
look a little confusing,

1230
01:09:29,069 --> 01:09:32,189
but let's just make sure
that we understand each other

1231
01:09:32,189 --> 01:09:33,700
before we go any further.

1232
01:09:33,700 --> 01:09:38,510
So I'm going to
have observations,

1233
01:09:38,510 --> 01:09:43,359
x1, y1, all the
way to xn, yn, just

1234
01:09:43,359 --> 01:09:45,310
like in a natural
regression problem,

1235
01:09:45,310 --> 01:09:49,180
except that here my y's
might be 0 one valued.

1236
01:09:49,180 --> 01:09:50,649
They might be positive valued.

1237
01:09:50,649 --> 01:09:51,732
They might be exponential.

1238
01:09:51,732 --> 01:09:54,600
They might be anything in the
canonical exponential family.

1239
01:09:57,840 --> 01:09:59,950
OK so I have this
thing, and now,

1240
01:09:59,950 --> 01:10:01,900
what I have is that my
observations are x1,

1241
01:10:01,900 --> 01:10:03,310
y1, xn, yn.

1242
01:10:03,310 --> 01:10:06,460
And what I want
is that I'm going

1243
01:10:06,460 --> 01:10:11,640
to assume that the conditional
expectation of yi, given--

1244
01:10:14,980 --> 01:10:18,710
the conditional distribution
of yi, given xi,

1245
01:10:18,710 --> 01:10:20,390
is something that has density.

1246
01:10:30,070 --> 01:10:31,473
Did I put an i on y-- yeah.

1247
01:10:42,820 --> 01:10:45,920
I'm not going to deal with
the phi and the c now.

1248
01:10:45,920 --> 01:10:48,610
And why do I have
theta i and not theta

1249
01:10:48,610 --> 01:11:01,350
is because theta i is
really a function of xi.

1250
01:11:01,350 --> 01:11:05,270
So it's really theta i of xi.

1251
01:11:05,270 --> 01:11:07,240
But what do I know
about theta i of xi,

1252
01:11:07,240 --> 01:11:11,890
it's actually equal to b--

1253
01:11:11,890 --> 01:11:13,920
I did this error twice--

1254
01:11:13,920 --> 01:11:16,450
b prime inverse of mu of xi.

1255
01:11:30,620 --> 01:11:34,160
And I'm going to assume that
this is of the form beta

1256
01:11:34,160 --> 01:11:36,190
transpose xi.

1257
01:11:36,190 --> 01:11:37,810
And this is why I have theta i--

1258
01:11:37,810 --> 01:11:40,414
is because this theta
i is a function of xi,

1259
01:11:40,414 --> 01:11:42,830
and I'm going to assume a very
simple form for this thing.

1260
01:11:46,030 --> 01:11:48,747
Sorry, sorry, sorry, sorry--

1261
01:11:48,747 --> 01:11:50,080
I should not write it like this.

1262
01:11:50,080 --> 01:11:51,980
This is only when I
have the canonical link.

1263
01:11:51,980 --> 01:11:57,310
So this is actually equal
to b prime inverse of g,

1264
01:11:57,310 --> 01:11:59,650
of xi transpose beta.

1265
01:12:05,010 --> 01:12:07,754
Sorry, g inverse--
those two things

1266
01:12:07,754 --> 01:12:09,170
are actually
canceling each other.

1267
01:12:13,760 --> 01:12:17,735
So as before, I'm going to
stack everything into some--

1268
01:12:17,735 --> 01:12:20,360
well, actually, I'm not going to
stack anything for the moment.

1269
01:12:20,360 --> 01:12:22,151
I'm just going to give
you a peek at what's

1270
01:12:22,151 --> 01:12:28,010
happening next week, rather
than just manipulating the data.

1271
01:12:28,010 --> 01:12:33,810
So here is how we're going
to proceed at this point.

1272
01:12:33,810 --> 01:12:36,540
Well now, I want to write
my likelihood function,

1273
01:12:36,540 --> 01:12:39,270
not as a function of theta,
but as a function of beta,

1274
01:12:39,270 --> 01:12:44,270
because that's the parameter
I'm actually trying to maximize.

1275
01:12:44,270 --> 01:12:47,050
So if I have a link--

1276
01:12:47,050 --> 01:12:50,455
so this thing that matters
here, I'm going to call h.

1277
01:12:53,600 --> 01:12:58,190
By definition, this is going
to be h of xi transpose beta.

1278
01:12:58,190 --> 01:13:00,080
Helena, you have a question?

1279
01:13:00,080 --> 01:13:02,069
AUDIENCE: Uh, no [INAUDIBLE]

1280
01:13:02,069 --> 01:13:04,110
PHILIPPE RIGOLLET: So this
is just all the things

1281
01:13:04,110 --> 01:13:04,930
that we know.

1282
01:13:04,930 --> 01:13:09,150
Theta is just the, by
definition of the fact that mu

1283
01:13:09,150 --> 01:13:11,505
is b prime of theta, the
mean is b prime of theta--

1284
01:13:11,505 --> 01:13:14,250
it means that theta is
b prime inverse of mu.

1285
01:13:14,250 --> 01:13:19,190
And then, mu is modeled from
the systematic component.

1286
01:13:19,190 --> 01:13:21,940
G of mu is xi transpose
beta, so this is

1287
01:13:21,940 --> 01:13:23,590
g inverse of xi transpose beta.

1288
01:13:23,590 --> 01:13:27,810
So I want to have b prime
inverse of g inverse.

1289
01:13:27,810 --> 01:13:30,030
This function is a
bit annoying to say,

1290
01:13:30,030 --> 01:13:32,750
so I'm just going to call it h.

1291
01:13:32,750 --> 01:13:34,837
And when I do the
composition of two inverses,

1292
01:13:34,837 --> 01:13:36,920
the inverse of the composition
of those two things

1293
01:13:36,920 --> 01:13:38,280
in the reverse order--

1294
01:13:38,280 --> 01:13:42,140
so h is really the inverse
of g, composed with b

1295
01:13:42,140 --> 01:13:46,677
prime, g of b prime inverse.

1296
01:13:46,677 --> 01:13:48,260
And now, if I have
the canonical link,

1297
01:13:48,260 --> 01:13:51,200
since I know that g
is b prime inverse,

1298
01:13:51,200 --> 01:13:54,180
this is really
just the identity.

1299
01:13:54,180 --> 01:13:58,109
As you can imagine,
this entire thing,

1300
01:13:58,109 --> 01:13:59,650
which is actually
quite complicated--

1301
01:13:59,650 --> 01:14:01,750
would just say oh, this thing,
actually, does not show up

1302
01:14:01,750 --> 01:14:03,041
when I have the canonical link.

1303
01:14:03,041 --> 01:14:06,370
I really just have that theta
can be replaced by xi of beta.

1304
01:14:06,370 --> 01:14:09,280
So think about going
back to this guy here.

1305
01:14:09,280 --> 01:14:15,160
Now, theta becomes
only xi transpose beta.

1306
01:14:15,160 --> 01:14:18,425
That's going to be much
more simple to optimize,

1307
01:14:18,425 --> 01:14:20,550
because remember, when I'm
going to log likelihood,

1308
01:14:20,550 --> 01:14:21,841
this thing is going to go away.

1309
01:14:21,841 --> 01:14:23,020
I'm going to sum those guys.

1310
01:14:23,020 --> 01:14:24,310
And so what I'm going to
have is something which

1311
01:14:24,310 --> 01:14:26,140
is essentially linear in beta.

1312
01:14:26,140 --> 01:14:28,340
And then, I'm going
to have this minus b,

1313
01:14:28,340 --> 01:14:31,760
which is just minus the sum
of convex functions of beta.

1314
01:14:31,760 --> 01:14:34,220
And so I'm going to have to
bring in the tools of convex

1315
01:14:34,220 --> 01:14:34,860
optimization.

1316
01:14:34,860 --> 01:14:37,566
Now, it's not just going to be
take the gradient, set it to 0.

1317
01:14:37,566 --> 01:14:39,440
It's going to be more
complicated to do that.

1318
01:14:39,440 --> 01:14:42,320
I'm going to have to do that
in an iterative fashion.

1319
01:14:42,320 --> 01:14:43,800
And so that's what
I'm telling you,

1320
01:14:43,800 --> 01:14:46,400
when you look at your
log likelihood for all

1321
01:14:46,400 --> 01:14:47,330
those functions.

1322
01:14:47,330 --> 01:14:50,062
You sum, the exponential goes
away because you had the log,

1323
01:14:50,062 --> 01:14:51,770
and then, you have
all these things here.

1324
01:14:51,770 --> 01:14:52,660
I kept the b.

1325
01:14:52,660 --> 01:14:53,990
I kept the h.

1326
01:14:53,990 --> 01:14:56,690
But if h is the identity,
this is the linear function,

1327
01:14:56,690 --> 01:14:59,210
the linear part, yi
times xi transpose

1328
01:14:59,210 --> 01:15:03,776
beta, minus b of my theta, which
is now only xi transpose beta.

1329
01:15:03,776 --> 01:15:05,900
And that's the function I
want to maximize in beta.

1330
01:15:10,370 --> 01:15:11,390
It's a convex function.

1331
01:15:11,390 --> 01:15:15,130
When I know what b is, I have
an explicit formula for this,

1332
01:15:15,130 --> 01:15:18,230
and I want to just bring
in some optimization.

1333
01:15:18,230 --> 01:15:19,682
And that's what
we're going to do,

1334
01:15:19,682 --> 01:15:21,890
and we're going to see three
different methods, which

1335
01:15:21,890 --> 01:15:24,110
are really, basically,
the same method.

1336
01:15:24,110 --> 01:15:28,760
It's just an adaptation
or specialization

1337
01:15:28,760 --> 01:15:31,550
of the so-called Newton-Raphson
method, which is essentially

1338
01:15:31,550 --> 01:15:34,735
telling you do iterative
local quadratic approximations

1339
01:15:34,735 --> 01:15:36,360
through your function--
so second order

1340
01:15:36,360 --> 01:15:38,480
[INAUDIBLE] expansion,
minimize this guy,

1341
01:15:38,480 --> 01:15:41,060
and then do it again
from where you were.

1342
01:15:41,060 --> 01:15:43,460
And we'll see that
this can be, actually,

1343
01:15:43,460 --> 01:15:47,210
implemented using what's called
iteratively re-weighted least

1344
01:15:47,210 --> 01:15:49,640
squares, which means
that every step--

1345
01:15:49,640 --> 01:15:51,200
since it's just
a quadratic, it's

1346
01:15:51,200 --> 01:15:53,090
going to be just
squares in there--

1347
01:15:53,090 --> 01:15:56,190
can actually be solved
by using a weighted least

1348
01:15:56,190 --> 01:15:59,420
squares version of the problem.

1349
01:15:59,420 --> 01:16:02,270
So I'm going to
stop here for today.

1350
01:16:02,270 --> 01:16:05,930
So we'll continue and probably
not finish this chapter,

1351
01:16:05,930 --> 01:16:07,440
but finish next week.

1352
01:16:07,440 --> 01:16:10,670
And then, I think
there's only one lecture.

1353
01:16:10,670 --> 01:16:13,310
Actually, for the last lecture,
what do you guys want to do?

1354
01:16:16,320 --> 01:16:18,460
Do you want to have
doughnuts and cider?

1355
01:16:18,460 --> 01:16:25,620
Do you want to just have
some more outlooking lecture

1356
01:16:25,620 --> 01:16:31,390
on what's happening
post 1975 in statistics?

1357
01:16:31,390 --> 01:16:36,130
Do you want to have a
review for the final exam--

1358
01:16:36,130 --> 01:16:38,970
pragmatic people.

1359
01:16:38,970 --> 01:16:43,300
AUDIENCE: [INAUDIBLE]
interesting, advanced topics.

1360
01:16:43,300 --> 01:16:46,100
PHILIPPE RIGOLLET: You want
to do interesting, advanced--

1361
01:16:46,100 --> 01:16:48,200
for the last lecture?

1362
01:16:48,200 --> 01:16:50,470
AUDIENCE: Something that
we haven't thought of yet.

1363
01:16:50,470 --> 01:16:53,920
PHILIPPE RIGOLLET: Yeah, that's,
basically, what I'm asking,

1364
01:16:53,920 --> 01:16:55,420
right-- interesting
advanced topics,

1365
01:16:55,420 --> 01:17:00,694
versus ask me any
question you want.

1366
01:17:00,694 --> 01:17:03,110
Those questions can be about
interesting, advanced topics,

1367
01:17:03,110 --> 01:17:03,850
though.

1368
01:17:03,850 --> 01:17:06,020
Like, what are interesting,
advanced topics?

1369
01:17:06,020 --> 01:17:06,547
I'm sorry?

1370
01:17:06,547 --> 01:17:08,630
AUDIENCE: Interesting with
doughnuts-- is that OK?

1371
01:17:08,630 --> 01:17:10,963
PHILIPPE RIGOLLET: Yeah, we
can always do the doughnuts.

1372
01:17:10,963 --> 01:17:11,838
[LAUGHTER]

1373
01:17:11,838 --> 01:17:14,792
AUDIENCE: As long as
there are doughnuts.

1374
01:17:14,792 --> 01:17:16,750
PHILIPPE RIGOLLET: All
right, so we'll do that.

1375
01:17:16,750 --> 01:17:19,500
So you guys have a good weekend.