1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:19,865
at ocw.mit.edu.

8
00:00:19,865 --> 00:00:26,880
PHILIPPE RIGOLLET: [INAUDIBLE]
minus xi transpose t.

9
00:00:26,880 --> 00:00:30,220
I just pick whatever notation
I want from a variable.

10
00:00:30,220 --> 00:00:33,850
And let's say it's t.

11
00:00:33,850 --> 00:00:35,530
So that's the least
squares estimator.

12
00:00:35,530 --> 00:00:38,350
And it turns out that,
as I said last time,

13
00:00:38,350 --> 00:00:39,850
it's going to be
convenient to think

14
00:00:39,850 --> 00:00:42,160
of those things as matrices.

15
00:00:42,160 --> 00:00:44,530
So here, I already have vectors.

16
00:00:44,530 --> 00:00:47,050
I've already gone from one
dimension, just real valued

17
00:00:47,050 --> 00:00:49,450
random variables through
random vectors when

18
00:00:49,450 --> 00:00:52,720
I think of each xi, but if I
start stacking them together,

19
00:00:52,720 --> 00:00:56,020
I'm going to have vectors
and matrices that show up.

20
00:00:56,020 --> 00:00:57,610
So the first vector
I'm getting is

21
00:00:57,610 --> 00:01:04,030
y, which is just a vector
where I have y1 to yn.

22
00:01:04,030 --> 00:01:07,900
Then I have-- so that's
a boldface vector.

23
00:01:07,900 --> 00:01:12,940
Then I have x, which is
a matrix where I have--

24
00:01:12,940 --> 00:01:16,150
well, the first
coordinate is always 1.

25
00:01:16,150 --> 00:01:24,760
So I have 1, and then x1 xp
minus 1, and that's-- sorry,

26
00:01:24,760 --> 00:01:29,672
x1 xp minus 1, and
that's for observation 1.

27
00:01:29,672 --> 00:01:31,630
And then I have the same
thing all the way down

28
00:01:31,630 --> 00:01:32,720
for observation n.

29
00:01:40,390 --> 00:01:42,685
OK, everybody
understands what this is?

30
00:01:42,685 --> 00:01:47,920
So I'm just basically
stacking up all the xi's.

31
00:01:47,920 --> 00:01:55,420
So this i-th row
is xi transpose.

32
00:01:55,420 --> 00:01:57,390
I am just stacking them up.

33
00:01:57,390 --> 00:02:00,310
And so if I want to write
all these things to be

34
00:02:00,310 --> 00:02:03,130
true for each of
them, all I need to do

35
00:02:03,130 --> 00:02:05,350
is to write a vector
epsilon, which

36
00:02:05,350 --> 00:02:08,680
is epsilon 1 to epsilon n.

37
00:02:08,680 --> 00:02:11,510
And what I'm going to have is
that y, the boldface vector,

38
00:02:11,510 --> 00:02:14,260
now is equal to the
matrix x times the vector

39
00:02:14,260 --> 00:02:18,490
beta plus the vector epsilon.

40
00:02:18,490 --> 00:02:20,540
And it's really
just exactly saying

41
00:02:20,540 --> 00:02:23,620
what's there, because for 2--
so this is a vector, right?

42
00:02:23,620 --> 00:02:25,780
This is a vector.

43
00:02:25,780 --> 00:02:27,850
And what is the
dimension of this vector?

44
00:02:32,660 --> 00:02:37,140
n, so this is n observations.

45
00:02:37,140 --> 00:02:39,770
And for all these-- for
two vectors to be equal,

46
00:02:39,770 --> 00:02:41,990
I need to have all the
coordinates to be equal,

47
00:02:41,990 --> 00:02:44,600
and that's exactly the same
thing as saying that this

48
00:02:44,600 --> 00:02:46,290
holds for i equal 1 to n.

49
00:02:48,990 --> 00:02:51,400
But now, when I have
this, I can actually

50
00:02:51,400 --> 00:02:55,690
rewrite the sum for t equals--

51
00:02:55,690 --> 00:03:03,310
sorry, for i equals 1 to
n of yi minus xi transpose

52
00:03:03,310 --> 00:03:05,680
beta squared, this
turns out to be

53
00:03:05,680 --> 00:03:12,340
equal to the Euclidean norm of
the vector y minus the matrix x

54
00:03:12,340 --> 00:03:14,852
times beta squared.

55
00:03:14,852 --> 00:03:16,310
And I'm going to
put a 2 here so we

56
00:03:16,310 --> 00:03:19,079
know we're talking about
the Euclidean norm.

57
00:03:19,079 --> 00:03:20,870
This just means this
is the Euclidean norm.

58
00:03:27,259 --> 00:03:28,800
That's the one we've
seen before when

59
00:03:28,800 --> 00:03:30,365
we talked about chi squared--

60
00:03:30,365 --> 00:03:31,740
that's the square
norm is the sum

61
00:03:31,740 --> 00:03:32,940
of the square of
the coefficients,

62
00:03:32,940 --> 00:03:34,320
and then I take a
square root, but here I

63
00:03:34,320 --> 00:03:35,467
have an extra square.

64
00:03:35,467 --> 00:03:38,050
So it's really just the sum of
the square of the coefficients,

65
00:03:38,050 --> 00:03:38,730
which is this.

66
00:03:38,730 --> 00:03:40,464
And here are the coefficients.

67
00:03:43,370 --> 00:03:49,530
So then, that I write this thing
like that, then minimizing--

68
00:03:49,530 --> 00:03:54,430
so my goal here, now, is
going to solve minimum over t

69
00:03:54,430 --> 00:04:05,300
in our p of y minus
x times t2 squared.

70
00:04:05,300 --> 00:04:07,260
And just like we did
for one dimension,

71
00:04:07,260 --> 00:04:12,200
we can actually write
optimality conditions for this.

72
00:04:12,200 --> 00:04:14,990
I mean, this is a function.

73
00:04:14,990 --> 00:04:23,064
So this is a function
from rp to r.

74
00:04:23,064 --> 00:04:24,980
And if I want to minimize
it, all I have to do

75
00:04:24,980 --> 00:04:28,820
is to take its gradient
and set it equal to 0.

76
00:04:28,820 --> 00:04:42,010
So minimum, set gradient to 0.

77
00:04:42,010 --> 00:04:45,640
So that's where it becomes
a little complicated.

78
00:04:45,640 --> 00:04:49,210
Now I'm going to have to take
the gradient of this norm.

79
00:04:49,210 --> 00:04:51,550
It might be a little
annoying to do.

80
00:04:51,550 --> 00:04:53,980
But actually, what's
nice about those things--

81
00:04:53,980 --> 00:04:56,720
I mean, I remember that it
was a bit annoying to learn.

82
00:04:56,720 --> 00:04:59,800
I mean, it's just
basically rules of calculus

83
00:04:59,800 --> 00:05:01,480
that you don't use that much.

84
00:05:01,480 --> 00:05:05,690
But essentially, you can
actually expend this norm.

85
00:05:05,690 --> 00:05:07,787
And you will see that
the rules are basically

86
00:05:07,787 --> 00:05:09,370
the same as in one
dimension, you just

87
00:05:09,370 --> 00:05:11,890
have to be careful about the
fact that matrices do not

88
00:05:11,890 --> 00:05:13,050
commute.

89
00:05:13,050 --> 00:05:15,940
So let's expand this thing.

90
00:05:15,940 --> 00:05:19,550
y minus xt squared--

91
00:05:19,550 --> 00:05:21,400
well, this is equal
to the norm of y

92
00:05:21,400 --> 00:05:30,580
squared plus the norm
of x squared plus 2

93
00:05:30,580 --> 00:05:33,980
times y transpose xt.

94
00:05:36,730 --> 00:05:41,230
That's just expanding the
square in more dimensions.

95
00:05:41,230 --> 00:05:47,710
And this, I'm actually going
to write as y squared plus--

96
00:05:47,710 --> 00:05:50,600
so here, the norm
squared of this guy,

97
00:05:50,600 --> 00:05:53,140
I always have that
the norm of x squared

98
00:05:53,140 --> 00:05:56,650
is equal to x transpose x.

99
00:05:56,650 --> 00:05:58,480
So I'm going to write
this as x transpose

100
00:05:58,480 --> 00:06:04,540
x, so it's t transpose
x transpose xt

101
00:06:04,540 --> 00:06:10,540
plus 2 times y transpose xt.

102
00:06:10,540 --> 00:06:13,735
So now, if I'm going to take
the gradient with respect to t,

103
00:06:13,735 --> 00:06:16,210
I have basically three
terms, and each of them

104
00:06:16,210 --> 00:06:18,280
has some sort of a
different nature.

105
00:06:18,280 --> 00:06:21,610
This term is linear
in t, and it's

106
00:06:21,610 --> 00:06:23,170
going to differentiate
the same way

107
00:06:23,170 --> 00:06:25,720
that I differentiate a times x.

108
00:06:25,720 --> 00:06:28,210
I'm just going to keep the a.

109
00:06:28,210 --> 00:06:29,710
This guy is quadratic.

110
00:06:29,710 --> 00:06:32,170
t appears twice.

111
00:06:32,170 --> 00:06:34,140
And this guy, I'm
going to pick up a 2,

112
00:06:34,140 --> 00:06:37,510
and it's going to differentiate
just like when I differentiate

113
00:06:37,510 --> 00:06:38,840
a times x squared.

114
00:06:38,840 --> 00:06:41,200
It's 2 times ax.

115
00:06:41,200 --> 00:06:43,330
And this guy is a constant
with respect to t,

116
00:06:43,330 --> 00:06:47,380
so it's going to
differentiate to 0.

117
00:06:47,380 --> 00:06:49,090
So when I compute the gradient--

118
00:06:53,680 --> 00:06:55,930
now, of course, all of these
rules that I give you you

119
00:06:55,930 --> 00:06:58,810
can check by looking at the
partial derivative with respect

120
00:06:58,810 --> 00:06:59,950
to each coordinate.

121
00:06:59,950 --> 00:07:02,800
But arguably, it's
much faster to know

122
00:07:02,800 --> 00:07:04,780
the rules of differentiability.

123
00:07:04,780 --> 00:07:06,922
It's like if I gave you
the function exponential x

124
00:07:06,922 --> 00:07:08,380
and I said, what
is the derivative,

125
00:07:08,380 --> 00:07:09,796
and you started
writing, well, I'm

126
00:07:09,796 --> 00:07:13,420
going to write exponential x
plus h minus exponential ax

127
00:07:13,420 --> 00:07:15,670
divided by h and let h go to 0.

128
00:07:15,670 --> 00:07:17,476
That's a bit painful.

129
00:07:17,476 --> 00:07:19,891
AUDIENCE: Why did
you transpose your--

130
00:07:19,891 --> 00:07:23,755
why does x have
to be [INAUDIBLE]??

131
00:07:23,755 --> 00:07:25,106
PHILIPPE RIGOLLET: I'm sorry?

132
00:07:25,106 --> 00:07:26,814
AUDIENCE: I was
wondering why you times t

133
00:07:26,814 --> 00:07:29,080
times the [INAUDIBLE]?

134
00:07:29,080 --> 00:07:33,190
PHILIPPE RIGOLLET:
The transpose of 2ab

135
00:07:33,190 --> 00:07:35,490
is b transpose a transpose.

136
00:07:38,990 --> 00:07:40,880
If you're not sure
about this, just

137
00:07:40,880 --> 00:07:42,860
make a and b have
different size,

138
00:07:42,860 --> 00:07:46,070
and then you will see that
there's some incompatibility.

139
00:07:46,070 --> 00:07:48,440
I mean, there's basically
only one way to not screw

140
00:07:48,440 --> 00:07:51,230
that one up, so that's
easy to remember.

141
00:07:51,230 --> 00:07:54,650
So if I take the gradient, then
it's going to be equal to what?

142
00:07:54,650 --> 00:07:58,130
It's going to be 0
plus-- we said here,

143
00:07:58,130 --> 00:07:59,880
this is going to
differentiate like-- so

144
00:07:59,880 --> 00:08:05,130
think a times x squared.

145
00:08:05,130 --> 00:08:06,730
So I'm going to have 2ax.

146
00:08:06,730 --> 00:08:13,840
So here, basically, this guy is
going to go to x transpose xt.

147
00:08:13,840 --> 00:08:17,250
Now, I could have
made this one go away,

148
00:08:17,250 --> 00:08:20,050
but that's the same thing as
saying that my gradient is--

149
00:08:20,050 --> 00:08:21,610
I can think of my
gradient as being

150
00:08:21,610 --> 00:08:24,200
either a horizontal vector
or a vertical vector.

151
00:08:24,200 --> 00:08:26,530
So if I remove this guy,
I'm thinking of my gradient

152
00:08:26,530 --> 00:08:27,370
as being horizontal.

153
00:08:27,370 --> 00:08:30,460
If I remove that guy, I'm
thinking of my gradient

154
00:08:30,460 --> 00:08:31,300
as being vertical.

155
00:08:31,300 --> 00:08:33,258
And that's what I want
to think of, typically--

156
00:08:33,258 --> 00:08:36,820
vertical vectors,
column vectors.

157
00:08:36,820 --> 00:08:39,159
And then this guy, well,
it's like these guys just

158
00:08:39,159 --> 00:08:42,460
think a times x.

159
00:08:42,460 --> 00:08:44,560
So the derivative is
just a, so I'm going

160
00:08:44,560 --> 00:08:47,712
to keep only that part here.

161
00:08:47,712 --> 00:08:49,670
Sorry, I forgot a minus
somewhere-- yeah, here.

162
00:08:55,200 --> 00:08:59,700
Minus 2y transpose x.

163
00:08:59,700 --> 00:09:02,160
And what I want is this
thing to be equal to 0.

164
00:09:06,240 --> 00:09:20,530
So t, the optimal t, is called
beta hat and satisfies--

165
00:09:24,680 --> 00:09:27,786
well, I can cancel the
2's and put the minus

166
00:09:27,786 --> 00:09:29,160
on the other side,
and what I get

167
00:09:29,160 --> 00:09:36,970
is that x transpose xt is
equal to y transpose x.

168
00:09:44,720 --> 00:09:48,240
Yeah, that's not working for me.

169
00:09:48,240 --> 00:09:50,240
Yeah, that's because when
I took the derivative,

170
00:09:50,240 --> 00:09:51,665
I still need to make sure--

171
00:09:51,665 --> 00:09:53,660
so it's the same question
of whether I want

172
00:09:53,660 --> 00:09:55,610
things to be columns or rows.

173
00:09:55,610 --> 00:09:57,550
So this is not a column.

174
00:09:57,550 --> 00:10:01,465
If I remove that guy,
y transpose t is a row.

175
00:10:01,465 --> 00:10:03,590
So I'm just going to take
the transpose of this guy

176
00:10:03,590 --> 00:10:07,890
to make things work, and this is
just going to be x transpose y.

177
00:10:11,710 --> 00:10:14,310
And this guy is x transpose
y so that I have columns.

178
00:10:19,090 --> 00:10:23,910
So this is just the
linear equation in t.

179
00:10:23,910 --> 00:10:26,640
And I have to solve it, so it's
of the form some matrix times

180
00:10:26,640 --> 00:10:29,890
t is equal to another vector.

181
00:10:29,890 --> 00:10:31,590
And so that's basically
in your system.

182
00:10:31,590 --> 00:10:33,381
And the way to solve
it, at least formally,

183
00:10:33,381 --> 00:10:36,450
is to just take the inverse
of the matrix on the left.

184
00:10:36,450 --> 00:10:45,830
So if x transpose x is
invertible, then-- sorry,

185
00:10:45,830 --> 00:10:48,200
that's beta hat is the t I want.

186
00:10:48,200 --> 00:10:51,410
I get that beta hat is
equal to x transpose

187
00:10:51,410 --> 00:10:54,440
x inverse x transpose y.

188
00:10:57,860 --> 00:10:59,780
And that's the least
squares estimator.

189
00:11:12,670 --> 00:11:16,600
So here, I use this condition.

190
00:11:16,600 --> 00:11:21,580
I want it to be invertible so I
can actually write its inverse.

191
00:11:21,580 --> 00:11:25,570
Here, I wrote, rank
of x is equal to p.

192
00:11:25,570 --> 00:11:26,948
What is the difference?

193
00:11:36,910 --> 00:11:40,900
Well, there's basically
no difference.

194
00:11:40,900 --> 00:11:45,100
Basically, here,
I have to assume--

195
00:11:45,100 --> 00:11:47,350
what is the size of the
matrix x transpose x?

196
00:11:52,509 --> 00:11:53,376
[INTERPOSING VOICES]

197
00:11:53,376 --> 00:11:55,250
PHILIPPE RIGOLLET: Yeah,
so what is the size?

198
00:11:55,250 --> 00:11:56,850
AUDIENCE: p by p.

199
00:11:56,850 --> 00:11:58,000
PHILIPPE RIGOLLET: p by p.

200
00:11:58,000 --> 00:12:00,940
So this matrix is
invertible if it's a rank p,

201
00:12:00,940 --> 00:12:02,500
if you know what rank means.

202
00:12:02,500 --> 00:12:05,740
If you don't, that just rank
p means that it's invertible.

203
00:12:05,740 --> 00:12:07,780
So it's full rank
and it's invertible.

204
00:12:07,780 --> 00:12:10,690
And the rank of x
transpose x is actually

205
00:12:10,690 --> 00:12:13,540
just the rank of x because this
is the same matrix that you

206
00:12:13,540 --> 00:12:14,710
apply twice.

207
00:12:14,710 --> 00:12:15,940
And that's all it's saying.

208
00:12:15,940 --> 00:12:18,160
So if you're not comfortable
with the notion of rank

209
00:12:18,160 --> 00:12:21,070
that you see here, just
think of this condition

210
00:12:21,070 --> 00:12:25,080
just being the condition that
x transpose x is invertible.

211
00:12:25,080 --> 00:12:26,650
And that's all it says.

212
00:12:26,650 --> 00:12:29,620
What it means for it to be
invertible-- this was true.

213
00:12:29,620 --> 00:12:32,530
We made no assumption
up to this point.

214
00:12:32,530 --> 00:12:35,840
If x is not invertible,
it means that there

215
00:12:35,840 --> 00:12:38,840
might be multiple
solutions to this equation.

216
00:12:38,840 --> 00:12:42,830
In particular, for a matrix
to not be invertible,

217
00:12:42,830 --> 00:12:45,890
it means that there's
some vector v.

218
00:12:45,890 --> 00:12:55,800
So if x transpose x
is not invertible,

219
00:12:55,800 --> 00:13:00,080
then this is equivalent
to there exists a vector

220
00:13:00,080 --> 00:13:07,910
v, which is not 0, and such that
x transpose xv is equal to 0.

221
00:13:07,910 --> 00:13:10,400
That's what it means
to not be invertible.

222
00:13:10,400 --> 00:13:13,730
So in particular, if
beta hat is a solution--

223
00:13:22,090 --> 00:13:26,290
so this equation is sometimes
called score equations,

224
00:13:26,290 --> 00:13:28,280
because the gradient
is called the score,

225
00:13:28,280 --> 00:13:31,090
and so you're just checking
if the gradient is equal to 0.

226
00:13:31,090 --> 00:13:33,730
So if beta hat
satisfies star, then so

227
00:13:33,730 --> 00:13:46,820
does beta hat plus lambda v for
all lambda in the real line.

228
00:13:51,840 --> 00:13:54,930
And the reason is because,
well, if I start looking at--

229
00:13:54,930 --> 00:14:02,400
what is x transpose x times
beta hat plus lambda v?

230
00:14:02,400 --> 00:14:08,000
Well, by linearity, this
is just x transpose x

231
00:14:08,000 --> 00:14:16,510
beta hat plus lambda
x transpose x times v.

232
00:14:16,510 --> 00:14:17,750
But this guy is what?

233
00:14:22,420 --> 00:14:27,860
It's 0, just because
that's what we assumed.

234
00:14:27,860 --> 00:14:31,070
We assumed that x transpose
xv was equal to 0,

235
00:14:31,070 --> 00:14:34,130
so we're left only with this
part, which, by star, is just

236
00:14:34,130 --> 00:14:35,060
x transpose y.

237
00:14:40,040 --> 00:14:44,300
So that means that x transpose
x beta hat plus lambda v

238
00:14:44,300 --> 00:14:48,080
is actually equal to x transpose
y, which means that there's

239
00:14:48,080 --> 00:14:50,360
another solution, which
is not just beta hat,

240
00:14:50,360 --> 00:14:56,784
but any move of beta hat along
this direction v by any size.

241
00:14:56,784 --> 00:14:58,700
So that's going to be
an issue, because you're

242
00:14:58,700 --> 00:15:00,050
looking for one estimator.

243
00:15:00,050 --> 00:15:03,350
And there's not just one,
in this case, there's many.

244
00:15:03,350 --> 00:15:05,599
And so this is not
going to be well-defined

245
00:15:05,599 --> 00:15:07,140
and you're going to
have some issues.

246
00:15:07,140 --> 00:15:09,560
So if you want to talk about
the least squares estimator,

247
00:15:09,560 --> 00:15:13,510
you have to make
this assumption.

248
00:15:13,510 --> 00:15:15,310
What does it imply
in terms of, can I

249
00:15:15,310 --> 00:15:18,976
think of p being too n,
for example, in this case?

250
00:15:18,976 --> 00:15:20,350
What happens if
p is equal to 2n?

251
00:15:27,528 --> 00:15:31,084
AUDIENCE: Well, then the rank
of your matrix is only p/2.

252
00:15:31,084 --> 00:15:33,500
PHILIPPE RIGOLLET: So the rank
of your matrix is only p/2,

253
00:15:33,500 --> 00:15:36,480
so that means that this is
actually not going to happen.

254
00:15:36,480 --> 00:15:39,530
I mean, it's not only
p/2, it's at most p/2.

255
00:15:39,530 --> 00:15:42,560
It's at most the smallest of the
two dimensions of your matrix.

256
00:15:42,560 --> 00:15:44,600
So if your matrix
is n times 2n, it's

257
00:15:44,600 --> 00:15:47,702
at most n, which means that
it's not going to be full rank,

258
00:15:47,702 --> 00:15:49,160
so it's not going
to be invertible.

259
00:15:49,160 --> 00:15:53,600
So every time the dimension p
is larger than the sample size,

260
00:15:53,600 --> 00:15:56,060
your matrix is not invertible,
and you cannot talk about

261
00:15:56,060 --> 00:15:57,950
the least squares estimator.

262
00:15:57,950 --> 00:15:59,930
So that's something
to keep in mind.

263
00:15:59,930 --> 00:16:01,710
And it's actually a
very simple thing.

264
00:16:01,710 --> 00:16:05,750
It's essentially saying,
well, if p is lower than n,

265
00:16:05,750 --> 00:16:07,760
it means that you
have more parameters

266
00:16:07,760 --> 00:16:11,000
to estimate than you have
equations to estimate it.

267
00:16:11,000 --> 00:16:12,480
So you have this linear system.

268
00:16:12,480 --> 00:16:17,240
There's one equation
per observation.

269
00:16:17,240 --> 00:16:19,400
Each row, which was
each observation,

270
00:16:19,400 --> 00:16:21,230
was giving me one equation.

271
00:16:21,230 --> 00:16:24,960
But then the number of unknowns
in this linear system is p,

272
00:16:24,960 --> 00:16:28,760
and so I cannot solve linear
systems that have more unknowns

273
00:16:28,760 --> 00:16:30,350
than they have equations.

274
00:16:30,350 --> 00:16:32,302
And so that's basically
what's happening.

275
00:16:32,302 --> 00:16:34,010
Now, in practice, if
you think about what

276
00:16:34,010 --> 00:16:36,330
data sets look like
these days, for example,

277
00:16:36,330 --> 00:16:38,940
people are trying to
express some phenotype.

278
00:16:38,940 --> 00:16:41,570
So phenotype is something you
can measure on people-- maybe

279
00:16:41,570 --> 00:16:43,880
the color of your
eyes, or your height,

280
00:16:43,880 --> 00:16:47,690
or whether you have diabetes
or not, things like this,

281
00:16:47,690 --> 00:16:51,080
so things that are macroscopic.

282
00:16:51,080 --> 00:16:53,429
And then they want to use
the genotype to do that.

283
00:16:53,429 --> 00:16:55,970
They want to measure your-- they
want to sequence your genome

284
00:16:55,970 --> 00:16:58,940
and try to use this to
predict whether you're going

285
00:16:58,940 --> 00:17:01,250
to be responsive to a drug
or whether your r's are

286
00:17:01,250 --> 00:17:03,170
going to be blue, or
something like this.

287
00:17:03,170 --> 00:17:05,060
Now, the data sets
that you can have--

288
00:17:05,060 --> 00:17:09,619
people, maybe, for a given study
about some sort of disease.

289
00:17:09,619 --> 00:17:15,260
Maybe you will sequence the
genome of maybe 100 people.

290
00:17:15,260 --> 00:17:17,645
n is equal to 100.

291
00:17:17,645 --> 00:17:21,030
p is basically the number
of genes they're sequencing.

292
00:17:21,030 --> 00:17:23,849
This is of the order of 100,000.

293
00:17:23,849 --> 00:17:26,287
So you can imagine that this
is a case where n is much,

294
00:17:26,287 --> 00:17:28,620
much smaller than p, and you
cannot talk about the least

295
00:17:28,620 --> 00:17:29,670
squares estimator.

296
00:17:29,670 --> 00:17:31,080
There's plenty of them.

297
00:17:31,080 --> 00:17:33,630
There's not just
one line like that,

298
00:17:33,630 --> 00:17:36,180
lambda times v that
you can move away.

299
00:17:36,180 --> 00:17:40,320
There's basically an entire
space in which you can move,

300
00:17:40,320 --> 00:17:42,027
and so it's not well-defined.

301
00:17:42,027 --> 00:17:43,860
So at the end of this
class, I will give you

302
00:17:43,860 --> 00:17:46,740
a short introduction
on how you do this.

303
00:17:46,740 --> 00:17:49,200
This actually represents
more and more.

304
00:17:49,200 --> 00:17:51,652
It becomes a more and more
preponderant part of the data

305
00:17:51,652 --> 00:17:53,610
sets you have to deal
with, because people just

306
00:17:53,610 --> 00:17:54,950
collect data.

307
00:17:54,950 --> 00:17:57,810
When I do the
sequencing, the machine

308
00:17:57,810 --> 00:17:59,730
allows me to sequence
100,000 genes.

309
00:17:59,730 --> 00:18:03,600
I'm not going to stop at 100
because doctors are never

310
00:18:03,600 --> 00:18:06,510
going to have cohorts of
more than 100 patients.

311
00:18:06,510 --> 00:18:08,510
So you just collect
everything you can collect.

312
00:18:08,510 --> 00:18:11,310
And this is true for everything.

313
00:18:11,310 --> 00:18:13,372
Cars have sensors
all over the place,

314
00:18:13,372 --> 00:18:15,080
much more than they
actually gather data.

315
00:18:15,080 --> 00:18:16,890
There's data, there's--
we're creating,

316
00:18:16,890 --> 00:18:18,840
we're recording
everything we can.

317
00:18:18,840 --> 00:18:20,744
And so we need some new
techniques for that,

318
00:18:20,744 --> 00:18:23,410
and that's what high-dimensional
statistics is trying to answer.

319
00:18:23,410 --> 00:18:25,530
So this is way beyond
the scope of this class,

320
00:18:25,530 --> 00:18:27,029
but towards the
end, I will give you

321
00:18:27,029 --> 00:18:29,340
some hints about what can
be done in this framework

322
00:18:29,340 --> 00:18:34,810
because, well, this is the new
reality we have to deal with.

323
00:18:34,810 --> 00:18:37,100
So here, we're in a case
where p's less than n

324
00:18:37,100 --> 00:18:38,555
and typically much
smaller than n.

325
00:18:38,555 --> 00:18:40,680
So the kind of orders of
magnitude you want to have

326
00:18:40,680 --> 00:18:46,135
is maybe p's of order 10 and
n's of order 100, something

327
00:18:46,135 --> 00:18:46,635
like this.

328
00:18:46,635 --> 00:18:50,280
So you can scale that,
but maybe 10 times larger.

329
00:18:50,280 --> 00:18:57,810
So maybe you cannot solve this
guy b for b hat, but actually,

330
00:18:57,810 --> 00:19:00,480
you can talk about
x times b hat,

331
00:19:00,480 --> 00:19:02,580
even if p is larger than n.

332
00:19:02,580 --> 00:19:06,880
And the reason is
that x times b hat

333
00:19:06,880 --> 00:19:09,280
is actually something
that's very well-defined.

334
00:19:09,280 --> 00:19:11,400
So what is x times b hat?

335
00:19:11,400 --> 00:19:16,810
Remember, I started
with the model.

336
00:19:16,810 --> 00:19:20,580
So if I look at this
definition, essentially, what I

337
00:19:20,580 --> 00:19:24,360
had as the original
thing was that the vector

338
00:19:24,360 --> 00:19:29,910
y was equal to x times beta
plus the vector epsilon.

339
00:19:29,910 --> 00:19:32,960
That was my model.

340
00:19:32,960 --> 00:19:36,400
So beta is actually
giving me something.

341
00:19:36,400 --> 00:19:39,380
Beta is actually some
parameter, some coefficients

342
00:19:39,380 --> 00:19:40,830
that are interesting.

343
00:19:40,830 --> 00:19:43,610
But a good estimator
for-- so here, it

344
00:19:43,610 --> 00:19:45,320
means that the
observations that I have

345
00:19:45,320 --> 00:19:48,870
are of the form x times
beta plus some noise.

346
00:19:48,870 --> 00:19:51,050
So if I want to adjust the
noise, remove the noise,

347
00:19:51,050 --> 00:19:57,110
a good candidate to do
noise is x times beta hat.

348
00:19:57,110 --> 00:19:59,450
x times beta hat is something
that should actually

349
00:19:59,450 --> 00:20:10,140
be useful to me, which should
be close to x times beta.

350
00:20:10,140 --> 00:20:13,770
So in the one-dimensional case,
what it means is that if I

351
00:20:13,770 --> 00:20:16,530
have-- let's say this
is the true line,

352
00:20:16,530 --> 00:20:19,050
and these are my
x's, so I have--

353
00:20:19,050 --> 00:20:22,050
these are the true
points on the real line,

354
00:20:22,050 --> 00:20:24,180
and then I have
my little epsilon

355
00:20:24,180 --> 00:20:26,670
that just give me
my observations that

356
00:20:26,670 --> 00:20:28,560
move around this line.

357
00:20:28,560 --> 00:20:34,860
So this is one of
epsilons, say epsilon i.

358
00:20:34,860 --> 00:20:37,460
Then I can actually
either talk--

359
00:20:37,460 --> 00:20:39,210
to say that I
recovered the line,

360
00:20:39,210 --> 00:20:42,270
I can actually talk about
recovering the right intercept

361
00:20:42,270 --> 00:20:44,370
or recovering the right
slope for this line.

362
00:20:44,370 --> 00:20:46,740
Those are the two parameters
that I need to recover.

363
00:20:46,740 --> 00:20:48,900
But I can also say
that I've actually

364
00:20:48,900 --> 00:20:50,880
found a set of
points that's closer

365
00:20:50,880 --> 00:20:56,250
to being on the line that are
closer to this set of points

366
00:20:56,250 --> 00:21:00,900
right here than the original
crosses that I observed.

367
00:21:00,900 --> 00:21:03,870
So if we go back to
the picture here,

368
00:21:03,870 --> 00:21:08,850
for example, what I could do
is say, well, for this point

369
00:21:08,850 --> 00:21:09,750
here--

370
00:21:09,750 --> 00:21:11,430
there was an x here--

371
00:21:11,430 --> 00:21:14,550
rather than looking at this
dot, which was my observation,

372
00:21:14,550 --> 00:21:17,732
I can say, well, now that
I've estimated the red line,

373
00:21:17,732 --> 00:21:19,440
I can actually just
say, well, this point

374
00:21:19,440 --> 00:21:21,420
should really be here.

375
00:21:21,420 --> 00:21:23,760
And actually, I can
move all these dots

376
00:21:23,760 --> 00:21:26,010
so that they're actually
on the red line.

377
00:21:26,010 --> 00:21:28,680
And this should be a
better value, something

378
00:21:28,680 --> 00:21:30,840
that has less noise than
the original y value

379
00:21:30,840 --> 00:21:32,010
that I should see.

380
00:21:32,010 --> 00:21:33,630
It should be close
to the true value

381
00:21:33,630 --> 00:21:37,240
that I should be seeing
without the extra noise.

382
00:21:37,240 --> 00:21:40,080
So that's definitely something
that could be of interest.

383
00:21:43,410 --> 00:21:45,990
For example, in
imaging, you're not

384
00:21:45,990 --> 00:21:48,690
trying to understand--
so when you do imaging,

385
00:21:48,690 --> 00:21:50,370
y is basically an image.

386
00:21:50,370 --> 00:21:53,310
So think of a pixel
image, and you just

387
00:21:53,310 --> 00:21:55,644
stack it into one long vector.

388
00:21:55,644 --> 00:21:57,060
And what you see
is something that

389
00:21:57,060 --> 00:21:59,730
should look like some linear
combination of some feature

390
00:21:59,730 --> 00:22:01,440
vectors, maybe.

391
00:22:01,440 --> 00:22:05,790
So there's people created
a bunch of features.

392
00:22:05,790 --> 00:22:09,290
They're called, for example,
Gabor frames or wavelet

393
00:22:09,290 --> 00:22:14,820
transforms-- so just well-known
libraries of variables x such

394
00:22:14,820 --> 00:22:17,250
that when you take linear
combinations of those guys,

395
00:22:17,250 --> 00:22:19,730
this should looks like
a bunch of images.

396
00:22:19,730 --> 00:22:22,049
And what you want
for your image--

397
00:22:22,049 --> 00:22:24,090
you don't care what the
coefficients of the image

398
00:22:24,090 --> 00:22:26,130
are in these bases
that you came up with.

399
00:22:26,130 --> 00:22:28,690
What you care about is
the noise in the image.

400
00:22:28,690 --> 00:22:31,690
And so you really
want to get x beta.

401
00:22:31,690 --> 00:22:34,390
So if you want to
estimate x beta,

402
00:22:34,390 --> 00:22:36,920
well, you can use x beta hat.

403
00:22:36,920 --> 00:22:37,960
What is x beta hat?

404
00:22:37,960 --> 00:22:42,040
Well, since beta hat is x
transpose x inverse x transpose

405
00:22:42,040 --> 00:22:44,030
y, this is x transpose.

406
00:22:48,800 --> 00:22:50,830
That's my estimator for x beta.

407
00:22:54,060 --> 00:22:59,170
Now, this thing,
actually, I can define

408
00:22:59,170 --> 00:23:01,630
even if I'm not low rank.

409
00:23:01,630 --> 00:23:03,190
So why is this
thing interesting?

410
00:23:03,190 --> 00:23:08,120
Well, there's a formula
for this estimator,

411
00:23:08,120 --> 00:23:10,260
but actually, I can
visualize what this thing is.

412
00:23:18,792 --> 00:23:22,840
So let's assume, for the
sake of illustration,

413
00:23:22,840 --> 00:23:26,200
that n is equal to 3.

414
00:23:29,500 --> 00:23:33,700
So that means that y lives
in a three-dimensional space.

415
00:23:33,700 --> 00:23:36,800
And so let's say it's here.

416
00:23:36,800 --> 00:23:43,970
And so I have my,
let's say, y's here.

417
00:23:43,970 --> 00:23:48,020
And I also have a
plane that's given

418
00:23:48,020 --> 00:23:55,450
by the vectors x1 transpose
x2 transpose, which

419
00:23:55,450 --> 00:23:58,890
is, by the way, 1--

420
00:23:58,890 --> 00:24:01,290
sorry, that's not
what I want to do.

421
00:24:04,320 --> 00:24:10,600
I'm going to say that n is equal
to 3 and that p is equal to 2.

422
00:24:10,600 --> 00:24:18,460
So I basically have two
vectors, 1, 1 and another one,

423
00:24:18,460 --> 00:24:25,670
let's assume that
it's, for example, abc.

424
00:24:25,670 --> 00:24:27,020
So those are my two vectors.

425
00:24:27,020 --> 00:24:33,430
This is x1, and this is x2.

426
00:24:36,190 --> 00:24:39,190
And those are my three
observations for this guy.

427
00:24:39,190 --> 00:24:48,940
So what I want when
I minimize this,

428
00:24:48,940 --> 00:24:50,560
I'm looking at the
point which can

429
00:24:50,560 --> 00:24:52,660
be formed as the linear
combination of the columns

430
00:24:52,660 --> 00:24:57,887
of x, and I'm trying to find
the guy that's the closest to y.

431
00:24:57,887 --> 00:24:58,970
So what does it look like?

432
00:24:58,970 --> 00:25:01,870
Well, the two points, 1, 1,
1 is going to be, say, here.

433
00:25:01,870 --> 00:25:04,360
That's the point 1, 1, 1.

434
00:25:04,360 --> 00:25:06,300
And let's say that
abc is this point.

435
00:25:14,890 --> 00:25:17,410
So now I have a line that
goes through those two guys.

436
00:25:20,620 --> 00:25:22,820
That's not really--
let's say it's

437
00:25:22,820 --> 00:25:24,560
going through those two guys.

438
00:25:24,560 --> 00:25:27,800
And this is the line which
can be formed by looking only

439
00:25:27,800 --> 00:25:28,990
at linear combination.

440
00:25:28,990 --> 00:25:36,330
So this is the line of
x times t for t in r2.

441
00:25:36,330 --> 00:25:39,320
That's this entire
line that you can get.

442
00:25:39,320 --> 00:25:42,870
Why is it-- yeah, sorry,
it's not just a line,

443
00:25:42,870 --> 00:25:45,720
I also have to have
t, all the 0's thing.

444
00:25:45,720 --> 00:25:48,860
So that actually
creates an entire plane,

445
00:25:48,860 --> 00:25:54,160
which is going to be really
hard for me to represent.

446
00:25:54,160 --> 00:25:55,390
I don't know.

447
00:25:55,390 --> 00:26:00,044
I mean, maybe I shouldn't
do it in these dimensions.

448
00:26:05,130 --> 00:26:08,650
So I'm going to do it like that.

449
00:26:11,240 --> 00:26:14,350
So this plane here is the
set of xt for t and r2.

450
00:26:17,770 --> 00:26:22,390
So that's a two-dimensional
plane, definitely goes to 0,

451
00:26:22,390 --> 00:26:23,760
and those are all these things.

452
00:26:23,760 --> 00:26:25,960
So think of a sheet of
paper in three dimensions.

453
00:26:25,960 --> 00:26:27,910
Those are the things I can get.

454
00:26:27,910 --> 00:26:29,600
So now, what I'm
going to have as y

455
00:26:29,600 --> 00:26:32,980
is not necessarily
in this plane.

456
00:26:32,980 --> 00:26:39,440
y is actually something
in this plane, x beta

457
00:26:39,440 --> 00:26:40,655
plus some epsilon.

458
00:26:44,810 --> 00:26:50,091
y is x beta plus epsilon.

459
00:26:50,091 --> 00:26:51,590
So I start from
this plane, and then

460
00:26:51,590 --> 00:26:53,048
I have this epsilon
that pushes me,

461
00:26:53,048 --> 00:26:54,620
maybe, outside of this plane.

462
00:26:54,620 --> 00:26:56,370
And what least squares
is doing is saying,

463
00:26:56,370 --> 00:26:59,212
well, I know that epsilon
should be fairly small,

464
00:26:59,212 --> 00:27:01,670
so the only thing I'm going to
be doing that actually makes

465
00:27:01,670 --> 00:27:04,370
sense is to take y and
find the point that's

466
00:27:04,370 --> 00:27:06,170
on this plane that's
the closest to it.

467
00:27:06,170 --> 00:27:10,010
And that corresponds to doing
an orthogonal projection of y

468
00:27:10,010 --> 00:27:13,070
onto this thing, and that's
actually exactly x beta hat.

469
00:27:18,840 --> 00:27:21,390
So in one dimension, just
because this is actually

470
00:27:21,390 --> 00:27:22,920
a little hard--

471
00:27:22,920 --> 00:27:34,140
in one dimension, so
that's if p is equal to 1.

472
00:27:34,140 --> 00:27:36,780
So let's say this is my point.

473
00:27:36,780 --> 00:27:38,854
And then I have y, which
is in two dimensions,

474
00:27:38,854 --> 00:27:40,020
so this is all on the plane.

475
00:27:42,930 --> 00:27:44,590
What it does, this is my--

476
00:27:44,590 --> 00:27:48,579
the point that's right here
is actually x beta hat.

477
00:27:48,579 --> 00:27:49,870
That's how you find x beta hat.

478
00:27:49,870 --> 00:27:51,780
You take your point
y and you project it

479
00:27:51,780 --> 00:27:54,490
on the linear span
of the columns of x.

480
00:27:54,490 --> 00:27:56,640
And that's x beta hat.

481
00:27:56,640 --> 00:27:59,032
This does not tell you
exactly what beta should be.

482
00:27:59,032 --> 00:28:00,990
And if you know a little
bit of linear algebra,

483
00:28:00,990 --> 00:28:04,580
it's pretty clear, because
if you want to find beta hat,

484
00:28:04,580 --> 00:28:06,330
that means that you
should be able to find

485
00:28:06,330 --> 00:28:12,284
the coordinates of a point in
the system of columns of x.

486
00:28:12,284 --> 00:28:13,950
And if those guys are
redundant, there's

487
00:28:13,950 --> 00:28:17,430
not going to be unique
coordinates for these guys,

488
00:28:17,430 --> 00:28:19,410
so that's why it's
actually not easy to find.

489
00:28:19,410 --> 00:28:21,120
But x beta hat is
uniquely defined.

490
00:28:21,120 --> 00:28:21,870
It's a projection.

491
00:28:21,870 --> 00:28:22,744
Yeah?

492
00:28:22,744 --> 00:28:24,285
AUDIENCE: And epsilon
is the distance

493
00:28:24,285 --> 00:28:25,840
between the y and the--

494
00:28:25,840 --> 00:28:29,630
PHILIPPE RIGOLLET: No, epsilon
is the vector that goes from--

495
00:28:29,630 --> 00:28:33,800
so there's a true x beta.

496
00:28:33,800 --> 00:28:36,245
That's the true one.

497
00:28:36,245 --> 00:28:36,870
It's not clear.

498
00:28:36,870 --> 00:28:41,940
I mean, x beta hat is unlikely
to be exactly equal to x beta.

499
00:28:41,940 --> 00:28:44,410
And then the epsilon is the
one that starts from this line.

500
00:28:44,410 --> 00:28:46,800
It's the vector that
pushes you away.

501
00:28:46,800 --> 00:28:48,240
So really, this is this vector.

502
00:28:48,240 --> 00:28:50,600
That's epsilon.

503
00:28:50,600 --> 00:28:51,650
So it's not a length.

504
00:28:51,650 --> 00:28:54,245
The lengths of epsilon
is the distance,

505
00:28:54,245 --> 00:28:57,454
but epsilon is just the
actual vector that takes you

506
00:28:57,454 --> 00:28:58,370
from one to the other.

507
00:29:01,600 --> 00:29:03,020
So this is all in
two dimensions,

508
00:29:03,020 --> 00:29:05,060
and it's probably much
clearer than what's here.

509
00:29:09,080 --> 00:29:12,860
And so here, I claim
that this x beta hat--

510
00:29:12,860 --> 00:29:15,110
so from this
picture, I implicitly

511
00:29:15,110 --> 00:29:22,980
claim that forming this
operator that ticks y

512
00:29:22,980 --> 00:29:26,400
and maps it into this vector
x times x transpose y, blah,

513
00:29:26,400 --> 00:29:33,570
blah, blah, this should actually
be equal to the projection of y

514
00:29:33,570 --> 00:29:44,990
onto the linear span
of the columns of x.

515
00:29:44,990 --> 00:29:46,889
That's what I just drew for you.

516
00:29:46,889 --> 00:29:48,430
And what it means
is that this matrix

517
00:29:48,430 --> 00:29:49,679
must be the projection matrix.

518
00:29:54,350 --> 00:29:56,910
So of course, anybody--

519
00:29:56,910 --> 00:29:59,400
who knows linear algebra here?

520
00:29:59,400 --> 00:30:01,730
OK, wow.

521
00:30:01,730 --> 00:30:04,560
So what are the conditions
that a projection matrix

522
00:30:04,560 --> 00:30:06,763
should be satisfying?

523
00:30:06,763 --> 00:30:08,150
AUDIENCE: Squares
through itself.

524
00:30:08,150 --> 00:30:09,360
PHILIPPE RIGOLLET: Squares
through itself, right.

525
00:30:09,360 --> 00:30:11,430
If I project twice,
I'm not moving.

526
00:30:11,430 --> 00:30:13,560
If I keep on
iterating projection,

527
00:30:13,560 --> 00:30:15,330
once I'm in the space
I'm projecting onto,

528
00:30:15,330 --> 00:30:16,854
I'm not moving.

529
00:30:16,854 --> 00:30:17,808
What else?

530
00:30:24,970 --> 00:30:28,110
Do they have to be
symmetric, maybe?

531
00:30:28,110 --> 00:30:29,960
AUDIENCE: If it's an
orthogonal projection.

532
00:30:29,960 --> 00:30:32,501
PHILIPPE RIGOLLET: Yeah, so this
is an orthogonal projection.

533
00:30:32,501 --> 00:30:34,710
It has to be symmetric.

534
00:30:34,710 --> 00:30:36,510
And that's pretty much it.

535
00:30:36,510 --> 00:30:38,520
So from those things,
you can actually

536
00:30:38,520 --> 00:30:39,694
get quite a bit of things.

537
00:30:39,694 --> 00:30:41,610
But what's interesting
is that if you actually

538
00:30:41,610 --> 00:30:44,550
look at the eigenvalues
of this matrix,

539
00:30:44,550 --> 00:30:47,670
they should be either
0 or 1, essentially.

540
00:30:47,670 --> 00:30:52,320
And they are 1 if the
eigenvector associated

541
00:30:52,320 --> 00:30:55,089
is within this space,
and 0 otherwise.

542
00:30:55,089 --> 00:30:56,880
And so that's basically
what you can check.

543
00:30:56,880 --> 00:30:58,630
This is not an exercise
in linear algebra,

544
00:30:58,630 --> 00:31:00,970
so I'm not going to go too
much into those details.

545
00:31:00,970 --> 00:31:03,330
But this is essentially what
you want to keep in mind.

546
00:31:03,330 --> 00:31:05,460
What's associated to
orthogonal projections

547
00:31:05,460 --> 00:31:07,860
is Pythagoras theorem.

548
00:31:07,860 --> 00:31:10,380
And that's something that's
going to be useful for us.

549
00:31:10,380 --> 00:31:12,150
What it's essentially
telling is that if I

550
00:31:12,150 --> 00:31:16,342
look at this norm squared, it's
equal to this norm squared--

551
00:31:16,342 --> 00:31:18,300
sorry, this norm squared
plus this norm squared

552
00:31:18,300 --> 00:31:20,100
is equal to this norm squared.

553
00:31:20,100 --> 00:31:22,510
And that's something
the norm of y squared.

554
00:31:22,510 --> 00:31:32,040
So Pythagoras tells me
that the norm of y squared

555
00:31:32,040 --> 00:31:40,090
is equal to the norm of x beta
hat squared plus the norm of y

556
00:31:40,090 --> 00:31:41,230
minus x beta hat squared.

557
00:31:46,120 --> 00:31:47,890
Agreed?

558
00:31:47,890 --> 00:31:51,700
It's just because I have
a straight angle here.

559
00:31:51,700 --> 00:31:54,174
So that's this plus
this is equal to this.

560
00:31:58,840 --> 00:32:02,770
So now, to define this,
I made no assumption.

561
00:32:02,770 --> 00:32:04,630
Epsilon could be as wild.

562
00:32:04,630 --> 00:32:07,300
I was just crossing my fingers
that epsilon was actually

563
00:32:07,300 --> 00:32:09,910
small enough that
it would make sense

564
00:32:09,910 --> 00:32:13,450
to project onto the linear
span, because I implicitly

565
00:32:13,450 --> 00:32:16,640
assumed that epsilon did not
take me all the way there,

566
00:32:16,640 --> 00:32:19,900
so that actually, it makes
sense to project back.

567
00:32:19,900 --> 00:32:22,240
And so for that, I need to
somehow make assumptions

568
00:32:22,240 --> 00:32:24,730
that epsilon is
well-behaved and that it's

569
00:32:24,730 --> 00:32:31,330
completely wild, that
it's moving uniformly

570
00:32:31,330 --> 00:32:33,050
in all directions of the space.

571
00:32:33,050 --> 00:32:34,630
There's no privileged
direction where

572
00:32:34,630 --> 00:32:36,005
it's always going,
otherwise, I'm

573
00:32:36,005 --> 00:32:37,900
going to make a
systematic error.

574
00:32:37,900 --> 00:32:42,400
And I need that those epsilons
are going to average somehow.

575
00:32:42,400 --> 00:32:44,641
So here are the
assumptions we're

576
00:32:44,641 --> 00:32:46,390
going to be making so
that we can actually

577
00:32:46,390 --> 00:32:48,880
do some statistical inference.

578
00:32:48,880 --> 00:32:53,350
The first one is that the
design matrix is deterministic.

579
00:32:53,350 --> 00:32:55,270
So I started by saying the x--

580
00:32:55,270 --> 00:32:58,570
I have xi, yi, and maybe
they're independent.

581
00:32:58,570 --> 00:33:03,460
Here, they are, but the xi's, I
want to think as deterministic.

582
00:33:03,460 --> 00:33:06,400
If they're not deterministic,
it can condition on them,

583
00:33:06,400 --> 00:33:08,110
but otherwise,
it's very difficult

584
00:33:08,110 --> 00:33:11,770
to think about this thing
if I think of those entries

585
00:33:11,770 --> 00:33:14,380
as being random,
because then I have

586
00:33:14,380 --> 00:33:17,470
the inverse of a random matrix,
and things become very, very

587
00:33:17,470 --> 00:33:18,800
complicated.

588
00:33:18,800 --> 00:33:21,760
So we're to think of those
guys as being deterministic.

589
00:33:21,760 --> 00:33:27,400
We're going to think of the
model as being homoscedastic.

590
00:33:27,400 --> 00:33:29,790
And actually, let me come
back to this in a second.

591
00:33:29,790 --> 00:33:31,780
Homoscedastic-- well,
I mean, if you're

592
00:33:31,780 --> 00:33:34,330
trying to find the
etymology of this word,

593
00:33:34,330 --> 00:33:38,080
"homo" means the same,
"scedastic" means scaling.

594
00:33:38,080 --> 00:33:40,090
So what I want to say
is that the epsilons

595
00:33:40,090 --> 00:33:41,890
have the same scaling.

596
00:33:41,890 --> 00:33:46,914
And since my third assumption is
that epsilon is Gaussian, then

597
00:33:46,914 --> 00:33:49,330
essentially, what I'm going
to want is that they all share

598
00:33:49,330 --> 00:33:52,900
the same sigma squared.

599
00:33:52,900 --> 00:33:55,540
They're independent, so this
is definitely in the identity

600
00:33:55,540 --> 00:33:56,784
covariance matrix.

601
00:33:56,784 --> 00:33:58,450
And I want them to
be centered, as well.

602
00:33:58,450 --> 00:34:00,310
That means that
there's no direction

603
00:34:00,310 --> 00:34:04,240
that I'm always privileging when
I'm moving away from my plane

604
00:34:04,240 --> 00:34:05,560
there.

605
00:34:05,560 --> 00:34:09,969
So these are
important conditions.

606
00:34:09,969 --> 00:34:13,210
It depends on how much
inference you want to do.

607
00:34:13,210 --> 00:34:16,310
If you want to write t-tests,
you need all these assumptions.

608
00:34:16,310 --> 00:34:19,810
But if you only want to
write, for example, the fact

609
00:34:19,810 --> 00:34:23,230
that your least squares
estimator is consistent,

610
00:34:23,230 --> 00:34:25,210
you really just need
the fact that epsilon

611
00:34:25,210 --> 00:34:27,630
has variance sigma squared.

612
00:34:27,630 --> 00:34:29,850
The fact that it's
Gaussian won't matter, just

613
00:34:29,850 --> 00:34:33,449
like Gaussianity doesn't
matter for a large number.

614
00:34:33,449 --> 00:34:34,055
Yeah?

615
00:34:34,055 --> 00:34:35,480
AUDIENCE: So the
first assumption

616
00:34:35,480 --> 00:34:38,013
that x has to be
deterministic, but I just

617
00:34:38,013 --> 00:34:40,327
made up this x1, x2--

618
00:34:40,327 --> 00:34:41,785
PHILIPPE RIGOLLET:
x is the matrix.

619
00:34:41,785 --> 00:34:42,485
AUDIENCE: Yeah.

620
00:34:42,485 --> 00:34:45,159
So most are random
variables, right?

621
00:34:45,159 --> 00:34:47,075
PHILIPPE RIGOLLET: No,
that's the assumption.

622
00:34:47,075 --> 00:34:49,400
AUDIENCE: OK.

623
00:34:49,400 --> 00:34:52,595
So I mean, once we collect the
data and put it in the matrix,

624
00:34:52,595 --> 00:34:54,020
it becomes deterministic.

625
00:34:54,020 --> 00:34:55,920
So maybe I'm missing something.

626
00:34:55,920 --> 00:34:56,920
PHILIPPE RIGOLLET: Yeah.

627
00:34:56,920 --> 00:35:00,510
So this is for the
purpose of the analysis.

628
00:35:00,510 --> 00:35:01,800
I can actually assume that--

629
00:35:01,800 --> 00:35:04,140
I look at my data,
and I think of this.

630
00:35:04,140 --> 00:35:06,210
So what is the difference
between thinking

631
00:35:06,210 --> 00:35:08,832
of data as deterministic or
thinking of it as random?

632
00:35:08,832 --> 00:35:11,040
When I talked about random
data, the only assumptions

633
00:35:11,040 --> 00:35:12,706
that I made were about
the distribution.

634
00:35:12,706 --> 00:35:14,730
I said, well, if my x
is a random variable,

635
00:35:14,730 --> 00:35:16,980
I want it to have this
variance and I want it to have,

636
00:35:16,980 --> 00:35:19,250
maybe, this distribution,
things like this.

637
00:35:19,250 --> 00:35:25,050
Here, I'm actually making
an assumption on the values

638
00:35:25,050 --> 00:35:25,940
that I see.

639
00:35:25,940 --> 00:35:30,120
I'm seeing that the value
that you give me is--

640
00:35:30,120 --> 00:35:32,010
the matrix is
actually invertible.

641
00:35:32,010 --> 00:35:33,960
x transpose x will
be invertible.

642
00:35:33,960 --> 00:35:36,690
So I've never done
that before, assuming

643
00:35:36,690 --> 00:35:38,880
that some random variable--

644
00:35:38,880 --> 00:35:41,740
assuming that some Gaussian
random variable was positive,

645
00:35:41,740 --> 00:35:43,160
for example.

646
00:35:43,160 --> 00:35:45,570
We don't do that, because
there's always some probability

647
00:35:45,570 --> 00:35:49,110
that things don't happen if
you make things at random.

648
00:35:49,110 --> 00:35:52,380
And so here, I'm just going
to say, OK, forget about--

649
00:35:52,380 --> 00:35:54,990
here, it's basically
a little stronger.

650
00:35:54,990 --> 00:35:58,710
I start my assumption by saying,
the data that's given to me

651
00:35:58,710 --> 00:36:00,630
will actually satisfy
those assumptions.

652
00:36:00,630 --> 00:36:02,130
And that means that
I don't actually

653
00:36:02,130 --> 00:36:05,279
need to make some modeling
assumption on this thing,

654
00:36:05,279 --> 00:36:06,820
because I'm actually
putting directly

655
00:36:06,820 --> 00:36:08,028
the assumption I want to see.

656
00:36:12,650 --> 00:36:14,730
So here, either I
know sigma squared

657
00:36:14,730 --> 00:36:16,190
or I don't know sigma squared.

658
00:36:16,190 --> 00:36:16,940
So is that clear?

659
00:36:16,940 --> 00:36:21,880
So essentially, I'm assuming
that I have this model, where

660
00:36:21,880 --> 00:36:26,950
this guy, now, is
deterministic, and this

661
00:36:26,950 --> 00:36:30,490
is some multivariate
Gaussian with mean 0

662
00:36:30,490 --> 00:36:33,500
and covariance matrix
identity of rn.

663
00:36:33,500 --> 00:36:36,460
That's the model I'm assuming.

664
00:36:36,460 --> 00:36:40,810
And I'm observing this, and
I'm given this matrix x.

665
00:36:40,810 --> 00:36:42,130
Where does this make sense?

666
00:36:42,130 --> 00:36:44,770
You could say, well, if I think
of my rows as being people

667
00:36:44,770 --> 00:36:48,084
and I'm collecting genes,
it's a little intense

668
00:36:48,084 --> 00:36:50,000
to assume that I actually
know, ahead of time,

669
00:36:50,000 --> 00:36:51,340
what I'm going to be seeing,
and that those things are

670
00:36:51,340 --> 00:36:52,210
deterministic.

671
00:36:52,210 --> 00:36:55,630
That's true, but it still
does not prevent the analysis

672
00:36:55,630 --> 00:36:56,830
to go through, for one.

673
00:36:56,830 --> 00:37:00,970
And second, a better example
might be this imaging example

674
00:37:00,970 --> 00:37:04,870
that I described, where those
x's are actually libraries.

675
00:37:04,870 --> 00:37:07,570
Those are libraries of
patterns that people

676
00:37:07,570 --> 00:37:09,800
have created, maybe
from deep learning nets,

677
00:37:09,800 --> 00:37:10,847
or something like this.

678
00:37:10,847 --> 00:37:12,430
But they've created
patterns, and they

679
00:37:12,430 --> 00:37:14,830
say that all images should
be representable as a linear

680
00:37:14,830 --> 00:37:16,511
combination of those patterns.

681
00:37:16,511 --> 00:37:18,260
And those patterns are
somewhere in books,

682
00:37:18,260 --> 00:37:19,390
so they're certainly
deterministic.

683
00:37:19,390 --> 00:37:21,190
Everything that's actually
written down in a book

684
00:37:21,190 --> 00:37:22,600
is as deterministic as it gets.

685
00:37:29,027 --> 00:37:30,610
Any questions about
those assumptions?

686
00:37:30,610 --> 00:37:32,776
Those are the things we're
going to be working with.

687
00:37:32,776 --> 00:37:33,910
There's only three of them.

688
00:37:33,910 --> 00:37:35,130
One is about x.

689
00:37:35,130 --> 00:37:37,530
Actually, there's
really two of them.

690
00:37:37,530 --> 00:37:41,625
I mean, this guy
already appears here.

691
00:37:41,625 --> 00:37:44,640
So there's two-- one on
the noise, one on the x's.

692
00:37:44,640 --> 00:37:45,600
That's it.

693
00:37:48,480 --> 00:37:51,430
Those things allow
us to do quite a bit.

694
00:37:51,430 --> 00:37:52,830
They will allow us to--

695
00:37:55,410 --> 00:38:02,980
well, that's
actually-- they allow

696
00:38:02,980 --> 00:38:09,100
me to write the distribution
of beta hat, which is great,

697
00:38:09,100 --> 00:38:12,220
because when I know the
distribution of my estimator,

698
00:38:12,220 --> 00:38:14,250
I know it's fluctuations.

699
00:38:14,250 --> 00:38:16,450
If it's centered around
the true parameter,

700
00:38:16,450 --> 00:38:19,060
I know that it's going
to be fluctuating

701
00:38:19,060 --> 00:38:20,440
around the true parameter.

702
00:38:20,440 --> 00:38:22,450
And it should tell me
what kind of distribution

703
00:38:22,450 --> 00:38:23,860
the fluctuations are.

704
00:38:23,860 --> 00:38:26,260
I actually know how to
build confidence intervals.

705
00:38:26,260 --> 00:38:27,790
I know how to build tests.

706
00:38:27,790 --> 00:38:29,170
I know how to build everything.

707
00:38:29,170 --> 00:38:31,660
It's just like when I told
you that asymptotically,

708
00:38:31,660 --> 00:38:33,760
the empirical
variance was Gaussian

709
00:38:33,760 --> 00:38:39,470
with mean theta and standard
deviation that depended on n,

710
00:38:39,470 --> 00:38:42,040
et cetera, that's basically
the only thing I needed.

711
00:38:42,040 --> 00:38:44,840
And this is what I'm
actually getting here.

712
00:38:44,840 --> 00:38:49,820
So let me start
with this statement.

713
00:38:49,820 --> 00:38:52,087
So remember, beta
hat satisfied this,

714
00:38:52,087 --> 00:38:53,420
so I'm going to rewrite it here.

715
00:38:57,940 --> 00:39:02,530
So beta hat was
equal to x transpose

716
00:39:02,530 --> 00:39:07,440
x inverse x transpose y.

717
00:39:07,440 --> 00:39:09,710
That was the definition
that we found.

718
00:39:09,710 --> 00:39:17,450
And now, I also know that y was
equal to x beta plus epsilon.

719
00:39:17,450 --> 00:39:20,800
So let me just replace y by
x beta plus epsilon here.

720
00:39:20,800 --> 00:39:21,300
Yeah?

721
00:39:21,300 --> 00:39:25,185
AUDIENCE: Isn't it x transpose
x inverse x transpose y?

722
00:39:25,185 --> 00:39:26,685
PHILIPPE RIGOLLET:
Yes, x transpose.

723
00:39:26,685 --> 00:39:27,184
Thank you.

724
00:39:31,890 --> 00:39:36,830
So I'm going to replace
y by x beta plus epsilon.

725
00:39:36,830 --> 00:39:58,560
So that's-- and here
comes the magic.

726
00:39:58,560 --> 00:40:00,780
I have an inverse of
a matrix, and then

727
00:40:00,780 --> 00:40:03,810
I have the true matrix, I
have the original matrix.

728
00:40:03,810 --> 00:40:08,420
So this is actually the
identity times beta.

729
00:40:08,420 --> 00:40:11,610
And now this guy, well,
this is a Gaussian,

730
00:40:11,610 --> 00:40:13,800
because this is a
Gaussian random vector,

731
00:40:13,800 --> 00:40:18,540
and I just multiply it by
a deterministic matrix.

732
00:40:18,540 --> 00:40:22,690
So we're going to use the rule
that if I have, say, epsilon,

733
00:40:22,690 --> 00:40:29,780
which is n0 sigma, then
b times epsilon is n0--

734
00:40:29,780 --> 00:40:32,280
can somebody tell me what the
covariance matrix of b epsilon

735
00:40:32,280 --> 00:40:32,780
is?

736
00:40:35,302 --> 00:40:37,010
AUDIENCE: What is
capital B in this case?

737
00:40:37,010 --> 00:40:38,593
PHILIPPE RIGOLLET:
It's just a matrix.

738
00:40:42,410 --> 00:40:46,230
And for any matrix, I mean any
matrix that I can premultiply--

739
00:40:46,230 --> 00:40:48,360
that I can postmultiply
with epsilon.

740
00:40:48,360 --> 00:40:49,042
Yeah?

741
00:40:49,042 --> 00:40:50,819
AUDIENCE: b transpose b.

742
00:40:50,819 --> 00:40:52,110
PHILIPPE RIGOLLET: b transpose?

743
00:40:52,110 --> 00:40:53,503
AUDIENCE: Times b.

744
00:40:53,503 --> 00:40:55,044
PHILIPPE RIGOLLET:
And sigma is gone.

745
00:40:55,044 --> 00:40:57,177
AUDIENCE: Oh,
times sigma, sorry.

746
00:40:57,177 --> 00:40:59,010
PHILIPPE RIGOLLET:
That's the matrix, right?

747
00:40:59,010 --> 00:41:00,427
AUDIENCE: b transpose sigma b.

748
00:41:00,427 --> 00:41:01,510
PHILIPPE RIGOLLET: Almost.

749
00:41:04,255 --> 00:41:07,470
Anybody wants to take a
guess at the last one?

750
00:41:07,470 --> 00:41:12,790
I think we've removed
all other possibilities.

751
00:41:12,790 --> 00:41:15,510
It's b sigma b transpose.

752
00:41:20,880 --> 00:41:24,910
So if you ever answered
to the question,

753
00:41:24,910 --> 00:41:26,590
do you know Gaussian
random vectors,

754
00:41:26,590 --> 00:41:29,414
but you did not know that,
there's a gap in your knowledge

755
00:41:29,414 --> 00:41:31,330
that you need to fill,
because that's probably

756
00:41:31,330 --> 00:41:33,880
the most important property
of Gaussian vectors.

757
00:41:33,880 --> 00:41:38,410
When you multiply
them by matrices,

758
00:41:38,410 --> 00:41:43,390
you have a simple rule on how
to update the covariance matrix.

759
00:41:43,390 --> 00:41:49,250
So here, sigma is the identity.

760
00:41:49,250 --> 00:41:53,480
And here, this is the
matrix b that I had here.

761
00:41:53,480 --> 00:41:58,480
So what this is is, basically,
n, some multivariate n,

762
00:41:58,480 --> 00:41:59,350
of course.

763
00:41:59,350 --> 00:42:00,970
Then I'm going to have 0.

764
00:42:00,970 --> 00:42:04,140
And so what I need to do is
b times the identity times b

765
00:42:04,140 --> 00:42:07,017
transpose, which is
just b b transpose.

766
00:42:07,017 --> 00:42:08,350
And what is it going to tell me?

767
00:42:08,350 --> 00:42:12,850
It's x transpose x--

768
00:42:12,850 --> 00:42:17,560
sorry, that's inverse--
inverse x transpose, and then

769
00:42:17,560 --> 00:42:21,760
the transpose of this
guy, which is x x

770
00:42:21,760 --> 00:42:25,170
transpose x inverse transpose.

771
00:42:25,170 --> 00:42:27,130
But this matrix is
symmetric, so I'm actually

772
00:42:27,130 --> 00:42:30,190
not going to make the
transpose of this guy.

773
00:42:30,190 --> 00:42:34,090
And again, magic shows up.

774
00:42:34,090 --> 00:42:36,220
Inverse times the
matrix of those two guys

775
00:42:36,220 --> 00:42:38,950
cancel, and so this is
actually equal to beta

776
00:42:38,950 --> 00:42:43,990
plus some n0 x
transpose x inverse.

777
00:42:46,955 --> 00:42:47,455
Yeah?

778
00:42:47,455 --> 00:42:49,454
AUDIENCE: I'm a little
lost on the [INAUDIBLE]..

779
00:42:49,454 --> 00:42:52,788
So you define that as the
b matrix, and what happens?

780
00:42:52,788 --> 00:42:54,954
PHILIPPE RIGOLLET: So I
just apply this rule, right?

781
00:42:54,954 --> 00:42:55,720
AUDIENCE: Yeah.

782
00:42:55,720 --> 00:42:57,850
PHILIPPE RIGOLLET: So
if I multiply a matrix

783
00:42:57,850 --> 00:43:01,840
by a Gaussian, then let's
say this Gaussian had

784
00:43:01,840 --> 00:43:05,680
mean 0, which is the
case of epsilon here,

785
00:43:05,680 --> 00:43:07,960
then the covariance
matrix that I get

786
00:43:07,960 --> 00:43:10,330
is b times the original
covariance matrix times b

787
00:43:10,330 --> 00:43:11,470
transpose.

788
00:43:11,470 --> 00:43:15,290
So all I did is write this
matrix times the identity

789
00:43:15,290 --> 00:43:18,195
times this matrix transpose.

790
00:43:18,195 --> 00:43:20,320
And the identity, of course,
doesn't play any role,

791
00:43:20,320 --> 00:43:21,240
so I can remove it.

792
00:43:21,240 --> 00:43:23,860
It's just this matrix,
then the matrix transpose.

793
00:43:23,860 --> 00:43:25,370
And what happened?

794
00:43:25,370 --> 00:43:27,280
So what is the transpose
of this matrix?

795
00:43:27,280 --> 00:43:32,710
So I used the fact that if I
look at x transpose x inverse x

796
00:43:32,710 --> 00:43:39,160
transpose, and now I look at the
whole transpose of this thing,

797
00:43:39,160 --> 00:43:40,510
that's actually equal 2.

798
00:43:40,510 --> 00:43:43,510
And I use the rule that ab
transpose is b transpose

799
00:43:43,510 --> 00:43:46,030
a transpose-- let me finish--

800
00:43:46,030 --> 00:43:51,925
and it's x x
transpose x inverse.

801
00:43:55,151 --> 00:43:55,650
Yes?

802
00:43:55,650 --> 00:43:58,020
AUDIENCE: I thought the--

803
00:43:58,020 --> 00:44:00,335
for epsilon, it
was sigma squared.

804
00:44:00,335 --> 00:44:01,710
PHILIPPE RIGOLLET:
Oh, thank you.

805
00:44:01,710 --> 00:44:03,610
There's a sigma
squared somewhere.

806
00:44:03,610 --> 00:44:08,610
So this was sigma squared times
the identity, so I can just

807
00:44:08,610 --> 00:44:10,566
pick up a sigma
squared anywhere.

808
00:44:14,740 --> 00:44:28,560
So here, in our case, so
for epsilon, this is sigma.

809
00:44:28,560 --> 00:44:30,000
Sigma squared
times the identity,

810
00:44:30,000 --> 00:44:31,166
that's my covariance matrix.

811
00:44:33,920 --> 00:44:35,242
You seem perplexed.

812
00:44:35,242 --> 00:44:37,170
AUDIENCE: It's just
a new idea for me

813
00:44:37,170 --> 00:44:41,754
to think of a maximum likelihood
estimator as a random variable.

814
00:44:41,754 --> 00:44:43,420
PHILIPPE RIGOLLET:
Oh, it should not be.

815
00:44:43,420 --> 00:44:45,722
Any estimator is
a random variable.

816
00:44:45,722 --> 00:44:48,132
AUDIENCE: Oh, yeah,
that's a good point.

817
00:44:48,132 --> 00:44:52,236
PHILIPPE RIGOLLET:
[LAUGHS] And I have not

818
00:44:52,236 --> 00:44:54,110
told you that this was
the maximum likelihood

819
00:44:54,110 --> 00:44:55,720
estimator just yet.

820
00:44:55,720 --> 00:44:58,910
The estimator is
a random variable.

821
00:44:58,910 --> 00:45:00,890
There's a word-- some
people use estimate just

822
00:45:00,890 --> 00:45:03,519
to differentiate the
estimator while you're

823
00:45:03,519 --> 00:45:05,810
doing the analysis with random
variables and the values

824
00:45:05,810 --> 00:45:09,477
when you plug in the
numbers in there.

825
00:45:09,477 --> 00:45:12,060
But then, of course, people use
estimate because it's shorter,

826
00:45:12,060 --> 00:45:14,660
so then it's confusing.

827
00:45:14,660 --> 00:45:17,990
So any questions about
this computation?

828
00:45:17,990 --> 00:45:20,810
Did I forget any other
Greek letter along the way?

829
00:45:20,810 --> 00:45:22,620
All right, I think we're good.

830
00:45:22,620 --> 00:45:26,225
So one thing that it
says-- and actually,

831
00:45:26,225 --> 00:45:27,600
thank you for
pointing this out--

832
00:45:27,600 --> 00:45:30,540
I said there's actually a
little hidden statement there.

833
00:45:30,540 --> 00:45:33,130
By the way, this
answers this question.

834
00:45:33,130 --> 00:45:35,990
Beta hat is of the form beta
plus something that's centered,

835
00:45:35,990 --> 00:45:39,484
so it's indeed of the form
Gaussian with mean beta

836
00:45:39,484 --> 00:45:41,900
and covariance matrix sigma
squared x transpose x inverse.

837
00:45:45,520 --> 00:45:47,640
So that's very nice.

838
00:45:47,640 --> 00:45:50,830
As long as x transpose
x is not huge,

839
00:45:50,830 --> 00:45:55,900
I'm going to have something
that is close to what I want.

840
00:45:55,900 --> 00:45:58,550
Oh, sorry, x transpose
x inverse is not huge.

841
00:46:01,800 --> 00:46:05,670
So there's a hidden
claim in there,

842
00:46:05,670 --> 00:46:08,640
which is that least
squares estimator

843
00:46:08,640 --> 00:46:11,588
is equal to the maximum
likelihood estimator.

844
00:46:15,500 --> 00:46:17,920
Why does the maximum
likelihood estimator just

845
00:46:17,920 --> 00:46:19,770
enter the picture now?

846
00:46:19,770 --> 00:46:23,280
We've been talking about
regression for the past 18

847
00:46:23,280 --> 00:46:24,450
slides.

848
00:46:24,450 --> 00:46:26,130
And we've been talking
about estimators.

849
00:46:26,130 --> 00:46:29,070
And I just dumped on you
the least squares estimator,

850
00:46:29,070 --> 00:46:31,830
but I never really came back
to this thing that we know--

851
00:46:31,830 --> 00:46:35,100
maybe the method of moments,
or maybe the maximum likelihood

852
00:46:35,100 --> 00:46:35,930
estimator.

853
00:46:35,930 --> 00:46:37,930
It turns out that those
two things are the same.

854
00:46:37,930 --> 00:46:41,880
But if I want to talk about a
maximum likelihood estimator,

855
00:46:41,880 --> 00:46:43,140
I need to have a likelihood.

856
00:46:43,140 --> 00:46:46,160
In particular, I need
to have a density.

857
00:46:46,160 --> 00:46:47,600
And so if I want
a density, I have

858
00:46:47,600 --> 00:46:53,210
to make those assumptions,
such as the epsilons have

859
00:46:53,210 --> 00:46:55,970
this Gaussian distribution.

860
00:46:55,970 --> 00:46:58,580
So why is this the maximum
likelihood estimator?

861
00:46:58,580 --> 00:47:04,740
Well, remember, y is x
transpose beta plus epsilon.

862
00:47:04,740 --> 00:47:07,530
So I actually have
a bunch of data.

863
00:47:07,530 --> 00:47:14,390
So what is my model here?

864
00:47:14,390 --> 00:47:18,040
Well, its the
family of Gaussians

865
00:47:18,040 --> 00:47:22,460
on n observations with
mean x beta, variance sigma

866
00:47:22,460 --> 00:47:31,380
squared identity,
and beta lives in rp.

867
00:47:31,380 --> 00:47:34,800
Here's my family
of distributions.

868
00:47:34,800 --> 00:47:38,160
That's the possible
distributions for y.

869
00:47:38,160 --> 00:47:41,500
And so in particular, I
can write the density of y.

870
00:47:47,980 --> 00:47:48,760
Well, what is it?

871
00:47:48,760 --> 00:47:52,010
It's something that
looks like p of x--

872
00:47:52,010 --> 00:47:58,359
well, p of y, let's say,
is equal to 1 over--

873
00:47:58,359 --> 00:48:00,400
so now its going to be a
little more complicated,

874
00:48:00,400 --> 00:48:17,740
but its sigma squared times 2
pi to the p/2 exponential minus

875
00:48:17,740 --> 00:48:26,840
norm of y minus x beta squared
divided by 2 sigma squared.

876
00:48:26,840 --> 00:48:29,780
So that's just the
multivariate Gaussian density.

877
00:48:29,780 --> 00:48:30,890
I just wrote it.

878
00:48:30,890 --> 00:48:33,530
That's the density of
a multivariate Gaussian

879
00:48:33,530 --> 00:48:36,740
with mean x beta and
covariance matrix sigma squared

880
00:48:36,740 --> 00:48:37,700
times the identity.

881
00:48:37,700 --> 00:48:40,410
That's what it is.

882
00:48:40,410 --> 00:48:42,300
So you don't have to
learn this by heart,

883
00:48:42,300 --> 00:48:47,100
but if you are familiar with
the case where p is equal to 1,

884
00:48:47,100 --> 00:48:49,820
you can check that you recover
what you're familiar with,

885
00:48:49,820 --> 00:48:54,811
and this makes sense
as an extension.

886
00:48:59,730 --> 00:49:08,560
So now, I can actually
write my log likelihood.

887
00:49:08,560 --> 00:49:14,880
How many observations do
I have of this vector y?

888
00:49:23,710 --> 00:49:25,144
Do I have n observations of y?

889
00:49:30,580 --> 00:49:33,110
I have just one, right?

890
00:49:33,110 --> 00:49:36,830
Oh, sorry, I shouldn't
have said p, this is n.

891
00:49:36,830 --> 00:49:38,510
Everything is in dimension n.

892
00:49:38,510 --> 00:49:42,700
So I can think of either having
n independent observations

893
00:49:42,700 --> 00:49:44,180
of each coordinate,
or I can think

894
00:49:44,180 --> 00:49:47,210
of having just one
observation of the vector y.

895
00:49:47,210 --> 00:49:50,050
So when I write
my log likelihood,

896
00:49:50,050 --> 00:49:54,850
it's just the log
of the density at y.

897
00:50:09,090 --> 00:50:13,710
And that's the
vector y, which I can

898
00:50:13,710 --> 00:50:18,990
write as minus n/2
log sigma squared

899
00:50:18,990 --> 00:50:28,690
2 pi minus 1 over 2 sigma
squared norm of y minus x beta.

900
00:50:28,690 --> 00:50:30,310
And that's, again,
my boldface y.

901
00:50:36,710 --> 00:50:39,222
And what is my maximum
likelihood estimator?

902
00:50:44,470 --> 00:50:47,940
Well, this guy does
not depend on beta.

903
00:50:47,940 --> 00:50:50,850
And this is just a constant
factor in front of this guy.

904
00:50:50,850 --> 00:50:54,270
So it's the same thing
as just minimizing,

905
00:50:54,270 --> 00:50:57,230
because I have a minus
sign, over all beta and rp.

906
00:51:03,140 --> 00:51:05,580
y minus x beta squared,
and that's my least squares

907
00:51:05,580 --> 00:51:06,570
estimator.

908
00:51:15,312 --> 00:51:17,270
Is there anything that's
unclear on this board?

909
00:51:17,270 --> 00:51:17,910
Any question?

910
00:51:20,550 --> 00:51:23,230
So all I used was-- so I
wrote my log likelihood, which

911
00:51:23,230 --> 00:51:25,860
is just the log
of this expression

912
00:51:25,860 --> 00:51:28,750
where y is my observation.

913
00:51:28,750 --> 00:51:32,430
And that's indeed the
observation that I have here.

914
00:51:32,430 --> 00:51:35,980
And that was just some constant
minus some constant times

915
00:51:35,980 --> 00:51:37,960
this quantity that
depends on beta.

916
00:51:37,960 --> 00:51:40,270
So maximizing this whole
thing is the same thing

917
00:51:40,270 --> 00:51:42,810
as minimizing only this thing.

918
00:51:42,810 --> 00:51:44,620
The minimizers are the same.

919
00:51:44,620 --> 00:51:47,320
And so that tells me
that I actually just

920
00:51:47,320 --> 00:51:49,000
have to minimize
the squared norm

921
00:51:49,000 --> 00:51:51,710
to get my maximum
likelihood estimator.

922
00:51:51,710 --> 00:51:55,060
But this used, heavily, the
fact that I could actually

923
00:51:55,060 --> 00:52:03,450
write exactly what
my density was,

924
00:52:03,450 --> 00:52:06,240
and that when I took
the log of this thing,

925
00:52:06,240 --> 00:52:09,660
I had exactly the square
norm that showed up.

926
00:52:09,660 --> 00:52:12,630
If I had a different
density, if, for example,

927
00:52:12,630 --> 00:52:17,040
I assumed that my coordinates
of epsilons were, say, iid

928
00:52:17,040 --> 00:52:18,720
double exponential
random variables.

929
00:52:18,720 --> 00:52:21,240
So it's just half
of an exponential.

930
00:52:21,240 --> 00:52:24,280
And the plus is half of an
exponential on the negatives.

931
00:52:24,280 --> 00:52:27,342
So if I said that,
then this would not

932
00:52:27,342 --> 00:52:28,800
have the square
norm that shows up.

933
00:52:28,800 --> 00:52:31,057
This is really
idiosyncratic to Gaussians.

934
00:52:31,057 --> 00:52:32,640
If I had something
else, I would have,

935
00:52:32,640 --> 00:52:35,190
maybe, a different norm
here, or something different

936
00:52:35,190 --> 00:52:39,420
measures the difference
between y and x beta.

937
00:52:39,420 --> 00:52:41,820
And that's how you come up
with other maximum likelihood

938
00:52:41,820 --> 00:52:44,010
estimators that leads
to other estimators that

939
00:52:44,010 --> 00:52:45,420
are not the least squares--

940
00:52:45,420 --> 00:52:47,040
maybe the least
absolute deviation,

941
00:52:47,040 --> 00:52:50,310
for example, or this
fourth movement,

942
00:52:50,310 --> 00:52:52,890
for example, that you
suggested last time.

943
00:52:52,890 --> 00:52:55,650
So I can come up with a
bunch of different things,

944
00:52:55,650 --> 00:52:56,910
and they might be tied--

945
00:52:56,910 --> 00:52:59,716
maybe I can come up from them
from the same perspective

946
00:52:59,716 --> 00:53:01,590
that I came from the
least squares estimator.

947
00:53:01,590 --> 00:53:03,210
I said, let's just
do something smart

948
00:53:03,210 --> 00:53:06,350
and check, then, that it's
indeed the maximum likelihood

949
00:53:06,350 --> 00:53:08,040
estimator.

950
00:53:08,040 --> 00:53:11,250
Or I could just start
with the modeling on--

951
00:53:11,250 --> 00:53:13,260
and check, then, what happens--

952
00:53:13,260 --> 00:53:15,840
what was the implicit assumption
that I put on my noise.

953
00:53:15,840 --> 00:53:18,164
Or I could start with the
assumption of the noise,

954
00:53:18,164 --> 00:53:19,830
compute the maximum
likelihood estimator

955
00:53:19,830 --> 00:53:21,000
and see what it turns into.

956
00:53:24,660 --> 00:53:26,760
So that was the first thing.

957
00:53:26,760 --> 00:53:29,080
I've just proved to
you the first line.

958
00:53:29,080 --> 00:53:31,950
And from there, you
can get what you want.

959
00:53:31,950 --> 00:53:34,690
So all the other lines
are going to follow.

960
00:53:34,690 --> 00:53:39,570
So what is beta hat-- so
for example, let's look

961
00:53:39,570 --> 00:53:41,660
at the second line,
the quadratic risk.

962
00:53:46,180 --> 00:53:49,270
Beta hat minus beta,
from this formula,

963
00:53:49,270 --> 00:53:53,780
has a distribution,
which is n n0,

964
00:53:53,780 --> 00:53:58,369
and then I have x
transpose x inverse.

965
00:53:58,369 --> 00:54:03,299
AUDIENCE: Wouldn't the
dimension be p on the board?

966
00:54:07,250 --> 00:54:10,308
PHILIPPE RIGOLLET: Sorry,
the dimension of what?

967
00:54:10,308 --> 00:54:11,769
AUDIENCE: Oh beta
hat minus beta.

968
00:54:11,769 --> 00:54:13,287
Isn't beta only a p dimensional?

969
00:54:13,287 --> 00:54:15,620
PHILIPPE RIGOLLET: Oh, yeah,
you're right, you're right.

970
00:54:15,620 --> 00:54:17,450
That was all p
dimensional there.

971
00:54:22,170 --> 00:54:23,700
Yeah.

972
00:54:23,700 --> 00:54:28,220
So if b here, the matrix
that I'm actually applying,

973
00:54:28,220 --> 00:54:30,810
has dimension p times n--

974
00:54:30,810 --> 00:54:34,710
so even if epsilon was an n
dimensional Gaussian vector,

975
00:54:34,710 --> 00:54:39,310
then b times epsilon is a p
dimensional Gaussian vector

976
00:54:39,310 --> 00:54:39,980
now.

977
00:54:39,980 --> 00:54:42,720
So that's how I
switch from p to n--

978
00:54:42,720 --> 00:54:43,770
from n to p.

979
00:54:43,770 --> 00:54:45,120
Thank you.

980
00:54:45,120 --> 00:54:50,430
So you're right, this is beta
hat minus beta is this guy.

981
00:54:50,430 --> 00:54:54,090
And so in particular, if
I look at the expectation

982
00:54:54,090 --> 00:55:01,160
of the norm of beta hat minus
beta squared, what is it?

983
00:55:01,160 --> 00:55:08,140
It's the expectation of the
norm of some Gaussian vector.

984
00:55:12,100 --> 00:55:15,530
And so it turns out--
so maybe we don't have--

985
00:55:15,530 --> 00:55:18,960
well, that's just also a
property of a Gaussian vector.

986
00:55:18,960 --> 00:55:26,840
So if epsilon is n0 sigma,
then the expectation

987
00:55:26,840 --> 00:55:34,576
of the norm of epsilon squared
is just the trace of sigma.

988
00:55:37,910 --> 00:55:41,030
Actually, we can
probably check this

989
00:55:41,030 --> 00:55:44,540
by saying that this is
the sum from j equal 1

990
00:55:44,540 --> 00:55:51,128
to p of the expectation of beta
hat j minus beta j squared.

991
00:55:54,310 --> 00:55:57,879
Since beta j squared is
the expectation-- beta j

992
00:55:57,879 --> 00:55:59,170
is the expectation of beta hat.

993
00:55:59,170 --> 00:56:01,990
This is actually equal
to the sum from j equal 1

994
00:56:01,990 --> 00:56:08,110
to p of the variance
of beta hat j,

995
00:56:08,110 --> 00:56:11,950
just because this is the
expectation of beta hat.

996
00:56:11,950 --> 00:56:15,590
And how do I read the variances
in a covariance matrix?

997
00:56:15,590 --> 00:56:17,830
There are just the
diagonal elements.

998
00:56:17,830 --> 00:56:25,390
So that's really just sigma jj.

999
00:56:25,390 --> 00:56:27,700
And so that's really equal to--

1000
00:56:27,700 --> 00:56:29,470
so that's the sum of
the diagonal elements

1001
00:56:29,470 --> 00:56:30,790
of this matrix.

1002
00:56:30,790 --> 00:56:33,960
Let's call it sigma.

1003
00:56:33,960 --> 00:56:40,020
So that's equal to the trace
of x transpose x inverse.

1004
00:56:42,740 --> 00:56:45,364
The trace is the sum of the
diagonal elements of a matrix.

1005
00:56:48,080 --> 00:56:49,700
And I still had something else.

1006
00:56:49,700 --> 00:56:52,070
I'm sorry, this
was sigma squared.

1007
00:56:52,070 --> 00:56:54,200
I forget it all the time.

1008
00:56:54,200 --> 00:56:56,800
So the sigma squared comes out.

1009
00:56:56,800 --> 00:56:58,760
It's there.

1010
00:56:58,760 --> 00:57:01,275
And so the sigma
squared comes out

1011
00:57:01,275 --> 00:57:02,900
because the trace is
a linear operator.

1012
00:57:02,900 --> 00:57:06,275
If I multiply all the entries
of my matrix by the same number,

1013
00:57:06,275 --> 00:57:08,150
then all the diagonal
elements are multiplied

1014
00:57:08,150 --> 00:57:09,775
by the same number,
so when I sum them,

1015
00:57:09,775 --> 00:57:13,930
the sum is multiplied
by the same number.

1016
00:57:13,930 --> 00:57:18,120
So that's for the
quadratic risk of beta hat.

1017
00:57:18,120 --> 00:57:21,580
And now I need to tell
you about x beta hat.

1018
00:57:21,580 --> 00:57:27,250
x beta hat was something
that was actually telling me

1019
00:57:27,250 --> 00:57:30,370
that that was the point that
I reported on the red line

1020
00:57:30,370 --> 00:57:31,480
that I estimated.

1021
00:57:31,480 --> 00:57:32,800
That was my x beta hat.

1022
00:57:32,800 --> 00:57:40,310
That was my y minus the noise.

1023
00:57:40,310 --> 00:57:42,470
Now, this thing here--

1024
00:57:42,470 --> 00:57:47,100
so remember, we had this line,
and I had my observation.

1025
00:57:47,100 --> 00:57:51,370
And here, I'm really trying to
measure this distance squared.

1026
00:57:51,370 --> 00:57:53,470
This distance is actually
quite important for me

1027
00:57:53,470 --> 00:57:58,920
because it actually shows up
in the Pythagoras theorem.

1028
00:57:58,920 --> 00:58:02,260
And so you could actually
try to estimate this thing.

1029
00:58:02,260 --> 00:58:03,790
So what is the prediction error?

1030
00:58:12,900 --> 00:58:18,840
So we said we have y minus
x beta hat, so that's

1031
00:58:18,840 --> 00:58:21,930
the norm of this thing
we're trying to compute.

1032
00:58:21,930 --> 00:58:25,350
But let's write this for
what it is for one second.

1033
00:58:25,350 --> 00:58:27,810
So we said that beta
hat was x transpose

1034
00:58:27,810 --> 00:58:31,710
x inverse extra transpose
y, and we know that y is

1035
00:58:31,710 --> 00:58:35,950
x transpose beta plus epsilon.

1036
00:58:35,950 --> 00:58:37,410
So let's write this--

1037
00:58:40,620 --> 00:58:43,800
x beta plus epsilon plus x.

1038
00:58:57,000 --> 00:59:00,320
And actually, maybe I
should not write it.

1039
00:59:00,320 --> 00:59:02,722
Let me keep the y
for what it is now.

1040
00:59:07,140 --> 00:59:08,960
So that means that
I have, essentially,

1041
00:59:08,960 --> 00:59:13,050
the identity of rn times y
minus this matrix times y.

1042
00:59:13,050 --> 00:59:15,510
So I can factor
y out, and that's

1043
00:59:15,510 --> 00:59:20,280
the identity of rn
minus x x transpose

1044
00:59:20,280 --> 00:59:27,280
x inverse x transpose,
the whole thing times y.

1045
00:59:32,760 --> 00:59:37,980
We call this matrix p because
this was the projection matrix

1046
00:59:37,980 --> 00:59:41,540
onto the linear span of the x's.

1047
00:59:41,540 --> 00:59:46,120
So that means that if I take a
point x and I apply p times x,

1048
00:59:46,120 --> 00:59:50,910
I'm projecting onto the linear
span of the columns of x.

1049
00:59:50,910 --> 00:59:57,400
What happens if I do
i minus p times x?

1050
00:59:57,400 --> 00:59:59,000
It's x minus px.

1051
01:00:01,540 --> 01:00:04,690
So if I look at the
point on which--

1052
01:00:04,690 --> 01:00:07,000
so this is the point
on which I project.

1053
01:00:07,000 --> 01:00:08,660
This is x.

1054
01:00:08,660 --> 01:00:13,260
I project orthogonally
to get p times x.

1055
01:00:13,260 --> 01:00:15,920
And so what it means
is that this operator i

1056
01:00:15,920 --> 01:00:21,810
minus px is actually giving me
this guy, this vector here--

1057
01:00:21,810 --> 01:00:23,360
x minus p times x.

1058
01:00:30,790 --> 01:00:33,920
Let's say this is 0.

1059
01:00:33,920 --> 01:00:36,460
This means that this
vector, I can put it here.

1060
01:00:36,460 --> 01:00:38,370
It's this vector here.

1061
01:00:38,370 --> 01:00:40,510
And that's actually the
orthogonal projection

1062
01:00:40,510 --> 01:00:43,870
of x onto the orthogonal
complement of the span

1063
01:00:43,870 --> 01:00:45,532
of the columns of x.

1064
01:00:45,532 --> 01:00:51,000
So if I project x, or if I
look of x minus its projection,

1065
01:00:51,000 --> 01:00:55,730
I'm basically projecting
onto two orthogonal spaces.

1066
01:00:55,730 --> 01:00:59,520
What I'm trying to say
here is that this here

1067
01:00:59,520 --> 01:01:01,301
is another projection
matrix p prime.

1068
01:01:04,460 --> 01:01:10,310
That is just the projection
matrix onto the orthogonal--

1069
01:01:10,310 --> 01:01:29,560
projection onto orthogonal
of column span of x.

1070
01:01:29,560 --> 01:01:31,180
Orthogonal means
the set of vectors

1071
01:01:31,180 --> 01:01:34,329
that's orthogonal to everyone
in this linear space.

1072
01:01:37,050 --> 01:01:40,080
So now, when I'm doing
this, this is exactly what--

1073
01:01:40,080 --> 01:01:42,600
I mean, in a way, this is
illustrating this Pythagoras

1074
01:01:42,600 --> 01:01:43,610
theorem.

1075
01:01:43,610 --> 01:01:47,190
And so when I want to compute
the norm of this guy, the norm

1076
01:01:47,190 --> 01:01:49,560
squared of this guy,
I'm really computing--

1077
01:01:49,560 --> 01:01:52,810
if this is my y now,
this is px of y,

1078
01:01:52,810 --> 01:01:55,738
I'm really controlling the
norm squared of this thing.

1079
01:02:06,720 --> 01:02:08,850
So if I want to compute
the norm squared--

1080
01:02:42,540 --> 01:02:48,020
so I'm almost there.

1081
01:02:48,020 --> 01:02:52,840
So what am I projecting here
onto the orthogonal projector?

1082
01:02:52,840 --> 01:02:55,340
So here, y, now,
I know that y is

1083
01:02:55,340 --> 01:03:00,480
equal to x beta plus epsilon.

1084
01:03:00,480 --> 01:03:06,480
So when I look at this
matrix p prime times y,

1085
01:03:06,480 --> 01:03:11,105
It's actually p prime times
x beta plus p prime times

1086
01:03:11,105 --> 01:03:11,604
epsilon.

1087
01:03:14,380 --> 01:03:18,400
What's happening to
p prime times x beta?

1088
01:03:18,400 --> 01:03:19,525
Let's look at this picture.

1089
01:03:23,400 --> 01:03:26,610
So we know that p prime takes
any point here and projects it

1090
01:03:26,610 --> 01:03:29,350
orthogonally on this guy.

1091
01:03:29,350 --> 01:03:33,960
But x beta is actually
a point that lives here.

1092
01:03:33,960 --> 01:03:36,790
It's something that's
on the linear span.

1093
01:03:36,790 --> 01:03:39,660
So where do all the points
that are on this line

1094
01:03:39,660 --> 01:03:43,035
get projected to?

1095
01:03:43,035 --> 01:03:43,970
AUDIENCE: The origin.

1096
01:03:43,970 --> 01:03:45,920
PHILIPPE RIGOLLET:
The origin, to 0.

1097
01:03:45,920 --> 01:03:47,750
They all get projected to 0.

1098
01:03:47,750 --> 01:03:50,120
And that's because I'm
basically projecting

1099
01:03:50,120 --> 01:03:54,872
something that's on the column
span of x onto its orthogonal.

1100
01:03:54,872 --> 01:03:56,580
So that's always 0
that I'm getting here.

1101
01:04:02,410 --> 01:04:04,410
So when I apply
p prime to y, I'm

1102
01:04:04,410 --> 01:04:08,610
really just applying
p prime to epsilon.

1103
01:04:08,610 --> 01:04:10,590
So I know that now,
this, actually,

1104
01:04:10,590 --> 01:04:18,480
is equal to the norm of
some multivariate Gaussian.

1105
01:04:18,480 --> 01:04:20,092
What is the size
of this Gaussian?

1106
01:04:22,980 --> 01:04:24,570
What is the size of this matrix?

1107
01:04:24,570 --> 01:04:25,820
Well, I actually had it there.

1108
01:04:25,820 --> 01:04:28,440
It's i n, so it's n dimensional.

1109
01:04:28,440 --> 01:04:31,236
So it's some n
dimensional with mean 0.

1110
01:04:31,236 --> 01:04:32,610
And what is the
covariance matrix

1111
01:04:32,610 --> 01:04:34,179
of p prime times epsilon?

1112
01:04:39,109 --> 01:04:40,588
AUDIENCE: p p transpose.

1113
01:04:40,588 --> 01:04:43,880
PHILIPPE RIGOLLET: Yeah,
p prime p prime transpose,

1114
01:04:43,880 --> 01:04:48,500
which we just said p
prime transpose is p,

1115
01:04:48,500 --> 01:04:49,610
so that's p squared.

1116
01:04:49,610 --> 01:04:51,740
And we see that when
we project twice,

1117
01:04:51,740 --> 01:04:54,540
it's as if we
projected only once.

1118
01:04:54,540 --> 01:05:00,090
So here, this is n0 p
prime p prime transpose.

1119
01:05:00,090 --> 01:05:05,150
That's the formula for
the covariance matrix.

1120
01:05:05,150 --> 01:05:09,990
But this guy is actually equal
to p prime times p prime,

1121
01:05:09,990 --> 01:05:13,580
which is equal to p prime.

1122
01:05:13,580 --> 01:05:18,380
So now, what I'm looking for is
the norm squared of the trace.

1123
01:05:18,380 --> 01:05:20,050
So that means that
this whole thing here

1124
01:05:20,050 --> 01:05:22,270
is actually equal to the trace.

1125
01:05:22,270 --> 01:05:24,730
Oh, did I forget
again a sigma squared?

1126
01:05:24,730 --> 01:05:28,160
Yeah, I forgot it only
here, which is good news.

1127
01:05:28,160 --> 01:05:32,665
So I should assume that
sigma squared is equal to 1.

1128
01:05:32,665 --> 01:05:34,270
So sigma squared's here.

1129
01:05:34,270 --> 01:05:36,430
And then what I'm left
with is sigma squared

1130
01:05:36,430 --> 01:05:39,920
times the trace of p prime.

1131
01:05:45,780 --> 01:05:51,240
At some point, I mentioned that
the eigenvalues of a projection

1132
01:05:51,240 --> 01:05:54,210
matrix were actually 0 or 1.

1133
01:05:54,210 --> 01:05:56,689
The trace is the sum
of the eigenvalues.

1134
01:05:56,689 --> 01:05:58,230
So that means that
the trace is going

1135
01:05:58,230 --> 01:06:03,720
to be an integer number as the
number of non-0 eigenvalues.

1136
01:06:03,720 --> 01:06:05,170
And the non-0
eigenvalues are just

1137
01:06:05,170 --> 01:06:07,776
the dimension of the space
onto which I'm projecting.

1138
01:06:10,490 --> 01:06:15,200
Now, I'm projecting from
something of dimension n

1139
01:06:15,200 --> 01:06:19,520
onto the orthogonal of
a space of dimension p.

1140
01:06:19,520 --> 01:06:21,860
What is the dimension
of the orthogonal

1141
01:06:21,860 --> 01:06:23,720
of a space of dimension
p when thought

1142
01:06:23,720 --> 01:06:26,546
of space in dimension n?

1143
01:06:26,546 --> 01:06:27,296
AUDIENCE: [? 1. ?]

1144
01:06:27,296 --> 01:06:28,765
PHILIPPE RIGOLLET: N minus p--

1145
01:06:28,765 --> 01:06:32,980
that's the so-called rank
theorem, I guess, as a name.

1146
01:06:32,980 --> 01:06:35,710
And so that's how I get
this n minus p here.

1147
01:06:35,710 --> 01:06:40,071
This is really just
equal to n minus p.

1148
01:06:40,071 --> 01:06:40,570
Yeah?

1149
01:06:40,570 --> 01:06:43,319
AUDIENCE: Here, we're taking the
expectation of the whole thing.

1150
01:06:43,319 --> 01:06:44,860
PHILIPPE RIGOLLET:
Yes, you're right.

1151
01:06:44,860 --> 01:06:48,780
So that's actually
the expectation

1152
01:06:48,780 --> 01:06:50,410
of this thing that's
equal to that.

1153
01:06:50,410 --> 01:06:53,020
Absolutely.

1154
01:06:53,020 --> 01:06:55,150
But I actually have much better.

1155
01:06:55,150 --> 01:06:57,412
I know, even, that the
norm that I'm looking at,

1156
01:06:57,412 --> 01:06:58,870
I know it's going
to be this thing.

1157
01:06:58,870 --> 01:07:00,911
What is going to be the
distribution of this guy?

1158
01:07:03,860 --> 01:07:06,830
Norm squared of a
Gaussian, chi squared.

1159
01:07:06,830 --> 01:07:09,150
So there's going to be some
chi squared that shows up.

1160
01:07:09,150 --> 01:07:10,650
And the number of
degrees of freedom

1161
01:07:10,650 --> 01:07:12,940
is actually going to
be also n minus p.

1162
01:07:12,940 --> 01:07:16,510
And maybe it's
actually somewhere--

1163
01:07:16,510 --> 01:07:20,560
yeah, right here-- n
minus p times sigma hat

1164
01:07:20,560 --> 01:07:22,690
squared over sigma squared.

1165
01:07:22,690 --> 01:07:24,675
This is my sigma hat squared.

1166
01:07:24,675 --> 01:07:28,200
If I multiply n minus p, I'm
left only with this thing,

1167
01:07:28,200 --> 01:07:31,136
and so that means that I
get sigma squared times--

1168
01:07:31,136 --> 01:07:33,010
because they always
forget my sigma squared--

1169
01:07:33,010 --> 01:07:34,870
I get sigma squared
times this thing.

1170
01:07:34,870 --> 01:07:37,270
And it turns out that the
square norm of this guy

1171
01:07:37,270 --> 01:07:39,412
is actually exactly chi
squared with n minus b

1172
01:07:39,412 --> 01:07:40,226
degrees of freedom.

1173
01:07:43,370 --> 01:07:47,900
So in particular, so we
know that the expectation

1174
01:07:47,900 --> 01:07:50,556
of this thing is equal to
sigma squared times n minus p.

1175
01:07:50,556 --> 01:07:53,342
So if I divide both
sides by n minus p,

1176
01:07:53,342 --> 01:07:55,550
I'm going to have that
something whose expectation is

1177
01:07:55,550 --> 01:07:57,140
sigma squared.

1178
01:07:57,140 --> 01:07:59,140
And this something, I
can actually compute.

1179
01:07:59,140 --> 01:08:02,090
It depends on y,
and x that I know,

1180
01:08:02,090 --> 01:08:04,100
and beta hat that
I've just estimated.

1181
01:08:04,100 --> 01:08:05,000
I know what n is.

1182
01:08:05,000 --> 01:08:07,520
And pr's are the
dimensions of my matrix x.

1183
01:08:07,520 --> 01:08:11,120
So I'm actually given an
estimator whose expectation

1184
01:08:11,120 --> 01:08:13,330
is sigma squared.

1185
01:08:13,330 --> 01:08:15,880
And so now, I actually
have an unbiased estimator

1186
01:08:15,880 --> 01:08:17,430
of sigma squared.

1187
01:08:17,430 --> 01:08:19,269
That's this guy right here.

1188
01:08:19,269 --> 01:08:20,560
And it's actually super useful.

1189
01:08:23,470 --> 01:08:25,270
So those are called the--

1190
01:08:25,270 --> 01:08:27,950
this is the normalized
sum of square residuals.

1191
01:08:27,950 --> 01:08:29,340
These are called the residuals.

1192
01:08:29,340 --> 01:08:32,410
Those are whatever
is residual when

1193
01:08:32,410 --> 01:08:36,580
I project my points onto the
line that I've estimated.

1194
01:08:36,580 --> 01:08:40,870
And so in a way, those guys--
if you go back to this picture,

1195
01:08:40,870 --> 01:08:47,109
this was yi and this was
xi transpose beta hat.

1196
01:08:47,109 --> 01:08:49,540
So if beta hat is close
to beta, the difference

1197
01:08:49,540 --> 01:08:52,810
between yi and xi
transpose beta should

1198
01:08:52,810 --> 01:08:55,870
be close to my epsilon i.

1199
01:08:55,870 --> 01:08:57,430
It's some sort of epsilon i hat.

1200
01:09:00,319 --> 01:09:02,590
Agreed?

1201
01:09:02,590 --> 01:09:04,960
And so that means
that if I think

1202
01:09:04,960 --> 01:09:07,510
of those as being
epsilon i hat, they

1203
01:09:07,510 --> 01:09:09,910
should be close to epsilon
i, and so their norm

1204
01:09:09,910 --> 01:09:14,390
should be giving me something
that looks like sigma squared.

1205
01:09:14,390 --> 01:09:16,359
And so that's why it
actually makes sense.

1206
01:09:16,359 --> 01:09:18,790
It's just magical that
everything works out together,

1207
01:09:18,790 --> 01:09:21,130
because I'm not projecting
on the right line,

1208
01:09:21,130 --> 01:09:23,229
I'm actually projecting
on the wrong line.

1209
01:09:23,229 --> 01:09:27,310
But in the end, things
actually work out pretty well.

1210
01:09:27,310 --> 01:09:28,990
There's one thing--
so here, the theorem

1211
01:09:28,990 --> 01:09:31,779
is that this thing not only
has the right expectation,

1212
01:09:31,779 --> 01:09:33,450
but also has a chi
squared distribution.

1213
01:09:33,450 --> 01:09:34,700
That's what we just discussed.

1214
01:09:34,700 --> 01:09:36,250
So here, I'm just
telling you this.

1215
01:09:36,250 --> 01:09:37,899
But it's not too
hard to believe,

1216
01:09:37,899 --> 01:09:40,300
because it's actually
the norm of some vector.

1217
01:09:40,300 --> 01:09:42,279
You could make this
obvious, but again, I

1218
01:09:42,279 --> 01:09:44,800
didn't want to bring in
too much linear algebra.

1219
01:09:44,800 --> 01:09:46,359
So to prove this,
you actually have

1220
01:09:46,359 --> 01:09:48,899
to diagonalize the matrix p.

1221
01:09:48,899 --> 01:09:53,890
So you have to invoke the
eigenvalue decomposition

1222
01:09:53,890 --> 01:09:56,600
and the fact that the norm
is invariant by rotation.

1223
01:09:56,600 --> 01:09:59,440
So for those who are
familiar with, what I can do

1224
01:09:59,440 --> 01:10:01,780
is just look at the
decomposition of p

1225
01:10:01,780 --> 01:10:08,200
prime into ud u transpose where
this is an orthogonal matrix,

1226
01:10:08,200 --> 01:10:10,630
and this is a diagonal
matrix of eigenvalues.

1227
01:10:10,630 --> 01:10:13,312
And when I look at the
norm squared of this thing,

1228
01:10:13,312 --> 01:10:14,770
I mean, I have,
basically, the norm

1229
01:10:14,770 --> 01:10:20,200
squared of p prime
times some epsilon.

1230
01:10:20,200 --> 01:10:26,300
It's the norm of ud u
transpose epsilon squared.

1231
01:10:26,300 --> 01:10:28,550
The norm of a
rotation of a vector

1232
01:10:28,550 --> 01:10:32,280
is the same as the norm of the
vector, so this guy goes away.

1233
01:10:32,280 --> 01:10:34,140
This is not actually--

1234
01:10:34,140 --> 01:10:36,140
I mean, you don't have
to care about this if you

1235
01:10:36,140 --> 01:10:37,880
don't understand what I'm
saying, so don't freak out.

1236
01:10:37,880 --> 01:10:39,810
This is really for
those who follow.

1237
01:10:39,810 --> 01:10:42,211
What is the distribution
of u transpose epsilon?

1238
01:10:45,899 --> 01:10:50,310
I take a Gaussian vector that
has covariance matrix sigma

1239
01:10:50,310 --> 01:10:52,560
squared times the
[? identity, ?] and I basically

1240
01:10:52,560 --> 01:10:54,100
rotate it.

1241
01:10:54,100 --> 01:10:57,965
What is its distribution?

1242
01:10:57,965 --> 01:10:58,465
Yeah?

1243
01:10:58,465 --> 01:10:59,440
AUDIENCE: The same.

1244
01:10:59,440 --> 01:11:00,830
PHILIPPE RIGOLLET:
It's the same.

1245
01:11:00,830 --> 01:11:02,950
It's completely invariant,
because the Gaussian

1246
01:11:02,950 --> 01:11:04,700
think of all directions
as being the same.

1247
01:11:04,700 --> 01:11:07,550
So it doesn't really matter if
I take a Gaussian or a rotated

1248
01:11:07,550 --> 01:11:08,600
Gaussian.

1249
01:11:08,600 --> 01:11:10,190
So this is also a
Gaussian, so I'm

1250
01:11:10,190 --> 01:11:11,800
going to call it epsilon prime.

1251
01:11:11,800 --> 01:11:15,110
And I am left with just
the norm of epsilon primes.

1252
01:11:15,110 --> 01:11:23,730
So this is the sum of the
dj's squared times epsilon

1253
01:11:23,730 --> 01:11:24,250
j squared.

1254
01:11:27,030 --> 01:11:30,060
And we just said that
the eigenvalues of p

1255
01:11:30,060 --> 01:11:33,780
are either 0 or 1,
because it's a projector.

1256
01:11:33,780 --> 01:11:36,090
And so here, I'm going
to get only 0's and 1's.

1257
01:11:36,090 --> 01:11:39,300
So I'm really just
summing a certain number

1258
01:11:39,300 --> 01:11:42,050
of epsilon i squared.

1259
01:11:42,050 --> 01:11:45,110
So square root of
standard Gaussians--

1260
01:11:45,110 --> 01:11:48,210
sorry, with a sigma
squared somewhere.

1261
01:11:48,210 --> 01:11:50,850
And basically, how
many am I summing?

1262
01:11:50,850 --> 01:11:55,530
Well, the n minus p, the
number of non-0 eigenvalues

1263
01:11:55,530 --> 01:11:57,190
of p prime.

1264
01:11:57,190 --> 01:12:00,490
So that's how it shows up.

1265
01:12:00,490 --> 01:12:05,820
When you see this, what
theorem am I using here?

1266
01:12:05,820 --> 01:12:06,650
Cochran's theorem.

1267
01:12:06,650 --> 01:12:07,650
This is this magic book.

1268
01:12:07,650 --> 01:12:09,420
I'm actually going to dump
everything that I'm not going

1269
01:12:09,420 --> 01:12:11,160
to prove to you and say, oh,
this is actually Cochran's.

1270
01:12:11,160 --> 01:12:12,870
No, Cochran's theorem
is really just

1271
01:12:12,870 --> 01:12:15,870
telling me something about
orthogonality of things,

1272
01:12:15,870 --> 01:12:17,712
and therefore,
independence of things.

1273
01:12:17,712 --> 01:12:19,170
And Cochran's
theorem was something

1274
01:12:19,170 --> 01:12:23,271
that I used when I
wanted to use what?

1275
01:12:23,271 --> 01:12:27,280
That's something I used
just one slide before.

1276
01:12:27,280 --> 01:12:28,887
Student t-test, right?

1277
01:12:28,887 --> 01:12:30,970
I used Cochran's theorem
to see that the numerator

1278
01:12:30,970 --> 01:12:33,610
and the denominator of
the student statistic

1279
01:12:33,610 --> 01:12:35,414
were independent of each other.

1280
01:12:35,414 --> 01:12:37,330
And this is exactly what
I'm going to do here.

1281
01:12:40,170 --> 01:12:42,430
I'm going to actually write
a test to test, maybe,

1282
01:12:42,430 --> 01:12:44,430
if the beta j's are equal to 0.

1283
01:12:44,430 --> 01:12:49,110
I'm going to form a numerator,
which is beta hat minus beta.

1284
01:12:49,110 --> 01:12:50,310
This is normal.

1285
01:12:50,310 --> 01:12:53,287
And we know that beta hat
has a Gaussian distribution.

1286
01:12:53,287 --> 01:12:54,870
I'm going to
standardized by something

1287
01:12:54,870 --> 01:12:55,720
that makes sense to me.

1288
01:12:55,720 --> 01:12:56,940
And I'm not going
to go into details,

1289
01:12:56,940 --> 01:12:58,200
because we're out of time.

1290
01:12:58,200 --> 01:12:59,866
But there's the sigma
hat that shows up.

1291
01:12:59,866 --> 01:13:03,240
And then there's a gamma
j, which takes into account

1292
01:13:03,240 --> 01:13:06,450
the fact that my x's--

1293
01:13:06,450 --> 01:13:12,465
if I look at the distribution of
beta, which is gone, I think--

1294
01:13:12,465 --> 01:13:14,220
yeah, beta is gone.

1295
01:13:14,220 --> 01:13:16,020
Oh, yeah, that's where it is.

1296
01:13:16,020 --> 01:13:20,040
The covariance matrix depends
on this matrix x transpose x.

1297
01:13:20,040 --> 01:13:22,110
So this will show
up in the variance.

1298
01:13:22,110 --> 01:13:25,000
In particular, diagonal elements
are going to play a role here.

1299
01:13:25,000 --> 01:13:26,850
And so that's what
my gammas are.

1300
01:13:26,850 --> 01:13:30,880
The gammas is the j's diagonal
element of this matrix.

1301
01:13:30,880 --> 01:13:35,010
So we'll resume
that on Tuesday, so

1302
01:13:35,010 --> 01:13:38,476
don't worry too much if
this is going too fast.

1303
01:13:38,476 --> 01:13:40,350
I'm not supposed to
cover it, but just so you

1304
01:13:40,350 --> 01:13:45,300
get a hint of why Cochran's
theorem actually was useful.

1305
01:13:45,300 --> 01:13:51,690
So I don't know if we
actually ended up recording.

1306
01:13:51,690 --> 01:13:53,410
I have your homework.

1307
01:13:53,410 --> 01:13:56,500
And as usual, I will
give it to you outside.