1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:17,620
at ocw.mit.edu.

8
00:01:14,980 --> 00:01:17,500
PHILIPPE RIGOLLET: --bunch
of x's and a bunch of y's.

9
00:01:17,500 --> 00:01:20,140
The y's were univariate,
just one real

10
00:01:20,140 --> 00:01:21,460
valued random variable.

11
00:01:21,460 --> 00:01:24,760
And the x's were vectors that
described a bunch of attributes

12
00:01:24,760 --> 00:01:27,730
for each of our individuals
or each of our observations.

13
00:01:27,730 --> 00:01:30,350
Let's assume now that we're
given essentially only the x's.

14
00:01:30,350 --> 00:01:33,970
This is sometimes referred
to as unsupervised learning.

15
00:01:33,970 --> 00:01:35,920
There is just the x's.

16
00:01:35,920 --> 00:01:38,640
Usually, supervision
is done by the y's.

17
00:01:38,640 --> 00:01:41,710
And so what you're trying to do
is to make sense of this data.

18
00:01:41,710 --> 00:01:43,690
You're going to try to
understand this data,

19
00:01:43,690 --> 00:01:47,062
represent this data,
visualize this data,

20
00:01:47,062 --> 00:01:48,520
try to understand
something, right?

21
00:01:48,520 --> 00:01:52,196
So, if I give you a
d-dimensional random vectors,

22
00:01:52,196 --> 00:01:54,070
and you're going to have
n independent copies

23
00:01:54,070 --> 00:01:57,310
of this individual-- of
this random vector, OK?

24
00:01:57,310 --> 00:01:59,530
So you will see that
I'm going to have--

25
00:01:59,530 --> 00:02:02,200
I'm going to very quickly
run into some limitations

26
00:02:02,200 --> 00:02:04,270
about what I can actually
draw on the board

27
00:02:04,270 --> 00:02:05,980
because I'm using
[? boldface ?] here.

28
00:02:05,980 --> 00:02:08,180
I'm also going to use the
blackboard [? boldface. ?]

29
00:02:08,180 --> 00:02:09,820
So it's going to
be a bit difficult.

30
00:02:09,820 --> 00:02:15,430
So tell me if you're actually
a little confused by what

31
00:02:15,430 --> 00:02:17,710
is a vector, what is a
number, and what is a matrix.

32
00:02:17,710 --> 00:02:19,720
But we'll get there.

33
00:02:19,720 --> 00:02:22,450
So I have X in Rd, and
that's a random vector.

34
00:02:26,230 --> 00:02:30,650
And I have X1 to
Xn that are IID.

35
00:02:30,650 --> 00:02:37,635
They're independent
copies of X. OK,

36
00:02:37,635 --> 00:02:40,326
so you can think
of those as being--

37
00:02:40,326 --> 00:02:41,700
the realization
of these guys are

38
00:02:41,700 --> 00:02:51,090
going to be a cloud of
n points in R to the d.

39
00:02:51,090 --> 00:02:54,210
And we're going to think
of d as being fairly large.

40
00:02:54,210 --> 00:02:55,710
And for this to
start to make sense,

41
00:02:55,710 --> 00:02:59,760
we're going to think of d
as being at least 4, OK?

42
00:02:59,760 --> 00:03:01,830
And meaning that you're
going to have a hard time

43
00:03:01,830 --> 00:03:03,480
visualizing those things.

44
00:03:03,480 --> 00:03:06,530
If it was 3 or 2, you would
be able to draw these points.

45
00:03:06,530 --> 00:03:08,040
And that's pretty
much as much sense

46
00:03:08,040 --> 00:03:09,831
you're going to be
making about those guys,

47
00:03:09,831 --> 00:03:12,030
just looking at the [INAUDIBLE]

48
00:03:12,030 --> 00:03:16,860
All right, so I'm going to
write each of those X's, right?

49
00:03:16,860 --> 00:03:20,520
So this vector, X,
has d coordinate.

50
00:03:20,520 --> 00:03:25,650
And I'm going to write
them as X1, to Xd.

51
00:03:30,730 --> 00:03:34,780
And I'm going to stack
them into a matrix, OK?

52
00:03:34,780 --> 00:03:38,100
So once I have those guys,
I'm going to have a matrix.

53
00:03:38,100 --> 00:03:40,230
But here, I'm going
to use the double bar.

54
00:03:40,230 --> 00:03:47,880
And it's X1 transpose,
Xn transpose.

55
00:03:47,880 --> 00:03:51,250
So what it means is that
the coordinates of this guy,

56
00:03:51,250 --> 00:03:53,040
of course, are X1,1.

57
00:03:53,040 --> 00:03:54,710
Here, I have--

58
00:03:54,710 --> 00:03:57,870
I'm of size d, so I have X1d.

59
00:03:57,870 --> 00:04:01,290
And here, I have Xn1.

60
00:04:01,290 --> 00:04:02,940
Xnd.

61
00:04:02,940 --> 00:04:06,660
And so the i-th, j-th--

62
00:04:06,660 --> 00:04:10,950
i-th row and j-th column
is the matrix, Xij, right--

63
00:04:10,950 --> 00:04:12,780
is the entry, Xi to-- sorry.

64
00:04:23,540 --> 00:04:28,230
OK, so each-- so the rows
here are the observations.

65
00:04:28,230 --> 00:04:32,040
And the columns are the
covariance over attributes.

66
00:04:32,040 --> 00:04:32,640
OK?

67
00:04:32,640 --> 00:04:34,060
So this is an n by d matrix.

68
00:04:39,220 --> 00:04:41,320
All right, this is really
just some bookkeeping.

69
00:04:41,320 --> 00:04:43,840
How do we store
this data somehow?

70
00:04:43,840 --> 00:04:46,257
And the fact that we use a
matrix just like for regression

71
00:04:46,257 --> 00:04:48,464
is going to be convenient
because we're going to able

72
00:04:48,464 --> 00:04:50,050
to talk about projections--

73
00:04:50,050 --> 00:04:53,310
going to be able to talk
about things like this.

74
00:04:53,310 --> 00:04:56,310
All right, so everything
I'm going to say now

75
00:04:56,310 --> 00:04:59,190
is about variances
or covariances

76
00:04:59,190 --> 00:05:01,945
of those things, which means
that I need two moments, OK?

77
00:05:01,945 --> 00:05:03,570
If the variance does
not exist, there's

78
00:05:03,570 --> 00:05:05,320
nothing I can say
about this problem.

79
00:05:05,320 --> 00:05:07,620
So I'm going to assume
that the variance exists.

80
00:05:07,620 --> 00:05:09,090
And one way to
just put it to say

81
00:05:09,090 --> 00:05:12,390
that the two norm
of those guys is

82
00:05:12,390 --> 00:05:15,030
finite, which is another
way to say that each of them

83
00:05:15,030 --> 00:05:15,690
is finite.

84
00:05:15,690 --> 00:05:18,210
I mean, you can think
of it the way you want.

85
00:05:18,210 --> 00:05:21,000
All right, so now,
the mean of X, right?

86
00:05:21,000 --> 00:05:22,530
So I have a random vector.

87
00:05:22,530 --> 00:05:26,430
So I can talk about
the expectation of X.

88
00:05:26,430 --> 00:05:29,040
That's a vector that's in Rd.

89
00:05:29,040 --> 00:05:33,828
And that's just taking
the expectation entrywise.

90
00:05:33,828 --> 00:05:34,328
Sorry.

91
00:05:42,265 --> 00:05:45,540
X1, Xd.

92
00:05:45,540 --> 00:05:49,640
OK, so I should say it out loud.

93
00:05:49,640 --> 00:05:51,890
For this, the purpose
of this class,

94
00:05:51,890 --> 00:05:55,850
I will denote by
subscripts the indices that

95
00:05:55,850 --> 00:05:57,170
corresponds to observations.

96
00:05:57,170 --> 00:06:02,690
And superscripts, the
indices that correspond to

97
00:06:02,690 --> 00:06:04,280
coordinates of a variable.

98
00:06:04,280 --> 00:06:07,340
And I think that's the
same convention that we

99
00:06:07,340 --> 00:06:10,599
took for the regression case.

100
00:06:10,599 --> 00:06:12,390
Of course, you could
use whatever you want.

101
00:06:12,390 --> 00:06:13,931
If you want to put
commas, et cetera,

102
00:06:13,931 --> 00:06:16,072
it becomes just a
bit more complicated.

103
00:06:16,072 --> 00:06:18,070
All right, and so
now, once I have this,

104
00:06:18,070 --> 00:06:21,380
so this tells me where my cloud
of point is centered, right?

105
00:06:21,380 --> 00:06:24,380
So if I have a bunch of points--

106
00:06:24,380 --> 00:06:27,440
OK, so now I have a
distribution on Rd,

107
00:06:27,440 --> 00:06:29,990
so maybe I should
talk about this--

108
00:06:29,990 --> 00:06:31,610
I'll talk about
this when we talk

109
00:06:31,610 --> 00:06:32,960
about the empirical version.

110
00:06:32,960 --> 00:06:34,460
But if you think
that you have, say,

111
00:06:34,460 --> 00:06:36,680
a two-dimensional
Gaussian random variable,

112
00:06:36,680 --> 00:06:38,930
then you have a center
in two dimension, which

113
00:06:38,930 --> 00:06:41,572
is where it peaks, basically.

114
00:06:41,572 --> 00:06:43,280
And that's what we're
talking about here.

115
00:06:43,280 --> 00:06:44,738
But the other thing
we want to know

116
00:06:44,738 --> 00:06:47,545
is how much does it spread
in every direction, right?

117
00:06:47,545 --> 00:06:49,670
So in every direction of
the two dimensional thing,

118
00:06:49,670 --> 00:06:52,220
I can then try to understand
how much spread I'm getting.

119
00:06:52,220 --> 00:06:54,900
And the way you measure this
is by using covariance, right?

120
00:06:54,900 --> 00:07:02,150
So the covariance
matrix, sigma--

121
00:07:02,150 --> 00:07:05,900
that's a matrix which is d by d.

122
00:07:05,900 --> 00:07:08,150
And it records-- in
the j, k-th entry,

123
00:07:08,150 --> 00:07:10,620
it records the covariance
between the j-th coordinate

124
00:07:10,620 --> 00:07:13,490
of X and the k-th
coordinate of X, OK?

125
00:07:13,490 --> 00:07:14,570
So with entries--

126
00:07:21,300 --> 00:07:30,510
OK, so I have sigma, which is
sigma 1,1, sigma dd, sigma 1d,

127
00:07:30,510 --> 00:07:31,175
sigma d1.

128
00:07:34,750 --> 00:07:39,690
OK, and here I have
sigma jk And sigma jk

129
00:07:39,690 --> 00:07:48,930
is just the covariance between
Xj, the j-th coordinate

130
00:07:48,930 --> 00:07:52,160
and the k-th coordinate.

131
00:07:52,160 --> 00:07:52,869
OK?

132
00:07:52,869 --> 00:07:55,160
So in particular, it's
symmetric because the covariance

133
00:07:55,160 --> 00:07:57,780
between Xj and Xk is the same
as the covariance between Xk

134
00:07:57,780 --> 00:07:58,280
and Xj.

135
00:07:58,280 --> 00:08:01,230
I should not put those
parentheses here.

136
00:08:01,230 --> 00:08:05,330
I do not use them in this, OK?

137
00:08:05,330 --> 00:08:06,900
Just the covariance matrix.

138
00:08:06,900 --> 00:08:09,050
So that's just something
that records everything.

139
00:08:09,050 --> 00:08:10,966
And so what's nice about
the covariance matrix

140
00:08:10,966 --> 00:08:13,040
is that if I actually
give you X as a vector,

141
00:08:13,040 --> 00:08:15,170
you actually can
build the matrix just

142
00:08:15,170 --> 00:08:18,140
by looking at vectors
times vectors transpose,

143
00:08:18,140 --> 00:08:20,210
rather than actually
thinking about building

144
00:08:20,210 --> 00:08:21,882
it coordinate by coordinate.

145
00:08:21,882 --> 00:08:23,840
So for example, if you're
used to using MATLAB,

146
00:08:23,840 --> 00:08:26,006
that's the way you want to
build a covariance matrix

147
00:08:26,006 --> 00:08:29,600
because MATLAB is good
at manipulating vectors

148
00:08:29,600 --> 00:08:33,049
and matrices rather than just
entering it entry by entry.

149
00:08:33,049 --> 00:08:34,820
OK, so, right?

150
00:08:34,820 --> 00:08:42,590
So, what is the covariance
between Xj and Xk?

151
00:08:42,590 --> 00:08:51,360
Well by definition, it's
the expectation of Xj and Xk

152
00:08:51,360 --> 00:09:01,330
minus the expectation of Xj
times the expectation of Xk,

153
00:09:01,330 --> 00:09:01,830
right?

154
00:09:01,830 --> 00:09:03,496
That's the definition
of the covariance.

155
00:09:03,496 --> 00:09:05,770
I hope everybody's seeing that.

156
00:09:05,770 --> 00:09:08,280
And so, in particular,
I can actually

157
00:09:08,280 --> 00:09:10,620
see that this thing
can be written as--

158
00:09:10,620 --> 00:09:14,340
sigma can now be written
as the expectation

159
00:09:14,340 --> 00:09:21,040
of XX transpose minus
the expectation of X

160
00:09:21,040 --> 00:09:25,660
times the expectation
of X transpose.

161
00:09:25,660 --> 00:09:26,500
Why?

162
00:09:26,500 --> 00:09:29,470
Well, let's look at the jk-th
coefficient of this guy, right?

163
00:09:29,470 --> 00:09:35,650
So here, if I look at the
jk-th coefficient, I see what?

164
00:09:35,650 --> 00:09:38,980
Well, I see that
it's the expectation

165
00:09:38,980 --> 00:09:50,840
of XX transpose jk, which is
equal to the expectation of XX

166
00:09:50,840 --> 00:09:53,920
transpose jk.

167
00:09:53,920 --> 00:09:56,570
And what are the
entries of XX transpose?

168
00:09:56,570 --> 00:10:00,130
Well, they're of the
form, Xj times Xk exactly.

169
00:10:00,130 --> 00:10:02,940
So this is actually equal to
the expectation of Xj times Xk.

170
00:10:09,060 --> 00:10:11,250
And this is actually not
the way I want to write it.

171
00:10:11,250 --> 00:10:12,083
I want to write it--

172
00:10:15,530 --> 00:10:16,590
OK?

173
00:10:16,590 --> 00:10:17,590
Is that clear?

174
00:10:17,590 --> 00:10:20,420
That when I have a rank 1 matrix
of this form, XX transpose,

175
00:10:20,420 --> 00:10:21,950
the entries are of
this form, right?

176
00:10:21,950 --> 00:10:23,520
Because if I take--

177
00:10:23,520 --> 00:10:28,865
for example, think
about x, y, z, and then

178
00:10:28,865 --> 00:10:32,810
I multiply by x, y, z.

179
00:10:32,810 --> 00:10:36,380
What I'm getting here is x--

180
00:10:36,380 --> 00:10:40,350
maybe I should actually
use indices here.

181
00:10:40,350 --> 00:10:42,735
x1, x2, x3.

182
00:10:42,735 --> 00:10:44,750
x1, x2, x3.

183
00:10:44,750 --> 00:10:57,018
The entries are x1x1, x1x2,
x1x3; x2x1, x2x2, x2x3; x3x1,

184
00:10:57,018 --> 00:11:04,770
x3x2, x3x3, OK?

185
00:11:04,770 --> 00:11:08,340
So indeed, this is exactly of
the form if you look at jk,

186
00:11:08,340 --> 00:11:12,566
you get exactly Xj times Xk, OK?

187
00:11:12,566 --> 00:11:15,685
So that's the beauty
of those matrices.

188
00:11:15,685 --> 00:11:19,380
So now, once I have this, I
can do exactly the same thing,

189
00:11:19,380 --> 00:11:23,480
except that here, if I
take the jk-th entry,

190
00:11:23,480 --> 00:11:25,044
I will get exactly
the same thing,

191
00:11:25,044 --> 00:11:27,710
except that it's not going to be
the expectation of the product,

192
00:11:27,710 --> 00:11:29,780
but the product of the
expectation, right?

193
00:11:29,780 --> 00:11:36,810
So I get that the jk-th entry
of E of X, E of X transpose,

194
00:11:36,810 --> 00:11:48,310
is just the j-th entry of E of X
times the k-th entry of E of X.

195
00:11:48,310 --> 00:11:52,540
So if I put those two together,
it's actually telling me

196
00:11:52,540 --> 00:11:56,990
that if I look at the
j, k-th entry of sigma,

197
00:11:56,990 --> 00:11:59,690
which I called
little sigma jk, then

198
00:11:59,690 --> 00:12:01,340
this is actually equal to what?

199
00:12:01,340 --> 00:12:04,170
It's equal to the first
term minus the second term.

200
00:12:04,170 --> 00:12:11,420
The first term is the
expectation of Xj, Xk

201
00:12:11,420 --> 00:12:18,900
minus the expectation of Xj,
expectation of Xk, which--

202
00:12:18,900 --> 00:12:20,900
oh, by the way, I forgot
to say this is actually

203
00:12:20,900 --> 00:12:26,022
equal to the expectation of
Xj times the expectation of Xk

204
00:12:26,022 --> 00:12:28,230
because that's just the
definition of the expectation

205
00:12:28,230 --> 00:12:28,979
of random vectors.

206
00:12:28,979 --> 00:12:31,460
So my j and my k are now inside.

207
00:12:31,460 --> 00:12:37,175
And that's by definition the
covariance between Xj and Xk,

208
00:12:37,175 --> 00:12:39,550
OK?

209
00:12:39,550 --> 00:12:43,360
So just if you've seen those
manipulations between vectors,

210
00:12:43,360 --> 00:12:45,400
hopefully you're bored
out of your mind.

211
00:12:45,400 --> 00:12:47,800
And if you have not,
then that's something

212
00:12:47,800 --> 00:12:51,010
you just need to get
comfortable with, right?

213
00:12:51,010 --> 00:12:52,660
So one thing that's
going to be useful

214
00:12:52,660 --> 00:12:55,850
is to know very
quickly what's called

215
00:12:55,850 --> 00:12:57,850
the outer product of a
vector with itself, which

216
00:12:57,850 --> 00:12:59,997
is the vector of times
the vector transpose, what

217
00:12:59,997 --> 00:13:01,330
the entries of these things are.

218
00:13:01,330 --> 00:13:06,510
And that's what we've been using
on this second set of boards.

219
00:13:06,510 --> 00:13:08,290
OK, so everybody
agrees now that we've

220
00:13:08,290 --> 00:13:11,860
sort of showed that the
covariance matrix can

221
00:13:11,860 --> 00:13:14,290
be written in this vector form.

222
00:13:14,290 --> 00:13:17,500
So expectation of XX
transpose minus expectation

223
00:13:17,500 --> 00:13:19,312
of X, expectation
of X transpose.

224
00:13:22,264 --> 00:13:28,060
OK, just like the covariance
can be written in two ways,

225
00:13:28,060 --> 00:13:30,070
right we know that the
covariance can also

226
00:13:30,070 --> 00:13:39,460
be written as the expectation
of Xj minus expectation of Xj

227
00:13:39,460 --> 00:13:45,500
times Xk minus
expectation of Xk, right?

228
00:13:45,500 --> 00:13:50,220
That's the-- sometimes, this
is the original definition

229
00:13:50,220 --> 00:13:50,850
of covariance.

230
00:13:50,850 --> 00:13:52,490
This is the second
definition of covariance.

231
00:13:52,490 --> 00:13:54,031
Just like you have
the variance which

232
00:13:54,031 --> 00:13:57,240
is the expectation of the
square of X minus c of X,

233
00:13:57,240 --> 00:14:00,390
or the expectation X squared
minus the expectation of X

234
00:14:00,390 --> 00:14:01,160
squared.

235
00:14:01,160 --> 00:14:03,420
It's the same thing
for covariance.

236
00:14:03,420 --> 00:14:11,190
And you can actually see this
in terms of vectors, right?

237
00:14:11,190 --> 00:14:14,270
So this actually implies that
you can also rewrite sigma

238
00:14:14,270 --> 00:14:21,780
as the expectation of X
minus expectation of X

239
00:14:21,780 --> 00:14:23,845
times the same thing transpose.

240
00:14:32,191 --> 00:14:32,690
Right?

241
00:14:32,690 --> 00:14:35,950
And the reason is because if
you just distribute those guys,

242
00:14:35,950 --> 00:14:43,760
this is just the
expectation of XX transpose

243
00:14:43,760 --> 00:14:54,800
minus X, expectation of X
transpose minus expectation

244
00:14:54,800 --> 00:14:59,750
of XX transpose.

245
00:14:59,750 --> 00:15:03,608
And then I have plus
expectation of X,

246
00:15:03,608 --> 00:15:05,628
expectation of X transpose.

247
00:15:09,930 --> 00:15:13,110
Now, things could go wrong
because the main difference

248
00:15:13,110 --> 00:15:18,660
between matrices slash
vectors and numbers is

249
00:15:18,660 --> 00:15:21,930
that multiplication
does not commute, right?

250
00:15:21,930 --> 00:15:25,610
So in particular, those two
things are not the same thing.

251
00:15:25,610 --> 00:15:27,860
And so that's the main
difference that we have before,

252
00:15:27,860 --> 00:15:30,336
but it actually does not
matter for our problem.

253
00:15:30,336 --> 00:15:32,210
It's because what's
happening is that if when

254
00:15:32,210 --> 00:15:34,970
I take the expectation
of this guy, then

255
00:15:34,970 --> 00:15:38,940
it's actually the same as the
expectation of this guy, OK?

256
00:15:38,940 --> 00:15:43,540
And so just because the
expectation is linear--

257
00:15:48,230 --> 00:15:50,550
so what we have
is that sigma now

258
00:15:50,550 --> 00:15:55,560
becomes equal to the
expectation of XX transpose

259
00:15:55,560 --> 00:15:59,130
minus the expectation
of X, expectation

260
00:15:59,130 --> 00:16:03,170
of X transpose minus
expectation of X,

261
00:16:03,170 --> 00:16:07,110
expectation of X transpose.

262
00:16:07,110 --> 00:16:10,030
And then I have--

263
00:16:10,030 --> 00:16:14,070
well, really, what
I have is this guy.

264
00:16:14,070 --> 00:16:15,990
And then I have
plus the expectation

265
00:16:15,990 --> 00:16:19,680
of X, expectation
of X transpose.

266
00:16:23,970 --> 00:16:28,570
And now, those three things are
actually equal to each other

267
00:16:28,570 --> 00:16:30,700
just because the
expectation of X transpose

268
00:16:30,700 --> 00:16:34,145
is the same as the
expectation of X transpose.

269
00:16:34,145 --> 00:16:35,520
And so what I'm
left with is just

270
00:16:35,520 --> 00:16:44,364
the expectation of XX transpose
minus the expectation of X,

271
00:16:44,364 --> 00:16:49,650
expectation of X transpose, OK?

272
00:16:49,650 --> 00:16:51,610
So same thing that's
happening when

273
00:16:51,610 --> 00:16:53,110
you want to prove
that you can write

274
00:16:53,110 --> 00:16:57,760
the covariance either
this way or that way.

275
00:16:57,760 --> 00:17:00,980
The same thing happens for
matrices, or for vectors,

276
00:17:00,980 --> 00:17:02,340
right, or a covariance matrix.

277
00:17:02,340 --> 00:17:04,609
They go together.

278
00:17:04,609 --> 00:17:05,920
Is there any questions so far?

279
00:17:05,920 --> 00:17:09,460
And if you have some, please
tell me, because I want to--

280
00:17:09,460 --> 00:17:12,490
I don't know to which extent you
guys are comfortable with this

281
00:17:12,490 --> 00:17:13,420
at all or not.

282
00:17:16,700 --> 00:17:19,810
OK, so let's move on.

283
00:17:19,810 --> 00:17:23,460
All right, so of
course, this is what

284
00:17:23,460 --> 00:17:26,420
I'm describing in terms of
the distribution right here.

285
00:17:26,420 --> 00:17:28,359
I took expectations.

286
00:17:28,359 --> 00:17:30,140
Covariances are
also expectations.

287
00:17:30,140 --> 00:17:32,560
So those depend on some
distribution of X, right?

288
00:17:32,560 --> 00:17:34,630
If I wanted to compute
that, I would basically

289
00:17:34,630 --> 00:17:36,601
need to know what the
distribution of X is.

290
00:17:36,601 --> 00:17:37,975
Now, we're doing
statistics, so I

291
00:17:37,975 --> 00:17:41,180
need to [INAUDIBLE] my question
is going to be to say, well,

292
00:17:41,180 --> 00:17:44,380
how well can I estimate the
covariance matrix itself,

293
00:17:44,380 --> 00:17:47,260
or some properties of
this covariance matrix

294
00:17:47,260 --> 00:17:48,405
based on data?

295
00:17:48,405 --> 00:17:50,140
All right, so if I
want to understand

296
00:17:50,140 --> 00:17:52,990
what my covariance matrix
looks like based on data,

297
00:17:52,990 --> 00:17:54,940
I'm going to have
to basically form

298
00:17:54,940 --> 00:17:57,760
its empirical
counterparts, which

299
00:17:57,760 --> 00:18:02,200
I can do by doing the age-old
statistical trick, which

300
00:18:02,200 --> 00:18:04,700
is replace your expectation
by an average, all right?

301
00:18:04,700 --> 00:18:06,658
So let's just-- everything
that's on the board,

302
00:18:06,658 --> 00:18:09,310
you see expectation, just
replace it by an average.

303
00:18:09,310 --> 00:18:14,230
OK, so, now I'm going
to be given X1, Xn.

304
00:18:14,230 --> 00:18:16,551
So, I'm going to define
the empirical mean.

305
00:18:19,780 --> 00:18:22,290
OK so, really, the idea
is take your expectation

306
00:18:22,290 --> 00:18:24,970
and replace it by 1
over n sum, right?

307
00:18:24,970 --> 00:18:28,230
And so the empirical
mean is just 1 over n.

308
00:18:28,230 --> 00:18:31,510
Some of the Xi's--

309
00:18:31,510 --> 00:18:34,070
I'm guessing everybody knows
how to average vectors.

310
00:18:34,070 --> 00:18:36,110
It's just the average
of the coordinates.

311
00:18:36,110 --> 00:18:39,730
So I will write this as X bar.

312
00:18:39,730 --> 00:18:51,440
And the empirical covariance
matrix, often called

313
00:18:51,440 --> 00:18:57,520
sample covariance matrix,
hence the notation, S.

314
00:18:57,520 --> 00:18:59,800
Well, this is my
covariance matrix, right?

315
00:18:59,800 --> 00:19:02,650
Let's just replace the
expectations by averages.

316
00:19:02,650 --> 00:19:12,160
1 over n, sum from i equal 1 to
n, of Xi, Xi transpose, minus--

317
00:19:12,160 --> 00:19:14,290
this is the expectation
of X. I will replace it

318
00:19:14,290 --> 00:19:21,380
by the average, which I just
called X bar, X bar transpose,

319
00:19:21,380 --> 00:19:22,590
OK?

320
00:19:22,590 --> 00:19:25,480
And that's when I
want to use the--

321
00:19:25,480 --> 00:19:28,430
that's when I want
to use the notation--

322
00:19:28,430 --> 00:19:30,670
the second definition,
but I could actually

323
00:19:30,670 --> 00:19:35,530
do exactly the same thing
using this definition here.

324
00:19:35,530 --> 00:19:38,750
Sorry, using this
definition right here.

325
00:19:38,750 --> 00:19:42,340
So this is actually
1 over n, sum from i

326
00:19:42,340 --> 00:19:55,240
equal 1 to n, of Xi minus X
bar, Xi minus X bar transpose.

327
00:19:55,240 --> 00:19:56,560
And those are actually--

328
00:19:56,560 --> 00:19:58,367
I mean, in a way,
it looks like I

329
00:19:58,367 --> 00:19:59,950
could define two
different estimators,

330
00:19:59,950 --> 00:20:01,630
but you can actually check.

331
00:20:01,630 --> 00:20:03,700
And I do encourage
you to do this.

332
00:20:03,700 --> 00:20:05,920
If you're not comfortable
making those manipulations,

333
00:20:05,920 --> 00:20:08,294
you can actually check that
those two things are actually

334
00:20:08,294 --> 00:20:15,216
exactly the same, OK?

335
00:20:20,540 --> 00:20:25,070
So now, I'm going to want
to talk about matrices, OK?

336
00:20:25,070 --> 00:20:27,260
And remember, we defined
this big matrix, X,

337
00:20:27,260 --> 00:20:28,790
with the double bar.

338
00:20:28,790 --> 00:20:31,160
And the question
is, can I express

339
00:20:31,160 --> 00:20:35,360
both X bar and the
sample covariance matrix

340
00:20:35,360 --> 00:20:37,460
in terms of this big matrix, X?

341
00:20:37,460 --> 00:20:39,740
Because right now,
it's still expressed

342
00:20:39,740 --> 00:20:40,820
in terms of the vectors.

343
00:20:40,820 --> 00:20:43,220
I'm summing those vectors,
vectors transpose.

344
00:20:43,220 --> 00:20:46,050
The question is, can I just
do that in a very compact way,

345
00:20:46,050 --> 00:20:50,110
in a way that I can actually
remove this sum term,

346
00:20:50,110 --> 00:20:50,610
all right?

347
00:20:50,610 --> 00:20:52,990
That's going to be the goal.

348
00:20:52,990 --> 00:20:54,850
I mean, that's not
a notational goal.

349
00:20:54,850 --> 00:20:58,091
That's really something
that we want--

350
00:20:58,091 --> 00:20:59,590
that's going to be
convenient for us

351
00:20:59,590 --> 00:21:02,740
just like it was convenient
to talk about matrices when

352
00:21:02,740 --> 00:21:04,199
we did linear regression.

353
00:21:23,180 --> 00:21:26,340
OK, X bar.

354
00:21:26,340 --> 00:21:30,000
We just said it's 1 over
n, sum from I equal 1 to n

355
00:21:30,000 --> 00:21:32,730
of Xi, right?

356
00:21:32,730 --> 00:21:35,100
Now remember, what does
this matrix look like?

357
00:21:35,100 --> 00:21:39,010
We said that X bar--

358
00:21:39,010 --> 00:21:40,270
X is this guy.

359
00:21:40,270 --> 00:21:45,930
So if I look at X transpose,
the columns of this guy

360
00:21:45,930 --> 00:21:51,430
becomes X1, my first
observation, X2,

361
00:21:51,430 --> 00:21:54,840
my second observation, all the
way to Xn, my last observation,

362
00:21:54,840 --> 00:21:56,280
right?

363
00:21:56,280 --> 00:21:56,850
Agreed?

364
00:21:56,850 --> 00:21:58,470
That's what X transpose is.

365
00:21:58,470 --> 00:22:00,960
So if I want to
sum those guys, I

366
00:22:00,960 --> 00:22:02,700
can multiply by the
all-ones vector.

367
00:22:06,284 --> 00:22:08,700
All right, so that's what the
definition of the all-ones 1

368
00:22:08,700 --> 00:22:09,250
vector is.

369
00:22:11,840 --> 00:22:19,870
Well, it's just a bunch of
1's in Rn, in this case.

370
00:22:19,870 --> 00:22:23,620
And so when I do X transpose 1,
what I get is just the sum from

371
00:22:23,620 --> 00:22:27,690
i equal 1 to n of the Xi's.

372
00:22:27,690 --> 00:22:36,460
So if I divide by n,
I get my average, OK?

373
00:22:36,460 --> 00:22:43,200
So here, I definitely
removed the sum term.

374
00:22:43,200 --> 00:22:47,930
Let's see if with the covariance
matrix, we can do the same.

375
00:22:47,930 --> 00:22:53,280
Well, and that's actually a
little more difficult to see,

376
00:22:53,280 --> 00:22:55,280
I guess.

377
00:22:55,280 --> 00:23:05,510
But let's use this
definition for S, OK?

378
00:23:05,510 --> 00:23:07,540
And one thing that's
actually going to be--

379
00:23:07,540 --> 00:23:10,260
so, let's see for
one second, what--

380
00:23:10,260 --> 00:23:12,510
so it's going to be
something that involves X,

381
00:23:12,510 --> 00:23:14,377
multiplying X with itself, OK?

382
00:23:14,377 --> 00:23:15,960
And the question is,
is it going to be

383
00:23:15,960 --> 00:23:19,032
multiplying X with X transpose,
or X tranpose with X?

384
00:23:19,032 --> 00:23:20,490
To answer this
question, you can go

385
00:23:20,490 --> 00:23:23,960
the easy route, which says,
well, my covariance matrix is

386
00:23:23,960 --> 00:23:24,870
of size, what?

387
00:23:24,870 --> 00:23:27,682
What is the size of S?

388
00:23:27,682 --> 00:23:28,670
AUDIENCE: d by d.

389
00:23:28,670 --> 00:23:30,260
PHILIPPE RIGOLLET: d by d, OK?

390
00:23:30,260 --> 00:23:34,200
X is of size n by d.

391
00:23:34,200 --> 00:23:35,760
So if I do X times
X transpose, I'm

392
00:23:35,760 --> 00:23:37,760
going to have something
which is of size n by n.

393
00:23:37,760 --> 00:23:39,426
If I do X transpose
X, I'm going to have

394
00:23:39,426 --> 00:23:40,794
something which is d by d.

395
00:23:40,794 --> 00:23:41,710
That's the easy route.

396
00:23:41,710 --> 00:23:44,150
And there's basically
one of the two guys.

397
00:23:44,150 --> 00:23:46,130
You can actually open
the box a little bit

398
00:23:46,130 --> 00:23:48,170
and see what's
going on in there.

399
00:23:48,170 --> 00:23:52,760
If you do X transpose X, which
we know gives you a d by d,

400
00:23:52,760 --> 00:23:54,920
you'll see that X is
going to have vectors that

401
00:23:54,920 --> 00:23:57,465
are of the form,
Xi, and X transpose

402
00:23:57,465 --> 00:24:02,230
is going to have vectors that
are of the form, Xi transpose,

403
00:24:02,230 --> 00:24:03,710
right?

404
00:24:03,710 --> 00:24:06,810
And so, this is actually
probably the right way to go.

405
00:24:06,810 --> 00:24:11,690
So let's look at what's X
transpose X is giving us.

406
00:24:11,690 --> 00:24:16,850
So I claim that it's actually
going to give us what we want,

407
00:24:16,850 --> 00:24:19,710
but rather than actually
going there, let's--

408
00:24:19,710 --> 00:24:22,700
to actually-- I mean, we
could check it entry by entry,

409
00:24:22,700 --> 00:24:25,400
but there's actually a
nice thing we can do.

410
00:24:25,400 --> 00:24:28,090
Before we go there,
let's write X transpose

411
00:24:28,090 --> 00:24:33,260
as the following sum of
variables, X1 and then

412
00:24:33,260 --> 00:24:36,270
just a bunch of 0's
everywhere else.

413
00:24:36,270 --> 00:24:39,410
So it's still d by n.

414
00:24:39,410 --> 00:24:42,470
So n minus 1 of the columns
are equal to 0 here.

415
00:24:42,470 --> 00:24:45,860
Then I'm going to put
a 0 and then put X2.

416
00:24:45,860 --> 00:24:48,550
And then just a
bunch of 0's, right?

417
00:24:48,550 --> 00:24:59,940
So that's just 0, 0 plus 0,
0, all the way to Xn, OK?

418
00:24:59,940 --> 00:25:01,260
Everybody agrees with it?

419
00:25:01,260 --> 00:25:03,650
See what I'm doing here?

420
00:25:03,650 --> 00:25:06,150
I'm just splitting it into
a sum of matrices that

421
00:25:06,150 --> 00:25:08,730
only have one nonzero columns.

422
00:25:08,730 --> 00:25:11,210
But clearly, that's true.

423
00:25:11,210 --> 00:25:15,610
Now let's look at the product
of this guy with itself.

424
00:25:15,610 --> 00:25:23,396
So, let's call these
matrices M1, M2, Mn.

425
00:25:26,890 --> 00:25:30,750
So when I do X
transpose X, what I

426
00:25:30,750 --> 00:25:37,970
do is the sum of the
Mi's for i equal 1 to n,

427
00:25:37,970 --> 00:25:48,620
times the sum of the
Mi transpose, right?

428
00:25:48,620 --> 00:25:50,840
Now, the sum of
the Mi's transpose

429
00:25:50,840 --> 00:25:55,274
is just the sum of each
of the Mi's transpose, OK?

430
00:25:58,190 --> 00:26:00,620
So now I just have this
product of two sums,

431
00:26:00,620 --> 00:26:03,290
so I'm just going to
re-index the second one by j.

432
00:26:03,290 --> 00:26:12,650
So this is sum for i equal
1 to n, j equal 1 to n of Mi

433
00:26:12,650 --> 00:26:15,600
Mj transpose.

434
00:26:15,600 --> 00:26:16,100
OK?

435
00:26:19,036 --> 00:26:20,410
And now what we
want to notice is

436
00:26:20,410 --> 00:26:26,000
that if i is different
from j, what's happening?

437
00:26:26,000 --> 00:26:34,380
Well if i is different from j,
let's look at say, M1 times XM2

438
00:26:34,380 --> 00:26:35,040
transpose.

439
00:26:54,067 --> 00:26:56,150
So what is the product
between those two matrices?

440
00:27:04,404 --> 00:27:09,870
AUDIENCE: It's a new
entry and [INAUDIBLE]

441
00:27:09,870 --> 00:27:11,370
PHILIPPE RIGOLLET:
There's an entry?

442
00:27:11,370 --> 00:27:12,801
AUDIENCE: Well, it's an entry.

443
00:27:12,801 --> 00:27:17,116
It's like a dot product in that
form next to [? transpose. ?]

444
00:27:17,116 --> 00:27:19,490
PHILIPPE RIGOLLET: You mean
a dot product is just getting

445
00:27:19,490 --> 00:27:20,360
[INAUDIBLE] number, right?

446
00:27:20,360 --> 00:27:22,068
So I want-- this is
going to be a matrix.

447
00:27:22,068 --> 00:27:24,550
It's the product of
two matrices, right?

448
00:27:24,550 --> 00:27:27,100
This is a matrix times a matrix.

449
00:27:27,100 --> 00:27:31,210
So this should be a matrix,
right, of size d by d.

450
00:27:35,960 --> 00:27:37,610
Yeah, I should
see a lot of hands

451
00:27:37,610 --> 00:27:39,060
that look like this, right?

452
00:27:39,060 --> 00:27:40,200
Because look at this.

453
00:27:40,200 --> 00:27:42,450
So let's multiply the first--

454
00:27:42,450 --> 00:27:45,215
let's look at what's going
on in the first column here.

455
00:27:45,215 --> 00:27:48,840
I'm multiplying this column
with each of those rows.

456
00:27:48,840 --> 00:27:50,480
The only nonzero
coefficient is here,

457
00:27:50,480 --> 00:27:54,190
and it only hits
this column of 0's.

458
00:27:54,190 --> 00:27:57,036
So every time, this is going
to give you 0, 0, 0, 0.

459
00:27:57,036 --> 00:28:00,020
And it's going to be the same
for every single one of them.

460
00:28:00,020 --> 00:28:04,420
So this matrix is just
full of 0's, right?

461
00:28:04,420 --> 00:28:06,130
They never hit each
other when I do

462
00:28:06,130 --> 00:28:08,350
the matrix-matrix
multiplication.

463
00:28:08,350 --> 00:28:11,811
There's no-- every
non-zero hits a 0.

464
00:28:11,811 --> 00:28:13,560
So what it means is--
and this, of course,

465
00:28:13,560 --> 00:28:16,020
you can check for every
i different from j.

466
00:28:16,020 --> 00:28:22,290
So this means that Mi times
Mj transpose is actually

467
00:28:22,290 --> 00:28:27,150
equal to 0 when i is
different from j, Right?

468
00:28:27,150 --> 00:28:29,370
Everybody is OK with this?

469
00:28:29,370 --> 00:28:32,670
So what that means is that when
I do this double sum, really,

470
00:28:32,670 --> 00:28:33,670
it's a simple sum.

471
00:28:33,670 --> 00:28:37,310
There's only just the
sum from i equal 1

472
00:28:37,310 --> 00:28:41,820
to n of Mi Mi transpose.

473
00:28:41,820 --> 00:28:44,820
Because this is the only
terms in this double sum

474
00:28:44,820 --> 00:28:48,980
that are not going to be 0 when
[INAUDIBLE] [? M1 ?] with M1

475
00:28:48,980 --> 00:28:50,492
itself.

476
00:28:50,492 --> 00:28:51,950
Now, let's see
what's going on when

477
00:28:51,950 --> 00:28:53,930
I do M1 times M1 transpose.

478
00:28:53,930 --> 00:28:57,890
Well, now, if I do Mi
times and Mi transpose,

479
00:28:57,890 --> 00:29:00,300
now this guy becomes [? X1 ?]
[INAUDIBLE] it's here.

480
00:29:00,300 --> 00:29:03,830
And so now, I really have
X1 times X1 transpose.

481
00:29:03,830 --> 00:29:06,785
So this is really
just the sum from i

482
00:29:06,785 --> 00:29:20,080
equal 1 to n of Xi Xi transpose,
just because Mi Mi transpose

483
00:29:20,080 --> 00:29:21,716
is Xi Xi transpose.

484
00:29:21,716 --> 00:29:22,840
There's nothing else there.

485
00:29:26,190 --> 00:29:28,520
So that's the good news, right?

486
00:29:28,520 --> 00:29:37,100
This term here is really just
X transpose X divided by n.

487
00:29:43,460 --> 00:29:45,740
OK, I can use that
guy again, I guess.

488
00:29:45,740 --> 00:29:46,260
Well, no.

489
00:29:46,260 --> 00:30:08,602
Let's just-- OK, so
let me rewrite S.

490
00:30:08,602 --> 00:30:10,310
All right, that's the
definition we have.

491
00:30:10,310 --> 00:30:14,990
And we know that this guy
already is equal to 1 over n X

492
00:30:14,990 --> 00:30:20,960
transpose X. x bar
x bar transpose--

493
00:30:20,960 --> 00:30:25,950
we know that x bar-- we
just proved that x bar--

494
00:30:25,950 --> 00:30:31,080
sorry, little x
bar was equal to 1

495
00:30:31,080 --> 00:30:36,652
over n X bar transpose
times the all-ones vector.

496
00:30:36,652 --> 00:30:37,860
So I'm just going to do that.

497
00:30:37,860 --> 00:30:39,340
So that's just
going to be minus.

498
00:30:39,340 --> 00:30:40,999
I'm going to pull
my two 1 over n's--

499
00:30:40,999 --> 00:30:42,540
one from this guy,
one from this guy.

500
00:30:42,540 --> 00:30:44,530
So I'm going to get
1 over n squared.

501
00:30:44,530 --> 00:30:47,070
And then I'm going
to get X bar--

502
00:30:47,070 --> 00:30:48,690
sorry, there's no X bar here.

503
00:30:48,690 --> 00:30:50,908
It's just X. Yeah.

504
00:30:50,908 --> 00:30:59,861
X transpose all ones times X
transpose all ones transpose,

505
00:30:59,861 --> 00:31:00,360
right?

506
00:31:04,580 --> 00:31:07,580
And X transpose all
ones transpose--

507
00:31:11,800 --> 00:31:14,200
right, the rule-- if I
have A times B transpose,

508
00:31:14,200 --> 00:31:16,180
it's B transpose times
A transpose, right?

509
00:31:23,460 --> 00:31:25,060
That's just the rule
of transposition.

510
00:31:25,060 --> 00:31:31,400
So this is 1
transpose X transpose.

511
00:31:31,400 --> 00:31:34,120
And so when I put all
these guys together,

512
00:31:34,120 --> 00:31:38,365
this is actually equal to 1
over n X transpose X minus one

513
00:31:38,365 --> 00:31:47,670
over n squared X transpose
1, 1 transpose X. Because X

514
00:31:47,670 --> 00:31:50,466
transpose transposes X, OK?

515
00:31:53,700 --> 00:31:55,950
So now, I can actually--

516
00:31:55,950 --> 00:31:59,435
I have something which is
of the form, X transpose X--

517
00:31:59,435 --> 00:32:01,800
[INAUDIBLE] to the left, X
transpose; to the right, X.

518
00:32:01,800 --> 00:32:04,930
Here, I have X transpose to
the left, X to the right.

519
00:32:04,930 --> 00:32:07,690
So it can factor out
whatever's in there.

520
00:32:07,690 --> 00:32:11,640
So I can write S as 1 over n--

521
00:32:11,640 --> 00:32:17,230
sorry, X transpose times 1 over
n times the identity of Rd.

522
00:32:21,610 --> 00:32:33,110
And then I have minus 1
over n, 1, 1 transpose X.

523
00:32:33,110 --> 00:32:34,490
OK, because if you--

524
00:32:34,490 --> 00:32:36,770
I mean, you can
distribute it back, right?

525
00:32:36,770 --> 00:32:38,090
So here, I'm going to get what?

526
00:32:38,090 --> 00:32:41,810
X transpose identity times X,
the whole thing divided by n.

527
00:32:41,810 --> 00:32:42,777
That's this term.

528
00:32:42,777 --> 00:32:45,110
And then the second one is
going to be-- sorry, 1 over n

529
00:32:45,110 --> 00:32:46,110
squared.

530
00:32:46,110 --> 00:32:50,840
And then I'm going to get 1 over
n squared times X transpose 1,

531
00:32:50,840 --> 00:32:53,990
1 transpose which is
this guy, times X,

532
00:32:53,990 --> 00:32:58,580
and that's the [? right ?]
[? thing, ?] OK?

533
00:32:58,580 --> 00:33:01,820
So, the way it's written, I
factored out one of the 1 over

534
00:33:01,820 --> 00:33:02,320
n's.

535
00:33:02,320 --> 00:33:05,500
So I'm just going to do the
same thing as on this slide.

536
00:33:05,500 --> 00:33:08,110
So I'm just factoring
out this 1 over n here.

537
00:33:08,110 --> 00:33:16,280
So it's 1 over n times
X transpose identity

538
00:33:16,280 --> 00:33:21,010
of our d divided by n
divided by 1 this time,

539
00:33:21,010 --> 00:33:26,780
minus 1 over n 1, 1
transpose times X, OK?

540
00:33:26,780 --> 00:33:28,395
So that's just
what's on the slides.

541
00:33:31,720 --> 00:33:35,874
What does the matrix, 1,
1 transpose, look like?

542
00:33:35,874 --> 00:33:36,790
AUDIENCE: All 1's.

543
00:33:36,790 --> 00:33:38,623
PHILIPPE RIGOLLET: It's
just all 1's, right?

544
00:33:38,623 --> 00:33:41,060
Because the entries are the
products of the all-ones--

545
00:33:41,060 --> 00:33:42,750
of the coordinates of
the all-ones vectors with

546
00:33:42,750 --> 00:33:45,208
the coordinates of the all-ones
vectors, so I only get 1's.

547
00:33:45,208 --> 00:33:49,610
So it's a d by d
matrix with only 1's.

548
00:33:49,610 --> 00:33:52,170
So this matrix, I can
actually write exactly, right?

549
00:33:52,170 --> 00:33:55,710
H, this matrix that
I called H which

550
00:33:55,710 --> 00:33:59,430
is what's sandwiched in-between
this X transpose and X.

551
00:33:59,430 --> 00:34:02,760
By definition, I said this
is the definition of H. Then

552
00:34:02,760 --> 00:34:06,060
this thing, I can write
its coordinates exactly.

553
00:34:18,880 --> 00:34:23,110
We know it's identity
divided by n minus--

554
00:34:23,110 --> 00:34:25,330
sorry, I don't know
why I keep [INAUDIBLE]..

555
00:34:25,330 --> 00:34:29,110
Minus 1 over n 1, 1 transpose--

556
00:34:29,110 --> 00:34:30,940
so it's this matrix
with the only 1's

557
00:34:30,940 --> 00:34:34,389
on the diagonals and 0's and
elsewhere-- minus a matrix that

558
00:34:34,389 --> 00:34:36,487
only has 1 over n everywhere.

559
00:34:41,469 --> 00:34:49,820
OK, so the whole thing is 1
minus 1 over n on the diagonals

560
00:34:49,820 --> 00:34:57,430
and then minus 1
over n here, OK?

561
00:34:57,430 --> 00:35:01,920
And now I claim that this matrix
is an orthogonal projector.

562
00:35:01,920 --> 00:35:05,580
Now, I'm writing this, but
it's completely useless.

563
00:35:05,580 --> 00:35:08,190
This is just a way for you to
see that it's actually very

564
00:35:08,190 --> 00:35:11,430
convenient now to think
about this problem

565
00:35:11,430 --> 00:35:14,850
as being a matrix
problem, because things

566
00:35:14,850 --> 00:35:17,890
are much nicer when you
think about the actual form

567
00:35:17,890 --> 00:35:18,890
of your matrices, right?

568
00:35:18,890 --> 00:35:21,090
They could tell you,
here is the matrix.

569
00:35:21,090 --> 00:35:23,340
I mean, imagine you're
sitting at a midterm,

570
00:35:23,340 --> 00:35:25,910
and I say, here's the
matrix that has 1 minus 1

571
00:35:25,910 --> 00:35:28,640
over n on the diagonals
and minus 1 over n

572
00:35:28,640 --> 00:35:30,010
on the [INAUDIBLE] diagonal.

573
00:35:30,010 --> 00:35:32,855
Prove to me that it's
a projector matrix.

574
00:35:32,855 --> 00:35:34,230
You're going to
have to basically

575
00:35:34,230 --> 00:35:35,520
take this guy times itself.

576
00:35:35,520 --> 00:35:37,497
It's going to be really
complicated, right?

577
00:35:37,497 --> 00:35:38,580
So we know it's symmetric.

578
00:35:38,580 --> 00:35:39,930
That's for sure.

579
00:35:39,930 --> 00:35:42,120
But the fact that it
has this particular way

580
00:35:42,120 --> 00:35:44,100
of writing it is
going to make my life

581
00:35:44,100 --> 00:35:45,599
super easy to check this.

582
00:35:45,599 --> 00:35:47,140
That's the definition
of a projector.

583
00:35:47,140 --> 00:35:48,930
It has to be
symmetric and it has

584
00:35:48,930 --> 00:35:51,270
to square to itself
because we just

585
00:35:51,270 --> 00:35:54,300
said in the chapter
on linear regression

586
00:35:54,300 --> 00:35:57,360
that once you project, if you
apply the projection again,

587
00:35:57,360 --> 00:35:59,610
you're not moving because
you're already there.

588
00:35:59,610 --> 00:36:04,469
OK, so why is H
squared equal to H?

589
00:36:04,469 --> 00:36:05,760
Well let's just write H square.

590
00:36:05,760 --> 00:36:09,300
It's the identity
minus 1 over n 1, 1

591
00:36:09,300 --> 00:36:16,610
transpose times the
identity minus 1 over n 1, 1

592
00:36:16,610 --> 00:36:19,370
transpose, right?

593
00:36:19,370 --> 00:36:22,490
Let's just expand this now.

594
00:36:22,490 --> 00:36:25,350
This is equal to
the identity minus--

595
00:36:25,350 --> 00:36:29,280
well, the identity times 1, 1
transpose is just the identity.

596
00:36:29,280 --> 00:36:31,900
So it's 1, 1 transpose, sorry.

597
00:36:31,900 --> 00:36:38,840
So 1 over n 1, 1 transpose
minus 1 over n 1, 1 transpose.

598
00:36:38,840 --> 00:36:40,400
And then there's
going to be what

599
00:36:40,400 --> 00:36:42,710
makes the deal is that
I get this 1 over n

600
00:36:42,710 --> 00:36:44,750
squared this time.

601
00:36:44,750 --> 00:36:46,950
And then I get the product
of 1 over n trans--

602
00:36:46,950 --> 00:36:48,200
oh, let's write it completely.

603
00:36:48,200 --> 00:36:58,010
I get 1, 1 transpose
times 1, 1 transpose, OK?

604
00:36:58,010 --> 00:37:01,260
But this thing here--

605
00:37:01,260 --> 00:37:03,840
what is this?

606
00:37:03,840 --> 00:37:06,359
n, right, is the end product
of the all-ones vector

607
00:37:06,359 --> 00:37:07,400
with the all-ones vector.

608
00:37:07,400 --> 00:37:10,740
So I'm just summing n times
1 squared, which is n.

609
00:37:10,740 --> 00:37:11,980
So this is equal to n.

610
00:37:11,980 --> 00:37:13,920
So I pull it out,
cancel one of the ends,

611
00:37:13,920 --> 00:37:15,870
and I'm back to
what I had before.

612
00:37:15,870 --> 00:37:21,720
So I had identity minus 2
over n 1, 1 transpose plus 1

613
00:37:21,720 --> 00:37:27,530
over n 1, 1 transpose
which is equal to H.

614
00:37:27,530 --> 00:37:30,700
Because one of the 1
over n's cancel, OK?

615
00:37:36,264 --> 00:37:37,430
So it's a projection matrix.

616
00:37:37,430 --> 00:37:41,030
It's projecting onto
some linear space, right?

617
00:37:41,030 --> 00:37:42,450
It's taking a matrix.

618
00:37:42,450 --> 00:37:44,480
Sorry, it's taking
a vector and it's

619
00:37:44,480 --> 00:37:46,535
projecting onto a
certain space of vectors.

620
00:37:49,255 --> 00:37:50,229
What is this space?

621
00:37:53,160 --> 00:37:54,920
Right, so, how do
you-- so I'm only

622
00:37:54,920 --> 00:37:57,500
asking the answer to this
question in words, right?

623
00:37:57,500 --> 00:37:59,830
So how would you
describe the vectors

624
00:37:59,830 --> 00:38:02,950
onto which this
matrix is projecting?

625
00:38:02,950 --> 00:38:05,050
Well, if you want to
answer this question,

626
00:38:05,050 --> 00:38:07,870
the way you would tackle
it is first by saying, OK,

627
00:38:07,870 --> 00:38:13,690
what does a vector which is of
the form, H times something,

628
00:38:13,690 --> 00:38:14,960
look like, right?

629
00:38:14,960 --> 00:38:16,870
What can I say about
this vector that's

630
00:38:16,870 --> 00:38:19,540
going to be definitely
giving me something

631
00:38:19,540 --> 00:38:21,760
about the space on
which it projects?

632
00:38:21,760 --> 00:38:24,800
I need to know a little more to
know that it projects exactly

633
00:38:24,800 --> 00:38:25,820
onto this.

634
00:38:25,820 --> 00:38:29,050
But one way we can
do this is just

635
00:38:29,050 --> 00:38:30,440
see how it acts on a vector.

636
00:38:30,440 --> 00:38:32,370
What does it do to a
vector to apply H, right?

637
00:38:32,370 --> 00:38:44,550
So I take v. And let's see what
taking v and applying H to it

638
00:38:44,550 --> 00:38:46,410
looks like.

639
00:38:46,410 --> 00:38:48,750
Well, it's the identity
minus something.

640
00:38:48,750 --> 00:38:50,640
So it takes v and
it removes something

641
00:38:50,640 --> 00:38:54,160
from v. What does it remove?

642
00:38:54,160 --> 00:39:00,590
Well, it's 1 over n
times v transpose 1 times

643
00:39:00,590 --> 00:39:03,861
the all-ones vector, right?

644
00:39:03,861 --> 00:39:04,360
Agreed?

645
00:39:04,360 --> 00:39:13,570
I just wrote v transpose 1
instead of 1 transpose v,

646
00:39:13,570 --> 00:39:16,250
which are the same thing.

647
00:39:16,250 --> 00:39:17,310
What is this thing?

648
00:39:25,160 --> 00:39:27,765
What should I call it in
mathematical notation?

649
00:39:30,720 --> 00:39:31,460
v bar, right?

650
00:39:31,460 --> 00:39:35,150
I should all it v bar because
this is exactly the average

651
00:39:35,150 --> 00:39:38,840
of the entries of v, agreed?

652
00:39:38,840 --> 00:39:41,560
This is summing the entries
of v's, and this is dividing

653
00:39:41,560 --> 00:39:43,170
by the number of those v's.

654
00:39:43,170 --> 00:39:44,860
Sorry, now v is in our--

655
00:39:49,162 --> 00:39:51,074
sorry, why do I divide by--

656
00:39:53,950 --> 00:39:59,070
I'm just-- OK, I need to check
what my dimensions are now.

657
00:39:59,070 --> 00:40:00,390
No, it's in Rd, right?

658
00:40:00,390 --> 00:40:02,660
So why do I divide by n?

659
00:40:05,520 --> 00:40:07,720
So it's not really v bar.

660
00:40:07,720 --> 00:40:13,910
It's the sum of the
v's divided by--

661
00:40:13,910 --> 00:40:14,870
right, so it's v bar.

662
00:40:24,024 --> 00:40:25,163
AUDIENCE: [INAUDIBLE]

663
00:40:25,163 --> 00:40:25,996
[INTERPOSING VOICES]

664
00:40:25,996 --> 00:40:27,968
AUDIENCE: Yeah, v
has to be [INAUDIBLE]

665
00:40:27,968 --> 00:40:29,450
PHILIPPE RIGOLLET: Oh, yeah.

666
00:40:29,450 --> 00:40:31,120
OK, thank you.

667
00:40:31,120 --> 00:40:34,750
So everywhere I wrote
Hd, that was actually Hn.

668
00:40:34,750 --> 00:40:35,290
Oh, man.

669
00:40:35,290 --> 00:40:37,220
I wish I had a computer now.

670
00:40:37,220 --> 00:40:37,720
All right.

671
00:40:37,720 --> 00:40:43,230
So-- yeah, because the--

672
00:40:43,230 --> 00:40:43,740
yeah, right?

673
00:40:43,740 --> 00:40:45,775
So why it's not--

674
00:40:45,775 --> 00:40:48,150
well, why I thought it was
this is because I was thinking

675
00:40:48,150 --> 00:40:49,890
about the outer
dimension of X, really

676
00:40:49,890 --> 00:40:51,780
of X transpose, which is
really the inner dimension,

677
00:40:51,780 --> 00:40:52,914
didn't matter to me, right?

678
00:40:52,914 --> 00:40:55,080
So the thing that I can
sandwich between X transpose

679
00:40:55,080 --> 00:40:56,790
and X has to be n by n.

680
00:40:56,790 --> 00:40:58,800
So this was actually n by n.

681
00:40:58,800 --> 00:41:00,480
And so that's actually n by n.

682
00:41:00,480 --> 00:41:03,330
Everything is n by n.

683
00:41:03,330 --> 00:41:04,308
Sorry about that.

684
00:41:08,220 --> 00:41:09,400
So this is n.

685
00:41:09,400 --> 00:41:10,440
This is n.

686
00:41:10,440 --> 00:41:12,130
This is-- well, I
didn't really tell you

687
00:41:12,130 --> 00:41:16,290
what the all-ones vector
was, but it's also in our n.

688
00:41:16,290 --> 00:41:18,430
Yeah, OK.

689
00:41:22,190 --> 00:41:23,730
Thank you.

690
00:41:23,730 --> 00:41:27,939
And n-- actually, I used the
fact that this was of size n

691
00:41:27,939 --> 00:41:28,480
here already.

692
00:41:31,690 --> 00:41:33,340
OK, and so that's indeed v bar.

693
00:41:38,996 --> 00:41:40,870
So what is this projection
doing to a vector?

694
00:41:47,470 --> 00:41:51,930
It's removing its average
on each coordinate, right?

695
00:41:51,930 --> 00:41:54,570
And the effect of this
is that v is a vector.

696
00:41:54,570 --> 00:41:58,355
What is the average of Hv?

697
00:41:58,355 --> 00:41:59,340
AUDIENCE: 0.

698
00:41:59,340 --> 00:42:00,840
PHILIPPE RIGOLLET:
Right, so it's 0.

699
00:42:00,840 --> 00:42:04,050
It's the average of v, which
is v bar, minus the average

700
00:42:04,050 --> 00:42:07,230
of something that only has v
bar's entry, which is v bar.

701
00:42:07,230 --> 00:42:08,490
So this thing is actually 0.

702
00:42:11,560 --> 00:42:12,840
So let me repeat my question.

703
00:42:12,840 --> 00:42:15,700
Onto what subspace
does H project?

704
00:42:22,700 --> 00:42:26,670
Onto the subspace of
vectors that have mean 0.

705
00:42:26,670 --> 00:42:30,010
A vector that has
mean 0 is a vector.

706
00:42:30,010 --> 00:42:34,970
So if you want to talk more
linear algebra, v bar--

707
00:42:34,970 --> 00:42:36,750
for a vector you
have mean 0, it means

708
00:42:36,750 --> 00:42:43,440
that v is orthogonal to the
span of the all-ones vector.

709
00:42:43,440 --> 00:42:44,280
That's it.

710
00:42:44,280 --> 00:42:46,080
It projects to this space.

711
00:42:46,080 --> 00:42:47,930
So in words, it
projects onto the space

712
00:42:47,930 --> 00:42:49,880
of vectors that have 0 mean.

713
00:42:49,880 --> 00:42:52,380
In linear algebra,
it says it projects

714
00:42:52,380 --> 00:42:55,760
onto the hyperplane
which is orthogonal

715
00:42:55,760 --> 00:42:58,360
to the all-ones vector, OK?

716
00:42:58,360 --> 00:43:01,860
So that's all.

717
00:43:01,860 --> 00:43:04,760
Can you guys still
see the screen?

718
00:43:04,760 --> 00:43:05,940
Are you good over there?

719
00:43:05,940 --> 00:43:07,420
OK.

720
00:43:07,420 --> 00:43:12,030
All right, so now, what it
means is that, well, I'm

721
00:43:12,030 --> 00:43:13,280
doing this weird thing, right?

722
00:43:13,280 --> 00:43:15,360
I'm taking the inner product--

723
00:43:15,360 --> 00:43:20,030
so S is taking X. And then
it's removing its mean of each

724
00:43:20,030 --> 00:43:21,440
of the columns of X, right?

725
00:43:21,440 --> 00:43:24,530
When I take H times X, I'm
basically applying this

726
00:43:24,530 --> 00:43:26,780
projection which consists
in removing the mean of all

727
00:43:26,780 --> 00:43:28,430
the X's.

728
00:43:28,430 --> 00:43:31,340
And then I multiply
by H transpose.

729
00:43:31,340 --> 00:43:33,550
But what's actually
nice is that, remember,

730
00:43:33,550 --> 00:43:35,930
H is a projector.

731
00:43:35,930 --> 00:43:38,000
Sorry, I don't
want to keep that.

732
00:43:38,000 --> 00:43:47,010
Which means that when I
look at X transpose HX,

733
00:43:47,010 --> 00:43:52,410
it's the same as looking
at X transpose H squared X.

734
00:43:52,410 --> 00:43:54,420
But since H is equal
to its transpose,

735
00:43:54,420 --> 00:43:58,020
this is actually the same
as looking at X transpose H

736
00:43:58,020 --> 00:44:07,146
transpose HX, which is the
same as looking at HX transpose

737
00:44:07,146 --> 00:44:11,000
HX, OK?

738
00:44:11,000 --> 00:44:14,300
So what it's doing, it's
first applying this projection

739
00:44:14,300 --> 00:44:18,950
matrix, H, which removes the
mean of each of your columns,

740
00:44:18,950 --> 00:44:23,000
and then looks at the inner
products between those guys,

741
00:44:23,000 --> 00:44:23,586
right?

742
00:44:23,586 --> 00:44:25,460
Each entry of this guy
is just the covariance

743
00:44:25,460 --> 00:44:27,320
between those centered things.

744
00:44:27,320 --> 00:44:28,910
That's all it's doing.

745
00:44:28,910 --> 00:44:35,450
All right, so those are actually
going to be the key statements.

746
00:44:35,450 --> 00:44:37,270
So everything we've
done so far is really

747
00:44:37,270 --> 00:44:38,920
mainly linear algebra, right?

748
00:44:38,920 --> 00:44:41,950
I mean, looking at expectations
and covariances was just--

749
00:44:41,950 --> 00:44:44,200
we just used the fact that
the expectation was linear.

750
00:44:44,200 --> 00:44:45,520
We didn't do much.

751
00:44:45,520 --> 00:44:47,450
But now there's a nice
thing that's happening.

752
00:44:47,450 --> 00:44:50,050
And that's why we're
going to switch

753
00:44:50,050 --> 00:44:51,550
from the language
of linear algebra

754
00:44:51,550 --> 00:44:53,710
to more statistical,
because what's happening

755
00:44:53,710 --> 00:44:57,010
is that if I look at this
quadratic form, right?

756
00:44:57,010 --> 00:44:59,080
So I take sigma.

757
00:44:59,080 --> 00:45:00,462
So I take a vector, u.

758
00:45:03,630 --> 00:45:09,180
And I'm going to look at
u-- so let's say, in Rd.

759
00:45:09,180 --> 00:45:14,796
And I'm going to look
at u transpose sigma u.

760
00:45:14,796 --> 00:45:15,295
OK?

761
00:45:18,510 --> 00:45:19,720
What is this doing?

762
00:45:19,720 --> 00:45:24,630
Well, we know that u transpose
sigma u is equal to what?

763
00:45:24,630 --> 00:45:31,720
Well, sigma is the
expectation of XX transpose

764
00:45:31,720 --> 00:45:35,610
minus the expectation of X
expectation of X transpose,

765
00:45:35,610 --> 00:45:36,110
right?

766
00:45:39,460 --> 00:45:40,948
So I just substitute in there.

767
00:45:46,100 --> 00:45:49,370
Now, u is deterministic.

768
00:45:49,370 --> 00:45:52,250
So in particular, I can push
it inside the expectation

769
00:45:52,250 --> 00:45:55,180
here, agreed?

770
00:45:55,180 --> 00:45:57,200
And I can do the
same from the right.

771
00:45:57,200 --> 00:46:00,800
So here, when I push u
transpose here, and u here,

772
00:46:00,800 --> 00:46:06,170
what I'm left with is the
expectation of u transpose X

773
00:46:06,170 --> 00:46:09,990
times X transpose u.

774
00:46:09,990 --> 00:46:11,436
OK?

775
00:46:11,436 --> 00:46:14,050
And now, I can do the
same thing for this guy.

776
00:46:14,050 --> 00:46:17,410
And this tells me that this is
the expectation of u transpose

777
00:46:17,410 --> 00:46:21,340
X times the expectation
of X transpose u.

778
00:46:24,640 --> 00:46:29,260
Of course, u transpose X
is equal to X transpose u.

779
00:46:29,260 --> 00:46:31,330
And u-- yeah.

780
00:46:31,330 --> 00:46:33,910
So what it means is
that this is actually

781
00:46:33,910 --> 00:46:43,700
equal to the expectation
of u transpose X squared

782
00:46:43,700 --> 00:46:48,020
minus the expectation
of u transpose X,

783
00:46:48,020 --> 00:46:49,065
the whole thing squared.

784
00:46:56,900 --> 00:46:58,900
But this is something
that should look familiar.

785
00:46:58,900 --> 00:47:01,316
This is really just the variance
of this particular random

786
00:47:01,316 --> 00:47:03,360
variable which is of
the form, u transpose X,

787
00:47:03,360 --> 00:47:06,900
right? u transpose
X is a number.

788
00:47:06,900 --> 00:47:10,110
It involves a random vector,
so it's a random variable.

789
00:47:10,110 --> 00:47:11,580
And so it has a variance.

790
00:47:11,580 --> 00:47:15,430
And this variance is exactly
given by this formula.

791
00:47:15,430 --> 00:47:19,595
So this is just the
variance of u transpose X.

792
00:47:19,595 --> 00:47:21,720
So what we've proved is
that if I look at this guy,

793
00:47:21,720 --> 00:47:29,772
this is really just the
variance of u transpose X, OK?

794
00:47:37,580 --> 00:47:40,930
I can do the same thing
for the sample variance.

795
00:47:40,930 --> 00:47:41,770
So let's do this.

796
00:47:48,240 --> 00:47:52,140
And as you can
see, spoiler alert,

797
00:47:52,140 --> 00:47:56,334
this is going to be
the sample variance.

798
00:47:59,590 --> 00:48:09,430
OK, so remember, S is 1 over n,
sum of Xi Xi transpose minus X

799
00:48:09,430 --> 00:48:12,100
bar X bar transpose.

800
00:48:12,100 --> 00:48:16,060
So when I do u
transpose, Su, what

801
00:48:16,060 --> 00:48:19,400
it gives me is 1 over
n sum from i equal 1

802
00:48:19,400 --> 00:48:25,780
to n of u transpose Xi times
Xi transpose u, all right?

803
00:48:25,780 --> 00:48:27,880
So those are two numbers
that multiply each other

804
00:48:27,880 --> 00:48:30,370
and that happen to be
equal to each other,

805
00:48:30,370 --> 00:48:36,430
minus u transpose X
bar X bar transpose u,

806
00:48:36,430 --> 00:48:38,770
which is also the product
of two numbers that happen

807
00:48:38,770 --> 00:48:39,997
to be equal to each other.

808
00:48:39,997 --> 00:48:41,455
So I can rewrite
this with squares.

809
00:48:55,120 --> 00:48:57,390
So we're almost there.

810
00:48:57,390 --> 00:49:00,360
All I need to know to check
is that this thing is actually

811
00:49:00,360 --> 00:49:02,010
the average of
those guys, right?

812
00:49:02,010 --> 00:49:04,530
So u transpose X bar.

813
00:49:04,530 --> 00:49:05,030
What is it?

814
00:49:05,030 --> 00:49:10,980
It's 1 over n sum from i equal
1 to n of u transpose Xi.

815
00:49:10,980 --> 00:49:17,050
So it's really something that I
can write as u transpose X bar,

816
00:49:17,050 --> 00:49:17,550
right?

817
00:49:17,550 --> 00:49:19,383
That's the average of
those random variables

818
00:49:19,383 --> 00:49:21,240
of the form, u transpose Xi.

819
00:49:23,880 --> 00:49:29,910
So what it means is that u
transpose Su, I can write as 1

820
00:49:29,910 --> 00:49:38,060
over n sum from i equal 1 to
n of u transpose Xi squared

821
00:49:38,060 --> 00:49:46,720
minus u transpose X
bar squared, which

822
00:49:46,720 --> 00:49:51,660
is the empirical variance
that we need noted by small

823
00:49:51,660 --> 00:49:54,600
s squared, right?

824
00:49:54,600 --> 00:50:06,850
So that's the empirical variance
of u transpose X1 all the way

825
00:50:06,850 --> 00:50:08,209
to u transpose Xn.

826
00:50:12,430 --> 00:50:13,910
OK, and here, same thing.

827
00:50:13,910 --> 00:50:15,210
I use exactly the same thing.

828
00:50:15,210 --> 00:50:17,990
I just use the fact that here,
the only thing I use is really

829
00:50:17,990 --> 00:50:20,790
the linearity of this
guy, of 1 over n sum

830
00:50:20,790 --> 00:50:24,020
or the linearity of expectation,
that I can push things

831
00:50:24,020 --> 00:50:26,740
in there, OK?

832
00:50:30,224 --> 00:50:31,640
AUDIENCE: So what
you have written

833
00:50:31,640 --> 00:50:33,844
at the end of that
sum for uT Su?

834
00:50:33,844 --> 00:50:35,010
PHILIPPE RIGOLLET: This one?

835
00:50:35,010 --> 00:50:35,380
AUDIENCE: Yeah.

836
00:50:35,380 --> 00:50:37,290
PHILIPPE RIGOLLET: Yeah, I
said it's equal to small s,

837
00:50:37,290 --> 00:50:39,430
and I want to make a
difference between the big S

838
00:50:39,430 --> 00:50:40,660
that I'm using here.

839
00:50:40,660 --> 00:50:42,650
So this is equal to small--

840
00:50:42,650 --> 00:50:45,190
I don't know, I'm
trying to make it look

841
00:50:45,190 --> 00:50:47,550
like a calligraphic s squared.

842
00:50:56,870 --> 00:51:00,040
OK, so this is nice, right?

843
00:51:00,040 --> 00:51:04,120
This covariance matrix-- so
let's look at capital sigma

844
00:51:04,120 --> 00:51:05,210
itself right now.

845
00:51:05,210 --> 00:51:07,070
This covariance matrix,
we know that if we

846
00:51:07,070 --> 00:51:11,690
read its entries, what
we get is the covariance

847
00:51:11,690 --> 00:51:15,260
between the coordinates
of the X's, right,

848
00:51:15,260 --> 00:51:19,140
of the random vector, X.
And the coordinates, well,

849
00:51:19,140 --> 00:51:22,530
by definition, are attached
to a coordinate system.

850
00:51:22,530 --> 00:51:25,830
So I only know
what the covariance

851
00:51:25,830 --> 00:51:30,570
of X in of those two things are,
or the covariance of those two

852
00:51:30,570 --> 00:51:31,320
things are.

853
00:51:31,320 --> 00:51:33,570
But what if I want to find
coordinates between linear

854
00:51:33,570 --> 00:51:35,076
combination of the X's?

855
00:51:35,076 --> 00:51:37,200
Sorry, if I want to find
covariances between linear

856
00:51:37,200 --> 00:51:38,566
combination of those X's.

857
00:51:38,566 --> 00:51:40,440
And that's exactly what
this allows me to do.

858
00:51:40,440 --> 00:51:44,640
It says, well, if I pre-
and post-multiply by u,

859
00:51:44,640 --> 00:51:47,010
this is actually telling
me what the variance

860
00:51:47,010 --> 00:51:51,950
of X along direction u is, OK?

861
00:51:51,950 --> 00:51:53,944
So there's a lot of
information in there,

862
00:51:53,944 --> 00:51:55,610
and it's just really
exploiting the fact

863
00:51:55,610 --> 00:52:00,600
that there is some linearity
going on in the covariance.

864
00:52:00,600 --> 00:52:02,060
So, why variance?

865
00:52:02,060 --> 00:52:03,870
Why is variance
interesting for us, right?

866
00:52:03,870 --> 00:52:04,370
Why?

867
00:52:04,370 --> 00:52:05,760
I started by saying,
here, we're going

868
00:52:05,760 --> 00:52:07,050
to be interested
in having something

869
00:52:07,050 --> 00:52:08,151
to do dimension reduction.

870
00:52:08,151 --> 00:52:10,650
We have-- think of your points
as [? being in a ?] dimension

871
00:52:10,650 --> 00:52:13,990
larger than 4, and we're going
to try to reduce the dimension.

872
00:52:13,990 --> 00:52:15,480
So let's just think
for one second,

873
00:52:15,480 --> 00:52:19,320
what do we want about a
dimension reduction procedure?

874
00:52:19,320 --> 00:52:23,427
If I have all my points that
live in, say, three dimensions,

875
00:52:23,427 --> 00:52:25,260
and I have one point
here and one point here

876
00:52:25,260 --> 00:52:28,020
and one point here and one
point here and one point here,

877
00:52:28,020 --> 00:52:30,090
and I decide to project
them onto some plane--

878
00:52:30,090 --> 00:52:32,132
that I take a plane that's
just like this, what's

879
00:52:32,132 --> 00:52:34,673
going to happen is that those
points are all going to project

880
00:52:34,673 --> 00:52:36,030
to the same point, right?

881
00:52:36,030 --> 00:52:38,070
I'm just going to
not see anything.

882
00:52:38,070 --> 00:52:40,410
However, if I take a
plane which is like this,

883
00:52:40,410 --> 00:52:42,932
they're all going to
project into some nice line.

884
00:52:42,932 --> 00:52:44,640
Maybe I can even
project them onto a line

885
00:52:44,640 --> 00:52:47,160
and they will still be
far apart from each other.

886
00:52:47,160 --> 00:52:48,160
So that's what you want.

887
00:52:48,160 --> 00:52:51,930
You want to be able to
say, when I take my points

888
00:52:51,930 --> 00:52:54,610
and I say I project them
onto lower dimensions,

889
00:52:54,610 --> 00:52:57,270
I do not want them to collapse
into one single point.

890
00:52:57,270 --> 00:53:00,540
I want them to be spread as
possible in the direction

891
00:53:00,540 --> 00:53:02,251
on which I project.

892
00:53:02,251 --> 00:53:04,000
And this is what we're
going to try to do.

893
00:53:04,000 --> 00:53:06,510
And of course, measuring
spread between points

894
00:53:06,510 --> 00:53:08,160
can be done in many ways, right?

895
00:53:08,160 --> 00:53:09,960
I mean, you could
look at, I don't know,

896
00:53:09,960 --> 00:53:12,900
sum of pairwise distances
between those guys.

897
00:53:12,900 --> 00:53:14,790
You could look at
some sort of energy.

898
00:53:14,790 --> 00:53:16,380
You can look at
many ways to measure

899
00:53:16,380 --> 00:53:18,199
of spread in a direction.

900
00:53:18,199 --> 00:53:19,740
But variance is a
good way to measure

901
00:53:19,740 --> 00:53:21,150
of spread between points.

902
00:53:21,150 --> 00:53:23,727
If you have a lot of
variance between your points,

903
00:53:23,727 --> 00:53:25,560
then chances are they're
going to be spread.

904
00:53:25,560 --> 00:53:27,720
Now, this is not
always the case, right?

905
00:53:27,720 --> 00:53:30,480
If I have a direction in which
all my points are clumped

906
00:53:30,480 --> 00:53:33,234
onto one big point and
one other big point,

907
00:53:33,234 --> 00:53:34,900
it's going to choose
this because that's

908
00:53:34,900 --> 00:53:37,180
the direction that
has a lot of variance.

909
00:53:37,180 --> 00:53:39,030
But hopefully, the
variance is going

910
00:53:39,030 --> 00:53:41,560
to spread things out nicely.

911
00:53:41,560 --> 00:53:47,730
So the idea of principal
component analysis

912
00:53:47,730 --> 00:53:51,330
is going to try to
identify those variances--

913
00:53:51,330 --> 00:53:55,740
those directions along which
we have a lot of variance.

914
00:53:55,740 --> 00:53:57,870
Reciprocally, we're
going to try to eliminate

915
00:53:57,870 --> 00:54:01,890
the directions along which we do
not have a lot of variance, OK?

916
00:54:01,890 --> 00:54:02,640
And let's see why.

917
00:54:02,640 --> 00:54:08,130
Well, if-- so here's
the first claim.

918
00:54:08,130 --> 00:54:14,000
If you transpose Su is equal
to 0, what's happening?

919
00:54:14,000 --> 00:54:17,159
Well, I know that an empirical
variance is equal to 0.

920
00:54:17,159 --> 00:54:18,950
What does it mean for
an empirical variance

921
00:54:18,950 --> 00:54:22,056
to be equal to 0?

922
00:54:22,056 --> 00:54:23,680
So I give you a bunch
of points, right?

923
00:54:23,680 --> 00:54:26,420
So those points are those
points-- u transpose

924
00:54:26,420 --> 00:54:29,090
X1, u transpose-- those
are a bunch of numbers.

925
00:54:29,090 --> 00:54:31,090
What does it mean to have
the empirical variance

926
00:54:31,090 --> 00:54:33,279
of those points
being equal to 0?

927
00:54:33,279 --> 00:54:34,570
AUDIENCE: They're all the same.

928
00:54:34,570 --> 00:54:36,590
PHILIPPE RIGOLLET:
They're all the same.

929
00:54:36,590 --> 00:54:43,680
So what it means is that
when I have my points, right?

930
00:54:43,680 --> 00:54:46,470
So, can you find a direction
for those points in which they

931
00:54:46,470 --> 00:54:48,850
project to all the same point?

932
00:54:51,400 --> 00:54:52,360
No, right?

933
00:54:52,360 --> 00:54:53,590
There's no such thing.

934
00:54:53,590 --> 00:54:55,870
For this to happen, you have
to have your points which

935
00:54:55,870 --> 00:54:57,849
are perfectly aligned.

936
00:54:57,849 --> 00:54:59,390
And then when you're
going to project

937
00:54:59,390 --> 00:55:01,830
onto the orthogonal
of this guy, they're

938
00:55:01,830 --> 00:55:03,690
going to all project
to the same point

939
00:55:03,690 --> 00:55:06,450
here, which means that
the empirical variance is

940
00:55:06,450 --> 00:55:08,790
going to be 0.

941
00:55:08,790 --> 00:55:10,270
Now, this is an extreme case.

942
00:55:10,270 --> 00:55:11,760
This will never
happen in practice,

943
00:55:11,760 --> 00:55:13,840
because if that
happens, well, I mean,

944
00:55:13,840 --> 00:55:16,850
you can basically figure
that out very quickly.

945
00:55:16,850 --> 00:55:21,520
So in the same way,
it's very unlikely

946
00:55:21,520 --> 00:55:23,710
that you're going to have
u transpose sigma u, which

947
00:55:23,710 --> 00:55:26,230
is equal to 0, which means
that, essentially, all

948
00:55:26,230 --> 00:55:28,510
your points are [INAUDIBLE]
or let's say all of them

949
00:55:28,510 --> 00:55:30,069
are orthogonal to u, right?

950
00:55:30,069 --> 00:55:31,360
So it's exactly the same thing.

951
00:55:31,360 --> 00:55:33,330
It just says that in
the population case,

952
00:55:33,330 --> 00:55:36,960
there's no probability that your
points deviate from this guy

953
00:55:36,960 --> 00:55:37,510
here.

954
00:55:37,510 --> 00:55:41,142
This happens with
zero probability, OK?

955
00:55:41,142 --> 00:55:42,600
And that's just
because if you look

956
00:55:42,600 --> 00:55:46,690
at the variance of this
guy, it's going to be 0.

957
00:55:46,690 --> 00:55:48,910
And then that means that
there's no deviation.

958
00:55:48,910 --> 00:55:51,430
By the way, I'm using
the name projection

959
00:55:51,430 --> 00:55:55,510
when I talk about u
transpose X, right?

960
00:55:55,510 --> 00:55:59,170
So let's just be
clear about this.

961
00:55:59,170 --> 00:56:04,090
If you-- so let's say I
have a bunch of points,

962
00:56:04,090 --> 00:56:06,050
and u is a vector
in this direction.

963
00:56:06,050 --> 00:56:08,650
And let's say that u has the--

964
00:56:08,650 --> 00:56:10,120
so this is 0.

965
00:56:10,120 --> 00:56:10,720
This is u.

966
00:56:10,720 --> 00:56:17,560
And let's say that
u has norm, 1, OK?

967
00:56:17,560 --> 00:56:21,140
When I look, what is the
coordinate of the projection?

968
00:56:21,140 --> 00:56:23,860
So what is the length
of this guy here?

969
00:56:23,860 --> 00:56:25,569
Let's call this guy X1.

970
00:56:25,569 --> 00:56:26,860
What is the length of this guy?

971
00:56:31,150 --> 00:56:32,330
In terms of inner products?

972
00:56:35,990 --> 00:56:39,678
This is exactly u transpose X1.

973
00:56:39,678 --> 00:56:42,730
This length here,
if this is X2, this

974
00:56:42,730 --> 00:56:46,580
is exactly u transpose X2, OK?

975
00:56:46,580 --> 00:56:52,430
So those-- u transpose X
measure exactly the distance

976
00:56:52,430 --> 00:56:55,700
to the origin of those--

977
00:56:55,700 --> 00:56:58,310
I mean, it's really--

978
00:56:58,310 --> 00:57:00,887
think of it as being
just an x-axis thing.

979
00:57:00,887 --> 00:57:02,220
You just have a bunch of points.

980
00:57:02,220 --> 00:57:02,960
You have an origin.

981
00:57:02,960 --> 00:57:04,520
And it's really just
telling you what

982
00:57:04,520 --> 00:57:07,670
the coordinate on this
axis is going to be, right?

983
00:57:07,670 --> 00:57:10,820
So in particular, if the
empirical variance is 0,

984
00:57:10,820 --> 00:57:12,470
it means that all
these points project

985
00:57:12,470 --> 00:57:14,840
to the same point, which
means that they have

986
00:57:14,840 --> 00:57:16,912
to be orthogonal to this guy.

987
00:57:16,912 --> 00:57:19,370
And you can think of it as
being also maybe an entire plane

988
00:57:19,370 --> 00:57:23,990
that's orthogonal
to this line, OK?

989
00:57:23,990 --> 00:57:26,590
So that's why I talk
about projection,

990
00:57:26,590 --> 00:57:29,560
because the inner
products, u transpose X,

991
00:57:29,560 --> 00:57:36,220
is really measuring
the coordinates of X

992
00:57:36,220 --> 00:57:39,410
when u becomes the x-axis.

993
00:57:39,410 --> 00:57:42,820
Now, if u does not have
norm 1, then you just

994
00:57:42,820 --> 00:57:44,365
have a change of scale here.

995
00:57:44,365 --> 00:57:46,790
You just have a
change of unit, right?

996
00:57:46,790 --> 00:57:51,560
So this is really u times X1.

997
00:57:51,560 --> 00:57:54,044
The coordinates should really
be divided by the norm of u.

998
00:57:59,150 --> 00:58:04,970
OK, so now, just in
the same way-- so

999
00:58:04,970 --> 00:58:07,160
we're never going
to have exactly 0.

1000
00:58:07,160 --> 00:58:08,810
But if we [INAUDIBLE]
the other end,

1001
00:58:08,810 --> 00:58:12,050
if u transpose Su is
large, what does it mean?

1002
00:58:14,990 --> 00:58:17,740
It means that when
I look at my points

1003
00:58:17,740 --> 00:58:22,194
as projected onto the
axis generated by u,

1004
00:58:22,194 --> 00:58:23,860
they're going to have
a lot of variance.

1005
00:58:23,860 --> 00:58:25,930
They're going to be far away
from each other in average,

1006
00:58:25,930 --> 00:58:26,430
right?

1007
00:58:26,430 --> 00:58:28,900
That's what large variance
means, or at least

1008
00:58:28,900 --> 00:58:31,310
large empirical variance means.

1009
00:58:31,310 --> 00:58:34,690
And same thing for u.

1010
00:58:34,690 --> 00:58:36,130
So what we're going
to try to find

1011
00:58:36,130 --> 00:58:39,870
is a u that maximizes this.

1012
00:58:39,870 --> 00:58:42,230
If I can find a u
that maximizes this

1013
00:58:42,230 --> 00:58:44,790
so I can look in
every direction,

1014
00:58:44,790 --> 00:58:48,320
and suddenly I find a direction
in which the spread is massive,

1015
00:58:48,320 --> 00:58:50,070
then that's a point
on which I'm basically

1016
00:58:50,070 --> 00:58:52,260
the less likely
to have my points

1017
00:58:52,260 --> 00:58:54,824
project onto each other
and collide, right?

1018
00:58:54,824 --> 00:58:56,490
At least I know they're
going to project

1019
00:58:56,490 --> 00:58:59,710
at least onto two points.

1020
00:58:59,710 --> 00:59:02,290
So the idea now is
to say, OK, let's try

1021
00:59:02,290 --> 00:59:04,630
to maximize this spread, right?

1022
00:59:04,630 --> 00:59:09,130
So we're going to try to
find the maximum over all u's

1023
00:59:09,130 --> 00:59:12,886
of u transpose Su.

1024
00:59:12,886 --> 00:59:15,010
And that's going to be the
direction that maximizes

1025
00:59:15,010 --> 00:59:15,968
the empirical variance.

1026
00:59:15,968 --> 00:59:22,075
Now of course, if I read it
like that for all u's in Rd,

1027
00:59:22,075 --> 00:59:23,666
what is the value
of this maximum?

1028
00:59:28,060 --> 00:59:29,220
It's infinity, right?

1029
00:59:29,220 --> 00:59:32,160
Because I can always
multiply u by 10,

1030
00:59:32,160 --> 00:59:34,662
and this entire thing is
going to multiplied by 100.

1031
00:59:34,662 --> 00:59:36,620
So I'm just going to take
u as large as I want,

1032
00:59:36,620 --> 00:59:38,661
and this thing is going
to be as large as I want,

1033
00:59:38,661 --> 00:59:40,050
and so I need to constrain u.

1034
00:59:40,050 --> 00:59:42,840
And as I said, I need
to have u of size 1

1035
00:59:42,840 --> 00:59:45,990
to talk about coordinates
in the system generated

1036
00:59:45,990 --> 00:59:47,340
by u like this.

1037
00:59:47,340 --> 00:59:50,730
So I'm just going to
constrain u to have

1038
00:59:50,730 --> 00:59:55,467
Euclidean norm equal to 1, OK?

1039
00:59:55,467 --> 00:59:57,050
So that's going to
be my goal-- trying

1040
00:59:57,050 --> 01:00:01,100
to find the largest
possible u transpose Su,

1041
01:00:01,100 --> 01:00:03,680
or in other words, empirical
variance of the points

1042
01:00:03,680 --> 01:00:07,520
projected onto the direction
u when u is of norm 1,

1043
01:00:07,520 --> 01:00:11,039
which justifies to use
the word, "direction,"

1044
01:00:11,039 --> 01:00:12,830
and because there's no
magnitude to this u.

1045
01:00:17,770 --> 01:00:22,410
OK, so how am I
going to do this?

1046
01:00:22,410 --> 01:00:25,230
I could just fold and
say, let's just optimize

1047
01:00:25,230 --> 01:00:26,700
this thing, right?

1048
01:00:26,700 --> 01:00:28,540
Let's just take this problem.

1049
01:00:28,540 --> 01:00:32,250
It says maximize a function
onto some constraints.

1050
01:00:32,250 --> 01:00:34,125
Immediately, the constraint
is sort of nasty.

1051
01:00:34,125 --> 01:00:37,212
I'm on a sphere, and I'm trying
to move points on the sphere.

1052
01:00:37,212 --> 01:00:38,670
And I'm maximizing
this thing which

1053
01:00:38,670 --> 01:00:40,182
actually happens to be convex.

1054
01:00:40,182 --> 01:00:42,390
And we know we know how to
minimize convex functions,

1055
01:00:42,390 --> 01:00:45,280
but maximize them is
a different question.

1056
01:00:45,280 --> 01:00:47,340
And so this problem
might be super hard.

1057
01:00:47,340 --> 01:00:49,020
So I can just say,
OK, here's what

1058
01:00:49,020 --> 01:00:52,950
I want to do, and let me
give that to an optimizer

1059
01:00:52,950 --> 01:00:56,010
and just hope that the optimizer
can solve this problem for me.

1060
01:00:56,010 --> 01:00:57,630
That's one thing we can do.

1061
01:00:57,630 --> 01:01:00,092
Now as you can imagine, PCA
is so well spread, right?

1062
01:01:00,092 --> 01:01:01,800
Principal component
analysis is something

1063
01:01:01,800 --> 01:01:03,700
that people do constantly.

1064
01:01:03,700 --> 01:01:06,190
And so that means that we
know how to do this fast.

1065
01:01:06,190 --> 01:01:07,600
So that's one thing.

1066
01:01:07,600 --> 01:01:10,740
The other thing that you should
probably question about why--

1067
01:01:10,740 --> 01:01:13,110
if this thing is actually
difficult, why in the world

1068
01:01:13,110 --> 01:01:16,200
would you even choose the
variance as a measure of spread

1069
01:01:16,200 --> 01:01:19,020
if there's so many
measures of spread, right?

1070
01:01:19,020 --> 01:01:21,222
The variance is one
measure of spread.

1071
01:01:21,222 --> 01:01:22,680
It's not guaranteed
that everything

1072
01:01:22,680 --> 01:01:26,366
is going to project nicely
far apart from each other.

1073
01:01:26,366 --> 01:01:27,990
So we could choose
the variance, but we

1074
01:01:27,990 --> 01:01:28,800
could choose something else.

1075
01:01:28,800 --> 01:01:30,990
If the variance does
not help, why choose it?

1076
01:01:30,990 --> 01:01:32,520
Turns out the variance helps.

1077
01:01:32,520 --> 01:01:35,555
So this is indeed a
non-convex problem.

1078
01:01:35,555 --> 01:01:38,340
I'm maximizing, so
it's actually the same.

1079
01:01:38,340 --> 01:01:41,850
I can make this
constraint convex

1080
01:01:41,850 --> 01:01:43,920
because I'm maximizing
a convex function,

1081
01:01:43,920 --> 01:01:45,720
so it's clear that
the maximum is going

1082
01:01:45,720 --> 01:01:47,220
to be attained at the boundary.

1083
01:01:47,220 --> 01:01:51,540
So I can actually just fill
this ball into some convex ball.

1084
01:01:51,540 --> 01:01:53,430
However, I'm still
maximizing, so this

1085
01:01:53,430 --> 01:01:55,170
is a non-convex problem.

1086
01:01:55,170 --> 01:01:57,550
And this turns out to be the
fanciest non-convex problem

1087
01:01:57,550 --> 01:01:59,001
we know how to solve.

1088
01:01:59,001 --> 01:02:00,750
And the reason why we
know how to solve it

1089
01:02:00,750 --> 01:02:04,410
is not because of optimization
or using gradient-type things

1090
01:02:04,410 --> 01:02:06,690
or anything of the
algorithms that I mentioned

1091
01:02:06,690 --> 01:02:09,350
during the maximum likelihood.

1092
01:02:09,350 --> 01:02:11,000
It's because of linear algebra.

1093
01:02:11,000 --> 01:02:13,980
Linear algebra guarantees that
we know how to solve this.

1094
01:02:13,980 --> 01:02:17,885
And to understand this, we
need to go a little deeper

1095
01:02:17,885 --> 01:02:22,360
in linear algebra, and we
need to understand the concept

1096
01:02:22,360 --> 01:02:24,590
of diagonalization of a matrix.

1097
01:02:24,590 --> 01:02:29,850
So who has ever seen the
concept of an eigenvalue?

1098
01:02:29,850 --> 01:02:30,790
Oh, that's beautiful.

1099
01:02:30,790 --> 01:02:31,880
And if you're not
raising your hand,

1100
01:02:31,880 --> 01:02:33,588
you're just playing
"Candy Crush," right?

1101
01:02:33,588 --> 01:02:35,930
All right, so, OK.

1102
01:02:44,930 --> 01:02:46,640
This is great.

1103
01:02:46,640 --> 01:02:48,160
Everybody's seen it.

1104
01:02:48,160 --> 01:02:51,230
For my live audience of
millions, maybe you have not,

1105
01:02:51,230 --> 01:02:53,600
so I will still go through it.

1106
01:02:53,600 --> 01:02:58,840
All right, so one
of the basic facts--

1107
01:02:58,840 --> 01:03:02,490
and I remember when
I learned this in--

1108
01:03:02,490 --> 01:03:04,090
I mean, when I was
an undergrad, I

1109
01:03:04,090 --> 01:03:05,860
learned about the
spectral decomposition

1110
01:03:05,860 --> 01:03:07,450
and this diagonalization
of matrices.

1111
01:03:07,450 --> 01:03:09,070
And for me, it was just
a structural property

1112
01:03:09,070 --> 01:03:11,445
of matrices, but it turns out
that it's extremely useful,

1113
01:03:11,445 --> 01:03:13,294
and it's useful for
algorithmic purposes.

1114
01:03:13,294 --> 01:03:14,710
And so what this
theorem tells you

1115
01:03:14,710 --> 01:03:16,765
is that if you take
a symmetric matrix--

1116
01:03:22,860 --> 01:03:24,340
well, with real
entries, but that

1117
01:03:24,340 --> 01:03:28,220
really does not matter so much.

1118
01:03:28,220 --> 01:03:30,730
And here, I'm
going to actually--

1119
01:03:30,730 --> 01:03:33,200
so I take a symmetric matrix,
and actually S and sigma

1120
01:03:33,200 --> 01:03:36,190
are two such symmetric
matrices, right?

1121
01:03:36,190 --> 01:03:44,500
Then there exists P
and D, which are both--

1122
01:03:44,500 --> 01:03:47,000
so let's say d by d.

1123
01:03:47,000 --> 01:03:55,960
Which are both d by d
such that P is orthogonal.

1124
01:03:58,960 --> 01:04:02,420
That means that P transpose
P is equal to PP transpose

1125
01:04:02,420 --> 01:04:06,360
is equal to the identity.

1126
01:04:06,360 --> 01:04:07,630
And D is diagonal.

1127
01:04:11,840 --> 01:04:20,130
And sigma, let's say, is
equal to PDP transpose, OK?

1128
01:04:20,130 --> 01:04:22,080
So it's a diagonalization
because it's

1129
01:04:22,080 --> 01:04:23,970
finding a nice transformation.

1130
01:04:23,970 --> 01:04:25,260
P has some nice properties.

1131
01:04:25,260 --> 01:04:28,050
It's really just the change
of coordinates in which

1132
01:04:28,050 --> 01:04:31,044
your matrix is diagonal, right?

1133
01:04:31,044 --> 01:04:32,460
And the way you
want to see this--

1134
01:04:32,460 --> 01:04:35,610
and I think it sort of helps
to think about this problem

1135
01:04:35,610 --> 01:04:36,720
as being--

1136
01:04:36,720 --> 01:04:38,276
sigma being a covariance matrix.

1137
01:04:38,276 --> 01:04:39,900
What does a covariance
matrix tell you?

1138
01:04:39,900 --> 01:04:41,490
Think of a
multivariate Gaussian.

1139
01:04:41,490 --> 01:04:43,660
Can everybody visualize a
three-dimensional Gaussian

1140
01:04:43,660 --> 01:04:45,150
density?

1141
01:04:45,150 --> 01:04:48,200
Right, so it's going to be some
sort of a bell-shaped curve,

1142
01:04:48,200 --> 01:04:51,870
but it might be more elongated
in one direction than another.

1143
01:04:51,870 --> 01:04:54,310
And then going to chop
it like that, all right?

1144
01:04:54,310 --> 01:04:56,120
So I'm going to chop it off.

1145
01:04:56,120 --> 01:05:00,070
And I'm going to look at
how it bleeds, all right?

1146
01:05:00,070 --> 01:05:02,287
So I'm just going to look
at where the blood is.

1147
01:05:02,287 --> 01:05:03,620
And what it's going to look at--

1148
01:05:03,620 --> 01:05:08,720
it's going to look like some
sort of ellipsoid, right?

1149
01:05:08,720 --> 01:05:11,652
In high dimension, it's
just going to be an olive.

1150
01:05:11,652 --> 01:05:13,610
And that is just going
to be bigger and bigger.

1151
01:05:13,610 --> 01:05:16,460
And then I chop it
off a little lower,

1152
01:05:16,460 --> 01:05:20,150
and I get something a
little bigger like this.

1153
01:05:20,150 --> 01:05:23,070
And so it turns out that sigma
is capturing exactly this,

1154
01:05:23,070 --> 01:05:23,570
right?

1155
01:05:23,570 --> 01:05:27,320
The matrix sigma-- so the
center of your covariance matrix

1156
01:05:27,320 --> 01:05:29,240
of your Gaussian is
going to be this thing.

1157
01:05:29,240 --> 01:05:33,690
And sigma is going to tell you
which direction it's elongated.

1158
01:05:33,690 --> 01:05:36,140
And so in particular, if you
look, if you knew an ellipse,

1159
01:05:36,140 --> 01:05:38,160
you know there's something
called principal axis, right?

1160
01:05:38,160 --> 01:05:39,743
So you could actually
define something

1161
01:05:39,743 --> 01:05:43,190
that looks like this, which is
this axis, the one along which

1162
01:05:43,190 --> 01:05:44,390
it's the most elongated.

1163
01:05:44,390 --> 01:05:47,345
Then the axis along which
is orthogonal to it,

1164
01:05:47,345 --> 01:05:49,370
along which it's
slightly less elongated,

1165
01:05:49,370 --> 01:05:52,880
and you go again and again
along the orthogonal ones.

1166
01:05:52,880 --> 01:05:56,500
It turns out that
those things here

1167
01:05:56,500 --> 01:05:59,620
is the new coordinate system
in which this transformation, P

1168
01:05:59,620 --> 01:06:03,190
and P transpose, is
putting you into.

1169
01:06:03,190 --> 01:06:06,390
And D has entries
on the diagonal

1170
01:06:06,390 --> 01:06:09,979
which are exactly this length
and this length, right?

1171
01:06:09,979 --> 01:06:11,270
So that's just what it's doing.

1172
01:06:11,270 --> 01:06:12,920
It's just telling
you, well, if you

1173
01:06:12,920 --> 01:06:16,760
think of having this Gaussian
or this high-dimensional

1174
01:06:16,760 --> 01:06:19,990
ellipsoid, it's elongated
along certain directions.

1175
01:06:19,990 --> 01:06:23,020
And these directions are
actually maybe not well aligned

1176
01:06:23,020 --> 01:06:25,270
with your original coordinate
system, which might just

1177
01:06:25,270 --> 01:06:27,430
be the usual one, right--

1178
01:06:27,430 --> 01:06:29,740
north, south, and east, west.

1179
01:06:29,740 --> 01:06:30,800
Maybe I need to turn it.

1180
01:06:30,800 --> 01:06:33,174
And that's exactly what this
orthogonal transformation is

1181
01:06:33,174 --> 01:06:36,820
doing for you, all right?

1182
01:06:36,820 --> 01:06:39,627
So, in a way, this is actually
telling you even more.

1183
01:06:39,627 --> 01:06:41,710
It's telling you that any
matrix that's symmetric,

1184
01:06:41,710 --> 01:06:45,190
you can actually
turn it somewhere.

1185
01:06:45,190 --> 01:06:47,530
And that'll start to dilate
things in the directions

1186
01:06:47,530 --> 01:06:49,060
that you have, and
then turn it back

1187
01:06:49,060 --> 01:06:50,800
to what you originally had.

1188
01:06:50,800 --> 01:06:53,110
And that's actually
exactly the effect

1189
01:06:53,110 --> 01:06:57,180
of applying a symmetric matrix
through a vector, right?

1190
01:06:57,180 --> 01:06:58,920
And it's pretty impressive.

1191
01:06:58,920 --> 01:07:04,650
It says if I take sigma
times v. Any sigma that's

1192
01:07:04,650 --> 01:07:07,560
of this form, what I'm
doing is-- that's symmetric.

1193
01:07:07,560 --> 01:07:09,360
What I'm really
doing to v is I'm

1194
01:07:09,360 --> 01:07:12,150
changing its coordinate
system, so I'm rotating it.

1195
01:07:12,150 --> 01:07:14,970
Then I'm changing-- I'm
multiplying its coordinates,

1196
01:07:14,970 --> 01:07:16,956
and then I'm rotating it back.

1197
01:07:16,956 --> 01:07:18,330
That's all it's
doing, and that's

1198
01:07:18,330 --> 01:07:21,550
what all symmetric
matrices do, which

1199
01:07:21,550 --> 01:07:24,070
means that this is doing a lot.

1200
01:07:24,070 --> 01:07:27,130
All right, so OK.

1201
01:07:27,130 --> 01:07:29,237
So, what do I know?

1202
01:07:29,237 --> 01:07:30,820
So I'm not going to
prove that this is

1203
01:07:30,820 --> 01:07:32,140
the so-called spectral theorem.

1204
01:07:39,270 --> 01:07:45,850
And the diagonal entries of
D is of the form, lambda 1,

1205
01:07:45,850 --> 01:07:49,980
lambda 2, lambda d, 0, 0.

1206
01:07:49,980 --> 01:08:01,800
And the lambda j's are
called eigenvalues of D.

1207
01:08:01,800 --> 01:08:05,170
Now in general, those numbers
can be positive, negative,

1208
01:08:05,170 --> 01:08:06,660
or equal to 0.

1209
01:08:06,660 --> 01:08:12,000
But here, I know that
sigma and S are--

1210
01:08:12,000 --> 01:08:15,290
well, they're
symmetric for sure,

1211
01:08:15,290 --> 01:08:17,467
but they are positive
semidefinite.

1212
01:08:23,939 --> 01:08:25,840
What does it mean?

1213
01:08:25,840 --> 01:08:30,930
It means that when I take u
transpose sigma u for example,

1214
01:08:30,930 --> 01:08:33,192
this number is
always non-negative.

1215
01:08:35,910 --> 01:08:36,720
Why is this true?

1216
01:08:42,770 --> 01:08:43,609
What is this number?

1217
01:08:47,670 --> 01:08:49,850
It's the variance of--
and actually, I don't even

1218
01:08:49,850 --> 01:08:51,229
need to finish this sentence.

1219
01:08:51,229 --> 01:08:53,957
As soon as I say that
this is a variance, well,

1220
01:08:53,957 --> 01:08:55,040
it has to be non-negative.

1221
01:08:55,040 --> 01:08:57,990
We know that a variance
is not negative.

1222
01:08:57,990 --> 01:09:00,532
And so, that's also a
nice way you can use that.

1223
01:09:00,532 --> 01:09:02,240
So it's just to say,
well, OK, this thing

1224
01:09:02,240 --> 01:09:04,680
is positive semidefinite because
it's a covariance matrix.

1225
01:09:04,680 --> 01:09:06,920
So I know it's a variance, OK?

1226
01:09:06,920 --> 01:09:08,779
So I get this.

1227
01:09:08,779 --> 01:09:10,560
Now, if I had some
negative numbers--

1228
01:09:10,560 --> 01:09:15,350
so the effect of that is that
when I draw this picture,

1229
01:09:15,350 --> 01:09:19,040
those axes are always positive,
which is kind of a weird thing

1230
01:09:19,040 --> 01:09:19,950
to say.

1231
01:09:19,950 --> 01:09:23,840
But what it means is that when
I take a vector, v, I rotate it,

1232
01:09:23,840 --> 01:09:28,250
and then I stretch it in the
directions of the coordinate,

1233
01:09:28,250 --> 01:09:30,260
I cannot flip it.

1234
01:09:30,260 --> 01:09:34,260
I can only stretch or shrink,
but I cannot flip its sign,

1235
01:09:34,260 --> 01:09:34,760
all right?

1236
01:09:34,760 --> 01:09:37,370
But in general, for
any symmetric matrices,

1237
01:09:37,370 --> 01:09:38,840
I could do this.

1238
01:09:38,840 --> 01:09:40,910
But when it's positive
symmetric definite,

1239
01:09:40,910 --> 01:09:43,020
actually what turns out
is that all the lambda

1240
01:09:43,020 --> 01:09:48,350
j's are non-negative.

1241
01:09:48,350 --> 01:09:51,370
I cannot flip it, OK?

1242
01:09:51,370 --> 01:09:53,778
So all the eigenvalues
are non-negative.

1243
01:09:56,590 --> 01:09:58,469
That's a property
of positive semidef.

1244
01:09:58,469 --> 01:10:00,510
So when it's symmetric,
you have the eigenvalues.

1245
01:10:00,510 --> 01:10:01,670
They can be any number.

1246
01:10:01,670 --> 01:10:03,780
And when it's positive
semidefinite, in particular

1247
01:10:03,780 --> 01:10:05,220
that's the case of
the covariance matrix

1248
01:10:05,220 --> 01:10:07,110
and the empirical
covariance matrix, right?

1249
01:10:07,110 --> 01:10:08,940
Because the empirical
covariance matrix

1250
01:10:08,940 --> 01:10:12,150
is an empirical variance,
which itself is non-negative.

1251
01:10:12,150 --> 01:10:17,900
And so I get that the
eigenvalues are non-negative.

1252
01:10:17,900 --> 01:10:23,030
All right, so principal
component analysis is saying,

1253
01:10:23,030 --> 01:10:32,370
OK, I want to find
the direction, u,

1254
01:10:32,370 --> 01:10:38,830
that maximizes u
transpose Su, all right?

1255
01:10:38,830 --> 01:10:40,420
I've just introduced
in one slide

1256
01:10:40,420 --> 01:10:41,690
something about eigenvalues.

1257
01:10:41,690 --> 01:10:44,740
So hopefully, they should help.

1258
01:10:44,740 --> 01:10:47,560
So what is it that I'm
going to be getting?

1259
01:10:47,560 --> 01:10:51,446
Well, let's just
see what happens.

1260
01:10:51,446 --> 01:10:53,570
Oh, I forgot to mention
that-- and I will use this.

1261
01:10:53,570 --> 01:10:56,020
So the lambda j's are
called eigenvectors.

1262
01:10:56,020 --> 01:11:08,690
And then the matrix, P,
has columns v1 to vd, OK?

1263
01:11:08,690 --> 01:11:13,370
The fact that it's orthogonal--
that P transpose P is equal

1264
01:11:13,370 --> 01:11:15,470
to the identity--

1265
01:11:15,470 --> 01:11:20,810
means that those guys
satisfied that vi transpose

1266
01:11:20,810 --> 01:11:27,485
vj is equal to 0 if i
is different from j.

1267
01:11:27,485 --> 01:11:31,040
And vi transpose vi is
actually equal to 1,

1268
01:11:31,040 --> 01:11:33,920
right, because the
entries of PP transpose

1269
01:11:33,920 --> 01:11:38,990
are exactly going to be of
the form, vi transpose vj, OK?

1270
01:11:38,990 --> 01:11:40,890
So those v's are
called eigenvectors.

1271
01:11:46,000 --> 01:11:52,020
And v1 is attached to lambda 1,
and v2 is attached to lambda 2,

1272
01:11:52,020 --> 01:11:53,180
OK?

1273
01:11:53,180 --> 01:11:56,280
So let's see what's
happening with those things.

1274
01:11:56,280 --> 01:11:58,045
What happens if I take sigma--

1275
01:11:58,045 --> 01:12:00,170
so if you know eigenvalues,
you know exactly what's

1276
01:12:00,170 --> 01:12:01,580
going to happen.

1277
01:12:01,580 --> 01:12:06,920
If I look at, say, sigma
times v1, well, what is sigma?

1278
01:12:06,920 --> 01:12:15,440
We know that sigma
is PDP transpose v1.

1279
01:12:15,440 --> 01:12:17,420
What is P transpose times v1?

1280
01:12:17,420 --> 01:12:21,560
Well, P transpose has
rows v1 transpose,

1281
01:12:21,560 --> 01:12:26,850
v2 transpose, all the
way to vd transpose.

1282
01:12:26,850 --> 01:12:30,910
So when I multiply
this by v1, what

1283
01:12:30,910 --> 01:12:32,820
I'm left with is
the first coordinate

1284
01:12:32,820 --> 01:12:38,010
is going to be equal to 1
and the second coordinate is

1285
01:12:38,010 --> 01:12:40,980
going to be equal to 0, right?

1286
01:12:40,980 --> 01:12:42,910
Because they're
orthogonal to each other--

1287
01:12:42,910 --> 01:12:45,810
0 all the way to the end.

1288
01:12:45,810 --> 01:12:48,890
So that's when I
do P transpose v1.

1289
01:12:48,890 --> 01:12:55,250
Now I multiply by
D. Well, I'm just

1290
01:12:55,250 --> 01:12:58,950
multiplying this guy by lambda
1, this guy by lambda 2,

1291
01:12:58,950 --> 01:13:02,150
and this guy by lambda d, so
this is really just lambda 1.

1292
01:13:04,720 --> 01:13:12,080
And now I need to
post-multiply by P.

1293
01:13:12,080 --> 01:13:14,190
So what is P times this guy?

1294
01:13:14,190 --> 01:13:19,730
Well, P is v1 all the way to vd.

1295
01:13:19,730 --> 01:13:21,290
And now I multiply
by a vector that

1296
01:13:21,290 --> 01:13:24,620
only has 0's except
lambda 1 on the first guy.

1297
01:13:24,620 --> 01:13:26,510
So this is just
lambda 1 times v1.

1298
01:13:29,470 --> 01:13:34,630
So what we've proved is that
sigma times v1 is lambda 1 v1,

1299
01:13:34,630 --> 01:13:37,330
and that's probably the
notion of eigenvalue you're

1300
01:13:37,330 --> 01:13:39,010
most comfortable with, right?

1301
01:13:39,010 --> 01:13:41,620
So just when I
multiply by v1, I get

1302
01:13:41,620 --> 01:13:45,440
v1 back multiplied by something,
which is the eigenvalue.

1303
01:13:45,440 --> 01:13:54,450
So in particular, if I look
at v1, transpose sigma v1,

1304
01:13:54,450 --> 01:13:55,180
what do I get?

1305
01:13:55,180 --> 01:13:58,800
Well, I get lambda
1 v1 transpose v1,

1306
01:13:58,800 --> 01:14:00,180
which is 1, right?

1307
01:14:00,180 --> 01:14:04,050
So this is actually
lambda 1 v1 transpose v1,

1308
01:14:04,050 --> 01:14:08,360
which is lambda 1, OK?

1309
01:14:08,360 --> 01:14:10,940
And if I do the same
with v2, clearly I'm

1310
01:14:10,940 --> 01:14:13,450
going to get v2 transpose sigma.

1311
01:14:13,450 --> 01:14:16,910
v2 is equal to lambda 2.

1312
01:14:16,910 --> 01:14:19,910
So for each of the
vj's, I know that if I

1313
01:14:19,910 --> 01:14:21,650
look at the variance
along the vj,

1314
01:14:21,650 --> 01:14:27,760
it's actually exactly given by
those eigenvalues, all right?

1315
01:14:27,760 --> 01:14:38,490
Which proves this, because the
variance along the eigenvectors

1316
01:14:38,490 --> 01:14:40,270
is actually equal
to the eigenvalues.

1317
01:14:40,270 --> 01:14:43,760
So since they're variances,
they have to be non-negative.

1318
01:14:43,760 --> 01:14:47,960
So now, I'm looking for
the one direction that

1319
01:14:47,960 --> 01:14:50,450
has the most variance, right?

1320
01:14:50,450 --> 01:14:53,040
But that's not only
among the eigenvectors.

1321
01:14:53,040 --> 01:14:55,520
That's also among
the other directions

1322
01:14:55,520 --> 01:14:57,200
that are in-between
the eigenvectors.

1323
01:14:57,200 --> 01:14:59,390
If I were to look only
at the eigenvectors,

1324
01:14:59,390 --> 01:15:02,420
it would just tell me, well,
just pick the eigenvector, vj,

1325
01:15:02,420 --> 01:15:05,990
that's associated to the
largest of the lambda j's.

1326
01:15:05,990 --> 01:15:09,080
But it turns out that that's
also true for any vector--

1327
01:15:09,080 --> 01:15:11,810
that the maximum direction is
actually one direction which

1328
01:15:11,810 --> 01:15:13,809
is among the eigenvectors.

1329
01:15:13,809 --> 01:15:16,100
And among the eigenvectors,
we know that the one that's

1330
01:15:16,100 --> 01:15:17,080
the largest--

1331
01:15:17,080 --> 01:15:18,740
that carries the
largest variance is

1332
01:15:18,740 --> 01:15:23,780
the one that's associated to the
largest eigenvalue, all right?

1333
01:15:23,780 --> 01:15:26,990
And so this is what PCA is
going to try to do for me.

1334
01:15:26,990 --> 01:15:29,420
So in practice, that's what
I mentioned already, right?

1335
01:15:29,420 --> 01:15:31,970
We're trying to
project the point cloud

1336
01:15:31,970 --> 01:15:34,730
onto a low-dimensional
space, D prime,

1337
01:15:34,730 --> 01:15:36,800
by keeping as much
information as possible.

1338
01:15:36,800 --> 01:15:39,230
And by "as much information,"
I mean we do not

1339
01:15:39,230 --> 01:15:41,540
want points to collide.

1340
01:15:41,540 --> 01:15:45,530
And so what PCA is
going to do is just

1341
01:15:45,530 --> 01:15:48,231
going to try to project
[? on two ?] directions.

1342
01:15:48,231 --> 01:15:49,730
So there's going
to be a u, and then

1343
01:15:49,730 --> 01:15:52,021
there's going to be something
orthogonal to u, and then

1344
01:15:52,021 --> 01:15:55,550
the third one, et cetera, so
that once we project on those,

1345
01:15:55,550 --> 01:15:59,600
we're keeping as much of the
covariance as possible, OK?

1346
01:15:59,600 --> 01:16:02,859
And in particular,
those directions

1347
01:16:02,859 --> 01:16:04,400
that we're going to
pick are actually

1348
01:16:04,400 --> 01:16:06,920
a subset of the vj's that
are associated to the largest

1349
01:16:06,920 --> 01:16:08,580
eigenvalues.

1350
01:16:08,580 --> 01:16:11,300
So I'm going to
stop here for today.

1351
01:16:11,300 --> 01:16:15,020
We'll finish this on Tuesday.

1352
01:16:15,020 --> 01:16:18,260
But basically, the idea is
it's just the following.

1353
01:16:18,260 --> 01:16:22,590
You're just going to--
well, let me skip one more.

1354
01:16:22,590 --> 01:16:24,812
Yeah, this is the idea.

1355
01:16:24,812 --> 01:16:27,020
You're first going to pick
the eigenvector associated

1356
01:16:27,020 --> 01:16:30,290
to the largest eigenvalue.

1357
01:16:30,290 --> 01:16:33,890
Then you're going to pick
the direction that orthogonal

1358
01:16:33,890 --> 01:16:37,130
to the vector that
you've picked,

1359
01:16:37,130 --> 01:16:38,984
and that's carrying
the most variance.

1360
01:16:38,984 --> 01:16:40,650
And that's actually
the second largest--

1361
01:16:40,650 --> 01:16:44,030
the eigenvector associated to
the second largest eigenvalue.

1362
01:16:44,030 --> 01:16:46,520
And you're going to go all
the way to the number of them

1363
01:16:46,520 --> 01:16:50,120
that you actually want to pick,
which is in this case, d, OK?

1364
01:16:50,120 --> 01:16:53,180
And wherever you choose
to chop this process,

1365
01:16:53,180 --> 01:16:56,390
not going all the way to d,
is going to actually give you

1366
01:16:56,390 --> 01:16:57,890
a lower-dimensional
representation

1367
01:16:57,890 --> 01:17:01,238
in the coordinate system
that's given by v1, v2, v3, et

1368
01:17:01,238 --> 01:17:02,420
cetera, OK?

1369
01:17:02,420 --> 01:17:04,591
So we'll see that in
more details on Tuesday.

1370
01:17:04,591 --> 01:17:06,090
But I don't want
to get into it now.

1371
01:17:06,090 --> 01:17:07,500
We don't have enough time.

1372
01:17:07,500 --> 01:17:10,000
Are there any questions?