1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,680 --> 00:00:19,219
at ocw.mit.edu.

8
00:00:19,219 --> 00:00:20,760
PHILIPPE RIGOLLET:
We keep on talking

9
00:00:20,760 --> 00:00:24,870
about principal component
analysis, which we essentially

10
00:00:24,870 --> 00:00:27,910
introduced as a way to
work with a bunch of data.

11
00:00:27,910 --> 00:00:31,560
So the data that's given to
us when we want to do PCA

12
00:00:31,560 --> 00:00:35,270
is a bunch of vectors X1 to Xn.

13
00:00:35,270 --> 00:00:40,090
So they are random vectors.

14
00:00:45,290 --> 00:00:46,652
in Rd.

15
00:00:46,652 --> 00:00:48,110
And what we mentioned
is that we're

16
00:00:48,110 --> 00:00:51,742
going to be using linear
algebra-- in particular,

17
00:00:51,742 --> 00:00:54,200
the spectral theorem-- that
guarantees to us that if I look

18
00:00:54,200 --> 00:00:56,000
at the convenience
matrix of this guy,

19
00:00:56,000 --> 00:00:57,890
or its empirical
covariance matrix,

20
00:00:57,890 --> 00:01:00,132
since they're
symmetric real matrices

21
00:01:00,132 --> 00:01:01,590
and they are positive
semidefinite,

22
00:01:01,590 --> 00:01:06,830
there exists a diagonalization
into non-negative eigenvalues.

23
00:01:06,830 --> 00:01:09,555
And so here, those
things live in Rd,

24
00:01:09,555 --> 00:01:11,570
so it's a really large space.

25
00:01:11,570 --> 00:01:14,600
And what we want to
do is to map it down

26
00:01:14,600 --> 00:01:16,640
into a space that
we can visualize,

27
00:01:16,640 --> 00:01:19,610
hopefully a space
of size 2 or 3.

28
00:01:19,610 --> 00:01:22,460
Or if not, then we're just going
to take more and start looking

29
00:01:22,460 --> 00:01:24,920
at subspaces altogether.

30
00:01:24,920 --> 00:01:33,120
So think of the case where d
is large but not larger than n.

31
00:01:33,120 --> 00:01:36,520
So let's say, you have a
large number of points.

32
00:01:36,520 --> 00:01:40,590
The question is, is it possible
to project those things onto

33
00:01:40,590 --> 00:01:45,260
a lower dimensional
space, d prime,

34
00:01:45,260 --> 00:01:49,480
which is much less than d-- so
think of d prime equals, say,

35
00:01:49,480 --> 00:01:52,180
2 or 3--

36
00:01:52,180 --> 00:01:54,490
and so that you keep
as much information

37
00:01:54,490 --> 00:01:56,740
about the cloud of points
that you had originally.

38
00:01:56,740 --> 00:01:58,990
So again, the example
that we could have

39
00:01:58,990 --> 00:02:04,060
is that X1 to Xn for, say,
Xi for patient i's recording

40
00:02:04,060 --> 00:02:08,740
a bunch of body measurements
and maybe blood pressure,

41
00:02:08,740 --> 00:02:10,639
some symptoms, et cetera.

42
00:02:10,639 --> 00:02:12,520
And then we have a
cloud of n patients.

43
00:02:12,520 --> 00:02:15,222
And we're trying to
visualize maybe to see if--

44
00:02:15,222 --> 00:02:16,930
If I could see, for
example, that there's

45
00:02:16,930 --> 00:02:18,820
two groups of
patients, maybe I would

46
00:02:18,820 --> 00:02:21,252
know that I have two
groups of different disease

47
00:02:21,252 --> 00:02:22,960
or maybe two groups
of different patients

48
00:02:22,960 --> 00:02:25,540
that respond differently
to a particular disease

49
00:02:25,540 --> 00:02:27,040
or drug et cetera.

50
00:02:27,040 --> 00:02:28,900
So visualizing is
going to give us

51
00:02:28,900 --> 00:02:33,880
quite a bit of insight about
what the spatial arrangement

52
00:02:33,880 --> 00:02:35,980
of those vectors are.

53
00:02:35,980 --> 00:02:40,660
And so PCA says, well, here,
of course, in this question,

54
00:02:40,660 --> 00:02:42,880
one thing that's not defined
is what is information.

55
00:02:42,880 --> 00:02:44,338
And we said that
one thing we might

56
00:02:44,338 --> 00:02:46,600
want to do when we project
is that points do not

57
00:02:46,600 --> 00:02:48,267
collide with each other.

58
00:02:48,267 --> 00:02:50,350
And so that means we're
trying to find directions,

59
00:02:50,350 --> 00:02:53,110
where after I project, the
points are still pretty spread

60
00:02:53,110 --> 00:02:53,860
out.

61
00:02:53,860 --> 00:02:55,630
And so I can see
what's going on.

62
00:02:55,630 --> 00:02:58,270
And PCA says-- OK,
so there's many ways

63
00:02:58,270 --> 00:02:59,500
to answer this question.

64
00:02:59,500 --> 00:03:04,290
And PCA says, let's just
find a subspace of dimension

65
00:03:04,290 --> 00:03:08,110
d prime that keeps as much
covariance structure as

66
00:03:08,110 --> 00:03:10,150
possible.

67
00:03:10,150 --> 00:03:13,390
And the reason is
that those directions

68
00:03:13,390 --> 00:03:15,430
are the ones that maximize
the variance, which

69
00:03:15,430 --> 00:03:17,230
is a proxy for the spread.

70
00:03:17,230 --> 00:03:19,540
There's many, many
ways to do this.

71
00:03:19,540 --> 00:03:22,840
There's actually a
Google video that

72
00:03:22,840 --> 00:03:26,440
was released maybe last week
about the data visualization

73
00:03:26,440 --> 00:03:29,260
team of Google that shows
you something called

74
00:03:29,260 --> 00:03:31,554
t-SNE, which is
essentially something

75
00:03:31,554 --> 00:03:32,470
that tries to do that.

76
00:03:32,470 --> 00:03:34,540
It takes points in
very high dimensions

77
00:03:34,540 --> 00:03:36,400
and tries to map them
in lower dimensions,

78
00:03:36,400 --> 00:03:38,280
so that you can
actually visualize them.

79
00:03:38,280 --> 00:03:41,800
And t-SNE is some
alternative to PCA

80
00:03:41,800 --> 00:03:46,850
that gives an other definition
for the word information.

81
00:03:46,850 --> 00:03:49,970
I'll talk about this towards
the end, how you can actually

82
00:03:49,970 --> 00:03:52,730
somewhat automatically
extend everything

83
00:03:52,730 --> 00:03:58,830
we've said for PCA to an
infinite family of procedures.

84
00:03:58,830 --> 00:04:00,460
So how do we do this?

85
00:04:00,460 --> 00:04:02,690
Well, the way we do
this is as follows.

86
00:04:02,690 --> 00:04:05,010
So remember, given
those guys, we

87
00:04:05,010 --> 00:04:09,120
can form something which is
called S, which is the sample,

88
00:04:09,120 --> 00:04:16,885
or the empirical
covariance matrix.

89
00:04:19,930 --> 00:04:22,210
And since from
couple of slides ago,

90
00:04:22,210 --> 00:04:25,450
we know that S has a
eigenvalue decomposition,

91
00:04:25,450 --> 00:04:32,930
S is equal to PDP transpose,
where P is orthogonal.

92
00:04:35,570 --> 00:04:37,720
So that's where we use our
linear algebra results.

93
00:04:37,720 --> 00:04:43,640
So that means that P transpose P
is equal to PP transpose, which

94
00:04:43,640 --> 00:04:46,220
is the identity.

95
00:04:46,220 --> 00:04:50,370
So remember, S is
a d by d matrix.

96
00:04:50,370 --> 00:04:53,070
And so P is also d by d.

97
00:04:53,070 --> 00:04:55,860
And d is diagonal.

98
00:05:00,402 --> 00:05:02,860
And I'm actually going to take,
without loss of generality,

99
00:05:02,860 --> 00:05:04,487
I'm going to assume that d--

100
00:05:04,487 --> 00:05:06,070
so it's going to be
diagonal-- and I'm

101
00:05:06,070 --> 00:05:10,240
going to have something
that looks like lambda 1

102
00:05:10,240 --> 00:05:10,930
to lambda d.

103
00:05:10,930 --> 00:05:14,830
Those are called the
eigenvalues of S.

104
00:05:14,830 --> 00:05:19,036
What we know is that lambda
j's are non-negative.

105
00:05:19,036 --> 00:05:21,160
And actually, what I'm
going to assume without loss

106
00:05:21,160 --> 00:05:24,820
of generalities is lambda 1
is larger than lambda 2, which

107
00:05:24,820 --> 00:05:30,259
is larger than lambda d.

108
00:05:30,259 --> 00:05:32,050
Because in particular,
this decomposition--

109
00:05:32,050 --> 00:05:35,470
the spectrum decomposition--
is not entirely unique.

110
00:05:35,470 --> 00:05:39,750
I could permute
the columns of P,

111
00:05:39,750 --> 00:05:42,600
and I would still have
an orthogonal matrix.

112
00:05:42,600 --> 00:05:44,820
And to balance that,
I would also have

113
00:05:44,820 --> 00:05:46,890
to permute the entries of d.

114
00:05:46,890 --> 00:05:49,680
So there's as many
decompositions

115
00:05:49,680 --> 00:05:51,180
as there are permutations.

116
00:05:51,180 --> 00:05:52,860
So there's actually quite a bit.

117
00:05:52,860 --> 00:05:56,760
But the bag of
eigenvalues is unique.

118
00:05:56,760 --> 00:05:58,430
The set of
eigenvalues is unique.

119
00:05:58,430 --> 00:06:01,020
The ordering is
certainly not unique.

120
00:06:01,020 --> 00:06:02,730
So here, I'm just
going to pick--

121
00:06:02,730 --> 00:06:05,640
I'm going to nail down one
particular permutation--

122
00:06:05,640 --> 00:06:08,070
actually, maybe two in
case I have equalities.

123
00:06:08,070 --> 00:06:12,570
But let's say, I pick
one that satisfies this.

124
00:06:12,570 --> 00:06:15,450
And the reason why I do this
is really not very important.

125
00:06:15,450 --> 00:06:18,060
It's just to say,
I'm going to want

126
00:06:18,060 --> 00:06:20,500
to talk about the largest
of those eigenvalues.

127
00:06:20,500 --> 00:06:22,110
So this is just
going to be easier

128
00:06:22,110 --> 00:06:23,910
for me to say that
this one is lambda 1,

129
00:06:23,910 --> 00:06:26,730
rather than say it's lambda 7.

130
00:06:26,730 --> 00:06:39,980
So this is just to say that
the largest eigenvalue of S

131
00:06:39,980 --> 00:06:42,588
is lambda 1.

132
00:06:42,588 --> 00:06:45,550
If I didn't do that, I would
just call it maybe lambda max,

133
00:06:45,550 --> 00:06:47,760
and you would just know
which one I'm talking about.

134
00:06:52,910 --> 00:07:01,520
So what's happening now
is that if I look at d,

135
00:07:01,520 --> 00:07:04,250
then it turns out
that if I start--

136
00:07:04,250 --> 00:07:09,890
so if I do P transpose Xi, I am
actually projecting my Xi's--

137
00:07:09,890 --> 00:07:12,820
I'm basically changing
the basis for my Xi's.

138
00:07:12,820 --> 00:07:15,140
And now, D is the
empirical covariance matrix

139
00:07:15,140 --> 00:07:16,700
of those guys.

140
00:07:16,700 --> 00:07:18,630
So let's check that.

141
00:07:18,630 --> 00:07:22,010
So what it means is
that if I look at--

142
00:07:26,303 --> 00:07:29,120
so what I claim is
that P transpose Xi--

143
00:07:29,120 --> 00:07:35,180
that's a new vector, let's
call it Yi, it's also in Rd--

144
00:07:35,180 --> 00:07:37,940
and what I claim is that the
covariance matrix of this guy

145
00:07:37,940 --> 00:07:41,840
is actually now this
diagonal matrix, which

146
00:07:41,840 --> 00:07:45,140
means in particular that
if they were Gaussian, then

147
00:07:45,140 --> 00:07:46,280
they would be independent.

148
00:07:46,280 --> 00:07:48,890
But I also know now that
there's no correlation

149
00:07:48,890 --> 00:07:50,530
across coordinates of Yi.

150
00:07:50,530 --> 00:08:00,939
So to prove this, let me assume
that X bar is equal to 0.

151
00:08:00,939 --> 00:08:02,980
And the reason why I do
this is because it's just

152
00:08:02,980 --> 00:08:05,560
annoying to carry out all
this censuring constantly

153
00:08:05,560 --> 00:08:09,400
and I talk about S. So
when X bar is equal to 0,

154
00:08:09,400 --> 00:08:11,640
that implies that S
has a very simple form.

155
00:08:11,640 --> 00:08:14,170
It's of the form
sum from i equal 1

156
00:08:14,170 --> 00:08:18,790
to n of Xi Xi transpose.

157
00:08:18,790 --> 00:08:20,380
So that's my S.

158
00:08:20,380 --> 00:08:24,370
But what I want is the S of Y--

159
00:08:24,370 --> 00:08:28,830
So OK, that implies
also that P times X

160
00:08:28,830 --> 00:08:34,690
bar, which is equal to P times
X bar is also equal to 0.

161
00:08:34,690 --> 00:08:37,929
So that means that Y bar--

162
00:08:37,929 --> 00:08:40,240
Y has mean 0, if this is 0.

163
00:08:40,240 --> 00:08:43,970
So if I look at the sample
covariance matrix of Y,

164
00:08:43,970 --> 00:08:45,880
it's just going to
be something that

165
00:08:45,880 --> 00:08:49,990
looks like the sum of the
outer products or the Yi Yi

166
00:08:49,990 --> 00:08:50,590
transpose.

167
00:08:53,290 --> 00:08:56,770
And again, the reason why
I make this assumption

168
00:08:56,770 --> 00:09:01,400
is so that I don't have to write
minus X bar X bar transpose.

169
00:09:01,400 --> 00:09:02,284
But you can do it.

170
00:09:02,284 --> 00:09:03,950
And it's going to
work exactly the same.

171
00:09:06,790 --> 00:09:08,640
So now, I look at this S prime.

172
00:09:08,640 --> 00:09:11,340
And so what is this S prime?

173
00:09:11,340 --> 00:09:14,340
Well, I'm just going
to replace Yi with PXi.

174
00:09:14,340 --> 00:09:22,850
So it's the sum from i equal
1 to n of PXi PXi transpose,

175
00:09:22,850 --> 00:09:26,627
which is equal to the sum from--

176
00:09:26,627 --> 00:09:27,460
sorry there's a 1/n.

177
00:09:32,360 --> 00:09:34,820
So it's equal to 1/n
sum from i equal 1

178
00:09:34,820 --> 00:09:43,490
to n of PXi Xi
transpose P transpose.

179
00:09:43,490 --> 00:09:45,130
Agree?

180
00:09:45,130 --> 00:09:48,580
I just said that the transpose
of AB is the transpose of B

181
00:09:48,580 --> 00:09:53,830
times the transpose of A.

182
00:09:53,830 --> 00:09:55,900
And so now, I can
push the sum in.

183
00:09:55,900 --> 00:09:57,520
P does not depend on i.

184
00:09:57,520 --> 00:10:05,800
So this thing here is
equal to PS P transpose,

185
00:10:05,800 --> 00:10:10,130
because the sum of the Xi Xi
transpose divided by n is S.

186
00:10:10,130 --> 00:10:12,200
But what is PS P transpose?

187
00:10:12,200 --> 00:10:17,090
Well, we know that
S is equal to--

188
00:10:17,090 --> 00:10:19,340
sorry that's P transpose.

189
00:10:19,340 --> 00:10:20,880
So this was with a P transpose.

190
00:10:20,880 --> 00:10:23,420
I'm sorry, I made an
important mistake here.

191
00:10:23,420 --> 00:10:25,420
So Yi is P transpose Xi.

192
00:10:25,420 --> 00:10:27,440
So this is P transpose
and P transpose

193
00:10:27,440 --> 00:10:29,600
here, which means that
this is P transpose

194
00:10:29,600 --> 00:10:32,450
and this is double transpose,
which is just nothing

195
00:10:32,450 --> 00:10:34,150
and that transpose and nothing.

196
00:10:36,680 --> 00:10:41,600
So now, I write S
as PD P transpose.

197
00:10:41,600 --> 00:10:43,781
That's the spectral
decomposition

198
00:10:43,781 --> 00:10:44,530
that I had before.

199
00:10:44,530 --> 00:10:46,550
That's my eigenvalue
decomposition,

200
00:10:46,550 --> 00:10:49,050
which means that now,
if I look at S prime,

201
00:10:49,050 --> 00:10:56,000
it's P transpose times
PD P transpose P.

202
00:10:56,000 --> 00:10:58,300
But now, P transpose
P is the identity,

203
00:10:58,300 --> 00:11:00,250
P transpose P is the identity.

204
00:11:00,250 --> 00:11:06,646
So this is actually
just equal to D.

205
00:11:06,646 --> 00:11:08,270
And again, you can
check that this also

206
00:11:08,270 --> 00:11:12,840
works if you have to center
all those guys as you go.

207
00:11:12,840 --> 00:11:15,630
But if you think about
it, this is the same thing

208
00:11:15,630 --> 00:11:19,530
as saying that I just
replaced Xi by Xi minus X bar.

209
00:11:19,530 --> 00:11:26,590
And then it's true that Y bar
is also P times Xi minus X bar.

210
00:11:26,590 --> 00:11:29,770
So now, we have that D is
the empirical covariance

211
00:11:29,770 --> 00:11:30,940
matrix of those guys--

212
00:11:30,940 --> 00:11:33,112
the Yi's, which are
P transpose Xi's.

213
00:11:33,112 --> 00:11:34,570
And so in particular,
what it means

214
00:11:34,570 --> 00:11:42,810
is that if I look at the
covariance of Yj Yk--

215
00:11:46,130 --> 00:11:48,920
So that's the covariance
of the j-th coordinate of Y

216
00:11:48,920 --> 00:11:51,650
and the k-th coordinate of Y.
I'm just not putting an index.

217
00:11:51,650 --> 00:11:53,720
But maybe, let's say the
first one or something

218
00:11:53,720 --> 00:11:56,142
like this-- any of
them, their IID.

219
00:11:56,142 --> 00:11:57,350
Then what is this covariance?

220
00:11:57,350 --> 00:12:01,760
It's actually 0 if j
is different from k.

221
00:12:01,760 --> 00:12:06,590
And the covariance
between Yj and Yj,

222
00:12:06,590 --> 00:12:13,070
which is just the variance
of Yj, is equal to lambda j--

223
00:12:13,070 --> 00:12:17,300
the j-th largest eigenvalue.

224
00:12:17,300 --> 00:12:22,580
So the eigenvalues capture the
variance of my observations

225
00:12:22,580 --> 00:12:25,110
in this new coordinate system.

226
00:12:25,110 --> 00:12:26,632
And they're
completely orthogonal.

227
00:12:26,632 --> 00:12:27,590
So what does that mean?

228
00:12:27,590 --> 00:12:29,750
Well, again, remember,
if I chop off

229
00:12:29,750 --> 00:12:34,160
the head of my Gaussian
in multi dimensions,

230
00:12:34,160 --> 00:12:35,780
we said that what
we started from

231
00:12:35,780 --> 00:12:39,560
was something that
looked like this.

232
00:12:39,560 --> 00:12:42,320
And we said, well, there's one
direction that's important,

233
00:12:42,320 --> 00:12:45,230
that's this guy, and one
important that's this guy.

234
00:12:45,230 --> 00:12:48,200
When I applied a transformation
P transpose, what I'm doing

235
00:12:48,200 --> 00:12:51,110
is that I'm realigning this
thing with the new axes.

236
00:12:51,110 --> 00:12:53,660
Or in a way, rather
to be fair, I'm

237
00:12:53,660 --> 00:12:59,600
not actually realigning
the ellipses with the axes.

238
00:12:59,600 --> 00:13:02,690
I'm really realigning the
axes with the ellipses.

239
00:13:02,690 --> 00:13:05,360
So really, what I'm doing is
I'm saying, after I apply P,

240
00:13:05,360 --> 00:13:08,690
I'm just rotating this
coordinate system.

241
00:13:08,690 --> 00:13:12,670
So now, it becomes this guy.

242
00:13:19,360 --> 00:13:22,850
And now, my ellipses
actually completely align.

243
00:13:22,850 --> 00:13:25,730
And what happens here is
that this coordinate is

244
00:13:25,730 --> 00:13:27,110
independent of that coordinate.

245
00:13:27,110 --> 00:13:31,715
And that's what we write
here, if they are Gaussian.

246
00:13:31,715 --> 00:13:32,840
I didn't really tell this--

247
00:13:32,840 --> 00:13:34,810
I'm only making statements
about covariances.

248
00:13:34,810 --> 00:13:36,768
If they are Gaussians,
those implied statements

249
00:13:36,768 --> 00:13:37,614
about independence.

250
00:13:40,960 --> 00:13:44,590
So as I said, the
variance now, lambda 1,

251
00:13:44,590 --> 00:13:54,700
is actually the variance
of P transpose Xi.

252
00:13:57,890 --> 00:14:00,140
But if I look now at
the-- so this is a vector,

253
00:14:00,140 --> 00:14:04,910
so I need to look at the
first coordinate of this guy.

254
00:14:08,490 --> 00:14:11,250
So it turns out that
doing this is actually

255
00:14:11,250 --> 00:14:15,440
the same thing as looking
at the variance of what?

256
00:14:15,440 --> 00:14:21,480
Well, the first
column of P times Xi.

257
00:14:21,480 --> 00:14:24,490
So that's the variance of--

258
00:14:24,490 --> 00:14:30,344
I'm going to call it v1
transpose Xi, where P--

259
00:14:44,390 --> 00:14:53,920
So the v1 vd in Rd
are eigenvectors.

260
00:14:53,920 --> 00:14:57,190
And each vi is
associated to lambda i.

261
00:14:57,190 --> 00:14:59,740
So that's what we saw when
we talked about this eigen

262
00:14:59,740 --> 00:15:02,800
decomposition a
couple of slides back.

263
00:15:02,800 --> 00:15:06,040
That's the one here.

264
00:15:06,040 --> 00:15:10,310
So if I call the
columns of P v1 to vd,

265
00:15:10,310 --> 00:15:13,600
this is what's happening.

266
00:15:13,600 --> 00:15:16,030
So when I look at lambda
1, it's just the variance

267
00:15:16,030 --> 00:15:19,700
of Xi inner product with v1.

268
00:15:19,700 --> 00:15:22,180
And we made this picture
when we said, well,

269
00:15:22,180 --> 00:15:25,870
let's say v1 is here
and then x1 is here.

270
00:15:25,870 --> 00:15:31,180
And if vi has a unique
norm, then the inner product

271
00:15:31,180 --> 00:15:38,050
between Xi and v1 is just
the length of this guy here.

272
00:15:38,050 --> 00:15:41,020
So that's the variance of the
Xi says the length of Xi--

273
00:15:41,020 --> 00:15:43,720
so this is 0-- that's the
length of Xi when I project it

274
00:15:43,720 --> 00:15:46,750
onto the direction
that span by v1.

275
00:15:46,750 --> 00:15:52,210
If v1 has length 2, this is
really just twice this length.

276
00:15:52,210 --> 00:15:56,340
If vi has length 3,
it's three times this.

277
00:15:56,340 --> 00:16:01,570
But it turns out that since
P satisfies P transpose

278
00:16:01,570 --> 00:16:04,780
P is equal to the identity--

279
00:16:04,780 --> 00:16:07,900
that's an orthogonal
matrix, that's right here--

280
00:16:07,900 --> 00:16:11,470
then this is actually
saying the same thing

281
00:16:11,470 --> 00:16:18,760
as vj transpose vj, which is
really the norm squared of vj,

282
00:16:18,760 --> 00:16:20,800
is equal to 1.

283
00:16:20,800 --> 00:16:26,520
And vj transpose vk is equal
to 0, if j is different from k.

284
00:16:29,610 --> 00:16:31,560
The eigenvectors are
orthogonal to each other.

285
00:16:31,560 --> 00:16:33,050
And they're actually
all of norm 1.

286
00:16:37,390 --> 00:16:39,580
So now, I know that this
is indeed a direction.

287
00:16:39,580 --> 00:16:44,290
And so when I look
at v1 transpose Xi,

288
00:16:44,290 --> 00:16:46,240
I'm really measuring
exactly this length.

289
00:16:46,240 --> 00:16:47,460
And what is this length?

290
00:16:47,460 --> 00:16:49,660
It's the length of
the projection of Xi

291
00:16:49,660 --> 00:16:51,190
onto this line.

292
00:16:51,190 --> 00:16:53,920
That's the line
that's spanned by v1.

293
00:16:53,920 --> 00:16:57,680
So if I had a very high
dimensional problem

294
00:16:57,680 --> 00:17:01,460
and I started to look
at the direction v1--

295
00:17:01,460 --> 00:17:03,884
let's say v1 now is
not a eigenvector,

296
00:17:03,884 --> 00:17:08,270
it's any direction-- then
if I want to do this lower

297
00:17:08,270 --> 00:17:11,819
dimensional projection, then
I have to understand how those

298
00:17:11,819 --> 00:17:14,272
Xi's project onto the
line that's spanned by v1,

299
00:17:14,272 --> 00:17:16,730
because this is all that I'm
going to be keeping at the end

300
00:17:16,730 --> 00:17:17,646
of the day about Xi's.

301
00:17:20,170 --> 00:17:23,200
So what we want is
to find the direction

302
00:17:23,200 --> 00:17:25,240
where those Xi's,
those projections,

303
00:17:25,240 --> 00:17:26,361
have a lot of variance.

304
00:17:26,361 --> 00:17:28,569
And we know that the variance
of Xi on this direction

305
00:17:28,569 --> 00:17:30,490
is actually exactly
given by lambda 1.

306
00:17:36,890 --> 00:17:40,490
Sorry, that's the
empirical var--

307
00:17:40,490 --> 00:17:42,480
yeah, I should
call variance hat.

308
00:17:42,480 --> 00:17:43,730
That's the empirical variance.

309
00:17:43,730 --> 00:17:45,063
Everything is in empirical here.

310
00:17:45,063 --> 00:17:48,680
We're talking about the
empirical covariance matrix.

311
00:17:48,680 --> 00:17:54,150
And so I also have that lambda
2 is the empirical variance

312
00:17:54,150 --> 00:17:59,160
of when I project Xi onto
v2, which is the second one,

313
00:17:59,160 --> 00:18:00,600
just for exactly this reason.

314
00:18:07,474 --> 00:18:08,456
Any question?

315
00:18:14,170 --> 00:18:16,830
So lambda j's are going
to be important for us.

316
00:18:16,830 --> 00:18:19,320
Lambda j measure the
spread of the points

317
00:18:19,320 --> 00:18:22,530
when I project them onto a
line which is a one dimensional

318
00:18:22,530 --> 00:18:23,259
space.

319
00:18:23,259 --> 00:18:25,800
And so I'm going to have-- let's
say I want to pick only one,

320
00:18:25,800 --> 00:18:28,133
I'm going to have to find the
one dimensional space that

321
00:18:28,133 --> 00:18:29,690
carries the most variance.

322
00:18:29,690 --> 00:18:32,070
And I claim that
v1 is the one that

323
00:18:32,070 --> 00:18:35,280
actually maximizes the spread.

324
00:18:35,280 --> 00:18:55,900
So the claim-- so for
any direction, u in Rd--

325
00:18:55,900 --> 00:18:59,380
and by direction, I really
just mean that the norm of u

326
00:18:59,380 --> 00:19:00,920
is equal to 1.

327
00:19:00,920 --> 00:19:02,020
I need to play fair--

328
00:19:02,020 --> 00:19:04,690
I'm going to compare myself to
other things of lengths one,

329
00:19:04,690 --> 00:19:07,600
so I need to play fair and
look at directions of length 1.

330
00:19:07,600 --> 00:19:16,321
Now, if I'm interested
in the empirical variance

331
00:19:16,321 --> 00:19:20,875
of X1 transpose--

332
00:19:20,875 --> 00:19:29,150
sorry, u transpose X1 u
transpose Xn, then this thing

333
00:19:29,150 --> 00:19:37,950
is maximized for
u equals v1, where

334
00:19:37,950 --> 00:19:40,610
v1 is the eigenvector
associated to lambda 1

335
00:19:40,610 --> 00:19:42,110
and lambda 1 is not
any eigenvalues,

336
00:19:42,110 --> 00:19:45,090
it's the largest of all those.

337
00:19:45,090 --> 00:19:46,992
So it's the largest eigenvalue.

338
00:19:50,607 --> 00:19:51,440
So why is that true?

339
00:19:55,410 --> 00:20:00,840
Well, there's also a claim
that for any direction u--

340
00:20:00,840 --> 00:20:03,380
so that's 1 and 2--

341
00:20:03,380 --> 00:20:08,990
the variance of u
transpose X-- now,

342
00:20:08,990 --> 00:20:11,900
this is just a random variable,
and I'm looking about the true

343
00:20:11,900 --> 00:20:13,040
variance--

344
00:20:13,040 --> 00:20:27,440
this is maximized for u
equals, let's call it w1,

345
00:20:27,440 --> 00:20:38,320
where w1 is the
eigenvector of sigma--

346
00:20:38,320 --> 00:20:40,204
Now, I'm talking about
the true variance.

347
00:20:40,204 --> 00:20:42,620
Whereas, here, I was talking
about the empirical variance.

348
00:20:42,620 --> 00:20:44,950
So the true variance
is the eigenvectors

349
00:20:44,950 --> 00:20:55,630
of the true sigma
associated to the largest

350
00:20:55,630 --> 00:20:59,554
eigenvalue of sigma.

351
00:21:02,870 --> 00:21:04,270
So I did not give it a name.

352
00:21:04,270 --> 00:21:06,285
Here, that was lambda 1
for the empirical one.

353
00:21:06,285 --> 00:21:07,660
For the true one,
you can give it

354
00:21:07,660 --> 00:21:10,330
another name, mu 1 if you want.

355
00:21:10,330 --> 00:21:13,407
But that's just the same thing.

356
00:21:13,407 --> 00:21:15,490
All it's saying is like,
wherever I see empirical,

357
00:21:15,490 --> 00:21:16,156
I can remove it.

358
00:21:27,690 --> 00:21:29,815
So why is this claim true?

359
00:21:29,815 --> 00:21:31,815
Well, let's look at the
second one, for example.

360
00:21:38,180 --> 00:21:44,480
So what is the variance
of u transpose X?

361
00:21:44,480 --> 00:21:47,570
So that's what I want to know.

362
00:21:47,570 --> 00:21:54,850
So that's the expectation--
so let's assume that X is 0,

363
00:21:54,850 --> 00:21:56,711
again, for same
reasons as before.

364
00:21:56,711 --> 00:21:57,710
So what is the variance?

365
00:21:57,710 --> 00:21:59,410
It's just the expectation
of the square?

366
00:22:06,460 --> 00:22:08,260
I don't need to remove
the expectation.

367
00:22:08,260 --> 00:22:10,870
And the expedition
of the square is just

368
00:22:10,870 --> 00:22:12,700
the expectation
of u transpose X.

369
00:22:12,700 --> 00:22:15,250
And then I'm going to write
the other one X transpose u.

370
00:22:19,510 --> 00:22:22,360
And we know that this
is deterministic.

371
00:22:22,360 --> 00:22:25,570
So I'm just going to take
that this is just u transpose

372
00:22:25,570 --> 00:22:31,995
expectation of X X transpose u.

373
00:22:31,995 --> 00:22:32,870
And what is this guy?

374
00:22:39,305 --> 00:22:40,760
That's covariance sigma.

375
00:22:40,760 --> 00:22:41,870
That's just what sigma is.

376
00:22:44,730 --> 00:22:48,590
So the variance I can write
as u transpose sigma u.

377
00:22:48,590 --> 00:22:51,272
We've made this
computation before.

378
00:22:51,272 --> 00:22:53,730
And now what I want to claim
is that this thing is actually

379
00:22:53,730 --> 00:22:57,275
less than the largest
eigenvalue, which I actually

380
00:22:57,275 --> 00:22:58,150
called lambda 1 here.

381
00:22:58,150 --> 00:22:59,680
I should probably not.

382
00:22:59,680 --> 00:23:01,100
And the P is-- well, OK.

383
00:23:06,430 --> 00:23:11,260
Let's just pretend
everything is not empirical.

384
00:23:11,260 --> 00:23:22,580
So now, I'm going to write
sigma as P lambda 1 lambda n P

385
00:23:22,580 --> 00:23:23,180
transpose.

386
00:23:23,180 --> 00:23:25,010
That's just the
eigendecomposition,

387
00:23:25,010 --> 00:23:32,090
where I admittedly reuse the
same notation as I did for S.

388
00:23:32,090 --> 00:23:34,764
So I should really put
some primes everywhere,

389
00:23:34,764 --> 00:23:36,680
so you know those are
things that are actually

390
00:23:36,680 --> 00:23:38,630
different in practice.

391
00:23:38,630 --> 00:23:43,469
So this is just that the
decomposition of sigma.

392
00:23:43,469 --> 00:23:44,510
You seem confused, Helen.

393
00:23:44,510 --> 00:23:47,570
You have a question?

394
00:23:47,570 --> 00:23:48,070
Yeah?

395
00:23:48,070 --> 00:23:53,830
AUDIENCE: What is-- when you
talked about the empirical data

396
00:23:53,830 --> 00:23:55,750
and--

397
00:23:55,750 --> 00:23:56,880
PHILIPPE RIGOLLET: So OK--

398
00:24:00,670 --> 00:24:02,801
so I can make
everything I'm saying,

399
00:24:02,801 --> 00:24:04,300
I can talk about
either the variance

400
00:24:04,300 --> 00:24:05,470
or the empirical variance.

401
00:24:05,470 --> 00:24:07,720
And you can just add the
word empirical in front of it

402
00:24:07,720 --> 00:24:08,680
whenever you want.

403
00:24:08,680 --> 00:24:09,910
The same thing works.

404
00:24:09,910 --> 00:24:13,120
But just for the sake of
removing the confusion,

405
00:24:13,120 --> 00:24:20,409
let's just do it again
with S. So I'm just

406
00:24:20,409 --> 00:24:21,950
going to do everything
with S. So I'm

407
00:24:21,950 --> 00:24:24,650
going to assume that
X bar is equal to 0.

408
00:24:24,650 --> 00:24:27,780
And here, I'm going to talk
about the empirical variance,

409
00:24:27,780 --> 00:24:31,530
which is just 1/n
sum from i equal 1

410
00:24:31,530 --> 00:24:35,272
to n of u transpose Xi squared.

411
00:24:35,272 --> 00:24:36,230
So it's the same thing.

412
00:24:36,230 --> 00:24:37,646
Everywhere you see
an expectation,

413
00:24:37,646 --> 00:24:39,110
you just put in average.

414
00:24:45,930 --> 00:24:50,850
And then I get 1/n
sum from i equal 1

415
00:24:50,850 --> 00:24:53,032
to n of Xi Xi transpose.

416
00:24:53,032 --> 00:24:54,490
And now, I'm going
to call this guy

417
00:24:54,490 --> 00:24:58,200
S, because that's what it is.

418
00:24:58,200 --> 00:24:59,994
So this is u transpose Su.

419
00:24:59,994 --> 00:25:02,410
But just defined that I could
just replace the expectation

420
00:25:02,410 --> 00:25:03,910
by averages everywhere,
you can tell

421
00:25:03,910 --> 00:25:06,590
that the thing is going to work
for either one or the other.

422
00:25:06,590 --> 00:25:08,491
So now, this thing
was actually-- so now,

423
00:25:08,491 --> 00:25:10,240
I don't have any problem
with my notation.

424
00:25:10,240 --> 00:25:14,310
This is actually the
decomposition of S.

425
00:25:14,310 --> 00:25:16,030
That's just the
spectral decomposition

426
00:25:16,030 --> 00:25:18,840
and it's to its eigenvalues.

427
00:25:18,840 --> 00:25:27,080
And so now, what I have is that
when I look at u transpose Su,

428
00:25:27,080 --> 00:25:34,920
this is actually equal
to P u transpose S Pu.

429
00:25:39,294 --> 00:25:40,500
OK.

430
00:25:40,500 --> 00:25:41,750
There's a transpose somewhere.

431
00:25:41,750 --> 00:25:42,416
That's this guy.

432
00:25:45,300 --> 00:25:46,161
And that's this guy.

433
00:25:57,057 --> 00:26:00,220
Now-- sorry, that's
not P, that's

434
00:26:00,220 --> 00:26:05,000
D. That's D, that's
this diagonal matrix.

435
00:26:10,269 --> 00:26:11,310
Let's look at this thing.

436
00:26:11,310 --> 00:26:15,810
And let's call P transpose
u, let's call it b.

437
00:26:15,810 --> 00:26:18,705
So that's also a vector in Rd.

438
00:26:18,705 --> 00:26:19,530
What is it?

439
00:26:19,530 --> 00:26:21,370
It's just, I take a
unit vector, and then

440
00:26:21,370 --> 00:26:23,020
I apply P transpose to it.

441
00:26:23,020 --> 00:26:25,740
So that's basically what
happens to a unit vector

442
00:26:25,740 --> 00:26:29,820
when I apply the same
change of basis that I did.

443
00:26:29,820 --> 00:26:34,650
So I'm just changing my
orthogonal system the same way

444
00:26:34,650 --> 00:26:36,360
I did for the other ones.

445
00:26:36,360 --> 00:26:38,940
So what's happening
when I write this?

446
00:26:38,940 --> 00:26:46,590
Well, now I have that u
transpose Su is b transpose Db.

447
00:26:46,590 --> 00:26:50,310
But now, doing b transpose
Db when D is diagonal

448
00:26:50,310 --> 00:26:52,690
and b is a vector is
a very simple thing.

449
00:26:52,690 --> 00:26:53,910
I can expand it.

450
00:26:53,910 --> 00:26:54,480
This is what?

451
00:26:54,480 --> 00:26:57,120
This is just the
sum from j equal 1

452
00:26:57,120 --> 00:27:01,650
to d of lambda j bj squared.

453
00:27:05,386 --> 00:27:08,947
So that's just like matrix
vector multiplication.

454
00:27:08,947 --> 00:27:11,280
And in particular, I know
that the largest of those guys

455
00:27:11,280 --> 00:27:14,010
is lambda 1 and those
guys are all non-negative.

456
00:27:14,010 --> 00:27:16,705
So this thing is actually
less than lambda 1 times

457
00:27:16,705 --> 00:27:20,430
the sum from j equal 1 to
d of lambda j squared--

458
00:27:23,330 --> 00:27:24,490
sorry, bj squared.

459
00:27:27,560 --> 00:27:34,010
And this is just the
norm of b squared.

460
00:27:34,010 --> 00:27:38,320
So if I want to prove what's on
the slide, all I need to check

461
00:27:38,320 --> 00:27:40,965
is that b has norm, which is--

462
00:27:40,965 --> 00:27:41,935
AUDIENCE: 1.

463
00:27:41,935 --> 00:27:43,910
PHILIPPE RIGOLLET: At most, 1.

464
00:27:43,910 --> 00:27:45,090
It's going to be at most 1.

465
00:27:45,090 --> 00:27:45,780
Why?

466
00:27:45,780 --> 00:27:51,690
Well, because b is really
just a change of basis for u.

467
00:27:51,690 --> 00:27:55,650
And so if I take a vector,
I'm just changing its basis.

468
00:27:55,650 --> 00:27:57,540
I'm certainly not
changing its length--

469
00:27:57,540 --> 00:27:59,580
think of a rotation,
and I can also flip it,

470
00:27:59,580 --> 00:28:00,790
but think of a rotation--

471
00:28:02,839 --> 00:28:05,380
well, actually, for vector, it's
just going to be a rotation.

472
00:28:05,380 --> 00:28:06,850
And so now, what
I have I just have

473
00:28:06,850 --> 00:28:11,970
to check that the norm of
b squared is equal to what?

474
00:28:11,970 --> 00:28:16,470
Well, it's equal to the norm
of P transpose u squared,

475
00:28:16,470 --> 00:28:21,620
which is equal to u
transpose P P transpose u.

476
00:28:21,620 --> 00:28:23,000
But P is orthogonal.

477
00:28:23,000 --> 00:28:26,210
So this thing is actually
just the identity.

478
00:28:26,210 --> 00:28:28,307
So that's just u
transpose u, which

479
00:28:28,307 --> 00:28:33,260
is equal to the norm u
squared, which is equal to 1,

480
00:28:33,260 --> 00:28:37,070
because I took u to have
norm 1 in the first place.

481
00:28:37,070 --> 00:28:39,640
And so this-- you're right--
was actually of norm equal to 1.

482
00:28:39,640 --> 00:28:42,017
I just needed to have
it less, but it's equal.

483
00:28:42,017 --> 00:28:44,350
And so what I'm left with is
that this thing is actually

484
00:28:44,350 --> 00:28:45,820
equal to lambda 1.

485
00:28:45,820 --> 00:28:50,030
So I know that for
every u that I pick--

486
00:28:50,030 --> 00:28:52,890
that has norm--

487
00:28:52,890 --> 00:28:55,030
So I'm just reminding
you that u here

488
00:28:55,030 --> 00:28:57,730
has norm squared equal to 1.

489
00:28:57,730 --> 00:29:00,760
For every u that I
pick, this u transpose

490
00:29:00,760 --> 00:29:02,890
Su is at mostly lambda 1.

491
00:29:06,400 --> 00:29:11,250
So that's the u transpose
Su is at most lambda 1.

492
00:29:11,250 --> 00:29:13,270
And we know that that's
the variance, that's

493
00:29:13,270 --> 00:29:15,790
the empirical variance,
when I project my points

494
00:29:15,790 --> 00:29:17,500
onto direction spanned by u.

495
00:29:20,240 --> 00:29:23,040
So now, I have an
empirical variance,

496
00:29:23,040 --> 00:29:24,650
which is at most lambda 1.

497
00:29:24,650 --> 00:29:28,457
But I also know that if I take u
to be something very specific--

498
00:29:28,457 --> 00:29:30,040
I mean, it was on
the previous board--

499
00:29:30,040 --> 00:29:32,510
if I take u to be
equal to v1, then

500
00:29:32,510 --> 00:29:35,270
this thing is actually
not an inequality,

501
00:29:35,270 --> 00:29:37,160
this is an equality.

502
00:29:37,160 --> 00:29:41,990
And the reason is, when I
actually take u to be v1,

503
00:29:41,990 --> 00:29:46,410
all of these bj's are going to
be 0, except for the one that's

504
00:29:46,410 --> 00:29:50,360
b1, which is itself equal to 1.

505
00:29:50,360 --> 00:29:52,190
So I mean, we can
briefly check this.

506
00:29:52,190 --> 00:29:53,738
But if I take v--

507
00:29:59,106 --> 00:30:07,100
if u is equal to v1, what
I have is that u transpose

508
00:30:07,100 --> 00:30:24,800
Su is equal to P transpose
v1 D P transpose v1.

509
00:30:24,800 --> 00:30:26,680
But what is P transpose v1?

510
00:30:26,680 --> 00:30:31,960
Well, remember P
transpose is just

511
00:30:31,960 --> 00:30:34,820
the matrix that has
vectors v1 transpose here,

512
00:30:34,820 --> 00:30:40,110
v2 transpose here, all the
way to vd transpose here.

513
00:30:40,110 --> 00:30:45,570
And we know that when I take
vj transpose vk, I get 0,

514
00:30:45,570 --> 00:30:46,680
if j is different from k.

515
00:30:46,680 --> 00:30:49,620
And if j is equal to k, I get 1.

516
00:30:49,620 --> 00:30:53,690
So P transpose v1
is equal to what?

517
00:31:05,040 --> 00:31:06,570
Take v1 here and multiply it.

518
00:31:06,570 --> 00:31:08,250
So the first coordinate
is going to be

519
00:31:08,250 --> 00:31:12,870
v1 transpose v1, which is 1.

520
00:31:12,870 --> 00:31:14,370
The second coordinate
is going to be

521
00:31:14,370 --> 00:31:19,030
v2 transpose v1, which is 0.

522
00:31:19,030 --> 00:31:22,740
And so I get 0's
all the way, right?

523
00:31:22,740 --> 00:31:25,470
So that means that this
thing here is really

524
00:31:25,470 --> 00:31:29,040
just the vector 1, 0, 0.

525
00:31:29,040 --> 00:31:32,220
And here, this is just
the vector 1, 0, 0.

526
00:31:32,220 --> 00:31:34,100
So when I multiply
it with this guy,

527
00:31:34,100 --> 00:31:37,980
I am only picking up
the top left element

528
00:31:37,980 --> 00:31:41,740
of D, which is lambda 1.

529
00:31:41,740 --> 00:31:44,940
So for every one,
it's less lambda 1.

530
00:31:44,940 --> 00:31:46,950
And for v1, it's
equal to lambda 1,

531
00:31:46,950 --> 00:31:52,590
which means that it's
maximized for a equals v1.

532
00:31:52,590 --> 00:31:54,480
And that's where
I said that this

533
00:31:54,480 --> 00:31:57,527
is the fanciest non-convex
problem we know how to solve.

534
00:31:57,527 --> 00:31:59,610
This was a problem that
was definitely non-convex.

535
00:31:59,610 --> 00:32:02,820
We were maximizing a convex
function over a sphere.

536
00:32:02,820 --> 00:32:06,156
But we know that v1,
which is something--

537
00:32:06,156 --> 00:32:07,530
I mean, of course,
you still have

538
00:32:07,530 --> 00:32:08,946
to believe me that
you can compute

539
00:32:08,946 --> 00:32:11,670
the spectral decomposition
efficiently--

540
00:32:11,670 --> 00:32:14,880
but essentially, if you've
taken linear algebra,

541
00:32:14,880 --> 00:32:17,020
you know that you can
diagonalize a matrix.

542
00:32:17,020 --> 00:32:19,797
And so you get that v1
is just the maximum.

543
00:32:19,797 --> 00:32:21,630
So you can find your
maximum just by looking

544
00:32:21,630 --> 00:32:24,109
at the spectral decomposition.

545
00:32:24,109 --> 00:32:25,650
You don't have to
do any optimization

546
00:32:25,650 --> 00:32:28,790
or anything like this.

547
00:32:28,790 --> 00:32:29,870
So let's recap.

548
00:32:29,870 --> 00:32:32,390
Where are we?

549
00:32:32,390 --> 00:32:34,160
We've established
that if I start

550
00:32:34,160 --> 00:32:37,820
with my empirical covariance
matrix, I can diagonalize it

551
00:32:37,820 --> 00:32:42,270
and PD P transpose.

552
00:32:42,270 --> 00:32:44,250
And then if I take the
eigenvector associated

553
00:32:44,250 --> 00:32:48,630
to the largest eigenvalues-- so
if I permute the columns of P

554
00:32:48,630 --> 00:32:50,810
and of D's in such
a way that they

555
00:32:50,810 --> 00:32:53,520
are ordered from the
largest to the smallest when

556
00:32:53,520 --> 00:32:56,490
I look at the diagonal
elements of D,

557
00:32:56,490 --> 00:32:59,430
then if I pick the first
column of P, it's v1.

558
00:32:59,430 --> 00:33:04,750
And v1 is the direction on
which, if I project my points,

559
00:33:04,750 --> 00:33:08,090
they are going to carry the
most empirical variance.

560
00:33:08,090 --> 00:33:09,090
Well, that's a good way.

561
00:33:09,090 --> 00:33:13,064
If I told you,
pick one direction

562
00:33:13,064 --> 00:33:14,980
along which if you were
to project your points

563
00:33:14,980 --> 00:33:17,313
they would be as spread out
as possible, that's probably

564
00:33:17,313 --> 00:33:19,270
the one you would pick.

565
00:33:19,270 --> 00:33:22,160
And so that's exactly
what PCA is doing for us.

566
00:33:22,160 --> 00:33:28,780
It says, OK, if you ask me
to take d prime equal to 1,

567
00:33:28,780 --> 00:33:31,510
I will take v1.

568
00:33:31,510 --> 00:33:33,892
I will just take the direction
that's spanned by v1.

569
00:33:33,892 --> 00:33:36,100
And that's just when I come
back to this picture that

570
00:33:36,100 --> 00:33:43,750
was here before, this is v1.

571
00:33:43,750 --> 00:33:45,970
Of course, here, I
only have two of them.

572
00:33:45,970 --> 00:33:48,580
So v2 has to be this
guy, or this guy,

573
00:33:48,580 --> 00:33:49,940
or I mean or this thing.

574
00:33:49,940 --> 00:33:53,060
I mean, I don't know
them up to sine.

575
00:33:53,060 --> 00:33:55,600
But then if I have three--

576
00:33:55,600 --> 00:33:58,090
think of like an olive
in three dimensions--

577
00:33:58,090 --> 00:34:00,550
then maybe I have one
direction that's slightly more

578
00:34:00,550 --> 00:34:02,180
elongated than the other one.

579
00:34:02,180 --> 00:34:04,480
And so I'm going to
pick the second one.

580
00:34:04,480 --> 00:34:07,330
And so the procedure is
to say, well, first, I'm

581
00:34:07,330 --> 00:34:11,194
going to pick v1 the same way
I pick v1 in the first place.

582
00:34:11,194 --> 00:34:12,610
So the first
direction I am taking

583
00:34:12,610 --> 00:34:14,620
is the leading eigenvector.

584
00:34:14,620 --> 00:34:18,199
And then I'm looking
for a direction.

585
00:34:18,199 --> 00:34:20,719
Well, if I found
one-- the one I'm

586
00:34:20,719 --> 00:34:23,239
going to want to find-- if you
say you can take d equal 2,

587
00:34:23,239 --> 00:34:24,949
you're going to need
the basis for this guy.

588
00:34:24,949 --> 00:34:27,240
So the second one has to be
orthogonal to the first one

589
00:34:27,240 --> 00:34:28,705
you've already picked.

590
00:34:28,705 --> 00:34:30,080
And so the second
one you pick is

591
00:34:30,080 --> 00:34:31,940
the one that's just,
among all those that

592
00:34:31,940 --> 00:34:36,529
are orthogonal to v1, maximized
the empirical variance

593
00:34:36,529 --> 00:34:37,570
when you project onto it.

594
00:34:40,100 --> 00:34:44,000
And it turns out that this
is actually exactly v2.

595
00:34:44,000 --> 00:34:46,153
You don't have to
redo anything again.

596
00:34:46,153 --> 00:34:47,569
You're eigendecomposition,
this is

597
00:34:47,569 --> 00:34:54,690
just the second column
of P. Clearly, v2

598
00:34:54,690 --> 00:34:56,120
is orthogonal to v1.

599
00:34:56,120 --> 00:34:58,890
We just used it here.

600
00:34:58,890 --> 00:35:03,730
This 0 here just says this
v2 is orthogonal to v1.

601
00:35:03,730 --> 00:35:05,770
So they're like this.

602
00:35:05,770 --> 00:35:06,940
And now, what I said--

603
00:35:06,940 --> 00:35:08,530
what this slide
tells you extra--

604
00:35:08,530 --> 00:35:10,670
is that v2 among all
those directions that are

605
00:35:10,670 --> 00:35:11,170
orthogonal--

606
00:35:11,170 --> 00:35:13,610
I mean, there's still
d minus 1 of them--

607
00:35:13,610 --> 00:35:16,030
this is the one that
maximizes the, say,

608
00:35:16,030 --> 00:35:18,730
residual empirical
variance-- the one that

609
00:35:18,730 --> 00:35:21,950
was not explained by the first
direction that you picked.

610
00:35:21,950 --> 00:35:22,910
And you can check that.

611
00:35:22,910 --> 00:35:27,200
I mean, it's becoming a bit
more cumbersome to write down,

612
00:35:27,200 --> 00:35:28,760
but you can check that.

613
00:35:28,760 --> 00:35:32,130
If you're not convinced,
please raise your concern.

614
00:35:32,130 --> 00:35:38,641
I mean, basically, one
way you view this to--

615
00:35:38,641 --> 00:35:40,640
I mean, you're not really
dropping a coordinate,

616
00:35:40,640 --> 00:35:42,420
because v1 is not a coordinate.

617
00:35:42,420 --> 00:35:46,040
But let's assume actually for
simplicity that v1 was actually

618
00:35:46,040 --> 00:35:49,730
equal to e1, that the direction
that carries the most variance

619
00:35:49,730 --> 00:35:51,440
is the one that
just says, just look

620
00:35:51,440 --> 00:35:56,520
at the first coordinate of X.
So if that was the case, then

621
00:35:56,520 --> 00:35:58,380
clearly the orthogonal
directions are

622
00:35:58,380 --> 00:36:03,420
the ones that comprise only
of the coordinates 2 to d.

623
00:36:03,420 --> 00:36:05,670
So you could actually just
drop the first coordinate

624
00:36:05,670 --> 00:36:08,460
and do the same thing on
a slightly shorter vector

625
00:36:08,460 --> 00:36:10,129
of length d minus 1.

626
00:36:10,129 --> 00:36:12,420
And then you would just look
at the largest eigenvector

627
00:36:12,420 --> 00:36:14,530
of these guys, et
cetera, et cetera.

628
00:36:14,530 --> 00:36:16,230
So in a way, that's
what's happening,

629
00:36:16,230 --> 00:36:19,200
except that you rotate it
before you actually do this.

630
00:36:19,200 --> 00:36:22,260
And that's exactly
what's happening.

631
00:36:22,260 --> 00:36:30,890
So what we put together here
is essentially three things.

632
00:36:30,890 --> 00:36:32,220
One was statistics.

633
00:36:32,220 --> 00:36:34,690
Statistics says, if
you won't spread,

634
00:36:34,690 --> 00:36:39,230
if you want information, you
should be looking at variance.

635
00:36:39,230 --> 00:36:40,820
The second one was optimization.

636
00:36:40,820 --> 00:36:44,870
Optimization said, well, if you
want to maximize spread, well,

637
00:36:44,870 --> 00:36:48,260
you have to maximize variance
in a certain direction.

638
00:36:48,260 --> 00:36:51,920
And that means maximizing
over the sphere of vectors

639
00:36:51,920 --> 00:36:54,510
that have unique norm.

640
00:36:54,510 --> 00:36:56,720
And that's an optimization
problem, which actually

641
00:36:56,720 --> 00:36:58,310
turned out to be difficult.

642
00:36:58,310 --> 00:37:00,800
But then the third thing that
we use to solve this problem

643
00:37:00,800 --> 00:37:01,830
was linear algebra.

644
00:37:01,830 --> 00:37:03,410
Linear algebra
said, well, it looks

645
00:37:03,410 --> 00:37:05,450
like it's a difficult
optimization problem.

646
00:37:05,450 --> 00:37:08,410
But it turns out that the
answer comes in almost--

647
00:37:08,410 --> 00:37:11,210
I mean, it's not a closed form,
but those things are so used,

648
00:37:11,210 --> 00:37:12,590
that it's almost a closed form--

649
00:37:12,590 --> 00:37:17,240
says, just pick the
eigenvectors in order

650
00:37:17,240 --> 00:37:20,480
of their associated eigenvalues
from largest to smallest.

651
00:37:23,020 --> 00:37:24,940
And that's why principal
component analysis

652
00:37:24,940 --> 00:37:29,080
has been so popular and has
gained huge amount of traction

653
00:37:29,080 --> 00:37:33,760
since we had computers that were
allowed to compute eigenvalues

654
00:37:33,760 --> 00:37:37,429
and eigenvectors for
matrices of gigantic sizes.

655
00:37:37,429 --> 00:37:38,470
You can actually do that.

656
00:37:38,470 --> 00:37:39,760
If I give you--

657
00:37:39,760 --> 00:37:42,340
I don't know, this Google
video, for example,

658
00:37:42,340 --> 00:37:43,750
is talking about words.

659
00:37:43,750 --> 00:37:45,970
They want to do just the,
say, principal component

660
00:37:45,970 --> 00:37:47,380
analysis of words.

661
00:37:47,380 --> 00:37:50,230
So I give you all the
words in the dictionary.

662
00:37:50,230 --> 00:37:53,500
And-- sorry, well,
you would have

663
00:37:53,500 --> 00:37:55,090
to have a representation
for words,

664
00:37:55,090 --> 00:37:59,500
so it's a little more
difficult. But how do I do this?

665
00:38:03,980 --> 00:38:06,382
Let's say, for example,
pages of a book.

666
00:38:06,382 --> 00:38:08,090
I want to understand
the pages of a book.

667
00:38:08,090 --> 00:38:10,580
And I need to turn
it into a number.

668
00:38:10,580 --> 00:38:13,150
And a page of a book is
basically the word count.

669
00:38:13,150 --> 00:38:15,350
So I just count the number
of times "the" shows up,

670
00:38:15,350 --> 00:38:18,140
the number of times "and"
shows up, number of times "dog"

671
00:38:18,140 --> 00:38:19,100
shows up.

672
00:38:19,100 --> 00:38:20,934
And so that gives me a vector.

673
00:38:20,934 --> 00:38:22,225
It's in pretty high dimensions.

674
00:38:22,225 --> 00:38:25,350
It's as many dimensions as there
are words in the dictionary.

675
00:38:25,350 --> 00:38:28,310
And now, I want to visualize
how those pages get together--

676
00:38:28,310 --> 00:38:30,450
are two pages very
similar or not.

677
00:38:30,450 --> 00:38:32,630
And so what you would
do is essentially

678
00:38:32,630 --> 00:38:35,470
just compute the largest
eigenvector of this matrix--

679
00:38:35,470 --> 00:38:38,925
maybe the two largest-- and
then project this into a plane.

680
00:38:38,925 --> 00:38:39,425
Yeah.

681
00:38:39,425 --> 00:38:41,325
AUDIENCE: Can we assume
the number of points

682
00:38:41,325 --> 00:38:43,060
was far larger
than the dimension?

683
00:38:43,060 --> 00:38:44,560
PHILIPPE RIGOLLET:
Yeah, but there's

684
00:38:44,560 --> 00:38:46,834
many pages in the world.

685
00:38:46,834 --> 00:38:48,500
There's probably more
pages in the world

686
00:38:48,500 --> 00:38:50,154
than there's words
in the dictionary.

687
00:38:54,960 --> 00:38:57,185
Yeah, so of course, if
you are in high dimensions

688
00:38:57,185 --> 00:38:58,560
and you don't have
enough points,

689
00:38:58,560 --> 00:39:00,240
it's going to be
clearly an issue.

690
00:39:00,240 --> 00:39:03,605
If you have two points,
then the leading eigenvector

691
00:39:03,605 --> 00:39:04,980
is going to be
just the line that

692
00:39:04,980 --> 00:39:06,879
goes through those
two points, regardless

693
00:39:06,879 --> 00:39:07,920
of what the dimension is.

694
00:39:07,920 --> 00:39:09,670
And clearly, you're
not learning anything.

695
00:39:13,850 --> 00:39:16,310
So you have to pick,
say, the k largest one.

696
00:39:16,310 --> 00:39:18,842
If you go all the way, you're
just reordering your thing,

697
00:39:18,842 --> 00:39:20,550
and you're not actually
gaining anything.

698
00:39:20,550 --> 00:39:22,130
You start from d
and you go too d.

699
00:39:22,130 --> 00:39:26,300
So at some point, this
procedure has to stop.

700
00:39:26,300 --> 00:39:28,960
And let's say it stops at k.

701
00:39:28,960 --> 00:39:31,360
Now, of course, you
should ask me a question,

702
00:39:31,360 --> 00:39:34,100
which is, how do you choose k?

703
00:39:34,100 --> 00:39:37,400
So that's, of course,
a natural question.

704
00:39:37,400 --> 00:39:41,360
Probably the basic answer
is just pick k equals 3,

705
00:39:41,360 --> 00:39:43,220
because you can
actually visualize it.

706
00:39:43,220 --> 00:39:47,906
But what happens if I
take k is equal to 4?

707
00:39:47,906 --> 00:39:51,860
If I take is equal
to 4, I'm not going

708
00:39:51,860 --> 00:39:54,070
to be able to plot points
in four dimensions.

709
00:39:54,070 --> 00:39:55,550
Well, I could, I
could add color,

710
00:39:55,550 --> 00:39:57,440
or I could try to be a
little smart about it.

711
00:39:57,440 --> 00:40:00,060
But it's actually
quite difficult.

712
00:40:00,060 --> 00:40:04,420
And so what people tend to do,
if you have four dimensions,

713
00:40:04,420 --> 00:40:06,850
they actually do a bunch
of two dimensional plots.

714
00:40:06,850 --> 00:40:08,920
And that's what a computer--
a computer is not very good--

715
00:40:08,920 --> 00:40:10,750
I mean, by default,
they don't spit out

716
00:40:10,750 --> 00:40:12,380
three dimensional plots.

717
00:40:12,380 --> 00:40:15,024
So let's say they want to plot
only two dimensional things.

718
00:40:15,024 --> 00:40:17,440
So they're going to take the
first directions of, say, v1,

719
00:40:17,440 --> 00:40:18,586
v2.

720
00:40:18,586 --> 00:40:19,960
Let's say you have
three, but you

721
00:40:19,960 --> 00:40:21,760
want to have only two
dimensional plots.

722
00:40:21,760 --> 00:40:29,660
And then it's going to do
v1, v3; and then v2, v3.

723
00:40:29,660 --> 00:40:31,850
So really, you take
all three of them,

724
00:40:31,850 --> 00:40:35,240
but it's really just
showing you all choices

725
00:40:35,240 --> 00:40:37,340
of pairs of those guys.

726
00:40:37,340 --> 00:40:41,960
So if you were to
keep k is equal to 5,

727
00:40:41,960 --> 00:40:44,450
you would have five,
choose two different plots.

728
00:40:48,540 --> 00:40:51,930
So this is the actual
principal component algorithm,

729
00:40:51,930 --> 00:40:53,640
how it's implemented.

730
00:40:53,640 --> 00:40:55,000
And it's actually fairly simple.

731
00:40:55,000 --> 00:40:56,430
I mean, it looks like
there's lots of steps.

732
00:40:56,430 --> 00:40:58,600
But really, there's only
one that's important.

733
00:40:58,600 --> 00:40:59,850
So the first one is the input.

734
00:40:59,850 --> 00:41:04,860
I give you a bunch of points,
x1 to xn in d dimensions.

735
00:41:04,860 --> 00:41:07,680
And step two is, well, compute
their empirical covariance

736
00:41:07,680 --> 00:41:10,570
matrix S. The points themselves,
we don't really care.

737
00:41:10,570 --> 00:41:12,570
We care about their
empirical covariance matrix.

738
00:41:12,570 --> 00:41:14,530
So it's a d by d matrix.

739
00:41:14,530 --> 00:41:15,750
Now, I'm going to feed that.

740
00:41:15,750 --> 00:41:17,880
And that's where the actual
computation starts happening.

741
00:41:17,880 --> 00:41:19,796
I'm going to feed that
to something that knows

742
00:41:19,796 --> 00:41:21,090
how to diagonalize this matrix.

743
00:41:21,090 --> 00:41:23,220
And you have to
trust me, if I want

744
00:41:23,220 --> 00:41:25,770
to compute the k
largest eigenvalues

745
00:41:25,770 --> 00:41:27,960
and my matrix is
d by d, it's going

746
00:41:27,960 --> 00:41:32,730
to take me about k times
d squared operations.

747
00:41:32,730 --> 00:41:34,980
So if I want only three,
it's 3 times d squared,

748
00:41:34,980 --> 00:41:36,420
which is about--

749
00:41:36,420 --> 00:41:39,570
d squared is the time for me
it takes to just even read

750
00:41:39,570 --> 00:41:41,040
the matrix sigma.

751
00:41:41,040 --> 00:41:43,360
So that's not too bad.

752
00:41:43,360 --> 00:41:45,110
So what it's going to
spit out, of course,

753
00:41:45,110 --> 00:41:48,230
is the diagonal matrix
D. And those are nice,

754
00:41:48,230 --> 00:41:53,720
because they allow
me to tell me what

755
00:41:53,720 --> 00:41:56,210
is the order in which I should
be taking the columns of P.

756
00:41:56,210 --> 00:41:58,930
But what's really important
to me is v1 to vd,

757
00:41:58,930 --> 00:42:01,430
because those are going to be
the ones I'm going to be using

758
00:42:01,430 --> 00:42:04,250
to draw those plots.

759
00:42:04,250 --> 00:42:05,900
And now, I'm going
to say, OK, I need

760
00:42:05,900 --> 00:42:09,190
to actually choose some set k.

761
00:42:09,190 --> 00:42:11,630
And I'm going to basically
truncate and look

762
00:42:11,630 --> 00:42:16,380
only at the first
k columns of P.

763
00:42:16,380 --> 00:42:18,300
Once I have those
columns, what I

764
00:42:18,300 --> 00:42:20,820
want to do is to project
onto the linear span

765
00:42:20,820 --> 00:42:21,610
of those columns.

766
00:42:21,610 --> 00:42:23,340
And there's actually
a simple way

767
00:42:23,340 --> 00:42:26,940
to do this, which is just take
this matrix P, which is really

768
00:42:26,940 --> 00:42:29,460
the matrix of projection onto
the linear span of those k

769
00:42:29,460 --> 00:42:30,120
columns.

770
00:42:30,120 --> 00:42:32,160
And you just take Pk transpose.

771
00:42:32,160 --> 00:42:38,070
And then you apply this to
every single one of your points.

772
00:42:38,070 --> 00:42:42,000
Now Pk transpose, what is
the size of the matrix Pk?

773
00:42:46,410 --> 00:42:47,880
Yeah, [INAUDIBLE]?

774
00:42:47,880 --> 00:42:49,840
AUDIENCE: [INAUDIBLE]

775
00:42:49,840 --> 00:42:52,100
PHILIPPE RIGOLLET: So
Pk is just this matrix.

776
00:42:52,100 --> 00:42:54,601
I take the v1 and I stop at vk--

777
00:42:54,601 --> 00:42:55,100
well--

778
00:42:55,100 --> 00:42:57,656
AUDIENCE: [INAUDIBLE]

779
00:42:57,656 --> 00:42:59,030
PHILIPPE RIGOLLET:
d by k, right?

780
00:42:59,030 --> 00:43:01,290
Each of the column
is an eigenvector.

781
00:43:01,290 --> 00:43:02,840
It's of dimension d.

782
00:43:02,840 --> 00:43:05,730
I mean, that's a vector
in the original space.

783
00:43:05,730 --> 00:43:07,220
So I have this d by k matrix.

784
00:43:07,220 --> 00:43:11,360
So all it is is if I had my--

785
00:43:11,360 --> 00:43:13,970
well, I'm going to talk in
a second about Pk transpose.

786
00:43:13,970 --> 00:43:17,060
Pk transpose is
just this guy, where

787
00:43:17,060 --> 00:43:19,460
I stop at the k-th vector.

788
00:43:19,460 --> 00:43:22,370
So Pk transpose is k by d.

789
00:43:22,370 --> 00:43:26,825
So now, when I take Yi,
which is Pk transpose Xi,

790
00:43:26,825 --> 00:43:29,330
I end up with a point
which is in k dimensions.

791
00:43:29,330 --> 00:43:30,900
I have only k coordinates.

792
00:43:30,900 --> 00:43:33,350
So I took every single one
of my original points Xi,

793
00:43:33,350 --> 00:43:35,780
which had d coordinates, and
I turned it into a point that

794
00:43:35,780 --> 00:43:37,180
has only k coordinates.

795
00:43:37,180 --> 00:43:40,260
Particularly, I could
have k is equal to 2.

796
00:43:40,260 --> 00:43:42,820
This matrix is exactly
the one that projects.

797
00:43:42,820 --> 00:43:44,960
If you think about
it for one second,

798
00:43:44,960 --> 00:43:46,890
this is just the
matrix that says--

799
00:43:46,890 --> 00:43:48,610
well, we actually did
that several times.

800
00:43:48,610 --> 00:43:51,820
The matrix, so that
was this P transpose u

801
00:43:51,820 --> 00:43:53,470
that showed up somewhere.

802
00:43:53,470 --> 00:43:57,460
And so that's just
the matrix that

803
00:43:57,460 --> 00:44:01,030
take your point X in,
say, three dimensions,

804
00:44:01,030 --> 00:44:04,750
and then just project it
down to two dimensions.

805
00:44:04,750 --> 00:44:09,220
And that's just-- it goes to the
closest point in the subspace.

806
00:44:09,220 --> 00:44:12,650
Now, here, the floor is flat.

807
00:44:12,650 --> 00:44:16,510
But we can pick any
subspace we want,

808
00:44:16,510 --> 00:44:18,310
depending on what
the lambdas are.

809
00:44:18,310 --> 00:44:19,930
So the lambdas were
important for us

810
00:44:19,930 --> 00:44:23,610
to be able to identify
which columns to pick.

811
00:44:23,610 --> 00:44:25,692
The fact that we assumed
that they were ordered

812
00:44:25,692 --> 00:44:27,400
tells us that we can
pick the first ones.

813
00:44:27,400 --> 00:44:28,500
If they were not
ordered, it would

814
00:44:28,500 --> 00:44:30,583
be just a subset of the
columns, depending on what

815
00:44:30,583 --> 00:44:32,550
the size of the eigenvalue is.

816
00:44:32,550 --> 00:44:36,509
So each column is labeled.

817
00:44:36,509 --> 00:44:38,800
And so then, of course, we
still have this question of,

818
00:44:38,800 --> 00:44:40,570
how do I pick k?

819
00:44:40,570 --> 00:44:42,760
So there's definitely the
matter of convenience.

820
00:44:42,760 --> 00:44:44,410
Maybe 2 is convenient.

821
00:44:44,410 --> 00:44:47,180
If it works for 2, you don't
have to go any farther.

822
00:44:47,180 --> 00:44:50,680
But you might want
to say, well--

823
00:44:50,680 --> 00:44:52,690
originally, I did
that to actually keep

824
00:44:52,690 --> 00:44:54,320
as much information as possible.

825
00:44:54,320 --> 00:44:56,230
I know that the
ultimate thing is

826
00:44:56,230 --> 00:44:58,515
to keep as much information,
which would be to k

827
00:44:58,515 --> 00:45:00,970
is equal d-- that's as much
information as you want.

828
00:45:00,970 --> 00:45:03,310
But it's essentially the
same question about, well,

829
00:45:03,310 --> 00:45:07,180
if I want to compress
a JPEG image,

830
00:45:07,180 --> 00:45:10,100
how much information should
I keep so it's still visible?

831
00:45:10,100 --> 00:45:11,840
And so there's some
rules for that.

832
00:45:11,840 --> 00:45:14,950
But none of them is
actually really a science.

833
00:45:14,950 --> 00:45:16,600
So it's really a
matter of what you

834
00:45:16,600 --> 00:45:18,250
think is actually tolerable.

835
00:45:18,250 --> 00:45:21,970
And we're just going to start
replacing this choice by maybe

836
00:45:21,970 --> 00:45:22,900
another parameter.

837
00:45:22,900 --> 00:45:26,440
So here, we're going to
basically replace k by alpha,

838
00:45:26,440 --> 00:45:29,360
and so we just do stuff.

839
00:45:29,360 --> 00:45:32,020
So the first one that
people do that is probably

840
00:45:32,020 --> 00:45:33,750
the most popular one--

841
00:45:33,750 --> 00:45:35,860
OK, the most popular
one is definitely

842
00:45:35,860 --> 00:45:39,190
take k is equal to 2
or 3, because it's just

843
00:45:39,190 --> 00:45:41,320
convenient to visualize.

844
00:45:41,320 --> 00:45:48,050
The second most popular
one is the scree plot.

845
00:45:48,050 --> 00:45:49,370
So the scree plot--

846
00:45:49,370 --> 00:45:54,180
remember, I have my
values, lambda j's.

847
00:45:54,180 --> 00:45:57,670
And I've chosen the
lambda j's to decrease.

848
00:45:57,670 --> 00:45:59,380
So the indices are
chosen in such a way

849
00:45:59,380 --> 00:46:01,480
that lambda is a
decreasing function.

850
00:46:01,480 --> 00:46:04,332
So I have lambda 1, and
let's say it's this guy here.

851
00:46:04,332 --> 00:46:06,790
And then I have lambda 2, and
let's say it's this guy here.

852
00:46:06,790 --> 00:46:09,370
And then I have lambda 3, and
let's say it's this guy here,

853
00:46:09,370 --> 00:46:12,760
lambda 4, lambda 5, lambda 6.

854
00:46:12,760 --> 00:46:16,322
And all I care about is
that this thing decreases.

855
00:46:16,322 --> 00:46:19,580
The scree plot says
something like this--

856
00:46:19,580 --> 00:46:22,520
if there's an inflection point,
meaning that you can sort of do

857
00:46:22,520 --> 00:46:25,230
something like this and
then something like this,

858
00:46:25,230 --> 00:46:27,610
you should stop at 3.

859
00:46:27,610 --> 00:46:29,500
That's what the
scree plot tells you.

860
00:46:29,500 --> 00:46:34,590
What it's saying in a way
is that the percentage

861
00:46:34,590 --> 00:46:39,170
of the marginal
increment of explained

862
00:46:39,170 --> 00:46:41,990
variance that you get
starts to decrease after you

863
00:46:41,990 --> 00:46:43,555
pass this inflection point.

864
00:46:43,555 --> 00:46:45,840
So let's see why I way this.

865
00:46:45,840 --> 00:46:52,390
Well, here, what I
have-- so this ratio

866
00:46:52,390 --> 00:46:54,280
that you see there is
actually the percentage

867
00:46:54,280 --> 00:46:56,470
of explained variance.

868
00:46:56,470 --> 00:47:01,590
So what it means is that, if I
look at lambda 1 plus lambda k,

869
00:47:01,590 --> 00:47:08,260
and then I divide by lambda
1 plus lambda d, well,

870
00:47:08,260 --> 00:47:08,980
what is this?

871
00:47:08,980 --> 00:47:12,010
Well, this lambda
1 plus lambda d

872
00:47:12,010 --> 00:47:14,530
is the total amount of variance
that I get in my points.

873
00:47:14,530 --> 00:47:18,070
That's the trace of sigma.

874
00:47:18,070 --> 00:47:20,640
So that's the variance
in the first direction

875
00:47:20,640 --> 00:47:22,420
plus the variance in
the second direction

876
00:47:22,420 --> 00:47:24,280
plus the variance in
the third direction.

877
00:47:24,280 --> 00:47:26,571
That's basically all the
variance that I have possible.

878
00:47:28,900 --> 00:47:32,175
Now, this is the variance that
I kept in the first direction.

879
00:47:32,175 --> 00:47:34,550
This is the variance that I
kept in the second direction,

880
00:47:34,550 --> 00:47:37,190
all the way to the variance that
I kept in the k-th direction.

881
00:47:37,190 --> 00:47:41,800
So I know that this number is
always less than or equal to 1.

882
00:47:41,800 --> 00:47:43,540
And it's larger than 1.

883
00:47:43,540 --> 00:47:48,500
And this is just
the proportion, say,

884
00:47:48,500 --> 00:47:59,520
of variance explained
by v1 to vk,

885
00:47:59,520 --> 00:48:03,720
or simply, the proportion of
explained variance by my PCA,

886
00:48:03,720 --> 00:48:05,720
say.

887
00:48:05,720 --> 00:48:07,550
So now, what this
thing is telling me,

888
00:48:07,550 --> 00:48:09,860
its says, well, if
I look at this thing

889
00:48:09,860 --> 00:48:13,050
and I start seeing this
inflection point, it's saying,

890
00:48:13,050 --> 00:48:16,400
oh, here, you're gaining
a lot and lot of variance.

891
00:48:16,400 --> 00:48:19,090
And then at some point,
you stop gaining a lot

892
00:48:19,090 --> 00:48:21,820
in your proportion of
explained variance.

893
00:48:21,820 --> 00:48:23,870
So this will
translate in something

894
00:48:23,870 --> 00:48:28,490
where when I look at this ratio,
lambda 1 plus lambda k divided

895
00:48:28,490 --> 00:48:31,490
by lambda 1 plus
lambda d, this would

896
00:48:31,490 --> 00:48:34,195
translate into a function
that would look like this.

897
00:48:34,195 --> 00:48:36,320
And what it's telling you,
it says, well, maybe you

898
00:48:36,320 --> 00:48:38,570
should stop here, because
here every time you add one,

899
00:48:38,570 --> 00:48:40,520
you don't get as much
as you did before.

900
00:48:40,520 --> 00:48:43,700
You actually get like
smaller marginal returns.

901
00:48:50,910 --> 00:48:56,630
So explained variance is
the numerator of this ratio.

902
00:48:56,630 --> 00:48:58,430
And the total variance
is the denominator.

903
00:48:58,430 --> 00:49:01,010
Those are pretty
straightforward terms

904
00:49:01,010 --> 00:49:03,320
that you would want
to use for this.

905
00:49:03,320 --> 00:49:06,620
So if your goal is to
do data visualization--

906
00:49:06,620 --> 00:49:10,100
so why would you
take k larger than 2?

907
00:49:10,100 --> 00:49:11,750
Let's say, if you
take k larger than 6,

908
00:49:11,750 --> 00:49:12,906
you can start to
imagine that you're

909
00:49:12,906 --> 00:49:15,364
going to have six, choose two,
which starts to be annoying.

910
00:49:15,364 --> 00:49:16,850
And if you have k
is equal to 10--

911
00:49:16,850 --> 00:49:19,310
because you could start
in dimension 50,000--

912
00:49:19,310 --> 00:49:21,080
and then k equal to
10 would be the place

913
00:49:21,080 --> 00:49:22,780
where you have this thing
that's a lot of plots

914
00:49:22,780 --> 00:49:23,960
that you would have to show.

915
00:49:23,960 --> 00:49:26,900
So it's not always for
data visualization.

916
00:49:26,900 --> 00:49:29,540
Once I've actually
done this, I've

917
00:49:29,540 --> 00:49:32,460
actually effectively reduced
the dimension of my problem.

918
00:49:32,460 --> 00:49:34,230
And what I could do
with what I have is

919
00:49:34,230 --> 00:49:36,080
do a regression on those guys.

920
00:49:36,080 --> 00:49:39,010
The v1-- so I
forgot to tell you--

921
00:49:39,010 --> 00:49:41,460
why is that called principal
component analysis?

922
00:49:41,460 --> 00:49:46,910
Well, the vj's that
I keep, v1 to vk

923
00:49:46,910 --> 00:49:51,932
are called principal components.

924
00:49:59,020 --> 00:50:04,690
And they effectively act
as the summary of my Xi's.

925
00:50:04,690 --> 00:50:06,850
When I mentioned
image compression,

926
00:50:06,850 --> 00:50:10,840
I started with a point
Xi that was d numbers--

927
00:50:10,840 --> 00:50:12,604
let's say 50,000 numbers.

928
00:50:12,604 --> 00:50:14,020
And now, I'm saying,
actually, you

929
00:50:14,020 --> 00:50:16,270
can throw out those
50,000 numbers.

930
00:50:16,270 --> 00:50:19,390
If you actually know only
the k numbers that you need--

931
00:50:19,390 --> 00:50:20,860
the 6 numbers that you need--

932
00:50:20,860 --> 00:50:22,318
you're going to
have something that

933
00:50:22,318 --> 00:50:24,820
was pretty close to getting
what information you had.

934
00:50:24,820 --> 00:50:26,736
So in a way, there is
some form of compression

935
00:50:26,736 --> 00:50:27,810
that's going on here.

936
00:50:27,810 --> 00:50:31,150
And what you can do is that
those principal components,

937
00:50:31,150 --> 00:50:34,120
you can actually use
now for regression.

938
00:50:34,120 --> 00:50:39,130
If I want to regress
Y onto X that's

939
00:50:39,130 --> 00:50:41,862
very high dimensional,
before I do this,

940
00:50:41,862 --> 00:50:44,320
if I don't have enough points,
maybe what I can actually do

941
00:50:44,320 --> 00:50:47,780
is to do principal
component analysis

942
00:50:47,780 --> 00:50:49,510
throughout my
exercise, replace them

943
00:50:49,510 --> 00:50:52,150
by those compressed versions,
and do linear aggression

944
00:50:52,150 --> 00:50:53,020
on those guys.

945
00:50:53,020 --> 00:50:55,330
And that's called principal
component regression,

946
00:50:55,330 --> 00:50:56,039
not surprisingly.

947
00:50:56,039 --> 00:50:57,830
And that's something
that's pretty popular.

948
00:50:57,830 --> 00:51:00,086
And you can do with k is
equal to 10, for example.

949
00:51:03,020 --> 00:51:07,640
So for data visualization, I did
not find a Thanksgiving themed

950
00:51:07,640 --> 00:51:08,270
picture.

951
00:51:08,270 --> 00:51:11,960
But I found one that
has turkey in it.

952
00:51:11,960 --> 00:51:12,460
Get it?

953
00:51:15,310 --> 00:51:21,820
So this is actually a
gene data set that was--

954
00:51:21,820 --> 00:51:24,190
so when you see
something like this,

955
00:51:24,190 --> 00:51:27,056
you can imagine that someone
has been preprocessing

956
00:51:27,056 --> 00:51:28,180
the hell out of this thing.

957
00:51:28,180 --> 00:51:30,820
This is not like, oh, I
collect data on 23andMe

958
00:51:30,820 --> 00:51:32,670
and I'm just going
to run PCA on this.

959
00:51:32,670 --> 00:51:34,730
It just doesn't
happen like that.

960
00:51:34,730 --> 00:51:38,740
And so what happened is that--
so let's assume that this was

961
00:51:38,740 --> 00:51:41,560
a bunch of preprocessed data,
which are gene expression

962
00:51:41,560 --> 00:51:42,550
levels--

963
00:51:42,550 --> 00:51:47,650
so 500,000 genes
among 1,400 Europeans.

964
00:51:47,650 --> 00:51:50,260
So here, I actually
have less observations

965
00:51:50,260 --> 00:51:52,180
than I have samples.

966
00:51:52,180 --> 00:51:54,880
And that's when you use
principal component regression

967
00:51:54,880 --> 00:51:57,460
most of the time, so
it doesn't stop you.

968
00:51:57,460 --> 00:52:01,480
And then what you do is you say,
OK, have those 500,000 genes

969
00:52:01,480 --> 00:52:03,640
among--

970
00:52:03,640 --> 00:52:06,760
so here, that means that
there's 1,400 points here.

971
00:52:06,760 --> 00:52:09,760
And I actually take
those 500,000 directions.

972
00:52:09,760 --> 00:52:13,347
So each person has a vector
of, say, 500,000 genes

973
00:52:13,347 --> 00:52:14,430
that are attached to them.

974
00:52:14,430 --> 00:52:17,020
And I project them onto
two dimensions, which

975
00:52:17,020 --> 00:52:19,380
should be extremely lossy.

976
00:52:19,380 --> 00:52:21,040
I lose a lot of information.

977
00:52:21,040 --> 00:52:24,790
And indeed, I do, because
I'm one of these guys.

978
00:52:24,790 --> 00:52:27,350
And I'm pretty sure I'm very
different from this guy,

979
00:52:27,350 --> 00:52:30,070
even though probably from
an American perspective,

980
00:52:30,070 --> 00:52:31,970
we're all the same.

981
00:52:31,970 --> 00:52:35,690
But I think we have like
slightly different genomes.

982
00:52:35,690 --> 00:52:39,220
And so the thing is
now we have this--

983
00:52:39,220 --> 00:52:41,980
so you see there's lots of
Swiss that participate in this.

984
00:52:41,980 --> 00:52:43,900
But actually, those two
principal components

985
00:52:43,900 --> 00:52:46,210
recover sort of
the map of Europe.

986
00:52:46,210 --> 00:52:50,169
I mean, OK, again, this is
actually maybe fine-grained

987
00:52:50,169 --> 00:52:50,710
for you guys.

988
00:52:50,710 --> 00:52:52,810
But right here, there's
Portugal and Spain,

989
00:52:52,810 --> 00:52:54,430
which are those colors.

990
00:52:54,430 --> 00:52:55,450
So here is color-coded.

991
00:52:55,450 --> 00:52:58,510
And here is Turkey, of
course, which we know

992
00:52:58,510 --> 00:53:02,230
has very different genomes.

993
00:53:02,230 --> 00:53:04,850
So Turks are very
at the boundary.

994
00:53:04,850 --> 00:53:06,100
So you can see all the greens.

995
00:53:06,100 --> 00:53:08,560
They stay very far apart
from everything else.

996
00:53:08,560 --> 00:53:11,080
And then the rest
here is pretty mixed.

997
00:53:11,080 --> 00:53:13,430
But it sort of recovers--
if you look at the colors,

998
00:53:13,430 --> 00:53:14,500
it sort of recovers that.

999
00:53:14,500 --> 00:53:16,390
So in a way, those two
principal components

1000
00:53:16,390 --> 00:53:18,050
are just the geographic feature.

1001
00:53:18,050 --> 00:53:25,570
So if you insist to compress
all the genomic information

1002
00:53:25,570 --> 00:53:28,330
of these people into two
numbers, what you're actually

1003
00:53:28,330 --> 00:53:31,320
going to get is
longitude and latitude,

1004
00:53:31,320 --> 00:53:35,550
which is somewhat
surprising, but not

1005
00:53:35,550 --> 00:53:37,740
so much if you think that's
it's been preprocessed.

1006
00:53:43,120 --> 00:53:47,530
So what do you do
beyond practice?

1007
00:53:47,530 --> 00:53:50,780
Well, you could try to
actually study those things.

1008
00:53:50,780 --> 00:53:52,330
If you think about
it for a second,

1009
00:53:52,330 --> 00:53:54,880
we did not do any statistics.

1010
00:53:54,880 --> 00:53:57,460
I talked to you about
IID observations,

1011
00:53:57,460 --> 00:53:59,950
but we never used the fact
that they were independent.

1012
00:53:59,950 --> 00:54:01,491
The way we typically
use independence

1013
00:54:01,491 --> 00:54:04,270
is to have central
limit theorem, maybe.

1014
00:54:04,270 --> 00:54:06,640
I mentioned the fact that
the covariances of the word

1015
00:54:06,640 --> 00:54:09,520
Gaussian would actually give me
something which is independent.

1016
00:54:09,520 --> 00:54:10,870
We didn't care.

1017
00:54:10,870 --> 00:54:16,280
This was a data analysis, data
mining process that we did.

1018
00:54:16,280 --> 00:54:19,280
I give you points, and you just
put them through the crank.

1019
00:54:19,280 --> 00:54:21,350
There was an algorithm
in six steps.

1020
00:54:21,350 --> 00:54:23,750
And you just put it through
and that's what you got.

1021
00:54:23,750 --> 00:54:26,940
Now, of course, there's some
work which studies says, OK,

1022
00:54:26,940 --> 00:54:30,440
if my data is actually generated
from some process-- maybe,

1023
00:54:30,440 --> 00:54:33,050
my points are multivariate
Gaussian with some structure

1024
00:54:33,050 --> 00:54:34,520
on the covariance--

1025
00:54:34,520 --> 00:54:37,010
how well am I recovering
the covariance structure?

1026
00:54:37,010 --> 00:54:38,990
And that's where
statistics kicks in.

1027
00:54:38,990 --> 00:54:41,390
And that's where we stop.

1028
00:54:41,390 --> 00:54:44,730
So this is actually a bit
more difficult to study.

1029
00:54:44,730 --> 00:54:48,250
But in a way, it's not
entirely satisfactory,

1030
00:54:48,250 --> 00:54:50,320
because we could work
for a couple of boards

1031
00:54:50,320 --> 00:54:53,470
and I would just basically
sort of reverse engineer this

1032
00:54:53,470 --> 00:54:57,457
and find some models under which
it's a good idea to do that.

1033
00:54:57,457 --> 00:54:58,540
And what are those models?

1034
00:54:58,540 --> 00:55:01,450
Well, those are the models
that sort of give you

1035
00:55:01,450 --> 00:55:03,911
sort of prominent directions
that you want to find.

1036
00:55:03,911 --> 00:55:06,160
And it will say, yes, if you
have enough observations,

1037
00:55:06,160 --> 00:55:08,260
you will find those
directions along which

1038
00:55:08,260 --> 00:55:10,150
your data is elongated.

1039
00:55:10,150 --> 00:55:14,890
So that's essentially
what you want to do.

1040
00:55:14,890 --> 00:55:20,660
So that's exactly what
this thing is telling you.

1041
00:55:20,660 --> 00:55:23,010
So where does the
statistics lie from?

1042
00:55:23,010 --> 00:55:26,020
Well, everything, remember--
so actually that's

1043
00:55:26,020 --> 00:55:28,490
where Alana was confused--
the idea was to say, well,

1044
00:55:28,490 --> 00:55:32,590
if I have a true
covariance matrix sigma

1045
00:55:32,590 --> 00:55:34,540
and I never really
have access to it,

1046
00:55:34,540 --> 00:55:38,870
I'm just running PCA on the
empirical covariance matrix,

1047
00:55:38,870 --> 00:55:41,380
how do those results relate?

1048
00:55:41,380 --> 00:55:44,270
And this is something
that you can study.

1049
00:55:44,270 --> 00:55:47,530
So for example, if
n goes to infinity

1050
00:55:47,530 --> 00:55:55,840
and the number of points,
your dimension, is fixed,

1051
00:55:55,840 --> 00:56:00,370
then S goes to sigma
in any sense you want.

1052
00:56:00,370 --> 00:56:02,860
Maybe each entry is going
to each entry of sigma,

1053
00:56:02,860 --> 00:56:03,730
for example.

1054
00:56:03,730 --> 00:56:04,840
So S is a good estimator.

1055
00:56:04,840 --> 00:56:06,381
We know that the
empirical covariance

1056
00:56:06,381 --> 00:56:07,600
is a consistent as the mater.

1057
00:56:07,600 --> 00:56:10,230
And if d is fixed, this
is actually not an issue.

1058
00:56:10,230 --> 00:56:14,450
So in particular, if you run
PCA on the sample covariance

1059
00:56:14,450 --> 00:56:16,150
matrix, you look
at, say, v1, then

1060
00:56:16,150 --> 00:56:20,140
v1 is going to converge to the
largest eigenvector of sigma

1061
00:56:20,140 --> 00:56:23,990
as n goes to infinity,
but for d fixed.

1062
00:56:23,990 --> 00:56:27,960
And that's a story that
we know since the '60s.

1063
00:56:27,960 --> 00:56:30,906
More recently, people have
started challenging this.

1064
00:56:30,906 --> 00:56:33,030
Because what's happening
when you fix the dimension

1065
00:56:33,030 --> 00:56:35,310
and let the sample
size go to infinity,

1066
00:56:35,310 --> 00:56:38,961
you're certainly not
allowing for this.

1067
00:56:38,961 --> 00:56:41,460
It's certainly not explaining
to you anything about the fact

1068
00:56:41,460 --> 00:56:44,512
when d is equal to 500,000
and n is equal to 1,400.

1069
00:56:44,512 --> 00:56:46,470
Because when d is fixed
and n goes to infinity,

1070
00:56:46,470 --> 00:56:48,660
in particular, n is
much larger than d,

1071
00:56:48,660 --> 00:56:50,280
which is not the case here.

1072
00:56:50,280 --> 00:56:53,610
And so when n is much larger
than d, things go well.

1073
00:56:53,610 --> 00:56:57,430
But if d is less than n,
it's not clear what happens.

1074
00:56:57,430 --> 00:57:01,540
And particularly, if d is of the
order of n, what's happening?

1075
00:57:01,540 --> 00:57:04,320
So there's an entire theory
in mathematics that's called

1076
00:57:04,320 --> 00:57:07,890
random matrix theory that
studies the behavior of exactly

1077
00:57:07,890 --> 00:57:10,770
this question-- what is the
behavior of the spectrum--

1078
00:57:10,770 --> 00:57:13,020
the eigenvalues
and eigenvectors--

1079
00:57:13,020 --> 00:57:16,470
of a matrix in which I put
random numbers and I let--

1080
00:57:16,470 --> 00:57:19,710
so the matrix I'm interested
in here is the matrix of X's.

1081
00:57:19,710 --> 00:57:21,830
When I stack all my
X's next to each other,

1082
00:57:21,830 --> 00:57:26,940
so that's a matrix of size,
say, d by n, so each column

1083
00:57:26,940 --> 00:57:28,890
is of size d, it's one person.

1084
00:57:28,890 --> 00:57:29,880
And so I put them.

1085
00:57:29,880 --> 00:57:31,790
And when I let the
matrix go to infinity,

1086
00:57:31,790 --> 00:57:33,920
I let both d and n to infinity.

1087
00:57:33,920 --> 00:57:37,260
But I want the aspect ratio,
d/n, to go to some constant.

1088
00:57:37,260 --> 00:57:38,940
That's what they do.

1089
00:57:38,940 --> 00:57:41,730
And what's nice is that in the
end, you have this constant--

1090
00:57:41,730 --> 00:57:42,840
let's call it gamma--

1091
00:57:42,840 --> 00:57:44,550
that shows up in
all the asymptotics.

1092
00:57:44,550 --> 00:57:46,680
And then you can
replace it by d/n.

1093
00:57:46,680 --> 00:57:50,520
And you know that you still have
a handle of both the dimension

1094
00:57:50,520 --> 00:57:51,360
and the sample size.

1095
00:57:51,360 --> 00:57:54,020
Whereas, usually the dimension
goes away, as you let n

1096
00:57:54,020 --> 00:57:57,370
go to infinity without having
dimension going to infinity.

1097
00:57:57,370 --> 00:57:59,400
And so now, when
this happens, as soon

1098
00:57:59,400 --> 00:58:01,920
as d/n goes to a
constant, you can

1099
00:58:01,920 --> 00:58:07,380
show that essentially there's
an angle between the largest

1100
00:58:07,380 --> 00:58:14,460
eigenvector of sigma and the
largest eigenvector of S, as n

1101
00:58:14,460 --> 00:58:15,460
and d go to infinity.

1102
00:58:15,460 --> 00:58:17,251
There is always an
angle-- you can actually

1103
00:58:17,251 --> 00:58:18,930
write it explicitly.

1104
00:58:18,930 --> 00:58:22,240
And it's an angle that
depends on this ratio, gamma--

1105
00:58:22,240 --> 00:58:24,840
the asymptotic ratio of d/n.

1106
00:58:24,840 --> 00:58:29,392
And so there's been a lot of
understanding how to correct,

1107
00:58:29,392 --> 00:58:30,600
how to pay attention to this.

1108
00:58:30,600 --> 00:58:34,320
This creates some biases that
were sort of overlooked before.

1109
00:58:34,320 --> 00:58:37,470
In particular, when
I do this, this

1110
00:58:37,470 --> 00:58:40,490
is not the proportion
of explained variance,

1111
00:58:40,490 --> 00:58:42,940
when n and d are similar.

1112
00:58:42,940 --> 00:58:44,940
This is an estimated
number computed from S.

1113
00:58:44,940 --> 00:58:48,030
This is computed from S. All
these guys are computed from S.

1114
00:58:48,030 --> 00:58:49,830
So those are
actually not exactly

1115
00:58:49,830 --> 00:58:51,060
where you want them to be.

1116
00:58:51,060 --> 00:58:54,510
And there's some nice work that
allows you to recalibrate what

1117
00:58:54,510 --> 00:58:57,626
this ratio should be, how
this ratio should be computed,

1118
00:58:57,626 --> 00:58:59,250
so it's a better
representative of what

1119
00:58:59,250 --> 00:59:04,680
the proportion of explained
variance actually is.

1120
00:59:04,680 --> 00:59:07,470
So then, of course,
there's the question

1121
00:59:07,470 --> 00:59:09,870
of-- so that's when d/n
goes to some constant.

1122
00:59:09,870 --> 00:59:12,105
So the best case--
so that was '60s--

1123
00:59:12,105 --> 00:59:15,040
d is fixed and it's
much larger than d.

1124
00:59:15,040 --> 00:59:18,310
And then random matrix theory
tells you, well, d and n

1125
00:59:18,310 --> 00:59:20,680
are sort of the same
order of magnitude.

1126
00:59:20,680 --> 00:59:23,620
When they go to infinity, the
ratio goes to some constant.

1127
00:59:23,620 --> 00:59:25,270
Think of it as being order 1.

1128
00:59:25,270 --> 00:59:30,440
To be fair, if d is 100 times
larger than n, it still works.

1129
00:59:30,440 --> 00:59:32,440
And it depends on
what you think what

1130
00:59:32,440 --> 00:59:33,910
the infinity is at this point.

1131
00:59:33,910 --> 00:59:37,880
But I think the random matrix
theory results are very useful.

1132
00:59:37,880 --> 00:59:39,880
But then even in
this case, I told you

1133
00:59:39,880 --> 00:59:42,460
that the leading
eigenvector of S

1134
00:59:42,460 --> 00:59:48,812
is actually an angle of the
leading eigenvector of--

1135
00:59:48,812 --> 00:59:50,020
So what's happening is that--

1136
00:59:56,970 --> 01:00:01,320
so let's say that d/n
goes to some gamma.

1137
01:00:01,320 --> 01:00:04,470
And what I claim is
that, if you look at--

1138
01:00:04,470 --> 01:00:09,130
so that's v1, that's the v1 of
S. And then there's the v1 of--

1139
01:00:09,130 --> 01:00:11,760
so this should be of size 1.

1140
01:00:11,760 --> 01:00:13,096
So that's the v1 of sigma.

1141
01:00:13,096 --> 01:00:15,220
Then those things are going
to have an angle, which

1142
01:00:15,220 --> 01:00:16,629
is some function of gamma.

1143
01:00:16,629 --> 01:00:18,670
It's complicated, but
there's a function of gamma

1144
01:00:18,670 --> 01:00:19,628
that you can see there.

1145
01:00:19,628 --> 01:00:21,830
And there's some models.

1146
01:00:21,830 --> 01:00:24,620
When gamma goes
to infinity, which

1147
01:00:24,620 --> 01:00:27,800
means that d is now
much larger than n,

1148
01:00:27,800 --> 01:00:30,860
this angle is 90
degrees, which means

1149
01:00:30,860 --> 01:00:32,798
that you're getting nothing.

1150
01:00:32,798 --> 01:00:33,796
Yeah.

1151
01:00:33,796 --> 01:00:37,289
AUDIENCE: If d is not
on your lower plane,

1152
01:00:37,289 --> 01:00:40,782
so like gamma is 0,
is there still angle?

1153
01:00:40,782 --> 01:00:43,780
PHILIPPE RIGOLLET: No,
but that's consistent--

1154
01:00:43,780 --> 01:00:45,659
the fact that it's
consistent when--

1155
01:00:45,659 --> 01:00:46,825
so the angle is a function--

1156
01:00:46,825 --> 01:00:49,605
AUDIENCE: d is not a
constant [INAUDIBLE]??

1157
01:00:52,599 --> 01:00:54,600
PHILIPPE RIGOLLET:
d is not a constant?

1158
01:00:54,600 --> 01:00:57,090
So if d is little of n?

1159
01:00:57,090 --> 01:00:59,985
Then gamma goes to 0 and
f of gamma goes to 0.

1160
01:00:59,985 --> 01:01:02,490
So f of gamma is
a function that--

1161
01:01:02,490 --> 01:01:05,200
so for example, if f of gamma--

1162
01:01:05,200 --> 01:01:08,960
this is the sine of the
angle, for example--

1163
01:01:08,960 --> 01:01:11,840
then it's a function that starts
at 0, and that goes like this.

1164
01:01:15,340 --> 01:01:18,120
But as soon as gamma is
positive, it goes away from 0.

1165
01:01:20,650 --> 01:01:24,517
So now when gamma
goes to infinity,

1166
01:01:24,517 --> 01:01:26,350
then this thing goes
to a right angle, which

1167
01:01:26,350 --> 01:01:27,516
means I'm getting just junk.

1168
01:01:27,516 --> 01:01:29,210
So this is not my
leading eigenvector.

1169
01:01:29,210 --> 01:01:31,160
So how do you do this?

1170
01:01:31,160 --> 01:01:33,850
Well, just like
everywhere in statistics,

1171
01:01:33,850 --> 01:01:35,500
you have to just make
more assumptions.

1172
01:01:35,500 --> 01:01:36,916
You have to assume
that you're not

1173
01:01:36,916 --> 01:01:39,220
looking for the leading
eigenvector or the direction

1174
01:01:39,220 --> 01:01:40,610
that carries the most variance.

1175
01:01:40,610 --> 01:01:42,830
But you're looking, maybe,
for a special direction.

1176
01:01:42,830 --> 01:01:44,910
And that's what
sparse PCA is doing.

1177
01:01:44,910 --> 01:01:48,610
Sparse PCA is saying, I'm not
looking for any direction new

1178
01:01:48,610 --> 01:01:50,290
that carries the most variance.

1179
01:01:50,290 --> 01:01:54,070
I'm only looking for a
direction new that is sparse.

1180
01:01:54,070 --> 01:01:58,460
Think of it, for example, as
having 10 non-zero coordinates.

1181
01:01:58,460 --> 01:02:02,050
So that's a lot of
directions still to look for.

1182
01:02:02,050 --> 01:02:05,560
But once you do this,
then you actually

1183
01:02:05,560 --> 01:02:07,060
have not only--
there's a few things

1184
01:02:07,060 --> 01:02:08,930
that actually you
get from doing this.

1185
01:02:08,930 --> 01:02:12,160
The first one is you
actually essentially replace

1186
01:02:12,160 --> 01:02:15,660
d by k, which means
that n now just--

1187
01:02:15,660 --> 01:02:18,480
I'm sorry, let's say S
non-zero coefficients.

1188
01:02:18,480 --> 01:02:21,420
You replace d by S,
which means that n only

1189
01:02:21,420 --> 01:02:24,740
has to be much larger than S
for this thing to actually work.

1190
01:02:24,740 --> 01:02:26,760
Now, of course, you've
set your goal weaker.

1191
01:02:26,760 --> 01:02:28,830
Your goal is not to
find any direction, only

1192
01:02:28,830 --> 01:02:30,360
a sparse direction.

1193
01:02:30,360 --> 01:02:31,830
But there's something
very valuable

1194
01:02:31,830 --> 01:02:33,746
about sparse directions,
is that they actually

1195
01:02:33,746 --> 01:02:35,310
are interpretable.

1196
01:02:35,310 --> 01:02:37,810
When I found the v--

1197
01:02:37,810 --> 01:02:40,230
let's say that the v
that I found before

1198
01:02:40,230 --> 01:02:48,390
was 0.2, and then 0.9, and
then 1.1 minus 3, et cetera.

1199
01:02:48,390 --> 01:02:51,570
So that was the coordinates
of my leading eigenvector

1200
01:02:51,570 --> 01:02:54,410
in the original
coordinate system.

1201
01:02:54,410 --> 01:02:55,160
What does it mean?

1202
01:02:55,160 --> 01:02:57,140
Well, it means that if
I see a large number,

1203
01:02:57,140 --> 01:03:01,610
that means that this
v is very close--

1204
01:03:01,610 --> 01:03:03,830
so that's my original
coordinate system.

1205
01:03:03,830 --> 01:03:05,330
Let's call it e1 and e2.

1206
01:03:05,330 --> 01:03:09,230
So that's just 1,
0; and then 0, 1.

1207
01:03:09,230 --> 01:03:11,170
Then clearly, from
the coordinates of v,

1208
01:03:11,170 --> 01:03:13,550
I can tell if my v is like
this, or it's like this,

1209
01:03:13,550 --> 01:03:15,610
or it's like this.

1210
01:03:15,610 --> 01:03:18,330
Well, I mean, they should
all be of the same size.

1211
01:03:18,330 --> 01:03:20,590
So I can tell if
it's here or here

1212
01:03:20,590 --> 01:03:24,739
or here, depending
on-- like here,

1213
01:03:24,739 --> 01:03:26,280
that means I'm going
to see something

1214
01:03:26,280 --> 01:03:29,090
where the Y-coordinate it much
larger than the X-coordinate.

1215
01:03:29,090 --> 01:03:30,960
Here, I'm going to see something
where the X-coordinate is much

1216
01:03:30,960 --> 01:03:32,370
larger than the Y-coordinate.

1217
01:03:32,370 --> 01:03:33,480
And here, I'm going
to see something

1218
01:03:33,480 --> 01:03:35,354
where the X-coordinate
is about the same size

1219
01:03:35,354 --> 01:03:38,390
of the Y-coordinate.

1220
01:03:38,390 --> 01:03:40,499
So when things
starts to be bigger,

1221
01:03:40,499 --> 01:03:42,040
you're going to have
to make choices.

1222
01:03:42,040 --> 01:03:43,900
What does it mean to be bigger--

1223
01:03:43,900 --> 01:03:48,670
when d is 100,000,
I mean, the sum

1224
01:03:48,670 --> 01:03:51,160
of the squares of those
guys have to be equal to 1.

1225
01:03:51,160 --> 01:03:52,790
So they're all
very small numbers.

1226
01:03:52,790 --> 01:03:54,670
And so it's hard for you to
tell which one is a big number

1227
01:03:54,670 --> 01:03:56,045
and which ones is
a small number.

1228
01:03:56,045 --> 01:03:57,378
Why would you want to know this?

1229
01:03:57,378 --> 01:03:58,840
Because it's
actually telling you

1230
01:03:58,840 --> 01:04:03,219
that if v is very close to
e1, then that means that e1--

1231
01:04:03,219 --> 01:04:04,760
in the case of the
gene example, that

1232
01:04:04,760 --> 01:04:08,510
would mean that e1 is the
gene that's very important.

1233
01:04:08,510 --> 01:04:10,100
Maybe there's actually
just two genes

1234
01:04:10,100 --> 01:04:12,109
that explain those two things.

1235
01:04:12,109 --> 01:04:14,150
And those are the genes
that have been picked up.

1236
01:04:14,150 --> 01:04:16,880
There's two genes that I
encode geographic location,

1237
01:04:16,880 --> 01:04:18,224
and that's it.

1238
01:04:18,224 --> 01:04:19,640
And so it's very
important for you

1239
01:04:19,640 --> 01:04:21,630
to be able to
interpret what v means.

1240
01:04:21,630 --> 01:04:23,270
Where it has large
values, it means

1241
01:04:23,270 --> 01:04:26,689
that maybe it has large
values for e1, e2, and e3.

1242
01:04:26,689 --> 01:04:28,980
And it means that it's a
combination of e1, e2, and e3.

1243
01:04:28,980 --> 01:04:30,813
And now, you can
interpret, because you have

1244
01:04:30,813 --> 01:04:33,150
only three variables to find.

1245
01:04:33,150 --> 01:04:36,780
And so sparse PCA
builds that in.

1246
01:04:36,780 --> 01:04:39,920
Sparse PCA says,
listen, I'm going

1247
01:04:39,920 --> 01:04:42,600
to want to have at most
10 non-zero coefficients.

1248
01:04:42,600 --> 01:04:44,550
And the rest, I want to be 0.

1249
01:04:44,550 --> 01:04:47,040
I want to be able to be a
combination of at most 10

1250
01:04:47,040 --> 01:04:50,540
of my original variables.

1251
01:04:50,540 --> 01:04:52,740
And now, I can do
interpretation.

1252
01:04:52,740 --> 01:04:54,690
So the problem
with sparse PCA is

1253
01:04:54,690 --> 01:04:57,404
that it becomes very
difficult numerically

1254
01:04:57,404 --> 01:04:58,320
to solve this problem.

1255
01:04:58,320 --> 01:04:59,220
I can write it.

1256
01:04:59,220 --> 01:05:05,700
So the problem is simply
maximize the variance u

1257
01:05:05,700 --> 01:05:09,360
transpose, say, Su
subject to-- well,

1258
01:05:09,360 --> 01:05:12,180
I wanted to have u2 equal to 1.

1259
01:05:12,180 --> 01:05:14,450
So that's the original PCA.

1260
01:05:14,450 --> 01:05:16,020
But now, I also
want that the sum

1261
01:05:16,020 --> 01:05:19,320
of the indicators of the
uj that are not equal to 0

1262
01:05:19,320 --> 01:05:23,120
is at most, say, 10.

1263
01:05:23,120 --> 01:05:26,550
This constraint is
very non-convex.

1264
01:05:26,550 --> 01:05:28,430
So I can relax it
to a convex one

1265
01:05:28,430 --> 01:05:31,720
like we did for
linear aggression.

1266
01:05:31,720 --> 01:05:33,920
But now, I've totally
messed up with the fact

1267
01:05:33,920 --> 01:05:37,930
that I could use linear
algebra to solve this problem.

1268
01:05:37,930 --> 01:05:40,812
And so now, you have to go
through much more complicated

1269
01:05:40,812 --> 01:05:42,520
optimization techniques,
which are called

1270
01:05:42,520 --> 01:05:44,350
semidefinite
programs, which do not

1271
01:05:44,350 --> 01:05:46,600
scale well in high dimensions.

1272
01:05:46,600 --> 01:05:48,730
And so you have to do
a bunch of tricks--

1273
01:05:48,730 --> 01:05:49,660
numerical tricks.

1274
01:05:49,660 --> 01:05:52,630
But there are some packages
that implements some heuristics

1275
01:05:52,630 --> 01:05:55,140
or some other things--

1276
01:05:55,140 --> 01:05:56,800
iterative
thresholding, all sorts

1277
01:05:56,800 --> 01:05:58,896
of various numerical
tricks that you can do.

1278
01:05:58,896 --> 01:06:01,270
But the problem they are trying
to solve is exactly this.

1279
01:06:01,270 --> 01:06:03,947
Among all directions that
I have norm 1, of course,

1280
01:06:03,947 --> 01:06:06,030
because it's the direction
that have at most, say,

1281
01:06:06,030 --> 01:06:09,382
10 non-zero coordinates, I want
to find the one that maximizes

1282
01:06:09,382 --> 01:06:10,340
the empirical variance.

1283
01:06:23,030 --> 01:06:27,782
Actually, let me let
me just so you this.

1284
01:06:41,910 --> 01:06:47,830
I wanted to show
you an output of PCA

1285
01:06:47,830 --> 01:06:50,620
where people are actually
trying to do directly--

1286
01:06:56,043 --> 01:07:05,903
maybe-- there you go.

1287
01:07:20,700 --> 01:07:26,690
So right here, you
see this is SPSS.

1288
01:07:26,690 --> 01:07:29,310
That's a statistical software.

1289
01:07:29,310 --> 01:07:33,100
And this is an output
that was preprocessed

1290
01:07:33,100 --> 01:07:34,650
by a professional--

1291
01:07:34,650 --> 01:07:36,240
not preprocessed,
post-processed.

1292
01:07:36,240 --> 01:07:38,520
So that's something
where they read PCA.

1293
01:07:38,520 --> 01:07:39,390
So what is the data?

1294
01:07:39,390 --> 01:07:43,890
This is raw data
about you ask doctors

1295
01:07:43,890 --> 01:07:47,907
what they think of the
behavior of a particular sales

1296
01:07:47,907 --> 01:07:49,740
representative for
pharmaceutical companies.

1297
01:07:49,740 --> 01:07:51,323
So pharmaceutical
companies are trying

1298
01:07:51,323 --> 01:07:52,950
to improve their sales force.

1299
01:07:52,950 --> 01:07:56,430
And they're asking
doctors how would they

1300
01:07:56,430 --> 01:07:58,920
rate-- what do they value
about their interaction

1301
01:07:58,920 --> 01:08:01,410
with a sales representative.

1302
01:08:01,410 --> 01:08:04,140
So basically, there's
a bunch of questions.

1303
01:08:04,140 --> 01:08:10,410
One offers credible point
of view on something trends,

1304
01:08:10,410 --> 01:08:12,720
provides valuable
networking opportunities.

1305
01:08:12,720 --> 01:08:13,950
This is one question.

1306
01:08:13,950 --> 01:08:15,750
Rate this on a
scale from 1 to 5.

1307
01:08:15,750 --> 01:08:16,790
That was the question.

1308
01:08:16,790 --> 01:08:18,840
And they had a bunch
of questions like this.

1309
01:08:18,840 --> 01:08:22,410
And then they asked 1,000
doctors to make those ratings.

1310
01:08:22,410 --> 01:08:24,210
And what they want--
so each doctor now

1311
01:08:24,210 --> 01:08:25,890
is a vector of ratings.

1312
01:08:25,890 --> 01:08:28,960
And they want to know if there's
different groups of doctors,

1313
01:08:28,960 --> 01:08:30,210
what do doctors respond to.

1314
01:08:30,210 --> 01:08:31,240
If there's different
groups, then

1315
01:08:31,240 --> 01:08:33,450
maybe they know that they
can actually address them

1316
01:08:33,450 --> 01:08:35,500
separately, et cetera.

1317
01:08:35,500 --> 01:08:37,950
And so to do that, of course,
there's lots of questions.

1318
01:08:37,950 --> 01:08:39,840
And so what you want is
to just first project

1319
01:08:39,840 --> 01:08:41,589
into lower dimensions,
so you can actually

1320
01:08:41,589 --> 01:08:42,819
visualize what's going on.

1321
01:08:42,819 --> 01:08:44,760
And this is what
was done for this.

1322
01:08:44,760 --> 01:08:47,490
So these are the three
first principal component

1323
01:08:47,490 --> 01:08:49,439
that came out.

1324
01:08:49,439 --> 01:08:52,439
And even though we ordered
the values of the lambdas,

1325
01:08:52,439 --> 01:08:56,130
there's no reason why the
entries of v should be ordered.

1326
01:08:56,130 --> 01:08:57,840
And if you look at
the values of v here,

1327
01:08:57,840 --> 01:08:59,631
they look like they're
pretty much ordered.

1328
01:08:59,631 --> 01:09:04,142
It starts at 0.784, and then
you're at 0.3 around here.

1329
01:09:04,142 --> 01:09:06,600
There's something that goes up
again, and then you go down.

1330
01:09:06,600 --> 01:09:11,200
Actually, it's marked in red
every time it goes up again.

1331
01:09:11,200 --> 01:09:13,660
And so now, what they
did is they said,

1332
01:09:13,660 --> 01:09:16,270
OK, I need to
interpret those guys.

1333
01:09:16,270 --> 01:09:18,340
I need to tell you what this is.

1334
01:09:18,340 --> 01:09:21,160
If you tell me, we found
the principal component

1335
01:09:21,160 --> 01:09:24,866
that really discriminates
the doctors in two groups,

1336
01:09:24,866 --> 01:09:26,740
the drug company is
going to come back to you

1337
01:09:26,740 --> 01:09:29,080
and say, OK, what is
this characteristic?

1338
01:09:29,080 --> 01:09:31,510
And you say, oh, it's
actually a linear combination

1339
01:09:31,510 --> 01:09:33,460
of 40 characteristics.

1340
01:09:33,460 --> 01:09:35,735
And they say, well, we
don't need you to do that.

1341
01:09:35,735 --> 01:09:38,109
I mean, it cannot be a linear
combination of anything you

1342
01:09:38,109 --> 01:09:39,220
didn't ask.

1343
01:09:39,220 --> 01:09:41,680
And so for that,
first of all, there's

1344
01:09:41,680 --> 01:09:44,859
a post-processing of PCA, which
says, OK, once I actually,

1345
01:09:44,859 --> 01:09:46,990
say, found three
principal components,

1346
01:09:46,990 --> 01:09:51,370
that means that I found the
dimension three space on which

1347
01:09:51,370 --> 01:09:52,899
I want to project my points.

1348
01:09:52,899 --> 01:09:55,720
In this base, I can pick
any direction I want.

1349
01:09:55,720 --> 01:09:57,100
So the first thing
is that you do

1350
01:09:57,100 --> 01:09:59,308
some sort of local arrangements,
so that those things

1351
01:09:59,308 --> 01:10:01,790
look like they are increasing
and then decreasing.

1352
01:10:01,790 --> 01:10:06,130
So you just change, you
rotate your coordinate system

1353
01:10:06,130 --> 01:10:09,880
in this three dimensional space
that you've actually isolated.

1354
01:10:09,880 --> 01:10:11,830
And so once you do
this, the reason

1355
01:10:11,830 --> 01:10:13,600
to do that is that
it sort of makes

1356
01:10:13,600 --> 01:10:16,554
them big, sharp differences
between large and small values

1357
01:10:16,554 --> 01:10:18,220
of the coordinates
of the thing you had.

1358
01:10:18,220 --> 01:10:19,261
And why do you want this?

1359
01:10:19,261 --> 01:10:21,100
Because now, you
can say, well, I'm

1360
01:10:21,100 --> 01:10:23,590
going to start looking at the
ones that have large values.

1361
01:10:23,590 --> 01:10:24,250
And what do they say?

1362
01:10:24,250 --> 01:10:26,249
They say in-depth knowledge,
in-depth knowledge,

1363
01:10:26,249 --> 01:10:28,270
in-depth knowledge,
knowledge about.

1364
01:10:28,270 --> 01:10:30,280
This thing is clearly
something that

1365
01:10:30,280 --> 01:10:34,090
actually characterizes
the knowledge of my sales

1366
01:10:34,090 --> 01:10:35,260
representative.

1367
01:10:35,260 --> 01:10:38,311
And so that's something that
doctors are sensitive to.

1368
01:10:38,311 --> 01:10:40,060
That's something that
really discriminates

1369
01:10:40,060 --> 01:10:40,960
the doctors in a way.

1370
01:10:40,960 --> 01:10:43,120
There's lots of variance
along those things,

1371
01:10:43,120 --> 01:10:45,576
or at least a lot of variance--

1372
01:10:45,576 --> 01:10:47,950
I mean, doctors are separate
in terms of their experience

1373
01:10:47,950 --> 01:10:49,240
with respect to this.

1374
01:10:49,240 --> 01:10:51,102
And so what they
did is said, OK,

1375
01:10:51,102 --> 01:10:53,310
all these guys, some of
those they have large values,

1376
01:10:53,310 --> 01:10:55,015
but I don't know how
to interpret them.

1377
01:10:55,015 --> 01:10:56,890
And so I'm just going
to put the first block,

1378
01:10:56,890 --> 01:10:58,681
and I'm going to call
it medical knowledge,

1379
01:10:58,681 --> 01:11:01,330
because all those things are
knowledge about medical stuff.

1380
01:11:01,330 --> 01:11:03,538
Then here, I didn't know
how to interpret those guys.

1381
01:11:03,538 --> 01:11:06,220
But those guys, there's a big
clump of large coordinates,

1382
01:11:06,220 --> 01:11:10,720
and they're about respectful
of my time, listens, friendly

1383
01:11:10,720 --> 01:11:12,070
but courteous.

1384
01:11:12,070 --> 01:11:14,000
This is all about the
quality of interaction.

1385
01:11:14,000 --> 01:11:17,446
So this block was actually
called quality of interaction.

1386
01:11:17,446 --> 01:11:18,820
And then there
was a third block,

1387
01:11:18,820 --> 01:11:21,320
which you can tell starts to
be spreading a little thin.

1388
01:11:21,320 --> 01:11:22,864
There's just much less of them.

1389
01:11:22,864 --> 01:11:24,280
But this thing was
actually called

1390
01:11:24,280 --> 01:11:26,260
fair and critical opinion.

1391
01:11:26,260 --> 01:11:30,010
And so now, you have three
discriminating directions.

1392
01:11:30,010 --> 01:11:31,990
And you can actually
give them a name.

1393
01:11:31,990 --> 01:11:34,780
Wouldn't it be beautiful if
all the numbers in the gray box

1394
01:11:34,780 --> 01:11:36,700
came non-zero and
all the other numbers

1395
01:11:36,700 --> 01:11:38,860
came zero-- there
was no ad hoc choice.

1396
01:11:38,860 --> 01:11:40,750
I mean, this is probably
an afternoon of work

1397
01:11:40,750 --> 01:11:42,850
to like scratch out
all these numbers

1398
01:11:42,850 --> 01:11:44,801
and put all these
color codes, et cetera.

1399
01:11:44,801 --> 01:11:47,050
Whereas, you could just have
something that tells you,

1400
01:11:47,050 --> 01:11:49,090
OK, here are the non-zeros.

1401
01:11:49,090 --> 01:11:52,120
If you can actually make a story
around why this group of thing

1402
01:11:52,120 --> 01:11:54,820
actually makes sense, such
as it is medical knowledge,

1403
01:11:54,820 --> 01:11:55,730
then good for you.

1404
01:11:55,730 --> 01:11:57,804
Otherwise, you could
just say, I can't.

1405
01:11:57,804 --> 01:11:59,470
And that's what sparse
PCA does for you.

1406
01:11:59,470 --> 01:12:02,890
Sparse PCA outputs something
where all those numbers would

1407
01:12:02,890 --> 01:12:03,850
be zero.

1408
01:12:03,850 --> 01:12:06,964
And there would be exactly,
say, 10 non-zero coordinates.

1409
01:12:06,964 --> 01:12:08,380
And you can turn
this knob off 10.

1410
01:12:08,380 --> 01:12:09,220
You can make it 9.

1411
01:12:11,687 --> 01:12:13,270
Depending on what
your major is, maybe

1412
01:12:13,270 --> 01:12:15,310
you can actually go
on with 20 of them

1413
01:12:15,310 --> 01:12:18,310
and have the ability to
tell the story about 20

1414
01:12:18,310 --> 01:12:20,650
different variables and how
they fit in the same group.

1415
01:12:20,650 --> 01:12:22,750
And depending on
how you feel, it's

1416
01:12:22,750 --> 01:12:25,390
easy to rerun the PCA
depending on the value

1417
01:12:25,390 --> 01:12:26,535
that you want here.

1418
01:12:26,535 --> 01:12:28,660
And so you could actually
just come up with the one

1419
01:12:28,660 --> 01:12:30,240
you prefer.

1420
01:12:30,240 --> 01:12:32,354
And so that's the
sparse PCA thing

1421
01:12:32,354 --> 01:12:33,520
which I'm trying to promote.

1422
01:12:33,520 --> 01:12:35,250
I mean, this is not
super well-spread.

1423
01:12:35,250 --> 01:12:39,300
It's a fairly new idea,
maybe at most 10 years old.

1424
01:12:39,300 --> 01:12:40,940
And it's not
completely well-spread

1425
01:12:40,940 --> 01:12:42,540
in statistical packages.

1426
01:12:42,540 --> 01:12:44,040
But that's clearly
what people are

1427
01:12:44,040 --> 01:12:46,601
trying to emulate currently.

1428
01:12:46,601 --> 01:12:47,100
Yes?

1429
01:12:47,100 --> 01:12:48,600
AUDIENCE: So what
exactly does it

1430
01:12:48,600 --> 01:12:50,932
mean that the doctors
have a lot of variance

1431
01:12:50,932 --> 01:12:53,100
in medical knowledge,
quality of interaction,

1432
01:12:53,100 --> 01:12:55,600
and fair and critical opinion?

1433
01:12:55,600 --> 01:13:00,200
Like, it was saying that
these are like the main things

1434
01:13:00,200 --> 01:13:02,986
that doctors vary on,
some doctors care.

1435
01:13:02,986 --> 01:13:05,590
Like we could sort of
characterize a doctor by, oh,

1436
01:13:05,590 --> 01:13:08,030
he cares this much about
medical knowledge, this much

1437
01:13:08,030 --> 01:13:09,494
about the quality
of interaction,

1438
01:13:09,494 --> 01:13:11,446
and this much about
critical opinion.

1439
01:13:11,446 --> 01:13:14,862
And that says most of the story
about what this doctor wants

1440
01:13:14,862 --> 01:13:17,790
from a drug representative?

1441
01:13:17,790 --> 01:13:20,610
PHILIPPE RIGOLLET: Not really.

1442
01:13:20,610 --> 01:13:22,590
I mean, OK, let's say
you pick only one.

1443
01:13:22,590 --> 01:13:31,535
So that means that you
would take all your doctors,

1444
01:13:31,535 --> 01:13:33,160
and you would have
one direction, which

1445
01:13:33,160 --> 01:13:36,480
is quality of interaction.

1446
01:13:36,480 --> 01:13:38,710
And there would be just
spread out points here.

1447
01:13:42,604 --> 01:13:44,270
So there are two
things that can happen.

1448
01:13:44,270 --> 01:13:46,900
The first one is that
there's a clump here,

1449
01:13:46,900 --> 01:13:49,014
and then there's a clump here.

1450
01:13:49,014 --> 01:13:50,680
That still represents
a lot of variance.

1451
01:13:50,680 --> 01:13:52,420
And if this happens,
you probably

1452
01:13:52,420 --> 01:13:55,120
want to go back in
your data and see

1453
01:13:55,120 --> 01:13:58,540
were these people visited
by a different group

1454
01:13:58,540 --> 01:14:00,520
than these people,
or maybe these people

1455
01:14:00,520 --> 01:14:02,700
have a different specialty.

1456
01:14:05,250 --> 01:14:07,000
I mean, you have to
look back at your data

1457
01:14:07,000 --> 01:14:08,470
and try to understand
why you would have

1458
01:14:08,470 --> 01:14:09,700
different groups of people.

1459
01:14:09,700 --> 01:14:13,510
And if it's like completely
evenly spread out,

1460
01:14:13,510 --> 01:14:15,730
then all it's saying
is that, if you

1461
01:14:15,730 --> 01:14:18,460
want to have a uniform
quality of interaction,

1462
01:14:18,460 --> 01:14:20,410
you need to take
measures on this.

1463
01:14:20,410 --> 01:14:24,114
You need to have this to
not be discrimination.

1464
01:14:24,114 --> 01:14:26,530
But I think really when it's
becoming interesting it's not

1465
01:14:26,530 --> 01:14:27,779
when it's complete spread out.

1466
01:14:27,779 --> 01:14:29,350
It's when there's
a big group here.

1467
01:14:29,350 --> 01:14:30,520
And then there's
almost no one here,

1468
01:14:30,520 --> 01:14:32,020
and then there's
a big group here.

1469
01:14:32,020 --> 01:14:34,880
And then maybe there's
something you can do.

1470
01:14:34,880 --> 01:14:40,690
And so those two things actually
give you a lot of variance.

1471
01:14:40,690 --> 01:14:47,490
So actually, maybe
I'll talk about this.

1472
01:14:47,490 --> 01:14:49,732
Here, this is sort of a mixture.

1473
01:14:49,732 --> 01:14:51,690
You have a mixture of
two different populations

1474
01:14:51,690 --> 01:14:53,040
of doctors.

1475
01:14:53,040 --> 01:14:56,670
And it turns out that
principal component analysis--

1476
01:14:56,670 --> 01:14:59,750
so a mixture is when you
have different populations--

1477
01:14:59,750 --> 01:15:02,010
think of like two
Gaussians that are just

1478
01:15:02,010 --> 01:15:03,690
centered at two
different points,

1479
01:15:03,690 --> 01:15:05,460
and maybe they're
in high dimensions.

1480
01:15:05,460 --> 01:15:07,350
And those are
clusters of people,

1481
01:15:07,350 --> 01:15:09,680
and you want to be able to
differentiate those guys.

1482
01:15:09,680 --> 01:15:10,770
If you're in very
high dimensions,

1483
01:15:10,770 --> 01:15:12,120
it's going to be very
difficult. But one

1484
01:15:12,120 --> 01:15:14,730
of the first processing tools
that people do is to do PCA.

1485
01:15:14,730 --> 01:15:18,046
Because if you have one big
group here and one big group

1486
01:15:18,046 --> 01:15:19,920
here, it means that
there's a lot of variance

1487
01:15:19,920 --> 01:15:21,961
along the direction that
goes through the centers

1488
01:15:21,961 --> 01:15:22,860
of those groups.

1489
01:15:22,860 --> 01:15:24,630
And that's essentially
what happened here.

1490
01:15:24,630 --> 01:15:27,967
You could think of this as being
two blobs in high dimensions.

1491
01:15:27,967 --> 01:15:29,550
But you're really
just projecting them

1492
01:15:29,550 --> 01:15:30,810
into one dimension.

1493
01:15:30,810 --> 01:15:33,370
And this dimension, hopefully,
goes through the center.

1494
01:15:33,370 --> 01:15:37,460
And so as preprocessing--
so I'm going to stop here.

1495
01:15:37,460 --> 01:15:42,720
But PCA is not just made
for dimension reduction.

1496
01:15:42,720 --> 01:15:44,700
It's used for
mixtures, for example.

1497
01:15:44,700 --> 01:15:47,340
It's also used when you
have graphical data.

1498
01:15:47,340 --> 01:15:48,750
What is the idea of PCA?

1499
01:15:48,750 --> 01:15:53,400
It just says, if you have a
matrix that seems to have low

1500
01:15:53,400 --> 01:15:56,370
rank-- meaning that there's a
lot of those lambda i's that

1501
01:15:56,370 --> 01:15:57,570
are very small--

1502
01:15:57,570 --> 01:16:00,420
and then I see that
plus noise, then

1503
01:16:00,420 --> 01:16:02,790
it's a good idea to
do PCA on this thing.

1504
01:16:02,790 --> 01:16:05,520
And in particular, people
use that in networks a lot.

1505
01:16:05,520 --> 01:16:08,300
So you take the adjacency
matrix of a graph--

1506
01:16:08,300 --> 01:16:11,160
well, you sort of preprocess it
a little bit, so it looks nice.

1507
01:16:11,160 --> 01:16:13,590
And then if you have, for
example, two communities

1508
01:16:13,590 --> 01:16:15,570
in there, it should
look like something that

1509
01:16:15,570 --> 01:16:18,510
is low rank plus some noise.

1510
01:16:18,510 --> 01:16:22,670
And low rank means that there's
just very few non-zero--

1511
01:16:22,670 --> 01:16:24,226
well, low rank means this.

1512
01:16:24,226 --> 01:16:26,100
Low rank means that if
you do the scree plot,

1513
01:16:26,100 --> 01:16:27,250
you will see
something like this,

1514
01:16:27,250 --> 01:16:29,720
which means that if you throw
out all the smaller ones,

1515
01:16:29,720 --> 01:16:33,420
it should not really matter
in the overall structure.

1516
01:16:33,420 --> 01:16:35,430
And so you can use all--

1517
01:16:35,430 --> 01:16:39,090
these techniques are used
everywhere these days, not

1518
01:16:39,090 --> 01:16:39,900
just in PCA.

1519
01:16:39,900 --> 01:16:41,670
So we call it PCA
as statisticians.

1520
01:16:41,670 --> 01:16:46,700
But people call it the
spectral methods or SVD.

1521
01:16:46,700 --> 01:16:49,450
So everyone--