1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative
2
00:00:02,460 --> 00:00:03,880
Commons license.
3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare
4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.
5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials
6
00:00:12,720 --> 00:00:16,680
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:16,680 --> 00:00:19,219
at ocw.mit.edu.
8
00:00:19,219 --> 00:00:20,760
PHILIPPE RIGOLLET:
We keep on talking
9
00:00:20,760 --> 00:00:24,870
about principal component
analysis, which we essentially
10
00:00:24,870 --> 00:00:27,910
introduced as a way to
work with a bunch of data.
11
00:00:27,910 --> 00:00:31,560
So the data that's given to
us when we want to do PCA
12
00:00:31,560 --> 00:00:35,270
is a bunch of vectors X1 to Xn.
13
00:00:35,270 --> 00:00:40,090
So they are random vectors.
14
00:00:45,290 --> 00:00:46,652
in Rd.
15
00:00:46,652 --> 00:00:48,110
And what we mentioned
is that we're
16
00:00:48,110 --> 00:00:51,742
going to be using linear
algebra-- in particular,
17
00:00:51,742 --> 00:00:54,200
the spectral theorem-- that
guarantees to us that if I look
18
00:00:54,200 --> 00:00:56,000
at the convenience
matrix of this guy,
19
00:00:56,000 --> 00:00:57,890
or its empirical
covariance matrix,
20
00:00:57,890 --> 00:01:00,132
since they're
symmetric real matrices
21
00:01:00,132 --> 00:01:01,590
and they are positive
semidefinite,
22
00:01:01,590 --> 00:01:06,830
there exists a diagonalization
into non-negative eigenvalues.
23
00:01:06,830 --> 00:01:09,555
And so here, those
things live in Rd,
24
00:01:09,555 --> 00:01:11,570
so it's a really large space.
25
00:01:11,570 --> 00:01:14,600
And what we want to
do is to map it down
26
00:01:14,600 --> 00:01:16,640
into a space that
we can visualize,
27
00:01:16,640 --> 00:01:19,610
hopefully a space
of size 2 or 3.
28
00:01:19,610 --> 00:01:22,460
Or if not, then we're just going
to take more and start looking
29
00:01:22,460 --> 00:01:24,920
at subspaces altogether.
30
00:01:24,920 --> 00:01:33,120
So think of the case where d
is large but not larger than n.
31
00:01:33,120 --> 00:01:36,520
So let's say, you have a
large number of points.
32
00:01:36,520 --> 00:01:40,590
The question is, is it possible
to project those things onto
33
00:01:40,590 --> 00:01:45,260
a lower dimensional
space, d prime,
34
00:01:45,260 --> 00:01:49,480
which is much less than d-- so
think of d prime equals, say,
35
00:01:49,480 --> 00:01:52,180
2 or 3--
36
00:01:52,180 --> 00:01:54,490
and so that you keep
as much information
37
00:01:54,490 --> 00:01:56,740
about the cloud of points
that you had originally.
38
00:01:56,740 --> 00:01:58,990
So again, the example
that we could have
39
00:01:58,990 --> 00:02:04,060
is that X1 to Xn for, say,
Xi for patient i's recording
40
00:02:04,060 --> 00:02:08,740
a bunch of body measurements
and maybe blood pressure,
41
00:02:08,740 --> 00:02:10,639
some symptoms, et cetera.
42
00:02:10,639 --> 00:02:12,520
And then we have a
cloud of n patients.
43
00:02:12,520 --> 00:02:15,222
And we're trying to
visualize maybe to see if--
44
00:02:15,222 --> 00:02:16,930
If I could see, for
example, that there's
45
00:02:16,930 --> 00:02:18,820
two groups of
patients, maybe I would
46
00:02:18,820 --> 00:02:21,252
know that I have two
groups of different disease
47
00:02:21,252 --> 00:02:22,960
or maybe two groups
of different patients
48
00:02:22,960 --> 00:02:25,540
that respond differently
to a particular disease
49
00:02:25,540 --> 00:02:27,040
or drug et cetera.
50
00:02:27,040 --> 00:02:28,900
So visualizing is
going to give us
51
00:02:28,900 --> 00:02:33,880
quite a bit of insight about
what the spatial arrangement
52
00:02:33,880 --> 00:02:35,980
of those vectors are.
53
00:02:35,980 --> 00:02:40,660
And so PCA says, well, here,
of course, in this question,
54
00:02:40,660 --> 00:02:42,880
one thing that's not defined
is what is information.
55
00:02:42,880 --> 00:02:44,338
And we said that
one thing we might
56
00:02:44,338 --> 00:02:46,600
want to do when we project
is that points do not
57
00:02:46,600 --> 00:02:48,267
collide with each other.
58
00:02:48,267 --> 00:02:50,350
And so that means we're
trying to find directions,
59
00:02:50,350 --> 00:02:53,110
where after I project, the
points are still pretty spread
60
00:02:53,110 --> 00:02:53,860
out.
61
00:02:53,860 --> 00:02:55,630
And so I can see
what's going on.
62
00:02:55,630 --> 00:02:58,270
And PCA says-- OK,
so there's many ways
63
00:02:58,270 --> 00:02:59,500
to answer this question.
64
00:02:59,500 --> 00:03:04,290
And PCA says, let's just
find a subspace of dimension
65
00:03:04,290 --> 00:03:08,110
d prime that keeps as much
covariance structure as
66
00:03:08,110 --> 00:03:10,150
possible.
67
00:03:10,150 --> 00:03:13,390
And the reason is
that those directions
68
00:03:13,390 --> 00:03:15,430
are the ones that maximize
the variance, which
69
00:03:15,430 --> 00:03:17,230
is a proxy for the spread.
70
00:03:17,230 --> 00:03:19,540
There's many, many
ways to do this.
71
00:03:19,540 --> 00:03:22,840
There's actually a
Google video that
72
00:03:22,840 --> 00:03:26,440
was released maybe last week
about the data visualization
73
00:03:26,440 --> 00:03:29,260
team of Google that shows
you something called
74
00:03:29,260 --> 00:03:31,554
t-SNE, which is
essentially something
75
00:03:31,554 --> 00:03:32,470
that tries to do that.
76
00:03:32,470 --> 00:03:34,540
It takes points in
very high dimensions
77
00:03:34,540 --> 00:03:36,400
and tries to map them
in lower dimensions,
78
00:03:36,400 --> 00:03:38,280
so that you can
actually visualize them.
79
00:03:38,280 --> 00:03:41,800
And t-SNE is some
alternative to PCA
80
00:03:41,800 --> 00:03:46,850
that gives an other definition
for the word information.
81
00:03:46,850 --> 00:03:49,970
I'll talk about this towards
the end, how you can actually
82
00:03:49,970 --> 00:03:52,730
somewhat automatically
extend everything
83
00:03:52,730 --> 00:03:58,830
we've said for PCA to an
infinite family of procedures.
84
00:03:58,830 --> 00:04:00,460
So how do we do this?
85
00:04:00,460 --> 00:04:02,690
Well, the way we do
this is as follows.
86
00:04:02,690 --> 00:04:05,010
So remember, given
those guys, we
87
00:04:05,010 --> 00:04:09,120
can form something which is
called S, which is the sample,
88
00:04:09,120 --> 00:04:16,885
or the empirical
covariance matrix.
89
00:04:19,930 --> 00:04:22,210
And since from
couple of slides ago,
90
00:04:22,210 --> 00:04:25,450
we know that S has a
eigenvalue decomposition,
91
00:04:25,450 --> 00:04:32,930
S is equal to PDP transpose,
where P is orthogonal.
92
00:04:35,570 --> 00:04:37,720
So that's where we use our
linear algebra results.
93
00:04:37,720 --> 00:04:43,640
So that means that P transpose P
is equal to PP transpose, which
94
00:04:43,640 --> 00:04:46,220
is the identity.
95
00:04:46,220 --> 00:04:50,370
So remember, S is
a d by d matrix.
96
00:04:50,370 --> 00:04:53,070
And so P is also d by d.
97
00:04:53,070 --> 00:04:55,860
And d is diagonal.
98
00:05:00,402 --> 00:05:02,860
And I'm actually going to take,
without loss of generality,
99
00:05:02,860 --> 00:05:04,487
I'm going to assume that d--
100
00:05:04,487 --> 00:05:06,070
so it's going to be
diagonal-- and I'm
101
00:05:06,070 --> 00:05:10,240
going to have something
that looks like lambda 1
102
00:05:10,240 --> 00:05:10,930
to lambda d.
103
00:05:10,930 --> 00:05:14,830
Those are called the
eigenvalues of S.
104
00:05:14,830 --> 00:05:19,036
What we know is that lambda
j's are non-negative.
105
00:05:19,036 --> 00:05:21,160
And actually, what I'm
going to assume without loss
106
00:05:21,160 --> 00:05:24,820
of generalities is lambda 1
is larger than lambda 2, which
107
00:05:24,820 --> 00:05:30,259
is larger than lambda d.
108
00:05:30,259 --> 00:05:32,050
Because in particular,
this decomposition--
109
00:05:32,050 --> 00:05:35,470
the spectrum decomposition--
is not entirely unique.
110
00:05:35,470 --> 00:05:39,750
I could permute
the columns of P,
111
00:05:39,750 --> 00:05:42,600
and I would still have
an orthogonal matrix.
112
00:05:42,600 --> 00:05:44,820
And to balance that,
I would also have
113
00:05:44,820 --> 00:05:46,890
to permute the entries of d.
114
00:05:46,890 --> 00:05:49,680
So there's as many
decompositions
115
00:05:49,680 --> 00:05:51,180
as there are permutations.
116
00:05:51,180 --> 00:05:52,860
So there's actually quite a bit.
117
00:05:52,860 --> 00:05:56,760
But the bag of
eigenvalues is unique.
118
00:05:56,760 --> 00:05:58,430
The set of
eigenvalues is unique.
119
00:05:58,430 --> 00:06:01,020
The ordering is
certainly not unique.
120
00:06:01,020 --> 00:06:02,730
So here, I'm just
going to pick--
121
00:06:02,730 --> 00:06:05,640
I'm going to nail down one
particular permutation--
122
00:06:05,640 --> 00:06:08,070
actually, maybe two in
case I have equalities.
123
00:06:08,070 --> 00:06:12,570
But let's say, I pick
one that satisfies this.
124
00:06:12,570 --> 00:06:15,450
And the reason why I do this
is really not very important.
125
00:06:15,450 --> 00:06:18,060
It's just to say,
I'm going to want
126
00:06:18,060 --> 00:06:20,500
to talk about the largest
of those eigenvalues.
127
00:06:20,500 --> 00:06:22,110
So this is just
going to be easier
128
00:06:22,110 --> 00:06:23,910
for me to say that
this one is lambda 1,
129
00:06:23,910 --> 00:06:26,730
rather than say it's lambda 7.
130
00:06:26,730 --> 00:06:39,980
So this is just to say that
the largest eigenvalue of S
131
00:06:39,980 --> 00:06:42,588
is lambda 1.
132
00:06:42,588 --> 00:06:45,550
If I didn't do that, I would
just call it maybe lambda max,
133
00:06:45,550 --> 00:06:47,760
and you would just know
which one I'm talking about.
134
00:06:52,910 --> 00:07:01,520
So what's happening now
is that if I look at d,
135
00:07:01,520 --> 00:07:04,250
then it turns out
that if I start--
136
00:07:04,250 --> 00:07:09,890
so if I do P transpose Xi, I am
actually projecting my Xi's--
137
00:07:09,890 --> 00:07:12,820
I'm basically changing
the basis for my Xi's.
138
00:07:12,820 --> 00:07:15,140
And now, D is the
empirical covariance matrix
139
00:07:15,140 --> 00:07:16,700
of those guys.
140
00:07:16,700 --> 00:07:18,630
So let's check that.
141
00:07:18,630 --> 00:07:22,010
So what it means is
that if I look at--
142
00:07:26,303 --> 00:07:29,120
so what I claim is
that P transpose Xi--
143
00:07:29,120 --> 00:07:35,180
that's a new vector, let's
call it Yi, it's also in Rd--
144
00:07:35,180 --> 00:07:37,940
and what I claim is that the
covariance matrix of this guy
145
00:07:37,940 --> 00:07:41,840
is actually now this
diagonal matrix, which
146
00:07:41,840 --> 00:07:45,140
means in particular that
if they were Gaussian, then
147
00:07:45,140 --> 00:07:46,280
they would be independent.
148
00:07:46,280 --> 00:07:48,890
But I also know now that
there's no correlation
149
00:07:48,890 --> 00:07:50,530
across coordinates of Yi.
150
00:07:50,530 --> 00:08:00,939
So to prove this, let me assume
that X bar is equal to 0.
151
00:08:00,939 --> 00:08:02,980
And the reason why I do
this is because it's just
152
00:08:02,980 --> 00:08:05,560
annoying to carry out all
this censuring constantly
153
00:08:05,560 --> 00:08:09,400
and I talk about S. So
when X bar is equal to 0,
154
00:08:09,400 --> 00:08:11,640
that implies that S
has a very simple form.
155
00:08:11,640 --> 00:08:14,170
It's of the form
sum from i equal 1
156
00:08:14,170 --> 00:08:18,790
to n of Xi Xi transpose.
157
00:08:18,790 --> 00:08:20,380
So that's my S.
158
00:08:20,380 --> 00:08:24,370
But what I want is the S of Y--
159
00:08:24,370 --> 00:08:28,830
So OK, that implies
also that P times X
160
00:08:28,830 --> 00:08:34,690
bar, which is equal to P times
X bar is also equal to 0.
161
00:08:34,690 --> 00:08:37,929
So that means that Y bar--
162
00:08:37,929 --> 00:08:40,240
Y has mean 0, if this is 0.
163
00:08:40,240 --> 00:08:43,970
So if I look at the sample
covariance matrix of Y,
164
00:08:43,970 --> 00:08:45,880
it's just going to
be something that
165
00:08:45,880 --> 00:08:49,990
looks like the sum of the
outer products or the Yi Yi
166
00:08:49,990 --> 00:08:50,590
transpose.
167
00:08:53,290 --> 00:08:56,770
And again, the reason why
I make this assumption
168
00:08:56,770 --> 00:09:01,400
is so that I don't have to write
minus X bar X bar transpose.
169
00:09:01,400 --> 00:09:02,284
But you can do it.
170
00:09:02,284 --> 00:09:03,950
And it's going to
work exactly the same.
171
00:09:06,790 --> 00:09:08,640
So now, I look at this S prime.
172
00:09:08,640 --> 00:09:11,340
And so what is this S prime?
173
00:09:11,340 --> 00:09:14,340
Well, I'm just going
to replace Yi with PXi.
174
00:09:14,340 --> 00:09:22,850
So it's the sum from i equal
1 to n of PXi PXi transpose,
175
00:09:22,850 --> 00:09:26,627
which is equal to the sum from--
176
00:09:26,627 --> 00:09:27,460
sorry there's a 1/n.
177
00:09:32,360 --> 00:09:34,820
So it's equal to 1/n
sum from i equal 1
178
00:09:34,820 --> 00:09:43,490
to n of PXi Xi
transpose P transpose.
179
00:09:43,490 --> 00:09:45,130
Agree?
180
00:09:45,130 --> 00:09:48,580
I just said that the transpose
of AB is the transpose of B
181
00:09:48,580 --> 00:09:53,830
times the transpose of A.
182
00:09:53,830 --> 00:09:55,900
And so now, I can
push the sum in.
183
00:09:55,900 --> 00:09:57,520
P does not depend on i.
184
00:09:57,520 --> 00:10:05,800
So this thing here is
equal to PS P transpose,
185
00:10:05,800 --> 00:10:10,130
because the sum of the Xi Xi
transpose divided by n is S.
186
00:10:10,130 --> 00:10:12,200
But what is PS P transpose?
187
00:10:12,200 --> 00:10:17,090
Well, we know that
S is equal to--
188
00:10:17,090 --> 00:10:19,340
sorry that's P transpose.
189
00:10:19,340 --> 00:10:20,880
So this was with a P transpose.
190
00:10:20,880 --> 00:10:23,420
I'm sorry, I made an
important mistake here.
191
00:10:23,420 --> 00:10:25,420
So Yi is P transpose Xi.
192
00:10:25,420 --> 00:10:27,440
So this is P transpose
and P transpose
193
00:10:27,440 --> 00:10:29,600
here, which means that
this is P transpose
194
00:10:29,600 --> 00:10:32,450
and this is double transpose,
which is just nothing
195
00:10:32,450 --> 00:10:34,150
and that transpose and nothing.
196
00:10:36,680 --> 00:10:41,600
So now, I write S
as PD P transpose.
197
00:10:41,600 --> 00:10:43,781
That's the spectral
decomposition
198
00:10:43,781 --> 00:10:44,530
that I had before.
199
00:10:44,530 --> 00:10:46,550
That's my eigenvalue
decomposition,
200
00:10:46,550 --> 00:10:49,050
which means that now,
if I look at S prime,
201
00:10:49,050 --> 00:10:56,000
it's P transpose times
PD P transpose P.
202
00:10:56,000 --> 00:10:58,300
But now, P transpose
P is the identity,
203
00:10:58,300 --> 00:11:00,250
P transpose P is the identity.
204
00:11:00,250 --> 00:11:06,646
So this is actually
just equal to D.
205
00:11:06,646 --> 00:11:08,270
And again, you can
check that this also
206
00:11:08,270 --> 00:11:12,840
works if you have to center
all those guys as you go.
207
00:11:12,840 --> 00:11:15,630
But if you think about
it, this is the same thing
208
00:11:15,630 --> 00:11:19,530
as saying that I just
replaced Xi by Xi minus X bar.
209
00:11:19,530 --> 00:11:26,590
And then it's true that Y bar
is also P times Xi minus X bar.
210
00:11:26,590 --> 00:11:29,770
So now, we have that D is
the empirical covariance
211
00:11:29,770 --> 00:11:30,940
matrix of those guys--
212
00:11:30,940 --> 00:11:33,112
the Yi's, which are
P transpose Xi's.
213
00:11:33,112 --> 00:11:34,570
And so in particular,
what it means
214
00:11:34,570 --> 00:11:42,810
is that if I look at the
covariance of Yj Yk--
215
00:11:46,130 --> 00:11:48,920
So that's the covariance
of the j-th coordinate of Y
216
00:11:48,920 --> 00:11:51,650
and the k-th coordinate of Y.
I'm just not putting an index.
217
00:11:51,650 --> 00:11:53,720
But maybe, let's say the
first one or something
218
00:11:53,720 --> 00:11:56,142
like this-- any of
them, their IID.
219
00:11:56,142 --> 00:11:57,350
Then what is this covariance?
220
00:11:57,350 --> 00:12:01,760
It's actually 0 if j
is different from k.
221
00:12:01,760 --> 00:12:06,590
And the covariance
between Yj and Yj,
222
00:12:06,590 --> 00:12:13,070
which is just the variance
of Yj, is equal to lambda j--
223
00:12:13,070 --> 00:12:17,300
the j-th largest eigenvalue.
224
00:12:17,300 --> 00:12:22,580
So the eigenvalues capture the
variance of my observations
225
00:12:22,580 --> 00:12:25,110
in this new coordinate system.
226
00:12:25,110 --> 00:12:26,632
And they're
completely orthogonal.
227
00:12:26,632 --> 00:12:27,590
So what does that mean?
228
00:12:27,590 --> 00:12:29,750
Well, again, remember,
if I chop off
229
00:12:29,750 --> 00:12:34,160
the head of my Gaussian
in multi dimensions,
230
00:12:34,160 --> 00:12:35,780
we said that what
we started from
231
00:12:35,780 --> 00:12:39,560
was something that
looked like this.
232
00:12:39,560 --> 00:12:42,320
And we said, well, there's one
direction that's important,
233
00:12:42,320 --> 00:12:45,230
that's this guy, and one
important that's this guy.
234
00:12:45,230 --> 00:12:48,200
When I applied a transformation
P transpose, what I'm doing
235
00:12:48,200 --> 00:12:51,110
is that I'm realigning this
thing with the new axes.
236
00:12:51,110 --> 00:12:53,660
Or in a way, rather
to be fair, I'm
237
00:12:53,660 --> 00:12:59,600
not actually realigning
the ellipses with the axes.
238
00:12:59,600 --> 00:13:02,690
I'm really realigning the
axes with the ellipses.
239
00:13:02,690 --> 00:13:05,360
So really, what I'm doing is
I'm saying, after I apply P,
240
00:13:05,360 --> 00:13:08,690
I'm just rotating this
coordinate system.
241
00:13:08,690 --> 00:13:12,670
So now, it becomes this guy.
242
00:13:19,360 --> 00:13:22,850
And now, my ellipses
actually completely align.
243
00:13:22,850 --> 00:13:25,730
And what happens here is
that this coordinate is
244
00:13:25,730 --> 00:13:27,110
independent of that coordinate.
245
00:13:27,110 --> 00:13:31,715
And that's what we write
here, if they are Gaussian.
246
00:13:31,715 --> 00:13:32,840
I didn't really tell this--
247
00:13:32,840 --> 00:13:34,810
I'm only making statements
about covariances.
248
00:13:34,810 --> 00:13:36,768
If they are Gaussians,
those implied statements
249
00:13:36,768 --> 00:13:37,614
about independence.
250
00:13:40,960 --> 00:13:44,590
So as I said, the
variance now, lambda 1,
251
00:13:44,590 --> 00:13:54,700
is actually the variance
of P transpose Xi.
252
00:13:57,890 --> 00:14:00,140
But if I look now at
the-- so this is a vector,
253
00:14:00,140 --> 00:14:04,910
so I need to look at the
first coordinate of this guy.
254
00:14:08,490 --> 00:14:11,250
So it turns out that
doing this is actually
255
00:14:11,250 --> 00:14:15,440
the same thing as looking
at the variance of what?
256
00:14:15,440 --> 00:14:21,480
Well, the first
column of P times Xi.
257
00:14:21,480 --> 00:14:24,490
So that's the variance of--
258
00:14:24,490 --> 00:14:30,344
I'm going to call it v1
transpose Xi, where P--
259
00:14:44,390 --> 00:14:53,920
So the v1 vd in Rd
are eigenvectors.
260
00:14:53,920 --> 00:14:57,190
And each vi is
associated to lambda i.
261
00:14:57,190 --> 00:14:59,740
So that's what we saw when
we talked about this eigen
262
00:14:59,740 --> 00:15:02,800
decomposition a
couple of slides back.
263
00:15:02,800 --> 00:15:06,040
That's the one here.
264
00:15:06,040 --> 00:15:10,310
So if I call the
columns of P v1 to vd,
265
00:15:10,310 --> 00:15:13,600
this is what's happening.
266
00:15:13,600 --> 00:15:16,030
So when I look at lambda
1, it's just the variance
267
00:15:16,030 --> 00:15:19,700
of Xi inner product with v1.
268
00:15:19,700 --> 00:15:22,180
And we made this picture
when we said, well,
269
00:15:22,180 --> 00:15:25,870
let's say v1 is here
and then x1 is here.
270
00:15:25,870 --> 00:15:31,180
And if vi has a unique
norm, then the inner product
271
00:15:31,180 --> 00:15:38,050
between Xi and v1 is just
the length of this guy here.
272
00:15:38,050 --> 00:15:41,020
So that's the variance of the
Xi says the length of Xi--
273
00:15:41,020 --> 00:15:43,720
so this is 0-- that's the
length of Xi when I project it
274
00:15:43,720 --> 00:15:46,750
onto the direction
that span by v1.
275
00:15:46,750 --> 00:15:52,210
If v1 has length 2, this is
really just twice this length.
276
00:15:52,210 --> 00:15:56,340
If vi has length 3,
it's three times this.
277
00:15:56,340 --> 00:16:01,570
But it turns out that since
P satisfies P transpose
278
00:16:01,570 --> 00:16:04,780
P is equal to the identity--
279
00:16:04,780 --> 00:16:07,900
that's an orthogonal
matrix, that's right here--
280
00:16:07,900 --> 00:16:11,470
then this is actually
saying the same thing
281
00:16:11,470 --> 00:16:18,760
as vj transpose vj, which is
really the norm squared of vj,
282
00:16:18,760 --> 00:16:20,800
is equal to 1.
283
00:16:20,800 --> 00:16:26,520
And vj transpose vk is equal
to 0, if j is different from k.
284
00:16:29,610 --> 00:16:31,560
The eigenvectors are
orthogonal to each other.
285
00:16:31,560 --> 00:16:33,050
And they're actually
all of norm 1.
286
00:16:37,390 --> 00:16:39,580
So now, I know that this
is indeed a direction.
287
00:16:39,580 --> 00:16:44,290
And so when I look
at v1 transpose Xi,
288
00:16:44,290 --> 00:16:46,240
I'm really measuring
exactly this length.
289
00:16:46,240 --> 00:16:47,460
And what is this length?
290
00:16:47,460 --> 00:16:49,660
It's the length of
the projection of Xi
291
00:16:49,660 --> 00:16:51,190
onto this line.
292
00:16:51,190 --> 00:16:53,920
That's the line
that's spanned by v1.
293
00:16:53,920 --> 00:16:57,680
So if I had a very high
dimensional problem
294
00:16:57,680 --> 00:17:01,460
and I started to look
at the direction v1--
295
00:17:01,460 --> 00:17:03,884
let's say v1 now is
not a eigenvector,
296
00:17:03,884 --> 00:17:08,270
it's any direction-- then
if I want to do this lower
297
00:17:08,270 --> 00:17:11,819
dimensional projection, then
I have to understand how those
298
00:17:11,819 --> 00:17:14,272
Xi's project onto the
line that's spanned by v1,
299
00:17:14,272 --> 00:17:16,730
because this is all that I'm
going to be keeping at the end
300
00:17:16,730 --> 00:17:17,646
of the day about Xi's.
301
00:17:20,170 --> 00:17:23,200
So what we want is
to find the direction
302
00:17:23,200 --> 00:17:25,240
where those Xi's,
those projections,
303
00:17:25,240 --> 00:17:26,361
have a lot of variance.
304
00:17:26,361 --> 00:17:28,569
And we know that the variance
of Xi on this direction
305
00:17:28,569 --> 00:17:30,490
is actually exactly
given by lambda 1.
306
00:17:36,890 --> 00:17:40,490
Sorry, that's the
empirical var--
307
00:17:40,490 --> 00:17:42,480
yeah, I should
call variance hat.
308
00:17:42,480 --> 00:17:43,730
That's the empirical variance.
309
00:17:43,730 --> 00:17:45,063
Everything is in empirical here.
310
00:17:45,063 --> 00:17:48,680
We're talking about the
empirical covariance matrix.
311
00:17:48,680 --> 00:17:54,150
And so I also have that lambda
2 is the empirical variance
312
00:17:54,150 --> 00:17:59,160
of when I project Xi onto
v2, which is the second one,
313
00:17:59,160 --> 00:18:00,600
just for exactly this reason.
314
00:18:07,474 --> 00:18:08,456
Any question?
315
00:18:14,170 --> 00:18:16,830
So lambda j's are going
to be important for us.
316
00:18:16,830 --> 00:18:19,320
Lambda j measure the
spread of the points
317
00:18:19,320 --> 00:18:22,530
when I project them onto a
line which is a one dimensional
318
00:18:22,530 --> 00:18:23,259
space.
319
00:18:23,259 --> 00:18:25,800
And so I'm going to have-- let's
say I want to pick only one,
320
00:18:25,800 --> 00:18:28,133
I'm going to have to find the
one dimensional space that
321
00:18:28,133 --> 00:18:29,690
carries the most variance.
322
00:18:29,690 --> 00:18:32,070
And I claim that
v1 is the one that
323
00:18:32,070 --> 00:18:35,280
actually maximizes the spread.
324
00:18:35,280 --> 00:18:55,900
So the claim-- so for
any direction, u in Rd--
325
00:18:55,900 --> 00:18:59,380
and by direction, I really
just mean that the norm of u
326
00:18:59,380 --> 00:19:00,920
is equal to 1.
327
00:19:00,920 --> 00:19:02,020
I need to play fair--
328
00:19:02,020 --> 00:19:04,690
I'm going to compare myself to
other things of lengths one,
329
00:19:04,690 --> 00:19:07,600
so I need to play fair and
look at directions of length 1.
330
00:19:07,600 --> 00:19:16,321
Now, if I'm interested
in the empirical variance
331
00:19:16,321 --> 00:19:20,875
of X1 transpose--
332
00:19:20,875 --> 00:19:29,150
sorry, u transpose X1 u
transpose Xn, then this thing
333
00:19:29,150 --> 00:19:37,950
is maximized for
u equals v1, where
334
00:19:37,950 --> 00:19:40,610
v1 is the eigenvector
associated to lambda 1
335
00:19:40,610 --> 00:19:42,110
and lambda 1 is not
any eigenvalues,
336
00:19:42,110 --> 00:19:45,090
it's the largest of all those.
337
00:19:45,090 --> 00:19:46,992
So it's the largest eigenvalue.
338
00:19:50,607 --> 00:19:51,440
So why is that true?
339
00:19:55,410 --> 00:20:00,840
Well, there's also a claim
that for any direction u--
340
00:20:00,840 --> 00:20:03,380
so that's 1 and 2--
341
00:20:03,380 --> 00:20:08,990
the variance of u
transpose X-- now,
342
00:20:08,990 --> 00:20:11,900
this is just a random variable,
and I'm looking about the true
343
00:20:11,900 --> 00:20:13,040
variance--
344
00:20:13,040 --> 00:20:27,440
this is maximized for u
equals, let's call it w1,
345
00:20:27,440 --> 00:20:38,320
where w1 is the
eigenvector of sigma--
346
00:20:38,320 --> 00:20:40,204
Now, I'm talking about
the true variance.
347
00:20:40,204 --> 00:20:42,620
Whereas, here, I was talking
about the empirical variance.
348
00:20:42,620 --> 00:20:44,950
So the true variance
is the eigenvectors
349
00:20:44,950 --> 00:20:55,630
of the true sigma
associated to the largest
350
00:20:55,630 --> 00:20:59,554
eigenvalue of sigma.
351
00:21:02,870 --> 00:21:04,270
So I did not give it a name.
352
00:21:04,270 --> 00:21:06,285
Here, that was lambda 1
for the empirical one.
353
00:21:06,285 --> 00:21:07,660
For the true one,
you can give it
354
00:21:07,660 --> 00:21:10,330
another name, mu 1 if you want.
355
00:21:10,330 --> 00:21:13,407
But that's just the same thing.
356
00:21:13,407 --> 00:21:15,490
All it's saying is like,
wherever I see empirical,
357
00:21:15,490 --> 00:21:16,156
I can remove it.
358
00:21:27,690 --> 00:21:29,815
So why is this claim true?
359
00:21:29,815 --> 00:21:31,815
Well, let's look at the
second one, for example.
360
00:21:38,180 --> 00:21:44,480
So what is the variance
of u transpose X?
361
00:21:44,480 --> 00:21:47,570
So that's what I want to know.
362
00:21:47,570 --> 00:21:54,850
So that's the expectation--
so let's assume that X is 0,
363
00:21:54,850 --> 00:21:56,711
again, for same
reasons as before.
364
00:21:56,711 --> 00:21:57,710
So what is the variance?
365
00:21:57,710 --> 00:21:59,410
It's just the expectation
of the square?
366
00:22:06,460 --> 00:22:08,260
I don't need to remove
the expectation.
367
00:22:08,260 --> 00:22:10,870
And the expedition
of the square is just
368
00:22:10,870 --> 00:22:12,700
the expectation
of u transpose X.
369
00:22:12,700 --> 00:22:15,250
And then I'm going to write
the other one X transpose u.
370
00:22:19,510 --> 00:22:22,360
And we know that this
is deterministic.
371
00:22:22,360 --> 00:22:25,570
So I'm just going to take
that this is just u transpose
372
00:22:25,570 --> 00:22:31,995
expectation of X X transpose u.
373
00:22:31,995 --> 00:22:32,870
And what is this guy?
374
00:22:39,305 --> 00:22:40,760
That's covariance sigma.
375
00:22:40,760 --> 00:22:41,870
That's just what sigma is.
376
00:22:44,730 --> 00:22:48,590
So the variance I can write
as u transpose sigma u.
377
00:22:48,590 --> 00:22:51,272
We've made this
computation before.
378
00:22:51,272 --> 00:22:53,730
And now what I want to claim
is that this thing is actually
379
00:22:53,730 --> 00:22:57,275
less than the largest
eigenvalue, which I actually
380
00:22:57,275 --> 00:22:58,150
called lambda 1 here.
381
00:22:58,150 --> 00:22:59,680
I should probably not.
382
00:22:59,680 --> 00:23:01,100
And the P is-- well, OK.
383
00:23:06,430 --> 00:23:11,260
Let's just pretend
everything is not empirical.
384
00:23:11,260 --> 00:23:22,580
So now, I'm going to write
sigma as P lambda 1 lambda n P
385
00:23:22,580 --> 00:23:23,180
transpose.
386
00:23:23,180 --> 00:23:25,010
That's just the
eigendecomposition,
387
00:23:25,010 --> 00:23:32,090
where I admittedly reuse the
same notation as I did for S.
388
00:23:32,090 --> 00:23:34,764
So I should really put
some primes everywhere,
389
00:23:34,764 --> 00:23:36,680
so you know those are
things that are actually
390
00:23:36,680 --> 00:23:38,630
different in practice.
391
00:23:38,630 --> 00:23:43,469
So this is just that the
decomposition of sigma.
392
00:23:43,469 --> 00:23:44,510
You seem confused, Helen.
393
00:23:44,510 --> 00:23:47,570
You have a question?
394
00:23:47,570 --> 00:23:48,070
Yeah?
395
00:23:48,070 --> 00:23:53,830
AUDIENCE: What is-- when you
talked about the empirical data
396
00:23:53,830 --> 00:23:55,750
and--
397
00:23:55,750 --> 00:23:56,880
PHILIPPE RIGOLLET: So OK--
398
00:24:00,670 --> 00:24:02,801
so I can make
everything I'm saying,
399
00:24:02,801 --> 00:24:04,300
I can talk about
either the variance
400
00:24:04,300 --> 00:24:05,470
or the empirical variance.
401
00:24:05,470 --> 00:24:07,720
And you can just add the
word empirical in front of it
402
00:24:07,720 --> 00:24:08,680
whenever you want.
403
00:24:08,680 --> 00:24:09,910
The same thing works.
404
00:24:09,910 --> 00:24:13,120
But just for the sake of
removing the confusion,
405
00:24:13,120 --> 00:24:20,409
let's just do it again
with S. So I'm just
406
00:24:20,409 --> 00:24:21,950
going to do everything
with S. So I'm
407
00:24:21,950 --> 00:24:24,650
going to assume that
X bar is equal to 0.
408
00:24:24,650 --> 00:24:27,780
And here, I'm going to talk
about the empirical variance,
409
00:24:27,780 --> 00:24:31,530
which is just 1/n
sum from i equal 1
410
00:24:31,530 --> 00:24:35,272
to n of u transpose Xi squared.
411
00:24:35,272 --> 00:24:36,230
So it's the same thing.
412
00:24:36,230 --> 00:24:37,646
Everywhere you see
an expectation,
413
00:24:37,646 --> 00:24:39,110
you just put in average.
414
00:24:45,930 --> 00:24:50,850
And then I get 1/n
sum from i equal 1
415
00:24:50,850 --> 00:24:53,032
to n of Xi Xi transpose.
416
00:24:53,032 --> 00:24:54,490
And now, I'm going
to call this guy
417
00:24:54,490 --> 00:24:58,200
S, because that's what it is.
418
00:24:58,200 --> 00:24:59,994
So this is u transpose Su.
419
00:24:59,994 --> 00:25:02,410
But just defined that I could
just replace the expectation
420
00:25:02,410 --> 00:25:03,910
by averages everywhere,
you can tell
421
00:25:03,910 --> 00:25:06,590
that the thing is going to work
for either one or the other.
422
00:25:06,590 --> 00:25:08,491
So now, this thing
was actually-- so now,
423
00:25:08,491 --> 00:25:10,240
I don't have any problem
with my notation.
424
00:25:10,240 --> 00:25:14,310
This is actually the
decomposition of S.
425
00:25:14,310 --> 00:25:16,030
That's just the
spectral decomposition
426
00:25:16,030 --> 00:25:18,840
and it's to its eigenvalues.
427
00:25:18,840 --> 00:25:27,080
And so now, what I have is that
when I look at u transpose Su,
428
00:25:27,080 --> 00:25:34,920
this is actually equal
to P u transpose S Pu.
429
00:25:39,294 --> 00:25:40,500
OK.
430
00:25:40,500 --> 00:25:41,750
There's a transpose somewhere.
431
00:25:41,750 --> 00:25:42,416
That's this guy.
432
00:25:45,300 --> 00:25:46,161
And that's this guy.
433
00:25:57,057 --> 00:26:00,220
Now-- sorry, that's
not P, that's
434
00:26:00,220 --> 00:26:05,000
D. That's D, that's
this diagonal matrix.
435
00:26:10,269 --> 00:26:11,310
Let's look at this thing.
436
00:26:11,310 --> 00:26:15,810
And let's call P transpose
u, let's call it b.
437
00:26:15,810 --> 00:26:18,705
So that's also a vector in Rd.
438
00:26:18,705 --> 00:26:19,530
What is it?
439
00:26:19,530 --> 00:26:21,370
It's just, I take a
unit vector, and then
440
00:26:21,370 --> 00:26:23,020
I apply P transpose to it.
441
00:26:23,020 --> 00:26:25,740
So that's basically what
happens to a unit vector
442
00:26:25,740 --> 00:26:29,820
when I apply the same
change of basis that I did.
443
00:26:29,820 --> 00:26:34,650
So I'm just changing my
orthogonal system the same way
444
00:26:34,650 --> 00:26:36,360
I did for the other ones.
445
00:26:36,360 --> 00:26:38,940
So what's happening
when I write this?
446
00:26:38,940 --> 00:26:46,590
Well, now I have that u
transpose Su is b transpose Db.
447
00:26:46,590 --> 00:26:50,310
But now, doing b transpose
Db when D is diagonal
448
00:26:50,310 --> 00:26:52,690
and b is a vector is
a very simple thing.
449
00:26:52,690 --> 00:26:53,910
I can expand it.
450
00:26:53,910 --> 00:26:54,480
This is what?
451
00:26:54,480 --> 00:26:57,120
This is just the
sum from j equal 1
452
00:26:57,120 --> 00:27:01,650
to d of lambda j bj squared.
453
00:27:05,386 --> 00:27:08,947
So that's just like matrix
vector multiplication.
454
00:27:08,947 --> 00:27:11,280
And in particular, I know
that the largest of those guys
455
00:27:11,280 --> 00:27:14,010
is lambda 1 and those
guys are all non-negative.
456
00:27:14,010 --> 00:27:16,705
So this thing is actually
less than lambda 1 times
457
00:27:16,705 --> 00:27:20,430
the sum from j equal 1 to
d of lambda j squared--
458
00:27:23,330 --> 00:27:24,490
sorry, bj squared.
459
00:27:27,560 --> 00:27:34,010
And this is just the
norm of b squared.
460
00:27:34,010 --> 00:27:38,320
So if I want to prove what's on
the slide, all I need to check
461
00:27:38,320 --> 00:27:40,965
is that b has norm, which is--
462
00:27:40,965 --> 00:27:41,935
AUDIENCE: 1.
463
00:27:41,935 --> 00:27:43,910
PHILIPPE RIGOLLET: At most, 1.
464
00:27:43,910 --> 00:27:45,090
It's going to be at most 1.
465
00:27:45,090 --> 00:27:45,780
Why?
466
00:27:45,780 --> 00:27:51,690
Well, because b is really
just a change of basis for u.
467
00:27:51,690 --> 00:27:55,650
And so if I take a vector,
I'm just changing its basis.
468
00:27:55,650 --> 00:27:57,540
I'm certainly not
changing its length--
469
00:27:57,540 --> 00:27:59,580
think of a rotation,
and I can also flip it,
470
00:27:59,580 --> 00:28:00,790
but think of a rotation--
471
00:28:02,839 --> 00:28:05,380
well, actually, for vector, it's
just going to be a rotation.
472
00:28:05,380 --> 00:28:06,850
And so now, what
I have I just have
473
00:28:06,850 --> 00:28:11,970
to check that the norm of
b squared is equal to what?
474
00:28:11,970 --> 00:28:16,470
Well, it's equal to the norm
of P transpose u squared,
475
00:28:16,470 --> 00:28:21,620
which is equal to u
transpose P P transpose u.
476
00:28:21,620 --> 00:28:23,000
But P is orthogonal.
477
00:28:23,000 --> 00:28:26,210
So this thing is actually
just the identity.
478
00:28:26,210 --> 00:28:28,307
So that's just u
transpose u, which
479
00:28:28,307 --> 00:28:33,260
is equal to the norm u
squared, which is equal to 1,
480
00:28:33,260 --> 00:28:37,070
because I took u to have
norm 1 in the first place.
481
00:28:37,070 --> 00:28:39,640
And so this-- you're right--
was actually of norm equal to 1.
482
00:28:39,640 --> 00:28:42,017
I just needed to have
it less, but it's equal.
483
00:28:42,017 --> 00:28:44,350
And so what I'm left with is
that this thing is actually
484
00:28:44,350 --> 00:28:45,820
equal to lambda 1.
485
00:28:45,820 --> 00:28:50,030
So I know that for
every u that I pick--
486
00:28:50,030 --> 00:28:52,890
that has norm--
487
00:28:52,890 --> 00:28:55,030
So I'm just reminding
you that u here
488
00:28:55,030 --> 00:28:57,730
has norm squared equal to 1.
489
00:28:57,730 --> 00:29:00,760
For every u that I
pick, this u transpose
490
00:29:00,760 --> 00:29:02,890
Su is at mostly lambda 1.
491
00:29:06,400 --> 00:29:11,250
So that's the u transpose
Su is at most lambda 1.
492
00:29:11,250 --> 00:29:13,270
And we know that that's
the variance, that's
493
00:29:13,270 --> 00:29:15,790
the empirical variance,
when I project my points
494
00:29:15,790 --> 00:29:17,500
onto direction spanned by u.
495
00:29:20,240 --> 00:29:23,040
So now, I have an
empirical variance,
496
00:29:23,040 --> 00:29:24,650
which is at most lambda 1.
497
00:29:24,650 --> 00:29:28,457
But I also know that if I take u
to be something very specific--
498
00:29:28,457 --> 00:29:30,040
I mean, it was on
the previous board--
499
00:29:30,040 --> 00:29:32,510
if I take u to be
equal to v1, then
500
00:29:32,510 --> 00:29:35,270
this thing is actually
not an inequality,
501
00:29:35,270 --> 00:29:37,160
this is an equality.
502
00:29:37,160 --> 00:29:41,990
And the reason is, when I
actually take u to be v1,
503
00:29:41,990 --> 00:29:46,410
all of these bj's are going to
be 0, except for the one that's
504
00:29:46,410 --> 00:29:50,360
b1, which is itself equal to 1.
505
00:29:50,360 --> 00:29:52,190
So I mean, we can
briefly check this.
506
00:29:52,190 --> 00:29:53,738
But if I take v--
507
00:29:59,106 --> 00:30:07,100
if u is equal to v1, what
I have is that u transpose
508
00:30:07,100 --> 00:30:24,800
Su is equal to P transpose
v1 D P transpose v1.
509
00:30:24,800 --> 00:30:26,680
But what is P transpose v1?
510
00:30:26,680 --> 00:30:31,960
Well, remember P
transpose is just
511
00:30:31,960 --> 00:30:34,820
the matrix that has
vectors v1 transpose here,
512
00:30:34,820 --> 00:30:40,110
v2 transpose here, all the
way to vd transpose here.
513
00:30:40,110 --> 00:30:45,570
And we know that when I take
vj transpose vk, I get 0,
514
00:30:45,570 --> 00:30:46,680
if j is different from k.
515
00:30:46,680 --> 00:30:49,620
And if j is equal to k, I get 1.
516
00:30:49,620 --> 00:30:53,690
So P transpose v1
is equal to what?
517
00:31:05,040 --> 00:31:06,570
Take v1 here and multiply it.
518
00:31:06,570 --> 00:31:08,250
So the first coordinate
is going to be
519
00:31:08,250 --> 00:31:12,870
v1 transpose v1, which is 1.
520
00:31:12,870 --> 00:31:14,370
The second coordinate
is going to be
521
00:31:14,370 --> 00:31:19,030
v2 transpose v1, which is 0.
522
00:31:19,030 --> 00:31:22,740
And so I get 0's
all the way, right?
523
00:31:22,740 --> 00:31:25,470
So that means that this
thing here is really
524
00:31:25,470 --> 00:31:29,040
just the vector 1, 0, 0.
525
00:31:29,040 --> 00:31:32,220
And here, this is just
the vector 1, 0, 0.
526
00:31:32,220 --> 00:31:34,100
So when I multiply
it with this guy,
527
00:31:34,100 --> 00:31:37,980
I am only picking up
the top left element
528
00:31:37,980 --> 00:31:41,740
of D, which is lambda 1.
529
00:31:41,740 --> 00:31:44,940
So for every one,
it's less lambda 1.
530
00:31:44,940 --> 00:31:46,950
And for v1, it's
equal to lambda 1,
531
00:31:46,950 --> 00:31:52,590
which means that it's
maximized for a equals v1.
532
00:31:52,590 --> 00:31:54,480
And that's where
I said that this
533
00:31:54,480 --> 00:31:57,527
is the fanciest non-convex
problem we know how to solve.
534
00:31:57,527 --> 00:31:59,610
This was a problem that
was definitely non-convex.
535
00:31:59,610 --> 00:32:02,820
We were maximizing a convex
function over a sphere.
536
00:32:02,820 --> 00:32:06,156
But we know that v1,
which is something--
537
00:32:06,156 --> 00:32:07,530
I mean, of course,
you still have
538
00:32:07,530 --> 00:32:08,946
to believe me that
you can compute
539
00:32:08,946 --> 00:32:11,670
the spectral decomposition
efficiently--
540
00:32:11,670 --> 00:32:14,880
but essentially, if you've
taken linear algebra,
541
00:32:14,880 --> 00:32:17,020
you know that you can
diagonalize a matrix.
542
00:32:17,020 --> 00:32:19,797
And so you get that v1
is just the maximum.
543
00:32:19,797 --> 00:32:21,630
So you can find your
maximum just by looking
544
00:32:21,630 --> 00:32:24,109
at the spectral decomposition.
545
00:32:24,109 --> 00:32:25,650
You don't have to
do any optimization
546
00:32:25,650 --> 00:32:28,790
or anything like this.
547
00:32:28,790 --> 00:32:29,870
So let's recap.
548
00:32:29,870 --> 00:32:32,390
Where are we?
549
00:32:32,390 --> 00:32:34,160
We've established
that if I start
550
00:32:34,160 --> 00:32:37,820
with my empirical covariance
matrix, I can diagonalize it
551
00:32:37,820 --> 00:32:42,270
and PD P transpose.
552
00:32:42,270 --> 00:32:44,250
And then if I take the
eigenvector associated
553
00:32:44,250 --> 00:32:48,630
to the largest eigenvalues-- so
if I permute the columns of P
554
00:32:48,630 --> 00:32:50,810
and of D's in such
a way that they
555
00:32:50,810 --> 00:32:53,520
are ordered from the
largest to the smallest when
556
00:32:53,520 --> 00:32:56,490
I look at the diagonal
elements of D,
557
00:32:56,490 --> 00:32:59,430
then if I pick the first
column of P, it's v1.
558
00:32:59,430 --> 00:33:04,750
And v1 is the direction on
which, if I project my points,
559
00:33:04,750 --> 00:33:08,090
they are going to carry the
most empirical variance.
560
00:33:08,090 --> 00:33:09,090
Well, that's a good way.
561
00:33:09,090 --> 00:33:13,064
If I told you,
pick one direction
562
00:33:13,064 --> 00:33:14,980
along which if you were
to project your points
563
00:33:14,980 --> 00:33:17,313
they would be as spread out
as possible, that's probably
564
00:33:17,313 --> 00:33:19,270
the one you would pick.
565
00:33:19,270 --> 00:33:22,160
And so that's exactly
what PCA is doing for us.
566
00:33:22,160 --> 00:33:28,780
It says, OK, if you ask me
to take d prime equal to 1,
567
00:33:28,780 --> 00:33:31,510
I will take v1.
568
00:33:31,510 --> 00:33:33,892
I will just take the direction
that's spanned by v1.
569
00:33:33,892 --> 00:33:36,100
And that's just when I come
back to this picture that
570
00:33:36,100 --> 00:33:43,750
was here before, this is v1.
571
00:33:43,750 --> 00:33:45,970
Of course, here, I
only have two of them.
572
00:33:45,970 --> 00:33:48,580
So v2 has to be this
guy, or this guy,
573
00:33:48,580 --> 00:33:49,940
or I mean or this thing.
574
00:33:49,940 --> 00:33:53,060
I mean, I don't know
them up to sine.
575
00:33:53,060 --> 00:33:55,600
But then if I have three--
576
00:33:55,600 --> 00:33:58,090
think of like an olive
in three dimensions--
577
00:33:58,090 --> 00:34:00,550
then maybe I have one
direction that's slightly more
578
00:34:00,550 --> 00:34:02,180
elongated than the other one.
579
00:34:02,180 --> 00:34:04,480
And so I'm going to
pick the second one.
580
00:34:04,480 --> 00:34:07,330
And so the procedure is
to say, well, first, I'm
581
00:34:07,330 --> 00:34:11,194
going to pick v1 the same way
I pick v1 in the first place.
582
00:34:11,194 --> 00:34:12,610
So the first
direction I am taking
583
00:34:12,610 --> 00:34:14,620
is the leading eigenvector.
584
00:34:14,620 --> 00:34:18,199
And then I'm looking
for a direction.
585
00:34:18,199 --> 00:34:20,719
Well, if I found
one-- the one I'm
586
00:34:20,719 --> 00:34:23,239
going to want to find-- if you
say you can take d equal 2,
587
00:34:23,239 --> 00:34:24,949
you're going to need
the basis for this guy.
588
00:34:24,949 --> 00:34:27,240
So the second one has to be
orthogonal to the first one
589
00:34:27,240 --> 00:34:28,705
you've already picked.
590
00:34:28,705 --> 00:34:30,080
And so the second
one you pick is
591
00:34:30,080 --> 00:34:31,940
the one that's just,
among all those that
592
00:34:31,940 --> 00:34:36,529
are orthogonal to v1, maximized
the empirical variance
593
00:34:36,529 --> 00:34:37,570
when you project onto it.
594
00:34:40,100 --> 00:34:44,000
And it turns out that this
is actually exactly v2.
595
00:34:44,000 --> 00:34:46,153
You don't have to
redo anything again.
596
00:34:46,153 --> 00:34:47,569
You're eigendecomposition,
this is
597
00:34:47,569 --> 00:34:54,690
just the second column
of P. Clearly, v2
598
00:34:54,690 --> 00:34:56,120
is orthogonal to v1.
599
00:34:56,120 --> 00:34:58,890
We just used it here.
600
00:34:58,890 --> 00:35:03,730
This 0 here just says this
v2 is orthogonal to v1.
601
00:35:03,730 --> 00:35:05,770
So they're like this.
602
00:35:05,770 --> 00:35:06,940
And now, what I said--
603
00:35:06,940 --> 00:35:08,530
what this slide
tells you extra--
604
00:35:08,530 --> 00:35:10,670
is that v2 among all
those directions that are
605
00:35:10,670 --> 00:35:11,170
orthogonal--
606
00:35:11,170 --> 00:35:13,610
I mean, there's still
d minus 1 of them--
607
00:35:13,610 --> 00:35:16,030
this is the one that
maximizes the, say,
608
00:35:16,030 --> 00:35:18,730
residual empirical
variance-- the one that
609
00:35:18,730 --> 00:35:21,950
was not explained by the first
direction that you picked.
610
00:35:21,950 --> 00:35:22,910
And you can check that.
611
00:35:22,910 --> 00:35:27,200
I mean, it's becoming a bit
more cumbersome to write down,
612
00:35:27,200 --> 00:35:28,760
but you can check that.
613
00:35:28,760 --> 00:35:32,130
If you're not convinced,
please raise your concern.
614
00:35:32,130 --> 00:35:38,641
I mean, basically, one
way you view this to--
615
00:35:38,641 --> 00:35:40,640
I mean, you're not really
dropping a coordinate,
616
00:35:40,640 --> 00:35:42,420
because v1 is not a coordinate.
617
00:35:42,420 --> 00:35:46,040
But let's assume actually for
simplicity that v1 was actually
618
00:35:46,040 --> 00:35:49,730
equal to e1, that the direction
that carries the most variance
619
00:35:49,730 --> 00:35:51,440
is the one that
just says, just look
620
00:35:51,440 --> 00:35:56,520
at the first coordinate of X.
So if that was the case, then
621
00:35:56,520 --> 00:35:58,380
clearly the orthogonal
directions are
622
00:35:58,380 --> 00:36:03,420
the ones that comprise only
of the coordinates 2 to d.
623
00:36:03,420 --> 00:36:05,670
So you could actually just
drop the first coordinate
624
00:36:05,670 --> 00:36:08,460
and do the same thing on
a slightly shorter vector
625
00:36:08,460 --> 00:36:10,129
of length d minus 1.
626
00:36:10,129 --> 00:36:12,420
And then you would just look
at the largest eigenvector
627
00:36:12,420 --> 00:36:14,530
of these guys, et
cetera, et cetera.
628
00:36:14,530 --> 00:36:16,230
So in a way, that's
what's happening,
629
00:36:16,230 --> 00:36:19,200
except that you rotate it
before you actually do this.
630
00:36:19,200 --> 00:36:22,260
And that's exactly
what's happening.
631
00:36:22,260 --> 00:36:30,890
So what we put together here
is essentially three things.
632
00:36:30,890 --> 00:36:32,220
One was statistics.
633
00:36:32,220 --> 00:36:34,690
Statistics says, if
you won't spread,
634
00:36:34,690 --> 00:36:39,230
if you want information, you
should be looking at variance.
635
00:36:39,230 --> 00:36:40,820
The second one was optimization.
636
00:36:40,820 --> 00:36:44,870
Optimization said, well, if you
want to maximize spread, well,
637
00:36:44,870 --> 00:36:48,260
you have to maximize variance
in a certain direction.
638
00:36:48,260 --> 00:36:51,920
And that means maximizing
over the sphere of vectors
639
00:36:51,920 --> 00:36:54,510
that have unique norm.
640
00:36:54,510 --> 00:36:56,720
And that's an optimization
problem, which actually
641
00:36:56,720 --> 00:36:58,310
turned out to be difficult.
642
00:36:58,310 --> 00:37:00,800
But then the third thing that
we use to solve this problem
643
00:37:00,800 --> 00:37:01,830
was linear algebra.
644
00:37:01,830 --> 00:37:03,410
Linear algebra
said, well, it looks
645
00:37:03,410 --> 00:37:05,450
like it's a difficult
optimization problem.
646
00:37:05,450 --> 00:37:08,410
But it turns out that the
answer comes in almost--
647
00:37:08,410 --> 00:37:11,210
I mean, it's not a closed form,
but those things are so used,
648
00:37:11,210 --> 00:37:12,590
that it's almost a closed form--
649
00:37:12,590 --> 00:37:17,240
says, just pick the
eigenvectors in order
650
00:37:17,240 --> 00:37:20,480
of their associated eigenvalues
from largest to smallest.
651
00:37:23,020 --> 00:37:24,940
And that's why principal
component analysis
652
00:37:24,940 --> 00:37:29,080
has been so popular and has
gained huge amount of traction
653
00:37:29,080 --> 00:37:33,760
since we had computers that were
allowed to compute eigenvalues
654
00:37:33,760 --> 00:37:37,429
and eigenvectors for
matrices of gigantic sizes.
655
00:37:37,429 --> 00:37:38,470
You can actually do that.
656
00:37:38,470 --> 00:37:39,760
If I give you--
657
00:37:39,760 --> 00:37:42,340
I don't know, this Google
video, for example,
658
00:37:42,340 --> 00:37:43,750
is talking about words.
659
00:37:43,750 --> 00:37:45,970
They want to do just the,
say, principal component
660
00:37:45,970 --> 00:37:47,380
analysis of words.
661
00:37:47,380 --> 00:37:50,230
So I give you all the
words in the dictionary.
662
00:37:50,230 --> 00:37:53,500
And-- sorry, well,
you would have
663
00:37:53,500 --> 00:37:55,090
to have a representation
for words,
664
00:37:55,090 --> 00:37:59,500
so it's a little more
difficult. But how do I do this?
665
00:38:03,980 --> 00:38:06,382
Let's say, for example,
pages of a book.
666
00:38:06,382 --> 00:38:08,090
I want to understand
the pages of a book.
667
00:38:08,090 --> 00:38:10,580
And I need to turn
it into a number.
668
00:38:10,580 --> 00:38:13,150
And a page of a book is
basically the word count.
669
00:38:13,150 --> 00:38:15,350
So I just count the number
of times "the" shows up,
670
00:38:15,350 --> 00:38:18,140
the number of times "and"
shows up, number of times "dog"
671
00:38:18,140 --> 00:38:19,100
shows up.
672
00:38:19,100 --> 00:38:20,934
And so that gives me a vector.
673
00:38:20,934 --> 00:38:22,225
It's in pretty high dimensions.
674
00:38:22,225 --> 00:38:25,350
It's as many dimensions as there
are words in the dictionary.
675
00:38:25,350 --> 00:38:28,310
And now, I want to visualize
how those pages get together--
676
00:38:28,310 --> 00:38:30,450
are two pages very
similar or not.
677
00:38:30,450 --> 00:38:32,630
And so what you would
do is essentially
678
00:38:32,630 --> 00:38:35,470
just compute the largest
eigenvector of this matrix--
679
00:38:35,470 --> 00:38:38,925
maybe the two largest-- and
then project this into a plane.
680
00:38:38,925 --> 00:38:39,425
Yeah.
681
00:38:39,425 --> 00:38:41,325
AUDIENCE: Can we assume
the number of points
682
00:38:41,325 --> 00:38:43,060
was far larger
than the dimension?
683
00:38:43,060 --> 00:38:44,560
PHILIPPE RIGOLLET:
Yeah, but there's
684
00:38:44,560 --> 00:38:46,834
many pages in the world.
685
00:38:46,834 --> 00:38:48,500
There's probably more
pages in the world
686
00:38:48,500 --> 00:38:50,154
than there's words
in the dictionary.
687
00:38:54,960 --> 00:38:57,185
Yeah, so of course, if
you are in high dimensions
688
00:38:57,185 --> 00:38:58,560
and you don't have
enough points,
689
00:38:58,560 --> 00:39:00,240
it's going to be
clearly an issue.
690
00:39:00,240 --> 00:39:03,605
If you have two points,
then the leading eigenvector
691
00:39:03,605 --> 00:39:04,980
is going to be
just the line that
692
00:39:04,980 --> 00:39:06,879
goes through those
two points, regardless
693
00:39:06,879 --> 00:39:07,920
of what the dimension is.
694
00:39:07,920 --> 00:39:09,670
And clearly, you're
not learning anything.
695
00:39:13,850 --> 00:39:16,310
So you have to pick,
say, the k largest one.
696
00:39:16,310 --> 00:39:18,842
If you go all the way, you're
just reordering your thing,
697
00:39:18,842 --> 00:39:20,550
and you're not actually
gaining anything.
698
00:39:20,550 --> 00:39:22,130
You start from d
and you go too d.
699
00:39:22,130 --> 00:39:26,300
So at some point, this
procedure has to stop.
700
00:39:26,300 --> 00:39:28,960
And let's say it stops at k.
701
00:39:28,960 --> 00:39:31,360
Now, of course, you
should ask me a question,
702
00:39:31,360 --> 00:39:34,100
which is, how do you choose k?
703
00:39:34,100 --> 00:39:37,400
So that's, of course,
a natural question.
704
00:39:37,400 --> 00:39:41,360
Probably the basic answer
is just pick k equals 3,
705
00:39:41,360 --> 00:39:43,220
because you can
actually visualize it.
706
00:39:43,220 --> 00:39:47,906
But what happens if I
take k is equal to 4?
707
00:39:47,906 --> 00:39:51,860
If I take is equal
to 4, I'm not going
708
00:39:51,860 --> 00:39:54,070
to be able to plot points
in four dimensions.
709
00:39:54,070 --> 00:39:55,550
Well, I could, I
could add color,
710
00:39:55,550 --> 00:39:57,440
or I could try to be a
little smart about it.
711
00:39:57,440 --> 00:40:00,060
But it's actually
quite difficult.
712
00:40:00,060 --> 00:40:04,420
And so what people tend to do,
if you have four dimensions,
713
00:40:04,420 --> 00:40:06,850
they actually do a bunch
of two dimensional plots.
714
00:40:06,850 --> 00:40:08,920
And that's what a computer--
a computer is not very good--
715
00:40:08,920 --> 00:40:10,750
I mean, by default,
they don't spit out
716
00:40:10,750 --> 00:40:12,380
three dimensional plots.
717
00:40:12,380 --> 00:40:15,024
So let's say they want to plot
only two dimensional things.
718
00:40:15,024 --> 00:40:17,440
So they're going to take the
first directions of, say, v1,
719
00:40:17,440 --> 00:40:18,586
v2.
720
00:40:18,586 --> 00:40:19,960
Let's say you have
three, but you
721
00:40:19,960 --> 00:40:21,760
want to have only two
dimensional plots.
722
00:40:21,760 --> 00:40:29,660
And then it's going to do
v1, v3; and then v2, v3.
723
00:40:29,660 --> 00:40:31,850
So really, you take
all three of them,
724
00:40:31,850 --> 00:40:35,240
but it's really just
showing you all choices
725
00:40:35,240 --> 00:40:37,340
of pairs of those guys.
726
00:40:37,340 --> 00:40:41,960
So if you were to
keep k is equal to 5,
727
00:40:41,960 --> 00:40:44,450
you would have five,
choose two different plots.
728
00:40:48,540 --> 00:40:51,930
So this is the actual
principal component algorithm,
729
00:40:51,930 --> 00:40:53,640
how it's implemented.
730
00:40:53,640 --> 00:40:55,000
And it's actually fairly simple.
731
00:40:55,000 --> 00:40:56,430
I mean, it looks like
there's lots of steps.
732
00:40:56,430 --> 00:40:58,600
But really, there's only
one that's important.
733
00:40:58,600 --> 00:40:59,850
So the first one is the input.
734
00:40:59,850 --> 00:41:04,860
I give you a bunch of points,
x1 to xn in d dimensions.
735
00:41:04,860 --> 00:41:07,680
And step two is, well, compute
their empirical covariance
736
00:41:07,680 --> 00:41:10,570
matrix S. The points themselves,
we don't really care.
737
00:41:10,570 --> 00:41:12,570
We care about their
empirical covariance matrix.
738
00:41:12,570 --> 00:41:14,530
So it's a d by d matrix.
739
00:41:14,530 --> 00:41:15,750
Now, I'm going to feed that.
740
00:41:15,750 --> 00:41:17,880
And that's where the actual
computation starts happening.
741
00:41:17,880 --> 00:41:19,796
I'm going to feed that
to something that knows
742
00:41:19,796 --> 00:41:21,090
how to diagonalize this matrix.
743
00:41:21,090 --> 00:41:23,220
And you have to
trust me, if I want
744
00:41:23,220 --> 00:41:25,770
to compute the k
largest eigenvalues
745
00:41:25,770 --> 00:41:27,960
and my matrix is
d by d, it's going
746
00:41:27,960 --> 00:41:32,730
to take me about k times
d squared operations.
747
00:41:32,730 --> 00:41:34,980
So if I want only three,
it's 3 times d squared,
748
00:41:34,980 --> 00:41:36,420
which is about--
749
00:41:36,420 --> 00:41:39,570
d squared is the time for me
it takes to just even read
750
00:41:39,570 --> 00:41:41,040
the matrix sigma.
751
00:41:41,040 --> 00:41:43,360
So that's not too bad.
752
00:41:43,360 --> 00:41:45,110
So what it's going to
spit out, of course,
753
00:41:45,110 --> 00:41:48,230
is the diagonal matrix
D. And those are nice,
754
00:41:48,230 --> 00:41:53,720
because they allow
me to tell me what
755
00:41:53,720 --> 00:41:56,210
is the order in which I should
be taking the columns of P.
756
00:41:56,210 --> 00:41:58,930
But what's really important
to me is v1 to vd,
757
00:41:58,930 --> 00:42:01,430
because those are going to be
the ones I'm going to be using
758
00:42:01,430 --> 00:42:04,250
to draw those plots.
759
00:42:04,250 --> 00:42:05,900
And now, I'm going
to say, OK, I need
760
00:42:05,900 --> 00:42:09,190
to actually choose some set k.
761
00:42:09,190 --> 00:42:11,630
And I'm going to basically
truncate and look
762
00:42:11,630 --> 00:42:16,380
only at the first
k columns of P.
763
00:42:16,380 --> 00:42:18,300
Once I have those
columns, what I
764
00:42:18,300 --> 00:42:20,820
want to do is to project
onto the linear span
765
00:42:20,820 --> 00:42:21,610
of those columns.
766
00:42:21,610 --> 00:42:23,340
And there's actually
a simple way
767
00:42:23,340 --> 00:42:26,940
to do this, which is just take
this matrix P, which is really
768
00:42:26,940 --> 00:42:29,460
the matrix of projection onto
the linear span of those k
769
00:42:29,460 --> 00:42:30,120
columns.
770
00:42:30,120 --> 00:42:32,160
And you just take Pk transpose.
771
00:42:32,160 --> 00:42:38,070
And then you apply this to
every single one of your points.
772
00:42:38,070 --> 00:42:42,000
Now Pk transpose, what is
the size of the matrix Pk?
773
00:42:46,410 --> 00:42:47,880
Yeah, [INAUDIBLE]?
774
00:42:47,880 --> 00:42:49,840
AUDIENCE: [INAUDIBLE]
775
00:42:49,840 --> 00:42:52,100
PHILIPPE RIGOLLET: So
Pk is just this matrix.
776
00:42:52,100 --> 00:42:54,601
I take the v1 and I stop at vk--
777
00:42:54,601 --> 00:42:55,100
well--
778
00:42:55,100 --> 00:42:57,656
AUDIENCE: [INAUDIBLE]
779
00:42:57,656 --> 00:42:59,030
PHILIPPE RIGOLLET:
d by k, right?
780
00:42:59,030 --> 00:43:01,290
Each of the column
is an eigenvector.
781
00:43:01,290 --> 00:43:02,840
It's of dimension d.
782
00:43:02,840 --> 00:43:05,730
I mean, that's a vector
in the original space.
783
00:43:05,730 --> 00:43:07,220
So I have this d by k matrix.
784
00:43:07,220 --> 00:43:11,360
So all it is is if I had my--
785
00:43:11,360 --> 00:43:13,970
well, I'm going to talk in
a second about Pk transpose.
786
00:43:13,970 --> 00:43:17,060
Pk transpose is
just this guy, where
787
00:43:17,060 --> 00:43:19,460
I stop at the k-th vector.
788
00:43:19,460 --> 00:43:22,370
So Pk transpose is k by d.
789
00:43:22,370 --> 00:43:26,825
So now, when I take Yi,
which is Pk transpose Xi,
790
00:43:26,825 --> 00:43:29,330
I end up with a point
which is in k dimensions.
791
00:43:29,330 --> 00:43:30,900
I have only k coordinates.
792
00:43:30,900 --> 00:43:33,350
So I took every single one
of my original points Xi,
793
00:43:33,350 --> 00:43:35,780
which had d coordinates, and
I turned it into a point that
794
00:43:35,780 --> 00:43:37,180
has only k coordinates.
795
00:43:37,180 --> 00:43:40,260
Particularly, I could
have k is equal to 2.
796
00:43:40,260 --> 00:43:42,820
This matrix is exactly
the one that projects.
797
00:43:42,820 --> 00:43:44,960
If you think about
it for one second,
798
00:43:44,960 --> 00:43:46,890
this is just the
matrix that says--
799
00:43:46,890 --> 00:43:48,610
well, we actually did
that several times.
800
00:43:48,610 --> 00:43:51,820
The matrix, so that
was this P transpose u
801
00:43:51,820 --> 00:43:53,470
that showed up somewhere.
802
00:43:53,470 --> 00:43:57,460
And so that's just
the matrix that
803
00:43:57,460 --> 00:44:01,030
take your point X in,
say, three dimensions,
804
00:44:01,030 --> 00:44:04,750
and then just project it
down to two dimensions.
805
00:44:04,750 --> 00:44:09,220
And that's just-- it goes to the
closest point in the subspace.
806
00:44:09,220 --> 00:44:12,650
Now, here, the floor is flat.
807
00:44:12,650 --> 00:44:16,510
But we can pick any
subspace we want,
808
00:44:16,510 --> 00:44:18,310
depending on what
the lambdas are.
809
00:44:18,310 --> 00:44:19,930
So the lambdas were
important for us
810
00:44:19,930 --> 00:44:23,610
to be able to identify
which columns to pick.
811
00:44:23,610 --> 00:44:25,692
The fact that we assumed
that they were ordered
812
00:44:25,692 --> 00:44:27,400
tells us that we can
pick the first ones.
813
00:44:27,400 --> 00:44:28,500
If they were not
ordered, it would
814
00:44:28,500 --> 00:44:30,583
be just a subset of the
columns, depending on what
815
00:44:30,583 --> 00:44:32,550
the size of the eigenvalue is.
816
00:44:32,550 --> 00:44:36,509
So each column is labeled.
817
00:44:36,509 --> 00:44:38,800
And so then, of course, we
still have this question of,
818
00:44:38,800 --> 00:44:40,570
how do I pick k?
819
00:44:40,570 --> 00:44:42,760
So there's definitely the
matter of convenience.
820
00:44:42,760 --> 00:44:44,410
Maybe 2 is convenient.
821
00:44:44,410 --> 00:44:47,180
If it works for 2, you don't
have to go any farther.
822
00:44:47,180 --> 00:44:50,680
But you might want
to say, well--
823
00:44:50,680 --> 00:44:52,690
originally, I did
that to actually keep
824
00:44:52,690 --> 00:44:54,320
as much information as possible.
825
00:44:54,320 --> 00:44:56,230
I know that the
ultimate thing is
826
00:44:56,230 --> 00:44:58,515
to keep as much information,
which would be to k
827
00:44:58,515 --> 00:45:00,970
is equal d-- that's as much
information as you want.
828
00:45:00,970 --> 00:45:03,310
But it's essentially the
same question about, well,
829
00:45:03,310 --> 00:45:07,180
if I want to compress
a JPEG image,
830
00:45:07,180 --> 00:45:10,100
how much information should
I keep so it's still visible?
831
00:45:10,100 --> 00:45:11,840
And so there's some
rules for that.
832
00:45:11,840 --> 00:45:14,950
But none of them is
actually really a science.
833
00:45:14,950 --> 00:45:16,600
So it's really a
matter of what you
834
00:45:16,600 --> 00:45:18,250
think is actually tolerable.
835
00:45:18,250 --> 00:45:21,970
And we're just going to start
replacing this choice by maybe
836
00:45:21,970 --> 00:45:22,900
another parameter.
837
00:45:22,900 --> 00:45:26,440
So here, we're going to
basically replace k by alpha,
838
00:45:26,440 --> 00:45:29,360
and so we just do stuff.
839
00:45:29,360 --> 00:45:32,020
So the first one that
people do that is probably
840
00:45:32,020 --> 00:45:33,750
the most popular one--
841
00:45:33,750 --> 00:45:35,860
OK, the most popular
one is definitely
842
00:45:35,860 --> 00:45:39,190
take k is equal to 2
or 3, because it's just
843
00:45:39,190 --> 00:45:41,320
convenient to visualize.
844
00:45:41,320 --> 00:45:48,050
The second most popular
one is the scree plot.
845
00:45:48,050 --> 00:45:49,370
So the scree plot--
846
00:45:49,370 --> 00:45:54,180
remember, I have my
values, lambda j's.
847
00:45:54,180 --> 00:45:57,670
And I've chosen the
lambda j's to decrease.
848
00:45:57,670 --> 00:45:59,380
So the indices are
chosen in such a way
849
00:45:59,380 --> 00:46:01,480
that lambda is a
decreasing function.
850
00:46:01,480 --> 00:46:04,332
So I have lambda 1, and
let's say it's this guy here.
851
00:46:04,332 --> 00:46:06,790
And then I have lambda 2, and
let's say it's this guy here.
852
00:46:06,790 --> 00:46:09,370
And then I have lambda 3, and
let's say it's this guy here,
853
00:46:09,370 --> 00:46:12,760
lambda 4, lambda 5, lambda 6.
854
00:46:12,760 --> 00:46:16,322
And all I care about is
that this thing decreases.
855
00:46:16,322 --> 00:46:19,580
The scree plot says
something like this--
856
00:46:19,580 --> 00:46:22,520
if there's an inflection point,
meaning that you can sort of do
857
00:46:22,520 --> 00:46:25,230
something like this and
then something like this,
858
00:46:25,230 --> 00:46:27,610
you should stop at 3.
859
00:46:27,610 --> 00:46:29,500
That's what the
scree plot tells you.
860
00:46:29,500 --> 00:46:34,590
What it's saying in a way
is that the percentage
861
00:46:34,590 --> 00:46:39,170
of the marginal
increment of explained
862
00:46:39,170 --> 00:46:41,990
variance that you get
starts to decrease after you
863
00:46:41,990 --> 00:46:43,555
pass this inflection point.
864
00:46:43,555 --> 00:46:45,840
So let's see why I way this.
865
00:46:45,840 --> 00:46:52,390
Well, here, what I
have-- so this ratio
866
00:46:52,390 --> 00:46:54,280
that you see there is
actually the percentage
867
00:46:54,280 --> 00:46:56,470
of explained variance.
868
00:46:56,470 --> 00:47:01,590
So what it means is that, if I
look at lambda 1 plus lambda k,
869
00:47:01,590 --> 00:47:08,260
and then I divide by lambda
1 plus lambda d, well,
870
00:47:08,260 --> 00:47:08,980
what is this?
871
00:47:08,980 --> 00:47:12,010
Well, this lambda
1 plus lambda d
872
00:47:12,010 --> 00:47:14,530
is the total amount of variance
that I get in my points.
873
00:47:14,530 --> 00:47:18,070
That's the trace of sigma.
874
00:47:18,070 --> 00:47:20,640
So that's the variance
in the first direction
875
00:47:20,640 --> 00:47:22,420
plus the variance in
the second direction
876
00:47:22,420 --> 00:47:24,280
plus the variance in
the third direction.
877
00:47:24,280 --> 00:47:26,571
That's basically all the
variance that I have possible.
878
00:47:28,900 --> 00:47:32,175
Now, this is the variance that
I kept in the first direction.
879
00:47:32,175 --> 00:47:34,550
This is the variance that I
kept in the second direction,
880
00:47:34,550 --> 00:47:37,190
all the way to the variance that
I kept in the k-th direction.
881
00:47:37,190 --> 00:47:41,800
So I know that this number is
always less than or equal to 1.
882
00:47:41,800 --> 00:47:43,540
And it's larger than 1.
883
00:47:43,540 --> 00:47:48,500
And this is just
the proportion, say,
884
00:47:48,500 --> 00:47:59,520
of variance explained
by v1 to vk,
885
00:47:59,520 --> 00:48:03,720
or simply, the proportion of
explained variance by my PCA,
886
00:48:03,720 --> 00:48:05,720
say.
887
00:48:05,720 --> 00:48:07,550
So now, what this
thing is telling me,
888
00:48:07,550 --> 00:48:09,860
its says, well, if
I look at this thing
889
00:48:09,860 --> 00:48:13,050
and I start seeing this
inflection point, it's saying,
890
00:48:13,050 --> 00:48:16,400
oh, here, you're gaining
a lot and lot of variance.
891
00:48:16,400 --> 00:48:19,090
And then at some point,
you stop gaining a lot
892
00:48:19,090 --> 00:48:21,820
in your proportion of
explained variance.
893
00:48:21,820 --> 00:48:23,870
So this will
translate in something
894
00:48:23,870 --> 00:48:28,490
where when I look at this ratio,
lambda 1 plus lambda k divided
895
00:48:28,490 --> 00:48:31,490
by lambda 1 plus
lambda d, this would
896
00:48:31,490 --> 00:48:34,195
translate into a function
that would look like this.
897
00:48:34,195 --> 00:48:36,320
And what it's telling you,
it says, well, maybe you
898
00:48:36,320 --> 00:48:38,570
should stop here, because
here every time you add one,
899
00:48:38,570 --> 00:48:40,520
you don't get as much
as you did before.
900
00:48:40,520 --> 00:48:43,700
You actually get like
smaller marginal returns.
901
00:48:50,910 --> 00:48:56,630
So explained variance is
the numerator of this ratio.
902
00:48:56,630 --> 00:48:58,430
And the total variance
is the denominator.
903
00:48:58,430 --> 00:49:01,010
Those are pretty
straightforward terms
904
00:49:01,010 --> 00:49:03,320
that you would want
to use for this.
905
00:49:03,320 --> 00:49:06,620
So if your goal is to
do data visualization--
906
00:49:06,620 --> 00:49:10,100
so why would you
take k larger than 2?
907
00:49:10,100 --> 00:49:11,750
Let's say, if you
take k larger than 6,
908
00:49:11,750 --> 00:49:12,906
you can start to
imagine that you're
909
00:49:12,906 --> 00:49:15,364
going to have six, choose two,
which starts to be annoying.
910
00:49:15,364 --> 00:49:16,850
And if you have k
is equal to 10--
911
00:49:16,850 --> 00:49:19,310
because you could start
in dimension 50,000--
912
00:49:19,310 --> 00:49:21,080
and then k equal to
10 would be the place
913
00:49:21,080 --> 00:49:22,780
where you have this thing
that's a lot of plots
914
00:49:22,780 --> 00:49:23,960
that you would have to show.
915
00:49:23,960 --> 00:49:26,900
So it's not always for
data visualization.
916
00:49:26,900 --> 00:49:29,540
Once I've actually
done this, I've
917
00:49:29,540 --> 00:49:32,460
actually effectively reduced
the dimension of my problem.
918
00:49:32,460 --> 00:49:34,230
And what I could do
with what I have is
919
00:49:34,230 --> 00:49:36,080
do a regression on those guys.
920
00:49:36,080 --> 00:49:39,010
The v1-- so I
forgot to tell you--
921
00:49:39,010 --> 00:49:41,460
why is that called principal
component analysis?
922
00:49:41,460 --> 00:49:46,910
Well, the vj's that
I keep, v1 to vk
923
00:49:46,910 --> 00:49:51,932
are called principal components.
924
00:49:59,020 --> 00:50:04,690
And they effectively act
as the summary of my Xi's.
925
00:50:04,690 --> 00:50:06,850
When I mentioned
image compression,
926
00:50:06,850 --> 00:50:10,840
I started with a point
Xi that was d numbers--
927
00:50:10,840 --> 00:50:12,604
let's say 50,000 numbers.
928
00:50:12,604 --> 00:50:14,020
And now, I'm saying,
actually, you
929
00:50:14,020 --> 00:50:16,270
can throw out those
50,000 numbers.
930
00:50:16,270 --> 00:50:19,390
If you actually know only
the k numbers that you need--
931
00:50:19,390 --> 00:50:20,860
the 6 numbers that you need--
932
00:50:20,860 --> 00:50:22,318
you're going to
have something that
933
00:50:22,318 --> 00:50:24,820
was pretty close to getting
what information you had.
934
00:50:24,820 --> 00:50:26,736
So in a way, there is
some form of compression
935
00:50:26,736 --> 00:50:27,810
that's going on here.
936
00:50:27,810 --> 00:50:31,150
And what you can do is that
those principal components,
937
00:50:31,150 --> 00:50:34,120
you can actually use
now for regression.
938
00:50:34,120 --> 00:50:39,130
If I want to regress
Y onto X that's
939
00:50:39,130 --> 00:50:41,862
very high dimensional,
before I do this,
940
00:50:41,862 --> 00:50:44,320
if I don't have enough points,
maybe what I can actually do
941
00:50:44,320 --> 00:50:47,780
is to do principal
component analysis
942
00:50:47,780 --> 00:50:49,510
throughout my
exercise, replace them
943
00:50:49,510 --> 00:50:52,150
by those compressed versions,
and do linear aggression
944
00:50:52,150 --> 00:50:53,020
on those guys.
945
00:50:53,020 --> 00:50:55,330
And that's called principal
component regression,
946
00:50:55,330 --> 00:50:56,039
not surprisingly.
947
00:50:56,039 --> 00:50:57,830
And that's something
that's pretty popular.
948
00:50:57,830 --> 00:51:00,086
And you can do with k is
equal to 10, for example.
949
00:51:03,020 --> 00:51:07,640
So for data visualization, I did
not find a Thanksgiving themed
950
00:51:07,640 --> 00:51:08,270
picture.
951
00:51:08,270 --> 00:51:11,960
But I found one that
has turkey in it.
952
00:51:11,960 --> 00:51:12,460
Get it?
953
00:51:15,310 --> 00:51:21,820
So this is actually a
gene data set that was--
954
00:51:21,820 --> 00:51:24,190
so when you see
something like this,
955
00:51:24,190 --> 00:51:27,056
you can imagine that someone
has been preprocessing
956
00:51:27,056 --> 00:51:28,180
the hell out of this thing.
957
00:51:28,180 --> 00:51:30,820
This is not like, oh, I
collect data on 23andMe
958
00:51:30,820 --> 00:51:32,670
and I'm just going
to run PCA on this.
959
00:51:32,670 --> 00:51:34,730
It just doesn't
happen like that.
960
00:51:34,730 --> 00:51:38,740
And so what happened is that--
so let's assume that this was
961
00:51:38,740 --> 00:51:41,560
a bunch of preprocessed data,
which are gene expression
962
00:51:41,560 --> 00:51:42,550
levels--
963
00:51:42,550 --> 00:51:47,650
so 500,000 genes
among 1,400 Europeans.
964
00:51:47,650 --> 00:51:50,260
So here, I actually
have less observations
965
00:51:50,260 --> 00:51:52,180
than I have samples.
966
00:51:52,180 --> 00:51:54,880
And that's when you use
principal component regression
967
00:51:54,880 --> 00:51:57,460
most of the time, so
it doesn't stop you.
968
00:51:57,460 --> 00:52:01,480
And then what you do is you say,
OK, have those 500,000 genes
969
00:52:01,480 --> 00:52:03,640
among--
970
00:52:03,640 --> 00:52:06,760
so here, that means that
there's 1,400 points here.
971
00:52:06,760 --> 00:52:09,760
And I actually take
those 500,000 directions.
972
00:52:09,760 --> 00:52:13,347
So each person has a vector
of, say, 500,000 genes
973
00:52:13,347 --> 00:52:14,430
that are attached to them.
974
00:52:14,430 --> 00:52:17,020
And I project them onto
two dimensions, which
975
00:52:17,020 --> 00:52:19,380
should be extremely lossy.
976
00:52:19,380 --> 00:52:21,040
I lose a lot of information.
977
00:52:21,040 --> 00:52:24,790
And indeed, I do, because
I'm one of these guys.
978
00:52:24,790 --> 00:52:27,350
And I'm pretty sure I'm very
different from this guy,
979
00:52:27,350 --> 00:52:30,070
even though probably from
an American perspective,
980
00:52:30,070 --> 00:52:31,970
we're all the same.
981
00:52:31,970 --> 00:52:35,690
But I think we have like
slightly different genomes.
982
00:52:35,690 --> 00:52:39,220
And so the thing is
now we have this--
983
00:52:39,220 --> 00:52:41,980
so you see there's lots of
Swiss that participate in this.
984
00:52:41,980 --> 00:52:43,900
But actually, those two
principal components
985
00:52:43,900 --> 00:52:46,210
recover sort of
the map of Europe.
986
00:52:46,210 --> 00:52:50,169
I mean, OK, again, this is
actually maybe fine-grained
987
00:52:50,169 --> 00:52:50,710
for you guys.
988
00:52:50,710 --> 00:52:52,810
But right here, there's
Portugal and Spain,
989
00:52:52,810 --> 00:52:54,430
which are those colors.
990
00:52:54,430 --> 00:52:55,450
So here is color-coded.
991
00:52:55,450 --> 00:52:58,510
And here is Turkey, of
course, which we know
992
00:52:58,510 --> 00:53:02,230
has very different genomes.
993
00:53:02,230 --> 00:53:04,850
So Turks are very
at the boundary.
994
00:53:04,850 --> 00:53:06,100
So you can see all the greens.
995
00:53:06,100 --> 00:53:08,560
They stay very far apart
from everything else.
996
00:53:08,560 --> 00:53:11,080
And then the rest
here is pretty mixed.
997
00:53:11,080 --> 00:53:13,430
But it sort of recovers--
if you look at the colors,
998
00:53:13,430 --> 00:53:14,500
it sort of recovers that.
999
00:53:14,500 --> 00:53:16,390
So in a way, those two
principal components
1000
00:53:16,390 --> 00:53:18,050
are just the geographic feature.
1001
00:53:18,050 --> 00:53:25,570
So if you insist to compress
all the genomic information
1002
00:53:25,570 --> 00:53:28,330
of these people into two
numbers, what you're actually
1003
00:53:28,330 --> 00:53:31,320
going to get is
longitude and latitude,
1004
00:53:31,320 --> 00:53:35,550
which is somewhat
surprising, but not
1005
00:53:35,550 --> 00:53:37,740
so much if you think that's
it's been preprocessed.
1006
00:53:43,120 --> 00:53:47,530
So what do you do
beyond practice?
1007
00:53:47,530 --> 00:53:50,780
Well, you could try to
actually study those things.
1008
00:53:50,780 --> 00:53:52,330
If you think about
it for a second,
1009
00:53:52,330 --> 00:53:54,880
we did not do any statistics.
1010
00:53:54,880 --> 00:53:57,460
I talked to you about
IID observations,
1011
00:53:57,460 --> 00:53:59,950
but we never used the fact
that they were independent.
1012
00:53:59,950 --> 00:54:01,491
The way we typically
use independence
1013
00:54:01,491 --> 00:54:04,270
is to have central
limit theorem, maybe.
1014
00:54:04,270 --> 00:54:06,640
I mentioned the fact that
the covariances of the word
1015
00:54:06,640 --> 00:54:09,520
Gaussian would actually give me
something which is independent.
1016
00:54:09,520 --> 00:54:10,870
We didn't care.
1017
00:54:10,870 --> 00:54:16,280
This was a data analysis, data
mining process that we did.
1018
00:54:16,280 --> 00:54:19,280
I give you points, and you just
put them through the crank.
1019
00:54:19,280 --> 00:54:21,350
There was an algorithm
in six steps.
1020
00:54:21,350 --> 00:54:23,750
And you just put it through
and that's what you got.
1021
00:54:23,750 --> 00:54:26,940
Now, of course, there's some
work which studies says, OK,
1022
00:54:26,940 --> 00:54:30,440
if my data is actually generated
from some process-- maybe,
1023
00:54:30,440 --> 00:54:33,050
my points are multivariate
Gaussian with some structure
1024
00:54:33,050 --> 00:54:34,520
on the covariance--
1025
00:54:34,520 --> 00:54:37,010
how well am I recovering
the covariance structure?
1026
00:54:37,010 --> 00:54:38,990
And that's where
statistics kicks in.
1027
00:54:38,990 --> 00:54:41,390
And that's where we stop.
1028
00:54:41,390 --> 00:54:44,730
So this is actually a bit
more difficult to study.
1029
00:54:44,730 --> 00:54:48,250
But in a way, it's not
entirely satisfactory,
1030
00:54:48,250 --> 00:54:50,320
because we could work
for a couple of boards
1031
00:54:50,320 --> 00:54:53,470
and I would just basically
sort of reverse engineer this
1032
00:54:53,470 --> 00:54:57,457
and find some models under which
it's a good idea to do that.
1033
00:54:57,457 --> 00:54:58,540
And what are those models?
1034
00:54:58,540 --> 00:55:01,450
Well, those are the models
that sort of give you
1035
00:55:01,450 --> 00:55:03,911
sort of prominent directions
that you want to find.
1036
00:55:03,911 --> 00:55:06,160
And it will say, yes, if you
have enough observations,
1037
00:55:06,160 --> 00:55:08,260
you will find those
directions along which
1038
00:55:08,260 --> 00:55:10,150
your data is elongated.
1039
00:55:10,150 --> 00:55:14,890
So that's essentially
what you want to do.
1040
00:55:14,890 --> 00:55:20,660
So that's exactly what
this thing is telling you.
1041
00:55:20,660 --> 00:55:23,010
So where does the
statistics lie from?
1042
00:55:23,010 --> 00:55:26,020
Well, everything, remember--
so actually that's
1043
00:55:26,020 --> 00:55:28,490
where Alana was confused--
the idea was to say, well,
1044
00:55:28,490 --> 00:55:32,590
if I have a true
covariance matrix sigma
1045
00:55:32,590 --> 00:55:34,540
and I never really
have access to it,
1046
00:55:34,540 --> 00:55:38,870
I'm just running PCA on the
empirical covariance matrix,
1047
00:55:38,870 --> 00:55:41,380
how do those results relate?
1048
00:55:41,380 --> 00:55:44,270
And this is something
that you can study.
1049
00:55:44,270 --> 00:55:47,530
So for example, if
n goes to infinity
1050
00:55:47,530 --> 00:55:55,840
and the number of points,
your dimension, is fixed,
1051
00:55:55,840 --> 00:56:00,370
then S goes to sigma
in any sense you want.
1052
00:56:00,370 --> 00:56:02,860
Maybe each entry is going
to each entry of sigma,
1053
00:56:02,860 --> 00:56:03,730
for example.
1054
00:56:03,730 --> 00:56:04,840
So S is a good estimator.
1055
00:56:04,840 --> 00:56:06,381
We know that the
empirical covariance
1056
00:56:06,381 --> 00:56:07,600
is a consistent as the mater.
1057
00:56:07,600 --> 00:56:10,230
And if d is fixed, this
is actually not an issue.
1058
00:56:10,230 --> 00:56:14,450
So in particular, if you run
PCA on the sample covariance
1059
00:56:14,450 --> 00:56:16,150
matrix, you look
at, say, v1, then
1060
00:56:16,150 --> 00:56:20,140
v1 is going to converge to the
largest eigenvector of sigma
1061
00:56:20,140 --> 00:56:23,990
as n goes to infinity,
but for d fixed.
1062
00:56:23,990 --> 00:56:27,960
And that's a story that
we know since the '60s.
1063
00:56:27,960 --> 00:56:30,906
More recently, people have
started challenging this.
1064
00:56:30,906 --> 00:56:33,030
Because what's happening
when you fix the dimension
1065
00:56:33,030 --> 00:56:35,310
and let the sample
size go to infinity,
1066
00:56:35,310 --> 00:56:38,961
you're certainly not
allowing for this.
1067
00:56:38,961 --> 00:56:41,460
It's certainly not explaining
to you anything about the fact
1068
00:56:41,460 --> 00:56:44,512
when d is equal to 500,000
and n is equal to 1,400.
1069
00:56:44,512 --> 00:56:46,470
Because when d is fixed
and n goes to infinity,
1070
00:56:46,470 --> 00:56:48,660
in particular, n is
much larger than d,
1071
00:56:48,660 --> 00:56:50,280
which is not the case here.
1072
00:56:50,280 --> 00:56:53,610
And so when n is much larger
than d, things go well.
1073
00:56:53,610 --> 00:56:57,430
But if d is less than n,
it's not clear what happens.
1074
00:56:57,430 --> 00:57:01,540
And particularly, if d is of the
order of n, what's happening?
1075
00:57:01,540 --> 00:57:04,320
So there's an entire theory
in mathematics that's called
1076
00:57:04,320 --> 00:57:07,890
random matrix theory that
studies the behavior of exactly
1077
00:57:07,890 --> 00:57:10,770
this question-- what is the
behavior of the spectrum--
1078
00:57:10,770 --> 00:57:13,020
the eigenvalues
and eigenvectors--
1079
00:57:13,020 --> 00:57:16,470
of a matrix in which I put
random numbers and I let--
1080
00:57:16,470 --> 00:57:19,710
so the matrix I'm interested
in here is the matrix of X's.
1081
00:57:19,710 --> 00:57:21,830
When I stack all my
X's next to each other,
1082
00:57:21,830 --> 00:57:26,940
so that's a matrix of size,
say, d by n, so each column
1083
00:57:26,940 --> 00:57:28,890
is of size d, it's one person.
1084
00:57:28,890 --> 00:57:29,880
And so I put them.
1085
00:57:29,880 --> 00:57:31,790
And when I let the
matrix go to infinity,
1086
00:57:31,790 --> 00:57:33,920
I let both d and n to infinity.
1087
00:57:33,920 --> 00:57:37,260
But I want the aspect ratio,
d/n, to go to some constant.
1088
00:57:37,260 --> 00:57:38,940
That's what they do.
1089
00:57:38,940 --> 00:57:41,730
And what's nice is that in the
end, you have this constant--
1090
00:57:41,730 --> 00:57:42,840
let's call it gamma--
1091
00:57:42,840 --> 00:57:44,550
that shows up in
all the asymptotics.
1092
00:57:44,550 --> 00:57:46,680
And then you can
replace it by d/n.
1093
00:57:46,680 --> 00:57:50,520
And you know that you still have
a handle of both the dimension
1094
00:57:50,520 --> 00:57:51,360
and the sample size.
1095
00:57:51,360 --> 00:57:54,020
Whereas, usually the dimension
goes away, as you let n
1096
00:57:54,020 --> 00:57:57,370
go to infinity without having
dimension going to infinity.
1097
00:57:57,370 --> 00:57:59,400
And so now, when
this happens, as soon
1098
00:57:59,400 --> 00:58:01,920
as d/n goes to a
constant, you can
1099
00:58:01,920 --> 00:58:07,380
show that essentially there's
an angle between the largest
1100
00:58:07,380 --> 00:58:14,460
eigenvector of sigma and the
largest eigenvector of S, as n
1101
00:58:14,460 --> 00:58:15,460
and d go to infinity.
1102
00:58:15,460 --> 00:58:17,251
There is always an
angle-- you can actually
1103
00:58:17,251 --> 00:58:18,930
write it explicitly.
1104
00:58:18,930 --> 00:58:22,240
And it's an angle that
depends on this ratio, gamma--
1105
00:58:22,240 --> 00:58:24,840
the asymptotic ratio of d/n.
1106
00:58:24,840 --> 00:58:29,392
And so there's been a lot of
understanding how to correct,
1107
00:58:29,392 --> 00:58:30,600
how to pay attention to this.
1108
00:58:30,600 --> 00:58:34,320
This creates some biases that
were sort of overlooked before.
1109
00:58:34,320 --> 00:58:37,470
In particular, when
I do this, this
1110
00:58:37,470 --> 00:58:40,490
is not the proportion
of explained variance,
1111
00:58:40,490 --> 00:58:42,940
when n and d are similar.
1112
00:58:42,940 --> 00:58:44,940
This is an estimated
number computed from S.
1113
00:58:44,940 --> 00:58:48,030
This is computed from S. All
these guys are computed from S.
1114
00:58:48,030 --> 00:58:49,830
So those are
actually not exactly
1115
00:58:49,830 --> 00:58:51,060
where you want them to be.
1116
00:58:51,060 --> 00:58:54,510
And there's some nice work that
allows you to recalibrate what
1117
00:58:54,510 --> 00:58:57,626
this ratio should be, how
this ratio should be computed,
1118
00:58:57,626 --> 00:58:59,250
so it's a better
representative of what
1119
00:58:59,250 --> 00:59:04,680
the proportion of explained
variance actually is.
1120
00:59:04,680 --> 00:59:07,470
So then, of course,
there's the question
1121
00:59:07,470 --> 00:59:09,870
of-- so that's when d/n
goes to some constant.
1122
00:59:09,870 --> 00:59:12,105
So the best case--
so that was '60s--
1123
00:59:12,105 --> 00:59:15,040
d is fixed and it's
much larger than d.
1124
00:59:15,040 --> 00:59:18,310
And then random matrix theory
tells you, well, d and n
1125
00:59:18,310 --> 00:59:20,680
are sort of the same
order of magnitude.
1126
00:59:20,680 --> 00:59:23,620
When they go to infinity, the
ratio goes to some constant.
1127
00:59:23,620 --> 00:59:25,270
Think of it as being order 1.
1128
00:59:25,270 --> 00:59:30,440
To be fair, if d is 100 times
larger than n, it still works.
1129
00:59:30,440 --> 00:59:32,440
And it depends on
what you think what
1130
00:59:32,440 --> 00:59:33,910
the infinity is at this point.
1131
00:59:33,910 --> 00:59:37,880
But I think the random matrix
theory results are very useful.
1132
00:59:37,880 --> 00:59:39,880
But then even in
this case, I told you
1133
00:59:39,880 --> 00:59:42,460
that the leading
eigenvector of S
1134
00:59:42,460 --> 00:59:48,812
is actually an angle of the
leading eigenvector of--
1135
00:59:48,812 --> 00:59:50,020
So what's happening is that--
1136
00:59:56,970 --> 01:00:01,320
so let's say that d/n
goes to some gamma.
1137
01:00:01,320 --> 01:00:04,470
And what I claim is
that, if you look at--
1138
01:00:04,470 --> 01:00:09,130
so that's v1, that's the v1 of
S. And then there's the v1 of--
1139
01:00:09,130 --> 01:00:11,760
so this should be of size 1.
1140
01:00:11,760 --> 01:00:13,096
So that's the v1 of sigma.
1141
01:00:13,096 --> 01:00:15,220
Then those things are going
to have an angle, which
1142
01:00:15,220 --> 01:00:16,629
is some function of gamma.
1143
01:00:16,629 --> 01:00:18,670
It's complicated, but
there's a function of gamma
1144
01:00:18,670 --> 01:00:19,628
that you can see there.
1145
01:00:19,628 --> 01:00:21,830
And there's some models.
1146
01:00:21,830 --> 01:00:24,620
When gamma goes
to infinity, which
1147
01:00:24,620 --> 01:00:27,800
means that d is now
much larger than n,
1148
01:00:27,800 --> 01:00:30,860
this angle is 90
degrees, which means
1149
01:00:30,860 --> 01:00:32,798
that you're getting nothing.
1150
01:00:32,798 --> 01:00:33,796
Yeah.
1151
01:00:33,796 --> 01:00:37,289
AUDIENCE: If d is not
on your lower plane,
1152
01:00:37,289 --> 01:00:40,782
so like gamma is 0,
is there still angle?
1153
01:00:40,782 --> 01:00:43,780
PHILIPPE RIGOLLET: No,
but that's consistent--
1154
01:00:43,780 --> 01:00:45,659
the fact that it's
consistent when--
1155
01:00:45,659 --> 01:00:46,825
so the angle is a function--
1156
01:00:46,825 --> 01:00:49,605
AUDIENCE: d is not a
constant [INAUDIBLE]??
1157
01:00:52,599 --> 01:00:54,600
PHILIPPE RIGOLLET:
d is not a constant?
1158
01:00:54,600 --> 01:00:57,090
So if d is little of n?
1159
01:00:57,090 --> 01:00:59,985
Then gamma goes to 0 and
f of gamma goes to 0.
1160
01:00:59,985 --> 01:01:02,490
So f of gamma is
a function that--
1161
01:01:02,490 --> 01:01:05,200
so for example, if f of gamma--
1162
01:01:05,200 --> 01:01:08,960
this is the sine of the
angle, for example--
1163
01:01:08,960 --> 01:01:11,840
then it's a function that starts
at 0, and that goes like this.
1164
01:01:15,340 --> 01:01:18,120
But as soon as gamma is
positive, it goes away from 0.
1165
01:01:20,650 --> 01:01:24,517
So now when gamma
goes to infinity,
1166
01:01:24,517 --> 01:01:26,350
then this thing goes
to a right angle, which
1167
01:01:26,350 --> 01:01:27,516
means I'm getting just junk.
1168
01:01:27,516 --> 01:01:29,210
So this is not my
leading eigenvector.
1169
01:01:29,210 --> 01:01:31,160
So how do you do this?
1170
01:01:31,160 --> 01:01:33,850
Well, just like
everywhere in statistics,
1171
01:01:33,850 --> 01:01:35,500
you have to just make
more assumptions.
1172
01:01:35,500 --> 01:01:36,916
You have to assume
that you're not
1173
01:01:36,916 --> 01:01:39,220
looking for the leading
eigenvector or the direction
1174
01:01:39,220 --> 01:01:40,610
that carries the most variance.
1175
01:01:40,610 --> 01:01:42,830
But you're looking, maybe,
for a special direction.
1176
01:01:42,830 --> 01:01:44,910
And that's what
sparse PCA is doing.
1177
01:01:44,910 --> 01:01:48,610
Sparse PCA is saying, I'm not
looking for any direction new
1178
01:01:48,610 --> 01:01:50,290
that carries the most variance.
1179
01:01:50,290 --> 01:01:54,070
I'm only looking for a
direction new that is sparse.
1180
01:01:54,070 --> 01:01:58,460
Think of it, for example, as
having 10 non-zero coordinates.
1181
01:01:58,460 --> 01:02:02,050
So that's a lot of
directions still to look for.
1182
01:02:02,050 --> 01:02:05,560
But once you do this,
then you actually
1183
01:02:05,560 --> 01:02:07,060
have not only--
there's a few things
1184
01:02:07,060 --> 01:02:08,930
that actually you
get from doing this.
1185
01:02:08,930 --> 01:02:12,160
The first one is you
actually essentially replace
1186
01:02:12,160 --> 01:02:15,660
d by k, which means
that n now just--
1187
01:02:15,660 --> 01:02:18,480
I'm sorry, let's say S
non-zero coefficients.
1188
01:02:18,480 --> 01:02:21,420
You replace d by S,
which means that n only
1189
01:02:21,420 --> 01:02:24,740
has to be much larger than S
for this thing to actually work.
1190
01:02:24,740 --> 01:02:26,760
Now, of course, you've
set your goal weaker.
1191
01:02:26,760 --> 01:02:28,830
Your goal is not to
find any direction, only
1192
01:02:28,830 --> 01:02:30,360
a sparse direction.
1193
01:02:30,360 --> 01:02:31,830
But there's something
very valuable
1194
01:02:31,830 --> 01:02:33,746
about sparse directions,
is that they actually
1195
01:02:33,746 --> 01:02:35,310
are interpretable.
1196
01:02:35,310 --> 01:02:37,810
When I found the v--
1197
01:02:37,810 --> 01:02:40,230
let's say that the v
that I found before
1198
01:02:40,230 --> 01:02:48,390
was 0.2, and then 0.9, and
then 1.1 minus 3, et cetera.
1199
01:02:48,390 --> 01:02:51,570
So that was the coordinates
of my leading eigenvector
1200
01:02:51,570 --> 01:02:54,410
in the original
coordinate system.
1201
01:02:54,410 --> 01:02:55,160
What does it mean?
1202
01:02:55,160 --> 01:02:57,140
Well, it means that if
I see a large number,
1203
01:02:57,140 --> 01:03:01,610
that means that this
v is very close--
1204
01:03:01,610 --> 01:03:03,830
so that's my original
coordinate system.
1205
01:03:03,830 --> 01:03:05,330
Let's call it e1 and e2.
1206
01:03:05,330 --> 01:03:09,230
So that's just 1,
0; and then 0, 1.
1207
01:03:09,230 --> 01:03:11,170
Then clearly, from
the coordinates of v,
1208
01:03:11,170 --> 01:03:13,550
I can tell if my v is like
this, or it's like this,
1209
01:03:13,550 --> 01:03:15,610
or it's like this.
1210
01:03:15,610 --> 01:03:18,330
Well, I mean, they should
all be of the same size.
1211
01:03:18,330 --> 01:03:20,590
So I can tell if
it's here or here
1212
01:03:20,590 --> 01:03:24,739
or here, depending
on-- like here,
1213
01:03:24,739 --> 01:03:26,280
that means I'm going
to see something
1214
01:03:26,280 --> 01:03:29,090
where the Y-coordinate it much
larger than the X-coordinate.
1215
01:03:29,090 --> 01:03:30,960
Here, I'm going to see something
where the X-coordinate is much
1216
01:03:30,960 --> 01:03:32,370
larger than the Y-coordinate.
1217
01:03:32,370 --> 01:03:33,480
And here, I'm going
to see something
1218
01:03:33,480 --> 01:03:35,354
where the X-coordinate
is about the same size
1219
01:03:35,354 --> 01:03:38,390
of the Y-coordinate.
1220
01:03:38,390 --> 01:03:40,499
So when things
starts to be bigger,
1221
01:03:40,499 --> 01:03:42,040
you're going to have
to make choices.
1222
01:03:42,040 --> 01:03:43,900
What does it mean to be bigger--
1223
01:03:43,900 --> 01:03:48,670
when d is 100,000,
I mean, the sum
1224
01:03:48,670 --> 01:03:51,160
of the squares of those
guys have to be equal to 1.
1225
01:03:51,160 --> 01:03:52,790
So they're all
very small numbers.
1226
01:03:52,790 --> 01:03:54,670
And so it's hard for you to
tell which one is a big number
1227
01:03:54,670 --> 01:03:56,045
and which ones is
a small number.
1228
01:03:56,045 --> 01:03:57,378
Why would you want to know this?
1229
01:03:57,378 --> 01:03:58,840
Because it's
actually telling you
1230
01:03:58,840 --> 01:04:03,219
that if v is very close to
e1, then that means that e1--
1231
01:04:03,219 --> 01:04:04,760
in the case of the
gene example, that
1232
01:04:04,760 --> 01:04:08,510
would mean that e1 is the
gene that's very important.
1233
01:04:08,510 --> 01:04:10,100
Maybe there's actually
just two genes
1234
01:04:10,100 --> 01:04:12,109
that explain those two things.
1235
01:04:12,109 --> 01:04:14,150
And those are the genes
that have been picked up.
1236
01:04:14,150 --> 01:04:16,880
There's two genes that I
encode geographic location,
1237
01:04:16,880 --> 01:04:18,224
and that's it.
1238
01:04:18,224 --> 01:04:19,640
And so it's very
important for you
1239
01:04:19,640 --> 01:04:21,630
to be able to
interpret what v means.
1240
01:04:21,630 --> 01:04:23,270
Where it has large
values, it means
1241
01:04:23,270 --> 01:04:26,689
that maybe it has large
values for e1, e2, and e3.
1242
01:04:26,689 --> 01:04:28,980
And it means that it's a
combination of e1, e2, and e3.
1243
01:04:28,980 --> 01:04:30,813
And now, you can
interpret, because you have
1244
01:04:30,813 --> 01:04:33,150
only three variables to find.
1245
01:04:33,150 --> 01:04:36,780
And so sparse PCA
builds that in.
1246
01:04:36,780 --> 01:04:39,920
Sparse PCA says,
listen, I'm going
1247
01:04:39,920 --> 01:04:42,600
to want to have at most
10 non-zero coefficients.
1248
01:04:42,600 --> 01:04:44,550
And the rest, I want to be 0.
1249
01:04:44,550 --> 01:04:47,040
I want to be able to be a
combination of at most 10
1250
01:04:47,040 --> 01:04:50,540
of my original variables.
1251
01:04:50,540 --> 01:04:52,740
And now, I can do
interpretation.
1252
01:04:52,740 --> 01:04:54,690
So the problem
with sparse PCA is
1253
01:04:54,690 --> 01:04:57,404
that it becomes very
difficult numerically
1254
01:04:57,404 --> 01:04:58,320
to solve this problem.
1255
01:04:58,320 --> 01:04:59,220
I can write it.
1256
01:04:59,220 --> 01:05:05,700
So the problem is simply
maximize the variance u
1257
01:05:05,700 --> 01:05:09,360
transpose, say, Su
subject to-- well,
1258
01:05:09,360 --> 01:05:12,180
I wanted to have u2 equal to 1.
1259
01:05:12,180 --> 01:05:14,450
So that's the original PCA.
1260
01:05:14,450 --> 01:05:16,020
But now, I also
want that the sum
1261
01:05:16,020 --> 01:05:19,320
of the indicators of the
uj that are not equal to 0
1262
01:05:19,320 --> 01:05:23,120
is at most, say, 10.
1263
01:05:23,120 --> 01:05:26,550
This constraint is
very non-convex.
1264
01:05:26,550 --> 01:05:28,430
So I can relax it
to a convex one
1265
01:05:28,430 --> 01:05:31,720
like we did for
linear aggression.
1266
01:05:31,720 --> 01:05:33,920
But now, I've totally
messed up with the fact
1267
01:05:33,920 --> 01:05:37,930
that I could use linear
algebra to solve this problem.
1268
01:05:37,930 --> 01:05:40,812
And so now, you have to go
through much more complicated
1269
01:05:40,812 --> 01:05:42,520
optimization techniques,
which are called
1270
01:05:42,520 --> 01:05:44,350
semidefinite
programs, which do not
1271
01:05:44,350 --> 01:05:46,600
scale well in high dimensions.
1272
01:05:46,600 --> 01:05:48,730
And so you have to do
a bunch of tricks--
1273
01:05:48,730 --> 01:05:49,660
numerical tricks.
1274
01:05:49,660 --> 01:05:52,630
But there are some packages
that implements some heuristics
1275
01:05:52,630 --> 01:05:55,140
or some other things--
1276
01:05:55,140 --> 01:05:56,800
iterative
thresholding, all sorts
1277
01:05:56,800 --> 01:05:58,896
of various numerical
tricks that you can do.
1278
01:05:58,896 --> 01:06:01,270
But the problem they are trying
to solve is exactly this.
1279
01:06:01,270 --> 01:06:03,947
Among all directions that
I have norm 1, of course,
1280
01:06:03,947 --> 01:06:06,030
because it's the direction
that have at most, say,
1281
01:06:06,030 --> 01:06:09,382
10 non-zero coordinates, I want
to find the one that maximizes
1282
01:06:09,382 --> 01:06:10,340
the empirical variance.
1283
01:06:23,030 --> 01:06:27,782
Actually, let me let
me just so you this.
1284
01:06:41,910 --> 01:06:47,830
I wanted to show
you an output of PCA
1285
01:06:47,830 --> 01:06:50,620
where people are actually
trying to do directly--
1286
01:06:56,043 --> 01:07:05,903
maybe-- there you go.
1287
01:07:20,700 --> 01:07:26,690
So right here, you
see this is SPSS.
1288
01:07:26,690 --> 01:07:29,310
That's a statistical software.
1289
01:07:29,310 --> 01:07:33,100
And this is an output
that was preprocessed
1290
01:07:33,100 --> 01:07:34,650
by a professional--
1291
01:07:34,650 --> 01:07:36,240
not preprocessed,
post-processed.
1292
01:07:36,240 --> 01:07:38,520
So that's something
where they read PCA.
1293
01:07:38,520 --> 01:07:39,390
So what is the data?
1294
01:07:39,390 --> 01:07:43,890
This is raw data
about you ask doctors
1295
01:07:43,890 --> 01:07:47,907
what they think of the
behavior of a particular sales
1296
01:07:47,907 --> 01:07:49,740
representative for
pharmaceutical companies.
1297
01:07:49,740 --> 01:07:51,323
So pharmaceutical
companies are trying
1298
01:07:51,323 --> 01:07:52,950
to improve their sales force.
1299
01:07:52,950 --> 01:07:56,430
And they're asking
doctors how would they
1300
01:07:56,430 --> 01:07:58,920
rate-- what do they value
about their interaction
1301
01:07:58,920 --> 01:08:01,410
with a sales representative.
1302
01:08:01,410 --> 01:08:04,140
So basically, there's
a bunch of questions.
1303
01:08:04,140 --> 01:08:10,410
One offers credible point
of view on something trends,
1304
01:08:10,410 --> 01:08:12,720
provides valuable
networking opportunities.
1305
01:08:12,720 --> 01:08:13,950
This is one question.
1306
01:08:13,950 --> 01:08:15,750
Rate this on a
scale from 1 to 5.
1307
01:08:15,750 --> 01:08:16,790
That was the question.
1308
01:08:16,790 --> 01:08:18,840
And they had a bunch
of questions like this.
1309
01:08:18,840 --> 01:08:22,410
And then they asked 1,000
doctors to make those ratings.
1310
01:08:22,410 --> 01:08:24,210
And what they want--
so each doctor now
1311
01:08:24,210 --> 01:08:25,890
is a vector of ratings.
1312
01:08:25,890 --> 01:08:28,960
And they want to know if there's
different groups of doctors,
1313
01:08:28,960 --> 01:08:30,210
what do doctors respond to.
1314
01:08:30,210 --> 01:08:31,240
If there's different
groups, then
1315
01:08:31,240 --> 01:08:33,450
maybe they know that they
can actually address them
1316
01:08:33,450 --> 01:08:35,500
separately, et cetera.
1317
01:08:35,500 --> 01:08:37,950
And so to do that, of course,
there's lots of questions.
1318
01:08:37,950 --> 01:08:39,840
And so what you want is
to just first project
1319
01:08:39,840 --> 01:08:41,589
into lower dimensions,
so you can actually
1320
01:08:41,589 --> 01:08:42,819
visualize what's going on.
1321
01:08:42,819 --> 01:08:44,760
And this is what
was done for this.
1322
01:08:44,760 --> 01:08:47,490
So these are the three
first principal component
1323
01:08:47,490 --> 01:08:49,439
that came out.
1324
01:08:49,439 --> 01:08:52,439
And even though we ordered
the values of the lambdas,
1325
01:08:52,439 --> 01:08:56,130
there's no reason why the
entries of v should be ordered.
1326
01:08:56,130 --> 01:08:57,840
And if you look at
the values of v here,
1327
01:08:57,840 --> 01:08:59,631
they look like they're
pretty much ordered.
1328
01:08:59,631 --> 01:09:04,142
It starts at 0.784, and then
you're at 0.3 around here.
1329
01:09:04,142 --> 01:09:06,600
There's something that goes up
again, and then you go down.
1330
01:09:06,600 --> 01:09:11,200
Actually, it's marked in red
every time it goes up again.
1331
01:09:11,200 --> 01:09:13,660
And so now, what they
did is they said,
1332
01:09:13,660 --> 01:09:16,270
OK, I need to
interpret those guys.
1333
01:09:16,270 --> 01:09:18,340
I need to tell you what this is.
1334
01:09:18,340 --> 01:09:21,160
If you tell me, we found
the principal component
1335
01:09:21,160 --> 01:09:24,866
that really discriminates
the doctors in two groups,
1336
01:09:24,866 --> 01:09:26,740
the drug company is
going to come back to you
1337
01:09:26,740 --> 01:09:29,080
and say, OK, what is
this characteristic?
1338
01:09:29,080 --> 01:09:31,510
And you say, oh, it's
actually a linear combination
1339
01:09:31,510 --> 01:09:33,460
of 40 characteristics.
1340
01:09:33,460 --> 01:09:35,735
And they say, well, we
don't need you to do that.
1341
01:09:35,735 --> 01:09:38,109
I mean, it cannot be a linear
combination of anything you
1342
01:09:38,109 --> 01:09:39,220
didn't ask.
1343
01:09:39,220 --> 01:09:41,680
And so for that,
first of all, there's
1344
01:09:41,680 --> 01:09:44,859
a post-processing of PCA, which
says, OK, once I actually,
1345
01:09:44,859 --> 01:09:46,990
say, found three
principal components,
1346
01:09:46,990 --> 01:09:51,370
that means that I found the
dimension three space on which
1347
01:09:51,370 --> 01:09:52,899
I want to project my points.
1348
01:09:52,899 --> 01:09:55,720
In this base, I can pick
any direction I want.
1349
01:09:55,720 --> 01:09:57,100
So the first thing
is that you do
1350
01:09:57,100 --> 01:09:59,308
some sort of local arrangements,
so that those things
1351
01:09:59,308 --> 01:10:01,790
look like they are increasing
and then decreasing.
1352
01:10:01,790 --> 01:10:06,130
So you just change, you
rotate your coordinate system
1353
01:10:06,130 --> 01:10:09,880
in this three dimensional space
that you've actually isolated.
1354
01:10:09,880 --> 01:10:11,830
And so once you do
this, the reason
1355
01:10:11,830 --> 01:10:13,600
to do that is that
it sort of makes
1356
01:10:13,600 --> 01:10:16,554
them big, sharp differences
between large and small values
1357
01:10:16,554 --> 01:10:18,220
of the coordinates
of the thing you had.
1358
01:10:18,220 --> 01:10:19,261
And why do you want this?
1359
01:10:19,261 --> 01:10:21,100
Because now, you
can say, well, I'm
1360
01:10:21,100 --> 01:10:23,590
going to start looking at the
ones that have large values.
1361
01:10:23,590 --> 01:10:24,250
And what do they say?
1362
01:10:24,250 --> 01:10:26,249
They say in-depth knowledge,
in-depth knowledge,
1363
01:10:26,249 --> 01:10:28,270
in-depth knowledge,
knowledge about.
1364
01:10:28,270 --> 01:10:30,280
This thing is clearly
something that
1365
01:10:30,280 --> 01:10:34,090
actually characterizes
the knowledge of my sales
1366
01:10:34,090 --> 01:10:35,260
representative.
1367
01:10:35,260 --> 01:10:38,311
And so that's something that
doctors are sensitive to.
1368
01:10:38,311 --> 01:10:40,060
That's something that
really discriminates
1369
01:10:40,060 --> 01:10:40,960
the doctors in a way.
1370
01:10:40,960 --> 01:10:43,120
There's lots of variance
along those things,
1371
01:10:43,120 --> 01:10:45,576
or at least a lot of variance--
1372
01:10:45,576 --> 01:10:47,950
I mean, doctors are separate
in terms of their experience
1373
01:10:47,950 --> 01:10:49,240
with respect to this.
1374
01:10:49,240 --> 01:10:51,102
And so what they
did is said, OK,
1375
01:10:51,102 --> 01:10:53,310
all these guys, some of
those they have large values,
1376
01:10:53,310 --> 01:10:55,015
but I don't know how
to interpret them.
1377
01:10:55,015 --> 01:10:56,890
And so I'm just going
to put the first block,
1378
01:10:56,890 --> 01:10:58,681
and I'm going to call
it medical knowledge,
1379
01:10:58,681 --> 01:11:01,330
because all those things are
knowledge about medical stuff.
1380
01:11:01,330 --> 01:11:03,538
Then here, I didn't know
how to interpret those guys.
1381
01:11:03,538 --> 01:11:06,220
But those guys, there's a big
clump of large coordinates,
1382
01:11:06,220 --> 01:11:10,720
and they're about respectful
of my time, listens, friendly
1383
01:11:10,720 --> 01:11:12,070
but courteous.
1384
01:11:12,070 --> 01:11:14,000
This is all about the
quality of interaction.
1385
01:11:14,000 --> 01:11:17,446
So this block was actually
called quality of interaction.
1386
01:11:17,446 --> 01:11:18,820
And then there
was a third block,
1387
01:11:18,820 --> 01:11:21,320
which you can tell starts to
be spreading a little thin.
1388
01:11:21,320 --> 01:11:22,864
There's just much less of them.
1389
01:11:22,864 --> 01:11:24,280
But this thing was
actually called
1390
01:11:24,280 --> 01:11:26,260
fair and critical opinion.
1391
01:11:26,260 --> 01:11:30,010
And so now, you have three
discriminating directions.
1392
01:11:30,010 --> 01:11:31,990
And you can actually
give them a name.
1393
01:11:31,990 --> 01:11:34,780
Wouldn't it be beautiful if
all the numbers in the gray box
1394
01:11:34,780 --> 01:11:36,700
came non-zero and
all the other numbers
1395
01:11:36,700 --> 01:11:38,860
came zero-- there
was no ad hoc choice.
1396
01:11:38,860 --> 01:11:40,750
I mean, this is probably
an afternoon of work
1397
01:11:40,750 --> 01:11:42,850
to like scratch out
all these numbers
1398
01:11:42,850 --> 01:11:44,801
and put all these
color codes, et cetera.
1399
01:11:44,801 --> 01:11:47,050
Whereas, you could just have
something that tells you,
1400
01:11:47,050 --> 01:11:49,090
OK, here are the non-zeros.
1401
01:11:49,090 --> 01:11:52,120
If you can actually make a story
around why this group of thing
1402
01:11:52,120 --> 01:11:54,820
actually makes sense, such
as it is medical knowledge,
1403
01:11:54,820 --> 01:11:55,730
then good for you.
1404
01:11:55,730 --> 01:11:57,804
Otherwise, you could
just say, I can't.
1405
01:11:57,804 --> 01:11:59,470
And that's what sparse
PCA does for you.
1406
01:11:59,470 --> 01:12:02,890
Sparse PCA outputs something
where all those numbers would
1407
01:12:02,890 --> 01:12:03,850
be zero.
1408
01:12:03,850 --> 01:12:06,964
And there would be exactly,
say, 10 non-zero coordinates.
1409
01:12:06,964 --> 01:12:08,380
And you can turn
this knob off 10.
1410
01:12:08,380 --> 01:12:09,220
You can make it 9.
1411
01:12:11,687 --> 01:12:13,270
Depending on what
your major is, maybe
1412
01:12:13,270 --> 01:12:15,310
you can actually go
on with 20 of them
1413
01:12:15,310 --> 01:12:18,310
and have the ability to
tell the story about 20
1414
01:12:18,310 --> 01:12:20,650
different variables and how
they fit in the same group.
1415
01:12:20,650 --> 01:12:22,750
And depending on
how you feel, it's
1416
01:12:22,750 --> 01:12:25,390
easy to rerun the PCA
depending on the value
1417
01:12:25,390 --> 01:12:26,535
that you want here.
1418
01:12:26,535 --> 01:12:28,660
And so you could actually
just come up with the one
1419
01:12:28,660 --> 01:12:30,240
you prefer.
1420
01:12:30,240 --> 01:12:32,354
And so that's the
sparse PCA thing
1421
01:12:32,354 --> 01:12:33,520
which I'm trying to promote.
1422
01:12:33,520 --> 01:12:35,250
I mean, this is not
super well-spread.
1423
01:12:35,250 --> 01:12:39,300
It's a fairly new idea,
maybe at most 10 years old.
1424
01:12:39,300 --> 01:12:40,940
And it's not
completely well-spread
1425
01:12:40,940 --> 01:12:42,540
in statistical packages.
1426
01:12:42,540 --> 01:12:44,040
But that's clearly
what people are
1427
01:12:44,040 --> 01:12:46,601
trying to emulate currently.
1428
01:12:46,601 --> 01:12:47,100
Yes?
1429
01:12:47,100 --> 01:12:48,600
AUDIENCE: So what
exactly does it
1430
01:12:48,600 --> 01:12:50,932
mean that the doctors
have a lot of variance
1431
01:12:50,932 --> 01:12:53,100
in medical knowledge,
quality of interaction,
1432
01:12:53,100 --> 01:12:55,600
and fair and critical opinion?
1433
01:12:55,600 --> 01:13:00,200
Like, it was saying that
these are like the main things
1434
01:13:00,200 --> 01:13:02,986
that doctors vary on,
some doctors care.
1435
01:13:02,986 --> 01:13:05,590
Like we could sort of
characterize a doctor by, oh,
1436
01:13:05,590 --> 01:13:08,030
he cares this much about
medical knowledge, this much
1437
01:13:08,030 --> 01:13:09,494
about the quality
of interaction,
1438
01:13:09,494 --> 01:13:11,446
and this much about
critical opinion.
1439
01:13:11,446 --> 01:13:14,862
And that says most of the story
about what this doctor wants
1440
01:13:14,862 --> 01:13:17,790
from a drug representative?
1441
01:13:17,790 --> 01:13:20,610
PHILIPPE RIGOLLET: Not really.
1442
01:13:20,610 --> 01:13:22,590
I mean, OK, let's say
you pick only one.
1443
01:13:22,590 --> 01:13:31,535
So that means that you
would take all your doctors,
1444
01:13:31,535 --> 01:13:33,160
and you would have
one direction, which
1445
01:13:33,160 --> 01:13:36,480
is quality of interaction.
1446
01:13:36,480 --> 01:13:38,710
And there would be just
spread out points here.
1447
01:13:42,604 --> 01:13:44,270
So there are two
things that can happen.
1448
01:13:44,270 --> 01:13:46,900
The first one is that
there's a clump here,
1449
01:13:46,900 --> 01:13:49,014
and then there's a clump here.
1450
01:13:49,014 --> 01:13:50,680
That still represents
a lot of variance.
1451
01:13:50,680 --> 01:13:52,420
And if this happens,
you probably
1452
01:13:52,420 --> 01:13:55,120
want to go back in
your data and see
1453
01:13:55,120 --> 01:13:58,540
were these people visited
by a different group
1454
01:13:58,540 --> 01:14:00,520
than these people,
or maybe these people
1455
01:14:00,520 --> 01:14:02,700
have a different specialty.
1456
01:14:05,250 --> 01:14:07,000
I mean, you have to
look back at your data
1457
01:14:07,000 --> 01:14:08,470
and try to understand
why you would have
1458
01:14:08,470 --> 01:14:09,700
different groups of people.
1459
01:14:09,700 --> 01:14:13,510
And if it's like completely
evenly spread out,
1460
01:14:13,510 --> 01:14:15,730
then all it's saying
is that, if you
1461
01:14:15,730 --> 01:14:18,460
want to have a uniform
quality of interaction,
1462
01:14:18,460 --> 01:14:20,410
you need to take
measures on this.
1463
01:14:20,410 --> 01:14:24,114
You need to have this to
not be discrimination.
1464
01:14:24,114 --> 01:14:26,530
But I think really when it's
becoming interesting it's not
1465
01:14:26,530 --> 01:14:27,779
when it's complete spread out.
1466
01:14:27,779 --> 01:14:29,350
It's when there's
a big group here.
1467
01:14:29,350 --> 01:14:30,520
And then there's
almost no one here,
1468
01:14:30,520 --> 01:14:32,020
and then there's
a big group here.
1469
01:14:32,020 --> 01:14:34,880
And then maybe there's
something you can do.
1470
01:14:34,880 --> 01:14:40,690
And so those two things actually
give you a lot of variance.
1471
01:14:40,690 --> 01:14:47,490
So actually, maybe
I'll talk about this.
1472
01:14:47,490 --> 01:14:49,732
Here, this is sort of a mixture.
1473
01:14:49,732 --> 01:14:51,690
You have a mixture of
two different populations
1474
01:14:51,690 --> 01:14:53,040
of doctors.
1475
01:14:53,040 --> 01:14:56,670
And it turns out that
principal component analysis--
1476
01:14:56,670 --> 01:14:59,750
so a mixture is when you
have different populations--
1477
01:14:59,750 --> 01:15:02,010
think of like two
Gaussians that are just
1478
01:15:02,010 --> 01:15:03,690
centered at two
different points,
1479
01:15:03,690 --> 01:15:05,460
and maybe they're
in high dimensions.
1480
01:15:05,460 --> 01:15:07,350
And those are
clusters of people,
1481
01:15:07,350 --> 01:15:09,680
and you want to be able to
differentiate those guys.
1482
01:15:09,680 --> 01:15:10,770
If you're in very
high dimensions,
1483
01:15:10,770 --> 01:15:12,120
it's going to be very
difficult. But one
1484
01:15:12,120 --> 01:15:14,730
of the first processing tools
that people do is to do PCA.
1485
01:15:14,730 --> 01:15:18,046
Because if you have one big
group here and one big group
1486
01:15:18,046 --> 01:15:19,920
here, it means that
there's a lot of variance
1487
01:15:19,920 --> 01:15:21,961
along the direction that
goes through the centers
1488
01:15:21,961 --> 01:15:22,860
of those groups.
1489
01:15:22,860 --> 01:15:24,630
And that's essentially
what happened here.
1490
01:15:24,630 --> 01:15:27,967
You could think of this as being
two blobs in high dimensions.
1491
01:15:27,967 --> 01:15:29,550
But you're really
just projecting them
1492
01:15:29,550 --> 01:15:30,810
into one dimension.
1493
01:15:30,810 --> 01:15:33,370
And this dimension, hopefully,
goes through the center.
1494
01:15:33,370 --> 01:15:37,460
And so as preprocessing--
so I'm going to stop here.
1495
01:15:37,460 --> 01:15:42,720
But PCA is not just made
for dimension reduction.
1496
01:15:42,720 --> 01:15:44,700
It's used for
mixtures, for example.
1497
01:15:44,700 --> 01:15:47,340
It's also used when you
have graphical data.
1498
01:15:47,340 --> 01:15:48,750
What is the idea of PCA?
1499
01:15:48,750 --> 01:15:53,400
It just says, if you have a
matrix that seems to have low
1500
01:15:53,400 --> 01:15:56,370
rank-- meaning that there's a
lot of those lambda i's that
1501
01:15:56,370 --> 01:15:57,570
are very small--
1502
01:15:57,570 --> 01:16:00,420
and then I see that
plus noise, then
1503
01:16:00,420 --> 01:16:02,790
it's a good idea to
do PCA on this thing.
1504
01:16:02,790 --> 01:16:05,520
And in particular, people
use that in networks a lot.
1505
01:16:05,520 --> 01:16:08,300
So you take the adjacency
matrix of a graph--
1506
01:16:08,300 --> 01:16:11,160
well, you sort of preprocess it
a little bit, so it looks nice.
1507
01:16:11,160 --> 01:16:13,590
And then if you have, for
example, two communities
1508
01:16:13,590 --> 01:16:15,570
in there, it should
look like something that
1509
01:16:15,570 --> 01:16:18,510
is low rank plus some noise.
1510
01:16:18,510 --> 01:16:22,670
And low rank means that there's
just very few non-zero--
1511
01:16:22,670 --> 01:16:24,226
well, low rank means this.
1512
01:16:24,226 --> 01:16:26,100
Low rank means that if
you do the scree plot,
1513
01:16:26,100 --> 01:16:27,250
you will see
something like this,
1514
01:16:27,250 --> 01:16:29,720
which means that if you throw
out all the smaller ones,
1515
01:16:29,720 --> 01:16:33,420
it should not really matter
in the overall structure.
1516
01:16:33,420 --> 01:16:35,430
And so you can use all--
1517
01:16:35,430 --> 01:16:39,090
these techniques are used
everywhere these days, not
1518
01:16:39,090 --> 01:16:39,900
just in PCA.
1519
01:16:39,900 --> 01:16:41,670
So we call it PCA
as statisticians.
1520
01:16:41,670 --> 01:16:46,700
But people call it the
spectral methods or SVD.
1521
01:16:46,700 --> 01:16:49,450
So everyone--