1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:19,219 at ocw.mit.edu. 8 00:00:19,219 --> 00:00:20,760 PHILIPPE RIGOLLET: We keep on talking 9 00:00:20,760 --> 00:00:24,870 about principal component analysis, which we essentially 10 00:00:24,870 --> 00:00:27,910 introduced as a way to work with a bunch of data. 11 00:00:27,910 --> 00:00:31,560 So the data that's given to us when we want to do PCA 12 00:00:31,560 --> 00:00:35,270 is a bunch of vectors X1 to Xn. 13 00:00:35,270 --> 00:00:40,090 So they are random vectors. 14 00:00:45,290 --> 00:00:46,652 in Rd. 15 00:00:46,652 --> 00:00:48,110 And what we mentioned is that we're 16 00:00:48,110 --> 00:00:51,742 going to be using linear algebra-- in particular, 17 00:00:51,742 --> 00:00:54,200 the spectral theorem-- that guarantees to us that if I look 18 00:00:54,200 --> 00:00:56,000 at the convenience matrix of this guy, 19 00:00:56,000 --> 00:00:57,890 or its empirical covariance matrix, 20 00:00:57,890 --> 00:01:00,132 since they're symmetric real matrices 21 00:01:00,132 --> 00:01:01,590 and they are positive semidefinite, 22 00:01:01,590 --> 00:01:06,830 there exists a diagonalization into non-negative eigenvalues. 23 00:01:06,830 --> 00:01:09,555 And so here, those things live in Rd, 24 00:01:09,555 --> 00:01:11,570 so it's a really large space. 25 00:01:11,570 --> 00:01:14,600 And what we want to do is to map it down 26 00:01:14,600 --> 00:01:16,640 into a space that we can visualize, 27 00:01:16,640 --> 00:01:19,610 hopefully a space of size 2 or 3. 28 00:01:19,610 --> 00:01:22,460 Or if not, then we're just going to take more and start looking 29 00:01:22,460 --> 00:01:24,920 at subspaces altogether. 30 00:01:24,920 --> 00:01:33,120 So think of the case where d is large but not larger than n. 31 00:01:33,120 --> 00:01:36,520 So let's say, you have a large number of points. 32 00:01:36,520 --> 00:01:40,590 The question is, is it possible to project those things onto 33 00:01:40,590 --> 00:01:45,260 a lower dimensional space, d prime, 34 00:01:45,260 --> 00:01:49,480 which is much less than d-- so think of d prime equals, say, 35 00:01:49,480 --> 00:01:52,180 2 or 3-- 36 00:01:52,180 --> 00:01:54,490 and so that you keep as much information 37 00:01:54,490 --> 00:01:56,740 about the cloud of points that you had originally. 38 00:01:56,740 --> 00:01:58,990 So again, the example that we could have 39 00:01:58,990 --> 00:02:04,060 is that X1 to Xn for, say, Xi for patient i's recording 40 00:02:04,060 --> 00:02:08,740 a bunch of body measurements and maybe blood pressure, 41 00:02:08,740 --> 00:02:10,639 some symptoms, et cetera. 42 00:02:10,639 --> 00:02:12,520 And then we have a cloud of n patients. 43 00:02:12,520 --> 00:02:15,222 And we're trying to visualize maybe to see if-- 44 00:02:15,222 --> 00:02:16,930 If I could see, for example, that there's 45 00:02:16,930 --> 00:02:18,820 two groups of patients, maybe I would 46 00:02:18,820 --> 00:02:21,252 know that I have two groups of different disease 47 00:02:21,252 --> 00:02:22,960 or maybe two groups of different patients 48 00:02:22,960 --> 00:02:25,540 that respond differently to a particular disease 49 00:02:25,540 --> 00:02:27,040 or drug et cetera. 50 00:02:27,040 --> 00:02:28,900 So visualizing is going to give us 51 00:02:28,900 --> 00:02:33,880 quite a bit of insight about what the spatial arrangement 52 00:02:33,880 --> 00:02:35,980 of those vectors are. 53 00:02:35,980 --> 00:02:40,660 And so PCA says, well, here, of course, in this question, 54 00:02:40,660 --> 00:02:42,880 one thing that's not defined is what is information. 55 00:02:42,880 --> 00:02:44,338 And we said that one thing we might 56 00:02:44,338 --> 00:02:46,600 want to do when we project is that points do not 57 00:02:46,600 --> 00:02:48,267 collide with each other. 58 00:02:48,267 --> 00:02:50,350 And so that means we're trying to find directions, 59 00:02:50,350 --> 00:02:53,110 where after I project, the points are still pretty spread 60 00:02:53,110 --> 00:02:53,860 out. 61 00:02:53,860 --> 00:02:55,630 And so I can see what's going on. 62 00:02:55,630 --> 00:02:58,270 And PCA says-- OK, so there's many ways 63 00:02:58,270 --> 00:02:59,500 to answer this question. 64 00:02:59,500 --> 00:03:04,290 And PCA says, let's just find a subspace of dimension 65 00:03:04,290 --> 00:03:08,110 d prime that keeps as much covariance structure as 66 00:03:08,110 --> 00:03:10,150 possible. 67 00:03:10,150 --> 00:03:13,390 And the reason is that those directions 68 00:03:13,390 --> 00:03:15,430 are the ones that maximize the variance, which 69 00:03:15,430 --> 00:03:17,230 is a proxy for the spread. 70 00:03:17,230 --> 00:03:19,540 There's many, many ways to do this. 71 00:03:19,540 --> 00:03:22,840 There's actually a Google video that 72 00:03:22,840 --> 00:03:26,440 was released maybe last week about the data visualization 73 00:03:26,440 --> 00:03:29,260 team of Google that shows you something called 74 00:03:29,260 --> 00:03:31,554 t-SNE, which is essentially something 75 00:03:31,554 --> 00:03:32,470 that tries to do that. 76 00:03:32,470 --> 00:03:34,540 It takes points in very high dimensions 77 00:03:34,540 --> 00:03:36,400 and tries to map them in lower dimensions, 78 00:03:36,400 --> 00:03:38,280 so that you can actually visualize them. 79 00:03:38,280 --> 00:03:41,800 And t-SNE is some alternative to PCA 80 00:03:41,800 --> 00:03:46,850 that gives an other definition for the word information. 81 00:03:46,850 --> 00:03:49,970 I'll talk about this towards the end, how you can actually 82 00:03:49,970 --> 00:03:52,730 somewhat automatically extend everything 83 00:03:52,730 --> 00:03:58,830 we've said for PCA to an infinite family of procedures. 84 00:03:58,830 --> 00:04:00,460 So how do we do this? 85 00:04:00,460 --> 00:04:02,690 Well, the way we do this is as follows. 86 00:04:02,690 --> 00:04:05,010 So remember, given those guys, we 87 00:04:05,010 --> 00:04:09,120 can form something which is called S, which is the sample, 88 00:04:09,120 --> 00:04:16,885 or the empirical covariance matrix. 89 00:04:19,930 --> 00:04:22,210 And since from couple of slides ago, 90 00:04:22,210 --> 00:04:25,450 we know that S has a eigenvalue decomposition, 91 00:04:25,450 --> 00:04:32,930 S is equal to PDP transpose, where P is orthogonal. 92 00:04:35,570 --> 00:04:37,720 So that's where we use our linear algebra results. 93 00:04:37,720 --> 00:04:43,640 So that means that P transpose P is equal to PP transpose, which 94 00:04:43,640 --> 00:04:46,220 is the identity. 95 00:04:46,220 --> 00:04:50,370 So remember, S is a d by d matrix. 96 00:04:50,370 --> 00:04:53,070 And so P is also d by d. 97 00:04:53,070 --> 00:04:55,860 And d is diagonal. 98 00:05:00,402 --> 00:05:02,860 And I'm actually going to take, without loss of generality, 99 00:05:02,860 --> 00:05:04,487 I'm going to assume that d-- 100 00:05:04,487 --> 00:05:06,070 so it's going to be diagonal-- and I'm 101 00:05:06,070 --> 00:05:10,240 going to have something that looks like lambda 1 102 00:05:10,240 --> 00:05:10,930 to lambda d. 103 00:05:10,930 --> 00:05:14,830 Those are called the eigenvalues of S. 104 00:05:14,830 --> 00:05:19,036 What we know is that lambda j's are non-negative. 105 00:05:19,036 --> 00:05:21,160 And actually, what I'm going to assume without loss 106 00:05:21,160 --> 00:05:24,820 of generalities is lambda 1 is larger than lambda 2, which 107 00:05:24,820 --> 00:05:30,259 is larger than lambda d. 108 00:05:30,259 --> 00:05:32,050 Because in particular, this decomposition-- 109 00:05:32,050 --> 00:05:35,470 the spectrum decomposition-- is not entirely unique. 110 00:05:35,470 --> 00:05:39,750 I could permute the columns of P, 111 00:05:39,750 --> 00:05:42,600 and I would still have an orthogonal matrix. 112 00:05:42,600 --> 00:05:44,820 And to balance that, I would also have 113 00:05:44,820 --> 00:05:46,890 to permute the entries of d. 114 00:05:46,890 --> 00:05:49,680 So there's as many decompositions 115 00:05:49,680 --> 00:05:51,180 as there are permutations. 116 00:05:51,180 --> 00:05:52,860 So there's actually quite a bit. 117 00:05:52,860 --> 00:05:56,760 But the bag of eigenvalues is unique. 118 00:05:56,760 --> 00:05:58,430 The set of eigenvalues is unique. 119 00:05:58,430 --> 00:06:01,020 The ordering is certainly not unique. 120 00:06:01,020 --> 00:06:02,730 So here, I'm just going to pick-- 121 00:06:02,730 --> 00:06:05,640 I'm going to nail down one particular permutation-- 122 00:06:05,640 --> 00:06:08,070 actually, maybe two in case I have equalities. 123 00:06:08,070 --> 00:06:12,570 But let's say, I pick one that satisfies this. 124 00:06:12,570 --> 00:06:15,450 And the reason why I do this is really not very important. 125 00:06:15,450 --> 00:06:18,060 It's just to say, I'm going to want 126 00:06:18,060 --> 00:06:20,500 to talk about the largest of those eigenvalues. 127 00:06:20,500 --> 00:06:22,110 So this is just going to be easier 128 00:06:22,110 --> 00:06:23,910 for me to say that this one is lambda 1, 129 00:06:23,910 --> 00:06:26,730 rather than say it's lambda 7. 130 00:06:26,730 --> 00:06:39,980 So this is just to say that the largest eigenvalue of S 131 00:06:39,980 --> 00:06:42,588 is lambda 1. 132 00:06:42,588 --> 00:06:45,550 If I didn't do that, I would just call it maybe lambda max, 133 00:06:45,550 --> 00:06:47,760 and you would just know which one I'm talking about. 134 00:06:52,910 --> 00:07:01,520 So what's happening now is that if I look at d, 135 00:07:01,520 --> 00:07:04,250 then it turns out that if I start-- 136 00:07:04,250 --> 00:07:09,890 so if I do P transpose Xi, I am actually projecting my Xi's-- 137 00:07:09,890 --> 00:07:12,820 I'm basically changing the basis for my Xi's. 138 00:07:12,820 --> 00:07:15,140 And now, D is the empirical covariance matrix 139 00:07:15,140 --> 00:07:16,700 of those guys. 140 00:07:16,700 --> 00:07:18,630 So let's check that. 141 00:07:18,630 --> 00:07:22,010 So what it means is that if I look at-- 142 00:07:26,303 --> 00:07:29,120 so what I claim is that P transpose Xi-- 143 00:07:29,120 --> 00:07:35,180 that's a new vector, let's call it Yi, it's also in Rd-- 144 00:07:35,180 --> 00:07:37,940 and what I claim is that the covariance matrix of this guy 145 00:07:37,940 --> 00:07:41,840 is actually now this diagonal matrix, which 146 00:07:41,840 --> 00:07:45,140 means in particular that if they were Gaussian, then 147 00:07:45,140 --> 00:07:46,280 they would be independent. 148 00:07:46,280 --> 00:07:48,890 But I also know now that there's no correlation 149 00:07:48,890 --> 00:07:50,530 across coordinates of Yi. 150 00:07:50,530 --> 00:08:00,939 So to prove this, let me assume that X bar is equal to 0. 151 00:08:00,939 --> 00:08:02,980 And the reason why I do this is because it's just 152 00:08:02,980 --> 00:08:05,560 annoying to carry out all this censuring constantly 153 00:08:05,560 --> 00:08:09,400 and I talk about S. So when X bar is equal to 0, 154 00:08:09,400 --> 00:08:11,640 that implies that S has a very simple form. 155 00:08:11,640 --> 00:08:14,170 It's of the form sum from i equal 1 156 00:08:14,170 --> 00:08:18,790 to n of Xi Xi transpose. 157 00:08:18,790 --> 00:08:20,380 So that's my S. 158 00:08:20,380 --> 00:08:24,370 But what I want is the S of Y-- 159 00:08:24,370 --> 00:08:28,830 So OK, that implies also that P times X 160 00:08:28,830 --> 00:08:34,690 bar, which is equal to P times X bar is also equal to 0. 161 00:08:34,690 --> 00:08:37,929 So that means that Y bar-- 162 00:08:37,929 --> 00:08:40,240 Y has mean 0, if this is 0. 163 00:08:40,240 --> 00:08:43,970 So if I look at the sample covariance matrix of Y, 164 00:08:43,970 --> 00:08:45,880 it's just going to be something that 165 00:08:45,880 --> 00:08:49,990 looks like the sum of the outer products or the Yi Yi 166 00:08:49,990 --> 00:08:50,590 transpose. 167 00:08:53,290 --> 00:08:56,770 And again, the reason why I make this assumption 168 00:08:56,770 --> 00:09:01,400 is so that I don't have to write minus X bar X bar transpose. 169 00:09:01,400 --> 00:09:02,284 But you can do it. 170 00:09:02,284 --> 00:09:03,950 And it's going to work exactly the same. 171 00:09:06,790 --> 00:09:08,640 So now, I look at this S prime. 172 00:09:08,640 --> 00:09:11,340 And so what is this S prime? 173 00:09:11,340 --> 00:09:14,340 Well, I'm just going to replace Yi with PXi. 174 00:09:14,340 --> 00:09:22,850 So it's the sum from i equal 1 to n of PXi PXi transpose, 175 00:09:22,850 --> 00:09:26,627 which is equal to the sum from-- 176 00:09:26,627 --> 00:09:27,460 sorry there's a 1/n. 177 00:09:32,360 --> 00:09:34,820 So it's equal to 1/n sum from i equal 1 178 00:09:34,820 --> 00:09:43,490 to n of PXi Xi transpose P transpose. 179 00:09:43,490 --> 00:09:45,130 Agree? 180 00:09:45,130 --> 00:09:48,580 I just said that the transpose of AB is the transpose of B 181 00:09:48,580 --> 00:09:53,830 times the transpose of A. 182 00:09:53,830 --> 00:09:55,900 And so now, I can push the sum in. 183 00:09:55,900 --> 00:09:57,520 P does not depend on i. 184 00:09:57,520 --> 00:10:05,800 So this thing here is equal to PS P transpose, 185 00:10:05,800 --> 00:10:10,130 because the sum of the Xi Xi transpose divided by n is S. 186 00:10:10,130 --> 00:10:12,200 But what is PS P transpose? 187 00:10:12,200 --> 00:10:17,090 Well, we know that S is equal to-- 188 00:10:17,090 --> 00:10:19,340 sorry that's P transpose. 189 00:10:19,340 --> 00:10:20,880 So this was with a P transpose. 190 00:10:20,880 --> 00:10:23,420 I'm sorry, I made an important mistake here. 191 00:10:23,420 --> 00:10:25,420 So Yi is P transpose Xi. 192 00:10:25,420 --> 00:10:27,440 So this is P transpose and P transpose 193 00:10:27,440 --> 00:10:29,600 here, which means that this is P transpose 194 00:10:29,600 --> 00:10:32,450 and this is double transpose, which is just nothing 195 00:10:32,450 --> 00:10:34,150 and that transpose and nothing. 196 00:10:36,680 --> 00:10:41,600 So now, I write S as PD P transpose. 197 00:10:41,600 --> 00:10:43,781 That's the spectral decomposition 198 00:10:43,781 --> 00:10:44,530 that I had before. 199 00:10:44,530 --> 00:10:46,550 That's my eigenvalue decomposition, 200 00:10:46,550 --> 00:10:49,050 which means that now, if I look at S prime, 201 00:10:49,050 --> 00:10:56,000 it's P transpose times PD P transpose P. 202 00:10:56,000 --> 00:10:58,300 But now, P transpose P is the identity, 203 00:10:58,300 --> 00:11:00,250 P transpose P is the identity. 204 00:11:00,250 --> 00:11:06,646 So this is actually just equal to D. 205 00:11:06,646 --> 00:11:08,270 And again, you can check that this also 206 00:11:08,270 --> 00:11:12,840 works if you have to center all those guys as you go. 207 00:11:12,840 --> 00:11:15,630 But if you think about it, this is the same thing 208 00:11:15,630 --> 00:11:19,530 as saying that I just replaced Xi by Xi minus X bar. 209 00:11:19,530 --> 00:11:26,590 And then it's true that Y bar is also P times Xi minus X bar. 210 00:11:26,590 --> 00:11:29,770 So now, we have that D is the empirical covariance 211 00:11:29,770 --> 00:11:30,940 matrix of those guys-- 212 00:11:30,940 --> 00:11:33,112 the Yi's, which are P transpose Xi's. 213 00:11:33,112 --> 00:11:34,570 And so in particular, what it means 214 00:11:34,570 --> 00:11:42,810 is that if I look at the covariance of Yj Yk-- 215 00:11:46,130 --> 00:11:48,920 So that's the covariance of the j-th coordinate of Y 216 00:11:48,920 --> 00:11:51,650 and the k-th coordinate of Y. I'm just not putting an index. 217 00:11:51,650 --> 00:11:53,720 But maybe, let's say the first one or something 218 00:11:53,720 --> 00:11:56,142 like this-- any of them, their IID. 219 00:11:56,142 --> 00:11:57,350 Then what is this covariance? 220 00:11:57,350 --> 00:12:01,760 It's actually 0 if j is different from k. 221 00:12:01,760 --> 00:12:06,590 And the covariance between Yj and Yj, 222 00:12:06,590 --> 00:12:13,070 which is just the variance of Yj, is equal to lambda j-- 223 00:12:13,070 --> 00:12:17,300 the j-th largest eigenvalue. 224 00:12:17,300 --> 00:12:22,580 So the eigenvalues capture the variance of my observations 225 00:12:22,580 --> 00:12:25,110 in this new coordinate system. 226 00:12:25,110 --> 00:12:26,632 And they're completely orthogonal. 227 00:12:26,632 --> 00:12:27,590 So what does that mean? 228 00:12:27,590 --> 00:12:29,750 Well, again, remember, if I chop off 229 00:12:29,750 --> 00:12:34,160 the head of my Gaussian in multi dimensions, 230 00:12:34,160 --> 00:12:35,780 we said that what we started from 231 00:12:35,780 --> 00:12:39,560 was something that looked like this. 232 00:12:39,560 --> 00:12:42,320 And we said, well, there's one direction that's important, 233 00:12:42,320 --> 00:12:45,230 that's this guy, and one important that's this guy. 234 00:12:45,230 --> 00:12:48,200 When I applied a transformation P transpose, what I'm doing 235 00:12:48,200 --> 00:12:51,110 is that I'm realigning this thing with the new axes. 236 00:12:51,110 --> 00:12:53,660 Or in a way, rather to be fair, I'm 237 00:12:53,660 --> 00:12:59,600 not actually realigning the ellipses with the axes. 238 00:12:59,600 --> 00:13:02,690 I'm really realigning the axes with the ellipses. 239 00:13:02,690 --> 00:13:05,360 So really, what I'm doing is I'm saying, after I apply P, 240 00:13:05,360 --> 00:13:08,690 I'm just rotating this coordinate system. 241 00:13:08,690 --> 00:13:12,670 So now, it becomes this guy. 242 00:13:19,360 --> 00:13:22,850 And now, my ellipses actually completely align. 243 00:13:22,850 --> 00:13:25,730 And what happens here is that this coordinate is 244 00:13:25,730 --> 00:13:27,110 independent of that coordinate. 245 00:13:27,110 --> 00:13:31,715 And that's what we write here, if they are Gaussian. 246 00:13:31,715 --> 00:13:32,840 I didn't really tell this-- 247 00:13:32,840 --> 00:13:34,810 I'm only making statements about covariances. 248 00:13:34,810 --> 00:13:36,768 If they are Gaussians, those implied statements 249 00:13:36,768 --> 00:13:37,614 about independence. 250 00:13:40,960 --> 00:13:44,590 So as I said, the variance now, lambda 1, 251 00:13:44,590 --> 00:13:54,700 is actually the variance of P transpose Xi. 252 00:13:57,890 --> 00:14:00,140 But if I look now at the-- so this is a vector, 253 00:14:00,140 --> 00:14:04,910 so I need to look at the first coordinate of this guy. 254 00:14:08,490 --> 00:14:11,250 So it turns out that doing this is actually 255 00:14:11,250 --> 00:14:15,440 the same thing as looking at the variance of what? 256 00:14:15,440 --> 00:14:21,480 Well, the first column of P times Xi. 257 00:14:21,480 --> 00:14:24,490 So that's the variance of-- 258 00:14:24,490 --> 00:14:30,344 I'm going to call it v1 transpose Xi, where P-- 259 00:14:44,390 --> 00:14:53,920 So the v1 vd in Rd are eigenvectors. 260 00:14:53,920 --> 00:14:57,190 And each vi is associated to lambda i. 261 00:14:57,190 --> 00:14:59,740 So that's what we saw when we talked about this eigen 262 00:14:59,740 --> 00:15:02,800 decomposition a couple of slides back. 263 00:15:02,800 --> 00:15:06,040 That's the one here. 264 00:15:06,040 --> 00:15:10,310 So if I call the columns of P v1 to vd, 265 00:15:10,310 --> 00:15:13,600 this is what's happening. 266 00:15:13,600 --> 00:15:16,030 So when I look at lambda 1, it's just the variance 267 00:15:16,030 --> 00:15:19,700 of Xi inner product with v1. 268 00:15:19,700 --> 00:15:22,180 And we made this picture when we said, well, 269 00:15:22,180 --> 00:15:25,870 let's say v1 is here and then x1 is here. 270 00:15:25,870 --> 00:15:31,180 And if vi has a unique norm, then the inner product 271 00:15:31,180 --> 00:15:38,050 between Xi and v1 is just the length of this guy here. 272 00:15:38,050 --> 00:15:41,020 So that's the variance of the Xi says the length of Xi-- 273 00:15:41,020 --> 00:15:43,720 so this is 0-- that's the length of Xi when I project it 274 00:15:43,720 --> 00:15:46,750 onto the direction that span by v1. 275 00:15:46,750 --> 00:15:52,210 If v1 has length 2, this is really just twice this length. 276 00:15:52,210 --> 00:15:56,340 If vi has length 3, it's three times this. 277 00:15:56,340 --> 00:16:01,570 But it turns out that since P satisfies P transpose 278 00:16:01,570 --> 00:16:04,780 P is equal to the identity-- 279 00:16:04,780 --> 00:16:07,900 that's an orthogonal matrix, that's right here-- 280 00:16:07,900 --> 00:16:11,470 then this is actually saying the same thing 281 00:16:11,470 --> 00:16:18,760 as vj transpose vj, which is really the norm squared of vj, 282 00:16:18,760 --> 00:16:20,800 is equal to 1. 283 00:16:20,800 --> 00:16:26,520 And vj transpose vk is equal to 0, if j is different from k. 284 00:16:29,610 --> 00:16:31,560 The eigenvectors are orthogonal to each other. 285 00:16:31,560 --> 00:16:33,050 And they're actually all of norm 1. 286 00:16:37,390 --> 00:16:39,580 So now, I know that this is indeed a direction. 287 00:16:39,580 --> 00:16:44,290 And so when I look at v1 transpose Xi, 288 00:16:44,290 --> 00:16:46,240 I'm really measuring exactly this length. 289 00:16:46,240 --> 00:16:47,460 And what is this length? 290 00:16:47,460 --> 00:16:49,660 It's the length of the projection of Xi 291 00:16:49,660 --> 00:16:51,190 onto this line. 292 00:16:51,190 --> 00:16:53,920 That's the line that's spanned by v1. 293 00:16:53,920 --> 00:16:57,680 So if I had a very high dimensional problem 294 00:16:57,680 --> 00:17:01,460 and I started to look at the direction v1-- 295 00:17:01,460 --> 00:17:03,884 let's say v1 now is not a eigenvector, 296 00:17:03,884 --> 00:17:08,270 it's any direction-- then if I want to do this lower 297 00:17:08,270 --> 00:17:11,819 dimensional projection, then I have to understand how those 298 00:17:11,819 --> 00:17:14,272 Xi's project onto the line that's spanned by v1, 299 00:17:14,272 --> 00:17:16,730 because this is all that I'm going to be keeping at the end 300 00:17:16,730 --> 00:17:17,646 of the day about Xi's. 301 00:17:20,170 --> 00:17:23,200 So what we want is to find the direction 302 00:17:23,200 --> 00:17:25,240 where those Xi's, those projections, 303 00:17:25,240 --> 00:17:26,361 have a lot of variance. 304 00:17:26,361 --> 00:17:28,569 And we know that the variance of Xi on this direction 305 00:17:28,569 --> 00:17:30,490 is actually exactly given by lambda 1. 306 00:17:36,890 --> 00:17:40,490 Sorry, that's the empirical var-- 307 00:17:40,490 --> 00:17:42,480 yeah, I should call variance hat. 308 00:17:42,480 --> 00:17:43,730 That's the empirical variance. 309 00:17:43,730 --> 00:17:45,063 Everything is in empirical here. 310 00:17:45,063 --> 00:17:48,680 We're talking about the empirical covariance matrix. 311 00:17:48,680 --> 00:17:54,150 And so I also have that lambda 2 is the empirical variance 312 00:17:54,150 --> 00:17:59,160 of when I project Xi onto v2, which is the second one, 313 00:17:59,160 --> 00:18:00,600 just for exactly this reason. 314 00:18:07,474 --> 00:18:08,456 Any question? 315 00:18:14,170 --> 00:18:16,830 So lambda j's are going to be important for us. 316 00:18:16,830 --> 00:18:19,320 Lambda j measure the spread of the points 317 00:18:19,320 --> 00:18:22,530 when I project them onto a line which is a one dimensional 318 00:18:22,530 --> 00:18:23,259 space. 319 00:18:23,259 --> 00:18:25,800 And so I'm going to have-- let's say I want to pick only one, 320 00:18:25,800 --> 00:18:28,133 I'm going to have to find the one dimensional space that 321 00:18:28,133 --> 00:18:29,690 carries the most variance. 322 00:18:29,690 --> 00:18:32,070 And I claim that v1 is the one that 323 00:18:32,070 --> 00:18:35,280 actually maximizes the spread. 324 00:18:35,280 --> 00:18:55,900 So the claim-- so for any direction, u in Rd-- 325 00:18:55,900 --> 00:18:59,380 and by direction, I really just mean that the norm of u 326 00:18:59,380 --> 00:19:00,920 is equal to 1. 327 00:19:00,920 --> 00:19:02,020 I need to play fair-- 328 00:19:02,020 --> 00:19:04,690 I'm going to compare myself to other things of lengths one, 329 00:19:04,690 --> 00:19:07,600 so I need to play fair and look at directions of length 1. 330 00:19:07,600 --> 00:19:16,321 Now, if I'm interested in the empirical variance 331 00:19:16,321 --> 00:19:20,875 of X1 transpose-- 332 00:19:20,875 --> 00:19:29,150 sorry, u transpose X1 u transpose Xn, then this thing 333 00:19:29,150 --> 00:19:37,950 is maximized for u equals v1, where 334 00:19:37,950 --> 00:19:40,610 v1 is the eigenvector associated to lambda 1 335 00:19:40,610 --> 00:19:42,110 and lambda 1 is not any eigenvalues, 336 00:19:42,110 --> 00:19:45,090 it's the largest of all those. 337 00:19:45,090 --> 00:19:46,992 So it's the largest eigenvalue. 338 00:19:50,607 --> 00:19:51,440 So why is that true? 339 00:19:55,410 --> 00:20:00,840 Well, there's also a claim that for any direction u-- 340 00:20:00,840 --> 00:20:03,380 so that's 1 and 2-- 341 00:20:03,380 --> 00:20:08,990 the variance of u transpose X-- now, 342 00:20:08,990 --> 00:20:11,900 this is just a random variable, and I'm looking about the true 343 00:20:11,900 --> 00:20:13,040 variance-- 344 00:20:13,040 --> 00:20:27,440 this is maximized for u equals, let's call it w1, 345 00:20:27,440 --> 00:20:38,320 where w1 is the eigenvector of sigma-- 346 00:20:38,320 --> 00:20:40,204 Now, I'm talking about the true variance. 347 00:20:40,204 --> 00:20:42,620 Whereas, here, I was talking about the empirical variance. 348 00:20:42,620 --> 00:20:44,950 So the true variance is the eigenvectors 349 00:20:44,950 --> 00:20:55,630 of the true sigma associated to the largest 350 00:20:55,630 --> 00:20:59,554 eigenvalue of sigma. 351 00:21:02,870 --> 00:21:04,270 So I did not give it a name. 352 00:21:04,270 --> 00:21:06,285 Here, that was lambda 1 for the empirical one. 353 00:21:06,285 --> 00:21:07,660 For the true one, you can give it 354 00:21:07,660 --> 00:21:10,330 another name, mu 1 if you want. 355 00:21:10,330 --> 00:21:13,407 But that's just the same thing. 356 00:21:13,407 --> 00:21:15,490 All it's saying is like, wherever I see empirical, 357 00:21:15,490 --> 00:21:16,156 I can remove it. 358 00:21:27,690 --> 00:21:29,815 So why is this claim true? 359 00:21:29,815 --> 00:21:31,815 Well, let's look at the second one, for example. 360 00:21:38,180 --> 00:21:44,480 So what is the variance of u transpose X? 361 00:21:44,480 --> 00:21:47,570 So that's what I want to know. 362 00:21:47,570 --> 00:21:54,850 So that's the expectation-- so let's assume that X is 0, 363 00:21:54,850 --> 00:21:56,711 again, for same reasons as before. 364 00:21:56,711 --> 00:21:57,710 So what is the variance? 365 00:21:57,710 --> 00:21:59,410 It's just the expectation of the square? 366 00:22:06,460 --> 00:22:08,260 I don't need to remove the expectation. 367 00:22:08,260 --> 00:22:10,870 And the expedition of the square is just 368 00:22:10,870 --> 00:22:12,700 the expectation of u transpose X. 369 00:22:12,700 --> 00:22:15,250 And then I'm going to write the other one X transpose u. 370 00:22:19,510 --> 00:22:22,360 And we know that this is deterministic. 371 00:22:22,360 --> 00:22:25,570 So I'm just going to take that this is just u transpose 372 00:22:25,570 --> 00:22:31,995 expectation of X X transpose u. 373 00:22:31,995 --> 00:22:32,870 And what is this guy? 374 00:22:39,305 --> 00:22:40,760 That's covariance sigma. 375 00:22:40,760 --> 00:22:41,870 That's just what sigma is. 376 00:22:44,730 --> 00:22:48,590 So the variance I can write as u transpose sigma u. 377 00:22:48,590 --> 00:22:51,272 We've made this computation before. 378 00:22:51,272 --> 00:22:53,730 And now what I want to claim is that this thing is actually 379 00:22:53,730 --> 00:22:57,275 less than the largest eigenvalue, which I actually 380 00:22:57,275 --> 00:22:58,150 called lambda 1 here. 381 00:22:58,150 --> 00:22:59,680 I should probably not. 382 00:22:59,680 --> 00:23:01,100 And the P is-- well, OK. 383 00:23:06,430 --> 00:23:11,260 Let's just pretend everything is not empirical. 384 00:23:11,260 --> 00:23:22,580 So now, I'm going to write sigma as P lambda 1 lambda n P 385 00:23:22,580 --> 00:23:23,180 transpose. 386 00:23:23,180 --> 00:23:25,010 That's just the eigendecomposition, 387 00:23:25,010 --> 00:23:32,090 where I admittedly reuse the same notation as I did for S. 388 00:23:32,090 --> 00:23:34,764 So I should really put some primes everywhere, 389 00:23:34,764 --> 00:23:36,680 so you know those are things that are actually 390 00:23:36,680 --> 00:23:38,630 different in practice. 391 00:23:38,630 --> 00:23:43,469 So this is just that the decomposition of sigma. 392 00:23:43,469 --> 00:23:44,510 You seem confused, Helen. 393 00:23:44,510 --> 00:23:47,570 You have a question? 394 00:23:47,570 --> 00:23:48,070 Yeah? 395 00:23:48,070 --> 00:23:53,830 AUDIENCE: What is-- when you talked about the empirical data 396 00:23:53,830 --> 00:23:55,750 and-- 397 00:23:55,750 --> 00:23:56,880 PHILIPPE RIGOLLET: So OK-- 398 00:24:00,670 --> 00:24:02,801 so I can make everything I'm saying, 399 00:24:02,801 --> 00:24:04,300 I can talk about either the variance 400 00:24:04,300 --> 00:24:05,470 or the empirical variance. 401 00:24:05,470 --> 00:24:07,720 And you can just add the word empirical in front of it 402 00:24:07,720 --> 00:24:08,680 whenever you want. 403 00:24:08,680 --> 00:24:09,910 The same thing works. 404 00:24:09,910 --> 00:24:13,120 But just for the sake of removing the confusion, 405 00:24:13,120 --> 00:24:20,409 let's just do it again with S. So I'm just 406 00:24:20,409 --> 00:24:21,950 going to do everything with S. So I'm 407 00:24:21,950 --> 00:24:24,650 going to assume that X bar is equal to 0. 408 00:24:24,650 --> 00:24:27,780 And here, I'm going to talk about the empirical variance, 409 00:24:27,780 --> 00:24:31,530 which is just 1/n sum from i equal 1 410 00:24:31,530 --> 00:24:35,272 to n of u transpose Xi squared. 411 00:24:35,272 --> 00:24:36,230 So it's the same thing. 412 00:24:36,230 --> 00:24:37,646 Everywhere you see an expectation, 413 00:24:37,646 --> 00:24:39,110 you just put in average. 414 00:24:45,930 --> 00:24:50,850 And then I get 1/n sum from i equal 1 415 00:24:50,850 --> 00:24:53,032 to n of Xi Xi transpose. 416 00:24:53,032 --> 00:24:54,490 And now, I'm going to call this guy 417 00:24:54,490 --> 00:24:58,200 S, because that's what it is. 418 00:24:58,200 --> 00:24:59,994 So this is u transpose Su. 419 00:24:59,994 --> 00:25:02,410 But just defined that I could just replace the expectation 420 00:25:02,410 --> 00:25:03,910 by averages everywhere, you can tell 421 00:25:03,910 --> 00:25:06,590 that the thing is going to work for either one or the other. 422 00:25:06,590 --> 00:25:08,491 So now, this thing was actually-- so now, 423 00:25:08,491 --> 00:25:10,240 I don't have any problem with my notation. 424 00:25:10,240 --> 00:25:14,310 This is actually the decomposition of S. 425 00:25:14,310 --> 00:25:16,030 That's just the spectral decomposition 426 00:25:16,030 --> 00:25:18,840 and it's to its eigenvalues. 427 00:25:18,840 --> 00:25:27,080 And so now, what I have is that when I look at u transpose Su, 428 00:25:27,080 --> 00:25:34,920 this is actually equal to P u transpose S Pu. 429 00:25:39,294 --> 00:25:40,500 OK. 430 00:25:40,500 --> 00:25:41,750 There's a transpose somewhere. 431 00:25:41,750 --> 00:25:42,416 That's this guy. 432 00:25:45,300 --> 00:25:46,161 And that's this guy. 433 00:25:57,057 --> 00:26:00,220 Now-- sorry, that's not P, that's 434 00:26:00,220 --> 00:26:05,000 D. That's D, that's this diagonal matrix. 435 00:26:10,269 --> 00:26:11,310 Let's look at this thing. 436 00:26:11,310 --> 00:26:15,810 And let's call P transpose u, let's call it b. 437 00:26:15,810 --> 00:26:18,705 So that's also a vector in Rd. 438 00:26:18,705 --> 00:26:19,530 What is it? 439 00:26:19,530 --> 00:26:21,370 It's just, I take a unit vector, and then 440 00:26:21,370 --> 00:26:23,020 I apply P transpose to it. 441 00:26:23,020 --> 00:26:25,740 So that's basically what happens to a unit vector 442 00:26:25,740 --> 00:26:29,820 when I apply the same change of basis that I did. 443 00:26:29,820 --> 00:26:34,650 So I'm just changing my orthogonal system the same way 444 00:26:34,650 --> 00:26:36,360 I did for the other ones. 445 00:26:36,360 --> 00:26:38,940 So what's happening when I write this? 446 00:26:38,940 --> 00:26:46,590 Well, now I have that u transpose Su is b transpose Db. 447 00:26:46,590 --> 00:26:50,310 But now, doing b transpose Db when D is diagonal 448 00:26:50,310 --> 00:26:52,690 and b is a vector is a very simple thing. 449 00:26:52,690 --> 00:26:53,910 I can expand it. 450 00:26:53,910 --> 00:26:54,480 This is what? 451 00:26:54,480 --> 00:26:57,120 This is just the sum from j equal 1 452 00:26:57,120 --> 00:27:01,650 to d of lambda j bj squared. 453 00:27:05,386 --> 00:27:08,947 So that's just like matrix vector multiplication. 454 00:27:08,947 --> 00:27:11,280 And in particular, I know that the largest of those guys 455 00:27:11,280 --> 00:27:14,010 is lambda 1 and those guys are all non-negative. 456 00:27:14,010 --> 00:27:16,705 So this thing is actually less than lambda 1 times 457 00:27:16,705 --> 00:27:20,430 the sum from j equal 1 to d of lambda j squared-- 458 00:27:23,330 --> 00:27:24,490 sorry, bj squared. 459 00:27:27,560 --> 00:27:34,010 And this is just the norm of b squared. 460 00:27:34,010 --> 00:27:38,320 So if I want to prove what's on the slide, all I need to check 461 00:27:38,320 --> 00:27:40,965 is that b has norm, which is-- 462 00:27:40,965 --> 00:27:41,935 AUDIENCE: 1. 463 00:27:41,935 --> 00:27:43,910 PHILIPPE RIGOLLET: At most, 1. 464 00:27:43,910 --> 00:27:45,090 It's going to be at most 1. 465 00:27:45,090 --> 00:27:45,780 Why? 466 00:27:45,780 --> 00:27:51,690 Well, because b is really just a change of basis for u. 467 00:27:51,690 --> 00:27:55,650 And so if I take a vector, I'm just changing its basis. 468 00:27:55,650 --> 00:27:57,540 I'm certainly not changing its length-- 469 00:27:57,540 --> 00:27:59,580 think of a rotation, and I can also flip it, 470 00:27:59,580 --> 00:28:00,790 but think of a rotation-- 471 00:28:02,839 --> 00:28:05,380 well, actually, for vector, it's just going to be a rotation. 472 00:28:05,380 --> 00:28:06,850 And so now, what I have I just have 473 00:28:06,850 --> 00:28:11,970 to check that the norm of b squared is equal to what? 474 00:28:11,970 --> 00:28:16,470 Well, it's equal to the norm of P transpose u squared, 475 00:28:16,470 --> 00:28:21,620 which is equal to u transpose P P transpose u. 476 00:28:21,620 --> 00:28:23,000 But P is orthogonal. 477 00:28:23,000 --> 00:28:26,210 So this thing is actually just the identity. 478 00:28:26,210 --> 00:28:28,307 So that's just u transpose u, which 479 00:28:28,307 --> 00:28:33,260 is equal to the norm u squared, which is equal to 1, 480 00:28:33,260 --> 00:28:37,070 because I took u to have norm 1 in the first place. 481 00:28:37,070 --> 00:28:39,640 And so this-- you're right-- was actually of norm equal to 1. 482 00:28:39,640 --> 00:28:42,017 I just needed to have it less, but it's equal. 483 00:28:42,017 --> 00:28:44,350 And so what I'm left with is that this thing is actually 484 00:28:44,350 --> 00:28:45,820 equal to lambda 1. 485 00:28:45,820 --> 00:28:50,030 So I know that for every u that I pick-- 486 00:28:50,030 --> 00:28:52,890 that has norm-- 487 00:28:52,890 --> 00:28:55,030 So I'm just reminding you that u here 488 00:28:55,030 --> 00:28:57,730 has norm squared equal to 1. 489 00:28:57,730 --> 00:29:00,760 For every u that I pick, this u transpose 490 00:29:00,760 --> 00:29:02,890 Su is at mostly lambda 1. 491 00:29:06,400 --> 00:29:11,250 So that's the u transpose Su is at most lambda 1. 492 00:29:11,250 --> 00:29:13,270 And we know that that's the variance, that's 493 00:29:13,270 --> 00:29:15,790 the empirical variance, when I project my points 494 00:29:15,790 --> 00:29:17,500 onto direction spanned by u. 495 00:29:20,240 --> 00:29:23,040 So now, I have an empirical variance, 496 00:29:23,040 --> 00:29:24,650 which is at most lambda 1. 497 00:29:24,650 --> 00:29:28,457 But I also know that if I take u to be something very specific-- 498 00:29:28,457 --> 00:29:30,040 I mean, it was on the previous board-- 499 00:29:30,040 --> 00:29:32,510 if I take u to be equal to v1, then 500 00:29:32,510 --> 00:29:35,270 this thing is actually not an inequality, 501 00:29:35,270 --> 00:29:37,160 this is an equality. 502 00:29:37,160 --> 00:29:41,990 And the reason is, when I actually take u to be v1, 503 00:29:41,990 --> 00:29:46,410 all of these bj's are going to be 0, except for the one that's 504 00:29:46,410 --> 00:29:50,360 b1, which is itself equal to 1. 505 00:29:50,360 --> 00:29:52,190 So I mean, we can briefly check this. 506 00:29:52,190 --> 00:29:53,738 But if I take v-- 507 00:29:59,106 --> 00:30:07,100 if u is equal to v1, what I have is that u transpose 508 00:30:07,100 --> 00:30:24,800 Su is equal to P transpose v1 D P transpose v1. 509 00:30:24,800 --> 00:30:26,680 But what is P transpose v1? 510 00:30:26,680 --> 00:30:31,960 Well, remember P transpose is just 511 00:30:31,960 --> 00:30:34,820 the matrix that has vectors v1 transpose here, 512 00:30:34,820 --> 00:30:40,110 v2 transpose here, all the way to vd transpose here. 513 00:30:40,110 --> 00:30:45,570 And we know that when I take vj transpose vk, I get 0, 514 00:30:45,570 --> 00:30:46,680 if j is different from k. 515 00:30:46,680 --> 00:30:49,620 And if j is equal to k, I get 1. 516 00:30:49,620 --> 00:30:53,690 So P transpose v1 is equal to what? 517 00:31:05,040 --> 00:31:06,570 Take v1 here and multiply it. 518 00:31:06,570 --> 00:31:08,250 So the first coordinate is going to be 519 00:31:08,250 --> 00:31:12,870 v1 transpose v1, which is 1. 520 00:31:12,870 --> 00:31:14,370 The second coordinate is going to be 521 00:31:14,370 --> 00:31:19,030 v2 transpose v1, which is 0. 522 00:31:19,030 --> 00:31:22,740 And so I get 0's all the way, right? 523 00:31:22,740 --> 00:31:25,470 So that means that this thing here is really 524 00:31:25,470 --> 00:31:29,040 just the vector 1, 0, 0. 525 00:31:29,040 --> 00:31:32,220 And here, this is just the vector 1, 0, 0. 526 00:31:32,220 --> 00:31:34,100 So when I multiply it with this guy, 527 00:31:34,100 --> 00:31:37,980 I am only picking up the top left element 528 00:31:37,980 --> 00:31:41,740 of D, which is lambda 1. 529 00:31:41,740 --> 00:31:44,940 So for every one, it's less lambda 1. 530 00:31:44,940 --> 00:31:46,950 And for v1, it's equal to lambda 1, 531 00:31:46,950 --> 00:31:52,590 which means that it's maximized for a equals v1. 532 00:31:52,590 --> 00:31:54,480 And that's where I said that this 533 00:31:54,480 --> 00:31:57,527 is the fanciest non-convex problem we know how to solve. 534 00:31:57,527 --> 00:31:59,610 This was a problem that was definitely non-convex. 535 00:31:59,610 --> 00:32:02,820 We were maximizing a convex function over a sphere. 536 00:32:02,820 --> 00:32:06,156 But we know that v1, which is something-- 537 00:32:06,156 --> 00:32:07,530 I mean, of course, you still have 538 00:32:07,530 --> 00:32:08,946 to believe me that you can compute 539 00:32:08,946 --> 00:32:11,670 the spectral decomposition efficiently-- 540 00:32:11,670 --> 00:32:14,880 but essentially, if you've taken linear algebra, 541 00:32:14,880 --> 00:32:17,020 you know that you can diagonalize a matrix. 542 00:32:17,020 --> 00:32:19,797 And so you get that v1 is just the maximum. 543 00:32:19,797 --> 00:32:21,630 So you can find your maximum just by looking 544 00:32:21,630 --> 00:32:24,109 at the spectral decomposition. 545 00:32:24,109 --> 00:32:25,650 You don't have to do any optimization 546 00:32:25,650 --> 00:32:28,790 or anything like this. 547 00:32:28,790 --> 00:32:29,870 So let's recap. 548 00:32:29,870 --> 00:32:32,390 Where are we? 549 00:32:32,390 --> 00:32:34,160 We've established that if I start 550 00:32:34,160 --> 00:32:37,820 with my empirical covariance matrix, I can diagonalize it 551 00:32:37,820 --> 00:32:42,270 and PD P transpose. 552 00:32:42,270 --> 00:32:44,250 And then if I take the eigenvector associated 553 00:32:44,250 --> 00:32:48,630 to the largest eigenvalues-- so if I permute the columns of P 554 00:32:48,630 --> 00:32:50,810 and of D's in such a way that they 555 00:32:50,810 --> 00:32:53,520 are ordered from the largest to the smallest when 556 00:32:53,520 --> 00:32:56,490 I look at the diagonal elements of D, 557 00:32:56,490 --> 00:32:59,430 then if I pick the first column of P, it's v1. 558 00:32:59,430 --> 00:33:04,750 And v1 is the direction on which, if I project my points, 559 00:33:04,750 --> 00:33:08,090 they are going to carry the most empirical variance. 560 00:33:08,090 --> 00:33:09,090 Well, that's a good way. 561 00:33:09,090 --> 00:33:13,064 If I told you, pick one direction 562 00:33:13,064 --> 00:33:14,980 along which if you were to project your points 563 00:33:14,980 --> 00:33:17,313 they would be as spread out as possible, that's probably 564 00:33:17,313 --> 00:33:19,270 the one you would pick. 565 00:33:19,270 --> 00:33:22,160 And so that's exactly what PCA is doing for us. 566 00:33:22,160 --> 00:33:28,780 It says, OK, if you ask me to take d prime equal to 1, 567 00:33:28,780 --> 00:33:31,510 I will take v1. 568 00:33:31,510 --> 00:33:33,892 I will just take the direction that's spanned by v1. 569 00:33:33,892 --> 00:33:36,100 And that's just when I come back to this picture that 570 00:33:36,100 --> 00:33:43,750 was here before, this is v1. 571 00:33:43,750 --> 00:33:45,970 Of course, here, I only have two of them. 572 00:33:45,970 --> 00:33:48,580 So v2 has to be this guy, or this guy, 573 00:33:48,580 --> 00:33:49,940 or I mean or this thing. 574 00:33:49,940 --> 00:33:53,060 I mean, I don't know them up to sine. 575 00:33:53,060 --> 00:33:55,600 But then if I have three-- 576 00:33:55,600 --> 00:33:58,090 think of like an olive in three dimensions-- 577 00:33:58,090 --> 00:34:00,550 then maybe I have one direction that's slightly more 578 00:34:00,550 --> 00:34:02,180 elongated than the other one. 579 00:34:02,180 --> 00:34:04,480 And so I'm going to pick the second one. 580 00:34:04,480 --> 00:34:07,330 And so the procedure is to say, well, first, I'm 581 00:34:07,330 --> 00:34:11,194 going to pick v1 the same way I pick v1 in the first place. 582 00:34:11,194 --> 00:34:12,610 So the first direction I am taking 583 00:34:12,610 --> 00:34:14,620 is the leading eigenvector. 584 00:34:14,620 --> 00:34:18,199 And then I'm looking for a direction. 585 00:34:18,199 --> 00:34:20,719 Well, if I found one-- the one I'm 586 00:34:20,719 --> 00:34:23,239 going to want to find-- if you say you can take d equal 2, 587 00:34:23,239 --> 00:34:24,949 you're going to need the basis for this guy. 588 00:34:24,949 --> 00:34:27,240 So the second one has to be orthogonal to the first one 589 00:34:27,240 --> 00:34:28,705 you've already picked. 590 00:34:28,705 --> 00:34:30,080 And so the second one you pick is 591 00:34:30,080 --> 00:34:31,940 the one that's just, among all those that 592 00:34:31,940 --> 00:34:36,529 are orthogonal to v1, maximized the empirical variance 593 00:34:36,529 --> 00:34:37,570 when you project onto it. 594 00:34:40,100 --> 00:34:44,000 And it turns out that this is actually exactly v2. 595 00:34:44,000 --> 00:34:46,153 You don't have to redo anything again. 596 00:34:46,153 --> 00:34:47,569 You're eigendecomposition, this is 597 00:34:47,569 --> 00:34:54,690 just the second column of P. Clearly, v2 598 00:34:54,690 --> 00:34:56,120 is orthogonal to v1. 599 00:34:56,120 --> 00:34:58,890 We just used it here. 600 00:34:58,890 --> 00:35:03,730 This 0 here just says this v2 is orthogonal to v1. 601 00:35:03,730 --> 00:35:05,770 So they're like this. 602 00:35:05,770 --> 00:35:06,940 And now, what I said-- 603 00:35:06,940 --> 00:35:08,530 what this slide tells you extra-- 604 00:35:08,530 --> 00:35:10,670 is that v2 among all those directions that are 605 00:35:10,670 --> 00:35:11,170 orthogonal-- 606 00:35:11,170 --> 00:35:13,610 I mean, there's still d minus 1 of them-- 607 00:35:13,610 --> 00:35:16,030 this is the one that maximizes the, say, 608 00:35:16,030 --> 00:35:18,730 residual empirical variance-- the one that 609 00:35:18,730 --> 00:35:21,950 was not explained by the first direction that you picked. 610 00:35:21,950 --> 00:35:22,910 And you can check that. 611 00:35:22,910 --> 00:35:27,200 I mean, it's becoming a bit more cumbersome to write down, 612 00:35:27,200 --> 00:35:28,760 but you can check that. 613 00:35:28,760 --> 00:35:32,130 If you're not convinced, please raise your concern. 614 00:35:32,130 --> 00:35:38,641 I mean, basically, one way you view this to-- 615 00:35:38,641 --> 00:35:40,640 I mean, you're not really dropping a coordinate, 616 00:35:40,640 --> 00:35:42,420 because v1 is not a coordinate. 617 00:35:42,420 --> 00:35:46,040 But let's assume actually for simplicity that v1 was actually 618 00:35:46,040 --> 00:35:49,730 equal to e1, that the direction that carries the most variance 619 00:35:49,730 --> 00:35:51,440 is the one that just says, just look 620 00:35:51,440 --> 00:35:56,520 at the first coordinate of X. So if that was the case, then 621 00:35:56,520 --> 00:35:58,380 clearly the orthogonal directions are 622 00:35:58,380 --> 00:36:03,420 the ones that comprise only of the coordinates 2 to d. 623 00:36:03,420 --> 00:36:05,670 So you could actually just drop the first coordinate 624 00:36:05,670 --> 00:36:08,460 and do the same thing on a slightly shorter vector 625 00:36:08,460 --> 00:36:10,129 of length d minus 1. 626 00:36:10,129 --> 00:36:12,420 And then you would just look at the largest eigenvector 627 00:36:12,420 --> 00:36:14,530 of these guys, et cetera, et cetera. 628 00:36:14,530 --> 00:36:16,230 So in a way, that's what's happening, 629 00:36:16,230 --> 00:36:19,200 except that you rotate it before you actually do this. 630 00:36:19,200 --> 00:36:22,260 And that's exactly what's happening. 631 00:36:22,260 --> 00:36:30,890 So what we put together here is essentially three things. 632 00:36:30,890 --> 00:36:32,220 One was statistics. 633 00:36:32,220 --> 00:36:34,690 Statistics says, if you won't spread, 634 00:36:34,690 --> 00:36:39,230 if you want information, you should be looking at variance. 635 00:36:39,230 --> 00:36:40,820 The second one was optimization. 636 00:36:40,820 --> 00:36:44,870 Optimization said, well, if you want to maximize spread, well, 637 00:36:44,870 --> 00:36:48,260 you have to maximize variance in a certain direction. 638 00:36:48,260 --> 00:36:51,920 And that means maximizing over the sphere of vectors 639 00:36:51,920 --> 00:36:54,510 that have unique norm. 640 00:36:54,510 --> 00:36:56,720 And that's an optimization problem, which actually 641 00:36:56,720 --> 00:36:58,310 turned out to be difficult. 642 00:36:58,310 --> 00:37:00,800 But then the third thing that we use to solve this problem 643 00:37:00,800 --> 00:37:01,830 was linear algebra. 644 00:37:01,830 --> 00:37:03,410 Linear algebra said, well, it looks 645 00:37:03,410 --> 00:37:05,450 like it's a difficult optimization problem. 646 00:37:05,450 --> 00:37:08,410 But it turns out that the answer comes in almost-- 647 00:37:08,410 --> 00:37:11,210 I mean, it's not a closed form, but those things are so used, 648 00:37:11,210 --> 00:37:12,590 that it's almost a closed form-- 649 00:37:12,590 --> 00:37:17,240 says, just pick the eigenvectors in order 650 00:37:17,240 --> 00:37:20,480 of their associated eigenvalues from largest to smallest. 651 00:37:23,020 --> 00:37:24,940 And that's why principal component analysis 652 00:37:24,940 --> 00:37:29,080 has been so popular and has gained huge amount of traction 653 00:37:29,080 --> 00:37:33,760 since we had computers that were allowed to compute eigenvalues 654 00:37:33,760 --> 00:37:37,429 and eigenvectors for matrices of gigantic sizes. 655 00:37:37,429 --> 00:37:38,470 You can actually do that. 656 00:37:38,470 --> 00:37:39,760 If I give you-- 657 00:37:39,760 --> 00:37:42,340 I don't know, this Google video, for example, 658 00:37:42,340 --> 00:37:43,750 is talking about words. 659 00:37:43,750 --> 00:37:45,970 They want to do just the, say, principal component 660 00:37:45,970 --> 00:37:47,380 analysis of words. 661 00:37:47,380 --> 00:37:50,230 So I give you all the words in the dictionary. 662 00:37:50,230 --> 00:37:53,500 And-- sorry, well, you would have 663 00:37:53,500 --> 00:37:55,090 to have a representation for words, 664 00:37:55,090 --> 00:37:59,500 so it's a little more difficult. But how do I do this? 665 00:38:03,980 --> 00:38:06,382 Let's say, for example, pages of a book. 666 00:38:06,382 --> 00:38:08,090 I want to understand the pages of a book. 667 00:38:08,090 --> 00:38:10,580 And I need to turn it into a number. 668 00:38:10,580 --> 00:38:13,150 And a page of a book is basically the word count. 669 00:38:13,150 --> 00:38:15,350 So I just count the number of times "the" shows up, 670 00:38:15,350 --> 00:38:18,140 the number of times "and" shows up, number of times "dog" 671 00:38:18,140 --> 00:38:19,100 shows up. 672 00:38:19,100 --> 00:38:20,934 And so that gives me a vector. 673 00:38:20,934 --> 00:38:22,225 It's in pretty high dimensions. 674 00:38:22,225 --> 00:38:25,350 It's as many dimensions as there are words in the dictionary. 675 00:38:25,350 --> 00:38:28,310 And now, I want to visualize how those pages get together-- 676 00:38:28,310 --> 00:38:30,450 are two pages very similar or not. 677 00:38:30,450 --> 00:38:32,630 And so what you would do is essentially 678 00:38:32,630 --> 00:38:35,470 just compute the largest eigenvector of this matrix-- 679 00:38:35,470 --> 00:38:38,925 maybe the two largest-- and then project this into a plane. 680 00:38:38,925 --> 00:38:39,425 Yeah. 681 00:38:39,425 --> 00:38:41,325 AUDIENCE: Can we assume the number of points 682 00:38:41,325 --> 00:38:43,060 was far larger than the dimension? 683 00:38:43,060 --> 00:38:44,560 PHILIPPE RIGOLLET: Yeah, but there's 684 00:38:44,560 --> 00:38:46,834 many pages in the world. 685 00:38:46,834 --> 00:38:48,500 There's probably more pages in the world 686 00:38:48,500 --> 00:38:50,154 than there's words in the dictionary. 687 00:38:54,960 --> 00:38:57,185 Yeah, so of course, if you are in high dimensions 688 00:38:57,185 --> 00:38:58,560 and you don't have enough points, 689 00:38:58,560 --> 00:39:00,240 it's going to be clearly an issue. 690 00:39:00,240 --> 00:39:03,605 If you have two points, then the leading eigenvector 691 00:39:03,605 --> 00:39:04,980 is going to be just the line that 692 00:39:04,980 --> 00:39:06,879 goes through those two points, regardless 693 00:39:06,879 --> 00:39:07,920 of what the dimension is. 694 00:39:07,920 --> 00:39:09,670 And clearly, you're not learning anything. 695 00:39:13,850 --> 00:39:16,310 So you have to pick, say, the k largest one. 696 00:39:16,310 --> 00:39:18,842 If you go all the way, you're just reordering your thing, 697 00:39:18,842 --> 00:39:20,550 and you're not actually gaining anything. 698 00:39:20,550 --> 00:39:22,130 You start from d and you go too d. 699 00:39:22,130 --> 00:39:26,300 So at some point, this procedure has to stop. 700 00:39:26,300 --> 00:39:28,960 And let's say it stops at k. 701 00:39:28,960 --> 00:39:31,360 Now, of course, you should ask me a question, 702 00:39:31,360 --> 00:39:34,100 which is, how do you choose k? 703 00:39:34,100 --> 00:39:37,400 So that's, of course, a natural question. 704 00:39:37,400 --> 00:39:41,360 Probably the basic answer is just pick k equals 3, 705 00:39:41,360 --> 00:39:43,220 because you can actually visualize it. 706 00:39:43,220 --> 00:39:47,906 But what happens if I take k is equal to 4? 707 00:39:47,906 --> 00:39:51,860 If I take is equal to 4, I'm not going 708 00:39:51,860 --> 00:39:54,070 to be able to plot points in four dimensions. 709 00:39:54,070 --> 00:39:55,550 Well, I could, I could add color, 710 00:39:55,550 --> 00:39:57,440 or I could try to be a little smart about it. 711 00:39:57,440 --> 00:40:00,060 But it's actually quite difficult. 712 00:40:00,060 --> 00:40:04,420 And so what people tend to do, if you have four dimensions, 713 00:40:04,420 --> 00:40:06,850 they actually do a bunch of two dimensional plots. 714 00:40:06,850 --> 00:40:08,920 And that's what a computer-- a computer is not very good-- 715 00:40:08,920 --> 00:40:10,750 I mean, by default, they don't spit out 716 00:40:10,750 --> 00:40:12,380 three dimensional plots. 717 00:40:12,380 --> 00:40:15,024 So let's say they want to plot only two dimensional things. 718 00:40:15,024 --> 00:40:17,440 So they're going to take the first directions of, say, v1, 719 00:40:17,440 --> 00:40:18,586 v2. 720 00:40:18,586 --> 00:40:19,960 Let's say you have three, but you 721 00:40:19,960 --> 00:40:21,760 want to have only two dimensional plots. 722 00:40:21,760 --> 00:40:29,660 And then it's going to do v1, v3; and then v2, v3. 723 00:40:29,660 --> 00:40:31,850 So really, you take all three of them, 724 00:40:31,850 --> 00:40:35,240 but it's really just showing you all choices 725 00:40:35,240 --> 00:40:37,340 of pairs of those guys. 726 00:40:37,340 --> 00:40:41,960 So if you were to keep k is equal to 5, 727 00:40:41,960 --> 00:40:44,450 you would have five, choose two different plots. 728 00:40:48,540 --> 00:40:51,930 So this is the actual principal component algorithm, 729 00:40:51,930 --> 00:40:53,640 how it's implemented. 730 00:40:53,640 --> 00:40:55,000 And it's actually fairly simple. 731 00:40:55,000 --> 00:40:56,430 I mean, it looks like there's lots of steps. 732 00:40:56,430 --> 00:40:58,600 But really, there's only one that's important. 733 00:40:58,600 --> 00:40:59,850 So the first one is the input. 734 00:40:59,850 --> 00:41:04,860 I give you a bunch of points, x1 to xn in d dimensions. 735 00:41:04,860 --> 00:41:07,680 And step two is, well, compute their empirical covariance 736 00:41:07,680 --> 00:41:10,570 matrix S. The points themselves, we don't really care. 737 00:41:10,570 --> 00:41:12,570 We care about their empirical covariance matrix. 738 00:41:12,570 --> 00:41:14,530 So it's a d by d matrix. 739 00:41:14,530 --> 00:41:15,750 Now, I'm going to feed that. 740 00:41:15,750 --> 00:41:17,880 And that's where the actual computation starts happening. 741 00:41:17,880 --> 00:41:19,796 I'm going to feed that to something that knows 742 00:41:19,796 --> 00:41:21,090 how to diagonalize this matrix. 743 00:41:21,090 --> 00:41:23,220 And you have to trust me, if I want 744 00:41:23,220 --> 00:41:25,770 to compute the k largest eigenvalues 745 00:41:25,770 --> 00:41:27,960 and my matrix is d by d, it's going 746 00:41:27,960 --> 00:41:32,730 to take me about k times d squared operations. 747 00:41:32,730 --> 00:41:34,980 So if I want only three, it's 3 times d squared, 748 00:41:34,980 --> 00:41:36,420 which is about-- 749 00:41:36,420 --> 00:41:39,570 d squared is the time for me it takes to just even read 750 00:41:39,570 --> 00:41:41,040 the matrix sigma. 751 00:41:41,040 --> 00:41:43,360 So that's not too bad. 752 00:41:43,360 --> 00:41:45,110 So what it's going to spit out, of course, 753 00:41:45,110 --> 00:41:48,230 is the diagonal matrix D. And those are nice, 754 00:41:48,230 --> 00:41:53,720 because they allow me to tell me what 755 00:41:53,720 --> 00:41:56,210 is the order in which I should be taking the columns of P. 756 00:41:56,210 --> 00:41:58,930 But what's really important to me is v1 to vd, 757 00:41:58,930 --> 00:42:01,430 because those are going to be the ones I'm going to be using 758 00:42:01,430 --> 00:42:04,250 to draw those plots. 759 00:42:04,250 --> 00:42:05,900 And now, I'm going to say, OK, I need 760 00:42:05,900 --> 00:42:09,190 to actually choose some set k. 761 00:42:09,190 --> 00:42:11,630 And I'm going to basically truncate and look 762 00:42:11,630 --> 00:42:16,380 only at the first k columns of P. 763 00:42:16,380 --> 00:42:18,300 Once I have those columns, what I 764 00:42:18,300 --> 00:42:20,820 want to do is to project onto the linear span 765 00:42:20,820 --> 00:42:21,610 of those columns. 766 00:42:21,610 --> 00:42:23,340 And there's actually a simple way 767 00:42:23,340 --> 00:42:26,940 to do this, which is just take this matrix P, which is really 768 00:42:26,940 --> 00:42:29,460 the matrix of projection onto the linear span of those k 769 00:42:29,460 --> 00:42:30,120 columns. 770 00:42:30,120 --> 00:42:32,160 And you just take Pk transpose. 771 00:42:32,160 --> 00:42:38,070 And then you apply this to every single one of your points. 772 00:42:38,070 --> 00:42:42,000 Now Pk transpose, what is the size of the matrix Pk? 773 00:42:46,410 --> 00:42:47,880 Yeah, [INAUDIBLE]? 774 00:42:47,880 --> 00:42:49,840 AUDIENCE: [INAUDIBLE] 775 00:42:49,840 --> 00:42:52,100 PHILIPPE RIGOLLET: So Pk is just this matrix. 776 00:42:52,100 --> 00:42:54,601 I take the v1 and I stop at vk-- 777 00:42:54,601 --> 00:42:55,100 well-- 778 00:42:55,100 --> 00:42:57,656 AUDIENCE: [INAUDIBLE] 779 00:42:57,656 --> 00:42:59,030 PHILIPPE RIGOLLET: d by k, right? 780 00:42:59,030 --> 00:43:01,290 Each of the column is an eigenvector. 781 00:43:01,290 --> 00:43:02,840 It's of dimension d. 782 00:43:02,840 --> 00:43:05,730 I mean, that's a vector in the original space. 783 00:43:05,730 --> 00:43:07,220 So I have this d by k matrix. 784 00:43:07,220 --> 00:43:11,360 So all it is is if I had my-- 785 00:43:11,360 --> 00:43:13,970 well, I'm going to talk in a second about Pk transpose. 786 00:43:13,970 --> 00:43:17,060 Pk transpose is just this guy, where 787 00:43:17,060 --> 00:43:19,460 I stop at the k-th vector. 788 00:43:19,460 --> 00:43:22,370 So Pk transpose is k by d. 789 00:43:22,370 --> 00:43:26,825 So now, when I take Yi, which is Pk transpose Xi, 790 00:43:26,825 --> 00:43:29,330 I end up with a point which is in k dimensions. 791 00:43:29,330 --> 00:43:30,900 I have only k coordinates. 792 00:43:30,900 --> 00:43:33,350 So I took every single one of my original points Xi, 793 00:43:33,350 --> 00:43:35,780 which had d coordinates, and I turned it into a point that 794 00:43:35,780 --> 00:43:37,180 has only k coordinates. 795 00:43:37,180 --> 00:43:40,260 Particularly, I could have k is equal to 2. 796 00:43:40,260 --> 00:43:42,820 This matrix is exactly the one that projects. 797 00:43:42,820 --> 00:43:44,960 If you think about it for one second, 798 00:43:44,960 --> 00:43:46,890 this is just the matrix that says-- 799 00:43:46,890 --> 00:43:48,610 well, we actually did that several times. 800 00:43:48,610 --> 00:43:51,820 The matrix, so that was this P transpose u 801 00:43:51,820 --> 00:43:53,470 that showed up somewhere. 802 00:43:53,470 --> 00:43:57,460 And so that's just the matrix that 803 00:43:57,460 --> 00:44:01,030 take your point X in, say, three dimensions, 804 00:44:01,030 --> 00:44:04,750 and then just project it down to two dimensions. 805 00:44:04,750 --> 00:44:09,220 And that's just-- it goes to the closest point in the subspace. 806 00:44:09,220 --> 00:44:12,650 Now, here, the floor is flat. 807 00:44:12,650 --> 00:44:16,510 But we can pick any subspace we want, 808 00:44:16,510 --> 00:44:18,310 depending on what the lambdas are. 809 00:44:18,310 --> 00:44:19,930 So the lambdas were important for us 810 00:44:19,930 --> 00:44:23,610 to be able to identify which columns to pick. 811 00:44:23,610 --> 00:44:25,692 The fact that we assumed that they were ordered 812 00:44:25,692 --> 00:44:27,400 tells us that we can pick the first ones. 813 00:44:27,400 --> 00:44:28,500 If they were not ordered, it would 814 00:44:28,500 --> 00:44:30,583 be just a subset of the columns, depending on what 815 00:44:30,583 --> 00:44:32,550 the size of the eigenvalue is. 816 00:44:32,550 --> 00:44:36,509 So each column is labeled. 817 00:44:36,509 --> 00:44:38,800 And so then, of course, we still have this question of, 818 00:44:38,800 --> 00:44:40,570 how do I pick k? 819 00:44:40,570 --> 00:44:42,760 So there's definitely the matter of convenience. 820 00:44:42,760 --> 00:44:44,410 Maybe 2 is convenient. 821 00:44:44,410 --> 00:44:47,180 If it works for 2, you don't have to go any farther. 822 00:44:47,180 --> 00:44:50,680 But you might want to say, well-- 823 00:44:50,680 --> 00:44:52,690 originally, I did that to actually keep 824 00:44:52,690 --> 00:44:54,320 as much information as possible. 825 00:44:54,320 --> 00:44:56,230 I know that the ultimate thing is 826 00:44:56,230 --> 00:44:58,515 to keep as much information, which would be to k 827 00:44:58,515 --> 00:45:00,970 is equal d-- that's as much information as you want. 828 00:45:00,970 --> 00:45:03,310 But it's essentially the same question about, well, 829 00:45:03,310 --> 00:45:07,180 if I want to compress a JPEG image, 830 00:45:07,180 --> 00:45:10,100 how much information should I keep so it's still visible? 831 00:45:10,100 --> 00:45:11,840 And so there's some rules for that. 832 00:45:11,840 --> 00:45:14,950 But none of them is actually really a science. 833 00:45:14,950 --> 00:45:16,600 So it's really a matter of what you 834 00:45:16,600 --> 00:45:18,250 think is actually tolerable. 835 00:45:18,250 --> 00:45:21,970 And we're just going to start replacing this choice by maybe 836 00:45:21,970 --> 00:45:22,900 another parameter. 837 00:45:22,900 --> 00:45:26,440 So here, we're going to basically replace k by alpha, 838 00:45:26,440 --> 00:45:29,360 and so we just do stuff. 839 00:45:29,360 --> 00:45:32,020 So the first one that people do that is probably 840 00:45:32,020 --> 00:45:33,750 the most popular one-- 841 00:45:33,750 --> 00:45:35,860 OK, the most popular one is definitely 842 00:45:35,860 --> 00:45:39,190 take k is equal to 2 or 3, because it's just 843 00:45:39,190 --> 00:45:41,320 convenient to visualize. 844 00:45:41,320 --> 00:45:48,050 The second most popular one is the scree plot. 845 00:45:48,050 --> 00:45:49,370 So the scree plot-- 846 00:45:49,370 --> 00:45:54,180 remember, I have my values, lambda j's. 847 00:45:54,180 --> 00:45:57,670 And I've chosen the lambda j's to decrease. 848 00:45:57,670 --> 00:45:59,380 So the indices are chosen in such a way 849 00:45:59,380 --> 00:46:01,480 that lambda is a decreasing function. 850 00:46:01,480 --> 00:46:04,332 So I have lambda 1, and let's say it's this guy here. 851 00:46:04,332 --> 00:46:06,790 And then I have lambda 2, and let's say it's this guy here. 852 00:46:06,790 --> 00:46:09,370 And then I have lambda 3, and let's say it's this guy here, 853 00:46:09,370 --> 00:46:12,760 lambda 4, lambda 5, lambda 6. 854 00:46:12,760 --> 00:46:16,322 And all I care about is that this thing decreases. 855 00:46:16,322 --> 00:46:19,580 The scree plot says something like this-- 856 00:46:19,580 --> 00:46:22,520 if there's an inflection point, meaning that you can sort of do 857 00:46:22,520 --> 00:46:25,230 something like this and then something like this, 858 00:46:25,230 --> 00:46:27,610 you should stop at 3. 859 00:46:27,610 --> 00:46:29,500 That's what the scree plot tells you. 860 00:46:29,500 --> 00:46:34,590 What it's saying in a way is that the percentage 861 00:46:34,590 --> 00:46:39,170 of the marginal increment of explained 862 00:46:39,170 --> 00:46:41,990 variance that you get starts to decrease after you 863 00:46:41,990 --> 00:46:43,555 pass this inflection point. 864 00:46:43,555 --> 00:46:45,840 So let's see why I way this. 865 00:46:45,840 --> 00:46:52,390 Well, here, what I have-- so this ratio 866 00:46:52,390 --> 00:46:54,280 that you see there is actually the percentage 867 00:46:54,280 --> 00:46:56,470 of explained variance. 868 00:46:56,470 --> 00:47:01,590 So what it means is that, if I look at lambda 1 plus lambda k, 869 00:47:01,590 --> 00:47:08,260 and then I divide by lambda 1 plus lambda d, well, 870 00:47:08,260 --> 00:47:08,980 what is this? 871 00:47:08,980 --> 00:47:12,010 Well, this lambda 1 plus lambda d 872 00:47:12,010 --> 00:47:14,530 is the total amount of variance that I get in my points. 873 00:47:14,530 --> 00:47:18,070 That's the trace of sigma. 874 00:47:18,070 --> 00:47:20,640 So that's the variance in the first direction 875 00:47:20,640 --> 00:47:22,420 plus the variance in the second direction 876 00:47:22,420 --> 00:47:24,280 plus the variance in the third direction. 877 00:47:24,280 --> 00:47:26,571 That's basically all the variance that I have possible. 878 00:47:28,900 --> 00:47:32,175 Now, this is the variance that I kept in the first direction. 879 00:47:32,175 --> 00:47:34,550 This is the variance that I kept in the second direction, 880 00:47:34,550 --> 00:47:37,190 all the way to the variance that I kept in the k-th direction. 881 00:47:37,190 --> 00:47:41,800 So I know that this number is always less than or equal to 1. 882 00:47:41,800 --> 00:47:43,540 And it's larger than 1. 883 00:47:43,540 --> 00:47:48,500 And this is just the proportion, say, 884 00:47:48,500 --> 00:47:59,520 of variance explained by v1 to vk, 885 00:47:59,520 --> 00:48:03,720 or simply, the proportion of explained variance by my PCA, 886 00:48:03,720 --> 00:48:05,720 say. 887 00:48:05,720 --> 00:48:07,550 So now, what this thing is telling me, 888 00:48:07,550 --> 00:48:09,860 its says, well, if I look at this thing 889 00:48:09,860 --> 00:48:13,050 and I start seeing this inflection point, it's saying, 890 00:48:13,050 --> 00:48:16,400 oh, here, you're gaining a lot and lot of variance. 891 00:48:16,400 --> 00:48:19,090 And then at some point, you stop gaining a lot 892 00:48:19,090 --> 00:48:21,820 in your proportion of explained variance. 893 00:48:21,820 --> 00:48:23,870 So this will translate in something 894 00:48:23,870 --> 00:48:28,490 where when I look at this ratio, lambda 1 plus lambda k divided 895 00:48:28,490 --> 00:48:31,490 by lambda 1 plus lambda d, this would 896 00:48:31,490 --> 00:48:34,195 translate into a function that would look like this. 897 00:48:34,195 --> 00:48:36,320 And what it's telling you, it says, well, maybe you 898 00:48:36,320 --> 00:48:38,570 should stop here, because here every time you add one, 899 00:48:38,570 --> 00:48:40,520 you don't get as much as you did before. 900 00:48:40,520 --> 00:48:43,700 You actually get like smaller marginal returns. 901 00:48:50,910 --> 00:48:56,630 So explained variance is the numerator of this ratio. 902 00:48:56,630 --> 00:48:58,430 And the total variance is the denominator. 903 00:48:58,430 --> 00:49:01,010 Those are pretty straightforward terms 904 00:49:01,010 --> 00:49:03,320 that you would want to use for this. 905 00:49:03,320 --> 00:49:06,620 So if your goal is to do data visualization-- 906 00:49:06,620 --> 00:49:10,100 so why would you take k larger than 2? 907 00:49:10,100 --> 00:49:11,750 Let's say, if you take k larger than 6, 908 00:49:11,750 --> 00:49:12,906 you can start to imagine that you're 909 00:49:12,906 --> 00:49:15,364 going to have six, choose two, which starts to be annoying. 910 00:49:15,364 --> 00:49:16,850 And if you have k is equal to 10-- 911 00:49:16,850 --> 00:49:19,310 because you could start in dimension 50,000-- 912 00:49:19,310 --> 00:49:21,080 and then k equal to 10 would be the place 913 00:49:21,080 --> 00:49:22,780 where you have this thing that's a lot of plots 914 00:49:22,780 --> 00:49:23,960 that you would have to show. 915 00:49:23,960 --> 00:49:26,900 So it's not always for data visualization. 916 00:49:26,900 --> 00:49:29,540 Once I've actually done this, I've 917 00:49:29,540 --> 00:49:32,460 actually effectively reduced the dimension of my problem. 918 00:49:32,460 --> 00:49:34,230 And what I could do with what I have is 919 00:49:34,230 --> 00:49:36,080 do a regression on those guys. 920 00:49:36,080 --> 00:49:39,010 The v1-- so I forgot to tell you-- 921 00:49:39,010 --> 00:49:41,460 why is that called principal component analysis? 922 00:49:41,460 --> 00:49:46,910 Well, the vj's that I keep, v1 to vk 923 00:49:46,910 --> 00:49:51,932 are called principal components. 924 00:49:59,020 --> 00:50:04,690 And they effectively act as the summary of my Xi's. 925 00:50:04,690 --> 00:50:06,850 When I mentioned image compression, 926 00:50:06,850 --> 00:50:10,840 I started with a point Xi that was d numbers-- 927 00:50:10,840 --> 00:50:12,604 let's say 50,000 numbers. 928 00:50:12,604 --> 00:50:14,020 And now, I'm saying, actually, you 929 00:50:14,020 --> 00:50:16,270 can throw out those 50,000 numbers. 930 00:50:16,270 --> 00:50:19,390 If you actually know only the k numbers that you need-- 931 00:50:19,390 --> 00:50:20,860 the 6 numbers that you need-- 932 00:50:20,860 --> 00:50:22,318 you're going to have something that 933 00:50:22,318 --> 00:50:24,820 was pretty close to getting what information you had. 934 00:50:24,820 --> 00:50:26,736 So in a way, there is some form of compression 935 00:50:26,736 --> 00:50:27,810 that's going on here. 936 00:50:27,810 --> 00:50:31,150 And what you can do is that those principal components, 937 00:50:31,150 --> 00:50:34,120 you can actually use now for regression. 938 00:50:34,120 --> 00:50:39,130 If I want to regress Y onto X that's 939 00:50:39,130 --> 00:50:41,862 very high dimensional, before I do this, 940 00:50:41,862 --> 00:50:44,320 if I don't have enough points, maybe what I can actually do 941 00:50:44,320 --> 00:50:47,780 is to do principal component analysis 942 00:50:47,780 --> 00:50:49,510 throughout my exercise, replace them 943 00:50:49,510 --> 00:50:52,150 by those compressed versions, and do linear aggression 944 00:50:52,150 --> 00:50:53,020 on those guys. 945 00:50:53,020 --> 00:50:55,330 And that's called principal component regression, 946 00:50:55,330 --> 00:50:56,039 not surprisingly. 947 00:50:56,039 --> 00:50:57,830 And that's something that's pretty popular. 948 00:50:57,830 --> 00:51:00,086 And you can do with k is equal to 10, for example. 949 00:51:03,020 --> 00:51:07,640 So for data visualization, I did not find a Thanksgiving themed 950 00:51:07,640 --> 00:51:08,270 picture. 951 00:51:08,270 --> 00:51:11,960 But I found one that has turkey in it. 952 00:51:11,960 --> 00:51:12,460 Get it? 953 00:51:15,310 --> 00:51:21,820 So this is actually a gene data set that was-- 954 00:51:21,820 --> 00:51:24,190 so when you see something like this, 955 00:51:24,190 --> 00:51:27,056 you can imagine that someone has been preprocessing 956 00:51:27,056 --> 00:51:28,180 the hell out of this thing. 957 00:51:28,180 --> 00:51:30,820 This is not like, oh, I collect data on 23andMe 958 00:51:30,820 --> 00:51:32,670 and I'm just going to run PCA on this. 959 00:51:32,670 --> 00:51:34,730 It just doesn't happen like that. 960 00:51:34,730 --> 00:51:38,740 And so what happened is that-- so let's assume that this was 961 00:51:38,740 --> 00:51:41,560 a bunch of preprocessed data, which are gene expression 962 00:51:41,560 --> 00:51:42,550 levels-- 963 00:51:42,550 --> 00:51:47,650 so 500,000 genes among 1,400 Europeans. 964 00:51:47,650 --> 00:51:50,260 So here, I actually have less observations 965 00:51:50,260 --> 00:51:52,180 than I have samples. 966 00:51:52,180 --> 00:51:54,880 And that's when you use principal component regression 967 00:51:54,880 --> 00:51:57,460 most of the time, so it doesn't stop you. 968 00:51:57,460 --> 00:52:01,480 And then what you do is you say, OK, have those 500,000 genes 969 00:52:01,480 --> 00:52:03,640 among-- 970 00:52:03,640 --> 00:52:06,760 so here, that means that there's 1,400 points here. 971 00:52:06,760 --> 00:52:09,760 And I actually take those 500,000 directions. 972 00:52:09,760 --> 00:52:13,347 So each person has a vector of, say, 500,000 genes 973 00:52:13,347 --> 00:52:14,430 that are attached to them. 974 00:52:14,430 --> 00:52:17,020 And I project them onto two dimensions, which 975 00:52:17,020 --> 00:52:19,380 should be extremely lossy. 976 00:52:19,380 --> 00:52:21,040 I lose a lot of information. 977 00:52:21,040 --> 00:52:24,790 And indeed, I do, because I'm one of these guys. 978 00:52:24,790 --> 00:52:27,350 And I'm pretty sure I'm very different from this guy, 979 00:52:27,350 --> 00:52:30,070 even though probably from an American perspective, 980 00:52:30,070 --> 00:52:31,970 we're all the same. 981 00:52:31,970 --> 00:52:35,690 But I think we have like slightly different genomes. 982 00:52:35,690 --> 00:52:39,220 And so the thing is now we have this-- 983 00:52:39,220 --> 00:52:41,980 so you see there's lots of Swiss that participate in this. 984 00:52:41,980 --> 00:52:43,900 But actually, those two principal components 985 00:52:43,900 --> 00:52:46,210 recover sort of the map of Europe. 986 00:52:46,210 --> 00:52:50,169 I mean, OK, again, this is actually maybe fine-grained 987 00:52:50,169 --> 00:52:50,710 for you guys. 988 00:52:50,710 --> 00:52:52,810 But right here, there's Portugal and Spain, 989 00:52:52,810 --> 00:52:54,430 which are those colors. 990 00:52:54,430 --> 00:52:55,450 So here is color-coded. 991 00:52:55,450 --> 00:52:58,510 And here is Turkey, of course, which we know 992 00:52:58,510 --> 00:53:02,230 has very different genomes. 993 00:53:02,230 --> 00:53:04,850 So Turks are very at the boundary. 994 00:53:04,850 --> 00:53:06,100 So you can see all the greens. 995 00:53:06,100 --> 00:53:08,560 They stay very far apart from everything else. 996 00:53:08,560 --> 00:53:11,080 And then the rest here is pretty mixed. 997 00:53:11,080 --> 00:53:13,430 But it sort of recovers-- if you look at the colors, 998 00:53:13,430 --> 00:53:14,500 it sort of recovers that. 999 00:53:14,500 --> 00:53:16,390 So in a way, those two principal components 1000 00:53:16,390 --> 00:53:18,050 are just the geographic feature. 1001 00:53:18,050 --> 00:53:25,570 So if you insist to compress all the genomic information 1002 00:53:25,570 --> 00:53:28,330 of these people into two numbers, what you're actually 1003 00:53:28,330 --> 00:53:31,320 going to get is longitude and latitude, 1004 00:53:31,320 --> 00:53:35,550 which is somewhat surprising, but not 1005 00:53:35,550 --> 00:53:37,740 so much if you think that's it's been preprocessed. 1006 00:53:43,120 --> 00:53:47,530 So what do you do beyond practice? 1007 00:53:47,530 --> 00:53:50,780 Well, you could try to actually study those things. 1008 00:53:50,780 --> 00:53:52,330 If you think about it for a second, 1009 00:53:52,330 --> 00:53:54,880 we did not do any statistics. 1010 00:53:54,880 --> 00:53:57,460 I talked to you about IID observations, 1011 00:53:57,460 --> 00:53:59,950 but we never used the fact that they were independent. 1012 00:53:59,950 --> 00:54:01,491 The way we typically use independence 1013 00:54:01,491 --> 00:54:04,270 is to have central limit theorem, maybe. 1014 00:54:04,270 --> 00:54:06,640 I mentioned the fact that the covariances of the word 1015 00:54:06,640 --> 00:54:09,520 Gaussian would actually give me something which is independent. 1016 00:54:09,520 --> 00:54:10,870 We didn't care. 1017 00:54:10,870 --> 00:54:16,280 This was a data analysis, data mining process that we did. 1018 00:54:16,280 --> 00:54:19,280 I give you points, and you just put them through the crank. 1019 00:54:19,280 --> 00:54:21,350 There was an algorithm in six steps. 1020 00:54:21,350 --> 00:54:23,750 And you just put it through and that's what you got. 1021 00:54:23,750 --> 00:54:26,940 Now, of course, there's some work which studies says, OK, 1022 00:54:26,940 --> 00:54:30,440 if my data is actually generated from some process-- maybe, 1023 00:54:30,440 --> 00:54:33,050 my points are multivariate Gaussian with some structure 1024 00:54:33,050 --> 00:54:34,520 on the covariance-- 1025 00:54:34,520 --> 00:54:37,010 how well am I recovering the covariance structure? 1026 00:54:37,010 --> 00:54:38,990 And that's where statistics kicks in. 1027 00:54:38,990 --> 00:54:41,390 And that's where we stop. 1028 00:54:41,390 --> 00:54:44,730 So this is actually a bit more difficult to study. 1029 00:54:44,730 --> 00:54:48,250 But in a way, it's not entirely satisfactory, 1030 00:54:48,250 --> 00:54:50,320 because we could work for a couple of boards 1031 00:54:50,320 --> 00:54:53,470 and I would just basically sort of reverse engineer this 1032 00:54:53,470 --> 00:54:57,457 and find some models under which it's a good idea to do that. 1033 00:54:57,457 --> 00:54:58,540 And what are those models? 1034 00:54:58,540 --> 00:55:01,450 Well, those are the models that sort of give you 1035 00:55:01,450 --> 00:55:03,911 sort of prominent directions that you want to find. 1036 00:55:03,911 --> 00:55:06,160 And it will say, yes, if you have enough observations, 1037 00:55:06,160 --> 00:55:08,260 you will find those directions along which 1038 00:55:08,260 --> 00:55:10,150 your data is elongated. 1039 00:55:10,150 --> 00:55:14,890 So that's essentially what you want to do. 1040 00:55:14,890 --> 00:55:20,660 So that's exactly what this thing is telling you. 1041 00:55:20,660 --> 00:55:23,010 So where does the statistics lie from? 1042 00:55:23,010 --> 00:55:26,020 Well, everything, remember-- so actually that's 1043 00:55:26,020 --> 00:55:28,490 where Alana was confused-- the idea was to say, well, 1044 00:55:28,490 --> 00:55:32,590 if I have a true covariance matrix sigma 1045 00:55:32,590 --> 00:55:34,540 and I never really have access to it, 1046 00:55:34,540 --> 00:55:38,870 I'm just running PCA on the empirical covariance matrix, 1047 00:55:38,870 --> 00:55:41,380 how do those results relate? 1048 00:55:41,380 --> 00:55:44,270 And this is something that you can study. 1049 00:55:44,270 --> 00:55:47,530 So for example, if n goes to infinity 1050 00:55:47,530 --> 00:55:55,840 and the number of points, your dimension, is fixed, 1051 00:55:55,840 --> 00:56:00,370 then S goes to sigma in any sense you want. 1052 00:56:00,370 --> 00:56:02,860 Maybe each entry is going to each entry of sigma, 1053 00:56:02,860 --> 00:56:03,730 for example. 1054 00:56:03,730 --> 00:56:04,840 So S is a good estimator. 1055 00:56:04,840 --> 00:56:06,381 We know that the empirical covariance 1056 00:56:06,381 --> 00:56:07,600 is a consistent as the mater. 1057 00:56:07,600 --> 00:56:10,230 And if d is fixed, this is actually not an issue. 1058 00:56:10,230 --> 00:56:14,450 So in particular, if you run PCA on the sample covariance 1059 00:56:14,450 --> 00:56:16,150 matrix, you look at, say, v1, then 1060 00:56:16,150 --> 00:56:20,140 v1 is going to converge to the largest eigenvector of sigma 1061 00:56:20,140 --> 00:56:23,990 as n goes to infinity, but for d fixed. 1062 00:56:23,990 --> 00:56:27,960 And that's a story that we know since the '60s. 1063 00:56:27,960 --> 00:56:30,906 More recently, people have started challenging this. 1064 00:56:30,906 --> 00:56:33,030 Because what's happening when you fix the dimension 1065 00:56:33,030 --> 00:56:35,310 and let the sample size go to infinity, 1066 00:56:35,310 --> 00:56:38,961 you're certainly not allowing for this. 1067 00:56:38,961 --> 00:56:41,460 It's certainly not explaining to you anything about the fact 1068 00:56:41,460 --> 00:56:44,512 when d is equal to 500,000 and n is equal to 1,400. 1069 00:56:44,512 --> 00:56:46,470 Because when d is fixed and n goes to infinity, 1070 00:56:46,470 --> 00:56:48,660 in particular, n is much larger than d, 1071 00:56:48,660 --> 00:56:50,280 which is not the case here. 1072 00:56:50,280 --> 00:56:53,610 And so when n is much larger than d, things go well. 1073 00:56:53,610 --> 00:56:57,430 But if d is less than n, it's not clear what happens. 1074 00:56:57,430 --> 00:57:01,540 And particularly, if d is of the order of n, what's happening? 1075 00:57:01,540 --> 00:57:04,320 So there's an entire theory in mathematics that's called 1076 00:57:04,320 --> 00:57:07,890 random matrix theory that studies the behavior of exactly 1077 00:57:07,890 --> 00:57:10,770 this question-- what is the behavior of the spectrum-- 1078 00:57:10,770 --> 00:57:13,020 the eigenvalues and eigenvectors-- 1079 00:57:13,020 --> 00:57:16,470 of a matrix in which I put random numbers and I let-- 1080 00:57:16,470 --> 00:57:19,710 so the matrix I'm interested in here is the matrix of X's. 1081 00:57:19,710 --> 00:57:21,830 When I stack all my X's next to each other, 1082 00:57:21,830 --> 00:57:26,940 so that's a matrix of size, say, d by n, so each column 1083 00:57:26,940 --> 00:57:28,890 is of size d, it's one person. 1084 00:57:28,890 --> 00:57:29,880 And so I put them. 1085 00:57:29,880 --> 00:57:31,790 And when I let the matrix go to infinity, 1086 00:57:31,790 --> 00:57:33,920 I let both d and n to infinity. 1087 00:57:33,920 --> 00:57:37,260 But I want the aspect ratio, d/n, to go to some constant. 1088 00:57:37,260 --> 00:57:38,940 That's what they do. 1089 00:57:38,940 --> 00:57:41,730 And what's nice is that in the end, you have this constant-- 1090 00:57:41,730 --> 00:57:42,840 let's call it gamma-- 1091 00:57:42,840 --> 00:57:44,550 that shows up in all the asymptotics. 1092 00:57:44,550 --> 00:57:46,680 And then you can replace it by d/n. 1093 00:57:46,680 --> 00:57:50,520 And you know that you still have a handle of both the dimension 1094 00:57:50,520 --> 00:57:51,360 and the sample size. 1095 00:57:51,360 --> 00:57:54,020 Whereas, usually the dimension goes away, as you let n 1096 00:57:54,020 --> 00:57:57,370 go to infinity without having dimension going to infinity. 1097 00:57:57,370 --> 00:57:59,400 And so now, when this happens, as soon 1098 00:57:59,400 --> 00:58:01,920 as d/n goes to a constant, you can 1099 00:58:01,920 --> 00:58:07,380 show that essentially there's an angle between the largest 1100 00:58:07,380 --> 00:58:14,460 eigenvector of sigma and the largest eigenvector of S, as n 1101 00:58:14,460 --> 00:58:15,460 and d go to infinity. 1102 00:58:15,460 --> 00:58:17,251 There is always an angle-- you can actually 1103 00:58:17,251 --> 00:58:18,930 write it explicitly. 1104 00:58:18,930 --> 00:58:22,240 And it's an angle that depends on this ratio, gamma-- 1105 00:58:22,240 --> 00:58:24,840 the asymptotic ratio of d/n. 1106 00:58:24,840 --> 00:58:29,392 And so there's been a lot of understanding how to correct, 1107 00:58:29,392 --> 00:58:30,600 how to pay attention to this. 1108 00:58:30,600 --> 00:58:34,320 This creates some biases that were sort of overlooked before. 1109 00:58:34,320 --> 00:58:37,470 In particular, when I do this, this 1110 00:58:37,470 --> 00:58:40,490 is not the proportion of explained variance, 1111 00:58:40,490 --> 00:58:42,940 when n and d are similar. 1112 00:58:42,940 --> 00:58:44,940 This is an estimated number computed from S. 1113 00:58:44,940 --> 00:58:48,030 This is computed from S. All these guys are computed from S. 1114 00:58:48,030 --> 00:58:49,830 So those are actually not exactly 1115 00:58:49,830 --> 00:58:51,060 where you want them to be. 1116 00:58:51,060 --> 00:58:54,510 And there's some nice work that allows you to recalibrate what 1117 00:58:54,510 --> 00:58:57,626 this ratio should be, how this ratio should be computed, 1118 00:58:57,626 --> 00:58:59,250 so it's a better representative of what 1119 00:58:59,250 --> 00:59:04,680 the proportion of explained variance actually is. 1120 00:59:04,680 --> 00:59:07,470 So then, of course, there's the question 1121 00:59:07,470 --> 00:59:09,870 of-- so that's when d/n goes to some constant. 1122 00:59:09,870 --> 00:59:12,105 So the best case-- so that was '60s-- 1123 00:59:12,105 --> 00:59:15,040 d is fixed and it's much larger than d. 1124 00:59:15,040 --> 00:59:18,310 And then random matrix theory tells you, well, d and n 1125 00:59:18,310 --> 00:59:20,680 are sort of the same order of magnitude. 1126 00:59:20,680 --> 00:59:23,620 When they go to infinity, the ratio goes to some constant. 1127 00:59:23,620 --> 00:59:25,270 Think of it as being order 1. 1128 00:59:25,270 --> 00:59:30,440 To be fair, if d is 100 times larger than n, it still works. 1129 00:59:30,440 --> 00:59:32,440 And it depends on what you think what 1130 00:59:32,440 --> 00:59:33,910 the infinity is at this point. 1131 00:59:33,910 --> 00:59:37,880 But I think the random matrix theory results are very useful. 1132 00:59:37,880 --> 00:59:39,880 But then even in this case, I told you 1133 00:59:39,880 --> 00:59:42,460 that the leading eigenvector of S 1134 00:59:42,460 --> 00:59:48,812 is actually an angle of the leading eigenvector of-- 1135 00:59:48,812 --> 00:59:50,020 So what's happening is that-- 1136 00:59:56,970 --> 01:00:01,320 so let's say that d/n goes to some gamma. 1137 01:00:01,320 --> 01:00:04,470 And what I claim is that, if you look at-- 1138 01:00:04,470 --> 01:00:09,130 so that's v1, that's the v1 of S. And then there's the v1 of-- 1139 01:00:09,130 --> 01:00:11,760 so this should be of size 1. 1140 01:00:11,760 --> 01:00:13,096 So that's the v1 of sigma. 1141 01:00:13,096 --> 01:00:15,220 Then those things are going to have an angle, which 1142 01:00:15,220 --> 01:00:16,629 is some function of gamma. 1143 01:00:16,629 --> 01:00:18,670 It's complicated, but there's a function of gamma 1144 01:00:18,670 --> 01:00:19,628 that you can see there. 1145 01:00:19,628 --> 01:00:21,830 And there's some models. 1146 01:00:21,830 --> 01:00:24,620 When gamma goes to infinity, which 1147 01:00:24,620 --> 01:00:27,800 means that d is now much larger than n, 1148 01:00:27,800 --> 01:00:30,860 this angle is 90 degrees, which means 1149 01:00:30,860 --> 01:00:32,798 that you're getting nothing. 1150 01:00:32,798 --> 01:00:33,796 Yeah. 1151 01:00:33,796 --> 01:00:37,289 AUDIENCE: If d is not on your lower plane, 1152 01:00:37,289 --> 01:00:40,782 so like gamma is 0, is there still angle? 1153 01:00:40,782 --> 01:00:43,780 PHILIPPE RIGOLLET: No, but that's consistent-- 1154 01:00:43,780 --> 01:00:45,659 the fact that it's consistent when-- 1155 01:00:45,659 --> 01:00:46,825 so the angle is a function-- 1156 01:00:46,825 --> 01:00:49,605 AUDIENCE: d is not a constant [INAUDIBLE]?? 1157 01:00:52,599 --> 01:00:54,600 PHILIPPE RIGOLLET: d is not a constant? 1158 01:00:54,600 --> 01:00:57,090 So if d is little of n? 1159 01:00:57,090 --> 01:00:59,985 Then gamma goes to 0 and f of gamma goes to 0. 1160 01:00:59,985 --> 01:01:02,490 So f of gamma is a function that-- 1161 01:01:02,490 --> 01:01:05,200 so for example, if f of gamma-- 1162 01:01:05,200 --> 01:01:08,960 this is the sine of the angle, for example-- 1163 01:01:08,960 --> 01:01:11,840 then it's a function that starts at 0, and that goes like this. 1164 01:01:15,340 --> 01:01:18,120 But as soon as gamma is positive, it goes away from 0. 1165 01:01:20,650 --> 01:01:24,517 So now when gamma goes to infinity, 1166 01:01:24,517 --> 01:01:26,350 then this thing goes to a right angle, which 1167 01:01:26,350 --> 01:01:27,516 means I'm getting just junk. 1168 01:01:27,516 --> 01:01:29,210 So this is not my leading eigenvector. 1169 01:01:29,210 --> 01:01:31,160 So how do you do this? 1170 01:01:31,160 --> 01:01:33,850 Well, just like everywhere in statistics, 1171 01:01:33,850 --> 01:01:35,500 you have to just make more assumptions. 1172 01:01:35,500 --> 01:01:36,916 You have to assume that you're not 1173 01:01:36,916 --> 01:01:39,220 looking for the leading eigenvector or the direction 1174 01:01:39,220 --> 01:01:40,610 that carries the most variance. 1175 01:01:40,610 --> 01:01:42,830 But you're looking, maybe, for a special direction. 1176 01:01:42,830 --> 01:01:44,910 And that's what sparse PCA is doing. 1177 01:01:44,910 --> 01:01:48,610 Sparse PCA is saying, I'm not looking for any direction new 1178 01:01:48,610 --> 01:01:50,290 that carries the most variance. 1179 01:01:50,290 --> 01:01:54,070 I'm only looking for a direction new that is sparse. 1180 01:01:54,070 --> 01:01:58,460 Think of it, for example, as having 10 non-zero coordinates. 1181 01:01:58,460 --> 01:02:02,050 So that's a lot of directions still to look for. 1182 01:02:02,050 --> 01:02:05,560 But once you do this, then you actually 1183 01:02:05,560 --> 01:02:07,060 have not only-- there's a few things 1184 01:02:07,060 --> 01:02:08,930 that actually you get from doing this. 1185 01:02:08,930 --> 01:02:12,160 The first one is you actually essentially replace 1186 01:02:12,160 --> 01:02:15,660 d by k, which means that n now just-- 1187 01:02:15,660 --> 01:02:18,480 I'm sorry, let's say S non-zero coefficients. 1188 01:02:18,480 --> 01:02:21,420 You replace d by S, which means that n only 1189 01:02:21,420 --> 01:02:24,740 has to be much larger than S for this thing to actually work. 1190 01:02:24,740 --> 01:02:26,760 Now, of course, you've set your goal weaker. 1191 01:02:26,760 --> 01:02:28,830 Your goal is not to find any direction, only 1192 01:02:28,830 --> 01:02:30,360 a sparse direction. 1193 01:02:30,360 --> 01:02:31,830 But there's something very valuable 1194 01:02:31,830 --> 01:02:33,746 about sparse directions, is that they actually 1195 01:02:33,746 --> 01:02:35,310 are interpretable. 1196 01:02:35,310 --> 01:02:37,810 When I found the v-- 1197 01:02:37,810 --> 01:02:40,230 let's say that the v that I found before 1198 01:02:40,230 --> 01:02:48,390 was 0.2, and then 0.9, and then 1.1 minus 3, et cetera. 1199 01:02:48,390 --> 01:02:51,570 So that was the coordinates of my leading eigenvector 1200 01:02:51,570 --> 01:02:54,410 in the original coordinate system. 1201 01:02:54,410 --> 01:02:55,160 What does it mean? 1202 01:02:55,160 --> 01:02:57,140 Well, it means that if I see a large number, 1203 01:02:57,140 --> 01:03:01,610 that means that this v is very close-- 1204 01:03:01,610 --> 01:03:03,830 so that's my original coordinate system. 1205 01:03:03,830 --> 01:03:05,330 Let's call it e1 and e2. 1206 01:03:05,330 --> 01:03:09,230 So that's just 1, 0; and then 0, 1. 1207 01:03:09,230 --> 01:03:11,170 Then clearly, from the coordinates of v, 1208 01:03:11,170 --> 01:03:13,550 I can tell if my v is like this, or it's like this, 1209 01:03:13,550 --> 01:03:15,610 or it's like this. 1210 01:03:15,610 --> 01:03:18,330 Well, I mean, they should all be of the same size. 1211 01:03:18,330 --> 01:03:20,590 So I can tell if it's here or here 1212 01:03:20,590 --> 01:03:24,739 or here, depending on-- like here, 1213 01:03:24,739 --> 01:03:26,280 that means I'm going to see something 1214 01:03:26,280 --> 01:03:29,090 where the Y-coordinate it much larger than the X-coordinate. 1215 01:03:29,090 --> 01:03:30,960 Here, I'm going to see something where the X-coordinate is much 1216 01:03:30,960 --> 01:03:32,370 larger than the Y-coordinate. 1217 01:03:32,370 --> 01:03:33,480 And here, I'm going to see something 1218 01:03:33,480 --> 01:03:35,354 where the X-coordinate is about the same size 1219 01:03:35,354 --> 01:03:38,390 of the Y-coordinate. 1220 01:03:38,390 --> 01:03:40,499 So when things starts to be bigger, 1221 01:03:40,499 --> 01:03:42,040 you're going to have to make choices. 1222 01:03:42,040 --> 01:03:43,900 What does it mean to be bigger-- 1223 01:03:43,900 --> 01:03:48,670 when d is 100,000, I mean, the sum 1224 01:03:48,670 --> 01:03:51,160 of the squares of those guys have to be equal to 1. 1225 01:03:51,160 --> 01:03:52,790 So they're all very small numbers. 1226 01:03:52,790 --> 01:03:54,670 And so it's hard for you to tell which one is a big number 1227 01:03:54,670 --> 01:03:56,045 and which ones is a small number. 1228 01:03:56,045 --> 01:03:57,378 Why would you want to know this? 1229 01:03:57,378 --> 01:03:58,840 Because it's actually telling you 1230 01:03:58,840 --> 01:04:03,219 that if v is very close to e1, then that means that e1-- 1231 01:04:03,219 --> 01:04:04,760 in the case of the gene example, that 1232 01:04:04,760 --> 01:04:08,510 would mean that e1 is the gene that's very important. 1233 01:04:08,510 --> 01:04:10,100 Maybe there's actually just two genes 1234 01:04:10,100 --> 01:04:12,109 that explain those two things. 1235 01:04:12,109 --> 01:04:14,150 And those are the genes that have been picked up. 1236 01:04:14,150 --> 01:04:16,880 There's two genes that I encode geographic location, 1237 01:04:16,880 --> 01:04:18,224 and that's it. 1238 01:04:18,224 --> 01:04:19,640 And so it's very important for you 1239 01:04:19,640 --> 01:04:21,630 to be able to interpret what v means. 1240 01:04:21,630 --> 01:04:23,270 Where it has large values, it means 1241 01:04:23,270 --> 01:04:26,689 that maybe it has large values for e1, e2, and e3. 1242 01:04:26,689 --> 01:04:28,980 And it means that it's a combination of e1, e2, and e3. 1243 01:04:28,980 --> 01:04:30,813 And now, you can interpret, because you have 1244 01:04:30,813 --> 01:04:33,150 only three variables to find. 1245 01:04:33,150 --> 01:04:36,780 And so sparse PCA builds that in. 1246 01:04:36,780 --> 01:04:39,920 Sparse PCA says, listen, I'm going 1247 01:04:39,920 --> 01:04:42,600 to want to have at most 10 non-zero coefficients. 1248 01:04:42,600 --> 01:04:44,550 And the rest, I want to be 0. 1249 01:04:44,550 --> 01:04:47,040 I want to be able to be a combination of at most 10 1250 01:04:47,040 --> 01:04:50,540 of my original variables. 1251 01:04:50,540 --> 01:04:52,740 And now, I can do interpretation. 1252 01:04:52,740 --> 01:04:54,690 So the problem with sparse PCA is 1253 01:04:54,690 --> 01:04:57,404 that it becomes very difficult numerically 1254 01:04:57,404 --> 01:04:58,320 to solve this problem. 1255 01:04:58,320 --> 01:04:59,220 I can write it. 1256 01:04:59,220 --> 01:05:05,700 So the problem is simply maximize the variance u 1257 01:05:05,700 --> 01:05:09,360 transpose, say, Su subject to-- well, 1258 01:05:09,360 --> 01:05:12,180 I wanted to have u2 equal to 1. 1259 01:05:12,180 --> 01:05:14,450 So that's the original PCA. 1260 01:05:14,450 --> 01:05:16,020 But now, I also want that the sum 1261 01:05:16,020 --> 01:05:19,320 of the indicators of the uj that are not equal to 0 1262 01:05:19,320 --> 01:05:23,120 is at most, say, 10. 1263 01:05:23,120 --> 01:05:26,550 This constraint is very non-convex. 1264 01:05:26,550 --> 01:05:28,430 So I can relax it to a convex one 1265 01:05:28,430 --> 01:05:31,720 like we did for linear aggression. 1266 01:05:31,720 --> 01:05:33,920 But now, I've totally messed up with the fact 1267 01:05:33,920 --> 01:05:37,930 that I could use linear algebra to solve this problem. 1268 01:05:37,930 --> 01:05:40,812 And so now, you have to go through much more complicated 1269 01:05:40,812 --> 01:05:42,520 optimization techniques, which are called 1270 01:05:42,520 --> 01:05:44,350 semidefinite programs, which do not 1271 01:05:44,350 --> 01:05:46,600 scale well in high dimensions. 1272 01:05:46,600 --> 01:05:48,730 And so you have to do a bunch of tricks-- 1273 01:05:48,730 --> 01:05:49,660 numerical tricks. 1274 01:05:49,660 --> 01:05:52,630 But there are some packages that implements some heuristics 1275 01:05:52,630 --> 01:05:55,140 or some other things-- 1276 01:05:55,140 --> 01:05:56,800 iterative thresholding, all sorts 1277 01:05:56,800 --> 01:05:58,896 of various numerical tricks that you can do. 1278 01:05:58,896 --> 01:06:01,270 But the problem they are trying to solve is exactly this. 1279 01:06:01,270 --> 01:06:03,947 Among all directions that I have norm 1, of course, 1280 01:06:03,947 --> 01:06:06,030 because it's the direction that have at most, say, 1281 01:06:06,030 --> 01:06:09,382 10 non-zero coordinates, I want to find the one that maximizes 1282 01:06:09,382 --> 01:06:10,340 the empirical variance. 1283 01:06:23,030 --> 01:06:27,782 Actually, let me let me just so you this. 1284 01:06:41,910 --> 01:06:47,830 I wanted to show you an output of PCA 1285 01:06:47,830 --> 01:06:50,620 where people are actually trying to do directly-- 1286 01:06:56,043 --> 01:07:05,903 maybe-- there you go. 1287 01:07:20,700 --> 01:07:26,690 So right here, you see this is SPSS. 1288 01:07:26,690 --> 01:07:29,310 That's a statistical software. 1289 01:07:29,310 --> 01:07:33,100 And this is an output that was preprocessed 1290 01:07:33,100 --> 01:07:34,650 by a professional-- 1291 01:07:34,650 --> 01:07:36,240 not preprocessed, post-processed. 1292 01:07:36,240 --> 01:07:38,520 So that's something where they read PCA. 1293 01:07:38,520 --> 01:07:39,390 So what is the data? 1294 01:07:39,390 --> 01:07:43,890 This is raw data about you ask doctors 1295 01:07:43,890 --> 01:07:47,907 what they think of the behavior of a particular sales 1296 01:07:47,907 --> 01:07:49,740 representative for pharmaceutical companies. 1297 01:07:49,740 --> 01:07:51,323 So pharmaceutical companies are trying 1298 01:07:51,323 --> 01:07:52,950 to improve their sales force. 1299 01:07:52,950 --> 01:07:56,430 And they're asking doctors how would they 1300 01:07:56,430 --> 01:07:58,920 rate-- what do they value about their interaction 1301 01:07:58,920 --> 01:08:01,410 with a sales representative. 1302 01:08:01,410 --> 01:08:04,140 So basically, there's a bunch of questions. 1303 01:08:04,140 --> 01:08:10,410 One offers credible point of view on something trends, 1304 01:08:10,410 --> 01:08:12,720 provides valuable networking opportunities. 1305 01:08:12,720 --> 01:08:13,950 This is one question. 1306 01:08:13,950 --> 01:08:15,750 Rate this on a scale from 1 to 5. 1307 01:08:15,750 --> 01:08:16,790 That was the question. 1308 01:08:16,790 --> 01:08:18,840 And they had a bunch of questions like this. 1309 01:08:18,840 --> 01:08:22,410 And then they asked 1,000 doctors to make those ratings. 1310 01:08:22,410 --> 01:08:24,210 And what they want-- so each doctor now 1311 01:08:24,210 --> 01:08:25,890 is a vector of ratings. 1312 01:08:25,890 --> 01:08:28,960 And they want to know if there's different groups of doctors, 1313 01:08:28,960 --> 01:08:30,210 what do doctors respond to. 1314 01:08:30,210 --> 01:08:31,240 If there's different groups, then 1315 01:08:31,240 --> 01:08:33,450 maybe they know that they can actually address them 1316 01:08:33,450 --> 01:08:35,500 separately, et cetera. 1317 01:08:35,500 --> 01:08:37,950 And so to do that, of course, there's lots of questions. 1318 01:08:37,950 --> 01:08:39,840 And so what you want is to just first project 1319 01:08:39,840 --> 01:08:41,589 into lower dimensions, so you can actually 1320 01:08:41,589 --> 01:08:42,819 visualize what's going on. 1321 01:08:42,819 --> 01:08:44,760 And this is what was done for this. 1322 01:08:44,760 --> 01:08:47,490 So these are the three first principal component 1323 01:08:47,490 --> 01:08:49,439 that came out. 1324 01:08:49,439 --> 01:08:52,439 And even though we ordered the values of the lambdas, 1325 01:08:52,439 --> 01:08:56,130 there's no reason why the entries of v should be ordered. 1326 01:08:56,130 --> 01:08:57,840 And if you look at the values of v here, 1327 01:08:57,840 --> 01:08:59,631 they look like they're pretty much ordered. 1328 01:08:59,631 --> 01:09:04,142 It starts at 0.784, and then you're at 0.3 around here. 1329 01:09:04,142 --> 01:09:06,600 There's something that goes up again, and then you go down. 1330 01:09:06,600 --> 01:09:11,200 Actually, it's marked in red every time it goes up again. 1331 01:09:11,200 --> 01:09:13,660 And so now, what they did is they said, 1332 01:09:13,660 --> 01:09:16,270 OK, I need to interpret those guys. 1333 01:09:16,270 --> 01:09:18,340 I need to tell you what this is. 1334 01:09:18,340 --> 01:09:21,160 If you tell me, we found the principal component 1335 01:09:21,160 --> 01:09:24,866 that really discriminates the doctors in two groups, 1336 01:09:24,866 --> 01:09:26,740 the drug company is going to come back to you 1337 01:09:26,740 --> 01:09:29,080 and say, OK, what is this characteristic? 1338 01:09:29,080 --> 01:09:31,510 And you say, oh, it's actually a linear combination 1339 01:09:31,510 --> 01:09:33,460 of 40 characteristics. 1340 01:09:33,460 --> 01:09:35,735 And they say, well, we don't need you to do that. 1341 01:09:35,735 --> 01:09:38,109 I mean, it cannot be a linear combination of anything you 1342 01:09:38,109 --> 01:09:39,220 didn't ask. 1343 01:09:39,220 --> 01:09:41,680 And so for that, first of all, there's 1344 01:09:41,680 --> 01:09:44,859 a post-processing of PCA, which says, OK, once I actually, 1345 01:09:44,859 --> 01:09:46,990 say, found three principal components, 1346 01:09:46,990 --> 01:09:51,370 that means that I found the dimension three space on which 1347 01:09:51,370 --> 01:09:52,899 I want to project my points. 1348 01:09:52,899 --> 01:09:55,720 In this base, I can pick any direction I want. 1349 01:09:55,720 --> 01:09:57,100 So the first thing is that you do 1350 01:09:57,100 --> 01:09:59,308 some sort of local arrangements, so that those things 1351 01:09:59,308 --> 01:10:01,790 look like they are increasing and then decreasing. 1352 01:10:01,790 --> 01:10:06,130 So you just change, you rotate your coordinate system 1353 01:10:06,130 --> 01:10:09,880 in this three dimensional space that you've actually isolated. 1354 01:10:09,880 --> 01:10:11,830 And so once you do this, the reason 1355 01:10:11,830 --> 01:10:13,600 to do that is that it sort of makes 1356 01:10:13,600 --> 01:10:16,554 them big, sharp differences between large and small values 1357 01:10:16,554 --> 01:10:18,220 of the coordinates of the thing you had. 1358 01:10:18,220 --> 01:10:19,261 And why do you want this? 1359 01:10:19,261 --> 01:10:21,100 Because now, you can say, well, I'm 1360 01:10:21,100 --> 01:10:23,590 going to start looking at the ones that have large values. 1361 01:10:23,590 --> 01:10:24,250 And what do they say? 1362 01:10:24,250 --> 01:10:26,249 They say in-depth knowledge, in-depth knowledge, 1363 01:10:26,249 --> 01:10:28,270 in-depth knowledge, knowledge about. 1364 01:10:28,270 --> 01:10:30,280 This thing is clearly something that 1365 01:10:30,280 --> 01:10:34,090 actually characterizes the knowledge of my sales 1366 01:10:34,090 --> 01:10:35,260 representative. 1367 01:10:35,260 --> 01:10:38,311 And so that's something that doctors are sensitive to. 1368 01:10:38,311 --> 01:10:40,060 That's something that really discriminates 1369 01:10:40,060 --> 01:10:40,960 the doctors in a way. 1370 01:10:40,960 --> 01:10:43,120 There's lots of variance along those things, 1371 01:10:43,120 --> 01:10:45,576 or at least a lot of variance-- 1372 01:10:45,576 --> 01:10:47,950 I mean, doctors are separate in terms of their experience 1373 01:10:47,950 --> 01:10:49,240 with respect to this. 1374 01:10:49,240 --> 01:10:51,102 And so what they did is said, OK, 1375 01:10:51,102 --> 01:10:53,310 all these guys, some of those they have large values, 1376 01:10:53,310 --> 01:10:55,015 but I don't know how to interpret them. 1377 01:10:55,015 --> 01:10:56,890 And so I'm just going to put the first block, 1378 01:10:56,890 --> 01:10:58,681 and I'm going to call it medical knowledge, 1379 01:10:58,681 --> 01:11:01,330 because all those things are knowledge about medical stuff. 1380 01:11:01,330 --> 01:11:03,538 Then here, I didn't know how to interpret those guys. 1381 01:11:03,538 --> 01:11:06,220 But those guys, there's a big clump of large coordinates, 1382 01:11:06,220 --> 01:11:10,720 and they're about respectful of my time, listens, friendly 1383 01:11:10,720 --> 01:11:12,070 but courteous. 1384 01:11:12,070 --> 01:11:14,000 This is all about the quality of interaction. 1385 01:11:14,000 --> 01:11:17,446 So this block was actually called quality of interaction. 1386 01:11:17,446 --> 01:11:18,820 And then there was a third block, 1387 01:11:18,820 --> 01:11:21,320 which you can tell starts to be spreading a little thin. 1388 01:11:21,320 --> 01:11:22,864 There's just much less of them. 1389 01:11:22,864 --> 01:11:24,280 But this thing was actually called 1390 01:11:24,280 --> 01:11:26,260 fair and critical opinion. 1391 01:11:26,260 --> 01:11:30,010 And so now, you have three discriminating directions. 1392 01:11:30,010 --> 01:11:31,990 And you can actually give them a name. 1393 01:11:31,990 --> 01:11:34,780 Wouldn't it be beautiful if all the numbers in the gray box 1394 01:11:34,780 --> 01:11:36,700 came non-zero and all the other numbers 1395 01:11:36,700 --> 01:11:38,860 came zero-- there was no ad hoc choice. 1396 01:11:38,860 --> 01:11:40,750 I mean, this is probably an afternoon of work 1397 01:11:40,750 --> 01:11:42,850 to like scratch out all these numbers 1398 01:11:42,850 --> 01:11:44,801 and put all these color codes, et cetera. 1399 01:11:44,801 --> 01:11:47,050 Whereas, you could just have something that tells you, 1400 01:11:47,050 --> 01:11:49,090 OK, here are the non-zeros. 1401 01:11:49,090 --> 01:11:52,120 If you can actually make a story around why this group of thing 1402 01:11:52,120 --> 01:11:54,820 actually makes sense, such as it is medical knowledge, 1403 01:11:54,820 --> 01:11:55,730 then good for you. 1404 01:11:55,730 --> 01:11:57,804 Otherwise, you could just say, I can't. 1405 01:11:57,804 --> 01:11:59,470 And that's what sparse PCA does for you. 1406 01:11:59,470 --> 01:12:02,890 Sparse PCA outputs something where all those numbers would 1407 01:12:02,890 --> 01:12:03,850 be zero. 1408 01:12:03,850 --> 01:12:06,964 And there would be exactly, say, 10 non-zero coordinates. 1409 01:12:06,964 --> 01:12:08,380 And you can turn this knob off 10. 1410 01:12:08,380 --> 01:12:09,220 You can make it 9. 1411 01:12:11,687 --> 01:12:13,270 Depending on what your major is, maybe 1412 01:12:13,270 --> 01:12:15,310 you can actually go on with 20 of them 1413 01:12:15,310 --> 01:12:18,310 and have the ability to tell the story about 20 1414 01:12:18,310 --> 01:12:20,650 different variables and how they fit in the same group. 1415 01:12:20,650 --> 01:12:22,750 And depending on how you feel, it's 1416 01:12:22,750 --> 01:12:25,390 easy to rerun the PCA depending on the value 1417 01:12:25,390 --> 01:12:26,535 that you want here. 1418 01:12:26,535 --> 01:12:28,660 And so you could actually just come up with the one 1419 01:12:28,660 --> 01:12:30,240 you prefer. 1420 01:12:30,240 --> 01:12:32,354 And so that's the sparse PCA thing 1421 01:12:32,354 --> 01:12:33,520 which I'm trying to promote. 1422 01:12:33,520 --> 01:12:35,250 I mean, this is not super well-spread. 1423 01:12:35,250 --> 01:12:39,300 It's a fairly new idea, maybe at most 10 years old. 1424 01:12:39,300 --> 01:12:40,940 And it's not completely well-spread 1425 01:12:40,940 --> 01:12:42,540 in statistical packages. 1426 01:12:42,540 --> 01:12:44,040 But that's clearly what people are 1427 01:12:44,040 --> 01:12:46,601 trying to emulate currently. 1428 01:12:46,601 --> 01:12:47,100 Yes? 1429 01:12:47,100 --> 01:12:48,600 AUDIENCE: So what exactly does it 1430 01:12:48,600 --> 01:12:50,932 mean that the doctors have a lot of variance 1431 01:12:50,932 --> 01:12:53,100 in medical knowledge, quality of interaction, 1432 01:12:53,100 --> 01:12:55,600 and fair and critical opinion? 1433 01:12:55,600 --> 01:13:00,200 Like, it was saying that these are like the main things 1434 01:13:00,200 --> 01:13:02,986 that doctors vary on, some doctors care. 1435 01:13:02,986 --> 01:13:05,590 Like we could sort of characterize a doctor by, oh, 1436 01:13:05,590 --> 01:13:08,030 he cares this much about medical knowledge, this much 1437 01:13:08,030 --> 01:13:09,494 about the quality of interaction, 1438 01:13:09,494 --> 01:13:11,446 and this much about critical opinion. 1439 01:13:11,446 --> 01:13:14,862 And that says most of the story about what this doctor wants 1440 01:13:14,862 --> 01:13:17,790 from a drug representative? 1441 01:13:17,790 --> 01:13:20,610 PHILIPPE RIGOLLET: Not really. 1442 01:13:20,610 --> 01:13:22,590 I mean, OK, let's say you pick only one. 1443 01:13:22,590 --> 01:13:31,535 So that means that you would take all your doctors, 1444 01:13:31,535 --> 01:13:33,160 and you would have one direction, which 1445 01:13:33,160 --> 01:13:36,480 is quality of interaction. 1446 01:13:36,480 --> 01:13:38,710 And there would be just spread out points here. 1447 01:13:42,604 --> 01:13:44,270 So there are two things that can happen. 1448 01:13:44,270 --> 01:13:46,900 The first one is that there's a clump here, 1449 01:13:46,900 --> 01:13:49,014 and then there's a clump here. 1450 01:13:49,014 --> 01:13:50,680 That still represents a lot of variance. 1451 01:13:50,680 --> 01:13:52,420 And if this happens, you probably 1452 01:13:52,420 --> 01:13:55,120 want to go back in your data and see 1453 01:13:55,120 --> 01:13:58,540 were these people visited by a different group 1454 01:13:58,540 --> 01:14:00,520 than these people, or maybe these people 1455 01:14:00,520 --> 01:14:02,700 have a different specialty. 1456 01:14:05,250 --> 01:14:07,000 I mean, you have to look back at your data 1457 01:14:07,000 --> 01:14:08,470 and try to understand why you would have 1458 01:14:08,470 --> 01:14:09,700 different groups of people. 1459 01:14:09,700 --> 01:14:13,510 And if it's like completely evenly spread out, 1460 01:14:13,510 --> 01:14:15,730 then all it's saying is that, if you 1461 01:14:15,730 --> 01:14:18,460 want to have a uniform quality of interaction, 1462 01:14:18,460 --> 01:14:20,410 you need to take measures on this. 1463 01:14:20,410 --> 01:14:24,114 You need to have this to not be discrimination. 1464 01:14:24,114 --> 01:14:26,530 But I think really when it's becoming interesting it's not 1465 01:14:26,530 --> 01:14:27,779 when it's complete spread out. 1466 01:14:27,779 --> 01:14:29,350 It's when there's a big group here. 1467 01:14:29,350 --> 01:14:30,520 And then there's almost no one here, 1468 01:14:30,520 --> 01:14:32,020 and then there's a big group here. 1469 01:14:32,020 --> 01:14:34,880 And then maybe there's something you can do. 1470 01:14:34,880 --> 01:14:40,690 And so those two things actually give you a lot of variance. 1471 01:14:40,690 --> 01:14:47,490 So actually, maybe I'll talk about this. 1472 01:14:47,490 --> 01:14:49,732 Here, this is sort of a mixture. 1473 01:14:49,732 --> 01:14:51,690 You have a mixture of two different populations 1474 01:14:51,690 --> 01:14:53,040 of doctors. 1475 01:14:53,040 --> 01:14:56,670 And it turns out that principal component analysis-- 1476 01:14:56,670 --> 01:14:59,750 so a mixture is when you have different populations-- 1477 01:14:59,750 --> 01:15:02,010 think of like two Gaussians that are just 1478 01:15:02,010 --> 01:15:03,690 centered at two different points, 1479 01:15:03,690 --> 01:15:05,460 and maybe they're in high dimensions. 1480 01:15:05,460 --> 01:15:07,350 And those are clusters of people, 1481 01:15:07,350 --> 01:15:09,680 and you want to be able to differentiate those guys. 1482 01:15:09,680 --> 01:15:10,770 If you're in very high dimensions, 1483 01:15:10,770 --> 01:15:12,120 it's going to be very difficult. But one 1484 01:15:12,120 --> 01:15:14,730 of the first processing tools that people do is to do PCA. 1485 01:15:14,730 --> 01:15:18,046 Because if you have one big group here and one big group 1486 01:15:18,046 --> 01:15:19,920 here, it means that there's a lot of variance 1487 01:15:19,920 --> 01:15:21,961 along the direction that goes through the centers 1488 01:15:21,961 --> 01:15:22,860 of those groups. 1489 01:15:22,860 --> 01:15:24,630 And that's essentially what happened here. 1490 01:15:24,630 --> 01:15:27,967 You could think of this as being two blobs in high dimensions. 1491 01:15:27,967 --> 01:15:29,550 But you're really just projecting them 1492 01:15:29,550 --> 01:15:30,810 into one dimension. 1493 01:15:30,810 --> 01:15:33,370 And this dimension, hopefully, goes through the center. 1494 01:15:33,370 --> 01:15:37,460 And so as preprocessing-- so I'm going to stop here. 1495 01:15:37,460 --> 01:15:42,720 But PCA is not just made for dimension reduction. 1496 01:15:42,720 --> 01:15:44,700 It's used for mixtures, for example. 1497 01:15:44,700 --> 01:15:47,340 It's also used when you have graphical data. 1498 01:15:47,340 --> 01:15:48,750 What is the idea of PCA? 1499 01:15:48,750 --> 01:15:53,400 It just says, if you have a matrix that seems to have low 1500 01:15:53,400 --> 01:15:56,370 rank-- meaning that there's a lot of those lambda i's that 1501 01:15:56,370 --> 01:15:57,570 are very small-- 1502 01:15:57,570 --> 01:16:00,420 and then I see that plus noise, then 1503 01:16:00,420 --> 01:16:02,790 it's a good idea to do PCA on this thing. 1504 01:16:02,790 --> 01:16:05,520 And in particular, people use that in networks a lot. 1505 01:16:05,520 --> 01:16:08,300 So you take the adjacency matrix of a graph-- 1506 01:16:08,300 --> 01:16:11,160 well, you sort of preprocess it a little bit, so it looks nice. 1507 01:16:11,160 --> 01:16:13,590 And then if you have, for example, two communities 1508 01:16:13,590 --> 01:16:15,570 in there, it should look like something that 1509 01:16:15,570 --> 01:16:18,510 is low rank plus some noise. 1510 01:16:18,510 --> 01:16:22,670 And low rank means that there's just very few non-zero-- 1511 01:16:22,670 --> 01:16:24,226 well, low rank means this. 1512 01:16:24,226 --> 01:16:26,100 Low rank means that if you do the scree plot, 1513 01:16:26,100 --> 01:16:27,250 you will see something like this, 1514 01:16:27,250 --> 01:16:29,720 which means that if you throw out all the smaller ones, 1515 01:16:29,720 --> 01:16:33,420 it should not really matter in the overall structure. 1516 01:16:33,420 --> 01:16:35,430 And so you can use all-- 1517 01:16:35,430 --> 01:16:39,090 these techniques are used everywhere these days, not 1518 01:16:39,090 --> 01:16:39,900 just in PCA. 1519 01:16:39,900 --> 01:16:41,670 So we call it PCA as statisticians. 1520 01:16:41,670 --> 01:16:46,700 But people call it the spectral methods or SVD. 1521 01:16:46,700 --> 01:16:49,450 So everyone--