1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high-quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,715 at ocw.mit.edu. 8 00:00:21,715 --> 00:00:23,090 LORENZO ROSASCO: If you remember, 9 00:00:23,090 --> 00:00:26,420 we did the local method bias-variance. 10 00:00:26,420 --> 00:00:29,290 Then we passed to global regularization methods-- least 11 00:00:29,290 --> 00:00:31,790 squares, linear least squares, kernel least squares, 12 00:00:31,790 --> 00:00:33,420 computations modeling. 13 00:00:33,420 --> 00:00:34,762 And that's where we were at. 14 00:00:34,762 --> 00:00:36,470 And then we moved on and started to think 15 00:00:36,470 --> 00:00:40,070 about more intractable model, and we 16 00:00:40,070 --> 00:00:43,910 were starting to think of the problem of variable selection, 17 00:00:43,910 --> 00:00:45,110 OK? 18 00:00:45,110 --> 00:00:48,370 And the way we posed it is that, yeah, that you are going 19 00:00:48,370 --> 00:00:50,180 to consider a linear model. 20 00:00:50,180 --> 00:00:53,510 And I use the weights associated to each variable 21 00:00:53,510 --> 00:00:57,260 as the strength of the corresponding variable 22 00:00:57,260 --> 00:00:59,060 that you can view as measurements. 23 00:00:59,060 --> 00:01:00,590 And your game is not only to build 24 00:01:00,590 --> 00:01:02,360 good predictional measurements, but also 25 00:01:02,360 --> 00:01:06,170 to tell which measurements are interesting, OK? 26 00:01:06,170 --> 00:01:11,330 And so here, the term "relevant variable" 27 00:01:11,330 --> 00:01:17,870 is going to be related to the productivity, the contribution 28 00:01:17,870 --> 00:01:21,320 to the productivity of the corresponding function, OK? 29 00:01:21,320 --> 00:01:24,600 So that's how we measure relevance of a variable. 30 00:01:24,600 --> 00:01:28,460 So we looked at the funny name. 31 00:01:28,460 --> 00:01:31,610 And then we kind of agreed that somewhat it 32 00:01:31,610 --> 00:01:37,340 seems that there is a default approach, which 33 00:01:37,340 --> 00:01:40,970 is basically based on trying all possible subsets, OK? 34 00:01:40,970 --> 00:01:42,869 So this is also called best subset selection. 35 00:01:42,869 --> 00:01:44,660 Variable selection is also sometimes called 36 00:01:44,660 --> 00:01:45,577 best subset selection. 37 00:01:45,577 --> 00:01:47,743 And this gives you a feeling that what you should do 38 00:01:47,743 --> 00:01:50,420 is try all possible subsets and check the one which 39 00:01:50,420 --> 00:01:53,030 is best with respect to your data, which, again, would 40 00:01:53,030 --> 00:01:56,540 be like a trade-off between how well you fit the data 41 00:01:56,540 --> 00:02:01,190 and how many variables you have, OK? 42 00:02:01,190 --> 00:02:04,820 And what I told you last was you could actually 43 00:02:04,820 --> 00:02:07,730 see that this trying all possible subsets 44 00:02:07,730 --> 00:02:11,780 is related to a form of regularization, 45 00:02:11,780 --> 00:02:13,490 where it looks very similar to the one 46 00:02:13,490 --> 00:02:15,530 we saw until a minute ago. 47 00:02:15,530 --> 00:02:17,220 The main difference here, I put fw, 48 00:02:17,220 --> 00:02:19,796 but fw is just the usual linear function. 49 00:02:19,796 --> 00:02:21,170 The only difference is that here, 50 00:02:21,170 --> 00:02:24,380 rather than the square norm, we put this functional 51 00:02:24,380 --> 00:02:26,640 that is called the 0 norm, which if, 52 00:02:26,640 --> 00:02:29,030 as a functional, that given a vector 53 00:02:29,030 --> 00:02:31,880 returns the number of entries in the vector which 54 00:02:31,880 --> 00:02:34,670 are different from 0, OK? 55 00:02:34,670 --> 00:02:37,550 It turns out that if you were to minimize this, 56 00:02:37,550 --> 00:02:42,350 you are solving the best subset selection problem. 57 00:02:42,350 --> 00:02:45,080 Issue here is that-- another manifestation 58 00:02:45,080 --> 00:02:47,240 of the complexity of the problem is the fact 59 00:02:47,240 --> 00:02:49,800 that this functional here is non-convex. 60 00:02:49,800 --> 00:02:51,500 So there is no polynomial algorithm 61 00:02:51,500 --> 00:02:52,760 to actually find a solution. 62 00:02:57,210 --> 00:02:59,910 Come to my mind that somebody made the comment 63 00:02:59,910 --> 00:03:01,360 during the break. 64 00:03:01,360 --> 00:03:05,100 Notice that here I'm a bit passing quickly 65 00:03:05,100 --> 00:03:07,610 over a refinement of the question of best subset 66 00:03:07,610 --> 00:03:10,020 selection, which is related to, is there 67 00:03:10,020 --> 00:03:12,600 a unique subset which is good? 68 00:03:12,600 --> 00:03:13,710 Is there more than one? 69 00:03:13,710 --> 00:03:17,160 And if there's more than one, which one should I pick, OK? 70 00:03:17,160 --> 00:03:19,260 In practice, these questions arise immediately 71 00:03:19,260 --> 00:03:20,968 because if you have two measurements that 72 00:03:20,968 --> 00:03:23,580 are very correlated, or even more, 73 00:03:23,580 --> 00:03:25,950 if they're perfectly correlated, if you just 74 00:03:25,950 --> 00:03:28,792 build the measurements, you might build out 75 00:03:28,792 --> 00:03:30,750 of two measurements, a third measurement, which 76 00:03:30,750 --> 00:03:34,187 is completely just a linear combination of the first two. 77 00:03:34,187 --> 00:03:36,020 So at that point, what would you want to do? 78 00:03:36,020 --> 00:03:37,603 Do you want to keep the minimum number 79 00:03:37,603 --> 00:03:40,560 of variables, the biggest possible number of variables? 80 00:03:40,560 --> 00:03:43,410 And you have to decide, because all these variables 81 00:03:43,410 --> 00:03:47,430 are, to some extent, completely dependent, OK? 82 00:03:47,430 --> 00:03:49,290 So for now, we just keep to the case 83 00:03:49,290 --> 00:03:51,120 where we don't really worry about this, OK? 84 00:03:51,120 --> 00:03:55,470 We just say, among the good ones, we want to pick one. 85 00:03:55,470 --> 00:03:57,900 A harder question would be pick all of them 86 00:03:57,900 --> 00:03:58,990 or pick one of them. 87 00:03:58,990 --> 00:04:01,406 And if you wanted to pick one of them, you have to tell me 88 00:04:01,406 --> 00:04:03,650 which one you want, according to which criterion, OK? 89 00:04:06,190 --> 00:04:11,250 So the problem we're concerned with now is one of, OK, now 90 00:04:11,250 --> 00:04:16,390 that we know that we might want to do this, 91 00:04:16,390 --> 00:04:19,829 how can you do it in an approximate way that 92 00:04:19,829 --> 00:04:23,850 will be good enough, and what does it mean, good enough? 93 00:04:23,850 --> 00:04:27,660 So the simplest way, again, we can 94 00:04:27,660 --> 00:04:30,410 try to think about it together, is 95 00:04:30,410 --> 00:04:34,480 kind of a greedy version of the brute force approach. 96 00:04:34,480 --> 00:04:37,520 So the brute force approach was start 97 00:04:37,520 --> 00:04:41,070 from all single variables, then all couples, 98 00:04:41,070 --> 00:04:43,380 then all triplets, and blah, blah, blah, blah, OK? 99 00:04:43,380 --> 00:04:46,140 And this doesn't work computationally. 100 00:04:46,140 --> 00:04:49,620 So just to wake up, I don't know, 101 00:04:49,620 --> 00:04:52,470 how could you twist this approach 102 00:04:52,470 --> 00:04:58,010 to make it approximate, but computationally feasible? 103 00:04:58,010 --> 00:04:59,440 Let's keep the same spirit. 104 00:04:59,440 --> 00:05:03,480 So let's start from few, and then let's try to add more. 105 00:05:03,480 --> 00:05:06,420 So the general idea is I pick one. 106 00:05:06,420 --> 00:05:09,030 And once I pick one, I pick another one, just 107 00:05:09,030 --> 00:05:10,466 keeping the one I already picked. 108 00:05:10,466 --> 00:05:12,840 And then I pick another one, and then I pick another one, 109 00:05:12,840 --> 00:05:14,006 and then I pick another one. 110 00:05:14,006 --> 00:05:18,060 So this, of course, will not be the exhaustive search 111 00:05:18,060 --> 00:05:19,440 I've done before. 112 00:05:19,440 --> 00:05:21,100 It's probably doable. 113 00:05:21,100 --> 00:05:23,550 There is a bunch of different ways you can do it. 114 00:05:23,550 --> 00:05:24,990 And you can hopefully-- 115 00:05:24,990 --> 00:05:27,100 you can hope that under some condition 116 00:05:27,100 --> 00:05:29,100 you might be able to prove that it's not too far 117 00:05:29,100 --> 00:05:32,570 away from the brute force approach, at least 118 00:05:32,570 --> 00:05:34,590 under some condition. 119 00:05:34,590 --> 00:05:37,740 And this is kind of what we would want to do, OK? 120 00:05:37,740 --> 00:05:41,360 So we will have a notion of residual. 121 00:05:41,360 --> 00:05:43,560 At the first iteration, the residual 122 00:05:43,560 --> 00:05:46,060 will be just the output. 123 00:05:46,060 --> 00:05:48,300 So just think of the first iteration. 124 00:05:48,300 --> 00:05:51,420 You get the output vectors, and you want to explain it. 125 00:05:51,420 --> 00:05:54,090 You want to predict it well, OK? 126 00:05:54,090 --> 00:05:59,880 So what you do is that you first check the one variable that 127 00:05:59,880 --> 00:06:02,670 gives you the best prediction of this guy, 128 00:06:02,670 --> 00:06:04,839 and then you compute the prediction. 129 00:06:04,839 --> 00:06:06,630 Then what you want to do at the next round, 130 00:06:06,630 --> 00:06:09,019 you want to discount what you have already explained. 131 00:06:09,019 --> 00:06:10,560 And what you do is that basically you 132 00:06:10,560 --> 00:06:13,870 take the actual output minus your prediction, 133 00:06:13,870 --> 00:06:15,695 and you find the residual. 134 00:06:15,695 --> 00:06:17,070 And then you try to explain that. 135 00:06:17,070 --> 00:06:18,930 That's what's left to explain, OK? 136 00:06:18,930 --> 00:06:20,820 So now you check for the variables 137 00:06:20,820 --> 00:06:24,840 that best explain this remaining bit. 138 00:06:24,840 --> 00:06:27,570 Then you add this variable to the one you already have, 139 00:06:27,570 --> 00:06:29,700 and you have a new notion of residual, 140 00:06:29,700 --> 00:06:32,160 which is what you explained in the first round, what 141 00:06:32,160 --> 00:06:34,410 you added in explanation of the second round. 142 00:06:34,410 --> 00:06:37,980 And then there's still something left, and you keep on going. 143 00:06:37,980 --> 00:06:40,080 If you let this thing go for enough time, 144 00:06:40,080 --> 00:06:42,810 you will have the least squares solution. 145 00:06:42,810 --> 00:06:46,500 At the end of the day, you will have explained everything. 146 00:06:46,500 --> 00:06:48,840 And each round, notice that you might or not 147 00:06:48,840 --> 00:06:53,742 decide to put the variables back in, OK? 148 00:06:53,742 --> 00:06:55,200 So you might have that at each step 149 00:06:55,200 --> 00:06:56,430 you have just one variable, or you 150 00:06:56,430 --> 00:06:58,096 might have that you take multiple steps, 151 00:06:58,096 --> 00:07:00,840 but you have fewer variables than number of steps. 152 00:07:00,840 --> 00:07:02,610 No matter what, the number of steps 153 00:07:02,610 --> 00:07:06,030 would be related to the number of variables that 154 00:07:06,030 --> 00:07:08,812 are active in your model, OK? 155 00:07:08,812 --> 00:07:10,350 Does it makes sense? 156 00:07:10,350 --> 00:07:13,606 This is the wordy version, but now we go in details, OK? 157 00:07:13,606 --> 00:07:14,980 But this is roughly the speaking. 158 00:07:14,980 --> 00:07:17,701 So first round, you try to explain something, 159 00:07:17,701 --> 00:07:19,200 then you see what's left to explain. 160 00:07:19,200 --> 00:07:21,449 And you keep the variables that will explain the rest, 161 00:07:21,449 --> 00:07:22,890 and then you need to write. 162 00:07:22,890 --> 00:07:26,160 I'm not sure I used the word, but it's important. 163 00:07:26,160 --> 00:07:28,410 The key word here is "sparsity," OK? 164 00:07:28,410 --> 00:07:30,450 The fact that I'm assuming my model 165 00:07:30,450 --> 00:07:33,040 to depend on just a few vectors-- 166 00:07:33,040 --> 00:07:34,110 sorry a few entries. 167 00:07:34,110 --> 00:07:36,930 So it's a vector with many zero entries. 168 00:07:36,930 --> 00:07:40,327 Sparsity is the key word to explain this property, which 169 00:07:40,327 --> 00:07:41,535 is a property of the problem. 170 00:07:41,535 --> 00:07:44,460 And so I build algorithm that will 171 00:07:44,460 --> 00:07:49,710 try to find sparse solution explaining my data, 172 00:07:49,710 --> 00:07:51,370 and this is one way. 173 00:07:51,370 --> 00:07:53,330 So let's look at this list. 174 00:07:53,330 --> 00:07:56,294 So you define the notion of residual as this thing 175 00:07:56,294 --> 00:07:58,460 that you want to try to explain, and the first round 176 00:07:58,460 --> 00:07:59,300 will be the output. 177 00:07:59,300 --> 00:08:01,258 The second round will be what's left to explain 178 00:08:01,258 --> 00:08:02,900 after your prediction. 179 00:08:02,900 --> 00:08:05,060 You have a coefficient vector, OK, 180 00:08:05,060 --> 00:08:07,080 because we're building a linear function. 181 00:08:07,080 --> 00:08:09,560 And then you have an index set, which 182 00:08:09,560 --> 00:08:12,950 is the set of variables which are important at that stage. 183 00:08:12,950 --> 00:08:15,505 So these are the three objects that you have to initialize. 184 00:08:15,505 --> 00:08:18,062 So at the first round, the coefficient vector 185 00:08:18,062 --> 00:08:18,770 is going to be 0. 186 00:08:18,770 --> 00:08:20,450 The index set is going to be empty. 187 00:08:20,450 --> 00:08:22,010 And the residual at the first round 188 00:08:22,010 --> 00:08:27,880 is just going to be the output, the output vector. 189 00:08:27,880 --> 00:08:32,679 Then you find the best single variable, 190 00:08:32,679 --> 00:08:34,150 and then you update the index set. 191 00:08:34,150 --> 00:08:37,570 So you add that variable to the index set. 192 00:08:37,570 --> 00:08:41,559 To include such variables, you compute the coefficient vector. 193 00:08:41,559 --> 00:08:42,970 And then you update the residual, 194 00:08:42,970 --> 00:08:44,700 and then you start again, OK? 195 00:08:47,282 --> 00:08:52,020 If you want, here I show you the first exam-- 196 00:08:52,020 --> 00:08:53,180 just to give you an idea. 197 00:08:53,180 --> 00:08:54,750 Suppose that this is-- 198 00:08:54,750 --> 00:08:58,976 so first of all, notice this, OK? 199 00:08:58,976 --> 00:09:00,368 Oh, it's so boring. 200 00:09:02,931 --> 00:09:04,680 Forget about anything that's written here. 201 00:09:04,680 --> 00:09:07,150 Just look at this matrix. 202 00:09:07,150 --> 00:09:11,080 The output vector is of the same length of the column 203 00:09:11,080 --> 00:09:13,330 of the matrix, right? 204 00:09:13,330 --> 00:09:16,060 So each column of the matrix will 205 00:09:16,060 --> 00:09:18,372 be related to one variable. 206 00:09:18,372 --> 00:09:20,080 So what you're going to do is that you're 207 00:09:20,080 --> 00:09:23,860 going to try to see which of these best 208 00:09:23,860 --> 00:09:27,400 explain my output vector, and then 209 00:09:27,400 --> 00:09:30,380 you're going to try to define the residual and keep on going, 210 00:09:30,380 --> 00:09:30,880 OK? 211 00:09:34,896 --> 00:09:36,270 So in this case, for example, you 212 00:09:36,270 --> 00:09:38,620 can ask, which of the two directions, X1 and X2, 213 00:09:38,620 --> 00:09:42,010 best explain the vector Y, OK? 214 00:09:42,010 --> 00:09:44,770 So this is the case where it's simple. 215 00:09:44,770 --> 00:09:47,080 I basically have this one direction. 216 00:09:47,080 --> 00:09:48,940 One variable is this one. 217 00:09:48,940 --> 00:09:50,170 Another variable is this one. 218 00:09:50,170 --> 00:09:51,010 And then I have that vector. 219 00:09:51,010 --> 00:09:52,634 I want to know which direction I should 220 00:09:52,634 --> 00:09:57,182 pick to best explain my Y. Which one do you think I should pick? 221 00:09:57,182 --> 00:09:58,066 AUDIENCE: X1. 222 00:09:58,066 --> 00:10:00,060 LORENZO ROSASCO: I should pick X1? 223 00:10:00,060 --> 00:10:01,240 OK. 224 00:10:01,240 --> 00:10:04,330 This projection here will be the weight I have to put to X1 225 00:10:04,330 --> 00:10:05,750 to get a good prediction. 226 00:10:05,750 --> 00:10:07,230 And then what's the residual? 227 00:10:07,230 --> 00:10:09,130 Well, I have to take this X1. 228 00:10:09,130 --> 00:10:11,650 I have to-- I have to take Yn. 229 00:10:11,650 --> 00:10:14,020 I have to subtract that, and this is what I have left. 230 00:10:14,020 --> 00:10:16,390 So this is the simple terms, OK? 231 00:10:16,390 --> 00:10:19,460 So that's what we want to do. 232 00:10:19,460 --> 00:10:22,540 So we said it with hands. 233 00:10:22,540 --> 00:10:23,890 We say it with words. 234 00:10:23,890 --> 00:10:27,870 Here is, more or less, the pseudocode, OK? 235 00:10:27,870 --> 00:10:29,380 It's a bit boring to read. 236 00:10:29,380 --> 00:10:33,670 You can see that it's four lines of code anyway. 237 00:10:33,670 --> 00:10:35,586 Now that we've said it 15 times, probably it 238 00:10:35,586 --> 00:10:37,210 won't be that hard to read because what 239 00:10:37,210 --> 00:10:39,460 you see is that you have a notion of residual. 240 00:10:39,460 --> 00:10:42,750 You have the coefficient vector, and you have the index set. 241 00:10:42,750 --> 00:10:43,820 This is empty. 242 00:10:43,820 --> 00:10:44,564 This is all 0's. 243 00:10:44,564 --> 00:10:45,730 And this is the first round. 244 00:10:45,730 --> 00:10:47,662 It's just the output. 245 00:10:47,662 --> 00:10:49,120 Then what you do is that you start. 246 00:10:49,120 --> 00:10:52,930 And the free parameter here is T, the number of iterations, 247 00:10:52,930 --> 00:10:53,530 OK? 248 00:10:53,530 --> 00:10:56,230 It's going to be lambda, so to say. 249 00:10:56,230 --> 00:10:57,810 What you do is-- 250 00:10:57,810 --> 00:11:00,340 OK, you have for each j-- 251 00:11:00,340 --> 00:11:01,600 so just notation. 252 00:11:01,600 --> 00:11:06,190 j runs over the variables, and capital Xj 253 00:11:06,190 --> 00:11:09,360 will be the column of the data matrix, the one that 254 00:11:09,360 --> 00:11:12,400 corresponds to j's variable, OK? 255 00:11:12,400 --> 00:11:15,190 And then what you do in this line, 256 00:11:15,190 --> 00:11:19,130 here I expand it, is to find the coefficient-- 257 00:11:19,130 --> 00:11:21,320 sorry, find the error that corresponds 258 00:11:21,320 --> 00:11:23,410 to the best variable, OK? 259 00:11:23,410 --> 00:11:27,010 If you look, it turns out that it is-- 260 00:11:27,010 --> 00:11:30,460 if you assume-- oh, it is equivalent to find 261 00:11:30,460 --> 00:11:33,490 the column best correlated with the output 262 00:11:33,490 --> 00:11:36,910 is equivalent to find the column that best explains 263 00:11:36,910 --> 00:11:38,290 the output, or the residual. 264 00:11:38,290 --> 00:11:39,800 These two things are the same. 265 00:11:39,800 --> 00:11:41,470 So here, I write equivalence. 266 00:11:41,470 --> 00:11:43,950 So pick the one that you prefer, OK? 267 00:11:43,950 --> 00:11:46,750 Either you say I find the column that is best correlated 268 00:11:46,750 --> 00:11:49,300 with the output or the residual, or you find the column 269 00:11:49,300 --> 00:11:53,871 that best explains the residual in the sense of least squares, 270 00:11:53,871 --> 00:11:54,370 OK? 271 00:11:54,370 --> 00:11:55,320 These two things are equivalent. 272 00:11:55,320 --> 00:11:56,620 Pick the one that you like. 273 00:11:56,620 --> 00:11:58,690 And that's the content of this one. 274 00:11:58,690 --> 00:12:00,760 And then you select the index of that column. 275 00:12:00,760 --> 00:12:02,920 So you solve this problem for each column. 276 00:12:02,920 --> 00:12:04,090 It's an easy problem. 277 00:12:04,090 --> 00:12:05,412 It's a one-dimensional problem. 278 00:12:05,412 --> 00:12:07,120 And then you pick the one column that you 279 00:12:07,120 --> 00:12:09,400 like the most, which is the one that gives you 280 00:12:09,400 --> 00:12:13,440 the best correlation, a.k.a. 281 00:12:13,440 --> 00:12:16,240 least square error. 282 00:12:16,240 --> 00:12:20,230 Then you add this k to the index set. 283 00:12:20,230 --> 00:12:22,430 And then, in this case, it's very simple. 284 00:12:22,430 --> 00:12:23,360 I'm just going to-- 285 00:12:23,360 --> 00:12:25,240 I'm not going to recompute anything. 286 00:12:25,240 --> 00:12:28,241 So what I do is that suppose that at-- 287 00:12:28,241 --> 00:12:29,740 you remember the coefficient vector, 288 00:12:29,740 --> 00:12:31,774 where it was all 0's, OK? 289 00:12:31,774 --> 00:12:33,690 Then at the first round, I compute one number, 290 00:12:33,690 --> 00:12:36,850 the solution with, say, the first coordinate, for example. 291 00:12:36,850 --> 00:12:40,540 And then I add a number in that entry, OK? 292 00:12:40,540 --> 00:12:43,450 So this is the orthonormal basis, OK? 293 00:12:43,450 --> 00:12:46,390 So it has just all 0's, but 1 in the position k. 294 00:12:46,390 --> 00:12:49,210 So here I put this number. 295 00:12:49,210 --> 00:12:51,489 This is just a typo. 296 00:12:51,489 --> 00:12:53,530 And then what you do is that you sum them up, OK? 297 00:12:53,530 --> 00:12:54,367 So you have all 0's. 298 00:12:54,367 --> 00:12:56,950 Just one number here, the first iteration, then the other one. 299 00:12:56,950 --> 00:12:59,880 And then you add this one there, and you keep on going, OK? 300 00:12:59,880 --> 00:13:02,150 This is the simplest possible version. 301 00:13:02,150 --> 00:13:04,790 And once you have this, now you have this vector. 302 00:13:04,790 --> 00:13:06,310 This is a long vector. 303 00:13:06,310 --> 00:13:10,560 You multiply this-- sorry, this should be Xn. 304 00:13:10,560 --> 00:13:12,340 Maybe we should take note of the typos 305 00:13:12,340 --> 00:13:14,789 because I'm never going to remember all of them. 306 00:13:14,789 --> 00:13:16,330 And then what you do is that you just 307 00:13:16,330 --> 00:13:19,390 discount what you explained so far to the solution. 308 00:13:19,390 --> 00:13:21,816 So you already explained some part of the residual. 309 00:13:21,816 --> 00:13:23,440 Now you discount this new part, and you 310 00:13:23,440 --> 00:13:27,070 define the new residual, and then you go back. 311 00:13:27,070 --> 00:13:30,520 This method is-- so greedy approaches 312 00:13:30,520 --> 00:13:33,730 is one name, as it often happens in machine learning 313 00:13:33,730 --> 00:13:37,450 and statistics and other fields, things 314 00:13:37,450 --> 00:13:40,954 get reinvented constantly, a bit because people just come 315 00:13:40,954 --> 00:13:42,495 to them from a different perspective, 316 00:13:42,495 --> 00:13:45,790 a bit because people just decide studying and reading 317 00:13:45,790 --> 00:13:48,590 is not priority sometimes. 318 00:13:48,590 --> 00:13:54,680 And so this one algorithm is often called greedy. 319 00:13:54,680 --> 00:13:56,280 It's one example of greedy approaches. 320 00:13:56,280 --> 00:13:58,290 It's sometimes called matching pursuit. 321 00:13:58,290 --> 00:14:01,630 It's very much related to so-called forward stagewise 322 00:14:01,630 --> 00:14:02,680 regression. 323 00:14:02,680 --> 00:14:04,480 That's how it's called in statistics. 324 00:14:04,480 --> 00:14:08,360 And well, it has a bunch of other names. 325 00:14:08,360 --> 00:14:11,380 Now, this one version is just the basic version. 326 00:14:11,380 --> 00:14:15,110 It's the simplex-- it's the simplest version. 327 00:14:15,110 --> 00:14:17,440 This step typically remains. 328 00:14:17,440 --> 00:14:20,980 These two steps can be changed slightly, OK? 329 00:14:20,980 --> 00:14:24,554 For example, can you think of another way of doing this? 330 00:14:24,554 --> 00:14:25,720 Let me just give you a hint. 331 00:14:25,720 --> 00:14:28,053 In this case, what you do is that you select a variable. 332 00:14:28,053 --> 00:14:29,260 You compute the coefficient. 333 00:14:29,260 --> 00:14:30,430 Then you select another variable. 334 00:14:30,430 --> 00:14:32,110 You compute the coefficient for the second variable, 335 00:14:32,110 --> 00:14:34,151 but you keep the coefficient you already computed 336 00:14:34,151 --> 00:14:35,490 for the first variable. 337 00:14:35,490 --> 00:14:37,770 They never knew that you took another one 338 00:14:37,770 --> 00:14:40,990 because you didn't take it yet. 339 00:14:40,990 --> 00:14:44,720 So from this comment, do you see how 340 00:14:44,720 --> 00:14:51,620 you could change this method to somewhat fix this aspect? 341 00:14:51,620 --> 00:14:54,450 Do you see what I'm saying? 342 00:14:54,450 --> 00:14:56,210 I would like to change this one line where 343 00:14:56,210 --> 00:14:58,340 I compute the coefficient and this one line 344 00:14:58,340 --> 00:15:01,350 even, perhaps, where I compute the residual to account 345 00:15:01,350 --> 00:15:05,660 for the fact that this method basically never updated 346 00:15:05,660 --> 00:15:07,430 the weights that it computed before. 347 00:15:07,430 --> 00:15:09,150 You only add a new one. 348 00:15:09,150 --> 00:15:12,920 And this seems potentially not a good idea, 349 00:15:12,920 --> 00:15:16,160 because when you have two variables, 350 00:15:16,160 --> 00:15:18,435 it's better to compute the solution with both of them. 351 00:15:18,435 --> 00:15:19,310 So what could you do? 352 00:15:19,310 --> 00:15:22,262 AUDIENCE: [INAUDIBLE] 353 00:15:24,052 --> 00:15:25,010 LORENZO ROSASCO: Right. 354 00:15:25,010 --> 00:15:27,440 So what you could do is essentially 355 00:15:27,440 --> 00:15:29,360 what is called orthogonal matching pursuit. 356 00:15:29,360 --> 00:15:31,280 So you would take this set, and now you 357 00:15:31,280 --> 00:15:33,800 would solve a least square problem 358 00:15:33,800 --> 00:15:36,500 with the variables that are in the index set up 359 00:15:36,500 --> 00:15:39,240 to that point, all of them. 360 00:15:39,240 --> 00:15:41,577 You recompute everything. 361 00:15:41,577 --> 00:15:43,910 And now you have to solve not a one-dimensional problem, 362 00:15:43,910 --> 00:15:47,220 but n times k-dimensional problem, where the k is the-- 363 00:15:47,220 --> 00:15:49,460 I don't know, k is a bad name-- with the dimension 364 00:15:49,460 --> 00:15:53,780 of the set that could be T or more than T, OK? 365 00:15:53,780 --> 00:15:58,290 And then at that point, you also want to redefine this, 366 00:15:58,290 --> 00:16:00,851 because you're not discounting anymore 367 00:16:00,851 --> 00:16:02,850 what you already explained, but each time you're 368 00:16:02,850 --> 00:16:04,070 recomputing everything. 369 00:16:04,070 --> 00:16:08,216 So you just want to do Yn minus the prediction, OK? 370 00:16:08,216 --> 00:16:09,590 So this algorithm is the one that 371 00:16:09,590 --> 00:16:10,881 actually has better properties. 372 00:16:10,881 --> 00:16:12,200 It works better. 373 00:16:12,200 --> 00:16:14,240 You pay the price, because each time 374 00:16:14,240 --> 00:16:16,620 you have to recompute the least squares solution. 375 00:16:16,620 --> 00:16:19,659 And when you have more than one variable inside, 376 00:16:19,659 --> 00:16:21,200 then the problems become big and big. 377 00:16:21,200 --> 00:16:23,120 So if you stop after a few iterations, it's great. 378 00:16:23,120 --> 00:16:25,040 But if you start to have many iterations each time, 379 00:16:25,040 --> 00:16:26,456 you have to solve a linear system. 380 00:16:26,456 --> 00:16:28,100 So the complexity is much higher. 381 00:16:28,100 --> 00:16:30,710 This one here, as you can imagine, is super fast. 382 00:16:30,710 --> 00:16:31,380 So that's it. 383 00:16:31,380 --> 00:16:36,639 So it turns out that this method is, as I told you, 384 00:16:36,639 --> 00:16:38,930 is called matching pursuit, or if not matching pursuit, 385 00:16:38,930 --> 00:16:43,130 forward stagewise regression, is one way to approximate a zero 386 00:16:43,130 --> 00:16:43,910 solution. 387 00:16:43,910 --> 00:16:46,490 And one can prove exactly in which sense 388 00:16:46,490 --> 00:16:51,020 you can approximate it, OK? 389 00:16:51,020 --> 00:16:53,200 So I think this is the one that we might give you 390 00:16:53,200 --> 00:16:54,300 this afternoon, right? 391 00:16:54,300 --> 00:16:55,430 AUDIENCE: Orthogonal. 392 00:16:55,430 --> 00:16:57,513 LORENZO ROSASCO: Oh, yeah, the orthogonal version, 393 00:16:57,513 --> 00:16:59,660 the nicer version. 394 00:16:59,660 --> 00:17:04,410 The other way of doing this is the one that basically says, 395 00:17:04,410 --> 00:17:07,369 look, here what you're doing is that you're just counting 396 00:17:07,369 --> 00:17:11,180 the number of entries different from 0's. 397 00:17:11,180 --> 00:17:13,849 If you now were to replace this with something that 398 00:17:13,849 --> 00:17:17,089 does a bit more-- what it does is it not only counts, 399 00:17:17,089 --> 00:17:21,857 but it actually sums up the weights. 400 00:17:21,857 --> 00:17:23,690 So if you want, in one case, you just check. 401 00:17:23,690 --> 00:17:26,000 If a weight is bigger than 0, you count it 1. 402 00:17:26,000 --> 00:17:27,349 Otherwise, you count it 0. 403 00:17:27,349 --> 00:17:29,340 Here you actually take the absolute value. 404 00:17:29,340 --> 00:17:31,400 So instead of summing up binary values, 405 00:17:31,400 --> 00:17:33,380 you sum up real numbers, OK? 406 00:17:33,380 --> 00:17:35,810 This is what is called the L1 norm. 407 00:17:35,810 --> 00:17:38,945 So each weight doesn't count for its sign, 408 00:17:38,945 --> 00:17:42,630 but it actually counts for each absolute value. 409 00:17:42,630 --> 00:17:46,350 So it turns out that this one term, which you can imagine-- 410 00:17:46,350 --> 00:17:48,500 the absolute value looks like this, right, 411 00:17:48,500 --> 00:17:50,000 and now you're just summing them up. 412 00:17:50,000 --> 00:17:52,320 So that thing is actually convex. 413 00:17:52,320 --> 00:17:56,020 So it turns out that you're summing up two convex terms, 414 00:17:56,020 --> 00:17:58,670 and the overall functional is convex. 415 00:17:58,670 --> 00:18:01,070 And if you want, you can think of this a bit 416 00:18:01,070 --> 00:18:04,253 as a relaxation of the zero norm. 417 00:18:04,253 --> 00:18:05,960 As we say, relaxation in this sense 418 00:18:05,960 --> 00:18:07,749 is the strict requirement. 419 00:18:07,749 --> 00:18:10,040 So I talked about relaxation before when I said instead 420 00:18:10,040 --> 00:18:12,650 of binary values, it takes real values, 421 00:18:12,650 --> 00:18:15,649 and you optimize over the reals instead of the binary values. 422 00:18:15,649 --> 00:18:16,940 Here is kind of the same thing. 423 00:18:16,940 --> 00:18:19,314 Instead of restricting yourself to this functional, which 424 00:18:19,314 --> 00:18:21,740 is binary-valued, now you allow yourself to relax and get 425 00:18:21,740 --> 00:18:22,830 real numbers. 426 00:18:22,830 --> 00:18:25,690 And what you gain is that this algorithm, 427 00:18:25,690 --> 00:18:29,784 the corresponding optimization problem is convex, 428 00:18:29,784 --> 00:18:30,950 and you can try to solve it. 429 00:18:30,950 --> 00:18:33,500 It is not still something that we can do-- 430 00:18:33,500 --> 00:18:34,910 we cannot do what we did before. 431 00:18:34,910 --> 00:18:38,480 We cannot just take derivatives and set them equal to 0 432 00:18:38,480 --> 00:18:41,030 because this term is not smooth. 433 00:18:41,030 --> 00:18:42,960 The absolute value looks like this, 434 00:18:42,960 --> 00:18:45,320 which means that here, around the kink, 435 00:18:45,320 --> 00:18:46,560 is not differentiable. 436 00:18:46,560 --> 00:18:48,410 But we can still use convex analysis 437 00:18:48,410 --> 00:18:50,900 to try to get the solution, and actually the solution 438 00:18:50,900 --> 00:18:52,310 doesn't look too complicated. 439 00:18:52,310 --> 00:18:56,240 Getting there requires a bit of convex analysis, 440 00:18:56,240 --> 00:18:57,855 but there are techniques. 441 00:18:57,855 --> 00:18:59,480 And the ones that are trendy these days 442 00:18:59,480 --> 00:19:02,150 are called forward-backward splitting or proximal method 443 00:19:02,150 --> 00:19:04,334 to solve this problem. 444 00:19:04,334 --> 00:19:05,750 And apparently, I'm not even going 445 00:19:05,750 --> 00:19:07,610 to show them to you because you don't even see them. 446 00:19:07,610 --> 00:19:09,360 But essentially, it's not too complicated. 447 00:19:09,360 --> 00:19:11,480 Just to tell you in one word what they do 448 00:19:11,480 --> 00:19:16,310 is that they do the gradient descent of the first term, 449 00:19:16,310 --> 00:19:18,920 and then at each step of the gradient they threshold. 450 00:19:18,920 --> 00:19:21,380 So they take a step of the gradient, get a vector, 451 00:19:21,380 --> 00:19:23,480 look inside the vector. 452 00:19:23,480 --> 00:19:26,840 If something is smaller than a threshold that depends 453 00:19:26,840 --> 00:19:29,150 on lambda, I set it equal to 0. 454 00:19:29,150 --> 00:19:31,410 Otherwise, I let it go, OK? 455 00:19:31,410 --> 00:19:33,140 So I didn't put it, I don't know why, 456 00:19:33,140 --> 00:19:35,810 because it's really a one-line algorithm. 457 00:19:35,810 --> 00:19:38,717 It's a bit harder to derive, but it's very simple to check. 458 00:19:38,717 --> 00:19:40,550 So let's talk one second about this picture, 459 00:19:40,550 --> 00:19:45,170 and then let me tell you about what I'm not telling you. 460 00:19:45,170 --> 00:19:49,280 So hiding behind everything I said so far, 461 00:19:49,280 --> 00:19:51,320 there is a linear system, right? 462 00:19:51,320 --> 00:19:54,810 There is a linear system that is n by p or d 463 00:19:54,810 --> 00:19:56,810 or whatever you want to call it, with the number 464 00:19:56,810 --> 00:19:58,190 of p of variables. 465 00:19:58,190 --> 00:20:00,020 And our game so far has always been, 466 00:20:00,020 --> 00:20:04,220 look, we have a linear system that can be not invertible, 467 00:20:04,220 --> 00:20:06,680 or even if it is, it might have bad condition number, 468 00:20:06,680 --> 00:20:09,980 and I want to try to find a way to stabilize the problem. 469 00:20:09,980 --> 00:20:12,174 The first round, we basically replace the inverse 470 00:20:12,174 --> 00:20:13,340 with an approximate inverse. 471 00:20:13,340 --> 00:20:15,320 That's the classical way of doing it. 472 00:20:15,320 --> 00:20:17,420 Here, we're making another assumption. 473 00:20:17,420 --> 00:20:20,000 We're basically saying, look, this vector, 474 00:20:20,000 --> 00:20:24,610 it does look very long, so that this problem seems ill-posed. 475 00:20:24,610 --> 00:20:28,390 But in fact, if only a few entries were different from 0 476 00:20:28,390 --> 00:20:30,730 and if you were able to tell me which one they are, 477 00:20:30,730 --> 00:20:32,710 you can go in, delete all these entries, 478 00:20:32,710 --> 00:20:34,597 delete all the corresponding columns, 479 00:20:34,597 --> 00:20:35,930 and then you will have a matrix. 480 00:20:35,930 --> 00:20:38,300 Now he looks short and large. 481 00:20:38,300 --> 00:20:41,599 He will look skinny and tall, OK? 482 00:20:41,599 --> 00:20:43,390 And that probably would be easier to solve. 483 00:20:43,390 --> 00:20:45,320 It would be the case where the problem is 484 00:20:45,320 --> 00:20:48,170 one of the linear systema that we know how to solve. 485 00:20:48,170 --> 00:20:50,890 So what we described so far is a way 486 00:20:50,890 --> 00:20:55,167 to find solution of linear systems that are-- 487 00:20:55,167 --> 00:20:57,250 with the number of equations which is smaller than 488 00:20:57,250 --> 00:21:00,610 the number of unknowns, and, by definition, cannot be solved, 489 00:21:00,610 --> 00:21:03,100 under the extra assumption that, in fact, 490 00:21:03,100 --> 00:21:06,460 there are fewer unknowns than what it looks like. 491 00:21:06,460 --> 00:21:09,750 It's just that I'm not telling you which one they are, OK? 492 00:21:09,750 --> 00:21:11,839 You see, if I could tell you, you 493 00:21:11,839 --> 00:21:13,880 would just get back to a very easy problem, where 494 00:21:13,880 --> 00:21:16,340 the number of unknowns is much smaller, OK? 495 00:21:19,780 --> 00:21:21,900 So this is a mathematical effect, OK? 496 00:21:24,620 --> 00:21:26,372 And the odds are open, because-- 497 00:21:26,372 --> 00:21:27,830 now they're not because people have 498 00:21:27,830 --> 00:21:31,070 been talking about this stuff for 10 years constantly. 499 00:21:31,070 --> 00:21:35,120 But one question is, how much does this assumption buy you? 500 00:21:35,120 --> 00:21:37,970 For example, could you prove that in certain situations, 501 00:21:37,970 --> 00:21:41,030 even if you don't know the entry of this, 502 00:21:41,030 --> 00:21:44,690 you could actually solve this problem exactly? 503 00:21:44,690 --> 00:21:47,400 So if I give them to you, you can do it, right? 504 00:21:47,400 --> 00:21:50,180 But is there a way to try to guess them in some way 505 00:21:50,180 --> 00:21:53,810 so that you can do almost as good or with high probability 506 00:21:53,810 --> 00:21:56,810 as well as if I tell them in advance? 507 00:21:56,810 --> 00:21:59,870 And it turned out that the answer is yes, OK? 508 00:21:59,870 --> 00:22:03,980 And the answer is basically that if the number of entries that 509 00:22:03,980 --> 00:22:06,410 are different from 0 is small enough 510 00:22:06,410 --> 00:22:10,190 and the columns corresponding to those variables 511 00:22:10,190 --> 00:22:13,300 are not too correlated, are not too collinear, 512 00:22:13,300 --> 00:22:14,935 so they're distinguishable enough 513 00:22:14,935 --> 00:22:16,310 that when you perturb the problem 514 00:22:16,310 --> 00:22:18,800 a little bit nothing changes, then 515 00:22:18,800 --> 00:22:21,480 you can solve the problem exactly, OK? 516 00:22:24,100 --> 00:22:27,280 So this, on the one hand, is exactly the kind of theory 517 00:22:27,280 --> 00:22:30,580 that tells you why using greedy methods and convex relaxation 518 00:22:30,580 --> 00:22:32,830 will give you a good approximation to L0, 519 00:22:32,830 --> 00:22:35,860 because that's basically what this story tells you. 520 00:22:35,860 --> 00:22:38,680 People have been using this-- and so this is interesting 521 00:22:38,680 --> 00:22:42,670 for us-- people have been using this observation in a slightly 522 00:22:42,670 --> 00:22:47,920 different context, which is the context where-- 523 00:22:47,920 --> 00:22:50,710 you see, for us, Y and X, we don't choose. 524 00:22:50,710 --> 00:22:51,500 We get. 525 00:22:51,500 --> 00:22:53,200 And whatever they are, they are. 526 00:22:53,200 --> 00:22:55,660 And if it's correlate-- if the columns are nice, nice. 527 00:22:55,660 --> 00:23:00,071 But if they're not nice, sorry, you have to live with it, OK? 528 00:23:00,071 --> 00:23:02,570 But there are settings where you can think of the following. 529 00:23:02,570 --> 00:23:06,790 Suppose that you have a signal, and you want 530 00:23:06,790 --> 00:23:09,190 to be able to reconstruct it. 531 00:23:09,190 --> 00:23:12,227 So the classical Shannon sampling theorem results 532 00:23:12,227 --> 00:23:13,810 basically tell you that, I don't know, 533 00:23:13,810 --> 00:23:15,730 if you have something which has been limited, 534 00:23:15,730 --> 00:23:17,929 you have to sample twice the maximum frequency. 535 00:23:17,929 --> 00:23:19,720 But this is kind of worst case because it's 536 00:23:19,720 --> 00:23:23,050 assuming that all the bands, all the frequencies, are full. 537 00:23:23,050 --> 00:23:24,430 Suppose that now we play-- 538 00:23:24,430 --> 00:23:25,360 it's an analogy, OK? 539 00:23:25,360 --> 00:23:26,860 But I tell you, oh, look, it's true, 540 00:23:26,860 --> 00:23:28,370 the maximum frequency of this. 541 00:23:28,370 --> 00:23:30,910 But there's only another frequency, this one. 542 00:23:30,910 --> 00:23:32,980 Do you really need to sample that much, or you 543 00:23:32,980 --> 00:23:34,890 can do much less? 544 00:23:34,890 --> 00:23:37,234 And so it turns out that basically the story here, 545 00:23:37,234 --> 00:23:38,650 as you're answering this question, 546 00:23:38,650 --> 00:23:40,950 it turns out that, yes, you can do much less. 547 00:23:40,950 --> 00:23:42,700 Ideally, what you would like to say, well, 548 00:23:42,700 --> 00:23:45,610 instead of being twice the maximum frequency, 549 00:23:45,610 --> 00:23:47,730 if I have just four frequencies different from 0, 550 00:23:47,730 --> 00:23:50,916 I'd have to do eight samples, OK? 551 00:23:50,916 --> 00:23:52,540 That would be ideal, but you would have 552 00:23:52,540 --> 00:23:54,070 to know which one they are. 553 00:23:54,070 --> 00:23:57,850 You don't, so you pay a price, but it's just logarithmic. 554 00:23:57,850 --> 00:23:59,740 So you basically say that you can almost-- 555 00:23:59,740 --> 00:24:01,930 you have a new sampling theorem that tells you 556 00:24:01,930 --> 00:24:05,530 that you don't need to sample that much. 557 00:24:05,530 --> 00:24:07,720 You don't need to sample that low either, which 558 00:24:07,720 --> 00:24:11,520 would be, say, the maximum frequency is d. 559 00:24:11,520 --> 00:24:15,610 The number of non-zero frequencies is s. 560 00:24:15,610 --> 00:24:18,340 So with classical, you would have to say 2d. 561 00:24:18,340 --> 00:24:20,260 Ideally, we would like to say 2s. 562 00:24:20,260 --> 00:24:23,890 Actually, what you can say is something like 2s log d. 563 00:24:23,890 --> 00:24:27,010 So you have a log d price that you pay because you 564 00:24:27,010 --> 00:24:28,390 didn't know where they are. 565 00:24:28,390 --> 00:24:32,170 But still, it's much less than being linear in the dimension. 566 00:24:32,170 --> 00:24:34,150 So essentially, the field of compressed sensing 567 00:24:34,150 --> 00:24:37,000 has been built around this observation, 568 00:24:37,000 --> 00:24:38,710 and the focus is slightly different. 569 00:24:38,710 --> 00:24:41,620 Instead of saying I want to do a statistical estimation where 570 00:24:41,620 --> 00:24:44,990 I can just build this, what you do is say I have a signal. 571 00:24:44,990 --> 00:24:47,230 And now I view this as a sensing matrix 572 00:24:47,230 --> 00:24:49,690 that I design with the property that I 573 00:24:49,690 --> 00:24:53,320 know will allow me to do this estimation well. 574 00:24:53,320 --> 00:24:55,960 And so you basically assume that you can choose those vectors 575 00:24:55,960 --> 00:24:57,490 in certain ways, and then you can 576 00:24:57,490 --> 00:25:01,810 prove that you can reconstruct with much fewer samples, OK? 577 00:25:01,810 --> 00:25:03,550 And this has been used, for example-- 578 00:25:03,550 --> 00:25:07,532 I never remember for, as you call in MEG-- in what? 579 00:25:07,532 --> 00:25:09,202 No, MRI, MRI. 580 00:25:09,202 --> 00:25:10,660 Two things I didn't tell you about, 581 00:25:10,660 --> 00:25:13,390 but it's worth mentioning are, suppose 582 00:25:13,390 --> 00:25:17,530 that what I tell you is actually it's 583 00:25:17,530 --> 00:25:20,170 not individual entries that are 0, but group of entries that 584 00:25:20,170 --> 00:25:24,740 are 0, for example, because each entry is a biological process. 585 00:25:24,740 --> 00:25:26,980 So I have genes, but genes are actually 586 00:25:26,980 --> 00:25:28,850 involved in biological process. 587 00:25:28,850 --> 00:25:30,420 So there is a group of genes that is doing something. 588 00:25:30,420 --> 00:25:32,461 I have a group of genes that are doing something, 589 00:25:32,461 --> 00:25:35,680 and I want to select is not individual genes, but groups. 590 00:25:35,680 --> 00:25:38,740 Can you twist this stuff in such a way that you select groups? 591 00:25:38,740 --> 00:25:39,327 Yes. 592 00:25:39,327 --> 00:25:41,160 What if the groups are actually overlapping? 593 00:25:41,160 --> 00:25:42,160 How do you want to deal with the overlaps? 594 00:25:42,160 --> 00:25:43,889 Do you want to keep the overlap? 595 00:25:43,889 --> 00:25:45,180 Do you want to cut the overlap? 596 00:25:45,180 --> 00:25:47,590 What if you have a tree structure, OK? 597 00:25:47,590 --> 00:25:49,160 What do you do with this? 598 00:25:49,160 --> 00:25:51,407 So first of all, who gives it information, OK? 599 00:25:51,407 --> 00:25:53,740 And then if you have the information, how do you use it, 600 00:25:53,740 --> 00:25:55,073 and how are you going to use it? 601 00:25:55,073 --> 00:25:57,250 See, this is the whole field of structure sparsity. 602 00:25:57,250 --> 00:26:00,910 It's the whole industry of building penalties 603 00:26:00,910 --> 00:26:04,559 other than L1 that would allow you to incorporate 604 00:26:04,559 --> 00:26:05,850 this kind of prior information. 605 00:26:05,850 --> 00:26:08,800 And if you want as the place, as in kernel methods, 606 00:26:08,800 --> 00:26:11,230 the kernel was the place where you could 607 00:26:11,230 --> 00:26:12,910 incorporate prior information. 608 00:26:12,910 --> 00:26:14,620 This is the case where, in this field, 609 00:26:14,620 --> 00:26:19,789 you can do that by designing a suitable regularizer. 610 00:26:19,789 --> 00:26:21,330 And then a lot of the reason is this. 611 00:26:21,330 --> 00:26:25,720 So here we'll translate with these new regularizers. 612 00:26:25,720 --> 00:26:29,500 The last bit is that, with a bit of a twist, some 613 00:26:29,500 --> 00:26:31,810 of the idea that I show you now that are basically 614 00:26:31,810 --> 00:26:35,620 related to vectors and sparsity translate 615 00:26:35,620 --> 00:26:39,050 to more general context, in particular that of matrices 616 00:26:39,050 --> 00:26:41,200 that have low rank, OK? 617 00:26:41,200 --> 00:26:43,460 The classical example is suppose that I give you-- 618 00:26:43,460 --> 00:26:45,500 it's matrix completion, OK? 619 00:26:45,500 --> 00:26:47,410 I give you a matrix, but I actually 620 00:26:47,410 --> 00:26:50,080 delete most of the entries of the matrix. 621 00:26:50,080 --> 00:26:54,164 And I tell you, OK, estimate the original matrix. 622 00:26:54,164 --> 00:26:56,600 Well, how can I do that, right? 623 00:26:56,600 --> 00:26:58,450 Well, it turns out that if the matrix itself 624 00:26:58,450 --> 00:27:03,760 had very low rank, so that many of the columns and rows you saw 625 00:27:03,760 --> 00:27:05,397 were actually related to each other, 626 00:27:05,397 --> 00:27:06,980 you might actually be able to do that. 627 00:27:06,980 --> 00:27:10,540 And the way you chose the entries to delete or select 628 00:27:10,540 --> 00:27:13,660 was not malicious, then you would 629 00:27:13,660 --> 00:27:16,360 be able to fill in the missing entries, OK? 630 00:27:16,360 --> 00:27:18,040 And the theory behind this is very 631 00:27:18,040 --> 00:27:19,414 similar to the theory that allows 632 00:27:19,414 --> 00:27:23,840 to fill in the right entries of the vector, OK? 633 00:27:23,840 --> 00:27:27,986 Last bit-- PCA in 15 minutes. 634 00:27:27,986 --> 00:27:33,790 So what we've seen so far was a very hard problem 635 00:27:33,790 --> 00:27:34,780 of variable selection. 636 00:27:34,780 --> 00:27:36,400 It is still a supervised problem, 637 00:27:36,400 --> 00:27:38,224 where I give you labels, OK? 638 00:27:38,224 --> 00:27:39,640 The last bit I want to show you is 639 00:27:39,640 --> 00:27:42,180 PCA, which is the case where I don't give you labels. 640 00:27:42,180 --> 00:27:44,199 And what you try to answer is actually-- perhaps 641 00:27:44,199 --> 00:27:45,490 it's like the simpler question. 642 00:27:45,490 --> 00:27:48,232 Because you don't want to select one of the directions, 643 00:27:48,232 --> 00:27:49,690 but you would like to know if there 644 00:27:49,690 --> 00:27:50,950 are directions that matter. 645 00:27:50,950 --> 00:27:52,408 So you allow yourself, for example, 646 00:27:52,408 --> 00:27:57,040 to combine the different directions in your data, OK? 647 00:27:57,040 --> 00:28:01,460 So this question is interesting for many, many reasons. 648 00:28:01,460 --> 00:28:03,107 One is data visualization, for example. 649 00:28:03,107 --> 00:28:05,440 You have stuff that you cannot look at because you have, 650 00:28:05,440 --> 00:28:07,273 for example, digits in very high dimensions. 651 00:28:07,273 --> 00:28:09,220 You would like to look at them. 652 00:28:09,220 --> 00:28:09,940 How do you do it? 653 00:28:09,940 --> 00:28:11,320 Well, you like to find directions. 654 00:28:11,320 --> 00:28:12,880 The first direction to project everything, 655 00:28:12,880 --> 00:28:15,171 the second direction, three direction, because then you 656 00:28:15,171 --> 00:28:18,430 can plot and look at them, OK? 657 00:28:18,430 --> 00:28:21,952 And this is one visualization of these images here. 658 00:28:21,952 --> 00:28:23,410 And I'll remember now the code now. 659 00:28:23,410 --> 00:28:24,430 It's written here. 660 00:28:24,430 --> 00:28:26,669 You have different colors, and what you see 661 00:28:26,669 --> 00:28:28,210 is that this actually did a good job. 662 00:28:28,210 --> 00:28:30,835 Because what you expect is that if you do a nice visualization, 663 00:28:30,835 --> 00:28:33,460 what you would like to have is that similar numbers 664 00:28:33,460 --> 00:28:35,410 or same numbers are in the same regions, 665 00:28:35,410 --> 00:28:38,920 and perhaps similar numbers are close, OK? 666 00:28:38,920 --> 00:28:42,564 So this is one reason why you want to do this. 667 00:28:42,564 --> 00:28:44,230 One reason why you might want to do this 668 00:28:44,230 --> 00:28:45,910 is also because you might want to reduce 669 00:28:45,910 --> 00:28:48,427 the dimensionality of your data just to compress them 670 00:28:48,427 --> 00:28:51,010 or because you might hope that certain dimensions don't matter 671 00:28:51,010 --> 00:28:52,870 or are simply noise. 672 00:28:52,870 --> 00:28:55,150 And so you just want to get rid of that 673 00:28:55,150 --> 00:28:58,720 because this could be good for statistical reasons. 674 00:28:58,720 --> 00:29:02,170 OK, so the game is going to be the following. 675 00:29:02,170 --> 00:29:05,230 X is the data space, which is going to be RD. 676 00:29:05,230 --> 00:29:10,240 And we want to define a map M that sends vectors of length D 677 00:29:10,240 --> 00:29:14,200 into vectors of length k. 678 00:29:14,200 --> 00:29:20,094 So k is going to be my reduced dimensionality. 679 00:29:20,094 --> 00:29:21,760 And what we're going to do is that we're 680 00:29:21,760 --> 00:29:24,940 going to build a basic method to do this, which is PCA, 681 00:29:24,940 --> 00:29:29,290 and we're going to give a purely geometric view of PCA, OK? 682 00:29:29,290 --> 00:29:31,060 And this is going to be done by taking 683 00:29:31,060 --> 00:29:34,485 first the case where k is equal to 1 and then iterate to go up. 684 00:29:34,485 --> 00:29:35,860 So at the first case, we're going 685 00:29:35,860 --> 00:29:38,410 to ask, if I give you vectors which are D dimensional, 686 00:29:38,410 --> 00:29:41,230 how can I project them in one dimension with respect 687 00:29:41,230 --> 00:29:46,540 to some criterion of optimality, OK? 688 00:29:46,540 --> 00:29:50,620 And here what we ask is, we want to project the data in the one 689 00:29:50,620 --> 00:29:57,170 dimension that would give me the best possible error. 690 00:29:57,170 --> 00:30:01,320 So I think I had it before. 691 00:30:01,320 --> 00:30:03,279 Do I have it-- no, no. 692 00:30:09,440 --> 00:30:13,130 This was done for another reason, but it's useful now. 693 00:30:13,130 --> 00:30:15,440 If you have this vector and you want 694 00:30:15,440 --> 00:30:19,370 to project in this direction, and this is a unit vector, 695 00:30:19,370 --> 00:30:22,130 what do you do? 696 00:30:22,130 --> 00:30:26,910 I want to know how to write this vector here, the projection. 697 00:30:26,910 --> 00:30:31,850 What you do is that you take the inner product between Yn and X. 698 00:30:31,850 --> 00:30:34,430 You get the number, and that number is the length you 699 00:30:34,430 --> 00:30:36,576 want to assign to X1, OK? 700 00:30:40,550 --> 00:30:43,440 So suppose that w is the direction. 701 00:30:43,440 --> 00:30:45,215 And I have a vector x, and I want 702 00:30:45,215 --> 00:30:46,970 to give the projection, OK? 703 00:30:46,970 --> 00:30:47,840 What do I do? 704 00:30:47,840 --> 00:30:51,685 I take the inner product of x and w, 705 00:30:51,685 --> 00:30:54,770 and this is the length I have to assign to the vector w, which 706 00:30:54,770 --> 00:30:57,140 is unit norm, OK? 707 00:30:57,140 --> 00:30:59,930 So this is the best approximation 708 00:30:59,930 --> 00:31:03,480 of xi in the direction of w. 709 00:31:03,480 --> 00:31:04,540 Does it make sense? 710 00:31:07,670 --> 00:31:12,135 I fix a w, and I want to know how well I can describe x. 711 00:31:12,135 --> 00:31:15,090 I project x in that direction, and then I 712 00:31:15,090 --> 00:31:20,070 take the difference between x and the projection, OK? 713 00:31:20,070 --> 00:31:22,620 And then what I do is that I sum over all points. 714 00:31:22,620 --> 00:31:24,891 And then I check, among all possible directions, 715 00:31:24,891 --> 00:31:26,390 the one that give me the best error. 716 00:31:29,070 --> 00:31:32,484 So suppose that is your data set. 717 00:31:32,484 --> 00:31:33,900 Which direction you think is going 718 00:31:33,900 --> 00:31:34,983 to give me the best error? 719 00:31:39,860 --> 00:31:40,850 AUDIENCE: Keep going. 720 00:31:40,850 --> 00:31:42,180 LORENZO ROSASCO: Well, if you go in this direction, 721 00:31:42,180 --> 00:31:44,210 you can explain most of the stuff, OK? 722 00:31:44,210 --> 00:31:47,030 You can reconstruct it best. 723 00:31:47,030 --> 00:31:49,850 So this is going to be the solution. 724 00:31:49,850 --> 00:31:57,140 So the question here is really, how do you solve this problem? 725 00:31:57,140 --> 00:32:01,457 You could try to minimize with respect to w. 726 00:32:01,457 --> 00:32:04,040 But in fact, it's not clear what kind of computation you have. 727 00:32:04,040 --> 00:32:05,580 And if we massage this a little bit, 728 00:32:05,580 --> 00:32:08,744 it turns out that it is actually exactly an eigenvalue problem. 729 00:32:08,744 --> 00:32:10,160 So that's what we want to do next. 730 00:32:10,160 --> 00:32:12,590 So conceptually, what we want to do is what I said here 731 00:32:12,590 --> 00:32:13,580 and nothing more. 732 00:32:13,580 --> 00:32:16,160 I want to find the single individual direction that 733 00:32:16,160 --> 00:32:20,160 allows me to reconstruct best, on average, all the training 734 00:32:20,160 --> 00:32:20,999 set points. 735 00:32:20,999 --> 00:32:22,790 And now what we want to do is just to check 736 00:32:22,790 --> 00:32:26,210 what kind of computation these entail and learn 737 00:32:26,210 --> 00:32:28,050 a bit more about this, OK? 738 00:32:31,370 --> 00:32:34,640 So this notation is just to say that the vector is norm 1 739 00:32:34,640 --> 00:32:40,800 so that I don't have to fumble with the size of the vector. 740 00:32:40,800 --> 00:32:43,050 OK, so let's do a couple of computations. 741 00:32:43,050 --> 00:32:45,210 This is ideal after lunch. 742 00:32:45,210 --> 00:32:49,100 So you just take this square and develop it, OK? 743 00:32:49,100 --> 00:32:52,200 And remember that w is unit norm. 744 00:32:52,200 --> 00:32:56,690 So when you do w transpose w, you get 1. 745 00:32:56,690 --> 00:32:58,000 And then if you-- 746 00:32:58,000 --> 00:33:00,270 and if you don't forget to put your square 747 00:33:00,270 --> 00:33:01,770 and if you just develop this, you'll 748 00:33:01,770 --> 00:33:03,450 see that this is an equality, OK? 749 00:33:03,450 --> 00:33:05,830 There is a square missing here. 750 00:33:05,830 --> 00:33:07,530 So you have xi square. 751 00:33:07,530 --> 00:33:09,870 Then you would have the product of xi 752 00:33:09,870 --> 00:33:13,710 and this, which will be w transpose xi square. 753 00:33:13,710 --> 00:33:17,340 And then you would also have this square, 754 00:33:17,340 --> 00:33:20,570 but this square is w transpose xi and then w transpose w, 755 00:33:20,570 --> 00:33:21,675 which is 1. 756 00:33:21,675 --> 00:33:24,154 And so what you see is that this would create-- 757 00:33:24,154 --> 00:33:26,070 instead of three terms we have two because two 758 00:33:26,070 --> 00:33:27,810 cancel out-- not cancel out. 759 00:33:27,810 --> 00:33:29,720 They balance each other. 760 00:33:29,720 --> 00:33:30,220 OK. 761 00:33:32,820 --> 00:33:38,250 So then I'd argue that if instead of minimizing this, 762 00:33:38,250 --> 00:33:40,724 because this is equal to this, instead of minimizing this, 763 00:33:40,724 --> 00:33:41,640 you can maximize this. 764 00:33:46,050 --> 00:33:46,550 Why? 765 00:33:46,550 --> 00:33:49,160 Well, because this is just a constant. 766 00:33:49,160 --> 00:33:51,290 It doesn't depend on w at all, so I 767 00:33:51,290 --> 00:33:53,040 can drop it from my functional. 768 00:33:53,040 --> 00:33:56,000 The solution, the minimum, the minimum will be different, 769 00:33:56,000 --> 00:33:58,520 but the minimizer, the w that solves the problem, 770 00:33:58,520 --> 00:34:01,010 will be the same, OK? 771 00:34:01,010 --> 00:34:02,630 And then here is minimizing something 772 00:34:02,630 --> 00:34:05,810 with a minus, which is the same as maximizing the same thing 773 00:34:05,810 --> 00:34:09,391 without the minus, OK? 774 00:34:09,391 --> 00:34:13,190 I don't ask, so far, so good, because I'm scared. 775 00:34:13,190 --> 00:34:17,000 So what you see now is that basically 776 00:34:17,000 --> 00:34:20,929 if the data were centered, basically this would just 777 00:34:20,929 --> 00:34:23,110 be a variance. 778 00:34:23,110 --> 00:34:25,739 If the data are centered, so there is a minus 0 here, 779 00:34:25,739 --> 00:34:29,360 maybe you can interpret this as measuring the variance 780 00:34:29,360 --> 00:34:30,069 in one direction. 781 00:34:30,069 --> 00:34:31,651 And so you have another interpretation 782 00:34:31,651 --> 00:34:33,960 of PCA, which is the one where instead of picking 783 00:34:33,960 --> 00:34:37,370 the single direction with the best possible reconstruction, 784 00:34:37,370 --> 00:34:39,889 you're picking the direction where the variance of the data 785 00:34:39,889 --> 00:34:42,569 is bigger, OK? 786 00:34:42,569 --> 00:34:44,860 And these two points of view are completely equivalent. 787 00:34:44,860 --> 00:34:46,886 Essentially, whenever you have a square norm, 788 00:34:46,886 --> 00:34:48,469 thinking about maximizing the variance 789 00:34:48,469 --> 00:34:51,080 or minimizing the reconstruction are two 790 00:34:51,080 --> 00:34:53,540 complementary dual ideas, OK? 791 00:34:53,540 --> 00:34:56,090 So that's what you will be doing here. 792 00:34:56,090 --> 00:34:57,090 One more bit. 793 00:34:57,090 --> 00:34:58,080 What about computation? 794 00:34:58,080 --> 00:35:00,640 So this is-- so we can think about reconstruction. 795 00:35:00,640 --> 00:35:03,920 You can think about variance, if you like. 796 00:35:03,920 --> 00:35:05,100 What about this computation? 797 00:35:05,100 --> 00:35:07,180 What kind of computation is this, OK? 798 00:35:07,180 --> 00:35:08,600 If we massage it a little bit, we 799 00:35:08,600 --> 00:35:10,550 see that is just an eigenvalue problem. 800 00:35:10,550 --> 00:35:12,540 So this is how you do it. 801 00:35:12,540 --> 00:35:16,670 This actually look-- so it's annoying, but it's very simple. 802 00:35:16,670 --> 00:35:18,110 So I wrote all the passages. 803 00:35:18,110 --> 00:35:21,200 This is a square, so it's something times itself. 804 00:35:21,200 --> 00:35:23,030 This whole thing is symmetric, so you 805 00:35:23,030 --> 00:35:26,420 can swap the order of this multiplication. 806 00:35:26,420 --> 00:35:30,980 So you get w transpose xi, xi transpose w. 807 00:35:30,980 --> 00:35:33,740 But then this is just the sum that was 808 00:35:33,740 --> 00:35:35,240 going to involve these terms. 809 00:35:35,240 --> 00:35:40,310 So I can let the sum enter, and this is what you get. 810 00:35:40,310 --> 00:35:47,270 So you get w transpose 1/n xi xi transpose w. 811 00:35:47,270 --> 00:35:53,247 So this is just a number. w transpose xi is just a number. 812 00:35:53,247 --> 00:35:55,580 But the moment you look at something that looks like xi, 813 00:35:55,580 --> 00:35:57,750 xi transpose, what is that? 814 00:35:57,750 --> 00:36:00,310 Well, just look at dimensionality, OK? 815 00:36:00,310 --> 00:36:05,137 1 times d times d times 1 gives you a number, which is 1 by 1. 816 00:36:05,137 --> 00:36:06,720 Now you're doing the other way around. 817 00:36:06,720 --> 00:36:07,610 So what is this? 818 00:36:07,610 --> 00:36:08,771 AUDIENCE: It's a matrix. 819 00:36:08,771 --> 00:36:09,812 LORENZO ROSASCO: It's a-- 820 00:36:09,812 --> 00:36:10,286 AUDIENCE: Matrix. 821 00:36:10,286 --> 00:36:11,577 LORENZO ROSASCO: It's a matrix. 822 00:36:11,577 --> 00:36:15,590 And it's a matrix which is d by d, and it's of rank 1, OK? 823 00:36:15,590 --> 00:36:18,347 And what you do now is that you sum them all up, 824 00:36:18,347 --> 00:36:20,180 and what you have is that this quantity here 825 00:36:20,180 --> 00:36:22,180 becomes what is called the quadratic form. 826 00:36:22,180 --> 00:36:26,300 It is a matrix C, which just looks like this. 827 00:36:26,300 --> 00:36:30,810 And it's squeezed in between two vectors, w transpose and w. 828 00:36:30,810 --> 00:36:33,350 So now what you want to do is that you 829 00:36:33,350 --> 00:36:38,840 can rewrite this just this way as maximizing the w-- 830 00:36:38,840 --> 00:36:42,320 sorry, finding the unit norm vector w that 831 00:36:42,320 --> 00:36:44,475 maximizes this quadratic form. 832 00:36:44,475 --> 00:36:46,100 And at this point, you can still ask me 833 00:36:46,100 --> 00:36:47,740 who cares, because it's just keeping 834 00:36:47,740 --> 00:36:50,530 on rewriting the same problem. 835 00:36:50,530 --> 00:36:54,230 But it turns out that essentially using 836 00:36:54,230 --> 00:36:57,390 Lagrange theorem, it is relatively simple to do, 837 00:36:57,390 --> 00:36:59,660 you can check that-- 838 00:36:59,660 --> 00:37:03,740 oh, so boring-- that the solution of this problem 839 00:37:03,740 --> 00:37:08,060 is the maximum eigenvector of this matrix, OK? 840 00:37:08,060 --> 00:37:10,490 So this you can leave as an exercise. 841 00:37:10,490 --> 00:37:13,340 Essentially, you take the Lagrangian of this and use 842 00:37:13,340 --> 00:37:16,610 a little bit of duality, and you show that the [INAUDIBLE] 843 00:37:16,610 --> 00:37:19,370 of this problem is just the maximum eigen-- 844 00:37:19,370 --> 00:37:23,930 so the eigenvalues-- ugh, the eigenvector corresponding 845 00:37:23,930 --> 00:37:28,610 to the maximum eigenvalue of the matrix C. 846 00:37:28,610 --> 00:37:30,800 So finding this direction is just 847 00:37:30,800 --> 00:37:32,810 solving an eigenvalue problem. 848 00:37:32,810 --> 00:37:33,310 OK. 849 00:37:36,274 --> 00:37:39,850 I think just do last few of those slide, kind of cute. 850 00:37:39,850 --> 00:37:42,860 It's pretty simple, OK? 851 00:37:42,860 --> 00:37:46,280 So the one part, this line after lunch 852 00:37:46,280 --> 00:37:48,689 is a bit there for because I'm nice. 853 00:37:48,689 --> 00:37:51,230 But really, the only one part which is a bit more complicated 854 00:37:51,230 --> 00:37:51,950 is this one here. 855 00:37:51,950 --> 00:37:56,630 The rest is really just very simple algebra. 856 00:37:56,630 --> 00:37:58,500 So what about k equal 2? 857 00:37:58,500 --> 00:37:59,660 I run out of time. 858 00:37:59,660 --> 00:38:03,350 But it turns out that what you want to do 859 00:38:03,350 --> 00:38:07,382 is basically if you say that you want to look for a second-- 860 00:38:07,382 --> 00:38:08,840 so you look at the first direction. 861 00:38:08,840 --> 00:38:11,340 You solve it, and you know that it's the first eigenvector-- 862 00:38:11,340 --> 00:38:12,290 the first eigenvector. 863 00:38:12,290 --> 00:38:14,206 And then let's say that you add the constraint 864 00:38:14,206 --> 00:38:15,710 that the second direction you find 865 00:38:15,710 --> 00:38:19,124 has to be orthogonal to the first direction. 866 00:38:19,124 --> 00:38:20,540 You might not want to do this, OK? 867 00:38:20,540 --> 00:38:22,545 But if you do, if you say you add 868 00:38:22,545 --> 00:38:25,670 the orthogonality constraint, then what you can check 869 00:38:25,670 --> 00:38:28,170 is that you can repeat-- 870 00:38:28,170 --> 00:38:29,534 sorry, I didn't do it. 871 00:38:29,534 --> 00:38:31,700 It's on my note, the one that I have on the website, 872 00:38:31,700 --> 00:38:33,500 and the computation is kind of cute. 873 00:38:33,500 --> 00:38:36,530 And what you see is that the solution of this problem 874 00:38:36,530 --> 00:38:38,660 looks exactly like the one before, 875 00:38:38,660 --> 00:38:40,970 only with this additional constraint, 876 00:38:40,970 --> 00:38:44,060 is exactly the eigenvector corresponding 877 00:38:44,060 --> 00:38:46,980 to the second largest eigenvalue, OK? 878 00:38:46,980 --> 00:38:49,439 And so you can keep on going. 879 00:38:49,439 --> 00:38:50,980 And so now this gives you a way to go 880 00:38:50,980 --> 00:38:54,040 from k equal to 1 to k bigger than 1, 881 00:38:54,040 --> 00:38:56,530 and you can keep on going, OK? 882 00:38:56,530 --> 00:38:58,690 So if you're looking for the direction that 883 00:38:58,690 --> 00:39:00,565 maximizes the variance or the reconstruction, 884 00:39:00,565 --> 00:39:04,339 they turn out to be the biggest eigen-- 885 00:39:04,339 --> 00:39:06,380 the vectors-- ugh, the eigenvectors corresponding 886 00:39:06,380 --> 00:39:09,430 to the biggest eigenvalues of this matrix C, which what you 887 00:39:09,430 --> 00:39:11,470 can call it as a second moment or covariance 888 00:39:11,470 --> 00:39:14,380 matrix of the data. 889 00:39:14,380 --> 00:39:17,350 OK, so this is more or less the end. 890 00:39:17,350 --> 00:39:20,380 This is the basic, basic, basic version of this. 891 00:39:20,380 --> 00:39:24,970 You can mix this with pretty much all the other stuff 892 00:39:24,970 --> 00:39:25,900 we said today. 893 00:39:25,900 --> 00:39:29,620 So one is, how about trying to use kernels to do a nonlinear 894 00:39:29,620 --> 00:39:30,784 extension of this? 895 00:39:30,784 --> 00:39:32,950 So here we just looked at the linear reconstruction. 896 00:39:32,950 --> 00:39:34,417 How about nonlinear reconstruction? 897 00:39:34,417 --> 00:39:36,250 So what you would do is that you would first 898 00:39:36,250 --> 00:39:38,140 map the data in some way and then 899 00:39:38,140 --> 00:39:40,820 try to find some kind of nonlinear dimensionality 900 00:39:40,820 --> 00:39:41,390 reduction. 901 00:39:41,390 --> 00:39:44,980 You see that what I'm doing here is that I'm just 902 00:39:44,980 --> 00:39:46,840 using this linear-- 903 00:39:46,840 --> 00:39:49,496 it's just a linear dimensionality reduction, just 904 00:39:49,496 --> 00:39:51,100 a linear operator. 905 00:39:51,100 --> 00:39:52,620 But what about something nonlinear? 906 00:39:52,620 --> 00:39:54,790 What if my data lie on some kind of structure 907 00:39:54,790 --> 00:39:57,240 that looked like that-- 908 00:39:57,240 --> 00:40:01,210 our beloved machine learning Swiss roll. 909 00:40:01,210 --> 00:40:03,470 Well, if you do PCA, well, you're 910 00:40:03,470 --> 00:40:07,950 just going to find a plane that cuts that thing somewhere, OK? 911 00:40:07,950 --> 00:40:10,270 But if you try to embed the data in some nonlinear way, 912 00:40:10,270 --> 00:40:11,920 you could try to resolve this. 913 00:40:11,920 --> 00:40:13,830 And this has been much of the research 914 00:40:13,830 --> 00:40:17,650 done in the direction of manifold learning. 915 00:40:17,650 --> 00:40:21,790 Here there are just a few keywords-- 916 00:40:21,790 --> 00:40:25,270 kernel PCA is the easiest version, Laplacian map, 917 00:40:25,270 --> 00:40:28,870 eigenmaps, diffusion maps, and so on, OK? 918 00:40:28,870 --> 00:40:32,410 I only touch quickly upon random projection. 919 00:40:32,410 --> 00:40:34,310 There is a whole literature about those. 920 00:40:34,310 --> 00:40:38,290 The idea is, again, that by multiplying the data 921 00:40:38,290 --> 00:40:41,570 by random vectors, you can keep the information in the data 922 00:40:41,570 --> 00:40:43,120 and might be able to reconstruct them 923 00:40:43,120 --> 00:40:45,100 as well as to preserve distances. 924 00:40:45,100 --> 00:40:52,780 Also, you can combine ideas from sparsity with ideas from PCA. 925 00:40:52,780 --> 00:40:56,126 For example, you can say, what if I want to know not only-- 926 00:40:56,126 --> 00:40:58,000 I want to find something like an eigenvector, 927 00:40:58,000 --> 00:41:01,210 but I would like the entries of the eigenvector to be 0. 928 00:41:01,210 --> 00:41:03,520 So can I add here a constraint which basically 929 00:41:03,520 --> 00:41:05,470 says, among all the unit vectors, 930 00:41:05,470 --> 00:41:07,750 find the one whose entries are most-- 931 00:41:07,750 --> 00:41:10,810 so I want to add an L0 norm or L1 norm. 932 00:41:10,810 --> 00:41:12,580 So how can you do that, OK? 933 00:41:12,580 --> 00:41:15,930 And this leads to sparse PCA and other structured matrix 934 00:41:15,930 --> 00:41:17,239 estimation problems, OK? 935 00:41:17,239 --> 00:41:19,780 So this is, again, something I'm not going to tell you about, 936 00:41:19,780 --> 00:41:21,640 but that's kind of the beginning. 937 00:41:21,640 --> 00:41:26,650 And this, more or less, brings us to the desert island, 938 00:41:26,650 --> 00:41:28,500 and I'm done.