1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,670 at ocw.mit.edu. 8 00:00:21,670 --> 00:00:23,420 LORENZO ROSASCO: So what we want to do now 9 00:00:23,420 --> 00:00:25,040 is to move away from local methods 10 00:00:25,040 --> 00:00:29,810 and start to do some form of global regularization method. 11 00:00:29,810 --> 00:00:33,470 The word regularization I'm going to use broadly as a term 12 00:00:33,470 --> 00:00:36,860 to define procedures, statistical procedures 13 00:00:36,860 --> 00:00:38,540 and computational procedure that do 14 00:00:38,540 --> 00:00:41,780 have some parameters that allow to do from complex model 15 00:00:41,780 --> 00:00:43,880 to simple model in a very broad sense. 16 00:00:43,880 --> 00:00:46,280 What I mean by complex is something that is potentially 17 00:00:46,280 --> 00:00:49,910 going closer to overfitting and by simple model something that 18 00:00:49,910 --> 00:00:54,630 is giving me something, which is stable with respect to data. 19 00:00:54,630 --> 00:01:01,800 So we're going to consider the following algorithm. 20 00:01:01,800 --> 00:01:03,960 I imagine a lot of you have seen it before. 21 00:01:03,960 --> 00:01:07,890 This is called-- it has a bunch of different names-- 22 00:01:07,890 --> 00:01:12,200 probably the most famous one is Tikhonov regularization. 23 00:01:12,200 --> 00:01:14,520 A bunch of people at the beginning of the '60s 24 00:01:14,520 --> 00:01:17,520 thought about something similar either in the context 25 00:01:17,520 --> 00:01:20,974 of statistics or solving linear equations. 26 00:01:20,974 --> 00:01:23,515 So Tikhonov is the only one for which I can find the picture. 27 00:01:23,515 --> 00:01:25,270 The other one was Philips, and then there 28 00:01:25,270 --> 00:01:28,020 is Hoerl and other people. 29 00:01:28,020 --> 00:01:31,007 They basically thought all about this same procedure. 30 00:01:34,750 --> 00:01:39,506 The procedure basically is based on a functional 31 00:01:39,506 --> 00:01:41,380 that you want to minimize based on two terms. 32 00:01:41,380 --> 00:01:43,338 So there are several ingredients going on here. 33 00:01:43,338 --> 00:01:44,970 First of all, this is f of x. 34 00:01:44,970 --> 00:01:47,760 We assume the functional form of-- we 35 00:01:47,760 --> 00:01:49,740 try to estimate the function, and we 36 00:01:49,740 --> 00:01:51,870 do assume a parametric form of this function, which 37 00:01:51,870 --> 00:01:54,720 in this case, just linear. 38 00:01:54,720 --> 00:01:57,810 And for the time being, because you can really, 39 00:01:57,810 --> 00:02:01,230 you can put it back in, I don't look at the offset. 40 00:02:01,230 --> 00:02:03,347 So I just take lines passing through the origin. 41 00:02:03,347 --> 00:02:05,430 And this is just because you can prove in one line 42 00:02:05,430 --> 00:02:09,665 that you can put back in the offset at zero cost. 43 00:02:09,665 --> 00:02:11,039 So for the time being, just think 44 00:02:11,039 --> 00:02:12,372 that data are actually standard. 45 00:02:14,700 --> 00:02:17,670 The way you try to estimate this parameter is, on one hand, 46 00:02:17,670 --> 00:02:22,420 try to make the empirical error small, and on the other hand, 47 00:02:22,420 --> 00:02:27,844 you put a budget on the weights. 48 00:02:27,844 --> 00:02:29,010 The reason why you do this-- 49 00:02:29,010 --> 00:02:31,160 there are a bunch of way to explain this. 50 00:02:31,160 --> 00:02:34,110 Andrei yesterday talked about margin, and different lines, 51 00:02:34,110 --> 00:02:35,270 and so on. 52 00:02:35,270 --> 00:02:37,620 Another way to think about it is that you can convince 53 00:02:37,620 --> 00:02:39,720 yourself-- and we were going to see later-- 54 00:02:39,720 --> 00:02:44,550 that if you're in low dimension, a line is a very poor model. 55 00:02:44,550 --> 00:02:47,587 Because basically if you have more than a few points-- 56 00:02:47,587 --> 00:02:49,170 and they're not standing on the line-- 57 00:02:49,170 --> 00:02:51,400 you will not be able to make zero error. 58 00:02:51,400 --> 00:02:53,100 But if the number of points is lower 59 00:02:53,100 --> 00:02:54,900 than the number of dimension, you 60 00:02:54,900 --> 00:02:58,200 can show that the line actually can give you zero error. 61 00:02:58,200 --> 00:03:00,750 It's just a matter of degrees of freedom. 62 00:03:00,750 --> 00:03:04,570 You have fewer equations than the actual variables. 63 00:03:04,570 --> 00:03:07,140 So what you do is that you actually 64 00:03:07,140 --> 00:03:09,240 add a regularization theorem. 65 00:03:09,240 --> 00:03:11,820 It's basically a theorem that makes the problem well-posed. 66 00:03:11,820 --> 00:03:13,350 We're going to see this in a minute 67 00:03:13,350 --> 00:03:14,800 from a different perspective. 68 00:03:14,800 --> 00:03:18,390 The easiest one is going to be numerical. 69 00:03:18,390 --> 00:03:21,210 We stick to least squares for-- and there is-- 70 00:03:21,210 --> 00:03:26,430 so there is an extra parenthesis that I forgot, 71 00:03:26,430 --> 00:03:28,571 but before I tell you why we use least squares also 72 00:03:28,571 --> 00:03:30,570 let me tell you that-- as somebody pointed out-- 73 00:03:30,570 --> 00:03:32,611 there is a mistake here, because this is a minus. 74 00:03:36,420 --> 00:03:38,396 It should just be a minus. 75 00:03:38,396 --> 00:03:41,350 I'll fix this. 76 00:03:41,350 --> 00:03:43,760 So back, why do you use least squares? 77 00:03:43,760 --> 00:03:47,017 OK, so least squares on the one hand, 78 00:03:47,017 --> 00:03:48,600 if you're in low dimension especially, 79 00:03:48,600 --> 00:03:51,340 you can think of least squares as its way is basic, 80 00:03:51,340 --> 00:03:54,390 but it's not a very robust way to measure error, 81 00:03:54,390 --> 00:03:55,610 because you squared them. 82 00:03:55,610 --> 00:04:00,750 And so just one error can count a lot. 83 00:04:00,750 --> 00:04:02,570 So typically, there is a whole literature 84 00:04:02,570 --> 00:04:04,290 on robust statistics, where you want to replace least 85 00:04:04,290 --> 00:04:06,123 square with something like an absolute value 86 00:04:06,123 --> 00:04:07,560 or something like that. 87 00:04:07,560 --> 00:04:09,490 It turns out that at least in our experience 88 00:04:09,490 --> 00:04:11,432 and when you have high dimensional problem, 89 00:04:11,432 --> 00:04:13,890 it's not completely clear how much this kind of instability 90 00:04:13,890 --> 00:04:16,890 will occur and will not be cured by just adding 91 00:04:16,890 --> 00:04:18,660 some regularization term. 92 00:04:18,660 --> 00:04:22,710 And the computation underlying this algorithm 93 00:04:22,710 --> 00:04:25,300 are extremely, extremely simple. 94 00:04:25,300 --> 00:04:28,230 So that's why we're sticking to this, because it 95 00:04:28,230 --> 00:04:29,769 works pretty well in practice. 96 00:04:29,769 --> 00:04:31,560 We actually developed in the last few years 97 00:04:31,560 --> 00:04:34,170 some toolbox that you can use. 98 00:04:34,170 --> 00:04:35,970 They're pretty much plug and play. 99 00:04:35,970 --> 00:04:39,120 And because the algorithm is easy to understand 100 00:04:39,120 --> 00:04:40,420 in simpler terms. 101 00:04:40,420 --> 00:04:42,600 Yesterday, Andrei was talking about SVM. 102 00:04:42,600 --> 00:04:44,650 SVM is very similar in principle. 103 00:04:44,650 --> 00:04:46,650 Basically the only difference is that you change 104 00:04:46,650 --> 00:04:49,040 the way you measure cost here. 105 00:04:49,040 --> 00:04:51,690 This algorithm you can use both for classification 106 00:04:51,690 --> 00:04:54,351 and regression, whereas SVM-- 107 00:04:54,351 --> 00:04:56,100 the one which was talked about yesterday-- 108 00:04:56,100 --> 00:04:57,670 is just for classification. 109 00:04:57,670 --> 00:05:02,790 And because the cost function turns out to be non-smooth-- 110 00:05:02,790 --> 00:05:05,970 and non-smooth is basically non-differentiable-- 111 00:05:05,970 --> 00:05:08,321 and so the whole math is much more complicated, 112 00:05:08,321 --> 00:05:10,320 because you have to learn how to minimize things 113 00:05:10,320 --> 00:05:11,650 that are not differentiable. 114 00:05:11,650 --> 00:05:15,720 So in this case, you can stick to elementary stuff. 115 00:05:15,720 --> 00:05:19,200 And I think I did somewhere, that also because Legendre 116 00:05:19,200 --> 00:05:22,740 200 years ago said that least squares are really great. 117 00:05:22,740 --> 00:05:24,320 There is this old story-- 118 00:05:24,320 --> 00:05:28,570 who between Gauss and Legendre invented least squares first. 119 00:05:28,570 --> 00:05:31,050 And there are actually long articles about this. 120 00:05:31,050 --> 00:05:33,110 But anyway, it's around that time. 121 00:05:33,110 --> 00:05:34,437 It's around the end of the-- 122 00:05:34,437 --> 00:05:36,020 this is when he was born-- it's around 123 00:05:36,020 --> 00:05:37,260 the end of the 18th century. 124 00:05:37,260 --> 00:05:42,430 So the algorithm is pretty old. 125 00:05:42,430 --> 00:05:43,340 So what's the idea? 126 00:05:43,340 --> 00:05:47,100 So back to the case we had before, you're going 127 00:05:47,100 --> 00:05:50,110 to take a linear function. 128 00:05:50,110 --> 00:05:51,600 So one thing is-- 129 00:05:51,600 --> 00:05:53,500 just to be careful-- think about it once. 130 00:05:53,500 --> 00:05:55,500 Because if you've never thought about it before, 131 00:05:55,500 --> 00:05:57,240 it's good to focus. 132 00:05:57,240 --> 00:06:04,560 When you do this drawing, this is not f of x. 133 00:06:04,560 --> 00:06:07,620 This line is not f of x. 134 00:06:07,620 --> 00:06:10,920 It's f of x equals zero. 135 00:06:10,920 --> 00:06:15,510 So I think I made enough time to have a 3D plot. 136 00:06:15,510 --> 00:06:22,380 So f of x is actually a plane that cuts through the slide. 137 00:06:22,380 --> 00:06:25,805 It's positive, when it's not dotted-- 138 00:06:25,805 --> 00:06:27,930 because this points are positive-- and then becomes 139 00:06:27,930 --> 00:06:28,830 negative. 140 00:06:28,830 --> 00:06:31,770 And this line is where it changes sign. 141 00:06:31,770 --> 00:06:34,615 So the decision boundary is not f of x itself, 142 00:06:34,615 --> 00:06:36,240 but it's the level set that corresponds 143 00:06:36,240 --> 00:06:39,060 to f of x equals zero. 144 00:06:39,060 --> 00:06:41,460 Whereas f of x itself is this one line. 145 00:06:41,460 --> 00:06:44,250 If you think in one dimension, the points 146 00:06:44,250 --> 00:06:47,220 are just standing on a line. 147 00:06:47,220 --> 00:06:48,990 Some here are plus 1. 148 00:06:48,990 --> 00:06:50,180 Some here are minus 1. 149 00:06:50,180 --> 00:06:51,510 So what is f of x? 150 00:06:51,510 --> 00:06:54,660 It's just a line. 151 00:06:54,660 --> 00:06:57,900 What is the decision boundary in this case? 152 00:06:57,900 --> 00:07:00,150 It will just be one point in this case actually, 153 00:07:00,150 --> 00:07:04,060 because it's just one line that cuts 154 00:07:04,060 --> 00:07:06,530 the input line in one point. 155 00:07:06,530 --> 00:07:07,560 And that's it. 156 00:07:07,560 --> 00:07:10,907 If you were to take a more complicated nonlinear line, 157 00:07:10,907 --> 00:07:12,240 it would be more than one point. 158 00:07:12,240 --> 00:07:14,100 In two dimension, it becomes one line. 159 00:07:14,100 --> 00:07:16,894 In three dimension, it becomes a plane, and so on and so forth. 160 00:07:16,894 --> 00:07:19,310 But the important piece-- just a remember, at least once-- 161 00:07:19,310 --> 00:07:20,830 then we look at this plot. 162 00:07:20,830 --> 00:07:26,450 This is not f of x, but only the set 163 00:07:26,450 --> 00:07:30,072 f of x equals zero, which is where you change sign. 164 00:07:30,072 --> 00:07:32,030 And that's how you're going to make prediction. 165 00:07:32,030 --> 00:07:34,039 You take real valued functions, so you 166 00:07:34,039 --> 00:07:36,080 would like-- in principle, in classification, you 167 00:07:36,080 --> 00:07:39,620 would allow this function just to be binary. 168 00:07:39,620 --> 00:07:42,840 But optimization with binary functions is very hard. 169 00:07:42,840 --> 00:07:44,540 So what you typically do to relax this? 170 00:07:44,540 --> 00:07:47,789 You just allow it to be a real valued function, 171 00:07:47,789 --> 00:07:48,830 and then you take a sign. 172 00:07:48,830 --> 00:07:50,420 When it's positive, you take plus 1. 173 00:07:50,420 --> 00:07:51,927 If it's negative, you say minus 1. 174 00:07:51,927 --> 00:07:53,510 If it's a regression problem, you just 175 00:07:53,510 --> 00:07:54,590 keep it for what it is. 176 00:08:01,130 --> 00:08:04,550 And how many free parameters has this algorithm? 177 00:08:04,550 --> 00:08:05,380 Well, one. 178 00:08:05,380 --> 00:08:08,040 It's lambda for now and w. 179 00:08:08,040 --> 00:08:10,550 But w we're going to solve by solving this optimization 180 00:08:10,550 --> 00:08:11,280 problem. 181 00:08:11,280 --> 00:08:12,170 How about lambda? 182 00:08:12,170 --> 00:08:15,960 Well, whatever we discussed before for k. 183 00:08:15,960 --> 00:08:19,220 We would try to sit down and do some bias 184 00:08:19,220 --> 00:08:21,410 variance of the composition, see what it depends on, 185 00:08:21,410 --> 00:08:23,990 try to see if we can get a grasp on what 186 00:08:23,990 --> 00:08:25,580 the theory of this algorithm is. 187 00:08:25,580 --> 00:08:28,162 And then we try to see if we can use cross-validation. 188 00:08:28,162 --> 00:08:29,870 You can do all these things, so we're not 189 00:08:29,870 --> 00:08:36,059 going to discuss much how you choose lambda, but most of you 190 00:08:36,059 --> 00:08:40,970 are going to discuss how you can compute the minimizer of this. 191 00:08:40,970 --> 00:08:45,440 And this is not a problem, because this is smooth. 192 00:08:45,440 --> 00:08:48,470 So you can take the retrospect to w and also this. 193 00:08:48,470 --> 00:08:52,130 So what you can do is just to take the derivative of this, 194 00:08:52,130 --> 00:08:54,360 set it equal to zero, and check what happens. 195 00:08:59,090 --> 00:09:01,700 So it's useful to do this to other-- 196 00:09:01,700 --> 00:09:04,820 just some vectorial notation. 197 00:09:04,820 --> 00:09:06,770 We've already seen it before. 198 00:09:06,770 --> 00:09:10,700 So you take all the x's and you stack it as rows of the data 199 00:09:10,700 --> 00:09:13,040 matrix x of n. 200 00:09:13,040 --> 00:09:17,060 So this ny, you just stack them as entries of a vector. 201 00:09:17,060 --> 00:09:18,470 You call it yn. 202 00:09:18,470 --> 00:09:21,140 Then you can rewrite this term just 203 00:09:21,140 --> 00:09:24,440 in this way, as this vector minus this vector here, which 204 00:09:24,440 --> 00:09:27,140 you obtain by multiplying the matrix with w. 205 00:09:27,140 --> 00:09:28,760 So this norm is the norm in Rn. 206 00:09:31,270 --> 00:09:35,510 So this is just simple rewriting. 207 00:09:35,510 --> 00:09:38,120 It's useful just because if you now 208 00:09:38,120 --> 00:09:39,860 take the derivative of this with respect 209 00:09:39,860 --> 00:09:42,330 to w, set it equal to zero, you get this. 210 00:09:42,330 --> 00:09:43,690 This is the gradient. 211 00:09:43,690 --> 00:09:45,740 So I haven't set it to zero yet. 212 00:09:45,740 --> 00:09:49,190 This is the gradient of the least square part. 213 00:09:49,190 --> 00:09:52,380 This is the gradient of the second term. 214 00:09:52,380 --> 00:09:56,150 It is still multiplied by lambda. 215 00:09:56,150 --> 00:09:59,420 If you set them equal to zero, what you get is this. 216 00:09:59,420 --> 00:10:03,340 You take everything with x, so the 2 and the 2 goes away. 217 00:10:03,340 --> 00:10:06,410 You took everything with x, and you put it here. 218 00:10:06,410 --> 00:10:08,120 There's still the one here with lambda. 219 00:10:08,120 --> 00:10:10,200 You put it here. 220 00:10:10,200 --> 00:10:12,319 You take this term in x transpose y, 221 00:10:12,319 --> 00:10:14,360 and you put it on the other side of the equality. 222 00:10:14,360 --> 00:10:17,810 So you take everything with w on one side and everything 223 00:10:17,810 --> 00:10:19,610 without w on the other side. 224 00:10:19,610 --> 00:10:23,420 And then here, I remove n by multiplying. 225 00:10:26,440 --> 00:10:28,797 And so what you get is a linear system. 226 00:10:28,797 --> 00:10:29,880 It's just a linear system. 227 00:10:29,880 --> 00:10:31,580 So that's the beauty of least squares. 228 00:10:31,580 --> 00:10:33,140 Whether you regularize it or not-- 229 00:10:33,140 --> 00:10:36,620 in this case for this simple squared loss regularization, 230 00:10:36,620 --> 00:10:39,300 all you get is a linear system. 231 00:10:39,300 --> 00:10:42,680 And this is the first way to think about the effect 232 00:10:42,680 --> 00:10:45,770 of adding this term. 233 00:10:45,770 --> 00:10:47,900 So what is this doing? 234 00:10:47,900 --> 00:10:53,180 So just quickly for you a quick linear system recap. 235 00:10:53,180 --> 00:10:54,759 You're solving a linear system. 236 00:10:54,759 --> 00:10:55,550 I changed notation. 237 00:10:55,550 --> 00:10:59,120 This is just a parenthesis, just a little bit. 238 00:10:59,120 --> 00:11:00,680 The simplest case you can think of 239 00:11:00,680 --> 00:11:03,830 is the case where m is diagonal. 240 00:11:03,830 --> 00:11:06,140 Suppose it's just a diagonal matrix, a square diagonal 241 00:11:06,140 --> 00:11:06,640 matrix. 242 00:11:09,500 --> 00:11:11,020 How do you solve this problem? 243 00:11:11,020 --> 00:11:13,422 You have to invert the matrix m. 244 00:11:13,422 --> 00:11:15,130 What is the inverse of a diagonal matrix? 245 00:11:18,190 --> 00:11:20,470 So it's just another diagonal matrix. 246 00:11:20,470 --> 00:11:23,200 On the entries, instead of, say, sigma, you have 1 over sigma 247 00:11:23,200 --> 00:11:24,040 or whatever it is. 248 00:11:27,250 --> 00:11:31,900 So what you see is that if m-- you just consider m-- 249 00:11:31,900 --> 00:11:34,030 and m is diagonal like this-- this 250 00:11:34,030 --> 00:11:35,860 is what you're going to get. 251 00:11:35,860 --> 00:11:37,480 Suppose that now some of these numbers 252 00:11:37,480 --> 00:11:42,070 are actually small, then when you take 1 over, 253 00:11:42,070 --> 00:11:44,110 this is going to blow up. 254 00:11:44,110 --> 00:11:47,650 When you apply this matrix to b, what you might have is 255 00:11:47,650 --> 00:11:51,820 that if you change the sigmas or the b slightly, 256 00:11:51,820 --> 00:11:54,670 you can have an explosion. 257 00:11:54,670 --> 00:11:57,910 And if you want, this is one way to understand why 258 00:11:57,910 --> 00:11:59,980 adding the lambda would help. 259 00:11:59,980 --> 00:12:02,560 And it's another way to look at overfitting, if you want, 260 00:12:02,560 --> 00:12:04,130 from a numerical point of view. 261 00:12:04,130 --> 00:12:04,880 You take the data. 262 00:12:04,880 --> 00:12:07,480 You change them slightly, and you have numerical instability 263 00:12:07,480 --> 00:12:09,190 right away. 264 00:12:09,190 --> 00:12:10,960 What is the effect of adding this term? 265 00:12:14,020 --> 00:12:15,820 Well, what you see is that instead 266 00:12:15,820 --> 00:12:23,710 of just doing m minus 1, you're doing m plus lambda I minus 1. 267 00:12:23,710 --> 00:12:26,002 And this is the simple case, where it's diagonal. 268 00:12:26,002 --> 00:12:28,210 But what you see is that on the diagonal instead of 1 269 00:12:28,210 --> 00:12:32,990 over sigma 1, you take 1 over sigma 1 plus lambda. 270 00:12:32,990 --> 00:12:36,910 If sigma 1 is big, adding this lambda won't matter. 271 00:12:36,910 --> 00:12:40,510 If sigma-- for example, sigma d, now think there are order. 272 00:12:40,510 --> 00:12:42,880 I'm thinking they are order, and sigma d is small. 273 00:12:42,880 --> 00:12:47,320 If this is small, at some point lambda is going to jump in, 274 00:12:47,320 --> 00:12:49,660 make the problem stable at the price 275 00:12:49,660 --> 00:12:51,911 of ignoring the information in that sigma, 276 00:12:51,911 --> 00:12:53,410 that you basically consider it to be 277 00:12:53,410 --> 00:12:55,510 at the same size of the noise or the perturbation 278 00:12:55,510 --> 00:12:57,230 or the sample in your data. 279 00:12:57,230 --> 00:12:59,050 Does this make sense? 280 00:12:59,050 --> 00:13:01,015 So this is what the algorithm is doing. 281 00:13:01,015 --> 00:13:04,881 And it's a numerical way to look at stability. 282 00:13:04,881 --> 00:13:07,255 But you can imagine that this is an immediate statistical 283 00:13:07,255 --> 00:13:07,600 consequence. 284 00:13:07,600 --> 00:13:09,150 You change the data slightly, you'll 285 00:13:09,150 --> 00:13:11,230 have a big change in your solution and the other way 286 00:13:11,230 --> 00:13:11,730 around. 287 00:13:11,730 --> 00:13:14,140 And lambda governs this by basically telling you 288 00:13:14,140 --> 00:13:16,060 how much this is invertible. 289 00:13:16,060 --> 00:13:18,310 So it's a connection between statistical and numerical 290 00:13:18,310 --> 00:13:19,875 stability. 291 00:13:19,875 --> 00:13:22,000 Now of course, you can say, this is oversimplistic, 292 00:13:22,000 --> 00:13:26,110 because this is just a diagonal matrix. 293 00:13:26,110 --> 00:13:31,300 But basically, if you now take matrices 294 00:13:31,300 --> 00:13:35,620 that you can diagonalize, conceptually nothing 295 00:13:35,620 --> 00:13:36,280 would change. 296 00:13:36,280 --> 00:13:37,570 Because basically you would have that 297 00:13:37,570 --> 00:13:39,653 if you have a matrix-- so there is a mistake here. 298 00:13:39,653 --> 00:13:41,380 There should be no minus 1. 299 00:13:41,380 --> 00:13:43,360 If you have an m that you can-- 300 00:13:43,360 --> 00:13:45,235 this is just sigma, not minus 1. 301 00:13:45,235 --> 00:13:47,057 You can just diagonalize it. 302 00:13:47,057 --> 00:13:49,390 And now every operation you want to do on the matrix you 303 00:13:49,390 --> 00:13:51,910 can just do on the diagonal. 304 00:13:51,910 --> 00:13:54,845 So all the reasoning here will work the same. 305 00:13:54,845 --> 00:13:56,440 Only now you have to remember that you 306 00:13:56,440 --> 00:13:58,900 have to squeeze the diagonal matrix in between v and v 307 00:13:58,900 --> 00:14:00,631 transpose. 308 00:14:00,631 --> 00:14:03,130 I'm not saying that this is what you want to do numerically. 309 00:14:03,130 --> 00:14:05,350 But I'm just saying that the conceptual reasoning here-- 310 00:14:05,350 --> 00:14:07,610 that we tell it that this was the effect of lambda-- 311 00:14:07,610 --> 00:14:10,090 is going to hold just the same here. 312 00:14:10,090 --> 00:14:13,210 This is m, which you can write like this-- m minus 1 313 00:14:13,210 --> 00:14:14,290 you can write like this. 314 00:14:14,290 --> 00:14:18,220 And so this is just going to be the same diagonal terms 315 00:14:18,220 --> 00:14:18,739 inverted. 316 00:14:18,739 --> 00:14:20,280 And now you see the effect of lambda. 317 00:14:20,280 --> 00:14:22,390 It's just the same. 318 00:14:22,390 --> 00:14:24,290 So once you grasp this conceptually, 319 00:14:24,290 --> 00:14:28,020 for any matrix you can make diagonal, it's the same. 320 00:14:28,020 --> 00:14:29,736 And the point is that as long as you 321 00:14:29,736 --> 00:14:32,110 have a symmetric positive definite matrix, the reason you 322 00:14:32,110 --> 00:14:35,370 can diagonalize it, you just have the same thing squeezed 323 00:14:35,370 --> 00:14:37,630 in between v and v transpose. 324 00:14:37,630 --> 00:14:39,520 And that's what we have, because instead of-- 325 00:14:44,680 --> 00:14:47,920 because what we have is exactly this matrix here. 326 00:14:47,920 --> 00:14:50,270 So instead of-- and you see here that basically this 327 00:14:50,270 --> 00:14:52,880 depends a lot on the dimensionality of the data. 328 00:14:52,880 --> 00:14:57,220 If the number of points is much bigger than the dimensionality, 329 00:14:57,220 --> 00:14:59,450 this matrix in principle could be-- 330 00:14:59,450 --> 00:15:01,420 it's easier that is invertible. 331 00:15:01,420 --> 00:15:03,100 But if the number of points is smaller 332 00:15:03,100 --> 00:15:04,840 than the dimensionality-- 333 00:15:04,840 --> 00:15:06,040 how big is this matrix? 334 00:15:06,040 --> 00:15:10,080 So xn is-- you remember how big was xn? 335 00:15:10,080 --> 00:15:12,610 It was the rows were the points, and the columns 336 00:15:12,610 --> 00:15:13,370 were the variable. 337 00:15:13,370 --> 00:15:14,595 So how big is this? 338 00:15:14,595 --> 00:15:15,730 And we call this d. 339 00:15:15,730 --> 00:15:16,930 We called the length n. 340 00:15:16,930 --> 00:15:18,304 So this is-- 341 00:15:18,304 --> 00:15:19,510 AUDIENCE: [INAUDIBLE] 342 00:15:19,510 --> 00:15:22,470 LORENZO ROSASCO: --n by d. 343 00:15:22,470 --> 00:15:25,060 So this matrix here is how big? 344 00:15:25,060 --> 00:15:28,320 Just d by d, and the number of points 345 00:15:28,320 --> 00:15:31,080 is smaller than the number dimension. 346 00:15:31,080 --> 00:15:32,445 The rank of this-- 347 00:15:32,445 --> 00:15:34,440 this is going to be rank-deficient. 348 00:15:34,440 --> 00:15:35,725 So it's not invertible. 349 00:15:35,725 --> 00:15:38,100 So if the number of points is more, if you're in a high-- 350 00:15:38,100 --> 00:15:39,870 so called high-dimensional scenario, 351 00:15:39,870 --> 00:15:41,520 where the number of points is more than 352 00:15:41,520 --> 00:15:43,410 the number of dimension, for sure you 353 00:15:43,410 --> 00:15:44,790 won't be able to invert this. 354 00:15:44,790 --> 00:15:46,950 Ordinary least squares will not work. 355 00:15:46,950 --> 00:15:48,120 It will be unstable. 356 00:15:48,120 --> 00:15:49,619 And then you will have to regularize 357 00:15:49,619 --> 00:15:52,279 to get anything reasonable. 358 00:15:52,279 --> 00:15:54,820 So in the case of least squares, just by setting rank to zero 359 00:15:54,820 --> 00:15:56,790 and looking in this computation to get a grasp of both. 360 00:15:56,790 --> 00:15:58,530 What kind of computation you have to do, 361 00:15:58,530 --> 00:15:59,835 and what they mean both from the statistical 362 00:15:59,835 --> 00:16:01,376 and the numerical point of view. 363 00:16:01,376 --> 00:16:03,750 And that's why that's one of the beauty of least squares. 364 00:16:08,280 --> 00:16:11,400 We could stick to a whole derivation of this-- 365 00:16:11,400 --> 00:16:14,480 so this is more the linear system perspective. 366 00:16:14,480 --> 00:16:15,960 There is a whole literature trying 367 00:16:15,960 --> 00:16:18,570 to justify more from a statistical point of view what 368 00:16:18,570 --> 00:16:19,641 I'm saying. 369 00:16:19,641 --> 00:16:21,390 You can talk about the maximum likelihood, 370 00:16:21,390 --> 00:16:24,300 then you can talk about maximum a posteriori. 371 00:16:24,300 --> 00:16:26,760 You can talk about variance reduction and so-called Stein 372 00:16:26,760 --> 00:16:27,610 effect. 373 00:16:27,610 --> 00:16:31,170 And you can make a much bigger story trying, for example, 374 00:16:31,170 --> 00:16:34,227 to develop the whole theory of shrinkage estimators, the bias 375 00:16:34,227 --> 00:16:35,310 variance tradeoff of this. 376 00:16:35,310 --> 00:16:37,380 But we're not going to talk about that. 377 00:16:37,380 --> 00:16:39,690 So this simple numerical stability, 378 00:16:39,690 --> 00:16:41,640 statistical stability intuition is 379 00:16:41,640 --> 00:16:44,430 going to be my main motivation for considering these schemes. 380 00:16:47,730 --> 00:16:49,890 So let me skip these. 381 00:16:49,890 --> 00:16:52,440 I wanted to show the demo, but-- 382 00:16:52,440 --> 00:16:53,429 it's very simple. 383 00:16:53,429 --> 00:16:55,470 It's going to be very stable, because you're just 384 00:16:55,470 --> 00:16:58,480 drawing a one-dimensional line. 385 00:16:58,480 --> 00:17:00,270 Then you move on just a bit, because we 386 00:17:00,270 --> 00:17:04,290 didn't cover as much as I want in the first part. 387 00:17:04,290 --> 00:17:08,819 So first of all, so far so good? 388 00:17:08,819 --> 00:17:12,230 Are you all with me about this? 389 00:17:12,230 --> 00:17:14,290 So again, the basic thing if you want-- 390 00:17:14,290 --> 00:17:20,359 all the interesting-- so this is the one line, 391 00:17:20,359 --> 00:17:23,230 where there is something conceptual happening. 392 00:17:23,230 --> 00:17:26,530 This is the one line, where we make it a bit more complicated 393 00:17:26,530 --> 00:17:27,327 mathematically. 394 00:17:27,327 --> 00:17:29,160 And then all you have to do is to match this 395 00:17:29,160 --> 00:17:31,760 with what we just wrote before. 396 00:17:31,760 --> 00:17:32,260 That's all. 397 00:17:32,260 --> 00:17:35,040 These are the main three things we want to do. 398 00:17:35,040 --> 00:17:37,900 And think a bit about dimensionality. 399 00:17:37,900 --> 00:17:44,029 Now if you look at a problem even like this, as I said, 400 00:17:44,029 --> 00:17:45,820 this might be misleading-- a low dimension. 401 00:17:45,820 --> 00:17:47,740 And in fact, what we typically do in high dimension 402 00:17:47,740 --> 00:17:49,540 is that, first of all, you start with the linear model 403 00:17:49,540 --> 00:17:51,590 and you see how far you can go with that. 404 00:17:51,590 --> 00:17:55,969 And typically, you go a bit further that you might imagine. 405 00:17:55,969 --> 00:17:57,760 But still, you can think, why should I just 406 00:17:57,760 --> 00:17:59,670 stick to linear decision rule? 407 00:17:59,670 --> 00:18:03,080 This won't give me much of a flexibility. 408 00:18:03,080 --> 00:18:04,990 So in this case, obviously, it looks 409 00:18:04,990 --> 00:18:06,490 like something that would be better, 410 00:18:06,490 --> 00:18:09,850 some kind of quadric decision boundary. 411 00:18:09,850 --> 00:18:12,414 So how can you do this? 412 00:18:12,414 --> 00:18:14,080 How can you go-- suppose that I give you 413 00:18:14,080 --> 00:18:16,126 the code of least squares. 414 00:18:16,126 --> 00:18:17,500 And you're the laziest programmer 415 00:18:17,500 --> 00:18:19,900 in the world, which in my case is actually not 416 00:18:19,900 --> 00:18:22,360 that hard to imagine. 417 00:18:22,360 --> 00:18:25,500 How can you recycle the code to fit, 418 00:18:25,500 --> 00:18:28,420 to create a solution like this, instead 419 00:18:28,420 --> 00:18:30,880 of a solution like this? 420 00:18:30,880 --> 00:18:32,286 You see the question? 421 00:18:32,286 --> 00:18:34,035 I give you the code to solve this problem, 422 00:18:34,035 --> 00:18:35,243 the one I showed you before-- 423 00:18:35,243 --> 00:18:37,090 the linear system for different lambdas. 424 00:18:37,090 --> 00:18:40,270 But you want to go from this solution to the solution. 425 00:18:40,270 --> 00:18:41,870 How could you do that? 426 00:18:41,870 --> 00:18:44,300 So one way you can do it in this simple case 427 00:18:44,300 --> 00:18:46,155 is-- this is the example. 428 00:18:46,155 --> 00:18:47,500 So the idea is-- 429 00:18:47,500 --> 00:18:49,150 you remember the matrix? 430 00:18:49,150 --> 00:18:51,334 I'm going to invent new entries of the matrix, 431 00:18:51,334 --> 00:18:53,500 not of the points, because you cannot invent points, 432 00:18:53,500 --> 00:18:54,740 but of the variables. 433 00:18:54,740 --> 00:18:57,198 So what you're going to do, instead of just-- they can say, 434 00:18:57,198 --> 00:18:59,320 in this case I call them x1, x2. 435 00:18:59,320 --> 00:19:00,510 I'm just in two dimension. 436 00:19:00,510 --> 00:19:02,250 These are my data. 437 00:19:02,250 --> 00:19:04,240 This is just another example of this. 438 00:19:04,240 --> 00:19:06,190 So these are my data-- sorry these are-- 439 00:19:06,190 --> 00:19:07,780 let's see what they are. 440 00:19:07,780 --> 00:19:09,700 This is one point. 441 00:19:09,700 --> 00:19:12,160 X1 and x2 here are just the entry 442 00:19:12,160 --> 00:19:15,970 of the point x, so the first coordinate 443 00:19:15,970 --> 00:19:18,370 and the second coordinate. 444 00:19:18,370 --> 00:19:21,070 So what you said is exactly one way to do this. 445 00:19:21,070 --> 00:19:22,210 And it is-- 446 00:19:22,210 --> 00:19:24,790 I'm going to now build a new vector representation 447 00:19:24,790 --> 00:19:25,710 of the same points. 448 00:19:25,710 --> 00:19:26,470 So it's going to be the same point, 449 00:19:26,470 --> 00:19:28,240 but instead of two coordinates I now use 450 00:19:28,240 --> 00:19:32,410 three, which are going to be the first coordinate square, 451 00:19:32,410 --> 00:19:35,680 the second coordinate square, and the product of the two 452 00:19:35,680 --> 00:19:36,700 coordinates. 453 00:19:39,560 --> 00:19:42,490 Once I've done this, I forget about how I got this, 454 00:19:42,490 --> 00:19:44,800 and I just treat it as new variables. 455 00:19:44,800 --> 00:19:49,000 And I take a linear model with that variables. 456 00:19:49,000 --> 00:19:51,340 It's a linear model with these new variables, 457 00:19:51,340 --> 00:19:54,170 but it's a new linear model with the original variables. 458 00:19:54,170 --> 00:19:56,090 And that's what you see here. 459 00:19:56,090 --> 00:20:00,120 So x tilde is this stuff. 460 00:20:00,120 --> 00:20:02,920 It's just a new vector representation. 461 00:20:02,920 --> 00:20:05,500 And now I'm linear with respect to this new vector 462 00:20:05,500 --> 00:20:06,640 representation. 463 00:20:06,640 --> 00:20:09,869 But when you write x tilde explicitly, 464 00:20:09,869 --> 00:20:11,410 it's some kind of non-linear function 465 00:20:11,410 --> 00:20:12,670 of the original variable. 466 00:20:12,670 --> 00:20:14,770 So this function here is non-linear 467 00:20:14,770 --> 00:20:16,870 in the original variable. 468 00:20:16,870 --> 00:20:20,310 It's harder to say than probably to see. 469 00:20:20,310 --> 00:20:22,310 Does it make sense? 470 00:20:22,310 --> 00:20:23,920 So if you do this, you're completely 471 00:20:23,920 --> 00:20:26,650 recycling the beauty of the linearity 472 00:20:26,650 --> 00:20:29,530 from a computational point of view while 473 00:20:29,530 --> 00:20:31,510 augmenting the power of your model 474 00:20:31,510 --> 00:20:33,700 from linear to non-linear. 475 00:20:33,700 --> 00:20:37,351 It's still parametric in the sense that in this case-- 476 00:20:37,351 --> 00:20:39,100 what I mean by parametric is that we still 477 00:20:39,100 --> 00:20:41,020 fix a priori the number of degrees 478 00:20:41,020 --> 00:20:42,850 of freedom of our problem. 479 00:20:42,850 --> 00:20:45,070 It was true now I make it three. 480 00:20:45,070 --> 00:20:47,230 More general I could make it p, but the number 481 00:20:47,230 --> 00:20:49,280 of numbers I have to find is fixed a priori. 482 00:20:49,280 --> 00:20:54,040 It doesn't depend on my data, and it's fixed. 483 00:20:54,040 --> 00:20:58,150 But I can definitely go from linear to non-linear. 484 00:20:58,150 --> 00:20:59,540 So let's keep on going. 485 00:20:59,540 --> 00:21:02,447 So from the simple linear model we already went quite far, 486 00:21:02,447 --> 00:21:04,780 because we basically know that with the same computation 487 00:21:04,780 --> 00:21:06,500 we can now solve stuff like this. 488 00:21:06,500 --> 00:21:09,220 Let's take a couple of steps further. 489 00:21:09,220 --> 00:21:13,330 So one is-- appreciate that really the code 490 00:21:13,330 --> 00:21:14,680 is just the same. 491 00:21:14,680 --> 00:21:16,900 Instead of x, I have to do a pre-processing 492 00:21:16,900 --> 00:21:19,630 to replace x with this new matrix x tilde, which 493 00:21:19,630 --> 00:21:22,497 is the one which instead of being n by d, is now n by p 494 00:21:22,497 --> 00:21:24,830 where p is this new number of variables that I invented. 495 00:21:28,000 --> 00:21:31,870 Now it's useful to just get the feeling of what is 496 00:21:31,870 --> 00:21:33,710 the complexity of this method. 497 00:21:33,710 --> 00:21:38,400 And this is a very quick complexity recap. 498 00:21:38,400 --> 00:21:40,540 Here basically, the product of two numbers 499 00:21:40,540 --> 00:21:41,770 is going to count one. 500 00:21:41,770 --> 00:21:44,340 And then when you take product of vectors of matrices, 501 00:21:44,340 --> 00:21:46,870 you just count on any real number multiplication you do. 502 00:21:46,870 --> 00:21:48,220 And this is a quick recap. 503 00:21:48,220 --> 00:21:52,630 If I multiply two vectors of size p, the cost p, 504 00:21:52,630 --> 00:21:54,850 matrix vector is going to be np. 505 00:21:54,850 --> 00:21:58,600 Matrix matrix is going to be n square p. 506 00:21:58,600 --> 00:22:00,274 You have n vectors. 507 00:22:00,274 --> 00:22:02,490 And one-to-one, other n vectors. 508 00:22:02,490 --> 00:22:05,880 And they are size p, so each time you have-- it costs you p. 509 00:22:05,880 --> 00:22:07,930 And you have to do n against n. 510 00:22:07,930 --> 00:22:10,090 So it's going to be n square p. 511 00:22:10,090 --> 00:22:13,674 And the last one is-- this is a much-- 512 00:22:13,674 --> 00:22:15,340 less clear to just look at it like this. 513 00:22:15,340 --> 00:22:17,680 But roughly speaking, the inversion of a matrix 514 00:22:17,680 --> 00:22:21,460 costs roughly speaking n cube in the worst case. 515 00:22:21,460 --> 00:22:25,030 It's just to give you a feeling of what the complexity are. 516 00:22:25,030 --> 00:22:27,740 So it makes sense? 517 00:22:27,740 --> 00:22:29,579 It's a bit quick, but it's simple. 518 00:22:29,579 --> 00:22:30,370 If you know it, OK. 519 00:22:30,370 --> 00:22:32,120 Otherwise, you just take this on the side, 520 00:22:32,120 --> 00:22:33,520 when you think about this. 521 00:22:33,520 --> 00:22:37,720 So what is the complexity of this? 522 00:22:37,720 --> 00:22:41,390 Well, the matrix-- you have to multiply this times this, 523 00:22:41,390 --> 00:22:45,460 and this is going to cost you nd or np. 524 00:22:45,460 --> 00:22:46,820 You have to build this matrix. 525 00:22:46,820 --> 00:22:51,184 This is going to cost you n square d or n square p. 526 00:22:51,184 --> 00:22:52,350 And then you have to invert. 527 00:22:52,350 --> 00:22:55,720 These are going to be n cube. 528 00:22:55,720 --> 00:22:58,990 So-- sorry, p cubed, because with this matrix is 529 00:22:58,990 --> 00:23:03,070 going to be-- or d cube, because this matrix is d by d. 530 00:23:03,070 --> 00:23:06,560 So this is, roughly speaking, the cost. 531 00:23:06,560 --> 00:23:07,525 So now look at this. 532 00:23:07,525 --> 00:23:10,150 This is-- I take this. 533 00:23:10,150 --> 00:23:13,750 In this case, p is the new variable, otherwise d. 534 00:23:13,750 --> 00:23:21,050 So in this case, I have p cube, and then I have p square n. 535 00:23:21,050 --> 00:23:23,710 But one question is what if n is much-- 536 00:23:23,710 --> 00:23:28,390 and that's a fact-- what if n is much smaller than p? 537 00:23:28,390 --> 00:23:32,920 If n is a 10, do I really have to pay 538 00:23:32,920 --> 00:23:36,430 quadratic or even cubic in the number of dimension 539 00:23:36,430 --> 00:23:37,357 to solve this problem? 540 00:23:37,357 --> 00:23:39,940 Because in some sense, it looks I'm overshooting things a bit. 541 00:23:39,940 --> 00:23:42,820 Because I'm inverting a matrix, yes, but this matrix 542 00:23:42,820 --> 00:23:45,290 is really a rank n. 543 00:23:45,290 --> 00:23:48,010 It only has n rows that are linearly independent at most. 544 00:23:48,010 --> 00:23:50,410 It might be less, but at most it has n. 545 00:23:50,410 --> 00:23:53,920 So can I break the complexity of this? 546 00:23:53,920 --> 00:23:56,074 Linear system have to solve, you just 547 00:23:56,074 --> 00:23:57,490 use the table I showed you before. 548 00:23:57,490 --> 00:23:58,420 Check the computation. 549 00:23:58,420 --> 00:24:00,190 These are the computation you have to do. 550 00:24:00,190 --> 00:24:02,530 And one observation here is you pay really 551 00:24:02,530 --> 00:24:06,370 a lot in the dimension, the number of variables 552 00:24:06,370 --> 00:24:08,800 or the number of features you invented. 553 00:24:08,800 --> 00:24:12,370 And this might be OK, when p is smaller than n. 554 00:24:12,370 --> 00:24:14,800 But one thing-- this seems wrong intuitively, 555 00:24:14,800 --> 00:24:17,200 when n is much smaller than p. 556 00:24:17,200 --> 00:24:18,850 Because the complexity of the problem, 557 00:24:18,850 --> 00:24:21,310 the rank of the problem is just n. 558 00:24:21,310 --> 00:24:25,510 The matrix here has n rows and d or p columns 559 00:24:25,510 --> 00:24:27,580 depending on which representation you take. 560 00:24:27,580 --> 00:24:30,340 And so the rank of the whole thing is at most n, 561 00:24:30,340 --> 00:24:36,160 if n is much smaller. 562 00:24:36,160 --> 00:24:39,850 So now the red dot appears. 563 00:24:39,850 --> 00:24:44,484 And what you can do is proving this one line. 564 00:24:44,484 --> 00:24:46,150 So let's see what they do, and then I'll 565 00:24:46,150 --> 00:24:47,399 tell you how you can prove it. 566 00:24:47,399 --> 00:24:48,850 And it's an exercise. 567 00:24:48,850 --> 00:24:52,460 So you see here if you invert this, 568 00:24:52,460 --> 00:24:54,670 then you have to multiply x transpose 569 00:24:54,670 --> 00:24:57,940 y times the inverse of this matrix, which 570 00:24:57,940 --> 00:25:00,670 is what's written in here. 571 00:25:00,670 --> 00:25:03,010 So I claim that this equality stands. 572 00:25:03,010 --> 00:25:03,850 Look what it does. 573 00:25:03,850 --> 00:25:05,770 I take this x transpose. 574 00:25:05,770 --> 00:25:09,050 I move it in front. 575 00:25:09,050 --> 00:25:10,630 But then if I do this, you clearly 576 00:25:10,630 --> 00:25:12,463 see that I'm messing around with dimensions. 577 00:25:12,463 --> 00:25:16,122 So what you do is that you have to switch the order of the two 578 00:25:16,122 --> 00:25:17,080 matrices in the middle. 579 00:25:19,900 --> 00:25:22,360 Now from a dimensionality point of view, at least, 580 00:25:22,360 --> 00:25:24,490 I still see that this matrix and this matrix 581 00:25:24,490 --> 00:25:27,520 have the same dimension. 582 00:25:27,520 --> 00:25:28,690 How do you prove this? 583 00:25:28,690 --> 00:25:31,090 Well, you basically just need to do SVD. 584 00:25:31,090 --> 00:25:34,060 You take the singular-value decomposition of the matrix Xn. 585 00:25:34,060 --> 00:25:36,917 You plug it in, and you just compute things. 586 00:25:36,917 --> 00:25:38,750 And you check that this side of the equality 587 00:25:38,750 --> 00:25:41,590 is the same of this side of the equality. 588 00:25:41,590 --> 00:25:43,330 So there's nothing more than this, 589 00:25:43,330 --> 00:25:44,750 but we're going to skip this. 590 00:25:44,750 --> 00:25:47,320 So you just take this as a fact. 591 00:25:47,320 --> 00:25:48,220 It's a little trick. 592 00:25:48,220 --> 00:25:49,660 Why do I want to do this trick? 593 00:25:49,660 --> 00:25:52,870 Because look, now what I say is that my w is going 594 00:25:52,870 --> 00:25:55,930 to be x transpose of something. 595 00:25:59,330 --> 00:26:00,400 What is this something? 596 00:26:00,400 --> 00:26:06,130 So w is going to be X transpose of this thing here. 597 00:26:06,130 --> 00:26:08,020 How big is this vector? 598 00:26:08,020 --> 00:26:11,020 So how big is this matrix first of all? 599 00:26:11,020 --> 00:26:13,495 So remember, Xn was how big? 600 00:26:13,495 --> 00:26:14,770 AUDIENCE: N by d. 601 00:26:14,770 --> 00:26:17,260 LORENZO ROSASCO: N by d or p. 602 00:26:17,260 --> 00:26:18,340 How big is this? 603 00:26:18,340 --> 00:26:19,060 AUDIENCE: N by n. 604 00:26:19,060 --> 00:26:21,500 LORENZO ROSASCO: N by n. 605 00:26:21,500 --> 00:26:23,060 So how big is this vector? 606 00:26:23,060 --> 00:26:25,130 It's n by 1. 607 00:26:25,130 --> 00:26:27,370 So now I have to-- 608 00:26:27,370 --> 00:26:29,770 I found out that my w can always be 609 00:26:29,770 --> 00:26:34,320 written as x transpose c, where c is just 610 00:26:34,320 --> 00:26:36,540 an n-dimensional vector. 611 00:26:36,540 --> 00:26:40,350 I rewrote it like this, if you want. 612 00:26:40,350 --> 00:26:43,490 So what is the cost of doing this? 613 00:26:47,800 --> 00:26:51,610 Well, this was the cost of doing this? 614 00:26:51,610 --> 00:26:54,260 But now you just have to do-- 615 00:26:54,260 --> 00:26:57,070 so let's say what is the cost of doing this thing here 616 00:26:57,070 --> 00:26:59,080 above the bracket? 617 00:26:59,080 --> 00:27:01,710 Well, if this one was p cube p square n, 618 00:27:01,710 --> 00:27:03,910 this one will be how much? 619 00:27:03,910 --> 00:27:07,720 I have that this matrix will say p by p, 620 00:27:07,720 --> 00:27:10,860 and then this vector was p by 1. 621 00:27:10,860 --> 00:27:14,680 Whereas here, my matrix is n by n, and the victory is n by 1. 622 00:27:17,200 --> 00:27:20,699 So you basically have that these two numbers swap. 623 00:27:20,699 --> 00:27:22,115 Instead of having this complexity, 624 00:27:22,115 --> 00:27:25,780 now you have a complexity, which is n cube. 625 00:27:25,780 --> 00:27:30,910 And then you have n square p, which sounds about right. 626 00:27:30,910 --> 00:27:32,060 It's linear in p. 627 00:27:32,060 --> 00:27:33,250 You cannot avoid that. 628 00:27:33,250 --> 00:27:35,590 You have to look at the data at least once. 629 00:27:35,590 --> 00:27:40,035 But then it's polynomial only in the small quantity of the two. 630 00:27:40,035 --> 00:27:41,410 So in some sense, what you see is 631 00:27:41,410 --> 00:27:44,360 that, depending on the size of n, of course, 632 00:27:44,360 --> 00:27:46,110 you still have to do this multiplication. 633 00:27:46,110 --> 00:27:50,200 But this multiplication is just n, nd, or np. 634 00:27:50,200 --> 00:27:52,170 So let's just recap what I'm telling you. 635 00:27:52,170 --> 00:27:55,050 This is a lot more mathematical fact I put. 636 00:27:55,050 --> 00:27:57,010 I have a warning here. 637 00:27:57,010 --> 00:27:59,860 The first thing is the question should be clear. 638 00:27:59,860 --> 00:28:02,950 Can I break the complexity of this in the case 639 00:28:02,950 --> 00:28:05,550 when n is smaller than p or d? 640 00:28:05,550 --> 00:28:08,620 This is relevant because the question came out a second ago, 641 00:28:08,620 --> 00:28:10,870 which was should I always explode 642 00:28:10,870 --> 00:28:12,700 the dimension of my features? 643 00:28:12,700 --> 00:28:14,320 And here what you see is that-- 644 00:28:14,320 --> 00:28:16,960 well, at least for now we see that even if you do, 645 00:28:16,960 --> 00:28:20,980 you don't pay more than linearly in that. 646 00:28:20,980 --> 00:28:22,690 And the way you prove it is A, you 647 00:28:22,690 --> 00:28:25,377 observe this factor, which, again, I 648 00:28:25,377 --> 00:28:27,460 measured if you're curious, to show how you do it, 649 00:28:27,460 --> 00:28:28,720 but it's a one line. 650 00:28:28,720 --> 00:28:30,840 And 2, you observe that once you have this, 651 00:28:30,840 --> 00:28:34,840 if you just rewrite w, you can write w as a x transpose c. 652 00:28:34,840 --> 00:28:37,710 And to find a c-- which is now you basically re-parametrize-- 653 00:28:37,710 --> 00:28:39,970 and to find the new c is going to cost you 654 00:28:39,970 --> 00:28:43,690 only n cube n square p. 655 00:28:43,690 --> 00:28:46,294 So you do exactly what you wanted to do. 656 00:28:46,294 --> 00:28:47,710 And basically, what you see now is 657 00:28:47,710 --> 00:28:49,270 that whenever you do least squares, 658 00:28:49,270 --> 00:28:51,100 you can check the number of dimensions, 659 00:28:51,100 --> 00:28:54,610 the number of points, and always re-parametrize the problem 660 00:28:54,610 --> 00:28:58,750 in such a way that complexity is depending linearly 661 00:28:58,750 --> 00:29:00,940 on the bigger of the two and polynomially 662 00:29:00,940 --> 00:29:04,390 on the smaller of the two. 663 00:29:04,390 --> 00:29:05,230 So that's good news. 664 00:29:11,782 --> 00:29:12,740 Oh, I wrote it. 665 00:29:17,264 --> 00:29:18,680 So this is where we are right now. 666 00:29:21,210 --> 00:29:23,870 So if we're lost now, you're going to become completely lost 667 00:29:23,870 --> 00:29:24,530 in one second. 668 00:29:24,530 --> 00:29:26,480 Because this is what we want to do. 669 00:29:26,480 --> 00:29:29,910 We want to introduce kernel in the simplest possible way, 670 00:29:29,910 --> 00:29:31,095 which is the following. 671 00:29:33,770 --> 00:29:37,150 So look at-- this is what we find out. 672 00:29:37,150 --> 00:29:40,870 We discovered, we actually proved a theorem. 673 00:29:40,870 --> 00:29:44,350 And the theorem says that the w's 674 00:29:44,350 --> 00:29:48,610 that are output by the least squares algorithm are not 675 00:29:48,610 --> 00:29:50,740 any possible d-dimensional vectors, 676 00:29:50,740 --> 00:29:52,630 but they're always vectors that I 677 00:29:52,630 --> 00:29:57,100 can write as the combination of the training set vectors. 678 00:29:57,100 --> 00:30:01,300 So xi is long d or p, and I've summed them up 679 00:30:01,300 --> 00:30:02,429 with these weights. 680 00:30:02,429 --> 00:30:04,720 And the w's that are going to come out of least squares 681 00:30:04,720 --> 00:30:07,500 are always of that form. 682 00:30:07,500 --> 00:30:10,150 They cannot be of any other form. 683 00:30:10,150 --> 00:30:13,140 This is called the representer theorem. 684 00:30:13,140 --> 00:30:16,010 It's the basic theorem of so-called kernel methods. 685 00:30:16,010 --> 00:30:19,960 It shows you that the solution you're looking for 686 00:30:19,960 --> 00:30:22,840 can be written as a linear superposition of these terms. 687 00:30:27,310 --> 00:30:28,330 If you now write-- 688 00:30:28,330 --> 00:30:29,260 this is just the w. 689 00:30:29,260 --> 00:30:31,150 Let's just write down f of x. 690 00:30:31,150 --> 00:30:33,670 F of x is going to be x transpose w, just 691 00:30:33,670 --> 00:30:36,160 the linear function. 692 00:30:36,160 --> 00:30:39,400 And now you can-- if you write it down, you just get this. 693 00:30:39,400 --> 00:30:40,960 By linearity you can-- 694 00:30:40,960 --> 00:30:42,624 so w is written like this. 695 00:30:42,624 --> 00:30:43,790 You multiply by x transpose. 696 00:30:43,790 --> 00:30:45,190 This is a finite sum. 697 00:30:45,190 --> 00:30:47,680 So you can let x transpose inside the sum. 698 00:30:47,680 --> 00:30:49,420 This is what you get. 699 00:30:49,420 --> 00:30:50,320 Are you OK? 700 00:30:50,320 --> 00:30:52,650 So you have x transpose times a sum. 701 00:30:52,650 --> 00:30:57,800 This is the sum of x transpose multiplied by the rest. 702 00:30:57,800 --> 00:30:58,940 Why do we care about this? 703 00:30:58,940 --> 00:31:01,730 Because basically the idea of kernel methods-- 704 00:31:01,730 --> 00:31:05,570 in this very basic form-- is what 705 00:31:05,570 --> 00:31:08,750 if I replace this inner product, which 706 00:31:08,750 --> 00:31:11,720 is a way to measure similarity between my functions, 707 00:31:11,720 --> 00:31:14,360 with another similarity. 708 00:31:14,360 --> 00:31:17,417 So instead of mapping each x into a very high 709 00:31:17,417 --> 00:31:19,250 dimensional vector and then taking product-- 710 00:31:19,250 --> 00:31:22,430 which is itself, if you want another way, as I said, 711 00:31:22,430 --> 00:31:25,310 of measuring similarity in your product, 712 00:31:25,310 --> 00:31:27,650 distances between vectors-- what if I just define it, 713 00:31:27,650 --> 00:31:29,420 instead of by an explicit mapping, 714 00:31:29,420 --> 00:31:32,810 by redefining the inner product. 715 00:31:32,810 --> 00:31:36,830 So this k here is the k similar to the one 716 00:31:36,830 --> 00:31:39,730 we had in the previous-- in the very first slide. 717 00:31:39,730 --> 00:31:43,250 And it's-- re-parametrize the inner product. 718 00:31:43,250 --> 00:31:45,140 Change the inner product, and then I 719 00:31:45,140 --> 00:31:47,220 want to use everything else. 720 00:31:47,220 --> 00:31:50,520 So we need to question-- we need to answer two question. 721 00:31:50,520 --> 00:31:54,170 The first one is if I give you now a procedure 722 00:31:54,170 --> 00:31:56,330 that whenever you would want to do x transpose 723 00:31:56,330 --> 00:32:01,100 x does something else called ax comma x prime. 724 00:32:01,100 --> 00:32:03,410 How do you change the computations? 725 00:32:03,410 --> 00:32:05,030 This is going to be very easy. 726 00:32:05,030 --> 00:32:08,323 But also what are you doing from a modeling perspective? 727 00:32:14,000 --> 00:32:17,200 So from the computational point of view, it's very easy, 728 00:32:17,200 --> 00:32:21,520 because you see here you always had 729 00:32:21,520 --> 00:32:24,040 that you have to build a matrix whose entries were 730 00:32:24,040 --> 00:32:26,330 xi transpose xj. 731 00:32:29,920 --> 00:32:33,510 So it was always a product of two vectors. 732 00:32:33,510 --> 00:32:36,070 And what you do now is that you do the same. 733 00:32:36,070 --> 00:32:40,360 So you build the matrix kn, which is not just xn, 734 00:32:40,360 --> 00:32:43,705 xn transpose but is a new matrix whose entries are just this. 735 00:32:43,705 --> 00:32:45,880 This is just a generalization. 736 00:32:45,880 --> 00:32:47,830 If I put the linear kernel, I just get back 737 00:32:47,830 --> 00:32:49,760 in what we had before. 738 00:32:49,760 --> 00:32:52,500 If you put another kernel, you just get something else. 739 00:32:52,500 --> 00:32:54,610 So from a computational point of view, 740 00:32:54,610 --> 00:32:58,164 you're done for this computation of c. 741 00:32:58,164 --> 00:32:59,330 You have to do nothing else. 742 00:32:59,330 --> 00:33:02,390 You just replace this matrix with these general matrix. 743 00:33:02,390 --> 00:33:05,350 And if you want to now compute s-- 744 00:33:05,350 --> 00:33:08,050 so w you cannot compute anymore, because you don't know 745 00:33:08,050 --> 00:33:10,040 what's an x by itself. 746 00:33:10,040 --> 00:33:13,090 But if you want to compute f of x, you can, 747 00:33:13,090 --> 00:33:14,470 because you've just to plug-in-- 748 00:33:14,470 --> 00:33:16,280 So you know how to compute the c. 749 00:33:16,280 --> 00:33:20,190 And you know how to compute this quantity, because you have just 750 00:33:20,190 --> 00:33:21,220 to put the kernel there. 751 00:33:21,220 --> 00:33:23,530 So the magic here is that you never 752 00:33:23,530 --> 00:33:25,090 ever point x in isolation. 753 00:33:25,090 --> 00:33:27,880 You always have a point x multiplied by another point x. 754 00:33:27,880 --> 00:33:31,770 And this allows you to replace vectors by-- 755 00:33:31,770 --> 00:33:34,000 in some sense, this is an implicit remapping 756 00:33:34,000 --> 00:33:37,160 of the points by just redefining the inner product. 757 00:33:37,160 --> 00:33:39,089 So what you should see for now is 758 00:33:39,089 --> 00:33:40,630 just that the computation that you've 759 00:33:40,630 --> 00:33:45,100 done to compute f of x in the linear case you can redo, 760 00:33:45,100 --> 00:33:48,070 if you replace the inner product with this new function. 761 00:33:48,070 --> 00:33:52,990 Because A, you can compute c by just using this new matrix 762 00:33:52,990 --> 00:33:54,220 in place of this. 763 00:33:54,220 --> 00:33:57,430 And B, you can replace f of x, because all you need 764 00:33:57,430 --> 00:33:59,635 is to replace this inner product with this one 765 00:33:59,635 --> 00:34:04,210 and put the right weights, which you know how to compute. 766 00:34:04,210 --> 00:34:05,920 From a modeling perspective what you 767 00:34:05,920 --> 00:34:09,880 can check is that, for example, if you choose here 768 00:34:09,880 --> 00:34:11,260 this polynomial kernel-- 769 00:34:11,260 --> 00:34:14,499 which is just x transpose x prime plus 1 770 00:34:14,499 --> 00:34:16,210 elevated to the d-- 771 00:34:16,210 --> 00:34:18,400 if you take, for example, d equal 2, 772 00:34:18,400 --> 00:34:21,850 this is equivalent to the mapping I showed you before, 773 00:34:21,850 --> 00:34:25,989 the one with explicit monomials as entries. 774 00:34:25,989 --> 00:34:28,479 This is just doing it implicitly. 775 00:34:28,479 --> 00:34:31,435 If you're in low-dimensional, if you're low-dimensional, 776 00:34:31,435 --> 00:34:35,270 if n is very big, and the dimensions are very small, 777 00:34:35,270 --> 00:34:37,179 the first way might be better. 778 00:34:37,179 --> 00:34:42,400 But if n is much bigger, this way would be better. 779 00:34:42,400 --> 00:34:45,310 But also you can use stuff like this, like a Gaussian kernel. 780 00:34:45,310 --> 00:34:47,860 And in that case, you cannot really write down explicitly 781 00:34:47,860 --> 00:34:50,530 the explicit map, because it turns out that 782 00:34:50,530 --> 00:34:51,659 it's infinite-dimensional. 783 00:34:51,659 --> 00:34:53,560 The vectors you would need to write down, 784 00:34:53,560 --> 00:34:57,560 to write down the explicit variable version of-- 785 00:34:57,560 --> 00:35:00,840 embedding version of this is infinite-dimensional. 786 00:35:00,840 --> 00:35:01,900 So this is a-- 787 00:35:01,900 --> 00:35:04,420 if you use this, you get the truly non-parametric model. 788 00:35:07,480 --> 00:35:09,666 If you think of what is the effect of using this, 789 00:35:09,666 --> 00:35:11,290 it's quite clear if you plug them here. 790 00:35:11,290 --> 00:35:13,000 Because what you have is that in one case 791 00:35:13,000 --> 00:35:15,420 you have a superposition of linear stuff, 792 00:35:15,420 --> 00:35:17,920 a superposition of polynomial stuff, 793 00:35:17,920 --> 00:35:20,100 or a superposition of Gaussians. 794 00:35:23,330 --> 00:35:24,580 So same game as before. 795 00:35:28,490 --> 00:35:30,470 So same dataset we train. 796 00:35:30,470 --> 00:35:33,070 I take kernel least squares-- 797 00:35:33,070 --> 00:35:34,650 which is what I just showed you-- 798 00:35:34,650 --> 00:35:37,900 compute the c inverting that matrix, 799 00:35:37,900 --> 00:35:40,090 use the Gaussian kernel-- the last of the example-- 800 00:35:40,090 --> 00:35:41,170 and then compute f of x. 801 00:35:41,170 --> 00:35:42,544 And then we just want to plot it. 802 00:35:47,231 --> 00:35:48,230 So this is the solution. 803 00:35:51,770 --> 00:35:53,510 The algorithm depends on two parameters. 804 00:35:53,510 --> 00:35:54,093 What are they? 805 00:35:57,700 --> 00:35:58,620 AUDIENCE: Lambda. 806 00:35:58,620 --> 00:36:00,870 LORENZO ROSASCO: Lambda, the regularization parameter, 807 00:36:00,870 --> 00:36:03,700 the one that appeared already in the linear case-- 808 00:36:03,700 --> 00:36:04,240 and then-- 809 00:36:04,240 --> 00:36:06,760 AUDIENCE: Whatever parameter you've chosen [INAUDIBLE].. 810 00:36:06,760 --> 00:36:07,570 LORENZO ROSASCO: Exactly. 811 00:36:07,570 --> 00:36:09,220 Whatever parameters there is in your kernel. 812 00:36:09,220 --> 00:36:10,803 In this case, it's the Gaussian, so it 813 00:36:10,803 --> 00:36:12,870 will depend on this width. 814 00:36:17,370 --> 00:36:22,120 Now suppose that I take gamma big. 815 00:36:22,120 --> 00:36:24,120 I don't know what big is. 816 00:36:24,120 --> 00:36:27,530 I just do it by hand here, so we see what happens. 817 00:36:32,620 --> 00:36:36,120 If you take gamma-- sorry gamma, sigma big, 818 00:36:36,120 --> 00:36:39,450 you start to get something very simple. 819 00:36:39,450 --> 00:36:42,820 And if I make it a bit bigger, it 820 00:36:42,820 --> 00:36:47,550 will probably start to look very much like a linear solution. 821 00:36:54,530 --> 00:36:56,220 If I make it small-- 822 00:36:59,805 --> 00:37:01,560 and again, I don't know what small is, 823 00:37:01,560 --> 00:37:02,601 so I'm just going to try. 824 00:37:07,725 --> 00:37:08,715 I's very small. 825 00:37:15,660 --> 00:37:18,540 You start to see what's going on. 826 00:37:18,540 --> 00:37:20,040 And if you go in between, you really 827 00:37:20,040 --> 00:37:23,224 start to see that you can circle out individual examples. 828 00:37:23,224 --> 00:37:25,140 So let's think a second what we're doing here. 829 00:37:29,370 --> 00:37:33,830 It is going to be again other hand-waving explanation. 830 00:37:33,830 --> 00:37:37,190 Look at this equation. 831 00:37:37,190 --> 00:37:39,170 Let's read out what it says. 832 00:37:39,170 --> 00:37:42,710 In the case of Gaussians, it says, I take a Gaussian-- 833 00:37:42,710 --> 00:37:44,480 just a usual Gaussian-- 834 00:37:44,480 --> 00:37:47,840 I center it over a training set point, 835 00:37:47,840 --> 00:37:52,160 then by choosing the ci I'm choosing whether it is 836 00:37:52,160 --> 00:37:54,310 going to be a peak or a valley. 837 00:37:54,310 --> 00:37:57,680 It can go up, or it can go down in the two-dimensional case. 838 00:37:57,680 --> 00:38:00,520 And by choosing the width, I decide 839 00:38:00,520 --> 00:38:03,140 how large it's going to be. 840 00:38:03,140 --> 00:38:08,060 If I do f of x, then I sum up all this stuff, which basically 841 00:38:08,060 --> 00:38:11,810 means that I'm going to have these peaks and these valleys 842 00:38:11,810 --> 00:38:17,000 and I connect them in some way. 843 00:38:17,000 --> 00:38:18,910 Now you remember before that I pointed out 844 00:38:18,910 --> 00:38:20,780 within the two-dimensional case what 845 00:38:20,780 --> 00:38:24,920 we draw is not f of x, but f of x equal to zero. 846 00:38:24,920 --> 00:38:31,525 So what you should really think is that f of x in this case 847 00:38:31,525 --> 00:38:36,250 is no longer an upper plane, but it's this surface. 848 00:38:36,250 --> 00:38:37,545 It goes up, and it goes down. 849 00:38:37,545 --> 00:38:38,920 And it goes up, and it goes down. 850 00:38:38,920 --> 00:38:42,290 So in the blue part, it goes up, and in the orange part, 851 00:38:42,290 --> 00:38:45,260 it goes down into valley. 852 00:38:45,260 --> 00:38:48,110 So what you do is that right now you're 853 00:38:48,110 --> 00:38:50,270 taking all these small Gaussians, 854 00:38:50,270 --> 00:38:54,350 and you put them in around blue and orange point, 855 00:38:54,350 --> 00:38:56,702 and then you connect their peaks. 856 00:38:56,702 --> 00:38:58,160 And by making them small, you allow 857 00:38:58,160 --> 00:39:00,050 them to create a very complicated surface. 858 00:39:03,310 --> 00:39:05,290 So what did we put before? 859 00:39:11,240 --> 00:39:11,950 So they're small. 860 00:39:11,950 --> 00:39:14,210 They're getting smaller, and smaller, and smaller. 861 00:39:14,210 --> 00:39:16,640 And they go out, and you see the-- 862 00:39:16,640 --> 00:39:18,620 there is a point here, so they circle it out 863 00:39:18,620 --> 00:39:21,530 here by putting basically Gaussian right there 864 00:39:21,530 --> 00:39:24,920 for that individual point. 865 00:39:24,920 --> 00:39:26,870 Imagine what happens if my points-- 866 00:39:26,870 --> 00:39:29,240 I have two points here and two points here-- 867 00:39:29,240 --> 00:39:33,020 and now I put a huge Gaussian around each point. 868 00:39:33,020 --> 00:39:36,314 Basically, the peaks are almost going to touch each other. 869 00:39:36,314 --> 00:39:37,730 So what you're imagine is that you 870 00:39:37,730 --> 00:39:40,670 get something, where basically the decision boundary has 871 00:39:40,670 --> 00:39:42,170 to look like a line, because you get 872 00:39:42,170 --> 00:39:43,470 something which is so smooth. 873 00:39:43,470 --> 00:39:45,095 It doesn't go up and down all the time. 874 00:39:45,095 --> 00:39:47,539 It's going to be-- 875 00:39:47,539 --> 00:39:49,080 And that's what we saw before, right? 876 00:39:49,080 --> 00:39:51,110 And again, I don't remember what I put here. 877 00:39:54,718 --> 00:39:56,490 So this is starting to look good. 878 00:39:56,490 --> 00:40:00,720 So you really see that somewhat something nice happens. 879 00:40:00,720 --> 00:40:03,720 Maybe if I put-- five is what we put before maybe. 880 00:40:09,010 --> 00:40:10,890 So basically what you're basically doing 881 00:40:10,890 --> 00:40:13,560 is that you're computing the center of mass 882 00:40:13,560 --> 00:40:16,794 of one class in the sense of the Gaussians. 883 00:40:16,794 --> 00:40:18,210 So you're doing a Gaussian mixture 884 00:40:18,210 --> 00:40:19,850 on one side, a Gaussian mixture on the other side, 885 00:40:19,850 --> 00:40:21,780 you're basically computing the center of masses, 886 00:40:21,780 --> 00:40:22,890 and then you just find the line that 887 00:40:22,890 --> 00:40:24,230 separates the center of masses. 888 00:40:24,230 --> 00:40:26,021 That's what you're doing here, and you just 889 00:40:26,021 --> 00:40:27,810 find this one big line here. 890 00:40:30,600 --> 00:40:35,250 So again, so we're not playing around 891 00:40:35,250 --> 00:40:38,040 with the number of points. 892 00:40:38,040 --> 00:40:39,776 We're not play around with lambda. 893 00:40:39,776 --> 00:40:42,150 But because this is basically what we already saw before. 894 00:40:42,150 --> 00:40:44,691 All I want to show you right now is the effect of the kernel. 895 00:40:44,691 --> 00:40:51,610 And here I'm using the Gaussian kernel, but-- let's see-- 896 00:40:51,610 --> 00:40:54,177 but you can also use the linear kernel. 897 00:40:54,177 --> 00:40:55,260 This is the linear kernel. 898 00:40:55,260 --> 00:40:56,885 This is using the linear least squares. 899 00:40:56,885 --> 00:40:58,520 If you now use the Gaussian kernel, 900 00:40:58,520 --> 00:41:00,270 you give yourself the extra possibility. 901 00:41:00,270 --> 00:41:01,650 Essentially, what you see is that if you 902 00:41:01,650 --> 00:41:03,691 put the Gaussian which is very big, in some sense 903 00:41:03,691 --> 00:41:05,810 you get back the linear kernel. 904 00:41:05,810 --> 00:41:07,810 But if you put the Gaussian which is very small, 905 00:41:07,810 --> 00:41:12,050 you allow yourself to this extra complexity. 906 00:41:12,050 --> 00:41:14,910 And so that's what we gain with this little trick 907 00:41:14,910 --> 00:41:20,500 that we did of replacing the inner product 908 00:41:20,500 --> 00:41:24,400 with this new kernel. 909 00:41:24,400 --> 00:41:26,380 We went from the simple linear estimators 910 00:41:26,380 --> 00:41:28,045 to something, which is-- 911 00:41:28,045 --> 00:41:29,650 It's the same thing-- if you want-- 912 00:41:29,650 --> 00:41:31,300 that we did by building explicitly 913 00:41:31,300 --> 00:41:34,180 these monomials of higher power, but here you're 914 00:41:34,180 --> 00:41:35,879 doing it implicitly. 915 00:41:35,879 --> 00:41:37,420 And it turns out that it's actually-- 916 00:41:37,420 --> 00:41:40,510 there is no explicit version that you can-- 917 00:41:40,510 --> 00:41:43,690 You can do it mathematically, but the feature representation, 918 00:41:43,690 --> 00:41:45,910 the variable representation of this kernel 919 00:41:45,910 --> 00:41:48,260 would be an infinitely long vector. 920 00:41:48,260 --> 00:41:49,900 The space of function that is built 921 00:41:49,900 --> 00:41:53,130 as a combination of Gaussians is not finite-dimensional. 922 00:41:53,130 --> 00:41:56,040 For polynomials, you can check that the space of function, 923 00:41:56,040 --> 00:41:59,680 it basically is a polynomial in d. 924 00:41:59,680 --> 00:42:02,030 If I ask you how big is the function 925 00:42:02,030 --> 00:42:04,780 space that you can build using this-- well, this is easy. 926 00:42:04,780 --> 00:42:06,910 It's just d-dimensional. 927 00:42:06,910 --> 00:42:09,220 With this, well, this is a bit more complicated, 928 00:42:09,220 --> 00:42:11,250 but you can compute. 929 00:42:11,250 --> 00:42:16,270 For this, it's not easy to compute, because it's infinite. 930 00:42:16,270 --> 00:42:19,690 So it in some sense is a non-parametric model. 931 00:42:19,690 --> 00:42:21,280 What does it mean? 932 00:42:21,280 --> 00:42:22,990 Of course, you still have a finite number 933 00:42:22,990 --> 00:42:24,380 of parameters in practice. 934 00:42:24,380 --> 00:42:26,020 And that's the good news. 935 00:42:26,020 --> 00:42:28,440 But there is no fixed number of parameters a priori. 936 00:42:28,440 --> 00:42:31,129 If I give you a hundred points, you get a hundred parameters. 937 00:42:31,129 --> 00:42:33,670 If I give you 2 million points, you get 2 million parameters. 938 00:42:33,670 --> 00:42:36,400 If I give you 5 million points, you get 5 million parameters. 939 00:42:36,400 --> 00:42:40,300 But you never hit a boundary of complexity, 940 00:42:40,300 --> 00:42:44,800 because these are in some sense as an infinite-dimensional 941 00:42:44,800 --> 00:42:45,640 parameter space. 942 00:42:48,940 --> 00:42:52,210 So of course, I see that here there 943 00:42:52,210 --> 00:42:55,040 are some of the part that I'm explaining are complicated, 944 00:42:55,040 --> 00:42:57,370 especially if this is the first time you see them. 945 00:42:57,370 --> 00:43:00,370 But the take-home message should be essentially 946 00:43:00,370 --> 00:43:02,020 from least squares, I can understand 947 00:43:02,020 --> 00:43:03,936 what's going on from a numerical point of view 948 00:43:03,936 --> 00:43:06,700 and bridge numerics and statistics. 949 00:43:06,700 --> 00:43:08,680 Then by just simple linear algebra, 950 00:43:08,680 --> 00:43:10,090 I can understand the complexity-- 951 00:43:10,090 --> 00:43:12,250 how I can get complexly-- which is 952 00:43:12,250 --> 00:43:14,110 linear in the number of dimension 953 00:43:14,110 --> 00:43:16,240 or the number of points. 954 00:43:16,240 --> 00:43:19,840 And then by following up, I can do a a little magic 955 00:43:19,840 --> 00:43:23,480 and go from the linear model to something non-linear. 956 00:43:23,480 --> 00:43:26,200 The deep reason why this is possible are complicated. 957 00:43:26,200 --> 00:43:29,099 But as a take-home message, A, the computation 958 00:43:29,099 --> 00:43:29,890 you can check easy. 959 00:43:29,890 --> 00:43:31,620 It remained the same. 960 00:43:31,620 --> 00:43:33,796 B, you can check that what you're doing is now 961 00:43:33,796 --> 00:43:35,920 allowing yourself to take a more complicated model, 962 00:43:35,920 --> 00:43:38,800 it's combination of the kernel functions. 963 00:43:38,800 --> 00:43:46,480 And then even just by playing with these simple demos, 964 00:43:46,480 --> 00:43:48,370 you can understand a bit what is the effect. 965 00:43:48,370 --> 00:43:50,300 And that's what you intuitively would expect. 966 00:43:50,300 --> 00:43:52,600 So I hope that it would get you close enough 967 00:43:52,600 --> 00:43:55,900 to have some awareness, when you use this. 968 00:43:55,900 --> 00:43:57,850 And of course, you can put-- 969 00:43:57,850 --> 00:44:01,510 when you abstract from the specificity of this algorithm, 970 00:44:01,510 --> 00:44:03,990 you build an algorithm with one or two parameters-- 971 00:44:03,990 --> 00:44:05,380 lambda and sigma. 972 00:44:05,380 --> 00:44:07,540 And so as soon as you ask me how you choose those, 973 00:44:07,540 --> 00:44:09,685 well, we go back to the first part of the lecture-- 974 00:44:09,685 --> 00:44:12,100 bias-variance, tradeoffs, cross-validation, 975 00:44:12,100 --> 00:44:13,810 and so on and so forth. 976 00:44:13,810 --> 00:44:16,750 So you just have to put them together. 977 00:44:16,750 --> 00:44:19,100 There is a lot of stuff I've not talked about. 978 00:44:19,100 --> 00:44:21,130 And it's a step away from what we discussed, 979 00:44:21,130 --> 00:44:23,902 so you've just seen the take-home message part, 980 00:44:23,902 --> 00:44:25,360 but we could talk about reproducing 981 00:44:25,360 --> 00:44:27,670 kernel hybrid spaces, the functional analysis 982 00:44:27,670 --> 00:44:29,560 behind everything I said. 983 00:44:29,560 --> 00:44:31,990 We can talk about Gaussian processes, which is basically 984 00:44:31,990 --> 00:44:34,780 the probabilistic version of what I just showed you now. 985 00:44:34,780 --> 00:44:37,450 Then we can all see the connection with a bunch of math 986 00:44:37,450 --> 00:44:39,746 like integral equations and PDEs. 987 00:44:39,746 --> 00:44:41,620 There is a whole connection with the sampling 988 00:44:41,620 --> 00:44:44,240 theory a la Shannon, inverse problems and so on. 989 00:44:44,240 --> 00:44:47,230 And there is a bunch of extension, 990 00:44:47,230 --> 00:44:48,705 which are almost for free. 991 00:44:48,705 --> 00:44:50,234 You change the loss function. 992 00:44:50,234 --> 00:44:52,150 You can make the logistic, and you take kernel 993 00:44:52,150 --> 00:44:53,110 logistic regression. 994 00:44:53,110 --> 00:44:57,520 You can take SVM, and you get kernel SVM. 995 00:44:57,520 --> 00:45:00,430 Then you can also take more complicated output spaces. 996 00:45:00,430 --> 00:45:03,220 And you can do multiclass, multivariate regression. 997 00:45:03,220 --> 00:45:04,210 You can do regression. 998 00:45:04,210 --> 00:45:06,220 You can do multilabel, and you can 999 00:45:06,220 --> 00:45:07,640 do a bunch of different things. 1000 00:45:07,640 --> 00:45:10,210 And these are really a step away. 1001 00:45:10,210 --> 00:45:12,387 These are minor modification of the code. 1002 00:45:12,387 --> 00:45:13,720 And you can do a bunch of stuff. 1003 00:45:13,720 --> 00:45:16,120 So the good thing of this is that with really, really, 1004 00:45:16,120 --> 00:45:17,770 really minor effort, you can actually 1005 00:45:17,770 --> 00:45:19,090 solve a bunch of problem. 1006 00:45:19,090 --> 00:45:21,700 I'm not saying that it's going to be the best algorithm ever, 1007 00:45:21,700 --> 00:45:26,290 but definitely it gets you quite far. 1008 00:45:26,290 --> 00:45:28,240 So again we spent quite a bit of time thinking 1009 00:45:28,240 --> 00:45:30,950 about bias-variance and what it means and used 1010 00:45:30,950 --> 00:45:33,250 least squares and just basically warming up 1011 00:45:33,250 --> 00:45:34,570 a bit with this setting. 1012 00:45:34,570 --> 00:45:39,022 And then in the last hour or so, we discussed least squares, 1013 00:45:39,022 --> 00:45:41,480 because it allows to just think in terms of linear algebra, 1014 00:45:41,480 --> 00:45:43,120 which is something that-- 1015 00:45:43,120 --> 00:45:45,130 one way or another-- you've seen in your life. 1016 00:45:45,130 --> 00:45:48,400 And then from there, you can go from linear to non-linear. 1017 00:45:48,400 --> 00:45:51,100 And that's a bit of magic, but a couple of parts-- 1018 00:45:51,100 --> 00:45:54,070 which are how you use it both numerically 1019 00:45:54,070 --> 00:45:56,300 and just from a practical perspective 1020 00:45:56,300 --> 00:45:59,260 to go from complex models to simple models and vice versa-- 1021 00:45:59,260 --> 00:46:03,579 should be-- is the part that I hope you keep in your mind. 1022 00:46:03,579 --> 00:46:05,870 For now, our concern has just been to make predictions. 1023 00:46:05,870 --> 00:46:07,370 If you hear classification, you want 1024 00:46:07,370 --> 00:46:08,494 to have good clarification. 1025 00:46:08,494 --> 00:46:10,870 If you hear regression, you want to do good regression. 1026 00:46:10,870 --> 00:46:14,440 But you didn't talk about-- we didn't talk about understanding 1027 00:46:14,440 --> 00:46:17,200 how did you do good regression? 1028 00:46:17,200 --> 00:46:21,944 So a typical example is the example in biology. 1029 00:46:21,944 --> 00:46:23,110 This is, perhaps, a bit old. 1030 00:46:23,110 --> 00:46:24,350 This is micro-arrays. 1031 00:46:24,350 --> 00:46:31,860 But the idea is the datasets you have is a bunch of patients. 1032 00:46:31,860 --> 00:46:33,870 For each patient, you have measurements, 1033 00:46:33,870 --> 00:46:36,480 and the measurements correspond to some gene expression 1034 00:46:36,480 --> 00:46:38,265 level or some other biological process. 1035 00:46:42,390 --> 00:46:45,840 The patients are divided in two groups, say, disease type 1036 00:46:45,840 --> 00:46:48,630 A and disease type B. And based on the good prediction 1037 00:46:48,630 --> 00:46:50,880 of whether a patient is disease type A or B, 1038 00:46:50,880 --> 00:46:55,089 you can change the way you cure it or you address the disease. 1039 00:46:55,089 --> 00:46:57,130 So of course, you want to have a good prediction. 1040 00:46:57,130 --> 00:46:59,190 You want to be able-- when a new patient arrive-- 1041 00:46:59,190 --> 00:47:04,440 to say whether it's going to-- this is type A or type B. 1042 00:47:04,440 --> 00:47:06,810 But oftentimes, what you want to do 1043 00:47:06,810 --> 00:47:09,900 is that you want to use this not as the final tool, 1044 00:47:09,900 --> 00:47:13,540 because unless deep learning can solve this, 1045 00:47:13,540 --> 00:47:18,450 you might go back and study a bit more the biological process 1046 00:47:18,450 --> 00:47:19,900 to understand a bit more. 1047 00:47:19,900 --> 00:47:23,590 So you use this as more statistical tools 1048 00:47:23,590 --> 00:47:28,840 like measurements, like the way you can use a microscope 1049 00:47:28,840 --> 00:47:31,335 or something to look into your data and get information. 1050 00:47:31,335 --> 00:47:32,710 And in that sense sometimes, it's 1051 00:47:32,710 --> 00:47:34,335 interesting to-- instead of just saying 1052 00:47:34,335 --> 00:47:37,720 is this patient going to be more likely to be disease type 1053 00:47:37,720 --> 00:47:40,300 A or B, it's to go in and say, ah, 1054 00:47:40,300 --> 00:47:42,340 but when you make the prediction, what 1055 00:47:42,340 --> 00:47:44,500 are the process that matters for this prediction? 1056 00:47:44,500 --> 00:47:49,032 Is this gene number 33 or 34, so that I can go in and say, 1057 00:47:49,032 --> 00:47:51,490 oh, these genes make sense, because they're in fact related 1058 00:47:51,490 --> 00:47:54,210 to these other processes, which are known to be related, 1059 00:47:54,210 --> 00:47:56,354 involved in this disease. 1060 00:47:56,354 --> 00:47:58,270 And doing that, you use just as a little tool, 1061 00:47:58,270 --> 00:48:00,219 then you use other ones to get a picture. 1062 00:48:00,219 --> 00:48:01,510 And then you put them together. 1063 00:48:01,510 --> 00:48:04,330 And then it's mostly on the doctor, or the clinician, 1064 00:48:04,330 --> 00:48:08,320 or the biostatistician to try to develop better understanding. 1065 00:48:08,320 --> 00:48:10,640 But you do use these as tools to understand and look 1066 00:48:10,640 --> 00:48:12,460 into the data. 1067 00:48:12,460 --> 00:48:14,650 And in that perspective, the word interpretability 1068 00:48:14,650 --> 00:48:15,440 plays a big role. 1069 00:48:15,440 --> 00:48:17,590 And here by interpretability I mean 1070 00:48:17,590 --> 00:48:19,160 I not only want to make predictions, 1071 00:48:19,160 --> 00:48:22,300 but I want to know how I make predictions and tell you, come 1072 00:48:22,300 --> 00:48:25,060 afterwards with an explanation of how 1073 00:48:25,060 --> 00:48:29,410 I picked the information that were contained in my data. 1074 00:48:29,410 --> 00:48:35,080 So so far it's hard to see how to do it with the tools we had. 1075 00:48:35,080 --> 00:48:41,940 So this is basically the field of variable selection. 1076 00:48:41,940 --> 00:48:44,917 And in this basic form, the setting 1077 00:48:44,917 --> 00:48:46,500 where we do understand what's going on 1078 00:48:46,500 --> 00:48:49,210 is the setting of linear models. 1079 00:48:49,210 --> 00:48:52,530 So in this setting basically, I just 1080 00:48:52,530 --> 00:48:54,040 rewrite what we've seen before. 1081 00:48:54,040 --> 00:48:57,510 You have x is a vector, and you can think of it, for example, 1082 00:48:57,510 --> 00:48:59,000 as a patient. 1083 00:48:59,000 --> 00:49:01,290 And xj are measurements that you have 1084 00:49:01,290 --> 00:49:04,140 done describing this patient. 1085 00:49:04,140 --> 00:49:06,060 When you do a linear model, you basically 1086 00:49:06,060 --> 00:49:10,630 have that by putting a weight on each variables, 1087 00:49:10,630 --> 00:49:13,680 you're putting a weight on each measurement. 1088 00:49:13,680 --> 00:49:15,840 If a measurement doesn't matter, you 1089 00:49:15,840 --> 00:49:17,310 think you might put here a zero. 1090 00:49:17,310 --> 00:49:19,650 And it will disappear from the sum. 1091 00:49:19,650 --> 00:49:21,420 If the measurement matters a lot, 1092 00:49:21,420 --> 00:49:23,610 then here you might get a big weight. 1093 00:49:23,610 --> 00:49:27,840 So one way to try to get the feeling of which measurements 1094 00:49:27,840 --> 00:49:29,640 are important and which are not and to try 1095 00:49:29,640 --> 00:49:33,980 to estimate and model, a linear model, where you get the w, 1096 00:49:33,980 --> 00:49:37,650 but ideally we would like to get the w, which has many zeros. 1097 00:49:37,650 --> 00:49:40,189 You don't want to fumble with what's small and what's not. 1098 00:49:40,189 --> 00:49:42,480 So if you do least squares the way I showed you before, 1099 00:49:42,480 --> 00:49:44,047 you would get a w. 1100 00:49:44,047 --> 00:49:44,880 Then you would get-- 1101 00:49:44,880 --> 00:49:47,820 most of them you can check that it will not be zero. 1102 00:49:47,820 --> 00:49:50,254 In fact, none of them will be zero in general. 1103 00:49:50,254 --> 00:49:52,670 And so now you have to decide what's small and what's big, 1104 00:49:52,670 --> 00:49:53,794 and that might not be easy. 1105 00:49:56,740 --> 00:49:57,975 Oops, what happened here? 1106 00:50:05,590 --> 00:50:10,797 So funny enough, this is the name I found on how-- 1107 00:50:10,797 --> 00:50:12,380 I don't remember the name of the book. 1108 00:50:12,380 --> 00:50:14,250 It's the name that was used to describe 1109 00:50:14,250 --> 00:50:16,759 the process of variable selection, which 1110 00:50:16,759 --> 00:50:18,300 is a much harder problem, because you 1111 00:50:18,300 --> 00:50:19,330 don't want to make predictions. 1112 00:50:19,330 --> 00:50:20,705 But you want to go back and check 1113 00:50:20,705 --> 00:50:22,640 how you make the prediction. 1114 00:50:22,640 --> 00:50:28,230 And so it's very easy to start to get overfitting and start 1115 00:50:28,230 --> 00:50:30,690 to try to squeeze the data until you get some information. 1116 00:50:30,690 --> 00:50:32,231 So it's good to have a procedure that 1117 00:50:32,231 --> 00:50:34,950 will give you somewhat a clean procedure to extract 1118 00:50:34,950 --> 00:50:36,060 the important variables. 1119 00:50:36,060 --> 00:50:37,980 Again, you can think of this as a-- basically, 1120 00:50:37,980 --> 00:50:40,500 I want to build an f, but I also want 1121 00:50:40,500 --> 00:50:43,890 to come up with a list or even better weights that tell me 1122 00:50:43,890 --> 00:50:45,600 which variables are important. 1123 00:50:45,600 --> 00:50:47,150 And often this will be just a list, 1124 00:50:47,150 --> 00:50:49,560 which is much smaller than d, so that I can go back 1125 00:50:49,560 --> 00:50:52,740 and say, oh, measurement 33, 34, and 50-- what are they? 1126 00:50:52,740 --> 00:50:55,494 I could go in and look at it. 1127 00:50:55,494 --> 00:50:57,660 Notice that there is also a computational reason why 1128 00:50:57,660 --> 00:50:58,743 this would be interesting. 1129 00:50:58,743 --> 00:51:01,530 Because of course, if d here is 50,000-- 1130 00:51:01,530 --> 00:51:03,600 and what I see is that, in fact, I 1131 00:51:03,600 --> 00:51:07,710 can throw away most of these measurements and just keep 10-- 1132 00:51:07,710 --> 00:51:09,300 then it means that I can hopefully 1133 00:51:09,300 --> 00:51:10,980 reduce the complexity of my computation, 1134 00:51:10,980 --> 00:51:12,905 but also the storage of the data, for example. 1135 00:51:12,905 --> 00:51:14,780 If I have to send you the datasets after I've 1136 00:51:14,780 --> 00:51:16,410 done this thing, I've just to send you 1137 00:51:16,410 --> 00:51:17,520 this teeny tiny matrix. 1138 00:51:20,160 --> 00:51:22,210 So interpretability is one reason, 1139 00:51:22,210 --> 00:51:27,690 but the computational aspect could be another one. 1140 00:51:30,250 --> 00:51:33,510 Another reason that I don't want to talk too much is also-- 1141 00:51:33,510 --> 00:51:36,270 remember that we had this idea, where 1142 00:51:36,270 --> 00:51:39,600 we said we could document the complexity of a model 1143 00:51:39,600 --> 00:51:43,770 by inventing features, and he said 1144 00:51:43,770 --> 00:51:47,100 do I always have to pay the price of making it big? 1145 00:51:47,100 --> 00:51:49,339 Well, I basically-- if you what-- 1146 00:51:49,339 --> 00:51:50,130 I was pointing at-- 1147 00:51:50,130 --> 00:51:52,566 I said, no, not always, because I was thinking of kernels. 1148 00:51:52,566 --> 00:51:54,690 These, if you want give you another way potentially 1149 00:51:54,690 --> 00:51:57,150 to go around in which what you do is that, first of all, 1150 00:51:57,150 --> 00:51:59,252 you explode the number of features. 1151 00:51:59,252 --> 00:52:00,960 You take many, many, many, many, and then 1152 00:52:00,960 --> 00:52:02,940 you use this as a preliminary step 1153 00:52:02,940 --> 00:52:07,050 to shrink them down to a more reasonable number. 1154 00:52:07,050 --> 00:52:08,700 Because it's quite likely that among 1155 00:52:08,700 --> 00:52:10,980 these many, many measurements, some of them 1156 00:52:10,980 --> 00:52:13,170 would just be very correlated, or uninteresting, 1157 00:52:13,170 --> 00:52:15,240 or so on and so forth. 1158 00:52:15,240 --> 00:52:18,480 So this dimensionality reduction or 1159 00:52:18,480 --> 00:52:21,030 computational or interpretable model perspective 1160 00:52:21,030 --> 00:52:25,860 is what stands behind the desire to do something like this. 1161 00:52:25,860 --> 00:52:28,450 So let's say one more thing and then we'll stop. 1162 00:52:31,570 --> 00:52:35,330 So suppose that you have an infinite computational power. 1163 00:52:35,330 --> 00:52:39,450 So the computation are not your concern, 1164 00:52:39,450 --> 00:52:41,959 and you want to solve this problem. 1165 00:52:41,959 --> 00:52:42,750 How will you do it? 1166 00:52:46,358 --> 00:52:50,280 Suppose that you have the code for least squares. 1167 00:52:50,280 --> 00:52:52,450 And you can run it as many times as you want. 1168 00:52:52,450 --> 00:52:55,352 How would you go and try to estimate which 1169 00:52:55,352 --> 00:52:56,560 variables are more important? 1170 00:52:56,560 --> 00:52:58,864 AUDIENCE: [INAUDIBLE] possibility of computations. 1171 00:52:58,864 --> 00:53:00,530 LORENZO ROSASCO: That's one possibility. 1172 00:53:00,530 --> 00:53:02,760 What you do is that you have-- 1173 00:53:02,760 --> 00:53:05,270 you start and look at all single variables. 1174 00:53:05,270 --> 00:53:09,120 And you solve least squares for all single variables. 1175 00:53:09,120 --> 00:53:12,260 Then you take all couples of variables. 1176 00:53:12,260 --> 00:53:14,510 Then you get all triplets of variables. 1177 00:53:14,510 --> 00:53:17,797 And then you find which one is best. 1178 00:53:17,797 --> 00:53:19,380 From a statistical point of view there 1179 00:53:19,380 --> 00:53:21,530 is absolutely nothing wrong with this, 1180 00:53:21,530 --> 00:53:23,330 because you're trying everything. 1181 00:53:23,330 --> 00:53:26,510 And at some point, you find what's the best. 1182 00:53:26,510 --> 00:53:28,910 The problem is that it's combinatorial. 1183 00:53:28,910 --> 00:53:33,680 And you see that when you're in dimension a few more then-- 1184 00:53:33,680 --> 00:53:36,950 very few, it's huge. 1185 00:53:36,950 --> 00:53:39,740 So it's exponential. 1186 00:53:39,740 --> 00:53:43,670 So it turns out that doing what you just 1187 00:53:43,670 --> 00:53:46,600 told me to do, which is what I asked you to tell me to do, 1188 00:53:46,600 --> 00:53:48,320 which is this brute force approach 1189 00:53:48,320 --> 00:53:52,400 is equivalent to do something like this is again 1190 00:53:52,400 --> 00:53:54,470 a regularization approach. 1191 00:53:54,470 --> 00:53:57,400 Here I put what is called the zero norm. 1192 00:53:57,400 --> 00:53:59,730 The zero norm is actually not a norm. 1193 00:53:59,730 --> 00:54:01,685 And it is just functional. 1194 00:54:01,685 --> 00:54:04,040 It's a thing that does the following thing. 1195 00:54:04,040 --> 00:54:06,320 If I give you a vector, you've to return 1196 00:54:06,320 --> 00:54:10,250 the number of components different from zero, only that. 1197 00:54:10,250 --> 00:54:11,930 So you go inside and look at each entry, 1198 00:54:11,930 --> 00:54:14,630 and you tell if they are different from zero. 1199 00:54:14,630 --> 00:54:17,800 This is absolutely not convex. 1200 00:54:17,800 --> 00:54:21,270 And so this is the reason why this problem is equivalent-- 1201 00:54:21,270 --> 00:54:24,530 it becomes a computation not feasible. 1202 00:54:24,530 --> 00:54:26,480 So perhaps, we can stop here. 1203 00:54:26,480 --> 00:54:29,810 And what I want to show you next is essentially-- 1204 00:54:29,810 --> 00:54:31,270 if you have this-- 1205 00:54:31,270 --> 00:54:33,890 and you know that in some sense, this is what you would like 1206 00:54:33,890 --> 00:54:35,810 to do, if you could do it computationally, 1207 00:54:35,810 --> 00:54:37,100 but you cannot-- 1208 00:54:37,100 --> 00:54:41,330 so how can you find approximate version of this 1209 00:54:41,330 --> 00:54:42,770 that you can compute in practice? 1210 00:54:42,770 --> 00:54:44,769 And we're going to discuss two ways of doing it. 1211 00:54:44,769 --> 00:54:48,820 One is greedy methods and one is convex relaxations.