1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:20,000 at OCW.mit.edu. 8 00:00:22,329 --> 00:00:23,870 LORENZO ROSASCO: I'm Lorenzo Rosasco. 9 00:00:23,870 --> 00:00:29,371 This is going to be a couple of hours plus of basic machine 10 00:00:29,371 --> 00:00:29,870 learning. 11 00:00:29,870 --> 00:00:31,040 OK. 12 00:00:31,040 --> 00:00:33,950 And I want to emphasize a bit, the word, "basic." 13 00:00:33,950 --> 00:00:38,150 Because really I tried to just stick 14 00:00:38,150 --> 00:00:40,364 to the essentials, or things that I would think 15 00:00:40,364 --> 00:00:41,530 of essentials to just start. 16 00:00:41,530 --> 00:00:43,550 Suppose that you have zero knowledge of machine learning 17 00:00:43,550 --> 00:00:45,092 and you just want to start from zero. 18 00:00:45,092 --> 00:00:45,591 OK. 19 00:00:45,591 --> 00:00:47,780 So if you already had classes in machine learning, 20 00:00:47,780 --> 00:00:51,060 you might find this a little bit boring or at least 21 00:00:51,060 --> 00:00:54,680 kind of rehearsing things that you already know. 22 00:00:54,680 --> 00:00:58,460 The idea of looking at machine learning these days 23 00:00:58,460 --> 00:01:01,830 is coming from at least two different perspectives. 24 00:01:01,830 --> 00:01:05,830 The first one is for those of you, probably most of that 25 00:01:05,830 --> 00:01:08,750 are interested to develop intelligent systems in a very 26 00:01:08,750 --> 00:01:09,962 broad sense. 27 00:01:09,962 --> 00:01:11,420 What happened in the last few years 28 00:01:11,420 --> 00:01:14,360 is that there's been a kind of data-driven revolution 29 00:01:14,360 --> 00:01:17,210 where systems that are trained rather than programmed 30 00:01:17,210 --> 00:01:20,270 start to be the key engines to solve tasks. 31 00:01:20,270 --> 00:01:22,790 And here, there are just some pictures that are probably 32 00:01:22,790 --> 00:01:24,680 outdated, like robotics. 33 00:01:24,680 --> 00:01:26,180 You know, we have Siri on our phone. 34 00:01:26,180 --> 00:01:28,400 We hear about self-driving cars. 35 00:01:28,400 --> 00:01:31,460 In all these systems, one key engine 36 00:01:31,460 --> 00:01:34,010 is providing data to the system to essentially try 37 00:01:34,010 --> 00:01:35,754 to learn how to solve the task. 38 00:01:35,754 --> 00:01:37,670 And so one idea of this class is to try to see 39 00:01:37,670 --> 00:01:39,840 what does it mean to learn? 40 00:01:39,840 --> 00:01:43,050 And the moment that you start to use 41 00:01:43,050 --> 00:01:44,780 data to solve complex tasks, then there 42 00:01:44,780 --> 00:01:46,880 is a natural connection with what 43 00:01:46,880 --> 00:01:50,000 today is called data science, which is somewhat 44 00:01:50,000 --> 00:01:56,330 a rapid [INAUDIBLE] renovated version of what we 45 00:01:56,330 --> 00:01:58,130 used to call just statistics. 46 00:01:58,130 --> 00:02:02,129 So basically, we start to have tons of data of all kinds. 47 00:02:02,129 --> 00:02:03,670 They are very easy to collect, and we 48 00:02:03,670 --> 00:02:06,500 are starving for knowledge and trying to extract information 49 00:02:06,500 --> 00:02:07,430 from these data. 50 00:02:07,430 --> 00:02:09,470 And as it turns out, many of the techniques 51 00:02:09,470 --> 00:02:11,780 that are used to develop intelligent systems 52 00:02:11,780 --> 00:02:13,280 are the same very technique that you 53 00:02:13,280 --> 00:02:15,960 can use to try to extract relevant information 54 00:02:15,960 --> 00:02:20,504 patterns, data, from your data. 55 00:02:20,504 --> 00:02:21,920 So what we want to do today is try 56 00:02:21,920 --> 00:02:23,460 to see a bit what's in the middle. 57 00:02:23,460 --> 00:02:25,709 What is the set of techniques that allows you, indeed, 58 00:02:25,709 --> 00:02:31,580 to go from data to knowledge or to acquiring ability 59 00:02:31,580 --> 00:02:32,430 to solve tasks. 60 00:02:35,030 --> 00:02:37,010 Machine learning is huge these days, 61 00:02:37,010 --> 00:02:38,930 and there are tons of possible applications. 62 00:02:38,930 --> 00:02:41,220 There has been theory developed in the last 20, 63 00:02:41,220 --> 00:02:44,030 30 years that brought the field to a certain level of maturity 64 00:02:44,030 --> 00:02:46,380 from a mathematical point of view. 65 00:02:46,380 --> 00:02:48,020 There have been tons and tons and tons 66 00:02:48,020 --> 00:02:49,051 of algorithms developed. 67 00:02:49,051 --> 00:02:49,550 OK. 68 00:02:49,550 --> 00:02:51,800 So in three hours, there is no way 69 00:02:51,800 --> 00:02:55,309 I could give you even just a little view of what 70 00:02:55,309 --> 00:02:56,600 machine learning is these days. 71 00:02:56,600 --> 00:02:59,577 So what I did is pretty much this. 72 00:02:59,577 --> 00:03:01,160 I don't know if you've ever done this, 73 00:03:01,160 --> 00:03:03,620 but you used to do the mixtape, and you 74 00:03:03,620 --> 00:03:06,500 try to pick the songs that you would bring with yourself 75 00:03:06,500 --> 00:03:08,219 on a desert island. 76 00:03:08,219 --> 00:03:10,010 That's kind of the way I thought about what 77 00:03:10,010 --> 00:03:13,840 to put in this one [INAUDIBLE] lights that we're 78 00:03:13,840 --> 00:03:15,140 going to show in a minute. 79 00:03:15,140 --> 00:03:17,540 So basically, I thought, what are 80 00:03:17,540 --> 00:03:19,910 those three, four, five learning algorithms 81 00:03:19,910 --> 00:03:21,800 that you should know, OK, if you know 82 00:03:21,800 --> 00:03:23,660 nothing about machine learning. 83 00:03:23,660 --> 00:03:26,151 And this is more or less at least one part. 84 00:03:26,151 --> 00:03:27,650 Of course there are a few songs that 85 00:03:27,650 --> 00:03:31,200 stayed out of the compilation, but this is like one selection. 86 00:03:31,200 --> 00:03:31,700 OK. 87 00:03:34,490 --> 00:03:38,193 So as such, we're going to start, as I said-- 88 00:03:38,193 --> 00:03:40,270 whoop-- simple. 89 00:03:40,270 --> 00:03:42,290 And the idea is that this morning you're 90 00:03:42,290 --> 00:03:44,940 going to see a few algorithms. 91 00:03:44,940 --> 00:03:47,360 And I picked algorithms that are relatively 92 00:03:47,360 --> 00:03:49,550 simple from a computational point of view. 93 00:03:49,550 --> 00:03:52,561 So the math level is going to be pretty basic. 94 00:03:52,561 --> 00:03:53,060 OK. 95 00:03:53,060 --> 00:03:54,893 I think I'm going to use some linear algebra 96 00:03:54,893 --> 00:03:58,620 at some point and maybe some calculus, but that's about it. 97 00:03:58,620 --> 00:04:00,500 So most of the idea here is to emphasize 98 00:04:00,500 --> 00:04:03,770 conceptual ideas, the concepts. 99 00:04:03,770 --> 00:04:06,150 And then today, afternoon, there's going to be, 100 00:04:06,150 --> 00:04:08,240 basically labs where you sit and you just 101 00:04:08,240 --> 00:04:10,130 pick these kind of algorithms and use them, 102 00:04:10,130 --> 00:04:12,361 so you immediately see, what does it mean? 103 00:04:12,361 --> 00:04:12,860 OK. 104 00:04:12,860 --> 00:04:14,810 So at the end of the day, you should 105 00:04:14,810 --> 00:04:17,450 have reasonable knowledge about whatever 106 00:04:17,450 --> 00:04:20,160 you're seeing this morning. 107 00:04:20,160 --> 00:04:24,180 So this is how the class is structured. 108 00:04:24,180 --> 00:04:27,000 It's divided in parts plus the lab. 109 00:04:27,000 --> 00:04:29,060 So the first part, what we want to do 110 00:04:29,060 --> 00:04:32,540 is start from probably the simplest learning algorithm 111 00:04:32,540 --> 00:04:35,690 you can think of to try to emphasize, and use 112 00:04:35,690 --> 00:04:41,180 that as an excuse to introduce the idea of bias-variance, 113 00:04:41,180 --> 00:04:45,080 trade-off, which, to me, is probably either the, 114 00:04:45,080 --> 00:04:48,470 or one of the most fundamental concepts in statistics 115 00:04:48,470 --> 00:04:52,190 and machine learning, which is this idea that you're 116 00:04:52,190 --> 00:04:54,380 going to see in a few minutes in more detail. 117 00:04:54,380 --> 00:04:56,690 But it's essentially the idea that you never 118 00:04:56,690 --> 00:04:57,420 have enough data. 119 00:04:57,420 --> 00:04:57,920 OK. 120 00:04:57,920 --> 00:05:00,230 And the game here is not about describing the data 121 00:05:00,230 --> 00:05:02,480 that you have today, as much as using 122 00:05:02,480 --> 00:05:05,030 the data you have today as a basis of knowledge 123 00:05:05,030 --> 00:05:06,980 to describe data you're going to get tomorrow. 124 00:05:06,980 --> 00:05:09,230 So there is this inherent trade-off 125 00:05:09,230 --> 00:05:11,720 between what you have at disposal 126 00:05:11,720 --> 00:05:13,550 and what would you like to predict. 127 00:05:13,550 --> 00:05:15,680 And then, essentially it turns out 128 00:05:15,680 --> 00:05:18,170 that you have to somewhat decide how much you want 129 00:05:18,170 --> 00:05:21,120 to trust the data, and how much you want to somewhat throw 130 00:05:21,120 --> 00:05:24,030 away, or regularize, as they say, 131 00:05:24,030 --> 00:05:26,159 smooth out the information in your data, 132 00:05:26,159 --> 00:05:28,200 because you think that it's actually an accident. 133 00:05:28,200 --> 00:05:31,410 It's just because you saw data with aspects today 134 00:05:31,410 --> 00:05:34,170 that are not really reflective of the phenomenon that 135 00:05:34,170 --> 00:05:34,820 produced them. 136 00:05:34,820 --> 00:05:37,980 But it's just because I saw 10 points rather than 100. 137 00:05:37,980 --> 00:05:39,579 The basic idea here is essentially 138 00:05:39,579 --> 00:05:40,620 the law of large numbers. 139 00:05:40,620 --> 00:05:42,660 When you toss a coin, you might find out 140 00:05:42,660 --> 00:05:44,850 that if you toss it just 10 times, 141 00:05:44,850 --> 00:05:46,350 it looks like it's not a fair coin, 142 00:05:46,350 --> 00:05:47,850 but if you go for 100, or 1,000, you 143 00:05:47,850 --> 00:05:50,920 start to see that it converts to 50-50. 144 00:05:50,920 --> 00:05:51,420 OK. 145 00:05:51,420 --> 00:05:53,230 So that's kind of what's going on here. 146 00:05:53,230 --> 00:05:56,880 So the idea is that you want to use some kind of induction 147 00:05:56,880 --> 00:06:02,290 principle that tells you how much you can trust the data. 148 00:06:02,290 --> 00:06:04,930 Moving on from this basic class of algorithms, 149 00:06:04,930 --> 00:06:06,930 we're going to consider so-called regularization 150 00:06:06,930 --> 00:06:08,700 techniques. 151 00:06:08,700 --> 00:06:10,800 I use regularization in a very broad sentence. 152 00:06:10,800 --> 00:06:14,850 And here we're going to concentrate on least squares 153 00:06:14,850 --> 00:06:16,980 essentially because A, it's simple, 154 00:06:16,980 --> 00:06:18,790 and it just reduces to linear algebra. 155 00:06:18,790 --> 00:06:22,320 And so you don't have to know anything about convex 156 00:06:22,320 --> 00:06:25,920 optimization or any other kind of fancy optimization 157 00:06:25,920 --> 00:06:27,739 techniques. 158 00:06:27,739 --> 00:06:29,280 And B, because it's relatively simple 159 00:06:29,280 --> 00:06:33,240 to move from linear models to non-parametric non-linear 160 00:06:33,240 --> 00:06:34,670 models using kernels. 161 00:06:34,670 --> 00:06:35,220 OK. 162 00:06:35,220 --> 00:06:38,070 And kernels are a big field with a lot of math, 163 00:06:38,070 --> 00:06:39,510 but you're just going to look more 164 00:06:39,510 --> 00:06:41,580 at the recipe to move from simple models 165 00:06:41,580 --> 00:06:42,690 to complicated models. 166 00:06:45,240 --> 00:06:48,690 So finally, the last part, we're going to move a bit away 167 00:06:48,690 --> 00:06:49,950 from pure prediction. 168 00:06:49,950 --> 00:06:52,110 So basically these first two parts 169 00:06:52,110 --> 00:06:55,590 are about prediction, or what is called supervised learning. 170 00:06:55,590 --> 00:06:58,080 And here we're going to move a bit away from prediction 171 00:06:58,080 --> 00:07:02,010 and we're going to ask questions more related to, you have data, 172 00:07:02,010 --> 00:07:04,770 and you want to know, what are the important sectors 173 00:07:04,770 --> 00:07:06,390 in your data? 174 00:07:06,390 --> 00:07:10,010 So the one key word here is interoperability. 175 00:07:10,010 --> 00:07:12,820 You want to have some form of interoperability of the data 176 00:07:12,820 --> 00:07:13,320 at hand. 177 00:07:13,320 --> 00:07:14,820 You would like to know, not only how 178 00:07:14,820 --> 00:07:16,890 you can make good predictions, but what 179 00:07:16,890 --> 00:07:18,450 are the important sectors. 180 00:07:18,450 --> 00:07:21,000 So you not only want to do good prediction, 181 00:07:21,000 --> 00:07:23,610 but you want to know how you make good prediction. 182 00:07:23,610 --> 00:07:25,860 What is the important information to actually get 183 00:07:25,860 --> 00:07:27,230 good prediction. 184 00:07:27,230 --> 00:07:31,172 And so, in this last part we're going to take a peek into this. 185 00:07:31,172 --> 00:07:32,880 And as I said, the afternoon is basically 186 00:07:32,880 --> 00:07:34,360 going to be a practical session. 187 00:07:34,360 --> 00:07:36,589 If it's all MATLAB I think there is some quick-- 188 00:07:36,589 --> 00:07:38,130 if you have never seen MATLAB before, 189 00:07:38,130 --> 00:07:40,804 you can play around with just a little bit. 190 00:07:40,804 --> 00:07:42,220 But it's very easy and then you've 191 00:07:42,220 --> 00:07:45,969 got a few different proposals I think, of things you can do. 192 00:07:45,969 --> 00:07:48,510 And you can pick, depending on what you already know and what 193 00:07:48,510 --> 00:07:52,580 you can try, you can start from that and be more or less fancy. 194 00:07:52,580 --> 00:07:53,310 OK. 195 00:07:53,310 --> 00:07:58,830 So it goes without saying, stop me. 196 00:07:58,830 --> 00:08:02,170 I mean, the more we interact, the better it is. 197 00:08:02,170 --> 00:08:03,840 So the first part, as I said, the idea 198 00:08:03,840 --> 00:08:06,570 is to use so-called local methods as an excuse 199 00:08:06,570 --> 00:08:08,920 to understand it by experience. 200 00:08:08,920 --> 00:08:09,420 OK. 201 00:08:09,420 --> 00:08:11,503 So we're going to introduce the simplest algorithm 202 00:08:11,503 --> 00:08:14,070 you can think of, and we're going to use it to understand 203 00:08:14,070 --> 00:08:16,920 a much deeper concept. 204 00:08:16,920 --> 00:08:21,030 So first of all, let's just put down our setup. 205 00:08:21,030 --> 00:08:23,010 The idea is that we are-- 206 00:08:23,010 --> 00:08:27,310 so how many of you had a machine learning class before? 207 00:08:27,310 --> 00:08:27,810 All right. 208 00:08:27,810 --> 00:08:30,570 So, you won't be too bored. 209 00:08:30,570 --> 00:08:32,919 The idea is we want to do supervised learning. 210 00:08:32,919 --> 00:08:36,280 So in supervised learning there is an input and an output. 211 00:08:36,280 --> 00:08:38,599 And these inputs and outputs are somewhat related. 212 00:08:38,599 --> 00:08:40,140 And I'll be more precise in a minute. 213 00:08:40,140 --> 00:08:41,723 But the idea is that you want to learn 214 00:08:41,723 --> 00:08:44,970 this input-output relationship. 215 00:08:44,970 --> 00:08:49,700 And all you have at disposal are sets of inputs and outputs. 216 00:08:49,700 --> 00:08:50,300 OK. 217 00:08:50,300 --> 00:08:53,850 So x here is an input, and y is the output. 218 00:08:53,850 --> 00:08:56,350 f is a functional relation between the input 219 00:08:56,350 --> 00:08:58,470 and the output. 220 00:08:58,470 --> 00:09:00,780 All you have in this puzzle are these couples, OK. 221 00:09:00,780 --> 00:09:02,239 So I give an input, and then what's 222 00:09:02,239 --> 00:09:03,279 the corresponding output? 223 00:09:03,279 --> 00:09:04,920 I give another input and I know what's 224 00:09:04,920 --> 00:09:05,961 the corresponding output. 225 00:09:05,961 --> 00:09:07,920 But I don't give you all of them. 226 00:09:07,920 --> 00:09:11,230 You just have n, OK. n is the number of points, 227 00:09:11,230 --> 00:09:13,356 and you call this a training set, 228 00:09:13,356 --> 00:09:15,605 because it will be the basis of knowledge in which you 229 00:09:15,605 --> 00:09:20,580 can try to train a machine to estimate this functional 230 00:09:20,580 --> 00:09:21,890 relationship. 231 00:09:21,890 --> 00:09:22,860 OK. 232 00:09:22,860 --> 00:09:26,680 And the key point here is that, on the one hand, 233 00:09:26,680 --> 00:09:28,210 you want to describe these data. 234 00:09:28,210 --> 00:09:30,600 So you want to get a functional relationship that 235 00:09:30,600 --> 00:09:32,910 works well that, if you get the next one 236 00:09:32,910 --> 00:09:37,140 to give you an f(x1), which is close to y1 and so on. 237 00:09:37,140 --> 00:09:39,630 And f(x2), which is close to y2. 238 00:09:39,630 --> 00:09:42,420 But more importantly, you want an f, that 239 00:09:42,420 --> 00:09:46,360 given a new point that was not here, 240 00:09:46,360 --> 00:09:48,420 will give you an output, which is 241 00:09:48,420 --> 00:09:49,930 a good estimate of the true output 242 00:09:49,930 --> 00:09:51,780 to correspond to that input. 243 00:09:51,780 --> 00:09:53,280 OK. 244 00:09:53,280 --> 00:09:57,040 This is the most important thing of the setup. 245 00:09:57,040 --> 00:09:57,740 OK. 246 00:09:57,740 --> 00:09:59,432 The ideal, so-called generalization, 247 00:09:59,432 --> 00:10:00,390 if you want prediction. 248 00:10:00,390 --> 00:10:02,010 If you want to really do inference. 249 00:10:02,010 --> 00:10:03,843 You don't want to do descriptive statistics. 250 00:10:03,843 --> 00:10:08,230 You really want to do inferential statistics. 251 00:10:08,230 --> 00:10:11,390 So this is just very, very simple example, 252 00:10:11,390 --> 00:10:13,980 but just to start to have something in mind. 253 00:10:13,980 --> 00:10:15,230 Suppose that you have-- 254 00:10:15,230 --> 00:10:20,970 well, it's just like a toy version of the face recognition 255 00:10:20,970 --> 00:10:23,490 system we have on our phones. 256 00:10:23,490 --> 00:10:26,244 You know that when you take a picture, you start-- 257 00:10:26,244 --> 00:10:26,910 AUDIENCE: Sorry. 258 00:10:26,910 --> 00:10:28,920 LORENZO ROSASCO: They really weren't talking. 259 00:10:28,920 --> 00:10:30,128 You have something like this. 260 00:10:30,128 --> 00:10:32,610 You have a little square appearing around a face 261 00:10:32,610 --> 00:10:33,300 sometimes. 262 00:10:33,300 --> 00:10:35,216 It means that basically the system is actually 263 00:10:35,216 --> 00:10:38,130 going inside the image and recognizing faces. 264 00:10:38,130 --> 00:10:39,330 OK. 265 00:10:39,330 --> 00:10:42,970 So the idea is a bit more complicated than this. 266 00:10:42,970 --> 00:10:45,180 But a toy version of this algorithm 267 00:10:45,180 --> 00:10:47,250 is, you have an image like this. 268 00:10:47,250 --> 00:10:47,910 OK. 269 00:10:47,910 --> 00:10:52,170 The image you think of as a matrix of numbers. 270 00:10:52,170 --> 00:10:55,801 Now this is color, but imagine it's black and white, OK. 271 00:10:55,801 --> 00:10:57,300 Then it would just contain a number, 272 00:10:57,300 --> 00:11:01,380 which is the pixel value with the light intensity 273 00:11:01,380 --> 00:11:02,520 of that pixel. 274 00:11:02,520 --> 00:11:04,350 And you just have this array. 275 00:11:04,350 --> 00:11:08,700 And then if you want you can brutalize it with and just 276 00:11:08,700 --> 00:11:12,370 unroll the matrix into a long vector. 277 00:11:12,370 --> 00:11:13,286 OK. 278 00:11:13,286 --> 00:11:14,920 That gives one vector. 279 00:11:14,920 --> 00:11:17,280 So p here would be what? 280 00:11:17,280 --> 00:11:19,260 The number of? 281 00:11:19,260 --> 00:11:20,660 Just the number of pixels. 282 00:11:20,660 --> 00:11:22,020 OK. 283 00:11:22,020 --> 00:11:24,379 So I take this image and I unroll it. 284 00:11:24,379 --> 00:11:25,920 I take another image and I unroll it. 285 00:11:25,920 --> 00:11:26,730 And I take images. 286 00:11:26,730 --> 00:11:29,160 And you see, some images here do contain faces. 287 00:11:29,160 --> 00:11:31,691 Some of the images do not contain faces. 288 00:11:31,691 --> 00:11:32,190 OK. 289 00:11:32,190 --> 00:11:34,800 And I here use color to code them. 290 00:11:34,800 --> 00:11:38,430 And now what I have is that images are my inputs, OK, 291 00:11:38,430 --> 00:11:39,870 are the x's. 292 00:11:39,870 --> 00:11:43,830 So here-- full disclosure, I never use 293 00:11:43,830 --> 00:11:46,942 the little arrow above letters to denote vectors. 294 00:11:46,942 --> 00:11:48,900 So hopefully it will be clear from the context. 295 00:11:48,900 --> 00:11:55,000 When it's really useful I use upper or lower indices. 296 00:11:55,000 --> 00:11:55,590 Anyway. 297 00:11:55,590 --> 00:11:58,050 So this is the data matrix. 298 00:11:58,050 --> 00:12:01,800 Rows are inputs and columns are so-called features 299 00:12:01,800 --> 00:12:04,440 or variables, are the entries of each vector. 300 00:12:04,440 --> 00:12:05,280 OK. 301 00:12:05,280 --> 00:12:09,840 And I have n rows and p columns. 302 00:12:09,840 --> 00:12:12,581 Associated to this, I have my output vector. 303 00:12:12,581 --> 00:12:13,830 And what is the output vector? 304 00:12:13,830 --> 00:12:17,130 Well in this case, it's just a simple binary vector. 305 00:12:17,130 --> 00:12:21,360 And the idea here is, if there is a face, I put 1. 306 00:12:21,360 --> 00:12:24,570 If there is not a face, I put minus 1. 307 00:12:24,570 --> 00:12:25,560 OK. 308 00:12:25,560 --> 00:12:28,800 So this is the way I turn, like an abstract question, 309 00:12:28,800 --> 00:12:33,485 recognize faces in images, into some data 310 00:12:33,485 --> 00:12:35,610 structure that in a minute we're going to elaborate 311 00:12:35,610 --> 00:12:37,950 to try to actually answer the question, 312 00:12:37,950 --> 00:12:40,340 whether there is a face in an image or not. 313 00:12:40,340 --> 00:12:42,200 OK. 314 00:12:42,200 --> 00:12:45,270 So this first step, it's kind of obvious in this case, 315 00:12:45,270 --> 00:12:46,751 but it's actually a tricky step. 316 00:12:46,751 --> 00:12:47,250 OK. 317 00:12:47,250 --> 00:12:51,640 It's the part that I'm not going to give you any hint about. 318 00:12:51,640 --> 00:12:53,010 It's kind of an art. 319 00:12:53,010 --> 00:12:54,249 You have data and you have-- 320 00:12:54,249 --> 00:12:56,040 at the very beginning you have to turn them 321 00:12:56,040 --> 00:12:59,271 into some kind of manageable data structure. 322 00:12:59,271 --> 00:12:59,770 OK. 323 00:12:59,770 --> 00:13:01,501 Then you can elaborate in multiple ways. 324 00:13:01,501 --> 00:13:03,750 But the very first step is you deciding-- for example, 325 00:13:03,750 --> 00:13:07,600 here we decided to unroll all these numbers into vectors. 326 00:13:07,600 --> 00:13:09,976 This sounds like a good idea or a bad idea? 327 00:13:09,976 --> 00:13:11,850 One thing that you're doing is that the pixel 328 00:13:11,850 --> 00:13:14,910 here and the pixel here are probably related. 329 00:13:14,910 --> 00:13:17,680 And in this case there is some structure in the image. 330 00:13:17,680 --> 00:13:22,540 And so when you take this pixel 136, and you unroll it, 331 00:13:22,540 --> 00:13:24,000 it comes here. 332 00:13:24,000 --> 00:13:25,640 So they're not close. 333 00:13:25,640 --> 00:13:26,220 OK. 334 00:13:26,220 --> 00:13:28,020 Now here it turns out that if you think about it-- 335 00:13:28,020 --> 00:13:28,730 you'll see a minute. 336 00:13:28,730 --> 00:13:30,080 For those of you who remember, if you just 337 00:13:30,080 --> 00:13:32,600 took Euclidean distance, you take product of numbers 338 00:13:32,600 --> 00:13:33,660 and you sum them up. 339 00:13:33,660 --> 00:13:36,090 That's invariant to the position of the individual pixels. 340 00:13:36,090 --> 00:13:37,180 So that's OK. 341 00:13:37,180 --> 00:13:37,730 OK. 342 00:13:37,730 --> 00:13:39,813 But yet again, there is this intuition that, well, 343 00:13:39,813 --> 00:13:42,200 maybe here I'm losing too much geometric information 344 00:13:42,200 --> 00:13:43,910 about the context of the image. 345 00:13:43,910 --> 00:13:46,869 And indeed, while this kind of works in practice, 346 00:13:46,869 --> 00:13:48,410 but if you want to get better results 347 00:13:48,410 --> 00:13:50,659 you have to do the fancy stuff that Andrei was talking 348 00:13:50,659 --> 00:13:53,420 about today, looking locally and try to look at collection, 349 00:13:53,420 --> 00:13:55,680 try to keep more geometric information. 350 00:13:55,680 --> 00:13:56,180 OK. 351 00:13:56,180 --> 00:13:58,480 So I'm not going to talk about that kind of stuff. 352 00:13:58,480 --> 00:14:02,660 This up to date, a lot of engineering, and some good way 353 00:14:02,660 --> 00:14:03,770 to learn it. 354 00:14:03,770 --> 00:14:06,170 But we're going to try to just stick 355 00:14:06,170 --> 00:14:07,680 to simple representations. 356 00:14:07,680 --> 00:14:08,180 OK. 357 00:14:08,180 --> 00:14:10,080 So how do you build representation is now 358 00:14:10,080 --> 00:14:13,850 going to be part of what I'm going to talk about. 359 00:14:13,850 --> 00:14:16,040 So imagine that either and you stick 360 00:14:16,040 --> 00:14:18,350 to this super-simple representation or some friends 361 00:14:18,350 --> 00:14:20,780 of yours come in and put the box here 362 00:14:20,780 --> 00:14:24,320 in the middle, where you put this array of numbers 363 00:14:24,320 --> 00:14:27,440 and you extract another vector much fancier than this 364 00:14:27,440 --> 00:14:31,320 that contains some better representation of an image. 365 00:14:31,320 --> 00:14:31,887 OK. 366 00:14:31,887 --> 00:14:33,470 But then at the end of the day, my job 367 00:14:33,470 --> 00:14:36,530 starts when you give me a vector representation 368 00:14:36,530 --> 00:14:37,430 that I can trust. 369 00:14:37,430 --> 00:14:41,450 And I can basically say that if two vectors seem similar, 370 00:14:41,450 --> 00:14:43,480 they should have the same label. 371 00:14:43,480 --> 00:14:44,930 And that's the basic idea. 372 00:14:44,930 --> 00:14:45,430 OK. 373 00:14:48,060 --> 00:14:48,560 All right. 374 00:14:48,560 --> 00:14:51,969 So a little game here is, OK, imagine 375 00:14:51,969 --> 00:14:54,260 that these are just the two-pixel version of the images 376 00:14:54,260 --> 00:14:55,093 I showed you before. 377 00:14:55,093 --> 00:14:59,120 You have some boxes, some circles. 378 00:14:59,120 --> 00:15:00,905 And then I give you this one triangle. 379 00:15:00,905 --> 00:15:01,840 It's very original. 380 00:15:01,840 --> 00:15:03,500 Andrei showed you this yesterday. 381 00:15:03,500 --> 00:15:06,410 And the question is, what's the color of that? 382 00:15:06,410 --> 00:15:07,525 OK. 383 00:15:07,525 --> 00:15:10,770 Unless you haven't slept a minute, 384 00:15:10,770 --> 00:15:12,260 you're going to say it's orange. 385 00:15:12,260 --> 00:15:15,923 But the question is, why do you think it's orange? 386 00:15:15,923 --> 00:15:16,880 AUDIENCE: [INAUDIBLE] 387 00:15:16,880 --> 00:15:18,130 LORENZO ROSASCO: Say it again? 388 00:15:18,130 --> 00:15:18,986 AUDIENCE: It's surrounded by oranges. 389 00:15:18,986 --> 00:15:20,819 LORENZO ROSASCO: It's surrounded by oranges. 390 00:15:20,819 --> 00:15:21,330 OK. 391 00:15:21,330 --> 00:15:24,230 And she said, it's close to oranges. 392 00:15:26,577 --> 00:15:28,660 So it turns out that this is actually the simplest 393 00:15:28,660 --> 00:15:29,871 algorithm you can think of. 394 00:15:29,871 --> 00:15:30,370 OK. 395 00:15:30,370 --> 00:15:33,850 You check who you have close to you, and if it's orange, 396 00:15:33,850 --> 00:15:34,730 you say orange. 397 00:15:34,730 --> 00:15:37,210 And if it's blue, you say blue. 398 00:15:37,210 --> 00:15:39,640 OK. 399 00:15:39,640 --> 00:15:41,695 But we already made an assumption here, 400 00:15:41,695 --> 00:15:44,070 which we ask in the question, which is the nearby things. 401 00:15:44,070 --> 00:15:46,653 So we are basically saying that our of vectoral representation 402 00:15:46,653 --> 00:15:48,910 is such that, if two things are close-- 403 00:15:48,910 --> 00:15:51,700 so I do have a distance, and if two things are close, 404 00:15:51,700 --> 00:15:54,330 then they might have the same semantic content. 405 00:15:54,330 --> 00:15:55,506 OK. 406 00:15:55,506 --> 00:15:56,630 Which might be true or not. 407 00:15:56,630 --> 00:16:01,120 For example, if you take this thing I showed you here, 408 00:16:01,120 --> 00:16:02,380 we cannot just draw it, right? 409 00:16:02,380 --> 00:16:06,280 We cannot just take 200 times 200 vectors and just look 410 00:16:06,280 --> 00:16:08,489 at them and say, yeah, you know, a visual inspection. 411 00:16:08,489 --> 00:16:10,655 You have to believe that this distance will be fine. 412 00:16:10,655 --> 00:16:12,760 And so the discussion that we just had about what 413 00:16:12,760 --> 00:16:15,040 is a good representation is going to kick in. 414 00:16:15,040 --> 00:16:15,970 OK. 415 00:16:15,970 --> 00:16:17,260 But the assumption you make-- 416 00:16:17,260 --> 00:16:21,570 in this case visually it's very easy, it's low dimension-- 417 00:16:21,570 --> 00:16:23,862 is that nearby things have similar labels. 418 00:16:23,862 --> 00:16:26,320 One thing that I forgot to tell you in the previous slides, 419 00:16:26,320 --> 00:16:28,570 but it's key, is exactly this observation 420 00:16:28,570 --> 00:16:33,220 that in machine learning we typically move away 421 00:16:33,220 --> 00:16:34,720 from situations like this one, where 422 00:16:34,720 --> 00:16:38,032 you can do visual inspection and you have low dimensionality, 423 00:16:38,032 --> 00:16:40,240 to kind of a situation like the one I just showed you 424 00:16:40,240 --> 00:16:42,360 a minute before, where you have images. 425 00:16:42,360 --> 00:16:45,667 And if you have to think of each of these circles as an image, 426 00:16:45,667 --> 00:16:47,500 you want to be able to draw it, because it's 427 00:16:47,500 --> 00:16:49,660 going to be several hundred typically, 428 00:16:49,660 --> 00:16:53,100 or tens dimensional vector. 429 00:16:53,100 --> 00:16:53,830 OK. 430 00:16:53,830 --> 00:16:56,260 So the game is kind of different. 431 00:16:56,260 --> 00:16:58,390 Can we still do this kind of stuff? 432 00:16:58,390 --> 00:17:02,289 Can we just say that closed things should 433 00:17:02,289 --> 00:17:03,580 have the same semantic content? 434 00:17:03,580 --> 00:17:05,680 That's another question we're going to try to answer. 435 00:17:05,680 --> 00:17:06,179 OK. 436 00:17:06,179 --> 00:17:08,079 But I just want to do a bit of inception. 437 00:17:08,079 --> 00:17:10,630 This is a big deal, OK, going from low dimension 438 00:17:10,630 --> 00:17:12,891 to very high dimensions. 439 00:17:12,891 --> 00:17:13,390 All right. 440 00:17:13,390 --> 00:17:15,430 But let's stick for a minute to the idea 441 00:17:15,430 --> 00:17:18,640 that nearby things should have the same label, 442 00:17:18,640 --> 00:17:22,089 and just write the one line, write down the algorithm. 443 00:17:22,089 --> 00:17:24,609 It's the kind of case where it's harder to write it down 444 00:17:24,609 --> 00:17:27,550 than to code it up or just explain what it is. 445 00:17:27,550 --> 00:17:28,820 It's super simple. 446 00:17:28,820 --> 00:17:32,300 What you do is, you have data points, Xi. 447 00:17:32,300 --> 00:17:37,140 So Xi is the training set, the input data in the training set. 448 00:17:37,140 --> 00:17:39,530 X-bar is what I call X-new before. 449 00:17:39,530 --> 00:17:40,960 It's a new point. 450 00:17:40,960 --> 00:17:43,330 What you do is that you search. 451 00:17:43,330 --> 00:17:49,060 This just says, look for the index of the closest point. 452 00:17:49,060 --> 00:17:50,400 That's what you did before. 453 00:17:50,400 --> 00:17:51,040 OK. 454 00:17:51,040 --> 00:17:57,850 So here, I-prime is the index of the point Xi closest to X-bar. 455 00:17:57,850 --> 00:18:01,330 Once you find it, go in your dataset 456 00:18:01,330 --> 00:18:04,570 and find the label of that point. 457 00:18:04,570 --> 00:18:07,966 And then assign that label to the new point. 458 00:18:07,966 --> 00:18:10,090 Does that makes sense? 459 00:18:10,090 --> 00:18:12,780 Everybody's happy? 460 00:18:12,780 --> 00:18:15,810 Not super-complicated. 461 00:18:15,810 --> 00:18:18,180 Fair enough. 462 00:18:18,180 --> 00:18:20,960 How does it work? 463 00:18:20,960 --> 00:18:24,450 So let me see if I can do this. 464 00:18:27,570 --> 00:18:30,810 This is extremely fancy code. 465 00:18:33,702 --> 00:18:36,191 Let's see. 466 00:18:36,191 --> 00:18:36,690 All right. 467 00:18:36,690 --> 00:18:38,710 So what did I do? 468 00:18:38,710 --> 00:18:40,420 Let me do it a bit smaller. 469 00:18:40,420 --> 00:18:42,650 So this is just simple two-dimensional datasets. 470 00:18:42,650 --> 00:18:44,550 I take 40 points. 471 00:18:47,080 --> 00:18:51,035 The dataset looks like this. 472 00:18:51,035 --> 00:18:53,470 The dataset is the one on the left. 473 00:18:53,470 --> 00:18:54,080 OK. 474 00:18:54,080 --> 00:18:56,700 And what I do, I take 40 points. 475 00:18:56,700 --> 00:18:58,820 And to make it a bit more complex, 476 00:18:58,820 --> 00:19:01,040 I flip some of the labels. 477 00:19:01,040 --> 00:19:01,550 OK. 478 00:19:01,550 --> 00:19:02,925 So you basically say that the two 479 00:19:02,925 --> 00:19:04,972 datasets-- this is called the two moons dataset, 480 00:19:04,972 --> 00:19:05,930 or something like this. 481 00:19:05,930 --> 00:19:09,525 And what I did is that some of the labels in this sea, 482 00:19:09,525 --> 00:19:10,497 I changed color. 483 00:19:10,497 --> 00:19:11,330 I changed the label. 484 00:19:11,330 --> 00:19:11,830 OK. 485 00:19:11,830 --> 00:19:14,300 So I made the problem a bit harder. 486 00:19:14,300 --> 00:19:18,330 And here is what fortunately you don't have in practice. 487 00:19:18,330 --> 00:19:18,830 OK. 488 00:19:18,830 --> 00:19:19,650 Here we're cheating. 489 00:19:19,650 --> 00:19:20,800 We're doing just the simulations. 490 00:19:20,800 --> 00:19:22,140 We're looking at the future. 491 00:19:22,140 --> 00:19:25,276 We assume that because we can generate this data, 492 00:19:25,276 --> 00:19:27,150 we can look at the future and check how we're 493 00:19:27,150 --> 00:19:28,274 going to do in future data. 494 00:19:28,274 --> 00:19:29,990 So you can think of this as a future data 495 00:19:29,990 --> 00:19:31,240 that typically you don't have. 496 00:19:31,240 --> 00:19:33,420 So here you're a normal human being. 497 00:19:33,420 --> 00:19:35,550 Here you're playing god and looking at the future. 498 00:19:35,550 --> 00:19:36,050 OK. 499 00:19:36,050 --> 00:19:38,224 Because we just want to do a little simulation. 500 00:19:41,190 --> 00:19:51,780 So based on that, we can just go here and put 1, train, 501 00:19:51,780 --> 00:19:53,110 and then test and plot. 502 00:19:56,200 --> 00:20:00,830 So what you see here is the so-called decision boundary. 503 00:20:00,830 --> 00:20:01,530 OK. 504 00:20:01,530 --> 00:20:05,730 What I did is exactly that one line of code you saw before. 505 00:20:05,730 --> 00:20:06,360 OK. 506 00:20:06,360 --> 00:20:07,980 And what I did is, in this case I can draw it, 507 00:20:07,980 --> 00:20:09,188 because it's low dimensional. 508 00:20:09,188 --> 00:20:12,000 And basically what I do is that I just put in the regions 509 00:20:12,000 --> 00:20:14,730 where I think I should put orange, 510 00:20:14,730 --> 00:20:18,281 and the region where it think I should put blue. 511 00:20:18,281 --> 00:20:18,780 OK. 512 00:20:18,780 --> 00:20:21,840 And here you can kind of see what's going on. 513 00:20:21,840 --> 00:20:26,370 These are actually very good on the data, right? 514 00:20:26,370 --> 00:20:29,510 How many mistakes do you make on the new dataset? 515 00:20:29,510 --> 00:20:32,790 Sorry, on the training set? 516 00:20:32,790 --> 00:20:33,330 Zero. 517 00:20:33,330 --> 00:20:34,170 It's perfect. 518 00:20:34,170 --> 00:20:35,250 OK. 519 00:20:35,250 --> 00:20:36,520 Is that a good idea? 520 00:20:36,520 --> 00:20:39,651 Well, when you look at it here, it doesn't look that good. 521 00:20:39,651 --> 00:20:40,150 OK. 522 00:20:40,150 --> 00:20:42,820 There is this whole region of points, 523 00:20:42,820 --> 00:20:49,470 for example, that are going to be predicted to be orange, 524 00:20:49,470 --> 00:20:51,026 but they're actually blue. 525 00:20:51,026 --> 00:20:53,400 Of course if you want to have zero errors in the training 526 00:20:53,400 --> 00:20:55,289 set, there's nothing else you can do, right? 527 00:20:55,289 --> 00:20:57,330 Because you see, you have this orange point here. 528 00:20:57,330 --> 00:20:58,913 You have these two orange points here. 529 00:20:58,913 --> 00:21:01,200 And you want to go and follow them. 530 00:21:01,200 --> 00:21:03,665 So there's nothing you can do. 531 00:21:03,665 --> 00:21:05,040 So this is the first observation. 532 00:21:05,040 --> 00:21:08,470 The second observation is, the curve, 533 00:21:08,470 --> 00:21:11,490 if you look close enough, it's piecewise linear. 534 00:21:11,490 --> 00:21:19,200 It's like a sequence of linear pieces stuck together. 535 00:21:19,200 --> 00:21:21,810 If we just try to do a little game 536 00:21:21,810 --> 00:21:23,670 and generate some new data-- 537 00:21:23,670 --> 00:21:25,810 OK, so imagine again, I'm playing god now. 538 00:21:25,810 --> 00:21:28,560 I generate the new dataset that it should look like. 539 00:21:28,560 --> 00:21:30,480 So take another peek at this. 540 00:21:30,480 --> 00:21:33,100 OK. 541 00:21:33,100 --> 00:21:33,600 Oop. 542 00:21:40,820 --> 00:21:42,820 So now I generate them. 543 00:21:42,820 --> 00:21:43,480 I plot them. 544 00:21:46,445 --> 00:21:46,945 I train. 545 00:21:50,008 --> 00:21:51,412 And now let's test. 546 00:21:54,050 --> 00:21:54,550 OK. 547 00:21:54,550 --> 00:21:57,920 If you remember the decision curves you've seen before, 548 00:21:57,920 --> 00:21:59,255 what do you notice here? 549 00:21:59,255 --> 00:22:00,380 AUDIENCE: they're different 550 00:22:00,380 --> 00:22:02,046 LORENZO ROSASCO: They're very different. 551 00:22:02,046 --> 00:22:03,710 OK. 552 00:22:03,710 --> 00:22:06,770 For example, the one before, if you remember, 553 00:22:06,770 --> 00:22:09,170 we noticed it was going all the way down here to follow 554 00:22:09,170 --> 00:22:10,790 those couple of points. 555 00:22:10,790 --> 00:22:12,750 But here you don't have those couple of points. 556 00:22:12,750 --> 00:22:13,917 OK. 557 00:22:13,917 --> 00:22:15,750 So now, is that a good thing or a bad thing? 558 00:22:15,750 --> 00:22:17,374 Well the point here is that because you 559 00:22:17,374 --> 00:22:20,510 have so few points, the moment you start to just feed 560 00:22:20,510 --> 00:22:22,120 the data, this will happen. 561 00:22:22,120 --> 00:22:22,670 OK. 562 00:22:22,670 --> 00:22:24,544 You have something that changes all the time. 563 00:22:24,544 --> 00:22:25,790 It's very unstable. 564 00:22:25,790 --> 00:22:27,644 That's a key word, OK. 565 00:22:27,644 --> 00:22:29,060 You have something that you change 566 00:22:29,060 --> 00:22:30,920 the data just a little bit, and it changes completely. 567 00:22:30,920 --> 00:22:32,340 That sounds like a bad idea. 568 00:22:32,340 --> 00:22:32,840 OK. 569 00:22:32,840 --> 00:22:34,370 If I want to make a prediction, if I 570 00:22:34,370 --> 00:22:37,250 keep on getting slightly different data 571 00:22:37,250 --> 00:22:39,110 and I change my mind completely, that's 572 00:22:39,110 --> 00:22:41,240 probably not a good way to make a prediction about anything. 573 00:22:41,240 --> 00:22:41,764 OK. 574 00:22:41,764 --> 00:22:43,430 And this is happening all the time here. 575 00:22:43,430 --> 00:22:45,800 And it's exactly because our algorithm 576 00:22:45,800 --> 00:22:47,060 is in some sense is greedy. 577 00:22:47,060 --> 00:22:49,670 You just try to get perfect performance 578 00:22:49,670 --> 00:22:55,630 on the training set without worrying much about the future. 579 00:22:55,630 --> 00:22:57,740 Let's do this just once more. 580 00:23:08,340 --> 00:23:09,390 OK. 581 00:23:09,390 --> 00:23:10,680 And we keep on going. 582 00:23:10,680 --> 00:23:13,810 It's going to change all the time, all the time. 583 00:23:13,810 --> 00:23:17,790 Of course-- I don't know how much 584 00:23:17,790 --> 00:23:20,910 I can push this because it's not super-duper fast. 585 00:23:20,910 --> 00:23:21,590 But let's try. 586 00:23:25,020 --> 00:23:26,620 Let's say 18 by 30. 587 00:23:36,410 --> 00:23:39,560 So what I did now is just that I augmented the number of points 588 00:23:39,560 --> 00:23:40,460 in my training set. 589 00:23:40,460 --> 00:23:42,200 It was 20 or 30, I don't remember. 590 00:23:42,200 --> 00:23:45,990 Now it make it 100. 591 00:23:45,990 --> 00:23:47,830 So now you should see-- 592 00:23:47,830 --> 00:23:48,330 OK. 593 00:23:48,330 --> 00:23:49,951 So this is one solution. 594 00:23:49,951 --> 00:23:51,200 We want to play the same game. 595 00:23:51,200 --> 00:23:54,030 We just want to generate other datasets of the same. 596 00:23:54,030 --> 00:23:59,040 So maybe now it might be that I took them all. 597 00:23:59,040 --> 00:24:00,872 I don't remember how many there are. 598 00:24:06,230 --> 00:24:07,650 No, I didn't take them all. 599 00:24:07,650 --> 00:24:09,930 So, what do you see now? 600 00:24:09,930 --> 00:24:11,430 We are doing exactly the same thing. 601 00:24:11,430 --> 00:24:11,770 OK. 602 00:24:11,770 --> 00:24:14,061 And is this something that you can absolutely not to do 603 00:24:14,061 --> 00:24:17,750 in practice, because you cannot just generate datasets. 604 00:24:17,750 --> 00:24:21,900 But here what you see is that I just 605 00:24:21,900 --> 00:24:23,920 augmented the number of training set points. 606 00:24:23,920 --> 00:24:26,700 And what you see is now the solution does change, 607 00:24:26,700 --> 00:24:27,550 but not as much. 608 00:24:27,550 --> 00:24:28,050 OK. 609 00:24:28,050 --> 00:24:30,840 And you can kind of start to see that there is something 610 00:24:30,840 --> 00:24:34,360 going on a bit like this here. 611 00:24:34,360 --> 00:24:35,047 OK. 612 00:24:35,047 --> 00:24:36,630 So this one actually looks pretty bad. 613 00:24:36,630 --> 00:24:38,040 Let's try to do it once more. 614 00:24:46,980 --> 00:24:48,460 OK. 615 00:24:48,460 --> 00:24:51,900 So again, it does change a lot, but not as much as before. 616 00:24:51,900 --> 00:24:53,340 And you roughly see that this guy 617 00:24:53,340 --> 00:24:57,720 says that, here it should be orange and here should be blue. 618 00:24:57,720 --> 00:24:58,500 OK. 619 00:24:58,500 --> 00:25:01,110 So that's kind of what you expect. 620 00:25:01,110 --> 00:25:03,820 The more points you get, the better your solution would get. 621 00:25:03,820 --> 00:25:06,870 And if I put hear all the possible points, what 622 00:25:06,870 --> 00:25:09,510 you will start to see is that the closest point to any point 623 00:25:09,510 --> 00:25:11,380 here will be a blue point. 624 00:25:11,380 --> 00:25:11,880 OK. 625 00:25:11,880 --> 00:25:13,350 So it will be perfect. 626 00:25:13,350 --> 00:25:15,870 So if I ask you if this is a good algorithm 627 00:25:15,870 --> 00:25:17,800 or not, what would you say? 628 00:25:21,331 --> 00:25:22,830 AUDIENCE: It's overfitting the data. 629 00:25:22,830 --> 00:25:24,810 LORENZO ROSASCO: It's kind of a overfitting the data. 630 00:25:24,810 --> 00:25:26,700 But it is not always overfitting the data. 631 00:25:26,700 --> 00:25:29,190 If the data are good, it's a good idea to fit them. 632 00:25:29,190 --> 00:25:29,796 OK. 633 00:25:29,796 --> 00:25:31,170 But in some sense, this algorithm 634 00:25:31,170 --> 00:25:32,790 doesn't have a way to prevent itself 635 00:25:32,790 --> 00:25:35,569 to fall in love with the data when there are very few. 636 00:25:35,569 --> 00:25:37,110 And if you have very few data points, 637 00:25:37,110 --> 00:25:39,570 you start to just wiggle around, become extremely unstable, 638 00:25:39,570 --> 00:25:40,950 change your mind all the time. 639 00:25:40,950 --> 00:25:42,995 If the data are enough, it stabilizes, 640 00:25:42,995 --> 00:25:44,370 and in some senses, this setting, 641 00:25:44,370 --> 00:25:47,070 we're fitting the data, or as she's saying, 642 00:25:47,070 --> 00:25:48,090 overfitting the data. 643 00:25:48,090 --> 00:25:49,640 It's actually not a bad thing. 644 00:25:49,640 --> 00:25:50,280 OK. 645 00:25:50,280 --> 00:25:51,858 So this is what's going on here. 646 00:25:51,858 --> 00:25:54,050 AUDIENCE: What do you mean by overfitting? 647 00:25:54,050 --> 00:25:56,050 LORENZO ROSASCO: Fitting a bit too much. 648 00:25:56,050 --> 00:25:58,710 So if you look here. 649 00:25:58,710 --> 00:26:00,580 So here, if you look what you're doing here, 650 00:26:00,580 --> 00:26:02,040 you're always fitting the data OK. 651 00:26:02,040 --> 00:26:03,750 But here you're doing nothing else. 652 00:26:03,750 --> 00:26:05,910 And so if you have few data points, 653 00:26:05,910 --> 00:26:08,250 fitting the data is fine. 654 00:26:08,250 --> 00:26:09,860 Sorry, if you have many data points, 655 00:26:09,860 --> 00:26:11,110 fitting the data is just fine. 656 00:26:11,110 --> 00:26:14,430 If you have few data points, by fitting them you, 657 00:26:14,430 --> 00:26:17,010 in some sense, overfit in the sense 658 00:26:17,010 --> 00:26:18,870 that when you look at new data points, 659 00:26:18,870 --> 00:26:20,260 you have done a bit too much. 660 00:26:20,260 --> 00:26:20,760 OK. 661 00:26:20,760 --> 00:26:22,551 What you saw before, that you get something 662 00:26:22,551 --> 00:26:25,260 that is very good, because it perfectly fits that, 663 00:26:25,260 --> 00:26:27,280 but it's overfitting with respect to the future. 664 00:26:27,280 --> 00:26:30,870 Whereas here, the fitting on the left-hand side 665 00:26:30,870 --> 00:26:32,760 kind of reflects, not too badly the fitting 666 00:26:32,760 --> 00:26:33,718 on the right-hand side. 667 00:26:33,718 --> 00:26:35,950 OK. 668 00:26:35,950 --> 00:26:38,790 So the idea of overfitting and stability 669 00:26:38,790 --> 00:26:40,890 that came out in this discussion are key. 670 00:26:40,890 --> 00:26:41,430 OK. 671 00:26:41,430 --> 00:26:42,846 If you want everything we're going 672 00:26:42,846 --> 00:26:44,580 to do in the next three hours, understand 673 00:26:44,580 --> 00:26:48,810 how you can prevent overfitting and build a good way 674 00:26:48,810 --> 00:26:54,370 to stabilize your algorithms. 675 00:26:54,370 --> 00:26:54,870 OK. 676 00:26:54,870 --> 00:26:58,120 So let's go back here. 677 00:26:58,120 --> 00:27:00,660 This is going to be quick, because if I ask you, 678 00:27:00,660 --> 00:27:01,470 what is this? 679 00:27:01,470 --> 00:27:03,100 What would you say? 680 00:27:03,100 --> 00:27:04,600 AUDIENCE: [INAUDIBLE] 681 00:27:06,600 --> 00:27:08,100 [LAUGHING] 682 00:27:09,931 --> 00:27:11,680 LORENZO ROSASCO: So the idea is that, when 683 00:27:11,680 --> 00:27:13,060 you have a situation like this, you're 684 00:27:13,060 --> 00:27:15,460 still pretty much able to say what's the right answer. 685 00:27:15,460 --> 00:27:17,210 And what you're going to do is that you're 686 00:27:17,210 --> 00:27:19,420 going to move away from just saying, what's 687 00:27:19,420 --> 00:27:22,560 the closest point, and you just look at a few more points. 688 00:27:22,560 --> 00:27:24,020 You just don't look at one. 689 00:27:24,020 --> 00:27:24,520 OK. 690 00:27:24,520 --> 00:27:27,610 You look at, how many? boh? 691 00:27:27,610 --> 00:27:29,030 "boh" is very useful Italian word. 692 00:27:29,030 --> 00:27:30,130 It means, I don't know. 693 00:27:32,720 --> 00:27:36,070 So these algorithm-- it's called the k nearest neighbor 694 00:27:36,070 --> 00:27:38,650 algorithm, it's probably the second simplest algorithm 695 00:27:38,650 --> 00:27:39,957 you can think of. 696 00:27:39,957 --> 00:27:41,290 It's kind of the same as before. 697 00:27:41,290 --> 00:27:42,706 The notation here is a bit boring, 698 00:27:42,706 --> 00:27:45,395 but it's basically saying, take the points. 699 00:27:45,395 --> 00:27:46,270 Give them new points. 700 00:27:46,270 --> 00:27:47,820 Check the distance with everybody. 701 00:27:47,820 --> 00:27:50,310 Sort it and take the first k. 702 00:27:50,310 --> 00:27:51,676 OK. 703 00:27:51,676 --> 00:27:53,050 If it's a classification problem, 704 00:27:53,050 --> 00:27:57,167 it's probably a good idea to take an odd number for k, 705 00:27:57,167 --> 00:27:58,750 so that you can then just have voting. 706 00:27:58,750 --> 00:28:00,430 And basically everybody votes. 707 00:28:00,430 --> 00:28:01,850 Each vote counts one. 708 00:28:01,850 --> 00:28:04,930 And somebody says blue, somebody says orange, 709 00:28:04,930 --> 00:28:06,810 and you make a decision. 710 00:28:06,810 --> 00:28:09,220 OK. 711 00:28:09,220 --> 00:28:10,150 Fair enough. 712 00:28:10,150 --> 00:28:12,740 Well how does this work? 713 00:28:12,740 --> 00:28:14,090 You can kind of imagine. 714 00:28:16,960 --> 00:28:18,280 So what we have to do-- 715 00:28:18,280 --> 00:28:20,400 so for example here we have this guy. 716 00:28:20,400 --> 00:28:21,340 OK. 717 00:28:21,340 --> 00:28:23,110 Now let's just put k-- 718 00:28:23,110 --> 00:28:24,860 well, let's make this a bit smaller. 719 00:28:24,860 --> 00:28:27,160 So we do 40. 720 00:28:27,160 --> 00:28:31,510 Generate, plot, train. 721 00:28:31,510 --> 00:28:33,015 [INAUDIBLE] test. 722 00:28:33,015 --> 00:28:34,380 Plot. 723 00:28:34,380 --> 00:28:36,370 OK. 724 00:28:36,370 --> 00:28:38,650 Well we got a bit lucky, OK. 725 00:28:38,650 --> 00:28:41,350 This is actually a good dataset, because in some sense 726 00:28:41,350 --> 00:28:44,140 there are no, what you might call outliers. 727 00:28:44,140 --> 00:28:47,394 There are no orange points that really go and sit in the blue. 728 00:28:47,394 --> 00:28:49,810 So I just want to show you a bit about the dramatic effect 729 00:28:49,810 --> 00:28:50,020 of this. 730 00:28:50,020 --> 00:28:52,030 So I'm going to just try to redo this one so 731 00:28:52,030 --> 00:28:54,290 that we get the more-- 732 00:28:54,290 --> 00:28:55,317 yeah, this should do. 733 00:28:57,790 --> 00:28:58,290 OK. 734 00:28:58,290 --> 00:29:00,260 So this is nearest neighbor. 735 00:29:00,260 --> 00:29:02,000 This is the solution you get. 736 00:29:02,000 --> 00:29:03,320 It's not too horrible. 737 00:29:03,320 --> 00:29:06,340 But, for example, you see that it starts following this guy. 738 00:29:06,340 --> 00:29:08,180 OK. 739 00:29:08,180 --> 00:29:12,331 Now, what you can do is that you can just go in and say, four. 740 00:29:12,331 --> 00:29:13,330 Well, four's a bad idea. 741 00:29:13,330 --> 00:29:14,780 Five. 742 00:29:14,780 --> 00:29:16,130 You'd retrain them the same. 743 00:29:20,020 --> 00:29:23,184 And all of a sudden it just ignores this guy. 744 00:29:23,184 --> 00:29:25,100 Because the moment that you put more in, well, 745 00:29:25,100 --> 00:29:27,225 you just realize that he's surrounded by blue guys, 746 00:29:27,225 --> 00:29:32,230 so it's probably just, his vote just counts one against four. 747 00:29:32,230 --> 00:29:32,830 OK. 748 00:29:32,830 --> 00:29:35,680 And you can keep on going. 749 00:29:35,680 --> 00:29:42,600 And the idea here is that the more you make this big, 750 00:29:42,600 --> 00:29:46,800 the more your solution is going to be, what? 751 00:29:46,800 --> 00:29:48,760 Well you say, it's going to be good, 752 00:29:48,760 --> 00:29:50,490 but it's actually not true. 753 00:29:50,490 --> 00:29:53,040 Because if you start to put k too big, 754 00:29:53,040 --> 00:29:56,250 at some point all you're doing is counting how many points you 755 00:29:56,250 --> 00:29:58,500 have in class one, counting how many points 756 00:29:58,500 --> 00:30:02,741 you have in class two, and always say the same thing. 757 00:30:02,741 --> 00:30:03,240 OK. 758 00:30:03,240 --> 00:30:07,415 So I'm going to put here, 20. 759 00:30:07,415 --> 00:30:08,790 What you start to see is that you 760 00:30:08,790 --> 00:30:12,090 start to obtain a decision boundary, which is simpler, 761 00:30:12,090 --> 00:30:13,780 and simpler and simpler. 762 00:30:13,780 --> 00:30:14,280 OK. 763 00:30:14,280 --> 00:30:16,120 It looks kind of linear here. 764 00:30:16,120 --> 00:30:18,230 What you will see is that, suppose that now 765 00:30:18,230 --> 00:30:23,161 I regenerate the data. 766 00:30:23,161 --> 00:30:24,660 And you remember how much it changed 767 00:30:24,660 --> 00:30:27,600 before when I was using nearest neighbor with just 768 00:30:27,600 --> 00:30:31,030 k equal to 1. 769 00:30:31,030 --> 00:30:33,821 So of course here, you know, it's probabilistic. 770 00:30:33,821 --> 00:30:34,320 OK. 771 00:30:34,320 --> 00:30:36,180 So of course I'm going to get a dataset like the one I just 772 00:30:36,180 --> 00:30:38,970 showed you minutes ago, and I had it as fast as possible. 773 00:30:38,970 --> 00:30:40,969 Because if I pick 10, one is going 774 00:30:40,969 --> 00:30:43,260 to look like that and nine are going to look like this. 775 00:30:43,260 --> 00:30:43,830 OK. 776 00:30:43,830 --> 00:30:44,880 And when they look like this, you 777 00:30:44,880 --> 00:30:47,190 see, they kind of start to have this kind of line, 778 00:30:47,190 --> 00:30:49,770 like a decision boundary with some twists. 779 00:30:49,770 --> 00:30:50,820 But it's very simple. 780 00:30:50,820 --> 00:30:51,540 OK. 781 00:30:51,540 --> 00:30:54,075 And if at some point, if I put k big enough-- that is, 782 00:30:54,075 --> 00:30:56,200 the number of all points, it won't change any more. 783 00:30:56,200 --> 00:30:56,530 OK. 784 00:30:56,530 --> 00:30:58,100 It will just be essentially dividing 785 00:30:58,100 --> 00:31:01,350 the sets in two equal parts. 786 00:31:01,350 --> 00:31:05,170 So does that makes sense? 787 00:31:05,170 --> 00:31:09,960 So would it make sense to vote to make different votes? 788 00:31:09,960 --> 00:31:12,490 Essentially, the idea is, if the point is closest, 789 00:31:12,490 --> 00:31:16,120 his vote should count more than if a point is more far away? 790 00:31:16,120 --> 00:31:18,804 Yes, absolutely. 791 00:31:18,804 --> 00:31:20,470 Let's say here we're making the simplest 792 00:31:20,470 --> 00:31:21,920 thing in the world, the second simplest thing in the world, 793 00:31:21,920 --> 00:31:23,320 the third simplest thing in the world. 794 00:31:23,320 --> 00:31:24,028 It is doing that. 795 00:31:24,028 --> 00:31:24,670 OK. 796 00:31:24,670 --> 00:31:26,650 And you can see that you can go pretty far with this. 797 00:31:26,650 --> 00:31:28,180 I mean, it's simple, but these are actually algorithms 798 00:31:28,180 --> 00:31:29,860 that are used sometimes. 799 00:31:29,860 --> 00:31:34,724 And what you do is that, if you just look at this-- 800 00:31:34,724 --> 00:31:36,640 again, these I don't want to explain too much. 801 00:31:36,640 --> 00:31:38,080 If you've seen it before, it's simple. 802 00:31:38,080 --> 00:31:39,538 Otherwise it doesn't really matter. 803 00:31:39,538 --> 00:31:42,910 But the basic idea here is that each vote 804 00:31:42,910 --> 00:31:45,280 is going to be between 0-- so, you 805 00:31:45,280 --> 00:31:48,820 see here I put the distance between the new point and all 806 00:31:48,820 --> 00:31:50,860 the other points on top of an exponential. 807 00:31:50,860 --> 00:31:55,040 So the number I get is not 1, but it is between 0 and 1. 808 00:31:55,040 --> 00:31:58,040 If the two points are close, and the limits supposedly 809 00:31:58,040 --> 00:32:01,540 are the same, it becomes a 0, and it counts exactly one. 810 00:32:01,540 --> 00:32:05,050 If they're very far away, these would be, say, infinity 811 00:32:05,050 --> 00:32:06,820 and then we'd be close to 0. 812 00:32:06,820 --> 00:32:09,485 So the closest you are, the more you count. 813 00:32:09,485 --> 00:32:11,110 If you want, you can read it like this. 814 00:32:11,110 --> 00:32:17,400 You're sitting on a new point, and you put a zooming window. 815 00:32:17,400 --> 00:32:19,896 Yeah, like a zooming window of a certain size. 816 00:32:19,896 --> 00:32:21,520 And you basically check that everything 817 00:32:21,520 --> 00:32:24,029 which is inside this window will be closed. 818 00:32:24,029 --> 00:32:25,570 And the more you go farther away-- so 819 00:32:25,570 --> 00:32:27,940 the window is like this. 820 00:32:27,940 --> 00:32:30,040 And you deform the space so that basically what 821 00:32:30,040 --> 00:32:31,690 you say is, things that are far away, 822 00:32:31,690 --> 00:32:32,890 they're going to count less. 823 00:32:32,890 --> 00:32:37,660 And if I move sigma here, I'm somewhat 824 00:32:37,660 --> 00:32:41,830 making my visual field, if you want, larger or smaller, 825 00:32:41,830 --> 00:32:43,550 around this one new point. 826 00:32:43,550 --> 00:32:45,010 It's just a physical interpretation 827 00:32:45,010 --> 00:32:45,926 of what this is doing. 828 00:32:45,926 --> 00:32:47,950 There are 15 other ways of looking 829 00:32:47,950 --> 00:32:49,630 at what the Gaussian is doing. 830 00:32:49,630 --> 00:32:52,900 Voting, changing the weight of the vote is another one. 831 00:32:52,900 --> 00:32:53,710 OK. 832 00:32:53,710 --> 00:32:55,180 Why the Gaussian here? 833 00:32:55,180 --> 00:32:57,069 Well, because. 834 00:32:57,069 --> 00:32:57,610 Just because. 835 00:32:57,610 --> 00:32:58,860 You can use many, many others. 836 00:32:58,860 --> 00:33:01,090 You can use, for example, a hat window. 837 00:33:01,090 --> 00:33:03,460 And this is part of your prior knowledge, 838 00:33:03,460 --> 00:33:05,020 how much you want to weight. 839 00:33:05,020 --> 00:33:09,580 If you are in this kind of low dimensional situation, 840 00:33:09,580 --> 00:33:12,010 you might have good ways to just look inside the data 841 00:33:12,010 --> 00:33:14,599 and decide almost like doing by a visual inspection. 842 00:33:14,599 --> 00:33:16,890 Otherwise you have to trust some more broad principles. 843 00:33:16,890 --> 00:33:18,306 And it's again back to the problem 844 00:33:18,306 --> 00:33:20,710 of learning the representation and deciding 845 00:33:20,710 --> 00:33:22,190 how to measure distance, which are 846 00:33:22,190 --> 00:33:24,920 two phases of the same story. 847 00:33:24,920 --> 00:33:25,420 OK. 848 00:33:29,490 --> 00:33:33,510 And the other thing you see is that, if you 849 00:33:33,510 --> 00:33:35,370 start to do these games, you might actually 850 00:33:35,370 --> 00:33:37,020 add more parameters. 851 00:33:37,020 --> 00:33:37,680 OK. 852 00:33:37,680 --> 00:33:39,372 Because we start from nearest neighbor, 853 00:33:39,372 --> 00:33:40,830 which is completely parameter-free, 854 00:33:40,830 --> 00:33:42,690 but it was very unstable. 855 00:33:42,690 --> 00:33:43,840 We added k. 856 00:33:43,840 --> 00:33:45,930 We allow ourselves to go from simple to complex, 857 00:33:45,930 --> 00:33:48,000 from stability to overfitting. 858 00:33:48,000 --> 00:33:50,244 But we introduced a new parameter. 859 00:33:50,244 --> 00:33:51,910 And so that's not an algorithm any more. 860 00:33:51,910 --> 00:33:52,860 It's a half algorithm. 861 00:33:52,860 --> 00:33:55,059 A true algorithm is a parameter-free algorithm 862 00:33:55,059 --> 00:33:56,850 where I tell you how you choose everything. 863 00:33:56,850 --> 00:33:57,500 OK. 864 00:33:57,500 --> 00:33:59,166 So if they just give you something, say, 865 00:33:59,166 --> 00:34:01,940 yeah, there's k, well, how do you choose it? 866 00:34:01,940 --> 00:34:03,240 OK. 867 00:34:03,240 --> 00:34:05,230 It's not something you can use. 868 00:34:05,230 --> 00:34:06,690 And here I'm adding sigma. 869 00:34:06,690 --> 00:34:08,621 And again, you have to decide how you use it. 870 00:34:08,621 --> 00:34:09,120 OK. 871 00:34:09,120 --> 00:34:11,790 And so that's what we want to ask in a minute. 872 00:34:11,790 --> 00:34:17,580 So before doing that, just a side remark is-- 873 00:34:17,580 --> 00:34:19,480 we've been looking at vector data. 874 00:34:19,480 --> 00:34:19,980 OK. 875 00:34:19,980 --> 00:34:21,646 And we were basically measuring distance 876 00:34:21,646 --> 00:34:24,360 through just the Euclidean norm, OK, just the usual one, 877 00:34:24,360 --> 00:34:26,790 or this version like the Gaussian kernel 878 00:34:26,790 --> 00:34:30,690 that somewhat amplifies distances. 879 00:34:30,690 --> 00:34:34,630 What if you have strings, for example, or graphs? 880 00:34:34,630 --> 00:34:35,130 OK. 881 00:34:35,130 --> 00:34:36,570 Your data turns out to be strings 882 00:34:36,570 --> 00:34:37,778 and you want to compare them? 883 00:34:40,380 --> 00:34:42,210 Say even if they're binary strings, 884 00:34:42,210 --> 00:34:43,530 there's no linear structure. 885 00:34:43,530 --> 00:34:45,654 You cannot just sum them up. the Euclidean distance 886 00:34:45,654 --> 00:34:48,580 doesn't really make a lot of sense. 887 00:34:48,580 --> 00:34:50,310 But what you can do is that as long 888 00:34:50,310 --> 00:34:52,632 as you can define a distance-- and say this one 889 00:34:52,632 --> 00:34:54,840 would be the simplest one, just the Hamming distance. 890 00:34:54,840 --> 00:34:56,850 You just check entries, and if they're the same, 891 00:34:56,850 --> 00:34:57,450 you count one. 892 00:34:57,450 --> 00:34:59,340 If they're different, you count zero. 893 00:34:59,340 --> 00:35:00,600 OK. 894 00:35:00,600 --> 00:35:03,030 The moment you can define a distance of your data, 895 00:35:03,030 --> 00:35:06,760 then you can use this kind of technique. 896 00:35:06,760 --> 00:35:10,150 So this technique is pretty flexible in that sense, 897 00:35:10,150 --> 00:35:12,296 that whenever you can give-- you don't need 898 00:35:12,296 --> 00:35:14,470 a vectoral representation, you just 899 00:35:14,470 --> 00:35:16,300 need a way to measure, say, similarity 900 00:35:16,300 --> 00:35:19,060 or distances between things, and then you can use this method. 901 00:35:19,060 --> 00:35:19,690 OK. 902 00:35:19,690 --> 00:35:21,398 So here I just mentioned this, and that's 903 00:35:21,398 --> 00:35:24,520 what most of these classes are going to be, about vector data. 904 00:35:24,520 --> 00:35:29,200 But this is one point where, the moment you have k-- 905 00:35:29,200 --> 00:35:31,450 you can think of this case sometimes as a similarity. 906 00:35:31,450 --> 00:35:31,950 OK. 907 00:35:31,950 --> 00:35:34,370 Similarity is kind of concept that is dual to distances. 908 00:35:34,370 --> 00:35:36,700 So if the similarity is big, it's good. 909 00:35:36,700 --> 00:35:37,980 The distance small is good. 910 00:35:37,980 --> 00:35:38,710 OK. 911 00:35:38,710 --> 00:35:42,040 And so here, if you have a way to build the k or a distance, 912 00:35:42,040 --> 00:35:43,730 then you're good to go. 913 00:35:43,730 --> 00:35:46,260 And we're not going to really talk about it, 914 00:35:46,260 --> 00:35:48,232 but there's a whole industry about how 915 00:35:48,232 --> 00:35:49,440 you build this kind of stuff. 916 00:35:49,440 --> 00:35:51,120 So we give restraints. 917 00:35:51,120 --> 00:35:53,830 Maybe I want to say that I should not only 918 00:35:53,830 --> 00:35:57,400 look at the entry of a string, but also the nearby entry when 919 00:35:57,400 --> 00:35:58,860 I make the score for that specific. 920 00:35:58,860 --> 00:36:01,960 So maybe I shifted a value of the string a little bit. 921 00:36:01,960 --> 00:36:02,820 It's not right here. 922 00:36:02,820 --> 00:36:05,560 It's in the next position over, so that should come to bits. 923 00:36:05,560 --> 00:36:08,150 So I want to do a soft version of this. 924 00:36:08,150 --> 00:36:08,800 OK. 925 00:36:08,800 --> 00:36:11,560 Or maybe I have graphs, and I want to compare graphs. 926 00:36:11,560 --> 00:36:14,080 And I want to say that if two graphs are close, then 927 00:36:14,080 --> 00:36:15,710 I want them to have the same label. 928 00:36:15,710 --> 00:36:16,210 OK. 929 00:36:16,210 --> 00:36:18,950 How do you do that? 930 00:36:18,950 --> 00:36:22,470 The next big question is-- 931 00:36:22,470 --> 00:36:24,219 we introduced three parameters. 932 00:36:24,219 --> 00:36:26,010 They look really nice, because they kind of 933 00:36:26,010 --> 00:36:29,430 allowed us to get more flexible solutions to the problem 934 00:36:29,430 --> 00:36:32,880 by choosing, for example, k or the sigma in the Gaussian. 935 00:36:32,880 --> 00:36:35,736 We can go from overfitting to stability. 936 00:36:35,736 --> 00:36:37,860 But then of course we have to choose the parameter, 937 00:36:37,860 --> 00:36:40,592 and we have to find good ways to choose them. 938 00:36:40,592 --> 00:36:42,175 And so there are a bunch of questions. 939 00:36:42,175 --> 00:36:46,020 So the first one is, well, is there an optimal value at all? 940 00:36:46,020 --> 00:36:46,950 OK. 941 00:36:46,950 --> 00:36:48,580 Does it exist? 942 00:36:48,580 --> 00:36:51,580 But if it does exist, I can go try to estimate it in some way. 943 00:36:51,580 --> 00:36:54,230 If it doesn't, well it does not even make sense. 944 00:36:54,230 --> 00:36:55,610 I just throw a random number. 945 00:36:55,610 --> 00:36:56,741 I just say, k equals 4. 946 00:36:56,741 --> 00:36:57,240 Why? 947 00:36:57,240 --> 00:36:58,310 Just because. 948 00:36:58,310 --> 00:36:58,856 OK. 949 00:36:58,856 --> 00:36:59,730 So what do you think? 950 00:36:59,730 --> 00:37:01,980 It exists or not? 951 00:37:01,980 --> 00:37:04,130 What does it depend on? 952 00:37:04,130 --> 00:37:05,510 Because that's the next question. 953 00:37:05,510 --> 00:37:06,560 What does it depend on? 954 00:37:06,560 --> 00:37:07,730 Can we compute it? 955 00:37:07,730 --> 00:37:08,510 OK. 956 00:37:08,510 --> 00:37:10,759 So let's try to guess one minute before we go 957 00:37:10,759 --> 00:37:11,800 and check how we do this. 958 00:37:11,800 --> 00:37:14,640 OK. 959 00:37:14,640 --> 00:37:15,937 OK. 960 00:37:15,937 --> 00:37:16,770 I have to choose it. 961 00:37:16,770 --> 00:37:17,561 How do I choose it? 962 00:37:17,561 --> 00:37:19,332 What does it depend on? 963 00:37:19,332 --> 00:37:20,401 AUDIENCE: Size of this. 964 00:37:20,401 --> 00:37:22,650 LORENZO ROSASCO: One thing is the size of the dataset. 965 00:37:22,650 --> 00:37:26,400 Because what we saw is that a small k seems a good idea when 966 00:37:26,400 --> 00:37:29,370 you have a lot of data, but it seems like a bad idea 967 00:37:29,370 --> 00:37:31,670 when you have few. 968 00:37:31,670 --> 00:37:32,407 OK. 969 00:37:32,407 --> 00:37:33,240 So it should depend. 970 00:37:33,240 --> 00:37:34,656 It should be something that scales 971 00:37:34,656 --> 00:37:37,830 with n, the number of points, and probably also the training 972 00:37:37,830 --> 00:37:38,687 set itself. 973 00:37:38,687 --> 00:37:40,770 But we want something that works for all datasets, 974 00:37:40,770 --> 00:37:42,300 say, in expectation. 975 00:37:42,300 --> 00:37:44,190 So cardinality of the training set 976 00:37:44,190 --> 00:37:45,520 is going to be a main factor. 977 00:37:45,520 --> 00:37:46,840 What else? 978 00:37:46,840 --> 00:37:49,270 AUDIENCE: The smoothness of the boundary. 979 00:37:49,270 --> 00:37:49,470 LORENZO ROSASCO: The what? 980 00:37:49,470 --> 00:37:50,440 AUDIENCE: The smoothness. 981 00:37:50,440 --> 00:37:51,790 LORENZO ROSASCO: This smoothness of the boundary. 982 00:37:51,790 --> 00:37:52,290 Yeah. 983 00:37:52,290 --> 00:37:55,305 So what he's saying is, if my problem looks like this, 984 00:37:55,305 --> 00:37:57,525 or if my problem looks like this, 985 00:37:57,525 --> 00:37:59,310 it looks like k should be different. 986 00:37:59,310 --> 00:38:04,680 In this case I can take any arbitrary high k-- 987 00:38:04,680 --> 00:38:06,639 sorry, small k, I guess, or i. 988 00:38:06,639 --> 00:38:08,430 It doesn't matter, because whatever you do, 989 00:38:08,430 --> 00:38:09,888 you pretty much get the good thing. 990 00:38:09,888 --> 00:38:12,150 But if you start doing something like this, 991 00:38:12,150 --> 00:38:14,180 then you want-- k is enough, because otherwise 992 00:38:14,180 --> 00:38:15,690 you just start to blur everything. 993 00:38:15,690 --> 00:38:17,231 And this is exactly what he's saying. 994 00:38:17,231 --> 00:38:19,770 If your problem is complicated or it's easy. 995 00:38:19,770 --> 00:38:21,439 OK. 996 00:38:21,439 --> 00:38:22,980 And at the same time, this is related 997 00:38:22,980 --> 00:38:25,610 to the fact of how much noise you might have in the data, 998 00:38:25,610 --> 00:38:29,590 OK, how much flipping you might have in your data. 999 00:38:29,590 --> 00:38:34,510 If the problem is hard, then you expect to need a different k. 1000 00:38:34,510 --> 00:38:35,010 OK. 1001 00:38:35,010 --> 00:38:37,200 So it depends on the cardinality of the data, 1002 00:38:37,200 --> 00:38:38,670 and how complicated is the problem? 1003 00:38:38,670 --> 00:38:40,128 How complicated it is the boundary? 1004 00:38:40,128 --> 00:38:41,280 How much noise do I have? 1005 00:38:41,280 --> 00:38:42,420 OK. 1006 00:38:42,420 --> 00:38:45,450 So it turns out that one thing you can ask 1007 00:38:45,450 --> 00:38:46,661 is, can we prove it? 1008 00:38:46,661 --> 00:38:47,160 OK. 1009 00:38:47,160 --> 00:38:51,690 Can we prove a theorem that says that there is an optimal k, 1010 00:38:51,690 --> 00:38:57,570 and it really does depends on this, on this quantities. 1011 00:38:57,570 --> 00:38:59,280 And it turns out that you can. 1012 00:38:59,280 --> 00:39:02,060 Of course, as always, to make a theory or to make assumptions, 1013 00:39:02,060 --> 00:39:03,800 you have to work within a model. 1014 00:39:03,800 --> 00:39:05,883 And the model we want to work on is the following. 1015 00:39:05,883 --> 00:39:07,707 You're basically saying, this is the k 1016 00:39:07,707 --> 00:39:08,790 nearest neighbor solution. 1017 00:39:08,790 --> 00:39:10,520 So big k here is the number of neighbors, 1018 00:39:10,520 --> 00:39:13,050 and this is hat because it depends on the data. 1019 00:39:13,050 --> 00:39:14,550 And what I say here is that I'm just 1020 00:39:14,550 --> 00:39:17,575 going to look at squared loss error, just because it's easy. 1021 00:39:17,575 --> 00:39:19,950 And I'm going to look at the regression problem, not just 1022 00:39:19,950 --> 00:39:21,439 this classification. 1023 00:39:21,439 --> 00:39:22,980 And what you do here is that you take 1024 00:39:22,980 --> 00:39:26,430 expectation over all possible input-output pairs. 1025 00:39:26,430 --> 00:39:30,160 So basically you say, when I tried to do math, 1026 00:39:30,160 --> 00:39:31,686 I want to see what's ideal. 1027 00:39:31,686 --> 00:39:33,060 An ideally I want a solution that 1028 00:39:33,060 --> 00:39:35,070 does well on future points. 1029 00:39:35,070 --> 00:39:35,670 OK. 1030 00:39:35,670 --> 00:39:36,780 So how do I do that? 1031 00:39:36,780 --> 00:39:40,230 I think the average error over all possible points 1032 00:39:40,230 --> 00:39:41,820 in the future, x and y. 1033 00:39:41,820 --> 00:39:45,220 So this is the meaning of this first expectation. 1034 00:39:45,220 --> 00:39:47,070 Make sense? 1035 00:39:47,070 --> 00:39:48,705 Yes? 1036 00:39:48,705 --> 00:39:49,980 No? 1037 00:39:49,980 --> 00:39:54,330 So if they fix y and x, this is the error on a specific couple 1038 00:39:54,330 --> 00:39:55,290 input and output. 1039 00:39:55,290 --> 00:39:56,300 I give you the input. 1040 00:39:56,300 --> 00:40:01,049 I do f(kx) and then I check if it's close or not to y. 1041 00:40:01,049 --> 00:40:03,090 But what I want to do if I want to be theoretical 1042 00:40:03,090 --> 00:40:04,506 is to say, OK, what I would really 1043 00:40:04,506 --> 00:40:08,461 like to be small is this error over all possible points. 1044 00:40:08,461 --> 00:40:10,710 So I take the expectation, not the one on the training 1045 00:40:10,710 --> 00:40:12,220 set, the one in the future. 1046 00:40:12,220 --> 00:40:13,980 And I take expectation so that if points 1047 00:40:13,980 --> 00:40:15,450 are more likely to be simple, they 1048 00:40:15,450 --> 00:40:19,050 will count more than points that are less likely to be simple. 1049 00:40:19,050 --> 00:40:20,322 OK. 1050 00:40:20,322 --> 00:40:21,791 AUDIENCE: What was Es? 1051 00:40:21,791 --> 00:40:23,790 LORENZO ROSASCO: We haven't got to that one yet. 1052 00:40:23,790 --> 00:40:24,360 OK. 1053 00:40:24,360 --> 00:40:26,280 So Exy is what I just said. 1054 00:40:26,280 --> 00:40:27,610 What is Es? 1055 00:40:27,610 --> 00:40:30,360 It's the expectation over the training set. 1056 00:40:30,360 --> 00:40:31,684 Why do we need that? 1057 00:40:31,684 --> 00:40:33,600 Well because if we don't put that expectation, 1058 00:40:33,600 --> 00:40:37,290 I'm basically telling you what's the good k for this one 1059 00:40:37,290 --> 00:40:38,440 training set here. 1060 00:40:38,440 --> 00:40:39,465 Then I give you another training set 1061 00:40:39,465 --> 00:40:41,140 and I get another one, which is in some sense is good, 1062 00:40:41,140 --> 00:40:42,598 but it's also bad, because we would 1063 00:40:42,598 --> 00:40:44,400 like to have a take-home message that we 1064 00:40:44,400 --> 00:40:46,599 hold for all training sets. 1065 00:40:46,599 --> 00:40:47,640 And this is the simplest. 1066 00:40:47,640 --> 00:40:50,110 You say, for the average training set, 1067 00:40:50,110 --> 00:40:52,697 this is how I should choose k. 1068 00:40:52,697 --> 00:40:53,780 That's what we want to do. 1069 00:40:53,780 --> 00:40:54,090 OK. 1070 00:40:54,090 --> 00:40:56,470 So the first expectation is to measure error with respect 1071 00:40:56,470 --> 00:40:57,130 to the future. 1072 00:40:57,130 --> 00:40:58,812 The second expectation is to say, 1073 00:40:58,812 --> 00:41:00,270 I want to deal with the fact that I 1074 00:41:00,270 --> 00:41:03,740 have several potential training sets appearing. 1075 00:41:03,740 --> 00:41:06,480 OK. 1076 00:41:06,480 --> 00:41:09,180 So in the next couple of slides, this red dot 1077 00:41:09,180 --> 00:41:10,811 means that there are computations. 1078 00:41:10,811 --> 00:41:11,310 OK. 1079 00:41:11,310 --> 00:41:14,340 And so I want to do them quickly. 1080 00:41:14,340 --> 00:41:17,710 And the important thing of this bit is, it's an exercise. 1081 00:41:17,710 --> 00:41:18,210 OK. 1082 00:41:18,210 --> 00:41:22,550 So this is an exercise of stats zero. 1083 00:41:22,550 --> 00:41:23,550 OK. 1084 00:41:23,550 --> 00:41:25,800 So we don't want to spend time doing that. 1085 00:41:25,800 --> 00:41:28,133 The important thing is going to be the conceptual parts. 1086 00:41:28,133 --> 00:41:30,150 I'm going to go a bit quickly through it. 1087 00:41:30,150 --> 00:41:32,200 So you start from this, and you would 1088 00:41:32,200 --> 00:41:33,724 like to understand if there exists-- 1089 00:41:33,724 --> 00:41:36,140 so this is the quantity that you would like to make small, 1090 00:41:36,140 --> 00:41:37,590 ideally. 1091 00:41:37,590 --> 00:41:39,450 You will never have access to this, 1092 00:41:39,450 --> 00:41:42,930 but ideally, in the optimal scenario, 1093 00:41:42,930 --> 00:41:45,420 you want k to make this small. 1094 00:41:45,420 --> 00:41:46,260 OK. 1095 00:41:46,260 --> 00:41:49,020 Now the problem is that you want to essentially mathematically 1096 00:41:49,020 --> 00:41:51,000 study this m minimization problem, but it's not easy, 1097 00:41:51,000 --> 00:41:52,166 because, how do you do this? 1098 00:41:52,166 --> 00:41:52,740 OK. 1099 00:41:52,740 --> 00:41:55,680 The dependence of this function on k is complicated. 1100 00:41:55,680 --> 00:41:57,895 It's that equation we had before, right? 1101 00:41:57,895 --> 00:41:59,520 So you kind of just take the derivative 1102 00:41:59,520 --> 00:42:01,034 and set it equal to zero. 1103 00:42:01,034 --> 00:42:02,200 Let's keep on going into to. 1104 00:42:02,200 --> 00:42:03,780 So what we are at is, these are the points 1105 00:42:03,780 --> 00:42:04,770 I would like to make small. 1106 00:42:04,770 --> 00:42:07,170 I would like to choose k so that I can make this small. 1107 00:42:07,170 --> 00:42:09,480 I want to study this from a mathematical point of view. 1108 00:42:09,480 --> 00:42:11,310 But I cannot just use what you're doing in calculus, 1109 00:42:11,310 --> 00:42:13,726 which is taking a derivative and setting it equal to zero, 1110 00:42:13,726 --> 00:42:16,880 because the dependence of these two k, which is my variable, 1111 00:42:16,880 --> 00:42:17,920 it's complicated. 1112 00:42:17,920 --> 00:42:18,420 OK. 1113 00:42:18,420 --> 00:42:20,400 So we go a bit of a round way. 1114 00:42:20,400 --> 00:42:21,960 We turn out to be pretty universal. 1115 00:42:21,960 --> 00:42:23,460 And this is what we are going to do. 1116 00:42:29,540 --> 00:42:31,730 First of all, we assume a model for our data. 1117 00:42:31,730 --> 00:42:33,781 And this is just for the sake of simplicity. 1118 00:42:33,781 --> 00:42:34,280 OK. 1119 00:42:34,280 --> 00:42:37,100 I can use a much more general model. 1120 00:42:37,100 --> 00:42:38,540 But this is the model. 1121 00:42:38,540 --> 00:42:41,210 I'm going to say that my y are just some fixed function 1122 00:42:41,210 --> 00:42:44,720 of star plus some noise. 1123 00:42:44,720 --> 00:42:47,180 OK. 1124 00:42:47,180 --> 00:42:51,210 And the noise is zero mean and variance sigma 1125 00:42:51,210 --> 00:42:54,150 square for all entries. 1126 00:42:54,150 --> 00:42:56,570 OK. 1127 00:42:56,570 --> 00:42:58,190 This is the simplest model. 1128 00:42:58,190 --> 00:43:00,740 It's a Gaussian regression model. 1129 00:43:07,500 --> 00:43:09,702 So one thing I'm doing, and this is like a trick 1130 00:43:09,702 --> 00:43:11,410 and you can really forget it, but it just 1131 00:43:11,410 --> 00:43:15,190 makes life much easier is that I take the expectation over xy 1132 00:43:15,190 --> 00:43:16,780 and a condition here. 1133 00:43:16,780 --> 00:43:18,094 OK. 1134 00:43:18,094 --> 00:43:19,510 The reason why you do this is just 1135 00:43:19,510 --> 00:43:20,890 to make the math a bit easier. 1136 00:43:20,890 --> 00:43:23,317 Because basically now, if you put this expectation out, 1137 00:43:23,317 --> 00:43:24,900 and you look just at these quantities, 1138 00:43:24,900 --> 00:43:26,835 you're looking at everything for fixed x. 1139 00:43:26,835 --> 00:43:30,490 And these just become a real number, OK, not the function 1140 00:43:30,490 --> 00:43:31,120 anymore. 1141 00:43:31,120 --> 00:43:33,280 So you can use normal calculus. 1142 00:43:33,280 --> 00:43:35,620 You have a real-valued function and you can just 1143 00:43:35,620 --> 00:43:36,970 use the usual stuff. 1144 00:43:36,970 --> 00:43:37,659 OK. 1145 00:43:37,659 --> 00:43:39,700 Again, I'm going to going a bit quickly over this 1146 00:43:39,700 --> 00:43:41,074 because it doesn't really matter. 1147 00:43:41,074 --> 00:43:42,070 So this ingredient one. 1148 00:43:42,070 --> 00:43:44,390 This is observation two. 1149 00:43:44,390 --> 00:43:46,000 Observation three is that you need 1150 00:43:46,000 --> 00:43:49,780 to introduce an object between the solution you 1151 00:43:49,780 --> 00:43:54,490 get in practice and this ideal function. 1152 00:43:54,490 --> 00:43:55,220 What is this? 1153 00:43:55,220 --> 00:43:58,780 It's this kind of, what is called the expectation 1154 00:43:58,780 --> 00:44:00,700 of my algorithm. 1155 00:44:00,700 --> 00:44:02,620 What you do is that-- in my algorithm 1156 00:44:02,620 --> 00:44:06,160 what I do here is that I put Yi, i OK, 1157 00:44:06,160 --> 00:44:07,960 just the label of my training set. 1158 00:44:07,960 --> 00:44:09,360 And the label are noisy. 1159 00:44:09,360 --> 00:44:11,950 But this is an ideal object where you put the true function 1160 00:44:11,950 --> 00:44:17,320 itself, and you just average the value of the true function. 1161 00:44:17,320 --> 00:44:18,250 Why do I use this? 1162 00:44:18,250 --> 00:44:20,680 Because I want to get something which 1163 00:44:20,680 --> 00:44:24,310 is in between this f-star and this f-hat. 1164 00:44:24,310 --> 00:44:27,860 So if you put k big enough-- so if you have enough points, 1165 00:44:27,860 --> 00:44:28,990 this is going to be-- 1166 00:44:28,990 --> 00:44:30,500 sorry, if you take k small enough-- 1167 00:44:30,500 --> 00:44:34,330 so this is closer to f-star than my f-hat, 1168 00:44:34,330 --> 00:44:37,120 OK, because you get no noisy data. 1169 00:44:37,120 --> 00:44:39,171 And what I want to do-- 1170 00:44:39,171 --> 00:44:39,670 oops. 1171 00:44:43,716 --> 00:44:46,090 What I want to do is that I want to plug it in the middle 1172 00:44:46,090 --> 00:44:49,620 and split this error in two. 1173 00:44:49,620 --> 00:44:50,890 And this is what I do. 1174 00:44:50,890 --> 00:44:51,580 OK. 1175 00:44:51,580 --> 00:44:55,960 If you do this, you can check that you have a square here. 1176 00:44:55,960 --> 00:44:56,860 You get two terms. 1177 00:44:56,860 --> 00:44:59,354 One simplifies, because of this assumption on the noise, 1178 00:44:59,354 --> 00:45:00,520 and you get these two terms. 1179 00:45:00,520 --> 00:45:00,790 OK. 1180 00:45:00,790 --> 00:45:02,539 And the important thing is these two terms 1181 00:45:02,539 --> 00:45:05,470 are-- one is the comparison between my algorithm 1182 00:45:05,470 --> 00:45:06,760 and its expectation. 1183 00:45:06,760 --> 00:45:09,000 So that's exactly what we called a variance. 1184 00:45:09,000 --> 00:45:10,000 OK. 1185 00:45:10,000 --> 00:45:12,210 And one is the comparison between the value 1186 00:45:12,210 --> 00:45:14,230 of the true function here, and the value 1187 00:45:14,230 --> 00:45:15,860 of this other function. 1188 00:45:15,860 --> 00:45:17,610 Sorry, this should be-- 1189 00:45:17,610 --> 00:45:18,410 oh yeah. 1190 00:45:18,410 --> 00:45:20,050 This is the expectation, which is 1191 00:45:20,050 --> 00:45:23,020 my ideal version of my algorithm, the one that has 1192 00:45:23,020 --> 00:45:25,000 access to the noiseless labels. 1193 00:45:25,000 --> 00:45:25,750 OK. 1194 00:45:25,750 --> 00:45:27,290 It's what you call a bias. 1195 00:45:27,290 --> 00:45:28,990 It's basically because, instead of using 1196 00:45:28,990 --> 00:45:31,620 the exact value of the function, you blur it 1197 00:45:31,620 --> 00:45:33,410 a bit by averaging out. 1198 00:45:33,410 --> 00:45:33,910 OK. 1199 00:45:33,910 --> 00:45:36,370 You see here, instead of using the value of the function, 1200 00:45:36,370 --> 00:45:39,590 you average out a few nearby values. 1201 00:45:39,590 --> 00:45:41,986 So you're making it a bit dirtier. 1202 00:45:41,986 --> 00:45:44,110 The question now is, how would these two quantities 1203 00:45:44,110 --> 00:45:45,754 depend on k? 1204 00:45:45,754 --> 00:45:47,920 How this quantity depends on k and how this quantity 1205 00:45:47,920 --> 00:45:49,840 depends on k. 1206 00:45:49,840 --> 00:45:50,824 OK. 1207 00:45:50,824 --> 00:45:52,240 And then by putting this together, 1208 00:45:52,240 --> 00:45:54,340 we'll see that we have a certain behavior of this, 1209 00:45:54,340 --> 00:45:55,810 and a certain behavior of this. 1210 00:45:55,810 --> 00:45:57,400 And then balancing this out, we'll 1211 00:45:57,400 --> 00:45:59,530 get what the optimal value looked like. 1212 00:45:59,530 --> 00:46:02,980 And this is going to be all useless from-- 1213 00:46:02,980 --> 00:46:04,480 so these are going to be interesting 1214 00:46:04,480 --> 00:46:05,680 from a conceptual perspective. 1215 00:46:05,680 --> 00:46:07,638 We're going to learn something, but we'll still 1216 00:46:07,638 --> 00:46:10,030 have to do something practical, because nothing of this 1217 00:46:10,030 --> 00:46:11,350 you can measure in practice. 1218 00:46:11,350 --> 00:46:12,040 OK. 1219 00:46:12,040 --> 00:46:14,320 So the next question would be, now that we 1220 00:46:14,320 --> 00:46:16,520 know that it exists and it depends on this stuff, 1221 00:46:16,520 --> 00:46:18,220 how can we actually approximate it in practice? 1222 00:46:18,220 --> 00:46:20,770 And cross-validation is going to pop out of the window. 1223 00:46:20,770 --> 00:46:21,320 OK. 1224 00:46:21,320 --> 00:46:22,861 But this is the theory that shows you 1225 00:46:22,861 --> 00:46:25,710 that this would help proving a theory that 1226 00:46:25,710 --> 00:46:29,650 shows that cross-validation is a good idea, in a precise sense. 1227 00:46:29,650 --> 00:46:32,110 The take-home message is, by making this model 1228 00:46:32,110 --> 00:46:33,880 and using this as an intermediate object, 1229 00:46:33,880 --> 00:46:37,500 you split the error in two, and you start to be able to study. 1230 00:46:37,500 --> 00:46:41,260 And what you get is basically the following. 1231 00:46:41,260 --> 00:46:44,030 This term, by basically using-- 1232 00:46:44,030 --> 00:46:45,504 so we assume that the data-- 1233 00:46:45,504 --> 00:46:47,170 I didn't say that, but that's important. 1234 00:46:47,170 --> 00:46:49,960 We assume that the data are independent with each other. 1235 00:46:49,960 --> 00:46:50,800 OK. 1236 00:46:50,800 --> 00:46:54,070 And by using that, you get these results right away, 1237 00:46:54,070 --> 00:46:56,217 essentially using the fact that the variance 1238 00:46:56,217 --> 00:46:57,800 of the sum of the independent variable 1239 00:46:57,800 --> 00:47:00,130 is the sum of the variances. 1240 00:47:00,130 --> 00:47:01,930 You get these results in one line. 1241 00:47:01,930 --> 00:47:04,570 OK. 1242 00:47:04,570 --> 00:47:09,946 And basically what this shows is that, if k gets big-- 1243 00:47:09,946 --> 00:47:12,550 so variance is another word for the stability. 1244 00:47:12,550 --> 00:47:13,300 OK. 1245 00:47:13,300 --> 00:47:16,190 So if you have a big variance, things will vary a lot. 1246 00:47:16,190 --> 00:47:17,390 It will be unstable. 1247 00:47:17,390 --> 00:47:21,830 So what you see here is exactly what we observe in the plot 1248 00:47:21,830 --> 00:47:22,330 before. 1249 00:47:22,330 --> 00:47:24,940 If k was big, things are not changing as much. 1250 00:47:24,940 --> 00:47:27,220 If k was small, things were changing a lot. 1251 00:47:27,220 --> 00:47:27,820 OK. 1252 00:47:27,820 --> 00:47:31,400 And this is the one equation that shows you that. 1253 00:47:31,400 --> 00:47:32,040 OK. 1254 00:47:32,040 --> 00:47:34,450 And if you just look at that, it would just tell you, 1255 00:47:34,450 --> 00:47:35,740 the big is better. 1256 00:47:35,740 --> 00:47:36,660 Big, respect to what? 1257 00:47:36,660 --> 00:47:37,550 To the noise. 1258 00:47:37,550 --> 00:47:38,050 OK. 1259 00:47:38,050 --> 00:47:40,216 If there is a lot of noise, I should make it bigger. 1260 00:47:40,216 --> 00:47:43,480 If there's more noise, I can make it smaller. 1261 00:47:43,480 --> 00:47:45,970 But the point is that we saw before is 1262 00:47:45,970 --> 00:47:47,650 that the problem of putting k large 1263 00:47:47,650 --> 00:47:49,300 was that we were forgetting about the problem. 1264 00:47:49,300 --> 00:47:51,341 We're just getting something that was very stable 1265 00:47:51,341 --> 00:47:54,220 but could be potentially very bad, if my function was not 1266 00:47:54,220 --> 00:47:55,430 that simple. 1267 00:47:55,430 --> 00:47:56,080 OK. 1268 00:47:56,080 --> 00:47:58,620 This is a bit harder to study mathematically. 1269 00:47:58,620 --> 00:47:59,290 OK. 1270 00:47:59,290 --> 00:48:00,850 This is a calculation that I show you 1271 00:48:00,850 --> 00:48:05,952 because you can do it yourself in like 20 minutes, or less. 1272 00:48:05,952 --> 00:48:07,660 This one takes a bit more. 1273 00:48:07,660 --> 00:48:10,360 But he can get the hunch on how it looks like. 1274 00:48:10,360 --> 00:48:12,910 And the basic idea is what we already said. 1275 00:48:12,910 --> 00:48:19,210 If k is small, and the points are close enough, instead 1276 00:48:19,210 --> 00:48:24,910 of f-star x, we are thinking of f-star Xk, Xi. 1277 00:48:24,910 --> 00:48:26,560 And the i is closing off. 1278 00:48:26,560 --> 00:48:27,850 OK. 1279 00:48:27,850 --> 00:48:29,920 Now if we start to put k bigger, we 1280 00:48:29,920 --> 00:48:35,620 start to blur that prediction by looking at many nearby points. 1281 00:48:35,620 --> 00:48:37,090 But here there is no noise. 1282 00:48:37,090 --> 00:48:37,699 OK. 1283 00:48:37,699 --> 00:48:38,990 So that sounds like a bad idea. 1284 00:48:38,990 --> 00:48:41,110 So we expect the error in that case 1285 00:48:41,110 --> 00:48:45,610 to be either increasing, or at least flat with respect to k. 1286 00:48:45,610 --> 00:48:50,800 So when we take k larger, we're blurring this prediction, 1287 00:48:50,800 --> 00:48:53,690 and potentially make it far away from the true one. 1288 00:48:53,690 --> 00:48:54,190 OK. 1289 00:48:54,190 --> 00:48:57,040 And you can make this statement precise. 1290 00:48:57,040 --> 00:48:58,090 You can prove it. 1291 00:48:58,090 --> 00:49:00,780 And if you will prove it, it's basically that you have-- 1292 00:49:00,780 --> 00:49:03,130 what happened? 1293 00:49:03,130 --> 00:49:04,700 You have linear dependence. 1294 00:49:04,700 --> 00:49:07,460 So the error here is linearly increasing or polynomially 1295 00:49:07,460 --> 00:49:10,551 increasing-- in fact I don't remember-- with respect to k. 1296 00:49:10,551 --> 00:49:11,050 OK. 1297 00:49:13,385 --> 00:49:14,760 So the reason why I'm showing you 1298 00:49:14,760 --> 00:49:16,577 this, skipping all these details, 1299 00:49:16,577 --> 00:49:18,910 is just to give you a feeling of the kind of computation 1300 00:49:18,910 --> 00:49:23,370 that answered the question if there is a optimal value 1301 00:49:23,370 --> 00:49:25,310 and what it depends on. 1302 00:49:25,310 --> 00:49:27,060 And then at this point, once you get this, 1303 00:49:27,060 --> 00:49:28,560 you start to see this kind of plot. 1304 00:49:28,560 --> 00:49:30,990 And typically here I put them the wrong way. 1305 00:49:30,990 --> 00:49:32,430 But here you basically say, I have 1306 00:49:32,430 --> 00:49:34,750 this one function I wanted to study, 1307 00:49:34,750 --> 00:49:39,030 which is the sum of two functions. 1308 00:49:39,030 --> 00:49:41,040 I have this, and I have this. 1309 00:49:41,040 --> 00:49:42,725 OK. 1310 00:49:42,725 --> 00:49:44,100 And now to study the minimum, I'm 1311 00:49:44,100 --> 00:49:45,750 basically going to sum them up and see 1312 00:49:45,750 --> 00:49:47,666 what's the optimal value to optimize this too. 1313 00:49:47,666 --> 00:49:51,854 And the k that optimized this is exactly the optimal k. 1314 00:49:51,854 --> 00:49:54,270 And you see that the optimal k will behave as we expected. 1315 00:49:54,270 --> 00:49:55,440 OK. 1316 00:49:55,440 --> 00:49:58,770 So here, one ingredient is missing. 1317 00:49:58,770 --> 00:50:01,425 And it's just missing because I didn't put it in, 1318 00:50:01,425 --> 00:50:02,790 which is the number of points. 1319 00:50:02,790 --> 00:50:03,290 OK. 1320 00:50:03,290 --> 00:50:05,660 It's just because I didn't renormalize things. 1321 00:50:05,660 --> 00:50:06,210 OK. 1322 00:50:06,210 --> 00:50:09,150 It should be a 1 over n here. 1323 00:50:15,210 --> 00:50:16,950 It's just that I didn't renormalize. 1324 00:50:16,950 --> 00:50:17,629 OK. 1325 00:50:17,629 --> 00:50:19,920 But you announced it, and it's good, because it's true. 1326 00:50:19,920 --> 00:50:21,450 There should be a 1 over n there. 1327 00:50:21,450 --> 00:50:23,040 But the rest is what we expected. 1328 00:50:23,040 --> 00:50:23,592 OK. 1329 00:50:23,592 --> 00:50:25,800 In some sense what we expect is that if my problem is 1330 00:50:25,800 --> 00:50:28,100 complicated, I need the smaller k. 1331 00:50:28,100 --> 00:50:31,680 If there is a lot of noise, I need a bigger k. 1332 00:50:31,680 --> 00:50:33,330 And depending on the number of points, 1333 00:50:33,330 --> 00:50:35,150 which would be in the numerator here, 1334 00:50:35,150 --> 00:50:37,890 I can make a bigger or a larger. 1335 00:50:37,890 --> 00:50:38,620 k. 1336 00:50:38,620 --> 00:50:40,290 OK. 1337 00:50:40,290 --> 00:50:43,020 This plot is fundamental because it shows some property which 1338 00:50:43,020 --> 00:50:44,410 is inherent in the problem. 1339 00:50:44,410 --> 00:50:47,490 And the theorem that somewhat is behind it-- 1340 00:50:47,490 --> 00:50:50,680 intuition I've been saying, repeating over and over, 1341 00:50:50,680 --> 00:50:53,890 which is this intuition that you cannot trust the data too much. 1342 00:50:53,890 --> 00:50:56,850 And there is the optimal amount of trust you can of your data 1343 00:50:56,850 --> 00:50:58,320 based on certain assumptions. 1344 00:50:58,320 --> 00:50:58,830 OK. 1345 00:50:58,830 --> 00:51:03,560 And in our case, the assumption where this kind of model. 1346 00:51:03,560 --> 00:51:07,800 So little calculation I'll show you quickly, 1347 00:51:07,800 --> 00:51:11,220 grounds this intuition into a mathematical argument. 1348 00:51:11,220 --> 00:51:13,660 OK. 1349 00:51:13,660 --> 00:51:14,160 All right. 1350 00:51:14,160 --> 00:51:17,490 So we spent quite a bit of time on this. 1351 00:51:17,490 --> 00:51:20,420 In some sense, from a conceptual point of view, 1352 00:51:20,420 --> 00:51:21,420 this is a critical idea. 1353 00:51:21,420 --> 00:51:21,720 OK. 1354 00:51:21,720 --> 00:51:23,511 Because it's behind pretty much everything. 1355 00:51:23,511 --> 00:51:26,880 This idea of, how much you can trust or not of the data. 1356 00:51:26,880 --> 00:51:30,920 Of course here, as we said, this has been informative, 1357 00:51:30,920 --> 00:51:31,800 hopefully. 1358 00:51:31,800 --> 00:51:34,202 But you cannot really choose this k, 1359 00:51:34,202 --> 00:51:35,910 because you would need to know the noise, 1360 00:51:35,910 --> 00:51:38,250 but especially to know how to estimate this in order 1361 00:51:38,250 --> 00:51:40,510 to minimize this quantity. 1362 00:51:40,510 --> 00:51:44,730 So in practice what you can show is, 1363 00:51:44,730 --> 00:51:46,604 you can use what is called cross-validation. 1364 00:51:46,604 --> 00:51:48,020 And in effect, cross-validation is 1365 00:51:48,020 --> 00:51:50,430 one of a few other techniques you can use. 1366 00:51:50,430 --> 00:51:53,840 And the idea is that you don't have access [AUDIO OUT] 1367 00:51:53,840 --> 00:51:57,320 but you can show that if you take a bunch of data points, 1368 00:51:57,320 --> 00:51:59,780 you split them in two, you use half for the training 1369 00:51:59,780 --> 00:52:02,750 as you've always done, and you use the other half as a proxy 1370 00:52:02,750 --> 00:52:05,450 for this future data. 1371 00:52:05,450 --> 00:52:07,600 Then by minimizing the k-- 1372 00:52:07,600 --> 00:52:10,590 taking the k that minimized the error on this so-called holdout 1373 00:52:10,590 --> 00:52:15,620 set, then you can prove it's as good as if you 1374 00:52:15,620 --> 00:52:17,461 could have access to this. 1375 00:52:17,461 --> 00:52:17,960 OK. 1376 00:52:17,960 --> 00:52:19,730 And it's actually very easy to prove. 1377 00:52:19,730 --> 00:52:22,460 You can show that if you're just split in two, 1378 00:52:22,460 --> 00:52:24,260 and you minimize the error in second half-- 1379 00:52:24,260 --> 00:52:27,260 you do what is called the holdout cross-validation-- it's 1380 00:52:27,260 --> 00:52:31,291 as good as if you'd had access to this. 1381 00:52:31,291 --> 00:52:31,790 OK. 1382 00:52:31,790 --> 00:52:32,930 So it's optimal in a way. 1383 00:52:36,310 --> 00:52:39,790 Now, the problem with this is that we are only looking 1384 00:52:39,790 --> 00:52:44,394 at the area and expectation. 1385 00:52:44,394 --> 00:52:46,810 And what you can check is that if you look at higher order 1386 00:52:46,810 --> 00:52:49,741 statistics, say that variance of your estimators 1387 00:52:49,741 --> 00:52:51,490 and so on and so forth, what you might get 1388 00:52:51,490 --> 00:52:54,520 is that by splitting in two, [AUDIO OUT] big is fine. 1389 00:52:54,520 --> 00:52:56,337 In practice the difference is small, 1390 00:52:56,337 --> 00:52:58,420 you might get that the way you split might matter. 1391 00:52:58,420 --> 00:53:00,753 You might have bad luck and just split in a certain way. 1392 00:53:00,753 --> 00:53:03,850 And so there is a whole zoology of ways of splitting. 1393 00:53:03,850 --> 00:53:06,947 And the basic one is, say, split-- 1394 00:53:06,947 --> 00:53:08,405 this is, for example, the simplest. 1395 00:53:08,405 --> 00:53:08,905 OK. 1396 00:53:08,905 --> 00:53:13,630 Split in a bunch of groups. 1397 00:53:13,630 --> 00:53:14,260 OK. 1398 00:53:14,260 --> 00:53:16,240 k-fold or v-fold cross-validation. 1399 00:53:16,240 --> 00:53:18,081 Take one group out of the time. 1400 00:53:18,081 --> 00:53:18,580 OK. 1401 00:53:18,580 --> 00:53:20,150 And do the same trick. 1402 00:53:20,150 --> 00:53:23,470 You know, you train here and calculate the error here 1403 00:53:23,470 --> 00:53:24,430 for different k's. 1404 00:53:24,430 --> 00:53:25,960 Then you do the same here, do the same here, 1405 00:53:25,960 --> 00:53:26,740 do the same here. 1406 00:53:26,740 --> 00:53:29,572 Sum the errors up, renormalizing, 1407 00:53:29,572 --> 00:53:31,280 and then just choose the k that minimizes 1408 00:53:31,280 --> 00:53:34,090 this new form of error. 1409 00:53:34,090 --> 00:53:37,450 And if the data there are small, small, small, then 1410 00:53:37,450 --> 00:53:39,630 typically this set will become very small. 1411 00:53:39,630 --> 00:53:42,171 And then delimited, it becomes one, the leave one out error. 1412 00:53:42,171 --> 00:53:42,670 OK. 1413 00:53:42,670 --> 00:53:44,740 What you do is that you literally 1414 00:53:44,740 --> 00:53:47,110 leave one out, train on the rest, 1415 00:53:47,110 --> 00:53:49,670 get there for all the values of k in this case. 1416 00:53:49,670 --> 00:53:56,210 Put it back in, take another one out, and repeat the procedure. 1417 00:53:56,210 --> 00:53:59,440 Now the question that I had 10, 15 minutes ago was, 1418 00:53:59,440 --> 00:54:01,770 how do you choose v? 1419 00:54:01,770 --> 00:54:04,270 OK. 1420 00:54:04,270 --> 00:54:06,150 Shall I make this two? 1421 00:54:06,150 --> 00:54:08,630 So I just do one split like this? 1422 00:54:08,630 --> 00:54:12,640 Or shall I make it n, so I do leave one out? 1423 00:54:12,640 --> 00:54:14,570 And as far as I know there is not 1424 00:54:14,570 --> 00:54:17,180 a lot of theory that would support 1425 00:54:17,180 --> 00:54:18,830 an answer to this question. 1426 00:54:18,830 --> 00:54:23,040 And what I know is mostly what you can expect intuitively, 1427 00:54:23,040 --> 00:54:24,874 which is, if you have a lot of data points-- 1428 00:54:24,874 --> 00:54:25,873 what does it mean a lot? 1429 00:54:25,873 --> 00:54:26,480 I don't know. 1430 00:54:26,480 --> 00:54:29,690 If you have two million, 10,000, I don't know. 1431 00:54:29,690 --> 00:54:32,120 If you have a big dataset, typically splitting in two, 1432 00:54:32,120 --> 00:54:36,590 or maybe doing just random splits is stable enough. 1433 00:54:36,590 --> 00:54:37,700 What does it mean? 1434 00:54:37,700 --> 00:54:42,320 That you try, and you look at how much it moves. 1435 00:54:42,320 --> 00:54:43,370 Whereas if you have say-- 1436 00:54:43,370 --> 00:54:44,560 you know, I don't know if it even exists, 1437 00:54:44,560 --> 00:54:46,768 the implication like, you know, a few years ago there 1438 00:54:46,768 --> 00:54:51,350 were micro-reapplication where you would have 20, 30 inputs, 1439 00:54:51,350 --> 00:54:54,830 and you have 20 dimensions. 1440 00:54:54,830 --> 00:54:57,302 And then in that case, you really don't do much splitting. 1441 00:54:57,302 --> 00:54:59,510 If you have 20, for example, you try to leave one out 1442 00:54:59,510 --> 00:55:00,718 and it's the best you can do. 1443 00:55:00,718 --> 00:55:02,990 And it's already very unstable and sucks. 1444 00:55:02,990 --> 00:55:03,920 OK. 1445 00:55:03,920 --> 00:55:05,760 So in this case, there is work to be done. 1446 00:55:05,760 --> 00:55:08,311 I mean, as far as I know, that's the state of things. 1447 00:55:08,311 --> 00:55:08,810 OK. 1448 00:55:08,810 --> 00:55:10,964 So we introduced a class of very simple algorithms. 1449 00:55:10,964 --> 00:55:12,380 They seem to be pretty reasonable. 1450 00:55:12,380 --> 00:55:14,750 They seem to allow us, provided that we have a way 1451 00:55:14,750 --> 00:55:16,460 to measure distances or similarity, 1452 00:55:16,460 --> 00:55:19,010 to go from simple to complex. 1453 00:55:19,010 --> 00:55:21,320 And we have some kind of theory that 1454 00:55:21,320 --> 00:55:24,642 tells us what is the optimal value of a parameter, 1455 00:55:24,642 --> 00:55:28,800 a kind of practical procedure to actually choose it in practice. 1456 00:55:28,800 --> 00:55:30,630 OK. 1457 00:55:30,630 --> 00:55:31,850 Are we done? 1458 00:55:31,850 --> 00:55:33,229 Is that all? 1459 00:55:33,229 --> 00:55:34,520 do we need to do anything else? 1460 00:55:34,520 --> 00:55:36,326 What's missing here? 1461 00:55:36,326 --> 00:55:37,700 One thing that is missing here is 1462 00:55:37,700 --> 00:55:40,310 that most of the intuition we developed so far 1463 00:55:40,310 --> 00:55:42,000 are really related to low dimension. 1464 00:55:42,000 --> 00:55:42,720 OK. 1465 00:55:42,720 --> 00:55:46,160 And here, very quickly, if you just do a little exercise 1466 00:55:46,160 --> 00:55:47,960 where you try to say how big is a cube 1467 00:55:47,960 --> 00:55:53,911 that covers 1% of the volume of a bigger cube of a unit length? 1468 00:55:53,911 --> 00:55:54,410 OK. 1469 00:55:54,410 --> 00:55:56,540 So the big cube is volume 1. 1470 00:55:56,540 --> 00:55:58,040 The length of that is just 1. 1471 00:55:58,040 --> 00:55:59,700 And it ask you, how big is this, if it 1472 00:55:59,700 --> 00:56:02,280 has to cover 1% of the volume? 1473 00:56:02,280 --> 00:56:05,840 It's really to check that these are just going to be a dth-root 1474 00:56:05,840 --> 00:56:07,820 where d is the dimension of the cube. 1475 00:56:07,820 --> 00:56:10,460 And this is the shape of the dth-root. 1476 00:56:10,460 --> 00:56:11,090 OK. 1477 00:56:11,090 --> 00:56:13,940 So if you're in low dimension, basically, 1% 1478 00:56:13,940 --> 00:56:17,024 is intuitively small within the big cube. 1479 00:56:17,024 --> 00:56:19,190 But as soon as you're go in higher dimensional, what 1480 00:56:19,190 --> 00:56:22,940 you see is that the length of the edge of the little cube 1481 00:56:22,940 --> 00:56:26,390 that has to cover 1% of the volume becomes very close to 1, 1482 00:56:26,390 --> 00:56:27,490 almost immediately. 1483 00:56:27,490 --> 00:56:29,300 It's this curve going up. 1484 00:56:29,300 --> 00:56:30,410 OK. 1485 00:56:30,410 --> 00:56:31,160 What does it mean? 1486 00:56:31,160 --> 00:56:35,630 That if you say, our intuition is, well, 1%. 1487 00:56:35,630 --> 00:56:36,950 It's a pretty small volume. 1488 00:56:36,950 --> 00:56:39,920 If I just took the neighbors in 1%, they're pretty close, 1489 00:56:39,920 --> 00:56:42,420 so they should have the same label. 1490 00:56:42,420 --> 00:56:45,750 Well, in dimension 10, it's everything. 1491 00:56:45,750 --> 00:56:46,670 OK. 1492 00:56:46,670 --> 00:56:49,962 So our intuition-- now you can say that probably there 1493 00:56:49,962 --> 00:56:52,420 is something wrong with my way of thinking of volume, sure. 1494 00:56:52,420 --> 00:56:54,420 But the problem is that we have to rethink a bit 1495 00:56:54,420 --> 00:56:57,400 how you think of dimensions and similarity in high dimension, 1496 00:56:57,400 --> 00:56:59,530 because things that are obvious low dimensional 1497 00:56:59,530 --> 00:57:01,450 start to be very complicated. 1498 00:57:01,450 --> 00:57:02,200 OK. 1499 00:57:02,200 --> 00:57:05,770 And the basic idea is that this neighbor technique just 1500 00:57:05,770 --> 00:57:08,770 looks at what's happening in one region. 1501 00:57:08,770 --> 00:57:11,540 But what you hope to do is that if your function actually 1502 00:57:11,540 --> 00:57:13,840 has some kind of global properties-- 1503 00:57:13,840 --> 00:57:16,204 so, say for example a sign is the simplest example 1504 00:57:16,204 --> 00:57:18,370 of something which is global, because the value here 1505 00:57:18,370 --> 00:57:21,507 and the value here are very much related. 1506 00:57:21,507 --> 00:57:23,090 And then it goes up and it's the same. 1507 00:57:23,090 --> 00:57:24,012 And then it goes down. 1508 00:57:24,012 --> 00:57:25,470 So if you know something like this, 1509 00:57:25,470 --> 00:57:27,490 the idea is that you can borrow strength 1510 00:57:27,490 --> 00:57:29,729 from points which are far away. 1511 00:57:29,729 --> 00:57:32,020 In some sense the function has some similar properties. 1512 00:57:32,020 --> 00:57:34,180 And so you want to go from a local estimation 1513 00:57:34,180 --> 00:57:36,310 to some form of global estimation. 1514 00:57:36,310 --> 00:57:36,900 OK. 1515 00:57:36,900 --> 00:57:38,500 And instead of making a decision based 1516 00:57:38,500 --> 00:57:40,030 only on the neighbors of the points, 1517 00:57:40,030 --> 00:57:42,640 you might want to use points which are potentially far away. 1518 00:57:42,640 --> 00:57:43,140 OK. 1519 00:57:43,140 --> 00:57:47,620 And this seems to be like a good idea in high dimensions where 1520 00:57:47,620 --> 00:57:51,066 the neighboring points might not give enough information . 1521 00:57:51,066 --> 00:57:52,440 And that's kind of what's called, 1522 00:57:52,440 --> 00:57:53,560 curse of dimensionality. 1523 00:57:53,560 --> 00:57:54,060 OK. 1524 00:57:54,060 --> 00:57:56,260 So what I want to do next-- 1525 00:57:56,260 --> 00:57:58,030 we can take a break here-- 1526 00:57:58,030 --> 00:58:02,010 is discussing least squares and kernel least squares. 1527 00:58:02,010 --> 00:58:02,520 OK. 1528 00:58:02,520 --> 00:58:04,186 But what we're going to do is that we're 1529 00:58:04,186 --> 00:58:06,200 going to take a linear model of our data, 1530 00:58:06,200 --> 00:58:07,930 and then we are going to try to see 1531 00:58:07,930 --> 00:58:09,250 how you can estimate and learn. 1532 00:58:09,250 --> 00:58:11,291 And we're going to look at bit of the computation 1533 00:58:11,291 --> 00:58:14,100 and a bit of the statistical idea underlying this model. 1534 00:58:14,100 --> 00:58:16,600 And then we're going to play around in a very simple for way 1535 00:58:16,600 --> 00:58:19,390 to extend from a linear model to a non-linear model 1536 00:58:19,390 --> 00:58:21,245 and actually make it non-parametric. 1537 00:58:21,245 --> 00:58:24,030 I'll tell you what non-parametric means.