1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:15,210 from hundreds of MIT courses, visit 7 00:00:15,210 --> 00:00:17,360 MITOpenCourseWare@OCW.MIT.edu. 8 00:00:20,292 --> 00:00:22,790 PHILIPPE RIGOLLET: It's because if I was not, 9 00:00:22,790 --> 00:00:25,640 this would be basically the last topic we would ever see. 10 00:00:25,640 --> 00:00:29,201 And this is arguably, probably the most important topic 11 00:00:29,201 --> 00:00:30,950 in statistics, or at least that's probably 12 00:00:30,950 --> 00:00:33,470 the reason why most of you are taking this class. 13 00:00:33,470 --> 00:00:36,980 Because regression implies prediction, 14 00:00:36,980 --> 00:00:39,260 and prediction is what people are after to now, right? 15 00:00:39,260 --> 00:00:40,340 You don't need to understand what 16 00:00:40,340 --> 00:00:41,960 the model for the financial market 17 00:00:41,960 --> 00:00:43,460 is if you actually have a formula 18 00:00:43,460 --> 00:00:47,430 to predict what the stock prices are going to be tomorrow. 19 00:00:47,430 --> 00:00:49,705 And regression, in a way, allows us to do that. 20 00:00:49,705 --> 00:00:52,080 And we'll start with a very simple version of regression, 21 00:00:52,080 --> 00:00:55,130 which is linear regression, which is the most standard one. 22 00:00:55,130 --> 00:00:58,070 And then we'll move on to slightly more advanced notions 23 00:00:58,070 --> 00:00:59,560 such as nonparametric regression. 24 00:00:59,560 --> 00:01:02,960 At least, we're going to see the principles behind it. 25 00:01:02,960 --> 00:01:06,570 And I'll touch upon a little bit of high dimensional regression, 26 00:01:06,570 --> 00:01:09,450 which is what people are doing today. 27 00:01:09,450 --> 00:01:12,290 So the goal of regression is to try 28 00:01:12,290 --> 00:01:16,220 to predict one variable based on another variable. 29 00:01:16,220 --> 00:01:19,290 All right, so here the notation is very important. 30 00:01:19,290 --> 00:01:22,220 It's extremely standard. 31 00:01:22,220 --> 00:01:25,600 It goes everywhere essentially, and essentially you're 32 00:01:25,600 --> 00:01:29,840 trying to explain why as a function of x, 33 00:01:29,840 --> 00:01:33,320 which is the usual y equals f of x question-- 34 00:01:33,320 --> 00:01:36,090 except that, you know, if you look at a calculus class, 35 00:01:36,090 --> 00:01:39,930 people tell you y equals f of x, and they give you 36 00:01:39,930 --> 00:01:42,142 a specific form for f, and then you do something. 37 00:01:42,142 --> 00:01:43,850 Here, we're just going to try to estimate 38 00:01:43,850 --> 00:01:46,040 what this length function is. 39 00:01:46,040 --> 00:01:49,790 And this is why we often call y the explained variable 40 00:01:49,790 --> 00:01:52,800 and x the explanatory variable. 41 00:01:52,800 --> 00:01:55,960 All right, so we're statisticians, 42 00:01:55,960 --> 00:01:56,925 so we start with data. 43 00:01:56,925 --> 00:01:58,800 All right, then what does our data look like? 44 00:01:58,800 --> 00:02:01,820 Well, it looks like a bunch of input, output 45 00:02:01,820 --> 00:02:03,110 to this relationship. 46 00:02:03,110 --> 00:02:05,870 All right, so we have a bunch of xi, yi. 47 00:02:05,870 --> 00:02:09,280 Those are pairs, and I can do a scatterplot of those guys. 48 00:02:09,280 --> 00:02:14,390 So each point here has a x-coordinate, which is xi, 49 00:02:14,390 --> 00:02:16,550 and a y-coordinate, which is yi, and here, I 50 00:02:16,550 --> 00:02:17,810 have a bunch of endpoints. 51 00:02:17,810 --> 00:02:19,830 And I just draw them like that. 52 00:02:19,830 --> 00:02:23,700 Now, the functions we're going to be interested in 53 00:02:23,700 --> 00:02:30,170 are often function of the form y equals a plus b times x, OK. 54 00:02:30,170 --> 00:02:32,870 And that means that this function looks like this. 55 00:02:36,310 --> 00:02:38,200 So if I do x and y, this function 56 00:02:38,200 --> 00:02:41,530 looks exactly like a line, and clearly those points 57 00:02:41,530 --> 00:02:42,980 are not on the line. 58 00:02:42,980 --> 00:02:44,577 And it will basically never happen 59 00:02:44,577 --> 00:02:45,910 that those points are on a line. 60 00:02:45,910 --> 00:02:48,460 There's a famous T-shirt from, I think, 61 00:02:48,460 --> 00:02:50,320 U.C. Berkeley's staff department, 62 00:02:50,320 --> 00:02:52,736 that shows this picture and put a line between them 63 00:02:52,736 --> 00:02:53,860 like we're going to see it. 64 00:02:53,860 --> 00:02:56,890 And it says, oh, statisticians, so many points, 65 00:02:56,890 --> 00:02:59,590 and you still managed to miss all of them. 66 00:02:59,590 --> 00:03:04,150 And so essentially, we don't believe that this relationship 67 00:03:04,150 --> 00:03:08,912 y is equal to a plus bx is true, but maybe up to some noise. 68 00:03:08,912 --> 00:03:11,370 And that's where the statistics is going to come into play. 69 00:03:11,370 --> 00:03:13,995 There's going to be some random noise that's going to play out, 70 00:03:13,995 --> 00:03:17,470 and hopefully the noise is going to be spread out evenly, 71 00:03:17,470 --> 00:03:20,950 so that we can average it if we have enough points. 72 00:03:20,950 --> 00:03:22,830 Average it out, OK. 73 00:03:22,830 --> 00:03:26,910 And so this epsilon here is not necessarily due to randomness. 74 00:03:26,910 --> 00:03:29,387 But again, just like we did modeling in the first place, 75 00:03:29,387 --> 00:03:30,970 it essentially accounts for everything 76 00:03:30,970 --> 00:03:33,520 we don't understand about this relationship. 77 00:03:33,520 --> 00:03:36,200 All right, so for example-- 78 00:03:36,200 --> 00:03:37,630 so here, I'm not going to be-- 79 00:03:37,630 --> 00:03:41,960 give me one second, so we'll see an example in a second. 80 00:03:41,960 --> 00:03:44,050 But the idea here is that if you have data, 81 00:03:44,050 --> 00:03:45,850 and if you believe that it's of the form, 82 00:03:45,850 --> 00:03:47,740 a plus b x plus some noise, you're 83 00:03:47,740 --> 00:03:50,410 trying to find the line that will explain your data 84 00:03:50,410 --> 00:03:51,640 the best, right? 85 00:03:51,640 --> 00:03:54,650 In the terminology we've been using before, 86 00:03:54,650 --> 00:03:58,060 this would be the most likely line that explains the data. 87 00:03:58,060 --> 00:03:59,680 So we can see that it's slightly-- 88 00:03:59,680 --> 00:04:01,145 we've just added another dimension 89 00:04:01,145 --> 00:04:02,270 to our statistical problem. 90 00:04:02,270 --> 00:04:04,353 We don't have just x's, but we have y's, and we're 91 00:04:04,353 --> 00:04:07,450 trying to find the most likely explanation of the relationship 92 00:04:07,450 --> 00:04:09,220 between y and x. 93 00:04:09,220 --> 00:04:12,379 All right, and so in practice, the way 94 00:04:12,379 --> 00:04:14,920 it's going to look like is that we're going to have basically 95 00:04:14,920 --> 00:04:17,470 two parameters to find the slope b 96 00:04:17,470 --> 00:04:20,240 and the intercept a, and given data, 97 00:04:20,240 --> 00:04:23,120 the goal is going to be to try to find the best possible line. 98 00:04:23,120 --> 00:04:24,514 All right? 99 00:04:24,514 --> 00:04:25,930 So what we're going to find is not 100 00:04:25,930 --> 00:04:29,310 exactly a and b, the ones that actually generate the data, 101 00:04:29,310 --> 00:04:33,790 but some estimators of those parameters, a hat and b hat 102 00:04:33,790 --> 00:04:35,680 constructed from the data. 103 00:04:35,680 --> 00:04:38,260 All right, so we'll see that more generally, 104 00:04:38,260 --> 00:04:40,990 but we're not going to go too much in the details of this. 105 00:04:40,990 --> 00:04:42,190 There's actually quite a bit that you 106 00:04:42,190 --> 00:04:43,773 can understand if you do what's called 107 00:04:43,773 --> 00:04:47,260 univariate regression when x is actually 108 00:04:47,260 --> 00:04:49,940 a real valued random variable. 109 00:04:49,940 --> 00:04:52,720 So when this happens, this is called univariate regression. 110 00:04:59,640 --> 00:05:05,550 And when x is in rp for p larger than or equal to 2, 111 00:05:05,550 --> 00:05:07,810 this is called multivariate regression. 112 00:05:16,640 --> 00:05:20,340 OK, and so here we're just trying to explain y 113 00:05:20,340 --> 00:05:23,940 is a plus bx plus epsilon. 114 00:05:23,940 --> 00:05:26,620 And here we're going to have something more complicated. 115 00:05:26,620 --> 00:05:33,510 We're going to have y, which is equal to a plus b1, x1 plus b2, 116 00:05:33,510 --> 00:05:39,780 x2 plus bp, xp plus epsilon-- 117 00:05:39,780 --> 00:05:42,150 where x is equal to-- 118 00:05:42,150 --> 00:05:46,710 the coordinates of x are given by x1, 2xp, rp. 119 00:05:46,710 --> 00:05:49,200 OK, so it's still linear. 120 00:05:49,200 --> 00:05:51,030 Right, they still add all the coordinates 121 00:05:51,030 --> 00:05:53,370 of x with a coefficient in front of them, 122 00:05:53,370 --> 00:05:56,360 but it's a bit more complicated than just one coefficient 123 00:05:56,360 --> 00:05:58,770 for one coordinate of x, OK? 124 00:05:58,770 --> 00:06:03,420 So we'll come back to multivariate regression. 125 00:06:03,420 --> 00:06:08,280 Of course, you can write this as x transpose b, right? 126 00:06:08,280 --> 00:06:14,800 So this entire thing here, this linear combination 127 00:06:14,800 --> 00:06:17,710 is of the form x transpose b, where 128 00:06:17,710 --> 00:06:23,310 b is the vector that has coordinates b1 to bp. 129 00:06:23,310 --> 00:06:25,700 OK? 130 00:06:25,700 --> 00:06:31,100 Sorry, here, it's in [? rd, ?] p is the natural notation. 131 00:06:31,100 --> 00:06:35,660 All right, so our goal here, in the univariate one, 132 00:06:35,660 --> 00:06:38,360 is to try to write the model, make sense 133 00:06:38,360 --> 00:06:40,910 of this little twiddle here-- 134 00:06:40,910 --> 00:06:44,050 essentially, from a statistical modeling question, 135 00:06:44,050 --> 00:06:47,480 the question is going to be, what distributional assumptions 136 00:06:47,480 --> 00:06:48,730 do you want to put on epsilon? 137 00:06:48,730 --> 00:06:50,313 Are you going to say they're Gaussian? 138 00:06:50,313 --> 00:06:52,720 Are you going to say they're binomial? 139 00:07:00,160 --> 00:07:03,450 OK, are you going to say they're binomial? 140 00:07:03,450 --> 00:07:05,532 Are you going to say they're Bernoulli? 141 00:07:05,532 --> 00:07:07,990 So that's going to be what we we're going to make sense of, 142 00:07:07,990 --> 00:07:10,230 and then we're going to try to find a method 143 00:07:10,230 --> 00:07:11,700 to estimate a and b. 144 00:07:11,700 --> 00:07:13,680 And then maybe we're going to try to do 145 00:07:13,680 --> 00:07:15,030 some inference about a and b-- 146 00:07:15,030 --> 00:07:18,390 maybe test if a and b take certain values, if they're 147 00:07:18,390 --> 00:07:20,850 less than something, maybe find some confidence 148 00:07:20,850 --> 00:07:24,290 regions for a and b, all right? 149 00:07:24,290 --> 00:07:25,990 So why would you want to do this? 150 00:07:25,990 --> 00:07:29,810 Well, I'm sure all of you have an application, if I give you 151 00:07:29,810 --> 00:07:32,260 some x, you're trying to predict what y is. 152 00:07:32,260 --> 00:07:34,730 Machine learning is all about doing this, right? 153 00:07:34,730 --> 00:07:36,994 Without maybe trying to even understand 154 00:07:36,994 --> 00:07:38,660 the physics behind this, they're saying, 155 00:07:38,660 --> 00:07:40,590 well, you give me a bag of words, 156 00:07:40,590 --> 00:07:43,520 I want to understand whether it's going to be a spam or not. 157 00:07:43,520 --> 00:07:47,370 You give me a bunch of economic indicators, 158 00:07:47,370 --> 00:07:51,530 I want you to tell me how much I should be selling my car for. 159 00:07:51,530 --> 00:07:55,774 You give me a bunch of measurements on some patient, 160 00:07:55,774 --> 00:07:57,440 I want you to predict how this person is 161 00:07:57,440 --> 00:08:00,110 going to respond to my drug-- and things like this. 162 00:08:00,110 --> 00:08:04,830 All right, and often we actually don't have much modeling 163 00:08:04,830 --> 00:08:07,380 intuition about what the relationship between x and y 164 00:08:07,380 --> 00:08:10,350 is, and this linear thing is basically the simplest function 165 00:08:10,350 --> 00:08:11,530 we can think of. 166 00:08:11,530 --> 00:08:15,235 Arguably, linear functions are the simplest functions 167 00:08:15,235 --> 00:08:16,110 that are not trivial. 168 00:08:16,110 --> 00:08:19,110 Otherwise, we would just say, well, let's just predict x of y 169 00:08:19,110 --> 00:08:21,445 to be a constant, meaning it does not depend on x. 170 00:08:21,445 --> 00:08:23,070 But if you want it to depend on x, then 171 00:08:23,070 --> 00:08:25,710 your functions are basically as simple as it gets. 172 00:08:25,710 --> 00:08:30,840 It turns out, amazingly, this does the trick quite often. 173 00:08:30,840 --> 00:08:33,750 So for example, if you look at economics, 174 00:08:33,750 --> 00:08:35,909 you might want to assume that the demand is 175 00:08:35,909 --> 00:08:38,039 a linear function of the price. 176 00:08:38,039 --> 00:08:40,200 So if your price is zero, there's 177 00:08:40,200 --> 00:08:41,640 going to be a certain demand. 178 00:08:41,640 --> 00:08:45,037 And as the price increases, the demand is going to move. 179 00:08:45,037 --> 00:08:47,370 Do you think b is going to be positive or negative here? 180 00:08:51,000 --> 00:08:52,069 What? 181 00:08:52,069 --> 00:08:53,610 Typically, it's negative unless we're 182 00:08:53,610 --> 00:08:56,292 talking about maybe luxury goods, 183 00:08:56,292 --> 00:08:57,750 where you know, the more expensive, 184 00:08:57,750 --> 00:09:00,130 the more people actually want it. 185 00:09:00,130 --> 00:09:02,380 I mean, if we're talking about actual economic demand, 186 00:09:02,380 --> 00:09:06,030 that's probably definitely negative. 187 00:09:06,030 --> 00:09:11,360 It doesn't have to be, you know, clearly linear, 188 00:09:11,360 --> 00:09:13,724 so that you can actually make it linear, transform it 189 00:09:13,724 --> 00:09:14,640 into something linear. 190 00:09:14,640 --> 00:09:17,520 So for example, you have this like multiplicative 191 00:09:17,520 --> 00:09:24,330 relationship, PV equals nRT, which is the Ideal gas law. 192 00:09:24,330 --> 00:09:26,670 If you want to actually write this relationship, 193 00:09:26,670 --> 00:09:28,680 if you want to predict what the pressure is 194 00:09:28,680 --> 00:09:33,690 going to be as a function of the volume and the temperature-- 195 00:09:33,690 --> 00:09:37,810 and well, let's assume that n is the Avogadro constant, 196 00:09:37,810 --> 00:09:42,060 and let's assume that the radius is actually fixed. 197 00:09:42,060 --> 00:09:47,840 Then you take the log on each side, so you get PV equals nRT. 198 00:10:03,610 --> 00:10:07,690 So what that means is that log PV is equal to log nRT. 199 00:10:10,600 --> 00:10:23,180 So that means log P plus log V is equal to the log nR plus log 200 00:10:23,180 --> 00:10:28,737 T. So we said that R is constant, so this is actually 201 00:10:28,737 --> 00:10:29,320 your constant. 202 00:10:29,320 --> 00:10:31,400 I'm going to call it a. 203 00:10:31,400 --> 00:10:35,410 And then that means that log P is 204 00:10:35,410 --> 00:10:49,430 equal to minus log V. That log P is equal to a minus log 205 00:10:49,430 --> 00:10:55,070 V plus log T. OK? 206 00:10:55,070 --> 00:11:01,650 And so in particular, if I write b equal to negative 1 207 00:11:01,650 --> 00:11:04,800 and c equal to plus 1, this gives me the formula 208 00:11:04,800 --> 00:11:06,210 that I have here. 209 00:11:06,210 --> 00:11:10,670 Now again, it might be the case that this is the ideal gas law. 210 00:11:10,670 --> 00:11:12,960 So in practice, if I start recording pressure, 211 00:11:12,960 --> 00:11:16,830 and temperature, and volume, I might make measurement errors, 212 00:11:16,830 --> 00:11:18,950 there might be slightly different conditions 213 00:11:18,950 --> 00:11:21,346 in such a way that I'm not going to get exactly those. 214 00:11:21,346 --> 00:11:23,220 And I'm just going to put this little twiddle 215 00:11:23,220 --> 00:11:25,350 to account for the fact that the points that I'm 216 00:11:25,350 --> 00:11:28,170 going to be recording for log pressure, 217 00:11:28,170 --> 00:11:30,180 log volume, and log temperature are not going 218 00:11:30,180 --> 00:11:32,590 to be exactly on one line. 219 00:11:32,590 --> 00:11:33,840 OK, they're going to be close. 220 00:11:33,840 --> 00:11:36,150 Actually, in those physics experiments, 221 00:11:36,150 --> 00:11:39,600 usually, they're very close because the conditions 222 00:11:39,600 --> 00:11:41,740 are controlled under lab experiments. 223 00:11:41,740 --> 00:11:44,670 So it means that the noise is very small. 224 00:11:44,670 --> 00:11:47,160 But for other cases, like demand and prices, 225 00:11:47,160 --> 00:11:50,820 it's not a law of physics, and so this must change. 226 00:11:50,820 --> 00:11:53,180 Even the linear structure is probably not clear, right. 227 00:11:53,180 --> 00:11:54,763 At some points, there's probably going 228 00:11:54,763 --> 00:11:57,550 to be some weird curvature happening. 229 00:11:57,550 --> 00:12:00,910 All right, so this slide is just to tell you maybe you 230 00:12:00,910 --> 00:12:03,071 don't have, obviously, a linear relationship, 231 00:12:03,071 --> 00:12:04,570 but maybe you do if you start taking 232 00:12:04,570 --> 00:12:08,380 logs exponentials, squares. 233 00:12:08,380 --> 00:12:10,820 You can sometimes take the product of two variables, 234 00:12:10,820 --> 00:12:12,040 things like this, right. 235 00:12:12,040 --> 00:12:13,570 So this is variable transformation, 236 00:12:13,570 --> 00:12:15,610 and it's mostly domain-specific, so we're not 237 00:12:15,610 --> 00:12:18,076 going to go into more details of this. 238 00:12:18,076 --> 00:12:19,480 Any questions? 239 00:12:22,290 --> 00:12:27,100 All right, so now I'm going to be giving-- 240 00:12:27,100 --> 00:12:29,100 so if we start thinking a little more about what 241 00:12:29,100 --> 00:12:32,100 these coefficients should be, well, 242 00:12:32,100 --> 00:12:34,440 remember-- so everybody's clear why 243 00:12:34,440 --> 00:12:36,280 I don't put the little i here? 244 00:12:41,971 --> 00:12:43,970 Right, I don't put the little i because I'm just 245 00:12:43,970 --> 00:12:47,120 talking about a generic x and a generic y, 246 00:12:47,120 --> 00:12:49,870 but the observations are x1, y1, right. 247 00:12:49,870 --> 00:12:53,450 So typically, on the blackboard I'm 248 00:12:53,450 --> 00:13:02,980 often going to write only xy, but the data really is x1, 249 00:13:02,980 --> 00:13:07,180 y1, all the way to xn, yn. 250 00:13:07,180 --> 00:13:10,810 So those are those points in this two dimensional plot. 251 00:13:10,810 --> 00:13:21,830 But I think of those as being independent copies of the pair 252 00:13:21,830 --> 00:13:24,500 xy. 253 00:13:24,500 --> 00:13:26,120 They have to have-- 254 00:13:26,120 --> 00:13:27,420 to contain their relationship. 255 00:13:27,420 --> 00:13:29,630 And so when I talk about distribution 256 00:13:29,630 --> 00:13:32,420 of those random variables, I talk about the distribution 257 00:13:32,420 --> 00:13:34,240 of xy, and that's the same. 258 00:13:34,240 --> 00:13:36,950 All right, so the first thing you might want to ask 259 00:13:36,950 --> 00:13:41,790 is, well, if I have an infinite amount of data, 260 00:13:41,790 --> 00:13:44,390 what can I hope to get for a and b? 261 00:13:44,390 --> 00:13:46,350 If my simple size goes to infinity, 262 00:13:46,350 --> 00:13:48,110 then I should actually know exactly what 263 00:13:48,110 --> 00:13:50,040 the distribution of xy is. 264 00:13:50,040 --> 00:13:52,670 And so there should be an a and a b 265 00:13:52,670 --> 00:13:57,305 that captures this linear relationship between y and x. 266 00:13:57,305 --> 00:13:59,180 And so in particular, we're going 267 00:13:59,180 --> 00:14:02,709 to try to ask the population, or theoretic, values of a and b, 268 00:14:02,709 --> 00:14:04,250 and you can see that you can actually 269 00:14:04,250 --> 00:14:05,960 compute them explicitly. 270 00:14:05,960 --> 00:14:08,510 So let's just try to find how. 271 00:14:08,510 --> 00:14:10,640 So as I said, we have a bunch of points 272 00:14:10,640 --> 00:14:16,460 on this line close to a line, and I'm 273 00:14:16,460 --> 00:14:20,520 trying to find the best fit. 274 00:14:20,520 --> 00:14:23,330 All right, so this guy is not a good fit. 275 00:14:23,330 --> 00:14:24,960 This guy is not a good fit. 276 00:14:24,960 --> 00:14:27,870 And we know that this guy is a good fit somehow. 277 00:14:27,870 --> 00:14:30,680 So we need to mathematically formulate the fact 278 00:14:30,680 --> 00:14:35,150 that this line here is better than this line here 279 00:14:35,150 --> 00:14:37,460 or better than this line here. 280 00:14:37,460 --> 00:14:41,030 So what we're trying to do is to create a function that 281 00:14:41,030 --> 00:14:43,580 has values that are smaller for this curve 282 00:14:43,580 --> 00:14:45,590 and larger for these two curves. 283 00:14:45,590 --> 00:14:47,630 And the way we do it is by measuring the fit, 284 00:14:47,630 --> 00:14:51,740 and the fit is essentially the aggregate distance 285 00:14:51,740 --> 00:14:55,310 of all the points to the curve. 286 00:14:55,310 --> 00:14:56,930 And there's many ways I can measure 287 00:14:56,930 --> 00:14:58,550 the distance to a curve. 288 00:14:58,550 --> 00:15:01,730 So if I want to find so-- let's just open a parenthesis. 289 00:15:01,730 --> 00:15:03,290 If I have a point here-- so we're 290 00:15:03,290 --> 00:15:05,250 going to do it for one point at a time. 291 00:15:05,250 --> 00:15:07,120 So if I have a point, there's many ways 292 00:15:07,120 --> 00:15:09,530 I can measure its distance to the curve, right? 293 00:15:09,530 --> 00:15:12,800 I can measure it like that. 294 00:15:12,800 --> 00:15:14,690 That is one distance to the curve. 295 00:15:14,690 --> 00:15:19,280 I can measure it like that by having a right angle here that 296 00:15:19,280 --> 00:15:20,840 is one distance to the curve. 297 00:15:20,840 --> 00:15:23,430 Or I can measure it like that. 298 00:15:23,430 --> 00:15:27,490 That is another distance to the curve, right. 299 00:15:27,490 --> 00:15:29,650 There's many ways I can go for it. 300 00:15:29,650 --> 00:15:31,030 It turns out that one is actually 301 00:15:31,030 --> 00:15:33,040 going to be fairly convenient for us, 302 00:15:33,040 --> 00:15:36,910 and that's the one that says, let's look at the square 303 00:15:36,910 --> 00:15:38,720 of the value of x on the curve. 304 00:15:38,720 --> 00:15:43,690 So if this is the curve, y is equal to a plus bx. 305 00:15:51,260 --> 00:15:54,140 Now, I'm going to think of this point as a random point, 306 00:15:54,140 --> 00:15:57,050 capital X, capital Y, so that means 307 00:15:57,050 --> 00:16:02,210 that it's going to be x1, y1 or x2, y2, et cetera. 308 00:16:02,210 --> 00:16:04,250 Now, I want to measure the distance. 309 00:16:04,250 --> 00:16:06,390 Can somebody tell me which of the three-- 310 00:16:06,390 --> 00:16:08,870 the first one, the second one, or the third one-- 311 00:16:08,870 --> 00:16:13,610 this formula, expectation of y minus a minus bx squared is-- 312 00:16:13,610 --> 00:16:18,578 which of the three is it representing? 313 00:16:18,578 --> 00:16:20,020 AUDIENCE: The second one. 314 00:16:20,020 --> 00:16:21,395 PHILIPPE RIGOLLET: The second one 315 00:16:21,395 --> 00:16:22,740 where I have the right angle? 316 00:16:22,740 --> 00:16:26,710 OK, everybody agrees with this? 317 00:16:26,710 --> 00:16:28,730 Anybody wants to vote for something else? 318 00:16:28,730 --> 00:16:29,320 Yeah? 319 00:16:29,320 --> 00:16:30,320 AUDIENCE: The third one? 320 00:16:30,320 --> 00:16:31,695 PHILIPPE RIGOLLET: The third one? 321 00:16:31,695 --> 00:16:34,520 Everybody agrees with the third one? 322 00:16:34,520 --> 00:16:38,975 So by default, everybody's on the first one? 323 00:16:38,975 --> 00:16:42,010 Yeah, it is the vertical distance actually. 324 00:16:42,010 --> 00:16:44,555 And the reason is if it was the one with the straight angle, 325 00:16:44,555 --> 00:16:46,180 with the right angle, it would actually 326 00:16:46,180 --> 00:16:48,430 be a very complicated mathematical formula, 327 00:16:48,430 --> 00:16:51,240 so let's just see y, right? 328 00:16:51,240 --> 00:16:53,470 And by y, I mean y. 329 00:16:53,470 --> 00:16:59,460 OK, so this means that this is my x, and this is my y. 330 00:17:02,500 --> 00:17:05,829 All right, so that means that this point is xy. 331 00:17:05,829 --> 00:17:07,900 So what I'm measuring is the difference 332 00:17:07,900 --> 00:17:15,965 between y minus a plus b times x. 333 00:17:15,965 --> 00:17:18,339 This is the thing I'm going to take the expectation off-- 334 00:17:18,339 --> 00:17:20,290 the square and then the expectation-- so a 335 00:17:20,290 --> 00:17:24,140 plus b times x, if this is this line, this is this point. 336 00:17:24,140 --> 00:17:27,310 So that's this value here. 337 00:17:27,310 --> 00:17:33,254 This value here is a plus bx, right? 338 00:17:33,254 --> 00:17:35,170 So what I'm really measuring is the difference 339 00:17:35,170 --> 00:17:38,740 between y and N plus bx, which is this distance here. 340 00:17:42,400 --> 00:17:45,846 And since I like things like Pythagoras theorem, 341 00:17:45,846 --> 00:17:47,470 I'm actually going to put a square here 342 00:17:47,470 --> 00:17:51,500 before I take the expectation. 343 00:17:51,500 --> 00:17:53,090 So now this is a random variable. 344 00:17:53,090 --> 00:17:55,210 This is this random variable. 345 00:17:55,210 --> 00:17:58,420 And so I want a number, so I'm going to turn it 346 00:17:58,420 --> 00:18:00,020 into a deterministic number. 347 00:18:00,020 --> 00:18:03,400 And the way I do this is by taking expectation. 348 00:18:03,400 --> 00:18:07,330 And if you think expectations should be close to average, 349 00:18:07,330 --> 00:18:09,310 this is the same thing as saying, 350 00:18:09,310 --> 00:18:12,010 I want that in average, the y's are 351 00:18:12,010 --> 00:18:14,440 close to the a plus bx, right? 352 00:18:14,440 --> 00:18:16,570 So we're doing it in expectation, 353 00:18:16,570 --> 00:18:18,370 but that's going to translate into doing it 354 00:18:18,370 --> 00:18:20,650 in average for all the points. 355 00:18:20,650 --> 00:18:22,850 All right, so this is the thing I want to measure. 356 00:18:22,850 --> 00:18:24,500 So that's this vertical distance. 357 00:18:24,500 --> 00:18:26,321 Yeah? 358 00:18:26,321 --> 00:18:26,820 OK. 359 00:18:32,750 --> 00:18:36,292 This is my fault actually. 360 00:18:36,292 --> 00:18:37,890 Maybe we should close those shades. 361 00:18:50,230 --> 00:18:53,280 OK, I cannot do just one at a time, sorry. 362 00:19:11,910 --> 00:19:15,640 All right, so now that I do those vertical distances, 363 00:19:15,640 --> 00:19:18,340 I can ask-- well, now, I have this function, 364 00:19:18,340 --> 00:19:22,020 right-- to have a function that takes two parameters a and b, 365 00:19:22,020 --> 00:19:30,220 maps it to the expectation of y minus a plus bx squared. 366 00:19:30,220 --> 00:19:32,170 Sorry, the square is here. 367 00:19:32,170 --> 00:19:35,080 And I could ask, well, this is a function that 368 00:19:35,080 --> 00:19:38,320 measures the fit of the parameters a and b, right? 369 00:19:38,320 --> 00:19:40,210 This function should be small. 370 00:19:40,210 --> 00:19:45,700 The value of this function here, function 371 00:19:45,700 --> 00:20:07,370 of a and b that measures how close the point xy is 372 00:20:07,370 --> 00:20:14,210 to the line a plus b times x while y 373 00:20:14,210 --> 00:20:18,869 is equal to a plus b times x in expectation. 374 00:20:23,760 --> 00:20:24,400 OK, agreed? 375 00:20:24,400 --> 00:20:27,030 This is what we just said. 376 00:20:27,030 --> 00:20:29,480 Again, if you're not comfortable with the reason why 377 00:20:29,480 --> 00:20:32,290 you get expectations, just think about having data points 378 00:20:32,290 --> 00:20:34,410 and taking the average value for this guy. 379 00:20:34,410 --> 00:20:36,360 So it's basically an aggregate distance 380 00:20:36,360 --> 00:20:41,070 of the points to their line. 381 00:20:41,070 --> 00:20:44,390 OK, everybody agrees this is a legitimate measure? 382 00:20:44,390 --> 00:20:48,150 If all my points were on the line-- if my distribution-- 383 00:20:48,150 --> 00:20:51,720 if y was actually equal to a plus bx for some a 384 00:20:51,720 --> 00:20:54,780 and b then this function would be equal to 0 385 00:20:54,780 --> 00:20:57,906 for the correct a and b, right? 386 00:20:57,906 --> 00:20:59,510 If they are far-- well, it's going 387 00:20:59,510 --> 00:21:01,460 to depend on how much noise I'm getting, 388 00:21:01,460 --> 00:21:04,190 but it's still going to be minimized for the best one. 389 00:21:04,190 --> 00:21:06,800 So let's minimize this thing. 390 00:21:06,800 --> 00:21:11,000 So here, I don't make any-- 391 00:21:11,000 --> 00:21:12,350 again, sorry. 392 00:21:12,350 --> 00:21:21,800 I don't make an assumption on the distribution of x or y. 393 00:21:21,800 --> 00:21:27,290 Here, I assume, somehow, that the variance of x 394 00:21:27,290 --> 00:21:28,289 is not equal to 0. 395 00:21:28,289 --> 00:21:29,330 Can somebody tell me why? 396 00:21:29,330 --> 00:21:30,310 Yeah? 397 00:21:30,310 --> 00:21:33,250 AUDIENCE: Not really a question-- the slides, 398 00:21:33,250 --> 00:21:38,150 you have y minus a minus bx quantity squared expectation 399 00:21:38,150 --> 00:21:41,204 of that, and here you've written square of the expectation. 400 00:21:41,204 --> 00:21:42,870 PHILIPPE RIGOLLET: No, here I'm actually 401 00:21:42,870 --> 00:21:46,890 in the expectation of the square. 402 00:21:46,890 --> 00:21:49,200 If I wanted to write the square of the expectation, 403 00:21:49,200 --> 00:21:52,350 I would just do this. 404 00:21:52,350 --> 00:21:53,680 So let's just make it clear. 405 00:22:00,970 --> 00:22:01,575 Right? 406 00:22:01,575 --> 00:22:03,820 Do you want me to put an extra set of parenthesis? 407 00:22:03,820 --> 00:22:06,690 That's what you want me to do? 408 00:22:06,690 --> 00:22:11,034 AUDIENCE: Yeah, it's just confusing with the [INAUDIBLE] 409 00:22:11,034 --> 00:22:13,450 PHILIPPE RIGOLLET: OK, that's the one that makes sense, so 410 00:22:13,450 --> 00:22:14,700 the square of the expectation? 411 00:22:14,700 --> 00:22:15,680 AUDIENCE: Yeah. 412 00:22:15,680 --> 00:22:17,180 PHILIPPE RIGOLLET: Oh, the expectation of the square, 413 00:22:17,180 --> 00:22:17,680 sorry. 414 00:22:20,310 --> 00:22:22,130 Yeah, dyslexia. 415 00:22:22,130 --> 00:22:25,100 All right, any question? 416 00:22:25,100 --> 00:22:25,600 Yeah? 417 00:22:25,600 --> 00:22:28,400 AUDIENCE: Does this assume that the error is Gaussian? 418 00:22:28,400 --> 00:22:29,316 PHILIPPE RIGOLLET: No. 419 00:22:32,290 --> 00:22:34,133 AUDIENCE: I mean, in the sense that like, 420 00:22:34,133 --> 00:22:36,980 if we knew that the error was, like, 421 00:22:36,980 --> 00:22:40,062 even the minus followed like-- so even the minus x 422 00:22:40,062 --> 00:22:44,942 to the fourth distribution, would we want to minimise 423 00:22:44,942 --> 00:22:48,358 the expectation of what the fourth power of y minus 424 00:22:48,358 --> 00:22:52,280 a equals bx in order to get [? what the ?] [? best is? ?] 425 00:22:52,280 --> 00:22:53,238 PHILIPPE RIGOLLET: Why? 426 00:22:57,372 --> 00:22:59,080 So you know the answers to your question, 427 00:22:59,080 --> 00:23:01,760 so I just want you to use the words that-- 428 00:23:01,760 --> 00:23:04,756 right, so why would you want to use the fourth power? 429 00:23:04,756 --> 00:23:06,429 AUDIENCE: Well, because, like, we 430 00:23:06,429 --> 00:23:08,137 want to more strongly penalize deviations 431 00:23:08,137 --> 00:23:11,518 because we'd expect very large deviations to be 432 00:23:11,518 --> 00:23:15,870 very rare, or more rare, than it would 433 00:23:15,870 --> 00:23:18,170 with the Gaussian [INAUDIBLE] power. 434 00:23:18,170 --> 00:23:19,360 PHILIPPE RIGOLLET: Yeah so, that would be the maximum likely 435 00:23:19,360 --> 00:23:21,290 estimator that you're describing to me, right? 436 00:23:21,290 --> 00:23:22,850 I can actually write the likelihood 437 00:23:22,850 --> 00:23:25,340 of a pair of numbers ab. 438 00:23:25,340 --> 00:23:26,847 And if I know this, that's actually 439 00:23:26,847 --> 00:23:28,430 what's going to come into it because I 440 00:23:28,430 --> 00:23:31,610 know that the density is going to come into play when 441 00:23:31,610 --> 00:23:32,740 I talk about there. 442 00:23:32,740 --> 00:23:34,580 But here, I'm just talking about-- 443 00:23:34,580 --> 00:23:36,350 this is a mechanical tool. 444 00:23:36,350 --> 00:23:39,640 I'm just saying, let's minimize the distance to the curve. 445 00:23:39,640 --> 00:23:42,320 Another thing I could have done is take the absolute value 446 00:23:42,320 --> 00:23:43,750 of this thing, for example. 447 00:23:43,750 --> 00:23:46,190 I just decided to take the square root before I did it. 448 00:23:46,190 --> 00:23:48,630 OK, so regardless of what I'm doing, 449 00:23:48,630 --> 00:23:50,600 I'm just taking the squares because that's just 450 00:23:50,600 --> 00:23:53,600 going to be convenient for me to do my computations for now. 451 00:23:53,600 --> 00:23:55,400 But we don't have any statistical model 452 00:23:55,400 --> 00:23:56,940 at this point. 453 00:23:56,940 --> 00:23:59,040 I didn't say anything-- that y follows this. 454 00:23:59,040 --> 00:24:00,320 X follows this. 455 00:24:00,320 --> 00:24:01,760 I'm just doing minimal assumptions 456 00:24:01,760 --> 00:24:04,250 as we go, all right? 457 00:24:04,250 --> 00:24:06,140 So the variance of x is not equal to 0? 458 00:24:06,140 --> 00:24:07,270 Could somebody tell me why? 459 00:24:11,330 --> 00:24:14,490 What would my cloud point look like if the variance of x 460 00:24:14,490 --> 00:24:16,130 was equal to 0? 461 00:24:16,130 --> 00:24:18,122 Yeah, they would all be at the same point. 462 00:24:18,122 --> 00:24:20,580 So it's going to be hard for me to start fitting in a line, 463 00:24:20,580 --> 00:24:21,180 right? 464 00:24:21,180 --> 00:24:24,100 I mean, best case scenario, I have this x. 465 00:24:24,100 --> 00:24:26,700 It has variance, zero, so this is the expectation of x. 466 00:24:26,700 --> 00:24:31,020 And all my points have the same expectation, 467 00:24:31,020 --> 00:24:33,780 and so, yes, I could probably fit that line. 468 00:24:33,780 --> 00:24:38,340 But that wouldn't help very much for other x's. 469 00:24:38,340 --> 00:24:41,400 So I need a bit of variance so that things spread out 470 00:24:41,400 --> 00:24:42,010 a little bit. 471 00:24:47,440 --> 00:24:51,130 OK, I'm going to have to do this. 472 00:24:51,130 --> 00:24:52,370 I think it's just my-- 473 00:25:10,200 --> 00:25:13,460 All right, so I'm going to put a little bit of variance. 474 00:25:13,460 --> 00:25:15,960 And the other thing is here, I don't want to do much more, 475 00:25:15,960 --> 00:25:22,440 but I'm actually going to think of x as having means zero. 476 00:25:22,440 --> 00:25:24,430 And the way I do this is as follows. 477 00:25:24,430 --> 00:25:30,570 Let's define x tilde, which is x minus the expectation of x. 478 00:25:30,570 --> 00:25:33,920 OK, so definitely the expectation of x tilde is what? 479 00:25:36,620 --> 00:25:38,110 Zero, OK. 480 00:25:38,110 --> 00:25:43,350 And so now I want to minimize in ab, expectation 481 00:25:43,350 --> 00:25:53,920 of y minus a plus b, x squared. 482 00:25:53,920 --> 00:26:03,810 And the way I'm going to do this is by turning x into x tilde 483 00:26:03,810 --> 00:26:07,060 and stuffing the extra-- 484 00:26:07,060 --> 00:26:12,760 and putting the extra expectation of x into the a. 485 00:26:12,760 --> 00:26:19,840 So I'm going to write this as an expectation of y minus a plus 486 00:26:19,840 --> 00:26:25,180 b expectation of x-- 487 00:26:25,180 --> 00:26:27,530 which I'm going to a tilde-- 488 00:26:27,530 --> 00:26:30,300 and plus b x tilde. 489 00:26:33,930 --> 00:26:35,630 OK? 490 00:26:35,630 --> 00:26:38,920 And everybody agrees with this? 491 00:26:38,920 --> 00:26:41,490 So now I have two parameters, a tilde and b, 492 00:26:41,490 --> 00:26:44,350 and I'm going to pretend that now x tilde-- 493 00:26:44,350 --> 00:26:50,830 so now the role of x is played by x tilde, which is now 494 00:26:50,830 --> 00:26:53,020 a centered random variable. 495 00:26:53,020 --> 00:26:55,660 OK, so I'm going to call this guy a tilde, 496 00:26:55,660 --> 00:26:58,859 but for my computations I'm going to call it a. 497 00:26:58,859 --> 00:27:00,650 So how do I find the minimum of this thing? 498 00:27:05,114 --> 00:27:06,620 Derivative equal to zero, right? 499 00:27:06,620 --> 00:27:08,235 So here it's a quadratic thing. 500 00:27:08,235 --> 00:27:09,360 It's going to be like that. 501 00:27:09,360 --> 00:27:10,880 I take the derivative, set it to zero. 502 00:27:10,880 --> 00:27:13,130 So I'm first going to take the derivative with respect 503 00:27:13,130 --> 00:27:16,370 to a and set it equal to zero, so that's equivalent to saying 504 00:27:16,370 --> 00:27:18,320 that the expectation of-- 505 00:27:18,320 --> 00:27:21,315 well, here, I'm going to pick up a 2-- 506 00:27:21,315 --> 00:27:33,720 y minus a plus bx tilde is equal to zero. 507 00:27:33,720 --> 00:27:36,580 And then I also have that the derivative with respect to b is 508 00:27:36,580 --> 00:27:40,260 equal to zero, which is equivalent to the expectation 509 00:27:40,260 --> 00:27:42,100 of-- well, I have a negative sign somewhere, 510 00:27:42,100 --> 00:27:43,410 so let me put it here-- 511 00:27:43,410 --> 00:27:50,950 minus 2x tilde, y minus a plus bx tilde. 512 00:27:55,644 --> 00:27:58,910 OK, see that's why I don't want to put too many parenthesis. 513 00:28:03,140 --> 00:28:05,741 OK. 514 00:28:05,741 --> 00:28:07,490 So I just took the derivative with respect 515 00:28:07,490 --> 00:28:09,920 to a, which is just basically the square, 516 00:28:09,920 --> 00:28:12,569 and then I have a negative 1 that comes out from inside. 517 00:28:12,569 --> 00:28:14,360 And then I take the derivative with respect 518 00:28:14,360 --> 00:28:17,010 to b, and since b has x tilde. 519 00:28:17,010 --> 00:28:19,340 In [? factor, ?] it comes out as well. 520 00:28:19,340 --> 00:28:24,420 All right, so the minus 2's really won't matter for me. 521 00:28:24,420 --> 00:28:26,706 And so now I have two equations. 522 00:28:26,706 --> 00:28:28,580 The first equation, while it's pretty simple, 523 00:28:28,580 --> 00:28:31,955 it's just telling me that the expectation of y minus a 524 00:28:31,955 --> 00:28:33,710 is equal to zero. 525 00:28:33,710 --> 00:28:41,870 So what I know is that a is equal to the expectation of y. 526 00:28:41,870 --> 00:28:44,060 And really that was a tilde, which 527 00:28:44,060 --> 00:28:47,870 implies that the a I want is actually 528 00:28:47,870 --> 00:29:00,690 equal to the expectation of y minus b 529 00:29:00,690 --> 00:29:05,030 times the expectation of x. 530 00:29:05,030 --> 00:29:05,530 OK? 531 00:29:10,240 --> 00:29:13,450 Just because a tilde is a plus b times the expectation of x. 532 00:29:16,830 --> 00:29:19,360 So that's for my a. 533 00:29:19,360 --> 00:29:22,180 And then for my b, I use the second one. 534 00:29:22,180 --> 00:29:27,990 So the second one tells me that the expectation of x tilde of y 535 00:29:27,990 --> 00:29:32,430 is equal to a plus b times the expectation of x tilde 536 00:29:32,430 --> 00:29:33,520 which is zero, right? 537 00:29:38,640 --> 00:29:39,460 OK? 538 00:29:39,460 --> 00:29:41,630 But this a is actually a tilde in this problem, 539 00:29:41,630 --> 00:29:47,210 so it's actually a plus b expectation of x. 540 00:29:51,900 --> 00:29:53,890 Now, this is the expectation of the product 541 00:29:53,890 --> 00:29:57,480 of two random variables, but x tilde is centered, right? 542 00:29:57,480 --> 00:30:00,670 It's x minus expectation of x, so this thing is actually 543 00:30:00,670 --> 00:30:03,640 equal to the covariance between x and y 544 00:30:03,640 --> 00:30:05,140 by definition of covariance. 545 00:30:09,130 --> 00:30:11,840 So now I have everything I need, right. 546 00:30:11,840 --> 00:30:14,110 How do I just-- 547 00:30:14,110 --> 00:30:16,520 I'm sorry about that. 548 00:30:16,520 --> 00:30:18,330 So I have everything I need. 549 00:30:18,330 --> 00:30:22,560 Now, I now have two equations with two unknowns, 550 00:30:22,560 --> 00:30:25,110 and all I have to do is to basically plug it in. 551 00:30:25,110 --> 00:30:29,460 So it's essentially telling me that the covariance of xy-- 552 00:30:29,460 --> 00:30:31,980 so the first equation tells me that the covariance of xy 553 00:30:31,980 --> 00:30:36,750 is equal to a plus b expectation of x, but a is expectation of y 554 00:30:36,750 --> 00:30:39,640 minus b expectation of x. 555 00:30:39,640 --> 00:30:45,113 So it's-- well, actually, maybe I should start with b. 556 00:30:54,780 --> 00:30:56,010 Oh, sorry. 557 00:30:56,010 --> 00:30:59,580 OK, I forgot one thing. 558 00:30:59,580 --> 00:31:00,750 This is not true, right. 559 00:31:00,750 --> 00:31:02,516 I forgot this term. 560 00:31:02,516 --> 00:31:05,850 x tilde multiplies x tilde here, so what 561 00:31:05,850 --> 00:31:07,680 I'm left with is x tilde-- 562 00:31:07,680 --> 00:31:11,320 it's minus b times the expectation of x tilde squared. 563 00:31:11,320 --> 00:31:14,800 So that's actually minus b times the variance of x 564 00:31:14,800 --> 00:31:17,970 tilde because x tilde is already centered, 565 00:31:17,970 --> 00:31:19,760 which is actually the variance of x. 566 00:31:23,850 --> 00:31:29,790 So now I have that this thing is actually a plus b expectation 567 00:31:29,790 --> 00:31:36,570 of x minus b variance of x. 568 00:31:36,570 --> 00:31:42,180 And I also have that a is equal to expectation 569 00:31:42,180 --> 00:31:45,960 of y minus b expectation of x. 570 00:31:53,720 --> 00:31:58,100 So if I sum the two, those guys are going to cancel. 571 00:31:58,100 --> 00:32:00,740 Those guys are going to cancel. 572 00:32:00,740 --> 00:32:05,630 And so what I'm going to be left with is covariance of xy 573 00:32:05,630 --> 00:32:10,570 is equal to expectation of x, expectation of y, 574 00:32:10,570 --> 00:32:12,610 and then I'm left with this term here, minus 575 00:32:12,610 --> 00:32:14,050 b times the variance of x. 576 00:32:17,070 --> 00:32:20,171 And so that tells me that b-- 577 00:32:20,171 --> 00:32:21,796 why do I still have the variance there? 578 00:32:34,692 --> 00:32:37,668 AUDIENCE: So is the covariance really 579 00:32:37,668 --> 00:32:43,124 the expectation of x tilde times y minus expectation of y? 580 00:32:43,124 --> 00:32:46,092 Because y is not centered, correct? 581 00:32:46,092 --> 00:32:47,092 PHILIPPE RIGOLLET: Yeah. 582 00:32:47,092 --> 00:32:48,814 AUDIENCE: OK, but x is still the center. 583 00:32:48,814 --> 00:32:50,980 PHILIPPE RIGOLLET: But x is still the center, right. 584 00:32:50,980 --> 00:32:52,700 So you just need to have one that's 585 00:32:52,700 --> 00:32:53,830 centered for this to work. 586 00:32:57,187 --> 00:32:58,520 Right, I mean, you can check it. 587 00:32:58,520 --> 00:33:00,144 But basically when you're going to have 588 00:33:00,144 --> 00:33:02,877 the product of the expectations, you only need one of the two 589 00:33:02,877 --> 00:33:03,960 in the product to be zero. 590 00:33:03,960 --> 00:33:04,920 So the product is zero. 591 00:33:09,090 --> 00:33:11,020 OK, why do I keep my-- 592 00:33:11,020 --> 00:33:14,542 so I get a, a, and then the b expectation. 593 00:33:14,542 --> 00:33:16,750 OK, so that's probably earlier that I made a mistake. 594 00:33:25,620 --> 00:33:29,140 So I get-- so this was a tilde. 595 00:33:29,140 --> 00:33:30,548 Let's just be clear about the-- 596 00:33:40,508 --> 00:33:43,350 So that tells me that a tilde-- 597 00:33:43,350 --> 00:33:45,570 maybe it's not super fair of me to-- 598 00:33:48,310 --> 00:33:50,426 yeah, OK, I think I know where I made a mistake. 599 00:33:50,426 --> 00:33:51,550 I should not have centered. 600 00:33:51,550 --> 00:33:54,760 I wanted to make my life easier, and I should not 601 00:33:54,760 --> 00:33:55,960 have done that. 602 00:33:55,960 --> 00:33:59,140 And the reason is a tilde depends on b, 603 00:33:59,140 --> 00:34:01,780 so when I take the derivative with respect 604 00:34:01,780 --> 00:34:04,840 to b, what I'm left with here-- 605 00:34:04,840 --> 00:34:06,880 since a tilde depends on b, when I 606 00:34:06,880 --> 00:34:09,370 take the derivative of this guy, I actually 607 00:34:09,370 --> 00:34:12,550 don't get a tilde here, but I really get-- 608 00:34:17,570 --> 00:34:20,896 so again, this was not-- 609 00:34:20,896 --> 00:34:21,960 so that's the first one. 610 00:34:30,389 --> 00:34:33,800 This is actually x here-- 611 00:34:33,800 --> 00:34:38,050 because when I take the derivative with respect to b. 612 00:34:38,050 --> 00:34:40,929 And so now, what I'm left with is that the expectation-- so 613 00:34:40,929 --> 00:34:43,929 yeah, I'm basically left with nothing that helps. 614 00:34:43,929 --> 00:34:46,300 So I'm sorry about. 615 00:34:46,300 --> 00:34:49,929 Let's start from the beginning because this is not 616 00:34:49,929 --> 00:34:53,090 getting us anywhere, and a fix is not going to help. 617 00:34:53,090 --> 00:34:55,370 So let's just do it again. 618 00:34:55,370 --> 00:34:56,320 Sorry about that. 619 00:34:56,320 --> 00:34:59,230 So let's not center anything and just do brute force 620 00:34:59,230 --> 00:35:01,120 because we're going to-- 621 00:35:01,120 --> 00:35:04,820 b x squared. 622 00:35:04,820 --> 00:35:07,270 All right. 623 00:35:07,270 --> 00:35:09,520 Partial, with respect to a, is giving 624 00:35:09,520 --> 00:35:11,920 equal zero is equivalent, so my minus 2 625 00:35:11,920 --> 00:35:13,060 is going to cancel, right. 626 00:35:13,060 --> 00:35:14,851 So I'm going to actually forget about this. 627 00:35:14,851 --> 00:35:17,980 So it's actually telling me that the expectation 628 00:35:17,980 --> 00:35:25,660 of y minus a plus bx is equal to zero, which 629 00:35:25,660 --> 00:35:31,090 is equivalent to a plus b expectation of x, is 630 00:35:31,090 --> 00:35:33,775 equal to the expectation of y. 631 00:35:33,775 --> 00:35:35,650 Now, if I take the derivative with respect to 632 00:35:35,650 --> 00:35:38,830 b and set it equal to zero, this is telling me 633 00:35:38,830 --> 00:35:41,656 that the expectation of-- 634 00:35:41,656 --> 00:35:43,780 well, it's the same thing except that this time I'm 635 00:35:43,780 --> 00:35:45,280 going to pull out an x. 636 00:35:52,470 --> 00:35:54,310 This guy is equal to zero-- 637 00:35:54,310 --> 00:35:56,660 this guy is not here-- 638 00:35:56,660 --> 00:36:03,650 and so that implies that the expectation of xy 639 00:36:03,650 --> 00:36:09,560 is equal to a times the expectation of x, 640 00:36:09,560 --> 00:36:16,726 plus b times the expectation of x square. 641 00:36:16,726 --> 00:36:17,226 OK? 642 00:36:21,540 --> 00:36:26,720 All right, so the first one is actually not giving me much, 643 00:36:26,720 --> 00:36:29,700 so I need to actually work with the two of those guys. 644 00:36:29,700 --> 00:36:31,470 So I'm going to take the first-- 645 00:36:31,470 --> 00:36:33,690 so let me rewrite those two inequalities that I have. 646 00:36:33,690 --> 00:36:40,830 I have a plus b, e of x is equal to e of y. 647 00:36:40,830 --> 00:36:43,092 And then I have e of xy. 648 00:36:50,970 --> 00:37:01,160 OK, and now what I do is that I multiply this guy. 649 00:37:01,160 --> 00:37:03,230 So I want to cancel one of those things, right? 650 00:37:03,230 --> 00:37:04,455 So what I'm going to-- 651 00:37:12,197 --> 00:37:13,780 so I'm going to take this guy, and I'm 652 00:37:13,780 --> 00:37:19,030 going to multiply it by e of x and take the difference. 653 00:37:19,030 --> 00:37:26,330 So I do times e of x, and then I take the sum of those two, 654 00:37:26,330 --> 00:37:28,840 and then those two terms are going to cancel. 655 00:37:28,840 --> 00:37:33,550 So then that tells me that b times e 656 00:37:33,550 --> 00:37:45,180 of x squared, plus the expectation of xy is equal to-- 657 00:37:45,180 --> 00:37:48,423 so this guy is the one that cancelled. 658 00:37:53,850 --> 00:37:56,570 Then I get this guy here, expectation 659 00:37:56,570 --> 00:38:02,450 of x times the expectation of y, plus the guy that 660 00:38:02,450 --> 00:38:04,070 remains here-- 661 00:38:04,070 --> 00:38:08,752 which is b times the expectation of x square. 662 00:38:11,920 --> 00:38:16,220 So here I have b expectation of x, the whole thing squared. 663 00:38:16,220 --> 00:38:18,400 And here I have b expectation of x square. 664 00:38:18,400 --> 00:38:22,440 So if I pull this guy here, what do I get? 665 00:38:22,440 --> 00:38:26,140 b times the variance of x, OK? 666 00:38:26,140 --> 00:38:28,180 So I'm going to move here. 667 00:38:28,180 --> 00:38:31,160 And this guy here, when I move this guy here, 668 00:38:31,160 --> 00:38:32,980 I get the expectation of x times y, 669 00:38:32,980 --> 00:38:35,590 minus the expectation of x times the expectation of y. 670 00:38:35,590 --> 00:38:40,540 So this is actually telling me that the covariance of x and y 671 00:38:40,540 --> 00:38:45,450 is equal to b times the variance of x. 672 00:38:45,450 --> 00:38:48,840 And so then that tells me that b is 673 00:38:48,840 --> 00:38:55,519 equal to covariance of xy divided by the variance of x. 674 00:38:55,519 --> 00:38:57,310 And that's why I actually need the variance 675 00:38:57,310 --> 00:39:01,690 of x to be non-zero because I couldn't do that otherwise. 676 00:39:01,690 --> 00:39:03,190 And because if it was, it would mean 677 00:39:03,190 --> 00:39:04,890 that b should be plus infinity, which 678 00:39:04,890 --> 00:39:08,220 is what the limit of this guy is when the variance goes 679 00:39:08,220 --> 00:39:11,200 to zero or negative infinity. 680 00:39:11,200 --> 00:39:14,410 I can not sort them out. 681 00:39:14,410 --> 00:39:16,130 All right, so I'm sorry about the mess, 682 00:39:16,130 --> 00:39:19,070 but that should be more clear. 683 00:39:19,070 --> 00:39:21,410 Then a, of course, you can write it 684 00:39:21,410 --> 00:39:23,240 by plugging in the value of b, so you 685 00:39:23,240 --> 00:39:27,030 know it's only a function of your distribution, right? 686 00:39:27,030 --> 00:39:29,240 So what are the characteristics of the distribution-- 687 00:39:29,240 --> 00:39:31,031 so distribution can have a bunch of things. 688 00:39:31,031 --> 00:39:34,330 It can have movements of order 4, of order 26. 689 00:39:34,330 --> 00:39:36,590 It can have heavy tails or light tails. 690 00:39:36,590 --> 00:39:39,320 But when you compute least squares, 691 00:39:39,320 --> 00:39:41,900 the only thing that matters are the variance 692 00:39:41,900 --> 00:39:45,320 of x, the expectation of the individual ones-- 693 00:39:45,320 --> 00:39:50,300 and really what captures how y changes when you change x, 694 00:39:50,300 --> 00:39:51,590 is captured in the covariance. 695 00:39:51,590 --> 00:39:54,510 The rest is really just normalization. 696 00:39:54,510 --> 00:39:58,550 It's just telling you, I want things to cross the y-axis 697 00:39:58,550 --> 00:39:59,360 at the right place. 698 00:39:59,360 --> 00:40:02,330 I want things to cross the x-axis at the right place. 699 00:40:02,330 --> 00:40:05,720 But the slope is really captured by how much more covariance 700 00:40:05,720 --> 00:40:08,330 you have relative to the variance of x. 701 00:40:08,330 --> 00:40:12,350 So this is essentially setting the scale for the x-axis, 702 00:40:12,350 --> 00:40:15,410 and this is telling you for a unit scale, 703 00:40:15,410 --> 00:40:20,460 this is the unit of y that you're changing. 704 00:40:20,460 --> 00:40:23,600 OK, so we have explicit forms. 705 00:40:23,600 --> 00:40:26,300 And what I could do, if I wanted to estimate those things, 706 00:40:26,300 --> 00:40:32,510 is just say, well again, we have expectations, right? 707 00:40:32,510 --> 00:40:36,050 The expectation of xy minus the product of the expectations, 708 00:40:36,050 --> 00:40:38,510 I could replace expectations by averages 709 00:40:38,510 --> 00:40:40,310 and get an empirical covariance just 710 00:40:40,310 --> 00:40:42,710 like we can replace the expectations for the variance 711 00:40:42,710 --> 00:40:44,720 and get a sample covariance. 712 00:40:44,720 --> 00:40:47,300 And this is basically what we're going to be doing. 713 00:40:47,300 --> 00:40:49,470 All right, this is essentially what you want. 714 00:40:49,470 --> 00:40:51,950 The problem is that if you view it that way, 715 00:40:51,950 --> 00:40:54,860 you sort of prevent yourself from being able to solve 716 00:40:54,860 --> 00:40:56,510 the multivariate problem. 717 00:40:56,510 --> 00:40:58,430 Because it's only in the univariate problem 718 00:40:58,430 --> 00:41:00,930 that you have closed form solutions for your problem. 719 00:41:00,930 --> 00:41:03,080 But if you actually go to multivariate, 720 00:41:03,080 --> 00:41:05,510 this is not where you want to replace expectations 721 00:41:05,510 --> 00:41:06,230 by averages. 722 00:41:06,230 --> 00:41:09,120 You actually want to replace expectation by averages here. 723 00:41:12,520 --> 00:41:14,950 And once you do it here, then you 724 00:41:14,950 --> 00:41:17,920 can actually just solve the minimisation problem. 725 00:41:23,240 --> 00:41:29,840 OK, so one thing that arises from this guy 726 00:41:29,840 --> 00:41:35,795 is that this is an interesting formula. 727 00:41:40,640 --> 00:41:43,740 All right, think about it. 728 00:41:43,740 --> 00:42:00,190 If I have that y is a plus bx plus some noise. 729 00:42:00,190 --> 00:42:02,680 Things are no longer on something. 730 00:42:02,680 --> 00:42:08,470 I have that y is equal to a bx plus some noise, which 731 00:42:08,470 --> 00:42:11,210 is usually denoted by epsilon. 732 00:42:11,210 --> 00:42:12,910 So that's the distribution, right? 733 00:42:12,910 --> 00:42:15,760 If I tell you the distribution of x, and I 734 00:42:15,760 --> 00:42:17,470 say y is a plus b epsilon-- 735 00:42:17,470 --> 00:42:18,940 I tell you the distribution of y, 736 00:42:18,940 --> 00:42:21,190 and if [? they mean ?] that those two are independent, 737 00:42:21,190 --> 00:42:23,860 you have a distribution on y. 738 00:42:23,860 --> 00:42:27,364 So what happens is that I can actually always say-- well, you 739 00:42:27,364 --> 00:42:28,780 know, this is equivalent to saying 740 00:42:28,780 --> 00:42:35,560 that epsilon is equal to y minus a plus bx, right? 741 00:42:35,560 --> 00:42:37,540 I can always write this as just-- 742 00:42:37,540 --> 00:42:40,320 I mean, as tautology. 743 00:42:40,320 --> 00:42:42,069 But here, for those guys-- 744 00:42:42,069 --> 00:42:43,360 this is not for any guy, right. 745 00:42:43,360 --> 00:42:45,770 This is really for the best fit, a 746 00:42:45,770 --> 00:42:50,170 and b, those ones that satisfy this gradient is 747 00:42:50,170 --> 00:42:51,610 equal to zero thing. 748 00:42:51,610 --> 00:42:55,330 Then what we had is that the expectation of epsilon 749 00:42:55,330 --> 00:42:59,380 was equal to expectation of y minus a plus 750 00:42:59,380 --> 00:43:03,430 b expectation of x by linearity of the expectation, which 751 00:43:03,430 --> 00:43:05,560 was equal to zero. 752 00:43:05,560 --> 00:43:10,180 So for this best fit we have zero. 753 00:43:10,180 --> 00:43:13,630 Now, the covariance between x and y-- 754 00:43:17,190 --> 00:43:20,530 Between, sorry, x and epsilon, is what? 755 00:43:20,530 --> 00:43:23,420 Well, it's the covariance between x-- 756 00:43:23,420 --> 00:43:27,540 and well, epsilon was y minus a plus bx. 757 00:43:30,100 --> 00:43:33,240 Now, the covariance is bilinear, so what I have 758 00:43:33,240 --> 00:43:35,640 is that the covariance of this is 759 00:43:35,640 --> 00:43:38,760 the covariance of xn times y-- 760 00:43:38,760 --> 00:43:41,790 sorry, of x and y, minus the variance-- well, 761 00:43:41,790 --> 00:43:50,220 minus a plus b, covariance of x and x, 762 00:43:50,220 --> 00:43:54,720 which is the variance of x? 763 00:43:59,050 --> 00:44:03,510 Covariance of xy minus a plus b variance of x. 764 00:44:12,384 --> 00:44:13,300 OK, I didn't write it. 765 00:44:13,300 --> 00:44:16,080 So here I have covariance of xy is 766 00:44:16,080 --> 00:44:17,910 equal to b variance of x, right? 767 00:44:34,070 --> 00:44:35,270 Covariance of xy. 768 00:44:35,270 --> 00:44:38,057 Yeah, that's because they cannot do that with the covariance. 769 00:44:44,030 --> 00:44:46,520 Yeah, I have those averages again. 770 00:44:46,520 --> 00:44:48,320 No, because this is centered, right? 771 00:44:48,320 --> 00:44:51,000 Sorry, this is centered, so this is actually 772 00:44:51,000 --> 00:44:56,760 equal to the expectation of x times y minus a plus bx. 773 00:45:01,527 --> 00:45:03,110 The covariance is equal to the product 774 00:45:03,110 --> 00:45:05,750 just because this insight is actually centered. 775 00:45:05,750 --> 00:45:09,980 So this is the expectation of x times y 776 00:45:09,980 --> 00:45:20,100 minus the expectation of a times the expectation of x, plus b 777 00:45:20,100 --> 00:45:23,013 minus b times the expectation of x squared. 778 00:45:32,200 --> 00:45:34,720 Well, actually maybe I should not really go too far. 779 00:45:38,894 --> 00:45:40,560 So this is actually the one that I need. 780 00:45:40,560 --> 00:45:47,300 But if I stop here, this is actually equal to zero, right. 781 00:45:47,300 --> 00:45:49,095 Those are the same equations. 782 00:45:52,065 --> 00:45:53,050 OK? 783 00:45:53,050 --> 00:45:53,550 Yeah? 784 00:45:53,550 --> 00:45:55,516 AUDIENCE: What are we doing right now? 785 00:45:55,516 --> 00:45:57,140 PHILIPPE RIGOLLET: So we're just saying 786 00:45:57,140 --> 00:46:01,070 that if I actually believe that this best fit was the one that 787 00:46:01,070 --> 00:46:02,990 gave me the right parameters, what would 788 00:46:02,990 --> 00:46:05,804 that imply on the noise itself, on this epsilon? 789 00:46:05,804 --> 00:46:07,220 So here we're actually just trying 790 00:46:07,220 --> 00:46:10,070 to find some necessary condition for the noise to hold-- 791 00:46:10,070 --> 00:46:11,030 for the noise. 792 00:46:11,030 --> 00:46:14,540 And so those conditions are, that first, the expectation 793 00:46:14,540 --> 00:46:15,290 is zero. 794 00:46:15,290 --> 00:46:17,090 That's what we've got here. 795 00:46:17,090 --> 00:46:20,480 And then, that the covariance between the noise and x 796 00:46:20,480 --> 00:46:22,900 has to be zero as well. 797 00:46:22,900 --> 00:46:24,770 OK, so those are actually conditions 798 00:46:24,770 --> 00:46:26,360 that the noise must satisfy. 799 00:46:26,360 --> 00:46:29,450 But the noise was just not really defined as noise itself. 800 00:46:29,450 --> 00:46:31,550 We were just saying, OK, if we're 801 00:46:31,550 --> 00:46:35,230 going to put some assumptions on the epsilon, what 802 00:46:35,230 --> 00:46:36,110 do we better have? 803 00:46:36,110 --> 00:46:38,360 So the first one is that it's centered, which is good, 804 00:46:38,360 --> 00:46:41,150 because otherwise, the noise would shift everything. 805 00:46:41,150 --> 00:46:45,620 So now when you look at a linear regression model-- 806 00:46:45,620 --> 00:46:48,590 typically, if you open a book, it doesn't start by saying, 807 00:46:48,590 --> 00:46:50,920 let the noise be the difference between y 808 00:46:50,920 --> 00:46:52,940 and what I actually want y to be. 809 00:46:52,940 --> 00:46:57,210 It says let y be a plus bx plus epsilon. 810 00:46:57,210 --> 00:47:02,120 So conversely, if we assume that this is the model that we have, 811 00:47:02,120 --> 00:47:04,340 then we're going to have to assume that epsilon-- 812 00:47:04,340 --> 00:47:06,298 we're going to assume that epsilon is centered, 813 00:47:06,298 --> 00:47:10,840 and that the covariance between x and epsilon is zero. 814 00:47:10,840 --> 00:47:13,760 Actually, often, we're going to assume much more. 815 00:47:13,760 --> 00:47:17,600 And one way to ensure that those two things are satisfied 816 00:47:17,600 --> 00:47:19,940 is to assume that x is independent of epsilon, 817 00:47:19,940 --> 00:47:21,290 for example. 818 00:47:21,290 --> 00:47:23,940 If you assume that x is independent of epsilon, 819 00:47:23,940 --> 00:47:28,332 of course the covariance is going to be zero. 820 00:47:28,332 --> 00:47:30,720 Or we might assume that the conditional expectation 821 00:47:30,720 --> 00:47:35,450 of epsilon, given x, is equal to zero, then that implies that. 822 00:47:35,450 --> 00:47:38,710 OK, now the fact that it's centered is one thing. 823 00:47:38,710 --> 00:47:43,500 So if we make this assumption, the only thing it's telling us 824 00:47:43,500 --> 00:47:47,700 is that those ab's that come-- right, we started from there. 825 00:47:47,700 --> 00:47:51,240 y is equal to a plus bx plus some epsilon for some a, 826 00:47:51,240 --> 00:47:51,960 for some b. 827 00:47:51,960 --> 00:47:55,890 What it turns out is that those a's and b's are actually 828 00:47:55,890 --> 00:47:58,680 the ones that you would get by solving this expectation 829 00:47:58,680 --> 00:48:00,690 of square thing. 830 00:48:00,690 --> 00:48:02,610 All right, so when you asked-- 831 00:48:02,610 --> 00:48:04,530 back when you were following-- 832 00:48:04,530 --> 00:48:07,170 so when you asked, you know, why don't we 833 00:48:07,170 --> 00:48:10,290 take the square, for example, or the power 834 00:48:10,290 --> 00:48:12,210 4, or something like this-- 835 00:48:12,210 --> 00:48:15,990 then here, I'm saying, well, if I have y is equal to a plus bx, 836 00:48:15,990 --> 00:48:19,230 I don't actually need to put too much assumptions on epsilon. 837 00:48:19,230 --> 00:48:22,320 If epsilon is actually satisfying those two things, 838 00:48:22,320 --> 00:48:25,620 expectation is equal to zero and the covariance 839 00:48:25,620 --> 00:48:28,912 with x is equal to zero, then the right a and b 840 00:48:28,912 --> 00:48:30,870 that I'm looking for are actually the ones that 841 00:48:30,870 --> 00:48:32,120 come with the square-- 842 00:48:32,120 --> 00:48:36,750 not with power 4 or power 25. 843 00:48:36,750 --> 00:48:39,300 So those are actually pretty weak assumptions. 844 00:48:39,300 --> 00:48:41,510 If we want to do inference, we're 845 00:48:41,510 --> 00:48:43,350 going to have to assume slightly more. 846 00:48:43,350 --> 00:48:45,690 If we want to use T-distributions at some point, 847 00:48:45,690 --> 00:48:47,520 for example, and we will, we're going 848 00:48:47,520 --> 00:48:50,800 to have to assume that epsilon has a Gaussian distribution. 849 00:48:50,800 --> 00:48:53,700 So if you want to start doing more statistics beyond just 850 00:48:53,700 --> 00:48:56,550 like doing this least square thing, which is minimizing 851 00:48:56,550 --> 00:48:58,350 the square of criterion, you're actually 852 00:48:58,350 --> 00:48:59,933 going to have to put more assumptions. 853 00:48:59,933 --> 00:49:01,710 But right now, we did not need them. 854 00:49:01,710 --> 00:49:04,210 We only need that epsilon as mean zero and covariant 855 00:49:04,210 --> 00:49:04,998 zero with x. 856 00:49:08,750 --> 00:49:13,040 OK, so that was basically probabilistic, right. 857 00:49:13,040 --> 00:49:14,450 If I were to do probability and I 858 00:49:14,450 --> 00:49:17,090 were trying to model the relationship between two 859 00:49:17,090 --> 00:49:20,330 random variables, x and y, in the form 860 00:49:20,330 --> 00:49:24,320 y is a plus bx plus some noise, this is what would come out. 861 00:49:24,320 --> 00:49:25,640 Everything was expectations. 862 00:49:25,640 --> 00:49:27,290 There was no data involved. 863 00:49:27,290 --> 00:49:33,620 So now let's go to the data problem, which is now, 864 00:49:33,620 --> 00:49:35,540 I do not know what those expectations are. 865 00:49:35,540 --> 00:49:38,240 In particular, I don't know what the covariance of x and y is, 866 00:49:38,240 --> 00:49:40,610 and I don't know with the expectation of x 867 00:49:40,610 --> 00:49:42,950 and the expectation of y r. 868 00:49:42,950 --> 00:49:44,570 So I have data to do that. 869 00:49:44,570 --> 00:49:45,880 So how am I going to do this? 870 00:49:49,244 --> 00:49:50,660 Well, I'm just going to say, well, 871 00:49:50,660 --> 00:49:57,570 if I want x1, y1, xn, yn, and I'm going 872 00:49:57,570 --> 00:49:59,781 to assume that they're [? iid. ?] 873 00:49:59,781 --> 00:50:01,530 And I'm actually going to assume that they 874 00:50:01,530 --> 00:50:02,820 have some model, right. 875 00:50:02,820 --> 00:50:06,570 So I'm going to assume that I have that a-- 876 00:50:06,570 --> 00:50:09,150 so that Yi follows the same model. 877 00:50:14,620 --> 00:50:17,000 So epsilon i [? rad, ?] and I won't 878 00:50:17,000 --> 00:50:23,610 say that expectation of epsilon i is zero and covariance of xi, 879 00:50:23,610 --> 00:50:25,630 epsilon i is equal to zero. 880 00:50:25,630 --> 00:50:28,880 So I'm going to put the same model on all the data. 881 00:50:28,880 --> 00:50:31,420 So you can see that a is not ai, and b is not bi. 882 00:50:31,420 --> 00:50:32,380 It's the same. 883 00:50:32,380 --> 00:50:34,090 So as my data increases, I should 884 00:50:34,090 --> 00:50:36,850 be able to recover the correct things-- 885 00:50:36,850 --> 00:50:39,430 as the size of my data increases. 886 00:50:39,430 --> 00:50:43,030 OK, so this is what the statistical problem look like. 887 00:50:43,030 --> 00:50:45,250 You're given the points. 888 00:50:45,250 --> 00:50:47,350 There is a true line from which this point 889 00:50:47,350 --> 00:50:48,557 was generated, right. 890 00:50:48,557 --> 00:50:49,390 There was this line. 891 00:50:49,390 --> 00:50:54,250 There was a true ab that I use to draw this plot, 892 00:50:54,250 --> 00:50:55,190 and that was the line. 893 00:50:55,190 --> 00:50:59,320 So first I picked an x, say uniformly at 894 00:50:59,320 --> 00:51:02,110 on this intervals, 0 to 2. 895 00:51:02,110 --> 00:51:03,610 I said that was this one. 896 00:51:03,610 --> 00:51:06,800 Then I said well, I want y to be a plus bx, 897 00:51:06,800 --> 00:51:08,500 so it should be here, but then I'm 898 00:51:08,500 --> 00:51:10,840 going to add some noise epsilon to go away again 899 00:51:10,840 --> 00:51:13,270 back from this line. 900 00:51:13,270 --> 00:51:16,970 And that's actually me, here, we actually got two points correct 901 00:51:16,970 --> 00:51:18,070 on this line. 902 00:51:18,070 --> 00:51:20,170 So there's basically two epsilons 903 00:51:20,170 --> 00:51:22,330 that were small enough that the dots actually 904 00:51:22,330 --> 00:51:24,720 look like they're on the line. 905 00:51:24,720 --> 00:51:27,060 Everybody's clear about what I'm drawing? 906 00:51:27,060 --> 00:51:28,810 So now of course if you're a statistician, 907 00:51:28,810 --> 00:51:29,620 you don't see this. 908 00:51:29,620 --> 00:51:30,810 You only see this. 909 00:51:30,810 --> 00:51:32,610 And you have to recover this guy, 910 00:51:32,610 --> 00:51:34,260 and it's going to look like this. 911 00:51:34,260 --> 00:51:36,550 You're going to have an estimated line, which 912 00:51:36,550 --> 00:51:37,780 is the red one. 913 00:51:37,780 --> 00:51:42,610 And the blue line, which is the true one, the one that 914 00:51:42,610 --> 00:51:44,230 actually generated the data. 915 00:51:44,230 --> 00:51:46,810 And your question is, while this line corresponds 916 00:51:46,810 --> 00:51:48,967 to some parameters a hat and b hat, 917 00:51:48,967 --> 00:51:51,550 how could I make sure that those two lines-- how far those two 918 00:51:51,550 --> 00:51:52,060 lines are? 919 00:51:52,060 --> 00:51:53,620 And one to address this question is 920 00:51:53,620 --> 00:51:57,920 to say how far is a from a hat, and how far is b from b hat? 921 00:51:57,920 --> 00:51:58,785 OK? 922 00:51:58,785 --> 00:52:00,660 Another question, of course, that you may ask 923 00:52:00,660 --> 00:52:04,470 is, how do you find a hat and b hat? 924 00:52:04,470 --> 00:52:07,530 And as you can see, it's basically the same thing. 925 00:52:07,530 --> 00:52:15,210 Remember, what was a-- so b was the covariance between x 926 00:52:15,210 --> 00:52:21,240 and y divided by the variance of x, right? 927 00:52:21,240 --> 00:52:22,410 We check and rewrite this. 928 00:52:22,410 --> 00:52:26,430 The expectation of xy minus expectation 929 00:52:26,430 --> 00:52:30,060 of x times the expectation of y, divided 930 00:52:30,060 --> 00:52:35,580 by expectation of x squared minus expectation of x. 931 00:52:35,580 --> 00:52:37,540 The whole thing's-- 932 00:52:37,540 --> 00:52:39,040 OK? 933 00:52:39,040 --> 00:52:42,910 If you look at the expression for b hat, 934 00:52:42,910 --> 00:52:47,670 I basically replaced all the expectations by bars. 935 00:52:47,670 --> 00:52:49,800 So I said, well, this guy I'm going 936 00:52:49,800 --> 00:52:53,480 to estimate by an average. 937 00:52:53,480 --> 00:52:59,970 So that's the xy bar, and is 1 over n, 938 00:52:59,970 --> 00:53:03,025 [? sum ?] from [? i co ?] 1, to n of Xi, times Yi. 939 00:53:05,555 --> 00:53:08,380 x bar, of course, is just the one that we're used to. 940 00:53:12,690 --> 00:53:14,970 And same for y bar. 941 00:53:14,970 --> 00:53:20,580 X squared bar, the one that's here, 942 00:53:20,580 --> 00:53:22,290 is the average of the squares. 943 00:53:22,290 --> 00:53:24,426 And x bar square is the square of the average. 944 00:53:39,510 --> 00:53:44,070 OK, so you just basically replace this guy by x bar, 945 00:53:44,070 --> 00:53:47,820 this guy by y bar, this guy by x square bar, 946 00:53:47,820 --> 00:53:52,350 and this guy by x bar and no square. 947 00:53:52,350 --> 00:53:54,810 OK, so that's basically one way to do it. 948 00:53:54,810 --> 00:53:56,340 Everywhere you see an expectation, 949 00:53:56,340 --> 00:53:58,740 you replace it by an average. 950 00:53:58,740 --> 00:54:02,070 That's the usual statistical hammer. 951 00:54:02,070 --> 00:54:04,720 You can actually be slightly more subtle about this. 952 00:54:09,980 --> 00:54:12,420 And as an exercise, I invite you-- 953 00:54:12,420 --> 00:54:14,940 just to make sure that you know how to do this competition, 954 00:54:14,940 --> 00:54:17,400 it's going to be exactly the same kind of competitions 955 00:54:17,400 --> 00:54:18,840 that we've done. 956 00:54:18,840 --> 00:54:20,670 But as an exercise, you can check 957 00:54:20,670 --> 00:54:23,311 that if you actually look at say, well, 958 00:54:23,311 --> 00:54:25,810 what I wanted to minimize here, I had an expectation, right? 959 00:54:32,720 --> 00:54:35,660 And I said, let's minimize this thing. 960 00:54:35,660 --> 00:54:41,800 Well, let's replace this by an average first. 961 00:54:51,630 --> 00:54:54,270 And now minimize. 962 00:54:54,270 --> 00:54:57,100 OK, so if I do this, it turns out 963 00:54:57,100 --> 00:55:00,160 I'm going to actually get the same result. 964 00:55:00,160 --> 00:55:03,940 The minimum of the average is basically-- 965 00:55:03,940 --> 00:55:06,160 when I replace the average by-- sorry, 966 00:55:06,160 --> 00:55:09,040 when I replace the expectation by the average 967 00:55:09,040 --> 00:55:11,817 and then minimize, it's the same thing 968 00:55:11,817 --> 00:55:13,900 as first minimizing and then replacing expectation 969 00:55:13,900 --> 00:55:17,510 by averages in this case. 970 00:55:17,510 --> 00:55:21,764 Again, this is a much more general principle 971 00:55:21,764 --> 00:55:23,180 because if you don't have a closed 972 00:55:23,180 --> 00:55:27,530 form for the minimum like for some, say, likelihood problems, 973 00:55:27,530 --> 00:55:30,579 well, you might not actually have a possibility 974 00:55:30,579 --> 00:55:32,870 to just look at what the formula looks like-- see where 975 00:55:32,870 --> 00:55:35,480 the expectations show up-- and then just plug in the averages 976 00:55:35,480 --> 00:55:36,380 instead. 977 00:55:36,380 --> 00:55:39,170 So this is the one you want to keep in mind. 978 00:55:39,170 --> 00:55:41,000 And again, as an exercise. 979 00:55:47,000 --> 00:55:48,870 OK, so here, and then you do expectation 980 00:55:48,870 --> 00:55:52,980 replaced by averages. 981 00:55:52,980 --> 00:55:57,800 And then that's the same answer, and I encourage 982 00:55:57,800 --> 00:56:00,080 you to solve the exercise. 983 00:56:00,080 --> 00:56:03,770 OK, everybody's clear that this is actually the same expression 984 00:56:03,770 --> 00:56:07,140 for a hat and b hat that we had before that we had for a and b 985 00:56:07,140 --> 00:56:12,460 when we replaced the expectations by averages? 986 00:56:12,460 --> 00:56:16,960 Here, by the way, I minimize the sum rather than the average. 987 00:56:16,960 --> 00:56:19,708 It's clear to everyone that this is the same thing, right? 988 00:56:22,680 --> 00:56:23,180 Yep? 989 00:56:23,180 --> 00:56:27,148 AUDIENCE: [INAUDIBLE] sum replacing it [INAUDIBLE] 990 00:56:27,148 --> 00:56:29,628 minimize the expectation, I'm assuming 991 00:56:29,628 --> 00:56:31,612 it's switched with the derivative 992 00:56:31,612 --> 00:56:33,596 on the expectation [INAUDIBLE]. 993 00:56:37,592 --> 00:56:39,050 PHILIPPE RIGOLLET: So we did switch 994 00:56:39,050 --> 00:56:43,640 the derivative and the expectation before you came, 995 00:56:43,640 --> 00:56:44,140 I think. 996 00:56:47,890 --> 00:56:49,810 All right, so indeed, the picture 997 00:56:49,810 --> 00:56:52,150 was the one that we said, so visually, this 998 00:56:52,150 --> 00:56:53,380 is what we're doing. 999 00:56:53,380 --> 00:56:55,780 We're looking among all the lines. 1000 00:56:55,780 --> 00:56:58,822 For each line, we compute this distance. 1001 00:56:58,822 --> 00:57:00,280 So if I give you another line there 1002 00:57:00,280 --> 00:57:01,759 would be another set of arrows. 1003 00:57:01,759 --> 00:57:02,800 You look at their length. 1004 00:57:02,800 --> 00:57:03,610 You square it. 1005 00:57:03,610 --> 00:57:05,520 And then you sum it all, and you find 1006 00:57:05,520 --> 00:57:08,080 the line that has the minimum sum of squared lengths 1007 00:57:08,080 --> 00:57:09,364 of the arrows. 1008 00:57:09,364 --> 00:57:11,780 All right, and those are the arrows that we're looking at. 1009 00:57:11,780 --> 00:57:14,710 But again, you could actually think of other distances, 1010 00:57:14,710 --> 00:57:17,307 and you would actually get different-- 1011 00:57:17,307 --> 00:57:19,390 you could actually get different solutions, right. 1012 00:57:19,390 --> 00:57:22,644 So there's something called, mean absolute deviation, 1013 00:57:22,644 --> 00:57:24,310 which rather than minimizing this thing, 1014 00:57:24,310 --> 00:57:27,490 is actually minimizing the sum from i to co 1 to n 1015 00:57:27,490 --> 00:57:33,970 of the absolute value of y minus a plus bXi. 1016 00:57:33,970 --> 00:57:36,160 And that's not something for which 1017 00:57:36,160 --> 00:57:39,190 you're going to have a closed form, as you can imagine. 1018 00:57:39,190 --> 00:57:42,010 You might have something that's sort of implicit, 1019 00:57:42,010 --> 00:57:44,647 but you can actually still solve it numerically. 1020 00:57:44,647 --> 00:57:46,230 And this is something that people also 1021 00:57:46,230 --> 00:57:50,478 like to use but way, way less than the least squares one. 1022 00:57:50,478 --> 00:57:52,174 AUDIENCE: [INAUDIBLE] 1023 00:57:52,174 --> 00:57:53,840 PHILIPPE RIGOLLET: What did I just what? 1024 00:57:53,840 --> 00:57:56,600 AUDIENCE: [INAUDIBLE] 1025 00:57:56,600 --> 00:58:02,230 The sum of the absolute values of Yi minus a plus bXi. 1026 00:58:02,230 --> 00:58:04,432 So it's the same except I don't square here. 1027 00:58:07,820 --> 00:58:08,320 OK? 1028 00:58:11,250 --> 00:58:18,330 So arguably, you know, predicting a demand 1029 00:58:18,330 --> 00:58:21,780 based on price is a fairly naive problem. 1030 00:58:21,780 --> 00:58:23,787 Typically, what we have is a bunch of data 1031 00:58:23,787 --> 00:58:25,620 that we've collected, and we're hoping that, 1032 00:58:25,620 --> 00:58:29,460 together, they can help us do a better prediction. 1033 00:58:29,460 --> 00:58:31,890 All right, so maybe I don't have only the price, 1034 00:58:31,890 --> 00:58:35,670 but maybe I have a bunch of other social indicators. 1035 00:58:35,670 --> 00:58:40,484 Maybe I know the competition, the price of the competition. 1036 00:58:40,484 --> 00:58:42,150 And maybe I know a bunch of other things 1037 00:58:42,150 --> 00:58:43,980 that are actually relevant. 1038 00:58:43,980 --> 00:58:48,030 And so I'm trying to find a way to combine a bunch of points, 1039 00:58:48,030 --> 00:58:50,880 a bunch of measures. 1040 00:58:50,880 --> 00:58:52,540 There's a nice example that I like, 1041 00:58:52,540 --> 00:58:56,370 which is people were trying to measure something 1042 00:58:56,370 --> 00:59:00,750 related to your body mass index, so basically 1043 00:59:00,750 --> 00:59:04,820 the volume of your-- the density of your body. 1044 00:59:04,820 --> 00:59:07,380 And the way you can do this is by just, really, 1045 00:59:07,380 --> 00:59:10,170 weighing someone and also putting them 1046 00:59:10,170 --> 00:59:13,920 in some cubic meter of water and see how much overflows. 1047 00:59:13,920 --> 00:59:15,750 And then you have both the volume 1048 00:59:15,750 --> 00:59:20,850 and the mass of this person, and you 1049 00:59:20,850 --> 00:59:23,370 can start computing density. 1050 00:59:23,370 --> 00:59:25,860 But as you can imagine, you know, 1051 00:59:25,860 --> 00:59:27,684 I would not personally like to go to a gym 1052 00:59:27,684 --> 00:59:29,600 when the first thing they ask me is to just go 1053 00:59:29,600 --> 00:59:33,240 in a bucket of water, and so people 1054 00:59:33,240 --> 00:59:36,840 try to find ways to measure this based on other indicators that 1055 00:59:36,840 --> 00:59:38,110 are much easier to measure. 1056 00:59:38,110 --> 00:59:41,040 For example, I don't know, the length of my forearm, 1057 00:59:41,040 --> 00:59:45,090 and the circumference of my head, and maybe my belly 1058 00:59:45,090 --> 00:59:46,870 would probably be more appropriate here. 1059 00:59:46,870 --> 00:59:48,870 And so you know, they just try to find something 1060 00:59:48,870 --> 00:59:50,340 that actually makes sense. 1061 00:59:50,340 --> 00:59:52,094 And so there's actually a nice example 1062 00:59:52,094 --> 00:59:53,760 where you can show that if you measure-- 1063 00:59:53,760 --> 00:59:55,050 I think one of the most significant 1064 00:59:55,050 --> 00:59:56,860 was with the circumference of your wrist. 1065 00:59:56,860 --> 01:00:02,070 This is actually a very good indicator of your body density. 1066 01:00:02,070 --> 01:00:06,780 And it turns out that if you stuff all the bunch of things 1067 01:00:06,780 --> 01:00:09,240 together, you might actually get a very good formula that 1068 01:00:09,240 --> 01:00:10,840 explains things. 1069 01:00:10,840 --> 01:00:12,390 All right, so what we're going to do 1070 01:00:12,390 --> 01:00:14,406 is rather than saying we have only one x 1071 01:00:14,406 --> 01:00:15,780 to explain y's, let's say we have 1072 01:00:15,780 --> 01:00:19,510 20 x's that we're trying to combine to explain y. 1073 01:00:19,510 --> 01:00:22,410 And again, just like assuming something of the form, 1074 01:00:22,410 --> 01:00:26,107 y is a plus b times x was the simplest thing we could do, 1075 01:00:26,107 --> 01:00:28,440 here we're just going to assume that we have y is a plus 1076 01:00:28,440 --> 01:00:31,650 b1, x1 plus b2, x2, plus b3, x3. 1077 01:00:31,650 --> 01:00:33,690 And we can write it in a vector form 1078 01:00:33,690 --> 01:00:39,210 by writing that Yi is Xi transposed b, which 1079 01:00:39,210 --> 01:00:42,770 is now a vector plus epsilon i. 1080 01:00:42,770 --> 01:00:44,520 OK, and here, on the board, I'm going 1081 01:00:44,520 --> 01:00:46,980 to have a hard time doing boldface, 1082 01:00:46,980 --> 01:00:52,360 but all these things are vectors except for y, 1083 01:00:52,360 --> 01:00:53,520 which is a number. 1084 01:00:53,520 --> 01:00:54,450 Yi is a number. 1085 01:00:54,450 --> 01:00:57,780 It's always the value of my y-axis. 1086 01:00:57,780 --> 01:00:59,930 So even if my x-axis lives on-- 1087 01:00:59,930 --> 01:01:04,350 this is x1, and this is x2, y is really just the real valued 1088 01:01:04,350 --> 01:01:05,249 function. 1089 01:01:05,249 --> 01:01:07,290 And so I'm going to get a bunch of points, x1,y1, 1090 01:01:07,290 --> 01:01:10,380 and I'm going to see how much they respond. 1091 01:01:10,380 --> 01:01:13,560 So for example, my body density is y, 1092 01:01:13,560 --> 01:01:16,562 and then all the x's are a bunch of other things. 1093 01:01:16,562 --> 01:01:17,270 Agreed with that? 1094 01:01:17,270 --> 01:01:20,870 So this is an equation that holds on the real line, 1095 01:01:20,870 --> 01:01:27,390 but this guy here is an r p, and this guy's an rp. 1096 01:01:30,080 --> 01:01:33,550 It's actually common to talk to call b, beta, 1097 01:01:33,550 --> 01:01:38,650 when it's a vector, and that's the usual linear regression 1098 01:01:38,650 --> 01:01:39,370 notation. 1099 01:01:39,370 --> 01:01:42,470 Y is x beta plus epsilon. 1100 01:01:42,470 --> 01:01:45,780 So x's are called explanatory variables. 1101 01:01:45,780 --> 01:01:50,600 y is called explained variable, or dependent variable, 1102 01:01:50,600 --> 01:01:52,000 or response variable. 1103 01:01:52,000 --> 01:01:53,050 It has a bunch of names. 1104 01:01:53,050 --> 01:01:55,877 You can use whatever you feel more comfortable with. 1105 01:01:55,877 --> 01:01:57,460 It should actually be explicit, right, 1106 01:01:57,460 --> 01:01:58,668 so that's all you care about. 1107 01:02:01,100 --> 01:02:05,840 Now, what we typically do is that rather-- so you 1108 01:02:05,840 --> 01:02:07,840 notice here, that there's actually no intercept. 1109 01:02:07,840 --> 01:02:10,840 If I actually fold that back down to one dimension, 1110 01:02:10,840 --> 01:02:13,210 there's actually a is equal to zero, right? 1111 01:02:13,210 --> 01:02:18,350 If I go back to p is equal to 1, that 1112 01:02:18,350 --> 01:02:22,430 would imply that Yi is, well, say, beta times 1113 01:02:22,430 --> 01:02:24,979 x plus epsilon i. 1114 01:02:24,979 --> 01:02:27,020 And that's not good, I want to have an intercept. 1115 01:02:27,020 --> 01:02:29,480 And the way I do this, rather than writing 1116 01:02:29,480 --> 01:02:31,910 a plus this, and you know, just have 1117 01:02:31,910 --> 01:02:35,420 like an overload of notation, what I am actually doing 1118 01:02:35,420 --> 01:02:37,670 is that I fold back. 1119 01:02:37,670 --> 01:02:40,750 I fold my intercept back into my x. 1120 01:02:43,460 --> 01:02:46,190 And so if I measure 20 variables, 1121 01:02:46,190 --> 01:02:48,080 I'm going to create a 21st variable, which 1122 01:02:48,080 --> 01:02:49,700 is always equal to 1. 1123 01:02:49,700 --> 01:02:52,650 OK, so you should need to think of x as being 1. 1124 01:02:52,650 --> 01:02:58,120 And then x1 xp. 1125 01:02:58,120 --> 01:03:00,790 And sorry, xp minus 1, I guess. 1126 01:03:00,790 --> 01:03:02,293 OK, and now this is an rp. 1127 01:03:05,590 --> 01:03:07,900 I'm always going to assume that the first one is 1. 1128 01:03:07,900 --> 01:03:09,250 I can always do that. 1129 01:03:09,250 --> 01:03:11,320 If I have a table of data-- 1130 01:03:11,320 --> 01:03:15,940 if my data is given to me in an Excel spreadsheet-- 1131 01:03:15,940 --> 01:03:19,990 and here I have the density that I measured on my data, 1132 01:03:19,990 --> 01:03:22,940 and then maybe here I have the height, 1133 01:03:22,940 --> 01:03:25,544 and here I have the wrist circumference. 1134 01:03:25,544 --> 01:03:26,710 And I have all these things. 1135 01:03:26,710 --> 01:03:31,100 All I have to do is to create another column here of ones, 1136 01:03:31,100 --> 01:03:34,180 and I just put 1-1-1-1-1. 1137 01:03:34,180 --> 01:03:37,090 OK, that's all I have to do to create this guy. 1138 01:03:37,090 --> 01:03:39,190 Agreed? 1139 01:03:39,190 --> 01:03:43,940 And now my x is going to be just one of those rows. 1140 01:03:43,940 --> 01:03:46,190 So that's this is Xi, this entire row. 1141 01:03:46,190 --> 01:03:47,622 And this entry here is Yi. 1142 01:03:54,430 --> 01:03:56,920 So now, for my noise coefficients, 1143 01:03:56,920 --> 01:03:59,300 I'm still going to ask for the same thing 1144 01:03:59,300 --> 01:04:04,090 except that here, the covariance is not between x-- 1145 01:04:04,090 --> 01:04:07,210 between one random variable and another random variable. 1146 01:04:07,210 --> 01:04:10,930 It's between a random vector and a random variable. 1147 01:04:10,930 --> 01:04:13,130 OK, how do I measure the covariance between a vector 1148 01:04:13,130 --> 01:04:14,594 and a random variable? 1149 01:04:23,866 --> 01:04:25,840 AUDIENCE: [INAUDIBLE] 1150 01:04:25,840 --> 01:04:29,002 PHILIPPE RIGOLLET: Yeah, so basically-- 1151 01:04:29,002 --> 01:04:31,380 AUDIENCE: [INAUDIBLE] 1152 01:04:31,380 --> 01:04:33,630 PHILIPPE RIGOLLET: Yeah, I mean, the covariance vector 1153 01:04:33,630 --> 01:04:36,171 is equal to 0 is the same thing as [INAUDIBLE] equal to zero, 1154 01:04:36,171 --> 01:04:39,270 but yeah, this is basically thought of entry-wise. 1155 01:04:39,270 --> 01:04:41,820 For each coordinate of x, I want that the covariance 1156 01:04:41,820 --> 01:04:47,430 between epsilon and this coordinate of x is equal to 0. 1157 01:04:47,430 --> 01:04:50,370 So I'm just asking this for all coordinates. 1158 01:04:50,370 --> 01:04:52,020 Again, in most instances, we're going 1159 01:04:52,020 --> 01:04:53,520 to think that epsilon is independent 1160 01:04:53,520 --> 01:04:56,310 of x, and that's something we can understand without thinking 1161 01:04:56,310 --> 01:04:59,022 about coordinates. 1162 01:04:59,022 --> 01:05:00,471 Yep? 1163 01:05:00,471 --> 01:05:03,852 AUDIENCE: [INAUDIBLE] like what if beta equals alpha 1164 01:05:03,852 --> 01:05:04,818 [INAUDIBLE]? 1165 01:05:06,774 --> 01:05:09,190 PHILIPPE RIGOLLET: I'm sorry, can you repeat the question? 1166 01:05:09,190 --> 01:05:09,773 I didn't hear. 1167 01:05:09,773 --> 01:05:12,140 AUDIENCE: Is this the parameter of beta, a parameter? 1168 01:05:12,140 --> 01:05:13,100 PHILIPPE RIGOLLET: Yeah, beta is the parameter 1169 01:05:13,100 --> 01:05:14,141 we're looking for, right. 1170 01:05:14,141 --> 01:05:18,485 Just like it was the pair ab has become the whole vector of beta 1171 01:05:18,485 --> 01:05:19,394 now. 1172 01:05:19,394 --> 01:05:20,810 AUDIENCE: And what's [INAUDIBLE]?? 1173 01:05:22,720 --> 01:05:25,219 PHILIPPE RIGOLLET: Well, can you think of an intercept 1174 01:05:25,219 --> 01:05:26,260 of a function that take-- 1175 01:05:26,260 --> 01:05:28,630 I mean, there is one actually. 1176 01:05:28,630 --> 01:05:30,370 There's the one for which betas-- 1177 01:05:30,370 --> 01:05:31,840 all the betas that don't correspond 1178 01:05:31,840 --> 01:05:35,200 to the vector of all ones, so the intercept 1179 01:05:35,200 --> 01:05:38,469 is really the weight that I put on this guy. 1180 01:05:38,469 --> 01:05:40,510 That's the beta that's going to come to this guy, 1181 01:05:40,510 --> 01:05:44,310 but we don't really talk about intercept. 1182 01:05:44,310 --> 01:05:49,210 So if x lives in two dimensions, the way 1183 01:05:49,210 --> 01:05:50,950 you want to think about this is you 1184 01:05:50,950 --> 01:05:54,420 take a sheet of paper like that, so now I 1185 01:05:54,420 --> 01:05:57,080 have points that live in three dimensions. 1186 01:05:57,080 --> 01:05:59,320 So let's say one direction here is x1. 1187 01:05:59,320 --> 01:06:02,710 This direction is x2, and this direction is y. 1188 01:06:02,710 --> 01:06:04,960 And so what's going to happen is that I'm 1189 01:06:04,960 --> 01:06:07,120 going to have my points that live in this three 1190 01:06:07,120 --> 01:06:08,710 dimensional space. 1191 01:06:08,710 --> 01:06:10,180 And what I'm trying to do when I'm 1192 01:06:10,180 --> 01:06:12,580 trying to do a linear model for those guys-- 1193 01:06:12,580 --> 01:06:13,990 when I assume a linear model. 1194 01:06:13,990 --> 01:06:17,380 What I assume is that there's a plane in those three 1195 01:06:17,380 --> 01:06:17,950 dimensions. 1196 01:06:17,950 --> 01:06:20,170 So think of this guy as going everywhere, 1197 01:06:20,170 --> 01:06:23,920 and there's a plane close to which all my points should be. 1198 01:06:23,920 --> 01:06:26,320 That's what's happening in two dimensional. 1199 01:06:26,320 --> 01:06:29,930 If you see higher dimensions then congratulations to you, 1200 01:06:29,930 --> 01:06:30,975 but I can't. 1201 01:06:33,530 --> 01:06:36,470 But you know, you can definitely formalize that fairly easily 1202 01:06:36,470 --> 01:06:38,405 mathematically and just talk about vectors. 1203 01:06:40,940 --> 01:06:44,200 So now here, if I talk about the least square error estimator, 1204 01:06:44,200 --> 01:06:47,470 or just the least squares estimator of beta, 1205 01:06:47,470 --> 01:06:49,990 it's simply the same thing as before. 1206 01:06:49,990 --> 01:06:52,460 Just like we said-- 1207 01:06:52,460 --> 01:06:56,750 so remember, you should think of as beta 1208 01:06:56,750 --> 01:06:59,930 as being both the pair a b generalized. 1209 01:06:59,930 --> 01:07:05,060 So we said, oh, we wanted to minimize the expectation of y 1210 01:07:05,060 --> 01:07:13,640 minus a plus bx squared, right? 1211 01:07:13,640 --> 01:07:16,910 Now, so that's in-- for p is equal to 1. 1212 01:07:16,910 --> 01:07:19,510 Now for p lower than or equal to 2, 1213 01:07:19,510 --> 01:07:28,760 we're just going to write it as y minus x transpose beta 1214 01:07:28,760 --> 01:07:29,260 squared. 1215 01:07:34,210 --> 01:07:37,900 OK, so I'm just trying to minimize this quantity. 1216 01:07:37,900 --> 01:07:40,857 Of course, I don't have access to this, 1217 01:07:40,857 --> 01:07:42,940 so what I'm going to do with them going to replace 1218 01:07:42,940 --> 01:07:44,881 my expectation by an average. 1219 01:07:51,010 --> 01:07:54,890 So here I'm using the notation t because beta is the true one, 1220 01:07:54,890 --> 01:07:56,960 and I don't want you to just-- 1221 01:07:56,960 --> 01:07:59,960 so here, I have a variable t that's just moving around. 1222 01:07:59,960 --> 01:08:02,390 And so now I'm going to take the square of this thing. 1223 01:08:02,390 --> 01:08:08,450 And when I minimize this over all t in rp, the arc min, 1224 01:08:08,450 --> 01:08:19,584 the minimum is attained at beta hat, which is my estimator. 1225 01:08:19,584 --> 01:08:20,084 OK? 1226 01:08:25,359 --> 01:08:29,337 So if I want to actually compute-- 1227 01:08:29,337 --> 01:08:29,837 yeah? 1228 01:08:29,837 --> 01:08:31,420 AUDIENCE: I'm sorry, on the last slide 1229 01:08:31,420 --> 01:08:36,422 did we require the expectation of [INAUDIBLE] to be zero? 1230 01:08:36,422 --> 01:08:38,380 PHILIPPE RIGOLLET: You mean the previous slide? 1231 01:08:38,380 --> 01:08:38,963 AUDIENCE: Yes. 1232 01:08:38,963 --> 01:08:40,262 [INAUDIBLE] 1233 01:08:40,262 --> 01:08:42,720 PHILIPPE RIGOLLET: So again, I'm just defining an estimator 1234 01:08:42,720 --> 01:08:45,053 just like I would tell you, just take the estimator that 1235 01:08:45,053 --> 01:08:46,539 has coordinates for everywhere. 1236 01:08:46,539 --> 01:08:48,984 AUDIENCE: So I'm saying like [? in that sign, ?] we'll say 1237 01:08:48,984 --> 01:08:51,918 the noise [? terms ?] we want to satisfy the covariance of that 1238 01:08:51,918 --> 01:08:55,830 [? side. ?] We also want them to satisfy expectation of each 1239 01:08:55,830 --> 01:08:56,808 [? noise turn ?] zero? 1240 01:09:07,827 --> 01:09:09,660 PHILIPPE RIGOLLET: And so the answer is yes. 1241 01:09:09,660 --> 01:09:13,050 I was just trying to think if this was captured. 1242 01:09:13,050 --> 01:09:15,180 So it is not captured in this guy 1243 01:09:15,180 --> 01:09:17,700 because this is just telling me that the expectation 1244 01:09:17,700 --> 01:09:23,750 of epsilon i minus expectation of some i is equal to zero. 1245 01:09:23,750 --> 01:09:27,380 OK, so yes I need to have that epsilon has mean zero-- 1246 01:09:27,380 --> 01:09:29,130 let's assume that expectation of epsilon 1247 01:09:29,130 --> 01:09:31,545 is zero for this problem. 1248 01:09:43,640 --> 01:09:45,374 And we're going to need something 1249 01:09:45,374 --> 01:09:47,540 about some sort of question about the variance being 1250 01:09:47,540 --> 01:09:51,060 not equal to zero, right, but this is going to come up later. 1251 01:09:51,060 --> 01:09:54,710 So let's think for one second about doing the same approach 1252 01:09:54,710 --> 01:09:55,490 as we did before. 1253 01:09:55,490 --> 01:09:57,320 Take the partial derivative with respect 1254 01:09:57,320 --> 01:09:59,279 to the first coordinate of t, with respect 1255 01:09:59,279 --> 01:10:01,070 to the second coordinate of t, with respect 1256 01:10:01,070 --> 01:10:03,320 to the third coordinate of t, et cetera. 1257 01:10:03,320 --> 01:10:04,610 So that's what we did before. 1258 01:10:04,610 --> 01:10:07,460 We had two equations, and we reconciled them 1259 01:10:07,460 --> 01:10:10,190 because it was fairly easy to solve, right? 1260 01:10:10,190 --> 01:10:11,826 But in general, what's going to happen 1261 01:10:11,826 --> 01:10:13,700 is we're going to have a system of equations. 1262 01:10:13,700 --> 01:10:17,150 We're going to have a system of p equations, one for each 1263 01:10:17,150 --> 01:10:19,340 of the coordinates of t. 1264 01:10:19,340 --> 01:10:23,960 And we're going to have p unknowns, each coordinate of t. 1265 01:10:23,960 --> 01:10:26,559 And so we're going to have the system to solve-- 1266 01:10:26,559 --> 01:10:28,850 actually, i turns out it's going to be a linear system. 1267 01:10:28,850 --> 01:10:29,960 But it's not going to be something 1268 01:10:29,960 --> 01:10:32,543 that we're going to be able to solve coordinate by coordinate. 1269 01:10:32,543 --> 01:10:34,020 It's going to be annoying to solve. 1270 01:10:34,020 --> 01:10:36,820 You know, you can guess that what's going to happen, right. 1271 01:10:36,820 --> 01:10:40,700 Here, it involved the covariance between x and epsilon, right. 1272 01:10:40,700 --> 01:10:43,910 That's what it involved to understand-- 1273 01:10:43,910 --> 01:10:47,540 sorry, the correlation between x and y 1274 01:10:47,540 --> 01:10:50,660 to understand how the solution of this problem was. 1275 01:10:50,660 --> 01:10:52,070 In this case, there's going to be 1276 01:10:52,070 --> 01:10:57,930 only the covariance between x1 and y, x2 and y, x3, et 1277 01:10:57,930 --> 01:10:59,510 cetera, all the way to xp and y. 1278 01:10:59,510 --> 01:11:02,960 There's also going to be all the cross covariances between xj 1279 01:11:02,960 --> 01:11:04,077 and xk. 1280 01:11:04,077 --> 01:11:05,660 And so this is going to be a nightmare 1281 01:11:05,660 --> 01:11:08,210 to solve, like, in this system. 1282 01:11:08,210 --> 01:11:12,100 And what we do is that we go on to using a matrix notation, 1283 01:11:12,100 --> 01:11:14,600 so that when we take derivatives, 1284 01:11:14,600 --> 01:11:16,340 we talk about gradients, and then we 1285 01:11:16,340 --> 01:11:20,390 can invert matrices and solve linear systems in a somewhat 1286 01:11:20,390 --> 01:11:23,330 formal manner by just saying that, if I want to solve 1287 01:11:23,330 --> 01:11:27,230 the system ax equals b-- 1288 01:11:27,230 --> 01:11:28,760 rather than actually solving this 1289 01:11:28,760 --> 01:11:30,440 for each coordinate of x individually, 1290 01:11:30,440 --> 01:11:33,770 I just say that x is equal to a inverse times. 1291 01:11:33,770 --> 01:11:37,490 So that's really why we're going to the equation one, 1292 01:11:37,490 --> 01:11:40,730 because we have a formalism to write that x 1293 01:11:40,730 --> 01:11:42,260 is the solution of the system. 1294 01:11:42,260 --> 01:11:43,843 I'm not telling you that this is going 1295 01:11:43,843 --> 01:11:48,110 to be easy to solve numerically, but at least I can write it. 1296 01:11:48,110 --> 01:11:51,307 And so here's how it goes. 1297 01:11:51,307 --> 01:11:52,390 I have a bunch of vectors. 1298 01:11:55,540 --> 01:11:56,790 So what are my vectors, right? 1299 01:11:56,790 --> 01:11:57,875 So I have x1-- 1300 01:11:57,875 --> 01:11:59,250 oh, by the way, I didn't actually 1301 01:11:59,250 --> 01:12:01,320 mention that when I put the lowercase, when 1302 01:12:01,320 --> 01:12:03,660 I put the subscript, I'm talking about the observation. 1303 01:12:03,660 --> 01:12:05,118 And when I put the superscript, I'm 1304 01:12:05,118 --> 01:12:07,110 talking about the coordinates, right? 1305 01:12:07,110 --> 01:12:13,290 So I have x1, which is equal to x1, x1 [? 1, ?] 1306 01:12:13,290 --> 01:12:19,965 x 1p, x2, which is 1. 1307 01:12:19,965 --> 01:12:32,380 x2, 1, x2 p, all the way to xn, which is 1, xn 1, x np. 1308 01:12:32,380 --> 01:12:35,210 All right, so those are n observed x's, and then I 1309 01:12:35,210 --> 01:12:40,870 have another y1, y2, yn, that comes paired with those guys. 1310 01:12:40,870 --> 01:12:42,510 OK? 1311 01:12:42,510 --> 01:12:44,640 So the first thing is that I'm going 1312 01:12:44,640 --> 01:12:46,290 to stack those guys into some vector 1313 01:12:46,290 --> 01:12:47,520 that I'm going to call y. 1314 01:12:47,520 --> 01:12:49,710 So maybe I should put an arrow for the purpose 1315 01:12:49,710 --> 01:12:53,310 of the blackboard, and it's just y1 to yn. 1316 01:12:53,310 --> 01:12:56,720 OK, so this is a vector in rn. 1317 01:12:56,720 --> 01:12:59,150 Now, if I want to stack those guys together, 1318 01:12:59,150 --> 01:13:03,449 I can either create a long vector of size n times p, 1319 01:13:03,449 --> 01:13:05,990 but the problem is that I lose the role of who's a coordinate 1320 01:13:05,990 --> 01:13:08,815 and who's an observation. 1321 01:13:08,815 --> 01:13:10,190 And so it's actually nicer for me 1322 01:13:10,190 --> 01:13:12,840 to just put those guys next to each other 1323 01:13:12,840 --> 01:13:15,320 and create one new variable. 1324 01:13:15,320 --> 01:13:18,020 And so the way I'm going to do this is-- rather than actually 1325 01:13:18,020 --> 01:13:22,220 stacking those guys like that, I'm getting their transpose 1326 01:13:22,220 --> 01:13:24,530 and stack them as rows of a matrix. 1327 01:13:24,530 --> 01:13:26,870 OK, so I'm going to create a matrix, which 1328 01:13:26,870 --> 01:13:28,700 here is denoted typically by-- 1329 01:13:28,700 --> 01:13:31,295 I'm going to write x double bar. 1330 01:13:31,295 --> 01:13:33,420 And here, I'm going to actually just-- so since I'm 1331 01:13:33,420 --> 01:13:35,940 taking those guys like this, the first column 1332 01:13:35,940 --> 01:13:37,010 is going to be only ones. 1333 01:13:40,510 --> 01:13:41,950 And then I'm going to have-- 1334 01:13:41,950 --> 01:13:47,130 well, x1, 1, [? 1, ?] x1, p. 1335 01:13:47,130 --> 01:13:52,890 And here, I'm going to have x n1, x np. 1336 01:13:52,890 --> 01:13:57,690 OK, so here the number of rows is n, and the number of columns 1337 01:13:57,690 --> 01:13:58,800 is p. 1338 01:13:58,800 --> 01:14:02,352 One row per observation, one column per coordinate. 1339 01:14:05,010 --> 01:14:10,710 And again, I make your life miserable because this really 1340 01:14:10,710 --> 01:14:13,380 should be p minus 1 because I already used 1341 01:14:13,380 --> 01:14:15,850 the first one for this guy. 1342 01:14:15,850 --> 01:14:16,820 I'm sorry about that. 1343 01:14:16,820 --> 01:14:18,400 It's a bit painful. 1344 01:14:18,400 --> 01:14:20,490 So usually we don't even write what's in there. 1345 01:14:20,490 --> 01:14:21,948 So we don't have to think about it. 1346 01:14:21,948 --> 01:14:23,970 Those are just vectors of size p. 1347 01:14:23,970 --> 01:14:25,380 OK? 1348 01:14:25,380 --> 01:14:27,740 So now that I created this thing, 1349 01:14:27,740 --> 01:14:31,340 I can actually just basically stack up all my models. 1350 01:14:31,340 --> 01:14:39,270 So Yi equals Xi transpose beta plus epsilon i for all i 1351 01:14:39,270 --> 01:14:41,430 equal 1 to n. 1352 01:14:41,430 --> 01:14:44,010 This transforms into-- this is equivalent to saying 1353 01:14:44,010 --> 01:14:47,610 that the vector y is equal to the matrix x 1354 01:14:47,610 --> 01:14:51,150 times beta plus a matrix, plus a vector epsilon, 1355 01:14:51,150 --> 01:14:57,940 where epsilon is just epsilon 1 to epsilon n, right. 1356 01:14:57,940 --> 01:14:59,830 So I have just this system, which 1357 01:14:59,830 --> 01:15:02,000 I write as a matrix, which really just consists 1358 01:15:02,000 --> 01:15:04,900 in stacking up all these equations next to each other. 1359 01:15:10,195 --> 01:15:12,820 So now that I have this model-- this is the usual least squares 1360 01:15:12,820 --> 01:15:13,330 model. 1361 01:15:13,330 --> 01:15:16,150 And here, when I want to write my least squares criterion 1362 01:15:16,150 --> 01:15:17,500 in terms of matrices, right? 1363 01:15:17,500 --> 01:15:19,041 My least squares criterion, remember, 1364 01:15:19,041 --> 01:15:27,010 was sum from i equal 1 to n of Yi minus Xi transposed beta 1365 01:15:27,010 --> 01:15:28,210 squared. 1366 01:15:28,210 --> 01:15:31,060 Well, here it's really just the sum 1367 01:15:31,060 --> 01:15:35,260 of the square of the coefficients of the vector 1368 01:15:35,260 --> 01:15:37,540 y minus x beta. 1369 01:15:37,540 --> 01:15:40,380 So this is actually equal to the norm squared 1370 01:15:40,380 --> 01:15:43,090 of y minus x beta square. 1371 01:15:46,382 --> 01:15:47,340 That's just the square. 1372 01:15:47,340 --> 01:15:49,470 Norm is, by definition, the sum of the square 1373 01:15:49,470 --> 01:15:51,720 of the coordinates. 1374 01:15:51,720 --> 01:15:53,885 And so now I can actually talk about minimizing 1375 01:15:53,885 --> 01:15:56,090 a norm squared, and here it's going 1376 01:15:56,090 --> 01:15:58,160 to be easier for me to take derivatives. 1377 01:15:58,160 --> 01:16:01,300 All right, so we'll do that next time.