1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high-quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:19,865 at ocw.mit.edu. 8 00:00:19,865 --> 00:00:26,880 PHILIPPE RIGOLLET: [INAUDIBLE] minus xi transpose t. 9 00:00:26,880 --> 00:00:30,220 I just pick whatever notation I want from a variable. 10 00:00:30,220 --> 00:00:33,850 And let's say it's t. 11 00:00:33,850 --> 00:00:35,530 So that's the least squares estimator. 12 00:00:35,530 --> 00:00:38,350 And it turns out that, as I said last time, 13 00:00:38,350 --> 00:00:39,850 it's going to be convenient to think 14 00:00:39,850 --> 00:00:42,160 of those things as matrices. 15 00:00:42,160 --> 00:00:44,530 So here, I already have vectors. 16 00:00:44,530 --> 00:00:47,050 I've already gone from one dimension, just real valued 17 00:00:47,050 --> 00:00:49,450 random variables through random vectors when 18 00:00:49,450 --> 00:00:52,720 I think of each xi, but if I start stacking them together, 19 00:00:52,720 --> 00:00:56,020 I'm going to have vectors and matrices that show up. 20 00:00:56,020 --> 00:00:57,610 So the first vector I'm getting is 21 00:00:57,610 --> 00:01:04,030 y, which is just a vector where I have y1 to yn. 22 00:01:04,030 --> 00:01:07,900 Then I have-- so that's a boldface vector. 23 00:01:07,900 --> 00:01:12,940 Then I have x, which is a matrix where I have-- 24 00:01:12,940 --> 00:01:16,150 well, the first coordinate is always 1. 25 00:01:16,150 --> 00:01:24,760 So I have 1, and then x1 xp minus 1, and that's-- sorry, 26 00:01:24,760 --> 00:01:29,672 x1 xp minus 1, and that's for observation 1. 27 00:01:29,672 --> 00:01:31,630 And then I have the same thing all the way down 28 00:01:31,630 --> 00:01:32,720 for observation n. 29 00:01:40,390 --> 00:01:42,685 OK, everybody understands what this is? 30 00:01:42,685 --> 00:01:47,920 So I'm just basically stacking up all the xi's. 31 00:01:47,920 --> 00:01:55,420 So this i-th row is xi transpose. 32 00:01:55,420 --> 00:01:57,390 I am just stacking them up. 33 00:01:57,390 --> 00:02:00,310 And so if I want to write all these things to be 34 00:02:00,310 --> 00:02:03,130 true for each of them, all I need to do 35 00:02:03,130 --> 00:02:05,350 is to write a vector epsilon, which 36 00:02:05,350 --> 00:02:08,680 is epsilon 1 to epsilon n. 37 00:02:08,680 --> 00:02:11,510 And what I'm going to have is that y, the boldface vector, 38 00:02:11,510 --> 00:02:14,260 now is equal to the matrix x times the vector 39 00:02:14,260 --> 00:02:18,490 beta plus the vector epsilon. 40 00:02:18,490 --> 00:02:20,540 And it's really just exactly saying 41 00:02:20,540 --> 00:02:23,620 what's there, because for 2-- so this is a vector, right? 42 00:02:23,620 --> 00:02:25,780 This is a vector. 43 00:02:25,780 --> 00:02:27,850 And what is the dimension of this vector? 44 00:02:32,660 --> 00:02:37,140 n, so this is n observations. 45 00:02:37,140 --> 00:02:39,770 And for all these-- for two vectors to be equal, 46 00:02:39,770 --> 00:02:41,990 I need to have all the coordinates to be equal, 47 00:02:41,990 --> 00:02:44,600 and that's exactly the same thing as saying that this 48 00:02:44,600 --> 00:02:46,290 holds for i equal 1 to n. 49 00:02:48,990 --> 00:02:51,400 But now, when I have this, I can actually 50 00:02:51,400 --> 00:02:55,690 rewrite the sum for t equals-- 51 00:02:55,690 --> 00:03:03,310 sorry, for i equals 1 to n of yi minus xi transpose 52 00:03:03,310 --> 00:03:05,680 beta squared, this turns out to be 53 00:03:05,680 --> 00:03:12,340 equal to the Euclidean norm of the vector y minus the matrix x 54 00:03:12,340 --> 00:03:14,852 times beta squared. 55 00:03:14,852 --> 00:03:16,310 And I'm going to put a 2 here so we 56 00:03:16,310 --> 00:03:19,079 know we're talking about the Euclidean norm. 57 00:03:19,079 --> 00:03:20,870 This just means this is the Euclidean norm. 58 00:03:27,259 --> 00:03:28,800 That's the one we've seen before when 59 00:03:28,800 --> 00:03:30,365 we talked about chi squared-- 60 00:03:30,365 --> 00:03:31,740 that's the square norm is the sum 61 00:03:31,740 --> 00:03:32,940 of the square of the coefficients, 62 00:03:32,940 --> 00:03:34,320 and then I take a square root, but here I 63 00:03:34,320 --> 00:03:35,467 have an extra square. 64 00:03:35,467 --> 00:03:38,050 So it's really just the sum of the square of the coefficients, 65 00:03:38,050 --> 00:03:38,730 which is this. 66 00:03:38,730 --> 00:03:40,464 And here are the coefficients. 67 00:03:43,370 --> 00:03:49,530 So then, that I write this thing like that, then minimizing-- 68 00:03:49,530 --> 00:03:54,430 so my goal here, now, is going to solve minimum over t 69 00:03:54,430 --> 00:04:05,300 in our p of y minus x times t2 squared. 70 00:04:05,300 --> 00:04:07,260 And just like we did for one dimension, 71 00:04:07,260 --> 00:04:12,200 we can actually write optimality conditions for this. 72 00:04:12,200 --> 00:04:14,990 I mean, this is a function. 73 00:04:14,990 --> 00:04:23,064 So this is a function from rp to r. 74 00:04:23,064 --> 00:04:24,980 And if I want to minimize it, all I have to do 75 00:04:24,980 --> 00:04:28,820 is to take its gradient and set it equal to 0. 76 00:04:28,820 --> 00:04:42,010 So minimum, set gradient to 0. 77 00:04:42,010 --> 00:04:45,640 So that's where it becomes a little complicated. 78 00:04:45,640 --> 00:04:49,210 Now I'm going to have to take the gradient of this norm. 79 00:04:49,210 --> 00:04:51,550 It might be a little annoying to do. 80 00:04:51,550 --> 00:04:53,980 But actually, what's nice about those things-- 81 00:04:53,980 --> 00:04:56,720 I mean, I remember that it was a bit annoying to learn. 82 00:04:56,720 --> 00:04:59,800 I mean, it's just basically rules of calculus 83 00:04:59,800 --> 00:05:01,480 that you don't use that much. 84 00:05:01,480 --> 00:05:05,690 But essentially, you can actually expend this norm. 85 00:05:05,690 --> 00:05:07,787 And you will see that the rules are basically 86 00:05:07,787 --> 00:05:09,370 the same as in one dimension, you just 87 00:05:09,370 --> 00:05:11,890 have to be careful about the fact that matrices do not 88 00:05:11,890 --> 00:05:13,050 commute. 89 00:05:13,050 --> 00:05:15,940 So let's expand this thing. 90 00:05:15,940 --> 00:05:19,550 y minus xt squared-- 91 00:05:19,550 --> 00:05:21,400 well, this is equal to the norm of y 92 00:05:21,400 --> 00:05:30,580 squared plus the norm of x squared plus 2 93 00:05:30,580 --> 00:05:33,980 times y transpose xt. 94 00:05:36,730 --> 00:05:41,230 That's just expanding the square in more dimensions. 95 00:05:41,230 --> 00:05:47,710 And this, I'm actually going to write as y squared plus-- 96 00:05:47,710 --> 00:05:50,600 so here, the norm squared of this guy, 97 00:05:50,600 --> 00:05:53,140 I always have that the norm of x squared 98 00:05:53,140 --> 00:05:56,650 is equal to x transpose x. 99 00:05:56,650 --> 00:05:58,480 So I'm going to write this as x transpose 100 00:05:58,480 --> 00:06:04,540 x, so it's t transpose x transpose xt 101 00:06:04,540 --> 00:06:10,540 plus 2 times y transpose xt. 102 00:06:10,540 --> 00:06:13,735 So now, if I'm going to take the gradient with respect to t, 103 00:06:13,735 --> 00:06:16,210 I have basically three terms, and each of them 104 00:06:16,210 --> 00:06:18,280 has some sort of a different nature. 105 00:06:18,280 --> 00:06:21,610 This term is linear in t, and it's 106 00:06:21,610 --> 00:06:23,170 going to differentiate the same way 107 00:06:23,170 --> 00:06:25,720 that I differentiate a times x. 108 00:06:25,720 --> 00:06:28,210 I'm just going to keep the a. 109 00:06:28,210 --> 00:06:29,710 This guy is quadratic. 110 00:06:29,710 --> 00:06:32,170 t appears twice. 111 00:06:32,170 --> 00:06:34,140 And this guy, I'm going to pick up a 2, 112 00:06:34,140 --> 00:06:37,510 and it's going to differentiate just like when I differentiate 113 00:06:37,510 --> 00:06:38,840 a times x squared. 114 00:06:38,840 --> 00:06:41,200 It's 2 times ax. 115 00:06:41,200 --> 00:06:43,330 And this guy is a constant with respect to t, 116 00:06:43,330 --> 00:06:47,380 so it's going to differentiate to 0. 117 00:06:47,380 --> 00:06:49,090 So when I compute the gradient-- 118 00:06:53,680 --> 00:06:55,930 now, of course, all of these rules that I give you you 119 00:06:55,930 --> 00:06:58,810 can check by looking at the partial derivative with respect 120 00:06:58,810 --> 00:06:59,950 to each coordinate. 121 00:06:59,950 --> 00:07:02,800 But arguably, it's much faster to know 122 00:07:02,800 --> 00:07:04,780 the rules of differentiability. 123 00:07:04,780 --> 00:07:06,922 It's like if I gave you the function exponential x 124 00:07:06,922 --> 00:07:08,380 and I said, what is the derivative, 125 00:07:08,380 --> 00:07:09,796 and you started writing, well, I'm 126 00:07:09,796 --> 00:07:13,420 going to write exponential x plus h minus exponential ax 127 00:07:13,420 --> 00:07:15,670 divided by h and let h go to 0. 128 00:07:15,670 --> 00:07:17,476 That's a bit painful. 129 00:07:17,476 --> 00:07:19,891 AUDIENCE: Why did you transpose your-- 130 00:07:19,891 --> 00:07:23,755 why does x have to be [INAUDIBLE]?? 131 00:07:23,755 --> 00:07:25,106 PHILIPPE RIGOLLET: I'm sorry? 132 00:07:25,106 --> 00:07:26,814 AUDIENCE: I was wondering why you times t 133 00:07:26,814 --> 00:07:29,080 times the [INAUDIBLE]? 134 00:07:29,080 --> 00:07:33,190 PHILIPPE RIGOLLET: The transpose of 2ab 135 00:07:33,190 --> 00:07:35,490 is b transpose a transpose. 136 00:07:38,990 --> 00:07:40,880 If you're not sure about this, just 137 00:07:40,880 --> 00:07:42,860 make a and b have different size, 138 00:07:42,860 --> 00:07:46,070 and then you will see that there's some incompatibility. 139 00:07:46,070 --> 00:07:48,440 I mean, there's basically only one way to not screw 140 00:07:48,440 --> 00:07:51,230 that one up, so that's easy to remember. 141 00:07:51,230 --> 00:07:54,650 So if I take the gradient, then it's going to be equal to what? 142 00:07:54,650 --> 00:07:58,130 It's going to be 0 plus-- we said here, 143 00:07:58,130 --> 00:07:59,880 this is going to differentiate like-- so 144 00:07:59,880 --> 00:08:05,130 think a times x squared. 145 00:08:05,130 --> 00:08:06,730 So I'm going to have 2ax. 146 00:08:06,730 --> 00:08:13,840 So here, basically, this guy is going to go to x transpose xt. 147 00:08:13,840 --> 00:08:17,250 Now, I could have made this one go away, 148 00:08:17,250 --> 00:08:20,050 but that's the same thing as saying that my gradient is-- 149 00:08:20,050 --> 00:08:21,610 I can think of my gradient as being 150 00:08:21,610 --> 00:08:24,200 either a horizontal vector or a vertical vector. 151 00:08:24,200 --> 00:08:26,530 So if I remove this guy, I'm thinking of my gradient 152 00:08:26,530 --> 00:08:27,370 as being horizontal. 153 00:08:27,370 --> 00:08:30,460 If I remove that guy, I'm thinking of my gradient 154 00:08:30,460 --> 00:08:31,300 as being vertical. 155 00:08:31,300 --> 00:08:33,258 And that's what I want to think of, typically-- 156 00:08:33,258 --> 00:08:36,820 vertical vectors, column vectors. 157 00:08:36,820 --> 00:08:39,159 And then this guy, well, it's like these guys just 158 00:08:39,159 --> 00:08:42,460 think a times x. 159 00:08:42,460 --> 00:08:44,560 So the derivative is just a, so I'm going 160 00:08:44,560 --> 00:08:47,712 to keep only that part here. 161 00:08:47,712 --> 00:08:49,670 Sorry, I forgot a minus somewhere-- yeah, here. 162 00:08:55,200 --> 00:08:59,700 Minus 2y transpose x. 163 00:08:59,700 --> 00:09:02,160 And what I want is this thing to be equal to 0. 164 00:09:06,240 --> 00:09:20,530 So t, the optimal t, is called beta hat and satisfies-- 165 00:09:24,680 --> 00:09:27,786 well, I can cancel the 2's and put the minus 166 00:09:27,786 --> 00:09:29,160 on the other side, and what I get 167 00:09:29,160 --> 00:09:36,970 is that x transpose xt is equal to y transpose x. 168 00:09:44,720 --> 00:09:48,240 Yeah, that's not working for me. 169 00:09:48,240 --> 00:09:50,240 Yeah, that's because when I took the derivative, 170 00:09:50,240 --> 00:09:51,665 I still need to make sure-- 171 00:09:51,665 --> 00:09:53,660 so it's the same question of whether I want 172 00:09:53,660 --> 00:09:55,610 things to be columns or rows. 173 00:09:55,610 --> 00:09:57,550 So this is not a column. 174 00:09:57,550 --> 00:10:01,465 If I remove that guy, y transpose t is a row. 175 00:10:01,465 --> 00:10:03,590 So I'm just going to take the transpose of this guy 176 00:10:03,590 --> 00:10:07,890 to make things work, and this is just going to be x transpose y. 177 00:10:11,710 --> 00:10:14,310 And this guy is x transpose y so that I have columns. 178 00:10:19,090 --> 00:10:23,910 So this is just the linear equation in t. 179 00:10:23,910 --> 00:10:26,640 And I have to solve it, so it's of the form some matrix times 180 00:10:26,640 --> 00:10:29,890 t is equal to another vector. 181 00:10:29,890 --> 00:10:31,590 And so that's basically in your system. 182 00:10:31,590 --> 00:10:33,381 And the way to solve it, at least formally, 183 00:10:33,381 --> 00:10:36,450 is to just take the inverse of the matrix on the left. 184 00:10:36,450 --> 00:10:45,830 So if x transpose x is invertible, then-- sorry, 185 00:10:45,830 --> 00:10:48,200 that's beta hat is the t I want. 186 00:10:48,200 --> 00:10:51,410 I get that beta hat is equal to x transpose 187 00:10:51,410 --> 00:10:54,440 x inverse x transpose y. 188 00:10:57,860 --> 00:10:59,780 And that's the least squares estimator. 189 00:11:12,670 --> 00:11:16,600 So here, I use this condition. 190 00:11:16,600 --> 00:11:21,580 I want it to be invertible so I can actually write its inverse. 191 00:11:21,580 --> 00:11:25,570 Here, I wrote, rank of x is equal to p. 192 00:11:25,570 --> 00:11:26,948 What is the difference? 193 00:11:36,910 --> 00:11:40,900 Well, there's basically no difference. 194 00:11:40,900 --> 00:11:45,100 Basically, here, I have to assume-- 195 00:11:45,100 --> 00:11:47,350 what is the size of the matrix x transpose x? 196 00:11:52,509 --> 00:11:53,376 [INTERPOSING VOICES] 197 00:11:53,376 --> 00:11:55,250 PHILIPPE RIGOLLET: Yeah, so what is the size? 198 00:11:55,250 --> 00:11:56,850 AUDIENCE: p by p. 199 00:11:56,850 --> 00:11:58,000 PHILIPPE RIGOLLET: p by p. 200 00:11:58,000 --> 00:12:00,940 So this matrix is invertible if it's a rank p, 201 00:12:00,940 --> 00:12:02,500 if you know what rank means. 202 00:12:02,500 --> 00:12:05,740 If you don't, that just rank p means that it's invertible. 203 00:12:05,740 --> 00:12:07,780 So it's full rank and it's invertible. 204 00:12:07,780 --> 00:12:10,690 And the rank of x transpose x is actually 205 00:12:10,690 --> 00:12:13,540 just the rank of x because this is the same matrix that you 206 00:12:13,540 --> 00:12:14,710 apply twice. 207 00:12:14,710 --> 00:12:15,940 And that's all it's saying. 208 00:12:15,940 --> 00:12:18,160 So if you're not comfortable with the notion of rank 209 00:12:18,160 --> 00:12:21,070 that you see here, just think of this condition 210 00:12:21,070 --> 00:12:25,080 just being the condition that x transpose x is invertible. 211 00:12:25,080 --> 00:12:26,650 And that's all it says. 212 00:12:26,650 --> 00:12:29,620 What it means for it to be invertible-- this was true. 213 00:12:29,620 --> 00:12:32,530 We made no assumption up to this point. 214 00:12:32,530 --> 00:12:35,840 If x is not invertible, it means that there 215 00:12:35,840 --> 00:12:38,840 might be multiple solutions to this equation. 216 00:12:38,840 --> 00:12:42,830 In particular, for a matrix to not be invertible, 217 00:12:42,830 --> 00:12:45,890 it means that there's some vector v. 218 00:12:45,890 --> 00:12:55,800 So if x transpose x is not invertible, 219 00:12:55,800 --> 00:13:00,080 then this is equivalent to there exists a vector 220 00:13:00,080 --> 00:13:07,910 v, which is not 0, and such that x transpose xv is equal to 0. 221 00:13:07,910 --> 00:13:10,400 That's what it means to not be invertible. 222 00:13:10,400 --> 00:13:13,730 So in particular, if beta hat is a solution-- 223 00:13:22,090 --> 00:13:26,290 so this equation is sometimes called score equations, 224 00:13:26,290 --> 00:13:28,280 because the gradient is called the score, 225 00:13:28,280 --> 00:13:31,090 and so you're just checking if the gradient is equal to 0. 226 00:13:31,090 --> 00:13:33,730 So if beta hat satisfies star, then so 227 00:13:33,730 --> 00:13:46,820 does beta hat plus lambda v for all lambda in the real line. 228 00:13:51,840 --> 00:13:54,930 And the reason is because, well, if I start looking at-- 229 00:13:54,930 --> 00:14:02,400 what is x transpose x times beta hat plus lambda v? 230 00:14:02,400 --> 00:14:08,000 Well, by linearity, this is just x transpose x 231 00:14:08,000 --> 00:14:16,510 beta hat plus lambda x transpose x times v. 232 00:14:16,510 --> 00:14:17,750 But this guy is what? 233 00:14:22,420 --> 00:14:27,860 It's 0, just because that's what we assumed. 234 00:14:27,860 --> 00:14:31,070 We assumed that x transpose xv was equal to 0, 235 00:14:31,070 --> 00:14:34,130 so we're left only with this part, which, by star, is just 236 00:14:34,130 --> 00:14:35,060 x transpose y. 237 00:14:40,040 --> 00:14:44,300 So that means that x transpose x beta hat plus lambda v 238 00:14:44,300 --> 00:14:48,080 is actually equal to x transpose y, which means that there's 239 00:14:48,080 --> 00:14:50,360 another solution, which is not just beta hat, 240 00:14:50,360 --> 00:14:56,784 but any move of beta hat along this direction v by any size. 241 00:14:56,784 --> 00:14:58,700 So that's going to be an issue, because you're 242 00:14:58,700 --> 00:15:00,050 looking for one estimator. 243 00:15:00,050 --> 00:15:03,350 And there's not just one, in this case, there's many. 244 00:15:03,350 --> 00:15:05,599 And so this is not going to be well-defined 245 00:15:05,599 --> 00:15:07,140 and you're going to have some issues. 246 00:15:07,140 --> 00:15:09,560 So if you want to talk about the least squares estimator, 247 00:15:09,560 --> 00:15:13,510 you have to make this assumption. 248 00:15:13,510 --> 00:15:15,310 What does it imply in terms of, can I 249 00:15:15,310 --> 00:15:18,976 think of p being too n, for example, in this case? 250 00:15:18,976 --> 00:15:20,350 What happens if p is equal to 2n? 251 00:15:27,528 --> 00:15:31,084 AUDIENCE: Well, then the rank of your matrix is only p/2. 252 00:15:31,084 --> 00:15:33,500 PHILIPPE RIGOLLET: So the rank of your matrix is only p/2, 253 00:15:33,500 --> 00:15:36,480 so that means that this is actually not going to happen. 254 00:15:36,480 --> 00:15:39,530 I mean, it's not only p/2, it's at most p/2. 255 00:15:39,530 --> 00:15:42,560 It's at most the smallest of the two dimensions of your matrix. 256 00:15:42,560 --> 00:15:44,600 So if your matrix is n times 2n, it's 257 00:15:44,600 --> 00:15:47,702 at most n, which means that it's not going to be full rank, 258 00:15:47,702 --> 00:15:49,160 so it's not going to be invertible. 259 00:15:49,160 --> 00:15:53,600 So every time the dimension p is larger than the sample size, 260 00:15:53,600 --> 00:15:56,060 your matrix is not invertible, and you cannot talk about 261 00:15:56,060 --> 00:15:57,950 the least squares estimator. 262 00:15:57,950 --> 00:15:59,930 So that's something to keep in mind. 263 00:15:59,930 --> 00:16:01,710 And it's actually a very simple thing. 264 00:16:01,710 --> 00:16:05,750 It's essentially saying, well, if p is lower than n, 265 00:16:05,750 --> 00:16:07,760 it means that you have more parameters 266 00:16:07,760 --> 00:16:11,000 to estimate than you have equations to estimate it. 267 00:16:11,000 --> 00:16:12,480 So you have this linear system. 268 00:16:12,480 --> 00:16:17,240 There's one equation per observation. 269 00:16:17,240 --> 00:16:19,400 Each row, which was each observation, 270 00:16:19,400 --> 00:16:21,230 was giving me one equation. 271 00:16:21,230 --> 00:16:24,960 But then the number of unknowns in this linear system is p, 272 00:16:24,960 --> 00:16:28,760 and so I cannot solve linear systems that have more unknowns 273 00:16:28,760 --> 00:16:30,350 than they have equations. 274 00:16:30,350 --> 00:16:32,302 And so that's basically what's happening. 275 00:16:32,302 --> 00:16:34,010 Now, in practice, if you think about what 276 00:16:34,010 --> 00:16:36,330 data sets look like these days, for example, 277 00:16:36,330 --> 00:16:38,940 people are trying to express some phenotype. 278 00:16:38,940 --> 00:16:41,570 So phenotype is something you can measure on people-- maybe 279 00:16:41,570 --> 00:16:43,880 the color of your eyes, or your height, 280 00:16:43,880 --> 00:16:47,690 or whether you have diabetes or not, things like this, 281 00:16:47,690 --> 00:16:51,080 so things that are macroscopic. 282 00:16:51,080 --> 00:16:53,429 And then they want to use the genotype to do that. 283 00:16:53,429 --> 00:16:55,970 They want to measure your-- they want to sequence your genome 284 00:16:55,970 --> 00:16:58,940 and try to use this to predict whether you're going 285 00:16:58,940 --> 00:17:01,250 to be responsive to a drug or whether your r's are 286 00:17:01,250 --> 00:17:03,170 going to be blue, or something like this. 287 00:17:03,170 --> 00:17:05,060 Now, the data sets that you can have-- 288 00:17:05,060 --> 00:17:09,619 people, maybe, for a given study about some sort of disease. 289 00:17:09,619 --> 00:17:15,260 Maybe you will sequence the genome of maybe 100 people. 290 00:17:15,260 --> 00:17:17,645 n is equal to 100. 291 00:17:17,645 --> 00:17:21,030 p is basically the number of genes they're sequencing. 292 00:17:21,030 --> 00:17:23,849 This is of the order of 100,000. 293 00:17:23,849 --> 00:17:26,287 So you can imagine that this is a case where n is much, 294 00:17:26,287 --> 00:17:28,620 much smaller than p, and you cannot talk about the least 295 00:17:28,620 --> 00:17:29,670 squares estimator. 296 00:17:29,670 --> 00:17:31,080 There's plenty of them. 297 00:17:31,080 --> 00:17:33,630 There's not just one line like that, 298 00:17:33,630 --> 00:17:36,180 lambda times v that you can move away. 299 00:17:36,180 --> 00:17:40,320 There's basically an entire space in which you can move, 300 00:17:40,320 --> 00:17:42,027 and so it's not well-defined. 301 00:17:42,027 --> 00:17:43,860 So at the end of this class, I will give you 302 00:17:43,860 --> 00:17:46,740 a short introduction on how you do this. 303 00:17:46,740 --> 00:17:49,200 This actually represents more and more. 304 00:17:49,200 --> 00:17:51,652 It becomes a more and more preponderant part of the data 305 00:17:51,652 --> 00:17:53,610 sets you have to deal with, because people just 306 00:17:53,610 --> 00:17:54,950 collect data. 307 00:17:54,950 --> 00:17:57,810 When I do the sequencing, the machine 308 00:17:57,810 --> 00:17:59,730 allows me to sequence 100,000 genes. 309 00:17:59,730 --> 00:18:03,600 I'm not going to stop at 100 because doctors are never 310 00:18:03,600 --> 00:18:06,510 going to have cohorts of more than 100 patients. 311 00:18:06,510 --> 00:18:08,510 So you just collect everything you can collect. 312 00:18:08,510 --> 00:18:11,310 And this is true for everything. 313 00:18:11,310 --> 00:18:13,372 Cars have sensors all over the place, 314 00:18:13,372 --> 00:18:15,080 much more than they actually gather data. 315 00:18:15,080 --> 00:18:16,890 There's data, there's-- we're creating, 316 00:18:16,890 --> 00:18:18,840 we're recording everything we can. 317 00:18:18,840 --> 00:18:20,744 And so we need some new techniques for that, 318 00:18:20,744 --> 00:18:23,410 and that's what high-dimensional statistics is trying to answer. 319 00:18:23,410 --> 00:18:25,530 So this is way beyond the scope of this class, 320 00:18:25,530 --> 00:18:27,029 but towards the end, I will give you 321 00:18:27,029 --> 00:18:29,340 some hints about what can be done in this framework 322 00:18:29,340 --> 00:18:34,810 because, well, this is the new reality we have to deal with. 323 00:18:34,810 --> 00:18:37,100 So here, we're in a case where p's less than n 324 00:18:37,100 --> 00:18:38,555 and typically much smaller than n. 325 00:18:38,555 --> 00:18:40,680 So the kind of orders of magnitude you want to have 326 00:18:40,680 --> 00:18:46,135 is maybe p's of order 10 and n's of order 100, something 327 00:18:46,135 --> 00:18:46,635 like this. 328 00:18:46,635 --> 00:18:50,280 So you can scale that, but maybe 10 times larger. 329 00:18:50,280 --> 00:18:57,810 So maybe you cannot solve this guy b for b hat, but actually, 330 00:18:57,810 --> 00:19:00,480 you can talk about x times b hat, 331 00:19:00,480 --> 00:19:02,580 even if p is larger than n. 332 00:19:02,580 --> 00:19:06,880 And the reason is that x times b hat 333 00:19:06,880 --> 00:19:09,280 is actually something that's very well-defined. 334 00:19:09,280 --> 00:19:11,400 So what is x times b hat? 335 00:19:11,400 --> 00:19:16,810 Remember, I started with the model. 336 00:19:16,810 --> 00:19:20,580 So if I look at this definition, essentially, what I 337 00:19:20,580 --> 00:19:24,360 had as the original thing was that the vector 338 00:19:24,360 --> 00:19:29,910 y was equal to x times beta plus the vector epsilon. 339 00:19:29,910 --> 00:19:32,960 That was my model. 340 00:19:32,960 --> 00:19:36,400 So beta is actually giving me something. 341 00:19:36,400 --> 00:19:39,380 Beta is actually some parameter, some coefficients 342 00:19:39,380 --> 00:19:40,830 that are interesting. 343 00:19:40,830 --> 00:19:43,610 But a good estimator for-- so here, it 344 00:19:43,610 --> 00:19:45,320 means that the observations that I have 345 00:19:45,320 --> 00:19:48,870 are of the form x times beta plus some noise. 346 00:19:48,870 --> 00:19:51,050 So if I want to adjust the noise, remove the noise, 347 00:19:51,050 --> 00:19:57,110 a good candidate to do noise is x times beta hat. 348 00:19:57,110 --> 00:19:59,450 x times beta hat is something that should actually 349 00:19:59,450 --> 00:20:10,140 be useful to me, which should be close to x times beta. 350 00:20:10,140 --> 00:20:13,770 So in the one-dimensional case, what it means is that if I 351 00:20:13,770 --> 00:20:16,530 have-- let's say this is the true line, 352 00:20:16,530 --> 00:20:19,050 and these are my x's, so I have-- 353 00:20:19,050 --> 00:20:22,050 these are the true points on the real line, 354 00:20:22,050 --> 00:20:24,180 and then I have my little epsilon 355 00:20:24,180 --> 00:20:26,670 that just give me my observations that 356 00:20:26,670 --> 00:20:28,560 move around this line. 357 00:20:28,560 --> 00:20:34,860 So this is one of epsilons, say epsilon i. 358 00:20:34,860 --> 00:20:37,460 Then I can actually either talk-- 359 00:20:37,460 --> 00:20:39,210 to say that I recovered the line, 360 00:20:39,210 --> 00:20:42,270 I can actually talk about recovering the right intercept 361 00:20:42,270 --> 00:20:44,370 or recovering the right slope for this line. 362 00:20:44,370 --> 00:20:46,740 Those are the two parameters that I need to recover. 363 00:20:46,740 --> 00:20:48,900 But I can also say that I've actually 364 00:20:48,900 --> 00:20:50,880 found a set of points that's closer 365 00:20:50,880 --> 00:20:56,250 to being on the line that are closer to this set of points 366 00:20:56,250 --> 00:21:00,900 right here than the original crosses that I observed. 367 00:21:00,900 --> 00:21:03,870 So if we go back to the picture here, 368 00:21:03,870 --> 00:21:08,850 for example, what I could do is say, well, for this point 369 00:21:08,850 --> 00:21:09,750 here-- 370 00:21:09,750 --> 00:21:11,430 there was an x here-- 371 00:21:11,430 --> 00:21:14,550 rather than looking at this dot, which was my observation, 372 00:21:14,550 --> 00:21:17,732 I can say, well, now that I've estimated the red line, 373 00:21:17,732 --> 00:21:19,440 I can actually just say, well, this point 374 00:21:19,440 --> 00:21:21,420 should really be here. 375 00:21:21,420 --> 00:21:23,760 And actually, I can move all these dots 376 00:21:23,760 --> 00:21:26,010 so that they're actually on the red line. 377 00:21:26,010 --> 00:21:28,680 And this should be a better value, something 378 00:21:28,680 --> 00:21:30,840 that has less noise than the original y value 379 00:21:30,840 --> 00:21:32,010 that I should see. 380 00:21:32,010 --> 00:21:33,630 It should be close to the true value 381 00:21:33,630 --> 00:21:37,240 that I should be seeing without the extra noise. 382 00:21:37,240 --> 00:21:40,080 So that's definitely something that could be of interest. 383 00:21:43,410 --> 00:21:45,990 For example, in imaging, you're not 384 00:21:45,990 --> 00:21:48,690 trying to understand-- so when you do imaging, 385 00:21:48,690 --> 00:21:50,370 y is basically an image. 386 00:21:50,370 --> 00:21:53,310 So think of a pixel image, and you just 387 00:21:53,310 --> 00:21:55,644 stack it into one long vector. 388 00:21:55,644 --> 00:21:57,060 And what you see is something that 389 00:21:57,060 --> 00:21:59,730 should look like some linear combination of some feature 390 00:21:59,730 --> 00:22:01,440 vectors, maybe. 391 00:22:01,440 --> 00:22:05,790 So there's people created a bunch of features. 392 00:22:05,790 --> 00:22:09,290 They're called, for example, Gabor frames or wavelet 393 00:22:09,290 --> 00:22:14,820 transforms-- so just well-known libraries of variables x such 394 00:22:14,820 --> 00:22:17,250 that when you take linear combinations of those guys, 395 00:22:17,250 --> 00:22:19,730 this should looks like a bunch of images. 396 00:22:19,730 --> 00:22:22,049 And what you want for your image-- 397 00:22:22,049 --> 00:22:24,090 you don't care what the coefficients of the image 398 00:22:24,090 --> 00:22:26,130 are in these bases that you came up with. 399 00:22:26,130 --> 00:22:28,690 What you care about is the noise in the image. 400 00:22:28,690 --> 00:22:31,690 And so you really want to get x beta. 401 00:22:31,690 --> 00:22:34,390 So if you want to estimate x beta, 402 00:22:34,390 --> 00:22:36,920 well, you can use x beta hat. 403 00:22:36,920 --> 00:22:37,960 What is x beta hat? 404 00:22:37,960 --> 00:22:42,040 Well, since beta hat is x transpose x inverse x transpose 405 00:22:42,040 --> 00:22:44,030 y, this is x transpose. 406 00:22:48,800 --> 00:22:50,830 That's my estimator for x beta. 407 00:22:54,060 --> 00:22:59,170 Now, this thing, actually, I can define 408 00:22:59,170 --> 00:23:01,630 even if I'm not low rank. 409 00:23:01,630 --> 00:23:03,190 So why is this thing interesting? 410 00:23:03,190 --> 00:23:08,120 Well, there's a formula for this estimator, 411 00:23:08,120 --> 00:23:10,260 but actually, I can visualize what this thing is. 412 00:23:18,792 --> 00:23:22,840 So let's assume, for the sake of illustration, 413 00:23:22,840 --> 00:23:26,200 that n is equal to 3. 414 00:23:29,500 --> 00:23:33,700 So that means that y lives in a three-dimensional space. 415 00:23:33,700 --> 00:23:36,800 And so let's say it's here. 416 00:23:36,800 --> 00:23:43,970 And so I have my, let's say, y's here. 417 00:23:43,970 --> 00:23:48,020 And I also have a plane that's given 418 00:23:48,020 --> 00:23:55,450 by the vectors x1 transpose x2 transpose, which 419 00:23:55,450 --> 00:23:58,890 is, by the way, 1-- 420 00:23:58,890 --> 00:24:01,290 sorry, that's not what I want to do. 421 00:24:04,320 --> 00:24:10,600 I'm going to say that n is equal to 3 and that p is equal to 2. 422 00:24:10,600 --> 00:24:18,460 So I basically have two vectors, 1, 1 and another one, 423 00:24:18,460 --> 00:24:25,670 let's assume that it's, for example, abc. 424 00:24:25,670 --> 00:24:27,020 So those are my two vectors. 425 00:24:27,020 --> 00:24:33,430 This is x1, and this is x2. 426 00:24:36,190 --> 00:24:39,190 And those are my three observations for this guy. 427 00:24:39,190 --> 00:24:48,940 So what I want when I minimize this, 428 00:24:48,940 --> 00:24:50,560 I'm looking at the point which can 429 00:24:50,560 --> 00:24:52,660 be formed as the linear combination of the columns 430 00:24:52,660 --> 00:24:57,887 of x, and I'm trying to find the guy that's the closest to y. 431 00:24:57,887 --> 00:24:58,970 So what does it look like? 432 00:24:58,970 --> 00:25:01,870 Well, the two points, 1, 1, 1 is going to be, say, here. 433 00:25:01,870 --> 00:25:04,360 That's the point 1, 1, 1. 434 00:25:04,360 --> 00:25:06,300 And let's say that abc is this point. 435 00:25:14,890 --> 00:25:17,410 So now I have a line that goes through those two guys. 436 00:25:20,620 --> 00:25:22,820 That's not really-- let's say it's 437 00:25:22,820 --> 00:25:24,560 going through those two guys. 438 00:25:24,560 --> 00:25:27,800 And this is the line which can be formed by looking only 439 00:25:27,800 --> 00:25:28,990 at linear combination. 440 00:25:28,990 --> 00:25:36,330 So this is the line of x times t for t in r2. 441 00:25:36,330 --> 00:25:39,320 That's this entire line that you can get. 442 00:25:39,320 --> 00:25:42,870 Why is it-- yeah, sorry, it's not just a line, 443 00:25:42,870 --> 00:25:45,720 I also have to have t, all the 0's thing. 444 00:25:45,720 --> 00:25:48,860 So that actually creates an entire plane, 445 00:25:48,860 --> 00:25:54,160 which is going to be really hard for me to represent. 446 00:25:54,160 --> 00:25:55,390 I don't know. 447 00:25:55,390 --> 00:26:00,044 I mean, maybe I shouldn't do it in these dimensions. 448 00:26:05,130 --> 00:26:08,650 So I'm going to do it like that. 449 00:26:11,240 --> 00:26:14,350 So this plane here is the set of xt for t and r2. 450 00:26:17,770 --> 00:26:22,390 So that's a two-dimensional plane, definitely goes to 0, 451 00:26:22,390 --> 00:26:23,760 and those are all these things. 452 00:26:23,760 --> 00:26:25,960 So think of a sheet of paper in three dimensions. 453 00:26:25,960 --> 00:26:27,910 Those are the things I can get. 454 00:26:27,910 --> 00:26:29,600 So now, what I'm going to have as y 455 00:26:29,600 --> 00:26:32,980 is not necessarily in this plane. 456 00:26:32,980 --> 00:26:39,440 y is actually something in this plane, x beta 457 00:26:39,440 --> 00:26:40,655 plus some epsilon. 458 00:26:44,810 --> 00:26:50,091 y is x beta plus epsilon. 459 00:26:50,091 --> 00:26:51,590 So I start from this plane, and then 460 00:26:51,590 --> 00:26:53,048 I have this epsilon that pushes me, 461 00:26:53,048 --> 00:26:54,620 maybe, outside of this plane. 462 00:26:54,620 --> 00:26:56,370 And what least squares is doing is saying, 463 00:26:56,370 --> 00:26:59,212 well, I know that epsilon should be fairly small, 464 00:26:59,212 --> 00:27:01,670 so the only thing I'm going to be doing that actually makes 465 00:27:01,670 --> 00:27:04,370 sense is to take y and find the point that's 466 00:27:04,370 --> 00:27:06,170 on this plane that's the closest to it. 467 00:27:06,170 --> 00:27:10,010 And that corresponds to doing an orthogonal projection of y 468 00:27:10,010 --> 00:27:13,070 onto this thing, and that's actually exactly x beta hat. 469 00:27:18,840 --> 00:27:21,390 So in one dimension, just because this is actually 470 00:27:21,390 --> 00:27:22,920 a little hard-- 471 00:27:22,920 --> 00:27:34,140 in one dimension, so that's if p is equal to 1. 472 00:27:34,140 --> 00:27:36,780 So let's say this is my point. 473 00:27:36,780 --> 00:27:38,854 And then I have y, which is in two dimensions, 474 00:27:38,854 --> 00:27:40,020 so this is all on the plane. 475 00:27:42,930 --> 00:27:44,590 What it does, this is my-- 476 00:27:44,590 --> 00:27:48,579 the point that's right here is actually x beta hat. 477 00:27:48,579 --> 00:27:49,870 That's how you find x beta hat. 478 00:27:49,870 --> 00:27:51,780 You take your point y and you project it 479 00:27:51,780 --> 00:27:54,490 on the linear span of the columns of x. 480 00:27:54,490 --> 00:27:56,640 And that's x beta hat. 481 00:27:56,640 --> 00:27:59,032 This does not tell you exactly what beta should be. 482 00:27:59,032 --> 00:28:00,990 And if you know a little bit of linear algebra, 483 00:28:00,990 --> 00:28:04,580 it's pretty clear, because if you want to find beta hat, 484 00:28:04,580 --> 00:28:06,330 that means that you should be able to find 485 00:28:06,330 --> 00:28:12,284 the coordinates of a point in the system of columns of x. 486 00:28:12,284 --> 00:28:13,950 And if those guys are redundant, there's 487 00:28:13,950 --> 00:28:17,430 not going to be unique coordinates for these guys, 488 00:28:17,430 --> 00:28:19,410 so that's why it's actually not easy to find. 489 00:28:19,410 --> 00:28:21,120 But x beta hat is uniquely defined. 490 00:28:21,120 --> 00:28:21,870 It's a projection. 491 00:28:21,870 --> 00:28:22,744 Yeah? 492 00:28:22,744 --> 00:28:24,285 AUDIENCE: And epsilon is the distance 493 00:28:24,285 --> 00:28:25,840 between the y and the-- 494 00:28:25,840 --> 00:28:29,630 PHILIPPE RIGOLLET: No, epsilon is the vector that goes from-- 495 00:28:29,630 --> 00:28:33,800 so there's a true x beta. 496 00:28:33,800 --> 00:28:36,245 That's the true one. 497 00:28:36,245 --> 00:28:36,870 It's not clear. 498 00:28:36,870 --> 00:28:41,940 I mean, x beta hat is unlikely to be exactly equal to x beta. 499 00:28:41,940 --> 00:28:44,410 And then the epsilon is the one that starts from this line. 500 00:28:44,410 --> 00:28:46,800 It's the vector that pushes you away. 501 00:28:46,800 --> 00:28:48,240 So really, this is this vector. 502 00:28:48,240 --> 00:28:50,600 That's epsilon. 503 00:28:50,600 --> 00:28:51,650 So it's not a length. 504 00:28:51,650 --> 00:28:54,245 The lengths of epsilon is the distance, 505 00:28:54,245 --> 00:28:57,454 but epsilon is just the actual vector that takes you 506 00:28:57,454 --> 00:28:58,370 from one to the other. 507 00:29:01,600 --> 00:29:03,020 So this is all in two dimensions, 508 00:29:03,020 --> 00:29:05,060 and it's probably much clearer than what's here. 509 00:29:09,080 --> 00:29:12,860 And so here, I claim that this x beta hat-- 510 00:29:12,860 --> 00:29:15,110 so from this picture, I implicitly 511 00:29:15,110 --> 00:29:22,980 claim that forming this operator that ticks y 512 00:29:22,980 --> 00:29:26,400 and maps it into this vector x times x transpose y, blah, 513 00:29:26,400 --> 00:29:33,570 blah, blah, this should actually be equal to the projection of y 514 00:29:33,570 --> 00:29:44,990 onto the linear span of the columns of x. 515 00:29:44,990 --> 00:29:46,889 That's what I just drew for you. 516 00:29:46,889 --> 00:29:48,430 And what it means is that this matrix 517 00:29:48,430 --> 00:29:49,679 must be the projection matrix. 518 00:29:54,350 --> 00:29:56,910 So of course, anybody-- 519 00:29:56,910 --> 00:29:59,400 who knows linear algebra here? 520 00:29:59,400 --> 00:30:01,730 OK, wow. 521 00:30:01,730 --> 00:30:04,560 So what are the conditions that a projection matrix 522 00:30:04,560 --> 00:30:06,763 should be satisfying? 523 00:30:06,763 --> 00:30:08,150 AUDIENCE: Squares through itself. 524 00:30:08,150 --> 00:30:09,360 PHILIPPE RIGOLLET: Squares through itself, right. 525 00:30:09,360 --> 00:30:11,430 If I project twice, I'm not moving. 526 00:30:11,430 --> 00:30:13,560 If I keep on iterating projection, 527 00:30:13,560 --> 00:30:15,330 once I'm in the space I'm projecting onto, 528 00:30:15,330 --> 00:30:16,854 I'm not moving. 529 00:30:16,854 --> 00:30:17,808 What else? 530 00:30:24,970 --> 00:30:28,110 Do they have to be symmetric, maybe? 531 00:30:28,110 --> 00:30:29,960 AUDIENCE: If it's an orthogonal projection. 532 00:30:29,960 --> 00:30:32,501 PHILIPPE RIGOLLET: Yeah, so this is an orthogonal projection. 533 00:30:32,501 --> 00:30:34,710 It has to be symmetric. 534 00:30:34,710 --> 00:30:36,510 And that's pretty much it. 535 00:30:36,510 --> 00:30:38,520 So from those things, you can actually 536 00:30:38,520 --> 00:30:39,694 get quite a bit of things. 537 00:30:39,694 --> 00:30:41,610 But what's interesting is that if you actually 538 00:30:41,610 --> 00:30:44,550 look at the eigenvalues of this matrix, 539 00:30:44,550 --> 00:30:47,670 they should be either 0 or 1, essentially. 540 00:30:47,670 --> 00:30:52,320 And they are 1 if the eigenvector associated 541 00:30:52,320 --> 00:30:55,089 is within this space, and 0 otherwise. 542 00:30:55,089 --> 00:30:56,880 And so that's basically what you can check. 543 00:30:56,880 --> 00:30:58,630 This is not an exercise in linear algebra, 544 00:30:58,630 --> 00:31:00,970 so I'm not going to go too much into those details. 545 00:31:00,970 --> 00:31:03,330 But this is essentially what you want to keep in mind. 546 00:31:03,330 --> 00:31:05,460 What's associated to orthogonal projections 547 00:31:05,460 --> 00:31:07,860 is Pythagoras theorem. 548 00:31:07,860 --> 00:31:10,380 And that's something that's going to be useful for us. 549 00:31:10,380 --> 00:31:12,150 What it's essentially telling is that if I 550 00:31:12,150 --> 00:31:16,342 look at this norm squared, it's equal to this norm squared-- 551 00:31:16,342 --> 00:31:18,300 sorry, this norm squared plus this norm squared 552 00:31:18,300 --> 00:31:20,100 is equal to this norm squared. 553 00:31:20,100 --> 00:31:22,510 And that's something the norm of y squared. 554 00:31:22,510 --> 00:31:32,040 So Pythagoras tells me that the norm of y squared 555 00:31:32,040 --> 00:31:40,090 is equal to the norm of x beta hat squared plus the norm of y 556 00:31:40,090 --> 00:31:41,230 minus x beta hat squared. 557 00:31:46,120 --> 00:31:47,890 Agreed? 558 00:31:47,890 --> 00:31:51,700 It's just because I have a straight angle here. 559 00:31:51,700 --> 00:31:54,174 So that's this plus this is equal to this. 560 00:31:58,840 --> 00:32:02,770 So now, to define this, I made no assumption. 561 00:32:02,770 --> 00:32:04,630 Epsilon could be as wild. 562 00:32:04,630 --> 00:32:07,300 I was just crossing my fingers that epsilon was actually 563 00:32:07,300 --> 00:32:09,910 small enough that it would make sense 564 00:32:09,910 --> 00:32:13,450 to project onto the linear span, because I implicitly 565 00:32:13,450 --> 00:32:16,640 assumed that epsilon did not take me all the way there, 566 00:32:16,640 --> 00:32:19,900 so that actually, it makes sense to project back. 567 00:32:19,900 --> 00:32:22,240 And so for that, I need to somehow make assumptions 568 00:32:22,240 --> 00:32:24,730 that epsilon is well-behaved and that it's 569 00:32:24,730 --> 00:32:31,330 completely wild, that it's moving uniformly 570 00:32:31,330 --> 00:32:33,050 in all directions of the space. 571 00:32:33,050 --> 00:32:34,630 There's no privileged direction where 572 00:32:34,630 --> 00:32:36,005 it's always going, otherwise, I'm 573 00:32:36,005 --> 00:32:37,900 going to make a systematic error. 574 00:32:37,900 --> 00:32:42,400 And I need that those epsilons are going to average somehow. 575 00:32:42,400 --> 00:32:44,641 So here are the assumptions we're 576 00:32:44,641 --> 00:32:46,390 going to be making so that we can actually 577 00:32:46,390 --> 00:32:48,880 do some statistical inference. 578 00:32:48,880 --> 00:32:53,350 The first one is that the design matrix is deterministic. 579 00:32:53,350 --> 00:32:55,270 So I started by saying the x-- 580 00:32:55,270 --> 00:32:58,570 I have xi, yi, and maybe they're independent. 581 00:32:58,570 --> 00:33:03,460 Here, they are, but the xi's, I want to think as deterministic. 582 00:33:03,460 --> 00:33:06,400 If they're not deterministic, it can condition on them, 583 00:33:06,400 --> 00:33:08,110 but otherwise, it's very difficult 584 00:33:08,110 --> 00:33:11,770 to think about this thing if I think of those entries 585 00:33:11,770 --> 00:33:14,380 as being random, because then I have 586 00:33:14,380 --> 00:33:17,470 the inverse of a random matrix, and things become very, very 587 00:33:17,470 --> 00:33:18,800 complicated. 588 00:33:18,800 --> 00:33:21,760 So we're to think of those guys as being deterministic. 589 00:33:21,760 --> 00:33:27,400 We're going to think of the model as being homoscedastic. 590 00:33:27,400 --> 00:33:29,790 And actually, let me come back to this in a second. 591 00:33:29,790 --> 00:33:31,780 Homoscedastic-- well, I mean, if you're 592 00:33:31,780 --> 00:33:34,330 trying to find the etymology of this word, 593 00:33:34,330 --> 00:33:38,080 "homo" means the same, "scedastic" means scaling. 594 00:33:38,080 --> 00:33:40,090 So what I want to say is that the epsilons 595 00:33:40,090 --> 00:33:41,890 have the same scaling. 596 00:33:41,890 --> 00:33:46,914 And since my third assumption is that epsilon is Gaussian, then 597 00:33:46,914 --> 00:33:49,330 essentially, what I'm going to want is that they all share 598 00:33:49,330 --> 00:33:52,900 the same sigma squared. 599 00:33:52,900 --> 00:33:55,540 They're independent, so this is definitely in the identity 600 00:33:55,540 --> 00:33:56,784 covariance matrix. 601 00:33:56,784 --> 00:33:58,450 And I want them to be centered, as well. 602 00:33:58,450 --> 00:34:00,310 That means that there's no direction 603 00:34:00,310 --> 00:34:04,240 that I'm always privileging when I'm moving away from my plane 604 00:34:04,240 --> 00:34:05,560 there. 605 00:34:05,560 --> 00:34:09,969 So these are important conditions. 606 00:34:09,969 --> 00:34:13,210 It depends on how much inference you want to do. 607 00:34:13,210 --> 00:34:16,310 If you want to write t-tests, you need all these assumptions. 608 00:34:16,310 --> 00:34:19,810 But if you only want to write, for example, the fact 609 00:34:19,810 --> 00:34:23,230 that your least squares estimator is consistent, 610 00:34:23,230 --> 00:34:25,210 you really just need the fact that epsilon 611 00:34:25,210 --> 00:34:27,630 has variance sigma squared. 612 00:34:27,630 --> 00:34:29,850 The fact that it's Gaussian won't matter, just 613 00:34:29,850 --> 00:34:33,449 like Gaussianity doesn't matter for a large number. 614 00:34:33,449 --> 00:34:34,055 Yeah? 615 00:34:34,055 --> 00:34:35,480 AUDIENCE: So the first assumption 616 00:34:35,480 --> 00:34:38,013 that x has to be deterministic, but I just 617 00:34:38,013 --> 00:34:40,327 made up this x1, x2-- 618 00:34:40,327 --> 00:34:41,785 PHILIPPE RIGOLLET: x is the matrix. 619 00:34:41,785 --> 00:34:42,485 AUDIENCE: Yeah. 620 00:34:42,485 --> 00:34:45,159 So most are random variables, right? 621 00:34:45,159 --> 00:34:47,075 PHILIPPE RIGOLLET: No, that's the assumption. 622 00:34:47,075 --> 00:34:49,400 AUDIENCE: OK. 623 00:34:49,400 --> 00:34:52,595 So I mean, once we collect the data and put it in the matrix, 624 00:34:52,595 --> 00:34:54,020 it becomes deterministic. 625 00:34:54,020 --> 00:34:55,920 So maybe I'm missing something. 626 00:34:55,920 --> 00:34:56,920 PHILIPPE RIGOLLET: Yeah. 627 00:34:56,920 --> 00:35:00,510 So this is for the purpose of the analysis. 628 00:35:00,510 --> 00:35:01,800 I can actually assume that-- 629 00:35:01,800 --> 00:35:04,140 I look at my data, and I think of this. 630 00:35:04,140 --> 00:35:06,210 So what is the difference between thinking 631 00:35:06,210 --> 00:35:08,832 of data as deterministic or thinking of it as random? 632 00:35:08,832 --> 00:35:11,040 When I talked about random data, the only assumptions 633 00:35:11,040 --> 00:35:12,706 that I made were about the distribution. 634 00:35:12,706 --> 00:35:14,730 I said, well, if my x is a random variable, 635 00:35:14,730 --> 00:35:16,980 I want it to have this variance and I want it to have, 636 00:35:16,980 --> 00:35:19,250 maybe, this distribution, things like this. 637 00:35:19,250 --> 00:35:25,050 Here, I'm actually making an assumption on the values 638 00:35:25,050 --> 00:35:25,940 that I see. 639 00:35:25,940 --> 00:35:30,120 I'm seeing that the value that you give me is-- 640 00:35:30,120 --> 00:35:32,010 the matrix is actually invertible. 641 00:35:32,010 --> 00:35:33,960 x transpose x will be invertible. 642 00:35:33,960 --> 00:35:36,690 So I've never done that before, assuming 643 00:35:36,690 --> 00:35:38,880 that some random variable-- 644 00:35:38,880 --> 00:35:41,740 assuming that some Gaussian random variable was positive, 645 00:35:41,740 --> 00:35:43,160 for example. 646 00:35:43,160 --> 00:35:45,570 We don't do that, because there's always some probability 647 00:35:45,570 --> 00:35:49,110 that things don't happen if you make things at random. 648 00:35:49,110 --> 00:35:52,380 And so here, I'm just going to say, OK, forget about-- 649 00:35:52,380 --> 00:35:54,990 here, it's basically a little stronger. 650 00:35:54,990 --> 00:35:58,710 I start my assumption by saying, the data that's given to me 651 00:35:58,710 --> 00:36:00,630 will actually satisfy those assumptions. 652 00:36:00,630 --> 00:36:02,130 And that means that I don't actually 653 00:36:02,130 --> 00:36:05,279 need to make some modeling assumption on this thing, 654 00:36:05,279 --> 00:36:06,820 because I'm actually putting directly 655 00:36:06,820 --> 00:36:08,028 the assumption I want to see. 656 00:36:12,650 --> 00:36:14,730 So here, either I know sigma squared 657 00:36:14,730 --> 00:36:16,190 or I don't know sigma squared. 658 00:36:16,190 --> 00:36:16,940 So is that clear? 659 00:36:16,940 --> 00:36:21,880 So essentially, I'm assuming that I have this model, where 660 00:36:21,880 --> 00:36:26,950 this guy, now, is deterministic, and this 661 00:36:26,950 --> 00:36:30,490 is some multivariate Gaussian with mean 0 662 00:36:30,490 --> 00:36:33,500 and covariance matrix identity of rn. 663 00:36:33,500 --> 00:36:36,460 That's the model I'm assuming. 664 00:36:36,460 --> 00:36:40,810 And I'm observing this, and I'm given this matrix x. 665 00:36:40,810 --> 00:36:42,130 Where does this make sense? 666 00:36:42,130 --> 00:36:44,770 You could say, well, if I think of my rows as being people 667 00:36:44,770 --> 00:36:48,084 and I'm collecting genes, it's a little intense 668 00:36:48,084 --> 00:36:50,000 to assume that I actually know, ahead of time, 669 00:36:50,000 --> 00:36:51,340 what I'm going to be seeing, and that those things are 670 00:36:51,340 --> 00:36:52,210 deterministic. 671 00:36:52,210 --> 00:36:55,630 That's true, but it still does not prevent the analysis 672 00:36:55,630 --> 00:36:56,830 to go through, for one. 673 00:36:56,830 --> 00:37:00,970 And second, a better example might be this imaging example 674 00:37:00,970 --> 00:37:04,870 that I described, where those x's are actually libraries. 675 00:37:04,870 --> 00:37:07,570 Those are libraries of patterns that people 676 00:37:07,570 --> 00:37:09,800 have created, maybe from deep learning nets, 677 00:37:09,800 --> 00:37:10,847 or something like this. 678 00:37:10,847 --> 00:37:12,430 But they've created patterns, and they 679 00:37:12,430 --> 00:37:14,830 say that all images should be representable as a linear 680 00:37:14,830 --> 00:37:16,511 combination of those patterns. 681 00:37:16,511 --> 00:37:18,260 And those patterns are somewhere in books, 682 00:37:18,260 --> 00:37:19,390 so they're certainly deterministic. 683 00:37:19,390 --> 00:37:21,190 Everything that's actually written down in a book 684 00:37:21,190 --> 00:37:22,600 is as deterministic as it gets. 685 00:37:29,027 --> 00:37:30,610 Any questions about those assumptions? 686 00:37:30,610 --> 00:37:32,776 Those are the things we're going to be working with. 687 00:37:32,776 --> 00:37:33,910 There's only three of them. 688 00:37:33,910 --> 00:37:35,130 One is about x. 689 00:37:35,130 --> 00:37:37,530 Actually, there's really two of them. 690 00:37:37,530 --> 00:37:41,625 I mean, this guy already appears here. 691 00:37:41,625 --> 00:37:44,640 So there's two-- one on the noise, one on the x's. 692 00:37:44,640 --> 00:37:45,600 That's it. 693 00:37:48,480 --> 00:37:51,430 Those things allow us to do quite a bit. 694 00:37:51,430 --> 00:37:52,830 They will allow us to-- 695 00:37:55,410 --> 00:38:02,980 well, that's actually-- they allow 696 00:38:02,980 --> 00:38:09,100 me to write the distribution of beta hat, which is great, 697 00:38:09,100 --> 00:38:12,220 because when I know the distribution of my estimator, 698 00:38:12,220 --> 00:38:14,250 I know it's fluctuations. 699 00:38:14,250 --> 00:38:16,450 If it's centered around the true parameter, 700 00:38:16,450 --> 00:38:19,060 I know that it's going to be fluctuating 701 00:38:19,060 --> 00:38:20,440 around the true parameter. 702 00:38:20,440 --> 00:38:22,450 And it should tell me what kind of distribution 703 00:38:22,450 --> 00:38:23,860 the fluctuations are. 704 00:38:23,860 --> 00:38:26,260 I actually know how to build confidence intervals. 705 00:38:26,260 --> 00:38:27,790 I know how to build tests. 706 00:38:27,790 --> 00:38:29,170 I know how to build everything. 707 00:38:29,170 --> 00:38:31,660 It's just like when I told you that asymptotically, 708 00:38:31,660 --> 00:38:33,760 the empirical variance was Gaussian 709 00:38:33,760 --> 00:38:39,470 with mean theta and standard deviation that depended on n, 710 00:38:39,470 --> 00:38:42,040 et cetera, that's basically the only thing I needed. 711 00:38:42,040 --> 00:38:44,840 And this is what I'm actually getting here. 712 00:38:44,840 --> 00:38:49,820 So let me start with this statement. 713 00:38:49,820 --> 00:38:52,087 So remember, beta hat satisfied this, 714 00:38:52,087 --> 00:38:53,420 so I'm going to rewrite it here. 715 00:38:57,940 --> 00:39:02,530 So beta hat was equal to x transpose 716 00:39:02,530 --> 00:39:07,440 x inverse x transpose y. 717 00:39:07,440 --> 00:39:09,710 That was the definition that we found. 718 00:39:09,710 --> 00:39:17,450 And now, I also know that y was equal to x beta plus epsilon. 719 00:39:17,450 --> 00:39:20,800 So let me just replace y by x beta plus epsilon here. 720 00:39:20,800 --> 00:39:21,300 Yeah? 721 00:39:21,300 --> 00:39:25,185 AUDIENCE: Isn't it x transpose x inverse x transpose y? 722 00:39:25,185 --> 00:39:26,685 PHILIPPE RIGOLLET: Yes, x transpose. 723 00:39:26,685 --> 00:39:27,184 Thank you. 724 00:39:31,890 --> 00:39:36,830 So I'm going to replace y by x beta plus epsilon. 725 00:39:36,830 --> 00:39:58,560 So that's-- and here comes the magic. 726 00:39:58,560 --> 00:40:00,780 I have an inverse of a matrix, and then 727 00:40:00,780 --> 00:40:03,810 I have the true matrix, I have the original matrix. 728 00:40:03,810 --> 00:40:08,420 So this is actually the identity times beta. 729 00:40:08,420 --> 00:40:11,610 And now this guy, well, this is a Gaussian, 730 00:40:11,610 --> 00:40:13,800 because this is a Gaussian random vector, 731 00:40:13,800 --> 00:40:18,540 and I just multiply it by a deterministic matrix. 732 00:40:18,540 --> 00:40:22,690 So we're going to use the rule that if I have, say, epsilon, 733 00:40:22,690 --> 00:40:29,780 which is n0 sigma, then b times epsilon is n0-- 734 00:40:29,780 --> 00:40:32,280 can somebody tell me what the covariance matrix of b epsilon 735 00:40:32,280 --> 00:40:32,780 is? 736 00:40:35,302 --> 00:40:37,010 AUDIENCE: What is capital B in this case? 737 00:40:37,010 --> 00:40:38,593 PHILIPPE RIGOLLET: It's just a matrix. 738 00:40:42,410 --> 00:40:46,230 And for any matrix, I mean any matrix that I can premultiply-- 739 00:40:46,230 --> 00:40:48,360 that I can postmultiply with epsilon. 740 00:40:48,360 --> 00:40:49,042 Yeah? 741 00:40:49,042 --> 00:40:50,819 AUDIENCE: b transpose b. 742 00:40:50,819 --> 00:40:52,110 PHILIPPE RIGOLLET: b transpose? 743 00:40:52,110 --> 00:40:53,503 AUDIENCE: Times b. 744 00:40:53,503 --> 00:40:55,044 PHILIPPE RIGOLLET: And sigma is gone. 745 00:40:55,044 --> 00:40:57,177 AUDIENCE: Oh, times sigma, sorry. 746 00:40:57,177 --> 00:40:59,010 PHILIPPE RIGOLLET: That's the matrix, right? 747 00:40:59,010 --> 00:41:00,427 AUDIENCE: b transpose sigma b. 748 00:41:00,427 --> 00:41:01,510 PHILIPPE RIGOLLET: Almost. 749 00:41:04,255 --> 00:41:07,470 Anybody wants to take a guess at the last one? 750 00:41:07,470 --> 00:41:12,790 I think we've removed all other possibilities. 751 00:41:12,790 --> 00:41:15,510 It's b sigma b transpose. 752 00:41:20,880 --> 00:41:24,910 So if you ever answered to the question, 753 00:41:24,910 --> 00:41:26,590 do you know Gaussian random vectors, 754 00:41:26,590 --> 00:41:29,414 but you did not know that, there's a gap in your knowledge 755 00:41:29,414 --> 00:41:31,330 that you need to fill, because that's probably 756 00:41:31,330 --> 00:41:33,880 the most important property of Gaussian vectors. 757 00:41:33,880 --> 00:41:38,410 When you multiply them by matrices, 758 00:41:38,410 --> 00:41:43,390 you have a simple rule on how to update the covariance matrix. 759 00:41:43,390 --> 00:41:49,250 So here, sigma is the identity. 760 00:41:49,250 --> 00:41:53,480 And here, this is the matrix b that I had here. 761 00:41:53,480 --> 00:41:58,480 So what this is is, basically, n, some multivariate n, 762 00:41:58,480 --> 00:41:59,350 of course. 763 00:41:59,350 --> 00:42:00,970 Then I'm going to have 0. 764 00:42:00,970 --> 00:42:04,140 And so what I need to do is b times the identity times b 765 00:42:04,140 --> 00:42:07,017 transpose, which is just b b transpose. 766 00:42:07,017 --> 00:42:08,350 And what is it going to tell me? 767 00:42:08,350 --> 00:42:12,850 It's x transpose x-- 768 00:42:12,850 --> 00:42:17,560 sorry, that's inverse-- inverse x transpose, and then 769 00:42:17,560 --> 00:42:21,760 the transpose of this guy, which is x x 770 00:42:21,760 --> 00:42:25,170 transpose x inverse transpose. 771 00:42:25,170 --> 00:42:27,130 But this matrix is symmetric, so I'm actually 772 00:42:27,130 --> 00:42:30,190 not going to make the transpose of this guy. 773 00:42:30,190 --> 00:42:34,090 And again, magic shows up. 774 00:42:34,090 --> 00:42:36,220 Inverse times the matrix of those two guys 775 00:42:36,220 --> 00:42:38,950 cancel, and so this is actually equal to beta 776 00:42:38,950 --> 00:42:43,990 plus some n0 x transpose x inverse. 777 00:42:46,955 --> 00:42:47,455 Yeah? 778 00:42:47,455 --> 00:42:49,454 AUDIENCE: I'm a little lost on the [INAUDIBLE].. 779 00:42:49,454 --> 00:42:52,788 So you define that as the b matrix, and what happens? 780 00:42:52,788 --> 00:42:54,954 PHILIPPE RIGOLLET: So I just apply this rule, right? 781 00:42:54,954 --> 00:42:55,720 AUDIENCE: Yeah. 782 00:42:55,720 --> 00:42:57,850 PHILIPPE RIGOLLET: So if I multiply a matrix 783 00:42:57,850 --> 00:43:01,840 by a Gaussian, then let's say this Gaussian had 784 00:43:01,840 --> 00:43:05,680 mean 0, which is the case of epsilon here, 785 00:43:05,680 --> 00:43:07,960 then the covariance matrix that I get 786 00:43:07,960 --> 00:43:10,330 is b times the original covariance matrix times b 787 00:43:10,330 --> 00:43:11,470 transpose. 788 00:43:11,470 --> 00:43:15,290 So all I did is write this matrix times the identity 789 00:43:15,290 --> 00:43:18,195 times this matrix transpose. 790 00:43:18,195 --> 00:43:20,320 And the identity, of course, doesn't play any role, 791 00:43:20,320 --> 00:43:21,240 so I can remove it. 792 00:43:21,240 --> 00:43:23,860 It's just this matrix, then the matrix transpose. 793 00:43:23,860 --> 00:43:25,370 And what happened? 794 00:43:25,370 --> 00:43:27,280 So what is the transpose of this matrix? 795 00:43:27,280 --> 00:43:32,710 So I used the fact that if I look at x transpose x inverse x 796 00:43:32,710 --> 00:43:39,160 transpose, and now I look at the whole transpose of this thing, 797 00:43:39,160 --> 00:43:40,510 that's actually equal 2. 798 00:43:40,510 --> 00:43:43,510 And I use the rule that ab transpose is b transpose 799 00:43:43,510 --> 00:43:46,030 a transpose-- let me finish-- 800 00:43:46,030 --> 00:43:51,925 and it's x x transpose x inverse. 801 00:43:55,151 --> 00:43:55,650 Yes? 802 00:43:55,650 --> 00:43:58,020 AUDIENCE: I thought the-- 803 00:43:58,020 --> 00:44:00,335 for epsilon, it was sigma squared. 804 00:44:00,335 --> 00:44:01,710 PHILIPPE RIGOLLET: Oh, thank you. 805 00:44:01,710 --> 00:44:03,610 There's a sigma squared somewhere. 806 00:44:03,610 --> 00:44:08,610 So this was sigma squared times the identity, so I can just 807 00:44:08,610 --> 00:44:10,566 pick up a sigma squared anywhere. 808 00:44:14,740 --> 00:44:28,560 So here, in our case, so for epsilon, this is sigma. 809 00:44:28,560 --> 00:44:30,000 Sigma squared times the identity, 810 00:44:30,000 --> 00:44:31,166 that's my covariance matrix. 811 00:44:33,920 --> 00:44:35,242 You seem perplexed. 812 00:44:35,242 --> 00:44:37,170 AUDIENCE: It's just a new idea for me 813 00:44:37,170 --> 00:44:41,754 to think of a maximum likelihood estimator as a random variable. 814 00:44:41,754 --> 00:44:43,420 PHILIPPE RIGOLLET: Oh, it should not be. 815 00:44:43,420 --> 00:44:45,722 Any estimator is a random variable. 816 00:44:45,722 --> 00:44:48,132 AUDIENCE: Oh, yeah, that's a good point. 817 00:44:48,132 --> 00:44:52,236 PHILIPPE RIGOLLET: [LAUGHS] And I have not 818 00:44:52,236 --> 00:44:54,110 told you that this was the maximum likelihood 819 00:44:54,110 --> 00:44:55,720 estimator just yet. 820 00:44:55,720 --> 00:44:58,910 The estimator is a random variable. 821 00:44:58,910 --> 00:45:00,890 There's a word-- some people use estimate just 822 00:45:00,890 --> 00:45:03,519 to differentiate the estimator while you're 823 00:45:03,519 --> 00:45:05,810 doing the analysis with random variables and the values 824 00:45:05,810 --> 00:45:09,477 when you plug in the numbers in there. 825 00:45:09,477 --> 00:45:12,060 But then, of course, people use estimate because it's shorter, 826 00:45:12,060 --> 00:45:14,660 so then it's confusing. 827 00:45:14,660 --> 00:45:17,990 So any questions about this computation? 828 00:45:17,990 --> 00:45:20,810 Did I forget any other Greek letter along the way? 829 00:45:20,810 --> 00:45:22,620 All right, I think we're good. 830 00:45:22,620 --> 00:45:26,225 So one thing that it says-- and actually, 831 00:45:26,225 --> 00:45:27,600 thank you for pointing this out-- 832 00:45:27,600 --> 00:45:30,540 I said there's actually a little hidden statement there. 833 00:45:30,540 --> 00:45:33,130 By the way, this answers this question. 834 00:45:33,130 --> 00:45:35,990 Beta hat is of the form beta plus something that's centered, 835 00:45:35,990 --> 00:45:39,484 so it's indeed of the form Gaussian with mean beta 836 00:45:39,484 --> 00:45:41,900 and covariance matrix sigma squared x transpose x inverse. 837 00:45:45,520 --> 00:45:47,640 So that's very nice. 838 00:45:47,640 --> 00:45:50,830 As long as x transpose x is not huge, 839 00:45:50,830 --> 00:45:55,900 I'm going to have something that is close to what I want. 840 00:45:55,900 --> 00:45:58,550 Oh, sorry, x transpose x inverse is not huge. 841 00:46:01,800 --> 00:46:05,670 So there's a hidden claim in there, 842 00:46:05,670 --> 00:46:08,640 which is that least squares estimator 843 00:46:08,640 --> 00:46:11,588 is equal to the maximum likelihood estimator. 844 00:46:15,500 --> 00:46:17,920 Why does the maximum likelihood estimator just 845 00:46:17,920 --> 00:46:19,770 enter the picture now? 846 00:46:19,770 --> 00:46:23,280 We've been talking about regression for the past 18 847 00:46:23,280 --> 00:46:24,450 slides. 848 00:46:24,450 --> 00:46:26,130 And we've been talking about estimators. 849 00:46:26,130 --> 00:46:29,070 And I just dumped on you the least squares estimator, 850 00:46:29,070 --> 00:46:31,830 but I never really came back to this thing that we know-- 851 00:46:31,830 --> 00:46:35,100 maybe the method of moments, or maybe the maximum likelihood 852 00:46:35,100 --> 00:46:35,930 estimator. 853 00:46:35,930 --> 00:46:37,930 It turns out that those two things are the same. 854 00:46:37,930 --> 00:46:41,880 But if I want to talk about a maximum likelihood estimator, 855 00:46:41,880 --> 00:46:43,140 I need to have a likelihood. 856 00:46:43,140 --> 00:46:46,160 In particular, I need to have a density. 857 00:46:46,160 --> 00:46:47,600 And so if I want a density, I have 858 00:46:47,600 --> 00:46:53,210 to make those assumptions, such as the epsilons have 859 00:46:53,210 --> 00:46:55,970 this Gaussian distribution. 860 00:46:55,970 --> 00:46:58,580 So why is this the maximum likelihood estimator? 861 00:46:58,580 --> 00:47:04,740 Well, remember, y is x transpose beta plus epsilon. 862 00:47:04,740 --> 00:47:07,530 So I actually have a bunch of data. 863 00:47:07,530 --> 00:47:14,390 So what is my model here? 864 00:47:14,390 --> 00:47:18,040 Well, its the family of Gaussians 865 00:47:18,040 --> 00:47:22,460 on n observations with mean x beta, variance sigma 866 00:47:22,460 --> 00:47:31,380 squared identity, and beta lives in rp. 867 00:47:31,380 --> 00:47:34,800 Here's my family of distributions. 868 00:47:34,800 --> 00:47:38,160 That's the possible distributions for y. 869 00:47:38,160 --> 00:47:41,500 And so in particular, I can write the density of y. 870 00:47:47,980 --> 00:47:48,760 Well, what is it? 871 00:47:48,760 --> 00:47:52,010 It's something that looks like p of x-- 872 00:47:52,010 --> 00:47:58,359 well, p of y, let's say, is equal to 1 over-- 873 00:47:58,359 --> 00:48:00,400 so now its going to be a little more complicated, 874 00:48:00,400 --> 00:48:17,740 but its sigma squared times 2 pi to the p/2 exponential minus 875 00:48:17,740 --> 00:48:26,840 norm of y minus x beta squared divided by 2 sigma squared. 876 00:48:26,840 --> 00:48:29,780 So that's just the multivariate Gaussian density. 877 00:48:29,780 --> 00:48:30,890 I just wrote it. 878 00:48:30,890 --> 00:48:33,530 That's the density of a multivariate Gaussian 879 00:48:33,530 --> 00:48:36,740 with mean x beta and covariance matrix sigma squared 880 00:48:36,740 --> 00:48:37,700 times the identity. 881 00:48:37,700 --> 00:48:40,410 That's what it is. 882 00:48:40,410 --> 00:48:42,300 So you don't have to learn this by heart, 883 00:48:42,300 --> 00:48:47,100 but if you are familiar with the case where p is equal to 1, 884 00:48:47,100 --> 00:48:49,820 you can check that you recover what you're familiar with, 885 00:48:49,820 --> 00:48:54,811 and this makes sense as an extension. 886 00:48:59,730 --> 00:49:08,560 So now, I can actually write my log likelihood. 887 00:49:08,560 --> 00:49:14,880 How many observations do I have of this vector y? 888 00:49:23,710 --> 00:49:25,144 Do I have n observations of y? 889 00:49:30,580 --> 00:49:33,110 I have just one, right? 890 00:49:33,110 --> 00:49:36,830 Oh, sorry, I shouldn't have said p, this is n. 891 00:49:36,830 --> 00:49:38,510 Everything is in dimension n. 892 00:49:38,510 --> 00:49:42,700 So I can think of either having n independent observations 893 00:49:42,700 --> 00:49:44,180 of each coordinate, or I can think 894 00:49:44,180 --> 00:49:47,210 of having just one observation of the vector y. 895 00:49:47,210 --> 00:49:50,050 So when I write my log likelihood, 896 00:49:50,050 --> 00:49:54,850 it's just the log of the density at y. 897 00:50:09,090 --> 00:50:13,710 And that's the vector y, which I can 898 00:50:13,710 --> 00:50:18,990 write as minus n/2 log sigma squared 899 00:50:18,990 --> 00:50:28,690 2 pi minus 1 over 2 sigma squared norm of y minus x beta. 900 00:50:28,690 --> 00:50:30,310 And that's, again, my boldface y. 901 00:50:36,710 --> 00:50:39,222 And what is my maximum likelihood estimator? 902 00:50:44,470 --> 00:50:47,940 Well, this guy does not depend on beta. 903 00:50:47,940 --> 00:50:50,850 And this is just a constant factor in front of this guy. 904 00:50:50,850 --> 00:50:54,270 So it's the same thing as just minimizing, 905 00:50:54,270 --> 00:50:57,230 because I have a minus sign, over all beta and rp. 906 00:51:03,140 --> 00:51:05,580 y minus x beta squared, and that's my least squares 907 00:51:05,580 --> 00:51:06,570 estimator. 908 00:51:15,312 --> 00:51:17,270 Is there anything that's unclear on this board? 909 00:51:17,270 --> 00:51:17,910 Any question? 910 00:51:20,550 --> 00:51:23,230 So all I used was-- so I wrote my log likelihood, which 911 00:51:23,230 --> 00:51:25,860 is just the log of this expression 912 00:51:25,860 --> 00:51:28,750 where y is my observation. 913 00:51:28,750 --> 00:51:32,430 And that's indeed the observation that I have here. 914 00:51:32,430 --> 00:51:35,980 And that was just some constant minus some constant times 915 00:51:35,980 --> 00:51:37,960 this quantity that depends on beta. 916 00:51:37,960 --> 00:51:40,270 So maximizing this whole thing is the same thing 917 00:51:40,270 --> 00:51:42,810 as minimizing only this thing. 918 00:51:42,810 --> 00:51:44,620 The minimizers are the same. 919 00:51:44,620 --> 00:51:47,320 And so that tells me that I actually just 920 00:51:47,320 --> 00:51:49,000 have to minimize the squared norm 921 00:51:49,000 --> 00:51:51,710 to get my maximum likelihood estimator. 922 00:51:51,710 --> 00:51:55,060 But this used, heavily, the fact that I could actually 923 00:51:55,060 --> 00:52:03,450 write exactly what my density was, 924 00:52:03,450 --> 00:52:06,240 and that when I took the log of this thing, 925 00:52:06,240 --> 00:52:09,660 I had exactly the square norm that showed up. 926 00:52:09,660 --> 00:52:12,630 If I had a different density, if, for example, 927 00:52:12,630 --> 00:52:17,040 I assumed that my coordinates of epsilons were, say, iid 928 00:52:17,040 --> 00:52:18,720 double exponential random variables. 929 00:52:18,720 --> 00:52:21,240 So it's just half of an exponential. 930 00:52:21,240 --> 00:52:24,280 And the plus is half of an exponential on the negatives. 931 00:52:24,280 --> 00:52:27,342 So if I said that, then this would not 932 00:52:27,342 --> 00:52:28,800 have the square norm that shows up. 933 00:52:28,800 --> 00:52:31,057 This is really idiosyncratic to Gaussians. 934 00:52:31,057 --> 00:52:32,640 If I had something else, I would have, 935 00:52:32,640 --> 00:52:35,190 maybe, a different norm here, or something different 936 00:52:35,190 --> 00:52:39,420 measures the difference between y and x beta. 937 00:52:39,420 --> 00:52:41,820 And that's how you come up with other maximum likelihood 938 00:52:41,820 --> 00:52:44,010 estimators that leads to other estimators that 939 00:52:44,010 --> 00:52:45,420 are not the least squares-- 940 00:52:45,420 --> 00:52:47,040 maybe the least absolute deviation, 941 00:52:47,040 --> 00:52:50,310 for example, or this fourth movement, 942 00:52:50,310 --> 00:52:52,890 for example, that you suggested last time. 943 00:52:52,890 --> 00:52:55,650 So I can come up with a bunch of different things, 944 00:52:55,650 --> 00:52:56,910 and they might be tied-- 945 00:52:56,910 --> 00:52:59,716 maybe I can come up from them from the same perspective 946 00:52:59,716 --> 00:53:01,590 that I came from the least squares estimator. 947 00:53:01,590 --> 00:53:03,210 I said, let's just do something smart 948 00:53:03,210 --> 00:53:06,350 and check, then, that it's indeed the maximum likelihood 949 00:53:06,350 --> 00:53:08,040 estimator. 950 00:53:08,040 --> 00:53:11,250 Or I could just start with the modeling on-- 951 00:53:11,250 --> 00:53:13,260 and check, then, what happens-- 952 00:53:13,260 --> 00:53:15,840 what was the implicit assumption that I put on my noise. 953 00:53:15,840 --> 00:53:18,164 Or I could start with the assumption of the noise, 954 00:53:18,164 --> 00:53:19,830 compute the maximum likelihood estimator 955 00:53:19,830 --> 00:53:21,000 and see what it turns into. 956 00:53:24,660 --> 00:53:26,760 So that was the first thing. 957 00:53:26,760 --> 00:53:29,080 I've just proved to you the first line. 958 00:53:29,080 --> 00:53:31,950 And from there, you can get what you want. 959 00:53:31,950 --> 00:53:34,690 So all the other lines are going to follow. 960 00:53:34,690 --> 00:53:39,570 So what is beta hat-- so for example, let's look 961 00:53:39,570 --> 00:53:41,660 at the second line, the quadratic risk. 962 00:53:46,180 --> 00:53:49,270 Beta hat minus beta, from this formula, 963 00:53:49,270 --> 00:53:53,780 has a distribution, which is n n0, 964 00:53:53,780 --> 00:53:58,369 and then I have x transpose x inverse. 965 00:53:58,369 --> 00:54:03,299 AUDIENCE: Wouldn't the dimension be p on the board? 966 00:54:07,250 --> 00:54:10,308 PHILIPPE RIGOLLET: Sorry, the dimension of what? 967 00:54:10,308 --> 00:54:11,769 AUDIENCE: Oh beta hat minus beta. 968 00:54:11,769 --> 00:54:13,287 Isn't beta only a p dimensional? 969 00:54:13,287 --> 00:54:15,620 PHILIPPE RIGOLLET: Oh, yeah, you're right, you're right. 970 00:54:15,620 --> 00:54:17,450 That was all p dimensional there. 971 00:54:22,170 --> 00:54:23,700 Yeah. 972 00:54:23,700 --> 00:54:28,220 So if b here, the matrix that I'm actually applying, 973 00:54:28,220 --> 00:54:30,810 has dimension p times n-- 974 00:54:30,810 --> 00:54:34,710 so even if epsilon was an n dimensional Gaussian vector, 975 00:54:34,710 --> 00:54:39,310 then b times epsilon is a p dimensional Gaussian vector 976 00:54:39,310 --> 00:54:39,980 now. 977 00:54:39,980 --> 00:54:42,720 So that's how I switch from p to n-- 978 00:54:42,720 --> 00:54:43,770 from n to p. 979 00:54:43,770 --> 00:54:45,120 Thank you. 980 00:54:45,120 --> 00:54:50,430 So you're right, this is beta hat minus beta is this guy. 981 00:54:50,430 --> 00:54:54,090 And so in particular, if I look at the expectation 982 00:54:54,090 --> 00:55:01,160 of the norm of beta hat minus beta squared, what is it? 983 00:55:01,160 --> 00:55:08,140 It's the expectation of the norm of some Gaussian vector. 984 00:55:12,100 --> 00:55:15,530 And so it turns out-- so maybe we don't have-- 985 00:55:15,530 --> 00:55:18,960 well, that's just also a property of a Gaussian vector. 986 00:55:18,960 --> 00:55:26,840 So if epsilon is n0 sigma, then the expectation 987 00:55:26,840 --> 00:55:34,576 of the norm of epsilon squared is just the trace of sigma. 988 00:55:37,910 --> 00:55:41,030 Actually, we can probably check this 989 00:55:41,030 --> 00:55:44,540 by saying that this is the sum from j equal 1 990 00:55:44,540 --> 00:55:51,128 to p of the expectation of beta hat j minus beta j squared. 991 00:55:54,310 --> 00:55:57,879 Since beta j squared is the expectation-- beta j 992 00:55:57,879 --> 00:55:59,170 is the expectation of beta hat. 993 00:55:59,170 --> 00:56:01,990 This is actually equal to the sum from j equal 1 994 00:56:01,990 --> 00:56:08,110 to p of the variance of beta hat j, 995 00:56:08,110 --> 00:56:11,950 just because this is the expectation of beta hat. 996 00:56:11,950 --> 00:56:15,590 And how do I read the variances in a covariance matrix? 997 00:56:15,590 --> 00:56:17,830 There are just the diagonal elements. 998 00:56:17,830 --> 00:56:25,390 So that's really just sigma jj. 999 00:56:25,390 --> 00:56:27,700 And so that's really equal to-- 1000 00:56:27,700 --> 00:56:29,470 so that's the sum of the diagonal elements 1001 00:56:29,470 --> 00:56:30,790 of this matrix. 1002 00:56:30,790 --> 00:56:33,960 Let's call it sigma. 1003 00:56:33,960 --> 00:56:40,020 So that's equal to the trace of x transpose x inverse. 1004 00:56:42,740 --> 00:56:45,364 The trace is the sum of the diagonal elements of a matrix. 1005 00:56:48,080 --> 00:56:49,700 And I still had something else. 1006 00:56:49,700 --> 00:56:52,070 I'm sorry, this was sigma squared. 1007 00:56:52,070 --> 00:56:54,200 I forget it all the time. 1008 00:56:54,200 --> 00:56:56,800 So the sigma squared comes out. 1009 00:56:56,800 --> 00:56:58,760 It's there. 1010 00:56:58,760 --> 00:57:01,275 And so the sigma squared comes out 1011 00:57:01,275 --> 00:57:02,900 because the trace is a linear operator. 1012 00:57:02,900 --> 00:57:06,275 If I multiply all the entries of my matrix by the same number, 1013 00:57:06,275 --> 00:57:08,150 then all the diagonal elements are multiplied 1014 00:57:08,150 --> 00:57:09,775 by the same number, so when I sum them, 1015 00:57:09,775 --> 00:57:13,930 the sum is multiplied by the same number. 1016 00:57:13,930 --> 00:57:18,120 So that's for the quadratic risk of beta hat. 1017 00:57:18,120 --> 00:57:21,580 And now I need to tell you about x beta hat. 1018 00:57:21,580 --> 00:57:27,250 x beta hat was something that was actually telling me 1019 00:57:27,250 --> 00:57:30,370 that that was the point that I reported on the red line 1020 00:57:30,370 --> 00:57:31,480 that I estimated. 1021 00:57:31,480 --> 00:57:32,800 That was my x beta hat. 1022 00:57:32,800 --> 00:57:40,310 That was my y minus the noise. 1023 00:57:40,310 --> 00:57:42,470 Now, this thing here-- 1024 00:57:42,470 --> 00:57:47,100 so remember, we had this line, and I had my observation. 1025 00:57:47,100 --> 00:57:51,370 And here, I'm really trying to measure this distance squared. 1026 00:57:51,370 --> 00:57:53,470 This distance is actually quite important for me 1027 00:57:53,470 --> 00:57:58,920 because it actually shows up in the Pythagoras theorem. 1028 00:57:58,920 --> 00:58:02,260 And so you could actually try to estimate this thing. 1029 00:58:02,260 --> 00:58:03,790 So what is the prediction error? 1030 00:58:12,900 --> 00:58:18,840 So we said we have y minus x beta hat, so that's 1031 00:58:18,840 --> 00:58:21,930 the norm of this thing we're trying to compute. 1032 00:58:21,930 --> 00:58:25,350 But let's write this for what it is for one second. 1033 00:58:25,350 --> 00:58:27,810 So we said that beta hat was x transpose 1034 00:58:27,810 --> 00:58:31,710 x inverse extra transpose y, and we know that y is 1035 00:58:31,710 --> 00:58:35,950 x transpose beta plus epsilon. 1036 00:58:35,950 --> 00:58:37,410 So let's write this-- 1037 00:58:40,620 --> 00:58:43,800 x beta plus epsilon plus x. 1038 00:58:57,000 --> 00:59:00,320 And actually, maybe I should not write it. 1039 00:59:00,320 --> 00:59:02,722 Let me keep the y for what it is now. 1040 00:59:07,140 --> 00:59:08,960 So that means that I have, essentially, 1041 00:59:08,960 --> 00:59:13,050 the identity of rn times y minus this matrix times y. 1042 00:59:13,050 --> 00:59:15,510 So I can factor y out, and that's 1043 00:59:15,510 --> 00:59:20,280 the identity of rn minus x x transpose 1044 00:59:20,280 --> 00:59:27,280 x inverse x transpose, the whole thing times y. 1045 00:59:32,760 --> 00:59:37,980 We call this matrix p because this was the projection matrix 1046 00:59:37,980 --> 00:59:41,540 onto the linear span of the x's. 1047 00:59:41,540 --> 00:59:46,120 So that means that if I take a point x and I apply p times x, 1048 00:59:46,120 --> 00:59:50,910 I'm projecting onto the linear span of the columns of x. 1049 00:59:50,910 --> 00:59:57,400 What happens if I do i minus p times x? 1050 00:59:57,400 --> 00:59:59,000 It's x minus px. 1051 01:00:01,540 --> 01:00:04,690 So if I look at the point on which-- 1052 01:00:04,690 --> 01:00:07,000 so this is the point on which I project. 1053 01:00:07,000 --> 01:00:08,660 This is x. 1054 01:00:08,660 --> 01:00:13,260 I project orthogonally to get p times x. 1055 01:00:13,260 --> 01:00:15,920 And so what it means is that this operator i 1056 01:00:15,920 --> 01:00:21,810 minus px is actually giving me this guy, this vector here-- 1057 01:00:21,810 --> 01:00:23,360 x minus p times x. 1058 01:00:30,790 --> 01:00:33,920 Let's say this is 0. 1059 01:00:33,920 --> 01:00:36,460 This means that this vector, I can put it here. 1060 01:00:36,460 --> 01:00:38,370 It's this vector here. 1061 01:00:38,370 --> 01:00:40,510 And that's actually the orthogonal projection 1062 01:00:40,510 --> 01:00:43,870 of x onto the orthogonal complement of the span 1063 01:00:43,870 --> 01:00:45,532 of the columns of x. 1064 01:00:45,532 --> 01:00:51,000 So if I project x, or if I look of x minus its projection, 1065 01:00:51,000 --> 01:00:55,730 I'm basically projecting onto two orthogonal spaces. 1066 01:00:55,730 --> 01:00:59,520 What I'm trying to say here is that this here 1067 01:00:59,520 --> 01:01:01,301 is another projection matrix p prime. 1068 01:01:04,460 --> 01:01:10,310 That is just the projection matrix onto the orthogonal-- 1069 01:01:10,310 --> 01:01:29,560 projection onto orthogonal of column span of x. 1070 01:01:29,560 --> 01:01:31,180 Orthogonal means the set of vectors 1071 01:01:31,180 --> 01:01:34,329 that's orthogonal to everyone in this linear space. 1072 01:01:37,050 --> 01:01:40,080 So now, when I'm doing this, this is exactly what-- 1073 01:01:40,080 --> 01:01:42,600 I mean, in a way, this is illustrating this Pythagoras 1074 01:01:42,600 --> 01:01:43,610 theorem. 1075 01:01:43,610 --> 01:01:47,190 And so when I want to compute the norm of this guy, the norm 1076 01:01:47,190 --> 01:01:49,560 squared of this guy, I'm really computing-- 1077 01:01:49,560 --> 01:01:52,810 if this is my y now, this is px of y, 1078 01:01:52,810 --> 01:01:55,738 I'm really controlling the norm squared of this thing. 1079 01:02:06,720 --> 01:02:08,850 So if I want to compute the norm squared-- 1080 01:02:42,540 --> 01:02:48,020 so I'm almost there. 1081 01:02:48,020 --> 01:02:52,840 So what am I projecting here onto the orthogonal projector? 1082 01:02:52,840 --> 01:02:55,340 So here, y, now, I know that y is 1083 01:02:55,340 --> 01:03:00,480 equal to x beta plus epsilon. 1084 01:03:00,480 --> 01:03:06,480 So when I look at this matrix p prime times y, 1085 01:03:06,480 --> 01:03:11,105 It's actually p prime times x beta plus p prime times 1086 01:03:11,105 --> 01:03:11,604 epsilon. 1087 01:03:14,380 --> 01:03:18,400 What's happening to p prime times x beta? 1088 01:03:18,400 --> 01:03:19,525 Let's look at this picture. 1089 01:03:23,400 --> 01:03:26,610 So we know that p prime takes any point here and projects it 1090 01:03:26,610 --> 01:03:29,350 orthogonally on this guy. 1091 01:03:29,350 --> 01:03:33,960 But x beta is actually a point that lives here. 1092 01:03:33,960 --> 01:03:36,790 It's something that's on the linear span. 1093 01:03:36,790 --> 01:03:39,660 So where do all the points that are on this line 1094 01:03:39,660 --> 01:03:43,035 get projected to? 1095 01:03:43,035 --> 01:03:43,970 AUDIENCE: The origin. 1096 01:03:43,970 --> 01:03:45,920 PHILIPPE RIGOLLET: The origin, to 0. 1097 01:03:45,920 --> 01:03:47,750 They all get projected to 0. 1098 01:03:47,750 --> 01:03:50,120 And that's because I'm basically projecting 1099 01:03:50,120 --> 01:03:54,872 something that's on the column span of x onto its orthogonal. 1100 01:03:54,872 --> 01:03:56,580 So that's always 0 that I'm getting here. 1101 01:04:02,410 --> 01:04:04,410 So when I apply p prime to y, I'm 1102 01:04:04,410 --> 01:04:08,610 really just applying p prime to epsilon. 1103 01:04:08,610 --> 01:04:10,590 So I know that now, this, actually, 1104 01:04:10,590 --> 01:04:18,480 is equal to the norm of some multivariate Gaussian. 1105 01:04:18,480 --> 01:04:20,092 What is the size of this Gaussian? 1106 01:04:22,980 --> 01:04:24,570 What is the size of this matrix? 1107 01:04:24,570 --> 01:04:25,820 Well, I actually had it there. 1108 01:04:25,820 --> 01:04:28,440 It's i n, so it's n dimensional. 1109 01:04:28,440 --> 01:04:31,236 So it's some n dimensional with mean 0. 1110 01:04:31,236 --> 01:04:32,610 And what is the covariance matrix 1111 01:04:32,610 --> 01:04:34,179 of p prime times epsilon? 1112 01:04:39,109 --> 01:04:40,588 AUDIENCE: p p transpose. 1113 01:04:40,588 --> 01:04:43,880 PHILIPPE RIGOLLET: Yeah, p prime p prime transpose, 1114 01:04:43,880 --> 01:04:48,500 which we just said p prime transpose is p, 1115 01:04:48,500 --> 01:04:49,610 so that's p squared. 1116 01:04:49,610 --> 01:04:51,740 And we see that when we project twice, 1117 01:04:51,740 --> 01:04:54,540 it's as if we projected only once. 1118 01:04:54,540 --> 01:05:00,090 So here, this is n0 p prime p prime transpose. 1119 01:05:00,090 --> 01:05:05,150 That's the formula for the covariance matrix. 1120 01:05:05,150 --> 01:05:09,990 But this guy is actually equal to p prime times p prime, 1121 01:05:09,990 --> 01:05:13,580 which is equal to p prime. 1122 01:05:13,580 --> 01:05:18,380 So now, what I'm looking for is the norm squared of the trace. 1123 01:05:18,380 --> 01:05:20,050 So that means that this whole thing here 1124 01:05:20,050 --> 01:05:22,270 is actually equal to the trace. 1125 01:05:22,270 --> 01:05:24,730 Oh, did I forget again a sigma squared? 1126 01:05:24,730 --> 01:05:28,160 Yeah, I forgot it only here, which is good news. 1127 01:05:28,160 --> 01:05:32,665 So I should assume that sigma squared is equal to 1. 1128 01:05:32,665 --> 01:05:34,270 So sigma squared's here. 1129 01:05:34,270 --> 01:05:36,430 And then what I'm left with is sigma squared 1130 01:05:36,430 --> 01:05:39,920 times the trace of p prime. 1131 01:05:45,780 --> 01:05:51,240 At some point, I mentioned that the eigenvalues of a projection 1132 01:05:51,240 --> 01:05:54,210 matrix were actually 0 or 1. 1133 01:05:54,210 --> 01:05:56,689 The trace is the sum of the eigenvalues. 1134 01:05:56,689 --> 01:05:58,230 So that means that the trace is going 1135 01:05:58,230 --> 01:06:03,720 to be an integer number as the number of non-0 eigenvalues. 1136 01:06:03,720 --> 01:06:05,170 And the non-0 eigenvalues are just 1137 01:06:05,170 --> 01:06:07,776 the dimension of the space onto which I'm projecting. 1138 01:06:10,490 --> 01:06:15,200 Now, I'm projecting from something of dimension n 1139 01:06:15,200 --> 01:06:19,520 onto the orthogonal of a space of dimension p. 1140 01:06:19,520 --> 01:06:21,860 What is the dimension of the orthogonal 1141 01:06:21,860 --> 01:06:23,720 of a space of dimension p when thought 1142 01:06:23,720 --> 01:06:26,546 of space in dimension n? 1143 01:06:26,546 --> 01:06:27,296 AUDIENCE: [? 1. ?] 1144 01:06:27,296 --> 01:06:28,765 PHILIPPE RIGOLLET: N minus p-- 1145 01:06:28,765 --> 01:06:32,980 that's the so-called rank theorem, I guess, as a name. 1146 01:06:32,980 --> 01:06:35,710 And so that's how I get this n minus p here. 1147 01:06:35,710 --> 01:06:40,071 This is really just equal to n minus p. 1148 01:06:40,071 --> 01:06:40,570 Yeah? 1149 01:06:40,570 --> 01:06:43,319 AUDIENCE: Here, we're taking the expectation of the whole thing. 1150 01:06:43,319 --> 01:06:44,860 PHILIPPE RIGOLLET: Yes, you're right. 1151 01:06:44,860 --> 01:06:48,780 So that's actually the expectation 1152 01:06:48,780 --> 01:06:50,410 of this thing that's equal to that. 1153 01:06:50,410 --> 01:06:53,020 Absolutely. 1154 01:06:53,020 --> 01:06:55,150 But I actually have much better. 1155 01:06:55,150 --> 01:06:57,412 I know, even, that the norm that I'm looking at, 1156 01:06:57,412 --> 01:06:58,870 I know it's going to be this thing. 1157 01:06:58,870 --> 01:07:00,911 What is going to be the distribution of this guy? 1158 01:07:03,860 --> 01:07:06,830 Norm squared of a Gaussian, chi squared. 1159 01:07:06,830 --> 01:07:09,150 So there's going to be some chi squared that shows up. 1160 01:07:09,150 --> 01:07:10,650 And the number of degrees of freedom 1161 01:07:10,650 --> 01:07:12,940 is actually going to be also n minus p. 1162 01:07:12,940 --> 01:07:16,510 And maybe it's actually somewhere-- 1163 01:07:16,510 --> 01:07:20,560 yeah, right here-- n minus p times sigma hat 1164 01:07:20,560 --> 01:07:22,690 squared over sigma squared. 1165 01:07:22,690 --> 01:07:24,675 This is my sigma hat squared. 1166 01:07:24,675 --> 01:07:28,200 If I multiply n minus p, I'm left only with this thing, 1167 01:07:28,200 --> 01:07:31,136 and so that means that I get sigma squared times-- 1168 01:07:31,136 --> 01:07:33,010 because they always forget my sigma squared-- 1169 01:07:33,010 --> 01:07:34,870 I get sigma squared times this thing. 1170 01:07:34,870 --> 01:07:37,270 And it turns out that the square norm of this guy 1171 01:07:37,270 --> 01:07:39,412 is actually exactly chi squared with n minus b 1172 01:07:39,412 --> 01:07:40,226 degrees of freedom. 1173 01:07:43,370 --> 01:07:47,900 So in particular, so we know that the expectation 1174 01:07:47,900 --> 01:07:50,556 of this thing is equal to sigma squared times n minus p. 1175 01:07:50,556 --> 01:07:53,342 So if I divide both sides by n minus p, 1176 01:07:53,342 --> 01:07:55,550 I'm going to have that something whose expectation is 1177 01:07:55,550 --> 01:07:57,140 sigma squared. 1178 01:07:57,140 --> 01:07:59,140 And this something, I can actually compute. 1179 01:07:59,140 --> 01:08:02,090 It depends on y, and x that I know, 1180 01:08:02,090 --> 01:08:04,100 and beta hat that I've just estimated. 1181 01:08:04,100 --> 01:08:05,000 I know what n is. 1182 01:08:05,000 --> 01:08:07,520 And pr's are the dimensions of my matrix x. 1183 01:08:07,520 --> 01:08:11,120 So I'm actually given an estimator whose expectation 1184 01:08:11,120 --> 01:08:13,330 is sigma squared. 1185 01:08:13,330 --> 01:08:15,880 And so now, I actually have an unbiased estimator 1186 01:08:15,880 --> 01:08:17,430 of sigma squared. 1187 01:08:17,430 --> 01:08:19,269 That's this guy right here. 1188 01:08:19,269 --> 01:08:20,560 And it's actually super useful. 1189 01:08:23,470 --> 01:08:25,270 So those are called the-- 1190 01:08:25,270 --> 01:08:27,950 this is the normalized sum of square residuals. 1191 01:08:27,950 --> 01:08:29,340 These are called the residuals. 1192 01:08:29,340 --> 01:08:32,410 Those are whatever is residual when 1193 01:08:32,410 --> 01:08:36,580 I project my points onto the line that I've estimated. 1194 01:08:36,580 --> 01:08:40,870 And so in a way, those guys-- if you go back to this picture, 1195 01:08:40,870 --> 01:08:47,109 this was yi and this was xi transpose beta hat. 1196 01:08:47,109 --> 01:08:49,540 So if beta hat is close to beta, the difference 1197 01:08:49,540 --> 01:08:52,810 between yi and xi transpose beta should 1198 01:08:52,810 --> 01:08:55,870 be close to my epsilon i. 1199 01:08:55,870 --> 01:08:57,430 It's some sort of epsilon i hat. 1200 01:09:00,319 --> 01:09:02,590 Agreed? 1201 01:09:02,590 --> 01:09:04,960 And so that means that if I think 1202 01:09:04,960 --> 01:09:07,510 of those as being epsilon i hat, they 1203 01:09:07,510 --> 01:09:09,910 should be close to epsilon i, and so their norm 1204 01:09:09,910 --> 01:09:14,390 should be giving me something that looks like sigma squared. 1205 01:09:14,390 --> 01:09:16,359 And so that's why it actually makes sense. 1206 01:09:16,359 --> 01:09:18,790 It's just magical that everything works out together, 1207 01:09:18,790 --> 01:09:21,130 because I'm not projecting on the right line, 1208 01:09:21,130 --> 01:09:23,229 I'm actually projecting on the wrong line. 1209 01:09:23,229 --> 01:09:27,310 But in the end, things actually work out pretty well. 1210 01:09:27,310 --> 01:09:28,990 There's one thing-- so here, the theorem 1211 01:09:28,990 --> 01:09:31,779 is that this thing not only has the right expectation, 1212 01:09:31,779 --> 01:09:33,450 but also has a chi squared distribution. 1213 01:09:33,450 --> 01:09:34,700 That's what we just discussed. 1214 01:09:34,700 --> 01:09:36,250 So here, I'm just telling you this. 1215 01:09:36,250 --> 01:09:37,899 But it's not too hard to believe, 1216 01:09:37,899 --> 01:09:40,300 because it's actually the norm of some vector. 1217 01:09:40,300 --> 01:09:42,279 You could make this obvious, but again, I 1218 01:09:42,279 --> 01:09:44,800 didn't want to bring in too much linear algebra. 1219 01:09:44,800 --> 01:09:46,359 So to prove this, you actually have 1220 01:09:46,359 --> 01:09:48,899 to diagonalize the matrix p. 1221 01:09:48,899 --> 01:09:53,890 So you have to invoke the eigenvalue decomposition 1222 01:09:53,890 --> 01:09:56,600 and the fact that the norm is invariant by rotation. 1223 01:09:56,600 --> 01:09:59,440 So for those who are familiar with, what I can do 1224 01:09:59,440 --> 01:10:01,780 is just look at the decomposition of p 1225 01:10:01,780 --> 01:10:08,200 prime into ud u transpose where this is an orthogonal matrix, 1226 01:10:08,200 --> 01:10:10,630 and this is a diagonal matrix of eigenvalues. 1227 01:10:10,630 --> 01:10:13,312 And when I look at the norm squared of this thing, 1228 01:10:13,312 --> 01:10:14,770 I mean, I have, basically, the norm 1229 01:10:14,770 --> 01:10:20,200 squared of p prime times some epsilon. 1230 01:10:20,200 --> 01:10:26,300 It's the norm of ud u transpose epsilon squared. 1231 01:10:26,300 --> 01:10:28,550 The norm of a rotation of a vector 1232 01:10:28,550 --> 01:10:32,280 is the same as the norm of the vector, so this guy goes away. 1233 01:10:32,280 --> 01:10:34,140 This is not actually-- 1234 01:10:34,140 --> 01:10:36,140 I mean, you don't have to care about this if you 1235 01:10:36,140 --> 01:10:37,880 don't understand what I'm saying, so don't freak out. 1236 01:10:37,880 --> 01:10:39,810 This is really for those who follow. 1237 01:10:39,810 --> 01:10:42,211 What is the distribution of u transpose epsilon? 1238 01:10:45,899 --> 01:10:50,310 I take a Gaussian vector that has covariance matrix sigma 1239 01:10:50,310 --> 01:10:52,560 squared times the [? identity, ?] and I basically 1240 01:10:52,560 --> 01:10:54,100 rotate it. 1241 01:10:54,100 --> 01:10:57,965 What is its distribution? 1242 01:10:57,965 --> 01:10:58,465 Yeah? 1243 01:10:58,465 --> 01:10:59,440 AUDIENCE: The same. 1244 01:10:59,440 --> 01:11:00,830 PHILIPPE RIGOLLET: It's the same. 1245 01:11:00,830 --> 01:11:02,950 It's completely invariant, because the Gaussian 1246 01:11:02,950 --> 01:11:04,700 think of all directions as being the same. 1247 01:11:04,700 --> 01:11:07,550 So it doesn't really matter if I take a Gaussian or a rotated 1248 01:11:07,550 --> 01:11:08,600 Gaussian. 1249 01:11:08,600 --> 01:11:10,190 So this is also a Gaussian, so I'm 1250 01:11:10,190 --> 01:11:11,800 going to call it epsilon prime. 1251 01:11:11,800 --> 01:11:15,110 And I am left with just the norm of epsilon primes. 1252 01:11:15,110 --> 01:11:23,730 So this is the sum of the dj's squared times epsilon 1253 01:11:23,730 --> 01:11:24,250 j squared. 1254 01:11:27,030 --> 01:11:30,060 And we just said that the eigenvalues of p 1255 01:11:30,060 --> 01:11:33,780 are either 0 or 1, because it's a projector. 1256 01:11:33,780 --> 01:11:36,090 And so here, I'm going to get only 0's and 1's. 1257 01:11:36,090 --> 01:11:39,300 So I'm really just summing a certain number 1258 01:11:39,300 --> 01:11:42,050 of epsilon i squared. 1259 01:11:42,050 --> 01:11:45,110 So square root of standard Gaussians-- 1260 01:11:45,110 --> 01:11:48,210 sorry, with a sigma squared somewhere. 1261 01:11:48,210 --> 01:11:50,850 And basically, how many am I summing? 1262 01:11:50,850 --> 01:11:55,530 Well, the n minus p, the number of non-0 eigenvalues 1263 01:11:55,530 --> 01:11:57,190 of p prime. 1264 01:11:57,190 --> 01:12:00,490 So that's how it shows up. 1265 01:12:00,490 --> 01:12:05,820 When you see this, what theorem am I using here? 1266 01:12:05,820 --> 01:12:06,650 Cochran's theorem. 1267 01:12:06,650 --> 01:12:07,650 This is this magic book. 1268 01:12:07,650 --> 01:12:09,420 I'm actually going to dump everything that I'm not going 1269 01:12:09,420 --> 01:12:11,160 to prove to you and say, oh, this is actually Cochran's. 1270 01:12:11,160 --> 01:12:12,870 No, Cochran's theorem is really just 1271 01:12:12,870 --> 01:12:15,870 telling me something about orthogonality of things, 1272 01:12:15,870 --> 01:12:17,712 and therefore, independence of things. 1273 01:12:17,712 --> 01:12:19,170 And Cochran's theorem was something 1274 01:12:19,170 --> 01:12:23,271 that I used when I wanted to use what? 1275 01:12:23,271 --> 01:12:27,280 That's something I used just one slide before. 1276 01:12:27,280 --> 01:12:28,887 Student t-test, right? 1277 01:12:28,887 --> 01:12:30,970 I used Cochran's theorem to see that the numerator 1278 01:12:30,970 --> 01:12:33,610 and the denominator of the student statistic 1279 01:12:33,610 --> 01:12:35,414 were independent of each other. 1280 01:12:35,414 --> 01:12:37,330 And this is exactly what I'm going to do here. 1281 01:12:40,170 --> 01:12:42,430 I'm going to actually write a test to test, maybe, 1282 01:12:42,430 --> 01:12:44,430 if the beta j's are equal to 0. 1283 01:12:44,430 --> 01:12:49,110 I'm going to form a numerator, which is beta hat minus beta. 1284 01:12:49,110 --> 01:12:50,310 This is normal. 1285 01:12:50,310 --> 01:12:53,287 And we know that beta hat has a Gaussian distribution. 1286 01:12:53,287 --> 01:12:54,870 I'm going to standardized by something 1287 01:12:54,870 --> 01:12:55,720 that makes sense to me. 1288 01:12:55,720 --> 01:12:56,940 And I'm not going to go into details, 1289 01:12:56,940 --> 01:12:58,200 because we're out of time. 1290 01:12:58,200 --> 01:12:59,866 But there's the sigma hat that shows up. 1291 01:12:59,866 --> 01:13:03,240 And then there's a gamma j, which takes into account 1292 01:13:03,240 --> 01:13:06,450 the fact that my x's-- 1293 01:13:06,450 --> 01:13:12,465 if I look at the distribution of beta, which is gone, I think-- 1294 01:13:12,465 --> 01:13:14,220 yeah, beta is gone. 1295 01:13:14,220 --> 01:13:16,020 Oh, yeah, that's where it is. 1296 01:13:16,020 --> 01:13:20,040 The covariance matrix depends on this matrix x transpose x. 1297 01:13:20,040 --> 01:13:22,110 So this will show up in the variance. 1298 01:13:22,110 --> 01:13:25,000 In particular, diagonal elements are going to play a role here. 1299 01:13:25,000 --> 01:13:26,850 And so that's what my gammas are. 1300 01:13:26,850 --> 01:13:30,880 The gammas is the j's diagonal element of this matrix. 1301 01:13:30,880 --> 01:13:35,010 So we'll resume that on Tuesday, so 1302 01:13:35,010 --> 01:13:38,476 don't worry too much if this is going too fast. 1303 01:13:38,476 --> 01:13:40,350 I'm not supposed to cover it, but just so you 1304 01:13:40,350 --> 01:13:45,300 get a hint of why Cochran's theorem actually was useful. 1305 01:13:45,300 --> 01:13:51,690 So I don't know if we actually ended up recording. 1306 01:13:51,690 --> 01:13:53,410 I have your homework. 1307 01:13:53,410 --> 01:13:56,500 And as usual, I will give it to you outside.