1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high-quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,650 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,650 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:20,760 --> 00:00:25,220 PHILIPPE RIGOLLET: So yes, before we start, 9 00:00:25,220 --> 00:00:27,710 this chapter will not be part of the midterm. 10 00:00:27,710 --> 00:00:31,970 Everything else will be, so all the way up to goodness of fit 11 00:00:31,970 --> 00:00:33,080 tests. 12 00:00:33,080 --> 00:00:36,920 And there will be some practice exams 13 00:00:36,920 --> 00:00:38,690 that will be posted in the recitation 14 00:00:38,690 --> 00:00:39,780 section of the course. 15 00:00:39,780 --> 00:00:40,790 And that will be-- 16 00:00:40,790 --> 00:00:41,760 you will be working on. 17 00:00:41,760 --> 00:00:44,480 So the recitation tomorrow will be a review session 18 00:00:44,480 --> 00:00:46,430 for the midterm. 19 00:00:46,430 --> 00:00:49,850 I'll send an announcement by email. 20 00:00:49,850 --> 00:00:55,370 So going back to our estimator, we 21 00:00:55,370 --> 00:00:58,460 showed that the least squares estimator in the case 22 00:00:58,460 --> 00:01:01,860 where we had some Gaussian observations. 23 00:01:01,860 --> 00:01:04,459 So we had something that looked like this-- y was 24 00:01:04,459 --> 00:01:07,850 equal to some matrix x times beta plus some epsilon. 25 00:01:07,850 --> 00:01:10,640 This was an equation that was happening in r 26 00:01:10,640 --> 00:01:13,010 to the n for n observations. 27 00:01:13,010 --> 00:01:15,530 And then we wrote the least squares estimator beta hat. 28 00:01:21,300 --> 00:01:23,580 And for the purpose from here on, 29 00:01:23,580 --> 00:01:26,200 you see that you have this normal distribution, 30 00:01:26,200 --> 00:01:28,189 this Gaussian p variant distribution. 31 00:01:28,189 --> 00:01:29,730 That means that, at some point, we've 32 00:01:29,730 --> 00:01:31,830 made the assumption that epsilons 33 00:01:31,830 --> 00:01:38,060 were n and dimensional 0 identity of rn 34 00:01:38,060 --> 00:01:41,000 times sigma squared, which I kept 35 00:01:41,000 --> 00:01:43,130 on forgetting about last time. 36 00:01:43,130 --> 00:01:45,530 I will try not to do that this time. 37 00:01:45,530 --> 00:01:48,800 And so from this, we derived a bunch 38 00:01:48,800 --> 00:01:54,080 of properties of this least squares estimator, beta hat. 39 00:01:54,080 --> 00:01:58,090 And in particular, the key thing that everything was built on 40 00:01:58,090 --> 00:02:02,090 was that we could write beta hat as the true unknown beta 41 00:02:02,090 --> 00:02:05,020 plus some multivariate Gaussian that was centered, 42 00:02:05,020 --> 00:02:07,010 but had a weird covariant structure. 43 00:02:07,010 --> 00:02:08,979 So that was definitely p dimensional. 44 00:02:08,979 --> 00:02:11,240 And it was sigma squared times x-- 45 00:02:13,970 --> 00:02:16,359 so that's x transpose x. 46 00:02:16,359 --> 00:02:17,150 And that's inverse. 47 00:02:19,720 --> 00:02:22,840 And the way we derived that was by having a lot of-- 48 00:02:22,840 --> 00:02:26,260 at least one cancellation between x transpose x and x 49 00:02:26,260 --> 00:02:28,150 transpose x inverse. 50 00:02:28,150 --> 00:02:47,582 So this is the basis for inference in linear regression. 51 00:02:51,650 --> 00:02:54,980 So in a way, that's correct, because what 52 00:02:54,980 --> 00:02:58,370 happened is that we used the fact that x beta hat-- 53 00:02:58,370 --> 00:02:59,980 once we have this beta, x beta hat 54 00:02:59,980 --> 00:03:04,310 is really just a projection of y onto the linear span 55 00:03:04,310 --> 00:03:08,480 of the columns of x, or the column span of x. 56 00:03:08,480 --> 00:03:10,040 And so in particular, those things-- 57 00:03:10,040 --> 00:03:11,990 y minus x beta hats-- 58 00:03:11,990 --> 00:03:13,230 are called residuals. 59 00:03:22,060 --> 00:03:25,180 So that's the vector of residuals. 60 00:03:28,070 --> 00:03:32,390 What's the dimension of this vector? 61 00:03:36,214 --> 00:03:37,180 AUDIENCE: n by 1. 62 00:03:37,180 --> 00:03:38,350 PHILIPPE RIGOLLET: n by 1. 63 00:03:38,350 --> 00:03:42,890 So those things, we can write as epsilon hat. 64 00:03:42,890 --> 00:03:44,720 There's an estimate for this epsilon 65 00:03:44,720 --> 00:03:47,870 because we just put a hat on beta. 66 00:03:47,870 --> 00:03:49,610 And from this one, we could actually 67 00:03:49,610 --> 00:03:54,560 build an unbiased estimator of sigma hat squared, 68 00:03:54,560 --> 00:03:55,940 and that was this guy. 69 00:03:55,940 --> 00:03:59,330 And we showed that, indeed, the right normalization for this 70 00:03:59,330 --> 00:04:04,730 was n minus p, because y minus x beta hat to norm 71 00:04:04,730 --> 00:04:07,830 is actually a chi squared with n minus p degrees of freedom. 72 00:04:07,830 --> 00:04:11,120 And so that's up to this scaling by sigma squared. 73 00:04:11,120 --> 00:04:12,766 So that's what we came up with. 74 00:04:12,766 --> 00:04:14,390 And something I told you, which follows 75 00:04:14,390 --> 00:04:15,389 from Cochran's theorem-- 76 00:04:15,389 --> 00:04:17,420 we did not go into details about this. 77 00:04:17,420 --> 00:04:18,950 But essentially, since one of them 78 00:04:18,950 --> 00:04:22,640 corresponds to projection onto the linear span of the columns 79 00:04:22,640 --> 00:04:25,400 of x, and the other one corresponds to projection 80 00:04:25,400 --> 00:04:28,575 onto the orthogonal of this guy, and we're in a Gaussian case, 81 00:04:28,575 --> 00:04:30,200 things that are orthogonal are actually 82 00:04:30,200 --> 00:04:31,874 independent in a Gaussian case. 83 00:04:31,874 --> 00:04:33,290 So from a geometric point of view, 84 00:04:33,290 --> 00:04:34,873 you can sort of understand everything. 85 00:04:34,873 --> 00:04:37,580 You think of your subspace of the linear span of the x's, 86 00:04:37,580 --> 00:04:39,080 sometimes you project onto this guy, 87 00:04:39,080 --> 00:04:41,420 sometimes you project onto its orthogonal. 88 00:04:41,420 --> 00:04:43,240 Beta hat corresponds to projection 89 00:04:43,240 --> 00:04:44,471 onto the linear span. 90 00:04:44,471 --> 00:04:46,970 Epsilon hats correspond to a projection onto the orthogonal. 91 00:04:46,970 --> 00:04:48,777 And those things tend to be independent, 92 00:04:48,777 --> 00:04:50,360 and that's what you have that beta hat 93 00:04:50,360 --> 00:04:53,560 is independent of sigma hat squared. 94 00:04:53,560 --> 00:04:56,930 So it's really just a statement about two linear spaces being 95 00:04:56,930 --> 00:05:00,510 orthogonal with respect to each other. 96 00:05:00,510 --> 00:05:07,820 So we left on this slide last time. 97 00:05:07,820 --> 00:05:10,610 And what I claim is that this thing here is actually-- 98 00:05:10,610 --> 00:05:12,510 oh, yeah-- the other thing we want to use. 99 00:05:12,510 --> 00:05:14,002 So that's good for beta hat. 100 00:05:14,002 --> 00:05:15,960 But since we don't know what sigma squared is-- 101 00:05:15,960 --> 00:05:17,335 if we knew what sigma squared is, 102 00:05:17,335 --> 00:05:19,160 that would totally be enough for us. 103 00:05:19,160 --> 00:05:21,290 But we also need this extra thing-- 104 00:05:21,290 --> 00:05:27,250 that sigma squared hat squared over sigma squared follows-- 105 00:05:27,250 --> 00:05:29,450 and there's an n minus p. 106 00:05:29,450 --> 00:05:33,250 This follows a chi squared with n minus p degrees of freedom. 107 00:05:33,250 --> 00:05:36,820 And sigma hat squared is independent of beta hat. 108 00:05:36,820 --> 00:05:41,780 So that's going to be something we need. 109 00:05:41,780 --> 00:05:47,870 So that's useful if sigma squared if unknown. 110 00:05:51,510 --> 00:05:53,490 And again, sometimes it might be known 111 00:05:53,490 --> 00:05:56,164 if you're using some sort of measurement device 112 00:05:56,164 --> 00:05:58,080 for which it's written on the side of the box. 113 00:06:01,000 --> 00:06:02,860 So from these two things, we're going 114 00:06:02,860 --> 00:06:05,830 to be able to do inference And inference, we 115 00:06:05,830 --> 00:06:09,370 said there's three pillars to inference. 116 00:06:09,370 --> 00:06:12,340 The first one is estimation, and we've been doing that so far. 117 00:06:12,340 --> 00:06:14,530 We've constructed this least squares estimator, 118 00:06:14,530 --> 00:06:16,280 which happens to be the maximum likelihood 119 00:06:16,280 --> 00:06:18,520 estimator in the Gaussian case. 120 00:06:18,520 --> 00:06:20,710 The two other things we do in inference 121 00:06:20,710 --> 00:06:22,780 are confidence intervals. 122 00:06:22,780 --> 00:06:24,294 And we can do confidence intervals. 123 00:06:24,294 --> 00:06:25,960 We're not going to do much because we're 124 00:06:25,960 --> 00:06:29,260 going to talk about their sort of cousin, which are tests. 125 00:06:29,260 --> 00:06:31,600 And that's really where the statistical inference 126 00:06:31,600 --> 00:06:32,180 comes into. 127 00:06:32,180 --> 00:06:34,180 And here, we're going to be interested in a very 128 00:06:34,180 --> 00:06:36,820 specific kind of test for linear regression. 129 00:06:36,820 --> 00:06:42,650 And those are tests of the form beta j-- 130 00:06:42,650 --> 00:06:46,190 so the j-th coefficient of beta is equal to 0, 131 00:06:46,190 --> 00:06:52,310 and that's going to be our null hypothesis, versus h1 where 132 00:06:52,310 --> 00:06:55,190 beta j is, say, not equal to 0. 133 00:06:55,190 --> 00:06:57,560 And for the purpose of regression, 134 00:06:57,560 --> 00:07:00,080 unless you have lots of domain-specific knowledge, 135 00:07:00,080 --> 00:07:03,020 it won't be beta j positive or beta j negative. 136 00:07:03,020 --> 00:07:06,800 It's really non-0 that's interesting to you. 137 00:07:06,800 --> 00:07:09,830 So why would I want to do this test? 138 00:07:09,830 --> 00:07:14,540 Well, if I expand this thing where I have y 139 00:07:14,540 --> 00:07:19,740 is equal to x beta plus epsilon-- 140 00:07:19,740 --> 00:07:21,850 so what happens if I look, for example, 141 00:07:21,850 --> 00:07:24,630 at the first coordinates? 142 00:07:24,630 --> 00:07:32,420 So I have that y is actually-- so say, y1 is equal to beta 1 143 00:07:32,420 --> 00:07:37,050 plus beta 2 x 1. 144 00:07:37,050 --> 00:07:38,866 Well, that's actually complicated. 145 00:07:38,866 --> 00:07:39,990 Let me write it like this-- 146 00:07:42,600 --> 00:07:56,500 beta 0 plus beta 1 x1 plus beta p minus 1 xp minus 1 147 00:07:56,500 --> 00:07:58,860 plus epsilon. 148 00:07:58,860 --> 00:08:00,130 And that's true for all i's. 149 00:08:04,510 --> 00:08:05,960 So this is beta 1 times 1. 150 00:08:05,960 --> 00:08:07,880 That was our first coordinate. 151 00:08:07,880 --> 00:08:09,920 So that's just expanding this-- 152 00:08:09,920 --> 00:08:12,980 going back to the scalar form rather than 153 00:08:12,980 --> 00:08:15,140 going to the matrix vector form. 154 00:08:15,140 --> 00:08:16,250 That's what we're doing. 155 00:08:16,250 --> 00:08:19,450 When I write y is equal to x beta plus epsilon, 156 00:08:19,450 --> 00:08:22,400 I assume that each of my y's can be represented 157 00:08:22,400 --> 00:08:25,520 as a linear combination of the x's, the first one 158 00:08:25,520 --> 00:08:26,990 being 1 plus some epsilon i. 159 00:08:26,990 --> 00:08:29,630 Everybody agrees with this? 160 00:08:29,630 --> 00:08:32,539 What does it mean for beta j to be equal to 0? 161 00:08:40,661 --> 00:08:41,161 Yeah? 162 00:08:41,161 --> 00:08:43,315 AUDIENCE: That xj's not important. 163 00:08:43,315 --> 00:08:45,190 PHILIPPE RIGOLLET: Yeah, that xj doesn't even 164 00:08:45,190 --> 00:08:46,750 show up in this thing. 165 00:08:46,750 --> 00:08:51,940 So if beta j is equal to 0, that means that, essentially, we 166 00:08:51,940 --> 00:09:05,946 can remove the j's coordinate, xj, from all observations. 167 00:09:12,710 --> 00:09:15,080 So for example, I'm a banker, and I'm 168 00:09:15,080 --> 00:09:19,280 trying to predict some score-- 169 00:09:19,280 --> 00:09:21,260 let's call it y-- 170 00:09:21,260 --> 00:09:22,460 without the noise. 171 00:09:22,460 --> 00:09:26,400 So I'm trying to predict what is going to be your score. 172 00:09:26,400 --> 00:09:29,090 And that's something that should be telling me 173 00:09:29,090 --> 00:09:33,080 how likely you are to reimburse your loan on time 174 00:09:33,080 --> 00:09:34,490 or do you have late payments. 175 00:09:34,490 --> 00:09:36,530 Or actually, maybe these days bankers 176 00:09:36,530 --> 00:09:40,550 are actually looking at how much late fees will I 177 00:09:40,550 --> 00:09:41,509 be collecting from you. 178 00:09:41,509 --> 00:09:44,049 Maybe that's what they are more after rather than making sure 179 00:09:44,049 --> 00:09:45,490 that you reimburse everything. 180 00:09:45,490 --> 00:09:47,810 So they're trying to maximize this number of late fees. 181 00:09:47,810 --> 00:09:49,970 And they collect a bunch of things about you-- 182 00:09:49,970 --> 00:09:52,130 definitely your credit score, but maybe your 183 00:09:52,130 --> 00:09:57,110 zip code, profession, years of education, family status, 184 00:09:57,110 --> 00:09:59,150 a bunch of things. 185 00:09:59,150 --> 00:10:01,560 One might be your shoe size. 186 00:10:01,560 --> 00:10:03,750 And they want to know-- maybe shoe is actually 187 00:10:03,750 --> 00:10:07,050 a good explanation for how much fees 188 00:10:07,050 --> 00:10:08,770 they're going to be collecting from you. 189 00:10:08,770 --> 00:10:10,950 But as you can imagine, this would be a controversial thing 190 00:10:10,950 --> 00:10:12,720 to bring, and people might want to test for their shoe 191 00:10:12,720 --> 00:10:14,010 size is a good idea. 192 00:10:14,010 --> 00:10:17,130 And so they would just look at the j corresponding 193 00:10:17,130 --> 00:10:21,120 to shoe size and test whether shoe size should appear or not 194 00:10:21,120 --> 00:10:22,484 in this formula. 195 00:10:22,484 --> 00:10:24,150 And that's essentially the kind of thing 196 00:10:24,150 --> 00:10:25,410 that people are going to do. 197 00:10:25,410 --> 00:10:27,840 Now, if I do genomics and I'm trying 198 00:10:27,840 --> 00:10:32,760 to predict the size, the girth, of a pumpkin for a competition 199 00:10:32,760 --> 00:10:37,530 based on some available genomic data, 200 00:10:37,530 --> 00:10:40,710 then I can test whether gene j, which is called-- 201 00:10:40,710 --> 00:10:44,010 I don't know-- pea snap 24-- they always have these crazy 202 00:10:44,010 --> 00:10:44,820 names-- 203 00:10:44,820 --> 00:10:46,730 appears or not in this formula. 204 00:10:46,730 --> 00:10:49,350 Is the gene pea snap 24 going to be important or not 205 00:10:49,350 --> 00:10:52,080 for the size of the final pumpkin? 206 00:10:52,080 --> 00:10:54,420 So those are definitely the important things. 207 00:10:54,420 --> 00:10:57,660 And definitely, we want to put beta j not 208 00:10:57,660 --> 00:11:00,120 equal to 0 as the alternative because that's where 209 00:11:00,120 --> 00:11:02,880 scientific discovery shows up. 210 00:11:02,880 --> 00:11:06,450 And so to do that, well, we're in a Gaussian set-up, 211 00:11:06,450 --> 00:11:10,470 so we know that even if we don't know what sigma hat is, 212 00:11:10,470 --> 00:11:14,250 we can actually call for a t-test. 213 00:11:14,250 --> 00:11:16,740 So how did we build the t-test in general? 214 00:11:16,740 --> 00:11:23,630 Well, we had something that looked like-- so before, what 215 00:11:23,630 --> 00:11:28,490 we had was something that looked like theta hat was 216 00:11:28,490 --> 00:11:35,030 equal to theta plus some n0 and something that 217 00:11:35,030 --> 00:11:38,540 depended on n, maybe, something like this-- sigma squared 218 00:11:38,540 --> 00:11:39,350 over n. 219 00:11:39,350 --> 00:11:41,470 So that's what it looked like. 220 00:11:41,470 --> 00:11:46,120 Now what we have is that beta hat 221 00:11:46,120 --> 00:11:50,470 is equal to beta plus some n, but this time, it's p variant, 222 00:11:50,470 --> 00:11:56,130 and then x transpose x inverse sigma squared. 223 00:11:56,130 --> 00:12:00,700 So it's actually very similar, except that the matrix 224 00:12:00,700 --> 00:12:03,110 x transpose x inverse is now replacing 225 00:12:03,110 --> 00:12:06,830 just this number, 1/n, but it's playing the same role. 226 00:12:06,830 --> 00:12:12,750 So in particular, this implies that for every j from 1 227 00:12:12,750 --> 00:12:16,300 to p, what is the distribution of beta hat j? 228 00:12:19,010 --> 00:12:22,550 Well, beta hat j is actually equal to-- 229 00:12:22,550 --> 00:12:26,350 so all I have to do-- so this is a system of p equations, 230 00:12:26,350 --> 00:12:29,540 and all I have to do is to read the j through. 231 00:12:29,540 --> 00:12:32,090 So it's telling me here, I'm going to read beta hat j. 232 00:12:32,090 --> 00:12:34,300 Here, I'm going to read beta j. 233 00:12:34,300 --> 00:12:36,470 And here, I need to read, what is 234 00:12:36,470 --> 00:12:40,980 the distribution of the j-th coordinates of this guy? 235 00:12:40,980 --> 00:12:43,570 So this is a Gaussian vector, so we 236 00:12:43,570 --> 00:12:45,910 need to understand what its definition is. 237 00:12:49,470 --> 00:12:52,350 So how do I do this? 238 00:12:52,350 --> 00:12:56,360 Well, the observation that's actually useful for this-- 239 00:12:56,360 --> 00:12:59,830 maybe I shouldn't use the word observation in a stats class, 240 00:12:59,830 --> 00:13:00,900 so let's call it claim. 241 00:13:03,648 --> 00:13:09,610 The interesting claim is that if I have a vector-- 242 00:13:09,610 --> 00:13:13,916 let's call it v-- 243 00:13:13,916 --> 00:13:20,500 then vj is equal to v transpose ej where 244 00:13:20,500 --> 00:13:28,890 ej is the vector with 0, 0, 0, and then the 1 on the j-th 245 00:13:28,890 --> 00:13:30,990 coordinate, and then 0 elsewhere. 246 00:13:30,990 --> 00:13:32,687 That's the j-th coordinate. 247 00:13:35,620 --> 00:13:38,530 So that's the j-th vector of the canonical basis of rp. 248 00:13:41,640 --> 00:13:43,860 So now that I have this form, I can 249 00:13:43,860 --> 00:13:45,730 see that, essentially, beta hat j 250 00:13:45,730 --> 00:13:51,270 is just ej transpose this np0 sigma squared 251 00:13:51,270 --> 00:13:53,790 x transpose x inverse. 252 00:13:59,550 --> 00:14:02,040 And now, I know what the distribution 253 00:14:02,040 --> 00:14:05,550 of the inner product between a Gaussian 254 00:14:05,550 --> 00:14:08,660 and a deterministic vector is. 255 00:14:08,660 --> 00:14:09,160 What is it? 256 00:14:13,810 --> 00:14:15,640 It's a Gaussian. 257 00:14:15,640 --> 00:14:23,950 So all I have to check is that ej transpose np0 sigma squared 258 00:14:23,950 --> 00:14:27,200 x transpose x inverse-- 259 00:14:27,200 --> 00:14:31,610 well, this is equal in distribution to what? 260 00:14:31,610 --> 00:14:34,900 Well, this is going to be a one-dimensional thing. 261 00:14:34,900 --> 00:14:38,230 A then your product is just a real number. 262 00:14:38,230 --> 00:14:42,090 So it's going to be some Gaussian. 263 00:14:42,090 --> 00:14:49,480 The mean is going to be 0 in a product with ej, which is 0. 264 00:14:49,480 --> 00:14:52,502 What is the variance of this guy? 265 00:14:52,502 --> 00:14:55,320 We actually used this, except that ej was not a vector, 266 00:14:55,320 --> 00:14:57,460 but it was a matrix. 267 00:14:57,460 --> 00:15:04,990 So what we do is we, to see-- so the rule is that v transpose, 268 00:15:04,990 --> 00:15:16,610 say, n mu sigma is some n v transpose mu, 269 00:15:16,610 --> 00:15:21,140 and then v transpose sigma v. That's 270 00:15:21,140 --> 00:15:23,234 the rule for Gaussian vectors. 271 00:15:23,234 --> 00:15:25,150 There's just the property of Gaussian vectors. 272 00:15:27,760 --> 00:15:29,300 So what do we have here? 273 00:15:29,300 --> 00:15:33,350 Well, ej plays the role of v. And sigma 274 00:15:33,350 --> 00:15:36,990 squared x transpose x inverse is the role of sigma. 275 00:15:36,990 --> 00:15:40,765 So here, I'm left with ej transpose-- 276 00:15:40,765 --> 00:15:42,390 let me pull out the sigma squared here. 277 00:15:54,590 --> 00:15:57,350 But this thing is, what happens if I take a matrix, 278 00:15:57,350 --> 00:16:00,110 I premultiply it by this vector ej, 279 00:16:00,110 --> 00:16:02,298 and I postmultiply it by this vector ej? 280 00:16:05,110 --> 00:16:07,890 I'm claiming that this corresponds to only one 281 00:16:07,890 --> 00:16:09,510 single element of this matrix. 282 00:16:09,510 --> 00:16:11,371 Which one is it? 283 00:16:11,371 --> 00:16:11,870 AUDIENCE: j. 284 00:16:11,870 --> 00:16:14,570 PHILIPPE RIGOLLET: j's diagonal element. 285 00:16:14,570 --> 00:16:23,210 So this thing here is nothing but x transpose x inverse, 286 00:16:23,210 --> 00:16:27,840 and then the j-th diagonal element is jj. 287 00:16:27,840 --> 00:16:30,840 Now, I cannot go any further. 288 00:16:30,840 --> 00:16:34,730 x transpose x inverse can be a complicated matrix, 289 00:16:34,730 --> 00:16:40,350 and I do not know how to express jj's diagonal element much 290 00:16:40,350 --> 00:16:41,190 better than this. 291 00:16:43,990 --> 00:16:46,740 Well, no, actually, I don't. 292 00:16:46,740 --> 00:16:48,681 It involves basically all the coefficients. 293 00:16:48,681 --> 00:16:49,181 Yeah? 294 00:16:49,181 --> 00:16:52,127 AUDIENCE: [INAUDIBLE] second j come from, 295 00:16:52,127 --> 00:16:55,073 so I get why ej transpose [INAUDIBLE].. 296 00:16:55,073 --> 00:16:56,560 Where did the-- 297 00:16:56,560 --> 00:16:58,194 PHILIPPE RIGOLLET: From this rule? 298 00:16:58,194 --> 00:16:59,070 AUDIENCE: [INAUDIBLE] 299 00:16:59,070 --> 00:16:59,790 PHILIPPE RIGOLLET: So you always pre- 300 00:16:59,790 --> 00:17:01,956 and postmultiply when you talk about the covariance, 301 00:17:01,956 --> 00:17:04,869 because if you did not, it would be a vector and not a scalar, 302 00:17:04,869 --> 00:17:06,480 for one. 303 00:17:06,480 --> 00:17:08,550 But in general, think of v as a matrix. 304 00:17:08,550 --> 00:17:11,310 It's still true even in v is a matrix that's 305 00:17:11,310 --> 00:17:12,911 compatible with the premultiplying 306 00:17:12,911 --> 00:17:13,619 by some Gaussian. 307 00:17:19,079 --> 00:17:20,805 Any other question? 308 00:17:20,805 --> 00:17:21,305 Yeah? 309 00:17:21,305 --> 00:17:25,241 AUDIENCE: When you say claim a vector v, what is vector v? 310 00:17:29,180 --> 00:17:31,759 PHILIPPE RIGOLLET: So for any vector v-- 311 00:17:31,759 --> 00:17:32,300 AUDIENCE: OK. 312 00:17:37,700 --> 00:17:40,690 PHILIPPE RIGOLLET: Any other question? 313 00:17:40,690 --> 00:17:44,890 So now we've identified that the j-th coefficient 314 00:17:44,890 --> 00:17:47,440 of this Gaussian, which I can represent from the claim 315 00:17:47,440 --> 00:17:49,540 as ej transpose this guy, is also 316 00:17:49,540 --> 00:17:51,220 a Gaussian that's centered. 317 00:17:51,220 --> 00:17:54,310 And its variance, now, is sigma squared 318 00:17:54,310 --> 00:17:58,700 times the j-th diagonal element of x transpose x inverse. 319 00:17:58,700 --> 00:18:05,830 So the conclusion is that beta hat j 320 00:18:05,830 --> 00:18:10,749 is equal to beta j plus some n. 321 00:18:10,749 --> 00:18:12,790 And I'm going to emphasize the fact that now it's 322 00:18:12,790 --> 00:18:19,180 one-dimensional with mean 0 and covariance sigma squared x 323 00:18:19,180 --> 00:18:25,100 transpose x inverse inverse jj. 324 00:18:25,100 --> 00:18:28,430 Now, if you look at the last line of the second board 325 00:18:28,430 --> 00:18:31,912 and the first line on the first board, 326 00:18:31,912 --> 00:18:33,370 those are basically the same thing. 327 00:18:36,630 --> 00:18:39,410 Beta hat j is my theta hat. 328 00:18:39,410 --> 00:18:41,630 Beta j is my theta. 329 00:18:41,630 --> 00:18:44,080 And the variance sigma squared over n 330 00:18:44,080 --> 00:18:47,840 is now sigma squared times this [? jj's ?] element. 331 00:18:47,840 --> 00:18:52,710 Now, the inverse suggests that it looks like the inverse of n. 332 00:18:52,710 --> 00:18:53,960 So those things are going to-- 333 00:18:53,960 --> 00:18:55,710 we're going to want to think of those guys 334 00:18:55,710 --> 00:18:59,917 as being some sort of 1/n kind of statement. 335 00:19:04,120 --> 00:19:09,420 So from this, the fact that those two things are the same 336 00:19:09,420 --> 00:19:11,660 leads us to believe that we are now 337 00:19:11,660 --> 00:19:14,010 equipped to perform the task that we're trying to do, 338 00:19:14,010 --> 00:19:16,410 because under the null hypothesis, 339 00:19:16,410 --> 00:19:22,790 beta j is known it's equal to 0, so I can remove it. 340 00:19:22,790 --> 00:19:24,540 And I have to deal with the sigma squared. 341 00:19:24,540 --> 00:19:26,130 If sigma squared is known, then I 342 00:19:26,130 --> 00:19:29,100 can just perform a regular Gaussian 343 00:19:29,100 --> 00:19:31,070 test using Gaussian quintiles. 344 00:19:31,070 --> 00:19:33,120 And if sigma squared is unknown, I'm 345 00:19:33,120 --> 00:19:35,730 going to just divide by sigma squared 346 00:19:35,730 --> 00:19:38,790 and multiply by sigma hat, and then I'm 347 00:19:38,790 --> 00:19:40,260 going to basically get my t-test. 348 00:20:00,630 --> 00:20:03,600 Actually, for the purpose of your exam, 349 00:20:03,600 --> 00:20:06,060 I really suggest that you understand every single word 350 00:20:06,060 --> 00:20:08,220 I'm going to be saying now, because this is exactly 351 00:20:08,220 --> 00:20:09,678 the same thing that you're expected 352 00:20:09,678 --> 00:20:12,719 to know from other courses, because right now, I'm just 353 00:20:12,719 --> 00:20:14,760 going to apply exactly the same technique that we 354 00:20:14,760 --> 00:20:17,400 did for the single parameter estimation. 355 00:20:17,400 --> 00:20:26,940 So what do we have now is that under h0, beta j is equal to 0. 356 00:20:26,940 --> 00:20:39,030 Therefore, beta hat j follows some n0 sigma squared. 357 00:20:39,030 --> 00:20:41,670 Just like I do in the slide, I'm going to call this gamma j. 358 00:20:50,810 --> 00:20:56,060 So gamma j is this x transpose x inverse j-th diagonal element. 359 00:20:59,770 --> 00:21:06,140 So that implies that beta hat j over sigma-- 360 00:21:06,140 --> 00:21:08,120 oh, was it a square root? 361 00:21:08,120 --> 00:21:16,130 Yeah, sigma square root of gamma j follows some n0 1. 362 00:21:16,130 --> 00:21:21,280 So I can form my test statistic, which 363 00:21:21,280 --> 00:21:30,880 to be reject if the absolute value of beta hat j divided 364 00:21:30,880 --> 00:21:38,159 by sigma square root gamma j is larger than what? 365 00:21:38,159 --> 00:21:39,700 Can somebody tell me what I want this 366 00:21:39,700 --> 00:21:41,050 to be larger than to reject? 367 00:21:43,948 --> 00:21:45,400 AUDIENCE: q alpha. 368 00:21:45,400 --> 00:21:46,525 PHILIPPE RIGOLLET: q alpha. 369 00:21:48,642 --> 00:21:49,350 Everybody agrees? 370 00:21:49,350 --> 00:21:50,852 Of what? 371 00:21:50,852 --> 00:21:58,070 Of this guy, where the standard notation 372 00:21:58,070 --> 00:21:59,480 that this is the quintile. 373 00:21:59,480 --> 00:22:01,257 Everybody agrees? 374 00:22:01,257 --> 00:22:02,756 AUDIENCE: It's alpha over 2 I think. 375 00:22:02,756 --> 00:22:03,537 I think alpha's-- 376 00:22:03,537 --> 00:22:04,870 PHILIPPE RIGOLLET: Alpha over 2. 377 00:22:04,870 --> 00:22:06,520 So not everybody should be agreeing. 378 00:22:06,520 --> 00:22:08,765 Thank you, you're the first one to disagree with yourself, 379 00:22:08,765 --> 00:22:09,723 which is probably good. 380 00:22:12,111 --> 00:22:14,110 It's alpha over 2 because of the absolute value. 381 00:22:14,110 --> 00:22:15,670 I want to just be away from this guy, 382 00:22:15,670 --> 00:22:17,110 and that's because I have-- 383 00:22:17,110 --> 00:22:19,140 so the alpha over 2-- 384 00:22:19,140 --> 00:22:27,650 the sanity check should be that h1 is beta j not equal to 0. 385 00:22:27,650 --> 00:22:35,010 So that works if sigma is known, because I need to know sigma 386 00:22:35,010 --> 00:22:37,380 to be able to build my test. 387 00:22:37,380 --> 00:22:39,960 So if sigma is unknown, well, I can tell you, use this test, 388 00:22:39,960 --> 00:22:41,550 but you're going to be like, OK, when 389 00:22:41,550 --> 00:22:44,310 I'm going to have to plug in some numbers, 390 00:22:44,310 --> 00:22:45,810 I'm going to be stuck. 391 00:22:49,240 --> 00:22:59,570 But if sigma is unknown, we have sigma hat 392 00:22:59,570 --> 00:23:03,400 squared as an estimator. 393 00:23:03,400 --> 00:23:06,850 So let me write sigma squared here. 394 00:23:06,850 --> 00:23:12,050 So in particular, beta hat divided 395 00:23:12,050 --> 00:23:18,220 by sigma hat squared times square root gamma j-- something 396 00:23:18,220 --> 00:23:19,169 I can compute. 397 00:23:19,169 --> 00:23:20,210 Sorry, that's beta hat j. 398 00:23:23,070 --> 00:23:24,576 I can compute that thing. 399 00:23:24,576 --> 00:23:25,490 Agreed? 400 00:23:25,490 --> 00:23:27,230 Now I have sigma hat j. 401 00:23:27,230 --> 00:23:28,980 What I need to do is to be able to compute 402 00:23:28,980 --> 00:23:32,625 the distribution of this thing. 403 00:23:32,625 --> 00:23:37,880 So I know the distribution of beta hat j over the square root 404 00:23:37,880 --> 00:23:38,410 of gamma j. 405 00:23:38,410 --> 00:23:40,479 That's some Gaussian 0, 1. 406 00:23:40,479 --> 00:23:42,770 I don't know exactly what the distribution of sigma hat 407 00:23:42,770 --> 00:23:46,660 j squared is, but what I know is that that was actually written, 408 00:23:46,660 --> 00:23:54,790 maybe, here is that n minus p sigma hat squared over sigma 409 00:23:54,790 --> 00:23:59,550 squared follows some chi squared with n minus p 410 00:23:59,550 --> 00:24:01,350 degrees of freedom, and that it's actually 411 00:24:01,350 --> 00:24:06,590 independent of beta hat j. 412 00:24:06,590 --> 00:24:08,220 It's independent of beta hat, so it's 413 00:24:08,220 --> 00:24:10,030 independent of each of its coordinates. 414 00:24:10,030 --> 00:24:13,680 That was part of your homework where you had to-- 415 00:24:13,680 --> 00:24:15,900 some of you were confused by the fact that-- 416 00:24:15,900 --> 00:24:18,199 I mean, if you're independent of some big thing, 417 00:24:18,199 --> 00:24:19,740 you're independent of all the smaller 418 00:24:19,740 --> 00:24:20,948 components of this big thing. 419 00:24:20,948 --> 00:24:24,080 That's basically what you need to know. 420 00:24:24,080 --> 00:24:26,310 And so now I can just write this as-- 421 00:24:29,630 --> 00:24:35,970 this is beta hat j divided by-- 422 00:24:35,970 --> 00:24:37,890 so now I want to make this guy appear, 423 00:24:37,890 --> 00:24:42,630 so it's beta hat j sigma squared over sigma squared-- 424 00:24:42,630 --> 00:24:48,261 sigma hat squared over sigma squared times n minus p divided 425 00:24:48,261 --> 00:24:49,510 by the square root of gamma j. 426 00:24:49,510 --> 00:24:51,580 So that's what I want to see. 427 00:24:51,580 --> 00:24:52,236 Yeah? 428 00:24:52,236 --> 00:24:53,188 AUDIENCE: Why do you have to stick 429 00:24:53,188 --> 00:24:54,313 the hat in the denominator? 430 00:24:54,313 --> 00:24:56,996 Shouldn't it be sigma? 431 00:24:56,996 --> 00:24:59,130 PHILIPPE RIGOLLET: Yeah, so I write this. 432 00:24:59,130 --> 00:25:01,330 I decide to write this. 433 00:25:01,330 --> 00:25:03,170 I could have put a Mickey Mouse here. 434 00:25:03,170 --> 00:25:04,460 It just wouldn't make sense. 435 00:25:04,460 --> 00:25:05,960 I just decided to take this thing. 436 00:25:05,960 --> 00:25:06,390 AUDIENCE: OK. 437 00:25:06,390 --> 00:25:07,306 PHILIPPE RIGOLLET: OK. 438 00:25:07,306 --> 00:25:12,800 So now, let-- so I take this guy, and now, 439 00:25:12,800 --> 00:25:15,050 I'm going to rewrite it as something I want, 440 00:25:15,050 --> 00:25:17,891 because if you don't know what sigma is-- 441 00:25:17,891 --> 00:25:18,890 sorry, that's not sigm-- 442 00:25:18,890 --> 00:25:19,850 you mean the square? 443 00:25:19,850 --> 00:25:20,265 AUDIENCE: Yeah. 444 00:25:20,265 --> 00:25:21,020 PHILIPPE RIGOLLET: Oh, thank you. 445 00:25:21,020 --> 00:25:22,160 Yes, that's correct. 446 00:25:22,160 --> 00:25:25,390 [LAUGHS] OK, so you don't know what's sigma 447 00:25:25,390 --> 00:25:26,740 is, you replace it by sigma hat. 448 00:25:26,740 --> 00:25:28,650 That's the most natural thing to do. 449 00:25:28,650 --> 00:25:30,590 You just now want to find out what 450 00:25:30,590 --> 00:25:33,380 the distribution of this guy is. 451 00:25:33,380 --> 00:25:35,780 So this is not exactly what I had. 452 00:25:35,780 --> 00:25:41,530 To be able to get this, I need to divide by sigma squared-- 453 00:25:41,530 --> 00:25:42,640 sorry, I need to-- 454 00:25:42,640 --> 00:25:43,950 AUDIENCE: Square root. 455 00:25:43,950 --> 00:25:44,741 PHILIPPE RIGOLLET: I'm sorry. 456 00:25:44,741 --> 00:25:46,157 AUDIENCE: Do we need a square root 457 00:25:46,157 --> 00:25:47,450 of the sigma hat [INAUDIBLE]. 458 00:25:47,450 --> 00:25:49,033 PHILIPPE RIGOLLET: That's correct now. 459 00:25:55,400 --> 00:25:57,080 And now I have that-- 460 00:25:57,080 --> 00:25:59,430 sorry, I should not write it like that. 461 00:25:59,430 --> 00:26:01,770 That's not what I want. 462 00:26:01,770 --> 00:26:04,350 What I want is this. 463 00:26:08,260 --> 00:26:11,470 And to be able to get this guy, what I need 464 00:26:11,470 --> 00:26:25,100 is sigma over sigma hat square root. 465 00:26:25,100 --> 00:26:27,500 And then I need to make this thing show up. 466 00:26:27,500 --> 00:26:32,670 So I need to have this n minus p show up in the denominator. 467 00:26:32,670 --> 00:26:34,610 So to be able to get it, I need to multiply 468 00:26:34,610 --> 00:26:37,343 the entire thing by the square root of n minus p. 469 00:26:41,120 --> 00:26:42,590 So this is just a tautology. 470 00:26:42,590 --> 00:26:46,510 I just squeezed in what I wanted. 471 00:26:46,510 --> 00:26:50,680 But now this whole thing here, this is actually 472 00:26:50,680 --> 00:26:56,560 of the form beta hat j divided by sigma over square root gamma 473 00:26:56,560 --> 00:27:01,450 j, and then divided by square root of sigma hat squared 474 00:27:01,450 --> 00:27:04,700 over sigma squared. 475 00:27:08,607 --> 00:27:11,231 No, I don't want to divide it by square root of minus p, sorry. 476 00:27:15,290 --> 00:27:21,720 And now it's times n minus p divided by n minus p. 477 00:27:27,560 --> 00:27:29,714 And what is the distribution of this thing here? 478 00:27:43,546 --> 00:27:45,610 So I'm going to keep going here. 479 00:27:45,610 --> 00:27:48,480 So the distribution of this thing here is what? 480 00:27:48,480 --> 00:27:54,075 Well, this numerator, what is this distribution? 481 00:27:58,035 --> 00:28:01,650 AUDIENCE: [INAUDIBLE] 482 00:28:01,650 --> 00:28:02,900 PHILIPPE RIGOLLET: Yeah, n0 1. 483 00:28:02,900 --> 00:28:04,525 It's actually still written over there. 484 00:28:09,460 --> 00:28:11,509 So that's our n0 1. 485 00:28:11,509 --> 00:28:13,050 What is the distribution of this guy? 486 00:28:16,580 --> 00:28:18,970 Sorry, I don't think you have color again. 487 00:28:18,970 --> 00:28:22,922 So what is the distribution of this guy? 488 00:28:22,922 --> 00:28:24,380 This is still written on the board. 489 00:28:24,380 --> 00:28:25,660 AUDIENCE: Chi squared. 490 00:28:25,660 --> 00:28:28,285 PHILIPPE RIGOLLET: It's the chi squared that I have right here. 491 00:28:32,530 --> 00:28:35,580 So that's a chi squared n minus p divided by n minus p 492 00:28:35,580 --> 00:28:36,516 degrees of freedom. 493 00:28:36,516 --> 00:28:37,890 The only thing I need to check is 494 00:28:37,890 --> 00:28:39,780 that those two guys are independent, which 495 00:28:39,780 --> 00:28:43,050 is also what I have from here. 496 00:28:43,050 --> 00:28:49,690 And so that implies that beta hat j divided 497 00:28:49,690 --> 00:28:53,290 by sigma hat square root of gamma 498 00:28:53,290 --> 00:28:55,360 j, what is the distribution of this guy? 499 00:29:04,822 --> 00:29:06,330 [INTERPOSING VOICES] 500 00:29:06,330 --> 00:29:09,095 PHILIPPE RIGOLLET: n minus p. 501 00:29:09,095 --> 00:29:12,040 Was that crystal clear for everyone? 502 00:29:12,040 --> 00:29:15,370 Was that so simple that it was boring to everyone? 503 00:29:15,370 --> 00:29:16,090 OK, good. 504 00:29:16,090 --> 00:29:18,760 That's where the point at which you should be. 505 00:29:18,760 --> 00:29:23,350 So now I have this, I can read the quintiles of this guy. 506 00:29:23,350 --> 00:29:28,580 So my test statistic becomes-- 507 00:29:28,580 --> 00:29:31,000 well, my rejection region, I reject 508 00:29:31,000 --> 00:29:40,390 if the absolute value of this new guy 509 00:29:40,390 --> 00:29:44,900 exceeds the quintile of order alpha over 2, but this time, 510 00:29:44,900 --> 00:29:48,390 of a tn minus p. 511 00:29:48,390 --> 00:29:50,660 And now you can actually see that the only difference 512 00:29:50,660 --> 00:29:53,600 between this test and that test, apart from replacing sigma 513 00:29:53,600 --> 00:29:55,490 by sigma hat, is that now I've moved 514 00:29:55,490 --> 00:29:58,280 from the quintiles of a Gaussian to the quintiles 515 00:29:58,280 --> 00:29:59,640 of a tn minus p. 516 00:30:11,085 --> 00:30:13,210 What's actually interesting, from this perspective, 517 00:30:13,210 --> 00:30:18,070 is that the tn minus p, we know, has 518 00:30:18,070 --> 00:30:20,800 heavier tails than the Gaussian, but if the number of degrees 519 00:30:20,800 --> 00:30:26,131 of freedom reaches, maybe, 30 or 40, they're virtually the same. 520 00:30:26,131 --> 00:30:27,880 And here, the number of degrees of freedom 521 00:30:27,880 --> 00:30:30,610 is not given only by n, but it's n minus p. 522 00:30:30,610 --> 00:30:33,100 So if I have more and more parameters to estimate, 523 00:30:33,100 --> 00:30:35,616 this will result in some heavier, heavier tails, 524 00:30:35,616 --> 00:30:37,240 and that's just to account for the fact 525 00:30:37,240 --> 00:30:41,680 that it's harder and harder to estimate the variance 526 00:30:41,680 --> 00:30:44,680 when I have a lot of parameters. 527 00:30:44,680 --> 00:30:46,780 That's basically where it's coming from. 528 00:30:46,780 --> 00:30:52,270 So now let's move on to-- 529 00:30:52,270 --> 00:30:57,040 well, I don't know what because this is not working anymore. 530 00:30:57,040 --> 00:30:59,080 So this is the simplest test. 531 00:30:59,080 --> 00:31:02,560 And actually, if you run any statistical software 532 00:31:02,560 --> 00:31:06,190 for least squares, the output in any of them 533 00:31:06,190 --> 00:31:08,690 will look like this. 534 00:31:08,690 --> 00:31:11,780 You will have a sequence of rows. 535 00:31:11,780 --> 00:31:15,330 And you're going to have an estimate for beta 0, 536 00:31:15,330 --> 00:31:17,445 an estimate for beta 1, et cetera. 537 00:31:17,445 --> 00:31:19,320 Here, you're going to have a bunch of things. 538 00:31:19,320 --> 00:31:23,040 And on this row, you're going to have the value here, 539 00:31:23,040 --> 00:31:25,910 so that's going to be what's estimated by least squares. 540 00:31:25,910 --> 00:31:30,260 And then the second line immediately is going to be, 541 00:31:30,260 --> 00:31:32,300 well, either the value of this thing-- 542 00:31:35,320 --> 00:31:36,854 so let's call it t. 543 00:31:36,854 --> 00:31:38,520 And then there's going to be the p value 544 00:31:38,520 --> 00:31:40,800 corresponding to this t. 545 00:31:40,800 --> 00:31:44,109 This is something that's just routinely coming out because-- 546 00:31:44,109 --> 00:31:46,650 oh, and then there's, of course, the last line for people who 547 00:31:46,650 --> 00:31:49,740 cannot read numbers that's really just giving you little 548 00:31:49,740 --> 00:31:50,240 stars. 549 00:31:53,850 --> 00:31:56,900 They're not stickers, but that's close to it. 550 00:31:56,900 --> 00:31:59,110 And that's just saying, well, I have three stars, 551 00:31:59,110 --> 00:32:01,420 I'm very significantly different from 0's. 552 00:32:01,420 --> 00:32:04,160 If I have 2 stars, I'm moderately differently from 0. 553 00:32:04,160 --> 00:32:07,090 And if I have 1 star, it means, well, just 554 00:32:07,090 --> 00:32:10,450 give me another $1,000 and I will sign that it's actually 555 00:32:10,450 --> 00:32:12,250 different from 0. 556 00:32:12,250 --> 00:32:14,950 So that's basically the kind of outputs. 557 00:32:14,950 --> 00:32:16,467 Everybody sees what I mean by that? 558 00:32:16,467 --> 00:32:18,550 So what I mean, what I'm trying to emphasize here, 559 00:32:18,550 --> 00:32:20,260 is that those things are so routine when 560 00:32:20,260 --> 00:32:23,740 you run linear aggression, because people stuff in maybe-- 561 00:32:23,740 --> 00:32:25,510 even if you have 200 observations, 562 00:32:25,510 --> 00:32:28,720 you're going to stuff in maybe 20 variables-- p equals 20. 563 00:32:28,720 --> 00:32:31,110 That's still a big number to interpret what's going on. 564 00:32:31,110 --> 00:32:35,410 And it's nice for you if you can actually trim some fat out. 565 00:32:35,410 --> 00:32:41,260 And so the problem is that when you start doing this, and then 566 00:32:41,260 --> 00:32:44,386 this, and then this, and then this, 567 00:32:44,386 --> 00:32:47,750 the probability that you make a mistake 568 00:32:47,750 --> 00:32:52,040 in your test, the probably that you erroneously 569 00:32:52,040 --> 00:32:55,170 reject the null here is 5%. 570 00:32:55,170 --> 00:32:56,540 Here, it's 5%. 571 00:32:56,540 --> 00:32:58,500 Here, it's 5%. 572 00:32:58,500 --> 00:33:00,120 Here, it's 5%. 573 00:33:00,120 --> 00:33:05,370 And at some point, if things happen with 5% chances 574 00:33:05,370 --> 00:33:08,130 and you keep on doing them over and over again, 575 00:33:08,130 --> 00:33:10,240 they're going to start to happen. 576 00:33:10,240 --> 00:33:14,160 So you can see that basically what's happening 577 00:33:14,160 --> 00:33:15,900 is that you actually have an issue is 578 00:33:15,900 --> 00:33:18,750 that if you start repeating those tests, 579 00:33:18,750 --> 00:33:23,000 you might not be at 5% error at some point. 580 00:33:23,000 --> 00:33:25,940 And so what do you do to prevent from that, 581 00:33:25,940 --> 00:33:28,850 if you want to test all those beta j's simultaneously, 582 00:33:28,850 --> 00:33:32,340 you have to do what's called the Bonferroni correction. 583 00:33:32,340 --> 00:33:35,060 And the Bonferroni correction follows from what's 584 00:33:35,060 --> 00:33:36,790 called a union bound. 585 00:33:36,790 --> 00:33:40,392 A union bound is actually-- so if you're a computer scientist, 586 00:33:40,392 --> 00:33:41,600 you're very familiar with it. 587 00:33:41,600 --> 00:33:44,390 If you're a mathematician, that's just, essentially, 588 00:33:44,390 --> 00:33:46,650 the third axiom of probability that you see, 589 00:33:46,650 --> 00:33:48,140 that the probability of the union 590 00:33:48,140 --> 00:33:50,392 is less than the sum of the probabilities. 591 00:34:00,350 --> 00:34:02,660 That's the union bound. 592 00:34:02,660 --> 00:34:05,570 And you, of course, can generalize that to more than 2. 593 00:34:05,570 --> 00:34:07,460 And that's exactly what you're doing here. 594 00:34:07,460 --> 00:34:11,870 So let's see how we would want to perform Bonferroni 595 00:34:11,870 --> 00:34:19,340 correction to control the probability that they're all 596 00:34:19,340 --> 00:34:21,429 equal to 0 at the same time. 597 00:34:26,690 --> 00:34:29,960 So recall-- so if I want to perform this test over there 598 00:34:29,960 --> 00:34:34,820 where I want to test h0, that beta j 599 00:34:34,820 --> 00:34:40,560 is equal to 0 for all j in some subset s. 600 00:34:43,860 --> 00:34:48,409 So think of s included in 1p. 601 00:34:48,409 --> 00:34:51,139 You can think of it as being all of 1 of p if you want. 602 00:34:51,139 --> 00:34:53,960 It really doesn't matter. s is something that's given to you. 603 00:34:53,960 --> 00:34:55,790 Maybe you want to test the subset of them, 604 00:34:55,790 --> 00:34:57,890 but maybe you want to test all of them. 605 00:34:57,890 --> 00:35:04,540 Versus h1, beta j is not equal to 0 for some j in s. 606 00:35:07,850 --> 00:35:10,610 That's a test that tests all these things at once. 607 00:35:10,610 --> 00:35:13,880 And if you actually look at this table all at once, 608 00:35:13,880 --> 00:35:16,820 implicitly, you're performing this test for all of the rows, 609 00:35:16,820 --> 00:35:19,262 for s equal 1 to p. 610 00:35:19,262 --> 00:35:19,970 You will do that. 611 00:35:19,970 --> 00:35:23,120 Whether you like it or not, you will. 612 00:35:23,120 --> 00:35:27,110 So now let's look at what the probability of type I error 613 00:35:27,110 --> 00:35:28,100 looks like. 614 00:35:28,100 --> 00:35:31,270 So I want the probability of type 1 error, 615 00:35:31,270 --> 00:35:35,370 so that's the probably when h0 is true. 616 00:35:35,370 --> 00:35:41,930 Well, so let me call psi j the indicator that, say, beta j 617 00:35:41,930 --> 00:35:51,330 hat over sigma hat square root gamma j exceeds 618 00:35:51,330 --> 00:35:54,636 q alpha over 2 of tn minus p. 619 00:35:54,636 --> 00:35:56,760 So we know that those are the tests that I perform. 620 00:35:56,760 --> 00:35:59,160 Here, I just add this extra index j 621 00:35:59,160 --> 00:36:02,400 to tell me that I'm actually testing the j-th coefficient. 622 00:36:02,400 --> 00:36:06,490 So what I want is the probability that under the null 623 00:36:06,490 --> 00:36:12,450 so that those are all equal to 0 that beta j's-- 624 00:36:12,450 --> 00:36:16,620 that I will reject to the alternative for one of them. 625 00:36:16,620 --> 00:36:25,510 So that's psi 1 is equal to 1 or psi 2 626 00:36:25,510 --> 00:36:29,120 is equal to 1, all the way to psi-- 627 00:36:29,120 --> 00:36:31,474 well, let's just say that this is the entire thing, 628 00:36:31,474 --> 00:36:32,390 because it's annoying. 629 00:36:36,247 --> 00:36:37,830 I mean, you can check the slide if you 630 00:36:37,830 --> 00:36:39,150 want to do it more generally. 631 00:36:39,150 --> 00:36:44,140 But psi p is equal to-- 632 00:36:44,140 --> 00:36:48,850 or, or-- everybody agrees that this is the probability 633 00:36:48,850 --> 00:36:51,940 of type I error? 634 00:36:51,940 --> 00:36:54,010 So either I reject this one, or this one, 635 00:36:54,010 --> 00:36:55,757 or this one, or this one, or this one. 636 00:36:55,757 --> 00:36:58,090 And that's exactly when I'm going to reject at least one 637 00:36:58,090 --> 00:36:59,580 of them. 638 00:36:59,580 --> 00:37:08,550 So this is the probability of type I error. 639 00:37:08,550 --> 00:37:12,380 And what I want is to keep this guy less than alpha. 640 00:37:15,780 --> 00:37:17,730 But what I know is to control the probability 641 00:37:17,730 --> 00:37:20,190 that this guy is less than alpha, that this guy is 642 00:37:20,190 --> 00:37:22,820 less than alpha, that this guy is less than alpha. 643 00:37:22,820 --> 00:37:26,260 In particular, if all these guys are disjoint, 644 00:37:26,260 --> 00:37:29,530 then this could really be the sum of all these probabilities. 645 00:37:29,530 --> 00:37:42,400 So in the worst case, if psi j equals 1 intersected with psi k 646 00:37:42,400 --> 00:37:46,540 equals 1 is the empty set, so that means 647 00:37:46,540 --> 00:37:47,960 those are called disjoint sets. 648 00:37:51,210 --> 00:37:53,970 You've seen this terminology in probability, right? 649 00:37:53,970 --> 00:38:00,800 So if those sets are disjoint, for all of them, 650 00:38:00,800 --> 00:38:04,176 for all j different from k, then this probability-- 651 00:38:07,370 --> 00:38:14,590 well, let me write it as star-- 652 00:38:14,590 --> 00:38:20,990 then star is equal to, well, the probability under h0 653 00:38:20,990 --> 00:38:30,320 that psi 1 is equal to 1 plus the probability under h0 654 00:38:30,320 --> 00:38:33,110 that psi p is equal to 1. 655 00:38:33,110 --> 00:38:37,120 Now, if I use this test with this alpha here, 656 00:38:37,120 --> 00:38:40,600 then this probability is equal to alpha. 657 00:38:40,600 --> 00:38:43,185 This probability is also equal to alpha. 658 00:38:43,185 --> 00:38:45,810 So the probably of type I error is actually not equal to alpha. 659 00:38:45,810 --> 00:38:47,404 It's equal to? 660 00:38:47,404 --> 00:38:48,270 AUDIENCE: p alpha. 661 00:38:48,270 --> 00:38:49,395 PHILIPPE RIGOLLET: p alpha. 662 00:38:52,770 --> 00:38:54,240 So what is the solution here? 663 00:38:54,240 --> 00:38:58,470 Well, it's to run those guys not with alpha, 664 00:38:58,470 --> 00:38:59,802 but with alpha over p. 665 00:39:02,400 --> 00:39:06,870 And if they do this, then this guy is equal to alpha over p, 666 00:39:06,870 --> 00:39:09,036 this guy is equal to alpha over p. 667 00:39:09,036 --> 00:39:10,410 And so when I get those things, I 668 00:39:10,410 --> 00:39:13,260 get p times alpha over p, which is just alpha. 669 00:39:17,170 --> 00:39:20,410 So all I do is, rather than running each of the tests 670 00:39:20,410 --> 00:39:23,862 with probability of error-- 671 00:39:23,862 --> 00:39:28,751 so that's a test at level alpha over p. 672 00:39:32,500 --> 00:39:33,800 That's actually very stringent. 673 00:39:33,800 --> 00:39:35,500 If you think about it for 1 second, 674 00:39:35,500 --> 00:39:41,542 even if you have only 5 variables-- p equals 5-- 675 00:39:41,542 --> 00:39:43,000 and you started with the tests, you 676 00:39:43,000 --> 00:39:45,610 wanted to do your tests at 5%. 677 00:39:45,610 --> 00:39:50,720 It forces you to do the test at 1% for each of those variables. 678 00:39:50,720 --> 00:39:53,350 If you have 10 variables, I mean, that 679 00:39:53,350 --> 00:39:55,460 start to be very stringent. 680 00:39:55,460 --> 00:39:59,690 So it's going to be harder and harder for you 681 00:39:59,690 --> 00:40:01,945 to conclude to the alternative. 682 00:40:01,945 --> 00:40:03,320 Now, one thing I need to tell you 683 00:40:03,320 --> 00:40:05,240 is that here I said, if they are disjoint, 684 00:40:05,240 --> 00:40:07,230 then those probabilities are equal. 685 00:40:07,230 --> 00:40:12,610 But if they are not disjoint, the union bound 686 00:40:12,610 --> 00:40:14,360 tells me that the probability of the union 687 00:40:14,360 --> 00:40:16,650 is less than the sum of the probabilities. 688 00:40:16,650 --> 00:40:20,090 And so now I'm not exactly equal to alpha, 689 00:40:20,090 --> 00:40:23,220 but I'm bounded by alpha. 690 00:40:23,220 --> 00:40:26,170 And that's why Bonferroni correction, 691 00:40:26,170 --> 00:40:28,110 people are not super comfortable with, 692 00:40:28,110 --> 00:40:30,600 is because, in reality, you never think 693 00:40:30,600 --> 00:40:32,610 that those tests are going to be giving you 694 00:40:32,610 --> 00:40:34,890 completely disjoint things. 695 00:40:34,890 --> 00:40:36,480 I mean, why would it be? 696 00:40:36,480 --> 00:40:39,210 Why would it be that if this guy is equal to 1, 697 00:40:39,210 --> 00:40:42,110 then all the other ones are equal to 0? 698 00:40:42,110 --> 00:40:44,340 Why would it make any sense? 699 00:40:44,340 --> 00:40:45,860 So this is definitely conservative, 700 00:40:45,860 --> 00:40:49,394 but the problem is that we don't know how to do much better. 701 00:40:49,394 --> 00:40:51,060 I mean, we have a formula that tells you 702 00:40:51,060 --> 00:40:54,120 the probability of the union as some crazy sum that 703 00:40:54,120 --> 00:40:57,330 looks at all the intersection and all these little things. 704 00:40:57,330 --> 00:41:01,340 I mean, it's the generalization of p of a or b 705 00:41:01,340 --> 00:41:06,060 is equal to p of a plus p of b minus probability 706 00:41:06,060 --> 00:41:08,997 of the intersection. 707 00:41:08,997 --> 00:41:10,830 But if you start doing this for more than 2, 708 00:41:10,830 --> 00:41:12,060 it's super complicated. 709 00:41:12,060 --> 00:41:15,030 The number of terms grows really fast. 710 00:41:15,030 --> 00:41:17,432 But most importantly, even if you go here, 711 00:41:17,432 --> 00:41:19,140 you still need to control the probability 712 00:41:19,140 --> 00:41:20,130 of the intersection. 713 00:41:20,130 --> 00:41:22,470 And those tests are not necessarily independent. 714 00:41:22,470 --> 00:41:24,090 If they were independent, then that would be easy. 715 00:41:24,090 --> 00:41:26,340 The probably of the intersection would be the product 716 00:41:26,340 --> 00:41:27,840 of the probabilities. 717 00:41:27,840 --> 00:41:31,270 But those things are super correlated, 718 00:41:31,270 --> 00:41:33,220 and so it doesn't really help. 719 00:41:33,220 --> 00:41:37,026 And so we'll see, when we talk about high-dimensional stats 720 00:41:37,026 --> 00:41:38,650 towards the end, that there's something 721 00:41:38,650 --> 00:41:41,600 called false discovery rate, which is essentially saying, 722 00:41:41,600 --> 00:41:45,380 listen, if I want to control this thing, 723 00:41:45,380 --> 00:41:47,260 if I really define my probability of type I 724 00:41:47,260 --> 00:41:50,260 error as this, I want to make sure that I never make 725 00:41:50,260 --> 00:41:52,300 this kind of error, I'm doomed. 726 00:41:52,300 --> 00:41:54,680 This is just not going to happen. 727 00:41:54,680 --> 00:41:59,500 But I can revise what my goals are in terms of errors 728 00:41:59,500 --> 00:42:02,570 that I make, and then I will actually be able to do. 729 00:42:02,570 --> 00:42:05,680 And what people are looking at is false discovery rate. 730 00:42:05,680 --> 00:42:07,750 And this is called family-wise error rate, which 731 00:42:07,750 --> 00:42:10,280 is a stronger thing to control. 732 00:42:10,280 --> 00:42:14,590 So this trick that consists in replacing 733 00:42:14,590 --> 00:42:16,704 alpha by alpha over the number of times 734 00:42:16,704 --> 00:42:18,370 you're going to be performing your test, 735 00:42:18,370 --> 00:42:21,700 or alpha over the number of terms in your union, 736 00:42:21,700 --> 00:42:24,164 is actually called the Bonferroni correction. 737 00:42:32,160 --> 00:42:35,450 And that's something you use when you have what's called-- 738 00:42:35,450 --> 00:42:41,010 another key word here is multiple testing, 739 00:42:41,010 --> 00:42:43,830 when you're trying to do multiple tests simultaneously. 740 00:42:47,470 --> 00:42:49,840 And if s is not of p, well, you just 741 00:42:49,840 --> 00:42:52,760 divide by the number of tests that you are actually making. 742 00:42:52,760 --> 00:42:56,130 So if s is of size k for some k less than p, 743 00:42:56,130 --> 00:42:59,172 you just divide alpha by k and not by p, of course. 744 00:42:59,172 --> 00:43:00,630 I mean, you can always divide by p, 745 00:43:00,630 --> 00:43:03,170 but you're going to make your life harder for no reason. 746 00:43:11,010 --> 00:43:13,260 Any question about Bonferroni correction? 747 00:43:18,260 --> 00:43:26,100 So one thing that is maybe not as obvious 748 00:43:26,100 --> 00:43:30,150 as the test beta j equal to 0 versus beta j not equal to 0-- 749 00:43:30,150 --> 00:43:32,190 and in particular, what it means is 750 00:43:32,190 --> 00:43:36,360 that it's not going to come up as a software output 751 00:43:36,360 --> 00:43:39,480 without even you requesting it because this is so standard 752 00:43:39,480 --> 00:43:40,897 that it's just coming out. 753 00:43:40,897 --> 00:43:42,480 But there's other tests that you might 754 00:43:42,480 --> 00:43:45,060 think of that might be more complicated and more 755 00:43:45,060 --> 00:43:47,590 tailored to your particular problem. 756 00:43:47,590 --> 00:43:52,560 And those tests are of the form g times beta 757 00:43:52,560 --> 00:43:56,260 is equal to some lambda. 758 00:43:56,260 --> 00:44:05,810 So let's see, the test we've just done, 759 00:44:05,810 --> 00:44:14,910 beta j equals 0 versus beta j not equal to 0, 760 00:44:14,910 --> 00:44:23,100 is actually equivalent to ej transpose beta equals 761 00:44:23,100 --> 00:44:28,020 0 versus ej transpose beta not equal to 0. 762 00:44:28,020 --> 00:44:31,260 That was our claim. 763 00:44:31,260 --> 00:44:32,870 But now I don't have to stop here. 764 00:44:32,870 --> 00:44:34,970 I don't have to multiply by a vector 765 00:44:34,970 --> 00:44:36,890 and test if it's equal to 0. 766 00:44:36,890 --> 00:44:46,790 I can actually replace this by some general matrix g 767 00:44:46,790 --> 00:44:54,449 and replace this guy by some general vector lambda. 768 00:44:54,449 --> 00:44:56,240 And I'm not telling you what the dimensions 769 00:44:56,240 --> 00:44:57,406 are because they're general. 770 00:44:57,406 --> 00:44:58,830 I can take whatever I want. 771 00:44:58,830 --> 00:45:00,260 Take your favorite matrix, as long 772 00:45:00,260 --> 00:45:05,690 as the right side of the matrix can be multiplying beta, 773 00:45:05,690 --> 00:45:09,710 and lambda, take it as the number of rows of g, 774 00:45:09,710 --> 00:45:11,820 and then you can do that. 775 00:45:11,820 --> 00:45:14,280 I can always formulate this test. 776 00:45:14,280 --> 00:45:16,680 What will this test encompass? 777 00:45:16,680 --> 00:45:18,780 Well, those are kind of weird tests. 778 00:45:18,780 --> 00:45:22,170 So you can think of things like, I 779 00:45:22,170 --> 00:45:30,440 want to test if beta 2 plus beta 3 are equal to 0, for example. 780 00:45:30,440 --> 00:45:40,770 Maybe I want to test if beta 5 minus 2 beta 6 is equal to 23. 781 00:45:40,770 --> 00:45:42,270 Well, that's weird. 782 00:45:42,270 --> 00:45:44,730 But why would you want to test if beta 2 plus beta 3 783 00:45:44,730 --> 00:45:46,814 is equal to 0? 784 00:45:46,814 --> 00:45:48,730 Maybe you don't want to know if the-- you know 785 00:45:48,730 --> 00:45:50,720 that the effect of some gene is not 0. 786 00:45:50,720 --> 00:45:54,210 Maybe you know that this gene affects this trait, 787 00:45:54,210 --> 00:45:56,790 but you want to know if the effect of this gene 788 00:45:56,790 --> 00:45:59,262 is canceled by the effect of that gene. 789 00:45:59,262 --> 00:46:00,970 And this is the kind of stuff that you're 790 00:46:00,970 --> 00:46:02,178 going to be testing for that. 791 00:46:04,470 --> 00:46:06,150 Now, this guy is much more artificial, 792 00:46:06,150 --> 00:46:08,770 and I don't have a bedtime story to tell you around this. 793 00:46:08,770 --> 00:46:13,340 So those things can happen and can be much more complicated. 794 00:46:13,340 --> 00:46:15,180 Now, here, notice that the matrix g 795 00:46:15,180 --> 00:46:18,270 has one row for both of the examples. 796 00:46:18,270 --> 00:46:20,580 But if I want to test if those two things happen 797 00:46:20,580 --> 00:46:25,380 at the same time, then I actually can take a matrix. 798 00:46:25,380 --> 00:46:27,840 Another matrix that can be useful 799 00:46:27,840 --> 00:46:34,620 is g equals the identity of rp and lambda is equal to 0. 800 00:46:34,620 --> 00:46:39,530 What am I doing here in this case? 801 00:46:39,530 --> 00:46:41,480 What is this test testing? 802 00:46:41,480 --> 00:46:42,280 Sorry, this test. 803 00:46:44,959 --> 00:46:45,458 Yeah? 804 00:46:45,458 --> 00:46:46,820 AUDIENCE: Whether or not beta is 0. 805 00:46:46,820 --> 00:46:49,278 PHILIPPE RIGOLLET: Yeah, we're testing if the entire vector 806 00:46:49,278 --> 00:46:54,120 beta is equal to 0, because g times beta is equal to beta, 807 00:46:54,120 --> 00:46:56,100 and we're asking whether it's equal to 0. 808 00:47:00,375 --> 00:47:04,590 So the thing is, when you want to actually test 809 00:47:04,590 --> 00:47:07,140 if beta is equal to 0, you're actually 810 00:47:07,140 --> 00:47:09,510 testing if your entire model, everything you're 811 00:47:09,510 --> 00:47:12,070 doing in life, is just junk. 812 00:47:12,070 --> 00:47:13,920 This is just telling you, actually, 813 00:47:13,920 --> 00:47:17,090 forget about this y is x beta plus epsilon. 814 00:47:17,090 --> 00:47:18,360 y is really just epsilon. 815 00:47:18,360 --> 00:47:19,200 There's nothing. 816 00:47:19,200 --> 00:47:21,810 There's just some big noise with some big variants, 817 00:47:21,810 --> 00:47:23,950 and there's nothing else. 818 00:47:23,950 --> 00:47:26,860 So turns out that the statistical software 819 00:47:26,860 --> 00:47:30,970 output that I wrote here spits out an answer to this question. 820 00:47:30,970 --> 00:47:34,480 Just the last line, usually, is doing this test. 821 00:47:34,480 --> 00:47:36,642 Does your model even make sense? 822 00:47:36,642 --> 00:47:39,100 And it's probably for people to check whether they actually 823 00:47:39,100 --> 00:47:41,230 just mix their two data sets. 824 00:47:41,230 --> 00:47:43,450 Maybe they're actually trying to predict-- 825 00:47:43,450 --> 00:47:49,190 I don't know-- some credit score from genomic data, 826 00:47:49,190 --> 00:47:51,040 and so just want to make sure, maybe, that's 827 00:47:51,040 --> 00:47:53,050 not the right thing. 828 00:47:53,050 --> 00:47:56,500 So it turns out that the machinery is exactly the same 829 00:47:56,500 --> 00:47:58,750 as the one we've just taken. 830 00:47:58,750 --> 00:48:00,380 So we actually start from here. 831 00:48:05,542 --> 00:48:06,500 So let me pull this up. 832 00:48:12,930 --> 00:48:15,000 So we start from here. 833 00:48:15,000 --> 00:48:18,470 Beta hat was equal to beta plus this guy. 834 00:48:21,780 --> 00:48:23,640 And the first thing we did was to say, well, 835 00:48:23,640 --> 00:48:27,180 beta j is equal to this thing because, well, beta j was 836 00:48:27,180 --> 00:48:29,250 just ej times beta. 837 00:48:29,250 --> 00:48:32,616 So rather than taking ej here, let me just take g. 838 00:48:42,280 --> 00:48:45,220 Now, we said that for any vector-- 839 00:48:45,220 --> 00:48:47,840 well, that was trivial. 840 00:48:47,840 --> 00:48:50,350 So the thing we need to know is, what is this thing? 841 00:48:50,350 --> 00:48:55,110 Well, this thing here, what is this guy? 842 00:48:55,110 --> 00:48:59,870 It's also normal and the mean is 0. 843 00:48:59,870 --> 00:49:03,510 Again, that's just using properties of Gaussian vectors. 844 00:49:03,510 --> 00:49:06,430 And what is the covariance matrix? 845 00:49:06,430 --> 00:49:09,290 Let's call these guys sigma so that you can make an answer, 846 00:49:09,290 --> 00:49:11,660 you can formulate an answer. 847 00:49:11,660 --> 00:49:14,230 So what is the distribution of-- what 848 00:49:14,230 --> 00:49:18,354 is the covariance of g times some Gaussian 0 sigma? 849 00:49:18,354 --> 00:49:20,290 AUDIENCE: g sigma g transpose. 850 00:49:20,290 --> 00:49:22,500 PHILIPPE RIGOLLET: g sigma g transpose, right? 851 00:49:22,500 --> 00:49:33,895 So that's gx transpose x inverse g transpose. 852 00:49:38,650 --> 00:49:41,780 Now, I'm not going to be able to go much farther. 853 00:49:41,780 --> 00:49:44,900 I mean, I made this very acute observation 854 00:49:44,900 --> 00:49:47,790 that ej transpose the matrix times ej is the j-th angle 855 00:49:47,790 --> 00:49:48,290 element. 856 00:49:48,290 --> 00:49:50,450 Now, if I have a general matrix, the price to pay is that I 857 00:49:50,450 --> 00:49:52,949 cannot just shrink this thing any further because I'm trying 858 00:49:52,949 --> 00:49:54,640 to be abstract. 859 00:49:54,640 --> 00:49:56,487 And so I'm almost there. 860 00:49:56,487 --> 00:49:58,070 The only thing that happened last time 861 00:49:58,070 --> 00:50:00,050 is that when this was ej under h0, 0, 862 00:50:00,050 --> 00:50:03,380 we knew that this was equal to 0 under the null. 863 00:50:03,380 --> 00:50:08,790 But under the null, what is this equal to? 864 00:50:12,510 --> 00:50:13,440 AUDIENCE: Lambda. 865 00:50:13,440 --> 00:50:15,106 PHILIPPE RIGOLLET: Lambda, which I know. 866 00:50:15,106 --> 00:50:16,880 I mean, I wrote my thing. 867 00:50:16,880 --> 00:50:19,730 And in the couple instances I just showed you, 868 00:50:19,730 --> 00:50:22,700 including this one over there on top, lambda was equal to 0. 869 00:50:22,700 --> 00:50:24,620 But in general, it can be any lambda. 870 00:50:24,620 --> 00:50:27,890 But what's key about this lambda is that I actually know it. 871 00:50:27,890 --> 00:50:31,940 That's the hypothesis I'm formulating. 872 00:50:31,940 --> 00:50:34,340 So now I'm going to have to be a little more careful when 873 00:50:34,340 --> 00:50:36,650 I want to build the distribution of g beta hat. 874 00:50:36,650 --> 00:50:39,380 I need to actually subtract this lambda. 875 00:50:39,380 --> 00:50:40,970 So now we go from this, and we say, 876 00:50:40,970 --> 00:50:47,040 well, g beta hat minus lambda follows 877 00:50:47,040 --> 00:50:57,730 some np0 sigma squared g x transpose x 878 00:50:57,730 --> 00:51:00,660 inverse g transpose. 879 00:51:04,070 --> 00:51:06,469 So that's true. 880 00:51:06,469 --> 00:51:08,510 Let's assume-- let's go straight to the case when 881 00:51:08,510 --> 00:51:10,410 we don't know what sigma is. 882 00:51:10,410 --> 00:51:11,970 So what I'm going to be interested in 883 00:51:11,970 --> 00:51:26,360 is g beta hat minus lambda divided by sigma hat. 884 00:51:26,360 --> 00:51:29,870 And that's going to follow some Gaussian that has this thing, 885 00:51:29,870 --> 00:51:37,660 gx transpose x inverse g transpose. 886 00:51:37,660 --> 00:51:40,780 So now, what did I do last time? 887 00:51:40,780 --> 00:51:45,010 So clearly, the quintiles of this distribution 888 00:51:45,010 --> 00:51:48,000 is-- well, OK, what is the size of this distribution? 889 00:51:48,000 --> 00:51:52,848 Well, I need to tell you that g is an-- 890 00:51:52,848 --> 00:51:54,724 what did I take here? 891 00:51:54,724 --> 00:51:57,180 AUDIENCE: 1 divided by sigma, not sigma hat. 892 00:51:57,180 --> 00:51:58,930 PHILIPPE RIGOLLET: Oh, yeah, you're right. 893 00:51:58,930 --> 00:52:00,440 So let me write it like this. 894 00:52:05,750 --> 00:52:15,800 Well, let me write it like this-- 895 00:52:15,800 --> 00:52:17,253 sigma squared over sigma. 896 00:52:21,659 --> 00:52:23,325 So let's forget about the size of g now. 897 00:52:23,325 --> 00:52:25,120 Let's just think of any general g. 898 00:52:27,730 --> 00:52:30,820 When g was a vector, what was nice 899 00:52:30,820 --> 00:52:35,410 is that this guy was just the scalar number, just one number. 900 00:52:35,410 --> 00:52:38,012 And so if I wanted to get rid of this in the right-hand side, 901 00:52:38,012 --> 00:52:39,970 all I had to do was to divide it by this thing. 902 00:52:39,970 --> 00:52:41,464 We called it gamma j. 903 00:52:41,464 --> 00:52:43,630 And we just had to divide by square root of gamma j, 904 00:52:43,630 --> 00:52:45,820 and that would be gone. 905 00:52:45,820 --> 00:52:48,450 Now I have a matrix. 906 00:52:48,450 --> 00:52:50,100 So I need to get rid of this matrix 907 00:52:50,100 --> 00:52:55,016 somehow because, clearly, the quintiles of this distribution 908 00:52:55,016 --> 00:52:56,640 are not going to be written in the back 909 00:52:56,640 --> 00:52:59,170 of a book for any value of g and any value of x. 910 00:52:59,170 --> 00:53:01,660 So I need to standardize before I can read anything out 911 00:53:01,660 --> 00:53:03,860 of a table. 912 00:53:03,860 --> 00:53:04,820 So how do we do it? 913 00:53:04,820 --> 00:53:14,880 Well, we just form this guy here. 914 00:53:14,880 --> 00:53:18,770 So what we know is that if-- 915 00:53:18,770 --> 00:53:21,120 so here's the claim, again, another 916 00:53:21,120 --> 00:53:23,520 claim about Gaussian vector. 917 00:53:23,520 --> 00:53:43,220 If x follows some n0 sigma, then x transpose sigma inverse x 918 00:53:43,220 --> 00:53:44,596 follows some chi squared. 919 00:53:48,330 --> 00:53:51,930 And here, it's going to depend on what is the dimension here. 920 00:53:51,930 --> 00:53:56,160 So if I make this a k by k, a k-dimensional Gaussian vector, 921 00:53:56,160 --> 00:53:57,497 this is x squared k. 922 00:54:02,467 --> 00:54:04,455 Where have we used that before? 923 00:54:08,928 --> 00:54:09,922 Yeah? 924 00:54:09,922 --> 00:54:10,850 AUDIENCE: Wald's test. 925 00:54:10,850 --> 00:54:13,350 PHILIPPE RIGOLLET: Wald's test, that's exactly what we used. 926 00:54:13,350 --> 00:54:16,480 Wald's test had a chi squared that was showing up. 927 00:54:16,480 --> 00:54:18,430 And the way we made it show up was 928 00:54:18,430 --> 00:54:20,640 by taking the asymptotic variance, 929 00:54:20,640 --> 00:54:24,852 taking its inverse, which, in this framework, was called-- 930 00:54:24,852 --> 00:54:25,710 AUDIENCE: Fisher. 931 00:54:25,710 --> 00:54:27,300 PHILIPPE RIGOLLET: Fisher information. 932 00:54:27,300 --> 00:54:31,410 And then we pre- and postmultiply by this thing. 933 00:54:31,410 --> 00:54:33,150 So this is the key. 934 00:54:33,150 --> 00:54:35,400 And so now, it tells me exactly, when 935 00:54:35,400 --> 00:54:38,190 I start from this guy that has this multivariate Gaussian, 936 00:54:38,190 --> 00:54:40,050 it tells me how to turn it into something 937 00:54:40,050 --> 00:54:42,720 that has a distribution which is pivotal. 938 00:54:42,720 --> 00:54:45,849 Chi squared k is completely pivotal, does not depend 939 00:54:45,849 --> 00:54:46,890 on anything I don't know. 940 00:55:03,810 --> 00:55:06,400 The way I go from here is by saying, well, now, 941 00:55:06,400 --> 00:55:13,380 I look at g beta hat minus lambda transpose, 942 00:55:13,380 --> 00:55:15,390 and now I need to look at the inverse 943 00:55:15,390 --> 00:55:16,600 of the matrix over there. 944 00:55:16,600 --> 00:55:29,950 So it's gx transpose x inverse g inverse g beta 945 00:55:29,950 --> 00:55:32,510 hat minus lambda. 946 00:55:35,647 --> 00:55:36,855 This guy is going to follow-- 947 00:55:39,700 --> 00:55:42,891 well, here, I need to actually divide by sigma in this case-- 948 00:55:56,540 --> 00:56:00,560 if g is k times p. 949 00:56:00,560 --> 00:56:04,370 So what I mean here is just that's the same k. 950 00:56:04,370 --> 00:56:07,250 The k that shows up is the number of constraints 951 00:56:07,250 --> 00:56:08,840 that I have in my tests. 952 00:56:13,340 --> 00:56:20,690 So now, if I go from here to using sigma hat, 953 00:56:20,690 --> 00:56:23,180 the key thing to observe is that this guy is actually 954 00:56:23,180 --> 00:56:25,100 not a Gaussian. 955 00:56:25,100 --> 00:56:28,410 I'm not going to have a student t-distribution that shows up. 956 00:56:36,290 --> 00:57:03,850 So that implies that if I take the same thing, 957 00:57:03,850 --> 00:57:06,450 so now I just go from sigma to sigma hat, 958 00:57:06,450 --> 00:57:08,140 then this thing is of the form-- 959 00:57:12,620 --> 00:57:17,280 well, this chi squared k divided by the chi squared that shows 960 00:57:17,280 --> 00:57:20,590 up in the denominator of the t-distribution, 961 00:57:20,590 --> 00:57:28,270 which is square root of-- 962 00:57:28,270 --> 00:57:30,060 oh, I should not divide by sigma-- 963 00:57:30,060 --> 00:57:31,510 so this is sigma squared, right? 964 00:57:31,510 --> 00:57:32,567 AUDIENCE: Yeah. 965 00:57:32,567 --> 00:57:34,400 PHILIPPE RIGOLLET: So this is sigma squared. 966 00:57:34,400 --> 00:57:40,550 So this is of the form divided by chi squared n 967 00:57:40,550 --> 00:57:44,180 minus p divided by n minus p. 968 00:57:44,180 --> 00:57:48,370 So that's the same denominator that I saw in my t-test. 969 00:57:48,370 --> 00:57:49,955 The numerator has changed, though. 970 00:57:49,955 --> 00:57:52,080 The numerator is now this chi squared and no longer 971 00:57:52,080 --> 00:57:52,580 a Gaussian. 972 00:57:55,430 --> 00:58:00,350 But this distribution is actually pivotal, as long 973 00:58:00,350 --> 00:58:02,210 as we can guarantee that there's no hidden 974 00:58:02,210 --> 00:58:08,550 parameter in the correlation between the two chi squares. 975 00:58:08,550 --> 00:58:13,470 So again, as all statements of independence in this class, 976 00:58:13,470 --> 00:58:15,930 I will just give it to you for free. 977 00:58:15,930 --> 00:58:20,660 Those two things, I claim-- 978 00:58:20,660 --> 00:58:29,635 so OK, let's say admit these are independent. 979 00:58:37,370 --> 00:58:38,730 We're almost there. 980 00:58:38,730 --> 00:58:41,627 This could be a distribution that's pivotal. 981 00:58:41,627 --> 00:58:43,960 But there's something that's a little unbalanced with it 982 00:58:43,960 --> 00:58:46,160 is that this guy is divided by its number of degrees 983 00:58:46,160 --> 00:58:48,980 of freedom, but this guy is not divided by its number 984 00:58:48,980 --> 00:58:50,670 of degrees of freedom. 985 00:58:50,670 --> 00:58:53,350 And so we just have to make the extra step 986 00:58:53,350 --> 00:58:57,280 that if I divide this guy by k, and this guy is a chi squared 987 00:58:57,280 --> 00:59:00,080 divided by k, if I divide this guy by k, 988 00:59:00,080 --> 00:59:03,900 then I get this guy divided by k. 989 00:59:03,900 --> 00:59:05,442 And now it looks-- 990 00:59:05,442 --> 00:59:06,900 I mean, it doesn't change anything. 991 00:59:06,900 --> 00:59:09,020 I've just divided by a fixed number. 992 00:59:09,020 --> 00:59:11,200 But it just looks more elegant-- 993 00:59:11,200 --> 00:59:13,650 is the ratio of two independent chi 994 00:59:13,650 --> 00:59:15,420 squared that are individually divided 995 00:59:15,420 --> 00:59:16,920 by the number of degrees of freedom. 996 00:59:20,840 --> 00:59:31,100 And this has a name, and it's called a Fisher 997 00:59:31,100 --> 00:59:34,190 or F-distribution. 998 00:59:34,190 --> 00:59:40,740 So unlike William Gosset, who was not 999 00:59:40,740 --> 00:59:43,200 allowed to use his own name and used the name student, 1000 00:59:43,200 --> 00:59:45,000 Fisher was allowed to use his own name, 1001 00:59:45,000 --> 00:59:47,220 and that's called the Fisher distribution. 1002 00:59:47,220 --> 00:59:52,470 And the Fisher distribution has now 2 parameters, 1003 00:59:52,470 --> 00:59:53,910 a set of 2 degrees of freedom-- 1004 00:59:53,910 --> 00:59:57,180 1 for the numerator and 1 for the denominator. 1005 00:59:57,180 --> 01:00:01,217 So F- of Fisher distribution-- 1006 01:00:07,430 --> 01:00:13,450 so F is equal to the ratio of a chi squared p/p 1007 01:00:13,450 --> 01:00:16,960 and a chi squared q/q. 1008 01:00:16,960 --> 01:00:27,320 So that's Fpq P-q where the 2 chi squareds are independent. 1009 01:00:32,970 --> 01:00:35,160 Is that clear what I'm defining here? 1010 01:00:35,160 --> 01:00:41,460 So this is basically what plays the role of t-distributions 1011 01:00:41,460 --> 01:00:43,870 when you're testing more than 1 parameter at a time. 1012 01:00:43,870 --> 01:00:45,630 So you basically replace-- 1013 01:00:45,630 --> 01:00:47,190 the normal that was in the numerator, 1014 01:00:47,190 --> 01:00:49,023 you replace it by chi squared because you're 1015 01:00:49,023 --> 01:00:51,780 testing if 2 vectors are simultaneously close. 1016 01:00:51,780 --> 01:00:55,340 And the way you do it is by looking at their squared norm. 1017 01:00:55,340 --> 01:00:57,800 And that's how the chi squared shows up. 1018 01:01:00,632 --> 01:01:08,240 Quick remark-- are those things really very different? 1019 01:01:08,240 --> 01:01:12,090 How can I relate a chi squared with a t-distribution? 1020 01:01:12,090 --> 01:01:19,151 Well, if t follows, say, a t-- 1021 01:01:19,151 --> 01:01:20,400 I don't know, let's call it q. 1022 01:01:24,080 --> 01:01:28,330 So that means that t, let me look at-- 1023 01:01:28,330 --> 01:01:38,200 t is some n01 divided by the square root of a chi 1024 01:01:38,200 --> 01:01:40,650 squared q/q. 1025 01:01:44,820 --> 01:01:48,926 That's the distribution of t. 1026 01:01:48,926 --> 01:01:51,300 So if I look at the square of the-- the distribution of t 1027 01:01:51,300 --> 01:01:53,600 squared-- 1028 01:01:53,600 --> 01:01:55,010 let me put it here-- 1029 01:01:58,300 --> 01:02:06,280 well, that's the square of some n01 divided by chi squared q/q. 1030 01:02:09,690 --> 01:02:11,900 Agreed? 1031 01:02:11,900 --> 01:02:13,470 I just removed the square root here, 1032 01:02:13,470 --> 01:02:15,810 and I took the square of the Gaussian. 1033 01:02:15,810 --> 01:02:20,030 But what is the distribution of a square of a Gaussian? 1034 01:02:20,030 --> 01:02:21,530 AUDIENCE: Chi squared with 1 degree. 1035 01:02:21,530 --> 01:02:25,140 PHILIPPE RIGOLLET: Chi squared with 1 degree of freedom. 1036 01:02:25,140 --> 01:02:27,284 So this is a chi squared with 1 degree of freedom. 1037 01:02:27,284 --> 01:02:28,700 And in particular, it's also a chi 1038 01:02:28,700 --> 01:02:31,836 squared with 1 degree of freedom divided by 1. 1039 01:02:31,836 --> 01:02:38,860 So t-squared, in the end, has an F-distribution with 1 1040 01:02:38,860 --> 01:02:41,300 and q degrees of freedom. 1041 01:02:41,300 --> 01:02:43,589 So those two things are actually very similar. 1042 01:02:43,589 --> 01:02:45,130 The only thing that's going to change 1043 01:02:45,130 --> 01:02:48,280 is that, since we're actually looking at, typically, 1044 01:02:48,280 --> 01:02:51,164 absolute values of t when we do our tests, 1045 01:02:51,164 --> 01:02:52,830 it's going to be exactly the same thing. 1046 01:02:52,830 --> 01:02:54,330 These quintiles of one guy are going 1047 01:02:54,330 --> 01:02:56,496 to be, essentially, the square root of the quintiles 1048 01:02:56,496 --> 01:02:57,310 of the other guy. 1049 01:02:57,310 --> 01:03:00,390 That's all it's going to be. 1050 01:03:00,390 --> 01:03:07,360 So if my test is psi is equal to the indicator 1051 01:03:07,360 --> 01:03:16,010 that t exceeds q alpha over 2 of tq, for example, 1052 01:03:16,010 --> 01:03:19,990 then it's equal to the indicator that t-squared 1053 01:03:19,990 --> 01:03:26,030 exceeds q squared alpha over 2 tq, 1054 01:03:26,030 --> 01:03:28,770 because I had the absolute value here, 1055 01:03:28,770 --> 01:03:33,110 which is equal to the indicator that t squared is 1056 01:03:33,110 --> 01:03:35,580 greater than q alpha over 2. 1057 01:03:35,580 --> 01:03:37,000 And now this time, it's an F1q. 1058 01:03:39,880 --> 01:03:42,340 So in a way, those two things belong to the same family. 1059 01:03:42,340 --> 01:03:44,680 They really are a natural generalization of each other. 1060 01:03:44,680 --> 01:03:47,310 I mean, at least the F-test is a generalization of the t-test. 1061 01:03:51,230 --> 01:03:54,480 And so now I can perform my test just like it's written here. 1062 01:03:54,480 --> 01:03:56,250 I just formed this guy, and then I 1063 01:03:56,250 --> 01:03:58,860 perform against the quintile of an F-test. 1064 01:03:58,860 --> 01:04:01,440 Notice, there's no absolute value-- 1065 01:04:01,440 --> 01:04:04,740 oh, yeah, I forgot, this is actually 1066 01:04:04,740 --> 01:04:09,761 q alpha because the F-statistic is already positive. 1067 01:04:09,761 --> 01:04:11,760 So I'm not going to look between left and right, 1068 01:04:11,760 --> 01:04:15,240 I'm just going to look whether it's too large or not. 1069 01:04:15,240 --> 01:04:18,030 So that's by definition. 1070 01:04:18,030 --> 01:04:19,380 So you can check-- 1071 01:04:19,380 --> 01:04:21,120 if you look at a table for student 1072 01:04:21,120 --> 01:04:23,025 and you look at a table for F1q, one 1073 01:04:23,025 --> 01:04:25,650 it just going to-- you're going to have to move from one column 1074 01:04:25,650 --> 01:04:26,610 to the other because you're going 1075 01:04:26,610 --> 01:04:28,475 to have to move from alpha over 2 to alpha, 1076 01:04:28,475 --> 01:04:31,670 but one is going to be squared root of the other one, 1077 01:04:31,670 --> 01:04:34,370 just like the chi squared is the square of the Gaussian. 1078 01:04:34,370 --> 01:04:36,828 I mean, if you look at the chi squared 1 degree of freedom, 1079 01:04:36,828 --> 01:04:40,441 you will see the same thing as the Gaussians. 1080 01:04:47,035 --> 01:04:53,594 So I'm actually going to start with the last one 1081 01:04:53,594 --> 01:04:55,760 because you've been asking a few questions about why 1082 01:04:55,760 --> 01:04:58,450 is my design deterministic. 1083 01:04:58,450 --> 01:04:59,660 So there's many answers. 1084 01:04:59,660 --> 01:05:01,955 Some are philosophical. 1085 01:05:01,955 --> 01:05:04,330 But one that's actually-- well, there's the one that says 1086 01:05:04,330 --> 01:05:07,106 everything you cannot do if you don't have a condition-- 1087 01:05:07,106 --> 01:05:09,730 if you don't have x, because all of the statements that we made 1088 01:05:09,730 --> 01:05:12,850 here, for example, just the fact that this is chi squared, 1089 01:05:12,850 --> 01:05:15,010 if those guys start to be random variables, 1090 01:05:15,010 --> 01:05:17,010 then it's clearly not going to be a chi squared. 1091 01:05:17,010 --> 01:05:19,000 I mean, it cannot be chi squared when those guys are 1092 01:05:19,000 --> 01:05:20,624 deterministic and when they are random. 1093 01:05:20,624 --> 01:05:22,100 I mean, things change. 1094 01:05:22,100 --> 01:05:25,060 So that's just maybe [INAUDIBLE] check statement. 1095 01:05:25,060 --> 01:05:27,580 But I think the one that really matters is that-- 1096 01:05:27,580 --> 01:05:30,450 remember when we did the t-test, we 1097 01:05:30,450 --> 01:05:32,195 had this gamma j that showed up. 1098 01:05:32,195 --> 01:05:34,910 Gamma j was playing the role of the variance. 1099 01:05:34,910 --> 01:05:36,904 So here, the variance, you never think of-- 1100 01:05:36,904 --> 01:05:39,070 I mean, we'll talk about this in the Bayesian setup, 1101 01:05:39,070 --> 01:05:41,320 but so far, we haven't thought of the variance 1102 01:05:41,320 --> 01:05:42,390 as a random variable. 1103 01:05:42,390 --> 01:05:45,580 And so here, your x's really are the parameters of your data. 1104 01:05:45,580 --> 01:05:48,110 And the diagonal elements of x transpose x inverse 1105 01:05:48,110 --> 01:05:49,787 actually tell you what the variance is. 1106 01:05:49,787 --> 01:05:52,120 So that's also one reason why you should think of your x 1107 01:05:52,120 --> 01:05:53,530 as being a deterministic number. 1108 01:05:53,530 --> 01:05:55,450 They are, in a way, things that change 1109 01:05:55,450 --> 01:05:56,740 the geometry of your problem. 1110 01:05:56,740 --> 01:05:58,450 They just say, oh, let me look at it 1111 01:05:58,450 --> 01:06:01,180 from the perspective of x. 1112 01:06:01,180 --> 01:06:03,070 Actually, for that matter, we didn't really 1113 01:06:03,070 --> 01:06:06,000 spend much time commenting on what 1114 01:06:06,000 --> 01:06:09,730 is the effect of x onto gamma. 1115 01:06:09,730 --> 01:06:19,910 So remember, gamma j, so that was the variance parameter. 1116 01:06:19,910 --> 01:06:23,780 So we should try to understand what x's lead to big variance 1117 01:06:23,780 --> 01:06:26,030 and what x's lead to small variance. 1118 01:06:26,030 --> 01:06:28,610 That would be nice. 1119 01:06:28,610 --> 01:06:31,550 Well, if this is the identity matrix-- 1120 01:06:31,550 --> 01:06:35,140 let's say identity over n, which is the natural thing 1121 01:06:35,140 --> 01:06:38,620 to look at, because we want this thing to scale like 1/n-- 1122 01:06:38,620 --> 01:06:39,820 then this is just 1/n. 1123 01:06:39,820 --> 01:06:41,200 We're back to the original case. 1124 01:06:41,200 --> 01:06:41,700 Yes? 1125 01:06:41,700 --> 01:06:43,200 AUDIENCE: Shouldn't that be inverse? 1126 01:06:43,200 --> 01:06:45,500 PHILIPPE RIGOLLET: Yeah, thank you. x inverse, yes. 1127 01:06:45,500 --> 01:06:48,590 So if this is the identity, then, well, the inverse 1128 01:06:48,590 --> 01:06:53,180 is-- let's say just this guy here is n times this guy. 1129 01:06:53,180 --> 01:06:56,210 So then the inverse is 1/n. 1130 01:06:56,210 --> 01:06:59,270 So in this case, that means that gamma j is equal to 1/n 1131 01:06:59,270 --> 01:07:02,240 and we're back to the theta hat theta 1132 01:07:02,240 --> 01:07:06,450 case, the basic one-dimensional thing. 1133 01:07:06,450 --> 01:07:11,390 What does it mean for a matrix for when I take its-- 1134 01:07:11,390 --> 01:07:13,230 yeah, so that's of dimension p. 1135 01:07:13,230 --> 01:07:15,420 But when I take its transpose-- 1136 01:07:15,420 --> 01:07:17,394 so forget about the scaling by n right now. 1137 01:07:17,394 --> 01:07:19,060 This is just a matter of scaling things. 1138 01:07:19,060 --> 01:07:20,840 I can always multiply my x's so that I 1139 01:07:20,840 --> 01:07:22,584 have this thing that shows up. 1140 01:07:22,584 --> 01:07:24,750 But when I have a matrix, if I look at x transpose x 1141 01:07:24,750 --> 01:07:26,550 and I get something which is the identity, how 1142 01:07:26,550 --> 01:07:27,570 do I call this matrix? 1143 01:07:31,980 --> 01:07:32,970 AUDIENCE: Orthonormal? 1144 01:07:32,970 --> 01:07:34,470 PHILIPPE RIGOLLET: Orthogonal, yeah. 1145 01:07:34,470 --> 01:07:35,790 Orthonormal or orthogonal. 1146 01:07:35,790 --> 01:07:37,919 So you call this thing an orthogonal matrix. 1147 01:07:37,919 --> 01:07:39,960 And when it's an orthogonal matrix, what it means 1148 01:07:39,960 --> 01:07:42,540 is that the-- 1149 01:07:42,540 --> 01:07:46,230 so this matrix here, if you look at the matrix xx transpose, 1150 01:07:46,230 --> 01:07:48,390 the entries of this matrix are the inner products 1151 01:07:48,390 --> 01:07:49,890 between the columns of x. 1152 01:07:49,890 --> 01:07:51,240 That's what's happening. 1153 01:07:51,240 --> 01:07:52,800 You can write it, and you will see 1154 01:07:52,800 --> 01:07:55,890 that the entries of this matrix are linear products. 1155 01:07:55,890 --> 01:07:59,860 If it's the identity, that means that you get some 1's 1156 01:07:59,860 --> 01:08:03,170 and a bunch of 0's, it means that all the inner products 1157 01:08:03,170 --> 01:08:05,910 between 2 different columns is actually 0. 1158 01:08:05,910 --> 01:08:07,980 What it means is that this matrix x 1159 01:08:07,980 --> 01:08:09,990 is an orthonormal basis for your space. 1160 01:08:09,990 --> 01:08:12,100 The columns form an orthonormal basis. 1161 01:08:12,100 --> 01:08:15,680 So they're basically as far from each other as they can. 1162 01:08:15,680 --> 01:08:20,260 Now, if I start making those guys closer and closer, 1163 01:08:20,260 --> 01:08:21,939 then I'm starting to have some issues. 1164 01:08:21,939 --> 01:08:24,490 x transpose x is not going to be the identity. 1165 01:08:24,490 --> 01:08:27,330 I'm going to start to have some non-0 entries. 1166 01:08:27,330 --> 01:08:32,551 But if they all remain of norm 1, then-- 1167 01:08:32,551 --> 01:08:34,880 oh, sorry, so that's for the inverse. 1168 01:08:34,880 --> 01:08:37,899 So I first start putting some stuff here, which is non-0, 1169 01:08:37,899 --> 01:08:39,550 by taking my x's. 1170 01:08:39,550 --> 01:08:44,269 Rather than having this, I move to this. 1171 01:08:44,269 --> 01:08:46,310 Now I'm going to start seeing some non-0 entries. 1172 01:08:46,310 --> 01:08:49,410 And when I'm going to take the inverse of this matrix, 1173 01:08:49,410 --> 01:08:52,781 the diagonal elements are going to start to blow up. 1174 01:08:52,781 --> 01:08:56,010 Oh, sorry, the diagonals start to become smaller and smaller. 1175 01:08:56,010 --> 01:08:57,399 So when I take the inverse-- 1176 01:08:57,399 --> 01:09:01,399 no, sorry, the diagonal limits are going to blow up. 1177 01:09:01,399 --> 01:09:05,430 And so what it means is that the variance is going to blow up. 1178 01:09:05,430 --> 01:09:06,899 And that's essentially telling you 1179 01:09:06,899 --> 01:09:09,090 that if you get to choose your x's, you 1180 01:09:09,090 --> 01:09:12,582 want to take them as orthogonal as you can. 1181 01:09:12,582 --> 01:09:14,790 But if you don't, then you just have to deal with it, 1182 01:09:14,790 --> 01:09:18,950 and it will have a significant impact on your estimation 1183 01:09:18,950 --> 01:09:19,620 performance. 1184 01:09:19,620 --> 01:09:25,010 And that's what, also, routinely, statistical software 1185 01:09:25,010 --> 01:09:26,885 is going to spit out this value here for you. 1186 01:09:26,885 --> 01:09:28,884 And you're going to have-- well, actually square 1187 01:09:28,884 --> 01:09:30,410 root of this value. 1188 01:09:30,410 --> 01:09:32,440 And it's going to tell you, essentially-- 1189 01:09:32,440 --> 01:09:34,939 you're going to know how much randomness, how much variation 1190 01:09:34,939 --> 01:09:36,952 you have in this particular parameter 1191 01:09:36,952 --> 01:09:37,910 that you're estimating. 1192 01:09:37,910 --> 01:09:41,564 So if gamma j is large, then you're 1193 01:09:41,564 --> 01:09:43,189 going to have wide confidence intervals 1194 01:09:43,189 --> 01:09:45,740 and your tests are not going to reject very much. 1195 01:09:45,740 --> 01:09:47,110 And that's all captured by x. 1196 01:09:47,110 --> 01:09:48,109 That's what's important. 1197 01:09:48,109 --> 01:09:50,927 Everything, all of this, is completely captured by x. 1198 01:09:50,927 --> 01:09:52,760 Then, of course, there was the sigma squared 1199 01:09:52,760 --> 01:09:54,570 that showed up here. 1200 01:09:54,570 --> 01:09:57,155 Actually, it was here, even in the definition of gamma j. 1201 01:09:57,155 --> 01:09:58,430 I forgot it. 1202 01:09:58,430 --> 01:10:00,440 What is the sigma squared police doing? 1203 01:10:00,440 --> 01:10:02,950 And so this thing was here as well, 1204 01:10:02,950 --> 01:10:04,850 and that's just exogenous. 1205 01:10:04,850 --> 01:10:06,269 It comes from the noise itself. 1206 01:10:06,269 --> 01:10:08,810 But there was this huge factor that came from the x's itself. 1207 01:10:11,680 --> 01:10:13,960 So let's go back, now, to reading 1208 01:10:13,960 --> 01:10:15,320 this list in a linear fashion. 1209 01:10:15,320 --> 01:10:20,680 So I mean, you're MIT students, you've probably 1210 01:10:20,680 --> 01:10:25,480 heard that correlation does not imply causation many times. 1211 01:10:25,480 --> 01:10:27,145 Maybe you don't know what it means. 1212 01:10:27,145 --> 01:10:30,900 If you don't, that's OK, you just have to know the sentence. 1213 01:10:30,900 --> 01:10:32,420 No, what it means is that it's done 1214 01:10:32,420 --> 01:10:35,255 because I decided that something was going to be the x 1215 01:10:35,255 --> 01:10:36,630 and that something else was going 1216 01:10:36,630 --> 01:10:39,640 to be the y, that whatever thing I'm getting, 1217 01:10:39,640 --> 01:10:42,010 it means that x implies y. 1218 01:10:42,010 --> 01:10:44,530 For example, even if I do genetics, genomics, 1219 01:10:44,530 --> 01:10:47,230 or whatever, I mean, I implicitly 1220 01:10:47,230 --> 01:10:49,630 assume that my genes are going to have 1221 01:10:49,630 --> 01:10:52,780 an effect on my outside look. 1222 01:10:52,780 --> 01:10:54,310 I could be the opposite. 1223 01:10:54,310 --> 01:10:55,720 I mean, who am I to say? 1224 01:10:55,720 --> 01:10:56,570 I'm not a biologist. 1225 01:10:56,570 --> 01:10:57,111 I don't know. 1226 01:10:57,111 --> 01:10:59,590 I didn't open a biology book in 20 years. 1227 01:10:59,590 --> 01:11:02,140 So maybe, if I start hitting my head with a hammer, 1228 01:11:02,140 --> 01:11:04,720 I'm going to have changing my genetic material. 1229 01:11:04,720 --> 01:11:07,140 Probably not, but that's why-- 1230 01:11:07,140 --> 01:11:09,450 but causation definitely does not come from statistics. 1231 01:11:09,450 --> 01:11:11,690 So if you know that that's the different thing, 1232 01:11:11,690 --> 01:11:13,180 it's actually going to-- 1233 01:11:13,180 --> 01:11:14,690 it's not coming from there. 1234 01:11:14,690 --> 01:11:18,410 So actually, I remember, once, I put an exam to students, 1235 01:11:18,410 --> 01:11:21,685 and there was an old data set from police expenditures, 1236 01:11:21,685 --> 01:11:23,920 I think, in Chicago in the '60s. 1237 01:11:23,920 --> 01:11:27,437 And they were trying to understand-- 1238 01:11:27,437 --> 01:11:28,270 no, it was on crime. 1239 01:11:28,270 --> 01:11:29,650 It was the crime data set. 1240 01:11:29,650 --> 01:11:31,700 And they were trying-- so the y variable was just 1241 01:11:31,700 --> 01:11:34,530 the rate of crime, and the x's were a bunch of things, 1242 01:11:34,530 --> 01:11:36,670 and one of them was police expenditures. 1243 01:11:36,670 --> 01:11:38,200 And if you rend the regression, you 1244 01:11:38,200 --> 01:11:41,050 would find that the coefficient in front of police expenditure 1245 01:11:41,050 --> 01:11:42,700 was a positive number, which means 1246 01:11:42,700 --> 01:11:45,690 that if you increase police expenditures, 1247 01:11:45,690 --> 01:11:48,400 that increases the crime. 1248 01:11:48,400 --> 01:11:52,800 I mean, that's what it means to have a positive coefficient. 1249 01:11:52,800 --> 01:11:55,410 Everybody agrees with this fact? 1250 01:11:55,410 --> 01:11:57,830 If beta j is 10, then it means that if I increase by $1 1251 01:11:57,830 --> 01:12:01,860 my police expenditure, I [INAUDIBLE] by 10 my crime, 1252 01:12:01,860 --> 01:12:04,160 everything else being kept equal. 1253 01:12:04,160 --> 01:12:06,140 Well, there were, I think, about 80% 1254 01:12:06,140 --> 01:12:09,230 of the students that were able to explain to me that if you 1255 01:12:09,230 --> 01:12:11,844 give more money to the police, then 1256 01:12:11,844 --> 01:12:13,010 the crime is going to raise. 1257 01:12:13,010 --> 01:12:14,780 Some people were like, well, police 1258 01:12:14,780 --> 01:12:16,730 is making too much money, and they 1259 01:12:16,730 --> 01:12:19,264 don't think about their work, and they become lazy. 1260 01:12:19,264 --> 01:12:20,930 And I mean, people were really coming up 1261 01:12:20,930 --> 01:12:22,340 with some crazy things. 1262 01:12:22,340 --> 01:12:26,090 And what it just meant is that, no, it's not causation. 1263 01:12:26,090 --> 01:12:28,030 It's just, if you have more crime, 1264 01:12:28,030 --> 01:12:29,810 you give more money to your police. 1265 01:12:29,810 --> 01:12:31,370 That's what's happening. 1266 01:12:31,370 --> 01:12:33,800 And that's all there is. 1267 01:12:33,800 --> 01:12:35,750 So just be careful when you actually 1268 01:12:35,750 --> 01:12:38,360 draw some conclusions that causation is a very important 1269 01:12:38,360 --> 01:12:39,410 thing to keep in mind. 1270 01:12:39,410 --> 01:12:43,280 And in practice, unless you have external sources of reason 1271 01:12:43,280 --> 01:12:45,680 for causality-- for example, genetic material 1272 01:12:45,680 --> 01:12:52,040 and physical traits, we agree upon what 1273 01:12:52,040 --> 01:12:54,690 the direction of the arrow of causality is here. 1274 01:12:54,690 --> 01:12:57,845 There's places where you might not. 1275 01:12:57,845 --> 01:12:59,930 Now, finally, the normality on the noise-- 1276 01:12:59,930 --> 01:13:04,340 everything we did today required normal Gaussian distribution 1277 01:13:04,340 --> 01:13:05,750 on the noise. 1278 01:13:05,750 --> 01:13:07,541 I mean, it's everywhere. 1279 01:13:07,541 --> 01:13:09,540 There's some Gaussian, there's some chi squared. 1280 01:13:09,540 --> 01:13:11,330 Everything came out of Gaussian. 1281 01:13:11,330 --> 01:13:13,836 And for that, we needed this basic formula 1282 01:13:13,836 --> 01:13:15,710 for inference, which we derived from the fact 1283 01:13:15,710 --> 01:13:18,610 that the noise was Gaussian itself. 1284 01:13:18,610 --> 01:13:20,860 If we did not have that, the only thing we could write 1285 01:13:20,860 --> 01:13:24,370 is, beta hat is this number, or this vector. 1286 01:13:24,370 --> 01:13:27,980 We would not be able to say, the fluctuations of beta hat 1287 01:13:27,980 --> 01:13:28,615 are this guy. 1288 01:13:28,615 --> 01:13:30,472 We would not be able to do tests. 1289 01:13:30,472 --> 01:13:31,930 We would not be able to build, say, 1290 01:13:31,930 --> 01:13:34,160 confidence regions or anything. 1291 01:13:34,160 --> 01:13:38,150 And so this is an important condition that we need, 1292 01:13:38,150 --> 01:13:40,670 and that's what statistical software assumes by default. 1293 01:13:40,670 --> 01:13:44,870 But we now have a recipe on how to do those tests. 1294 01:13:44,870 --> 01:13:47,060 We can do it either visually, if we really 1295 01:13:47,060 --> 01:13:49,430 want to conclude that, yes, this is Gaussian, 1296 01:13:49,430 --> 01:13:51,350 using our normal Q-Q plots. 1297 01:13:51,350 --> 01:13:54,860 And we can also do it using our favorite tests. 1298 01:13:54,860 --> 01:13:56,750 What test should I be using to test that? 1299 01:14:01,540 --> 01:14:03,771 With two names? 1300 01:14:03,771 --> 01:14:04,270 Yeah? 1301 01:14:04,270 --> 01:14:06,957 AUDIENCE: Normal [INAUDIBLE]. 1302 01:14:06,957 --> 01:14:08,540 PHILIPPE RIGOLLET: Not the 2 Russians. 1303 01:14:08,540 --> 01:14:10,820 So I want a Russian and a Scandinavian person 1304 01:14:10,820 --> 01:14:12,722 for this one. 1305 01:14:12,722 --> 01:14:13,416 What's that? 1306 01:14:13,416 --> 01:14:14,540 AUDIENCE: Lillie-something? 1307 01:14:14,540 --> 01:14:16,290 PHILIPPE RIGOLLET: Yeah, Lillie-something. 1308 01:14:16,290 --> 01:14:18,660 So Kolmogorov Lillie-something test. 1309 01:14:18,660 --> 01:14:23,370 And [LAUGHS] so it's the Kolmogorov Lilliefors test. 1310 01:14:23,370 --> 01:14:26,670 And because I'm testing if there Gaussian, and I'm actually 1311 01:14:26,670 --> 01:14:28,140 not really making any-- 1312 01:14:28,140 --> 01:14:30,510 I don't need to know what the variance is. 1313 01:14:30,510 --> 01:14:31,350 The mean is 0. 1314 01:14:31,350 --> 01:14:32,558 We saw that at the beginning. 1315 01:14:32,558 --> 01:14:34,680 It's 0 by construction, so we actually 1316 01:14:34,680 --> 01:14:37,590 don't need to think about the mean being 0 itself. 1317 01:14:37,590 --> 01:14:38,850 This just happens to be 0. 1318 01:14:38,850 --> 01:14:41,340 So we know that it's 0, but the variance, we don't know. 1319 01:14:41,340 --> 01:14:42,900 So we just want to know if it belongs 1320 01:14:42,900 --> 01:14:45,233 to the family of Gaussians, and so we need to Kolmogorov 1321 01:14:45,233 --> 01:14:46,660 Lilliefors for that. 1322 01:14:46,660 --> 01:14:49,650 And that's also one of the thing that's spit out by statistical 1323 01:14:49,650 --> 01:14:52,680 software by default. When you run a linear regression, 1324 01:14:52,680 --> 01:14:54,670 actually, it spits out both Kolmogorov-Smirnov 1325 01:14:54,670 --> 01:14:59,118 and Kolmogorov Lilliefors, probably contributing 1326 01:14:59,118 --> 01:15:01,860 to the widespread use of Kolmogorov-Smirnov when you 1327 01:15:01,860 --> 01:15:03,550 really shouldn't. 1328 01:15:03,550 --> 01:15:08,920 So next time, we will talk about more advanced topics 1329 01:15:08,920 --> 01:15:09,670 on regression. 1330 01:15:09,670 --> 01:15:11,780 But I think I'm going to stop here for today. 1331 01:15:11,780 --> 01:15:14,740 So again, tomorrow, sometime during the day, 1332 01:15:14,740 --> 01:15:16,780 at least before the recitation, you 1333 01:15:16,780 --> 01:15:20,740 will have a list of practice exercises that will be posted. 1334 01:15:20,740 --> 01:15:23,600 And if you go to the optional recitation, 1335 01:15:23,600 --> 01:15:26,190 you will have someone solving them