1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:20,270 --> 00:00:22,770 PROFESSOR: So we've been talking about this chi square test. 9 00:00:22,770 --> 00:00:26,400 And the name chi square comes from the fact 10 00:00:26,400 --> 00:00:28,710 that we build a test statistic that 11 00:00:28,710 --> 00:00:31,110 has asymptotic distribution given 12 00:00:31,110 --> 00:00:36,080 by the chi square distribution. 13 00:00:36,080 --> 00:00:37,460 Let's just give it another shot. 14 00:00:44,970 --> 00:00:47,440 OK. 15 00:00:47,440 --> 00:00:48,210 This test. 16 00:00:48,210 --> 00:00:50,770 Who has actually ever encountered the chi square test 17 00:00:50,770 --> 00:00:54,150 outside of a stats classroom? 18 00:00:54,150 --> 00:00:54,930 All right. 19 00:00:54,930 --> 00:00:55,940 So some people have. 20 00:00:55,940 --> 00:00:59,160 It's a fairly common test that you might encounter. 21 00:00:59,160 --> 00:01:01,650 And it was essentially to test, if given 22 00:01:01,650 --> 00:01:06,900 some data with a fixed probability mass function, so 23 00:01:06,900 --> 00:01:08,580 a discrete distribution, you wanted 24 00:01:08,580 --> 00:01:12,990 to test if the PMF was equal to a set value, p0, 25 00:01:12,990 --> 00:01:15,000 or if it was different from p0. 26 00:01:15,000 --> 00:01:18,480 And the way the chi square arose here 27 00:01:18,480 --> 00:01:22,160 was by looking at Wald's test. 28 00:01:22,160 --> 00:01:25,020 And essentially if you write-- so Wald's is the one that 29 00:01:25,020 --> 00:01:27,120 has the chi square as the limiting distribution, 30 00:01:27,120 --> 00:01:31,900 and if you invert the covariance matrix, 31 00:01:31,900 --> 00:01:33,900 the asymptotic covariance matrix, so you compute 32 00:01:33,900 --> 00:01:36,108 the Fisher information, which in this particular case 33 00:01:36,108 --> 00:01:39,360 does not exist for the multinomial distribution, 34 00:01:39,360 --> 00:01:41,180 but we found the trick on how to do this. 35 00:01:41,180 --> 00:01:44,470 We remove the part that forbid it to be invertible, 36 00:01:44,470 --> 00:01:46,309 then we found this chi square distribution. 37 00:01:46,309 --> 00:01:47,850 In a way we have this test statistic, 38 00:01:47,850 --> 00:01:50,550 which you might have learned as a black box, laundry list, 39 00:01:50,550 --> 00:01:53,280 but going through the math which might have been slightly 40 00:01:53,280 --> 00:01:56,010 unpleasant, I acknowledge, but really told you 41 00:01:56,010 --> 00:01:59,130 why you should do this particular normalization. 42 00:01:59,130 --> 00:02:04,470 So since some of you requested a little more practical examples 43 00:02:04,470 --> 00:02:07,200 of how those things work, let me show you a couple. 44 00:02:07,200 --> 00:02:12,210 The first one is you want to answer the question, well, 45 00:02:12,210 --> 00:02:16,440 you know, when should I be born to be successful. 46 00:02:16,440 --> 00:02:20,220 Some people believe in zodiac, and so Fortune magazine 47 00:02:20,220 --> 00:02:24,870 actually collected the signs of 256 heads of the Fortune 500. 48 00:02:24,870 --> 00:02:26,407 Those were taken randomly. 49 00:02:26,407 --> 00:02:27,990 And they were collected there, and you 50 00:02:27,990 --> 00:02:31,260 can see the count of number of CEOs that 51 00:02:31,260 --> 00:02:33,060 have a particular zodiac sign. 52 00:02:33,060 --> 00:02:35,580 And if this was completely uniformly distributed, 53 00:02:35,580 --> 00:02:37,410 you should actually get a number that's 54 00:02:37,410 --> 00:02:42,540 around 256 divided by 12, which in this case is 21.33. 55 00:02:42,540 --> 00:02:45,870 And you can see that there is numbers 56 00:02:45,870 --> 00:02:49,890 that are probably in the vicinity, but look at this guy. 57 00:02:49,890 --> 00:02:51,360 Pisces, that's 29. 58 00:02:51,360 --> 00:02:53,020 So who's Pisces here? 59 00:02:53,020 --> 00:02:55,600 All right. 60 00:02:55,600 --> 00:02:57,510 All right, so give me your information 61 00:02:57,510 --> 00:02:59,940 and we'll meet again in 10 years. 62 00:02:59,940 --> 00:03:02,784 And so basically you might want to test 63 00:03:02,784 --> 00:03:04,950 if actually the fact that it's uniformly distributed 64 00:03:04,950 --> 00:03:06,280 is a valid assumption. 65 00:03:06,280 --> 00:03:09,240 Now this is clearly a random variable. 66 00:03:09,240 --> 00:03:13,230 I pick a random CEO and I measure 67 00:03:13,230 --> 00:03:16,570 what his zodiac sign is. 68 00:03:16,570 --> 00:03:19,530 And I want to know, so it's a probability over, I don't know, 69 00:03:19,530 --> 00:03:20,850 12 zodiac signs. 70 00:03:20,850 --> 00:03:23,330 And I want to know if it's uniform or not. 71 00:03:23,330 --> 00:03:25,230 Uniform sounds like it should be the status 72 00:03:25,230 --> 00:03:27,390 quo, if you're reasonable. 73 00:03:27,390 --> 00:03:31,530 And maybe there's actually something that moves away. 74 00:03:31,530 --> 00:03:34,950 So we could do this, in view of these data is there evidence 75 00:03:34,950 --> 00:03:36,720 that one is different. 76 00:03:36,720 --> 00:03:38,730 Here is another example where you might want 77 00:03:38,730 --> 00:03:40,560 to apply the chi square test. 78 00:03:40,560 --> 00:03:44,219 So as I said, the benchmark distribution 79 00:03:44,219 --> 00:03:46,260 was the uniform distribution for the zodiac sign, 80 00:03:46,260 --> 00:03:47,843 and that's usually the one I give you. 81 00:03:47,843 --> 00:03:49,890 1 over k, 1 over k, because well that's 82 00:03:49,890 --> 00:03:53,930 sort of the zero, the central point for all distributions. 83 00:03:53,930 --> 00:03:57,354 That's the point, the center of what we call the simplex. 84 00:03:57,354 --> 00:03:58,770 But you can have another benchmark 85 00:03:58,770 --> 00:03:59,970 that sort of makes sense. 86 00:03:59,970 --> 00:04:04,770 So for example this is an actual dataset where 275 jurors were 87 00:04:04,770 --> 00:04:09,360 identified, racial group were collected, 88 00:04:09,360 --> 00:04:10,890 and you actually might want to know 89 00:04:10,890 --> 00:04:12,990 if you know juries in this country 90 00:04:12,990 --> 00:04:17,620 are actually representative of the actual population. 91 00:04:17,620 --> 00:04:19,500 And so here of course, the population 92 00:04:19,500 --> 00:04:23,269 is not uniformly distributed according to racial group. 93 00:04:23,269 --> 00:04:24,810 And the way you actually do it is you 94 00:04:24,810 --> 00:04:26,450 actually go on Wikipedia, for example, 95 00:04:26,450 --> 00:04:28,700 and you look at the demographics of the United States, 96 00:04:28,700 --> 00:04:33,570 and you find that the proportion of white is 72%, black is 7%, 97 00:04:33,570 --> 00:04:41,610 Hispanic is 12, and other is about 9%. 98 00:04:41,610 --> 00:04:43,080 So that's a total of 1. 99 00:04:43,080 --> 00:04:46,670 And this is what we actually measured for some jurors. 100 00:04:46,670 --> 00:04:48,120 So for this guy, you can actually 101 00:04:48,120 --> 00:04:49,470 run the chi square test. 102 00:04:49,470 --> 00:04:51,630 You have the estimated proportion, which 103 00:04:51,630 --> 00:04:53,070 comes from this first line. 104 00:04:53,070 --> 00:04:55,004 You have the tested proportion, p0, 105 00:04:55,004 --> 00:04:56,670 that comes from the second line, and you 106 00:04:56,670 --> 00:04:58,503 might want to check if those things actually 107 00:04:58,503 --> 00:04:59,970 correspond to each other. 108 00:04:59,970 --> 00:05:01,650 OK, so I'm not going to do it for you, 109 00:05:01,650 --> 00:05:03,240 but I sort of invite you to do it 110 00:05:03,240 --> 00:05:05,400 and test, and see how this compares 111 00:05:05,400 --> 00:05:07,474 to the quantiles of the appropriate chi 112 00:05:07,474 --> 00:05:10,140 square distribution and see what you can conclude from those two 113 00:05:10,140 --> 00:05:12,320 things. 114 00:05:12,320 --> 00:05:12,820 All right. 115 00:05:12,820 --> 00:05:15,180 So this was the multinomial case. 116 00:05:15,180 --> 00:05:17,110 So this is essentially what we did. 117 00:05:17,110 --> 00:05:19,130 We computed the MLE under the right constraint, 118 00:05:19,130 --> 00:05:20,980 and that was our test statistic that converges 119 00:05:20,980 --> 00:05:22,130 to the chi square distribution. 120 00:05:22,130 --> 00:05:23,588 So if you've seen it before, that's 121 00:05:23,588 --> 00:05:24,790 all that was given to you. 122 00:05:24,790 --> 00:05:27,820 Now we know why the normalization here 123 00:05:27,820 --> 00:05:33,310 is p0 j and not p0 j squared or square root of p0 j, or even 1. 124 00:05:33,310 --> 00:05:35,074 I mean it's not clear that this should 125 00:05:35,074 --> 00:05:36,490 be the right normalization, but we 126 00:05:36,490 --> 00:05:38,950 know that's what comes from taking 127 00:05:38,950 --> 00:05:41,410 the right normalization, which comes from the Fisher 128 00:05:41,410 --> 00:05:42,230 information. 129 00:05:42,230 --> 00:05:43,421 All right? 130 00:05:43,421 --> 00:05:43,920 OK. 131 00:05:47,500 --> 00:05:50,290 The thing I wanted to move onto, so we've basically covered 132 00:05:50,290 --> 00:05:51,130 chi square test. 133 00:05:51,130 --> 00:05:53,980 Are there any questions about chi square test? 134 00:05:53,980 --> 00:05:56,140 And for those of you who were not here on Thursday, 135 00:05:56,140 --> 00:05:57,662 I'm really just-- 136 00:05:57,662 --> 00:05:59,530 do not pretend I just did it. 137 00:05:59,530 --> 00:06:01,730 That's something we did last Thursday. 138 00:06:01,730 --> 00:06:03,324 But are there any questions that arose 139 00:06:03,324 --> 00:06:04,990 when you were reading your notes, things 140 00:06:04,990 --> 00:06:06,230 that you didn't understand? 141 00:06:06,230 --> 00:06:06,952 Yes. 142 00:06:06,952 --> 00:06:09,412 AUDIENCE: Is there like a formal name? 143 00:06:09,412 --> 00:06:12,856 Before we had talked about how what we call the Fisher 144 00:06:12,856 --> 00:06:17,776 information [INAUDIBLE],, still has the same [INAUDIBLE] 145 00:06:17,776 --> 00:06:21,834 because it's the same number. 146 00:06:21,834 --> 00:06:23,250 PROFESSOR: So it's not the Fisher. 147 00:06:23,250 --> 00:06:25,990 The Fisher information does not exist in this case. 148 00:06:25,990 --> 00:06:27,992 And so there's no appropriate name for this. 149 00:06:27,992 --> 00:06:30,450 It's the pseudoinverse of the asymptotic covariance matrix, 150 00:06:30,450 --> 00:06:32,462 and that's what it is. 151 00:06:32,462 --> 00:06:34,170 I don't know if I mentioned it last time, 152 00:06:34,170 --> 00:06:36,230 but there's this entire field that uses-- 153 00:06:36,230 --> 00:06:39,420 you know, for people who really aspire to differential geometry 154 00:06:39,420 --> 00:06:41,095 but are stuck in the stats department, 155 00:06:41,095 --> 00:06:43,470 and there's this thing called information geometry, which 156 00:06:43,470 --> 00:06:47,220 is essentially studying the manifolds associated 157 00:06:47,220 --> 00:06:50,460 to the Fisher information metric, the metric that's 158 00:06:50,460 --> 00:06:52,410 associated to Fisher information. 159 00:06:52,410 --> 00:06:55,810 And so those of course can be lower dimensional manifolds, 160 00:06:55,810 --> 00:06:58,002 not only distorts the geometry but forces everything 161 00:06:58,002 --> 00:06:59,460 to live on a lower dimension, which 162 00:06:59,460 --> 00:07:02,410 is what happens when your Fisher information does not exist. 163 00:07:02,410 --> 00:07:04,350 And so there's a bunch of things that you 164 00:07:04,350 --> 00:07:06,900 can study, what this manifold looks like, et cetera. 165 00:07:06,900 --> 00:07:09,400 But no, there's no particular terminology here 166 00:07:09,400 --> 00:07:12,879 about going here. 167 00:07:12,879 --> 00:07:14,670 To be fair, within the scope of this class, 168 00:07:14,670 --> 00:07:18,320 this is the only case where you-- 169 00:07:18,320 --> 00:07:19,970 multinomial case is the only case 170 00:07:19,970 --> 00:07:26,169 where you typically see a lack of a Fisher information matrix. 171 00:07:26,169 --> 00:07:28,460 And that's just because we have these extra constraints 172 00:07:28,460 --> 00:07:30,200 that the sum of the parameters should be 1. 173 00:07:30,200 --> 00:07:31,866 And if you have an extra constraint that 174 00:07:31,866 --> 00:07:34,370 seems like it's actually remove one degree of freedom, 175 00:07:34,370 --> 00:07:36,470 this will happen inevitably. 176 00:07:36,470 --> 00:07:40,040 And so maybe what you can do is reparameterize. 177 00:07:40,040 --> 00:07:44,240 So if I actually reparameterize everything function of p1 178 00:07:44,240 --> 00:07:46,940 to p k minus 1, and then 1 minus the sum, 179 00:07:46,940 --> 00:07:48,500 this would not have happened. 180 00:07:48,500 --> 00:07:51,200 Because I have only a k-dimensional space. 181 00:07:51,200 --> 00:07:53,170 So there's tricks around this to make it 182 00:07:53,170 --> 00:07:56,060 exist if you want it to exist. 183 00:07:56,060 --> 00:07:58,740 Any other question? 184 00:07:58,740 --> 00:07:59,240 All right. 185 00:07:59,240 --> 00:08:02,180 So let's move on to Student's t-test. 186 00:08:02,180 --> 00:08:03,650 We mentioned it last time. 187 00:08:03,650 --> 00:08:06,390 So essentially you've probably done it 188 00:08:06,390 --> 00:08:09,770 more even in the homework than you've done it in lectures, 189 00:08:09,770 --> 00:08:12,870 but just quickly this is essentially the test. 190 00:08:12,870 --> 00:08:15,505 That's the test when we have an actual data that comes 191 00:08:15,505 --> 00:08:16,630 from a normal distribution. 192 00:08:16,630 --> 00:08:18,750 There is no Central Limit Theorem that exists. 193 00:08:18,750 --> 00:08:21,080 This is really to account for the fact 194 00:08:21,080 --> 00:08:24,140 that for smaller sample sizes, it 195 00:08:24,140 --> 00:08:27,920 might be the case that it's not exactly true that when 196 00:08:27,920 --> 00:08:33,860 I look at xn bar minus mu divided by-- so if I look 197 00:08:33,860 --> 00:08:37,010 at xn bar minus mu divided by sigma times square root of n, 198 00:08:37,010 --> 00:08:41,179 then this thing should have N 0, 1 distribution approximately. 199 00:08:41,179 --> 00:08:42,280 Right? 200 00:08:42,280 --> 00:08:45,770 By the Central Limit Theorem. 201 00:08:45,770 --> 00:08:47,120 So that's for n large. 202 00:08:47,120 --> 00:08:53,780 But if n is small, then it's still true 203 00:08:53,780 --> 00:09:00,910 when the data is N mu, sigma squared, 204 00:09:00,910 --> 00:09:02,940 then it's true that square root of n-- 205 00:09:09,310 --> 00:09:12,040 so here it's approximately. 206 00:09:12,040 --> 00:09:14,720 And this is always true. 207 00:09:14,720 --> 00:09:16,930 But I don't know sigma in practice, right? 208 00:09:16,930 --> 00:09:20,550 Maybe mu, it comes from my, maybe mu comes from my mu 209 00:09:20,550 --> 00:09:23,830 0, maybe something from the test statistic 210 00:09:23,830 --> 00:09:25,290 where mu actually is here. 211 00:09:25,290 --> 00:09:27,490 But for this guy I'm going to have inevitably 212 00:09:27,490 --> 00:09:29,120 to find an estimator. 213 00:09:29,120 --> 00:09:32,800 And now in this case, for small n, this is no longer true. 214 00:09:32,800 --> 00:09:34,375 And what the t statistic is doing 215 00:09:34,375 --> 00:09:36,875 is essentially telling you what the distribution of this guy 216 00:09:36,875 --> 00:09:37,780 is. 217 00:09:37,780 --> 00:09:41,050 So what you should say is that now this guy 218 00:09:41,050 --> 00:09:44,440 has a t distribution with n minus 1 degrees of freedom. 219 00:09:44,440 --> 00:09:47,296 That's basically the laundry list stats 220 00:09:47,296 --> 00:09:48,170 that you would learn. 221 00:09:48,170 --> 00:09:50,860 It says just look at a different table, that's what it is. 222 00:09:50,860 --> 00:09:55,180 But we actually defined what a t distribution was. 223 00:09:55,180 --> 00:09:58,510 And a t distribution is basically 224 00:09:58,510 --> 00:10:03,490 something that has the same distribution as some N 0, 1, 225 00:10:03,490 --> 00:10:06,040 divided by the square root of a chi square 226 00:10:06,040 --> 00:10:08,110 with d degrees of freedom divided by d. 227 00:10:08,110 --> 00:10:12,750 And that's a t distribution with d degrees of freedom. 228 00:10:12,750 --> 00:10:14,785 And those two have to be independent. 229 00:10:20,960 --> 00:10:24,280 And so what I need to check is that this guy over there 230 00:10:24,280 --> 00:10:25,030 is of this form. 231 00:10:37,160 --> 00:10:39,120 OK? 232 00:10:39,120 --> 00:10:41,830 So let's look at the numerator. 233 00:10:41,830 --> 00:10:45,986 Well, square root of n, xn bar minus mu. 234 00:10:45,986 --> 00:10:47,610 What is the distribution of this thing? 235 00:10:47,610 --> 00:10:50,562 Is it an N 0, 1? 236 00:10:50,562 --> 00:10:52,569 AUDIENCE: N 0, sigma squared? 237 00:10:52,569 --> 00:10:54,110 PROFESSOR: N 0, sigma squared, right. 238 00:10:58,817 --> 00:11:00,150 So I'm not going to put it here. 239 00:11:00,150 --> 00:11:01,714 So if I want this guy to be N 0, 1, 240 00:11:01,714 --> 00:11:04,130 I need to divide by sigma, that's what we have over there. 241 00:11:06,980 --> 00:11:09,820 So that's my N 0, 1 that's going to play the role of this guy 242 00:11:09,820 --> 00:11:11,300 here. 243 00:11:11,300 --> 00:11:13,970 So if I want to go a little further, 244 00:11:13,970 --> 00:11:21,090 I need to just say, OK, now I need to have square root of n, 245 00:11:21,090 --> 00:11:23,670 and I need to find something here 246 00:11:23,670 --> 00:11:27,300 that looks like my square root of chi square divided 247 00:11:27,300 --> 00:11:28,119 by-- yeah? 248 00:11:28,119 --> 00:11:29,452 AUDIENCE: Really quick question. 249 00:11:29,452 --> 00:11:32,700 The equals sign with the d on top, that's just defined as? 250 00:11:32,700 --> 00:11:35,440 PROFESSOR: No, that's just the distribution. 251 00:11:35,440 --> 00:11:37,207 So, I don't know. 252 00:11:37,207 --> 00:11:38,290 AUDIENCE: Then never mind. 253 00:11:38,290 --> 00:11:41,800 PROFESSOR: Let's just write it like that, if you want. 254 00:11:41,800 --> 00:11:44,406 I mean, that's not really appropriate to have. 255 00:11:44,406 --> 00:11:46,030 Usually you write only one distribution 256 00:11:46,030 --> 00:11:48,310 on the right-hand inside of this little thing. 257 00:11:48,310 --> 00:11:51,160 So not just this complicated function of distributions. 258 00:11:51,160 --> 00:11:53,110 This is more like to explain. 259 00:11:53,110 --> 00:11:54,910 OK, and so usually the thing you should 260 00:11:54,910 --> 00:11:58,990 say that t is equal to this X divided by square root of Z 261 00:11:58,990 --> 00:12:01,660 divided by d where X has normal distribution, 262 00:12:01,660 --> 00:12:06,210 Z has chi square distribution with d degrees of freedom. 263 00:12:06,210 --> 00:12:07,230 So what do we need here? 264 00:12:07,230 --> 00:12:10,031 Well I need to have something which looks like my sigma hat, 265 00:12:10,031 --> 00:12:10,530 right? 266 00:12:10,530 --> 00:12:13,740 So somehow inevitably I'm going to need to have sigma hat. 267 00:12:16,430 --> 00:12:18,866 Now of course I need to divide this by my sigma 268 00:12:18,866 --> 00:12:19,990 so that my sigma goes away. 269 00:12:22,710 --> 00:12:25,195 And so now this thing here-- 270 00:12:25,195 --> 00:12:27,880 sorry, I should move on to the right, OK. 271 00:12:27,880 --> 00:12:33,270 And so this thing here, so sigma hat is square root of Sn. 272 00:12:33,270 --> 00:12:35,300 And now I'm almost there. 273 00:12:35,300 --> 00:12:38,090 So this thing is actually equal to square root of n. 274 00:12:47,760 --> 00:12:51,920 But this thing here is actually not a-- 275 00:12:55,100 --> 00:12:57,890 so this thing here follows a distribution 276 00:12:57,890 --> 00:13:00,530 which is actually a chi square, square root 277 00:13:00,530 --> 00:13:11,510 of a chi square distribution divided by n. 278 00:13:15,460 --> 00:13:18,340 Yeah, that's the square root chi square distribution 279 00:13:18,340 --> 00:13:20,740 with n minus 1 degrees of freedom divided 280 00:13:20,740 --> 00:13:25,360 by n, because sigma hat is equal to 1 over n sum 281 00:13:25,360 --> 00:13:30,290 from i equal 1 to n, xi minus x bar squared. 282 00:13:30,290 --> 00:13:32,750 And we just said that this part here 283 00:13:32,750 --> 00:13:34,000 was a chi square distribution. 284 00:13:34,000 --> 00:13:36,770 We didn't just say it, we said it a few lectures years back, 285 00:13:36,770 --> 00:13:39,228 that this thing was a chi square distribution, and the fact 286 00:13:39,228 --> 00:13:42,160 that the presence of this x bar here 287 00:13:42,160 --> 00:13:46,000 was actually removing one degree of freedom from this sum. 288 00:13:46,000 --> 00:13:48,220 OK, so this guy here has the same distribution 289 00:13:48,220 --> 00:13:52,860 as a chi square n minus 1 divided by n. 290 00:13:52,860 --> 00:13:56,100 So I need to actually still arrange this thing a little bit 291 00:13:56,100 --> 00:13:58,170 to have a t distribution. 292 00:13:58,170 --> 00:14:01,850 I should not see n here, but I should n minus 1. 293 00:14:01,850 --> 00:14:06,396 The d is the same as this d here. 294 00:14:06,396 --> 00:14:07,770 And so let me make the correction 295 00:14:07,770 --> 00:14:09,880 so that this actually happens. 296 00:14:09,880 --> 00:14:14,700 Well, if I actually write this to be equal to-- 297 00:14:14,700 --> 00:14:19,950 so if I write square root of n minus 1, as on the slide, 298 00:14:19,950 --> 00:14:25,200 times xn bar minus mu divided by-- 299 00:14:25,200 --> 00:14:27,150 well let me write it as square root of Sn, 300 00:14:27,150 --> 00:14:29,550 which is my sigma hat. 301 00:14:29,550 --> 00:14:33,902 Then what this thing is actually equal to, 302 00:14:33,902 --> 00:14:39,030 it follows a N 0, 1, divided by the square root 303 00:14:39,030 --> 00:14:40,560 of my chi square distribution with n 304 00:14:40,560 --> 00:14:42,150 minus 1 degrees of freedom. 305 00:14:42,150 --> 00:14:43,930 And here the fact that I multiply 306 00:14:43,930 --> 00:14:45,439 by square root of n minus 1, and I 307 00:14:45,439 --> 00:14:47,730 have the square root of n here, is essentially the same 308 00:14:47,730 --> 00:14:51,440 as dividing here by n minus 1. 309 00:14:51,440 --> 00:14:54,600 And that's my tn distribution. 310 00:14:54,600 --> 00:14:58,190 My t distribution with n minus 1 degrees of freedom. 311 00:14:58,190 --> 00:15:00,200 Just by definition of what this thing is. 312 00:15:00,200 --> 00:15:00,700 OK? 313 00:15:22,020 --> 00:15:22,631 All right. 314 00:15:22,631 --> 00:15:23,130 Yes? 315 00:15:23,130 --> 00:15:26,387 AUDIENCE: Where'd you get the square root from? 316 00:15:26,387 --> 00:15:27,220 PROFESSOR: This guy? 317 00:15:27,220 --> 00:15:28,511 Oh sorry, that's sigma squared. 318 00:15:28,511 --> 00:15:30,210 Thank you. 319 00:15:30,210 --> 00:15:32,570 That's the estimator of the variance, not the estimator 320 00:15:32,570 --> 00:15:33,440 of the standard deviation. 321 00:15:33,440 --> 00:15:35,939 And when I want to divide it I divide by standard deviation. 322 00:15:35,939 --> 00:15:38,480 Thank you. 323 00:15:38,480 --> 00:15:40,194 Any other question or remark? 324 00:15:40,194 --> 00:15:42,194 AUDIENCE: Shouldn't you divide by sigma squared? 325 00:15:42,194 --> 00:15:45,617 The actual. 326 00:15:45,617 --> 00:15:47,410 The estimator for the variance is 327 00:15:47,410 --> 00:15:52,480 equal to sigma squared times chi square, right? 328 00:15:52,480 --> 00:15:55,975 PROFESSOR: The estimator for the variance. 329 00:15:55,975 --> 00:15:56,850 Oh yes, you're right. 330 00:15:56,850 --> 00:15:59,325 So there's a sigma squared here. 331 00:15:59,325 --> 00:16:00,450 Is that what you're asking? 332 00:16:00,450 --> 00:16:00,750 AUDIENCE: Yeah. 333 00:16:00,750 --> 00:16:01,874 PROFESSOR: Yes, absolutely. 334 00:16:01,874 --> 00:16:03,760 And that's where, it get cancels here. 335 00:16:03,760 --> 00:16:04,860 It gets canceled here. 336 00:16:10,048 --> 00:16:10,548 OK? 337 00:16:13,185 --> 00:16:15,310 So this is really a sigma squared times chi square. 338 00:16:20,040 --> 00:16:21,120 OK. 339 00:16:21,120 --> 00:16:22,740 So the fact that it's sigma squared 340 00:16:22,740 --> 00:16:24,240 is just because I can pull out sigma 341 00:16:24,240 --> 00:16:26,030 squared and just think those guys N 0, 1. 342 00:16:32,591 --> 00:16:33,090 All right. 343 00:16:33,090 --> 00:16:34,960 So that's my t distribution. 344 00:16:34,960 --> 00:16:37,950 Now that I actually have a pivotal distribution, what I do 345 00:16:37,950 --> 00:16:40,230 is that I form the statistic. 346 00:16:40,230 --> 00:16:42,480 Here I called it Tn tilde. 347 00:16:52,180 --> 00:16:53,260 OK. 348 00:16:53,260 --> 00:16:54,350 And what is this thing? 349 00:16:54,350 --> 00:16:56,250 I know that this has a pivotal distribution. 350 00:16:56,250 --> 00:16:59,150 So for example, I know that the probability 351 00:16:59,150 --> 00:17:05,660 that Tn tilde in absolute value exceeds some number that I'm 352 00:17:05,660 --> 00:17:11,420 going to call q alpha over 2 for the t n minus 1, 353 00:17:11,420 --> 00:17:13,400 is equal to alpha. 354 00:17:13,400 --> 00:17:16,880 So that's basically, remember the t distribution 355 00:17:16,880 --> 00:17:19,700 has the same shape as the Gaussian distribution. 356 00:17:19,700 --> 00:17:21,920 What I'm finding is, for this t distribution, 357 00:17:21,920 --> 00:17:26,079 some number q alpha over 2 of t n minus 1 358 00:17:26,079 --> 00:17:29,370 and minus q alpha over 2 of t minus 1. 359 00:17:29,370 --> 00:17:31,610 So those are different from the Gaussian one. 360 00:17:31,610 --> 00:17:33,290 Such that the area under the curve 361 00:17:33,290 --> 00:17:36,530 here is alpha over 2 on each side 362 00:17:36,530 --> 00:17:39,770 so that the probability that my absolute value exceeds 363 00:17:39,770 --> 00:17:43,790 this number is equal to alpha. 364 00:17:43,790 --> 00:17:46,040 And that's what I'm going to use to reject the test. 365 00:17:46,040 --> 00:17:59,270 So now my test becomes, for H0, say mu is equal to some mu 0, 366 00:17:59,270 --> 00:18:05,015 versus H1, mu is not equal to mu 0. 367 00:18:08,240 --> 00:18:13,090 The rejection region is going to be equal to the set on which 368 00:18:13,090 --> 00:18:19,520 square root of n minus 1 times xn bar minus mu 0 this time, 369 00:18:19,520 --> 00:18:25,700 divided by square root of Sn exceeds, in absolute value, 370 00:18:25,700 --> 00:18:28,580 exceeds q-- sorry that's already here-- 371 00:18:28,580 --> 00:18:34,130 exceeds q alpha over 2 of t n minus 1. 372 00:18:34,130 --> 00:18:36,200 So I reject when this thing increases. 373 00:18:36,200 --> 00:18:39,050 The same as the Gaussian case, except that rather than reading 374 00:18:39,050 --> 00:18:41,410 my quantiles from the Gaussian table 375 00:18:41,410 --> 00:18:44,550 I read them from the Student table. 376 00:18:44,550 --> 00:18:45,600 It's just the same thing. 377 00:18:45,600 --> 00:18:48,740 So they're just going to be a little bit farther. 378 00:18:48,740 --> 00:18:52,640 So this guy here is just going to be a little bigger 379 00:18:52,640 --> 00:18:54,500 than the one for the Gaussian one, 380 00:18:54,500 --> 00:18:57,170 because it's going to require me a little more evidence 381 00:18:57,170 --> 00:18:59,060 in my data to be able to reject because I 382 00:18:59,060 --> 00:19:01,411 have to account for the fluctuations of sigma hat. 383 00:19:09,270 --> 00:19:12,800 So of course Student's test is used everywhere. 384 00:19:12,800 --> 00:19:15,800 People use only t tests, right? 385 00:19:15,800 --> 00:19:19,190 If you look at any data point, any output, 386 00:19:19,190 --> 00:19:21,504 even if you had 500 observations, 387 00:19:21,504 --> 00:19:23,420 if you look at the statistical software output 388 00:19:23,420 --> 00:19:25,220 it's going to say t test. 389 00:19:25,220 --> 00:19:26,630 And the reason why you see t test 390 00:19:26,630 --> 00:19:29,420 is because somehow it's felt like it's not asymptotic. 391 00:19:29,420 --> 00:19:31,850 You don't need to actually do, you 392 00:19:31,850 --> 00:19:33,680 know, to be particularly careful. 393 00:19:33,680 --> 00:19:35,705 And anyway, if n is equal to 500, 394 00:19:35,705 --> 00:19:37,730 since the two curves are above each other 395 00:19:37,730 --> 00:19:39,019 it's basically the same thing. 396 00:19:39,019 --> 00:19:40,560 So it doesn't really change anything. 397 00:19:40,560 --> 00:19:43,320 So why not use the t test? 398 00:19:43,320 --> 00:19:44,460 So it's not asymptotic. 399 00:19:44,460 --> 00:19:47,040 It doesn't require Central Limit Theorem to kick in. 400 00:19:47,040 --> 00:19:50,940 And so in particular it be run if you have 15 observations. 401 00:19:50,940 --> 00:19:52,770 Of course, the drawback of the Student test 402 00:19:52,770 --> 00:19:54,228 is that it relies on the assumption 403 00:19:54,228 --> 00:19:56,579 that the sample is Gaussian, and that's something 404 00:19:56,579 --> 00:19:57,870 we really need to keep in mind. 405 00:19:57,870 --> 00:20:01,710 If you have a small sample size, there is no magic going on. 406 00:20:01,710 --> 00:20:04,320 It's not like Student t test allows you to get rid 407 00:20:04,320 --> 00:20:06,720 of this asymptotic normality. 408 00:20:06,720 --> 00:20:08,520 It sort of assumes that it's built in. 409 00:20:08,520 --> 00:20:14,150 It assumes that your data has a Gaussian distribution. 410 00:20:14,150 --> 00:20:18,580 So if you have 15 observations, what are you going to do? 411 00:20:18,580 --> 00:20:21,610 You want to test if the mean is equal to 0 or not equal to 0, 412 00:20:21,610 --> 00:20:24,510 but you have only 15 observations. 413 00:20:24,510 --> 00:20:27,900 You have to somehow assume that your data is Gaussian. 414 00:20:27,900 --> 00:20:30,270 But if the data is given to you, this is not math, 415 00:20:30,270 --> 00:20:32,514 you actually have to check that it's Gaussian. 416 00:20:32,514 --> 00:20:33,930 And so we're going to have to find 417 00:20:33,930 --> 00:20:38,840 a test that, given some data, tells us whether it's Gaussian 418 00:20:38,840 --> 00:20:39,830 or not. 419 00:20:39,830 --> 00:20:42,320 If I have 15 observations, 8 of them 420 00:20:42,320 --> 00:20:46,089 are equal to plus 1 and 7 of them are equal to minus 1, 421 00:20:46,089 --> 00:20:47,630 then it's pretty unlikely that you're 422 00:20:47,630 --> 00:20:50,046 going to be able to conclude that your data has a Gaussian 423 00:20:50,046 --> 00:20:51,000 distribution. 424 00:20:51,000 --> 00:20:54,320 However, if you see some sort of spread around some value, 425 00:20:54,320 --> 00:20:56,120 you form a histogram maybe and it sort of 426 00:20:56,120 --> 00:20:57,710 looks like it's a Gaussian, you might 427 00:20:57,710 --> 00:20:59,120 want to say it's Gaussian. 428 00:20:59,120 --> 00:21:01,920 And so how do we make this more quantitative? 429 00:21:01,920 --> 00:21:05,390 Well, the sad answer to this question 430 00:21:05,390 --> 00:21:08,030 is that there will be some tests that make it quantitative, 431 00:21:08,030 --> 00:21:11,590 but here, if you think about it for one second, what is going 432 00:21:11,590 --> 00:21:13,030 to be your null hypothesis? 433 00:21:13,030 --> 00:21:15,930 Your null hypothesis, since it's one point, 434 00:21:15,930 --> 00:21:17,880 it's going to be that it's Gaussian, 435 00:21:17,880 --> 00:21:19,290 and then the alternative is going 436 00:21:19,290 --> 00:21:21,520 to be that it's not Gaussian. 437 00:21:21,520 --> 00:21:23,860 So what it means is that, for the first time 438 00:21:23,860 --> 00:21:26,140 in your statistician life, you're 439 00:21:26,140 --> 00:21:30,142 going to want to conclude that H0 is the true one. 440 00:21:30,142 --> 00:21:31,600 You're definitely not going to want 441 00:21:31,600 --> 00:21:34,016 to say that it's not Gaussian, because then everything you 442 00:21:34,016 --> 00:21:36,580 know is sort of falling apart. 443 00:21:36,580 --> 00:21:39,540 And so it's kind of a weird thing where 444 00:21:39,540 --> 00:21:41,430 you're sort of going to be seeking tests 445 00:21:41,430 --> 00:21:43,140 that have no power basically. 446 00:21:43,140 --> 00:21:46,240 You're going to want to test that, and that's the nature. 447 00:21:46,240 --> 00:21:49,140 The amount of alternatives, the number 448 00:21:49,140 --> 00:21:52,710 of ways you can be not Gaussian, is so huge 449 00:21:52,710 --> 00:21:56,569 that all tests are sort of bound to have very low power. 450 00:21:56,569 --> 00:21:58,860 And so that's why people are pretty happy with the idea 451 00:21:58,860 --> 00:22:00,420 that things are Gaussian, because it's 452 00:22:00,420 --> 00:22:01,961 very hard to find a test that's going 453 00:22:01,961 --> 00:22:04,790 to reject this hypothesis. 454 00:22:04,790 --> 00:22:08,479 And so we're even going to find some tests that are visual, 455 00:22:08,479 --> 00:22:10,020 where you're going to be able to say, 456 00:22:10,020 --> 00:22:12,800 well, sort of looks Gaussian to me. 457 00:22:12,800 --> 00:22:16,760 It allows you to deal with the borderline cases 458 00:22:16,760 --> 00:22:17,580 pretty efficiently. 459 00:22:17,580 --> 00:22:19,930 We'll see actually a particular example. 460 00:22:19,930 --> 00:22:22,280 All right, so this theory of testing 461 00:22:22,280 --> 00:22:24,470 whether data comes from a particular distribution 462 00:22:24,470 --> 00:22:26,930 is called goodness of fit. 463 00:22:26,930 --> 00:22:31,480 Is this distribution a good fit for my data? 464 00:22:31,480 --> 00:22:33,620 That's the goodness of fit test. 465 00:22:33,620 --> 00:22:36,110 We have just seen a goodness of fit test. 466 00:22:36,110 --> 00:22:36,690 What was it? 467 00:22:41,950 --> 00:22:44,143 Yeah. 468 00:22:44,143 --> 00:22:46,090 The chi square test, right? 469 00:22:46,090 --> 00:22:49,660 The case square test, we were given a candidate PMF 470 00:22:49,660 --> 00:22:52,390 and we were testing if this was a good fit for our data. 471 00:22:52,390 --> 00:22:54,490 That was a goodness of fit test. 472 00:22:54,490 --> 00:22:57,190 So of course multinomial is one example, 473 00:22:57,190 --> 00:22:59,500 but really what we have in the back of our mind is 474 00:22:59,500 --> 00:23:01,210 I want to test if my data is Gaussian. 475 00:23:01,210 --> 00:23:03,410 That's basically the usual thing. 476 00:23:03,410 --> 00:23:06,010 And just like you always see t test as the standard output 477 00:23:06,010 --> 00:23:09,220 from statistical software whether you ask for it or not, 478 00:23:09,220 --> 00:23:11,530 there will be a test for normality 479 00:23:11,530 --> 00:23:16,500 whether you ask it or not from any statistical software app. 480 00:23:16,500 --> 00:23:17,020 All right. 481 00:23:17,020 --> 00:23:19,580 So a goodness of fit test looks as follows. 482 00:23:19,580 --> 00:23:21,340 There's a random variable X and you're 483 00:23:21,340 --> 00:23:23,909 given i.i.d. copies of X, X1 to Xn, 484 00:23:23,909 --> 00:23:25,450 they come from the same distribution. 485 00:23:25,450 --> 00:23:28,750 And you're going to ask the following question: does X have 486 00:23:28,750 --> 00:23:31,510 a standard normal distribution? 487 00:23:31,510 --> 00:23:33,520 So for t distribution that's definitely 488 00:23:33,520 --> 00:23:35,720 the kind of questions you may want to ask. 489 00:23:35,720 --> 00:23:39,820 Does X have a uniform distribution on 0, 1? 490 00:23:39,820 --> 00:23:41,560 That's different from the distribution 1 491 00:23:41,560 --> 00:23:44,110 over k, 1 over k, it's the continuous notion 492 00:23:44,110 --> 00:23:47,430 of uniformity. 493 00:23:47,430 --> 00:23:49,950 And for example, you might want to test that-- 494 00:23:49,950 --> 00:23:51,700 so there's actually a nice exercise, which 495 00:23:51,700 --> 00:23:53,580 is if you look at the p-values. 496 00:23:53,580 --> 00:23:55,900 So we've defined what the p-values were. 497 00:23:55,900 --> 00:23:59,710 And the p-value's a number between 0 and 1, right? 498 00:23:59,710 --> 00:24:01,270 And you could actually ask yourself, 499 00:24:01,270 --> 00:24:04,660 what is the distribution of the p-value under the null? 500 00:24:04,660 --> 00:24:08,000 So the p-value is a random number. 501 00:24:08,000 --> 00:24:10,180 It's the probability-- so the p-value-- let's look 502 00:24:10,180 --> 00:24:13,720 at the following test. 503 00:24:17,995 --> 00:24:25,190 H0, mu is equal to 0, versus H1, mu is not equal to 0. 504 00:24:25,190 --> 00:24:28,757 And I know that the p-value is-- 505 00:24:28,757 --> 00:24:29,840 so I'm going to form what? 506 00:24:29,840 --> 00:24:34,832 I'm going to look at Xn bar minus mu 507 00:24:34,832 --> 00:24:37,040 times square root of n divided by-- let's say that we 508 00:24:37,040 --> 00:24:40,040 know sigma for one second. 509 00:24:40,040 --> 00:24:43,040 Then the p-value is the probability 510 00:24:43,040 --> 00:24:48,590 that this is larger then square root of n little xn 511 00:24:48,590 --> 00:24:54,775 bar minus mu, minus 0 actually in this case, 512 00:24:54,775 --> 00:24:59,600 divided by sigma, where this guy is the observed. 513 00:25:04,360 --> 00:25:05,510 OK. 514 00:25:05,510 --> 00:25:09,890 So now you could say, well, how is that a random variable? 515 00:25:09,890 --> 00:25:11,260 It's just a number. 516 00:25:11,260 --> 00:25:13,380 It's just a probability of something. 517 00:25:13,380 --> 00:25:17,090 But then I can view this as a function of this guy 518 00:25:17,090 --> 00:25:23,190 here when I plug it back to be a random variable. 519 00:25:23,190 --> 00:25:26,040 So what I mean by this is that if I look at this value 520 00:25:26,040 --> 00:25:34,520 here, if I say that phi is the CDF of N 0, 1, 521 00:25:34,520 --> 00:25:36,260 so the p-value is the probability 522 00:25:36,260 --> 00:25:37,730 that it exceeds this. 523 00:25:37,730 --> 00:25:41,730 So that's the probability that I'm either here or here. 524 00:25:44,258 --> 00:25:47,975 AUDIENCE: [INAUDIBLE] 525 00:25:47,975 --> 00:25:49,266 PROFESSOR: No, it's not, right? 526 00:25:49,266 --> 00:25:52,530 AUDIENCE: [INAUDIBLE] 527 00:25:52,530 --> 00:25:55,290 PROFESSOR: This is a big X and this is a small x. 528 00:25:55,290 --> 00:25:57,685 This is just where you plug in your data. 529 00:25:57,685 --> 00:25:59,310 The p-value is the probability that you 530 00:25:59,310 --> 00:26:03,180 have more evidence against your null 531 00:26:03,180 --> 00:26:05,390 than what you already have. 532 00:26:05,390 --> 00:26:06,820 OK, so now I can write it in terms 533 00:26:06,820 --> 00:26:09,190 of cumulative distribution functions. 534 00:26:09,190 --> 00:26:09,940 So this is what? 535 00:26:09,940 --> 00:26:14,480 This is phi of this guy, which is minus this thing here. 536 00:26:17,500 --> 00:26:19,270 Well it's basically 2 times this guy, 537 00:26:19,270 --> 00:26:27,950 phi of minus square root of n, Xn bar divided by sigma. 538 00:26:30,740 --> 00:26:31,652 That's my p-value. 539 00:26:31,652 --> 00:26:33,860 If you give me data, I'm going to compute the average 540 00:26:33,860 --> 00:26:36,435 and plug it in there, and it can spit out the p-value. 541 00:26:36,435 --> 00:26:37,340 Everybody agrees? 542 00:26:39,890 --> 00:26:42,940 So now I can view this, if I start now looking back I say, 543 00:26:42,940 --> 00:26:45,710 well, where does this data come from? 544 00:26:45,710 --> 00:26:48,520 Well, it could be a random variable. 545 00:26:48,520 --> 00:26:51,340 It came from the realization of this thing. 546 00:26:51,340 --> 00:26:54,664 So I can try to, I can think of this value, 547 00:26:54,664 --> 00:26:57,080 where now this is a random variable because I just plugged 548 00:26:57,080 --> 00:26:59,840 in a random variable in here. 549 00:26:59,840 --> 00:27:04,240 So now I view my p-value as a random variable. 550 00:27:04,240 --> 00:27:06,960 So I keep switching from small x to large X. Everybody 551 00:27:06,960 --> 00:27:08,740 agrees what I'm doing here? 552 00:27:08,740 --> 00:27:11,490 So I just wrote it as a deterministic function 553 00:27:11,490 --> 00:27:14,135 of some deterministic number, and now the function 554 00:27:14,135 --> 00:27:17,520 stays deterministic but the number becomes random. 555 00:27:17,520 --> 00:27:21,702 And so I can think of this as some statistic of my data. 556 00:27:21,702 --> 00:27:23,660 And I could say, well, what is the distribution 557 00:27:23,660 --> 00:27:26,560 of this random variable? 558 00:27:26,560 --> 00:27:29,480 Now if my data is actually normally distributed, 559 00:27:29,480 --> 00:27:31,810 so I'm actually under the null, so 560 00:27:31,810 --> 00:27:37,570 under the null, that means that Xn bar times square root of n 561 00:27:37,570 --> 00:27:40,947 divided by sigma has what distribution? 562 00:27:48,335 --> 00:27:48,835 Normal? 563 00:27:56,540 --> 00:27:59,260 Well it was sigma, I assume I knew it. 564 00:27:59,260 --> 00:28:00,520 So it's N 0, 1, right? 565 00:28:00,520 --> 00:28:02,080 I divided by sigma here. 566 00:28:02,080 --> 00:28:03,010 OK? 567 00:28:03,010 --> 00:28:04,500 So now I have this random variable. 568 00:28:15,880 --> 00:28:24,012 And so my random variable is now 2 phi of minus absolute value 569 00:28:24,012 --> 00:28:24,595 of a Gaussian. 570 00:28:34,430 --> 00:28:40,300 And I'm actually interested in the distribution of this thing. 571 00:28:40,300 --> 00:28:41,620 I could ask that. 572 00:28:41,620 --> 00:28:43,150 Anybody has an idea of how you would 573 00:28:43,150 --> 00:28:45,017 want to tackle this thing? 574 00:28:45,017 --> 00:28:46,600 If I ask you, what is the distribution 575 00:28:46,600 --> 00:28:48,930 of a random variable, how do you tackle this question? 576 00:28:53,120 --> 00:28:54,360 There's basically two ways. 577 00:28:54,360 --> 00:28:55,880 One is to try to find something that 578 00:28:55,880 --> 00:29:02,090 looks like the expectation of h of x for all h. 579 00:29:02,090 --> 00:29:04,790 And you try to write this using change of variables 580 00:29:04,790 --> 00:29:09,260 and something that looks like integral of h of x p of x dx. 581 00:29:09,260 --> 00:29:12,540 And then you say, well, that's the density. 582 00:29:12,540 --> 00:29:15,290 If you can read this for any h, then that's 583 00:29:15,290 --> 00:29:16,970 the way you would do it. 584 00:29:16,970 --> 00:29:19,160 But there's a simpler way that does not 585 00:29:19,160 --> 00:29:21,805 involve changing variables, et cetera, 586 00:29:21,805 --> 00:29:23,930 you just try to compute the cumulative distribution 587 00:29:23,930 --> 00:29:25,250 function. 588 00:29:25,250 --> 00:29:26,900 So let's try to compute the probability 589 00:29:26,900 --> 00:29:34,850 that 2 phi minus N 0, 1, is less than t. 590 00:29:34,850 --> 00:29:38,130 And maybe we can find something we know. 591 00:29:38,130 --> 00:29:38,630 OK. 592 00:29:38,630 --> 00:29:39,713 Well that's equal to what? 593 00:29:39,713 --> 00:29:43,040 That's the probability that a minus N 0, 594 00:29:43,040 --> 00:29:45,968 well let's say that an N 0, 1-- 595 00:29:45,968 --> 00:29:57,590 sorry, N 0, 1 absolute value is greater than minus phi inverse 596 00:29:57,590 --> 00:29:58,600 of t over 2. 597 00:30:04,170 --> 00:30:05,477 And that's what? 598 00:30:05,477 --> 00:30:07,560 Well, it's just the same thing that we had before. 599 00:30:07,560 --> 00:30:12,990 It's equal to-- so if I look again, 600 00:30:12,990 --> 00:30:15,840 this is the probability that I'm actually on this side 601 00:30:15,840 --> 00:30:17,550 or that side of this number. 602 00:30:17,550 --> 00:30:18,650 And this number is what? 603 00:30:18,650 --> 00:30:25,840 It's minus phi of t over 2. 604 00:30:25,840 --> 00:30:27,080 Why do I have a minus here? 605 00:30:32,230 --> 00:30:33,880 That's fine, OK. 606 00:30:33,880 --> 00:30:36,220 So it's actually not this, it's actually the probability 607 00:30:36,220 --> 00:30:39,830 that my absolute value-- 608 00:30:39,830 --> 00:30:41,340 oh, because phi inverse. 609 00:30:41,340 --> 00:30:42,330 OK. 610 00:30:42,330 --> 00:30:44,460 Because phi inverse is-- 611 00:30:44,460 --> 00:30:48,280 so I'm going to look at t between 0 612 00:30:48,280 --> 00:30:52,440 and-- so this number is ranging between 0 and 1. 613 00:30:52,440 --> 00:30:55,170 So it means that this number is ranging between 0-- 614 00:30:55,170 --> 00:30:58,230 well, the probability that something is less than t 615 00:30:58,230 --> 00:31:03,030 should be ranging between the numbers that this guy takes, 616 00:31:03,030 --> 00:31:04,500 so that's between 0 and 2. 617 00:31:11,410 --> 00:31:14,750 Because this thing takes values between 0 and 2. 618 00:31:14,750 --> 00:31:16,824 I want to see 0 and 1, though. 619 00:31:21,048 --> 00:31:23,533 AUDIENCE: Negative absolute value is always less 620 00:31:23,533 --> 00:31:24,776 than [INAUDIBLE]. 621 00:31:29,314 --> 00:31:29,980 PROFESSOR: Yeah. 622 00:31:29,980 --> 00:31:30,979 You're right, thank you. 623 00:31:30,979 --> 00:31:34,090 So this is always some number which is less than 0, 624 00:31:34,090 --> 00:31:36,490 so the probability that the Gaussian is less 625 00:31:36,490 --> 00:31:38,710 than this number is always less than the probability 626 00:31:38,710 --> 00:31:40,750 it's less than 0, which is 1/2, so t only 627 00:31:40,750 --> 00:31:41,890 has to be between 0 and 1. 628 00:31:41,890 --> 00:31:43,940 Thank you. 629 00:31:43,940 --> 00:31:47,380 And so now for t between 0 and 1, then 630 00:31:47,380 --> 00:31:50,680 this guy is actually becoming something which is positive, 631 00:31:50,680 --> 00:31:52,274 for the same reason as before. 632 00:31:52,274 --> 00:31:53,065 And so that's what? 633 00:31:53,065 --> 00:32:04,710 That's just basically 2 times phi of phi inverse of t over 2. 634 00:32:07,625 --> 00:32:09,750 That's just playing with the symmetry a little bit. 635 00:32:09,750 --> 00:32:11,695 You can look at the areas under the curve. 636 00:32:11,695 --> 00:32:13,820 And so what it means is that those two guys cancel. 637 00:32:13,820 --> 00:32:15,320 This is the identity. 638 00:32:15,320 --> 00:32:18,740 And so this is equal to t. 639 00:32:18,740 --> 00:32:23,240 So which distribution has a density-- 640 00:32:23,240 --> 00:32:27,410 sorry, which distribution has a cumulative distribution 641 00:32:27,410 --> 00:32:32,040 function which is equal to t for t between 0 and 1? 642 00:32:32,040 --> 00:32:34,140 That's the uniform distribution, right? 643 00:32:34,140 --> 00:32:37,800 So it means that this guy follows a uniform distribution 644 00:32:37,800 --> 00:32:39,130 on the interval 0, 1. 645 00:32:44,264 --> 00:32:45,680 And you could actually check that. 646 00:32:45,680 --> 00:32:47,240 For any test you're going to come up with, 647 00:32:47,240 --> 00:32:48,448 this is going to be the case. 648 00:32:48,448 --> 00:32:52,340 Your p-value under the null will have a distribution 649 00:32:52,340 --> 00:32:54,330 which is uniform. 650 00:32:54,330 --> 00:32:58,070 So now if somebody shows up and says, here's my test, 651 00:32:58,070 --> 00:33:00,552 it's awesome, it just works great. 652 00:33:00,552 --> 00:33:02,510 I'm not going to explain to you how I built it, 653 00:33:02,510 --> 00:33:03,926 it's a complicated statistics that 654 00:33:03,926 --> 00:33:06,740 involve moments of order 27. 655 00:33:06,740 --> 00:33:08,510 And I'm like, OK, you know, how am I 656 00:33:08,510 --> 00:33:11,690 going to test that your test statistic actually makes sense? 657 00:33:11,690 --> 00:33:16,090 Well one thing I can do is to run a bunch of data, 658 00:33:16,090 --> 00:33:18,440 draw a bunch of samples, compute your test statistic, 659 00:33:18,440 --> 00:33:22,610 compute the p-value, and check if my p-value has 660 00:33:22,610 --> 00:33:27,050 a uniform distribution on the interval 0, 1. 661 00:33:27,050 --> 00:33:29,340 But for that I need to have a test that, 662 00:33:29,340 --> 00:33:31,110 given a bunch of observations, can tell me 663 00:33:31,110 --> 00:33:33,060 whether they're actually distributed uniformly 664 00:33:33,060 --> 00:33:34,410 on the interval 0, 1. 665 00:33:34,410 --> 00:33:36,750 And again one thing I could do is build a histogram 666 00:33:36,750 --> 00:33:40,032 and see if it looks like that of a uniform, 667 00:33:40,032 --> 00:33:42,240 but I could also try to be slightly more quantitative 668 00:33:42,240 --> 00:33:43,232 about this. 669 00:33:43,232 --> 00:33:44,954 AUDIENCE: Why does the [INAUDIBLE] have 670 00:33:44,954 --> 00:33:47,039 to be for a [INAUDIBLE]? 671 00:33:47,039 --> 00:33:48,080 PROFESSOR: For two tests? 672 00:33:48,080 --> 00:33:51,461 AUDIENCE: For each test. 673 00:33:51,461 --> 00:33:54,370 Why does the p-value have to be normal? 674 00:33:54,370 --> 00:33:55,096 I mean, uniform. 675 00:33:55,096 --> 00:33:57,620 PROFESSOR: It's uniform under the null. 676 00:33:57,620 --> 00:34:00,599 So because my test statistic was built under the null, 677 00:34:00,599 --> 00:34:03,140 and so I have to be able to plug in the right value in there, 678 00:34:03,140 --> 00:34:04,806 otherwise it's going to shift everything 679 00:34:04,806 --> 00:34:06,078 for this particular test. 680 00:34:06,078 --> 00:34:08,203 AUDIENCE: At the beginning while your probabilities 681 00:34:08,203 --> 00:34:09,910 were of big Xn, that thing. 682 00:34:09,910 --> 00:34:11,848 That thing is the p-value. 683 00:34:11,848 --> 00:34:13,389 PROFESSOR: That's the p-value, right? 684 00:34:13,389 --> 00:34:15,038 That's the definition of the p-value. 685 00:34:15,038 --> 00:34:15,579 AUDIENCE: OK. 686 00:34:17,604 --> 00:34:19,020 PROFESSOR: So it's the probability 687 00:34:19,020 --> 00:34:23,835 that my test statistic exceeds what I've actually observed. 688 00:34:23,835 --> 00:34:26,270 AUDIENCE: So how you run the test is basically 689 00:34:26,270 --> 00:34:29,679 you have your observations and plug them 690 00:34:29,679 --> 00:34:33,575 into the cumulative distribution function for a normal, 691 00:34:33,575 --> 00:34:35,854 and then see if it falls under the given-- 692 00:34:35,854 --> 00:34:36,520 PROFESSOR: Yeah. 693 00:34:36,520 --> 00:34:39,850 So my p-value is just this number 694 00:34:39,850 --> 00:34:42,429 when I just plug in the values that I observe here. 695 00:34:42,429 --> 00:34:43,179 That's one number. 696 00:34:43,179 --> 00:34:45,040 For every dataset you're going to give me, 697 00:34:45,040 --> 00:34:46,820 it's going to be one number. 698 00:34:46,820 --> 00:34:51,820 Now what I can do is generate a bunch of datasets of size n, 699 00:34:51,820 --> 00:34:53,436 like 200 of them. 700 00:34:53,436 --> 00:34:55,060 And then I'm going to have a new sample 701 00:34:55,060 --> 00:34:59,000 of say 200, which is just the sample of 200 p-values. 702 00:34:59,000 --> 00:35:00,770 And I want to test if those p-values have 703 00:35:00,770 --> 00:35:02,360 a uniform distribution. 704 00:35:02,360 --> 00:35:02,930 OK? 705 00:35:02,930 --> 00:35:05,688 Because that's the distribution they should be having. 706 00:35:05,688 --> 00:35:06,676 All right? 707 00:35:11,130 --> 00:35:12,174 OK. 708 00:35:12,174 --> 00:35:13,340 This one we've already seen. 709 00:35:13,340 --> 00:35:18,410 Does x have a PMF with 30%, 50%, and 20%? 710 00:35:18,410 --> 00:35:21,070 That's something I could try to test. 711 00:35:21,070 --> 00:35:27,260 That looks like your grade point distribution for this class. 712 00:35:27,260 --> 00:35:30,230 Well not exactly, but that looks like it. 713 00:35:30,230 --> 00:35:33,039 So all these things are known as goodness of fit tests. 714 00:35:33,039 --> 00:35:34,580 The goodness of fit test is something 715 00:35:34,580 --> 00:35:38,180 that you want to know if the data that you have at hand 716 00:35:38,180 --> 00:35:41,390 follows the hypothesized distribution. 717 00:35:41,390 --> 00:35:43,010 So it's not a parametric test. 718 00:35:43,010 --> 00:35:46,790 It's not a test that says, is my mean equal to 25 or not. 719 00:35:46,790 --> 00:35:51,200 Is my proportion of heads larger than 1/2 or not? 720 00:35:51,200 --> 00:35:53,494 It's something that says, my distribution 721 00:35:53,494 --> 00:35:54,410 this particular thing. 722 00:35:57,290 --> 00:36:00,690 So I'm going to write them as goodness of fit, G-O-F here. 723 00:36:00,690 --> 00:36:02,940 You don't need to have parametric modeling to do that. 724 00:36:05,560 --> 00:36:06,760 So how do I work? 725 00:36:06,760 --> 00:36:09,550 So if I don't have any parametric modeling, 726 00:36:09,550 --> 00:36:12,430 I need to have something which is somewhat non-parametric, 727 00:36:12,430 --> 00:36:14,860 something that goes beyond computing the mean 728 00:36:14,860 --> 00:36:16,630 and the standard deviation, something 729 00:36:16,630 --> 00:36:19,900 that computes some intrinsic non-parametric aspect 730 00:36:19,900 --> 00:36:21,370 of my data. 731 00:36:21,370 --> 00:36:24,400 And just like here we made this computation, what we did 732 00:36:24,400 --> 00:36:28,490 is we said well, if I actually check 733 00:36:28,490 --> 00:36:34,340 that the CDF of my data, that my p-value is uniform, 734 00:36:34,340 --> 00:36:35,500 then I know it's uniform. 735 00:36:35,500 --> 00:36:37,840 So it means that the cumulative distribution function 736 00:36:37,840 --> 00:36:39,850 has an intrinsic value about it that captures 737 00:36:39,850 --> 00:36:41,740 the entire distribution. 738 00:36:41,740 --> 00:36:44,200 Everything I need to know about my distribution 739 00:36:44,200 --> 00:36:47,320 is captured by the cumulative distribution function. 740 00:36:47,320 --> 00:36:49,810 Now I have an empirical way of computing, 741 00:36:49,810 --> 00:36:52,180 I have a data-driven way of computing 742 00:36:52,180 --> 00:36:54,850 an estimate for the cumulative distribution function, which 743 00:36:54,850 --> 00:36:57,040 is using the old statistical trick which 744 00:36:57,040 --> 00:37:00,190 consists of replacing expectations by averages. 745 00:37:00,190 --> 00:37:04,420 So as I said, the cumulative distribution function 746 00:37:04,420 --> 00:37:08,440 for any distribution, for any random variable, is-- 747 00:37:12,510 --> 00:37:17,610 so F of t is the probability that X 748 00:37:17,610 --> 00:37:19,600 is less than or equal to t, which 749 00:37:19,600 --> 00:37:22,630 is equal to the expectation of the indicator 750 00:37:22,630 --> 00:37:26,110 that X is less than or equal to t. 751 00:37:26,110 --> 00:37:28,310 That's the definition of a probability. 752 00:37:28,310 --> 00:37:31,660 And so here I'm just going to replace expectation 753 00:37:31,660 --> 00:37:34,390 by the average. 754 00:37:34,390 --> 00:37:37,480 That's my usual statistical trick. 755 00:37:37,480 --> 00:37:42,980 And so my estimator Fn for-- 756 00:37:42,980 --> 00:37:45,940 the distribution is going to be 1 over n sum from i 757 00:37:45,940 --> 00:37:48,423 equal 1 to n of these indicators. 758 00:37:53,420 --> 00:37:58,780 And this is called the empirical CDF. 759 00:37:58,780 --> 00:38:01,050 It's just the data version of the CDF. 760 00:38:04,800 --> 00:38:08,423 So I just replaced this expectation here by an average. 761 00:38:13,630 --> 00:38:17,270 Now when I sum indicators, I'm actually 762 00:38:17,270 --> 00:38:20,570 counting the number of them that satisfy something. 763 00:38:20,570 --> 00:38:24,070 So if you look at what this guy is, 764 00:38:24,070 --> 00:38:32,820 this is the number of X i's that is less than t, right? 765 00:38:32,820 --> 00:38:35,355 And so if I divide by n, it's the proportion of observations 766 00:38:35,355 --> 00:38:36,840 I have that are less than t. 767 00:38:41,821 --> 00:38:43,570 That's what the empirical distribution is. 768 00:38:46,920 --> 00:38:50,280 That's what's written here, the number of data points 769 00:38:50,280 --> 00:38:52,080 that are less than t. 770 00:38:52,080 --> 00:38:53,670 And so this is going to be something 771 00:38:53,670 --> 00:38:57,511 that's sort of trying to estimate one or the other. 772 00:38:57,511 --> 00:38:59,010 And the law of large number actually 773 00:38:59,010 --> 00:39:03,240 tells me that for any given t, if n is large enough, Fn of t 774 00:39:03,240 --> 00:39:05,600 should be close to F of t. 775 00:39:05,600 --> 00:39:07,010 Because it's an average. 776 00:39:07,010 --> 00:39:10,610 And this entire thing, this entire statistical trick, 777 00:39:10,610 --> 00:39:13,820 which consists of replacing expectations by averages, 778 00:39:13,820 --> 00:39:16,610 is justified by the law of large number. 779 00:39:16,610 --> 00:39:19,610 Every time we used it, that was because the law of large number 780 00:39:19,610 --> 00:39:21,560 sort of guaranteed to us that the average was 781 00:39:21,560 --> 00:39:23,057 close to the expectation. 782 00:39:26,720 --> 00:39:27,250 OK. 783 00:39:27,250 --> 00:39:30,760 So law of large numbers tell me that Fn of t converges, 784 00:39:30,760 --> 00:39:34,010 so that's the strong law, says that almost surely actually 785 00:39:34,010 --> 00:39:35,390 Fn of t goes to F of t. 786 00:39:40,470 --> 00:39:43,260 And that's just for any given t. 787 00:39:43,260 --> 00:39:46,650 Is there any question about this? 788 00:39:46,650 --> 00:39:48,329 That averages converge to expectation, 789 00:39:48,329 --> 00:39:49,620 that's the law of large number. 790 00:39:52,690 --> 00:39:54,704 And almost surely we could say in probability 791 00:39:54,704 --> 00:39:57,120 it's the same, that would be the weak law of large number. 792 00:40:00,460 --> 00:40:01,960 Now this is fine. 793 00:40:01,960 --> 00:40:05,470 For any given t, the average converges to the true. 794 00:40:05,470 --> 00:40:09,010 It just happens that this random variable is indexed by t, 795 00:40:09,010 --> 00:40:12,460 and I could do it for t equals 1 or 2 or 25, 796 00:40:12,460 --> 00:40:14,280 and just check it again. 797 00:40:14,280 --> 00:40:18,145 But I might want to check it for all t's at once. 798 00:40:18,145 --> 00:40:19,770 And that's actually a different result. 799 00:40:19,770 --> 00:40:21,960 That's called a uniform result. I 800 00:40:21,960 --> 00:40:25,200 want this to hold for all t at the same time. 801 00:40:25,200 --> 00:40:28,720 And it may be the case that it works for each t individually 802 00:40:28,720 --> 00:40:31,200 but not for all t's at the same time. 803 00:40:31,200 --> 00:40:33,330 What could happen is that for t equals 1 804 00:40:33,330 --> 00:40:36,326 it converges at a certain rate, and for t equals 2 805 00:40:36,326 --> 00:40:37,950 it converges at a bit of a slower rate, 806 00:40:37,950 --> 00:40:41,010 and for t equals 3 at a slower rate and slower rate. 807 00:40:41,010 --> 00:40:43,770 And so as t goes to infinity, the rate is going to vanish 808 00:40:43,770 --> 00:40:45,360 and nothing is going to converge. 809 00:40:45,360 --> 00:40:46,290 That could happen. 810 00:40:46,290 --> 00:40:48,600 I could make this happen at a finite point. 811 00:40:48,600 --> 00:40:50,850 There's many ways where it could make this happen. 812 00:40:50,850 --> 00:40:52,780 Let's see how that could work. 813 00:40:52,780 --> 00:40:54,780 I could say, well, actually no. 814 00:40:54,780 --> 00:40:59,115 I still need to have this at infinity for some reason. 815 00:40:59,115 --> 00:41:01,686 It turns out that this is still true uniformly, 816 00:41:01,686 --> 00:41:03,810 and this is actually a much more complicated result 817 00:41:03,810 --> 00:41:05,340 than the law of large number. 818 00:41:05,340 --> 00:41:07,740 It's called Glivenko-Cantelli Theorem. 819 00:41:07,740 --> 00:41:09,270 And the Glivenko-Cantelli Theorem 820 00:41:09,270 --> 00:41:14,960 tells me that, for all t's at once, Fn converges to F. 821 00:41:14,960 --> 00:41:18,230 So let me just show you quickly why 822 00:41:18,230 --> 00:41:22,040 this is just a little bit stronger than the one 823 00:41:22,040 --> 00:41:25,900 that we had. 824 00:41:25,900 --> 00:41:29,120 If sup is confusing you, think of max. 825 00:41:29,120 --> 00:41:31,880 It's just the max over an infinite set. 826 00:41:31,880 --> 00:41:40,500 And so what we know is that Fn of t goes to F of t 827 00:41:40,500 --> 00:41:43,270 as n goes to infinity. 828 00:41:43,270 --> 00:41:45,560 And that's almost surely. 829 00:41:45,560 --> 00:41:48,360 And that's the law of large numbers. 830 00:41:48,360 --> 00:41:54,920 Which is equivalent to saying that Fn of t minus F of t as n 831 00:41:54,920 --> 00:41:59,700 goes to infinity converges almost surely to 0, right? 832 00:41:59,700 --> 00:42:01,823 This is the same thing. 833 00:42:01,823 --> 00:42:07,217 Now I want this to happen for all t's at once. 834 00:42:07,217 --> 00:42:09,300 So what I'm going to do-- oh, and this is actually 835 00:42:09,300 --> 00:42:11,174 equivalent to this. 836 00:42:11,174 --> 00:42:12,840 And so what I'm going to do is I'm going 837 00:42:12,840 --> 00:42:14,590 to make it a little stronger. 838 00:42:14,590 --> 00:42:16,990 So here the arrow only goes one way. 839 00:42:16,990 --> 00:42:20,660 And this is where the sup for t in R of Fn of t. 840 00:42:26,847 --> 00:42:28,930 And you could actually show that this happens also 841 00:42:28,930 --> 00:42:29,513 almost surely. 842 00:42:35,500 --> 00:42:37,650 Now maybe almost surely is a bit more 843 00:42:37,650 --> 00:42:39,210 difficult to get a grasp on. 844 00:42:43,560 --> 00:42:48,630 Does anybody want to see, like why this statement for this sup 845 00:42:48,630 --> 00:42:51,030 is strictly stronger than the one that holds individually 846 00:42:51,030 --> 00:42:52,880 for all t's? 847 00:42:52,880 --> 00:42:54,086 You want to see that? 848 00:42:54,086 --> 00:42:54,960 OK, so let's do that. 849 00:42:54,960 --> 00:42:57,410 So forget about it almost surely for one second. 850 00:42:57,410 --> 00:42:59,660 Let's just do it in probability. 851 00:42:59,660 --> 00:43:09,690 The fact that Fn of t converges to F of t for all t, 852 00:43:09,690 --> 00:43:12,300 in probability means that this goes to 0 as n goes 853 00:43:12,300 --> 00:43:13,860 to infinity for any epsilon. 854 00:43:17,400 --> 00:43:19,150 For any epsilon in t we know we have this. 855 00:43:19,150 --> 00:43:22,529 That's the convergence in probability. 856 00:43:22,529 --> 00:43:24,070 Now what I want is to put a sup here. 857 00:43:28,408 --> 00:43:32,920 The probability that the sup is lower than epsilon, 858 00:43:32,920 --> 00:43:38,080 might be actually always larger than, never go to 0 859 00:43:38,080 --> 00:43:39,090 in some cases. 860 00:43:39,090 --> 00:43:42,410 It could be the case that for each given t, 861 00:43:42,410 --> 00:43:46,440 I can make n large enough so that this probability becomes 862 00:43:46,440 --> 00:43:47,640 small. 863 00:43:47,640 --> 00:43:49,950 But then maybe it's an n of t. 864 00:43:49,950 --> 00:43:53,750 So this here means that for any-- 865 00:43:53,750 --> 00:43:56,570 maybe I shouldn't put, let me put a delta here. 866 00:43:56,570 --> 00:44:02,460 So for any epsilon, for any t and for any epsilon, 867 00:44:02,460 --> 00:44:09,800 there exists n, which could depend on both epsilon 868 00:44:09,800 --> 00:44:15,920 and t, such that the probability that Fn t 869 00:44:15,920 --> 00:44:25,110 minus F of t exceeding delta is less than epsilon t. 870 00:44:25,110 --> 00:44:29,140 There exists an n and a delta. 871 00:44:29,140 --> 00:44:30,600 No, that's for all delta, sorry. 872 00:44:34,810 --> 00:44:36,100 So this is true. 873 00:44:36,100 --> 00:44:40,040 That's what this limit statement actually means. 874 00:44:40,040 --> 00:44:43,060 But it could be the case that now when I take the sup over t, 875 00:44:43,060 --> 00:44:47,380 maybe that n of t is something that looks like t. 876 00:44:50,480 --> 00:44:54,510 Or maybe, well, integer part of t. 877 00:44:54,510 --> 00:44:56,175 It could be, right? 878 00:44:56,175 --> 00:44:57,050 I don't say anything. 879 00:44:57,050 --> 00:44:59,710 It's just an n that depends on t. 880 00:44:59,710 --> 00:45:04,730 So if this n is just t, maybe t over epsilon, 881 00:45:04,730 --> 00:45:05,930 because I want epsilon. 882 00:45:05,930 --> 00:45:07,610 Something like this. 883 00:45:07,610 --> 00:45:09,470 Well that means that if I want this 884 00:45:09,470 --> 00:45:11,510 to hold for all t's at once, I'm going 885 00:45:11,510 --> 00:45:15,980 to have to go for the n that works for all t's at once. 886 00:45:15,980 --> 00:45:19,070 But there's no such n that works for all t's at once. 887 00:45:19,070 --> 00:45:21,830 The only n that works is infinity. 888 00:45:21,830 --> 00:45:24,350 And so I cannot make this happen for all of them. 889 00:45:24,350 --> 00:45:26,420 What Glivenko-Cantelli tells you, 890 00:45:26,420 --> 00:45:29,090 it's actually this is not something that holds like this. 891 00:45:29,090 --> 00:45:33,650 That the n that depends on t, there's actually one largest n 892 00:45:33,650 --> 00:45:37,150 that works for all the t's at once, and that's it. 893 00:45:39,451 --> 00:45:39,950 OK. 894 00:45:39,950 --> 00:45:44,150 So just so you know why this is actually a stronger statement, 895 00:45:44,150 --> 00:45:48,880 and that's basically how it works. 896 00:45:48,880 --> 00:45:50,567 Any other question? 897 00:45:50,567 --> 00:45:51,067 Yeah. 898 00:45:51,067 --> 00:45:53,271 AUDIENCE: So what's the position for this 899 00:45:53,271 --> 00:45:54,979 to have, because the random variable have 900 00:45:54,979 --> 00:45:57,179 a finite mean, finite variance? 901 00:45:57,179 --> 00:45:58,664 PROFESSOR: No. 902 00:45:58,664 --> 00:46:00,580 Well the random variable does have finite mean 903 00:46:00,580 --> 00:46:02,580 and finite variance, because the random variable 904 00:46:02,580 --> 00:46:03,580 is an indicator. 905 00:46:03,580 --> 00:46:04,830 So it has everything you want. 906 00:46:04,830 --> 00:46:06,621 This is one of the nicest random variables, 907 00:46:06,621 --> 00:46:08,410 this is a Bernoulli random variable. 908 00:46:08,410 --> 00:46:11,952 So here when I say law of large number, that this holds. 909 00:46:11,952 --> 00:46:12,910 Where did I write this? 910 00:46:12,910 --> 00:46:14,140 I think I erased it. 911 00:46:14,140 --> 00:46:15,440 Yeah, the one over there. 912 00:46:15,440 --> 00:46:16,570 This is actually the law of large numbers 913 00:46:16,570 --> 00:46:17,680 for Bernoulli random variables. 914 00:46:17,680 --> 00:46:18,930 They have everything you want. 915 00:46:18,930 --> 00:46:21,320 They're bounded. 916 00:46:21,320 --> 00:46:21,820 Yes. 917 00:46:21,820 --> 00:46:23,989 AUDIENCE: So I'm having trouble understanding 918 00:46:23,989 --> 00:46:25,194 the first statement. 919 00:46:25,194 --> 00:46:27,122 So it says, for all epsilon and all t, 920 00:46:27,122 --> 00:46:29,540 the probability of that-- 921 00:46:29,540 --> 00:46:31,040 PROFESSOR: So you mean this one? 922 00:46:31,040 --> 00:46:31,790 AUDIENCE: Yeah. 923 00:46:31,790 --> 00:46:34,760 PROFESSOR: For all epsilon and all t. 924 00:46:34,760 --> 00:46:36,110 So you fix them now. 925 00:46:36,110 --> 00:46:39,050 Then the probability that, sorry, that was delta. 926 00:46:39,050 --> 00:46:41,694 I changed this epsilon to delta at some point. 927 00:46:41,694 --> 00:46:44,930 AUDIENCE: And then what's the second line? 928 00:46:44,930 --> 00:46:49,860 PROFESSOR: Oh, so then the second line says that, 929 00:46:49,860 --> 00:46:53,330 so I'm just rewriting in terms of epsilon delta 930 00:46:53,330 --> 00:46:56,360 what this n goes to infinity means. 931 00:46:56,360 --> 00:47:01,880 So it means that for any a t and delta, 932 00:47:01,880 --> 00:47:04,250 so that's the same as this guy here, 933 00:47:04,250 --> 00:47:06,284 then here I'm just going back to rewriting this. 934 00:47:06,284 --> 00:47:08,450 It says that for any epsilon there exists an n large 935 00:47:08,450 --> 00:47:11,990 enough such that, well, n larger than this thing 936 00:47:11,990 --> 00:47:14,370 basically, such that this thing is less than epsilon. 937 00:47:18,670 --> 00:47:21,420 So Glivenko-Cantelli tells us that not only is this thing 938 00:47:21,420 --> 00:47:25,150 a good idea pointwise, but it's also a good idea uniformly. 939 00:47:25,150 --> 00:47:27,690 And all it's saying is if you actually 940 00:47:27,690 --> 00:47:30,300 were happy with just this result, you should 941 00:47:30,300 --> 00:47:32,290 be even happier with that result. 942 00:47:32,290 --> 00:47:34,570 And both of those results only tell you one thing. 943 00:47:34,570 --> 00:47:36,720 They're just telling you that the empirical CDF 944 00:47:36,720 --> 00:47:38,196 is a good estimator of the CDF. 945 00:47:41,600 --> 00:47:47,720 Now since those indicators are Bernoulli distributions, 946 00:47:47,720 --> 00:47:50,390 I can actually do even more. 947 00:47:50,390 --> 00:47:52,130 So let me get this guy here. 948 00:48:00,240 --> 00:48:14,220 OK so, those guys, Fn of t, this guy 949 00:48:14,220 --> 00:48:16,530 is a Bernoulli distribution. 950 00:48:16,530 --> 00:48:20,494 What is the parameter of this Bernoulli distribution? 951 00:48:20,494 --> 00:48:22,410 What is the probability that it takes value 1? 952 00:48:26,250 --> 00:48:26,989 AUDIENCE: F of t. 953 00:48:26,989 --> 00:48:28,030 PROFESSOR: F of t, right? 954 00:48:28,030 --> 00:48:30,484 It's just the probability that this thing happens, 955 00:48:30,484 --> 00:48:31,150 which is F of t. 956 00:48:34,020 --> 00:48:40,650 So in particular the variance of this guy 957 00:48:40,650 --> 00:48:42,540 is the variance of this Bernoulli. 958 00:48:42,540 --> 00:48:46,730 So it's F of t 1 minus F of t. 959 00:48:46,730 --> 00:48:50,230 And I can use that in my Central Limit Theorem. 960 00:48:50,230 --> 00:48:51,760 And Central Limit Theorem is just 961 00:48:51,760 --> 00:48:53,890 going to tell me that if I look at the average 962 00:48:53,890 --> 00:48:56,530 of random variables, I remove their mean, 963 00:48:56,530 --> 00:49:01,000 so I look at square root of n Fn of t, 964 00:49:01,000 --> 00:49:04,380 which I could really write as xn bar, right? 965 00:49:04,380 --> 00:49:06,730 That's really just an xn bar. 966 00:49:06,730 --> 00:49:08,320 Minus the expectation, which is F 967 00:49:08,320 --> 00:49:11,290 of t, that comes from this guy. 968 00:49:11,290 --> 00:49:16,000 Now if I divide by square root of the variance, that's 969 00:49:16,000 --> 00:49:18,950 my square root p1 minus p. 970 00:49:18,950 --> 00:49:22,540 Then this guy, by the Central Limit Theorem, 971 00:49:22,540 --> 00:49:23,890 goes to some N 0, 1. 972 00:49:27,032 --> 00:49:28,740 Which is the same thing as you see there, 973 00:49:28,740 --> 00:49:30,865 except that the variance was put on the other side. 974 00:49:34,850 --> 00:49:36,170 OK. 975 00:49:36,170 --> 00:49:42,422 Do I have the same thing uniformly in t? 976 00:49:46,110 --> 00:49:48,630 Can I write something that holds uniformly in t? 977 00:49:48,630 --> 00:49:50,940 Well, if you think about it for one second 978 00:49:50,940 --> 00:49:53,520 it's unlikely it's going to go too well. 979 00:49:53,520 --> 00:49:55,650 In the sense that it's unlikely that the supremum 980 00:49:55,650 --> 00:49:58,530 of those random variables over t is going to also be a Gaussian. 981 00:50:02,800 --> 00:50:08,120 And the reason is that, well actually the reason 982 00:50:08,120 --> 00:50:10,790 is that this thing is actually a stochastic process indexed 983 00:50:10,790 --> 00:50:11,460 by t. 984 00:50:11,460 --> 00:50:14,870 A stochastic process is just a sequence in random variables 985 00:50:14,870 --> 00:50:17,060 that's indexed by, let's say time. 986 00:50:17,060 --> 00:50:20,030 The one that's the most famous is Brownian motion, 987 00:50:20,030 --> 00:50:24,440 and it's basically a bunch of Gaussian increments. 988 00:50:24,440 --> 00:50:27,170 So when you go from t to just t a little after that, 989 00:50:27,170 --> 00:50:30,180 you have add some Gaussian into the thing. 990 00:50:30,180 --> 00:50:33,770 And here it's basically the same thing that's happening. 991 00:50:33,770 --> 00:50:35,970 And you would sort of expect, since each of this guy 992 00:50:35,970 --> 00:50:37,610 is Gaussian, you would expect to see 993 00:50:37,610 --> 00:50:40,005 something that looks like a Brownian motion at the end. 994 00:50:40,005 --> 00:50:41,630 But it's not exactly a Brownian motion, 995 00:50:41,630 --> 00:50:43,671 it's something that's called the Brownian bridge. 996 00:50:43,671 --> 00:50:45,920 So if you've seen the Brownian motion, if I make 997 00:50:45,920 --> 00:50:49,160 it start at 0 for example, so this is the value 998 00:50:49,160 --> 00:50:50,280 of my Brownian motion. 999 00:50:50,280 --> 00:50:52,110 Let's write it. 1000 00:50:52,110 --> 00:50:56,370 So this is one path, one realization of Brownian motion. 1001 00:50:56,370 --> 00:50:59,350 Let's call it w of t as t increases. 1002 00:50:59,350 --> 00:51:04,430 So let's say it starts at 0 and looks like something like this. 1003 00:51:04,430 --> 00:51:06,540 So that's what Brownian motion looks like. 1004 00:51:06,540 --> 00:51:11,010 It's just something that's pretty nasty. 1005 00:51:11,010 --> 00:51:13,710 I mean it looks pretty nasty, it's not continuous et cetera, 1006 00:51:13,710 --> 00:51:19,110 but it's actually very benign in some average way. 1007 00:51:19,110 --> 00:51:21,150 So Brownian motion is just something, 1008 00:51:21,150 --> 00:51:25,820 you should view this as if I sum some random variable that 1009 00:51:25,820 --> 00:51:29,520 are Gaussian, and then I look at this from farther and farther, 1010 00:51:29,520 --> 00:51:31,980 it's going to look like this. 1011 00:51:31,980 --> 00:51:34,750 And so here I cannot have a Brownian motion in the n, 1012 00:51:34,750 --> 00:51:40,030 because what is the variance of Fn of t minus F of t at t is 1013 00:51:40,030 --> 00:51:40,673 equal to 1? 1014 00:51:43,780 --> 00:51:47,890 Sorry, at t is equal to infinity. 1015 00:51:47,890 --> 00:51:48,390 AUDIENCE: 0. 1016 00:51:48,390 --> 00:51:49,460 PROFESSOR: It's 0, right? 1017 00:51:49,460 --> 00:51:52,100 The variance goes from 0 at t is negative infinity, 1018 00:51:52,100 --> 00:51:56,940 because at negative infinity F of t is going to 0. 1019 00:51:56,940 --> 00:51:59,850 And as t goes to plus infinity, F of t 1020 00:51:59,850 --> 00:52:03,590 is going to 1, which means that the variance of this guy as t 1021 00:52:03,590 --> 00:52:06,320 goes from negative infinity to plus infinity 1022 00:52:06,320 --> 00:52:09,990 is pinned to be 0 on each side. 1023 00:52:09,990 --> 00:52:12,200 And so my Brownian motion cannot, 1024 00:52:12,200 --> 00:52:14,630 when I describe a Brownian motion I'm just adding more 1025 00:52:14,630 --> 00:52:16,880 and more entropy to the thing and it's going all over 1026 00:52:16,880 --> 00:52:20,450 the place, but here what I want is that as I go back it should 1027 00:52:20,450 --> 00:52:21,920 go back to essentially 0. 1028 00:52:21,920 --> 00:52:25,280 It should be pinned down to a specific value at the n. 1029 00:52:25,280 --> 00:52:27,322 And that's actually called the Brownian bridge. 1030 00:52:27,322 --> 00:52:29,030 It's a Brownian motion that's conditioned 1031 00:52:29,030 --> 00:52:32,546 to come back to where it started essentially. 1032 00:52:32,546 --> 00:52:35,170 Now you don't need to understand Brownian bridges to understand 1033 00:52:35,170 --> 00:52:36,780 what I'm going to be telling you. 1034 00:52:36,780 --> 00:52:39,040 The only thing I want to communicate to you 1035 00:52:39,040 --> 00:52:42,720 is that this guy here, when I say a Brownian bridge, 1036 00:52:42,720 --> 00:52:45,010 I can go to any probabilist and they can tell you 1037 00:52:45,010 --> 00:52:51,167 all the probability properties of this stochastic process. 1038 00:52:51,167 --> 00:52:52,750 It can tell me the probability that it 1039 00:52:52,750 --> 00:52:55,120 takes any value at any point. 1040 00:52:55,120 --> 00:52:57,370 In particular, it can tell me-- 1041 00:52:57,370 --> 00:53:01,040 the supremum between 0 and 1 of this guy, 1042 00:53:01,040 --> 00:53:03,230 it could tell me what the cumulative distribution 1043 00:53:03,230 --> 00:53:04,813 function of this thing is, can tell me 1044 00:53:04,813 --> 00:53:07,585 what the density of this thing is, can tell me everything. 1045 00:53:07,585 --> 00:53:09,710 So it means that if I want to compute probabilities 1046 00:53:09,710 --> 00:53:14,210 on this object here, which is the maximum value that this guy 1047 00:53:14,210 --> 00:53:17,565 can take over a certain period of time, which is basically 1048 00:53:17,565 --> 00:53:18,440 this random variable. 1049 00:53:18,440 --> 00:53:20,390 So if I look at the value here, it's 1050 00:53:20,390 --> 00:53:22,310 a random variable that fluctuates. 1051 00:53:22,310 --> 00:53:25,160 It can tell me where it is with hyperability, can tell me 1052 00:53:25,160 --> 00:53:28,790 the quantiles of this thing, which is useful 1053 00:53:28,790 --> 00:53:31,440 because I can build a table and use it to compute my quantiles 1054 00:53:31,440 --> 00:53:34,480 and form tests from it. 1055 00:53:34,480 --> 00:53:36,100 So that's what actually is quite nice. 1056 00:53:36,100 --> 00:53:38,170 It says that if I look at the square root of n Fn 1057 00:53:38,170 --> 00:53:40,999 hat minus sup over t, I get something 1058 00:53:40,999 --> 00:53:42,790 that looks like the sup of these Gaussians, 1059 00:53:42,790 --> 00:53:44,290 but it's not really sup of Gaussian, 1060 00:53:44,290 --> 00:53:46,040 it's sup of a Brownian motion. 1061 00:53:46,040 --> 00:53:48,290 Now there's something you should be very careful here. 1062 00:53:48,290 --> 00:53:49,210 I cheated a little bit. 1063 00:53:49,210 --> 00:53:51,251 I mean, I didn't cheat, I can do whatever I want. 1064 00:53:51,251 --> 00:53:55,730 But my notation might be a little confusing. 1065 00:53:55,730 --> 00:54:01,870 Everybody sees that this t here is not the same as this t here? 1066 00:54:01,870 --> 00:54:03,140 Can somebody see that? 1067 00:54:03,140 --> 00:54:05,690 Just because, first of all, this guy's between 0 and 1. 1068 00:54:05,690 --> 00:54:09,550 And this guy is in all of R. 1069 00:54:09,550 --> 00:54:12,760 What is this t here? 1070 00:54:12,760 --> 00:54:14,040 As a function of this t here? 1071 00:54:21,270 --> 00:54:23,770 This guy is F of this guy. 1072 00:54:23,770 --> 00:54:27,790 So really, if I want it to be completely transparent 1073 00:54:27,790 --> 00:54:32,750 and not save the keys of my keyboard, 1074 00:54:32,750 --> 00:54:42,460 I would read this as sup over t of Fn t minus F of t 1075 00:54:42,460 --> 00:54:46,430 goes to N distribution as n goes to infinity. 1076 00:54:46,430 --> 00:54:50,440 The supremum over t, again in R, so this guy is 1077 00:54:50,440 --> 00:54:52,655 for t in the entire real line, this guy 1078 00:54:52,655 --> 00:54:54,620 is for t in the entire real line. 1079 00:54:54,620 --> 00:54:58,440 But now I should write b of what? 1080 00:54:58,440 --> 00:55:00,710 F of t, exactly. 1081 00:55:00,710 --> 00:55:04,150 So really the t here is F of the original one. 1082 00:55:04,150 --> 00:55:06,570 And so that's a Brownian bridge, where 1083 00:55:06,570 --> 00:55:09,870 when t goes to infinity the Brownian bridge 1084 00:55:09,870 --> 00:55:11,670 goes from 0 to 1 and it looks like this. 1085 00:55:11,670 --> 00:55:16,100 A Brownian bridge at 0 is 0, at 1 it's 0. 1086 00:55:16,100 --> 00:55:18,470 And it does this. 1087 00:55:18,470 --> 00:55:20,580 But it doesn't stray too far because I condition 1088 00:55:20,580 --> 00:55:22,860 it to come back to this point. 1089 00:55:22,860 --> 00:55:26,600 That's what a Brownian bridge is. 1090 00:55:26,600 --> 00:55:28,450 OK. 1091 00:55:28,450 --> 00:55:33,527 So in particular, I can find a distribution for this guy. 1092 00:55:33,527 --> 00:55:35,610 And I can use this to build a test which is called 1093 00:55:35,610 --> 00:55:37,120 the Kolmogorov-Smirnov test. 1094 00:55:39,810 --> 00:55:40,895 The idea is the following. 1095 00:55:40,895 --> 00:55:44,875 It says, if I want to test some distribution 1096 00:55:44,875 --> 00:55:49,650 F0, some distribution that has a particular CDF F0, 1097 00:55:49,650 --> 00:55:52,360 and I plug it in under the null, then 1098 00:55:52,360 --> 00:55:55,420 this guy should have pretty much the same distribution 1099 00:55:55,420 --> 00:55:58,090 as the supremum of Brownian bridge. 1100 00:55:58,090 --> 00:56:00,790 And so if I see this to be much larger than it should 1101 00:56:00,790 --> 00:56:02,980 be when it's the supremum of a Brownian bridge, 1102 00:56:02,980 --> 00:56:05,020 I'm actually going to reject my hypothesis. 1103 00:56:08,270 --> 00:56:09,290 So here's the test. 1104 00:56:09,290 --> 00:56:17,100 I want to test whether H0, F is equal to F0, 1105 00:56:17,100 --> 00:56:22,850 and you will see that most of the goodness of fit tests 1106 00:56:22,850 --> 00:56:24,950 are formulated mathematically in terms 1107 00:56:24,950 --> 00:56:26,960 of the cumulative distribution function. 1108 00:56:26,960 --> 00:56:29,600 I could formulate them in terms of personality density 1109 00:56:29,600 --> 00:56:33,270 function, or just write x follows N 0, 1, 1110 00:56:33,270 --> 00:56:34,950 but that's the way we write it. 1111 00:56:34,950 --> 00:56:37,880 We formulate them in terms of cumulative distribution 1112 00:56:37,880 --> 00:56:39,650 function because that's what we have 1113 00:56:39,650 --> 00:56:42,320 a handle on through the empirical cumulative 1114 00:56:42,320 --> 00:56:44,330 distribution function. 1115 00:56:44,330 --> 00:56:50,300 And then it's versus H1, F is not equal to F0. 1116 00:56:50,300 --> 00:56:52,370 So now I have my empirical CDF. 1117 00:56:52,370 --> 00:56:54,650 And I hope that for all t's, Fn of t 1118 00:56:54,650 --> 00:56:57,900 should be close to F0 of t. 1119 00:56:57,900 --> 00:57:00,330 Let me write it like this. 1120 00:57:00,330 --> 00:57:03,740 I put it on the exponent because otherwise that 1121 00:57:03,740 --> 00:57:06,650 would be the empirical distribution function based 1122 00:57:06,650 --> 00:57:07,970 on zero observations. 1123 00:57:11,060 --> 00:57:14,011 Now I form the following test statistic. 1124 00:57:21,450 --> 00:57:24,280 So my test statistic is tn, which 1125 00:57:24,280 --> 00:57:28,120 is the supremum over t in the real line of square root 1126 00:57:28,120 --> 00:57:34,494 of n Fn of t minus F of t, sorry, F0 of t. 1127 00:57:34,494 --> 00:57:35,660 So I can compute everything. 1128 00:57:35,660 --> 00:57:37,450 I know this from the data, and this 1129 00:57:37,450 --> 00:57:39,930 is the one that comes from my null hypothesis. 1130 00:57:39,930 --> 00:57:41,939 As I can compute this thing. 1131 00:57:41,939 --> 00:57:43,480 And I know that if this is true, this 1132 00:57:43,480 --> 00:57:46,180 should actually be the supremum of a Brownian bridge. 1133 00:57:46,180 --> 00:57:48,940 Pretty much. 1134 00:57:48,940 --> 00:58:01,620 And so the Kolmogorov-Smirnov test is simply, 1135 00:58:01,620 --> 00:58:09,080 reject if this guy, tn, in absolute value, 1136 00:58:09,080 --> 00:58:10,690 no actually not in absolute value. 1137 00:58:10,690 --> 00:58:13,590 This is just already absolute valued. 1138 00:58:13,590 --> 00:58:14,960 Then this guy should be what? 1139 00:58:14,960 --> 00:58:20,580 It should be larger than the q alpha over 2 distribution 1140 00:58:20,580 --> 00:58:21,540 that I have. 1141 00:58:21,540 --> 00:58:24,870 But now rather than putting N 0, 1, or Tn, 1142 00:58:24,870 --> 00:58:30,016 this is here whatever notation I have for supremum 1143 00:58:30,016 --> 00:58:31,952 of Brownian bridge. 1144 00:58:40,860 --> 00:58:43,710 Just like I did for any pivotal distribution. 1145 00:58:43,710 --> 00:58:45,900 That was the same recipe every single time. 1146 00:58:45,900 --> 00:58:47,970 I formed the test statistic such that 1147 00:58:47,970 --> 00:58:51,330 the asymptotic distribution did not depend on anything I know, 1148 00:58:51,330 --> 00:58:54,300 and then I would just reject when this pivotal distribution 1149 00:58:54,300 --> 00:58:56,080 was larger than something. 1150 00:58:56,080 --> 00:58:56,845 Yes? 1151 00:58:56,845 --> 00:58:59,635 AUDIENCE: I'm not really sure why Brownian bridge appears. 1152 00:59:02,900 --> 00:59:05,542 PROFESSOR: Do you know what a Brownian bridge is, or? 1153 00:59:05,542 --> 00:59:06,500 AUDIENCE: Only vaguely. 1154 00:59:06,500 --> 00:59:07,100 PROFESSOR: OK. 1155 00:59:07,100 --> 00:59:14,320 So this thing here, think of it as being a Gaussian. 1156 00:59:14,320 --> 00:59:18,110 So for all t you have a Gaussian distribution. 1157 00:59:18,110 --> 00:59:27,270 Now a Brownian motion, so if I had a Brownian motion 1158 00:59:27,270 --> 00:59:28,770 I need to tell you what the-- 1159 00:59:28,770 --> 00:59:30,300 so it's basically a Brownian motion 1160 00:59:30,300 --> 00:59:31,730 is something that looks like this. 1161 00:59:31,730 --> 00:59:34,180 It's some random variable that's indexed by t. 1162 00:59:34,180 --> 00:59:38,610 I want, say, the expectation of Xt could be equal to 0 1163 00:59:38,610 --> 00:59:40,640 for all t. 1164 00:59:40,640 --> 00:59:42,960 And what I want is that the increments 1165 00:59:42,960 --> 00:59:44,640 have a certain distribution. 1166 00:59:44,640 --> 00:59:53,700 So what I want is that the expectation of Xt minus Xs 1167 00:59:53,700 --> 00:59:58,050 follows some distribution which is N 0, t minus s. 1168 00:59:58,050 --> 01:00:00,750 So the increments are bigger as I go farther, 1169 01:00:00,750 --> 01:00:02,580 in terms of variability. 1170 01:00:02,580 --> 01:00:05,880 And I also want some covariance structure between the two. 1171 01:00:05,880 --> 01:00:10,320 So what I want is that the covariance between Xs and Xt 1172 01:00:10,320 --> 01:00:12,750 is actually equal to the minimum of s and t. 1173 01:00:18,520 --> 01:00:21,660 Yeah, maybe. 1174 01:00:21,660 --> 01:00:23,220 Yeah, that should be there. 1175 01:00:23,220 --> 01:00:26,040 So this is, you open a probability book, that's 1176 01:00:26,040 --> 01:00:27,370 what it's going to look like. 1177 01:00:27,370 --> 01:00:31,710 So in particular, you can see, if I put 0 here 1178 01:00:31,710 --> 01:00:34,390 and X0 is equal to 0, it has 0 variance. 1179 01:00:34,390 --> 01:00:38,180 So in particular, it means that Xt, 1180 01:00:38,180 --> 01:00:39,920 if I look only at the t-th one, it 1181 01:00:39,920 --> 01:00:43,110 has some normal distribution with variance t. 1182 01:00:43,110 --> 01:00:46,050 So this is something that just blows up. 1183 01:00:46,050 --> 01:00:49,230 So this guy here looks like it's going 1184 01:00:49,230 --> 01:00:50,730 to be a Brownian motion because when 1185 01:00:50,730 --> 01:00:53,700 I look at the left-hand side it has a normal distribution. 1186 01:00:53,700 --> 01:00:55,950 Now there's a bunch of other things you need to check. 1187 01:00:55,950 --> 01:00:58,325 It's the fact that you have this covariance, for example, 1188 01:00:58,325 --> 01:01:00,090 which I did not tell you. 1189 01:01:00,090 --> 01:01:03,300 But it sure look somewhat like that. 1190 01:01:03,300 --> 01:01:07,590 And in particular, when I look at the normal with mean 0 1191 01:01:07,590 --> 01:01:10,440 and variance here, then it's clear 1192 01:01:10,440 --> 01:01:12,420 that this guy does not have a variance that's 1193 01:01:12,420 --> 01:01:16,560 going to go to infinity just like the variance of this guy. 1194 01:01:16,560 --> 01:01:21,620 We know that the variance is forced to be back to 0. 1195 01:01:21,620 --> 01:01:23,290 And so in particular we have something 1196 01:01:23,290 --> 01:01:28,270 that has mean 0 always, whose variance has to be 0 at 0, 1197 01:01:28,270 --> 01:01:31,870 and variance-- sorry, at t equals negative infinity, 1198 01:01:31,870 --> 01:01:34,890 and variance 1 at t equals plus infinity. 1199 01:01:34,890 --> 01:01:36,920 So a variance 0 at t equals plus infinity, 1200 01:01:36,920 --> 01:01:40,420 and so I have to basically force it to be equal to 0 at each n. 1201 01:01:40,420 --> 01:01:42,400 So the Brownian motion here tends 1202 01:01:42,400 --> 01:01:44,830 to just go to infinity somewhere, 1203 01:01:44,830 --> 01:01:47,110 whereas this guy forces it to come back. 1204 01:01:47,110 --> 01:01:48,700 Now everything I described to you 1205 01:01:48,700 --> 01:01:52,720 is on the scale negative infinity to plus infinity, 1206 01:01:52,720 --> 01:01:56,620 but since everything depends on F of t, 1207 01:01:56,620 --> 01:01:58,360 I can actually just put that back 1208 01:01:58,360 --> 01:02:02,300 into a scale, which is 0 and 1 by a simple change of variable. 1209 01:02:02,300 --> 01:02:06,814 It's called change of time for the Brownian motion. 1210 01:02:06,814 --> 01:02:07,314 OK? 1211 01:02:07,314 --> 01:02:08,302 Yeah. 1212 01:02:08,302 --> 01:02:09,784 AUDIENCE: So does a Brownian bridge 1213 01:02:09,784 --> 01:02:13,242 have a variance at each point that's proportional? 1214 01:02:13,242 --> 01:02:15,382 Like it starts at 0 variance and then 1215 01:02:15,382 --> 01:02:17,688 goes to 1/4 variance in the middle 1216 01:02:17,688 --> 01:02:21,146 and then goes back to 0 variance? 1217 01:02:21,146 --> 01:02:23,924 Like in the same parabolic shape? 1218 01:02:23,924 --> 01:02:24,590 PROFESSOR: Yeah. 1219 01:02:24,590 --> 01:02:26,180 I mean, definitely. 1220 01:02:26,180 --> 01:02:29,873 I mean by symmetry you can probably infer all the things. 1221 01:02:29,873 --> 01:02:31,706 AUDIENCE: Well I can imagine Brownian bridge 1222 01:02:31,706 --> 01:02:34,904 with a variance that starts at 0 and stays, like, 1223 01:02:34,904 --> 01:02:38,809 the shape of the variance as you move along. 1224 01:02:38,809 --> 01:02:40,600 PROFESSOR: Yeah, so I don't know if-- there 1225 01:02:40,600 --> 01:02:43,205 is an explicit formula for this, and it's simple. 1226 01:02:43,205 --> 01:02:45,830 That's what I can tell you, but I don't know what the explicit, 1227 01:02:45,830 --> 01:02:47,756 off the top of my head what the explicit formula is. 1228 01:02:47,756 --> 01:02:49,708 AUDIENCE: But would it have to match this F 1229 01:02:49,708 --> 01:02:53,112 of t 1 minus F of t structure? 1230 01:02:53,112 --> 01:02:53,612 Or not? 1231 01:02:53,612 --> 01:02:54,278 PROFESSOR: Yeah. 1232 01:02:56,052 --> 01:02:58,510 AUDIENCE: Or does the fact that we're taking the supremum-- 1233 01:02:58,510 --> 01:02:59,700 PROFESSOR: No. 1234 01:02:59,700 --> 01:03:03,390 Well the Brownian bridge, this is the supremum-- you're right. 1235 01:03:03,390 --> 01:03:06,700 So this will be this form for the variance for sure, 1236 01:03:06,700 --> 01:03:08,700 because this is only marginal distributions that 1237 01:03:08,700 --> 01:03:10,920 don't take-- right, the process is not just 1238 01:03:10,920 --> 01:03:13,360 what is the distribution at each instant t. 1239 01:03:13,360 --> 01:03:15,990 It's also how do those distributions interact 1240 01:03:15,990 --> 01:03:17,657 with each other in terms of covariance. 1241 01:03:17,657 --> 01:03:19,740 For the marginal distributions at each instance t, 1242 01:03:19,740 --> 01:03:22,950 you're right, the variance is F of t 1 minus F of t. 1243 01:03:22,950 --> 01:03:25,170 We're not going to escape that. 1244 01:03:25,170 --> 01:03:27,390 But then the covariance structure between those guys 1245 01:03:27,390 --> 01:03:29,250 is a little more complicated. 1246 01:03:29,250 --> 01:03:30,330 But yes, you're right. 1247 01:03:30,330 --> 01:03:32,201 For marginal that's enough. 1248 01:03:32,201 --> 01:03:32,701 Yeah? 1249 01:03:32,701 --> 01:03:34,701 AUDIENCE: So the supremum of the Brownian bridge 1250 01:03:34,701 --> 01:03:38,180 is a number between 0 and 10, let's just say. 1251 01:03:38,180 --> 01:03:40,244 PROFESSOR: Yeah, it could be infinity. 1252 01:03:40,244 --> 01:03:43,226 AUDIENCE: So it's not symmetrical with respect to 0, 1253 01:03:43,226 --> 01:03:45,214 so why are we doing all over 2? 1254 01:03:56,170 --> 01:03:57,900 PROFESSOR: OK. 1255 01:03:57,900 --> 01:03:58,640 Did say raise it? 1256 01:03:58,640 --> 01:03:59,230 Yeah. 1257 01:03:59,230 --> 01:04:01,974 Because here I didn't say the supremum of the absolute value 1258 01:04:01,974 --> 01:04:03,890 of a Brownian bridge, I just said the supremum 1259 01:04:03,890 --> 01:04:04,765 of a Brownian bridge. 1260 01:04:04,765 --> 01:04:08,640 But you're right, let's just do this like that. 1261 01:04:08,640 --> 01:04:11,070 And then it's probably cleaner. 1262 01:04:14,580 --> 01:04:17,210 So yeah, actually well it should be q alpha. 1263 01:04:17,210 --> 01:04:19,630 So this is basically, you're right. 1264 01:04:19,630 --> 01:04:22,210 So think of it as being one-sided. 1265 01:04:22,210 --> 01:04:25,960 And there's actually no symmetry for the supremum. 1266 01:04:25,960 --> 01:04:29,074 I mean the supremum is not symmetric around 0, 1267 01:04:29,074 --> 01:04:29,740 so you're right. 1268 01:04:29,740 --> 01:04:33,630 I should not use alpha over 2, thank you. 1269 01:04:33,630 --> 01:04:35,690 Any other question? 1270 01:04:35,690 --> 01:04:36,950 This should be alpha. 1271 01:04:36,950 --> 01:04:37,815 Yeah. 1272 01:04:37,815 --> 01:04:39,940 I mean those slides were written with 1 minus alpha 1273 01:04:39,940 --> 01:04:42,490 and I have not replaced all instances of 1 minus alpha 1274 01:04:42,490 --> 01:04:43,980 by alpha. 1275 01:04:43,980 --> 01:04:45,392 I mean, except this guy, tilde. 1276 01:04:45,392 --> 01:04:47,100 Well, depends on how you want to call it. 1277 01:04:47,100 --> 01:04:50,520 But this is still, the probability that Z exceeds 1278 01:04:50,520 --> 01:04:53,550 this guy should be alpha. 1279 01:04:53,550 --> 01:04:54,160 OK? 1280 01:04:54,160 --> 01:04:55,910 And this can be found in tables. 1281 01:04:55,910 --> 01:05:00,370 And we can compute the p-value just like we did before. 1282 01:05:00,370 --> 01:05:02,320 But we have to simulate it because it's not 1283 01:05:02,320 --> 01:05:04,236 going to depend on the cumulative distribution 1284 01:05:04,236 --> 01:05:06,890 function of a Gaussian, like it did for the usual Gaussian 1285 01:05:06,890 --> 01:05:07,740 test. 1286 01:05:07,740 --> 01:05:09,656 That's something that's more complicated, 1287 01:05:09,656 --> 01:05:11,030 and typically you don't even try. 1288 01:05:11,030 --> 01:05:14,210 You get the statistical software to do it for you. 1289 01:05:14,210 --> 01:05:17,650 So just let me skip a few lines. 1290 01:05:17,650 --> 01:05:20,150 This is what the table looks like for the Kolmogorov-Smirnov 1291 01:05:20,150 --> 01:05:21,430 test. 1292 01:05:21,430 --> 01:05:25,690 So it just tells you, what is your number of observations, n. 1293 01:05:25,690 --> 01:05:28,270 Then you want alpha to be equal to 5%, say. 1294 01:05:28,270 --> 01:05:30,320 Let's say you have nine observations. 1295 01:05:30,320 --> 01:05:34,240 So if square root of n absolute value of Fn of t minus F of t 1296 01:05:34,240 --> 01:05:36,610 exceeds this thing, you reject. 1297 01:05:46,060 --> 01:05:47,620 Well it's pretty clear from this test 1298 01:05:47,620 --> 01:05:49,390 is that it looks very nice, and I tell 1299 01:05:49,390 --> 01:05:50,680 you this is how you build it. 1300 01:05:50,680 --> 01:05:52,577 But if you think about it for one second, 1301 01:05:52,577 --> 01:05:54,160 it's actually really an annoying thing 1302 01:05:54,160 --> 01:05:57,760 to build because you have to take the supremum over t. 1303 01:05:57,760 --> 01:06:01,150 This depends on computing a supremum, which in practice 1304 01:06:01,150 --> 01:06:03,070 might be super cumbersome. 1305 01:06:03,070 --> 01:06:05,360 I don't want to have to compute this for all values t 1306 01:06:05,360 --> 01:06:07,784 and then to take the maximum of those guys. 1307 01:06:07,784 --> 01:06:09,950 It turns out that that's actually quite nice that we 1308 01:06:09,950 --> 01:06:11,720 don't have to actually do this. 1309 01:06:11,720 --> 01:06:14,060 What does the empirical distribution function 1310 01:06:14,060 --> 01:06:15,474 look like? 1311 01:06:15,474 --> 01:06:23,350 Well, this thing, remember Fn of t by definition was-- 1312 01:06:23,350 --> 01:06:25,590 so let me go to the slide that's relevant. 1313 01:06:25,590 --> 01:06:27,290 So Fn of t looks like this. 1314 01:06:38,320 --> 01:06:41,590 So what it means is that when t is between two observations, 1315 01:06:41,590 --> 01:06:44,390 then this guy is actually keeping the same value. 1316 01:06:44,390 --> 01:06:48,210 So if I put my observations on the real line here. 1317 01:06:48,210 --> 01:06:49,940 So let's say I have one observation here, 1318 01:06:49,940 --> 01:06:51,782 one observation here, one observation here, 1319 01:06:51,782 --> 01:06:53,740 one observation here, and one observation here, 1320 01:06:53,740 --> 01:06:55,270 for simplicity. 1321 01:06:55,270 --> 01:06:57,730 Then this guy is basically, up to this normalization, 1322 01:06:57,730 --> 01:07:01,820 counting how many observations they have that are less than t. 1323 01:07:01,820 --> 01:07:05,020 So since I normalize by n, I know that the smallest number 1324 01:07:05,020 --> 01:07:10,480 here is going to be 0, and the largest number here 1325 01:07:10,480 --> 01:07:13,300 is going to be 1. 1326 01:07:13,300 --> 01:07:14,980 So let's say this looks like this. 1327 01:07:14,980 --> 01:07:18,290 This is the value 1. 1328 01:07:18,290 --> 01:07:21,800 At the value, since I take it less than or equal to, 1329 01:07:21,800 --> 01:07:24,530 when I'm at Xi, I'm actually counting it. 1330 01:07:24,530 --> 01:07:26,570 So the jump happens at Xi. 1331 01:07:26,570 --> 01:07:29,000 So that's the first observation, and then I jump. 1332 01:07:29,000 --> 01:07:30,860 By how much do I jump? 1333 01:07:33,650 --> 01:07:35,510 Yeah? 1334 01:07:35,510 --> 01:07:38,670 One over n, right? 1335 01:07:38,670 --> 01:07:41,732 And then this value belongs to the right. 1336 01:07:41,732 --> 01:07:42,690 And then I do it again. 1337 01:07:50,850 --> 01:07:54,534 I know it's not going to work out for me, but we'll see. 1338 01:07:54,534 --> 01:07:55,950 Oh no actually, I did pretty well. 1339 01:08:00,790 --> 01:08:04,130 This is what my cumulative distribution looks like. 1340 01:08:04,130 --> 01:08:05,630 Now if you look on this slide, there 1341 01:08:05,630 --> 01:08:07,870 is this weird notation where I start putting now 1342 01:08:07,870 --> 01:08:10,390 my indices in parentheses. 1343 01:08:10,390 --> 01:08:13,350 X parenthesis 1, X parenthesis 2, et cetera. 1344 01:08:13,350 --> 01:08:15,910 Those are called the ordered statistic. 1345 01:08:15,910 --> 01:08:18,729 It's just because it might be, when my data is given 1346 01:08:18,729 --> 01:08:20,422 to me I just call the first observation, 1347 01:08:20,422 --> 01:08:21,880 the one that's on top of the table, 1348 01:08:21,880 --> 01:08:24,640 but it doesn't have to be the smallest value. 1349 01:08:24,640 --> 01:08:28,029 So it might be that this is X1 and that this is X2, 1350 01:08:28,029 --> 01:08:31,510 and then this is X3, X4, and X5. 1351 01:08:31,510 --> 01:08:33,374 These might be my observations. 1352 01:08:33,374 --> 01:08:35,290 So what I do is that I call them in such a way 1353 01:08:35,290 --> 01:08:38,109 that this is actually, I recall this guy X1, 1354 01:08:38,109 --> 01:08:40,569 which is just really X3. 1355 01:08:40,569 --> 01:08:46,810 This is X2, X3, X4, and X5. 1356 01:08:46,810 --> 01:08:48,790 These are my reordered observations 1357 01:08:48,790 --> 01:08:52,029 in such a way that the smallest one is indexed by one 1358 01:08:52,029 --> 01:08:54,013 and the largest one is indexed by n. 1359 01:08:58,439 --> 01:09:01,200 So now this is actually quite nice, 1360 01:09:01,200 --> 01:09:04,170 because what I'm trying to do is to find the largest 1361 01:09:04,170 --> 01:09:07,210 deviation from this guy to the true cumulative distribution 1362 01:09:07,210 --> 01:09:07,710 function. 1363 01:09:07,710 --> 01:09:09,460 The true cumulative distribution function, 1364 01:09:09,460 --> 01:09:11,729 let's say it's Gaussian, looks like this. 1365 01:09:15,340 --> 01:09:19,120 It's something continuous, for a symmetric distribution 1366 01:09:19,120 --> 01:09:22,520 it crosses this axis at 1/2, and that's what it looks like. 1367 01:09:22,520 --> 01:09:25,330 And the Kolmogorov-Smirnov test is just 1368 01:09:25,330 --> 01:09:31,470 telling me how far do those two curves get 1369 01:09:31,470 --> 01:09:35,069 in the worst possible case? 1370 01:09:35,069 --> 01:09:37,380 So in particular here, where are they the farthest? 1371 01:09:37,380 --> 01:09:40,729 Clearly that's this point. 1372 01:09:40,729 --> 01:09:42,490 And so up to rescaling, this is the value 1373 01:09:42,490 --> 01:09:44,510 I'm going to be interested in. 1374 01:09:44,510 --> 01:09:49,130 That's how they get as far as possible from each other. 1375 01:09:49,130 --> 01:09:52,399 Here, something just happened, right? 1376 01:09:52,399 --> 01:09:54,695 The farthest distance that I got was exactly 1377 01:09:54,695 --> 01:09:55,970 at one of those dots. 1378 01:09:58,480 --> 01:10:01,600 It turns out this is enough to look at those dots. 1379 01:10:01,600 --> 01:10:04,660 And the reason is, well because after this dot 1380 01:10:04,660 --> 01:10:08,460 and until the next jump, this guy does not change, 1381 01:10:08,460 --> 01:10:11,230 but this guy increases. 1382 01:10:11,230 --> 01:10:15,220 And so the only point where they can be the farthest apart 1383 01:10:15,220 --> 01:10:19,720 is either to the left of a jump or to the right of a jump. 1384 01:10:19,720 --> 01:10:22,540 That's the only place where they can be far from each other. 1385 01:10:22,540 --> 01:10:24,896 And that means that only one observation. 1386 01:10:24,896 --> 01:10:26,620 Everybody sees that? 1387 01:10:26,620 --> 01:10:29,470 The farthest points, the points at which those two curves are 1388 01:10:29,470 --> 01:10:31,300 the farthest from each other, has 1389 01:10:31,300 --> 01:10:34,000 to be at one of the observations. 1390 01:10:34,000 --> 01:10:37,790 And so rather than looking at a sup over all possible t's, 1391 01:10:37,790 --> 01:10:40,920 really all I need to do is to look at a maximum 1392 01:10:40,920 --> 01:10:43,036 only at my observations. 1393 01:10:46,390 --> 01:10:48,960 I just need to check at each of those points 1394 01:10:48,960 --> 01:10:51,150 whether they're far. 1395 01:10:51,150 --> 01:10:53,090 Now here, notice that you did not, 1396 01:10:53,090 --> 01:10:57,530 this is not written Fn of Xi. 1397 01:10:57,530 --> 01:11:01,410 The reason is because I actually know what Fn of Xi is. 1398 01:11:01,410 --> 01:11:05,320 Fn of the i-th order observation is just 1399 01:11:05,320 --> 01:11:08,680 the number of jumps I've had until this observation. 1400 01:11:08,680 --> 01:11:11,290 So here, I know that the value of Fn is 1 over n, 1401 01:11:11,290 --> 01:11:15,520 here it's 2 over n, 3 over n, 4 over n, 5 over n. 1402 01:11:15,520 --> 01:11:19,300 So I knew that the values of Fn at my observations, 1403 01:11:19,300 --> 01:11:22,300 and those are actually the only values that Fn can take, 1404 01:11:22,300 --> 01:11:25,060 are an integer divided by n. 1405 01:11:25,060 --> 01:11:29,680 And that's why you see i minus 1 over n, or i over n. 1406 01:11:29,680 --> 01:11:32,520 This is the difference just before the jump, 1407 01:11:32,520 --> 01:11:34,450 and this is the difference at the jump. 1408 01:11:38,090 --> 01:11:42,800 So here the key message is that this is no longer 1409 01:11:42,800 --> 01:11:44,610 a supremum over all t's, but it's just 1410 01:11:44,610 --> 01:11:46,110 the maximum from 1 to n. 1411 01:11:46,110 --> 01:11:49,160 So I really have only two n values to compute. 1412 01:11:49,160 --> 01:11:51,970 This value and this value for each observation, that's 2n 1413 01:11:51,970 --> 01:11:52,850 total. 1414 01:11:52,850 --> 01:11:55,760 I look at the maximum and that's actually the value. 1415 01:11:55,760 --> 01:11:58,907 And it's actually equal to tn. 1416 01:11:58,907 --> 01:11:59,990 It's not an approximation. 1417 01:11:59,990 --> 01:12:00,840 Those things are equal. 1418 01:12:00,840 --> 01:12:02,060 That's just the only places where 1419 01:12:02,060 --> 01:12:03,143 those guys can be maximum. 1420 01:12:09,242 --> 01:12:10,715 Yes? 1421 01:12:10,715 --> 01:12:15,134 AUDIENCE: It seems like since the null hypothesis [INAUDIBLE] 1422 01:12:15,134 --> 01:12:17,589 the entire distribution of theta, 1423 01:12:17,589 --> 01:12:19,716 this is like strictly more powerful than just 1424 01:12:19,716 --> 01:12:23,000 doing it [INAUDIBLE]. 1425 01:12:23,000 --> 01:12:24,832 PROFESSOR: It's strictly less powerful. 1426 01:12:24,832 --> 01:12:27,784 AUDIENCE: Strictly less powerful. 1427 01:12:27,784 --> 01:12:30,490 But is there, is that like a big trade-off 1428 01:12:30,490 --> 01:12:32,018 that we're making when we do that? 1429 01:12:32,018 --> 01:12:33,934 Obviously we're not certain in the first place 1430 01:12:33,934 --> 01:12:35,309 that we want to assume normality. 1431 01:12:35,309 --> 01:12:37,624 Does it make sense to [INAUDIBLE],, 1432 01:12:37,624 --> 01:12:39,592 the Gaussian [INAUDIBLE]. 1433 01:12:48,000 --> 01:12:50,420 PROFESSOR: So can you, I'm not sure what 1434 01:12:50,420 --> 01:12:51,400 question you're asking. 1435 01:12:51,400 --> 01:12:53,360 AUDIENCE: So when we're doing a normal test, 1436 01:12:53,360 --> 01:12:55,810 we're just asking questions about the mus, 1437 01:12:55,810 --> 01:12:57,280 the means of our distribution. 1438 01:12:57,280 --> 01:13:00,383 [INAUDIBLE] This one, it seems like it 1439 01:13:00,383 --> 01:13:02,670 would be both at the same time. 1440 01:13:02,670 --> 01:13:11,000 [INAUDIBLE] Is this decreasing power [INAUDIBLE]?? 1441 01:13:11,000 --> 01:13:13,470 PROFESSOR: So remember, here in this test 1442 01:13:13,470 --> 01:13:16,140 we want to conclude to H0, in the other test we typically 1443 01:13:16,140 --> 01:13:17,670 want to conclude to H1. 1444 01:13:17,670 --> 01:13:21,150 So here we actually don't want power, in a way. 1445 01:13:21,150 --> 01:13:24,619 And you have to also assume that doing a test on the mean 1446 01:13:24,619 --> 01:13:26,160 is probably not the only thing you're 1447 01:13:26,160 --> 01:13:27,576 going to end up doing on your data 1448 01:13:27,576 --> 01:13:31,472 after you actually establish that it's normally distributed. 1449 01:13:31,472 --> 01:13:33,180 Then you have the dataset, you've sort of 1450 01:13:33,180 --> 01:13:34,763 established it's normally distributed, 1451 01:13:34,763 --> 01:13:38,090 and then you can just run the arsenal of statistical studies. 1452 01:13:38,090 --> 01:13:39,540 And we're going to see regression 1453 01:13:39,540 --> 01:13:42,570 and all sorts of predictive things, which are not just 1454 01:13:42,570 --> 01:13:44,280 tests if the mean is equal to something. 1455 01:13:44,280 --> 01:13:45,660 Maybe you want to build a confidence interval 1456 01:13:45,660 --> 01:13:46,530 for the mean. 1457 01:13:46,530 --> 01:13:50,052 Then this is not, confidence interval is not a test. 1458 01:13:50,052 --> 01:13:52,260 So you're going to have to first test if it's normal, 1459 01:13:52,260 --> 01:13:53,760 and then see if you can actually use 1460 01:13:53,760 --> 01:13:55,770 the quantiles of a Gaussian distribution or a t 1461 01:13:55,770 --> 01:13:59,760 distribution to build this confidence interval. 1462 01:13:59,760 --> 01:14:03,510 So in a way you should do this as like, the flat fee 1463 01:14:03,510 --> 01:14:05,670 to enter the Gaussian world, and then you 1464 01:14:05,670 --> 01:14:09,072 can do whatever you want to do in the Gaussian world. 1465 01:14:09,072 --> 01:14:11,030 We'll see actually that your question goes back 1466 01:14:11,030 --> 01:14:14,750 to something that's a little important, is here 1467 01:14:14,750 --> 01:14:17,540 I said F0 is fully specified. 1468 01:14:17,540 --> 01:14:21,490 It's like an N 1, 5. 1469 01:14:21,490 --> 01:14:24,410 But I didn't say, is it normally distributed, 1470 01:14:24,410 --> 01:14:26,440 which is the question that everybody asks. 1471 01:14:26,440 --> 01:14:29,189 You're not asking, is it this particular normal distribution 1472 01:14:29,189 --> 01:14:31,480 with this particular mean and this particular variance. 1473 01:14:31,480 --> 01:14:32,860 So how would you do it in practice? 1474 01:14:32,860 --> 01:14:34,276 Well you would say, I'm just going 1475 01:14:34,276 --> 01:14:36,910 to replace the mean by the empirical mean and the variance 1476 01:14:36,910 --> 01:14:38,720 by the empirical variance. 1477 01:14:38,720 --> 01:14:41,710 But by doing that you're making a huge mistake because you 1478 01:14:41,710 --> 01:14:45,160 are sort of depriving your test of the possibility 1479 01:14:45,160 --> 01:14:46,967 to reject the Gaussian hypothesis just 1480 01:14:46,967 --> 01:14:49,300 based on the fact that the mean is wrong or the variance 1481 01:14:49,300 --> 01:14:49,930 is wrong. 1482 01:14:49,930 --> 01:14:52,600 You've already stuck to your data pretty well. 1483 01:14:52,600 --> 01:14:55,660 And so you're sort of like already 1484 01:14:55,660 --> 01:14:59,320 tilting the game in favor of H0 big time. 1485 01:14:59,320 --> 01:15:01,300 So there's actually a way to arrange for this. 1486 01:15:03,930 --> 01:15:05,555 OK, so this is about pivotal statistic. 1487 01:15:05,555 --> 01:15:06,950 We've used this word many times. 1488 01:15:09,680 --> 01:15:12,272 And So that's how. 1489 01:15:12,272 --> 01:15:13,730 I'm not going to go into this test. 1490 01:15:13,730 --> 01:15:16,640 It's really, this is a recipe on how you would actually 1491 01:15:16,640 --> 01:15:20,920 build the table that I showed you, this table. 1492 01:15:20,920 --> 01:15:23,660 This is basically the recipe on how to build it. 1493 01:15:23,660 --> 01:15:25,790 There's another recipe to build it, which is just 1494 01:15:25,790 --> 01:15:27,730 open a book at this page. 1495 01:15:27,730 --> 01:15:29,770 That's a little faster. 1496 01:15:29,770 --> 01:15:32,870 Or use software. 1497 01:15:32,870 --> 01:15:34,050 I just wanted to show you. 1498 01:15:34,050 --> 01:15:36,891 So let's just keep in mind, anybody has a good memory? 1499 01:15:36,891 --> 01:15:38,390 Let's just keep in mind this number. 1500 01:15:38,390 --> 01:15:44,060 This is the threshold for the Kolmogorov-Smirnov statistic. 1501 01:15:44,060 --> 01:15:47,250 If I have 10 observations and I want to do it at 5%, 1502 01:15:47,250 --> 01:15:50,060 it's about 41%. 1503 01:15:50,060 --> 01:15:52,380 So that's the number that it should be larger from. 1504 01:15:52,380 --> 01:15:56,630 So it turns out that if you want to test if it's normal, and not 1505 01:15:56,630 --> 01:15:59,000 just the specific normal, this number 1506 01:15:59,000 --> 01:16:00,145 is going to be different. 1507 01:16:00,145 --> 01:16:01,520 Do you think the number I'm going 1508 01:16:01,520 --> 01:16:03,561 to read in a table that's appropriate for this is 1509 01:16:03,561 --> 01:16:05,630 going to be larger or smaller? 1510 01:16:05,630 --> 01:16:07,505 Who says larger? 1511 01:16:07,505 --> 01:16:09,130 AUDIENCE: Sorry, what was the question? 1512 01:16:09,130 --> 01:16:10,588 PROFESSOR: So the question is, this 1513 01:16:10,588 --> 01:16:20,270 is the number I should see if my test was, is X, say, N 0, 5. 1514 01:16:20,270 --> 01:16:20,770 Right? 1515 01:16:20,770 --> 01:16:25,630 That's a specific distribution with a specific F0. 1516 01:16:25,630 --> 01:16:27,810 So that's the number, I would build 1517 01:16:27,810 --> 01:16:29,630 the Kolmogorov-Smirnov statistic from this. 1518 01:16:29,630 --> 01:16:32,460 I would perform a test and check if my Kolmogorov-Smirnov 1519 01:16:32,460 --> 01:16:34,970 statistic tn is larger than this number or not. 1520 01:16:34,970 --> 01:16:36,450 If it's larger I'm going to reject. 1521 01:16:36,450 --> 01:16:40,940 Now I say, actually, I don't want to test if H0 is N 0, 5, 1522 01:16:40,940 --> 01:16:47,942 but it's just a mu sigma squared for some mu and sigma squared. 1523 01:16:47,942 --> 01:16:50,400 And in particular I'm just going to plugin mu hat and sigma 1524 01:16:50,400 --> 01:16:52,680 hat into my F0, run the same statistic, 1525 01:16:52,680 --> 01:16:56,280 but compare it to a different number. 1526 01:16:56,280 --> 01:17:00,090 So the larger the number, the more or less 1527 01:17:00,090 --> 01:17:03,660 likely am I to reject? 1528 01:17:03,660 --> 01:17:05,730 The less likely I am to reject, right? 1529 01:17:05,730 --> 01:17:09,700 So if I just use that number, let's say 1530 01:17:09,700 --> 01:17:12,660 this is a large number, I would be more 1531 01:17:12,660 --> 01:17:14,077 tempted to say it's Gaussian. 1532 01:17:14,077 --> 01:17:15,660 And if you look at the table you would 1533 01:17:15,660 --> 01:17:18,300 get that if you make the appropriate correction 1534 01:17:18,300 --> 01:17:21,200 at the same number of observations, 10, 1535 01:17:21,200 --> 01:17:26,359 and the same level, you get 25% as opposed to 41%. 1536 01:17:26,359 --> 01:17:28,650 That means that you're actually much more likely if you 1537 01:17:28,650 --> 01:17:32,670 use the appropriate test to reject the fact that it's 1538 01:17:32,670 --> 01:17:34,680 normal, which is bad news, because that means 1539 01:17:34,680 --> 01:17:36,780 you don't have access to the Gaussian arsenal, 1540 01:17:36,780 --> 01:17:38,160 and nobody wants to do this. 1541 01:17:38,160 --> 01:17:40,920 So actually this is a mistake that people do a lot. 1542 01:17:40,920 --> 01:17:42,570 They use the Kolmogorov-Smirnov test 1543 01:17:42,570 --> 01:17:45,810 to test for normality without adjusting for the fact 1544 01:17:45,810 --> 01:17:48,210 that they've plugged in the estimated mean 1545 01:17:48,210 --> 01:17:50,100 and the estimated variance. 1546 01:17:50,100 --> 01:17:53,520 This leads to rejecting less often, right? 1547 01:17:53,520 --> 01:17:58,490 I mean this is almost half of the number that we had. 1548 01:17:58,490 --> 01:18:00,990 And then they can be happy and walk home 1549 01:18:00,990 --> 01:18:03,120 and say, well, I did the test and it was normal. 1550 01:18:03,120 --> 01:18:04,640 So this is actually a mistake that I 1551 01:18:04,640 --> 01:18:07,130 believe that genuinely at least a quarter of the people 1552 01:18:07,130 --> 01:18:09,357 do make in purpose. 1553 01:18:09,357 --> 01:18:11,690 They just say, well I want it to be Gaussian so I'm just 1554 01:18:11,690 --> 01:18:13,760 going to make my life easier. 1555 01:18:13,760 --> 01:18:17,180 So this is the so-called Kolmogorov Lilliefors test. 1556 01:18:17,180 --> 01:18:20,800 We'll talk about it, well not today for sure. 1557 01:18:20,800 --> 01:18:24,650 There's other statistics that you can test, that you can use. 1558 01:18:24,650 --> 01:18:26,390 And the idea is to say, well, we want 1559 01:18:26,390 --> 01:18:28,280 to know if the empirical distribution 1560 01:18:28,280 --> 01:18:31,900 function, the empirical CDF, is close to the true CDF. 1561 01:18:31,900 --> 01:18:33,880 The way we did it is by forming the difference 1562 01:18:33,880 --> 01:18:36,240 in looking at the worst possible distance they can be. 1563 01:18:36,240 --> 01:18:39,880 That's called a sup norm, or L infinity norm, 1564 01:18:39,880 --> 01:18:42,140 in functional analysis. 1565 01:18:42,140 --> 01:18:44,020 So here, this is what it looked like. 1566 01:18:44,020 --> 01:18:46,630 The distance between Fn and F that we measured was just 1567 01:18:46,630 --> 01:18:48,520 the supremum distance over all t's. 1568 01:18:48,520 --> 01:18:51,100 That's one way to measure distance between two functions. 1569 01:18:51,100 --> 01:18:53,170 But there's an infinite many ways 1570 01:18:53,170 --> 01:18:54,880 to measure distance between functions. 1571 01:18:54,880 --> 01:18:56,840 One is something we're much more familiar with, 1572 01:18:56,840 --> 01:18:59,510 which is the squared L2-norm. 1573 01:18:59,510 --> 01:19:02,770 This is nice because this has like an inner product, 1574 01:19:02,770 --> 01:19:04,370 it has some nice properties. 1575 01:19:04,370 --> 01:19:06,740 And you could actually just, rather than taking the sup, 1576 01:19:06,740 --> 01:19:10,280 you could just integrate the squared distance. 1577 01:19:10,280 --> 01:19:14,485 And this is what leads to Cramier-Von Mises test. 1578 01:19:14,485 --> 01:19:15,860 And then there's another one that 1579 01:19:15,860 --> 01:19:18,770 says, well, maybe I don't want to integrate without weights. 1580 01:19:18,770 --> 01:19:22,230 Maybe I want to put weights that account for the variance. 1581 01:19:22,230 --> 01:19:24,500 And this guy is called Anderson-Darling. 1582 01:19:24,500 --> 01:19:26,810 For each of these tests you can check 1583 01:19:26,810 --> 01:19:29,660 that the asymptotic distribution is going to be pivotal, 1584 01:19:29,660 --> 01:19:32,420 which means that there will be a table at the back of some book 1585 01:19:32,420 --> 01:19:37,190 that tells you what the statistic, the quantiles 1586 01:19:37,190 --> 01:19:38,730 of square root of n times this guy 1587 01:19:38,730 --> 01:19:40,333 are asymptotically, basically. 1588 01:19:40,333 --> 01:19:41,299 Yeah? 1589 01:19:41,299 --> 01:19:44,197 AUDIENCE: For the Kolmogorov-Smirnov test, 1590 01:19:44,197 --> 01:19:48,061 for the table that shows the value it has, 1591 01:19:48,061 --> 01:19:51,572 it has the value for different n. 1592 01:19:51,572 --> 01:19:53,390 But I thought we [INAUDIBLE]-- 1593 01:19:53,390 --> 01:19:54,150 PROFESSOR: Yeah. 1594 01:19:54,150 --> 01:19:56,649 So that's just to show you that asymptotically it's pivotal, 1595 01:19:56,649 --> 01:19:59,160 and I can point you to one specific thing. 1596 01:19:59,160 --> 01:20:02,842 But it turns out that this thing is actually pivotal for each n. 1597 01:20:02,842 --> 01:20:05,300 And that's why you have this recipe to construct the entire 1598 01:20:05,300 --> 01:20:08,690 thing, because it's actually not true for all possible n's. 1599 01:20:08,690 --> 01:20:10,700 Also there's the n that shows up here. 1600 01:20:10,700 --> 01:20:13,040 So no actually, this is something 1601 01:20:13,040 --> 01:20:14,090 you should have in mind. 1602 01:20:14,090 --> 01:20:18,350 So basically, let me strike what I just said. 1603 01:20:18,350 --> 01:20:20,330 This thing you can actually, this distribution 1604 01:20:20,330 --> 01:20:24,119 will not depend on F0 for any particular n. 1605 01:20:24,119 --> 01:20:25,910 It's just not going to be a Brownian bridge 1606 01:20:25,910 --> 01:20:28,130 but a finite sample approximation of a Brownian 1607 01:20:28,130 --> 01:20:31,160 bridge, and you can simulate that just drawing samples 1608 01:20:31,160 --> 01:20:33,500 from it, building a histogram, and constructing 1609 01:20:33,500 --> 01:20:35,286 the quantiles for this guy. 1610 01:20:35,286 --> 01:20:36,910 AUDIENCE: No one has actually developed 1611 01:20:36,910 --> 01:20:38,304 a table for Brownian-- 1612 01:20:38,304 --> 01:20:39,470 PROFESSOR: Oh, there is one. 1613 01:20:39,470 --> 01:20:42,570 That's the table, maybe. 1614 01:20:42,570 --> 01:20:46,670 Let's see if we see it at the bottom of the other table. 1615 01:20:46,670 --> 01:20:47,220 Yeah. 1616 01:20:47,220 --> 01:20:47,720 See? 1617 01:20:47,720 --> 01:20:48,997 Over 40, over 30. 1618 01:20:48,997 --> 01:20:50,580 So this is not the Kolmogorov-Smirnov, 1619 01:20:50,580 --> 01:20:52,710 but that's the Kolmogorov Lilliefors. 1620 01:20:52,710 --> 01:20:54,900 Those numbers that you see here, they 1621 01:20:54,900 --> 01:20:57,060 are the numbers for the asymptotic thing which is 1622 01:20:57,060 --> 01:20:59,192 some sort of Brownian bridge. 1623 01:20:59,192 --> 01:21:00,184 Yeah? 1624 01:21:00,184 --> 01:21:01,184 AUDIENCE: Two questions. 1625 01:21:01,184 --> 01:21:03,656 If I want to build the Kolmogorov-Smirnov test, 1626 01:21:03,656 --> 01:21:08,120 it says that F0 is required to be continuous. 1627 01:21:08,120 --> 01:21:10,104 PROFESSOR: Yeah. 1628 01:21:10,104 --> 01:21:13,576 AUDIENCE: [INAUDIBLE] If we have, like, probability 1629 01:21:13,576 --> 01:21:15,560 mass of a particular value. 1630 01:21:15,560 --> 01:21:18,060 Like some sort of data. 1631 01:21:18,060 --> 01:21:20,769 PROFESSOR: So then you won't have this nice picture, right? 1632 01:21:20,769 --> 01:21:22,560 This can happen at any point because you're 1633 01:21:22,560 --> 01:21:24,335 going to have discontinuities in F 1634 01:21:24,335 --> 01:21:26,620 and those things can happen everywhere. 1635 01:21:26,620 --> 01:21:27,120 And then-- 1636 01:21:27,120 --> 01:21:29,034 AUDIENCE: Would the supremum still work? 1637 01:21:29,034 --> 01:21:30,700 PROFESSOR: You mean the Brownian bridge? 1638 01:21:30,700 --> 01:21:32,140 AUDIENCE: Yeah. 1639 01:21:32,140 --> 01:21:35,340 The Kolmogorov test doesn't say that you 1640 01:21:35,340 --> 01:21:37,382 have to be able to easily calculate the supremum. 1641 01:21:37,382 --> 01:21:39,256 PROFESSOR: No, no, no, but you still need it. 1642 01:21:39,256 --> 01:21:40,600 You still need it for-- 1643 01:21:40,600 --> 01:21:42,832 so there's some finite sample versions of it 1644 01:21:42,832 --> 01:21:45,040 that you can use that are slightly more conservative, 1645 01:21:45,040 --> 01:21:47,740 which is in a way good news because you're 1646 01:21:47,740 --> 01:21:50,250 going to conclude more to H0. 1647 01:21:50,250 --> 01:21:52,930 And there's are some, I forget the name, 1648 01:21:52,930 --> 01:21:57,172 it's Kiefer-Wolfowitz, the Kiefer-Dvoretzky-Wolfowitz, 1649 01:21:57,172 --> 01:21:59,630 an equality which is basically like Hoeffding's inequality. 1650 01:21:59,630 --> 01:22:01,510 So it's basically up to bad constants 1651 01:22:01,510 --> 01:22:04,900 telling you the same result as the Brownian bridge result, 1652 01:22:04,900 --> 01:22:06,850 and those are true all the time. 1653 01:22:06,850 --> 01:22:08,830 But for the exact asymptotic distribution, 1654 01:22:08,830 --> 01:22:11,467 you need continuity. 1655 01:22:11,467 --> 01:22:12,496 Yes. 1656 01:22:12,496 --> 01:22:13,912 AUDIENCE: So just a clarification. 1657 01:22:13,912 --> 01:22:15,868 So when we are testing the Kolmogorov, 1658 01:22:15,868 --> 01:22:19,902 we shouldn't test a particular mu and sigma squared? 1659 01:22:19,902 --> 01:22:22,110 PROFESSOR: Well if you know what they are you can use 1660 01:22:22,110 --> 01:22:25,259 Kolmogorov-Smirnov, but if you don't know what they are 1661 01:22:25,259 --> 01:22:26,300 you're going to plug in-- 1662 01:22:26,300 --> 01:22:27,758 as soon as you're going to estimate 1663 01:22:27,758 --> 01:22:29,534 the mean and the variance from the data, 1664 01:22:29,534 --> 01:22:31,700 you should use the one we'll see next time, which is 1665 01:22:31,700 --> 01:22:33,200 called Kolmogorov Lilliefors. 1666 01:22:33,200 --> 01:22:34,950 You don't have to think about it too much. 1667 01:22:34,950 --> 01:22:38,000 We'll talk about it on Thursday. 1668 01:22:38,000 --> 01:22:39,215 Any other question? 1669 01:22:39,215 --> 01:22:40,090 So we're out of time. 1670 01:22:40,090 --> 01:22:45,700 So I think we should stop here, and we'll resume on Thursday.