1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high-quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:19,839 --> 00:00:22,380 PHILIPPE RIGOLLET: We're talking about goodness-of-fit tests. 9 00:00:22,380 --> 00:00:25,470 Goodness-of-fit tests are, does my data 10 00:00:25,470 --> 00:00:27,472 come from a particular distribution? 11 00:00:27,472 --> 00:00:28,930 And why would we want to know this? 12 00:00:28,930 --> 00:00:32,100 Well, maybe we're interested in, for example, 13 00:00:32,100 --> 00:00:36,900 knowing if the zodiac signs of the Fortune 500 CEOs 14 00:00:36,900 --> 00:00:38,550 are uniformly distributed. 15 00:00:38,550 --> 00:00:41,310 Or maybe we actually have slightly more-- 16 00:00:41,310 --> 00:00:44,740 slightly deeper endeavors, such as understanding 17 00:00:44,740 --> 00:00:48,240 if you can actually apply the t-test by testing normality 18 00:00:48,240 --> 00:00:49,121 of your sample. 19 00:00:49,121 --> 00:00:49,620 All right? 20 00:00:49,620 --> 00:00:51,900 So we saw that there's the main result-- 21 00:00:51,900 --> 00:00:53,460 the main standard test for this. 22 00:00:53,460 --> 00:00:55,800 It's called the Kolmogorov-Smirnov test 23 00:00:55,800 --> 00:00:57,690 that people use quite a bit. 24 00:00:57,690 --> 00:01:00,840 It's probably one of the most used tests out there. 25 00:01:00,840 --> 00:01:05,570 And there's other versions of it that I mentioned passing by. 26 00:01:05,570 --> 00:01:08,010 There's the Cramer-von Mises, and there's 27 00:01:08,010 --> 00:01:09,630 the Anderson-Darling test. 28 00:01:09,630 --> 00:01:12,120 Now, how would you pick one of such tests? 29 00:01:12,120 --> 00:01:14,880 Well, they're always are going to-- they're always 30 00:01:14,880 --> 00:01:17,700 going to have their advantages and disadvantages. 31 00:01:17,700 --> 00:01:22,300 And Kolmogorov-Smirnov is definitely the most widely used 32 00:01:22,300 --> 00:01:23,130 because-- 33 00:01:23,130 --> 00:01:24,990 well, I guess because it's a natural notion 34 00:01:24,990 --> 00:01:26,240 of distance between functions. 35 00:01:26,240 --> 00:01:28,530 You just look for each point how far they can be, 36 00:01:28,530 --> 00:01:30,150 and you just look at the farthest 37 00:01:30,150 --> 00:01:31,830 they can be everywhere. 38 00:01:31,830 --> 00:01:34,670 Now, Cramer-von Mises involves L2 distance. 39 00:01:34,670 --> 00:01:39,690 So if you're not used to Hilbert spaces or notions 40 00:01:39,690 --> 00:01:43,076 of Euclidean spaces, at least it's a little more complicated. 41 00:01:43,076 --> 00:01:44,700 And then Anderson-Darling is definitely 42 00:01:44,700 --> 00:01:45,666 even more complicated. 43 00:01:45,666 --> 00:01:47,040 Now, each of these tests is going 44 00:01:47,040 --> 00:01:49,720 to be more powerful against other alternatives. 45 00:01:49,720 --> 00:01:52,560 So unless you can really guess which alternative 46 00:01:52,560 --> 00:01:54,560 you're expecting to see, which you probably 47 00:01:54,560 --> 00:01:56,810 don't, because, again, you're in a case where you want 48 00:01:56,810 --> 00:02:00,450 to typically declare H0 to be the correct one, 49 00:02:00,450 --> 00:02:04,950 then it's really a matter of tossing a coin. 50 00:02:04,950 --> 00:02:06,600 Maybe you can run all three of them 51 00:02:06,600 --> 00:02:09,170 and just sleep better at night, because all three of them 52 00:02:09,170 --> 00:02:11,245 have failed to reject, for example. 53 00:02:11,245 --> 00:02:12,130 All right? 54 00:02:12,130 --> 00:02:15,250 So as I mentioned, one of the maybe primary goals 55 00:02:15,250 --> 00:02:19,420 to test goodness of fit is to be able to check 56 00:02:19,420 --> 00:02:22,534 whether we can apply Student's test, right, 57 00:02:22,534 --> 00:02:24,325 and if the Student distribution is actually 58 00:02:24,325 --> 00:02:25,750 a valid distribution. 59 00:02:25,750 --> 00:02:28,570 And for that, we need to have normally distributed data. 60 00:02:28,570 --> 00:02:32,470 Now, as I said several times, normally distributed, 61 00:02:32,470 --> 00:02:34,360 it's not a specific distribution. 62 00:02:34,360 --> 00:02:35,950 It's a family of distributions that's 63 00:02:35,950 --> 00:02:38,467 indexed by means and variances. 64 00:02:38,467 --> 00:02:41,050 And the way I would want to test if a distribution is normally 65 00:02:41,050 --> 00:02:42,790 distributed is, well, I would just 66 00:02:42,790 --> 00:02:45,460 look at the most natural normal distribution 67 00:02:45,460 --> 00:02:48,010 or Gaussian distribution that my data could follow. 68 00:02:48,010 --> 00:02:50,260 That means that's the Gaussian distribution that 69 00:02:50,260 --> 00:02:53,635 has the same mean as my data and the same empirical variance 70 00:02:53,635 --> 00:02:55,350 as my data, right? 71 00:02:55,350 --> 00:03:00,080 And so I'm going to be given some points x1, xn, 72 00:03:00,080 --> 00:03:02,205 and I'm going to be asking, are those Gaussian? 73 00:03:04,970 --> 00:03:07,310 That means this is equivalent to, say, 74 00:03:07,310 --> 00:03:15,800 are they N mu sigma square for some mu sigma squared? 75 00:03:15,800 --> 00:03:17,600 And of course, the natural choice 76 00:03:17,600 --> 00:03:20,610 is to take mu hat to be-- 77 00:03:20,610 --> 00:03:23,720 mu to be equal to mu hat, which is xn bar. 78 00:03:23,720 --> 00:03:30,520 And sigma squared to be sigma squared hat to be, 79 00:03:30,520 --> 00:03:32,990 well, Sn hat-- 80 00:03:32,990 --> 00:03:37,580 Sn-- what we wrote Sn, which is 1/n sum from i equal 1 to n 81 00:03:37,580 --> 00:03:40,800 of xi minus xn bar squared. 82 00:03:40,800 --> 00:03:41,720 OK? 83 00:03:41,720 --> 00:03:44,152 So this is definitely the natural one 84 00:03:44,152 --> 00:03:45,110 you would want to test. 85 00:03:45,110 --> 00:03:47,210 And maybe you could actually just close your eyes 86 00:03:47,210 --> 00:03:52,920 and just stuff that in a Kolmogorov-Smirnov test. 87 00:03:52,920 --> 00:03:53,420 OK? 88 00:03:53,420 --> 00:03:55,490 So here, there's a few things that don't work. 89 00:03:55,490 --> 00:03:57,657 The first one is that Donsker's theorem does not 90 00:03:57,657 --> 00:03:58,490 work anymore, right? 91 00:03:58,490 --> 00:03:59,906 Donsker's theorem was the one that 92 00:03:59,906 --> 00:04:02,840 told us that, properly normalized, 93 00:04:02,840 --> 00:04:04,460 this thing would actually converge 94 00:04:04,460 --> 00:04:07,800 to the supremum of a Brownian bridge, which is not true. 95 00:04:07,800 --> 00:04:08,969 So that's one problem. 96 00:04:08,969 --> 00:04:10,760 But there's actually an even bigger problem 97 00:04:10,760 --> 00:04:13,730 is that this distribution, we will check in a second, 98 00:04:13,730 --> 00:04:15,080 actually does not-- 99 00:04:15,080 --> 00:04:19,011 is pivotal itself, right, the statistic is pivotal. 100 00:04:19,011 --> 00:04:20,510 It does not have a distribution that 101 00:04:20,510 --> 00:04:22,093 depends on the known parameters, which 102 00:04:22,093 --> 00:04:24,810 is sort of nice, at least under the null. 103 00:04:24,810 --> 00:04:27,950 However, the distribution is not the same 104 00:04:27,950 --> 00:04:31,010 as the one that had fixed mu and sigma. 105 00:04:31,010 --> 00:04:34,250 The fact that they come from some random variables 106 00:04:34,250 --> 00:04:36,892 is actually distorting the distribution itself. 107 00:04:36,892 --> 00:04:39,350 And in particular, the quantiles are going to be distorted, 108 00:04:39,350 --> 00:04:41,840 and we hinted at that last time. 109 00:04:41,840 --> 00:04:44,070 So one other thing I need to tell you, though, 110 00:04:44,070 --> 00:04:48,460 is that this thing actually-- so I know there's some-- 111 00:04:48,460 --> 00:04:51,890 oh, yeah, that's where there's a word missing. 112 00:04:51,890 --> 00:04:54,680 So we compute the quantiles for this test statistic. 113 00:04:54,680 --> 00:04:56,570 And so what I need to promise to you 114 00:04:56,570 --> 00:04:59,810 is that these quantiles do not depend 115 00:04:59,810 --> 00:05:01,710 on any unknown parameter, right? 116 00:05:01,710 --> 00:05:06,660 I mean, it's not clear, right? 117 00:05:06,660 --> 00:05:09,300 So I want to test whether my data has some Gaussian 118 00:05:09,300 --> 00:05:09,930 distribution. 119 00:05:09,930 --> 00:05:15,270 So under the null, all I know is that my xi's are 120 00:05:15,270 --> 00:05:18,300 Gaussian with some mean mu and some variance sigma, 121 00:05:18,300 --> 00:05:19,386 which I don't know. 122 00:05:19,386 --> 00:05:20,760 So it could be the case that when 123 00:05:20,760 --> 00:05:23,220 I try to understand the distribution of this quantity 124 00:05:23,220 --> 00:05:28,320 under the null, it depends on mu and sigma, which I don't know. 125 00:05:28,320 --> 00:05:30,930 So we need to check that this is the case. 126 00:05:30,930 --> 00:05:33,810 And what's actually our redemption 127 00:05:33,810 --> 00:05:37,754 here is actually going to be the supremum. 128 00:05:37,754 --> 00:05:39,420 The supremum is going to basically allow 129 00:05:39,420 --> 00:05:43,726 us to, say, sup out mu and sigma square. 130 00:05:43,726 --> 00:05:44,850 So let's check that, right? 131 00:05:44,850 --> 00:05:48,330 So what I'm interested in is this quantity, supremum 132 00:05:48,330 --> 00:05:54,440 over t and R of the difference between Fn of t 133 00:05:54,440 --> 00:06:02,210 and, what I write, phi mu hat sigma squared of t. 134 00:06:02,210 --> 00:06:07,360 So phi mu hat sigma hat squared-- 135 00:06:07,360 --> 00:06:09,140 sorry, sigma hat squared-- 136 00:06:09,140 --> 00:06:15,320 is the CDF of some Gaussian with mean mu hat and variance sigma 137 00:06:15,320 --> 00:06:16,530 hat squared. 138 00:06:16,530 --> 00:06:24,680 And so in particular, this thing here, phi hat of mu hat-- 139 00:06:24,680 --> 00:06:30,170 sorry, phi hat of mu hat sigma hat squared of t 140 00:06:30,170 --> 00:06:34,160 is the probability that some x is less than t, 141 00:06:34,160 --> 00:06:39,870 where x follows some N mu hat sigma hat squared. 142 00:06:39,870 --> 00:06:42,650 So what it means is that by just the translation 143 00:06:42,650 --> 00:06:44,390 and scaling trig that we typically do 144 00:06:44,390 --> 00:06:47,930 for Gaussian to turn it into some standard Gaussian, that 145 00:06:47,930 --> 00:06:50,660 implies that there exists some z, which 146 00:06:50,660 --> 00:06:54,230 is standard Gaussian this time, so mean 0 and variance 1, 147 00:06:54,230 --> 00:06:58,380 such that x is equal to sigma hat x-- 148 00:06:58,380 --> 00:07:02,510 sorry, z plus mu hat. 149 00:07:02,510 --> 00:07:04,060 Agreed? 150 00:07:04,060 --> 00:07:08,570 That's basically saying that x has some Gaussian with mean mu 151 00:07:08,570 --> 00:07:09,875 and variance sigma squared. 152 00:07:09,875 --> 00:07:13,930 And I'm not going to say the hats every single time, OK? 153 00:07:13,930 --> 00:07:17,220 So OK, so that's what it means. 154 00:07:17,220 --> 00:07:20,260 So in particular, maybe I shouldn't use x here, 155 00:07:20,260 --> 00:07:22,790 because x is going to be my actual data. 156 00:07:22,790 --> 00:07:23,920 So let me write y. 157 00:07:27,988 --> 00:07:29,446 OK? 158 00:07:29,446 --> 00:07:32,790 So now what is this guy here? 159 00:07:32,790 --> 00:07:35,430 It's basically-- so phi hat. 160 00:07:35,430 --> 00:07:42,360 So this implies that phi mu hat sigma hat squared of t 161 00:07:42,360 --> 00:07:46,680 is equal to the probability that sigma hat z 162 00:07:46,680 --> 00:07:50,220 plus mu hat is less than t, which 163 00:07:50,220 --> 00:07:53,940 is equal to the probability that z is less than t 164 00:07:53,940 --> 00:08:00,550 minus mu hat divided by sigma hat, right? 165 00:08:00,550 --> 00:08:02,470 But now when z is the standard normal, 166 00:08:02,470 --> 00:08:04,540 this is really just the cumulative distribution 167 00:08:04,540 --> 00:08:07,450 function of a standard Gaussian but evaluated 168 00:08:07,450 --> 00:08:09,660 at a point which is not t, but t minus mu 169 00:08:09,660 --> 00:08:11,270 hat divided by sigma hat. 170 00:08:11,270 --> 00:08:11,770 All right? 171 00:08:11,770 --> 00:08:15,525 So in particular, what I know-- so from this what I get-- well, 172 00:08:15,525 --> 00:08:17,650 maybe I'll remove that, it's going to be annoying-- 173 00:08:17,650 --> 00:08:23,182 I know that phi mu hat sigma hat squared-- 174 00:08:23,182 --> 00:08:27,000 sorry-- phi mu hat sigma hat squared of t 175 00:08:27,000 --> 00:08:31,230 is simply phi of, say, 0, 1. 176 00:08:31,230 --> 00:08:32,559 And that's just the notation. 177 00:08:32,559 --> 00:08:35,730 Usually we don't put those, but here it's more convenient. 178 00:08:35,730 --> 00:08:43,318 So it's phi 0, 1 of t minus mu hat divided by sigma hat. 179 00:08:43,318 --> 00:08:45,660 OK? 180 00:08:45,660 --> 00:08:48,540 That's just something you can quickly check. 181 00:08:48,540 --> 00:08:51,960 There's this nice way of writing the cumulative distribution 182 00:08:51,960 --> 00:08:55,050 function for any mean and any variance 183 00:08:55,050 --> 00:08:57,360 in terms of the cumulative distribution function 184 00:08:57,360 --> 00:08:59,161 with mean 0 and variance 1. 185 00:08:59,161 --> 00:08:59,660 All right? 186 00:08:59,660 --> 00:09:00,795 Not too complicated. 187 00:09:00,795 --> 00:09:01,295 All right. 188 00:09:01,295 --> 00:09:04,791 So I know what I'm going to say is that, OK, I have this sup 189 00:09:04,791 --> 00:09:05,290 here. 190 00:09:05,290 --> 00:09:07,620 So what I can write is that this thing here 191 00:09:07,620 --> 00:09:12,690 is equal to the sup routine R of 1/n. 192 00:09:12,690 --> 00:09:14,400 Let me write what Fn is-- 193 00:09:14,400 --> 00:09:17,360 sum from i equal 1 to n of the indicator 194 00:09:17,360 --> 00:09:23,910 that xi is less than t minus phi 0, 1 195 00:09:23,910 --> 00:09:27,050 of t minus mu hat divided by sigma hat. 196 00:09:30,459 --> 00:09:32,410 OK? 197 00:09:32,410 --> 00:09:34,770 I actually want to make a change of variable 198 00:09:34,770 --> 00:09:36,700 so that this thing I'm going to call mu-- 199 00:09:36,700 --> 00:09:37,990 u, sorry. 200 00:09:37,990 --> 00:09:38,680 OK? 201 00:09:38,680 --> 00:09:40,660 And so I'm going to make my life easier, 202 00:09:40,660 --> 00:09:42,830 and I'm going to make it appear here. 203 00:09:42,830 --> 00:09:46,700 And so I'm just going to replace this by indicator 204 00:09:46,700 --> 00:09:52,480 that xi minus mu hat divided by sigma hat less than t 205 00:09:52,480 --> 00:09:56,110 minus mu hat divided by sigma hat, which is 206 00:09:56,110 --> 00:09:57,880 sort of useless at this point. 207 00:09:57,880 --> 00:10:00,100 I'm just making my formula more complicated. 208 00:10:00,100 --> 00:10:02,080 But now I see something here that shows up, 209 00:10:02,080 --> 00:10:06,952 and I will call it u, and this is another u. 210 00:10:06,952 --> 00:10:08,450 OK? 211 00:10:08,450 --> 00:10:12,230 So now what it means is that suping over t, when t ranges 212 00:10:12,230 --> 00:10:15,290 from negative infinity to plus infinity, 213 00:10:15,290 --> 00:10:17,980 the new range is from negative infinity to plus infinity, 214 00:10:17,980 --> 00:10:20,130 right? 215 00:10:20,130 --> 00:10:22,050 So this sup, I can actually write-- 216 00:10:22,050 --> 00:10:34,256 this suping t I can write as the sup in u, 217 00:10:34,256 --> 00:10:38,305 as the indicator that xi minus mu hat divided by sigma hat 218 00:10:38,305 --> 00:10:47,380 is less than u minus phi 0, 1 of u. 219 00:10:47,380 --> 00:10:49,600 Now, let's pause for one second. 220 00:10:49,600 --> 00:10:51,130 Let's see where we're going. 221 00:10:51,130 --> 00:10:53,230 What we're trying to show that this thing does not 222 00:10:53,230 --> 00:10:57,520 depend on the unknown parameters, say, mu and sigma, 223 00:10:57,520 --> 00:11:01,800 which are the mean and the variance of x under the null. 224 00:11:01,800 --> 00:11:04,050 To do that, we basically need to make 225 00:11:04,050 --> 00:11:09,340 only quantities that are sort of invariant under these values. 226 00:11:09,340 --> 00:11:11,789 So I tried to make this thing invariant under anything, 227 00:11:11,789 --> 00:11:14,080 and it's just really something that depends on nothing. 228 00:11:14,080 --> 00:11:15,570 It's the CDF. 229 00:11:15,570 --> 00:11:18,600 It doesn't depend on sigma hat and mu hat anymore. 230 00:11:18,600 --> 00:11:22,399 But sigma hat and mu hat will depend on mu and sigma, right? 231 00:11:22,399 --> 00:11:24,690 I mean, they're actually good estimators of those guys, 232 00:11:24,690 --> 00:11:26,650 so they should be pretty close to them. 233 00:11:26,650 --> 00:11:28,650 And so I need to make sure that I'm not actually 234 00:11:28,650 --> 00:11:30,970 doing anything wrong here. 235 00:11:30,970 --> 00:11:35,400 So the key thing here is going to be to observe that 1/n sum 236 00:11:35,400 --> 00:11:40,380 from i equal 1 to n of indicator of xi minus u hat divided 237 00:11:40,380 --> 00:11:43,650 by sigma hat less than u, which is the first term that I have 238 00:11:43,650 --> 00:11:48,720 in this absolute value, well, this is what-- well, 239 00:11:48,720 --> 00:11:54,040 this is equal to 1/n sum from i equal 1 to n of indicator 240 00:11:54,040 --> 00:11:55,440 that-- 241 00:11:55,440 --> 00:12:00,810 well, now under the null, which is 242 00:12:00,810 --> 00:12:06,309 that x follows N mu sigma squared, for some mu and sigma 243 00:12:06,309 --> 00:12:07,350 squared that are unknown. 244 00:12:07,350 --> 00:12:08,190 But they are here. 245 00:12:08,190 --> 00:12:08,940 They exist. 246 00:12:08,940 --> 00:12:10,800 I just don't know what they are. 247 00:12:10,800 --> 00:12:17,820 Then xi minus mu can be written as sigma zi plus mu 248 00:12:17,820 --> 00:12:23,340 minus mu hat divided by sigma hat, where 249 00:12:23,340 --> 00:12:29,150 z is equal to x minus mu divided by sigma, right? 250 00:12:29,150 --> 00:12:32,750 That's just the same trick that I wrote here. 251 00:12:32,750 --> 00:12:33,494 OK? 252 00:12:33,494 --> 00:12:34,160 Everybody agree? 253 00:12:34,160 --> 00:12:36,820 So I just standardize-- 254 00:12:36,820 --> 00:12:42,080 sorry, z-- yeah, so zi is xi minus mu i minus mu divided 255 00:12:42,080 --> 00:12:42,721 by sigma. 256 00:12:42,721 --> 00:12:43,220 All right? 257 00:12:43,220 --> 00:12:45,020 Just a standardization. 258 00:12:45,020 --> 00:12:47,780 So now once I write this, I can actually 259 00:12:47,780 --> 00:12:49,250 divide everybody by sigma. 260 00:12:55,246 --> 00:12:55,746 Right? 261 00:12:55,746 --> 00:12:59,984 So I just divided on top here and in the bottom here. 262 00:12:59,984 --> 00:13:02,150 So now what I need to check is that the distribution 263 00:13:02,150 --> 00:13:08,790 of this guy does not depend on mu or sigma. 264 00:13:08,790 --> 00:13:10,200 That's what I claim. 265 00:13:10,200 --> 00:13:12,330 What is the distribution of this indicator? 266 00:13:16,360 --> 00:13:19,070 It's a Bernoulli, right? 267 00:13:19,070 --> 00:13:21,260 And so if I want to understand its distribution, 268 00:13:21,260 --> 00:13:23,680 all I need to do is to compute its expectation, 269 00:13:23,680 --> 00:13:26,199 which is just the probability that this thing happens. 270 00:13:26,199 --> 00:13:27,990 But the probability that this thing happens 271 00:13:27,990 --> 00:13:29,880 is actually now depending on mu and sigma. 272 00:13:29,880 --> 00:13:33,310 And the reason is that mu is what? 273 00:13:33,310 --> 00:13:44,510 Well, it's x bar-- sorry, yeah, so mu hat-- sorry, is xn bar. 274 00:13:44,510 --> 00:13:50,730 So mu hat minus mu, which under the null 275 00:13:50,730 --> 00:13:54,810 follows N mu sigma square over n, right? 276 00:13:54,810 --> 00:13:57,370 That's the property of the average. 277 00:13:57,370 --> 00:14:00,990 So when I do mu hat minus mu divided by sigma, 278 00:14:00,990 --> 00:14:04,820 this thing is what distribution? 279 00:14:04,820 --> 00:14:05,720 It's still a normal. 280 00:14:05,720 --> 00:14:07,428 It's a linear transformation of a normal. 281 00:14:07,428 --> 00:14:11,006 What are the parameters? 282 00:14:11,006 --> 00:14:11,877 AUDIENCE: 0, 1/n. 283 00:14:11,877 --> 00:14:13,210 PHILIPPE RIGOLLET: Yeah, 0, 1/n. 284 00:14:16,190 --> 00:14:26,910 But this does not depend on mu or sigma, right? 285 00:14:29,520 --> 00:14:31,440 Now, I need to check that this guy does not 286 00:14:31,440 --> 00:14:34,920 depend on mu or sigma. 287 00:14:34,920 --> 00:14:37,680 What is the distribution of sigma hat over sigma? 288 00:14:40,389 --> 00:14:41,847 AUDIENCE: It's a chi-square, right? 289 00:14:41,847 --> 00:14:43,680 PHILIPPE RIGOLLET: Yeah, it is a chi-square. 290 00:14:43,680 --> 00:14:45,490 So this is actually-- 291 00:14:45,490 --> 00:14:48,700 sorry, sigma hat squared divided by sigma squared 292 00:14:48,700 --> 00:14:54,309 is a chi-square with n minus 1 degrees of freedom. 293 00:14:54,309 --> 00:14:55,600 Does not depend on mu or sigma. 294 00:15:00,440 --> 00:15:02,860 AUDIENCE: [INAUDIBLE] 295 00:15:02,860 --> 00:15:03,828 AUDIENCE: [INAUDIBLE] 296 00:15:03,828 --> 00:15:05,992 AUDIENCE: Or sigma hat squared over sigma squared? 297 00:15:05,992 --> 00:15:07,450 PHILIPPE RIGOLLET: Yeah, thank you. 298 00:15:07,450 --> 00:15:10,810 So this is actually divided by it. 299 00:15:10,810 --> 00:15:11,629 So maybe this guy. 300 00:15:11,629 --> 00:15:12,670 Let's write it like that. 301 00:15:12,670 --> 00:15:14,211 This is the proper way of writing it. 302 00:15:14,211 --> 00:15:14,966 Thank you. 303 00:15:20,389 --> 00:15:21,649 Right? 304 00:15:21,649 --> 00:15:22,940 So now I have those two things. 305 00:15:22,940 --> 00:15:25,680 Neither of them depends on mu or sigma. 306 00:15:25,680 --> 00:15:28,079 I these two things. 307 00:15:28,079 --> 00:15:29,620 There's just one more thing to check. 308 00:15:32,191 --> 00:15:32,690 What is it? 309 00:15:35,342 --> 00:15:36,800 AUDIENCE: That they're independent? 310 00:15:36,800 --> 00:15:37,790 PHILIPPE RIGOLLET: That they're independent, right? 311 00:15:37,790 --> 00:15:39,373 Because the dependence in mu and sigma 312 00:15:39,373 --> 00:15:41,090 could be hidden in the covariance. 313 00:15:41,090 --> 00:15:44,540 It could be the case that the marginal distribution of mu 314 00:15:44,540 --> 00:15:47,150 does not depend on mu or sigma, that the marginal distribution 315 00:15:47,150 --> 00:15:48,100 of sigma-- 316 00:15:48,100 --> 00:15:49,850 of mu hat does not depend on mu and sigma. 317 00:15:49,850 --> 00:15:51,730 The marginal distribution of sigma hat 318 00:15:51,730 --> 00:15:54,290 does not depend on mu or sigma, but their correlation 319 00:15:54,290 --> 00:15:56,000 could depend on mu and sigma. 320 00:15:56,000 --> 00:15:59,100 But we also have that if I look at-- 321 00:15:59,100 --> 00:15:59,900 so if I look at-- 322 00:16:02,920 --> 00:16:10,330 so since mu hat is independent of sigma hat, 323 00:16:10,330 --> 00:16:33,770 it means that the joint distribution of mu hat divided 324 00:16:33,770 --> 00:16:38,180 by sigma and sigma hat divided by sigma 325 00:16:38,180 --> 00:16:46,790 does not depend on blah, blah, blah, on mu and sigma. 326 00:16:46,790 --> 00:16:47,290 OK? 327 00:16:50,170 --> 00:16:52,580 Agree? 328 00:16:52,580 --> 00:16:54,940 It's not in the individual ones, and it's not 329 00:16:54,940 --> 00:16:57,940 in the way they interact with each other. 330 00:16:57,940 --> 00:16:59,743 It's nowhere. 331 00:16:59,743 --> 00:17:01,951 AUDIENCE: [INAUDIBLE] independence be [INAUDIBLE] 332 00:17:01,951 --> 00:17:02,450 theorem? 333 00:17:02,450 --> 00:17:03,750 PHILIPPE RIGOLLET: Yeah, covariance theorem, right. 334 00:17:03,750 --> 00:17:06,359 So that's something we've been using over and again. 335 00:17:06,359 --> 00:17:07,960 That's all under the null. 336 00:17:07,960 --> 00:17:12,790 If my data is not Gaussian, nothing actually holds. 337 00:17:12,790 --> 00:17:14,440 I just use the fact that under the null 338 00:17:14,440 --> 00:17:17,624 I'm Gaussian for some mean mu and variance sigma squared. 339 00:17:17,624 --> 00:17:18,790 But that's all I care about. 340 00:17:18,790 --> 00:17:21,430 When I'm designing a test, I only 341 00:17:21,430 --> 00:17:24,910 care about the distribution under the null, at least 342 00:17:24,910 --> 00:17:26,420 to control the type I error. 343 00:17:26,420 --> 00:17:28,270 Then to control the type II error, 344 00:17:28,270 --> 00:17:31,610 then I cross my fingers pretty hard. 345 00:17:31,610 --> 00:17:32,110 OK? 346 00:17:34,710 --> 00:17:41,420 So now this basically implies what's written on the board, 347 00:17:41,420 --> 00:17:45,270 that this distribution, this test statistic, 348 00:17:45,270 --> 00:17:48,270 does not depend on any unknown parameters. 349 00:17:48,270 --> 00:17:50,160 It's just something that's pivotal. 350 00:17:50,160 --> 00:17:53,160 In particular, I could go at the back of a book 351 00:17:53,160 --> 00:17:56,220 and check if there's a table for the quantiles of these things, 352 00:17:56,220 --> 00:17:58,260 and indeed there are. 353 00:17:58,260 --> 00:18:00,884 This is the table that you see. 354 00:18:00,884 --> 00:18:02,550 So actually, this is not even in a book. 355 00:18:02,550 --> 00:18:09,970 This is in Lilliefors original paper, 1967, 356 00:18:09,970 --> 00:18:13,050 as you can tell from the typewriting. 357 00:18:13,050 --> 00:18:17,190 And he actually probably was rolling some dice 358 00:18:17,190 --> 00:18:19,920 from his office back in the day and was checking 359 00:18:19,920 --> 00:18:22,380 that this was-- he simulated it, and this is 360 00:18:22,380 --> 00:18:24,120 how he computed those numbers. 361 00:18:24,120 --> 00:18:28,440 And here you also have some limiting distribution, 362 00:18:28,440 --> 00:18:31,180 which is not the sup of a Brownian motion over 0, 363 00:18:31,180 --> 00:18:35,034 1 of-- sorry, of a Brownian bridge over 0, 364 00:18:35,034 --> 00:18:36,450 1, which is the one that you would 365 00:18:36,450 --> 00:18:38,900 see for the Kolmogorov-Smirnov test, 366 00:18:38,900 --> 00:18:41,050 but it's something that's slightly different. 367 00:18:41,050 --> 00:18:45,727 And as I said, these numbers are actually typically much smaller 368 00:18:45,727 --> 00:18:47,310 than the numbers you would get, right? 369 00:18:47,310 --> 00:18:50,280 Remember, we got something that was about 0.5, I think, 370 00:18:50,280 --> 00:18:54,350 or maybe 0.41, for the Kolmogorov-Smirnov test 371 00:18:54,350 --> 00:18:56,070 at the same entrance, which means 372 00:18:56,070 --> 00:18:58,452 that using Kolmogorov-Lilliefors test 373 00:18:58,452 --> 00:18:59,910 it's going to be harder for you not 374 00:18:59,910 --> 00:19:02,460 to reject for the same data. 375 00:19:02,460 --> 00:19:04,860 It might be the case that in one case you reject, 376 00:19:04,860 --> 00:19:06,660 and in the other one you fail to reject. 377 00:19:06,660 --> 00:19:09,710 But the ordering is always that if you 378 00:19:09,710 --> 00:19:12,410 fail to reject with Kolmogorov-Lilliefors, 379 00:19:12,410 --> 00:19:17,370 you will fail to reject with Kolmogorov-Smirnov, right? 380 00:19:17,370 --> 00:19:18,720 There's always one. 381 00:19:18,720 --> 00:19:20,850 So that's why people tend to close their eyes 382 00:19:20,850 --> 00:19:23,160 and prefer Kolmogorov-Smirnov because it just 383 00:19:23,160 --> 00:19:25,365 makes their life easier. 384 00:19:25,365 --> 00:19:27,690 OK? 385 00:19:27,690 --> 00:19:29,930 So this is called Kolmogorov-Lilliefors. 386 00:19:29,930 --> 00:19:33,250 I think there's actually an E here-- 387 00:19:33,250 --> 00:19:41,440 sorry, an I before the E. Doesn't matter too much. 388 00:19:41,440 --> 00:19:42,000 OK? 389 00:19:42,000 --> 00:19:43,000 Are there any questions? 390 00:19:43,000 --> 00:19:43,932 Yes? 391 00:19:43,932 --> 00:19:45,390 AUDIENCE: Is there like a place you 392 00:19:45,390 --> 00:19:59,135 can point to like [INAUDIBLE] 393 00:19:59,135 --> 00:20:00,135 PHILIPPE RIGOLLET: Yeah. 394 00:20:00,135 --> 00:20:01,540 AUDIENCE: [INAUDIBLE]. 395 00:20:01,540 --> 00:20:03,581 PHILIPPE RIGOLLET: So the fact that it's actually 396 00:20:03,581 --> 00:20:07,120 a different distribution is that here-- 397 00:20:07,120 --> 00:20:11,970 so if I actually knew what mu and sigma were, 398 00:20:11,970 --> 00:20:13,800 I would do exactly the same thing. 399 00:20:13,800 --> 00:20:16,350 But here, rather than having this average with mu and sigma, 400 00:20:16,350 --> 00:20:17,516 I would just have the-- 401 00:20:17,516 --> 00:20:19,140 with mu hat and sigma hat, I would just 402 00:20:19,140 --> 00:20:20,760 have the average with mu and sigma. 403 00:20:20,760 --> 00:20:21,630 OK? 404 00:20:21,630 --> 00:20:23,700 So what it means is that the key thing 405 00:20:23,700 --> 00:20:29,790 is that what I would compare is the 1/n sum of some Bernoullis 406 00:20:29,790 --> 00:20:30,600 with parameter. 407 00:20:30,600 --> 00:20:34,920 And the parameter here would be the probability that mu-- 408 00:20:34,920 --> 00:20:37,830 xi minus mu over sigma is less than u, 409 00:20:37,830 --> 00:20:40,500 which is just the probability that phi-- 410 00:20:40,500 --> 00:20:44,340 sorry, it's a Bernoulli with probability F of t. 411 00:20:44,340 --> 00:20:49,590 Well, let me write what it is, right? 412 00:20:49,590 --> 00:20:57,440 So that's minus phi 0, 1 of t. 413 00:20:57,440 --> 00:20:57,940 OK? 414 00:20:57,940 --> 00:21:04,160 So that's for the K-S test, and then I sup over t, right? 415 00:21:04,160 --> 00:21:06,590 That's what I would have had, because this is actually 416 00:21:06,590 --> 00:21:08,100 exactly the right thing. 417 00:21:08,100 --> 00:21:10,130 Here I would remove the true mean. 418 00:21:10,130 --> 00:21:12,150 I would divide by the true standard deviation. 419 00:21:12,150 --> 00:21:15,320 So that would actually end up being a standard Gaussian, 420 00:21:15,320 --> 00:21:18,520 and that's why I'm allowed to use phi 0, 1 here. 421 00:21:18,520 --> 00:21:19,690 Agreed? 422 00:21:19,690 --> 00:21:22,220 And these are Bernoullis because they're just indicators. 423 00:21:22,220 --> 00:21:26,860 What happens in the Kolmogorov-Lilliefors test? 424 00:21:26,860 --> 00:21:28,810 Well, here the Bernoulli, the only thing 425 00:21:28,810 --> 00:21:30,150 that's going to change is this guy, right? 426 00:21:30,150 --> 00:21:31,390 They still have a Bernoulli. 427 00:21:31,390 --> 00:21:34,000 It's just that the parameters of the Bernoulli are weird. 428 00:21:34,000 --> 00:21:37,066 The parameters of the Bernoulli looks like it's-- 429 00:21:37,066 --> 00:21:47,735 it becomes the probability that some N(0, 1) plus some N(0, 430 00:21:47,735 --> 00:22:02,140 1/n), right, divided by some square root of chi-squared n 431 00:22:02,140 --> 00:22:07,550 minus 1 divided by n is less than t. 432 00:22:07,550 --> 00:22:09,380 And those things are independent, 433 00:22:09,380 --> 00:22:12,480 but those guys are not necessarily independent, right? 434 00:22:12,480 --> 00:22:14,720 And so why is this probability changing? 435 00:22:14,720 --> 00:22:17,480 Well, because this denominator is actually fluctuating a lot. 436 00:22:17,480 --> 00:22:20,570 So that actually makes this probability different. 437 00:22:20,570 --> 00:22:23,940 And so that's basically where it comes from, right? 438 00:22:23,940 --> 00:22:26,940 So you could probably convince yourself 439 00:22:26,940 --> 00:22:32,520 very quickly that this only makes those guys closer. 440 00:22:32,520 --> 00:22:38,280 And why does it make those guys closer? 441 00:22:40,700 --> 00:22:41,200 No, sorry. 442 00:22:41,200 --> 00:22:43,230 It makes those guys farther, right? 443 00:22:43,230 --> 00:22:46,000 And it makes those guys farther for a very clear reason, 444 00:22:46,000 --> 00:22:51,080 is that the expectation of this Bernoulli is exactly that guy. 445 00:22:51,080 --> 00:22:52,794 Here I think it's going to be true 446 00:22:52,794 --> 00:22:54,710 as well that the expectation of this Bernoulli 447 00:22:54,710 --> 00:22:56,690 is going to be that guy, but the fluctuations 448 00:22:56,690 --> 00:22:58,700 are going to be much bigger than just the phi of the Bernoulli. 449 00:22:58,700 --> 00:22:59,780 Because the first thing I do is I 450 00:22:59,780 --> 00:23:01,550 have a random parameter from my Bernoulli, 451 00:23:01,550 --> 00:23:02,830 and then I flip the Bernoulli. 452 00:23:02,830 --> 00:23:04,670 So fluctuations are going to be bigger than a Bernoulli. 453 00:23:04,670 --> 00:23:06,211 And so when I take the sup, I'm going 454 00:23:06,211 --> 00:23:07,960 to have to [INAUDIBLE] them. 455 00:23:07,960 --> 00:23:09,686 So it makes things farther apart, 456 00:23:09,686 --> 00:23:11,560 which makes it more likely for you to reject. 457 00:23:11,560 --> 00:23:12,060 Yeah? 458 00:23:12,060 --> 00:23:16,778 AUDIENCE: You also said that if you compare the same-- if you 459 00:23:16,778 --> 00:23:19,512 compare the table and you set at the same level, 460 00:23:19,512 --> 00:23:24,476 the Lilliefors is like 0.2, and for the Smirnov is at 0.4. 461 00:23:24,476 --> 00:23:25,476 PHILIPPE RIGOLLET: Yeah. 462 00:23:25,476 --> 00:23:26,017 AUDIENCE: OK. 463 00:23:26,017 --> 00:23:30,022 So it means that Lilliefors is harder not to reject? 464 00:23:30,022 --> 00:23:32,230 PHILIPPE RIGOLLET: It means that Lilliefors is harder 465 00:23:32,230 --> 00:23:35,020 not to reject, yes, because we reject when 466 00:23:35,020 --> 00:23:36,490 we're larger than the number. 467 00:23:36,490 --> 00:23:39,900 So the number being smaller with the same data, we might be, 468 00:23:39,900 --> 00:23:40,400 right? 469 00:23:40,400 --> 00:23:43,750 So basically, it looks like this. 470 00:23:43,750 --> 00:23:55,750 What we run-- so here we have the distribution for the-- 471 00:23:55,750 --> 00:24:05,510 so let's say this is the density for K-S. 472 00:24:05,510 --> 00:24:11,270 And then we have the density for Kolmogorov-Lilliefors, K-L. OK? 473 00:24:11,270 --> 00:24:13,040 And what the density of K-L looks like, 474 00:24:13,040 --> 00:24:22,270 it looks like this, right? 475 00:24:22,270 --> 00:24:27,860 And so if I want to squeeze in alpha here, 476 00:24:27,860 --> 00:24:30,260 I'm going to have to squeeze in-- and I squeeze in alpha 477 00:24:30,260 --> 00:24:34,740 here, then this is the quantile of order 1 minus alp-- 478 00:24:34,740 --> 00:24:38,270 well, let's say alpha of the K-L. 479 00:24:38,270 --> 00:24:41,750 And this is the quantile alpha of K-S. 480 00:24:41,750 --> 00:24:44,270 So now you give me data, and what I do with it, 481 00:24:44,270 --> 00:24:46,310 I check whether they're larger than this number. 482 00:24:46,310 --> 00:24:48,800 So if I apply K-S, I check whether I'm larger or smaller 483 00:24:48,800 --> 00:24:49,599 than this thing. 484 00:24:49,599 --> 00:24:51,140 But if I apply Kolmogorov-Lilliefors, 485 00:24:51,140 --> 00:24:53,390 I check whether I'm larger or smaller than this thing. 486 00:24:53,390 --> 00:24:56,270 So over this entire range of values for my test statistic-- 487 00:24:56,270 --> 00:24:58,354 because it is the same test statistic, 488 00:24:58,354 --> 00:25:00,020 I just plugged in mu hat and sigma hat-- 489 00:25:00,020 --> 00:25:04,229 for this entire range, the two tests have different outcomes. 490 00:25:04,229 --> 00:25:06,020 And this is a big range in practice, right? 491 00:25:06,020 --> 00:25:08,210 I mean, it's between-- 492 00:25:08,210 --> 00:25:10,670 I mean, it's pretty much at scale here. 493 00:25:13,614 --> 00:25:14,114 OK? 494 00:25:18,050 --> 00:25:18,994 Any other-- yeah? 495 00:25:18,994 --> 00:25:21,494 AUDIENCE: [INAUDIBLE] when n goes to infinity, the two tests 496 00:25:21,494 --> 00:25:24,446 become the same now, right? 497 00:25:24,446 --> 00:25:25,922 PHILIPPE RIGOLLET: Hmmm. 498 00:25:25,922 --> 00:25:27,381 AUDIENCE: Looking at that formula-- 499 00:25:27,381 --> 00:25:29,546 PHILIPPE RIGOLLET: Yeah, They should become the same 500 00:25:29,546 --> 00:25:30,130 very far. 501 00:25:32,740 --> 00:25:34,740 Let me see, though, because-- 502 00:25:34,740 --> 00:25:35,240 right. 503 00:25:35,240 --> 00:25:38,100 So here we have 8-- 504 00:25:38,100 --> 00:25:44,030 so here we have, say, for 0.5, we get 0.886. 505 00:25:44,030 --> 00:25:45,940 And for-- oh, I don't have it. 506 00:25:49,540 --> 00:25:50,715 Yeah, actually, sorry. 507 00:25:50,715 --> 00:25:51,760 So you're right. 508 00:25:51,760 --> 00:25:52,900 You're totally right. 509 00:25:52,900 --> 00:25:56,650 This is the Brownian bridge values. 510 00:25:56,650 --> 00:26:00,950 Because in the limit by, say, Slutsky-- 511 00:26:00,950 --> 00:26:02,070 sorry, I'm lost. 512 00:26:02,070 --> 00:26:03,694 Yeah, these are the values that you 513 00:26:03,694 --> 00:26:04,860 get for the Brownian bridge. 514 00:26:04,860 --> 00:26:07,620 Because in the limit by Slutsky, this thing 515 00:26:07,620 --> 00:26:09,720 is going to have no fluctuation, and this thing 516 00:26:09,720 --> 00:26:11,054 is going to have no fluctuation. 517 00:26:11,054 --> 00:26:12,719 So they're just going to be pinned down, 518 00:26:12,719 --> 00:26:15,390 and it's going to look like as if I did not replace anything. 519 00:26:15,390 --> 00:26:18,980 Because in the limit, I know those guys much faster-- 520 00:26:18,980 --> 00:26:20,840 the mu hat and sigma hat converge 521 00:26:20,840 --> 00:26:25,500 much faster to mu and sigma than the distribution itself, right? 522 00:26:25,500 --> 00:26:27,375 So those are actually going to be negligible. 523 00:26:27,375 --> 00:26:29,800 You're right. 524 00:26:29,800 --> 00:26:31,330 Actually even, I didn't have-- 525 00:26:31,330 --> 00:26:32,800 these are actually the numbers I showed you 526 00:26:32,800 --> 00:26:34,030 for the bridge, the Brownian bridge, 527 00:26:34,030 --> 00:26:36,620 last time, because I didn't have it for the Kolmogorov-Smirnov 528 00:26:36,620 --> 00:26:38,166 one. 529 00:26:38,166 --> 00:26:38,666 OK? 530 00:26:41,480 --> 00:26:44,990 So there's actually-- so those are numerical ways of checking 531 00:26:44,990 --> 00:26:45,740 things, right? 532 00:26:45,740 --> 00:26:47,350 I give you data. 533 00:26:47,350 --> 00:26:50,750 You just crank the Kolmogorov-Smirnov test. 534 00:26:50,750 --> 00:26:52,580 Usually you press a 5 on MATLAB. 535 00:26:52,580 --> 00:26:55,940 But let's say you actually compute this entire thing, 536 00:26:55,940 --> 00:26:57,740 and there's a number that comes out, 537 00:26:57,740 --> 00:27:00,366 and you decide whether it's large enough or small enough. 538 00:27:00,366 --> 00:27:02,990 Of course, statistical software is going to make your life even 539 00:27:02,990 --> 00:27:05,450 simpler by spitting out a p-value, because you can-- 540 00:27:05,450 --> 00:27:07,533 I mean, if you can compute quantiles, you can also 541 00:27:07,533 --> 00:27:09,060 when compute p-values. 542 00:27:09,060 --> 00:27:12,530 And so your life is just fairly easy. 543 00:27:12,530 --> 00:27:18,630 You just have red is bad, green is good, and then you can go. 544 00:27:18,630 --> 00:27:21,390 The problem is that those are numbers you want to rely on. 545 00:27:21,390 --> 00:27:23,085 But let's say you actually reject. 546 00:27:23,085 --> 00:27:23,960 Let's say you reject. 547 00:27:23,960 --> 00:27:29,240 Your p-value is actually just like slightly below 5%. 548 00:27:29,240 --> 00:27:33,910 So you can say, well, maybe I'm just going to change 549 00:27:33,910 --> 00:27:34,880 my p-value-- 550 00:27:34,880 --> 00:27:36,950 my threshold to 1%, but you might 551 00:27:36,950 --> 00:27:38,360 want to see what's happening. 552 00:27:38,360 --> 00:27:40,280 And for that you need a visual diagnostic. 553 00:27:40,280 --> 00:27:42,710 Like, how do I check if something departs 554 00:27:42,710 --> 00:27:44,580 from being normal, for example? 555 00:27:44,580 --> 00:27:46,770 How do I check if a distribution-- 556 00:27:46,770 --> 00:27:49,530 why is a distribution not a uniform distribution? 557 00:27:49,530 --> 00:27:51,900 Why is a distribution not an exponential distribution? 558 00:27:51,900 --> 00:27:53,060 There's many, many, right? 559 00:27:53,060 --> 00:27:54,660 If I have an exponential distribution 560 00:27:54,660 --> 00:27:57,060 and half of my values are negative, 561 00:27:57,060 --> 00:27:59,900 for example, well, there's like pretty obvious reasons 562 00:27:59,900 --> 00:28:01,880 why it should not be exponential. 563 00:28:01,880 --> 00:28:03,780 But it could be the case that it's 564 00:28:03,780 --> 00:28:05,540 just the tails are little heavier 565 00:28:05,540 --> 00:28:08,700 or there's more concentration at some point. 566 00:28:08,700 --> 00:28:10,020 Maybe it has two modes. 567 00:28:10,020 --> 00:28:11,610 There's things like this. 568 00:28:11,610 --> 00:28:13,910 But the real thing, we don't believe 569 00:28:13,910 --> 00:28:16,400 that the Gaussian is so important because it 570 00:28:16,400 --> 00:28:19,280 looks like this close to 0. 571 00:28:19,280 --> 00:28:22,310 What we like about the Gaussian is that the tails here 572 00:28:22,310 --> 00:28:24,740 decay at this rate-- exponential minus x 573 00:28:24,740 --> 00:28:28,445 squared over 2 that we described in the maybe first lecture. 574 00:28:28,445 --> 00:28:31,580 And in particular, if there were like kinks around here, 575 00:28:31,580 --> 00:28:33,260 it wouldn't matter too much. 576 00:28:33,260 --> 00:28:36,670 This is not what makes issues for the Gaussian. 577 00:28:36,670 --> 00:28:41,860 And so what we want is to have a visual diagnostic that tells us 578 00:28:41,860 --> 00:28:44,890 if the tails of my distribution are 579 00:28:44,890 --> 00:28:48,350 comparable to the tails of a Gaussian one, for example. 580 00:28:48,350 --> 00:28:51,700 And those are what's called quantile-quantile plots, 581 00:28:51,700 --> 00:28:54,250 and in particular-- or QQ plots. 582 00:28:54,250 --> 00:28:58,090 And the basic QQ plots we're going to be using 583 00:28:58,090 --> 00:29:00,640 are the ones that are called normal QQ plots that 584 00:29:00,640 --> 00:29:03,310 are comparing your data to a Gaussian distribution, 585 00:29:03,310 --> 00:29:05,230 or a normal distribution. 586 00:29:05,230 --> 00:29:07,600 But in general, you could be comparing your data 587 00:29:07,600 --> 00:29:09,426 to any distribution you want. 588 00:29:09,426 --> 00:29:11,050 And the way you do this is by comparing 589 00:29:11,050 --> 00:29:14,860 the quantiles of your data, the empirical quantiles, 590 00:29:14,860 --> 00:29:16,840 to the quantiles of the actual distribution 591 00:29:16,840 --> 00:29:19,020 you're trying to compare yourself to. 592 00:29:19,020 --> 00:29:22,470 So this, in a way, is a visual way 593 00:29:22,470 --> 00:29:25,200 of performing these goodness-of-fit tests. 594 00:29:25,200 --> 00:29:29,040 And what's nice about visual is that there's room for debate. 595 00:29:29,040 --> 00:29:31,311 You can see something that somebody else cannot see, 596 00:29:31,311 --> 00:29:33,810 and you can always-- because you want to say that things are 597 00:29:33,810 --> 00:29:34,300 Gaussian. 598 00:29:34,300 --> 00:29:36,674 And we'll see some examples where you can actually say it 599 00:29:36,674 --> 00:29:41,790 if you are good at debate, but it's actually 600 00:29:41,790 --> 00:29:44,230 going to be clearly not true. 601 00:29:44,230 --> 00:29:44,730 All right. 602 00:29:44,730 --> 00:29:46,722 So this is a quick and easy check. 603 00:29:46,722 --> 00:29:48,180 That's something I do all the time. 604 00:29:48,180 --> 00:29:49,710 You give me data, I'm just going to run this. 605 00:29:49,710 --> 00:29:51,085 One of the first things I do so I 606 00:29:51,085 --> 00:29:54,000 can check if I can start entering the Gaussian 607 00:29:54,000 --> 00:29:57,930 world without compromising myself too much. 608 00:29:57,930 --> 00:30:04,350 And the idea is to say, well, if F is close to-- if F-- 609 00:30:04,350 --> 00:30:07,590 if my data comes from an F, and if I 610 00:30:07,590 --> 00:30:10,080 know that Fn is close to F, then rather 611 00:30:10,080 --> 00:30:12,660 than computing some norm, some number that tells me 612 00:30:12,660 --> 00:30:14,635 how far they are, summarizing how far they are, 613 00:30:14,635 --> 00:30:16,260 I could actually plot the two functions 614 00:30:16,260 --> 00:30:17,970 and see if they're far apart. 615 00:30:17,970 --> 00:30:21,870 So let's think for one second what this kind of a plot 616 00:30:21,870 --> 00:30:23,380 would look like. 617 00:30:23,380 --> 00:30:25,090 Well, I would go between 0 and 1. 618 00:30:25,090 --> 00:30:26,631 That's where everything would happen. 619 00:30:26,631 --> 00:30:29,680 Let's say my distribution is the Gaussian distribution. 620 00:30:29,680 --> 00:30:35,370 So this is the CDF of N(0, 1). 621 00:30:35,370 --> 00:30:37,950 And now I have this guy that shows up, and remember 622 00:30:37,950 --> 00:30:39,540 we had this piecewise constant. 623 00:30:44,172 --> 00:30:46,130 Well, OK, let's say we get something like this. 624 00:30:46,130 --> 00:30:51,770 We get a piecewise constant distribution for Fn, right? 625 00:30:54,590 --> 00:31:00,200 Just from this, and even despite my bad skills at drawing, 626 00:31:00,200 --> 00:31:01,880 it's clear that it's going to be hard 627 00:31:01,880 --> 00:31:03,920 for you to distinguish those two things, 628 00:31:03,920 --> 00:31:05,960 even for a fairly large amount of points. 629 00:31:05,960 --> 00:31:08,750 Because the problem is going to happen here, 630 00:31:08,750 --> 00:31:11,000 and those guys look pretty much the same everywhere 631 00:31:11,000 --> 00:31:11,894 you are here. 632 00:31:11,894 --> 00:31:14,060 You're going to see differences maybe in the middle, 633 00:31:14,060 --> 00:31:17,010 but we don't care too much about those differences. 634 00:31:17,010 --> 00:31:19,191 And so what's going to happen is that you're 635 00:31:19,191 --> 00:31:20,940 going to want to compare those two things. 636 00:31:20,940 --> 00:31:23,660 And this is basically you have the information you want, 637 00:31:23,660 --> 00:31:26,870 but visually it just doesn't render very well because you're 638 00:31:26,870 --> 00:31:28,970 not scaling things properly. 639 00:31:28,970 --> 00:31:32,060 And the way we actually do it is by flipping things around. 640 00:31:32,060 --> 00:31:36,230 And rather than comparing the plot of F to the plot of Fn, 641 00:31:36,230 --> 00:31:38,300 we compare the plot of Fn inverse 642 00:31:38,300 --> 00:31:41,444 to the plot of F inverse. 643 00:31:41,444 --> 00:31:47,050 Now, if F goes from the real line to the interval 0, 1, 644 00:31:47,050 --> 00:31:52,409 F inverse goes from 0, 1 to the whole real line. 645 00:31:52,409 --> 00:31:53,950 So what's going to happen is that I'm 646 00:31:53,950 --> 00:31:57,120 going to compare things on some intervals, which is the-- 647 00:31:57,120 --> 00:31:59,210 which are the entire real line. 648 00:31:59,210 --> 00:32:02,470 And then what values should I be looking at those things at? 649 00:32:02,470 --> 00:32:05,920 Well, technically for F, if F is continuous I 650 00:32:05,920 --> 00:32:09,730 could look at F inverse for any value that I please, right? 651 00:32:09,730 --> 00:32:14,630 So I have F. And if I want to look at F inverse, 652 00:32:14,630 --> 00:32:17,240 I pick a point here and I look at the value that it gives me, 653 00:32:17,240 --> 00:32:23,054 and that's F inverse of, say, u, right, if this is u. 654 00:32:23,054 --> 00:32:24,470 And I could pick any value I want, 655 00:32:24,470 --> 00:32:25,920 I'm going to be able to find it. 656 00:32:25,920 --> 00:32:27,586 The problem is that when I start to have 657 00:32:27,586 --> 00:32:30,860 this piecewise constant thing, I need 658 00:32:30,860 --> 00:32:33,320 to decide what value I assign for anything that's 659 00:32:33,320 --> 00:32:35,920 in between two jumps, right? 660 00:32:35,920 --> 00:32:38,340 And so I can choose whatever I want, 661 00:32:38,340 --> 00:32:40,470 but in practice it's just going to be things 662 00:32:40,470 --> 00:32:42,100 that I myself decide. 663 00:32:42,100 --> 00:32:44,580 Maybe I can decide that this is the value. 664 00:32:44,580 --> 00:32:46,990 Maybe I can decide that the value is here. 665 00:32:46,990 --> 00:32:49,590 But for all these guys, I'm going to pretty much decide 666 00:32:49,590 --> 00:32:51,510 always the same value, right? 667 00:32:51,510 --> 00:32:52,530 If I'm in between-- 668 00:32:52,530 --> 00:32:56,710 for this value u, for this jump the jump is here. 669 00:32:56,710 --> 00:32:59,370 So for this value, I'm going to be 670 00:32:59,370 --> 00:33:02,310 able to decide whether I want to go above or below, 671 00:33:02,310 --> 00:33:05,160 but it's always this value that's going to come out. 672 00:33:05,160 --> 00:33:07,380 So rather than picking values that are in between, 673 00:33:07,380 --> 00:33:08,940 I might as well just pick only values 674 00:33:08,940 --> 00:33:11,570 for which this is the value that it's going to get. 675 00:33:11,570 --> 00:33:15,450 And those values are exactly 1/n, 2/n, 3/n, 4/n. 676 00:33:15,450 --> 00:33:17,160 It's all the way to n/n, right? 677 00:33:17,160 --> 00:33:19,920 That's exactly where the flat parts are. 678 00:33:19,920 --> 00:33:23,310 We know we jump from 1/n every time. 679 00:33:23,310 --> 00:33:25,250 And so that's exactly the recipe. 680 00:33:25,250 --> 00:33:29,960 It says look at those values, 1/n, 2/n, 3/n 681 00:33:29,960 --> 00:33:32,500 until, say, n minus 1 over n. 682 00:33:32,500 --> 00:33:35,450 And for those values, compute the inverse 683 00:33:35,450 --> 00:33:40,130 of both the empiricial CDF and the true CDF. 684 00:33:40,130 --> 00:33:43,025 Now, for the empirical CDF, it's actually easy. 685 00:33:43,025 --> 00:33:45,590 I just told you this is basically where the points-- 686 00:33:45,590 --> 00:33:47,540 where the jumps occur. 687 00:33:47,540 --> 00:33:49,010 And the jumps occur where? 688 00:33:49,010 --> 00:33:53,010 Well, exactly at my observations. 689 00:33:53,010 --> 00:33:56,420 Now, remember I need to sort those observations to talk 690 00:33:56,420 --> 00:33:57,230 about them. 691 00:33:57,230 --> 00:34:00,200 So the one that occurs for the i-th jump 692 00:34:00,200 --> 00:34:07,030 is the i-th largest observation, which we denoted by X sub (i). 693 00:34:07,030 --> 00:34:07,530 Remember? 694 00:34:07,530 --> 00:34:11,785 We had this formula that we said, well, we have x1, xn. 695 00:34:11,785 --> 00:34:13,081 These are my data. 696 00:34:13,081 --> 00:34:14,580 And what I'm going to sort them into 697 00:34:14,580 --> 00:34:18,824 is x sub (1), which is less than or equal to x 698 00:34:18,824 --> 00:34:23,801 sub (2), which is less than x sub (n). 699 00:34:23,801 --> 00:34:24,300 OK? 700 00:34:24,300 --> 00:34:26,560 So we just ordered them from smallest to largest. 701 00:34:26,560 --> 00:34:28,290 And then now we've done that, we just 702 00:34:28,290 --> 00:34:30,010 put this parenthesis notation. 703 00:34:30,010 --> 00:34:34,850 So in particular, Fn inverse of i/n 704 00:34:34,850 --> 00:34:38,010 is the location where the i-th jumps occur, 705 00:34:38,010 --> 00:34:40,600 which is the i-th largest observation. 706 00:34:40,600 --> 00:34:42,380 OK? 707 00:34:42,380 --> 00:34:47,270 So for this guy, these values, the y-axes 708 00:34:47,270 --> 00:34:49,460 are actually fairly easy. 709 00:34:49,460 --> 00:34:53,389 I know it's basically my ordered observations. 710 00:34:53,389 --> 00:34:58,100 The x-values are-- well, that depends on the function 711 00:34:58,100 --> 00:34:59,090 F I'm trying to test. 712 00:34:59,090 --> 00:35:01,490 If it's the Gaussian, it's just the quantile 713 00:35:01,490 --> 00:35:05,400 of order 1 minus 1/n, right? 714 00:35:05,400 --> 00:35:08,570 It's this Q1 minus 1/n here that I need to compute. 715 00:35:08,570 --> 00:35:11,180 It's the inverse of the cumulative distribution 716 00:35:11,180 --> 00:35:13,950 function, which, given the formula for F, 717 00:35:13,950 --> 00:35:16,632 you can actually compute or maybe estimate fairly well. 718 00:35:16,632 --> 00:35:18,590 But it's something that you can find in tables. 719 00:35:18,590 --> 00:35:20,770 Those are basically quantiles. 720 00:35:20,770 --> 00:35:23,940 Inverse of CDFs are quantiles, right? 721 00:35:23,940 --> 00:35:28,510 And so that's basically the things we're interested in. 722 00:35:28,510 --> 00:35:30,420 That's why it's called quantile-quantile. 723 00:35:30,420 --> 00:35:34,560 Those are sometimes referred to as theoretical quantiles, 724 00:35:34,560 --> 00:35:37,200 the one we're trying to test, and empirical quantiles, 725 00:35:37,200 --> 00:35:39,870 the one that corresponds to the empirical CDF. 726 00:35:39,870 --> 00:35:44,190 And so I'm plotting a plot where the x-axis is quantile. 727 00:35:44,190 --> 00:35:45,930 The y-axis is quantile. 728 00:35:45,930 --> 00:35:49,010 And so I call this plot a quantile-quantile plot, or QQ 729 00:35:49,010 --> 00:35:54,330 plot, because, well, just say 10 times quantile-quantile, 730 00:35:54,330 --> 00:35:55,584 and then you'll see why. 731 00:35:55,584 --> 00:35:56,572 Yeah? 732 00:35:56,572 --> 00:35:59,977 AUDIENCE: [INAUDIBLE] have to have the [INAUDIBLE]?? 733 00:35:59,977 --> 00:36:01,560 PHILIPPE RIGOLLET: Well, that's just-- 734 00:36:01,560 --> 00:36:03,270 we're back to the-- 735 00:36:03,270 --> 00:36:06,030 we're back to the goodness-of-fit test, right? 736 00:36:06,030 --> 00:36:08,060 So if you look-- 737 00:36:08,060 --> 00:36:10,060 so you don't do it yourself. 738 00:36:10,060 --> 00:36:11,150 That's the simple answer. 739 00:36:11,150 --> 00:36:14,310 You don't-- I'm just telling you how those plots are going to be 740 00:36:14,310 --> 00:36:17,760 seen spit out from a software are going to look like. 741 00:36:17,760 --> 00:36:19,530 Now, depending on the software, there's 742 00:36:19,530 --> 00:36:21,240 a different thing that's happening. 743 00:36:21,240 --> 00:36:25,050 Some softwares are actually plotting F with the right-- 744 00:36:25,050 --> 00:36:27,480 let's say you want to do normal, as you asked. 745 00:36:27,480 --> 00:36:30,420 So some software are just going to use F 746 00:36:30,420 --> 00:36:33,690 to be with mu hat and sigma hat, and that's fine. 747 00:36:33,690 --> 00:36:36,150 Some software are actually not going to do this. 748 00:36:36,150 --> 00:36:39,270 They're just going to use a Gaussian. 749 00:36:39,270 --> 00:36:41,700 But then they're going to actually have 750 00:36:41,700 --> 00:36:43,710 a different reference point. 751 00:36:43,710 --> 00:36:45,960 So what do we want to see here? 752 00:36:45,960 --> 00:36:48,960 What should happen if all these points-- 753 00:36:48,960 --> 00:36:51,510 if all my points actually come from F, 754 00:36:51,510 --> 00:36:53,700 from a distribution that has CDF F? 755 00:36:53,700 --> 00:36:54,560 What should happen? 756 00:36:54,560 --> 00:36:55,310 What should I see? 757 00:36:58,650 --> 00:37:01,510 Well, since Fn should be close to F, 758 00:37:01,510 --> 00:37:04,710 Fn inverse should be close to F inverse, which 759 00:37:04,710 --> 00:37:07,202 means that this point should be close to that point. 760 00:37:07,202 --> 00:37:08,910 This point should be close to that point. 761 00:37:08,910 --> 00:37:13,080 So ideally, if I actually pick the right F, 762 00:37:13,080 --> 00:37:19,000 I should see a plot that looks like this, something where 763 00:37:19,000 --> 00:37:24,520 all my points are very close to the line y 764 00:37:24,520 --> 00:37:26,904 is equal to x, right? 765 00:37:26,904 --> 00:37:28,570 And I'm going to have some fluctuations, 766 00:37:28,570 --> 00:37:31,710 but something very close to this. 767 00:37:31,710 --> 00:37:34,260 Now, that's if F is exactly the right one. 768 00:37:34,260 --> 00:37:36,880 If F is not exactly the right one, in particular, 769 00:37:36,880 --> 00:37:40,920 in the case of a Gaussian one, if I actually 770 00:37:40,920 --> 00:37:43,410 plotted here the quantiles-- 771 00:37:43,410 --> 00:37:52,550 so if I plotted F 0, 1 of t, right? 772 00:37:52,550 --> 00:37:54,680 So let's say those are the ones I actually plot, 773 00:37:54,680 --> 00:37:57,350 but I really don't know what-- mu hat is not 0 774 00:37:57,350 --> 00:37:59,000 and sigma hat is not 0. 775 00:37:59,000 --> 00:38:01,310 And so this is not the one I should be getting. 776 00:38:01,310 --> 00:38:06,410 Since we actually know that phi of mu hat sigma hat 777 00:38:06,410 --> 00:38:12,890 squared t is equal to phi 0, 1 of t minus mu hat divided 778 00:38:12,890 --> 00:38:16,610 by sigma hat, there's just this change 779 00:38:16,610 --> 00:38:19,460 of axis, which is actually very simple. 780 00:38:19,460 --> 00:38:22,370 This change of axis is just a simple translation scaling, 781 00:38:22,370 --> 00:38:26,602 which means that this line here is 782 00:38:26,602 --> 00:38:28,310 going to be transformed into another line 783 00:38:28,310 --> 00:38:31,530 with a different slope and a different intercept. 784 00:38:31,530 --> 00:38:34,400 And so some software will actually decide 785 00:38:34,400 --> 00:38:37,900 to go with this curve and just show you 786 00:38:37,900 --> 00:38:39,370 what the reference curve should be, 787 00:38:39,370 --> 00:38:41,203 rather than actually putting everything back 788 00:38:41,203 --> 00:38:43,370 onto the 45-degree curve. 789 00:38:43,370 --> 00:38:45,120 AUDIENCE: So if you get any straight line? 790 00:38:45,120 --> 00:38:47,290 PHILIPPE RIGOLLET: Any straight line, you're happy. 791 00:38:47,290 --> 00:38:49,600 I mean, depending on the software. 792 00:38:49,600 --> 00:38:53,180 Because if the software actually really rescaled this thing 793 00:38:53,180 --> 00:38:56,880 to have mu hat and sigma square and you find a different line, 794 00:38:56,880 --> 00:38:58,510 a different straight line, this is 795 00:38:58,510 --> 00:39:01,720 bad news, which is not going to happen actually. 796 00:39:01,720 --> 00:39:05,040 It's impossible that happens, because you actually-- well, 797 00:39:05,040 --> 00:39:06,170 it could. 798 00:39:06,170 --> 00:39:07,895 If it's crazy, it could. 799 00:39:07,895 --> 00:39:09,510 It shouldn't be very crazy. 800 00:39:09,510 --> 00:39:10,010 OK. 801 00:39:10,010 --> 00:39:14,600 So let's see what R does for us, for example. 802 00:39:14,600 --> 00:39:20,380 So here in R, R actually does this funny trick where-- 803 00:39:20,380 --> 00:39:22,240 so here I did not actually plot the lines. 804 00:39:22,240 --> 00:39:23,573 I should actually add the lines. 805 00:39:23,573 --> 00:39:27,839 So the command is like qqnorm of my sample, right? 806 00:39:27,839 --> 00:39:28,880 And that's really simple. 807 00:39:28,880 --> 00:39:33,580 I just stack all my data into some vector, say, x. 808 00:39:33,580 --> 00:39:40,150 And I say qqnorm of x, and it just spits this thing out. 809 00:39:40,150 --> 00:39:40,720 OK? 810 00:39:40,720 --> 00:39:42,000 Very simple. 811 00:39:42,000 --> 00:39:44,262 But I could actually add another command, 812 00:39:44,262 --> 00:39:45,220 which I can't remember. 813 00:39:45,220 --> 00:39:50,670 I think it's like qqline, and it's just going 814 00:39:50,670 --> 00:39:52,980 to add the line on top of it. 815 00:39:52,980 --> 00:39:55,210 But if you see, actually what R does for us, 816 00:39:55,210 --> 00:39:58,830 it's actually doing the translation and scaling 817 00:39:58,830 --> 00:40:01,710 on the axes themselves. 818 00:40:01,710 --> 00:40:05,587 So it actually changes the x and y-axis in such a 819 00:40:05,587 --> 00:40:07,170 way that when you look at your picture 820 00:40:07,170 --> 00:40:09,570 and you forget about what the meaning of the axes are, 821 00:40:09,570 --> 00:40:11,520 the relevant straight line is actually 822 00:40:11,520 --> 00:40:13,185 still the 45-degree line. 823 00:40:13,185 --> 00:40:17,605 It's Because it's actually done the change of units for you. 824 00:40:17,605 --> 00:40:19,230 So you don't have to even see the line. 825 00:40:19,230 --> 00:40:21,630 You know that, in your mind, that this is basically-- 826 00:40:21,630 --> 00:40:25,520 the reference line is still 45 degree because that's 827 00:40:25,520 --> 00:40:27,050 the way the axes are made. 828 00:40:27,050 --> 00:40:29,940 But if I actually put my axes, right-- so here, for example, 829 00:40:29,940 --> 00:40:31,490 it goes from-- 830 00:40:31,490 --> 00:40:32,820 let's look at some-- 831 00:40:32,820 --> 00:40:36,310 well, OK, those are all square. 832 00:40:36,310 --> 00:40:38,810 Yeah, and that's probably because they actually have-- 833 00:40:38,810 --> 00:40:41,380 the samples are actually from a standard normal. 834 00:40:41,380 --> 00:40:43,245 So I did not make my life very easy 835 00:40:43,245 --> 00:40:45,120 to illustrate your question, but of course, I 836 00:40:45,120 --> 00:40:46,661 didn't know you were going to ask it. 837 00:40:46,661 --> 00:40:49,280 Next time, let's just prepare. 838 00:40:49,280 --> 00:40:50,760 Let's script more. 839 00:40:50,760 --> 00:40:52,620 We'll see another one in the next plot. 840 00:40:52,620 --> 00:40:54,540 But so here what you expect to see 841 00:40:54,540 --> 00:40:58,410 is that all the plots should be on the 45-degree line, right? 842 00:40:58,410 --> 00:40:59,830 This should be the right one. 843 00:40:59,830 --> 00:41:02,850 And if you see, when I start having 10,000 samples, 844 00:41:02,850 --> 00:41:04,480 this is exactly what's happening. 845 00:41:04,480 --> 00:41:05,930 So this is as good as it gets. 846 00:41:05,930 --> 00:41:08,240 This is an N(0, 1) plotted against the theoretical 847 00:41:08,240 --> 00:41:10,300 quantile of an N(0, 1). 848 00:41:10,300 --> 00:41:12,090 As good as it gets. 849 00:41:12,090 --> 00:41:15,110 And if you see, for the second one, which is 50, 850 00:41:15,110 --> 00:41:16,610 sample size of size-- 851 00:41:16,610 --> 00:41:19,730 sample of size 50, there is some fudge factor, right? 852 00:41:19,730 --> 00:41:20,690 I mean, those things-- 853 00:41:20,690 --> 00:41:22,310 doesn't look like there's a straight line, right? 854 00:41:22,310 --> 00:41:24,851 It sort of appears that there are some weird things happening 855 00:41:24,851 --> 00:41:27,810 here at the lower tail. 856 00:41:27,810 --> 00:41:29,310 And the reason why this is happening 857 00:41:29,310 --> 00:41:32,400 is because we're trying to compare the tails, right? 858 00:41:32,400 --> 00:41:34,980 When I look at this picture, the only thing that goes wrong 859 00:41:34,980 --> 00:41:37,050 somehow is always at the tip, because those 860 00:41:37,050 --> 00:41:39,090 are sort of rare and extreme values, 861 00:41:39,090 --> 00:41:41,100 and they're sort of all over the place. 862 00:41:41,100 --> 00:41:44,610 And so things are never really super smooth and super clean. 863 00:41:44,610 --> 00:41:46,920 So this is what your best shot is. 864 00:41:46,920 --> 00:41:49,140 This is what you will ever hope to get. 865 00:41:49,140 --> 00:41:52,486 So size 10, right, so you have 10 points. 866 00:41:52,486 --> 00:41:54,360 Remember, we actually-- well, I didn't really 867 00:41:54,360 --> 00:41:56,220 tell you how to deal with the extreme cases. 868 00:41:56,220 --> 00:41:59,720 Because the problem is that F inverse of 1 for the true F 869 00:41:59,720 --> 00:42:01,050 is plus infinity. 870 00:42:01,050 --> 00:42:04,350 So you have to make some sort of weird boundary choices 871 00:42:04,350 --> 00:42:07,830 to decide what F inverse of 1 is, and it's something 872 00:42:07,830 --> 00:42:09,694 that's like somewhere. 873 00:42:09,694 --> 00:42:11,610 But you still want to put like 10 dots, right? 874 00:42:11,610 --> 00:42:15,450 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 dots. 875 00:42:15,450 --> 00:42:17,610 So I have 10 observations, you will see 10 dots. 876 00:42:17,610 --> 00:42:21,230 I have 50 observations, you will see 50 dots, right, 877 00:42:21,230 --> 00:42:22,230 because I have-- 878 00:42:22,230 --> 00:42:26,720 there are 1/n, 2/n, 3/n all the way to n/n. 879 00:42:26,720 --> 00:42:29,010 I didn't tell you the last one. 880 00:42:29,010 --> 00:42:29,510 OK. 881 00:42:29,510 --> 00:42:31,490 So this is when things go well, and this is 882 00:42:31,490 --> 00:42:32,881 when things should not go well. 883 00:42:32,881 --> 00:42:33,380 OK? 884 00:42:33,380 --> 00:42:35,030 So here, actually, the distribution 885 00:42:35,030 --> 00:42:37,670 is a Student's t with 15 degrees of freedom, 886 00:42:37,670 --> 00:42:41,180 which should depart somewhat from a Gaussian distribution. 887 00:42:41,180 --> 00:42:44,070 The tails should be heavier. 888 00:42:44,070 --> 00:42:47,700 And what you can see is basically the following, 889 00:42:47,700 --> 00:42:51,240 is that for 10 you actually see something that's crazy, right, 890 00:42:51,240 --> 00:42:52,980 if I do 10 observations. 891 00:42:52,980 --> 00:42:55,075 But if I do 50 observations, honestly, it's 892 00:42:55,075 --> 00:42:56,700 kind of hard to say that it's different 893 00:42:56,700 --> 00:42:58,560 from the standard normal. 894 00:42:58,560 --> 00:43:01,410 So you could still be happy with this for 100. 895 00:43:01,410 --> 00:43:03,840 And then this is what's happening for 10,000. 896 00:43:03,840 --> 00:43:06,957 And even here it's not the beautiful straight line, 897 00:43:06,957 --> 00:43:08,790 but it feels like you would be still tempted 898 00:43:08,790 --> 00:43:11,580 to conclude that it's a beautiful straight line. 899 00:43:11,580 --> 00:43:13,800 So let's try to guess. 900 00:43:13,800 --> 00:43:18,420 So basically, there's-- for each of those sides there's two 901 00:43:18,420 --> 00:43:18,960 phenomena. 902 00:43:18,960 --> 00:43:22,080 Either it goes like this or it goes like this, 903 00:43:22,080 --> 00:43:24,960 and then it goes like this or it goes like this. 904 00:43:24,960 --> 00:43:28,200 Each side corresponds to the left tail, all the smallest 905 00:43:28,200 --> 00:43:29,360 values. 906 00:43:29,360 --> 00:43:30,360 So that's the left side. 907 00:43:30,360 --> 00:43:31,984 And that's the right side-- corresponds 908 00:43:31,984 --> 00:43:33,040 to the large values. 909 00:43:33,040 --> 00:43:33,540 OK? 910 00:43:33,540 --> 00:43:35,460 And so basically you can actually 911 00:43:35,460 --> 00:43:40,050 think of some sort of a table that tells you 912 00:43:40,050 --> 00:43:41,310 what your QQ plot looks like. 913 00:43:47,220 --> 00:43:48,650 And so let's say it looks-- 914 00:43:48,650 --> 00:43:50,840 so we have our reference 45-degree line. 915 00:43:50,840 --> 00:43:52,960 So let's say this is the QQ plot. 916 00:43:52,960 --> 00:43:54,820 That could be one thing. 917 00:43:54,820 --> 00:43:59,380 This could be the QQ plot where I have another thing. 918 00:43:59,380 --> 00:44:08,890 Then I can do this guy, and then I do this guy. 919 00:44:08,890 --> 00:44:10,690 So this is like this. 920 00:44:10,690 --> 00:44:11,642 OK? 921 00:44:11,642 --> 00:44:13,665 So those are the four cases. 922 00:44:13,665 --> 00:44:14,950 OK? 923 00:44:14,950 --> 00:44:19,000 And here what's changing is the right tail, 924 00:44:19,000 --> 00:44:20,970 and here what's changing is the-- 925 00:44:20,970 --> 00:44:24,820 and when I go from here to here, what changes is the left tail. 926 00:44:24,820 --> 00:44:26,851 Is that true? 927 00:44:26,851 --> 00:44:27,350 No, sorry. 928 00:44:27,350 --> 00:44:29,290 What changes here is the right tail, right? 929 00:44:29,290 --> 00:44:34,110 It's this part that changes from top to bottom. 930 00:44:34,110 --> 00:44:38,542 So here it's something about right tail, 931 00:44:38,542 --> 00:44:40,410 and here that's something about left tail. 932 00:44:44,060 --> 00:44:46,805 Everybody understands what I mean when I talk about tails? 933 00:44:46,805 --> 00:44:48,200 OK. 934 00:44:48,200 --> 00:44:50,120 And so here it's just going to be 935 00:44:50,120 --> 00:44:52,670 a question of whether the tails are heavier 936 00:44:52,670 --> 00:44:54,689 or lighter than the Gaussian. 937 00:44:54,689 --> 00:44:56,480 Everybody understand what I mean when I say 938 00:44:56,480 --> 00:44:58,640 heavy tails and light tails? 939 00:44:58,640 --> 00:44:59,600 OK. 940 00:44:59,600 --> 00:45:01,670 So right, so heavy tails just means 941 00:45:01,670 --> 00:45:04,880 that basically here the tails of this guy 942 00:45:04,880 --> 00:45:06,540 are heavier than the tails of this guy. 943 00:45:06,540 --> 00:45:08,785 So it means that if I draw them, they're going to be above. 944 00:45:08,785 --> 00:45:10,520 Actually, I'm going to keep this picture because it's 945 00:45:10,520 --> 00:45:11,811 going to be very useful for me. 946 00:45:16,170 --> 00:45:19,650 When I plug the quantiles at the same-- so let's 947 00:45:19,650 --> 00:45:21,180 look at the right tail, for example. 948 00:45:21,180 --> 00:45:23,610 Right here my picture is for right tails. 949 00:45:23,610 --> 00:45:26,350 When I look at the quantiles of my theoretical distribution-- 950 00:45:26,350 --> 00:45:28,440 so here you can see the bottom curve 951 00:45:28,440 --> 00:45:31,420 we have the theoretical quantiles, 952 00:45:31,420 --> 00:45:34,800 and those are the empirical quantiles. 953 00:45:34,800 --> 00:45:39,090 If I look to the right here, are the theoretical quantiles 954 00:45:39,090 --> 00:45:41,770 larger or smaller than the empirical quantiles? 955 00:45:47,124 --> 00:45:48,290 Let me phrase it the other-- 956 00:45:48,290 --> 00:45:50,460 are the empirical quantiles larger or smaller 957 00:45:50,460 --> 00:45:53,250 than the theoretical quantiles? 958 00:45:53,250 --> 00:45:56,610 AUDIENCE: This is a graph of quantiles, right? 959 00:45:56,610 --> 00:45:59,072 So if it's [INAUDIBLE] it should be smaller. 960 00:45:59,072 --> 00:46:01,030 PHILIPPE RIGOLLET: It should be smaller, right? 961 00:46:01,030 --> 00:46:04,190 On this line, they are equal. 962 00:46:04,190 --> 00:46:07,180 So if I see the empirical quantile showing up here, 963 00:46:07,180 --> 00:46:10,510 it means that here the empirical quantile is less 964 00:46:10,510 --> 00:46:12,550 than the theoretical quantile. 965 00:46:12,550 --> 00:46:13,890 Agree? 966 00:46:13,890 --> 00:46:16,410 So that means that if I look at this thing-- 967 00:46:16,410 --> 00:46:18,540 and that's for the same values, right? 968 00:46:18,540 --> 00:46:22,440 So the quantiles are computed for the same values i/n. 969 00:46:22,440 --> 00:46:25,890 So it means that the empirical quantiles should be looking-- 970 00:46:25,890 --> 00:46:29,840 so that should be the empirical quantile, 971 00:46:29,840 --> 00:46:32,470 and that should be the theoretical quantile. 972 00:46:32,470 --> 00:46:34,390 Agreed? 973 00:46:34,390 --> 00:46:37,730 Those are the smaller values for the same alpha. 974 00:46:37,730 --> 00:46:41,300 So that implies that the tails-- 975 00:46:41,300 --> 00:46:43,880 the right tail, is it heavy or lighter-- 976 00:46:43,880 --> 00:46:45,530 heavier or lighter than the Gaussian? 977 00:46:50,390 --> 00:46:51,140 AUDIENCE: Lighter. 978 00:46:51,140 --> 00:46:52,200 PHILIPPE RIGOLLET: Lighter, right? 979 00:46:52,200 --> 00:46:54,033 Because those are the tails of the Gaussian. 980 00:46:54,033 --> 00:46:55,650 Those are my theoretical quantiles. 981 00:46:55,650 --> 00:46:59,580 That means that this is the tail of my empirical distribution. 982 00:46:59,580 --> 00:47:00,870 So they are actually lighter. 983 00:47:08,090 --> 00:47:09,250 OK? 984 00:47:09,250 --> 00:47:11,500 So here, if I look at this thing, 985 00:47:11,500 --> 00:47:18,240 this means that the right tail is actually light. 986 00:47:18,240 --> 00:47:20,800 And by light, I mean lighter than Gaussian. 987 00:47:20,800 --> 00:47:22,650 Heavy, I mean heavier than Gaussian. 988 00:47:22,650 --> 00:47:23,730 OK? 989 00:47:23,730 --> 00:47:27,150 OK, now we can probably do the entire thing. 990 00:47:27,150 --> 00:47:31,980 Well, if this is light, this is going to be heavy, right? 991 00:47:31,980 --> 00:47:33,520 That's when I'm above the curve. 992 00:47:36,820 --> 00:47:40,390 Exercise-- is this light or is this heavy, the first column? 993 00:47:46,970 --> 00:47:47,900 And it's OK. 994 00:47:47,900 --> 00:47:51,734 It should take you at least 30 seconds. 995 00:47:51,734 --> 00:47:53,570 AUDIENCE: [INAUDIBLE] different column? 996 00:47:53,570 --> 00:47:54,740 PHILIPPE RIGOLLET: Yeah, this column, right? 997 00:47:54,740 --> 00:47:56,240 So this is something that pertains-- 998 00:47:56,240 --> 00:47:59,080 this entire column is going to tell me whether the fact 999 00:47:59,080 --> 00:48:01,620 that this guy is above, does this 1000 00:48:01,620 --> 00:48:06,570 mean that I have lighter or heavier left tails? 1001 00:48:06,570 --> 00:48:09,050 AUDIENCE: Well, on the left, it's heavier. 1002 00:48:09,050 --> 00:48:11,150 PHILIPPE RIGOLLET: On the left, it's heavier. 1003 00:48:11,150 --> 00:48:12,090 OK. 1004 00:48:12,090 --> 00:48:12,672 I don't know. 1005 00:48:12,672 --> 00:48:14,130 Actually, I need to draw a picture. 1006 00:48:14,130 --> 00:48:17,348 You guys are probably faster than I am. 1007 00:48:17,348 --> 00:48:19,872 AUDIENCE: [INTERPOSING VOICES]. 1008 00:48:19,872 --> 00:48:21,330 PHILIPPE RIGOLLET: Actually, let me 1009 00:48:21,330 --> 00:48:23,400 check how much randomness is-- 1010 00:48:23,400 --> 00:48:26,430 who says it's lighter? 1011 00:48:26,430 --> 00:48:27,450 Who says it's heavier? 1012 00:48:27,450 --> 00:48:29,880 AUDIENCE: Yeah, but we're biased. 1013 00:48:29,880 --> 00:48:30,852 AUDIENCE: [INAUDIBLE] 1014 00:48:30,852 --> 00:48:32,018 PHILIPPE RIGOLLET: Yeah, OK. 1015 00:48:32,018 --> 00:48:33,610 AUDIENCE: [INAUDIBLE] 1016 00:48:33,610 --> 00:48:34,818 PHILIPPE RIGOLLET: All right. 1017 00:48:34,818 --> 00:48:36,760 So let's see if it's heavier. 1018 00:48:36,760 --> 00:48:40,786 So we're on the left tail, and so we have one looks like this, 1019 00:48:40,786 --> 00:48:41,910 one looks like that, right? 1020 00:48:45,410 --> 00:48:49,100 So we know here that I'm looking at this part here. 1021 00:48:49,100 --> 00:48:52,070 So it means that here my empirical quantile is larger 1022 00:48:52,070 --> 00:48:53,320 than the theoretical quantile. 1023 00:48:58,480 --> 00:49:00,350 OK? 1024 00:49:00,350 --> 00:49:02,030 So are my tails heavier or lighter? 1025 00:49:06,125 --> 00:49:07,280 They're lighter. 1026 00:49:07,280 --> 00:49:08,180 That was a bad bias. 1027 00:49:08,180 --> 00:49:10,299 AUDIENCE: [INAUDIBLE] 1028 00:49:10,299 --> 00:49:11,340 PHILIPPE RIGOLLET: Right? 1029 00:49:11,340 --> 00:49:14,660 It's below, so it's lighter. 1030 00:49:14,660 --> 00:49:19,100 Because the problem is that larger for the negative ones 1031 00:49:19,100 --> 00:49:22,068 means that it's smaller [INAUDIBLE],, right? 1032 00:49:22,068 --> 00:49:23,550 Yeah? 1033 00:49:23,550 --> 00:49:26,514 AUDIENCE: Sorry but, what exactly are these [INAUDIBLE]?? 1034 00:49:26,514 --> 00:49:28,984 If this is the inverse-- 1035 00:49:28,984 --> 00:49:32,936 if this is the inverse CDF, shouldn't everything-- 1036 00:49:32,936 --> 00:49:34,912 well, if this is the inverse CDF, 1037 00:49:34,912 --> 00:49:36,394 then you should only be inputting 1038 00:49:36,394 --> 00:49:38,864 values between 0 and 1 in it. 1039 00:49:38,864 --> 00:49:40,840 And-- 1040 00:49:40,840 --> 00:49:42,900 PHILIPPE RIGOLLET: Oh, did I put the inverse CDF? 1041 00:49:42,900 --> 00:49:46,814 AUDIENCE: Like on the previous slide, I think. 1042 00:49:46,814 --> 00:49:48,230 PHILIPPE RIGOLLET: No, the inverse 1043 00:49:48,230 --> 00:49:49,910 CDF, yeah, so I'm inputting-- 1044 00:49:49,910 --> 00:49:51,339 AUDIENCE: Oh, you're [INAUDIBLE].. 1045 00:49:51,339 --> 00:49:53,630 PHILIPPE RIGOLLET: Yeah, so it's a scatter plot, right? 1046 00:49:53,630 --> 00:49:56,780 So each point is attached-- each point 1047 00:49:56,780 --> 00:49:59,990 is attached 1/n, 2/n, 3/n. 1048 00:49:59,990 --> 00:50:01,600 Now, for each point I'm plotting, 1049 00:50:01,600 --> 00:50:05,060 that's my x-value, which maps a number between 0 and 1 1050 00:50:05,060 --> 00:50:09,690 back onto the entire real line, and my y-value is the same. 1051 00:50:09,690 --> 00:50:10,190 OK? 1052 00:50:10,190 --> 00:50:14,370 So what it means is that those two numbers, this is in the-- 1053 00:50:14,370 --> 00:50:17,330 this lives on the entire real line, not on the interval. 1054 00:50:17,330 --> 00:50:20,540 This lives on the entire real line, not in the interval. 1055 00:50:20,540 --> 00:50:26,630 And so my QQ plots take values on the entire real line, 1056 00:50:26,630 --> 00:50:28,660 entire real line, right? 1057 00:50:28,660 --> 00:50:31,915 So you think of it as a parameterized curve, where 1058 00:50:31,915 --> 00:50:34,610 the time steps are 1/n, 2/n, 3/n, 1059 00:50:34,610 --> 00:50:38,740 and I'm just like putting a dot every time I'm making one step. 1060 00:50:38,740 --> 00:50:41,470 OK? 1061 00:50:41,470 --> 00:50:43,540 OK, so what did we say? 1062 00:50:43,540 --> 00:50:46,356 That was lighter, right? 1063 00:50:46,356 --> 00:50:51,196 AUDIENCE: [INAUDIBLE] 1064 00:50:51,196 --> 00:50:54,110 PHILIPPE RIGOLLET: OK? 1065 00:50:54,110 --> 00:50:58,380 One of my favorite exercises is, here's a bunch of densities. 1066 00:50:58,380 --> 00:51:00,140 Here's a bunch of QQ plots. 1067 00:51:00,140 --> 00:51:04,490 Map the correct QQ plot to its own density. 1068 00:51:04,490 --> 00:51:05,980 All right? 1069 00:51:05,980 --> 00:51:09,220 And there won't be mingled lines that allow you to do that, 1070 00:51:09,220 --> 00:51:11,720 then you just have to follow, like at the back of cereal 1071 00:51:11,720 --> 00:51:13,070 boxes. 1072 00:51:13,070 --> 00:51:15,530 All right. 1073 00:51:15,530 --> 00:51:17,165 Are there any questions? 1074 00:51:17,165 --> 00:51:18,540 So one thing-- there's two things 1075 00:51:18,540 --> 00:51:19,914 I'm trying to communicate here is 1076 00:51:19,914 --> 00:51:22,460 if you see a QQ plot, now you should understand, 1077 00:51:22,460 --> 00:51:28,350 one, how it was built, and two, whether it means that you have 1078 00:51:28,350 --> 00:51:30,520 heavier tails or lighter tails. 1079 00:51:30,520 --> 00:51:32,760 Now, let's look at this guy. 1080 00:51:32,760 --> 00:51:34,800 What should we see? 1081 00:51:34,800 --> 00:51:37,480 We should see heavy on the left and heavy on the right, right? 1082 00:51:37,480 --> 00:51:39,360 We know that this should be the case. 1083 00:51:39,360 --> 00:51:45,130 So this thing actually looks like this, and it sort of does, 1084 00:51:45,130 --> 00:51:46,250 right? 1085 00:51:46,250 --> 00:51:48,860 If I take this line going through here, 1086 00:51:48,860 --> 00:51:50,620 I can see that this guy's tipping here, 1087 00:51:50,620 --> 00:51:52,360 and this guy's dipping here. 1088 00:51:52,360 --> 00:51:57,670 But honestly-- actually, I can't remember exactly, but t 15, 1089 00:51:57,670 --> 00:52:01,570 if I plotted the density on top of the Gaussian, 1090 00:52:01,570 --> 00:52:02,776 you can see a difference. 1091 00:52:02,776 --> 00:52:04,900 But if I just gave it to you, it would be very hard 1092 00:52:04,900 --> 00:52:07,399 for you to tell me if there's an actual difference between t 1093 00:52:07,399 --> 00:52:08,950 15 and Gaussian, right? 1094 00:52:08,950 --> 00:52:11,076 Those things are actually very close. 1095 00:52:11,076 --> 00:52:12,700 And so in particular, here we're really 1096 00:52:12,700 --> 00:52:15,640 trying to recognize what the shape is the fact-- 1097 00:52:15,640 --> 00:52:16,140 right? 1098 00:52:16,140 --> 00:52:20,980 So t 15 compared to a standard Gaussian was different, 1099 00:52:20,980 --> 00:52:26,119 but t 15 compared to a Gaussian with a slightly larger variance 1100 00:52:26,119 --> 00:52:27,910 is not going to actually-- you're not going 1101 00:52:27,910 --> 00:52:29,090 to see much of a difference. 1102 00:52:29,090 --> 00:52:33,610 So in a way, such distributions are actually not 1103 00:52:33,610 --> 00:52:35,890 too far from the Gaussian, and it's not too-- 1104 00:52:35,890 --> 00:52:38,950 it's still pretty benign to conclude that this was actually 1105 00:52:38,950 --> 00:52:42,283 a Gaussian distribution because you can just use the variance 1106 00:52:42,283 --> 00:52:43,750 as a little bit of a buffer. 1107 00:52:43,750 --> 00:52:45,250 I'm not going to get really into how 1108 00:52:45,250 --> 00:52:50,500 you would use a t-distribution into a t-test, 1109 00:52:50,500 --> 00:52:54,420 because it's kind of like Inception, right? 1110 00:52:54,420 --> 00:52:58,150 So but you could pretend that your data actually 1111 00:52:58,150 --> 00:53:02,010 is t-distributed and then build a t-distribution from it, 1112 00:53:02,010 --> 00:53:03,570 but let's not say that. 1113 00:53:03,570 --> 00:53:05,490 Maybe that was a bad example. 1114 00:53:05,490 --> 00:53:08,280 But there's like other heavy-tailed distributions like 1115 00:53:08,280 --> 00:53:10,825 Cauchy distribution, which doesn't even have a-- 1116 00:53:10,825 --> 00:53:12,450 it's not even integrable because that's 1117 00:53:12,450 --> 00:53:14,490 as heavy as the tails get. 1118 00:53:14,490 --> 00:53:18,760 And this you can really tell it's going to look like this. 1119 00:53:18,760 --> 00:53:22,010 It's going to be like pfft. 1120 00:53:22,010 --> 00:53:24,240 What does a uniform distribution look like? 1121 00:53:30,727 --> 00:53:32,210 Like this? 1122 00:53:32,210 --> 00:53:37,890 It's going to be-- it's going to look like a Gaussian one, 1123 00:53:37,890 --> 00:53:38,940 right? 1124 00:53:38,940 --> 00:53:41,030 So a uniform-- so this is my Gaussian. 1125 00:53:41,030 --> 00:53:43,130 A uniform is basically going to look like this, 1126 00:53:43,130 --> 00:53:46,260 one side take the right mean and the right variance, right? 1127 00:53:46,260 --> 00:53:48,480 So the tails are definitely lighter. 1128 00:53:48,480 --> 00:53:49,640 They're 0. 1129 00:53:49,640 --> 00:53:51,570 That's as lighter as it gets. 1130 00:53:51,570 --> 00:53:55,290 So the light-light is going to look like this S shape. 1131 00:53:55,290 --> 00:53:59,050 So an S-- light-tailed distribution has this S shape. 1132 00:53:59,050 --> 00:53:59,820 OK? 1133 00:53:59,820 --> 00:54:02,520 What is the exponential going to look like? 1134 00:54:06,620 --> 00:54:08,500 So the exponential is positively supported. 1135 00:54:08,500 --> 00:54:10,430 It only has positive numbers. 1136 00:54:10,430 --> 00:54:11,750 So there's no left tail. 1137 00:54:11,750 --> 00:54:14,110 This is also as light as it gets. 1138 00:54:14,110 --> 00:54:16,480 But the right tail, is it heavier or lighter 1139 00:54:16,480 --> 00:54:17,230 than the Gaussian? 1140 00:54:17,230 --> 00:54:18,420 AUDIENCE: Heavier. 1141 00:54:18,420 --> 00:54:19,080 PHILIPPE RIGOLLET: It's heavier, right? 1142 00:54:19,080 --> 00:54:21,990 It's only the case like e of the minus x rather e to the minus 1143 00:54:21,990 --> 00:54:22,860 x squared. 1144 00:54:22,860 --> 00:54:23,760 So it's heavier. 1145 00:54:23,760 --> 00:54:27,620 So it means that on the left it's going to be light, 1146 00:54:27,620 --> 00:54:29,430 and on the right it's going to be heavy. 1147 00:54:29,430 --> 00:54:31,870 So it's going to be U-shaped. 1148 00:54:31,870 --> 00:54:32,370 OK? 1149 00:54:35,340 --> 00:54:37,100 That will be fine. 1150 00:54:37,100 --> 00:54:39,800 All right. 1151 00:54:39,800 --> 00:54:41,840 Any other question? 1152 00:54:41,840 --> 00:54:44,990 Again, two messages, like, more technical, 1153 00:54:44,990 --> 00:54:47,960 and you can sort of fiddle with it by looking at it. 1154 00:54:47,960 --> 00:54:49,670 You can definitely conclude that this 1155 00:54:49,670 --> 00:54:53,456 is OK enough to be Gaussian for your purposes. 1156 00:54:53,456 --> 00:54:53,956 Yeah? 1157 00:54:53,956 --> 00:54:59,591 AUDIENCE: So [INAUDIBLE] 1158 00:54:59,591 --> 00:55:01,340 PHILIPPE RIGOLLET: I did not hear the "if" 1159 00:55:01,340 --> 00:55:02,756 at the beginning of your sentence. 1160 00:55:06,431 --> 00:55:08,472 AUDIENCE: I would want to be lighter tail, right, 1161 00:55:08,472 --> 00:55:10,436 because that'll be-- it's easier to reject? 1162 00:55:10,436 --> 00:55:11,909 Is that correct? 1163 00:55:16,340 --> 00:55:20,272 PHILIPPE RIGOLLET: So what is your purpose as a-- 1164 00:55:20,272 --> 00:55:21,733 AUDIENCE: I want to-- 1165 00:55:21,733 --> 00:55:25,142 I have some [INAUDIBLE] right? 1166 00:55:25,142 --> 00:55:28,551 I want to be able to say I reject H0 [INAUDIBLE].. 1167 00:55:28,551 --> 00:55:29,525 PHILIPPE RIGOLLET: Yes. 1168 00:55:29,525 --> 00:55:32,203 AUDIENCE: So if you wanted to make it easier 1169 00:55:32,203 --> 00:55:35,002 to reject H0, then-- 1170 00:55:35,002 --> 00:55:37,210 PHILIPPE RIGOLLET: Yeah, in a way that's true, right? 1171 00:55:37,210 --> 00:55:40,440 So once you've actually factored in the mean and the variance, 1172 00:55:40,440 --> 00:55:43,190 the only thing that actually-- 1173 00:55:43,190 --> 00:55:43,690 right. 1174 00:55:43,690 --> 00:55:47,950 So if you have Gaussian tails or lighter-- even lighter tails, 1175 00:55:47,950 --> 00:55:51,460 then it's harder for you to explain deviations 1176 00:55:51,460 --> 00:55:52,780 from randomness only, right? 1177 00:55:52,780 --> 00:55:54,640 If you have a uniform distribution 1178 00:55:54,640 --> 00:55:56,250 and you see something which is-- 1179 00:55:56,250 --> 00:55:59,680 if you're uniform on 0, 1 plus some number and you see 25, 1180 00:55:59,680 --> 00:56:01,960 you know this number is not going to be 0, right? 1181 00:56:01,960 --> 00:56:04,120 So that's basically as good as it gets. 1182 00:56:04,120 --> 00:56:06,610 And there's basically some smooth interpolation 1183 00:56:06,610 --> 00:56:07,940 if you have lighter tails. 1184 00:56:07,940 --> 00:56:10,600 Now, if you start having something that has heavy tails, 1185 00:56:10,600 --> 00:56:12,880 then it's more likely that pure noise 1186 00:56:12,880 --> 00:56:15,880 will generate large observations and therefore discovery. 1187 00:56:15,880 --> 00:56:19,160 So yes, lighter tails is definitely 1188 00:56:19,160 --> 00:56:21,440 the better-behaved noise. 1189 00:56:21,440 --> 00:56:22,520 Let's put it this way. 1190 00:56:22,520 --> 00:56:24,740 The lighter it is, the better behaved it is. 1191 00:56:24,740 --> 00:56:27,230 Now, this is good-- 1192 00:56:27,230 --> 00:56:30,140 this is good for some purposes, but when you want to compute 1193 00:56:30,140 --> 00:56:35,420 actual quantiles, like exact quantiles, 1194 00:56:35,420 --> 00:56:40,070 then it is true in general that the quantiles of lighter-tail 1195 00:56:40,070 --> 00:56:42,520 distributions are going to be dominated by the-- are going 1196 00:56:42,520 --> 00:56:46,236 to be dominated by the-- 1197 00:56:46,236 --> 00:56:47,610 let's say on the right tails, are 1198 00:56:47,610 --> 00:56:51,410 going to be dominated by those of a heavy distribution. 1199 00:56:51,410 --> 00:56:52,729 That is true. 1200 00:56:52,729 --> 00:56:54,020 But that's not always the case. 1201 00:56:54,020 --> 00:56:54,980 And in particular, there's going to be 1202 00:56:54,980 --> 00:56:57,627 some like sort of weird points where things are actually 1203 00:56:57,627 --> 00:56:59,960 changing depending on what level you're actually looking 1204 00:56:59,960 --> 00:57:01,964 at those things, maybe 5% or 10%, 1205 00:57:01,964 --> 00:57:04,130 in which case things might be changing a little bit. 1206 00:57:04,130 --> 00:57:06,171 But if you started going really towards the tail, 1207 00:57:06,171 --> 00:57:10,220 if you start looking at levels alpha which are 1% or 0.1%, 1208 00:57:10,220 --> 00:57:13,070 it is true that it's always-- 1209 00:57:13,070 --> 00:57:14,990 if you can actually-- so if you see something 1210 00:57:14,990 --> 00:57:16,790 that looks light tail, you definitely 1211 00:57:16,790 --> 00:57:18,581 do not want to conclude that it's Gaussian. 1212 00:57:18,581 --> 00:57:21,080 You want to actually change your modeling so that it 1213 00:57:21,080 --> 00:57:23,240 makes your life even easier. 1214 00:57:23,240 --> 00:57:25,400 And you actually factor in the fact 1215 00:57:25,400 --> 00:57:27,830 that you can see that the noise is actually more benign 1216 00:57:27,830 --> 00:57:30,929 than you would like it to be. 1217 00:57:30,929 --> 00:57:31,429 OK? 1218 00:57:34,190 --> 00:57:35,440 Stretching fingers, that's it? 1219 00:57:35,440 --> 00:57:37,930 All right. 1220 00:57:37,930 --> 00:57:38,880 OK. 1221 00:57:38,880 --> 00:57:40,045 So I want to-- 1222 00:57:40,045 --> 00:57:43,380 I mentioned at some point that we had this chi-square test 1223 00:57:43,380 --> 00:57:45,270 that was showing up. 1224 00:57:45,270 --> 00:57:47,720 And I do not know what I did-- 1225 00:57:47,720 --> 00:57:49,260 let's just-- oh, yeah. 1226 00:57:49,260 --> 00:57:53,770 So we have this chi-square test that we worked on last time, 1227 00:57:53,770 --> 00:57:54,270 right? 1228 00:57:54,270 --> 00:57:57,420 So the way I introduced the chi-square test is by saying, 1229 00:57:57,420 --> 00:57:59,520 I am fascinated by this question. 1230 00:57:59,520 --> 00:58:01,380 Let's check if it's correct, OK? 1231 00:58:01,380 --> 00:58:04,230 Or something maybe slightly deeper-- 1232 00:58:04,230 --> 00:58:06,570 let's check if juries in this country 1233 00:58:06,570 --> 00:58:10,740 are representative of racial distribution. 1234 00:58:10,740 --> 00:58:14,640 But you could actually-- those numbers here 1235 00:58:14,640 --> 00:58:16,046 come from a very specific thing. 1236 00:58:16,046 --> 00:58:16,920 That was the uniform. 1237 00:58:16,920 --> 00:58:17,878 That was our benchmark. 1238 00:58:17,878 --> 00:58:19,320 Here's the uniform. 1239 00:58:19,320 --> 00:58:21,690 And there was this guy, which was a benchmark, which 1240 00:58:21,690 --> 00:58:24,792 was the actual benchmark that we need to have for this problem. 1241 00:58:24,792 --> 00:58:27,000 And those things basically came out of my hat, right? 1242 00:58:27,000 --> 00:58:29,230 Those are numbers that exist. 1243 00:58:29,230 --> 00:58:33,120 But in practice, you actually make those numbers yourself. 1244 00:58:33,120 --> 00:58:36,360 And the way you do it is by saying, well, 1245 00:58:36,360 --> 00:58:39,760 if I have a binomial distribution 1246 00:58:39,760 --> 00:58:41,350 and I want to test if my data comes 1247 00:58:41,350 --> 00:58:42,969 from a binomial distribution, you 1248 00:58:42,969 --> 00:58:44,260 could ask this question, right? 1249 00:58:44,260 --> 00:58:45,580 You have a bunch of data. 1250 00:58:45,580 --> 00:58:48,070 I did not promise to you that this 1251 00:58:48,070 --> 00:58:50,920 was the sum of independent Bernoullis and [INAUDIBLE].. 1252 00:58:50,920 --> 00:58:53,800 And then you can actually check that it's a binomial indeed, 1253 00:58:53,800 --> 00:58:55,030 and you have binomial. 1254 00:58:55,030 --> 00:58:57,580 If you think about where you've encountered binomials, 1255 00:58:57,580 --> 00:58:59,380 it was mostly when you were drawing balls 1256 00:58:59,380 --> 00:59:02,490 from urns, which you probably don't do that much in practice. 1257 00:59:02,490 --> 00:59:02,990 OK? 1258 00:59:02,990 --> 00:59:05,639 And so maybe one day you want to model things as a binomial, 1259 00:59:05,639 --> 00:59:07,430 or maybe you want to model it as a Poisson, 1260 00:59:07,430 --> 00:59:08,800 as a limiting binomial, right? 1261 00:59:08,800 --> 00:59:11,380 People tell you photons arrive-- 1262 00:59:11,380 --> 00:59:13,510 the rate of a photon hitting some surface 1263 00:59:13,510 --> 00:59:15,460 is actually a Poisson distribution, right? 1264 00:59:15,460 --> 00:59:18,330 That's where they arise a lot in imaging. 1265 00:59:18,330 --> 00:59:21,100 So if I have a colleague who's taking pictures 1266 00:59:21,100 --> 00:59:23,714 of the skies over night, and he's like following stars 1267 00:59:23,714 --> 00:59:26,380 and it's just like moving around with the rotation of the Earth. 1268 00:59:26,380 --> 00:59:28,637 And he has to do this for like eight hours 1269 00:59:28,637 --> 00:59:30,970 because he needs to get enough photons over this picture 1270 00:59:30,970 --> 00:59:32,150 to actually arise. 1271 00:59:32,150 --> 00:59:35,515 And he knows they arrive at like a Poisson process, 1272 00:59:35,515 --> 00:59:39,830 and you know, chapter 7 of your probability class, I guess. 1273 00:59:39,830 --> 00:59:40,650 And 1274 00:59:40,650 --> 00:59:43,330 And there's all these distributions 1275 00:59:43,330 --> 00:59:44,890 outside the classroom you probably 1276 00:59:44,890 --> 00:59:46,724 want to check that they're actually correct. 1277 00:59:46,724 --> 00:59:49,139 And so the first one you might want to check, for example, 1278 00:59:49,139 --> 00:59:49,725 is a binomial. 1279 00:59:49,725 --> 00:59:52,540 So I give you a distribution, a binomial distribution 1280 00:59:52,540 --> 00:59:56,540 on, say, K trials, and you have some number p. 1281 00:59:56,540 --> 00:59:59,140 And here, I don't know typically what p should be, 1282 00:59:59,140 --> 01:00:01,824 but let's say I know it or estimate it from my data. 1283 01:00:01,824 --> 01:00:04,240 And here, since we're only going to deal with asymptotics, 1284 01:00:04,240 --> 01:00:07,000 just like it was the case for the Kolmogorov-Smirnov one, 1285 01:00:07,000 --> 01:00:08,860 in the asymptotic we're going to be 1286 01:00:08,860 --> 01:00:13,086 able to think of the estimated p as being a true p, OK, 1287 01:00:13,086 --> 01:00:15,340 under the null at least. 1288 01:00:15,340 --> 01:00:19,180 So therefore, each outcome, I can actually tell you what 1289 01:00:19,180 --> 01:00:20,590 the probability of a binomial-- 1290 01:00:20,590 --> 01:00:21,340 is this outcome. 1291 01:00:21,340 --> 01:00:23,920 For a given K and a given p, I can tell you 1292 01:00:23,920 --> 01:00:25,690 exactly what a binomial should give you 1293 01:00:25,690 --> 01:00:27,800 as the probability for the outcome. 1294 01:00:27,800 --> 01:00:33,670 And that's what I actually use to replace the numbers 1/12, 1295 01:00:33,670 --> 01:00:41,290 1/12, 1/12, 1/12 or the numbers 0.72, 0.7, 0.12, 0.9. 1296 01:00:41,290 --> 01:00:43,420 All these numbers I can actually compute 1297 01:00:43,420 --> 01:00:45,640 using the probabilities of a binomial, right? 1298 01:00:45,640 --> 01:00:52,600 So I know, for example, that the probability that a binomial np 1299 01:00:52,600 --> 01:01:02,830 is equal to, say, K is n choose K p to the K 1 minus p 1300 01:01:02,830 --> 01:01:05,895 to the n minus K. OK? 1301 01:01:05,895 --> 01:01:07,300 I mean, so these are numbers. 1302 01:01:07,300 --> 01:01:08,800 If you give me p and you give me n, 1303 01:01:08,800 --> 01:01:12,710 I can compute those numbers for all K from 0 to n. 1304 01:01:12,710 --> 01:01:14,604 And from this I can actually build a table. 1305 01:01:22,060 --> 01:01:22,560 All right? 1306 01:01:22,560 --> 01:01:25,600 So for each K-- 1307 01:01:25,600 --> 01:01:26,340 0. 1308 01:01:26,340 --> 01:01:31,020 So K is here, and from 0, 1, et cetera, 1309 01:01:31,020 --> 01:01:35,640 all the way to n, I can compute the true probability, which 1310 01:01:35,640 --> 01:01:40,680 is the probability that my binomial np is equal to 0, 1311 01:01:40,680 --> 01:01:45,130 the probability that my binomial is equal to 1, et cetera, 1312 01:01:45,130 --> 01:01:46,440 all the way to n. 1313 01:01:46,440 --> 01:01:47,610 I can compute those numbers. 1314 01:01:47,610 --> 01:01:50,560 Those are actually going to be exact numbers, right? 1315 01:01:50,560 --> 01:01:52,952 I just plug in the formula that I had. 1316 01:01:52,952 --> 01:01:54,660 And then I'm going to have some observed. 1317 01:02:01,900 --> 01:02:05,460 So that's going to be p hat, 0, and that's basically 1318 01:02:05,460 --> 01:02:12,430 the proportion of 0's, right? 1319 01:02:12,430 --> 01:02:16,542 So here you have to remember it's not a one-time experiment 1320 01:02:16,542 --> 01:02:18,250 like you do in probability where you say, 1321 01:02:18,250 --> 01:02:22,390 I'm going to draw n balls from an urn, 1322 01:02:22,390 --> 01:02:24,100 and I'm counting how many-- 1323 01:02:24,100 --> 01:02:25,150 how many I have. 1324 01:02:25,150 --> 01:02:25,990 This is statistics. 1325 01:02:25,990 --> 01:02:28,990 I need to be able to do this experiment many times 1326 01:02:28,990 --> 01:02:31,910 so I can actually, in the end, get an idea of what 1327 01:02:31,910 --> 01:02:33,810 the proportion of p's is. 1328 01:02:33,810 --> 01:02:36,100 So you have not just one binomial, 1329 01:02:36,100 --> 01:02:38,300 but you have n binomials. 1330 01:02:38,300 --> 01:02:40,380 Well, maybe I should not use n twice. 1331 01:02:40,380 --> 01:02:42,080 So that's why it's the K here, right? 1332 01:02:42,080 --> 01:02:44,140 So I have a binomial [INAUDIBLE] at Kp 1333 01:02:44,140 --> 01:02:46,405 and I just seize n of those guys. 1334 01:02:46,405 --> 01:02:48,280 And with this n of those guys, I can actually 1335 01:02:48,280 --> 01:02:50,072 estimate those probabilities. 1336 01:02:50,072 --> 01:02:51,530 And what I'm going to want to check 1337 01:02:51,530 --> 01:02:53,280 is if those two probabilities are actually 1338 01:02:53,280 --> 01:02:54,520 close to each other. 1339 01:02:54,520 --> 01:02:57,980 But I already know how to do this. 1340 01:02:57,980 --> 01:02:58,480 All right? 1341 01:02:58,480 --> 01:03:00,130 So here I'm going to test whether P 1342 01:03:00,130 --> 01:03:02,810 is in some parametric family, for example, 1343 01:03:02,810 --> 01:03:06,700 binomial or not binomial. 1344 01:03:06,700 --> 01:03:09,630 And testing-- if I know that it's a binomial [INAUDIBLE],, 1345 01:03:09,630 --> 01:03:12,870 and I basically just have to test if P is the right thing. 1346 01:03:12,870 --> 01:03:14,460 OK? 1347 01:03:14,460 --> 01:03:17,710 Oh, sorry, I'm actually lying to you here. 1348 01:03:17,710 --> 01:03:18,210 OK. 1349 01:03:18,210 --> 01:03:19,793 I don't want to test if it's binomial. 1350 01:03:19,793 --> 01:03:24,220 I want to test the parameter of the binomial here. 1351 01:03:24,220 --> 01:03:24,840 OK? 1352 01:03:24,840 --> 01:03:28,330 So I know-- no, sorry, [INAUDIBLE] sorry. 1353 01:03:28,330 --> 01:03:28,830 OK. 1354 01:03:28,830 --> 01:03:30,960 So I want to know if I'm in some family, 1355 01:03:30,960 --> 01:03:34,380 the family of binomials, or not in the family of binomials. 1356 01:03:34,380 --> 01:03:35,280 OK? 1357 01:03:35,280 --> 01:03:36,910 Well, that's what I want to do. 1358 01:03:36,910 --> 01:03:39,690 And so here H0 is basically equivalent to testing 1359 01:03:39,690 --> 01:03:42,750 if the pj's are the pj's that come from the binomial. 1360 01:03:42,750 --> 01:03:46,170 And the pj's here are the probabilities that I get. 1361 01:03:46,170 --> 01:03:50,180 This is the probability that I get j successes. 1362 01:03:50,180 --> 01:03:51,180 That's my pj. 1363 01:03:51,180 --> 01:03:54,370 That's j's value here. 1364 01:03:54,370 --> 01:03:54,870 OK? 1365 01:03:54,870 --> 01:03:57,600 So this is the example, and we know how to do this. 1366 01:03:57,600 --> 01:04:00,290 We construct p hat, which is the estimated 1367 01:04:00,290 --> 01:04:03,230 proportion of successes from the observations. 1368 01:04:03,230 --> 01:04:05,750 So here now I have n trials. 1369 01:04:05,750 --> 01:04:08,390 This is the actual maximum likelihood estimator. 1370 01:04:08,390 --> 01:04:12,230 This becomes a multinomial experiment, right? 1371 01:04:12,230 --> 01:04:13,430 So it's kind of confusing. 1372 01:04:13,430 --> 01:04:17,010 We have a multinomial experiment for a binomial distribution. 1373 01:04:17,010 --> 01:04:19,520 The binomial here is just a recipe 1374 01:04:19,520 --> 01:04:21,740 to create some test probabilities. 1375 01:04:21,740 --> 01:04:22,654 That's all it is. 1376 01:04:22,654 --> 01:04:24,320 The binomial here doesn't really matter. 1377 01:04:24,320 --> 01:04:26,539 It's really to create the test probabilities. 1378 01:04:26,539 --> 01:04:28,830 And then I'm going to define this test statistic, which 1379 01:04:28,830 --> 01:04:36,420 is known as the chi-square statistic, right? 1380 01:04:36,420 --> 01:04:37,910 This was the chi-square test. 1381 01:04:37,910 --> 01:04:41,490 We just looked at sum of the square root of the differences. 1382 01:04:41,490 --> 01:04:45,004 Inverting the covariance matrix or using the Fisher information 1383 01:04:45,004 --> 01:04:46,920 with removing the part that was not invertible 1384 01:04:46,920 --> 01:04:50,410 led us to actually use this particular value here, 1385 01:04:50,410 --> 01:04:54,325 and then we had to multiply by n. 1386 01:04:54,325 --> 01:04:55,040 OK? 1387 01:04:55,040 --> 01:04:59,710 And that, we know, converges to what? 1388 01:04:59,710 --> 01:05:01,510 A chi-square distribution. 1389 01:05:01,510 --> 01:05:03,260 So I'm not going to go through this again. 1390 01:05:03,260 --> 01:05:05,218 I'm just telling you you can use the chi-square 1391 01:05:05,218 --> 01:05:08,090 that we've seen, where we just came up with the numbers we 1392 01:05:08,090 --> 01:05:09,020 were testing. 1393 01:05:09,020 --> 01:05:12,350 Those numbers that were in this row for the true probabilities, 1394 01:05:12,350 --> 01:05:14,224 we came up with them out of thin air. 1395 01:05:14,224 --> 01:05:15,890 And now I'm telling you you can actually 1396 01:05:15,890 --> 01:05:19,010 come up with those guys from a binomial distribution 1397 01:05:19,010 --> 01:05:20,905 or a Poisson distribution or whatever 1398 01:05:20,905 --> 01:05:22,196 distribution you're happy with. 1399 01:05:26,004 --> 01:05:26,956 Any question? 1400 01:05:30,300 --> 01:05:31,970 So now I'm creating this thing, and I 1401 01:05:31,970 --> 01:05:34,790 can apply the entire theory that I have for the chi-square 1402 01:05:34,790 --> 01:05:36,710 and, in particular, that this thing converges 1403 01:05:36,710 --> 01:05:38,846 to a chi-square. 1404 01:05:38,846 --> 01:05:40,970 But if you see, there's something that's different. 1405 01:05:40,970 --> 01:05:42,186 What is different? 1406 01:05:45,640 --> 01:05:47,850 The degrees of freedom. 1407 01:05:47,850 --> 01:05:51,990 And if you think about it, again, the meaning of degrees 1408 01:05:51,990 --> 01:05:52,510 of freedom. 1409 01:05:52,510 --> 01:05:54,020 What does this word-- 1410 01:05:54,020 --> 01:05:55,810 these words actually mean? 1411 01:05:55,810 --> 01:05:57,960 It means, well, to which extent can I 1412 01:05:57,960 --> 01:05:59,340 play around with those values? 1413 01:05:59,340 --> 01:06:01,230 What are the possible values that I can get? 1414 01:06:01,230 --> 01:06:03,990 If I'm not equal to this particular value I'm testing, 1415 01:06:03,990 --> 01:06:07,500 how many directions can I be different from this guy? 1416 01:06:07,500 --> 01:06:10,650 And when we had a given set of values, 1417 01:06:10,650 --> 01:06:13,170 we could be any other set of values, right? 1418 01:06:13,170 --> 01:06:16,140 So here, I had this-- 1419 01:06:16,140 --> 01:06:19,890 I'm going to represent-- this is the set of all probability 1420 01:06:19,890 --> 01:06:23,910 distributions of vectors of size K. So here, 1421 01:06:23,910 --> 01:06:25,830 if I look at one point in this set, 1422 01:06:25,830 --> 01:06:29,530 this is something that looks like p1 through pK such that 1423 01:06:29,530 --> 01:06:30,360 their sum-- 1424 01:06:30,360 --> 01:06:36,520 such that they're non-negative, and the sum p1 through pK 1425 01:06:36,520 --> 01:06:37,190 is equal to 1. 1426 01:06:37,190 --> 01:06:37,690 OK? 1427 01:06:37,690 --> 01:06:40,000 So I have all those points here. 1428 01:06:40,000 --> 01:06:41,900 OK? 1429 01:06:41,900 --> 01:06:44,930 So this is basically the set that I had before. 1430 01:06:44,930 --> 01:06:47,210 I was testing whether I was equal to this one guy, 1431 01:06:47,210 --> 01:06:48,980 or if I was anything else. 1432 01:06:48,980 --> 01:06:51,157 And there's many ways I can be anything else. 1433 01:06:51,157 --> 01:06:53,240 What matters, of course, is what's around this guy 1434 01:06:53,240 --> 01:06:55,970 that I could actually confuse myself with. 1435 01:06:55,970 --> 01:06:58,050 But there's many ways I can move around this guy. 1436 01:06:58,050 --> 01:07:00,670 Agreed? 1437 01:07:00,670 --> 01:07:04,710 Now I'm actually just testing something very specific. 1438 01:07:04,710 --> 01:07:06,840 I'm saying, well, now the piece that I 1439 01:07:06,840 --> 01:07:09,180 have had to come from this-- have 1440 01:07:09,180 --> 01:07:13,560 to be constructed from this formula, this parametric family 1441 01:07:13,560 --> 01:07:14,840 P of theta. 1442 01:07:14,840 --> 01:07:20,130 And there's a fixed way for-- let's say this is theta, 1443 01:07:20,130 --> 01:07:23,340 so I have a theta here. 1444 01:07:23,340 --> 01:07:26,150 There's not that many ways this can actually give me 1445 01:07:26,150 --> 01:07:28,430 a set of probabilities, right? 1446 01:07:28,430 --> 01:07:31,110 I have to move to another theta to actually start 1447 01:07:31,110 --> 01:07:32,510 being confused. 1448 01:07:32,510 --> 01:07:34,940 And so here the number of degrees of freedom 1449 01:07:34,940 --> 01:07:39,200 is basically, how can I move along this family? 1450 01:07:39,200 --> 01:07:41,630 And so here, this is all the points, 1451 01:07:41,630 --> 01:07:43,160 but there might be just the subset 1452 01:07:43,160 --> 01:07:45,750 of the points that looks like this, just this curve, 1453 01:07:45,750 --> 01:07:48,680 not the half of this thing. 1454 01:07:48,680 --> 01:07:56,210 And those guys on this curve are the p thetas, 1455 01:07:56,210 --> 01:08:00,020 and that's for all thetas when theta runs across data. 1456 01:08:00,020 --> 01:08:03,060 So in a way, this is just a much smaller dimensional thing. 1457 01:08:03,060 --> 01:08:04,700 It's a much smaller object. 1458 01:08:04,700 --> 01:08:06,860 Those are only the ones that I can 1459 01:08:06,860 --> 01:08:13,100 create that are exactly of this very specific parametric form. 1460 01:08:13,100 --> 01:08:15,410 And of course, not all are of this form. 1461 01:08:15,410 --> 01:08:19,270 Not all probability PMFs are of this form. 1462 01:08:19,270 --> 01:08:20,939 And so that is going to have an effect 1463 01:08:20,939 --> 01:08:24,060 on what my PMF is going to be-- 1464 01:08:24,060 --> 01:08:28,830 sorry, on what my-- 1465 01:08:28,830 --> 01:08:33,689 sorry, what my degrees of freedoms are going to be. 1466 01:08:33,689 --> 01:08:39,149 Because when this thing is very small, that means when-- 1467 01:08:39,149 --> 01:08:41,170 that's happening when theta is actually, 1468 01:08:41,170 --> 01:08:44,670 say, a one-dimensional space, then there's still 1469 01:08:44,670 --> 01:08:46,470 many ways I can escape, right? 1470 01:08:46,470 --> 01:08:48,450 I can be different from this guy in pretty 1471 01:08:48,450 --> 01:08:50,939 much every other direction, except for those two 1472 01:08:50,939 --> 01:08:53,910 directions, just when I move from here 1473 01:08:53,910 --> 01:08:56,050 or when I move in this direction. 1474 01:08:56,050 --> 01:09:00,120 But now if this thing becomes bigger, 1475 01:09:00,120 --> 01:09:03,399 your theta is, say, two dimensional, 1476 01:09:03,399 --> 01:09:06,090 then when I'm here it's becoming harder 1477 01:09:06,090 --> 01:09:07,229 for me to not be that guy. 1478 01:09:07,229 --> 01:09:08,812 If I want to move away from it, then I 1479 01:09:08,812 --> 01:09:11,460 have to move away from the board. 1480 01:09:11,460 --> 01:09:15,018 And so that means that the bigger the dimension 1481 01:09:15,018 --> 01:09:18,590 of my theta, the smaller the degrees of freedoms 1482 01:09:18,590 --> 01:09:24,810 that I have, OK, because moving out of this parametric family 1483 01:09:24,810 --> 01:09:27,490 is actually very difficult for me. 1484 01:09:27,490 --> 01:09:30,930 So if you think, for example, as an extreme case, 1485 01:09:30,930 --> 01:09:36,580 the parametric family that I have is basically all PMFs, 1486 01:09:36,580 --> 01:09:38,069 all of them, right? 1487 01:09:38,069 --> 01:09:39,710 So that's a stupid parametric family. 1488 01:09:39,710 --> 01:09:41,890 I'm indexed by the distribution itself, 1489 01:09:41,890 --> 01:09:43,810 but it's still finite dimensional. 1490 01:09:43,810 --> 01:09:46,810 Then here, I have basically no degrees of freedom. 1491 01:09:46,810 --> 01:09:48,220 There's no way I can actually not 1492 01:09:48,220 --> 01:09:51,250 be that guy, because this is everything I have. 1493 01:09:51,250 --> 01:09:54,220 And so you don't have to really understand 1494 01:09:54,220 --> 01:09:59,050 how the computation comes into the numbers of dimension 1495 01:09:59,050 --> 01:10:01,300 and what I mean by dimension of this current space. 1496 01:10:01,300 --> 01:10:05,170 But really, what's important is that as the dimension of theta 1497 01:10:05,170 --> 01:10:09,350 becomes bigger, I have less degrees of freedom 1498 01:10:09,350 --> 01:10:11,640 to be away from this family. 1499 01:10:11,640 --> 01:10:13,730 This family becomes big, and it's very hard for me 1500 01:10:13,730 --> 01:10:14,990 to violate this. 1501 01:10:14,990 --> 01:10:17,210 So it's actually shrinking the number of degrees 1502 01:10:17,210 --> 01:10:18,907 of freedom of my chi-square. 1503 01:10:18,907 --> 01:10:20,490 And that's all you need to understand. 1504 01:10:20,490 --> 01:10:23,240 When d increases, the number of degrees of freedom decreases. 1505 01:10:23,240 --> 01:10:27,304 And I'd like to you to have an idea of why this is somewhat 1506 01:10:27,304 --> 01:10:28,928 true, and this is basically the picture 1507 01:10:28,928 --> 01:10:30,068 you should have in mind. 1508 01:10:33,240 --> 01:10:33,740 OK. 1509 01:10:33,740 --> 01:10:35,920 So now once I have done this, I can just construct. 1510 01:10:35,920 --> 01:10:37,290 So here I need to check. 1511 01:10:37,290 --> 01:10:39,178 So what is d in the case of the binomial? 1512 01:10:42,590 --> 01:10:43,090 AUDIENCE: 1. 1513 01:10:43,090 --> 01:10:43,570 PHILIPPE RIGOLLET: 1, right? 1514 01:10:43,570 --> 01:10:44,980 It's just a one-dimensional thing. 1515 01:10:44,980 --> 01:10:46,396 And for most of the examples we're 1516 01:10:46,396 --> 01:10:48,440 going to have it's going to be one dimensional. 1517 01:10:48,440 --> 01:10:49,360 So we have this weird thing. 1518 01:10:49,360 --> 01:10:51,430 We're going to have K minus 2 degrees of freedom. 1519 01:10:54,580 --> 01:10:59,640 So now I have this thing, and I have this asymptotic. 1520 01:10:59,640 --> 01:11:02,310 And then I can just basically use a test that has-- 1521 01:11:02,310 --> 01:11:04,610 that uses the fact that the asymptotic distribution 1522 01:11:04,610 --> 01:11:05,110 is this. 1523 01:11:05,110 --> 01:11:06,870 So I compute my quantiles out of this. 1524 01:11:06,870 --> 01:11:08,210 Again, I made the same mistake. 1525 01:11:08,210 --> 01:11:11,490 This should be q alpha, and this should be q alpha. 1526 01:11:11,490 --> 01:11:13,110 So that's just the tail probability 1527 01:11:13,110 --> 01:11:16,699 is equal to alpha when I'm on the right of q alpha. 1528 01:11:16,699 --> 01:11:18,240 And so those are the tail probability 1529 01:11:18,240 --> 01:11:20,730 of the appropriate chi-square with the appropriate number 1530 01:11:20,730 --> 01:11:22,030 of degrees of freedom. 1531 01:11:22,030 --> 01:11:24,880 And so I can compute p-values, and I can do whatever I want. 1532 01:11:24,880 --> 01:11:25,380 OK? 1533 01:11:25,380 --> 01:11:28,510 So then I just like [INAUDIBLE] my testing machinery. 1534 01:11:28,510 --> 01:11:29,010 OK? 1535 01:11:29,010 --> 01:11:34,960 So now I know how to test if I'm a binomial distribution or not. 1536 01:11:34,960 --> 01:11:38,080 Again here, testing if I'm a binomial distribution 1537 01:11:38,080 --> 01:11:40,660 is not a simple goodness of fit. 1538 01:11:40,660 --> 01:11:43,040 It's a composite one where I can actually-- 1539 01:11:43,040 --> 01:11:45,910 there's many ways I can be a binomial distribution 1540 01:11:45,910 --> 01:11:48,260 because there's as many as there is theta. 1541 01:11:48,260 --> 01:11:51,700 And so I'm actually plugging in the theta hat, which is 1542 01:11:51,700 --> 01:11:54,380 estimated from the data, right? 1543 01:11:54,380 --> 01:11:57,370 And here, since everything's happening in the asymptotics, 1544 01:11:57,370 --> 01:12:00,790 I'm not claiming that Tn has a pivotal distribution 1545 01:12:00,790 --> 01:12:01,849 for finite n. 1546 01:12:01,849 --> 01:12:02,890 That's actually not true. 1547 01:12:02,890 --> 01:12:04,514 It's going to depend like crazy on what 1548 01:12:04,514 --> 01:12:06,150 the actual distribution is. 1549 01:12:06,150 --> 01:12:08,170 But asymptotically, I have a chi-square, 1550 01:12:08,170 --> 01:12:11,539 which obviously does not depend on anything [INAUDIBLE].. 1551 01:12:11,539 --> 01:12:13,511 OK? 1552 01:12:13,511 --> 01:12:14,497 Yeah? 1553 01:12:14,497 --> 01:12:19,920 AUDIENCE: So in general, for the binomial [INAUDIBLE] trials. 1554 01:12:19,920 --> 01:12:23,371 But in the general case, the number of-- 1555 01:12:23,371 --> 01:12:26,315 the size of our PMF is the number of [INAUDIBLE].. 1556 01:12:26,315 --> 01:12:27,315 PHILIPPE RIGOLLET: Yeah. 1557 01:12:27,315 --> 01:12:29,287 AUDIENCE: So let's say that I was also 1558 01:12:29,287 --> 01:12:32,738 uncertain about what K was so that I don't 1559 01:12:32,738 --> 01:12:37,668 know how big my [INAUDIBLE] is. 1560 01:12:37,668 --> 01:12:48,580 [INAUDIBLE] 1561 01:12:48,580 --> 01:12:50,090 PHILIPPE RIGOLLET: That is correct. 1562 01:12:50,090 --> 01:12:54,670 And thank you for this beautiful segue into my next slide. 1563 01:12:54,670 --> 01:12:56,290 So we can actually deal with the case 1564 01:12:56,290 --> 01:12:57,640 not only where it's infinite, which 1565 01:12:57,640 --> 01:12:58,870 would be the case of Poisson. 1566 01:12:58,870 --> 01:13:00,244 I mean, nobody believes I'm going 1567 01:13:00,244 --> 01:13:02,620 to get an infinite number of photons 1568 01:13:02,620 --> 01:13:04,210 in a finite amount of time. 1569 01:13:04,210 --> 01:13:08,140 But we just don't want to have to say there's got to be a-- 1570 01:13:08,140 --> 01:13:09,910 this is the largest possible number. 1571 01:13:09,910 --> 01:13:10,870 We don't want to have to do that. 1572 01:13:10,870 --> 01:13:13,078 Because if you start doing this and the probabilities 1573 01:13:13,078 --> 01:13:16,370 become close to 0, things become degenerate and it's an issue. 1574 01:13:16,370 --> 01:13:18,220 So what we do is we bin. 1575 01:13:18,220 --> 01:13:19,890 We just bin stuff. 1576 01:13:19,890 --> 01:13:20,550 OK? 1577 01:13:20,550 --> 01:13:23,860 And so maybe if I have a binomial distribution 1578 01:13:23,860 --> 01:13:28,400 with, say, 200,000 possible values, 1579 01:13:28,400 --> 01:13:32,082 then it's actually maybe not the level of precision 1580 01:13:32,082 --> 01:13:33,040 I want to look at this. 1581 01:13:33,040 --> 01:13:33,870 Maybe I want to bin. 1582 01:13:33,870 --> 01:13:35,411 Maybe I want to say, let's just think 1583 01:13:35,411 --> 01:13:37,450 of all things that are between 0 and 100 1584 01:13:37,450 --> 01:13:40,765 to be the same thing, between 100 and 200 the same thing, 1585 01:13:40,765 --> 01:13:41,710 et cetera. 1586 01:13:41,710 --> 01:13:44,064 And so in fact, I'm actually going to bin. 1587 01:13:44,064 --> 01:13:46,480 I don't even have to think about things that are discrete. 1588 01:13:46,480 --> 01:13:49,120 I can even think about continuous cases. 1589 01:13:49,120 --> 01:13:51,850 And so if I want to test if I have a Gaussian distribution, 1590 01:13:51,850 --> 01:13:55,420 for example, I can just approximate that by some, 1591 01:13:55,420 --> 01:13:59,590 say, piecewise constant function that just says that, 1592 01:13:59,590 --> 01:14:03,370 well, if I have a Gaussian distribution like this, 1593 01:14:03,370 --> 01:14:06,484 I'm going to bin it like this. 1594 01:14:06,484 --> 01:14:08,650 And I'm going to say, well, the probability that I'm 1595 01:14:08,650 --> 01:14:10,150 less than this value is this. 1596 01:14:10,150 --> 01:14:12,692 The probability that I'm between this and this value is this. 1597 01:14:12,692 --> 01:14:14,650 The probability I'm between this and this value 1598 01:14:14,650 --> 01:14:18,370 is this, and then this and then this, right? 1599 01:14:18,370 --> 01:14:19,650 And now I've turned-- 1600 01:14:19,650 --> 01:14:24,240 I've discretized, effectively, my Gaussian into a PMF. 1601 01:14:24,240 --> 01:14:26,140 The value-- this is p1. 1602 01:14:26,140 --> 01:14:28,510 The value here is p1. 1603 01:14:28,510 --> 01:14:30,460 This is p2. 1604 01:14:30,460 --> 01:14:32,800 This is p3. 1605 01:14:32,800 --> 01:14:35,230 This is p4. 1606 01:14:35,230 --> 01:14:39,150 This is p5 and p6, right? 1607 01:14:39,150 --> 01:14:41,920 I have discretized my Gaussian into six possible values. 1608 01:14:41,920 --> 01:14:46,650 That's just the probability that they fall into a certain bin. 1609 01:14:46,650 --> 01:14:47,865 And we can do this-- 1610 01:14:47,865 --> 01:14:51,590 if you don't know what K is, just stop at 10. 1611 01:14:51,590 --> 01:14:54,360 You look at your data quickly and you say, well, you know, 1612 01:14:54,360 --> 01:15:00,180 I have so few of them that are-- like I see maybe one 8, one 11, 1613 01:15:00,180 --> 01:15:01,590 and one 15. 1614 01:15:01,590 --> 01:15:03,270 Well, everything that's between 8 and 20 1615 01:15:03,270 --> 01:15:05,130 I'm just going to put it in one bin. 1616 01:15:05,130 --> 01:15:07,020 Because what else are you going to do? 1617 01:15:07,020 --> 01:15:09,490 I mean, you just don't have enough observations. 1618 01:15:09,490 --> 01:15:11,710 And so what we do is we just bin everything. 1619 01:15:11,710 --> 01:15:14,460 So here I'm going to actually be slightly abstract. 1620 01:15:14,460 --> 01:15:16,922 Our bins are going to be intervals Aj. 1621 01:15:16,922 --> 01:15:18,880 So here-- they don't even have to be intervals. 1622 01:15:18,880 --> 01:15:21,930 I could go crazy and just like call the bin this guy 1623 01:15:21,930 --> 01:15:23,370 and this guy, right? 1624 01:15:23,370 --> 01:15:27,110 That would make no sense, but I could do that. 1625 01:15:27,110 --> 01:15:30,620 And then I'm-- and of course, you can do whatever you want, 1626 01:15:30,620 --> 01:15:33,180 but there's going to be some consequences in the conclusions 1627 01:15:33,180 --> 01:15:34,490 that you can take, right? 1628 01:15:34,490 --> 01:15:35,906 All you're going to be able to say 1629 01:15:35,906 --> 01:15:38,790 is that my distribution does not look like it 1630 01:15:38,790 --> 01:15:40,800 could be binned in this way. 1631 01:15:40,800 --> 01:15:42,570 That's all you're going to be able to say. 1632 01:15:42,570 --> 01:15:46,800 So if you decide to just put all the negative numbers 1633 01:15:46,800 --> 01:15:48,357 and the positive numbers, then it's 1634 01:15:48,357 --> 01:15:50,190 going to be very hard for you to distinguish 1635 01:15:50,190 --> 01:15:52,314 a Gaussian from a random variable that takes values 1636 01:15:52,314 --> 01:15:54,110 of minus 1 and plus 1 only. 1637 01:15:54,110 --> 01:15:57,490 You need to just be reasonable. 1638 01:15:57,490 --> 01:15:57,990 OK? 1639 01:15:57,990 --> 01:16:00,720 So now I have my pj's become the probability 1640 01:16:00,720 --> 01:16:02,590 that my random variable falls into bin j. 1641 01:16:06,600 --> 01:16:10,290 So that's pj of theta under the parametric distribution. 1642 01:16:10,290 --> 01:16:14,270 For the true one, whether it's parametric or not, I have a pj. 1643 01:16:14,270 --> 01:16:15,870 And then I have p hat j, which is 1644 01:16:15,870 --> 01:16:19,030 the proportion of observations that falls in this bin. 1645 01:16:19,030 --> 01:16:19,530 All right? 1646 01:16:19,530 --> 01:16:21,030 So I have a bunch of observations. 1647 01:16:21,030 --> 01:16:23,250 I count how many of them fall in this bin. 1648 01:16:23,250 --> 01:16:26,130 I divide by n, and that tells me what my estimated 1649 01:16:26,130 --> 01:16:29,410 probability for this bin is. 1650 01:16:29,410 --> 01:16:31,444 And theta hat, well, it's the same as before. 1651 01:16:31,444 --> 01:16:32,860 If I'm in a parametric family, I'm 1652 01:16:32,860 --> 01:16:35,151 just estimating theta hat, maybe the maximum likelihood 1653 01:16:35,151 --> 01:16:37,690 estimator, plug it in, and estimate 1654 01:16:37,690 --> 01:16:39,700 those pj's of theta hat. 1655 01:16:39,700 --> 01:16:43,390 From this, I form my chi-square, and I have exactly 1656 01:16:43,390 --> 01:16:45,230 the same thing as before. 1657 01:16:45,230 --> 01:16:48,680 So the answer to your question is, yes, you bin. 1658 01:16:48,680 --> 01:16:51,690 And it's the answer to even more questions. 1659 01:16:51,690 --> 01:16:53,390 So that's why there you can actually 1660 01:16:53,390 --> 01:16:56,420 use the chi-square test to test for normality. 1661 01:16:56,420 --> 01:16:58,850 Now here it's going to be slightly weaker, 1662 01:16:58,850 --> 01:17:00,800 because there's only an asymptotic theory, 1663 01:17:00,800 --> 01:17:03,920 whereas Kolmogorov-Smirnov and Kolmogorov-Lilliefors work 1664 01:17:03,920 --> 01:17:06,230 actually even for finite samples. 1665 01:17:06,230 --> 01:17:08,600 For the chi-square test, it's only asymptotic. 1666 01:17:08,600 --> 01:17:11,300 So you just pretend you actually know what the parameters are. 1667 01:17:11,300 --> 01:17:15,250 You just stuff them into a theta, a mu hat, 1668 01:17:15,250 --> 01:17:16,670 and sigma square hat. 1669 01:17:16,670 --> 01:17:19,280 And you just go to-- you just cross your finger 1670 01:17:19,280 --> 01:17:21,020 that n is large enough for everything 1671 01:17:21,020 --> 01:17:24,161 to have converged by the time you make your decision. 1672 01:17:24,161 --> 01:17:24,660 OK? 1673 01:17:24,660 --> 01:17:28,440 And then this is a copy/paste, with the same error actually 1674 01:17:28,440 --> 01:17:31,710 as the previous slide, where you just build your test based 1675 01:17:31,710 --> 01:17:34,560 on whether you exceed or not some quantile, 1676 01:17:34,560 --> 01:17:37,721 and you can also compute some p-value. 1677 01:17:37,721 --> 01:17:38,220 OK? 1678 01:17:38,220 --> 01:17:39,120 AUDIENCE: The error? 1679 01:17:39,120 --> 01:17:40,328 PHILIPPE RIGOLLET: I'm sorry? 1680 01:17:40,328 --> 01:17:41,559 AUDIENCE: What's the error? 1681 01:17:41,559 --> 01:17:43,100 PHILIPPE RIGOLLET: What is the error? 1682 01:17:43,100 --> 01:17:45,575 AUDIENCE: You said [INAUDIBLE] copy/paste [INAUDIBLE].. 1683 01:17:45,575 --> 01:17:47,450 PHILIPPE RIGOLLET: Oh, the error is that this 1684 01:17:47,450 --> 01:17:48,520 should be q alpha, right? 1685 01:17:48,520 --> 01:17:49,190 AUDIENCE: OK. 1686 01:17:49,190 --> 01:17:51,273 PHILIPPE RIGOLLET: I've been calling this q alpha. 1687 01:17:51,273 --> 01:17:53,459 I mean, that's my personal choice, 1688 01:17:53,459 --> 01:17:54,500 because I don't want to-- 1689 01:17:54,500 --> 01:17:55,820 I only use q alpha. 1690 01:17:55,820 --> 01:17:59,644 So I only use quantiles where alpha is to the right, so. 1691 01:17:59,644 --> 01:18:01,310 That's what statisticians-- probabilists 1692 01:18:01,310 --> 01:18:02,970 would use this notation. 1693 01:18:07,041 --> 01:18:07,540 OK. 1694 01:18:07,540 --> 01:18:10,000 And so some questions, right? 1695 01:18:10,000 --> 01:18:11,820 So of course, in practice you're going 1696 01:18:11,820 --> 01:18:13,650 to have some issues which translate. 1697 01:18:13,650 --> 01:18:16,010 I say, well, how do you pick this guy, this K? 1698 01:18:16,010 --> 01:18:17,610 So I gave you some sort of a-- 1699 01:18:17,610 --> 01:18:19,810 I mean, the way we discussed, right? 1700 01:18:19,810 --> 01:18:23,220 You have 8 and 10 and 20, then it's ad hoc. 1701 01:18:23,220 --> 01:18:27,120 And so depending on whether you want to stop K at 20 1702 01:18:27,120 --> 01:18:29,610 or if you want to bin those guys is really up to you. 1703 01:18:29,610 --> 01:18:31,050 And there's going to be some considerations 1704 01:18:31,050 --> 01:18:32,591 about the particular problem at hand. 1705 01:18:32,591 --> 01:18:34,180 I mean, is it coarse-- too coarse 1706 01:18:34,180 --> 01:18:38,070 for your problem to decide that the observations between 8 1707 01:18:38,070 --> 01:18:39,644 and 20 are the same? 1708 01:18:39,644 --> 01:18:40,560 It's really up to you. 1709 01:18:40,560 --> 01:18:42,476 Maybe that's actually making a huge difference 1710 01:18:42,476 --> 01:18:45,420 in terms of what phenomenon you're looking at. 1711 01:18:45,420 --> 01:18:46,770 The choice of the bins, right? 1712 01:18:46,770 --> 01:18:48,450 So here there's actually some sort 1713 01:18:48,450 --> 01:18:51,870 of rules, which are don't use only one bin 1714 01:18:51,870 --> 01:18:55,200 and make sure there's actually-- don't use them too small 1715 01:18:55,200 --> 01:18:57,710 so that there's at least one observation per bin, right? 1716 01:18:57,710 --> 01:18:59,010 And it's basically the same kind of rules 1717 01:18:59,010 --> 01:19:00,360 that you would have to build a histogram. 1718 01:19:00,360 --> 01:19:02,280 If you were to build a histogram for your data, 1719 01:19:02,280 --> 01:19:03,780 you still want to make sure that you 1720 01:19:03,780 --> 01:19:05,030 bin in an appropriate fashion. 1721 01:19:05,030 --> 01:19:05,530 OK? 1722 01:19:05,530 --> 01:19:08,052 And there's a bunch of rule of thumbs. 1723 01:19:08,052 --> 01:19:09,510 Every time you ask someone, they're 1724 01:19:09,510 --> 01:19:11,176 going to have a different rule of thumb, 1725 01:19:11,176 --> 01:19:13,850 so just make your own. 1726 01:19:13,850 --> 01:19:17,580 And then there's the computation of pj 1727 01:19:17,580 --> 01:19:19,530 of theta, which might be a bit complicated 1728 01:19:19,530 --> 01:19:21,450 because, in this case, I would have 1729 01:19:21,450 --> 01:19:24,030 to integrate the Gaussian between this number 1730 01:19:24,030 --> 01:19:25,270 and this number. 1731 01:19:25,270 --> 01:19:27,120 So for this case, I could just say, well, 1732 01:19:27,120 --> 01:19:30,150 it's the difference of the CDF in that value and that value 1733 01:19:30,150 --> 01:19:31,440 and then be happy with it. 1734 01:19:31,440 --> 01:19:33,606 But you can imagine that you have some slightly more 1735 01:19:33,606 --> 01:19:34,574 crazy distributions. 1736 01:19:34,574 --> 01:19:36,240 You're going to have to somewhat compute 1737 01:19:36,240 --> 01:19:39,630 some integrals that might be unpleasant for you to compute. 1738 01:19:39,630 --> 01:19:40,180 OK? 1739 01:19:40,180 --> 01:19:41,846 And in particular, I said the difference 1740 01:19:41,846 --> 01:19:44,680 of the PDF between that value and that value of-- sorry, 1741 01:19:44,680 --> 01:19:47,722 the CDF between that value and that value, it is true. 1742 01:19:47,722 --> 01:19:49,180 But it's not like you actually have 1743 01:19:49,180 --> 01:19:52,480 tables that compute the CDF at any value you like, right? 1744 01:19:52,480 --> 01:19:54,445 You have to sort of-- 1745 01:19:54,445 --> 01:19:56,212 well, there might be but at some degree, 1746 01:19:56,212 --> 01:19:58,420 but you are going to have to use a computer typically 1747 01:19:58,420 --> 01:20:01,050 to do that. 1748 01:20:01,050 --> 01:20:01,550 OK? 1749 01:20:01,550 --> 01:20:05,270 And so for example, you could do the Poisson. 1750 01:20:05,270 --> 01:20:07,489 If I had time, if I had more than one minute, 1751 01:20:07,489 --> 01:20:08,780 I would actually do it for you. 1752 01:20:08,780 --> 01:20:10,340 But it's basically the same. 1753 01:20:10,340 --> 01:20:12,560 The Poisson, you are going to have an infinite tail, 1754 01:20:12,560 --> 01:20:14,018 and you just say, at some point I'm 1755 01:20:14,018 --> 01:20:16,560 going to cut everything that's larger than some value. 1756 01:20:16,560 --> 01:20:17,060 All right? 1757 01:20:17,060 --> 01:20:20,727 So you can play around, right? 1758 01:20:20,727 --> 01:20:23,310 I say, well, if you have extra knowledge about what you expect 1759 01:20:23,310 --> 01:20:26,000 to see, maybe you can cut at a certain number 1760 01:20:26,000 --> 01:20:30,530 and then just fold all the largest values from K minus 1 1761 01:20:30,530 --> 01:20:35,630 to infinity so that you actually have-- 1762 01:20:35,630 --> 01:20:37,891 you have everything into one large bin. 1763 01:20:37,891 --> 01:20:38,390 OK? 1764 01:20:38,390 --> 01:20:39,980 That's the entire tail. 1765 01:20:39,980 --> 01:20:42,350 And that's the way people do it in insurance companies, 1766 01:20:42,350 --> 01:20:42,869 for example. 1767 01:20:42,869 --> 01:20:45,410 They assume that the number of accidents you're going to have 1768 01:20:45,410 --> 01:20:47,300 is a Poisson distribution. 1769 01:20:47,300 --> 01:20:48,620 They have to fit it to you. 1770 01:20:48,620 --> 01:20:49,680 They have to know-- 1771 01:20:49,680 --> 01:20:52,970 or at least to your pool of insurance of injured people. 1772 01:20:52,970 --> 01:20:56,390 So they just slice you into what your character-- 1773 01:20:56,390 --> 01:20:58,187 relevant characteristics are, and then they 1774 01:20:58,187 --> 01:21:00,270 want to estimate what the Poisson distribution is. 1775 01:21:00,270 --> 01:21:03,760 And basically, they can do a chi-square test 1776 01:21:03,760 --> 01:21:06,980 to check if it's indeed a Poisson distribution. 1777 01:21:06,980 --> 01:21:07,480 All right. 1778 01:21:07,480 --> 01:21:10,070 So that will be it for today. 1779 01:21:10,070 --> 01:21:11,330 And so I'll be-- 1780 01:21:11,330 --> 01:21:13,800 I'll have your homework--