1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:20,540 --> 00:00:22,070 PHILIPPE RIGOLLET: --124. 9 00:00:22,070 --> 00:00:24,350 If I were to repeat this 1,000 times, 10 00:00:24,350 --> 00:00:26,390 so every one of those 1,000 times 11 00:00:26,390 --> 00:00:29,150 they collect 124 data points and then 12 00:00:29,150 --> 00:00:31,430 I'd do it again and do it again and again, 13 00:00:31,430 --> 00:00:34,070 then in average, the number I should get 14 00:00:34,070 --> 00:00:37,222 should be close to the true parameter that I'm looking for. 15 00:00:37,222 --> 00:00:38,930 The fluctuations that are due to the fact 16 00:00:38,930 --> 00:00:40,850 that I get different samples every time 17 00:00:40,850 --> 00:00:42,190 should somewhat vanish. 18 00:00:42,190 --> 00:00:46,310 And so what I want is to have a small bias, hopefully a 0 bias. 19 00:00:46,310 --> 00:00:50,030 If this thing is 0, then we see that the estimator is unbiased. 20 00:01:06,470 --> 00:01:08,250 So this is definitely a property that we 21 00:01:08,250 --> 00:01:10,100 are going to be looking for in an estimator, 22 00:01:10,100 --> 00:01:11,570 trying to find them to be unbiased. 23 00:01:11,570 --> 00:01:14,060 But we'll see that it's actually maybe not enough. 24 00:01:14,060 --> 00:01:16,040 So unbiasedness should not be something 25 00:01:16,040 --> 00:01:18,140 you lose your sleep over. 26 00:01:18,140 --> 00:01:21,650 Something that's slightly better is the risk, really 27 00:01:21,650 --> 00:01:33,050 the quadratics risk, which is expectation of-- 28 00:01:33,050 --> 00:01:35,400 so if I have an estimator, theta hat, 29 00:01:35,400 --> 00:01:38,360 I'm going to look at the expectation of theta hat n 30 00:01:38,360 --> 00:01:41,710 minus theta squared. 31 00:01:41,710 --> 00:01:44,130 And what we showed last time is that we can actually-- 32 00:01:44,130 --> 00:01:46,800 by inserting in there, adding and removing 33 00:01:46,800 --> 00:01:49,374 the expectation of theta hat, we actually 34 00:01:49,374 --> 00:01:50,790 get something where this thing can 35 00:01:50,790 --> 00:01:59,160 be decomposed as the square of the bias plus the variance, 36 00:01:59,160 --> 00:02:04,160 which is just the expectation of theta hat minus its expectation 37 00:02:04,160 --> 00:02:06,986 squared. 38 00:02:06,986 --> 00:02:08,360 That came from the fact that when 39 00:02:08,360 --> 00:02:10,669 I added and removed the expectation of theta hat 40 00:02:10,669 --> 00:02:13,731 in there, the cross-terms cancel. 41 00:02:13,731 --> 00:02:14,230 All right. 42 00:02:14,230 --> 00:02:19,556 So that was the bias squared, and this is the variance. 43 00:02:25,410 --> 00:02:29,550 And so for example, if the quadratic risk goes to 0, 44 00:02:29,550 --> 00:02:31,470 then that means that theta hat converges 45 00:02:31,470 --> 00:02:34,570 to theta in the L2 sense. 46 00:02:34,570 --> 00:02:38,050 And here we know that if we want this to go to 0, 47 00:02:38,050 --> 00:02:40,260 since it's the sum of two positive terms, 48 00:02:40,260 --> 00:02:42,877 we need to have both the bias that goes to 0 49 00:02:42,877 --> 00:02:44,460 and the variance that goes to 0, so we 50 00:02:44,460 --> 00:02:46,470 need to control both of those things. 51 00:02:46,470 --> 00:02:49,230 And so there is usually an inherent trade-off 52 00:02:49,230 --> 00:02:53,550 between getting a small bias and getting a small variance. 53 00:02:53,550 --> 00:02:56,220 If you reduce one too much, then the variance of the other one 54 00:02:56,220 --> 00:02:57,030 is going to-- 55 00:02:57,030 --> 00:02:59,470 then the other one is going to increase, or the opposite. 56 00:02:59,470 --> 00:03:03,360 That happens a lot, but not so much, actually, in this class. 57 00:03:03,360 --> 00:03:07,113 So let's just look at a couple of examples. 58 00:03:07,113 --> 00:03:10,541 So am I planning-- 59 00:03:10,541 --> 00:03:11,041 yeah. 60 00:03:11,041 --> 00:03:19,040 So examples. 61 00:03:19,040 --> 00:03:26,164 So if I do, for example, X1, Xn, there are iid Bernoulli. 62 00:03:26,164 --> 00:03:27,580 And I'm going to write it theta so 63 00:03:27,580 --> 00:03:29,440 that we keep the same notation. 64 00:03:29,440 --> 00:03:32,360 Then theta hat, what is the theta hat 65 00:03:32,360 --> 00:03:33,860 that we proposed many times? 66 00:03:33,860 --> 00:03:38,530 It's just X bar, Xn bar, the average of Xi's. 67 00:03:38,530 --> 00:03:40,990 So what is the bias of this guy? 68 00:03:40,990 --> 00:03:44,340 Well, to know the bias, I just have to remove theta 69 00:03:44,340 --> 00:03:46,080 from the expectation. 70 00:03:46,080 --> 00:03:49,330 What is the expectation of Xn bar? 71 00:03:49,330 --> 00:03:51,300 Well, by linearity of the expectation, 72 00:03:51,300 --> 00:03:53,600 it's just the average of the expectations. 73 00:03:57,950 --> 00:04:00,550 But since all my Xi's are Bernouilli with the same theta, 74 00:04:00,550 --> 00:04:03,740 then each of this guy is actually equal to theta. 75 00:04:03,740 --> 00:04:06,260 So this thing is actually theta, which means 76 00:04:06,260 --> 00:04:07,520 that this isn't biased, right? 77 00:04:14,660 --> 00:04:16,320 Now, what is the variance of this guy? 78 00:04:22,440 --> 00:04:27,774 So if you forgot the properties of the variance 79 00:04:27,774 --> 00:04:29,440 for sum of independent random variables, 80 00:04:29,440 --> 00:04:30,580 now it's time to wake up. 81 00:04:30,580 --> 00:04:34,060 So we have the variance of something 82 00:04:34,060 --> 00:04:38,830 that looks like 1 over n, the sum from i equal 1 to n of Xi. 83 00:04:41,460 --> 00:04:45,060 So it's of the form variance of a constant times 84 00:04:45,060 --> 00:04:46,230 a random variable. 85 00:04:46,230 --> 00:04:49,140 So the first thing I'm going to do is pull out the constant. 86 00:04:49,140 --> 00:04:52,180 But we know that the variance leaves on the square scale, 87 00:04:52,180 --> 00:04:54,450 so when I pull out a constant outside of the variance, 88 00:04:54,450 --> 00:04:56,416 it comes out with a square. 89 00:04:56,416 --> 00:04:59,210 The variance of a times X is a-squared 90 00:04:59,210 --> 00:05:02,550 times the variance of X, so this is equal to 1 91 00:05:02,550 --> 00:05:06,090 over n squared times the variance of the sum. 92 00:05:10,580 --> 00:05:13,790 So now we want to always do what we want to do. 93 00:05:13,790 --> 00:05:16,060 So we have the variance of the sum. 94 00:05:16,060 --> 00:05:17,740 We would like somehow to say that this 95 00:05:17,740 --> 00:05:19,640 is the sum of the variances. 96 00:05:19,640 --> 00:05:22,320 And in general, we are not allowed to say that, 97 00:05:22,320 --> 00:05:26,500 but we are because my Xi's are actually independent. 98 00:05:26,500 --> 00:05:30,660 So this is actually equal to 1 over n squared sum from i equal 99 00:05:30,660 --> 00:05:36,320 1 to n of the variance of each of the Xi's. 100 00:05:36,320 --> 00:05:42,760 And that's by independence, so this is basic probability. 101 00:05:42,760 --> 00:05:45,210 And now, what is the variance of Xi's where again they're 102 00:05:45,210 --> 00:05:47,210 all the same distribution, so the variance of Xi 103 00:05:47,210 --> 00:05:49,400 is the same as the variance of X1. 104 00:05:49,400 --> 00:05:51,560 And so each of those guys has variance what? 105 00:05:51,560 --> 00:05:53,060 What is the variance of a Bernoulli? 106 00:05:53,060 --> 00:05:54,080 We've said it once. 107 00:05:54,080 --> 00:05:55,770 It's theta times 1 minus theta. 108 00:06:00,040 --> 00:06:03,390 And so now I'm going to have the sum of n times a constant, 109 00:06:03,390 --> 00:06:05,730 so I get n times the constant divided by n squared, 110 00:06:05,730 --> 00:06:07,720 so one of the n's is going to cancel. 111 00:06:07,720 --> 00:06:10,210 And so the whole thing here is actually 112 00:06:10,210 --> 00:06:15,110 equal to theta, 1 minus theta divided by n. 113 00:06:18,500 --> 00:06:20,336 So if I'm interested in the quadratic risk-- 114 00:06:27,434 --> 00:06:28,850 and again, I should just say risk, 115 00:06:28,850 --> 00:06:30,910 because this is the only risk we're going 116 00:06:30,910 --> 00:06:32,010 to be actually looking at. 117 00:06:32,010 --> 00:06:32,510 Yeah. 118 00:06:32,510 --> 00:06:34,230 This parenthesis should really stop here. 119 00:06:38,000 --> 00:06:41,250 I really wanted to put quadratic in parenthesis. 120 00:06:41,250 --> 00:06:43,350 So the risk of this guy is what? 121 00:06:43,350 --> 00:06:50,890 Well, it's the expectation of x bar n minus theta squared. 122 00:06:50,890 --> 00:06:54,849 And we know it's the square of the variance, 123 00:06:54,849 --> 00:06:56,390 so it's the square of the bias, which 124 00:06:56,390 --> 00:07:00,110 we know is 0, so it's 0 squared plus the variance, which 125 00:07:00,110 --> 00:07:03,430 is theta, 1 plus theta-- 126 00:07:03,430 --> 00:07:07,600 1 minus theta divided by n. 127 00:07:07,600 --> 00:07:14,620 So it's just theta, 1 minus theta divided by n. 128 00:07:14,620 --> 00:07:17,512 So this is just summarizing the performance of an estimator, 129 00:07:17,512 --> 00:07:18,720 which is the random variable. 130 00:07:18,720 --> 00:07:19,761 I mean, it's complicated. 131 00:07:19,761 --> 00:07:22,310 If I really wanted to describe it, 132 00:07:22,310 --> 00:07:25,400 I would just tell you the entire distribution 133 00:07:25,400 --> 00:07:27,319 of this random variable. 134 00:07:27,319 --> 00:07:29,110 But now what I'm doing is I'm saying, well, 135 00:07:29,110 --> 00:07:32,710 let's just take this random variable, remove theta from it, 136 00:07:32,710 --> 00:07:36,950 and see how small the fluctuations around theta-- 137 00:07:36,950 --> 00:07:41,120 the squared fluctuations around theta are in expectation. 138 00:07:41,120 --> 00:07:43,140 So that's what the quadratic risk is doing. 139 00:07:43,140 --> 00:07:44,550 And in a way, this decomposition, 140 00:07:44,550 --> 00:07:46,508 as the sum of the bias square and the variance, 141 00:07:46,508 --> 00:07:47,840 is really telling you that-- 142 00:07:47,840 --> 00:07:50,489 it is really accounting for the bias, which is, well, 143 00:07:50,489 --> 00:07:52,530 even if I had an infinite amount of observations, 144 00:07:52,530 --> 00:07:54,166 is this thing doing the right thing? 145 00:07:54,166 --> 00:07:56,040 And the other thing is actually the variance, 146 00:07:56,040 --> 00:07:57,581 so for finite number of observations, 147 00:07:57,581 --> 00:07:59,680 what are the fluctuations? 148 00:07:59,680 --> 00:08:00,210 All right. 149 00:08:00,210 --> 00:08:02,459 Then you can see that those things, bias and variance, 150 00:08:02,459 --> 00:08:05,740 are actually very different. 151 00:08:05,740 --> 00:08:08,220 So I don't have any colors here, so you're 152 00:08:08,220 --> 00:08:12,360 going to have to really follow the speed-- 153 00:08:12,360 --> 00:08:14,380 the order in which I draw those curves. 154 00:08:14,380 --> 00:08:14,880 All right. 155 00:08:14,880 --> 00:08:15,720 So let's find-- 156 00:08:15,720 --> 00:08:19,530 I'm going to give you three candidate estimators, so-- 157 00:08:29,980 --> 00:08:31,290 estimators for theta. 158 00:08:35,350 --> 00:08:38,230 So the first one is definitely Xn bar. 159 00:08:38,230 --> 00:08:40,900 That will be a good candidate estimator. 160 00:08:40,900 --> 00:08:45,070 The second one is going to be 0.5, because after all, 161 00:08:45,070 --> 00:08:47,260 why should I bother if it's actually going to be-- 162 00:08:47,260 --> 00:08:47,759 right? 163 00:08:47,759 --> 00:08:51,760 So for example, if I ask you to predict 164 00:08:51,760 --> 00:08:54,510 the score of some candidate in some election, 165 00:08:54,510 --> 00:08:57,472 then since you know it's going to be very close to 0.5, 166 00:08:57,472 --> 00:08:59,680 you might as well just throw 0.5 and you're not going 167 00:08:59,680 --> 00:09:00,880 to be very far from reality. 168 00:09:00,880 --> 00:09:02,945 And it's actually going to cost you 0 time and $0 169 00:09:02,945 --> 00:09:03,820 to come up with that. 170 00:09:03,820 --> 00:09:06,460 So sometimes maybe just a good old guess 171 00:09:06,460 --> 00:09:08,830 is actually doing the job for you. 172 00:09:08,830 --> 00:09:10,990 Of course, for presidential elections 173 00:09:10,990 --> 00:09:12,890 or something like this, it's not very helpful 174 00:09:12,890 --> 00:09:14,514 if your prediction is telling you this. 175 00:09:14,514 --> 00:09:17,170 But if it was something different, 176 00:09:17,170 --> 00:09:21,112 that would be a good way to generate some close to 1/2. 177 00:09:21,112 --> 00:09:23,895 For a coin, for example, if I give you a coin, 178 00:09:23,895 --> 00:09:24,520 you never know. 179 00:09:24,520 --> 00:09:25,720 Maybe it's slightly biased. 180 00:09:25,720 --> 00:09:27,970 But the good guess, just looking at it, inspecting it, 181 00:09:27,970 --> 00:09:29,560 maybe there's something crazy happening 182 00:09:29,560 --> 00:09:31,143 with the structure of it, you're going 183 00:09:31,143 --> 00:09:34,522 to guess that it's 0.5 without trying to collect information. 184 00:09:34,522 --> 00:09:36,730 And let's find another one, which is, well, you know, 185 00:09:36,730 --> 00:09:38,950 I have a lot of observations. 186 00:09:38,950 --> 00:09:43,810 But I'm recording couples kissing, but I'm on a budget. 187 00:09:43,810 --> 00:09:46,120 I don't have time to travel all around the world 188 00:09:46,120 --> 00:09:47,122 and collect some people. 189 00:09:47,122 --> 00:09:49,330 So really, I'm just going to look at the first couple 190 00:09:49,330 --> 00:09:49,840 and go home. 191 00:09:49,840 --> 00:09:53,410 So my other estimator is just going to be X1. 192 00:09:53,410 --> 00:09:55,710 I just take the first observation, 0, 1, 193 00:09:55,710 --> 00:09:57,110 and that's it. 194 00:09:57,110 --> 00:09:57,860 So now I'm going-- 195 00:09:57,860 --> 00:10:01,080 I want to actually understand what the behavior of those guys 196 00:10:01,080 --> 00:10:01,900 is. 197 00:10:01,900 --> 00:10:02,400 All right. 198 00:10:02,400 --> 00:10:09,240 So we know-- and so we know that for this guy, the bias is 0 199 00:10:09,240 --> 00:10:14,280 and the variance is equal to theta, 200 00:10:14,280 --> 00:10:19,610 1 minus theta divided by n. 201 00:10:19,610 --> 00:10:22,980 What is the bias of this guy, 0.5? 202 00:10:28,100 --> 00:10:29,917 AUDIENCE: 0.5. 203 00:10:29,917 --> 00:10:31,000 AUDIENCE: 0.5 minus theta? 204 00:10:31,000 --> 00:10:32,749 PHILIPPE RIGOLLET: 0.5 minus theta, right. 205 00:10:35,360 --> 00:10:39,136 So the bias, 0.5 minus theta. 206 00:10:39,136 --> 00:10:40,510 What is the variance of this guy? 207 00:10:44,670 --> 00:10:46,702 What is the variance of 0.5? 208 00:10:46,702 --> 00:10:47,410 AUDIENCE: It's 0. 209 00:10:47,410 --> 00:10:48,285 PHILIPPE RIGOLLET: 0. 210 00:10:48,285 --> 00:10:49,095 Right. 211 00:10:49,095 --> 00:10:50,470 It's just a deterministic number, 212 00:10:50,470 --> 00:10:53,110 so there's no fluctuations for this guy. 213 00:10:53,110 --> 00:10:54,190 What is the bias? 214 00:10:54,190 --> 00:10:56,590 Well, X1 is actually-- 215 00:10:56,590 --> 00:10:58,210 just for simplicity, I can think of it 216 00:10:58,210 --> 00:11:00,820 as being X1 bar, the average of itself, 217 00:11:00,820 --> 00:11:03,940 so that wherever I saw an n for this guy, I can replace it by 1 218 00:11:03,940 --> 00:11:05,690 and that will give me my formula. 219 00:11:05,690 --> 00:11:07,390 So the bias is still going to be 0. 220 00:11:07,390 --> 00:11:10,190 And the variance is going to be equal to theta, 1 minus theta. 221 00:11:13,270 --> 00:11:16,180 So now I have those three estimators. 222 00:11:16,180 --> 00:11:19,660 Well, if I compare X1 and Xn bar, then 223 00:11:19,660 --> 00:11:22,480 clearly I have 0 bias in both cases. 224 00:11:22,480 --> 00:11:23,740 That's good. 225 00:11:23,740 --> 00:11:27,220 And I have the variance that's actually n times smaller when I 226 00:11:27,220 --> 00:11:29,556 use my n observations than when I don't. 227 00:11:29,556 --> 00:11:31,180 So those two guys, on these two fronts, 228 00:11:31,180 --> 00:11:32,846 you can actually look at the two numbers 229 00:11:32,846 --> 00:11:35,264 and say, well, the first number is the same. 230 00:11:35,264 --> 00:11:37,180 The second number is better for the other guy, 231 00:11:37,180 --> 00:11:40,550 so I will definitely go for this guy compared to this guy. 232 00:11:40,550 --> 00:11:42,460 So this guy is gone. 233 00:11:42,460 --> 00:11:43,790 But not this guy. 234 00:11:43,790 --> 00:11:47,080 Well, if I look at the bias, the variance is 0. 235 00:11:47,080 --> 00:11:49,480 It's always beating the variance of this guy. 236 00:11:49,480 --> 00:11:52,270 And if I look at the bias, it's actually really not that bad. 237 00:11:52,270 --> 00:11:53,860 It's 0.5 minus theta. 238 00:11:53,860 --> 00:11:55,970 In particular, if theta is 0.5, then this guy 239 00:11:55,970 --> 00:11:57,930 is strictly better. 240 00:11:57,930 --> 00:12:00,430 And so you can actually now look at what 241 00:12:00,430 --> 00:12:05,100 the quadratic risk looks like. 242 00:12:05,100 --> 00:12:06,600 So here, what I'm going to do is I'm 243 00:12:06,600 --> 00:12:08,141 going to take my true theta-- so it's 244 00:12:08,141 --> 00:12:09,706 going to range between 0 and 1. 245 00:12:09,706 --> 00:12:12,080 And we know that those two things are functions of theta, 246 00:12:12,080 --> 00:12:13,913 so I can only understand them if I plot them 247 00:12:13,913 --> 00:12:16,650 as functions of theta. 248 00:12:16,650 --> 00:12:18,590 And so now I'm going to actually plot-- 249 00:12:18,590 --> 00:12:20,870 the y-axis is going to be the risk. 250 00:12:23,860 --> 00:12:26,680 So what is the risk of the estimator of 0.5? 251 00:12:26,680 --> 00:12:27,630 This one is easy. 252 00:12:27,630 --> 00:12:33,330 Well, it's 0 plus the square of 0.5 minus theta. 253 00:12:33,330 --> 00:12:37,990 So we know that at theta, it's actually going to be 0. 254 00:12:37,990 --> 00:12:39,640 And then it's going to be a square. 255 00:12:39,640 --> 00:12:44,800 So at 0, it's going to be 0.25. 256 00:12:44,800 --> 00:12:49,024 And at 1, it's going to be 0.25 as well. 257 00:12:49,024 --> 00:12:49,940 So it looks like this. 258 00:12:49,940 --> 00:12:50,856 Well, actually, sorry. 259 00:12:50,856 --> 00:12:52,650 Let me put the 0.5 where it should be. 260 00:12:56,680 --> 00:12:57,180 OK. 261 00:12:57,180 --> 00:13:03,690 So this here is the risk of 0.5. 262 00:13:03,690 --> 00:13:06,370 And we'll write it like this. 263 00:13:06,370 --> 00:13:09,490 So when theta is very close to 0.5, I'm very happy. 264 00:13:09,490 --> 00:13:13,090 When theta gets farther, it's a little bit annoying. 265 00:13:13,090 --> 00:13:16,680 And then here, I want to plot the risk of this guy. 266 00:13:16,680 --> 00:13:18,430 So now the thing with the risk of this guy 267 00:13:18,430 --> 00:13:20,740 is that it will depend on n. 268 00:13:20,740 --> 00:13:24,040 So I will just pick some n that I'm happy with just 269 00:13:24,040 --> 00:13:26,417 so that I can actually draw a curve. 270 00:13:26,417 --> 00:13:29,000 Otherwise, I'm going to have to plot one curve per value of n. 271 00:13:29,000 --> 00:13:31,900 So let's just say, for example, that n is equal to 10. 272 00:13:31,900 --> 00:13:35,250 And so now I need to plot the function theta, 1 minus 273 00:13:35,250 --> 00:13:37,600 theta divided by 10. 274 00:13:37,600 --> 00:13:39,000 We know that theta, 1 minus theta 275 00:13:39,000 --> 00:13:40,780 is a curve that goes like this. 276 00:13:40,780 --> 00:13:42,200 It takes value at 1/2. 277 00:13:42,200 --> 00:13:43,480 It thinks value 1/4. 278 00:13:43,480 --> 00:13:44,310 That's the maximum. 279 00:13:44,310 --> 00:13:46,000 And then it's 0 at the end. 280 00:13:46,000 --> 00:13:52,557 So really, if n is equal to 1, this 281 00:13:52,557 --> 00:13:53,890 is what the variance looks like. 282 00:13:53,890 --> 00:13:56,530 The bias doesn't count in the risk. 283 00:13:56,530 --> 00:13:57,029 Yeah. 284 00:13:57,029 --> 00:14:00,020 AUDIENCE: [INAUDIBLE] 285 00:14:00,020 --> 00:14:01,020 PHILIPPE RIGOLLET: Sure. 286 00:14:01,020 --> 00:14:03,560 Can you move? 287 00:14:03,560 --> 00:14:04,290 All right. 288 00:14:04,290 --> 00:14:05,065 Are you guys good? 289 00:14:08,060 --> 00:14:08,560 All right. 290 00:14:08,560 --> 00:14:10,060 So now I have this picture. 291 00:14:10,060 --> 00:14:12,280 And I know I'm going up to 25. 292 00:14:12,280 --> 00:14:15,230 And there's a place where those curves cross. 293 00:14:15,230 --> 00:14:16,132 So if you're sure-- 294 00:14:16,132 --> 00:14:18,340 let's say you're talking about presidential election, 295 00:14:18,340 --> 00:14:20,830 you know that those things are going to be really close. 296 00:14:20,830 --> 00:14:23,620 Maybe you're actually better by predicting 0.5 297 00:14:23,620 --> 00:14:25,810 if you know it's not going to go too far. 298 00:14:25,810 --> 00:14:32,670 But that's for one observation, so that's the risk of X1. 299 00:14:32,670 --> 00:14:34,890 But if I look at the risk of Xn, all I'm doing 300 00:14:34,890 --> 00:14:38,940 is just crushing this curve down to 0. 301 00:14:38,940 --> 00:14:42,530 So as n increases, it's going to look more and more like this. 302 00:14:42,530 --> 00:14:44,396 It's the same curve divided by n. 303 00:14:48,600 --> 00:14:50,650 And so now I can just start to understand 304 00:14:50,650 --> 00:14:52,660 that for different values of thetas, 305 00:14:52,660 --> 00:14:56,240 now I'm going to have to be very close to theta is equal to 1/2 306 00:14:56,240 --> 00:14:58,480 if I want to start saying that Xn bar is worse 307 00:14:58,480 --> 00:15:03,226 than the naive estimator 0.5. 308 00:15:03,226 --> 00:15:04,006 Yeah. 309 00:15:04,006 --> 00:15:04,672 AUDIENCE: Sorry. 310 00:15:04,672 --> 00:15:08,528 I know you explained a little bit before, but can you just-- 311 00:15:08,528 --> 00:15:11,420 what is an intuitive definition of risk? 312 00:15:11,420 --> 00:15:13,840 What is it actually describing? 313 00:15:13,840 --> 00:15:16,380 PHILIPPE RIGOLLET: So either you can-- 314 00:15:16,380 --> 00:15:18,924 well, when you have an unbiased estimator, it's simple. 315 00:15:18,924 --> 00:15:20,590 It's just telling you it's the variance, 316 00:15:20,590 --> 00:15:23,190 because the theta that you have over there is really-- so 317 00:15:23,190 --> 00:15:26,500 in the definition of the risk, the theta 318 00:15:26,500 --> 00:15:28,200 that you have here if you're unbiased 319 00:15:28,200 --> 00:15:31,390 is really the expectation of theta hat. 320 00:15:31,390 --> 00:15:33,230 So that's really just the variance. 321 00:15:33,230 --> 00:15:35,590 So the risk is really telling you 322 00:15:35,590 --> 00:15:39,160 how much fluctuations I have around my expectation 323 00:15:39,160 --> 00:15:39,790 if unbiased. 324 00:15:39,790 --> 00:15:42,164 But actually here, it's telling you how much fluctuations 325 00:15:42,164 --> 00:15:44,420 I have in average around theta. 326 00:15:44,420 --> 00:15:47,105 So if you understand the notion of variance as being-- 327 00:15:47,105 --> 00:15:47,980 AUDIENCE: [INAUDIBLE] 328 00:15:47,980 --> 00:15:48,580 PHILIPPE RIGOLLET: What? 329 00:15:48,580 --> 00:15:49,780 AUDIENCE: Like variance on average. 330 00:15:49,780 --> 00:15:49,990 PHILIPPE RIGOLLET: No. 331 00:15:49,990 --> 00:15:50,650 AUDIENCE: No. 332 00:15:50,650 --> 00:15:51,775 PHILIPPE RIGOLLET: It's just like variance. 333 00:15:51,775 --> 00:15:52,060 AUDIENCE: Oh, OK. 334 00:15:52,060 --> 00:15:53,800 PHILIPPE RIGOLLET: So when you-- 335 00:15:53,800 --> 00:15:56,140 I mean, if you claim you understand what variance is, 336 00:15:56,140 --> 00:15:58,090 it's telling you what is the expected 337 00:15:58,090 --> 00:16:00,310 squared fluctuation around the expectation 338 00:16:00,310 --> 00:16:01,640 of my random variable. 339 00:16:01,640 --> 00:16:04,270 It's just telling you on average how far I'm going to be. 340 00:16:04,270 --> 00:16:06,200 And you take the square because you want to cancel the signs. 341 00:16:06,200 --> 00:16:07,250 Otherwise, you're going to get 0. 342 00:16:07,250 --> 00:16:07,660 AUDIENCE: Oh, OK. 343 00:16:07,660 --> 00:16:08,660 PHILIPPE RIGOLLET: And here it's saying, well, 344 00:16:08,660 --> 00:16:11,034 I really don't care what the expectation of theta hat is. 345 00:16:11,034 --> 00:16:13,030 What I want to get to is theta, so I'm 346 00:16:13,030 --> 00:16:15,430 looking at the expectation of the squared fluctuations 347 00:16:15,430 --> 00:16:16,870 around theta itself. 348 00:16:16,870 --> 00:16:19,610 If I'm unbiased, it coincides with the variance. 349 00:16:19,610 --> 00:16:21,940 But if I'm biased, then I have to account for the fact 350 00:16:21,940 --> 00:16:23,260 that I'm really not computing the-- 351 00:16:23,260 --> 00:16:23,801 AUDIENCE: OK. 352 00:16:23,801 --> 00:16:24,640 OK. 353 00:16:24,640 --> 00:16:25,140 Thanks. 354 00:16:25,140 --> 00:16:27,490 PHILIPPE RIGOLLET: OK? 355 00:16:27,490 --> 00:16:28,670 All right. 356 00:16:28,670 --> 00:16:29,670 Are there any questions? 357 00:16:29,670 --> 00:16:31,459 So here, what I really want to illustrate 358 00:16:31,459 --> 00:16:33,000 is that the risk itself is a function 359 00:16:33,000 --> 00:16:34,260 of theta most of the times. 360 00:16:34,260 --> 00:16:35,620 And so for different thetas, some estimators 361 00:16:35,620 --> 00:16:37,170 are going to be better than others. 362 00:16:37,170 --> 00:16:38,580 But there's also the entire range 363 00:16:38,580 --> 00:16:41,960 of estimators, those that are really biased, 364 00:16:41,960 --> 00:16:44,720 but the bias can completely vanish. 365 00:16:44,720 --> 00:16:47,270 And so here, you see you have no bias, 366 00:16:47,270 --> 00:16:48,890 but the variance can be large. 367 00:16:48,890 --> 00:16:50,720 Or you have 0 bias-- 368 00:16:50,720 --> 00:16:52,430 you have a bias, but the variance is 0. 369 00:16:52,430 --> 00:16:54,130 So you can actually have this trade-off 370 00:16:54,130 --> 00:16:58,220 and you can find things that are in the entire range in general. 371 00:17:01,260 --> 00:17:05,940 So those things are actually-- those trade-offs 372 00:17:05,940 --> 00:17:10,260 between bias and variance are usually much better illustrated 373 00:17:10,260 --> 00:17:12,565 if we're talking about multivariate parameters. 374 00:17:12,565 --> 00:17:14,190 If I actually look at a parameter which 375 00:17:14,190 --> 00:17:19,025 is the mean of some multivariate Gaussian, so an entire vector, 376 00:17:19,025 --> 00:17:20,150 then the bias is going to-- 377 00:17:20,150 --> 00:17:23,599 I can make the bias bigger by, for example, 378 00:17:23,599 --> 00:17:26,190 forcing all the coordinates of my estimator to be the same. 379 00:17:26,190 --> 00:17:27,690 So here, I'm going to get some bias, 380 00:17:27,690 --> 00:17:29,106 but the variance is actually going 381 00:17:29,106 --> 00:17:31,200 to be much better, because I get to average all 382 00:17:31,200 --> 00:17:32,940 the coordinates for this guy. 383 00:17:32,940 --> 00:17:35,680 And so really, the bias/variance trade-off 384 00:17:35,680 --> 00:17:38,790 is when you have multiple parameters to estimate, 385 00:17:38,790 --> 00:17:40,400 so you have a vector of parameters, 386 00:17:40,400 --> 00:17:42,930 a multivariate parameter, the bias 387 00:17:42,930 --> 00:17:45,450 increases when you're trying to pull more information 388 00:17:45,450 --> 00:17:49,470 across the different components to actually have 389 00:17:49,470 --> 00:17:50,670 a lower variance. 390 00:17:50,670 --> 00:17:53,220 So the more you average, the lower the variance. 391 00:17:53,220 --> 00:17:54,870 That's exactly what we've illustrated. 392 00:17:54,870 --> 00:17:56,700 As n increases, the variance decreases, 393 00:17:56,700 --> 00:17:59,370 like 1 over n or theta, 1 minus theta over n. 394 00:17:59,370 --> 00:18:01,530 And so this is how it happens in general. 395 00:18:01,530 --> 00:18:03,840 In this class, it's mostly one-dimensional parameter 396 00:18:03,840 --> 00:18:06,150 estimation, so it's going to be a little harder to illustrate 397 00:18:06,150 --> 00:18:06,649 that. 398 00:18:06,649 --> 00:18:09,210 But if you do, for example, non-parametric estimation, 399 00:18:09,210 --> 00:18:10,450 that's all you do. 400 00:18:10,450 --> 00:18:14,590 There's just bias/variance trade-offs all the time. 401 00:18:14,590 --> 00:18:16,980 And in between, when you have high-dimensional parametric 402 00:18:16,980 --> 00:18:20,110 estimation, that happens a lot as well. 403 00:18:20,110 --> 00:18:21,750 OK. 404 00:18:21,750 --> 00:18:25,022 So I'm just going to go quickly through those two remaining 405 00:18:25,022 --> 00:18:26,730 slides, because we've actually seen them. 406 00:18:26,730 --> 00:18:29,850 But I just wanted you to have somewhere a formal definition 407 00:18:29,850 --> 00:18:32,700 of what a confidence interval is. 408 00:18:32,700 --> 00:18:37,830 And so we fixed a statistical model for n observations, X1 409 00:18:37,830 --> 00:18:38,700 to Xn. 410 00:18:38,700 --> 00:18:42,050 The parameter theta here is one-dimensional. 411 00:18:42,050 --> 00:18:44,010 Theta is a subset of the real line, 412 00:18:44,010 --> 00:18:47,010 and that's why I talk about intervals. 413 00:18:47,010 --> 00:18:48,510 An interval is a subset of the line. 414 00:18:48,510 --> 00:18:51,480 If I had a subset of R2, for example, 415 00:18:51,480 --> 00:18:54,800 that would no longer be called an interval, but a region, 416 00:18:54,800 --> 00:18:57,570 just because-- well, that's just we can say a set, 417 00:18:57,570 --> 00:18:59,130 a confidence set. 418 00:18:59,130 --> 00:19:01,920 But people like to say confidence region. 419 00:19:01,920 --> 00:19:04,170 So an interval is just a one-dimensional conference 420 00:19:04,170 --> 00:19:04,740 region. 421 00:19:04,740 --> 00:19:07,350 And it has to be an interval as well. 422 00:19:07,350 --> 00:19:11,820 So a confidence interval of level 1 minus alpha-- 423 00:19:11,820 --> 00:19:16,110 so we refer to the quality of a confidence interval 424 00:19:16,110 --> 00:19:18,120 is actually called it's level. 425 00:19:18,120 --> 00:19:21,490 It takes value 1 minus alpha for some positive alpha. 426 00:19:21,490 --> 00:19:23,080 And so the confidence level-- 427 00:19:23,080 --> 00:19:26,760 the level of the confidence interval is between 0 and 1. 428 00:19:26,760 --> 00:19:29,410 The closer to 1 it is, the better the confidence interval. 429 00:19:29,410 --> 00:19:32,040 The closer to 0, the worse it is. 430 00:19:32,040 --> 00:19:34,560 And so for any random interval-- so 431 00:19:34,560 --> 00:19:37,830 a confidence interval is a random interval. 432 00:19:37,830 --> 00:19:41,310 The bounds of this interval depends on random data. 433 00:19:41,310 --> 00:19:44,650 Just like we had X bar plus/minus 434 00:19:44,650 --> 00:19:46,570 1 over square root of n, for example, or 2 435 00:19:46,570 --> 00:19:49,020 over square root of n, this X bar 436 00:19:49,020 --> 00:19:53,020 was the random thing that would make fluctuate those guys. 437 00:19:53,020 --> 00:19:54,342 And so now I have an interval. 438 00:19:54,342 --> 00:19:56,550 And now I have its boundaries, but now the boundaries 439 00:19:56,550 --> 00:19:58,830 are not allowed to depend on my unknown parameter. 440 00:19:58,830 --> 00:20:00,600 Otherwise, it's not a confidence interval, 441 00:20:00,600 --> 00:20:02,070 just like an estimator that depends 442 00:20:02,070 --> 00:20:04,929 on the unknown parameter is not an estimator. 443 00:20:04,929 --> 00:20:06,720 The confidence interval has to be something 444 00:20:06,720 --> 00:20:10,360 that I can compute once I collect data. 445 00:20:10,360 --> 00:20:14,990 And so what I want is that-- so there's this weird notation. 446 00:20:14,990 --> 00:20:17,800 The fact that I write theta-- 447 00:20:17,800 --> 00:20:19,960 that's the probability that I contains theta. 448 00:20:19,960 --> 00:20:23,081 You're used to seeing theta belongs to I. 449 00:20:23,081 --> 00:20:24,580 But here, I really want to emphasize 450 00:20:24,580 --> 00:20:26,980 that the randomness is in I. And so the way 451 00:20:26,980 --> 00:20:28,540 you actually say it when you read 452 00:20:28,540 --> 00:20:32,980 this formula is the probability that I contains theta 453 00:20:32,980 --> 00:20:36,930 is at least 1 minus alpha. 454 00:20:36,930 --> 00:20:39,810 So it better be close to 1. 455 00:20:39,810 --> 00:20:41,850 You want 1 minus alpha to be very close to 1, 456 00:20:41,850 --> 00:20:43,724 because it's really telling you that whatever 457 00:20:43,724 --> 00:20:46,320 random variable I'm giving you, my error bars are actually 458 00:20:46,320 --> 00:20:49,190 covering the right theta. 459 00:20:49,190 --> 00:20:50,890 And I want this to be true. 460 00:20:50,890 --> 00:20:52,390 But I want this-- since I don't know 461 00:20:52,390 --> 00:20:54,340 what my confidence-- my parameter of theta 462 00:20:54,340 --> 00:20:58,450 is, I want this to hold true for all possible values 463 00:20:58,450 --> 00:21:02,860 of the parameters that nature may have come up with from. 464 00:21:02,860 --> 00:21:05,050 So I want this-- so there's theta that changes here, 465 00:21:05,050 --> 00:21:06,580 so the distribution of the interval 466 00:21:06,580 --> 00:21:08,860 is actually changing with theta hopefully. 467 00:21:08,860 --> 00:21:11,090 And theta is changing with this guy. 468 00:21:11,090 --> 00:21:13,780 So regardless of the value of theta that I'm getting, 469 00:21:13,780 --> 00:21:17,350 I want that the probability that it contains the theta 470 00:21:17,350 --> 00:21:20,520 is actually larger than 1 minus alpha. 471 00:21:20,520 --> 00:21:22,020 So I'll come back to it in a second. 472 00:21:22,020 --> 00:21:23,600 I just want to say that here, we can 473 00:21:23,600 --> 00:21:25,130 talk about asymptotic level. 474 00:21:25,130 --> 00:21:27,320 And that's typically when you use central limit 475 00:21:27,320 --> 00:21:29,750 theorem to compute this guy. 476 00:21:29,750 --> 00:21:32,180 Then you're not guaranteed that the value is 477 00:21:32,180 --> 00:21:35,840 at least 1 minus alpha for every n, 478 00:21:35,840 --> 00:21:40,410 but it's actually in the limit larger than 1 minus alpha. 479 00:21:40,410 --> 00:21:43,140 So maybe for each fixed n it's going to be not true. 480 00:21:43,140 --> 00:21:44,970 But for as no goes to infinity, it's 481 00:21:44,970 --> 00:21:46,620 actually going to become true. 482 00:21:46,620 --> 00:21:49,230 If you want this to hold for every n, 483 00:21:49,230 --> 00:21:51,840 you actually need to use things such as Hoeffding's inequality 484 00:21:51,840 --> 00:21:55,170 that we described at some point, that hold for every n. 485 00:21:55,170 --> 00:22:00,002 So as a rule of thumb, if you use the central limit theorem, 486 00:22:00,002 --> 00:22:01,710 you're dealing with a confidence interval 487 00:22:01,710 --> 00:22:04,057 with asymptotic level 1 minus alpha. 488 00:22:04,057 --> 00:22:05,640 And the reason is because you actually 489 00:22:05,640 --> 00:22:10,260 want to get the quintiles of the normal-- the Gaussian 490 00:22:10,260 --> 00:22:13,110 distribution that comes from the central limit theorem. 491 00:22:13,110 --> 00:22:15,579 And if you want to use Hoeffding's, for example, 492 00:22:15,579 --> 00:22:18,120 you might actually get away with a confidence interval that's 493 00:22:18,120 --> 00:22:20,280 actually true even non-asymptotically. 494 00:22:20,280 --> 00:22:22,030 It's just the regular confidence interval. 495 00:22:24,560 --> 00:22:26,390 So this is the formal definition. 496 00:22:26,390 --> 00:22:28,009 It's a bit of a mouthful. 497 00:22:28,009 --> 00:22:30,050 But we actually-- the best way to understand them 498 00:22:30,050 --> 00:22:31,980 is to build them. 499 00:22:31,980 --> 00:22:33,930 Now, at some point I said-- 500 00:22:33,930 --> 00:22:35,898 and I think it was part of the homework-- 501 00:22:38,429 --> 00:22:39,970 so here, I really say the probability 502 00:22:39,970 --> 00:22:42,178 the true parameter belongs to the confidence interval 503 00:22:42,178 --> 00:22:44,870 is actually 1 minus alpha. 504 00:22:44,870 --> 00:22:47,096 And so that's because here, this confidence interval 505 00:22:47,096 --> 00:22:48,220 is still a random variable. 506 00:22:48,220 --> 00:22:50,350 Now, if I start plugging in numbers instead 507 00:22:50,350 --> 00:22:52,000 of the random variables X1 to Xn, 508 00:22:52,000 --> 00:22:55,240 I start putting 1, 0, 0, 1, 0, 0, 1, 509 00:22:55,240 --> 00:22:58,220 like I did for the kiss example, then in this case, 510 00:22:58,220 --> 00:23:03,321 the random interval is actually going to be 0.42, 0.65. 511 00:23:03,321 --> 00:23:05,570 And this guy, the probability that theta belongs to it 512 00:23:05,570 --> 00:23:07,950 is not 1 minus alpha. 513 00:23:07,950 --> 00:23:10,000 It's either 0 if it's not in there 514 00:23:10,000 --> 00:23:11,214 or it's 1 if it's in there. 515 00:23:16,870 --> 00:23:19,360 So here is the example that we had. 516 00:23:19,360 --> 00:23:24,220 So just let's look at back into our favorite example, which 517 00:23:24,220 --> 00:23:26,130 is the average of Bernoulli random variables, 518 00:23:26,130 --> 00:23:30,280 so we studied that maybe that's the third time already. 519 00:23:30,280 --> 00:23:34,210 So the sample average, Xn bar, is a strongly consistent 520 00:23:34,210 --> 00:23:35,200 estimator of p. 521 00:23:35,200 --> 00:23:37,480 That was one of the properties that we wanted. 522 00:23:37,480 --> 00:23:40,285 Strongly consistent means that as n goes to infinity, 523 00:23:40,285 --> 00:23:42,940 it converges almost surely to the true parameter. 524 00:23:42,940 --> 00:23:44,710 That's the strong law of large number. 525 00:23:44,710 --> 00:23:47,050 It is consistent also, because it's strongly consistent, 526 00:23:47,050 --> 00:23:49,910 so it also converges in probability, 527 00:23:49,910 --> 00:23:52,290 which makes it consistent. 528 00:23:52,290 --> 00:23:53,070 It's unbiased. 529 00:23:53,070 --> 00:23:53,970 We've seen that. 530 00:23:53,970 --> 00:23:57,780 We've actually computed its quadratic risk. 531 00:23:57,780 --> 00:24:00,344 And now what I have is that if I look at-- 532 00:24:00,344 --> 00:24:02,760 thanks to the central limit theorem, we actually did this. 533 00:24:02,760 --> 00:24:08,850 We built a confidence interval at level 1 minus alpha-- 534 00:24:08,850 --> 00:24:12,360 asymptotic level, sorry, asymptotic level 1 minus alpha. 535 00:24:12,360 --> 00:24:15,680 And so here, this is how we did it. 536 00:24:15,680 --> 00:24:17,640 Let me just go through it again. 537 00:24:17,640 --> 00:24:19,455 So we know from the central limit theorem-- 538 00:24:28,240 --> 00:24:31,270 so the central limit theorem tells us 539 00:24:31,270 --> 00:24:38,040 that Xn bar minus p divided by square root of p1 minus p, 540 00:24:38,040 --> 00:24:41,330 square root of n converges in distribution as n 541 00:24:41,330 --> 00:24:47,270 goes to infinity to some standard normal distribution. 542 00:24:47,270 --> 00:24:49,910 So what it means is that if I look at the probability 543 00:24:49,910 --> 00:24:53,990 under the true p, that's square root of n, Xn bar 544 00:24:53,990 --> 00:25:03,130 minus p divided by square root of p1 minus p, 545 00:25:03,130 --> 00:25:06,040 it's less than Q alpha over 2, where this is 546 00:25:06,040 --> 00:25:07,780 the definition of the quintile. 547 00:25:07,780 --> 00:25:11,980 Then this guy-- and I'm actually going to use the same notation, 548 00:25:11,980 --> 00:25:17,320 limit as n goes to infinity, this is the same thing. 549 00:25:17,320 --> 00:25:22,720 So this is actually going to be equal to 1 minus alpha. 550 00:25:22,720 --> 00:25:25,180 That's exactly what I did last time. 551 00:25:25,180 --> 00:25:28,690 This is by definition of the quintile of a standard Gaussian 552 00:25:28,690 --> 00:25:32,890 and of a limit in distribution. 553 00:25:32,890 --> 00:25:36,920 So the probabilities computed on this guy in the limit converges 554 00:25:36,920 --> 00:25:38,620 to the probability computed on this guy. 555 00:25:38,620 --> 00:25:40,580 And we know that this is just the probability 556 00:25:40,580 --> 00:25:42,280 that the absolute value of sum n 0, 1 557 00:25:42,280 --> 00:25:44,640 is less than Q alpha over 2. 558 00:25:47,750 --> 00:25:50,280 And so in particular, if it's equal, 559 00:25:50,280 --> 00:25:54,180 then I can put some larger than or equal to, 560 00:25:54,180 --> 00:25:57,480 which guarantees my asymptotic confidence level. 561 00:25:57,480 --> 00:25:59,700 And I just solve for p. 562 00:25:59,700 --> 00:26:03,240 So this is equivalent to the limit 563 00:26:03,240 --> 00:26:07,110 as n goes to infinity of the probability 564 00:26:07,110 --> 00:26:15,990 that theta is between Xn bar minus Q 565 00:26:15,990 --> 00:26:21,210 alpha over 2 divided by-- 566 00:26:21,210 --> 00:26:26,810 times square root of p1 minus p divided by square root of n, Xn 567 00:26:26,810 --> 00:26:33,980 bar plus q alpha over 2, square root of p1 minus p 568 00:26:33,980 --> 00:26:37,370 divided by square root of n is larger than or equal 569 00:26:37,370 --> 00:26:39,030 to 1 minus alpha. 570 00:26:39,030 --> 00:26:39,960 And so there you go. 571 00:26:39,960 --> 00:26:43,500 I have my confidence interval. 572 00:26:43,500 --> 00:26:45,750 Except that's not, right? 573 00:26:45,750 --> 00:26:48,290 We just said that the bounds of a confidence interval 574 00:26:48,290 --> 00:26:50,370 may not depend on the unknown parameter. 575 00:26:50,370 --> 00:26:52,440 And here, they do. 576 00:26:52,440 --> 00:26:54,300 And so we actually came up with two ways 577 00:26:54,300 --> 00:26:55,874 of getting rid of this. 578 00:26:55,874 --> 00:26:58,290 Since we only need this thing-- so this thing, as we said, 579 00:26:58,290 --> 00:26:59,926 is really equal. 580 00:26:59,926 --> 00:27:01,800 Every time I'm going to make this guy smaller 581 00:27:01,800 --> 00:27:03,450 and this guy larger, I'm only going 582 00:27:03,450 --> 00:27:05,210 to increase the probability. 583 00:27:05,210 --> 00:27:06,960 And so what we do is we actually just take 584 00:27:06,960 --> 00:27:08,940 the largest possible value for p1 minus 585 00:27:08,940 --> 00:27:13,090 p, which makes the interval as large as possible. 586 00:27:13,090 --> 00:27:15,420 And so now I have this. 587 00:27:15,420 --> 00:27:17,070 I just do one of the two tricks. 588 00:27:17,070 --> 00:27:22,400 I replace p1 minus p by their upper bound, which is 1/4. 589 00:27:25,150 --> 00:27:28,255 As we said, p1 minus p, the function looks like this. 590 00:27:28,255 --> 00:27:31,540 So I just take the value here at 1/2. 591 00:27:31,540 --> 00:27:37,800 Or, I can use Slutsky and say that if I replace p by Xn bar, 592 00:27:37,800 --> 00:27:40,890 that's the same as just replacing p by Xn bar here. 593 00:27:45,640 --> 00:27:48,310 And by Slutsky, we know that this is actually converging 594 00:27:48,310 --> 00:27:50,650 also to some standard Gaussian. 595 00:27:59,630 --> 00:28:04,120 We've seen that when we saw Slutsky as an example. 596 00:28:04,120 --> 00:28:05,620 And so those two things-- actually, 597 00:28:05,620 --> 00:28:07,300 just because I'm taking the limit 598 00:28:07,300 --> 00:28:10,090 and I'm only caring about the asymptotic confidence level, 599 00:28:10,090 --> 00:28:13,420 I can actually just plug in consistent quantities in there, 600 00:28:13,420 --> 00:28:15,580 such as Xn bar where I don't have a p. 601 00:28:15,580 --> 00:28:18,790 And that gives me another confidence interval. 602 00:28:18,790 --> 00:28:19,290 All right. 603 00:28:19,290 --> 00:28:24,510 So this by now, hopefully after doing it three times, 604 00:28:24,510 --> 00:28:28,320 you should really, really be comfortable with just creating 605 00:28:28,320 --> 00:28:29,880 this confidence interval. 606 00:28:29,880 --> 00:28:31,260 We did it three times in class. 607 00:28:31,260 --> 00:28:33,660 I think you probably did it another couple times 608 00:28:33,660 --> 00:28:34,612 in your homework. 609 00:28:34,612 --> 00:28:36,570 So just make sure you're comfortable with this. 610 00:28:36,570 --> 00:28:37,070 All right. 611 00:28:37,070 --> 00:28:39,900 That's one of the basic things you would want to know. 612 00:28:39,900 --> 00:28:41,470 Are there any questions? 613 00:28:41,470 --> 00:28:42,121 Yes. 614 00:28:42,121 --> 00:28:46,540 AUDIENCE: So Slutsky holds for any single response set p. 615 00:28:46,540 --> 00:28:48,504 But Xn converges [INAUDIBLE]. 616 00:28:52,940 --> 00:28:55,076 PHILIPPE RIGOLLET: So that's not Slutsky, right? 617 00:28:55,076 --> 00:28:58,040 AUDIENCE: That's [INAUDIBLE]. 618 00:28:58,040 --> 00:29:04,040 PHILIPPE RIGOLLET: So Slutsky tells you that if you-- 619 00:29:04,040 --> 00:29:06,530 Slutsky's about combining two types of convergence. 620 00:29:06,530 --> 00:29:08,270 So Slutsky tells you that if you actually 621 00:29:08,270 --> 00:29:13,910 have one Xn that converges to X in distribution and Yn 622 00:29:13,910 --> 00:29:16,700 that converges to Y in probability, then 623 00:29:16,700 --> 00:29:18,867 you can actually multiply Xn and Yn 624 00:29:18,867 --> 00:29:20,450 and get that the limit in distribution 625 00:29:20,450 --> 00:29:28,460 is the product of X and Y, where X is now a constant. 626 00:29:28,460 --> 00:29:32,650 And here we have the constant, which is 1. 627 00:29:32,650 --> 00:29:35,160 But I did that already, right? 628 00:29:35,160 --> 00:29:37,890 Using Slutsky to replace it for the-- 629 00:29:37,890 --> 00:29:40,890 to replace P by Xn bar, we've done 630 00:29:40,890 --> 00:29:44,368 that last time, maybe a couple of times ago, actually. 631 00:29:44,368 --> 00:29:45,850 Yeah. 632 00:29:45,850 --> 00:29:49,802 AUDIENCE: So I guess these statements are [INAUDIBLE].. 633 00:29:49,802 --> 00:29:51,284 PHILIPPE RIGOLLET: That's correct. 634 00:29:51,284 --> 00:29:53,754 AUDIENCE: So could we like figure out [INAUDIBLE] 635 00:29:53,754 --> 00:29:58,220 can we set a finite [INAUDIBLE]. 636 00:29:58,220 --> 00:30:00,830 PHILIPPE RIGOLLET: So of course, the short answer is no. 637 00:30:04,280 --> 00:30:06,740 So here's how you would go about thinking 638 00:30:06,740 --> 00:30:08,420 about which method is better. 639 00:30:08,420 --> 00:30:10,760 So there's always the more conservative method. 640 00:30:10,760 --> 00:30:13,400 The first one, the only thing you're losing 641 00:30:13,400 --> 00:30:16,430 is the rate of convergence of the central limit theorem. 642 00:30:16,430 --> 00:30:19,990 So if n is large enough so that the central limit theorem 643 00:30:19,990 --> 00:30:22,700 approximation is very good, then that's all you're 644 00:30:22,700 --> 00:30:24,539 going to be losing. 645 00:30:24,539 --> 00:30:27,080 Of course, the price you pay is that your confidence interval 646 00:30:27,080 --> 00:30:28,700 is wider than it would be if you were 647 00:30:28,700 --> 00:30:31,160 to use Slutsky for this particular problem, 648 00:30:31,160 --> 00:30:32,600 typically wider. 649 00:30:32,600 --> 00:30:37,140 Actually, it is always wider, because Xn bar-- 650 00:30:37,140 --> 00:30:41,120 1 minus Xn bar is always less than 1/4 as well. 651 00:30:41,120 --> 00:30:45,920 And so that's the first thing you-- 652 00:30:45,920 --> 00:30:51,380 so Slutsky basically adds your relying on the central limit-- 653 00:30:51,380 --> 00:30:53,570 your relying on the asymptotics again. 654 00:30:53,570 --> 00:30:56,180 Now of course, you don't want to be conservative, 655 00:30:56,180 --> 00:30:59,060 because you actually want to squeeze as much from your data 656 00:30:59,060 --> 00:30:59,930 as you can. 657 00:30:59,930 --> 00:31:04,040 So it depends on how comfortable and how critical it is for you 658 00:31:04,040 --> 00:31:06,410 to put valid error bars. 659 00:31:06,410 --> 00:31:07,940 If they're valid in the asymptotics, 660 00:31:07,940 --> 00:31:09,710 then maybe you're actually going to go with Slutsky 661 00:31:09,710 --> 00:31:11,918 so it actually gives you slightly narrower confidence 662 00:31:11,918 --> 00:31:16,060 intervals and so you feel like you're a little more-- 663 00:31:16,060 --> 00:31:17,869 you have a more precise answer. 664 00:31:17,869 --> 00:31:19,910 Now, if you really need to be super-conservative, 665 00:31:19,910 --> 00:31:23,390 then you're actually going to go with the P1 minus P. 666 00:31:23,390 --> 00:31:25,790 Actually, if you need to be even more conservative, 667 00:31:25,790 --> 00:31:28,850 you are going to go with Hoeffding's so you don't even 668 00:31:28,850 --> 00:31:31,412 have to rely on the asymptotics level at all. 669 00:31:31,412 --> 00:31:32,870 But then you're confidence interval 670 00:31:32,870 --> 00:31:35,000 becomes twice as wide and twice as wide 671 00:31:35,000 --> 00:31:37,960 and it becomes wider and wider as you go. 672 00:31:37,960 --> 00:31:39,859 So depends on-- 673 00:31:39,859 --> 00:31:41,650 I mean, there's a lot of data in statistics 674 00:31:41,650 --> 00:31:46,310 which is gauging how critical it is for you to output 675 00:31:46,310 --> 00:31:48,380 valid error bounds or if they're really just here 676 00:31:48,380 --> 00:31:51,620 to be indicative of the precision of the estimator you 677 00:31:51,620 --> 00:31:55,396 gave from a more qualitative perspective. 678 00:31:55,396 --> 00:31:57,540 AUDIENCE: So the error there is [INAUDIBLE]?? 679 00:31:57,540 --> 00:31:58,540 PHILIPPE RIGOLLET: Yeah. 680 00:31:58,540 --> 00:32:01,220 So here, there's basically a bunch of errors. 681 00:32:01,220 --> 00:32:04,280 There's one that's-- so there's a theorem called Berry-Esseen 682 00:32:04,280 --> 00:32:09,830 that quantifies how far this probability is from 1 minus 683 00:32:09,830 --> 00:32:12,670 alpha, but the constants are terrible. 684 00:32:12,670 --> 00:32:14,510 So it's not very helpful, but it tells you 685 00:32:14,510 --> 00:32:17,330 as n grows how smaller this thing grows-- 686 00:32:17,330 --> 00:32:18,320 becomes smaller. 687 00:32:18,320 --> 00:32:20,330 And then for Slutsky, again you're 688 00:32:20,330 --> 00:32:22,790 multiplying something that converges by something that 689 00:32:22,790 --> 00:32:24,827 fluctuates around 1, so you need to understand 690 00:32:24,827 --> 00:32:25,910 how this thing fluctuates. 691 00:32:25,910 --> 00:32:28,070 Now, there's something that shows up. 692 00:32:28,070 --> 00:32:31,400 Basically, what is the slope of the function 1 693 00:32:31,400 --> 00:32:36,220 over square root of X1 minus X around the value 694 00:32:36,220 --> 00:32:37,590 you're interested in? 695 00:32:37,590 --> 00:32:39,850 And so if this function is super-sharp, 696 00:32:39,850 --> 00:32:43,200 then small fluctuations of Xn bar around this expectation 697 00:32:43,200 --> 00:32:45,700 are going to lead to really high fluctuations 698 00:32:45,700 --> 00:32:47,630 of the function itself. 699 00:32:47,630 --> 00:32:49,570 So if you're looking at-- 700 00:32:49,570 --> 00:32:55,615 if you have f of Xn bar and f around say the true P, 701 00:32:55,615 --> 00:32:58,390 if f is really sharp like that, then 702 00:32:58,390 --> 00:33:00,730 if you move a little bit here, then you're 703 00:33:00,730 --> 00:33:03,260 going to move really a lot on the y-axis. 704 00:33:03,260 --> 00:33:05,650 So that's what the function here-- the function 705 00:33:05,650 --> 00:33:09,205 you're interested in is 1 over square root of X1 minus X. 706 00:33:09,205 --> 00:33:11,830 So what does this function look like around the point where you 707 00:33:11,830 --> 00:33:14,260 think P is the true parameter? 708 00:33:17,270 --> 00:33:19,850 Its derivative really is what matters. 709 00:33:19,850 --> 00:33:21,729 OK? 710 00:33:21,729 --> 00:33:22,520 Any other question. 711 00:33:24,665 --> 00:33:25,165 OK. 712 00:33:25,165 --> 00:33:26,665 So it's important, because now we're 713 00:33:26,665 --> 00:33:29,430 going to switch to the real let's do some hardcore 714 00:33:29,430 --> 00:33:31,928 computation type of things. 715 00:33:31,928 --> 00:33:32,892 All right. 716 00:33:36,760 --> 00:33:39,550 So in this chapter, we're going to talk about maximum 717 00:33:39,550 --> 00:33:40,988 likelihood estimation. 718 00:33:44,340 --> 00:33:49,320 Who has already seen maximum likelihood estimation? 719 00:33:49,320 --> 00:33:50,030 OK. 720 00:33:50,030 --> 00:33:55,380 And who knows what a convex function is? 721 00:33:55,380 --> 00:33:56,340 OK. 722 00:33:56,340 --> 00:34:00,910 So we'll do a little bit of reminders on those things. 723 00:34:00,910 --> 00:34:04,330 So those things are when we do maximum likelihood estimation, 724 00:34:04,330 --> 00:34:07,470 likelihood is the function, so we need to maximize a function. 725 00:34:07,470 --> 00:34:09,325 That's basically what we need to do. 726 00:34:09,325 --> 00:34:10,699 And if I give you a function, you 727 00:34:10,699 --> 00:34:12,659 need to know how to maximize this function. 728 00:34:12,659 --> 00:34:14,408 Sometimes, you have closed-form solutions. 729 00:34:14,408 --> 00:34:18,219 You can take the derivative and set it equal to 0 and solve it. 730 00:34:18,219 --> 00:34:21,060 But sometimes, you actually need to resort to algorithms 731 00:34:21,060 --> 00:34:21,600 to do that. 732 00:34:21,600 --> 00:34:25,020 And there's an entire industry doing that. 733 00:34:25,020 --> 00:34:27,750 And we'll briefly touch upon it, but this is definitely 734 00:34:27,750 --> 00:34:30,370 not the focus of this class. 735 00:34:30,370 --> 00:34:31,330 OK. 736 00:34:31,330 --> 00:34:34,630 So before diving directly into the definition 737 00:34:34,630 --> 00:34:36,520 of the likelihood and what is the definition 738 00:34:36,520 --> 00:34:38,500 of the maximum likelihood estimator, what 739 00:34:38,500 --> 00:34:41,840 I'm going to try to do is to give you 740 00:34:41,840 --> 00:34:45,380 an insight for what we're actually doing when we do 741 00:34:45,380 --> 00:34:48,870 maximum likelihood estimation. 742 00:34:48,870 --> 00:34:53,719 So remember, we have a model on a sample space E 743 00:34:53,719 --> 00:34:57,600 and some candidate distributions P theta. 744 00:34:57,600 --> 00:35:00,930 And really, your goal is to estimate a true theta 745 00:35:00,930 --> 00:35:04,080 star, the one that generated some data, X1 to Xn, 746 00:35:04,080 --> 00:35:06,360 in an iid fashion. 747 00:35:06,360 --> 00:35:08,790 But this theta star is really a proxy for us 748 00:35:08,790 --> 00:35:10,740 to know that we actually understand 749 00:35:10,740 --> 00:35:12,100 the distribution itself. 750 00:35:12,100 --> 00:35:15,540 The goal of knowing theta star is so that you can actually 751 00:35:15,540 --> 00:35:17,790 know what P theta star. 752 00:35:17,790 --> 00:35:19,380 Otherwise, it has-- well, sometimes we 753 00:35:19,380 --> 00:35:21,660 said it has some meaning itself, but really you 754 00:35:21,660 --> 00:35:23,490 want to know what the distribution is. 755 00:35:23,490 --> 00:35:27,810 And so your goal is to actually come up with the distribution-- 756 00:35:27,810 --> 00:35:30,270 hopefully that comes from the family P theta-- 757 00:35:30,270 --> 00:35:33,360 that's close to P theta star. 758 00:35:33,360 --> 00:35:38,710 So in a way, what does it mean to have two distributions that 759 00:35:38,710 --> 00:35:39,210 are close? 760 00:35:39,210 --> 00:35:41,260 It means that when you compute probabilities 761 00:35:41,260 --> 00:35:43,330 on one distribution, you should have 762 00:35:43,330 --> 00:35:46,870 the same probability on the other distribution pretty much. 763 00:35:46,870 --> 00:35:49,120 So what we can do is say, well, now I 764 00:35:49,120 --> 00:35:51,311 have two candidate distributions. 765 00:35:59,010 --> 00:36:03,360 So if theta hat leads to a candidate distribution P theta 766 00:36:03,360 --> 00:36:06,210 hat, and this is the true theta star, 767 00:36:06,210 --> 00:36:08,820 it leads to the true distribution P theta star 768 00:36:08,820 --> 00:36:11,940 according to which my data was drawn. 769 00:36:11,940 --> 00:36:12,970 That's my candidate. 770 00:36:16,060 --> 00:36:18,100 As a statistician, I'm supposed to come up 771 00:36:18,100 --> 00:36:20,380 with a good candidate, and this is the truth. 772 00:36:23,940 --> 00:36:26,790 And what I want is that if you actually give me 773 00:36:26,790 --> 00:36:30,030 the distribution, then I want when 774 00:36:30,030 --> 00:36:31,950 I'm computing probabilities for this guy, 775 00:36:31,950 --> 00:36:34,980 I know what the probabilities for the other guys are. 776 00:36:34,980 --> 00:36:40,040 And so really what I want is that if I compute a probability 777 00:36:40,040 --> 00:36:44,340 under theta hat of some interval a, b, 778 00:36:44,340 --> 00:36:46,580 it should be pretty close to the probability 779 00:36:46,580 --> 00:36:51,660 under theta star of a, b. 780 00:36:51,660 --> 00:36:53,220 And more generally, if I want to take 781 00:36:53,220 --> 00:36:55,470 the union of two intervals, I want this to be true. 782 00:36:55,470 --> 00:36:58,500 If I take just 1/2 lines, I want this to be true from 0 783 00:36:58,500 --> 00:37:00,900 to infinity, for example, things like this. 784 00:37:00,900 --> 00:37:03,550 I want this to be true for all of them at once. 785 00:37:03,550 --> 00:37:07,620 And so what I do is that I write A for a probability event. 786 00:37:07,620 --> 00:37:11,520 And I want that P hat of A is close to P star of A 787 00:37:11,520 --> 00:37:15,517 for any event A in the sample space. 788 00:37:15,517 --> 00:37:17,100 Does that sound like a reasonable goal 789 00:37:17,100 --> 00:37:18,994 for a statistician? 790 00:37:18,994 --> 00:37:20,910 So in particular, if I want those to be close, 791 00:37:20,910 --> 00:37:22,784 I want the absolute value of their difference 792 00:37:22,784 --> 00:37:23,680 to be close to 0. 793 00:37:26,220 --> 00:37:28,140 And this turns out to be-- 794 00:37:28,140 --> 00:37:31,875 if I want this to hold for all possible A's, I 795 00:37:31,875 --> 00:37:35,460 have all possible events, so I'm going to actually maximize over 796 00:37:35,460 --> 00:37:36,100 these events. 797 00:37:36,100 --> 00:37:37,516 And I'm going to look at the worst 798 00:37:37,516 --> 00:37:41,160 possible event on which theta hat can depart from theta star. 799 00:37:41,160 --> 00:37:43,170 And so rather than defining it specifically 800 00:37:43,170 --> 00:37:44,790 for theta hat and theta star, I'm 801 00:37:44,790 --> 00:37:47,910 just going to say, well, if you give me two probability 802 00:37:47,910 --> 00:37:51,420 measures, P theta and P theta prime, 803 00:37:51,420 --> 00:37:53,100 I want to know how close they are. 804 00:37:53,100 --> 00:37:55,080 Well, if I want to measure how close they 805 00:37:55,080 --> 00:37:58,980 are by how they can differ when I measure 806 00:37:58,980 --> 00:38:01,920 the probability of some event, I'm 807 00:38:01,920 --> 00:38:04,800 just looking at the absolute value of the difference 808 00:38:04,800 --> 00:38:06,180 of the probabilities and I'm just 809 00:38:06,180 --> 00:38:09,240 maximizing over the worst possible event that might 810 00:38:09,240 --> 00:38:11,380 actually make them differ. 811 00:38:11,380 --> 00:38:13,040 Agreed? 812 00:38:13,040 --> 00:38:14,360 That's a pretty strong notion. 813 00:38:14,360 --> 00:38:17,720 So if the total variation between theta and theta prime 814 00:38:17,720 --> 00:38:22,390 is small, it means that for all possible A's that you give me, 815 00:38:22,390 --> 00:38:25,590 then P theta of A is going to be close to P 816 00:38:25,590 --> 00:38:30,820 theta prime of A, because if-- 817 00:38:30,820 --> 00:38:33,820 let's say I just found the bound on the total variation 818 00:38:33,820 --> 00:38:41,911 distance, which is 0.01. 819 00:38:41,911 --> 00:38:42,410 All right. 820 00:38:42,410 --> 00:38:46,110 So that means that this is going to be larger 821 00:38:46,110 --> 00:39:00,940 than the max over A of P theta minus P theta prime of A, 822 00:39:00,940 --> 00:39:04,550 which means that for any A-- 823 00:39:04,550 --> 00:39:06,990 actually, let me write P theta hat and P theta star, 824 00:39:06,990 --> 00:39:10,611 like we said, theta hat and theta star. 825 00:39:10,611 --> 00:39:12,860 And so if I have a bound, say, on the total variation, 826 00:39:12,860 --> 00:39:19,270 which is 0.01, that means that P theta hat-- 827 00:39:19,270 --> 00:39:23,950 every time I compute a probability on P theta hat, 828 00:39:23,950 --> 00:39:29,295 it's basically in the interval P theta star of A, 829 00:39:29,295 --> 00:39:34,790 the one that I really wanted to compute, plus or minus 0.01. 830 00:39:34,790 --> 00:39:36,790 This has nothing to do with confidence interval. 831 00:39:36,790 --> 00:39:38,165 This is just telling me how far I 832 00:39:38,165 --> 00:39:41,280 am from the value of actually trying to compute. 833 00:39:41,280 --> 00:39:44,750 And that's true for all A. And that's key. 834 00:39:44,750 --> 00:39:47,400 That's where this max comes into play. 835 00:39:47,400 --> 00:39:49,310 It just says, I want this bound to hold 836 00:39:49,310 --> 00:39:50,870 for all possible A's at once. 837 00:39:55,300 --> 00:39:58,142 So this is actually a very well-known distance 838 00:39:58,142 --> 00:39:59,350 between probability measures. 839 00:39:59,350 --> 00:40:00,766 It's the total variation distance. 840 00:40:00,766 --> 00:40:04,880 It's extremely central to probabilistic analysis. 841 00:40:04,880 --> 00:40:07,120 And it essentially tells you that every time-- 842 00:40:07,120 --> 00:40:09,160 if two probability distributions are close, 843 00:40:09,160 --> 00:40:11,560 then it means that every time I compute a probability 844 00:40:11,560 --> 00:40:15,160 under P theta but I really actually 845 00:40:15,160 --> 00:40:17,290 have data from P theta prime, then 846 00:40:17,290 --> 00:40:21,710 the error is no larger than the total variation. 847 00:40:21,710 --> 00:40:23,470 OK. 848 00:40:23,470 --> 00:40:29,460 So this is maybe not the most convenient way 849 00:40:29,460 --> 00:40:30,870 of finding a distance. 850 00:40:30,870 --> 00:40:32,130 I mean, how are you going-- 851 00:40:32,130 --> 00:40:34,500 in reality, how are you to compute this maximum 852 00:40:34,500 --> 00:40:35,640 over all possible events? 853 00:40:35,640 --> 00:40:36,931 I mean, it's just crazy, right? 854 00:40:36,931 --> 00:40:38,430 There's an infinite number of them. 855 00:40:38,430 --> 00:40:41,340 It's much larger than the number of intervals, for example, 856 00:40:41,340 --> 00:40:43,050 so it's a bit annoying. 857 00:40:43,050 --> 00:40:46,800 And so there's actually a way to compress it 858 00:40:46,800 --> 00:40:50,834 by just looking at the basically function distance or vector 859 00:40:50,834 --> 00:40:53,250 distance between probability mass functions or probability 860 00:40:53,250 --> 00:40:55,510 density functions. 861 00:40:55,510 --> 00:40:58,150 So I'm going to start with the discrete version 862 00:40:58,150 --> 00:40:59,280 of the total variation. 863 00:40:59,280 --> 00:41:03,282 So throughout this chapter, I will 864 00:41:03,282 --> 00:41:05,490 make the difference between discrete random variables 865 00:41:05,490 --> 00:41:07,530 and continuous random variables. 866 00:41:07,530 --> 00:41:08,651 It really doesn't matter. 867 00:41:08,651 --> 00:41:10,650 All it means is that when I talk about discrete, 868 00:41:10,650 --> 00:41:12,606 I will talk about probability mass functions. 869 00:41:12,606 --> 00:41:13,980 And when I talk about continuous, 870 00:41:13,980 --> 00:41:16,600 I will talk about probability density functions. 871 00:41:16,600 --> 00:41:20,030 When I talk about probability mass functions, 872 00:41:20,030 --> 00:41:21,510 I talk about sums. 873 00:41:21,510 --> 00:41:24,900 When I talk about probability density functions, 874 00:41:24,900 --> 00:41:26,730 I talk about integrals. 875 00:41:26,730 --> 00:41:30,090 But they're all the same thing, really. 876 00:41:30,090 --> 00:41:32,475 So let's start with the probability mass function. 877 00:41:32,475 --> 00:41:34,350 Everybody remembers what the probability mass 878 00:41:34,350 --> 00:41:37,980 function of a discrete random variable is. 879 00:41:37,980 --> 00:41:42,180 This is the function that tells me for each possible value 880 00:41:42,180 --> 00:41:43,720 that it can take, the probability 881 00:41:43,720 --> 00:41:46,410 that it takes this value. 882 00:41:46,410 --> 00:41:53,200 So the Probability Mass Function, PMF, 883 00:41:53,200 --> 00:41:57,310 is just the function for all x in the sample space 884 00:41:57,310 --> 00:42:01,420 tells me the probability that my random variable is 885 00:42:01,420 --> 00:42:03,970 equal to this little value. 886 00:42:03,970 --> 00:42:09,091 And I will denote it by P sub theta of X. 887 00:42:09,091 --> 00:42:10,840 So what I want is, of course, that the sum 888 00:42:10,840 --> 00:42:12,250 of the probabilities is 1. 889 00:42:17,620 --> 00:42:20,460 And I want them to be non-negative. 890 00:42:20,460 --> 00:42:23,110 Actually, typically we will assume that they are positive. 891 00:42:23,110 --> 00:42:27,410 Otherwise, we can just remove this x from the sample space. 892 00:42:27,410 --> 00:42:31,700 And so then I have the total variation distance, I mean, 893 00:42:31,700 --> 00:42:35,470 it's supposed to be the maximum overall sets of-- 894 00:42:35,470 --> 00:42:39,850 of subsets of E, such that the probability 895 00:42:39,850 --> 00:42:43,130 of A minus probability of theta prime of A-- 896 00:42:43,130 --> 00:42:44,630 it's complicated, but really there's 897 00:42:44,630 --> 00:42:46,130 this beautiful formula that tells me 898 00:42:46,130 --> 00:42:50,410 that if I look at the total variation between P theta 899 00:42:50,410 --> 00:42:54,520 and P theta prime, it's actually equal to just 1/2 900 00:42:54,520 --> 00:43:04,402 of the sum for all X in E of the absolute difference between P 901 00:43:04,402 --> 00:43:12,151 theta X and P theta prime of X. 902 00:43:12,151 --> 00:43:13,650 So that's something you can compute. 903 00:43:13,650 --> 00:43:16,110 If I give you two probability mass functions, 904 00:43:16,110 --> 00:43:19,660 you can compute this immediately. 905 00:43:19,660 --> 00:43:24,020 But if I give you just the densities 906 00:43:24,020 --> 00:43:26,460 and the original distribution, the original definition 907 00:43:26,460 --> 00:43:28,200 where you have to max over all possible events, 908 00:43:28,200 --> 00:43:29,575 it's not clear you're going to be 909 00:43:29,575 --> 00:43:31,140 able to do that very quickly. 910 00:43:31,140 --> 00:43:35,335 So this is really the one you can work with. 911 00:43:35,335 --> 00:43:36,960 But the other one is really telling you 912 00:43:36,960 --> 00:43:37,830 what it is doing for you. 913 00:43:37,830 --> 00:43:39,829 It's controlling the difference of probabilities 914 00:43:39,829 --> 00:43:41,077 you can compute on any event. 915 00:43:41,077 --> 00:43:42,660 But here, it's just telling you, well, 916 00:43:42,660 --> 00:43:46,370 if you do it for each simple event, it's little x. 917 00:43:46,370 --> 00:43:49,080 It's actually simple. 918 00:43:49,080 --> 00:43:53,150 Now, if we have continuous random variables-- so 919 00:43:53,150 --> 00:43:56,060 by the way, I didn't mention, but discrete means Bernoulli. 920 00:43:56,060 --> 00:43:59,420 Binomial, but not only those that have finite support, 921 00:43:59,420 --> 00:44:02,260 like Bernoulli has support of size 2, 922 00:44:02,260 --> 00:44:05,760 binomial NP has support of size n-- 923 00:44:05,760 --> 00:44:08,570 there's n possible values it can take-- but also Poisson. 924 00:44:08,570 --> 00:44:10,570 Poisson distribution can take an infinite number 925 00:44:10,570 --> 00:44:13,510 of values, all the positive integers, 926 00:44:13,510 --> 00:44:16,100 non-negative integers. 927 00:44:16,100 --> 00:44:18,000 And so now we have also the continuous ones, 928 00:44:18,000 --> 00:44:19,384 such as Gaussian, exponential. 929 00:44:19,384 --> 00:44:21,300 And what characterizes those guys is that they 930 00:44:21,300 --> 00:44:24,230 have a probability density. 931 00:44:24,230 --> 00:44:26,630 So the density, remember the way I 932 00:44:26,630 --> 00:44:28,820 use my density is when I want to compute 933 00:44:28,820 --> 00:44:31,910 the probability of belonging to some event A. 934 00:44:31,910 --> 00:44:37,010 The probability of X falling to some subset of the real line A 935 00:44:37,010 --> 00:44:40,280 is simply the integral of the density on this set. 936 00:44:40,280 --> 00:44:43,940 That's the famous area under the curve thing. 937 00:44:43,940 --> 00:44:49,196 So since for each possible value, the probability at X-- 938 00:44:49,196 --> 00:44:51,350 so I hope you remember that stuff. 939 00:44:51,350 --> 00:44:57,890 That's just probably something that you 940 00:44:57,890 --> 00:44:59,210 must remember from probability. 941 00:44:59,210 --> 00:45:02,120 But essentially, we know that the probability that X is equal 942 00:45:02,120 --> 00:45:04,820 to little x is 0 for a continuous random variable, 943 00:45:04,820 --> 00:45:06,830 for all possible X's. 944 00:45:06,830 --> 00:45:09,030 There's just none of them that actually gets weight. 945 00:45:09,030 --> 00:45:11,321 So what we have to do is to describe the fact that it's 946 00:45:11,321 --> 00:45:12,980 in some little region. 947 00:45:12,980 --> 00:45:18,830 So the probability that it's in some interval, say, a, b, this 948 00:45:18,830 --> 00:45:25,550 is the integral between A and B of f theta of X, dx. 949 00:45:25,550 --> 00:45:28,379 So I have this density, such as the Gaussian one. 950 00:45:28,379 --> 00:45:30,545 And the probability that I belong to the interval a, 951 00:45:30,545 --> 00:45:36,920 b is just the area under the curve between A and B. 952 00:45:36,920 --> 00:45:43,880 If you don't remember that, please take immediate remedy. 953 00:45:43,880 --> 00:45:48,920 So this function f, just like P, is non-negative. 954 00:45:48,920 --> 00:45:51,890 And rather than summing to 1, it integrates to 1 955 00:45:51,890 --> 00:45:55,119 when I integrate it over the entire sample space E. 956 00:45:55,119 --> 00:45:56,660 And now the total variation, well, it 957 00:45:56,660 --> 00:45:58,130 takes basically the same form. 958 00:45:58,130 --> 00:46:00,230 I said that you essentially replace sums 959 00:46:00,230 --> 00:46:03,264 by integrals when you're dealing with densities. 960 00:46:03,264 --> 00:46:05,180 And here, it's just saying, rather than having 961 00:46:05,180 --> 00:46:07,220 1/2 of the sum of the absolute values, 962 00:46:07,220 --> 00:46:09,860 you have 1/2 of the integral of the absolute value 963 00:46:09,860 --> 00:46:11,530 of the difference. 964 00:46:11,530 --> 00:46:15,310 Again, if I give you two densities 965 00:46:15,310 --> 00:46:18,280 and if you're not too bad at calculus, which you will often 966 00:46:18,280 --> 00:46:21,490 be, because there's lots of them you can actually not compute. 967 00:46:21,490 --> 00:46:24,400 But if I gave you, for example, two Gaussian densities, 968 00:46:24,400 --> 00:46:27,330 exponential minus x squared, blah, blah, blah, and I say, 969 00:46:27,330 --> 00:46:29,080 just compute the total variation distance, 970 00:46:29,080 --> 00:46:30,957 you could actually write it as an integral. 971 00:46:30,957 --> 00:46:33,040 Now, whether you can actually reduce this integral 972 00:46:33,040 --> 00:46:35,470 to some particular number is another story. 973 00:46:35,470 --> 00:46:38,860 But you could technically do it. 974 00:46:38,860 --> 00:46:41,695 So now, you have actually a handle on this thing 975 00:46:41,695 --> 00:46:43,660 and you could technically ask Mathematica, 976 00:46:43,660 --> 00:46:45,280 whereas asking Mathematica to take 977 00:46:45,280 --> 00:46:48,280 the max over all possible events is going to be difficult. 978 00:46:48,280 --> 00:46:48,780 All right. 979 00:46:48,780 --> 00:46:55,240 So the total variation has some properties. 980 00:46:55,240 --> 00:46:59,560 So let's keep on the board the definition that 981 00:46:59,560 --> 00:47:05,410 involves, say, the densities. 982 00:47:05,410 --> 00:47:06,910 So think Gaussian in your mind. 983 00:47:06,910 --> 00:47:09,530 And you have two Gaussians, one with mean theta 984 00:47:09,530 --> 00:47:10,810 and one with mean theta prime. 985 00:47:10,810 --> 00:47:13,143 And I'm looking at the total variation between those two 986 00:47:13,143 --> 00:47:14,560 guys. 987 00:47:14,560 --> 00:47:20,030 So if I look at P theta minus-- 988 00:47:20,030 --> 00:47:20,530 sorry. 989 00:47:20,530 --> 00:47:25,800 TV between P theta and P theta prime, this 990 00:47:25,800 --> 00:47:31,110 is equal to 1/2 of the integral between f theta, f theta prime. 991 00:47:31,110 --> 00:47:32,490 And when I don't write it-- 992 00:47:32,490 --> 00:47:34,800 so I don't write the X, dx but it's there. 993 00:47:34,800 --> 00:47:38,432 And then I integrate over E. 994 00:47:38,432 --> 00:47:39,890 So what is this thing doing for me? 995 00:47:39,890 --> 00:47:41,480 It's just saying, well, if I have-- so 996 00:47:41,480 --> 00:47:42,438 think of two Gaussians. 997 00:47:42,438 --> 00:47:44,940 For example, I have one that's here and one that's here. 998 00:47:47,610 --> 00:47:51,670 So this is let's say f theta, f theta prime. 999 00:47:51,670 --> 00:47:52,750 This guy is doing what? 1000 00:47:52,750 --> 00:47:55,980 It's computing the absolute value of the difference 1001 00:47:55,980 --> 00:47:57,910 between f and f theta prime. 1002 00:47:57,910 --> 00:48:01,980 You can check for yourself that graphically, this I 1003 00:48:01,980 --> 00:48:05,931 can represent as an area not under the curve, 1004 00:48:05,931 --> 00:48:10,300 but between the curves. 1005 00:48:10,300 --> 00:48:11,760 So this is this guy. 1006 00:48:16,370 --> 00:48:20,040 Now, this guy is really the integral of the absolute value. 1007 00:48:20,040 --> 00:48:22,570 So this thing here, this area, this 1008 00:48:22,570 --> 00:48:25,224 is 2 times the total variation. 1009 00:48:28,240 --> 00:48:29,980 The scaling 1/2 really doesn't matter. 1010 00:48:29,980 --> 00:48:32,790 It's just if I want to have an actual correspondence 1011 00:48:32,790 --> 00:48:36,350 between the maximum and the other guy, I have to do this. 1012 00:48:39,630 --> 00:48:41,290 So this is what it looks like. 1013 00:48:41,290 --> 00:48:42,910 So we have this definition. 1014 00:48:42,910 --> 00:48:48,279 And so we have a couple of properties that come into this. 1015 00:48:48,279 --> 00:48:49,820 The first one is that it's symmetric. 1016 00:48:49,820 --> 00:48:51,860 TV of P theta and P theta prime is 1017 00:48:51,860 --> 00:48:55,970 the same as the TV between P theta prime and P theta. 1018 00:48:55,970 --> 00:48:59,710 Well, that's pretty obvious from this definition. 1019 00:48:59,710 --> 00:49:02,090 I just flip those two, I get the same number. 1020 00:49:02,090 --> 00:49:05,297 It's actually also true if I take the maximum. 1021 00:49:05,297 --> 00:49:07,630 Those things are completely symmetric in theta and theta 1022 00:49:07,630 --> 00:49:08,350 prime. 1023 00:49:08,350 --> 00:49:10,620 You can just flip them. 1024 00:49:10,620 --> 00:49:11,830 It's non-negative. 1025 00:49:11,830 --> 00:49:15,640 Is that clear to everyone that this thing is non-negative? 1026 00:49:15,640 --> 00:49:20,530 I integrate an absolute value, so this thing 1027 00:49:20,530 --> 00:49:22,640 is going to give me some non-negative number. 1028 00:49:22,640 --> 00:49:24,598 And so if I integrate this non-negative number, 1029 00:49:24,598 --> 00:49:26,670 it's going to be a non-negative number. 1030 00:49:26,670 --> 00:49:29,230 The fact also that it's an area tells me 1031 00:49:29,230 --> 00:49:32,680 that it's going to be non-negative. 1032 00:49:32,680 --> 00:49:36,900 The nice thing is that if TV is equal to zero, then 1033 00:49:36,900 --> 00:49:42,490 the two distributions, the two probabilities are the same. 1034 00:49:42,490 --> 00:49:46,540 That means that for every A, P theta of A 1035 00:49:46,540 --> 00:49:49,050 is equal to P theta prime of A. Now, 1036 00:49:49,050 --> 00:49:50,860 there's two ways to see that. 1037 00:49:50,860 --> 00:49:53,140 The first one is to say that if this integral is 1038 00:49:53,140 --> 00:49:56,650 equal to 0, that means that for almost all X, 1039 00:49:56,650 --> 00:49:58,240 f theta is equal to f theta prime. 1040 00:49:58,240 --> 00:50:01,390 The only way I can integrate a non-negative and get 0 1041 00:50:01,390 --> 00:50:05,390 is that it's 0 pretty much everywhere. 1042 00:50:05,390 --> 00:50:07,550 And so what it means is that the two densities 1043 00:50:07,550 --> 00:50:09,530 have to be the same pretty much everywhere, 1044 00:50:09,530 --> 00:50:11,546 which means that the distributions are the same. 1045 00:50:11,546 --> 00:50:13,670 But this is not really the way you want to do this, 1046 00:50:13,670 --> 00:50:15,128 because you have to understand what 1047 00:50:15,128 --> 00:50:16,850 pretty much everywhere means-- 1048 00:50:16,850 --> 00:50:18,760 which I should really say almost everywhere. 1049 00:50:18,760 --> 00:50:20,570 That's the formal way of saying it. 1050 00:50:20,570 --> 00:50:22,280 But let's go to this definition-- 1051 00:50:24,830 --> 00:50:26,160 which is gone. 1052 00:50:26,160 --> 00:50:26,660 Yeah. 1053 00:50:26,660 --> 00:50:28,670 That's the one here. 1054 00:50:28,670 --> 00:50:35,230 The max of those two guys, if this maximum is equal to 0-- 1055 00:50:35,230 --> 00:50:39,220 I have a maximum of non-negative numbers, their absolute values. 1056 00:50:39,220 --> 00:50:42,090 Their maximum is equal to 0, well, 1057 00:50:42,090 --> 00:50:44,490 they better be all equal to 0, because if one is not 1058 00:50:44,490 --> 00:50:47,470 equal to 0, then the maximum is not equal to 0. 1059 00:50:47,470 --> 00:50:50,170 So those two guys, for those two things 1060 00:50:50,170 --> 00:50:52,180 to be-- for the maximum to be equal to 0, 1061 00:50:52,180 --> 00:50:54,220 then each of the individual absolute values 1062 00:50:54,220 --> 00:50:57,430 have to be equal to 0, which means that the probability here 1063 00:50:57,430 --> 00:51:03,730 is equal to this probability here for every event A. 1064 00:51:03,730 --> 00:51:04,960 So those two things-- 1065 00:51:04,960 --> 00:51:06,070 this is nice, right? 1066 00:51:06,070 --> 00:51:08,410 That's called definiteness. 1067 00:51:08,410 --> 00:51:10,900 The total variation equal to 0 implies that P theta 1068 00:51:10,900 --> 00:51:12,210 is equal to P theta prime. 1069 00:51:12,210 --> 00:51:14,350 So that's really some notion of distance, right? 1070 00:51:14,350 --> 00:51:16,060 That's what we want. 1071 00:51:16,060 --> 00:51:17,980 If this thing being small implied 1072 00:51:17,980 --> 00:51:20,350 that P theta could be all over the place compared 1073 00:51:20,350 --> 00:51:24,270 to P theta prime, that would not help very much. 1074 00:51:24,270 --> 00:51:26,580 Now, there's also the triangle inequality 1075 00:51:26,580 --> 00:51:28,710 that follows immediately from the triangle 1076 00:51:28,710 --> 00:51:32,730 inequality inside this guy. 1077 00:51:32,730 --> 00:51:35,654 If I squeeze in some f theta prime prime in there, 1078 00:51:35,654 --> 00:51:37,320 I'm going to use the triangle inequality 1079 00:51:37,320 --> 00:51:39,486 and get the triangle inequality for the whole thing. 1080 00:51:42,392 --> 00:51:42,892 Yeah? 1081 00:51:42,892 --> 00:51:45,287 AUDIENCE: The fact that you need two definitions 1082 00:51:45,287 --> 00:51:48,640 of the [INAUDIBLE],, is it something 1083 00:51:48,640 --> 00:51:50,090 obvious or is it complete? 1084 00:51:50,090 --> 00:51:52,930 PHILIPPE RIGOLLET: I'll do it for you now. 1085 00:51:52,930 --> 00:51:56,530 So let's just prove that those two things are actually 1086 00:51:56,530 --> 00:51:58,756 giving me the same definition. 1087 00:52:00,956 --> 00:52:02,830 So what I'm going to do is I'm actually going 1088 00:52:02,830 --> 00:52:04,420 to start with the second one. 1089 00:52:04,420 --> 00:52:05,420 And I'm going to write-- 1090 00:52:05,420 --> 00:52:07,253 I'm going to start with the density version. 1091 00:52:07,253 --> 00:52:10,300 But as an exercise, you can do it for the PMF version 1092 00:52:10,300 --> 00:52:11,347 if you prefer. 1093 00:52:11,347 --> 00:52:13,180 So I'm going to start with the fact that f-- 1094 00:52:20,240 --> 00:52:23,810 so I'm going to write f of g so I don't have to write f and g. 1095 00:52:23,810 --> 00:52:27,490 So think of this as being f sub theta, and think of this guy 1096 00:52:27,490 --> 00:52:29,180 as being f sub theta prime. 1097 00:52:29,180 --> 00:52:32,240 I just don't want to have to write indices all the time. 1098 00:52:32,240 --> 00:52:34,970 So I'm going to start with this thing, the integral of f 1099 00:52:34,970 --> 00:52:38,870 of X minus g of X dx. 1100 00:52:38,870 --> 00:52:41,910 The first thing I'm going to do is this is an absolute value, 1101 00:52:41,910 --> 00:52:45,170 so either the number in the absolute value is positive 1102 00:52:45,170 --> 00:52:47,390 and I actually kept it like that, or it's negative 1103 00:52:47,390 --> 00:52:48,760 and I flipped its sign. 1104 00:52:48,760 --> 00:52:51,600 So let's just split between those two cases. 1105 00:52:51,600 --> 00:52:55,460 So this thing is equal to 1/2 the integral of-- 1106 00:52:55,460 --> 00:53:00,350 so let me actually write the set A star as 1107 00:53:00,350 --> 00:53:09,240 being the set of X's such that f of X is larger than g of X. 1108 00:53:09,240 --> 00:53:11,340 So that's the set on which the difference is 1109 00:53:11,340 --> 00:53:13,060 going to be positive or the difference is 1110 00:53:13,060 --> 00:53:14,370 going to be negative. 1111 00:53:14,370 --> 00:53:17,082 So this, again, is equivalent to f 1112 00:53:17,082 --> 00:53:23,280 of X minus g of X is positive. 1113 00:53:23,280 --> 00:53:23,780 OK. 1114 00:53:23,780 --> 00:53:24,488 Everybody agrees? 1115 00:53:24,488 --> 00:53:26,330 So this is the set I'm interested in. 1116 00:53:29,040 --> 00:53:31,830 So now I'm going to split my integral into two parts, 1117 00:53:31,830 --> 00:53:38,250 in A, A star, so on A star, f is larger than g, 1118 00:53:38,250 --> 00:53:40,666 so the absolute value is just the difference itself. 1119 00:53:45,150 --> 00:53:48,980 So here I put parenthesis rather than absolute value. 1120 00:53:48,980 --> 00:53:54,330 And then I have plus 1/2 of the integral on the complement. 1121 00:53:54,330 --> 00:53:57,940 What are you guys used to to write the complement, to the C 1122 00:53:57,940 --> 00:54:01,005 or the bar? 1123 00:54:01,005 --> 00:54:01,991 To the C? 1124 00:54:05,450 --> 00:54:08,320 And so here on the complement, then f is less than g, 1125 00:54:08,320 --> 00:54:17,810 so this is actually really g of X minus f of X, dx. 1126 00:54:17,810 --> 00:54:19,550 Everybody's with me here? 1127 00:54:19,550 --> 00:54:20,900 So I just said-- 1128 00:54:20,900 --> 00:54:23,390 I mean, those are just rewriting what the definition 1129 00:54:23,390 --> 00:54:24,560 of the absolute value is. 1130 00:54:33,290 --> 00:54:33,830 OK. 1131 00:54:33,830 --> 00:54:38,120 So now there's nice things that I know about f and g. 1132 00:54:38,120 --> 00:54:40,880 And the two nice things is that the integral of f is equal to 1 1133 00:54:40,880 --> 00:54:42,790 and the integral of g is equal to 1. 1134 00:54:46,270 --> 00:54:53,614 This implies that the integral of f minus g is equal to what? 1135 00:54:53,614 --> 00:54:54,526 AUDIENCE: 0. 1136 00:54:54,526 --> 00:54:56,400 PHILIPPE RIGOLLET: 0. 1137 00:54:56,400 --> 00:54:59,060 And so now that means that if I want 1138 00:54:59,060 --> 00:55:04,130 to just go from the integral here on A complement 1139 00:55:04,130 --> 00:55:05,690 to the integral on A-- 1140 00:55:05,690 --> 00:55:08,780 or on A star, complement to the integral of A star, 1141 00:55:08,780 --> 00:55:11,700 I just have to flip the sign. 1142 00:55:11,700 --> 00:55:14,920 So that implies that an integral on A star 1143 00:55:14,920 --> 00:55:21,198 complement of g of X minus f of X, 1144 00:55:21,198 --> 00:55:25,830 dx, this is simply equal to the integral on A star 1145 00:55:25,830 --> 00:55:30,250 of f of X minus g of X, dx. 1146 00:55:40,880 --> 00:55:41,780 All right. 1147 00:55:41,780 --> 00:55:46,100 So now this guy becomes this guy over there. 1148 00:55:46,100 --> 00:55:50,050 So I have 1/2 of this plus 1/2 of the same guy, 1149 00:55:50,050 --> 00:55:55,720 so that means that 1/2 half of the integral between of f 1150 00:55:55,720 --> 00:55:57,450 minus g absolute value-- 1151 00:55:57,450 --> 00:55:59,810 so that was my original definition, 1152 00:55:59,810 --> 00:56:03,890 this thing is actually equal to the integral on A star 1153 00:56:03,890 --> 00:56:10,379 of f of X minus g of X, dx. 1154 00:56:14,160 --> 00:56:21,440 And this is simply equal to P of A star-- 1155 00:56:21,440 --> 00:56:26,160 so say Pf of A start minus Pg of A star. 1156 00:56:34,160 --> 00:56:36,810 Which one is larger than the other one? 1157 00:56:41,610 --> 00:56:43,540 AUDIENCE: [INAUDIBLE] 1158 00:56:43,540 --> 00:56:44,600 PHILIPPE RIGOLLET: It is. 1159 00:56:44,600 --> 00:56:45,951 Just look at this board. 1160 00:56:45,951 --> 00:56:47,406 AUDIENCE: [INAUDIBLE] 1161 00:56:47,406 --> 00:56:48,406 PHILIPPE RIGOLLET: What? 1162 00:56:48,406 --> 00:56:49,880 AUDIENCE: [INAUDIBLE] 1163 00:56:49,880 --> 00:56:50,510 PHILIPPE RIGOLLET: The first one has 1164 00:56:50,510 --> 00:56:51,980 to be larger, because this thing is actually 1165 00:56:51,980 --> 00:56:53,271 equal to a non-negative number. 1166 00:56:59,590 --> 00:57:01,990 So now I have this absolute value of two things, 1167 00:57:01,990 --> 00:57:04,150 and so I'm closer to the actual definition. 1168 00:57:04,150 --> 00:57:06,910 But I still need to show you that this thing is 1169 00:57:06,910 --> 00:57:09,010 the maximum value. 1170 00:57:09,010 --> 00:57:17,710 So this is definitely at most the maximum over A of Pf 1171 00:57:17,710 --> 00:57:21,670 of A minus Pg of A. 1172 00:57:21,670 --> 00:57:24,290 That's certainly true. 1173 00:57:24,290 --> 00:57:24,830 Right? 1174 00:57:24,830 --> 00:57:27,850 We agree with this? 1175 00:57:27,850 --> 00:57:30,620 Because this is just for one specific A, 1176 00:57:30,620 --> 00:57:34,930 and I'm bounding it by the maximum over all possible A. 1177 00:57:34,930 --> 00:57:36,932 So that's clearly true. 1178 00:57:36,932 --> 00:57:38,640 So now I have to go the other way around. 1179 00:57:38,640 --> 00:57:44,370 I have to show you that the max is actually this guy, A star. 1180 00:57:44,370 --> 00:57:45,640 So why would that be true? 1181 00:57:45,640 --> 00:57:49,180 Well, let's just inspect this thing over there. 1182 00:57:49,180 --> 00:57:50,730 So we want to show that if I take 1183 00:57:50,730 --> 00:57:53,490 any other A in this integral than this guy A star, 1184 00:57:53,490 --> 00:57:56,580 it's actually got to decrease its value. 1185 00:57:56,580 --> 00:57:57,720 So we have this function. 1186 00:57:57,720 --> 00:57:59,303 I'm going to call this function delta. 1187 00:58:02,314 --> 00:58:03,730 And what we have is-- so let's say 1188 00:58:03,730 --> 00:58:04,920 this function looks like this. 1189 00:58:04,920 --> 00:58:06,836 Now it's the difference between two densities. 1190 00:58:06,836 --> 00:58:09,500 It doesn't have to integrate-- it doesn't 1191 00:58:09,500 --> 00:58:10,500 have to be non-negative. 1192 00:58:10,500 --> 00:58:12,420 But it certainly has to integrate to 0. 1193 00:58:15,510 --> 00:58:18,440 And so now I take this thing. 1194 00:58:18,440 --> 00:58:22,126 And the A star, what is the set A star here? 1195 00:58:22,126 --> 00:58:25,640 The set A star is the set over which the function 1196 00:58:25,640 --> 00:58:27,645 delta is non-negative. 1197 00:58:36,340 --> 00:58:37,590 So that's just the definition. 1198 00:58:37,590 --> 00:58:41,660 A star was the set over which f minus g was positive, 1199 00:58:41,660 --> 00:58:44,430 and f minus g was just called delta. 1200 00:58:44,430 --> 00:58:47,720 So what it means is that what I'm really integrating 1201 00:58:47,720 --> 00:58:50,810 is delta on this set. 1202 00:58:50,810 --> 00:58:53,570 So it's this area under the curve, 1203 00:58:53,570 --> 00:58:55,230 just on the positive things. 1204 00:58:55,230 --> 00:58:57,830 Agreed? 1205 00:58:57,830 --> 00:59:03,290 So now let's just make some tiny variations around this guy. 1206 00:59:03,290 --> 00:59:08,150 If I take A to be larger than A star-- 1207 00:59:08,150 --> 00:59:10,280 so let me add, for example, this part here. 1208 00:59:12,920 --> 00:59:15,680 That means that when I compute my integral, 1209 00:59:15,680 --> 00:59:18,067 I'm removing this area under the curve. 1210 00:59:18,067 --> 00:59:18,650 It's negative. 1211 00:59:18,650 --> 00:59:20,520 The integral here is negative. 1212 00:59:20,520 --> 00:59:25,160 So if I start adding something to A, the value goes lower. 1213 00:59:25,160 --> 00:59:29,060 If I start removing something from A, like say this guy, 1214 00:59:29,060 --> 00:59:32,450 I'm actually removing this value from the integral. 1215 00:59:32,450 --> 00:59:33,320 So there's no way. 1216 00:59:33,320 --> 00:59:34,370 I'm actually stuck. 1217 00:59:34,370 --> 00:59:37,100 This A star is the one that actually maximizes 1218 00:59:37,100 --> 00:59:39,830 the integral of this function. 1219 00:59:39,830 --> 00:59:49,470 So we used the fact that for any function, 1220 00:59:49,470 --> 00:59:59,180 say delta, the integral over A of delta 1221 00:59:59,180 --> 01:00:02,712 is less than the integral over the set of X's 1222 01:00:02,712 --> 01:00:07,670 such that delta of X is non-negative of delta of X, dx. 1223 01:00:10,280 --> 01:00:13,518 And that's an obvious fact, just by picture, say. 1224 01:00:18,498 --> 01:00:24,972 And that's true for all A. Yeah? 1225 01:00:24,972 --> 01:00:28,956 AUDIENCE: [INAUDIBLE] could you use 1226 01:00:28,956 --> 01:00:33,106 like a portion under the axis as like less than 1227 01:00:33,106 --> 01:00:34,845 or equal to the portion above the axis? 1228 01:00:34,845 --> 01:00:36,470 PHILIPPE RIGOLLET: It's actually equal. 1229 01:00:36,470 --> 01:00:39,005 We know that the integral of f minus g-- 1230 01:00:39,005 --> 01:00:41,580 the integral of delta is 0. 1231 01:00:41,580 --> 01:00:47,344 So there's actually exactly the same area above and below. 1232 01:00:47,344 --> 01:00:49,880 But yeah, you're right. 1233 01:00:49,880 --> 01:00:51,349 You could go to the extreme cases. 1234 01:00:51,349 --> 01:00:51,890 You're right. 1235 01:00:57,470 --> 01:00:57,970 No. 1236 01:00:57,970 --> 01:01:00,490 It's actually still be true, even if there was-- 1237 01:01:00,490 --> 01:01:02,720 if this was a constant, that would still be true. 1238 01:01:02,720 --> 01:01:05,500 Here, I never use the fact that the integral is equal to 0. 1239 01:01:11,380 --> 01:01:15,560 I could shift this function by 1 so that the integral of delta 1240 01:01:15,560 --> 01:01:18,230 is equal to 1, and it would still 1241 01:01:18,230 --> 01:01:21,000 be true that it's maximized when I take A to be 1242 01:01:21,000 --> 01:01:24,892 the set where it's positive. 1243 01:01:24,892 --> 01:01:27,350 Just need to make sure that there is someplace where it is, 1244 01:01:27,350 --> 01:01:28,390 but that's about it. 1245 01:01:33,390 --> 01:01:36,981 Of course, we used this before, when we made this thing. 1246 01:01:36,981 --> 01:01:38,730 But just the last argument, this last fact 1247 01:01:38,730 --> 01:01:39,646 does not require that. 1248 01:01:43,820 --> 01:01:44,320 All right. 1249 01:01:44,320 --> 01:01:47,030 So now we have this notion of-- 1250 01:01:47,030 --> 01:01:48,358 I need the-- 1251 01:01:52,531 --> 01:01:53,030 OK. 1252 01:01:53,030 --> 01:01:57,450 So we have this notion of distance 1253 01:01:57,450 --> 01:01:58,830 between probability measures. 1254 01:01:58,830 --> 01:02:00,940 I mean, these things are exactly what-- 1255 01:02:00,940 --> 01:02:03,780 if I were to be in a formal math class and I said, 1256 01:02:03,780 --> 01:02:06,060 here are the axioms that a distance should satisfy, 1257 01:02:06,060 --> 01:02:08,640 those are exactly those things. 1258 01:02:08,640 --> 01:02:10,150 If it's not satisfying this thing, 1259 01:02:10,150 --> 01:02:13,800 it's called pseudo-distance or quasi-distance or just metric 1260 01:02:13,800 --> 01:02:15,770 or nothing at all, honestly. 1261 01:02:15,770 --> 01:02:16,590 So it's a distance. 1262 01:02:16,590 --> 01:02:18,930 It's symmetric, non-negative, equal to 0, 1263 01:02:18,930 --> 01:02:21,720 if and only if the two arguments are equal, then 1264 01:02:21,720 --> 01:02:25,870 it satisfies the triangle inequality. 1265 01:02:25,870 --> 01:02:28,860 And so that means that we have this actual total variation 1266 01:02:28,860 --> 01:02:31,140 distance between probability distributions. 1267 01:02:31,140 --> 01:02:36,510 And here is now a statistical strategy to implement our goal. 1268 01:02:36,510 --> 01:02:38,190 Remember, our goal was to spit out 1269 01:02:38,190 --> 01:02:41,940 a theta hat, which was close such that P theta 1270 01:02:41,940 --> 01:02:45,700 hat was close to P theta star. 1271 01:02:45,700 --> 01:02:48,940 So hopefully, we were trying to minimize the total variation 1272 01:02:48,940 --> 01:02:51,580 distance between P theta hat and P theta star. 1273 01:02:51,580 --> 01:02:55,090 Now, we cannot do that, because just by this fact, this slide, 1274 01:02:55,090 --> 01:02:57,340 if we wanted to do that directly, we would just take-- 1275 01:02:57,340 --> 01:02:59,830 well, let's take theta hat equals theta star and that will 1276 01:02:59,830 --> 01:03:00,880 give me the value 0. 1277 01:03:00,880 --> 01:03:03,196 And that's the minimum possible value we can take. 1278 01:03:03,196 --> 01:03:04,570 The problem is that we don't know 1279 01:03:04,570 --> 01:03:07,342 what the total variation is to something that we don't know. 1280 01:03:07,342 --> 01:03:09,550 We know how to compute total variations if I give you 1281 01:03:09,550 --> 01:03:10,660 the two arguments. 1282 01:03:10,660 --> 01:03:12,560 But here, one of the arguments is not known. 1283 01:03:12,560 --> 01:03:16,370 P theta star is not known to us, so we need to estimate it. 1284 01:03:16,370 --> 01:03:18,910 And so here is the strategy. 1285 01:03:18,910 --> 01:03:21,760 Just build an estimator of the total variation 1286 01:03:21,760 --> 01:03:24,580 distance between P theta and P theta star 1287 01:03:24,580 --> 01:03:27,250 for all candidate theta, all possible theta 1288 01:03:27,250 --> 01:03:30,240 in capital theta. 1289 01:03:30,240 --> 01:03:33,390 Now, if this is a good estimate, then when I minimize it, 1290 01:03:33,390 --> 01:03:37,230 I should get something that's close to P theta star. 1291 01:03:37,230 --> 01:03:38,220 So here's the strategy. 1292 01:03:38,220 --> 01:03:40,980 This is my function that maps theta 1293 01:03:40,980 --> 01:03:44,340 to the total variation between P theta and P theta star. 1294 01:03:44,340 --> 01:03:47,010 I know it's minimized at theta star. 1295 01:03:47,010 --> 01:03:51,090 That's definitely TV of P-- and the value here, the y-axis 1296 01:03:51,090 --> 01:03:53,300 should say 0. 1297 01:03:53,300 --> 01:03:54,800 And so I don't know this guy, so I'm 1298 01:03:54,800 --> 01:03:56,810 going to estimate it by some estimator that 1299 01:03:56,810 --> 01:03:57,680 comes from my data. 1300 01:03:57,680 --> 01:04:00,590 Hopefully, the more data I have, the better this estimator is. 1301 01:04:00,590 --> 01:04:03,391 And I'm going to try to minimize this estimator now. 1302 01:04:03,391 --> 01:04:05,390 And if the two things are close, then the minima 1303 01:04:05,390 --> 01:04:07,470 should be close. 1304 01:04:07,470 --> 01:04:09,560 That's a pretty good estimation strategy. 1305 01:04:09,560 --> 01:04:11,370 The problem is that it's very unclear 1306 01:04:11,370 --> 01:04:13,810 how you would build this estimator of TV, 1307 01:04:13,810 --> 01:04:18,710 of the Total Variation. 1308 01:04:18,710 --> 01:04:21,410 So building estimators, as I said, 1309 01:04:21,410 --> 01:04:25,160 typically consists in replacing expectations by averages. 1310 01:04:25,160 --> 01:04:29,130 But there's no simple way of expressing the total variation 1311 01:04:29,130 --> 01:04:31,230 distance as the expectations with respect 1312 01:04:31,230 --> 01:04:33,840 to theta star of anything. 1313 01:04:33,840 --> 01:04:36,060 So what we're going to do is we're 1314 01:04:36,060 --> 01:04:38,190 going to move from total variation distance 1315 01:04:38,190 --> 01:04:41,040 to another notion of distance that sort of has 1316 01:04:41,040 --> 01:04:43,020 the same properties and the same feeling 1317 01:04:43,020 --> 01:04:47,040 and the same motivations as the total variation distance. 1318 01:04:47,040 --> 01:04:49,650 But for this guy, we will be able to build 1319 01:04:49,650 --> 01:04:51,420 an estimate for it, because it's actually 1320 01:04:51,420 --> 01:04:53,929 going to be of the form expectation of something. 1321 01:04:53,929 --> 01:04:55,470 And we're going to be able to replace 1322 01:04:55,470 --> 01:05:00,280 the expectation by an average and then minimize this average. 1323 01:05:00,280 --> 01:05:04,290 So this surrogate for total variation distance 1324 01:05:04,290 --> 01:05:07,510 is actually called the Kullback-Leibler divergence. 1325 01:05:07,510 --> 01:05:09,760 And why we call it divergence is because it's actually 1326 01:05:09,760 --> 01:05:11,740 not a distance. 1327 01:05:11,740 --> 01:05:14,760 It's not going to be symmetric to start with. 1328 01:05:14,760 --> 01:05:17,400 So this Kullback-Leibler or even KL divergence-- 1329 01:05:17,400 --> 01:05:20,790 I will just refer to it as KL-- 1330 01:05:20,790 --> 01:05:22,860 is actually just more convenient. 1331 01:05:22,860 --> 01:05:27,480 But it has some roots coming from information theory, which 1332 01:05:27,480 --> 01:05:29,170 I will not delve into. 1333 01:05:29,170 --> 01:05:31,450 But if any of you is actually a Core 6 student, 1334 01:05:31,450 --> 01:05:32,970 I'm sure you've seen that in some-- 1335 01:05:32,970 --> 01:05:37,980 I don't know-- course that has any content on information 1336 01:05:37,980 --> 01:05:39,060 theory. 1337 01:05:39,060 --> 01:05:39,560 All right. 1338 01:05:39,560 --> 01:05:42,380 So the KL divergence between two probability measures, P theta 1339 01:05:42,380 --> 01:05:43,790 and P theta prime-- 1340 01:05:43,790 --> 01:05:47,810 and here, as I said, it's not going to be the symmetric, 1341 01:05:47,810 --> 01:05:49,680 so it's very important for you to specify 1342 01:05:49,680 --> 01:05:51,930 which order you say it is, between P theta and P theta 1343 01:05:51,930 --> 01:05:52,429 prime. 1344 01:05:52,429 --> 01:05:55,060 It's different from saying between P theta prime and P 1345 01:05:55,060 --> 01:05:56,510 theta. 1346 01:05:56,510 --> 01:05:58,550 And so we denote it by KL. 1347 01:05:58,550 --> 01:06:04,010 And so remember, before we had either the sum or the integral 1348 01:06:04,010 --> 01:06:07,910 of 1/2 of the distance-- absolute value of the distance 1349 01:06:07,910 --> 01:06:10,550 between the PMFs and 1/2 of the absolute values 1350 01:06:10,550 --> 01:06:17,900 of the distances between the probability density functions. 1351 01:06:17,900 --> 01:06:19,940 And then we replace this absolute value 1352 01:06:19,940 --> 01:06:24,740 of the distance divided by 2 by this weird function. 1353 01:06:24,740 --> 01:06:28,100 This function is P theta, log P theta, 1354 01:06:28,100 --> 01:06:30,290 divided by P theta prime. 1355 01:06:30,290 --> 01:06:31,880 That's the function. 1356 01:06:31,880 --> 01:06:34,710 That's a weird function. 1357 01:06:34,710 --> 01:06:35,210 OK. 1358 01:06:35,210 --> 01:06:38,360 So this was what we had. 1359 01:06:40,960 --> 01:06:41,590 That's the TV. 1360 01:06:44,670 --> 01:06:48,120 And the KL, if I use the same notation, f and g, 1361 01:06:48,120 --> 01:06:57,315 is integral of f of X, log of f of X over g of X, dx. 1362 01:07:01,088 --> 01:07:04,280 It's a bit different. 1363 01:07:04,280 --> 01:07:09,120 And I go from discrete to continuous using an integral. 1364 01:07:09,120 --> 01:07:10,240 Everybody can read this. 1365 01:07:10,240 --> 01:07:11,365 Everybody's fine with this. 1366 01:07:11,365 --> 01:07:15,780 Is there any uncertainty about the actual definition here? 1367 01:07:15,780 --> 01:07:17,480 So here I go straight to the definition, 1368 01:07:17,480 --> 01:07:19,910 which is just plugging the functions 1369 01:07:19,910 --> 01:07:22,190 into some integral and compute. 1370 01:07:22,190 --> 01:07:24,670 So I don't bother with maxima or anything. 1371 01:07:24,670 --> 01:07:26,400 I mean, there is something like that, 1372 01:07:26,400 --> 01:07:29,885 but it's certainly not as natural as the total variation. 1373 01:07:29,885 --> 01:07:30,875 Yes? 1374 01:07:30,875 --> 01:07:33,845 AUDIENCE: The total variation, [INAUDIBLE].. 1375 01:07:38,732 --> 01:07:40,440 PHILIPPE RIGOLLET: Yes, just because it's 1376 01:07:40,440 --> 01:07:42,280 hard to build anything from total variation, 1377 01:07:42,280 --> 01:07:43,500 because I don't know it. 1378 01:07:43,500 --> 01:07:45,835 So it's very difficult. But if you can actually-- 1379 01:07:45,835 --> 01:07:47,910 and even computing it between two Gaussians, 1380 01:07:47,910 --> 01:07:49,680 just try it for yourself. 1381 01:07:49,680 --> 01:07:52,740 And please stop doing it after at most six minutes, 1382 01:07:52,740 --> 01:07:54,730 because you won't be able to do it. 1383 01:07:54,730 --> 01:07:56,730 And so it's just very hard to manipulate, 1384 01:07:56,730 --> 01:07:59,070 like this integral of absolute values of differences 1385 01:07:59,070 --> 01:08:01,230 between probability density function, at least 1386 01:08:01,230 --> 01:08:02,771 for the probability density functions 1387 01:08:02,771 --> 01:08:04,860 we're used to manipulate is actually a nightmare. 1388 01:08:04,860 --> 01:08:08,370 And so people prefer KL, because for the Gaussian, 1389 01:08:08,370 --> 01:08:10,770 this is going to be theta minus theta prime squared. 1390 01:08:10,770 --> 01:08:12,580 And then we're going to be happy. 1391 01:08:12,580 --> 01:08:15,720 And so those things are much easier to manipulate. 1392 01:08:15,720 --> 01:08:18,029 But it's really-- the total variation 1393 01:08:18,029 --> 01:08:20,162 is telling you how far in the worst case 1394 01:08:20,162 --> 01:08:21,370 the two probabilities can be. 1395 01:08:21,370 --> 01:08:23,220 This is really the intrinsic notion 1396 01:08:23,220 --> 01:08:25,380 of closeness between probabilities. 1397 01:08:25,380 --> 01:08:27,229 So that's really the one-- if we could, 1398 01:08:27,229 --> 01:08:30,202 that's the one we would go after. 1399 01:08:30,202 --> 01:08:32,160 Sometimes people will compute them numerically, 1400 01:08:32,160 --> 01:08:34,785 so that they can say, oh, here's the total variation distance I 1401 01:08:34,785 --> 01:08:36,899 have between those two things. 1402 01:08:36,899 --> 01:08:38,670 And then you actually know that that 1403 01:08:38,670 --> 01:08:41,460 means they are close, because the absolute value-- if I tell 1404 01:08:41,460 --> 01:08:44,370 you total variation is 0.01, like we did here, 1405 01:08:44,370 --> 01:08:46,319 it has a very specific meaning. 1406 01:08:46,319 --> 01:08:49,762 If I tell you the KL divergence is 0.01, 1407 01:08:49,762 --> 01:08:50,970 it's not clear what it means. 1408 01:08:55,130 --> 01:08:55,760 OK. 1409 01:08:55,760 --> 01:08:58,109 So what are the properties? 1410 01:08:58,109 --> 01:09:00,870 The KL divergence between P theta and P theta prime 1411 01:09:00,870 --> 01:09:03,170 is different from the KL divergence between P theta 1412 01:09:03,170 --> 01:09:05,569 prime and P theta in general. 1413 01:09:05,569 --> 01:09:07,640 Of course, in general, because if theta 1414 01:09:07,640 --> 01:09:11,029 is equal to theta prime, then this certainly is true. 1415 01:09:11,029 --> 01:09:14,600 So there's cases when it's not true. 1416 01:09:14,600 --> 01:09:17,090 The KL divergence is non-negative. 1417 01:09:17,090 --> 01:09:19,742 Who knows the Jensen's inequality here? 1418 01:09:19,742 --> 01:09:21,450 That should be a subset of the people who 1419 01:09:21,450 --> 01:09:25,310 raised their hand when I asked what a convex function is. 1420 01:09:25,310 --> 01:09:26,090 All right. 1421 01:09:26,090 --> 01:09:27,890 So you know what Jensen's inequality is. 1422 01:09:27,890 --> 01:09:30,490 This is Jensen's-- the proof is just one step 1423 01:09:30,490 --> 01:09:33,840 Jensen's inequality, which we will not go into details. 1424 01:09:33,840 --> 01:09:35,569 But that's basically an inequality 1425 01:09:35,569 --> 01:09:38,233 involving expectation of a convex function 1426 01:09:38,233 --> 01:09:40,399 of a random variable compared to the convex function 1427 01:09:40,399 --> 01:09:42,065 of the expectation of a random variable. 1428 01:09:45,460 --> 01:09:48,580 If you know Jensen, have fun and prove it. 1429 01:09:48,580 --> 01:09:51,729 What's really nice is that if the KL is equal to 0, 1430 01:09:51,729 --> 01:09:55,220 then the two distributions are the same. 1431 01:09:55,220 --> 01:09:57,170 And that's something we're looking for. 1432 01:09:57,170 --> 01:09:59,020 Everything else we're happy to throw out. 1433 01:09:59,020 --> 01:10:00,478 And actually, if you pay attention, 1434 01:10:00,478 --> 01:10:03,500 we're actually really throwing out everything else. 1435 01:10:03,500 --> 01:10:05,060 So they're not symmetric. 1436 01:10:05,060 --> 01:10:08,530 It does satisfy the triangle inequality in general. 1437 01:10:08,530 --> 01:10:12,790 But it's non-negative and it's 0 if and only if the two 1438 01:10:12,790 --> 01:10:13,922 distributions are the same. 1439 01:10:13,922 --> 01:10:15,130 And that's all we care about. 1440 01:10:15,130 --> 01:10:17,129 And that's what we call a divergence rather than 1441 01:10:17,129 --> 01:10:21,910 a distance, and divergence will be enough for our purposes. 1442 01:10:21,910 --> 01:10:24,080 And actually, this asymmetry, the fact 1443 01:10:24,080 --> 01:10:26,570 that it's not flipping-- the first time I saw it, 1444 01:10:26,570 --> 01:10:27,380 I was just annoyed. 1445 01:10:27,380 --> 01:10:29,225 I was like, can we just like, I don't 1446 01:10:29,225 --> 01:10:31,550 know, take the average of the KL between P theta 1447 01:10:31,550 --> 01:10:34,270 and P theta prime and P theta prime and P theta, 1448 01:10:34,270 --> 01:10:36,290 you would think maybe you could do this. 1449 01:10:36,290 --> 01:10:39,590 You just symmatrize it by just taking the average of the two 1450 01:10:39,590 --> 01:10:41,480 possible values it can take. 1451 01:10:41,480 --> 01:10:44,930 The problem is that this will still not satisfy the triangle 1452 01:10:44,930 --> 01:10:45,500 inequality. 1453 01:10:45,500 --> 01:10:48,290 And there's no way basically to turn it into something 1454 01:10:48,290 --> 01:10:49,850 that is a distance. 1455 01:10:49,850 --> 01:10:52,350 But the divergence is doing a pretty good thing for us. 1456 01:10:52,350 --> 01:10:55,790 And this is what will allow us to estimate it and basically 1457 01:10:55,790 --> 01:11:03,160 overcome what we could not do with the total variation. 1458 01:11:03,160 --> 01:11:06,410 So the first thing that you want to notice 1459 01:11:06,410 --> 01:11:08,120 is the total variation distance-- 1460 01:11:08,120 --> 01:11:10,130 the KL divergence, sorry, is actually 1461 01:11:10,130 --> 01:11:12,470 an expectation of something. 1462 01:11:12,470 --> 01:11:15,260 Look at what it is here. 1463 01:11:15,260 --> 01:11:20,420 It's the integral of some function against a density. 1464 01:11:20,420 --> 01:11:25,230 That's exactly the definition of an expectation, right? 1465 01:11:25,230 --> 01:11:29,950 So this is the expectation of this particular function 1466 01:11:29,950 --> 01:11:31,730 with respect to this density f. 1467 01:11:31,730 --> 01:11:35,650 So in particular, if I call this is density f-- if I say, 1468 01:11:35,650 --> 01:11:38,400 I want the true distribution to be the first argument, 1469 01:11:38,400 --> 01:11:39,920 this is an expectation with respect 1470 01:11:39,920 --> 01:11:42,310 to the true distribution from which my data is actually 1471 01:11:42,310 --> 01:11:45,760 drawn of the log of this ratio. 1472 01:11:45,760 --> 01:11:46,870 So ha ha. 1473 01:11:46,870 --> 01:11:47,700 I'm a statistician. 1474 01:11:47,700 --> 01:11:49,300 Now I have an expectation. 1475 01:11:49,300 --> 01:11:51,430 I can replace it by an average, because I have data 1476 01:11:51,430 --> 01:11:52,524 from this distribution. 1477 01:11:52,524 --> 01:11:54,940 And I could actually replace the expectation by an average 1478 01:11:54,940 --> 01:11:56,680 and try to minimize here. 1479 01:11:56,680 --> 01:11:57,959 The problem is that-- 1480 01:11:57,959 --> 01:12:00,250 actually the star here should be in front of the theta, 1481 01:12:00,250 --> 01:12:01,150 not of the P, right? 1482 01:12:01,150 --> 01:12:04,460 That's P theta star, not P star theta. 1483 01:12:04,460 --> 01:12:05,960 But here, I still cannot compute it, 1484 01:12:05,960 --> 01:12:08,510 because I have this P theta star that shows up. 1485 01:12:08,510 --> 01:12:10,220 I don't know what it is. 1486 01:12:10,220 --> 01:12:13,500 And that's now where the log plays a role. 1487 01:12:13,500 --> 01:12:15,050 If you actually pay attention, I said 1488 01:12:15,050 --> 01:12:16,940 you can use Jensen to prove all this stuff. 1489 01:12:16,940 --> 01:12:21,110 You could actually replace the log by any concave function. 1490 01:12:21,110 --> 01:12:22,440 That would be f divergent. 1491 01:12:22,440 --> 01:12:24,030 That's called an f divergence. 1492 01:12:24,030 --> 01:12:26,950 But the log itself is a very, very specific property, 1493 01:12:26,950 --> 01:12:29,790 which allows us to say that the log of the ratio 1494 01:12:29,790 --> 01:12:33,290 is the ratio of the log. 1495 01:12:33,290 --> 01:12:38,620 Now, this thing here does not depend on theta. 1496 01:12:38,620 --> 01:12:43,010 If I think of this KL divergence as a function of theta, 1497 01:12:43,010 --> 01:12:45,239 then the first part is actually a constant. 1498 01:12:45,239 --> 01:12:47,530 If I change theta, this thing is never going to change. 1499 01:12:47,530 --> 01:12:49,980 It depends only on theta star. 1500 01:12:49,980 --> 01:12:51,480 So if I look at this function KL-- 1501 01:13:03,200 --> 01:13:05,500 so if I look at the function, theta maps 1502 01:13:05,500 --> 01:13:11,450 to KL P theta star, P theta, it's 1503 01:13:11,450 --> 01:13:15,400 of the form expectation with respect to theta star, 1504 01:13:15,400 --> 01:13:23,780 log of P theta star of X. And then I 1505 01:13:23,780 --> 01:13:29,610 have minus expectation with respect to theta star of log 1506 01:13:29,610 --> 01:13:33,340 of P theta of x. 1507 01:13:33,340 --> 01:13:38,900 Now as I said, this thing here, this second expectation 1508 01:13:38,900 --> 01:13:39,950 is a function of theta. 1509 01:13:39,950 --> 01:13:42,381 When theta changes, this thing is going to change. 1510 01:13:42,381 --> 01:13:43,380 And that's a good thing. 1511 01:13:43,380 --> 01:13:45,754 We want something that reflects how close theta and theta 1512 01:13:45,754 --> 01:13:46,537 star are. 1513 01:13:46,537 --> 01:13:48,120 But this thing is not going to change. 1514 01:13:48,120 --> 01:13:49,620 This is a fixed value. 1515 01:13:49,620 --> 01:13:53,125 Actually, it's the negative entropy of P theta star. 1516 01:13:53,125 --> 01:13:54,500 And if you've heard of KL, you've 1517 01:13:54,500 --> 01:13:55,583 probably heard of entropy. 1518 01:13:55,583 --> 01:13:58,820 And that's what-- it's basically minus the entropy. 1519 01:13:58,820 --> 01:14:01,310 And that's a quantity that just depends on theta star. 1520 01:14:01,310 --> 01:14:03,470 But it's just the number. 1521 01:14:03,470 --> 01:14:05,030 I could compute this number if I told 1522 01:14:05,030 --> 01:14:07,130 you this is n theta star 1. 1523 01:14:07,130 --> 01:14:09,450 You could compute this. 1524 01:14:09,450 --> 01:14:11,640 So now I'm going to try to minimize 1525 01:14:11,640 --> 01:14:14,500 the estimate of this function. 1526 01:14:14,500 --> 01:14:16,870 And minimizing a function or a function plus a constant 1527 01:14:16,870 --> 01:14:18,800 is the same thing. 1528 01:14:18,800 --> 01:14:20,840 I'm just shifting the function here or here, 1529 01:14:20,840 --> 01:14:23,560 but it's the same minimizer. 1530 01:14:23,560 --> 01:14:24,060 OK. 1531 01:14:24,060 --> 01:14:28,910 So the function that maps theta to KL of P theta star 1532 01:14:28,910 --> 01:14:32,370 to P theta is of the form constant minus this expectation 1533 01:14:32,370 --> 01:14:35,810 of a log of P theta. 1534 01:14:35,810 --> 01:14:38,070 Everybody agrees? 1535 01:14:38,070 --> 01:14:40,610 Are there any questions about this? 1536 01:14:40,610 --> 01:14:42,740 Are there any remarks, including I 1537 01:14:42,740 --> 01:14:46,230 have no idea what's happening right now? 1538 01:14:46,230 --> 01:14:46,730 OK. 1539 01:14:46,730 --> 01:14:47,700 We're good? 1540 01:14:47,700 --> 01:14:48,200 Yeah. 1541 01:14:48,200 --> 01:14:50,160 AUDIENCE: So when you're actually employing this method, 1542 01:14:50,160 --> 01:14:52,610 how do you know which theta to use as theta star and which 1543 01:14:52,610 --> 01:14:53,142 isn't? 1544 01:14:53,142 --> 01:14:55,600 PHILIPPE RIGOLLET: So this is not a method just yet, right? 1545 01:14:55,600 --> 01:14:57,550 I'm just describing to you what the KL divergence 1546 01:14:57,550 --> 01:14:58,720 between two distributions is. 1547 01:14:58,720 --> 01:15:00,130 If you really wanted to compute it, 1548 01:15:00,130 --> 01:15:01,930 you would need to know what P theta star is 1549 01:15:01,930 --> 01:15:02,770 and what P theta is. 1550 01:15:02,770 --> 01:15:03,467 AUDIENCE: Right. 1551 01:15:03,467 --> 01:15:06,050 PHILIPPE RIGOLLET: And so here, I'm just saying at some point, 1552 01:15:06,050 --> 01:15:07,650 we still-- so here, you see-- 1553 01:15:07,650 --> 01:15:09,280 so now let's move onto one step. 1554 01:15:09,280 --> 01:15:12,570 I don't know expectation of theta star. 1555 01:15:12,570 --> 01:15:15,904 But I have data that comes from distribution P theta star. 1556 01:15:15,904 --> 01:15:17,820 So the expectation by the law of large numbers 1557 01:15:17,820 --> 01:15:19,950 should be close to the average. 1558 01:15:19,950 --> 01:15:23,670 And so what I'm doing is I'm replacing any-- 1559 01:15:23,670 --> 01:15:27,390 I can actually-- this is a very standard estimation method. 1560 01:15:27,390 --> 01:15:30,360 You write something as an expectation with respect 1561 01:15:30,360 --> 01:15:34,380 to the data-generating process of some function. 1562 01:15:34,380 --> 01:15:37,349 And then you replace this by the average of this function. 1563 01:15:37,349 --> 01:15:38,890 And the law of large numbers tells me 1564 01:15:38,890 --> 01:15:41,326 that those two quantities should actually be close. 1565 01:15:41,326 --> 01:15:43,820 Now, it doesn't mean that's going to be the end of the day, 1566 01:15:43,820 --> 01:15:44,319 right. 1567 01:15:44,319 --> 01:15:46,950 When we did Xn bar, that was the end of the day. 1568 01:15:46,950 --> 01:15:47,900 We had an expectation. 1569 01:15:47,900 --> 01:15:49,850 We replaced it by an average. 1570 01:15:49,850 --> 01:15:51,170 And then we were gone. 1571 01:15:51,170 --> 01:15:53,376 But here, we still have to do something, 1572 01:15:53,376 --> 01:15:55,250 because this is not telling me what theta is. 1573 01:15:55,250 --> 01:15:58,070 Now I still have to minimize this average. 1574 01:15:58,070 --> 01:16:04,370 So this is now my candidate estimator for KL, KL hat. 1575 01:16:04,370 --> 01:16:06,170 And that's the one where I said, well, it's 1576 01:16:06,170 --> 01:16:07,897 going to be of the form of constant. 1577 01:16:07,897 --> 01:16:09,230 And this constant, I don't know. 1578 01:16:09,230 --> 01:16:09,771 You're right. 1579 01:16:09,771 --> 01:16:11,586 I have no idea what this constant is. 1580 01:16:11,586 --> 01:16:13,640 It depends on P theta star. 1581 01:16:13,640 --> 01:16:16,310 But then I have minus something that I can completely compute. 1582 01:16:16,310 --> 01:16:20,170 If you give me data and theta, I can compute this entire thing. 1583 01:16:20,170 --> 01:16:25,670 And now what I claim is that the minimizer of f or f plus-- 1584 01:16:25,670 --> 01:16:28,950 f of X or f of X plus 4 are the same thing, 1585 01:16:28,950 --> 01:16:32,200 or say 4 plus f of X. I'm just shifting 1586 01:16:32,200 --> 01:16:34,260 the plot of my function up and down, 1587 01:16:34,260 --> 01:16:36,340 but the minimizer stays exactly where it is. 1588 01:16:39,590 --> 01:16:41,075 If I have a function-- 1589 01:16:43,750 --> 01:16:45,284 so now I have a function of theta. 1590 01:16:51,620 --> 01:16:56,100 This is KL hat of P theta star, P theta. 1591 01:16:56,100 --> 01:16:58,831 And it's of the form-- it's a function like this. 1592 01:16:58,831 --> 01:17:00,330 I don't know where this function is. 1593 01:17:00,330 --> 01:17:06,880 It might very well be this function or this function. 1594 01:17:06,880 --> 01:17:10,870 Every time it's a translation on the y-axis of all these guys. 1595 01:17:10,870 --> 01:17:14,690 And the value that I translated by depends on theta star. 1596 01:17:14,690 --> 01:17:15,970 I don't know what it is. 1597 01:17:15,970 --> 01:17:19,600 But what I claim is that the minimizer is always this guy, 1598 01:17:19,600 --> 01:17:22,428 regardless of what the value is. 1599 01:17:22,428 --> 01:17:25,290 OK? 1600 01:17:25,290 --> 01:17:28,560 So when I say constant, it's a constant with respect to theta. 1601 01:17:28,560 --> 01:17:29,670 It's an unknown constant. 1602 01:17:29,670 --> 01:17:32,490 But it's with respect to theta, so without loss of generality, 1603 01:17:32,490 --> 01:17:36,840 I can assume that this constant is 0 for my purposes, 1604 01:17:36,840 --> 01:17:38,040 or 25 if you prefer. 1605 01:17:41,171 --> 01:17:41,670 All right. 1606 01:17:41,670 --> 01:17:46,420 So we'll just keep going on this property next time. 1607 01:17:46,420 --> 01:17:49,359 And we'll see how from here we can move on to-- 1608 01:17:49,359 --> 01:17:51,900 the likelihood is actually going to come out of this formula. 1609 01:17:51,900 --> 01:17:53,450 Thanks.