1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high-quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,650 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,650 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:20,524 --> 00:00:21,940 PHILIPPE RIGOLLET: So today, we're 9 00:00:21,940 --> 00:00:24,820 going to close this chapter, this short chapter, 10 00:00:24,820 --> 00:00:26,200 on Bayesian inference. 11 00:00:26,200 --> 00:00:28,990 Again, this was just an overview of what you 12 00:00:28,990 --> 00:00:32,259 can do in Bayesian inference. 13 00:00:32,259 --> 00:00:34,630 And last time, we started defining 14 00:00:34,630 --> 00:00:36,260 what's called Jeffreys priors. 15 00:00:36,260 --> 00:00:36,760 Right? 16 00:00:36,760 --> 00:00:38,560 So when you do Bayesian inference, 17 00:00:38,560 --> 00:00:41,620 you have to introduce a prior on your parameter. 18 00:00:41,620 --> 00:00:43,660 And we said that usually, it's something 19 00:00:43,660 --> 00:00:45,820 that encodes your domain knowledge about where 20 00:00:45,820 --> 00:00:47,130 the parameter could be. 21 00:00:47,130 --> 00:00:49,030 But there's also some principle way to do it, 22 00:00:49,030 --> 00:00:51,155 if you want to do Bayesian inference without really 23 00:00:51,155 --> 00:00:53,420 having to think about it. 24 00:00:53,420 --> 00:00:56,260 And for example, one of the natural priors 25 00:00:56,260 --> 00:00:58,080 were those non-informative priors, right? 26 00:00:58,080 --> 00:00:59,740 If you were on a compact set, it's 27 00:00:59,740 --> 00:01:01,570 a uniform prior of this set. 28 00:01:01,570 --> 00:01:04,239 If you're on an infinite set, you can still think of taking 29 00:01:04,239 --> 00:01:06,520 the [? 01s ?] prior. 30 00:01:06,520 --> 00:01:09,280 And that's called an [INAUDIBLE] That's always equal to 1. 31 00:01:09,280 --> 00:01:13,300 And that's an improper prior if you are an infinite set 32 00:01:13,300 --> 00:01:14,830 or proportional to one. 33 00:01:14,830 --> 00:01:17,860 And so another prior that you can think of, 34 00:01:17,860 --> 00:01:20,230 in the case where you have a Fisher information, which 35 00:01:20,230 --> 00:01:23,200 is well-defined, is something called Jefferys prior. 36 00:01:23,200 --> 00:01:25,600 And this prior is a prior which is 37 00:01:25,600 --> 00:01:28,150 proportional to square root of the determinant of the Fisher 38 00:01:28,150 --> 00:01:29,780 information matrix. 39 00:01:29,780 --> 00:01:31,750 And if you're in one dimension, it's 40 00:01:31,750 --> 00:01:37,750 basically proportional to a square root of the Fisher 41 00:01:37,750 --> 00:01:40,750 information coefficient, which we know, for example, 42 00:01:40,750 --> 00:01:44,170 is the asymptotic variance of the maximum likelihood 43 00:01:44,170 --> 00:01:45,370 estimator. 44 00:01:45,370 --> 00:01:48,010 And it turns out that it's basically. 45 00:01:48,010 --> 00:01:50,330 So square root of this thing is basically 46 00:01:50,330 --> 00:01:54,160 one over the standard deviation of the maximum likelihood 47 00:01:54,160 --> 00:01:55,150 estimator. 48 00:01:55,150 --> 00:01:56,690 And so you can compute this, right? 49 00:01:56,690 --> 00:01:59,944 So you can compute for the maximum likelihood estimator. 50 00:01:59,944 --> 00:02:01,360 We know that the variance is going 51 00:02:01,360 --> 00:02:09,910 to be p1 minus p in the Bernoulli 52 00:02:09,910 --> 00:02:11,200 statistical experiment. 53 00:02:11,200 --> 00:02:13,510 So you get this one over the square root of this thing. 54 00:02:13,510 --> 00:02:16,720 And for example, in the Gaussian setting, 55 00:02:16,720 --> 00:02:19,880 you actually have the Fisher information, 56 00:02:19,880 --> 00:02:22,000 even in the multi-variate one, is actually 57 00:02:22,000 --> 00:02:24,752 going to be something like the identity matrix. 58 00:02:24,752 --> 00:02:25,960 So this is proportional to 1. 59 00:02:25,960 --> 00:02:29,530 It's the improper prior that you get, in this case, OK? 60 00:02:29,530 --> 00:02:31,690 Meaning that, for the Gaussian setting, 61 00:02:31,690 --> 00:02:33,880 no place where you center your Gaussian 62 00:02:33,880 --> 00:02:36,020 is actually better than any other. 63 00:02:36,020 --> 00:02:36,520 All right. 64 00:02:36,520 --> 00:02:40,130 So we basically left on this slide, 65 00:02:40,130 --> 00:02:43,570 where we saw that Jeffreys prior satisfy 66 00:02:43,570 --> 00:02:46,170 a reparametrization [INAUDIBLE] invariant 67 00:02:46,170 --> 00:02:49,180 by transformation of your parameter, which 68 00:02:49,180 --> 00:02:51,920 is a desirable property. 69 00:02:51,920 --> 00:02:57,217 And the way, it says that, well, if I have my prior on theta, 70 00:02:57,217 --> 00:02:59,050 and then I suddenly decide that theta is not 71 00:02:59,050 --> 00:03:01,720 the parameter I want to use to parameterize my problem, 72 00:03:01,720 --> 00:03:04,640 actually what I want is phi of theta. 73 00:03:04,640 --> 00:03:07,840 So think, for example, as theta being the mean of a Gaussian, 74 00:03:07,840 --> 00:03:11,140 and phi of theta as being mean to the cube. 75 00:03:11,140 --> 00:03:11,920 OK? 76 00:03:11,920 --> 00:03:15,520 This is a one-to-one map phi, right? 77 00:03:15,520 --> 00:03:20,185 So for example, if I want to go from theta to theta cubed, 78 00:03:20,185 --> 00:03:22,840 and now I decide that this is the actual parameter that I 79 00:03:22,840 --> 00:03:26,200 want, well, then it means that, on this parameter, 80 00:03:26,200 --> 00:03:29,110 my original prior is going to induce another prior. 81 00:03:29,110 --> 00:03:30,970 And here, it says, well, this prior 82 00:03:30,970 --> 00:03:33,200 is actually also Jeffreys prior. 83 00:03:33,200 --> 00:03:33,700 OK? 84 00:03:33,700 --> 00:03:35,450 So it's essentially telling you that, 85 00:03:35,450 --> 00:03:38,410 for this new parametrization, if you take Jeffreys prior, then 86 00:03:38,410 --> 00:03:41,201 you actually go back to having exactly something that's 87 00:03:41,201 --> 00:03:43,450 of the form's [INAUDIBLE] of determinant of the Fisher 88 00:03:43,450 --> 00:03:45,116 information, but this thing with respect 89 00:03:45,116 --> 00:03:47,810 to your new parametrization All right. 90 00:03:47,810 --> 00:03:50,360 And so why is this true? 91 00:03:50,360 --> 00:03:53,440 Well, it's just this change of variable theorem. 92 00:03:53,440 --> 00:03:58,330 So it's essentially telling you that, if you call-- 93 00:03:58,330 --> 00:04:08,850 let's call p-- well, let's go pi tilde of eta prior over eta. 94 00:04:08,850 --> 00:04:11,130 And you have pi of theta as the prior 95 00:04:11,130 --> 00:04:18,040 over theta, than since eta is of the form phi of theta, 96 00:04:18,040 --> 00:04:26,620 just by change of variable, so that's essentially 97 00:04:26,620 --> 00:04:33,070 a probability result. It says that pi tilde of eta 98 00:04:33,070 --> 00:04:42,790 is equal to pi of eta times d pi of theta times d 99 00:04:42,790 --> 00:04:48,860 theta over d eta and-- 100 00:04:55,706 --> 00:04:57,189 sorry, is that the one? 101 00:04:57,189 --> 00:04:58,730 Sorry, I'm going to have to write it, 102 00:04:58,730 --> 00:04:59,938 because I always forget this. 103 00:05:05,209 --> 00:05:07,380 So if I take a function-- 104 00:05:14,380 --> 00:05:14,960 OK. 105 00:05:14,960 --> 00:05:16,400 So what I want is to check. 106 00:05:38,340 --> 00:05:41,870 OK, so I want the function of eta that I can here. 107 00:05:41,870 --> 00:05:48,480 And what I know is that this is h of phi of theta. 108 00:05:48,480 --> 00:05:48,980 All right? 109 00:05:48,980 --> 00:05:51,810 So sorry, eta is phi of theta, right? 110 00:05:51,810 --> 00:05:53,471 Yeah. 111 00:05:53,471 --> 00:05:54,970 So what I'm going to do is I'm going 112 00:05:54,970 --> 00:06:09,130 to do the change of variable, theta is phi inverse of eta. 113 00:06:09,130 --> 00:06:14,120 So eta is phi of theta, which means 114 00:06:14,120 --> 00:06:20,540 that d eta is equal to d-- 115 00:06:20,540 --> 00:06:26,020 well, to phi prime of theta d theta. 116 00:06:26,020 --> 00:06:31,464 So when I'm going to write this, I'm going to get integral of h. 117 00:06:31,464 --> 00:06:33,470 Actually, let me write this, as I 118 00:06:33,470 --> 00:06:36,980 am more comfortable writing this as e 119 00:06:36,980 --> 00:06:40,031 with respect to eta of h of eta. 120 00:06:40,031 --> 00:06:40,530 OK? 121 00:06:40,530 --> 00:06:44,580 So that's just eta according to being drawn from the prior. 122 00:06:44,580 --> 00:06:47,670 And I want to write this as the integral of he of eta times 123 00:06:47,670 --> 00:06:49,080 some function, right? 124 00:06:49,080 --> 00:06:58,580 So this is the integral of h of phi 125 00:06:58,580 --> 00:07:03,556 of theta pi of theta d theta. 126 00:07:03,556 --> 00:07:06,150 Now, I'm going to do my change of variable. 127 00:07:06,150 --> 00:07:09,290 So this is going to be the integral of h of eta. 128 00:07:09,290 --> 00:07:16,420 And then pi of phi of-- 129 00:07:16,420 --> 00:07:20,290 so theta is phi inverse of eta. 130 00:07:20,290 --> 00:07:27,390 And then d theta is phi prime of theta d theta, OK? 131 00:07:27,390 --> 00:07:30,210 And so what is pi of phi theta? 132 00:07:30,210 --> 00:07:32,120 So this thing is proportional. 133 00:07:32,120 --> 00:07:33,750 So we're in, say, dimension 1, so it's 134 00:07:33,750 --> 00:07:38,420 proportional of square root of the Fisher information. 135 00:07:38,420 --> 00:07:39,920 And the Fisher information, we know, 136 00:07:39,920 --> 00:07:44,630 is the expectation of the square of the derivative of the log 137 00:07:44,630 --> 00:07:45,770 likelihood, right? 138 00:07:45,770 --> 00:07:48,740 So this is square root of the expectation 139 00:07:48,740 --> 00:08:03,650 of d over d theta of log of-- 140 00:08:03,650 --> 00:08:06,010 well, now, I need the density. 141 00:08:06,010 --> 00:08:10,050 Well, let's just call it l of theta. 142 00:08:10,050 --> 00:08:17,030 And I want this to be taken at phi inverse of eta squared. 143 00:08:19,980 --> 00:08:21,480 And then what I pick up is the-- 144 00:08:23,771 --> 00:08:25,770 so I'm going to put everything under the square. 145 00:08:25,770 --> 00:08:31,460 So I get phi prime of theta squared d theta. 146 00:08:31,460 --> 00:08:33,260 OK? 147 00:08:33,260 --> 00:08:35,090 So now, I have the expectation of a square. 148 00:08:35,090 --> 00:08:38,539 This does not depend, so this is-- sorry, this is l of theta. 149 00:08:38,539 --> 00:08:42,307 This is the expectation of l of theta of an x, right? 150 00:08:42,307 --> 00:08:44,390 That's for some variable, and the expectation here 151 00:08:44,390 --> 00:08:45,710 is with respect to x. 152 00:08:45,710 --> 00:08:49,824 That's just the definition of the Fisher information. 153 00:08:49,824 --> 00:08:52,240 So now I'm going to squeeze this guy into the expectation. 154 00:08:52,240 --> 00:08:53,260 It does not depend on x. 155 00:08:53,260 --> 00:08:55,412 It just acts as a constant. 156 00:08:55,412 --> 00:08:57,370 And so what I have now is that this is actually 157 00:08:57,370 --> 00:08:59,760 proportional to the integral of h 158 00:08:59,760 --> 00:09:05,320 eta times the square root of the expectation with respect 159 00:09:05,320 --> 00:09:06,600 to x of what? 160 00:09:06,600 --> 00:09:10,540 Well, here, I have d over d theta of log of theta. 161 00:09:10,540 --> 00:09:15,620 And here, this guy is really d eta over d theta, right? 162 00:09:19,524 --> 00:09:21,480 Agree? 163 00:09:21,480 --> 00:09:24,720 So now, what I'm really left by-- so I get d over d theta 164 00:09:24,720 --> 00:09:25,520 times d-- 165 00:09:25,520 --> 00:09:28,047 sorry, times d theta over d eta. 166 00:09:42,980 --> 00:09:51,396 so that's just d over d eta of log of eta x. 167 00:10:00,198 --> 00:10:04,370 And then this guy is now becoming d eta, right? 168 00:10:04,370 --> 00:10:06,590 OK, so this was a mess. 169 00:10:09,710 --> 00:10:12,320 This is a complete mess, because I actually want to use phi. 170 00:10:12,320 --> 00:10:14,150 I should not actually introduce phi at all. 171 00:10:14,150 --> 00:10:21,930 I should just talk about d eta over d theta type of things. 172 00:10:21,930 --> 00:10:24,370 And then that would actually make my life so much easier. 173 00:10:24,370 --> 00:10:25,002 OK. 174 00:10:25,002 --> 00:10:26,710 I'm not going to spend more time on this. 175 00:10:26,710 --> 00:10:28,210 This is really just the idea, right? 176 00:10:28,210 --> 00:10:30,170 You have square root of a square in there. 177 00:10:30,170 --> 00:10:31,480 And then, when you do your change of variable, 178 00:10:31,480 --> 00:10:32,710 you just pick up a square. 179 00:10:32,710 --> 00:10:35,750 You just pick up something in here. 180 00:10:35,750 --> 00:10:38,110 And so you just move this thing in there. 181 00:10:38,110 --> 00:10:38,920 You get a square. 182 00:10:38,920 --> 00:10:40,400 It goes inside the square. 183 00:10:40,400 --> 00:10:42,280 And so your derivative of the log likelihood 184 00:10:42,280 --> 00:10:44,488 with respect to theta becomes a derivative of the log 185 00:10:44,488 --> 00:10:46,240 likelihood with respect to eta. 186 00:10:46,240 --> 00:10:48,850 And that's the only thing that's happening here. 187 00:10:48,850 --> 00:10:52,478 I'm just being super sloppy, for some reason. 188 00:10:52,478 --> 00:10:54,612 OK. 189 00:10:54,612 --> 00:10:56,570 And then, of course, now, what you're left with 190 00:10:56,570 --> 00:10:59,442 is that this is really just proportional. 191 00:10:59,442 --> 00:11:00,650 Well, this is actually equal. 192 00:11:00,650 --> 00:11:02,150 Everything is proportional, but this 193 00:11:02,150 --> 00:11:05,090 is equal to the Fisher information tilde with respect 194 00:11:05,090 --> 00:11:07,050 to eta now. 195 00:11:07,050 --> 00:11:07,550 Right? 196 00:11:07,550 --> 00:11:09,630 You're doing this with respect to eta. 197 00:11:09,630 --> 00:11:17,010 And so that's your new prior with respect to eta. 198 00:11:17,010 --> 00:11:17,510 OK. 199 00:11:17,510 --> 00:11:21,800 So one thing that you want to do, 200 00:11:21,800 --> 00:11:23,870 once you have-- so remember, when you actually 201 00:11:23,870 --> 00:11:26,600 compute your posterior rate, rather 202 00:11:26,600 --> 00:11:29,330 than having-- so you start with a prior, 203 00:11:29,330 --> 00:11:32,090 and you have some observations, let's say, x1 to xn. 204 00:11:36,190 --> 00:11:41,540 When you do Bayesian inference, rather than spitting 205 00:11:41,540 --> 00:11:45,450 out just some theta hat, which is an estimator for theta, 206 00:11:45,450 --> 00:11:48,565 you actually spit out an entire posterior distribution-- 207 00:11:53,220 --> 00:11:57,040 pi of theta, given x1 xn. 208 00:11:57,040 --> 00:11:57,540 OK? 209 00:11:57,540 --> 00:11:59,460 So there's an entire distribution 210 00:11:59,460 --> 00:12:01,110 on the [INAUDIBLE] theta. 211 00:12:01,110 --> 00:12:04,290 And you can actually use this to perform inference, rather 212 00:12:04,290 --> 00:12:06,150 than just having one number. 213 00:12:06,150 --> 00:12:06,950 OK? 214 00:12:06,950 --> 00:12:09,300 And so you could actually build confidence regions 215 00:12:09,300 --> 00:12:10,540 from this thing. 216 00:12:10,540 --> 00:12:11,040 OK. 217 00:12:11,040 --> 00:12:16,600 And so a Bayesian confidence interval-- 218 00:12:16,600 --> 00:12:21,480 so if your set of parameters is included in the real line, 219 00:12:21,480 --> 00:12:23,880 then you can actually-- it's not even guaranteed 220 00:12:23,880 --> 00:12:25,740 to be to be an interval. 221 00:12:25,740 --> 00:12:33,350 So let me call it a confidence region, so a Bayesian 222 00:12:33,350 --> 00:12:40,090 confidence region, OK? 223 00:12:40,090 --> 00:12:43,360 So it's just a random subspace. 224 00:12:43,360 --> 00:12:47,810 So let's call it r, is included in theta. 225 00:12:47,810 --> 00:12:49,750 And when you have the deterministic one, 226 00:12:49,750 --> 00:12:53,650 we had a definition, which was with respect to the randomness 227 00:12:53,650 --> 00:12:54,880 of the data, right? 228 00:12:54,880 --> 00:12:57,850 That's how you actually had a random subset. 229 00:12:57,850 --> 00:12:59,740 So you had a random confidence interval. 230 00:12:59,740 --> 00:13:02,200 Here, it's actually conditioned on the data, 231 00:13:02,200 --> 00:13:03,640 but with respect to the randomness 232 00:13:03,640 --> 00:13:06,531 that you actually get from your posterior distribution. 233 00:13:06,531 --> 00:13:07,030 OK? 234 00:13:07,030 --> 00:13:16,760 So such that the probability that your theta 235 00:13:16,760 --> 00:13:18,350 belongs to this confidence region, 236 00:13:18,350 --> 00:13:24,500 given x1 xn is, say, at least 1 minus alpha. 237 00:13:24,500 --> 00:13:27,040 Let's just take it equal to 1 minus alpha. 238 00:13:27,040 --> 00:13:34,530 OK so that's a confidence region at level 1 minus alpha. 239 00:13:34,530 --> 00:13:36,240 OK, so that's one way. 240 00:13:36,240 --> 00:13:38,770 So why would you actually-- 241 00:13:38,770 --> 00:13:41,390 when I actually implement Bayesian inference, 242 00:13:41,390 --> 00:13:44,480 I'm actually spitting out that entire distribution. 243 00:13:44,480 --> 00:13:47,540 I need to summarize this thing to communicate it, right? 244 00:13:47,540 --> 00:13:49,730 I cannot just say this is this entire function. 245 00:13:49,730 --> 00:13:51,230 I want to know where are the regions 246 00:13:51,230 --> 00:13:54,344 of high probability, where my perimeter is supposed to be? 247 00:13:54,344 --> 00:13:56,510 And so here, when I have this thing, what I actually 248 00:13:56,510 --> 00:13:58,010 want to have is something that says, 249 00:13:58,010 --> 00:14:00,200 well, I want to summarize this thing 250 00:14:00,200 --> 00:14:03,680 into some subset of the real line, in which I'm 251 00:14:03,680 --> 00:14:08,120 sure that the area under the curve, here, of my posterior 252 00:14:08,120 --> 00:14:11,734 is actually 1 minus alpha. 253 00:14:11,734 --> 00:14:13,400 And there's many ways to do this, right? 254 00:14:16,790 --> 00:14:22,450 So one way to do this is to look at level sets. 255 00:14:27,870 --> 00:14:29,550 And so rather than actually-- so let's 256 00:14:29,550 --> 00:14:32,220 say my posterior looks like this. 257 00:14:32,220 --> 00:14:35,760 I know, for example, if I have a Gaussian distribution, 258 00:14:35,760 --> 00:14:38,230 I can actually take my posterior to be-- my posterior is 259 00:14:38,230 --> 00:14:39,480 actually going to be Gaussian. 260 00:14:43,060 --> 00:14:50,760 And what I can do is to try to cut it here on the y-axis 261 00:14:50,760 --> 00:14:54,910 so that now, the area under the curve, when I cut here, 262 00:14:54,910 --> 00:14:59,430 is actually 1 minus alpha. 263 00:14:59,430 --> 00:15:02,080 OK, so I have some threshold tau. 264 00:15:02,080 --> 00:15:05,490 If tau goes to plus infinity, then I'm 265 00:15:05,490 --> 00:15:07,380 going to have that this area under the curve 266 00:15:07,380 --> 00:15:10,380 here is going to-- 267 00:15:18,012 --> 00:15:19,920 AUDIENCE: [INAUDIBLE] 268 00:15:19,920 --> 00:15:21,786 PHILIPPE RIGOLLET: Well, no. 269 00:15:21,786 --> 00:15:23,160 So the area under the curve, when 270 00:15:23,160 --> 00:15:24,810 tau is going to plus infinity, think 271 00:15:24,810 --> 00:15:27,892 of the small, the when tau is just right here. 272 00:15:27,892 --> 00:15:29,280 AUDIENCE: [INAUDIBLE] 273 00:15:29,280 --> 00:15:32,150 PHILIPPE RIGOLLET: So this is actually going to 0, right? 274 00:15:32,150 --> 00:15:33,530 And so I start here. 275 00:15:33,530 --> 00:15:36,290 And then I start going down and down and down and down, 276 00:15:36,290 --> 00:15:39,440 until I actually get something which is going down to 1 plus 277 00:15:39,440 --> 00:15:40,160 alpha. 278 00:15:40,160 --> 00:15:44,000 And if tau is going down to 0, then my area under the curve 279 00:15:44,000 --> 00:15:44,750 is going to-- 280 00:15:48,240 --> 00:15:51,604 if tau is here, I'm cutting nowhere. 281 00:15:51,604 --> 00:15:52,770 And so I'm getting 1, right? 282 00:15:56,160 --> 00:15:56,980 Agree? 283 00:15:56,980 --> 00:16:00,540 Think of, when tau is very close to 0, 284 00:16:00,540 --> 00:16:02,876 I'm cutting [? s ?] s very far here. 285 00:16:02,876 --> 00:16:04,750 And so I'm getting some area under the curve, 286 00:16:04,750 --> 00:16:06,000 which is almost everything. 287 00:16:06,000 --> 00:16:08,100 And so it's going to 1-- as tau going down to 0. 288 00:16:08,100 --> 00:16:09,960 Yeah? 289 00:16:09,960 --> 00:16:12,882 AUDIENCE: Does this only work for [INAUDIBLE] 290 00:16:12,882 --> 00:16:14,340 PHILIPPE RIGOLLET: No, it does not. 291 00:16:14,340 --> 00:16:17,160 I mean-- so this is a picture. 292 00:16:17,160 --> 00:16:20,277 So those two things work for all of them, right? 293 00:16:20,277 --> 00:16:22,110 But when you have a [? bimodal, ?] actually, 294 00:16:22,110 --> 00:16:23,526 this is actually when things start 295 00:16:23,526 --> 00:16:24,990 to become interesting, right? 296 00:16:24,990 --> 00:16:30,600 So when we built a frequentist confidence interval, 297 00:16:30,600 --> 00:16:34,590 it was always of the form x bar plus or minus something. 298 00:16:34,590 --> 00:16:36,510 But now, if I start to have a posterior that 299 00:16:36,510 --> 00:16:40,230 looks like this, what I'm going to start cutting off, 300 00:16:40,230 --> 00:16:41,370 I'm going to have two-- 301 00:16:41,370 --> 00:16:44,550 I mean, my confidence region is going 302 00:16:44,550 --> 00:16:47,740 to be the union of those two things, right? 303 00:16:47,740 --> 00:16:50,700 And it really reflects the fact that there 304 00:16:50,700 --> 00:16:51,820 is this bimodal thing. 305 00:16:51,820 --> 00:16:53,486 It's going to say, well, with hyperbole, 306 00:16:53,486 --> 00:16:56,840 I'm actually going to be either here or here. 307 00:16:56,840 --> 00:16:59,840 Now, the meaning here of a Bayesian confidence region 308 00:16:59,840 --> 00:17:02,570 and the confidence interval are completely distinct notions, 309 00:17:02,570 --> 00:17:03,260 right? 310 00:17:03,260 --> 00:17:06,140 And I'm going to work out on example with you 311 00:17:06,140 --> 00:17:08,673 so that we can actually see that sometimes-- 312 00:17:08,673 --> 00:17:10,089 I mean, both of them, actually you 313 00:17:10,089 --> 00:17:11,839 can come up with some crazy paradoxes. 314 00:17:11,839 --> 00:17:13,609 So since we don't have that much time, 315 00:17:13,609 --> 00:17:17,339 I will actually talk to you about why, in some instances, 316 00:17:17,339 --> 00:17:19,819 it's actually a good idea to think of Bayesian confidence 317 00:17:19,819 --> 00:17:22,369 intervals rather than frequentist ones. 318 00:17:22,369 --> 00:17:25,609 So before we go into more details about what 319 00:17:25,609 --> 00:17:27,440 those Bayesian confidence intervals are, 320 00:17:27,440 --> 00:17:29,570 let's remind ourselves what does it 321 00:17:29,570 --> 00:17:33,110 mean to have a frequentist confidence interval? 322 00:17:33,110 --> 00:17:33,610 Right? 323 00:17:46,460 --> 00:17:46,960 OK. 324 00:17:46,960 --> 00:17:49,690 So when I have a frequentist confidence interval, 325 00:17:49,690 --> 00:17:59,290 let's say something like x bar n to minus 1.96 sigma over root n 326 00:17:59,290 --> 00:18:06,136 and x bar n plus 1.96 sigma over root n, 327 00:18:06,136 --> 00:18:07,510 so that's the confidence interval 328 00:18:07,510 --> 00:18:10,720 that you get for the mean of some Gaussian 329 00:18:10,720 --> 00:18:16,390 with known variants to be equal to sigma square, OK. 330 00:18:16,390 --> 00:18:18,460 So what we know is that the meaning of this 331 00:18:18,460 --> 00:18:20,410 is the probability that theta belongs 332 00:18:20,410 --> 00:18:25,870 to this is equal to 95%, right? 333 00:18:25,870 --> 00:18:27,340 And this, more generally, you can 334 00:18:27,340 --> 00:18:29,620 think of being q alpha over 2. 335 00:18:29,620 --> 00:18:33,040 And what you're going to get is 1 minus alpha here, OK? 336 00:18:33,040 --> 00:18:34,280 So what does it mean here? 337 00:18:34,280 --> 00:18:37,480 Well, it looks very much like what we have here, 338 00:18:37,480 --> 00:18:39,970 except that we're not conditioning on x1 xn. 339 00:18:39,970 --> 00:18:40,720 And we should not. 340 00:18:40,720 --> 00:18:43,830 Because there was a question like that in the midterm-- 341 00:18:43,830 --> 00:18:47,590 if I condition on x1 xn, this probability is either 0 or 1. 342 00:18:47,590 --> 00:18:48,610 OK? 343 00:18:48,610 --> 00:18:50,170 Because once I condition-- so here, 344 00:18:50,170 --> 00:18:52,170 this probability, actually, here is with respect 345 00:18:52,170 --> 00:18:55,010 to the randomness in x1 xn. 346 00:18:55,010 --> 00:18:56,040 So if I condition-- 347 00:18:58,860 --> 00:19:04,890 so let's build this thing, r freq, for frequentist. 348 00:19:07,830 --> 00:19:11,930 Well, given x1 xn-- 349 00:19:11,930 --> 00:19:13,940 and actually, I don't need to know x1 xn really. 350 00:19:13,940 --> 00:19:16,420 What I need to know is what xn bar is. 351 00:19:16,420 --> 00:19:18,140 Well, this thing now is what? 352 00:19:18,140 --> 00:19:22,200 It's 1, if theta is in r, and it's 0, 353 00:19:22,200 --> 00:19:27,110 if theta is not in r, right? 354 00:19:27,110 --> 00:19:28,010 That's all there is. 355 00:19:28,010 --> 00:19:29,900 This is a deterministic confidence interval, 356 00:19:29,900 --> 00:19:32,360 once I condition x1 xn. 357 00:19:32,360 --> 00:19:33,270 So I have a number. 358 00:19:33,270 --> 00:19:35,720 The average is maybe 3. 359 00:19:35,720 --> 00:19:36,790 And so I get 3. 360 00:19:36,790 --> 00:19:41,900 Either theta is between 3 minus 0.5 or in 3 plus 0.5, 361 00:19:41,900 --> 00:19:42,840 or it's not. 362 00:19:42,840 --> 00:19:44,000 And so there's basically-- 363 00:19:44,000 --> 00:19:45,470 I mean, I write it as probability, 364 00:19:45,470 --> 00:19:47,303 but it's really not a probalistic statement. 365 00:19:47,303 --> 00:19:49,160 It's either it's true or not. 366 00:19:49,160 --> 00:19:50,240 Agreed? 367 00:19:50,240 --> 00:19:52,580 So what does it mean to have a frequentist confidence 368 00:19:52,580 --> 00:19:53,550 interval? 369 00:19:53,550 --> 00:19:55,270 It means that if I were-- 370 00:19:55,270 --> 00:19:58,660 and here, where the word frequentist comes from, 371 00:19:58,660 --> 00:20:02,840 it says that if I repeat this experiment over and over, 372 00:20:02,840 --> 00:20:06,700 meaning that on Monday, I collect a sample of size n, 373 00:20:06,700 --> 00:20:09,260 and I build a confidence interval, 374 00:20:09,260 --> 00:20:12,260 and then on Tuesday, I collect another sample of size n, 375 00:20:12,260 --> 00:20:13,890 and I build a confidence interval, 376 00:20:13,890 --> 00:20:17,000 and on Wednesday, I do this again and again, what's going 377 00:20:17,000 --> 00:20:18,510 to happen is the following. 378 00:20:18,510 --> 00:20:21,530 I'm going to have my true theta that lives here. 379 00:20:21,530 --> 00:20:23,900 And then on Monday, this is the confidence interval 380 00:20:23,900 --> 00:20:25,470 that I build. 381 00:20:25,470 --> 00:20:28,802 OK, so this is the real line. 382 00:20:28,802 --> 00:20:31,260 The true theta is here, and this is the confidence interval 383 00:20:31,260 --> 00:20:32,300 I build on Monday. 384 00:20:32,300 --> 00:20:32,800 All right? 385 00:20:32,800 --> 00:20:37,530 So x bar was here, and this is my confidence interval. 386 00:20:37,530 --> 00:20:41,540 On Tuesday, I build this confidence interval maybe. 387 00:20:41,540 --> 00:20:44,640 x bar was closer to theta, but smaller. 388 00:20:44,640 --> 00:20:49,820 But then on Wednesday, I build this confidence interval. 389 00:20:49,820 --> 00:20:50,880 I'm not here. 390 00:20:50,880 --> 00:20:51,920 It's not in there. 391 00:20:51,920 --> 00:20:53,681 And that's this case. 392 00:20:53,681 --> 00:20:54,180 Right? 393 00:20:54,180 --> 00:20:56,100 It happens that it's just not in there. 394 00:20:56,100 --> 00:20:57,930 And then on Thursday, I build another one. 395 00:20:57,930 --> 00:21:01,300 I almost miss it, but I'm in there, et cetera. 396 00:21:01,300 --> 00:21:04,430 Maybe here, Here, I miss again. 397 00:21:04,430 --> 00:21:07,490 And so what it means to have a confidence interval-- so what 398 00:21:07,490 --> 00:21:12,131 does it mean to have a confidence interval at 95%? 399 00:21:12,131 --> 00:21:15,610 AUDIENCE: [INAUDIBLE] 400 00:21:15,610 --> 00:21:18,150 PHILIPPE RIGOLLET: Yeah, so it means that if I repeat this 401 00:21:18,150 --> 00:21:19,800 the frequency of times-- 402 00:21:19,800 --> 00:21:21,720 hence, the word frequentist-- at which 403 00:21:21,720 --> 00:21:24,150 I'm actually going to overlap that, 404 00:21:24,150 --> 00:21:26,910 I'm actually going to contain theta, should be 95%. 405 00:21:26,910 --> 00:21:28,890 That's what frequentist means. 406 00:21:28,890 --> 00:21:31,740 So it's just a matter of trusting that. 407 00:21:31,740 --> 00:21:35,690 So on one given thing, one given realization of your data, 408 00:21:35,690 --> 00:21:36,970 it's not telling you anything. 409 00:21:36,970 --> 00:21:38,460 [INAUDIBLE] it's there or not. 410 00:21:38,460 --> 00:21:42,530 So it's not really something that's actually 411 00:21:42,530 --> 00:21:46,430 something that assesses the confidence of your decision, 412 00:21:46,430 --> 00:21:48,230 such as data is in there or not. 413 00:21:48,230 --> 00:21:50,360 It's something that assesses the confidence 414 00:21:50,360 --> 00:21:52,410 you have in the method that you're using. 415 00:21:52,410 --> 00:21:54,170 If you were you repeat it over and again, 416 00:21:54,170 --> 00:21:56,470 it'd be the same thing. 417 00:21:56,470 --> 00:21:58,850 It would be 95% of the time correct, right? 418 00:21:58,850 --> 00:22:02,570 So for example, we know that we could build a test. 419 00:22:02,570 --> 00:22:04,940 So it's pretty clear that you can actually 420 00:22:04,940 --> 00:22:09,020 build a test for whether theta is equal to theta naught 421 00:22:09,020 --> 00:22:10,705 or not equal to theta naught, by just 422 00:22:10,705 --> 00:22:13,080 checking whether theta naught is in a confidence interval 423 00:22:13,080 --> 00:22:13,780 or not. 424 00:22:13,780 --> 00:22:15,530 And what it means is that, if you actually 425 00:22:15,530 --> 00:22:21,170 are doing those tests at 5%, that means that 5% of the time, 426 00:22:21,170 --> 00:22:23,440 if you do this over and again, 5% of the time 427 00:22:23,440 --> 00:22:24,610 you're going to be wrong. 428 00:22:24,610 --> 00:22:27,640 I mentioned my wife does market research. 429 00:22:27,640 --> 00:22:31,930 And she does maybe, I don't know, 100,000 tests a year. 430 00:22:31,930 --> 00:22:34,210 And if they do all of them at 1%, 431 00:22:34,210 --> 00:22:37,550 then it means that 1% of the time, which is a lot of time, 432 00:22:37,550 --> 00:22:38,050 right? 433 00:22:38,050 --> 00:22:40,840 When you do 100,000 a year, it's 1,000 of them 434 00:22:40,840 --> 00:22:41,755 are actually wrong. 435 00:22:41,755 --> 00:22:44,611 OK, I mean, she's actually hedging 436 00:22:44,611 --> 00:22:47,110 against the fact that 1% of them that are going to be wrong. 437 00:22:47,110 --> 00:22:49,109 That's 1,000 of them that are going to be wrong. 438 00:22:49,109 --> 00:22:52,890 Just like, if you do this 100,000 times at 95%, 439 00:22:52,890 --> 00:22:54,910 5,000 of those guys are actually not going 440 00:22:54,910 --> 00:22:56,360 to be the correct ones. 441 00:22:56,360 --> 00:22:56,860 OK? 442 00:22:56,860 --> 00:22:58,600 So I mean, it's kind of scary. 443 00:22:58,600 --> 00:23:01,300 But that's the way it is. 444 00:23:01,300 --> 00:23:03,730 So that's with the frequentist interpretation of this is. 445 00:23:03,730 --> 00:23:07,720 Now, as I mentioned, when we started this Bayesian chapter, 446 00:23:07,720 --> 00:23:10,930 I said, Bayesian statistics converge to-- 447 00:23:10,930 --> 00:23:14,800 I mean, Bayesian decisions and Bayesian methods converge 448 00:23:14,800 --> 00:23:16,510 to frequentist methods. 449 00:23:16,510 --> 00:23:18,590 When the sample size is large enough, 450 00:23:18,590 --> 00:23:20,610 they lead to the same decisions. 451 00:23:20,610 --> 00:23:22,930 And in general, they need not be the same, 452 00:23:22,930 --> 00:23:24,970 but they tend to actually, when the sample 453 00:23:24,970 --> 00:23:27,830 size is large enough, to have the same behavior. 454 00:23:27,830 --> 00:23:30,850 Think about, for example, the posterior 455 00:23:30,850 --> 00:23:34,450 that you have when you have in the Gaussian case, right? 456 00:23:34,450 --> 00:23:36,420 We said that, in the Gaussian case, 457 00:23:36,420 --> 00:23:38,020 what you're going to see is that it's 458 00:23:38,020 --> 00:23:40,240 as if you had an extra observation which 459 00:23:40,240 --> 00:23:43,230 was essentially given by your prior. 460 00:23:43,230 --> 00:23:44,570 OK? 461 00:23:44,570 --> 00:23:50,830 And now, what's going to happen is that, when this just one 462 00:23:50,830 --> 00:23:53,470 observation among n plus 1, it's really 463 00:23:53,470 --> 00:23:55,720 going to be totally drawn, and you 464 00:23:55,720 --> 00:23:58,390 won't see it when the sample size grows larger. 465 00:23:58,390 --> 00:24:00,400 So Bayesian methods are particularly useful when 466 00:24:00,400 --> 00:24:02,190 you have a small sample size. 467 00:24:02,190 --> 00:24:05,680 And when you have a small sample size, the effect of the prior 468 00:24:05,680 --> 00:24:06,980 is going to be bigger. 469 00:24:06,980 --> 00:24:08,950 But most importantly, you're not going 470 00:24:08,950 --> 00:24:10,810 to have to repeat this thing over and again. 471 00:24:10,810 --> 00:24:11,830 And you're going to have a meaning. 472 00:24:11,830 --> 00:24:13,180 You're going to have to have something 473 00:24:13,180 --> 00:24:15,138 that has a meaning for this particular data set 474 00:24:15,138 --> 00:24:16,150 that you have. 475 00:24:16,150 --> 00:24:19,900 When I said that the probability that theta belongs to r-- 476 00:24:19,900 --> 00:24:22,810 and here, I'm going to specify the fact that it's a Bayesian 477 00:24:22,810 --> 00:24:24,740 confidence region, like this one-- 478 00:24:24,740 --> 00:24:27,490 this is actually conditionally on the data 479 00:24:27,490 --> 00:24:29,490 that you've collected. 480 00:24:29,490 --> 00:24:32,110 It says, given this data, given the points that you have-- 481 00:24:32,110 --> 00:24:34,540 just put in some numbers, if you want, in there-- 482 00:24:34,540 --> 00:24:36,460 it's actually telling you the probability 483 00:24:36,460 --> 00:24:39,430 that theta belongs to this Bayesian thing, 484 00:24:39,430 --> 00:24:41,750 to this Bayesian confidence region. 485 00:24:41,750 --> 00:24:44,230 Here, since I have conditioned on x1 xn, 486 00:24:44,230 --> 00:24:46,840 this probability is really just with respect to theta 487 00:24:46,840 --> 00:24:51,660 drawn from the prior, right? 488 00:24:51,660 --> 00:24:54,150 And so now, it has a slightly different meaning. 489 00:24:54,150 --> 00:24:57,170 It's just telling you that when-- 490 00:24:57,170 --> 00:24:59,570 it's really making a statement about where 491 00:24:59,570 --> 00:25:03,870 the regions of hyperability of your posterior are. 492 00:25:03,870 --> 00:25:05,050 Now, why is that useful? 493 00:25:05,050 --> 00:25:11,600 Well, there's actually an interesting story that 494 00:25:11,600 --> 00:25:13,980 goes behind Bayesian methods. 495 00:25:13,980 --> 00:25:17,240 Anybody knows the story of the USS I think it's Scorpion? 496 00:25:17,240 --> 00:25:18,610 Do you know the story? 497 00:25:18,610 --> 00:25:22,770 So that was an American vessel that disappeared. 498 00:25:22,770 --> 00:25:25,490 I think it was close to Bermuda or something. 499 00:25:25,490 --> 00:25:28,790 But you can tell the story of the Malaysian Airlines, 500 00:25:28,790 --> 00:25:31,640 except that I don't think it's such a successful story. 501 00:25:31,640 --> 00:25:33,770 But the idea was essentially, we're 502 00:25:33,770 --> 00:25:36,050 trying to find where this thing happened. 503 00:25:36,050 --> 00:25:39,800 And of course, this is a one-time thing. 504 00:25:39,800 --> 00:25:41,686 You need something that works once. 505 00:25:41,686 --> 00:25:44,060 You need something that works for this particular vessel. 506 00:25:44,060 --> 00:25:46,601 And you don't care, if you go to the Navy, and you tell them, 507 00:25:46,601 --> 00:25:48,320 well, here's a method. 508 00:25:48,320 --> 00:25:51,730 And for 95 out of 100 vessels that you're going to lose, 509 00:25:51,730 --> 00:25:53,350 we're going to be able to find it. 510 00:25:53,350 --> 00:25:57,230 And they want this to work for this particular one. 511 00:25:57,230 --> 00:25:59,750 And so they were looking, and they were 512 00:25:59,750 --> 00:26:02,200 diving in different places. 513 00:26:02,200 --> 00:26:04,710 And suddenly, they brought in this guy. 514 00:26:04,710 --> 00:26:05,460 I forget his name. 515 00:26:05,460 --> 00:26:08,960 I mean, there's a whole story about this on Wikipedia. 516 00:26:08,960 --> 00:26:10,612 And he started collecting the data 517 00:26:10,612 --> 00:26:13,070 that they had from different dives and maybe from currents. 518 00:26:13,070 --> 00:26:14,569 And he started to put everything in. 519 00:26:14,569 --> 00:26:17,540 And he said, OK, what is the posterior distribution 520 00:26:17,540 --> 00:26:21,140 of the location of the vessel, given all the things 521 00:26:21,140 --> 00:26:22,340 that I've seen? 522 00:26:22,340 --> 00:26:23,390 And what have you seen? 523 00:26:23,390 --> 00:26:25,280 Well, you've seen that it's not here, it's not there, 524 00:26:25,280 --> 00:26:26,071 and it's not there. 525 00:26:26,071 --> 00:26:29,360 And you've also seen that the currents were going that way, 526 00:26:29,360 --> 00:26:30,786 and the winds were going that way. 527 00:26:30,786 --> 00:26:32,660 And you can actually put some modeling traits 528 00:26:32,660 --> 00:26:33,890 to understand this. 529 00:26:33,890 --> 00:26:37,940 Now, given this, for this particular data that you have, 530 00:26:37,940 --> 00:26:41,420 you can actually think of having a two-dimensional density that 531 00:26:41,420 --> 00:26:44,650 tells you where it's more likely that the vessel is. 532 00:26:44,650 --> 00:26:46,400 And where are you going to be looking for? 533 00:26:46,400 --> 00:26:48,097 Well, if it's a multimodal distribution, 534 00:26:48,097 --> 00:26:50,180 you're just going to go to the highest mode first, 535 00:26:50,180 --> 00:26:52,190 because that's where it's the most likely to be. 536 00:26:52,190 --> 00:26:53,600 And maybe it's not there, so you're just 537 00:26:53,600 --> 00:26:55,250 going to update your posterior, based on the fact 538 00:26:55,250 --> 00:26:56,791 that it's not there, and do it again. 539 00:26:56,791 --> 00:26:59,270 And actually, after two dives, I think, 540 00:26:59,270 --> 00:27:01,010 he actually found the thing. 541 00:27:01,010 --> 00:27:03,122 And that's exactly where Bayesian statistics 542 00:27:03,122 --> 00:27:03,830 start to kick in. 543 00:27:03,830 --> 00:27:08,570 Because you put a lot of knowledge into your model, 544 00:27:08,570 --> 00:27:11,340 but you also can actually factor in a bunch of information, 545 00:27:11,340 --> 00:27:11,840 right? 546 00:27:11,840 --> 00:27:13,460 The model, he had to build a model 547 00:27:13,460 --> 00:27:17,360 that was actually taking into account and currents, and when. 548 00:27:17,360 --> 00:27:20,780 And what you can have as a guarantee is that, 549 00:27:20,780 --> 00:27:22,610 when you talk about the probability 550 00:27:22,610 --> 00:27:27,346 that this vessel is in this location, 551 00:27:27,346 --> 00:27:28,970 given what you've observed in the past, 552 00:27:28,970 --> 00:27:30,140 it actually has some sense. 553 00:27:30,140 --> 00:27:34,610 Whereas, if you were to use a frequentist approach, 554 00:27:34,610 --> 00:27:35,810 then there's no probability. 555 00:27:35,810 --> 00:27:38,660 Either it's underneath this position or it's not, right? 556 00:27:38,660 --> 00:27:41,520 So that's actually where it start to make sense. 557 00:27:41,520 --> 00:27:43,370 And so you can actually build this. 558 00:27:43,370 --> 00:27:44,930 And there's actually a lot of methods 559 00:27:44,930 --> 00:27:47,300 that are based on, for search, that 560 00:27:47,300 --> 00:27:48,979 are based on Bayesian methods. 561 00:27:48,979 --> 00:27:50,520 I think, for example, the Higgs boson 562 00:27:50,520 --> 00:27:51,920 was based on a lot of Bayesian methods, 563 00:27:51,920 --> 00:27:54,050 because this is something you need to find [INAUDIBLE],, 564 00:27:54,050 --> 00:27:54,549 right? 565 00:27:54,549 --> 00:27:57,330 I mean, there was a lot of prior that has to be built in. 566 00:27:57,330 --> 00:27:57,830 OK. 567 00:27:57,830 --> 00:27:59,621 So now, you build this confidence interval. 568 00:27:59,621 --> 00:28:02,300 And the nicest way to do it is to use level sets. 569 00:28:02,300 --> 00:28:05,210 But again, just like for Gaussians, I mean, if I had, 570 00:28:05,210 --> 00:28:12,290 even in the Gaussian case, I decided 571 00:28:12,290 --> 00:28:16,110 to go at x bar plus or minus something, 572 00:28:16,110 --> 00:28:19,500 but I could go at something that's completely asymmetric. 573 00:28:19,500 --> 00:28:21,467 So what's happening is that here, this method 574 00:28:21,467 --> 00:28:23,550 guarantees that you're going to have the narrowest 575 00:28:23,550 --> 00:28:24,800 possible confidence intervals. 576 00:28:24,800 --> 00:28:27,480 That's essentially what it's telling you, OK? 577 00:28:27,480 --> 00:28:31,890 Because every time I'm choosing a point, starting from here, 578 00:28:31,890 --> 00:28:36,170 I'm actually putting as much area under the curve as I can. 579 00:28:36,170 --> 00:28:38,660 All right. 580 00:28:38,660 --> 00:28:41,737 So those are called Bayesian confidence [? interval. ?] 581 00:28:41,737 --> 00:28:43,320 Oh yeah, and I promised you that we're 582 00:28:43,320 --> 00:28:46,500 going to work on some example that actually 583 00:28:46,500 --> 00:28:50,940 gives a meaning to what I just told you, with actual numbers. 584 00:28:50,940 --> 00:28:56,790 So this is something that's taken from Wasserman's book. 585 00:28:56,790 --> 00:29:01,140 And also, it's coming from a paper, 586 00:29:01,140 --> 00:29:03,780 from a stats paper, from [? Wolpert ?] and I 587 00:29:03,780 --> 00:29:05,760 don't know who, from the '80s. 588 00:29:05,760 --> 00:29:07,760 And essentially, this is how it works. 589 00:29:07,760 --> 00:29:10,680 So assume that you have n equals 2 observations. 590 00:29:14,320 --> 00:29:18,780 And you have y1, so those observations are y1-- 591 00:29:18,780 --> 00:29:20,680 no, sorry, let's call them x1, which 592 00:29:20,680 --> 00:29:26,000 is theta, plus epsilon 1 and x2, which is theta plus epsilon 2, 593 00:29:26,000 --> 00:29:31,060 where epsilon 1 and epsilon 2 are iid. 594 00:29:31,060 --> 00:29:33,280 And the probability that epsilon i is equal 595 00:29:33,280 --> 00:29:35,110 to plus 1 is equal to the probability 596 00:29:35,110 --> 00:29:38,440 that epsilon i is equal to minus 1 is equal to 1/2. 597 00:29:38,440 --> 00:29:44,550 OK, so it's just the uniform sign plus minus 1, OK? 598 00:29:44,550 --> 00:29:46,590 Now, let's think about so you're trying 599 00:29:46,590 --> 00:29:47,970 to do some inference on theta. 600 00:29:47,970 --> 00:29:50,261 Maybe you actually want to find some inference on theta 601 00:29:50,261 --> 00:29:51,825 that's actually based on-- 602 00:29:51,825 --> 00:29:55,660 and that's based only on the x1 and x2. 603 00:29:55,660 --> 00:29:56,430 OK? 604 00:29:56,430 --> 00:29:58,750 So I'm going to actually build a confidence interval. 605 00:29:58,750 --> 00:30:01,110 But what I really want to build is a-- 606 00:30:03,594 --> 00:30:05,010 but let's start thinking about how 607 00:30:05,010 --> 00:30:07,780 I would find an estimator for those two things. 608 00:30:07,780 --> 00:30:09,970 Well, what values am I going to be getting, right? 609 00:30:09,970 --> 00:30:13,750 So I'm going to get either theta plus 1 or theta minus 1. 610 00:30:13,750 --> 00:30:15,610 And actually, I can get basically four 611 00:30:15,610 --> 00:30:19,260 different observations, right? 612 00:30:19,260 --> 00:30:21,516 Sorry, four different pairs of observations-- 613 00:30:30,760 --> 00:30:32,410 plus plus theta minus 1. 614 00:30:32,410 --> 00:30:33,170 Agreed? 615 00:30:33,170 --> 00:30:37,340 Those are the four possible observations that I can get. 616 00:30:37,340 --> 00:30:38,970 Agreed? 617 00:30:38,970 --> 00:30:42,924 Either they're both equal to plus 1, both equal to minus 1, 618 00:30:42,924 --> 00:30:44,340 or one of the two is equal to plus 619 00:30:44,340 --> 00:30:46,950 1, the other one to minus 1, or the epsilons. 620 00:30:46,950 --> 00:30:47,580 OK. 621 00:30:47,580 --> 00:30:49,730 So those are the four observations I can get. 622 00:30:49,730 --> 00:30:56,010 So in particular, if they take the same value, 623 00:30:56,010 --> 00:30:59,390 and you know it's either theta plus 1 or theta minus 1, 624 00:30:59,390 --> 00:31:02,100 and if they take a different value, I know one of them 625 00:31:02,100 --> 00:31:04,555 is theta plus 1, and one is actually theta minus 1. 626 00:31:04,555 --> 00:31:07,180 So in particular, if I take the average of those two guys, when 627 00:31:07,180 --> 00:31:09,138 they take different values, I know I'm actually 628 00:31:09,138 --> 00:31:10,850 getting theta right. 629 00:31:10,850 --> 00:31:14,441 So let's build a confidence region. 630 00:31:14,441 --> 00:31:16,940 OK, so I'm actually going to take a confidence region, which 631 00:31:16,940 --> 00:31:18,810 is just a singleton. 632 00:31:21,662 --> 00:31:23,120 And I'm going to say the following. 633 00:31:23,120 --> 00:31:32,460 Well, if x1 is equal to x2, I'm just going to take x1 minus 1, 634 00:31:32,460 --> 00:31:33,320 OK? 635 00:31:33,320 --> 00:31:34,790 So I'm just saying, well, I'm never 636 00:31:34,790 --> 00:31:37,310 going to able to resolve whether it's plus 1 or minus 1 637 00:31:37,310 --> 00:31:38,864 that actually gives me the best one, 638 00:31:38,864 --> 00:31:41,030 so I'm just going to take a dive and say, well, it's 639 00:31:41,030 --> 00:31:42,594 just plus 1. 640 00:31:42,594 --> 00:31:44,860 OK? 641 00:31:44,860 --> 00:31:47,710 And then, if they're different, then here, 642 00:31:47,710 --> 00:31:50,830 I can do much better. 643 00:31:50,830 --> 00:31:52,929 I'm going to actually just think the average. 644 00:31:56,282 --> 00:31:58,200 OK? 645 00:31:58,200 --> 00:32:08,360 Now, what I claim is that this is a confidence region-- 646 00:32:08,360 --> 00:32:10,370 and by default, when I don't mention it, 647 00:32:10,370 --> 00:32:16,190 this is a frequentist confidence region-- 648 00:32:16,190 --> 00:32:18,740 at level 75%. 649 00:32:21,050 --> 00:32:21,550 OK? 650 00:32:21,550 --> 00:32:23,100 So let's just check that. 651 00:32:23,100 --> 00:32:24,685 To check that this is correct, I need 652 00:32:24,685 --> 00:32:27,460 to check that the probability under the realization of x1 653 00:32:27,460 --> 00:32:30,940 and x2, that theta belongs, is one of those two guys, 654 00:32:30,940 --> 00:32:33,291 is actually equal to 0.75. 655 00:32:33,291 --> 00:32:33,790 Yes? 656 00:32:33,790 --> 00:32:36,529 AUDIENCE: What are the [INAUDIBLE] 657 00:32:36,529 --> 00:32:39,070 PHILIPPE RIGOLLET: Well, it's just the frequentist confidence 658 00:32:39,070 --> 00:32:41,842 interval that does not need to be an interval. 659 00:32:41,842 --> 00:32:44,050 Actually, in this case, it's going to be an interval. 660 00:32:44,050 --> 00:32:46,602 But that's just what it means. 661 00:32:46,602 --> 00:32:50,055 Yeah, region for Bayesian was just because-- 662 00:32:50,055 --> 00:32:51,430 I mean, the confidence intervals, 663 00:32:51,430 --> 00:32:53,320 when we're frequentist, we tend to make them 664 00:32:53,320 --> 00:32:54,606 intervals, because we want-- 665 00:32:54,606 --> 00:32:56,980 but when you're Bayesian, and you're doing this level set 666 00:32:56,980 --> 00:32:58,180 thing, you cannot really guarantee, 667 00:32:58,180 --> 00:33:00,460 unless its [INAUDIBLE] is going to be an interval. 668 00:33:00,460 --> 00:33:02,720 So region is just a way to not have to say interval, 669 00:33:02,720 --> 00:33:03,430 in case it's not. 670 00:33:06,080 --> 00:33:06,640 OK. 671 00:33:06,640 --> 00:33:08,490 So I have this thing. 672 00:33:08,490 --> 00:33:11,440 So what I need to check is the probability that theta 673 00:33:11,440 --> 00:33:13,000 is in one of those two things, right? 674 00:33:13,000 --> 00:33:16,060 So what I need to find is the probability that theta 675 00:33:16,060 --> 00:33:24,220 is an [INAUDIBLE] Well, x1 minus 1 and x1 is not equal to x2. 676 00:33:24,220 --> 00:33:26,840 And those are disjoint events, so it's plus the probability 677 00:33:26,840 --> 00:33:35,980 that theta is in x1 plus x2 over 2 and x1-- 678 00:33:35,980 --> 00:33:37,580 sorry, that's equal. 679 00:33:37,580 --> 00:33:39,700 That's different. 680 00:33:39,700 --> 00:33:40,200 OK. 681 00:33:40,200 --> 00:33:42,780 And OK, just before we actually finish the computation, 682 00:33:42,780 --> 00:33:44,730 why do I have 75%? 683 00:33:44,730 --> 00:33:46,920 75% is 3/4. 684 00:33:46,920 --> 00:33:48,930 So it means that we have four cases. 685 00:33:48,930 --> 00:33:52,020 And essentially, I did not account for one case. 686 00:33:52,020 --> 00:33:52,650 And it's true. 687 00:33:52,650 --> 00:33:56,040 I did not account for this case, when 688 00:33:56,040 --> 00:34:01,060 the both of the epsilon i's are equal to minus 1. 689 00:34:01,060 --> 00:34:01,560 Right? 690 00:34:01,560 --> 00:34:03,393 So this is essentially the one I'm not going 691 00:34:03,393 --> 00:34:04,620 to be able to account for. 692 00:34:04,620 --> 00:34:06,040 And so we'll see that in a second. 693 00:34:06,040 --> 00:34:09,310 So in this case, we know that everything goes great. 694 00:34:09,310 --> 00:34:09,810 Right? 695 00:34:09,810 --> 00:34:11,080 So in this case, this is-- 696 00:34:11,080 --> 00:34:11,580 OK. 697 00:34:11,580 --> 00:34:13,831 Well, let's just start from the first line. 698 00:34:13,831 --> 00:34:15,330 So the first line is the probability 699 00:34:15,330 --> 00:34:20,290 that theta is equal to x1 minus 1 and those two are equal. 700 00:34:20,290 --> 00:34:28,440 So this is the probability that theta is equal to-- 701 00:34:28,440 --> 00:34:36,260 well, this is theta plus epsilon 1 minus 1. 702 00:34:36,260 --> 00:34:43,409 And epsilon 1 is equal to epsilon 2, right? 703 00:34:43,409 --> 00:34:45,290 Because I can remove the theta from here, 704 00:34:45,290 --> 00:34:47,780 and I can actually remove the theta from here, 705 00:34:47,780 --> 00:34:50,765 so that this guy here is just epsilon 1 is equal to 1. 706 00:34:50,765 --> 00:34:52,407 So when I intersect with this guy, 707 00:34:52,407 --> 00:34:54,740 it's actually the same thing as epsilon 1 is equal to 1, 708 00:34:54,740 --> 00:34:56,530 as well-- 709 00:34:56,530 --> 00:34:59,780 episilon 2 is equal to 1, as well, OK? 710 00:34:59,780 --> 00:35:05,240 So this first thing is actually equal to the probability 711 00:35:05,240 --> 00:35:10,780 that epsilon 1 is equal to 1 and epsilon 2 is equal to 1, 712 00:35:10,780 --> 00:35:14,180 which is equal to what? 713 00:35:14,180 --> 00:35:15,570 AUDIENCE: [INAUDIBLE] 714 00:35:15,570 --> 00:35:17,070 PHILIPPE RIGOLLET: Yeah, 1/4, right? 715 00:35:17,070 --> 00:35:19,870 So that's just the first case over there. 716 00:35:19,870 --> 00:35:21,020 They're independent. 717 00:35:21,020 --> 00:35:23,420 Now, I still need to do the second one. 718 00:35:23,420 --> 00:35:24,650 So this case is what? 719 00:35:24,650 --> 00:35:28,890 Well, when those things are equal, x1 plus x2 over 2 720 00:35:28,890 --> 00:35:29,390 is what? 721 00:35:29,390 --> 00:35:31,920 Well, I get theta plus theta over 2. 722 00:35:31,920 --> 00:35:33,800 So that's just equal to the probability 723 00:35:33,800 --> 00:35:39,620 that epsilon 1 plus epsilon 2 over 2 is equal to 0 724 00:35:39,620 --> 00:35:43,600 and epsilon 1 is different from epsilon 2. 725 00:35:43,600 --> 00:35:44,100 Agreed? 726 00:35:46,860 --> 00:35:49,797 I just removed the thetas from these equations, because I can. 727 00:35:49,797 --> 00:35:51,380 They're just on both sides every time. 728 00:35:54,810 --> 00:35:55,310 OK. 729 00:35:55,310 --> 00:35:56,482 And so that means what? 730 00:35:56,482 --> 00:35:58,440 That means that the second part-- so this thing 731 00:35:58,440 --> 00:36:02,120 is actually equal to 1/4 plus the probability 732 00:36:02,120 --> 00:36:05,350 that epsilon 1 and epsilon 2 over 2 is equal to 0. 733 00:36:05,350 --> 00:36:06,544 I can remove the 2. 734 00:36:06,544 --> 00:36:08,460 So this is just the probability that one is 1, 735 00:36:08,460 --> 00:36:10,560 and the other one is minus 1, right? 736 00:36:10,560 --> 00:36:12,510 So that's equal to the probability 737 00:36:12,510 --> 00:36:17,820 that epsilon 1 is equal to 1 and epsilon 2 is equal to minus 1 738 00:36:17,820 --> 00:36:21,360 plus the probability that epsilon 1 is equal to minus 1 739 00:36:21,360 --> 00:36:24,447 and epsilon 2 is equal to plus 1, OK? 740 00:36:24,447 --> 00:36:25,780 Because they're disjoint events. 741 00:36:25,780 --> 00:36:28,080 So I can break them into the sum of the two. 742 00:36:28,080 --> 00:36:32,310 And each of those guys is also one of the atomic part of it. 743 00:36:32,310 --> 00:36:33,960 It's one of the basic things. 744 00:36:33,960 --> 00:36:36,011 And so each of those guys has probability 1/4. 745 00:36:36,011 --> 00:36:38,010 And so here, we can really see that we accounted 746 00:36:38,010 --> 00:36:41,910 for everything, except for the case when epsilon 1 was equal 747 00:36:41,910 --> 00:36:44,730 to minus 1, and epsilon 2 was equal to minus 1. 748 00:36:44,730 --> 00:36:45,570 So this is 1/4. 749 00:36:45,570 --> 00:36:46,380 This is 1/4. 750 00:36:46,380 --> 00:36:49,850 So the whole thing is equal to 3/4. 751 00:36:49,850 --> 00:36:56,060 So now, what we have is that the probability that epsilon 1 752 00:36:56,060 --> 00:36:57,350 is in-- 753 00:36:57,350 --> 00:37:03,230 so the probability that data belongs to this confidence 754 00:37:03,230 --> 00:37:06,280 region is equal to 3/4. 755 00:37:06,280 --> 00:37:07,990 And that's very nice. 756 00:37:07,990 --> 00:37:09,740 But the thing is some people are sort of-- 757 00:37:09,740 --> 00:37:12,650 I mean, it's not super nice to be able to see this, 758 00:37:12,650 --> 00:37:17,510 because, in a way, I know that, if I observe x1 and x2 that 759 00:37:17,510 --> 00:37:24,050 are different, I know for sure that theta, 760 00:37:24,050 --> 00:37:25,882 that I actually got the right theta, right? 761 00:37:25,882 --> 00:37:27,590 That this confidence interval is actually 762 00:37:27,590 --> 00:37:31,370 happening with probability 1. 763 00:37:31,370 --> 00:37:34,700 And the problem is that I do not know-- 764 00:37:34,700 --> 00:37:37,640 I cannot make this precise with the notion of frequentist 765 00:37:37,640 --> 00:37:39,230 confidence intervals. 766 00:37:39,230 --> 00:37:39,730 OK? 767 00:37:39,730 --> 00:37:41,396 Because frequentist confidence intervals 768 00:37:41,396 --> 00:37:43,810 have to account for the fact that, in the future, 769 00:37:43,810 --> 00:37:47,810 it might not be the case that x1 and x2 are different. 770 00:37:47,810 --> 00:37:53,360 So Bayesian confidence regions, by definition-- 771 00:37:53,360 --> 00:37:54,530 well, they're all gone-- 772 00:37:54,530 --> 00:37:57,387 but they are conditioned on the data that I have. 773 00:37:57,387 --> 00:37:58,470 And so that's what I want. 774 00:37:58,470 --> 00:38:00,800 I want to be able to make this statement conditionally 775 00:38:00,800 --> 00:38:02,640 and the data that I have. 776 00:38:02,640 --> 00:38:03,140 OK. 777 00:38:03,140 --> 00:38:06,450 So if I want to be able to make this statement, 778 00:38:06,450 --> 00:38:08,450 if I want to build a Bayesian confidence region, 779 00:38:08,450 --> 00:38:10,520 I'm going to have to put a prior on theta. 780 00:38:10,520 --> 00:38:12,050 So without loss of generality-- 781 00:38:12,050 --> 00:38:16,520 I mean, maybe with-- but let's assume 782 00:38:16,520 --> 00:38:25,980 that pi is a prior on theta. 783 00:38:25,980 --> 00:38:31,540 And let's assume that pi of j is strictly positive 784 00:38:31,540 --> 00:38:35,920 for all integers j equal, say, 0-- 785 00:38:35,920 --> 00:38:42,770 well, actually, for all j in the integers, positive or negative. 786 00:38:42,770 --> 00:38:43,270 OK. 787 00:38:43,270 --> 00:38:46,870 So that's a pretty weak assumption on my prior. 788 00:38:46,870 --> 00:38:52,901 I'm just assuming that theta is some integer. 789 00:38:52,901 --> 00:38:57,290 And now, let's build our Bayesian confidence region. 790 00:38:57,290 --> 00:38:59,540 Well, if I want to build a Bayesian confidence region, 791 00:38:59,540 --> 00:39:01,520 I need to understand what my posterior is going to be. 792 00:39:01,520 --> 00:39:02,089 OK? 793 00:39:02,089 --> 00:39:04,630 And if I want to understand what my posterior is going to be, 794 00:39:04,630 --> 00:39:11,530 I actually need to build a likelihood, right? 795 00:39:11,530 --> 00:39:16,370 So we know that it's the product of the likelihood 796 00:39:16,370 --> 00:39:20,740 and of the prior divided by-- 797 00:39:20,740 --> 00:39:21,240 OK. 798 00:39:31,140 --> 00:39:32,850 So what is my likelihood? 799 00:39:32,850 --> 00:39:35,540 So my likelihood is the probability 800 00:39:35,540 --> 00:39:40,580 of x1 x2, given theta. 801 00:39:40,580 --> 00:39:41,240 Right? 802 00:39:41,240 --> 00:39:45,010 That's what the likelihood should be. 803 00:39:45,010 --> 00:39:49,840 And now let's say that actually, just 804 00:39:49,840 --> 00:39:51,910 to make things a little simpler, let 805 00:39:51,910 --> 00:40:07,230 us assume that x1 is equal to, I don't know, 5, 806 00:40:07,230 --> 00:40:11,180 and x2 is equal to 7. 807 00:40:11,180 --> 00:40:12,540 OK? 808 00:40:12,540 --> 00:40:16,350 So I'm not going to take the case where they're actually 809 00:40:16,350 --> 00:40:19,180 equal to each other, because I know that, in this case, 810 00:40:19,180 --> 00:40:20,550 x1 and x2 are different. 811 00:40:20,550 --> 00:40:23,970 I know I'm going to actually nail exactly what theta is, 812 00:40:23,970 --> 00:40:26,540 by looking at the average of those guys, right? 813 00:40:26,540 --> 00:40:30,630 Here, it must be that theta is equal to 6. 814 00:40:30,630 --> 00:40:34,491 So what I want is to compute the likelihood at 5 and 7, OK? 815 00:40:38,419 --> 00:40:42,350 And what is this likelihood? 816 00:40:42,350 --> 00:40:53,950 Well, if theta is equal to 6, that's 817 00:40:53,950 --> 00:41:00,010 just the probability that I will observe 5 and 7, right? 818 00:41:00,010 --> 00:41:01,910 So what is the probability I observe 5 and 7? 819 00:41:04,610 --> 00:41:05,510 Yeah? 820 00:41:05,510 --> 00:41:06,672 1? 821 00:41:06,672 --> 00:41:08,499 AUDIENCE: 1/4. 822 00:41:08,499 --> 00:41:10,040 PHILIPPE RIGOLLET: That's 1/4, right? 823 00:41:10,040 --> 00:41:15,260 As the probability, I have minus 1 for the first epsilon 1, 824 00:41:15,260 --> 00:41:15,760 right? 825 00:41:15,760 --> 00:41:17,260 So this is infinity 6. 826 00:41:17,260 --> 00:41:23,080 This is the probability that epsilon 1 is equal to minus 1, 827 00:41:23,080 --> 00:41:28,790 and epsilon 2 is equal to plus 1, which is equal to 1/4. 828 00:41:28,790 --> 00:41:31,520 So this probability is 1/4. 829 00:41:31,520 --> 00:41:35,560 If theta is different from 6, what is this probability? 830 00:41:35,560 --> 00:41:37,630 So if theta is different from 6, since we 831 00:41:37,630 --> 00:41:41,210 know that we've only loaded the integers-- 832 00:41:41,210 --> 00:41:46,770 so if theta has to be another integer, 833 00:41:46,770 --> 00:41:49,214 what is the probability that I see 5 and 7? 834 00:41:49,214 --> 00:41:49,731 AUDIENCE: 0. 835 00:41:49,731 --> 00:41:50,606 PHILIPPE RIGOLLET: 0. 836 00:41:53,860 --> 00:41:55,190 So that's my likelihood. 837 00:41:55,190 --> 00:42:00,210 And if I want to know what my posterior is, 838 00:42:00,210 --> 00:42:03,340 well, it's just pi of theta times 839 00:42:03,340 --> 00:42:10,240 p of 5/6, given theta, divided by the sum over all T's, say, 840 00:42:10,240 --> 00:42:11,890 in Z. Right? 841 00:42:11,890 --> 00:42:14,590 So now, I just need to normalize this thing. 842 00:42:14,590 --> 00:42:21,950 So of pi of T, p of 4/6, given T. Agreed? 843 00:42:24,730 --> 00:42:27,350 That's just the definition of the posterior. 844 00:42:27,350 --> 00:42:30,330 But when I sum these guys, there's 845 00:42:30,330 --> 00:42:34,780 only one that counts, because, for those things, 846 00:42:34,780 --> 00:42:38,140 we know that this is actually equal to 0 for every T, 847 00:42:38,140 --> 00:42:41,470 except for when T is equal to 6. 848 00:42:41,470 --> 00:42:45,380 So this entire sum here is actually 849 00:42:45,380 --> 00:42:54,310 equal to pi of 6 times p of 5/6-- 850 00:42:54,310 --> 00:43:03,360 sorry, 5/7, of 5/7, given that theta 851 00:43:03,360 --> 00:43:08,370 is equal to 6, which we know is equal to 1/4. 852 00:43:08,370 --> 00:43:10,630 And I did not tell you what pi of 6 was. 853 00:43:16,840 --> 00:43:18,070 But it's the same thing here. 854 00:43:18,070 --> 00:43:21,020 The posterior for any theta that's not 6 855 00:43:21,020 --> 00:43:23,520 is actually going to be-- this guy's going to be equal to 0. 856 00:43:23,520 --> 00:43:26,130 So I really don't care what this guy is. 857 00:43:26,130 --> 00:43:29,270 So what it means is that my posterior becomes what? 858 00:43:33,870 --> 00:43:40,290 It becomes the posterior pi of theta, 859 00:43:40,290 --> 00:43:46,970 given 5 and 7 is equal to-- well, when theta is not 860 00:43:46,970 --> 00:43:49,090 equal to 6, this is actually 0. 861 00:43:49,090 --> 00:43:52,450 So regardless of what I do here, I get something which is 0. 862 00:43:55,120 --> 00:43:58,000 And if theta is equal to 6, what I get 863 00:43:58,000 --> 00:44:02,500 is pi of 6 times p of 5/7, given 6, 864 00:44:02,500 --> 00:44:05,560 which I've just computed here, which is 1/4 divided 865 00:44:05,560 --> 00:44:08,140 by pi of 6 times 1/4. 866 00:44:08,140 --> 00:44:10,640 So it's the ratio of two things that are identical. 867 00:44:10,640 --> 00:44:13,360 So I get 1. 868 00:44:13,360 --> 00:44:16,570 So now, my posterior tells me that, given 869 00:44:16,570 --> 00:44:22,440 that I observe 5 and 7, theta has 870 00:44:22,440 --> 00:44:27,690 to be 1 with probability-- has to be 6 with probability 1. 871 00:44:27,690 --> 00:44:32,850 So now, I say that this thing here-- so now, this 872 00:44:32,850 --> 00:44:34,590 is not something that actually makes 873 00:44:34,590 --> 00:44:37,440 sense when I talk about frequentist confidence 874 00:44:37,440 --> 00:44:38,310 intervals. 875 00:44:38,310 --> 00:44:40,560 They don't really make sense, to talk about confidence 876 00:44:40,560 --> 00:44:42,330 intervals, given something. 877 00:44:42,330 --> 00:44:44,100 And so now, given that I observe 5 and 7, 878 00:44:44,100 --> 00:44:46,224 I know that the probability of theta is equal to 1. 879 00:44:46,224 --> 00:44:50,310 And in this sense, the Bayesian confidence interval 880 00:44:50,310 --> 00:44:54,699 is actually more meaningful. 881 00:44:54,699 --> 00:44:56,990 So one thing I want to actually say about this Bayesian 882 00:44:56,990 --> 00:44:58,466 confidence interval is that it's-- 883 00:45:01,100 --> 00:45:03,181 I mean, here, it's equal to the value 1, right? 884 00:45:03,181 --> 00:45:05,180 So it really encompasses the thing that we want. 885 00:45:05,180 --> 00:45:06,763 But the fact that we actually computed 886 00:45:06,763 --> 00:45:09,140 it using the Bayesian posterior and the Bayesian rule 887 00:45:09,140 --> 00:45:10,806 did not really matter for this argument. 888 00:45:10,806 --> 00:45:12,980 All I just said was that it had a prior. 889 00:45:12,980 --> 00:45:15,080 But just what I want to illustrate 890 00:45:15,080 --> 00:45:17,930 is the fact that we can actually give a meaning 891 00:45:17,930 --> 00:45:21,740 to the probability that theta is equal to 6, 892 00:45:21,740 --> 00:45:23,390 given that I see 5 and 7. 893 00:45:23,390 --> 00:45:26,780 Whereas, we cannot really in the other cases. 894 00:45:26,780 --> 00:45:28,490 And we don't have to be particularly 895 00:45:28,490 --> 00:45:31,740 precise in the prior and theta to be able to give theta this-- 896 00:45:31,740 --> 00:45:32,930 to give this meaning. 897 00:45:32,930 --> 00:45:35,062 OK? 898 00:45:35,062 --> 00:45:36,038 All right. 899 00:45:38,966 --> 00:45:43,130 So now, as I said, I think the main power of Bayesian 900 00:45:43,130 --> 00:45:45,980 inference is that it spits out the posterior distribution, 901 00:45:45,980 --> 00:45:48,830 and not just the single number, like frequentists 902 00:45:48,830 --> 00:45:50,030 would give you. 903 00:45:50,030 --> 00:45:55,070 Then we can say decorate, or theta hat, or point estimate, 904 00:45:55,070 --> 00:45:56,570 with maybe some confidence interval. 905 00:45:56,570 --> 00:45:58,400 Maybe we can do a bunch of tests. 906 00:45:58,400 --> 00:46:01,070 But at the end of the day, we just have, 907 00:46:01,070 --> 00:46:02,624 essentially, one number, right? 908 00:46:02,624 --> 00:46:04,040 Then maybe we can understand where 909 00:46:04,040 --> 00:46:07,310 the fluctuations of this number are in a frequentist setup. 910 00:46:07,310 --> 00:46:11,760 but the Bayesian framework is essentially 911 00:46:11,760 --> 00:46:13,059 giving you a natural method. 912 00:46:13,059 --> 00:46:15,517 And you can interpret it in terms of the probabilities that 913 00:46:15,517 --> 00:46:17,400 are associated to the prior. 914 00:46:17,400 --> 00:46:21,180 But you can actually also try to make some-- 915 00:46:21,180 --> 00:46:25,840 so a Bayesian, if you give me any prior, 916 00:46:25,840 --> 00:46:29,040 you're going to actually build an estimator from this prior, 917 00:46:29,040 --> 00:46:30,515 maybe from the posterior. 918 00:46:30,515 --> 00:46:32,890 And maybe it's going to have some frequentist properties. 919 00:46:32,890 --> 00:46:35,181 And that's what's really nice about [? Bayesians, ?] is 920 00:46:35,181 --> 00:46:36,700 that you can actually try to give 921 00:46:36,700 --> 00:46:39,340 some frequentist properties of Bayesian methods, that 922 00:46:39,340 --> 00:46:42,224 are built using Bayesian methodology. 923 00:46:42,224 --> 00:46:44,140 But you cannot really go the other way around. 924 00:46:44,140 --> 00:46:46,449 If I give you a frequency methodology, 925 00:46:46,449 --> 00:46:48,490 how are you going to say something about the fact 926 00:46:48,490 --> 00:46:51,620 that there's a prior going on, et cetera? 927 00:46:51,620 --> 00:46:53,457 And so this is actually one of the things 928 00:46:53,457 --> 00:46:55,790 there's actually some research that's going on for this. 929 00:46:55,790 --> 00:46:58,147 They call it Bayesian posterior concentration. 930 00:46:58,147 --> 00:46:59,980 And one of the things-- so there's something 931 00:46:59,980 --> 00:47:01,990 called the Bernstein-von Mises theorem. 932 00:47:01,990 --> 00:47:03,910 And those are a class of theorems, 933 00:47:03,910 --> 00:47:06,790 and those are essentially methods that tell you, well, 934 00:47:06,790 --> 00:47:10,690 if I actually run a Bayesian method, 935 00:47:10,690 --> 00:47:12,647 and I look at the posterior that I get-- 936 00:47:12,647 --> 00:47:14,230 it's going to be something like this-- 937 00:47:14,230 --> 00:47:16,540 but now, I try to study this in a frequentist point of view, 938 00:47:16,540 --> 00:47:18,289 there's actually a true parameter of theta 939 00:47:18,289 --> 00:47:20,390 somewhere, the true one. 940 00:47:20,390 --> 00:47:21,640 There's no prior for this guy. 941 00:47:21,640 --> 00:47:23,410 This is just one fixed number. 942 00:47:23,410 --> 00:47:25,120 Is it true that as my sample size is 943 00:47:25,120 --> 00:47:27,610 going to go to infinity, then this thing is going 944 00:47:27,610 --> 00:47:29,860 to concentrate around theta? 945 00:47:29,860 --> 00:47:31,990 And the rate of concentration of this thing, 946 00:47:31,990 --> 00:47:35,440 the size of this width, the standard deviation 947 00:47:35,440 --> 00:47:38,290 of this thing, is something that should decay maybe 948 00:47:38,290 --> 00:47:40,850 like 1 over square root of n, or something like this. 949 00:47:40,850 --> 00:47:43,349 And the rate of posterior concentration, 950 00:47:43,349 --> 00:47:45,890 when you characterize it, it's called the Bernstein-von Mises 951 00:47:45,890 --> 00:47:46,390 theorem. 952 00:47:46,390 --> 00:47:47,830 And so people are looking at this 953 00:47:47,830 --> 00:47:49,566 in some non-parametric cases. 954 00:47:49,566 --> 00:47:51,190 You can do it in pretty much everything 955 00:47:51,190 --> 00:47:52,190 we've been doing before. 956 00:47:52,190 --> 00:47:55,690 You can do it for non-parametric regression estimation 957 00:47:55,690 --> 00:47:56,794 or density estimation. 958 00:47:56,794 --> 00:47:58,210 You can do it for, of course-- you 959 00:47:58,210 --> 00:48:01,340 can do it for sparse estimation, if you want. 960 00:48:01,340 --> 00:48:01,840 OK. 961 00:48:01,840 --> 00:48:04,967 So you can actually compute the procedure and-- 962 00:48:08,620 --> 00:48:09,290 yeah. 963 00:48:09,290 --> 00:48:12,660 And so you can think of it as being just a method somehow. 964 00:48:12,660 --> 00:48:14,970 Now, the estimator I'm talking about-- so 965 00:48:14,970 --> 00:48:18,210 that's just a general Bayesian posterior concentration. 966 00:48:18,210 --> 00:48:20,430 But you can also try to understand 967 00:48:20,430 --> 00:48:22,710 what is the property of something that's 968 00:48:22,710 --> 00:48:24,210 extracted from this posterior. 969 00:48:24,210 --> 00:48:26,130 And one thing that we actually describe 970 00:48:26,130 --> 00:48:28,310 was, for example, well, given this guy, 971 00:48:28,310 --> 00:48:30,060 maybe it's a good idea to think about what 972 00:48:30,060 --> 00:48:32,370 the mean of this thing is, right? 973 00:48:32,370 --> 00:48:35,040 So there's going to be some theta hat, 974 00:48:35,040 --> 00:48:41,460 which is just the integral of theta pi theta, given x1 xn-- 975 00:48:41,460 --> 00:48:43,860 so that's my posterior-- 976 00:48:43,860 --> 00:48:44,380 d theta. 977 00:48:44,380 --> 00:48:44,880 Right? 978 00:48:44,880 --> 00:48:46,500 So that's the posterior mean. 979 00:48:46,500 --> 00:48:48,750 That's the expected value with respect 980 00:48:48,750 --> 00:48:50,880 to the posterior distribution. 981 00:48:50,880 --> 00:48:53,640 And I want to know how does this thing behave, 982 00:48:53,640 --> 00:48:56,670 how close it is to a true theta if I actually 983 00:48:56,670 --> 00:48:58,370 am in a frequency setup. 984 00:48:58,370 --> 00:48:59,784 So that's the posterior mean. 985 00:49:04,260 --> 00:49:08,450 But this is not the only thing I can actually spit out, right? 986 00:49:08,450 --> 00:49:09,980 This is definitely uniquely defined. 987 00:49:09,980 --> 00:49:13,490 If you give me a distribution, I can actually 988 00:49:13,490 --> 00:49:15,170 spit out its posterior mean. 989 00:49:15,170 --> 00:49:17,480 But I can also think of the posterior median. 990 00:49:21,450 --> 00:49:23,237 But now, if this is not continuous, 991 00:49:23,237 --> 00:49:24,570 you might have some uncertainty. 992 00:49:24,570 --> 00:49:26,570 Maybe the median is not uniquely defined, 993 00:49:26,570 --> 00:49:29,180 and so maybe that's not something you use as much. 994 00:49:29,180 --> 00:49:31,690 Maybe you can actually talk about the posterior mode. 995 00:49:35,160 --> 00:49:38,040 All right, so for example, if you're posterior density looks 996 00:49:38,040 --> 00:49:40,020 like this, then maybe you just want 997 00:49:40,020 --> 00:49:43,600 to summarize your posterior with this number. 998 00:49:43,600 --> 00:49:46,080 So clearly, in this case, it's not such a good idea, 999 00:49:46,080 --> 00:49:48,270 because you completely forget about this mode. 1000 00:49:48,270 --> 00:49:49,811 But maybe that's what you want to do. 1001 00:49:49,811 --> 00:49:53,400 Maybe you want to focus on the most peak mode. 1002 00:49:53,400 --> 00:49:58,524 And this is actually called maximum a posteriori. 1003 00:49:58,524 --> 00:49:59,940 As I said, maybe you want a sample 1004 00:49:59,940 --> 00:50:03,240 from this posterior distribution. 1005 00:50:03,240 --> 00:50:06,420 OK, and so in all these cases, these Bayesian estimators 1006 00:50:06,420 --> 00:50:09,000 will depend on the prior distribution. 1007 00:50:09,000 --> 00:50:11,610 And the hope is that, as the sample size grows, 1008 00:50:11,610 --> 00:50:14,130 you won't see that again. 1009 00:50:14,130 --> 00:50:14,630 OK. 1010 00:50:14,630 --> 00:50:20,840 So to conclude, let's just do a couple of experiments. 1011 00:50:20,840 --> 00:50:22,340 So if I look at-- 1012 00:50:25,200 --> 00:50:26,011 did we do this? 1013 00:50:26,011 --> 00:50:26,510 Yes. 1014 00:50:26,510 --> 00:50:30,398 So for example, so let's focus on the posterior mean. 1015 00:50:34,366 --> 00:50:45,394 And we know-- so remember in experiment one-- 1016 00:50:45,394 --> 00:50:48,100 [INAUDIBLE] example one, what we had 1017 00:50:48,100 --> 00:50:56,000 was x1 xn that were [? iid, ?] Bernoulli p, 1018 00:50:56,000 --> 00:51:06,410 and the prior I put on p was a beta with parameter aa. 1019 00:51:06,410 --> 00:51:07,160 OK? 1020 00:51:07,160 --> 00:51:09,830 And if I go back to what we computed, 1021 00:51:09,830 --> 00:51:12,740 you can actually compute the posterior of this thing. 1022 00:51:12,740 --> 00:51:15,000 And we know that it's actually going to be-- 1023 00:51:15,000 --> 00:51:17,390 sorry, that was uniform? 1024 00:51:17,390 --> 00:51:18,620 Where is-- yeah. 1025 00:51:18,620 --> 00:51:31,170 So what we get is that the posterior, this thing 1026 00:51:31,170 --> 00:51:36,630 is actually going to be a beta with parameter 1027 00:51:36,630 --> 00:51:42,640 a plus the sum, so a plus the number of 1s 1028 00:51:42,640 --> 00:51:44,770 and a plus the number of 0s. 1029 00:51:48,590 --> 00:51:49,870 OK? 1030 00:51:49,870 --> 00:51:53,840 And the beta was just something that looked like-- 1031 00:51:56,480 --> 00:52:00,500 the density was p to the a minus 1, 1 minus p. 1032 00:52:05,440 --> 00:52:05,940 OK? 1033 00:52:05,940 --> 00:52:11,130 So if I want to understand the posterior mean, 1034 00:52:11,130 --> 00:52:13,950 I need to be able to compute the expectation of a beta, 1035 00:52:13,950 --> 00:52:16,620 and then maybe plug in a for a plus 1036 00:52:16,620 --> 00:52:17,980 this guy and minus this guy. 1037 00:52:17,980 --> 00:52:18,480 OK. 1038 00:52:18,480 --> 00:52:21,770 So actually, let me do this. 1039 00:52:21,770 --> 00:52:22,270 OK. 1040 00:52:22,270 --> 00:52:23,930 So what is the expectation? 1041 00:52:26,337 --> 00:52:27,920 So what I want is something that looks 1042 00:52:27,920 --> 00:52:34,820 like the integral between 0 and 1 of p times a minus 1-- 1043 00:52:34,820 --> 00:52:42,320 sorry, p times p a minus 1, 1 minus p, b minus 1. 1044 00:52:42,320 --> 00:52:43,590 Do we agree that this-- 1045 00:52:43,590 --> 00:52:46,290 and then there's a normalizing constant. 1046 00:52:46,290 --> 00:52:49,270 Let's call it c. 1047 00:52:49,270 --> 00:52:49,770 OK? 1048 00:52:53,200 --> 00:52:56,330 So this is what I need to compute. 1049 00:52:56,330 --> 00:52:57,640 So that's c of a and b. 1050 00:53:00,257 --> 00:53:01,840 Do we agree that this is the posterior 1051 00:53:01,840 --> 00:53:08,651 mean with respect to a beta with parameters a and b? 1052 00:53:08,651 --> 00:53:09,150 Right? 1053 00:53:09,150 --> 00:53:13,334 I just integrate p against the density. 1054 00:53:13,334 --> 00:53:14,750 So what does this thing look like? 1055 00:53:14,750 --> 00:53:18,550 Well, I can actually move this guy in here. 1056 00:53:18,550 --> 00:53:23,402 And here, I'm going to have a plus 1 minus 1. 1057 00:53:23,402 --> 00:53:26,366 OK? 1058 00:53:26,366 --> 00:53:29,360 So the problem is that this thing is actually-- 1059 00:53:29,360 --> 00:53:31,360 the constant is going to play a big role, right? 1060 00:53:31,360 --> 00:53:33,100 Because this is essentially equal 1061 00:53:33,100 --> 00:53:40,270 to c a plus 1b divided by c ab, where 1062 00:53:40,270 --> 00:53:42,220 ca plus 1b is just the normalizing 1063 00:53:42,220 --> 00:53:46,340 constant of a beta a plus 1 b. 1064 00:53:46,340 --> 00:53:48,729 So I need to know the ratio of those two constants. 1065 00:53:58,320 --> 00:53:59,660 And this is not something-- 1066 00:53:59,660 --> 00:54:01,680 I mean, this is just a calculus exercise. 1067 00:54:01,680 --> 00:54:06,820 So in this case, what you get is-- 1068 00:54:06,820 --> 00:54:08,640 sorry. 1069 00:54:08,640 --> 00:54:09,750 In this case, you get-- 1070 00:54:12,560 --> 00:54:34,940 well, OK, so we get essentially a divided by, 1071 00:54:34,940 --> 00:54:37,990 I think, it's a plus b. 1072 00:54:37,990 --> 00:54:38,940 Yeah, it's a plus b. 1073 00:54:41,856 --> 00:54:43,314 So that's this quantity. 1074 00:54:47,188 --> 00:54:47,688 OK? 1075 00:54:51,100 --> 00:54:56,520 And when I plug in a to be this guy and b to be this guy, what 1076 00:54:56,520 --> 00:55:02,520 I get is a plus sum of the xi. 1077 00:55:02,520 --> 00:55:06,240 And then I get a plus this guy, a plus n minus this guy. 1078 00:55:06,240 --> 00:55:07,720 So those two guys go away, and I'm 1079 00:55:07,720 --> 00:55:14,050 left with 2a plus n, which does not work. 1080 00:55:14,050 --> 00:55:15,240 No, that actually works. 1081 00:55:15,240 --> 00:55:18,520 And so now what I do, I can actually divide and get 1082 00:55:18,520 --> 00:55:19,850 this thing, over there. 1083 00:55:19,850 --> 00:55:20,350 OK. 1084 00:55:20,350 --> 00:55:23,380 So what you can see, the reason why this thing has been divided 1085 00:55:23,380 --> 00:55:27,730 is that you can really see that, as n goes to infinity, 1086 00:55:27,730 --> 00:55:30,120 then this thing behaves like xn bar, which 1087 00:55:30,120 --> 00:55:31,650 is our frequentist estimator. 1088 00:55:31,650 --> 00:55:34,200 The effect of a is actually going away. 1089 00:55:34,200 --> 00:55:37,530 The effect of the prior, which is completely captured by a, 1090 00:55:37,530 --> 00:55:40,440 is going away as n goes to infinity. 1091 00:55:40,440 --> 00:55:42,440 Is there any question? 1092 00:55:47,440 --> 00:55:48,850 You guys have a question. 1093 00:55:48,850 --> 00:55:50,202 What is it? 1094 00:55:50,202 --> 00:55:51,551 Do you have a question? 1095 00:55:51,551 --> 00:55:53,426 AUDIENCE: Yeah, on the board, is that divided 1096 00:55:53,426 --> 00:55:56,259 by some [INAUDIBLE] stuff? 1097 00:55:56,259 --> 00:55:58,050 PHILIPPE RIGOLLET: Is that divided by what? 1098 00:55:58,050 --> 00:56:00,555 AUDIENCE: That a over a plus b, and then you just expanded-- 1099 00:56:00,555 --> 00:56:01,930 PHILIPPE RIGOLLET: Oh yeah, yeah, 1100 00:56:01,930 --> 00:56:05,220 then I said that this is equal to this, right. 1101 00:56:05,220 --> 00:56:15,690 So that's for a becomes a plus sum of the xi's, and b becomes 1102 00:56:15,690 --> 00:56:20,391 a plus n minus sum of the xi's. 1103 00:56:20,391 --> 00:56:20,890 OK. 1104 00:56:20,890 --> 00:56:22,508 So that's just for the posterior one. 1105 00:56:22,508 --> 00:56:26,264 AUDIENCE: What's [INAUDIBLE] 1106 00:56:26,264 --> 00:56:27,430 PHILIPPE RIGOLLET: This guy? 1107 00:56:27,430 --> 00:56:28,070 AUDIENCE: Yeah. 1108 00:56:28,070 --> 00:56:28,740 PHILIPPE RIGOLLET: 2a. 1109 00:56:28,740 --> 00:56:29,281 AUDIENCE: 2a. 1110 00:56:29,281 --> 00:56:30,150 Oh, OK. 1111 00:56:30,150 --> 00:56:31,191 PHILIPPE RIGOLLET: Right. 1112 00:56:31,191 --> 00:56:34,885 So I get a plus a plus n. 1113 00:56:34,885 --> 00:56:37,960 And then those two guys cancel. 1114 00:56:37,960 --> 00:56:38,460 OK? 1115 00:56:38,460 --> 00:56:41,380 And that's what you have here. 1116 00:56:41,380 --> 00:56:44,920 So for a is equal to 1/2-- 1117 00:56:44,920 --> 00:56:47,020 and I claim that this is Jeffreys prior. 1118 00:56:47,020 --> 00:56:53,950 Because remember, Jeffreys was [INAUDIBLE] was square root 1119 00:56:53,950 --> 00:56:56,100 and was proportional to the square root of p1 minus 1120 00:56:56,100 --> 00:57:01,050 p, which I can write as p to the 1/2, 1 minus p to the 1/2. 1121 00:57:01,050 --> 00:57:03,501 So it's just the case a is equal to 1/2. 1122 00:57:03,501 --> 00:57:04,000 OK. 1123 00:57:04,000 --> 00:57:07,660 So if I use Jeffreys prior, I just plug in a equals to 1/2, 1124 00:57:07,660 --> 00:57:10,530 and this is what I get. 1125 00:57:10,530 --> 00:57:12,630 OK? 1126 00:57:12,630 --> 00:57:14,880 So those things are going to have an impact again when 1127 00:57:14,880 --> 00:57:16,150 n is moderately large. 1128 00:57:16,150 --> 00:57:19,090 For large n, those things, whether you take Jeffreys prior 1129 00:57:19,090 --> 00:57:20,710 or you take whatever a you prefer, 1130 00:57:20,710 --> 00:57:23,130 it's going to have no impact whatsoever. 1131 00:57:23,130 --> 00:57:26,894 But n is of the order of 10 maybe, 1132 00:57:26,894 --> 00:57:28,810 then you're going to start to see some impact, 1133 00:57:28,810 --> 00:57:30,351 depending on what a you want to pick. 1134 00:57:33,540 --> 00:57:34,040 OK. 1135 00:57:34,040 --> 00:57:38,390 And then in the second example, well, here we actually 1136 00:57:38,390 --> 00:57:42,560 computed the posterior to be this guy. 1137 00:57:42,560 --> 00:57:45,544 Well, here, I can just read off what the expectation is, right? 1138 00:57:45,544 --> 00:57:47,210 I mean, I don't have to actually compute 1139 00:57:47,210 --> 00:57:48,970 the expectation of a Gaussian. 1140 00:57:48,970 --> 00:57:50,650 It's just that xn bar. 1141 00:57:50,650 --> 00:57:52,660 And so in this case, there's actually no-- 1142 00:57:52,660 --> 00:57:57,190 I mean, when I have a non-informative prior 1143 00:57:57,190 --> 00:58:01,750 for a Gaussian, then I have basically xn in bar. 1144 00:58:01,750 --> 00:58:04,390 As you can see, actually, this is an interesting example. 1145 00:58:04,390 --> 00:58:06,490 When I actually look at the posterior, 1146 00:58:06,490 --> 00:58:09,190 it's not something that cost me a lot to communicate to you, 1147 00:58:09,190 --> 00:58:10,037 right? 1148 00:58:10,037 --> 00:58:12,370 There's one symbol here, one symbol here, and one symbol 1149 00:58:12,370 --> 00:58:13,330 here. 1150 00:58:13,330 --> 00:58:17,950 I tell you the posterior is a Gaussian with mean xn bar 1151 00:58:17,950 --> 00:58:19,660 and variance 1/n. 1152 00:58:19,660 --> 00:58:23,530 When I actually turn that into a poster mean, 1153 00:58:23,530 --> 00:58:26,264 I'm dropping all this information. 1154 00:58:26,264 --> 00:58:27,930 I'm just giving you the first parameter. 1155 00:58:27,930 --> 00:58:30,150 So you can see there's actually much more information 1156 00:58:30,150 --> 00:58:35,100 in the posterior than there is in the posterior mean. 1157 00:58:35,100 --> 00:58:37,210 The posterior mean is just a point. 1158 00:58:37,210 --> 00:58:39,930 It's not telling me how confident I am in this point. 1159 00:58:39,930 --> 00:58:41,950 And this thing is actually very interesting. 1160 00:58:41,950 --> 00:58:42,450 OK. 1161 00:58:42,450 --> 00:58:44,283 So you can talk about the posterior variance 1162 00:58:44,283 --> 00:58:45,880 that's associated to it, right? 1163 00:58:45,880 --> 00:58:47,516 You can talk about, as an output, 1164 00:58:47,516 --> 00:58:49,890 you could give the posterior mean and posterior variance. 1165 00:58:49,890 --> 00:58:53,311 And those things are actually interesting. 1166 00:58:53,311 --> 00:58:53,810 All right. 1167 00:58:53,810 --> 00:58:56,370 So I think this is it. 1168 00:58:56,370 --> 00:59:05,360 So as I said, in general, just like in this case, 1169 00:59:05,360 --> 00:59:07,980 the impact of the prior is being washed away 1170 00:59:07,980 --> 00:59:10,310 as the sample size goes to infinity. 1171 00:59:10,310 --> 00:59:12,860 Just well, like here, there's no impact of the prior. 1172 00:59:12,860 --> 00:59:14,500 It was an noninvasive one. 1173 00:59:14,500 --> 00:59:17,780 But if you actually had an informative one, [? CF ?] 1174 00:59:17,780 --> 00:59:18,683 homework-- yeah? 1175 00:59:18,683 --> 00:59:19,650 AUDIENCE: [INAUDIBLE] 1176 00:59:19,650 --> 00:59:21,150 PHILIPPE RIGOLLET: Yeah, so [? CF ?] homework, 1177 00:59:21,150 --> 00:59:23,358 you would actually see an impact of the prior, which, 1178 00:59:23,358 --> 00:59:25,890 again, would be washed away as your sample size increases. 1179 00:59:25,890 --> 00:59:26,820 Here, it goes away. 1180 00:59:26,820 --> 00:59:29,610 You just get xn bar over 1. 1181 00:59:29,610 --> 00:59:31,830 And actually, in these cases, you 1182 00:59:31,830 --> 00:59:35,580 see that the posterior distribution converges 1183 00:59:35,580 --> 00:59:37,560 to-- sorry, the Bayesian estimator 1184 00:59:37,560 --> 00:59:39,510 is asymptotically normal. 1185 00:59:39,510 --> 00:59:43,471 This is different from the distribution of the posterior, 1186 00:59:43,471 --> 00:59:43,970 right? 1187 00:59:43,970 --> 00:59:45,886 This is just the posterior mean, which happens 1188 00:59:45,886 --> 00:59:47,480 to be asymptotically normal. 1189 00:59:47,480 --> 00:59:49,595 But the posterior may not have a-- 1190 00:59:49,595 --> 00:59:53,000 I mean, here, the posterior is a beta, right? 1191 00:59:53,000 --> 00:59:55,020 I mean, it's not normal. 1192 00:59:55,020 --> 00:59:57,210 OK, so there's different-- those things 1193 00:59:57,210 --> 00:59:59,556 are two different things. 1194 00:59:59,556 --> 01:00:01,548 Your question? 1195 01:00:01,548 --> 01:00:04,487 AUDIENCE: What was the prior [INAUDIBLE] 1196 01:00:04,487 --> 01:00:05,820 PHILIPPE RIGOLLET: All 1, right? 1197 01:00:05,820 --> 01:00:06,986 That was the improper prior. 1198 01:00:06,986 --> 01:00:08,896 AUDIENCE: OK. 1199 01:00:08,896 --> 01:00:12,563 And so that would give you the same thing as [INAUDIBLE],, not 1200 01:00:12,563 --> 01:00:13,790 just the proportion. 1201 01:00:13,790 --> 01:00:15,373 PHILIPPE RIGOLLET: Well, I mean, yeah. 1202 01:00:15,373 --> 01:00:17,600 So it's essentially telling you that-- 1203 01:00:17,600 --> 01:00:23,390 so we said that, when you have a non-informative prior, 1204 01:00:23,390 --> 01:00:25,760 essentially, the maximum likelihood is the maximum 1205 01:00:25,760 --> 01:00:26,879 a posteriori, right? 1206 01:00:26,879 --> 01:00:28,670 But in this case, there's so much symmetry, 1207 01:00:28,670 --> 01:00:30,560 that it just so happens that the maximum in this thing 1208 01:00:30,560 --> 01:00:32,370 is completely symmetric around its maximum. 1209 01:00:32,370 --> 01:00:34,809 So it means that the expectation is equal to the maximum, 1210 01:00:34,809 --> 01:00:35,600 to [INAUDIBLE] max. 1211 01:00:40,957 --> 01:00:41,931 Yeah? 1212 01:00:41,931 --> 01:00:43,392 AUDIENCE: I read somewhere that one 1213 01:00:43,392 --> 01:00:45,340 of the issues with Bayesian methods 1214 01:00:45,340 --> 01:00:46,801 is that we choose the wrong prior, 1215 01:00:46,801 --> 01:00:49,723 and it could mess up your results. 1216 01:00:49,723 --> 01:00:51,370 PHILIPPE RIGOLLET: Yeah, but hence, 1217 01:00:51,370 --> 01:00:53,980 do not pick the wrong prior. 1218 01:00:53,980 --> 01:00:55,244 I mean, of course, it would. 1219 01:00:55,244 --> 01:00:57,160 I mean, it would mess up your res-- of course. 1220 01:00:57,160 --> 01:00:58,810 I mean, you're putting extra information. 1221 01:00:58,810 --> 01:01:00,601 But you could say the same thing by saying, 1222 01:01:00,601 --> 01:01:03,670 well, the issue with frequentist method 1223 01:01:03,670 --> 01:01:06,730 is that, if you mess up the choice of your likelihood, 1224 01:01:06,730 --> 01:01:09,424 then it's going to mess up your output. 1225 01:01:09,424 --> 01:01:11,590 So here, you just have two chances of messing it up, 1226 01:01:11,590 --> 01:01:12,250 right? 1227 01:01:12,250 --> 01:01:14,440 You have the-- well, it's gone. 1228 01:01:14,440 --> 01:01:17,920 So you have the product of the likelihood and the prior, 1229 01:01:17,920 --> 01:01:20,350 and you have one more chance to-- 1230 01:01:20,350 --> 01:01:22,420 but it's true, if you assume that the model is 1231 01:01:22,420 --> 01:01:25,960 right, then, of course, finding the wrong prior could 1232 01:01:25,960 --> 01:01:28,520 completely mess up things if your prior, for example, 1233 01:01:28,520 --> 01:01:30,780 has no support on the true parameter. 1234 01:01:30,780 --> 01:01:34,715 But if your prior has a positive weight on the true parameter 1235 01:01:34,715 --> 01:01:38,140 as n goes to infinity-- 1236 01:01:38,140 --> 01:01:40,640 I mean, OK, I cannot speak for all counterexamples 1237 01:01:40,640 --> 01:01:41,480 in the world. 1238 01:01:41,480 --> 01:01:44,450 But I'm sure, under minor technical conditions, 1239 01:01:44,450 --> 01:01:46,550 you can guarantee that your posterior 1240 01:01:46,550 --> 01:01:48,530 mean is going to converge to what 1241 01:01:48,530 --> 01:01:49,742 you need it to converge to. 1242 01:01:53,678 --> 01:01:54,662 Any other question? 1243 01:01:57,881 --> 01:01:58,380 All right. 1244 01:01:58,380 --> 01:02:07,650 So I think this closes the more traditional mathematical-- not 1245 01:02:07,650 --> 01:02:11,490 mathematical, but traditional statistics part of this class. 1246 01:02:11,490 --> 01:02:14,310 And from here on, we'll talk about more multivariate 1247 01:02:14,310 --> 01:02:17,740 statistics, starting with principal component analysis. 1248 01:02:17,740 --> 01:02:19,800 So that's more like when you have multiple data. 1249 01:02:19,800 --> 01:02:22,650 We started, in a way, to talk about multivariate statistics 1250 01:02:22,650 --> 01:02:25,320 when we talked about multivariate regression. 1251 01:02:25,320 --> 01:02:28,180 But we'll move on to principal component analysis. 1252 01:02:28,180 --> 01:02:30,690 I'll talk a bit about multiple testing. 1253 01:02:30,690 --> 01:02:32,400 I haven't made my mind yet about what 1254 01:02:32,400 --> 01:02:34,350 we'll talk really in December. 1255 01:02:34,350 --> 01:02:36,480 But I want to make sure that you have 1256 01:02:36,480 --> 01:02:41,310 a taste and a flavor of what is being interesting in statistics 1257 01:02:41,310 --> 01:02:44,341 these days, especially as you go towards more [INAUDIBLE] 1258 01:02:44,341 --> 01:02:46,590 learning type of questions, where really, the focus is 1259 01:02:46,590 --> 01:02:48,619 on prediction rather than the modeling itself. 1260 01:02:48,619 --> 01:02:50,160 We'll talk about logistic regression, 1261 01:02:50,160 --> 01:02:52,800 as well, for example, which is generalized 1262 01:02:52,800 --> 01:02:55,470 linear models, which is just the generalization in the case 1263 01:02:55,470 --> 01:03:00,480 that y does not take value in the whole real line, maybe 0,1, 1264 01:03:00,480 --> 01:03:03,360 for example, for regression. 1265 01:03:03,360 --> 01:03:03,960 All right. 1266 01:03:03,960 --> 01:03:05,510 Thanks.