1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:15,210 from hundreds of MIT courses, visit 7 00:00:15,210 --> 00:00:21,470 MITOpenCourseWare@OCW.MIT.edu 8 00:00:21,470 --> 00:00:26,670 PHILIPPE RIGOLLET: So today WE'LL actually just do a brief 9 00:00:26,670 --> 00:00:28,590 chapter on Bayesian statistics. 10 00:00:28,590 --> 00:00:31,380 And there's entire courses on Bayesian statistics, 11 00:00:31,380 --> 00:00:33,480 there's entire books on Bayesian statistics, 12 00:00:33,480 --> 00:00:36,130 there's entire careers in Bayesian statistics. 13 00:00:36,130 --> 00:00:39,270 So admittedly, I'm not going to be 14 00:00:39,270 --> 00:00:40,920 able to do it justice and tell you 15 00:00:40,920 --> 00:00:42,850 all the interesting things that are happening 16 00:00:42,850 --> 00:00:44,040 in Bayesian statistics. 17 00:00:44,040 --> 00:00:47,310 But I think it's important as a statistician 18 00:00:47,310 --> 00:00:49,320 to know what it is, how it works, 19 00:00:49,320 --> 00:00:52,500 because it's actually a weapon of choice 20 00:00:52,500 --> 00:00:55,260 for many practitioners. 21 00:00:55,260 --> 00:00:58,080 And because it allows them to incorporate their knowledge 22 00:00:58,080 --> 00:01:00,790 about a problem in a fairly systematic manner. 23 00:01:00,790 --> 00:01:04,099 So if you look at like, say the Bayesian statistics literature, 24 00:01:04,099 --> 00:01:05,489 it's huge. 25 00:01:05,489 --> 00:01:09,570 And so here I give you sort of a range 26 00:01:09,570 --> 00:01:12,840 of what you can expect to see in Bayesian statistics 27 00:01:12,840 --> 00:01:18,300 from your second edition of a traditional book, something 28 00:01:18,300 --> 00:01:20,580 that involves computation, some things that 29 00:01:20,580 --> 00:01:22,200 involve risk thinking. 30 00:01:22,200 --> 00:01:24,750 And there's a lot of Bayesian thinking. 31 00:01:24,750 --> 00:01:26,640 There's a lot of things that you know 32 00:01:26,640 --> 00:01:29,010 talking about sort of like philosophy of thinking 33 00:01:29,010 --> 00:01:30,180 Bayesian. 34 00:01:30,180 --> 00:01:32,380 This book, for example, seems to be one of them. 35 00:01:32,380 --> 00:01:34,710 This book is definitely one of them. 36 00:01:34,710 --> 00:01:38,880 This one represents sort of a wide, a broad literature 37 00:01:38,880 --> 00:01:42,370 on Bayesian statistics, for applications for example, 38 00:01:42,370 --> 00:01:43,620 in social sciences. 39 00:01:43,620 --> 00:01:45,380 But even in large scale machine learning, 40 00:01:45,380 --> 00:01:47,340 there's a lot of Bayesian statistics happening, 41 00:01:47,340 --> 00:01:50,280 particular using something called Bayesian parametrics, 42 00:01:50,280 --> 00:01:53,490 or hierarchical Bayesian modeling. 43 00:01:53,490 --> 00:01:59,470 So we do have some experts at MIT in the c-cell. 44 00:01:59,470 --> 00:02:02,070 Tamara Broderick for example, is a person 45 00:02:02,070 --> 00:02:04,560 who does quite a bit of interesting work 46 00:02:04,560 --> 00:02:06,093 on Bayesian parametrics. 47 00:02:06,093 --> 00:02:08,259 And if that's something you want to know more about, 48 00:02:08,259 --> 00:02:10,889 I urge you to go and talk to her. 49 00:02:10,889 --> 00:02:14,070 So before we go into more advanced things, 50 00:02:14,070 --> 00:02:17,220 we need to start with what is the Bayesian approach. 51 00:02:17,220 --> 00:02:19,290 What do Bayesians do, and how is it 52 00:02:19,290 --> 00:02:22,720 different from what we've been doing so far? 53 00:02:22,720 --> 00:02:26,340 So to understand the difference between Bayesians 54 00:02:26,340 --> 00:02:28,800 and what we've been doing so far is, 55 00:02:28,800 --> 00:02:31,350 we need to first put a name on what we've been doing so far. 56 00:02:31,350 --> 00:02:32,940 It's called frequentist statistics. 57 00:02:32,940 --> 00:02:36,720 Which usually Bayesian versus frequentist statistics, 58 00:02:36,720 --> 00:02:38,760 by versus I don't mean that there is naturally 59 00:02:38,760 --> 00:02:40,380 in opposition to them. 60 00:02:40,380 --> 00:02:43,350 Actually, often you will see the same method that 61 00:02:43,350 --> 00:02:45,420 comes out of both approaches. 62 00:02:45,420 --> 00:02:46,860 So let's see how we did it, right. 63 00:02:46,860 --> 00:02:48,930 The first thing, we had data. 64 00:02:48,930 --> 00:02:50,700 We observed some data. 65 00:02:50,700 --> 00:02:52,980 And we assumed that this data was generated randomly. 66 00:02:52,980 --> 00:02:54,840 The reason we did that is because this 67 00:02:54,840 --> 00:02:57,840 would allow us to leverage tools from probability. 68 00:02:57,840 --> 00:03:01,020 So let's say by nature, measurements, you do a survey, 69 00:03:01,020 --> 00:03:03,090 you get some data. 70 00:03:03,090 --> 00:03:06,030 Then we made some assumptions on the data generating process. 71 00:03:06,030 --> 00:03:07,939 For example, we assumed they were iid. 72 00:03:07,939 --> 00:03:09,480 That was one of the recurring things. 73 00:03:09,480 --> 00:03:11,530 Sometimes we assume it was Gaussian. 74 00:03:11,530 --> 00:03:13,470 If you wanted to use say, T-test. 75 00:03:13,470 --> 00:03:15,330 Maybe we did some nonparametric statistics. 76 00:03:15,330 --> 00:03:18,240 We assume it was a smooth function or maybe 77 00:03:18,240 --> 00:03:20,350 linear regression function. 78 00:03:20,350 --> 00:03:21,540 So those are our modeling. 79 00:03:21,540 --> 00:03:24,850 And this was basically a way to say, well, 80 00:03:24,850 --> 00:03:28,440 we're not going to allow for any distributions for the data 81 00:03:28,440 --> 00:03:29,160 that we have. 82 00:03:29,160 --> 00:03:31,640 But maybe a small set of distributions 83 00:03:31,640 --> 00:03:34,770 that indexed by some small parameters, for example. 84 00:03:34,770 --> 00:03:38,400 Or at least remove some of the possibilities. 85 00:03:38,400 --> 00:03:41,660 Otherwise, there's nothing we can learn. 86 00:03:41,660 --> 00:03:45,270 And so for example, this was associated 87 00:03:45,270 --> 00:03:48,980 to some parameter of interest, say data or beta 88 00:03:48,980 --> 00:03:51,270 in the regression model. 89 00:03:51,270 --> 00:03:55,860 Then we had this unknown problem and this unknown thing, 90 00:03:55,860 --> 00:03:56,610 a known parameter. 91 00:03:56,610 --> 00:03:57,651 And we wanted to find it. 92 00:03:57,651 --> 00:03:59,610 We wanted to either estimate it or test it, 93 00:03:59,610 --> 00:04:02,460 or maybe find a confidence interval for the subject. 94 00:04:02,460 --> 00:04:06,030 So, so far I should not have said anything that's new. 95 00:04:06,030 --> 00:04:08,210 But this last sentence is actually 96 00:04:08,210 --> 00:04:10,590 what's going to be different from the Bayesian part. 97 00:04:10,590 --> 00:04:12,989 And particular, this unknown but fixed things 98 00:04:12,989 --> 00:04:14,280 is what's going to be changing. 99 00:04:16,965 --> 00:04:18,740 In the Bayesian approach, we still 100 00:04:18,740 --> 00:04:22,050 assume that we observe some random data. 101 00:04:22,050 --> 00:04:24,180 But the generating process is slightly different. 102 00:04:24,180 --> 00:04:25,737 It's sort of a two later process. 103 00:04:25,737 --> 00:04:27,320 And there's one process that generates 104 00:04:27,320 --> 00:04:28,740 the parameter and then one process 105 00:04:28,740 --> 00:04:31,470 that, given this parameter generates the data. 106 00:04:31,470 --> 00:04:35,990 So what the first layer does, nobody really 107 00:04:35,990 --> 00:04:38,030 believes that there's some random process that's 108 00:04:38,030 --> 00:04:41,000 happening, about generating what is going 109 00:04:41,000 --> 00:04:44,820 to be the true expected number of people 110 00:04:44,820 --> 00:04:47,060 who turn their head to the right when they kiss. 111 00:04:47,060 --> 00:04:49,435 But this is actually going to be something that brings us 112 00:04:49,435 --> 00:04:53,270 some easiness for us to incorporate 113 00:04:53,270 --> 00:04:57,230 what we call prior belief. 114 00:04:57,230 --> 00:04:58,640 We'll see an example in a second. 115 00:04:58,640 --> 00:05:01,430 But often, you actually have prior belief 116 00:05:01,430 --> 00:05:02,960 of what this parameter should be. 117 00:05:02,960 --> 00:05:05,510 When we, say least squares, we looked 118 00:05:05,510 --> 00:05:09,350 over all of the vectors in all of R to the p, 119 00:05:09,350 --> 00:05:11,840 including the ones that have coefficients equal 120 00:05:11,840 --> 00:05:15,080 to 50 million. 121 00:05:15,080 --> 00:05:18,050 Those are things that we might be able to rule out. 122 00:05:18,050 --> 00:05:21,800 We might be able to rule out that on a much smaller scale. 123 00:05:21,800 --> 00:05:24,650 For example, well I'm not an expert 124 00:05:24,650 --> 00:05:29,180 on turning your head to the right or to the left. 125 00:05:29,180 --> 00:05:30,950 But maybe you can rule out the fact 126 00:05:30,950 --> 00:05:33,200 that almost everybody is turning their head 127 00:05:33,200 --> 00:05:35,990 in the same direction, or almost everybody is turning their head 128 00:05:35,990 --> 00:05:38,090 to another direction. 129 00:05:38,090 --> 00:05:39,980 So we have this prior belief. 130 00:05:39,980 --> 00:05:43,750 And this belief is going to play say, hopefully 131 00:05:43,750 --> 00:05:47,534 less and less important role as we collect more and more data. 132 00:05:47,534 --> 00:05:49,200 But if we have a smaller amount of data, 133 00:05:49,200 --> 00:05:52,510 we might want to be able to use this information, 134 00:05:52,510 --> 00:05:54,700 rather than just shooting in the dark. 135 00:05:54,700 --> 00:05:58,150 And so the idea is to have this prior belief. 136 00:05:58,150 --> 00:06:00,430 And then, we want to update this prior belief 137 00:06:00,430 --> 00:06:03,550 into what's called the posterior belief after we've 138 00:06:03,550 --> 00:06:04,870 seen some data. 139 00:06:04,870 --> 00:06:08,050 Maybe I believe that there's something 140 00:06:08,050 --> 00:06:09,640 that should be in some range. 141 00:06:09,640 --> 00:06:12,580 But maybe after I see data, it's comforting me in my beliefs. 142 00:06:12,580 --> 00:06:15,330 So I'm actually having maybe a belief that's more. 143 00:06:15,330 --> 00:06:18,460 So belief encompasses basically what you think 144 00:06:18,460 --> 00:06:20,000 and how strongly you think about it. 145 00:06:20,000 --> 00:06:21,370 That's what I call belief. 146 00:06:21,370 --> 00:06:24,070 So for example, if I have a belief about some parameter 147 00:06:24,070 --> 00:06:26,050 theta, maybe my belief is telling me 148 00:06:26,050 --> 00:06:28,970 where theta should be and how strongly I 149 00:06:28,970 --> 00:06:32,920 believe in it, in the sense that I have a very narrow region 150 00:06:32,920 --> 00:06:35,470 where theta could be. 151 00:06:35,470 --> 00:06:37,810 The posterior beliefs, as well, you see some data. 152 00:06:37,810 --> 00:06:40,000 And maybe you're more confident or less confident about what 153 00:06:40,000 --> 00:06:40,499 you've seen. 154 00:06:40,499 --> 00:06:42,796 Maybe you've shifted your belief a little bit. 155 00:06:42,796 --> 00:06:44,670 And so that's what we're going to try to see, 156 00:06:44,670 --> 00:06:48,630 and how to do this in a principal manner. 157 00:06:48,630 --> 00:06:50,190 To understand this better, there's 158 00:06:50,190 --> 00:06:52,150 nothing better than an example. 159 00:06:52,150 --> 00:06:56,220 So let's talk about another stupid statistical question. 160 00:06:56,220 --> 00:06:58,620 Which is, let's try to understand p. 161 00:06:58,620 --> 00:07:01,430 Of course, I'm not going to talk about politics from now on. 162 00:07:01,430 --> 00:07:03,930 So let's talk about p, the proportion of women 163 00:07:03,930 --> 00:07:04,950 in the population. 164 00:07:15,330 --> 00:07:21,850 And so what I could do is to collect some data, X1, Xn 165 00:07:21,850 --> 00:07:23,950 and assume that they're Bernoulli 166 00:07:23,950 --> 00:07:25,900 with some parameter, p unknown. 167 00:07:25,900 --> 00:07:30,810 So p is in 0, 1. 168 00:07:30,810 --> 00:07:33,270 OK, let's assume that those guys are iid. 169 00:07:33,270 --> 00:07:38,190 So this is just an indicator for each of my collected data, 170 00:07:38,190 --> 00:07:42,130 whether the person I randomly sample is a woman, I get a one. 171 00:07:42,130 --> 00:07:43,350 If it's a man, I get a zero. 172 00:07:46,200 --> 00:07:49,470 Now the question is, I sample these people randomly. 173 00:07:49,470 --> 00:07:51,560 I do you know their gender. 174 00:07:51,560 --> 00:07:54,600 And the frequentist approach was just saying, 175 00:07:54,600 --> 00:07:58,250 OK, let's just estimate p hat being Xn bar. 176 00:07:58,250 --> 00:08:01,110 And then we could do some tests. 177 00:08:01,110 --> 00:08:02,240 So here, there's a test. 178 00:08:02,240 --> 00:08:05,330 I want to test maybe if p is equal to 0.5 or not. 179 00:08:05,330 --> 00:08:09,710 That sounds like a pretty reasonable thing to test. 180 00:08:09,710 --> 00:08:13,100 But we want to also maybe estimate p. 181 00:08:13,100 --> 00:08:16,160 But here, this is a case where we definitely prior belief 182 00:08:16,160 --> 00:08:17,720 of what p should be. 183 00:08:17,720 --> 00:08:22,040 We are pretty confident that p is not going to be 0.7. 184 00:08:22,040 --> 00:08:23,570 We actually believe that we should 185 00:08:23,570 --> 00:08:29,330 be extremely close to one half, but maybe not exactly. 186 00:08:29,330 --> 00:08:32,679 Maybe this population is not the population in the world. 187 00:08:32,679 --> 00:08:35,659 But maybe this is the population of, say some college 188 00:08:35,659 --> 00:08:38,720 and we want to understand if this college has half women 189 00:08:38,720 --> 00:08:40,069 or not. 190 00:08:40,069 --> 00:08:42,110 Maybe we know it's going to be close to one half, 191 00:08:42,110 --> 00:08:43,460 but maybe we're not quite sure. 192 00:08:46,840 --> 00:08:49,960 We're going to want to integrate that knowledge. 193 00:08:49,960 --> 00:08:52,660 So I could integrate it in a blunt manner by saying, 194 00:08:52,660 --> 00:08:55,420 discard the data and say that p is equal to one half. 195 00:08:55,420 --> 00:08:57,650 But maybe that's just a little too much. 196 00:08:57,650 --> 00:09:01,360 So how do I do this trade off between adding the data 197 00:09:01,360 --> 00:09:06,760 and combining it with this prior knowledge? 198 00:09:06,760 --> 00:09:09,610 In many instances, essentially what's going to happen 199 00:09:09,610 --> 00:09:14,330 is this one half is going to act like one new observation. 200 00:09:14,330 --> 00:09:17,062 So if you have five observations, 201 00:09:17,062 --> 00:09:18,520 this is just the sixth observation, 202 00:09:18,520 --> 00:09:20,240 which will play a role. 203 00:09:20,240 --> 00:09:21,790 If you have a million observations, 204 00:09:21,790 --> 00:09:22,860 you're going to have a million and one. 205 00:09:22,860 --> 00:09:24,568 It's not going to play so much of a role. 206 00:09:24,568 --> 00:09:25,900 That's basically how it goes. 207 00:09:28,760 --> 00:09:33,470 But, definitely not always because we'll 208 00:09:33,470 --> 00:09:36,700 see that if I take my prior to be a point minus one half here, 209 00:09:36,700 --> 00:09:39,290 it's basically as if I was discarding my data. 210 00:09:39,290 --> 00:09:41,740 So essentially, there's also your ability 211 00:09:41,740 --> 00:09:45,520 to encompass how strongly you believe in this prior. 212 00:09:45,520 --> 00:09:47,809 And if you believe infinitely more in the prior 213 00:09:47,809 --> 00:09:49,600 than you believe in the data you collected, 214 00:09:49,600 --> 00:09:54,600 then it's not going to act like one more observation. 215 00:09:54,600 --> 00:09:56,820 The Bayesian approach is a tool to one, 216 00:09:56,820 --> 00:09:59,010 include mathematically our prior. 217 00:09:59,010 --> 00:10:02,580 And our prior belief into statistical procedures. 218 00:10:02,580 --> 00:10:04,030 Maybe I have this prior knowledge. 219 00:10:04,030 --> 00:10:06,090 But if I'm a medical doctor, it's not clear to me 220 00:10:06,090 --> 00:10:09,870 how I'm going to turn this into some principal way of building 221 00:10:09,870 --> 00:10:10,410 estimators. 222 00:10:10,410 --> 00:10:12,330 And the second goal is going to be 223 00:10:12,330 --> 00:10:16,260 to update this prior belief into a posterior belief 224 00:10:16,260 --> 00:10:17,270 by using the data. 225 00:10:22,200 --> 00:10:23,917 How do I do this? 226 00:10:23,917 --> 00:10:25,500 And at some point, I sort of suggested 227 00:10:25,500 --> 00:10:28,610 that there's two layers. 228 00:10:28,610 --> 00:10:31,660 One is where you draw the parameter at random. 229 00:10:31,660 --> 00:10:35,290 And two, once you have the parameter, 230 00:10:35,290 --> 00:10:39,320 conditionless parameter, you draw your data. 231 00:10:39,320 --> 00:10:42,080 Nobody believed this actually is happening, that nature is just 232 00:10:42,080 --> 00:10:45,510 rolling dice for us and choosing parameters at random. 233 00:10:45,510 --> 00:10:48,260 But what's happening is that, this idea 234 00:10:48,260 --> 00:10:51,410 that the parameter comes from some random distribution 235 00:10:51,410 --> 00:10:54,860 actually captures, very well, this idea that how 236 00:10:54,860 --> 00:10:56,960 you would encompass your prior. 237 00:10:56,960 --> 00:10:59,090 How would you say, my belief is as follows? 238 00:10:59,090 --> 00:11:01,870 Well here's an example about p. 239 00:11:01,870 --> 00:11:07,856 I'm 90% sure that p is between 0.4 and 0.6. 240 00:11:07,856 --> 00:11:14,230 And I'm 95% sure that p is between 0.3 and 0.8. 241 00:11:14,230 --> 00:11:18,490 So essentially, I have this possible value of p. 242 00:11:18,490 --> 00:11:35,430 And what I know is that, there's 90% here between 0.4 and 0.6. 243 00:11:35,430 --> 00:11:39,340 And then I have 0.3 and 0.8. 244 00:11:39,340 --> 00:11:44,200 And I know that I'm 95% sure that I'm in here. 245 00:11:44,200 --> 00:11:47,050 If you remember, this sort of looks like the kind of pictures 246 00:11:47,050 --> 00:11:50,110 that I made when I had some Gaussian, for example. 247 00:11:50,110 --> 00:11:54,220 And I said, oh here we have 90% of the observations. 248 00:11:54,220 --> 00:11:57,105 And here, we have 95% of the observations. 249 00:12:00,500 --> 00:12:04,570 So in a way, if I were able to tell you 250 00:12:04,570 --> 00:12:07,610 all those ranges for all possible values, 251 00:12:07,610 --> 00:12:10,510 then I would essentially describe a probability 252 00:12:10,510 --> 00:12:13,400 distribution for p. 253 00:12:13,400 --> 00:12:15,410 And what I'm saying is that, p is going 254 00:12:15,410 --> 00:12:16,582 to have this kind of shape. 255 00:12:16,582 --> 00:12:19,040 So of course, if I tell you only two twice this information 256 00:12:19,040 --> 00:12:22,280 that there's 90% I'm here, and I'm between here and here. 257 00:12:22,280 --> 00:12:24,980 And 95%, I'm between here and here, then there's 258 00:12:24,980 --> 00:12:26,845 many ways I can accomplish that, right. 259 00:12:26,845 --> 00:12:28,970 I could have something that looks like this, maybe. 260 00:12:33,190 --> 00:12:35,830 It could be like this. 261 00:12:35,830 --> 00:12:37,605 There's many ways I can have this. 262 00:12:37,605 --> 00:12:38,980 Some of them are definitely going 263 00:12:38,980 --> 00:12:42,280 to be mathematically more convenient than others. 264 00:12:42,280 --> 00:12:44,320 And hopefully, we're going to have things 265 00:12:44,320 --> 00:12:47,230 that I can parameterize very well. 266 00:12:47,230 --> 00:12:49,900 Because if I tell you this is this guy, 267 00:12:49,900 --> 00:12:54,340 then there's basically one, two three, four, five, six, 268 00:12:54,340 --> 00:12:56,554 seven parameters. 269 00:12:56,554 --> 00:12:57,970 So I probably don't want something 270 00:12:57,970 --> 00:12:59,053 that has seven parameters. 271 00:12:59,053 --> 00:13:01,582 But maybe I can say, oh, it's a Gaussian and I all 272 00:13:01,582 --> 00:13:03,540 I have to do is to tell you where it's centered 273 00:13:03,540 --> 00:13:04,998 and what the standard deviation is. 274 00:13:07,250 --> 00:13:11,030 So the idea of using this two layer thing, 275 00:13:11,030 --> 00:13:12,800 where we think of the parameter p 276 00:13:12,800 --> 00:13:14,450 as being drawn from some distribution, 277 00:13:14,450 --> 00:13:17,660 is really just a way for us to capture this information. 278 00:13:17,660 --> 00:13:20,420 Our prior belief being, well there's 279 00:13:20,420 --> 00:13:22,794 this percentage of chances that it's there. 280 00:13:22,794 --> 00:13:24,710 But the percentage of this chance, I'm not I'm 281 00:13:24,710 --> 00:13:28,730 deliberately not using probability here. 282 00:13:28,730 --> 00:13:30,980 So it's really a way to get close to this. 283 00:13:33,620 --> 00:13:36,170 That's why I say, the true parameter is not random. 284 00:13:36,170 --> 00:13:40,420 But the Bayesian approach does as if it was random. 285 00:13:40,420 --> 00:13:42,430 And then, just spits out a procedure 286 00:13:42,430 --> 00:13:49,110 out of this thought process, this thought experiment. 287 00:13:49,110 --> 00:13:54,060 So when you practice Bayesian statistics a lot, 288 00:13:54,060 --> 00:13:57,840 you start getting automatisms. 289 00:13:57,840 --> 00:14:00,905 You start getting some things that you do without really 290 00:14:00,905 --> 00:14:02,280 thinking about it. just like when 291 00:14:02,280 --> 00:14:04,860 you you're a statistician, the first thing you do is, 292 00:14:04,860 --> 00:14:07,419 can I think of this data as being Gaussian for example? 293 00:14:07,419 --> 00:14:09,210 When you're Bayesian you're thinking about, 294 00:14:09,210 --> 00:14:11,400 OK I have a set of parameters. 295 00:14:11,400 --> 00:14:14,250 So here, I can describe my parameter 296 00:14:14,250 --> 00:14:20,190 as being theta in general, in some big space 297 00:14:20,190 --> 00:14:21,540 parameter of theta. 298 00:14:21,540 --> 00:14:24,730 But what spaces did we encounter? 299 00:14:24,730 --> 00:14:27,090 Well, we encountered the real line. 300 00:14:27,090 --> 00:14:31,320 We encountered the interval 0, 1 for Bernoulli's And we 301 00:14:31,320 --> 00:14:36,320 encountered some of the positive real line 302 00:14:36,320 --> 00:14:39,320 for exponential distributions, etc. 303 00:14:39,320 --> 00:14:42,020 And so what I'm going to need to do, 304 00:14:42,020 --> 00:14:44,570 if I want to put some prior on those spaces, 305 00:14:44,570 --> 00:14:47,694 I'm going to have to have a usual set of tools 306 00:14:47,694 --> 00:14:49,610 for this guy, usual set of tools for this guy, 307 00:14:49,610 --> 00:14:51,176 usual sort of tools for this guy. 308 00:14:51,176 --> 00:14:52,550 And by usual set of tools, I mean 309 00:14:52,550 --> 00:14:54,966 I'm going to have to have a family of distributions that's 310 00:14:54,966 --> 00:14:56,690 supported on this. 311 00:14:56,690 --> 00:14:59,010 So in particular, this is the speed 312 00:14:59,010 --> 00:15:01,610 in which my parameter that I usually denote 313 00:15:01,610 --> 00:15:03,900 by p for Bernoulli lives. 314 00:15:03,900 --> 00:15:07,840 And so what I need is to find a distribution on the interval 0, 315 00:15:07,840 --> 00:15:13,540 1 just like this guy. 316 00:15:13,540 --> 00:15:15,310 The problem with the Gaussian is that it's 317 00:15:15,310 --> 00:15:17,890 not on the interval 0, 1. 318 00:15:17,890 --> 00:15:20,230 It's going to spill out in the end. 319 00:15:20,230 --> 00:15:22,850 And it's not going to be something that works for me. 320 00:15:22,850 --> 00:15:25,802 And so the question is, I need to think about distributions 321 00:15:25,802 --> 00:15:27,010 that are probably continuous. 322 00:15:27,010 --> 00:15:30,070 Why would I restrict myself to discrete distributions that 323 00:15:30,070 --> 00:15:34,060 are actually convenient and for Bernoulli, one that's actually 324 00:15:34,060 --> 00:15:36,730 basically the main tool that everybody is using 325 00:15:36,730 --> 00:15:39,670 is the so-called beta distribution. 326 00:15:39,670 --> 00:15:42,190 So the beta distribution has two parameters. 327 00:15:50,680 --> 00:15:56,910 So x follows a beta with parameters 328 00:15:56,910 --> 00:16:05,070 a and b if it has a density, f of x 329 00:16:05,070 --> 00:16:09,050 is equal to x to the a minus 1. 330 00:16:09,050 --> 00:16:15,800 1 minus x to the b minus 1, if x is in the interval 0, 331 00:16:15,800 --> 00:16:22,730 1 and 0 for all other x's. 332 00:16:22,730 --> 00:16:23,230 OK? 333 00:16:27,590 --> 00:16:30,470 Why is that a good thing? 334 00:16:30,470 --> 00:16:33,870 Well, it's a density that's on the interval 0, 1 for sure. 335 00:16:33,870 --> 00:16:37,130 But now I have these two parameters and a set of shapes 336 00:16:37,130 --> 00:16:41,525 that I can get by tweaking those two parameters is incredible. 337 00:16:44,260 --> 00:16:46,190 It's going to be a unimodal distribution. 338 00:16:46,190 --> 00:16:47,260 It's still fairly nice. 339 00:16:47,260 --> 00:16:49,760 It's not going to be something that goes like this and this. 340 00:16:49,760 --> 00:16:52,790 Because if you think about this, what 341 00:16:52,790 --> 00:16:55,550 would it mean if your prior distribution of the interval 0, 342 00:16:55,550 --> 00:16:57,120 1 had this shape? 343 00:16:59,630 --> 00:17:01,934 It would mean that, maybe you think that p is here 344 00:17:01,934 --> 00:17:03,350 or maybe you think that p is here, 345 00:17:03,350 --> 00:17:05,127 or maybe you think that p is here. 346 00:17:05,127 --> 00:17:06,710 Which essentially means that you think 347 00:17:06,710 --> 00:17:10,661 that p can come from three different phenomena. 348 00:17:10,661 --> 00:17:12,619 And there's other models that are called mixers 349 00:17:12,619 --> 00:17:15,079 for that, that directly account for the fact 350 00:17:15,079 --> 00:17:19,550 that maybe there are several phenomena that are aggregated 351 00:17:19,550 --> 00:17:21,050 in your data set. 352 00:17:21,050 --> 00:17:23,390 But if you think that your data set is sort of pure, 353 00:17:23,390 --> 00:17:25,650 and that everything comes from the same phenomenon, 354 00:17:25,650 --> 00:17:28,850 you want something that looks like this, 355 00:17:28,850 --> 00:17:32,850 or maybe looks like this, or maybe is sort of symmetric. 356 00:17:32,850 --> 00:17:34,410 You want to get all this stuff. 357 00:17:34,410 --> 00:17:36,900 Maybe you want something that says, well 358 00:17:36,900 --> 00:17:42,330 if I'm talking about p being the probability of the proportion 359 00:17:42,330 --> 00:17:45,840 of women in the whole world, you want something that's probably 360 00:17:45,840 --> 00:17:48,600 really spiked around one half. 361 00:17:48,600 --> 00:17:50,850 Almost the point math, because you know 362 00:17:50,850 --> 00:17:54,990 let's agree that 0.5 is the actual number. 363 00:17:54,990 --> 00:17:58,950 So you want something that says, OK maybe I'm wrong. 364 00:17:58,950 --> 00:18:01,300 But I'm sure I'm not going to be really that way off. 365 00:18:01,300 --> 00:18:03,300 So you want something that's really pointy. 366 00:18:03,300 --> 00:18:06,660 But if it's something you've never checked, 367 00:18:06,660 --> 00:18:09,780 and again I can not make references at this point, 368 00:18:09,780 --> 00:18:13,197 but something where you might have some uncertainty that 369 00:18:13,197 --> 00:18:14,280 should be around one half. 370 00:18:14,280 --> 00:18:17,070 Maybe you want something that a little more allows 371 00:18:17,070 --> 00:18:19,410 you to say, well, I think there's more around one half. 372 00:18:19,410 --> 00:18:22,950 But there's still some fluctuations that are possible. 373 00:18:22,950 --> 00:18:25,110 And in particular here, I talk about p, 374 00:18:25,110 --> 00:18:29,310 where the two parameters a and b are actually the same. 375 00:18:29,310 --> 00:18:30,500 I call them a. 376 00:18:30,500 --> 00:18:31,710 One is called scale. 377 00:18:31,710 --> 00:18:33,430 The other one is called shape. 378 00:18:33,430 --> 00:18:35,500 Oh sorry, this is not a density. 379 00:18:35,500 --> 00:18:38,646 So it actually has to be normalized. 380 00:18:38,646 --> 00:18:40,020 When you integrate this guy, it's 381 00:18:40,020 --> 00:18:41,490 going to be some function that depends on a 382 00:18:41,490 --> 00:18:43,410 and b, actually depends on this function 383 00:18:43,410 --> 00:18:45,427 through the beta function. 384 00:18:45,427 --> 00:18:47,260 Which is this combination of gamma function, 385 00:18:47,260 --> 00:18:51,515 so that's why it's called beta distribution. 386 00:18:51,515 --> 00:18:53,640 That's the definition of the beta function when you 387 00:18:53,640 --> 00:18:55,721 integrate this thing anyway. 388 00:18:55,721 --> 00:18:56,970 You just have to normalize it. 389 00:18:56,970 --> 00:18:59,730 That's just a number that depends on the a and b. 390 00:18:59,730 --> 00:19:01,542 So here, if you take a equal to b, 391 00:19:01,542 --> 00:19:03,000 you have something that essentially 392 00:19:03,000 --> 00:19:05,340 is symmetric around one half. 393 00:19:05,340 --> 00:19:07,120 Because what does it look like? 394 00:19:07,120 --> 00:19:10,980 Well, so my density f of x, is going to be what? 395 00:19:10,980 --> 00:19:19,200 It's going to be my constant times x, times one minus x 396 00:19:19,200 --> 00:19:21,670 to a minus one. 397 00:19:21,670 --> 00:19:26,080 And this function, x times 1 minus x looks like this. 398 00:19:26,080 --> 00:19:27,730 We've drawn it before. 399 00:19:27,730 --> 00:19:29,380 That was something that showed up 400 00:19:29,380 --> 00:19:36,490 as being the variance of my Bernoulli. 401 00:19:36,490 --> 00:19:42,240 So we know it's something that takes its maximum at one half. 402 00:19:42,240 --> 00:19:44,190 And now I'm just taking a power of this guy. 403 00:19:44,190 --> 00:19:46,020 So I'm really just distorting this thing 404 00:19:46,020 --> 00:19:51,340 into some fairly symmetric manner. 405 00:19:56,400 --> 00:20:00,630 This distribution that we actually take for p. 406 00:20:00,630 --> 00:20:03,030 I assume that p, the parameter, notice 407 00:20:03,030 --> 00:20:04,470 that this is kind of weird. 408 00:20:04,470 --> 00:20:06,344 First of all, this is probably the first time 409 00:20:06,344 --> 00:20:09,570 in this entire course that something 410 00:20:09,570 --> 00:20:12,085 has a distribution when it's actually a lower case letter. 411 00:20:12,085 --> 00:20:13,710 That's something you have to deal with, 412 00:20:13,710 --> 00:20:16,827 because we've been using lower case letters for parameters. 413 00:20:16,827 --> 00:20:18,660 And now we want them to have a distribution. 414 00:20:18,660 --> 00:20:20,550 So that's what's going to happen. 415 00:20:20,550 --> 00:20:23,850 This is called the prior distribution. 416 00:20:23,850 --> 00:20:27,750 So really, I should write something like f of p 417 00:20:27,750 --> 00:20:35,290 is equal to a constant times p, 1 minus p, to the n minus 1. 418 00:20:35,290 --> 00:20:39,985 Well no, actually I should not because then it's confusing. 419 00:20:39,985 --> 00:20:41,610 One thing in terms of notation that I'm 420 00:20:41,610 --> 00:20:43,639 going to write, when I have a constant here 421 00:20:43,639 --> 00:20:45,180 and I don't want to make it explicit. 422 00:20:45,180 --> 00:20:48,480 And we'll see in a second why I don't need to make it explicit. 423 00:20:48,480 --> 00:20:53,250 I'm going to write this as f of x 424 00:20:53,250 --> 00:21:04,060 is proportional to x 1 minus x to the n minus 1. 425 00:21:04,060 --> 00:21:08,740 That's just to say, equal to some constant that does not 426 00:21:08,740 --> 00:21:11,260 depend on x times this thing. 427 00:21:16,320 --> 00:21:21,930 So if we continue with our experiment 428 00:21:21,930 --> 00:21:25,410 where I'm drawing this data, X1 to Xn, 429 00:21:25,410 --> 00:21:29,050 which is Bernoulli p, if p has some distribution 430 00:21:29,050 --> 00:21:31,050 it's not clear what it means to have a Bernoulli 431 00:21:31,050 --> 00:21:32,427 with some random parameter. 432 00:21:32,427 --> 00:21:35,010 So what I'm going to do is, then I'm going to first draw my p. 433 00:21:35,010 --> 00:21:38,310 Let's say I get a number, 0.52. 434 00:21:38,310 --> 00:21:41,100 And then, I'm going to draw my data conditionally on p. 435 00:21:41,100 --> 00:21:45,150 So here comes the first and last flowchart of this class. 436 00:21:49,500 --> 00:21:51,190 So nature first draws p. 437 00:21:53,930 --> 00:21:58,360 p follows some data on a, a. 438 00:21:58,360 --> 00:21:59,670 Then I condition on p. 439 00:22:02,460 --> 00:22:10,760 And then I draw X1, Xn that are iid, Bernoulli p. 440 00:22:10,760 --> 00:22:14,250 Everybody understand the process of generating this data? 441 00:22:14,250 --> 00:22:16,250 So you first draw a parameter, and then you just 442 00:22:16,250 --> 00:22:21,040 flip those independent biased coins with this particular p. 443 00:22:21,040 --> 00:22:23,230 There's this layered thing. 444 00:22:26,570 --> 00:22:31,010 Now conditionally p, right so here I have this prior about p 445 00:22:31,010 --> 00:22:32,070 which was the thing. 446 00:22:32,070 --> 00:22:34,090 So this is just the thought process again, 447 00:22:34,090 --> 00:22:36,480 it's not anything that actually happens in practice. 448 00:22:36,480 --> 00:22:39,920 This is my way of thinking about how the data was generated. 449 00:22:39,920 --> 00:22:43,310 And from this, I'm going to try to come up with some procedure. 450 00:22:43,310 --> 00:22:47,880 Just like, if your estimator is the average of the data, 451 00:22:47,880 --> 00:22:49,700 you don't have to understand probability 452 00:22:49,700 --> 00:22:52,670 to say that my estimator is the average of the data. 453 00:22:52,670 --> 00:22:54,200 Anyone outside this room understands 454 00:22:54,200 --> 00:22:55,970 that the average is a good estimator 455 00:22:55,970 --> 00:22:58,550 for some average behavior. 456 00:22:58,550 --> 00:23:01,070 And they don't need to think of the data 457 00:23:01,070 --> 00:23:02,960 as being a random variable, et cetera. 458 00:23:02,960 --> 00:23:04,570 So same thing, basically. 459 00:23:10,760 --> 00:23:13,790 In this case, you can see that the posterior distribution 460 00:23:13,790 --> 00:23:14,720 is still a beta. 461 00:23:18,320 --> 00:23:20,090 What it means is that, I had this thing. 462 00:23:20,090 --> 00:23:21,650 Then, I observed my data. 463 00:23:21,650 --> 00:23:23,570 And then, I continue and here I'm 464 00:23:23,570 --> 00:23:32,800 going to update my prior into some posterior 465 00:23:32,800 --> 00:23:36,680 distribution, pi. 466 00:23:36,680 --> 00:23:39,210 And here, this guy is actually also a beta. 467 00:23:43,370 --> 00:23:45,950 My posterior distribution, p, is also 468 00:23:45,950 --> 00:23:48,002 a beta distribution with the parameters 469 00:23:48,002 --> 00:23:48,960 that are on this slide. 470 00:23:48,960 --> 00:23:51,670 And I'll have the space to reproduce them. 471 00:23:51,670 --> 00:23:54,180 So I start the beginning of this flowchart 472 00:23:54,180 --> 00:23:57,130 as having p, which is a prior. 473 00:23:57,130 --> 00:23:58,810 I'm going to get some observations 474 00:23:58,810 --> 00:24:01,120 and then, I'm going to update what my posterior is. 475 00:24:04,530 --> 00:24:06,900 This posterior is basically something 476 00:24:06,900 --> 00:24:09,690 that's, in business statistics was 477 00:24:09,690 --> 00:24:13,720 beautiful is as soon as you have this distribution, 478 00:24:13,720 --> 00:24:17,030 it's essentially capturing all the information about the data 479 00:24:17,030 --> 00:24:19,010 that you want for p. 480 00:24:19,010 --> 00:24:20,429 And it's not just the point. 481 00:24:20,429 --> 00:24:21,470 It's not just an average. 482 00:24:21,470 --> 00:24:23,660 It's actually an entire distribution 483 00:24:23,660 --> 00:24:27,050 for the possible values of theta. 484 00:24:27,050 --> 00:24:30,740 And it's not the same thing as saying, well 485 00:24:30,740 --> 00:24:35,030 if theta hat is equal to Xn bar, in the Gaussian case I know 486 00:24:35,030 --> 00:24:37,130 that this is some mean, mu. 487 00:24:37,130 --> 00:24:39,680 And then maybe it has varying sigma squared over n. 488 00:24:39,680 --> 00:24:43,550 That's not what I mean by, this is my posterior distribution. 489 00:24:43,550 --> 00:24:46,640 This is not what I mean. 490 00:24:46,640 --> 00:24:49,790 This is going to come from this guy, the Gaussian thing 491 00:24:49,790 --> 00:24:51,350 and the central limit theorem. 492 00:24:51,350 --> 00:24:52,970 But what I mean is this guy. 493 00:24:52,970 --> 00:24:58,130 And this came exclusively from the prior distribution. 494 00:24:58,130 --> 00:25:00,830 If I had another prior, I would not necessarily 495 00:25:00,830 --> 00:25:03,840 have a beta distribution on the output. 496 00:25:03,840 --> 00:25:07,580 So when I have the same family of distributions 497 00:25:07,580 --> 00:25:11,090 at the beginning and at the end of this flowchart, 498 00:25:11,090 --> 00:25:16,520 I say that beta is a conjugate prior. 499 00:25:21,200 --> 00:25:27,390 Meaning I put in beta as a prior and I get beta as [INAUDIBLE] 500 00:25:27,390 --> 00:25:30,850 And that's why betas are so popular. 501 00:25:30,850 --> 00:25:32,280 Conjugate priors are really nice, 502 00:25:32,280 --> 00:25:35,730 because you know that whatever you put in, what you're going 503 00:25:35,730 --> 00:25:37,170 to get in the end is a beta. 504 00:25:37,170 --> 00:25:38,790 So all you have to think about is the parameters. 505 00:25:38,790 --> 00:25:41,040 You don't have to check again what the posterior is 506 00:25:41,040 --> 00:25:43,290 going to look like, what the PDF of this guy is going to be. 507 00:25:43,290 --> 00:25:44,664 You don't have to think about it. 508 00:25:44,664 --> 00:25:46,650 You just have to check what the parameters are. 509 00:25:46,650 --> 00:25:48,358 And there's families of conjugate priors. 510 00:25:48,358 --> 00:25:51,150 Gaussian gives Gaussian, for example. 511 00:25:51,150 --> 00:25:52,170 There's a bunch of them. 512 00:25:52,170 --> 00:25:57,210 And this is what drives people into using specific priors as 513 00:25:57,210 --> 00:25:58,200 opposed to others. 514 00:25:58,200 --> 00:26:00,660 It has nice mathematical properties. 515 00:26:00,660 --> 00:26:05,910 Nobody believes that p is really distributed according to beta. 516 00:26:05,910 --> 00:26:08,640 But it's flexible enough and super convenient 517 00:26:08,640 --> 00:26:09,700 mathematically. 518 00:26:12,450 --> 00:26:14,640 Now let's see for one second, before we actually 519 00:26:14,640 --> 00:26:17,080 go any further. 520 00:26:17,080 --> 00:26:19,790 I didn't mention A and B are both in here, 521 00:26:19,790 --> 00:26:21,540 A and B are both positive numbers. 522 00:26:24,320 --> 00:26:27,610 They can be anything positive. 523 00:26:27,610 --> 00:26:29,460 So here what I did is that, I updated A 524 00:26:29,460 --> 00:26:34,650 into a plus the sum of my data, and b 525 00:26:34,650 --> 00:26:38,500 into b plus n minus the sum of my data. 526 00:26:38,500 --> 00:26:41,910 So that's essentially, a becomes a plus the number of ones. 527 00:26:45,040 --> 00:26:47,350 Well, that's only when I have a and a. 528 00:26:47,350 --> 00:26:50,116 So the first parameters become itself plus the number of ones. 529 00:26:50,116 --> 00:26:51,490 And the second one becomes itself 530 00:26:51,490 --> 00:26:52,531 plus the number of zeros. 531 00:26:55,440 --> 00:26:59,160 And so just as a sanity check, what does this mean? 532 00:26:59,160 --> 00:27:08,910 If a it goes to zero, what is the beta when a goes to 0? 533 00:27:08,910 --> 00:27:10,410 We can actually read this from here. 534 00:27:16,920 --> 00:27:19,170 Actually, let's take a goes to-- 535 00:27:25,370 --> 00:27:26,110 no. 536 00:27:26,110 --> 00:27:27,310 Sorry, let's just do this. 537 00:27:38,670 --> 00:27:40,840 I'll do it when we talk about non-informative prior, 538 00:27:40,840 --> 00:27:42,840 because it's a little too messy. 539 00:27:47,220 --> 00:27:47,970 How do we do this? 540 00:27:47,970 --> 00:27:51,390 How did I get this posterior distribution, given the prior? 541 00:27:51,390 --> 00:27:56,070 How do I update This well this is called Bayesian statistics. 542 00:27:56,070 --> 00:27:58,800 And you've heard this word, Bayes before. 543 00:27:58,800 --> 00:28:02,010 And the way you've heard it is in the Bayes formula. 544 00:28:02,010 --> 00:28:03,680 What was the Bayes formula? 545 00:28:03,680 --> 00:28:05,190 The Bayes formula was telling you 546 00:28:05,190 --> 00:28:11,390 that the probability of A, given B was equal to something that 547 00:28:11,390 --> 00:28:14,430 depended on the probability of B, given A. That's what it was. 548 00:28:16,787 --> 00:28:18,620 You can actually either remember the formula 549 00:28:18,620 --> 00:28:20,250 or you can remember the definition. 550 00:28:20,250 --> 00:28:26,000 And this is what p of A and B divided by p of B. 551 00:28:26,000 --> 00:28:35,480 So this is p of B, given A times p of A divided by p of B. 552 00:28:35,480 --> 00:28:37,590 That's what Bayes formula is telling you. 553 00:28:37,590 --> 00:28:40,050 Agree? 554 00:28:40,050 --> 00:28:46,200 So now what I want is to have something that's telling me 555 00:28:46,200 --> 00:28:49,730 how this is going to work. 556 00:28:49,730 --> 00:28:54,410 What is going to play the role of those events, A and B? 557 00:28:54,410 --> 00:28:59,280 Well one is going to be, this is going 558 00:28:59,280 --> 00:29:01,980 to be the distribution of my parameter of theta, 559 00:29:01,980 --> 00:29:03,894 given that I see the data. 560 00:29:03,894 --> 00:29:05,310 And this is going to tell me, what 561 00:29:05,310 --> 00:29:07,601 is the distribution of the data, given that I know what 562 00:29:07,601 --> 00:29:09,270 my parameter if theta is. 563 00:29:09,270 --> 00:29:11,456 But that part, if this is theta and this 564 00:29:11,456 --> 00:29:13,080 is the parameter of theta, this is what 565 00:29:13,080 --> 00:29:15,720 we've been doing all along. 566 00:29:15,720 --> 00:29:18,720 The distribution of the data, given the parameter here 567 00:29:18,720 --> 00:29:22,350 was n iid Bernoulli p. 568 00:29:22,350 --> 00:29:27,960 I knew exactly what their joint probability mass function is. 569 00:29:27,960 --> 00:29:29,290 Then, that was what? 570 00:29:29,290 --> 00:29:32,700 So we said that this is going to be my data 571 00:29:32,700 --> 00:29:34,730 and this is going to be my parameter. 572 00:29:37,270 --> 00:29:40,210 So that means that, this is the probability of my data, 573 00:29:40,210 --> 00:29:43,000 given the parameter. 574 00:29:43,000 --> 00:29:45,729 This is the probability of the parameter. 575 00:29:45,729 --> 00:29:46,270 What is this? 576 00:29:46,270 --> 00:29:49,095 What did we call this? 577 00:29:49,095 --> 00:29:50,280 This is the prior. 578 00:29:50,280 --> 00:29:53,690 It's just the distribution of my parameter. 579 00:29:53,690 --> 00:29:56,030 Now what is this? 580 00:29:56,030 --> 00:29:57,490 Well, this is just the distribution 581 00:29:57,490 --> 00:30:00,340 of the data, itself. 582 00:30:00,340 --> 00:30:06,800 This is essentially the distribution of this, 583 00:30:06,800 --> 00:30:15,080 if this was indeed not conditioned on p. 584 00:30:15,080 --> 00:30:18,710 So if I don't condition on p, this data 585 00:30:18,710 --> 00:30:23,982 is going to be a bunch of iid, Bernoulli with some parameter. 586 00:30:23,982 --> 00:30:25,440 But the perimeter is random, right. 587 00:30:25,440 --> 00:30:27,837 So for different realization of this data set, 588 00:30:27,837 --> 00:30:30,170 I'm going to get different parameters for the Bernoulli. 589 00:30:30,170 --> 00:30:34,379 And so that leads to some sort of convolution. 590 00:30:34,379 --> 00:30:36,170 It's not really a convolution in this case, 591 00:30:36,170 --> 00:30:38,660 but it's like some sort of composition of distributions. 592 00:30:38,660 --> 00:30:41,600 I have the randomness that comes from here and then, 593 00:30:41,600 --> 00:30:44,757 the randomness that comes from realizing the Bernoulli. 594 00:30:44,757 --> 00:30:46,340 That's just the marginal distribution. 595 00:30:46,340 --> 00:30:49,820 It actually might be painful to understand what this is, right. 596 00:30:49,820 --> 00:30:52,970 In a way, it's sort of a mixture and it's not super nice. 597 00:30:52,970 --> 00:30:55,880 But we'll see that this actually won't matter for us. 598 00:30:55,880 --> 00:30:57,240 This is going to be some number. 599 00:30:57,240 --> 00:30:58,220 It's going to be there. 600 00:30:58,220 --> 00:31:00,260 But it will matter for us, what it is. 601 00:31:00,260 --> 00:31:02,510 Because it actually does not depend on the parameter. 602 00:31:02,510 --> 00:31:04,340 And that's all that matters to us. 603 00:31:09,100 --> 00:31:11,170 Let's put some names on those things. 604 00:31:11,170 --> 00:31:12,860 This was very informal. 605 00:31:12,860 --> 00:31:19,710 So let's put some actual names on what we call prior. 606 00:31:19,710 --> 00:31:22,320 So what is the formal definition of a prior, 607 00:31:22,320 --> 00:31:24,960 what is the formal definition of a posterior, 608 00:31:24,960 --> 00:31:27,450 and what are the rules to update it? 609 00:31:27,450 --> 00:31:30,100 So I'm going to have my data, which is going to be X1, Xn. 610 00:31:35,710 --> 00:31:38,520 Let's say they are iid, but they don't actually have to. 611 00:31:38,520 --> 00:31:41,260 And so I'm going to have given, theta. 612 00:31:47,450 --> 00:31:48,890 And when I say given, it's either 613 00:31:48,890 --> 00:31:51,890 given like I did in the first part of this course 614 00:31:51,890 --> 00:31:55,940 in all previous chapters, or conditionally on. 615 00:31:55,940 --> 00:31:58,340 If you're thinking like a Bayesian, what I really mean 616 00:31:58,340 --> 00:32:02,250 is conditionally on this random parameter. 617 00:32:02,250 --> 00:32:06,350 It's as if it was a fixed number. 618 00:32:06,350 --> 00:32:08,410 They're going to have a distribution, 619 00:32:08,410 --> 00:32:12,350 X1, Xn is going to have some distribution. 620 00:32:12,350 --> 00:32:19,260 Let's assume for now it's a PDF, pn of X1, Xn. 621 00:32:19,260 --> 00:32:22,140 I'm going to write theta like this. 622 00:32:22,140 --> 00:32:24,900 So for example, what is this? 623 00:32:24,900 --> 00:32:27,140 Let's say this is a PDF. 624 00:32:27,140 --> 00:32:28,110 It could be a PMF. 625 00:32:28,110 --> 00:32:31,197 Everything I say, I'm going to think of them as being PDF's. 626 00:32:31,197 --> 00:32:33,030 I'm going to combine PDF's with PDF's, but I 627 00:32:33,030 --> 00:32:37,440 could combine PDF it PMF, PMF with PDF's or PMF with PMF. 628 00:32:37,440 --> 00:32:41,590 So everywhere you see a D could be an M. 629 00:32:41,590 --> 00:32:42,590 Now I have those things. 630 00:32:42,590 --> 00:32:43,465 So what does it mean? 631 00:32:43,465 --> 00:32:46,430 So here is an example. 632 00:32:46,430 --> 00:32:53,970 X1, Xn or iid, and theta 1. 633 00:32:53,970 --> 00:32:57,530 Now I know exactly what the joint PDF of this thing is. 634 00:32:57,530 --> 00:33:03,790 It means that pn of X1, Xn given theta is equal to what? 635 00:33:03,790 --> 00:33:10,560 Well it's 1 over 2pi to the power n 636 00:33:10,560 --> 00:33:15,000 e, to the minus sum from i equal 1 to n 637 00:33:15,000 --> 00:33:18,450 of xi minus theta squared divided by 2. 638 00:33:18,450 --> 00:33:21,220 So that's just the joint distribution of n iid 639 00:33:21,220 --> 00:33:25,120 and theta 1, random variables. 640 00:33:25,120 --> 00:33:27,290 That's my pn given theta. 641 00:33:27,290 --> 00:33:33,310 Now this is what we denoted by f sub theta before. 642 00:33:33,310 --> 00:33:36,790 We had the subscript before, but now we just put a bar in theta 643 00:33:36,790 --> 00:33:38,860 because we want to remember that this is actually 644 00:33:38,860 --> 00:33:40,660 conditioned on theta. 645 00:33:40,660 --> 00:33:42,130 But this is just notation. 646 00:33:42,130 --> 00:33:46,060 You should just think of this as being, just the usual thing 647 00:33:46,060 --> 00:33:50,910 that you get from some statistical model. 648 00:33:50,910 --> 00:33:53,910 Now, that's going to be pn. 649 00:34:11,020 --> 00:34:19,500 Theta has prior distribution, pi. 650 00:34:22,400 --> 00:34:29,130 For example, so think of it as either PDF or PMF again. 651 00:34:29,130 --> 00:34:33,920 For example, pi of theta was what? 652 00:34:33,920 --> 00:34:40,159 Well it was some constant times theta to the a minus 1, 653 00:34:40,159 --> 00:34:43,739 1 minus theta to a minus 1. 654 00:34:43,739 --> 00:34:45,900 So it has some prior distribution, 655 00:34:45,900 --> 00:34:49,050 and that's another PMF. 656 00:34:49,050 --> 00:34:51,090 So now I'm given the distribution of my, 657 00:34:51,090 --> 00:34:54,000 x is given theta and given the distribution of my theta. 658 00:34:54,000 --> 00:34:57,410 I'm given this guy. 659 00:34:57,410 --> 00:35:00,100 That's this guy. 660 00:35:00,100 --> 00:35:05,340 I'm given that guy, which is my pi. 661 00:35:05,340 --> 00:35:11,700 So that's my pn of X1, Xn given theta. 662 00:35:11,700 --> 00:35:13,063 That's my pi of theta. 663 00:35:17,390 --> 00:35:21,130 Well, this is just the integral of pn 664 00:35:21,130 --> 00:35:28,280 of X1, Xn times pi of theta, d theta, 665 00:35:28,280 --> 00:35:29,720 over all possible sets of theta. 666 00:35:29,720 --> 00:35:33,360 That's just when I integrate out my theta, 667 00:35:33,360 --> 00:35:35,790 or I compute the marginal distribution, 668 00:35:35,790 --> 00:35:37,290 I did this by integrating. 669 00:35:37,290 --> 00:35:41,010 That's just basic probability, conditional probabilities. 670 00:35:41,010 --> 00:35:42,610 Then if I had the PMF, I would just 671 00:35:42,610 --> 00:35:43,970 sum over the values of thetas. 672 00:35:49,020 --> 00:35:55,210 Now what I want is to find what's called, 673 00:35:55,210 --> 00:35:58,870 so that's the prior distribution, 674 00:35:58,870 --> 00:36:01,227 and I want to find the posterior distribution. 675 00:36:15,110 --> 00:36:18,690 It's pi of theta, given X1, Xn. 676 00:36:21,780 --> 00:36:23,970 If I use Bayes' rule I know that this 677 00:36:23,970 --> 00:36:34,650 is pn of X1, Xn, given theta times pi of theta. 678 00:36:34,650 --> 00:36:37,530 And then it's divided by the distribution 679 00:36:37,530 --> 00:36:41,070 of those guys, which I will write as integral over theta 680 00:36:41,070 --> 00:36:48,830 of pn, X1, Xn, given theta times pi of theta, d theta. 681 00:36:55,360 --> 00:36:57,700 Everybody's with me, still? 682 00:36:57,700 --> 00:36:59,200 If you're not comfortable with this, 683 00:36:59,200 --> 00:37:03,010 it means that you probably need to go read your couple of pages 684 00:37:03,010 --> 00:37:04,930 on conditional densities and conditional 685 00:37:04,930 --> 00:37:07,420 PMF's from your probably class. 686 00:37:07,420 --> 00:37:08,870 There's really not much there. 687 00:37:08,870 --> 00:37:13,660 It's just a matter of being able to define those quantities, f 688 00:37:13,660 --> 00:37:15,289 density of x, given y. 689 00:37:15,289 --> 00:37:17,330 This is just what's called a conditional density. 690 00:37:17,330 --> 00:37:19,079 You need to understand what this object is 691 00:37:19,079 --> 00:37:21,920 and how it relates to the joint distribution of x and y, 692 00:37:21,920 --> 00:37:24,302 or maybe the distribution of x or the distribution of y. 693 00:37:27,400 --> 00:37:29,920 But it's the same rules. 694 00:37:29,920 --> 00:37:31,465 One way to actually remember this 695 00:37:31,465 --> 00:37:33,730 is, this is exactly the same rules as this. 696 00:37:33,730 --> 00:37:36,610 When you see a bar, it's the same thing as the probability 697 00:37:36,610 --> 00:37:37,790 of this and this guy. 698 00:37:37,790 --> 00:37:40,060 So for densities, it's just a comma 699 00:37:40,060 --> 00:37:43,240 divided by the second the probably the second guy. 700 00:37:43,240 --> 00:37:45,120 That's it. 701 00:37:45,120 --> 00:37:48,360 So if you remember this, you can just do some pattern matching 702 00:37:48,360 --> 00:37:49,980 and see what I just wrote here. 703 00:37:53,220 --> 00:37:57,010 Now, I can compute every single one of these guys. 704 00:37:57,010 --> 00:38:04,030 This something I get from my modeling. 705 00:38:04,030 --> 00:38:05,290 So I did not write this. 706 00:38:05,290 --> 00:38:09,130 It's not written in the slides. 707 00:38:09,130 --> 00:38:14,820 But I give a name to this guy that was my prior distribution. 708 00:38:14,820 --> 00:38:16,550 And that was my posterior distribution. 709 00:38:22,550 --> 00:38:26,980 In chapter three, maybe what did we call this guy? 710 00:38:32,120 --> 00:38:35,180 The one that does not have a name and that's in the box. 711 00:38:39,347 --> 00:38:40,180 What did we call it? 712 00:38:43,498 --> 00:38:46,335 AUDIENCE: [INAUDIBLE] 713 00:38:46,335 --> 00:38:48,835 PHILLIPE RIGOLLET: It is the joint distribution of the Xi's. 714 00:38:51,950 --> 00:38:53,235 And we gave it a name. 715 00:38:53,235 --> 00:38:54,214 AUDIENCE: [INAUDIBLE] 716 00:38:54,214 --> 00:38:56,130 PHILLIPE RIGOLLET: It's the likelihood, right? 717 00:38:56,130 --> 00:38:57,630 This is exactly the likelihood. 718 00:38:57,630 --> 00:38:59,100 This was the likelihood of theta. 719 00:39:03,920 --> 00:39:06,350 And this is something that's very important to remember, 720 00:39:06,350 --> 00:39:10,520 and that really reminds you that these things are really not 721 00:39:10,520 --> 00:39:11,540 that different. 722 00:39:11,540 --> 00:39:13,970 Maximum likelihood estimation and Bayesian estimation, 723 00:39:13,970 --> 00:39:18,860 because your posterior is really just your likelihood times 724 00:39:18,860 --> 00:39:23,570 something that's just putting some weights on the thetas, 725 00:39:23,570 --> 00:39:26,390 depending on where you think theta should be. 726 00:39:26,390 --> 00:39:28,420 If I had, say a maximum likelihood estimate, 727 00:39:28,420 --> 00:39:31,130 and my likelihood and theta looked like this, 728 00:39:31,130 --> 00:39:33,410 but my prior and theta looked like this. 729 00:39:33,410 --> 00:39:37,040 I said, oh I really want thetas that are like this. 730 00:39:37,040 --> 00:39:38,710 So what's going to happen is that, I'm 731 00:39:38,710 --> 00:39:41,320 going to turn this into some posterior that looks like this. 732 00:39:44,400 --> 00:39:47,610 So I'm just really waiting, this posterior, 733 00:39:47,610 --> 00:39:49,971 this is a constant that does not depend on theta right? 734 00:39:49,971 --> 00:39:50,470 Agreed? 735 00:39:50,470 --> 00:39:53,460 I integrated over theta, so theta is gone. 736 00:39:53,460 --> 00:39:56,220 So forget about this guy. 737 00:39:56,220 --> 00:39:59,247 I have basically, that the posterior distribution up 738 00:39:59,247 --> 00:40:01,830 to scaling, because it has to be a probability density and not 739 00:40:01,830 --> 00:40:03,810 just anything any function that's positive, 740 00:40:03,810 --> 00:40:05,070 is the product of this guy. 741 00:40:05,070 --> 00:40:06,920 It's a weighted version of my likelihood. 742 00:40:06,920 --> 00:40:07,890 That's all it is. 743 00:40:07,890 --> 00:40:09,990 I'm just weighing the likelihood, 744 00:40:09,990 --> 00:40:13,150 using my prior belief on theta. 745 00:40:13,150 --> 00:40:16,870 And so given this guy a natural estimator, 746 00:40:16,870 --> 00:40:19,480 if you follow the maximum likelihood principle, 747 00:40:19,480 --> 00:40:23,150 would be the maximum of this posterior. 748 00:40:23,150 --> 00:40:24,620 Agreed? 749 00:40:24,620 --> 00:40:28,830 That would basically be doing exactly what maximum likelihood 750 00:40:28,830 --> 00:40:31,740 estimation is telling you. 751 00:40:31,740 --> 00:40:33,560 So it turns out that you can. 752 00:40:33,560 --> 00:40:35,330 It's called Maximum A Posteriori, 753 00:40:35,330 --> 00:40:39,370 and I won't talk much about this, or MAP. 754 00:40:39,370 --> 00:40:44,500 That's Maximum a Posteriori. 755 00:40:44,500 --> 00:40:47,200 So it's just the theta hat is the arc 756 00:40:47,200 --> 00:40:50,790 max of pi theta, given X1, Xn. 757 00:40:54,990 --> 00:40:56,190 And it sounds like it's OK. 758 00:40:56,190 --> 00:40:58,660 I'll give you a density and you say, OK 759 00:40:58,660 --> 00:41:00,970 I have a density for all values of my parameters. 760 00:41:00,970 --> 00:41:03,440 You're asking me to summarize it into one number. 761 00:41:03,440 --> 00:41:06,570 I'm just going to take the most likely number of those guys. 762 00:41:06,570 --> 00:41:08,310 But you could summarize it, otherwise. 763 00:41:08,310 --> 00:41:10,770 You could take the average. 764 00:41:10,770 --> 00:41:12,420 You could take the median. 765 00:41:12,420 --> 00:41:14,370 You could take a bunch of numbers. 766 00:41:14,370 --> 00:41:16,080 And the beauty of Bayesian statistics 767 00:41:16,080 --> 00:41:19,230 is that, you don't have to take any number in particular. 768 00:41:19,230 --> 00:41:21,480 You have an entire posterior distribution. 769 00:41:21,480 --> 00:41:25,080 This is not only telling you where theta is, 770 00:41:25,080 --> 00:41:29,160 but it's actually telling you the difference 771 00:41:29,160 --> 00:41:31,920 if you actually give as something 772 00:41:31,920 --> 00:41:33,180 that gives you the posterior. 773 00:41:33,180 --> 00:41:36,270 Now, let's say the theta is p between 0 and 1. 774 00:41:36,270 --> 00:41:39,990 If my posterior distribution looks like this, 775 00:41:39,990 --> 00:41:43,410 or my posterior distribution looks like this, 776 00:41:43,410 --> 00:41:47,610 then those two guys have one, the same mode. 777 00:41:47,610 --> 00:41:49,200 This is the same value. 778 00:41:49,200 --> 00:41:51,630 And their symmetric, so they'll also have the same mean. 779 00:41:51,630 --> 00:41:53,130 So these two posterior distributions 780 00:41:53,130 --> 00:41:55,500 give me the same summary into one number. 781 00:41:55,500 --> 00:41:58,229 However clearly, one is much more confident 782 00:41:58,229 --> 00:41:59,020 than the other one. 783 00:41:59,020 --> 00:42:04,010 So I might as well just spit it out as a solution. 784 00:42:04,010 --> 00:42:05,180 You can do even better. 785 00:42:05,180 --> 00:42:09,560 People actually do things, such as drawing a random number 786 00:42:09,560 --> 00:42:10,600 from this distribution. 787 00:42:10,600 --> 00:42:12,940 Say, this is my number. 788 00:42:12,940 --> 00:42:14,440 That's kind of dangerous, but you 789 00:42:14,440 --> 00:42:15,690 can imagine you could do this. 790 00:42:20,730 --> 00:42:22,140 This is what works. 791 00:42:22,140 --> 00:42:23,680 That's what we went through. 792 00:42:23,680 --> 00:42:28,650 So here, as you notice I don't care so much about this part 793 00:42:28,650 --> 00:42:30,240 here. 794 00:42:30,240 --> 00:42:32,240 Because it does not depend on theta. 795 00:42:32,240 --> 00:42:35,190 So I know that given the product of those two things, 796 00:42:35,190 --> 00:42:37,650 this thing is only the constant that I need to divide 797 00:42:37,650 --> 00:42:40,050 so that when I integrate this thing over theta, 798 00:42:40,050 --> 00:42:41,460 it integrates to one. 799 00:42:41,460 --> 00:42:45,540 Because this has to be a probability density on theta. 800 00:42:45,540 --> 00:42:47,910 I can write this and just forget about that part. 801 00:42:47,910 --> 00:42:52,280 And that's what's written on the top of this slide. 802 00:42:52,280 --> 00:42:57,920 This notation, this sort of weird alpha, or I don't know. 803 00:42:57,920 --> 00:42:59,780 Infinity sign propped to the right. 804 00:42:59,780 --> 00:43:02,330 Whatever you want to call this thing 805 00:43:02,330 --> 00:43:04,700 is actually just really emphasizing the fact 806 00:43:04,700 --> 00:43:06,310 that I don't care. 807 00:43:06,310 --> 00:43:12,490 I write it because I can, but you know what it is. 808 00:43:17,314 --> 00:43:19,480 In some instances, you have to compute the integral. 809 00:43:19,480 --> 00:43:21,640 In some instances, you don't have to compute the integral. 810 00:43:21,640 --> 00:43:23,200 And a lot of Bayesian computation 811 00:43:23,200 --> 00:43:25,600 is about saying, OK it's actually 812 00:43:25,600 --> 00:43:27,146 really hard to compute this integral, 813 00:43:27,146 --> 00:43:28,270 so I'd rather not doing it. 814 00:43:28,270 --> 00:43:31,450 So let me try to find some methods that will allow me 815 00:43:31,450 --> 00:43:33,789 to sample from the posterior distribution, 816 00:43:33,789 --> 00:43:35,080 without having to compute this. 817 00:43:35,080 --> 00:43:37,720 And that's what's called Monte-Carlo Markov 818 00:43:37,720 --> 00:43:40,580 chains, or MCMC, and that's exactly what they're doing. 819 00:43:40,580 --> 00:43:42,370 They're just using only ratios of things, 820 00:43:42,370 --> 00:43:44,130 like that for different thetas. 821 00:43:44,130 --> 00:43:45,890 And which means that if you take ratios, 822 00:43:45,890 --> 00:43:47,860 the normalizing constant is gone and you don't 823 00:43:47,860 --> 00:43:50,810 need to find this integral. 824 00:43:50,810 --> 00:43:53,015 So we won't go into those details at all. 825 00:43:53,015 --> 00:43:54,890 That would be the purpose of an entire course 826 00:43:54,890 --> 00:43:56,630 on Bayesian inference. 827 00:43:56,630 --> 00:43:59,570 Actually, even Bayesian computations 828 00:43:59,570 --> 00:44:02,154 would be an entire course on its own. 829 00:44:02,154 --> 00:44:03,820 And there's some very interesting things 830 00:44:03,820 --> 00:44:05,778 that are going on there, the interface of stats 831 00:44:05,778 --> 00:44:06,890 and computation. 832 00:44:10,054 --> 00:44:12,470 So let's go back to our example and see if we can actually 833 00:44:12,470 --> 00:44:13,636 compute any of those things. 834 00:44:13,636 --> 00:44:17,420 Because it's very nice to give you some data, some formulas. 835 00:44:17,420 --> 00:44:19,990 Let's see if we can actually do it. 836 00:44:19,990 --> 00:44:23,810 In particular, can I actually recover this claim 837 00:44:23,810 --> 00:44:31,250 that the posterior associated to a beta prior with a Bernoulli 838 00:44:31,250 --> 00:44:35,780 likelihood is actually giving me a beta again? 839 00:44:35,780 --> 00:44:36,710 What was my prior? 840 00:44:42,670 --> 00:44:45,970 So p was following a beta AA, which 841 00:44:45,970 --> 00:44:48,320 means that p, the density. 842 00:44:53,620 --> 00:44:56,610 That was pi of theta. 843 00:44:56,610 --> 00:44:59,580 Well I'm going to write this as pi of p-- 844 00:44:59,580 --> 00:45:05,800 was proportional to p to the A minus 1 times 1 minus p 845 00:45:05,800 --> 00:45:08,806 to the A minus 1. 846 00:45:08,806 --> 00:45:11,430 So that's the first ingredient I need to complete my posterior. 847 00:45:11,430 --> 00:45:14,370 I really need only two, if I wanted to bound up to constant. 848 00:45:14,370 --> 00:45:16,234 The second one was p hat. 849 00:45:20,710 --> 00:45:22,620 We've computed that many times. 850 00:45:22,620 --> 00:45:25,610 And we had even a nice compact way of writing it, 851 00:45:25,610 --> 00:45:32,570 which was that pn of X1, Xn, given the parameter p. 852 00:45:32,570 --> 00:45:36,850 So the joint density of my data, given p, that's my likelihood. 853 00:45:36,850 --> 00:45:38,730 The likelihood of p was what? 854 00:45:38,730 --> 00:45:41,230 Well it was p to the sum of Xi's. 855 00:45:44,030 --> 00:45:46,300 1 minus p to the n minus some of the Xi's. 856 00:45:50,990 --> 00:45:53,750 Anybody wants me to parse this more? 857 00:45:53,750 --> 00:45:56,060 Or do you remember seeing that from maximum likelihood 858 00:45:56,060 --> 00:45:57,060 estimation? 859 00:45:57,060 --> 00:45:57,697 Yeah? 860 00:45:57,697 --> 00:46:02,929 AUDIENCE: [INAUDIBLE] 861 00:46:02,929 --> 00:46:04,970 PHILLIPE RIGOLLET: That's what conditioning does. 862 00:46:10,838 --> 00:46:15,239 AUDIENCE: [INAUDIBLE] previous slide. 863 00:46:15,239 --> 00:46:19,151 [INAUDIBLE] bottom there, it says D pi of t. 864 00:46:19,151 --> 00:46:23,570 Shouldn't it be dt pi of t? 865 00:46:23,570 --> 00:46:25,300 PHILLIPE RIGOLLET: So D pi of T is 866 00:46:25,300 --> 00:46:29,110 a measure theoretic notation, which I used without thinking. 867 00:46:29,110 --> 00:46:32,380 And I should not because I can see it upsets you. 868 00:46:32,380 --> 00:46:35,050 D pi of T is just a natural way to say 869 00:46:35,050 --> 00:46:38,170 that I integrate against whatever I'm 870 00:46:38,170 --> 00:46:43,930 given for the prior of theta. 871 00:46:43,930 --> 00:46:48,820 In particular, if theta is just the mix of a PDF and a point 872 00:46:48,820 --> 00:46:51,430 mass, maybe I say that my p takes 873 00:46:51,430 --> 00:46:54,400 value 0.5 with probability 0.5. 874 00:46:54,400 --> 00:46:58,900 And then is uniform on the interval with probability 0.5. 875 00:46:58,900 --> 00:47:01,930 For this, I neither have a PDF nor a PMF. 876 00:47:01,930 --> 00:47:04,150 But I can still talk about integrating with respect 877 00:47:04,150 --> 00:47:04,930 to this, right? 878 00:47:04,930 --> 00:47:08,530 It's going to look like, if I take a function f of T, 879 00:47:08,530 --> 00:47:14,480 D pi of T is going to be one half of f of one half. 880 00:47:14,480 --> 00:47:16,480 That's the point mass with probability one half, 881 00:47:16,480 --> 00:47:17,560 at one half. 882 00:47:17,560 --> 00:47:23,230 Plus one half of the integral between 0 and 1, of f of TDT. 883 00:47:23,230 --> 00:47:26,980 This is just the notation, which is actually funnily enough, 884 00:47:26,980 --> 00:47:29,360 interchangeable with pi of DT. 885 00:47:32,460 --> 00:47:34,890 But if you have a density, it's really 886 00:47:34,890 --> 00:47:39,801 just the density pi of TDT. 887 00:47:39,801 --> 00:47:41,940 If pi is really a density, but that's 888 00:47:41,940 --> 00:47:44,120 when it's when pi is and measure and not a density. 889 00:47:46,820 --> 00:47:49,700 Everybody else, forget about this. 890 00:47:49,700 --> 00:47:51,627 This is not something you should really 891 00:47:51,627 --> 00:47:52,710 worry about at this point. 892 00:47:52,710 --> 00:47:55,719 This is more graduate level probability classes. 893 00:47:55,719 --> 00:47:57,260 But yeah, it's called measure theory. 894 00:47:57,260 --> 00:47:59,160 And that's when you think of pi as being a measure 895 00:47:59,160 --> 00:47:59,980 in an abstract fashion. 896 00:47:59,980 --> 00:48:01,896 You don't have to worry whether it's a density 897 00:48:01,896 --> 00:48:04,000 or not, or whether it has a density. 898 00:48:08,350 --> 00:48:10,250 So everybody is OK with this? 899 00:48:15,530 --> 00:48:17,390 Now I need to compute my posterior. 900 00:48:17,390 --> 00:48:23,120 And as I said, my posterior is really 901 00:48:23,120 --> 00:48:25,550 just the product of the likelihood weighted 902 00:48:25,550 --> 00:48:28,970 by the prior. 903 00:48:28,970 --> 00:48:33,030 Hopefully, at this stage of your application, 904 00:48:33,030 --> 00:48:35,390 you can multiply two functions. 905 00:48:35,390 --> 00:48:37,580 So what's happening is, if I multiply this guy 906 00:48:37,580 --> 00:48:41,300 with this guy, p gets this guy to the power 907 00:48:41,300 --> 00:48:42,860 this guy plus this guy. 908 00:48:53,810 --> 00:49:00,020 And then 1 minus p gets the power n minus some of Xi's. 909 00:49:00,020 --> 00:49:02,900 So this is always from I equal 1 to n. 910 00:49:02,900 --> 00:49:04,390 And then plus A minus 1 as well. 911 00:49:10,010 --> 00:49:15,560 This is up to constant, because I still need to solve this. 912 00:49:15,560 --> 00:49:17,259 And I could try to do it. 913 00:49:17,259 --> 00:49:18,800 But I really don't have to, because I 914 00:49:18,800 --> 00:49:24,380 know that if my density has this form, then 915 00:49:24,380 --> 00:49:25,532 it's a beta distribution. 916 00:49:25,532 --> 00:49:26,990 And then I can just go on Wikipedia 917 00:49:26,990 --> 00:49:29,021 and see what should be the normalization factor. 918 00:49:29,021 --> 00:49:31,020 But I know it's going to be a beta distribution. 919 00:49:31,020 --> 00:49:34,020 It's actually the beta with parameter. 920 00:49:34,020 --> 00:49:39,210 So this is really my beta with parameter, sum of Xi, 921 00:49:39,210 --> 00:49:43,580 i equal 1 to n plus A minus 1. 922 00:49:43,580 --> 00:49:46,130 And then the second parameter is n minus sum 923 00:49:46,130 --> 00:49:49,806 of the Xi's plus A minus 1. 924 00:49:54,980 --> 00:49:59,030 I just wrote what was here. 925 00:49:59,030 --> 00:50:01,580 What happened to my one? 926 00:50:01,580 --> 00:50:02,920 Oh no, sorry. 927 00:50:02,920 --> 00:50:05,640 Beta has the power minus 1. 928 00:50:05,640 --> 00:50:08,847 So that's the parameter of the beta. 929 00:50:08,847 --> 00:50:10,430 And this is the parameter of the beta. 930 00:50:15,127 --> 00:50:16,210 Beta is over there, right? 931 00:50:16,210 --> 00:50:19,852 So I just replace A by what I see. 932 00:50:19,852 --> 00:50:22,290 A is just becoming this guy plus this guy 933 00:50:22,290 --> 00:50:26,400 and this guy plus this guy. 934 00:50:26,400 --> 00:50:28,662 Everybody is comfortable with this computation? 935 00:50:34,170 --> 00:50:38,850 We just agreed that beta priors for Bernoulli observations 936 00:50:38,850 --> 00:50:42,540 are certainly convenient. 937 00:50:42,540 --> 00:50:44,457 Because they are just conjugate, and we know 938 00:50:44,457 --> 00:50:46,290 that's what is going to come out in the end. 939 00:50:46,290 --> 00:50:48,899 That's going to be a beta as well. 940 00:50:48,899 --> 00:50:50,190 I just claim it was convenient. 941 00:50:50,190 --> 00:50:52,890 It was certainly convenient to compute this, right? 942 00:50:52,890 --> 00:50:55,741 There was certainly some compatibility 943 00:50:55,741 --> 00:50:57,990 when I had to multiply this function by that function. 944 00:50:57,990 --> 00:51:00,916 And you can imagine that things could go much more wrong, 945 00:51:00,916 --> 00:51:03,540 than just having p to some power and p to some power, 1 minus p 946 00:51:03,540 --> 00:51:06,390 to some power, when it might just be some other power. 947 00:51:06,390 --> 00:51:09,280 Things were nice. 948 00:51:09,280 --> 00:51:12,410 Now this is nice, but I can also question the following things. 949 00:51:12,410 --> 00:51:14,330 Why beta, for one? 950 00:51:14,330 --> 00:51:17,840 The beta tells me something. 951 00:51:17,840 --> 00:51:20,636 That's convenient, but then how do I pick A? 952 00:51:20,636 --> 00:51:27,500 I know that A should definitely capture the fact that where 953 00:51:27,500 --> 00:51:30,200 I want to have my p most likely located. 954 00:51:30,200 --> 00:51:32,390 But it also actually also captures 955 00:51:32,390 --> 00:51:34,580 the variance of my beta. 956 00:51:34,580 --> 00:51:36,740 And so choosing different As is going 957 00:51:36,740 --> 00:51:37,950 to have different functions. 958 00:51:37,950 --> 00:51:43,050 If I have A and B, If I started with the beta with parameter. 959 00:51:43,050 --> 00:51:48,110 If I started with a B here, I would just pick up the B here. 960 00:51:48,110 --> 00:51:49,862 Agreed? 961 00:51:49,862 --> 00:51:51,320 And that would just be a symmetric. 962 00:51:51,320 --> 00:51:53,270 But they're going to capture mean and variance 963 00:51:53,270 --> 00:51:53,853 of this thing. 964 00:51:53,853 --> 00:51:56,030 And so how do I pick those guys? 965 00:51:56,030 --> 00:51:59,437 If I'm a doctor and you're asking me, 966 00:51:59,437 --> 00:52:01,520 what do you think the chances of this drug working 967 00:52:01,520 --> 00:52:03,230 in this kind of patients is? 968 00:52:03,230 --> 00:52:06,080 And I have to spit out the parameters of a beta for you, 969 00:52:06,080 --> 00:52:08,630 it might be a bit of a complicated thing to do. 970 00:52:08,630 --> 00:52:10,720 So how do you do this, especially for problems? 971 00:52:10,720 --> 00:52:14,750 So by now, people have actually mastered 972 00:52:14,750 --> 00:52:19,290 the art of coming up with how to formulate those numbers. 973 00:52:19,290 --> 00:52:21,660 But in new problems that come up, how do you do this? 974 00:52:21,660 --> 00:52:23,840 What happens if you want to use Bayesian methods, 975 00:52:23,840 --> 00:52:30,140 but you actually do not know what you expect to see? 976 00:52:30,140 --> 00:52:33,260 To be fair, before we started this class, I hope all of you 977 00:52:33,260 --> 00:52:36,870 had no idea whether people tend to bend their head to the right 978 00:52:36,870 --> 00:52:38,172 or to the left before kissing. 979 00:52:38,172 --> 00:52:40,130 Because if you did, well you have too much time 980 00:52:40,130 --> 00:52:42,130 on your hands and I should double your homework. 981 00:52:44,390 --> 00:52:46,970 So in this case, maybe you still want 982 00:52:46,970 --> 00:52:48,830 to use the Bayesian machinery. 983 00:52:48,830 --> 00:52:50,980 Maybe you just want to do something nice. 984 00:52:50,980 --> 00:52:53,512 It's nice right, I mean it worked out pretty well. 985 00:52:53,512 --> 00:52:54,470 What if you want to do? 986 00:52:54,470 --> 00:52:56,870 Well you actually want to use some priors that 987 00:52:56,870 --> 00:53:00,170 carry no information, that basically do not prefer 988 00:53:00,170 --> 00:53:02,750 any theta to another theta. 989 00:53:02,750 --> 00:53:05,435 Now, you could read this slide or you 990 00:53:05,435 --> 00:53:06,560 could look at this formula. 991 00:53:10,010 --> 00:53:14,920 We just said that this pi here was just here 992 00:53:14,920 --> 00:53:18,220 to weigh some thetas more than others, depending 993 00:53:18,220 --> 00:53:19,870 on their prior belief. 994 00:53:19,870 --> 00:53:21,400 If our prior belief does not want 995 00:53:21,400 --> 00:53:24,880 to put any preference towards some thetas than to others, 996 00:53:24,880 --> 00:53:26,332 what do I do? 997 00:53:26,332 --> 00:53:27,655 AUDIENCE: [INAUDIBLE] 998 00:53:27,655 --> 00:53:29,462 PHILLIPE RIGOLLET: Yeah, I remove it. 999 00:53:29,462 --> 00:53:31,420 And the way to remove something we multiply by, 1000 00:53:31,420 --> 00:53:32,650 is just replace it by one. 1001 00:53:32,650 --> 00:53:35,100 That's really what we're doing. 1002 00:53:35,100 --> 00:53:38,560 If this was a constant not depending on theta, 1003 00:53:38,560 --> 00:53:41,400 then that would mean that we're not preferring any theta. 1004 00:53:41,400 --> 00:53:44,370 And we're looking at the likelihood. 1005 00:53:44,370 --> 00:53:46,560 But not as a function that we're trying to maximize, 1006 00:53:46,560 --> 00:53:50,220 but it is a function that we normalize in such a way 1007 00:53:50,220 --> 00:53:52,570 that it's actually a distribution. 1008 00:53:52,570 --> 00:53:54,782 So if I have pi, which is not here, 1009 00:53:54,782 --> 00:53:56,740 this is really just taking the like likelihood, 1010 00:53:56,740 --> 00:53:57,990 which is a positive function. 1011 00:53:57,990 --> 00:53:59,970 It may not integrate to 1, so I normalize it 1012 00:53:59,970 --> 00:54:02,330 so that it integrates to 1. 1013 00:54:02,330 --> 00:54:05,120 And then I just say, well this is my posterior distribution. 1014 00:54:05,120 --> 00:54:06,770 Now I could just maximize this thing 1015 00:54:06,770 --> 00:54:09,180 and spit out my maximum likelihood estimator. 1016 00:54:09,180 --> 00:54:10,850 But I can also integrate and find 1017 00:54:10,850 --> 00:54:12,350 what the expectation of this guy is. 1018 00:54:12,350 --> 00:54:14,210 I can find what the median of this guy is. 1019 00:54:14,210 --> 00:54:16,370 I can sample data from this guy. 1020 00:54:16,370 --> 00:54:19,430 I can build, understand what the variance of this guy is. 1021 00:54:19,430 --> 00:54:21,830 Which is something we did not do when we just did 1022 00:54:21,830 --> 00:54:24,800 maximum likelihood estimation because given a function, all 1023 00:54:24,800 --> 00:54:27,998 we cared about was the arc max of this function. 1024 00:54:31,680 --> 00:54:36,120 These priors are called uninformative. 1025 00:54:36,120 --> 00:54:43,440 This is just replacing this number by one or by a constant. 1026 00:54:43,440 --> 00:54:45,020 Because it still has to be a density. 1027 00:54:49,236 --> 00:54:50,610 If I have a bounded set, I'm just 1028 00:54:50,610 --> 00:54:52,950 looking for the uniform distribution 1029 00:54:52,950 --> 00:54:56,580 on this bounded set, the one that puts constant one 1030 00:54:56,580 --> 00:54:59,200 over the size of this thing. 1031 00:54:59,200 --> 00:55:01,590 But if I have an invalid set, what 1032 00:55:01,590 --> 00:55:03,870 is the density that takes a constant value 1033 00:55:03,870 --> 00:55:07,555 on the entire real line, for example? 1034 00:55:07,555 --> 00:55:08,430 What is this density? 1035 00:55:13,200 --> 00:55:16,550 AUDIENCE: [INAUDIBLE] 1036 00:55:16,550 --> 00:55:18,530 PHILLIPE RIGOLLET: Doesn't exist, right? 1037 00:55:18,530 --> 00:55:20,990 It just doesn't exist. 1038 00:55:20,990 --> 00:55:22,770 The way you can think of it is a Gaussian 1039 00:55:22,770 --> 00:55:24,860 with the variance going to infinity, maybe, 1040 00:55:24,860 --> 00:55:26,289 or something like this. 1041 00:55:26,289 --> 00:55:27,830 But you can think of it in many ways. 1042 00:55:27,830 --> 00:55:32,330 You can think of the limit of the uniform between minus T 1043 00:55:32,330 --> 00:55:34,250 and T, with T going to infinity. 1044 00:55:34,250 --> 00:55:36,480 But this thing is actually zero. 1045 00:55:36,480 --> 00:55:39,530 There's nothing there. 1046 00:55:39,530 --> 00:55:41,990 You can actually still talk about this. 1047 00:55:41,990 --> 00:55:44,390 You could always talk about this thing, where 1048 00:55:44,390 --> 00:55:46,550 you think of this guy as being a constant, 1049 00:55:46,550 --> 00:55:49,080 remove this thing from this equation, and just say, 1050 00:55:49,080 --> 00:55:51,320 well my posterior is just the likelihood 1051 00:55:51,320 --> 00:55:54,680 divided by the integral of the likelihood over theta. 1052 00:55:54,680 --> 00:55:58,650 And if theta is the entire real line, so be it. 1053 00:55:58,650 --> 00:56:00,390 As long as this integral converges, 1054 00:56:00,390 --> 00:56:01,890 you can still talk about this stuff. 1055 00:56:04,460 --> 00:56:06,300 This is what's called an improper prior. 1056 00:56:09,140 --> 00:56:11,990 An improper prior is just a non-negative function defined 1057 00:56:11,990 --> 00:56:17,390 in theta, but it does not have to integrate neither to one, 1058 00:56:17,390 --> 00:56:18,170 nor to anything. 1059 00:56:20,900 --> 00:56:22,700 If I integrate the function equal to 1 1060 00:56:22,700 --> 00:56:24,330 on the entire real line, what do I get? 1061 00:56:27,800 --> 00:56:28,520 Infinity. 1062 00:56:32,390 --> 00:56:35,960 It's not a proper prior, and it's called and improper prior. 1063 00:56:35,960 --> 00:56:39,380 And those improper priors are usually 1064 00:56:39,380 --> 00:56:42,830 what you see when you start to want non-informative priors 1065 00:56:42,830 --> 00:56:44,360 on infinite sets of datas. 1066 00:56:44,360 --> 00:56:46,880 That's just the nature of it. 1067 00:56:46,880 --> 00:56:50,020 You should think of them as being the uniform distribution 1068 00:56:50,020 --> 00:56:52,550 of some infinite set, if that thing were to exist. 1069 00:56:56,360 --> 00:57:01,070 Let's see some examples about non-informative priors. 1070 00:57:01,070 --> 00:57:04,410 If I'm in the interval 0, 1 this is a finite set. 1071 00:57:04,410 --> 00:57:07,730 So I can talk about the uniform prior 1072 00:57:07,730 --> 00:57:10,600 on the interval 0, 1 for a parameter, p of a Bernoulli. 1073 00:57:26,380 --> 00:57:28,000 If I want to talk about this, then it 1074 00:57:28,000 --> 00:57:35,910 means that my prior is p follows some uniform on the interval 1075 00:57:35,910 --> 00:57:37,570 0, 1. 1076 00:57:37,570 --> 00:57:48,940 So that means that f of x is 1 if x is in 0, 1. 1077 00:57:48,940 --> 00:57:52,000 Otherwise, there is actually not even a normalization. 1078 00:57:52,000 --> 00:57:53,860 This thing integrates to 1. 1079 00:57:53,860 --> 00:57:56,137 And so now if I look at my likelihood, 1080 00:57:56,137 --> 00:57:57,220 it's still the same thing. 1081 00:57:57,220 --> 00:58:04,510 So my posterior becomes theta X1, Xn. 1082 00:58:04,510 --> 00:58:07,022 That's my posterior. 1083 00:58:07,022 --> 00:58:08,480 I don't write the likelihood again, 1084 00:58:08,480 --> 00:58:09,830 because we still have it-- 1085 00:58:09,830 --> 00:58:11,583 well we don't have it here anymore. 1086 00:58:15,440 --> 00:58:17,940 The likelihood is given here. 1087 00:58:17,940 --> 00:58:20,930 Copy, paste over there. 1088 00:58:20,930 --> 00:58:23,069 The posterior is just this thing times 1. 1089 00:58:23,069 --> 00:58:24,360 So you will see it in a second. 1090 00:58:24,360 --> 00:58:28,570 So it's p to the power sum of the Xi's, one minus p 1091 00:58:28,570 --> 00:58:31,970 to the power, n minus sum of the Xi's. 1092 00:58:31,970 --> 00:58:36,380 And then it's multiplied by 1, and then divided by this 1093 00:58:36,380 --> 00:58:42,250 integral between 0 and 1 of p, sum of the Xi's. 1094 00:58:42,250 --> 00:58:47,870 1 minus p, n minus sum of the Xi's. 1095 00:58:47,870 --> 00:58:51,866 Dp, which does not depend on p. 1096 00:58:51,866 --> 00:58:53,990 And I really don't care what the thing actually is. 1097 00:58:58,900 --> 00:59:03,550 That's posterior of p. 1098 00:59:03,550 --> 00:59:06,280 And now I can see, well what is this? 1099 00:59:06,280 --> 00:59:12,870 It's actually just the beta with parameters. 1100 00:59:12,870 --> 00:59:14,120 This guy plus 1. 1101 00:59:19,670 --> 00:59:21,680 And this guy plus 1. 1102 00:59:34,430 --> 00:59:38,057 I didn't tell you what the expectation of a beta was. 1103 00:59:38,057 --> 00:59:39,890 We don't know what the expectation of a beta 1104 00:59:39,890 --> 00:59:42,200 is, agreed? 1105 00:59:42,200 --> 00:59:45,980 If I wanted to find say, the expectation of this thing that 1106 00:59:45,980 --> 00:59:47,990 would be some good estimator, we know 1107 00:59:47,990 --> 00:59:49,902 that the maximum of this guy-- what 1108 00:59:49,902 --> 00:59:51,110 is the maximum of this thing? 1109 00:59:54,880 --> 00:59:57,937 Well, it's just this thing, it's the average of the Xi's. 1110 00:59:57,937 --> 00:59:59,770 That's just the maximum likelihood estimator 1111 00:59:59,770 --> 01:00:00,353 for Bernoulli. 1112 01:00:00,353 --> 01:00:01,702 We know it's the average. 1113 01:00:01,702 --> 01:00:03,910 Do you think if I take the expectation of this thing, 1114 01:00:03,910 --> 01:00:05,295 I'm going to get the average? 1115 01:00:13,864 --> 01:00:15,780 So actually, I'm not going to get the average. 1116 01:00:15,780 --> 01:00:19,790 I'm going to get this guy plus this guy, divided by n plus 1. 1117 01:00:27,246 --> 01:00:28,870 Let's look at what this thing is doing. 1118 01:00:28,870 --> 01:00:34,364 It's looking at the number of ones and it's adding one. 1119 01:00:34,364 --> 01:00:36,280 And this guy is looking at the number of zeros 1120 01:00:36,280 --> 01:00:39,190 and it's adding one. 1121 01:00:39,190 --> 01:00:41,910 Why is it adding this one? 1122 01:00:41,910 --> 01:00:42,840 What's going on here? 1123 01:00:47,510 --> 01:00:52,040 This is going to matter mostly when the number of ones 1124 01:00:52,040 --> 01:00:56,060 is actually zero, or the number of zeros is zero. 1125 01:00:56,060 --> 01:01:00,000 Because what it does is just pushes the zero from non-zero. 1126 01:01:00,000 --> 01:01:03,020 And why is that something that this Bayesian method actually 1127 01:01:03,020 --> 01:01:04,600 does for you automatically? 1128 01:01:04,600 --> 01:01:06,530 It's because when we put this non-informative 1129 01:01:06,530 --> 01:01:11,169 prior on p, which was uniform on the interval 0, 1. 1130 01:01:11,169 --> 01:01:12,960 In particular, we know that the probability 1131 01:01:12,960 --> 01:01:16,690 that p is equal to 0 is zero. 1132 01:01:16,690 --> 01:01:19,180 And the probability p is equal to 1 is zero. 1133 01:01:19,180 --> 01:01:21,880 And so the problem is that if I did not 1134 01:01:21,880 --> 01:01:24,520 add this 1 with some positive probability, 1135 01:01:24,520 --> 01:01:28,120 I wouldn't be allowed to spit out something that actually had 1136 01:01:28,120 --> 01:01:30,640 p hat, which was equal to 0. 1137 01:01:30,640 --> 01:01:33,280 If by chance, let's say I have n is equal to 3, 1138 01:01:33,280 --> 01:01:37,750 and I get only 0, 0, 0, that could happen with probability. 1139 01:01:37,750 --> 01:01:41,470 1 over pq, one over 1 minus pq. 1140 01:01:46,360 --> 01:01:47,880 That's not something that I want. 1141 01:01:47,880 --> 01:01:49,359 And I'm using my priors. 1142 01:01:49,359 --> 01:01:51,900 My prior is not informative, but somehow it captures the fact 1143 01:01:51,900 --> 01:01:53,550 that I don't want to believe p is going 1144 01:01:53,550 --> 01:01:56,110 to be either equal to 0 or 1. 1145 01:01:56,110 --> 01:01:59,790 So that's sort of taken care of here. 1146 01:01:59,790 --> 01:02:05,640 So let's move away a little bit from the Bernoulli example, 1147 01:02:05,640 --> 01:02:06,310 shall we? 1148 01:02:06,310 --> 01:02:08,120 I think we've seen enough of it. 1149 01:02:08,120 --> 01:02:10,860 And so let's talk about the Gaussian model. 1150 01:02:10,860 --> 01:02:12,690 Let's say I want to do Gaussian inference. 1151 01:02:17,859 --> 01:02:19,650 I want to do inference in a Gaussian model, 1152 01:02:19,650 --> 01:02:20,730 using Bayesian methods. 1153 01:02:30,600 --> 01:02:39,840 What I want is that Xi, X1, Xn, or say 0, 1 iid. 1154 01:02:44,720 --> 01:02:47,770 Sorry, theta 1, iid conditionally on theta. 1155 01:02:50,630 --> 01:02:56,300 That means that pn of X1, Xn, given theta 1156 01:02:56,300 --> 01:02:58,670 is equal to exactly what I wrote before. 1157 01:02:58,670 --> 01:03:04,760 So 1 square root to pi, to the n exponential minus one half 1158 01:03:04,760 --> 01:03:09,579 sum of Xi minus theta squared. 1159 01:03:09,579 --> 01:03:11,120 So that's just the joint distribution 1160 01:03:11,120 --> 01:03:13,410 of my Gaussian with mean data. 1161 01:03:13,410 --> 01:03:14,810 And the another question is, what 1162 01:03:14,810 --> 01:03:17,540 is the posterior distribution? 1163 01:03:17,540 --> 01:03:22,500 Well here I said, let's use the uninformative prior, 1164 01:03:22,500 --> 01:03:23,840 which is an improper prior. 1165 01:03:23,840 --> 01:03:25,490 It puts weight on everyone. 1166 01:03:25,490 --> 01:03:29,310 That's the so-called uniform on the entire real line. 1167 01:03:29,310 --> 01:03:31,190 So that's certainly not a density. 1168 01:03:31,190 --> 01:03:34,360 But it can still just use this. 1169 01:03:34,360 --> 01:03:40,430 So all I need to do is get this divided 1170 01:03:40,430 --> 01:03:44,690 by normalizing this thing. 1171 01:03:44,690 --> 01:03:47,900 But if you look at this, essentially I 1172 01:03:47,900 --> 01:03:49,530 want to understand. 1173 01:03:49,530 --> 01:03:52,470 So this is proportional to the exponential 1174 01:03:52,470 --> 01:03:55,040 minus one half sum from I equal 1 1175 01:03:55,040 --> 01:03:58,950 to n of Xi minus theta squared. 1176 01:03:58,950 --> 01:04:01,370 And now I want to see this thing as a density, 1177 01:04:01,370 --> 01:04:03,560 not on the Xi's but on theta. 1178 01:04:06,420 --> 01:04:10,120 What I want is a density on theta. 1179 01:04:10,120 --> 01:04:13,650 So it looks like I have chances of getting something 1180 01:04:13,650 --> 01:04:16,800 that looks like a Gaussian. 1181 01:04:16,800 --> 01:04:19,500 To have a Gaussian, I would need to see minus one half. 1182 01:04:19,500 --> 01:04:21,660 And then I would need to see theta minus something 1183 01:04:21,660 --> 01:04:25,230 here, not just the sum of something minus thetas. 1184 01:04:25,230 --> 01:04:29,820 So I need to work a little bit more, 1185 01:04:29,820 --> 01:04:31,475 to expand the square here. 1186 01:04:31,475 --> 01:04:32,850 So this thing here is going to be 1187 01:04:32,850 --> 01:04:37,330 equal to exponential minus one half sum from I equal 1 1188 01:04:37,330 --> 01:04:45,280 to n of Xi squared minus 2Xi theta plus theta squared. 1189 01:05:10,590 --> 01:05:13,590 Now what I'm going to do is, everything remember 1190 01:05:13,590 --> 01:05:15,870 is up to this little sign. 1191 01:05:15,870 --> 01:05:19,710 So every time I see a term that does not depend on theta, 1192 01:05:19,710 --> 01:05:22,250 I can just push it in there and just make it disappear. 1193 01:05:22,250 --> 01:05:24,550 Agreed? 1194 01:05:24,550 --> 01:05:28,420 This term here, exponential minus one half sum of Xi 1195 01:05:28,420 --> 01:05:31,661 squared, does it depend on theta? 1196 01:05:31,661 --> 01:05:32,160 No. 1197 01:05:32,160 --> 01:05:33,420 So I'm just pushing it here. 1198 01:05:33,420 --> 01:05:34,530 This guy, yes. 1199 01:05:34,530 --> 01:05:35,970 And the other one, yes. 1200 01:05:35,970 --> 01:05:45,020 So this is proportional to exponential sum of the Xi. 1201 01:05:45,020 --> 01:05:47,780 And then I'm going to pull out my theta, the minus one half 1202 01:05:47,780 --> 01:05:50,150 canceled with the minus 2. 1203 01:05:50,150 --> 01:05:56,460 And then I have minus one half sum from I 1204 01:05:56,460 --> 01:05:58,180 equal 1 to n of theta squared. 1205 01:06:01,480 --> 01:06:03,460 Agreed? 1206 01:06:03,460 --> 01:06:05,350 So now what this thing looks like, 1207 01:06:05,350 --> 01:06:09,570 this looks very much like some theta minus something squared. 1208 01:06:09,570 --> 01:06:15,110 This thing here is really just n over 2 times theta. 1209 01:06:18,520 --> 01:06:21,740 Sorry, times theta squared. 1210 01:06:21,740 --> 01:06:25,120 So now what I need to do is to write this of the form, theta 1211 01:06:25,120 --> 01:06:26,230 minus something. 1212 01:06:26,230 --> 01:06:31,820 Let's call it mu, squared, divided by 2 sigma squared. 1213 01:06:31,820 --> 01:06:34,160 I want to turn this into that, maybe up to terms 1214 01:06:34,160 --> 01:06:36,510 that do not depend on theta. 1215 01:06:36,510 --> 01:06:39,062 That's what I'm going to try to do. 1216 01:06:39,062 --> 01:06:40,770 So that's called completing the squaring. 1217 01:06:40,770 --> 01:06:42,010 That's some exercises you do. 1218 01:06:42,010 --> 01:06:44,260 You've done it probably, already in the homework. 1219 01:06:44,260 --> 01:06:46,560 And that's something you do a lot when 1220 01:06:46,560 --> 01:06:48,750 you do Bayesian statistics, in particular. 1221 01:06:48,750 --> 01:06:50,010 So let's do this. 1222 01:06:50,010 --> 01:06:51,910 What is it going to be the leading term? 1223 01:06:51,910 --> 01:06:54,160 Theta squared is going to be multiplied by this thing. 1224 01:06:54,160 --> 01:06:57,130 So I'm going to pull out my n over 2. 1225 01:06:57,130 --> 01:07:03,070 And then I'm going to write this as minus theta over 2. 1226 01:07:03,070 --> 01:07:06,220 And then I'm going to write theta minus something squared. 1227 01:07:06,220 --> 01:07:08,890 And this something is going to be one half of what 1228 01:07:08,890 --> 01:07:10,160 I see in the cross-product. 1229 01:07:12,966 --> 01:07:14,590 I need to actually pull this thing out. 1230 01:07:14,590 --> 01:07:18,340 So let me write it like that first. 1231 01:07:18,340 --> 01:07:21,860 So that's theta squared. 1232 01:07:21,860 --> 01:07:30,680 And then I'm going to write it as minus 2 times 1 over n sum 1233 01:07:30,680 --> 01:07:36,980 from I equal 1 to n of Xi's times theta. 1234 01:07:36,980 --> 01:07:39,874 That's exactly just a rewriting of what we had before. 1235 01:07:39,874 --> 01:07:41,540 And that should look much more familiar. 1236 01:07:44,990 --> 01:07:49,700 A squared minus 2 blap A, and then I missed something. 1237 01:07:49,700 --> 01:07:51,860 So this thing, I'm going to be able to rewrite 1238 01:07:51,860 --> 01:07:57,930 as theta minus Xn bar squared. 1239 01:07:57,930 --> 01:08:00,720 But then I need to remove the square of Xn bar. 1240 01:08:00,720 --> 01:08:01,740 Because it's not here. 1241 01:08:09,210 --> 01:08:11,297 So I just complete the square. 1242 01:08:11,297 --> 01:08:13,880 And then I actually really don't care with this thing actually 1243 01:08:13,880 --> 01:08:16,899 was, because it's going to go again in the little Alpha's 1244 01:08:16,899 --> 01:08:18,416 sign over there. 1245 01:08:18,416 --> 01:08:19,790 So this thing eventually is going 1246 01:08:19,790 --> 01:08:24,620 to be proportional to exponential 1247 01:08:24,620 --> 01:08:31,090 of minus n over 2 times theta of minus Xn bar squared. 1248 01:08:31,090 --> 01:08:33,370 And so we know that if this is a density that's 1249 01:08:33,370 --> 01:08:44,100 proportional to this guy, it has to be some n with mean, Xn bar. 1250 01:08:44,100 --> 01:08:47,520 And variance, this is supposed to be 1 over sigma squared. 1251 01:08:47,520 --> 01:08:49,318 This guy over here, this n. 1252 01:08:49,318 --> 01:08:50,609 So that's really just 1 over n. 1253 01:08:53,870 --> 01:09:01,740 So the posterior distribution is a Gaussian 1254 01:09:01,740 --> 01:09:05,819 centered at the average of my observations. 1255 01:09:05,819 --> 01:09:08,430 And with variance, 1 over n. 1256 01:09:13,307 --> 01:09:14,140 Everybody's with me? 1257 01:09:16,740 --> 01:09:19,779 Why I'm saying this, this was the output of some computation. 1258 01:09:19,779 --> 01:09:21,450 But it sort of makes sense, right? 1259 01:09:21,450 --> 01:09:24,210 It's really telling me that the more observations I have, 1260 01:09:24,210 --> 01:09:26,250 the more concentrated this posterior is. 1261 01:09:26,250 --> 01:09:27,819 Concentrated around what? 1262 01:09:27,819 --> 01:09:30,529 Well around this Xn bar. 1263 01:09:30,529 --> 01:09:33,140 That looks like something we've sort of seen before. 1264 01:09:33,140 --> 01:09:35,420 But it does not have the same meaning, somehow. 1265 01:09:35,420 --> 01:09:37,580 This is really just the posterior distribution. 1266 01:09:40,490 --> 01:09:43,160 It's sort of a sanity check, that I have this 1 over n 1267 01:09:43,160 --> 01:09:44,139 when I have Xn bar. 1268 01:09:44,139 --> 01:09:45,680 But it's not the same thing as saying 1269 01:09:45,680 --> 01:09:48,429 that the variance of Xn bar was 1 over n, like we had before. 1270 01:09:55,670 --> 01:09:59,390 As an exercise, I would recommend 1271 01:09:59,390 --> 01:10:10,140 if you don't get it, just try pi of theta 1272 01:10:10,140 --> 01:10:15,290 to be equal to some n mu 1. 1273 01:10:18,120 --> 01:10:22,350 Here, the prior that we used was completely non-informative. 1274 01:10:22,350 --> 01:10:25,594 What happens if I take my prior to be some Gaussian, which 1275 01:10:25,594 --> 01:10:27,510 is centered at mu and it has the same variance 1276 01:10:27,510 --> 01:10:30,120 as the other guys? 1277 01:10:30,120 --> 01:10:32,204 So what's going to happen here is that we're 1278 01:10:32,204 --> 01:10:33,120 going to put a weight. 1279 01:10:33,120 --> 01:10:34,536 And everything that's away from mu 1280 01:10:34,536 --> 01:10:38,469 is going to actually get less weight. 1281 01:10:38,469 --> 01:10:40,260 I want to know how I'm going to be updating 1282 01:10:40,260 --> 01:10:41,850 this prior into a posterior. 1283 01:10:44,520 --> 01:10:47,040 Everybody sees what I'm saying here? 1284 01:10:47,040 --> 01:10:50,040 So that means that pi of theta has the density proportional 1285 01:10:50,040 --> 01:10:55,680 to exponential minus one half theta minus mu squared. 1286 01:10:55,680 --> 01:11:00,540 So I need to multiply my posterior with this, 1287 01:11:00,540 --> 01:11:01,849 and then see. 1288 01:11:01,849 --> 01:11:03,390 It's actually going to be a Gaussian. 1289 01:11:03,390 --> 01:11:04,774 This is also a conjugate prior. 1290 01:11:04,774 --> 01:11:06,440 It's going to spit out another Gaussian. 1291 01:11:06,440 --> 01:11:09,390 You're going to have to complete a square again, and just check 1292 01:11:09,390 --> 01:11:10,814 what it's actually giving you. 1293 01:11:10,814 --> 01:11:12,480 And so spoiler alert, it's going to look 1294 01:11:12,480 --> 01:11:14,790 like you get an extra observation, which is actually 1295 01:11:14,790 --> 01:11:15,360 equal to mu. 1296 01:11:18,800 --> 01:11:22,440 It's going to be the average of n plus 1 observations. 1297 01:11:22,440 --> 01:11:24,110 The first n1's being X1 to Xn. 1298 01:11:24,110 --> 01:11:27,530 And then, the last one being mu. 1299 01:11:27,530 --> 01:11:30,860 And it sort of makes sense. 1300 01:11:30,860 --> 01:11:34,700 That's actually a fairly simple exercise. 1301 01:11:34,700 --> 01:11:36,441 Rather than going into more computation, 1302 01:11:36,441 --> 01:11:37,940 this is something you can definitely 1303 01:11:37,940 --> 01:11:41,510 do when you're in the comfort of your room. 1304 01:11:41,510 --> 01:11:43,910 I want to talk about other types of priors. 1305 01:11:43,910 --> 01:11:47,330 The first thing I said is, there's this beta prior 1306 01:11:47,330 --> 01:11:50,390 that I just pulled out of my hat and that was just convenient. 1307 01:11:50,390 --> 01:11:52,860 Then there was this non-informative prior. 1308 01:11:52,860 --> 01:11:53,720 It was convenient. 1309 01:11:53,720 --> 01:11:56,300 It was non-informative, so if you don't know anything 1310 01:11:56,300 --> 01:11:58,950 else maybe that's what you want to do. 1311 01:11:58,950 --> 01:12:01,940 The question is, are there any other priors that 1312 01:12:01,940 --> 01:12:04,490 are sort of principled and generic, in the sense 1313 01:12:04,490 --> 01:12:08,600 that the uninformative prior was generic, right? 1314 01:12:08,600 --> 01:12:11,400 It was equal to 1, that's as generic as it gets. 1315 01:12:11,400 --> 01:12:14,190 So is there anything that's generic as well? 1316 01:12:14,190 --> 01:12:17,180 Well, there's this priors that are called Jeffrey's priors. 1317 01:12:17,180 --> 01:12:20,540 And Jeffrey's prior, which is proportional to square root 1318 01:12:20,540 --> 01:12:23,290 of the determinant of the Fisher information of theta. 1319 01:12:26,360 --> 01:12:28,600 This is actually a weird thing to do. 1320 01:12:28,600 --> 01:12:31,380 It says, look at your model. 1321 01:12:31,380 --> 01:12:34,152 Your model is going to have a Fisher information. 1322 01:12:34,152 --> 01:12:34,985 Let's say it exists. 1323 01:12:38,150 --> 01:12:39,957 Because we know it does not always exist. 1324 01:12:39,957 --> 01:12:41,540 For example, in the multinomial model, 1325 01:12:41,540 --> 01:12:44,660 we didn't have a Fisher information. 1326 01:12:44,660 --> 01:12:46,670 The determinant of a matrix is somehow 1327 01:12:46,670 --> 01:12:48,800 measuring the size of a matrix. 1328 01:12:48,800 --> 01:12:50,540 If you don't trust me, just think 1329 01:12:50,540 --> 01:12:53,870 about the matrix being of size one by one, 1330 01:12:53,870 --> 01:12:56,910 then the determinant is just the number that you have there. 1331 01:12:56,910 --> 01:13:00,770 And so this is really something that looks like the Fisher 1332 01:13:00,770 --> 01:13:01,670 information. 1333 01:13:04,374 --> 01:13:06,290 It's proportional to the amount of information 1334 01:13:06,290 --> 01:13:09,620 that you have at a certain point. 1335 01:13:09,620 --> 01:13:12,310 And so what my prior is saying well, 1336 01:13:12,310 --> 01:13:14,280 I want to put more weights on those thetas that 1337 01:13:14,280 --> 01:13:17,050 are going to just extract more information from the data. 1338 01:13:20,510 --> 01:13:22,760 You can actually compute those things. 1339 01:13:22,760 --> 01:13:26,215 In the first example, Jeffrey's prior 1340 01:13:26,215 --> 01:13:28,360 is something that looks like this. 1341 01:13:28,360 --> 01:13:30,230 In one dimension, Fisher information 1342 01:13:30,230 --> 01:13:33,476 is essentially one the word variance. 1343 01:13:33,476 --> 01:13:35,600 That's just 1 over the square root of the variance, 1344 01:13:35,600 --> 01:13:37,550 because I have the square root. 1345 01:13:37,550 --> 01:13:45,770 And when I have the Jeffrey's prior, when I have the Gaussian 1346 01:13:45,770 --> 01:13:48,770 case, this is the identity matrix 1347 01:13:48,770 --> 01:13:50,840 that I would have in the Gaussian case. 1348 01:13:50,840 --> 01:13:52,580 The determinant of the identities is 1. 1349 01:13:52,580 --> 01:13:56,180 So square root of 1 is 1, and so I would basically get 1. 1350 01:13:56,180 --> 01:13:59,170 And that gives me my improper prior, my uninformative prior 1351 01:13:59,170 --> 01:14:01,020 that I had. 1352 01:14:01,020 --> 01:14:03,690 So the uninformative prior 1 is fine. 1353 01:14:03,690 --> 01:14:06,780 Clearly, all the thetas carry the same information 1354 01:14:06,780 --> 01:14:08,160 in the Gaussian model. 1355 01:14:08,160 --> 01:14:10,200 Whether I translate it here or here, 1356 01:14:10,200 --> 01:14:12,120 it's pretty clear none of them is actually 1357 01:14:12,120 --> 01:14:13,140 better than the other. 1358 01:14:13,140 --> 01:14:16,530 But clearly for the Bernoulli case, 1359 01:14:16,530 --> 01:14:22,560 the p's that are closer to the boundary carry 1360 01:14:22,560 --> 01:14:23,940 more information. 1361 01:14:23,940 --> 01:14:26,250 I sort of like those guys, because they just 1362 01:14:26,250 --> 01:14:27,757 carry more information. 1363 01:14:27,757 --> 01:14:29,340 So what I do is, I take this function. 1364 01:14:29,340 --> 01:14:30,300 So p1 minus p. 1365 01:14:30,300 --> 01:14:34,170 Remember, it's something that looks like this. 1366 01:14:34,170 --> 01:14:35,390 On the interval 0, 1. 1367 01:14:38,710 --> 01:14:40,979 This guy, 1 over square root of p1 minus p 1368 01:14:40,979 --> 01:14:42,395 is something that looks like this. 1369 01:14:45,780 --> 01:14:47,620 Agreed 1370 01:14:47,620 --> 01:14:49,780 What it's doing is sort of wants to push 1371 01:14:49,780 --> 01:14:54,586 towards the piece that actually carry more information. 1372 01:14:54,586 --> 01:14:56,210 Whether you want to bias your data that 1373 01:14:56,210 --> 01:14:59,120 way or not, is something you need to think about. 1374 01:14:59,120 --> 01:15:01,550 When you put a prior on your data, on your parameter, 1375 01:15:01,550 --> 01:15:06,140 you're sort of biasing towards this idea your data. 1376 01:15:06,140 --> 01:15:07,700 That's maybe not such a good idea, 1377 01:15:07,700 --> 01:15:13,160 when you have some p that's actually close to one half, 1378 01:15:13,160 --> 01:15:13,820 for example. 1379 01:15:13,820 --> 01:15:14,960 You're actually saying, no I don't 1380 01:15:14,960 --> 01:15:16,610 want to see a p that's close to one half. 1381 01:15:16,610 --> 01:15:18,350 Just make a decision, one way or another. 1382 01:15:18,350 --> 01:15:19,699 But just make a decision. 1383 01:15:19,699 --> 01:15:20,990 So it's forcing you to do that. 1384 01:15:23,690 --> 01:15:26,090 Jeffrey's prior, I'm running out of time 1385 01:15:26,090 --> 01:15:29,850 so I don't want to go into too much detail. 1386 01:15:29,850 --> 01:15:31,670 We'll probably stop here, actually. 1387 01:15:44,570 --> 01:15:47,810 So Jeffrey's priors have this very nice property. 1388 01:15:47,810 --> 01:15:51,740 It's that they actually do not care about the parameterization 1389 01:15:51,740 --> 01:15:53,150 of your space. 1390 01:15:53,150 --> 01:15:56,360 If you actually have p and you suddenly 1391 01:15:56,360 --> 01:15:58,850 decide that p is not the right parameter for Bernoulli, 1392 01:15:58,850 --> 01:16:00,740 but it's p squared. 1393 01:16:00,740 --> 01:16:03,200 You could decide to parameterize this by p squared. 1394 01:16:03,200 --> 01:16:05,840 Maybe your doctor is actually much more able 1395 01:16:05,840 --> 01:16:08,840 to formulate some prior assumption on p squared, 1396 01:16:08,840 --> 01:16:09,800 rather than p. 1397 01:16:09,800 --> 01:16:11,100 You never know. 1398 01:16:11,100 --> 01:16:14,390 And so what happens is that Jeffrey's priors 1399 01:16:14,390 --> 01:16:15,990 are an invariant in this. 1400 01:16:15,990 --> 01:16:18,560 And the reason is because the information carried by p 1401 01:16:18,560 --> 01:16:21,130 is the same as the information carried by p squared, somehow. 1402 01:16:28,822 --> 01:16:30,280 They're essentially the same thing. 1403 01:16:32,950 --> 01:16:34,630 You need to have one to one map. 1404 01:16:34,630 --> 01:16:37,896 Where you basically for each parameter, before 1405 01:16:37,896 --> 01:16:39,020 you have another parameter. 1406 01:16:39,020 --> 01:16:40,810 Let's call Eta the new parameters. 1407 01:16:45,790 --> 01:16:50,380 The PDF of the new prior indexed by Eta this time 1408 01:16:50,380 --> 01:16:52,990 is actually also Jeffrey's prior. 1409 01:16:52,990 --> 01:16:55,174 But this time, the new Fisher information 1410 01:16:55,174 --> 01:16:57,340 is not the Fisher information with respect to theta. 1411 01:16:57,340 --> 01:17:00,010 But it's this Fisher information associated 1412 01:17:00,010 --> 01:17:03,130 to this statistical model indexed by Eta. 1413 01:17:03,130 --> 01:17:08,110 So essentially, when you change the parameterization 1414 01:17:08,110 --> 01:17:10,600 of your model, you still get Jeffrey's prior 1415 01:17:10,600 --> 01:17:12,820 for the new parameterization. 1416 01:17:12,820 --> 01:17:15,020 Which is, in a way, a desirable property. 1417 01:17:19,410 --> 01:17:21,920 Jeffrey's prior is just an uninformative priors, 1418 01:17:21,920 --> 01:17:24,140 or priors you want to use when you 1419 01:17:24,140 --> 01:17:26,480 want a systematic way without really thinking about what 1420 01:17:26,480 --> 01:17:27,396 to pick for your mile. 1421 01:17:35,440 --> 01:17:37,060 I'll finish this next time. 1422 01:17:37,060 --> 01:17:39,910 And we'll talk about Bayesian confidence regions. 1423 01:17:39,910 --> 01:17:41,620 We'll talk about Bayesian estimation. 1424 01:17:41,620 --> 01:17:44,074 Once I have a posterior, what do I get? 1425 01:17:44,074 --> 01:17:45,490 And basically, the only message is 1426 01:17:45,490 --> 01:17:47,860 going to be that you might want to integrate 1427 01:17:47,860 --> 01:17:48,910 against the posterior. 1428 01:17:48,910 --> 01:17:51,490 Find the posterior, the expectation of your posterior 1429 01:17:51,490 --> 01:17:52,130 distribution. 1430 01:17:52,130 --> 01:17:54,010 That's a good point estimator for theta. 1431 01:17:56,860 --> 01:18:01,020 We'll just do a couple of computation.