1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,650 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,650 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:19,792 --> 00:00:21,750 PHILIPPE RIGOLLET: Doesn't to run Flash Player. 9 00:00:21,750 --> 00:00:26,560 So I had to run them on Chrome. 10 00:00:26,560 --> 00:00:29,450 All right, so let's move on to our second chapter. 11 00:00:29,450 --> 00:00:31,270 And hopefully, in this chapter, you 12 00:00:31,270 --> 00:00:33,820 will feel a little better if you felt 13 00:00:33,820 --> 00:00:36,412 like it was going a bit fast in the first chapter. 14 00:00:36,412 --> 00:00:38,620 And the main reason we actually went fast, especially 15 00:00:38,620 --> 00:00:40,534 in terms of confidence intervals. 16 00:00:40,534 --> 00:00:41,950 Some of you came and asked me what 17 00:00:41,950 --> 00:00:43,880 you mean by this is a confidence interval? 18 00:00:43,880 --> 00:00:45,588 What does it mean that it's not happening 19 00:00:45,588 --> 00:00:47,662 in there with probability 95%, et cetera? 20 00:00:47,662 --> 00:00:49,120 I just went really fast so that you 21 00:00:49,120 --> 00:00:53,440 could see why I didn't want to give you a first week doing 22 00:00:53,440 --> 00:00:57,170 probability only without understanding 23 00:00:57,170 --> 00:00:59,780 what the statistical context for it was. 24 00:00:59,780 --> 00:01:02,041 So hopefully, all these things that we've 25 00:01:02,041 --> 00:01:03,790 done in terms of probability, you actually 26 00:01:03,790 --> 00:01:05,701 know why we've been doing them. 27 00:01:05,701 --> 00:01:08,200 And so we're basically going to go back to what we're doing, 28 00:01:08,200 --> 00:01:11,110 maybe start with some statistical setup. 29 00:01:11,110 --> 00:01:13,000 But the goal of this lecture is really 30 00:01:13,000 --> 00:01:17,500 going to go back again to what we've seen from a purely 31 00:01:17,500 --> 00:01:18,940 statistical perspective. 32 00:01:18,940 --> 00:01:19,870 All right? 33 00:01:19,870 --> 00:01:22,660 So the first thing we're going to do 34 00:01:22,660 --> 00:01:26,080 is explain why we're doing statistical modeling, right? 35 00:01:26,080 --> 00:01:28,810 So in practice, if you have data, 36 00:01:28,810 --> 00:01:30,620 if you observe a bunch of points-- 37 00:01:30,620 --> 00:01:34,820 in here, I gave you some numbers, for example. 38 00:01:34,820 --> 00:01:37,960 So here's the partial data sets with the number of siblings, 39 00:01:37,960 --> 00:01:40,180 including self, that were collected from college 40 00:01:40,180 --> 00:01:41,470 students a few years back. 41 00:01:41,470 --> 00:01:43,570 So I was teaching a class like yours 42 00:01:43,570 --> 00:01:44,950 and actually asked students to go 43 00:01:44,950 --> 00:01:47,560 and fill out some Google form and tell me a bunch of things. 44 00:01:47,560 --> 00:01:49,870 And one of the questions was, including yourself, 45 00:01:49,870 --> 00:01:51,520 how many siblings do you have? 46 00:01:51,520 --> 00:01:54,790 And so they gave me this list of numbers, right? 47 00:01:54,790 --> 00:01:57,411 And there's many ways I can think of this list of numbers, 48 00:01:57,411 --> 00:01:57,910 right? 49 00:01:57,910 --> 00:02:01,780 I could think of it as being just a discrete distribution 50 00:02:01,780 --> 00:02:05,200 on the set of numbers between 1-- 51 00:02:05,200 --> 00:02:08,070 I know there's not going to be an answer which is less than 1, 52 00:02:08,070 --> 00:02:10,949 unless, well, someone doesn't understand the question. 53 00:02:10,949 --> 00:02:14,620 But all the answers I should get are positive integers-- 54 00:02:14,620 --> 00:02:16,450 1, 2, 3, et cetera. 55 00:02:16,450 --> 00:02:18,970 And there probably is an upper bound, 56 00:02:18,970 --> 00:02:20,780 but I don't know it on the top of my head. 57 00:02:20,780 --> 00:02:22,900 So maybe I should say 100. 58 00:02:22,900 --> 00:02:24,550 Maybe I should say 15. 59 00:02:24,550 --> 00:02:25,660 It depends, right? 60 00:02:25,660 --> 00:02:28,611 And so I think the largest number I got for this was 6. 61 00:02:28,611 --> 00:02:29,110 All right? 62 00:02:29,110 --> 00:02:33,610 So here you can see you have pretty standard families, 63 00:02:33,610 --> 00:02:37,930 you know, lots of 1s, 2s, and 3s. 64 00:02:37,930 --> 00:02:39,640 What statistical modeling is doing 65 00:02:39,640 --> 00:02:42,160 is to try to compress this information that I 66 00:02:42,160 --> 00:02:44,920 could actually describe in a very naive way. 67 00:02:44,920 --> 00:02:49,380 So let's start with the basic usual statistical set 68 00:02:49,380 --> 00:02:49,960 up, right? 69 00:02:49,960 --> 00:02:52,540 So I will start with many of the boards that look 70 00:02:52,540 --> 00:02:58,570 like x1, xn, random variables. 71 00:02:58,570 --> 00:03:01,120 And what I'm going to assume, as we said typically 72 00:03:01,120 --> 00:03:04,600 is that those guys are IID. 73 00:03:04,600 --> 00:03:06,730 And they have some distribution, all right? 74 00:03:06,730 --> 00:03:08,422 So they all share the same distribution. 75 00:03:08,422 --> 00:03:10,630 And the fact that their IID is so that I can actually 76 00:03:10,630 --> 00:03:11,350 do statistics. 77 00:03:11,350 --> 00:03:15,840 Statistics means looking at the global averaging thing 78 00:03:15,840 --> 00:03:17,890 so that I can actually get a sense of what 79 00:03:17,890 --> 00:03:20,316 the global behavior is for the population, right? 80 00:03:20,316 --> 00:03:22,690 If I start assuming that those things are not identically 81 00:03:22,690 --> 00:03:24,810 distributed-- they all live on their own-- 82 00:03:24,810 --> 00:03:28,380 that my sequence of number is your number of siblings-- 83 00:03:28,380 --> 00:03:30,190 the shoe size of this person-- 84 00:03:30,190 --> 00:03:31,756 the depth of the Charles River, and I 85 00:03:31,756 --> 00:03:33,130 start measuring a bunch of stuff. 86 00:03:33,130 --> 00:03:34,990 There's nothing I can actually get together. 87 00:03:34,990 --> 00:03:37,330 I need to have something that's cohesive. 88 00:03:37,330 --> 00:03:41,350 And so here, I collected some data that was cohesive. 89 00:03:41,350 --> 00:03:42,310 And so the goal here-- 90 00:03:42,310 --> 00:03:44,830 the first thing is to say what is the distribution that I 91 00:03:44,830 --> 00:03:46,586 actually have here, right? 92 00:03:50,250 --> 00:03:52,830 So I could actually be very general. 93 00:03:52,830 --> 00:03:56,640 I could just say at some distribution p. 94 00:03:56,640 --> 00:03:59,440 And let's so those are random variables, not random vectors, 95 00:03:59,440 --> 00:03:59,940 right? 96 00:03:59,940 --> 00:04:02,060 I could collect entire vectors about students, 97 00:04:02,060 --> 00:04:05,400 but let's say those are just random variables. 98 00:04:05,400 --> 00:04:07,530 And so now I can start making assumptions 99 00:04:07,530 --> 00:04:09,630 on this distribution p, right? 100 00:04:09,630 --> 00:04:11,550 What can I say about a distribution? 101 00:04:11,550 --> 00:04:14,640 Well, maybe if those numbers are continues, 102 00:04:14,640 --> 00:04:17,034 for example, I could assume they have a density-- 103 00:04:17,034 --> 00:04:19,810 a probability density function. 104 00:04:19,810 --> 00:04:21,220 That's already an assumption. 105 00:04:21,220 --> 00:04:23,469 Maybe I could start to assume that they're probability 106 00:04:23,469 --> 00:04:25,620 density function is smooth. 107 00:04:25,620 --> 00:04:26,822 That's another assumption. 108 00:04:26,822 --> 00:04:29,280 Maybe I could actually assume that it's piecewise constant. 109 00:04:29,280 --> 00:04:30,960 That's even better, right? 110 00:04:30,960 --> 00:04:33,270 And those things make my life simpler and simpler, 111 00:04:33,270 --> 00:04:37,260 because what I do by making the successive assumptions is 112 00:04:37,260 --> 00:04:40,190 reducing the degrees of freedom of the space in which I 113 00:04:40,190 --> 00:04:42,970 am actually searching the distribution. 114 00:04:42,970 --> 00:04:46,230 And so what we actually want is to have something 115 00:04:46,230 --> 00:04:48,510 which is small enough so we can actually 116 00:04:48,510 --> 00:04:50,710 have some averaging going on. 117 00:04:50,710 --> 00:04:53,940 But we also want something which is big enough 118 00:04:53,940 --> 00:04:55,550 that it can actually express. 119 00:04:55,550 --> 00:04:58,590 It has chances of actually containing a distribution that 120 00:04:58,590 --> 00:04:59,837 makes sense for us. 121 00:04:59,837 --> 00:05:01,920 So let's start with the simplest possible example, 122 00:05:01,920 --> 00:05:07,420 which is when the xi's belong to 0, 1. 123 00:05:07,420 --> 00:05:09,310 And as I said, here, we don't have a choice. 124 00:05:09,310 --> 00:05:14,880 The distribution of those guys has to be Bernoulli. 125 00:05:14,880 --> 00:05:18,330 And since they are IID, they all share the same p. 126 00:05:18,330 --> 00:05:20,520 So that's definitely the simplest possible thing 127 00:05:20,520 --> 00:05:21,600 I could think of. 128 00:05:21,600 --> 00:05:24,240 They are just Bernoulli p. 129 00:05:24,240 --> 00:05:28,030 And so all I would have to figure out in this case is p. 130 00:05:31,770 --> 00:05:34,030 And this is the simplest case. 131 00:05:34,030 --> 00:05:37,570 And unsurprisingly, it has the simplest answer, right? 132 00:05:37,570 --> 00:05:39,340 We will come back to this example 133 00:05:39,340 --> 00:05:42,400 when we study maximum likelihood estimators or method 134 00:05:42,400 --> 00:05:45,230 of moments estimators by method of moments. 135 00:05:45,230 --> 00:05:49,180 But at the end of the day, the things 136 00:05:49,180 --> 00:05:50,980 that we did-- the things that we will do 137 00:05:50,980 --> 00:05:52,630 are always the naive estimator you 138 00:05:52,630 --> 00:05:55,490 would come up with is what is the proportion of 1. 139 00:05:55,490 --> 00:05:58,560 And this will be, in pretty much all respects, 140 00:05:58,560 --> 00:06:01,071 the best estimator you can think of. 141 00:06:01,071 --> 00:06:01,570 All right? 142 00:06:01,570 --> 00:06:03,819 So then we're going to try to assess this performance. 143 00:06:03,819 --> 00:06:06,830 And we saw how to do that in the first chapter as well. 144 00:06:06,830 --> 00:06:10,420 So this problem here somehow is completely understood. 145 00:06:10,420 --> 00:06:11,630 We'll come back to it. 146 00:06:11,630 --> 00:06:14,400 But there's nothing fancy that is going to happen. 147 00:06:14,400 --> 00:06:17,260 But now, I could have some more complicated things. 148 00:06:17,260 --> 00:06:20,500 For example, in the example of the students now, 149 00:06:20,500 --> 00:06:26,050 my xi's belong to the sequence of integers 1, 2, 3, 150 00:06:26,050 --> 00:06:31,301 et cetera, OK, which is also denoted by n, maybe without 0 151 00:06:31,301 --> 00:06:32,800 if you want to put 0 in that, right? 152 00:06:32,800 --> 00:06:36,250 So the positive integers. 153 00:06:36,250 --> 00:06:38,860 Or I could actually just maybe put 154 00:06:38,860 --> 00:06:41,350 some prior knowledge about how humans 155 00:06:41,350 --> 00:06:42,880 have time to have families. 156 00:06:42,880 --> 00:06:47,470 But maybe some people thought of their college mates 157 00:06:47,470 --> 00:06:49,090 as being their brothers and sisters. 158 00:06:49,090 --> 00:06:53,740 And one student would actually put 465 siblings, 159 00:06:53,740 --> 00:06:56,530 because we're all good friends. 160 00:06:56,530 --> 00:06:59,279 Or maybe they actually think that all their Facebook 161 00:06:59,279 --> 00:07:00,820 contacts are actually their siblings. 162 00:07:00,820 --> 00:07:03,160 And so you never know what's going to happen. 163 00:07:03,160 --> 00:07:04,874 So maybe you want to account for this, 164 00:07:04,874 --> 00:07:06,790 but maybe you know that people are reasonable, 165 00:07:06,790 --> 00:07:08,956 and they will actually give you something like this. 166 00:07:11,790 --> 00:07:13,745 Now intuitively, maybe you would say, well, 167 00:07:13,745 --> 00:07:15,620 why would you bother doing this if you're not 168 00:07:15,620 --> 00:07:17,420 really sure about the 20? 169 00:07:17,420 --> 00:07:18,920 But I think that probably all of you 170 00:07:18,920 --> 00:07:21,830 intuitively guess that this is probably a good idea 171 00:07:21,830 --> 00:07:24,130 to start putting this kind of assumption 172 00:07:24,130 --> 00:07:26,750 rather than allowing for any number in the first place, 173 00:07:26,750 --> 00:07:30,830 because this eventually will be injected 174 00:07:30,830 --> 00:07:33,145 into the precision of our estimators. 175 00:07:33,145 --> 00:07:36,020 If I allow anything, it's going to be more complicated for me 176 00:07:36,020 --> 00:07:37,600 to get an accurate estimator. 177 00:07:37,600 --> 00:07:40,370 If I know that the numbers are either 1 or 2, then 178 00:07:40,370 --> 00:07:42,930 I'm actually going to be slightly more accurate as well. 179 00:07:42,930 --> 00:07:43,430 All right? 180 00:07:43,430 --> 00:07:45,560 Because I know that, for example, somebody put the 5, 181 00:07:45,560 --> 00:07:46,226 I can remove it. 182 00:07:46,226 --> 00:07:48,980 Then it's not going to actually corrupt with my estimator. 183 00:07:48,980 --> 00:07:55,320 All right, so now, let's say we actually 184 00:07:55,320 --> 00:07:57,080 agree that we have numbers. 185 00:07:57,080 --> 00:08:01,290 And here I put seven numbers, OK? 186 00:08:01,290 --> 00:08:03,995 So I just said, well, let's assume 187 00:08:03,995 --> 00:08:05,370 that the numbers I'm going to get 188 00:08:05,370 --> 00:08:10,050 are going to be 1 all the way to say this number that I denote 189 00:08:10,050 --> 00:08:12,360 by larger than or equal to 7, which 190 00:08:12,360 --> 00:08:15,570 is a placeholder for any numbers larger than 7, OK? 191 00:08:15,570 --> 00:08:17,250 Because I know maybe I don't want 192 00:08:17,250 --> 00:08:20,750 to distinguish between people that have 9 or 25 siblings. 193 00:08:20,750 --> 00:08:24,270 OK, and so now, this is a distribution 194 00:08:24,270 --> 00:08:28,110 on seven possible values-- the discrete distributions. 195 00:08:28,110 --> 00:08:30,518 And you know from your probability class 196 00:08:30,518 --> 00:08:32,309 that the way you describe this distribution 197 00:08:32,309 --> 00:08:34,549 is using the probability mass function. 198 00:08:44,520 --> 00:08:45,560 OK, or PMF-- 199 00:08:48,570 --> 00:08:51,810 So that's how we describe a discrete distribution. 200 00:08:51,810 --> 00:08:56,370 And the PMF is just a list of numbers, right? 201 00:08:56,370 --> 00:08:58,590 So as I wrote here, you have a list of numbers. 202 00:08:58,590 --> 00:09:00,750 And here, you wrote the possible value 203 00:09:00,750 --> 00:09:03,000 that your random variable can take. 204 00:09:03,000 --> 00:09:04,500 And here you rate the probability 205 00:09:04,500 --> 00:09:07,020 that your random variable takes this value. 206 00:09:07,020 --> 00:09:11,970 So the possible values being 1, 2, 3 all the way to larger than 207 00:09:11,970 --> 00:09:13,620 or equal to 7. 208 00:09:13,620 --> 00:09:16,050 And then I'm trying to estimate those numbers. 209 00:09:16,050 --> 00:09:16,620 Right? 210 00:09:16,620 --> 00:09:20,480 If I give you those numbers, at least up to this 211 00:09:20,480 --> 00:09:23,130 you know compression of all numbers that are equal to 7, 212 00:09:23,130 --> 00:09:25,800 you have the full description of your distribution. 213 00:09:25,800 --> 00:09:29,190 And that is the ultimate goal of statistics, right? 214 00:09:29,190 --> 00:09:30,810 The ultimate goal of statistics is 215 00:09:30,810 --> 00:09:33,314 to say what distribution your data came from, 216 00:09:33,314 --> 00:09:34,980 because that's basically the best you're 217 00:09:34,980 --> 00:09:36,970 going to be able to do. 218 00:09:36,970 --> 00:09:40,980 Now admittedly, if I started looking at the fraction of 1s, 219 00:09:40,980 --> 00:09:44,610 and the fraction of 2s, and the fraction of 3s, et cetera, 220 00:09:44,610 --> 00:09:48,180 I would actually eventually get those numbers-- 221 00:09:48,180 --> 00:09:49,950 just like looking at the fraction of 1s 222 00:09:49,950 --> 00:09:53,000 gave me a good estimate for p in the Bernoulli case, 223 00:09:53,000 --> 00:09:55,060 it would do the same in this case, right? 224 00:09:55,060 --> 00:09:56,750 It's a pretty intuitive idea. 225 00:09:56,750 --> 00:09:59,076 It's just the law of large numbers. 226 00:09:59,076 --> 00:10:00,200 Everybody agrees with that? 227 00:10:00,200 --> 00:10:02,560 If I look at the proportion of 1s, the proportion of 2s, 228 00:10:02,560 --> 00:10:04,310 the proportion of 3s, that should actually 229 00:10:04,310 --> 00:10:06,810 give me something that gets closer and closer, as my sample 230 00:10:06,810 --> 00:10:10,360 size increases to what I want. 231 00:10:10,360 --> 00:10:14,610 But the problem is if my sample size is not huge, 232 00:10:14,610 --> 00:10:17,450 here I have seven numbers to estimate. 233 00:10:17,450 --> 00:10:20,700 And if I have 20 observations, the ratio 234 00:10:20,700 --> 00:10:23,280 is not really in my favor-- 235 00:10:23,280 --> 00:10:26,089 20 observations to estimate seven parameters-- some of them 236 00:10:26,089 --> 00:10:27,630 are going to be pretty off, typically 237 00:10:27,630 --> 00:10:29,420 the ones with the large values. 238 00:10:29,420 --> 00:10:32,380 If you have only 20 students, look at the list of numbers. 239 00:10:32,380 --> 00:10:34,588 I don't know how many numbers I have, but it probably 240 00:10:34,588 --> 00:10:35,460 is close to 20-- 241 00:10:35,460 --> 00:10:37,140 maybe 15 or something. 242 00:10:37,140 --> 00:10:39,420 And so if you look at this list, nobody's 243 00:10:39,420 --> 00:10:45,570 actually-- nobody has four or more siblings, right? 244 00:10:45,570 --> 00:10:46,900 There's no such person. 245 00:10:46,900 --> 00:10:49,200 So that means that eventually from this data set, 246 00:10:49,200 --> 00:10:50,680 my estimates-- 247 00:10:50,680 --> 00:10:56,040 so those numbers I denote by say p1, p2, p3, et cetera-- 248 00:10:56,040 --> 00:11:01,390 those estimates p4 hat would be equal to what from this data? 249 00:11:06,150 --> 00:11:07,350 0, right? 250 00:11:07,350 --> 00:11:12,340 And p5 hat equal to 0 and p6 hat would be equal to 0. 251 00:11:12,340 --> 00:11:16,090 And p larger than or equal to 7 hat would be equal to 0. 252 00:11:16,090 --> 00:11:19,690 That would be my estimate from this data set. 253 00:11:19,690 --> 00:11:21,580 So maybe this is not-- 254 00:11:21,580 --> 00:11:25,300 maybe I want to actually pull some information 255 00:11:25,300 --> 00:11:28,460 from the people who have less siblings 256 00:11:28,460 --> 00:11:31,210 to try to make a guess, which is probably 257 00:11:31,210 --> 00:11:33,940 slightly better for the larger values, right? 258 00:11:33,940 --> 00:11:39,740 It's pretty clear that in average, there is more than 0-- 259 00:11:39,740 --> 00:11:42,850 the proportion of the population of households 260 00:11:42,850 --> 00:11:46,460 that have four children or more is definitely more than 0, 261 00:11:46,460 --> 00:11:46,960 all right? 262 00:11:46,960 --> 00:11:49,475 So it means that my data set is not 263 00:11:49,475 --> 00:11:51,100 representative of what I'm going to try 264 00:11:51,100 --> 00:11:53,649 to do is to find a model that tries to use the data they have 265 00:11:53,649 --> 00:11:56,190 for the smaller values that I can observe and just push it up 266 00:11:56,190 --> 00:11:57,530 to the other ones. 267 00:11:57,530 --> 00:12:00,970 And so what we can do is to just reduce those parameters 268 00:12:00,970 --> 00:12:03,740 into something that's understood. 269 00:12:03,740 --> 00:12:05,950 And this is part of the modeling that I talked about 270 00:12:05,950 --> 00:12:07,100 in the first place. 271 00:12:07,100 --> 00:12:12,447 Now, how do you succinctly describe a number of something? 272 00:12:12,447 --> 00:12:14,780 Well, one thing that you do is the Poisson distribution, 273 00:12:14,780 --> 00:12:15,550 right? 274 00:12:15,550 --> 00:12:17,230 Why do Poisson? 275 00:12:17,230 --> 00:12:18,670 There's many reasons. 276 00:12:18,670 --> 00:12:20,660 Again, that's part of statical modeling. 277 00:12:20,660 --> 00:12:23,710 But once you know that you have number of something that 278 00:12:23,710 --> 00:12:26,302 can be modeled by a Poisson, why not try a Poisson, right? 279 00:12:26,302 --> 00:12:27,510 You could just fit a Poisson. 280 00:12:27,510 --> 00:12:30,340 And the Poisson is something that looks like this. 281 00:12:30,340 --> 00:12:32,090 And I guess you've all seen it. 282 00:12:32,090 --> 00:12:36,040 But if x follows a Poisson distribution 283 00:12:36,040 --> 00:12:38,170 with parameter lambda, than the probability 284 00:12:38,170 --> 00:12:42,200 that x is equal to little x is equal to lambda 285 00:12:42,200 --> 00:12:47,420 to the x over factorial x e to the minus lambda. 286 00:12:47,420 --> 00:12:47,920 OK? 287 00:12:51,280 --> 00:12:54,410 And if you did the sheet that I gave you on the first day, 288 00:12:54,410 --> 00:12:55,970 you can check those numbers. 289 00:12:55,970 --> 00:13:00,200 So this is, of course, for x equals 0, 1, et cetera, right? 290 00:13:00,200 --> 00:13:04,470 So x is in natural integers. 291 00:13:04,470 --> 00:13:08,480 And if you sum from x equals 0 to infinity, this thing you get 292 00:13:08,480 --> 00:13:09,660 is e to the lambda. 293 00:13:09,660 --> 00:13:11,390 And so they cancel, and you have some 294 00:13:11,390 --> 00:13:13,730 which is equal to 1, which is indeed a PMF. 295 00:13:13,730 --> 00:13:17,610 But what's key about this PMF is that it never takes value 0. 296 00:13:17,610 --> 00:13:20,600 Like this thing is always strictly positive. 297 00:13:20,600 --> 00:13:24,076 So whatever value of lambda I find from this data 298 00:13:24,076 --> 00:13:25,700 will give me something that's certainly 299 00:13:25,700 --> 00:13:29,240 more interesting than just putting the value 0. 300 00:13:29,240 --> 00:13:31,490 But more importantly, rather than having 301 00:13:31,490 --> 00:13:35,390 to estimate seven parameters and, as a consequence, 302 00:13:35,390 --> 00:13:37,790 to actually have to estimate 1, 2, 303 00:13:37,790 --> 00:13:42,150 3, 4 of them being equal to 0, I have only one parameter 304 00:13:42,150 --> 00:13:44,680 to estimate which is lambda. 305 00:13:44,680 --> 00:13:49,070 The problem with doing this is that now lambda may not 306 00:13:49,070 --> 00:13:53,660 be just something as simple as computing the average number. 307 00:13:53,660 --> 00:13:54,560 Right? 308 00:13:54,560 --> 00:13:55,690 In this case, it will. 309 00:13:55,690 --> 00:13:58,940 But in many instances, it's actually not clear 310 00:13:58,940 --> 00:14:01,275 that this parametrization with lambda that I chose-- 311 00:14:01,275 --> 00:14:03,650 I'm going to be able to estimate lambda just by computing 312 00:14:03,650 --> 00:14:06,220 the average number that I get. 313 00:14:06,220 --> 00:14:07,530 It will be the case. 314 00:14:07,530 --> 00:14:10,640 But if it's not, remember this example of the exponential 315 00:14:10,640 --> 00:14:12,244 we did in the last lecture-- 316 00:14:12,244 --> 00:14:13,910 we could use the delta method and things 317 00:14:13,910 --> 00:14:16,400 like that to estimate them. 318 00:14:16,400 --> 00:14:20,900 All right, so here's modeling 101. 319 00:14:20,900 --> 00:14:22,950 So the purpose of modeling is to restrict 320 00:14:22,950 --> 00:14:26,630 the base of possible distributions 321 00:14:26,630 --> 00:14:29,960 to a subspace that's actually plausible, but much simpler 322 00:14:29,960 --> 00:14:31,590 for me to estimate. 323 00:14:31,590 --> 00:14:34,230 So we went from all distributions 324 00:14:34,230 --> 00:14:37,110 on seven parameters, which is a large space-- 325 00:14:37,110 --> 00:14:38,590 that's a lot of things-- 326 00:14:38,590 --> 00:14:41,030 to something which is just one number. 327 00:14:41,030 --> 00:14:42,370 This number is positive. 328 00:14:46,490 --> 00:14:50,750 Any question about the purpose of doing this? 329 00:14:50,750 --> 00:14:55,880 OK, so we're going to have to do a little bit of formalism now. 330 00:14:55,880 --> 00:14:58,070 And so if we want to talk-- 331 00:14:58,070 --> 00:14:59,420 this is a statistics classroom. 332 00:14:59,420 --> 00:15:01,700 I'm not going to want to talk about the Poisson model 333 00:15:01,700 --> 00:15:03,260 specifically every single time. 334 00:15:03,260 --> 00:15:05,300 I'm going to want to talk about generic models. 335 00:15:05,300 --> 00:15:08,370 And then you're going to build to plug in your favorite word-- 336 00:15:08,370 --> 00:15:11,066 Poisson, binomial, exponential, uniform-- 337 00:15:11,066 --> 00:15:12,440 all these words that you've seen, 338 00:15:12,440 --> 00:15:13,920 you're going to be able to plug in there. 339 00:15:13,920 --> 00:15:16,045 But we're just going to have some generic notations 340 00:15:16,045 --> 00:15:19,170 and some generic terminology for a statistical model. 341 00:15:19,170 --> 00:15:19,670 All right? 342 00:15:19,670 --> 00:15:22,430 So here is the formal definition. 343 00:15:22,430 --> 00:15:24,160 So I'm going to go through it with you. 344 00:15:29,710 --> 00:15:35,580 OK, so the definition is that of a statistical model. 345 00:15:47,330 --> 00:15:47,830 OK? 346 00:15:50,814 --> 00:15:53,280 Sorry, that's a statistical experiment, I should say. 347 00:16:00,520 --> 00:16:04,861 So a statistical experiment is actually just a pair-- 348 00:16:04,861 --> 00:16:08,930 E. And that's a set-- 349 00:16:11,630 --> 00:16:19,640 and a family of distributions P theta, where theta ranges 350 00:16:19,640 --> 00:16:22,050 in some set capital theta. 351 00:16:22,050 --> 00:16:22,550 OK? 352 00:16:22,550 --> 00:16:26,040 So I hope you're up to date with your Greek letters. 353 00:16:26,040 --> 00:16:28,870 So the small theta is the capital theta. 354 00:16:28,870 --> 00:16:29,980 And enough of us-- 355 00:16:29,980 --> 00:16:31,690 I don't have the handwriting. 356 00:16:31,690 --> 00:16:34,010 So if you don't see something, just ask me. 357 00:16:34,010 --> 00:16:36,650 And so this thing now-- so each of this guy 358 00:16:36,650 --> 00:16:40,700 is a probability distribution. 359 00:16:40,700 --> 00:16:41,390 All right? 360 00:16:41,390 --> 00:16:47,460 So for example, this could be a Poisson with parameter theta 361 00:16:47,460 --> 00:16:51,560 or a Bernoulli with parameter theta-- 362 00:16:51,560 --> 00:16:54,091 OK, or an exponential with parameter-- 363 00:16:54,091 --> 00:16:56,090 I don't know-- 1 over theta squared if you want. 364 00:16:58,620 --> 00:17:00,850 OK, but they're just indexed by theta. 365 00:17:00,850 --> 00:17:02,820 But for each theta, this completely 366 00:17:02,820 --> 00:17:05,190 describes the distribution. 367 00:17:05,190 --> 00:17:06,510 It could be more complicated. 368 00:17:06,510 --> 00:17:09,980 This theta should be a pair-- could be a pair-- a mu sigma 369 00:17:09,980 --> 00:17:10,619 square. 370 00:17:10,619 --> 00:17:16,790 And that could actually give you some n mu sigma squared. 371 00:17:16,790 --> 00:17:20,720 OK so anything where you can actually-- 372 00:17:20,720 --> 00:17:24,319 rather than actually giving you a full distribution, 373 00:17:24,319 --> 00:17:26,193 I can compress into a parameter. 374 00:17:26,193 --> 00:17:27,109 But it could be worse. 375 00:17:27,109 --> 00:17:28,460 It could be this guy here. 376 00:17:28,460 --> 00:17:28,960 Right? 377 00:17:28,960 --> 00:17:32,110 Theta could be p1-- 378 00:17:32,110 --> 00:17:34,850 p larger than or equal to 7. 379 00:17:34,850 --> 00:17:37,610 And my distribution could just be something that has PMF-- 380 00:17:40,830 --> 00:17:44,779 p1-- p larger than 7. 381 00:17:44,779 --> 00:17:45,820 That's another parameter. 382 00:17:45,820 --> 00:17:48,524 This one is seven dimensional. 383 00:17:48,524 --> 00:17:49,690 This one is two dimensional. 384 00:17:49,690 --> 00:17:52,590 And all these guys are just one dimensional. 385 00:17:52,590 --> 00:17:55,460 All these guys are parameters. 386 00:17:55,460 --> 00:17:56,939 Is that clear? 387 00:17:56,939 --> 00:17:59,230 What's important here is that once they give you theta, 388 00:17:59,230 --> 00:18:00,860 you know exactly all the probabilities associated 389 00:18:00,860 --> 00:18:01,840 with this random variable. 390 00:18:01,840 --> 00:18:03,401 You know its distribution perfectly. 391 00:18:09,307 --> 00:18:10,390 So this is the definition. 392 00:18:10,390 --> 00:18:11,150 Is that clear? 393 00:18:11,150 --> 00:18:13,682 Is there a question about this distribution-- 394 00:18:13,682 --> 00:18:14,890 about this definition, sorry? 395 00:18:17,670 --> 00:18:18,360 All right. 396 00:18:18,360 --> 00:18:22,440 So really, the key thing is the statistical model associated 397 00:18:22,440 --> 00:18:24,090 to a statistical experiments. 398 00:18:24,090 --> 00:18:24,590 OK? 399 00:18:27,807 --> 00:18:29,140 So let's just see some examples. 400 00:18:29,140 --> 00:18:31,630 It's probably just better because, again, the formalism 401 00:18:31,630 --> 00:18:33,900 is never really clear. 402 00:18:33,900 --> 00:18:35,860 Actually, that's the next slide. 403 00:18:35,860 --> 00:18:40,900 OK, so there's two things we need to assume. 404 00:18:40,900 --> 00:18:44,680 OK, so the purpose of a statistical model 405 00:18:44,680 --> 00:18:46,795 is once I estimate the parameter, 406 00:18:46,795 --> 00:18:51,310 I actually know exactly what distribution it has, OK? 407 00:18:51,310 --> 00:18:54,760 So it means that I could potentially 408 00:18:54,760 --> 00:18:56,410 have several parameters that give me 409 00:18:56,410 --> 00:18:59,034 the same distribution that would still be fine, because I could 410 00:18:59,034 --> 00:18:59,800 estimate one guy. 411 00:18:59,800 --> 00:19:01,150 Or I could estimate the other guy. 412 00:19:01,150 --> 00:19:03,220 And I would still recover the underlying distribution 413 00:19:03,220 --> 00:19:03,730 of my data. 414 00:19:04,850 --> 00:19:07,420 The problem is that this creates really annoying 415 00:19:07,420 --> 00:19:09,797 theoretical problems, like things 416 00:19:09,797 --> 00:19:11,380 don't work, the algorithms won't work, 417 00:19:11,380 --> 00:19:12,629 the guarantees won't work. 418 00:19:12,629 --> 00:19:14,670 And so what we typically assume is that the model 419 00:19:14,670 --> 00:19:16,300 is so-called well-specified. 420 00:19:18,980 --> 00:19:20,665 Sorry, that's not well specified. 421 00:19:20,665 --> 00:19:23,740 I'm jumping ahead of myself. 422 00:19:23,740 --> 00:19:28,690 OK, well-specified means that your data-- 423 00:19:28,690 --> 00:19:32,720 the distribution of your data is actually one of those guys. 424 00:19:32,720 --> 00:19:33,220 OK? 425 00:19:33,220 --> 00:19:46,070 So some vocabulary-- so well-specified 426 00:19:46,070 --> 00:19:51,310 means that for my observations x, 427 00:19:51,310 --> 00:19:55,250 there exists a theta in capital theta 428 00:19:55,250 --> 00:20:00,764 such that x follows p sub theta. 429 00:20:00,764 --> 00:20:03,610 I should put a double bar. 430 00:20:03,610 --> 00:20:06,850 OK, so that's what well-specified means. 431 00:20:06,850 --> 00:20:08,770 So that means that the distribution 432 00:20:08,770 --> 00:20:12,920 of your actual data is just one of those guys. 433 00:20:12,920 --> 00:20:18,790 This is a bit strong of an assumption. 434 00:20:18,790 --> 00:20:20,920 It's strong in the sense that-- 435 00:20:20,920 --> 00:20:26,740 I don't know if you've heard of this sense, which I don't know, 436 00:20:26,740 --> 00:20:28,370 I can tell you who it's attributed to, 437 00:20:28,370 --> 00:20:30,370 but that probably means that this person did not 438 00:20:30,370 --> 00:20:31,930 come up with it. 439 00:20:31,930 --> 00:20:35,920 But I said that all models are wrong, but some of them 440 00:20:35,920 --> 00:20:37,890 are useful. 441 00:20:37,890 --> 00:20:40,690 All right, so all models are wrong 442 00:20:40,690 --> 00:20:44,470 means that maybe it's not true that this Poisson distribution 443 00:20:44,470 --> 00:20:47,965 that I assume for the number of siblings for college students-- 444 00:20:47,965 --> 00:20:50,350 maybe that's not perfectly correct. 445 00:20:50,350 --> 00:20:53,210 Maybe there's a spike at three, right? 446 00:20:53,210 --> 00:20:55,900 Maybe there's a spike at one, because you know, 447 00:20:55,900 --> 00:20:58,180 maybe those are slightly more educated families. 448 00:20:58,180 --> 00:20:59,260 They have less children. 449 00:20:59,260 --> 00:21:02,260 Maybe this is actually not exactly perfect. 450 00:21:02,260 --> 00:21:04,496 But it's probably good enough for our purposes. 451 00:21:04,496 --> 00:21:05,870 And when we make this assumption, 452 00:21:05,870 --> 00:21:07,750 we're actually assuming that the data really 453 00:21:07,750 --> 00:21:09,756 comes from a Poisson model. 454 00:21:09,756 --> 00:21:11,380 There is a lot of research that goes on 455 00:21:11,380 --> 00:21:14,380 about misspecified models and that tells you 456 00:21:14,380 --> 00:21:16,686 how well you're doing in the model that's 457 00:21:16,686 --> 00:21:18,310 the closest to the actual distribution. 458 00:21:18,310 --> 00:21:19,630 So that's pretty much it. 459 00:21:19,630 --> 00:21:21,049 Yeah? 460 00:21:21,049 --> 00:21:22,470 AUDIENCE: [INAUDIBLE]. 461 00:21:24,130 --> 00:21:25,970 PHILIPPE RIGOLLET: So my data-- 462 00:21:25,970 --> 00:21:29,620 so it's always the way I denote one 463 00:21:29,620 --> 00:21:31,480 of the generic observations, right? 464 00:21:31,480 --> 00:21:36,100 So my observations are x1, xn. 465 00:21:36,100 --> 00:21:39,670 And they're IID with distribution p-- 466 00:21:39,670 --> 00:21:40,990 always. 467 00:21:40,990 --> 00:21:42,610 So x is just one of those guys. 468 00:21:42,610 --> 00:21:46,720 I don't want to write x5 or x4. 469 00:21:46,720 --> 00:21:47,490 They're IID. 470 00:21:47,490 --> 00:21:49,840 So they all have the same distribution. 471 00:21:49,840 --> 00:21:54,780 So OK-- no, no, no. 472 00:21:54,780 --> 00:21:55,490 They're all IID. 473 00:21:55,490 --> 00:21:57,490 So they all have the same p data. 474 00:21:57,490 --> 00:21:59,150 They'll have the same p, which means 475 00:21:59,150 --> 00:22:00,840 they'll have the same p data. 476 00:22:00,840 --> 00:22:02,250 So I can pick any one of them. 477 00:22:02,250 --> 00:22:05,470 So I'd just remove the index just so we're clear. 478 00:22:05,470 --> 00:22:06,940 OK? 479 00:22:06,940 --> 00:22:09,580 So when I write x, I just mean think of x1. 480 00:22:09,580 --> 00:22:10,540 Right they're an idea. 481 00:22:10,540 --> 00:22:12,607 I can pick whichever I want. 482 00:22:12,607 --> 00:22:13,690 I'm not going to write x1. 483 00:22:13,690 --> 00:22:14,648 It's going to be weird. 484 00:22:17,070 --> 00:22:18,470 OK? 485 00:22:18,470 --> 00:22:19,670 Is that clear? 486 00:22:19,670 --> 00:22:20,780 OK. 487 00:22:20,780 --> 00:22:26,522 So this particular theta is called the true parameter. 488 00:22:34,240 --> 00:22:37,060 Sometimes since we're going to want some variable theta, 489 00:22:37,060 --> 00:22:41,610 we might denote it by theta star as opposed 490 00:22:41,610 --> 00:22:43,750 to theta hat, which is always our estimator. 491 00:22:43,750 --> 00:22:47,250 But I'll keep it to be theta for now. 492 00:22:47,250 --> 00:22:50,280 And so the aim of this physical experiment 493 00:22:50,280 --> 00:22:52,500 is to estimate theta so that once I actually 494 00:22:52,500 --> 00:22:56,010 plug in theta in the form of my distribution, for example, 495 00:22:56,010 --> 00:22:58,020 I could plug in theta here. 496 00:22:58,020 --> 00:23:01,600 So theta here was actually lambda. 497 00:23:01,600 --> 00:23:03,600 So once I estimate this guy, I would plug it in, 498 00:23:03,600 --> 00:23:06,183 and I would know the probability that my random variable takes 499 00:23:06,183 --> 00:23:09,240 any value, by just putting the lambda hat and the lambda hat 500 00:23:09,240 --> 00:23:10,700 here. 501 00:23:10,700 --> 00:23:11,240 OK? 502 00:23:11,240 --> 00:23:12,960 So my goal is going to be to estimate 503 00:23:12,960 --> 00:23:16,080 this guy so that I can actually compute those distributions. 504 00:23:16,080 --> 00:23:18,520 But actually, we'll see, for example, 505 00:23:18,520 --> 00:23:21,540 when we talk about regression that this parameter actually 506 00:23:21,540 --> 00:23:23,340 has a meaning in many instances. 507 00:23:23,340 --> 00:23:26,670 And so just knowing the parameter itself 508 00:23:26,670 --> 00:23:30,360 intuitively or say more-- 509 00:23:30,360 --> 00:23:33,680 let's say more so than just computing probabilities, 510 00:23:33,680 --> 00:23:36,480 will actually tell us something about the process. 511 00:23:36,480 --> 00:23:38,880 For example, we're going to run linear regression. 512 00:23:38,880 --> 00:23:40,260 And when we do linear regression, 513 00:23:40,260 --> 00:23:41,490 there's going to be some coefficients 514 00:23:41,490 --> 00:23:42,810 in the linear regression. 515 00:23:42,810 --> 00:23:44,250 And the value of this coefficient 516 00:23:44,250 --> 00:23:47,460 is actually telling me what is the sensitivity of the response 517 00:23:47,460 --> 00:23:50,620 that I'm looking at to this particular input. 518 00:23:50,620 --> 00:23:51,120 All right? 519 00:23:51,120 --> 00:23:52,950 So just knowing if this number is larger 520 00:23:52,950 --> 00:23:55,050 or if this number is small is actually 521 00:23:55,050 --> 00:23:58,180 going to be useful for us to just look at this guy. 522 00:23:58,180 --> 00:23:58,680 All right? 523 00:23:58,680 --> 00:24:00,485 So there's going to be some instances where 524 00:24:00,485 --> 00:24:01,610 it's going to be important. 525 00:24:01,610 --> 00:24:04,026 Sometimes we're going to want to know if this parameter is 526 00:24:04,026 --> 00:24:07,260 larger or smaller than something or if it's equal to something 527 00:24:07,260 --> 00:24:08,647 or not equal to something. 528 00:24:08,647 --> 00:24:10,730 And those things are also important-- for example, 529 00:24:10,730 --> 00:24:13,091 if theta actually measures the true-- 530 00:24:13,091 --> 00:24:13,590 right? 531 00:24:13,590 --> 00:24:16,380 So theta is the true unknown parameter-- true efficacy 532 00:24:16,380 --> 00:24:18,010 of a drug. 533 00:24:18,010 --> 00:24:18,510 OK? 534 00:24:18,510 --> 00:24:21,720 Let's say I want to know what the true efficacy of a drug is. 535 00:24:21,720 --> 00:24:25,084 And what I'm going to want to know is maybe it's a score. 536 00:24:25,084 --> 00:24:27,500 Maybe I'm going to want to know if theta is larger than 2. 537 00:24:27,500 --> 00:24:30,166 Maybe I want to know if theta is the average number of siblings. 538 00:24:30,166 --> 00:24:32,080 Is this true number larger than 2 or not? 539 00:24:32,080 --> 00:24:32,580 Right? 540 00:24:32,580 --> 00:24:37,410 Maybe I am interested in knowing if college students come from-- 541 00:24:37,410 --> 00:24:40,217 so maybe from a sociological perspective, 542 00:24:40,217 --> 00:24:42,300 I'm interested in knowing if college students come 543 00:24:42,300 --> 00:24:45,375 from households with more than two children. 544 00:24:45,375 --> 00:24:47,070 All right, so those can be the questions 545 00:24:47,070 --> 00:24:48,814 that I may ask myself. 546 00:24:48,814 --> 00:24:50,730 I'm going to want to know maybe theta is going 547 00:24:50,730 --> 00:24:51,940 to be equal to 1/2 or not. 548 00:24:51,940 --> 00:24:54,810 So maybe for a drug efficacy, is it completely 549 00:24:54,810 --> 00:24:57,640 standard-- maybe for elections. 550 00:24:57,640 --> 00:24:59,220 Is the proportion of the population 551 00:24:59,220 --> 00:25:02,380 that is going to vote for this particular candidate 552 00:25:02,380 --> 00:25:03,420 equal to 0.5? 553 00:25:03,420 --> 00:25:05,584 Or is it different from 0.5? 554 00:25:05,584 --> 00:25:07,250 OK, and I can think of different things. 555 00:25:07,250 --> 00:25:09,025 When I'm talking about the regression, 556 00:25:09,025 --> 00:25:11,400 I'm going to want to test if this coefficient is actually 557 00:25:11,400 --> 00:25:13,650 0 or not, because if it's 0, it means 558 00:25:13,650 --> 00:25:17,200 that the variable that's in front of it actually goes out. 559 00:25:17,200 --> 00:25:18,900 And so those are things we're testing. 560 00:25:18,900 --> 00:25:22,050 Actually having this very specific yes/no answer 561 00:25:22,050 --> 00:25:26,760 is going to give me a huge intuition or huge understanding 562 00:25:26,760 --> 00:25:29,850 of what's going on in the phenomenon that I observe. 563 00:25:29,850 --> 00:25:32,850 But actually, since the questions are so precise, 564 00:25:32,850 --> 00:25:34,204 it's going to be much more-- 565 00:25:34,204 --> 00:25:36,370 I'm going to be much better at answering them rather 566 00:25:36,370 --> 00:25:38,520 than giving you an estimate for theta 567 00:25:38,520 --> 00:25:41,240 with some confidence around it. 568 00:25:41,240 --> 00:25:44,870 All right, it's sort of the same principle as trying to reduce. 569 00:25:44,870 --> 00:25:46,620 What you're trying to do as a statistician 570 00:25:46,620 --> 00:25:49,830 is to inject as much knowledge about the question and about 571 00:25:49,830 --> 00:25:52,740 the problem that you can so that the data has 572 00:25:52,740 --> 00:25:54,450 to do a minimal job. 573 00:25:54,450 --> 00:25:58,300 And henceforth, you actually need less data. 574 00:25:58,300 --> 00:26:00,720 So from now on, we will always assume-- 575 00:26:00,720 --> 00:26:03,030 and this is because this is an intro stats class-- 576 00:26:03,030 --> 00:26:05,550 you will always assume that theta-- 577 00:26:05,550 --> 00:26:09,180 the subset of parameters is a subset of r to the d. 578 00:26:09,180 --> 00:26:11,940 That means that theta is a vector 579 00:26:11,940 --> 00:26:16,320 with at most a finite number of coordinates. 580 00:26:16,320 --> 00:26:17,970 Why do I say this? 581 00:26:17,970 --> 00:26:20,280 Well, this is called a parametric model. 582 00:26:20,280 --> 00:26:31,750 So it's called a parametric model or sometimes 583 00:26:31,750 --> 00:26:35,022 parametric statistics. 584 00:26:35,022 --> 00:26:37,480 Actually, we don't really talk about parametric statistics. 585 00:26:37,480 --> 00:26:40,330 But we talk a lot about nonparametric statistics 586 00:26:40,330 --> 00:26:42,340 or a non-parametric model. 587 00:26:42,340 --> 00:26:45,852 Can somebody think of a model which is non-parametric? 588 00:26:53,090 --> 00:26:56,190 For example, in the siblings example, 589 00:26:56,190 --> 00:27:01,160 if I did not cap the number of siblings to 7, 590 00:27:01,160 --> 00:27:06,350 but I let this list go to infinity, 591 00:27:06,350 --> 00:27:09,530 I would have an infinite number of parameters to estimate. 592 00:27:09,530 --> 00:27:12,207 Very likely, the last ones would be 0. 593 00:27:12,207 --> 00:27:14,540 But still, I would have an infinite number of parameters 594 00:27:14,540 --> 00:27:15,295 to estimate. 595 00:27:15,295 --> 00:27:17,400 So this would not be a parametric model 596 00:27:17,400 --> 00:27:19,430 if I just let this list of things 597 00:27:19,430 --> 00:27:21,740 to be estimated to be infinite. 598 00:27:21,740 --> 00:27:24,580 But there's other classes that are actually infinite 599 00:27:24,580 --> 00:27:26,990 and cannot represented by vectors. 600 00:27:26,990 --> 00:27:29,700 For example, function-- right? 601 00:27:29,700 --> 00:27:38,870 If I tell you my model, pf, is just 602 00:27:38,870 --> 00:27:43,880 the distribution of x's, the probability distributions, 603 00:27:43,880 --> 00:27:48,187 that have density f, right? 604 00:27:48,187 --> 00:27:50,270 So what I know is that the density is non-negative 605 00:27:50,270 --> 00:27:52,100 and that it integrates to one, right? 606 00:27:52,100 --> 00:27:54,900 That's all I know about densities. 607 00:27:54,900 --> 00:27:57,620 Well f is not something you're going 608 00:27:57,620 --> 00:28:01,250 to be able to describe with a finite number of values, right? 609 00:28:01,250 --> 00:28:03,610 All possible functions is the huge set. 610 00:28:03,610 --> 00:28:08,730 It's certainly not representable by 10 numbers. 611 00:28:08,730 --> 00:28:12,470 And so non-parametric estimation is typically 612 00:28:12,470 --> 00:28:14,600 when you actually want to parametrize this 613 00:28:14,600 --> 00:28:17,600 by a large class of functions. 614 00:28:17,600 --> 00:28:20,810 And so for example, histograms is the prime tool 615 00:28:20,810 --> 00:28:22,970 of non-parametric estimation, because when 616 00:28:22,970 --> 00:28:24,470 you fit a histogram to data, you're 617 00:28:24,470 --> 00:28:26,929 trying to estimate the density of your data, 618 00:28:26,929 --> 00:28:28,470 but you're not trying to represent it 619 00:28:28,470 --> 00:28:31,920 as a finite number of points. 620 00:28:31,920 --> 00:28:35,040 That's really-- I mean, effectively, 621 00:28:35,040 --> 00:28:36,390 you have to represent it, right? 622 00:28:36,390 --> 00:28:38,480 So you actually truncate somewhere and just 623 00:28:38,480 --> 00:28:40,670 say those things are not going to matter. 624 00:28:40,670 --> 00:28:41,360 All right? 625 00:28:41,360 --> 00:28:44,690 But really the key thing is that this is non-parametric 626 00:28:44,690 --> 00:28:47,360 where you have a potentially infinite number of parameters. 627 00:28:47,360 --> 00:28:49,490 Whereas we're going to only talk about finites. 628 00:28:49,490 --> 00:28:53,790 And actually finite in the overwhelming majority of cases 629 00:28:53,790 --> 00:28:55,320 is going to be 1. 630 00:28:55,320 --> 00:28:58,404 So theta is going to be a subset of r1. 631 00:28:58,404 --> 00:29:00,320 OK, we're going to be interested in estimating 632 00:29:00,320 --> 00:29:03,770 one parameter just like the parameter of a Poisson 633 00:29:03,770 --> 00:29:05,760 or the parameter of an exponential-- 634 00:29:05,760 --> 00:29:07,460 the parameter of Bernoulli. 635 00:29:07,460 --> 00:29:09,539 But for example, really, we're going 636 00:29:09,539 --> 00:29:11,330 to be interested in estimating mu and sigma 637 00:29:11,330 --> 00:29:12,730 square for the normal. 638 00:29:17,880 --> 00:29:19,850 So here are some statistical models. 639 00:29:19,850 --> 00:29:20,350 All right? 640 00:29:20,350 --> 00:29:23,040 So I'm going to go through them with you. 641 00:29:31,360 --> 00:29:35,050 So if I tell you I observe-- 642 00:29:35,050 --> 00:29:38,940 I'm interested in understanding-- 643 00:29:38,940 --> 00:29:42,040 I'm still [INAUDIBLE] I'm interested in understanding 644 00:29:42,040 --> 00:29:44,680 the proportion of people who kiss by bending 645 00:29:44,680 --> 00:29:46,240 their head to the right. 646 00:29:46,240 --> 00:29:50,050 And for that, I collected n observations. 647 00:29:50,050 --> 00:29:53,050 And I'm interested in making some inference 648 00:29:53,050 --> 00:29:54,880 in the statistical model. 649 00:29:54,880 --> 00:29:58,000 My question to you is, what is the statistical model? 650 00:29:58,000 --> 00:30:00,050 Well, if you want to read the statistical model, 651 00:30:00,050 --> 00:30:02,170 you're going to have to write this E-- 652 00:30:02,170 --> 00:30:03,960 oh, sorry, I never told you what E was. 653 00:30:03,960 --> 00:30:06,610 OK, well actually just go to the examples, 654 00:30:06,610 --> 00:30:09,350 and then you'll know what E is. 655 00:30:09,350 --> 00:30:14,532 So you're going to have to write to me an E and a p theta, OK? 656 00:30:14,532 --> 00:30:16,240 So let's start with the Bernoulli trials. 657 00:30:25,180 --> 00:30:29,480 So this e here is called the sample space. 658 00:30:33,290 --> 00:30:37,040 And in the normal people's words, 659 00:30:37,040 --> 00:30:44,980 it just means the space or the set in which x and-- 660 00:30:44,980 --> 00:30:48,200 back to your question, x is just a generic observation lips. 661 00:30:51,620 --> 00:30:56,110 OK, and hopefully, this is the smallest you can think of. 662 00:30:56,110 --> 00:30:58,560 OK, so for example, for Bernoulli trials, 663 00:30:58,560 --> 00:31:01,270 I'm going to observe a sequence of 0's and 1's. 664 00:31:01,270 --> 00:31:04,360 So my experiment is going to be-- as written on the board, 665 00:31:04,360 --> 00:31:06,880 is going to be 1, 0, 1. 666 00:31:06,880 --> 00:31:08,657 And then the probability distributions 667 00:31:08,657 --> 00:31:10,240 are going to be, well, it's just going 668 00:31:10,240 --> 00:31:13,290 to be the Bernoulli distributions indexed 669 00:31:13,290 --> 00:31:14,240 by p, right? 670 00:31:14,240 --> 00:31:17,050 So rather than writing p sub p, I'm 671 00:31:17,050 --> 00:31:20,650 going to write it as Bernoulli p, 672 00:31:20,650 --> 00:31:24,145 because it's clear what I mean when I write that. 673 00:31:24,145 --> 00:31:25,537 Is everybody happy? 674 00:31:25,537 --> 00:31:27,370 Actually, I need to tell you something more. 675 00:31:27,370 --> 00:31:28,880 This is a family of distributions. 676 00:31:28,880 --> 00:31:29,932 So I need p. 677 00:31:29,932 --> 00:31:31,390 And maybe I don't want to have to p 678 00:31:31,390 --> 00:31:33,370 that's a value 0, 1, right? 679 00:31:33,370 --> 00:31:34,390 It doesn't make sense. 680 00:31:34,390 --> 00:31:37,660 I would probably not look at this problem 681 00:31:37,660 --> 00:31:40,660 if I anticipated that everybody would kiss to the right. 682 00:31:40,660 --> 00:31:43,280 And everybody would kiss to the left. 683 00:31:43,280 --> 00:31:45,400 So I am going to assume that p is in 0, 1, 684 00:31:45,400 --> 00:31:47,930 but does not have 0 and 1. 685 00:31:47,930 --> 00:31:48,430 OK? 686 00:31:48,430 --> 00:31:52,494 So that's the statistical model for a Bernoulli trial. 687 00:32:00,250 --> 00:32:03,180 OK, now the next one, what do we have? 688 00:32:03,180 --> 00:32:03,991 Exponential. 689 00:32:03,991 --> 00:32:04,490 OK? 690 00:32:09,630 --> 00:32:12,684 OK, so when I have exponential distributions, 691 00:32:12,684 --> 00:32:14,850 what is the support of the exponential distribution? 692 00:32:14,850 --> 00:32:17,150 What value is it going to take? 693 00:32:20,520 --> 00:32:23,190 0 to infinity, right? 694 00:32:23,190 --> 00:32:26,700 So what I have is that my first space 695 00:32:26,700 --> 00:32:28,740 is the value that my random variables can take. 696 00:32:28,740 --> 00:32:34,290 So its-- well, actually I can remove the 0 again-- 697 00:32:34,290 --> 00:32:37,140 0 to plus infinity. 698 00:32:37,140 --> 00:32:39,450 And then the family of distributions 699 00:32:39,450 --> 00:32:43,320 that I have are exponential with parameter lambda. 700 00:32:43,320 --> 00:32:45,090 And again, maybe you've seen me switching 701 00:32:45,090 --> 00:32:49,147 from p, to lambda, to theta, to mu, to sigma square. 702 00:32:49,147 --> 00:32:50,730 Honestly you can do whatever you want. 703 00:32:50,730 --> 00:32:53,430 But its just that it's customary to have this particular group 704 00:32:53,430 --> 00:32:54,740 of letters. 705 00:32:54,740 --> 00:32:55,620 OK? 706 00:32:55,620 --> 00:32:58,950 And so the parameters of an exponential 707 00:32:58,950 --> 00:33:02,210 are just positive numbers. 708 00:33:02,210 --> 00:33:02,710 OK? 709 00:33:02,710 --> 00:33:08,714 And that's my exponential model. 710 00:33:08,714 --> 00:33:09,630 What is the third one? 711 00:33:09,630 --> 00:33:11,960 Can somebody tell me? 712 00:33:11,960 --> 00:33:12,960 Poisson, OK? 713 00:33:16,080 --> 00:33:20,230 OK, so Poisson-- is a Poisson random verbal 714 00:33:20,230 --> 00:33:21,545 discrete or continuous? 715 00:33:27,720 --> 00:33:29,740 Go back to your probability. 716 00:33:29,740 --> 00:33:34,150 All right, so the answer being the opposite of continuous-- 717 00:33:34,150 --> 00:33:36,790 good job. 718 00:33:36,790 --> 00:33:38,471 All right, so it's going to be-- 719 00:33:38,471 --> 00:33:39,720 what value can a Poisson take? 720 00:33:43,157 --> 00:33:44,490 All the natural integers, right? 721 00:33:44,490 --> 00:33:47,434 So 0, 1, 2, 3, all the way to infinity. 722 00:33:47,434 --> 00:33:48,850 We don't have any control of this. 723 00:33:48,850 --> 00:33:53,830 So I'm going to write this as n without 0. 724 00:33:53,830 --> 00:33:55,882 I think in the slides, it's n-star maybe. 725 00:33:55,882 --> 00:33:57,340 Actually, no, you can take value 0. 726 00:33:57,340 --> 00:33:57,840 I'm sorry. 727 00:33:57,840 --> 00:33:59,630 This actually takes value 0 quite a lot. 728 00:33:59,630 --> 00:34:03,190 That's typically, in many instances, actually the mode. 729 00:34:03,190 --> 00:34:05,780 So it's n, and then I'm going to write it 730 00:34:05,780 --> 00:34:08,500 as Poisson with parameter-- well, 731 00:34:08,500 --> 00:34:11,350 here it's again lambda as a parameter. 732 00:34:11,350 --> 00:34:13,492 And lambda can take any positive value. 733 00:34:13,492 --> 00:34:13,992 OK? 734 00:34:17,469 --> 00:34:21,280 And that's where you can actually see that the model 735 00:34:21,280 --> 00:34:23,719 that we had for the siblings-- right? 736 00:34:23,719 --> 00:34:27,270 So let me actually just squeeze in the siblings model here. 737 00:34:31,230 --> 00:34:35,920 So that was the bad model that I had in the first place 738 00:34:35,920 --> 00:34:37,210 when I actually kept this. 739 00:34:37,210 --> 00:34:39,106 Let's say we just kept it at 7. 740 00:34:39,106 --> 00:34:40,730 Forget about larger than or equal to 7. 741 00:34:40,730 --> 00:34:42,290 We just assumed it was 7. 742 00:34:42,290 --> 00:34:43,749 What was our sample space? 743 00:34:54,228 --> 00:34:56,699 We said 7. 744 00:34:56,699 --> 00:35:01,530 So it's 1, 2, to 7, right? 745 00:35:01,530 --> 00:35:04,770 Those were the possible values that this thing would take. 746 00:35:04,770 --> 00:35:06,160 And then what was my-- 747 00:35:06,160 --> 00:35:07,330 what's my parameter space? 748 00:35:10,550 --> 00:35:12,750 So it's going to be a nightmare write. 749 00:35:12,750 --> 00:35:14,210 But I'm going to write it. 750 00:35:14,210 --> 00:35:18,380 OK, so I'm going to write it as something like the probability 751 00:35:18,380 --> 00:35:22,470 that x is equal to k is equal to p sub k. 752 00:35:26,210 --> 00:35:27,540 OK? 753 00:35:27,540 --> 00:35:33,480 And that's going to be for p. 754 00:35:33,480 --> 00:35:36,150 OK, so that's for all k's, right? 755 00:35:36,150 --> 00:35:38,890 Or for k equal 1 to 7. 756 00:35:38,890 --> 00:35:44,999 And here the index is the set of parameters p1 to pk. 757 00:35:44,999 --> 00:35:47,040 And I know a little more about those guys, right? 758 00:35:47,040 --> 00:35:49,670 I know there are going to be non-negative-- 759 00:35:49,670 --> 00:35:50,910 PJ non-negative. 760 00:35:50,910 --> 00:35:52,320 And I know that they sum to 1. 761 00:35:57,960 --> 00:36:01,770 OK, so maybe writing this, you start 762 00:36:01,770 --> 00:36:05,880 seeing why we like those Poisson, exponential, 763 00:36:05,880 --> 00:36:08,010 and short notation, because I actually don't have 764 00:36:08,010 --> 00:36:09,530 to write the PMF of a Poisson. 765 00:36:09,530 --> 00:36:10,870 The Poisson is really just this. 766 00:36:10,870 --> 00:36:12,570 But I call it Poisson so I don't have 767 00:36:12,570 --> 00:36:14,600 to rewrite this all the time. 768 00:36:14,600 --> 00:36:17,620 And so here, I did not use a particular form. 769 00:36:17,620 --> 00:36:19,810 So I just have this thing, and that's what it is. 770 00:36:19,810 --> 00:36:24,940 The set of parameters is the set of positive numbers of-- 771 00:36:24,940 --> 00:36:28,560 p1 to p7, pk-- 772 00:36:28,560 --> 00:36:31,010 and sum to 7, right? 773 00:36:31,010 --> 00:36:34,110 And so this as just a list of numbers 774 00:36:34,110 --> 00:36:37,240 that are non-negative and sum up to 1. 775 00:36:37,240 --> 00:36:39,971 So that's my parameter space. 776 00:36:39,971 --> 00:36:40,470 OK? 777 00:36:40,470 --> 00:36:42,280 So here that's my theta. 778 00:36:42,280 --> 00:36:45,360 This whole thing here-- 779 00:36:45,360 --> 00:36:47,551 this is my capital theta. 780 00:36:47,551 --> 00:36:48,051 OK? 781 00:36:51,947 --> 00:36:53,760 So that's just the set of parameters 782 00:36:53,760 --> 00:36:55,530 that theta-- the set of parameters 783 00:36:55,530 --> 00:36:58,440 that theta is allowed to take. 784 00:36:58,440 --> 00:37:01,890 OK, and finally, we're going to end with the star of all, 785 00:37:01,890 --> 00:37:03,970 and that's the normal distribution. 786 00:37:03,970 --> 00:37:06,960 And in the normal distribution, you still 787 00:37:06,960 --> 00:37:10,080 have also some flexibility in terms of choices, 788 00:37:10,080 --> 00:37:13,710 because then naturally, the normal distribution 789 00:37:13,710 --> 00:37:16,740 is parametrized by-- 790 00:37:16,740 --> 00:37:19,380 the normal distribution is parametrized by two parameters, 791 00:37:19,380 --> 00:37:19,880 right? 792 00:37:19,880 --> 00:37:20,709 Mean and variance. 793 00:37:26,200 --> 00:37:30,450 So what values can a Gaussian random variable take? 794 00:37:30,450 --> 00:37:33,980 An entire real line, right? 795 00:37:33,980 --> 00:37:35,990 And the set of parameters that it 796 00:37:35,990 --> 00:37:42,200 can take it-- so this is going to be n, mu, sigma square. 797 00:37:42,200 --> 00:37:46,190 And mu is going to be positive. 798 00:37:46,190 --> 00:37:49,080 And stigma square is going-- 799 00:37:49,080 --> 00:37:51,720 sorry, m is going to be an r. 800 00:37:51,720 --> 00:37:55,070 And sigma square is going to be positive. 801 00:37:55,070 --> 00:37:57,414 OK, so again here, that's the way 802 00:37:57,414 --> 00:37:58,580 you're supposed to write it. 803 00:37:58,580 --> 00:38:03,260 If you really want to identify what theta is, 804 00:38:03,260 --> 00:38:08,760 well, theta formally is the set of mu sigma square such that-- 805 00:38:08,760 --> 00:38:15,821 well, in r times 0 infinity, right? 806 00:38:19,120 --> 00:38:22,200 That's just to be formal, but this does the job just fine. 807 00:38:22,200 --> 00:38:22,700 OK? 808 00:38:22,700 --> 00:38:25,820 You don't have to be super formal. 809 00:38:25,820 --> 00:38:28,500 OK, that's not three. 810 00:38:28,500 --> 00:38:30,170 That's like five. 811 00:38:30,170 --> 00:38:32,120 Actually, I just want to write another one. 812 00:38:32,120 --> 00:38:35,130 Let's call it 5-bit. 813 00:38:35,130 --> 00:38:41,030 And 5-bit is just Gaussian with known variants. 814 00:38:46,760 --> 00:38:50,230 And this arises a lot in labs when 815 00:38:50,230 --> 00:38:51,550 you have measurement error-- 816 00:38:51,550 --> 00:38:55,870 when you actually receive your measurement device. 817 00:38:55,870 --> 00:38:57,910 This thing has been tested by the manufacturer 818 00:38:57,910 --> 00:39:00,940 so much that it actually comes in on the side of the box. 819 00:39:00,940 --> 00:39:04,030 It says that the standard deviation of your measurements 820 00:39:04,030 --> 00:39:07,574 is going to be 0.23. 821 00:39:07,574 --> 00:39:09,490 OK, and actually why you do this is because we 822 00:39:09,490 --> 00:39:11,290 can brag about accuracy, right? 823 00:39:11,290 --> 00:39:13,510 That's how they sell you this particular device. 824 00:39:13,510 --> 00:39:16,480 And so you actually know exactly what sigma square is. 825 00:39:16,480 --> 00:39:20,230 So once you actually get your data in the lab, 826 00:39:20,230 --> 00:39:22,210 you actually only have to estimate mu, 827 00:39:22,210 --> 00:39:25,530 because stigma comes on the label. 828 00:39:25,530 --> 00:39:28,660 So now, what is your statistical model? 829 00:39:28,660 --> 00:39:33,190 Well, the numbers are collecting still in r. 830 00:39:33,190 --> 00:39:42,034 But now, the models that I have is n, mu, sigma squared. 831 00:39:42,034 --> 00:39:46,010 But the parameter space is not mu, and r, and sigma positive. 832 00:39:46,010 --> 00:39:46,897 It's just mu and r. 833 00:39:54,530 --> 00:39:58,449 And to be a little more emphatic about this, 834 00:39:58,449 --> 00:39:59,990 this is enough to describe it, right? 835 00:39:59,990 --> 00:40:02,300 Because if sigma is the sigma that 836 00:40:02,300 --> 00:40:04,580 was specified by the manufacturer, 837 00:40:04,580 --> 00:40:07,340 then this is the sigma you want. 838 00:40:07,340 --> 00:40:10,710 But you can actually write sigma is equal to-- 839 00:40:10,710 --> 00:40:15,420 sigma square is equal to sigma square manufacturer. 840 00:40:15,420 --> 00:40:15,920 Right? 841 00:40:15,920 --> 00:40:18,860 You can just fix it to be this particular value. 842 00:40:18,860 --> 00:40:21,230 Or maybe you don't want to write that index that's 843 00:40:21,230 --> 00:40:22,007 the manufacturer. 844 00:40:22,007 --> 00:40:23,590 And so you just say, well, the sigma-- 845 00:40:23,590 --> 00:40:24,740 when I write n squared what I mean 846 00:40:24,740 --> 00:40:26,490 is the sigma square from the manufacturer. 847 00:40:26,490 --> 00:40:27,404 Yeah? 848 00:40:27,404 --> 00:40:29,368 AUDIENCE: [INAUDIBLE] 849 00:40:35,320 --> 00:40:37,080 PHILIPPE RIGOLLET: Yeah. 850 00:40:37,080 --> 00:40:39,597 For a particular measuring device? 851 00:40:39,597 --> 00:40:42,180 You know, you're in a lab, and you have some measuring device. 852 00:40:42,180 --> 00:40:45,260 I don't know-- something that measures 853 00:40:45,260 --> 00:40:48,042 tensile strength of something. 854 00:40:48,042 --> 00:40:49,750 And it's just going to measure something. 855 00:40:49,750 --> 00:40:51,480 And it will naturally make errors. 856 00:40:51,480 --> 00:40:53,865 But it's been tested so much by the manufacturer 857 00:40:53,865 --> 00:40:55,770 and calibrated by them. 858 00:40:55,770 --> 00:40:57,807 They know it's not going to be perfect. 859 00:40:57,807 --> 00:40:59,265 But they knew exactly what error it 860 00:40:59,265 --> 00:41:00,690 was making, because they've actually tried it 861 00:41:00,690 --> 00:41:02,231 on things for which they exactly knew 862 00:41:02,231 --> 00:41:04,431 what the tensile strength was. 863 00:41:04,431 --> 00:41:05,385 OK? 864 00:41:05,385 --> 00:41:06,339 Yeah. 865 00:41:06,339 --> 00:41:07,770 AUDIENCE: [INAUDIBLE] 866 00:41:09,155 --> 00:41:10,155 PHILIPPE RIGOLLET: This? 867 00:41:10,155 --> 00:41:11,600 AUDIENCE: [INAUDIBLE] 868 00:41:11,600 --> 00:41:13,898 PHILIPPE RIGOLLET: Oh, like that's pointing to-- 869 00:41:13,898 --> 00:41:14,886 5 prime? 870 00:41:19,340 --> 00:41:21,260 OK? 871 00:41:21,260 --> 00:41:24,230 And we can come up with other examples, right? 872 00:41:24,230 --> 00:41:26,030 So for example, here's another one. 873 00:41:30,380 --> 00:41:33,350 So the names don't really matter, right? 874 00:41:33,350 --> 00:41:34,670 I call it the siblings model. 875 00:41:34,670 --> 00:41:37,662 But you won't find the siblings model in the textbook, right? 876 00:41:37,662 --> 00:41:38,870 So I wouldn't worry too much. 877 00:41:38,870 --> 00:41:41,810 But for example, let's say you have something-- so 878 00:41:41,810 --> 00:41:42,710 let's call it 6. 879 00:41:42,710 --> 00:41:45,700 You have-- I don't know-- 880 00:41:45,700 --> 00:41:54,240 a truncated-- and that's the name I just came up with. 881 00:41:54,240 --> 00:41:57,490 But it's actually not exactly describing what I want. 882 00:41:57,490 --> 00:42:03,510 But let's say I observe y, which is the indicator of x larger 883 00:42:03,510 --> 00:42:11,460 than say 5 when x follows some exponential with parameter 884 00:42:11,460 --> 00:42:13,181 lambda. 885 00:42:13,181 --> 00:42:13,680 OK? 886 00:42:13,680 --> 00:42:15,570 This is what I get to observe. 887 00:42:15,570 --> 00:42:18,990 I only observe if my waiting time 888 00:42:18,990 --> 00:42:20,610 was more than five minutes, because I 889 00:42:20,610 --> 00:42:23,160 see somebody coming out of the Kendall Station 890 00:42:23,160 --> 00:42:24,380 being really upset. 891 00:42:24,380 --> 00:42:26,310 And that's all I record is I've been waiting 892 00:42:26,310 --> 00:42:27,770 for more than five minutes. 893 00:42:27,770 --> 00:42:29,460 And that's all I get to record. 894 00:42:29,460 --> 00:42:29,960 OK? 895 00:42:29,960 --> 00:42:31,109 That happens a lot. 896 00:42:31,109 --> 00:42:32,400 These are called censored data. 897 00:42:32,400 --> 00:42:34,960 I should probably not call it truncated, 898 00:42:34,960 --> 00:42:36,712 but this should be censored. 899 00:42:36,712 --> 00:42:38,140 OK? 900 00:42:38,140 --> 00:42:40,620 You see a lot of censored data when you ask people 901 00:42:40,620 --> 00:42:42,290 how much they make. 902 00:42:42,290 --> 00:42:45,330 They say, well, more than five figures. 903 00:42:45,330 --> 00:42:47,720 And that's all they want to tell you. 904 00:42:47,720 --> 00:42:48,380 OK? 905 00:42:48,380 --> 00:42:54,410 And so you see a lot of censored data in survival analysis, 906 00:42:54,410 --> 00:42:55,560 right? 907 00:42:55,560 --> 00:42:58,620 You are trying to understand how long your patients are going 908 00:42:58,620 --> 00:43:01,720 to live after some surgery, OK? 909 00:43:01,720 --> 00:43:05,970 And maybe you're not going to keep people alive, 910 00:43:05,970 --> 00:43:07,491 and you're not going to actually be 911 00:43:07,491 --> 00:43:09,490 in touch in their family every day and ask them, 912 00:43:09,490 --> 00:43:10,920 is the guy still alive? 913 00:43:10,920 --> 00:43:12,750 And so what you can do is just you 914 00:43:12,750 --> 00:43:15,540 ask people maybe five years after your study 915 00:43:15,540 --> 00:43:18,060 and say, please, come in. 916 00:43:18,060 --> 00:43:20,970 And you will just happen to have some people say, well, you 917 00:43:20,970 --> 00:43:22,470 know, the person is deceased. 918 00:43:22,470 --> 00:43:25,560 And you will only be able to know that the person deceased 919 00:43:25,560 --> 00:43:27,780 less than five years ago. 920 00:43:27,780 --> 00:43:31,980 But you only see what happens after that, OK? 921 00:43:31,980 --> 00:43:34,080 And so this is this truncated and censored data. 922 00:43:34,080 --> 00:43:35,940 It happens all the time just because you 923 00:43:35,940 --> 00:43:39,750 don't have the ability to do better than that. 924 00:43:39,750 --> 00:43:42,380 So this could happen here. 925 00:43:42,380 --> 00:43:45,270 So what is my physical experiment, right? 926 00:43:45,270 --> 00:43:47,650 So here, I should probably write this like this, 927 00:43:47,650 --> 00:43:50,380 because I just told you that my observations are going to be x, 928 00:43:50,380 --> 00:43:52,720 but there is some unknown y. 929 00:43:52,720 --> 00:43:54,210 I will never get to see this y. 930 00:43:54,210 --> 00:43:57,230 I only get to see the x. 931 00:43:57,230 --> 00:43:58,700 What is my statistical experiment? 932 00:43:58,700 --> 00:44:00,215 Please help me. 933 00:44:00,215 --> 00:44:02,460 So is it the real line? 934 00:44:02,460 --> 00:44:04,850 My sample space-- is it the real line? 935 00:44:09,270 --> 00:44:12,410 Sorry, who does not know what this means? 936 00:44:12,410 --> 00:44:13,440 I'm sorry. 937 00:44:13,440 --> 00:44:15,450 OK. 938 00:44:15,450 --> 00:44:18,460 So this is called an indicator. 939 00:44:18,460 --> 00:44:20,586 So I read it as-- 940 00:44:20,586 --> 00:44:23,940 if I wrote well, that would be one with a double bar. 941 00:44:23,940 --> 00:44:26,070 You can also write i if you prefer 942 00:44:26,070 --> 00:44:28,200 if you don't feel like writing one in double bars. 943 00:44:28,200 --> 00:44:31,235 And it's one of say-- 944 00:44:31,235 --> 00:44:32,610 I'm going to write it like that-- 945 00:44:32,610 --> 00:44:43,590 1 of a is equal to 1 if a is true and 0 if a is false. 946 00:44:43,590 --> 00:44:44,880 OK? 947 00:44:44,880 --> 00:44:48,370 So that means that if y is larger than 4, this thing is 1. 948 00:44:48,370 --> 00:44:52,350 And if y is not larger than 5, this thing is 0. 949 00:44:52,350 --> 00:44:53,754 OK. 950 00:44:53,754 --> 00:44:56,247 So that's called an indicator-- 951 00:45:00,143 --> 00:45:01,604 indicator function. 952 00:45:06,480 --> 00:45:10,760 It was very useful to just turn anything into a 0, 1. 953 00:45:10,760 --> 00:45:14,387 So now that I'm here, what is my sample space? 954 00:45:17,380 --> 00:45:18,800 0, 1. 955 00:45:18,800 --> 00:45:21,610 Well, whatever this thing I did not tell you 956 00:45:21,610 --> 00:45:24,247 was taking value with the thing you should have-- 957 00:45:24,247 --> 00:45:26,580 if I end up telling you that is taking value 6 or 7 that 958 00:45:26,580 --> 00:45:29,760 would be your sample space, OK? 959 00:45:29,760 --> 00:45:33,220 OK, so it takes values 0, 1. 960 00:45:33,220 --> 00:45:37,060 And then what is the probability here? 961 00:45:37,060 --> 00:45:38,410 What should I write here? 962 00:45:38,410 --> 00:45:40,243 What should you write without even thinking? 963 00:45:44,062 --> 00:45:45,020 Yeah. 964 00:45:45,020 --> 00:45:47,070 So let's assume there's two seconds 965 00:45:47,070 --> 00:45:48,830 before the end of the exam. 966 00:45:48,830 --> 00:45:50,164 You're going to write Bernoulli. 967 00:45:50,164 --> 00:45:52,663 And that's where you're going to start checking if I'm going 968 00:45:52,663 --> 00:45:54,174 to give you extra time, OK? 969 00:45:54,174 --> 00:45:55,840 So you write Bernoulli without thinking, 970 00:45:55,840 --> 00:45:57,330 because it's taking value 0, 1. 971 00:45:57,330 --> 00:45:59,760 So you just write Bernoulli, but you still have to tell me 972 00:45:59,760 --> 00:46:04,110 what possible parameters this thing is taking, right? 973 00:46:04,110 --> 00:46:06,450 So I'm going to write it p, because I don't know. 974 00:46:06,450 --> 00:46:09,080 And then p take value-- 975 00:46:09,080 --> 00:46:11,370 OK, so sorry. 976 00:46:11,370 --> 00:46:14,980 I could write it like that. 977 00:46:14,980 --> 00:46:16,910 Right? 978 00:46:16,910 --> 00:46:21,260 That would be perfectly valid, but actually no more. 979 00:46:21,260 --> 00:46:23,390 It's not any p. 980 00:46:23,390 --> 00:46:26,330 The p is the probability that an exponential lambda 981 00:46:26,330 --> 00:46:27,560 is larger than 5. 982 00:46:27,560 --> 00:46:30,530 And maybe I want to have lambda as a parameter. 983 00:46:30,530 --> 00:46:33,450 OK, so what I need to actually compute is, 984 00:46:33,450 --> 00:46:38,180 what is the probability that y is larger than 5-- 985 00:46:38,180 --> 00:46:40,725 when y is this exponential lambda, 986 00:46:40,725 --> 00:46:42,350 which means that what I need to compute 987 00:46:42,350 --> 00:46:46,414 is the integral between 5 and infinity of-- 988 00:46:46,414 --> 00:46:47,388 what is it? 989 00:46:47,388 --> 00:46:49,823 1 over lambda. 990 00:46:49,823 --> 00:46:52,745 How did I define it in this class? 991 00:46:52,745 --> 00:46:54,206 Did I change it-- what? 992 00:46:54,206 --> 00:46:57,150 AUDIENCE: [INAUDIBLE]. 993 00:46:57,150 --> 00:46:59,090 PHILIPPE RIGOLLET: Yeah, right, right, right. 994 00:46:59,090 --> 00:46:59,760 Yeah. 995 00:46:59,760 --> 00:47:04,230 Lambda e to the minus lambda x dx, right? 996 00:47:04,230 --> 00:47:07,760 So that's what I need to compute. 997 00:47:07,760 --> 00:47:09,580 What is this? 998 00:47:09,580 --> 00:47:11,678 Yeah, so what is the value of this integral? 999 00:47:14,666 --> 00:47:16,658 Can you take appropriate measures? 1000 00:47:25,622 --> 00:47:28,112 AUDIENCE: [INAUDIBLE] 1001 00:47:32,594 --> 00:47:33,610 PHILIPPE RIGOLLET: OK? 1002 00:47:33,610 --> 00:47:35,984 And again, you can cancel this, right? 1003 00:47:35,984 --> 00:47:37,650 So when I'm going to integrate this guy, 1004 00:47:37,650 --> 00:47:39,070 those guys are going to cancel. 1005 00:47:39,070 --> 00:47:40,900 I'm going to get 0 for infinity. 1006 00:47:40,900 --> 00:47:42,929 I'm going to get a 5 for this guy. 1007 00:47:42,929 --> 00:47:45,470 And well, I know it's going to be positive number, so I'm not 1008 00:47:45,470 --> 00:47:46,890 really going to bother with the signs, 1009 00:47:46,890 --> 00:47:48,556 because I know that's what it should be. 1010 00:47:48,556 --> 00:47:51,850 OK, so I get e to the minus 5 lambda. 1011 00:47:51,850 --> 00:47:55,231 And so that means that I can actually write this like that-- 1012 00:47:57,973 --> 00:48:01,710 and now parametrize this thing by lambda positive. 1013 00:48:01,710 --> 00:48:02,210 OK? 1014 00:48:02,210 --> 00:48:05,480 So what I did here is I changed the parametrization from p 1015 00:48:05,480 --> 00:48:06,890 to lambda. 1016 00:48:06,890 --> 00:48:07,490 Why? 1017 00:48:07,490 --> 00:48:10,550 Well, because maybe if I know this is happening, 1018 00:48:10,550 --> 00:48:13,400 maybe I am actually interested in reporting lambda 1019 00:48:13,400 --> 00:48:15,910 to MBTA, for example. 1020 00:48:15,910 --> 00:48:20,375 Maybe I'm actually trying to estimate 1 over lambda, so 1021 00:48:20,375 --> 00:48:22,857 that I know it is-- 1022 00:48:22,857 --> 00:48:24,440 well, lambda is actually the intensity 1023 00:48:24,440 --> 00:48:26,819 of arrival of my Poisson process, right? 1024 00:48:26,819 --> 00:48:27,860 I have a Poisson process. 1025 00:48:27,860 --> 00:48:31,357 That's how my trains are coming in. 1026 00:48:31,357 --> 00:48:32,690 And so I'm interested in lambda. 1027 00:48:32,690 --> 00:48:34,314 So I will parametrize things by lambda. 1028 00:48:34,314 --> 00:48:35,680 So the thing I get is lambda. 1029 00:48:35,680 --> 00:48:37,010 You can play with this, right? 1030 00:48:37,010 --> 00:48:39,051 I mean, I could parametrize this by 1 over lambda 1031 00:48:39,051 --> 00:48:42,960 and put 1 over lambda here if I want it. 1032 00:48:42,960 --> 00:48:46,650 But you know, the context of your problem 1033 00:48:46,650 --> 00:48:50,321 will tell you exactly how to parametrize this. 1034 00:48:50,321 --> 00:48:50,820 OK? 1035 00:48:53,890 --> 00:48:59,200 So what else did I want to tell you? 1036 00:48:59,200 --> 00:49:00,688 OK, let's do a final one. 1037 00:49:13,660 --> 00:49:17,800 By the way, are you guys OK with Poisson exponential, 1038 00:49:17,800 --> 00:49:21,060 Bernoulli's-- 1039 00:49:21,060 --> 00:49:22,930 I don't know, binomial, normal-- 1040 00:49:22,930 --> 00:49:24,155 all these things. 1041 00:49:24,155 --> 00:49:25,780 I'm not going to go back to it, but I'm 1042 00:49:25,780 --> 00:49:26,863 going to use them heavily. 1043 00:49:26,863 --> 00:49:29,435 So just spend five minutes on Wikipedia 1044 00:49:29,435 --> 00:49:31,870 if you forgot about what those things are. 1045 00:49:31,870 --> 00:49:35,140 Usually, you must have seen them the in your probability class. 1046 00:49:35,140 --> 00:49:36,670 So they should not be a crazy name. 1047 00:49:36,670 --> 00:49:38,410 And again, I'm not expecting you. 1048 00:49:38,410 --> 00:49:40,804 I don't remember what the density of an exponential is. 1049 00:49:40,804 --> 00:49:42,220 So it would be pretty unfair of me 1050 00:49:42,220 --> 00:49:44,069 to actually ask you to remember what it is. 1051 00:49:44,069 --> 00:49:45,610 Even for the Gaussian, I don't expect 1052 00:49:45,610 --> 00:49:46,760 you to remember what it is. 1053 00:49:46,760 --> 00:49:51,550 But I want you to remember that if I add 5 to a Gaussian, then 1054 00:49:51,550 --> 00:49:54,490 I have a Gaussian with me and mu plus 5 if I multiply it 1055 00:49:54,490 --> 00:49:55,800 by something, right? 1056 00:49:55,800 --> 00:49:59,170 You need to know how to operate those things. 1057 00:49:59,170 --> 00:50:02,110 But knowing complicated densities 1058 00:50:02,110 --> 00:50:04,290 is definitely not part of the game. 1059 00:50:04,290 --> 00:50:05,740 OK? 1060 00:50:05,740 --> 00:50:09,591 So let's do a final one. 1061 00:50:09,591 --> 00:50:11,090 I don't know what number I have now. 1062 00:50:11,090 --> 00:50:12,298 I'm going to just do uniform. 1063 00:50:14,999 --> 00:50:15,790 That's another one. 1064 00:50:15,790 --> 00:50:18,370 Everybody knows what uniform is? 1065 00:50:18,370 --> 00:50:19,360 So it's uniform, right? 1066 00:50:19,360 --> 00:50:22,810 So I'm going to have x, which my observations are 1067 00:50:22,810 --> 00:50:27,140 going to be uniform on the interval 0 theta, right? 1068 00:50:27,140 --> 00:50:30,100 So if I want to define a uniform distribution 1069 00:50:30,100 --> 00:50:32,575 for a random variable, I have to tell you which interval 1070 00:50:32,575 --> 00:50:35,200 or which set I want it to be uniform on. 1071 00:50:35,200 --> 00:50:38,210 And so here I'm telling you is the interval 0 theta. 1072 00:50:38,210 --> 00:50:41,270 And so what is going to be my sample space? 1073 00:50:41,270 --> 00:50:42,204 AUDIENCE: [INAUDIBLE] 1074 00:50:42,204 --> 00:50:44,480 PHILIPPE RIGOLLET: I'm sorry? 1075 00:50:44,480 --> 00:50:44,980 0 to theta. 1076 00:50:47,620 --> 00:50:50,770 And then what is my probability distribution? 1077 00:50:50,770 --> 00:50:52,228 My family of parameters? 1078 00:50:57,100 --> 00:51:00,000 So well, I can write it like this, right? 1079 00:51:00,000 --> 00:51:03,444 Uniform theta, right? 1080 00:51:03,444 --> 00:51:06,408 And theta let's say is positive. 1081 00:51:09,866 --> 00:51:12,336 Can somebody tell me what's wrong with what I wrote? 1082 00:51:18,780 --> 00:51:20,592 This makes no sense. 1083 00:51:20,592 --> 00:51:21,554 Tell me why. 1084 00:51:24,440 --> 00:51:26,845 Yeah? 1085 00:51:26,845 --> 00:51:30,292 Yeah, this set depends on theta, and why is that a problem? 1086 00:51:30,292 --> 00:51:32,188 AUDIENCE: [INAUDIBLE] 1087 00:51:36,869 --> 00:51:38,410 PHILIPPE RIGOLLET: There is no theta. 1088 00:51:38,410 --> 00:51:40,990 Right now, there's the families of theta. 1089 00:51:40,990 --> 00:51:43,430 Which one did you pick here? 1090 00:51:43,430 --> 00:51:46,540 Right, this is just something that's indexed by theta, 1091 00:51:46,540 --> 00:51:49,090 but I could have very well written it as, you know, 1092 00:51:49,090 --> 00:51:51,860 just not being Greek for a second, 1093 00:51:51,860 --> 00:51:55,652 I could have just written this as t rather than theta. 1094 00:51:55,652 --> 00:51:56,860 That would be the same thing. 1095 00:51:56,860 --> 00:51:59,349 And then what the hell is theta? 1096 00:51:59,349 --> 00:52:00,640 There's no such thing as theta. 1097 00:52:00,640 --> 00:52:02,230 We don't know what the parameter is. 1098 00:52:02,230 --> 00:52:04,314 This parameter should move with everyone. 1099 00:52:04,314 --> 00:52:05,980 And so that means that I actually am not 1100 00:52:05,980 --> 00:52:07,200 allowed to pick this theta. 1101 00:52:07,200 --> 00:52:10,060 I'm actually-- just for the reason that there is no 1102 00:52:10,060 --> 00:52:12,056 parameter to put on the left side-- 1103 00:52:12,056 --> 00:52:13,180 there should not be, right? 1104 00:52:13,180 --> 00:52:14,910 So you just said, well, there's a problem because the parameter 1105 00:52:14,910 --> 00:52:16,030 is on the left-hand side. 1106 00:52:16,030 --> 00:52:17,405 But there's not even a parameter. 1107 00:52:17,405 --> 00:52:19,630 I'm describing the family of possible parameters. 1108 00:52:19,630 --> 00:52:22,060 There is no one that you can actually plug it in. 1109 00:52:22,060 --> 00:52:24,352 So this should really be 1. 1110 00:52:24,352 --> 00:52:25,810 And I'm going to go back to writing 1111 00:52:25,810 --> 00:52:29,776 this as theta because that's pretty standard. 1112 00:52:29,776 --> 00:52:31,750 Is that clear for everyone. 1113 00:52:31,750 --> 00:52:37,780 I cannot just pick one and put it in there and just take the-- 1114 00:52:37,780 --> 00:52:40,600 before I run my experiments, I could potentially 1115 00:52:40,600 --> 00:52:42,370 get numbers that are all the way up to 1, 1116 00:52:42,370 --> 00:52:45,206 because I don't know what theta is going to be ahead of time. 1117 00:52:45,206 --> 00:52:47,740 Now, if somebody promised to me that theta 1118 00:52:47,740 --> 00:52:49,690 was going to be less than 0.5, that would be-- 1119 00:52:49,690 --> 00:52:50,980 sorry, why do I put 1 here? 1120 00:52:56,740 --> 00:52:58,490 I could put theta between 0 and 1. 1121 00:52:58,490 --> 00:53:00,150 But if somebody is going to promise me, for example, 1122 00:53:00,150 --> 00:53:01,649 if theta is going to be less than 1, 1123 00:53:01,649 --> 00:53:03,599 then you expect to put 0, 1. 1124 00:53:03,599 --> 00:53:04,565 All right? 1125 00:53:08,912 --> 00:53:12,310 Is that clear? 1126 00:53:12,310 --> 00:53:15,410 OK, so now you know how to answer the question-- 1127 00:53:15,410 --> 00:53:18,390 what is the statistical model? 1128 00:53:18,390 --> 00:53:20,140 And again, within the scope of this class, 1129 00:53:20,140 --> 00:53:23,110 you will not be asked to just come up with a model right that 1130 00:53:23,110 --> 00:53:24,260 will just tell you. 1131 00:53:24,260 --> 00:53:26,390 Poisson would be probably be a good idea here. 1132 00:53:26,390 --> 00:53:28,681 And then you would just have to trust me that indeed it 1133 00:53:28,681 --> 00:53:30,460 would be a good idea. 1134 00:53:30,460 --> 00:53:35,230 All right, so what I started talking about 20 minutes ago-- 1135 00:53:35,230 --> 00:53:38,350 so it's definitely ahead of myself 1136 00:53:38,350 --> 00:53:40,000 is the notion-- so that's when I was 1137 00:53:40,000 --> 00:53:41,290 talking about well-specified. 1138 00:53:41,290 --> 00:53:44,650 Remember, well-specified says that the true distribution 1139 00:53:44,650 --> 00:53:47,080 is one of the distributions in this parametric families 1140 00:53:47,080 --> 00:53:48,310 of distribution. 1141 00:53:48,310 --> 00:53:50,050 The true distribution of my siblings 1142 00:53:50,050 --> 00:53:52,570 is actually a Poisson with some parameters. 1143 00:53:52,570 --> 00:53:56,800 And all I need to figure out is what this parameter is. 1144 00:53:56,800 --> 00:53:58,560 When I started saying that, I said, well, 1145 00:53:58,560 --> 00:53:59,970 but then that could be that there 1146 00:53:59,970 --> 00:54:01,428 are several parameters that give me 1147 00:54:01,428 --> 00:54:03,010 the same distribution, right? 1148 00:54:03,010 --> 00:54:07,840 It could be the case that Poisson 5 and Poisson 17 1149 00:54:07,840 --> 00:54:09,910 are exactly the same distributions when 1150 00:54:09,910 --> 00:54:13,540 I started putting those numbers in the formula which I erased, 1151 00:54:13,540 --> 00:54:14,140 OK? 1152 00:54:14,140 --> 00:54:18,070 So it could be the case that two different numbers would give me 1153 00:54:18,070 --> 00:54:20,360 exactly the same probabilities. 1154 00:54:20,360 --> 00:54:24,600 And in this case, we see that the model is not identifiable. 1155 00:54:24,600 --> 00:54:26,680 I mean, the parameter is not identifiable. 1156 00:54:26,680 --> 00:54:29,410 I cannot identify the parameter, even if you actually gave me 1157 00:54:29,410 --> 00:54:32,080 an infinite amount of data, which means that I could 1158 00:54:32,080 --> 00:54:34,750 actually estimate exactly the PMF. 1159 00:54:34,750 --> 00:54:37,277 I might not be able to go back, because there would 1160 00:54:37,277 --> 00:54:38,860 be several candidates, and I would not 1161 00:54:38,860 --> 00:54:41,360 be able to tell you which one it was in the first place. 1162 00:54:41,360 --> 00:54:41,860 OK? 1163 00:54:41,860 --> 00:54:45,310 So what we want is that this function-- 1164 00:54:45,310 --> 00:54:49,720 theta maps to p theta is injective. 1165 00:54:49,720 --> 00:54:51,067 And that really can be fancy. 1166 00:54:54,410 --> 00:54:57,560 What I really mean is that if theta 1167 00:54:57,560 --> 00:55:01,580 is different from theta prime, then p of theta 1168 00:55:01,580 --> 00:55:04,100 is different from p of theta prime. 1169 00:55:04,100 --> 00:55:07,580 Or, if you prefer to think about the contrapositive of this, 1170 00:55:07,580 --> 00:55:11,960 this is the same as saying that if p theta gives me 1171 00:55:11,960 --> 00:55:15,320 the same distribution as theta prime, 1172 00:55:15,320 --> 00:55:17,480 then that implies that theta must 1173 00:55:17,480 --> 00:55:20,180 be equal to the theta prime. 1174 00:55:20,180 --> 00:55:24,330 The logic of those two things are equivalent, right? 1175 00:55:24,330 --> 00:55:26,780 So that's what this means. 1176 00:55:26,780 --> 00:55:37,130 So this is-- we say that the parameter is identifiable 1177 00:55:37,130 --> 00:55:41,414 or identified-- it doesn't really matter-- 1178 00:55:41,414 --> 00:55:42,386 in this model. 1179 00:55:49,170 --> 00:55:50,920 And this is something we're going to want. 1180 00:55:50,920 --> 00:55:51,920 OK? 1181 00:55:51,920 --> 00:55:54,980 So in all the examples that I gave you, 1182 00:55:54,980 --> 00:55:57,090 those parameters are completely identified. 1183 00:55:57,090 --> 00:55:57,590 Right? 1184 00:55:57,590 --> 00:55:58,410 If I tell you-- 1185 00:55:58,410 --> 00:56:01,440 I mean, if those things are in probability box, 1186 00:56:01,440 --> 00:56:03,920 it means that they were probably thought through, right? 1187 00:56:03,920 --> 00:56:06,290 So when I say exponential lambda, 1188 00:56:06,290 --> 00:56:09,987 I'm really talking about one specific distribution and not-- 1189 00:56:09,987 --> 00:56:11,820 there's not another lambda going to give you 1190 00:56:11,820 --> 00:56:13,910 exactly the same distribution. 1191 00:56:13,910 --> 00:56:15,020 OK so that was the case. 1192 00:56:15,020 --> 00:56:17,150 And you can check that, but it's a little annoying. 1193 00:56:17,150 --> 00:56:19,306 So I would probably not do it. 1194 00:56:19,306 --> 00:56:20,930 But rather than doing this, let me just 1195 00:56:20,930 --> 00:56:24,210 give you some examples where it would not be the case. 1196 00:56:24,210 --> 00:56:25,980 Again, here's an example, if I take xi-- 1197 00:56:31,220 --> 00:56:36,980 so now I'm back to just using this indicator function-- 1198 00:56:36,980 --> 00:56:39,080 but now for a Gaussian. 1199 00:56:39,080 --> 00:56:42,030 So what I observe is x is the indicator 1200 00:56:42,030 --> 00:56:44,020 that y is, what did we say? 1201 00:56:44,020 --> 00:56:44,914 Positive. 1202 00:56:48,050 --> 00:56:49,350 OK? 1203 00:56:49,350 --> 00:56:51,458 So this is a Bernoulli random variable, right? 1204 00:56:56,400 --> 00:56:57,709 And it has some parameter p. 1205 00:56:57,709 --> 00:56:59,250 But p now is going to depend-- sorry, 1206 00:56:59,250 --> 00:57:04,890 and here y is n mu sigma square. 1207 00:57:04,890 --> 00:57:09,090 So the p, the probability that this thing is positive, 1208 00:57:09,090 --> 00:57:10,020 is actually-- 1209 00:57:10,020 --> 00:57:11,390 I don't think I put the 0. 1210 00:57:11,390 --> 00:57:13,010 Oh, yeah, because I have mu. 1211 00:57:13,010 --> 00:57:15,935 OK, so this distribution-- this p the probability 1212 00:57:15,935 --> 00:57:17,560 that it's positive is just the probably 1213 00:57:17,560 --> 00:57:19,715 that some Gaussian is positive. 1214 00:57:19,715 --> 00:57:22,410 And it will depend on mu and sigma, right? 1215 00:57:22,410 --> 00:57:31,680 Because if I draw a 0, and I draw my Gaussian around mu, 1216 00:57:31,680 --> 00:57:35,430 then the probability of this Bernoulli being 1 1217 00:57:35,430 --> 00:57:39,790 is really the area under the curve here. 1218 00:57:39,790 --> 00:57:40,770 Right? 1219 00:57:40,770 --> 00:57:42,827 And this thing-- well, if mu is very large, 1220 00:57:42,827 --> 00:57:44,160 it's going to become very large. 1221 00:57:44,160 --> 00:57:48,540 If mu is very small, it's going to become very small. 1222 00:57:48,540 --> 00:57:51,961 And if sigma changes, it's also going to effect-- 1223 00:57:51,961 --> 00:57:53,970 is that clear for everyone? 1224 00:57:53,970 --> 00:57:56,040 But we can actually compute this, right? 1225 00:57:56,040 --> 00:57:59,610 So the parameter p that I'm looking for here 1226 00:57:59,610 --> 00:58:01,650 as a function of mu and sigma is simply 1227 00:58:01,650 --> 00:58:06,270 the probability that some y is non-negative, 1228 00:58:06,270 --> 00:58:12,150 which is the probability that y minus mu divided by sigma 1229 00:58:12,150 --> 00:58:16,790 is larger than minus mu divided by sigma. 1230 00:58:16,790 --> 00:58:20,160 But when you study probability, is that some operation you 1231 00:58:20,160 --> 00:58:22,170 were used to making? 1232 00:58:22,170 --> 00:58:26,010 Removing the mean and dividing by the standard deviation? 1233 00:58:26,010 --> 00:58:28,116 What is the effect of doing that on the Gaussian 1234 00:58:28,116 --> 00:58:30,655 random variable? 1235 00:58:30,655 --> 00:58:32,030 Yeah, so you normalize it, right? 1236 00:58:32,030 --> 00:58:33,490 And you standardize it. 1237 00:58:33,490 --> 00:58:34,900 You make it a standard Gaussian. 1238 00:58:34,900 --> 00:58:36,880 You remove the mean. 1239 00:58:36,880 --> 00:58:38,470 The mean 0 is Gaussian. 1240 00:58:38,470 --> 00:58:41,170 And you remove the variance for it to become 1. 1241 00:58:41,170 --> 00:58:43,019 So when you have a Gaussian, remove the mean 1242 00:58:43,019 --> 00:58:44,560 and divide by the standard deviation, 1243 00:58:44,560 --> 00:58:46,570 it becomes a standard Gaussian-- 1244 00:58:46,570 --> 00:58:50,975 which this thing has n , 0, 1 distribution, 1245 00:58:50,975 --> 00:58:53,350 which is the one you can read the quintiles of at the end 1246 00:58:53,350 --> 00:58:54,640 of the book. 1247 00:58:54,640 --> 00:58:55,140 Right? 1248 00:58:55,140 --> 00:58:57,020 And that's exactly what we did. 1249 00:58:57,020 --> 00:58:57,520 OK? 1250 00:58:57,520 --> 00:59:00,190 So now you have the probability that some standard Gaussian 1251 00:59:00,190 --> 00:59:04,366 exceeds negative mu over sigma, which 1252 00:59:04,366 --> 00:59:06,490 I can write in terms of the cumulative distribution 1253 00:59:06,490 --> 00:59:07,894 function, capital phi-- 1254 00:59:14,720 --> 00:59:16,560 like we did in the first lecture. 1255 00:59:16,560 --> 00:59:19,070 So if I do this cumulative distribution function, 1256 00:59:19,070 --> 00:59:21,012 what is this probability in terms of phi? 1257 00:59:25,431 --> 00:59:26,413 [INAUDIBLE]? 1258 00:59:26,413 --> 00:59:28,307 AUDIENCE: [INAUDIBLE]. 1259 00:59:28,307 --> 00:59:30,640 PHILIPPE RIGOLLET: Well, that's what your name tag says. 1260 00:59:33,400 --> 00:59:34,366 1 minus-- 1261 00:59:34,366 --> 00:59:36,034 AUDIENCE: [INAUDIBLE]. 1262 00:59:36,034 --> 00:59:37,700 PHILLIPPE RIGOLLET: 1 minus mu of sigma. 1263 00:59:37,700 --> 00:59:39,786 What happens with phi in our-- 1264 00:59:39,786 --> 00:59:43,450 do you think I defined this for fun? 1265 00:59:43,450 --> 00:59:50,210 1 minus phi of mu over sigma, right? 1266 00:59:50,210 --> 00:59:50,710 Right? 1267 00:59:50,710 --> 00:59:52,320 Because this is 1 minus the probability 1268 00:59:52,320 --> 00:59:53,361 that it's less than this. 1269 00:59:53,361 --> 00:59:55,180 And this is exactly the definition 1270 00:59:55,180 --> 00:59:57,915 of the cumulative distribution function. 1271 00:59:57,915 --> 01:00:04,080 So in particular, this thing only depends on mu over sigma. 1272 01:00:04,080 --> 01:00:05,640 Agreed? 1273 01:00:05,640 --> 01:00:09,630 So in particular, if I had 2 mu over 2 sigma, 1274 01:00:09,630 --> 01:00:11,240 p would remain unchanged. 1275 01:00:11,240 --> 01:00:15,340 If I have 12 mu over 12 sigma, this thing 1276 01:00:15,340 --> 01:00:18,370 would remain unchanged, which means 1277 01:00:18,370 --> 01:00:22,780 that p does not change if I scale mu 1278 01:00:22,780 --> 01:00:25,410 and sigma by the same factor. 1279 01:00:25,410 --> 01:00:28,840 So there's no way just by observing x, even 1280 01:00:28,840 --> 01:00:32,090 an infinite times, so that I can actually get exactly what p is. 1281 01:00:32,090 --> 01:00:34,870 I'm never going to be able to get mu and sigma separately. 1282 01:00:34,870 --> 01:00:37,842 All I'm going to be able to get is mu over sigma. 1283 01:00:37,842 --> 01:00:41,730 So here, we say that mu sigma-- 1284 01:00:41,730 --> 01:00:43,120 the parameter mu sigma-- 1285 01:00:43,120 --> 01:00:46,056 or actually each of them individually-- those guys-- 1286 01:00:50,310 --> 01:00:51,769 they're not identifiable. 1287 01:00:58,810 --> 01:01:03,180 But the parameter mu over sigma is identifiable. 1288 01:01:09,180 --> 01:01:13,820 So if I wanted to write a statistical model in which 1289 01:01:13,820 --> 01:01:15,620 the parameter is identifiable-- 1290 01:01:25,660 --> 01:01:32,440 I would write 0, 1 Bernoulli. 1291 01:01:32,440 --> 01:01:41,289 And then I would write 1 minus phi over mu over sigma. 1292 01:01:41,289 --> 01:01:42,830 And then I would take two parameters, 1293 01:01:42,830 --> 01:01:48,940 which are mu and r and sigma squared positive. 1294 01:01:48,940 --> 01:01:52,244 So let's write sigma positive. 1295 01:01:52,244 --> 01:01:52,744 Right? 1296 01:01:56,970 --> 01:01:59,440 No, this is not identifiable. 1297 01:01:59,440 --> 01:02:02,848 I cannot write those two guys as being two things different. 1298 01:02:12,010 --> 01:02:22,050 Instead, what I want to write is 0, 1, Bernoulli 1 minus-- 1299 01:02:26,026 --> 01:02:30,002 and now my parameter-- 1300 01:02:30,002 --> 01:02:37,610 I forgot this-- my parameter is mu over sigma. 1301 01:02:37,610 --> 01:02:41,180 Can somebody tell me where mu over sigma lives? 1302 01:02:41,180 --> 01:02:42,670 What values can this thing take? 1303 01:02:46,630 --> 01:02:48,115 Any real value, right? 1304 01:02:53,070 --> 01:02:55,920 OK, so now I've done this definitely out of convenience, 1305 01:02:55,920 --> 01:02:56,430 right? 1306 01:02:56,430 --> 01:02:59,130 Because that was the only thing I was able to identify-- 1307 01:02:59,130 --> 01:03:01,020 the ratio of mu over sigma. 1308 01:03:01,020 --> 01:03:04,260 But it's still something that has some meaning. 1309 01:03:04,260 --> 01:03:06,570 It's the normalized mean. 1310 01:03:06,570 --> 01:03:08,880 It really tells me what the mean is compared 1311 01:03:08,880 --> 01:03:10,200 to the standard deviation. 1312 01:03:10,200 --> 01:03:13,980 So in some models, in reality, in some real applications, 1313 01:03:13,980 --> 01:03:16,050 this actually might have a good meaning. 1314 01:03:16,050 --> 01:03:17,940 It's just telling me how big the mean 1315 01:03:17,940 --> 01:03:22,620 is compared to the standard fluctuations of this model. 1316 01:03:22,620 --> 01:03:24,936 But I won't be able to get more than that. 1317 01:03:24,936 --> 01:03:25,436 Agreed? 1318 01:03:30,630 --> 01:03:32,500 All right? 1319 01:03:32,500 --> 01:03:37,510 So now that we've set a parametric model, 1320 01:03:37,510 --> 01:03:40,580 let's try to see what our goals are going to be. 1321 01:03:40,580 --> 01:03:41,080 OK? 1322 01:03:41,080 --> 01:03:44,740 So now we have a sample and a statistical model. 1323 01:03:44,740 --> 01:03:47,560 And we want to estimate the parameter theta, 1324 01:03:47,560 --> 01:03:49,510 and I could say, well, you know what? 1325 01:03:49,510 --> 01:03:51,190 I don't have time for this analysis. 1326 01:03:51,190 --> 01:03:53,390 Collecting data is going to take me a while. 1327 01:03:53,390 --> 01:03:55,010 So I'm just going to mmm-- 1328 01:03:55,010 --> 01:03:57,509 and I'm going to say that mu over sigma is 4. 1329 01:03:57,509 --> 01:03:59,050 And I'm just going to give it to you. 1330 01:03:59,050 --> 01:04:00,970 And maybe you will tell me, yeah, 1331 01:04:00,970 --> 01:04:02,140 it's not very good, right? 1332 01:04:02,140 --> 01:04:04,870 So we need some measure of performance 1333 01:04:04,870 --> 01:04:05,770 of a given parameter. 1334 01:04:05,770 --> 01:04:09,310 We need to be able to evaluate if eyeballing the problem 1335 01:04:09,310 --> 01:04:11,680 is worse than actually collecting 1336 01:04:11,680 --> 01:04:13,030 a large amount of theta. 1337 01:04:13,030 --> 01:04:16,150 We need to know if even if I come up with an estimator that 1338 01:04:16,150 --> 01:04:18,520 actually sort of uses the data, does it 1339 01:04:18,520 --> 01:04:20,620 make an efficient use of the data? 1340 01:04:20,620 --> 01:04:22,750 Would I actually need 10 times more observations 1341 01:04:22,750 --> 01:04:24,279 to achieve the same accuracy? 1342 01:04:24,279 --> 01:04:25,820 To be able to answer these questions, 1343 01:04:25,820 --> 01:04:28,390 well, I need to define what accuracy means. 1344 01:04:28,390 --> 01:04:30,550 And accuracy is something that sort of makes sense. 1345 01:04:30,550 --> 01:04:31,716 It says, well, I want theta. 1346 01:04:31,716 --> 01:04:33,520 I have to be close to theta. 1347 01:04:33,520 --> 01:04:35,344 And the theta is a random variable. 1348 01:04:35,344 --> 01:04:36,760 So I'm going to have to understand 1349 01:04:36,760 --> 01:04:38,218 what it means for a random variable 1350 01:04:38,218 --> 01:04:40,570 to be close to a deterministic number. 1351 01:04:40,570 --> 01:04:44,354 And so, what is a parameter estimator, right? 1352 01:04:44,354 --> 01:04:46,770 So I have an estimator, and I said it's a random variable. 1353 01:04:49,692 --> 01:04:51,640 And the formal definition-- 1354 01:04:59,920 --> 01:05:10,470 so an estimator is a measurable function of the data. 1355 01:05:10,470 --> 01:05:12,570 So when I write theta hat, and that 1356 01:05:12,570 --> 01:05:18,060 will typically be my notation for an estimator, right? 1357 01:05:18,060 --> 01:05:24,820 I should really write theta hat of x1 xn. 1358 01:05:24,820 --> 01:05:25,380 OK? 1359 01:05:25,380 --> 01:05:26,930 That's what an estimator is. 1360 01:05:26,930 --> 01:05:28,620 If you want to know an estimator is, 1361 01:05:28,620 --> 01:05:30,400 this is a measurable function of the data. 1362 01:05:30,400 --> 01:05:35,187 And it's actually also known as a statistic. 1363 01:05:37,767 --> 01:05:39,350 And you know, if you're interested in, 1364 01:05:39,350 --> 01:05:43,340 you know, I see every day I think when I have like, 1365 01:05:43,340 --> 01:05:47,250 you know, a dinner with normal people. 1366 01:05:47,250 --> 01:05:48,630 And they say I'm a statistician. 1367 01:05:48,630 --> 01:05:50,660 Oh, yeah, I really like baseball. 1368 01:05:50,660 --> 01:05:53,210 And they talk to me about batting averages. 1369 01:05:53,210 --> 01:05:54,119 That's not what I do. 1370 01:05:54,119 --> 01:05:55,910 But for them, that's what it is, and that's 1371 01:05:55,910 --> 01:05:58,010 because in a way, that's what a statistic is. 1372 01:05:58,010 --> 01:06:00,450 A batting average is a statistic. 1373 01:06:00,450 --> 01:06:02,240 OK, and so here are some examples. 1374 01:06:02,240 --> 01:06:04,250 You can take the average xn bar. 1375 01:06:04,250 --> 01:06:06,400 You can take the maximum of your observation. 1376 01:06:06,400 --> 01:06:07,370 That's the statistics. 1377 01:06:07,370 --> 01:06:08,990 You can take the first one. 1378 01:06:08,990 --> 01:06:10,820 You can take the first one plus log of 1 1379 01:06:10,820 --> 01:06:12,830 plus the absolute value of the last one. 1380 01:06:12,830 --> 01:06:15,980 You can do whatever you want that will be an estimator. 1381 01:06:15,980 --> 01:06:17,780 Some of them are clearly going to be bad. 1382 01:06:17,780 --> 01:06:20,090 But that's still a statistic, and you can do this. 1383 01:06:20,090 --> 01:06:24,610 Now, when I say measurable, I always have-- 1384 01:06:24,610 --> 01:06:26,277 so you know, graduate students sometimes 1385 01:06:26,277 --> 01:06:28,943 ask me like, yeah, how do I know if this estimator is measurable 1386 01:06:28,943 --> 01:06:29,480 or not. 1387 01:06:29,480 --> 01:06:31,710 And usually, my answer is, well, if I give you data, 1388 01:06:31,710 --> 01:06:32,866 can you compute it. 1389 01:06:32,866 --> 01:06:35,240 And they say, yeah, I'm like, well, then it's measurable. 1390 01:06:35,240 --> 01:06:38,390 That's a very good rule to check if you can actually-- 1391 01:06:38,390 --> 01:06:40,970 if something is actually measurable. 1392 01:06:40,970 --> 01:06:42,560 When is this thing non-measurable? 1393 01:06:42,560 --> 01:06:44,750 It's when it's implicitly defined. 1394 01:06:44,750 --> 01:06:46,700 OK, and in particular, the things 1395 01:06:46,700 --> 01:06:48,880 that give you problems are-- 1396 01:06:52,370 --> 01:06:53,525 sup or inf. 1397 01:06:53,525 --> 01:06:55,820 Anybody knows what a sup or an inf is? 1398 01:06:55,820 --> 01:06:57,560 It's like a max or a min. 1399 01:06:57,560 --> 01:06:59,720 But it's not always attained. 1400 01:06:59,720 --> 01:07:02,120 OK, so if I have x1. 1401 01:07:02,120 --> 01:07:06,800 So if I look at the infimum of the function 1402 01:07:06,800 --> 01:07:11,610 f of x for x on the real line and f of x, sorry, 1403 01:07:11,610 --> 01:07:13,880 let's say x on the 1 infinity. 1404 01:07:13,880 --> 01:07:16,970 And f of x is equal to 1 over x. 1405 01:07:16,970 --> 01:07:18,070 Right? 1406 01:07:18,070 --> 01:07:20,030 Then the infimum is the smallest value 1407 01:07:20,030 --> 01:07:22,850 we can take except that it doesn't really 1408 01:07:22,850 --> 01:07:28,270 take it at 0 right, because 1 over x is going to 0. 1409 01:07:28,270 --> 01:07:30,240 But it's never really getting there. 1410 01:07:30,240 --> 01:07:32,590 So we just called the inf 0. 1411 01:07:32,590 --> 01:07:34,340 But it's not the value that it ever takes. 1412 01:07:34,340 --> 01:07:37,952 And these things might actually be complicated to compute. 1413 01:07:37,952 --> 01:07:40,160 And so that's when you actually have problems, right? 1414 01:07:40,160 --> 01:07:41,870 When the limit is not-- 1415 01:07:41,870 --> 01:07:44,030 you're not really quite reaching the limit. 1416 01:07:44,030 --> 01:07:47,481 You won't have this problem in general, but just so you know, 1417 01:07:47,481 --> 01:07:48,980 an estimator is not really anything. 1418 01:07:48,980 --> 01:07:51,440 It has to actually be measurable. 1419 01:07:51,440 --> 01:07:54,630 OK, so the first thing we want to know I mentioned it-- 1420 01:07:54,630 --> 01:07:57,690 so an estimator is a statistic which does not depend on theta, 1421 01:07:57,690 --> 01:07:58,670 of course. 1422 01:07:58,670 --> 01:08:01,430 So if I give you the data, you have to be able to compute it. 1423 01:08:01,430 --> 01:08:04,250 And that probably should not require not knowing any known 1424 01:08:04,250 --> 01:08:06,840 parameters. 1425 01:08:06,840 --> 01:08:11,070 OK, so an estimator is said to be consistent. 1426 01:08:11,070 --> 01:08:13,790 When my data-- when I collect more and more data, this thing 1427 01:08:13,790 --> 01:08:16,130 is getting closer and closer to the true parameter. 1428 01:08:16,130 --> 01:08:16,629 All right? 1429 01:08:16,629 --> 01:08:20,080 And we said that eyeballing and saying that it's going to be 4 1430 01:08:20,080 --> 01:08:21,770 is not really something that's probably 1431 01:08:21,770 --> 01:08:22,907 going to be consistent. 1432 01:08:22,907 --> 01:08:24,740 But they can have things that are consistent 1433 01:08:24,740 --> 01:08:28,850 but that are converging to theta at different speeds. 1434 01:08:28,850 --> 01:08:29,816 OK? 1435 01:08:29,816 --> 01:08:32,479 And we know also that this is a random variable. 1436 01:08:32,479 --> 01:08:33,649 It converges to something. 1437 01:08:33,649 --> 01:08:35,390 And there might be some different notions 1438 01:08:35,390 --> 01:08:36,556 of convergence that kick in. 1439 01:08:36,556 --> 01:08:38,060 And actually there are. 1440 01:08:38,060 --> 01:08:40,850 And we say that it's weakly convergent if it converges 1441 01:08:40,850 --> 01:08:43,670 in probability and strongly convergent 1442 01:08:43,670 --> 01:08:46,310 if it converges almost [INAUDIBLE].. 1443 01:08:46,310 --> 01:08:46,970 OK? 1444 01:08:46,970 --> 01:08:48,890 And this is just vocabulary. 1445 01:08:48,890 --> 01:08:50,581 It won't make a big difference. 1446 01:08:50,581 --> 01:08:51,080 OK? 1447 01:08:51,080 --> 01:08:56,228 So we will typically say it's consistent with any of the two. 1448 01:08:56,228 --> 01:08:57,707 AUDIENCE: [INAUDIBLE]. 1449 01:09:02,637 --> 01:09:07,488 PHILIPPE RIGOLLET: Well, so in parametric statistics, 1450 01:09:07,488 --> 01:09:09,529 it's actually a little difficult to come up with. 1451 01:09:09,529 --> 01:09:15,022 But in non-parametric ones, I could just say, if I had xi, 1452 01:09:15,022 --> 01:09:24,180 yi, and I know that yi is f of xi plus noise s1i. 1453 01:09:24,180 --> 01:09:26,800 And I know that f belongs to some class of function, 1454 01:09:26,800 --> 01:09:27,930 let's say-- 1455 01:09:27,930 --> 01:09:31,310 [INAUDIBLE] class of smooth functions-- it's massive. 1456 01:09:31,310 --> 01:09:33,810 And now, I'm going to actually find the following estimator. 1457 01:09:33,810 --> 01:09:35,279 I'm going to take the average. 1458 01:09:35,279 --> 01:09:36,945 So I'm going to do least squares, right? 1459 01:09:40,310 --> 01:09:41,720 So I just check. 1460 01:09:41,720 --> 01:09:44,300 I'm trying to minimize the distance of each of my f of xi 1461 01:09:44,300 --> 01:09:45,979 to my yi. 1462 01:09:45,979 --> 01:09:49,700 And now, I want to find the smallest of them. 1463 01:09:49,700 --> 01:09:56,240 So if I look at the infimum here, then the question is-- 1464 01:09:56,240 --> 01:09:57,590 so that could be-- 1465 01:09:57,590 --> 01:09:59,750 well, that's not really an estimator for f. 1466 01:09:59,750 --> 01:10:02,630 But it's an estimator for the smallest possible value. 1467 01:10:02,630 --> 01:10:04,550 And so for example, this is actually 1468 01:10:04,550 --> 01:10:07,340 an estimator for the variance of sigma square. 1469 01:10:07,340 --> 01:10:09,990 This might not be attained, and this might not 1470 01:10:09,990 --> 01:10:13,710 be measurable if f is massive? 1471 01:10:13,710 --> 01:10:16,285 All right, so that's the infimum over some class f of x. 1472 01:10:16,285 --> 01:10:18,150 OK? 1473 01:10:18,150 --> 01:10:20,810 So those are all voice things that are defined implicitly. 1474 01:10:20,810 --> 01:10:24,982 If it's an average, for example, it's completely measurable. 1475 01:10:24,982 --> 01:10:27,467 OK? 1476 01:10:27,467 --> 01:10:28,958 Any other question? 1477 01:10:31,950 --> 01:10:37,809 OK, so we know that the first thing we might want to check, 1478 01:10:37,809 --> 01:10:40,350 and that's definitely something we want about estimators that 1479 01:10:40,350 --> 01:10:43,020 is consistent, because all consistency tells 1480 01:10:43,020 --> 01:10:45,924 us is that just as I collect more and more data, 1481 01:10:45,924 --> 01:10:47,382 my estimator is going to get closer 1482 01:10:47,382 --> 01:10:51,300 and closer to the parameter. 1483 01:10:51,300 --> 01:10:52,930 There's other things we can look at. 1484 01:10:52,930 --> 01:10:55,600 For each possible value of n-- now, right now, 1485 01:10:55,600 --> 01:11:00,560 I have a finite number of observations-- 1486 01:11:00,560 --> 01:11:01,850 25. 1487 01:11:01,850 --> 01:11:04,450 And I want to know something about my estimator. 1488 01:11:04,450 --> 01:11:08,672 The first thing I want to check is maybe if in average, right? 1489 01:11:08,672 --> 01:11:09,880 So this is a random variable. 1490 01:11:09,880 --> 01:11:11,860 Is this random variable in average 1491 01:11:11,860 --> 01:11:14,540 going to be close to theta or not? 1492 01:11:14,540 --> 01:11:17,260 And so the difference how far I am from theta 1493 01:11:17,260 --> 01:11:20,140 is actually called the bias. 1494 01:11:20,140 --> 01:11:28,400 So the bias of an estimator is the expectation of theta hat 1495 01:11:28,400 --> 01:11:31,640 minus the value that I hope it gets, which is theta. 1496 01:11:31,640 --> 01:11:38,642 If this thing is equal to 0, we say that theta hat is unbiased. 1497 01:11:42,030 --> 01:11:44,680 And unbiased estimators are things that people 1498 01:11:44,680 --> 01:11:46,967 are looking for in general. 1499 01:11:46,967 --> 01:11:49,300 The problem is that there's lots of unbiased estimators. 1500 01:11:49,300 --> 01:11:52,829 And so it might be misleading to look for unbiasedness 1501 01:11:52,829 --> 01:11:54,370 when that's not really the only thing 1502 01:11:54,370 --> 01:11:55,870 you should be looking for. 1503 01:11:55,870 --> 01:11:58,690 OK, so what does it mean to be unbiased? 1504 01:11:58,690 --> 01:12:00,432 Maybe for this particular round of data 1505 01:12:00,432 --> 01:12:02,140 you collected, you're actually pretty far 1506 01:12:02,140 --> 01:12:04,240 from the true estimator. 1507 01:12:04,240 --> 01:12:08,440 But one thing that actually-- 1508 01:12:08,440 --> 01:12:12,580 what it means is that if I redid this experiment over, and over, 1509 01:12:12,580 --> 01:12:16,870 and over again, and I averaged all the values of my estimators 1510 01:12:16,870 --> 01:12:19,770 that I got, then this would actually be the right-- 1511 01:12:19,770 --> 01:12:21,371 the true parameter. 1512 01:12:21,371 --> 01:12:21,870 OK. 1513 01:12:21,870 --> 01:12:22,990 That's what it means. 1514 01:12:22,990 --> 01:12:25,360 If I were to repeat this experiment, 1515 01:12:25,360 --> 01:12:27,610 in average, I would actually get the right thing. 1516 01:12:27,610 --> 01:12:30,300 But you don't get to repeat the experiment. 1517 01:12:30,300 --> 01:12:33,070 OK, just a remark about estimators, 1518 01:12:33,070 --> 01:12:34,910 look at this estimator-- xn bar. 1519 01:12:34,910 --> 01:12:35,410 Right? 1520 01:12:35,410 --> 01:12:36,670 Think of the kiss example. 1521 01:12:36,670 --> 01:12:39,605 I'm looking at the average of my observations. 1522 01:12:39,605 --> 01:12:41,980 And I want to know what the expectation of this thing is. 1523 01:12:44,670 --> 01:12:45,780 OK? 1524 01:12:45,780 --> 01:12:56,843 Now, this guy is by linearity of the expectation, 1525 01:12:56,843 --> 01:12:59,710 it is this, right? 1526 01:12:59,710 --> 01:13:03,850 But my data is identically distributed. 1527 01:13:03,850 --> 01:13:07,060 So in particular, all the xi's have the same expectation, 1528 01:13:07,060 --> 01:13:09,070 right? 1529 01:13:09,070 --> 01:13:10,734 Everybody agrees with this. 1530 01:13:10,734 --> 01:13:12,150 When it's identically distributed, 1531 01:13:12,150 --> 01:13:14,870 they'll get the same expectation. 1532 01:13:14,870 --> 01:13:17,950 So what it means is that this guy's here-- 1533 01:13:17,950 --> 01:13:22,210 they're all equal to the expectation of x1. 1534 01:13:22,210 --> 01:13:23,710 Right? 1535 01:13:23,710 --> 01:13:25,980 So what it means is that these guys-- 1536 01:13:25,980 --> 01:13:28,280 I have the average of the same number. 1537 01:13:28,280 --> 01:13:31,830 So this is actually the expectation of x1. 1538 01:13:31,830 --> 01:13:32,517 OK? 1539 01:13:32,517 --> 01:13:33,100 And it's true. 1540 01:13:33,100 --> 01:13:36,836 In the kiss example, this was p. 1541 01:13:36,836 --> 01:13:37,780 And this is p-- 1542 01:13:40,620 --> 01:13:43,200 the probability of turning your head right. 1543 01:13:43,200 --> 01:13:43,840 OK? 1544 01:13:43,840 --> 01:13:45,670 So those two things are the same. 1545 01:13:45,670 --> 01:13:50,020 In particular, that means that xn bar and just x1 1546 01:13:50,020 --> 01:13:54,120 have the same bias. 1547 01:13:54,120 --> 01:13:56,100 So that should probably illustrate to you 1548 01:13:56,100 --> 01:13:59,400 that bias is not something that really is telling you 1549 01:13:59,400 --> 01:14:02,930 the entire picture, Right? 1550 01:14:02,930 --> 01:14:05,350 I can take only one of my observations-- 1551 01:14:05,350 --> 01:14:06,484 Bernoulli 0, 1. 1552 01:14:06,484 --> 01:14:07,900 This thing will have the same bias 1553 01:14:07,900 --> 01:14:10,880 as if I average 1,000 of them. 1554 01:14:10,880 --> 01:14:13,350 But the bias is really telling you where I am in average. 1555 01:14:13,350 --> 01:14:16,261 But it's really not telling me what fluctuations I'm getting. 1556 01:14:16,261 --> 01:14:18,510 And so if you want to start having fluctuations coming 1557 01:14:18,510 --> 01:14:20,460 into the picture, we actually have 1558 01:14:20,460 --> 01:14:22,350 to look at the risk or the quadratic risk 1559 01:14:22,350 --> 01:14:23,852 of the estimator. 1560 01:14:23,852 --> 01:14:25,560 And so the quadratic risk is the finest-- 1561 01:14:25,560 --> 01:14:28,590 the expectation of the square distance between theta hat 1562 01:14:28,590 --> 01:14:30,770 and theta. 1563 01:14:30,770 --> 01:14:33,155 OK? 1564 01:14:33,155 --> 01:14:34,520 So let's look at this. 1565 01:14:42,360 --> 01:14:43,830 So the quadratic risk-- 1566 01:14:47,270 --> 01:14:48,710 sometimes it's denoted that people 1567 01:14:48,710 --> 01:14:57,590 call it the l2 risk of theta hat, of course. 1568 01:14:57,590 --> 01:14:59,950 I'm sorry for maintaining such an ugly board. 1569 01:14:59,950 --> 01:15:00,910 [INAUDIBLE] this stuff. 1570 01:15:09,460 --> 01:15:10,960 OK, so I look at the square distance 1571 01:15:10,960 --> 01:15:12,070 between theta hat and theta. 1572 01:15:12,070 --> 01:15:14,530 This is still-- this is the function of a random variable. 1573 01:15:14,530 --> 01:15:16,250 So it's a random variable as well. 1574 01:15:16,250 --> 01:15:19,081 And now I'm looking at the expectation of this guy. 1575 01:15:19,081 --> 01:15:23,060 That's the definition. 1576 01:15:23,060 --> 01:15:25,560 I claimed that when this thing goes to 0, then 1577 01:15:25,560 --> 01:15:28,350 my estimator is actually going to be consistent. 1578 01:15:28,350 --> 01:15:30,298 Everybody agrees with this? 1579 01:15:37,116 --> 01:15:47,715 So if it goes to zero as n goes to infinity-- and here, 1580 01:15:47,715 --> 01:15:50,090 I don't need to tell you what kind of convergence I have, 1581 01:15:50,090 --> 01:15:51,715 because this is just the number, right? 1582 01:15:51,715 --> 01:15:52,640 It's an expectation. 1583 01:15:52,640 --> 01:15:57,260 So it's a regular, usual calculus-style convergence. 1584 01:15:57,260 --> 01:16:03,326 Then that implies that theta hat is actually weakly consistent. 1585 01:16:07,294 --> 01:16:09,774 What did I use to tell you this? 1586 01:16:14,740 --> 01:16:17,450 Yeah, this is the convergence in l2. 1587 01:16:17,450 --> 01:16:19,950 This actually is strictly equivalent. 1588 01:16:19,950 --> 01:16:26,264 This is by definition saying that theta hat converges in l2 1589 01:16:26,264 --> 01:16:29,360 to theta. 1590 01:16:29,360 --> 01:16:31,190 And we know that convergence in l2 1591 01:16:31,190 --> 01:16:37,305 implies convergence in credibility to theta. 1592 01:16:37,305 --> 01:16:38,180 That was the picture. 1593 01:16:38,180 --> 01:16:40,160 We're going up. 1594 01:16:40,160 --> 01:16:42,498 And this is actually equivalent to a consistency 1595 01:16:42,498 --> 01:16:46,329 by definition-- a weak consistency. 1596 01:16:46,329 --> 01:16:48,370 OK, so this is actually telling you a little more 1597 01:16:48,370 --> 01:16:50,380 because this guy here-- 1598 01:16:50,380 --> 01:16:52,810 they are both unbiased. 1599 01:16:52,810 --> 01:16:55,010 Theta xn bar is unbiased. 1600 01:16:55,010 --> 01:16:56,470 X1 is unbiased. 1601 01:16:56,470 --> 01:16:58,060 But x1 is certainly not consistent , 1602 01:16:58,060 --> 01:17:01,150 because the more data I collect, I'm not even doing anything 1603 01:17:01,150 --> 01:17:01,650 with it. 1604 01:17:01,650 --> 01:17:04,120 I'm just taking the first data point you're giving to me. 1605 01:17:04,120 --> 01:17:05,650 So they're both unbiased. 1606 01:17:05,650 --> 01:17:07,420 But this one is not consistent. 1607 01:17:07,420 --> 01:17:09,930 And this one we'll see is actually consistent. 1608 01:17:09,930 --> 01:17:11,370 xn bar is consistent. 1609 01:17:11,370 --> 01:17:14,358 And actually, we've seen that last time. 1610 01:17:14,358 --> 01:17:15,816 And that's because of the? 1611 01:17:19,704 --> 01:17:23,752 What guarantees the fact that xn bar is consistent? 1612 01:17:23,752 --> 01:17:25,210 AUDIENCE: The law of large numbers. 1613 01:17:25,210 --> 01:17:26,110 PHILIPPE RIGOLLET: The law of large numbers, right? 1614 01:17:26,110 --> 01:17:27,780 Actually, it's strongly consistent 1615 01:17:27,780 --> 01:17:29,880 if you have a strong [INAUDIBLE].. 1616 01:17:29,880 --> 01:17:35,580 OK, so just in the last two minutes, 1617 01:17:35,580 --> 01:17:39,920 I want to tell you a little bit about how this risk is linked 1618 01:17:39,920 --> 01:17:43,350 to see, quadratic risk is equal to bias 1619 01:17:43,350 --> 01:17:44,840 squared plus the variance. 1620 01:17:44,840 --> 01:17:48,030 So let's see what I mean by this? 1621 01:17:48,030 --> 01:17:50,030 So I'm going to forget about the absolute values 1622 01:17:50,030 --> 01:17:50,988 that you have a square. 1623 01:17:50,988 --> 01:17:54,600 I don't really need them. 1624 01:17:54,600 --> 01:17:57,530 If theta hat was unbiased, this thing 1625 01:17:57,530 --> 01:18:01,619 would be the expectation of theta hat. 1626 01:18:01,619 --> 01:18:02,660 It might not be the case. 1627 01:18:02,660 --> 01:18:06,464 So let me see how I can actually see-- put the bias in there. 1628 01:18:06,464 --> 01:18:07,880 Well, one way to do this is to see 1629 01:18:07,880 --> 01:18:10,170 that this is equal to the expectation of theta 1630 01:18:10,170 --> 01:18:13,110 hat minus the expectation of theta hat, 1631 01:18:13,110 --> 01:18:17,160 plus the expectation of theta hat minus theta. 1632 01:18:21,450 --> 01:18:22,620 OK? 1633 01:18:22,620 --> 01:18:24,860 I just removed the same and added the same thing. 1634 01:18:24,860 --> 01:18:27,030 So I didn't change anything. 1635 01:18:27,030 --> 01:18:29,840 Now, this guy is my bias, right? 1636 01:18:32,570 --> 01:18:34,680 So now let me expand the square. 1637 01:18:34,680 --> 01:18:37,680 So what I get is the expectation of the square of theta 1638 01:18:37,680 --> 01:18:39,295 hat minus its expectation. 1639 01:18:42,480 --> 01:18:45,850 I should put some square brackets-- 1640 01:18:45,850 --> 01:18:50,410 plus two times the cross-product. 1641 01:18:50,410 --> 01:18:52,900 So the cross-product is what expectation 1642 01:18:52,900 --> 01:18:59,740 of theta hat minus the expectation of theta hat times 1643 01:18:59,740 --> 01:19:03,516 expectation of theta hat minus theta. 1644 01:19:07,250 --> 01:19:08,892 And then I have the last square. 1645 01:19:17,830 --> 01:19:22,890 Expectation of theta hat minus theta squared. 1646 01:19:22,890 --> 01:19:24,240 OK? 1647 01:19:24,240 --> 01:19:27,100 So square, cross-products, square. 1648 01:19:27,100 --> 01:19:29,980 Everybody is with me? 1649 01:19:29,980 --> 01:19:32,830 now this guy here-- 1650 01:19:32,830 --> 01:19:35,070 if you pay attention, this thing is the expectation 1651 01:19:35,070 --> 01:19:36,070 of some random variable. 1652 01:19:36,070 --> 01:19:38,450 So it's a deterministic number. 1653 01:19:38,450 --> 01:19:39,800 Theta is the true parameter. 1654 01:19:39,800 --> 01:19:41,810 It's a deterministic number. 1655 01:19:41,810 --> 01:19:44,020 So what I can do is pull out this entire thing out 1656 01:19:44,020 --> 01:19:52,090 of the expectation like this and compute the expectation only 1657 01:19:52,090 --> 01:19:53,607 with respect to that part. 1658 01:19:53,607 --> 01:19:56,529 But what is the expectation of this thing? 1659 01:19:59,460 --> 01:20:00,240 It's zero, right? 1660 01:20:00,240 --> 01:20:02,323 The expectation of theta hat minus the expectation 1661 01:20:02,323 --> 01:20:03,852 of theta hat is 0. 1662 01:20:03,852 --> 01:20:07,900 So this entire thing is equal 0. 1663 01:20:07,900 --> 01:20:12,200 So now when I actually collect back my quadratic terms-- 1664 01:20:12,200 --> 01:20:15,980 my two squared terms in this expansion-- 1665 01:20:15,980 --> 01:20:18,650 what I get is that the expectation 1666 01:20:18,650 --> 01:20:21,800 of theta hat minus theta squared is 1667 01:20:21,800 --> 01:20:26,450 equal to the expectation of theta hat minus expectation 1668 01:20:26,450 --> 01:20:32,220 of theta hat squared plus the square of expectation 1669 01:20:32,220 --> 01:20:35,650 of theta hat minus theta. 1670 01:20:40,550 --> 01:20:41,190 Right? 1671 01:20:41,190 --> 01:20:42,600 So those are just the two-- 1672 01:20:42,600 --> 01:20:46,560 the first and the last term of the previous equality? 1673 01:20:46,560 --> 01:20:48,690 Now, here I have the expectation of the square 1674 01:20:48,690 --> 01:20:50,481 of the difference between a random variable 1675 01:20:50,481 --> 01:20:52,440 and its expectation. 1676 01:20:52,440 --> 01:20:56,100 This is otherwise known as the variance, right? 1677 01:20:56,100 --> 01:21:03,660 So this is actually equal to the variance of theta hat. 1678 01:21:03,660 --> 01:21:05,690 And well, this was the bias. 1679 01:21:05,690 --> 01:21:07,200 We already said that's there. 1680 01:21:07,200 --> 01:21:09,385 So this whole thing is the bias square. 1681 01:21:12,531 --> 01:21:13,030 OK? 1682 01:21:13,030 --> 01:21:15,550 And hence the quadratic term is the sum 1683 01:21:15,550 --> 01:21:18,010 of the variance and the squared bias. 1684 01:21:18,010 --> 01:21:18,940 Why squared bias? 1685 01:21:18,940 --> 01:21:21,130 Well, because otherwise, you would add dollars 1686 01:21:21,130 --> 01:21:22,799 in dollars squared. 1687 01:21:22,799 --> 01:21:24,715 So you need to add dollars squared and dollars 1688 01:21:24,715 --> 01:21:27,870 squared so that this thing is actually homogeneous. 1689 01:21:27,870 --> 01:21:30,910 So if x is in dollars, then the bias is in dollars, 1690 01:21:30,910 --> 01:21:32,870 but the variance is in dollars squared. 1691 01:21:32,870 --> 01:21:35,036 OK, and the square here forced you to put everything 1692 01:21:35,036 --> 01:21:36,150 on the square scale. 1693 01:21:36,150 --> 01:21:39,190 All right, so what's nice is that if the quadratic risk goes 1694 01:21:39,190 --> 01:21:42,820 to 0, then since I have the sum of two positive terms, 1695 01:21:42,820 --> 01:21:45,040 both of them have to go to 0. 1696 01:21:45,040 --> 01:21:46,930 That means that my variance is going to 0-- 1697 01:21:46,930 --> 01:21:48,790 very little fluctuations. 1698 01:21:48,790 --> 01:21:51,580 And my bias is also going to 0, which means that I'm actually 1699 01:21:51,580 --> 01:21:53,260 going to be on target once I reduce 1700 01:21:53,260 --> 01:21:55,450 my fluctuations, because it's one thing to reduce 1701 01:21:55,450 --> 01:21:56,732 the fluctuations. 1702 01:21:56,732 --> 01:21:58,690 But if I'm not on target, it's an issue, right? 1703 01:21:58,690 --> 01:22:03,422 For example, the estimator for the value 4 has no variance. 1704 01:22:03,422 --> 01:22:05,380 Every time I'm going to repeat the experiments, 1705 01:22:05,380 --> 01:22:07,510 I'm going to get 4, 4, 4, 4-- 1706 01:22:07,510 --> 01:22:08,890 variance is 0. 1707 01:22:08,890 --> 01:22:10,390 But the bias is bad. 1708 01:22:10,390 --> 01:22:12,880 The bias is 4 minus theta. 1709 01:22:12,880 --> 01:22:17,140 And if theta is far from 4, that's not doing very well. 1710 01:22:17,140 --> 01:22:21,420 OK, so next week, we will-- 1711 01:22:21,420 --> 01:22:25,060 we'll talk about what is a good estimate-- 1712 01:22:25,060 --> 01:22:26,740 how estimators change if they have 1713 01:22:26,740 --> 01:22:32,440 high variance or low variance or high bias and low bias. 1714 01:22:32,440 --> 01:22:35,640 And we'll talk about confidence intervals as well.