1 00:00:01,610 --> 00:00:03,980 The following content is provided under a Creative 2 00:00:03,980 --> 00:00:05,370 Commons license. 3 00:00:05,370 --> 00:00:07,580 Your support will help MIT OpenCourseWare 4 00:00:07,580 --> 00:00:11,670 continue to offer high-quality educational resources for free. 5 00:00:11,670 --> 00:00:14,210 To make a donation or to view additional materials 6 00:00:14,210 --> 00:00:18,170 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,170 --> 00:00:19,370 at ocw.mit.edu. 8 00:00:23,309 --> 00:00:25,850 PHILIPPE RIGOLLET: OK, so the course you're currently sitting 9 00:00:25,850 --> 00:00:27,670 in is 18.650. 10 00:00:27,670 --> 00:00:29,570 And it's called Fundamentals of Statistics. 11 00:00:29,570 --> 00:00:33,200 And until last spring, it was still called Statistics 12 00:00:33,200 --> 00:00:34,250 for Applications. 13 00:00:34,250 --> 00:00:37,160 It turned out that really, based on the content, "Fundamentals 14 00:00:37,160 --> 00:00:42,399 of Statistics" was a more appropriate title. 15 00:00:42,399 --> 00:00:43,940 I'll tell you a little bit about what 16 00:00:43,940 --> 00:00:46,231 we're going to be covering in class, what this class is 17 00:00:46,231 --> 00:00:48,470 about, what it's not about. 18 00:00:48,470 --> 00:00:50,580 I realize there's several offerings 19 00:00:50,580 --> 00:00:52,490 in statistics on campus. 20 00:00:52,490 --> 00:00:56,249 So I want to make sure that you've chosen the right one. 21 00:00:56,249 --> 00:00:58,040 And I also understand that for some of you, 22 00:00:58,040 --> 00:01:01,190 it's a matter of scheduling. 23 00:01:01,190 --> 00:01:03,220 I need to actually throw out a disclaimer. 24 00:01:03,220 --> 00:01:05,940 I tend to speak too fast. 25 00:01:05,940 --> 00:01:07,700 I'm aware that. 26 00:01:07,700 --> 00:01:10,580 Someone in the back, just do like that when you 27 00:01:10,580 --> 00:01:12,560 have no idea what I'm saying. 28 00:01:12,560 --> 00:01:14,520 Hopefully, I will repeat myself many times. 29 00:01:14,520 --> 00:01:15,979 So if you average over time, you'll 30 00:01:15,979 --> 00:01:17,353 see that statistics will tell you 31 00:01:17,353 --> 00:01:19,700 that you will get the right message that I was actually 32 00:01:19,700 --> 00:01:22,760 trying to stick to send. 33 00:01:22,760 --> 00:01:26,060 All right, so what are the goals of this class? 34 00:01:26,060 --> 00:01:28,510 The first one is basically to give you an introduction. 35 00:01:28,510 --> 00:01:31,910 No one here is expected to have seen statistics before, 36 00:01:31,910 --> 00:01:33,470 but as you will see, you are expected 37 00:01:33,470 --> 00:01:34,770 to have seen probability. 38 00:01:34,770 --> 00:01:36,830 And usually, you do see some statistics 39 00:01:36,830 --> 00:01:38,270 in a probability course. 40 00:01:38,270 --> 00:01:39,990 So I'm sure some of you have some ideas, 41 00:01:39,990 --> 00:01:42,090 but I won't expect anything. 42 00:01:42,090 --> 00:01:44,120 And we'll be using mathematics. 43 00:01:44,120 --> 00:01:48,110 Math class, so there's going to be a bunch of equations-- 44 00:01:48,110 --> 00:01:52,532 not so much real data and statistical thinking. 45 00:01:52,532 --> 00:01:54,740 We're going to try to provide theoretical guarantees. 46 00:01:54,740 --> 00:01:58,130 We have two estimators that are available for me-- 47 00:01:58,130 --> 00:02:01,580 how theory guides me to choose between the best of them, 48 00:02:01,580 --> 00:02:06,320 how certain can I be of my guarantees or prediction? 49 00:02:06,320 --> 00:02:08,549 It's one thing to just bid out a number. 50 00:02:08,549 --> 00:02:10,590 It's another thing to put some error bars around. 51 00:02:10,590 --> 00:02:14,510 And we'll see how to build error bars, for example. 52 00:02:14,510 --> 00:02:16,530 You will have your own applications. 53 00:02:16,530 --> 00:02:19,160 I'm happy to answer questions about specific applications. 54 00:02:19,160 --> 00:02:21,980 But rather than trying to tailor applications 55 00:02:21,980 --> 00:02:24,650 to an entire institute, I think we're 56 00:02:24,650 --> 00:02:28,040 going to work with pretty standard applications, 57 00:02:28,040 --> 00:02:32,510 mostly not very serious ones. 58 00:02:32,510 --> 00:02:36,470 And hopefully, you'll be able to take the main principles back 59 00:02:36,470 --> 00:02:39,290 with you and apply them to your particular problem. 60 00:02:39,290 --> 00:02:43,250 What I'm hoping that you will get out of this class is that 61 00:02:43,250 --> 00:02:46,130 when you have a real-life situation-- and by "real life", 62 00:02:46,130 --> 00:02:48,890 I mean mostly at MIT, so some people probably would not call 63 00:02:48,890 --> 00:02:50,240 that real life-- 64 00:02:50,240 --> 00:02:52,370 their goal is to formulate a statistical problem 65 00:02:52,370 --> 00:02:53,950 in mathematical terms. 66 00:02:53,950 --> 00:02:56,930 If I want to say, is a drug effective, 67 00:02:56,930 --> 00:02:58,620 that's not in mathematical terms, 68 00:02:58,620 --> 00:03:00,570 I have to find out which measure I want 69 00:03:00,570 --> 00:03:03,470 to have to call it effective. 70 00:03:03,470 --> 00:03:06,780 Maybe it's over a certain period of time. 71 00:03:06,780 --> 00:03:08,990 So there's a lot of things that you actually need. 72 00:03:08,990 --> 00:03:10,489 And I'm not really going to tell you 73 00:03:10,489 --> 00:03:13,461 how to go from the application to the point you need to be. 74 00:03:13,461 --> 00:03:14,960 But I will certainly describe to you 75 00:03:14,960 --> 00:03:19,160 what point you need to be at if you want to start applying 76 00:03:19,160 --> 00:03:21,320 statistical methodology. 77 00:03:21,320 --> 00:03:23,750 Then once you understand what kind of question 78 00:03:23,750 --> 00:03:24,714 you want to answer-- 79 00:03:24,714 --> 00:03:26,630 do I want a yes/no answer, do I want a number, 80 00:03:26,630 --> 00:03:29,300 do I want error bars, do I want to make predictions 81 00:03:29,300 --> 00:03:32,330 five years into future, do I have side information, 82 00:03:32,330 --> 00:03:34,784 or do I not have side information, all those things-- 83 00:03:34,784 --> 00:03:36,200 based on that, hopefully, you will 84 00:03:36,200 --> 00:03:38,480 have a catalog of statistical methods 85 00:03:38,480 --> 00:03:44,570 that you're going to be able to use and apply it in the wild. 86 00:03:44,570 --> 00:03:49,370 And also, no statistical method is perfect. 87 00:03:49,370 --> 00:03:52,470 Some of the math people have agreed upon over the years, 88 00:03:52,470 --> 00:03:54,980 and people understand that this is the standard. 89 00:03:54,980 --> 00:03:57,050 But I want you to be able to understand 90 00:03:57,050 --> 00:03:59,360 what the limitations are, and when you make conclusions 91 00:03:59,360 --> 00:04:03,020 based on data, that those conclusions might be erroneous, 92 00:04:03,020 --> 00:04:03,890 for example. 93 00:04:03,890 --> 00:04:09,220 All right, more practically, my goal here is to have you ready. 94 00:04:09,220 --> 00:04:12,916 So who has taken, for example, a machine-learning class here? 95 00:04:12,916 --> 00:04:15,820 All right, so many of you, actually-- maybe a third 96 00:04:15,820 --> 00:04:19,230 have taken a machine-learning class. 97 00:04:19,230 --> 00:04:21,600 So statistics has somewhat evolved into machine 98 00:04:21,600 --> 00:04:22,840 learning in recent years. 99 00:04:22,840 --> 00:04:24,529 And my goal is to take you there. 100 00:04:24,529 --> 00:04:26,820 So machine learning has a strong algorithmic component. 101 00:04:26,820 --> 00:04:29,310 So maybe some of you have taken a machine-learning class 102 00:04:29,310 --> 00:04:31,890 that displays mostly the algorithmic component. 103 00:04:31,890 --> 00:04:33,930 But there's also a statistical component. 104 00:04:33,930 --> 00:04:37,140 The machine learns from data. 105 00:04:37,140 --> 00:04:39,090 So this is a statistical track. 106 00:04:39,090 --> 00:04:43,770 And there are some statistical machine-learning classes 107 00:04:43,770 --> 00:04:44,800 that you can take here. 108 00:04:44,800 --> 00:04:47,250 They're offered at the graduate level, I believe. 109 00:04:47,250 --> 00:04:51,430 But I want you to be ready to be able to take those classes, 110 00:04:51,430 --> 00:04:53,520 having the statistical fundamentals to understand 111 00:04:53,520 --> 00:04:54,270 what you're doing. 112 00:04:54,270 --> 00:04:58,230 And then you're going to be able to expand to broader and more 113 00:04:58,230 --> 00:05:00,780 sophisticated methods. 114 00:05:00,780 --> 00:05:05,250 Lectures are here from 11:00 to 12:30 on Tuesday and Thursday. 115 00:05:05,250 --> 00:05:07,270 Victor-Emmanuel will also be-- 116 00:05:07,270 --> 00:05:09,260 and you can call him Victor-- 117 00:05:09,260 --> 00:05:12,250 will also be holding mandatory recitation. 118 00:05:12,250 --> 00:05:15,030 So please go on Stellar and pick your recitation. 119 00:05:15,030 --> 00:05:19,770 It's either 3:00 to 4:00 or 4:00 to 5:00 on Wednesdays. 120 00:05:19,770 --> 00:05:22,890 And it's going to be mostly focused on problem-solving. 121 00:05:22,890 --> 00:05:28,510 They're mandatory in the sense that we're allowed to do this, 122 00:05:28,510 --> 00:05:32,840 but they're not going to cover entirely new material. 123 00:05:32,840 --> 00:05:35,760 But they might cover some techniques 124 00:05:35,760 --> 00:05:39,180 that might save you some time when it comes to the exam. 125 00:05:39,180 --> 00:05:41,460 So you might get by. 126 00:05:41,460 --> 00:05:44,250 Attendance is not going to be taken or anything like this. 127 00:05:44,250 --> 00:05:47,760 But I highly recommend that you go, 128 00:05:47,760 --> 00:05:49,420 because, well, they're mandatory. 129 00:05:49,420 --> 00:05:52,830 So you cannot really complain that something was taught only 130 00:05:52,830 --> 00:05:54,390 in recitation. 131 00:05:54,390 --> 00:05:56,760 So please register on Stellar for which 132 00:05:56,760 --> 00:05:59,520 of the two recitations you would like to be in. 133 00:05:59,520 --> 00:06:03,230 They're capped at 40, so first come, first served. 134 00:06:03,230 --> 00:06:05,600 Homework will be due weekly. 135 00:06:05,600 --> 00:06:08,030 There's a total of 11 problem sets. 136 00:06:08,030 --> 00:06:09,090 I realize this is a lot. 137 00:06:09,090 --> 00:06:11,090 Hopefully, we'll keep them light. 138 00:06:11,090 --> 00:06:15,350 I just want you to not rush too much. 139 00:06:15,350 --> 00:06:17,390 The 10 best will be kept, and this 140 00:06:17,390 --> 00:06:20,640 will count for a total of 30% of the final grade. 141 00:06:20,640 --> 00:06:25,280 There are due Mondays at 8:00 PM on Stellar. 142 00:06:25,280 --> 00:06:28,670 And this is a new thing. 143 00:06:28,670 --> 00:06:31,590 We're not going to use the boxes outside of the math department. 144 00:06:31,590 --> 00:06:34,520 We're going to use only PDF files. 145 00:06:34,520 --> 00:06:37,790 Well, you're always welcome to type them and practice 146 00:06:37,790 --> 00:06:40,430 your LaTeX or Word typing. 147 00:06:40,430 --> 00:06:42,830 I also understand that this can be a bit of a strain, 148 00:06:42,830 --> 00:06:45,800 so just write them down on a piece of paper, 149 00:06:45,800 --> 00:06:48,800 use your iPhone, and take a picture of it. 150 00:06:48,800 --> 00:06:50,810 Dropbox has a nice, new-- 151 00:06:50,810 --> 00:06:53,149 so try to find something that puts a lot of contrast, 152 00:06:53,149 --> 00:06:55,190 especially if you use pencil, because we're going 153 00:06:55,190 --> 00:06:56,840 to check if they're readable. 154 00:06:56,840 --> 00:07:01,160 And this is your responsibility to have a readable file. 155 00:07:01,160 --> 00:07:02,450 I've had over the years-- 156 00:07:02,450 --> 00:07:03,920 not at MIT, I must admit-- but I've 157 00:07:03,920 --> 00:07:06,170 had students who actually write the doc file 158 00:07:06,170 --> 00:07:08,180 and think that converting it to a PDF 159 00:07:08,180 --> 00:07:11,360 consists in erasing the extension doc 160 00:07:11,360 --> 00:07:12,680 and replacing it by PDF. 161 00:07:12,680 --> 00:07:13,850 This is not how it works. 162 00:07:17,450 --> 00:07:19,790 So I'm sure you will figure it out. 163 00:07:19,790 --> 00:07:21,620 Please try to keep them letter-sized. 164 00:07:21,620 --> 00:07:23,480 This is not a strict requirement, 165 00:07:23,480 --> 00:07:26,330 but I don't want to see thumbnails, either. 166 00:07:26,330 --> 00:07:28,370 You are allowed to have two late homeworks. 167 00:07:28,370 --> 00:07:31,430 And by late, I mean 24 hours late. 168 00:07:31,430 --> 00:07:32,510 No questions asked. 169 00:07:32,510 --> 00:07:34,630 You submit them, this will be counted. 170 00:07:34,630 --> 00:07:36,380 You don't have to send an email to warn us 171 00:07:36,380 --> 00:07:39,350 or anything like this. 172 00:07:39,350 --> 00:07:42,500 Beyond that, even that you have one slack 173 00:07:42,500 --> 00:07:46,490 for one 0 grade and slack for two late homeworks, 174 00:07:46,490 --> 00:07:49,820 you're going to have to come up with a very good explanation 175 00:07:49,820 --> 00:07:52,310 why you need actually more extensions than that, if you 176 00:07:52,310 --> 00:07:53,060 ever do. 177 00:07:53,060 --> 00:07:54,643 And particularly, you're going to have 178 00:07:54,643 --> 00:07:58,220 to keep track about why you've used your three options before. 179 00:07:58,220 --> 00:08:00,830 There's going to be two midterms. 180 00:08:00,830 --> 00:08:03,740 One is October 3, and one is November 7. 181 00:08:03,740 --> 00:08:05,840 They're both going to be in class for the duration 182 00:08:05,840 --> 00:08:06,860 of the lecture. 183 00:08:06,860 --> 00:08:09,370 When I say they last for an hour and 20 minutes, 184 00:08:09,370 --> 00:08:11,390 it does not mean that if you arrive 10 minutes 185 00:08:11,390 --> 00:08:12,889 before the end of lecture, you still 186 00:08:12,889 --> 00:08:14,210 get an hour and 20 minutes. 187 00:08:14,210 --> 00:08:17,780 It will end at the end of lecture time. 188 00:08:17,780 --> 00:08:19,580 For this as well, no pressure. 189 00:08:19,580 --> 00:08:22,010 Only the best of the two will be kept. 190 00:08:22,010 --> 00:08:26,120 And this grade will count for 30% of the grade. 191 00:08:26,120 --> 00:08:29,120 This will be closed-books and closed-notes. 192 00:08:29,120 --> 00:08:30,941 The purpose is for you to-- yes? 193 00:08:30,941 --> 00:08:33,108 AUDIENCE: How many midterms did you say there are? 194 00:08:33,108 --> 00:08:34,066 PHILIPPE RIGOLLET: Two. 195 00:08:34,066 --> 00:08:36,010 AUDIENCE: You said the best of the two will be kept? 196 00:08:36,010 --> 00:08:37,885 PHILIPPE RIGOLLET: I said the best of the two 197 00:08:37,885 --> 00:08:39,218 will be kept, yes. 198 00:08:39,218 --> 00:08:42,896 AUDIENCE: So both the midterms will be kept? 199 00:08:42,896 --> 00:08:45,270 PHILIPPE RIGOLLET: The best of the two, not the best two. 200 00:08:45,270 --> 00:08:45,811 AUDIENCE: Oh. 201 00:08:50,757 --> 00:08:53,340 PHILIPPE RIGOLLET: We will add them, multiply the number by 9, 202 00:08:53,340 --> 00:08:54,298 and that will be grade. 203 00:08:54,298 --> 00:08:55,410 No. 204 00:08:55,410 --> 00:08:59,670 I am trying to be nice, there's just a limit to what I can do. 205 00:08:59,670 --> 00:09:02,340 All right, so the goal is for you to learn things 206 00:09:02,340 --> 00:09:04,020 and to be familiar with them. 207 00:09:04,020 --> 00:09:05,700 In the final, you will be allowed 208 00:09:05,700 --> 00:09:07,170 to have your notes with you. 209 00:09:07,170 --> 00:09:09,300 But the midterms are also a way for you 210 00:09:09,300 --> 00:09:11,760 to develop some mechanism so that you don't actually waste 211 00:09:11,760 --> 00:09:14,160 too much time on things that you should be able to do 212 00:09:14,160 --> 00:09:16,092 without thinking too much. 213 00:09:16,092 --> 00:09:17,550 You will be allowed to cheat sheet, 214 00:09:17,550 --> 00:09:20,940 because, well, you can always forget something. 215 00:09:20,940 --> 00:09:23,790 And it will be two-sided letters sheet, 216 00:09:23,790 --> 00:09:27,790 and you can practice yourself as writing as small as you want. 217 00:09:27,790 --> 00:09:30,510 And you can put whatever you want on this cheat sheet. 218 00:09:30,510 --> 00:09:33,037 All right, the final will be decided by the register. 219 00:09:33,037 --> 00:09:34,620 It's going to be three hours, and it's 220 00:09:34,620 --> 00:09:35,832 going to count for 40%. 221 00:09:35,832 --> 00:09:38,040 You cannot bring books, but you can bring your notes. 222 00:09:38,040 --> 00:09:38,556 Yes. 223 00:09:38,556 --> 00:09:40,306 AUDIENCE: I noticed that the midterm dates 224 00:09:40,306 --> 00:09:41,704 aren't dated in the syllabus. 225 00:09:41,704 --> 00:09:43,120 So I wanted to make sure you know. 226 00:09:43,120 --> 00:09:44,453 PHILIPPE RIGOLLET: They are not? 227 00:09:44,453 --> 00:09:45,470 AUDIENCE: Yeah-- 228 00:09:45,470 --> 00:09:45,890 PHILIPPE RIGOLLET: Oh, yeah, there's 229 00:09:45,890 --> 00:09:47,973 a "1" that's missing on both of them, isn't there? 230 00:09:59,030 --> 00:10:03,180 Yeah, let's figure that out. 231 00:10:03,180 --> 00:10:05,510 The syllabus is the true one. 232 00:10:05,510 --> 00:10:07,422 The slides are so that we can discuss, 233 00:10:07,422 --> 00:10:08,880 but the ones that's on the syllabus 234 00:10:08,880 --> 00:10:09,879 are the ones that count. 235 00:10:09,879 --> 00:10:13,050 And I think they're also posted on the calendar on Stellar 236 00:10:13,050 --> 00:10:14,909 as well. 237 00:10:14,909 --> 00:10:15,700 Any other question? 238 00:10:20,890 --> 00:10:23,380 OK, so the pre-reqs here-- 239 00:10:23,380 --> 00:10:28,100 and who has looked at the first problem set already? 240 00:10:28,100 --> 00:10:30,940 OK, so those hands that are raised 241 00:10:30,940 --> 00:10:34,160 realize that there is a true prerequisite of probability 242 00:10:34,160 --> 00:10:36,300 for this class. 243 00:10:36,300 --> 00:10:40,840 It can be at the level of 18.600 or 604.1. 244 00:10:40,840 --> 00:10:42,060 I should say "B" now. 245 00:10:42,060 --> 00:10:44,780 It's two classes. 246 00:10:44,780 --> 00:10:48,230 I will require you to know some calculus 247 00:10:48,230 --> 00:10:51,170 and have some notions of linear algebra, 248 00:10:51,170 --> 00:10:53,510 such as, what is a matrix, what is a vector, how 249 00:10:53,510 --> 00:10:55,730 do you multiply those things together, 250 00:10:55,730 --> 00:10:58,400 some notion of what orthonormal vectors are. 251 00:11:01,280 --> 00:11:03,350 We'll talk about eigenvectors and eigenvalues, 252 00:11:03,350 --> 00:11:05,010 but I remind you all of that. 253 00:11:05,010 --> 00:11:07,550 So this is not this strict pre-req. 254 00:11:07,550 --> 00:11:09,910 But if you've taken it, for example, 255 00:11:09,910 --> 00:11:12,080 it doesn't hurt to go back to your notes 256 00:11:12,080 --> 00:11:13,940 when we get closer to this chapter 257 00:11:13,940 --> 00:11:15,950 on principle-component analysis. 258 00:11:15,950 --> 00:11:19,520 The chapters, as they're listed in the syllabus, are in order, 259 00:11:19,520 --> 00:11:22,430 so you will see when it actually comes. 260 00:11:22,430 --> 00:11:24,830 There's no required textbook. 261 00:11:24,830 --> 00:11:29,506 And I know you tend to not like that. 262 00:11:29,506 --> 00:11:31,880 You like to have your textbook to know where you're going 263 00:11:31,880 --> 00:11:32,810 and what we're doing. 264 00:11:32,810 --> 00:11:34,670 I'm sorry, it's just this class. 265 00:11:34,670 --> 00:11:36,950 Either I would have to go to a mathematical statistics 266 00:11:36,950 --> 00:11:38,970 textbook, which is just too much, 267 00:11:38,970 --> 00:11:43,010 or to go to a more engineering-type statistics 268 00:11:43,010 --> 00:11:45,380 class, which is just too little. 269 00:11:45,380 --> 00:11:49,160 So hopefully, the problems will be enough 270 00:11:49,160 --> 00:11:50,690 for you to practice the recitations. 271 00:11:50,690 --> 00:11:52,940 We'll have some problems to solve as well. 272 00:11:52,940 --> 00:11:55,520 And the material will be posted on the slides. 273 00:11:55,520 --> 00:11:57,320 So you should have everything you need. 274 00:11:57,320 --> 00:11:58,799 There's plenty of resources online 275 00:11:58,799 --> 00:12:00,590 if you want to expand on a particular topic 276 00:12:00,590 --> 00:12:03,080 or read it as said by somebody else. 277 00:12:03,080 --> 00:12:07,690 The book that I recommend in the syllabus 278 00:12:07,690 --> 00:12:11,510 is this book called All of Statistics by Wasserman. 279 00:12:11,510 --> 00:12:13,580 Mainly because of the title, I'm guessing 280 00:12:13,580 --> 00:12:16,700 it has all of it in it. 281 00:12:16,700 --> 00:12:18,530 It's pretty broad. 282 00:12:18,530 --> 00:12:20,030 There's actually not that many. 283 00:12:20,030 --> 00:12:22,910 It's more of an intro-grad level. 284 00:12:22,910 --> 00:12:27,890 But it's not very deep, but you see a lot of the overview. 285 00:12:27,890 --> 00:12:29,480 Certainly, what we're going to cover 286 00:12:29,480 --> 00:12:30,979 will be a subset of what's in there. 287 00:12:34,190 --> 00:12:35,690 The slides will be posted on Stellar 288 00:12:35,690 --> 00:12:38,420 before lectures before we start a new chapter 289 00:12:38,420 --> 00:12:41,960 and after we're done with the chapter, with the annotations, 290 00:12:41,960 --> 00:12:46,784 and also, with the typos corrected, like for the exam. 291 00:12:46,784 --> 00:12:48,200 There will be some video lectures. 292 00:12:48,200 --> 00:12:52,550 Again, the first one will be posted on OCW from last year. 293 00:12:52,550 --> 00:12:54,950 But all of them will be available on Stellar-- 294 00:12:54,950 --> 00:12:57,770 of course, module technical problems. 295 00:12:57,770 --> 00:13:00,317 But this is an automated system. 296 00:13:00,317 --> 00:13:02,150 And hopefully, it will work out well for us. 297 00:13:02,150 --> 00:13:04,490 So if you somehow have to miss a lecture, 298 00:13:04,490 --> 00:13:08,270 you can always catch it up by watching it. 299 00:13:08,270 --> 00:13:10,760 You can also play at that speed 0.75 300 00:13:10,760 --> 00:13:12,530 in case I end up speaking too fast, 301 00:13:12,530 --> 00:13:15,470 but I think I've managed myself so far-- 302 00:13:15,470 --> 00:13:19,690 so just last warning. 303 00:13:19,690 --> 00:13:22,910 All right, why should you study statistics? 304 00:13:22,910 --> 00:13:27,362 Well, if you read the news, you will see a lot of statistics. 305 00:13:27,362 --> 00:13:28,570 I mentioned machine learning. 306 00:13:28,570 --> 00:13:32,269 It's built on a lot of statistics. 307 00:13:32,269 --> 00:13:34,060 If I were to teach this class 10 years ago, 308 00:13:34,060 --> 00:13:37,480 I would have to explain to you that data collection and making 309 00:13:37,480 --> 00:13:40,510 decisions based on data was something that made sense. 310 00:13:40,510 --> 00:13:43,180 But now, it's almost in our life. 311 00:13:43,180 --> 00:13:47,470 We're used to this idea that data helps in making decisions. 312 00:13:47,470 --> 00:13:51,680 And people use data to conduct studies. 313 00:13:51,680 --> 00:13:55,344 So here, I found a bunch of press titles that-- 314 00:13:55,344 --> 00:13:57,760 I think the key word I was looking for was "study finds"-- 315 00:13:57,760 --> 00:13:58,635 if I want to do this. 316 00:13:58,635 --> 00:14:01,640 So I actually did not bother doing it again this year. 317 00:14:01,640 --> 00:14:04,540 This is all 2016, 2016, 2016. 318 00:14:04,540 --> 00:14:07,030 But the key word that I look for is usually "study find"-- 319 00:14:07,030 --> 00:14:08,350 so a new study find-- 320 00:14:08,350 --> 00:14:10,710 traffic is bad for your health. 321 00:14:10,710 --> 00:14:13,740 So we had to wait for 2016 for data to tell us that. 322 00:14:13,740 --> 00:14:18,490 And there's a bunch of other slightly more interesting ones. 323 00:14:18,490 --> 00:14:20,530 For example, one that you might find interesting 324 00:14:20,530 --> 00:14:24,280 is that this study finds that students benefit from waiting 325 00:14:24,280 --> 00:14:26,000 to declare a major. 326 00:14:26,000 --> 00:14:28,300 Now, there's a bunch of press titles. 327 00:14:28,300 --> 00:14:33,385 There one in the MIT News that finds brain connections, 328 00:14:33,385 --> 00:14:34,870 key to reading. 329 00:14:34,870 --> 00:14:37,750 And so here, we have an idea of what happened there. 330 00:14:37,750 --> 00:14:39,490 Some data was collected. 331 00:14:39,490 --> 00:14:42,580 Some scientific hypothesis was formulated. 332 00:14:42,580 --> 00:14:47,920 And then the data was here to try to prove or disprove 333 00:14:47,920 --> 00:14:49,360 this scientific hypothesis. 334 00:14:49,360 --> 00:14:51,600 That's the usual scientific process. 335 00:14:51,600 --> 00:14:55,300 And we need to understand how the scientific process goes, 336 00:14:55,300 --> 00:14:58,030 because some of those things might be actually questionable. 337 00:14:58,030 --> 00:15:02,140 Who is 100% sure that study finds that students-- 338 00:15:02,140 --> 00:15:04,600 do you think that you benefit from waiting 339 00:15:04,600 --> 00:15:05,710 to declare a major? 340 00:15:05,710 --> 00:15:09,370 Right I would be skeptical about this. 341 00:15:09,370 --> 00:15:13,360 I would be like, I don't want to wait to declare a major. 342 00:15:13,360 --> 00:15:15,130 So what kind of thing can we bring? 343 00:15:15,130 --> 00:15:17,170 Well maybe this study studied people 344 00:15:17,170 --> 00:15:18,370 that were different from me. 345 00:15:18,370 --> 00:15:21,126 Or maybe the study finds that this 346 00:15:21,126 --> 00:15:22,750 is beneficial for a majority of people. 347 00:15:22,750 --> 00:15:23,890 I'm not a majority. 348 00:15:23,890 --> 00:15:24,724 I'm just one person. 349 00:15:24,724 --> 00:15:26,098 There's a bunch of things that we 350 00:15:26,098 --> 00:15:28,540 need to understand what those things actually mean. 351 00:15:28,540 --> 00:15:30,248 And we'll see that those are actually not 352 00:15:30,248 --> 00:15:31,511 statements about individuals. 353 00:15:31,511 --> 00:15:33,760 They're not even statements about the cohort of people 354 00:15:33,760 --> 00:15:35,050 they've actually looked at. 355 00:15:35,050 --> 00:15:37,270 They're statements about a parameter 356 00:15:37,270 --> 00:15:40,860 of a distribution that was used to model 357 00:15:40,860 --> 00:15:43,720 the benefit of waiting. 358 00:15:43,720 --> 00:15:45,040 So there's a lot of questions. 359 00:15:45,040 --> 00:15:46,840 And there are a lot of layers that come into this. 360 00:15:46,840 --> 00:15:49,131 And we're going to want to understand what was going on 361 00:15:49,131 --> 00:15:52,992 in there and try to peel it off and understand what assumptions 362 00:15:52,992 --> 00:15:53,950 have been put in there. 363 00:15:53,950 --> 00:15:59,650 Even though it looks like a totally legit study, out 364 00:15:59,650 --> 00:16:01,759 of those studies, statistically, I 365 00:16:01,759 --> 00:16:04,050 think there's going to be one that's going to be wrong. 366 00:16:04,050 --> 00:16:05,140 Well, maybe not one. 367 00:16:05,140 --> 00:16:07,930 But if I put a long list of those, 368 00:16:07,930 --> 00:16:10,150 there would be a few that would actually be wrong. 369 00:16:10,150 --> 00:16:12,840 If I put 20, there would definitely be one that's wrong. 370 00:16:12,840 --> 00:16:13,840 So you have to see that. 371 00:16:13,840 --> 00:16:16,600 Every time you see 20 studies, one is probably wrong. 372 00:16:16,600 --> 00:16:19,000 When there are studies about drug effects, 373 00:16:19,000 --> 00:16:21,260 out of a list of 100, one would be wrong. 374 00:16:21,260 --> 00:16:23,590 So we'll see what that means and what I mean by that. 375 00:16:26,840 --> 00:16:30,980 Of course, not only studies that make discoveries 376 00:16:30,980 --> 00:16:32,930 are actually making the press titles. 377 00:16:32,930 --> 00:16:36,890 There's also the press that talks about things 378 00:16:36,890 --> 00:16:38,480 that make no sense. 379 00:16:38,480 --> 00:16:41,600 I love this first experiment-- the salmon experiment. 380 00:16:41,600 --> 00:16:44,240 Actually, it was a grad student who 381 00:16:44,240 --> 00:16:47,720 came to a neuroscience poster session, 382 00:16:47,720 --> 00:16:50,780 pulled out this poster, and explained 383 00:16:50,780 --> 00:16:53,210 the scientific experiment that he was conducting, 384 00:16:53,210 --> 00:16:59,510 which consisted in taking a previously frozen and thawed 385 00:16:59,510 --> 00:17:02,060 salmon, putting it in an MRI, showing it 386 00:17:02,060 --> 00:17:07,760 pictures of violent images, and recording its brain activity. 387 00:17:07,760 --> 00:17:11,780 And he was able to discover a few voxels that were activated 388 00:17:11,780 --> 00:17:14,180 by those violent images. 389 00:17:14,180 --> 00:17:16,339 And can somebody tell me what happened here? 390 00:17:19,200 --> 00:17:23,530 Was the salmon responding to the violent activity? 391 00:17:23,530 --> 00:17:26,020 Basically, this is just a statistical fluke. 392 00:17:26,020 --> 00:17:27,730 That's just randomness at play. 393 00:17:27,730 --> 00:17:29,936 There's so many voxels that are recorded, 394 00:17:29,936 --> 00:17:31,310 and there's so many fluctuations. 395 00:17:31,310 --> 00:17:32,630 There's always a little bit of noise 396 00:17:32,630 --> 00:17:34,588 when you're in those things, that some of them, 397 00:17:34,588 --> 00:17:36,472 just by chance, got lit up. 398 00:17:36,472 --> 00:17:38,680 And so we need to understand how to correct for that. 399 00:17:38,680 --> 00:17:40,210 In this particular instance, we need 400 00:17:40,210 --> 00:17:43,600 to have tools that tell us that, well, finding three voxels that 401 00:17:43,600 --> 00:17:47,020 are activated for that many voxels 402 00:17:47,020 --> 00:17:50,200 that you can find in the salmon's brain 403 00:17:50,200 --> 00:17:53,030 is just too small of a number. 404 00:17:53,030 --> 00:17:56,050 Maybe we need to find a clump of 20 of them, for example. 405 00:17:56,050 --> 00:17:57,670 All right, so we're going to have 406 00:17:57,670 --> 00:18:02,710 mathematical tools that help us find those particular numbers. 407 00:18:02,710 --> 00:18:07,780 I don't know if you ever saw this one by John Oliver 408 00:18:07,780 --> 00:18:11,230 about phacking. 409 00:18:11,230 --> 00:18:14,110 Or actually, it said p-hacking. 410 00:18:14,110 --> 00:18:17,710 Basically, what John Oliver is saying 411 00:18:17,710 --> 00:18:20,410 is actually a full-length-- like there's long segments on this. 412 00:18:20,410 --> 00:18:24,220 And he was explaining how there's a sociology question 413 00:18:24,220 --> 00:18:28,930 here about how there's a huge incentive for scientists 414 00:18:28,930 --> 00:18:30,005 to publish results. 415 00:18:30,005 --> 00:18:31,630 You're not going to say, you know what? 416 00:18:31,630 --> 00:18:33,400 This year, I found nothing. 417 00:18:33,400 --> 00:18:35,277 And so people are trying to find things. 418 00:18:35,277 --> 00:18:36,860 And just by searching, it's as if they 419 00:18:36,860 --> 00:18:39,070 were searching for all the voxels in a brain 420 00:18:39,070 --> 00:18:41,920 until they find one that was just lit up by chance. 421 00:18:41,920 --> 00:18:43,690 And so they just run all these studies. 422 00:18:43,690 --> 00:18:46,960 And at some point, one will be right just out of chance. 423 00:18:46,960 --> 00:18:49,630 And so we have to be very careful about doing this. 424 00:18:49,630 --> 00:18:52,435 There's much more complicated problems associated 425 00:18:52,435 --> 00:18:53,920 to what's called p-hacking, which 426 00:18:53,920 --> 00:18:57,970 consists of violating the basic assumptions, in particular, 427 00:18:57,970 --> 00:19:00,130 looking at the data, and then formulating 428 00:19:00,130 --> 00:19:02,260 your scientific assumption based on data, 429 00:19:02,260 --> 00:19:03,400 and then going back to it. 430 00:19:03,400 --> 00:19:04,358 Your idea doesn't work. 431 00:19:04,358 --> 00:19:05,810 Let's just formulate another one. 432 00:19:05,810 --> 00:19:07,670 And if you are doing this, all bets are off. 433 00:19:10,127 --> 00:19:11,710 The theory that we're going to develop 434 00:19:11,710 --> 00:19:14,170 is actually for a very clean use of data, which 435 00:19:14,170 --> 00:19:16,450 might be a little unpleasant. 436 00:19:16,450 --> 00:19:19,930 If you've had an army of graduate students collecting 437 00:19:19,930 --> 00:19:21,730 genomic data for a year, for example, 438 00:19:21,730 --> 00:19:23,920 maybe you don't want to say, well, 439 00:19:23,920 --> 00:19:25,810 I had one hypothesis that didn't work. 440 00:19:25,810 --> 00:19:27,940 Let's throw all the data into the trash. 441 00:19:27,940 --> 00:19:30,650 And so we need to find ways to be able to do this. 442 00:19:30,650 --> 00:19:34,750 And there's actually a course been taught at BU. 443 00:19:34,750 --> 00:19:37,180 It's still in its early stages, but something 444 00:19:37,180 --> 00:19:39,205 called "adaptive data analysis" that will allow 445 00:19:39,205 --> 00:19:42,615 you to do these kind of things. 446 00:19:42,615 --> 00:19:43,569 Questions? 447 00:19:46,910 --> 00:19:49,870 OK, so of course, statistics is not 448 00:19:49,870 --> 00:19:52,480 just for you to be able to read the press. 449 00:19:52,480 --> 00:19:56,350 Statistics will probably be used in whatever career 450 00:19:56,350 --> 00:19:58,570 path you choose for yourself. 451 00:19:58,570 --> 00:20:03,070 It started in the 10th century in Netherlands for hydrology. 452 00:20:03,070 --> 00:20:06,410 Netherlands is basically under water, under sea level. 453 00:20:06,410 --> 00:20:08,237 And so they wanted to build some dikes. 454 00:20:08,237 --> 00:20:09,820 But once you're going to build a dike, 455 00:20:09,820 --> 00:20:11,890 you want to make sure that it's going to sustain 456 00:20:11,890 --> 00:20:13,875 some tides and some floods. 457 00:20:13,875 --> 00:20:15,250 And so in particular, they wanted 458 00:20:15,250 --> 00:20:19,799 to build dikes that were high enough, but not too high. 459 00:20:19,799 --> 00:20:21,340 You could always say, well, I'm going 460 00:20:21,340 --> 00:20:25,750 to build a 500-meter dike, and then I'm going to be safe. 461 00:20:25,750 --> 00:20:27,580 You want something that's based on data. 462 00:20:27,580 --> 00:20:28,510 You want to make sure. 463 00:20:28,510 --> 00:20:30,400 And so in particular, what did they do? 464 00:20:30,400 --> 00:20:33,700 Well, they collected data for previous floods. 465 00:20:33,700 --> 00:20:36,220 And then they just found a dike that 466 00:20:36,220 --> 00:20:37,720 was going to cover all these things. 467 00:20:37,720 --> 00:20:40,060 Now, if you look at the data they probably had, 468 00:20:40,060 --> 00:20:41,590 maybe it was scarce. 469 00:20:41,590 --> 00:20:43,484 Maybe they had 10 data points. 470 00:20:43,484 --> 00:20:44,900 And so for those data points, then 471 00:20:44,900 --> 00:20:47,622 maybe they wanted to sort of interpolate 472 00:20:47,622 --> 00:20:50,080 between those points, maybe extrapolate for the larger one. 473 00:20:50,080 --> 00:20:51,810 Based on what they've seen, maybe they 474 00:20:51,810 --> 00:20:53,740 have chances of seeing something which 475 00:20:53,740 --> 00:20:56,530 is even larger than everything they've seen before. 476 00:20:56,530 --> 00:21:00,710 And that's exactly the goal of statistical modeling-- 477 00:21:00,710 --> 00:21:04,030 being able to extrapolate beyond the data that you have, 478 00:21:04,030 --> 00:21:08,050 guessing what you have not seen yet might happen. 479 00:21:08,050 --> 00:21:10,390 When you buy insurance for your car, 480 00:21:10,390 --> 00:21:14,050 or your apartment, or your phone, 481 00:21:14,050 --> 00:21:16,240 there is a premium that you have to pay. 482 00:21:16,240 --> 00:21:17,800 And this premium has been determined 483 00:21:17,800 --> 00:21:21,370 based on how much you are, in expectation, going 484 00:21:21,370 --> 00:21:22,860 to cost the insurance. 485 00:21:22,860 --> 00:21:25,870 It says, OK, this person has, day a 10% chance 486 00:21:25,870 --> 00:21:28,200 of breaking their iPhone. 487 00:21:28,200 --> 00:21:30,047 An iPhone costs that much to repair, 488 00:21:30,047 --> 00:21:31,630 so I'm going to charge them that much. 489 00:21:31,630 --> 00:21:34,690 And then I'm going to add an extra dollar for my time. 490 00:21:34,690 --> 00:21:36,850 That's basically how those things are determined. 491 00:21:36,850 --> 00:21:39,220 And so this is using statistics. 492 00:21:39,220 --> 00:21:42,910 This is basically where statistics is probably 493 00:21:42,910 --> 00:21:43,630 mostly used. 494 00:21:43,630 --> 00:21:45,610 I was personally trained as an actuary. 495 00:21:45,610 --> 00:21:48,160 And that's me being a statistician at an insurance 496 00:21:48,160 --> 00:21:50,270 company. 497 00:21:50,270 --> 00:21:55,460 Clinical trials-- this is also one of the earliest success 498 00:21:55,460 --> 00:21:58,040 stories of statistics. 499 00:21:58,040 --> 00:21:59,570 It's actually now widespread. 500 00:21:59,570 --> 00:22:04,430 Every time a new drug is approved for market by the FDA, 501 00:22:04,430 --> 00:22:08,570 it requires a very strict regimen of testing with data, 502 00:22:08,570 --> 00:22:10,407 and control group, and treatment group, 503 00:22:10,407 --> 00:22:11,990 and how many people you need in there, 504 00:22:11,990 --> 00:22:17,390 and what kind of significance you need for those things. 505 00:22:17,390 --> 00:22:19,405 In particular, those things look like this, 506 00:22:19,405 --> 00:22:21,507 so now it's 5,000 patients. 507 00:22:21,507 --> 00:22:23,090 It depends on what kind of drug it is, 508 00:22:23,090 --> 00:22:26,630 but for, say, 100 patients, 56 were cured, 509 00:22:26,630 --> 00:22:29,150 and 44 showed no improvement. 510 00:22:29,150 --> 00:22:31,310 Does the FDA consider that this is a good number? 511 00:22:31,310 --> 00:22:37,970 Do they have a table for how many patients were cured? 512 00:22:37,970 --> 00:22:39,380 Is there a placebo effect? 513 00:22:39,380 --> 00:22:41,060 Do I need a control group of people that 514 00:22:41,060 --> 00:22:42,680 are actually getting a placebo? 515 00:22:42,680 --> 00:22:44,130 It's not clear, all these things. 516 00:22:44,130 --> 00:22:46,462 And so there's a lot of things to put into place. 517 00:22:46,462 --> 00:22:48,170 And there's a lot of floating parameters. 518 00:22:48,170 --> 00:22:50,100 So hopefully, we're going to be able to use 519 00:22:50,100 --> 00:22:52,040 statistical modeling to shrink it down 520 00:22:52,040 --> 00:22:55,190 to a small number of parameters to be able to ask 521 00:22:55,190 --> 00:22:56,480 very simple questions. 522 00:22:56,480 --> 00:22:59,630 "Is a drug effective" is not a mathematical equation. 523 00:22:59,630 --> 00:23:02,420 But "Is p larger than 0.5?" 524 00:23:02,420 --> 00:23:04,411 is a mathematical question And that's 525 00:23:04,411 --> 00:23:05,910 essentially we're going to be doing. 526 00:23:05,910 --> 00:23:08,710 We're going to take this, is a drug effective, to reducing to, 527 00:23:08,710 --> 00:23:12,460 is a variable larger than 0.5? 528 00:23:12,460 --> 00:23:15,040 Now, of course genetics are using that. 529 00:23:15,040 --> 00:23:19,120 That's typically actually the same size of data 530 00:23:19,120 --> 00:23:21,100 that you would see for FMRI data. 531 00:23:21,100 --> 00:23:25,480 So this is actually a study that I found. 532 00:23:25,480 --> 00:23:29,890 You have about 4,000 cases of Alzheimer's and 8,000 control. 533 00:23:29,890 --> 00:23:32,700 So people without Alzheimer's-- that's what's called a control. 534 00:23:32,700 --> 00:23:34,480 That's something just to make sure 535 00:23:34,480 --> 00:23:38,770 that you can see the difference with people 536 00:23:38,770 --> 00:23:42,610 that are not affected by either a drug or a disease. 537 00:23:42,610 --> 00:23:46,854 Is the gene APOE associated with Alzheimer's disease? 538 00:23:46,854 --> 00:23:49,270 Everybody can see why this would be an important question. 539 00:23:49,270 --> 00:23:50,228 We now have it crisper. 540 00:23:50,228 --> 00:23:52,570 It's targeted to very specific genes. 541 00:23:52,570 --> 00:23:55,554 If we could edit it, or knock it down, or knock it 542 00:23:55,554 --> 00:23:57,220 up, or boost it, maybe we could actually 543 00:23:57,220 --> 00:23:58,178 have an impact on that. 544 00:23:58,178 --> 00:24:00,111 So those are very important questions, 545 00:24:00,111 --> 00:24:02,360 because we have the technology to target those things. 546 00:24:02,360 --> 00:24:04,630 But we need the answers about what those things are. 547 00:24:04,630 --> 00:24:07,557 And there's a bunch of other questions. 548 00:24:07,557 --> 00:24:09,890 The minute you're going to talk to biologists about say, 549 00:24:09,890 --> 00:24:10,300 I can do that. 550 00:24:10,300 --> 00:24:11,390 They're going to say, OK, are there 551 00:24:11,390 --> 00:24:12,880 any other genes within the genes, 552 00:24:12,880 --> 00:24:15,046 or any particular snips that I can actually look at? 553 00:24:15,046 --> 00:24:17,084 And they're looking at very different questions. 554 00:24:17,084 --> 00:24:19,000 And when you start asking all these questions, 555 00:24:19,000 --> 00:24:22,040 you have to be careful, because you're reusing your data again. 556 00:24:22,040 --> 00:24:26,920 And it might lead you to wrong conclusions. 557 00:24:26,920 --> 00:24:29,530 And those are all over the place, those things. 558 00:24:29,530 --> 00:24:32,170 And that's why they go all the way to John Oliver talking 559 00:24:32,170 --> 00:24:33,567 about them. 560 00:24:33,567 --> 00:24:35,025 Any questions about those examples? 561 00:24:37,609 --> 00:24:38,900 So this is really a motivation. 562 00:24:38,900 --> 00:24:40,358 Again, we're not going to just take 563 00:24:40,358 --> 00:24:46,350 this data set of those cases and look at them in detail. 564 00:24:46,350 --> 00:24:49,370 So what is common to all these examples? 565 00:24:49,370 --> 00:24:50,960 Like, why do we have to use statistics 566 00:24:50,960 --> 00:24:53,120 for all those things? 567 00:24:53,120 --> 00:24:55,650 Well, there's the randomness of the data. 568 00:24:55,650 --> 00:24:59,300 There's some effect that we just don't understand-- 569 00:24:59,300 --> 00:25:02,750 for example, the randomness associated with the lining up 570 00:25:02,750 --> 00:25:04,010 of some voxels. 571 00:25:04,010 --> 00:25:06,650 Or the fact that as far as the insurance 572 00:25:06,650 --> 00:25:09,050 is concerned whether you're going to break your iPhone 573 00:25:09,050 --> 00:25:10,790 or not is essentially a coin toss. 574 00:25:10,790 --> 00:25:11,810 Fully, it's biased. 575 00:25:11,810 --> 00:25:14,780 But it's a coin toss. 576 00:25:14,780 --> 00:25:16,770 From the perspective of the statistician, 577 00:25:16,770 --> 00:25:18,590 those things are actually random events. 578 00:25:18,590 --> 00:25:20,550 And we need to tame this randomness, 579 00:25:20,550 --> 00:25:21,800 to understand this randomness. 580 00:25:21,800 --> 00:25:23,722 Is this going to be a lot of randomness? 581 00:25:23,722 --> 00:25:25,430 Or is it going to be a little randomness? 582 00:25:25,430 --> 00:25:26,846 Is it going to be something that's 583 00:25:26,846 --> 00:25:29,715 like, out of their people-- 584 00:25:32,990 --> 00:25:35,040 let's see, for example, for the floods. 585 00:25:35,040 --> 00:25:38,000 Were the floods that I saw consistently almost 586 00:25:38,000 --> 00:25:40,790 the same size? 587 00:25:40,790 --> 00:25:43,310 It was almost a rounding error, or they're just 588 00:25:43,310 --> 00:25:44,366 really widespread. 589 00:25:44,366 --> 00:25:45,990 All these things, we need to understand 590 00:25:45,990 --> 00:25:48,320 so we can understand how to build those dikes 591 00:25:48,320 --> 00:25:54,390 or how to make decisions based on those data. 592 00:25:54,390 --> 00:25:58,130 And we need to understand this randomness. 593 00:25:58,130 --> 00:26:01,550 OK, so the associated questions to randomness 594 00:26:01,550 --> 00:26:03,110 were actually hidden in the text. 595 00:26:03,110 --> 00:26:05,680 So we talked about the notion of average. 596 00:26:05,680 --> 00:26:08,672 Right, so as far as the insurance is concerned, 597 00:26:08,672 --> 00:26:10,880 they want to know in average with the probability is. 598 00:26:10,880 --> 00:26:13,970 Like, what is your chance of actually breaking your iPhone? 599 00:26:13,970 --> 00:26:18,330 And that's what came in this notion of fair premium. 600 00:26:18,330 --> 00:26:21,472 There's this notion of quantifying chance. 601 00:26:21,472 --> 00:26:23,430 We don't want to talk maybe only about average, 602 00:26:23,430 --> 00:26:26,280 maybe you want to cover say 99% percent of the floods. 603 00:26:26,280 --> 00:26:31,920 So we need to know what is the height of a flood that's 604 00:26:31,920 --> 00:26:34,350 higher than 99% of the floods. 605 00:26:34,350 --> 00:26:36,960 But maybe there's 1% of them, you know. 606 00:26:36,960 --> 00:26:38,790 When doomsday comes, doomsday comes. 607 00:26:38,790 --> 00:26:40,470 Right, we're not going to pay for it. 608 00:26:40,470 --> 00:26:43,187 All right, so that's most of the floods. 609 00:26:43,187 --> 00:26:45,270 And then there's questions of significance, right? 610 00:26:45,270 --> 00:26:47,880 So you know I give this example, a second ago 611 00:26:47,880 --> 00:26:50,070 about clinical trials. 612 00:26:50,070 --> 00:26:51,180 I give you some numbers. 613 00:26:51,180 --> 00:26:55,350 Clearly the drug cured more people than it did not. 614 00:26:55,350 --> 00:26:58,020 But does it mean that it's significantly good, 615 00:26:58,020 --> 00:26:59,220 or was this just by chance. 616 00:26:59,220 --> 00:27:01,950 Maybe it's just that these people just recovered. 617 00:27:01,950 --> 00:27:04,710 It's like you know curing a common cold. 618 00:27:04,710 --> 00:27:06,840 And you feel like, oh I got cured. 619 00:27:06,840 --> 00:27:09,540 But it's really you waited five days and then you got cured. 620 00:27:09,540 --> 00:27:11,910 All right, so there's this notion of significance, 621 00:27:11,910 --> 00:27:12,780 of variability. 622 00:27:12,780 --> 00:27:15,150 All these things are actually notions 623 00:27:15,150 --> 00:27:18,270 that describe randomness and quantify randomness 624 00:27:18,270 --> 00:27:19,560 into simple things. 625 00:27:19,560 --> 00:27:21,630 Randomness is a very complicated beast. 626 00:27:21,630 --> 00:27:24,390 But we can summarize it into things that we understand. 627 00:27:24,390 --> 00:27:27,630 Just like I am a complicated object. 628 00:27:27,630 --> 00:27:29,610 I'm made of molecules, and made of genes, 629 00:27:29,610 --> 00:27:31,290 and made of very complicated things. 630 00:27:31,290 --> 00:27:34,890 But I can be summarized as my name, my email address, 631 00:27:34,890 --> 00:27:37,270 my height and my weight, and maybe for most of you, 632 00:27:37,270 --> 00:27:39,100 this is basically enough. 633 00:27:39,100 --> 00:27:41,220 You will recognize me without having 634 00:27:41,220 --> 00:27:45,240 to do a biopsy on me every time you see me. 635 00:27:45,240 --> 00:27:49,020 All right, so, to understand randomness 636 00:27:49,020 --> 00:27:51,450 you have to go through probability. 637 00:27:51,450 --> 00:27:53,370 Probability is the study of randomness. 638 00:27:53,370 --> 00:27:54,130 That's what it is. 639 00:27:54,130 --> 00:27:57,750 That's what the first sentence that a lecturer in probability 640 00:27:57,750 --> 00:27:58,890 will say. 641 00:27:58,890 --> 00:28:02,040 And so that's why I need the pre-requisite, because this 642 00:28:02,040 --> 00:28:04,830 is what we're going to use to describe the randomness. 643 00:28:04,830 --> 00:28:07,830 We'll see in a second how it interacts with statistics. 644 00:28:07,830 --> 00:28:10,590 So sometimes, and actually probably most of the time 645 00:28:10,590 --> 00:28:13,460 throughout your semester on probability, 646 00:28:13,460 --> 00:28:15,810 randomness was very well understood. 647 00:28:15,810 --> 00:28:18,420 When you saw a probability problem, here 648 00:28:18,420 --> 00:28:19,860 was the chance of this happening, 649 00:28:19,860 --> 00:28:21,720 here was the chance of that happening. 650 00:28:21,720 --> 00:28:23,820 Maybe you had more complicated questions 651 00:28:23,820 --> 00:28:26,720 that you had some basic elements to answer. 652 00:28:26,720 --> 00:28:32,127 For example, the probability that I have HBO is this much. 653 00:28:32,127 --> 00:28:34,710 And the probability that I watch Game of Thrones is that much. 654 00:28:34,710 --> 00:28:38,160 And given that I play basketball what is the probability-- 655 00:28:38,160 --> 00:28:39,870 you had all these crazy questions, 656 00:28:39,870 --> 00:28:42,360 but you were able to build them. 657 00:28:42,360 --> 00:28:45,340 But all the basic numbers were given to you. 658 00:28:45,340 --> 00:28:48,240 Statistics will be about finding those basic numbers. 659 00:28:48,240 --> 00:28:51,940 All right so some examples that you've probably seen 660 00:28:51,940 --> 00:28:55,950 were dice, cards, roulette, flipping coins. 661 00:28:55,950 --> 00:28:57,700 All of these things are things that you've 662 00:28:57,700 --> 00:28:59,060 seen in a probability class. 663 00:28:59,060 --> 00:29:00,726 And the reason is because it's very easy 664 00:29:00,726 --> 00:29:02,680 to describe the probability of each outcome. 665 00:29:02,680 --> 00:29:05,620 For a die we know that each face is going 666 00:29:05,620 --> 00:29:07,150 to come with probably 1/6. 667 00:29:07,150 --> 00:29:09,910 Now I'm not going to go into a debate of whether this 668 00:29:09,910 --> 00:29:12,160 is pure randomness or this is determinism. 669 00:29:12,160 --> 00:29:14,940 I think as a model for actual randomness 670 00:29:14,940 --> 00:29:18,220 a die is a pretty good number, flipping a coin 671 00:29:18,220 --> 00:29:20,170 is a pretty good model. 672 00:29:20,170 --> 00:29:22,600 So those are actually a good thing. 673 00:29:22,600 --> 00:29:24,650 So the questions that you would see, for example, 674 00:29:24,650 --> 00:29:26,320 in probabilities are the following. 675 00:29:26,320 --> 00:29:27,640 I roll one die. 676 00:29:27,640 --> 00:29:31,300 Alice gets $1 if the number of dots is less than three. 677 00:29:31,300 --> 00:29:35,230 Bob gets $2 if the number of dots is less than two. 678 00:29:35,230 --> 00:29:37,510 Do you want to be Alice or Bob given that your role is 679 00:29:37,510 --> 00:29:40,965 actually to make money. 680 00:29:40,965 --> 00:29:43,910 Yeah, you want to be Bob, right? 681 00:29:43,910 --> 00:29:45,410 So let's see why. 682 00:29:45,410 --> 00:29:47,300 So if you look at the expectation of what 683 00:29:47,300 --> 00:29:48,500 Alice makes. 684 00:29:48,500 --> 00:29:51,140 So let's call it a. 685 00:29:51,140 --> 00:29:56,080 This is $1, with probability 1/2. 686 00:29:56,080 --> 00:29:59,630 So 3/6, that's 1/2. 687 00:29:59,630 --> 00:30:02,820 And the expectation of what Bob makes, 688 00:30:02,820 --> 00:30:11,370 this is $2 with probably 2/6 and that's 2/3. 689 00:30:11,370 --> 00:30:13,230 Which is definitely larger than 1/2. 690 00:30:13,230 --> 00:30:17,109 So Bob's expectations actually a bit higher. 691 00:30:17,109 --> 00:30:18,900 So those are the kind of questions that you 692 00:30:18,900 --> 00:30:19,941 may ask with probability. 693 00:30:19,941 --> 00:30:21,930 I described to you exactly, you use the fact 694 00:30:21,930 --> 00:30:25,337 that the die would get less than three dots, 695 00:30:25,337 --> 00:30:26,420 with probability one half. 696 00:30:26,420 --> 00:30:27,090 We knew that. 697 00:30:27,090 --> 00:30:29,880 And I didn't have to describe to you what was going on there. 698 00:30:29,880 --> 00:30:32,460 You didn't have to collect data about a die. 699 00:30:32,460 --> 00:30:34,620 Same thing, you roll two dice. 700 00:30:34,620 --> 00:30:36,180 You choose a number between 2 and 12 701 00:30:36,180 --> 00:30:42,100 and you win $100 if you choose the sum of the two dice. 702 00:30:42,100 --> 00:30:45,052 Which number do you pick? 703 00:30:45,052 --> 00:30:46,812 What? 704 00:30:46,812 --> 00:30:47,312 AUDIENCE: 7. 705 00:30:47,312 --> 00:30:48,186 PHILIPPE RIGOLLET: 7. 706 00:30:48,186 --> 00:30:48,800 Why 7? 707 00:30:48,800 --> 00:30:50,430 AUDIENCE: It's the most likely. 708 00:30:50,430 --> 00:30:52,638 PHILIPPE RIGOLLET: That's the most likely one, right? 709 00:30:52,638 --> 00:30:56,800 So your gain here will be $100 times the probability 710 00:30:56,800 --> 00:31:00,220 that the sum of the two dice, let's say x plus y, 711 00:31:00,220 --> 00:31:03,430 is equal to your little z where a little z is 712 00:31:03,430 --> 00:31:04,478 the number you pick. 713 00:31:07,830 --> 00:31:10,350 So 7 is the most likely to happen 714 00:31:10,350 --> 00:31:14,580 and that's the one that maximizes this function of z. 715 00:31:14,580 --> 00:31:17,112 And for this you need to study a more complicated function. 716 00:31:17,112 --> 00:31:18,820 But it's a function that enables two die. 717 00:31:18,820 --> 00:31:21,720 But you can compute the probability that x plus y 718 00:31:21,720 --> 00:31:26,490 is equal to z, for every z between 2 and 12. 719 00:31:26,490 --> 00:31:29,010 So you know exactly what the probabilities are 720 00:31:29,010 --> 00:31:30,930 and that's how you start probability. 721 00:31:35,450 --> 00:31:38,660 So here that's exactly what I said. 722 00:31:38,660 --> 00:31:43,550 You have a very simple process that describes basic events. 723 00:31:43,550 --> 00:31:45,074 Probability 1/6 for each of them. 724 00:31:45,074 --> 00:31:46,490 And then you can build up on that, 725 00:31:46,490 --> 00:31:49,120 and understand probably of more complicated events. 726 00:31:49,120 --> 00:31:50,750 You can throw some money in there. 727 00:31:50,750 --> 00:31:52,780 You can be build functions. 728 00:31:52,780 --> 00:31:56,960 You can do very complicated things building on that. 729 00:31:56,960 --> 00:31:59,780 Now if I was a statistician, a statistician 730 00:31:59,780 --> 00:32:01,640 would be the guy who just arrived on earth, 731 00:32:01,640 --> 00:32:03,473 had never seen a die and needs to understand 732 00:32:03,473 --> 00:32:05,840 that a die come up with probably 1/6 on each side. 733 00:32:05,840 --> 00:32:08,030 And the way he would do it is just to roll the die 734 00:32:08,030 --> 00:32:12,440 until he get some counts and tries to estimate those. 735 00:32:12,440 --> 00:32:14,060 And maybe that guy would come and say, 736 00:32:14,060 --> 00:32:16,440 well, you know, actually, the probability 737 00:32:16,440 --> 00:32:23,500 that I get a 1 is 1/6 plus 0.001 and the probability 738 00:32:23,500 --> 00:32:27,650 that I get a 2 is 1/6 minus 0.005. 739 00:32:27,650 --> 00:32:29,910 And there would be some fluctuations around this. 740 00:32:29,910 --> 00:32:31,868 And it's going to be his role as a statistician 741 00:32:31,868 --> 00:32:33,500 to say, listen, this is too complicated 742 00:32:33,500 --> 00:32:34,880 of a model for this thing. 743 00:32:34,880 --> 00:32:36,694 And these should all be the same numbers. 744 00:32:36,694 --> 00:32:39,110 Just looking at data, they should be all the same numbers. 745 00:32:39,110 --> 00:32:40,150 And that's part of the modeling. 746 00:32:40,150 --> 00:32:41,930 You make some simplifying assumptions 747 00:32:41,930 --> 00:32:46,590 that essentially make your questions more accurate. 748 00:32:46,590 --> 00:32:48,230 Now, of course, if your model is wrong, 749 00:32:48,230 --> 00:32:50,120 if it's not true that all the faces arrive 750 00:32:50,120 --> 00:32:54,410 with the same probability, then you have a model error here. 751 00:32:54,410 --> 00:32:56,146 So we will be making model errors. 752 00:32:56,146 --> 00:32:57,770 But that's going to be the price to pay 753 00:32:57,770 --> 00:32:59,645 to be able to extract anything from our data. 754 00:33:02,860 --> 00:33:07,299 So for more complicated processes, 755 00:33:07,299 --> 00:33:09,840 so of course nobody's going to waste their time rolling dice. 756 00:33:09,840 --> 00:33:11,370 I mean, I'm sure you might have done 757 00:33:11,370 --> 00:33:13,860 this in AP stat or something. 758 00:33:13,860 --> 00:33:18,070 But the need is to estimate parameters from data. 759 00:33:18,070 --> 00:33:19,830 All right, so for more complicated things 760 00:33:19,830 --> 00:33:27,350 you might want to estimate some density parameter 761 00:33:27,350 --> 00:33:29,400 on a particular set of material. 762 00:33:29,400 --> 00:33:31,620 And for this maybe you need to beam something to it, 763 00:33:31,620 --> 00:33:33,449 and measure how fast it's coming back. 764 00:33:33,449 --> 00:33:35,490 And you're going to have some measurement errors. 765 00:33:35,490 --> 00:33:37,281 And maybe you need to do that several times 766 00:33:37,281 --> 00:33:39,750 and you have a model for the physical process that's 767 00:33:39,750 --> 00:33:40,960 actually going on. 768 00:33:40,960 --> 00:33:42,750 And physics is usually a very good way 769 00:33:42,750 --> 00:33:46,560 to get models for engineering perspective. 770 00:33:46,560 --> 00:33:49,320 But there's models for sociology where we 771 00:33:49,320 --> 00:33:52,260 have no physical system, right. 772 00:33:52,260 --> 00:33:53,670 God knows how people interact. 773 00:33:53,670 --> 00:33:55,650 And maybe I'm going to say that the way 774 00:33:55,650 --> 00:33:59,870 I make friends is by first flipping a coin in my pocket. 775 00:33:59,870 --> 00:34:01,560 And with probability 2/3, I'm going 776 00:34:01,560 --> 00:34:02,730 to make my friend at work. 777 00:34:02,730 --> 00:34:04,410 And with probability 1/3 I'm going 778 00:34:04,410 --> 00:34:05,850 to make my friend at soccer. 779 00:34:05,850 --> 00:34:07,850 And once I make my friends at soccer-- 780 00:34:07,850 --> 00:34:09,750 I decide to make my friend soccer. 781 00:34:09,750 --> 00:34:11,400 Then I will face someone who's flipping 782 00:34:11,400 --> 00:34:14,780 the same coin with maybe be slightly different parameters. 783 00:34:14,780 --> 00:34:16,230 But those things actually exist. 784 00:34:16,230 --> 00:34:18,750 There's models about how friendships are formed. 785 00:34:18,750 --> 00:34:22,020 And the one I described is called 786 00:34:22,020 --> 00:34:23,177 the mixed-membership model. 787 00:34:23,177 --> 00:34:25,260 So those are models that are sort of hypothesized. 788 00:34:25,260 --> 00:34:29,250 And they're more reasonable than taking into account 789 00:34:29,250 --> 00:34:31,650 all the things that made you meet that person 790 00:34:31,650 --> 00:34:34,409 at that particular time. 791 00:34:34,409 --> 00:34:38,580 So the goal here-- so based on data now, 792 00:34:38,580 --> 00:34:41,370 once we have the model is going to be reduced to maybe two, 793 00:34:41,370 --> 00:34:42,989 three, four parameters, depending 794 00:34:42,989 --> 00:34:44,500 on how complex the model is. 795 00:34:44,500 --> 00:34:48,060 And then your goal will be to estimate those parameters. 796 00:34:48,060 --> 00:34:51,610 So sometimes the randomness we have here is real. 797 00:34:51,610 --> 00:34:56,940 So there's some true randomness in some surveys. 798 00:34:56,940 --> 00:34:58,740 If I pick a random student, as long 799 00:34:58,740 --> 00:35:00,870 as I believe that my random number generator that 800 00:35:00,870 --> 00:35:04,200 will pick your random ID is actually random, 801 00:35:04,200 --> 00:35:06,180 there is something random about you. 802 00:35:06,180 --> 00:35:07,950 The student that I pick at random 803 00:35:07,950 --> 00:35:09,390 will be a random student. 804 00:35:09,390 --> 00:35:13,540 The person that I call on the phone is a random person. 805 00:35:13,540 --> 00:35:16,380 So there's some randomness that I can build into my system 806 00:35:16,380 --> 00:35:20,310 by drawing something from a random number generator. 807 00:35:20,310 --> 00:35:22,512 A biased coin is a random thing. 808 00:35:22,512 --> 00:35:24,220 It's not a very interesting random thing. 809 00:35:24,220 --> 00:35:26,100 But it is a random thing. 810 00:35:26,100 --> 00:35:29,040 Again, if I wash out the fact that it actually 811 00:35:29,040 --> 00:35:30,510 is a deterministic mechanism. 812 00:35:30,510 --> 00:35:33,230 But at a certain accuracy, a certain granularity, 813 00:35:33,230 --> 00:35:36,090 this can be thought of as a truly random experiment. 814 00:35:36,090 --> 00:35:39,180 Measurement error for example, if you by some measurement 815 00:35:39,180 --> 00:35:39,850 device. 816 00:35:39,850 --> 00:35:42,760 or some optics device, for example. 817 00:35:42,760 --> 00:35:45,020 You will have like standard deviation and things that 818 00:35:45,020 --> 00:35:46,080 come on the side of the box. 819 00:35:46,080 --> 00:35:48,621 And it tells you, this will be making some measurement error. 820 00:35:48,621 --> 00:35:51,480 And it's usually thermal noise maybe, or things like this. 821 00:35:51,480 --> 00:35:54,570 And those are very accurately described 822 00:35:54,570 --> 00:35:56,430 by some random phenomenon. 823 00:35:56,430 --> 00:36:01,260 But sometimes, and I'd say most times, there's no randomness. 824 00:36:01,260 --> 00:36:02,520 There's no randomness. 825 00:36:02,520 --> 00:36:06,810 It's not like you breaking your iPhone is a random event. 826 00:36:06,810 --> 00:36:09,390 This is just something that we sweep-- 827 00:36:09,390 --> 00:36:11,940 randomness is a big rug under which we sweep 828 00:36:11,940 --> 00:36:13,650 everything we don't understand. 829 00:36:13,650 --> 00:36:15,900 And we just hope that in average we've 830 00:36:15,900 --> 00:36:18,911 captured, the average effect of what's going on. 831 00:36:18,911 --> 00:36:20,910 And the rest of it might fluctuate to the right, 832 00:36:20,910 --> 00:36:22,470 might fluctuate to the left. 833 00:36:22,470 --> 00:36:26,010 But what remains is just sort of randomness 834 00:36:26,010 --> 00:36:27,660 that can be averaged out. 835 00:36:27,660 --> 00:36:31,170 So, of course, this is where the leap of faith is. 836 00:36:31,170 --> 00:36:33,480 We do not know whether we were correct of doing this. 837 00:36:33,480 --> 00:36:35,957 Maybe we make some huge systematic biases 838 00:36:35,957 --> 00:36:36,540 by doing this. 839 00:36:36,540 --> 00:36:39,330 Maybe we forget a very important component. 840 00:36:39,330 --> 00:36:42,450 Right, for example, if I have-- 841 00:36:42,450 --> 00:36:45,816 I don't know, let's think of something-- 842 00:36:49,220 --> 00:36:51,080 a drug for breast cancer. 843 00:36:51,080 --> 00:36:52,880 All right, and I throw out the fact 844 00:36:52,880 --> 00:36:55,640 that my patient is either a man or woman. 845 00:36:55,640 --> 00:36:57,530 I'm going to have some serious model biases. 846 00:36:57,530 --> 00:36:58,030 Right. 847 00:36:58,030 --> 00:37:00,645 So if I say I'm going to collect a random and patient. 848 00:37:00,645 --> 00:37:02,270 And said I'm going to start doing this. 849 00:37:02,270 --> 00:37:04,790 There's some information that I really need, clearly, 850 00:37:04,790 --> 00:37:06,880 to build into my model. 851 00:37:06,880 --> 00:37:10,730 And so the model should be complicated enough, but not too 852 00:37:10,730 --> 00:37:11,420 complicated. 853 00:37:11,420 --> 00:37:13,220 Right so it should take into account things 854 00:37:13,220 --> 00:37:17,375 there will systematically be important. 855 00:37:17,375 --> 00:37:19,700 So, in particular, the simple rule of thumb 856 00:37:19,700 --> 00:37:24,620 is, when you have a complicated process, 857 00:37:24,620 --> 00:37:26,780 you can think of it as being a simple process 858 00:37:26,780 --> 00:37:28,205 and some random noise. 859 00:37:28,205 --> 00:37:30,320 Now, again, the random noise is everything 860 00:37:30,320 --> 00:37:33,320 you don't understand about the complicated process. 861 00:37:33,320 --> 00:37:37,810 And the simple process is everything you actually do. 862 00:37:37,810 --> 00:37:40,600 So good modeling, and this is not 863 00:37:40,600 --> 00:37:43,920 where we'll be seeing in this class, 864 00:37:43,920 --> 00:37:46,890 consistent choosing plausible simple models. 865 00:37:46,890 --> 00:37:50,241 And this requires a tremendous amount of domain knowledge. 866 00:37:50,241 --> 00:37:52,240 And that's why we're not doing it in this class. 867 00:37:52,240 --> 00:37:54,240 This is not something where I can make a blanket statement 868 00:37:54,240 --> 00:37:55,620 about making good modeling. 869 00:37:55,620 --> 00:37:58,020 You need to know, if I were a statistician working 870 00:37:58,020 --> 00:38:00,990 on a study, I would have to grill the person in front 871 00:38:00,990 --> 00:38:04,840 of me, the expert, for two hours to know, but how about this? 872 00:38:04,840 --> 00:38:05,970 How about that? 873 00:38:05,970 --> 00:38:06,990 How does this work? 874 00:38:06,990 --> 00:38:08,940 So it requires to understand a lot of things. 875 00:38:08,940 --> 00:38:14,070 There's this famous statistician to whom this sentence is 876 00:38:14,070 --> 00:38:16,640 attributed, and it's probably not his then, 877 00:38:16,640 --> 00:38:21,191 but Tukey said that he loves being a statistician, 878 00:38:21,191 --> 00:38:23,190 because you get to play in everybody's backyard. 879 00:38:23,190 --> 00:38:25,440 Right, so you get to go and see people. 880 00:38:25,440 --> 00:38:28,260 And you get to understand, at least to a certain extent, what 881 00:38:28,260 --> 00:38:29,430 their problems are. 882 00:38:29,430 --> 00:38:31,110 Enough that you can actually build 883 00:38:31,110 --> 00:38:33,400 a reasonable model for what they're actually doing. 884 00:38:33,400 --> 00:38:34,740 So you get to do some sociology. 885 00:38:34,740 --> 00:38:35,865 You get to do some biology. 886 00:38:35,865 --> 00:38:37,167 You get to do some engineering. 887 00:38:37,167 --> 00:38:39,000 And you get to do a lot of different things. 888 00:38:39,000 --> 00:38:40,625 Right, so he was actually at some point 889 00:38:40,625 --> 00:38:46,180 predicting the presidential election. 890 00:38:46,180 --> 00:38:48,720 So, you see, you get to do a lot of different things. 891 00:38:48,720 --> 00:38:50,850 But it requires a lot of time to understand 892 00:38:50,850 --> 00:38:52,230 what problem you're working on. 893 00:38:52,230 --> 00:38:54,480 And if you have a particular application in mind 894 00:38:54,480 --> 00:38:56,644 you're the best person to actually understand this. 895 00:38:56,644 --> 00:38:58,560 So I'm just going to give you the basic tools. 896 00:39:07,660 --> 00:39:11,705 So this is the circle of trust. 897 00:39:11,705 --> 00:39:14,010 No, this is really just a simple graphic 898 00:39:14,010 --> 00:39:15,840 that tells you what's going on. 899 00:39:15,840 --> 00:39:19,170 When you do probability, you're given the truth. 900 00:39:19,170 --> 00:39:24,330 Somebody tells you what die God is rolling. 901 00:39:24,330 --> 00:39:27,490 So you know exactly what the parameters of the problems are. 902 00:39:27,490 --> 00:39:29,490 And what you're trying to do is to describe what 903 00:39:29,490 --> 00:39:31,410 the outcomes are going to be. 904 00:39:31,410 --> 00:39:34,260 You can say, if you're rolling a fair die, 905 00:39:34,260 --> 00:39:36,480 you're going to have 1/6 of the time in your data 906 00:39:36,480 --> 00:39:37,699 you're going to have one. 907 00:39:37,699 --> 00:39:39,740 1/6 of the time you're going to have to have two. 908 00:39:39,740 --> 00:39:42,156 And so you can describe-- if I told you what the truth is, 909 00:39:42,156 --> 00:39:44,340 you could actually go into a computer, 910 00:39:44,340 --> 00:39:46,080 either generate some data. 911 00:39:46,080 --> 00:39:51,030 Or you could describe to me some more macro properties 912 00:39:51,030 --> 00:39:52,350 of what the data would be like. 913 00:39:52,350 --> 00:39:54,600 Oh, I would see a bunch of numbers 914 00:39:54,600 --> 00:39:57,150 that would be centered around 35, if I 915 00:39:57,150 --> 00:40:00,182 drew from a Gaussian distribution centered at 35. 916 00:40:00,182 --> 00:40:01,890 Right, you would know this kind of thing. 917 00:40:01,890 --> 00:40:07,400 I would know that it's very unlikely that if my Gaussian 918 00:40:07,400 --> 00:40:08,940 has standard deviation-- 919 00:40:08,940 --> 00:40:13,830 is centered on 0, say, with standard deviation 3. 920 00:40:13,830 --> 00:40:17,380 It's very unlikely that I will see numbers below minus 10 921 00:40:17,380 --> 00:40:18,750 in above 10, right? 922 00:40:18,750 --> 00:40:21,060 You know this, that you basically will not see them. 923 00:40:21,060 --> 00:40:25,310 So you know from the truth, from the distribution 924 00:40:25,310 --> 00:40:27,810 of a random variable that does not have mu or sigmas, really 925 00:40:27,810 --> 00:40:28,860 numbers there. 926 00:40:28,860 --> 00:40:31,980 You know what data, you're going to be having. 927 00:40:31,980 --> 00:40:33,895 Statistics is about going backwards. 928 00:40:33,895 --> 00:40:37,590 It's saying, if I have some data, what was 929 00:40:37,590 --> 00:40:39,420 the truth that generated it. 930 00:40:39,420 --> 00:40:41,790 And since there are so many possible truths, 931 00:40:41,790 --> 00:40:44,040 Modeling says you have to pick one 932 00:40:44,040 --> 00:40:47,190 of the simpler possible truths, so that you can average out. 933 00:40:47,190 --> 00:40:49,720 Statistics basically means averaging. 934 00:40:49,720 --> 00:40:51,540 You're averaging when you do statistics. 935 00:40:51,540 --> 00:40:54,150 And averaging means that if I say 936 00:40:54,150 --> 00:40:56,790 that I received-- so if I collect 937 00:40:56,790 --> 00:40:58,230 all your GPAs, for example. 938 00:40:58,230 --> 00:41:01,860 And my model is that the possible GPAs 939 00:41:01,860 --> 00:41:03,450 are any possible numbers. 940 00:41:03,450 --> 00:41:05,297 And anybody can have any possible GPA. 941 00:41:05,297 --> 00:41:06,880 This is going to be a serious problem. 942 00:41:06,880 --> 00:41:09,420 But if I can summarize those GPAs into two numbers, 943 00:41:09,420 --> 00:41:11,100 say, mean and standard deviation, 944 00:41:11,100 --> 00:41:13,020 than I have a pretty good description of what 945 00:41:13,020 --> 00:41:15,360 is going on, rather than having to have 946 00:41:15,360 --> 00:41:16,730 to predict the full list. 947 00:41:16,730 --> 00:41:18,880 Right, if I learn a full list of GPAs and I say, 948 00:41:18,880 --> 00:41:20,282 well this was the distribution. 949 00:41:20,282 --> 00:41:22,740 Then it's not going to be of any use for me to predict what 950 00:41:22,740 --> 00:41:25,232 the GPA would be, or some random student walking in, 951 00:41:25,232 --> 00:41:26,190 or something like this. 952 00:41:30,210 --> 00:41:34,680 So just to finish my rant about probability versus statistics, 953 00:41:34,680 --> 00:41:37,230 this is a question you would see in a probability-- this 954 00:41:37,230 --> 00:41:40,080 is a probabilistic question, and this is a statistical question. 955 00:41:40,080 --> 00:41:42,690 The probabilistic question is, previous studies 956 00:41:42,690 --> 00:41:45,150 showed that the drug was 80% effective. 957 00:41:45,150 --> 00:41:46,230 So you know that. 958 00:41:46,230 --> 00:41:48,340 This is the effectiveness of the drug. 959 00:41:48,340 --> 00:41:49,480 It's given to you. 960 00:41:49,480 --> 00:41:51,480 This is how your problem starts. 961 00:41:51,480 --> 00:41:54,930 Then we can anticipate that, for a study on 100 patients, 962 00:41:54,930 --> 00:41:57,030 in average, 80 be cured. 963 00:41:57,030 --> 00:42:00,660 And at least 65 will be cured with 99% chances. 964 00:42:00,660 --> 00:42:03,140 So again these are not-- 965 00:42:03,140 --> 00:42:05,994 I'm not predicting on 100 patients exactly the number 966 00:42:05,994 --> 00:42:07,410 of them they're going to be cured. 967 00:42:07,410 --> 00:42:08,970 And the number of them that are not. 968 00:42:08,970 --> 00:42:11,160 But I'm actually sort of predicting 969 00:42:11,160 --> 00:42:13,140 what things are going to look like on average, 970 00:42:13,140 --> 00:42:17,610 or some macro properties of what my data sets will look like. 971 00:42:17,610 --> 00:42:19,590 So with 99 percent chances, that means 972 00:42:19,590 --> 00:42:23,550 that in 99.99% of the data sets you will 973 00:42:23,550 --> 00:42:25,770 draw from this particular draw. 974 00:42:25,770 --> 00:42:30,870 99.99% of the cohort of 100 patients to whom you administer 975 00:42:30,870 --> 00:42:34,650 this drug, I will be able to conclude that at least 65 976 00:42:34,650 --> 00:42:41,010 of them will be cured, on 99.99% percent of those data sets. 977 00:42:41,010 --> 00:42:42,630 So that's a pretty accurate prediction 978 00:42:42,630 --> 00:42:45,060 of what's going to happen. 979 00:42:45,060 --> 00:42:46,260 Statistics is the opposite. 980 00:42:46,260 --> 00:42:49,620 It says, well, I just know that 78 out of 100 were cured. 981 00:42:49,620 --> 00:42:50,890 I have only one data set. 982 00:42:50,890 --> 00:42:53,340 I cannot make predictions for all data sets. 983 00:42:53,340 --> 00:42:57,199 But I can go back to the probability, 984 00:42:57,199 --> 00:42:59,490 make some inference about what my probability will look 985 00:42:59,490 --> 00:43:03,360 like, and then say, OK, then I can make those predictions 986 00:43:03,360 --> 00:43:04,150 later on. 987 00:43:04,150 --> 00:43:08,580 So when I start with 78/100 then maybe 988 00:43:08,580 --> 00:43:11,100 I'm actually, in this case, I just don't know. 989 00:43:11,100 --> 00:43:16,410 My best guess here is that I'm confident I 990 00:43:16,410 --> 00:43:19,800 have to add the extra error that I bet you making by predicting 991 00:43:19,800 --> 00:43:25,650 that here, the drug is not 80% effective but 78% effective. 992 00:43:25,650 --> 00:43:27,570 And they need some error bars around this, 993 00:43:27,570 --> 00:43:30,960 that will hopefully contain 80%, and then based on those error 994 00:43:30,960 --> 00:43:34,460 bars I'm going to make slightly less precise predictions 995 00:43:34,460 --> 00:43:35,456 for the future. 996 00:43:39,440 --> 00:43:44,510 So, to conclude, so this was, why statistics? 997 00:43:44,510 --> 00:43:46,530 So what is this course about? 998 00:43:46,530 --> 00:43:48,230 It's about understanding the mathematics 999 00:43:48,230 --> 00:43:50,230 behind statistical methods. 1000 00:43:50,230 --> 00:43:51,590 It's more of a tool. 1001 00:43:51,590 --> 00:43:54,680 We're not going to have fun and talk about algebraic geometry 1002 00:43:54,680 --> 00:43:57,440 just for fun in the middle of it. 1003 00:43:57,440 --> 00:44:01,610 So it justifies quantitative statements given some modeling 1004 00:44:01,610 --> 00:44:03,500 assumptions, that we will, in this class, 1005 00:44:03,500 --> 00:44:06,730 mostly admit that the modeling assumptions are correct. 1006 00:44:06,730 --> 00:44:08,480 | the first part-- in this introduction, 1007 00:44:08,480 --> 00:44:10,442 we will go through them because it's 1008 00:44:10,442 --> 00:44:12,650 very easy to forget what the assumptions are actually 1009 00:44:12,650 --> 00:44:13,310 making. 1010 00:44:13,310 --> 00:44:15,650 But this will be a pretty standard thing. 1011 00:44:15,650 --> 00:44:18,680 The words you will hear a lot are IID-- 1012 00:44:18,680 --> 00:44:20,630 independent and identically distributed-- 1013 00:44:20,630 --> 00:44:23,120 that means that your data is basically all the sams. 1014 00:44:23,120 --> 00:44:28,070 And one data point is not impacting another data point. 1015 00:44:28,070 --> 00:44:30,320 Hopefully we can describe some interesting mathematics 1016 00:44:30,320 --> 00:44:31,400 arising in statistics. 1017 00:44:31,400 --> 00:44:33,500 You know, if you've taken linear algebra, 1018 00:44:33,500 --> 00:44:36,590 maybe we can explain to you why. 1019 00:44:36,590 --> 00:44:38,630 If you've done some calculus, maybe we 1020 00:44:38,630 --> 00:44:40,040 can do some interesting calculus. 1021 00:44:40,040 --> 00:44:42,980 We'll see how in the spirit of applied math 1022 00:44:42,980 --> 00:44:45,630 those things answer interesting questions. 1023 00:44:45,630 --> 00:44:49,550 And basically we'll try to carve out a math toolbox that's 1024 00:44:49,550 --> 00:44:52,300 useful for us statistics. 1025 00:44:52,300 --> 00:44:55,280 And maybe you can extend it to more sophisticated methods 1026 00:44:55,280 --> 00:44:57,080 that we did not cover in this class. 1027 00:44:57,080 --> 00:44:59,180 In particular in the immersion learning class, 1028 00:44:59,180 --> 00:45:02,120 hopefully you'll be able to have some statistical intuition 1029 00:45:02,120 --> 00:45:04,590 about what is going on. 1030 00:45:04,590 --> 00:45:06,680 So what this course is not about, 1031 00:45:06,680 --> 00:45:09,950 it's not about spending a lot of time looking at data sets, 1032 00:45:09,950 --> 00:45:13,310 and trying to understand some statistical thinking 1033 00:45:13,310 --> 00:45:14,330 kind of questions. 1034 00:45:14,330 --> 00:45:16,550 So this is more of an applied statistical perspective 1035 00:45:16,550 --> 00:45:19,760 on things, or more modeling. 1036 00:45:19,760 --> 00:45:22,150 So I'm going to typically give you the model. 1037 00:45:22,150 --> 00:45:23,510 And say this is a model. 1038 00:45:23,510 --> 00:45:26,480 And this is how we're going to build an estimator 1039 00:45:26,480 --> 00:45:28,440 in the framework of this model. 1040 00:45:28,440 --> 00:45:30,820 So for example, 18.075, to a certain extent, 1041 00:45:30,820 --> 00:45:32,945 is called "Statistical Thinking and Data Analysis." 1042 00:45:32,945 --> 00:45:36,500 So I'm hoping there is some statistical thinking in there. 1043 00:45:36,500 --> 00:45:38,840 We will not talk about software implementation. 1044 00:45:38,840 --> 00:45:42,270 Unfortunately, there's just too little time in a semester. 1045 00:45:42,270 --> 00:45:45,840 There's other courses that are giving you some overview. 1046 00:45:45,840 --> 00:45:49,355 So the main software these days are R 1047 00:45:49,355 --> 00:45:54,830 is the leading software I'd say in statistics, both in academia 1048 00:45:54,830 --> 00:45:58,100 and industry, lots of packages, one every day 1049 00:45:58,100 --> 00:46:00,140 that's probably coming out. 1050 00:46:00,140 --> 00:46:03,410 But there's other things, right, so now Python is probably 1051 00:46:03,410 --> 00:46:09,110 catching up with all these scikit-learn packages that 1052 00:46:09,110 --> 00:46:10,130 are coming up. 1053 00:46:10,130 --> 00:46:14,300 Julia has some statistics in there, 1054 00:46:14,300 --> 00:46:17,867 but it really if you were to learn a statistical software, 1055 00:46:17,867 --> 00:46:19,325 let's say you love doing this, this 1056 00:46:19,325 --> 00:46:21,590 would be the one that would prove most useful for you 1057 00:46:21,590 --> 00:46:22,190 in the future. 1058 00:46:22,190 --> 00:46:26,700 It does not scale super well to high dimensional data. 1059 00:46:26,700 --> 00:46:28,550 So there is a class an IDSS that actually 1060 00:46:28,550 --> 00:46:31,080 uses R. It's called IDS 0.12, I think 1061 00:46:31,080 --> 00:46:36,520 it's called "Statistics, Computation, and Applications," 1062 00:46:36,520 --> 00:46:37,850 or something like this. 1063 00:46:37,850 --> 00:46:40,130 I'm also preparing, with Peter Kempthorne, 1064 00:46:40,130 --> 00:46:42,538 a course called "Computational Statistics." 1065 00:46:47,380 --> 00:46:50,830 It's going to be offered this Spring as a special topics. 1066 00:46:50,830 --> 00:46:55,880 And so Peter Kempthorne will be teaching it. 1067 00:46:55,880 --> 00:46:58,620 And this class will actually focus 1068 00:46:58,620 --> 00:47:00,807 on using R. And even beyond that, 1069 00:47:00,807 --> 00:47:02,390 it's not just going to be about using. 1070 00:47:02,390 --> 00:47:04,285 It's going to be about understanding-- 1071 00:47:04,285 --> 00:47:05,910 just the same way we we're going to see 1072 00:47:05,910 --> 00:47:07,980 how math helps you do statistics, 1073 00:47:07,980 --> 00:47:09,870 it's going to help see how math helps you 1074 00:47:09,870 --> 00:47:12,531 do algorithims for statistics. 1075 00:47:12,531 --> 00:47:15,030 All right, so we'll talk about maximum likelihood estimator. 1076 00:47:15,030 --> 00:47:16,910 Will need to maximize some function. 1077 00:47:16,910 --> 00:47:19,120 There's an optimization toolbox to do that. 1078 00:47:19,120 --> 00:47:20,970 And we'll see how we can have specialized 1079 00:47:20,970 --> 00:47:22,800 for statistics for that, and what 1080 00:47:22,800 --> 00:47:25,340 are the principles behind it. 1081 00:47:25,340 --> 00:47:26,990 And you know, of course, if you've 1082 00:47:26,990 --> 00:47:29,230 taken AP stats you probably think that stats 1083 00:47:29,230 --> 00:47:31,250 is boring to death because it was just 1084 00:47:31,250 --> 00:47:34,310 a long laundry-list that spent a lot of time on t-test. 1085 00:47:34,310 --> 00:47:37,010 I'm pretty sure we're not going to talk about t-test, well, 1086 00:47:37,010 --> 00:47:38,480 maybe once. 1087 00:47:38,480 --> 00:47:42,200 But this is not a matter of saying you're going to do this. 1088 00:47:42,200 --> 00:47:43,670 And this is a slight variant of it. 1089 00:47:43,670 --> 00:47:46,160 We're going to really try to understand what's going on. 1090 00:47:46,160 --> 00:47:49,820 So, admittedly, you have not chosen the simplest way 1091 00:47:49,820 --> 00:47:52,580 to get an A in statistics on campus. 1092 00:47:52,580 --> 00:47:54,470 All right, this is not the easiest class. 1093 00:47:54,470 --> 00:47:56,000 It might be challenging at times, 1094 00:47:56,000 --> 00:47:59,674 but I can promise you that you will maybe suffer. 1095 00:47:59,674 --> 00:48:01,340 But you will learn something by the time 1096 00:48:01,340 --> 00:48:02,381 you're out of this class. 1097 00:48:02,381 --> 00:48:04,550 This will not be a waste of your time. 1098 00:48:04,550 --> 00:48:06,560 And you will be able to understand, 1099 00:48:06,560 --> 00:48:09,230 and not having to remember by heart how those things actually 1100 00:48:09,230 --> 00:48:10,680 work. 1101 00:48:10,680 --> 00:48:13,750 Are there any questions? 1102 00:48:13,750 --> 00:48:16,910 Anybody want to go to other stats class on campus? 1103 00:48:16,910 --> 00:48:18,860 Maybe it's not too late. 1104 00:48:18,860 --> 00:48:21,360 OK. 1105 00:48:21,360 --> 00:48:25,280 So let's do some statistics. 1106 00:48:25,280 --> 00:48:29,600 So I see the time now and it's 11:56, 1107 00:48:29,600 --> 00:48:31,800 so we have another 30 minutes. 1108 00:48:31,800 --> 00:48:35,880 I will typically give you a three, 1109 00:48:35,880 --> 00:48:37,820 four minute break if you want to stretch, 1110 00:48:37,820 --> 00:48:39,278 if you want to run to the bathroom, 1111 00:48:39,278 --> 00:48:45,220 if you want to check your texts or Instagram. 1112 00:48:45,220 --> 00:48:47,510 There was very little content in this class, 1113 00:48:47,510 --> 00:48:49,010 hopefully it was entertaining enough 1114 00:48:49,010 --> 00:48:51,170 that you don't need the break. 1115 00:48:51,170 --> 00:48:55,510 But just in the future, so you know you will have a break. 1116 00:48:55,510 --> 00:49:01,670 So statistics, this is how it starts, I'm French, what can 1117 00:49:01,670 --> 00:49:05,750 I say I need to put some French words. 1118 00:49:05,750 --> 00:49:08,450 So this is not how office hours are going to go down. 1119 00:49:12,410 --> 00:49:16,370 Anybody know this sculpture by a Rodin, The Kiss. 1120 00:49:16,370 --> 00:49:18,520 Maybe probably The Thinker is more famous. 1121 00:49:18,520 --> 00:49:20,240 But this is actually a pretty famous one. 1122 00:49:20,240 --> 00:49:23,830 But is it really this one, or is it this one. 1123 00:49:23,830 --> 00:49:26,504 Anybody knows which one it is? 1124 00:49:26,504 --> 00:49:27,910 This one? 1125 00:49:27,910 --> 00:49:28,450 Or this one? 1126 00:49:28,450 --> 00:49:30,628 AUDIENCE: The previous. 1127 00:49:30,628 --> 00:49:32,532 PHILIPPE RIGOLLET: What's that? 1128 00:49:32,532 --> 00:49:33,490 AUDIENCE: This one. 1129 00:49:33,490 --> 00:49:33,990 PHILIPPE RIGOLLET: It's this one. 1130 00:49:33,990 --> 00:49:35,340 AUDIENCE: Final answer. 1131 00:49:35,340 --> 00:49:39,240 PHILIPPE RIGOLLET: Yeah, who votes for this one. 1132 00:49:39,240 --> 00:49:40,380 OK. 1133 00:49:40,380 --> 00:49:42,451 Who votes for that one? 1134 00:49:42,451 --> 00:49:42,950 Thank you. 1135 00:49:42,950 --> 00:49:45,770 I love that you do not want to pronounce yourself with no data 1136 00:49:45,770 --> 00:49:47,160 actually to make any decision. 1137 00:49:47,160 --> 00:49:49,640 This is a total coin toss right. 1138 00:49:49,640 --> 00:49:51,340 Turns out that there is data, and there 1139 00:49:51,340 --> 00:49:53,930 is in the very serious journal Nature, 1140 00:49:53,930 --> 00:49:56,210 someone published a very serious paper which 1141 00:49:56,210 --> 00:49:58,235 actually looks pretty serious. 1142 00:49:58,235 --> 00:50:00,110 If you look at it, it's like "Human Behavior: 1143 00:50:00,110 --> 00:50:02,510 Adult persistence of head-turning symmetry," 1144 00:50:02,510 --> 00:50:04,930 is a lot of fancy words in there. 1145 00:50:04,930 --> 00:50:07,310 And this, I'm not kidding you, this study 1146 00:50:07,310 --> 00:50:09,830 is about collecting data of people kissing, 1147 00:50:09,830 --> 00:50:12,080 and knowing if they bend their head to the right 1148 00:50:12,080 --> 00:50:14,130 or if they bend they head to the left. 1149 00:50:14,130 --> 00:50:15,630 And that's all it is. 1150 00:50:15,630 --> 00:50:21,560 And so a neonatal right-side preference 1151 00:50:21,560 --> 00:50:25,510 makes a surprising romantic reappearance in later life. 1152 00:50:25,510 --> 00:50:27,830 There's an explanation for it. 1153 00:50:27,830 --> 00:50:32,820 All right, so if we follow this Nature which one is the one. 1154 00:50:32,820 --> 00:50:33,540 This one? 1155 00:50:33,540 --> 00:50:34,920 Or this one? 1156 00:50:34,920 --> 00:50:35,980 This one, right? 1157 00:50:35,980 --> 00:50:38,900 Head to the right. 1158 00:50:38,900 --> 00:50:41,250 And to be fair, for this class I was like, 1159 00:50:41,250 --> 00:50:46,130 oh, I'm going to go and show them what Google Images does. 1160 00:50:46,130 --> 00:50:49,400 When you Google kissing couple, it's 1161 00:50:49,400 --> 00:50:51,740 inappropriate after maybe the first picture. 1162 00:50:51,740 --> 00:50:53,540 And so I cannot show you this. 1163 00:50:53,540 --> 00:50:55,370 But you know you can check for yourself. 1164 00:50:55,370 --> 00:50:57,140 Though I would argue, so this person 1165 00:50:57,140 --> 00:51:00,890 here actually went out in airports 1166 00:51:00,890 --> 00:51:06,740 and took pictures of strangers kissing and collecting data. 1167 00:51:06,740 --> 00:51:10,970 And can somebody guess why did he just not stay home 1168 00:51:10,970 --> 00:51:12,875 and collect data from Google Images 1169 00:51:12,875 --> 00:51:17,300 by just googling kissing couples. 1170 00:51:17,300 --> 00:51:19,850 What's wrong with this data? 1171 00:51:19,850 --> 00:51:22,754 I didn't know actually before I actually went on Google Images. 1172 00:51:22,754 --> 00:51:23,920 AUDIENCE: It can be altered? 1173 00:51:23,920 --> 00:51:24,540 PHILIPPE RIGOLLET: What was that? 1174 00:51:24,540 --> 00:51:25,730 AUDIENCE: It can be altered. 1175 00:51:25,730 --> 00:51:26,600 PHILIPPE RIGOLLET: It can be altered. 1176 00:51:26,600 --> 00:51:28,220 But, you know, who would want to do this? 1177 00:51:28,220 --> 00:51:29,420 I mean there's no particular reason why 1178 00:51:29,420 --> 00:51:31,920 you would want to flip an image before putting it out there. 1179 00:51:31,920 --> 00:51:34,610 I mean, you might, but you know maybe they 1180 00:51:34,610 --> 00:51:38,858 want to hide the brand of your Gap shirt or something. 1181 00:51:38,858 --> 00:51:42,260 AUDIENCE: I guess the people who post pictures of themselves 1182 00:51:42,260 --> 00:51:44,587 kissing on Google Images are not representative 1183 00:51:44,587 --> 00:51:45,670 of the general population. 1184 00:51:45,670 --> 00:51:47,626 PHILIPPE RIGOLLET: Yeah, that's very true. 1185 00:51:47,626 --> 00:51:49,250 And actually it's even worse than that. 1186 00:51:49,250 --> 00:51:51,237 The people who post pictures of themselves, 1187 00:51:51,237 --> 00:51:52,820 are not posting pictures of themselves 1188 00:51:52,820 --> 00:51:54,290 or putting pictures of the people 1189 00:51:54,290 --> 00:51:55,760 that they took a picture of. 1190 00:51:55,760 --> 00:51:59,367 And there usually is a stock watermark on this. 1191 00:51:59,367 --> 00:52:00,700 And it's basically stock images. 1192 00:52:00,700 --> 00:52:03,440 Those are actors, and so they've been directed to kiss 1193 00:52:03,440 --> 00:52:06,060 and this is not a natural thing to do. 1194 00:52:06,060 --> 00:52:08,060 And actually, if you go to Google Images-- and I 1195 00:52:08,060 --> 00:52:10,602 encourage you to do this, unless you 1196 00:52:10,602 --> 00:52:12,310 don't want to see inappropriate pictures, 1197 00:52:12,310 --> 00:52:14,420 and they're mightily inappropriate. 1198 00:52:14,420 --> 00:52:19,520 And basically you will see that this study is actually not 1199 00:52:19,520 --> 00:52:20,310 working at all. 1200 00:52:20,310 --> 00:52:21,470 I mean, I looked briefly. 1201 00:52:21,470 --> 00:52:22,920 I didn't actually collect numbers. 1202 00:52:22,920 --> 00:52:26,460 But I didn't find a particular tendency to bend right. 1203 00:52:26,460 --> 00:52:28,770 If anything, it was actually probably the opposite. 1204 00:52:28,770 --> 00:52:31,380 And it's because those people were directed to do it. 1205 00:52:31,380 --> 00:52:34,975 They just don't actually think about doing it. 1206 00:52:34,975 --> 00:52:36,350 And also because I think you need 1207 00:52:36,350 --> 00:52:38,870 to justify writing in your paper more than, 1208 00:52:38,870 --> 00:52:41,720 I sat in front of my computer. 1209 00:52:41,720 --> 00:52:46,820 So again, this first sentence here, 1210 00:52:46,820 --> 00:52:49,420 a neonatal right-side preference-- 1211 00:52:49,420 --> 00:52:51,630 "is there a right side preference?" 1212 00:52:51,630 --> 00:52:53,770 is not a mathematical question. 1213 00:52:53,770 --> 00:52:57,760 But we can start saying, let's blah, and put some variables, 1214 00:52:57,760 --> 00:52:59,990 and ask questions about those variables. 1215 00:52:59,990 --> 00:53:02,540 So you know x is actually not a variable that's 1216 00:53:02,540 --> 00:53:04,870 used very much in statistics for parameters. 1217 00:53:04,870 --> 00:53:07,129 But p is one, for parameter. 1218 00:53:07,129 --> 00:53:09,420 And so you're going to take your parameter of interest, 1219 00:53:09,420 --> 00:53:12,100 p, As here is going to be the proportion of couples. 1220 00:53:12,100 --> 00:53:13,930 And that's among all couples. 1221 00:53:13,930 --> 00:53:17,530 So here, if you talk about statistical thinking, 1222 00:53:17,530 --> 00:53:20,560 there would be a question about what population this would 1223 00:53:20,560 --> 00:53:22,080 actually be representative of. 1224 00:53:22,080 --> 00:53:24,190 | usually this is a call to your-- 1225 00:53:26,890 --> 00:53:30,280 sorry, I should not forget this word it's important for you. 1226 00:53:33,290 --> 00:53:34,720 OK, I forget this word. 1227 00:53:34,720 --> 00:53:38,170 So this is-- 1228 00:53:38,170 --> 00:53:43,094 OK, 1229 00:53:43,094 --> 00:53:44,510 So if you look at this proportion, 1230 00:53:44,510 --> 00:53:46,660 maybe these couples that are in the study 1231 00:53:46,660 --> 00:53:49,210 might be representative only of couples in airports. 1232 00:53:49,210 --> 00:53:51,700 Maybe they actually put on a show for the other passengers. 1233 00:53:51,700 --> 00:53:52,240 Who knows? 1234 00:53:52,240 --> 00:53:54,342 You know, like, oh, let's just do it as well. 1235 00:53:54,342 --> 00:53:56,050 And just like the people in Google Images 1236 00:53:56,050 --> 00:53:57,175 they are actually doing it. 1237 00:53:57,175 --> 00:53:58,930 So maybe you want to just restrict it. 1238 00:53:58,930 --> 00:54:01,120 But of course clearly if it's appearing in Nature, 1239 00:54:01,120 --> 00:54:04,420 it should not be only about couples in airports. 1240 00:54:04,420 --> 00:54:07,810 It's supposedly representative of all couples in the world. 1241 00:54:07,810 --> 00:54:10,210 And so here let's just keep it vague, 1242 00:54:10,210 --> 00:54:12,610 but you need to keep in mind what population 1243 00:54:12,610 --> 00:54:14,710 this is actually making a statement about. 1244 00:54:14,710 --> 00:54:20,120 So you have this full population of people in the world. 1245 00:54:20,120 --> 00:54:23,130 Right, so those are all the couples. 1246 00:54:23,130 --> 00:54:27,380 And this person went ahead and collected data 1247 00:54:27,380 --> 00:54:29,667 about a bunch of them. 1248 00:54:29,667 --> 00:54:31,750 And we know that, in this thing, there's basically 1249 00:54:31,750 --> 00:54:33,521 a proportion of them, that's like p, 1250 00:54:33,521 --> 00:54:35,520 and that's the proportion of them that's bending 1251 00:54:35,520 --> 00:54:36,570 their head to the right. 1252 00:54:36,570 --> 00:54:40,050 And so everybody on this side is bending their heads right. 1253 00:54:40,050 --> 00:54:41,550 And hopefully we can actually sample 1254 00:54:41,550 --> 00:54:42,716 this thing you're informing. 1255 00:54:42,716 --> 00:54:44,640 That's basically the process that's going on. 1256 00:54:44,640 --> 00:54:47,250 So this is the statistical experiment. 1257 00:54:47,250 --> 00:54:49,589 We're going to observe n kissing couples. 1258 00:54:49,589 --> 00:54:51,880 So here we're going to put as many variables as we can. 1259 00:54:51,880 --> 00:54:53,505 So we don't have to stick with numbers. 1260 00:54:53,505 --> 00:54:55,630 And then we'll just plug in the numbers. 1261 00:54:55,630 --> 00:54:58,450 n kissing couples, and n is also, in statistics, 1262 00:54:58,450 --> 00:55:04,320 by the way, n is the size of your sample 99.9% of the time. 1263 00:55:04,320 --> 00:55:06,720 And collect the value of each outcome. 1264 00:55:06,720 --> 00:55:07,530 So we want numbers. 1265 00:55:07,530 --> 00:55:08,696 We don't want right or left. 1266 00:55:08,696 --> 00:55:12,300 So we're going to code them by 0 and 1, pretty naturally. 1267 00:55:12,300 --> 00:55:16,890 And then we're going to estimate p which is unknown. 1268 00:55:16,890 --> 00:55:18,229 So p is this area. 1269 00:55:18,229 --> 00:55:19,770 And we're going to estimate it simply 1270 00:55:19,770 --> 00:55:24,570 by the proportion of right So the proportion of crosses 1271 00:55:24,570 --> 00:55:27,030 that actually fell in the right side. 1272 00:55:29,660 --> 00:55:33,340 So in this study what you will find 1273 00:55:33,340 --> 00:55:36,460 is that the numbers that were collected 1274 00:55:36,460 --> 00:55:43,690 were 124 couples, and that, out of those 124, 80 of them 1275 00:55:43,690 --> 00:55:46,570 turned their head to the right. 1276 00:55:46,570 --> 00:55:49,240 So, p hat is a proportion. 1277 00:55:49,240 --> 00:55:50,020 How do we do it? 1278 00:55:50,020 --> 00:55:51,880 Well, you don't need statistics for that. 1279 00:55:51,880 --> 00:55:54,790 You're going to see 80 divided by 124. 1280 00:55:54,790 --> 00:55:57,580 And you will find that in this particular study 1281 00:55:57,580 --> 00:56:00,139 64.5% of the couples were bending 1282 00:56:00,139 --> 00:56:01,180 their heads to the right. 1283 00:56:01,180 --> 00:56:03,040 That's a pretty large number, right? 1284 00:56:03,040 --> 00:56:07,180 The question is if I picked another 124 couples, maybe 1285 00:56:07,180 --> 00:56:10,240 at different airports, different times, would I see same number? 1286 00:56:10,240 --> 00:56:11,920 Would this number be all over the place? 1287 00:56:11,920 --> 00:56:14,650 Would it be sometimes very close to 120, or sometimes 1288 00:56:14,650 --> 00:56:15,850 for close to 10? 1289 00:56:15,850 --> 00:56:20,020 Or would it be-- is this number actually fluctuating a lot. 1290 00:56:20,020 --> 00:56:26,870 And so, hopefully not too much, 64.5 percent is definitely 1291 00:56:26,870 --> 00:56:28,670 much larger than 50%. 1292 00:56:28,670 --> 00:56:31,652 And so there seems to be this preference. 1293 00:56:31,652 --> 00:56:33,110 Now we're going to have to quantify 1294 00:56:33,110 --> 00:56:34,276 how much of this preference. 1295 00:56:34,276 --> 00:56:38,240 Is this number significantly larger than 50%? 1296 00:56:38,240 --> 00:56:41,090 So if our data, for example, was just three couples. 1297 00:56:41,090 --> 00:56:43,370 I'm just going there, I'm going to Logan. 1298 00:56:43,370 --> 00:56:45,440 I call it, I do right, left right. 1299 00:56:45,440 --> 00:56:47,570 And then I see-- 1300 00:56:47,570 --> 00:56:53,060 see what's the name of the fish place there? 1301 00:56:53,060 --> 00:56:56,399 I go to I go to Wahlburgers at Logan and I'm like, 1302 00:56:56,399 --> 00:56:57,440 OK, I'm done for the day. 1303 00:56:57,440 --> 00:56:58,610 I collect this data. 1304 00:56:58,610 --> 00:57:02,120 I go home, and I'm like, wow, 66.7% to the right. 1305 00:57:02,120 --> 00:57:03,380 That's a pretty big number. 1306 00:57:03,380 --> 00:57:06,650 It's even farther from 50% than this other guy. 1307 00:57:06,650 --> 00:57:08,315 So I'm doing even better. 1308 00:57:08,315 --> 00:57:10,190 But of course you know that this is not true. 1309 00:57:10,190 --> 00:57:12,237 Three people is definitely not representative. 1310 00:57:12,237 --> 00:57:13,820 If I stopped at the first one, I would 1311 00:57:13,820 --> 00:57:18,360 have actually-- at the first two, I would have even 100%. 1312 00:57:18,360 --> 00:57:21,290 So the question that statistics is going to help us answer is, 1313 00:57:21,290 --> 00:57:23,514 how large should the sample be? 1314 00:57:23,514 --> 00:57:25,805 For some reason, I don't know if you guys receive this, 1315 00:57:25,805 --> 00:57:27,554 I'm an affiliate with the Broad Institute, 1316 00:57:27,554 --> 00:57:30,050 and since then I receive one email per day 1317 00:57:30,050 --> 00:57:32,240 that says, sample size determination-- 1318 00:57:32,240 --> 00:57:33,950 how large should your sample be? 1319 00:57:33,950 --> 00:57:36,040 Like, I know how large should with my sample be. 1320 00:57:36,040 --> 00:57:39,530 I've taken 18.650 multiple times. 1321 00:57:39,530 --> 00:57:43,240 And so I know, but the question is-- is 124 1322 00:57:43,240 --> 00:57:45,270 a large enough number or not? 1323 00:57:45,270 --> 00:57:47,540 Well, the answer is actually, as usual, it depends. 1324 00:57:47,540 --> 00:57:51,310 It will depend on the true unknown value of p. 1325 00:57:51,310 --> 00:57:56,590 But from those particular values that we got, so 120 and-- 1326 00:57:56,590 --> 00:57:57,870 how many couples was there? 1327 00:57:57,870 --> 00:57:58,750 80? 1328 00:57:58,750 --> 00:58:02,240 We actually can make some question. 1329 00:58:02,240 --> 00:58:07,750 So here we said that 80 was larger than 50-- 1330 00:58:07,750 --> 00:58:12,310 was allowing us to conclude at 64.5%. 1331 00:58:12,310 --> 00:58:17,520 So it could be one reason to say that it was larger than 50%. 1332 00:58:17,520 --> 00:58:23,230 50% of 124 is 62. 1333 00:58:23,230 --> 00:58:24,910 So the question is, would I be would I 1334 00:58:24,910 --> 00:58:28,750 be willing to make this conclusion at 63? 1335 00:58:28,750 --> 00:58:30,630 Is that a number that would convince you? 1336 00:58:30,630 --> 00:58:34,030 Who would be convinced by 63? 1337 00:58:34,030 --> 00:58:35,465 who would be convinced by 72? 1338 00:58:38,410 --> 00:58:40,404 Who would be convinced by 75? 1339 00:58:40,404 --> 00:58:42,820 Hopefully the number of hands that are raised should grow. 1340 00:58:46,120 --> 00:58:48,820 Who would be convinced by 80? 1341 00:58:48,820 --> 00:58:51,190 All right, so basically those numbers actually 1342 00:58:51,190 --> 00:58:52,720 don't come from anywhere. 1343 00:58:52,720 --> 00:58:56,740 This 72 would be the number that you would need for a study-- 1344 00:58:56,740 --> 00:58:58,662 most statistical studies would be the number 1345 00:58:58,662 --> 00:58:59,620 that they would retain. 1346 00:58:59,620 --> 00:59:01,870 That's not for 124. 1347 00:59:01,870 --> 00:59:04,360 You would need to see 72 that turn their head 1348 00:59:04,360 --> 00:59:07,390 right to actually make this conclusion. 1349 00:59:07,390 --> 00:59:08,977 And then 75-- 1350 00:59:08,977 --> 00:59:11,560 So we'll see that there's many ways to come to this conclusion 1351 00:59:11,560 --> 00:59:12,934 because, as you can see, this was 1352 00:59:12,934 --> 00:59:15,250 published in Nature with 80. 1353 00:59:15,250 --> 00:59:15,920 So that was OK. 1354 00:59:15,920 --> 00:59:17,503 So 80 is actually a very large number. 1355 00:59:17,503 --> 00:59:20,116 This is 99 point-- 1356 00:59:20,116 --> 00:59:24,250 this 99% -- no, so this is 95% confidence. 1357 00:59:24,250 --> 00:59:26,290 This is 99% confidence. 1358 00:59:26,290 --> 00:59:29,330 And this is 99.9% percent confidence. 1359 00:59:29,330 --> 00:59:34,210 So if you said 80 you're a very conservative person. 1360 00:59:34,210 --> 00:59:36,930 Starting at 72, you can start making this conclusion. 1361 00:59:39,820 --> 00:59:41,260 To understand this, we need to do 1362 00:59:41,260 --> 00:59:45,880 our little mathematical kitchen here, 1363 00:59:45,880 --> 00:59:49,810 and we need to do some modeling. 1364 00:59:49,810 --> 00:59:51,460 So we need to understand by modeling-- 1365 00:59:51,460 --> 00:59:55,000 we need understand what random process we think 1366 00:59:55,000 --> 00:59:57,190 this data is generating from. 1367 00:59:57,190 --> 00:59:59,180 So it's going to have some unknown parameters, 1368 00:59:59,180 --> 01:00:00,190 unlike in probability. 1369 01:00:00,190 --> 01:00:02,800 But we need to have just basically everything written 1370 01:00:02,800 --> 01:00:04,900 except for the values of the parameters. 1371 01:00:04,900 --> 01:00:08,530 When I said a die is coming uniformly with probably 1/6 1372 01:00:08,530 --> 01:00:12,220 then I need to have, say maybe with probability-- maybe 1373 01:00:12,220 --> 01:00:14,260 I should say here are six numbers, 1374 01:00:14,260 --> 01:00:18,550 and I need to just fill those numbers. 1375 01:00:18,550 --> 01:00:23,140 So for i equal 1 to n, I'm going to define 1376 01:00:23,140 --> 01:00:27,140 Ri to be the indicator. 1377 01:00:27,140 --> 01:00:29,950 An indicator is just something that takes value 1 if something 1378 01:00:29,950 --> 01:00:31,330 is true, and 0 if not. 1379 01:00:31,330 --> 01:00:34,450 So it's an indicator that i-th couple 1380 01:00:34,450 --> 01:00:36,820 turns the head to the right. 1381 01:00:36,820 --> 01:00:39,580 So, Ri, so it's indexed by i. 1382 01:00:39,580 --> 01:00:42,290 And it's one if the i-th couple turns their head to the right, 1383 01:00:42,290 --> 01:00:45,820 and 0 if it's-- 1384 01:00:45,820 --> 01:00:48,640 well actually, I guess they can probably kiss straight, right? 1385 01:00:48,640 --> 01:00:51,500 So that would be weird, but they might be able to do this. 1386 01:00:51,500 --> 01:00:54,280 So let's say not right. 1387 01:00:54,280 --> 01:00:56,710 Then the estimator of p, we said, was p hat. 1388 01:00:56,710 --> 01:00:58,360 It was just the ratio of two numbers. 1389 01:00:58,360 --> 01:01:02,410 But really what it is is I count, I sum those Ri's. 1390 01:01:02,410 --> 01:01:07,360 Since I only add those that take value 1, what this is is-- 1391 01:01:07,360 --> 01:01:10,330 this sum here is actually just counting the number of 1's. 1392 01:01:10,330 --> 01:01:13,810 Which is another way to say it's counting the number of couples 1393 01:01:13,810 --> 01:01:15,620 that are kissing to the right. 1394 01:01:15,620 --> 01:01:18,520 And here I don't even have to tell you anything 1395 01:01:18,520 --> 01:01:20,230 about the numbers or anything. 1396 01:01:20,230 --> 01:01:21,940 I can only keep track of-- 1397 01:01:21,940 --> 01:01:24,240 first couple is a 0 second couple is a 1, 1398 01:01:24,240 --> 01:01:25,251 third couple is 0. 1399 01:01:25,251 --> 01:01:27,250 The data set-- you can actually find it online-- 1400 01:01:27,250 --> 01:01:29,646 is actually a sequence of 0's and 1's. 1401 01:01:29,646 --> 01:01:31,270 Now clearly for the question that we're 1402 01:01:31,270 --> 01:01:32,950 asking about this proportion, I don't 1403 01:01:32,950 --> 01:01:34,980 need to keep track of all this information. 1404 01:01:34,980 --> 01:01:37,150 All I need to keep track of is the number 1405 01:01:37,150 --> 01:01:39,130 of 0's and the number of 1's. 1406 01:01:39,130 --> 01:01:42,340 Those are completely interchangeable. 1407 01:01:42,340 --> 01:01:44,440 There's no time effect in this. 1408 01:01:44,440 --> 01:01:48,370 The first couple is no different than the 15th couple. 1409 01:01:48,370 --> 01:01:50,380 So we call this Rn bar. 1410 01:01:50,380 --> 01:01:52,960 That's going to be a very standard notation that we use. 1411 01:01:52,960 --> 01:01:55,780 R might be replaced by other letters like x-- 1412 01:01:55,780 --> 01:01:58,150 so xn bar, yn bar. 1413 01:01:58,150 --> 01:02:00,700 And this thing essentially means that I 1414 01:02:00,700 --> 01:02:04,630 average the R's, or the Ri's over n of them. 1415 01:02:04,630 --> 01:02:06,100 And the bar means the average. 1416 01:02:06,100 --> 01:02:09,460 So I divide by n the total number of 1's. 1417 01:02:09,460 --> 01:02:13,630 So here this sum was equal to 80 in our example and n 1418 01:02:13,630 --> 01:02:16,820 was equal to 124. 1419 01:02:16,820 --> 01:02:18,370 Now this is an estimator. 1420 01:02:18,370 --> 01:02:20,380 So an estimator is different from an estimate. 1421 01:02:20,380 --> 01:02:21,580 An estimate is a number. 1422 01:02:21,580 --> 01:02:23,680 My estimate was 64.5. 1423 01:02:23,680 --> 01:02:29,502 My estimator is this thing where I keep all the variables free. 1424 01:02:29,502 --> 01:02:31,210 And in particular, I keep those variables 1425 01:02:31,210 --> 01:02:34,720 to be random because I'm going to think of a random couple 1426 01:02:34,720 --> 01:02:37,360 kissing left to right as the outcome of a random process, 1427 01:02:37,360 --> 01:02:41,320 just like flipping a coin be getting heads or tails. 1428 01:02:41,320 --> 01:02:43,626 And so this thing here is a random variable, Ri. 1429 01:02:43,626 --> 01:02:46,250 And this average is, of course, an average of random variables. 1430 01:02:46,250 --> 01:02:47,500 It's itself a random variable. 1431 01:02:47,500 --> 01:02:49,330 So an estimator is a random variable. 1432 01:02:49,330 --> 01:02:51,972 An estimate is the realization of a random variable, 1433 01:02:51,972 --> 01:02:53,680 or, in other words, is the value that you 1434 01:02:53,680 --> 01:02:56,140 get for this random variable once you plug in the numbers 1435 01:02:56,140 --> 01:02:58,960 that you've collected. 1436 01:02:58,960 --> 01:03:01,030 So I can talk about the accuracy of an estimator. 1437 01:03:01,030 --> 01:03:02,470 Accuracy means what? 1438 01:03:02,470 --> 01:03:04,950 Well, what would we want for an estimator? 1439 01:03:04,950 --> 01:03:06,892 Maybe we won't want it to fluctuate too much. 1440 01:03:06,892 --> 01:03:07,850 It's a random variable. 1441 01:03:07,850 --> 01:03:11,050 So I'm talking about the accuracy of a random variable. 1442 01:03:11,050 --> 01:03:13,740 So maybe I don't want it to be too volatile. 1443 01:03:13,740 --> 01:03:16,440 I could have one estimator which would be-- 1444 01:03:16,440 --> 01:03:20,010 just throw out 182 couples, keep only 2 1445 01:03:20,010 --> 01:03:21,420 and average those two numbers. 1446 01:03:21,420 --> 01:03:23,040 That's definitely a worse estimator 1447 01:03:23,040 --> 01:03:25,140 than keeping all of the 124. 1448 01:03:25,140 --> 01:03:26,641 So I need to find a way to say that. 1449 01:03:26,641 --> 01:03:28,140 And what I'm going to be able to say 1450 01:03:28,140 --> 01:03:30,060 is that the number is going to be fluctuating. 1451 01:03:30,060 --> 01:03:31,669 If I take another two couples, I'm 1452 01:03:31,669 --> 01:03:33,210 going to be I'm probably going to get 1453 01:03:33,210 --> 01:03:34,590 a completely different number. 1454 01:03:34,590 --> 01:03:38,400 But if I take another 124 couples two days later, 1455 01:03:38,400 --> 01:03:40,260 maybe I'm going to have a very number that's 1456 01:03:40,260 --> 01:03:43,140 very close to 64.5%. 1457 01:03:43,140 --> 01:03:43,969 So that's one way. 1458 01:03:43,969 --> 01:03:46,260 The other thing we would like about this estimator it's 1459 01:03:46,260 --> 01:03:47,767 actually-- 1460 01:03:47,767 --> 01:03:49,350 maybe it's not too volatile-- but also 1461 01:03:49,350 --> 01:03:54,300 we want it to be close to the number that we're looking for. 1462 01:03:54,300 --> 01:03:55,410 Here is an estimator. 1463 01:03:55,410 --> 01:03:57,150 It's a beautiful variable. 1464 01:03:57,150 --> 01:04:00,330 72%, that's an estimator. 1465 01:04:00,330 --> 01:04:02,520 Go out there just do your favorite study 1466 01:04:02,520 --> 01:04:06,310 about drug performance. 1467 01:04:06,310 --> 01:04:10,440 And then they're going to call you, MIT student taking 1468 01:04:10,440 --> 01:04:12,011 statistics, they say, so how are you 1469 01:04:12,011 --> 01:04:13,260 going to build your estimator? 1470 01:04:13,260 --> 01:04:15,450 We've collected those 5,000 or something like that. 1471 01:04:15,450 --> 01:04:17,160 I'm just going to spit out 72%. 1472 01:04:17,160 --> 01:04:19,200 Whatever the data says, that's an estimator. 1473 01:04:19,200 --> 01:04:21,510 It's a stupid estimator but it is an estimator. 1474 01:04:21,510 --> 01:04:23,360 But this is estimator is very not volatile. 1475 01:04:23,360 --> 01:04:25,390 Every time you're going to have a new study, 1476 01:04:25,390 --> 01:04:27,640 even if you change fields, it's still going to be 72%. 1477 01:04:27,640 --> 01:04:29,190 This is beautiful. 1478 01:04:29,190 --> 01:04:31,590 And the problem is that's probably not 1479 01:04:31,590 --> 01:04:34,164 very close to the value you're actually trying to estimate. 1480 01:04:34,164 --> 01:04:35,080 So we need two things. 1481 01:04:35,080 --> 01:04:36,996 We need are estimated to be a random variable. 1482 01:04:36,996 --> 01:04:39,240 So think in terms of densities. 1483 01:04:39,240 --> 01:04:42,740 We want the density to be pretty narrow. 1484 01:04:42,740 --> 01:04:46,240 We want this thing to have very little-- 1485 01:04:46,240 --> 01:04:52,570 so this is definitely better than this. 1486 01:04:52,570 --> 01:04:55,830 But also, we want the number that we're interested in, p, 1487 01:04:55,830 --> 01:04:57,475 to be very close to this-- 1488 01:04:57,475 --> 01:05:00,260 to be close to the values that this thing is likely to take. 1489 01:05:00,260 --> 01:05:04,020 If p is here, this is not very good for us. 1490 01:05:04,020 --> 01:05:06,872 So that's basically the things we're going to be looking at. 1491 01:05:06,872 --> 01:05:08,580 The first one is referred to as variance. 1492 01:05:08,580 --> 01:05:10,163 The second one is referred to as bias. 1493 01:05:10,163 --> 01:05:14,380 Those things come all over in statistics. 1494 01:05:14,380 --> 01:05:16,810 So we need to understand a model. 1495 01:05:16,810 --> 01:05:20,907 So here's the model that we have for this particular problem. 1496 01:05:20,907 --> 01:05:22,990 So we need to make assumptions on the observations 1497 01:05:22,990 --> 01:05:23,650 that we see. 1498 01:05:23,650 --> 01:05:25,620 So we said we're going to assume that the random variable-- 1499 01:05:25,620 --> 01:05:27,160 that's not too much of a leap of faith. 1500 01:05:27,160 --> 01:05:29,243 We're just sweeping under the rug everything thing 1501 01:05:29,243 --> 01:05:31,180 we don't understand about those couples. 1502 01:05:31,180 --> 01:05:33,130 And the assumption that we make is 1503 01:05:33,130 --> 01:05:36,000 that Ri is a random variable. 1504 01:05:36,000 --> 01:05:38,120 This one you will forget very soon. 1505 01:05:38,120 --> 01:05:41,950 The second one is that each of the Ri's is-- 1506 01:05:41,950 --> 01:05:45,310 so it's a random variable that takes value 0 or 1. 1507 01:05:45,310 --> 01:05:47,024 Anybody can suggest the distribution 1508 01:05:47,024 --> 01:05:48,065 for this random variable? 1509 01:05:48,065 --> 01:05:49,363 AUDIENCE: Bernoulli. 1510 01:05:49,363 --> 01:05:50,363 PHILIPPE RIGOLLET: What? 1511 01:05:50,363 --> 01:05:51,500 AUDIENCE: Bernoulli. 1512 01:05:51,500 --> 01:05:51,980 PHILIPPE RIGOLLET: Bernoulli, right? 1513 01:05:51,980 --> 01:05:53,210 And it's actually beautiful. 1514 01:05:53,210 --> 01:05:56,770 This is where you have to do the least statistical modeling. 1515 01:05:56,770 --> 01:05:59,164 A random variable that takes value 0 or 1 1516 01:05:59,164 --> 01:06:00,080 is always a Bernoulli. 1517 01:06:00,080 --> 01:06:02,581 That's the simplest variable you can ever think of. 1518 01:06:02,581 --> 01:06:04,580 Any variable that takes only two possible values 1519 01:06:04,580 --> 01:06:06,155 can be reduced to a Bernoulli. 1520 01:06:06,155 --> 01:06:10,421 OK, so this is a Bernoulli. 1521 01:06:10,421 --> 01:06:12,420 And here we make the assumption that it actually 1522 01:06:12,420 --> 01:06:16,140 takes parameter p. 1523 01:06:16,140 --> 01:06:17,850 And there's an assumption here. 1524 01:06:17,850 --> 01:06:21,427 Anybody can tell me what the assumption is? 1525 01:06:21,427 --> 01:06:22,780 AUDIENCE: It's the same. 1526 01:06:22,780 --> 01:06:24,530 PHILIPPE RIGOLLET: Yeah, it's same, right? 1527 01:06:24,530 --> 01:06:26,194 I could have said p i, but it's p. 1528 01:06:26,194 --> 01:06:28,110 And that's where I'm going to be able to start 1529 01:06:28,110 --> 01:06:29,360 getting to do some statistics. 1530 01:06:29,360 --> 01:06:31,930 It's that I'm going to start to be able to pull information 1531 01:06:31,930 --> 01:06:32,896 across all my guys. 1532 01:06:32,896 --> 01:06:34,270 If I assume that they're all pi's 1533 01:06:34,270 --> 01:06:36,340 completely uncoupled with each other. 1534 01:06:36,340 --> 01:06:37,480 Then I'm in trouble. 1535 01:06:37,480 --> 01:06:39,741 There's nothing I can actually get. 1536 01:06:39,741 --> 01:06:41,740 And then I'm going to assume that those guys are 1537 01:06:41,740 --> 01:06:42,940 mutually independent. 1538 01:06:42,940 --> 01:06:45,530 And most of the time they will just say independent. 1539 01:06:45,530 --> 01:06:48,760 Meaning that, it's not like all these guys called each other 1540 01:06:48,760 --> 01:06:50,110 and it's actually a flash mob. 1541 01:06:50,110 --> 01:06:53,200 And they were like, let's all turn our left side to the left. 1542 01:06:53,200 --> 01:06:54,790 And then this is definitely not going 1543 01:06:54,790 --> 01:06:59,910 to give you a valid conclusion. 1544 01:06:59,910 --> 01:07:02,400 So, again. randomness is a way of modeling lack 1545 01:07:02,400 --> 01:07:03,210 of information. 1546 01:07:03,210 --> 01:07:05,409 Here there is a way to figure it out. 1547 01:07:05,409 --> 01:07:07,200 Maybe I could have followed all those guys, 1548 01:07:07,200 --> 01:07:09,040 and knew exactly what they were-- maybe 1549 01:07:09,040 --> 01:07:11,670 I could have looked at pictures of them in the womb 1550 01:07:11,670 --> 01:07:14,354 and guess how they were turning-- by the way that's 1551 01:07:14,354 --> 01:07:16,020 one of the conclusions, they're guessing 1552 01:07:16,020 --> 01:07:17,436 that we turn our head to the right 1553 01:07:17,436 --> 01:07:21,180 because our head is turned to the right in the womb. 1554 01:07:21,180 --> 01:07:24,840 So we don't know what goes on in the kissers minds. 1555 01:07:24,840 --> 01:07:26,730 And there's, you know, physics, sociology. 1556 01:07:26,730 --> 01:07:28,521 There's a lot of things that could help us, 1557 01:07:28,521 --> 01:07:31,220 but it's just too complicated to keep track of, 1558 01:07:31,220 --> 01:07:34,550 or too expensive for many instances 1559 01:07:34,550 --> 01:07:37,310 Now again, the nicest part of this modeling 1560 01:07:37,310 --> 01:07:39,560 was the fact that Ri's take only two values, which 1561 01:07:39,560 --> 01:07:41,900 mean that this conclusion that they were Bernoulli 1562 01:07:41,900 --> 01:07:43,220 was totally free for us. 1563 01:07:43,220 --> 01:07:45,530 Once we know it's a random variable, it's a Bernoulli. 1564 01:07:45,530 --> 01:07:47,720 Now they could have been, as we said, 1565 01:07:47,720 --> 01:07:51,910 they could have been a Bernoulli with parameter p i. 1566 01:07:51,910 --> 01:07:55,420 For each i, I could have put a different parameter, 1567 01:07:55,420 --> 01:07:57,672 but I just don't have enough information. 1568 01:07:57,672 --> 01:07:58,630 What would I have said? 1569 01:07:58,630 --> 01:08:00,921 I would say, well the first couple turned to the right. 1570 01:08:00,921 --> 01:08:04,600 p1 has to be one, that's my best guess. 1571 01:08:04,600 --> 01:08:06,910 The second couple kiss to the left, 1572 01:08:06,910 --> 01:08:10,690 well, p2 should be 0, that's my best guess. 1573 01:08:10,690 --> 01:08:14,710 And so basically I need to have to be 1574 01:08:14,710 --> 01:08:16,240 able to average my information. 1575 01:08:16,240 --> 01:08:19,649 And the way I do it is by coupling all these guys, 1576 01:08:19,649 --> 01:08:22,490 pi's to be the same p for all i. 1577 01:08:22,490 --> 01:08:23,840 OK, does it make sense? 1578 01:08:23,840 --> 01:08:28,984 Here what I am assuming is that my population is homogeneous. 1579 01:08:28,984 --> 01:08:29,609 Maybe it's not. 1580 01:08:29,609 --> 01:08:31,484 Maybe I could actually look at a finer grain, 1581 01:08:31,484 --> 01:08:35,529 but I'm basically making a statement about a population. 1582 01:08:35,529 --> 01:08:41,670 And so maybe you kiss to the left, and then you're not-- 1583 01:08:41,670 --> 01:08:44,010 I'm not making a statement about a person individually, 1584 01:08:44,010 --> 01:08:47,702 I'm making a statement about the overall population. 1585 01:08:47,702 --> 01:08:49,660 Now independence is probably reasonable, right? 1586 01:08:49,660 --> 01:08:53,080 This person just went and know can seriously 1587 01:08:53,080 --> 01:08:56,109 hope that these couples did not communicate with each other. 1588 01:08:56,109 --> 01:08:59,830 Or that you know Tanya did not text that we should all 1589 01:08:59,830 --> 01:09:01,810 turn our head to the left now. 1590 01:09:01,810 --> 01:09:05,050 And there's no external stimulus that forces people 1591 01:09:05,050 --> 01:09:08,069 to do something different. 1592 01:09:08,069 --> 01:09:15,810 OK, so-- sorry about that. 1593 01:09:15,810 --> 01:09:19,899 Since we have about less than 10 minutes. 1594 01:09:19,899 --> 01:09:22,260 Let's do a little bit of exercises, is that OK with you? 1595 01:09:22,260 --> 01:09:24,950 So I just have some exercises so we can see what 1596 01:09:24,950 --> 01:09:26,580 an exercise going to look like. 1597 01:09:26,580 --> 01:09:30,090 This is sort of similar to the exercises you will see with me. 1598 01:09:30,090 --> 01:09:31,674 We should do it together, OK? 1599 01:09:31,674 --> 01:09:32,840 So now we're going to have-- 1600 01:09:32,840 --> 01:09:33,479 I have a test. 1601 01:09:36,630 --> 01:09:42,950 So that's an exam in probability. 1602 01:09:42,950 --> 01:09:44,290 OK. 1603 01:09:44,290 --> 01:09:50,899 And I'm going to have 15 students in this test. 1604 01:09:50,899 --> 01:09:53,550 And hopefully, this should be 15 grades 1605 01:09:53,550 --> 01:09:57,210 that are representative of the grades of all a large class. 1606 01:09:57,210 --> 01:10:00,660 Right, so if you go you know 18.600, it's a large class, 1607 01:10:00,660 --> 01:10:02,430 there's definitely more than 15 students. 1608 01:10:02,430 --> 01:10:04,750 And maybe, just by sampling 15 students at random, 1609 01:10:04,750 --> 01:10:08,010 I want to have an idea of what my grade distribution will 1610 01:10:08,010 --> 01:10:09,090 look like. 1611 01:10:09,090 --> 01:10:13,230 I'm grading them, I want to make an educated guess. 1612 01:10:13,230 --> 01:10:15,480 So I'm going to make some modeling assumptions 1613 01:10:15,480 --> 01:10:16,680 for those guys. 1614 01:10:16,680 --> 01:10:22,830 So here, 15 students and the grades are x1 to x15. 1615 01:10:22,830 --> 01:10:26,050 Just like we had R1, R2, all the way to R124. 1616 01:10:26,050 --> 01:10:27,790 Those were my Ri's. 1617 01:10:27,790 --> 01:10:29,610 And so now I have my xi's. 1618 01:10:29,610 --> 01:10:33,270 And I'm going to assume that xi follows 1619 01:10:33,270 --> 01:10:39,060 a Gaussian or normal distribution with min mu 1620 01:10:39,060 --> 01:10:40,720 and variance sigma squared. 1621 01:10:40,720 --> 01:10:43,710 Now this is modeling, right? 1622 01:10:43,710 --> 01:10:45,720 Nobody told me there's no physical process that 1623 01:10:45,720 --> 01:10:46,471 makes this happen. 1624 01:10:46,471 --> 01:10:48,761 We know that there's something called the central limit 1625 01:10:48,761 --> 01:10:50,720 theorem in the background that says that things 1626 01:10:50,720 --> 01:10:53,385 tend to be Gaussian, but this is really a matter of convenience. 1627 01:10:53,385 --> 01:10:55,090 Actually this is, if you think about it, 1628 01:10:55,090 --> 01:10:57,381 this is terrible because this puts non-zero probability 1629 01:10:57,381 --> 01:10:58,260 on negative scores. 1630 01:10:58,260 --> 01:11:00,711 I'm definitely not going to get a negative score. 1631 01:11:00,711 --> 01:11:02,460 But you know it's good enough because they 1632 01:11:02,460 --> 01:11:05,310 know the probabilities non-zero but it's probably 10 1633 01:11:05,310 --> 01:11:06,240 to the minus 12. 1634 01:11:06,240 --> 01:11:10,480 So I would be very unlucky to see a negative score. 1635 01:11:10,480 --> 01:11:24,740 So here's the list of grades, so I have 65, 41, 70, 90, 58, 82, 1636 01:11:24,740 --> 01:11:28,860 76, 78-- 1637 01:11:28,860 --> 01:11:35,660 maybe I should have done it with 8 --59, 59-- 1638 01:11:35,660 --> 01:11:47,360 sitting next to each other --84, 89, 134, 51, and 72. 1639 01:11:47,360 --> 01:11:51,440 So those are the scores that I got. 1640 01:11:51,440 --> 01:11:53,930 There were clearly some bonus points over there. 1641 01:11:53,930 --> 01:12:05,134 And the question is, find estimator for mu. 1642 01:12:05,134 --> 01:12:06,300 What is my estimator for mu? 1643 01:12:09,120 --> 01:12:11,070 Well, an estimator, again, is something that 1644 01:12:11,070 --> 01:12:12,510 depends on the random variable. 1645 01:12:12,510 --> 01:12:15,180 All right, so mu is the expectation, right? 1646 01:12:15,180 --> 01:12:22,510 So a good estimator is definitely the average score, 1647 01:12:22,510 --> 01:12:24,460 just like we had the average of the Ri's. 1648 01:12:24,460 --> 01:12:28,210 Now the xi's no longer need to be 0's and 1's, so it's not 1649 01:12:28,210 --> 01:12:31,060 going to boil down to being a number of ones divided 1650 01:12:31,060 --> 01:12:32,880 by the total numbers. 1651 01:12:32,880 --> 01:12:41,070 Now if I'm looking for an estimate, 1652 01:12:41,070 --> 01:12:43,680 well, I need to actually sum those numbers 1653 01:12:43,680 --> 01:12:45,630 and divide them by 15. 1654 01:12:45,630 --> 01:12:47,472 So my estimate is going to be 1/15. 1655 01:12:47,472 --> 01:12:49,430 Then I'm going to start summing those numbers-- 1656 01:12:49,430 --> 01:12:51,430 65 plus 72. 1657 01:12:54,320 --> 01:13:06,340 OK, and I can do it, and it's 67.5. 1658 01:13:06,340 --> 01:13:08,810 This is my estimate. 1659 01:13:08,810 --> 01:13:13,810 Now if I want to compute a standard deviation-- 1660 01:13:13,810 --> 01:13:18,690 so let's say estimate for sigma. 1661 01:13:21,889 --> 01:13:23,180 You've seen that before, right? 1662 01:13:23,180 --> 01:13:24,690 An estimate for sigma is what? 1663 01:13:24,690 --> 01:13:27,150 An estimate for sigma, we'll see methods to do this, 1664 01:13:27,150 --> 01:13:31,360 but sigma squared is the variance, 1665 01:13:31,360 --> 01:13:35,941 or is the expectation, of x minus expectation of x squared. 1666 01:13:38,980 --> 01:13:40,480 And the problem is that I don't know 1667 01:13:40,480 --> 01:13:42,130 what those expectations are. 1668 01:13:42,130 --> 01:13:47,260 And so I'm going to do what 99.9% percent of statistics is. 1669 01:13:47,260 --> 01:13:49,710 And what is statistics about? 1670 01:13:49,710 --> 01:13:51,350 What's my motto? 1671 01:13:51,350 --> 01:13:54,590 Statistics is about replacing expectations with averages. 1672 01:13:54,590 --> 01:13:57,020 That's what all of statistics is about. 1673 01:13:57,020 --> 01:14:00,740 There's 300 pages in a purple book called All of Statistics 1674 01:14:00,740 --> 01:14:01,940 that tells you this. 1675 01:14:01,940 --> 01:14:03,740 All right, and then you do something fancy. 1676 01:14:03,740 --> 01:14:05,570 Maybe you minimize something after you 1677 01:14:05,570 --> 01:14:07,100 replace the expectation. 1678 01:14:07,100 --> 01:14:08,780 Maybe you need to plug in other stuff. 1679 01:14:08,780 --> 01:14:10,696 But really, every time you see an expectation, 1680 01:14:10,696 --> 01:14:12,110 you replace it by an average. 1681 01:14:12,110 --> 01:14:13,470 OK let's do this. 1682 01:14:13,470 --> 01:14:16,550 So sigma squared hat will be what? 1683 01:14:16,550 --> 01:14:20,420 It's going to be 1 over n, sum from i equals 1 to n 1684 01:14:20,420 --> 01:14:22,790 of xi minus-- 1685 01:14:22,790 --> 01:14:25,580 well, here I need to replace my expectation by an average, 1686 01:14:25,580 --> 01:14:27,230 which is really this average. 1687 01:14:27,230 --> 01:14:31,600 I'm going to call it mu hat squared. 1688 01:14:31,600 --> 01:14:34,310 There, you have replaced my expectation with average. 1689 01:14:34,310 --> 01:14:38,150 OK so the golden thing is, take your expectation 1690 01:14:38,150 --> 01:14:39,460 and replace it with this. 1691 01:14:45,580 --> 01:14:49,240 Frame it, get a tattoo, I don't care but that's what it is. 1692 01:14:49,240 --> 01:14:53,600 If you remember one thing from this class, that's what it is. 1693 01:14:53,600 --> 01:14:56,690 Now you can be fancy, if you look at your calculator, 1694 01:14:56,690 --> 01:14:59,660 it's going to put an n minus 1 here because it 1695 01:14:59,660 --> 01:15:00,535 wants to be unbiased. 1696 01:15:00,535 --> 01:15:02,410 And those are things we are going to come to. 1697 01:15:02,410 --> 01:15:04,400 But let's say right now we stick to this. 1698 01:15:04,400 --> 01:15:06,440 And then when I plug in my numbers. 1699 01:15:06,440 --> 01:15:14,040 I'm going to get an estimate for sigma, 1700 01:15:14,040 --> 01:15:17,210 which is the square root of the estimator 1701 01:15:17,210 --> 01:15:18,810 once I plug in the numbers. 1702 01:15:18,810 --> 01:15:21,622 And you can check that the number, you get will be 18. 1703 01:15:27,410 --> 01:15:31,060 So those are basic things and if you've taken any AP stats 1704 01:15:31,060 --> 01:15:32,810 this should be completely standard to you. 1705 01:15:35,800 --> 01:15:39,350 Now I have another list, and I don't have time to see it. 1706 01:15:42,029 --> 01:15:43,070 It doesn't really matter. 1707 01:15:49,264 --> 01:15:50,430 OK, we'll do that next time. 1708 01:15:50,430 --> 01:15:51,200 This is fine. 1709 01:15:51,200 --> 01:15:55,241 We'll see another list of numbers and see-- 1710 01:15:55,241 --> 01:15:57,240 we're going to think about modeling assumptions. 1711 01:15:57,240 --> 01:15:59,610 The goal of this exercise is not to compute those things, 1712 01:15:59,610 --> 01:16:01,610 it's really to think about modeling assumptions. 1713 01:16:01,610 --> 01:16:04,002 Is it reasonable to think that things are IID? 1714 01:16:04,002 --> 01:16:05,460 Is it reasonable to think that they 1715 01:16:05,460 --> 01:16:07,751 have all the same parameters, that they're independent, 1716 01:16:07,751 --> 01:16:08,880 et cetera, 1717 01:16:08,880 --> 01:16:16,850 OK so one thing that I wanted to add is, probably by tonight, 1718 01:16:16,850 --> 01:16:18,810 so I will try to use-- 1719 01:16:18,810 --> 01:16:20,625 in the spirit of-- 1720 01:16:20,625 --> 01:16:22,250 I don't know what's starting to happen. 1721 01:16:22,250 --> 01:16:26,190 In the spirit of using my iPad and fancy things, 1722 01:16:26,190 --> 01:16:29,250 I will try to post some videos of-- for in particular, 1723 01:16:29,250 --> 01:16:33,840 who has never used a statistical table to read, say, 1724 01:16:33,840 --> 01:16:37,860 the quantiles of a Gaussian distribution? 1725 01:16:37,860 --> 01:16:39,480 OK, so there's several of you. 1726 01:16:39,480 --> 01:16:42,600 This is a simple but boring exercise. 1727 01:16:42,600 --> 01:16:44,850 I will just post a video on how to do this, 1728 01:16:44,850 --> 01:16:46,785 and you will be able to find it on Stellar. 1729 01:16:46,785 --> 01:16:48,660 It's going to take five minutes, and then you 1730 01:16:48,660 --> 01:16:50,770 will know everything there is to know about those things 1731 01:16:50,770 --> 01:16:53,154 but that's something you need for the first problem set. 1732 01:16:53,154 --> 01:16:54,570 By the way, so the problem set has 1733 01:16:54,570 --> 01:16:57,180 30 exercises in probability. 1734 01:16:57,180 --> 01:16:59,510 You need to do 15. 1735 01:16:59,510 --> 01:17:01,470 And you only need to turn in 15. 1736 01:17:01,470 --> 01:17:03,760 You can turn in all of 30 if you want. 1737 01:17:03,760 --> 01:17:07,230 But you need to know, by the time we hit those things, 1738 01:17:07,230 --> 01:17:08,760 you need to know-- 1739 01:17:08,760 --> 01:17:11,370 well actually, by next week you need to know what's in there. 1740 01:17:11,370 --> 01:17:13,970 So if you don't have time to do all the homework, 1741 01:17:13,970 --> 01:17:15,720 and then go back to your probability class 1742 01:17:15,720 --> 01:17:19,250 to figure out how to do it, just do 15 easy that you can do. 1743 01:17:19,250 --> 01:17:20,250 And return those things. 1744 01:17:20,250 --> 01:17:21,827 But go back to your probability class 1745 01:17:21,827 --> 01:17:23,910 and make sure that you know how to do all of them. 1746 01:17:23,910 --> 01:17:25,650 Those are pretty basic questions, 1747 01:17:25,650 --> 01:17:28,290 and those are things that I'm not going to slow down on. 1748 01:17:28,290 --> 01:17:30,900 So you need to remember that the expectation of the product 1749 01:17:30,900 --> 01:17:32,316 of independent random variables is 1750 01:17:32,316 --> 01:17:34,560 a product of the expectations. 1751 01:17:34,560 --> 01:17:36,810 Expectation of the sum, is the sum of the expectation. 1752 01:17:36,810 --> 01:17:38,775 This kind of thing, which is a little silly, 1753 01:17:38,775 --> 01:17:40,690 but it just requires you practice. 1754 01:17:40,690 --> 01:17:42,150 So, just have fun. 1755 01:17:42,150 --> 01:17:43,440 Those are simple exercises. 1756 01:17:43,440 --> 01:17:46,620 You will have fun remembering your probability class. 1757 01:17:46,620 --> 01:17:49,620 All right, so I'll see you on Tuesday-- 1758 01:17:49,620 --> 01:17:51,410 or Monday.