1 00:00:00 --> 00:00:00 2 00:00:00 --> 00:00:02 The following content is provided under a 3 00:00:02 --> 00:00:03 Creative Commons license. 4 00:00:03 --> 00:00:06 Your support will help MIT OpenCourseWare continue to 5 00:00:06 --> 00:00:10 offer high quality educational resources for free. 6 00:00:10 --> 00:00:13 To make a donation or view additional materials from 7 00:00:13 --> 00:00:17 hundreds of MIT courses, visit MIT OpenCourseWare 8 00:00:17 --> 00:00:19 at ocw.mit.edu. 9 00:00:19 --> 00:00:26 PROFESSOR: So, if you remember, just before the break, as long 10 00:00:26 --> 00:00:31 ago as it was, we had looked at the problem of fitting 11 00:00:31 --> 00:00:35 curves to data. 12 00:00:35 --> 00:00:42 And the example we had seen, is that it's often possible, in 13 00:00:42 --> 00:00:49 fact, usually possible, to find a good fit to old values. 14 00:00:49 --> 00:00:51 What we looked at was, we looked at a small number of 15 00:00:51 --> 00:00:54 points, we took a high degree polynomial, sure enough, 16 00:00:54 --> 00:00:57 we got a great fit. 17 00:00:57 --> 00:01:02 The difficulty was, a great fit to old values does 18 00:01:02 --> 00:01:24 not necessarily imply a good fit to new values. 19 00:01:24 --> 00:01:31 And in general, that's somewhat worrisome. 20 00:01:31 --> 00:01:34 So now I want to spend a little bit of time I'm looking at some 21 00:01:34 --> 00:01:39 tools, that we can use to better understand the notion 22 00:01:39 --> 00:01:45 of, when we have a bunch of points, what do they look like? 23 00:01:45 --> 00:01:47 How does the variation work? 24 00:01:47 --> 00:01:51 This gets back to a concept that we've used a number of 25 00:01:51 --> 00:01:59 times, which is a notion of a distribution. 26 00:01:59 --> 00:02:04 Remember, the whole logic behind our idea of using 27 00:02:04 --> 00:02:11 simulation, or polling, or any kind of statistical technique, 28 00:02:11 --> 00:02:16 is the assumption that the values we would draw were 29 00:02:16 --> 00:02:21 representative of the values of the larger population. 30 00:02:21 --> 00:02:25 We're sampling some subset of the population, and we're 31 00:02:25 --> 00:02:29 assuming that that sample is representative of the 32 00:02:29 --> 00:02:31 greater population. 33 00:02:31 --> 00:02:33 We talked about several different issues 34 00:02:33 --> 00:02:36 related to that. 35 00:02:36 --> 00:02:40 I now want to look at that a little bit more formally. 36 00:02:40 --> 00:02:47 And we'll start with the very old problem of rolling dice. 37 00:02:47 --> 00:02:50 I presume you've all seen what a pair of dice 38 00:02:50 --> 00:02:51 look like, right? 39 00:02:51 --> 00:02:54 They've got the numbers 1 through 6 on them, you roll 40 00:02:54 --> 00:02:57 them and something comes up. 41 00:02:57 --> 00:03:01 If you haven't seen it, if you look at the very back, at the 42 00:03:01 --> 00:03:04 back page of the handout today, you'll see a picture 43 00:03:04 --> 00:03:06 of a very old die. 44 00:03:06 --> 00:03:11 Some time from the fourth to the second century BC. 45 00:03:11 --> 00:03:15 Looks remarkably like a modern dice, except it's not made 46 00:03:15 --> 00:03:18 out of plastic, it's made out of bones. 47 00:03:18 --> 00:03:20 And in fact, if you were interested in the history of 48 00:03:20 --> 00:03:23 gambling, or if you happen to play with dice, people 49 00:03:23 --> 00:03:25 do call them bones. 50 00:03:25 --> 00:03:28 And that just dates back to the fact that the original 51 00:03:28 --> 00:03:30 ones were made that way. 52 00:03:30 --> 00:03:35 And in fact, what we'll see is, that in the history of 53 00:03:35 --> 00:03:39 probability and statistics, an awful lot of the math that we 54 00:03:39 --> 00:03:42 take for granted today, came from people's attempts to 55 00:03:42 --> 00:03:46 understand various games of chance. 56 00:03:46 --> 00:03:50 So, let's look at it. 57 00:03:50 --> 00:03:53 So we'll look at this program. 58 00:03:53 --> 00:04:05 You should have this in the front of the handout. 59 00:04:05 --> 00:04:08 So I'm going to start with a fair dice. 60 00:04:08 --> 00:04:11 That is to say, when you roll it, it's equally probable that 61 00:04:11 --> 00:04:18 you get 1, 2, 3, 4, 5, or 6. 62 00:04:18 --> 00:04:20 And I'm going to throw a pair. 63 00:04:20 --> 00:04:23 You can see it's very simple. 64 00:04:23 --> 00:04:29 I'll take d 1, first die is random dot choice from vals 65 00:04:29 --> 00:04:34 1. d 2 will be random dot choice from vals 2. 66 00:04:34 --> 00:04:40 So I'm going to pass it in two sets of possible values, and 67 00:04:40 --> 00:04:45 randomly choose one or the other, and then return them. 68 00:04:45 --> 00:04:49 And the way I'll conduct a trial is, I'll take some 69 00:04:49 --> 00:04:53 number of throws, and two different kinds of dice. 70 00:04:53 --> 00:04:58 Throws will be the empty set, actually, yeah. 71 00:04:58 --> 00:05:01 And then I'll just do it. 72 00:05:01 --> 00:05:06 For i in range number of throws, d 1, d 2 is equal to 73 00:05:06 --> 00:05:08 throw a pair, and then I'll append it, and then 74 00:05:08 --> 00:05:12 I'll return it. 75 00:05:12 --> 00:05:14 Very simple, right? 76 00:05:14 --> 00:05:19 Could hardly imagine a simpler little program. 77 00:05:19 --> 00:05:21 And then, we'll analyze it. 78 00:05:21 --> 00:05:22 And we're going to analyze it. 79 00:05:22 --> 00:05:25 Well, first let's analyze it one way, and then we'll look at 80 00:05:25 --> 00:05:41 something slightly different. 81 00:05:41 --> 00:05:43 I'm going to conduct some number of trials 82 00:05:43 --> 00:05:46 with two fair die. 83 00:05:46 --> 00:05:50 Then I'm going to make a histogram, because I happen 84 00:05:50 --> 00:05:55 to know that there are only 11 possible values, 85 00:05:55 --> 00:06:01 I'll make 11 bins. 86 00:06:01 --> 00:06:03 You may not have seen this locution here, 87 00:06:03 --> 00:06:06 Pylab dot x ticks. 88 00:06:06 --> 00:06:10 That's telling it where to put the markers on the x-axis, 89 00:06:10 --> 00:06:12 and what they should be. 90 00:06:12 --> 00:06:16 In this case 2 through 12, and then I'll label it. 91 00:06:16 --> 00:06:29 So let's run this program. 92 00:06:29 --> 00:06:34 And here we see the distribution of values. 93 00:06:34 --> 00:06:38 So we see that I get more 7s than anything else, 94 00:06:38 --> 00:06:42 and fewer 2s and 12s. 95 00:06:42 --> 00:06:47 Snake eyes and boxcars to you gamblers. 96 00:06:47 --> 00:06:51 And it's a beautiful distribution, in some sense. 97 00:06:51 --> 00:06:53 I ran it enough trials. 98 00:06:53 --> 00:07:01 This kind of distribution is called normal. 99 00:07:01 --> 00:07:07 Also sometimes called Gaussian, after the mathematician Gauss. 100 00:07:07 --> 00:07:11 Sometimes called the bell curve, because in 101 00:07:11 --> 00:07:16 someone's imagination it looks like a bell. 102 00:07:16 --> 00:07:22 We see these things all the time. 103 00:07:22 --> 00:07:27 They're called normal, or sometimes even natural, because 104 00:07:27 --> 00:07:30 it's probably the most commonly observed probability 105 00:07:30 --> 00:07:35 distribution in nature. 106 00:07:35 --> 00:07:38 First documented, although I'm sure not first 107 00:07:38 --> 00:07:44 seen, by deMoivre and Laplace in the 1700s. 108 00:07:44 --> 00:07:48 And then in the 1800s, Gauss used it to analyze 109 00:07:48 --> 00:07:51 astronomical data. 110 00:07:51 --> 00:07:53 And it got to be called, in that case, the 111 00:07:53 --> 00:07:57 Gaussian distribution. 112 00:07:57 --> 00:07:58 So where do we see it occur? 113 00:07:58 --> 00:08:00 We see it occurring all over the place. 114 00:08:00 --> 00:08:04 We certainly see it rolling dice. 115 00:08:04 --> 00:08:09 We see it occur in things like the distribution of heights. 116 00:08:09 --> 00:08:12 If we were to take the height of all the students at MIT and 117 00:08:12 --> 00:08:17 plot the distribution, I would be astonished if it didn't 118 00:08:17 --> 00:08:18 look more or less like that. 119 00:08:18 --> 00:08:22 It would be a normal distribution. 120 00:08:22 --> 00:08:24 A lot of things in the same height. 121 00:08:24 --> 00:08:26 Now presumably, we'd have to round off to the nearest 122 00:08:26 --> 00:08:29 millimeter or something. 123 00:08:29 --> 00:08:35 And a few really tall people, and a few really short people. 124 00:08:35 --> 00:08:38 It's just astonishing in nature how often we 125 00:08:38 --> 00:08:40 look at these things. 126 00:08:40 --> 00:08:44 The graph looks exactly like that, or similar to that. 127 00:08:44 --> 00:08:47 The shape is roughly that. 128 00:08:47 --> 00:08:53 The normal distribution can be described, interestingly 129 00:08:53 --> 00:08:57 enough, with just two numbers. 130 00:08:57 --> 00:09:13 The mean and the standard deviation. 131 00:09:13 --> 00:09:16 So if I give you those two numbers, you can 132 00:09:16 --> 00:09:18 draw that curve. 133 00:09:18 --> 00:09:21 Now you might not be able to label, you couldn't label the 134 00:09:21 --> 00:09:25 axes, because how would you know how many trials 135 00:09:25 --> 00:09:26 I did, right? 136 00:09:26 --> 00:09:29 Whether I did 100, or 1,000 or a million, but the shape 137 00:09:29 --> 00:09:34 would always be the same. 138 00:09:34 --> 00:09:39 And if I were to, instead of doing, 100,000 throws of the 139 00:09:39 --> 00:09:44 dice, as I did here, I did a million, the label on the 140 00:09:44 --> 00:09:47 y-axis would change, but the shape would be 141 00:09:47 --> 00:09:51 absolutely identical. 142 00:09:51 --> 00:09:54 This is what's called a stable distribution. 143 00:09:54 --> 00:10:14 As you change the scale, the shape doesn't change. 144 00:10:14 --> 00:10:19 So the mean tells us where it's centered, and the standard 145 00:10:19 --> 00:10:23 deviation, basically, is a measure of statistical 146 00:10:23 --> 00:10:42 dispersion. 147 00:10:42 --> 00:10:48 It tells us how widely spread the points of the data set are. 148 00:10:48 --> 00:10:52 If many points are going to be very close to the mean, then 149 00:10:52 --> 00:10:57 the standard deviation is what, big or small? 150 00:10:57 --> 00:10:59 Pardon? 151 00:10:59 --> 00:11:04 Small. 152 00:11:04 --> 00:11:09 If they're spread out, and it's kind of a flat bell, then the 153 00:11:09 --> 00:11:13 standard deviation will be large. 154 00:11:13 --> 00:11:16 And I'm sure you've all seen standard deviations. 155 00:11:16 --> 00:11:19 We give exams, and we say here's the mean, here's 156 00:11:19 --> 00:11:22 the standard deviation. 157 00:11:22 --> 00:11:25 And the notion is, that's trying to tell you what the 158 00:11:25 --> 00:11:29 average score was, and how spread out they are. 159 00:11:29 --> 00:11:35 Now as it happens, rarely do we have exams that actually 160 00:11:35 --> 00:11:37 fall on a bell curve. 161 00:11:37 --> 00:11:38 Like this. 162 00:11:38 --> 00:11:44 So in a way, don't be deceived by thinking that we're really 163 00:11:44 --> 00:11:48 giving you a good measure of the dispersion, in the 164 00:11:48 --> 00:11:52 sense, that we would get with the bell curve. 165 00:11:52 --> 00:11:58 So the standard deviation does have a formal value, usually 166 00:11:58 --> 00:12:11 written sigma, and it's the estimates of x squared minus 167 00:12:11 --> 00:12:19 the estimates of x, and then I take all of this, the estimates 168 00:12:19 --> 00:12:26 of x, right, squared. 169 00:12:26 --> 00:12:29 So I don't worry much about this, but what I'm basically 170 00:12:29 --> 00:12:35 doing is, x is all of the values I have. 171 00:12:35 --> 00:12:42 And I can square each of the values, and then I subtract 172 00:12:42 --> 00:12:50 from that, the sum of the values, squaring that. 173 00:12:50 --> 00:12:56 What's more important than this formula, for most uses, is what 174 00:12:56 --> 00:13:03 people think of as the -- why didn't I write it down 175 00:13:03 --> 00:13:07 -- this is interesting. 176 00:13:07 --> 00:13:10 There is a some number -- I see why, I did write it down, it 177 00:13:10 --> 00:13:12 just got printed on two-sided. 178 00:13:12 --> 00:13:26 Is the Empirical Rule. 179 00:13:26 --> 00:13:35 And this applies for normal distributions. 180 00:13:35 --> 00:13:38 So anyone know how much of the data that you should expect to 181 00:13:38 --> 00:13:44 fall within one standard deviation of the mean? 182 00:13:44 --> 00:13:49 68. 183 00:13:49 --> 00:13:59 So 68% within one, 95% of the data falls within two, 184 00:13:59 --> 00:14:06 and almost all of the data within three. 185 00:14:06 --> 00:14:11 These values are approximations, by the way. 186 00:14:11 --> 00:14:20 So, this is, really 95% falls within 1.96 standard 187 00:14:20 --> 00:14:23 deviations, it's not two. 188 00:14:23 --> 00:14:28 But this gives you a sense of how spread out it is. 189 00:14:28 --> 00:14:33 And again, this applies only for a normal distribution. 190 00:14:33 --> 00:14:37 If you compute the standard deviation this way, and apply 191 00:14:37 --> 00:14:40 it to something other than a normal distribution, there's 192 00:14:40 --> 00:14:52 no reason to expect that Empirical Rule will hold. 193 00:14:52 --> 00:14:55 OK people with me on this? 194 00:14:55 --> 00:14:58 It's amazing to me how many people in society talk about 195 00:14:58 --> 00:15:02 standard deviations without actually knowing what they are. 196 00:15:02 --> 00:15:05 And there's another way to look at the same data, 197 00:15:05 --> 00:15:07 or almost the same data. 198 00:15:07 --> 00:15:22 Since it's a random experiment, it won't be exactly the same. 199 00:15:22 --> 00:15:28 So as before, we had the distribution, and fortunately 200 00:15:28 --> 00:15:30 it looks pretty much like the last one. 201 00:15:30 --> 00:15:32 We would've expected that. 202 00:15:32 --> 00:15:35 And I've now done something, another way of looking at the 203 00:15:35 --> 00:15:41 same information, really, is, I printed, I plotted, 204 00:15:41 --> 00:15:46 the probabilities of different values. 205 00:15:46 --> 00:15:50 So we can see here that the probability of getting 206 00:15:50 --> 00:15:59 a 7 is about 0.17 or something like that. 207 00:15:59 --> 00:16:03 Now, since I threw 100,000 die, it's not surprising that the 208 00:16:03 --> 00:16:11 probability of 0.17 looks about the same as 17,000 over here. 209 00:16:11 --> 00:16:15 But it's just a different way of looking at things. 210 00:16:15 --> 00:16:21 Right, had I thrown some different number, it might 211 00:16:21 --> 00:16:23 have been harder to visualize what the probability 212 00:16:23 --> 00:16:26 distribution looked like. 213 00:16:26 --> 00:16:28 But we often do talk about that. 214 00:16:28 --> 00:16:35 How probable is a certain value? 215 00:16:35 --> 00:16:39 People who design games of chance, by the way, something 216 00:16:39 --> 00:16:39 I've been meaning to say. 217 00:16:39 --> 00:16:43 You'll notice down here there's just, when we want to save 218 00:16:43 --> 00:16:46 these things, there's this little icon that's a floppy 219 00:16:46 --> 00:16:48 disk, to indicate store. 220 00:16:48 --> 00:16:52 And I thought maybe many of you'd never seen a floppy disk, 221 00:16:52 --> 00:16:53 so I decided to bring one in. 222 00:16:53 --> 00:16:55 You've seen the icons. 223 00:16:55 --> 00:16:58 And probably by the time most of you got, any you ever 224 00:16:58 --> 00:17:01 had a machine with a quote floppy drive? 225 00:17:01 --> 00:17:04 Did it actually flop the disk? 226 00:17:04 --> 00:17:07 No, they were pretty rigid, but the old floppy disks 227 00:17:07 --> 00:17:09 were really floppy. 228 00:17:09 --> 00:17:12 Hence they got the name. 229 00:17:12 --> 00:17:16 And, you know it's kind of like a giant size version. 230 00:17:16 --> 00:17:19 And it's amazing how people will probably continue to 231 00:17:19 --> 00:17:22 talk about floppy disks as long as they talk about 232 00:17:22 --> 00:17:24 dialing a telephone. 233 00:17:24 --> 00:17:27 And probably none of you've ever actually dialed a phone, 234 00:17:27 --> 00:17:29 for that matter, just pushed buttons . 235 00:17:29 --> 00:17:33 But they used to have dials that you would twirl. 236 00:17:33 --> 00:17:34 Anyway, I just thought everyone should at least 237 00:17:34 --> 00:17:37 see a floppy disk once. 238 00:17:37 --> 00:17:40 This is, by the way, a very good way to get data security. 239 00:17:40 --> 00:17:42 There's probably no way in the world to read the information 240 00:17:42 --> 00:17:48 on this disk anymore. 241 00:17:48 --> 00:17:54 All right, as I said people who design games of 242 00:17:54 --> 00:17:58 chance understand these probabilities very well. 243 00:17:58 --> 00:18:02 So I'm gonna now look at, show how we can understand these 244 00:18:02 --> 00:18:06 things in some other ways of popular example. 245 00:18:06 --> 00:18:09 A game of dice. 246 00:18:09 --> 00:18:13 Has anyone here ever played the game called craps? 247 00:18:13 --> 00:18:16 Did you win or lose money? 248 00:18:16 --> 00:18:18 You won. 249 00:18:18 --> 00:18:21 All right, you beat the odds. 250 00:18:21 --> 00:18:23 Well, it's a very popular game, and I'm going 251 00:18:23 --> 00:18:25 to explain it to you. 252 00:18:25 --> 00:18:28 As you will see, this is not an endorsement of gambling, 253 00:18:28 --> 00:18:30 because one of the things you will notice is, you are likely 254 00:18:30 --> 00:18:32 to lose money if you do this. 255 00:18:32 --> 00:18:35 So I tend not, I don't do it. 256 00:18:35 --> 00:18:45 All right, so how does a game of craps work? 257 00:18:45 --> 00:18:50 You start by rolling two dice. 258 00:18:50 --> 00:18:55 If you get a 7 or an 11, the roller, we'll call that 259 00:18:55 --> 00:19:01 the shooter, you win. 260 00:19:01 --> 00:19:08 If you get a 2, 3, or a 12, you lose. 261 00:19:08 --> 00:19:11 I'm assuming here, you're betting what's called 262 00:19:11 --> 00:19:12 the pass line. 263 00:19:12 --> 00:19:14 There are different ways to bet, this is the most 264 00:19:14 --> 00:19:17 common way to bet, we'll just deal with that. 265 00:19:17 --> 00:19:27 If it's not any of these, what you get is, otherwise, the 266 00:19:27 --> 00:19:37 number becomes what's called the point. 267 00:19:37 --> 00:19:41 Once you've got the point, you keep rolling the dice until 268 00:19:41 --> 00:19:44 1 of o things happens. 269 00:19:44 --> 00:19:56 You get a 7, in which case you lose, or you get the point, 270 00:19:56 --> 00:20:04 in which case you win. 271 00:20:04 --> 00:20:07 So it's a pretty simple game. 272 00:20:07 --> 00:20:12 Very popular game. 273 00:20:12 --> 00:20:14 So I've implemented. 274 00:20:14 --> 00:20:18 So one of the interesting things about this is, if you 275 00:20:18 --> 00:20:22 try and actually figure out what the probabilities are 276 00:20:22 --> 00:20:25 using pencil and paper, you can, but it gets a 277 00:20:25 --> 00:20:33 little bit involved. 278 00:20:33 --> 00:20:35 Gets involved because you have to, all right, what are the 279 00:20:35 --> 00:20:38 odds of winning or losing on the first throw? 280 00:20:38 --> 00:20:41 Well, you can compute those pretty easily, and you can see 281 00:20:41 --> 00:20:43 that you'd actually win more than you lose, on 282 00:20:43 --> 00:20:44 the first throw. 283 00:20:44 --> 00:20:48 But if you look at the distribution of 7s and 11s and 284 00:20:48 --> 00:20:51 2s, 3s, and 12s, you add them up, you'll see, well, this is 285 00:20:51 --> 00:20:53 more likely than this. 286 00:20:53 --> 00:20:55 But then you say, suppose I don't get those. 287 00:20:55 --> 00:20:58 What's the likelihood of getting each other possible 288 00:20:58 --> 00:21:02 point value, and then given that point value, what's the 289 00:21:02 --> 00:21:04 likelihood of getting that before a 7? 290 00:21:04 --> 00:21:07 And you can do it, but it gets very tedious. 291 00:21:07 --> 00:21:13 So those of us who are inclined to think computationally, and I 292 00:21:13 --> 00:21:17 hope by now that's all of you as well as me, say well, 293 00:21:17 --> 00:21:20 instead of doing the probabilities by hand, I'm just 294 00:21:20 --> 00:21:22 going to write a little program. 295 00:21:22 --> 00:21:27 And it's a program that took me maybe 10 minutes to write. 296 00:21:27 --> 00:21:34 You can see it's quite small, I did it yesterday. 297 00:21:34 --> 00:21:39 So the first function here is craps, it returns true if the 298 00:21:39 --> 00:21:43 shooter wins by betting the pass line. 299 00:21:43 --> 00:21:46 And it's just does what I said. 300 00:21:46 --> 00:21:49 Rolls them, if the total is 1 or 11, it returns 301 00:21:49 --> 00:21:51 true, you win. 302 00:21:51 --> 00:21:54 If it's 2, 3, or 12, it returns false, you lose. 303 00:21:54 --> 00:21:58 Otherwise the point becomes the total. 304 00:21:58 --> 00:22:01 And then while true, I'll just keep rolling. 305 00:22:01 --> 00:22:07 Until either, if the total gets the point, I return true. 306 00:22:07 --> 00:22:09 Or if the total's equal 7, I return false. 307 00:22:09 --> 00:22:12 And that's it. 308 00:22:12 --> 00:22:15 So essentially I just took these rules, typed them 309 00:22:15 --> 00:22:20 down, and I had my game. 310 00:22:20 --> 00:22:24 And then I'll simulate it will some number of bets. 311 00:22:24 --> 00:22:28 Keeping track of the numbers of wins and losses. 312 00:22:28 --> 00:22:31 Just by incrementing 1 or the other, depending upon whether 313 00:22:31 --> 00:22:34 I return true or false. 314 00:22:34 --> 00:22:36 I'm going to, just to show what we do, print the 315 00:22:36 --> 00:22:39 number of wins and losses. 316 00:22:39 --> 00:22:44 And then compute, how does the house do? 317 00:22:44 --> 00:22:47 Not the gambler, but the person who's running 318 00:22:47 --> 00:22:49 the game, the casino. 319 00:22:49 --> 00:22:53 Or in other circumstances, other places. 320 00:22:53 --> 00:22:56 And then we'll see how that goes. 321 00:22:56 --> 00:23:00 And I'll try it with 100,000 games. 322 00:23:00 --> 00:23:03 Now, this is more than 100,000 rolls of the dice, right? 323 00:23:03 --> 00:23:08 Because I don't get a 7 or 11, I keep rolling. 324 00:23:08 --> 00:23:15 So before I do it, I'll as the easy question first. 325 00:23:15 --> 00:23:21 Who thinks the casino wins more often than the player? 326 00:23:21 --> 00:23:26 Who thinks the player wins more often than the casino? 327 00:23:26 --> 00:23:29 Well, very logical, casinos are not in business 328 00:23:29 --> 00:23:32 of giving away money. 329 00:23:32 --> 00:23:34 So now the more interesting question. 330 00:23:34 --> 00:23:38 How steep do you think the odds are in the house's favor? 331 00:23:38 --> 00:23:43 Anyone want to guess? 332 00:23:43 --> 00:23:45 Actually pretty thin. 333 00:23:45 --> 00:23:57 Let's run it and see. 334 00:23:57 --> 00:24:00 So what we see is, the house wins 50, in this case 335 00:24:00 --> 00:24:04 50.424% of the time. 336 00:24:04 --> 00:24:06 Not a lot. 337 00:24:06 --> 00:24:08 On the other hand, if people bet 100,000, 338 00:24:08 --> 00:24:14 the house wins 848. 339 00:24:14 --> 00:24:23 Now, 100,000 is actually a small number. 340 00:24:23 --> 00:24:25 Let's get rid of these, should have gotten rid of these 341 00:24:25 --> 00:24:53 figures, you don't need to see them every time. 342 00:24:53 --> 00:24:58 We'll keep one figure, just for fun. 343 00:24:58 --> 00:25:04 Let's try it again. 344 00:25:04 --> 00:25:06 Probably get a little different answer. 345 00:25:06 --> 00:25:11 Considerable different, but still, less than 51% of 346 00:25:11 --> 00:25:13 the time in this trial. 347 00:25:13 --> 00:25:16 But you can see that the house is slowly but surely going to 348 00:25:16 --> 00:25:20 get rich playing this game. 349 00:25:20 --> 00:25:24 Now let's ask the other interesting question. 350 00:25:24 --> 00:25:29 Just for fun, suppose we want to cheat. 351 00:25:29 --> 00:25:32 Now, I realize none of you would never do that. 352 00:25:32 --> 00:25:36 But let's consider using a pair of loaded dice. 353 00:25:36 --> 00:25:40 So there's a long history, well you can imagine when you looked 354 00:25:40 --> 00:25:44 at that old bone I showed you, that it wasn't exactly fair. 355 00:25:44 --> 00:25:47 That some sides were a little heavier than others, and in 356 00:25:47 --> 00:25:52 fact you didn't get, say, a 5 exactly 1/6 of the time. 357 00:25:52 --> 00:25:55 And therefore, if you were using your own dice, instead of 358 00:25:55 --> 00:25:57 somebody else's, and you knew what was most likely, 359 00:25:57 --> 00:26:00 you might do better. 360 00:26:00 --> 00:26:02 Well, the modern version of that is, people do cheat by 361 00:26:02 --> 00:26:07 putting little weights in dice, to just make tiny changes in 362 00:26:07 --> 00:26:11 the probability of one number or another coming up. 363 00:26:11 --> 00:26:14 So let's do that. 364 00:26:14 --> 00:26:16 And let's first ask the question, well, what would 365 00:26:16 --> 00:26:20 be a nice way to do that? 366 00:26:20 --> 00:26:24 It's very easy here. 367 00:26:24 --> 00:26:37 If we look at it, all I've done is, I've changed the 368 00:26:37 --> 00:26:40 distribution of values, so instead of here being 1, 2, 369 00:26:40 --> 00:26:46 3, 4, 5, and 6, it's 1, 2, 3, 4, 5, 5, and 6. 370 00:26:46 --> 00:26:51 I snuck in an extra 5 on one of the two dice. 371 00:26:51 --> 00:26:57 So this has changed the odds of rolling a 5 from 1 in 372 00:26:57 --> 00:27:04 6 to roughly 3 in 12. 373 00:27:04 --> 00:27:09 Now 1/6, which is 2/12, vs. 3/12, it's not 374 00:27:09 --> 00:27:10 a big difference. 375 00:27:10 --> 00:27:13 And you can imagine, if you were sitting there watching it, 376 00:27:13 --> 00:27:17 you wouldn't notice that 5 was coming up a little bit more 377 00:27:17 --> 00:27:19 often than you expected. 378 00:27:19 --> 00:27:21 Normally. 379 00:27:21 --> 00:27:23 Close enough that you wouldn't notice it. 380 00:27:23 --> 00:27:28 But let's see if, what difference it makes? 381 00:27:28 --> 00:27:30 What difference do you think it will make here? 382 00:27:30 --> 00:27:36 First of all, is it going to be better or worse for the player? 383 00:27:36 --> 00:27:39 Who thinks better? 384 00:27:39 --> 00:27:42 Who thinks worse? 385 00:27:42 --> 00:27:45 Who thinks they haven't a clue? 386 00:27:45 --> 00:27:48 All right, we have an honest man. 387 00:27:48 --> 00:27:52 Where is Diogenes when we we need him? 388 00:27:52 --> 00:27:55 The reward for honesty. 389 00:27:55 --> 00:27:58 I could reward you and wake him up at the same time. 390 00:27:58 --> 00:28:01 It's good. 391 00:28:01 --> 00:28:16 All right, well, let's see what happens. 392 00:28:16 --> 00:28:20 All right, so suddenly, the odds have swung in 393 00:28:20 --> 00:28:23 favor of the player. 394 00:28:23 --> 00:28:28 This tiny little change has now made it likely that the player 395 00:28:28 --> 00:28:33 win money, instead of the house. 396 00:28:33 --> 00:28:34 So what's the point? 397 00:28:34 --> 00:28:38 The point is not, you should go out and try and cheat casinos, 398 00:28:38 --> 00:28:42 because you'll probably find an unpleasant consequence of that. 399 00:28:42 --> 00:28:47 The point is that, once I've written this simulation, I 400 00:28:47 --> 00:28:52 can play thought experiments in a very easy way. 401 00:28:52 --> 00:28:53 So-called what if games. 402 00:28:53 --> 00:28:54 What if we did this? 403 00:28:54 --> 00:28:56 What if we did that? 404 00:28:56 --> 00:28:59 And it's trivial to do those kinds of things. 405 00:28:59 --> 00:29:02 And that's one of the reasons we typically do try and 406 00:29:02 --> 00:29:04 write these simulations. 407 00:29:04 --> 00:29:09 So that we can experiment with things. 408 00:29:09 --> 00:29:11 Are there any other experiments people would like to 409 00:29:11 --> 00:29:13 perform while we're here? 410 00:29:13 --> 00:29:18 Any other sets of die you might like to try? 411 00:29:18 --> 00:29:22 All right, someone give me a suggestion of something that 412 00:29:22 --> 00:29:24 might work in the house's favor. 413 00:29:24 --> 00:29:27 Suppose a casino wanted to cheat. 414 00:29:27 --> 00:29:28 What do you think would help them out? 415 00:29:28 --> 00:29:28 Yeah? 416 00:29:28 --> 00:29:32 STUDENT: Increase prevalence of 1, instead of 5? 417 00:29:32 --> 00:29:34 PROFESSOR: All right, so let's see if we increase the 418 00:29:34 --> 00:29:52 probability of 1, what it does? 419 00:29:52 --> 00:29:58 Yep, clearly helped the house out, didn't it? 420 00:29:58 --> 00:30:04 So that would be a good thing for the house. 421 00:30:04 --> 00:30:08 Again, you know, three key strokes and we get to try it. 422 00:30:08 --> 00:30:15 It's really a very nice kind of thing to be able to do. 423 00:30:15 --> 00:30:19 OK, this works nicely. 424 00:30:19 --> 00:30:21 We'll get normal distributions. 425 00:30:21 --> 00:30:24 We can look at some things. 426 00:30:24 --> 00:30:26 There are two other kinds of distributions I 427 00:30:26 --> 00:30:27 want to talk about. 428 00:30:27 --> 00:30:40 We can get rid of this distraction. 429 00:30:40 --> 00:30:43 As you can imagine, I played a lot with these things, just 430 00:30:43 --> 00:30:52 cause it was fun once I had it. 431 00:30:52 --> 00:30:57 You have these in your handout. 432 00:30:57 --> 00:31:01 So the one on the upper right, is the Gaussian, 433 00:31:01 --> 00:31:07 or normal, distribution we've been talking about. 434 00:31:07 --> 00:31:12 As I said earlier, quite common, we see it a lot. 435 00:31:12 --> 00:31:17 The upper left is what's called a, and these, by the way, all 436 00:31:17 --> 00:31:20 of these distributions are symmetric, just in this 437 00:31:20 --> 00:31:23 particular picture. 438 00:31:23 --> 00:31:26 How do you spell symmetric, one or two m's? 439 00:31:26 --> 00:31:29 I help here. 440 00:31:29 --> 00:31:30 That right? 441 00:31:30 --> 00:31:32 OK, thank you. 442 00:31:32 --> 00:31:35 And they're symmetric in the sense that, if you take the 443 00:31:35 --> 00:31:40 mean, it looks the same on both sides of the mean. 444 00:31:40 --> 00:31:43 Now in general, you can have asymmetric 445 00:31:43 --> 00:31:48 distributions as well. 446 00:31:48 --> 00:31:51 But for simplicity, we'll here look at symmetric ones. 447 00:31:51 --> 00:31:54 So we've seen the bell curve, and then on the upper left is 448 00:31:54 --> 00:32:04 what's called the uniform. 449 00:32:04 --> 00:32:09 In a uniform distribution, each value in the range 450 00:32:09 --> 00:32:15 is equally likely. 451 00:32:15 --> 00:32:18 So to characterize it, you only need to give 452 00:32:18 --> 00:32:21 the range of values. 453 00:32:21 --> 00:32:26 I say the values range from 0 to 100, and it tells me 454 00:32:26 --> 00:32:30 everything I know about the uniform distribution. 455 00:32:30 --> 00:32:35 Each value in that will occur the same number of times. 456 00:32:35 --> 00:32:39 Have we seen a uniform distribution? 457 00:32:39 --> 00:32:46 What have we seen that's uniform here? 458 00:32:46 --> 00:32:46 Pardon? 459 00:32:46 --> 00:32:48 STUDENT: Playing dice. 460 00:32:48 --> 00:32:49 PROFESSOR: Playing dice. 461 00:32:49 --> 00:32:51 Exactly right. 462 00:32:51 --> 00:32:57 Each roll of the die was equally likely. 463 00:32:57 --> 00:32:59 Between 1 and 6. 464 00:32:59 --> 00:33:03 So, we got a normal distribution when I summed 465 00:33:03 --> 00:33:07 them, but if I gave you the distribution of a single 466 00:33:07 --> 00:33:11 die, it would have been uniform, right? 467 00:33:11 --> 00:33:14 So there's an interesting lesson there. 468 00:33:14 --> 00:33:18 One die, the distribution was uniform, but when I summed 469 00:33:18 --> 00:33:27 them, I ended up getting a normal distribution. 470 00:33:27 --> 00:33:29 So where else do we see them? 471 00:33:29 --> 00:33:34 In principle, lottery winners are uniformly distributed. 472 00:33:34 --> 00:33:38 Each number is equally likely to come up. 473 00:33:38 --> 00:33:41 To a first approximation, birthdays are uniformly 474 00:33:41 --> 00:33:44 distributed, things like that. 475 00:33:44 --> 00:33:49 But, in fact, they rarely arise in nature. 476 00:33:49 --> 00:33:52 You'll hardly ever run a physics experiment, or a 477 00:33:52 --> 00:33:55 biology experiment, or anything like that, and come up with 478 00:33:55 --> 00:33:58 a uniform distribution. 479 00:33:58 --> 00:34:04 Nor do they arise very often in complex systems. 480 00:34:04 --> 00:34:07 So if you look at what happens in financial markets, 481 00:34:07 --> 00:34:10 none of the interesting distributions are uniform. 482 00:34:10 --> 00:34:14 You know, the prices of stocks, for example, are clearly 483 00:34:14 --> 00:34:16 not uniformly distributed. 484 00:34:16 --> 00:34:19 Up days and down days in the stock market are not 485 00:34:19 --> 00:34:22 uniformly distributed. 486 00:34:22 --> 00:34:28 Winners of football games are not uniformly distributed. 487 00:34:28 --> 00:34:32 People seem to like to use them in games of chance, because 488 00:34:32 --> 00:34:35 they seem fair, but mostly you see them only in 489 00:34:35 --> 00:34:39 invented things, rather than real things. 490 00:34:39 --> 00:34:43 The third kind of distribution, the one in the bottom, is the 491 00:34:43 --> 00:34:55 exponential distribution. 492 00:34:55 --> 00:34:59 That's actually quite common in the real world. 493 00:34:59 --> 00:35:05 It's often used, for example, to model arrival times. 494 00:35:05 --> 00:35:08 If you want to model the frequency at which, say, 495 00:35:08 --> 00:35:16 automobiles arrive, get on the Mass Turnpike, you would find 496 00:35:16 --> 00:35:20 that the arrivals are exponential. 497 00:35:20 --> 00:35:25 We see with an exponential is, things fall off much more 498 00:35:25 --> 00:35:34 steeply around the mean than with the normal distribution. 499 00:35:34 --> 00:35:40 All right, that make sense? 500 00:35:40 --> 00:35:43 What else is exponentially distributed? 501 00:35:43 --> 00:35:48 Requests for web pages are often exponentially 502 00:35:48 --> 00:35:49 distributed. 503 00:35:49 --> 00:35:52 The amount of traffic at a website. 504 00:35:52 --> 00:35:55 How frequently they arrive. 505 00:35:55 --> 00:35:59 We'll see much more starting next week, or maybe even 506 00:35:59 --> 00:36:03 starting Thursday, about exponential distributions, as 507 00:36:03 --> 00:36:06 we go on with a final case study that we'll be dealing 508 00:36:06 --> 00:36:09 with in the course. 509 00:36:09 --> 00:36:16 You can think of each of these, by the way, as increasing 510 00:36:16 --> 00:36:19 order of predictability. 511 00:36:19 --> 00:36:23 Uniform distribution means the result is most unpredictable, 512 00:36:23 --> 00:36:27 it could be anything. 513 00:36:27 --> 00:36:34 A normal distribution says, well, it's pretty predictable. 514 00:36:34 --> 00:36:37 Again, depending on the standard deviation. 515 00:36:37 --> 00:36:44 If you guess the mean, you're pretty close to right. 516 00:36:44 --> 00:36:47 The exponential is very predictable. 517 00:36:47 --> 00:36:55 Most of the answers are right around the mean. 518 00:36:55 --> 00:36:57 Now there are many other distributions, there are Pareto 519 00:36:57 --> 00:37:01 distributions which have fat tails, there are fractal 520 00:37:01 --> 00:37:05 distributions, there are all sorts of things. 521 00:37:05 --> 00:37:09 We won't go into to those details. 522 00:37:09 --> 00:37:13 Now, I hope you didn't find this short excursion into 523 00:37:13 --> 00:37:17 statistics either too boring or too confusing. 524 00:37:17 --> 00:37:20 The point was not to teach you statistics, probability, 525 00:37:20 --> 00:37:24 we have multiple courses to do that. 526 00:37:24 --> 00:37:26 But to give you some tools that would help improve 527 00:37:26 --> 00:37:31 your intuition in thinking about data. 528 00:37:31 --> 00:37:36 In closing, I want to give a few words about 529 00:37:36 --> 00:37:38 the misuse of data. 530 00:37:38 --> 00:37:43 Since I think we misuse data an awful lot. 531 00:37:43 --> 00:37:57 So, point number 0, as in the most important, is beware of 532 00:37:57 --> 00:38:25 people who give you properties of data, but not the data. 533 00:38:25 --> 00:38:29 We see that sort of thing all the time. 534 00:38:29 --> 00:38:34 Where people come in, and they say, OK, here it is, here's the 535 00:38:34 --> 00:38:37 mean value of the quiz, and here's the standard deviation 536 00:38:37 --> 00:38:42 of the quiz, and that just doesn't really tell you where 537 00:38:42 --> 00:38:44 you stand, in some sense. 538 00:38:44 --> 00:38:47 Because it's probably not normally distributed. 539 00:38:47 --> 00:38:50 You want to see the data. 540 00:38:50 --> 00:38:53 At the very least, if you see the data, you can then say, 541 00:38:53 --> 00:38:57 yeah, it is normally distributed, so the standard 542 00:38:57 --> 00:39:01 deviation is meaningful, or not meaningful. 543 00:39:01 --> 00:39:14 So, whenever you can, try and get, at least, to see the data. 544 00:39:14 --> 00:39:18 So that's 1, or 0. 545 00:39:18 --> 00:39:24 1 is, well, all right. 546 00:39:24 --> 00:39:38 I'm going to test your Latin. 547 00:39:38 --> 00:39:41 Cum hoc ergo propter hoc. 548 00:39:41 --> 00:39:43 All right. 549 00:39:43 --> 00:39:49 I need a Latin scholar to translate this. 550 00:39:49 --> 00:39:53 Did not one of you take Latin in high school? 551 00:39:53 --> 00:39:55 We have someone who did. 552 00:39:55 --> 00:39:55 Go ahead. 553 00:39:55 --> 00:39:59 STUDENT: I think it means, with this, therefore, 554 00:39:59 --> 00:40:01 because of this. 555 00:40:01 --> 00:40:03 PROFESSOR: Exactly right. 556 00:40:03 --> 00:40:08 With this, therefore, because of this. 557 00:40:08 --> 00:40:09 I'm glad that at least one person has a 558 00:40:09 --> 00:40:12 classical education. 559 00:40:12 --> 00:40:16 I don't, by the way. 560 00:40:16 --> 00:40:21 Essentially what this is telling us, is that correlation 561 00:40:21 --> 00:40:39 does not imply causation. 562 00:40:39 --> 00:40:42 So sometimes two things go together. 563 00:40:42 --> 00:40:45 They both go up, they both go down. 564 00:40:45 --> 00:40:47 And people jump to the conclusion that one 565 00:40:47 --> 00:40:49 causes the other. 566 00:40:49 --> 00:40:53 That there's a cause and effect relationship. 567 00:40:53 --> 00:40:57 That is just not true. 568 00:40:57 --> 00:41:01 It's what's called a logical fallacy. 569 00:41:01 --> 00:41:04 So we see some examples of this. 570 00:41:04 --> 00:41:05 And you can get into big trouble. 571 00:41:05 --> 00:41:08 So here's a very interesting one. 572 00:41:08 --> 00:41:11 There was a very widely reported epidemiological study, 573 00:41:11 --> 00:41:14 that's a medical study where you get statistics about 574 00:41:14 --> 00:41:16 large populations. 575 00:41:16 --> 00:41:20 And it showed that women, who are taking hormone replacement 576 00:41:20 --> 00:41:24 therapy, were found to have a lower incidence of coronary 577 00:41:24 --> 00:41:28 heart disease than women who didn't. 578 00:41:28 --> 00:41:32 This was a big study of a lot of women. 579 00:41:32 --> 00:41:38 This led doctors to propose that hormone replacement 580 00:41:38 --> 00:41:41 therapy for middle aged women was protective against 581 00:41:41 --> 00:41:44 coronary heart disease. 582 00:41:44 --> 00:41:49 And in fact, in response to this, a large number of medical 583 00:41:49 --> 00:41:50 societies recommended this. 584 00:41:50 --> 00:41:55 And a large number of women were given this therapy. 585 00:41:55 --> 00:42:00 Later, controlled trials showed that in fact, hormone 586 00:42:00 --> 00:42:04 replacement therapy in women caused a small and significant 587 00:42:04 --> 00:42:10 increase in coronary heart disease. 588 00:42:10 --> 00:42:13 So they had taken the fact that these were correlated, said one 589 00:42:13 --> 00:42:17 causes the other, made a prescription, and it turned 590 00:42:17 --> 00:42:20 out to be the wrong one. 591 00:42:20 --> 00:42:25 Now, how could this be? 592 00:42:25 --> 00:42:27 How could this be? 593 00:42:27 --> 00:42:33 It turned out that the women in the original study who were 594 00:42:33 --> 00:42:39 taking the hormone replacement therapy, tended to be from a 595 00:42:39 --> 00:42:43 higher socioeconomic group than those who didn't. 596 00:42:43 --> 00:42:46 Because the therapy was not covered by insurance, so the 597 00:42:46 --> 00:42:48 women who took it were wealthy. 598 00:42:48 --> 00:42:52 Turns out wealthy people do a lot of other things that are 599 00:42:52 --> 00:42:54 protective of their hearts. 600 00:42:54 --> 00:42:57 And, therefore, are in general healthier than poor people. 601 00:42:57 --> 00:42:59 This is not a surprise. 602 00:42:59 --> 00:43:02 Rich people are healthier than poor people. 603 00:43:02 --> 00:43:09 And so in fact, it was this third variable that was 604 00:43:09 --> 00:43:13 actually the meaningful one. 605 00:43:13 --> 00:43:30 This is what is called in statistics, a lurking variable. 606 00:43:30 --> 00:43:34 Both of the things they were looking at in this study, who 607 00:43:34 --> 00:43:39 took the therapy, and who had a heart disease, each of those 608 00:43:39 --> 00:43:42 was correlated with the lurking variable of 609 00:43:42 --> 00:43:46 socioeconomic position. 610 00:43:46 --> 00:43:51 And so, in effect, there was no cause and effect relationship. 611 00:43:51 --> 00:43:54 And once they did another study, in which the lurking 612 00:43:54 --> 00:43:59 variable was controlled, and they looked at heart disease 613 00:43:59 --> 00:44:02 among rich women separately from poor women, with this 614 00:44:02 --> 00:44:07 therapy, they discovered that therapy was not good. 615 00:44:07 --> 00:44:10 It was, in fact, harmful. 616 00:44:10 --> 00:44:16 So this is a very important moral to remember. 617 00:44:16 --> 00:44:21 When you look at correlations, don't assume cause and effect. 618 00:44:21 --> 00:44:25 And don't assume that there isn't a lurking variable that 619 00:44:25 --> 00:44:29 really is the dominant factor. 620 00:44:29 --> 00:44:35 So that's one statistical, a second statistical worry. 621 00:44:35 --> 00:44:48 Number 2 is, beware of what's called, non-response bias. 622 00:44:48 --> 00:44:52 Which is another fancy way of saying, non-representative 623 00:44:52 --> 00:45:05 samples. 624 00:45:05 --> 00:45:08 No one doing a study beyond the trivial can sample 625 00:45:08 --> 00:45:11 everybody or everything. 626 00:45:11 --> 00:45:17 And only mind readers can be sure of what they've missed. 627 00:45:17 --> 00:45:18 Unless, of course, people choose to miss 628 00:45:18 --> 00:45:20 things on purpose. 629 00:45:20 --> 00:45:22 Which you also see. 630 00:45:22 --> 00:45:24 And that brings me to my next anecdote. 631 00:45:24 --> 00:45:31 A former professor at the University of Nebraska, who 632 00:45:31 --> 00:45:34 later headed a group called The Family Research Institute, 633 00:45:34 --> 00:45:37 which some of you may have heard about, claimed that gay 634 00:45:37 --> 00:45:43 men have an average life expectancy of 43 years. 635 00:45:43 --> 00:45:46 And they did a study full of statistics showing 636 00:45:46 --> 00:45:49 that this was the case. 637 00:45:49 --> 00:45:53 And the key was, they calculated the figure by 638 00:45:53 --> 00:45:57 checking gay newspapers for obituaries and news 639 00:45:57 --> 00:46:00 about stories of death. 640 00:46:00 --> 00:46:04 So they went through the gay newspapers, took a list of 641 00:46:04 --> 00:46:08 everybody whose obituary appeared, how old they were 642 00:46:08 --> 00:46:13 when they died, took the average, and said it was 43. 643 00:46:13 --> 00:46:15 Then they did a bunch of statistics, with all sorts of 644 00:46:15 --> 00:46:19 tests, showing how, you know, what the curves look like, the 645 00:46:19 --> 00:46:21 distributions, and the significance. 646 00:46:21 --> 00:46:25 All the math was valid. 647 00:46:25 --> 00:46:28 The problem was, it was a very unrepresentative sample. 648 00:46:28 --> 00:46:30 What was the most unrepresentative 649 00:46:30 --> 00:46:32 thing about it? 650 00:46:32 --> 00:46:33 Somebody? 651 00:46:33 --> 00:46:38 STUDENT: Not all deaths have obituaries. 652 00:46:38 --> 00:46:40 PROFESSOR: Well, that's one thing. 653 00:46:40 --> 00:46:41 That's certainly true. 654 00:46:41 --> 00:46:44 But what else? 655 00:46:44 --> 00:46:48 Well, not all gay people are dead, right? 656 00:46:48 --> 00:46:51 So if you're looking at obituaries, you're in fact only 657 00:46:51 --> 00:46:53 getting -- I'm sure that's what you were planning 658 00:46:53 --> 00:46:56 to say -- sorry. 659 00:46:56 --> 00:46:59 You're only getting the people who are dead, so it's clearly 660 00:46:59 --> 00:47:02 going to make the number look smaller, right? 661 00:47:02 --> 00:47:05 Furthermore, you're only getting the ones that were 662 00:47:05 --> 00:47:10 reported in newspapers, the newspapers are typically urban, 663 00:47:10 --> 00:47:18 rather than out in rural areas, so it turns out, it's also 664 00:47:18 --> 00:47:23 biased against gays who chose not come out of the closet, and 665 00:47:23 --> 00:47:26 therefore didn't appear in these. 666 00:47:26 --> 00:47:29 Lots and lots of things with the problems. 667 00:47:29 --> 00:47:31 Believe it or not, this paper was published in 668 00:47:31 --> 00:47:34 a reputable journal. 669 00:47:34 --> 00:47:38 And someone checked all the math, but missed the fact that 670 00:47:38 --> 00:47:49 all of that was irrelevant because the sample was wrong. 671 00:47:49 --> 00:47:50 Data enhancement. 672 00:47:50 --> 00:47:53 It even sounds bad, right? 673 00:47:53 --> 00:47:56 You run an experiment, you get your data, and you enhance it. 674 00:47:56 --> 00:47:59 It's kind of like when you ran those physics experiments in 675 00:47:59 --> 00:48:02 high school, and you've got answers that you knew didn't 676 00:48:02 --> 00:48:05 match the theory, so you fudged the data? 677 00:48:05 --> 00:48:08 I know none of you would have ever done that, but some 678 00:48:08 --> 00:48:09 people been known to do. 679 00:48:09 --> 00:48:12 That's not actually what this means. 680 00:48:12 --> 00:48:16 What this means is, reading more into the data than 681 00:48:16 --> 00:48:19 it actually implies. 682 00:48:19 --> 00:48:23 So well-meaning people are often the guiltiest here. 683 00:48:23 --> 00:48:26 So here's another one of my favorites. 684 00:48:26 --> 00:48:29 For example, there are people who try to scare 685 00:48:29 --> 00:48:31 us into driving safely. 686 00:48:31 --> 00:48:33 Driving safely is a good thing. 687 00:48:33 --> 00:48:36 By telling holiday deaths. 688 00:48:36 --> 00:48:39 So you'll read things like, 400 killed on the highways 689 00:48:39 --> 00:48:42 over long weekend. 690 00:48:42 --> 00:48:47 It sounds really bad, until you observe the fact that roughly 691 00:48:47 --> 00:48:51 400 people are killed on any 3-day period. 692 00:48:51 --> 00:48:56 And in fact, it's no higher on the holiday weekends. 693 00:48:56 --> 00:48:58 I'll bet you all thought more people got killed 694 00:48:58 --> 00:48:59 on holiday weekends. 695 00:48:59 --> 00:49:02 Well, typically not. 696 00:49:02 --> 00:49:05 They just report how many died, but they don't tell you the 697 00:49:05 --> 00:49:12 context, say, oh, by the way, take any 3-day period. 698 00:49:12 --> 00:49:16 So the moral there is, you really want to place 699 00:49:16 --> 00:49:20 the data in context. 700 00:49:20 --> 00:49:24 Data taken out of context without comparison is 701 00:49:24 --> 00:49:27 usually meaningless. 702 00:49:27 --> 00:49:39 Another variance of this is extrapolation. 703 00:49:39 --> 00:49:42 A commonly quoted statistic. 704 00:49:42 --> 00:49:45 Most auto accidents happen within 10 miles of home. 705 00:49:45 --> 00:49:48 Anyone here heard that statistic? 706 00:49:48 --> 00:49:53 It's true, but what does it mean? 707 00:49:53 --> 00:49:55 Well, people tend to say it means, it's dangerous to 708 00:49:55 --> 00:49:57 drive when you're near home. 709 00:49:57 --> 00:50:02 But in fact, most driving is done within 10 miles of home. 710 00:50:02 --> 00:50:06 Furthermore, we don't actually know where home is. 711 00:50:06 --> 00:50:09 Home is where the car is supposedly garaged on the 712 00:50:09 --> 00:50:12 state registration forms. 713 00:50:12 --> 00:50:15 So, data enhancements would suggest that I should 714 00:50:15 --> 00:50:18 register my car in Alaska. 715 00:50:18 --> 00:50:20 And then I would never be driving within 10 miles of 716 00:50:20 --> 00:50:24 home, and I would be much safer. 717 00:50:24 --> 00:50:32 Well, it's probably not a fact. 718 00:50:32 --> 00:50:37 So there are all sorts of things on that. 719 00:50:37 --> 00:50:39 Well, I think I will come back to this, because I have a 720 00:50:39 --> 00:50:42 couple more good stories which I hate not to give you. 721 00:50:42 --> 00:50:45 So we'll come back on Thursday and look at a couple of 722 00:50:45 --> 00:50:48 more things that can go wrong with statistics. 723 00:50:48 --> 00:50:48