1 00:00:00,000 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,730 Commons license. 3 00:00:03,730 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,060 continue to offer high quality educational resources for free. 5 00:00:10,060 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,560 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,560 --> 00:00:17,904 at ocw.mit.edu. 8 00:00:21,640 --> 00:00:24,880 PROFESSOR: Last time, we started looking in more detail at some 9 00:00:24,880 --> 00:00:26,990 of the statistical basics. 10 00:00:26,990 --> 00:00:29,650 These are the basis for a lot of the tools and techniques 11 00:00:29,650 --> 00:00:33,170 that we're going to be learning about throughout the term, 12 00:00:33,170 --> 00:00:36,520 especially things like statistical process control, 13 00:00:36,520 --> 00:00:39,160 statistical design of experiments, 14 00:00:39,160 --> 00:00:43,630 robust optimization, yield modeling, and so on. 15 00:00:43,630 --> 00:00:48,910 And so we're going to pick up more or less where we left off. 16 00:00:48,910 --> 00:00:52,360 We talked a bit about the normal distribution. 17 00:00:52,360 --> 00:00:55,660 And what I want to do is talk a little bit more about a few 18 00:00:55,660 --> 00:00:59,530 of the assumptions and why it's so common that we use it 19 00:00:59,530 --> 00:01:02,290 for describing some of the kinds of data 20 00:01:02,290 --> 00:01:04,069 that we looked at last time. 21 00:01:04,069 --> 00:01:07,750 So we went through a fairly substantial number 22 00:01:07,750 --> 00:01:16,050 of different examples and saw variation in time, variation 23 00:01:16,050 --> 00:01:19,710 across different parameter sets, and so on. 24 00:01:19,710 --> 00:01:23,160 Just to remind us, here's the-- the standard normal 25 00:01:23,160 --> 00:01:27,720 is just a mean centered. 26 00:01:27,720 --> 00:01:31,170 So if we have x as our data, and we subtract off the mean, 27 00:01:31,170 --> 00:01:34,080 and then normalize to the standard deviation, 28 00:01:34,080 --> 00:01:36,510 we get a unit normal variable. 29 00:01:36,510 --> 00:01:39,690 It's another random variable z that 30 00:01:39,690 --> 00:01:44,130 has a distribution that is marked out in terms of numbers 31 00:01:44,130 --> 00:01:46,150 of standard deviation. 32 00:01:46,150 --> 00:01:49,470 And so this is our normal distribution. 33 00:01:49,470 --> 00:01:52,320 Some nice properties that we mentioned last time 34 00:01:52,320 --> 00:01:55,630 are that it only has two parameters. 35 00:01:55,630 --> 00:02:00,210 So that completely describes the normal distribution, the mean, 36 00:02:00,210 --> 00:02:03,270 and the variance or standard deviation. 37 00:02:03,270 --> 00:02:07,620 Other properties are it's symmetric about the mean. 38 00:02:07,620 --> 00:02:11,460 We actually will use that property quite a bit in terms 39 00:02:11,460 --> 00:02:15,710 of manipulating some of the table 40 00:02:15,710 --> 00:02:19,640 values that one would look up for the proportion 41 00:02:19,640 --> 00:02:22,800 of the distribution that's out in either of the tail. 42 00:02:22,800 --> 00:02:26,200 So it's perhaps obvious, but we actually do use that. 43 00:02:26,200 --> 00:02:28,960 We'll come back to that a little bit later. 44 00:02:28,960 --> 00:02:31,430 But what I wanted to start with a little bit 45 00:02:31,430 --> 00:02:35,770 is talking a little bit more about this assumption, 46 00:02:35,770 --> 00:02:40,960 if you dive into it, that we are using a normal distribution 47 00:02:40,960 --> 00:02:42,160 very often. 48 00:02:42,160 --> 00:02:44,500 And the questions are, why? 49 00:02:47,650 --> 00:02:54,070 And how good of an approximation is that in most cases? 50 00:02:54,070 --> 00:02:56,100 When can we use it? 51 00:02:56,100 --> 00:02:58,530 When might we be motivated to use it? 52 00:02:58,530 --> 00:03:01,050 And what we did last time is we did a couple of things 53 00:03:01,050 --> 00:03:03,110 where we looked at some of the data. 54 00:03:03,110 --> 00:03:05,430 In particular, we did histogram-- 55 00:03:05,430 --> 00:03:10,320 binning kinds of plots of variations. 56 00:03:10,320 --> 00:03:16,590 And that would often motivate, based on a general shape, 57 00:03:16,590 --> 00:03:22,930 that normal distribution looked appropriate. 58 00:03:22,930 --> 00:03:27,610 One can, I guess, do a curve fit to the histogram. 59 00:03:30,840 --> 00:03:33,720 Would you ever try to do that? 60 00:03:33,720 --> 00:03:35,400 So imagine that you actually had, say, 61 00:03:35,400 --> 00:03:40,470 the tops of the bins for the distribution. 62 00:03:44,100 --> 00:03:48,140 So maybe I had bins like this, where sometimes I 63 00:03:48,140 --> 00:03:52,075 had these as values-- 64 00:03:56,980 --> 00:03:59,386 something like this. 65 00:03:59,386 --> 00:04:04,120 Now, would you actually try to do a normal distribution 66 00:04:04,120 --> 00:04:06,490 curve fit to that? 67 00:04:06,490 --> 00:04:08,350 In other words, if you said, what 68 00:04:08,350 --> 00:04:11,630 I'm going to try to do is minimize 69 00:04:11,630 --> 00:04:17,465 the errors between these points and the normal distribution, 70 00:04:17,465 --> 00:04:19,839 does that seem like a reasonable thing to do? 71 00:04:19,839 --> 00:04:24,390 AUDIENCE: It's all driven by the size of your tails. 72 00:04:24,390 --> 00:04:26,580 PROFESSOR: Yeah, there are some gotchas, certainly, 73 00:04:26,580 --> 00:04:28,320 with any histogram. 74 00:04:28,320 --> 00:04:31,950 The point was that the shape of this distribution-- 75 00:04:31,950 --> 00:04:34,050 if you've ever played around, especially 76 00:04:34,050 --> 00:04:37,170 with interactive tools, where you can bin and plot out 77 00:04:37,170 --> 00:04:42,220 distributions, if you were to change the size of your bins, 78 00:04:42,220 --> 00:04:44,650 you have this disturbing effect where 79 00:04:44,650 --> 00:04:48,640 the shape of your distribution sometimes changes a little bit 80 00:04:48,640 --> 00:04:49,670 out from under you. 81 00:04:49,670 --> 00:04:51,610 So if you change the bins, you may well 82 00:04:51,610 --> 00:04:55,210 end up with something that-- 83 00:04:55,210 --> 00:04:57,760 all of a sudden, this one was low, and now it's high. 84 00:04:57,760 --> 00:05:00,250 And the next one is a little bit low. 85 00:05:00,250 --> 00:05:05,570 And this one's up here if your bins are a little bit wider. 86 00:05:05,570 --> 00:05:07,580 So that might be a concern, but that's actually 87 00:05:07,580 --> 00:05:11,270 not the point that I'm after. 88 00:05:11,270 --> 00:05:16,640 Would you curve fit to this distribution to fit a normal 89 00:05:16,640 --> 00:05:17,770 to your data? 90 00:05:20,554 --> 00:05:23,240 AUDIENCE: Well, you said that normal distribution 91 00:05:23,240 --> 00:05:25,790 is described by a standard deviation of the mean. 92 00:05:25,790 --> 00:05:30,140 So you might as well just take the mean and standard deviation 93 00:05:30,140 --> 00:05:30,900 of your data. 94 00:05:30,900 --> 00:05:32,062 PROFESSOR: Beautiful. 95 00:05:32,062 --> 00:05:33,020 AUDIENCE: And use that. 96 00:05:33,020 --> 00:05:33,728 PROFESSOR: Right. 97 00:05:33,728 --> 00:05:36,350 Right, especially-- I guess the only circumstance 98 00:05:36,350 --> 00:05:38,600 I can imagine where it might make sense to curve fit 99 00:05:38,600 --> 00:05:41,340 is you didn't have the raw data. 100 00:05:41,340 --> 00:05:43,800 You only had the bins. 101 00:05:43,800 --> 00:05:45,240 That's kind of strange. 102 00:05:45,240 --> 00:05:47,080 I think in most cases, you would, in fact, 103 00:05:47,080 --> 00:05:47,910 have the raw data. 104 00:05:47,910 --> 00:05:52,440 And then you simply calculate the mean 105 00:05:52,440 --> 00:05:53,920 and the standard deviation. 106 00:05:53,920 --> 00:05:57,720 Now one thing we want to do and we'll get to a little bit today 107 00:05:57,720 --> 00:06:00,580 is why that's a reasonable thing to do-- 108 00:06:00,580 --> 00:06:02,762 to actually go in and calculate the mean 109 00:06:02,762 --> 00:06:03,720 and standard deviation. 110 00:06:03,720 --> 00:06:11,130 Why is that a good estimator for the true mean and underlying 111 00:06:11,130 --> 00:06:13,740 parameters for this distribution-- the true meaning 112 00:06:13,740 --> 00:06:16,460 of the true variance? 113 00:06:16,460 --> 00:06:20,780 There's other things you can do certainly to check and see. 114 00:06:20,780 --> 00:06:22,880 If you had your data, and you calculated 115 00:06:22,880 --> 00:06:25,340 the mean and standard deviation, then you 116 00:06:25,340 --> 00:06:31,960 can plot perhaps your Gaussian on top of that distribution. 117 00:06:31,960 --> 00:06:33,940 And that, I think, is a reasonable thing 118 00:06:33,940 --> 00:06:38,590 to do as a quick check visually to see how well it seems 119 00:06:38,590 --> 00:06:40,630 to map, as well as a quick check that you 120 00:06:40,630 --> 00:06:45,550 had reasonable calculations, not something strange go wrong 121 00:06:45,550 --> 00:06:48,220 just in your numerical calculation 122 00:06:48,220 --> 00:06:49,690 of those parameters. 123 00:06:49,690 --> 00:06:51,190 Now there's a couple of other things 124 00:06:51,190 --> 00:06:56,230 that one can do to check quickly visually the assumption. 125 00:06:56,230 --> 00:07:01,720 And then there's a couple of very nice additional tools 126 00:07:01,720 --> 00:07:05,260 that I'll mention here for either checking assumptions 127 00:07:05,260 --> 00:07:08,440 or visually-- 128 00:07:08,440 --> 00:07:10,120 in a little bit more sophisticated way, 129 00:07:10,120 --> 00:07:13,730 visually or numerically check a couple of those assumptions. 130 00:07:13,730 --> 00:07:15,430 But one thing you can certainly do 131 00:07:15,430 --> 00:07:20,470 is look at the location of your data 132 00:07:20,470 --> 00:07:23,830 and just do a quick comparison between the percentage of data 133 00:07:23,830 --> 00:07:28,120 that you would expect in different bands of this data. 134 00:07:28,120 --> 00:07:33,310 And we'll do a little bit more examples 135 00:07:33,310 --> 00:07:36,170 there so that we know what percentage of the data 136 00:07:36,170 --> 00:07:40,930 we expect in the plus/minus 1 sigma band, for example, 137 00:07:40,930 --> 00:07:45,430 or what percentage of the data we would typically expect out 138 00:07:45,430 --> 00:07:48,620 in the 3 sigma tails of the data. 139 00:07:48,620 --> 00:07:51,593 And so you can do a quick calculation and comparison 140 00:07:51,593 --> 00:07:54,010 of the percentage of data in each of these different bands 141 00:07:54,010 --> 00:07:57,730 and see, is that matching up to what we would expect 142 00:07:57,730 --> 00:08:00,550 from a normal distribution? 143 00:08:00,550 --> 00:08:03,250 This actually gets very close to the idea 144 00:08:03,250 --> 00:08:07,570 of confidence intervals that we'll 145 00:08:07,570 --> 00:08:09,670 formalize a little bit more. 146 00:08:09,670 --> 00:08:11,530 Now there's a couple of additional things 147 00:08:11,530 --> 00:08:12,890 I've listed here. 148 00:08:12,890 --> 00:08:16,570 One is you can look at the kurtosis 149 00:08:16,570 --> 00:08:21,220 or do a quick calculation of kurtosis, which 150 00:08:21,220 --> 00:08:24,250 is a higher order statistical moment 151 00:08:24,250 --> 00:08:26,950 than the mean or the variance. 152 00:08:26,950 --> 00:08:32,140 In fact, if you look at the definition of the kurtosis, 153 00:08:32,140 --> 00:08:37,140 it's an expectation of the fourth moment. 154 00:08:37,140 --> 00:08:40,169 Or it is a calculation-- a normalized version 155 00:08:40,169 --> 00:08:41,520 of the fourth moment. 156 00:08:41,520 --> 00:08:44,880 And for a perfectly normal distribution, 157 00:08:44,880 --> 00:08:48,210 this kurtosis value would be 1. 158 00:08:48,210 --> 00:08:51,270 And then as the distribution changes its shape, 159 00:08:51,270 --> 00:08:55,230 either gets more peaked or less peaked following 160 00:08:55,230 --> 00:08:57,840 other distributions, other common distributions, 161 00:08:57,840 --> 00:09:04,380 then it starts to deviate substantially from k equals 1. 162 00:09:04,380 --> 00:09:06,870 And in fact, this is a quick little tool 163 00:09:06,870 --> 00:09:11,670 to use sometimes if you're not sure, well, number one, 164 00:09:11,670 --> 00:09:12,780 if it's normal. 165 00:09:12,780 --> 00:09:15,625 And number two, if it's not normal, what distribution might 166 00:09:15,625 --> 00:09:16,125 it follow? 167 00:09:21,200 --> 00:09:24,600 If you look here, this is a nice plot, 168 00:09:24,600 --> 00:09:27,170 although I didn't break out what all 169 00:09:27,170 --> 00:09:32,490 of these different distributions are. 170 00:09:32,490 --> 00:09:36,450 This is just a plot normalized to standard deviation 171 00:09:36,450 --> 00:09:41,430 of the data of a set of different distributions. 172 00:09:41,430 --> 00:09:44,430 And the black one here is n. 173 00:09:44,430 --> 00:09:48,210 So this is-- let me do the black one here, right? 174 00:09:48,210 --> 00:09:49,620 This is the n distribution. 175 00:09:49,620 --> 00:09:55,090 That's our Gaussian with a kurtosis. 176 00:09:55,090 --> 00:09:58,840 Well, I guess you got to look a little bit carefully 177 00:09:58,840 --> 00:10:00,070 at the definition. 178 00:10:00,070 --> 00:10:05,390 Actually, I think if I go back to the previous page, which 179 00:10:05,390 --> 00:10:10,070 is one that Dave had, this definition for sample data 180 00:10:10,070 --> 00:10:15,980 is essentially, as n gets very large, this subtracts off 3. 181 00:10:15,980 --> 00:10:18,480 So that I believe then, in this case, 182 00:10:18,480 --> 00:10:20,990 this kurtosis for a normal distribution 183 00:10:20,990 --> 00:10:23,210 is actually more like 0. 184 00:10:23,210 --> 00:10:27,500 These two definitions you might look up. 185 00:10:27,500 --> 00:10:29,900 I'm not sure are exactly the same. 186 00:10:29,900 --> 00:10:33,050 Rarely would you actually use this one. 187 00:10:33,050 --> 00:10:38,450 You're going to actually use this definition, which 188 00:10:38,450 --> 00:10:41,960 basically subtracts off a value. 189 00:10:41,960 --> 00:10:45,590 This goes with the plot on the next page. 190 00:10:51,050 --> 00:10:53,300 So they are slightly different definitions, I believe. 191 00:10:56,890 --> 00:11:00,580 So in that case, that's subtracting off a 3. 192 00:11:00,580 --> 00:11:02,350 For the normal distribution, it ends up 193 00:11:02,350 --> 00:11:05,110 with a value of about 0. 194 00:11:05,110 --> 00:11:07,450 Now what's nice is as you get some 195 00:11:07,450 --> 00:11:10,810 of these distribution, such as the Laplace distribution, 196 00:11:10,810 --> 00:11:17,000 this very peaked one right here, the kurtosis value goes up. 197 00:11:17,000 --> 00:11:21,770 It's an indication of a more peaked distribution. 198 00:11:21,770 --> 00:11:23,780 The logistic distribution, which we might 199 00:11:23,780 --> 00:11:25,520 talk about a little bit later-- 200 00:11:25,520 --> 00:11:30,590 it's one that comes up occasionally with some quality 201 00:11:30,590 --> 00:11:32,870 or discrete kinds of distributions-- 202 00:11:32,870 --> 00:11:35,150 has a kurtosis of 1.2. 203 00:11:35,150 --> 00:11:37,280 And the interesting one here also 204 00:11:37,280 --> 00:11:40,910 is the uniform distribution, which is less 205 00:11:40,910 --> 00:11:42,950 sharply peaked than a Gaussian. 206 00:11:42,950 --> 00:11:45,890 And it actually has a negative kurtosis with that subtraction 207 00:11:45,890 --> 00:11:49,410 of the 3 off it at the end. 208 00:11:49,410 --> 00:11:55,430 So you might find that as a useful tool. 209 00:11:55,430 --> 00:11:59,330 I've rarely used kurtosis actually as an indicator. 210 00:11:59,330 --> 00:12:03,800 But I want to mention it to you because it 211 00:12:03,800 --> 00:12:07,880 is out there as at least a hint at looking 212 00:12:07,880 --> 00:12:10,190 at some different distributions. 213 00:12:10,190 --> 00:12:11,540 A more useful tool-- 214 00:12:11,540 --> 00:12:13,427 yeah, question? 215 00:12:13,427 --> 00:12:15,260 AUDIENCE: So there's two different formulas? 216 00:12:15,260 --> 00:12:16,487 Because-- 217 00:12:16,487 --> 00:12:17,195 PROFESSOR: Well-- 218 00:12:17,195 --> 00:12:18,590 AUDIENCE: What you said or-- 219 00:12:18,590 --> 00:12:20,450 PROFESSOR: Yeah, so this is for sample data. 220 00:12:20,450 --> 00:12:24,570 And I think if you were to actually go in-- 221 00:12:24,570 --> 00:12:26,610 I mean, essentially this-- 222 00:12:29,510 --> 00:12:31,265 I have not checked this. 223 00:12:31,265 --> 00:12:37,420 This was some definitions from previous class notes. 224 00:12:37,420 --> 00:12:42,070 I do believe when I did a quick lookup on what kurtosis 225 00:12:42,070 --> 00:12:48,160 is, I believe this is a better definition in terms 226 00:12:48,160 --> 00:12:50,800 of actual calculation formulas that you 227 00:12:50,800 --> 00:12:53,590 can use for calculating it. 228 00:12:53,590 --> 00:12:54,898 This is to give you the sense. 229 00:12:54,898 --> 00:12:56,440 I mean, it's sort of lurking in here. 230 00:12:56,440 --> 00:13:01,370 You can see the expectation operation down in here, 231 00:13:01,370 --> 00:13:04,630 and then the normalization to the standard deviation. 232 00:13:04,630 --> 00:13:07,420 In this case, this has to be your calculated standard 233 00:13:07,420 --> 00:13:09,490 deviation. 234 00:13:09,490 --> 00:13:12,430 This is the abstracted one. 235 00:13:17,910 --> 00:13:22,590 So if you actually poke around, you will find in the literature 236 00:13:22,590 --> 00:13:26,300 more than one definition of kurtosis. 237 00:13:26,300 --> 00:13:29,740 My point was that this is what I would 238 00:13:29,740 --> 00:13:34,240 use if you want to use the plot on the next page in terms 239 00:13:34,240 --> 00:13:36,940 of coming up with a number that might also indicate if there's 240 00:13:36,940 --> 00:13:38,950 a different distribution that you might look at. 241 00:13:41,800 --> 00:13:43,365 So it's related to the fourth moment. 242 00:13:46,960 --> 00:13:48,250 A more useful tool-- 243 00:13:48,250 --> 00:13:52,510 and this is one that I actually do use-- 244 00:13:52,510 --> 00:13:56,920 is probability or quantile-quantile plots. 245 00:13:56,920 --> 00:14:03,880 And there's a section in Montgomery on that, as well 246 00:14:03,880 --> 00:14:05,740 as different toolboxes. 247 00:14:05,740 --> 00:14:07,880 We'll be able to generate these things. 248 00:14:07,880 --> 00:14:11,440 And so here's an example for a quantile-quantile plot. 249 00:14:16,790 --> 00:14:20,360 What I've started doing on the lecture notes on the web 250 00:14:20,360 --> 00:14:22,430 is put up an early draft as early 251 00:14:22,430 --> 00:14:26,270 as I can for the next couple of weeks of lecture notes. 252 00:14:26,270 --> 00:14:30,290 But then as I'm editing and adding in, 253 00:14:30,290 --> 00:14:32,220 I'll have the most up-to-date one. 254 00:14:32,220 --> 00:14:35,240 So if you've got slides, you may be missing a couple of these. 255 00:14:35,240 --> 00:14:40,850 If you printed them out before 9:00 or 10:00 PM last night, 256 00:14:40,850 --> 00:14:43,940 I think these got updated about that time. 257 00:14:43,940 --> 00:14:45,860 So this plot, for example, was not 258 00:14:45,860 --> 00:14:47,990 in the early draft of the slides. 259 00:14:47,990 --> 00:14:50,570 And I'll try to indicate that with a little "draft" 260 00:14:50,570 --> 00:14:56,080 on the web page if they're still early drafts of the slides. 261 00:14:56,080 --> 00:14:57,930 So what are quantile-quantile plots? 262 00:14:57,930 --> 00:15:02,520 These are a little bit subtle in terms of explaining. 263 00:15:02,520 --> 00:15:04,830 So let me try to give it a shot at explaining. 264 00:15:04,830 --> 00:15:08,650 And then if you have questions, let me know. 265 00:15:08,650 --> 00:15:14,130 And then normally, it's going to be generated by your statistics 266 00:15:14,130 --> 00:15:14,830 package. 267 00:15:14,830 --> 00:15:16,800 There are hand ways to do it, and I'll 268 00:15:16,800 --> 00:15:19,320 refer you to Montgomery for practice 269 00:15:19,320 --> 00:15:23,010 with actually trying to generate them if you had to by hand. 270 00:15:23,010 --> 00:15:24,480 But here's the basic idea. 271 00:15:24,480 --> 00:15:31,970 What we're plotting is the actual data that you've got. 272 00:15:31,970 --> 00:15:37,340 And in the y-axis, you'll be plotting your data in terms 273 00:15:37,340 --> 00:15:42,000 of normalized distribution. 274 00:15:42,000 --> 00:15:46,580 So you would normalize to the mean or center to the mean 275 00:15:46,580 --> 00:15:49,230 and then scale it to your standard deviation. 276 00:15:49,230 --> 00:15:54,200 So think of these as unit standard deviations. 277 00:15:54,200 --> 00:15:59,480 So you simply find that as your y location for your data. 278 00:16:02,240 --> 00:16:06,560 Then what you're plotting on the x-axis is the normal 279 00:16:06,560 --> 00:16:07,970 theoretical-- 280 00:16:07,970 --> 00:16:09,680 I'm not sure I'd use quantiles here-- 281 00:16:09,680 --> 00:16:13,730 but your normal theoretical standard deviation 282 00:16:13,730 --> 00:16:17,210 for that number of data points that you would have had 283 00:16:17,210 --> 00:16:20,370 and the location for each of those data points. 284 00:16:20,370 --> 00:16:24,020 So imagine this is 50 data points. 285 00:16:24,020 --> 00:16:26,810 I'm not sure exactly how many data points this is. 286 00:16:26,810 --> 00:16:30,110 If you were to take 50 data points 287 00:16:30,110 --> 00:16:35,330 and draw 50 data points from a normal distribution 288 00:16:35,330 --> 00:16:39,590 or order them and put them where you would expect 289 00:16:39,590 --> 00:16:43,580 on a normal distribution, what you would have is many 290 00:16:43,580 --> 00:16:45,110 more data points near 0. 291 00:16:45,110 --> 00:16:51,320 And as you get further and further out, 1 out of 50 times 292 00:16:51,320 --> 00:16:53,570 or 1 out of 25 times, you would expect 293 00:16:53,570 --> 00:16:57,650 to find a data point about whatever it is-- 294 00:16:57,650 --> 00:17:02,190 2, 2.1 standard deviations away. 295 00:17:02,190 --> 00:17:08,180 In other words, if I were to compare the actual location 296 00:17:08,180 --> 00:17:14,060 of that data point in terms of its value 297 00:17:14,060 --> 00:17:16,760 within my sample distribution of 50, 298 00:17:16,760 --> 00:17:22,069 compared to if I just drew randomly 50 data points, 299 00:17:22,069 --> 00:17:25,109 that would be its location. 300 00:17:25,109 --> 00:17:30,060 Then what I can do is plot that coordinate for that data. 301 00:17:30,060 --> 00:17:34,110 So what you end up with is taking all of your data, 302 00:17:34,110 --> 00:17:34,640 if you will. 303 00:17:34,640 --> 00:17:37,520 You sort it from low to high. 304 00:17:37,520 --> 00:17:42,670 And then starting at the center, in some sense, 305 00:17:42,670 --> 00:17:44,890 you start working outward from the center, 306 00:17:44,890 --> 00:17:48,100 ordering the data from the location 307 00:17:48,100 --> 00:17:54,280 of its index in your sorted data from number 308 00:17:54,280 --> 00:17:57,910 of standard deviations away from the mean that it would be, 309 00:17:57,910 --> 00:18:01,090 compared to how far that data point actually 310 00:18:01,090 --> 00:18:04,210 was away from your sample mean. 311 00:18:04,210 --> 00:18:07,480 And what that gives you, if it were perfect, 312 00:18:07,480 --> 00:18:13,270 and there was not any sort of noise in your data, 313 00:18:13,270 --> 00:18:16,390 that would give you this perfect matching line. 314 00:18:16,390 --> 00:18:21,418 Every data point falls where you would expect it to. 315 00:18:21,418 --> 00:18:22,960 Now in your actual data, you're going 316 00:18:22,960 --> 00:18:25,030 to see some deviations from that. 317 00:18:25,030 --> 00:18:31,950 But what this is basically doing is a compression of your data 318 00:18:31,950 --> 00:18:34,410 or an expansion of your data out in the tails, 319 00:18:34,410 --> 00:18:37,470 but a compression of your data near the center 320 00:18:37,470 --> 00:18:43,980 to be able to basically tell you how closely your data here 321 00:18:43,980 --> 00:18:47,550 is following the assumed distribution. 322 00:18:50,860 --> 00:18:56,930 And for this case here, we plotted the location 323 00:18:56,930 --> 00:19:00,480 of 50 data points, assuming it was a normal distribution. 324 00:19:00,480 --> 00:19:03,840 So that's where my x values were coming from. 325 00:19:03,840 --> 00:19:05,330 And as you can see here, the data 326 00:19:05,330 --> 00:19:08,870 pretty much nicely follows this distribution. 327 00:19:08,870 --> 00:19:13,460 You get a few little things that look like it's wandering 328 00:19:13,460 --> 00:19:16,350 or trailing off a little bit. 329 00:19:16,350 --> 00:19:19,080 And then you also often look out here in the tails. 330 00:19:19,080 --> 00:19:23,790 And you find even out here for over two standard deviations 331 00:19:23,790 --> 00:19:25,890 away, it looks like I've got pretty good fidelity 332 00:19:25,890 --> 00:19:27,210 to those tails. 333 00:19:27,210 --> 00:19:30,780 I might have values that are a little bit further 334 00:19:30,780 --> 00:19:33,690 away from the mean than I might expect 335 00:19:33,690 --> 00:19:37,120 from a normal distribution, but it's pretty close. 336 00:19:37,120 --> 00:19:38,850 So this is the kind of plot that you 337 00:19:38,850 --> 00:19:42,270 would expect to see for data that, in fact, followed 338 00:19:42,270 --> 00:19:45,260 a normal distribution. 339 00:19:45,260 --> 00:19:48,380 All right, so I know that's confusing. 340 00:19:48,380 --> 00:19:51,680 Are there questions that people have on what this-- 341 00:19:51,680 --> 00:19:52,303 AUDIENCE: Yes. 342 00:19:52,303 --> 00:19:52,970 PROFESSOR: Yeah? 343 00:19:52,970 --> 00:19:55,350 AUDIENCE: I have a question. 344 00:19:55,350 --> 00:19:59,480 So for each point, you get the y-axis 345 00:19:59,480 --> 00:20:03,632 by the sampling value from your data. 346 00:20:03,632 --> 00:20:04,340 PROFESSOR: Right. 347 00:20:04,340 --> 00:20:06,000 AUDIENCE: And how do you get x? 348 00:20:06,000 --> 00:20:09,080 Do you get it based on the probability of that, 349 00:20:09,080 --> 00:20:11,720 simply pulling from your sample distribution 350 00:20:11,720 --> 00:20:15,920 that you referred to the theoretical normal distribution 351 00:20:15,920 --> 00:20:18,120 with the same probability, then you get the y-- 352 00:20:18,120 --> 00:20:18,620 x-axis? 353 00:20:21,200 --> 00:20:22,640 PROFESSOR: Yes, very, very close. 354 00:20:22,640 --> 00:20:26,180 So for the y-axis, you've got it exactly right. 355 00:20:26,180 --> 00:20:28,700 For the x-axis, what's interesting is 356 00:20:28,700 --> 00:20:32,300 you don't actually use the values of your data. 357 00:20:32,300 --> 00:20:37,580 You just use its index location in a sample of the size 358 00:20:37,580 --> 00:20:39,020 that you've got. 359 00:20:39,020 --> 00:20:43,700 In other words, if I had a million points, 360 00:20:43,700 --> 00:20:45,590 I would look at the lowest. 361 00:20:45,590 --> 00:20:49,850 And I would expect that to be-- 362 00:20:49,850 --> 00:20:52,550 in a normal distribution, I would 363 00:20:52,550 --> 00:20:55,670 look at where the probability, the number 364 00:20:55,670 --> 00:20:59,300 of standard deviations where 1 out of 500,000 points 365 00:20:59,300 --> 00:21:01,830 is that far away from the mean. 366 00:21:01,830 --> 00:21:05,210 So I would look up the inverse probability 367 00:21:05,210 --> 00:21:09,110 on a normal distribution of being-- 368 00:21:09,110 --> 00:21:12,320 of where 1 in 5-- 369 00:21:12,320 --> 00:21:14,240 500,000-- point-- what? 370 00:21:14,240 --> 00:21:18,050 0.02 to the whatever. 371 00:21:18,050 --> 00:21:24,080 So I basically look on a tabulated normal probability 372 00:21:24,080 --> 00:21:29,750 plot, going backwards from where that index-- 373 00:21:29,750 --> 00:21:31,130 my smallest point was. 374 00:21:31,130 --> 00:21:33,140 And then I could do that for every point 375 00:21:33,140 --> 00:21:36,500 in my sample to figure out what the probability 376 00:21:36,500 --> 00:21:39,470 for its location should be on the x-axis. 377 00:21:44,210 --> 00:21:46,240 So here's another example. 378 00:21:46,240 --> 00:21:49,570 Maybe this gives you a feel because these q-q plots-- 379 00:21:49,570 --> 00:21:52,000 the quantile-quantile plots-- can actually 380 00:21:52,000 --> 00:21:54,290 be used with other distributions as well. 381 00:21:54,290 --> 00:21:57,200 They are not always q-q norm plots. 382 00:21:57,200 --> 00:21:59,950 They can be applied to whatever assumed 383 00:21:59,950 --> 00:22:03,400 probability distribution you might want to investigate. 384 00:22:07,200 --> 00:22:13,590 So here's an example where we again took the data. 385 00:22:13,590 --> 00:22:17,330 But in this case, the theoretical quantiles 386 00:22:17,330 --> 00:22:20,180 are actually lining up. 387 00:22:20,180 --> 00:22:22,760 I'm assuming a normal distribution. 388 00:22:22,760 --> 00:22:25,640 But in this example that I'm showing here, 389 00:22:25,640 --> 00:22:27,670 the data actually came from-- 390 00:22:27,670 --> 00:22:28,406 let me erase. 391 00:22:28,406 --> 00:22:31,010 Let me get rid of all this. 392 00:22:31,010 --> 00:22:35,750 The data actually came from an exponential distribution. 393 00:22:35,750 --> 00:22:39,050 So this is an example where I would have assumed things 394 00:22:39,050 --> 00:22:44,270 were coming from a Gaussian. 395 00:22:44,270 --> 00:22:46,760 So this is still for the normal quantiles. 396 00:22:46,760 --> 00:22:51,740 But with an exponential and e to the minus 397 00:22:51,740 --> 00:22:56,210 x or an e to the x kind of distribution, what you end up 398 00:22:56,210 --> 00:22:59,390 with are a lot of data values that 399 00:22:59,390 --> 00:23:02,930 are much larger, much further away from the mean, 400 00:23:02,930 --> 00:23:05,390 than you would expect from a Gaussian. 401 00:23:05,390 --> 00:23:12,870 And you also get a bunch of data that's much larger than you 402 00:23:12,870 --> 00:23:16,410 would expect from a Gaussian. 403 00:23:16,410 --> 00:23:20,430 So this would be an example here, where the normal q-q 404 00:23:20,430 --> 00:23:22,560 plot doesn't seem to match up. 405 00:23:22,560 --> 00:23:24,810 It's telling me my data really is not 406 00:23:24,810 --> 00:23:28,300 following along the normal distribution line. 407 00:23:28,300 --> 00:23:29,860 Now, I didn't pull a plot. 408 00:23:29,860 --> 00:23:33,220 But one could then ask the question-- 409 00:23:33,220 --> 00:23:35,310 maybe you'd look up kurtosis. 410 00:23:35,310 --> 00:23:38,590 Or maybe you look back at your data and say, 411 00:23:38,590 --> 00:23:41,680 I think maybe an exponential distribution is really 412 00:23:41,680 --> 00:23:43,150 what this is following. 413 00:23:43,150 --> 00:23:46,308 How would you plot that on a q-q norm? 414 00:23:46,308 --> 00:23:47,100 AUDIENCE: Question. 415 00:23:47,100 --> 00:23:47,580 PROFESSOR: Yeah? 416 00:23:47,580 --> 00:23:49,538 AUDIENCE: Why doesn't the line go through 0, 0? 417 00:23:51,392 --> 00:23:52,850 PROFESSOR: This is a good question. 418 00:23:52,850 --> 00:23:59,140 These don't appear to be mean-centered to me. 419 00:23:59,140 --> 00:24:01,120 So there's something weird on the plot. 420 00:24:01,120 --> 00:24:04,410 AUDIENCE: So the line should be for a normal distribution, not 421 00:24:04,410 --> 00:24:04,910 fitting. 422 00:24:04,910 --> 00:24:06,243 PROFESSOR: Yeah, this does not-- 423 00:24:06,243 --> 00:24:13,990 I think what has happened here is these are not quite 424 00:24:13,990 --> 00:24:19,180 mean-centered and normalized because-- 425 00:24:19,180 --> 00:24:25,450 well, so in terms of 0, 0 following on the plot, 426 00:24:25,450 --> 00:24:28,080 that that's not happening. 427 00:24:28,080 --> 00:24:29,290 So I'm a little-- 428 00:24:29,290 --> 00:24:32,815 I'm not sure exactly what's going on there. 429 00:24:32,815 --> 00:24:34,690 AUDIENCE: We need the closing function taking 430 00:24:34,690 --> 00:24:36,590 the mean of the data et cetera. 431 00:24:36,590 --> 00:24:40,187 It's a conceptual normal on the data mean. 432 00:24:40,187 --> 00:24:41,270 PROFESSOR: Yes, it should. 433 00:24:41,270 --> 00:24:45,760 And that's what I'm saying, is this plot I don't think is 434 00:24:45,760 --> 00:24:50,020 correctly mean-centered because it should then-- 435 00:24:50,020 --> 00:24:54,190 0, 0, by definition, has to fall. 436 00:24:54,190 --> 00:24:55,752 AUDIENCE: Right. 437 00:24:55,752 --> 00:24:57,460 PROFESSOR: Oh, that's what you're saying. 438 00:24:57,460 --> 00:25:00,043 AUDIENCE: No, I was saying you could take the mean of the data 439 00:25:00,043 --> 00:25:02,620 you send to the normal that you're plotting 440 00:25:02,620 --> 00:25:03,730 is aligned with that data. 441 00:25:06,449 --> 00:25:07,157 PROFESSOR: Right. 442 00:25:07,157 --> 00:25:13,090 But I'm saying here's my y data, and my 0 mean is not-- 443 00:25:13,090 --> 00:25:15,220 I don't have any negative-- 444 00:25:15,220 --> 00:25:17,290 I don't have any data lower than the mean. 445 00:25:17,290 --> 00:25:20,510 And therefore, that doesn't make any sense. 446 00:25:20,510 --> 00:25:22,900 So this is not mean-centered correctly. 447 00:25:27,292 --> 00:25:30,940 AUDIENCE: It looks to me like the mean of the data 448 00:25:30,940 --> 00:25:33,811 does give us a slightly less than 1 in the point data, 449 00:25:33,811 --> 00:25:36,246 so that coincides with the mean. 450 00:25:40,730 --> 00:25:43,520 PROFESSOR: But if I mean-center and scale to 0, 451 00:25:43,520 --> 00:25:47,510 then the mean of my data have-- 452 00:25:47,510 --> 00:25:51,020 by definition, that ought to be at 0, right? 453 00:25:51,020 --> 00:25:51,920 AUDIENCE: Oh, I see. 454 00:25:51,920 --> 00:25:53,880 I don't think you're shifting the data, though. 455 00:25:58,240 --> 00:26:01,560 PROFESSOR: When you mean-center, yeah, you're shifting. 456 00:26:01,560 --> 00:26:03,270 AUDIENCE: Oh, I think you're shifting, 457 00:26:03,270 --> 00:26:05,865 but I think conceptually, you're not shifting the data. 458 00:26:05,865 --> 00:26:07,490 You're shifting the normal, that you're 459 00:26:07,490 --> 00:26:10,087 saying my might correspond to the data. 460 00:26:10,087 --> 00:26:10,670 PROFESSOR: No. 461 00:26:10,670 --> 00:26:15,860 In a normal-- in the standard q-q norm plot, you mean-center. 462 00:26:15,860 --> 00:26:18,320 You actually take your data, you mean-center it, 463 00:26:18,320 --> 00:26:21,410 you normalize it to the calculated sample standard 464 00:26:21,410 --> 00:26:23,450 deviation and plot that. 465 00:26:23,450 --> 00:26:26,060 And that does not look like quite what they've done here. 466 00:26:26,060 --> 00:26:28,280 I think these are still normalized 467 00:26:28,280 --> 00:26:29,930 to standard deviation, but I think 468 00:26:29,930 --> 00:26:32,055 it's not quite mean-centered. 469 00:26:32,055 --> 00:26:33,680 But in some sense that doesn't actually 470 00:26:33,680 --> 00:26:38,450 matter in terms of the data following along the line. 471 00:26:38,450 --> 00:26:39,980 It's still indicating. 472 00:26:39,980 --> 00:26:41,730 That would just be a shift. 473 00:26:41,730 --> 00:26:43,350 That would be a shift. 474 00:26:43,350 --> 00:26:46,880 AUDIENCE: You said the data hasn't been normalized 475 00:26:46,880 --> 00:26:48,320 or hasn't been mean-centered. 476 00:26:48,320 --> 00:26:51,740 But if it's an exponential distribution, 477 00:26:51,740 --> 00:26:54,770 can you still normalize a bit? 478 00:26:59,060 --> 00:27:04,190 PROFESSOR: In this first use of such a plot, 479 00:27:04,190 --> 00:27:06,200 you would be testing the question. 480 00:27:06,200 --> 00:27:09,380 Did your data-- you don't know yet that it's exponential. 481 00:27:09,380 --> 00:27:11,070 You just have data, and you're testing. 482 00:27:11,070 --> 00:27:14,280 Does it fall on the normal line? 483 00:27:14,280 --> 00:27:16,720 So you would still follow that procedure. 484 00:27:16,720 --> 00:27:20,600 We'll look at an exponential distribution in a minute. 485 00:27:20,600 --> 00:27:23,420 And of course, every distribution has a mean. 486 00:27:23,420 --> 00:27:25,970 So you can always mean-center. 487 00:27:25,970 --> 00:27:29,870 Similarly, every distribution has a variance 488 00:27:29,870 --> 00:27:31,680 that you can calculate. 489 00:27:31,680 --> 00:27:33,680 The neat thing about the exponential is the mean 490 00:27:33,680 --> 00:27:36,050 and the variance are the same. 491 00:27:36,050 --> 00:27:37,670 But that's not entering in here. 492 00:27:37,670 --> 00:27:39,170 There's something else weird. 493 00:27:39,170 --> 00:27:44,570 So there's the risk of pulling a plot off at 9:50 at night. 494 00:27:44,570 --> 00:27:46,510 I hadn't noticed that the-- 495 00:27:46,510 --> 00:27:50,800 it doesn't look correctly mean-centered. 496 00:27:50,800 --> 00:27:52,630 But the additional point I wanted to make 497 00:27:52,630 --> 00:27:55,600 is I could actually take this same data. 498 00:27:55,600 --> 00:27:59,830 I could produce a different plot, not a normal q-q plot, 499 00:27:59,830 --> 00:28:03,620 but an exponential q-q plot. 500 00:28:03,620 --> 00:28:09,550 And if I were doing that, in that case, what I would do 501 00:28:09,550 --> 00:28:18,530 is take my data, still plot it hopefully mean-centered, 502 00:28:18,530 --> 00:28:20,990 and then number of standard deviations away. 503 00:28:23,570 --> 00:28:29,860 But then along this axis, I would calculate the location 504 00:28:29,860 --> 00:28:32,320 in numbers of standard deviations 505 00:28:32,320 --> 00:28:36,580 based on the probability of an exponential distribution, 506 00:28:36,580 --> 00:28:40,660 not based on the probability of that index location 507 00:28:40,660 --> 00:28:42,800 in a normal distribution. 508 00:28:42,800 --> 00:28:46,150 So I would basically say, for my 50 data points, 509 00:28:46,150 --> 00:28:52,060 I expect the 25th data point larger than the mean 510 00:28:52,060 --> 00:28:56,800 to occur in that distribution. 511 00:28:56,800 --> 00:29:05,290 I have to go 2.1 normalized standard deviations away 512 00:29:05,290 --> 00:29:07,850 in order to get to that probability. 513 00:29:07,850 --> 00:29:11,200 So that it takes my same y data, but then it 514 00:29:11,200 --> 00:29:18,330 plots along the line, where if it really is exponential, 515 00:29:18,330 --> 00:29:25,490 my data should follow along a 1 to 1 correspondence line. 516 00:29:25,490 --> 00:29:30,110 So you don't often see the use of these q-q plots 517 00:29:30,110 --> 00:29:33,320 from the perspective of different distributions, 518 00:29:33,320 --> 00:29:35,000 but you can use them. 519 00:29:35,000 --> 00:29:37,940 What you often will see is really this. 520 00:29:37,940 --> 00:29:40,460 You'll see q-q norm plots. 521 00:29:40,460 --> 00:29:42,200 And they're lovely plots. 522 00:29:42,200 --> 00:29:44,000 It's a wonderful tool to do-- 523 00:29:44,000 --> 00:29:46,790 use-- because you're actually seeing all of your data. 524 00:29:49,620 --> 00:29:52,040 It's got all of your actual data. 525 00:29:52,040 --> 00:29:54,310 It's showing you that it corresponds roughly 526 00:29:54,310 --> 00:29:56,900 to a normal distribution. 527 00:29:56,900 --> 00:30:00,100 It's also giving you very nice information 528 00:30:00,100 --> 00:30:07,930 about essentially your variance or standard deviation. 529 00:30:07,930 --> 00:30:11,410 And there are variants of these plots that you will often 530 00:30:11,410 --> 00:30:14,680 see in the literature, especially the semiconductor 531 00:30:14,680 --> 00:30:20,020 literature, dealing with large numbers of samples coming 532 00:30:20,020 --> 00:30:22,370 from different kinds of measurement. 533 00:30:22,370 --> 00:30:25,750 So for example, if you want to make contact resistance 534 00:30:25,750 --> 00:30:30,910 measurements for literally thousands of contacts and very 535 00:30:30,910 --> 00:30:33,520 succinctly present that data, you 536 00:30:33,520 --> 00:30:37,990 will see families of q-q norm plots. 537 00:30:37,990 --> 00:30:41,140 So for example, maybe you did a bunch of contacts 538 00:30:41,140 --> 00:30:42,520 at a particular size. 539 00:30:42,520 --> 00:30:44,170 You would plot them like this. 540 00:30:44,170 --> 00:30:46,060 And then maybe you had another data set, 541 00:30:46,060 --> 00:30:51,460 where you had attempted to pattern those contacts slightly 542 00:30:51,460 --> 00:30:53,500 larger, slightly smaller. 543 00:30:53,500 --> 00:30:56,560 And you would often see then another-- 544 00:30:56,560 --> 00:30:59,720 oops, that's not very straight, is it? 545 00:30:59,720 --> 00:31:05,450 It's meant to be another underlying set of data. 546 00:31:05,450 --> 00:31:09,880 But you might find your data looking something like this. 547 00:31:09,880 --> 00:31:12,540 And that kind of plot is really useful for showing 548 00:31:12,540 --> 00:31:15,300 that there is a mean shift, a mean difference 549 00:31:15,300 --> 00:31:16,540 between your data. 550 00:31:16,540 --> 00:31:19,980 But also, the variance is different in the two cases. 551 00:31:19,980 --> 00:31:22,187 Now, exactly what you're plotting here 552 00:31:22,187 --> 00:31:23,520 might be a little bit different. 553 00:31:23,520 --> 00:31:27,570 You might actually not plot quite normalized data. 554 00:31:27,570 --> 00:31:33,960 You might actually use it in an unnormalized fashion. 555 00:31:33,960 --> 00:31:38,550 Here, you might plot this not in terms of standard deviations, 556 00:31:38,550 --> 00:31:40,640 but rather actual-- 557 00:31:40,640 --> 00:31:43,950 keep it in the quantiles or the probability 558 00:31:43,950 --> 00:31:46,740 of being that far away-- 559 00:31:46,740 --> 00:31:51,000 probability of that x-- 560 00:31:51,000 --> 00:31:52,470 that's weird. 561 00:31:52,470 --> 00:31:54,550 The probability of that x value. 562 00:31:54,550 --> 00:31:56,940 So for example, you will often see these kinds 563 00:31:56,940 --> 00:32:10,590 of plots which would show things like 0.001, 0.01, 0.1, 1, 564 00:32:10,590 --> 00:32:16,050 or something like that, getting up to-- 565 00:32:16,050 --> 00:32:21,480 I guess 0.5 would be the equivalent for the mean. 566 00:32:21,480 --> 00:32:24,240 And then you start going larger-- 567 00:32:24,240 --> 00:32:29,640 0.9, 0.99, 0.999. 568 00:32:29,640 --> 00:32:33,070 In other words, you might actually plot as-- 569 00:32:33,070 --> 00:32:35,620 I should have put these on the x value-- 570 00:32:35,620 --> 00:32:38,980 the probability that you would find a data 571 00:32:38,980 --> 00:32:46,210 point that far away as opposed to implied probabilities 572 00:32:46,210 --> 00:32:48,947 in terms of number of standard deviations. 573 00:32:48,947 --> 00:32:50,530 So there are some really cool variants 574 00:32:50,530 --> 00:32:52,552 of these plots that are very useful. 575 00:32:52,552 --> 00:32:54,010 And I think we'll see some of these 576 00:32:54,010 --> 00:32:56,050 when we talk a little bit about yield 577 00:32:56,050 --> 00:32:59,923 and some other distributions. 578 00:32:59,923 --> 00:33:01,090 AUDIENCE: I have a question? 579 00:33:01,090 --> 00:33:02,670 PROFESSOR: Yeah. 580 00:33:02,670 --> 00:33:04,600 AUDIENCE: Yeah, after I have the q-q plots, 581 00:33:04,600 --> 00:33:07,210 how can I tell the confidence level 582 00:33:07,210 --> 00:33:10,540 that I have to say whether or not my data is normally 583 00:33:10,540 --> 00:33:11,480 distributed? 584 00:33:11,480 --> 00:33:13,570 PROFESSOR: So the q-q plot does not actually 585 00:33:13,570 --> 00:33:17,980 tell you confidence intervals on either the hypothesis that it's 586 00:33:17,980 --> 00:33:21,550 normally distributed or confidence intervals 587 00:33:21,550 --> 00:33:24,970 on the parameter estimate. 588 00:33:24,970 --> 00:33:28,270 There are some formal statistical tests 589 00:33:28,270 --> 00:33:32,290 where you can test that hypothesis of normality. 590 00:33:32,290 --> 00:33:40,890 And essentially, you can use those from your-- 591 00:33:40,890 --> 00:33:44,100 never going to hand-calculate some of those statistics, 592 00:33:44,100 --> 00:33:45,660 and then the probability associated 593 00:33:45,660 --> 00:33:48,540 with a derived statistic associated with normality. 594 00:33:48,540 --> 00:33:51,960 You'll use your statistics package for that. 595 00:33:51,960 --> 00:33:54,540 This gives you a good visual indication. 596 00:33:54,540 --> 00:33:57,090 But to actually test, is it normal? 597 00:33:57,090 --> 00:34:00,840 Or what is the probability that the data is non-normal? 598 00:34:00,840 --> 00:34:02,430 That's a different question. 599 00:34:02,430 --> 00:34:05,550 And then today, we will start talking 600 00:34:05,550 --> 00:34:09,000 about confidence intervals on the mean and the variance, 601 00:34:09,000 --> 00:34:14,770 which you also would not use the q-q norm plot to generate. 602 00:34:14,770 --> 00:34:17,840 So in fact, let's get to that because that's-- 603 00:34:17,840 --> 00:34:18,340 yeah? 604 00:34:18,340 --> 00:34:20,530 AUDIENCE: For that plot, can you use regression 605 00:34:20,530 --> 00:34:23,774 to see how far it is from the normal? 606 00:34:27,590 --> 00:34:30,760 PROFESSOR: Well, first off, again, if you were actually 607 00:34:30,760 --> 00:34:33,100 trying to estimate the parameters of normality, 608 00:34:33,100 --> 00:34:36,760 you would just use the data and calculate the sample mean 609 00:34:36,760 --> 00:34:38,560 and sample standard deviation. 610 00:34:38,560 --> 00:34:40,570 I think if you are-- 611 00:34:40,570 --> 00:34:43,520 essentially what you are posing here is, 612 00:34:43,520 --> 00:34:49,989 could I go in and look at these deviations 613 00:34:49,989 --> 00:34:52,900 and do some, I don't know, sum of squared values 614 00:34:52,900 --> 00:34:54,159 of those deviations? 615 00:34:54,159 --> 00:34:57,640 That's actually getting really close to calculating 616 00:34:57,640 --> 00:35:00,040 a statistic. 617 00:35:00,040 --> 00:35:05,470 Call it a W or some number, a W statistic 618 00:35:05,470 --> 00:35:09,130 that I would form based on sum of squared deviations on one 619 00:35:09,130 --> 00:35:11,380 of these plots or some other-- maybe 620 00:35:11,380 --> 00:35:16,740 it's a sum of absolute distance deviations. 621 00:35:16,740 --> 00:35:19,160 Now I've got a statistic W, and that's 622 00:35:19,160 --> 00:35:22,010 getting really close to the kinds of statistical tests 623 00:35:22,010 --> 00:35:26,450 that one would run to ask the question of normality. 624 00:35:26,450 --> 00:35:30,170 I don't actually know what the formula is 625 00:35:30,170 --> 00:35:37,610 used in coming up with a W value and then what 626 00:35:37,610 --> 00:35:38,840 the normality tests are. 627 00:35:38,840 --> 00:35:40,640 But that's the kernel of the idea, 628 00:35:40,640 --> 00:35:42,620 is to actually look at your data, 629 00:35:42,620 --> 00:35:48,840 form an aggregate value for that statistic, that W statistic. 630 00:35:48,840 --> 00:35:51,740 So for example, if it was sum of absolute values, 631 00:35:51,740 --> 00:35:56,565 and it-- for a sample of size 50, and that W is very near 0, 632 00:35:56,565 --> 00:35:58,190 then you have high confidence that it's 633 00:35:58,190 --> 00:35:59,550 a normal distribution. 634 00:35:59,550 --> 00:36:02,390 But as W gets bigger, that would seem 635 00:36:02,390 --> 00:36:04,340 to indicate more and more likelihood 636 00:36:04,340 --> 00:36:05,720 that it's not normal. 637 00:36:05,720 --> 00:36:07,430 And that's exactly the kind of thing 638 00:36:07,430 --> 00:36:11,390 that's going on in the formal statistical test for normality. 639 00:36:19,220 --> 00:36:20,750 So here, we've given you a few tools 640 00:36:20,750 --> 00:36:23,060 for being able to look at the data, 641 00:36:23,060 --> 00:36:26,730 get a feel for is it normal or not. 642 00:36:26,730 --> 00:36:29,970 But it hasn't answered the question, 643 00:36:29,970 --> 00:36:34,190 how come so often we're using a normal distribution when we're 644 00:36:34,190 --> 00:36:35,750 actually looking at manufacturing 645 00:36:35,750 --> 00:36:38,480 data or other kinds of experimental data? 646 00:36:38,480 --> 00:36:43,670 And a really important thing is the following observation-- 647 00:36:43,670 --> 00:36:50,790 the following fact-- that if we are forming a sum 648 00:36:50,790 --> 00:36:54,910 of independent observations of a random variable-- 649 00:36:54,910 --> 00:37:01,720 so x has some underlying distribution. 650 00:37:01,720 --> 00:37:03,940 And it doesn't actually matter what 651 00:37:03,940 --> 00:37:07,510 the underlying distribution is. 652 00:37:07,510 --> 00:37:11,470 But I form n independent observations 653 00:37:11,470 --> 00:37:13,240 of that random variable. 654 00:37:13,240 --> 00:37:14,980 And then I look at the distribution 655 00:37:14,980 --> 00:37:24,870 of the sum of x1 plus x2 plus all n random variables. 656 00:37:24,870 --> 00:37:27,780 The fascinating fact is that the sum 657 00:37:27,780 --> 00:37:29,880 of independent random variables tends 658 00:37:29,880 --> 00:37:32,860 towards a normal distribution. 659 00:37:32,860 --> 00:37:34,320 This is the central limit theorem. 660 00:37:37,930 --> 00:37:42,660 So here's a neat little example. 661 00:37:42,660 --> 00:37:49,130 If my underlying distribution is in fact something 662 00:37:49,130 --> 00:37:51,890 like a uniform distribution, and if I'm, say, 663 00:37:51,890 --> 00:37:57,050 pulling off 20 samples of x1 and 20 samples of x2 664 00:37:57,050 --> 00:37:58,925 from a different uniform distribution, 665 00:37:58,925 --> 00:38:02,360 and I form, say, 100 samples of, I 666 00:38:02,360 --> 00:38:04,700 guess, 100 sets-- each one of these is, I guess, 667 00:38:04,700 --> 00:38:08,700 1,000 points in this example. 668 00:38:08,700 --> 00:38:12,320 But I essentially take the sum of all 669 00:38:12,320 --> 00:38:17,210 of these random variables and form a new random variable. 670 00:38:17,210 --> 00:38:22,550 The new random variable tends towards a normal distribution 671 00:38:22,550 --> 00:38:25,360 with some mean and variance. 672 00:38:28,000 --> 00:38:31,180 Some of you I saw in 2853. 673 00:38:31,180 --> 00:38:34,360 And I had a nice link to a website. 674 00:38:34,360 --> 00:38:38,360 And I'll actually dig that up and post it for this class. 675 00:38:38,360 --> 00:38:43,270 It's the SticiGui website. 676 00:38:43,270 --> 00:38:46,510 It's a statistics-- interactive statistic 677 00:38:46,510 --> 00:38:48,610 package out of UC Berkeley. 678 00:38:48,610 --> 00:38:49,670 And it's really fun. 679 00:38:49,670 --> 00:38:51,400 You can actually form these kinds 680 00:38:51,400 --> 00:38:56,290 of sums of random variables out of different underlying 681 00:38:56,290 --> 00:38:59,500 distributions and plot them and start 682 00:38:59,500 --> 00:39:04,870 to see how close the sum or the normalized sum 683 00:39:04,870 --> 00:39:09,430 of these distributions are to a normal. 684 00:39:09,430 --> 00:39:12,520 So there's some very, very nice interactive tools 685 00:39:12,520 --> 00:39:14,380 that you can play with. 686 00:39:14,380 --> 00:39:22,390 Now, an important point here is if I'm calculating the mean-- 687 00:39:22,390 --> 00:39:26,520 so I'm calculating an x bar across my data. 688 00:39:26,520 --> 00:39:31,330 And I've got 100 samples, each drawn-- 689 00:39:31,330 --> 00:39:34,240 and I'm assuming I'm drawing it from the same underlying 690 00:39:34,240 --> 00:39:36,850 distribution, whatever that may be. 691 00:39:36,850 --> 00:39:41,290 What is the distribution of the sample mean? 692 00:39:43,910 --> 00:39:47,290 Well, if you look at the formula for the sample mean, 693 00:39:47,290 --> 00:39:49,780 it's not exactly a sum of your data. 694 00:39:49,780 --> 00:39:52,480 It's the sum of your data, then divided by 1, right? 695 00:39:52,480 --> 00:39:56,680 It's summed from i equals 1 to whatever n 696 00:39:56,680 --> 00:40:00,640 is of your individual samples. 697 00:40:00,640 --> 00:40:04,270 So it is a sum, and then with a constant out front. 698 00:40:04,270 --> 00:40:08,830 But the point is, by appealing to the central limit theorem, 699 00:40:08,830 --> 00:40:13,720 the sample mean distribution, the PDF associated 700 00:40:13,720 --> 00:40:19,180 with a sample mean, always tends towards 701 00:40:19,180 --> 00:40:22,510 the normal distribution. 702 00:40:22,510 --> 00:40:24,850 So we're going to come back to this idea of sampling 703 00:40:24,850 --> 00:40:27,660 and what the distribution is for sample statistics a little bit 704 00:40:27,660 --> 00:40:28,780 later. 705 00:40:28,780 --> 00:40:32,280 But more generally, very often what we're doing 706 00:40:32,280 --> 00:40:37,410 is pulling data out of a process that in itself is already, 707 00:40:37,410 --> 00:40:41,540 by the physics of the process, highly averaged. 708 00:40:41,540 --> 00:40:43,550 And therefore, it's averaging lots 709 00:40:43,550 --> 00:40:46,160 of perhaps other underlying strange physics 710 00:40:46,160 --> 00:40:48,860 or difficult physics. 711 00:40:48,860 --> 00:40:54,990 But in aggregate, that averaging nature of the data itself-- 712 00:40:54,990 --> 00:40:57,050 not the operation that we perform, 713 00:40:57,050 --> 00:41:01,130 but each individual underlying data point-- 714 00:41:01,130 --> 00:41:03,110 each individual x sub i-- 715 00:41:03,110 --> 00:41:05,900 underneath of that may have some averaging 716 00:41:05,900 --> 00:41:07,940 by the physics going on that will 717 00:41:07,940 --> 00:41:13,980 help to drive it towards itself being a normal distribution. 718 00:41:13,980 --> 00:41:16,710 So just to remind you, the central limit theorem 719 00:41:16,710 --> 00:41:24,740 is probably the most used and perhaps often abused appeal 720 00:41:24,740 --> 00:41:28,430 to why we're using normal distributions very often. 721 00:41:28,430 --> 00:41:30,470 It is still good to test it. 722 00:41:30,470 --> 00:41:35,540 But there is a good reason why very often, our data does come 723 00:41:35,540 --> 00:41:38,783 up as normal distributions. 724 00:41:38,783 --> 00:41:40,200 So I want to talk a little bit now 725 00:41:40,200 --> 00:41:43,140 about sampling because we are very often using 726 00:41:43,140 --> 00:41:46,950 actual measurements and data to try to get estimates 727 00:41:46,950 --> 00:41:53,760 for, or more generally, build a model of our random process 728 00:41:53,760 --> 00:41:57,030 and estimate parameters of that random process. 729 00:41:57,030 --> 00:42:00,840 And we've said in general, p sub x is unknown. 730 00:42:00,840 --> 00:42:05,820 The data-- always plot your raw data first and foremost. 731 00:42:05,820 --> 00:42:10,530 And very often, the raw data will suggest a distribution. 732 00:42:10,530 --> 00:42:17,700 Or then histograms may provide some insight. 733 00:42:17,700 --> 00:42:22,180 So for example, a very quick histogram 734 00:42:22,180 --> 00:42:24,820 will very often give you the difference 735 00:42:24,820 --> 00:42:28,720 between a normal distribution and a uniform distribution. 736 00:42:28,720 --> 00:42:31,660 If it's evenly falling, and I don't 737 00:42:31,660 --> 00:42:36,230 have this falloff in the tails, that's very important. 738 00:42:36,230 --> 00:42:39,280 And then we can also use things like the q-q norm plot 739 00:42:39,280 --> 00:42:41,210 to test some of those things. 740 00:42:41,210 --> 00:42:48,430 So the first job is to come up with what likely [COUGH] 741 00:42:48,430 --> 00:42:51,640 distribution you want to use. 742 00:42:51,640 --> 00:42:55,900 Nine times out of 10, normal distribution 743 00:42:55,900 --> 00:42:57,370 will be appropriate. 744 00:42:57,370 --> 00:43:00,560 And then the second thing is to estimate 745 00:43:00,560 --> 00:43:03,880 parameters of the distribution. 746 00:43:07,010 --> 00:43:09,290 And the normal distribution, again, to remind you, 747 00:43:09,290 --> 00:43:11,780 just has these two parameters, mean and variance. 748 00:43:11,780 --> 00:43:15,200 And now what we want to do is estimate them. 749 00:43:15,200 --> 00:43:18,540 Now, everybody is used to the formulas. 750 00:43:18,540 --> 00:43:20,390 We've got the formulas right here 751 00:43:20,390 --> 00:43:24,200 for calculating from your sample, your limited number 752 00:43:24,200 --> 00:43:28,180 of pieces of data, what things are-- 753 00:43:28,180 --> 00:43:31,900 what a few important statistics are or characteristics 754 00:43:31,900 --> 00:43:35,385 are of that data, like the sample mean or the average, 755 00:43:35,385 --> 00:43:36,385 and the sample variance. 756 00:43:39,190 --> 00:43:42,550 But what I want to give you a feel for today, perhaps 757 00:43:42,550 --> 00:43:47,770 the most subtle idea, an important idea 758 00:43:47,770 --> 00:43:51,550 for interpretation, for establishment of confidence 759 00:43:51,550 --> 00:43:53,980 intervals, for actually being able to say where 760 00:43:53,980 --> 00:43:57,010 you think the real values lie-- 761 00:43:57,010 --> 00:44:01,250 the subtle idea is that these themselves 762 00:44:01,250 --> 00:44:07,130 are statistics that have their own PDF, their own Probability 763 00:44:07,130 --> 00:44:08,550 Density Function. 764 00:44:08,550 --> 00:44:11,930 They have a sample statistics that 765 00:44:11,930 --> 00:44:15,560 fall that tell you the likelihood of observing 766 00:44:15,560 --> 00:44:17,660 particular values of them-- 767 00:44:17,660 --> 00:44:23,570 that establish bounds for where, if I had a different sample, 768 00:44:23,570 --> 00:44:25,940 how close you think the new sample, 769 00:44:25,940 --> 00:44:28,910 still drawn from the underlying parent distribution, 770 00:44:28,910 --> 00:44:31,820 would actually lie to the particular sample 771 00:44:31,820 --> 00:44:33,820 that I just drew. 772 00:44:33,820 --> 00:44:35,570 So I'm going to explain that in a few more 773 00:44:35,570 --> 00:44:38,420 slides or several slides here. 774 00:44:38,420 --> 00:44:44,180 But the key idea is it's really easy to calculate 775 00:44:44,180 --> 00:44:46,080 a couple of these moments-- 776 00:44:46,080 --> 00:44:49,040 the mean and the variance. 777 00:44:49,040 --> 00:44:51,140 For the normal distribution, that 778 00:44:51,140 --> 00:44:56,790 tells you everything for an estimate of your raw data. 779 00:44:56,790 --> 00:44:58,760 But then I want to get to the more subtle idea 780 00:44:58,760 --> 00:45:00,468 so that we can start talking about things 781 00:45:00,468 --> 00:45:02,870 like confidence intervals. 782 00:45:02,870 --> 00:45:13,700 And a simple example to give you a little bit of a feel 783 00:45:13,700 --> 00:45:18,080 for this here is if I were to ask you 784 00:45:18,080 --> 00:45:21,350 what distribution applies to the sample 785 00:45:21,350 --> 00:45:26,770 mean, where does that come from? 786 00:45:26,770 --> 00:45:29,910 Where does this notion of a distribution associated 787 00:45:29,910 --> 00:45:33,130 with the sample mean arise? 788 00:45:33,130 --> 00:45:36,570 So if we look at the formula for the sample mean 789 00:45:36,570 --> 00:45:39,330 and expand it out, in some sense we've 790 00:45:39,330 --> 00:45:43,860 got just a sum of independent random variables, 791 00:45:43,860 --> 00:45:48,870 like we were talking about with the central limit theorem. 792 00:45:48,870 --> 00:45:50,710 There are different constants in here. 793 00:45:50,710 --> 00:45:53,220 And in this case, for the sample mean statistic, 794 00:45:53,220 --> 00:45:56,910 all of the constants are the same, which is just 1 795 00:45:56,910 --> 00:45:59,370 over the total number of data points or sample points 796 00:45:59,370 --> 00:46:00,630 that I've got. 797 00:46:00,630 --> 00:46:05,790 Now, you can go back to the definition of expectation 798 00:46:05,790 --> 00:46:09,180 that we talked about earlier and do the expectation 799 00:46:09,180 --> 00:46:13,870 operator across this and do expectation math. 800 00:46:13,870 --> 00:46:21,740 So the expectation of ax is equal to just that constant 801 00:46:21,740 --> 00:46:26,380 times the expectation of the underlying random variable. 802 00:46:26,380 --> 00:46:33,690 So the 1 over n simply comes out to the left. 803 00:46:33,690 --> 00:46:39,710 And if I were to ask, what is the mean of the PDF 804 00:46:39,710 --> 00:46:46,730 associated with x bar, it is going to be 1 over n-- 805 00:46:46,730 --> 00:46:49,810 the same mean. 806 00:46:49,810 --> 00:46:51,610 Now what else is going on here is 807 00:46:51,610 --> 00:46:55,710 if you look at the standard deviation of x bar-- 808 00:46:55,710 --> 00:46:57,010 I hope you guys can see that. 809 00:46:57,010 --> 00:47:00,580 There's a variance of x bar in here. 810 00:47:00,580 --> 00:47:05,360 So that's an x and a bar, which I just-- 811 00:47:05,360 --> 00:47:10,080 the pen doesn't line up exactly with the screen. 812 00:47:10,080 --> 00:47:14,310 You can also do the expectation operator for-- 813 00:47:14,310 --> 00:47:18,790 oops, not the expectation, but the variance operator. 814 00:47:18,790 --> 00:47:24,210 And if you do the mathematics on variance of some ax, 815 00:47:24,210 --> 00:47:30,950 that's equal to a squared, the variance of the underlying 816 00:47:30,950 --> 00:47:32,280 variable. 817 00:47:32,280 --> 00:47:35,720 And if you follow that math through for the definition 818 00:47:35,720 --> 00:47:39,260 of x bar and relate that to the variance 819 00:47:39,260 --> 00:47:45,140 of each of these x sub i's, what you find is that the variance-- 820 00:47:45,140 --> 00:47:47,690 I get an n times-- 821 00:47:50,550 --> 00:47:53,170 I'm summing n of these random variables. 822 00:47:53,170 --> 00:47:57,090 So I've got n times-- 823 00:47:57,090 --> 00:48:01,340 1 over n is the constant in here. 824 00:48:01,340 --> 00:48:04,400 So I get a 1 over n squared times the underlying 825 00:48:04,400 --> 00:48:07,150 variance of my x. 826 00:48:07,150 --> 00:48:11,260 So that I get a cancellation, and the variance then 827 00:48:11,260 --> 00:48:16,235 of my x bar is just equal to what I've shown here, 828 00:48:16,235 --> 00:48:20,810 a 1 over n of the variance of the underlying distribution. 829 00:48:20,810 --> 00:48:24,220 So what's interesting here is if I start 830 00:48:24,220 --> 00:48:27,280 to ask about the distributions associated 831 00:48:27,280 --> 00:48:30,250 with what are the mean and the variance 832 00:48:30,250 --> 00:48:34,060 of the normal distribution associated with x bar-- 833 00:48:36,990 --> 00:48:40,380 what is the mean of an x bar that I would typically 834 00:48:40,380 --> 00:48:43,260 observe from lots of samples of my underlying distribution? 835 00:48:43,260 --> 00:48:45,300 What is a variance I would observe? 836 00:48:45,300 --> 00:48:49,350 It's related to the underlying distribution, 837 00:48:49,350 --> 00:48:51,250 but it's not exactly the same. 838 00:48:51,250 --> 00:48:54,330 I've got a new random variable, an x bar, 839 00:48:54,330 --> 00:48:57,010 that has a different mean and variance. 840 00:48:57,010 --> 00:49:01,720 It's got the same mean in this case, 841 00:49:01,720 --> 00:49:04,480 but the variance is actually scaled. 842 00:49:04,480 --> 00:49:07,330 And this is extremely useful because the variance 843 00:49:07,330 --> 00:49:12,430 of my averaging means that I'm getting a tighter 844 00:49:12,430 --> 00:49:13,720 distribution-- 845 00:49:13,720 --> 00:49:20,080 a narrower or smaller variance compared 846 00:49:20,080 --> 00:49:22,030 to the underlying distribution. 847 00:49:22,030 --> 00:49:24,280 I'm going to show you that in a little bit more 848 00:49:24,280 --> 00:49:29,590 of a graphical fashion a little bit later because this is-- 849 00:49:29,590 --> 00:49:33,020 that's a preview to this whole idea of sampling, 850 00:49:33,020 --> 00:49:36,180 which is really critical. 851 00:49:36,180 --> 00:49:38,770 We've already talked about this. 852 00:49:38,770 --> 00:49:47,610 So the key thing here is to get to this notion of sampling 853 00:49:47,610 --> 00:49:50,250 distributions, what are the key distributions arising 854 00:49:50,250 --> 00:49:53,970 from the fact that I'm drawing multiple pieces of data 855 00:49:53,970 --> 00:49:56,160 from a parent distribution, and then 856 00:49:56,160 --> 00:49:58,240 calculating things about that? 857 00:49:58,240 --> 00:50:01,380 So we'll get to some of these key distributions 858 00:50:01,380 --> 00:50:03,360 besides the normal distribution. 859 00:50:03,360 --> 00:50:06,820 We'll actually talk about these next class. 860 00:50:06,820 --> 00:50:11,580 But what we want to do is go back and get a little bit more 861 00:50:11,580 --> 00:50:13,890 feel for not only the normal distribution, 862 00:50:13,890 --> 00:50:15,330 but a few other distributions that 863 00:50:15,330 --> 00:50:18,330 often arise in manufacturing, and then also start 864 00:50:18,330 --> 00:50:22,860 talking about these notions of where the data actually lies. 865 00:50:22,860 --> 00:50:26,460 What are the probabilities of data falling out in the tails? 866 00:50:26,460 --> 00:50:28,470 And using that then to start to get 867 00:50:28,470 --> 00:50:31,170 towards the idea of building confidence intervals 868 00:50:31,170 --> 00:50:34,230 and where we think the real mean of our underlying parent 869 00:50:34,230 --> 00:50:36,060 distribution sits. 870 00:50:36,060 --> 00:50:39,180 Next class, we'll also get to hypotheses tests, 871 00:50:39,180 --> 00:50:42,180 which arise naturally and actually start 872 00:50:42,180 --> 00:50:46,020 to get really close to statistical process control 873 00:50:46,020 --> 00:50:51,570 charting, which is one of the fundamental tools 874 00:50:51,570 --> 00:50:54,070 of manufacturing control. 875 00:50:54,070 --> 00:50:56,620 So what I'm going to do here is go back-- 876 00:50:56,620 --> 00:51:00,400 this is the plan for the next-- 877 00:51:00,400 --> 00:51:02,945 the rest of today and starting into tomorrow. 878 00:51:02,945 --> 00:51:04,570 We're going to go back, just remind you 879 00:51:04,570 --> 00:51:07,300 of some of the discrete variable distributions, 880 00:51:07,300 --> 00:51:09,520 then talk about some of the-- 881 00:51:09,520 --> 00:51:13,930 which are more applicable to attribute modeling or yield 882 00:51:13,930 --> 00:51:15,570 modeling, sort of discrete things. 883 00:51:15,570 --> 00:51:17,320 Then we'll come back and talk a little bit 884 00:51:17,320 --> 00:51:20,500 about the continuous distributions, 885 00:51:20,500 --> 00:51:26,130 and then also touch on how you manipulate 886 00:51:26,130 --> 00:51:27,480 some of these distributions. 887 00:51:31,670 --> 00:51:36,180 Discrete distributions-- people seen the Bernoulli distribution 888 00:51:36,180 --> 00:51:36,680 before? 889 00:51:39,500 --> 00:51:40,280 Good. 890 00:51:40,280 --> 00:51:43,920 This is like the simplest distribution-- 891 00:51:43,920 --> 00:51:46,980 the very simplest. 892 00:51:46,980 --> 00:51:48,300 You do a trial. 893 00:51:48,300 --> 00:51:49,290 You do an experiment. 894 00:51:49,290 --> 00:51:53,000 Can only have two outcomes, success or failure. 895 00:51:53,000 --> 00:51:55,910 You get to label what success is. 896 00:51:55,910 --> 00:51:59,210 We'll label a success with the random variable 897 00:51:59,210 --> 00:52:03,500 taking on the value of 1 and failure taking on 0. 898 00:52:03,500 --> 00:52:04,970 I could flip that. 899 00:52:04,970 --> 00:52:07,730 You can start to see already a little bit of inkling 900 00:52:07,730 --> 00:52:09,830 of yield in here. 901 00:52:09,830 --> 00:52:11,720 Does the thing work or not? 902 00:52:11,720 --> 00:52:14,660 The very simplest, coarsest, crudest kind 903 00:52:14,660 --> 00:52:18,710 of model for functionality, and the probability or statistics 904 00:52:18,710 --> 00:52:22,700 associated with that is, does the thing work or not? 905 00:52:22,700 --> 00:52:27,230 And often, we talk about what is the probability 906 00:52:27,230 --> 00:52:31,100 that the thing is functioning at the end of the line? 907 00:52:31,100 --> 00:52:33,980 Maybe that's 0.95. 908 00:52:33,980 --> 00:52:37,880 So 95% of the time, I think I've got yielding parts out. 909 00:52:37,880 --> 00:52:43,050 For any one experiment, one outcome, 910 00:52:43,050 --> 00:52:48,030 I've simply got a p and 1 minus p probability associated 911 00:52:48,030 --> 00:52:48,550 with that. 912 00:52:48,550 --> 00:52:52,510 And the PDF can be expressed as shown here. 913 00:52:52,510 --> 00:52:55,110 Now we can go in and use our expectation operations 914 00:52:55,110 --> 00:52:58,740 for discrete random variables and calculate what 915 00:52:58,740 --> 00:53:00,150 the mean and the variance are. 916 00:53:00,150 --> 00:53:03,570 And those have nice, closed form functions 917 00:53:03,570 --> 00:53:05,085 for those two outcomes. 918 00:53:07,660 --> 00:53:09,420 So that's the Bernoulli. 919 00:53:09,420 --> 00:53:11,340 Now the second easiest-- 920 00:53:11,340 --> 00:53:14,390 although it can actually look a little 921 00:53:14,390 --> 00:53:15,860 confusing at first glance. 922 00:53:15,860 --> 00:53:17,720 But the second easiest distribution 923 00:53:17,720 --> 00:53:20,060 is the binomial distribution because it's 924 00:53:20,060 --> 00:53:22,340 saying that I'm simply taking that success 925 00:53:22,340 --> 00:53:25,970 or failure with a fixed probability p 926 00:53:25,970 --> 00:53:28,740 and running repeated trials of that. 927 00:53:28,740 --> 00:53:33,740 So now I'm flipping my coin, say, which has-- 928 00:53:33,740 --> 00:53:35,870 perhaps it's a weighted coin, and it comes up 929 00:53:35,870 --> 00:53:39,110 heads with probability p that's not 0.5. 930 00:53:39,110 --> 00:53:42,680 Maybe it's 0.9. 931 00:53:42,680 --> 00:53:44,810 But now I'm doing that repeated times. 932 00:53:44,810 --> 00:53:47,660 I'm doing that n times. 933 00:53:47,660 --> 00:53:51,260 Now what's the probability of having n successes? 934 00:53:55,260 --> 00:53:58,680 Or let me state that again. 935 00:53:58,680 --> 00:54:01,770 What's the probability of having x successes 936 00:54:01,770 --> 00:54:04,300 when I ran n repeated trials? 937 00:54:04,300 --> 00:54:05,940 So n is the number of trials. 938 00:54:09,260 --> 00:54:11,560 So if I ran 100 trials, the probability 939 00:54:11,560 --> 00:54:16,660 that I had exactly x equals to 7 successes 940 00:54:16,660 --> 00:54:18,490 is given by this formula, here. 941 00:54:18,490 --> 00:54:21,110 And you can actually see this lurking in here. 942 00:54:21,110 --> 00:54:23,710 How do I have 7 successes? 943 00:54:23,710 --> 00:54:27,120 Well, that meant p, the probability 944 00:54:27,120 --> 00:54:30,960 of having a success, had to come up exactly 7 times. 945 00:54:30,960 --> 00:54:32,935 And the rest of the times-- 946 00:54:32,935 --> 00:54:36,780 if I was running 100 trials, the other 93 trials 947 00:54:36,780 --> 00:54:39,610 all had to be failures. 948 00:54:39,610 --> 00:54:42,640 So I've simply got the product of all of those probabilities. 949 00:54:42,640 --> 00:54:44,790 And then we've got the combinatorics, 950 00:54:44,790 --> 00:54:49,020 the n choose x, which tells me how many different orderings 951 00:54:49,020 --> 00:54:53,850 could have occurred by which I would get the 7 successes 952 00:54:53,850 --> 00:54:58,220 and 93 failures for n equals 100. 953 00:54:58,220 --> 00:55:02,600 So that's simply the different numbers of combinations 954 00:55:02,600 --> 00:55:04,610 that can come up with that. 955 00:55:04,610 --> 00:55:08,830 So the notation here, by the way, that we would often use-- 956 00:55:08,830 --> 00:55:12,190 and I already snuck it in some other places-- 957 00:55:12,190 --> 00:55:18,520 is this little tilde symbol here we're 958 00:55:18,520 --> 00:55:24,340 using to read as "is distributed as some distribution." 959 00:55:24,340 --> 00:55:27,400 And I'm using the big B to indicate the binomial 960 00:55:27,400 --> 00:55:30,010 distribution, which has associated with it 961 00:55:30,010 --> 00:55:34,000 the underlying Bernoulli probability-- 962 00:55:34,000 --> 00:55:35,860 success for any one trial-- 963 00:55:35,860 --> 00:55:39,940 and then the number of repeated trials. 964 00:55:42,600 --> 00:55:45,580 So this is a discrete probability. 965 00:55:45,580 --> 00:55:51,160 What's the probability that x could take on 0.7? 966 00:55:51,160 --> 00:55:51,960 0, right? 967 00:55:51,960 --> 00:55:56,380 It's the number of successes out of this. 968 00:55:56,380 --> 00:55:58,330 And here are some examples that just give you 969 00:55:58,330 --> 00:56:00,640 a little bit of a feel for what the Bernoulli 970 00:56:00,640 --> 00:56:04,180 distribution looks like. 971 00:56:04,180 --> 00:56:07,060 This is the number of successes plotted 972 00:56:07,060 --> 00:56:10,600 as a histogram for some values. 973 00:56:13,560 --> 00:56:15,250 I think that this is-- 974 00:56:15,250 --> 00:56:19,047 if you try it, I think this is a live spreadsheet. 975 00:56:19,047 --> 00:56:21,630 So actually, if you double-click on this from your PowerPoint, 976 00:56:21,630 --> 00:56:27,660 it may bring up the underlying Excel spreadsheet. 977 00:56:27,660 --> 00:56:30,600 So you can actually play with some of the parameters in this. 978 00:56:30,600 --> 00:56:34,960 I don't remember what either p or n was for this. 979 00:56:34,960 --> 00:56:37,080 But you can start to see, it's really-- 980 00:56:37,080 --> 00:56:45,000 it does not look quite normal because you can never have 981 00:56:45,000 --> 00:56:47,620 negative numbers of successes. 982 00:56:47,620 --> 00:56:50,400 It's always truncated. 983 00:56:50,400 --> 00:56:55,770 And you get these very non-normal kinds 984 00:56:55,770 --> 00:56:56,860 of distributions. 985 00:56:56,860 --> 00:56:58,930 This is a binomial distribution. 986 00:56:58,930 --> 00:57:01,890 But its location and its shape can change somewhat 987 00:57:01,890 --> 00:57:04,810 as you play with p and n. 988 00:57:04,810 --> 00:57:10,450 By the way, up here-- this is just the cumulative probability 989 00:57:10,450 --> 00:57:13,710 function, just saying the probability 990 00:57:13,710 --> 00:57:19,750 that I've got x less than or equal to some value. 991 00:57:19,750 --> 00:57:22,170 So that's also shown. 992 00:57:22,170 --> 00:57:25,260 So then this is also in this histogram, normalized 993 00:57:25,260 --> 00:57:29,250 to the fraction of products. 994 00:57:29,250 --> 00:57:33,650 And so now, you can start to look at calculating. 995 00:57:33,650 --> 00:57:36,365 If this were my data, and I simply-- 996 00:57:36,365 --> 00:57:38,390 it was actually coming from a line 997 00:57:38,390 --> 00:57:44,030 where I was looking at the probability of any one part 998 00:57:44,030 --> 00:57:47,000 succeeding or not, I could start to ask questions 999 00:57:47,000 --> 00:57:52,440 about the probability of seeing, out of 1,000 products 1000 00:57:52,440 --> 00:57:55,680 coming off the line, some number of defects 1001 00:57:55,680 --> 00:57:57,720 or some number of failed products. 1002 00:57:57,720 --> 00:58:01,628 You can appeal to the binomial distribution for that. 1003 00:58:01,628 --> 00:58:03,420 Now this is all still pretty coarse, right? 1004 00:58:03,420 --> 00:58:05,700 It's just a very simplified model-- 1005 00:58:05,700 --> 00:58:10,420 failure or success for yield. 1006 00:58:10,420 --> 00:58:12,460 Now another discrete distribution 1007 00:58:12,460 --> 00:58:20,170 is a Poisson distribution or also sometimes referred 1008 00:58:20,170 --> 00:58:24,160 to as an exponential distribution, 1009 00:58:24,160 --> 00:58:28,060 although terminology there sometimes varies, 1010 00:58:28,060 --> 00:58:33,870 depending on whether people are including this component 1011 00:58:33,870 --> 00:58:34,800 or not. 1012 00:58:34,800 --> 00:58:39,330 But the formal definition for the Poisson distribution 1013 00:58:39,330 --> 00:58:40,330 is shown here. 1014 00:58:40,330 --> 00:58:43,300 Now it continues to be a discrete distribution. 1015 00:58:43,300 --> 00:58:45,570 So I'm asking, what is the probability associated 1016 00:58:45,570 --> 00:58:52,810 with observing x taking on actual discrete integer values? 1017 00:58:52,810 --> 00:59:03,800 But this is a very nice distribution associated with 1018 00:59:03,800 --> 00:59:09,320 kinds of operations that many of you saw in 2.850 or 2.8-- 1019 00:59:09,320 --> 00:59:11,600 yeah, 2.853. 1020 00:59:11,600 --> 00:59:14,900 The arrival times in queuing networks 1021 00:59:14,900 --> 00:59:18,380 will often be Poisson distributed. 1022 00:59:18,380 --> 00:59:21,980 But it also can come up when we are dealing 1023 00:59:21,980 --> 00:59:25,640 with very large numbers associated 1024 00:59:25,640 --> 00:59:30,290 with the binomial distribution as a very good approximation 1025 00:59:30,290 --> 00:59:32,670 to the binomial. 1026 00:59:32,670 --> 00:59:34,320 And this turns out to be really nice, 1027 00:59:34,320 --> 00:59:37,200 because if you actually go back to the binomial formula 1028 00:59:37,200 --> 00:59:43,990 and try to calculate it for situations where, say, n 1029 00:59:43,990 --> 00:59:48,640 or x are very, very large, or p or 1 minus 1030 00:59:48,640 --> 00:59:50,440 p is very, very small or very large, 1031 00:59:50,440 --> 00:59:53,170 very close to either 0 or 1, you end up 1032 00:59:53,170 --> 00:59:57,220 with some problems, some numerical problems. 1033 00:59:57,220 --> 01:00:01,120 Because if you actually try to calculate it for, let's say, 1034 01:00:01,120 --> 01:00:08,170 p is equal to 0.0001, or maybe 1 minus p is equal to that. 1035 01:00:08,170 --> 01:00:10,540 Let's say you had really, really high yield. 1036 01:00:14,420 --> 01:00:17,030 And I take that, so if that's 1 minus p-- 1037 01:00:17,030 --> 01:00:20,330 and I'm doing this for a sample of size a million. 1038 01:00:20,330 --> 01:00:25,410 I've got 0.0001 to the one millionth power. 1039 01:00:25,410 --> 01:00:29,220 And numerically, you start losing the digits. 1040 01:00:29,220 --> 01:00:31,660 You can't hardly keep track of that. 1041 01:00:31,660 --> 01:00:33,990 But I might be asking, what is the probability 1042 01:00:33,990 --> 01:00:36,910 of some substantial number of failures? 1043 01:00:36,910 --> 01:00:39,180 And this, the combinatorics, end up 1044 01:00:39,180 --> 01:00:41,470 being a really, really large number. 1045 01:00:41,470 --> 01:00:46,530 So overall, the overall probability 1046 01:00:46,530 --> 01:00:50,760 of seeing 10 failures out of a million parts 1047 01:00:50,760 --> 01:00:52,570 might be substantial. 1048 01:00:52,570 --> 01:00:56,250 But to calculate it, you can't do it numerically, 1049 01:00:56,250 --> 01:00:59,490 because I've got a huge number times a really small number. 1050 01:00:59,490 --> 01:01:01,470 I get overflow or underflow. 1051 01:01:01,470 --> 01:01:04,470 And I can't actually calculate it. 1052 01:01:04,470 --> 01:01:10,080 What's useful is in those kinds of situations, where, 1053 01:01:10,080 --> 01:01:15,260 say, n and p together-- the product of those things-- 1054 01:01:15,260 --> 01:01:17,750 are reasonable-size numbers, then 1055 01:01:17,750 --> 01:01:21,120 the Poisson distribution is a very, very good approximation. 1056 01:01:21,120 --> 01:01:25,730 And this applies to things where you have very, say, 1057 01:01:25,730 --> 01:01:28,240 low probability. 1058 01:01:28,240 --> 01:01:30,410 So p might be very small. 1059 01:01:30,410 --> 01:01:36,650 But I'm asking-- or I have many, many opportunities 1060 01:01:36,650 --> 01:01:41,810 to observe that very low-likelihood event. 1061 01:01:41,810 --> 01:01:45,290 So an example here that comes up in semiconductor manufacturing 1062 01:01:45,290 --> 01:01:51,630 are things like the probability of observing some number 1063 01:01:51,630 --> 01:01:53,340 of defects on a wafer. 1064 01:01:53,340 --> 01:01:56,610 The likelihood of seeing a point defect on any one location 1065 01:01:56,610 --> 01:01:58,470 is very, very, very small. 1066 01:01:58,470 --> 01:02:00,600 But I've got lots and lots of area 1067 01:02:00,600 --> 01:02:02,970 on the wafer-- lots and lots of opportunity 1068 01:02:02,970 --> 01:02:05,610 for the appearance of that small defect. 1069 01:02:05,610 --> 01:02:10,620 And so you can start to talk about the product 1070 01:02:10,620 --> 01:02:14,370 of those things or a rate per unit area 1071 01:02:14,370 --> 01:02:16,470 that starts to become reasonable. 1072 01:02:20,170 --> 01:02:24,010 Another example is the number of misprints on a page of a book. 1073 01:02:24,010 --> 01:02:26,560 You don't expect for any one character 1074 01:02:26,560 --> 01:02:31,720 on a book for that to actually be a misprint. 1075 01:02:31,720 --> 01:02:35,260 But over the entire aggregate number of pages in your book, 1076 01:02:35,260 --> 01:02:37,180 you expect some number of misprints. 1077 01:02:37,180 --> 01:02:39,580 And the statistics that go with that 1078 01:02:39,580 --> 01:02:42,112 are typically Poisson distributed. 1079 01:02:42,112 --> 01:02:44,580 And I already mentioned that the mean and the variance, 1080 01:02:44,580 --> 01:02:48,150 if you actually apply those formulas to this distribution, 1081 01:02:48,150 --> 01:02:50,490 come out to the fascinating fact that they 1082 01:02:50,490 --> 01:02:52,950 are numerically the same value. 1083 01:02:52,950 --> 01:02:56,130 By the way, units-wise, they're not. 1084 01:02:56,130 --> 01:03:00,360 But x is an integer and-- 1085 01:03:00,360 --> 01:03:01,500 oops. 1086 01:03:01,500 --> 01:03:03,640 That should be x, by the way. 1087 01:03:03,640 --> 01:03:04,240 Come on. 1088 01:03:04,240 --> 01:03:04,830 Cut that out. 1089 01:03:10,710 --> 01:03:11,240 There we go. 1090 01:03:13,840 --> 01:03:16,660 So here are some example Poisson distributions. 1091 01:03:16,660 --> 01:03:20,410 You can start to see one here for a mean of 5. 1092 01:03:20,410 --> 01:03:23,830 It looks close to the binomial distribution 1093 01:03:23,830 --> 01:03:25,360 that I showed you earlier. 1094 01:03:25,360 --> 01:03:29,570 And then as the mean here is increasing, 1095 01:03:29,570 --> 01:03:31,340 and the lambda parameter, you can 1096 01:03:31,340 --> 01:03:36,560 start to see this distribution shifting to the right. 1097 01:03:36,560 --> 01:03:38,660 We said lambda is the mean. 1098 01:03:38,660 --> 01:03:43,280 It's also a characteristic of the variance. 1099 01:03:43,280 --> 01:03:48,690 The variance is also equal to lambda. 1100 01:03:48,690 --> 01:03:55,170 So that will also broaden out for larger numbers of-- 1101 01:03:55,170 --> 01:03:58,280 or larger values of lambda. 1102 01:03:58,280 --> 01:04:02,300 There's another observation in here which is useful. 1103 01:04:02,300 --> 01:04:04,610 What are they starting to look like for large lambdas? 1104 01:04:08,342 --> 01:04:09,050 AUDIENCE: Normal. 1105 01:04:09,050 --> 01:04:10,130 PROFESSOR: Normal, right. 1106 01:04:10,130 --> 01:04:12,260 If you looked at that, it doesn't look very normal 1107 01:04:12,260 --> 01:04:12,830 distributed. 1108 01:04:12,830 --> 01:04:14,000 It's truncated. 1109 01:04:14,000 --> 01:04:15,920 It's a little bit skewed. 1110 01:04:15,920 --> 01:04:23,270 But another approximation is for large lambda, that also tends 1111 01:04:23,270 --> 01:04:25,560 towards a normal distribution. 1112 01:04:25,560 --> 01:04:28,640 So very often, you've got this success or succession 1113 01:04:28,640 --> 01:04:31,340 of approximations, where you might take a binomial, 1114 01:04:31,340 --> 01:04:32,960 approximate it as a Poisson. 1115 01:04:32,960 --> 01:04:37,280 But then for large numbers, a normal distribution also 1116 01:04:37,280 --> 01:04:42,130 can be a useful approximation. 1117 01:04:42,130 --> 01:04:46,860 So let's go back to the continuous distributions, 1118 01:04:46,860 --> 01:04:49,230 the normal and the uniform. 1119 01:04:49,230 --> 01:04:52,920 And here, I want to start getting to actually how you use 1120 01:04:52,920 --> 01:04:58,960 or calculate probabilities of observations in certain ranges 1121 01:04:58,960 --> 01:05:02,520 and in particular things, like the probabilities of observing 1122 01:05:02,520 --> 01:05:04,840 things out in the tails. 1123 01:05:04,840 --> 01:05:06,930 So here's a continuous distribution 1124 01:05:06,930 --> 01:05:09,730 that has a probability density function. 1125 01:05:09,730 --> 01:05:10,950 This is the normal-- 1126 01:05:10,950 --> 01:05:12,510 excuse me, the uniform distribution 1127 01:05:12,510 --> 01:05:17,400 that has the same probability density for values 1128 01:05:17,400 --> 01:05:19,510 in some range. 1129 01:05:19,510 --> 01:05:22,380 And then I've also indicated with a capital F 1130 01:05:22,380 --> 01:05:27,060 our cumulative density function for that. 1131 01:05:27,060 --> 01:05:30,630 So this is just reminding you of a little bit of the terminology 1132 01:05:30,630 --> 01:05:31,380 there. 1133 01:05:31,380 --> 01:05:36,060 But I'm highlighting the uniform distribution 1134 01:05:36,060 --> 01:05:38,850 because there's a couple of very standard questions, 1135 01:05:38,850 --> 01:05:42,975 that if you have a known PDF or CDF, 1136 01:05:42,975 --> 01:05:45,600 these are the kinds of questions that you're going to be asking 1137 01:05:45,600 --> 01:05:47,310 again and again and again. 1138 01:05:47,310 --> 01:05:48,780 And they're nice and intuitive off 1139 01:05:48,780 --> 01:05:51,390 of the uniform distribution. 1140 01:05:51,390 --> 01:05:54,400 When we get to the normal and other distributions, 1141 01:05:54,400 --> 01:05:56,580 they're not quite as intuitive. 1142 01:05:56,580 --> 01:06:01,530 But seeing them here for the uniform first, I think, helps. 1143 01:06:01,530 --> 01:06:03,480 One of the typical kinds of questions 1144 01:06:03,480 --> 01:06:08,760 is I want to know, what is the probability that some x is 1145 01:06:08,760 --> 01:06:12,330 less than or equal to some value if I were to draw it 1146 01:06:12,330 --> 01:06:15,350 from this underlying distribution-- 1147 01:06:15,350 --> 01:06:17,160 from a normal distribution? 1148 01:06:17,160 --> 01:06:21,890 And so one could ask that using either 1149 01:06:21,890 --> 01:06:26,480 the PDF or the Cumulative Density Function. 1150 01:06:26,480 --> 01:06:28,430 And sometimes, one or the other, if they're 1151 01:06:28,430 --> 01:06:33,020 tabulated or available to you, is easier to use. 1152 01:06:33,020 --> 01:06:36,770 Clearly, if this is a Probability Density Function 1153 01:06:36,770 --> 01:06:41,390 here, I can ask it in terms of the interval question. 1154 01:06:41,390 --> 01:06:46,050 Oops, excuse me-- the interval question right here, and say, 1155 01:06:46,050 --> 01:06:53,410 well, the probability that x is less than or equal to that x1 1156 01:06:53,410 --> 01:06:56,950 is simply the integration up of that probability. 1157 01:06:56,950 --> 01:06:59,020 And you can do that numerically or just 1158 01:06:59,020 --> 01:07:01,990 by hand on such a simple distribution. 1159 01:07:01,990 --> 01:07:06,070 But the point that is actually exactly the value that 1160 01:07:06,070 --> 01:07:08,860 is tabulated on the Cumulative Density Function. 1161 01:07:08,860 --> 01:07:12,140 That's the definition of the Cumulative Density Function. 1162 01:07:12,140 --> 01:07:17,060 So if you've got the CDF, you simply look it up and say, 1163 01:07:17,060 --> 01:07:23,440 what is f of x1 equal to whatever your value is 1164 01:07:23,440 --> 01:07:28,990 for that probability function? 1165 01:07:28,990 --> 01:07:32,180 Now similarly, you can also ask the question, 1166 01:07:32,180 --> 01:07:35,770 what is the probability that x sits within some range, 1167 01:07:35,770 --> 01:07:39,890 say, between x1 and x2? 1168 01:07:39,890 --> 01:07:41,510 And again, you can do that either 1169 01:07:41,510 --> 01:07:45,620 off of the underlying density function, just 1170 01:07:45,620 --> 01:07:47,510 integrating and saying, yes, x has 1171 01:07:47,510 --> 01:07:51,560 to lie between those values, and integrate up the density. 1172 01:07:51,560 --> 01:07:56,840 Or you can recognize that the probability that x 1173 01:07:56,840 --> 01:08:00,770 is less than x2 is simply that value 1174 01:08:00,770 --> 01:08:05,500 and subtract off that the probability that x 1175 01:08:05,500 --> 01:08:08,410 was less than x1 is that. 1176 01:08:08,410 --> 01:08:11,650 And so therefore, the difference between those two 1177 01:08:11,650 --> 01:08:16,899 corresponds to the integration on the underlying Probability 1178 01:08:16,899 --> 01:08:19,460 Density Function. 1179 01:08:19,460 --> 01:08:20,600 So that's pretty easy. 1180 01:08:20,600 --> 01:08:22,590 That should be pretty clear. 1181 01:08:22,590 --> 01:08:25,880 Let's talk about that also for the normal distribution 1182 01:08:25,880 --> 01:08:28,640 because some of those values are not 1183 01:08:28,640 --> 01:08:30,960 as easy to integrate up by hand. 1184 01:08:30,960 --> 01:08:34,160 In fact, there exist no closed-form formulas. 1185 01:08:34,160 --> 01:08:36,050 But they are tabulated for you. 1186 01:08:36,050 --> 01:08:39,149 And that's where going to the table 1187 01:08:39,149 --> 01:08:41,880 on the normal distribution for things like f of x 1188 01:08:41,880 --> 01:08:42,750 are going to-- 1189 01:08:42,750 --> 01:08:46,140 is an operation that you will actually perform quite a bit 1190 01:08:46,140 --> 01:08:50,160 when you're manipulating normal distributions. 1191 01:08:50,160 --> 01:08:51,560 So here's another plot. 1192 01:08:51,560 --> 01:08:54,979 We've already talked, or I've shown other examples here 1193 01:08:54,979 --> 01:08:56,600 of the normal distribution. 1194 01:08:56,600 --> 01:08:59,390 I've tagged off on this plot for us 1195 01:08:59,390 --> 01:09:03,170 a few useful little numbers to have as rules of thumb. 1196 01:09:03,170 --> 01:09:08,149 This is actually, I think, a moderately useful page 1197 01:09:08,149 --> 01:09:13,069 to print out and have off on the side for your use. 1198 01:09:13,069 --> 01:09:15,529 In particular, what I'm showing here 1199 01:09:15,529 --> 01:09:20,180 is for the normal distribution you've got a formula. 1200 01:09:20,180 --> 01:09:21,965 You're hardly ever going to actually plug 1201 01:09:21,965 --> 01:09:23,870 in values for the formula. 1202 01:09:23,870 --> 01:09:27,260 But if you look out plus 1 standard deviation, 1203 01:09:27,260 --> 01:09:30,439 plus 2 standard deviation, on the PDF, 1204 01:09:30,439 --> 01:09:33,080 I've tried to indicate here how rapidly 1205 01:09:33,080 --> 01:09:37,140 the value of that probability density falls off. 1206 01:09:37,140 --> 01:09:41,390 So for example, one standard deviation, I'm about 60% 1207 01:09:41,390 --> 01:09:42,290 the peak. 1208 01:09:42,290 --> 01:09:45,740 Two standard deviations, I'm down to about 13.5% 1209 01:09:45,740 --> 01:09:48,029 of the peak. 1210 01:09:48,029 --> 01:09:53,100 Now more often than asking what is the relative probabilities 1211 01:09:53,100 --> 01:09:58,930 of these things, you're actually more often asking, what is-- 1212 01:09:58,930 --> 01:10:01,450 how much-- what is the integrated probability 1213 01:10:01,450 --> 01:10:07,120 density of the random variable out in some tail 1214 01:10:07,120 --> 01:10:09,340 or in some central region? 1215 01:10:09,340 --> 01:10:12,670 And that's where the Cumulative Density Function is really 1216 01:10:12,670 --> 01:10:14,780 the one that you want to use. 1217 01:10:14,780 --> 01:10:20,550 And so what I'm showing here is out for some number 1218 01:10:20,550 --> 01:10:25,170 of standard deviations-- this is mu minus 3 standard deviation. 1219 01:10:25,170 --> 01:10:28,860 This is saying the probability that x is less 1220 01:10:28,860 --> 01:10:36,290 than mu minus 3 sigma is exactly that value. 1221 01:10:36,290 --> 01:10:42,720 That equals f of mi minus 3 sigma. 1222 01:10:42,720 --> 01:10:44,550 And I simply look that up. 1223 01:10:44,550 --> 01:10:51,120 And that's about 0.00135, or less than 0.1% of your data 1224 01:10:51,120 --> 01:10:55,230 should fall less than 3 sigma off the left side 1225 01:10:55,230 --> 01:10:58,050 of your distribution. 1226 01:10:58,050 --> 01:11:01,290 And then I've tabulated that for two standard deviations, one 1227 01:11:01,290 --> 01:11:03,060 standard deviation. 1228 01:11:03,060 --> 01:11:05,640 By the way, what's the probability, now 1229 01:11:05,640 --> 01:11:10,400 that I've marked it up, that your data falls less 1230 01:11:10,400 --> 01:11:11,060 than your mean? 1231 01:11:13,740 --> 01:11:14,610 50%. 1232 01:11:14,610 --> 01:11:17,520 It's a symmetric distribution. 1233 01:11:17,520 --> 01:11:21,750 And so, in fact, you could then ask also 1234 01:11:21,750 --> 01:11:26,250 the question, what's the probability that my data is 1235 01:11:26,250 --> 01:11:29,850 all the way from my left tail up to two standard deviations 1236 01:11:29,850 --> 01:11:31,380 above the mean? 1237 01:11:31,380 --> 01:11:33,680 And that's 97.7%. 1238 01:11:33,680 --> 01:11:36,480 But I want to also point out these-- 1239 01:11:36,480 --> 01:11:43,810 this distribution itself is also anti-symmetric around the mean. 1240 01:11:43,810 --> 01:11:50,560 So this value and this value sum to 1. 1241 01:11:50,560 --> 01:11:53,050 So in other words, 1 minus whatever 1242 01:11:53,050 --> 01:11:57,910 is out in the upper tail is equal to the probability 1243 01:11:57,910 --> 01:12:00,280 of being below the lower tail. 1244 01:12:04,630 --> 01:12:09,510 So what's tabulated is not mu minus numbers 1245 01:12:09,510 --> 01:12:11,100 of standard deviations. 1246 01:12:11,100 --> 01:12:12,450 But what will often-- 1247 01:12:12,450 --> 01:12:16,440 what is actually tabulated are the standardized or unit 1248 01:12:16,440 --> 01:12:20,310 normal distribution-- again, the mean-centered version, 1249 01:12:20,310 --> 01:12:22,260 where I subtract off the mean and divide 1250 01:12:22,260 --> 01:12:25,210 by the standard deviation. 1251 01:12:25,210 --> 01:12:33,000 And that gives a PDF and a CDF that is universal. 1252 01:12:33,000 --> 01:12:39,370 And that is what will often be then tabulated 1253 01:12:39,370 --> 01:12:45,220 as the unit normal Cumulative Density Function. 1254 01:12:45,220 --> 01:12:47,770 In some sense, that's what I actually showed on this plot, 1255 01:12:47,770 --> 01:12:51,040 by just labeling it as a function of mu and standard 1256 01:12:51,040 --> 01:12:52,090 deviations. 1257 01:12:52,090 --> 01:12:57,160 But now when you normalize, that becomes in units of z as 0 1258 01:12:57,160 --> 01:13:01,570 and the numbers of standard deviations off on the side. 1259 01:13:01,570 --> 01:13:05,100 Now, if you look at the back of Montgomery, 1260 01:13:05,100 --> 01:13:06,860 there is a whole bunch of these tables. 1261 01:13:06,860 --> 01:13:09,360 And you'll be using these tables in some of the problem sets 1262 01:13:09,360 --> 01:13:10,980 and so on. 1263 01:13:10,980 --> 01:13:14,520 And there is a table for the unit normal. 1264 01:13:14,520 --> 01:13:20,790 And in particular, what's tabulated 1265 01:13:20,790 --> 01:13:24,780 is this Cumulative Density Function for the unit normal. 1266 01:13:24,780 --> 01:13:26,910 And we have a little bit of terminology 1267 01:13:26,910 --> 01:13:28,590 here that I want to alert you to, 1268 01:13:28,590 --> 01:13:32,460 because we often talk about percentage points off 1269 01:13:32,460 --> 01:13:36,240 of some distribution or percentage points of the unit 1270 01:13:36,240 --> 01:13:38,700 normal, as pictured here. 1271 01:13:38,700 --> 01:13:45,510 And what we're talking about is relating percentages 1272 01:13:45,510 --> 01:13:48,660 of my distribution that are in some location, usually 1273 01:13:48,660 --> 01:13:52,980 the tails, to numbers of standard deviations 1274 01:13:52,980 --> 01:13:57,810 that I have to go in order to apportion that amount over 1275 01:13:57,810 --> 01:14:00,330 in the tails or in the central regions. 1276 01:14:00,330 --> 01:14:06,420 So a very typical question I might ask is, how many z's-- 1277 01:14:06,420 --> 01:14:11,900 how many "unit standard deviations," how many z's-- 1278 01:14:11,900 --> 01:14:16,610 do I have to go away from the mean 1279 01:14:16,610 --> 01:14:22,160 in order to get some alpha or some percentage 1280 01:14:22,160 --> 01:14:27,500 of the distribution located out in those tails? 1281 01:14:27,500 --> 01:14:30,260 So for example, I might say I want 1282 01:14:30,260 --> 01:14:38,580 the 20%, 20th percentile percentage 1283 01:14:38,580 --> 01:14:46,890 point, the 0.2 probability that my data sits in the two tails. 1284 01:14:46,890 --> 01:14:52,590 So for a total probability that all my data or the remain-- 1285 01:14:52,590 --> 01:14:55,140 the portion of my data is on either 1286 01:14:55,140 --> 01:15:03,480 of the tails, some further away than some z, that means 10% 1287 01:15:03,480 --> 01:15:04,800 is in each of the tails. 1288 01:15:04,800 --> 01:15:08,520 And I'm asking the question, how far-- 1289 01:15:08,520 --> 01:15:11,130 how many standard deviations do I 1290 01:15:11,130 --> 01:15:13,770 have to go to get 10% in the left tail 1291 01:15:13,770 --> 01:15:17,200 and 10% out in the right tail? 1292 01:15:17,200 --> 01:15:19,600 So I'm essentially asking the question, 1293 01:15:19,600 --> 01:15:24,550 what is the probability on the cumulative unit 1294 01:15:24,550 --> 01:15:28,230 normal Probability Distribution Function to get to-- 1295 01:15:28,230 --> 01:15:30,060 how many z's do I have to go to get 1296 01:15:30,060 --> 01:15:33,540 to half of that alpha probability 1297 01:15:33,540 --> 01:15:36,600 being in each of the tails? 1298 01:15:36,600 --> 01:15:40,890 One observation here is that these things are, again, 1299 01:15:40,890 --> 01:15:42,240 anti-symmetric. 1300 01:15:42,240 --> 01:15:46,140 So I can also ask the question either looking 1301 01:15:46,140 --> 01:15:49,940 just the right tail or the left tail. 1302 01:15:49,940 --> 01:15:54,510 And then you can do the inverse operation using the table. 1303 01:15:54,510 --> 01:15:56,360 So I'm actually asking the question, what 1304 01:15:56,360 --> 01:15:58,980 is the z associated with that? 1305 01:15:58,980 --> 01:16:02,280 And I'm looking up on this plot. 1306 01:16:02,280 --> 01:16:06,980 So I might ask, OK, I need 10% there in the tail. 1307 01:16:06,980 --> 01:16:09,590 How many z's does that correspond to? 1308 01:16:09,590 --> 01:16:12,380 And to get 10% out in that left tail, 1309 01:16:12,380 --> 01:16:17,090 I got to go out 1.28 standard deviations off to the left. 1310 01:16:17,090 --> 01:16:22,050 That's the operation that one would look up in the table. 1311 01:16:22,050 --> 01:16:28,910 So very often, you would get to these kinds of lookups, 1312 01:16:28,910 --> 01:16:39,590 where you're relating the probability alpha of your data 1313 01:16:39,590 --> 01:16:44,470 lying below that number of standard deviations 1314 01:16:44,470 --> 01:16:46,810 and what that corresponding standard deviation is. 1315 01:16:51,440 --> 01:16:56,510 So I didn't copy one of the tables out of Montgomery, 1316 01:16:56,510 --> 01:17:00,900 but you'll get some practice with that on the problem sets. 1317 01:17:00,900 --> 01:17:03,110 Now, there's other related operations 1318 01:17:03,110 --> 01:17:04,860 you can do once you have that. 1319 01:17:04,860 --> 01:17:09,650 So for example, now I can ask, what is the probability 1320 01:17:09,650 --> 01:17:12,590 not just that data lies out in the tail, 1321 01:17:12,590 --> 01:17:16,280 but what are the probabilities that it also or instead lies 1322 01:17:16,280 --> 01:17:17,660 in the middle region? 1323 01:17:17,660 --> 01:17:20,960 They're all the same kinds of operations. 1324 01:17:20,960 --> 01:17:25,190 And so for example, here's a quick tabulation 1325 01:17:25,190 --> 01:17:28,130 for three different kinds of examples, 1326 01:17:28,130 --> 01:17:30,620 where I'm asking not what is out in the tails, 1327 01:17:30,620 --> 01:17:35,420 but I'm asking what is within the center plus/minus 1 sigma 1328 01:17:35,420 --> 01:17:37,220 region of the data? 1329 01:17:37,220 --> 01:17:40,010 And if you look very carefully, I'm 1330 01:17:40,010 --> 01:17:44,060 using exactly these Cumulative Probability Density 1331 01:17:44,060 --> 01:17:45,950 functions for the unit normal. 1332 01:17:45,950 --> 01:17:47,750 This is for a unit normal. 1333 01:17:50,930 --> 01:17:53,570 And looking out, what's the cumulative probability 1334 01:17:53,570 --> 01:17:54,680 over in the left tail? 1335 01:17:54,680 --> 01:17:55,550 The right tail? 1336 01:17:55,550 --> 01:17:57,170 Doing those observations. 1337 01:17:57,170 --> 01:17:59,840 But these are also very nice rules of thumb 1338 01:17:59,840 --> 01:18:07,740 to have ready for you, which is saying within plus/minus 1 1339 01:18:07,740 --> 01:18:12,810 standard deviation in the normal, 68% of your data 1340 01:18:12,810 --> 01:18:16,070 is going to fall in that 1 sigma region. 1341 01:18:16,070 --> 01:18:21,540 In the case of if I expand out to 2 sigma, 1342 01:18:21,540 --> 01:18:26,140 now I've got 95% of my data should fall roughly in there. 1343 01:18:26,140 --> 01:18:29,340 And if I expand out even further to the 3 sigma, 1344 01:18:29,340 --> 01:18:33,690 that's the 99.7% of your data would be falling-- 1345 01:18:33,690 --> 01:18:39,900 should fall within those center three standard deviations. 1346 01:18:43,220 --> 01:18:45,170 So the percentage points out there, 1347 01:18:45,170 --> 01:18:48,950 the part that falls outside of that, is about 3 and a-- 1348 01:18:48,950 --> 01:18:50,900 3 and 1,000. 1349 01:18:50,900 --> 01:18:54,590 We'll come back to this when we see statistical process control 1350 01:18:54,590 --> 01:18:58,470 and control charts because you may have run into these control 1351 01:18:58,470 --> 01:18:58,970 charts. 1352 01:18:58,970 --> 01:19:03,980 We're often plotting the 3 sigma control limits. 1353 01:19:03,980 --> 01:19:05,690 And essentially what we're saying 1354 01:19:05,690 --> 01:19:10,070 is only a very small fraction of my data-- 1355 01:19:10,070 --> 01:19:13,100 3 out of 1,000, if I'm using plus/minus 1356 01:19:13,100 --> 01:19:14,870 3 sigma control limits. 1357 01:19:14,870 --> 01:19:18,140 3 out of 1,000 points of my data, by random chance 1358 01:19:18,140 --> 01:19:22,775 alone, should be falling outside of those 3 sigma bounds. 1359 01:19:25,440 --> 01:19:33,150 So that starts to get as close to statistical process control. 1360 01:19:33,150 --> 01:19:36,470 So what we're going to do next time 1361 01:19:36,470 --> 01:19:41,030 is start to look a little bit more closely at statistics. 1362 01:19:41,030 --> 01:19:46,110 When I do form, again, things like the sample 1363 01:19:46,110 --> 01:19:53,990 mean, or I form the sample standard deviation or sample 1364 01:19:53,990 --> 01:19:58,880 variance from my data, those themselves 1365 01:19:58,880 --> 01:20:02,000 have these probability densities associated with them. 1366 01:20:02,000 --> 01:20:05,810 And from that, we're going to be able to go backwards 1367 01:20:05,810 --> 01:20:13,040 and essentially work to try to understand things 1368 01:20:13,040 --> 01:20:16,490 about the underlying process distribution, the parent 1369 01:20:16,490 --> 01:20:20,700 probability distribution function, associated with that. 1370 01:20:20,700 --> 01:20:22,910 So we're going to have to understand 1371 01:20:22,910 --> 01:20:29,990 more complicated PDFs than the normal distribution 1372 01:20:29,990 --> 01:20:32,090 because things like the sample variance 1373 01:20:32,090 --> 01:20:34,730 is not going to be normally distributed. 1374 01:20:34,730 --> 01:20:39,020 It's going to have its own bizarre distribution-- 1375 01:20:39,020 --> 01:20:41,250 in this case, the chi-square distribution. 1376 01:20:41,250 --> 01:20:44,330 So we'll return to looking at some additional distributions, 1377 01:20:44,330 --> 01:20:47,730 but these same manipulations will come up again. 1378 01:20:47,730 --> 01:20:51,590 And what we're ultimately going to want to be able to do is 1379 01:20:51,590 --> 01:20:54,890 make inferences about the underlying distribution-- 1380 01:20:54,890 --> 01:20:56,660 the parent process-- 1381 01:20:56,660 --> 01:20:59,510 what its mean is, what its variance is, 1382 01:20:59,510 --> 01:21:02,990 based on the calculated sample mean and sample variance 1383 01:21:02,990 --> 01:21:06,080 that we might be using, and then also make 1384 01:21:06,080 --> 01:21:10,040 inferences about the likelihood that the true mean 1385 01:21:10,040 --> 01:21:12,200 lies in certain ranges. 1386 01:21:12,200 --> 01:21:15,380 Or to put it another way, next time, 1387 01:21:15,380 --> 01:21:20,360 we'll also be talking about confidence intervals. 1388 01:21:20,360 --> 01:21:22,610 So we'll see you on Thursday. 1389 01:21:22,610 --> 01:21:30,500 Watch for the message from Hayden about tours and enjoy.