1 00:00:00,000 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,730 Commons license. 3 00:00:03,730 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,060 continue to offer high quality educational resources for free. 5 00:00:10,060 --> 00:00:12,660 To make a donation or to view additional materials 6 00:00:12,660 --> 00:00:16,560 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,560 --> 00:00:17,874 at ocw.mit.edu. 8 00:00:21,838 --> 00:00:23,880 PROFESSOR: So as I was saying, what we want to do 9 00:00:23,880 --> 00:00:28,270 is get up through use of some of these statistical distributions 10 00:00:28,270 --> 00:00:32,729 for making hypotheses tests and understanding the relationship 11 00:00:32,729 --> 00:00:40,670 of probabilities associated with hypotheses such as a point 12 00:00:40,670 --> 00:00:43,880 belongs to this distribution or that distribution. 13 00:00:43,880 --> 00:00:46,610 And that will set the ground for talking 14 00:00:46,610 --> 00:00:50,510 about statistical process control and SPC charting, where 15 00:00:50,510 --> 00:00:53,330 you're asking the question of a new piece of data 16 00:00:53,330 --> 00:00:56,150 off of the manufacturing line, does that piece of data 17 00:00:56,150 --> 00:00:59,180 come from the in control distribution? 18 00:00:59,180 --> 00:01:02,850 Or does it come from some out of control distribution? 19 00:01:02,850 --> 00:01:06,290 So it's all about probabilities on SPC charts. 20 00:01:06,290 --> 00:01:10,880 And we want to build up the rest of the machinery 21 00:01:10,880 --> 00:01:12,230 that we need for that today. 22 00:01:15,310 --> 00:01:18,490 To do that, one of the subtle things 23 00:01:18,490 --> 00:01:20,680 that we have to understand a bit more about 24 00:01:20,680 --> 00:01:24,190 is sampling and sampling distributions. 25 00:01:24,190 --> 00:01:26,350 And really what we're dealing with here 26 00:01:26,350 --> 00:01:30,790 is issues of the use of statistics 27 00:01:30,790 --> 00:01:32,680 dealing with observed data. 28 00:01:32,680 --> 00:01:35,080 And I have this philosophical picture 29 00:01:35,080 --> 00:01:37,390 of what I think of as statistics meaning. 30 00:01:37,390 --> 00:01:41,590 And that's our real goal and statistics is to reason about, 31 00:01:41,590 --> 00:01:47,520 think about, and be able to argue about processes-- 32 00:01:47,520 --> 00:01:51,760 in our case, real manufacturing processes-- 33 00:01:51,760 --> 00:01:53,950 when there's uncertainty in those processes. 34 00:01:53,950 --> 00:01:55,030 There is noise. 35 00:01:55,030 --> 00:01:57,070 There's other things we don't know. 36 00:01:57,070 --> 00:01:58,750 But the key idea in statistics is 37 00:01:58,750 --> 00:02:00,040 we are getting some evidence. 38 00:02:00,040 --> 00:02:01,300 We're getting some data. 39 00:02:01,300 --> 00:02:02,740 And what we want to be able to do 40 00:02:02,740 --> 00:02:07,000 is use that data to start to infer things back 41 00:02:07,000 --> 00:02:10,060 about the underlying population, the underlying 42 00:02:10,060 --> 00:02:12,110 process or distribution. 43 00:02:12,110 --> 00:02:16,780 So there are some preconditions in here. 44 00:02:16,780 --> 00:02:21,100 A lot of what I said here is we're reasoning based 45 00:02:21,100 --> 00:02:23,770 on evidence from observed data. 46 00:02:23,770 --> 00:02:27,220 But that really means we are taking fundamentally 47 00:02:27,220 --> 00:02:31,160 a probability model of what's going on. 48 00:02:31,160 --> 00:02:33,580 And we talked last time, for example, 49 00:02:33,580 --> 00:02:39,160 about assumptions with normal distributions and parameters 50 00:02:39,160 --> 00:02:41,220 of normal distributions. 51 00:02:41,220 --> 00:02:44,260 And what we're going to do today is focus a little bit more 52 00:02:44,260 --> 00:02:49,720 on evidence coming from finite sets of observations, 53 00:02:49,720 --> 00:02:53,260 drawn from that population, and then calculations we 54 00:02:53,260 --> 00:02:54,840 do on that-- 55 00:02:54,840 --> 00:02:58,200 simple calculations, like calculating the sample mean. 56 00:02:58,200 --> 00:03:01,020 And then we have this number, this sample mean. 57 00:03:01,020 --> 00:03:02,740 What's it really telling us? 58 00:03:02,740 --> 00:03:07,320 What can we infer back about the underlying distribution-- 59 00:03:07,320 --> 00:03:12,130 what the true mean of the underlying population is? 60 00:03:12,130 --> 00:03:16,960 And then a little bit later, we'll flesh this out more. 61 00:03:16,960 --> 00:03:22,360 But already, even as we start building these simple arguments 62 00:03:22,360 --> 00:03:25,570 based on our data, we have an underlying implicit model 63 00:03:25,570 --> 00:03:26,660 of the process. 64 00:03:26,660 --> 00:03:28,900 It may be a purely probabilistic model, 65 00:03:28,900 --> 00:03:32,830 saying it has a certain mean and a Gaussian distribution 66 00:03:32,830 --> 00:03:34,780 or a certain mean in a normal-- 67 00:03:34,780 --> 00:03:37,570 or a uniform or a Poisson. 68 00:03:37,570 --> 00:03:39,010 There is a model there. 69 00:03:39,010 --> 00:03:44,680 And so we have to keep in mind that it is only a model. 70 00:03:44,680 --> 00:03:47,140 A little bit later, we'll also build up 71 00:03:47,140 --> 00:03:49,630 other kinds of functional relationships 72 00:03:49,630 --> 00:03:52,210 when we get to things like response surface modeling. 73 00:03:52,210 --> 00:03:55,090 But for now, these are relatively simple models, 74 00:03:55,090 --> 00:04:00,160 mostly focused on the probabilistic or stochastic 75 00:04:00,160 --> 00:04:02,330 nature of that. 76 00:04:02,330 --> 00:04:05,370 So here's the plan for today. 77 00:04:05,370 --> 00:04:08,190 What we're going to do is talk a little bit 78 00:04:08,190 --> 00:04:11,530 about sampling distributions. 79 00:04:11,530 --> 00:04:13,590 We touched on this a little bit last time 80 00:04:13,590 --> 00:04:15,285 when we talked about the distribution 81 00:04:15,285 --> 00:04:21,310 of the sum of random variables and the central limit theorem, 82 00:04:21,310 --> 00:04:24,090 where the sum or the average always 83 00:04:24,090 --> 00:04:26,190 tends towards the normal. 84 00:04:26,190 --> 00:04:30,410 In some of the cases, we're going to be calculating things, 85 00:04:30,410 --> 00:04:33,080 like the sample s squared, the sample 86 00:04:33,080 --> 00:04:36,570 variants that are not going to be normally distributed. 87 00:04:36,570 --> 00:04:38,690 They will have other statistical shapes 88 00:04:38,690 --> 00:04:43,670 or statistical distributions such as the chi-squared. 89 00:04:43,670 --> 00:04:46,970 There will be other cases where the student t-distribution is 90 00:04:46,970 --> 00:04:47,480 operable. 91 00:04:47,480 --> 00:04:51,650 So we want to get a sense of these sampling distributions 92 00:04:51,650 --> 00:04:56,450 and understand how to use those to make not only point 93 00:04:56,450 --> 00:04:58,720 estimates-- 94 00:04:58,720 --> 00:05:01,930 that is our best guess of things like the underlying population 95 00:05:01,930 --> 00:05:02,540 mean-- 96 00:05:02,540 --> 00:05:06,100 but also confidence intervals-- where, with some probability, 97 00:05:06,100 --> 00:05:09,820 we think the true mean lies or where, with some probability, 98 00:05:09,820 --> 00:05:12,250 we think the true variance lies based 99 00:05:12,250 --> 00:05:14,810 on one set of observations. 100 00:05:14,810 --> 00:05:19,420 So that's where the sampling distributions come into play. 101 00:05:19,420 --> 00:05:23,440 And we'll talk about the effects of sample size on that 102 00:05:23,440 --> 00:05:26,440 as well as things like what kind of inferences, 103 00:05:26,440 --> 00:05:29,920 these point and confidence interval inferences 104 00:05:29,920 --> 00:05:33,500 we can make on those. 105 00:05:33,500 --> 00:05:37,250 And then, again, leading up towards hypothesis testing. 106 00:05:37,250 --> 00:05:42,380 And then really, this will be for next time. 107 00:05:42,380 --> 00:05:44,315 We'll dive into SPC charts. 108 00:05:49,150 --> 00:05:52,480 So here's how we typically are using sampling. 109 00:05:52,480 --> 00:05:54,460 We have some underlying-- 110 00:05:54,460 --> 00:05:58,600 I'll refer to it as the population distribution 111 00:05:58,600 --> 00:06:00,850 or sometimes the parent distribution. 112 00:06:00,850 --> 00:06:05,140 It's the set or universe of all possible parts, 113 00:06:05,140 --> 00:06:07,150 say, coming off your manufacturing line 114 00:06:07,150 --> 00:06:10,600 or all possible observations. 115 00:06:10,600 --> 00:06:13,210 What we're typically going to do is just draw 116 00:06:13,210 --> 00:06:16,390 some number, some finite number, of samples, 117 00:06:16,390 --> 00:06:20,600 some n samples of the process output-- 118 00:06:20,600 --> 00:06:25,360 so some x sub i drawn from a parent distribution 119 00:06:25,360 --> 00:06:28,090 with some PDF p. 120 00:06:28,090 --> 00:06:30,430 And what we're going to be doing is calculating 121 00:06:30,430 --> 00:06:33,190 these sample mean, sample variants, other sorts 122 00:06:33,190 --> 00:06:35,130 of sample statistics. 123 00:06:35,130 --> 00:06:39,670 A key point here is the underlying process. 124 00:06:39,670 --> 00:06:43,900 That basic variable x has a probability distribution 125 00:06:43,900 --> 00:06:46,820 function associated with it. 126 00:06:46,820 --> 00:06:50,890 This new variable x bar that we calculate, 127 00:06:50,890 --> 00:06:52,960 this statistic that we calculate, 128 00:06:52,960 --> 00:06:56,770 also has a probability density function associated with it. 129 00:06:56,770 --> 00:06:59,320 And it's a different one than the parent one. 130 00:06:59,320 --> 00:07:01,360 And so what we'll need to understand 131 00:07:01,360 --> 00:07:05,350 is what those probability distributions 132 00:07:05,350 --> 00:07:07,810 are that arise from sampling, and then 133 00:07:07,810 --> 00:07:10,900 how to work backwards from those to make inferences 134 00:07:10,900 --> 00:07:12,860 about the parent. 135 00:07:12,860 --> 00:07:13,850 Now, a quick thing. 136 00:07:13,850 --> 00:07:16,340 I guess there's both definitions on this slide, 137 00:07:16,340 --> 00:07:23,750 but also a quick thing about definitions or terminology 138 00:07:23,750 --> 00:07:28,290 or notation that I like to use. 139 00:07:28,290 --> 00:07:31,430 And in particular, I'm, again, distinguishing 140 00:07:31,430 --> 00:07:36,080 between the population or parent distribution, and then 141 00:07:36,080 --> 00:07:38,740 these sample statistics. 142 00:07:38,740 --> 00:07:42,820 And typically when I talk about "truth" or the population 143 00:07:42,820 --> 00:07:47,110 as a whole, we're using Greek variables 144 00:07:47,110 --> 00:07:54,850 like mu, sigma, rho, xy, for the correlation coefficient. 145 00:07:54,850 --> 00:07:58,720 And those expectations, those different moments, 146 00:07:58,720 --> 00:08:02,470 are calculated over the entire population. 147 00:08:02,470 --> 00:08:04,360 Typically we're doing those analytically 148 00:08:04,360 --> 00:08:08,080 if we have a closed form description of what 149 00:08:08,080 --> 00:08:10,670 the population is. 150 00:08:10,670 --> 00:08:13,340 In contrast, I'm going to typically use 151 00:08:13,340 --> 00:08:15,470 Roman characters-- 152 00:08:15,470 --> 00:08:20,390 x, s, r, xy, for example-- 153 00:08:20,390 --> 00:08:24,680 to indicate the finite sample statistics 154 00:08:24,680 --> 00:08:29,630 calculated from some n number of observations. 155 00:08:29,630 --> 00:08:34,159 And so that's when we have a finite discrete number 156 00:08:34,159 --> 00:08:36,230 of observations. 157 00:08:36,230 --> 00:08:39,409 And we have simple formulas for the calculation 158 00:08:39,409 --> 00:08:42,450 of those statistics. 159 00:08:42,450 --> 00:08:44,130 A little bit later in the term, we 160 00:08:44,130 --> 00:08:48,690 will come back and start to look in particular at covariance 161 00:08:48,690 --> 00:08:51,760 and correlation between two different random variables, 162 00:08:51,760 --> 00:08:53,218 some x and y. 163 00:08:53,218 --> 00:08:55,260 Those are especially important when we're looking 164 00:08:55,260 --> 00:08:58,020 for functional dependencies. 165 00:08:58,020 --> 00:09:01,530 Right now, we're simply looking at one set of data 166 00:09:01,530 --> 00:09:05,700 or one population, one random variable x. 167 00:09:05,700 --> 00:09:08,700 So we'll focus on univariate stuff today. 168 00:09:13,150 --> 00:09:15,550 There is a term, "random sampling," 169 00:09:15,550 --> 00:09:18,100 that actually has a technical definition that I 170 00:09:18,100 --> 00:09:23,470 want to point out that's very close to the intuitive notion 171 00:09:23,470 --> 00:09:24,160 here. 172 00:09:24,160 --> 00:09:26,440 But it actually is a little bit stronger 173 00:09:26,440 --> 00:09:29,890 in requirements for its definition. 174 00:09:29,890 --> 00:09:33,760 We said sampling is this act of taking some finite observations 175 00:09:33,760 --> 00:09:35,500 out of a population. 176 00:09:35,500 --> 00:09:41,260 Random sampling is when every observation that we pull 177 00:09:41,260 --> 00:09:45,970 is identically distributed, has the same PDF associated 178 00:09:45,970 --> 00:09:50,740 with it, and is independent from any other sample that we 179 00:09:50,740 --> 00:09:54,460 pull from that population. 180 00:09:54,460 --> 00:09:57,250 And this would not always naturally be the case-- 181 00:09:57,250 --> 00:10:01,000 if you had, for example, finite populations, 182 00:10:01,000 --> 00:10:03,490 and you pulled out a sample, held it in your hand, 183 00:10:03,490 --> 00:10:08,560 recorded it, pulled out another sample, for example. 184 00:10:08,560 --> 00:10:15,830 Imagine that you've got a bag of 17 blue and red marbles in it. 185 00:10:15,830 --> 00:10:19,840 And I pull a marble out, and it's red. 186 00:10:19,840 --> 00:10:24,670 I hold it in my hand, and I pull another marble out. 187 00:10:24,670 --> 00:10:27,400 Do you think I'm sampling from the same underlying 188 00:10:27,400 --> 00:10:28,390 distribution? 189 00:10:28,390 --> 00:10:32,530 No, because I did not replace that original marble. 190 00:10:32,530 --> 00:10:34,930 So now the mix of blue and red marbles 191 00:10:34,930 --> 00:10:37,630 is different within that bag, and the probability 192 00:10:37,630 --> 00:10:38,420 is different. 193 00:10:38,420 --> 00:10:43,000 It is not identical and independent anymore. 194 00:10:43,000 --> 00:10:46,240 The observation that I made first, based on the first draw, 195 00:10:46,240 --> 00:10:51,670 changes the probability for later draws, changes-- 196 00:10:51,670 --> 00:10:56,050 there is dependence as well as no longer 197 00:10:56,050 --> 00:10:58,600 an identical distribution. 198 00:10:58,600 --> 00:11:04,030 So when we do random sampling, as I'm defining it here, 199 00:11:04,030 --> 00:11:06,040 and random sampling for calculation 200 00:11:06,040 --> 00:11:08,350 of some of these sampling distributions, 201 00:11:08,350 --> 00:11:11,510 we're assuming if it's coming from a finite population, 202 00:11:11,510 --> 00:11:14,440 you would always put the observation back in 203 00:11:14,440 --> 00:11:18,640 and do another sample from the same pool. 204 00:11:18,640 --> 00:11:20,770 Typically what you're often doing 205 00:11:20,770 --> 00:11:22,600 is assuming there's no connection from one 206 00:11:22,600 --> 00:11:24,640 to the other, and the same process, physics, 207 00:11:24,640 --> 00:11:27,950 is operable from one point in time to the next. 208 00:11:27,950 --> 00:11:31,570 So we are typically making this IID, this Independent 209 00:11:31,570 --> 00:11:37,690 and Identically Distributed assumption. 210 00:11:37,690 --> 00:11:40,120 And then we're going to, again, as I said, calculate 211 00:11:40,120 --> 00:11:42,250 some statistics from those. 212 00:11:42,250 --> 00:11:44,170 Ultimately, when you have a sample-- 213 00:11:44,170 --> 00:11:48,130 sample of size 14, drawn from a big population. 214 00:11:48,130 --> 00:11:49,750 You calculate x bar. 215 00:11:49,750 --> 00:11:50,890 What do you get? 216 00:11:50,890 --> 00:11:51,580 A number. 217 00:11:51,580 --> 00:11:56,560 You get an actual number because I observed those 14 218 00:11:56,560 --> 00:11:58,900 things, measured length or whatever it was 219 00:11:58,900 --> 00:12:02,720 that I was measuring on those. 220 00:12:02,720 --> 00:12:05,500 And so a key point here is that the statistic 221 00:12:05,500 --> 00:12:10,010 is a function of the sample and the sample data. 222 00:12:10,010 --> 00:12:13,710 And so it's actually a value that you can compute. 223 00:12:13,710 --> 00:12:18,690 If I do that, I grab one sample, I calculate that x bar, 224 00:12:18,690 --> 00:12:21,200 I've got one number. 225 00:12:21,200 --> 00:12:24,640 If I were to go back and draw another sample 226 00:12:24,640 --> 00:12:27,850 from that distribution, I get a different number. 227 00:12:27,850 --> 00:12:29,700 And so if I keep going back and drawing 228 00:12:29,700 --> 00:12:31,860 multiple, multiple samples, that's 229 00:12:31,860 --> 00:12:35,280 how you build up a distribution function associated 230 00:12:35,280 --> 00:12:39,490 with that statistic, that calculation. 231 00:12:39,490 --> 00:12:43,590 So that's where this notion of statistics, x bar or whatever, 232 00:12:43,590 --> 00:12:48,030 as a random variable also comes into play. 233 00:12:48,030 --> 00:12:50,330 For any one sample it's a number. 234 00:12:50,330 --> 00:12:55,440 But when I go and take multiple samples, multiple sets, of n, 235 00:12:55,440 --> 00:13:01,160 now I build up a distribution function associated with those. 236 00:13:01,160 --> 00:13:04,870 I'm going to switch here to-- 237 00:13:04,870 --> 00:13:05,950 or do I want to? 238 00:13:05,950 --> 00:13:06,670 No. 239 00:13:06,670 --> 00:13:08,740 I'm going to switch to-- 240 00:13:08,740 --> 00:13:11,680 this is here on the web. 241 00:13:11,680 --> 00:13:15,835 I mentioned last time this very nice website. 242 00:13:18,780 --> 00:13:20,430 I don't even know what the acronym 243 00:13:20,430 --> 00:13:22,620 stands for-- this SticiGui. 244 00:13:22,620 --> 00:13:25,650 It's out of the Department of Statistics at Berkeley. 245 00:13:25,650 --> 00:13:28,720 It's got a lot of different-- 246 00:13:28,720 --> 00:13:31,960 I guess sort of an online course kind of thing. 247 00:13:31,960 --> 00:13:37,160 But what I really like in this is the Tools tab. 248 00:13:37,160 --> 00:13:40,050 So if I go to that Tools tab-- 249 00:13:40,050 --> 00:13:41,840 let me do that-- 250 00:13:41,840 --> 00:13:45,920 it's got a number of these little Java utilities online. 251 00:13:45,920 --> 00:13:48,560 And one that I want to look at here first 252 00:13:48,560 --> 00:13:51,030 is sampling distributions. 253 00:13:51,030 --> 00:13:51,650 So let's see. 254 00:13:51,650 --> 00:13:52,400 Let this load. 255 00:13:55,900 --> 00:13:57,910 Loading up Java, here. 256 00:13:57,910 --> 00:14:02,550 So here's an example of sampling from some 257 00:14:02,550 --> 00:14:04,390 a priori distribution. 258 00:14:04,390 --> 00:14:09,310 And this is actually drawing from a uniform distribution 259 00:14:09,310 --> 00:14:13,540 with discrete values, 0, 1, 2, 3, and 4. 260 00:14:13,540 --> 00:14:16,150 So that's our underlying true population, 261 00:14:16,150 --> 00:14:18,740 and they all have equal probabilities. 262 00:14:18,740 --> 00:14:21,460 And what I'm going to do is calculate a-- 263 00:14:21,460 --> 00:14:24,780 I'm going to draw a sample down here at the bottom. 264 00:14:24,780 --> 00:14:26,400 It's a sample of size 5. 265 00:14:26,400 --> 00:14:30,360 So I'm going to do random sampling with replacement. 266 00:14:30,360 --> 00:14:33,060 So I'm going to draw five independent and identically 267 00:14:33,060 --> 00:14:36,950 distributed samples out of that underlying parent distribution. 268 00:14:36,950 --> 00:14:39,420 And then I'm going to calculate some statistic. 269 00:14:39,420 --> 00:14:42,670 What I want to do is to actually calculate the sample mean. 270 00:14:42,670 --> 00:14:46,930 So there in blue is our underlying population. 271 00:14:46,930 --> 00:14:50,820 Let me take one sample of size 5, calculate the mean, 272 00:14:50,820 --> 00:14:51,660 and plot it. 273 00:14:51,660 --> 00:14:52,260 There it is. 274 00:14:52,260 --> 00:14:55,080 It's a mean of 1.4. 275 00:14:55,080 --> 00:14:56,510 Let me take another sample. 276 00:14:56,510 --> 00:14:57,690 I take another sample. 277 00:14:57,690 --> 00:15:01,190 Do you think the value is going to be 1.4 again? 278 00:15:01,190 --> 00:15:01,970 It might be. 279 00:15:01,970 --> 00:15:02,930 AUDIENCE: Might be. 280 00:15:02,930 --> 00:15:04,490 PROFESSOR: But probably not, right? 281 00:15:04,490 --> 00:15:05,960 Let's see what happens. 282 00:15:05,960 --> 00:15:08,060 There it is-- 2.4. 283 00:15:08,060 --> 00:15:09,020 Let me do a few more. 284 00:15:12,370 --> 00:15:14,460 So the green bars are popping up, 285 00:15:14,460 --> 00:15:18,240 as I think I've done something like 1, 2, 3, 4, 5, 6-- 286 00:15:18,240 --> 00:15:21,510 something like 8 different samples, each of size 5, 287 00:15:21,510 --> 00:15:23,980 plotted the mean. 288 00:15:23,980 --> 00:15:25,980 Now to speed things up, I can keep 289 00:15:25,980 --> 00:15:28,440 taking more and more samples. 290 00:15:28,440 --> 00:15:31,040 What distribution do you think this is trending to? 291 00:15:31,040 --> 00:15:31,860 AUDIENCE: Normal. 292 00:15:31,860 --> 00:15:33,098 PROFESSOR: Normal. 293 00:15:33,098 --> 00:15:34,890 Down here at the bottom, I can take samples 294 00:15:34,890 --> 00:15:36,910 that are a little bit larger. 295 00:15:36,910 --> 00:15:38,590 Or let me take-- 296 00:15:38,590 --> 00:15:41,020 excuse me, we take-- the thing tells me 297 00:15:41,020 --> 00:15:44,080 how many samples I'm taking, so I don't have to just take 298 00:15:44,080 --> 00:15:45,790 one sample of five, plot it. 299 00:15:45,790 --> 00:15:50,060 I can take 10 samples of 5, each of 5, and plot it. 300 00:15:50,060 --> 00:15:53,710 So it's just speeding up my button clicks 301 00:15:53,710 --> 00:15:57,205 so that we can get a little bit better shape on that. 302 00:15:57,205 --> 00:15:58,080 So there's the point. 303 00:15:58,080 --> 00:15:59,700 That's a very fascinating point. 304 00:15:59,700 --> 00:16:01,560 I find it fascinating that I can sample 305 00:16:01,560 --> 00:16:05,310 from a non-normal distribution, take the average, 306 00:16:05,310 --> 00:16:11,930 the sample average, x bar, and over lots and lots of sampling, 307 00:16:11,930 --> 00:16:15,460 I get a normal distribution. 308 00:16:15,460 --> 00:16:16,030 What else? 309 00:16:16,030 --> 00:16:18,640 What other observations or what other points 310 00:16:18,640 --> 00:16:23,930 might you make about that green distribution? 311 00:16:23,930 --> 00:16:27,930 What do you think is true about that green distribution? 312 00:16:27,930 --> 00:16:29,850 There's a really important fact which 313 00:16:29,850 --> 00:16:33,750 motivates why we can't calculate x bars all the time 314 00:16:33,750 --> 00:16:35,430 and believe the numbers that come out 315 00:16:35,430 --> 00:16:37,710 of an x bar calculation. 316 00:16:41,070 --> 00:16:42,990 AUDIENCE: It's centered around 2. 317 00:16:42,990 --> 00:16:44,410 PROFESSOR: It's centered around 2. 318 00:16:44,410 --> 00:16:46,440 Out of the numbers 0, 1, 2, 3, and 4, 319 00:16:46,440 --> 00:16:49,590 what do you think the average is-- the true average? 320 00:16:49,590 --> 00:16:51,030 2. 321 00:16:51,030 --> 00:16:58,360 So one thing that's very nice about the sample mean 322 00:16:58,360 --> 00:17:04,130 is that it trends toward the true population mean. 323 00:17:04,130 --> 00:17:05,630 It's unbiased. 324 00:17:05,630 --> 00:17:11,390 That if I were to take enough samples, 325 00:17:11,390 --> 00:17:18,630 the average of or the mean of all of these sample averages 326 00:17:18,630 --> 00:17:22,200 is equal to the true underlying population mean. 327 00:17:22,200 --> 00:17:22,980 It's unbiased. 328 00:17:22,980 --> 00:17:25,680 Doesn't have a bias or delta, a fixed delta, 329 00:17:25,680 --> 00:17:28,200 a fixed offset error in it. 330 00:17:28,200 --> 00:17:30,690 It is an unbiased estimator. 331 00:17:30,690 --> 00:17:35,040 So I can take lots and build that up. 332 00:17:35,040 --> 00:17:36,630 Turns out there's another thing that's 333 00:17:36,630 --> 00:17:40,190 true which I don't want to go into 334 00:17:40,190 --> 00:17:42,140 and don't want to try to prove. 335 00:17:42,140 --> 00:17:48,890 But it turns out that the sample mean is also not only unbiased, 336 00:17:48,890 --> 00:17:52,250 but it's also the minimum error estimator. 337 00:17:52,250 --> 00:17:56,540 So on average, it's the best estimator of the mean 338 00:17:56,540 --> 00:18:01,400 that you can use as a statistic, meaning its distribution 339 00:18:01,400 --> 00:18:03,260 in some sense is the narrowest. 340 00:18:03,260 --> 00:18:06,770 The x bar distribution is the narrowest estimator 341 00:18:06,770 --> 00:18:13,370 you can have for trying to calculate the sample mean 342 00:18:13,370 --> 00:18:16,700 based on your distributions. 343 00:18:16,700 --> 00:18:19,160 Now another important thing that comes up 344 00:18:19,160 --> 00:18:20,720 here is at least a few of the times, 345 00:18:20,720 --> 00:18:27,940 I got a sample mean that was 0.6. 346 00:18:27,940 --> 00:18:28,780 Is it wrong? 347 00:18:33,930 --> 00:18:36,560 If you do just one sample, it's quite possible, 348 00:18:36,560 --> 00:18:40,700 out of this set of four, I drew a sample of size 5. 349 00:18:40,700 --> 00:18:43,190 I might have gotten a value of 0.6. 350 00:18:43,190 --> 00:18:45,080 That's all the data you have. 351 00:18:45,080 --> 00:18:47,300 What's your best guess for the true mean 352 00:18:47,300 --> 00:18:50,200 of the underlying population? 353 00:18:50,200 --> 00:18:53,320 That 0.6, whatever that value was. 354 00:18:53,320 --> 00:18:55,890 But now there is some spread on it. 355 00:18:55,890 --> 00:18:59,490 And so if you're wise, you would also 356 00:18:59,490 --> 00:19:02,700 start to want to hedge your bets a little bit here, right? 357 00:19:02,700 --> 00:19:06,760 You want to be able to say, my best guess is 0.5. 358 00:19:06,760 --> 00:19:10,510 But I think I'm only drawing a sample of size 5. 359 00:19:10,510 --> 00:19:15,400 So I know there is, in fact, this kind of Gaussian spread. 360 00:19:15,400 --> 00:19:17,320 And I think the true mean probably 361 00:19:17,320 --> 00:19:19,910 lies within some range of that. 362 00:19:19,910 --> 00:19:23,020 And so you would like to have this confidence interval idea. 363 00:19:23,020 --> 00:19:26,290 We'll get back to that a little bit later. 364 00:19:26,290 --> 00:19:29,620 In fact, there's another very nice little tool in here 365 00:19:29,620 --> 00:19:32,170 for illustrating confidence intervals 366 00:19:32,170 --> 00:19:35,788 that we'll use at that point. 367 00:19:35,788 --> 00:19:37,580 I want to do one more thing, and then we'll 368 00:19:37,580 --> 00:19:39,830 go back to the lecture slides. 369 00:19:39,830 --> 00:19:42,140 One of the neat things you can do with this tool, 370 00:19:42,140 --> 00:19:43,670 and it's lots of fun for you guys 371 00:19:43,670 --> 00:19:46,670 to connect up with and play with, 372 00:19:46,670 --> 00:19:48,320 is you can change the sample size. 373 00:19:50,830 --> 00:19:55,230 Let's say you wanted a better or a tighter estimate 374 00:19:55,230 --> 00:19:57,132 for the x bar. 375 00:19:57,132 --> 00:19:58,590 You're not happy with the idea that 376 00:19:58,590 --> 00:20:02,100 sometimes, with fairly substantial probability, 377 00:20:02,100 --> 00:20:05,610 you might be off by plus or minus 1. 378 00:20:05,610 --> 00:20:08,190 You have a substantial probability 379 00:20:08,190 --> 00:20:12,450 of estimating, say, the-- 380 00:20:12,450 --> 00:20:17,220 or guessing the sample mean to be more than one value away 381 00:20:17,220 --> 00:20:20,400 from the true population mean. 382 00:20:20,400 --> 00:20:22,410 What might you do to try to improve 383 00:20:22,410 --> 00:20:27,510 your likelihood of being closer to the true mean when you're 384 00:20:27,510 --> 00:20:28,220 doing sampling? 385 00:20:28,220 --> 00:20:29,520 AUDIENCE: More samples. 386 00:20:29,520 --> 00:20:31,420 PROFESSOR: More samples. 387 00:20:31,420 --> 00:20:33,010 More samples? 388 00:20:33,010 --> 00:20:35,420 I guess you could do more samples. 389 00:20:35,420 --> 00:20:39,400 But in some sense, really, that taking one sample of size 5 390 00:20:39,400 --> 00:20:40,900 and another sample of size 5, that's 391 00:20:40,900 --> 00:20:44,380 like one sample of size 10. 392 00:20:44,380 --> 00:20:45,040 Larger samples. 393 00:20:45,040 --> 00:20:46,540 AUDIENCE: Oh, yeah, larger samples. 394 00:20:46,540 --> 00:20:47,870 PROFESSOR: Larger samples. 395 00:20:47,870 --> 00:20:54,700 So if I do that here, let's take-- 396 00:20:54,700 --> 00:20:59,170 instead of samples of size 5, let's do a modest increase 397 00:20:59,170 --> 00:21:02,420 first and take samples of size 10. 398 00:21:02,420 --> 00:21:04,560 See what happens now. 399 00:21:04,560 --> 00:21:06,650 Oops, let me just do-- 400 00:21:06,650 --> 00:21:08,860 OK, that's good. 401 00:21:08,860 --> 00:21:10,570 I'm taking a lot of samples here. 402 00:21:10,570 --> 00:21:13,960 I've taken several hundred samples, each of size 10. 403 00:21:13,960 --> 00:21:15,430 And sure enough, that distribution 404 00:21:15,430 --> 00:21:17,350 is a little bit tighter. 405 00:21:17,350 --> 00:21:22,650 Let's say if I took a really big sample, sample of size 100. 406 00:21:22,650 --> 00:21:26,180 Yeah, looking a lot tighter. 407 00:21:26,180 --> 00:21:29,360 So one question is, we know as I take a larger samples, 408 00:21:29,360 --> 00:21:30,920 the distribution gets tighter. 409 00:21:30,920 --> 00:21:33,440 One of the things we want to do is understand 410 00:21:33,440 --> 00:21:40,260 how much tighter do they get as a function of the sample size? 411 00:21:40,260 --> 00:21:43,220 So it turns out-- let me go back now to-- 412 00:21:52,270 --> 00:21:56,920 it turns out that if I'm sampling from a parent 413 00:21:56,920 --> 00:22:01,060 distribution, the variance in the estimate of that x bar, 414 00:22:01,060 --> 00:22:06,190 or the PDF, the variance of x bar itself, 415 00:22:06,190 --> 00:22:08,710 shrinks with size n. 416 00:22:08,710 --> 00:22:13,030 And the variance in fact scales as 1 over n. 417 00:22:13,030 --> 00:22:16,615 It scales inversely proportional to the size of the sample. 418 00:22:20,090 --> 00:22:24,650 That's true always as you take larger numbers of samples. 419 00:22:24,650 --> 00:22:28,010 For this special case, if my underlying population 420 00:22:28,010 --> 00:22:31,040 is in fact really a true-- 421 00:22:31,040 --> 00:22:33,590 has a true probability distribution function 422 00:22:33,590 --> 00:22:37,050 that was normal, then it turns out 423 00:22:37,050 --> 00:22:41,010 that x bar is not just trending towards the normal, 424 00:22:41,010 --> 00:22:45,180 but is itself, even for very small numbers of samples, also 425 00:22:45,180 --> 00:22:47,280 a normal distribution. 426 00:22:47,280 --> 00:22:48,780 So in that little demo I showed you, 427 00:22:48,780 --> 00:22:50,700 drawing from a uniform distribution, 428 00:22:50,700 --> 00:22:53,400 for large enough n's, large enough samples, 429 00:22:53,400 --> 00:22:55,530 large enough numbers of samples, the mean 430 00:22:55,530 --> 00:22:58,320 does trend towards a Gaussian. 431 00:22:58,320 --> 00:23:01,830 But it's even a stronger statement, a stronger 432 00:23:01,830 --> 00:23:05,970 relationship, if the underlying population is itself normal. 433 00:23:05,970 --> 00:23:09,240 So let's say we start with an underlying random variable, 434 00:23:09,240 --> 00:23:12,330 an underlying process x, that has 435 00:23:12,330 --> 00:23:14,640 some mean and some variance. 436 00:23:14,640 --> 00:23:20,160 Now if I take samples of size 1 and plot out the distribution, 437 00:23:20,160 --> 00:23:22,070 what do you think it looks like? 438 00:23:22,070 --> 00:23:24,013 AUDIENCE: [INAUDIBLE] 439 00:23:24,013 --> 00:23:24,680 PROFESSOR: Yeah. 440 00:23:24,680 --> 00:23:25,560 I'm just repeating. 441 00:23:25,560 --> 00:23:30,920 I'm replicating my underlying distribution, right? 442 00:23:30,920 --> 00:23:34,250 So part of the special case of a sample of size 1, 443 00:23:34,250 --> 00:23:39,080 if I do that long enough, I build up the same distribution. 444 00:23:39,080 --> 00:23:42,830 But now, if I take larger numbers of samples, 445 00:23:42,830 --> 00:23:45,415 even a little bit with n equals 2, 446 00:23:45,415 --> 00:23:46,790 again, we get that effect that we 447 00:23:46,790 --> 00:23:51,530 saw with the SticiGui of the narrowing of the distribution, 448 00:23:51,530 --> 00:23:55,250 PDF associated with the x bar. 449 00:23:55,250 --> 00:24:02,720 And in particular, the PDF or the Probability Distribution 450 00:24:02,720 --> 00:24:06,410 Function associated with x bar is exactly 451 00:24:06,410 --> 00:24:09,050 normal with the same mean-- 452 00:24:09,050 --> 00:24:12,440 it's unbiased-- and with reduced variance. 453 00:24:12,440 --> 00:24:15,830 So the variance goes as 1 over n. 454 00:24:15,830 --> 00:24:20,490 So we start with the population distribution here, 455 00:24:20,490 --> 00:24:23,010 and we end up with a sample mean distribution 456 00:24:23,010 --> 00:24:25,770 that is a different PDF. 457 00:24:25,770 --> 00:24:28,480 Everybody clear on this? 458 00:24:28,480 --> 00:24:32,270 So key points-- statistic itself is 459 00:24:32,270 --> 00:24:34,940 the random variable has its own probability distribution 460 00:24:34,940 --> 00:24:35,810 function. 461 00:24:35,810 --> 00:24:41,460 Now what we want to do is reason about the underlying population 462 00:24:41,460 --> 00:24:44,280 based on those observed statistics. 463 00:24:44,280 --> 00:24:47,810 Somebody's cell phone is going crazy. 464 00:24:47,810 --> 00:24:48,430 Not mine. 465 00:24:53,187 --> 00:24:54,270 Everybody hear that click? 466 00:24:54,270 --> 00:24:57,550 Can you even hear that click in Singapore? 467 00:24:57,550 --> 00:24:58,070 Yeah? 468 00:24:58,070 --> 00:24:58,660 All right. 469 00:25:02,300 --> 00:25:04,040 Hopefully that will go away in a second. 470 00:25:09,080 --> 00:25:13,350 So once we know the sampling distribution, say, for x bar, 471 00:25:13,350 --> 00:25:16,910 now we can argue about the probabilities associated 472 00:25:16,910 --> 00:25:20,810 with observing particular values of x bar. 473 00:25:20,810 --> 00:25:23,420 We can make observations or arguments 474 00:25:23,420 --> 00:25:26,540 about how much probability's out in the tails of these things. 475 00:25:26,540 --> 00:25:29,420 And then we can invert backwards and reason 476 00:25:29,420 --> 00:25:31,850 about the actual population mean. 477 00:25:34,640 --> 00:25:36,950 And again, we're after not only the point 478 00:25:36,950 --> 00:25:41,570 estimates, our best guess, but also interval estimates-- 479 00:25:41,570 --> 00:25:45,950 confidence intervals where we think the actual value 480 00:25:45,950 --> 00:25:47,300 is going to lie. 481 00:25:47,300 --> 00:25:51,050 And these are critically dependent on probability 482 00:25:51,050 --> 00:25:56,010 calculations of the sampling distribution. 483 00:25:56,010 --> 00:25:59,090 So here's an example. 484 00:25:59,090 --> 00:26:02,680 So suppose that we start out with some assumptions. 485 00:26:02,680 --> 00:26:08,940 We start out with some a priori beliefs about the distribution 486 00:26:08,940 --> 00:26:10,020 of some parameter. 487 00:26:10,020 --> 00:26:15,220 In particular, we're interested in the thickness of some part. 488 00:26:15,220 --> 00:26:16,750 We don't know the mean of it. 489 00:26:16,750 --> 00:26:19,940 But based on maybe lots and lots of historical data, 490 00:26:19,940 --> 00:26:24,940 we do believe we do know a couple of things. 491 00:26:24,940 --> 00:26:27,040 We know its variance. 492 00:26:27,040 --> 00:26:28,950 The standard deviation was 10. 493 00:26:28,950 --> 00:26:32,782 So let's just assume that we know the standard deviation. 494 00:26:32,782 --> 00:26:34,240 And we also know-- the second thing 495 00:26:34,240 --> 00:26:38,410 is that the thickness of these parts is normally distributed. 496 00:26:38,410 --> 00:26:40,060 Those are our starting assumptions. 497 00:26:40,060 --> 00:26:43,150 our a priori assumptions. 498 00:26:43,150 --> 00:26:46,930 Now what we do is we go, and we draw 50 different random parts 499 00:26:46,930 --> 00:26:50,470 with the IID assumption. 500 00:26:50,470 --> 00:26:55,210 And we calculate the average thickness from those. 501 00:26:55,210 --> 00:26:57,970 And I'll tell you, of those n equals 502 00:26:57,970 --> 00:27:00,550 50 samples, the actual sample mean 503 00:27:00,550 --> 00:27:06,490 that comes out from that one sample of size 50 is 113.5. 504 00:27:06,490 --> 00:27:08,090 There you go. 505 00:27:08,090 --> 00:27:10,820 You're blessed with that piece of data. 506 00:27:10,820 --> 00:27:13,070 Now the first question here, based on what we've seen, 507 00:27:13,070 --> 00:27:16,520 is what is the distribution of the mean of the thickness? 508 00:27:16,520 --> 00:27:19,610 What is the PDF associated with t bar? 509 00:27:19,610 --> 00:27:21,600 Everybody should know this. 510 00:27:21,600 --> 00:27:24,590 What's t bar distributed as? 511 00:27:29,738 --> 00:27:30,680 AUDIENCE: It's normal. 512 00:27:30,680 --> 00:27:32,602 PROFESSOR: It's normal, right. 513 00:27:32,602 --> 00:27:34,060 AUDIENCE: Centered around the mean. 514 00:27:34,060 --> 00:27:35,810 PROFESSOR: Centered around the mean, so it 515 00:27:35,810 --> 00:27:38,470 would have the same mu unknown. 516 00:27:38,470 --> 00:27:40,250 And what would its variance be? 517 00:27:40,250 --> 00:27:40,750 AUDIENCE: 2. 518 00:27:40,750 --> 00:27:41,840 AUDIENCE: 2. 519 00:27:41,840 --> 00:27:44,960 PROFESSOR: 2, very good. 520 00:27:44,960 --> 00:27:50,630 So it has the same mean, and the variance scales as 1 over n. 521 00:27:50,630 --> 00:27:55,860 So we had 50 samples, so the variance goes down 522 00:27:55,860 --> 00:27:59,760 by that factor. 523 00:27:59,760 --> 00:28:04,530 One quick notation point here is when 524 00:28:04,530 --> 00:28:12,530 we use this notation of normal with mu and sigma squared, 525 00:28:12,530 --> 00:28:16,940 I try to be very consistent and put the mean and the variance 526 00:28:16,940 --> 00:28:17,640 in there. 527 00:28:17,640 --> 00:28:19,490 You will sometimes find different texts 528 00:28:19,490 --> 00:28:21,890 and different writers or whatever 529 00:28:21,890 --> 00:28:25,230 putting the mean and the standard deviation. 530 00:28:25,230 --> 00:28:29,030 So you always want to confirm that, because one's a square, 531 00:28:29,030 --> 00:28:31,410 and one's the square root of the other. 532 00:28:31,410 --> 00:28:33,500 So be a little bit careful-- 533 00:28:33,500 --> 00:28:35,210 a little bit careful on that. 534 00:28:35,210 --> 00:28:43,040 I try to be consistent and have that be the variance. 535 00:28:43,040 --> 00:28:44,820 So that was a first easy question. 536 00:28:44,820 --> 00:28:47,150 We know that based on sampling theory. 537 00:28:47,150 --> 00:28:51,853 We know the distribution function for the sample mean. 538 00:28:51,853 --> 00:28:53,270 Now the key question is, how do we 539 00:28:53,270 --> 00:28:57,960 use that to reason about the actual population mean? 540 00:28:57,960 --> 00:29:01,050 Well, it's really easy already-- the best guess. 541 00:29:01,050 --> 00:29:03,660 But the more subtle question that we've been talking about 542 00:29:03,660 --> 00:29:08,080 is, where do we think the true mean of the population 543 00:29:08,080 --> 00:29:12,310 lies based on this one observation? 544 00:29:12,310 --> 00:29:15,550 What range do we think the true mean has 545 00:29:15,550 --> 00:29:19,330 with some degree of confidence? 546 00:29:19,330 --> 00:29:23,463 Do you think it's plus or minus 2 around that mean? 547 00:29:23,463 --> 00:29:25,630 Do you think it's plus or minus 20 around that mean? 548 00:29:25,630 --> 00:29:32,200 If I were to ask you to bet your life on what the true mean is, 549 00:29:32,200 --> 00:29:35,170 you would want to be able to say with some degree of confidence, 550 00:29:35,170 --> 00:29:38,590 it's actually within this amount of distance. 551 00:29:41,280 --> 00:29:43,640 I have to say one more thing, because if I 552 00:29:43,640 --> 00:29:46,640 said it's within some amount of distance of that, 553 00:29:46,640 --> 00:29:50,360 well, with non-zero probability, that thickness 554 00:29:50,360 --> 00:29:55,040 could take on values all the way from plus infinity, if it's 555 00:29:55,040 --> 00:30:00,160 truly normally distributed, all the way to not quite 556 00:30:00,160 --> 00:30:03,190 negative infinity, because this is a thickness to 0. 557 00:30:03,190 --> 00:30:05,960 So it's still an approximate model. 558 00:30:05,960 --> 00:30:08,630 So if I just asked you, bet your life. 559 00:30:08,630 --> 00:30:11,360 Tell me where you think the true mean is, 560 00:30:11,360 --> 00:30:14,810 if you wanted 100% chance of saving your life, you'd say, 561 00:30:14,810 --> 00:30:16,680 it could be anything. 562 00:30:16,680 --> 00:30:18,940 So I have to give you, when we're 563 00:30:18,940 --> 00:30:21,370 talking about confidence intervals, another piece 564 00:30:21,370 --> 00:30:22,840 of bounding information. 565 00:30:22,840 --> 00:30:24,550 I want the range. 566 00:30:24,550 --> 00:30:28,010 How far away from that one observation of the mean 567 00:30:28,010 --> 00:30:32,530 do I need to be with some probability? 568 00:30:32,530 --> 00:30:36,820 95% confidence or 95% of the time, 569 00:30:36,820 --> 00:30:39,760 where do we think the true mean would lie? 570 00:30:39,760 --> 00:30:43,930 What that means is if I were to go and calculate another 50 571 00:30:43,930 --> 00:30:46,900 samples and calculate the mean, again, we 572 00:30:46,900 --> 00:30:47,980 have that distribution. 573 00:30:47,980 --> 00:30:53,440 And what we're looking for is that 95% central region 574 00:30:53,440 --> 00:30:56,230 of the PDF associated with x bar, which 575 00:30:56,230 --> 00:31:02,700 is where 95% of the time, the mean is actually going to lie. 576 00:31:02,700 --> 00:31:08,640 So that gets us pictorially and formulaically here 577 00:31:08,640 --> 00:31:11,220 to this notion of the confidence interval 578 00:31:11,220 --> 00:31:13,920 and how we actually go about calculating that. 579 00:31:13,920 --> 00:31:18,150 What we're asking-- what we've got in this situation 580 00:31:18,150 --> 00:31:21,120 is the variance is known, so I'm not 581 00:31:21,120 --> 00:31:22,650 trying to estimate the variance. 582 00:31:22,650 --> 00:31:26,590 I'm just trying to reason about the mean. 583 00:31:26,590 --> 00:31:31,210 And I want to estimate it to some percent, some confidence 584 00:31:31,210 --> 00:31:32,500 interval. 585 00:31:32,500 --> 00:31:37,000 You always have this chance of being wrong 586 00:31:37,000 --> 00:31:38,860 when you talk confidence intervals. 587 00:31:38,860 --> 00:31:41,230 You've got some alpha probability 588 00:31:41,230 --> 00:31:43,450 that the true meaning is even further away than you 589 00:31:43,450 --> 00:31:45,340 think in your interval. 590 00:31:45,340 --> 00:31:47,650 But you're trying to quantify that and bound that. 591 00:31:47,650 --> 00:31:52,720 So we typically talk about, say, an alpha of 5% or maybe 1% 592 00:31:52,720 --> 00:31:58,270 probability of being outside of your interval. 593 00:31:58,270 --> 00:32:00,250 So there's this alpha probability 594 00:32:00,250 --> 00:32:06,820 of error associated with any confidence interval. 595 00:32:06,820 --> 00:32:09,880 So that's that second piece of data I had to give you. 596 00:32:09,880 --> 00:32:13,350 The first is we want to know this range-- what the size is. 597 00:32:13,350 --> 00:32:17,940 So the way this works is we're wanting to know, 598 00:32:17,940 --> 00:32:22,110 based on our calculated x bar from our sample 599 00:32:22,110 --> 00:32:28,820 of size n, where the true mean actually lies. 600 00:32:28,820 --> 00:32:31,220 So we know what we're doing is saying 601 00:32:31,220 --> 00:32:36,710 that the true mean, mu, is going to be bounded on the left 602 00:32:36,710 --> 00:32:42,590 by the x bar, but then going some portion 603 00:32:42,590 --> 00:32:44,930 of the distribution to the left and some portion 604 00:32:44,930 --> 00:32:52,040 of the distribution to the right until we get the 1 minus alpha. 605 00:32:52,040 --> 00:32:57,340 So this area in here is the 1 minus alpha-- the 95%, say, 606 00:32:57,340 --> 00:33:00,370 central component of that distribution. 607 00:33:00,370 --> 00:33:04,750 And then we're evenly spreading the error part, the alpha, 608 00:33:04,750 --> 00:33:07,420 into 2 alpha over 2's on each side, 609 00:33:07,420 --> 00:33:10,870 saying I've got for a 95% confidence interval, 610 00:33:10,870 --> 00:33:15,670 a 2.5% chance that the true mean is a little bit further off 611 00:33:15,670 --> 00:33:18,070 to the left and a 2.5% chance that it's a little further 612 00:33:18,070 --> 00:33:19,340 off to the right. 613 00:33:19,340 --> 00:33:22,660 I guess in this picture here I'm doing an 80% confidence 614 00:33:22,660 --> 00:33:28,240 interval with a total alpha error risk, error probability, 615 00:33:28,240 --> 00:33:31,460 of 0.2. 616 00:33:31,460 --> 00:33:33,620 And so the question then becomes, 617 00:33:33,620 --> 00:33:35,360 how far do I have to go out? 618 00:33:35,360 --> 00:33:39,350 And we know that from the basic probability manipulations 619 00:33:39,350 --> 00:33:41,450 from a normal distribution you guys 620 00:33:41,450 --> 00:33:44,390 have been dealing with already. 621 00:33:44,390 --> 00:33:51,190 The whole question is, how many unit standard deviations 622 00:33:51,190 --> 00:33:53,480 of a unit normal do I have to go? 623 00:33:53,480 --> 00:33:58,420 How many z's out do I have to go until I have exactly alpha 624 00:33:58,420 --> 00:34:02,590 over 2 out here in the tail? 625 00:34:02,590 --> 00:34:04,980 So for example, we might know what 626 00:34:04,980 --> 00:34:07,560 this is going to do is I've got to go out 627 00:34:07,560 --> 00:34:13,350 1.28 standard deviations to the left 628 00:34:13,350 --> 00:34:18,524 in order to be able to have just that alpha over 2 629 00:34:18,524 --> 00:34:23,389 to the left of that tail, and similarly to the right. 630 00:34:23,389 --> 00:34:30,460 Now, notice that we're also unnormalizing. 631 00:34:30,460 --> 00:34:32,300 The z is the normal-- 632 00:34:32,300 --> 00:34:36,070 how many z's you get to, out of the unit Gaussian, 633 00:34:36,070 --> 00:34:38,469 the probability out in the tails. 634 00:34:38,469 --> 00:34:41,110 But what we wanted to do is reason about the location 635 00:34:41,110 --> 00:34:43,690 of the true population. 636 00:34:43,690 --> 00:34:46,580 We want to know the true population mean. 637 00:34:46,580 --> 00:34:52,989 And so we have to do a little bit of unnormalization and say, 638 00:34:52,989 --> 00:34:56,860 z alpha gave me the number of unit normals. 639 00:34:56,860 --> 00:35:00,700 Now, in terms of my actual population variance 640 00:35:00,700 --> 00:35:03,040 or population standard deviation, 641 00:35:03,040 --> 00:35:05,580 what does that correspond to? 642 00:35:05,580 --> 00:35:11,100 And this is where the sample size also comes into play. 643 00:35:11,100 --> 00:35:14,730 We were reasoning about the distribution associated 644 00:35:14,730 --> 00:35:17,210 with a x bar. 645 00:35:17,210 --> 00:35:19,280 And the x bar is scaled. 646 00:35:19,280 --> 00:35:22,100 It shrunk by that square root of n 647 00:35:22,100 --> 00:35:24,300 in terms of the standard deviation. 648 00:35:24,300 --> 00:35:29,000 So when I expand it back out, I'm counting number of-- 649 00:35:29,000 --> 00:35:34,910 first off, together, this is number of standard deviations 650 00:35:34,910 --> 00:35:36,980 in my x bar. 651 00:35:36,980 --> 00:35:42,140 And then when I expand that further 652 00:35:42,140 --> 00:35:46,010 out to the number of standard deviations in my population, 653 00:35:46,010 --> 00:35:48,710 I have to divide back out by that root n. 654 00:35:52,520 --> 00:35:57,780 So what we've got is the rationale for being 655 00:35:57,780 --> 00:36:01,920 able to use the PDF associated with x bar calculate, 656 00:36:01,920 --> 00:36:03,660 probabilities off of the details, 657 00:36:03,660 --> 00:36:08,170 and get finally to this nice-- 658 00:36:08,170 --> 00:36:12,930 this is my fast way to erase everything-- 659 00:36:12,930 --> 00:36:17,200 get back to my nice distribution here or a nice formula, which 660 00:36:17,200 --> 00:36:21,700 you'll see in Montgomery, you'll see in all of the textbooks. 661 00:36:21,700 --> 00:36:28,270 It's a wonderful note to have on your one page set of notes 662 00:36:28,270 --> 00:36:30,310 or cheat sheet for taking quizzes 663 00:36:30,310 --> 00:36:32,200 in this class and elsewhere. 664 00:36:32,200 --> 00:36:35,530 This is the interval, the confidence interval formula, 665 00:36:35,530 --> 00:36:38,890 for the location of the true mean when 666 00:36:38,890 --> 00:36:39,850 the variance was known. 667 00:36:44,370 --> 00:36:46,290 So any questions on that? 668 00:36:46,290 --> 00:36:51,030 We actually want to return to our example 669 00:36:51,030 --> 00:36:54,150 and see what numbers pop out because I want to know-- 670 00:36:54,150 --> 00:36:57,420 we knew x bar was 113.5. 671 00:36:57,420 --> 00:37:01,080 But I actually want to know, what is the 95% confidence 672 00:37:01,080 --> 00:37:02,950 interval for that? 673 00:37:02,950 --> 00:37:05,550 And so we can simply go back to our second question. 674 00:37:05,550 --> 00:37:08,800 Use the fact that we had-- 675 00:37:08,800 --> 00:37:12,750 you guys told me what the distribution was of t bar 676 00:37:12,750 --> 00:37:15,990 was our unknown mu. 677 00:37:15,990 --> 00:37:20,730 And the variance was scaled, 100 over 50. 678 00:37:20,730 --> 00:37:24,170 So now for a 95% confidence interval, 679 00:37:24,170 --> 00:37:28,700 what is the true mean? 680 00:37:28,700 --> 00:37:30,080 So I've pictured it here. 681 00:37:30,080 --> 00:37:33,560 And what we're saying is we want-- 682 00:37:33,560 --> 00:37:36,800 we've got this red curve which, again, 683 00:37:36,800 --> 00:37:42,330 goes with this PDF associated with t bar. 684 00:37:42,330 --> 00:37:46,380 And I want the plus/minus z alpha over 2, the alpha 685 00:37:46,380 --> 00:37:48,570 being 0.05. 686 00:37:48,570 --> 00:37:51,090 That's my probability of being wrong to get 687 00:37:51,090 --> 00:37:54,370 to a 0.95 confidence interval. 688 00:37:54,370 --> 00:38:01,400 So how many z's do I have to go out to have 95% in the center? 689 00:38:01,400 --> 00:38:03,580 We actually showed some examples. 690 00:38:03,580 --> 00:38:05,390 If you remember, last time we looked 691 00:38:05,390 --> 00:38:08,480 at plus/minus 1 sigma, plus/minus 2 sigma, 692 00:38:08,480 --> 00:38:11,040 plus/minus 3 sigma for a Gaussian. 693 00:38:11,040 --> 00:38:14,270 And it's actually a very close approximation. 694 00:38:14,270 --> 00:38:18,350 That plus/minus 2 sigma is 95% of a distribution. 695 00:38:18,350 --> 00:38:20,930 That's a good rule of thumb to remember. 696 00:38:20,930 --> 00:38:24,680 It's actually 1.96, not quite 2. 697 00:38:24,680 --> 00:38:29,660 But about plus/minus 2 sigma has 95%. 698 00:38:29,660 --> 00:38:34,670 So you'll often see 95% confidence intervals 699 00:38:34,670 --> 00:38:36,140 graphically shown. 700 00:38:36,140 --> 00:38:40,380 So we need about 1.96 standard deviations. 701 00:38:40,380 --> 00:38:46,300 Now that translates to a confidence interval 702 00:38:46,300 --> 00:38:50,020 that tells us, as a function of n, 703 00:38:50,020 --> 00:38:52,990 the distribution for where we think the true population 704 00:38:52,990 --> 00:38:55,600 is, based on the sample size that we had. 705 00:38:55,600 --> 00:38:59,860 The compression that we got because of sampling 706 00:38:59,860 --> 00:39:02,650 gets us that tighter standard deviation. 707 00:39:02,650 --> 00:39:08,080 And I've got a symmetric plus/minus 2.77 708 00:39:08,080 --> 00:39:11,560 for my 95% confidence interval. 709 00:39:11,560 --> 00:39:13,330 Now, notice that all you had to do here 710 00:39:13,330 --> 00:39:16,600 was be told what the actual calculated t bar was 711 00:39:16,600 --> 00:39:20,450 and what the underlying variance was 712 00:39:20,450 --> 00:39:22,005 and the size of your sample. 713 00:39:22,005 --> 00:39:23,630 I didn't even have to actually give you 714 00:39:23,630 --> 00:39:25,620 a list of all those values, right? 715 00:39:30,250 --> 00:39:32,510 But I did have to tell you the sample size. 716 00:39:32,510 --> 00:39:37,780 If sample size changed, that PDF would narrow or widen, 717 00:39:37,780 --> 00:39:43,450 and your confidence interval would narrow or widen, right? 718 00:39:43,450 --> 00:39:46,936 So any questions to where we are now? 719 00:39:46,936 --> 00:39:48,790 It's all seeming pretty clear? 720 00:39:52,220 --> 00:39:55,830 So this is the relatively easy part 721 00:39:55,830 --> 00:39:58,260 because it's dealing with normal distributions. 722 00:39:58,260 --> 00:40:00,990 This notion of sampling is a little bit subtle 723 00:40:00,990 --> 00:40:02,550 because there is a different PDF, 724 00:40:02,550 --> 00:40:05,760 and you got to know how that scales with the sample size. 725 00:40:05,760 --> 00:40:10,670 Now I'm going to throw a few different curves at you, 726 00:40:10,670 --> 00:40:12,950 the different curves being different probability 727 00:40:12,950 --> 00:40:17,180 distribution functions than normal distributions. 728 00:40:17,180 --> 00:40:21,200 And I'm going to briefly cover three of them, 729 00:40:21,200 --> 00:40:24,320 and all three of them are ones that we actually 730 00:40:24,320 --> 00:40:27,790 will be using in multiple scenarios 731 00:40:27,790 --> 00:40:34,180 in statistical analysis and statistical techniques 732 00:40:34,180 --> 00:40:35,980 and tools that we're using. 733 00:40:35,980 --> 00:40:40,270 The first one is a relatively easy step, 734 00:40:40,270 --> 00:40:43,900 and that's to look at the student t distribution. 735 00:40:43,900 --> 00:40:44,860 I'll come back to this. 736 00:40:44,860 --> 00:40:47,800 But basically, if we go back to the example I gave you. 737 00:40:47,800 --> 00:40:51,310 I said, we assumed we knew, based on, I don't know, 738 00:40:51,310 --> 00:40:55,360 lots of past history what the underlying variance was 739 00:40:55,360 --> 00:40:57,410 on the thickness of our parts. 740 00:40:57,410 --> 00:40:58,690 What if you don't know that? 741 00:40:58,690 --> 00:41:02,467 What if you have to estimate that, too? 742 00:41:02,467 --> 00:41:04,050 Well, if you had to estimate it, you'd 743 00:41:04,050 --> 00:41:06,900 probably use sample standard deviation, that formula, 744 00:41:06,900 --> 00:41:08,700 and come up with an estimate. 745 00:41:08,700 --> 00:41:12,930 It turns out when you do that, that additional uncertainty 746 00:41:12,930 --> 00:41:15,330 on what the underlying variance is 747 00:41:15,330 --> 00:41:17,730 means that the right distribution for arguing 748 00:41:17,730 --> 00:41:21,810 about the mean when you didn't know the underlying variance 749 00:41:21,810 --> 00:41:23,580 is no longer a normal distribution. 750 00:41:23,580 --> 00:41:27,470 It's actually a t-distribution, and we'll talk about that. 751 00:41:27,470 --> 00:41:30,030 It's a slightly different-- it's very close to 752 00:41:30,030 --> 00:41:35,470 or looks qualitatively close to a normal distribution, 753 00:41:35,470 --> 00:41:37,640 but we do want to cover that. 754 00:41:37,640 --> 00:41:42,540 And then more have to do with not the mean, 755 00:41:42,540 --> 00:41:45,330 but arguing about the variance. 756 00:41:45,330 --> 00:41:48,900 If I calculate sample variance from a distribution, 757 00:41:48,900 --> 00:41:55,110 I calculate s squared using the formula for a sample of size 758 00:41:55,110 --> 00:41:57,180 50, I get a number. 759 00:41:57,180 --> 00:41:58,630 I do that lots and lots of times. 760 00:41:58,630 --> 00:42:00,900 I trace out a PDF. 761 00:42:00,900 --> 00:42:04,680 The PDF associated with the values 762 00:42:04,680 --> 00:42:08,130 of sample variance calculated from that sample 763 00:42:08,130 --> 00:42:10,380 is a chi-squared distribution. 764 00:42:13,000 --> 00:42:16,260 So we'll talk about what that shape looks like. 765 00:42:16,260 --> 00:42:19,800 And then we've got a variance that we've 766 00:42:19,800 --> 00:42:21,870 calculated from a sample. 767 00:42:21,870 --> 00:42:25,570 And a very strange distribution is the F distribution, 768 00:42:25,570 --> 00:42:30,360 which is the distribution of the ratio of two normally 769 00:42:30,360 --> 00:42:34,920 distributed variances or two variances drawn from normally 770 00:42:34,920 --> 00:42:37,340 distributed sample data. 771 00:42:37,340 --> 00:42:38,030 Good heavens. 772 00:42:38,030 --> 00:42:39,590 Why would you ever be calculating 773 00:42:39,590 --> 00:42:41,870 ratios of variances? 774 00:42:41,870 --> 00:42:45,300 What a weird distribution. 775 00:42:45,300 --> 00:42:48,970 Why would you ever calculate ratios of variances? 776 00:42:48,970 --> 00:42:51,090 Where might that come up? 777 00:42:51,090 --> 00:42:54,540 There's at least a couple of cases-- one that's 778 00:42:54,540 --> 00:42:56,989 kind of subtle, but one that's pretty obvious. 779 00:42:56,989 --> 00:42:58,630 AUDIENCE: I think it's you're thinking 780 00:42:58,630 --> 00:43:03,953 about the variation of the actual population, which 781 00:43:03,953 --> 00:43:05,598 varies from your sample. 782 00:43:08,230 --> 00:43:10,270 PROFESSOR: Certainly, the variance 783 00:43:10,270 --> 00:43:12,250 associated with a sample of smaller 784 00:43:12,250 --> 00:43:15,440 size than your true population. 785 00:43:15,440 --> 00:43:17,110 So that's exactly true. 786 00:43:17,110 --> 00:43:18,850 That's one important area. 787 00:43:18,850 --> 00:43:23,020 The fact of sample size entering into spread and things 788 00:43:23,020 --> 00:43:24,280 is very important. 789 00:43:24,280 --> 00:43:26,920 That actually will come up more in the chi-squared. 790 00:43:26,920 --> 00:43:30,280 But I think a second very obvious place is 791 00:43:30,280 --> 00:43:32,590 I make a change to a process. 792 00:43:32,590 --> 00:43:34,750 And I'm maybe not trying to mean center it. 793 00:43:34,750 --> 00:43:37,390 I'm trying to get a reduced variance process. 794 00:43:37,390 --> 00:43:40,570 I want to know, is this process better or not? 795 00:43:40,570 --> 00:43:42,670 Is its variance smaller? 796 00:43:42,670 --> 00:43:45,430 So the ratio of those two variances 797 00:43:45,430 --> 00:43:48,550 are something I might be very, very interested in. 798 00:43:48,550 --> 00:43:50,890 I want to look at those and see, well, 799 00:43:50,890 --> 00:43:52,510 I did get a smaller variance. 800 00:43:52,510 --> 00:43:54,730 It's half as small. 801 00:43:54,730 --> 00:43:57,580 Do I have confidence that the true population variance 802 00:43:57,580 --> 00:43:59,597 is really smaller or not? 803 00:43:59,597 --> 00:44:01,180 And so that's where the F distribution 804 00:44:01,180 --> 00:44:02,450 is going to come into play. 805 00:44:02,450 --> 00:44:05,623 So we want to be able to manipulate and deal 806 00:44:05,623 --> 00:44:06,540 with that one as well. 807 00:44:11,880 --> 00:44:18,670 Let me do the student t-distribution first. 808 00:44:18,670 --> 00:44:19,990 Actually, I can't do that. 809 00:44:19,990 --> 00:44:22,540 Let me do the chi-squared distribution first. 810 00:44:22,540 --> 00:44:24,130 For the formal definition of the t, 811 00:44:24,130 --> 00:44:26,860 I need the chi-squared, even though conceptually, 812 00:44:26,860 --> 00:44:28,660 it doesn't really matter. 813 00:44:28,660 --> 00:44:32,320 So let's talk about the chi-squared distribution first. 814 00:44:32,320 --> 00:44:39,580 If I start out with truly normally distributed data 815 00:44:39,580 --> 00:44:44,150 and unit normal, mean 0, variance 1. 816 00:44:44,150 --> 00:44:50,670 And now, I take a sum of n of these unit 817 00:44:50,670 --> 00:44:54,290 normals, each one of which is squared. 818 00:44:54,290 --> 00:44:56,410 So each x sub i is normally distributed. 819 00:44:56,410 --> 00:44:59,170 I do this weird operation where I take that sample. 820 00:44:59,170 --> 00:45:05,170 I square it, I take another draw or another random variable, 821 00:45:05,170 --> 00:45:08,830 also from the same distribution, square that, and then take 822 00:45:08,830 --> 00:45:14,020 the sum of n of those squared random variables to create 823 00:45:14,020 --> 00:45:17,420 a new random variable y. 824 00:45:17,420 --> 00:45:22,430 y is the sum of squared unit normal random variables. 825 00:45:22,430 --> 00:45:26,560 Then I get this chi-squared distribution. 826 00:45:26,560 --> 00:45:29,080 The distribution of this new random variable y 827 00:45:29,080 --> 00:45:33,400 is chi-squared with n degrees of freedom. 828 00:45:33,400 --> 00:45:36,625 Good heavens, what a weird thing to be doing. 829 00:45:36,625 --> 00:45:38,740 Why would you be taking random variables, 830 00:45:38,740 --> 00:45:42,270 squaring them, and taking sums of them? 831 00:45:42,270 --> 00:45:45,710 Well, think back to the formula. 832 00:45:45,710 --> 00:45:47,420 Let's see if I can do this. 833 00:45:47,420 --> 00:45:48,590 What page is that? 834 00:45:48,590 --> 00:45:50,930 Anybody got it there? 835 00:45:50,930 --> 00:45:52,600 8? 836 00:45:52,600 --> 00:45:54,700 There we go, page 5. 837 00:45:54,700 --> 00:46:01,670 Look back at this formula for sample standard deviation. 838 00:46:04,870 --> 00:46:08,660 First off, I'm subtracting the mean off of some sample. 839 00:46:08,660 --> 00:46:13,010 So now I've got a 0 mean variable. 840 00:46:13,010 --> 00:46:15,430 Now I'm taking squares of them. 841 00:46:15,430 --> 00:46:17,950 Well, that sounds kind of like this squaring operation. 842 00:46:17,950 --> 00:46:21,830 And then I'm taking a big sum of them. 843 00:46:21,830 --> 00:46:24,050 That sounds a lot like this operation I was just 844 00:46:24,050 --> 00:46:26,250 describing for chi-squared. 845 00:46:26,250 --> 00:46:31,850 So this creation of a new random variable, this F squared here, 846 00:46:31,850 --> 00:46:37,070 is very closely related to-- 847 00:46:37,070 --> 00:46:38,970 that didn't work. 848 00:46:38,970 --> 00:46:41,250 There we go-- very closely related 849 00:46:41,250 --> 00:46:45,310 to the definition of chi-squared. 850 00:46:45,310 --> 00:46:47,370 Now the chi-squared, the PDF associated 851 00:46:47,370 --> 00:46:51,450 with the chi-squared, looks kind of funky. 852 00:46:51,450 --> 00:46:53,730 It's clearly not normally distributed, right? 853 00:46:53,730 --> 00:46:55,380 It's kind of skewed. 854 00:46:55,380 --> 00:47:01,240 Notice it's got a long tail out here 855 00:47:01,240 --> 00:47:04,450 to the right for large values. 856 00:47:04,450 --> 00:47:08,870 Because it's a sum of squared values, it can't be negative. 857 00:47:08,870 --> 00:47:09,730 So it's truncated. 858 00:47:09,730 --> 00:47:14,190 There's nothing-- can't be smaller than 0. 859 00:47:14,190 --> 00:47:18,690 Another really weird thing is that the maximal probability 860 00:47:18,690 --> 00:47:26,680 value is not equal to the mean of the distribution. 861 00:47:26,680 --> 00:47:28,540 That's kind of interesting. 862 00:47:28,540 --> 00:47:30,430 And there's another really interesting fact 863 00:47:30,430 --> 00:47:33,400 that is truly useful and occasionally 864 00:47:33,400 --> 00:47:36,310 comes up on problem sets and that sort of thing. 865 00:47:36,310 --> 00:47:39,940 The mean, the expected value of the chi-squared distribution 866 00:47:39,940 --> 00:47:44,580 with degrees of freedom n, is n. 867 00:47:44,580 --> 00:47:48,180 So as I have larger numbers of variables, 868 00:47:48,180 --> 00:47:53,850 the sum of that larger number keeps getting bigger. 869 00:47:53,850 --> 00:47:58,370 So that makes sense when you think about it. 870 00:47:58,370 --> 00:48:01,590 So the point here is when we actually 871 00:48:01,590 --> 00:48:08,490 do that calculation of a sample standard or a sample variance 872 00:48:08,490 --> 00:48:12,750 or a sample standard deviation, the PDF 873 00:48:12,750 --> 00:48:15,060 associated with that is actually related 874 00:48:15,060 --> 00:48:17,250 to this chi-squared distribution. 875 00:48:17,250 --> 00:48:19,480 Now there were some other constants in there. 876 00:48:19,480 --> 00:48:21,280 They're scaling factors. 877 00:48:21,280 --> 00:48:23,940 So for example, we did a mean shift x bar, 878 00:48:23,940 --> 00:48:26,400 but we didn't normalize to the true variance, 879 00:48:26,400 --> 00:48:28,430 because we didn't know it. 880 00:48:28,430 --> 00:48:31,500 So there is this relationship or a scaling factor 881 00:48:31,500 --> 00:48:34,080 before we get to the chi-squared distribution. 882 00:48:34,080 --> 00:48:40,140 We also had this other n minus 1 factor back on the calculation 883 00:48:40,140 --> 00:48:41,220 of the sample-- 884 00:48:41,220 --> 00:48:44,400 sample standard or sample variance. 885 00:48:44,400 --> 00:48:48,870 So we have to do a little bit of moving variables around 886 00:48:48,870 --> 00:48:51,430 to get to a chi-squared distribution. 887 00:48:51,430 --> 00:48:56,600 Another important point is that the-- 888 00:48:56,600 --> 00:48:59,090 let me clean up some of this-- 889 00:48:59,090 --> 00:49:03,980 is that the sample variance is actually 890 00:49:03,980 --> 00:49:09,722 related to a chi-squared with n minus 1 degrees of freedom. 891 00:49:09,722 --> 00:49:11,930 And I really don't want to go into a whole discussion 892 00:49:11,930 --> 00:49:16,260 of degrees of freedom because it's a little bit subtle. 893 00:49:16,260 --> 00:49:17,960 But this reminds me of another point 894 00:49:17,960 --> 00:49:20,510 that I didn't make back on slide 8. 895 00:49:24,390 --> 00:49:25,840 Get me to 8, please. 896 00:49:25,840 --> 00:49:26,520 There we go. 897 00:49:26,520 --> 00:49:29,235 Oops, not 48, 8. 898 00:49:29,235 --> 00:49:30,660 Oh, it wasn't 8. 899 00:49:30,660 --> 00:49:31,320 Where was it? 900 00:49:31,320 --> 00:49:32,670 4, 5. 901 00:49:32,670 --> 00:49:34,670 There we go. 902 00:49:34,670 --> 00:49:40,500 Back here on this, notice that when we calculate sample mean, 903 00:49:40,500 --> 00:49:42,240 we used 1 over n. 904 00:49:42,240 --> 00:49:44,610 But when we calculate sample variance, 905 00:49:44,610 --> 00:49:47,770 we always use 1 over n minus 1. 906 00:49:47,770 --> 00:49:48,520 Why do we do that? 907 00:49:53,080 --> 00:50:00,170 It turns out that if you need or want an unbiased estimator 908 00:50:00,170 --> 00:50:04,150 for a sample variance, you need to divide by 1 over n minus 1 909 00:50:04,150 --> 00:50:08,310 or divide by n minus 1, not n. 910 00:50:08,310 --> 00:50:10,140 Now, as n gets very large, the difference 911 00:50:10,140 --> 00:50:11,400 doesn't really matter. 912 00:50:11,400 --> 00:50:15,890 But you can go through some statistical proofs 913 00:50:15,890 --> 00:50:21,800 to show that the best unbiased estimator needs that n minus 1. 914 00:50:21,800 --> 00:50:26,210 Now the other thing that's going on in this formula 915 00:50:26,210 --> 00:50:29,120 is we were subtracting off the mean. 916 00:50:29,120 --> 00:50:33,210 And in this case, we were also estimating the mean. 917 00:50:33,210 --> 00:50:35,420 So we're using up essentially one degree 918 00:50:35,420 --> 00:50:41,660 of freedom out of our data to calculate the sample mean, 919 00:50:41,660 --> 00:50:45,560 leaving us only n minus 1 degrees of freedom 920 00:50:45,560 --> 00:50:51,990 really in the remaining data to allow variance around the mean. 921 00:50:51,990 --> 00:50:56,030 So I'm not going to go into much more detail, 922 00:50:56,030 --> 00:51:01,400 other than to simply say the fact is, when we're calculating 923 00:51:01,400 --> 00:51:04,370 sample standard deviation, we're actually calculating 924 00:51:04,370 --> 00:51:10,520 two random variables or two statistics, x bar and variance. 925 00:51:10,520 --> 00:51:14,900 And so you would need-- you essentially 926 00:51:14,900 --> 00:51:19,190 don't have complete independence between those two things. 927 00:51:19,190 --> 00:51:23,410 You use up one degree of freedom for one of those. 928 00:51:23,410 --> 00:51:26,110 Let's use this. 929 00:51:26,110 --> 00:51:29,010 Before we use this, just to give you a qualitative feel, 930 00:51:29,010 --> 00:51:30,190 here's-- 931 00:51:30,190 --> 00:51:35,410 again, plotted a few different chi-squared distributions. 932 00:51:35,410 --> 00:51:38,260 When n is very small, it becomes very skewed. 933 00:51:38,260 --> 00:51:40,960 It's quite interesting. 934 00:51:40,960 --> 00:51:47,720 Again, the mean you can see for n equals 3 here is 3. 935 00:51:47,720 --> 00:51:50,140 It's this blue curve. 936 00:51:50,140 --> 00:51:53,110 And as n increases, the distribution 937 00:51:53,110 --> 00:51:54,010 shifts to the right. 938 00:51:54,010 --> 00:51:55,190 The mean shift to the right. 939 00:51:55,190 --> 00:51:57,700 But it also spreads out, which kind of makes sense. 940 00:51:57,700 --> 00:51:59,860 If I've got more and more random variables, 941 00:51:59,860 --> 00:52:04,090 and I'm looking at the variance and estimating 942 00:52:04,090 --> 00:52:07,450 that sum of random variables, its spread 943 00:52:07,450 --> 00:52:11,160 is going to get large. 944 00:52:11,160 --> 00:52:17,780 And another observation is that as n gets larger and larger, 945 00:52:17,780 --> 00:52:21,790 this also trends towards a normal distribution, 946 00:52:21,790 --> 00:52:26,010 which for very large n can be a useful fact. 947 00:52:26,010 --> 00:52:27,780 I want to actually go in and use-- 948 00:52:30,740 --> 00:52:37,450 not that one-- use this chi-squared distribution 949 00:52:37,450 --> 00:52:43,750 to ask another question on that thickness example. 950 00:52:43,750 --> 00:52:45,580 I'd actually want to know, what's 951 00:52:45,580 --> 00:52:51,190 the best guess for the variance of my thickness of parts? 952 00:52:51,190 --> 00:52:54,190 And better than that, what's a confidence interval for where 953 00:52:54,190 --> 00:52:57,250 I think the true variance lies, based on just this one 954 00:52:57,250 --> 00:53:00,520 number for sample variance, based on my sample of size n 955 00:53:00,520 --> 00:53:02,280 equals 50. 956 00:53:02,280 --> 00:53:09,390 And this is where we do the same kind of a formula for the range 957 00:53:09,390 --> 00:53:12,750 where we think the true variance lies, 958 00:53:12,750 --> 00:53:19,350 based on our observation from one sample of sample standard 959 00:53:19,350 --> 00:53:20,490 deviation. 960 00:53:20,490 --> 00:53:22,830 And this is using that relationship 961 00:53:22,830 --> 00:53:27,300 between the chi-squared distribution and F 962 00:53:27,300 --> 00:53:30,120 squared and the true underlying variance. 963 00:53:30,120 --> 00:53:32,160 So if you go back to one of those formulas, 964 00:53:32,160 --> 00:53:34,890 what I did was took-- 965 00:53:34,890 --> 00:53:36,600 sigma squared was lying out here. 966 00:53:36,600 --> 00:53:40,390 I moved it up here and divided the chi-squared down here. 967 00:53:40,390 --> 00:53:43,050 So this is essentially right in here 968 00:53:43,050 --> 00:53:47,070 that equivalence that we said before about how F squared was 969 00:53:47,070 --> 00:53:50,790 distributed as a chi-squared with n 970 00:53:50,790 --> 00:53:52,230 minus 1 degrees of freedom. 971 00:53:55,430 --> 00:53:59,600 So what we've got is a bound-- 972 00:53:59,600 --> 00:54:02,790 let me get rid of all this gook-- 973 00:54:02,790 --> 00:54:06,150 a bound, upper and lower bound, on where we think, 974 00:54:06,150 --> 00:54:09,590 again, the true variance is, based on our calculated F 975 00:54:09,590 --> 00:54:10,720 squareds. 976 00:54:10,720 --> 00:54:14,850 And what we're doing again is putting some alpha probability 977 00:54:14,850 --> 00:54:17,360 of being wrong in each of the tails. 978 00:54:17,360 --> 00:54:18,930 I want the central part. 979 00:54:18,930 --> 00:54:22,350 I want the 95% central part of where we 980 00:54:22,350 --> 00:54:27,060 think the true variance lies. 981 00:54:27,060 --> 00:54:33,750 Now an interesting point here is chi-squared is asymmetric. 982 00:54:33,750 --> 00:54:37,260 So if you ever see somebody going off and writing, 983 00:54:37,260 --> 00:54:40,200 I think the true variance is equal to F 984 00:54:40,200 --> 00:54:45,240 squared plus or minus 14.2, that should 985 00:54:45,240 --> 00:54:47,415 be a great, big red flag. 986 00:54:51,360 --> 00:54:54,330 It's somebody who doesn't know what they're talking about. 987 00:54:54,330 --> 00:54:56,180 Well, maybe they have a huge sample size, 988 00:54:56,180 --> 00:54:58,650 and they're appealing to a normal distribution. 989 00:54:58,650 --> 00:55:04,520 But what they're probably doing here is something very wrong. 990 00:55:04,520 --> 00:55:07,790 Because the chi-squared distribution is not symmetric, 991 00:55:07,790 --> 00:55:11,340 I have my best point estimate of F squared. 992 00:55:11,340 --> 00:55:15,200 And then I'm going to go a different distance 993 00:55:15,200 --> 00:55:17,360 to the left and a different distance to the right. 994 00:55:17,360 --> 00:55:20,930 So here's, still for our same example, 995 00:55:20,930 --> 00:55:25,370 the chi-squared distribution for n, a sample size of 50. 996 00:55:25,370 --> 00:55:28,880 So this is a chi-squared with 49 degrees of freedom. 997 00:55:28,880 --> 00:55:32,270 And again, I want 2.5% in the left tail 998 00:55:32,270 --> 00:55:34,640 and 2.5% in the right tail. 999 00:55:34,640 --> 00:55:38,210 And so if I apply that formula, and I have to look up 1000 00:55:38,210 --> 00:55:44,110 chi-squared with 0.025 and 49 degrees of freedom, 1001 00:55:44,110 --> 00:55:49,590 and then the chi-squared where I need to know-- 1002 00:55:49,590 --> 00:55:54,210 I want 97.5, everything, leaving except just alpha over 2 1003 00:55:54,210 --> 00:55:56,940 out to the right. 1004 00:55:56,940 --> 00:55:59,150 The s squareds are the same in both cases. 1005 00:55:59,150 --> 00:56:01,100 My n minus 1 is the same. 1006 00:56:01,100 --> 00:56:06,320 But because these values, the chi-squareds, are not equal-- 1007 00:56:06,320 --> 00:56:07,520 whoops. 1008 00:56:07,520 --> 00:56:10,340 I guess I got these flipped. 1009 00:56:10,340 --> 00:56:12,080 Actually, when you look at the tables 1010 00:56:12,080 --> 00:56:17,540 at the back of Montgomery or Mayo and Spanos, 1011 00:56:17,540 --> 00:56:18,890 be careful on the definition. 1012 00:56:18,890 --> 00:56:20,420 They often show you a little plot 1013 00:56:20,420 --> 00:56:23,060 that looks a lot like this. 1014 00:56:23,060 --> 00:56:27,200 And they shade in what their percentage points are. 1015 00:56:27,200 --> 00:56:31,410 And sometimes they go from the right, sometimes from the left. 1016 00:56:31,410 --> 00:56:33,700 But the point was when you actually 1017 00:56:33,700 --> 00:56:36,840 look that up, you get different values 1018 00:56:36,840 --> 00:56:38,310 for the left and the right. 1019 00:56:38,310 --> 00:56:42,620 And when you divide those out, you get a range-- 1020 00:56:42,620 --> 00:56:44,440 get that out of the way. 1021 00:56:44,440 --> 00:56:49,611 You get a range finally for where your true variance lies. 1022 00:56:49,611 --> 00:56:53,270 AUDIENCE: So is that through a [INAUDIBLE] or estimates 1023 00:56:53,270 --> 00:56:58,850 of variance or from chi-square distribution, or is that-- 1024 00:56:58,850 --> 00:57:03,750 PROFESSOR: The point is that all estimates-- 1025 00:57:03,750 --> 00:57:08,370 well, it's strictly true if I'm drawing from a population that 1026 00:57:08,370 --> 00:57:09,960 is normally distributed. 1027 00:57:09,960 --> 00:57:11,910 But an approximation is no matter 1028 00:57:11,910 --> 00:57:18,180 what, any time I'm calculating a variance, 1029 00:57:18,180 --> 00:57:21,210 the variance tends to be chi-squared distributed. 1030 00:57:21,210 --> 00:57:23,190 So it's always going to be these kinds 1031 00:57:23,190 --> 00:57:24,690 of chi-squared calculations. 1032 00:57:27,580 --> 00:57:30,250 So it's not that the chi-squared was a special case. 1033 00:57:30,250 --> 00:57:34,120 It's the PDF that you should always 1034 00:57:34,120 --> 00:57:36,580 associate it with s squared. 1035 00:57:41,310 --> 00:57:45,090 And notice here, we had 102.3. 1036 00:57:45,090 --> 00:57:47,310 That's our best guess. 1037 00:57:47,310 --> 00:57:53,070 And we had 71.4 and 158.1 for the range and variance. 1038 00:57:58,507 --> 00:58:00,215 I always find this a little bit shocking. 1039 00:58:02,840 --> 00:58:04,790 A sample size of 50? 1040 00:58:04,790 --> 00:58:08,940 I took 50 samples, right? 1041 00:58:08,940 --> 00:58:16,920 And I had-- my underlying variance, I guess, was 100. 1042 00:58:16,920 --> 00:58:19,170 But I took a lot of samples. 1043 00:58:19,170 --> 00:58:20,970 And it always shocks me a little bit 1044 00:58:20,970 --> 00:58:24,480 how big the range is on the estimate of variance 1045 00:58:24,480 --> 00:58:27,000 coming out of this. 1046 00:58:27,000 --> 00:58:30,420 Here, my estimate of variance is 102.3. 1047 00:58:30,420 --> 00:58:31,980 Well, that's at least reassuring, 1048 00:58:31,980 --> 00:58:34,560 because that's close to the example that I gave here, 1049 00:58:34,560 --> 00:58:40,780 where a priori, I thought it was 100. 1050 00:58:40,780 --> 00:58:43,630 I just basically popped that out. 1051 00:58:43,630 --> 00:58:47,140 What's shocking is I can go down to 71. 1052 00:58:47,140 --> 00:58:54,880 That's like 30% lower than that, or 158, which is 68% higher 1053 00:58:54,880 --> 00:58:57,770 than my point estimate. 1054 00:58:57,770 --> 00:59:01,190 And a really important thing just to know qualitatively 1055 00:59:01,190 --> 00:59:04,610 is that estimating a mean is pretty easy. 1056 00:59:04,610 --> 00:59:06,680 And actually, as sample size grows, 1057 00:59:06,680 --> 00:59:09,770 you can get pretty good tight estimates of mean. 1058 00:59:09,770 --> 00:59:13,785 But the estimates of variance are hard. 1059 00:59:13,785 --> 00:59:17,260 You need a lot of data to estimate 1060 00:59:17,260 --> 00:59:21,650 that second-order statistic. 1061 00:59:21,650 --> 00:59:24,490 And so we get big spreads in variance. 1062 00:59:24,490 --> 00:59:26,890 So you've got to be really careful in your reasoning 1063 00:59:26,890 --> 00:59:28,180 about variances. 1064 00:59:28,180 --> 00:59:31,060 And that'll bring us back to the F-statistic a little bit later. 1065 00:59:36,150 --> 00:59:40,260 So let me go back now to the student t-distribution. 1066 00:59:40,260 --> 00:59:44,550 And it has a formula and a formal definition 1067 00:59:44,550 --> 00:59:49,560 here, which is if I start out with a random variable z, that 1068 00:59:49,560 --> 00:59:52,270 is the unit normal. 1069 00:59:52,270 --> 00:59:57,040 And then I divide it by a random variable that 1070 00:59:57,040 --> 01:00:02,620 is chi-squared with k degrees of freedom, divided by k, 1071 01:00:02,620 --> 01:00:06,310 I get a new distribution, a new variable t, 1072 01:00:06,310 --> 01:00:11,390 that is a t-distribution with k degrees of freedom. 1073 01:00:11,390 --> 01:00:13,310 And it's the same question. 1074 01:00:13,310 --> 01:00:15,290 My god, why would you do such a cruel thing 1075 01:00:15,290 --> 01:00:18,410 to a random variable-- divide it by a chi-squared random 1076 01:00:18,410 --> 01:00:21,470 variable and some constant k? 1077 01:00:21,470 --> 01:00:26,400 And the answer is that's essentially 1078 01:00:26,400 --> 01:00:33,270 what we're doing when we are normalizing data 1079 01:00:33,270 --> 01:00:38,120 like this, when instead of normalizing 1080 01:00:38,120 --> 01:00:40,430 to the true underlying population 1081 01:00:40,430 --> 01:00:42,560 variance or the true underlying sample variance, 1082 01:00:42,560 --> 01:00:49,220 I'm also having to estimate not only the mean, 1083 01:00:49,220 --> 01:00:54,460 but also estimate the population standard deviation. 1084 01:00:54,460 --> 01:00:56,515 We already said, what is s? 1085 01:00:56,515 --> 01:00:59,020 s squared is chi-squared distributed. 1086 01:00:59,020 --> 01:01:04,400 So s is a square root of a chi-squared distribution. 1087 01:01:04,400 --> 01:01:08,050 So buried in this unit normalization 1088 01:01:08,050 --> 01:01:11,620 that we like to do to get to a probability distribution 1089 01:01:11,620 --> 01:01:13,390 function-- we can talk about confidence 1090 01:01:13,390 --> 01:01:14,980 intervals on the mean. 1091 01:01:14,980 --> 01:01:18,190 We subtract off some mean, and then we 1092 01:01:18,190 --> 01:01:21,040 normalize to s over root n. 1093 01:01:21,040 --> 01:01:23,600 But s itself is this chi-squared. 1094 01:01:23,600 --> 01:01:27,400 So it's really closely related to the operations 1095 01:01:27,400 --> 01:01:33,040 that we do when we are normalizing our sample data, 1096 01:01:33,040 --> 01:01:37,130 when we also had to estimate the standard deviation. 1097 01:01:37,130 --> 01:01:42,580 So the way to think about the t-distribution 1098 01:01:42,580 --> 01:01:46,330 is it's really close to the normal distribution, 1099 01:01:46,330 --> 01:01:47,920 except it's perturbed a little bit, 1100 01:01:47,920 --> 01:01:51,460 because we didn't really know the underlying variance. 1101 01:01:51,460 --> 01:01:53,710 We're having to estimate it also. 1102 01:01:53,710 --> 01:01:57,910 So here's some pictures, some examples. 1103 01:01:57,910 --> 01:02:03,450 The red is the unit normal distribution. 1104 01:02:03,450 --> 01:02:10,390 And now for different sizes of sample, so for an n equals 3, 1105 01:02:10,390 --> 01:02:14,520 you have this little blue distribution. 1106 01:02:14,520 --> 01:02:20,070 That's the t-distribution with degrees of freedom 3. 1107 01:02:20,070 --> 01:02:23,580 Notice that it's a little bit wider 1108 01:02:23,580 --> 01:02:27,540 than the normal distribution, reflecting a little bit 1109 01:02:27,540 --> 01:02:30,360 less certainty on really the location 1110 01:02:30,360 --> 01:02:33,270 of that random variable. 1111 01:02:33,270 --> 01:02:35,840 Now as n gets bigger, so we've got 1112 01:02:35,840 --> 01:02:40,800 an n equals 10 example in here in the green, 1113 01:02:40,800 --> 01:02:42,710 the chi-square-- or the t-distribution 1114 01:02:42,710 --> 01:02:43,910 gets a little bit tighter. 1115 01:02:43,910 --> 01:02:47,750 And for n equals 100, it's basically almost lying right 1116 01:02:47,750 --> 01:02:50,370 on top of the normal distribution. 1117 01:02:50,370 --> 01:02:54,800 So what the t is reflecting is a little additional uncertainty 1118 01:02:54,800 --> 01:02:58,700 because we didn't know sigma squared. 1119 01:02:58,700 --> 01:03:01,880 I had to calculate s squared from that same sample 1120 01:03:01,880 --> 01:03:03,830 distribution. 1121 01:03:03,830 --> 01:03:06,230 So that's all that's really going on there. 1122 01:03:06,230 --> 01:03:10,040 If we then say, OK, I want to get back 1123 01:03:10,040 --> 01:03:11,870 to a confidence interval. 1124 01:03:11,870 --> 01:03:14,360 But now, I don't know the variance, 1125 01:03:14,360 --> 01:03:18,180 and I have to estimate that also from my data. 1126 01:03:18,180 --> 01:03:22,320 We have essentially the same confidence interval formula, 1127 01:03:22,320 --> 01:03:26,570 the only difference being instead of z 1128 01:03:26,570 --> 01:03:29,930 related to the unit normal distribution, 1129 01:03:29,930 --> 01:03:33,200 we have numbers of standard deviations 1130 01:03:33,200 --> 01:03:36,200 on the t-distribution that we're arguing about, 1131 01:03:36,200 --> 01:03:39,860 again, reflecting that that t is a little bit wider. 1132 01:03:39,860 --> 01:03:43,130 But it's essentially exactly the same thinking, 1133 01:03:43,130 --> 01:03:46,910 just recognizing that now, the sampling distribution for x 1134 01:03:46,910 --> 01:03:50,520 bar when variance is unknown-- 1135 01:03:50,520 --> 01:03:52,200 is not a normal. 1136 01:03:52,200 --> 01:03:53,580 It's a t-distribution. 1137 01:03:56,300 --> 01:03:58,830 But all the other operations are exactly the same. 1138 01:03:58,830 --> 01:04:02,930 We look for what alpha error we're willing to accept, 1139 01:04:02,930 --> 01:04:06,920 what our chance of being wrong on our bounding of the interval 1140 01:04:06,920 --> 01:04:12,020 is, and then allocating that to the left and the right; 1141 01:04:12,020 --> 01:04:14,150 figuring out how many units normal over we 1142 01:04:14,150 --> 01:04:18,650 go on not the underlying population distribution, 1143 01:04:18,650 --> 01:04:20,430 but our sampling distribution. 1144 01:04:20,430 --> 01:04:25,280 So we still get the benefits of increasing n getting tighter. 1145 01:04:25,280 --> 01:04:27,760 But we just do that all on the t-distribution. 1146 01:04:27,760 --> 01:04:30,270 AUDIENCE: So this is-- will be necessary for small sample 1147 01:04:30,270 --> 01:04:31,300 sizes. 1148 01:04:31,300 --> 01:04:33,250 PROFESSOR: Exactly. 1149 01:04:33,250 --> 01:04:35,110 So the point or the question was this 1150 01:04:35,110 --> 01:04:38,230 is only necessary for small sample sizes. 1151 01:04:38,230 --> 01:04:42,670 And that's exactly right because of the effect 1152 01:04:42,670 --> 01:04:45,910 that we see back with the t-distribution getting 1153 01:04:45,910 --> 01:04:51,820 very close in approximation to the normal distribution for n 1154 01:04:51,820 --> 01:04:53,800 becoming appreciable. 1155 01:04:53,800 --> 01:04:55,930 I've heard different kinds of rules of thumb. 1156 01:04:55,930 --> 01:04:58,930 Some people like to say for n about 25, 1157 01:04:58,930 --> 01:05:02,030 you're pretty close to a normal distribution. 1158 01:05:02,030 --> 01:05:05,260 Some people like to draw it at n equals 40. 1159 01:05:05,260 --> 01:05:10,420 It really depends on what kind of accuracy you're after. 1160 01:05:10,420 --> 01:05:13,750 But you can be substantially wrong for very small sample 1161 01:05:13,750 --> 01:05:17,140 sizes-- of sample size 5, which is a natural sample 1162 01:05:17,140 --> 01:05:21,200 size you would often use in some manufacturing scenarios. 1163 01:05:21,200 --> 01:05:24,400 So you do have to be aware for very small n 1164 01:05:24,400 --> 01:05:27,630 to use the t-distribution. 1165 01:05:27,630 --> 01:05:30,390 This was an example where we had n equals 50 1166 01:05:30,390 --> 01:05:32,130 in our part thickness example. 1167 01:05:32,130 --> 01:05:34,530 Let's see how different things pop out 1168 01:05:34,530 --> 01:05:37,840 if we use the t-distribution or the normal distribution. 1169 01:05:37,840 --> 01:05:39,840 So let's go back to our example. 1170 01:05:39,840 --> 01:05:43,440 But now, let's say we don't know either the variance 1171 01:05:43,440 --> 01:05:45,270 or the mean. 1172 01:05:45,270 --> 01:05:48,000 Both of them are unknown. 1173 01:05:48,000 --> 01:05:50,130 We already calculated the sample mean. 1174 01:05:50,130 --> 01:05:52,620 We had 113.5. 1175 01:05:52,620 --> 01:05:55,890 And now I'll tell you-- 1176 01:05:55,890 --> 01:05:58,080 I guess I already gave you this number previously. 1177 01:05:58,080 --> 01:06:01,380 But I'll tell you that we apply the sample variance 1178 01:06:01,380 --> 01:06:06,650 formula to the data, and out pops the number 102.3. 1179 01:06:06,650 --> 01:06:09,350 So again, that's your best estimate 1180 01:06:09,350 --> 01:06:14,950 of the sample variance. 1181 01:06:14,950 --> 01:06:17,410 So these are your point estimates. 1182 01:06:17,410 --> 01:06:19,990 But now, I want to go back to the question, where's 1183 01:06:19,990 --> 01:06:23,320 the confidence interval on where we think the true mean would 1184 01:06:23,320 --> 01:06:25,240 be 95% of the time? 1185 01:06:25,240 --> 01:06:28,060 Well, now we have to use the t-distribution. 1186 01:06:28,060 --> 01:06:32,770 When we do that with 49 degrees of freedom, 1187 01:06:32,770 --> 01:06:35,890 again, k minus 1, because we're using up 1 for calculation 1188 01:06:35,890 --> 01:06:37,810 of the sample mean. 1189 01:06:37,810 --> 01:06:42,630 Now we have this slightly different formula. 1190 01:06:42,630 --> 01:06:45,960 Here, we can use the plus/minus, because the t-distribution, 1191 01:06:45,960 --> 01:06:50,010 like the normal distribution, is symmetric. 1192 01:06:50,010 --> 01:06:55,420 So I've got plus or minus some number of unit, z's. 1193 01:06:55,420 --> 01:06:57,610 In this case, it's unit t's because 1194 01:06:57,610 --> 01:07:01,360 the operative distribution is the t-distribution. 1195 01:07:01,360 --> 01:07:03,140 I plug that in. 1196 01:07:03,140 --> 01:07:08,530 Notice that for 2.5% in each of the tail, 1197 01:07:08,530 --> 01:07:12,190 the t-distribution is slightly wider. 1198 01:07:12,190 --> 01:07:13,870 Remember, back with the unit normal, 1199 01:07:13,870 --> 01:07:19,420 we said 1.96 plus or minus standard deviations is 95%. 1200 01:07:19,420 --> 01:07:22,060 For the t, you got to go a little bit further-- 1201 01:07:22,060 --> 01:07:24,610 2.01. 1202 01:07:24,610 --> 01:07:26,290 Not a big difference-- 1203 01:07:26,290 --> 01:07:27,460 2.01. 1204 01:07:27,460 --> 01:07:29,110 And when you come out with that, you 1205 01:07:29,110 --> 01:07:34,490 get a slightly wider confidence interval. 1206 01:07:34,490 --> 01:07:35,910 I'm less confident. 1207 01:07:35,910 --> 01:07:40,370 I got to go further to get to my 95% confidence on the range 1208 01:07:40,370 --> 01:07:42,680 because I'm also estimating. 1209 01:07:42,680 --> 01:07:45,290 So in this case, the difference is pretty much negligible. 1210 01:07:45,290 --> 01:07:47,480 And if I had a sample of size 50, 1211 01:07:47,480 --> 01:07:50,600 I would probably just use the normal distribution. 1212 01:07:50,600 --> 01:07:53,410 And that's a good example, showing that difference 1213 01:07:53,410 --> 01:07:56,410 is 5 parts out of 200. 1214 01:07:56,410 --> 01:07:58,180 It's really quite small. 1215 01:08:02,350 --> 01:08:04,450 One more distribution I want to mention-- 1216 01:08:04,450 --> 01:08:05,950 we're not going to use it much here. 1217 01:08:05,950 --> 01:08:09,220 I think I've already described it briefly-- 1218 01:08:09,220 --> 01:08:11,200 is this F distribution. 1219 01:08:11,200 --> 01:08:14,620 And this arises if I have one random variable that 1220 01:08:14,620 --> 01:08:16,270 is chi-squared distributed. 1221 01:08:16,270 --> 01:08:19,000 I take another random variable that's chi-squared distributed. 1222 01:08:19,000 --> 01:08:21,729 And I form a new random variable R 1223 01:08:21,729 --> 01:08:24,880 that is the ratio of those two, each normalized 1224 01:08:24,880 --> 01:08:29,740 to the degrees of freedom or the number of variables 1225 01:08:29,740 --> 01:08:32,680 that went into each of those chi-squared distributed 1226 01:08:32,680 --> 01:08:33,729 variables. 1227 01:08:33,729 --> 01:08:40,529 And that is an F with u and v degrees of freedom. 1228 01:08:40,529 --> 01:08:48,359 Again, this comes up when we're looking at things like ratios 1229 01:08:48,359 --> 01:08:52,920 and want to reason about ratios of true population variances, 1230 01:08:52,920 --> 01:09:00,390 based on observations of sample variances. 1231 01:09:00,390 --> 01:09:05,210 And the key place where that might come up that I mentioned 1232 01:09:05,210 --> 01:09:07,970 is experimental design cases. 1233 01:09:07,970 --> 01:09:10,520 So this is an injection molding example, 1234 01:09:10,520 --> 01:09:12,950 where you might be looking at two different process 1235 01:09:12,950 --> 01:09:16,800 conditions-- a low hold time and a high hold time. 1236 01:09:16,800 --> 01:09:19,340 And there may be other things varying, maybe even 1237 01:09:19,340 --> 01:09:23,479 other variables varying, that cause there to be a spread. 1238 01:09:23,479 --> 01:09:25,640 Or there's just natural variation 1239 01:09:25,640 --> 01:09:27,350 in the two populations. 1240 01:09:27,350 --> 01:09:30,370 And you might ask questions like, 1241 01:09:30,370 --> 01:09:33,210 are these two variances different? 1242 01:09:33,210 --> 01:09:36,090 Did I improve the variance with that process condition change? 1243 01:09:38,810 --> 01:09:40,370 Maybe-- maybe not. 1244 01:09:40,370 --> 01:09:42,979 Certainly not obvious here, so you might 1245 01:09:42,979 --> 01:09:44,352 have a very low confidence. 1246 01:09:44,352 --> 01:09:46,310 So we're going to go and use the F distribution 1247 01:09:46,310 --> 01:09:50,510 a little bit later when we do analysis of experiments, 1248 01:09:50,510 --> 01:09:54,050 especially where you're looking to try to make inferences 1249 01:09:54,050 --> 01:09:55,970 about whether there is differences 1250 01:09:55,970 --> 01:09:59,750 between a couple of populations. 1251 01:09:59,750 --> 01:10:04,010 And again, because we're dealing with variances, 1252 01:10:04,010 --> 01:10:07,520 there's a huge spread that arise naturally 1253 01:10:07,520 --> 01:10:12,200 in these distributions, purely by chance. 1254 01:10:12,200 --> 01:10:15,470 This is a good place to re-emphasize 1255 01:10:15,470 --> 01:10:19,550 that a lot of what's going on here in random sampling 1256 01:10:19,550 --> 01:10:23,060 is they're spread in the observations that you get. 1257 01:10:23,060 --> 01:10:25,830 So here's a very simple numerical example. 1258 01:10:25,830 --> 01:10:30,900 If I start with a variable that is unit normal, 1259 01:10:30,900 --> 01:10:36,120 and I'm just going to take two samples, sets of size n 1260 01:10:36,120 --> 01:10:38,540 equals 20. 1261 01:10:38,540 --> 01:10:40,690 So I'm taking two different samples, 1262 01:10:40,690 --> 01:10:42,680 same underlying population. 1263 01:10:42,680 --> 01:10:45,400 I'm not making a process change, say. 1264 01:10:45,400 --> 01:10:48,680 I'm just taking two samples, each of size 20. 1265 01:10:48,680 --> 01:10:51,320 By chance, when I take that first sample size, 1266 01:10:51,320 --> 01:10:55,950 I calculate a particular sample variance, s squared. 1267 01:10:55,950 --> 01:10:57,620 And by chance, I calculate another one 1268 01:10:57,620 --> 01:10:58,970 for the second sample. 1269 01:10:58,970 --> 01:11:03,470 And if I form the ratio of those two, what typical range am 1270 01:11:03,470 --> 01:11:07,580 I going to observe in the ratio of those two variances? 1271 01:11:07,580 --> 01:11:10,400 For example, what ratio might I observe 1272 01:11:10,400 --> 01:11:13,370 95% of the time or what range? 1273 01:11:13,370 --> 01:11:15,810 And that's the F distribution. 1274 01:11:15,810 --> 01:11:20,930 In fact, if I look at the upper and lower 1275 01:11:20,930 --> 01:11:26,810 bound on the range of that ratio for a 95% confidence 1276 01:11:26,810 --> 01:11:31,280 interval for this ratio of two samples of size 20, 1277 01:11:31,280 --> 01:11:36,260 I can go anywhere from 2.5 to 0.4 in that ratio. 1278 01:11:39,120 --> 01:11:41,040 That's with samples of size 20. 1279 01:11:41,040 --> 01:11:43,540 That's a huge range, right? 1280 01:11:43,540 --> 01:11:46,900 Imagine, 2 and 1/2 times bigger variance over here, 1281 01:11:46,900 --> 01:11:48,790 compared to over here. 1282 01:11:48,790 --> 01:11:51,950 And that occurs purely by chance. 1283 01:11:51,950 --> 01:11:57,070 So in 95% of the time, I might have ratios within that. 1284 01:11:57,070 --> 01:11:59,800 But 5% of the time, I'll even observe 1285 01:11:59,800 --> 01:12:02,350 ratios that are bigger or even smaller 1286 01:12:02,350 --> 01:12:04,540 than those extremo points. 1287 01:12:04,540 --> 01:12:07,140 So you've got to be really careful in reasoning 1288 01:12:07,140 --> 01:12:08,160 about variances. 1289 01:12:10,990 --> 01:12:13,210 So we're mostly there. 1290 01:12:13,210 --> 01:12:15,340 The last thing I want to do here is 1291 01:12:15,340 --> 01:12:20,260 draw the relationship of some of these two hypotheses tests. 1292 01:12:20,260 --> 01:12:23,590 And that gets us very close to some of the Shewhart hypotheses 1293 01:12:23,590 --> 01:12:26,020 that are the basis for control charts 1294 01:12:26,020 --> 01:12:28,490 that we'll talk about in the next lecture. 1295 01:12:28,490 --> 01:12:32,110 But I do want to get the basic idea in the last five, 1296 01:12:32,110 --> 01:12:36,740 10 minutes on what statistical hypothesis is 1297 01:12:36,740 --> 01:12:39,440 and how that relates to some of these confidence intervals 1298 01:12:39,440 --> 01:12:42,000 that we've been talking about. 1299 01:12:42,000 --> 01:12:44,870 So the basic idea we've been doing with these means is 1300 01:12:44,870 --> 01:12:48,350 we've been hypothesizing that the mean has some distribution, 1301 01:12:48,350 --> 01:12:50,600 say a normal distribution. 1302 01:12:50,600 --> 01:12:54,500 And then when we talked about this confidence interval, 1303 01:12:54,500 --> 01:12:57,740 I would say, accept or reject the hypothesis 1304 01:12:57,740 --> 01:13:03,200 that the mean was within some range with some probability. 1305 01:13:03,200 --> 01:13:06,920 We can extend that to asking other questions 1306 01:13:06,920 --> 01:13:09,080 or other hypotheses, and then looking 1307 01:13:09,080 --> 01:13:11,150 at the probabilities associated with it, 1308 01:13:11,150 --> 01:13:14,300 and saying, with some degree of confidence, 1309 01:13:14,300 --> 01:13:15,980 I believe the hypothesis. 1310 01:13:15,980 --> 01:13:19,460 Or I have enough evidence to counter it. 1311 01:13:19,460 --> 01:13:23,990 And a typical example might be a null hypothesis, 1312 01:13:23,990 --> 01:13:31,100 often referred to as H0, that the mean is some 1313 01:13:31,100 --> 01:13:34,070 a priori mean, some phi 0. 1314 01:13:34,070 --> 01:13:37,580 The null hypothesis is based on this sample, this sample 1315 01:13:37,580 --> 01:13:39,890 that I'm drawing from the population. 1316 01:13:39,890 --> 01:13:41,780 I have this alternative hypothesis 1317 01:13:41,780 --> 01:13:43,020 that the mean has changed. 1318 01:13:43,020 --> 01:13:44,810 It's no longer the same mean. 1319 01:13:44,810 --> 01:13:48,917 Do I have enough evidence to say with some degree of confidence 1320 01:13:48,917 --> 01:13:50,000 that the mean has changed? 1321 01:13:52,760 --> 01:13:55,610 And it's a little tricky because there's 1322 01:13:55,610 --> 01:13:57,470 all these probabilities associated 1323 01:13:57,470 --> 01:14:00,450 with random sampling. 1324 01:14:00,450 --> 01:14:03,260 So I observe a particular value with some deviation. 1325 01:14:03,260 --> 01:14:10,130 How do I know to what degree there's actual shift, 1326 01:14:10,130 --> 01:14:13,210 say, in the mean or not? 1327 01:14:13,210 --> 01:14:14,380 So let's look at this. 1328 01:14:14,380 --> 01:14:16,840 What we do is we form the hypothesis. 1329 01:14:16,840 --> 01:14:19,180 We then look at the probabilities associated 1330 01:14:19,180 --> 01:14:22,840 with the two cases, and then based on those probabilities, 1331 01:14:22,840 --> 01:14:25,720 say with some degree of confidence, 1332 01:14:25,720 --> 01:14:28,250 I choose one or the other. 1333 01:14:28,250 --> 01:14:31,420 And what's important is there's always the chance of being 1334 01:14:31,420 --> 01:14:33,940 wrong, making an error-- 1335 01:14:33,940 --> 01:14:38,170 those alpha errors out in the tails, for example-- 1336 01:14:38,170 --> 01:14:39,500 with that decision. 1337 01:14:39,500 --> 01:14:42,850 So that's where this confidence level comes in. 1338 01:14:42,850 --> 01:14:45,430 So let's say we're looking at this test. 1339 01:14:45,430 --> 01:14:48,920 We're asking-- the null hypothesis 1340 01:14:48,920 --> 01:14:54,380 is I have a normal distribution with some a priori mean 1341 01:14:54,380 --> 01:14:56,250 and some a priori variance. 1342 01:14:56,250 --> 01:14:58,040 I'm going to draw a new sample. 1343 01:14:58,040 --> 01:15:02,880 And based on that, I want to either decide 1344 01:15:02,880 --> 01:15:07,290 that a shift has occurred or that the data-- 1345 01:15:07,290 --> 01:15:11,170 or not-- that the data comes from that distribution or not. 1346 01:15:11,170 --> 01:15:12,960 And so what we're going to do is use 1347 01:15:12,960 --> 01:15:16,560 essentially this same confidence interval idea 1348 01:15:16,560 --> 01:15:21,730 and say, say to 95% confidence, 95% of the time, 1349 01:15:21,730 --> 01:15:26,370 if my value lies in the central part of that distribution, 1350 01:15:26,370 --> 01:15:30,870 I'm going to accept the-- 1351 01:15:30,870 --> 01:15:33,210 well, in this case, the null hypothesis 1352 01:15:33,210 --> 01:15:37,840 that my new sample still comes from that same distribution. 1353 01:15:37,840 --> 01:15:41,580 So that would be my 95%, my 1 minus alpha, if alpha 1354 01:15:41,580 --> 01:15:43,390 is a 5% error. 1355 01:15:43,390 --> 01:15:46,500 But if I observe a sample mean, say, 1356 01:15:46,500 --> 01:15:51,230 or I observe a piece of data that lies out here, 1357 01:15:51,230 --> 01:15:53,450 I'm going to reject the null hypothesis. 1358 01:15:53,450 --> 01:15:55,130 I'm going to say instead, I think 1359 01:15:55,130 --> 01:15:58,910 I've got an unlikely event by chance 1360 01:15:58,910 --> 01:16:02,660 that I think instead indicates something has changed. 1361 01:16:02,660 --> 01:16:04,270 Something has changed in the process. 1362 01:16:04,270 --> 01:16:07,213 And we'll call that the region of rejection. 1363 01:16:10,340 --> 01:16:14,090 So again, already you can see one kind of error 1364 01:16:14,090 --> 01:16:16,160 that's likely to pop up. 1365 01:16:16,160 --> 01:16:18,710 There is a confidence interval, this alpha. 1366 01:16:18,710 --> 01:16:21,840 There is a significance level to the test, 1367 01:16:21,840 --> 01:16:24,470 very similar to the confidence interval 1368 01:16:24,470 --> 01:16:28,290 idea and the alpha error associated with that. 1369 01:16:28,290 --> 01:16:31,840 So right away, you see there's one kind of error-- 1370 01:16:31,840 --> 01:16:35,240 it's referred to as a type I error-- 1371 01:16:35,240 --> 01:16:37,060 on these kinds of hypothesis tests. 1372 01:16:37,060 --> 01:16:40,900 We're rejecting the null hypothesis out 1373 01:16:40,900 --> 01:16:44,155 here in the tails with some probability alpha. 1374 01:16:47,675 --> 01:16:50,740 If I observed a point out there in the tails, 1375 01:16:50,740 --> 01:16:54,650 even if that population or that distribution 1376 01:16:54,650 --> 01:16:57,360 is still operative, it is, in fact, true. 1377 01:16:57,360 --> 01:17:01,250 My samples are still coming from that distribution. 1378 01:17:01,250 --> 01:17:06,040 But I happened to draw a sample way out in the tail. 1379 01:17:06,040 --> 01:17:08,590 And I said, well, that was unlikely. 1380 01:17:08,590 --> 01:17:10,630 That was unlikely in this picture. 1381 01:17:10,630 --> 01:17:12,400 I'm rejecting the null hypothesis. 1382 01:17:12,400 --> 01:17:15,608 I'm claiming this is evidence that something changed 1383 01:17:15,608 --> 01:17:16,900 when, in fact, nothing changed. 1384 01:17:16,900 --> 01:17:19,150 I just got unlucky, right? 1385 01:17:19,150 --> 01:17:22,300 So the first type of error that you can make 1386 01:17:22,300 --> 01:17:25,570 is this type I error. 1387 01:17:28,160 --> 01:17:31,970 It's also sometimes referred to as producer error, 1388 01:17:31,970 --> 01:17:33,890 producer risk. 1389 01:17:33,890 --> 01:17:35,390 You're the manufacturer. 1390 01:17:35,390 --> 01:17:38,390 You reject your part because your-- 1391 01:17:38,390 --> 01:17:40,970 or you reject a batch, say, because your sample 1392 01:17:40,970 --> 01:17:42,710 was way out here in the tail. 1393 01:17:42,710 --> 01:17:45,980 You're taking the risk of rejecting and throwing away 1394 01:17:45,980 --> 01:17:50,900 good product, even though it really was good. 1395 01:17:50,900 --> 01:17:54,740 If I took more samples, it would go back and really indicate 1396 01:17:54,740 --> 01:17:55,820 what was going on-- 1397 01:17:55,820 --> 01:17:58,010 that the product was still good. 1398 01:17:58,010 --> 01:18:01,250 So it's also sometimes referred to as producer risk. 1399 01:18:01,250 --> 01:18:04,950 But there's another possible error. 1400 01:18:04,950 --> 01:18:11,200 There is an error associated with the distribution shifted 1401 01:18:11,200 --> 01:18:12,880 or changed. 1402 01:18:12,880 --> 01:18:16,210 I still accepted it based on a random sample 1403 01:18:16,210 --> 01:18:17,950 from the different distribution that 1404 01:18:17,950 --> 01:18:20,800 happened to fall in my other distribution. 1405 01:18:20,800 --> 01:18:24,640 And that's referred to as type II error-- 1406 01:18:24,640 --> 01:18:28,305 has a probability associated with that called beta. 1407 01:18:28,305 --> 01:18:30,055 We've been talking all about these alphas. 1408 01:18:30,055 --> 01:18:31,720 Well, there's also a beta. 1409 01:18:31,720 --> 01:18:37,510 It's also sometimes referred to as a consumers' risk. 1410 01:18:37,510 --> 01:18:40,690 The manufacturer did a little inspection. 1411 01:18:40,690 --> 01:18:43,270 The mean happened to fall in the region of acceptance. 1412 01:18:43,270 --> 01:18:44,940 He shipped it. 1413 01:18:44,940 --> 01:18:48,030 Turns out, it was actually by bad chance just happened 1414 01:18:48,030 --> 01:18:49,500 to fall in the good region. 1415 01:18:49,500 --> 01:18:54,470 It really is coming from a bad distribution. 1416 01:18:54,470 --> 01:18:55,610 So let's look at that. 1417 01:18:55,610 --> 01:18:57,770 What is this beta? 1418 01:18:57,770 --> 01:18:59,990 Well, for the type II errors, we essentially 1419 01:18:59,990 --> 01:19:05,120 have to hypothesize a shift of some size, some little delta. 1420 01:19:05,120 --> 01:19:08,330 And then we assess the probabilities 1421 01:19:08,330 --> 01:19:12,590 that I'm drawing from the tail of that shifted distribution 1422 01:19:12,590 --> 01:19:14,630 and just happen to fall over here 1423 01:19:14,630 --> 01:19:20,040 in this region of acceptance for our good distribution. 1424 01:19:20,040 --> 01:19:23,210 So this is the probability associated 1425 01:19:23,210 --> 01:19:24,470 with our null hypothesis. 1426 01:19:24,470 --> 01:19:26,720 This is our starting distribution. 1427 01:19:26,720 --> 01:19:29,600 Our alternative hypothesis here is 1428 01:19:29,600 --> 01:19:32,375 that I had a plus delta shift in the mean. 1429 01:19:34,950 --> 01:19:38,900 So this is our possible new operative. 1430 01:19:38,900 --> 01:19:41,040 And in fact, for a type II error, 1431 01:19:41,040 --> 01:19:43,860 this is actually at work. 1432 01:19:43,860 --> 01:19:46,950 Remember, this is the region of acceptance. 1433 01:19:46,950 --> 01:19:50,690 So I'm claiming this is good. 1434 01:19:50,690 --> 01:19:54,110 But if the population actually shifted over there 1435 01:19:54,110 --> 01:19:56,570 to the right, notice off on the left 1436 01:19:56,570 --> 01:20:01,560 here we've got this whole tail, where 1437 01:20:01,560 --> 01:20:04,020 if I drew from the shifted distribution, 1438 01:20:04,020 --> 01:20:07,380 I've got that tail, that lightly shaded blue tail, falling 1439 01:20:07,380 --> 01:20:09,930 in the region of acceptance, where I would say it's 1440 01:20:09,930 --> 01:20:14,140 a good distribution and erroneously except. 1441 01:20:14,140 --> 01:20:19,000 And one can simply apply the same probabilities to basically 1442 01:20:19,000 --> 01:20:21,280 go in and calculate-- 1443 01:20:21,280 --> 01:20:26,830 just integrate up and do the cumulative normal distribution 1444 01:20:26,830 --> 01:20:32,200 function to calculate what that tail is. 1445 01:20:32,200 --> 01:20:36,410 So it's all the same probabilities. 1446 01:20:36,410 --> 01:20:40,510 So the applications of this are really going to be on-- 1447 01:20:40,510 --> 01:20:42,470 of hypothesis testing. 1448 01:20:42,470 --> 01:20:44,020 This would be shifts of the mean. 1449 01:20:44,020 --> 01:20:47,470 You can start to see worrying about monitoring your process 1450 01:20:47,470 --> 01:20:50,140 and seeing if something changed in your process, 1451 01:20:50,140 --> 01:20:53,260 a shift occurred, and being able to detect that. 1452 01:20:53,260 --> 01:20:55,270 And that gets us to control charting 1453 01:20:55,270 --> 01:20:58,160 that we'll do next time. 1454 01:20:58,160 --> 01:21:00,730 So this is all pretty much the same stuff. 1455 01:21:00,730 --> 01:21:03,250 And now this is a peek ahead. 1456 01:21:03,250 --> 01:21:06,250 You'll see process control. 1457 01:21:06,250 --> 01:21:09,010 And we'll talk about repeated samples 1458 01:21:09,010 --> 01:21:13,840 in time coming from the same distribution next time. 1459 01:21:13,840 --> 01:21:16,240 So we will see you on Thursday. 1460 01:21:16,240 --> 01:21:20,980 And we will dive into Shewhart control charts.