1 00:00:00,000 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,730 Commons license. 3 00:00:03,730 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,060 continue to offer high quality educational resources for free. 5 00:00:10,060 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,560 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,560 --> 00:00:17,904 at ocw.mit.edu. 8 00:00:21,790 --> 00:00:25,810 DUANE BONING: OK, so last time we 9 00:00:25,810 --> 00:00:28,600 continued with our discussion of design of experiments 10 00:00:28,600 --> 00:00:32,830 and especially looking at fractional factorial designs, 11 00:00:32,830 --> 00:00:35,260 some of the aliasing patterns that come up, 12 00:00:35,260 --> 00:00:39,310 and how that interplays with model construction, 13 00:00:39,310 --> 00:00:42,640 in particular, what terms of a model you can include, 14 00:00:42,640 --> 00:00:44,770 what you can't include, as well as 15 00:00:44,770 --> 00:00:49,340 a few ideas on different kinds of patterns, 16 00:00:49,340 --> 00:00:51,790 things like the central composite pattern as well as 17 00:00:51,790 --> 00:00:54,070 fractional or full factorial. 18 00:00:54,070 --> 00:00:56,560 What I want to do today is pick up a little bit more 19 00:00:56,560 --> 00:01:00,130 on response surface modeling, or RSM. 20 00:01:00,130 --> 00:01:01,790 We've already touched on some of these, 21 00:01:01,790 --> 00:01:03,748 but there's a couple of things I've alluded to, 22 00:01:03,748 --> 00:01:06,310 but we haven't really shown you, things 23 00:01:06,310 --> 00:01:10,120 like how one gets confidence intervals on the estimates 24 00:01:10,120 --> 00:01:12,400 of coefficients in the model. 25 00:01:12,400 --> 00:01:15,190 Just like when we were doing some estimation 26 00:01:15,190 --> 00:01:17,140 of statistical distributions, we would 27 00:01:17,140 --> 00:01:20,980 say we want more than just an estimate of the mean 28 00:01:20,980 --> 00:01:23,080 or an estimate of the variance of a process. 29 00:01:23,080 --> 00:01:27,640 We would like to know what range we might say 90% of the time 30 00:01:27,640 --> 00:01:31,330 with 90% confidence, we think the true mean or true variance 31 00:01:31,330 --> 00:01:32,470 lies. 32 00:01:32,470 --> 00:01:36,040 Similarly, when we're fitting models and model coefficients, 33 00:01:36,040 --> 00:01:38,230 we'd like some notion of what range 34 00:01:38,230 --> 00:01:42,160 we think the true model coefficients likely lie 35 00:01:42,160 --> 00:01:44,060 based on the data that we have. 36 00:01:44,060 --> 00:01:45,880 So I want to go over that a little bit. 37 00:01:45,880 --> 00:01:49,180 And then we'll start talking about using these models 38 00:01:49,180 --> 00:01:53,950 for process optimization, so combining 39 00:01:53,950 --> 00:01:56,980 a little bit of the response surface methodology 40 00:01:56,980 --> 00:02:01,240 with design of experiments both in sequential fashion 41 00:02:01,240 --> 00:02:04,660 and in iterative fashion, where one might adapt 42 00:02:04,660 --> 00:02:09,460 the model on the fly or based on the additional experiments 43 00:02:09,460 --> 00:02:13,360 in order to drive the process or try to seek out and find 44 00:02:13,360 --> 00:02:15,290 an optimum in the process. 45 00:02:15,290 --> 00:02:16,570 So that's the plan. 46 00:02:16,570 --> 00:02:21,430 I've noted here a reading assignment. 47 00:02:21,430 --> 00:02:23,440 You can read all of chapter 8. 48 00:02:23,440 --> 00:02:26,890 It's actually interesting, but what I'm mostly focused on 49 00:02:26,890 --> 00:02:30,070 are the first three sections in May and Spanos, 50 00:02:30,070 --> 00:02:32,240 which talks about process modeling. 51 00:02:32,240 --> 00:02:34,240 So it's covering a lot of the material 52 00:02:34,240 --> 00:02:38,980 here on response surface models, model fitting, a little bit 53 00:02:38,980 --> 00:02:40,990 of regression, and then also using 54 00:02:40,990 --> 00:02:42,870 these things for optimization. 55 00:02:42,870 --> 00:02:45,340 So a couple of chapters that have a little bit more 56 00:02:45,340 --> 00:02:49,040 advanced material on principal component analysis, 57 00:02:49,040 --> 00:02:52,850 which we may come back to a little bit later. 58 00:02:52,850 --> 00:02:54,160 OK, so that's the plan. 59 00:02:58,380 --> 00:03:02,210 Here's a list of some of the fundamentals of regression. 60 00:03:02,210 --> 00:03:04,685 When we were talking about fractional factorial 61 00:03:04,685 --> 00:03:11,180 and factorial design, especially those formed out of contrast, 62 00:03:11,180 --> 00:03:15,260 that simplified method using differences 63 00:03:15,260 --> 00:03:18,660 in different collections of the data, 64 00:03:18,660 --> 00:03:20,930 we found that those were very useful, 65 00:03:20,930 --> 00:03:25,190 quick ways to be able to estimate model effects, 66 00:03:25,190 --> 00:03:29,600 to fill those into ANOVA tables and decide if those effects are 67 00:03:29,600 --> 00:03:32,210 significant, and then also the relationship 68 00:03:32,210 --> 00:03:36,680 of those for model coefficient estimation. 69 00:03:36,680 --> 00:03:40,350 I want to talk a little bit about 70 00:03:40,350 --> 00:03:42,690 the alternative perspective, which 71 00:03:42,690 --> 00:03:47,130 is regression as a way for fitting those coefficients. 72 00:03:47,130 --> 00:03:49,700 And we've already done some of that. 73 00:03:49,700 --> 00:03:51,690 What I'm going to illustrate here 74 00:03:51,690 --> 00:03:55,860 is our basic assumption and what falls out 75 00:03:55,860 --> 00:04:01,050 of using minimization of least square or squared error 76 00:04:01,050 --> 00:04:05,130 estimates in order to fit the coefficients 77 00:04:05,130 --> 00:04:07,260 or estimate the coefficients in a model. 78 00:04:07,260 --> 00:04:12,570 And I want to talk a little bit more about estimation. 79 00:04:12,570 --> 00:04:14,160 We've already touched on estimation 80 00:04:14,160 --> 00:04:15,990 using the normal equations. 81 00:04:15,990 --> 00:04:19,500 But especially I want to talk about the variance again, 82 00:04:19,500 --> 00:04:22,710 in these coefficients, things like the confidence intervals 83 00:04:22,710 --> 00:04:26,920 for fitting of coefficients. 84 00:04:26,920 --> 00:04:31,930 I'm going to do this here mostly in the context of a simplified 85 00:04:31,930 --> 00:04:34,810 perspective, a one parameter model. 86 00:04:34,810 --> 00:04:37,240 I just have one input and one output. 87 00:04:37,240 --> 00:04:39,220 And we'll do the simplest model. 88 00:04:39,220 --> 00:04:41,650 We'll build it up to a simple linear model, 89 00:04:41,650 --> 00:04:44,410 but all of these ideas also carry through 90 00:04:44,410 --> 00:04:49,360 for polynomial regression when I've got multiple inputs. 91 00:04:49,360 --> 00:04:50,860 But I think it's a little bit easier 92 00:04:50,860 --> 00:04:54,850 to see and discuss in the context of a simplified-- 93 00:04:54,850 --> 00:04:56,660 a simplified model. 94 00:04:56,660 --> 00:04:59,650 And we also talked last time a bit-- 95 00:04:59,650 --> 00:05:01,780 the last couple of times about lack of fit. 96 00:05:01,780 --> 00:05:04,480 And I have a little example that carries us 97 00:05:04,480 --> 00:05:08,440 through the development of a model looking for lack of fit 98 00:05:08,440 --> 00:05:12,460 or seeing lack of fit and extending the model. 99 00:05:12,460 --> 00:05:16,640 So it's got a small example embedded in here. 100 00:05:16,640 --> 00:05:18,190 In fact, that small example might 101 00:05:18,190 --> 00:05:24,790 look familiar to those of you that saw or took 2853. 102 00:05:24,790 --> 00:05:28,120 It's actually the same model that I 103 00:05:28,120 --> 00:05:32,350 described in a very condensed lecture there on regression. 104 00:05:36,100 --> 00:05:39,640 It's also important, I think, for us to get 105 00:05:39,640 --> 00:05:41,320 a little bit of terminology. 106 00:05:41,320 --> 00:05:44,590 You've probably run into measures 107 00:05:44,590 --> 00:05:50,470 of moral goodness, an overall summary measure of R-squared 108 00:05:50,470 --> 00:05:53,320 that is an attempt to capture how good the model is 109 00:05:53,320 --> 00:05:57,980 in describing what's going on with your data. 110 00:05:57,980 --> 00:06:01,240 So once one is done, the ANOVA analysis it's actually quite 111 00:06:01,240 --> 00:06:06,850 easy to calculate both the goodness of fit R-squared 112 00:06:06,850 --> 00:06:11,230 and the adjusted R-squared as shown here because they both 113 00:06:11,230 --> 00:06:12,850 depend on-- 114 00:06:12,850 --> 00:06:17,350 both of these R-squared measures and the ANOVA look 115 00:06:17,350 --> 00:06:22,270 at the amount of variation in your data 116 00:06:22,270 --> 00:06:25,060 and the amount of variation expressed in your model 117 00:06:25,060 --> 00:06:30,070 and use those to summarize how good the model is. 118 00:06:30,070 --> 00:06:32,530 So the first measure, this R-squared, 119 00:06:32,530 --> 00:06:35,320 is basically just looking and saying 120 00:06:35,320 --> 00:06:43,170 if I were to simply model my output as the mean, how 121 00:06:43,170 --> 00:06:46,830 much better does a model that has more than the mean in it 122 00:06:46,830 --> 00:06:50,170 do in explaining the data? 123 00:06:50,170 --> 00:06:52,200 So essentially what we do is look 124 00:06:52,200 --> 00:06:57,260 at the sum of squared deviations around the mean. 125 00:06:57,260 --> 00:07:02,040 So this is total summit squared deviations around the mean. 126 00:07:02,040 --> 00:07:05,700 And then we say OK, how much sum of squared 127 00:07:05,700 --> 00:07:17,860 deviations based on the model, so the amount explained 128 00:07:17,860 --> 00:07:21,820 in the model, compared to the total deviations 129 00:07:21,820 --> 00:07:22,510 around the mean? 130 00:07:22,510 --> 00:07:28,780 What fraction of those is captured in the model? 131 00:07:28,780 --> 00:07:32,050 So in other words, if there's really 132 00:07:32,050 --> 00:07:36,940 nothing going on except a flat dependency, that is, 133 00:07:36,940 --> 00:07:38,575 there is no slope with x. 134 00:07:38,575 --> 00:07:42,060 As I vary x, nothing changes in the model, 135 00:07:42,060 --> 00:07:44,890 then this simplified notion of R-squared 136 00:07:44,890 --> 00:07:49,570 is basically saying there is no dependency on x. 137 00:07:49,570 --> 00:07:52,630 And therefore, I'm going to explain 138 00:07:52,630 --> 00:07:54,850 with the model essentially nothing. 139 00:07:54,850 --> 00:07:57,850 Now it's funny because we are ignoring the fact 140 00:07:57,850 --> 00:08:01,420 that you might also be fitting the mean value. 141 00:08:01,420 --> 00:08:04,270 But the notion captured in the R-squared is dependence 142 00:08:04,270 --> 00:08:08,650 on the input, dependence on x. 143 00:08:08,650 --> 00:08:12,040 Now the big gotcha with this simple measure 144 00:08:12,040 --> 00:08:16,450 is I can always add model coefficients 145 00:08:16,450 --> 00:08:19,900 and fit more of my data, or at least I 146 00:08:19,900 --> 00:08:25,460 can do that ignoring replication in the model. 147 00:08:25,460 --> 00:08:30,980 For Example, we saw with a, say a two input, one output 148 00:08:30,980 --> 00:08:35,450 model and a full factorial, If I just have those four corner 149 00:08:35,450 --> 00:08:39,169 points, I can fit up to a second order model 150 00:08:39,169 --> 00:08:40,590 with the interaction terms. 151 00:08:40,590 --> 00:08:42,230 If I have four data points, I could 152 00:08:42,230 --> 00:08:45,350 fit the mean, first order, first order, and interaction 153 00:08:45,350 --> 00:08:47,660 with exactly four coefficients. 154 00:08:47,660 --> 00:08:50,510 And in that case, what would the R-squared 155 00:08:50,510 --> 00:08:55,920 be if I put my data with all four coefficients? 156 00:08:55,920 --> 00:08:56,760 1. 157 00:08:56,760 --> 00:08:59,170 I would fit the data perfectly. 158 00:08:59,170 --> 00:09:00,710 Again, this is without replication. 159 00:09:00,710 --> 00:09:02,790 And I would fit the data perfectly. 160 00:09:05,300 --> 00:09:07,940 And therefore I'd have an R-squared of 1. 161 00:09:07,940 --> 00:09:11,670 Now is that really a perfect model? 162 00:09:11,670 --> 00:09:15,550 Well, kind of, but what you've done 163 00:09:15,550 --> 00:09:18,190 is you've used all of the degrees of freedom 164 00:09:18,190 --> 00:09:22,060 in the data to actually fit or use them to fit the model. 165 00:09:22,060 --> 00:09:28,030 We also don't have any notion of replication, which isn't really 166 00:09:28,030 --> 00:09:29,350 completely captured. 167 00:09:29,350 --> 00:09:33,400 So one way of penalizing ourselves 168 00:09:33,400 --> 00:09:37,180 for the use of these additional model terms 169 00:09:37,180 --> 00:09:42,340 is to essentially have a different perspective referred 170 00:09:42,340 --> 00:09:46,450 to as the adjusted R-squared, which essentially 171 00:09:46,450 --> 00:09:48,580 looks at the residual data. 172 00:09:48,580 --> 00:09:51,790 Rather than the deviations captured by the model, 173 00:09:51,790 --> 00:09:55,210 it's looking at OK, what deviations are not 174 00:09:55,210 --> 00:09:56,650 captured by the model? 175 00:09:56,650 --> 00:09:59,530 What residual data would I have, which 176 00:09:59,530 --> 00:10:03,010 also has a side effect of essentially penalizing us 177 00:10:03,010 --> 00:10:06,100 for the use of additional model coefficients 178 00:10:06,100 --> 00:10:09,910 because we use up degrees of freedom 179 00:10:09,910 --> 00:10:15,550 in the model when we add model coefficients. 180 00:10:15,550 --> 00:10:18,310 So very often people talk about the adjusted R square 181 00:10:18,310 --> 00:10:21,640 as this fair comparison between models, especially 182 00:10:21,640 --> 00:10:25,270 between models where I may have a simplified model with fewer 183 00:10:25,270 --> 00:10:27,820 coefficients and a more complicated model with more 184 00:10:27,820 --> 00:10:28,990 coefficients. 185 00:10:28,990 --> 00:10:32,050 And essentially what we do is form it 186 00:10:32,050 --> 00:10:37,480 as the ratio of the mean square error of the residuals 187 00:10:37,480 --> 00:10:42,010 over the total mean square variance, if you will, 188 00:10:42,010 --> 00:10:45,250 captured by deviations around the mean 189 00:10:45,250 --> 00:10:47,710 and then subtract that all from 1. 190 00:10:47,710 --> 00:10:50,770 And the way I like to think about it is essentially, 191 00:10:50,770 --> 00:10:52,690 I start with the perfect model. 192 00:10:52,690 --> 00:10:57,550 And then any residual error, which 193 00:10:57,550 --> 00:11:03,250 could include both replication error and lack of fit error, 194 00:11:03,250 --> 00:11:08,320 whatever percentage error that I don't capture in the data-- 195 00:11:08,320 --> 00:11:10,750 the sum of squared deviations divided 196 00:11:10,750 --> 00:11:15,700 by the degrees of freedom, that's my mean square error 197 00:11:15,700 --> 00:11:19,180 estimate or my estimate of the true total variance 198 00:11:19,180 --> 00:11:22,380 around the mean. 199 00:11:22,380 --> 00:11:26,430 Whatever fraction of that that is in the residual, 200 00:11:26,430 --> 00:11:29,310 that's what I'm not modeling. 201 00:11:29,310 --> 00:11:32,940 So essentially what we're doing is simply 202 00:11:32,940 --> 00:11:35,520 looking at what's not expressed in the model. 203 00:11:35,520 --> 00:11:39,250 And the model can never capture pure replication error, 204 00:11:39,250 --> 00:11:41,970 so it's got that variance, but it might also have lack of it 205 00:11:41,970 --> 00:11:42,470 in it. 206 00:11:45,550 --> 00:11:49,090 So most statistical packages will report out 207 00:11:49,090 --> 00:11:50,350 both of these numbers. 208 00:11:50,350 --> 00:11:52,070 You can also calculate them. 209 00:11:52,070 --> 00:11:57,910 But it's generally I like the R-squared 210 00:11:57,910 --> 00:12:01,660 adjusted as a better measure. 211 00:12:01,660 --> 00:12:04,360 In part because it feels to me a little bit more 212 00:12:04,360 --> 00:12:08,230 conceptual and comprehensive, in terms of telling me 213 00:12:08,230 --> 00:12:10,300 what's not captured in the model, 214 00:12:10,300 --> 00:12:15,670 how much pure variation going on in the data 215 00:12:15,670 --> 00:12:17,360 is not in the model. 216 00:12:17,360 --> 00:12:21,370 However, you have to be really careful interpreting what 217 00:12:21,370 --> 00:12:24,220 that R-squared is telling you. 218 00:12:24,220 --> 00:12:28,870 It's not necessarily telling you that your model is good or bad. 219 00:12:28,870 --> 00:12:35,160 You might have a perfect model given variant noise factors 220 00:12:35,160 --> 00:12:37,060 in the model. 221 00:12:37,060 --> 00:12:39,990 So for example, if underlying everything, 222 00:12:39,990 --> 00:12:42,640 I've got a true systematic dependency, 223 00:12:42,640 --> 00:12:45,840 but I also have pure replication variance, 224 00:12:45,840 --> 00:12:49,170 that's going to limit how good your R-squared can possibly 225 00:12:49,170 --> 00:12:53,610 be even if your model were perfect in terms of capturing 226 00:12:53,610 --> 00:12:57,430 the systematic dependency. 227 00:12:57,430 --> 00:13:02,270 I think there was a question lurking there in Singapore. 228 00:13:02,270 --> 00:13:07,190 AUDIENCE: Yes, so for R-squared and just R-squared 229 00:13:07,190 --> 00:13:14,220 the closer those values are to 1, the better of the model is? 230 00:13:14,220 --> 00:13:16,084 DUANE BONING: Yes, definitely. 231 00:13:16,084 --> 00:13:17,000 AUDIENCE: OK. 232 00:13:17,000 --> 00:13:19,442 DUANE BONING: Definitely 1 is better. 233 00:13:19,442 --> 00:13:20,900 But you have to be a little careful 234 00:13:20,900 --> 00:13:23,300 in interpreting because even-- 235 00:13:23,300 --> 00:13:26,390 AUDIENCE: What you just-- 236 00:13:26,390 --> 00:13:28,160 no but what, Professor, you just said 237 00:13:28,160 --> 00:13:33,140 is the R-squared increase both the error of the model, also 238 00:13:33,140 --> 00:13:34,525 the error of the noise. 239 00:13:34,525 --> 00:13:35,900 So you can't really differentiate 240 00:13:35,900 --> 00:13:36,870 between these two. 241 00:13:36,870 --> 00:13:39,860 DUANE BONING: That's right, that's right. 242 00:13:39,860 --> 00:13:42,560 And that's where a lack of fit analysis-- and we'll 243 00:13:42,560 --> 00:13:44,570 go in and do one of those as well --is also 244 00:13:44,570 --> 00:13:47,180 still important for being able to try to differentiate 245 00:13:47,180 --> 00:13:51,620 between those two sources of imperfection in the model. 246 00:13:51,620 --> 00:13:53,390 Yeah? 247 00:13:53,390 --> 00:13:58,910 AUDIENCE: Also you mentioned the second R-squared also being 248 00:13:58,910 --> 00:14:00,927 [INAUDIBLE]. 249 00:14:00,927 --> 00:14:01,760 DUANE BONING: Right. 250 00:14:01,760 --> 00:14:03,890 AUDIENCE: Your main concern is fit 251 00:14:03,890 --> 00:14:07,220 and having more coefficients is cheap, 252 00:14:07,220 --> 00:14:12,280 would you prefer R-squared or adjusted R-squared? 253 00:14:12,280 --> 00:14:14,230 DUANE BONING: So the question is what would I 254 00:14:14,230 --> 00:14:17,830 prefer if the number of-- if fitting additional coefficients 255 00:14:17,830 --> 00:14:18,532 is cheap. 256 00:14:18,532 --> 00:14:19,990 AUDIENCE: And fit is more important 257 00:14:19,990 --> 00:14:23,380 DUANE BONING: And fit is more important. 258 00:14:23,380 --> 00:14:29,160 I think I would still essentially think 259 00:14:29,160 --> 00:14:33,870 of R-squared as somewhat of a more representative description 260 00:14:33,870 --> 00:14:41,850 of the trade off between adding coefficients, improving my fit. 261 00:14:41,850 --> 00:14:46,750 But also my R-squared doesn't get as much batter. 262 00:14:46,750 --> 00:14:49,690 And in fact, if I start overfitting, 263 00:14:49,690 --> 00:14:53,690 it will tend to degrade slightly my R-squared. 264 00:14:53,690 --> 00:14:55,930 However, what I think a better mechanism 265 00:14:55,930 --> 00:14:59,770 for actually making the decision about whether to include 266 00:14:59,770 --> 00:15:03,280 coefficients or not is an analysis of variance 267 00:15:03,280 --> 00:15:06,250 and looking at the significance of those model 268 00:15:06,250 --> 00:15:11,370 coefficients, both the significance and the magnitude. 269 00:15:11,370 --> 00:15:15,860 So I would tend to do more the regression analysis together 270 00:15:15,860 --> 00:15:17,000 with the ANOVA. 271 00:15:17,000 --> 00:15:20,060 And the R-squared is a nice aggregate measure, 272 00:15:20,060 --> 00:15:24,280 but it's not the thing that drives my decision-making so 273 00:15:24,280 --> 00:15:27,308 much, so I hope that helps. 274 00:15:27,308 --> 00:15:29,350 So we'll see some examples of some R-squared that 275 00:15:29,350 --> 00:15:32,680 come out of some analysis. 276 00:15:32,680 --> 00:15:38,170 Now we said that regression is at least as almost-- 277 00:15:38,170 --> 00:15:44,060 most commonly used is driven by minimization of a least-- 278 00:15:44,060 --> 00:15:46,780 or minimization of a squared error measure. 279 00:15:46,780 --> 00:15:48,820 And this is just trying to illustrate 280 00:15:48,820 --> 00:15:51,790 what I'm talking about here with where 281 00:15:51,790 --> 00:15:57,730 the residuals, the differences between my model and my data, 282 00:15:57,730 --> 00:16:00,917 may come from in the simple 1D case. 283 00:16:00,917 --> 00:16:02,500 We've already talked a bit about this, 284 00:16:02,500 --> 00:16:05,180 but I'm using a very, very simple model here, 285 00:16:05,180 --> 00:16:06,555 which has only one term. 286 00:16:06,555 --> 00:16:10,540 It doesn't even have a constant offset. 287 00:16:10,540 --> 00:16:15,100 It's simply got a linear, direct linear dependence of the output 288 00:16:15,100 --> 00:16:16,210 on the input. 289 00:16:16,210 --> 00:16:19,120 And I'm saying that the true model 290 00:16:19,120 --> 00:16:23,600 does have some noise in it, which is normally distributed. 291 00:16:23,600 --> 00:16:26,470 And I'm fitting that or estimating that 292 00:16:26,470 --> 00:16:29,500 with some coefficient, a little b. 293 00:16:29,500 --> 00:16:35,370 And so this is my fit through my data minimizing 294 00:16:35,370 --> 00:16:37,380 the squared deviations, or I'd like 295 00:16:37,380 --> 00:16:39,510 to minimize the squared deviations. 296 00:16:39,510 --> 00:16:42,360 And again, we're saying that any differences between the model 297 00:16:42,360 --> 00:16:48,360 prediction, essentially the y hat sub i minus the y sub i 298 00:16:48,360 --> 00:16:51,180 for that data point, that's a residual. 299 00:16:51,180 --> 00:16:54,090 That's an error. 300 00:16:54,090 --> 00:16:58,320 And it can come from two factors again, either lack of fit 301 00:16:58,320 --> 00:17:04,770 in the model or because of the underlying noise in the data. 302 00:17:04,770 --> 00:17:08,230 Now last time, or maybe even 2 times ago, 303 00:17:08,230 --> 00:17:11,760 we talked about the use of regression numerically, 304 00:17:11,760 --> 00:17:16,589 if you will or algebraically, to estimate 305 00:17:16,589 --> 00:17:20,460 this beta with the best be based on a minimization 306 00:17:20,460 --> 00:17:24,250 of the sum of squared errors. 307 00:17:24,250 --> 00:17:27,839 So we take each one of those residuals, square it, 308 00:17:27,839 --> 00:17:30,460 and then we sum that over all of our data. 309 00:17:30,460 --> 00:17:31,950 And it turns out what we're trying 310 00:17:31,950 --> 00:17:35,550 to do is find the beta hat, the b that 311 00:17:35,550 --> 00:17:38,100 estimates the beta hat that minimizes 312 00:17:38,100 --> 00:17:40,380 that sum of squared deviations. 313 00:17:40,380 --> 00:17:43,390 And what's nice with linear models 314 00:17:43,390 --> 00:17:48,400 is there's an algebraic way to find actually what b 315 00:17:48,400 --> 00:17:50,870 does that minimization for us. 316 00:17:50,870 --> 00:17:53,830 But I also want to just remind you 317 00:17:53,830 --> 00:17:58,300 that lurking inside of that minimization 318 00:17:58,300 --> 00:18:06,880 is an estimate of the total sum of squared residuals, SSR, 319 00:18:06,880 --> 00:18:12,280 what's lurking back there in that R and R-squared adjusted. 320 00:18:12,280 --> 00:18:14,440 And then if I divide that out again 321 00:18:14,440 --> 00:18:20,560 by the degrees of freedom, mu sub R, 322 00:18:20,560 --> 00:18:26,370 then I've got also my estimate of variance in the underlying 323 00:18:26,370 --> 00:18:28,240 model assuming no lack of fit. 324 00:18:32,320 --> 00:18:34,600 So we said with least squares estimation, 325 00:18:34,600 --> 00:18:39,140 I can form the set of linear equations. 326 00:18:39,140 --> 00:18:42,940 And assuming that the residuals are all normal or orthogonal 327 00:18:42,940 --> 00:18:50,680 to each other, then the sum of the product of our residual 328 00:18:50,680 --> 00:18:53,200 and the input should sum to 0. 329 00:18:53,200 --> 00:18:55,480 And when you carry through the algebra for that, 330 00:18:55,480 --> 00:19:01,180 out plops the formula for the slope coefficient given 331 00:19:01,180 --> 00:19:04,750 our data, simply the sum of the product of my x sub 332 00:19:04,750 --> 00:19:10,090 i times y sub i over the sum of my x sub i squared, it's funky. 333 00:19:10,090 --> 00:19:13,150 And as I said, here's our estimate 334 00:19:13,150 --> 00:19:16,030 of the underlying variance. 335 00:19:16,030 --> 00:19:20,140 That's our best estimate, unbiased, best estimate 336 00:19:20,140 --> 00:19:21,820 of the process variance. 337 00:19:21,820 --> 00:19:24,920 And in this case, we're only fitting one model coefficient. 338 00:19:24,920 --> 00:19:28,270 So I've got my total number set of data 339 00:19:28,270 --> 00:19:31,780 and then I've just got n minus 1, 340 00:19:31,780 --> 00:19:35,750 since I've only got one model coefficient. 341 00:19:35,750 --> 00:19:37,580 Now the interesting thing that I've 342 00:19:37,580 --> 00:19:40,790 alluded to in a previous lecture but haven't 343 00:19:40,790 --> 00:19:46,640 shown you is I want more than just the best estimate of b. 344 00:19:46,640 --> 00:19:50,050 I'd like to have a confidence interval on b. 345 00:19:50,050 --> 00:19:55,060 Given the spread in the data and an underlying normal noise 346 00:19:55,060 --> 00:19:58,750 model or noise assumption, what do I think the range, 347 00:19:58,750 --> 00:20:04,300 say a 95% confidence interval, might be on my estimation of b? 348 00:20:04,300 --> 00:20:08,860 And we can do that very simply by taking the formula for b 349 00:20:08,860 --> 00:20:16,990 and simply doing our variance of b calculation on that formula. 350 00:20:16,990 --> 00:20:18,970 It's just variance math. 351 00:20:18,970 --> 00:20:21,640 And that's what's broken out here in the y. 352 00:20:21,640 --> 00:20:28,480 If I expand out the b summation into a some 353 00:20:28,480 --> 00:20:33,040 of those individual terms, I can then apply my normal variance 354 00:20:33,040 --> 00:20:35,270 math here. 355 00:20:35,270 --> 00:20:39,580 And what I've got for the variance of that some-- 356 00:20:39,580 --> 00:20:41,500 just thinking of each of these elements 357 00:20:41,500 --> 00:20:46,750 has some constant, then that the variance of that sum of terms 358 00:20:46,750 --> 00:20:49,780 is the variance of the constant squared-- 359 00:20:49,780 --> 00:20:53,080 or excuse me, the value of the constant squared times 360 00:20:53,080 --> 00:20:55,910 the variance of each of those underlying variables. 361 00:20:55,910 --> 00:20:58,840 And when you go and do that, what you've got 362 00:20:58,840 --> 00:21:04,570 is another formula down here for the variance 363 00:21:04,570 --> 00:21:13,110 in that coefficient b based on the data that you've got. 364 00:21:13,110 --> 00:21:16,810 So once I've got that up here, I've 365 00:21:16,810 --> 00:21:19,270 got my estimate for the variance. 366 00:21:19,270 --> 00:21:23,110 Now we've got an estimate of what 1 standard deviation would 367 00:21:23,110 --> 00:21:25,390 be in the variance. 368 00:21:25,390 --> 00:21:29,350 And then you can express that based on whatever confidence 369 00:21:29,350 --> 00:21:30,220 interval you want. 370 00:21:30,220 --> 00:21:34,000 So I might write that typically as b plus or minus 371 00:21:34,000 --> 00:21:39,070 1 standard error, 1 standard deviation in b. 372 00:21:39,070 --> 00:21:40,900 1 standard deviation-- 373 00:21:40,900 --> 00:21:44,700 I can't remember, what that correspond to, the typical? 374 00:21:44,700 --> 00:21:49,600 Got about a 90% confidence interval? 375 00:21:49,600 --> 00:21:52,090 Plus or minus 1 standard deviation? 376 00:21:52,090 --> 00:21:56,030 67%, thank you. 377 00:21:56,030 --> 00:21:59,030 The one I always remember is two standard errors. 378 00:21:59,030 --> 00:22:01,770 That's about 95% confidence. 379 00:22:01,770 --> 00:22:06,680 So if you wanted to 95% confidence interval, now 380 00:22:06,680 --> 00:22:08,720 you know how to formulate that. 381 00:22:08,720 --> 00:22:11,480 It might be 1.96 or whatever it is. 382 00:22:15,690 --> 00:22:18,930 So there you have nicely falling out 383 00:22:18,930 --> 00:22:22,860 of the basic mathematical formulation for minimizing 384 00:22:22,860 --> 00:22:25,770 the sum of squares both the best estimate 385 00:22:25,770 --> 00:22:28,920 for your slope and a confidence interval to the slope. 386 00:22:31,830 --> 00:22:37,530 By the way, if you're based that on a relatively small number 387 00:22:37,530 --> 00:22:39,390 of data points, you should probably 388 00:22:39,390 --> 00:22:45,070 use a t distribution rather than a normal distribution. 389 00:22:45,070 --> 00:22:51,390 So it might change my 1.964 for a 95% confidence interval, 390 00:22:51,390 --> 00:22:55,020 as we're used to. 391 00:22:55,020 --> 00:22:57,990 So this also lets us now go back and do-- 392 00:22:57,990 --> 00:23:01,590 think again about another perspective 393 00:23:01,590 --> 00:23:03,232 on analysis of variance. 394 00:23:03,232 --> 00:23:05,190 In fact, you guys played with this a little bit 395 00:23:05,190 --> 00:23:10,920 or saw this in a slightly different form on the quiz. 396 00:23:10,920 --> 00:23:17,250 There's two ways of thinking about the significance 397 00:23:17,250 --> 00:23:22,740 whether some slope coefficient or model coefficient should 398 00:23:22,740 --> 00:23:24,600 be included in the model. 399 00:23:24,600 --> 00:23:27,150 The basic hypothesis is are we saying 400 00:23:27,150 --> 00:23:30,510 do I have enough evidence to suggest that that slope term is 401 00:23:30,510 --> 00:23:32,490 non-zero? 402 00:23:32,490 --> 00:23:36,990 If it might be 0 to some degree of confidence, 403 00:23:36,990 --> 00:23:40,260 then I shouldn't include it. 404 00:23:40,260 --> 00:23:42,180 So one way of doing it is the ANOVA 405 00:23:42,180 --> 00:23:45,765 with the ratio of variances in the F test. 406 00:23:45,765 --> 00:23:49,440 The other way is basically looking at the confidence 407 00:23:49,440 --> 00:23:54,220 interval for beta, say the 95% confidence interval, 408 00:23:54,220 --> 00:23:58,580 and if that intersects 0, that says that more than 5% 409 00:23:58,580 --> 00:24:02,090 of the time based on just random variation in the data, 410 00:24:02,090 --> 00:24:05,490 I might have a 0 coefficient there, 411 00:24:05,490 --> 00:24:08,750 in which case I cannot say that it is significantly different 412 00:24:08,750 --> 00:24:10,820 than 0. 413 00:24:10,820 --> 00:24:12,800 So you can make that determination 414 00:24:12,800 --> 00:24:15,410 about whether you should include the model coefficient based 415 00:24:15,410 --> 00:24:19,870 on your confidence interval for each individual term as well. 416 00:24:22,610 --> 00:24:24,800 So that's just alluding back to what we already know 417 00:24:24,800 --> 00:24:28,220 but just trying to make sure you see the connection 418 00:24:28,220 --> 00:24:31,100 or alternative ways of looking at it either in the ANOVA 419 00:24:31,100 --> 00:24:34,880 table, or if you want to look at individual coefficient 420 00:24:34,880 --> 00:24:37,760 terms, the confidence intervals on those individual 421 00:24:37,760 --> 00:24:39,620 coefficients. 422 00:24:39,620 --> 00:24:42,930 OK, let's do an example. 423 00:24:42,930 --> 00:24:45,620 Here's a very simple set of data. 424 00:24:45,620 --> 00:24:50,870 We've got some input, some x value, call that "age". 425 00:24:50,870 --> 00:24:52,850 And some y values. 426 00:24:52,850 --> 00:24:56,350 Call that "income". 427 00:24:56,350 --> 00:24:59,180 And if I just plot the data, let me get the data up here-- 428 00:25:03,460 --> 00:25:06,645 actually, what I've done here is used JUMP. 429 00:25:06,645 --> 00:25:08,770 I don't know how many of you have played with JUMP, 430 00:25:08,770 --> 00:25:11,500 but I love JUMP because it's nice and interactive. 431 00:25:11,500 --> 00:25:14,860 It does a lot of regression analysis, 432 00:25:14,860 --> 00:25:18,350 lets me explore the data fairly interactively, 433 00:25:18,350 --> 00:25:21,550 I like it a lot better than Excel for doing 434 00:25:21,550 --> 00:25:22,840 some of these analysis. 435 00:25:22,840 --> 00:25:25,360 I think in an earlier problem set, 436 00:25:25,360 --> 00:25:27,700 we did give you a pointer to where you could run that 437 00:25:27,700 --> 00:25:32,240 on Athena and so on. 438 00:25:32,240 --> 00:25:34,750 And what this is doing is basically looking and doing 439 00:25:34,750 --> 00:25:39,820 my analysis of variance for a very simple linear model 440 00:25:39,820 --> 00:25:42,830 without a constant term. 441 00:25:42,830 --> 00:25:45,490 So I've just got one model coefficient, looks 442 00:25:45,490 --> 00:25:48,550 at the sum of squares, the mean square, 443 00:25:48,550 --> 00:25:51,550 looks at the residual with the remaining data point 444 00:25:51,550 --> 00:25:53,170 forms and F. 445 00:25:53,170 --> 00:25:54,970 That F ratio is huge. 446 00:25:54,970 --> 00:25:59,800 It's 1,000, and the probability of observing that large of an F 447 00:25:59,800 --> 00:26:01,510 is minuscule. 448 00:26:01,510 --> 00:26:06,850 So I have great confidence that in fact there is a slope. 449 00:26:06,850 --> 00:26:10,660 And if I look down here at my income leverage 450 00:26:10,660 --> 00:26:14,620 residual versus the age parameter, 451 00:26:14,620 --> 00:26:21,510 I can see this is basically just y sub i x sub i. 452 00:26:21,510 --> 00:26:25,240 I see a definite trend there. 453 00:26:25,240 --> 00:26:30,040 Now what this nice plot has done is the solid line 454 00:26:30,040 --> 00:26:33,520 is my best fit, but it is also plotted for us 455 00:26:33,520 --> 00:26:39,810 with the dashed line the confidence on the output. 456 00:26:39,810 --> 00:26:46,150 I think it's a 95% confidence interval on the output as well. 457 00:26:46,150 --> 00:26:50,040 Now I told you how to get an estimate on the confidence 458 00:26:50,040 --> 00:26:52,050 interval for our b term. 459 00:26:52,050 --> 00:26:54,660 How do we get a confidence interval on the output term? 460 00:26:57,570 --> 00:26:59,430 Well, what we're going to need to do 461 00:26:59,430 --> 00:27:03,420 is also do the variance calculations on our y formula 462 00:27:03,420 --> 00:27:06,870 and see how uncertainty in our data 463 00:27:06,870 --> 00:27:12,570 also propagates through to uncertainty in our output. 464 00:27:12,570 --> 00:27:14,130 But before we do that, we can also 465 00:27:14,130 --> 00:27:17,130 see here in the JUMP output things 466 00:27:17,130 --> 00:27:23,940 like the parameter estimates for our age dependence. 467 00:27:23,940 --> 00:27:26,910 So here's our best guess for the age dependence 468 00:27:26,910 --> 00:27:31,140 is a simple 0.5 estimate. 469 00:27:31,140 --> 00:27:34,410 And it is also showing us things like the standard error 470 00:27:34,410 --> 00:27:38,010 in these typical ANOVA tables, which we've ignored in the past 471 00:27:38,010 --> 00:27:39,450 if you've been looking at these. 472 00:27:39,450 --> 00:27:41,730 But that can also be used them directly, 473 00:27:41,730 --> 00:27:44,670 as we talked about, to give me a confidence interval, 474 00:27:44,670 --> 00:27:50,010 depending on what level of error whatever level of alpha 475 00:27:50,010 --> 00:27:52,860 I want to be able to estimate those things. 476 00:27:52,860 --> 00:27:56,190 And it's also looking at an individual t ratio 477 00:27:56,190 --> 00:27:57,930 for each of the coefficients. 478 00:27:57,930 --> 00:28:00,420 I've only got one here, but it's basically 479 00:28:00,420 --> 00:28:04,620 doing a one by one assessment of each of my model coefficients 480 00:28:04,620 --> 00:28:08,530 to see if it's significant. 481 00:28:08,530 --> 00:28:11,830 And in fact, it's significant since it's 482 00:28:11,830 --> 00:28:15,460 exactly ends up being the same probability not really shown 483 00:28:15,460 --> 00:28:17,540 here. 484 00:28:17,540 --> 00:28:19,750 Essentially, the t test and the F test 485 00:28:19,750 --> 00:28:25,125 are identical in this simple example. 486 00:28:25,125 --> 00:28:32,550 AUDIENCE: [INAUDIBLE] think of some subset of data, 487 00:28:32,550 --> 00:28:35,830 wouldn't it make sense to have [INAUDIBLE] part of the data 488 00:28:35,830 --> 00:28:41,010 then use some for testing like the model and seeing if it 489 00:28:41,010 --> 00:28:45,120 actually has a prediction because if you use that entire 490 00:28:45,120 --> 00:28:48,890 data set then essentially-- 491 00:28:48,890 --> 00:28:50,640 DUANE BONING: That's an interesting point. 492 00:28:50,640 --> 00:28:53,630 So what you're saying is how about the idea 493 00:28:53,630 --> 00:28:56,180 if you have a fair amount of data of holding out 494 00:28:56,180 --> 00:28:59,640 some of the data, fitting the data some portion of it, 495 00:28:59,640 --> 00:29:05,110 and the held back data to sort of test the model. 496 00:29:05,110 --> 00:29:16,260 And I think, especially when you do nonlinear models-- 497 00:29:16,260 --> 00:29:17,790 and I don't mean just polynomial, 498 00:29:17,790 --> 00:29:20,070 but I mean some other nonlinear dependence 499 00:29:20,070 --> 00:29:24,360 --that cross validation is extremely common and very 500 00:29:24,360 --> 00:29:25,740 useful. 501 00:29:25,740 --> 00:29:28,870 Here, you could do that. 502 00:29:28,870 --> 00:29:31,440 And essentially what I think that's doing 503 00:29:31,440 --> 00:29:39,240 is allowing you to do a lack of fit versus noise estimate. 504 00:29:39,240 --> 00:29:41,260 In other words, what you're doing, 505 00:29:41,260 --> 00:29:43,530 I think conceptually, there is saying here's 506 00:29:43,530 --> 00:29:45,900 what my model would have predicted. 507 00:29:45,900 --> 00:29:47,370 Here's my data point. 508 00:29:47,370 --> 00:29:52,380 There's a residual that I'm going to attribute maybe-- 509 00:29:52,380 --> 00:29:55,830 again, it's to a mix of random noise underlying 510 00:29:55,830 --> 00:30:00,060 but also model lack of fidelity. 511 00:30:04,110 --> 00:30:09,090 I think it's more common to go ahead and use all of your data 512 00:30:09,090 --> 00:30:12,780 because then you've got your aggregate measures 513 00:30:12,780 --> 00:30:16,740 and can run all of your tests with the highest resolution 514 00:30:16,740 --> 00:30:18,600 possible. 515 00:30:18,600 --> 00:30:22,440 But I suspect there's actually a relationship that's 516 00:30:22,440 --> 00:30:25,235 very close in there. 517 00:30:25,235 --> 00:30:27,360 I think it's a little better to use all of the data 518 00:30:27,360 --> 00:30:30,330 because the more data you have, the better your estimates 519 00:30:30,330 --> 00:30:32,610 of underlying process variance are 520 00:30:32,610 --> 00:30:37,260 so you can better differentiate lack of fit from noise. 521 00:30:37,260 --> 00:30:40,090 But I haven't thought about that very much, 522 00:30:40,090 --> 00:30:42,900 especially of the simple linear cases. 523 00:30:42,900 --> 00:30:45,090 It's an interesting approach. 524 00:30:51,490 --> 00:30:55,060 So I want to come back to this lack of fit 525 00:30:55,060 --> 00:30:58,660 versus the pure error because we talked 526 00:30:58,660 --> 00:31:02,590 about often being able to do multiple runs at the same x 527 00:31:02,590 --> 00:31:03,670 values. 528 00:31:03,670 --> 00:31:06,580 In this data here that I've shown you, 529 00:31:06,580 --> 00:31:10,930 we actually have a difficulty in distinguishing 530 00:31:10,930 --> 00:31:15,460 between model lack of fit and underlying variance. 531 00:31:15,460 --> 00:31:18,130 I had to basically make an assumption 532 00:31:18,130 --> 00:31:21,670 that my underlying model was truly linear. 533 00:31:21,670 --> 00:31:24,940 And then I'm basically assuming, if I 534 00:31:24,940 --> 00:31:27,155 go back even further here-- 535 00:31:27,155 --> 00:31:28,030 where did my data go? 536 00:31:30,730 --> 00:31:36,310 --I'm basically assuming a y sub i is equal to beat sub-- 537 00:31:36,310 --> 00:31:41,140 beta x sub i plus epsilon sub i model. 538 00:31:41,140 --> 00:31:50,080 Why not-- I have really nothing except ideas of parsimony, 539 00:31:50,080 --> 00:31:56,170 simple models in general and perhaps prior knowledge 540 00:31:56,170 --> 00:31:59,320 of the physics of the process to really say this 541 00:31:59,320 --> 00:32:01,120 is the form of the model. 542 00:32:01,120 --> 00:32:09,280 If you look at my data, why couldn't my model be that? 543 00:32:09,280 --> 00:32:10,930 It may well be. 544 00:32:10,930 --> 00:32:13,330 It might have a very complicated structure. 545 00:32:13,330 --> 00:32:15,980 That might be true. 546 00:32:15,980 --> 00:32:17,620 The problem is I don't have-- 547 00:32:17,620 --> 00:32:24,080 in this random data, I don't have any replicates 548 00:32:24,080 --> 00:32:26,780 to be able to give me an independent notion 549 00:32:26,780 --> 00:32:34,310 of underlying repeated variance noise from model form. 550 00:32:34,310 --> 00:32:35,930 And so that goes back to what we said 551 00:32:35,930 --> 00:32:40,190 is if we have multiple runs at the same x values, especially 552 00:32:40,190 --> 00:32:44,360 if we design an experiment so that we do that, 553 00:32:44,360 --> 00:32:47,630 and we aren't using this sort of happenstance data, 554 00:32:47,630 --> 00:32:50,840 then we can decompose the total residual error 555 00:32:50,840 --> 00:32:55,460 into that lack of fit and pure replicate error and start 556 00:32:55,460 --> 00:33:00,530 to be able to distinguish between model structure 557 00:33:00,530 --> 00:33:05,600 and and pure replication error. 558 00:33:05,600 --> 00:33:07,540 And so we talked previously about being 559 00:33:07,540 --> 00:33:11,950 able to form the F test, the of variance 560 00:33:11,950 --> 00:33:16,840 explained by deviations from model prediction 561 00:33:16,840 --> 00:33:21,190 in the replicate data over total error 562 00:33:21,190 --> 00:33:25,360 and then seeing how likely it would be to observe that ratio 563 00:33:25,360 --> 00:33:30,228 and use the F test in the ANOVA test for that. 564 00:33:30,228 --> 00:33:32,520 And we'll come back to that a little bit in an example. 565 00:33:36,510 --> 00:33:37,590 This is a quick one. 566 00:33:37,590 --> 00:33:41,790 I showed you an example here where the previous example 567 00:33:41,790 --> 00:33:46,110 was a pure linear term without even a constant offset. 568 00:33:46,110 --> 00:33:50,820 We can also do models that have both a slope 569 00:33:50,820 --> 00:33:53,040 term and a constant term. 570 00:33:53,040 --> 00:33:57,540 And this is simply formulated here as a means centered model. 571 00:33:57,540 --> 00:34:04,170 If I were to take my data in and say when x was added mean, 572 00:34:04,170 --> 00:34:05,610 this term would be 0. 573 00:34:05,610 --> 00:34:08,290 So this is not really an intercept. 574 00:34:08,290 --> 00:34:13,320 This is saying my a coefficient is when x is added to mean. 575 00:34:13,320 --> 00:34:17,100 I could similarly formulate it so that the coefficient 576 00:34:17,100 --> 00:34:21,199 would be when x was 0. 577 00:34:21,199 --> 00:34:23,780 The point being that the same approach 578 00:34:23,780 --> 00:34:29,360 for estimating both a linear term and a constant offset term 579 00:34:29,360 --> 00:34:30,500 can apply. 580 00:34:30,500 --> 00:34:39,500 And the same notion of not only getting estimates but also 581 00:34:39,500 --> 00:34:44,719 getting confidence intervals based on variances 582 00:34:44,719 --> 00:34:48,920 in those coefficients applies. 583 00:34:48,920 --> 00:34:53,060 So we can also use this to get confidence intervals, 584 00:34:53,060 --> 00:34:57,120 not only on the slope term but also on the variance term-- 585 00:34:57,120 --> 00:34:58,820 I mean the offset term. 586 00:35:03,680 --> 00:35:10,370 Now we can also, what's nice is, do the same math now and look 587 00:35:10,370 --> 00:35:14,570 at a variance in our prediction of the output. 588 00:35:14,570 --> 00:35:17,000 I already alluded to that with these confidence intervals 589 00:35:17,000 --> 00:35:23,430 on that plot of y versus x in that one set of data. 590 00:35:23,430 --> 00:35:26,960 And if I basically am saying, OK, this is my best estimate-- 591 00:35:26,960 --> 00:35:28,250 this was my-- 592 00:35:28,250 --> 00:35:31,280 this is equal to the a coefficient --this is my best 593 00:35:31,280 --> 00:35:36,050 estimate of the underlying linear model 594 00:35:36,050 --> 00:35:40,080 with an offset term, and I just do my variance math on this, 595 00:35:40,080 --> 00:35:43,320 I've got a variance of some of these terms. 596 00:35:43,320 --> 00:35:45,650 And if you carry through that math, 597 00:35:45,650 --> 00:35:49,250 this is just a constant at each x sub i. 598 00:35:49,250 --> 00:35:52,560 Since x bar is a constant, x sub i is a constant. 599 00:35:52,560 --> 00:35:54,140 So in the variance math, when I look 600 00:35:54,140 --> 00:35:56,270 at the variance of this term, it's 601 00:35:56,270 --> 00:35:59,405 the variance of this times the variance of this. 602 00:35:59,405 --> 00:36:01,430 This is a constant term, so I've got 603 00:36:01,430 --> 00:36:05,750 that constant squared out in front of the variance of my b. 604 00:36:05,750 --> 00:36:07,550 We already calculated what the variance 605 00:36:07,550 --> 00:36:10,520 of the b a and the variance of the B term were. 606 00:36:10,520 --> 00:36:14,000 I can plug those in and get an overall estimate 607 00:36:14,000 --> 00:36:20,870 of the variance of each of my y sub i terms in my model. 608 00:36:20,870 --> 00:36:22,640 And based on-- once I've got that 609 00:36:22,640 --> 00:36:26,420 for the single standard error, my single standard deviation, 610 00:36:26,420 --> 00:36:30,470 I can use the t or the normal to get a confidence interval 611 00:36:30,470 --> 00:36:31,310 on the output. 612 00:36:35,110 --> 00:36:37,470 So it's the same thing we did on the coefficients. 613 00:36:37,470 --> 00:36:41,940 I can also do it to tell me what kind of spread, what confidence 614 00:36:41,940 --> 00:36:46,700 do I have in where the true output should lie when I'm 615 00:36:46,700 --> 00:36:49,580 predicting for any x value, where 616 00:36:49,580 --> 00:36:52,850 I think the actual true output y would lie. 617 00:36:52,850 --> 00:36:55,220 Now there's an interesting aspect 618 00:36:55,220 --> 00:37:03,870 to this, which is if I look for any given x sub i input 619 00:37:03,870 --> 00:37:10,240 particular x input value, notice OK, that's right here. 620 00:37:10,240 --> 00:37:13,480 I plug-in for my particular i of interest. 621 00:37:13,480 --> 00:37:17,360 Notice that the denominator here was a sum over all of my data. 622 00:37:17,360 --> 00:37:19,160 So that ends up being just a constant. 623 00:37:19,160 --> 00:37:20,950 It doesn't change. 624 00:37:20,950 --> 00:37:24,160 But depending on what x I'm looking at, 625 00:37:24,160 --> 00:37:29,015 where I am on the x, the size of this changes. 626 00:37:32,010 --> 00:37:36,030 So for example, if I look at my mean, 627 00:37:36,030 --> 00:37:39,630 if I look where my x sub i is equal to x bar, 628 00:37:39,630 --> 00:37:42,840 that numerator term goes to 0. 629 00:37:42,840 --> 00:37:45,690 And essentially what I've got in that case 630 00:37:45,690 --> 00:37:49,260 is at the mean of my data, my estimation 631 00:37:49,260 --> 00:37:53,490 is basically-- my variance in my output estimate 632 00:37:53,490 --> 00:37:58,200 is basically just related to the random noise in the data. 633 00:37:58,200 --> 00:38:02,610 But then as I get further and further from the mean, 634 00:38:02,610 --> 00:38:05,355 my confidence interval in my output spreads. 635 00:38:07,990 --> 00:38:11,110 So what you will often see on data-- 636 00:38:11,110 --> 00:38:16,360 this was x data and this is my y --is 637 00:38:16,360 --> 00:38:20,170 near the center of your data, you've 638 00:38:20,170 --> 00:38:22,660 got the narrowest confidence intervals. 639 00:38:22,660 --> 00:38:24,980 And as I get further and further away, 640 00:38:24,980 --> 00:38:30,610 if I were to use the dash for a 95% confidence on the output, 641 00:38:30,610 --> 00:38:35,500 the further away that I get in x from my x bar, 642 00:38:35,500 --> 00:38:38,575 the wider my prediction error becomes. 643 00:38:41,590 --> 00:38:43,840 Even though I'm still may be interpolating over 644 00:38:43,840 --> 00:38:48,720 the data I've got, my variance does spread 645 00:38:48,720 --> 00:38:57,240 as I get further and further away, just an interesting fact. 646 00:38:57,240 --> 00:39:01,410 All right, we're almost ready to do a polynomial example. 647 00:39:01,410 --> 00:39:05,130 I just want to point out we talked about this previously. 648 00:39:05,130 --> 00:39:09,060 We can also do not only a constant term but also 649 00:39:09,060 --> 00:39:11,310 a linear term. 650 00:39:11,310 --> 00:39:14,850 We can do terms that include this square polynomial, 651 00:39:14,850 --> 00:39:18,720 for example, include curvature in the x squared. 652 00:39:18,720 --> 00:39:22,110 One important fact is this is still 653 00:39:22,110 --> 00:39:25,500 linear data in the coefficients. 654 00:39:28,130 --> 00:39:32,960 And what this means is the least squares approach-- 655 00:39:32,960 --> 00:39:36,240 least squares minimization, still applies. 656 00:39:36,240 --> 00:39:37,790 So you can still do least squares 657 00:39:37,790 --> 00:39:40,670 minimization to estimate your beta coefficients. 658 00:39:40,670 --> 00:39:43,460 And essentially what you do mechanically, 659 00:39:43,460 --> 00:39:46,400 say in something like Excel, is create 660 00:39:46,400 --> 00:39:51,440 that additional fake column of data, just taking your x. 661 00:39:51,440 --> 00:39:55,670 You can almost think of this as equating that with an x2, 662 00:39:55,670 --> 00:39:59,780 think of this as an x1, and building your data column, 663 00:39:59,780 --> 00:40:03,530 taking each of your x coefficients, squaring it, 664 00:40:03,530 --> 00:40:06,680 and that becomes a new x sub 2 input. 665 00:40:06,680 --> 00:40:08,090 And then all you're doing is just 666 00:40:08,090 --> 00:40:12,000 a linear fit now in these multiple coefficients. 667 00:40:12,000 --> 00:40:13,760 So it looks exactly the same like we 668 00:40:13,760 --> 00:40:19,940 did for multiple inputs, even if we have additional higher order 669 00:40:19,940 --> 00:40:21,635 terms in the x squared. 670 00:40:25,630 --> 00:40:28,990 So let's look at a simple example here. 671 00:40:28,990 --> 00:40:31,990 Pull these threads together, look at confidence, 672 00:40:31,990 --> 00:40:34,390 but also look at it in the case when 673 00:40:34,390 --> 00:40:38,860 I've got some replicate data so we can get a little experience 674 00:40:38,860 --> 00:40:41,080 with this lack of fit idea. 675 00:40:41,080 --> 00:40:47,020 And so in this case, we've got importantly here cases 676 00:40:47,020 --> 00:40:50,900 where I've replicated my x values. 677 00:40:50,900 --> 00:40:55,960 So I've got two runs with 20 grams of some kind of growth 678 00:40:55,960 --> 00:40:57,230 supplement. 679 00:40:57,230 --> 00:41:00,268 And so I've got two different output values at that point. 680 00:41:00,268 --> 00:41:01,810 And I've got another point where I've 681 00:41:01,810 --> 00:41:09,570 got three replicates, triply replicated set of data. 682 00:41:09,570 --> 00:41:12,720 And what I'd like to do is try to fit a model 683 00:41:12,720 --> 00:41:16,500 and hear what we've got in the picture is 684 00:41:16,500 --> 00:41:19,350 an inkling or a foreshadowing of some of the kinds of models 685 00:41:19,350 --> 00:41:23,580 we might consider and some of the issues we might consider. 686 00:41:23,580 --> 00:41:24,480 If we look-- 687 00:41:24,480 --> 00:41:28,690 I think you can see it here --the basic data here in black, 688 00:41:28,690 --> 00:41:30,940 these are the data points. 689 00:41:30,940 --> 00:41:32,900 So this is just my output. 690 00:41:32,900 --> 00:41:34,920 There's my triply replicated data. 691 00:41:34,920 --> 00:41:36,870 There Is my x data. 692 00:41:36,870 --> 00:41:39,000 First off, I could try to fit that with a mean. 693 00:41:39,000 --> 00:41:40,360 That's just the red line. 694 00:41:40,360 --> 00:41:43,920 That's the pure just mean of my data. 695 00:41:43,920 --> 00:41:45,930 The green line here is a first order 696 00:41:45,930 --> 00:41:49,350 fit to just a slope coefficient and the mean, 697 00:41:49,350 --> 00:41:51,720 so two model terms. 698 00:41:51,720 --> 00:41:53,880 And you can see already that's not 699 00:41:53,880 --> 00:41:56,100 going to be a very good model. 700 00:41:56,100 --> 00:41:58,920 And what we've got is enough data here with the replicates 701 00:41:58,920 --> 00:42:01,320 to perhaps be able to detect that using 702 00:42:01,320 --> 00:42:05,340 our machinery of ANOVA, and then perhaps then build that 703 00:42:05,340 --> 00:42:10,860 into a second order model that we can already get a sense is 704 00:42:10,860 --> 00:42:13,260 going to be a quadratic model that fits the data lot 705 00:42:13,260 --> 00:42:15,615 that a lot better. 706 00:42:15,615 --> 00:42:17,610 Now, If I were to just try it-- 707 00:42:17,610 --> 00:42:19,590 let's say I didn't already-- 708 00:42:19,590 --> 00:42:21,770 first off you should always plot your actual data 709 00:42:21,770 --> 00:42:24,870 so you have a feel for what kind of a model 710 00:42:24,870 --> 00:42:26,860 is going to be needed. 711 00:42:26,860 --> 00:42:29,370 So if you were to actually plot that data, you would already 712 00:42:29,370 --> 00:42:32,670 you probably needed a quadratic model. 713 00:42:32,670 --> 00:42:37,080 So you might go ahead and up front, include that term. 714 00:42:37,080 --> 00:42:40,740 But let's say we had not done that, we'd just 715 00:42:40,740 --> 00:42:42,910 tried to fit it with a very simple model, 716 00:42:42,910 --> 00:42:44,640 a simple linear model. 717 00:42:44,640 --> 00:42:46,890 And if we go through and do the ANOVA, 718 00:42:46,890 --> 00:42:50,580 now because we do have repeated residual, 719 00:42:50,580 --> 00:42:55,140 I can split my overall residual sum of squared deviations 720 00:42:55,140 --> 00:42:58,420 into a lack of fit term. 721 00:42:58,420 --> 00:42:59,880 That's a sum of squared deviations 722 00:42:59,880 --> 00:43:02,430 just from my replicated-- 723 00:43:02,430 --> 00:43:07,110 or my total deviation from my model from my replicated data. 724 00:43:07,110 --> 00:43:10,890 And I can formulate then a ratio of those two things. 725 00:43:10,890 --> 00:43:15,150 And what I've got is deviations from my model 726 00:43:15,150 --> 00:43:17,140 that are much larger. 727 00:43:17,140 --> 00:43:21,597 So this is a deviation. 728 00:43:21,597 --> 00:43:22,430 It's not a good one. 729 00:43:22,430 --> 00:43:24,650 Actually right, there the deviation from the model 730 00:43:24,650 --> 00:43:27,740 is quite small. 731 00:43:27,740 --> 00:43:30,620 If I were to look right here, for example, 732 00:43:30,620 --> 00:43:34,070 this is my deviation from the model. 733 00:43:34,070 --> 00:43:36,560 I don't have any replicate data there. 734 00:43:36,560 --> 00:43:39,770 Right here, I've got deviation from the linear model. 735 00:43:39,770 --> 00:43:44,710 And then I've got pure replicate error. 736 00:43:44,710 --> 00:43:50,380 And you can start to see that the deviations from my best 737 00:43:50,380 --> 00:43:53,440 estimate prediction at the model is much, much larger. 738 00:43:53,440 --> 00:43:57,940 And that's what shows up in this ratio of the two variances. 739 00:43:57,940 --> 00:44:00,970 If you do that and follow through with the F, 740 00:44:00,970 --> 00:44:03,910 that's highly unlikely-- that big of a ratio 741 00:44:03,910 --> 00:44:09,580 is highly unlikely to occur by chance given the noise spread. 742 00:44:09,580 --> 00:44:12,910 So if you actually go in and do the lack of fit analysis, 743 00:44:12,910 --> 00:44:16,370 it's already setting up big red flags. 744 00:44:16,370 --> 00:44:18,610 Here's my red flag saying, look out, look out. 745 00:44:18,610 --> 00:44:23,530 You've got a lot of evidence of a lack of fit. 746 00:44:23,530 --> 00:44:26,080 What's interesting in this example is 747 00:44:26,080 --> 00:44:30,010 if I were to just look at the significance 748 00:44:30,010 --> 00:44:36,980 of the individual model terms, this pops out in fact 749 00:44:36,980 --> 00:44:40,510 that the mean is highly significant 750 00:44:40,510 --> 00:44:42,025 but the slope term is not. 751 00:44:45,930 --> 00:44:47,280 So this would say-- 752 00:44:47,280 --> 00:44:48,960 if I weren't looking at lack of fit 753 00:44:48,960 --> 00:44:51,870 and paying attention to that red flag, 754 00:44:51,870 --> 00:44:57,270 I might be tempted to say a very wrong thing. 755 00:44:57,270 --> 00:45:00,570 I might be tempted to say there is a significant estimate 756 00:45:00,570 --> 00:45:07,710 of the mean that's non-zero, but given the spread in my data, 757 00:45:07,710 --> 00:45:10,530 I cannot conclude that there is a linear dependence 758 00:45:10,530 --> 00:45:12,390 on my input. 759 00:45:12,390 --> 00:45:18,090 My linear dependence on x could be 0. 760 00:45:18,090 --> 00:45:20,370 In other words, with that green line 761 00:45:20,370 --> 00:45:25,820 right here, that's a small slope that given 762 00:45:25,820 --> 00:45:30,740 the spread in my data is not justified to actually estimate 763 00:45:30,740 --> 00:45:32,060 as anything other than 0. 764 00:45:35,720 --> 00:45:36,690 Interesting, huh? 765 00:45:39,350 --> 00:45:41,930 So you really need to look at both. 766 00:45:41,930 --> 00:45:44,120 I'd have to be very careful because 767 00:45:44,120 --> 00:45:46,910 the extra explanatory power of the linear term 768 00:45:46,910 --> 00:45:48,860 is very, very minimal here. 769 00:45:48,860 --> 00:45:50,690 So I might think OK, so I've really 770 00:45:50,690 --> 00:45:53,210 got no dependence at all, which what I really 771 00:45:53,210 --> 00:45:54,380 got his lack of fit. 772 00:45:57,820 --> 00:45:58,620 That making sense? 773 00:46:01,210 --> 00:46:04,780 So what I might then do is say, OK, I am paying attention 774 00:46:04,780 --> 00:46:05,710 to that big red flag. 775 00:46:05,710 --> 00:46:06,790 I've got lack of fit. 776 00:46:06,790 --> 00:46:13,460 Maybe I better add a quadratic term, refit my data. 777 00:46:13,460 --> 00:46:19,300 So now if I look at the S for my model 778 00:46:19,300 --> 00:46:22,150 with the mean with a term for the linear coefficient 779 00:46:22,150 --> 00:46:25,960 and one for the quadratic, now what do I get? 780 00:46:25,960 --> 00:46:29,620 And return to breaking apart my residual 781 00:46:29,620 --> 00:46:34,330 and now looking and seeing how much deviation is there 782 00:46:34,330 --> 00:46:38,650 due to lack of fit compared to underlying replicate variance. 783 00:46:38,650 --> 00:46:40,880 And now that ratio is very small. 784 00:46:40,880 --> 00:46:44,620 So now I don't have any longer any evidence 785 00:46:44,620 --> 00:46:47,770 of lack of fit, that's good. 786 00:46:47,770 --> 00:46:50,320 And now I can return to deciding about 787 00:46:50,320 --> 00:46:54,820 whether individual terms are significant. 788 00:46:54,820 --> 00:46:59,380 And we don't see the full F test, it's an incomplete ANOVA. 789 00:46:59,380 --> 00:47:01,750 But what we would basically find here 790 00:47:01,750 --> 00:47:05,830 is the mean term is significant, the quadratic term 791 00:47:05,830 --> 00:47:08,197 is significant. 792 00:47:08,197 --> 00:47:09,280 How about the linear term? 793 00:47:12,160 --> 00:47:14,180 It's still not significant. 794 00:47:14,180 --> 00:47:17,620 So in fact, we've got a a mean and a square term 795 00:47:17,620 --> 00:47:20,590 but no dependence on the linear term. 796 00:47:20,590 --> 00:47:22,090 You will typically see that. 797 00:47:22,090 --> 00:47:26,770 In fact, these-- if these terms are truly orthogonal, 798 00:47:26,770 --> 00:47:29,590 if I add the terms, it should not change my estimates 799 00:47:29,590 --> 00:47:31,300 for the other terms. 800 00:47:31,300 --> 00:47:34,570 That's not quite true if you throw those missing terms 801 00:47:34,570 --> 00:47:36,820 into noise factors. 802 00:47:36,820 --> 00:47:40,750 But the basic point here is I've now actually captured 803 00:47:40,750 --> 00:47:47,610 that the dependence on x with this quadratic term. 804 00:47:47,610 --> 00:47:49,660 So you can do exactly the same thing. 805 00:47:49,660 --> 00:47:54,210 This is the same data using Excel. 806 00:47:54,210 --> 00:47:56,730 And you get the same kind of a table 807 00:47:56,730 --> 00:47:59,970 here with an x term and x squared term. 808 00:47:59,970 --> 00:48:03,390 And what's interesting here is you can also 809 00:48:03,390 --> 00:48:06,780 go in and look at estimates of the coefficients, 810 00:48:06,780 --> 00:48:11,010 the standard error, 95% confidence intervals. 811 00:48:11,010 --> 00:48:15,630 And I guess actually if you were to look at that 95% confidence 812 00:48:15,630 --> 00:48:18,960 interval for that x term, looks like it actually 813 00:48:18,960 --> 00:48:22,170 is likely to be non-zero. 814 00:48:22,170 --> 00:48:24,810 So I did get that right. 815 00:48:28,040 --> 00:48:31,740 So actually you probably should include that term, 816 00:48:31,740 --> 00:48:34,530 even though the ratio is a little bit smaller. 817 00:48:34,530 --> 00:48:36,470 It is still significant. 818 00:48:36,470 --> 00:48:38,720 Now I also put this one up because it's also 819 00:48:38,720 --> 00:48:43,880 got estimates of your R-squared and adjusted R-squared. 820 00:48:43,880 --> 00:48:49,370 where it's giving you a nice feel. 821 00:48:49,370 --> 00:48:53,840 R-squared of around 0.9, 0.95, you start to feel pretty good 822 00:48:53,840 --> 00:48:54,462 about-- 823 00:48:54,462 --> 00:48:55,670 pretty good about your model. 824 00:48:58,660 --> 00:49:00,880 So I don't know if you played around with Excel. 825 00:49:00,880 --> 00:49:06,850 So again, I encourage JUMP, but if you do need to use Excel, 826 00:49:06,850 --> 00:49:08,620 there is-- 827 00:49:08,620 --> 00:49:12,400 under the data analysis tool if you pull that down, 828 00:49:12,400 --> 00:49:15,010 you will also see the regression analysis. 829 00:49:15,010 --> 00:49:19,240 And it will let you indicate what your output problems are 830 00:49:19,240 --> 00:49:21,040 and what your input columns are. 831 00:49:21,040 --> 00:49:24,040 And it does just the least squares regression, pops out 832 00:49:24,040 --> 00:49:26,920 your ANOVA table for you. 833 00:49:26,920 --> 00:49:29,410 In that case, you actually have to construct 834 00:49:29,410 --> 00:49:32,800 by hand your wide square or your x 835 00:49:32,800 --> 00:49:35,350 squared data if you want to polynomial fit. 836 00:49:35,350 --> 00:49:37,870 And that's what I've just illustrated here. 837 00:49:37,870 --> 00:49:41,200 You can't simply, unfortunately, at least in the version 838 00:49:41,200 --> 00:49:45,040 of Excel I have, say I want to try a polynomial model up 839 00:49:45,040 --> 00:49:49,450 to some order and have it just know to do that 840 00:49:49,450 --> 00:49:51,400 on the polynomial input data. 841 00:49:51,400 --> 00:49:53,470 You actually have to create columns 842 00:49:53,470 --> 00:49:55,030 for each of the model coefficients 843 00:49:55,030 --> 00:49:56,478 that you want to estimate. 844 00:50:00,150 --> 00:50:03,780 Here's the same polynomial regression using the JUMP 845 00:50:03,780 --> 00:50:07,470 package, again, with all of the lack of fit 846 00:50:07,470 --> 00:50:12,360 versus pure error, the x and x squared terms, 847 00:50:12,360 --> 00:50:16,440 t ratios, all of that, but basically the same analysis 848 00:50:16,440 --> 00:50:20,850 with the second order included. 849 00:50:20,850 --> 00:50:23,400 OK so with that, I'm going to-- about to move on 850 00:50:23,400 --> 00:50:24,780 to process optimization. 851 00:50:24,780 --> 00:50:30,000 But I'd like to take any questions on regression, 852 00:50:30,000 --> 00:50:33,100 confidence intervals, confidence intervals and input, confidence 853 00:50:33,100 --> 00:50:34,200 intervals and outputs. 854 00:50:34,200 --> 00:50:35,430 Is that all? 855 00:50:35,430 --> 00:50:37,770 It's starting to feel-- 856 00:50:37,770 --> 00:50:40,950 are you confident in your understanding 857 00:50:40,950 --> 00:50:42,255 of confidence intervals? 858 00:50:42,255 --> 00:50:43,020 Yeah, question? 859 00:50:43,020 --> 00:50:44,660 AUDIENCE: Definitely don't know what 860 00:50:44,660 --> 00:50:49,220 do you do if your inputs that are correlated? 861 00:50:49,220 --> 00:50:51,350 DUANE BONING: OK so the question was, what do you 862 00:50:51,350 --> 00:50:53,300 do if your inputs are correlated. 863 00:50:55,890 --> 00:51:02,540 So what is assumed in all of these fits 864 00:51:02,540 --> 00:51:05,090 is essentially you've got orthogonality. 865 00:51:05,090 --> 00:51:07,800 If we go back to the tables we were forming 866 00:51:07,800 --> 00:51:09,740 with full factorial and so on, we're 867 00:51:09,740 --> 00:51:13,950 assuming that each of your columns are orthogonal, 868 00:51:13,950 --> 00:51:17,570 which is to say we're assuming each of your coefficients 869 00:51:17,570 --> 00:51:22,580 in each of your different terms are uncorrelated or orthogonal. 870 00:51:22,580 --> 00:51:27,860 If they are orthogonal, and you do a least squares regression-- 871 00:51:27,860 --> 00:51:31,670 or if they are not orthogonal, there they are correlated, 872 00:51:31,670 --> 00:51:33,480 what happens? 873 00:51:33,480 --> 00:51:37,100 Well, what happens is you've got to model coefficients 874 00:51:37,100 --> 00:51:41,120 both trying to explain some amount of the same data. 875 00:51:41,120 --> 00:51:43,200 And they fight against each other. 876 00:51:43,200 --> 00:51:48,980 And it's almost random how the effect 877 00:51:48,980 --> 00:51:51,650 that-- that true underlying effect gets apportioned between 878 00:51:51,650 --> 00:51:54,390 say a beta 1 and a beta to term. 879 00:51:54,390 --> 00:51:56,630 In fact very, very tiny little perturbations, 880 00:51:56,630 --> 00:52:00,170 and you can get a different mix of beta 1 and beta 2. 881 00:52:00,170 --> 00:52:03,230 And it turns out you might still be 882 00:52:03,230 --> 00:52:05,300 OK in terms of predicting an output 883 00:52:05,300 --> 00:52:08,400 because at least your model has both of them in there. 884 00:52:08,400 --> 00:52:11,390 But it really screws up your ability to decide 885 00:52:11,390 --> 00:52:17,980 is that model term significant or not. 886 00:52:17,980 --> 00:52:21,820 What you need to do is transform your data 887 00:52:21,820 --> 00:52:25,060 to get it into an orthogonal form 888 00:52:25,060 --> 00:52:28,750 to get rid of the correlation to basically create do 889 00:52:28,750 --> 00:52:32,980 model coefficients and new explanatory values 890 00:52:32,980 --> 00:52:39,190 to fake x values that don't have the correlation in them. 891 00:52:39,190 --> 00:52:42,940 And the classic tool for doing that 892 00:52:42,940 --> 00:52:49,390 is principal component analysis or some transformation 893 00:52:49,390 --> 00:52:55,900 of the data to a different basis than your original x1, x2, x3 894 00:52:55,900 --> 00:52:59,220 coefficients. 895 00:52:59,220 --> 00:53:02,580 We might talk a little bit about multivariable things. 896 00:53:02,580 --> 00:53:08,100 I think we did a little bit with multivariable statistical and T 897 00:53:08,100 --> 00:53:12,240 charts and so on, but essentially 898 00:53:12,240 --> 00:53:15,180 a principal components or some other kind of transformation 899 00:53:15,180 --> 00:53:17,430 is needed on the data in order to then 900 00:53:17,430 --> 00:53:20,640 have individual coefficients that 901 00:53:20,640 --> 00:53:23,200 are not duplicating each other. 902 00:53:23,200 --> 00:53:27,060 If you look, I think it's chapter section 8 point-- 903 00:53:27,060 --> 00:53:29,010 maybe 8.4. 904 00:53:29,010 --> 00:53:31,800 The next one after what I assigned as a reading, that 905 00:53:31,800 --> 00:53:33,660 talks about principal component analysis 906 00:53:33,660 --> 00:53:36,270 and how you do that and process modeling. 907 00:53:36,270 --> 00:53:38,070 So you can read that section. 908 00:53:38,070 --> 00:53:42,510 It's actually very good, very interesting. 909 00:53:42,510 --> 00:53:47,990 Other questions, progression? 910 00:53:47,990 --> 00:53:48,490 Yeah? 911 00:53:48,490 --> 00:53:50,073 AUDIENCE: If there is a big difference 912 00:53:50,073 --> 00:53:52,360 between R-squared and adjusted R-squared, what 913 00:53:52,360 --> 00:53:54,010 is that telling us? 914 00:53:54,010 --> 00:53:58,860 In this case, it's essentially [INAUDIBLE] 0.9 and 0.8, 915 00:53:58,860 --> 00:54:01,192 or 0.7 [INAUDIBLE]. 916 00:54:03,413 --> 00:54:04,830 DUANE BONING: Yes, so the question 917 00:54:04,830 --> 00:54:06,330 is what if you have big differences 918 00:54:06,330 --> 00:54:09,570 between R-squared and adjusted R-squared. 919 00:54:09,570 --> 00:54:13,710 I think it's essentially telling you 920 00:54:13,710 --> 00:54:17,930 that the influence of additional model coefficients 921 00:54:17,930 --> 00:54:24,350 is really important, both-- 922 00:54:24,350 --> 00:54:26,060 this very qualitative. 923 00:54:26,060 --> 00:54:27,860 But essentially, it's telling you 924 00:54:27,860 --> 00:54:31,850 there's more than going on than just the mean response. 925 00:54:31,850 --> 00:54:34,490 So you're seeing a little bit of a mix of both-- 926 00:54:34,490 --> 00:54:37,370 the penalty of adding more model coefficients, 927 00:54:37,370 --> 00:54:40,580 but it's also telling you there's 928 00:54:40,580 --> 00:54:45,530 likely additional structure that you needed in order 929 00:54:45,530 --> 00:54:47,090 to use that. 930 00:54:47,090 --> 00:54:48,410 But that's pretty qualitative. 931 00:54:48,410 --> 00:54:50,780 I think basically it's signaling that there's 932 00:54:50,780 --> 00:54:52,430 more than just mean-- 933 00:54:52,430 --> 00:54:54,230 mean deviations going on. 934 00:54:57,480 --> 00:55:01,110 It sounded like there was a microphone 935 00:55:01,110 --> 00:55:03,054 question in Singapore? 936 00:55:03,054 --> 00:55:05,030 AUDIENCE: Question on slide 50. 937 00:55:09,650 --> 00:55:14,450 You mentioned we should only see the mean which 938 00:55:14,450 --> 00:55:17,900 also focused on the lack of fit and the pure error. 939 00:55:17,900 --> 00:55:20,630 So why do you say that we only see the mean, 940 00:55:20,630 --> 00:55:22,460 we may say it's a good model. 941 00:55:22,460 --> 00:55:23,990 Can you explain that again? 942 00:55:23,990 --> 00:55:25,365 DUANE BONING: Yeah, actually what 943 00:55:25,365 --> 00:55:27,830 I was saying in this example is that if I only 944 00:55:27,830 --> 00:55:34,710 looked at the mean, I might be hesitant to include any model 945 00:55:34,710 --> 00:55:37,120 terms beyond the mean. 946 00:55:37,120 --> 00:55:42,100 So I might not actually think it's a good model at all. 947 00:55:42,100 --> 00:55:45,990 So that part of your question, I'm not sure I quite understood 948 00:55:45,990 --> 00:55:49,260 or quite agreed with. 949 00:55:49,260 --> 00:55:50,280 But I do-- 950 00:55:50,280 --> 00:55:53,400 I guess maybe I'm just repeating myself, 951 00:55:53,400 --> 00:55:57,270 I think it is really critical to look for lack of fit 952 00:55:57,270 --> 00:55:59,730 because you need both perspectives. 953 00:55:59,730 --> 00:56:05,790 You need to look not only at model coefficients in terms 954 00:56:05,790 --> 00:56:08,430 and whether they should be included in the model, 955 00:56:08,430 --> 00:56:12,870 but you also have to be alert am I missing terms. 956 00:56:16,010 --> 00:56:19,040 That's what the lack of fit enables you to do. 957 00:56:19,040 --> 00:56:22,490 This is basically saying the terms that are there, 958 00:56:22,490 --> 00:56:23,630 are they significant? 959 00:56:26,630 --> 00:56:29,110 So in some, sense this one is basically just leading 960 00:56:29,110 --> 00:56:32,920 you to throw away coefficients and throw away model terms. 961 00:56:32,920 --> 00:56:36,400 And this number two, the lack of fit, is telling you, 962 00:56:36,400 --> 00:56:38,650 hey wait a second, there's stuff going on in the model 963 00:56:38,650 --> 00:56:40,120 that you're not explaining that's 964 00:56:40,120 --> 00:56:43,840 different than random noise, so maybe you 965 00:56:43,840 --> 00:56:46,600 should add model terms. 966 00:56:46,600 --> 00:56:48,670 And so you need both perspectives. 967 00:56:52,380 --> 00:56:58,310 OK so I think we're ready to move on and look a little bit 968 00:56:58,310 --> 00:57:00,260 at process optimization. 969 00:57:00,260 --> 00:57:03,710 I want to touch on the most natural use of these sorts 970 00:57:03,710 --> 00:57:08,130 of models, which is we define an experimental design, 971 00:57:08,130 --> 00:57:10,280 we go gather the data, we build a model, 972 00:57:10,280 --> 00:57:12,260 and then we start playing with the model. 973 00:57:12,260 --> 00:57:15,620 I think of that is offline use of the model, 974 00:57:15,620 --> 00:57:19,160 using it to try to identify an optimal point. 975 00:57:19,160 --> 00:57:22,880 But it's not purely offline because I 976 00:57:22,880 --> 00:57:26,540 want to make the point that if you're predicting an optimum, 977 00:57:26,540 --> 00:57:30,770 you probably want to go back and run some confirming experiments 978 00:57:30,770 --> 00:57:34,640 and use those back with your physical process 979 00:57:34,640 --> 00:57:37,940 to check your model and maybe even iterate and improve 980 00:57:37,940 --> 00:57:39,050 your model. 981 00:57:39,050 --> 00:57:41,160 So that's one natural approach. 982 00:57:41,160 --> 00:57:42,470 And the other is-- 983 00:57:42,470 --> 00:57:47,070 that should be online use. 984 00:57:47,070 --> 00:57:52,440 So another clever approach is actually build simplified 985 00:57:52,440 --> 00:57:56,340 models in a little part of the space, use that to tell me 986 00:57:56,340 --> 00:58:00,300 what direction to move in exploring my overall process 987 00:58:00,300 --> 00:58:05,340 space, and then dynamically build and improve my model. 988 00:58:05,340 --> 00:58:09,480 In the case when my real goal is getting to an optimum, 989 00:58:09,480 --> 00:58:12,840 not having the perfect model covering all of my space 990 00:58:12,840 --> 00:58:15,090 but rather get to an optimum point. 991 00:58:15,090 --> 00:58:17,640 So I want to touch on both of these ideas, ways 992 00:58:17,640 --> 00:58:20,640 of using these sort of simplified response surface 993 00:58:20,640 --> 00:58:22,650 models. 994 00:58:22,650 --> 00:58:27,250 And part of the point here is one important use 995 00:58:27,250 --> 00:58:31,450 of these models really is trying to find an optimal process 996 00:58:31,450 --> 00:58:35,230 output or find the inputs that give me an optimal process 997 00:58:35,230 --> 00:58:36,310 output. 998 00:58:36,310 --> 00:58:39,160 And that optimal process output may 999 00:58:39,160 --> 00:58:41,170 have multiple characteristics about it 1000 00:58:41,170 --> 00:58:44,360 that are important for us. 1001 00:58:44,360 --> 00:58:48,220 One is I want to be close to a target value. 1002 00:58:48,220 --> 00:58:53,530 But the other is we may also want small sensitivity, 1003 00:58:53,530 --> 00:58:56,180 small deviations in my output. 1004 00:58:56,180 --> 00:58:58,480 And if we go back to our variation equation, 1005 00:58:58,480 --> 00:59:02,770 that may mean I want small deviations around noise factors 1006 00:59:02,770 --> 00:59:05,710 that I'm not controlling. 1007 00:59:05,710 --> 00:59:11,670 And I may also want relatively small sensitivity 1008 00:59:11,670 --> 00:59:13,950 even to some of my input parameters 1009 00:59:13,950 --> 00:59:15,810 because I'm going to fix them in my process. 1010 00:59:15,810 --> 00:59:20,500 And I'm not dynamically or in a feedback loop changing them. 1011 00:59:20,500 --> 00:59:24,690 So in some cases, I want this to also be small. 1012 00:59:24,690 --> 00:59:27,000 So we'll talk a little bit about ways 1013 00:59:27,000 --> 00:59:30,750 to mix in these and other objectives. 1014 00:59:30,750 --> 00:59:33,060 For right now, I'm going to mostly focus 1015 00:59:33,060 --> 00:59:40,710 on say trying to meet some set of target mean values. 1016 00:59:40,710 --> 00:59:43,710 But I can make the point you can generalize 1017 00:59:43,710 --> 00:59:47,730 what I'm going to be talking about here by thinking 1018 00:59:47,730 --> 00:59:51,150 of some objective function, or some cost function, 1019 00:59:51,150 --> 00:59:55,620 or some goodness function that actually mixes in together 1020 00:59:55,620 --> 00:59:57,630 multiple objectives. 1021 00:59:57,630 --> 01:00:00,240 So some of the objectives, you might have a cost function 1022 01:00:00,240 --> 01:00:06,120 that penalizes for deviations from the target 1023 01:00:06,120 --> 01:00:08,220 or maybe sum of squared deviations 1024 01:00:08,220 --> 01:00:10,930 if I have multiple outputs from the target. 1025 01:00:10,930 --> 01:00:15,450 It may also penalize me for larger x's because-- 1026 01:00:15,450 --> 01:00:18,810 larger input because there's more cost 1027 01:00:18,810 --> 01:00:24,000 associated with using more gas if I have a higher gas 1028 01:00:24,000 --> 01:00:25,860 flow in some process. 1029 01:00:25,860 --> 01:00:27,840 And then I can also include other things 1030 01:00:27,840 --> 01:00:34,410 like terms that penalize for sensitivity, these delta y's, 1031 01:00:34,410 --> 01:00:36,450 sensitivity to the output. 1032 01:00:36,450 --> 01:00:40,890 And I can keep throwing additional things in. 1033 01:00:40,890 --> 01:00:46,770 So if I've got in general some complicated objective function, 1034 01:00:46,770 --> 01:00:51,120 if I can formulate that and actually model 1035 01:00:51,120 --> 01:00:57,120 either empirically or analytically that cost 1036 01:00:57,120 --> 01:01:00,300 function as a function of my input 1037 01:01:00,300 --> 01:01:04,650 or as a function utilizing the models that I already have, 1038 01:01:04,650 --> 01:01:07,710 I can then formulate an optimization function 1039 01:01:07,710 --> 01:01:09,180 or an optimization problem where I 1040 01:01:09,180 --> 01:01:11,130 might be trying to minimize that cost 1041 01:01:11,130 --> 01:01:15,563 or minimize that objective. 1042 01:01:15,563 --> 01:01:16,980 Or maybe I'm trying to maximize it 1043 01:01:16,980 --> 01:01:20,070 because I think of it as really a goodness function rather 1044 01:01:20,070 --> 01:01:21,690 than a penalty function. 1045 01:01:21,690 --> 01:01:26,460 But overall, I've got some complicated form for J 1046 01:01:26,460 --> 01:01:29,310 as a function of my factors. 1047 01:01:29,310 --> 01:01:32,820 Or my factors might be my actual input, 1048 01:01:32,820 --> 01:01:38,010 but they may also be noise factors, other factors that I 1049 01:01:38,010 --> 01:01:41,840 haven't explicitly modeled. 1050 01:01:41,840 --> 01:01:48,150 And we'll talk about robustness next week, or not next week, 1051 01:01:48,150 --> 01:01:49,830 on Thursday. 1052 01:01:49,830 --> 01:01:51,360 But right now, I just want to talk 1053 01:01:51,360 --> 01:01:57,690 about adjusting or searching for good input factors 1054 01:01:57,690 --> 01:02:03,430 to minimize or maximize some cost function with constraints. 1055 01:02:03,430 --> 01:02:05,850 So in general, you can think about different approaches 1056 01:02:05,850 --> 01:02:06,690 for this. 1057 01:02:06,690 --> 01:02:10,020 If I've got a full expression for y 1058 01:02:10,020 --> 01:02:16,530 as some function of x and maybe J is some function of y, 1059 01:02:16,530 --> 01:02:21,570 I have overall got some overall function for my cost 1060 01:02:21,570 --> 01:02:24,240 as a function of my inputs. 1061 01:02:24,240 --> 01:02:27,690 Then I can go in and try to minimize, 1062 01:02:27,690 --> 01:02:32,910 really dj dx and find-- 1063 01:02:32,910 --> 01:02:35,130 with some assumptions of monotonicity, 1064 01:02:35,130 --> 01:02:38,430 I can find an overall minimum or at least a local minimum 1065 01:02:38,430 --> 01:02:40,480 or maximum to that function. 1066 01:02:40,480 --> 01:02:43,080 So that's if I've got a full expression. 1067 01:02:43,080 --> 01:02:46,230 And we'll explore that a little bit. 1068 01:02:46,230 --> 01:02:48,870 Another approach is more of an incremental approach. 1069 01:02:48,870 --> 01:02:50,970 Rather than having the full expression 1070 01:02:50,970 --> 01:02:54,420 and leaping right to the optimum point 1071 01:02:54,420 --> 01:02:58,720 based on a local minimum or local maximum, 1072 01:02:58,720 --> 01:03:00,540 I may have to search for it. 1073 01:03:00,540 --> 01:03:04,255 I may have to iteratively explore the space. 1074 01:03:04,255 --> 01:03:05,880 And we'll talk a little bit about these 1075 01:03:05,880 --> 01:03:10,200 with hill climbing or steepest ascent and descent kinds 1076 01:03:10,200 --> 01:03:10,855 of problems. 1077 01:03:10,855 --> 01:03:12,480 And I've already mentioned a little bit 1078 01:03:12,480 --> 01:03:15,360 of this online versus offline. 1079 01:03:15,360 --> 01:03:17,460 So here's the simplest picture for one 1080 01:03:17,460 --> 01:03:19,230 of these optimization problems. 1081 01:03:19,230 --> 01:03:22,950 I've got my input x, and I've got my output y. 1082 01:03:22,950 --> 01:03:31,470 And what I'm looking for is a maximum for my output y. 1083 01:03:31,470 --> 01:03:33,810 And maybe here simply my cost function 1084 01:03:33,810 --> 01:03:39,850 is simply J or J is equal to y, something like that. 1085 01:03:39,850 --> 01:03:43,500 So I'm not differentiating here too much between y and J. 1086 01:03:43,500 --> 01:03:45,450 I'm just simply saying what I'm looking 1087 01:03:45,450 --> 01:03:50,520 for is the overall maximum for this output. 1088 01:03:50,520 --> 01:03:55,650 And one knows from basic geometry, basic algebra 1089 01:03:55,650 --> 01:03:59,400 that the maximum will occur-- 1090 01:03:59,400 --> 01:04:02,310 unless I hit some constraints or some boundary 1091 01:04:02,310 --> 01:04:05,670 cases --will occur when I've got zero 1092 01:04:05,670 --> 01:04:09,780 curvature in that function. 1093 01:04:09,780 --> 01:04:11,370 So how do I find it? 1094 01:04:11,370 --> 01:04:15,970 Well, one approach is, again, this analytic approach. 1095 01:04:15,970 --> 01:04:18,760 If I have a full expression, I can simply 1096 01:04:18,760 --> 01:04:20,680 recognize that that minimum occurs 1097 01:04:20,680 --> 01:04:26,320 where there is zero curvature, solve for the y such 1098 01:04:26,320 --> 01:04:32,440 that that curvature is 0, and I directly get to the answer. 1099 01:04:32,440 --> 01:04:35,920 But in order to do that, I need a full analytic model. 1100 01:04:35,920 --> 01:04:40,330 To do that, I needed perhaps relatively small 1101 01:04:40,330 --> 01:04:43,900 or good accurate increments and x or assumptions 1102 01:04:43,900 --> 01:04:45,730 on the model form. 1103 01:04:45,730 --> 01:04:50,695 And especially if I have relatively sparse data points, 1104 01:04:50,695 --> 01:04:54,190 if I had say just these data points, 1105 01:04:54,190 --> 01:04:58,510 it's quite easy to miss the true optimum because 1106 01:04:58,510 --> 01:05:06,370 of noise or imperfections in my model fit. 1107 01:05:06,370 --> 01:05:09,570 So it can actually be a little bit tricky with small amounts 1108 01:05:09,570 --> 01:05:14,430 of data to find that if I fit an overall analytic model 1109 01:05:14,430 --> 01:05:16,650 to a very small number of data points. 1110 01:05:18,900 --> 01:05:23,940 An alternative is a little bit of an iterative or a search 1111 01:05:23,940 --> 01:05:31,020 process where we might actually add data or explore or model, 1112 01:05:31,020 --> 01:05:37,770 either explore experiments or explore a model in a smaller 1113 01:05:37,770 --> 01:05:43,770 space in each case and sort of seek to find the optimum point. 1114 01:05:43,770 --> 01:05:46,530 And here are a simple conceptual idea 1115 01:05:46,530 --> 01:05:50,340 here is in some regions of my space, 1116 01:05:50,340 --> 01:05:54,660 I may have very good model fits less so 1117 01:05:54,660 --> 01:05:56,730 than with much less error than trying 1118 01:05:56,730 --> 01:06:00,060 to fit this overall quadratic to a small number of data points. 1119 01:06:00,060 --> 01:06:02,310 I may have relatively good model fit 1120 01:06:02,310 --> 01:06:04,275 in smaller regions of the space. 1121 01:06:06,990 --> 01:06:09,110 Remember that confidence interval on the output? 1122 01:06:09,110 --> 01:06:13,190 I said as we get further and further away from say 1123 01:06:13,190 --> 01:06:18,440 the central moments of our data, my confidence interval 1124 01:06:18,440 --> 01:06:21,380 on my output prediction gets wider and wider. 1125 01:06:21,380 --> 01:06:24,860 If I shrink my space, I get better estimates 1126 01:06:24,860 --> 01:06:27,480 of my model in a local space. 1127 01:06:27,480 --> 01:06:29,190 And so one approach here is to say, 1128 01:06:29,190 --> 01:06:30,900 I'm going to look in a local space 1129 01:06:30,900 --> 01:06:34,490 get a good estimate of what the slope is. 1130 01:06:34,490 --> 01:06:39,270 Maybe it's a reduced order model that's only linear. 1131 01:06:39,270 --> 01:06:42,760 So I'm not even trying to fit additional curvature. 1132 01:06:42,760 --> 01:06:46,410 And then use that to say my output y 1133 01:06:46,410 --> 01:06:51,000 is increasing in this direction with x increasing. 1134 01:06:51,000 --> 01:06:55,200 And use that to project forward a small amount 1135 01:06:55,200 --> 01:07:00,030 and suggest a new x value to try. 1136 01:07:00,030 --> 01:07:06,660 So it's projecting and additional steps to explore. 1137 01:07:06,660 --> 01:07:09,870 If I then do that and build an additional linear model-- 1138 01:07:09,870 --> 01:07:16,220 whoa --build an additional linear model here, 1139 01:07:16,220 --> 01:07:18,530 it might suggest another small step. 1140 01:07:18,530 --> 01:07:23,540 And as my linear model starts to have a slope turn that shrinks, 1141 01:07:23,540 --> 01:07:26,690 that's telling me I'm getting something closer 1142 01:07:26,690 --> 01:07:31,370 to an optimum point or at least a local optimum point. 1143 01:07:31,370 --> 01:07:35,720 And at that point that's signaling me that if I really 1144 01:07:35,720 --> 01:07:39,260 want improved accuracy at that point in space, 1145 01:07:39,260 --> 01:07:43,770 to really zero in on the maximum, I can do two things. 1146 01:07:43,770 --> 01:07:47,760 One is still constrained my search space. 1147 01:07:47,760 --> 01:07:52,550 But also in this region, it's quite likely that my-- 1148 01:07:55,580 --> 01:07:57,240 it's quite likely-- 1149 01:07:57,240 --> 01:07:59,360 I don't want this. 1150 01:07:59,360 --> 01:08:00,900 I don't know what that was. 1151 01:08:00,900 --> 01:08:05,860 Oh, wow, something funky happened. 1152 01:08:05,860 --> 01:08:10,540 In this space, it's just like with that curvature model 1153 01:08:10,540 --> 01:08:12,520 that I showed you earlier, the linear term 1154 01:08:12,520 --> 01:08:14,800 is probably no longer very significant. 1155 01:08:14,800 --> 01:08:17,470 I really need the quadratic term. 1156 01:08:17,470 --> 01:08:20,500 So I might fit locally a quadratic model just 1157 01:08:20,500 --> 01:08:24,069 near the optimum which allows me in a restricted space 1158 01:08:24,069 --> 01:08:27,160 to get an accurate model that really lets me zero in 1159 01:08:27,160 --> 01:08:31,180 on the optimum point. 1160 01:08:31,180 --> 01:08:33,490 So out here, a linear model might be good 1161 01:08:33,490 --> 01:08:34,870 enough up in here. 1162 01:08:34,870 --> 01:08:37,840 I may need a beta 0 plus a beta 2 x 1163 01:08:37,840 --> 01:08:42,790 squared term, maybe still also with a linear term 1164 01:08:42,790 --> 01:08:44,510 here as well. 1165 01:08:44,510 --> 01:08:46,510 But I can basically build dynamically 1166 01:08:46,510 --> 01:08:50,189 the model getting an accurate model near the optimum point. 1167 01:08:53,189 --> 01:08:55,689 Now, I showed you this in 1D, the point 1D, 1168 01:08:55,689 --> 01:08:58,439 but you can also do this with two inputs, 1169 01:08:58,439 --> 01:09:02,760 where I've got a 3D model if this is an x1, this is an x2, 1170 01:09:02,760 --> 01:09:04,890 and this is a y. 1171 01:09:04,890 --> 01:09:07,020 But you can essentially think the same thing. 1172 01:09:07,020 --> 01:09:13,770 If I start out here in this space, locally it's linear. 1173 01:09:13,770 --> 01:09:17,490 I can use that to suggest the next step 1174 01:09:17,490 --> 01:09:22,790 to take using a simplified linear model in this region. 1175 01:09:22,790 --> 01:09:29,390 And then as I hill climb up, as I get close to the optimum, 1176 01:09:29,390 --> 01:09:33,729 then again now near the optimum, I need-- 1177 01:09:33,729 --> 01:09:37,189 as my x1 and x2, I may need a quadratic model 1178 01:09:37,189 --> 01:09:38,990 in those two coefficients. 1179 01:09:38,990 --> 01:09:42,560 But I can extend the same idea to hill climbing 1180 01:09:42,560 --> 01:09:45,800 not only in one input, but two inputs, three inputs, 1181 01:09:45,800 --> 01:09:51,220 multiple inputs in order to get to an optimum point. 1182 01:09:51,220 --> 01:09:52,720 So essentially what we're doing here 1183 01:09:52,720 --> 01:09:55,120 is, again, linear gradient modeling, 1184 01:09:55,120 --> 01:09:59,590 it is useful often to include still an interaction term. 1185 01:09:59,590 --> 01:10:02,330 But essentially we're doing exactly that same thing. 1186 01:10:02,330 --> 01:10:05,840 And if my model itself is linear, 1187 01:10:05,840 --> 01:10:07,145 an interesting thing happened. 1188 01:10:10,780 --> 01:10:12,400 Where is my overall optimum? 1189 01:10:12,400 --> 01:10:17,270 If I'm trying to get to maximized y, 1190 01:10:17,270 --> 01:10:19,900 where's my maximum y going to occur? 1191 01:10:19,900 --> 01:10:23,390 It will always occur on a boundary when I hit a limit 1192 01:10:23,390 --> 01:10:27,080 of my input and x's. 1193 01:10:27,080 --> 01:10:30,880 So an important thing that I haven't talked much about 1194 01:10:30,880 --> 01:10:34,560 is also the notion of additional constraints. 1195 01:10:34,560 --> 01:10:39,460 We may be driving to an interior point like in this model, 1196 01:10:39,460 --> 01:10:41,460 but it's also possible that we may 1197 01:10:41,460 --> 01:10:46,410 be driving to either a corner point or some other boundary 1198 01:10:46,410 --> 01:10:52,020 point because of a constraint on my allowable ranges 1199 01:10:52,020 --> 01:10:53,040 for my x inputs. 1200 01:10:56,040 --> 01:10:58,110 There is another piece of terminology 1201 01:10:58,110 --> 01:11:02,490 that's sometimes used for these kinds of searches, 1202 01:11:02,490 --> 01:11:04,800 either steepest descent or steepest descent, 1203 01:11:04,800 --> 01:11:07,620 whether you're climbing or looking for a local minima. 1204 01:11:07,620 --> 01:11:10,680 And the basic point is when I've got that simplified 1205 01:11:10,680 --> 01:11:15,470 linear model perhaps with the linear interaction term 1206 01:11:15,470 --> 01:11:19,340 as well, you can think about the local gradient with respect 1207 01:11:19,340 --> 01:11:24,650 to x1 or the local gradient with respect to x2. 1208 01:11:24,650 --> 01:11:29,060 And now when you make your step, what you often want to do 1209 01:11:29,060 --> 01:11:34,460 is make the step in the overall steepest descent direction, 1210 01:11:34,460 --> 01:11:39,270 changing both your x1 and x2 parameter at the same time. 1211 01:11:39,270 --> 01:11:44,830 So this is simply showing when I move 1212 01:11:44,830 --> 01:11:49,420 and hill climb, I may change x1 and x2 proportionally 1213 01:11:49,420 --> 01:11:53,320 depending on the relative slope in those two coefficients. 1214 01:11:53,320 --> 01:11:54,970 And it's relatively easy once I've 1215 01:11:54,970 --> 01:11:57,880 got that model to decide what direction is 1216 01:11:57,880 --> 01:12:00,880 the overall steepest descent. 1217 01:12:00,880 --> 01:12:04,820 Another point here is that with quadratic terms, 1218 01:12:04,820 --> 01:12:09,230 you can have complicated functions where your minima may 1219 01:12:09,230 --> 01:12:13,250 occur in the interior of the space or your maxima 1220 01:12:13,250 --> 01:12:15,020 in the interior of the space. 1221 01:12:15,020 --> 01:12:20,450 But you can also have hyperbolic or inverse polynomial 1222 01:12:20,450 --> 01:12:25,040 kinds of relationships where, again, you 1223 01:12:25,040 --> 01:12:28,970 may have local minima or maxima with respect to one variable 1224 01:12:28,970 --> 01:12:31,790 depending on what you're doing with the other variable. 1225 01:12:31,790 --> 01:12:34,820 Or you may also have places where you end up with a maxima 1226 01:12:34,820 --> 01:12:36,930 again at your constraint points. 1227 01:12:36,930 --> 01:12:42,660 So in your search, you've got to account for both. 1228 01:12:42,660 --> 01:12:46,700 So I can summarize what we've done here 1229 01:12:46,700 --> 01:12:50,030 with a combined procedure for design of experiments 1230 01:12:50,030 --> 01:12:53,330 and optimization in either the iterative fashion 1231 01:12:53,330 --> 01:13:02,450 or at the end, I'll allude to evolutionary or incremental 1232 01:13:02,450 --> 01:13:04,050 kind of version. 1233 01:13:04,050 --> 01:13:08,540 So this is a summary of the last two or three lectures boiled 1234 01:13:08,540 --> 01:13:13,100 down into a reminder-- a summary of the basic process 1235 01:13:13,100 --> 01:13:16,610 or procedure for doing DOE and optimization. 1236 01:13:16,610 --> 01:13:18,770 We said originally our goal here is 1237 01:13:18,770 --> 01:13:22,070 to build a model, to do a design of experiments. 1238 01:13:22,070 --> 01:13:23,870 I do want to emphasize that depends 1239 01:13:23,870 --> 01:13:26,840 on some knowledge of the process, a little bit 1240 01:13:26,840 --> 01:13:28,790 of knowledge either experience based 1241 01:13:28,790 --> 01:13:30,980 or in the physics of the process. 1242 01:13:30,980 --> 01:13:33,380 Because you need that in order to do things 1243 01:13:33,380 --> 01:13:39,280 decide what the important inputs are likely to be. 1244 01:13:39,280 --> 01:13:43,030 Now there are things you can do with the DOE to confirm that 1245 01:13:43,030 --> 01:13:46,630 or to expand your knowledge, like factor screening 1246 01:13:46,630 --> 01:13:47,420 experiments. 1247 01:13:47,420 --> 01:13:49,600 We talked about fractional factorial 1248 01:13:49,600 --> 01:13:52,390 with large numbers of coefficients where you're just 1249 01:13:52,390 --> 01:13:56,080 trying to decide is there a main effect associated 1250 01:13:56,080 --> 01:13:57,550 with that factor. 1251 01:13:57,550 --> 01:14:01,700 But up front, defining the inputs is very important. 1252 01:14:01,700 --> 01:14:05,890 We also need to define limits on the inputs. 1253 01:14:05,890 --> 01:14:08,800 What space do we want to explore and build 1254 01:14:08,800 --> 01:14:11,380 a model over in our design of experiments? 1255 01:14:13,960 --> 01:14:17,350 So overall, we're going to need to first build our-- 1256 01:14:17,350 --> 01:14:19,210 decide on a DOE. 1257 01:14:19,210 --> 01:14:21,040 We'd go and run our experiments. 1258 01:14:21,040 --> 01:14:23,950 And then we're going to construct our response service 1259 01:14:23,950 --> 01:14:24,610 model. 1260 01:14:24,610 --> 01:14:26,980 And if we're using it for the optimization, 1261 01:14:26,980 --> 01:14:28,630 I also want to make the point that you 1262 01:14:28,630 --> 01:14:32,470 need to think early on about what your overall optimization 1263 01:14:32,470 --> 01:14:36,130 or penalty function is because that may strongly 1264 01:14:36,130 --> 01:14:42,130 influence your DOE and maybe even your factor selection. 1265 01:14:42,130 --> 01:14:45,760 So for example, if you believe that you're really 1266 01:14:45,760 --> 01:14:53,650 going to need an optimization that folds in things like noise 1267 01:14:53,650 --> 01:14:57,730 in addition to just trying to get to a target, 1268 01:14:57,730 --> 01:15:02,500 that can have a profound effect on the DOE that you explore. 1269 01:15:02,500 --> 01:15:05,420 And we'll talk about that on Thursday, 1270 01:15:05,420 --> 01:15:07,930 where you might do additional small experiments 1271 01:15:07,930 --> 01:15:10,600 at each point in the DOE in order 1272 01:15:10,600 --> 01:15:15,040 to build a sensitivity model of that delta y 1273 01:15:15,040 --> 01:15:18,790 as a function of some additional noise factors. 1274 01:15:18,790 --> 01:15:22,210 So depending on what it is you're 1275 01:15:22,210 --> 01:15:26,350 trying to achieve with your model, that can of course, 1276 01:15:26,350 --> 01:15:29,920 I guess it's obvious, that can affect 1277 01:15:29,920 --> 01:15:32,920 the structure of your model and the design of experiments 1278 01:15:32,920 --> 01:15:34,780 that you want to do. 1279 01:15:34,780 --> 01:15:36,880 So we've already talked about a lot of this. 1280 01:15:36,880 --> 01:15:39,730 Again in summary, your DOE includes 1281 01:15:39,730 --> 01:15:43,600 decisions about what likely terms you think 1282 01:15:43,600 --> 01:15:46,090 might be in there based on your knowledge of the physics. 1283 01:15:46,090 --> 01:15:48,040 Is it going to be mostly linear? 1284 01:15:48,040 --> 01:15:50,680 Might there be quadratic terms? 1285 01:15:50,680 --> 01:15:53,110 That can influence again the selection 1286 01:15:53,110 --> 01:15:55,810 of the high-low center points. 1287 01:15:55,810 --> 01:15:58,690 Do you need center points, do you need three levels 1288 01:15:58,690 --> 01:16:01,610 for all factors, and so on. 1289 01:16:01,610 --> 01:16:04,480 And you also need to think about things like the noise factors. 1290 01:16:04,480 --> 01:16:08,020 We talked about these nuisance factors, if you will, 1291 01:16:08,020 --> 01:16:09,650 or additional noise factors. 1292 01:16:09,650 --> 01:16:13,420 So that you might randomize or block against those. 1293 01:16:13,420 --> 01:16:16,120 If they're not going to be explicitly in the model, 1294 01:16:16,120 --> 01:16:19,000 you don't want them aliasing with or confounding 1295 01:16:19,000 --> 01:16:22,800 with the terms you actually had. 1296 01:16:22,800 --> 01:16:25,860 The response service modeling is actually a pretty easy piece, 1297 01:16:25,860 --> 01:16:29,730 especially if you use things like the regression 1298 01:16:29,730 --> 01:16:31,950 and the ANOVA approach. 1299 01:16:31,950 --> 01:16:35,760 Again, you can use contrast, if you've 1300 01:16:35,760 --> 01:16:38,010 got a highly structured design and experiment 1301 01:16:38,010 --> 01:16:41,030 for very rapid estimation of those terms. 1302 01:16:41,030 --> 01:16:43,590 But overall, the emphasis here is 1303 01:16:43,590 --> 01:16:49,260 you're trying to determine if there's significant variation 1304 01:16:49,260 --> 01:16:52,530 in your data, are individual terms significant, 1305 01:16:52,530 --> 01:16:53,950 are you missing terms. 1306 01:16:53,950 --> 01:16:57,180 So that lack of fit is extremely important. 1307 01:16:57,180 --> 01:17:01,590 And there's often a very interesting interplay 1308 01:17:01,590 --> 01:17:05,010 with the regression modeling. 1309 01:17:05,010 --> 01:17:08,640 In fact, an approach we haven't talked about much, 1310 01:17:08,640 --> 01:17:11,280 but it's essentially inherent in what 1311 01:17:11,280 --> 01:17:16,680 we've been talking about here is also referred to as-- 1312 01:17:16,680 --> 01:17:20,910 I think it's-- not piece-wise, step-wise, 1313 01:17:20,910 --> 01:17:22,680 step-wise regression. 1314 01:17:22,680 --> 01:17:25,920 And some of the interactive tools like JUMP 1315 01:17:25,920 --> 01:17:28,440 actually explicitly support this, 1316 01:17:28,440 --> 01:17:32,500 where one factor at a time, you look and say, 1317 01:17:32,500 --> 01:17:34,590 I would like to add a term or drop 1318 01:17:34,590 --> 01:17:38,340 a term based on cut off decision points, on significance, 1319 01:17:38,340 --> 01:17:39,260 and so on. 1320 01:17:39,260 --> 01:17:43,500 So you can build up an appropriate regression model 1321 01:17:43,500 --> 01:17:48,270 by dropping or adding terms as needed. 1322 01:17:48,270 --> 01:17:52,380 And we talked about this at a fairly high order or high level 1323 01:17:52,380 --> 01:17:55,230 about the optimization procedure and again, just 1324 01:17:55,230 --> 01:17:58,500 ideas of defining your penalty function 1325 01:17:58,500 --> 01:18:01,380 and then searching for your optimization 1326 01:18:01,380 --> 01:18:04,270 either piece-wise or analytically. 1327 01:18:04,270 --> 01:18:06,470 I'll come back to this in just a second. 1328 01:18:06,470 --> 01:18:08,530 But I do want to emphasize that once you've 1329 01:18:08,530 --> 01:18:18,940 come to some expected optimum point, 1330 01:18:18,940 --> 01:18:24,370 you really should check that and confirm that often 1331 01:18:24,370 --> 01:18:30,100 because you're building your estimate of your model 1332 01:18:30,100 --> 01:18:32,270 based on relatively limited data, 1333 01:18:32,270 --> 01:18:33,910 especially in the factorial models 1334 01:18:33,910 --> 01:18:37,690 perhaps with only one interior point or center point based 1335 01:18:37,690 --> 01:18:41,050 on mostly extreme old data. 1336 01:18:41,050 --> 01:18:43,600 And especially if you've driven your optimum 1337 01:18:43,600 --> 01:18:49,940 to some interior point using say the analytic model 1338 01:18:49,940 --> 01:18:53,240 of the response service model rather than iteratively 1339 01:18:53,240 --> 01:18:57,530 or incrementally, you're making a lot of big assumptions 1340 01:18:57,530 --> 01:19:01,430 about the shape of the model right near your optimum, 1341 01:19:01,430 --> 01:19:04,790 like it's convex right at that optimum point. 1342 01:19:04,790 --> 01:19:09,230 So you really ought to go in and do a confirming experiment 1343 01:19:09,230 --> 01:19:13,960 right at or right near your optimum 1344 01:19:13,960 --> 01:19:16,480 in order to really test the model 1345 01:19:16,480 --> 01:19:21,740 and consider model error right at that point. 1346 01:19:21,740 --> 01:19:25,220 And that might actually drive you to improving the model 1347 01:19:25,220 --> 01:19:31,080 or exploring slightly different space right near that optimum. 1348 01:19:31,080 --> 01:19:32,810 Now the one last thing I just want 1349 01:19:32,810 --> 01:19:37,340 to allude to is an alternative approach 1350 01:19:37,340 --> 01:19:43,770 here is often starting with some data point in a small space 1351 01:19:43,770 --> 01:19:47,760 and building your model iteratively or adaptively. 1352 01:19:47,760 --> 01:19:50,230 And next week, at the end of next week, 1353 01:19:50,230 --> 01:19:52,680 we'll have a guest lecturer, Dan fry, 1354 01:19:52,680 --> 01:19:56,520 who has actually studied one factor at a time 1355 01:19:56,520 --> 01:20:00,870 incremental exploration and model building 1356 01:20:00,870 --> 01:20:04,260 for the purpose of optimization a great deal. 1357 01:20:04,260 --> 01:20:07,920 So he's going to lead us through an alternative approach 1358 01:20:07,920 --> 01:20:11,280 of actually doing full factorial models 1359 01:20:11,280 --> 01:20:15,270 but trying to find the optimum by not defining up front 1360 01:20:15,270 --> 01:20:19,150 the whole DOE and running the whole thing, 1361 01:20:19,150 --> 01:20:23,790 but rather just walking around your multifactor space 1362 01:20:23,790 --> 01:20:27,600 in order to try to find the optimum point. 1363 01:20:27,600 --> 01:20:33,990 And that has some relationship to another approach that 1364 01:20:33,990 --> 01:20:38,820 is also in May and Spanos in chapter 8.5 which I've just 1365 01:20:38,820 --> 01:20:42,000 mentioned to you but not expect that you actually 1366 01:20:42,000 --> 01:20:45,810 have to know a lot about, which is evolutionary optimization. 1367 01:20:45,810 --> 01:20:49,830 Which would say build a local model use that again 1368 01:20:49,830 --> 01:20:52,740 and a hill climbing fashion to suggest where you 1369 01:20:52,740 --> 01:20:56,170 want to go for your next point. 1370 01:20:56,170 --> 01:20:59,130 Maybe in fact you simply pick one of those corners. 1371 01:20:59,130 --> 01:21:01,800 And then you build a do model around that. 1372 01:21:01,800 --> 01:21:04,920 And it might suggest you move your process 1373 01:21:04,920 --> 01:21:08,370 to another corner, in which case you build another model 1374 01:21:08,370 --> 01:21:13,710 and so on, so that you can walk or evolutionarily 1375 01:21:13,710 --> 01:21:17,490 arrive at an optimum point in your process, 1376 01:21:17,490 --> 01:21:21,200 building local models along the way. 1377 01:21:21,200 --> 01:21:24,650 OK so next time, the one additional topic 1378 01:21:24,650 --> 01:21:28,340 I want to mention in this space of optimization and process 1379 01:21:28,340 --> 01:21:32,370 optimization and DOE is this notion of robustness. 1380 01:21:32,370 --> 01:21:34,640 I'll allude to actually building models 1381 01:21:34,640 --> 01:21:37,640 that include the variance in them 1382 01:21:37,640 --> 01:21:40,440 and not just the overall output. 1383 01:21:40,440 --> 01:21:45,410 So we'll come back to that on Thursday and enjoy. 1384 01:21:45,410 --> 01:21:47,750 In the meantime, I think you've got the problem that 1385 01:21:47,750 --> 01:21:48,833 is due on Thursday. 1386 01:21:48,833 --> 01:21:50,750 And it's going to let you explore a little bit 1387 01:21:50,750 --> 01:21:53,840 more some of these DOE and response service model kinds 1388 01:21:53,840 --> 01:21:54,800 of things. 1389 01:21:54,800 --> 01:21:56,500 So we'll see you on Thursday.