WEBVTT
00:00:00.000 --> 00:00:02.430
The following content is
provided under a Creative
00:00:02.430 --> 00:00:03.730
Commons license.
00:00:03.730 --> 00:00:06.030
Your support will help
MIT OpenCourseWare
00:00:06.030 --> 00:00:10.060
continue to offer high quality
educational resources for free.
00:00:10.060 --> 00:00:12.690
To make a donation or to
view additional materials
00:00:12.690 --> 00:00:16.560
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.560 --> 00:00:17.904
at ocw.mit.edu.
00:00:21.790 --> 00:00:25.810
DUANE BONING: OK,
so last time we
00:00:25.810 --> 00:00:28.600
continued with our discussion
of design of experiments
00:00:28.600 --> 00:00:32.830
and especially looking at
fractional factorial designs,
00:00:32.830 --> 00:00:35.260
some of the aliasing
patterns that come up,
00:00:35.260 --> 00:00:39.310
and how that interplays
with model construction,
00:00:39.310 --> 00:00:42.640
in particular, what terms
of a model you can include,
00:00:42.640 --> 00:00:44.770
what you can't
include, as well as
00:00:44.770 --> 00:00:49.340
a few ideas on different
kinds of patterns,
00:00:49.340 --> 00:00:51.790
things like the central
composite pattern as well as
00:00:51.790 --> 00:00:54.070
fractional or full factorial.
00:00:54.070 --> 00:00:56.560
What I want to do today is
pick up a little bit more
00:00:56.560 --> 00:01:00.130
on response surface
modeling, or RSM.
00:01:00.130 --> 00:01:01.790
We've already touched
on some of these,
00:01:01.790 --> 00:01:03.748
but there's a couple of
things I've alluded to,
00:01:03.748 --> 00:01:06.310
but we haven't really
shown you, things
00:01:06.310 --> 00:01:10.120
like how one gets confidence
intervals on the estimates
00:01:10.120 --> 00:01:12.400
of coefficients in the model.
00:01:12.400 --> 00:01:15.190
Just like when we were
doing some estimation
00:01:15.190 --> 00:01:17.140
of statistical
distributions, we would
00:01:17.140 --> 00:01:20.980
say we want more than just
an estimate of the mean
00:01:20.980 --> 00:01:23.080
or an estimate of the
variance of a process.
00:01:23.080 --> 00:01:27.640
We would like to know what range
we might say 90% of the time
00:01:27.640 --> 00:01:31.330
with 90% confidence, we think
the true mean or true variance
00:01:31.330 --> 00:01:32.470
lies.
00:01:32.470 --> 00:01:36.040
Similarly, when we're fitting
models and model coefficients,
00:01:36.040 --> 00:01:38.230
we'd like some
notion of what range
00:01:38.230 --> 00:01:42.160
we think the true model
coefficients likely lie
00:01:42.160 --> 00:01:44.060
based on the data that we have.
00:01:44.060 --> 00:01:45.880
So I want to go over
that a little bit.
00:01:45.880 --> 00:01:49.180
And then we'll start talking
about using these models
00:01:49.180 --> 00:01:53.950
for process optimization,
so combining
00:01:53.950 --> 00:01:56.980
a little bit of the
response surface methodology
00:01:56.980 --> 00:02:01.240
with design of experiments
both in sequential fashion
00:02:01.240 --> 00:02:04.660
and in iterative fashion,
where one might adapt
00:02:04.660 --> 00:02:09.460
the model on the fly or based
on the additional experiments
00:02:09.460 --> 00:02:13.360
in order to drive the process
or try to seek out and find
00:02:13.360 --> 00:02:15.290
an optimum in the process.
00:02:15.290 --> 00:02:16.570
So that's the plan.
00:02:16.570 --> 00:02:21.430
I've noted here a
reading assignment.
00:02:21.430 --> 00:02:23.440
You can read all of chapter 8.
00:02:23.440 --> 00:02:26.890
It's actually interesting,
but what I'm mostly focused on
00:02:26.890 --> 00:02:30.070
are the first three
sections in May and Spanos,
00:02:30.070 --> 00:02:32.240
which talks about
process modeling.
00:02:32.240 --> 00:02:34.240
So it's covering a
lot of the material
00:02:34.240 --> 00:02:38.980
here on response surface models,
model fitting, a little bit
00:02:38.980 --> 00:02:40.990
of regression, and
then also using
00:02:40.990 --> 00:02:42.870
these things for optimization.
00:02:42.870 --> 00:02:45.340
So a couple of chapters
that have a little bit more
00:02:45.340 --> 00:02:49.040
advanced material on
principal component analysis,
00:02:49.040 --> 00:02:52.850
which we may come back
to a little bit later.
00:02:52.850 --> 00:02:54.160
OK, so that's the plan.
00:02:58.380 --> 00:03:02.210
Here's a list of some of the
fundamentals of regression.
00:03:02.210 --> 00:03:04.685
When we were talking
about fractional factorial
00:03:04.685 --> 00:03:11.180
and factorial design, especially
those formed out of contrast,
00:03:11.180 --> 00:03:15.260
that simplified method
using differences
00:03:15.260 --> 00:03:18.660
in different
collections of the data,
00:03:18.660 --> 00:03:20.930
we found that those
were very useful,
00:03:20.930 --> 00:03:25.190
quick ways to be able to
estimate model effects,
00:03:25.190 --> 00:03:29.600
to fill those into ANOVA tables
and decide if those effects are
00:03:29.600 --> 00:03:32.210
significant, and then
also the relationship
00:03:32.210 --> 00:03:36.680
of those for model
coefficient estimation.
00:03:36.680 --> 00:03:40.350
I want to talk a
little bit about
00:03:40.350 --> 00:03:42.690
the alternative
perspective, which
00:03:42.690 --> 00:03:47.130
is regression as a way for
fitting those coefficients.
00:03:47.130 --> 00:03:49.700
And we've already
done some of that.
00:03:49.700 --> 00:03:51.690
What I'm going to
illustrate here
00:03:51.690 --> 00:03:55.860
is our basic assumption
and what falls out
00:03:55.860 --> 00:04:01.050
of using minimization of
least square or squared error
00:04:01.050 --> 00:04:05.130
estimates in order to
fit the coefficients
00:04:05.130 --> 00:04:07.260
or estimate the
coefficients in a model.
00:04:07.260 --> 00:04:12.570
And I want to talk a little
bit more about estimation.
00:04:12.570 --> 00:04:14.160
We've already
touched on estimation
00:04:14.160 --> 00:04:15.990
using the normal equations.
00:04:15.990 --> 00:04:19.500
But especially I want to talk
about the variance again,
00:04:19.500 --> 00:04:22.710
in these coefficients, things
like the confidence intervals
00:04:22.710 --> 00:04:26.920
for fitting of coefficients.
00:04:26.920 --> 00:04:31.930
I'm going to do this here mostly
in the context of a simplified
00:04:31.930 --> 00:04:34.810
perspective, a one
parameter model.
00:04:34.810 --> 00:04:37.240
I just have one
input and one output.
00:04:37.240 --> 00:04:39.220
And we'll do the simplest model.
00:04:39.220 --> 00:04:41.650
We'll build it up to
a simple linear model,
00:04:41.650 --> 00:04:44.410
but all of these ideas
also carry through
00:04:44.410 --> 00:04:49.360
for polynomial regression
when I've got multiple inputs.
00:04:49.360 --> 00:04:50.860
But I think it's a
little bit easier
00:04:50.860 --> 00:04:54.850
to see and discuss in the
context of a simplified--
00:04:54.850 --> 00:04:56.660
a simplified model.
00:04:56.660 --> 00:04:59.650
And we also talked
last time a bit--
00:04:59.650 --> 00:05:01.780
the last couple of
times about lack of fit.
00:05:01.780 --> 00:05:04.480
And I have a little
example that carries us
00:05:04.480 --> 00:05:08.440
through the development of a
model looking for lack of fit
00:05:08.440 --> 00:05:12.460
or seeing lack of fit
and extending the model.
00:05:12.460 --> 00:05:16.640
So it's got a small
example embedded in here.
00:05:16.640 --> 00:05:18.190
In fact, that
small example might
00:05:18.190 --> 00:05:24.790
look familiar to those of
you that saw or took 2853.
00:05:24.790 --> 00:05:28.120
It's actually the
same model that I
00:05:28.120 --> 00:05:32.350
described in a very condensed
lecture there on regression.
00:05:36.100 --> 00:05:39.640
It's also important,
I think, for us to get
00:05:39.640 --> 00:05:41.320
a little bit of terminology.
00:05:41.320 --> 00:05:44.590
You've probably
run into measures
00:05:44.590 --> 00:05:50.470
of moral goodness, an overall
summary measure of R-squared
00:05:50.470 --> 00:05:53.320
that is an attempt to
capture how good the model is
00:05:53.320 --> 00:05:57.980
in describing what's
going on with your data.
00:05:57.980 --> 00:06:01.240
So once one is done, the ANOVA
analysis it's actually quite
00:06:01.240 --> 00:06:06.850
easy to calculate both the
goodness of fit R-squared
00:06:06.850 --> 00:06:11.230
and the adjusted R-squared as
shown here because they both
00:06:11.230 --> 00:06:12.850
depend on--
00:06:12.850 --> 00:06:17.350
both of these R-squared
measures and the ANOVA look
00:06:17.350 --> 00:06:22.270
at the amount of
variation in your data
00:06:22.270 --> 00:06:25.060
and the amount of variation
expressed in your model
00:06:25.060 --> 00:06:30.070
and use those to summarize
how good the model is.
00:06:30.070 --> 00:06:32.530
So the first measure,
this R-squared,
00:06:32.530 --> 00:06:35.320
is basically just
looking and saying
00:06:35.320 --> 00:06:43.170
if I were to simply model
my output as the mean, how
00:06:43.170 --> 00:06:46.830
much better does a model that
has more than the mean in it
00:06:46.830 --> 00:06:50.170
do in explaining the data?
00:06:50.170 --> 00:06:52.200
So essentially
what we do is look
00:06:52.200 --> 00:06:57.260
at the sum of squared
deviations around the mean.
00:06:57.260 --> 00:07:02.040
So this is total summit squared
deviations around the mean.
00:07:02.040 --> 00:07:05.700
And then we say OK,
how much sum of squared
00:07:05.700 --> 00:07:17.860
deviations based on the
model, so the amount explained
00:07:17.860 --> 00:07:21.820
in the model, compared
to the total deviations
00:07:21.820 --> 00:07:22.510
around the mean?
00:07:22.510 --> 00:07:28.780
What fraction of those
is captured in the model?
00:07:28.780 --> 00:07:32.050
So in other words,
if there's really
00:07:32.050 --> 00:07:36.940
nothing going on except a
flat dependency, that is,
00:07:36.940 --> 00:07:38.575
there is no slope with x.
00:07:38.575 --> 00:07:42.060
As I vary x, nothing
changes in the model,
00:07:42.060 --> 00:07:44.890
then this simplified
notion of R-squared
00:07:44.890 --> 00:07:49.570
is basically saying there
is no dependency on x.
00:07:49.570 --> 00:07:52.630
And therefore, I'm
going to explain
00:07:52.630 --> 00:07:54.850
with the model
essentially nothing.
00:07:54.850 --> 00:07:57.850
Now it's funny because
we are ignoring the fact
00:07:57.850 --> 00:08:01.420
that you might also be
fitting the mean value.
00:08:01.420 --> 00:08:04.270
But the notion captured in
the R-squared is dependence
00:08:04.270 --> 00:08:08.650
on the input, dependence on x.
00:08:08.650 --> 00:08:12.040
Now the big gotcha with
this simple measure
00:08:12.040 --> 00:08:16.450
is I can always add
model coefficients
00:08:16.450 --> 00:08:19.900
and fit more of my
data, or at least I
00:08:19.900 --> 00:08:25.460
can do that ignoring
replication in the model.
00:08:25.460 --> 00:08:30.980
For Example, we saw with a,
say a two input, one output
00:08:30.980 --> 00:08:35.450
model and a full factorial, If
I just have those four corner
00:08:35.450 --> 00:08:39.169
points, I can fit up
to a second order model
00:08:39.169 --> 00:08:40.590
with the interaction terms.
00:08:40.590 --> 00:08:42.230
If I have four data
points, I could
00:08:42.230 --> 00:08:45.350
fit the mean, first order,
first order, and interaction
00:08:45.350 --> 00:08:47.660
with exactly four coefficients.
00:08:47.660 --> 00:08:50.510
And in that case, what
would the R-squared
00:08:50.510 --> 00:08:55.920
be if I put my data with
all four coefficients?
00:08:55.920 --> 00:08:56.760
1.
00:08:56.760 --> 00:08:59.170
I would fit the data perfectly.
00:08:59.170 --> 00:09:00.710
Again, this is
without replication.
00:09:00.710 --> 00:09:02.790
And I would fit
the data perfectly.
00:09:05.300 --> 00:09:07.940
And therefore I'd have
an R-squared of 1.
00:09:07.940 --> 00:09:11.670
Now is that really
a perfect model?
00:09:11.670 --> 00:09:15.550
Well, kind of, but
what you've done
00:09:15.550 --> 00:09:18.190
is you've used all of
the degrees of freedom
00:09:18.190 --> 00:09:22.060
in the data to actually fit
or use them to fit the model.
00:09:22.060 --> 00:09:28.030
We also don't have any notion of
replication, which isn't really
00:09:28.030 --> 00:09:29.350
completely captured.
00:09:29.350 --> 00:09:33.400
So one way of
penalizing ourselves
00:09:33.400 --> 00:09:37.180
for the use of these
additional model terms
00:09:37.180 --> 00:09:42.340
is to essentially have a
different perspective referred
00:09:42.340 --> 00:09:46.450
to as the adjusted
R-squared, which essentially
00:09:46.450 --> 00:09:48.580
looks at the residual data.
00:09:48.580 --> 00:09:51.790
Rather than the deviations
captured by the model,
00:09:51.790 --> 00:09:55.210
it's looking at OK,
what deviations are not
00:09:55.210 --> 00:09:56.650
captured by the model?
00:09:56.650 --> 00:09:59.530
What residual data
would I have, which
00:09:59.530 --> 00:10:03.010
also has a side effect of
essentially penalizing us
00:10:03.010 --> 00:10:06.100
for the use of additional
model coefficients
00:10:06.100 --> 00:10:09.910
because we use up
degrees of freedom
00:10:09.910 --> 00:10:15.550
in the model when we
add model coefficients.
00:10:15.550 --> 00:10:18.310
So very often people talk
about the adjusted R square
00:10:18.310 --> 00:10:21.640
as this fair comparison
between models, especially
00:10:21.640 --> 00:10:25.270
between models where I may have
a simplified model with fewer
00:10:25.270 --> 00:10:27.820
coefficients and a more
complicated model with more
00:10:27.820 --> 00:10:28.990
coefficients.
00:10:28.990 --> 00:10:32.050
And essentially what
we do is form it
00:10:32.050 --> 00:10:37.480
as the ratio of the mean
square error of the residuals
00:10:37.480 --> 00:10:42.010
over the total mean square
variance, if you will,
00:10:42.010 --> 00:10:45.250
captured by deviations
around the mean
00:10:45.250 --> 00:10:47.710
and then subtract
that all from 1.
00:10:47.710 --> 00:10:50.770
And the way I like to think
about it is essentially,
00:10:50.770 --> 00:10:52.690
I start with the perfect model.
00:10:52.690 --> 00:10:57.550
And then any
residual error, which
00:10:57.550 --> 00:11:03.250
could include both replication
error and lack of fit error,
00:11:03.250 --> 00:11:08.320
whatever percentage error that
I don't capture in the data--
00:11:08.320 --> 00:11:10.750
the sum of squared
deviations divided
00:11:10.750 --> 00:11:15.700
by the degrees of freedom,
that's my mean square error
00:11:15.700 --> 00:11:19.180
estimate or my estimate
of the true total variance
00:11:19.180 --> 00:11:22.380
around the mean.
00:11:22.380 --> 00:11:26.430
Whatever fraction of that
that is in the residual,
00:11:26.430 --> 00:11:29.310
that's what I'm not modeling.
00:11:29.310 --> 00:11:32.940
So essentially what
we're doing is simply
00:11:32.940 --> 00:11:35.520
looking at what's not
expressed in the model.
00:11:35.520 --> 00:11:39.250
And the model can never
capture pure replication error,
00:11:39.250 --> 00:11:41.970
so it's got that variance, but
it might also have lack of it
00:11:41.970 --> 00:11:42.470
in it.
00:11:45.550 --> 00:11:49.090
So most statistical
packages will report out
00:11:49.090 --> 00:11:50.350
both of these numbers.
00:11:50.350 --> 00:11:52.070
You can also calculate them.
00:11:52.070 --> 00:11:57.910
But it's generally
I like the R-squared
00:11:57.910 --> 00:12:01.660
adjusted as a better measure.
00:12:01.660 --> 00:12:04.360
In part because it feels
to me a little bit more
00:12:04.360 --> 00:12:08.230
conceptual and comprehensive,
in terms of telling me
00:12:08.230 --> 00:12:10.300
what's not captured
in the model,
00:12:10.300 --> 00:12:15.670
how much pure variation
going on in the data
00:12:15.670 --> 00:12:17.360
is not in the model.
00:12:17.360 --> 00:12:21.370
However, you have to be really
careful interpreting what
00:12:21.370 --> 00:12:24.220
that R-squared is telling you.
00:12:24.220 --> 00:12:28.870
It's not necessarily telling you
that your model is good or bad.
00:12:28.870 --> 00:12:35.160
You might have a perfect model
given variant noise factors
00:12:35.160 --> 00:12:37.060
in the model.
00:12:37.060 --> 00:12:39.990
So for example, if
underlying everything,
00:12:39.990 --> 00:12:42.640
I've got a true
systematic dependency,
00:12:42.640 --> 00:12:45.840
but I also have pure
replication variance,
00:12:45.840 --> 00:12:49.170
that's going to limit how good
your R-squared can possibly
00:12:49.170 --> 00:12:53.610
be even if your model were
perfect in terms of capturing
00:12:53.610 --> 00:12:57.430
the systematic dependency.
00:12:57.430 --> 00:13:02.270
I think there was a question
lurking there in Singapore.
00:13:02.270 --> 00:13:07.190
AUDIENCE: Yes, so for
R-squared and just R-squared
00:13:07.190 --> 00:13:14.220
the closer those values are to
1, the better of the model is?
00:13:14.220 --> 00:13:16.084
DUANE BONING: Yes, definitely.
00:13:16.084 --> 00:13:17.000
AUDIENCE: OK.
00:13:17.000 --> 00:13:19.442
DUANE BONING:
Definitely 1 is better.
00:13:19.442 --> 00:13:20.900
But you have to be
a little careful
00:13:20.900 --> 00:13:23.300
in interpreting because even--
00:13:23.300 --> 00:13:26.390
AUDIENCE: What you just--
00:13:26.390 --> 00:13:28.160
no but what, Professor,
you just said
00:13:28.160 --> 00:13:33.140
is the R-squared increase both
the error of the model, also
00:13:33.140 --> 00:13:34.525
the error of the noise.
00:13:34.525 --> 00:13:35.900
So you can't really
differentiate
00:13:35.900 --> 00:13:36.870
between these two.
00:13:36.870 --> 00:13:39.860
DUANE BONING: That's
right, that's right.
00:13:39.860 --> 00:13:42.560
And that's where a lack of
fit analysis-- and we'll
00:13:42.560 --> 00:13:44.570
go in and do one of
those as well --is also
00:13:44.570 --> 00:13:47.180
still important for being
able to try to differentiate
00:13:47.180 --> 00:13:51.620
between those two sources of
imperfection in the model.
00:13:51.620 --> 00:13:53.390
Yeah?
00:13:53.390 --> 00:13:58.910
AUDIENCE: Also you mentioned
the second R-squared also being
00:13:58.910 --> 00:14:00.927
[INAUDIBLE].
00:14:00.927 --> 00:14:01.760
DUANE BONING: Right.
00:14:01.760 --> 00:14:03.890
AUDIENCE: Your
main concern is fit
00:14:03.890 --> 00:14:07.220
and having more
coefficients is cheap,
00:14:07.220 --> 00:14:12.280
would you prefer R-squared
or adjusted R-squared?
00:14:12.280 --> 00:14:14.230
DUANE BONING: So the
question is what would I
00:14:14.230 --> 00:14:17.830
prefer if the number of-- if
fitting additional coefficients
00:14:17.830 --> 00:14:18.532
is cheap.
00:14:18.532 --> 00:14:19.990
AUDIENCE: And fit
is more important
00:14:19.990 --> 00:14:23.380
DUANE BONING: And fit
is more important.
00:14:23.380 --> 00:14:29.160
I think I would still
essentially think
00:14:29.160 --> 00:14:33.870
of R-squared as somewhat of a
more representative description
00:14:33.870 --> 00:14:41.850
of the trade off between adding
coefficients, improving my fit.
00:14:41.850 --> 00:14:46.750
But also my R-squared
doesn't get as much batter.
00:14:46.750 --> 00:14:49.690
And in fact, if I
start overfitting,
00:14:49.690 --> 00:14:53.690
it will tend to degrade
slightly my R-squared.
00:14:53.690 --> 00:14:55.930
However, what I think
a better mechanism
00:14:55.930 --> 00:14:59.770
for actually making the decision
about whether to include
00:14:59.770 --> 00:15:03.280
coefficients or not is
an analysis of variance
00:15:03.280 --> 00:15:06.250
and looking at the
significance of those model
00:15:06.250 --> 00:15:11.370
coefficients, both the
significance and the magnitude.
00:15:11.370 --> 00:15:15.860
So I would tend to do more the
regression analysis together
00:15:15.860 --> 00:15:17.000
with the ANOVA.
00:15:17.000 --> 00:15:20.060
And the R-squared is a
nice aggregate measure,
00:15:20.060 --> 00:15:24.280
but it's not the thing that
drives my decision-making so
00:15:24.280 --> 00:15:27.308
much, so I hope that helps.
00:15:27.308 --> 00:15:29.350
So we'll see some examples
of some R-squared that
00:15:29.350 --> 00:15:32.680
come out of some analysis.
00:15:32.680 --> 00:15:38.170
Now we said that regression
is at least as almost--
00:15:38.170 --> 00:15:44.060
most commonly used is driven
by minimization of a least--
00:15:44.060 --> 00:15:46.780
or minimization of a
squared error measure.
00:15:46.780 --> 00:15:48.820
And this is just
trying to illustrate
00:15:48.820 --> 00:15:51.790
what I'm talking
about here with where
00:15:51.790 --> 00:15:57.730
the residuals, the differences
between my model and my data,
00:15:57.730 --> 00:16:00.917
may come from in
the simple 1D case.
00:16:00.917 --> 00:16:02.500
We've already talked
a bit about this,
00:16:02.500 --> 00:16:05.180
but I'm using a very,
very simple model here,
00:16:05.180 --> 00:16:06.555
which has only one term.
00:16:06.555 --> 00:16:10.540
It doesn't even have
a constant offset.
00:16:10.540 --> 00:16:15.100
It's simply got a linear, direct
linear dependence of the output
00:16:15.100 --> 00:16:16.210
on the input.
00:16:16.210 --> 00:16:19.120
And I'm saying
that the true model
00:16:19.120 --> 00:16:23.600
does have some noise in it,
which is normally distributed.
00:16:23.600 --> 00:16:26.470
And I'm fitting that
or estimating that
00:16:26.470 --> 00:16:29.500
with some coefficient,
a little b.
00:16:29.500 --> 00:16:35.370
And so this is my fit
through my data minimizing
00:16:35.370 --> 00:16:37.380
the squared
deviations, or I'd like
00:16:37.380 --> 00:16:39.510
to minimize the
squared deviations.
00:16:39.510 --> 00:16:42.360
And again, we're saying that any
differences between the model
00:16:42.360 --> 00:16:48.360
prediction, essentially the
y hat sub i minus the y sub i
00:16:48.360 --> 00:16:51.180
for that data point,
that's a residual.
00:16:51.180 --> 00:16:54.090
That's an error.
00:16:54.090 --> 00:16:58.320
And it can come from two factors
again, either lack of fit
00:16:58.320 --> 00:17:04.770
in the model or because of the
underlying noise in the data.
00:17:04.770 --> 00:17:08.230
Now last time, or
maybe even 2 times ago,
00:17:08.230 --> 00:17:11.760
we talked about the use
of regression numerically,
00:17:11.760 --> 00:17:16.589
if you will or
algebraically, to estimate
00:17:16.589 --> 00:17:20.460
this beta with the best
be based on a minimization
00:17:20.460 --> 00:17:24.250
of the sum of squared errors.
00:17:24.250 --> 00:17:27.839
So we take each one of
those residuals, square it,
00:17:27.839 --> 00:17:30.460
and then we sum that
over all of our data.
00:17:30.460 --> 00:17:31.950
And it turns out
what we're trying
00:17:31.950 --> 00:17:35.550
to do is find the
beta hat, the b that
00:17:35.550 --> 00:17:38.100
estimates the beta
hat that minimizes
00:17:38.100 --> 00:17:40.380
that sum of squared deviations.
00:17:40.380 --> 00:17:43.390
And what's nice
with linear models
00:17:43.390 --> 00:17:48.400
is there's an algebraic
way to find actually what b
00:17:48.400 --> 00:17:50.870
does that minimization for us.
00:17:50.870 --> 00:17:53.830
But I also want
to just remind you
00:17:53.830 --> 00:17:58.300
that lurking inside
of that minimization
00:17:58.300 --> 00:18:06.880
is an estimate of the total
sum of squared residuals, SSR,
00:18:06.880 --> 00:18:12.280
what's lurking back there in
that R and R-squared adjusted.
00:18:12.280 --> 00:18:14.440
And then if I divide
that out again
00:18:14.440 --> 00:18:20.560
by the degrees of
freedom, mu sub R,
00:18:20.560 --> 00:18:26.370
then I've got also my estimate
of variance in the underlying
00:18:26.370 --> 00:18:28.240
model assuming no lack of fit.
00:18:32.320 --> 00:18:34.600
So we said with least
squares estimation,
00:18:34.600 --> 00:18:39.140
I can form the set
of linear equations.
00:18:39.140 --> 00:18:42.940
And assuming that the residuals
are all normal or orthogonal
00:18:42.940 --> 00:18:50.680
to each other, then the sum
of the product of our residual
00:18:50.680 --> 00:18:53.200
and the input should sum to 0.
00:18:53.200 --> 00:18:55.480
And when you carry through
the algebra for that,
00:18:55.480 --> 00:19:01.180
out plops the formula for
the slope coefficient given
00:19:01.180 --> 00:19:04.750
our data, simply the sum
of the product of my x sub
00:19:04.750 --> 00:19:10.090
i times y sub i over the sum of
my x sub i squared, it's funky.
00:19:10.090 --> 00:19:13.150
And as I said,
here's our estimate
00:19:13.150 --> 00:19:16.030
of the underlying variance.
00:19:16.030 --> 00:19:20.140
That's our best estimate,
unbiased, best estimate
00:19:20.140 --> 00:19:21.820
of the process variance.
00:19:21.820 --> 00:19:24.920
And in this case, we're only
fitting one model coefficient.
00:19:24.920 --> 00:19:28.270
So I've got my total
number set of data
00:19:28.270 --> 00:19:31.780
and then I've just
got n minus 1,
00:19:31.780 --> 00:19:35.750
since I've only got
one model coefficient.
00:19:35.750 --> 00:19:37.580
Now the interesting
thing that I've
00:19:37.580 --> 00:19:40.790
alluded to in a previous
lecture but haven't
00:19:40.790 --> 00:19:46.640
shown you is I want more than
just the best estimate of b.
00:19:46.640 --> 00:19:50.050
I'd like to have a
confidence interval on b.
00:19:50.050 --> 00:19:55.060
Given the spread in the data
and an underlying normal noise
00:19:55.060 --> 00:19:58.750
model or noise assumption,
what do I think the range,
00:19:58.750 --> 00:20:04.300
say a 95% confidence interval,
might be on my estimation of b?
00:20:04.300 --> 00:20:08.860
And we can do that very simply
by taking the formula for b
00:20:08.860 --> 00:20:16.990
and simply doing our variance of
b calculation on that formula.
00:20:16.990 --> 00:20:18.970
It's just variance math.
00:20:18.970 --> 00:20:21.640
And that's what's broken
out here in the y.
00:20:21.640 --> 00:20:28.480
If I expand out the b
summation into a some
00:20:28.480 --> 00:20:33.040
of those individual terms, I can
then apply my normal variance
00:20:33.040 --> 00:20:35.270
math here.
00:20:35.270 --> 00:20:39.580
And what I've got for the
variance of that some--
00:20:39.580 --> 00:20:41.500
just thinking of each
of these elements
00:20:41.500 --> 00:20:46.750
has some constant, then that the
variance of that sum of terms
00:20:46.750 --> 00:20:49.780
is the variance of
the constant squared--
00:20:49.780 --> 00:20:53.080
or excuse me, the value of
the constant squared times
00:20:53.080 --> 00:20:55.910
the variance of each of
those underlying variables.
00:20:55.910 --> 00:20:58.840
And when you go and do
that, what you've got
00:20:58.840 --> 00:21:04.570
is another formula down
here for the variance
00:21:04.570 --> 00:21:13.110
in that coefficient b based
on the data that you've got.
00:21:13.110 --> 00:21:16.810
So once I've got
that up here, I've
00:21:16.810 --> 00:21:19.270
got my estimate
for the variance.
00:21:19.270 --> 00:21:23.110
Now we've got an estimate of
what 1 standard deviation would
00:21:23.110 --> 00:21:25.390
be in the variance.
00:21:25.390 --> 00:21:29.350
And then you can express that
based on whatever confidence
00:21:29.350 --> 00:21:30.220
interval you want.
00:21:30.220 --> 00:21:34.000
So I might write that
typically as b plus or minus
00:21:34.000 --> 00:21:39.070
1 standard error, 1
standard deviation in b.
00:21:39.070 --> 00:21:40.900
1 standard deviation--
00:21:40.900 --> 00:21:44.700
I can't remember, what that
correspond to, the typical?
00:21:44.700 --> 00:21:49.600
Got about a 90%
confidence interval?
00:21:49.600 --> 00:21:52.090
Plus or minus 1
standard deviation?
00:21:52.090 --> 00:21:56.030
67%, thank you.
00:21:56.030 --> 00:21:59.030
The one I always remember
is two standard errors.
00:21:59.030 --> 00:22:01.770
That's about 95% confidence.
00:22:01.770 --> 00:22:06.680
So if you wanted to 95%
confidence interval, now
00:22:06.680 --> 00:22:08.720
you know how to formulate that.
00:22:08.720 --> 00:22:11.480
It might be 1.96
or whatever it is.
00:22:15.690 --> 00:22:18.930
So there you have
nicely falling out
00:22:18.930 --> 00:22:22.860
of the basic mathematical
formulation for minimizing
00:22:22.860 --> 00:22:25.770
the sum of squares
both the best estimate
00:22:25.770 --> 00:22:28.920
for your slope and a confidence
interval to the slope.
00:22:31.830 --> 00:22:37.530
By the way, if you're based that
on a relatively small number
00:22:37.530 --> 00:22:39.390
of data points,
you should probably
00:22:39.390 --> 00:22:45.070
use a t distribution rather
than a normal distribution.
00:22:45.070 --> 00:22:51.390
So it might change my 1.964
for a 95% confidence interval,
00:22:51.390 --> 00:22:55.020
as we're used to.
00:22:55.020 --> 00:22:57.990
So this also lets us
now go back and do--
00:22:57.990 --> 00:23:01.590
think again about
another perspective
00:23:01.590 --> 00:23:03.232
on analysis of variance.
00:23:03.232 --> 00:23:05.190
In fact, you guys played
with this a little bit
00:23:05.190 --> 00:23:10.920
or saw this in a slightly
different form on the quiz.
00:23:10.920 --> 00:23:17.250
There's two ways of thinking
about the significance
00:23:17.250 --> 00:23:22.740
whether some slope coefficient
or model coefficient should
00:23:22.740 --> 00:23:24.600
be included in the model.
00:23:24.600 --> 00:23:27.150
The basic hypothesis
is are we saying
00:23:27.150 --> 00:23:30.510
do I have enough evidence to
suggest that that slope term is
00:23:30.510 --> 00:23:32.490
non-zero?
00:23:32.490 --> 00:23:36.990
If it might be 0 to some
degree of confidence,
00:23:36.990 --> 00:23:40.260
then I shouldn't include it.
00:23:40.260 --> 00:23:42.180
So one way of doing
it is the ANOVA
00:23:42.180 --> 00:23:45.765
with the ratio of
variances in the F test.
00:23:45.765 --> 00:23:49.440
The other way is basically
looking at the confidence
00:23:49.440 --> 00:23:54.220
interval for beta, say the
95% confidence interval,
00:23:54.220 --> 00:23:58.580
and if that intersects 0,
that says that more than 5%
00:23:58.580 --> 00:24:02.090
of the time based on just
random variation in the data,
00:24:02.090 --> 00:24:05.490
I might have a 0
coefficient there,
00:24:05.490 --> 00:24:08.750
in which case I cannot say that
it is significantly different
00:24:08.750 --> 00:24:10.820
than 0.
00:24:10.820 --> 00:24:12.800
So you can make
that determination
00:24:12.800 --> 00:24:15.410
about whether you should include
the model coefficient based
00:24:15.410 --> 00:24:19.870
on your confidence interval for
each individual term as well.
00:24:22.610 --> 00:24:24.800
So that's just alluding
back to what we already know
00:24:24.800 --> 00:24:28.220
but just trying to make
sure you see the connection
00:24:28.220 --> 00:24:31.100
or alternative ways of looking
at it either in the ANOVA
00:24:31.100 --> 00:24:34.880
table, or if you want to look
at individual coefficient
00:24:34.880 --> 00:24:37.760
terms, the confidence
intervals on those individual
00:24:37.760 --> 00:24:39.620
coefficients.
00:24:39.620 --> 00:24:42.930
OK, let's do an example.
00:24:42.930 --> 00:24:45.620
Here's a very
simple set of data.
00:24:45.620 --> 00:24:50.870
We've got some input, some
x value, call that "age".
00:24:50.870 --> 00:24:52.850
And some y values.
00:24:52.850 --> 00:24:56.350
Call that "income".
00:24:56.350 --> 00:24:59.180
And if I just plot the data,
let me get the data up here--
00:25:03.460 --> 00:25:06.645
actually, what I've
done here is used JUMP.
00:25:06.645 --> 00:25:08.770
I don't know how many of
you have played with JUMP,
00:25:08.770 --> 00:25:11.500
but I love JUMP because
it's nice and interactive.
00:25:11.500 --> 00:25:14.860
It does a lot of
regression analysis,
00:25:14.860 --> 00:25:18.350
lets me explore the data
fairly interactively,
00:25:18.350 --> 00:25:21.550
I like it a lot better
than Excel for doing
00:25:21.550 --> 00:25:22.840
some of these analysis.
00:25:22.840 --> 00:25:25.360
I think in an
earlier problem set,
00:25:25.360 --> 00:25:27.700
we did give you a pointer
to where you could run that
00:25:27.700 --> 00:25:32.240
on Athena and so on.
00:25:32.240 --> 00:25:34.750
And what this is doing is
basically looking and doing
00:25:34.750 --> 00:25:39.820
my analysis of variance for
a very simple linear model
00:25:39.820 --> 00:25:42.830
without a constant term.
00:25:42.830 --> 00:25:45.490
So I've just got one
model coefficient, looks
00:25:45.490 --> 00:25:48.550
at the sum of squares,
the mean square,
00:25:48.550 --> 00:25:51.550
looks at the residual with
the remaining data point
00:25:51.550 --> 00:25:53.170
forms and F.
00:25:53.170 --> 00:25:54.970
That F ratio is huge.
00:25:54.970 --> 00:25:59.800
It's 1,000, and the probability
of observing that large of an F
00:25:59.800 --> 00:26:01.510
is minuscule.
00:26:01.510 --> 00:26:06.850
So I have great confidence
that in fact there is a slope.
00:26:06.850 --> 00:26:10.660
And if I look down here
at my income leverage
00:26:10.660 --> 00:26:14.620
residual versus
the age parameter,
00:26:14.620 --> 00:26:21.510
I can see this is basically
just y sub i x sub i.
00:26:21.510 --> 00:26:25.240
I see a definite trend there.
00:26:25.240 --> 00:26:30.040
Now what this nice plot
has done is the solid line
00:26:30.040 --> 00:26:33.520
is my best fit, but it
is also plotted for us
00:26:33.520 --> 00:26:39.810
with the dashed line the
confidence on the output.
00:26:39.810 --> 00:26:46.150
I think it's a 95% confidence
interval on the output as well.
00:26:46.150 --> 00:26:50.040
Now I told you how to get an
estimate on the confidence
00:26:50.040 --> 00:26:52.050
interval for our b term.
00:26:52.050 --> 00:26:54.660
How do we get a confidence
interval on the output term?
00:26:57.570 --> 00:26:59.430
Well, what we're
going to need to do
00:26:59.430 --> 00:27:03.420
is also do the variance
calculations on our y formula
00:27:03.420 --> 00:27:06.870
and see how
uncertainty in our data
00:27:06.870 --> 00:27:12.570
also propagates through to
uncertainty in our output.
00:27:12.570 --> 00:27:14.130
But before we do
that, we can also
00:27:14.130 --> 00:27:17.130
see here in the
JUMP output things
00:27:17.130 --> 00:27:23.940
like the parameter estimates
for our age dependence.
00:27:23.940 --> 00:27:26.910
So here's our best guess
for the age dependence
00:27:26.910 --> 00:27:31.140
is a simple 0.5 estimate.
00:27:31.140 --> 00:27:34.410
And it is also showing us
things like the standard error
00:27:34.410 --> 00:27:38.010
in these typical ANOVA tables,
which we've ignored in the past
00:27:38.010 --> 00:27:39.450
if you've been looking at these.
00:27:39.450 --> 00:27:41.730
But that can also be
used them directly,
00:27:41.730 --> 00:27:44.670
as we talked about, to give
me a confidence interval,
00:27:44.670 --> 00:27:50.010
depending on what level of
error whatever level of alpha
00:27:50.010 --> 00:27:52.860
I want to be able to
estimate those things.
00:27:52.860 --> 00:27:56.190
And it's also looking
at an individual t ratio
00:27:56.190 --> 00:27:57.930
for each of the coefficients.
00:27:57.930 --> 00:28:00.420
I've only got one here,
but it's basically
00:28:00.420 --> 00:28:04.620
doing a one by one assessment
of each of my model coefficients
00:28:04.620 --> 00:28:08.530
to see if it's significant.
00:28:08.530 --> 00:28:11.830
And in fact, it's
significant since it's
00:28:11.830 --> 00:28:15.460
exactly ends up being the same
probability not really shown
00:28:15.460 --> 00:28:17.540
here.
00:28:17.540 --> 00:28:19.750
Essentially, the t
test and the F test
00:28:19.750 --> 00:28:25.125
are identical in
this simple example.
00:28:25.125 --> 00:28:32.550
AUDIENCE: [INAUDIBLE] think
of some subset of data,
00:28:32.550 --> 00:28:35.830
wouldn't it make sense to have
[INAUDIBLE] part of the data
00:28:35.830 --> 00:28:41.010
then use some for testing like
the model and seeing if it
00:28:41.010 --> 00:28:45.120
actually has a prediction
because if you use that entire
00:28:45.120 --> 00:28:48.890
data set then essentially--
00:28:48.890 --> 00:28:50.640
DUANE BONING: That's
an interesting point.
00:28:50.640 --> 00:28:53.630
So what you're saying
is how about the idea
00:28:53.630 --> 00:28:56.180
if you have a fair amount
of data of holding out
00:28:56.180 --> 00:28:59.640
some of the data, fitting
the data some portion of it,
00:28:59.640 --> 00:29:05.110
and the held back data to
sort of test the model.
00:29:05.110 --> 00:29:16.260
And I think, especially when
you do nonlinear models--
00:29:16.260 --> 00:29:17.790
and I don't mean
just polynomial,
00:29:17.790 --> 00:29:20.070
but I mean some other
nonlinear dependence
00:29:20.070 --> 00:29:24.360
--that cross validation is
extremely common and very
00:29:24.360 --> 00:29:25.740
useful.
00:29:25.740 --> 00:29:28.870
Here, you could do that.
00:29:28.870 --> 00:29:31.440
And essentially what
I think that's doing
00:29:31.440 --> 00:29:39.240
is allowing you to do a lack
of fit versus noise estimate.
00:29:39.240 --> 00:29:41.260
In other words,
what you're doing,
00:29:41.260 --> 00:29:43.530
I think conceptually,
there is saying here's
00:29:43.530 --> 00:29:45.900
what my model would
have predicted.
00:29:45.900 --> 00:29:47.370
Here's my data point.
00:29:47.370 --> 00:29:52.380
There's a residual that I'm
going to attribute maybe--
00:29:52.380 --> 00:29:55.830
again, it's to a mix of
random noise underlying
00:29:55.830 --> 00:30:00.060
but also model lack of fidelity.
00:30:04.110 --> 00:30:09.090
I think it's more common to go
ahead and use all of your data
00:30:09.090 --> 00:30:12.780
because then you've got
your aggregate measures
00:30:12.780 --> 00:30:16.740
and can run all of your tests
with the highest resolution
00:30:16.740 --> 00:30:18.600
possible.
00:30:18.600 --> 00:30:22.440
But I suspect there's
actually a relationship that's
00:30:22.440 --> 00:30:25.235
very close in there.
00:30:25.235 --> 00:30:27.360
I think it's a little better
to use all of the data
00:30:27.360 --> 00:30:30.330
because the more data you
have, the better your estimates
00:30:30.330 --> 00:30:32.610
of underlying
process variance are
00:30:32.610 --> 00:30:37.260
so you can better differentiate
lack of fit from noise.
00:30:37.260 --> 00:30:40.090
But I haven't thought
about that very much,
00:30:40.090 --> 00:30:42.900
especially of the
simple linear cases.
00:30:42.900 --> 00:30:45.090
It's an interesting approach.
00:30:51.490 --> 00:30:55.060
So I want to come back
to this lack of fit
00:30:55.060 --> 00:30:58.660
versus the pure error
because we talked
00:30:58.660 --> 00:31:02.590
about often being able to do
multiple runs at the same x
00:31:02.590 --> 00:31:03.670
values.
00:31:03.670 --> 00:31:06.580
In this data here
that I've shown you,
00:31:06.580 --> 00:31:10.930
we actually have a
difficulty in distinguishing
00:31:10.930 --> 00:31:15.460
between model lack of fit
and underlying variance.
00:31:15.460 --> 00:31:18.130
I had to basically
make an assumption
00:31:18.130 --> 00:31:21.670
that my underlying
model was truly linear.
00:31:21.670 --> 00:31:24.940
And then I'm basically
assuming, if I
00:31:24.940 --> 00:31:27.155
go back even further here--
00:31:27.155 --> 00:31:28.030
where did my data go?
00:31:30.730 --> 00:31:36.310
--I'm basically assuming a y
sub i is equal to beat sub--
00:31:36.310 --> 00:31:41.140
beta x sub i plus
epsilon sub i model.
00:31:41.140 --> 00:31:50.080
Why not-- I have really nothing
except ideas of parsimony,
00:31:50.080 --> 00:31:56.170
simple models in general
and perhaps prior knowledge
00:31:56.170 --> 00:31:59.320
of the physics of the
process to really say this
00:31:59.320 --> 00:32:01.120
is the form of the model.
00:32:01.120 --> 00:32:09.280
If you look at my data, why
couldn't my model be that?
00:32:09.280 --> 00:32:10.930
It may well be.
00:32:10.930 --> 00:32:13.330
It might have a very
complicated structure.
00:32:13.330 --> 00:32:15.980
That might be true.
00:32:15.980 --> 00:32:17.620
The problem is I don't have--
00:32:17.620 --> 00:32:24.080
in this random data, I
don't have any replicates
00:32:24.080 --> 00:32:26.780
to be able to give me
an independent notion
00:32:26.780 --> 00:32:34.310
of underlying repeated
variance noise from model form.
00:32:34.310 --> 00:32:35.930
And so that goes
back to what we said
00:32:35.930 --> 00:32:40.190
is if we have multiple runs at
the same x values, especially
00:32:40.190 --> 00:32:44.360
if we design an experiment
so that we do that,
00:32:44.360 --> 00:32:47.630
and we aren't using this
sort of happenstance data,
00:32:47.630 --> 00:32:50.840
then we can decompose
the total residual error
00:32:50.840 --> 00:32:55.460
into that lack of fit and
pure replicate error and start
00:32:55.460 --> 00:33:00.530
to be able to distinguish
between model structure
00:33:00.530 --> 00:33:05.600
and and pure replication error.
00:33:05.600 --> 00:33:07.540
And so we talked
previously about being
00:33:07.540 --> 00:33:11.950
able to form the F
test, the of variance
00:33:11.950 --> 00:33:16.840
explained by deviations
from model prediction
00:33:16.840 --> 00:33:21.190
in the replicate
data over total error
00:33:21.190 --> 00:33:25.360
and then seeing how likely it
would be to observe that ratio
00:33:25.360 --> 00:33:30.228
and use the F test in
the ANOVA test for that.
00:33:30.228 --> 00:33:32.520
And we'll come back to that
a little bit in an example.
00:33:36.510 --> 00:33:37.590
This is a quick one.
00:33:37.590 --> 00:33:41.790
I showed you an example here
where the previous example
00:33:41.790 --> 00:33:46.110
was a pure linear term without
even a constant offset.
00:33:46.110 --> 00:33:50.820
We can also do models
that have both a slope
00:33:50.820 --> 00:33:53.040
term and a constant term.
00:33:53.040 --> 00:33:57.540
And this is simply formulated
here as a means centered model.
00:33:57.540 --> 00:34:04.170
If I were to take my data in
and say when x was added mean,
00:34:04.170 --> 00:34:05.610
this term would be 0.
00:34:05.610 --> 00:34:08.290
So this is not
really an intercept.
00:34:08.290 --> 00:34:13.320
This is saying my a coefficient
is when x is added to mean.
00:34:13.320 --> 00:34:17.100
I could similarly formulate
it so that the coefficient
00:34:17.100 --> 00:34:21.199
would be when x was 0.
00:34:21.199 --> 00:34:23.780
The point being that
the same approach
00:34:23.780 --> 00:34:29.360
for estimating both a linear
term and a constant offset term
00:34:29.360 --> 00:34:30.500
can apply.
00:34:30.500 --> 00:34:39.500
And the same notion of not
only getting estimates but also
00:34:39.500 --> 00:34:44.719
getting confidence
intervals based on variances
00:34:44.719 --> 00:34:48.920
in those coefficients applies.
00:34:48.920 --> 00:34:53.060
So we can also use this to
get confidence intervals,
00:34:53.060 --> 00:34:57.120
not only on the slope term but
also on the variance term--
00:34:57.120 --> 00:34:58.820
I mean the offset term.
00:35:03.680 --> 00:35:10.370
Now we can also, what's nice is,
do the same math now and look
00:35:10.370 --> 00:35:14.570
at a variance in our
prediction of the output.
00:35:14.570 --> 00:35:17.000
I already alluded to that with
these confidence intervals
00:35:17.000 --> 00:35:23.430
on that plot of y versus
x in that one set of data.
00:35:23.430 --> 00:35:26.960
And if I basically am saying,
OK, this is my best estimate--
00:35:26.960 --> 00:35:28.250
this was my--
00:35:28.250 --> 00:35:31.280
this is equal to the a
coefficient --this is my best
00:35:31.280 --> 00:35:36.050
estimate of the
underlying linear model
00:35:36.050 --> 00:35:40.080
with an offset term, and I just
do my variance math on this,
00:35:40.080 --> 00:35:43.320
I've got a variance of
some of these terms.
00:35:43.320 --> 00:35:45.650
And if you carry
through that math,
00:35:45.650 --> 00:35:49.250
this is just a constant
at each x sub i.
00:35:49.250 --> 00:35:52.560
Since x bar is a constant,
x sub i is a constant.
00:35:52.560 --> 00:35:54.140
So in the variance
math, when I look
00:35:54.140 --> 00:35:56.270
at the variance
of this term, it's
00:35:56.270 --> 00:35:59.405
the variance of this times
the variance of this.
00:35:59.405 --> 00:36:01.430
This is a constant
term, so I've got
00:36:01.430 --> 00:36:05.750
that constant squared out in
front of the variance of my b.
00:36:05.750 --> 00:36:07.550
We already calculated
what the variance
00:36:07.550 --> 00:36:10.520
of the b a and the variance
of the B term were.
00:36:10.520 --> 00:36:14.000
I can plug those in and
get an overall estimate
00:36:14.000 --> 00:36:20.870
of the variance of each of
my y sub i terms in my model.
00:36:20.870 --> 00:36:22.640
And based on--
once I've got that
00:36:22.640 --> 00:36:26.420
for the single standard error,
my single standard deviation,
00:36:26.420 --> 00:36:30.470
I can use the t or the normal
to get a confidence interval
00:36:30.470 --> 00:36:31.310
on the output.
00:36:35.110 --> 00:36:37.470
So it's the same thing we
did on the coefficients.
00:36:37.470 --> 00:36:41.940
I can also do it to tell me what
kind of spread, what confidence
00:36:41.940 --> 00:36:46.700
do I have in where the true
output should lie when I'm
00:36:46.700 --> 00:36:49.580
predicting for
any x value, where
00:36:49.580 --> 00:36:52.850
I think the actual true
output y would lie.
00:36:52.850 --> 00:36:55.220
Now there's an
interesting aspect
00:36:55.220 --> 00:37:03.870
to this, which is if I look
for any given x sub i input
00:37:03.870 --> 00:37:10.240
particular x input value,
notice OK, that's right here.
00:37:10.240 --> 00:37:13.480
I plug-in for my
particular i of interest.
00:37:13.480 --> 00:37:17.360
Notice that the denominator here
was a sum over all of my data.
00:37:17.360 --> 00:37:19.160
So that ends up being
just a constant.
00:37:19.160 --> 00:37:20.950
It doesn't change.
00:37:20.950 --> 00:37:24.160
But depending on what
x I'm looking at,
00:37:24.160 --> 00:37:29.015
where I am on the x, the
size of this changes.
00:37:32.010 --> 00:37:36.030
So for example, if
I look at my mean,
00:37:36.030 --> 00:37:39.630
if I look where my x
sub i is equal to x bar,
00:37:39.630 --> 00:37:42.840
that numerator term goes to 0.
00:37:42.840 --> 00:37:45.690
And essentially what
I've got in that case
00:37:45.690 --> 00:37:49.260
is at the mean of my
data, my estimation
00:37:49.260 --> 00:37:53.490
is basically-- my variance
in my output estimate
00:37:53.490 --> 00:37:58.200
is basically just related to
the random noise in the data.
00:37:58.200 --> 00:38:02.610
But then as I get further
and further from the mean,
00:38:02.610 --> 00:38:05.355
my confidence interval
in my output spreads.
00:38:07.990 --> 00:38:11.110
So what you will
often see on data--
00:38:11.110 --> 00:38:16.360
this was x data and
this is my y --is
00:38:16.360 --> 00:38:20.170
near the center of
your data, you've
00:38:20.170 --> 00:38:22.660
got the narrowest
confidence intervals.
00:38:22.660 --> 00:38:24.980
And as I get further
and further away,
00:38:24.980 --> 00:38:30.610
if I were to use the dash for
a 95% confidence on the output,
00:38:30.610 --> 00:38:35.500
the further away that I
get in x from my x bar,
00:38:35.500 --> 00:38:38.575
the wider my prediction
error becomes.
00:38:41.590 --> 00:38:43.840
Even though I'm still
may be interpolating over
00:38:43.840 --> 00:38:48.720
the data I've got, my
variance does spread
00:38:48.720 --> 00:38:57.240
as I get further and further
away, just an interesting fact.
00:38:57.240 --> 00:39:01.410
All right, we're almost ready
to do a polynomial example.
00:39:01.410 --> 00:39:05.130
I just want to point out we
talked about this previously.
00:39:05.130 --> 00:39:09.060
We can also do not only
a constant term but also
00:39:09.060 --> 00:39:11.310
a linear term.
00:39:11.310 --> 00:39:14.850
We can do terms that include
this square polynomial,
00:39:14.850 --> 00:39:18.720
for example, include
curvature in the x squared.
00:39:18.720 --> 00:39:22.110
One important fact
is this is still
00:39:22.110 --> 00:39:25.500
linear data in the coefficients.
00:39:28.130 --> 00:39:32.960
And what this means is the
least squares approach--
00:39:32.960 --> 00:39:36.240
least squares minimization,
still applies.
00:39:36.240 --> 00:39:37.790
So you can still
do least squares
00:39:37.790 --> 00:39:40.670
minimization to estimate
your beta coefficients.
00:39:40.670 --> 00:39:43.460
And essentially what
you do mechanically,
00:39:43.460 --> 00:39:46.400
say in something
like Excel, is create
00:39:46.400 --> 00:39:51.440
that additional fake column
of data, just taking your x.
00:39:51.440 --> 00:39:55.670
You can almost think of this
as equating that with an x2,
00:39:55.670 --> 00:39:59.780
think of this as an x1, and
building your data column,
00:39:59.780 --> 00:40:03.530
taking each of your x
coefficients, squaring it,
00:40:03.530 --> 00:40:06.680
and that becomes a
new x sub 2 input.
00:40:06.680 --> 00:40:08.090
And then all you're
doing is just
00:40:08.090 --> 00:40:12.000
a linear fit now in these
multiple coefficients.
00:40:12.000 --> 00:40:13.760
So it looks exactly
the same like we
00:40:13.760 --> 00:40:19.940
did for multiple inputs, even if
we have additional higher order
00:40:19.940 --> 00:40:21.635
terms in the x squared.
00:40:25.630 --> 00:40:28.990
So let's look at a
simple example here.
00:40:28.990 --> 00:40:31.990
Pull these threads together,
look at confidence,
00:40:31.990 --> 00:40:34.390
but also look at
it in the case when
00:40:34.390 --> 00:40:38.860
I've got some replicate data so
we can get a little experience
00:40:38.860 --> 00:40:41.080
with this lack of fit idea.
00:40:41.080 --> 00:40:47.020
And so in this case, we've
got importantly here cases
00:40:47.020 --> 00:40:50.900
where I've replicated
my x values.
00:40:50.900 --> 00:40:55.960
So I've got two runs with 20
grams of some kind of growth
00:40:55.960 --> 00:40:57.230
supplement.
00:40:57.230 --> 00:41:00.268
And so I've got two different
output values at that point.
00:41:00.268 --> 00:41:01.810
And I've got another
point where I've
00:41:01.810 --> 00:41:09.570
got three replicates, triply
replicated set of data.
00:41:09.570 --> 00:41:12.720
And what I'd like to do
is try to fit a model
00:41:12.720 --> 00:41:16.500
and hear what we've
got in the picture is
00:41:16.500 --> 00:41:19.350
an inkling or a foreshadowing
of some of the kinds of models
00:41:19.350 --> 00:41:23.580
we might consider and some of
the issues we might consider.
00:41:23.580 --> 00:41:24.480
If we look--
00:41:24.480 --> 00:41:28.690
I think you can see it here
--the basic data here in black,
00:41:28.690 --> 00:41:30.940
these are the data points.
00:41:30.940 --> 00:41:32.900
So this is just my output.
00:41:32.900 --> 00:41:34.920
There's my triply
replicated data.
00:41:34.920 --> 00:41:36.870
There Is my x data.
00:41:36.870 --> 00:41:39.000
First off, I could try
to fit that with a mean.
00:41:39.000 --> 00:41:40.360
That's just the red line.
00:41:40.360 --> 00:41:43.920
That's the pure just
mean of my data.
00:41:43.920 --> 00:41:45.930
The green line here
is a first order
00:41:45.930 --> 00:41:49.350
fit to just a slope
coefficient and the mean,
00:41:49.350 --> 00:41:51.720
so two model terms.
00:41:51.720 --> 00:41:53.880
And you can see
already that's not
00:41:53.880 --> 00:41:56.100
going to be a very good model.
00:41:56.100 --> 00:41:58.920
And what we've got is enough
data here with the replicates
00:41:58.920 --> 00:42:01.320
to perhaps be able
to detect that using
00:42:01.320 --> 00:42:05.340
our machinery of ANOVA, and
then perhaps then build that
00:42:05.340 --> 00:42:10.860
into a second order model that
we can already get a sense is
00:42:10.860 --> 00:42:13.260
going to be a quadratic
model that fits the data lot
00:42:13.260 --> 00:42:15.615
that a lot better.
00:42:15.615 --> 00:42:17.610
Now, If I were to just try it--
00:42:17.610 --> 00:42:19.590
let's say I didn't already--
00:42:19.590 --> 00:42:21.770
first off you should always
plot your actual data
00:42:21.770 --> 00:42:24.870
so you have a feel for
what kind of a model
00:42:24.870 --> 00:42:26.860
is going to be needed.
00:42:26.860 --> 00:42:29.370
So if you were to actually plot
that data, you would already
00:42:29.370 --> 00:42:32.670
you probably needed
a quadratic model.
00:42:32.670 --> 00:42:37.080
So you might go ahead and
up front, include that term.
00:42:37.080 --> 00:42:40.740
But let's say we had
not done that, we'd just
00:42:40.740 --> 00:42:42.910
tried to fit it with
a very simple model,
00:42:42.910 --> 00:42:44.640
a simple linear model.
00:42:44.640 --> 00:42:46.890
And if we go through
and do the ANOVA,
00:42:46.890 --> 00:42:50.580
now because we do have
repeated residual,
00:42:50.580 --> 00:42:55.140
I can split my overall residual
sum of squared deviations
00:42:55.140 --> 00:42:58.420
into a lack of fit term.
00:42:58.420 --> 00:42:59.880
That's a sum of
squared deviations
00:42:59.880 --> 00:43:02.430
just from my replicated--
00:43:02.430 --> 00:43:07.110
or my total deviation from my
model from my replicated data.
00:43:07.110 --> 00:43:10.890
And I can formulate then a
ratio of those two things.
00:43:10.890 --> 00:43:15.150
And what I've got is
deviations from my model
00:43:15.150 --> 00:43:17.140
that are much larger.
00:43:17.140 --> 00:43:21.597
So this is a deviation.
00:43:21.597 --> 00:43:22.430
It's not a good one.
00:43:22.430 --> 00:43:24.650
Actually right, there the
deviation from the model
00:43:24.650 --> 00:43:27.740
is quite small.
00:43:27.740 --> 00:43:30.620
If I were to look right
here, for example,
00:43:30.620 --> 00:43:34.070
this is my deviation
from the model.
00:43:34.070 --> 00:43:36.560
I don't have any
replicate data there.
00:43:36.560 --> 00:43:39.770
Right here, I've got deviation
from the linear model.
00:43:39.770 --> 00:43:44.710
And then I've got
pure replicate error.
00:43:44.710 --> 00:43:50.380
And you can start to see that
the deviations from my best
00:43:50.380 --> 00:43:53.440
estimate prediction at the
model is much, much larger.
00:43:53.440 --> 00:43:57.940
And that's what shows up in
this ratio of the two variances.
00:43:57.940 --> 00:44:00.970
If you do that and follow
through with the F,
00:44:00.970 --> 00:44:03.910
that's highly unlikely--
that big of a ratio
00:44:03.910 --> 00:44:09.580
is highly unlikely to occur by
chance given the noise spread.
00:44:09.580 --> 00:44:12.910
So if you actually go in and
do the lack of fit analysis,
00:44:12.910 --> 00:44:16.370
it's already setting
up big red flags.
00:44:16.370 --> 00:44:18.610
Here's my red flag saying,
look out, look out.
00:44:18.610 --> 00:44:23.530
You've got a lot of
evidence of a lack of fit.
00:44:23.530 --> 00:44:26.080
What's interesting
in this example is
00:44:26.080 --> 00:44:30.010
if I were to just look
at the significance
00:44:30.010 --> 00:44:36.980
of the individual model
terms, this pops out in fact
00:44:36.980 --> 00:44:40.510
that the mean is
highly significant
00:44:40.510 --> 00:44:42.025
but the slope term is not.
00:44:45.930 --> 00:44:47.280
So this would say--
00:44:47.280 --> 00:44:48.960
if I weren't looking
at lack of fit
00:44:48.960 --> 00:44:51.870
and paying attention
to that red flag,
00:44:51.870 --> 00:44:57.270
I might be tempted to
say a very wrong thing.
00:44:57.270 --> 00:45:00.570
I might be tempted to say
there is a significant estimate
00:45:00.570 --> 00:45:07.710
of the mean that's non-zero,
but given the spread in my data,
00:45:07.710 --> 00:45:10.530
I cannot conclude that
there is a linear dependence
00:45:10.530 --> 00:45:12.390
on my input.
00:45:12.390 --> 00:45:18.090
My linear dependence
on x could be 0.
00:45:18.090 --> 00:45:20.370
In other words,
with that green line
00:45:20.370 --> 00:45:25.820
right here, that's a
small slope that given
00:45:25.820 --> 00:45:30.740
the spread in my data is not
justified to actually estimate
00:45:30.740 --> 00:45:32.060
as anything other than 0.
00:45:35.720 --> 00:45:36.690
Interesting, huh?
00:45:39.350 --> 00:45:41.930
So you really need
to look at both.
00:45:41.930 --> 00:45:44.120
I'd have to be very
careful because
00:45:44.120 --> 00:45:46.910
the extra explanatory
power of the linear term
00:45:46.910 --> 00:45:48.860
is very, very minimal here.
00:45:48.860 --> 00:45:50.690
So I might think
OK, so I've really
00:45:50.690 --> 00:45:53.210
got no dependence at
all, which what I really
00:45:53.210 --> 00:45:54.380
got his lack of fit.
00:45:57.820 --> 00:45:58.620
That making sense?
00:46:01.210 --> 00:46:04.780
So what I might then do is
say, OK, I am paying attention
00:46:04.780 --> 00:46:05.710
to that big red flag.
00:46:05.710 --> 00:46:06.790
I've got lack of fit.
00:46:06.790 --> 00:46:13.460
Maybe I better add a
quadratic term, refit my data.
00:46:13.460 --> 00:46:19.300
So now if I look at
the S for my model
00:46:19.300 --> 00:46:22.150
with the mean with a term
for the linear coefficient
00:46:22.150 --> 00:46:25.960
and one for the quadratic,
now what do I get?
00:46:25.960 --> 00:46:29.620
And return to breaking
apart my residual
00:46:29.620 --> 00:46:34.330
and now looking and seeing
how much deviation is there
00:46:34.330 --> 00:46:38.650
due to lack of fit compared to
underlying replicate variance.
00:46:38.650 --> 00:46:40.880
And now that ratio
is very small.
00:46:40.880 --> 00:46:44.620
So now I don't have
any longer any evidence
00:46:44.620 --> 00:46:47.770
of lack of fit, that's good.
00:46:47.770 --> 00:46:50.320
And now I can return
to deciding about
00:46:50.320 --> 00:46:54.820
whether individual
terms are significant.
00:46:54.820 --> 00:46:59.380
And we don't see the full F
test, it's an incomplete ANOVA.
00:46:59.380 --> 00:47:01.750
But what we would
basically find here
00:47:01.750 --> 00:47:05.830
is the mean term is
significant, the quadratic term
00:47:05.830 --> 00:47:08.197
is significant.
00:47:08.197 --> 00:47:09.280
How about the linear term?
00:47:12.160 --> 00:47:14.180
It's still not significant.
00:47:14.180 --> 00:47:17.620
So in fact, we've got a
a mean and a square term
00:47:17.620 --> 00:47:20.590
but no dependence
on the linear term.
00:47:20.590 --> 00:47:22.090
You will typically see that.
00:47:22.090 --> 00:47:26.770
In fact, these-- if these
terms are truly orthogonal,
00:47:26.770 --> 00:47:29.590
if I add the terms, it should
not change my estimates
00:47:29.590 --> 00:47:31.300
for the other terms.
00:47:31.300 --> 00:47:34.570
That's not quite true if you
throw those missing terms
00:47:34.570 --> 00:47:36.820
into noise factors.
00:47:36.820 --> 00:47:40.750
But the basic point here is
I've now actually captured
00:47:40.750 --> 00:47:47.610
that the dependence on x
with this quadratic term.
00:47:47.610 --> 00:47:49.660
So you can do exactly
the same thing.
00:47:49.660 --> 00:47:54.210
This is the same
data using Excel.
00:47:54.210 --> 00:47:56.730
And you get the
same kind of a table
00:47:56.730 --> 00:47:59.970
here with an x term
and x squared term.
00:47:59.970 --> 00:48:03.390
And what's interesting
here is you can also
00:48:03.390 --> 00:48:06.780
go in and look at estimates
of the coefficients,
00:48:06.780 --> 00:48:11.010
the standard error, 95%
confidence intervals.
00:48:11.010 --> 00:48:15.630
And I guess actually if you were
to look at that 95% confidence
00:48:15.630 --> 00:48:18.960
interval for that x term,
looks like it actually
00:48:18.960 --> 00:48:22.170
is likely to be non-zero.
00:48:22.170 --> 00:48:24.810
So I did get that right.
00:48:28.040 --> 00:48:31.740
So actually you probably
should include that term,
00:48:31.740 --> 00:48:34.530
even though the ratio
is a little bit smaller.
00:48:34.530 --> 00:48:36.470
It is still significant.
00:48:36.470 --> 00:48:38.720
Now I also put this one
up because it's also
00:48:38.720 --> 00:48:43.880
got estimates of your R-squared
and adjusted R-squared.
00:48:43.880 --> 00:48:49.370
where it's giving
you a nice feel.
00:48:49.370 --> 00:48:53.840
R-squared of around 0.9, 0.95,
you start to feel pretty good
00:48:53.840 --> 00:48:54.462
about--
00:48:54.462 --> 00:48:55.670
pretty good about your model.
00:48:58.660 --> 00:49:00.880
So I don't know if you
played around with Excel.
00:49:00.880 --> 00:49:06.850
So again, I encourage JUMP, but
if you do need to use Excel,
00:49:06.850 --> 00:49:08.620
there is--
00:49:08.620 --> 00:49:12.400
under the data analysis
tool if you pull that down,
00:49:12.400 --> 00:49:15.010
you will also see the
regression analysis.
00:49:15.010 --> 00:49:19.240
And it will let you indicate
what your output problems are
00:49:19.240 --> 00:49:21.040
and what your input columns are.
00:49:21.040 --> 00:49:24.040
And it does just the least
squares regression, pops out
00:49:24.040 --> 00:49:26.920
your ANOVA table for you.
00:49:26.920 --> 00:49:29.410
In that case, you
actually have to construct
00:49:29.410 --> 00:49:32.800
by hand your wide
square or your x
00:49:32.800 --> 00:49:35.350
squared data if you
want to polynomial fit.
00:49:35.350 --> 00:49:37.870
And that's what I've
just illustrated here.
00:49:37.870 --> 00:49:41.200
You can't simply, unfortunately,
at least in the version
00:49:41.200 --> 00:49:45.040
of Excel I have, say I want
to try a polynomial model up
00:49:45.040 --> 00:49:49.450
to some order and have
it just know to do that
00:49:49.450 --> 00:49:51.400
on the polynomial input data.
00:49:51.400 --> 00:49:53.470
You actually have
to create columns
00:49:53.470 --> 00:49:55.030
for each of the
model coefficients
00:49:55.030 --> 00:49:56.478
that you want to estimate.
00:50:00.150 --> 00:50:03.780
Here's the same polynomial
regression using the JUMP
00:50:03.780 --> 00:50:07.470
package, again, with
all of the lack of fit
00:50:07.470 --> 00:50:12.360
versus pure error, the
x and x squared terms,
00:50:12.360 --> 00:50:16.440
t ratios, all of that, but
basically the same analysis
00:50:16.440 --> 00:50:20.850
with the second order included.
00:50:20.850 --> 00:50:23.400
OK so with that, I'm going
to-- about to move on
00:50:23.400 --> 00:50:24.780
to process optimization.
00:50:24.780 --> 00:50:30.000
But I'd like to take any
questions on regression,
00:50:30.000 --> 00:50:33.100
confidence intervals, confidence
intervals and input, confidence
00:50:33.100 --> 00:50:34.200
intervals and outputs.
00:50:34.200 --> 00:50:35.430
Is that all?
00:50:35.430 --> 00:50:37.770
It's starting to feel--
00:50:37.770 --> 00:50:40.950
are you confident in
your understanding
00:50:40.950 --> 00:50:42.255
of confidence intervals?
00:50:42.255 --> 00:50:43.020
Yeah, question?
00:50:43.020 --> 00:50:44.660
AUDIENCE: Definitely
don't know what
00:50:44.660 --> 00:50:49.220
do you do if your inputs
that are correlated?
00:50:49.220 --> 00:50:51.350
DUANE BONING: OK so the
question was, what do you
00:50:51.350 --> 00:50:53.300
do if your inputs
are correlated.
00:50:55.890 --> 00:51:02.540
So what is assumed
in all of these fits
00:51:02.540 --> 00:51:05.090
is essentially you've
got orthogonality.
00:51:05.090 --> 00:51:07.800
If we go back to the
tables we were forming
00:51:07.800 --> 00:51:09.740
with full factorial
and so on, we're
00:51:09.740 --> 00:51:13.950
assuming that each of your
columns are orthogonal,
00:51:13.950 --> 00:51:17.570
which is to say we're assuming
each of your coefficients
00:51:17.570 --> 00:51:22.580
in each of your different terms
are uncorrelated or orthogonal.
00:51:22.580 --> 00:51:27.860
If they are orthogonal, and you
do a least squares regression--
00:51:27.860 --> 00:51:31.670
or if they are not orthogonal,
there they are correlated,
00:51:31.670 --> 00:51:33.480
what happens?
00:51:33.480 --> 00:51:37.100
Well, what happens is you've
got to model coefficients
00:51:37.100 --> 00:51:41.120
both trying to explain some
amount of the same data.
00:51:41.120 --> 00:51:43.200
And they fight
against each other.
00:51:43.200 --> 00:51:48.980
And it's almost
random how the effect
00:51:48.980 --> 00:51:51.650
that-- that true underlying
effect gets apportioned between
00:51:51.650 --> 00:51:54.390
say a beta 1 and a beta to term.
00:51:54.390 --> 00:51:56.630
In fact very, very tiny
little perturbations,
00:51:56.630 --> 00:52:00.170
and you can get a different
mix of beta 1 and beta 2.
00:52:00.170 --> 00:52:03.230
And it turns out
you might still be
00:52:03.230 --> 00:52:05.300
OK in terms of
predicting an output
00:52:05.300 --> 00:52:08.400
because at least your model
has both of them in there.
00:52:08.400 --> 00:52:11.390
But it really screws up
your ability to decide
00:52:11.390 --> 00:52:17.980
is that model term
significant or not.
00:52:17.980 --> 00:52:21.820
What you need to do
is transform your data
00:52:21.820 --> 00:52:25.060
to get it into an
orthogonal form
00:52:25.060 --> 00:52:28.750
to get rid of the correlation
to basically create do
00:52:28.750 --> 00:52:32.980
model coefficients and
new explanatory values
00:52:32.980 --> 00:52:39.190
to fake x values that don't
have the correlation in them.
00:52:39.190 --> 00:52:42.940
And the classic
tool for doing that
00:52:42.940 --> 00:52:49.390
is principal component
analysis or some transformation
00:52:49.390 --> 00:52:55.900
of the data to a different basis
than your original x1, x2, x3
00:52:55.900 --> 00:52:59.220
coefficients.
00:52:59.220 --> 00:53:02.580
We might talk a little bit
about multivariable things.
00:53:02.580 --> 00:53:08.100
I think we did a little bit with
multivariable statistical and T
00:53:08.100 --> 00:53:12.240
charts and so on,
but essentially
00:53:12.240 --> 00:53:15.180
a principal components or some
other kind of transformation
00:53:15.180 --> 00:53:17.430
is needed on the
data in order to then
00:53:17.430 --> 00:53:20.640
have individual
coefficients that
00:53:20.640 --> 00:53:23.200
are not duplicating each other.
00:53:23.200 --> 00:53:27.060
If you look, I think it's
chapter section 8 point--
00:53:27.060 --> 00:53:29.010
maybe 8.4.
00:53:29.010 --> 00:53:31.800
The next one after what I
assigned as a reading, that
00:53:31.800 --> 00:53:33.660
talks about principal
component analysis
00:53:33.660 --> 00:53:36.270
and how you do that
and process modeling.
00:53:36.270 --> 00:53:38.070
So you can read that section.
00:53:38.070 --> 00:53:42.510
It's actually very
good, very interesting.
00:53:42.510 --> 00:53:47.990
Other questions, progression?
00:53:47.990 --> 00:53:48.490
Yeah?
00:53:48.490 --> 00:53:50.073
AUDIENCE: If there
is a big difference
00:53:50.073 --> 00:53:52.360
between R-squared and
adjusted R-squared, what
00:53:52.360 --> 00:53:54.010
is that telling us?
00:53:54.010 --> 00:53:58.860
In this case, it's essentially
[INAUDIBLE] 0.9 and 0.8,
00:53:58.860 --> 00:54:01.192
or 0.7 [INAUDIBLE].
00:54:03.413 --> 00:54:04.830
DUANE BONING: Yes,
so the question
00:54:04.830 --> 00:54:06.330
is what if you have
big differences
00:54:06.330 --> 00:54:09.570
between R-squared and
adjusted R-squared.
00:54:09.570 --> 00:54:13.710
I think it's
essentially telling you
00:54:13.710 --> 00:54:17.930
that the influence of
additional model coefficients
00:54:17.930 --> 00:54:24.350
is really important, both--
00:54:24.350 --> 00:54:26.060
this very qualitative.
00:54:26.060 --> 00:54:27.860
But essentially,
it's telling you
00:54:27.860 --> 00:54:31.850
there's more than going on
than just the mean response.
00:54:31.850 --> 00:54:34.490
So you're seeing a little
bit of a mix of both--
00:54:34.490 --> 00:54:37.370
the penalty of adding
more model coefficients,
00:54:37.370 --> 00:54:40.580
but it's also
telling you there's
00:54:40.580 --> 00:54:45.530
likely additional structure
that you needed in order
00:54:45.530 --> 00:54:47.090
to use that.
00:54:47.090 --> 00:54:48.410
But that's pretty qualitative.
00:54:48.410 --> 00:54:50.780
I think basically it's
signaling that there's
00:54:50.780 --> 00:54:52.430
more than just mean--
00:54:52.430 --> 00:54:54.230
mean deviations going on.
00:54:57.480 --> 00:55:01.110
It sounded like there
was a microphone
00:55:01.110 --> 00:55:03.054
question in Singapore?
00:55:03.054 --> 00:55:05.030
AUDIENCE: Question on slide 50.
00:55:09.650 --> 00:55:14.450
You mentioned we should
only see the mean which
00:55:14.450 --> 00:55:17.900
also focused on the lack
of fit and the pure error.
00:55:17.900 --> 00:55:20.630
So why do you say that
we only see the mean,
00:55:20.630 --> 00:55:22.460
we may say it's a good model.
00:55:22.460 --> 00:55:23.990
Can you explain that again?
00:55:23.990 --> 00:55:25.365
DUANE BONING:
Yeah, actually what
00:55:25.365 --> 00:55:27.830
I was saying in this
example is that if I only
00:55:27.830 --> 00:55:34.710
looked at the mean, I might be
hesitant to include any model
00:55:34.710 --> 00:55:37.120
terms beyond the mean.
00:55:37.120 --> 00:55:42.100
So I might not actually think
it's a good model at all.
00:55:42.100 --> 00:55:45.990
So that part of your question,
I'm not sure I quite understood
00:55:45.990 --> 00:55:49.260
or quite agreed with.
00:55:49.260 --> 00:55:50.280
But I do--
00:55:50.280 --> 00:55:53.400
I guess maybe I'm
just repeating myself,
00:55:53.400 --> 00:55:57.270
I think it is really critical
to look for lack of fit
00:55:57.270 --> 00:55:59.730
because you need
both perspectives.
00:55:59.730 --> 00:56:05.790
You need to look not only at
model coefficients in terms
00:56:05.790 --> 00:56:08.430
and whether they should
be included in the model,
00:56:08.430 --> 00:56:12.870
but you also have to be
alert am I missing terms.
00:56:16.010 --> 00:56:19.040
That's what the lack of
fit enables you to do.
00:56:19.040 --> 00:56:22.490
This is basically saying
the terms that are there,
00:56:22.490 --> 00:56:23.630
are they significant?
00:56:26.630 --> 00:56:29.110
So in some, sense this one
is basically just leading
00:56:29.110 --> 00:56:32.920
you to throw away coefficients
and throw away model terms.
00:56:32.920 --> 00:56:36.400
And this number two, the
lack of fit, is telling you,
00:56:36.400 --> 00:56:38.650
hey wait a second, there's
stuff going on in the model
00:56:38.650 --> 00:56:40.120
that you're not
explaining that's
00:56:40.120 --> 00:56:43.840
different than random
noise, so maybe you
00:56:43.840 --> 00:56:46.600
should add model terms.
00:56:46.600 --> 00:56:48.670
And so you need
both perspectives.
00:56:52.380 --> 00:56:58.310
OK so I think we're ready to
move on and look a little bit
00:56:58.310 --> 00:57:00.260
at process optimization.
00:57:00.260 --> 00:57:03.710
I want to touch on the most
natural use of these sorts
00:57:03.710 --> 00:57:08.130
of models, which is we define
an experimental design,
00:57:08.130 --> 00:57:10.280
we go gather the data,
we build a model,
00:57:10.280 --> 00:57:12.260
and then we start
playing with the model.
00:57:12.260 --> 00:57:15.620
I think of that is
offline use of the model,
00:57:15.620 --> 00:57:19.160
using it to try to
identify an optimal point.
00:57:19.160 --> 00:57:22.880
But it's not purely
offline because I
00:57:22.880 --> 00:57:26.540
want to make the point that if
you're predicting an optimum,
00:57:26.540 --> 00:57:30.770
you probably want to go back and
run some confirming experiments
00:57:30.770 --> 00:57:34.640
and use those back with
your physical process
00:57:34.640 --> 00:57:37.940
to check your model and maybe
even iterate and improve
00:57:37.940 --> 00:57:39.050
your model.
00:57:39.050 --> 00:57:41.160
So that's one natural approach.
00:57:41.160 --> 00:57:42.470
And the other is--
00:57:42.470 --> 00:57:47.070
that should be online use.
00:57:47.070 --> 00:57:52.440
So another clever approach
is actually build simplified
00:57:52.440 --> 00:57:56.340
models in a little part of
the space, use that to tell me
00:57:56.340 --> 00:58:00.300
what direction to move in
exploring my overall process
00:58:00.300 --> 00:58:05.340
space, and then dynamically
build and improve my model.
00:58:05.340 --> 00:58:09.480
In the case when my real goal
is getting to an optimum,
00:58:09.480 --> 00:58:12.840
not having the perfect model
covering all of my space
00:58:12.840 --> 00:58:15.090
but rather get to
an optimum point.
00:58:15.090 --> 00:58:17.640
So I want to touch on
both of these ideas, ways
00:58:17.640 --> 00:58:20.640
of using these sort of
simplified response surface
00:58:20.640 --> 00:58:22.650
models.
00:58:22.650 --> 00:58:27.250
And part of the point
here is one important use
00:58:27.250 --> 00:58:31.450
of these models really is trying
to find an optimal process
00:58:31.450 --> 00:58:35.230
output or find the inputs that
give me an optimal process
00:58:35.230 --> 00:58:36.310
output.
00:58:36.310 --> 00:58:39.160
And that optimal
process output may
00:58:39.160 --> 00:58:41.170
have multiple
characteristics about it
00:58:41.170 --> 00:58:44.360
that are important for us.
00:58:44.360 --> 00:58:48.220
One is I want to be
close to a target value.
00:58:48.220 --> 00:58:53.530
But the other is we may
also want small sensitivity,
00:58:53.530 --> 00:58:56.180
small deviations in my output.
00:58:56.180 --> 00:58:58.480
And if we go back to
our variation equation,
00:58:58.480 --> 00:59:02.770
that may mean I want small
deviations around noise factors
00:59:02.770 --> 00:59:05.710
that I'm not controlling.
00:59:05.710 --> 00:59:11.670
And I may also want
relatively small sensitivity
00:59:11.670 --> 00:59:13.950
even to some of my
input parameters
00:59:13.950 --> 00:59:15.810
because I'm going to
fix them in my process.
00:59:15.810 --> 00:59:20.500
And I'm not dynamically or in
a feedback loop changing them.
00:59:20.500 --> 00:59:24.690
So in some cases, I want
this to also be small.
00:59:24.690 --> 00:59:27.000
So we'll talk a
little bit about ways
00:59:27.000 --> 00:59:30.750
to mix in these and
other objectives.
00:59:30.750 --> 00:59:33.060
For right now, I'm
going to mostly focus
00:59:33.060 --> 00:59:40.710
on say trying to meet some
set of target mean values.
00:59:40.710 --> 00:59:43.710
But I can make the
point you can generalize
00:59:43.710 --> 00:59:47.730
what I'm going to be talking
about here by thinking
00:59:47.730 --> 00:59:51.150
of some objective function,
or some cost function,
00:59:51.150 --> 00:59:55.620
or some goodness function that
actually mixes in together
00:59:55.620 --> 00:59:57.630
multiple objectives.
00:59:57.630 --> 01:00:00.240
So some of the objectives,
you might have a cost function
01:00:00.240 --> 01:00:06.120
that penalizes for
deviations from the target
01:00:06.120 --> 01:00:08.220
or maybe sum of
squared deviations
01:00:08.220 --> 01:00:10.930
if I have multiple
outputs from the target.
01:00:10.930 --> 01:00:15.450
It may also penalize me
for larger x's because--
01:00:15.450 --> 01:00:18.810
larger input because
there's more cost
01:00:18.810 --> 01:00:24.000
associated with using more
gas if I have a higher gas
01:00:24.000 --> 01:00:25.860
flow in some process.
01:00:25.860 --> 01:00:27.840
And then I can also
include other things
01:00:27.840 --> 01:00:34.410
like terms that penalize for
sensitivity, these delta y's,
01:00:34.410 --> 01:00:36.450
sensitivity to the output.
01:00:36.450 --> 01:00:40.890
And I can keep throwing
additional things in.
01:00:40.890 --> 01:00:46.770
So if I've got in general some
complicated objective function,
01:00:46.770 --> 01:00:51.120
if I can formulate
that and actually model
01:00:51.120 --> 01:00:57.120
either empirically or
analytically that cost
01:00:57.120 --> 01:01:00.300
function as a
function of my input
01:01:00.300 --> 01:01:04.650
or as a function utilizing the
models that I already have,
01:01:04.650 --> 01:01:07.710
I can then formulate an
optimization function
01:01:07.710 --> 01:01:09.180
or an optimization
problem where I
01:01:09.180 --> 01:01:11.130
might be trying to
minimize that cost
01:01:11.130 --> 01:01:15.563
or minimize that objective.
01:01:15.563 --> 01:01:16.980
Or maybe I'm trying
to maximize it
01:01:16.980 --> 01:01:20.070
because I think of it as really
a goodness function rather
01:01:20.070 --> 01:01:21.690
than a penalty function.
01:01:21.690 --> 01:01:26.460
But overall, I've got some
complicated form for J
01:01:26.460 --> 01:01:29.310
as a function of my factors.
01:01:29.310 --> 01:01:32.820
Or my factors might
be my actual input,
01:01:32.820 --> 01:01:38.010
but they may also be noise
factors, other factors that I
01:01:38.010 --> 01:01:41.840
haven't explicitly modeled.
01:01:41.840 --> 01:01:48.150
And we'll talk about robustness
next week, or not next week,
01:01:48.150 --> 01:01:49.830
on Thursday.
01:01:49.830 --> 01:01:51.360
But right now, I
just want to talk
01:01:51.360 --> 01:01:57.690
about adjusting or searching
for good input factors
01:01:57.690 --> 01:02:03.430
to minimize or maximize some
cost function with constraints.
01:02:03.430 --> 01:02:05.850
So in general, you can think
about different approaches
01:02:05.850 --> 01:02:06.690
for this.
01:02:06.690 --> 01:02:10.020
If I've got a full
expression for y
01:02:10.020 --> 01:02:16.530
as some function of x and
maybe J is some function of y,
01:02:16.530 --> 01:02:21.570
I have overall got some
overall function for my cost
01:02:21.570 --> 01:02:24.240
as a function of my inputs.
01:02:24.240 --> 01:02:27.690
Then I can go in
and try to minimize,
01:02:27.690 --> 01:02:32.910
really dj dx and find--
01:02:32.910 --> 01:02:35.130
with some assumptions
of monotonicity,
01:02:35.130 --> 01:02:38.430
I can find an overall minimum
or at least a local minimum
01:02:38.430 --> 01:02:40.480
or maximum to that function.
01:02:40.480 --> 01:02:43.080
So that's if I've got
a full expression.
01:02:43.080 --> 01:02:46.230
And we'll explore
that a little bit.
01:02:46.230 --> 01:02:48.870
Another approach is more
of an incremental approach.
01:02:48.870 --> 01:02:50.970
Rather than having
the full expression
01:02:50.970 --> 01:02:54.420
and leaping right
to the optimum point
01:02:54.420 --> 01:02:58.720
based on a local minimum
or local maximum,
01:02:58.720 --> 01:03:00.540
I may have to search for it.
01:03:00.540 --> 01:03:04.255
I may have to iteratively
explore the space.
01:03:04.255 --> 01:03:05.880
And we'll talk a
little bit about these
01:03:05.880 --> 01:03:10.200
with hill climbing or steepest
ascent and descent kinds
01:03:10.200 --> 01:03:10.855
of problems.
01:03:10.855 --> 01:03:12.480
And I've already
mentioned a little bit
01:03:12.480 --> 01:03:15.360
of this online versus offline.
01:03:15.360 --> 01:03:17.460
So here's the simplest
picture for one
01:03:17.460 --> 01:03:19.230
of these optimization problems.
01:03:19.230 --> 01:03:22.950
I've got my input x, and
I've got my output y.
01:03:22.950 --> 01:03:31.470
And what I'm looking for is
a maximum for my output y.
01:03:31.470 --> 01:03:33.810
And maybe here simply
my cost function
01:03:33.810 --> 01:03:39.850
is simply J or J is equal
to y, something like that.
01:03:39.850 --> 01:03:43.500
So I'm not differentiating
here too much between y and J.
01:03:43.500 --> 01:03:45.450
I'm just simply saying
what I'm looking
01:03:45.450 --> 01:03:50.520
for is the overall
maximum for this output.
01:03:50.520 --> 01:03:55.650
And one knows from basic
geometry, basic algebra
01:03:55.650 --> 01:03:59.400
that the maximum will occur--
01:03:59.400 --> 01:04:02.310
unless I hit some
constraints or some boundary
01:04:02.310 --> 01:04:05.670
cases --will occur
when I've got zero
01:04:05.670 --> 01:04:09.780
curvature in that function.
01:04:09.780 --> 01:04:11.370
So how do I find it?
01:04:11.370 --> 01:04:15.970
Well, one approach is, again,
this analytic approach.
01:04:15.970 --> 01:04:18.760
If I have a full
expression, I can simply
01:04:18.760 --> 01:04:20.680
recognize that
that minimum occurs
01:04:20.680 --> 01:04:26.320
where there is zero curvature,
solve for the y such
01:04:26.320 --> 01:04:32.440
that that curvature is 0, and
I directly get to the answer.
01:04:32.440 --> 01:04:35.920
But in order to do that, I
need a full analytic model.
01:04:35.920 --> 01:04:40.330
To do that, I needed
perhaps relatively small
01:04:40.330 --> 01:04:43.900
or good accurate increments
and x or assumptions
01:04:43.900 --> 01:04:45.730
on the model form.
01:04:45.730 --> 01:04:50.695
And especially if I have
relatively sparse data points,
01:04:50.695 --> 01:04:54.190
if I had say just
these data points,
01:04:54.190 --> 01:04:58.510
it's quite easy to miss
the true optimum because
01:04:58.510 --> 01:05:06.370
of noise or imperfections
in my model fit.
01:05:06.370 --> 01:05:09.570
So it can actually be a little
bit tricky with small amounts
01:05:09.570 --> 01:05:14.430
of data to find that if I
fit an overall analytic model
01:05:14.430 --> 01:05:16.650
to a very small
number of data points.
01:05:18.900 --> 01:05:23.940
An alternative is a little bit
of an iterative or a search
01:05:23.940 --> 01:05:31.020
process where we might actually
add data or explore or model,
01:05:31.020 --> 01:05:37.770
either explore experiments or
explore a model in a smaller
01:05:37.770 --> 01:05:43.770
space in each case and sort of
seek to find the optimum point.
01:05:43.770 --> 01:05:46.530
And here are a simple
conceptual idea
01:05:46.530 --> 01:05:50.340
here is in some
regions of my space,
01:05:50.340 --> 01:05:54.660
I may have very good
model fits less so
01:05:54.660 --> 01:05:56.730
than with much less
error than trying
01:05:56.730 --> 01:06:00.060
to fit this overall quadratic to
a small number of data points.
01:06:00.060 --> 01:06:02.310
I may have relatively
good model fit
01:06:02.310 --> 01:06:04.275
in smaller regions of the space.
01:06:06.990 --> 01:06:09.110
Remember that confidence
interval on the output?
01:06:09.110 --> 01:06:13.190
I said as we get further
and further away from say
01:06:13.190 --> 01:06:18.440
the central moments of our
data, my confidence interval
01:06:18.440 --> 01:06:21.380
on my output prediction
gets wider and wider.
01:06:21.380 --> 01:06:24.860
If I shrink my space,
I get better estimates
01:06:24.860 --> 01:06:27.480
of my model in a local space.
01:06:27.480 --> 01:06:29.190
And so one approach
here is to say,
01:06:29.190 --> 01:06:30.900
I'm going to look
in a local space
01:06:30.900 --> 01:06:34.490
get a good estimate
of what the slope is.
01:06:34.490 --> 01:06:39.270
Maybe it's a reduced order
model that's only linear.
01:06:39.270 --> 01:06:42.760
So I'm not even trying to
fit additional curvature.
01:06:42.760 --> 01:06:46.410
And then use that
to say my output y
01:06:46.410 --> 01:06:51.000
is increasing in this
direction with x increasing.
01:06:51.000 --> 01:06:55.200
And use that to project
forward a small amount
01:06:55.200 --> 01:07:00.030
and suggest a new
x value to try.
01:07:00.030 --> 01:07:06.660
So it's projecting and
additional steps to explore.
01:07:06.660 --> 01:07:09.870
If I then do that and build
an additional linear model--
01:07:09.870 --> 01:07:16.220
whoa --build an additional
linear model here,
01:07:16.220 --> 01:07:18.530
it might suggest
another small step.
01:07:18.530 --> 01:07:23.540
And as my linear model starts to
have a slope turn that shrinks,
01:07:23.540 --> 01:07:26.690
that's telling me I'm
getting something closer
01:07:26.690 --> 01:07:31.370
to an optimum point or at
least a local optimum point.
01:07:31.370 --> 01:07:35.720
And at that point that's
signaling me that if I really
01:07:35.720 --> 01:07:39.260
want improved accuracy
at that point in space,
01:07:39.260 --> 01:07:43.770
to really zero in on the
maximum, I can do two things.
01:07:43.770 --> 01:07:47.760
One is still constrained
my search space.
01:07:47.760 --> 01:07:52.550
But also in this region,
it's quite likely that my--
01:07:55.580 --> 01:07:57.240
it's quite likely--
01:07:57.240 --> 01:07:59.360
I don't want this.
01:07:59.360 --> 01:08:00.900
I don't know what that was.
01:08:00.900 --> 01:08:05.860
Oh, wow, something
funky happened.
01:08:05.860 --> 01:08:10.540
In this space, it's just like
with that curvature model
01:08:10.540 --> 01:08:12.520
that I showed you
earlier, the linear term
01:08:12.520 --> 01:08:14.800
is probably no longer
very significant.
01:08:14.800 --> 01:08:17.470
I really need the
quadratic term.
01:08:17.470 --> 01:08:20.500
So I might fit locally
a quadratic model just
01:08:20.500 --> 01:08:24.069
near the optimum which allows
me in a restricted space
01:08:24.069 --> 01:08:27.160
to get an accurate model
that really lets me zero in
01:08:27.160 --> 01:08:31.180
on the optimum point.
01:08:31.180 --> 01:08:33.490
So out here, a linear
model might be good
01:08:33.490 --> 01:08:34.870
enough up in here.
01:08:34.870 --> 01:08:37.840
I may need a beta
0 plus a beta 2 x
01:08:37.840 --> 01:08:42.790
squared term, maybe still
also with a linear term
01:08:42.790 --> 01:08:44.510
here as well.
01:08:44.510 --> 01:08:46.510
But I can basically
build dynamically
01:08:46.510 --> 01:08:50.189
the model getting an accurate
model near the optimum point.
01:08:53.189 --> 01:08:55.689
Now, I showed you this
in 1D, the point 1D,
01:08:55.689 --> 01:08:58.439
but you can also do
this with two inputs,
01:08:58.439 --> 01:09:02.760
where I've got a 3D model if
this is an x1, this is an x2,
01:09:02.760 --> 01:09:04.890
and this is a y.
01:09:04.890 --> 01:09:07.020
But you can essentially
think the same thing.
01:09:07.020 --> 01:09:13.770
If I start out here in this
space, locally it's linear.
01:09:13.770 --> 01:09:17.490
I can use that to
suggest the next step
01:09:17.490 --> 01:09:22.790
to take using a simplified
linear model in this region.
01:09:22.790 --> 01:09:29.390
And then as I hill climb up,
as I get close to the optimum,
01:09:29.390 --> 01:09:33.729
then again now near
the optimum, I need--
01:09:33.729 --> 01:09:37.189
as my x1 and x2, I may
need a quadratic model
01:09:37.189 --> 01:09:38.990
in those two coefficients.
01:09:38.990 --> 01:09:42.560
But I can extend the same
idea to hill climbing
01:09:42.560 --> 01:09:45.800
not only in one input, but
two inputs, three inputs,
01:09:45.800 --> 01:09:51.220
multiple inputs in order
to get to an optimum point.
01:09:51.220 --> 01:09:52.720
So essentially what
we're doing here
01:09:52.720 --> 01:09:55.120
is, again, linear
gradient modeling,
01:09:55.120 --> 01:09:59.590
it is useful often to include
still an interaction term.
01:09:59.590 --> 01:10:02.330
But essentially we're doing
exactly that same thing.
01:10:02.330 --> 01:10:05.840
And if my model
itself is linear,
01:10:05.840 --> 01:10:07.145
an interesting thing happened.
01:10:10.780 --> 01:10:12.400
Where is my overall optimum?
01:10:12.400 --> 01:10:17.270
If I'm trying to
get to maximized y,
01:10:17.270 --> 01:10:19.900
where's my maximum
y going to occur?
01:10:19.900 --> 01:10:23.390
It will always occur on a
boundary when I hit a limit
01:10:23.390 --> 01:10:27.080
of my input and x's.
01:10:27.080 --> 01:10:30.880
So an important thing that
I haven't talked much about
01:10:30.880 --> 01:10:34.560
is also the notion of
additional constraints.
01:10:34.560 --> 01:10:39.460
We may be driving to an interior
point like in this model,
01:10:39.460 --> 01:10:41.460
but it's also
possible that we may
01:10:41.460 --> 01:10:46.410
be driving to either a corner
point or some other boundary
01:10:46.410 --> 01:10:52.020
point because of a constraint
on my allowable ranges
01:10:52.020 --> 01:10:53.040
for my x inputs.
01:10:56.040 --> 01:10:58.110
There is another
piece of terminology
01:10:58.110 --> 01:11:02.490
that's sometimes used for
these kinds of searches,
01:11:02.490 --> 01:11:04.800
either steepest descent
or steepest descent,
01:11:04.800 --> 01:11:07.620
whether you're climbing or
looking for a local minima.
01:11:07.620 --> 01:11:10.680
And the basic point is when
I've got that simplified
01:11:10.680 --> 01:11:15.470
linear model perhaps with
the linear interaction term
01:11:15.470 --> 01:11:19.340
as well, you can think about
the local gradient with respect
01:11:19.340 --> 01:11:24.650
to x1 or the local gradient
with respect to x2.
01:11:24.650 --> 01:11:29.060
And now when you make your
step, what you often want to do
01:11:29.060 --> 01:11:34.460
is make the step in the overall
steepest descent direction,
01:11:34.460 --> 01:11:39.270
changing both your x1 and x2
parameter at the same time.
01:11:39.270 --> 01:11:44.830
So this is simply
showing when I move
01:11:44.830 --> 01:11:49.420
and hill climb, I may change
x1 and x2 proportionally
01:11:49.420 --> 01:11:53.320
depending on the relative slope
in those two coefficients.
01:11:53.320 --> 01:11:54.970
And it's relatively
easy once I've
01:11:54.970 --> 01:11:57.880
got that model to
decide what direction is
01:11:57.880 --> 01:12:00.880
the overall steepest descent.
01:12:00.880 --> 01:12:04.820
Another point here is
that with quadratic terms,
01:12:04.820 --> 01:12:09.230
you can have complicated
functions where your minima may
01:12:09.230 --> 01:12:13.250
occur in the interior of
the space or your maxima
01:12:13.250 --> 01:12:15.020
in the interior of the space.
01:12:15.020 --> 01:12:20.450
But you can also have
hyperbolic or inverse polynomial
01:12:20.450 --> 01:12:25.040
kinds of relationships
where, again, you
01:12:25.040 --> 01:12:28.970
may have local minima or maxima
with respect to one variable
01:12:28.970 --> 01:12:31.790
depending on what you're
doing with the other variable.
01:12:31.790 --> 01:12:34.820
Or you may also have places
where you end up with a maxima
01:12:34.820 --> 01:12:36.930
again at your constraint points.
01:12:36.930 --> 01:12:42.660
So in your search, you've
got to account for both.
01:12:42.660 --> 01:12:46.700
So I can summarize
what we've done here
01:12:46.700 --> 01:12:50.030
with a combined procedure
for design of experiments
01:12:50.030 --> 01:12:53.330
and optimization in either
the iterative fashion
01:12:53.330 --> 01:13:02.450
or at the end, I'll allude to
evolutionary or incremental
01:13:02.450 --> 01:13:04.050
kind of version.
01:13:04.050 --> 01:13:08.540
So this is a summary of the last
two or three lectures boiled
01:13:08.540 --> 01:13:13.100
down into a reminder-- a
summary of the basic process
01:13:13.100 --> 01:13:16.610
or procedure for doing
DOE and optimization.
01:13:16.610 --> 01:13:18.770
We said originally
our goal here is
01:13:18.770 --> 01:13:22.070
to build a model, to do
a design of experiments.
01:13:22.070 --> 01:13:23.870
I do want to
emphasize that depends
01:13:23.870 --> 01:13:26.840
on some knowledge of the
process, a little bit
01:13:26.840 --> 01:13:28.790
of knowledge either
experience based
01:13:28.790 --> 01:13:30.980
or in the physics
of the process.
01:13:30.980 --> 01:13:33.380
Because you need that
in order to do things
01:13:33.380 --> 01:13:39.280
decide what the important
inputs are likely to be.
01:13:39.280 --> 01:13:43.030
Now there are things you can
do with the DOE to confirm that
01:13:43.030 --> 01:13:46.630
or to expand your knowledge,
like factor screening
01:13:46.630 --> 01:13:47.420
experiments.
01:13:47.420 --> 01:13:49.600
We talked about
fractional factorial
01:13:49.600 --> 01:13:52.390
with large numbers of
coefficients where you're just
01:13:52.390 --> 01:13:56.080
trying to decide is there
a main effect associated
01:13:56.080 --> 01:13:57.550
with that factor.
01:13:57.550 --> 01:14:01.700
But up front, defining the
inputs is very important.
01:14:01.700 --> 01:14:05.890
We also need to define
limits on the inputs.
01:14:05.890 --> 01:14:08.800
What space do we want
to explore and build
01:14:08.800 --> 01:14:11.380
a model over in our
design of experiments?
01:14:13.960 --> 01:14:17.350
So overall, we're going to
need to first build our--
01:14:17.350 --> 01:14:19.210
decide on a DOE.
01:14:19.210 --> 01:14:21.040
We'd go and run our experiments.
01:14:21.040 --> 01:14:23.950
And then we're going to
construct our response service
01:14:23.950 --> 01:14:24.610
model.
01:14:24.610 --> 01:14:26.980
And if we're using it
for the optimization,
01:14:26.980 --> 01:14:28.630
I also want to make
the point that you
01:14:28.630 --> 01:14:32.470
need to think early on about
what your overall optimization
01:14:32.470 --> 01:14:36.130
or penalty function is
because that may strongly
01:14:36.130 --> 01:14:42.130
influence your DOE and maybe
even your factor selection.
01:14:42.130 --> 01:14:45.760
So for example, if you
believe that you're really
01:14:45.760 --> 01:14:53.650
going to need an optimization
that folds in things like noise
01:14:53.650 --> 01:14:57.730
in addition to just
trying to get to a target,
01:14:57.730 --> 01:15:02.500
that can have a profound effect
on the DOE that you explore.
01:15:02.500 --> 01:15:05.420
And we'll talk about
that on Thursday,
01:15:05.420 --> 01:15:07.930
where you might do
additional small experiments
01:15:07.930 --> 01:15:10.600
at each point in
the DOE in order
01:15:10.600 --> 01:15:15.040
to build a sensitivity
model of that delta y
01:15:15.040 --> 01:15:18.790
as a function of some
additional noise factors.
01:15:18.790 --> 01:15:22.210
So depending on
what it is you're
01:15:22.210 --> 01:15:26.350
trying to achieve with your
model, that can of course,
01:15:26.350 --> 01:15:29.920
I guess it's obvious,
that can affect
01:15:29.920 --> 01:15:32.920
the structure of your model
and the design of experiments
01:15:32.920 --> 01:15:34.780
that you want to do.
01:15:34.780 --> 01:15:36.880
So we've already talked
about a lot of this.
01:15:36.880 --> 01:15:39.730
Again in summary,
your DOE includes
01:15:39.730 --> 01:15:43.600
decisions about what
likely terms you think
01:15:43.600 --> 01:15:46.090
might be in there based on
your knowledge of the physics.
01:15:46.090 --> 01:15:48.040
Is it going to be mostly linear?
01:15:48.040 --> 01:15:50.680
Might there be quadratic terms?
01:15:50.680 --> 01:15:53.110
That can influence
again the selection
01:15:53.110 --> 01:15:55.810
of the high-low center points.
01:15:55.810 --> 01:15:58.690
Do you need center points,
do you need three levels
01:15:58.690 --> 01:16:01.610
for all factors, and so on.
01:16:01.610 --> 01:16:04.480
And you also need to think about
things like the noise factors.
01:16:04.480 --> 01:16:08.020
We talked about these
nuisance factors, if you will,
01:16:08.020 --> 01:16:09.650
or additional noise factors.
01:16:09.650 --> 01:16:13.420
So that you might randomize
or block against those.
01:16:13.420 --> 01:16:16.120
If they're not going to be
explicitly in the model,
01:16:16.120 --> 01:16:19.000
you don't want them
aliasing with or confounding
01:16:19.000 --> 01:16:22.800
with the terms you actually had.
01:16:22.800 --> 01:16:25.860
The response service modeling
is actually a pretty easy piece,
01:16:25.860 --> 01:16:29.730
especially if you use
things like the regression
01:16:29.730 --> 01:16:31.950
and the ANOVA approach.
01:16:31.950 --> 01:16:35.760
Again, you can use
contrast, if you've
01:16:35.760 --> 01:16:38.010
got a highly structured
design and experiment
01:16:38.010 --> 01:16:41.030
for very rapid estimation
of those terms.
01:16:41.030 --> 01:16:43.590
But overall, the
emphasis here is
01:16:43.590 --> 01:16:49.260
you're trying to determine if
there's significant variation
01:16:49.260 --> 01:16:52.530
in your data, are individual
terms significant,
01:16:52.530 --> 01:16:53.950
are you missing terms.
01:16:53.950 --> 01:16:57.180
So that lack of fit is
extremely important.
01:16:57.180 --> 01:17:01.590
And there's often a very
interesting interplay
01:17:01.590 --> 01:17:05.010
with the regression modeling.
01:17:05.010 --> 01:17:08.640
In fact, an approach we
haven't talked about much,
01:17:08.640 --> 01:17:11.280
but it's essentially
inherent in what
01:17:11.280 --> 01:17:16.680
we've been talking about
here is also referred to as--
01:17:16.680 --> 01:17:20.910
I think it's-- not
piece-wise, step-wise,
01:17:20.910 --> 01:17:22.680
step-wise regression.
01:17:22.680 --> 01:17:25.920
And some of the
interactive tools like JUMP
01:17:25.920 --> 01:17:28.440
actually explicitly
support this,
01:17:28.440 --> 01:17:32.500
where one factor at a
time, you look and say,
01:17:32.500 --> 01:17:34.590
I would like to
add a term or drop
01:17:34.590 --> 01:17:38.340
a term based on cut off decision
points, on significance,
01:17:38.340 --> 01:17:39.260
and so on.
01:17:39.260 --> 01:17:43.500
So you can build up an
appropriate regression model
01:17:43.500 --> 01:17:48.270
by dropping or adding
terms as needed.
01:17:48.270 --> 01:17:52.380
And we talked about this at a
fairly high order or high level
01:17:52.380 --> 01:17:55.230
about the optimization
procedure and again, just
01:17:55.230 --> 01:17:58.500
ideas of defining
your penalty function
01:17:58.500 --> 01:18:01.380
and then searching
for your optimization
01:18:01.380 --> 01:18:04.270
either piece-wise
or analytically.
01:18:04.270 --> 01:18:06.470
I'll come back to
this in just a second.
01:18:06.470 --> 01:18:08.530
But I do want to
emphasize that once you've
01:18:08.530 --> 01:18:18.940
come to some expected
optimum point,
01:18:18.940 --> 01:18:24.370
you really should check
that and confirm that often
01:18:24.370 --> 01:18:30.100
because you're building
your estimate of your model
01:18:30.100 --> 01:18:32.270
based on relatively
limited data,
01:18:32.270 --> 01:18:33.910
especially in the
factorial models
01:18:33.910 --> 01:18:37.690
perhaps with only one interior
point or center point based
01:18:37.690 --> 01:18:41.050
on mostly extreme old data.
01:18:41.050 --> 01:18:43.600
And especially if you've
driven your optimum
01:18:43.600 --> 01:18:49.940
to some interior point
using say the analytic model
01:18:49.940 --> 01:18:53.240
of the response service
model rather than iteratively
01:18:53.240 --> 01:18:57.530
or incrementally, you're
making a lot of big assumptions
01:18:57.530 --> 01:19:01.430
about the shape of the model
right near your optimum,
01:19:01.430 --> 01:19:04.790
like it's convex right
at that optimum point.
01:19:04.790 --> 01:19:09.230
So you really ought to go in
and do a confirming experiment
01:19:09.230 --> 01:19:13.960
right at or right
near your optimum
01:19:13.960 --> 01:19:16.480
in order to really
test the model
01:19:16.480 --> 01:19:21.740
and consider model error
right at that point.
01:19:21.740 --> 01:19:25.220
And that might actually drive
you to improving the model
01:19:25.220 --> 01:19:31.080
or exploring slightly different
space right near that optimum.
01:19:31.080 --> 01:19:32.810
Now the one last
thing I just want
01:19:32.810 --> 01:19:37.340
to allude to is an
alternative approach
01:19:37.340 --> 01:19:43.770
here is often starting with
some data point in a small space
01:19:43.770 --> 01:19:47.760
and building your model
iteratively or adaptively.
01:19:47.760 --> 01:19:50.230
And next week, at
the end of next week,
01:19:50.230 --> 01:19:52.680
we'll have a guest
lecturer, Dan fry,
01:19:52.680 --> 01:19:56.520
who has actually studied
one factor at a time
01:19:56.520 --> 01:20:00.870
incremental exploration
and model building
01:20:00.870 --> 01:20:04.260
for the purpose of
optimization a great deal.
01:20:04.260 --> 01:20:07.920
So he's going to lead us
through an alternative approach
01:20:07.920 --> 01:20:11.280
of actually doing
full factorial models
01:20:11.280 --> 01:20:15.270
but trying to find the optimum
by not defining up front
01:20:15.270 --> 01:20:19.150
the whole DOE and
running the whole thing,
01:20:19.150 --> 01:20:23.790
but rather just walking
around your multifactor space
01:20:23.790 --> 01:20:27.600
in order to try to
find the optimum point.
01:20:27.600 --> 01:20:33.990
And that has some relationship
to another approach that
01:20:33.990 --> 01:20:38.820
is also in May and Spanos in
chapter 8.5 which I've just
01:20:38.820 --> 01:20:42.000
mentioned to you but not
expect that you actually
01:20:42.000 --> 01:20:45.810
have to know a lot about, which
is evolutionary optimization.
01:20:45.810 --> 01:20:49.830
Which would say build a
local model use that again
01:20:49.830 --> 01:20:52.740
and a hill climbing fashion
to suggest where you
01:20:52.740 --> 01:20:56.170
want to go for your next point.
01:20:56.170 --> 01:20:59.130
Maybe in fact you simply
pick one of those corners.
01:20:59.130 --> 01:21:01.800
And then you build a
do model around that.
01:21:01.800 --> 01:21:04.920
And it might suggest
you move your process
01:21:04.920 --> 01:21:08.370
to another corner, in which
case you build another model
01:21:08.370 --> 01:21:13.710
and so on, so that you
can walk or evolutionarily
01:21:13.710 --> 01:21:17.490
arrive at an optimum
point in your process,
01:21:17.490 --> 01:21:21.200
building local
models along the way.
01:21:21.200 --> 01:21:24.650
OK so next time, the
one additional topic
01:21:24.650 --> 01:21:28.340
I want to mention in this space
of optimization and process
01:21:28.340 --> 01:21:32.370
optimization and DOE is
this notion of robustness.
01:21:32.370 --> 01:21:34.640
I'll allude to actually
building models
01:21:34.640 --> 01:21:37.640
that include the
variance in them
01:21:37.640 --> 01:21:40.440
and not just the overall output.
01:21:40.440 --> 01:21:45.410
So we'll come back to that
on Thursday and enjoy.
01:21:45.410 --> 01:21:47.750
In the meantime, I think
you've got the problem that
01:21:47.750 --> 01:21:48.833
is due on Thursday.
01:21:48.833 --> 01:21:50.750
And it's going to let
you explore a little bit
01:21:50.750 --> 01:21:53.840
more some of these DOE and
response service model kinds
01:21:53.840 --> 01:21:54.800
of things.
01:21:54.800 --> 01:21:56.500
So we'll see you on Thursday.