WEBVTT
00:00:00.000 --> 00:00:02.430
The following content is
provided under a Creative
00:00:02.430 --> 00:00:03.730
Commons license.
00:00:03.730 --> 00:00:06.030
Your support will help
MIT OpenCourseWare
00:00:06.030 --> 00:00:10.060
continue to offer high quality
educational resources for free.
00:00:10.060 --> 00:00:12.690
To make a donation or to
view additional materials
00:00:12.690 --> 00:00:16.560
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.560 --> 00:00:17.904
at ocw.mit.edu.
00:00:21.640 --> 00:00:24.880
PROFESSOR: Last time, we started
looking in more detail at some
00:00:24.880 --> 00:00:26.990
of the statistical basics.
00:00:26.990 --> 00:00:29.650
These are the basis for a lot
of the tools and techniques
00:00:29.650 --> 00:00:33.170
that we're going to be learning
about throughout the term,
00:00:33.170 --> 00:00:36.520
especially things like
statistical process control,
00:00:36.520 --> 00:00:39.160
statistical design
of experiments,
00:00:39.160 --> 00:00:43.630
robust optimization,
yield modeling, and so on.
00:00:43.630 --> 00:00:48.910
And so we're going to pick up
more or less where we left off.
00:00:48.910 --> 00:00:52.360
We talked a bit about
the normal distribution.
00:00:52.360 --> 00:00:55.660
And what I want to do is talk
a little bit more about a few
00:00:55.660 --> 00:00:59.530
of the assumptions and why
it's so common that we use it
00:00:59.530 --> 00:01:02.290
for describing some
of the kinds of data
00:01:02.290 --> 00:01:04.069
that we looked at last time.
00:01:04.069 --> 00:01:07.750
So we went through a
fairly substantial number
00:01:07.750 --> 00:01:16.050
of different examples and saw
variation in time, variation
00:01:16.050 --> 00:01:19.710
across different
parameter sets, and so on.
00:01:19.710 --> 00:01:23.160
Just to remind us, here's
the-- the standard normal
00:01:23.160 --> 00:01:27.720
is just a mean centered.
00:01:27.720 --> 00:01:31.170
So if we have x as our data,
and we subtract off the mean,
00:01:31.170 --> 00:01:34.080
and then normalize to
the standard deviation,
00:01:34.080 --> 00:01:36.510
we get a unit normal variable.
00:01:36.510 --> 00:01:39.690
It's another random
variable z that
00:01:39.690 --> 00:01:44.130
has a distribution that is
marked out in terms of numbers
00:01:44.130 --> 00:01:46.150
of standard deviation.
00:01:46.150 --> 00:01:49.470
And so this is our
normal distribution.
00:01:49.470 --> 00:01:52.320
Some nice properties that
we mentioned last time
00:01:52.320 --> 00:01:55.630
are that it only
has two parameters.
00:01:55.630 --> 00:02:00.210
So that completely describes the
normal distribution, the mean,
00:02:00.210 --> 00:02:03.270
and the variance or
standard deviation.
00:02:03.270 --> 00:02:07.620
Other properties are it's
symmetric about the mean.
00:02:07.620 --> 00:02:11.460
We actually will use that
property quite a bit in terms
00:02:11.460 --> 00:02:15.710
of manipulating
some of the table
00:02:15.710 --> 00:02:19.640
values that one would
look up for the proportion
00:02:19.640 --> 00:02:22.800
of the distribution that's
out in either of the tail.
00:02:22.800 --> 00:02:26.200
So it's perhaps obvious,
but we actually do use that.
00:02:26.200 --> 00:02:28.960
We'll come back to that
a little bit later.
00:02:28.960 --> 00:02:31.430
But what I wanted to
start with a little bit
00:02:31.430 --> 00:02:35.770
is talking a little bit
more about this assumption,
00:02:35.770 --> 00:02:40.960
if you dive into it, that we
are using a normal distribution
00:02:40.960 --> 00:02:42.160
very often.
00:02:42.160 --> 00:02:44.500
And the questions are, why?
00:02:47.650 --> 00:02:54.070
And how good of an approximation
is that in most cases?
00:02:54.070 --> 00:02:56.100
When can we use it?
00:02:56.100 --> 00:02:58.530
When might we be
motivated to use it?
00:02:58.530 --> 00:03:01.050
And what we did last time
is we did a couple of things
00:03:01.050 --> 00:03:03.110
where we looked at
some of the data.
00:03:03.110 --> 00:03:05.430
In particular, we
did histogram--
00:03:05.430 --> 00:03:10.320
binning kinds of
plots of variations.
00:03:10.320 --> 00:03:16.590
And that would often motivate,
based on a general shape,
00:03:16.590 --> 00:03:22.930
that normal distribution
looked appropriate.
00:03:22.930 --> 00:03:27.610
One can, I guess, do a
curve fit to the histogram.
00:03:30.840 --> 00:03:33.720
Would you ever try to do that?
00:03:33.720 --> 00:03:35.400
So imagine that you
actually had, say,
00:03:35.400 --> 00:03:40.470
the tops of the bins
for the distribution.
00:03:44.100 --> 00:03:48.140
So maybe I had bins like
this, where sometimes I
00:03:48.140 --> 00:03:52.075
had these as values--
00:03:56.980 --> 00:03:59.386
something like this.
00:03:59.386 --> 00:04:04.120
Now, would you actually try
to do a normal distribution
00:04:04.120 --> 00:04:06.490
curve fit to that?
00:04:06.490 --> 00:04:08.350
In other words,
if you said, what
00:04:08.350 --> 00:04:11.630
I'm going to try
to do is minimize
00:04:11.630 --> 00:04:17.465
the errors between these points
and the normal distribution,
00:04:17.465 --> 00:04:19.839
does that seem like a
reasonable thing to do?
00:04:19.839 --> 00:04:24.390
AUDIENCE: It's all driven
by the size of your tails.
00:04:24.390 --> 00:04:26.580
PROFESSOR: Yeah, there are
some gotchas, certainly,
00:04:26.580 --> 00:04:28.320
with any histogram.
00:04:28.320 --> 00:04:31.950
The point was that the
shape of this distribution--
00:04:31.950 --> 00:04:34.050
if you've ever played
around, especially
00:04:34.050 --> 00:04:37.170
with interactive tools, where
you can bin and plot out
00:04:37.170 --> 00:04:42.220
distributions, if you were to
change the size of your bins,
00:04:42.220 --> 00:04:44.650
you have this
disturbing effect where
00:04:44.650 --> 00:04:48.640
the shape of your distribution
sometimes changes a little bit
00:04:48.640 --> 00:04:49.670
out from under you.
00:04:49.670 --> 00:04:51.610
So if you change the
bins, you may well
00:04:51.610 --> 00:04:55.210
end up with something that--
00:04:55.210 --> 00:04:57.760
all of a sudden, this one
was low, and now it's high.
00:04:57.760 --> 00:05:00.250
And the next one is
a little bit low.
00:05:00.250 --> 00:05:05.570
And this one's up here if your
bins are a little bit wider.
00:05:05.570 --> 00:05:07.580
So that might be a concern,
but that's actually
00:05:07.580 --> 00:05:11.270
not the point that I'm after.
00:05:11.270 --> 00:05:16.640
Would you curve fit to this
distribution to fit a normal
00:05:16.640 --> 00:05:17.770
to your data?
00:05:20.554 --> 00:05:23.240
AUDIENCE: Well, you said
that normal distribution
00:05:23.240 --> 00:05:25.790
is described by a standard
deviation of the mean.
00:05:25.790 --> 00:05:30.140
So you might as well just take
the mean and standard deviation
00:05:30.140 --> 00:05:30.900
of your data.
00:05:30.900 --> 00:05:32.062
PROFESSOR: Beautiful.
00:05:32.062 --> 00:05:33.020
AUDIENCE: And use that.
00:05:33.020 --> 00:05:33.728
PROFESSOR: Right.
00:05:33.728 --> 00:05:36.350
Right, especially-- I
guess the only circumstance
00:05:36.350 --> 00:05:38.600
I can imagine where it might
make sense to curve fit
00:05:38.600 --> 00:05:41.340
is you didn't have the raw data.
00:05:41.340 --> 00:05:43.800
You only had the bins.
00:05:43.800 --> 00:05:45.240
That's kind of strange.
00:05:45.240 --> 00:05:47.080
I think in most cases,
you would, in fact,
00:05:47.080 --> 00:05:47.910
have the raw data.
00:05:47.910 --> 00:05:52.440
And then you simply
calculate the mean
00:05:52.440 --> 00:05:53.920
and the standard deviation.
00:05:53.920 --> 00:05:57.720
Now one thing we want to do and
we'll get to a little bit today
00:05:57.720 --> 00:06:00.580
is why that's a
reasonable thing to do--
00:06:00.580 --> 00:06:02.762
to actually go in and
calculate the mean
00:06:02.762 --> 00:06:03.720
and standard deviation.
00:06:03.720 --> 00:06:11.130
Why is that a good estimator
for the true mean and underlying
00:06:11.130 --> 00:06:13.740
parameters for this
distribution-- the true meaning
00:06:13.740 --> 00:06:16.460
of the true variance?
00:06:16.460 --> 00:06:20.780
There's other things you can
do certainly to check and see.
00:06:20.780 --> 00:06:22.880
If you had your data,
and you calculated
00:06:22.880 --> 00:06:25.340
the mean and standard
deviation, then you
00:06:25.340 --> 00:06:31.960
can plot perhaps your Gaussian
on top of that distribution.
00:06:31.960 --> 00:06:33.940
And that, I think,
is a reasonable thing
00:06:33.940 --> 00:06:38.590
to do as a quick check visually
to see how well it seems
00:06:38.590 --> 00:06:40.630
to map, as well as a
quick check that you
00:06:40.630 --> 00:06:45.550
had reasonable calculations,
not something strange go wrong
00:06:45.550 --> 00:06:48.220
just in your
numerical calculation
00:06:48.220 --> 00:06:49.690
of those parameters.
00:06:49.690 --> 00:06:51.190
Now there's a couple
of other things
00:06:51.190 --> 00:06:56.230
that one can do to check
quickly visually the assumption.
00:06:56.230 --> 00:07:01.720
And then there's a couple of
very nice additional tools
00:07:01.720 --> 00:07:05.260
that I'll mention here for
either checking assumptions
00:07:05.260 --> 00:07:08.440
or visually--
00:07:08.440 --> 00:07:10.120
in a little bit more
sophisticated way,
00:07:10.120 --> 00:07:13.730
visually or numerically check
a couple of those assumptions.
00:07:13.730 --> 00:07:15.430
But one thing you
can certainly do
00:07:15.430 --> 00:07:20.470
is look at the
location of your data
00:07:20.470 --> 00:07:23.830
and just do a quick comparison
between the percentage of data
00:07:23.830 --> 00:07:28.120
that you would expect in
different bands of this data.
00:07:28.120 --> 00:07:33.310
And we'll do a little
bit more examples
00:07:33.310 --> 00:07:36.170
there so that we know what
percentage of the data
00:07:36.170 --> 00:07:40.930
we expect in the plus/minus
1 sigma band, for example,
00:07:40.930 --> 00:07:45.430
or what percentage of the data
we would typically expect out
00:07:45.430 --> 00:07:48.620
in the 3 sigma
tails of the data.
00:07:48.620 --> 00:07:51.593
And so you can do a quick
calculation and comparison
00:07:51.593 --> 00:07:54.010
of the percentage of data in
each of these different bands
00:07:54.010 --> 00:07:57.730
and see, is that matching
up to what we would expect
00:07:57.730 --> 00:08:00.550
from a normal distribution?
00:08:00.550 --> 00:08:03.250
This actually gets
very close to the idea
00:08:03.250 --> 00:08:07.570
of confidence
intervals that we'll
00:08:07.570 --> 00:08:09.670
formalize a little bit more.
00:08:09.670 --> 00:08:11.530
Now there's a couple
of additional things
00:08:11.530 --> 00:08:12.890
I've listed here.
00:08:12.890 --> 00:08:16.570
One is you can look
at the kurtosis
00:08:16.570 --> 00:08:21.220
or do a quick calculation
of kurtosis, which
00:08:21.220 --> 00:08:24.250
is a higher order
statistical moment
00:08:24.250 --> 00:08:26.950
than the mean or the variance.
00:08:26.950 --> 00:08:32.140
In fact, if you look at the
definition of the kurtosis,
00:08:32.140 --> 00:08:37.140
it's an expectation
of the fourth moment.
00:08:37.140 --> 00:08:40.169
Or it is a calculation--
a normalized version
00:08:40.169 --> 00:08:41.520
of the fourth moment.
00:08:41.520 --> 00:08:44.880
And for a perfectly
normal distribution,
00:08:44.880 --> 00:08:48.210
this kurtosis value would be 1.
00:08:48.210 --> 00:08:51.270
And then as the distribution
changes its shape,
00:08:51.270 --> 00:08:55.230
either gets more peaked
or less peaked following
00:08:55.230 --> 00:08:57.840
other distributions, other
common distributions,
00:08:57.840 --> 00:09:04.380
then it starts to deviate
substantially from k equals 1.
00:09:04.380 --> 00:09:06.870
And in fact, this is
a quick little tool
00:09:06.870 --> 00:09:11.670
to use sometimes if you're
not sure, well, number one,
00:09:11.670 --> 00:09:12.780
if it's normal.
00:09:12.780 --> 00:09:15.625
And number two, if it's not
normal, what distribution might
00:09:15.625 --> 00:09:16.125
it follow?
00:09:21.200 --> 00:09:24.600
If you look here,
this is a nice plot,
00:09:24.600 --> 00:09:27.170
although I didn't
break out what all
00:09:27.170 --> 00:09:32.490
of these different
distributions are.
00:09:32.490 --> 00:09:36.450
This is just a plot normalized
to standard deviation
00:09:36.450 --> 00:09:41.430
of the data of a set of
different distributions.
00:09:41.430 --> 00:09:44.430
And the black one here is n.
00:09:44.430 --> 00:09:48.210
So this is-- let me do
the black one here, right?
00:09:48.210 --> 00:09:49.620
This is the n distribution.
00:09:49.620 --> 00:09:55.090
That's our Gaussian
with a kurtosis.
00:09:55.090 --> 00:09:58.840
Well, I guess you got to
look a little bit carefully
00:09:58.840 --> 00:10:00.070
at the definition.
00:10:00.070 --> 00:10:05.390
Actually, I think if I go back
to the previous page, which
00:10:05.390 --> 00:10:10.070
is one that Dave had, this
definition for sample data
00:10:10.070 --> 00:10:15.980
is essentially, as n gets very
large, this subtracts off 3.
00:10:15.980 --> 00:10:18.480
So that I believe
then, in this case,
00:10:18.480 --> 00:10:20.990
this kurtosis for a
normal distribution
00:10:20.990 --> 00:10:23.210
is actually more like 0.
00:10:23.210 --> 00:10:27.500
These two definitions
you might look up.
00:10:27.500 --> 00:10:29.900
I'm not sure are
exactly the same.
00:10:29.900 --> 00:10:33.050
Rarely would you
actually use this one.
00:10:33.050 --> 00:10:38.450
You're going to actually
use this definition, which
00:10:38.450 --> 00:10:41.960
basically subtracts off a value.
00:10:41.960 --> 00:10:45.590
This goes with the
plot on the next page.
00:10:51.050 --> 00:10:53.300
So they are slightly different
definitions, I believe.
00:10:56.890 --> 00:11:00.580
So in that case, that's
subtracting off a 3.
00:11:00.580 --> 00:11:02.350
For the normal
distribution, it ends up
00:11:02.350 --> 00:11:05.110
with a value of about 0.
00:11:05.110 --> 00:11:07.450
Now what's nice
is as you get some
00:11:07.450 --> 00:11:10.810
of these distribution, such
as the Laplace distribution,
00:11:10.810 --> 00:11:17.000
this very peaked one right here,
the kurtosis value goes up.
00:11:17.000 --> 00:11:21.770
It's an indication of a
more peaked distribution.
00:11:21.770 --> 00:11:23.780
The logistic distribution,
which we might
00:11:23.780 --> 00:11:25.520
talk about a little bit later--
00:11:25.520 --> 00:11:30.590
it's one that comes up
occasionally with some quality
00:11:30.590 --> 00:11:32.870
or discrete kinds
of distributions--
00:11:32.870 --> 00:11:35.150
has a kurtosis of 1.2.
00:11:35.150 --> 00:11:37.280
And the interesting
one here also
00:11:37.280 --> 00:11:40.910
is the uniform
distribution, which is less
00:11:40.910 --> 00:11:42.950
sharply peaked than a Gaussian.
00:11:42.950 --> 00:11:45.890
And it actually has a negative
kurtosis with that subtraction
00:11:45.890 --> 00:11:49.410
of the 3 off it at the end.
00:11:49.410 --> 00:11:55.430
So you might find
that as a useful tool.
00:11:55.430 --> 00:11:59.330
I've rarely used kurtosis
actually as an indicator.
00:11:59.330 --> 00:12:03.800
But I want to mention
it to you because it
00:12:03.800 --> 00:12:07.880
is out there as at
least a hint at looking
00:12:07.880 --> 00:12:10.190
at some different distributions.
00:12:10.190 --> 00:12:11.540
A more useful tool--
00:12:11.540 --> 00:12:13.427
yeah, question?
00:12:13.427 --> 00:12:15.260
AUDIENCE: So there's
two different formulas?
00:12:15.260 --> 00:12:16.487
Because--
00:12:16.487 --> 00:12:17.195
PROFESSOR: Well--
00:12:17.195 --> 00:12:18.590
AUDIENCE: What you said or--
00:12:18.590 --> 00:12:20.450
PROFESSOR: Yeah, so
this is for sample data.
00:12:20.450 --> 00:12:24.570
And I think if you were
to actually go in--
00:12:24.570 --> 00:12:26.610
I mean, essentially this--
00:12:29.510 --> 00:12:31.265
I have not checked this.
00:12:31.265 --> 00:12:37.420
This was some definitions
from previous class notes.
00:12:37.420 --> 00:12:42.070
I do believe when I did a
quick lookup on what kurtosis
00:12:42.070 --> 00:12:48.160
is, I believe this is a
better definition in terms
00:12:48.160 --> 00:12:50.800
of actual calculation
formulas that you
00:12:50.800 --> 00:12:53.590
can use for calculating it.
00:12:53.590 --> 00:12:54.898
This is to give you the sense.
00:12:54.898 --> 00:12:56.440
I mean, it's sort
of lurking in here.
00:12:56.440 --> 00:13:01.370
You can see the expectation
operation down in here,
00:13:01.370 --> 00:13:04.630
and then the normalization
to the standard deviation.
00:13:04.630 --> 00:13:07.420
In this case, this has to
be your calculated standard
00:13:07.420 --> 00:13:09.490
deviation.
00:13:09.490 --> 00:13:12.430
This is the abstracted one.
00:13:17.910 --> 00:13:22.590
So if you actually poke around,
you will find in the literature
00:13:22.590 --> 00:13:26.300
more than one
definition of kurtosis.
00:13:26.300 --> 00:13:29.740
My point was that
this is what I would
00:13:29.740 --> 00:13:34.240
use if you want to use the
plot on the next page in terms
00:13:34.240 --> 00:13:36.940
of coming up with a number that
might also indicate if there's
00:13:36.940 --> 00:13:38.950
a different distribution
that you might look at.
00:13:41.800 --> 00:13:43.365
So it's related to
the fourth moment.
00:13:46.960 --> 00:13:48.250
A more useful tool--
00:13:48.250 --> 00:13:52.510
and this is one that
I actually do use--
00:13:52.510 --> 00:13:56.920
is probability or
quantile-quantile plots.
00:13:56.920 --> 00:14:03.880
And there's a section in
Montgomery on that, as well
00:14:03.880 --> 00:14:05.740
as different toolboxes.
00:14:05.740 --> 00:14:07.880
We'll be able to
generate these things.
00:14:07.880 --> 00:14:11.440
And so here's an example for
a quantile-quantile plot.
00:14:16.790 --> 00:14:20.360
What I've started doing on
the lecture notes on the web
00:14:20.360 --> 00:14:22.430
is put up an early
draft as early
00:14:22.430 --> 00:14:26.270
as I can for the next couple
of weeks of lecture notes.
00:14:26.270 --> 00:14:30.290
But then as I'm
editing and adding in,
00:14:30.290 --> 00:14:32.220
I'll have the most
up-to-date one.
00:14:32.220 --> 00:14:35.240
So if you've got slides, you may
be missing a couple of these.
00:14:35.240 --> 00:14:40.850
If you printed them out before
9:00 or 10:00 PM last night,
00:14:40.850 --> 00:14:43.940
I think these got
updated about that time.
00:14:43.940 --> 00:14:45.860
So this plot, for
example, was not
00:14:45.860 --> 00:14:47.990
in the early draft
of the slides.
00:14:47.990 --> 00:14:50.570
And I'll try to indicate
that with a little "draft"
00:14:50.570 --> 00:14:56.080
on the web page if they're still
early drafts of the slides.
00:14:56.080 --> 00:14:57.930
So what are
quantile-quantile plots?
00:14:57.930 --> 00:15:02.520
These are a little bit subtle
in terms of explaining.
00:15:02.520 --> 00:15:04.830
So let me try to give
it a shot at explaining.
00:15:04.830 --> 00:15:08.650
And then if you have
questions, let me know.
00:15:08.650 --> 00:15:14.130
And then normally, it's going to
be generated by your statistics
00:15:14.130 --> 00:15:14.830
package.
00:15:14.830 --> 00:15:16.800
There are hand ways
to do it, and I'll
00:15:16.800 --> 00:15:19.320
refer you to
Montgomery for practice
00:15:19.320 --> 00:15:23.010
with actually trying to generate
them if you had to by hand.
00:15:23.010 --> 00:15:24.480
But here's the basic idea.
00:15:24.480 --> 00:15:31.970
What we're plotting is the
actual data that you've got.
00:15:31.970 --> 00:15:37.340
And in the y-axis, you'll be
plotting your data in terms
00:15:37.340 --> 00:15:42.000
of normalized distribution.
00:15:42.000 --> 00:15:46.580
So you would normalize to the
mean or center to the mean
00:15:46.580 --> 00:15:49.230
and then scale it to
your standard deviation.
00:15:49.230 --> 00:15:54.200
So think of these as
unit standard deviations.
00:15:54.200 --> 00:15:59.480
So you simply find that as
your y location for your data.
00:16:02.240 --> 00:16:06.560
Then what you're plotting
on the x-axis is the normal
00:16:06.560 --> 00:16:07.970
theoretical--
00:16:07.970 --> 00:16:09.680
I'm not sure I'd
use quantiles here--
00:16:09.680 --> 00:16:13.730
but your normal theoretical
standard deviation
00:16:13.730 --> 00:16:17.210
for that number of data
points that you would have had
00:16:17.210 --> 00:16:20.370
and the location for each
of those data points.
00:16:20.370 --> 00:16:24.020
So imagine this
is 50 data points.
00:16:24.020 --> 00:16:26.810
I'm not sure exactly how
many data points this is.
00:16:26.810 --> 00:16:30.110
If you were to
take 50 data points
00:16:30.110 --> 00:16:35.330
and draw 50 data points
from a normal distribution
00:16:35.330 --> 00:16:39.590
or order them and put them
where you would expect
00:16:39.590 --> 00:16:43.580
on a normal distribution,
what you would have is many
00:16:43.580 --> 00:16:45.110
more data points near 0.
00:16:45.110 --> 00:16:51.320
And as you get further and
further out, 1 out of 50 times
00:16:51.320 --> 00:16:53.570
or 1 out of 25 times,
you would expect
00:16:53.570 --> 00:16:57.650
to find a data point
about whatever it is--
00:16:57.650 --> 00:17:02.190
2, 2.1 standard deviations away.
00:17:02.190 --> 00:17:08.180
In other words, if I were to
compare the actual location
00:17:08.180 --> 00:17:14.060
of that data point
in terms of its value
00:17:14.060 --> 00:17:16.760
within my sample
distribution of 50,
00:17:16.760 --> 00:17:22.069
compared to if I just drew
randomly 50 data points,
00:17:22.069 --> 00:17:25.109
that would be its location.
00:17:25.109 --> 00:17:30.060
Then what I can do is plot
that coordinate for that data.
00:17:30.060 --> 00:17:34.110
So what you end up with is
taking all of your data,
00:17:34.110 --> 00:17:34.640
if you will.
00:17:34.640 --> 00:17:37.520
You sort it from low to high.
00:17:37.520 --> 00:17:42.670
And then starting at the
center, in some sense,
00:17:42.670 --> 00:17:44.890
you start working
outward from the center,
00:17:44.890 --> 00:17:48.100
ordering the data
from the location
00:17:48.100 --> 00:17:54.280
of its index in your
sorted data from number
00:17:54.280 --> 00:17:57.910
of standard deviations away
from the mean that it would be,
00:17:57.910 --> 00:18:01.090
compared to how far
that data point actually
00:18:01.090 --> 00:18:04.210
was away from your sample mean.
00:18:04.210 --> 00:18:07.480
And what that gives
you, if it were perfect,
00:18:07.480 --> 00:18:13.270
and there was not any sort
of noise in your data,
00:18:13.270 --> 00:18:16.390
that would give you this
perfect matching line.
00:18:16.390 --> 00:18:21.418
Every data point falls where
you would expect it to.
00:18:21.418 --> 00:18:22.960
Now in your actual
data, you're going
00:18:22.960 --> 00:18:25.030
to see some
deviations from that.
00:18:25.030 --> 00:18:31.950
But what this is basically doing
is a compression of your data
00:18:31.950 --> 00:18:34.410
or an expansion of your
data out in the tails,
00:18:34.410 --> 00:18:37.470
but a compression of
your data near the center
00:18:37.470 --> 00:18:43.980
to be able to basically tell
you how closely your data here
00:18:43.980 --> 00:18:47.550
is following the
assumed distribution.
00:18:50.860 --> 00:18:56.930
And for this case here,
we plotted the location
00:18:56.930 --> 00:19:00.480
of 50 data points, assuming
it was a normal distribution.
00:19:00.480 --> 00:19:03.840
So that's where my x
values were coming from.
00:19:03.840 --> 00:19:05.330
And as you can
see here, the data
00:19:05.330 --> 00:19:08.870
pretty much nicely
follows this distribution.
00:19:08.870 --> 00:19:13.460
You get a few little things
that look like it's wandering
00:19:13.460 --> 00:19:16.350
or trailing off a little bit.
00:19:16.350 --> 00:19:19.080
And then you also often
look out here in the tails.
00:19:19.080 --> 00:19:23.790
And you find even out here for
over two standard deviations
00:19:23.790 --> 00:19:25.890
away, it looks like I've
got pretty good fidelity
00:19:25.890 --> 00:19:27.210
to those tails.
00:19:27.210 --> 00:19:30.780
I might have values that
are a little bit further
00:19:30.780 --> 00:19:33.690
away from the mean
than I might expect
00:19:33.690 --> 00:19:37.120
from a normal distribution,
but it's pretty close.
00:19:37.120 --> 00:19:38.850
So this is the kind
of plot that you
00:19:38.850 --> 00:19:42.270
would expect to see for
data that, in fact, followed
00:19:42.270 --> 00:19:45.260
a normal distribution.
00:19:45.260 --> 00:19:48.380
All right, so I know
that's confusing.
00:19:48.380 --> 00:19:51.680
Are there questions that
people have on what this--
00:19:51.680 --> 00:19:52.303
AUDIENCE: Yes.
00:19:52.303 --> 00:19:52.970
PROFESSOR: Yeah?
00:19:52.970 --> 00:19:55.350
AUDIENCE: I have a question.
00:19:55.350 --> 00:19:59.480
So for each point,
you get the y-axis
00:19:59.480 --> 00:20:03.632
by the sampling
value from your data.
00:20:03.632 --> 00:20:04.340
PROFESSOR: Right.
00:20:04.340 --> 00:20:06.000
AUDIENCE: And how do you get x?
00:20:06.000 --> 00:20:09.080
Do you get it based on
the probability of that,
00:20:09.080 --> 00:20:11.720
simply pulling from
your sample distribution
00:20:11.720 --> 00:20:15.920
that you referred to the
theoretical normal distribution
00:20:15.920 --> 00:20:18.120
with the same probability,
then you get the y--
00:20:18.120 --> 00:20:18.620
x-axis?
00:20:21.200 --> 00:20:22.640
PROFESSOR: Yes,
very, very close.
00:20:22.640 --> 00:20:26.180
So for the y-axis, you've
got it exactly right.
00:20:26.180 --> 00:20:28.700
For the x-axis,
what's interesting is
00:20:28.700 --> 00:20:32.300
you don't actually use
the values of your data.
00:20:32.300 --> 00:20:37.580
You just use its index location
in a sample of the size
00:20:37.580 --> 00:20:39.020
that you've got.
00:20:39.020 --> 00:20:43.700
In other words, if I
had a million points,
00:20:43.700 --> 00:20:45.590
I would look at the lowest.
00:20:45.590 --> 00:20:49.850
And I would expect that to be--
00:20:49.850 --> 00:20:52.550
in a normal
distribution, I would
00:20:52.550 --> 00:20:55.670
look at where the
probability, the number
00:20:55.670 --> 00:20:59.300
of standard deviations where
1 out of 500,000 points
00:20:59.300 --> 00:21:01.830
is that far away from the mean.
00:21:01.830 --> 00:21:05.210
So I would look up the
inverse probability
00:21:05.210 --> 00:21:09.110
on a normal
distribution of being--
00:21:09.110 --> 00:21:12.320
of where 1 in 5--
00:21:12.320 --> 00:21:14.240
500,000-- point-- what?
00:21:14.240 --> 00:21:18.050
0.02 to the whatever.
00:21:18.050 --> 00:21:24.080
So I basically look on a
tabulated normal probability
00:21:24.080 --> 00:21:29.750
plot, going backwards
from where that index--
00:21:29.750 --> 00:21:31.130
my smallest point was.
00:21:31.130 --> 00:21:33.140
And then I could do
that for every point
00:21:33.140 --> 00:21:36.500
in my sample to figure
out what the probability
00:21:36.500 --> 00:21:39.470
for its location should
be on the x-axis.
00:21:44.210 --> 00:21:46.240
So here's another example.
00:21:46.240 --> 00:21:49.570
Maybe this gives you a feel
because these q-q plots--
00:21:49.570 --> 00:21:52.000
the quantile-quantile
plots-- can actually
00:21:52.000 --> 00:21:54.290
be used with other
distributions as well.
00:21:54.290 --> 00:21:57.200
They are not always
q-q norm plots.
00:21:57.200 --> 00:21:59.950
They can be applied
to whatever assumed
00:21:59.950 --> 00:22:03.400
probability distribution you
might want to investigate.
00:22:07.200 --> 00:22:13.590
So here's an example where
we again took the data.
00:22:13.590 --> 00:22:17.330
But in this case, the
theoretical quantiles
00:22:17.330 --> 00:22:20.180
are actually lining up.
00:22:20.180 --> 00:22:22.760
I'm assuming a
normal distribution.
00:22:22.760 --> 00:22:25.640
But in this example
that I'm showing here,
00:22:25.640 --> 00:22:27.670
the data actually came from--
00:22:27.670 --> 00:22:28.406
let me erase.
00:22:28.406 --> 00:22:31.010
Let me get rid of all this.
00:22:31.010 --> 00:22:35.750
The data actually came from
an exponential distribution.
00:22:35.750 --> 00:22:39.050
So this is an example where
I would have assumed things
00:22:39.050 --> 00:22:44.270
were coming from a Gaussian.
00:22:44.270 --> 00:22:46.760
So this is still for
the normal quantiles.
00:22:46.760 --> 00:22:51.740
But with an exponential
and e to the minus
00:22:51.740 --> 00:22:56.210
x or an e to the x kind of
distribution, what you end up
00:22:56.210 --> 00:22:59.390
with are a lot of
data values that
00:22:59.390 --> 00:23:02.930
are much larger, much
further away from the mean,
00:23:02.930 --> 00:23:05.390
than you would expect
from a Gaussian.
00:23:05.390 --> 00:23:12.870
And you also get a bunch of
data that's much larger than you
00:23:12.870 --> 00:23:16.410
would expect from a Gaussian.
00:23:16.410 --> 00:23:20.430
So this would be an example
here, where the normal q-q
00:23:20.430 --> 00:23:22.560
plot doesn't seem to match up.
00:23:22.560 --> 00:23:24.810
It's telling me my
data really is not
00:23:24.810 --> 00:23:28.300
following along the
normal distribution line.
00:23:28.300 --> 00:23:29.860
Now, I didn't pull a plot.
00:23:29.860 --> 00:23:33.220
But one could then
ask the question--
00:23:33.220 --> 00:23:35.310
maybe you'd look up kurtosis.
00:23:35.310 --> 00:23:38.590
Or maybe you look back
at your data and say,
00:23:38.590 --> 00:23:41.680
I think maybe an exponential
distribution is really
00:23:41.680 --> 00:23:43.150
what this is following.
00:23:43.150 --> 00:23:46.308
How would you plot
that on a q-q norm?
00:23:46.308 --> 00:23:47.100
AUDIENCE: Question.
00:23:47.100 --> 00:23:47.580
PROFESSOR: Yeah?
00:23:47.580 --> 00:23:49.538
AUDIENCE: Why doesn't
the line go through 0, 0?
00:23:51.392 --> 00:23:52.850
PROFESSOR: This is
a good question.
00:23:52.850 --> 00:23:59.140
These don't appear to
be mean-centered to me.
00:23:59.140 --> 00:24:01.120
So there's something
weird on the plot.
00:24:01.120 --> 00:24:04.410
AUDIENCE: So the line should be
for a normal distribution, not
00:24:04.410 --> 00:24:04.910
fitting.
00:24:04.910 --> 00:24:06.243
PROFESSOR: Yeah, this does not--
00:24:06.243 --> 00:24:13.990
I think what has happened
here is these are not quite
00:24:13.990 --> 00:24:19.180
mean-centered and
normalized because--
00:24:19.180 --> 00:24:25.450
well, so in terms of 0,
0 following on the plot,
00:24:25.450 --> 00:24:28.080
that that's not happening.
00:24:28.080 --> 00:24:29.290
So I'm a little--
00:24:29.290 --> 00:24:32.815
I'm not sure exactly
what's going on there.
00:24:32.815 --> 00:24:34.690
AUDIENCE: We need the
closing function taking
00:24:34.690 --> 00:24:36.590
the mean of the data et cetera.
00:24:36.590 --> 00:24:40.187
It's a conceptual
normal on the data mean.
00:24:40.187 --> 00:24:41.270
PROFESSOR: Yes, it should.
00:24:41.270 --> 00:24:45.760
And that's what I'm saying,
is this plot I don't think is
00:24:45.760 --> 00:24:50.020
correctly mean-centered
because it should then--
00:24:50.020 --> 00:24:54.190
0, 0, by definition,
has to fall.
00:24:54.190 --> 00:24:55.752
AUDIENCE: Right.
00:24:55.752 --> 00:24:57.460
PROFESSOR: Oh, that's
what you're saying.
00:24:57.460 --> 00:25:00.043
AUDIENCE: No, I was saying you
could take the mean of the data
00:25:00.043 --> 00:25:02.620
you send to the normal
that you're plotting
00:25:02.620 --> 00:25:03.730
is aligned with that data.
00:25:06.449 --> 00:25:07.157
PROFESSOR: Right.
00:25:07.157 --> 00:25:13.090
But I'm saying here's my y
data, and my 0 mean is not--
00:25:13.090 --> 00:25:15.220
I don't have any negative--
00:25:15.220 --> 00:25:17.290
I don't have any data
lower than the mean.
00:25:17.290 --> 00:25:20.510
And therefore, that
doesn't make any sense.
00:25:20.510 --> 00:25:22.900
So this is not
mean-centered correctly.
00:25:27.292 --> 00:25:30.940
AUDIENCE: It looks to me
like the mean of the data
00:25:30.940 --> 00:25:33.811
does give us a slightly less
than 1 in the point data,
00:25:33.811 --> 00:25:36.246
so that coincides with the mean.
00:25:40.730 --> 00:25:43.520
PROFESSOR: But if I
mean-center and scale to 0,
00:25:43.520 --> 00:25:47.510
then the mean of my data have--
00:25:47.510 --> 00:25:51.020
by definition, that
ought to be at 0, right?
00:25:51.020 --> 00:25:51.920
AUDIENCE: Oh, I see.
00:25:51.920 --> 00:25:53.880
I don't think you're
shifting the data, though.
00:25:58.240 --> 00:26:01.560
PROFESSOR: When you mean-center,
yeah, you're shifting.
00:26:01.560 --> 00:26:03.270
AUDIENCE: Oh, I think
you're shifting,
00:26:03.270 --> 00:26:05.865
but I think conceptually,
you're not shifting the data.
00:26:05.865 --> 00:26:07.490
You're shifting the
normal, that you're
00:26:07.490 --> 00:26:10.087
saying my might
correspond to the data.
00:26:10.087 --> 00:26:10.670
PROFESSOR: No.
00:26:10.670 --> 00:26:15.860
In a normal-- in the standard
q-q norm plot, you mean-center.
00:26:15.860 --> 00:26:18.320
You actually take your
data, you mean-center it,
00:26:18.320 --> 00:26:21.410
you normalize it to the
calculated sample standard
00:26:21.410 --> 00:26:23.450
deviation and plot that.
00:26:23.450 --> 00:26:26.060
And that does not look like
quite what they've done here.
00:26:26.060 --> 00:26:28.280
I think these are
still normalized
00:26:28.280 --> 00:26:29.930
to standard
deviation, but I think
00:26:29.930 --> 00:26:32.055
it's not quite mean-centered.
00:26:32.055 --> 00:26:33.680
But in some sense
that doesn't actually
00:26:33.680 --> 00:26:38.450
matter in terms of the data
following along the line.
00:26:38.450 --> 00:26:39.980
It's still indicating.
00:26:39.980 --> 00:26:41.730
That would just be a shift.
00:26:41.730 --> 00:26:43.350
That would be a shift.
00:26:43.350 --> 00:26:46.880
AUDIENCE: You said the
data hasn't been normalized
00:26:46.880 --> 00:26:48.320
or hasn't been mean-centered.
00:26:48.320 --> 00:26:51.740
But if it's an
exponential distribution,
00:26:51.740 --> 00:26:54.770
can you still normalize a bit?
00:26:59.060 --> 00:27:04.190
PROFESSOR: In this first
use of such a plot,
00:27:04.190 --> 00:27:06.200
you would be testing
the question.
00:27:06.200 --> 00:27:09.380
Did your data-- you don't know
yet that it's exponential.
00:27:09.380 --> 00:27:11.070
You just have data,
and you're testing.
00:27:11.070 --> 00:27:14.280
Does it fall on the normal line?
00:27:14.280 --> 00:27:16.720
So you would still
follow that procedure.
00:27:16.720 --> 00:27:20.600
We'll look at an exponential
distribution in a minute.
00:27:20.600 --> 00:27:23.420
And of course, every
distribution has a mean.
00:27:23.420 --> 00:27:25.970
So you can always mean-center.
00:27:25.970 --> 00:27:29.870
Similarly, every
distribution has a variance
00:27:29.870 --> 00:27:31.680
that you can calculate.
00:27:31.680 --> 00:27:33.680
The neat thing about the
exponential is the mean
00:27:33.680 --> 00:27:36.050
and the variance are the same.
00:27:36.050 --> 00:27:37.670
But that's not entering in here.
00:27:37.670 --> 00:27:39.170
There's something else weird.
00:27:39.170 --> 00:27:44.570
So there's the risk of pulling
a plot off at 9:50 at night.
00:27:44.570 --> 00:27:46.510
I hadn't noticed that the--
00:27:46.510 --> 00:27:50.800
it doesn't look
correctly mean-centered.
00:27:50.800 --> 00:27:52.630
But the additional
point I wanted to make
00:27:52.630 --> 00:27:55.600
is I could actually
take this same data.
00:27:55.600 --> 00:27:59.830
I could produce a different
plot, not a normal q-q plot,
00:27:59.830 --> 00:28:03.620
but an exponential q-q plot.
00:28:03.620 --> 00:28:09.550
And if I were doing that, in
that case, what I would do
00:28:09.550 --> 00:28:18.530
is take my data, still plot
it hopefully mean-centered,
00:28:18.530 --> 00:28:20.990
and then number of
standard deviations away.
00:28:23.570 --> 00:28:29.860
But then along this axis, I
would calculate the location
00:28:29.860 --> 00:28:32.320
in numbers of
standard deviations
00:28:32.320 --> 00:28:36.580
based on the probability of
an exponential distribution,
00:28:36.580 --> 00:28:40.660
not based on the probability
of that index location
00:28:40.660 --> 00:28:42.800
in a normal distribution.
00:28:42.800 --> 00:28:46.150
So I would basically say,
for my 50 data points,
00:28:46.150 --> 00:28:52.060
I expect the 25th data
point larger than the mean
00:28:52.060 --> 00:28:56.800
to occur in that distribution.
00:28:56.800 --> 00:29:05.290
I have to go 2.1 normalized
standard deviations away
00:29:05.290 --> 00:29:07.850
in order to get to
that probability.
00:29:07.850 --> 00:29:11.200
So that it takes my
same y data, but then it
00:29:11.200 --> 00:29:18.330
plots along the line, where
if it really is exponential,
00:29:18.330 --> 00:29:25.490
my data should follow along
a 1 to 1 correspondence line.
00:29:25.490 --> 00:29:30.110
So you don't often see
the use of these q-q plots
00:29:30.110 --> 00:29:33.320
from the perspective of
different distributions,
00:29:33.320 --> 00:29:35.000
but you can use them.
00:29:35.000 --> 00:29:37.940
What you often will
see is really this.
00:29:37.940 --> 00:29:40.460
You'll see q-q norm plots.
00:29:40.460 --> 00:29:42.200
And they're lovely plots.
00:29:42.200 --> 00:29:44.000
It's a wonderful tool to do--
00:29:44.000 --> 00:29:46.790
use-- because you're actually
seeing all of your data.
00:29:49.620 --> 00:29:52.040
It's got all of
your actual data.
00:29:52.040 --> 00:29:54.310
It's showing you that
it corresponds roughly
00:29:54.310 --> 00:29:56.900
to a normal distribution.
00:29:56.900 --> 00:30:00.100
It's also giving you
very nice information
00:30:00.100 --> 00:30:07.930
about essentially your
variance or standard deviation.
00:30:07.930 --> 00:30:11.410
And there are variants of
these plots that you will often
00:30:11.410 --> 00:30:14.680
see in the literature,
especially the semiconductor
00:30:14.680 --> 00:30:20.020
literature, dealing with large
numbers of samples coming
00:30:20.020 --> 00:30:22.370
from different kinds
of measurement.
00:30:22.370 --> 00:30:25.750
So for example, if you want
to make contact resistance
00:30:25.750 --> 00:30:30.910
measurements for literally
thousands of contacts and very
00:30:30.910 --> 00:30:33.520
succinctly present
that data, you
00:30:33.520 --> 00:30:37.990
will see families
of q-q norm plots.
00:30:37.990 --> 00:30:41.140
So for example, maybe you
did a bunch of contacts
00:30:41.140 --> 00:30:42.520
at a particular size.
00:30:42.520 --> 00:30:44.170
You would plot them like this.
00:30:44.170 --> 00:30:46.060
And then maybe you
had another data set,
00:30:46.060 --> 00:30:51.460
where you had attempted to
pattern those contacts slightly
00:30:51.460 --> 00:30:53.500
larger, slightly smaller.
00:30:53.500 --> 00:30:56.560
And you would often
see then another--
00:30:56.560 --> 00:30:59.720
oops, that's not
very straight, is it?
00:30:59.720 --> 00:31:05.450
It's meant to be another
underlying set of data.
00:31:05.450 --> 00:31:09.880
But you might find your data
looking something like this.
00:31:09.880 --> 00:31:12.540
And that kind of plot is
really useful for showing
00:31:12.540 --> 00:31:15.300
that there is a mean
shift, a mean difference
00:31:15.300 --> 00:31:16.540
between your data.
00:31:16.540 --> 00:31:19.980
But also, the variance is
different in the two cases.
00:31:19.980 --> 00:31:22.187
Now, exactly what
you're plotting here
00:31:22.187 --> 00:31:23.520
might be a little bit different.
00:31:23.520 --> 00:31:27.570
You might actually not
plot quite normalized data.
00:31:27.570 --> 00:31:33.960
You might actually use it
in an unnormalized fashion.
00:31:33.960 --> 00:31:38.550
Here, you might plot this not
in terms of standard deviations,
00:31:38.550 --> 00:31:40.640
but rather actual--
00:31:40.640 --> 00:31:43.950
keep it in the quantiles
or the probability
00:31:43.950 --> 00:31:46.740
of being that far away--
00:31:46.740 --> 00:31:51.000
probability of that x--
00:31:51.000 --> 00:31:52.470
that's weird.
00:31:52.470 --> 00:31:54.550
The probability of that x value.
00:31:54.550 --> 00:31:56.940
So for example, you will
often see these kinds
00:31:56.940 --> 00:32:10.590
of plots which would show
things like 0.001, 0.01, 0.1, 1,
00:32:10.590 --> 00:32:16.050
or something like
that, getting up to--
00:32:16.050 --> 00:32:21.480
I guess 0.5 would be the
equivalent for the mean.
00:32:21.480 --> 00:32:24.240
And then you start
going larger--
00:32:24.240 --> 00:32:29.640
0.9, 0.99, 0.999.
00:32:29.640 --> 00:32:33.070
In other words, you
might actually plot as--
00:32:33.070 --> 00:32:35.620
I should have put
these on the x value--
00:32:35.620 --> 00:32:38.980
the probability that
you would find a data
00:32:38.980 --> 00:32:46.210
point that far away as opposed
to implied probabilities
00:32:46.210 --> 00:32:48.947
in terms of number of
standard deviations.
00:32:48.947 --> 00:32:50.530
So there are some
really cool variants
00:32:50.530 --> 00:32:52.552
of these plots that
are very useful.
00:32:52.552 --> 00:32:54.010
And I think we'll
see some of these
00:32:54.010 --> 00:32:56.050
when we talk a little
bit about yield
00:32:56.050 --> 00:32:59.923
and some other distributions.
00:32:59.923 --> 00:33:01.090
AUDIENCE: I have a question?
00:33:01.090 --> 00:33:02.670
PROFESSOR: Yeah.
00:33:02.670 --> 00:33:04.600
AUDIENCE: Yeah, after
I have the q-q plots,
00:33:04.600 --> 00:33:07.210
how can I tell the
confidence level
00:33:07.210 --> 00:33:10.540
that I have to say whether
or not my data is normally
00:33:10.540 --> 00:33:11.480
distributed?
00:33:11.480 --> 00:33:13.570
PROFESSOR: So the q-q
plot does not actually
00:33:13.570 --> 00:33:17.980
tell you confidence intervals on
either the hypothesis that it's
00:33:17.980 --> 00:33:21.550
normally distributed
or confidence intervals
00:33:21.550 --> 00:33:24.970
on the parameter estimate.
00:33:24.970 --> 00:33:28.270
There are some formal
statistical tests
00:33:28.270 --> 00:33:32.290
where you can test that
hypothesis of normality.
00:33:32.290 --> 00:33:40.890
And essentially, you can
use those from your--
00:33:40.890 --> 00:33:44.100
never going to hand-calculate
some of those statistics,
00:33:44.100 --> 00:33:45.660
and then the
probability associated
00:33:45.660 --> 00:33:48.540
with a derived statistic
associated with normality.
00:33:48.540 --> 00:33:51.960
You'll use your statistics
package for that.
00:33:51.960 --> 00:33:54.540
This gives you a good
visual indication.
00:33:54.540 --> 00:33:57.090
But to actually
test, is it normal?
00:33:57.090 --> 00:34:00.840
Or what is the probability
that the data is non-normal?
00:34:00.840 --> 00:34:02.430
That's a different question.
00:34:02.430 --> 00:34:05.550
And then today, we
will start talking
00:34:05.550 --> 00:34:09.000
about confidence intervals
on the mean and the variance,
00:34:09.000 --> 00:34:14.770
which you also would not use
the q-q norm plot to generate.
00:34:14.770 --> 00:34:17.840
So in fact, let's get
to that because that's--
00:34:17.840 --> 00:34:18.340
yeah?
00:34:18.340 --> 00:34:20.530
AUDIENCE: For that plot,
can you use regression
00:34:20.530 --> 00:34:23.774
to see how far it
is from the normal?
00:34:27.590 --> 00:34:30.760
PROFESSOR: Well, first off,
again, if you were actually
00:34:30.760 --> 00:34:33.100
trying to estimate the
parameters of normality,
00:34:33.100 --> 00:34:36.760
you would just use the data
and calculate the sample mean
00:34:36.760 --> 00:34:38.560
and sample standard deviation.
00:34:38.560 --> 00:34:40.570
I think if you are--
00:34:40.570 --> 00:34:43.520
essentially what you
are posing here is,
00:34:43.520 --> 00:34:49.989
could I go in and look
at these deviations
00:34:49.989 --> 00:34:52.900
and do some, I don't know,
sum of squared values
00:34:52.900 --> 00:34:54.159
of those deviations?
00:34:54.159 --> 00:34:57.640
That's actually getting
really close to calculating
00:34:57.640 --> 00:35:00.040
a statistic.
00:35:00.040 --> 00:35:05.470
Call it a W or some
number, a W statistic
00:35:05.470 --> 00:35:09.130
that I would form based on sum
of squared deviations on one
00:35:09.130 --> 00:35:11.380
of these plots or
some other-- maybe
00:35:11.380 --> 00:35:16.740
it's a sum of absolute
distance deviations.
00:35:16.740 --> 00:35:19.160
Now I've got a
statistic W, and that's
00:35:19.160 --> 00:35:22.010
getting really close to the
kinds of statistical tests
00:35:22.010 --> 00:35:26.450
that one would run to ask
the question of normality.
00:35:26.450 --> 00:35:30.170
I don't actually know
what the formula is
00:35:30.170 --> 00:35:37.610
used in coming up with
a W value and then what
00:35:37.610 --> 00:35:38.840
the normality tests are.
00:35:38.840 --> 00:35:40.640
But that's the
kernel of the idea,
00:35:40.640 --> 00:35:42.620
is to actually
look at your data,
00:35:42.620 --> 00:35:48.840
form an aggregate value for that
statistic, that W statistic.
00:35:48.840 --> 00:35:51.740
So for example, if it was
sum of absolute values,
00:35:51.740 --> 00:35:56.565
and it-- for a sample of size
50, and that W is very near 0,
00:35:56.565 --> 00:35:58.190
then you have high
confidence that it's
00:35:58.190 --> 00:35:59.550
a normal distribution.
00:35:59.550 --> 00:36:02.390
But as W gets bigger,
that would seem
00:36:02.390 --> 00:36:04.340
to indicate more
and more likelihood
00:36:04.340 --> 00:36:05.720
that it's not normal.
00:36:05.720 --> 00:36:07.430
And that's exactly
the kind of thing
00:36:07.430 --> 00:36:11.390
that's going on in the formal
statistical test for normality.
00:36:19.220 --> 00:36:20.750
So here, we've given
you a few tools
00:36:20.750 --> 00:36:23.060
for being able to
look at the data,
00:36:23.060 --> 00:36:26.730
get a feel for is
it normal or not.
00:36:26.730 --> 00:36:29.970
But it hasn't
answered the question,
00:36:29.970 --> 00:36:34.190
how come so often we're using a
normal distribution when we're
00:36:34.190 --> 00:36:35.750
actually looking
at manufacturing
00:36:35.750 --> 00:36:38.480
data or other kinds
of experimental data?
00:36:38.480 --> 00:36:43.670
And a really important thing
is the following observation--
00:36:43.670 --> 00:36:50.790
the following fact-- that
if we are forming a sum
00:36:50.790 --> 00:36:54.910
of independent observations
of a random variable--
00:36:54.910 --> 00:37:01.720
so x has some
underlying distribution.
00:37:01.720 --> 00:37:03.940
And it doesn't
actually matter what
00:37:03.940 --> 00:37:07.510
the underlying distribution is.
00:37:07.510 --> 00:37:11.470
But I form n
independent observations
00:37:11.470 --> 00:37:13.240
of that random variable.
00:37:13.240 --> 00:37:14.980
And then I look at
the distribution
00:37:14.980 --> 00:37:24.870
of the sum of x1 plus x2
plus all n random variables.
00:37:24.870 --> 00:37:27.780
The fascinating
fact is that the sum
00:37:27.780 --> 00:37:29.880
of independent random
variables tends
00:37:29.880 --> 00:37:32.860
towards a normal distribution.
00:37:32.860 --> 00:37:34.320
This is the central
limit theorem.
00:37:37.930 --> 00:37:42.660
So here's a neat little example.
00:37:42.660 --> 00:37:49.130
If my underlying distribution
is in fact something
00:37:49.130 --> 00:37:51.890
like a uniform distribution,
and if I'm, say,
00:37:51.890 --> 00:37:57.050
pulling off 20 samples of
x1 and 20 samples of x2
00:37:57.050 --> 00:37:58.925
from a different
uniform distribution,
00:37:58.925 --> 00:38:02.360
and I form, say,
100 samples of, I
00:38:02.360 --> 00:38:04.700
guess, 100 sets-- each
one of these is, I guess,
00:38:04.700 --> 00:38:08.700
1,000 points in this example.
00:38:08.700 --> 00:38:12.320
But I essentially
take the sum of all
00:38:12.320 --> 00:38:17.210
of these random variables and
form a new random variable.
00:38:17.210 --> 00:38:22.550
The new random variable tends
towards a normal distribution
00:38:22.550 --> 00:38:25.360
with some mean and variance.
00:38:28.000 --> 00:38:31.180
Some of you I saw in 2853.
00:38:31.180 --> 00:38:34.360
And I had a nice
link to a website.
00:38:34.360 --> 00:38:38.360
And I'll actually dig that up
and post it for this class.
00:38:38.360 --> 00:38:43.270
It's the SticiGui website.
00:38:43.270 --> 00:38:46.510
It's a statistics--
interactive statistic
00:38:46.510 --> 00:38:48.610
package out of UC Berkeley.
00:38:48.610 --> 00:38:49.670
And it's really fun.
00:38:49.670 --> 00:38:51.400
You can actually
form these kinds
00:38:51.400 --> 00:38:56.290
of sums of random variables
out of different underlying
00:38:56.290 --> 00:38:59.500
distributions and
plot them and start
00:38:59.500 --> 00:39:04.870
to see how close the sum
or the normalized sum
00:39:04.870 --> 00:39:09.430
of these distributions
are to a normal.
00:39:09.430 --> 00:39:12.520
So there's some very, very
nice interactive tools
00:39:12.520 --> 00:39:14.380
that you can play with.
00:39:14.380 --> 00:39:22.390
Now, an important point here is
if I'm calculating the mean--
00:39:22.390 --> 00:39:26.520
so I'm calculating an
x bar across my data.
00:39:26.520 --> 00:39:31.330
And I've got 100
samples, each drawn--
00:39:31.330 --> 00:39:34.240
and I'm assuming I'm drawing
it from the same underlying
00:39:34.240 --> 00:39:36.850
distribution,
whatever that may be.
00:39:36.850 --> 00:39:41.290
What is the distribution
of the sample mean?
00:39:43.910 --> 00:39:47.290
Well, if you look at the
formula for the sample mean,
00:39:47.290 --> 00:39:49.780
it's not exactly a
sum of your data.
00:39:49.780 --> 00:39:52.480
It's the sum of your data,
then divided by 1, right?
00:39:52.480 --> 00:39:56.680
It's summed from i
equals 1 to whatever n
00:39:56.680 --> 00:40:00.640
is of your individual samples.
00:40:00.640 --> 00:40:04.270
So it is a sum, and then
with a constant out front.
00:40:04.270 --> 00:40:08.830
But the point is, by appealing
to the central limit theorem,
00:40:08.830 --> 00:40:13.720
the sample mean distribution,
the PDF associated
00:40:13.720 --> 00:40:19.180
with a sample mean,
always tends towards
00:40:19.180 --> 00:40:22.510
the normal distribution.
00:40:22.510 --> 00:40:24.850
So we're going to come back
to this idea of sampling
00:40:24.850 --> 00:40:27.660
and what the distribution is for
sample statistics a little bit
00:40:27.660 --> 00:40:28.780
later.
00:40:28.780 --> 00:40:32.280
But more generally, very
often what we're doing
00:40:32.280 --> 00:40:37.410
is pulling data out of a process
that in itself is already,
00:40:37.410 --> 00:40:41.540
by the physics of the
process, highly averaged.
00:40:41.540 --> 00:40:43.550
And therefore,
it's averaging lots
00:40:43.550 --> 00:40:46.160
of perhaps other
underlying strange physics
00:40:46.160 --> 00:40:48.860
or difficult physics.
00:40:48.860 --> 00:40:54.990
But in aggregate, that averaging
nature of the data itself--
00:40:54.990 --> 00:40:57.050
not the operation
that we perform,
00:40:57.050 --> 00:41:01.130
but each individual
underlying data point--
00:41:01.130 --> 00:41:03.110
each individual x sub i--
00:41:03.110 --> 00:41:05.900
underneath of that may
have some averaging
00:41:05.900 --> 00:41:07.940
by the physics
going on that will
00:41:07.940 --> 00:41:13.980
help to drive it towards itself
being a normal distribution.
00:41:13.980 --> 00:41:16.710
So just to remind you,
the central limit theorem
00:41:16.710 --> 00:41:24.740
is probably the most used and
perhaps often abused appeal
00:41:24.740 --> 00:41:28.430
to why we're using normal
distributions very often.
00:41:28.430 --> 00:41:30.470
It is still good to test it.
00:41:30.470 --> 00:41:35.540
But there is a good reason why
very often, our data does come
00:41:35.540 --> 00:41:38.783
up as normal distributions.
00:41:38.783 --> 00:41:40.200
So I want to talk
a little bit now
00:41:40.200 --> 00:41:43.140
about sampling because
we are very often using
00:41:43.140 --> 00:41:46.950
actual measurements and
data to try to get estimates
00:41:46.950 --> 00:41:53.760
for, or more generally, build
a model of our random process
00:41:53.760 --> 00:41:57.030
and estimate parameters
of that random process.
00:41:57.030 --> 00:42:00.840
And we've said in general,
p sub x is unknown.
00:42:00.840 --> 00:42:05.820
The data-- always plot your
raw data first and foremost.
00:42:05.820 --> 00:42:10.530
And very often, the raw data
will suggest a distribution.
00:42:10.530 --> 00:42:17.700
Or then histograms may
provide some insight.
00:42:17.700 --> 00:42:22.180
So for example, a
very quick histogram
00:42:22.180 --> 00:42:24.820
will very often give
you the difference
00:42:24.820 --> 00:42:28.720
between a normal distribution
and a uniform distribution.
00:42:28.720 --> 00:42:31.660
If it's evenly
falling, and I don't
00:42:31.660 --> 00:42:36.230
have this falloff in the
tails, that's very important.
00:42:36.230 --> 00:42:39.280
And then we can also use
things like the q-q norm plot
00:42:39.280 --> 00:42:41.210
to test some of those things.
00:42:41.210 --> 00:42:48.430
So the first job is to come
up with what likely [COUGH]
00:42:48.430 --> 00:42:51.640
distribution you want to use.
00:42:51.640 --> 00:42:55.900
Nine times out of 10,
normal distribution
00:42:55.900 --> 00:42:57.370
will be appropriate.
00:42:57.370 --> 00:43:00.560
And then the second
thing is to estimate
00:43:00.560 --> 00:43:03.880
parameters of the distribution.
00:43:07.010 --> 00:43:09.290
And the normal distribution,
again, to remind you,
00:43:09.290 --> 00:43:11.780
just has these two
parameters, mean and variance.
00:43:11.780 --> 00:43:15.200
And now what we want
to do is estimate them.
00:43:15.200 --> 00:43:18.540
Now, everybody is
used to the formulas.
00:43:18.540 --> 00:43:20.390
We've got the
formulas right here
00:43:20.390 --> 00:43:24.200
for calculating from your
sample, your limited number
00:43:24.200 --> 00:43:28.180
of pieces of data,
what things are--
00:43:28.180 --> 00:43:31.900
what a few important statistics
are or characteristics
00:43:31.900 --> 00:43:35.385
are of that data, like the
sample mean or the average,
00:43:35.385 --> 00:43:36.385
and the sample variance.
00:43:39.190 --> 00:43:42.550
But what I want to give you
a feel for today, perhaps
00:43:42.550 --> 00:43:47.770
the most subtle idea,
an important idea
00:43:47.770 --> 00:43:51.550
for interpretation, for
establishment of confidence
00:43:51.550 --> 00:43:53.980
intervals, for actually
being able to say where
00:43:53.980 --> 00:43:57.010
you think the real values lie--
00:43:57.010 --> 00:44:01.250
the subtle idea is
that these themselves
00:44:01.250 --> 00:44:07.130
are statistics that have their
own PDF, their own Probability
00:44:07.130 --> 00:44:08.550
Density Function.
00:44:08.550 --> 00:44:11.930
They have a sample
statistics that
00:44:11.930 --> 00:44:15.560
fall that tell you the
likelihood of observing
00:44:15.560 --> 00:44:17.660
particular values of them--
00:44:17.660 --> 00:44:23.570
that establish bounds for where,
if I had a different sample,
00:44:23.570 --> 00:44:25.940
how close you think
the new sample,
00:44:25.940 --> 00:44:28.910
still drawn from the
underlying parent distribution,
00:44:28.910 --> 00:44:31.820
would actually lie to
the particular sample
00:44:31.820 --> 00:44:33.820
that I just drew.
00:44:33.820 --> 00:44:35.570
So I'm going to explain
that in a few more
00:44:35.570 --> 00:44:38.420
slides or several slides here.
00:44:38.420 --> 00:44:44.180
But the key idea is it's
really easy to calculate
00:44:44.180 --> 00:44:46.080
a couple of these moments--
00:44:46.080 --> 00:44:49.040
the mean and the variance.
00:44:49.040 --> 00:44:51.140
For the normal
distribution, that
00:44:51.140 --> 00:44:56.790
tells you everything for an
estimate of your raw data.
00:44:56.790 --> 00:44:58.760
But then I want to get
to the more subtle idea
00:44:58.760 --> 00:45:00.468
so that we can start
talking about things
00:45:00.468 --> 00:45:02.870
like confidence intervals.
00:45:02.870 --> 00:45:13.700
And a simple example to give
you a little bit of a feel
00:45:13.700 --> 00:45:18.080
for this here is if
I were to ask you
00:45:18.080 --> 00:45:21.350
what distribution
applies to the sample
00:45:21.350 --> 00:45:26.770
mean, where does that come from?
00:45:26.770 --> 00:45:29.910
Where does this notion of
a distribution associated
00:45:29.910 --> 00:45:33.130
with the sample mean arise?
00:45:33.130 --> 00:45:36.570
So if we look at the
formula for the sample mean
00:45:36.570 --> 00:45:39.330
and expand it out,
in some sense we've
00:45:39.330 --> 00:45:43.860
got just a sum of
independent random variables,
00:45:43.860 --> 00:45:48.870
like we were talking about
with the central limit theorem.
00:45:48.870 --> 00:45:50.710
There are different
constants in here.
00:45:50.710 --> 00:45:53.220
And in this case, for the
sample mean statistic,
00:45:53.220 --> 00:45:56.910
all of the constants are
the same, which is just 1
00:45:56.910 --> 00:45:59.370
over the total number of
data points or sample points
00:45:59.370 --> 00:46:00.630
that I've got.
00:46:00.630 --> 00:46:05.790
Now, you can go back to the
definition of expectation
00:46:05.790 --> 00:46:09.180
that we talked about earlier
and do the expectation
00:46:09.180 --> 00:46:13.870
operator across this
and do expectation math.
00:46:13.870 --> 00:46:21.740
So the expectation of ax is
equal to just that constant
00:46:21.740 --> 00:46:26.380
times the expectation of the
underlying random variable.
00:46:26.380 --> 00:46:33.690
So the 1 over n simply
comes out to the left.
00:46:33.690 --> 00:46:39.710
And if I were to ask, what
is the mean of the PDF
00:46:39.710 --> 00:46:46.730
associated with x bar, it
is going to be 1 over n--
00:46:46.730 --> 00:46:49.810
the same mean.
00:46:49.810 --> 00:46:51.610
Now what else is
going on here is
00:46:51.610 --> 00:46:55.710
if you look at the standard
deviation of x bar--
00:46:55.710 --> 00:46:57.010
I hope you guys can see that.
00:46:57.010 --> 00:47:00.580
There's a variance
of x bar in here.
00:47:00.580 --> 00:47:05.360
So that's an x and a
bar, which I just--
00:47:05.360 --> 00:47:10.080
the pen doesn't line up
exactly with the screen.
00:47:10.080 --> 00:47:14.310
You can also do the
expectation operator for--
00:47:14.310 --> 00:47:18.790
oops, not the expectation,
but the variance operator.
00:47:18.790 --> 00:47:24.210
And if you do the mathematics
on variance of some ax,
00:47:24.210 --> 00:47:30.950
that's equal to a squared,
the variance of the underlying
00:47:30.950 --> 00:47:32.280
variable.
00:47:32.280 --> 00:47:35.720
And if you follow that math
through for the definition
00:47:35.720 --> 00:47:39.260
of x bar and relate
that to the variance
00:47:39.260 --> 00:47:45.140
of each of these x sub i's, what
you find is that the variance--
00:47:45.140 --> 00:47:47.690
I get an n times--
00:47:50.550 --> 00:47:53.170
I'm summing n of these
random variables.
00:47:53.170 --> 00:47:57.090
So I've got n times--
00:47:57.090 --> 00:48:01.340
1 over n is the
constant in here.
00:48:01.340 --> 00:48:04.400
So I get a 1 over n squared
times the underlying
00:48:04.400 --> 00:48:07.150
variance of my x.
00:48:07.150 --> 00:48:11.260
So that I get a cancellation,
and the variance then
00:48:11.260 --> 00:48:16.235
of my x bar is just equal
to what I've shown here,
00:48:16.235 --> 00:48:20.810
a 1 over n of the variance of
the underlying distribution.
00:48:20.810 --> 00:48:24.220
So what's interesting
here is if I start
00:48:24.220 --> 00:48:27.280
to ask about the
distributions associated
00:48:27.280 --> 00:48:30.250
with what are the
mean and the variance
00:48:30.250 --> 00:48:34.060
of the normal distribution
associated with x bar--
00:48:36.990 --> 00:48:40.380
what is the mean of an x
bar that I would typically
00:48:40.380 --> 00:48:43.260
observe from lots of samples
of my underlying distribution?
00:48:43.260 --> 00:48:45.300
What is a variance
I would observe?
00:48:45.300 --> 00:48:49.350
It's related to the
underlying distribution,
00:48:49.350 --> 00:48:51.250
but it's not exactly the same.
00:48:51.250 --> 00:48:54.330
I've got a new random
variable, an x bar,
00:48:54.330 --> 00:48:57.010
that has a different
mean and variance.
00:48:57.010 --> 00:49:01.720
It's got the same
mean in this case,
00:49:01.720 --> 00:49:04.480
but the variance
is actually scaled.
00:49:04.480 --> 00:49:07.330
And this is extremely
useful because the variance
00:49:07.330 --> 00:49:12.430
of my averaging means
that I'm getting a tighter
00:49:12.430 --> 00:49:13.720
distribution--
00:49:13.720 --> 00:49:20.080
a narrower or smaller
variance compared
00:49:20.080 --> 00:49:22.030
to the underlying distribution.
00:49:22.030 --> 00:49:24.280
I'm going to show you
that in a little bit more
00:49:24.280 --> 00:49:29.590
of a graphical fashion a little
bit later because this is--
00:49:29.590 --> 00:49:33.020
that's a preview to this
whole idea of sampling,
00:49:33.020 --> 00:49:36.180
which is really critical.
00:49:36.180 --> 00:49:38.770
We've already talked about this.
00:49:38.770 --> 00:49:47.610
So the key thing here is to
get to this notion of sampling
00:49:47.610 --> 00:49:50.250
distributions, what are the
key distributions arising
00:49:50.250 --> 00:49:53.970
from the fact that I'm drawing
multiple pieces of data
00:49:53.970 --> 00:49:56.160
from a parent
distribution, and then
00:49:56.160 --> 00:49:58.240
calculating things about that?
00:49:58.240 --> 00:50:01.380
So we'll get to some of
these key distributions
00:50:01.380 --> 00:50:03.360
besides the normal distribution.
00:50:03.360 --> 00:50:06.820
We'll actually talk
about these next class.
00:50:06.820 --> 00:50:11.580
But what we want to do is go
back and get a little bit more
00:50:11.580 --> 00:50:13.890
feel for not only the
normal distribution,
00:50:13.890 --> 00:50:15.330
but a few other
distributions that
00:50:15.330 --> 00:50:18.330
often arise in manufacturing,
and then also start
00:50:18.330 --> 00:50:22.860
talking about these notions of
where the data actually lies.
00:50:22.860 --> 00:50:26.460
What are the probabilities of
data falling out in the tails?
00:50:26.460 --> 00:50:28.470
And using that then
to start to get
00:50:28.470 --> 00:50:31.170
towards the idea of building
confidence intervals
00:50:31.170 --> 00:50:34.230
and where we think the real
mean of our underlying parent
00:50:34.230 --> 00:50:36.060
distribution sits.
00:50:36.060 --> 00:50:39.180
Next class, we'll also
get to hypotheses tests,
00:50:39.180 --> 00:50:42.180
which arise naturally
and actually start
00:50:42.180 --> 00:50:46.020
to get really close to
statistical process control
00:50:46.020 --> 00:50:51.570
charting, which is one
of the fundamental tools
00:50:51.570 --> 00:50:54.070
of manufacturing control.
00:50:54.070 --> 00:50:56.620
So what I'm going to
do here is go back--
00:50:56.620 --> 00:51:00.400
this is the plan for the next--
00:51:00.400 --> 00:51:02.945
the rest of today and
starting into tomorrow.
00:51:02.945 --> 00:51:04.570
We're going to go
back, just remind you
00:51:04.570 --> 00:51:07.300
of some of the discrete
variable distributions,
00:51:07.300 --> 00:51:09.520
then talk about some of the--
00:51:09.520 --> 00:51:13.930
which are more applicable to
attribute modeling or yield
00:51:13.930 --> 00:51:15.570
modeling, sort of
discrete things.
00:51:15.570 --> 00:51:17.320
Then we'll come back
and talk a little bit
00:51:17.320 --> 00:51:20.500
about the continuous
distributions,
00:51:20.500 --> 00:51:26.130
and then also touch
on how you manipulate
00:51:26.130 --> 00:51:27.480
some of these distributions.
00:51:31.670 --> 00:51:36.180
Discrete distributions-- people
seen the Bernoulli distribution
00:51:36.180 --> 00:51:36.680
before?
00:51:39.500 --> 00:51:40.280
Good.
00:51:40.280 --> 00:51:43.920
This is like the
simplest distribution--
00:51:43.920 --> 00:51:46.980
the very simplest.
00:51:46.980 --> 00:51:48.300
You do a trial.
00:51:48.300 --> 00:51:49.290
You do an experiment.
00:51:49.290 --> 00:51:53.000
Can only have two outcomes,
success or failure.
00:51:53.000 --> 00:51:55.910
You get to label
what success is.
00:51:55.910 --> 00:51:59.210
We'll label a success
with the random variable
00:51:59.210 --> 00:52:03.500
taking on the value of 1
and failure taking on 0.
00:52:03.500 --> 00:52:04.970
I could flip that.
00:52:04.970 --> 00:52:07.730
You can start to see already
a little bit of inkling
00:52:07.730 --> 00:52:09.830
of yield in here.
00:52:09.830 --> 00:52:11.720
Does the thing work or not?
00:52:11.720 --> 00:52:14.660
The very simplest,
coarsest, crudest kind
00:52:14.660 --> 00:52:18.710
of model for functionality, and
the probability or statistics
00:52:18.710 --> 00:52:22.700
associated with that is,
does the thing work or not?
00:52:22.700 --> 00:52:27.230
And often, we talk about
what is the probability
00:52:27.230 --> 00:52:31.100
that the thing is functioning
at the end of the line?
00:52:31.100 --> 00:52:33.980
Maybe that's 0.95.
00:52:33.980 --> 00:52:37.880
So 95% of the time, I think
I've got yielding parts out.
00:52:37.880 --> 00:52:43.050
For any one experiment,
one outcome,
00:52:43.050 --> 00:52:48.030
I've simply got a p and 1
minus p probability associated
00:52:48.030 --> 00:52:48.550
with that.
00:52:48.550 --> 00:52:52.510
And the PDF can be
expressed as shown here.
00:52:52.510 --> 00:52:55.110
Now we can go in and use
our expectation operations
00:52:55.110 --> 00:52:58.740
for discrete random
variables and calculate what
00:52:58.740 --> 00:53:00.150
the mean and the variance are.
00:53:00.150 --> 00:53:03.570
And those have nice,
closed form functions
00:53:03.570 --> 00:53:05.085
for those two outcomes.
00:53:07.660 --> 00:53:09.420
So that's the Bernoulli.
00:53:09.420 --> 00:53:11.340
Now the second easiest--
00:53:11.340 --> 00:53:14.390
although it can
actually look a little
00:53:14.390 --> 00:53:15.860
confusing at first glance.
00:53:15.860 --> 00:53:17.720
But the second
easiest distribution
00:53:17.720 --> 00:53:20.060
is the binomial
distribution because it's
00:53:20.060 --> 00:53:22.340
saying that I'm simply
taking that success
00:53:22.340 --> 00:53:25.970
or failure with a
fixed probability p
00:53:25.970 --> 00:53:28.740
and running repeated
trials of that.
00:53:28.740 --> 00:53:33.740
So now I'm flipping my
coin, say, which has--
00:53:33.740 --> 00:53:35.870
perhaps it's a weighted
coin, and it comes up
00:53:35.870 --> 00:53:39.110
heads with probability
p that's not 0.5.
00:53:39.110 --> 00:53:42.680
Maybe it's 0.9.
00:53:42.680 --> 00:53:44.810
But now I'm doing
that repeated times.
00:53:44.810 --> 00:53:47.660
I'm doing that n times.
00:53:47.660 --> 00:53:51.260
Now what's the probability
of having n successes?
00:53:55.260 --> 00:53:58.680
Or let me state that again.
00:53:58.680 --> 00:54:01.770
What's the probability
of having x successes
00:54:01.770 --> 00:54:04.300
when I ran n repeated trials?
00:54:04.300 --> 00:54:05.940
So n is the number of trials.
00:54:09.260 --> 00:54:11.560
So if I ran 100
trials, the probability
00:54:11.560 --> 00:54:16.660
that I had exactly x
equals to 7 successes
00:54:16.660 --> 00:54:18.490
is given by this formula, here.
00:54:18.490 --> 00:54:21.110
And you can actually see
this lurking in here.
00:54:21.110 --> 00:54:23.710
How do I have 7 successes?
00:54:23.710 --> 00:54:27.120
Well, that meant
p, the probability
00:54:27.120 --> 00:54:30.960
of having a success, had
to come up exactly 7 times.
00:54:30.960 --> 00:54:32.935
And the rest of the times--
00:54:32.935 --> 00:54:36.780
if I was running 100
trials, the other 93 trials
00:54:36.780 --> 00:54:39.610
all had to be failures.
00:54:39.610 --> 00:54:42.640
So I've simply got the product
of all of those probabilities.
00:54:42.640 --> 00:54:44.790
And then we've got
the combinatorics,
00:54:44.790 --> 00:54:49.020
the n choose x, which tells me
how many different orderings
00:54:49.020 --> 00:54:53.850
could have occurred by which
I would get the 7 successes
00:54:53.850 --> 00:54:58.220
and 93 failures
for n equals 100.
00:54:58.220 --> 00:55:02.600
So that's simply the different
numbers of combinations
00:55:02.600 --> 00:55:04.610
that can come up with that.
00:55:04.610 --> 00:55:08.830
So the notation here, by the
way, that we would often use--
00:55:08.830 --> 00:55:12.190
and I already snuck it
in some other places--
00:55:12.190 --> 00:55:18.520
is this little tilde
symbol here we're
00:55:18.520 --> 00:55:24.340
using to read as "is distributed
as some distribution."
00:55:24.340 --> 00:55:27.400
And I'm using the big B
to indicate the binomial
00:55:27.400 --> 00:55:30.010
distribution, which
has associated with it
00:55:30.010 --> 00:55:34.000
the underlying
Bernoulli probability--
00:55:34.000 --> 00:55:35.860
success for any one trial--
00:55:35.860 --> 00:55:39.940
and then the number
of repeated trials.
00:55:42.600 --> 00:55:45.580
So this is a
discrete probability.
00:55:45.580 --> 00:55:51.160
What's the probability
that x could take on 0.7?
00:55:51.160 --> 00:55:51.960
0, right?
00:55:51.960 --> 00:55:56.380
It's the number of
successes out of this.
00:55:56.380 --> 00:55:58.330
And here are some examples
that just give you
00:55:58.330 --> 00:56:00.640
a little bit of a feel
for what the Bernoulli
00:56:00.640 --> 00:56:04.180
distribution looks like.
00:56:04.180 --> 00:56:07.060
This is the number
of successes plotted
00:56:07.060 --> 00:56:10.600
as a histogram for some values.
00:56:13.560 --> 00:56:15.250
I think that this is--
00:56:15.250 --> 00:56:19.047
if you try it, I think
this is a live spreadsheet.
00:56:19.047 --> 00:56:21.630
So actually, if you double-click
on this from your PowerPoint,
00:56:21.630 --> 00:56:27.660
it may bring up the
underlying Excel spreadsheet.
00:56:27.660 --> 00:56:30.600
So you can actually play with
some of the parameters in this.
00:56:30.600 --> 00:56:34.960
I don't remember what
either p or n was for this.
00:56:34.960 --> 00:56:37.080
But you can start to
see, it's really--
00:56:37.080 --> 00:56:45.000
it does not look quite normal
because you can never have
00:56:45.000 --> 00:56:47.620
negative numbers of successes.
00:56:47.620 --> 00:56:50.400
It's always truncated.
00:56:50.400 --> 00:56:55.770
And you get these
very non-normal kinds
00:56:55.770 --> 00:56:56.860
of distributions.
00:56:56.860 --> 00:56:58.930
This is a binomial distribution.
00:56:58.930 --> 00:57:01.890
But its location and its
shape can change somewhat
00:57:01.890 --> 00:57:04.810
as you play with p and n.
00:57:04.810 --> 00:57:10.450
By the way, up here-- this is
just the cumulative probability
00:57:10.450 --> 00:57:13.710
function, just saying
the probability
00:57:13.710 --> 00:57:19.750
that I've got x less than
or equal to some value.
00:57:19.750 --> 00:57:22.170
So that's also shown.
00:57:22.170 --> 00:57:25.260
So then this is also in
this histogram, normalized
00:57:25.260 --> 00:57:29.250
to the fraction of products.
00:57:29.250 --> 00:57:33.650
And so now, you can start
to look at calculating.
00:57:33.650 --> 00:57:36.365
If this were my
data, and I simply--
00:57:36.365 --> 00:57:38.390
it was actually
coming from a line
00:57:38.390 --> 00:57:44.030
where I was looking at the
probability of any one part
00:57:44.030 --> 00:57:47.000
succeeding or not, I could
start to ask questions
00:57:47.000 --> 00:57:52.440
about the probability of
seeing, out of 1,000 products
00:57:52.440 --> 00:57:55.680
coming off the line,
some number of defects
00:57:55.680 --> 00:57:57.720
or some number of
failed products.
00:57:57.720 --> 00:58:01.628
You can appeal to the binomial
distribution for that.
00:58:01.628 --> 00:58:03.420
Now this is all still
pretty coarse, right?
00:58:03.420 --> 00:58:05.700
It's just a very
simplified model--
00:58:05.700 --> 00:58:10.420
failure or success for yield.
00:58:10.420 --> 00:58:12.460
Now another discrete
distribution
00:58:12.460 --> 00:58:20.170
is a Poisson distribution
or also sometimes referred
00:58:20.170 --> 00:58:24.160
to as an exponential
distribution,
00:58:24.160 --> 00:58:28.060
although terminology
there sometimes varies,
00:58:28.060 --> 00:58:33.870
depending on whether people
are including this component
00:58:33.870 --> 00:58:34.800
or not.
00:58:34.800 --> 00:58:39.330
But the formal definition
for the Poisson distribution
00:58:39.330 --> 00:58:40.330
is shown here.
00:58:40.330 --> 00:58:43.300
Now it continues to be
a discrete distribution.
00:58:43.300 --> 00:58:45.570
So I'm asking, what is
the probability associated
00:58:45.570 --> 00:58:52.810
with observing x taking on
actual discrete integer values?
00:58:52.810 --> 00:59:03.800
But this is a very nice
distribution associated with
00:59:03.800 --> 00:59:09.320
kinds of operations that many
of you saw in 2.850 or 2.8--
00:59:09.320 --> 00:59:11.600
yeah, 2.853.
00:59:11.600 --> 00:59:14.900
The arrival times
in queuing networks
00:59:14.900 --> 00:59:18.380
will often be
Poisson distributed.
00:59:18.380 --> 00:59:21.980
But it also can come
up when we are dealing
00:59:21.980 --> 00:59:25.640
with very large
numbers associated
00:59:25.640 --> 00:59:30.290
with the binomial distribution
as a very good approximation
00:59:30.290 --> 00:59:32.670
to the binomial.
00:59:32.670 --> 00:59:34.320
And this turns out
to be really nice,
00:59:34.320 --> 00:59:37.200
because if you actually go
back to the binomial formula
00:59:37.200 --> 00:59:43.990
and try to calculate it for
situations where, say, n
00:59:43.990 --> 00:59:48.640
or x are very, very
large, or p or 1 minus
00:59:48.640 --> 00:59:50.440
p is very, very
small or very large,
00:59:50.440 --> 00:59:53.170
very close to either
0 or 1, you end up
00:59:53.170 --> 00:59:57.220
with some problems,
some numerical problems.
00:59:57.220 --> 01:00:01.120
Because if you actually try to
calculate it for, let's say,
01:00:01.120 --> 01:00:08.170
p is equal to 0.0001, or maybe
1 minus p is equal to that.
01:00:08.170 --> 01:00:10.540
Let's say you had really,
really high yield.
01:00:14.420 --> 01:00:17.030
And I take that, so
if that's 1 minus p--
01:00:17.030 --> 01:00:20.330
and I'm doing this for a
sample of size a million.
01:00:20.330 --> 01:00:25.410
I've got 0.0001 to the
one millionth power.
01:00:25.410 --> 01:00:29.220
And numerically, you
start losing the digits.
01:00:29.220 --> 01:00:31.660
You can't hardly
keep track of that.
01:00:31.660 --> 01:00:33.990
But I might be asking,
what is the probability
01:00:33.990 --> 01:00:36.910
of some substantial
number of failures?
01:00:36.910 --> 01:00:39.180
And this, the
combinatorics, end up
01:00:39.180 --> 01:00:41.470
being a really,
really large number.
01:00:41.470 --> 01:00:46.530
So overall, the
overall probability
01:00:46.530 --> 01:00:50.760
of seeing 10 failures
out of a million parts
01:00:50.760 --> 01:00:52.570
might be substantial.
01:00:52.570 --> 01:00:56.250
But to calculate it, you
can't do it numerically,
01:00:56.250 --> 01:00:59.490
because I've got a huge number
times a really small number.
01:00:59.490 --> 01:01:01.470
I get overflow or underflow.
01:01:01.470 --> 01:01:04.470
And I can't actually
calculate it.
01:01:04.470 --> 01:01:10.080
What's useful is in those
kinds of situations, where,
01:01:10.080 --> 01:01:15.260
say, n and p together-- the
product of those things--
01:01:15.260 --> 01:01:17.750
are reasonable-size
numbers, then
01:01:17.750 --> 01:01:21.120
the Poisson distribution is a
very, very good approximation.
01:01:21.120 --> 01:01:25.730
And this applies to things
where you have very, say,
01:01:25.730 --> 01:01:28.240
low probability.
01:01:28.240 --> 01:01:30.410
So p might be very small.
01:01:30.410 --> 01:01:36.650
But I'm asking-- or I have
many, many opportunities
01:01:36.650 --> 01:01:41.810
to observe that very
low-likelihood event.
01:01:41.810 --> 01:01:45.290
So an example here that comes up
in semiconductor manufacturing
01:01:45.290 --> 01:01:51.630
are things like the probability
of observing some number
01:01:51.630 --> 01:01:53.340
of defects on a wafer.
01:01:53.340 --> 01:01:56.610
The likelihood of seeing a
point defect on any one location
01:01:56.610 --> 01:01:58.470
is very, very, very small.
01:01:58.470 --> 01:02:00.600
But I've got lots
and lots of area
01:02:00.600 --> 01:02:02.970
on the wafer-- lots
and lots of opportunity
01:02:02.970 --> 01:02:05.610
for the appearance
of that small defect.
01:02:05.610 --> 01:02:10.620
And so you can start to
talk about the product
01:02:10.620 --> 01:02:14.370
of those things or
a rate per unit area
01:02:14.370 --> 01:02:16.470
that starts to
become reasonable.
01:02:20.170 --> 01:02:24.010
Another example is the number of
misprints on a page of a book.
01:02:24.010 --> 01:02:26.560
You don't expect for
any one character
01:02:26.560 --> 01:02:31.720
on a book for that to
actually be a misprint.
01:02:31.720 --> 01:02:35.260
But over the entire aggregate
number of pages in your book,
01:02:35.260 --> 01:02:37.180
you expect some
number of misprints.
01:02:37.180 --> 01:02:39.580
And the statistics
that go with that
01:02:39.580 --> 01:02:42.112
are typically
Poisson distributed.
01:02:42.112 --> 01:02:44.580
And I already mentioned that
the mean and the variance,
01:02:44.580 --> 01:02:48.150
if you actually apply those
formulas to this distribution,
01:02:48.150 --> 01:02:50.490
come out to the
fascinating fact that they
01:02:50.490 --> 01:02:52.950
are numerically the same value.
01:02:52.950 --> 01:02:56.130
By the way, units-wise,
they're not.
01:02:56.130 --> 01:03:00.360
But x is an integer and--
01:03:00.360 --> 01:03:01.500
oops.
01:03:01.500 --> 01:03:03.640
That should be x, by the way.
01:03:03.640 --> 01:03:04.240
Come on.
01:03:04.240 --> 01:03:04.830
Cut that out.
01:03:10.710 --> 01:03:11.240
There we go.
01:03:13.840 --> 01:03:16.660
So here are some example
Poisson distributions.
01:03:16.660 --> 01:03:20.410
You can start to see one
here for a mean of 5.
01:03:20.410 --> 01:03:23.830
It looks close to the
binomial distribution
01:03:23.830 --> 01:03:25.360
that I showed you earlier.
01:03:25.360 --> 01:03:29.570
And then as the mean
here is increasing,
01:03:29.570 --> 01:03:31.340
and the lambda
parameter, you can
01:03:31.340 --> 01:03:36.560
start to see this distribution
shifting to the right.
01:03:36.560 --> 01:03:38.660
We said lambda is the mean.
01:03:38.660 --> 01:03:43.280
It's also a characteristic
of the variance.
01:03:43.280 --> 01:03:48.690
The variance is also
equal to lambda.
01:03:48.690 --> 01:03:55.170
So that will also broaden
out for larger numbers of--
01:03:55.170 --> 01:03:58.280
or larger values of lambda.
01:03:58.280 --> 01:04:02.300
There's another observation
in here which is useful.
01:04:02.300 --> 01:04:04.610
What are they starting to
look like for large lambdas?
01:04:08.342 --> 01:04:09.050
AUDIENCE: Normal.
01:04:09.050 --> 01:04:10.130
PROFESSOR: Normal, right.
01:04:10.130 --> 01:04:12.260
If you looked at that, it
doesn't look very normal
01:04:12.260 --> 01:04:12.830
distributed.
01:04:12.830 --> 01:04:14.000
It's truncated.
01:04:14.000 --> 01:04:15.920
It's a little bit skewed.
01:04:15.920 --> 01:04:23.270
But another approximation is for
large lambda, that also tends
01:04:23.270 --> 01:04:25.560
towards a normal distribution.
01:04:25.560 --> 01:04:28.640
So very often, you've got
this success or succession
01:04:28.640 --> 01:04:31.340
of approximations, where
you might take a binomial,
01:04:31.340 --> 01:04:32.960
approximate it as a Poisson.
01:04:32.960 --> 01:04:37.280
But then for large numbers,
a normal distribution also
01:04:37.280 --> 01:04:42.130
can be a useful approximation.
01:04:42.130 --> 01:04:46.860
So let's go back to the
continuous distributions,
01:04:46.860 --> 01:04:49.230
the normal and the uniform.
01:04:49.230 --> 01:04:52.920
And here, I want to start
getting to actually how you use
01:04:52.920 --> 01:04:58.960
or calculate probabilities of
observations in certain ranges
01:04:58.960 --> 01:05:02.520
and in particular things, like
the probabilities of observing
01:05:02.520 --> 01:05:04.840
things out in the tails.
01:05:04.840 --> 01:05:06.930
So here's a continuous
distribution
01:05:06.930 --> 01:05:09.730
that has a probability
density function.
01:05:09.730 --> 01:05:10.950
This is the normal--
01:05:10.950 --> 01:05:12.510
excuse me, the
uniform distribution
01:05:12.510 --> 01:05:17.400
that has the same probability
density for values
01:05:17.400 --> 01:05:19.510
in some range.
01:05:19.510 --> 01:05:22.380
And then I've also
indicated with a capital F
01:05:22.380 --> 01:05:27.060
our cumulative density
function for that.
01:05:27.060 --> 01:05:30.630
So this is just reminding you of
a little bit of the terminology
01:05:30.630 --> 01:05:31.380
there.
01:05:31.380 --> 01:05:36.060
But I'm highlighting
the uniform distribution
01:05:36.060 --> 01:05:38.850
because there's a couple
of very standard questions,
01:05:38.850 --> 01:05:42.975
that if you have a
known PDF or CDF,
01:05:42.975 --> 01:05:45.600
these are the kinds of questions
that you're going to be asking
01:05:45.600 --> 01:05:47.310
again and again and again.
01:05:47.310 --> 01:05:48.780
And they're nice
and intuitive off
01:05:48.780 --> 01:05:51.390
of the uniform distribution.
01:05:51.390 --> 01:05:54.400
When we get to the normal
and other distributions,
01:05:54.400 --> 01:05:56.580
they're not quite as intuitive.
01:05:56.580 --> 01:06:01.530
But seeing them here for the
uniform first, I think, helps.
01:06:01.530 --> 01:06:03.480
One of the typical
kinds of questions
01:06:03.480 --> 01:06:08.760
is I want to know, what is
the probability that some x is
01:06:08.760 --> 01:06:12.330
less than or equal to some
value if I were to draw it
01:06:12.330 --> 01:06:15.350
from this underlying
distribution--
01:06:15.350 --> 01:06:17.160
from a normal distribution?
01:06:17.160 --> 01:06:21.890
And so one could ask
that using either
01:06:21.890 --> 01:06:26.480
the PDF or the Cumulative
Density Function.
01:06:26.480 --> 01:06:28.430
And sometimes, one or
the other, if they're
01:06:28.430 --> 01:06:33.020
tabulated or available
to you, is easier to use.
01:06:33.020 --> 01:06:36.770
Clearly, if this is a
Probability Density Function
01:06:36.770 --> 01:06:41.390
here, I can ask it in terms
of the interval question.
01:06:41.390 --> 01:06:46.050
Oops, excuse me-- the interval
question right here, and say,
01:06:46.050 --> 01:06:53.410
well, the probability that x is
less than or equal to that x1
01:06:53.410 --> 01:06:56.950
is simply the integration
up of that probability.
01:06:56.950 --> 01:06:59.020
And you can do that
numerically or just
01:06:59.020 --> 01:07:01.990
by hand on such a
simple distribution.
01:07:01.990 --> 01:07:06.070
But the point that is actually
exactly the value that
01:07:06.070 --> 01:07:08.860
is tabulated on the
Cumulative Density Function.
01:07:08.860 --> 01:07:12.140
That's the definition of the
Cumulative Density Function.
01:07:12.140 --> 01:07:17.060
So if you've got the CDF, you
simply look it up and say,
01:07:17.060 --> 01:07:23.440
what is f of x1 equal to
whatever your value is
01:07:23.440 --> 01:07:28.990
for that probability function?
01:07:28.990 --> 01:07:32.180
Now similarly, you can
also ask the question,
01:07:32.180 --> 01:07:35.770
what is the probability that
x sits within some range,
01:07:35.770 --> 01:07:39.890
say, between x1 and x2?
01:07:39.890 --> 01:07:41.510
And again, you
can do that either
01:07:41.510 --> 01:07:45.620
off of the underlying
density function, just
01:07:45.620 --> 01:07:47.510
integrating and
saying, yes, x has
01:07:47.510 --> 01:07:51.560
to lie between those values,
and integrate up the density.
01:07:51.560 --> 01:07:56.840
Or you can recognize that
the probability that x
01:07:56.840 --> 01:08:00.770
is less than x2 is
simply that value
01:08:00.770 --> 01:08:05.500
and subtract off that
the probability that x
01:08:05.500 --> 01:08:08.410
was less than x1 is that.
01:08:08.410 --> 01:08:11.650
And so therefore, the
difference between those two
01:08:11.650 --> 01:08:16.899
corresponds to the integration
on the underlying Probability
01:08:16.899 --> 01:08:19.460
Density Function.
01:08:19.460 --> 01:08:20.600
So that's pretty easy.
01:08:20.600 --> 01:08:22.590
That should be pretty clear.
01:08:22.590 --> 01:08:25.880
Let's talk about that also
for the normal distribution
01:08:25.880 --> 01:08:28.640
because some of
those values are not
01:08:28.640 --> 01:08:30.960
as easy to integrate up by hand.
01:08:30.960 --> 01:08:34.160
In fact, there exist no
closed-form formulas.
01:08:34.160 --> 01:08:36.050
But they are tabulated for you.
01:08:36.050 --> 01:08:39.149
And that's where
going to the table
01:08:39.149 --> 01:08:41.880
on the normal distribution
for things like f of x
01:08:41.880 --> 01:08:42.750
are going to--
01:08:42.750 --> 01:08:46.140
is an operation that you will
actually perform quite a bit
01:08:46.140 --> 01:08:50.160
when you're manipulating
normal distributions.
01:08:50.160 --> 01:08:51.560
So here's another plot.
01:08:51.560 --> 01:08:54.979
We've already talked, or I've
shown other examples here
01:08:54.979 --> 01:08:56.600
of the normal distribution.
01:08:56.600 --> 01:08:59.390
I've tagged off on
this plot for us
01:08:59.390 --> 01:09:03.170
a few useful little numbers
to have as rules of thumb.
01:09:03.170 --> 01:09:08.149
This is actually, I think,
a moderately useful page
01:09:08.149 --> 01:09:13.069
to print out and have off
on the side for your use.
01:09:13.069 --> 01:09:15.529
In particular, what
I'm showing here
01:09:15.529 --> 01:09:20.180
is for the normal distribution
you've got a formula.
01:09:20.180 --> 01:09:21.965
You're hardly ever
going to actually plug
01:09:21.965 --> 01:09:23.870
in values for the formula.
01:09:23.870 --> 01:09:27.260
But if you look out plus
1 standard deviation,
01:09:27.260 --> 01:09:30.439
plus 2 standard
deviation, on the PDF,
01:09:30.439 --> 01:09:33.080
I've tried to indicate
here how rapidly
01:09:33.080 --> 01:09:37.140
the value of that probability
density falls off.
01:09:37.140 --> 01:09:41.390
So for example, one standard
deviation, I'm about 60%
01:09:41.390 --> 01:09:42.290
the peak.
01:09:42.290 --> 01:09:45.740
Two standard deviations,
I'm down to about 13.5%
01:09:45.740 --> 01:09:48.029
of the peak.
01:09:48.029 --> 01:09:53.100
Now more often than asking what
is the relative probabilities
01:09:53.100 --> 01:09:58.930
of these things, you're actually
more often asking, what is--
01:09:58.930 --> 01:10:01.450
how much-- what is the
integrated probability
01:10:01.450 --> 01:10:07.120
density of the random
variable out in some tail
01:10:07.120 --> 01:10:09.340
or in some central region?
01:10:09.340 --> 01:10:12.670
And that's where the Cumulative
Density Function is really
01:10:12.670 --> 01:10:14.780
the one that you want to use.
01:10:14.780 --> 01:10:20.550
And so what I'm showing
here is out for some number
01:10:20.550 --> 01:10:25.170
of standard deviations-- this is
mu minus 3 standard deviation.
01:10:25.170 --> 01:10:28.860
This is saying the
probability that x is less
01:10:28.860 --> 01:10:36.290
than mu minus 3 sigma
is exactly that value.
01:10:36.290 --> 01:10:42.720
That equals f of
mi minus 3 sigma.
01:10:42.720 --> 01:10:44.550
And I simply look that up.
01:10:44.550 --> 01:10:51.120
And that's about 0.00135, or
less than 0.1% of your data
01:10:51.120 --> 01:10:55.230
should fall less than 3
sigma off the left side
01:10:55.230 --> 01:10:58.050
of your distribution.
01:10:58.050 --> 01:11:01.290
And then I've tabulated that
for two standard deviations, one
01:11:01.290 --> 01:11:03.060
standard deviation.
01:11:03.060 --> 01:11:05.640
By the way, what's
the probability, now
01:11:05.640 --> 01:11:10.400
that I've marked it up,
that your data falls less
01:11:10.400 --> 01:11:11.060
than your mean?
01:11:13.740 --> 01:11:14.610
50%.
01:11:14.610 --> 01:11:17.520
It's a symmetric distribution.
01:11:17.520 --> 01:11:21.750
And so, in fact, you
could then ask also
01:11:21.750 --> 01:11:26.250
the question, what's the
probability that my data is
01:11:26.250 --> 01:11:29.850
all the way from my left tail
up to two standard deviations
01:11:29.850 --> 01:11:31.380
above the mean?
01:11:31.380 --> 01:11:33.680
And that's 97.7%.
01:11:33.680 --> 01:11:36.480
But I want to also
point out these--
01:11:36.480 --> 01:11:43.810
this distribution itself is also
anti-symmetric around the mean.
01:11:43.810 --> 01:11:50.560
So this value and
this value sum to 1.
01:11:50.560 --> 01:11:53.050
So in other words,
1 minus whatever
01:11:53.050 --> 01:11:57.910
is out in the upper tail
is equal to the probability
01:11:57.910 --> 01:12:00.280
of being below the lower tail.
01:12:04.630 --> 01:12:09.510
So what's tabulated is
not mu minus numbers
01:12:09.510 --> 01:12:11.100
of standard deviations.
01:12:11.100 --> 01:12:12.450
But what will often--
01:12:12.450 --> 01:12:16.440
what is actually tabulated
are the standardized or unit
01:12:16.440 --> 01:12:20.310
normal distribution-- again,
the mean-centered version,
01:12:20.310 --> 01:12:22.260
where I subtract off
the mean and divide
01:12:22.260 --> 01:12:25.210
by the standard deviation.
01:12:25.210 --> 01:12:33.000
And that gives a PDF and
a CDF that is universal.
01:12:33.000 --> 01:12:39.370
And that is what will
often be then tabulated
01:12:39.370 --> 01:12:45.220
as the unit normal
Cumulative Density Function.
01:12:45.220 --> 01:12:47.770
In some sense, that's what I
actually showed on this plot,
01:12:47.770 --> 01:12:51.040
by just labeling it as a
function of mu and standard
01:12:51.040 --> 01:12:52.090
deviations.
01:12:52.090 --> 01:12:57.160
But now when you normalize,
that becomes in units of z as 0
01:12:57.160 --> 01:13:01.570
and the numbers of standard
deviations off on the side.
01:13:01.570 --> 01:13:05.100
Now, if you look at
the back of Montgomery,
01:13:05.100 --> 01:13:06.860
there is a whole
bunch of these tables.
01:13:06.860 --> 01:13:09.360
And you'll be using these tables
in some of the problem sets
01:13:09.360 --> 01:13:10.980
and so on.
01:13:10.980 --> 01:13:14.520
And there is a table
for the unit normal.
01:13:14.520 --> 01:13:20.790
And in particular,
what's tabulated
01:13:20.790 --> 01:13:24.780
is this Cumulative Density
Function for the unit normal.
01:13:24.780 --> 01:13:26.910
And we have a little
bit of terminology
01:13:26.910 --> 01:13:28.590
here that I want
to alert you to,
01:13:28.590 --> 01:13:32.460
because we often talk
about percentage points off
01:13:32.460 --> 01:13:36.240
of some distribution or
percentage points of the unit
01:13:36.240 --> 01:13:38.700
normal, as pictured here.
01:13:38.700 --> 01:13:45.510
And what we're talking about
is relating percentages
01:13:45.510 --> 01:13:48.660
of my distribution that are
in some location, usually
01:13:48.660 --> 01:13:52.980
the tails, to numbers
of standard deviations
01:13:52.980 --> 01:13:57.810
that I have to go in order
to apportion that amount over
01:13:57.810 --> 01:14:00.330
in the tails or in
the central regions.
01:14:00.330 --> 01:14:06.420
So a very typical question I
might ask is, how many z's--
01:14:06.420 --> 01:14:11.900
how many "unit standard
deviations," how many z's--
01:14:11.900 --> 01:14:16.610
do I have to go
away from the mean
01:14:16.610 --> 01:14:22.160
in order to get some
alpha or some percentage
01:14:22.160 --> 01:14:27.500
of the distribution
located out in those tails?
01:14:27.500 --> 01:14:30.260
So for example, I
might say I want
01:14:30.260 --> 01:14:38.580
the 20%, 20th
percentile percentage
01:14:38.580 --> 01:14:46.890
point, the 0.2 probability that
my data sits in the two tails.
01:14:46.890 --> 01:14:52.590
So for a total probability that
all my data or the remain--
01:14:52.590 --> 01:14:55.140
the portion of my
data is on either
01:14:55.140 --> 01:15:03.480
of the tails, some further away
than some z, that means 10%
01:15:03.480 --> 01:15:04.800
is in each of the tails.
01:15:04.800 --> 01:15:08.520
And I'm asking the
question, how far--
01:15:08.520 --> 01:15:11.130
how many standard
deviations do I
01:15:11.130 --> 01:15:13.770
have to go to get
10% in the left tail
01:15:13.770 --> 01:15:17.200
and 10% out in the right tail?
01:15:17.200 --> 01:15:19.600
So I'm essentially
asking the question,
01:15:19.600 --> 01:15:24.550
what is the probability
on the cumulative unit
01:15:24.550 --> 01:15:28.230
normal Probability Distribution
Function to get to--
01:15:28.230 --> 01:15:30.060
how many z's do I
have to go to get
01:15:30.060 --> 01:15:33.540
to half of that
alpha probability
01:15:33.540 --> 01:15:36.600
being in each of the tails?
01:15:36.600 --> 01:15:40.890
One observation here is that
these things are, again,
01:15:40.890 --> 01:15:42.240
anti-symmetric.
01:15:42.240 --> 01:15:46.140
So I can also ask the
question either looking
01:15:46.140 --> 01:15:49.940
just the right tail
or the left tail.
01:15:49.940 --> 01:15:54.510
And then you can do the inverse
operation using the table.
01:15:54.510 --> 01:15:56.360
So I'm actually asking
the question, what
01:15:56.360 --> 01:15:58.980
is the z associated with that?
01:15:58.980 --> 01:16:02.280
And I'm looking up on this plot.
01:16:02.280 --> 01:16:06.980
So I might ask, OK, I need
10% there in the tail.
01:16:06.980 --> 01:16:09.590
How many z's does
that correspond to?
01:16:09.590 --> 01:16:12.380
And to get 10% out
in that left tail,
01:16:12.380 --> 01:16:17.090
I got to go out 1.28 standard
deviations off to the left.
01:16:17.090 --> 01:16:22.050
That's the operation that one
would look up in the table.
01:16:22.050 --> 01:16:28.910
So very often, you would get
to these kinds of lookups,
01:16:28.910 --> 01:16:39.590
where you're relating the
probability alpha of your data
01:16:39.590 --> 01:16:44.470
lying below that number
of standard deviations
01:16:44.470 --> 01:16:46.810
and what that corresponding
standard deviation is.
01:16:51.440 --> 01:16:56.510
So I didn't copy one of the
tables out of Montgomery,
01:16:56.510 --> 01:17:00.900
but you'll get some practice
with that on the problem sets.
01:17:00.900 --> 01:17:03.110
Now, there's other
related operations
01:17:03.110 --> 01:17:04.860
you can do once you have that.
01:17:04.860 --> 01:17:09.650
So for example, now I can
ask, what is the probability
01:17:09.650 --> 01:17:12.590
not just that data
lies out in the tail,
01:17:12.590 --> 01:17:16.280
but what are the probabilities
that it also or instead lies
01:17:16.280 --> 01:17:17.660
in the middle region?
01:17:17.660 --> 01:17:20.960
They're all the same
kinds of operations.
01:17:20.960 --> 01:17:25.190
And so for example,
here's a quick tabulation
01:17:25.190 --> 01:17:28.130
for three different
kinds of examples,
01:17:28.130 --> 01:17:30.620
where I'm asking not
what is out in the tails,
01:17:30.620 --> 01:17:35.420
but I'm asking what is within
the center plus/minus 1 sigma
01:17:35.420 --> 01:17:37.220
region of the data?
01:17:37.220 --> 01:17:40.010
And if you look
very carefully, I'm
01:17:40.010 --> 01:17:44.060
using exactly these
Cumulative Probability Density
01:17:44.060 --> 01:17:45.950
functions for the unit normal.
01:17:45.950 --> 01:17:47.750
This is for a unit normal.
01:17:50.930 --> 01:17:53.570
And looking out, what's
the cumulative probability
01:17:53.570 --> 01:17:54.680
over in the left tail?
01:17:54.680 --> 01:17:55.550
The right tail?
01:17:55.550 --> 01:17:57.170
Doing those observations.
01:17:57.170 --> 01:17:59.840
But these are also very
nice rules of thumb
01:17:59.840 --> 01:18:07.740
to have ready for you, which
is saying within plus/minus 1
01:18:07.740 --> 01:18:12.810
standard deviation in the
normal, 68% of your data
01:18:12.810 --> 01:18:16.070
is going to fall in
that 1 sigma region.
01:18:16.070 --> 01:18:21.540
In the case of if I
expand out to 2 sigma,
01:18:21.540 --> 01:18:26.140
now I've got 95% of my data
should fall roughly in there.
01:18:26.140 --> 01:18:29.340
And if I expand out even
further to the 3 sigma,
01:18:29.340 --> 01:18:33.690
that's the 99.7% of your
data would be falling--
01:18:33.690 --> 01:18:39.900
should fall within those center
three standard deviations.
01:18:43.220 --> 01:18:45.170
So the percentage
points out there,
01:18:45.170 --> 01:18:48.950
the part that falls outside
of that, is about 3 and a--
01:18:48.950 --> 01:18:50.900
3 and 1,000.
01:18:50.900 --> 01:18:54.590
We'll come back to this when we
see statistical process control
01:18:54.590 --> 01:18:58.470
and control charts because you
may have run into these control
01:18:58.470 --> 01:18:58.970
charts.
01:18:58.970 --> 01:19:03.980
We're often plotting the
3 sigma control limits.
01:19:03.980 --> 01:19:05.690
And essentially
what we're saying
01:19:05.690 --> 01:19:10.070
is only a very small
fraction of my data--
01:19:10.070 --> 01:19:13.100
3 out of 1,000, if
I'm using plus/minus
01:19:13.100 --> 01:19:14.870
3 sigma control limits.
01:19:14.870 --> 01:19:18.140
3 out of 1,000 points of
my data, by random chance
01:19:18.140 --> 01:19:22.775
alone, should be falling
outside of those 3 sigma bounds.
01:19:25.440 --> 01:19:33.150
So that starts to get as close
to statistical process control.
01:19:33.150 --> 01:19:36.470
So what we're going
to do next time
01:19:36.470 --> 01:19:41.030
is start to look a little bit
more closely at statistics.
01:19:41.030 --> 01:19:46.110
When I do form, again,
things like the sample
01:19:46.110 --> 01:19:53.990
mean, or I form the sample
standard deviation or sample
01:19:53.990 --> 01:19:58.880
variance from my
data, those themselves
01:19:58.880 --> 01:20:02.000
have these probability
densities associated with them.
01:20:02.000 --> 01:20:05.810
And from that, we're going
to be able to go backwards
01:20:05.810 --> 01:20:13.040
and essentially work to
try to understand things
01:20:13.040 --> 01:20:16.490
about the underlying process
distribution, the parent
01:20:16.490 --> 01:20:20.700
probability distribution
function, associated with that.
01:20:20.700 --> 01:20:22.910
So we're going to
have to understand
01:20:22.910 --> 01:20:29.990
more complicated PDFs than
the normal distribution
01:20:29.990 --> 01:20:32.090
because things like
the sample variance
01:20:32.090 --> 01:20:34.730
is not going to be
normally distributed.
01:20:34.730 --> 01:20:39.020
It's going to have its
own bizarre distribution--
01:20:39.020 --> 01:20:41.250
in this case, the
chi-square distribution.
01:20:41.250 --> 01:20:44.330
So we'll return to looking at
some additional distributions,
01:20:44.330 --> 01:20:47.730
but these same manipulations
will come up again.
01:20:47.730 --> 01:20:51.590
And what we're ultimately going
to want to be able to do is
01:20:51.590 --> 01:20:54.890
make inferences about the
underlying distribution--
01:20:54.890 --> 01:20:56.660
the parent process--
01:20:56.660 --> 01:20:59.510
what its mean is,
what its variance is,
01:20:59.510 --> 01:21:02.990
based on the calculated sample
mean and sample variance
01:21:02.990 --> 01:21:06.080
that we might be using,
and then also make
01:21:06.080 --> 01:21:10.040
inferences about the
likelihood that the true mean
01:21:10.040 --> 01:21:12.200
lies in certain ranges.
01:21:12.200 --> 01:21:15.380
Or to put it another
way, next time,
01:21:15.380 --> 01:21:20.360
we'll also be talking
about confidence intervals.
01:21:20.360 --> 01:21:22.610
So we'll see you on Thursday.
01:21:22.610 --> 01:21:30.500
Watch for the message from
Hayden about tours and enjoy.