WEBVTT
00:00:00.000 --> 00:00:02.430
The following content is
provided under a Creative
00:00:02.430 --> 00:00:03.730
Commons license.
00:00:03.730 --> 00:00:06.030
Your support will help
MIT OpenCourseWare
00:00:06.030 --> 00:00:10.060
continue to offer high quality
educational resources for free.
00:00:10.060 --> 00:00:12.660
To make a donation or to
view additional materials
00:00:12.660 --> 00:00:16.560
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:16.560 --> 00:00:17.874
at ocw.mit.edu.
00:00:21.838 --> 00:00:23.880
PROFESSOR: So as I was
saying, what we want to do
00:00:23.880 --> 00:00:28.270
is get up through use of some of
these statistical distributions
00:00:28.270 --> 00:00:32.729
for making hypotheses tests and
understanding the relationship
00:00:32.729 --> 00:00:40.670
of probabilities associated
with hypotheses such as a point
00:00:40.670 --> 00:00:43.880
belongs to this distribution
or that distribution.
00:00:43.880 --> 00:00:46.610
And that will set the
ground for talking
00:00:46.610 --> 00:00:50.510
about statistical process
control and SPC charting, where
00:00:50.510 --> 00:00:53.330
you're asking the question
of a new piece of data
00:00:53.330 --> 00:00:56.150
off of the manufacturing
line, does that piece of data
00:00:56.150 --> 00:00:59.180
come from the in
control distribution?
00:00:59.180 --> 00:01:02.850
Or does it come from some
out of control distribution?
00:01:02.850 --> 00:01:06.290
So it's all about
probabilities on SPC charts.
00:01:06.290 --> 00:01:10.880
And we want to build up
the rest of the machinery
00:01:10.880 --> 00:01:12.230
that we need for that today.
00:01:15.310 --> 00:01:18.490
To do that, one of
the subtle things
00:01:18.490 --> 00:01:20.680
that we have to understand
a bit more about
00:01:20.680 --> 00:01:24.190
is sampling and
sampling distributions.
00:01:24.190 --> 00:01:26.350
And really what we're
dealing with here
00:01:26.350 --> 00:01:30.790
is issues of the
use of statistics
00:01:30.790 --> 00:01:32.680
dealing with observed data.
00:01:32.680 --> 00:01:35.080
And I have this
philosophical picture
00:01:35.080 --> 00:01:37.390
of what I think of as
statistics meaning.
00:01:37.390 --> 00:01:41.590
And that's our real goal and
statistics is to reason about,
00:01:41.590 --> 00:01:47.520
think about, and be able
to argue about processes--
00:01:47.520 --> 00:01:51.760
in our case, real
manufacturing processes--
00:01:51.760 --> 00:01:53.950
when there's uncertainty
in those processes.
00:01:53.950 --> 00:01:55.030
There is noise.
00:01:55.030 --> 00:01:57.070
There's other things
we don't know.
00:01:57.070 --> 00:01:58.750
But the key idea
in statistics is
00:01:58.750 --> 00:02:00.040
we are getting some evidence.
00:02:00.040 --> 00:02:01.300
We're getting some data.
00:02:01.300 --> 00:02:02.740
And what we want
to be able to do
00:02:02.740 --> 00:02:07.000
is use that data to start
to infer things back
00:02:07.000 --> 00:02:10.060
about the underlying
population, the underlying
00:02:10.060 --> 00:02:12.110
process or distribution.
00:02:12.110 --> 00:02:16.780
So there are some
preconditions in here.
00:02:16.780 --> 00:02:21.100
A lot of what I said here
is we're reasoning based
00:02:21.100 --> 00:02:23.770
on evidence from observed data.
00:02:23.770 --> 00:02:27.220
But that really means we
are taking fundamentally
00:02:27.220 --> 00:02:31.160
a probability model
of what's going on.
00:02:31.160 --> 00:02:33.580
And we talked last
time, for example,
00:02:33.580 --> 00:02:39.160
about assumptions with normal
distributions and parameters
00:02:39.160 --> 00:02:41.220
of normal distributions.
00:02:41.220 --> 00:02:44.260
And what we're going to do
today is focus a little bit more
00:02:44.260 --> 00:02:49.720
on evidence coming from
finite sets of observations,
00:02:49.720 --> 00:02:53.260
drawn from that population,
and then calculations we
00:02:53.260 --> 00:02:54.840
do on that--
00:02:54.840 --> 00:02:58.200
simple calculations, like
calculating the sample mean.
00:02:58.200 --> 00:03:01.020
And then we have this
number, this sample mean.
00:03:01.020 --> 00:03:02.740
What's it really telling us?
00:03:02.740 --> 00:03:07.320
What can we infer back about
the underlying distribution--
00:03:07.320 --> 00:03:12.130
what the true mean of the
underlying population is?
00:03:12.130 --> 00:03:16.960
And then a little bit later,
we'll flesh this out more.
00:03:16.960 --> 00:03:22.360
But already, even as we start
building these simple arguments
00:03:22.360 --> 00:03:25.570
based on our data, we have
an underlying implicit model
00:03:25.570 --> 00:03:26.660
of the process.
00:03:26.660 --> 00:03:28.900
It may be a purely
probabilistic model,
00:03:28.900 --> 00:03:32.830
saying it has a certain mean
and a Gaussian distribution
00:03:32.830 --> 00:03:34.780
or a certain mean in a normal--
00:03:34.780 --> 00:03:37.570
or a uniform or a Poisson.
00:03:37.570 --> 00:03:39.010
There is a model there.
00:03:39.010 --> 00:03:44.680
And so we have to keep in
mind that it is only a model.
00:03:44.680 --> 00:03:47.140
A little bit later,
we'll also build up
00:03:47.140 --> 00:03:49.630
other kinds of
functional relationships
00:03:49.630 --> 00:03:52.210
when we get to things like
response surface modeling.
00:03:52.210 --> 00:03:55.090
But for now, these are
relatively simple models,
00:03:55.090 --> 00:04:00.160
mostly focused on the
probabilistic or stochastic
00:04:00.160 --> 00:04:02.330
nature of that.
00:04:02.330 --> 00:04:05.370
So here's the plan for today.
00:04:05.370 --> 00:04:08.190
What we're going to do
is talk a little bit
00:04:08.190 --> 00:04:11.530
about sampling distributions.
00:04:11.530 --> 00:04:13.590
We touched on this a
little bit last time
00:04:13.590 --> 00:04:15.285
when we talked about
the distribution
00:04:15.285 --> 00:04:21.310
of the sum of random variables
and the central limit theorem,
00:04:21.310 --> 00:04:24.090
where the sum or
the average always
00:04:24.090 --> 00:04:26.190
tends towards the normal.
00:04:26.190 --> 00:04:30.410
In some of the cases, we're
going to be calculating things,
00:04:30.410 --> 00:04:33.080
like the sample s
squared, the sample
00:04:33.080 --> 00:04:36.570
variants that are not going
to be normally distributed.
00:04:36.570 --> 00:04:38.690
They will have other
statistical shapes
00:04:38.690 --> 00:04:43.670
or statistical distributions
such as the chi-squared.
00:04:43.670 --> 00:04:46.970
There will be other cases where
the student t-distribution is
00:04:46.970 --> 00:04:47.480
operable.
00:04:47.480 --> 00:04:51.650
So we want to get a sense of
these sampling distributions
00:04:51.650 --> 00:04:56.450
and understand how to use
those to make not only point
00:04:56.450 --> 00:04:58.720
estimates--
00:04:58.720 --> 00:05:01.930
that is our best guess of things
like the underlying population
00:05:01.930 --> 00:05:02.540
mean--
00:05:02.540 --> 00:05:06.100
but also confidence intervals--
where, with some probability,
00:05:06.100 --> 00:05:09.820
we think the true mean lies or
where, with some probability,
00:05:09.820 --> 00:05:12.250
we think the true
variance lies based
00:05:12.250 --> 00:05:14.810
on one set of observations.
00:05:14.810 --> 00:05:19.420
So that's where the sampling
distributions come into play.
00:05:19.420 --> 00:05:23.440
And we'll talk about the
effects of sample size on that
00:05:23.440 --> 00:05:26.440
as well as things like
what kind of inferences,
00:05:26.440 --> 00:05:29.920
these point and confidence
interval inferences
00:05:29.920 --> 00:05:33.500
we can make on those.
00:05:33.500 --> 00:05:37.250
And then, again, leading up
towards hypothesis testing.
00:05:37.250 --> 00:05:42.380
And then really, this
will be for next time.
00:05:42.380 --> 00:05:44.315
We'll dive into SPC charts.
00:05:49.150 --> 00:05:52.480
So here's how we typically
are using sampling.
00:05:52.480 --> 00:05:54.460
We have some underlying--
00:05:54.460 --> 00:05:58.600
I'll refer to it as the
population distribution
00:05:58.600 --> 00:06:00.850
or sometimes the
parent distribution.
00:06:00.850 --> 00:06:05.140
It's the set or universe
of all possible parts,
00:06:05.140 --> 00:06:07.150
say, coming off your
manufacturing line
00:06:07.150 --> 00:06:10.600
or all possible observations.
00:06:10.600 --> 00:06:13.210
What we're typically
going to do is just draw
00:06:13.210 --> 00:06:16.390
some number, some finite
number, of samples,
00:06:16.390 --> 00:06:20.600
some n samples of
the process output--
00:06:20.600 --> 00:06:25.360
so some x sub i drawn
from a parent distribution
00:06:25.360 --> 00:06:28.090
with some PDF p.
00:06:28.090 --> 00:06:30.430
And what we're going to
be doing is calculating
00:06:30.430 --> 00:06:33.190
these sample mean, sample
variants, other sorts
00:06:33.190 --> 00:06:35.130
of sample statistics.
00:06:35.130 --> 00:06:39.670
A key point here is
the underlying process.
00:06:39.670 --> 00:06:43.900
That basic variable x has
a probability distribution
00:06:43.900 --> 00:06:46.820
function associated with it.
00:06:46.820 --> 00:06:50.890
This new variable x
bar that we calculate,
00:06:50.890 --> 00:06:52.960
this statistic
that we calculate,
00:06:52.960 --> 00:06:56.770
also has a probability density
function associated with it.
00:06:56.770 --> 00:06:59.320
And it's a different
one than the parent one.
00:06:59.320 --> 00:07:01.360
And so what we'll
need to understand
00:07:01.360 --> 00:07:05.350
is what those
probability distributions
00:07:05.350 --> 00:07:07.810
are that arise from
sampling, and then
00:07:07.810 --> 00:07:10.900
how to work backwards from
those to make inferences
00:07:10.900 --> 00:07:12.860
about the parent.
00:07:12.860 --> 00:07:13.850
Now, a quick thing.
00:07:13.850 --> 00:07:16.340
I guess there's both
definitions on this slide,
00:07:16.340 --> 00:07:23.750
but also a quick thing about
definitions or terminology
00:07:23.750 --> 00:07:28.290
or notation that I like to use.
00:07:28.290 --> 00:07:31.430
And in particular, I'm,
again, distinguishing
00:07:31.430 --> 00:07:36.080
between the population or
parent distribution, and then
00:07:36.080 --> 00:07:38.740
these sample statistics.
00:07:38.740 --> 00:07:42.820
And typically when I talk
about "truth" or the population
00:07:42.820 --> 00:07:47.110
as a whole, we're
using Greek variables
00:07:47.110 --> 00:07:54.850
like mu, sigma, rho, xy, for
the correlation coefficient.
00:07:54.850 --> 00:07:58.720
And those expectations,
those different moments,
00:07:58.720 --> 00:08:02.470
are calculated over
the entire population.
00:08:02.470 --> 00:08:04.360
Typically we're doing
those analytically
00:08:04.360 --> 00:08:08.080
if we have a closed
form description of what
00:08:08.080 --> 00:08:10.670
the population is.
00:08:10.670 --> 00:08:13.340
In contrast, I'm
going to typically use
00:08:13.340 --> 00:08:15.470
Roman characters--
00:08:15.470 --> 00:08:20.390
x, s, r, xy, for example--
00:08:20.390 --> 00:08:24.680
to indicate the finite
sample statistics
00:08:24.680 --> 00:08:29.630
calculated from some n
number of observations.
00:08:29.630 --> 00:08:34.159
And so that's when we have
a finite discrete number
00:08:34.159 --> 00:08:36.230
of observations.
00:08:36.230 --> 00:08:39.409
And we have simple formulas
for the calculation
00:08:39.409 --> 00:08:42.450
of those statistics.
00:08:42.450 --> 00:08:44.130
A little bit later
in the term, we
00:08:44.130 --> 00:08:48.690
will come back and start to
look in particular at covariance
00:08:48.690 --> 00:08:51.760
and correlation between two
different random variables,
00:08:51.760 --> 00:08:53.218
some x and y.
00:08:53.218 --> 00:08:55.260
Those are especially
important when we're looking
00:08:55.260 --> 00:08:58.020
for functional dependencies.
00:08:58.020 --> 00:09:01.530
Right now, we're simply
looking at one set of data
00:09:01.530 --> 00:09:05.700
or one population,
one random variable x.
00:09:05.700 --> 00:09:08.700
So we'll focus on
univariate stuff today.
00:09:13.150 --> 00:09:15.550
There is a term,
"random sampling,"
00:09:15.550 --> 00:09:18.100
that actually has a
technical definition that I
00:09:18.100 --> 00:09:23.470
want to point out that's very
close to the intuitive notion
00:09:23.470 --> 00:09:24.160
here.
00:09:24.160 --> 00:09:26.440
But it actually is a
little bit stronger
00:09:26.440 --> 00:09:29.890
in requirements
for its definition.
00:09:29.890 --> 00:09:33.760
We said sampling is this act of
taking some finite observations
00:09:33.760 --> 00:09:35.500
out of a population.
00:09:35.500 --> 00:09:41.260
Random sampling is when every
observation that we pull
00:09:41.260 --> 00:09:45.970
is identically distributed,
has the same PDF associated
00:09:45.970 --> 00:09:50.740
with it, and is independent
from any other sample that we
00:09:50.740 --> 00:09:54.460
pull from that population.
00:09:54.460 --> 00:09:57.250
And this would not always
naturally be the case--
00:09:57.250 --> 00:10:01.000
if you had, for example,
finite populations,
00:10:01.000 --> 00:10:03.490
and you pulled out a sample,
held it in your hand,
00:10:03.490 --> 00:10:08.560
recorded it, pulled out
another sample, for example.
00:10:08.560 --> 00:10:15.830
Imagine that you've got a bag of
17 blue and red marbles in it.
00:10:15.830 --> 00:10:19.840
And I pull a marble
out, and it's red.
00:10:19.840 --> 00:10:24.670
I hold it in my hand, and
I pull another marble out.
00:10:24.670 --> 00:10:27.400
Do you think I'm sampling
from the same underlying
00:10:27.400 --> 00:10:28.390
distribution?
00:10:28.390 --> 00:10:32.530
No, because I did not
replace that original marble.
00:10:32.530 --> 00:10:34.930
So now the mix of
blue and red marbles
00:10:34.930 --> 00:10:37.630
is different within that
bag, and the probability
00:10:37.630 --> 00:10:38.420
is different.
00:10:38.420 --> 00:10:43.000
It is not identical and
independent anymore.
00:10:43.000 --> 00:10:46.240
The observation that I made
first, based on the first draw,
00:10:46.240 --> 00:10:51.670
changes the probability
for later draws, changes--
00:10:51.670 --> 00:10:56.050
there is dependence
as well as no longer
00:10:56.050 --> 00:10:58.600
an identical distribution.
00:10:58.600 --> 00:11:04.030
So when we do random sampling,
as I'm defining it here,
00:11:04.030 --> 00:11:06.040
and random sampling
for calculation
00:11:06.040 --> 00:11:08.350
of some of these
sampling distributions,
00:11:08.350 --> 00:11:11.510
we're assuming if it's coming
from a finite population,
00:11:11.510 --> 00:11:14.440
you would always put
the observation back in
00:11:14.440 --> 00:11:18.640
and do another sample
from the same pool.
00:11:18.640 --> 00:11:20.770
Typically what
you're often doing
00:11:20.770 --> 00:11:22.600
is assuming there's
no connection from one
00:11:22.600 --> 00:11:24.640
to the other, and the
same process, physics,
00:11:24.640 --> 00:11:27.950
is operable from one
point in time to the next.
00:11:27.950 --> 00:11:31.570
So we are typically making
this IID, this Independent
00:11:31.570 --> 00:11:37.690
and Identically
Distributed assumption.
00:11:37.690 --> 00:11:40.120
And then we're going to,
again, as I said, calculate
00:11:40.120 --> 00:11:42.250
some statistics from those.
00:11:42.250 --> 00:11:44.170
Ultimately, when
you have a sample--
00:11:44.170 --> 00:11:48.130
sample of size 14, drawn
from a big population.
00:11:48.130 --> 00:11:49.750
You calculate x bar.
00:11:49.750 --> 00:11:50.890
What do you get?
00:11:50.890 --> 00:11:51.580
A number.
00:11:51.580 --> 00:11:56.560
You get an actual number
because I observed those 14
00:11:56.560 --> 00:11:58.900
things, measured length
or whatever it was
00:11:58.900 --> 00:12:02.720
that I was measuring on those.
00:12:02.720 --> 00:12:05.500
And so a key point here
is that the statistic
00:12:05.500 --> 00:12:10.010
is a function of the
sample and the sample data.
00:12:10.010 --> 00:12:13.710
And so it's actually a
value that you can compute.
00:12:13.710 --> 00:12:18.690
If I do that, I grab one
sample, I calculate that x bar,
00:12:18.690 --> 00:12:21.200
I've got one number.
00:12:21.200 --> 00:12:24.640
If I were to go back
and draw another sample
00:12:24.640 --> 00:12:27.850
from that distribution,
I get a different number.
00:12:27.850 --> 00:12:29.700
And so if I keep
going back and drawing
00:12:29.700 --> 00:12:31.860
multiple, multiple
samples, that's
00:12:31.860 --> 00:12:35.280
how you build up a distribution
function associated
00:12:35.280 --> 00:12:39.490
with that statistic,
that calculation.
00:12:39.490 --> 00:12:43.590
So that's where this notion of
statistics, x bar or whatever,
00:12:43.590 --> 00:12:48.030
as a random variable
also comes into play.
00:12:48.030 --> 00:12:50.330
For any one sample
it's a number.
00:12:50.330 --> 00:12:55.440
But when I go and take multiple
samples, multiple sets, of n,
00:12:55.440 --> 00:13:01.160
now I build up a distribution
function associated with those.
00:13:01.160 --> 00:13:04.870
I'm going to switch here to--
00:13:04.870 --> 00:13:05.950
or do I want to?
00:13:05.950 --> 00:13:06.670
No.
00:13:06.670 --> 00:13:08.740
I'm going to switch to--
00:13:08.740 --> 00:13:11.680
this is here on the web.
00:13:11.680 --> 00:13:15.835
I mentioned last time
this very nice website.
00:13:18.780 --> 00:13:20.430
I don't even know
what the acronym
00:13:20.430 --> 00:13:22.620
stands for-- this SticiGui.
00:13:22.620 --> 00:13:25.650
It's out of the Department
of Statistics at Berkeley.
00:13:25.650 --> 00:13:28.720
It's got a lot of different--
00:13:28.720 --> 00:13:31.960
I guess sort of an online
course kind of thing.
00:13:31.960 --> 00:13:37.160
But what I really like
in this is the Tools tab.
00:13:37.160 --> 00:13:40.050
So if I go to that Tools tab--
00:13:40.050 --> 00:13:41.840
let me do that--
00:13:41.840 --> 00:13:45.920
it's got a number of these
little Java utilities online.
00:13:45.920 --> 00:13:48.560
And one that I want
to look at here first
00:13:48.560 --> 00:13:51.030
is sampling distributions.
00:13:51.030 --> 00:13:51.650
So let's see.
00:13:51.650 --> 00:13:52.400
Let this load.
00:13:55.900 --> 00:13:57.910
Loading up Java, here.
00:13:57.910 --> 00:14:02.550
So here's an example
of sampling from some
00:14:02.550 --> 00:14:04.390
a priori distribution.
00:14:04.390 --> 00:14:09.310
And this is actually drawing
from a uniform distribution
00:14:09.310 --> 00:14:13.540
with discrete values,
0, 1, 2, 3, and 4.
00:14:13.540 --> 00:14:16.150
So that's our underlying
true population,
00:14:16.150 --> 00:14:18.740
and they all have
equal probabilities.
00:14:18.740 --> 00:14:21.460
And what I'm going to
do is calculate a--
00:14:21.460 --> 00:14:24.780
I'm going to draw a sample
down here at the bottom.
00:14:24.780 --> 00:14:26.400
It's a sample of size 5.
00:14:26.400 --> 00:14:30.360
So I'm going to do random
sampling with replacement.
00:14:30.360 --> 00:14:33.060
So I'm going to draw five
independent and identically
00:14:33.060 --> 00:14:36.950
distributed samples out of that
underlying parent distribution.
00:14:36.950 --> 00:14:39.420
And then I'm going to
calculate some statistic.
00:14:39.420 --> 00:14:42.670
What I want to do is to actually
calculate the sample mean.
00:14:42.670 --> 00:14:46.930
So there in blue is our
underlying population.
00:14:46.930 --> 00:14:50.820
Let me take one sample of
size 5, calculate the mean,
00:14:50.820 --> 00:14:51.660
and plot it.
00:14:51.660 --> 00:14:52.260
There it is.
00:14:52.260 --> 00:14:55.080
It's a mean of 1.4.
00:14:55.080 --> 00:14:56.510
Let me take another sample.
00:14:56.510 --> 00:14:57.690
I take another sample.
00:14:57.690 --> 00:15:01.190
Do you think the value
is going to be 1.4 again?
00:15:01.190 --> 00:15:01.970
It might be.
00:15:01.970 --> 00:15:02.930
AUDIENCE: Might be.
00:15:02.930 --> 00:15:04.490
PROFESSOR: But
probably not, right?
00:15:04.490 --> 00:15:05.960
Let's see what happens.
00:15:05.960 --> 00:15:08.060
There it is-- 2.4.
00:15:08.060 --> 00:15:09.020
Let me do a few more.
00:15:12.370 --> 00:15:14.460
So the green bars
are popping up,
00:15:14.460 --> 00:15:18.240
as I think I've done something
like 1, 2, 3, 4, 5, 6--
00:15:18.240 --> 00:15:21.510
something like 8 different
samples, each of size 5,
00:15:21.510 --> 00:15:23.980
plotted the mean.
00:15:23.980 --> 00:15:25.980
Now to speed things
up, I can keep
00:15:25.980 --> 00:15:28.440
taking more and more samples.
00:15:28.440 --> 00:15:31.040
What distribution do you
think this is trending to?
00:15:31.040 --> 00:15:31.860
AUDIENCE: Normal.
00:15:31.860 --> 00:15:33.098
PROFESSOR: Normal.
00:15:33.098 --> 00:15:34.890
Down here at the bottom,
I can take samples
00:15:34.890 --> 00:15:36.910
that are a little bit larger.
00:15:36.910 --> 00:15:38.590
Or let me take--
00:15:38.590 --> 00:15:41.020
excuse me, we take--
the thing tells me
00:15:41.020 --> 00:15:44.080
how many samples I'm taking,
so I don't have to just take
00:15:44.080 --> 00:15:45.790
one sample of five, plot it.
00:15:45.790 --> 00:15:50.060
I can take 10 samples of
5, each of 5, and plot it.
00:15:50.060 --> 00:15:53.710
So it's just speeding
up my button clicks
00:15:53.710 --> 00:15:57.205
so that we can get a little
bit better shape on that.
00:15:57.205 --> 00:15:58.080
So there's the point.
00:15:58.080 --> 00:15:59.700
That's a very fascinating point.
00:15:59.700 --> 00:16:01.560
I find it fascinating
that I can sample
00:16:01.560 --> 00:16:05.310
from a non-normal
distribution, take the average,
00:16:05.310 --> 00:16:11.930
the sample average, x bar, and
over lots and lots of sampling,
00:16:11.930 --> 00:16:15.460
I get a normal distribution.
00:16:15.460 --> 00:16:16.030
What else?
00:16:16.030 --> 00:16:18.640
What other observations
or what other points
00:16:18.640 --> 00:16:23.930
might you make about
that green distribution?
00:16:23.930 --> 00:16:27.930
What do you think is true
about that green distribution?
00:16:27.930 --> 00:16:29.850
There's a really
important fact which
00:16:29.850 --> 00:16:33.750
motivates why we can't
calculate x bars all the time
00:16:33.750 --> 00:16:35.430
and believe the
numbers that come out
00:16:35.430 --> 00:16:37.710
of an x bar calculation.
00:16:41.070 --> 00:16:42.990
AUDIENCE: It's
centered around 2.
00:16:42.990 --> 00:16:44.410
PROFESSOR: It's
centered around 2.
00:16:44.410 --> 00:16:46.440
Out of the numbers
0, 1, 2, 3, and 4,
00:16:46.440 --> 00:16:49.590
what do you think the average
is-- the true average?
00:16:49.590 --> 00:16:51.030
2.
00:16:51.030 --> 00:16:58.360
So one thing that's very
nice about the sample mean
00:16:58.360 --> 00:17:04.130
is that it trends toward
the true population mean.
00:17:04.130 --> 00:17:05.630
It's unbiased.
00:17:05.630 --> 00:17:11.390
That if I were to
take enough samples,
00:17:11.390 --> 00:17:18.630
the average of or the mean of
all of these sample averages
00:17:18.630 --> 00:17:22.200
is equal to the true
underlying population mean.
00:17:22.200 --> 00:17:22.980
It's unbiased.
00:17:22.980 --> 00:17:25.680
Doesn't have a bias or
delta, a fixed delta,
00:17:25.680 --> 00:17:28.200
a fixed offset error in it.
00:17:28.200 --> 00:17:30.690
It is an unbiased estimator.
00:17:30.690 --> 00:17:35.040
So I can take lots
and build that up.
00:17:35.040 --> 00:17:36.630
Turns out there's
another thing that's
00:17:36.630 --> 00:17:40.190
true which I don't
want to go into
00:17:40.190 --> 00:17:42.140
and don't want to try to prove.
00:17:42.140 --> 00:17:48.890
But it turns out that the sample
mean is also not only unbiased,
00:17:48.890 --> 00:17:52.250
but it's also the
minimum error estimator.
00:17:52.250 --> 00:17:56.540
So on average, it's the
best estimator of the mean
00:17:56.540 --> 00:18:01.400
that you can use as a statistic,
meaning its distribution
00:18:01.400 --> 00:18:03.260
in some sense is the narrowest.
00:18:03.260 --> 00:18:06.770
The x bar distribution is
the narrowest estimator
00:18:06.770 --> 00:18:13.370
you can have for trying to
calculate the sample mean
00:18:13.370 --> 00:18:16.700
based on your distributions.
00:18:16.700 --> 00:18:19.160
Now another important
thing that comes up
00:18:19.160 --> 00:18:20.720
here is at least a
few of the times,
00:18:20.720 --> 00:18:27.940
I got a sample
mean that was 0.6.
00:18:27.940 --> 00:18:28.780
Is it wrong?
00:18:33.930 --> 00:18:36.560
If you do just one sample,
it's quite possible,
00:18:36.560 --> 00:18:40.700
out of this set of four,
I drew a sample of size 5.
00:18:40.700 --> 00:18:43.190
I might have gotten
a value of 0.6.
00:18:43.190 --> 00:18:45.080
That's all the data you have.
00:18:45.080 --> 00:18:47.300
What's your best guess
for the true mean
00:18:47.300 --> 00:18:50.200
of the underlying population?
00:18:50.200 --> 00:18:53.320
That 0.6, whatever
that value was.
00:18:53.320 --> 00:18:55.890
But now there is
some spread on it.
00:18:55.890 --> 00:18:59.490
And so if you're
wise, you would also
00:18:59.490 --> 00:19:02.700
start to want to hedge your
bets a little bit here, right?
00:19:02.700 --> 00:19:06.760
You want to be able to
say, my best guess is 0.5.
00:19:06.760 --> 00:19:10.510
But I think I'm only
drawing a sample of size 5.
00:19:10.510 --> 00:19:15.400
So I know there is, in fact,
this kind of Gaussian spread.
00:19:15.400 --> 00:19:17.320
And I think the
true mean probably
00:19:17.320 --> 00:19:19.910
lies within some range of that.
00:19:19.910 --> 00:19:23.020
And so you would like to have
this confidence interval idea.
00:19:23.020 --> 00:19:26.290
We'll get back to that
a little bit later.
00:19:26.290 --> 00:19:29.620
In fact, there's another
very nice little tool in here
00:19:29.620 --> 00:19:32.170
for illustrating
confidence intervals
00:19:32.170 --> 00:19:35.788
that we'll use at that point.
00:19:35.788 --> 00:19:37.580
I want to do one more
thing, and then we'll
00:19:37.580 --> 00:19:39.830
go back to the lecture slides.
00:19:39.830 --> 00:19:42.140
One of the neat things
you can do with this tool,
00:19:42.140 --> 00:19:43.670
and it's lots of
fun for you guys
00:19:43.670 --> 00:19:46.670
to connect up with
and play with,
00:19:46.670 --> 00:19:48.320
is you can change
the sample size.
00:19:50.830 --> 00:19:55.230
Let's say you wanted a
better or a tighter estimate
00:19:55.230 --> 00:19:57.132
for the x bar.
00:19:57.132 --> 00:19:58.590
You're not happy
with the idea that
00:19:58.590 --> 00:20:02.100
sometimes, with fairly
substantial probability,
00:20:02.100 --> 00:20:05.610
you might be off
by plus or minus 1.
00:20:05.610 --> 00:20:08.190
You have a substantial
probability
00:20:08.190 --> 00:20:12.450
of estimating, say, the--
00:20:12.450 --> 00:20:17.220
or guessing the sample mean
to be more than one value away
00:20:17.220 --> 00:20:20.400
from the true population mean.
00:20:20.400 --> 00:20:22.410
What might you do
to try to improve
00:20:22.410 --> 00:20:27.510
your likelihood of being closer
to the true mean when you're
00:20:27.510 --> 00:20:28.220
doing sampling?
00:20:28.220 --> 00:20:29.520
AUDIENCE: More samples.
00:20:29.520 --> 00:20:31.420
PROFESSOR: More samples.
00:20:31.420 --> 00:20:33.010
More samples?
00:20:33.010 --> 00:20:35.420
I guess you could
do more samples.
00:20:35.420 --> 00:20:39.400
But in some sense, really, that
taking one sample of size 5
00:20:39.400 --> 00:20:40.900
and another sample
of size 5, that's
00:20:40.900 --> 00:20:44.380
like one sample of size 10.
00:20:44.380 --> 00:20:45.040
Larger samples.
00:20:45.040 --> 00:20:46.540
AUDIENCE: Oh, yeah,
larger samples.
00:20:46.540 --> 00:20:47.870
PROFESSOR: Larger samples.
00:20:47.870 --> 00:20:54.700
So if I do that
here, let's take--
00:20:54.700 --> 00:20:59.170
instead of samples of size
5, let's do a modest increase
00:20:59.170 --> 00:21:02.420
first and take
samples of size 10.
00:21:02.420 --> 00:21:04.560
See what happens now.
00:21:04.560 --> 00:21:06.650
Oops, let me just do--
00:21:06.650 --> 00:21:08.860
OK, that's good.
00:21:08.860 --> 00:21:10.570
I'm taking a lot
of samples here.
00:21:10.570 --> 00:21:13.960
I've taken several hundred
samples, each of size 10.
00:21:13.960 --> 00:21:15.430
And sure enough,
that distribution
00:21:15.430 --> 00:21:17.350
is a little bit tighter.
00:21:17.350 --> 00:21:22.650
Let's say if I took a really
big sample, sample of size 100.
00:21:22.650 --> 00:21:26.180
Yeah, looking a lot tighter.
00:21:26.180 --> 00:21:29.360
So one question is, we know
as I take a larger samples,
00:21:29.360 --> 00:21:30.920
the distribution gets tighter.
00:21:30.920 --> 00:21:33.440
One of the things we
want to do is understand
00:21:33.440 --> 00:21:40.260
how much tighter do they get as
a function of the sample size?
00:21:40.260 --> 00:21:43.220
So it turns out-- let
me go back now to--
00:21:52.270 --> 00:21:56.920
it turns out that if I'm
sampling from a parent
00:21:56.920 --> 00:22:01.060
distribution, the variance in
the estimate of that x bar,
00:22:01.060 --> 00:22:06.190
or the PDF, the variance
of x bar itself,
00:22:06.190 --> 00:22:08.710
shrinks with size n.
00:22:08.710 --> 00:22:13.030
And the variance in
fact scales as 1 over n.
00:22:13.030 --> 00:22:16.615
It scales inversely proportional
to the size of the sample.
00:22:20.090 --> 00:22:24.650
That's true always as you take
larger numbers of samples.
00:22:24.650 --> 00:22:28.010
For this special case, if
my underlying population
00:22:28.010 --> 00:22:31.040
is in fact really a true--
00:22:31.040 --> 00:22:33.590
has a true probability
distribution function
00:22:33.590 --> 00:22:37.050
that was normal,
then it turns out
00:22:37.050 --> 00:22:41.010
that x bar is not just
trending towards the normal,
00:22:41.010 --> 00:22:45.180
but is itself, even for very
small numbers of samples, also
00:22:45.180 --> 00:22:47.280
a normal distribution.
00:22:47.280 --> 00:22:48.780
So in that little
demo I showed you,
00:22:48.780 --> 00:22:50.700
drawing from a
uniform distribution,
00:22:50.700 --> 00:22:53.400
for large enough n's,
large enough samples,
00:22:53.400 --> 00:22:55.530
large enough numbers
of samples, the mean
00:22:55.530 --> 00:22:58.320
does trend towards a Gaussian.
00:22:58.320 --> 00:23:01.830
But it's even a stronger
statement, a stronger
00:23:01.830 --> 00:23:05.970
relationship, if the underlying
population is itself normal.
00:23:05.970 --> 00:23:09.240
So let's say we start with an
underlying random variable,
00:23:09.240 --> 00:23:12.330
an underlying
process x, that has
00:23:12.330 --> 00:23:14.640
some mean and some variance.
00:23:14.640 --> 00:23:20.160
Now if I take samples of size 1
and plot out the distribution,
00:23:20.160 --> 00:23:22.070
what do you think it looks like?
00:23:22.070 --> 00:23:24.013
AUDIENCE: [INAUDIBLE]
00:23:24.013 --> 00:23:24.680
PROFESSOR: Yeah.
00:23:24.680 --> 00:23:25.560
I'm just repeating.
00:23:25.560 --> 00:23:30.920
I'm replicating my underlying
distribution, right?
00:23:30.920 --> 00:23:34.250
So part of the special
case of a sample of size 1,
00:23:34.250 --> 00:23:39.080
if I do that long enough, I
build up the same distribution.
00:23:39.080 --> 00:23:42.830
But now, if I take larger
numbers of samples,
00:23:42.830 --> 00:23:45.415
even a little bit
with n equals 2,
00:23:45.415 --> 00:23:46.790
again, we get that
effect that we
00:23:46.790 --> 00:23:51.530
saw with the SticiGui of the
narrowing of the distribution,
00:23:51.530 --> 00:23:55.250
PDF associated with the x bar.
00:23:55.250 --> 00:24:02.720
And in particular, the PDF or
the Probability Distribution
00:24:02.720 --> 00:24:06.410
Function associated
with x bar is exactly
00:24:06.410 --> 00:24:09.050
normal with the same mean--
00:24:09.050 --> 00:24:12.440
it's unbiased-- and
with reduced variance.
00:24:12.440 --> 00:24:15.830
So the variance
goes as 1 over n.
00:24:15.830 --> 00:24:20.490
So we start with the
population distribution here,
00:24:20.490 --> 00:24:23.010
and we end up with a
sample mean distribution
00:24:23.010 --> 00:24:25.770
that is a different PDF.
00:24:25.770 --> 00:24:28.480
Everybody clear on this?
00:24:28.480 --> 00:24:32.270
So key points--
statistic itself is
00:24:32.270 --> 00:24:34.940
the random variable has its
own probability distribution
00:24:34.940 --> 00:24:35.810
function.
00:24:35.810 --> 00:24:41.460
Now what we want to do is reason
about the underlying population
00:24:41.460 --> 00:24:44.280
based on those
observed statistics.
00:24:44.280 --> 00:24:47.810
Somebody's cell
phone is going crazy.
00:24:47.810 --> 00:24:48.430
Not mine.
00:24:53.187 --> 00:24:54.270
Everybody hear that click?
00:24:54.270 --> 00:24:57.550
Can you even hear that
click in Singapore?
00:24:57.550 --> 00:24:58.070
Yeah?
00:24:58.070 --> 00:24:58.660
All right.
00:25:02.300 --> 00:25:04.040
Hopefully that will
go away in a second.
00:25:09.080 --> 00:25:13.350
So once we know the sampling
distribution, say, for x bar,
00:25:13.350 --> 00:25:16.910
now we can argue about the
probabilities associated
00:25:16.910 --> 00:25:20.810
with observing particular
values of x bar.
00:25:20.810 --> 00:25:23.420
We can make observations
or arguments
00:25:23.420 --> 00:25:26.540
about how much probability's out
in the tails of these things.
00:25:26.540 --> 00:25:29.420
And then we can invert
backwards and reason
00:25:29.420 --> 00:25:31.850
about the actual
population mean.
00:25:34.640 --> 00:25:36.950
And again, we're after
not only the point
00:25:36.950 --> 00:25:41.570
estimates, our best guess,
but also interval estimates--
00:25:41.570 --> 00:25:45.950
confidence intervals where
we think the actual value
00:25:45.950 --> 00:25:47.300
is going to lie.
00:25:47.300 --> 00:25:51.050
And these are critically
dependent on probability
00:25:51.050 --> 00:25:56.010
calculations of the
sampling distribution.
00:25:56.010 --> 00:25:59.090
So here's an example.
00:25:59.090 --> 00:26:02.680
So suppose that we start
out with some assumptions.
00:26:02.680 --> 00:26:08.940
We start out with some a priori
beliefs about the distribution
00:26:08.940 --> 00:26:10.020
of some parameter.
00:26:10.020 --> 00:26:15.220
In particular, we're interested
in the thickness of some part.
00:26:15.220 --> 00:26:16.750
We don't know the mean of it.
00:26:16.750 --> 00:26:19.940
But based on maybe lots and
lots of historical data,
00:26:19.940 --> 00:26:24.940
we do believe we do
know a couple of things.
00:26:24.940 --> 00:26:27.040
We know its variance.
00:26:27.040 --> 00:26:28.950
The standard deviation was 10.
00:26:28.950 --> 00:26:32.782
So let's just assume that we
know the standard deviation.
00:26:32.782 --> 00:26:34.240
And we also know--
the second thing
00:26:34.240 --> 00:26:38.410
is that the thickness of these
parts is normally distributed.
00:26:38.410 --> 00:26:40.060
Those are our
starting assumptions.
00:26:40.060 --> 00:26:43.150
our a priori assumptions.
00:26:43.150 --> 00:26:46.930
Now what we do is we go, and we
draw 50 different random parts
00:26:46.930 --> 00:26:50.470
with the IID assumption.
00:26:50.470 --> 00:26:55.210
And we calculate the average
thickness from those.
00:26:55.210 --> 00:26:57.970
And I'll tell you,
of those n equals
00:26:57.970 --> 00:27:00.550
50 samples, the
actual sample mean
00:27:00.550 --> 00:27:06.490
that comes out from that one
sample of size 50 is 113.5.
00:27:06.490 --> 00:27:08.090
There you go.
00:27:08.090 --> 00:27:10.820
You're blessed with
that piece of data.
00:27:10.820 --> 00:27:13.070
Now the first question here,
based on what we've seen,
00:27:13.070 --> 00:27:16.520
is what is the distribution
of the mean of the thickness?
00:27:16.520 --> 00:27:19.610
What is the PDF
associated with t bar?
00:27:19.610 --> 00:27:21.600
Everybody should know this.
00:27:21.600 --> 00:27:24.590
What's t bar distributed as?
00:27:29.738 --> 00:27:30.680
AUDIENCE: It's normal.
00:27:30.680 --> 00:27:32.602
PROFESSOR: It's normal, right.
00:27:32.602 --> 00:27:34.060
AUDIENCE: Centered
around the mean.
00:27:34.060 --> 00:27:35.810
PROFESSOR: Centered
around the mean, so it
00:27:35.810 --> 00:27:38.470
would have the same mu unknown.
00:27:38.470 --> 00:27:40.250
And what would its variance be?
00:27:40.250 --> 00:27:40.750
AUDIENCE: 2.
00:27:40.750 --> 00:27:41.840
AUDIENCE: 2.
00:27:41.840 --> 00:27:44.960
PROFESSOR: 2, very good.
00:27:44.960 --> 00:27:50.630
So it has the same mean, and
the variance scales as 1 over n.
00:27:50.630 --> 00:27:55.860
So we had 50 samples, so
the variance goes down
00:27:55.860 --> 00:27:59.760
by that factor.
00:27:59.760 --> 00:28:04.530
One quick notation
point here is when
00:28:04.530 --> 00:28:12.530
we use this notation of normal
with mu and sigma squared,
00:28:12.530 --> 00:28:16.940
I try to be very consistent and
put the mean and the variance
00:28:16.940 --> 00:28:17.640
in there.
00:28:17.640 --> 00:28:19.490
You will sometimes
find different texts
00:28:19.490 --> 00:28:21.890
and different
writers or whatever
00:28:21.890 --> 00:28:25.230
putting the mean and
the standard deviation.
00:28:25.230 --> 00:28:29.030
So you always want to confirm
that, because one's a square,
00:28:29.030 --> 00:28:31.410
and one's the square
root of the other.
00:28:31.410 --> 00:28:33.500
So be a little bit careful--
00:28:33.500 --> 00:28:35.210
a little bit careful on that.
00:28:35.210 --> 00:28:43.040
I try to be consistent and
have that be the variance.
00:28:43.040 --> 00:28:44.820
So that was a first
easy question.
00:28:44.820 --> 00:28:47.150
We know that based
on sampling theory.
00:28:47.150 --> 00:28:51.853
We know the distribution
function for the sample mean.
00:28:51.853 --> 00:28:53.270
Now the key question
is, how do we
00:28:53.270 --> 00:28:57.960
use that to reason about
the actual population mean?
00:28:57.960 --> 00:29:01.050
Well, it's really easy
already-- the best guess.
00:29:01.050 --> 00:29:03.660
But the more subtle question
that we've been talking about
00:29:03.660 --> 00:29:08.080
is, where do we think the
true mean of the population
00:29:08.080 --> 00:29:12.310
lies based on this
one observation?
00:29:12.310 --> 00:29:15.550
What range do we think
the true mean has
00:29:15.550 --> 00:29:19.330
with some degree of confidence?
00:29:19.330 --> 00:29:23.463
Do you think it's plus or
minus 2 around that mean?
00:29:23.463 --> 00:29:25.630
Do you think it's plus or
minus 20 around that mean?
00:29:25.630 --> 00:29:32.200
If I were to ask you to bet your
life on what the true mean is,
00:29:32.200 --> 00:29:35.170
you would want to be able to say
with some degree of confidence,
00:29:35.170 --> 00:29:38.590
it's actually within
this amount of distance.
00:29:41.280 --> 00:29:43.640
I have to say one more
thing, because if I
00:29:43.640 --> 00:29:46.640
said it's within some
amount of distance of that,
00:29:46.640 --> 00:29:50.360
well, with non-zero
probability, that thickness
00:29:50.360 --> 00:29:55.040
could take on values all the
way from plus infinity, if it's
00:29:55.040 --> 00:30:00.160
truly normally distributed,
all the way to not quite
00:30:00.160 --> 00:30:03.190
negative infinity, because
this is a thickness to 0.
00:30:03.190 --> 00:30:05.960
So it's still an
approximate model.
00:30:05.960 --> 00:30:08.630
So if I just asked
you, bet your life.
00:30:08.630 --> 00:30:11.360
Tell me where you
think the true mean is,
00:30:11.360 --> 00:30:14.810
if you wanted 100% chance of
saving your life, you'd say,
00:30:14.810 --> 00:30:16.680
it could be anything.
00:30:16.680 --> 00:30:18.940
So I have to give
you, when we're
00:30:18.940 --> 00:30:21.370
talking about confidence
intervals, another piece
00:30:21.370 --> 00:30:22.840
of bounding information.
00:30:22.840 --> 00:30:24.550
I want the range.
00:30:24.550 --> 00:30:28.010
How far away from that one
observation of the mean
00:30:28.010 --> 00:30:32.530
do I need to be with
some probability?
00:30:32.530 --> 00:30:36.820
95% confidence or
95% of the time,
00:30:36.820 --> 00:30:39.760
where do we think the
true mean would lie?
00:30:39.760 --> 00:30:43.930
What that means is if I were
to go and calculate another 50
00:30:43.930 --> 00:30:46.900
samples and calculate
the mean, again, we
00:30:46.900 --> 00:30:47.980
have that distribution.
00:30:47.980 --> 00:30:53.440
And what we're looking for
is that 95% central region
00:30:53.440 --> 00:30:56.230
of the PDF associated
with x bar, which
00:30:56.230 --> 00:31:02.700
is where 95% of the time, the
mean is actually going to lie.
00:31:02.700 --> 00:31:08.640
So that gets us pictorially
and formulaically here
00:31:08.640 --> 00:31:11.220
to this notion of the
confidence interval
00:31:11.220 --> 00:31:13.920
and how we actually go
about calculating that.
00:31:13.920 --> 00:31:18.150
What we're asking-- what
we've got in this situation
00:31:18.150 --> 00:31:21.120
is the variance is
known, so I'm not
00:31:21.120 --> 00:31:22.650
trying to estimate the variance.
00:31:22.650 --> 00:31:26.590
I'm just trying to
reason about the mean.
00:31:26.590 --> 00:31:31.210
And I want to estimate it to
some percent, some confidence
00:31:31.210 --> 00:31:32.500
interval.
00:31:32.500 --> 00:31:37.000
You always have this
chance of being wrong
00:31:37.000 --> 00:31:38.860
when you talk
confidence intervals.
00:31:38.860 --> 00:31:41.230
You've got some
alpha probability
00:31:41.230 --> 00:31:43.450
that the true meaning is
even further away than you
00:31:43.450 --> 00:31:45.340
think in your interval.
00:31:45.340 --> 00:31:47.650
But you're trying to
quantify that and bound that.
00:31:47.650 --> 00:31:52.720
So we typically talk about,
say, an alpha of 5% or maybe 1%
00:31:52.720 --> 00:31:58.270
probability of being
outside of your interval.
00:31:58.270 --> 00:32:00.250
So there's this
alpha probability
00:32:00.250 --> 00:32:06.820
of error associated with
any confidence interval.
00:32:06.820 --> 00:32:09.880
So that's that second piece
of data I had to give you.
00:32:09.880 --> 00:32:13.350
The first is we want to know
this range-- what the size is.
00:32:13.350 --> 00:32:17.940
So the way this works is
we're wanting to know,
00:32:17.940 --> 00:32:22.110
based on our calculated
x bar from our sample
00:32:22.110 --> 00:32:28.820
of size n, where the
true mean actually lies.
00:32:28.820 --> 00:32:31.220
So we know what
we're doing is saying
00:32:31.220 --> 00:32:36.710
that the true mean, mu, is
going to be bounded on the left
00:32:36.710 --> 00:32:42.590
by the x bar, but then
going some portion
00:32:42.590 --> 00:32:44.930
of the distribution to
the left and some portion
00:32:44.930 --> 00:32:52.040
of the distribution to the right
until we get the 1 minus alpha.
00:32:52.040 --> 00:32:57.340
So this area in here is the
1 minus alpha-- the 95%, say,
00:32:57.340 --> 00:33:00.370
central component of
that distribution.
00:33:00.370 --> 00:33:04.750
And then we're evenly spreading
the error part, the alpha,
00:33:04.750 --> 00:33:07.420
into 2 alpha over
2's on each side,
00:33:07.420 --> 00:33:10.870
saying I've got for a
95% confidence interval,
00:33:10.870 --> 00:33:15.670
a 2.5% chance that the true
mean is a little bit further off
00:33:15.670 --> 00:33:18.070
to the left and a 2.5% chance
that it's a little further
00:33:18.070 --> 00:33:19.340
off to the right.
00:33:19.340 --> 00:33:22.660
I guess in this picture here
I'm doing an 80% confidence
00:33:22.660 --> 00:33:28.240
interval with a total alpha
error risk, error probability,
00:33:28.240 --> 00:33:31.460
of 0.2.
00:33:31.460 --> 00:33:33.620
And so the question
then becomes,
00:33:33.620 --> 00:33:35.360
how far do I have to go out?
00:33:35.360 --> 00:33:39.350
And we know that from the
basic probability manipulations
00:33:39.350 --> 00:33:41.450
from a normal
distribution you guys
00:33:41.450 --> 00:33:44.390
have been dealing with already.
00:33:44.390 --> 00:33:51.190
The whole question is, how
many unit standard deviations
00:33:51.190 --> 00:33:53.480
of a unit normal
do I have to go?
00:33:53.480 --> 00:33:58.420
How many z's out do I have to
go until I have exactly alpha
00:33:58.420 --> 00:34:02.590
over 2 out here in the tail?
00:34:02.590 --> 00:34:04.980
So for example,
we might know what
00:34:04.980 --> 00:34:07.560
this is going to do
is I've got to go out
00:34:07.560 --> 00:34:13.350
1.28 standard
deviations to the left
00:34:13.350 --> 00:34:18.524
in order to be able to
have just that alpha over 2
00:34:18.524 --> 00:34:23.389
to the left of that tail,
and similarly to the right.
00:34:23.389 --> 00:34:30.460
Now, notice that we're
also unnormalizing.
00:34:30.460 --> 00:34:32.300
The z is the normal--
00:34:32.300 --> 00:34:36.070
how many z's you get to,
out of the unit Gaussian,
00:34:36.070 --> 00:34:38.469
the probability
out in the tails.
00:34:38.469 --> 00:34:41.110
But what we wanted to do is
reason about the location
00:34:41.110 --> 00:34:43.690
of the true population.
00:34:43.690 --> 00:34:46.580
We want to know the
true population mean.
00:34:46.580 --> 00:34:52.989
And so we have to do a little
bit of unnormalization and say,
00:34:52.989 --> 00:34:56.860
z alpha gave me the
number of unit normals.
00:34:56.860 --> 00:35:00.700
Now, in terms of my
actual population variance
00:35:00.700 --> 00:35:03.040
or population
standard deviation,
00:35:03.040 --> 00:35:05.580
what does that correspond to?
00:35:05.580 --> 00:35:11.100
And this is where the sample
size also comes into play.
00:35:11.100 --> 00:35:14.730
We were reasoning about
the distribution associated
00:35:14.730 --> 00:35:17.210
with a x bar.
00:35:17.210 --> 00:35:19.280
And the x bar is scaled.
00:35:19.280 --> 00:35:22.100
It shrunk by that
square root of n
00:35:22.100 --> 00:35:24.300
in terms of the
standard deviation.
00:35:24.300 --> 00:35:29.000
So when I expand it back
out, I'm counting number of--
00:35:29.000 --> 00:35:34.910
first off, together, this is
number of standard deviations
00:35:34.910 --> 00:35:36.980
in my x bar.
00:35:36.980 --> 00:35:42.140
And then when I
expand that further
00:35:42.140 --> 00:35:46.010
out to the number of standard
deviations in my population,
00:35:46.010 --> 00:35:48.710
I have to divide back
out by that root n.
00:35:52.520 --> 00:35:57.780
So what we've got is
the rationale for being
00:35:57.780 --> 00:36:01.920
able to use the PDF associated
with x bar calculate,
00:36:01.920 --> 00:36:03.660
probabilities off
of the details,
00:36:03.660 --> 00:36:08.170
and get finally to this nice--
00:36:08.170 --> 00:36:12.930
this is my fast way
to erase everything--
00:36:12.930 --> 00:36:17.200
get back to my nice distribution
here or a nice formula, which
00:36:17.200 --> 00:36:21.700
you'll see in Montgomery, you'll
see in all of the textbooks.
00:36:21.700 --> 00:36:28.270
It's a wonderful note to have
on your one page set of notes
00:36:28.270 --> 00:36:30.310
or cheat sheet
for taking quizzes
00:36:30.310 --> 00:36:32.200
in this class and elsewhere.
00:36:32.200 --> 00:36:35.530
This is the interval, the
confidence interval formula,
00:36:35.530 --> 00:36:38.890
for the location of
the true mean when
00:36:38.890 --> 00:36:39.850
the variance was known.
00:36:44.370 --> 00:36:46.290
So any questions on that?
00:36:46.290 --> 00:36:51.030
We actually want to
return to our example
00:36:51.030 --> 00:36:54.150
and see what numbers pop
out because I want to know--
00:36:54.150 --> 00:36:57.420
we knew x bar was 113.5.
00:36:57.420 --> 00:37:01.080
But I actually want to know,
what is the 95% confidence
00:37:01.080 --> 00:37:02.950
interval for that?
00:37:02.950 --> 00:37:05.550
And so we can simply go
back to our second question.
00:37:05.550 --> 00:37:08.800
Use the fact that we had--
00:37:08.800 --> 00:37:12.750
you guys told me what the
distribution was of t bar
00:37:12.750 --> 00:37:15.990
was our unknown mu.
00:37:15.990 --> 00:37:20.730
And the variance was
scaled, 100 over 50.
00:37:20.730 --> 00:37:24.170
So now for a 95%
confidence interval,
00:37:24.170 --> 00:37:28.700
what is the true mean?
00:37:28.700 --> 00:37:30.080
So I've pictured it here.
00:37:30.080 --> 00:37:33.560
And what we're
saying is we want--
00:37:33.560 --> 00:37:36.800
we've got this red
curve which, again,
00:37:36.800 --> 00:37:42.330
goes with this PDF
associated with t bar.
00:37:42.330 --> 00:37:46.380
And I want the plus/minus
z alpha over 2, the alpha
00:37:46.380 --> 00:37:48.570
being 0.05.
00:37:48.570 --> 00:37:51.090
That's my probability
of being wrong to get
00:37:51.090 --> 00:37:54.370
to a 0.95 confidence interval.
00:37:54.370 --> 00:38:01.400
So how many z's do I have to go
out to have 95% in the center?
00:38:01.400 --> 00:38:03.580
We actually showed
some examples.
00:38:03.580 --> 00:38:05.390
If you remember,
last time we looked
00:38:05.390 --> 00:38:08.480
at plus/minus 1 sigma,
plus/minus 2 sigma,
00:38:08.480 --> 00:38:11.040
plus/minus 3 sigma
for a Gaussian.
00:38:11.040 --> 00:38:14.270
And it's actually a very
close approximation.
00:38:14.270 --> 00:38:18.350
That plus/minus 2 sigma
is 95% of a distribution.
00:38:18.350 --> 00:38:20.930
That's a good rule
of thumb to remember.
00:38:20.930 --> 00:38:24.680
It's actually 1.96, not quite 2.
00:38:24.680 --> 00:38:29.660
But about plus/minus
2 sigma has 95%.
00:38:29.660 --> 00:38:34.670
So you'll often see 95%
confidence intervals
00:38:34.670 --> 00:38:36.140
graphically shown.
00:38:36.140 --> 00:38:40.380
So we need about 1.96
standard deviations.
00:38:40.380 --> 00:38:46.300
Now that translates to
a confidence interval
00:38:46.300 --> 00:38:50.020
that tells us, as
a function of n,
00:38:50.020 --> 00:38:52.990
the distribution for where
we think the true population
00:38:52.990 --> 00:38:55.600
is, based on the sample
size that we had.
00:38:55.600 --> 00:38:59.860
The compression that we
got because of sampling
00:38:59.860 --> 00:39:02.650
gets us that tighter
standard deviation.
00:39:02.650 --> 00:39:08.080
And I've got a symmetric
plus/minus 2.77
00:39:08.080 --> 00:39:11.560
for my 95% confidence interval.
00:39:11.560 --> 00:39:13.330
Now, notice that all
you had to do here
00:39:13.330 --> 00:39:16.600
was be told what the
actual calculated t bar was
00:39:16.600 --> 00:39:20.450
and what the
underlying variance was
00:39:20.450 --> 00:39:22.005
and the size of your sample.
00:39:22.005 --> 00:39:23.630
I didn't even have
to actually give you
00:39:23.630 --> 00:39:25.620
a list of all those
values, right?
00:39:30.250 --> 00:39:32.510
But I did have to tell
you the sample size.
00:39:32.510 --> 00:39:37.780
If sample size changed, that
PDF would narrow or widen,
00:39:37.780 --> 00:39:43.450
and your confidence interval
would narrow or widen, right?
00:39:43.450 --> 00:39:46.936
So any questions to
where we are now?
00:39:46.936 --> 00:39:48.790
It's all seeming pretty clear?
00:39:52.220 --> 00:39:55.830
So this is the
relatively easy part
00:39:55.830 --> 00:39:58.260
because it's dealing with
normal distributions.
00:39:58.260 --> 00:40:00.990
This notion of sampling
is a little bit subtle
00:40:00.990 --> 00:40:02.550
because there is
a different PDF,
00:40:02.550 --> 00:40:05.760
and you got to know how that
scales with the sample size.
00:40:05.760 --> 00:40:10.670
Now I'm going to throw a
few different curves at you,
00:40:10.670 --> 00:40:12.950
the different curves being
different probability
00:40:12.950 --> 00:40:17.180
distribution functions
than normal distributions.
00:40:17.180 --> 00:40:21.200
And I'm going to briefly
cover three of them,
00:40:21.200 --> 00:40:24.320
and all three of them
are ones that we actually
00:40:24.320 --> 00:40:27.790
will be using in
multiple scenarios
00:40:27.790 --> 00:40:34.180
in statistical analysis
and statistical techniques
00:40:34.180 --> 00:40:35.980
and tools that we're using.
00:40:35.980 --> 00:40:40.270
The first one is a
relatively easy step,
00:40:40.270 --> 00:40:43.900
and that's to look at the
student t distribution.
00:40:43.900 --> 00:40:44.860
I'll come back to this.
00:40:44.860 --> 00:40:47.800
But basically, if we go back
to the example I gave you.
00:40:47.800 --> 00:40:51.310
I said, we assumed we knew,
based on, I don't know,
00:40:51.310 --> 00:40:55.360
lots of past history what
the underlying variance was
00:40:55.360 --> 00:40:57.410
on the thickness of our parts.
00:40:57.410 --> 00:40:58.690
What if you don't know that?
00:40:58.690 --> 00:41:02.467
What if you have to
estimate that, too?
00:41:02.467 --> 00:41:04.050
Well, if you had to
estimate it, you'd
00:41:04.050 --> 00:41:06.900
probably use sample standard
deviation, that formula,
00:41:06.900 --> 00:41:08.700
and come up with an estimate.
00:41:08.700 --> 00:41:12.930
It turns out when you do that,
that additional uncertainty
00:41:12.930 --> 00:41:15.330
on what the
underlying variance is
00:41:15.330 --> 00:41:17.730
means that the right
distribution for arguing
00:41:17.730 --> 00:41:21.810
about the mean when you didn't
know the underlying variance
00:41:21.810 --> 00:41:23.580
is no longer a
normal distribution.
00:41:23.580 --> 00:41:27.470
It's actually a t-distribution,
and we'll talk about that.
00:41:27.470 --> 00:41:30.030
It's a slightly different--
it's very close to
00:41:30.030 --> 00:41:35.470
or looks qualitatively close
to a normal distribution,
00:41:35.470 --> 00:41:37.640
but we do want to cover that.
00:41:37.640 --> 00:41:42.540
And then more have to
do with not the mean,
00:41:42.540 --> 00:41:45.330
but arguing about the variance.
00:41:45.330 --> 00:41:48.900
If I calculate sample
variance from a distribution,
00:41:48.900 --> 00:41:55.110
I calculate s squared using the
formula for a sample of size
00:41:55.110 --> 00:41:57.180
50, I get a number.
00:41:57.180 --> 00:41:58.630
I do that lots
and lots of times.
00:41:58.630 --> 00:42:00.900
I trace out a PDF.
00:42:00.900 --> 00:42:04.680
The PDF associated
with the values
00:42:04.680 --> 00:42:08.130
of sample variance
calculated from that sample
00:42:08.130 --> 00:42:10.380
is a chi-squared distribution.
00:42:13.000 --> 00:42:16.260
So we'll talk about what
that shape looks like.
00:42:16.260 --> 00:42:19.800
And then we've got a
variance that we've
00:42:19.800 --> 00:42:21.870
calculated from a sample.
00:42:21.870 --> 00:42:25.570
And a very strange distribution
is the F distribution,
00:42:25.570 --> 00:42:30.360
which is the distribution
of the ratio of two normally
00:42:30.360 --> 00:42:34.920
distributed variances or two
variances drawn from normally
00:42:34.920 --> 00:42:37.340
distributed sample data.
00:42:37.340 --> 00:42:38.030
Good heavens.
00:42:38.030 --> 00:42:39.590
Why would you ever
be calculating
00:42:39.590 --> 00:42:41.870
ratios of variances?
00:42:41.870 --> 00:42:45.300
What a weird distribution.
00:42:45.300 --> 00:42:48.970
Why would you ever calculate
ratios of variances?
00:42:48.970 --> 00:42:51.090
Where might that come up?
00:42:51.090 --> 00:42:54.540
There's at least a couple
of cases-- one that's
00:42:54.540 --> 00:42:56.989
kind of subtle, but one
that's pretty obvious.
00:42:56.989 --> 00:42:58.630
AUDIENCE: I think
it's you're thinking
00:42:58.630 --> 00:43:03.953
about the variation of the
actual population, which
00:43:03.953 --> 00:43:05.598
varies from your sample.
00:43:08.230 --> 00:43:10.270
PROFESSOR: Certainly,
the variance
00:43:10.270 --> 00:43:12.250
associated with a
sample of smaller
00:43:12.250 --> 00:43:15.440
size than your true population.
00:43:15.440 --> 00:43:17.110
So that's exactly true.
00:43:17.110 --> 00:43:18.850
That's one important area.
00:43:18.850 --> 00:43:23.020
The fact of sample size
entering into spread and things
00:43:23.020 --> 00:43:24.280
is very important.
00:43:24.280 --> 00:43:26.920
That actually will come up
more in the chi-squared.
00:43:26.920 --> 00:43:30.280
But I think a second
very obvious place is
00:43:30.280 --> 00:43:32.590
I make a change to a process.
00:43:32.590 --> 00:43:34.750
And I'm maybe not trying
to mean center it.
00:43:34.750 --> 00:43:37.390
I'm trying to get a
reduced variance process.
00:43:37.390 --> 00:43:40.570
I want to know, is this
process better or not?
00:43:40.570 --> 00:43:42.670
Is its variance smaller?
00:43:42.670 --> 00:43:45.430
So the ratio of
those two variances
00:43:45.430 --> 00:43:48.550
are something I might be
very, very interested in.
00:43:48.550 --> 00:43:50.890
I want to look at
those and see, well,
00:43:50.890 --> 00:43:52.510
I did get a smaller variance.
00:43:52.510 --> 00:43:54.730
It's half as small.
00:43:54.730 --> 00:43:57.580
Do I have confidence that
the true population variance
00:43:57.580 --> 00:43:59.597
is really smaller or not?
00:43:59.597 --> 00:44:01.180
And so that's where
the F distribution
00:44:01.180 --> 00:44:02.450
is going to come into play.
00:44:02.450 --> 00:44:05.623
So we want to be able
to manipulate and deal
00:44:05.623 --> 00:44:06.540
with that one as well.
00:44:11.880 --> 00:44:18.670
Let me do the student
t-distribution first.
00:44:18.670 --> 00:44:19.990
Actually, I can't do that.
00:44:19.990 --> 00:44:22.540
Let me do the chi-squared
distribution first.
00:44:22.540 --> 00:44:24.130
For the formal
definition of the t,
00:44:24.130 --> 00:44:26.860
I need the chi-squared,
even though conceptually,
00:44:26.860 --> 00:44:28.660
it doesn't really matter.
00:44:28.660 --> 00:44:32.320
So let's talk about the
chi-squared distribution first.
00:44:32.320 --> 00:44:39.580
If I start out with truly
normally distributed data
00:44:39.580 --> 00:44:44.150
and unit normal,
mean 0, variance 1.
00:44:44.150 --> 00:44:50.670
And now, I take a sum
of n of these unit
00:44:50.670 --> 00:44:54.290
normals, each one
of which is squared.
00:44:54.290 --> 00:44:56.410
So each x sub i is
normally distributed.
00:44:56.410 --> 00:44:59.170
I do this weird operation
where I take that sample.
00:44:59.170 --> 00:45:05.170
I square it, I take another
draw or another random variable,
00:45:05.170 --> 00:45:08.830
also from the same distribution,
square that, and then take
00:45:08.830 --> 00:45:14.020
the sum of n of those squared
random variables to create
00:45:14.020 --> 00:45:17.420
a new random variable y.
00:45:17.420 --> 00:45:22.430
y is the sum of squared unit
normal random variables.
00:45:22.430 --> 00:45:26.560
Then I get this
chi-squared distribution.
00:45:26.560 --> 00:45:29.080
The distribution of this
new random variable y
00:45:29.080 --> 00:45:33.400
is chi-squared with
n degrees of freedom.
00:45:33.400 --> 00:45:36.625
Good heavens, what a
weird thing to be doing.
00:45:36.625 --> 00:45:38.740
Why would you be taking
random variables,
00:45:38.740 --> 00:45:42.270
squaring them, and
taking sums of them?
00:45:42.270 --> 00:45:45.710
Well, think back to the formula.
00:45:45.710 --> 00:45:47.420
Let's see if I can do this.
00:45:47.420 --> 00:45:48.590
What page is that?
00:45:48.590 --> 00:45:50.930
Anybody got it there?
00:45:50.930 --> 00:45:52.600
8?
00:45:52.600 --> 00:45:54.700
There we go, page 5.
00:45:54.700 --> 00:46:01.670
Look back at this formula for
sample standard deviation.
00:46:04.870 --> 00:46:08.660
First off, I'm subtracting
the mean off of some sample.
00:46:08.660 --> 00:46:13.010
So now I've got a
0 mean variable.
00:46:13.010 --> 00:46:15.430
Now I'm taking squares of them.
00:46:15.430 --> 00:46:17.950
Well, that sounds kind of
like this squaring operation.
00:46:17.950 --> 00:46:21.830
And then I'm taking
a big sum of them.
00:46:21.830 --> 00:46:24.050
That sounds a lot like
this operation I was just
00:46:24.050 --> 00:46:26.250
describing for chi-squared.
00:46:26.250 --> 00:46:31.850
So this creation of a new random
variable, this F squared here,
00:46:31.850 --> 00:46:37.070
is very closely related to--
00:46:37.070 --> 00:46:38.970
that didn't work.
00:46:38.970 --> 00:46:41.250
There we go-- very
closely related
00:46:41.250 --> 00:46:45.310
to the definition
of chi-squared.
00:46:45.310 --> 00:46:47.370
Now the chi-squared,
the PDF associated
00:46:47.370 --> 00:46:51.450
with the chi-squared,
looks kind of funky.
00:46:51.450 --> 00:46:53.730
It's clearly not normally
distributed, right?
00:46:53.730 --> 00:46:55.380
It's kind of skewed.
00:46:55.380 --> 00:47:01.240
Notice it's got a
long tail out here
00:47:01.240 --> 00:47:04.450
to the right for large values.
00:47:04.450 --> 00:47:08.870
Because it's a sum of squared
values, it can't be negative.
00:47:08.870 --> 00:47:09.730
So it's truncated.
00:47:09.730 --> 00:47:14.190
There's nothing-- can't
be smaller than 0.
00:47:14.190 --> 00:47:18.690
Another really weird thing is
that the maximal probability
00:47:18.690 --> 00:47:26.680
value is not equal to the
mean of the distribution.
00:47:26.680 --> 00:47:28.540
That's kind of interesting.
00:47:28.540 --> 00:47:30.430
And there's another
really interesting fact
00:47:30.430 --> 00:47:33.400
that is truly useful
and occasionally
00:47:33.400 --> 00:47:36.310
comes up on problem sets
and that sort of thing.
00:47:36.310 --> 00:47:39.940
The mean, the expected value
of the chi-squared distribution
00:47:39.940 --> 00:47:44.580
with degrees of freedom n, is n.
00:47:44.580 --> 00:47:48.180
So as I have larger
numbers of variables,
00:47:48.180 --> 00:47:53.850
the sum of that larger
number keeps getting bigger.
00:47:53.850 --> 00:47:58.370
So that makes sense
when you think about it.
00:47:58.370 --> 00:48:01.590
So the point here
is when we actually
00:48:01.590 --> 00:48:08.490
do that calculation of a sample
standard or a sample variance
00:48:08.490 --> 00:48:12.750
or a sample standard
deviation, the PDF
00:48:12.750 --> 00:48:15.060
associated with that
is actually related
00:48:15.060 --> 00:48:17.250
to this chi-squared
distribution.
00:48:17.250 --> 00:48:19.480
Now there were some
other constants in there.
00:48:19.480 --> 00:48:21.280
They're scaling factors.
00:48:21.280 --> 00:48:23.940
So for example, we did
a mean shift x bar,
00:48:23.940 --> 00:48:26.400
but we didn't normalize
to the true variance,
00:48:26.400 --> 00:48:28.430
because we didn't know it.
00:48:28.430 --> 00:48:31.500
So there is this relationship
or a scaling factor
00:48:31.500 --> 00:48:34.080
before we get to the
chi-squared distribution.
00:48:34.080 --> 00:48:40.140
We also had this other n minus
1 factor back on the calculation
00:48:40.140 --> 00:48:41.220
of the sample--
00:48:41.220 --> 00:48:44.400
sample standard or
sample variance.
00:48:44.400 --> 00:48:48.870
So we have to do a little bit
of moving variables around
00:48:48.870 --> 00:48:51.430
to get to a chi-squared
distribution.
00:48:51.430 --> 00:48:56.600
Another important
point is that the--
00:48:56.600 --> 00:48:59.090
let me clean up some of this--
00:48:59.090 --> 00:49:03.980
is that the sample
variance is actually
00:49:03.980 --> 00:49:09.722
related to a chi-squared with
n minus 1 degrees of freedom.
00:49:09.722 --> 00:49:11.930
And I really don't want to
go into a whole discussion
00:49:11.930 --> 00:49:16.260
of degrees of freedom because
it's a little bit subtle.
00:49:16.260 --> 00:49:17.960
But this reminds
me of another point
00:49:17.960 --> 00:49:20.510
that I didn't make
back on slide 8.
00:49:24.390 --> 00:49:25.840
Get me to 8, please.
00:49:25.840 --> 00:49:26.520
There we go.
00:49:26.520 --> 00:49:29.235
Oops, not 48, 8.
00:49:29.235 --> 00:49:30.660
Oh, it wasn't 8.
00:49:30.660 --> 00:49:31.320
Where was it?
00:49:31.320 --> 00:49:32.670
4, 5.
00:49:32.670 --> 00:49:34.670
There we go.
00:49:34.670 --> 00:49:40.500
Back here on this, notice that
when we calculate sample mean,
00:49:40.500 --> 00:49:42.240
we used 1 over n.
00:49:42.240 --> 00:49:44.610
But when we calculate
sample variance,
00:49:44.610 --> 00:49:47.770
we always use 1 over n minus 1.
00:49:47.770 --> 00:49:48.520
Why do we do that?
00:49:53.080 --> 00:50:00.170
It turns out that if you need
or want an unbiased estimator
00:50:00.170 --> 00:50:04.150
for a sample variance, you need
to divide by 1 over n minus 1
00:50:04.150 --> 00:50:08.310
or divide by n minus 1, not n.
00:50:08.310 --> 00:50:10.140
Now, as n gets very
large, the difference
00:50:10.140 --> 00:50:11.400
doesn't really matter.
00:50:11.400 --> 00:50:15.890
But you can go through
some statistical proofs
00:50:15.890 --> 00:50:21.800
to show that the best unbiased
estimator needs that n minus 1.
00:50:21.800 --> 00:50:26.210
Now the other thing that's
going on in this formula
00:50:26.210 --> 00:50:29.120
is we were subtracting
off the mean.
00:50:29.120 --> 00:50:33.210
And in this case, we were
also estimating the mean.
00:50:33.210 --> 00:50:35.420
So we're using up
essentially one degree
00:50:35.420 --> 00:50:41.660
of freedom out of our data
to calculate the sample mean,
00:50:41.660 --> 00:50:45.560
leaving us only n minus
1 degrees of freedom
00:50:45.560 --> 00:50:51.990
really in the remaining data to
allow variance around the mean.
00:50:51.990 --> 00:50:56.030
So I'm not going to go
into much more detail,
00:50:56.030 --> 00:51:01.400
other than to simply say the
fact is, when we're calculating
00:51:01.400 --> 00:51:04.370
sample standard deviation,
we're actually calculating
00:51:04.370 --> 00:51:10.520
two random variables or two
statistics, x bar and variance.
00:51:10.520 --> 00:51:14.900
And so you would
need-- you essentially
00:51:14.900 --> 00:51:19.190
don't have complete independence
between those two things.
00:51:19.190 --> 00:51:23.410
You use up one degree of
freedom for one of those.
00:51:23.410 --> 00:51:26.110
Let's use this.
00:51:26.110 --> 00:51:29.010
Before we use this, just to
give you a qualitative feel,
00:51:29.010 --> 00:51:30.190
here's--
00:51:30.190 --> 00:51:35.410
again, plotted a few different
chi-squared distributions.
00:51:35.410 --> 00:51:38.260
When n is very small,
it becomes very skewed.
00:51:38.260 --> 00:51:40.960
It's quite interesting.
00:51:40.960 --> 00:51:47.720
Again, the mean you can see
for n equals 3 here is 3.
00:51:47.720 --> 00:51:50.140
It's this blue curve.
00:51:50.140 --> 00:51:53.110
And as n increases,
the distribution
00:51:53.110 --> 00:51:54.010
shifts to the right.
00:51:54.010 --> 00:51:55.190
The mean shift to the right.
00:51:55.190 --> 00:51:57.700
But it also spreads out,
which kind of makes sense.
00:51:57.700 --> 00:51:59.860
If I've got more and
more random variables,
00:51:59.860 --> 00:52:04.090
and I'm looking at the
variance and estimating
00:52:04.090 --> 00:52:07.450
that sum of random
variables, its spread
00:52:07.450 --> 00:52:11.160
is going to get large.
00:52:11.160 --> 00:52:17.780
And another observation is that
as n gets larger and larger,
00:52:17.780 --> 00:52:21.790
this also trends towards
a normal distribution,
00:52:21.790 --> 00:52:26.010
which for very large n
can be a useful fact.
00:52:26.010 --> 00:52:27.780
I want to actually
go in and use--
00:52:30.740 --> 00:52:37.450
not that one-- use this
chi-squared distribution
00:52:37.450 --> 00:52:43.750
to ask another question
on that thickness example.
00:52:43.750 --> 00:52:45.580
I'd actually want
to know, what's
00:52:45.580 --> 00:52:51.190
the best guess for the variance
of my thickness of parts?
00:52:51.190 --> 00:52:54.190
And better than that, what's a
confidence interval for where
00:52:54.190 --> 00:52:57.250
I think the true variance
lies, based on just this one
00:52:57.250 --> 00:53:00.520
number for sample variance,
based on my sample of size n
00:53:00.520 --> 00:53:02.280
equals 50.
00:53:02.280 --> 00:53:09.390
And this is where we do the same
kind of a formula for the range
00:53:09.390 --> 00:53:12.750
where we think the
true variance lies,
00:53:12.750 --> 00:53:19.350
based on our observation from
one sample of sample standard
00:53:19.350 --> 00:53:20.490
deviation.
00:53:20.490 --> 00:53:22.830
And this is using
that relationship
00:53:22.830 --> 00:53:27.300
between the chi-squared
distribution and F
00:53:27.300 --> 00:53:30.120
squared and the true
underlying variance.
00:53:30.120 --> 00:53:32.160
So if you go back to
one of those formulas,
00:53:32.160 --> 00:53:34.890
what I did was took--
00:53:34.890 --> 00:53:36.600
sigma squared was
lying out here.
00:53:36.600 --> 00:53:40.390
I moved it up here and divided
the chi-squared down here.
00:53:40.390 --> 00:53:43.050
So this is essentially
right in here
00:53:43.050 --> 00:53:47.070
that equivalence that we said
before about how F squared was
00:53:47.070 --> 00:53:50.790
distributed as a
chi-squared with n
00:53:50.790 --> 00:53:52.230
minus 1 degrees of freedom.
00:53:55.430 --> 00:53:59.600
So what we've got is a bound--
00:53:59.600 --> 00:54:02.790
let me get rid of
all this gook--
00:54:02.790 --> 00:54:06.150
a bound, upper and lower
bound, on where we think,
00:54:06.150 --> 00:54:09.590
again, the true variance is,
based on our calculated F
00:54:09.590 --> 00:54:10.720
squareds.
00:54:10.720 --> 00:54:14.850
And what we're doing again is
putting some alpha probability
00:54:14.850 --> 00:54:17.360
of being wrong in
each of the tails.
00:54:17.360 --> 00:54:18.930
I want the central part.
00:54:18.930 --> 00:54:22.350
I want the 95% central
part of where we
00:54:22.350 --> 00:54:27.060
think the true variance lies.
00:54:27.060 --> 00:54:33.750
Now an interesting point here
is chi-squared is asymmetric.
00:54:33.750 --> 00:54:37.260
So if you ever see somebody
going off and writing,
00:54:37.260 --> 00:54:40.200
I think the true
variance is equal to F
00:54:40.200 --> 00:54:45.240
squared plus or minus
14.2, that should
00:54:45.240 --> 00:54:47.415
be a great, big red flag.
00:54:51.360 --> 00:54:54.330
It's somebody who doesn't know
what they're talking about.
00:54:54.330 --> 00:54:56.180
Well, maybe they have
a huge sample size,
00:54:56.180 --> 00:54:58.650
and they're appealing to
a normal distribution.
00:54:58.650 --> 00:55:04.520
But what they're probably doing
here is something very wrong.
00:55:04.520 --> 00:55:07.790
Because the chi-squared
distribution is not symmetric,
00:55:07.790 --> 00:55:11.340
I have my best point
estimate of F squared.
00:55:11.340 --> 00:55:15.200
And then I'm going to
go a different distance
00:55:15.200 --> 00:55:17.360
to the left and a different
distance to the right.
00:55:17.360 --> 00:55:20.930
So here's, still for
our same example,
00:55:20.930 --> 00:55:25.370
the chi-squared distribution
for n, a sample size of 50.
00:55:25.370 --> 00:55:28.880
So this is a chi-squared
with 49 degrees of freedom.
00:55:28.880 --> 00:55:32.270
And again, I want
2.5% in the left tail
00:55:32.270 --> 00:55:34.640
and 2.5% in the right tail.
00:55:34.640 --> 00:55:38.210
And so if I apply that
formula, and I have to look up
00:55:38.210 --> 00:55:44.110
chi-squared with 0.025
and 49 degrees of freedom,
00:55:44.110 --> 00:55:49.590
and then the chi-squared
where I need to know--
00:55:49.590 --> 00:55:54.210
I want 97.5, everything,
leaving except just alpha over 2
00:55:54.210 --> 00:55:56.940
out to the right.
00:55:56.940 --> 00:55:59.150
The s squareds are the
same in both cases.
00:55:59.150 --> 00:56:01.100
My n minus 1 is the same.
00:56:01.100 --> 00:56:06.320
But because these values, the
chi-squareds, are not equal--
00:56:06.320 --> 00:56:07.520
whoops.
00:56:07.520 --> 00:56:10.340
I guess I got these flipped.
00:56:10.340 --> 00:56:12.080
Actually, when you
look at the tables
00:56:12.080 --> 00:56:17.540
at the back of Montgomery
or Mayo and Spanos,
00:56:17.540 --> 00:56:18.890
be careful on the definition.
00:56:18.890 --> 00:56:20.420
They often show
you a little plot
00:56:20.420 --> 00:56:23.060
that looks a lot like this.
00:56:23.060 --> 00:56:27.200
And they shade in what
their percentage points are.
00:56:27.200 --> 00:56:31.410
And sometimes they go from the
right, sometimes from the left.
00:56:31.410 --> 00:56:33.700
But the point was
when you actually
00:56:33.700 --> 00:56:36.840
look that up, you
get different values
00:56:36.840 --> 00:56:38.310
for the left and the right.
00:56:38.310 --> 00:56:42.620
And when you divide those
out, you get a range--
00:56:42.620 --> 00:56:44.440
get that out of the way.
00:56:44.440 --> 00:56:49.611
You get a range finally for
where your true variance lies.
00:56:49.611 --> 00:56:53.270
AUDIENCE: So is that through
a [INAUDIBLE] or estimates
00:56:53.270 --> 00:56:58.850
of variance or from chi-square
distribution, or is that--
00:56:58.850 --> 00:57:03.750
PROFESSOR: The point
is that all estimates--
00:57:03.750 --> 00:57:08.370
well, it's strictly true if I'm
drawing from a population that
00:57:08.370 --> 00:57:09.960
is normally distributed.
00:57:09.960 --> 00:57:11.910
But an approximation
is no matter
00:57:11.910 --> 00:57:18.180
what, any time I'm
calculating a variance,
00:57:18.180 --> 00:57:21.210
the variance tends to be
chi-squared distributed.
00:57:21.210 --> 00:57:23.190
So it's always going
to be these kinds
00:57:23.190 --> 00:57:24.690
of chi-squared calculations.
00:57:27.580 --> 00:57:30.250
So it's not that the
chi-squared was a special case.
00:57:30.250 --> 00:57:34.120
It's the PDF that
you should always
00:57:34.120 --> 00:57:36.580
associate it with s squared.
00:57:41.310 --> 00:57:45.090
And notice here, we had 102.3.
00:57:45.090 --> 00:57:47.310
That's our best guess.
00:57:47.310 --> 00:57:53.070
And we had 71.4 and 158.1
for the range and variance.
00:57:58.507 --> 00:58:00.215
I always find this a
little bit shocking.
00:58:02.840 --> 00:58:04.790
A sample size of 50?
00:58:04.790 --> 00:58:08.940
I took 50 samples, right?
00:58:08.940 --> 00:58:16.920
And I had-- my underlying
variance, I guess, was 100.
00:58:16.920 --> 00:58:19.170
But I took a lot of samples.
00:58:19.170 --> 00:58:20.970
And it always shocks
me a little bit
00:58:20.970 --> 00:58:24.480
how big the range is on
the estimate of variance
00:58:24.480 --> 00:58:27.000
coming out of this.
00:58:27.000 --> 00:58:30.420
Here, my estimate of
variance is 102.3.
00:58:30.420 --> 00:58:31.980
Well, that's at
least reassuring,
00:58:31.980 --> 00:58:34.560
because that's close to the
example that I gave here,
00:58:34.560 --> 00:58:40.780
where a priori, I
thought it was 100.
00:58:40.780 --> 00:58:43.630
I just basically
popped that out.
00:58:43.630 --> 00:58:47.140
What's shocking is
I can go down to 71.
00:58:47.140 --> 00:58:54.880
That's like 30% lower than that,
or 158, which is 68% higher
00:58:54.880 --> 00:58:57.770
than my point estimate.
00:58:57.770 --> 00:59:01.190
And a really important thing
just to know qualitatively
00:59:01.190 --> 00:59:04.610
is that estimating a
mean is pretty easy.
00:59:04.610 --> 00:59:06.680
And actually, as
sample size grows,
00:59:06.680 --> 00:59:09.770
you can get pretty good
tight estimates of mean.
00:59:09.770 --> 00:59:13.785
But the estimates of
variance are hard.
00:59:13.785 --> 00:59:17.260
You need a lot of
data to estimate
00:59:17.260 --> 00:59:21.650
that second-order statistic.
00:59:21.650 --> 00:59:24.490
And so we get big
spreads in variance.
00:59:24.490 --> 00:59:26.890
So you've got to be really
careful in your reasoning
00:59:26.890 --> 00:59:28.180
about variances.
00:59:28.180 --> 00:59:31.060
And that'll bring us back to the
F-statistic a little bit later.
00:59:36.150 --> 00:59:40.260
So let me go back now to
the student t-distribution.
00:59:40.260 --> 00:59:44.550
And it has a formula
and a formal definition
00:59:44.550 --> 00:59:49.560
here, which is if I start out
with a random variable z, that
00:59:49.560 --> 00:59:52.270
is the unit normal.
00:59:52.270 --> 00:59:57.040
And then I divide it by
a random variable that
00:59:57.040 --> 01:00:02.620
is chi-squared with k degrees
of freedom, divided by k,
01:00:02.620 --> 01:00:06.310
I get a new distribution,
a new variable t,
01:00:06.310 --> 01:00:11.390
that is a t-distribution
with k degrees of freedom.
01:00:11.390 --> 01:00:13.310
And it's the same question.
01:00:13.310 --> 01:00:15.290
My god, why would you
do such a cruel thing
01:00:15.290 --> 01:00:18.410
to a random variable-- divide
it by a chi-squared random
01:00:18.410 --> 01:00:21.470
variable and some constant k?
01:00:21.470 --> 01:00:26.400
And the answer is
that's essentially
01:00:26.400 --> 01:00:33.270
what we're doing when
we are normalizing data
01:00:33.270 --> 01:00:38.120
like this, when
instead of normalizing
01:00:38.120 --> 01:00:40.430
to the true
underlying population
01:00:40.430 --> 01:00:42.560
variance or the true
underlying sample variance,
01:00:42.560 --> 01:00:49.220
I'm also having to
estimate not only the mean,
01:00:49.220 --> 01:00:54.460
but also estimate the
population standard deviation.
01:00:54.460 --> 01:00:56.515
We already said, what is s?
01:00:56.515 --> 01:00:59.020
s squared is
chi-squared distributed.
01:00:59.020 --> 01:01:04.400
So s is a square root of a
chi-squared distribution.
01:01:04.400 --> 01:01:08.050
So buried in this
unit normalization
01:01:08.050 --> 01:01:11.620
that we like to do to get to
a probability distribution
01:01:11.620 --> 01:01:13.390
function-- we can
talk about confidence
01:01:13.390 --> 01:01:14.980
intervals on the mean.
01:01:14.980 --> 01:01:18.190
We subtract off some
mean, and then we
01:01:18.190 --> 01:01:21.040
normalize to s over root n.
01:01:21.040 --> 01:01:23.600
But s itself is
this chi-squared.
01:01:23.600 --> 01:01:27.400
So it's really closely
related to the operations
01:01:27.400 --> 01:01:33.040
that we do when we are
normalizing our sample data,
01:01:33.040 --> 01:01:37.130
when we also had to estimate
the standard deviation.
01:01:37.130 --> 01:01:42.580
So the way to think
about the t-distribution
01:01:42.580 --> 01:01:46.330
is it's really close to
the normal distribution,
01:01:46.330 --> 01:01:47.920
except it's perturbed
a little bit,
01:01:47.920 --> 01:01:51.460
because we didn't really
know the underlying variance.
01:01:51.460 --> 01:01:53.710
We're having to
estimate it also.
01:01:53.710 --> 01:01:57.910
So here's some
pictures, some examples.
01:01:57.910 --> 01:02:03.450
The red is the unit
normal distribution.
01:02:03.450 --> 01:02:10.390
And now for different sizes of
sample, so for an n equals 3,
01:02:10.390 --> 01:02:14.520
you have this little
blue distribution.
01:02:14.520 --> 01:02:20.070
That's the t-distribution
with degrees of freedom 3.
01:02:20.070 --> 01:02:23.580
Notice that it's
a little bit wider
01:02:23.580 --> 01:02:27.540
than the normal distribution,
reflecting a little bit
01:02:27.540 --> 01:02:30.360
less certainty on
really the location
01:02:30.360 --> 01:02:33.270
of that random variable.
01:02:33.270 --> 01:02:35.840
Now as n gets
bigger, so we've got
01:02:35.840 --> 01:02:40.800
an n equals 10 example
in here in the green,
01:02:40.800 --> 01:02:42.710
the chi-square-- or
the t-distribution
01:02:42.710 --> 01:02:43.910
gets a little bit tighter.
01:02:43.910 --> 01:02:47.750
And for n equals 100, it's
basically almost lying right
01:02:47.750 --> 01:02:50.370
on top of the
normal distribution.
01:02:50.370 --> 01:02:54.800
So what the t is reflecting is
a little additional uncertainty
01:02:54.800 --> 01:02:58.700
because we didn't
know sigma squared.
01:02:58.700 --> 01:03:01.880
I had to calculate s squared
from that same sample
01:03:01.880 --> 01:03:03.830
distribution.
01:03:03.830 --> 01:03:06.230
So that's all that's
really going on there.
01:03:06.230 --> 01:03:10.040
If we then say, OK,
I want to get back
01:03:10.040 --> 01:03:11.870
to a confidence interval.
01:03:11.870 --> 01:03:14.360
But now, I don't
know the variance,
01:03:14.360 --> 01:03:18.180
and I have to estimate
that also from my data.
01:03:18.180 --> 01:03:22.320
We have essentially the same
confidence interval formula,
01:03:22.320 --> 01:03:26.570
the only difference
being instead of z
01:03:26.570 --> 01:03:29.930
related to the unit
normal distribution,
01:03:29.930 --> 01:03:33.200
we have numbers of
standard deviations
01:03:33.200 --> 01:03:36.200
on the t-distribution
that we're arguing about,
01:03:36.200 --> 01:03:39.860
again, reflecting that that
t is a little bit wider.
01:03:39.860 --> 01:03:43.130
But it's essentially
exactly the same thinking,
01:03:43.130 --> 01:03:46.910
just recognizing that now, the
sampling distribution for x
01:03:46.910 --> 01:03:50.520
bar when variance is unknown--
01:03:50.520 --> 01:03:52.200
is not a normal.
01:03:52.200 --> 01:03:53.580
It's a t-distribution.
01:03:56.300 --> 01:03:58.830
But all the other operations
are exactly the same.
01:03:58.830 --> 01:04:02.930
We look for what alpha error
we're willing to accept,
01:04:02.930 --> 01:04:06.920
what our chance of being wrong
on our bounding of the interval
01:04:06.920 --> 01:04:12.020
is, and then allocating that
to the left and the right;
01:04:12.020 --> 01:04:14.150
figuring out how many
units normal over we
01:04:14.150 --> 01:04:18.650
go on not the underlying
population distribution,
01:04:18.650 --> 01:04:20.430
but our sampling distribution.
01:04:20.430 --> 01:04:25.280
So we still get the benefits of
increasing n getting tighter.
01:04:25.280 --> 01:04:27.760
But we just do that all
on the t-distribution.
01:04:27.760 --> 01:04:30.270
AUDIENCE: So this is-- will
be necessary for small sample
01:04:30.270 --> 01:04:31.300
sizes.
01:04:31.300 --> 01:04:33.250
PROFESSOR: Exactly.
01:04:33.250 --> 01:04:35.110
So the point or the
question was this
01:04:35.110 --> 01:04:38.230
is only necessary for
small sample sizes.
01:04:38.230 --> 01:04:42.670
And that's exactly right
because of the effect
01:04:42.670 --> 01:04:45.910
that we see back with the
t-distribution getting
01:04:45.910 --> 01:04:51.820
very close in approximation to
the normal distribution for n
01:04:51.820 --> 01:04:53.800
becoming appreciable.
01:04:53.800 --> 01:04:55.930
I've heard different
kinds of rules of thumb.
01:04:55.930 --> 01:04:58.930
Some people like to
say for n about 25,
01:04:58.930 --> 01:05:02.030
you're pretty close to
a normal distribution.
01:05:02.030 --> 01:05:05.260
Some people like to
draw it at n equals 40.
01:05:05.260 --> 01:05:10.420
It really depends on what
kind of accuracy you're after.
01:05:10.420 --> 01:05:13.750
But you can be substantially
wrong for very small sample
01:05:13.750 --> 01:05:17.140
sizes-- of sample size 5,
which is a natural sample
01:05:17.140 --> 01:05:21.200
size you would often use in
some manufacturing scenarios.
01:05:21.200 --> 01:05:24.400
So you do have to be
aware for very small n
01:05:24.400 --> 01:05:27.630
to use the t-distribution.
01:05:27.630 --> 01:05:30.390
This was an example
where we had n equals 50
01:05:30.390 --> 01:05:32.130
in our part thickness example.
01:05:32.130 --> 01:05:34.530
Let's see how different
things pop out
01:05:34.530 --> 01:05:37.840
if we use the t-distribution
or the normal distribution.
01:05:37.840 --> 01:05:39.840
So let's go back to our example.
01:05:39.840 --> 01:05:43.440
But now, let's say we don't
know either the variance
01:05:43.440 --> 01:05:45.270
or the mean.
01:05:45.270 --> 01:05:48.000
Both of them are unknown.
01:05:48.000 --> 01:05:50.130
We already calculated
the sample mean.
01:05:50.130 --> 01:05:52.620
We had 113.5.
01:05:52.620 --> 01:05:55.890
And now I'll tell you--
01:05:55.890 --> 01:05:58.080
I guess I already gave you
this number previously.
01:05:58.080 --> 01:06:01.380
But I'll tell you that we
apply the sample variance
01:06:01.380 --> 01:06:06.650
formula to the data, and
out pops the number 102.3.
01:06:06.650 --> 01:06:09.350
So again, that's
your best estimate
01:06:09.350 --> 01:06:14.950
of the sample variance.
01:06:14.950 --> 01:06:17.410
So these are your
point estimates.
01:06:17.410 --> 01:06:19.990
But now, I want to go back
to the question, where's
01:06:19.990 --> 01:06:23.320
the confidence interval on where
we think the true mean would
01:06:23.320 --> 01:06:25.240
be 95% of the time?
01:06:25.240 --> 01:06:28.060
Well, now we have to
use the t-distribution.
01:06:28.060 --> 01:06:32.770
When we do that with
49 degrees of freedom,
01:06:32.770 --> 01:06:35.890
again, k minus 1, because we're
using up 1 for calculation
01:06:35.890 --> 01:06:37.810
of the sample mean.
01:06:37.810 --> 01:06:42.630
Now we have this slightly
different formula.
01:06:42.630 --> 01:06:45.960
Here, we can use the plus/minus,
because the t-distribution,
01:06:45.960 --> 01:06:50.010
like the normal
distribution, is symmetric.
01:06:50.010 --> 01:06:55.420
So I've got plus or minus
some number of unit, z's.
01:06:55.420 --> 01:06:57.610
In this case, it's
unit t's because
01:06:57.610 --> 01:07:01.360
the operative distribution
is the t-distribution.
01:07:01.360 --> 01:07:03.140
I plug that in.
01:07:03.140 --> 01:07:08.530
Notice that for 2.5%
in each of the tail,
01:07:08.530 --> 01:07:12.190
the t-distribution
is slightly wider.
01:07:12.190 --> 01:07:13.870
Remember, back with
the unit normal,
01:07:13.870 --> 01:07:19.420
we said 1.96 plus or minus
standard deviations is 95%.
01:07:19.420 --> 01:07:22.060
For the t, you got to go
a little bit further--
01:07:22.060 --> 01:07:24.610
2.01.
01:07:24.610 --> 01:07:26.290
Not a big difference--
01:07:26.290 --> 01:07:27.460
2.01.
01:07:27.460 --> 01:07:29.110
And when you come
out with that, you
01:07:29.110 --> 01:07:34.490
get a slightly wider
confidence interval.
01:07:34.490 --> 01:07:35.910
I'm less confident.
01:07:35.910 --> 01:07:40.370
I got to go further to get to
my 95% confidence on the range
01:07:40.370 --> 01:07:42.680
because I'm also estimating.
01:07:42.680 --> 01:07:45.290
So in this case, the difference
is pretty much negligible.
01:07:45.290 --> 01:07:47.480
And if I had a
sample of size 50,
01:07:47.480 --> 01:07:50.600
I would probably just use
the normal distribution.
01:07:50.600 --> 01:07:53.410
And that's a good example,
showing that difference
01:07:53.410 --> 01:07:56.410
is 5 parts out of 200.
01:07:56.410 --> 01:07:58.180
It's really quite small.
01:08:02.350 --> 01:08:04.450
One more distribution
I want to mention--
01:08:04.450 --> 01:08:05.950
we're not going to
use it much here.
01:08:05.950 --> 01:08:09.220
I think I've already
described it briefly--
01:08:09.220 --> 01:08:11.200
is this F distribution.
01:08:11.200 --> 01:08:14.620
And this arises if I have
one random variable that
01:08:14.620 --> 01:08:16.270
is chi-squared distributed.
01:08:16.270 --> 01:08:19.000
I take another random variable
that's chi-squared distributed.
01:08:19.000 --> 01:08:21.729
And I form a new
random variable R
01:08:21.729 --> 01:08:24.880
that is the ratio of
those two, each normalized
01:08:24.880 --> 01:08:29.740
to the degrees of freedom
or the number of variables
01:08:29.740 --> 01:08:32.680
that went into each of those
chi-squared distributed
01:08:32.680 --> 01:08:33.729
variables.
01:08:33.729 --> 01:08:40.529
And that is an F with u
and v degrees of freedom.
01:08:40.529 --> 01:08:48.359
Again, this comes up when we're
looking at things like ratios
01:08:48.359 --> 01:08:52.920
and want to reason about ratios
of true population variances,
01:08:52.920 --> 01:09:00.390
based on observations
of sample variances.
01:09:00.390 --> 01:09:05.210
And the key place where that
might come up that I mentioned
01:09:05.210 --> 01:09:07.970
is experimental design cases.
01:09:07.970 --> 01:09:10.520
So this is an injection
molding example,
01:09:10.520 --> 01:09:12.950
where you might be looking
at two different process
01:09:12.950 --> 01:09:16.800
conditions-- a low hold
time and a high hold time.
01:09:16.800 --> 01:09:19.340
And there may be other
things varying, maybe even
01:09:19.340 --> 01:09:23.479
other variables varying, that
cause there to be a spread.
01:09:23.479 --> 01:09:25.640
Or there's just
natural variation
01:09:25.640 --> 01:09:27.350
in the two populations.
01:09:27.350 --> 01:09:30.370
And you might ask
questions like,
01:09:30.370 --> 01:09:33.210
are these two
variances different?
01:09:33.210 --> 01:09:36.090
Did I improve the variance with
that process condition change?
01:09:38.810 --> 01:09:40.370
Maybe-- maybe not.
01:09:40.370 --> 01:09:42.979
Certainly not obvious
here, so you might
01:09:42.979 --> 01:09:44.352
have a very low confidence.
01:09:44.352 --> 01:09:46.310
So we're going to go and
use the F distribution
01:09:46.310 --> 01:09:50.510
a little bit later when we
do analysis of experiments,
01:09:50.510 --> 01:09:54.050
especially where you're looking
to try to make inferences
01:09:54.050 --> 01:09:55.970
about whether there
is differences
01:09:55.970 --> 01:09:59.750
between a couple of populations.
01:09:59.750 --> 01:10:04.010
And again, because we're
dealing with variances,
01:10:04.010 --> 01:10:07.520
there's a huge spread
that arise naturally
01:10:07.520 --> 01:10:12.200
in these distributions,
purely by chance.
01:10:12.200 --> 01:10:15.470
This is a good place
to re-emphasize
01:10:15.470 --> 01:10:19.550
that a lot of what's going
on here in random sampling
01:10:19.550 --> 01:10:23.060
is they're spread in the
observations that you get.
01:10:23.060 --> 01:10:25.830
So here's a very simple
numerical example.
01:10:25.830 --> 01:10:30.900
If I start with a variable
that is unit normal,
01:10:30.900 --> 01:10:36.120
and I'm just going to take
two samples, sets of size n
01:10:36.120 --> 01:10:38.540
equals 20.
01:10:38.540 --> 01:10:40.690
So I'm taking two
different samples,
01:10:40.690 --> 01:10:42.680
same underlying population.
01:10:42.680 --> 01:10:45.400
I'm not making a
process change, say.
01:10:45.400 --> 01:10:48.680
I'm just taking two
samples, each of size 20.
01:10:48.680 --> 01:10:51.320
By chance, when I take
that first sample size,
01:10:51.320 --> 01:10:55.950
I calculate a particular
sample variance, s squared.
01:10:55.950 --> 01:10:57.620
And by chance, I
calculate another one
01:10:57.620 --> 01:10:58.970
for the second sample.
01:10:58.970 --> 01:11:03.470
And if I form the ratio of
those two, what typical range am
01:11:03.470 --> 01:11:07.580
I going to observe in the
ratio of those two variances?
01:11:07.580 --> 01:11:10.400
For example, what
ratio might I observe
01:11:10.400 --> 01:11:13.370
95% of the time or what range?
01:11:13.370 --> 01:11:15.810
And that's the F distribution.
01:11:15.810 --> 01:11:20.930
In fact, if I look at
the upper and lower
01:11:20.930 --> 01:11:26.810
bound on the range of that
ratio for a 95% confidence
01:11:26.810 --> 01:11:31.280
interval for this ratio
of two samples of size 20,
01:11:31.280 --> 01:11:36.260
I can go anywhere from
2.5 to 0.4 in that ratio.
01:11:39.120 --> 01:11:41.040
That's with samples of size 20.
01:11:41.040 --> 01:11:43.540
That's a huge range, right?
01:11:43.540 --> 01:11:46.900
Imagine, 2 and 1/2 times
bigger variance over here,
01:11:46.900 --> 01:11:48.790
compared to over here.
01:11:48.790 --> 01:11:51.950
And that occurs
purely by chance.
01:11:51.950 --> 01:11:57.070
So in 95% of the time, I
might have ratios within that.
01:11:57.070 --> 01:11:59.800
But 5% of the time,
I'll even observe
01:11:59.800 --> 01:12:02.350
ratios that are
bigger or even smaller
01:12:02.350 --> 01:12:04.540
than those extremo points.
01:12:04.540 --> 01:12:07.140
So you've got to be really
careful in reasoning
01:12:07.140 --> 01:12:08.160
about variances.
01:12:10.990 --> 01:12:13.210
So we're mostly there.
01:12:13.210 --> 01:12:15.340
The last thing I
want to do here is
01:12:15.340 --> 01:12:20.260
draw the relationship of some
of these two hypotheses tests.
01:12:20.260 --> 01:12:23.590
And that gets us very close to
some of the Shewhart hypotheses
01:12:23.590 --> 01:12:26.020
that are the basis
for control charts
01:12:26.020 --> 01:12:28.490
that we'll talk about
in the next lecture.
01:12:28.490 --> 01:12:32.110
But I do want to get the
basic idea in the last five,
01:12:32.110 --> 01:12:36.740
10 minutes on what
statistical hypothesis is
01:12:36.740 --> 01:12:39.440
and how that relates to some
of these confidence intervals
01:12:39.440 --> 01:12:42.000
that we've been talking about.
01:12:42.000 --> 01:12:44.870
So the basic idea we've been
doing with these means is
01:12:44.870 --> 01:12:48.350
we've been hypothesizing that
the mean has some distribution,
01:12:48.350 --> 01:12:50.600
say a normal distribution.
01:12:50.600 --> 01:12:54.500
And then when we talked about
this confidence interval,
01:12:54.500 --> 01:12:57.740
I would say, accept or
reject the hypothesis
01:12:57.740 --> 01:13:03.200
that the mean was within some
range with some probability.
01:13:03.200 --> 01:13:06.920
We can extend that to
asking other questions
01:13:06.920 --> 01:13:09.080
or other hypotheses,
and then looking
01:13:09.080 --> 01:13:11.150
at the probabilities
associated with it,
01:13:11.150 --> 01:13:14.300
and saying, with some
degree of confidence,
01:13:14.300 --> 01:13:15.980
I believe the hypothesis.
01:13:15.980 --> 01:13:19.460
Or I have enough
evidence to counter it.
01:13:19.460 --> 01:13:23.990
And a typical example
might be a null hypothesis,
01:13:23.990 --> 01:13:31.100
often referred to as H0,
that the mean is some
01:13:31.100 --> 01:13:34.070
a priori mean, some phi 0.
01:13:34.070 --> 01:13:37.580
The null hypothesis is based
on this sample, this sample
01:13:37.580 --> 01:13:39.890
that I'm drawing
from the population.
01:13:39.890 --> 01:13:41.780
I have this
alternative hypothesis
01:13:41.780 --> 01:13:43.020
that the mean has changed.
01:13:43.020 --> 01:13:44.810
It's no longer the same mean.
01:13:44.810 --> 01:13:48.917
Do I have enough evidence to say
with some degree of confidence
01:13:48.917 --> 01:13:50.000
that the mean has changed?
01:13:52.760 --> 01:13:55.610
And it's a little
tricky because there's
01:13:55.610 --> 01:13:57.470
all these probabilities
associated
01:13:57.470 --> 01:14:00.450
with random sampling.
01:14:00.450 --> 01:14:03.260
So I observe a particular
value with some deviation.
01:14:03.260 --> 01:14:10.130
How do I know to what
degree there's actual shift,
01:14:10.130 --> 01:14:13.210
say, in the mean or not?
01:14:13.210 --> 01:14:14.380
So let's look at this.
01:14:14.380 --> 01:14:16.840
What we do is we
form the hypothesis.
01:14:16.840 --> 01:14:19.180
We then look at the
probabilities associated
01:14:19.180 --> 01:14:22.840
with the two cases, and then
based on those probabilities,
01:14:22.840 --> 01:14:25.720
say with some degree
of confidence,
01:14:25.720 --> 01:14:28.250
I choose one or the other.
01:14:28.250 --> 01:14:31.420
And what's important is there's
always the chance of being
01:14:31.420 --> 01:14:33.940
wrong, making an error--
01:14:33.940 --> 01:14:38.170
those alpha errors out in
the tails, for example--
01:14:38.170 --> 01:14:39.500
with that decision.
01:14:39.500 --> 01:14:42.850
So that's where this
confidence level comes in.
01:14:42.850 --> 01:14:45.430
So let's say we're
looking at this test.
01:14:45.430 --> 01:14:48.920
We're asking-- the
null hypothesis
01:14:48.920 --> 01:14:54.380
is I have a normal distribution
with some a priori mean
01:14:54.380 --> 01:14:56.250
and some a priori variance.
01:14:56.250 --> 01:14:58.040
I'm going to draw a new sample.
01:14:58.040 --> 01:15:02.880
And based on that, I
want to either decide
01:15:02.880 --> 01:15:07.290
that a shift has occurred
or that the data--
01:15:07.290 --> 01:15:11.170
or not-- that the data comes
from that distribution or not.
01:15:11.170 --> 01:15:12.960
And so what we're
going to do is use
01:15:12.960 --> 01:15:16.560
essentially this same
confidence interval idea
01:15:16.560 --> 01:15:21.730
and say, say to 95%
confidence, 95% of the time,
01:15:21.730 --> 01:15:26.370
if my value lies in the central
part of that distribution,
01:15:26.370 --> 01:15:30.870
I'm going to accept the--
01:15:30.870 --> 01:15:33.210
well, in this case,
the null hypothesis
01:15:33.210 --> 01:15:37.840
that my new sample still comes
from that same distribution.
01:15:37.840 --> 01:15:41.580
So that would be my 95%,
my 1 minus alpha, if alpha
01:15:41.580 --> 01:15:43.390
is a 5% error.
01:15:43.390 --> 01:15:46.500
But if I observe a
sample mean, say,
01:15:46.500 --> 01:15:51.230
or I observe a piece of
data that lies out here,
01:15:51.230 --> 01:15:53.450
I'm going to reject
the null hypothesis.
01:15:53.450 --> 01:15:55.130
I'm going to say
instead, I think
01:15:55.130 --> 01:15:58.910
I've got an unlikely
event by chance
01:15:58.910 --> 01:16:02.660
that I think instead indicates
something has changed.
01:16:02.660 --> 01:16:04.270
Something has changed
in the process.
01:16:04.270 --> 01:16:07.213
And we'll call that the
region of rejection.
01:16:10.340 --> 01:16:14.090
So again, already you
can see one kind of error
01:16:14.090 --> 01:16:16.160
that's likely to pop up.
01:16:16.160 --> 01:16:18.710
There is a confidence
interval, this alpha.
01:16:18.710 --> 01:16:21.840
There is a significance
level to the test,
01:16:21.840 --> 01:16:24.470
very similar to the
confidence interval
01:16:24.470 --> 01:16:28.290
idea and the alpha error
associated with that.
01:16:28.290 --> 01:16:31.840
So right away, you see
there's one kind of error--
01:16:31.840 --> 01:16:35.240
it's referred to
as a type I error--
01:16:35.240 --> 01:16:37.060
on these kinds of
hypothesis tests.
01:16:37.060 --> 01:16:40.900
We're rejecting the
null hypothesis out
01:16:40.900 --> 01:16:44.155
here in the tails with
some probability alpha.
01:16:47.675 --> 01:16:50.740
If I observed a point
out there in the tails,
01:16:50.740 --> 01:16:54.650
even if that population
or that distribution
01:16:54.650 --> 01:16:57.360
is still operative,
it is, in fact, true.
01:16:57.360 --> 01:17:01.250
My samples are still coming
from that distribution.
01:17:01.250 --> 01:17:06.040
But I happened to draw a
sample way out in the tail.
01:17:06.040 --> 01:17:08.590
And I said, well,
that was unlikely.
01:17:08.590 --> 01:17:10.630
That was unlikely
in this picture.
01:17:10.630 --> 01:17:12.400
I'm rejecting the
null hypothesis.
01:17:12.400 --> 01:17:15.608
I'm claiming this is evidence
that something changed
01:17:15.608 --> 01:17:16.900
when, in fact, nothing changed.
01:17:16.900 --> 01:17:19.150
I just got unlucky, right?
01:17:19.150 --> 01:17:22.300
So the first type of
error that you can make
01:17:22.300 --> 01:17:25.570
is this type I error.
01:17:28.160 --> 01:17:31.970
It's also sometimes referred
to as producer error,
01:17:31.970 --> 01:17:33.890
producer risk.
01:17:33.890 --> 01:17:35.390
You're the manufacturer.
01:17:35.390 --> 01:17:38.390
You reject your
part because your--
01:17:38.390 --> 01:17:40.970
or you reject a batch,
say, because your sample
01:17:40.970 --> 01:17:42.710
was way out here in the tail.
01:17:42.710 --> 01:17:45.980
You're taking the risk of
rejecting and throwing away
01:17:45.980 --> 01:17:50.900
good product, even though
it really was good.
01:17:50.900 --> 01:17:54.740
If I took more samples, it would
go back and really indicate
01:17:54.740 --> 01:17:55.820
what was going on--
01:17:55.820 --> 01:17:58.010
that the product was still good.
01:17:58.010 --> 01:18:01.250
So it's also sometimes
referred to as producer risk.
01:18:01.250 --> 01:18:04.950
But there's another
possible error.
01:18:04.950 --> 01:18:11.200
There is an error associated
with the distribution shifted
01:18:11.200 --> 01:18:12.880
or changed.
01:18:12.880 --> 01:18:16.210
I still accepted it
based on a random sample
01:18:16.210 --> 01:18:17.950
from the different
distribution that
01:18:17.950 --> 01:18:20.800
happened to fall in
my other distribution.
01:18:20.800 --> 01:18:24.640
And that's referred
to as type II error--
01:18:24.640 --> 01:18:28.305
has a probability associated
with that called beta.
01:18:28.305 --> 01:18:30.055
We've been talking all
about these alphas.
01:18:30.055 --> 01:18:31.720
Well, there's also a beta.
01:18:31.720 --> 01:18:37.510
It's also sometimes referred
to as a consumers' risk.
01:18:37.510 --> 01:18:40.690
The manufacturer did
a little inspection.
01:18:40.690 --> 01:18:43.270
The mean happened to fall
in the region of acceptance.
01:18:43.270 --> 01:18:44.940
He shipped it.
01:18:44.940 --> 01:18:48.030
Turns out, it was actually
by bad chance just happened
01:18:48.030 --> 01:18:49.500
to fall in the good region.
01:18:49.500 --> 01:18:54.470
It really is coming
from a bad distribution.
01:18:54.470 --> 01:18:55.610
So let's look at that.
01:18:55.610 --> 01:18:57.770
What is this beta?
01:18:57.770 --> 01:18:59.990
Well, for the type II
errors, we essentially
01:18:59.990 --> 01:19:05.120
have to hypothesize a shift of
some size, some little delta.
01:19:05.120 --> 01:19:08.330
And then we assess
the probabilities
01:19:08.330 --> 01:19:12.590
that I'm drawing from the tail
of that shifted distribution
01:19:12.590 --> 01:19:14.630
and just happen
to fall over here
01:19:14.630 --> 01:19:20.040
in this region of acceptance
for our good distribution.
01:19:20.040 --> 01:19:23.210
So this is the
probability associated
01:19:23.210 --> 01:19:24.470
with our null hypothesis.
01:19:24.470 --> 01:19:26.720
This is our starting
distribution.
01:19:26.720 --> 01:19:29.600
Our alternative
hypothesis here is
01:19:29.600 --> 01:19:32.375
that I had a plus delta
shift in the mean.
01:19:34.950 --> 01:19:38.900
So this is our
possible new operative.
01:19:38.900 --> 01:19:41.040
And in fact, for
a type II error,
01:19:41.040 --> 01:19:43.860
this is actually at work.
01:19:43.860 --> 01:19:46.950
Remember, this is the
region of acceptance.
01:19:46.950 --> 01:19:50.690
So I'm claiming this is good.
01:19:50.690 --> 01:19:54.110
But if the population
actually shifted over there
01:19:54.110 --> 01:19:56.570
to the right, notice
off on the left
01:19:56.570 --> 01:20:01.560
here we've got this
whole tail, where
01:20:01.560 --> 01:20:04.020
if I drew from the
shifted distribution,
01:20:04.020 --> 01:20:07.380
I've got that tail, that lightly
shaded blue tail, falling
01:20:07.380 --> 01:20:09.930
in the region of acceptance,
where I would say it's
01:20:09.930 --> 01:20:14.140
a good distribution
and erroneously except.
01:20:14.140 --> 01:20:19.000
And one can simply apply the
same probabilities to basically
01:20:19.000 --> 01:20:21.280
go in and calculate--
01:20:21.280 --> 01:20:26.830
just integrate up and do the
cumulative normal distribution
01:20:26.830 --> 01:20:32.200
function to calculate
what that tail is.
01:20:32.200 --> 01:20:36.410
So it's all the
same probabilities.
01:20:36.410 --> 01:20:40.510
So the applications of this
are really going to be on--
01:20:40.510 --> 01:20:42.470
of hypothesis testing.
01:20:42.470 --> 01:20:44.020
This would be
shifts of the mean.
01:20:44.020 --> 01:20:47.470
You can start to see worrying
about monitoring your process
01:20:47.470 --> 01:20:50.140
and seeing if something
changed in your process,
01:20:50.140 --> 01:20:53.260
a shift occurred, and
being able to detect that.
01:20:53.260 --> 01:20:55.270
And that gets us
to control charting
01:20:55.270 --> 01:20:58.160
that we'll do next time.
01:20:58.160 --> 01:21:00.730
So this is all pretty
much the same stuff.
01:21:00.730 --> 01:21:03.250
And now this is a peek ahead.
01:21:03.250 --> 01:21:06.250
You'll see process control.
01:21:06.250 --> 01:21:09.010
And we'll talk about
repeated samples
01:21:09.010 --> 01:21:13.840
in time coming from the
same distribution next time.
01:21:13.840 --> 01:21:16.240
So we will see you on Thursday.
01:21:16.240 --> 01:21:20.980
And we will dive into
Shewhart control charts.