1
00:00:00,000 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,730
Commons license.

3
00:00:03,730 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,060
continue to offer high quality
educational resources for free.

5
00:00:10,060 --> 00:00:12,660
To make a donation or to
view additional materials

6
00:00:12,660 --> 00:00:16,560
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,560 --> 00:00:17,874
at ocw.mit.edu.

8
00:00:21,838 --> 00:00:23,880
PROFESSOR: So as I was
saying, what we want to do

9
00:00:23,880 --> 00:00:28,270
is get up through use of some of
these statistical distributions

10
00:00:28,270 --> 00:00:32,729
for making hypotheses tests and
understanding the relationship

11
00:00:32,729 --> 00:00:40,670
of probabilities associated
with hypotheses such as a point

12
00:00:40,670 --> 00:00:43,880
belongs to this distribution
or that distribution.

13
00:00:43,880 --> 00:00:46,610
And that will set the
ground for talking

14
00:00:46,610 --> 00:00:50,510
about statistical process
control and SPC charting, where

15
00:00:50,510 --> 00:00:53,330
you're asking the question
of a new piece of data

16
00:00:53,330 --> 00:00:56,150
off of the manufacturing
line, does that piece of data

17
00:00:56,150 --> 00:00:59,180
come from the in
control distribution?

18
00:00:59,180 --> 00:01:02,850
Or does it come from some
out of control distribution?

19
00:01:02,850 --> 00:01:06,290
So it's all about
probabilities on SPC charts.

20
00:01:06,290 --> 00:01:10,880
And we want to build up
the rest of the machinery

21
00:01:10,880 --> 00:01:12,230
that we need for that today.

22
00:01:15,310 --> 00:01:18,490
To do that, one of
the subtle things

23
00:01:18,490 --> 00:01:20,680
that we have to understand
a bit more about

24
00:01:20,680 --> 00:01:24,190
is sampling and
sampling distributions.

25
00:01:24,190 --> 00:01:26,350
And really what we're
dealing with here

26
00:01:26,350 --> 00:01:30,790
is issues of the
use of statistics

27
00:01:30,790 --> 00:01:32,680
dealing with observed data.

28
00:01:32,680 --> 00:01:35,080
And I have this
philosophical picture

29
00:01:35,080 --> 00:01:37,390
of what I think of as
statistics meaning.

30
00:01:37,390 --> 00:01:41,590
And that's our real goal and
statistics is to reason about,

31
00:01:41,590 --> 00:01:47,520
think about, and be able
to argue about processes--

32
00:01:47,520 --> 00:01:51,760
in our case, real
manufacturing processes--

33
00:01:51,760 --> 00:01:53,950
when there's uncertainty
in those processes.

34
00:01:53,950 --> 00:01:55,030
There is noise.

35
00:01:55,030 --> 00:01:57,070
There's other things
we don't know.

36
00:01:57,070 --> 00:01:58,750
But the key idea
in statistics is

37
00:01:58,750 --> 00:02:00,040
we are getting some evidence.

38
00:02:00,040 --> 00:02:01,300
We're getting some data.

39
00:02:01,300 --> 00:02:02,740
And what we want
to be able to do

40
00:02:02,740 --> 00:02:07,000
is use that data to start
to infer things back

41
00:02:07,000 --> 00:02:10,060
about the underlying
population, the underlying

42
00:02:10,060 --> 00:02:12,110
process or distribution.

43
00:02:12,110 --> 00:02:16,780
So there are some
preconditions in here.

44
00:02:16,780 --> 00:02:21,100
A lot of what I said here
is we're reasoning based

45
00:02:21,100 --> 00:02:23,770
on evidence from observed data.

46
00:02:23,770 --> 00:02:27,220
But that really means we
are taking fundamentally

47
00:02:27,220 --> 00:02:31,160
a probability model
of what's going on.

48
00:02:31,160 --> 00:02:33,580
And we talked last
time, for example,

49
00:02:33,580 --> 00:02:39,160
about assumptions with normal
distributions and parameters

50
00:02:39,160 --> 00:02:41,220
of normal distributions.

51
00:02:41,220 --> 00:02:44,260
And what we're going to do
today is focus a little bit more

52
00:02:44,260 --> 00:02:49,720
on evidence coming from
finite sets of observations,

53
00:02:49,720 --> 00:02:53,260
drawn from that population,
and then calculations we

54
00:02:53,260 --> 00:02:54,840
do on that--

55
00:02:54,840 --> 00:02:58,200
simple calculations, like
calculating the sample mean.

56
00:02:58,200 --> 00:03:01,020
And then we have this
number, this sample mean.

57
00:03:01,020 --> 00:03:02,740
What's it really telling us?

58
00:03:02,740 --> 00:03:07,320
What can we infer back about
the underlying distribution--

59
00:03:07,320 --> 00:03:12,130
what the true mean of the
underlying population is?

60
00:03:12,130 --> 00:03:16,960
And then a little bit later,
we'll flesh this out more.

61
00:03:16,960 --> 00:03:22,360
But already, even as we start
building these simple arguments

62
00:03:22,360 --> 00:03:25,570
based on our data, we have
an underlying implicit model

63
00:03:25,570 --> 00:03:26,660
of the process.

64
00:03:26,660 --> 00:03:28,900
It may be a purely
probabilistic model,

65
00:03:28,900 --> 00:03:32,830
saying it has a certain mean
and a Gaussian distribution

66
00:03:32,830 --> 00:03:34,780
or a certain mean in a normal--

67
00:03:34,780 --> 00:03:37,570
or a uniform or a Poisson.

68
00:03:37,570 --> 00:03:39,010
There is a model there.

69
00:03:39,010 --> 00:03:44,680
And so we have to keep in
mind that it is only a model.

70
00:03:44,680 --> 00:03:47,140
A little bit later,
we'll also build up

71
00:03:47,140 --> 00:03:49,630
other kinds of
functional relationships

72
00:03:49,630 --> 00:03:52,210
when we get to things like
response surface modeling.

73
00:03:52,210 --> 00:03:55,090
But for now, these are
relatively simple models,

74
00:03:55,090 --> 00:04:00,160
mostly focused on the
probabilistic or stochastic

75
00:04:00,160 --> 00:04:02,330
nature of that.

76
00:04:02,330 --> 00:04:05,370
So here's the plan for today.

77
00:04:05,370 --> 00:04:08,190
What we're going to do
is talk a little bit

78
00:04:08,190 --> 00:04:11,530
about sampling distributions.

79
00:04:11,530 --> 00:04:13,590
We touched on this a
little bit last time

80
00:04:13,590 --> 00:04:15,285
when we talked about
the distribution

81
00:04:15,285 --> 00:04:21,310
of the sum of random variables
and the central limit theorem,

82
00:04:21,310 --> 00:04:24,090
where the sum or
the average always

83
00:04:24,090 --> 00:04:26,190
tends towards the normal.

84
00:04:26,190 --> 00:04:30,410
In some of the cases, we're
going to be calculating things,

85
00:04:30,410 --> 00:04:33,080
like the sample s
squared, the sample

86
00:04:33,080 --> 00:04:36,570
variants that are not going
to be normally distributed.

87
00:04:36,570 --> 00:04:38,690
They will have other
statistical shapes

88
00:04:38,690 --> 00:04:43,670
or statistical distributions
such as the chi-squared.

89
00:04:43,670 --> 00:04:46,970
There will be other cases where
the student t-distribution is

90
00:04:46,970 --> 00:04:47,480
operable.

91
00:04:47,480 --> 00:04:51,650
So we want to get a sense of
these sampling distributions

92
00:04:51,650 --> 00:04:56,450
and understand how to use
those to make not only point

93
00:04:56,450 --> 00:04:58,720
estimates--

94
00:04:58,720 --> 00:05:01,930
that is our best guess of things
like the underlying population

95
00:05:01,930 --> 00:05:02,540
mean--

96
00:05:02,540 --> 00:05:06,100
but also confidence intervals--
where, with some probability,

97
00:05:06,100 --> 00:05:09,820
we think the true mean lies or
where, with some probability,

98
00:05:09,820 --> 00:05:12,250
we think the true
variance lies based

99
00:05:12,250 --> 00:05:14,810
on one set of observations.

100
00:05:14,810 --> 00:05:19,420
So that's where the sampling
distributions come into play.

101
00:05:19,420 --> 00:05:23,440
And we'll talk about the
effects of sample size on that

102
00:05:23,440 --> 00:05:26,440
as well as things like
what kind of inferences,

103
00:05:26,440 --> 00:05:29,920
these point and confidence
interval inferences

104
00:05:29,920 --> 00:05:33,500
we can make on those.

105
00:05:33,500 --> 00:05:37,250
And then, again, leading up
towards hypothesis testing.

106
00:05:37,250 --> 00:05:42,380
And then really, this
will be for next time.

107
00:05:42,380 --> 00:05:44,315
We'll dive into SPC charts.

108
00:05:49,150 --> 00:05:52,480
So here's how we typically
are using sampling.

109
00:05:52,480 --> 00:05:54,460
We have some underlying--

110
00:05:54,460 --> 00:05:58,600
I'll refer to it as the
population distribution

111
00:05:58,600 --> 00:06:00,850
or sometimes the
parent distribution.

112
00:06:00,850 --> 00:06:05,140
It's the set or universe
of all possible parts,

113
00:06:05,140 --> 00:06:07,150
say, coming off your
manufacturing line

114
00:06:07,150 --> 00:06:10,600
or all possible observations.

115
00:06:10,600 --> 00:06:13,210
What we're typically
going to do is just draw

116
00:06:13,210 --> 00:06:16,390
some number, some finite
number, of samples,

117
00:06:16,390 --> 00:06:20,600
some n samples of
the process output--

118
00:06:20,600 --> 00:06:25,360
so some x sub i drawn
from a parent distribution

119
00:06:25,360 --> 00:06:28,090
with some PDF p.

120
00:06:28,090 --> 00:06:30,430
And what we're going to
be doing is calculating

121
00:06:30,430 --> 00:06:33,190
these sample mean, sample
variants, other sorts

122
00:06:33,190 --> 00:06:35,130
of sample statistics.

123
00:06:35,130 --> 00:06:39,670
A key point here is
the underlying process.

124
00:06:39,670 --> 00:06:43,900
That basic variable x has
a probability distribution

125
00:06:43,900 --> 00:06:46,820
function associated with it.

126
00:06:46,820 --> 00:06:50,890
This new variable x
bar that we calculate,

127
00:06:50,890 --> 00:06:52,960
this statistic
that we calculate,

128
00:06:52,960 --> 00:06:56,770
also has a probability density
function associated with it.

129
00:06:56,770 --> 00:06:59,320
And it's a different
one than the parent one.

130
00:06:59,320 --> 00:07:01,360
And so what we'll
need to understand

131
00:07:01,360 --> 00:07:05,350
is what those
probability distributions

132
00:07:05,350 --> 00:07:07,810
are that arise from
sampling, and then

133
00:07:07,810 --> 00:07:10,900
how to work backwards from
those to make inferences

134
00:07:10,900 --> 00:07:12,860
about the parent.

135
00:07:12,860 --> 00:07:13,850
Now, a quick thing.

136
00:07:13,850 --> 00:07:16,340
I guess there's both
definitions on this slide,

137
00:07:16,340 --> 00:07:23,750
but also a quick thing about
definitions or terminology

138
00:07:23,750 --> 00:07:28,290
or notation that I like to use.

139
00:07:28,290 --> 00:07:31,430
And in particular, I'm,
again, distinguishing

140
00:07:31,430 --> 00:07:36,080
between the population or
parent distribution, and then

141
00:07:36,080 --> 00:07:38,740
these sample statistics.

142
00:07:38,740 --> 00:07:42,820
And typically when I talk
about "truth" or the population

143
00:07:42,820 --> 00:07:47,110
as a whole, we're
using Greek variables

144
00:07:47,110 --> 00:07:54,850
like mu, sigma, rho, xy, for
the correlation coefficient.

145
00:07:54,850 --> 00:07:58,720
And those expectations,
those different moments,

146
00:07:58,720 --> 00:08:02,470
are calculated over
the entire population.

147
00:08:02,470 --> 00:08:04,360
Typically we're doing
those analytically

148
00:08:04,360 --> 00:08:08,080
if we have a closed
form description of what

149
00:08:08,080 --> 00:08:10,670
the population is.

150
00:08:10,670 --> 00:08:13,340
In contrast, I'm
going to typically use

151
00:08:13,340 --> 00:08:15,470
Roman characters--

152
00:08:15,470 --> 00:08:20,390
x, s, r, xy, for example--

153
00:08:20,390 --> 00:08:24,680
to indicate the finite
sample statistics

154
00:08:24,680 --> 00:08:29,630
calculated from some n
number of observations.

155
00:08:29,630 --> 00:08:34,159
And so that's when we have
a finite discrete number

156
00:08:34,159 --> 00:08:36,230
of observations.

157
00:08:36,230 --> 00:08:39,409
And we have simple formulas
for the calculation

158
00:08:39,409 --> 00:08:42,450
of those statistics.

159
00:08:42,450 --> 00:08:44,130
A little bit later
in the term, we

160
00:08:44,130 --> 00:08:48,690
will come back and start to
look in particular at covariance

161
00:08:48,690 --> 00:08:51,760
and correlation between two
different random variables,

162
00:08:51,760 --> 00:08:53,218
some x and y.

163
00:08:53,218 --> 00:08:55,260
Those are especially
important when we're looking

164
00:08:55,260 --> 00:08:58,020
for functional dependencies.

165
00:08:58,020 --> 00:09:01,530
Right now, we're simply
looking at one set of data

166
00:09:01,530 --> 00:09:05,700
or one population,
one random variable x.

167
00:09:05,700 --> 00:09:08,700
So we'll focus on
univariate stuff today.

168
00:09:13,150 --> 00:09:15,550
There is a term,
"random sampling,"

169
00:09:15,550 --> 00:09:18,100
that actually has a
technical definition that I

170
00:09:18,100 --> 00:09:23,470
want to point out that's very
close to the intuitive notion

171
00:09:23,470 --> 00:09:24,160
here.

172
00:09:24,160 --> 00:09:26,440
But it actually is a
little bit stronger

173
00:09:26,440 --> 00:09:29,890
in requirements
for its definition.

174
00:09:29,890 --> 00:09:33,760
We said sampling is this act of
taking some finite observations

175
00:09:33,760 --> 00:09:35,500
out of a population.

176
00:09:35,500 --> 00:09:41,260
Random sampling is when every
observation that we pull

177
00:09:41,260 --> 00:09:45,970
is identically distributed,
has the same PDF associated

178
00:09:45,970 --> 00:09:50,740
with it, and is independent
from any other sample that we

179
00:09:50,740 --> 00:09:54,460
pull from that population.

180
00:09:54,460 --> 00:09:57,250
And this would not always
naturally be the case--

181
00:09:57,250 --> 00:10:01,000
if you had, for example,
finite populations,

182
00:10:01,000 --> 00:10:03,490
and you pulled out a sample,
held it in your hand,

183
00:10:03,490 --> 00:10:08,560
recorded it, pulled out
another sample, for example.

184
00:10:08,560 --> 00:10:15,830
Imagine that you've got a bag of
17 blue and red marbles in it.

185
00:10:15,830 --> 00:10:19,840
And I pull a marble
out, and it's red.

186
00:10:19,840 --> 00:10:24,670
I hold it in my hand, and
I pull another marble out.

187
00:10:24,670 --> 00:10:27,400
Do you think I'm sampling
from the same underlying

188
00:10:27,400 --> 00:10:28,390
distribution?

189
00:10:28,390 --> 00:10:32,530
No, because I did not
replace that original marble.

190
00:10:32,530 --> 00:10:34,930
So now the mix of
blue and red marbles

191
00:10:34,930 --> 00:10:37,630
is different within that
bag, and the probability

192
00:10:37,630 --> 00:10:38,420
is different.

193
00:10:38,420 --> 00:10:43,000
It is not identical and
independent anymore.

194
00:10:43,000 --> 00:10:46,240
The observation that I made
first, based on the first draw,

195
00:10:46,240 --> 00:10:51,670
changes the probability
for later draws, changes--

196
00:10:51,670 --> 00:10:56,050
there is dependence
as well as no longer

197
00:10:56,050 --> 00:10:58,600
an identical distribution.

198
00:10:58,600 --> 00:11:04,030
So when we do random sampling,
as I'm defining it here,

199
00:11:04,030 --> 00:11:06,040
and random sampling
for calculation

200
00:11:06,040 --> 00:11:08,350
of some of these
sampling distributions,

201
00:11:08,350 --> 00:11:11,510
we're assuming if it's coming
from a finite population,

202
00:11:11,510 --> 00:11:14,440
you would always put
the observation back in

203
00:11:14,440 --> 00:11:18,640
and do another sample
from the same pool.

204
00:11:18,640 --> 00:11:20,770
Typically what
you're often doing

205
00:11:20,770 --> 00:11:22,600
is assuming there's
no connection from one

206
00:11:22,600 --> 00:11:24,640
to the other, and the
same process, physics,

207
00:11:24,640 --> 00:11:27,950
is operable from one
point in time to the next.

208
00:11:27,950 --> 00:11:31,570
So we are typically making
this IID, this Independent

209
00:11:31,570 --> 00:11:37,690
and Identically
Distributed assumption.

210
00:11:37,690 --> 00:11:40,120
And then we're going to,
again, as I said, calculate

211
00:11:40,120 --> 00:11:42,250
some statistics from those.

212
00:11:42,250 --> 00:11:44,170
Ultimately, when
you have a sample--

213
00:11:44,170 --> 00:11:48,130
sample of size 14, drawn
from a big population.

214
00:11:48,130 --> 00:11:49,750
You calculate x bar.

215
00:11:49,750 --> 00:11:50,890
What do you get?

216
00:11:50,890 --> 00:11:51,580
A number.

217
00:11:51,580 --> 00:11:56,560
You get an actual number
because I observed those 14

218
00:11:56,560 --> 00:11:58,900
things, measured length
or whatever it was

219
00:11:58,900 --> 00:12:02,720
that I was measuring on those.

220
00:12:02,720 --> 00:12:05,500
And so a key point here
is that the statistic

221
00:12:05,500 --> 00:12:10,010
is a function of the
sample and the sample data.

222
00:12:10,010 --> 00:12:13,710
And so it's actually a
value that you can compute.

223
00:12:13,710 --> 00:12:18,690
If I do that, I grab one
sample, I calculate that x bar,

224
00:12:18,690 --> 00:12:21,200
I've got one number.

225
00:12:21,200 --> 00:12:24,640
If I were to go back
and draw another sample

226
00:12:24,640 --> 00:12:27,850
from that distribution,
I get a different number.

227
00:12:27,850 --> 00:12:29,700
And so if I keep
going back and drawing

228
00:12:29,700 --> 00:12:31,860
multiple, multiple
samples, that's

229
00:12:31,860 --> 00:12:35,280
how you build up a distribution
function associated

230
00:12:35,280 --> 00:12:39,490
with that statistic,
that calculation.

231
00:12:39,490 --> 00:12:43,590
So that's where this notion of
statistics, x bar or whatever,

232
00:12:43,590 --> 00:12:48,030
as a random variable
also comes into play.

233
00:12:48,030 --> 00:12:50,330
For any one sample
it's a number.

234
00:12:50,330 --> 00:12:55,440
But when I go and take multiple
samples, multiple sets, of n,

235
00:12:55,440 --> 00:13:01,160
now I build up a distribution
function associated with those.

236
00:13:01,160 --> 00:13:04,870
I'm going to switch here to--

237
00:13:04,870 --> 00:13:05,950
or do I want to?

238
00:13:05,950 --> 00:13:06,670
No.

239
00:13:06,670 --> 00:13:08,740
I'm going to switch to--

240
00:13:08,740 --> 00:13:11,680
this is here on the web.

241
00:13:11,680 --> 00:13:15,835
I mentioned last time
this very nice website.

242
00:13:18,780 --> 00:13:20,430
I don't even know
what the acronym

243
00:13:20,430 --> 00:13:22,620
stands for-- this SticiGui.

244
00:13:22,620 --> 00:13:25,650
It's out of the Department
of Statistics at Berkeley.

245
00:13:25,650 --> 00:13:28,720
It's got a lot of different--

246
00:13:28,720 --> 00:13:31,960
I guess sort of an online
course kind of thing.

247
00:13:31,960 --> 00:13:37,160
But what I really like
in this is the Tools tab.

248
00:13:37,160 --> 00:13:40,050
So if I go to that Tools tab--

249
00:13:40,050 --> 00:13:41,840
let me do that--

250
00:13:41,840 --> 00:13:45,920
it's got a number of these
little Java utilities online.

251
00:13:45,920 --> 00:13:48,560
And one that I want
to look at here first

252
00:13:48,560 --> 00:13:51,030
is sampling distributions.

253
00:13:51,030 --> 00:13:51,650
So let's see.

254
00:13:51,650 --> 00:13:52,400
Let this load.

255
00:13:55,900 --> 00:13:57,910
Loading up Java, here.

256
00:13:57,910 --> 00:14:02,550
So here's an example
of sampling from some

257
00:14:02,550 --> 00:14:04,390
a priori distribution.

258
00:14:04,390 --> 00:14:09,310
And this is actually drawing
from a uniform distribution

259
00:14:09,310 --> 00:14:13,540
with discrete values,
0, 1, 2, 3, and 4.

260
00:14:13,540 --> 00:14:16,150
So that's our underlying
true population,

261
00:14:16,150 --> 00:14:18,740
and they all have
equal probabilities.

262
00:14:18,740 --> 00:14:21,460
And what I'm going to
do is calculate a--

263
00:14:21,460 --> 00:14:24,780
I'm going to draw a sample
down here at the bottom.

264
00:14:24,780 --> 00:14:26,400
It's a sample of size 5.

265
00:14:26,400 --> 00:14:30,360
So I'm going to do random
sampling with replacement.

266
00:14:30,360 --> 00:14:33,060
So I'm going to draw five
independent and identically

267
00:14:33,060 --> 00:14:36,950
distributed samples out of that
underlying parent distribution.

268
00:14:36,950 --> 00:14:39,420
And then I'm going to
calculate some statistic.

269
00:14:39,420 --> 00:14:42,670
What I want to do is to actually
calculate the sample mean.

270
00:14:42,670 --> 00:14:46,930
So there in blue is our
underlying population.

271
00:14:46,930 --> 00:14:50,820
Let me take one sample of
size 5, calculate the mean,

272
00:14:50,820 --> 00:14:51,660
and plot it.

273
00:14:51,660 --> 00:14:52,260
There it is.

274
00:14:52,260 --> 00:14:55,080
It's a mean of 1.4.

275
00:14:55,080 --> 00:14:56,510
Let me take another sample.

276
00:14:56,510 --> 00:14:57,690
I take another sample.

277
00:14:57,690 --> 00:15:01,190
Do you think the value
is going to be 1.4 again?

278
00:15:01,190 --> 00:15:01,970
It might be.

279
00:15:01,970 --> 00:15:02,930
AUDIENCE: Might be.

280
00:15:02,930 --> 00:15:04,490
PROFESSOR: But
probably not, right?

281
00:15:04,490 --> 00:15:05,960
Let's see what happens.

282
00:15:05,960 --> 00:15:08,060
There it is-- 2.4.

283
00:15:08,060 --> 00:15:09,020
Let me do a few more.

284
00:15:12,370 --> 00:15:14,460
So the green bars
are popping up,

285
00:15:14,460 --> 00:15:18,240
as I think I've done something
like 1, 2, 3, 4, 5, 6--

286
00:15:18,240 --> 00:15:21,510
something like 8 different
samples, each of size 5,

287
00:15:21,510 --> 00:15:23,980
plotted the mean.

288
00:15:23,980 --> 00:15:25,980
Now to speed things
up, I can keep

289
00:15:25,980 --> 00:15:28,440
taking more and more samples.

290
00:15:28,440 --> 00:15:31,040
What distribution do you
think this is trending to?

291
00:15:31,040 --> 00:15:31,860
AUDIENCE: Normal.

292
00:15:31,860 --> 00:15:33,098
PROFESSOR: Normal.

293
00:15:33,098 --> 00:15:34,890
Down here at the bottom,
I can take samples

294
00:15:34,890 --> 00:15:36,910
that are a little bit larger.

295
00:15:36,910 --> 00:15:38,590
Or let me take--

296
00:15:38,590 --> 00:15:41,020
excuse me, we take--
the thing tells me

297
00:15:41,020 --> 00:15:44,080
how many samples I'm taking,
so I don't have to just take

298
00:15:44,080 --> 00:15:45,790
one sample of five, plot it.

299
00:15:45,790 --> 00:15:50,060
I can take 10 samples of
5, each of 5, and plot it.

300
00:15:50,060 --> 00:15:53,710
So it's just speeding
up my button clicks

301
00:15:53,710 --> 00:15:57,205
so that we can get a little
bit better shape on that.

302
00:15:57,205 --> 00:15:58,080
So there's the point.

303
00:15:58,080 --> 00:15:59,700
That's a very fascinating point.

304
00:15:59,700 --> 00:16:01,560
I find it fascinating
that I can sample

305
00:16:01,560 --> 00:16:05,310
from a non-normal
distribution, take the average,

306
00:16:05,310 --> 00:16:11,930
the sample average, x bar, and
over lots and lots of sampling,

307
00:16:11,930 --> 00:16:15,460
I get a normal distribution.

308
00:16:15,460 --> 00:16:16,030
What else?

309
00:16:16,030 --> 00:16:18,640
What other observations
or what other points

310
00:16:18,640 --> 00:16:23,930
might you make about
that green distribution?

311
00:16:23,930 --> 00:16:27,930
What do you think is true
about that green distribution?

312
00:16:27,930 --> 00:16:29,850
There's a really
important fact which

313
00:16:29,850 --> 00:16:33,750
motivates why we can't
calculate x bars all the time

314
00:16:33,750 --> 00:16:35,430
and believe the
numbers that come out

315
00:16:35,430 --> 00:16:37,710
of an x bar calculation.

316
00:16:41,070 --> 00:16:42,990
AUDIENCE: It's
centered around 2.

317
00:16:42,990 --> 00:16:44,410
PROFESSOR: It's
centered around 2.

318
00:16:44,410 --> 00:16:46,440
Out of the numbers
0, 1, 2, 3, and 4,

319
00:16:46,440 --> 00:16:49,590
what do you think the average
is-- the true average?

320
00:16:49,590 --> 00:16:51,030
2.

321
00:16:51,030 --> 00:16:58,360
So one thing that's very
nice about the sample mean

322
00:16:58,360 --> 00:17:04,130
is that it trends toward
the true population mean.

323
00:17:04,130 --> 00:17:05,630
It's unbiased.

324
00:17:05,630 --> 00:17:11,390
That if I were to
take enough samples,

325
00:17:11,390 --> 00:17:18,630
the average of or the mean of
all of these sample averages

326
00:17:18,630 --> 00:17:22,200
is equal to the true
underlying population mean.

327
00:17:22,200 --> 00:17:22,980
It's unbiased.

328
00:17:22,980 --> 00:17:25,680
Doesn't have a bias or
delta, a fixed delta,

329
00:17:25,680 --> 00:17:28,200
a fixed offset error in it.

330
00:17:28,200 --> 00:17:30,690
It is an unbiased estimator.

331
00:17:30,690 --> 00:17:35,040
So I can take lots
and build that up.

332
00:17:35,040 --> 00:17:36,630
Turns out there's
another thing that's

333
00:17:36,630 --> 00:17:40,190
true which I don't
want to go into

334
00:17:40,190 --> 00:17:42,140
and don't want to try to prove.

335
00:17:42,140 --> 00:17:48,890
But it turns out that the sample
mean is also not only unbiased,

336
00:17:48,890 --> 00:17:52,250
but it's also the
minimum error estimator.

337
00:17:52,250 --> 00:17:56,540
So on average, it's the
best estimator of the mean

338
00:17:56,540 --> 00:18:01,400
that you can use as a statistic,
meaning its distribution

339
00:18:01,400 --> 00:18:03,260
in some sense is the narrowest.

340
00:18:03,260 --> 00:18:06,770
The x bar distribution is
the narrowest estimator

341
00:18:06,770 --> 00:18:13,370
you can have for trying to
calculate the sample mean

342
00:18:13,370 --> 00:18:16,700
based on your distributions.

343
00:18:16,700 --> 00:18:19,160
Now another important
thing that comes up

344
00:18:19,160 --> 00:18:20,720
here is at least a
few of the times,

345
00:18:20,720 --> 00:18:27,940
I got a sample
mean that was 0.6.

346
00:18:27,940 --> 00:18:28,780
Is it wrong?

347
00:18:33,930 --> 00:18:36,560
If you do just one sample,
it's quite possible,

348
00:18:36,560 --> 00:18:40,700
out of this set of four,
I drew a sample of size 5.

349
00:18:40,700 --> 00:18:43,190
I might have gotten
a value of 0.6.

350
00:18:43,190 --> 00:18:45,080
That's all the data you have.

351
00:18:45,080 --> 00:18:47,300
What's your best guess
for the true mean

352
00:18:47,300 --> 00:18:50,200
of the underlying population?

353
00:18:50,200 --> 00:18:53,320
That 0.6, whatever
that value was.

354
00:18:53,320 --> 00:18:55,890
But now there is
some spread on it.

355
00:18:55,890 --> 00:18:59,490
And so if you're
wise, you would also

356
00:18:59,490 --> 00:19:02,700
start to want to hedge your
bets a little bit here, right?

357
00:19:02,700 --> 00:19:06,760
You want to be able to
say, my best guess is 0.5.

358
00:19:06,760 --> 00:19:10,510
But I think I'm only
drawing a sample of size 5.

359
00:19:10,510 --> 00:19:15,400
So I know there is, in fact,
this kind of Gaussian spread.

360
00:19:15,400 --> 00:19:17,320
And I think the
true mean probably

361
00:19:17,320 --> 00:19:19,910
lies within some range of that.

362
00:19:19,910 --> 00:19:23,020
And so you would like to have
this confidence interval idea.

363
00:19:23,020 --> 00:19:26,290
We'll get back to that
a little bit later.

364
00:19:26,290 --> 00:19:29,620
In fact, there's another
very nice little tool in here

365
00:19:29,620 --> 00:19:32,170
for illustrating
confidence intervals

366
00:19:32,170 --> 00:19:35,788
that we'll use at that point.

367
00:19:35,788 --> 00:19:37,580
I want to do one more
thing, and then we'll

368
00:19:37,580 --> 00:19:39,830
go back to the lecture slides.

369
00:19:39,830 --> 00:19:42,140
One of the neat things
you can do with this tool,

370
00:19:42,140 --> 00:19:43,670
and it's lots of
fun for you guys

371
00:19:43,670 --> 00:19:46,670
to connect up with
and play with,

372
00:19:46,670 --> 00:19:48,320
is you can change
the sample size.

373
00:19:50,830 --> 00:19:55,230
Let's say you wanted a
better or a tighter estimate

374
00:19:55,230 --> 00:19:57,132
for the x bar.

375
00:19:57,132 --> 00:19:58,590
You're not happy
with the idea that

376
00:19:58,590 --> 00:20:02,100
sometimes, with fairly
substantial probability,

377
00:20:02,100 --> 00:20:05,610
you might be off
by plus or minus 1.

378
00:20:05,610 --> 00:20:08,190
You have a substantial
probability

379
00:20:08,190 --> 00:20:12,450
of estimating, say, the--

380
00:20:12,450 --> 00:20:17,220
or guessing the sample mean
to be more than one value away

381
00:20:17,220 --> 00:20:20,400
from the true population mean.

382
00:20:20,400 --> 00:20:22,410
What might you do
to try to improve

383
00:20:22,410 --> 00:20:27,510
your likelihood of being closer
to the true mean when you're

384
00:20:27,510 --> 00:20:28,220
doing sampling?

385
00:20:28,220 --> 00:20:29,520
AUDIENCE: More samples.

386
00:20:29,520 --> 00:20:31,420
PROFESSOR: More samples.

387
00:20:31,420 --> 00:20:33,010
More samples?

388
00:20:33,010 --> 00:20:35,420
I guess you could
do more samples.

389
00:20:35,420 --> 00:20:39,400
But in some sense, really, that
taking one sample of size 5

390
00:20:39,400 --> 00:20:40,900
and another sample
of size 5, that's

391
00:20:40,900 --> 00:20:44,380
like one sample of size 10.

392
00:20:44,380 --> 00:20:45,040
Larger samples.

393
00:20:45,040 --> 00:20:46,540
AUDIENCE: Oh, yeah,
larger samples.

394
00:20:46,540 --> 00:20:47,870
PROFESSOR: Larger samples.

395
00:20:47,870 --> 00:20:54,700
So if I do that
here, let's take--

396
00:20:54,700 --> 00:20:59,170
instead of samples of size
5, let's do a modest increase

397
00:20:59,170 --> 00:21:02,420
first and take
samples of size 10.

398
00:21:02,420 --> 00:21:04,560
See what happens now.

399
00:21:04,560 --> 00:21:06,650
Oops, let me just do--

400
00:21:06,650 --> 00:21:08,860
OK, that's good.

401
00:21:08,860 --> 00:21:10,570
I'm taking a lot
of samples here.

402
00:21:10,570 --> 00:21:13,960
I've taken several hundred
samples, each of size 10.

403
00:21:13,960 --> 00:21:15,430
And sure enough,
that distribution

404
00:21:15,430 --> 00:21:17,350
is a little bit tighter.

405
00:21:17,350 --> 00:21:22,650
Let's say if I took a really
big sample, sample of size 100.

406
00:21:22,650 --> 00:21:26,180
Yeah, looking a lot tighter.

407
00:21:26,180 --> 00:21:29,360
So one question is, we know
as I take a larger samples,

408
00:21:29,360 --> 00:21:30,920
the distribution gets tighter.

409
00:21:30,920 --> 00:21:33,440
One of the things we
want to do is understand

410
00:21:33,440 --> 00:21:40,260
how much tighter do they get as
a function of the sample size?

411
00:21:40,260 --> 00:21:43,220
So it turns out-- let
me go back now to--

412
00:21:52,270 --> 00:21:56,920
it turns out that if I'm
sampling from a parent

413
00:21:56,920 --> 00:22:01,060
distribution, the variance in
the estimate of that x bar,

414
00:22:01,060 --> 00:22:06,190
or the PDF, the variance
of x bar itself,

415
00:22:06,190 --> 00:22:08,710
shrinks with size n.

416
00:22:08,710 --> 00:22:13,030
And the variance in
fact scales as 1 over n.

417
00:22:13,030 --> 00:22:16,615
It scales inversely proportional
to the size of the sample.

418
00:22:20,090 --> 00:22:24,650
That's true always as you take
larger numbers of samples.

419
00:22:24,650 --> 00:22:28,010
For this special case, if
my underlying population

420
00:22:28,010 --> 00:22:31,040
is in fact really a true--

421
00:22:31,040 --> 00:22:33,590
has a true probability
distribution function

422
00:22:33,590 --> 00:22:37,050
that was normal,
then it turns out

423
00:22:37,050 --> 00:22:41,010
that x bar is not just
trending towards the normal,

424
00:22:41,010 --> 00:22:45,180
but is itself, even for very
small numbers of samples, also

425
00:22:45,180 --> 00:22:47,280
a normal distribution.

426
00:22:47,280 --> 00:22:48,780
So in that little
demo I showed you,

427
00:22:48,780 --> 00:22:50,700
drawing from a
uniform distribution,

428
00:22:50,700 --> 00:22:53,400
for large enough n's,
large enough samples,

429
00:22:53,400 --> 00:22:55,530
large enough numbers
of samples, the mean

430
00:22:55,530 --> 00:22:58,320
does trend towards a Gaussian.

431
00:22:58,320 --> 00:23:01,830
But it's even a stronger
statement, a stronger

432
00:23:01,830 --> 00:23:05,970
relationship, if the underlying
population is itself normal.

433
00:23:05,970 --> 00:23:09,240
So let's say we start with an
underlying random variable,

434
00:23:09,240 --> 00:23:12,330
an underlying
process x, that has

435
00:23:12,330 --> 00:23:14,640
some mean and some variance.

436
00:23:14,640 --> 00:23:20,160
Now if I take samples of size 1
and plot out the distribution,

437
00:23:20,160 --> 00:23:22,070
what do you think it looks like?

438
00:23:22,070 --> 00:23:24,013
AUDIENCE: [INAUDIBLE]

439
00:23:24,013 --> 00:23:24,680
PROFESSOR: Yeah.

440
00:23:24,680 --> 00:23:25,560
I'm just repeating.

441
00:23:25,560 --> 00:23:30,920
I'm replicating my underlying
distribution, right?

442
00:23:30,920 --> 00:23:34,250
So part of the special
case of a sample of size 1,

443
00:23:34,250 --> 00:23:39,080
if I do that long enough, I
build up the same distribution.

444
00:23:39,080 --> 00:23:42,830
But now, if I take larger
numbers of samples,

445
00:23:42,830 --> 00:23:45,415
even a little bit
with n equals 2,

446
00:23:45,415 --> 00:23:46,790
again, we get that
effect that we

447
00:23:46,790 --> 00:23:51,530
saw with the SticiGui of the
narrowing of the distribution,

448
00:23:51,530 --> 00:23:55,250
PDF associated with the x bar.

449
00:23:55,250 --> 00:24:02,720
And in particular, the PDF or
the Probability Distribution

450
00:24:02,720 --> 00:24:06,410
Function associated
with x bar is exactly

451
00:24:06,410 --> 00:24:09,050
normal with the same mean--

452
00:24:09,050 --> 00:24:12,440
it's unbiased-- and
with reduced variance.

453
00:24:12,440 --> 00:24:15,830
So the variance
goes as 1 over n.

454
00:24:15,830 --> 00:24:20,490
So we start with the
population distribution here,

455
00:24:20,490 --> 00:24:23,010
and we end up with a
sample mean distribution

456
00:24:23,010 --> 00:24:25,770
that is a different PDF.

457
00:24:25,770 --> 00:24:28,480
Everybody clear on this?

458
00:24:28,480 --> 00:24:32,270
So key points--
statistic itself is

459
00:24:32,270 --> 00:24:34,940
the random variable has its
own probability distribution

460
00:24:34,940 --> 00:24:35,810
function.

461
00:24:35,810 --> 00:24:41,460
Now what we want to do is reason
about the underlying population

462
00:24:41,460 --> 00:24:44,280
based on those
observed statistics.

463
00:24:44,280 --> 00:24:47,810
Somebody's cell
phone is going crazy.

464
00:24:47,810 --> 00:24:48,430
Not mine.

465
00:24:53,187 --> 00:24:54,270
Everybody hear that click?

466
00:24:54,270 --> 00:24:57,550
Can you even hear that
click in Singapore?

467
00:24:57,550 --> 00:24:58,070
Yeah?

468
00:24:58,070 --> 00:24:58,660
All right.

469
00:25:02,300 --> 00:25:04,040
Hopefully that will
go away in a second.

470
00:25:09,080 --> 00:25:13,350
So once we know the sampling
distribution, say, for x bar,

471
00:25:13,350 --> 00:25:16,910
now we can argue about the
probabilities associated

472
00:25:16,910 --> 00:25:20,810
with observing particular
values of x bar.

473
00:25:20,810 --> 00:25:23,420
We can make observations
or arguments

474
00:25:23,420 --> 00:25:26,540
about how much probability's out
in the tails of these things.

475
00:25:26,540 --> 00:25:29,420
And then we can invert
backwards and reason

476
00:25:29,420 --> 00:25:31,850
about the actual
population mean.

477
00:25:34,640 --> 00:25:36,950
And again, we're after
not only the point

478
00:25:36,950 --> 00:25:41,570
estimates, our best guess,
but also interval estimates--

479
00:25:41,570 --> 00:25:45,950
confidence intervals where
we think the actual value

480
00:25:45,950 --> 00:25:47,300
is going to lie.

481
00:25:47,300 --> 00:25:51,050
And these are critically
dependent on probability

482
00:25:51,050 --> 00:25:56,010
calculations of the
sampling distribution.

483
00:25:56,010 --> 00:25:59,090
So here's an example.

484
00:25:59,090 --> 00:26:02,680
So suppose that we start
out with some assumptions.

485
00:26:02,680 --> 00:26:08,940
We start out with some a priori
beliefs about the distribution

486
00:26:08,940 --> 00:26:10,020
of some parameter.

487
00:26:10,020 --> 00:26:15,220
In particular, we're interested
in the thickness of some part.

488
00:26:15,220 --> 00:26:16,750
We don't know the mean of it.

489
00:26:16,750 --> 00:26:19,940
But based on maybe lots and
lots of historical data,

490
00:26:19,940 --> 00:26:24,940
we do believe we do
know a couple of things.

491
00:26:24,940 --> 00:26:27,040
We know its variance.

492
00:26:27,040 --> 00:26:28,950
The standard deviation was 10.

493
00:26:28,950 --> 00:26:32,782
So let's just assume that we
know the standard deviation.

494
00:26:32,782 --> 00:26:34,240
And we also know--
the second thing

495
00:26:34,240 --> 00:26:38,410
is that the thickness of these
parts is normally distributed.

496
00:26:38,410 --> 00:26:40,060
Those are our
starting assumptions.

497
00:26:40,060 --> 00:26:43,150
our a priori assumptions.

498
00:26:43,150 --> 00:26:46,930
Now what we do is we go, and we
draw 50 different random parts

499
00:26:46,930 --> 00:26:50,470
with the IID assumption.

500
00:26:50,470 --> 00:26:55,210
And we calculate the average
thickness from those.

501
00:26:55,210 --> 00:26:57,970
And I'll tell you,
of those n equals

502
00:26:57,970 --> 00:27:00,550
50 samples, the
actual sample mean

503
00:27:00,550 --> 00:27:06,490
that comes out from that one
sample of size 50 is 113.5.

504
00:27:06,490 --> 00:27:08,090
There you go.

505
00:27:08,090 --> 00:27:10,820
You're blessed with
that piece of data.

506
00:27:10,820 --> 00:27:13,070
Now the first question here,
based on what we've seen,

507
00:27:13,070 --> 00:27:16,520
is what is the distribution
of the mean of the thickness?

508
00:27:16,520 --> 00:27:19,610
What is the PDF
associated with t bar?

509
00:27:19,610 --> 00:27:21,600
Everybody should know this.

510
00:27:21,600 --> 00:27:24,590
What's t bar distributed as?

511
00:27:29,738 --> 00:27:30,680
AUDIENCE: It's normal.

512
00:27:30,680 --> 00:27:32,602
PROFESSOR: It's normal, right.

513
00:27:32,602 --> 00:27:34,060
AUDIENCE: Centered
around the mean.

514
00:27:34,060 --> 00:27:35,810
PROFESSOR: Centered
around the mean, so it

515
00:27:35,810 --> 00:27:38,470
would have the same mu unknown.

516
00:27:38,470 --> 00:27:40,250
And what would its variance be?

517
00:27:40,250 --> 00:27:40,750
AUDIENCE: 2.

518
00:27:40,750 --> 00:27:41,840
AUDIENCE: 2.

519
00:27:41,840 --> 00:27:44,960
PROFESSOR: 2, very good.

520
00:27:44,960 --> 00:27:50,630
So it has the same mean, and
the variance scales as 1 over n.

521
00:27:50,630 --> 00:27:55,860
So we had 50 samples, so
the variance goes down

522
00:27:55,860 --> 00:27:59,760
by that factor.

523
00:27:59,760 --> 00:28:04,530
One quick notation
point here is when

524
00:28:04,530 --> 00:28:12,530
we use this notation of normal
with mu and sigma squared,

525
00:28:12,530 --> 00:28:16,940
I try to be very consistent and
put the mean and the variance

526
00:28:16,940 --> 00:28:17,640
in there.

527
00:28:17,640 --> 00:28:19,490
You will sometimes
find different texts

528
00:28:19,490 --> 00:28:21,890
and different
writers or whatever

529
00:28:21,890 --> 00:28:25,230
putting the mean and
the standard deviation.

530
00:28:25,230 --> 00:28:29,030
So you always want to confirm
that, because one's a square,

531
00:28:29,030 --> 00:28:31,410
and one's the square
root of the other.

532
00:28:31,410 --> 00:28:33,500
So be a little bit careful--

533
00:28:33,500 --> 00:28:35,210
a little bit careful on that.

534
00:28:35,210 --> 00:28:43,040
I try to be consistent and
have that be the variance.

535
00:28:43,040 --> 00:28:44,820
So that was a first
easy question.

536
00:28:44,820 --> 00:28:47,150
We know that based
on sampling theory.

537
00:28:47,150 --> 00:28:51,853
We know the distribution
function for the sample mean.

538
00:28:51,853 --> 00:28:53,270
Now the key question
is, how do we

539
00:28:53,270 --> 00:28:57,960
use that to reason about
the actual population mean?

540
00:28:57,960 --> 00:29:01,050
Well, it's really easy
already-- the best guess.

541
00:29:01,050 --> 00:29:03,660
But the more subtle question
that we've been talking about

542
00:29:03,660 --> 00:29:08,080
is, where do we think the
true mean of the population

543
00:29:08,080 --> 00:29:12,310
lies based on this
one observation?

544
00:29:12,310 --> 00:29:15,550
What range do we think
the true mean has

545
00:29:15,550 --> 00:29:19,330
with some degree of confidence?

546
00:29:19,330 --> 00:29:23,463
Do you think it's plus or
minus 2 around that mean?

547
00:29:23,463 --> 00:29:25,630
Do you think it's plus or
minus 20 around that mean?

548
00:29:25,630 --> 00:29:32,200
If I were to ask you to bet your
life on what the true mean is,

549
00:29:32,200 --> 00:29:35,170
you would want to be able to say
with some degree of confidence,

550
00:29:35,170 --> 00:29:38,590
it's actually within
this amount of distance.

551
00:29:41,280 --> 00:29:43,640
I have to say one more
thing, because if I

552
00:29:43,640 --> 00:29:46,640
said it's within some
amount of distance of that,

553
00:29:46,640 --> 00:29:50,360
well, with non-zero
probability, that thickness

554
00:29:50,360 --> 00:29:55,040
could take on values all the
way from plus infinity, if it's

555
00:29:55,040 --> 00:30:00,160
truly normally distributed,
all the way to not quite

556
00:30:00,160 --> 00:30:03,190
negative infinity, because
this is a thickness to 0.

557
00:30:03,190 --> 00:30:05,960
So it's still an
approximate model.

558
00:30:05,960 --> 00:30:08,630
So if I just asked
you, bet your life.

559
00:30:08,630 --> 00:30:11,360
Tell me where you
think the true mean is,

560
00:30:11,360 --> 00:30:14,810
if you wanted 100% chance of
saving your life, you'd say,

561
00:30:14,810 --> 00:30:16,680
it could be anything.

562
00:30:16,680 --> 00:30:18,940
So I have to give
you, when we're

563
00:30:18,940 --> 00:30:21,370
talking about confidence
intervals, another piece

564
00:30:21,370 --> 00:30:22,840
of bounding information.

565
00:30:22,840 --> 00:30:24,550
I want the range.

566
00:30:24,550 --> 00:30:28,010
How far away from that one
observation of the mean

567
00:30:28,010 --> 00:30:32,530
do I need to be with
some probability?

568
00:30:32,530 --> 00:30:36,820
95% confidence or
95% of the time,

569
00:30:36,820 --> 00:30:39,760
where do we think the
true mean would lie?

570
00:30:39,760 --> 00:30:43,930
What that means is if I were
to go and calculate another 50

571
00:30:43,930 --> 00:30:46,900
samples and calculate
the mean, again, we

572
00:30:46,900 --> 00:30:47,980
have that distribution.

573
00:30:47,980 --> 00:30:53,440
And what we're looking for
is that 95% central region

574
00:30:53,440 --> 00:30:56,230
of the PDF associated
with x bar, which

575
00:30:56,230 --> 00:31:02,700
is where 95% of the time, the
mean is actually going to lie.

576
00:31:02,700 --> 00:31:08,640
So that gets us pictorially
and formulaically here

577
00:31:08,640 --> 00:31:11,220
to this notion of the
confidence interval

578
00:31:11,220 --> 00:31:13,920
and how we actually go
about calculating that.

579
00:31:13,920 --> 00:31:18,150
What we're asking-- what
we've got in this situation

580
00:31:18,150 --> 00:31:21,120
is the variance is
known, so I'm not

581
00:31:21,120 --> 00:31:22,650
trying to estimate the variance.

582
00:31:22,650 --> 00:31:26,590
I'm just trying to
reason about the mean.

583
00:31:26,590 --> 00:31:31,210
And I want to estimate it to
some percent, some confidence

584
00:31:31,210 --> 00:31:32,500
interval.

585
00:31:32,500 --> 00:31:37,000
You always have this
chance of being wrong

586
00:31:37,000 --> 00:31:38,860
when you talk
confidence intervals.

587
00:31:38,860 --> 00:31:41,230
You've got some
alpha probability

588
00:31:41,230 --> 00:31:43,450
that the true meaning is
even further away than you

589
00:31:43,450 --> 00:31:45,340
think in your interval.

590
00:31:45,340 --> 00:31:47,650
But you're trying to
quantify that and bound that.

591
00:31:47,650 --> 00:31:52,720
So we typically talk about,
say, an alpha of 5% or maybe 1%

592
00:31:52,720 --> 00:31:58,270
probability of being
outside of your interval.

593
00:31:58,270 --> 00:32:00,250
So there's this
alpha probability

594
00:32:00,250 --> 00:32:06,820
of error associated with
any confidence interval.

595
00:32:06,820 --> 00:32:09,880
So that's that second piece
of data I had to give you.

596
00:32:09,880 --> 00:32:13,350
The first is we want to know
this range-- what the size is.

597
00:32:13,350 --> 00:32:17,940
So the way this works is
we're wanting to know,

598
00:32:17,940 --> 00:32:22,110
based on our calculated
x bar from our sample

599
00:32:22,110 --> 00:32:28,820
of size n, where the
true mean actually lies.

600
00:32:28,820 --> 00:32:31,220
So we know what
we're doing is saying

601
00:32:31,220 --> 00:32:36,710
that the true mean, mu, is
going to be bounded on the left

602
00:32:36,710 --> 00:32:42,590
by the x bar, but then
going some portion

603
00:32:42,590 --> 00:32:44,930
of the distribution to
the left and some portion

604
00:32:44,930 --> 00:32:52,040
of the distribution to the right
until we get the 1 minus alpha.

605
00:32:52,040 --> 00:32:57,340
So this area in here is the
1 minus alpha-- the 95%, say,

606
00:32:57,340 --> 00:33:00,370
central component of
that distribution.

607
00:33:00,370 --> 00:33:04,750
And then we're evenly spreading
the error part, the alpha,

608
00:33:04,750 --> 00:33:07,420
into 2 alpha over
2's on each side,

609
00:33:07,420 --> 00:33:10,870
saying I've got for a
95% confidence interval,

610
00:33:10,870 --> 00:33:15,670
a 2.5% chance that the true
mean is a little bit further off

611
00:33:15,670 --> 00:33:18,070
to the left and a 2.5% chance
that it's a little further

612
00:33:18,070 --> 00:33:19,340
off to the right.

613
00:33:19,340 --> 00:33:22,660
I guess in this picture here
I'm doing an 80% confidence

614
00:33:22,660 --> 00:33:28,240
interval with a total alpha
error risk, error probability,

615
00:33:28,240 --> 00:33:31,460
of 0.2.

616
00:33:31,460 --> 00:33:33,620
And so the question
then becomes,

617
00:33:33,620 --> 00:33:35,360
how far do I have to go out?

618
00:33:35,360 --> 00:33:39,350
And we know that from the
basic probability manipulations

619
00:33:39,350 --> 00:33:41,450
from a normal
distribution you guys

620
00:33:41,450 --> 00:33:44,390
have been dealing with already.

621
00:33:44,390 --> 00:33:51,190
The whole question is, how
many unit standard deviations

622
00:33:51,190 --> 00:33:53,480
of a unit normal
do I have to go?

623
00:33:53,480 --> 00:33:58,420
How many z's out do I have to
go until I have exactly alpha

624
00:33:58,420 --> 00:34:02,590
over 2 out here in the tail?

625
00:34:02,590 --> 00:34:04,980
So for example,
we might know what

626
00:34:04,980 --> 00:34:07,560
this is going to do
is I've got to go out

627
00:34:07,560 --> 00:34:13,350
1.28 standard
deviations to the left

628
00:34:13,350 --> 00:34:18,524
in order to be able to
have just that alpha over 2

629
00:34:18,524 --> 00:34:23,389
to the left of that tail,
and similarly to the right.

630
00:34:23,389 --> 00:34:30,460
Now, notice that we're
also unnormalizing.

631
00:34:30,460 --> 00:34:32,300
The z is the normal--

632
00:34:32,300 --> 00:34:36,070
how many z's you get to,
out of the unit Gaussian,

633
00:34:36,070 --> 00:34:38,469
the probability
out in the tails.

634
00:34:38,469 --> 00:34:41,110
But what we wanted to do is
reason about the location

635
00:34:41,110 --> 00:34:43,690
of the true population.

636
00:34:43,690 --> 00:34:46,580
We want to know the
true population mean.

637
00:34:46,580 --> 00:34:52,989
And so we have to do a little
bit of unnormalization and say,

638
00:34:52,989 --> 00:34:56,860
z alpha gave me the
number of unit normals.

639
00:34:56,860 --> 00:35:00,700
Now, in terms of my
actual population variance

640
00:35:00,700 --> 00:35:03,040
or population
standard deviation,

641
00:35:03,040 --> 00:35:05,580
what does that correspond to?

642
00:35:05,580 --> 00:35:11,100
And this is where the sample
size also comes into play.

643
00:35:11,100 --> 00:35:14,730
We were reasoning about
the distribution associated

644
00:35:14,730 --> 00:35:17,210
with a x bar.

645
00:35:17,210 --> 00:35:19,280
And the x bar is scaled.

646
00:35:19,280 --> 00:35:22,100
It shrunk by that
square root of n

647
00:35:22,100 --> 00:35:24,300
in terms of the
standard deviation.

648
00:35:24,300 --> 00:35:29,000
So when I expand it back
out, I'm counting number of--

649
00:35:29,000 --> 00:35:34,910
first off, together, this is
number of standard deviations

650
00:35:34,910 --> 00:35:36,980
in my x bar.

651
00:35:36,980 --> 00:35:42,140
And then when I
expand that further

652
00:35:42,140 --> 00:35:46,010
out to the number of standard
deviations in my population,

653
00:35:46,010 --> 00:35:48,710
I have to divide back
out by that root n.

654
00:35:52,520 --> 00:35:57,780
So what we've got is
the rationale for being

655
00:35:57,780 --> 00:36:01,920
able to use the PDF associated
with x bar calculate,

656
00:36:01,920 --> 00:36:03,660
probabilities off
of the details,

657
00:36:03,660 --> 00:36:08,170
and get finally to this nice--

658
00:36:08,170 --> 00:36:12,930
this is my fast way
to erase everything--

659
00:36:12,930 --> 00:36:17,200
get back to my nice distribution
here or a nice formula, which

660
00:36:17,200 --> 00:36:21,700
you'll see in Montgomery, you'll
see in all of the textbooks.

661
00:36:21,700 --> 00:36:28,270
It's a wonderful note to have
on your one page set of notes

662
00:36:28,270 --> 00:36:30,310
or cheat sheet
for taking quizzes

663
00:36:30,310 --> 00:36:32,200
in this class and elsewhere.

664
00:36:32,200 --> 00:36:35,530
This is the interval, the
confidence interval formula,

665
00:36:35,530 --> 00:36:38,890
for the location of
the true mean when

666
00:36:38,890 --> 00:36:39,850
the variance was known.

667
00:36:44,370 --> 00:36:46,290
So any questions on that?

668
00:36:46,290 --> 00:36:51,030
We actually want to
return to our example

669
00:36:51,030 --> 00:36:54,150
and see what numbers pop
out because I want to know--

670
00:36:54,150 --> 00:36:57,420
we knew x bar was 113.5.

671
00:36:57,420 --> 00:37:01,080
But I actually want to know,
what is the 95% confidence

672
00:37:01,080 --> 00:37:02,950
interval for that?

673
00:37:02,950 --> 00:37:05,550
And so we can simply go
back to our second question.

674
00:37:05,550 --> 00:37:08,800
Use the fact that we had--

675
00:37:08,800 --> 00:37:12,750
you guys told me what the
distribution was of t bar

676
00:37:12,750 --> 00:37:15,990
was our unknown mu.

677
00:37:15,990 --> 00:37:20,730
And the variance was
scaled, 100 over 50.

678
00:37:20,730 --> 00:37:24,170
So now for a 95%
confidence interval,

679
00:37:24,170 --> 00:37:28,700
what is the true mean?

680
00:37:28,700 --> 00:37:30,080
So I've pictured it here.

681
00:37:30,080 --> 00:37:33,560
And what we're
saying is we want--

682
00:37:33,560 --> 00:37:36,800
we've got this red
curve which, again,

683
00:37:36,800 --> 00:37:42,330
goes with this PDF
associated with t bar.

684
00:37:42,330 --> 00:37:46,380
And I want the plus/minus
z alpha over 2, the alpha

685
00:37:46,380 --> 00:37:48,570
being 0.05.

686
00:37:48,570 --> 00:37:51,090
That's my probability
of being wrong to get

687
00:37:51,090 --> 00:37:54,370
to a 0.95 confidence interval.

688
00:37:54,370 --> 00:38:01,400
So how many z's do I have to go
out to have 95% in the center?

689
00:38:01,400 --> 00:38:03,580
We actually showed
some examples.

690
00:38:03,580 --> 00:38:05,390
If you remember,
last time we looked

691
00:38:05,390 --> 00:38:08,480
at plus/minus 1 sigma,
plus/minus 2 sigma,

692
00:38:08,480 --> 00:38:11,040
plus/minus 3 sigma
for a Gaussian.

693
00:38:11,040 --> 00:38:14,270
And it's actually a very
close approximation.

694
00:38:14,270 --> 00:38:18,350
That plus/minus 2 sigma
is 95% of a distribution.

695
00:38:18,350 --> 00:38:20,930
That's a good rule
of thumb to remember.

696
00:38:20,930 --> 00:38:24,680
It's actually 1.96, not quite 2.

697
00:38:24,680 --> 00:38:29,660
But about plus/minus
2 sigma has 95%.

698
00:38:29,660 --> 00:38:34,670
So you'll often see 95%
confidence intervals

699
00:38:34,670 --> 00:38:36,140
graphically shown.

700
00:38:36,140 --> 00:38:40,380
So we need about 1.96
standard deviations.

701
00:38:40,380 --> 00:38:46,300
Now that translates to
a confidence interval

702
00:38:46,300 --> 00:38:50,020
that tells us, as
a function of n,

703
00:38:50,020 --> 00:38:52,990
the distribution for where
we think the true population

704
00:38:52,990 --> 00:38:55,600
is, based on the sample
size that we had.

705
00:38:55,600 --> 00:38:59,860
The compression that we
got because of sampling

706
00:38:59,860 --> 00:39:02,650
gets us that tighter
standard deviation.

707
00:39:02,650 --> 00:39:08,080
And I've got a symmetric
plus/minus 2.77

708
00:39:08,080 --> 00:39:11,560
for my 95% confidence interval.

709
00:39:11,560 --> 00:39:13,330
Now, notice that all
you had to do here

710
00:39:13,330 --> 00:39:16,600
was be told what the
actual calculated t bar was

711
00:39:16,600 --> 00:39:20,450
and what the
underlying variance was

712
00:39:20,450 --> 00:39:22,005
and the size of your sample.

713
00:39:22,005 --> 00:39:23,630
I didn't even have
to actually give you

714
00:39:23,630 --> 00:39:25,620
a list of all those
values, right?

715
00:39:30,250 --> 00:39:32,510
But I did have to tell
you the sample size.

716
00:39:32,510 --> 00:39:37,780
If sample size changed, that
PDF would narrow or widen,

717
00:39:37,780 --> 00:39:43,450
and your confidence interval
would narrow or widen, right?

718
00:39:43,450 --> 00:39:46,936
So any questions to
where we are now?

719
00:39:46,936 --> 00:39:48,790
It's all seeming pretty clear?

720
00:39:52,220 --> 00:39:55,830
So this is the
relatively easy part

721
00:39:55,830 --> 00:39:58,260
because it's dealing with
normal distributions.

722
00:39:58,260 --> 00:40:00,990
This notion of sampling
is a little bit subtle

723
00:40:00,990 --> 00:40:02,550
because there is
a different PDF,

724
00:40:02,550 --> 00:40:05,760
and you got to know how that
scales with the sample size.

725
00:40:05,760 --> 00:40:10,670
Now I'm going to throw a
few different curves at you,

726
00:40:10,670 --> 00:40:12,950
the different curves being
different probability

727
00:40:12,950 --> 00:40:17,180
distribution functions
than normal distributions.

728
00:40:17,180 --> 00:40:21,200
And I'm going to briefly
cover three of them,

729
00:40:21,200 --> 00:40:24,320
and all three of them
are ones that we actually

730
00:40:24,320 --> 00:40:27,790
will be using in
multiple scenarios

731
00:40:27,790 --> 00:40:34,180
in statistical analysis
and statistical techniques

732
00:40:34,180 --> 00:40:35,980
and tools that we're using.

733
00:40:35,980 --> 00:40:40,270
The first one is a
relatively easy step,

734
00:40:40,270 --> 00:40:43,900
and that's to look at the
student t distribution.

735
00:40:43,900 --> 00:40:44,860
I'll come back to this.

736
00:40:44,860 --> 00:40:47,800
But basically, if we go back
to the example I gave you.

737
00:40:47,800 --> 00:40:51,310
I said, we assumed we knew,
based on, I don't know,

738
00:40:51,310 --> 00:40:55,360
lots of past history what
the underlying variance was

739
00:40:55,360 --> 00:40:57,410
on the thickness of our parts.

740
00:40:57,410 --> 00:40:58,690
What if you don't know that?

741
00:40:58,690 --> 00:41:02,467
What if you have to
estimate that, too?

742
00:41:02,467 --> 00:41:04,050
Well, if you had to
estimate it, you'd

743
00:41:04,050 --> 00:41:06,900
probably use sample standard
deviation, that formula,

744
00:41:06,900 --> 00:41:08,700
and come up with an estimate.

745
00:41:08,700 --> 00:41:12,930
It turns out when you do that,
that additional uncertainty

746
00:41:12,930 --> 00:41:15,330
on what the
underlying variance is

747
00:41:15,330 --> 00:41:17,730
means that the right
distribution for arguing

748
00:41:17,730 --> 00:41:21,810
about the mean when you didn't
know the underlying variance

749
00:41:21,810 --> 00:41:23,580
is no longer a
normal distribution.

750
00:41:23,580 --> 00:41:27,470
It's actually a t-distribution,
and we'll talk about that.

751
00:41:27,470 --> 00:41:30,030
It's a slightly different--
it's very close to

752
00:41:30,030 --> 00:41:35,470
or looks qualitatively close
to a normal distribution,

753
00:41:35,470 --> 00:41:37,640
but we do want to cover that.

754
00:41:37,640 --> 00:41:42,540
And then more have to
do with not the mean,

755
00:41:42,540 --> 00:41:45,330
but arguing about the variance.

756
00:41:45,330 --> 00:41:48,900
If I calculate sample
variance from a distribution,

757
00:41:48,900 --> 00:41:55,110
I calculate s squared using the
formula for a sample of size

758
00:41:55,110 --> 00:41:57,180
50, I get a number.

759
00:41:57,180 --> 00:41:58,630
I do that lots
and lots of times.

760
00:41:58,630 --> 00:42:00,900
I trace out a PDF.

761
00:42:00,900 --> 00:42:04,680
The PDF associated
with the values

762
00:42:04,680 --> 00:42:08,130
of sample variance
calculated from that sample

763
00:42:08,130 --> 00:42:10,380
is a chi-squared distribution.

764
00:42:13,000 --> 00:42:16,260
So we'll talk about what
that shape looks like.

765
00:42:16,260 --> 00:42:19,800
And then we've got a
variance that we've

766
00:42:19,800 --> 00:42:21,870
calculated from a sample.

767
00:42:21,870 --> 00:42:25,570
And a very strange distribution
is the F distribution,

768
00:42:25,570 --> 00:42:30,360
which is the distribution
of the ratio of two normally

769
00:42:30,360 --> 00:42:34,920
distributed variances or two
variances drawn from normally

770
00:42:34,920 --> 00:42:37,340
distributed sample data.

771
00:42:37,340 --> 00:42:38,030
Good heavens.

772
00:42:38,030 --> 00:42:39,590
Why would you ever
be calculating

773
00:42:39,590 --> 00:42:41,870
ratios of variances?

774
00:42:41,870 --> 00:42:45,300
What a weird distribution.

775
00:42:45,300 --> 00:42:48,970
Why would you ever calculate
ratios of variances?

776
00:42:48,970 --> 00:42:51,090
Where might that come up?

777
00:42:51,090 --> 00:42:54,540
There's at least a couple
of cases-- one that's

778
00:42:54,540 --> 00:42:56,989
kind of subtle, but one
that's pretty obvious.

779
00:42:56,989 --> 00:42:58,630
AUDIENCE: I think
it's you're thinking

780
00:42:58,630 --> 00:43:03,953
about the variation of the
actual population, which

781
00:43:03,953 --> 00:43:05,598
varies from your sample.

782
00:43:08,230 --> 00:43:10,270
PROFESSOR: Certainly,
the variance

783
00:43:10,270 --> 00:43:12,250
associated with a
sample of smaller

784
00:43:12,250 --> 00:43:15,440
size than your true population.

785
00:43:15,440 --> 00:43:17,110
So that's exactly true.

786
00:43:17,110 --> 00:43:18,850
That's one important area.

787
00:43:18,850 --> 00:43:23,020
The fact of sample size
entering into spread and things

788
00:43:23,020 --> 00:43:24,280
is very important.

789
00:43:24,280 --> 00:43:26,920
That actually will come up
more in the chi-squared.

790
00:43:26,920 --> 00:43:30,280
But I think a second
very obvious place is

791
00:43:30,280 --> 00:43:32,590
I make a change to a process.

792
00:43:32,590 --> 00:43:34,750
And I'm maybe not trying
to mean center it.

793
00:43:34,750 --> 00:43:37,390
I'm trying to get a
reduced variance process.

794
00:43:37,390 --> 00:43:40,570
I want to know, is this
process better or not?

795
00:43:40,570 --> 00:43:42,670
Is its variance smaller?

796
00:43:42,670 --> 00:43:45,430
So the ratio of
those two variances

797
00:43:45,430 --> 00:43:48,550
are something I might be
very, very interested in.

798
00:43:48,550 --> 00:43:50,890
I want to look at
those and see, well,

799
00:43:50,890 --> 00:43:52,510
I did get a smaller variance.

800
00:43:52,510 --> 00:43:54,730
It's half as small.

801
00:43:54,730 --> 00:43:57,580
Do I have confidence that
the true population variance

802
00:43:57,580 --> 00:43:59,597
is really smaller or not?

803
00:43:59,597 --> 00:44:01,180
And so that's where
the F distribution

804
00:44:01,180 --> 00:44:02,450
is going to come into play.

805
00:44:02,450 --> 00:44:05,623
So we want to be able
to manipulate and deal

806
00:44:05,623 --> 00:44:06,540
with that one as well.

807
00:44:11,880 --> 00:44:18,670
Let me do the student
t-distribution first.

808
00:44:18,670 --> 00:44:19,990
Actually, I can't do that.

809
00:44:19,990 --> 00:44:22,540
Let me do the chi-squared
distribution first.

810
00:44:22,540 --> 00:44:24,130
For the formal
definition of the t,

811
00:44:24,130 --> 00:44:26,860
I need the chi-squared,
even though conceptually,

812
00:44:26,860 --> 00:44:28,660
it doesn't really matter.

813
00:44:28,660 --> 00:44:32,320
So let's talk about the
chi-squared distribution first.

814
00:44:32,320 --> 00:44:39,580
If I start out with truly
normally distributed data

815
00:44:39,580 --> 00:44:44,150
and unit normal,
mean 0, variance 1.

816
00:44:44,150 --> 00:44:50,670
And now, I take a sum
of n of these unit

817
00:44:50,670 --> 00:44:54,290
normals, each one
of which is squared.

818
00:44:54,290 --> 00:44:56,410
So each x sub i is
normally distributed.

819
00:44:56,410 --> 00:44:59,170
I do this weird operation
where I take that sample.

820
00:44:59,170 --> 00:45:05,170
I square it, I take another
draw or another random variable,

821
00:45:05,170 --> 00:45:08,830
also from the same distribution,
square that, and then take

822
00:45:08,830 --> 00:45:14,020
the sum of n of those squared
random variables to create

823
00:45:14,020 --> 00:45:17,420
a new random variable y.

824
00:45:17,420 --> 00:45:22,430
y is the sum of squared unit
normal random variables.

825
00:45:22,430 --> 00:45:26,560
Then I get this
chi-squared distribution.

826
00:45:26,560 --> 00:45:29,080
The distribution of this
new random variable y

827
00:45:29,080 --> 00:45:33,400
is chi-squared with
n degrees of freedom.

828
00:45:33,400 --> 00:45:36,625
Good heavens, what a
weird thing to be doing.

829
00:45:36,625 --> 00:45:38,740
Why would you be taking
random variables,

830
00:45:38,740 --> 00:45:42,270
squaring them, and
taking sums of them?

831
00:45:42,270 --> 00:45:45,710
Well, think back to the formula.

832
00:45:45,710 --> 00:45:47,420
Let's see if I can do this.

833
00:45:47,420 --> 00:45:48,590
What page is that?

834
00:45:48,590 --> 00:45:50,930
Anybody got it there?

835
00:45:50,930 --> 00:45:52,600
8?

836
00:45:52,600 --> 00:45:54,700
There we go, page 5.

837
00:45:54,700 --> 00:46:01,670
Look back at this formula for
sample standard deviation.

838
00:46:04,870 --> 00:46:08,660
First off, I'm subtracting
the mean off of some sample.

839
00:46:08,660 --> 00:46:13,010
So now I've got a
0 mean variable.

840
00:46:13,010 --> 00:46:15,430
Now I'm taking squares of them.

841
00:46:15,430 --> 00:46:17,950
Well, that sounds kind of
like this squaring operation.

842
00:46:17,950 --> 00:46:21,830
And then I'm taking
a big sum of them.

843
00:46:21,830 --> 00:46:24,050
That sounds a lot like
this operation I was just

844
00:46:24,050 --> 00:46:26,250
describing for chi-squared.

845
00:46:26,250 --> 00:46:31,850
So this creation of a new random
variable, this F squared here,

846
00:46:31,850 --> 00:46:37,070
is very closely related to--

847
00:46:37,070 --> 00:46:38,970
that didn't work.

848
00:46:38,970 --> 00:46:41,250
There we go-- very
closely related

849
00:46:41,250 --> 00:46:45,310
to the definition
of chi-squared.

850
00:46:45,310 --> 00:46:47,370
Now the chi-squared,
the PDF associated

851
00:46:47,370 --> 00:46:51,450
with the chi-squared,
looks kind of funky.

852
00:46:51,450 --> 00:46:53,730
It's clearly not normally
distributed, right?

853
00:46:53,730 --> 00:46:55,380
It's kind of skewed.

854
00:46:55,380 --> 00:47:01,240
Notice it's got a
long tail out here

855
00:47:01,240 --> 00:47:04,450
to the right for large values.

856
00:47:04,450 --> 00:47:08,870
Because it's a sum of squared
values, it can't be negative.

857
00:47:08,870 --> 00:47:09,730
So it's truncated.

858
00:47:09,730 --> 00:47:14,190
There's nothing-- can't
be smaller than 0.

859
00:47:14,190 --> 00:47:18,690
Another really weird thing is
that the maximal probability

860
00:47:18,690 --> 00:47:26,680
value is not equal to the
mean of the distribution.

861
00:47:26,680 --> 00:47:28,540
That's kind of interesting.

862
00:47:28,540 --> 00:47:30,430
And there's another
really interesting fact

863
00:47:30,430 --> 00:47:33,400
that is truly useful
and occasionally

864
00:47:33,400 --> 00:47:36,310
comes up on problem sets
and that sort of thing.

865
00:47:36,310 --> 00:47:39,940
The mean, the expected value
of the chi-squared distribution

866
00:47:39,940 --> 00:47:44,580
with degrees of freedom n, is n.

867
00:47:44,580 --> 00:47:48,180
So as I have larger
numbers of variables,

868
00:47:48,180 --> 00:47:53,850
the sum of that larger
number keeps getting bigger.

869
00:47:53,850 --> 00:47:58,370
So that makes sense
when you think about it.

870
00:47:58,370 --> 00:48:01,590
So the point here
is when we actually

871
00:48:01,590 --> 00:48:08,490
do that calculation of a sample
standard or a sample variance

872
00:48:08,490 --> 00:48:12,750
or a sample standard
deviation, the PDF

873
00:48:12,750 --> 00:48:15,060
associated with that
is actually related

874
00:48:15,060 --> 00:48:17,250
to this chi-squared
distribution.

875
00:48:17,250 --> 00:48:19,480
Now there were some
other constants in there.

876
00:48:19,480 --> 00:48:21,280
They're scaling factors.

877
00:48:21,280 --> 00:48:23,940
So for example, we did
a mean shift x bar,

878
00:48:23,940 --> 00:48:26,400
but we didn't normalize
to the true variance,

879
00:48:26,400 --> 00:48:28,430
because we didn't know it.

880
00:48:28,430 --> 00:48:31,500
So there is this relationship
or a scaling factor

881
00:48:31,500 --> 00:48:34,080
before we get to the
chi-squared distribution.

882
00:48:34,080 --> 00:48:40,140
We also had this other n minus
1 factor back on the calculation

883
00:48:40,140 --> 00:48:41,220
of the sample--

884
00:48:41,220 --> 00:48:44,400
sample standard or
sample variance.

885
00:48:44,400 --> 00:48:48,870
So we have to do a little bit
of moving variables around

886
00:48:48,870 --> 00:48:51,430
to get to a chi-squared
distribution.

887
00:48:51,430 --> 00:48:56,600
Another important
point is that the--

888
00:48:56,600 --> 00:48:59,090
let me clean up some of this--

889
00:48:59,090 --> 00:49:03,980
is that the sample
variance is actually

890
00:49:03,980 --> 00:49:09,722
related to a chi-squared with
n minus 1 degrees of freedom.

891
00:49:09,722 --> 00:49:11,930
And I really don't want to
go into a whole discussion

892
00:49:11,930 --> 00:49:16,260
of degrees of freedom because
it's a little bit subtle.

893
00:49:16,260 --> 00:49:17,960
But this reminds
me of another point

894
00:49:17,960 --> 00:49:20,510
that I didn't make
back on slide 8.

895
00:49:24,390 --> 00:49:25,840
Get me to 8, please.

896
00:49:25,840 --> 00:49:26,520
There we go.

897
00:49:26,520 --> 00:49:29,235
Oops, not 48, 8.

898
00:49:29,235 --> 00:49:30,660
Oh, it wasn't 8.

899
00:49:30,660 --> 00:49:31,320
Where was it?

900
00:49:31,320 --> 00:49:32,670
4, 5.

901
00:49:32,670 --> 00:49:34,670
There we go.

902
00:49:34,670 --> 00:49:40,500
Back here on this, notice that
when we calculate sample mean,

903
00:49:40,500 --> 00:49:42,240
we used 1 over n.

904
00:49:42,240 --> 00:49:44,610
But when we calculate
sample variance,

905
00:49:44,610 --> 00:49:47,770
we always use 1 over n minus 1.

906
00:49:47,770 --> 00:49:48,520
Why do we do that?

907
00:49:53,080 --> 00:50:00,170
It turns out that if you need
or want an unbiased estimator

908
00:50:00,170 --> 00:50:04,150
for a sample variance, you need
to divide by 1 over n minus 1

909
00:50:04,150 --> 00:50:08,310
or divide by n minus 1, not n.

910
00:50:08,310 --> 00:50:10,140
Now, as n gets very
large, the difference

911
00:50:10,140 --> 00:50:11,400
doesn't really matter.

912
00:50:11,400 --> 00:50:15,890
But you can go through
some statistical proofs

913
00:50:15,890 --> 00:50:21,800
to show that the best unbiased
estimator needs that n minus 1.

914
00:50:21,800 --> 00:50:26,210
Now the other thing that's
going on in this formula

915
00:50:26,210 --> 00:50:29,120
is we were subtracting
off the mean.

916
00:50:29,120 --> 00:50:33,210
And in this case, we were
also estimating the mean.

917
00:50:33,210 --> 00:50:35,420
So we're using up
essentially one degree

918
00:50:35,420 --> 00:50:41,660
of freedom out of our data
to calculate the sample mean,

919
00:50:41,660 --> 00:50:45,560
leaving us only n minus
1 degrees of freedom

920
00:50:45,560 --> 00:50:51,990
really in the remaining data to
allow variance around the mean.

921
00:50:51,990 --> 00:50:56,030
So I'm not going to go
into much more detail,

922
00:50:56,030 --> 00:51:01,400
other than to simply say the
fact is, when we're calculating

923
00:51:01,400 --> 00:51:04,370
sample standard deviation,
we're actually calculating

924
00:51:04,370 --> 00:51:10,520
two random variables or two
statistics, x bar and variance.

925
00:51:10,520 --> 00:51:14,900
And so you would
need-- you essentially

926
00:51:14,900 --> 00:51:19,190
don't have complete independence
between those two things.

927
00:51:19,190 --> 00:51:23,410
You use up one degree of
freedom for one of those.

928
00:51:23,410 --> 00:51:26,110
Let's use this.

929
00:51:26,110 --> 00:51:29,010
Before we use this, just to
give you a qualitative feel,

930
00:51:29,010 --> 00:51:30,190
here's--

931
00:51:30,190 --> 00:51:35,410
again, plotted a few different
chi-squared distributions.

932
00:51:35,410 --> 00:51:38,260
When n is very small,
it becomes very skewed.

933
00:51:38,260 --> 00:51:40,960
It's quite interesting.

934
00:51:40,960 --> 00:51:47,720
Again, the mean you can see
for n equals 3 here is 3.

935
00:51:47,720 --> 00:51:50,140
It's this blue curve.

936
00:51:50,140 --> 00:51:53,110
And as n increases,
the distribution

937
00:51:53,110 --> 00:51:54,010
shifts to the right.

938
00:51:54,010 --> 00:51:55,190
The mean shift to the right.

939
00:51:55,190 --> 00:51:57,700
But it also spreads out,
which kind of makes sense.

940
00:51:57,700 --> 00:51:59,860
If I've got more and
more random variables,

941
00:51:59,860 --> 00:52:04,090
and I'm looking at the
variance and estimating

942
00:52:04,090 --> 00:52:07,450
that sum of random
variables, its spread

943
00:52:07,450 --> 00:52:11,160
is going to get large.

944
00:52:11,160 --> 00:52:17,780
And another observation is that
as n gets larger and larger,

945
00:52:17,780 --> 00:52:21,790
this also trends towards
a normal distribution,

946
00:52:21,790 --> 00:52:26,010
which for very large n
can be a useful fact.

947
00:52:26,010 --> 00:52:27,780
I want to actually
go in and use--

948
00:52:30,740 --> 00:52:37,450
not that one-- use this
chi-squared distribution

949
00:52:37,450 --> 00:52:43,750
to ask another question
on that thickness example.

950
00:52:43,750 --> 00:52:45,580
I'd actually want
to know, what's

951
00:52:45,580 --> 00:52:51,190
the best guess for the variance
of my thickness of parts?

952
00:52:51,190 --> 00:52:54,190
And better than that, what's a
confidence interval for where

953
00:52:54,190 --> 00:52:57,250
I think the true variance
lies, based on just this one

954
00:52:57,250 --> 00:53:00,520
number for sample variance,
based on my sample of size n

955
00:53:00,520 --> 00:53:02,280
equals 50.

956
00:53:02,280 --> 00:53:09,390
And this is where we do the same
kind of a formula for the range

957
00:53:09,390 --> 00:53:12,750
where we think the
true variance lies,

958
00:53:12,750 --> 00:53:19,350
based on our observation from
one sample of sample standard

959
00:53:19,350 --> 00:53:20,490
deviation.

960
00:53:20,490 --> 00:53:22,830
And this is using
that relationship

961
00:53:22,830 --> 00:53:27,300
between the chi-squared
distribution and F

962
00:53:27,300 --> 00:53:30,120
squared and the true
underlying variance.

963
00:53:30,120 --> 00:53:32,160
So if you go back to
one of those formulas,

964
00:53:32,160 --> 00:53:34,890
what I did was took--

965
00:53:34,890 --> 00:53:36,600
sigma squared was
lying out here.

966
00:53:36,600 --> 00:53:40,390
I moved it up here and divided
the chi-squared down here.

967
00:53:40,390 --> 00:53:43,050
So this is essentially
right in here

968
00:53:43,050 --> 00:53:47,070
that equivalence that we said
before about how F squared was

969
00:53:47,070 --> 00:53:50,790
distributed as a
chi-squared with n

970
00:53:50,790 --> 00:53:52,230
minus 1 degrees of freedom.

971
00:53:55,430 --> 00:53:59,600
So what we've got is a bound--

972
00:53:59,600 --> 00:54:02,790
let me get rid of
all this gook--

973
00:54:02,790 --> 00:54:06,150
a bound, upper and lower
bound, on where we think,

974
00:54:06,150 --> 00:54:09,590
again, the true variance is,
based on our calculated F

975
00:54:09,590 --> 00:54:10,720
squareds.

976
00:54:10,720 --> 00:54:14,850
And what we're doing again is
putting some alpha probability

977
00:54:14,850 --> 00:54:17,360
of being wrong in
each of the tails.

978
00:54:17,360 --> 00:54:18,930
I want the central part.

979
00:54:18,930 --> 00:54:22,350
I want the 95% central
part of where we

980
00:54:22,350 --> 00:54:27,060
think the true variance lies.

981
00:54:27,060 --> 00:54:33,750
Now an interesting point here
is chi-squared is asymmetric.

982
00:54:33,750 --> 00:54:37,260
So if you ever see somebody
going off and writing,

983
00:54:37,260 --> 00:54:40,200
I think the true
variance is equal to F

984
00:54:40,200 --> 00:54:45,240
squared plus or minus
14.2, that should

985
00:54:45,240 --> 00:54:47,415
be a great, big red flag.

986
00:54:51,360 --> 00:54:54,330
It's somebody who doesn't know
what they're talking about.

987
00:54:54,330 --> 00:54:56,180
Well, maybe they have
a huge sample size,

988
00:54:56,180 --> 00:54:58,650
and they're appealing to
a normal distribution.

989
00:54:58,650 --> 00:55:04,520
But what they're probably doing
here is something very wrong.

990
00:55:04,520 --> 00:55:07,790
Because the chi-squared
distribution is not symmetric,

991
00:55:07,790 --> 00:55:11,340
I have my best point
estimate of F squared.

992
00:55:11,340 --> 00:55:15,200
And then I'm going to
go a different distance

993
00:55:15,200 --> 00:55:17,360
to the left and a different
distance to the right.

994
00:55:17,360 --> 00:55:20,930
So here's, still for
our same example,

995
00:55:20,930 --> 00:55:25,370
the chi-squared distribution
for n, a sample size of 50.

996
00:55:25,370 --> 00:55:28,880
So this is a chi-squared
with 49 degrees of freedom.

997
00:55:28,880 --> 00:55:32,270
And again, I want
2.5% in the left tail

998
00:55:32,270 --> 00:55:34,640
and 2.5% in the right tail.

999
00:55:34,640 --> 00:55:38,210
And so if I apply that
formula, and I have to look up

1000
00:55:38,210 --> 00:55:44,110
chi-squared with 0.025
and 49 degrees of freedom,

1001
00:55:44,110 --> 00:55:49,590
and then the chi-squared
where I need to know--

1002
00:55:49,590 --> 00:55:54,210
I want 97.5, everything,
leaving except just alpha over 2

1003
00:55:54,210 --> 00:55:56,940
out to the right.

1004
00:55:56,940 --> 00:55:59,150
The s squareds are the
same in both cases.

1005
00:55:59,150 --> 00:56:01,100
My n minus 1 is the same.

1006
00:56:01,100 --> 00:56:06,320
But because these values, the
chi-squareds, are not equal--

1007
00:56:06,320 --> 00:56:07,520
whoops.

1008
00:56:07,520 --> 00:56:10,340
I guess I got these flipped.

1009
00:56:10,340 --> 00:56:12,080
Actually, when you
look at the tables

1010
00:56:12,080 --> 00:56:17,540
at the back of Montgomery
or Mayo and Spanos,

1011
00:56:17,540 --> 00:56:18,890
be careful on the definition.

1012
00:56:18,890 --> 00:56:20,420
They often show
you a little plot

1013
00:56:20,420 --> 00:56:23,060
that looks a lot like this.

1014
00:56:23,060 --> 00:56:27,200
And they shade in what
their percentage points are.

1015
00:56:27,200 --> 00:56:31,410
And sometimes they go from the
right, sometimes from the left.

1016
00:56:31,410 --> 00:56:33,700
But the point was
when you actually

1017
00:56:33,700 --> 00:56:36,840
look that up, you
get different values

1018
00:56:36,840 --> 00:56:38,310
for the left and the right.

1019
00:56:38,310 --> 00:56:42,620
And when you divide those
out, you get a range--

1020
00:56:42,620 --> 00:56:44,440
get that out of the way.

1021
00:56:44,440 --> 00:56:49,611
You get a range finally for
where your true variance lies.

1022
00:56:49,611 --> 00:56:53,270
AUDIENCE: So is that through
a [INAUDIBLE] or estimates

1023
00:56:53,270 --> 00:56:58,850
of variance or from chi-square
distribution, or is that--

1024
00:56:58,850 --> 00:57:03,750
PROFESSOR: The point
is that all estimates--

1025
00:57:03,750 --> 00:57:08,370
well, it's strictly true if I'm
drawing from a population that

1026
00:57:08,370 --> 00:57:09,960
is normally distributed.

1027
00:57:09,960 --> 00:57:11,910
But an approximation
is no matter

1028
00:57:11,910 --> 00:57:18,180
what, any time I'm
calculating a variance,

1029
00:57:18,180 --> 00:57:21,210
the variance tends to be
chi-squared distributed.

1030
00:57:21,210 --> 00:57:23,190
So it's always going
to be these kinds

1031
00:57:23,190 --> 00:57:24,690
of chi-squared calculations.

1032
00:57:27,580 --> 00:57:30,250
So it's not that the
chi-squared was a special case.

1033
00:57:30,250 --> 00:57:34,120
It's the PDF that
you should always

1034
00:57:34,120 --> 00:57:36,580
associate it with s squared.

1035
00:57:41,310 --> 00:57:45,090
And notice here, we had 102.3.

1036
00:57:45,090 --> 00:57:47,310
That's our best guess.

1037
00:57:47,310 --> 00:57:53,070
And we had 71.4 and 158.1
for the range and variance.

1038
00:57:58,507 --> 00:58:00,215
I always find this a
little bit shocking.

1039
00:58:02,840 --> 00:58:04,790
A sample size of 50?

1040
00:58:04,790 --> 00:58:08,940
I took 50 samples, right?

1041
00:58:08,940 --> 00:58:16,920
And I had-- my underlying
variance, I guess, was 100.

1042
00:58:16,920 --> 00:58:19,170
But I took a lot of samples.

1043
00:58:19,170 --> 00:58:20,970
And it always shocks
me a little bit

1044
00:58:20,970 --> 00:58:24,480
how big the range is on
the estimate of variance

1045
00:58:24,480 --> 00:58:27,000
coming out of this.

1046
00:58:27,000 --> 00:58:30,420
Here, my estimate of
variance is 102.3.

1047
00:58:30,420 --> 00:58:31,980
Well, that's at
least reassuring,

1048
00:58:31,980 --> 00:58:34,560
because that's close to the
example that I gave here,

1049
00:58:34,560 --> 00:58:40,780
where a priori, I
thought it was 100.

1050
00:58:40,780 --> 00:58:43,630
I just basically
popped that out.

1051
00:58:43,630 --> 00:58:47,140
What's shocking is
I can go down to 71.

1052
00:58:47,140 --> 00:58:54,880
That's like 30% lower than that,
or 158, which is 68% higher

1053
00:58:54,880 --> 00:58:57,770
than my point estimate.

1054
00:58:57,770 --> 00:59:01,190
And a really important thing
just to know qualitatively

1055
00:59:01,190 --> 00:59:04,610
is that estimating a
mean is pretty easy.

1056
00:59:04,610 --> 00:59:06,680
And actually, as
sample size grows,

1057
00:59:06,680 --> 00:59:09,770
you can get pretty good
tight estimates of mean.

1058
00:59:09,770 --> 00:59:13,785
But the estimates of
variance are hard.

1059
00:59:13,785 --> 00:59:17,260
You need a lot of
data to estimate

1060
00:59:17,260 --> 00:59:21,650
that second-order statistic.

1061
00:59:21,650 --> 00:59:24,490
And so we get big
spreads in variance.

1062
00:59:24,490 --> 00:59:26,890
So you've got to be really
careful in your reasoning

1063
00:59:26,890 --> 00:59:28,180
about variances.

1064
00:59:28,180 --> 00:59:31,060
And that'll bring us back to the
F-statistic a little bit later.

1065
00:59:36,150 --> 00:59:40,260
So let me go back now to
the student t-distribution.

1066
00:59:40,260 --> 00:59:44,550
And it has a formula
and a formal definition

1067
00:59:44,550 --> 00:59:49,560
here, which is if I start out
with a random variable z, that

1068
00:59:49,560 --> 00:59:52,270
is the unit normal.

1069
00:59:52,270 --> 00:59:57,040
And then I divide it by
a random variable that

1070
00:59:57,040 --> 01:00:02,620
is chi-squared with k degrees
of freedom, divided by k,

1071
01:00:02,620 --> 01:00:06,310
I get a new distribution,
a new variable t,

1072
01:00:06,310 --> 01:00:11,390
that is a t-distribution
with k degrees of freedom.

1073
01:00:11,390 --> 01:00:13,310
And it's the same question.

1074
01:00:13,310 --> 01:00:15,290
My god, why would you
do such a cruel thing

1075
01:00:15,290 --> 01:00:18,410
to a random variable-- divide
it by a chi-squared random

1076
01:00:18,410 --> 01:00:21,470
variable and some constant k?

1077
01:00:21,470 --> 01:00:26,400
And the answer is
that's essentially

1078
01:00:26,400 --> 01:00:33,270
what we're doing when
we are normalizing data

1079
01:00:33,270 --> 01:00:38,120
like this, when
instead of normalizing

1080
01:00:38,120 --> 01:00:40,430
to the true
underlying population

1081
01:00:40,430 --> 01:00:42,560
variance or the true
underlying sample variance,

1082
01:00:42,560 --> 01:00:49,220
I'm also having to
estimate not only the mean,

1083
01:00:49,220 --> 01:00:54,460
but also estimate the
population standard deviation.

1084
01:00:54,460 --> 01:00:56,515
We already said, what is s?

1085
01:00:56,515 --> 01:00:59,020
s squared is
chi-squared distributed.

1086
01:00:59,020 --> 01:01:04,400
So s is a square root of a
chi-squared distribution.

1087
01:01:04,400 --> 01:01:08,050
So buried in this
unit normalization

1088
01:01:08,050 --> 01:01:11,620
that we like to do to get to
a probability distribution

1089
01:01:11,620 --> 01:01:13,390
function-- we can
talk about confidence

1090
01:01:13,390 --> 01:01:14,980
intervals on the mean.

1091
01:01:14,980 --> 01:01:18,190
We subtract off some
mean, and then we

1092
01:01:18,190 --> 01:01:21,040
normalize to s over root n.

1093
01:01:21,040 --> 01:01:23,600
But s itself is
this chi-squared.

1094
01:01:23,600 --> 01:01:27,400
So it's really closely
related to the operations

1095
01:01:27,400 --> 01:01:33,040
that we do when we are
normalizing our sample data,

1096
01:01:33,040 --> 01:01:37,130
when we also had to estimate
the standard deviation.

1097
01:01:37,130 --> 01:01:42,580
So the way to think
about the t-distribution

1098
01:01:42,580 --> 01:01:46,330
is it's really close to
the normal distribution,

1099
01:01:46,330 --> 01:01:47,920
except it's perturbed
a little bit,

1100
01:01:47,920 --> 01:01:51,460
because we didn't really
know the underlying variance.

1101
01:01:51,460 --> 01:01:53,710
We're having to
estimate it also.

1102
01:01:53,710 --> 01:01:57,910
So here's some
pictures, some examples.

1103
01:01:57,910 --> 01:02:03,450
The red is the unit
normal distribution.

1104
01:02:03,450 --> 01:02:10,390
And now for different sizes of
sample, so for an n equals 3,

1105
01:02:10,390 --> 01:02:14,520
you have this little
blue distribution.

1106
01:02:14,520 --> 01:02:20,070
That's the t-distribution
with degrees of freedom 3.

1107
01:02:20,070 --> 01:02:23,580
Notice that it's
a little bit wider

1108
01:02:23,580 --> 01:02:27,540
than the normal distribution,
reflecting a little bit

1109
01:02:27,540 --> 01:02:30,360
less certainty on
really the location

1110
01:02:30,360 --> 01:02:33,270
of that random variable.

1111
01:02:33,270 --> 01:02:35,840
Now as n gets
bigger, so we've got

1112
01:02:35,840 --> 01:02:40,800
an n equals 10 example
in here in the green,

1113
01:02:40,800 --> 01:02:42,710
the chi-square-- or
the t-distribution

1114
01:02:42,710 --> 01:02:43,910
gets a little bit tighter.

1115
01:02:43,910 --> 01:02:47,750
And for n equals 100, it's
basically almost lying right

1116
01:02:47,750 --> 01:02:50,370
on top of the
normal distribution.

1117
01:02:50,370 --> 01:02:54,800
So what the t is reflecting is
a little additional uncertainty

1118
01:02:54,800 --> 01:02:58,700
because we didn't
know sigma squared.

1119
01:02:58,700 --> 01:03:01,880
I had to calculate s squared
from that same sample

1120
01:03:01,880 --> 01:03:03,830
distribution.

1121
01:03:03,830 --> 01:03:06,230
So that's all that's
really going on there.

1122
01:03:06,230 --> 01:03:10,040
If we then say, OK,
I want to get back

1123
01:03:10,040 --> 01:03:11,870
to a confidence interval.

1124
01:03:11,870 --> 01:03:14,360
But now, I don't
know the variance,

1125
01:03:14,360 --> 01:03:18,180
and I have to estimate
that also from my data.

1126
01:03:18,180 --> 01:03:22,320
We have essentially the same
confidence interval formula,

1127
01:03:22,320 --> 01:03:26,570
the only difference
being instead of z

1128
01:03:26,570 --> 01:03:29,930
related to the unit
normal distribution,

1129
01:03:29,930 --> 01:03:33,200
we have numbers of
standard deviations

1130
01:03:33,200 --> 01:03:36,200
on the t-distribution
that we're arguing about,

1131
01:03:36,200 --> 01:03:39,860
again, reflecting that that
t is a little bit wider.

1132
01:03:39,860 --> 01:03:43,130
But it's essentially
exactly the same thinking,

1133
01:03:43,130 --> 01:03:46,910
just recognizing that now, the
sampling distribution for x

1134
01:03:46,910 --> 01:03:50,520
bar when variance is unknown--

1135
01:03:50,520 --> 01:03:52,200
is not a normal.

1136
01:03:52,200 --> 01:03:53,580
It's a t-distribution.

1137
01:03:56,300 --> 01:03:58,830
But all the other operations
are exactly the same.

1138
01:03:58,830 --> 01:04:02,930
We look for what alpha error
we're willing to accept,

1139
01:04:02,930 --> 01:04:06,920
what our chance of being wrong
on our bounding of the interval

1140
01:04:06,920 --> 01:04:12,020
is, and then allocating that
to the left and the right;

1141
01:04:12,020 --> 01:04:14,150
figuring out how many
units normal over we

1142
01:04:14,150 --> 01:04:18,650
go on not the underlying
population distribution,

1143
01:04:18,650 --> 01:04:20,430
but our sampling distribution.

1144
01:04:20,430 --> 01:04:25,280
So we still get the benefits of
increasing n getting tighter.

1145
01:04:25,280 --> 01:04:27,760
But we just do that all
on the t-distribution.

1146
01:04:27,760 --> 01:04:30,270
AUDIENCE: So this is-- will
be necessary for small sample

1147
01:04:30,270 --> 01:04:31,300
sizes.

1148
01:04:31,300 --> 01:04:33,250
PROFESSOR: Exactly.

1149
01:04:33,250 --> 01:04:35,110
So the point or the
question was this

1150
01:04:35,110 --> 01:04:38,230
is only necessary for
small sample sizes.

1151
01:04:38,230 --> 01:04:42,670
And that's exactly right
because of the effect

1152
01:04:42,670 --> 01:04:45,910
that we see back with the
t-distribution getting

1153
01:04:45,910 --> 01:04:51,820
very close in approximation to
the normal distribution for n

1154
01:04:51,820 --> 01:04:53,800
becoming appreciable.

1155
01:04:53,800 --> 01:04:55,930
I've heard different
kinds of rules of thumb.

1156
01:04:55,930 --> 01:04:58,930
Some people like to
say for n about 25,

1157
01:04:58,930 --> 01:05:02,030
you're pretty close to
a normal distribution.

1158
01:05:02,030 --> 01:05:05,260
Some people like to
draw it at n equals 40.

1159
01:05:05,260 --> 01:05:10,420
It really depends on what
kind of accuracy you're after.

1160
01:05:10,420 --> 01:05:13,750
But you can be substantially
wrong for very small sample

1161
01:05:13,750 --> 01:05:17,140
sizes-- of sample size 5,
which is a natural sample

1162
01:05:17,140 --> 01:05:21,200
size you would often use in
some manufacturing scenarios.

1163
01:05:21,200 --> 01:05:24,400
So you do have to be
aware for very small n

1164
01:05:24,400 --> 01:05:27,630
to use the t-distribution.

1165
01:05:27,630 --> 01:05:30,390
This was an example
where we had n equals 50

1166
01:05:30,390 --> 01:05:32,130
in our part thickness example.

1167
01:05:32,130 --> 01:05:34,530
Let's see how different
things pop out

1168
01:05:34,530 --> 01:05:37,840
if we use the t-distribution
or the normal distribution.

1169
01:05:37,840 --> 01:05:39,840
So let's go back to our example.

1170
01:05:39,840 --> 01:05:43,440
But now, let's say we don't
know either the variance

1171
01:05:43,440 --> 01:05:45,270
or the mean.

1172
01:05:45,270 --> 01:05:48,000
Both of them are unknown.

1173
01:05:48,000 --> 01:05:50,130
We already calculated
the sample mean.

1174
01:05:50,130 --> 01:05:52,620
We had 113.5.

1175
01:05:52,620 --> 01:05:55,890
And now I'll tell you--

1176
01:05:55,890 --> 01:05:58,080
I guess I already gave you
this number previously.

1177
01:05:58,080 --> 01:06:01,380
But I'll tell you that we
apply the sample variance

1178
01:06:01,380 --> 01:06:06,650
formula to the data, and
out pops the number 102.3.

1179
01:06:06,650 --> 01:06:09,350
So again, that's
your best estimate

1180
01:06:09,350 --> 01:06:14,950
of the sample variance.

1181
01:06:14,950 --> 01:06:17,410
So these are your
point estimates.

1182
01:06:17,410 --> 01:06:19,990
But now, I want to go back
to the question, where's

1183
01:06:19,990 --> 01:06:23,320
the confidence interval on where
we think the true mean would

1184
01:06:23,320 --> 01:06:25,240
be 95% of the time?

1185
01:06:25,240 --> 01:06:28,060
Well, now we have to
use the t-distribution.

1186
01:06:28,060 --> 01:06:32,770
When we do that with
49 degrees of freedom,

1187
01:06:32,770 --> 01:06:35,890
again, k minus 1, because we're
using up 1 for calculation

1188
01:06:35,890 --> 01:06:37,810
of the sample mean.

1189
01:06:37,810 --> 01:06:42,630
Now we have this slightly
different formula.

1190
01:06:42,630 --> 01:06:45,960
Here, we can use the plus/minus,
because the t-distribution,

1191
01:06:45,960 --> 01:06:50,010
like the normal
distribution, is symmetric.

1192
01:06:50,010 --> 01:06:55,420
So I've got plus or minus
some number of unit, z's.

1193
01:06:55,420 --> 01:06:57,610
In this case, it's
unit t's because

1194
01:06:57,610 --> 01:07:01,360
the operative distribution
is the t-distribution.

1195
01:07:01,360 --> 01:07:03,140
I plug that in.

1196
01:07:03,140 --> 01:07:08,530
Notice that for 2.5%
in each of the tail,

1197
01:07:08,530 --> 01:07:12,190
the t-distribution
is slightly wider.

1198
01:07:12,190 --> 01:07:13,870
Remember, back with
the unit normal,

1199
01:07:13,870 --> 01:07:19,420
we said 1.96 plus or minus
standard deviations is 95%.

1200
01:07:19,420 --> 01:07:22,060
For the t, you got to go
a little bit further--

1201
01:07:22,060 --> 01:07:24,610
2.01.

1202
01:07:24,610 --> 01:07:26,290
Not a big difference--

1203
01:07:26,290 --> 01:07:27,460
2.01.

1204
01:07:27,460 --> 01:07:29,110
And when you come
out with that, you

1205
01:07:29,110 --> 01:07:34,490
get a slightly wider
confidence interval.

1206
01:07:34,490 --> 01:07:35,910
I'm less confident.

1207
01:07:35,910 --> 01:07:40,370
I got to go further to get to
my 95% confidence on the range

1208
01:07:40,370 --> 01:07:42,680
because I'm also estimating.

1209
01:07:42,680 --> 01:07:45,290
So in this case, the difference
is pretty much negligible.

1210
01:07:45,290 --> 01:07:47,480
And if I had a
sample of size 50,

1211
01:07:47,480 --> 01:07:50,600
I would probably just use
the normal distribution.

1212
01:07:50,600 --> 01:07:53,410
And that's a good example,
showing that difference

1213
01:07:53,410 --> 01:07:56,410
is 5 parts out of 200.

1214
01:07:56,410 --> 01:07:58,180
It's really quite small.

1215
01:08:02,350 --> 01:08:04,450
One more distribution
I want to mention--

1216
01:08:04,450 --> 01:08:05,950
we're not going to
use it much here.

1217
01:08:05,950 --> 01:08:09,220
I think I've already
described it briefly--

1218
01:08:09,220 --> 01:08:11,200
is this F distribution.

1219
01:08:11,200 --> 01:08:14,620
And this arises if I have
one random variable that

1220
01:08:14,620 --> 01:08:16,270
is chi-squared distributed.

1221
01:08:16,270 --> 01:08:19,000
I take another random variable
that's chi-squared distributed.

1222
01:08:19,000 --> 01:08:21,729
And I form a new
random variable R

1223
01:08:21,729 --> 01:08:24,880
that is the ratio of
those two, each normalized

1224
01:08:24,880 --> 01:08:29,740
to the degrees of freedom
or the number of variables

1225
01:08:29,740 --> 01:08:32,680
that went into each of those
chi-squared distributed

1226
01:08:32,680 --> 01:08:33,729
variables.

1227
01:08:33,729 --> 01:08:40,529
And that is an F with u
and v degrees of freedom.

1228
01:08:40,529 --> 01:08:48,359
Again, this comes up when we're
looking at things like ratios

1229
01:08:48,359 --> 01:08:52,920
and want to reason about ratios
of true population variances,

1230
01:08:52,920 --> 01:09:00,390
based on observations
of sample variances.

1231
01:09:00,390 --> 01:09:05,210
And the key place where that
might come up that I mentioned

1232
01:09:05,210 --> 01:09:07,970
is experimental design cases.

1233
01:09:07,970 --> 01:09:10,520
So this is an injection
molding example,

1234
01:09:10,520 --> 01:09:12,950
where you might be looking
at two different process

1235
01:09:12,950 --> 01:09:16,800
conditions-- a low hold
time and a high hold time.

1236
01:09:16,800 --> 01:09:19,340
And there may be other
things varying, maybe even

1237
01:09:19,340 --> 01:09:23,479
other variables varying, that
cause there to be a spread.

1238
01:09:23,479 --> 01:09:25,640
Or there's just
natural variation

1239
01:09:25,640 --> 01:09:27,350
in the two populations.

1240
01:09:27,350 --> 01:09:30,370
And you might ask
questions like,

1241
01:09:30,370 --> 01:09:33,210
are these two
variances different?

1242
01:09:33,210 --> 01:09:36,090
Did I improve the variance with
that process condition change?

1243
01:09:38,810 --> 01:09:40,370
Maybe-- maybe not.

1244
01:09:40,370 --> 01:09:42,979
Certainly not obvious
here, so you might

1245
01:09:42,979 --> 01:09:44,352
have a very low confidence.

1246
01:09:44,352 --> 01:09:46,310
So we're going to go and
use the F distribution

1247
01:09:46,310 --> 01:09:50,510
a little bit later when we
do analysis of experiments,

1248
01:09:50,510 --> 01:09:54,050
especially where you're looking
to try to make inferences

1249
01:09:54,050 --> 01:09:55,970
about whether there
is differences

1250
01:09:55,970 --> 01:09:59,750
between a couple of populations.

1251
01:09:59,750 --> 01:10:04,010
And again, because we're
dealing with variances,

1252
01:10:04,010 --> 01:10:07,520
there's a huge spread
that arise naturally

1253
01:10:07,520 --> 01:10:12,200
in these distributions,
purely by chance.

1254
01:10:12,200 --> 01:10:15,470
This is a good place
to re-emphasize

1255
01:10:15,470 --> 01:10:19,550
that a lot of what's going
on here in random sampling

1256
01:10:19,550 --> 01:10:23,060
is they're spread in the
observations that you get.

1257
01:10:23,060 --> 01:10:25,830
So here's a very simple
numerical example.

1258
01:10:25,830 --> 01:10:30,900
If I start with a variable
that is unit normal,

1259
01:10:30,900 --> 01:10:36,120
and I'm just going to take
two samples, sets of size n

1260
01:10:36,120 --> 01:10:38,540
equals 20.

1261
01:10:38,540 --> 01:10:40,690
So I'm taking two
different samples,

1262
01:10:40,690 --> 01:10:42,680
same underlying population.

1263
01:10:42,680 --> 01:10:45,400
I'm not making a
process change, say.

1264
01:10:45,400 --> 01:10:48,680
I'm just taking two
samples, each of size 20.

1265
01:10:48,680 --> 01:10:51,320
By chance, when I take
that first sample size,

1266
01:10:51,320 --> 01:10:55,950
I calculate a particular
sample variance, s squared.

1267
01:10:55,950 --> 01:10:57,620
And by chance, I
calculate another one

1268
01:10:57,620 --> 01:10:58,970
for the second sample.

1269
01:10:58,970 --> 01:11:03,470
And if I form the ratio of
those two, what typical range am

1270
01:11:03,470 --> 01:11:07,580
I going to observe in the
ratio of those two variances?

1271
01:11:07,580 --> 01:11:10,400
For example, what
ratio might I observe

1272
01:11:10,400 --> 01:11:13,370
95% of the time or what range?

1273
01:11:13,370 --> 01:11:15,810
And that's the F distribution.

1274
01:11:15,810 --> 01:11:20,930
In fact, if I look at
the upper and lower

1275
01:11:20,930 --> 01:11:26,810
bound on the range of that
ratio for a 95% confidence

1276
01:11:26,810 --> 01:11:31,280
interval for this ratio
of two samples of size 20,

1277
01:11:31,280 --> 01:11:36,260
I can go anywhere from
2.5 to 0.4 in that ratio.

1278
01:11:39,120 --> 01:11:41,040
That's with samples of size 20.

1279
01:11:41,040 --> 01:11:43,540
That's a huge range, right?

1280
01:11:43,540 --> 01:11:46,900
Imagine, 2 and 1/2 times
bigger variance over here,

1281
01:11:46,900 --> 01:11:48,790
compared to over here.

1282
01:11:48,790 --> 01:11:51,950
And that occurs
purely by chance.

1283
01:11:51,950 --> 01:11:57,070
So in 95% of the time, I
might have ratios within that.

1284
01:11:57,070 --> 01:11:59,800
But 5% of the time,
I'll even observe

1285
01:11:59,800 --> 01:12:02,350
ratios that are
bigger or even smaller

1286
01:12:02,350 --> 01:12:04,540
than those extremo points.

1287
01:12:04,540 --> 01:12:07,140
So you've got to be really
careful in reasoning

1288
01:12:07,140 --> 01:12:08,160
about variances.

1289
01:12:10,990 --> 01:12:13,210
So we're mostly there.

1290
01:12:13,210 --> 01:12:15,340
The last thing I
want to do here is

1291
01:12:15,340 --> 01:12:20,260
draw the relationship of some
of these two hypotheses tests.

1292
01:12:20,260 --> 01:12:23,590
And that gets us very close to
some of the Shewhart hypotheses

1293
01:12:23,590 --> 01:12:26,020
that are the basis
for control charts

1294
01:12:26,020 --> 01:12:28,490
that we'll talk about
in the next lecture.

1295
01:12:28,490 --> 01:12:32,110
But I do want to get the
basic idea in the last five,

1296
01:12:32,110 --> 01:12:36,740
10 minutes on what
statistical hypothesis is

1297
01:12:36,740 --> 01:12:39,440
and how that relates to some
of these confidence intervals

1298
01:12:39,440 --> 01:12:42,000
that we've been talking about.

1299
01:12:42,000 --> 01:12:44,870
So the basic idea we've been
doing with these means is

1300
01:12:44,870 --> 01:12:48,350
we've been hypothesizing that
the mean has some distribution,

1301
01:12:48,350 --> 01:12:50,600
say a normal distribution.

1302
01:12:50,600 --> 01:12:54,500
And then when we talked about
this confidence interval,

1303
01:12:54,500 --> 01:12:57,740
I would say, accept or
reject the hypothesis

1304
01:12:57,740 --> 01:13:03,200
that the mean was within some
range with some probability.

1305
01:13:03,200 --> 01:13:06,920
We can extend that to
asking other questions

1306
01:13:06,920 --> 01:13:09,080
or other hypotheses,
and then looking

1307
01:13:09,080 --> 01:13:11,150
at the probabilities
associated with it,

1308
01:13:11,150 --> 01:13:14,300
and saying, with some
degree of confidence,

1309
01:13:14,300 --> 01:13:15,980
I believe the hypothesis.

1310
01:13:15,980 --> 01:13:19,460
Or I have enough
evidence to counter it.

1311
01:13:19,460 --> 01:13:23,990
And a typical example
might be a null hypothesis,

1312
01:13:23,990 --> 01:13:31,100
often referred to as H0,
that the mean is some

1313
01:13:31,100 --> 01:13:34,070
a priori mean, some phi 0.

1314
01:13:34,070 --> 01:13:37,580
The null hypothesis is based
on this sample, this sample

1315
01:13:37,580 --> 01:13:39,890
that I'm drawing
from the population.

1316
01:13:39,890 --> 01:13:41,780
I have this
alternative hypothesis

1317
01:13:41,780 --> 01:13:43,020
that the mean has changed.

1318
01:13:43,020 --> 01:13:44,810
It's no longer the same mean.

1319
01:13:44,810 --> 01:13:48,917
Do I have enough evidence to say
with some degree of confidence

1320
01:13:48,917 --> 01:13:50,000
that the mean has changed?

1321
01:13:52,760 --> 01:13:55,610
And it's a little
tricky because there's

1322
01:13:55,610 --> 01:13:57,470
all these probabilities
associated

1323
01:13:57,470 --> 01:14:00,450
with random sampling.

1324
01:14:00,450 --> 01:14:03,260
So I observe a particular
value with some deviation.

1325
01:14:03,260 --> 01:14:10,130
How do I know to what
degree there's actual shift,

1326
01:14:10,130 --> 01:14:13,210
say, in the mean or not?

1327
01:14:13,210 --> 01:14:14,380
So let's look at this.

1328
01:14:14,380 --> 01:14:16,840
What we do is we
form the hypothesis.

1329
01:14:16,840 --> 01:14:19,180
We then look at the
probabilities associated

1330
01:14:19,180 --> 01:14:22,840
with the two cases, and then
based on those probabilities,

1331
01:14:22,840 --> 01:14:25,720
say with some degree
of confidence,

1332
01:14:25,720 --> 01:14:28,250
I choose one or the other.

1333
01:14:28,250 --> 01:14:31,420
And what's important is there's
always the chance of being

1334
01:14:31,420 --> 01:14:33,940
wrong, making an error--

1335
01:14:33,940 --> 01:14:38,170
those alpha errors out in
the tails, for example--

1336
01:14:38,170 --> 01:14:39,500
with that decision.

1337
01:14:39,500 --> 01:14:42,850
So that's where this
confidence level comes in.

1338
01:14:42,850 --> 01:14:45,430
So let's say we're
looking at this test.

1339
01:14:45,430 --> 01:14:48,920
We're asking-- the
null hypothesis

1340
01:14:48,920 --> 01:14:54,380
is I have a normal distribution
with some a priori mean

1341
01:14:54,380 --> 01:14:56,250
and some a priori variance.

1342
01:14:56,250 --> 01:14:58,040
I'm going to draw a new sample.

1343
01:14:58,040 --> 01:15:02,880
And based on that, I
want to either decide

1344
01:15:02,880 --> 01:15:07,290
that a shift has occurred
or that the data--

1345
01:15:07,290 --> 01:15:11,170
or not-- that the data comes
from that distribution or not.

1346
01:15:11,170 --> 01:15:12,960
And so what we're
going to do is use

1347
01:15:12,960 --> 01:15:16,560
essentially this same
confidence interval idea

1348
01:15:16,560 --> 01:15:21,730
and say, say to 95%
confidence, 95% of the time,

1349
01:15:21,730 --> 01:15:26,370
if my value lies in the central
part of that distribution,

1350
01:15:26,370 --> 01:15:30,870
I'm going to accept the--

1351
01:15:30,870 --> 01:15:33,210
well, in this case,
the null hypothesis

1352
01:15:33,210 --> 01:15:37,840
that my new sample still comes
from that same distribution.

1353
01:15:37,840 --> 01:15:41,580
So that would be my 95%,
my 1 minus alpha, if alpha

1354
01:15:41,580 --> 01:15:43,390
is a 5% error.

1355
01:15:43,390 --> 01:15:46,500
But if I observe a
sample mean, say,

1356
01:15:46,500 --> 01:15:51,230
or I observe a piece of
data that lies out here,

1357
01:15:51,230 --> 01:15:53,450
I'm going to reject
the null hypothesis.

1358
01:15:53,450 --> 01:15:55,130
I'm going to say
instead, I think

1359
01:15:55,130 --> 01:15:58,910
I've got an unlikely
event by chance

1360
01:15:58,910 --> 01:16:02,660
that I think instead indicates
something has changed.

1361
01:16:02,660 --> 01:16:04,270
Something has changed
in the process.

1362
01:16:04,270 --> 01:16:07,213
And we'll call that the
region of rejection.

1363
01:16:10,340 --> 01:16:14,090
So again, already you
can see one kind of error

1364
01:16:14,090 --> 01:16:16,160
that's likely to pop up.

1365
01:16:16,160 --> 01:16:18,710
There is a confidence
interval, this alpha.

1366
01:16:18,710 --> 01:16:21,840
There is a significance
level to the test,

1367
01:16:21,840 --> 01:16:24,470
very similar to the
confidence interval

1368
01:16:24,470 --> 01:16:28,290
idea and the alpha error
associated with that.

1369
01:16:28,290 --> 01:16:31,840
So right away, you see
there's one kind of error--

1370
01:16:31,840 --> 01:16:35,240
it's referred to
as a type I error--

1371
01:16:35,240 --> 01:16:37,060
on these kinds of
hypothesis tests.

1372
01:16:37,060 --> 01:16:40,900
We're rejecting the
null hypothesis out

1373
01:16:40,900 --> 01:16:44,155
here in the tails with
some probability alpha.

1374
01:16:47,675 --> 01:16:50,740
If I observed a point
out there in the tails,

1375
01:16:50,740 --> 01:16:54,650
even if that population
or that distribution

1376
01:16:54,650 --> 01:16:57,360
is still operative,
it is, in fact, true.

1377
01:16:57,360 --> 01:17:01,250
My samples are still coming
from that distribution.

1378
01:17:01,250 --> 01:17:06,040
But I happened to draw a
sample way out in the tail.

1379
01:17:06,040 --> 01:17:08,590
And I said, well,
that was unlikely.

1380
01:17:08,590 --> 01:17:10,630
That was unlikely
in this picture.

1381
01:17:10,630 --> 01:17:12,400
I'm rejecting the
null hypothesis.

1382
01:17:12,400 --> 01:17:15,608
I'm claiming this is evidence
that something changed

1383
01:17:15,608 --> 01:17:16,900
when, in fact, nothing changed.

1384
01:17:16,900 --> 01:17:19,150
I just got unlucky, right?

1385
01:17:19,150 --> 01:17:22,300
So the first type of
error that you can make

1386
01:17:22,300 --> 01:17:25,570
is this type I error.

1387
01:17:28,160 --> 01:17:31,970
It's also sometimes referred
to as producer error,

1388
01:17:31,970 --> 01:17:33,890
producer risk.

1389
01:17:33,890 --> 01:17:35,390
You're the manufacturer.

1390
01:17:35,390 --> 01:17:38,390
You reject your
part because your--

1391
01:17:38,390 --> 01:17:40,970
or you reject a batch,
say, because your sample

1392
01:17:40,970 --> 01:17:42,710
was way out here in the tail.

1393
01:17:42,710 --> 01:17:45,980
You're taking the risk of
rejecting and throwing away

1394
01:17:45,980 --> 01:17:50,900
good product, even though
it really was good.

1395
01:17:50,900 --> 01:17:54,740
If I took more samples, it would
go back and really indicate

1396
01:17:54,740 --> 01:17:55,820
what was going on--

1397
01:17:55,820 --> 01:17:58,010
that the product was still good.

1398
01:17:58,010 --> 01:18:01,250
So it's also sometimes
referred to as producer risk.

1399
01:18:01,250 --> 01:18:04,950
But there's another
possible error.

1400
01:18:04,950 --> 01:18:11,200
There is an error associated
with the distribution shifted

1401
01:18:11,200 --> 01:18:12,880
or changed.

1402
01:18:12,880 --> 01:18:16,210
I still accepted it
based on a random sample

1403
01:18:16,210 --> 01:18:17,950
from the different
distribution that

1404
01:18:17,950 --> 01:18:20,800
happened to fall in
my other distribution.

1405
01:18:20,800 --> 01:18:24,640
And that's referred
to as type II error--

1406
01:18:24,640 --> 01:18:28,305
has a probability associated
with that called beta.

1407
01:18:28,305 --> 01:18:30,055
We've been talking all
about these alphas.

1408
01:18:30,055 --> 01:18:31,720
Well, there's also a beta.

1409
01:18:31,720 --> 01:18:37,510
It's also sometimes referred
to as a consumers' risk.

1410
01:18:37,510 --> 01:18:40,690
The manufacturer did
a little inspection.

1411
01:18:40,690 --> 01:18:43,270
The mean happened to fall
in the region of acceptance.

1412
01:18:43,270 --> 01:18:44,940
He shipped it.

1413
01:18:44,940 --> 01:18:48,030
Turns out, it was actually
by bad chance just happened

1414
01:18:48,030 --> 01:18:49,500
to fall in the good region.

1415
01:18:49,500 --> 01:18:54,470
It really is coming
from a bad distribution.

1416
01:18:54,470 --> 01:18:55,610
So let's look at that.

1417
01:18:55,610 --> 01:18:57,770
What is this beta?

1418
01:18:57,770 --> 01:18:59,990
Well, for the type II
errors, we essentially

1419
01:18:59,990 --> 01:19:05,120
have to hypothesize a shift of
some size, some little delta.

1420
01:19:05,120 --> 01:19:08,330
And then we assess
the probabilities

1421
01:19:08,330 --> 01:19:12,590
that I'm drawing from the tail
of that shifted distribution

1422
01:19:12,590 --> 01:19:14,630
and just happen
to fall over here

1423
01:19:14,630 --> 01:19:20,040
in this region of acceptance
for our good distribution.

1424
01:19:20,040 --> 01:19:23,210
So this is the
probability associated

1425
01:19:23,210 --> 01:19:24,470
with our null hypothesis.

1426
01:19:24,470 --> 01:19:26,720
This is our starting
distribution.

1427
01:19:26,720 --> 01:19:29,600
Our alternative
hypothesis here is

1428
01:19:29,600 --> 01:19:32,375
that I had a plus delta
shift in the mean.

1429
01:19:34,950 --> 01:19:38,900
So this is our
possible new operative.

1430
01:19:38,900 --> 01:19:41,040
And in fact, for
a type II error,

1431
01:19:41,040 --> 01:19:43,860
this is actually at work.

1432
01:19:43,860 --> 01:19:46,950
Remember, this is the
region of acceptance.

1433
01:19:46,950 --> 01:19:50,690
So I'm claiming this is good.

1434
01:19:50,690 --> 01:19:54,110
But if the population
actually shifted over there

1435
01:19:54,110 --> 01:19:56,570
to the right, notice
off on the left

1436
01:19:56,570 --> 01:20:01,560
here we've got this
whole tail, where

1437
01:20:01,560 --> 01:20:04,020
if I drew from the
shifted distribution,

1438
01:20:04,020 --> 01:20:07,380
I've got that tail, that lightly
shaded blue tail, falling

1439
01:20:07,380 --> 01:20:09,930
in the region of acceptance,
where I would say it's

1440
01:20:09,930 --> 01:20:14,140
a good distribution
and erroneously except.

1441
01:20:14,140 --> 01:20:19,000
And one can simply apply the
same probabilities to basically

1442
01:20:19,000 --> 01:20:21,280
go in and calculate--

1443
01:20:21,280 --> 01:20:26,830
just integrate up and do the
cumulative normal distribution

1444
01:20:26,830 --> 01:20:32,200
function to calculate
what that tail is.

1445
01:20:32,200 --> 01:20:36,410
So it's all the
same probabilities.

1446
01:20:36,410 --> 01:20:40,510
So the applications of this
are really going to be on--

1447
01:20:40,510 --> 01:20:42,470
of hypothesis testing.

1448
01:20:42,470 --> 01:20:44,020
This would be
shifts of the mean.

1449
01:20:44,020 --> 01:20:47,470
You can start to see worrying
about monitoring your process

1450
01:20:47,470 --> 01:20:50,140
and seeing if something
changed in your process,

1451
01:20:50,140 --> 01:20:53,260
a shift occurred, and
being able to detect that.

1452
01:20:53,260 --> 01:20:55,270
And that gets us
to control charting

1453
01:20:55,270 --> 01:20:58,160
that we'll do next time.

1454
01:20:58,160 --> 01:21:00,730
So this is all pretty
much the same stuff.

1455
01:21:00,730 --> 01:21:03,250
And now this is a peek ahead.

1456
01:21:03,250 --> 01:21:06,250
You'll see process control.

1457
01:21:06,250 --> 01:21:09,010
And we'll talk about
repeated samples

1458
01:21:09,010 --> 01:21:13,840
in time coming from the
same distribution next time.

1459
01:21:13,840 --> 01:21:16,240
So we will see you on Thursday.

1460
01:21:16,240 --> 01:21:20,980
And we will dive into
Shewhart control charts.