1
00:00:00,000 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,730
Commons license.

3
00:00:03,730 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,060
continue to offer high quality
educational resources for free.

5
00:00:10,060 --> 00:00:12,690
To make a donation or to
view additional materials

6
00:00:12,690 --> 00:00:16,560
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,560 --> 00:00:17,904
at ocw.mit.edu.

8
00:00:21,640 --> 00:00:24,880
PROFESSOR: Last time, we started
looking in more detail at some

9
00:00:24,880 --> 00:00:26,990
of the statistical basics.

10
00:00:26,990 --> 00:00:29,650
These are the basis for a lot
of the tools and techniques

11
00:00:29,650 --> 00:00:33,170
that we're going to be learning
about throughout the term,

12
00:00:33,170 --> 00:00:36,520
especially things like
statistical process control,

13
00:00:36,520 --> 00:00:39,160
statistical design
of experiments,

14
00:00:39,160 --> 00:00:43,630
robust optimization,
yield modeling, and so on.

15
00:00:43,630 --> 00:00:48,910
And so we're going to pick up
more or less where we left off.

16
00:00:48,910 --> 00:00:52,360
We talked a bit about
the normal distribution.

17
00:00:52,360 --> 00:00:55,660
And what I want to do is talk
a little bit more about a few

18
00:00:55,660 --> 00:00:59,530
of the assumptions and why
it's so common that we use it

19
00:00:59,530 --> 00:01:02,290
for describing some
of the kinds of data

20
00:01:02,290 --> 00:01:04,069
that we looked at last time.

21
00:01:04,069 --> 00:01:07,750
So we went through a
fairly substantial number

22
00:01:07,750 --> 00:01:16,050
of different examples and saw
variation in time, variation

23
00:01:16,050 --> 00:01:19,710
across different
parameter sets, and so on.

24
00:01:19,710 --> 00:01:23,160
Just to remind us, here's
the-- the standard normal

25
00:01:23,160 --> 00:01:27,720
is just a mean centered.

26
00:01:27,720 --> 00:01:31,170
So if we have x as our data,
and we subtract off the mean,

27
00:01:31,170 --> 00:01:34,080
and then normalize to
the standard deviation,

28
00:01:34,080 --> 00:01:36,510
we get a unit normal variable.

29
00:01:36,510 --> 00:01:39,690
It's another random
variable z that

30
00:01:39,690 --> 00:01:44,130
has a distribution that is
marked out in terms of numbers

31
00:01:44,130 --> 00:01:46,150
of standard deviation.

32
00:01:46,150 --> 00:01:49,470
And so this is our
normal distribution.

33
00:01:49,470 --> 00:01:52,320
Some nice properties that
we mentioned last time

34
00:01:52,320 --> 00:01:55,630
are that it only
has two parameters.

35
00:01:55,630 --> 00:02:00,210
So that completely describes the
normal distribution, the mean,

36
00:02:00,210 --> 00:02:03,270
and the variance or
standard deviation.

37
00:02:03,270 --> 00:02:07,620
Other properties are it's
symmetric about the mean.

38
00:02:07,620 --> 00:02:11,460
We actually will use that
property quite a bit in terms

39
00:02:11,460 --> 00:02:15,710
of manipulating
some of the table

40
00:02:15,710 --> 00:02:19,640
values that one would
look up for the proportion

41
00:02:19,640 --> 00:02:22,800
of the distribution that's
out in either of the tail.

42
00:02:22,800 --> 00:02:26,200
So it's perhaps obvious,
but we actually do use that.

43
00:02:26,200 --> 00:02:28,960
We'll come back to that
a little bit later.

44
00:02:28,960 --> 00:02:31,430
But what I wanted to
start with a little bit

45
00:02:31,430 --> 00:02:35,770
is talking a little bit
more about this assumption,

46
00:02:35,770 --> 00:02:40,960
if you dive into it, that we
are using a normal distribution

47
00:02:40,960 --> 00:02:42,160
very often.

48
00:02:42,160 --> 00:02:44,500
And the questions are, why?

49
00:02:47,650 --> 00:02:54,070
And how good of an approximation
is that in most cases?

50
00:02:54,070 --> 00:02:56,100
When can we use it?

51
00:02:56,100 --> 00:02:58,530
When might we be
motivated to use it?

52
00:02:58,530 --> 00:03:01,050
And what we did last time
is we did a couple of things

53
00:03:01,050 --> 00:03:03,110
where we looked at
some of the data.

54
00:03:03,110 --> 00:03:05,430
In particular, we
did histogram--

55
00:03:05,430 --> 00:03:10,320
binning kinds of
plots of variations.

56
00:03:10,320 --> 00:03:16,590
And that would often motivate,
based on a general shape,

57
00:03:16,590 --> 00:03:22,930
that normal distribution
looked appropriate.

58
00:03:22,930 --> 00:03:27,610
One can, I guess, do a
curve fit to the histogram.

59
00:03:30,840 --> 00:03:33,720
Would you ever try to do that?

60
00:03:33,720 --> 00:03:35,400
So imagine that you
actually had, say,

61
00:03:35,400 --> 00:03:40,470
the tops of the bins
for the distribution.

62
00:03:44,100 --> 00:03:48,140
So maybe I had bins like
this, where sometimes I

63
00:03:48,140 --> 00:03:52,075
had these as values--

64
00:03:56,980 --> 00:03:59,386
something like this.

65
00:03:59,386 --> 00:04:04,120
Now, would you actually try
to do a normal distribution

66
00:04:04,120 --> 00:04:06,490
curve fit to that?

67
00:04:06,490 --> 00:04:08,350
In other words,
if you said, what

68
00:04:08,350 --> 00:04:11,630
I'm going to try
to do is minimize

69
00:04:11,630 --> 00:04:17,465
the errors between these points
and the normal distribution,

70
00:04:17,465 --> 00:04:19,839
does that seem like a
reasonable thing to do?

71
00:04:19,839 --> 00:04:24,390
AUDIENCE: It's all driven
by the size of your tails.

72
00:04:24,390 --> 00:04:26,580
PROFESSOR: Yeah, there are
some gotchas, certainly,

73
00:04:26,580 --> 00:04:28,320
with any histogram.

74
00:04:28,320 --> 00:04:31,950
The point was that the
shape of this distribution--

75
00:04:31,950 --> 00:04:34,050
if you've ever played
around, especially

76
00:04:34,050 --> 00:04:37,170
with interactive tools, where
you can bin and plot out

77
00:04:37,170 --> 00:04:42,220
distributions, if you were to
change the size of your bins,

78
00:04:42,220 --> 00:04:44,650
you have this
disturbing effect where

79
00:04:44,650 --> 00:04:48,640
the shape of your distribution
sometimes changes a little bit

80
00:04:48,640 --> 00:04:49,670
out from under you.

81
00:04:49,670 --> 00:04:51,610
So if you change the
bins, you may well

82
00:04:51,610 --> 00:04:55,210
end up with something that--

83
00:04:55,210 --> 00:04:57,760
all of a sudden, this one
was low, and now it's high.

84
00:04:57,760 --> 00:05:00,250
And the next one is
a little bit low.

85
00:05:00,250 --> 00:05:05,570
And this one's up here if your
bins are a little bit wider.

86
00:05:05,570 --> 00:05:07,580
So that might be a concern,
but that's actually

87
00:05:07,580 --> 00:05:11,270
not the point that I'm after.

88
00:05:11,270 --> 00:05:16,640
Would you curve fit to this
distribution to fit a normal

89
00:05:16,640 --> 00:05:17,770
to your data?

90
00:05:20,554 --> 00:05:23,240
AUDIENCE: Well, you said
that normal distribution

91
00:05:23,240 --> 00:05:25,790
is described by a standard
deviation of the mean.

92
00:05:25,790 --> 00:05:30,140
So you might as well just take
the mean and standard deviation

93
00:05:30,140 --> 00:05:30,900
of your data.

94
00:05:30,900 --> 00:05:32,062
PROFESSOR: Beautiful.

95
00:05:32,062 --> 00:05:33,020
AUDIENCE: And use that.

96
00:05:33,020 --> 00:05:33,728
PROFESSOR: Right.

97
00:05:33,728 --> 00:05:36,350
Right, especially-- I
guess the only circumstance

98
00:05:36,350 --> 00:05:38,600
I can imagine where it might
make sense to curve fit

99
00:05:38,600 --> 00:05:41,340
is you didn't have the raw data.

100
00:05:41,340 --> 00:05:43,800
You only had the bins.

101
00:05:43,800 --> 00:05:45,240
That's kind of strange.

102
00:05:45,240 --> 00:05:47,080
I think in most cases,
you would, in fact,

103
00:05:47,080 --> 00:05:47,910
have the raw data.

104
00:05:47,910 --> 00:05:52,440
And then you simply
calculate the mean

105
00:05:52,440 --> 00:05:53,920
and the standard deviation.

106
00:05:53,920 --> 00:05:57,720
Now one thing we want to do and
we'll get to a little bit today

107
00:05:57,720 --> 00:06:00,580
is why that's a
reasonable thing to do--

108
00:06:00,580 --> 00:06:02,762
to actually go in and
calculate the mean

109
00:06:02,762 --> 00:06:03,720
and standard deviation.

110
00:06:03,720 --> 00:06:11,130
Why is that a good estimator
for the true mean and underlying

111
00:06:11,130 --> 00:06:13,740
parameters for this
distribution-- the true meaning

112
00:06:13,740 --> 00:06:16,460
of the true variance?

113
00:06:16,460 --> 00:06:20,780
There's other things you can
do certainly to check and see.

114
00:06:20,780 --> 00:06:22,880
If you had your data,
and you calculated

115
00:06:22,880 --> 00:06:25,340
the mean and standard
deviation, then you

116
00:06:25,340 --> 00:06:31,960
can plot perhaps your Gaussian
on top of that distribution.

117
00:06:31,960 --> 00:06:33,940
And that, I think,
is a reasonable thing

118
00:06:33,940 --> 00:06:38,590
to do as a quick check visually
to see how well it seems

119
00:06:38,590 --> 00:06:40,630
to map, as well as a
quick check that you

120
00:06:40,630 --> 00:06:45,550
had reasonable calculations,
not something strange go wrong

121
00:06:45,550 --> 00:06:48,220
just in your
numerical calculation

122
00:06:48,220 --> 00:06:49,690
of those parameters.

123
00:06:49,690 --> 00:06:51,190
Now there's a couple
of other things

124
00:06:51,190 --> 00:06:56,230
that one can do to check
quickly visually the assumption.

125
00:06:56,230 --> 00:07:01,720
And then there's a couple of
very nice additional tools

126
00:07:01,720 --> 00:07:05,260
that I'll mention here for
either checking assumptions

127
00:07:05,260 --> 00:07:08,440
or visually--

128
00:07:08,440 --> 00:07:10,120
in a little bit more
sophisticated way,

129
00:07:10,120 --> 00:07:13,730
visually or numerically check
a couple of those assumptions.

130
00:07:13,730 --> 00:07:15,430
But one thing you
can certainly do

131
00:07:15,430 --> 00:07:20,470
is look at the
location of your data

132
00:07:20,470 --> 00:07:23,830
and just do a quick comparison
between the percentage of data

133
00:07:23,830 --> 00:07:28,120
that you would expect in
different bands of this data.

134
00:07:28,120 --> 00:07:33,310
And we'll do a little
bit more examples

135
00:07:33,310 --> 00:07:36,170
there so that we know what
percentage of the data

136
00:07:36,170 --> 00:07:40,930
we expect in the plus/minus
1 sigma band, for example,

137
00:07:40,930 --> 00:07:45,430
or what percentage of the data
we would typically expect out

138
00:07:45,430 --> 00:07:48,620
in the 3 sigma
tails of the data.

139
00:07:48,620 --> 00:07:51,593
And so you can do a quick
calculation and comparison

140
00:07:51,593 --> 00:07:54,010
of the percentage of data in
each of these different bands

141
00:07:54,010 --> 00:07:57,730
and see, is that matching
up to what we would expect

142
00:07:57,730 --> 00:08:00,550
from a normal distribution?

143
00:08:00,550 --> 00:08:03,250
This actually gets
very close to the idea

144
00:08:03,250 --> 00:08:07,570
of confidence
intervals that we'll

145
00:08:07,570 --> 00:08:09,670
formalize a little bit more.

146
00:08:09,670 --> 00:08:11,530
Now there's a couple
of additional things

147
00:08:11,530 --> 00:08:12,890
I've listed here.

148
00:08:12,890 --> 00:08:16,570
One is you can look
at the kurtosis

149
00:08:16,570 --> 00:08:21,220
or do a quick calculation
of kurtosis, which

150
00:08:21,220 --> 00:08:24,250
is a higher order
statistical moment

151
00:08:24,250 --> 00:08:26,950
than the mean or the variance.

152
00:08:26,950 --> 00:08:32,140
In fact, if you look at the
definition of the kurtosis,

153
00:08:32,140 --> 00:08:37,140
it's an expectation
of the fourth moment.

154
00:08:37,140 --> 00:08:40,169
Or it is a calculation--
a normalized version

155
00:08:40,169 --> 00:08:41,520
of the fourth moment.

156
00:08:41,520 --> 00:08:44,880
And for a perfectly
normal distribution,

157
00:08:44,880 --> 00:08:48,210
this kurtosis value would be 1.

158
00:08:48,210 --> 00:08:51,270
And then as the distribution
changes its shape,

159
00:08:51,270 --> 00:08:55,230
either gets more peaked
or less peaked following

160
00:08:55,230 --> 00:08:57,840
other distributions, other
common distributions,

161
00:08:57,840 --> 00:09:04,380
then it starts to deviate
substantially from k equals 1.

162
00:09:04,380 --> 00:09:06,870
And in fact, this is
a quick little tool

163
00:09:06,870 --> 00:09:11,670
to use sometimes if you're
not sure, well, number one,

164
00:09:11,670 --> 00:09:12,780
if it's normal.

165
00:09:12,780 --> 00:09:15,625
And number two, if it's not
normal, what distribution might

166
00:09:15,625 --> 00:09:16,125
it follow?

167
00:09:21,200 --> 00:09:24,600
If you look here,
this is a nice plot,

168
00:09:24,600 --> 00:09:27,170
although I didn't
break out what all

169
00:09:27,170 --> 00:09:32,490
of these different
distributions are.

170
00:09:32,490 --> 00:09:36,450
This is just a plot normalized
to standard deviation

171
00:09:36,450 --> 00:09:41,430
of the data of a set of
different distributions.

172
00:09:41,430 --> 00:09:44,430
And the black one here is n.

173
00:09:44,430 --> 00:09:48,210
So this is-- let me do
the black one here, right?

174
00:09:48,210 --> 00:09:49,620
This is the n distribution.

175
00:09:49,620 --> 00:09:55,090
That's our Gaussian
with a kurtosis.

176
00:09:55,090 --> 00:09:58,840
Well, I guess you got to
look a little bit carefully

177
00:09:58,840 --> 00:10:00,070
at the definition.

178
00:10:00,070 --> 00:10:05,390
Actually, I think if I go back
to the previous page, which

179
00:10:05,390 --> 00:10:10,070
is one that Dave had, this
definition for sample data

180
00:10:10,070 --> 00:10:15,980
is essentially, as n gets very
large, this subtracts off 3.

181
00:10:15,980 --> 00:10:18,480
So that I believe
then, in this case,

182
00:10:18,480 --> 00:10:20,990
this kurtosis for a
normal distribution

183
00:10:20,990 --> 00:10:23,210
is actually more like 0.

184
00:10:23,210 --> 00:10:27,500
These two definitions
you might look up.

185
00:10:27,500 --> 00:10:29,900
I'm not sure are
exactly the same.

186
00:10:29,900 --> 00:10:33,050
Rarely would you
actually use this one.

187
00:10:33,050 --> 00:10:38,450
You're going to actually
use this definition, which

188
00:10:38,450 --> 00:10:41,960
basically subtracts off a value.

189
00:10:41,960 --> 00:10:45,590
This goes with the
plot on the next page.

190
00:10:51,050 --> 00:10:53,300
So they are slightly different
definitions, I believe.

191
00:10:56,890 --> 00:11:00,580
So in that case, that's
subtracting off a 3.

192
00:11:00,580 --> 00:11:02,350
For the normal
distribution, it ends up

193
00:11:02,350 --> 00:11:05,110
with a value of about 0.

194
00:11:05,110 --> 00:11:07,450
Now what's nice
is as you get some

195
00:11:07,450 --> 00:11:10,810
of these distribution, such
as the Laplace distribution,

196
00:11:10,810 --> 00:11:17,000
this very peaked one right here,
the kurtosis value goes up.

197
00:11:17,000 --> 00:11:21,770
It's an indication of a
more peaked distribution.

198
00:11:21,770 --> 00:11:23,780
The logistic distribution,
which we might

199
00:11:23,780 --> 00:11:25,520
talk about a little bit later--

200
00:11:25,520 --> 00:11:30,590
it's one that comes up
occasionally with some quality

201
00:11:30,590 --> 00:11:32,870
or discrete kinds
of distributions--

202
00:11:32,870 --> 00:11:35,150
has a kurtosis of 1.2.

203
00:11:35,150 --> 00:11:37,280
And the interesting
one here also

204
00:11:37,280 --> 00:11:40,910
is the uniform
distribution, which is less

205
00:11:40,910 --> 00:11:42,950
sharply peaked than a Gaussian.

206
00:11:42,950 --> 00:11:45,890
And it actually has a negative
kurtosis with that subtraction

207
00:11:45,890 --> 00:11:49,410
of the 3 off it at the end.

208
00:11:49,410 --> 00:11:55,430
So you might find
that as a useful tool.

209
00:11:55,430 --> 00:11:59,330
I've rarely used kurtosis
actually as an indicator.

210
00:11:59,330 --> 00:12:03,800
But I want to mention
it to you because it

211
00:12:03,800 --> 00:12:07,880
is out there as at
least a hint at looking

212
00:12:07,880 --> 00:12:10,190
at some different distributions.

213
00:12:10,190 --> 00:12:11,540
A more useful tool--

214
00:12:11,540 --> 00:12:13,427
yeah, question?

215
00:12:13,427 --> 00:12:15,260
AUDIENCE: So there's
two different formulas?

216
00:12:15,260 --> 00:12:16,487
Because--

217
00:12:16,487 --> 00:12:17,195
PROFESSOR: Well--

218
00:12:17,195 --> 00:12:18,590
AUDIENCE: What you said or--

219
00:12:18,590 --> 00:12:20,450
PROFESSOR: Yeah, so
this is for sample data.

220
00:12:20,450 --> 00:12:24,570
And I think if you were
to actually go in--

221
00:12:24,570 --> 00:12:26,610
I mean, essentially this--

222
00:12:29,510 --> 00:12:31,265
I have not checked this.

223
00:12:31,265 --> 00:12:37,420
This was some definitions
from previous class notes.

224
00:12:37,420 --> 00:12:42,070
I do believe when I did a
quick lookup on what kurtosis

225
00:12:42,070 --> 00:12:48,160
is, I believe this is a
better definition in terms

226
00:12:48,160 --> 00:12:50,800
of actual calculation
formulas that you

227
00:12:50,800 --> 00:12:53,590
can use for calculating it.

228
00:12:53,590 --> 00:12:54,898
This is to give you the sense.

229
00:12:54,898 --> 00:12:56,440
I mean, it's sort
of lurking in here.

230
00:12:56,440 --> 00:13:01,370
You can see the expectation
operation down in here,

231
00:13:01,370 --> 00:13:04,630
and then the normalization
to the standard deviation.

232
00:13:04,630 --> 00:13:07,420
In this case, this has to
be your calculated standard

233
00:13:07,420 --> 00:13:09,490
deviation.

234
00:13:09,490 --> 00:13:12,430
This is the abstracted one.

235
00:13:17,910 --> 00:13:22,590
So if you actually poke around,
you will find in the literature

236
00:13:22,590 --> 00:13:26,300
more than one
definition of kurtosis.

237
00:13:26,300 --> 00:13:29,740
My point was that
this is what I would

238
00:13:29,740 --> 00:13:34,240
use if you want to use the
plot on the next page in terms

239
00:13:34,240 --> 00:13:36,940
of coming up with a number that
might also indicate if there's

240
00:13:36,940 --> 00:13:38,950
a different distribution
that you might look at.

241
00:13:41,800 --> 00:13:43,365
So it's related to
the fourth moment.

242
00:13:46,960 --> 00:13:48,250
A more useful tool--

243
00:13:48,250 --> 00:13:52,510
and this is one that
I actually do use--

244
00:13:52,510 --> 00:13:56,920
is probability or
quantile-quantile plots.

245
00:13:56,920 --> 00:14:03,880
And there's a section in
Montgomery on that, as well

246
00:14:03,880 --> 00:14:05,740
as different toolboxes.

247
00:14:05,740 --> 00:14:07,880
We'll be able to
generate these things.

248
00:14:07,880 --> 00:14:11,440
And so here's an example for
a quantile-quantile plot.

249
00:14:16,790 --> 00:14:20,360
What I've started doing on
the lecture notes on the web

250
00:14:20,360 --> 00:14:22,430
is put up an early
draft as early

251
00:14:22,430 --> 00:14:26,270
as I can for the next couple
of weeks of lecture notes.

252
00:14:26,270 --> 00:14:30,290
But then as I'm
editing and adding in,

253
00:14:30,290 --> 00:14:32,220
I'll have the most
up-to-date one.

254
00:14:32,220 --> 00:14:35,240
So if you've got slides, you may
be missing a couple of these.

255
00:14:35,240 --> 00:14:40,850
If you printed them out before
9:00 or 10:00 PM last night,

256
00:14:40,850 --> 00:14:43,940
I think these got
updated about that time.

257
00:14:43,940 --> 00:14:45,860
So this plot, for
example, was not

258
00:14:45,860 --> 00:14:47,990
in the early draft
of the slides.

259
00:14:47,990 --> 00:14:50,570
And I'll try to indicate
that with a little "draft"

260
00:14:50,570 --> 00:14:56,080
on the web page if they're still
early drafts of the slides.

261
00:14:56,080 --> 00:14:57,930
So what are
quantile-quantile plots?

262
00:14:57,930 --> 00:15:02,520
These are a little bit subtle
in terms of explaining.

263
00:15:02,520 --> 00:15:04,830
So let me try to give
it a shot at explaining.

264
00:15:04,830 --> 00:15:08,650
And then if you have
questions, let me know.

265
00:15:08,650 --> 00:15:14,130
And then normally, it's going to
be generated by your statistics

266
00:15:14,130 --> 00:15:14,830
package.

267
00:15:14,830 --> 00:15:16,800
There are hand ways
to do it, and I'll

268
00:15:16,800 --> 00:15:19,320
refer you to
Montgomery for practice

269
00:15:19,320 --> 00:15:23,010
with actually trying to generate
them if you had to by hand.

270
00:15:23,010 --> 00:15:24,480
But here's the basic idea.

271
00:15:24,480 --> 00:15:31,970
What we're plotting is the
actual data that you've got.

272
00:15:31,970 --> 00:15:37,340
And in the y-axis, you'll be
plotting your data in terms

273
00:15:37,340 --> 00:15:42,000
of normalized distribution.

274
00:15:42,000 --> 00:15:46,580
So you would normalize to the
mean or center to the mean

275
00:15:46,580 --> 00:15:49,230
and then scale it to
your standard deviation.

276
00:15:49,230 --> 00:15:54,200
So think of these as
unit standard deviations.

277
00:15:54,200 --> 00:15:59,480
So you simply find that as
your y location for your data.

278
00:16:02,240 --> 00:16:06,560
Then what you're plotting
on the x-axis is the normal

279
00:16:06,560 --> 00:16:07,970
theoretical--

280
00:16:07,970 --> 00:16:09,680
I'm not sure I'd
use quantiles here--

281
00:16:09,680 --> 00:16:13,730
but your normal theoretical
standard deviation

282
00:16:13,730 --> 00:16:17,210
for that number of data
points that you would have had

283
00:16:17,210 --> 00:16:20,370
and the location for each
of those data points.

284
00:16:20,370 --> 00:16:24,020
So imagine this
is 50 data points.

285
00:16:24,020 --> 00:16:26,810
I'm not sure exactly how
many data points this is.

286
00:16:26,810 --> 00:16:30,110
If you were to
take 50 data points

287
00:16:30,110 --> 00:16:35,330
and draw 50 data points
from a normal distribution

288
00:16:35,330 --> 00:16:39,590
or order them and put them
where you would expect

289
00:16:39,590 --> 00:16:43,580
on a normal distribution,
what you would have is many

290
00:16:43,580 --> 00:16:45,110
more data points near 0.

291
00:16:45,110 --> 00:16:51,320
And as you get further and
further out, 1 out of 50 times

292
00:16:51,320 --> 00:16:53,570
or 1 out of 25 times,
you would expect

293
00:16:53,570 --> 00:16:57,650
to find a data point
about whatever it is--

294
00:16:57,650 --> 00:17:02,190
2, 2.1 standard deviations away.

295
00:17:02,190 --> 00:17:08,180
In other words, if I were to
compare the actual location

296
00:17:08,180 --> 00:17:14,060
of that data point
in terms of its value

297
00:17:14,060 --> 00:17:16,760
within my sample
distribution of 50,

298
00:17:16,760 --> 00:17:22,069
compared to if I just drew
randomly 50 data points,

299
00:17:22,069 --> 00:17:25,109
that would be its location.

300
00:17:25,109 --> 00:17:30,060
Then what I can do is plot
that coordinate for that data.

301
00:17:30,060 --> 00:17:34,110
So what you end up with is
taking all of your data,

302
00:17:34,110 --> 00:17:34,640
if you will.

303
00:17:34,640 --> 00:17:37,520
You sort it from low to high.

304
00:17:37,520 --> 00:17:42,670
And then starting at the
center, in some sense,

305
00:17:42,670 --> 00:17:44,890
you start working
outward from the center,

306
00:17:44,890 --> 00:17:48,100
ordering the data
from the location

307
00:17:48,100 --> 00:17:54,280
of its index in your
sorted data from number

308
00:17:54,280 --> 00:17:57,910
of standard deviations away
from the mean that it would be,

309
00:17:57,910 --> 00:18:01,090
compared to how far
that data point actually

310
00:18:01,090 --> 00:18:04,210
was away from your sample mean.

311
00:18:04,210 --> 00:18:07,480
And what that gives
you, if it were perfect,

312
00:18:07,480 --> 00:18:13,270
and there was not any sort
of noise in your data,

313
00:18:13,270 --> 00:18:16,390
that would give you this
perfect matching line.

314
00:18:16,390 --> 00:18:21,418
Every data point falls where
you would expect it to.

315
00:18:21,418 --> 00:18:22,960
Now in your actual
data, you're going

316
00:18:22,960 --> 00:18:25,030
to see some
deviations from that.

317
00:18:25,030 --> 00:18:31,950
But what this is basically doing
is a compression of your data

318
00:18:31,950 --> 00:18:34,410
or an expansion of your
data out in the tails,

319
00:18:34,410 --> 00:18:37,470
but a compression of
your data near the center

320
00:18:37,470 --> 00:18:43,980
to be able to basically tell
you how closely your data here

321
00:18:43,980 --> 00:18:47,550
is following the
assumed distribution.

322
00:18:50,860 --> 00:18:56,930
And for this case here,
we plotted the location

323
00:18:56,930 --> 00:19:00,480
of 50 data points, assuming
it was a normal distribution.

324
00:19:00,480 --> 00:19:03,840
So that's where my x
values were coming from.

325
00:19:03,840 --> 00:19:05,330
And as you can
see here, the data

326
00:19:05,330 --> 00:19:08,870
pretty much nicely
follows this distribution.

327
00:19:08,870 --> 00:19:13,460
You get a few little things
that look like it's wandering

328
00:19:13,460 --> 00:19:16,350
or trailing off a little bit.

329
00:19:16,350 --> 00:19:19,080
And then you also often
look out here in the tails.

330
00:19:19,080 --> 00:19:23,790
And you find even out here for
over two standard deviations

331
00:19:23,790 --> 00:19:25,890
away, it looks like I've
got pretty good fidelity

332
00:19:25,890 --> 00:19:27,210
to those tails.

333
00:19:27,210 --> 00:19:30,780
I might have values that
are a little bit further

334
00:19:30,780 --> 00:19:33,690
away from the mean
than I might expect

335
00:19:33,690 --> 00:19:37,120
from a normal distribution,
but it's pretty close.

336
00:19:37,120 --> 00:19:38,850
So this is the kind
of plot that you

337
00:19:38,850 --> 00:19:42,270
would expect to see for
data that, in fact, followed

338
00:19:42,270 --> 00:19:45,260
a normal distribution.

339
00:19:45,260 --> 00:19:48,380
All right, so I know
that's confusing.

340
00:19:48,380 --> 00:19:51,680
Are there questions that
people have on what this--

341
00:19:51,680 --> 00:19:52,303
AUDIENCE: Yes.

342
00:19:52,303 --> 00:19:52,970
PROFESSOR: Yeah?

343
00:19:52,970 --> 00:19:55,350
AUDIENCE: I have a question.

344
00:19:55,350 --> 00:19:59,480
So for each point,
you get the y-axis

345
00:19:59,480 --> 00:20:03,632
by the sampling
value from your data.

346
00:20:03,632 --> 00:20:04,340
PROFESSOR: Right.

347
00:20:04,340 --> 00:20:06,000
AUDIENCE: And how do you get x?

348
00:20:06,000 --> 00:20:09,080
Do you get it based on
the probability of that,

349
00:20:09,080 --> 00:20:11,720
simply pulling from
your sample distribution

350
00:20:11,720 --> 00:20:15,920
that you referred to the
theoretical normal distribution

351
00:20:15,920 --> 00:20:18,120
with the same probability,
then you get the y--

352
00:20:18,120 --> 00:20:18,620
x-axis?

353
00:20:21,200 --> 00:20:22,640
PROFESSOR: Yes,
very, very close.

354
00:20:22,640 --> 00:20:26,180
So for the y-axis, you've
got it exactly right.

355
00:20:26,180 --> 00:20:28,700
For the x-axis,
what's interesting is

356
00:20:28,700 --> 00:20:32,300
you don't actually use
the values of your data.

357
00:20:32,300 --> 00:20:37,580
You just use its index location
in a sample of the size

358
00:20:37,580 --> 00:20:39,020
that you've got.

359
00:20:39,020 --> 00:20:43,700
In other words, if I
had a million points,

360
00:20:43,700 --> 00:20:45,590
I would look at the lowest.

361
00:20:45,590 --> 00:20:49,850
And I would expect that to be--

362
00:20:49,850 --> 00:20:52,550
in a normal
distribution, I would

363
00:20:52,550 --> 00:20:55,670
look at where the
probability, the number

364
00:20:55,670 --> 00:20:59,300
of standard deviations where
1 out of 500,000 points

365
00:20:59,300 --> 00:21:01,830
is that far away from the mean.

366
00:21:01,830 --> 00:21:05,210
So I would look up the
inverse probability

367
00:21:05,210 --> 00:21:09,110
on a normal
distribution of being--

368
00:21:09,110 --> 00:21:12,320
of where 1 in 5--

369
00:21:12,320 --> 00:21:14,240
500,000-- point-- what?

370
00:21:14,240 --> 00:21:18,050
0.02 to the whatever.

371
00:21:18,050 --> 00:21:24,080
So I basically look on a
tabulated normal probability

372
00:21:24,080 --> 00:21:29,750
plot, going backwards
from where that index--

373
00:21:29,750 --> 00:21:31,130
my smallest point was.

374
00:21:31,130 --> 00:21:33,140
And then I could do
that for every point

375
00:21:33,140 --> 00:21:36,500
in my sample to figure
out what the probability

376
00:21:36,500 --> 00:21:39,470
for its location should
be on the x-axis.

377
00:21:44,210 --> 00:21:46,240
So here's another example.

378
00:21:46,240 --> 00:21:49,570
Maybe this gives you a feel
because these q-q plots--

379
00:21:49,570 --> 00:21:52,000
the quantile-quantile
plots-- can actually

380
00:21:52,000 --> 00:21:54,290
be used with other
distributions as well.

381
00:21:54,290 --> 00:21:57,200
They are not always
q-q norm plots.

382
00:21:57,200 --> 00:21:59,950
They can be applied
to whatever assumed

383
00:21:59,950 --> 00:22:03,400
probability distribution you
might want to investigate.

384
00:22:07,200 --> 00:22:13,590
So here's an example where
we again took the data.

385
00:22:13,590 --> 00:22:17,330
But in this case, the
theoretical quantiles

386
00:22:17,330 --> 00:22:20,180
are actually lining up.

387
00:22:20,180 --> 00:22:22,760
I'm assuming a
normal distribution.

388
00:22:22,760 --> 00:22:25,640
But in this example
that I'm showing here,

389
00:22:25,640 --> 00:22:27,670
the data actually came from--

390
00:22:27,670 --> 00:22:28,406
let me erase.

391
00:22:28,406 --> 00:22:31,010
Let me get rid of all this.

392
00:22:31,010 --> 00:22:35,750
The data actually came from
an exponential distribution.

393
00:22:35,750 --> 00:22:39,050
So this is an example where
I would have assumed things

394
00:22:39,050 --> 00:22:44,270
were coming from a Gaussian.

395
00:22:44,270 --> 00:22:46,760
So this is still for
the normal quantiles.

396
00:22:46,760 --> 00:22:51,740
But with an exponential
and e to the minus

397
00:22:51,740 --> 00:22:56,210
x or an e to the x kind of
distribution, what you end up

398
00:22:56,210 --> 00:22:59,390
with are a lot of
data values that

399
00:22:59,390 --> 00:23:02,930
are much larger, much
further away from the mean,

400
00:23:02,930 --> 00:23:05,390
than you would expect
from a Gaussian.

401
00:23:05,390 --> 00:23:12,870
And you also get a bunch of
data that's much larger than you

402
00:23:12,870 --> 00:23:16,410
would expect from a Gaussian.

403
00:23:16,410 --> 00:23:20,430
So this would be an example
here, where the normal q-q

404
00:23:20,430 --> 00:23:22,560
plot doesn't seem to match up.

405
00:23:22,560 --> 00:23:24,810
It's telling me my
data really is not

406
00:23:24,810 --> 00:23:28,300
following along the
normal distribution line.

407
00:23:28,300 --> 00:23:29,860
Now, I didn't pull a plot.

408
00:23:29,860 --> 00:23:33,220
But one could then
ask the question--

409
00:23:33,220 --> 00:23:35,310
maybe you'd look up kurtosis.

410
00:23:35,310 --> 00:23:38,590
Or maybe you look back
at your data and say,

411
00:23:38,590 --> 00:23:41,680
I think maybe an exponential
distribution is really

412
00:23:41,680 --> 00:23:43,150
what this is following.

413
00:23:43,150 --> 00:23:46,308
How would you plot
that on a q-q norm?

414
00:23:46,308 --> 00:23:47,100
AUDIENCE: Question.

415
00:23:47,100 --> 00:23:47,580
PROFESSOR: Yeah?

416
00:23:47,580 --> 00:23:49,538
AUDIENCE: Why doesn't
the line go through 0, 0?

417
00:23:51,392 --> 00:23:52,850
PROFESSOR: This is
a good question.

418
00:23:52,850 --> 00:23:59,140
These don't appear to
be mean-centered to me.

419
00:23:59,140 --> 00:24:01,120
So there's something
weird on the plot.

420
00:24:01,120 --> 00:24:04,410
AUDIENCE: So the line should be
for a normal distribution, not

421
00:24:04,410 --> 00:24:04,910
fitting.

422
00:24:04,910 --> 00:24:06,243
PROFESSOR: Yeah, this does not--

423
00:24:06,243 --> 00:24:13,990
I think what has happened
here is these are not quite

424
00:24:13,990 --> 00:24:19,180
mean-centered and
normalized because--

425
00:24:19,180 --> 00:24:25,450
well, so in terms of 0,
0 following on the plot,

426
00:24:25,450 --> 00:24:28,080
that that's not happening.

427
00:24:28,080 --> 00:24:29,290
So I'm a little--

428
00:24:29,290 --> 00:24:32,815
I'm not sure exactly
what's going on there.

429
00:24:32,815 --> 00:24:34,690
AUDIENCE: We need the
closing function taking

430
00:24:34,690 --> 00:24:36,590
the mean of the data et cetera.

431
00:24:36,590 --> 00:24:40,187
It's a conceptual
normal on the data mean.

432
00:24:40,187 --> 00:24:41,270
PROFESSOR: Yes, it should.

433
00:24:41,270 --> 00:24:45,760
And that's what I'm saying,
is this plot I don't think is

434
00:24:45,760 --> 00:24:50,020
correctly mean-centered
because it should then--

435
00:24:50,020 --> 00:24:54,190
0, 0, by definition,
has to fall.

436
00:24:54,190 --> 00:24:55,752
AUDIENCE: Right.

437
00:24:55,752 --> 00:24:57,460
PROFESSOR: Oh, that's
what you're saying.

438
00:24:57,460 --> 00:25:00,043
AUDIENCE: No, I was saying you
could take the mean of the data

439
00:25:00,043 --> 00:25:02,620
you send to the normal
that you're plotting

440
00:25:02,620 --> 00:25:03,730
is aligned with that data.

441
00:25:06,449 --> 00:25:07,157
PROFESSOR: Right.

442
00:25:07,157 --> 00:25:13,090
But I'm saying here's my y
data, and my 0 mean is not--

443
00:25:13,090 --> 00:25:15,220
I don't have any negative--

444
00:25:15,220 --> 00:25:17,290
I don't have any data
lower than the mean.

445
00:25:17,290 --> 00:25:20,510
And therefore, that
doesn't make any sense.

446
00:25:20,510 --> 00:25:22,900
So this is not
mean-centered correctly.

447
00:25:27,292 --> 00:25:30,940
AUDIENCE: It looks to me
like the mean of the data

448
00:25:30,940 --> 00:25:33,811
does give us a slightly less
than 1 in the point data,

449
00:25:33,811 --> 00:25:36,246
so that coincides with the mean.

450
00:25:40,730 --> 00:25:43,520
PROFESSOR: But if I
mean-center and scale to 0,

451
00:25:43,520 --> 00:25:47,510
then the mean of my data have--

452
00:25:47,510 --> 00:25:51,020
by definition, that
ought to be at 0, right?

453
00:25:51,020 --> 00:25:51,920
AUDIENCE: Oh, I see.

454
00:25:51,920 --> 00:25:53,880
I don't think you're
shifting the data, though.

455
00:25:58,240 --> 00:26:01,560
PROFESSOR: When you mean-center,
yeah, you're shifting.

456
00:26:01,560 --> 00:26:03,270
AUDIENCE: Oh, I think
you're shifting,

457
00:26:03,270 --> 00:26:05,865
but I think conceptually,
you're not shifting the data.

458
00:26:05,865 --> 00:26:07,490
You're shifting the
normal, that you're

459
00:26:07,490 --> 00:26:10,087
saying my might
correspond to the data.

460
00:26:10,087 --> 00:26:10,670
PROFESSOR: No.

461
00:26:10,670 --> 00:26:15,860
In a normal-- in the standard
q-q norm plot, you mean-center.

462
00:26:15,860 --> 00:26:18,320
You actually take your
data, you mean-center it,

463
00:26:18,320 --> 00:26:21,410
you normalize it to the
calculated sample standard

464
00:26:21,410 --> 00:26:23,450
deviation and plot that.

465
00:26:23,450 --> 00:26:26,060
And that does not look like
quite what they've done here.

466
00:26:26,060 --> 00:26:28,280
I think these are
still normalized

467
00:26:28,280 --> 00:26:29,930
to standard
deviation, but I think

468
00:26:29,930 --> 00:26:32,055
it's not quite mean-centered.

469
00:26:32,055 --> 00:26:33,680
But in some sense
that doesn't actually

470
00:26:33,680 --> 00:26:38,450
matter in terms of the data
following along the line.

471
00:26:38,450 --> 00:26:39,980
It's still indicating.

472
00:26:39,980 --> 00:26:41,730
That would just be a shift.

473
00:26:41,730 --> 00:26:43,350
That would be a shift.

474
00:26:43,350 --> 00:26:46,880
AUDIENCE: You said the
data hasn't been normalized

475
00:26:46,880 --> 00:26:48,320
or hasn't been mean-centered.

476
00:26:48,320 --> 00:26:51,740
But if it's an
exponential distribution,

477
00:26:51,740 --> 00:26:54,770
can you still normalize a bit?

478
00:26:59,060 --> 00:27:04,190
PROFESSOR: In this first
use of such a plot,

479
00:27:04,190 --> 00:27:06,200
you would be testing
the question.

480
00:27:06,200 --> 00:27:09,380
Did your data-- you don't know
yet that it's exponential.

481
00:27:09,380 --> 00:27:11,070
You just have data,
and you're testing.

482
00:27:11,070 --> 00:27:14,280
Does it fall on the normal line?

483
00:27:14,280 --> 00:27:16,720
So you would still
follow that procedure.

484
00:27:16,720 --> 00:27:20,600
We'll look at an exponential
distribution in a minute.

485
00:27:20,600 --> 00:27:23,420
And of course, every
distribution has a mean.

486
00:27:23,420 --> 00:27:25,970
So you can always mean-center.

487
00:27:25,970 --> 00:27:29,870
Similarly, every
distribution has a variance

488
00:27:29,870 --> 00:27:31,680
that you can calculate.

489
00:27:31,680 --> 00:27:33,680
The neat thing about the
exponential is the mean

490
00:27:33,680 --> 00:27:36,050
and the variance are the same.

491
00:27:36,050 --> 00:27:37,670
But that's not entering in here.

492
00:27:37,670 --> 00:27:39,170
There's something else weird.

493
00:27:39,170 --> 00:27:44,570
So there's the risk of pulling
a plot off at 9:50 at night.

494
00:27:44,570 --> 00:27:46,510
I hadn't noticed that the--

495
00:27:46,510 --> 00:27:50,800
it doesn't look
correctly mean-centered.

496
00:27:50,800 --> 00:27:52,630
But the additional
point I wanted to make

497
00:27:52,630 --> 00:27:55,600
is I could actually
take this same data.

498
00:27:55,600 --> 00:27:59,830
I could produce a different
plot, not a normal q-q plot,

499
00:27:59,830 --> 00:28:03,620
but an exponential q-q plot.

500
00:28:03,620 --> 00:28:09,550
And if I were doing that, in
that case, what I would do

501
00:28:09,550 --> 00:28:18,530
is take my data, still plot
it hopefully mean-centered,

502
00:28:18,530 --> 00:28:20,990
and then number of
standard deviations away.

503
00:28:23,570 --> 00:28:29,860
But then along this axis, I
would calculate the location

504
00:28:29,860 --> 00:28:32,320
in numbers of
standard deviations

505
00:28:32,320 --> 00:28:36,580
based on the probability of
an exponential distribution,

506
00:28:36,580 --> 00:28:40,660
not based on the probability
of that index location

507
00:28:40,660 --> 00:28:42,800
in a normal distribution.

508
00:28:42,800 --> 00:28:46,150
So I would basically say,
for my 50 data points,

509
00:28:46,150 --> 00:28:52,060
I expect the 25th data
point larger than the mean

510
00:28:52,060 --> 00:28:56,800
to occur in that distribution.

511
00:28:56,800 --> 00:29:05,290
I have to go 2.1 normalized
standard deviations away

512
00:29:05,290 --> 00:29:07,850
in order to get to
that probability.

513
00:29:07,850 --> 00:29:11,200
So that it takes my
same y data, but then it

514
00:29:11,200 --> 00:29:18,330
plots along the line, where
if it really is exponential,

515
00:29:18,330 --> 00:29:25,490
my data should follow along
a 1 to 1 correspondence line.

516
00:29:25,490 --> 00:29:30,110
So you don't often see
the use of these q-q plots

517
00:29:30,110 --> 00:29:33,320
from the perspective of
different distributions,

518
00:29:33,320 --> 00:29:35,000
but you can use them.

519
00:29:35,000 --> 00:29:37,940
What you often will
see is really this.

520
00:29:37,940 --> 00:29:40,460
You'll see q-q norm plots.

521
00:29:40,460 --> 00:29:42,200
And they're lovely plots.

522
00:29:42,200 --> 00:29:44,000
It's a wonderful tool to do--

523
00:29:44,000 --> 00:29:46,790
use-- because you're actually
seeing all of your data.

524
00:29:49,620 --> 00:29:52,040
It's got all of
your actual data.

525
00:29:52,040 --> 00:29:54,310
It's showing you that
it corresponds roughly

526
00:29:54,310 --> 00:29:56,900
to a normal distribution.

527
00:29:56,900 --> 00:30:00,100
It's also giving you
very nice information

528
00:30:00,100 --> 00:30:07,930
about essentially your
variance or standard deviation.

529
00:30:07,930 --> 00:30:11,410
And there are variants of
these plots that you will often

530
00:30:11,410 --> 00:30:14,680
see in the literature,
especially the semiconductor

531
00:30:14,680 --> 00:30:20,020
literature, dealing with large
numbers of samples coming

532
00:30:20,020 --> 00:30:22,370
from different kinds
of measurement.

533
00:30:22,370 --> 00:30:25,750
So for example, if you want
to make contact resistance

534
00:30:25,750 --> 00:30:30,910
measurements for literally
thousands of contacts and very

535
00:30:30,910 --> 00:30:33,520
succinctly present
that data, you

536
00:30:33,520 --> 00:30:37,990
will see families
of q-q norm plots.

537
00:30:37,990 --> 00:30:41,140
So for example, maybe you
did a bunch of contacts

538
00:30:41,140 --> 00:30:42,520
at a particular size.

539
00:30:42,520 --> 00:30:44,170
You would plot them like this.

540
00:30:44,170 --> 00:30:46,060
And then maybe you
had another data set,

541
00:30:46,060 --> 00:30:51,460
where you had attempted to
pattern those contacts slightly

542
00:30:51,460 --> 00:30:53,500
larger, slightly smaller.

543
00:30:53,500 --> 00:30:56,560
And you would often
see then another--

544
00:30:56,560 --> 00:30:59,720
oops, that's not
very straight, is it?

545
00:30:59,720 --> 00:31:05,450
It's meant to be another
underlying set of data.

546
00:31:05,450 --> 00:31:09,880
But you might find your data
looking something like this.

547
00:31:09,880 --> 00:31:12,540
And that kind of plot is
really useful for showing

548
00:31:12,540 --> 00:31:15,300
that there is a mean
shift, a mean difference

549
00:31:15,300 --> 00:31:16,540
between your data.

550
00:31:16,540 --> 00:31:19,980
But also, the variance is
different in the two cases.

551
00:31:19,980 --> 00:31:22,187
Now, exactly what
you're plotting here

552
00:31:22,187 --> 00:31:23,520
might be a little bit different.

553
00:31:23,520 --> 00:31:27,570
You might actually not
plot quite normalized data.

554
00:31:27,570 --> 00:31:33,960
You might actually use it
in an unnormalized fashion.

555
00:31:33,960 --> 00:31:38,550
Here, you might plot this not
in terms of standard deviations,

556
00:31:38,550 --> 00:31:40,640
but rather actual--

557
00:31:40,640 --> 00:31:43,950
keep it in the quantiles
or the probability

558
00:31:43,950 --> 00:31:46,740
of being that far away--

559
00:31:46,740 --> 00:31:51,000
probability of that x--

560
00:31:51,000 --> 00:31:52,470
that's weird.

561
00:31:52,470 --> 00:31:54,550
The probability of that x value.

562
00:31:54,550 --> 00:31:56,940
So for example, you will
often see these kinds

563
00:31:56,940 --> 00:32:10,590
of plots which would show
things like 0.001, 0.01, 0.1, 1,

564
00:32:10,590 --> 00:32:16,050
or something like
that, getting up to--

565
00:32:16,050 --> 00:32:21,480
I guess 0.5 would be the
equivalent for the mean.

566
00:32:21,480 --> 00:32:24,240
And then you start
going larger--

567
00:32:24,240 --> 00:32:29,640
0.9, 0.99, 0.999.

568
00:32:29,640 --> 00:32:33,070
In other words, you
might actually plot as--

569
00:32:33,070 --> 00:32:35,620
I should have put
these on the x value--

570
00:32:35,620 --> 00:32:38,980
the probability that
you would find a data

571
00:32:38,980 --> 00:32:46,210
point that far away as opposed
to implied probabilities

572
00:32:46,210 --> 00:32:48,947
in terms of number of
standard deviations.

573
00:32:48,947 --> 00:32:50,530
So there are some
really cool variants

574
00:32:50,530 --> 00:32:52,552
of these plots that
are very useful.

575
00:32:52,552 --> 00:32:54,010
And I think we'll
see some of these

576
00:32:54,010 --> 00:32:56,050
when we talk a little
bit about yield

577
00:32:56,050 --> 00:32:59,923
and some other distributions.

578
00:32:59,923 --> 00:33:01,090
AUDIENCE: I have a question?

579
00:33:01,090 --> 00:33:02,670
PROFESSOR: Yeah.

580
00:33:02,670 --> 00:33:04,600
AUDIENCE: Yeah, after
I have the q-q plots,

581
00:33:04,600 --> 00:33:07,210
how can I tell the
confidence level

582
00:33:07,210 --> 00:33:10,540
that I have to say whether
or not my data is normally

583
00:33:10,540 --> 00:33:11,480
distributed?

584
00:33:11,480 --> 00:33:13,570
PROFESSOR: So the q-q
plot does not actually

585
00:33:13,570 --> 00:33:17,980
tell you confidence intervals on
either the hypothesis that it's

586
00:33:17,980 --> 00:33:21,550
normally distributed
or confidence intervals

587
00:33:21,550 --> 00:33:24,970
on the parameter estimate.

588
00:33:24,970 --> 00:33:28,270
There are some formal
statistical tests

589
00:33:28,270 --> 00:33:32,290
where you can test that
hypothesis of normality.

590
00:33:32,290 --> 00:33:40,890
And essentially, you can
use those from your--

591
00:33:40,890 --> 00:33:44,100
never going to hand-calculate
some of those statistics,

592
00:33:44,100 --> 00:33:45,660
and then the
probability associated

593
00:33:45,660 --> 00:33:48,540
with a derived statistic
associated with normality.

594
00:33:48,540 --> 00:33:51,960
You'll use your statistics
package for that.

595
00:33:51,960 --> 00:33:54,540
This gives you a good
visual indication.

596
00:33:54,540 --> 00:33:57,090
But to actually
test, is it normal?

597
00:33:57,090 --> 00:34:00,840
Or what is the probability
that the data is non-normal?

598
00:34:00,840 --> 00:34:02,430
That's a different question.

599
00:34:02,430 --> 00:34:05,550
And then today, we
will start talking

600
00:34:05,550 --> 00:34:09,000
about confidence intervals
on the mean and the variance,

601
00:34:09,000 --> 00:34:14,770
which you also would not use
the q-q norm plot to generate.

602
00:34:14,770 --> 00:34:17,840
So in fact, let's get
to that because that's--

603
00:34:17,840 --> 00:34:18,340
yeah?

604
00:34:18,340 --> 00:34:20,530
AUDIENCE: For that plot,
can you use regression

605
00:34:20,530 --> 00:34:23,774
to see how far it
is from the normal?

606
00:34:27,590 --> 00:34:30,760
PROFESSOR: Well, first off,
again, if you were actually

607
00:34:30,760 --> 00:34:33,100
trying to estimate the
parameters of normality,

608
00:34:33,100 --> 00:34:36,760
you would just use the data
and calculate the sample mean

609
00:34:36,760 --> 00:34:38,560
and sample standard deviation.

610
00:34:38,560 --> 00:34:40,570
I think if you are--

611
00:34:40,570 --> 00:34:43,520
essentially what you
are posing here is,

612
00:34:43,520 --> 00:34:49,989
could I go in and look
at these deviations

613
00:34:49,989 --> 00:34:52,900
and do some, I don't know,
sum of squared values

614
00:34:52,900 --> 00:34:54,159
of those deviations?

615
00:34:54,159 --> 00:34:57,640
That's actually getting
really close to calculating

616
00:34:57,640 --> 00:35:00,040
a statistic.

617
00:35:00,040 --> 00:35:05,470
Call it a W or some
number, a W statistic

618
00:35:05,470 --> 00:35:09,130
that I would form based on sum
of squared deviations on one

619
00:35:09,130 --> 00:35:11,380
of these plots or
some other-- maybe

620
00:35:11,380 --> 00:35:16,740
it's a sum of absolute
distance deviations.

621
00:35:16,740 --> 00:35:19,160
Now I've got a
statistic W, and that's

622
00:35:19,160 --> 00:35:22,010
getting really close to the
kinds of statistical tests

623
00:35:22,010 --> 00:35:26,450
that one would run to ask
the question of normality.

624
00:35:26,450 --> 00:35:30,170
I don't actually know
what the formula is

625
00:35:30,170 --> 00:35:37,610
used in coming up with
a W value and then what

626
00:35:37,610 --> 00:35:38,840
the normality tests are.

627
00:35:38,840 --> 00:35:40,640
But that's the
kernel of the idea,

628
00:35:40,640 --> 00:35:42,620
is to actually
look at your data,

629
00:35:42,620 --> 00:35:48,840
form an aggregate value for that
statistic, that W statistic.

630
00:35:48,840 --> 00:35:51,740
So for example, if it was
sum of absolute values,

631
00:35:51,740 --> 00:35:56,565
and it-- for a sample of size
50, and that W is very near 0,

632
00:35:56,565 --> 00:35:58,190
then you have high
confidence that it's

633
00:35:58,190 --> 00:35:59,550
a normal distribution.

634
00:35:59,550 --> 00:36:02,390
But as W gets bigger,
that would seem

635
00:36:02,390 --> 00:36:04,340
to indicate more
and more likelihood

636
00:36:04,340 --> 00:36:05,720
that it's not normal.

637
00:36:05,720 --> 00:36:07,430
And that's exactly
the kind of thing

638
00:36:07,430 --> 00:36:11,390
that's going on in the formal
statistical test for normality.

639
00:36:19,220 --> 00:36:20,750
So here, we've given
you a few tools

640
00:36:20,750 --> 00:36:23,060
for being able to
look at the data,

641
00:36:23,060 --> 00:36:26,730
get a feel for is
it normal or not.

642
00:36:26,730 --> 00:36:29,970
But it hasn't
answered the question,

643
00:36:29,970 --> 00:36:34,190
how come so often we're using a
normal distribution when we're

644
00:36:34,190 --> 00:36:35,750
actually looking
at manufacturing

645
00:36:35,750 --> 00:36:38,480
data or other kinds
of experimental data?

646
00:36:38,480 --> 00:36:43,670
And a really important thing
is the following observation--

647
00:36:43,670 --> 00:36:50,790
the following fact-- that
if we are forming a sum

648
00:36:50,790 --> 00:36:54,910
of independent observations
of a random variable--

649
00:36:54,910 --> 00:37:01,720
so x has some
underlying distribution.

650
00:37:01,720 --> 00:37:03,940
And it doesn't
actually matter what

651
00:37:03,940 --> 00:37:07,510
the underlying distribution is.

652
00:37:07,510 --> 00:37:11,470
But I form n
independent observations

653
00:37:11,470 --> 00:37:13,240
of that random variable.

654
00:37:13,240 --> 00:37:14,980
And then I look at
the distribution

655
00:37:14,980 --> 00:37:24,870
of the sum of x1 plus x2
plus all n random variables.

656
00:37:24,870 --> 00:37:27,780
The fascinating
fact is that the sum

657
00:37:27,780 --> 00:37:29,880
of independent random
variables tends

658
00:37:29,880 --> 00:37:32,860
towards a normal distribution.

659
00:37:32,860 --> 00:37:34,320
This is the central
limit theorem.

660
00:37:37,930 --> 00:37:42,660
So here's a neat little example.

661
00:37:42,660 --> 00:37:49,130
If my underlying distribution
is in fact something

662
00:37:49,130 --> 00:37:51,890
like a uniform distribution,
and if I'm, say,

663
00:37:51,890 --> 00:37:57,050
pulling off 20 samples of
x1 and 20 samples of x2

664
00:37:57,050 --> 00:37:58,925
from a different
uniform distribution,

665
00:37:58,925 --> 00:38:02,360
and I form, say,
100 samples of, I

666
00:38:02,360 --> 00:38:04,700
guess, 100 sets-- each
one of these is, I guess,

667
00:38:04,700 --> 00:38:08,700
1,000 points in this example.

668
00:38:08,700 --> 00:38:12,320
But I essentially
take the sum of all

669
00:38:12,320 --> 00:38:17,210
of these random variables and
form a new random variable.

670
00:38:17,210 --> 00:38:22,550
The new random variable tends
towards a normal distribution

671
00:38:22,550 --> 00:38:25,360
with some mean and variance.

672
00:38:28,000 --> 00:38:31,180
Some of you I saw in 2853.

673
00:38:31,180 --> 00:38:34,360
And I had a nice
link to a website.

674
00:38:34,360 --> 00:38:38,360
And I'll actually dig that up
and post it for this class.

675
00:38:38,360 --> 00:38:43,270
It's the SticiGui website.

676
00:38:43,270 --> 00:38:46,510
It's a statistics--
interactive statistic

677
00:38:46,510 --> 00:38:48,610
package out of UC Berkeley.

678
00:38:48,610 --> 00:38:49,670
And it's really fun.

679
00:38:49,670 --> 00:38:51,400
You can actually
form these kinds

680
00:38:51,400 --> 00:38:56,290
of sums of random variables
out of different underlying

681
00:38:56,290 --> 00:38:59,500
distributions and
plot them and start

682
00:38:59,500 --> 00:39:04,870
to see how close the sum
or the normalized sum

683
00:39:04,870 --> 00:39:09,430
of these distributions
are to a normal.

684
00:39:09,430 --> 00:39:12,520
So there's some very, very
nice interactive tools

685
00:39:12,520 --> 00:39:14,380
that you can play with.

686
00:39:14,380 --> 00:39:22,390
Now, an important point here is
if I'm calculating the mean--

687
00:39:22,390 --> 00:39:26,520
so I'm calculating an
x bar across my data.

688
00:39:26,520 --> 00:39:31,330
And I've got 100
samples, each drawn--

689
00:39:31,330 --> 00:39:34,240
and I'm assuming I'm drawing
it from the same underlying

690
00:39:34,240 --> 00:39:36,850
distribution,
whatever that may be.

691
00:39:36,850 --> 00:39:41,290
What is the distribution
of the sample mean?

692
00:39:43,910 --> 00:39:47,290
Well, if you look at the
formula for the sample mean,

693
00:39:47,290 --> 00:39:49,780
it's not exactly a
sum of your data.

694
00:39:49,780 --> 00:39:52,480
It's the sum of your data,
then divided by 1, right?

695
00:39:52,480 --> 00:39:56,680
It's summed from i
equals 1 to whatever n

696
00:39:56,680 --> 00:40:00,640
is of your individual samples.

697
00:40:00,640 --> 00:40:04,270
So it is a sum, and then
with a constant out front.

698
00:40:04,270 --> 00:40:08,830
But the point is, by appealing
to the central limit theorem,

699
00:40:08,830 --> 00:40:13,720
the sample mean distribution,
the PDF associated

700
00:40:13,720 --> 00:40:19,180
with a sample mean,
always tends towards

701
00:40:19,180 --> 00:40:22,510
the normal distribution.

702
00:40:22,510 --> 00:40:24,850
So we're going to come back
to this idea of sampling

703
00:40:24,850 --> 00:40:27,660
and what the distribution is for
sample statistics a little bit

704
00:40:27,660 --> 00:40:28,780
later.

705
00:40:28,780 --> 00:40:32,280
But more generally, very
often what we're doing

706
00:40:32,280 --> 00:40:37,410
is pulling data out of a process
that in itself is already,

707
00:40:37,410 --> 00:40:41,540
by the physics of the
process, highly averaged.

708
00:40:41,540 --> 00:40:43,550
And therefore,
it's averaging lots

709
00:40:43,550 --> 00:40:46,160
of perhaps other
underlying strange physics

710
00:40:46,160 --> 00:40:48,860
or difficult physics.

711
00:40:48,860 --> 00:40:54,990
But in aggregate, that averaging
nature of the data itself--

712
00:40:54,990 --> 00:40:57,050
not the operation
that we perform,

713
00:40:57,050 --> 00:41:01,130
but each individual
underlying data point--

714
00:41:01,130 --> 00:41:03,110
each individual x sub i--

715
00:41:03,110 --> 00:41:05,900
underneath of that may
have some averaging

716
00:41:05,900 --> 00:41:07,940
by the physics
going on that will

717
00:41:07,940 --> 00:41:13,980
help to drive it towards itself
being a normal distribution.

718
00:41:13,980 --> 00:41:16,710
So just to remind you,
the central limit theorem

719
00:41:16,710 --> 00:41:24,740
is probably the most used and
perhaps often abused appeal

720
00:41:24,740 --> 00:41:28,430
to why we're using normal
distributions very often.

721
00:41:28,430 --> 00:41:30,470
It is still good to test it.

722
00:41:30,470 --> 00:41:35,540
But there is a good reason why
very often, our data does come

723
00:41:35,540 --> 00:41:38,783
up as normal distributions.

724
00:41:38,783 --> 00:41:40,200
So I want to talk
a little bit now

725
00:41:40,200 --> 00:41:43,140
about sampling because
we are very often using

726
00:41:43,140 --> 00:41:46,950
actual measurements and
data to try to get estimates

727
00:41:46,950 --> 00:41:53,760
for, or more generally, build
a model of our random process

728
00:41:53,760 --> 00:41:57,030
and estimate parameters
of that random process.

729
00:41:57,030 --> 00:42:00,840
And we've said in general,
p sub x is unknown.

730
00:42:00,840 --> 00:42:05,820
The data-- always plot your
raw data first and foremost.

731
00:42:05,820 --> 00:42:10,530
And very often, the raw data
will suggest a distribution.

732
00:42:10,530 --> 00:42:17,700
Or then histograms may
provide some insight.

733
00:42:17,700 --> 00:42:22,180
So for example, a
very quick histogram

734
00:42:22,180 --> 00:42:24,820
will very often give
you the difference

735
00:42:24,820 --> 00:42:28,720
between a normal distribution
and a uniform distribution.

736
00:42:28,720 --> 00:42:31,660
If it's evenly
falling, and I don't

737
00:42:31,660 --> 00:42:36,230
have this falloff in the
tails, that's very important.

738
00:42:36,230 --> 00:42:39,280
And then we can also use
things like the q-q norm plot

739
00:42:39,280 --> 00:42:41,210
to test some of those things.

740
00:42:41,210 --> 00:42:48,430
So the first job is to come
up with what likely [COUGH]

741
00:42:48,430 --> 00:42:51,640
distribution you want to use.

742
00:42:51,640 --> 00:42:55,900
Nine times out of 10,
normal distribution

743
00:42:55,900 --> 00:42:57,370
will be appropriate.

744
00:42:57,370 --> 00:43:00,560
And then the second
thing is to estimate

745
00:43:00,560 --> 00:43:03,880
parameters of the distribution.

746
00:43:07,010 --> 00:43:09,290
And the normal distribution,
again, to remind you,

747
00:43:09,290 --> 00:43:11,780
just has these two
parameters, mean and variance.

748
00:43:11,780 --> 00:43:15,200
And now what we want
to do is estimate them.

749
00:43:15,200 --> 00:43:18,540
Now, everybody is
used to the formulas.

750
00:43:18,540 --> 00:43:20,390
We've got the
formulas right here

751
00:43:20,390 --> 00:43:24,200
for calculating from your
sample, your limited number

752
00:43:24,200 --> 00:43:28,180
of pieces of data,
what things are--

753
00:43:28,180 --> 00:43:31,900
what a few important statistics
are or characteristics

754
00:43:31,900 --> 00:43:35,385
are of that data, like the
sample mean or the average,

755
00:43:35,385 --> 00:43:36,385
and the sample variance.

756
00:43:39,190 --> 00:43:42,550
But what I want to give you
a feel for today, perhaps

757
00:43:42,550 --> 00:43:47,770
the most subtle idea,
an important idea

758
00:43:47,770 --> 00:43:51,550
for interpretation, for
establishment of confidence

759
00:43:51,550 --> 00:43:53,980
intervals, for actually
being able to say where

760
00:43:53,980 --> 00:43:57,010
you think the real values lie--

761
00:43:57,010 --> 00:44:01,250
the subtle idea is
that these themselves

762
00:44:01,250 --> 00:44:07,130
are statistics that have their
own PDF, their own Probability

763
00:44:07,130 --> 00:44:08,550
Density Function.

764
00:44:08,550 --> 00:44:11,930
They have a sample
statistics that

765
00:44:11,930 --> 00:44:15,560
fall that tell you the
likelihood of observing

766
00:44:15,560 --> 00:44:17,660
particular values of them--

767
00:44:17,660 --> 00:44:23,570
that establish bounds for where,
if I had a different sample,

768
00:44:23,570 --> 00:44:25,940
how close you think
the new sample,

769
00:44:25,940 --> 00:44:28,910
still drawn from the
underlying parent distribution,

770
00:44:28,910 --> 00:44:31,820
would actually lie to
the particular sample

771
00:44:31,820 --> 00:44:33,820
that I just drew.

772
00:44:33,820 --> 00:44:35,570
So I'm going to explain
that in a few more

773
00:44:35,570 --> 00:44:38,420
slides or several slides here.

774
00:44:38,420 --> 00:44:44,180
But the key idea is it's
really easy to calculate

775
00:44:44,180 --> 00:44:46,080
a couple of these moments--

776
00:44:46,080 --> 00:44:49,040
the mean and the variance.

777
00:44:49,040 --> 00:44:51,140
For the normal
distribution, that

778
00:44:51,140 --> 00:44:56,790
tells you everything for an
estimate of your raw data.

779
00:44:56,790 --> 00:44:58,760
But then I want to get
to the more subtle idea

780
00:44:58,760 --> 00:45:00,468
so that we can start
talking about things

781
00:45:00,468 --> 00:45:02,870
like confidence intervals.

782
00:45:02,870 --> 00:45:13,700
And a simple example to give
you a little bit of a feel

783
00:45:13,700 --> 00:45:18,080
for this here is if
I were to ask you

784
00:45:18,080 --> 00:45:21,350
what distribution
applies to the sample

785
00:45:21,350 --> 00:45:26,770
mean, where does that come from?

786
00:45:26,770 --> 00:45:29,910
Where does this notion of
a distribution associated

787
00:45:29,910 --> 00:45:33,130
with the sample mean arise?

788
00:45:33,130 --> 00:45:36,570
So if we look at the
formula for the sample mean

789
00:45:36,570 --> 00:45:39,330
and expand it out,
in some sense we've

790
00:45:39,330 --> 00:45:43,860
got just a sum of
independent random variables,

791
00:45:43,860 --> 00:45:48,870
like we were talking about
with the central limit theorem.

792
00:45:48,870 --> 00:45:50,710
There are different
constants in here.

793
00:45:50,710 --> 00:45:53,220
And in this case, for the
sample mean statistic,

794
00:45:53,220 --> 00:45:56,910
all of the constants are
the same, which is just 1

795
00:45:56,910 --> 00:45:59,370
over the total number of
data points or sample points

796
00:45:59,370 --> 00:46:00,630
that I've got.

797
00:46:00,630 --> 00:46:05,790
Now, you can go back to the
definition of expectation

798
00:46:05,790 --> 00:46:09,180
that we talked about earlier
and do the expectation

799
00:46:09,180 --> 00:46:13,870
operator across this
and do expectation math.

800
00:46:13,870 --> 00:46:21,740
So the expectation of ax is
equal to just that constant

801
00:46:21,740 --> 00:46:26,380
times the expectation of the
underlying random variable.

802
00:46:26,380 --> 00:46:33,690
So the 1 over n simply
comes out to the left.

803
00:46:33,690 --> 00:46:39,710
And if I were to ask, what
is the mean of the PDF

804
00:46:39,710 --> 00:46:46,730
associated with x bar, it
is going to be 1 over n--

805
00:46:46,730 --> 00:46:49,810
the same mean.

806
00:46:49,810 --> 00:46:51,610
Now what else is
going on here is

807
00:46:51,610 --> 00:46:55,710
if you look at the standard
deviation of x bar--

808
00:46:55,710 --> 00:46:57,010
I hope you guys can see that.

809
00:46:57,010 --> 00:47:00,580
There's a variance
of x bar in here.

810
00:47:00,580 --> 00:47:05,360
So that's an x and a
bar, which I just--

811
00:47:05,360 --> 00:47:10,080
the pen doesn't line up
exactly with the screen.

812
00:47:10,080 --> 00:47:14,310
You can also do the
expectation operator for--

813
00:47:14,310 --> 00:47:18,790
oops, not the expectation,
but the variance operator.

814
00:47:18,790 --> 00:47:24,210
And if you do the mathematics
on variance of some ax,

815
00:47:24,210 --> 00:47:30,950
that's equal to a squared,
the variance of the underlying

816
00:47:30,950 --> 00:47:32,280
variable.

817
00:47:32,280 --> 00:47:35,720
And if you follow that math
through for the definition

818
00:47:35,720 --> 00:47:39,260
of x bar and relate
that to the variance

819
00:47:39,260 --> 00:47:45,140
of each of these x sub i's, what
you find is that the variance--

820
00:47:45,140 --> 00:47:47,690
I get an n times--

821
00:47:50,550 --> 00:47:53,170
I'm summing n of these
random variables.

822
00:47:53,170 --> 00:47:57,090
So I've got n times--

823
00:47:57,090 --> 00:48:01,340
1 over n is the
constant in here.

824
00:48:01,340 --> 00:48:04,400
So I get a 1 over n squared
times the underlying

825
00:48:04,400 --> 00:48:07,150
variance of my x.

826
00:48:07,150 --> 00:48:11,260
So that I get a cancellation,
and the variance then

827
00:48:11,260 --> 00:48:16,235
of my x bar is just equal
to what I've shown here,

828
00:48:16,235 --> 00:48:20,810
a 1 over n of the variance of
the underlying distribution.

829
00:48:20,810 --> 00:48:24,220
So what's interesting
here is if I start

830
00:48:24,220 --> 00:48:27,280
to ask about the
distributions associated

831
00:48:27,280 --> 00:48:30,250
with what are the
mean and the variance

832
00:48:30,250 --> 00:48:34,060
of the normal distribution
associated with x bar--

833
00:48:36,990 --> 00:48:40,380
what is the mean of an x
bar that I would typically

834
00:48:40,380 --> 00:48:43,260
observe from lots of samples
of my underlying distribution?

835
00:48:43,260 --> 00:48:45,300
What is a variance
I would observe?

836
00:48:45,300 --> 00:48:49,350
It's related to the
underlying distribution,

837
00:48:49,350 --> 00:48:51,250
but it's not exactly the same.

838
00:48:51,250 --> 00:48:54,330
I've got a new random
variable, an x bar,

839
00:48:54,330 --> 00:48:57,010
that has a different
mean and variance.

840
00:48:57,010 --> 00:49:01,720
It's got the same
mean in this case,

841
00:49:01,720 --> 00:49:04,480
but the variance
is actually scaled.

842
00:49:04,480 --> 00:49:07,330
And this is extremely
useful because the variance

843
00:49:07,330 --> 00:49:12,430
of my averaging means
that I'm getting a tighter

844
00:49:12,430 --> 00:49:13,720
distribution--

845
00:49:13,720 --> 00:49:20,080
a narrower or smaller
variance compared

846
00:49:20,080 --> 00:49:22,030
to the underlying distribution.

847
00:49:22,030 --> 00:49:24,280
I'm going to show you
that in a little bit more

848
00:49:24,280 --> 00:49:29,590
of a graphical fashion a little
bit later because this is--

849
00:49:29,590 --> 00:49:33,020
that's a preview to this
whole idea of sampling,

850
00:49:33,020 --> 00:49:36,180
which is really critical.

851
00:49:36,180 --> 00:49:38,770
We've already talked about this.

852
00:49:38,770 --> 00:49:47,610
So the key thing here is to
get to this notion of sampling

853
00:49:47,610 --> 00:49:50,250
distributions, what are the
key distributions arising

854
00:49:50,250 --> 00:49:53,970
from the fact that I'm drawing
multiple pieces of data

855
00:49:53,970 --> 00:49:56,160
from a parent
distribution, and then

856
00:49:56,160 --> 00:49:58,240
calculating things about that?

857
00:49:58,240 --> 00:50:01,380
So we'll get to some of
these key distributions

858
00:50:01,380 --> 00:50:03,360
besides the normal distribution.

859
00:50:03,360 --> 00:50:06,820
We'll actually talk
about these next class.

860
00:50:06,820 --> 00:50:11,580
But what we want to do is go
back and get a little bit more

861
00:50:11,580 --> 00:50:13,890
feel for not only the
normal distribution,

862
00:50:13,890 --> 00:50:15,330
but a few other
distributions that

863
00:50:15,330 --> 00:50:18,330
often arise in manufacturing,
and then also start

864
00:50:18,330 --> 00:50:22,860
talking about these notions of
where the data actually lies.

865
00:50:22,860 --> 00:50:26,460
What are the probabilities of
data falling out in the tails?

866
00:50:26,460 --> 00:50:28,470
And using that then
to start to get

867
00:50:28,470 --> 00:50:31,170
towards the idea of building
confidence intervals

868
00:50:31,170 --> 00:50:34,230
and where we think the real
mean of our underlying parent

869
00:50:34,230 --> 00:50:36,060
distribution sits.

870
00:50:36,060 --> 00:50:39,180
Next class, we'll also
get to hypotheses tests,

871
00:50:39,180 --> 00:50:42,180
which arise naturally
and actually start

872
00:50:42,180 --> 00:50:46,020
to get really close to
statistical process control

873
00:50:46,020 --> 00:50:51,570
charting, which is one
of the fundamental tools

874
00:50:51,570 --> 00:50:54,070
of manufacturing control.

875
00:50:54,070 --> 00:50:56,620
So what I'm going to
do here is go back--

876
00:50:56,620 --> 00:51:00,400
this is the plan for the next--

877
00:51:00,400 --> 00:51:02,945
the rest of today and
starting into tomorrow.

878
00:51:02,945 --> 00:51:04,570
We're going to go
back, just remind you

879
00:51:04,570 --> 00:51:07,300
of some of the discrete
variable distributions,

880
00:51:07,300 --> 00:51:09,520
then talk about some of the--

881
00:51:09,520 --> 00:51:13,930
which are more applicable to
attribute modeling or yield

882
00:51:13,930 --> 00:51:15,570
modeling, sort of
discrete things.

883
00:51:15,570 --> 00:51:17,320
Then we'll come back
and talk a little bit

884
00:51:17,320 --> 00:51:20,500
about the continuous
distributions,

885
00:51:20,500 --> 00:51:26,130
and then also touch
on how you manipulate

886
00:51:26,130 --> 00:51:27,480
some of these distributions.

887
00:51:31,670 --> 00:51:36,180
Discrete distributions-- people
seen the Bernoulli distribution

888
00:51:36,180 --> 00:51:36,680
before?

889
00:51:39,500 --> 00:51:40,280
Good.

890
00:51:40,280 --> 00:51:43,920
This is like the
simplest distribution--

891
00:51:43,920 --> 00:51:46,980
the very simplest.

892
00:51:46,980 --> 00:51:48,300
You do a trial.

893
00:51:48,300 --> 00:51:49,290
You do an experiment.

894
00:51:49,290 --> 00:51:53,000
Can only have two outcomes,
success or failure.

895
00:51:53,000 --> 00:51:55,910
You get to label
what success is.

896
00:51:55,910 --> 00:51:59,210
We'll label a success
with the random variable

897
00:51:59,210 --> 00:52:03,500
taking on the value of 1
and failure taking on 0.

898
00:52:03,500 --> 00:52:04,970
I could flip that.

899
00:52:04,970 --> 00:52:07,730
You can start to see already
a little bit of inkling

900
00:52:07,730 --> 00:52:09,830
of yield in here.

901
00:52:09,830 --> 00:52:11,720
Does the thing work or not?

902
00:52:11,720 --> 00:52:14,660
The very simplest,
coarsest, crudest kind

903
00:52:14,660 --> 00:52:18,710
of model for functionality, and
the probability or statistics

904
00:52:18,710 --> 00:52:22,700
associated with that is,
does the thing work or not?

905
00:52:22,700 --> 00:52:27,230
And often, we talk about
what is the probability

906
00:52:27,230 --> 00:52:31,100
that the thing is functioning
at the end of the line?

907
00:52:31,100 --> 00:52:33,980
Maybe that's 0.95.

908
00:52:33,980 --> 00:52:37,880
So 95% of the time, I think
I've got yielding parts out.

909
00:52:37,880 --> 00:52:43,050
For any one experiment,
one outcome,

910
00:52:43,050 --> 00:52:48,030
I've simply got a p and 1
minus p probability associated

911
00:52:48,030 --> 00:52:48,550
with that.

912
00:52:48,550 --> 00:52:52,510
And the PDF can be
expressed as shown here.

913
00:52:52,510 --> 00:52:55,110
Now we can go in and use
our expectation operations

914
00:52:55,110 --> 00:52:58,740
for discrete random
variables and calculate what

915
00:52:58,740 --> 00:53:00,150
the mean and the variance are.

916
00:53:00,150 --> 00:53:03,570
And those have nice,
closed form functions

917
00:53:03,570 --> 00:53:05,085
for those two outcomes.

918
00:53:07,660 --> 00:53:09,420
So that's the Bernoulli.

919
00:53:09,420 --> 00:53:11,340
Now the second easiest--

920
00:53:11,340 --> 00:53:14,390
although it can
actually look a little

921
00:53:14,390 --> 00:53:15,860
confusing at first glance.

922
00:53:15,860 --> 00:53:17,720
But the second
easiest distribution

923
00:53:17,720 --> 00:53:20,060
is the binomial
distribution because it's

924
00:53:20,060 --> 00:53:22,340
saying that I'm simply
taking that success

925
00:53:22,340 --> 00:53:25,970
or failure with a
fixed probability p

926
00:53:25,970 --> 00:53:28,740
and running repeated
trials of that.

927
00:53:28,740 --> 00:53:33,740
So now I'm flipping my
coin, say, which has--

928
00:53:33,740 --> 00:53:35,870
perhaps it's a weighted
coin, and it comes up

929
00:53:35,870 --> 00:53:39,110
heads with probability
p that's not 0.5.

930
00:53:39,110 --> 00:53:42,680
Maybe it's 0.9.

931
00:53:42,680 --> 00:53:44,810
But now I'm doing
that repeated times.

932
00:53:44,810 --> 00:53:47,660
I'm doing that n times.

933
00:53:47,660 --> 00:53:51,260
Now what's the probability
of having n successes?

934
00:53:55,260 --> 00:53:58,680
Or let me state that again.

935
00:53:58,680 --> 00:54:01,770
What's the probability
of having x successes

936
00:54:01,770 --> 00:54:04,300
when I ran n repeated trials?

937
00:54:04,300 --> 00:54:05,940
So n is the number of trials.

938
00:54:09,260 --> 00:54:11,560
So if I ran 100
trials, the probability

939
00:54:11,560 --> 00:54:16,660
that I had exactly x
equals to 7 successes

940
00:54:16,660 --> 00:54:18,490
is given by this formula, here.

941
00:54:18,490 --> 00:54:21,110
And you can actually see
this lurking in here.

942
00:54:21,110 --> 00:54:23,710
How do I have 7 successes?

943
00:54:23,710 --> 00:54:27,120
Well, that meant
p, the probability

944
00:54:27,120 --> 00:54:30,960
of having a success, had
to come up exactly 7 times.

945
00:54:30,960 --> 00:54:32,935
And the rest of the times--

946
00:54:32,935 --> 00:54:36,780
if I was running 100
trials, the other 93 trials

947
00:54:36,780 --> 00:54:39,610
all had to be failures.

948
00:54:39,610 --> 00:54:42,640
So I've simply got the product
of all of those probabilities.

949
00:54:42,640 --> 00:54:44,790
And then we've got
the combinatorics,

950
00:54:44,790 --> 00:54:49,020
the n choose x, which tells me
how many different orderings

951
00:54:49,020 --> 00:54:53,850
could have occurred by which
I would get the 7 successes

952
00:54:53,850 --> 00:54:58,220
and 93 failures
for n equals 100.

953
00:54:58,220 --> 00:55:02,600
So that's simply the different
numbers of combinations

954
00:55:02,600 --> 00:55:04,610
that can come up with that.

955
00:55:04,610 --> 00:55:08,830
So the notation here, by the
way, that we would often use--

956
00:55:08,830 --> 00:55:12,190
and I already snuck it
in some other places--

957
00:55:12,190 --> 00:55:18,520
is this little tilde
symbol here we're

958
00:55:18,520 --> 00:55:24,340
using to read as "is distributed
as some distribution."

959
00:55:24,340 --> 00:55:27,400
And I'm using the big B
to indicate the binomial

960
00:55:27,400 --> 00:55:30,010
distribution, which
has associated with it

961
00:55:30,010 --> 00:55:34,000
the underlying
Bernoulli probability--

962
00:55:34,000 --> 00:55:35,860
success for any one trial--

963
00:55:35,860 --> 00:55:39,940
and then the number
of repeated trials.

964
00:55:42,600 --> 00:55:45,580
So this is a
discrete probability.

965
00:55:45,580 --> 00:55:51,160
What's the probability
that x could take on 0.7?

966
00:55:51,160 --> 00:55:51,960
0, right?

967
00:55:51,960 --> 00:55:56,380
It's the number of
successes out of this.

968
00:55:56,380 --> 00:55:58,330
And here are some examples
that just give you

969
00:55:58,330 --> 00:56:00,640
a little bit of a feel
for what the Bernoulli

970
00:56:00,640 --> 00:56:04,180
distribution looks like.

971
00:56:04,180 --> 00:56:07,060
This is the number
of successes plotted

972
00:56:07,060 --> 00:56:10,600
as a histogram for some values.

973
00:56:13,560 --> 00:56:15,250
I think that this is--

974
00:56:15,250 --> 00:56:19,047
if you try it, I think
this is a live spreadsheet.

975
00:56:19,047 --> 00:56:21,630
So actually, if you double-click
on this from your PowerPoint,

976
00:56:21,630 --> 00:56:27,660
it may bring up the
underlying Excel spreadsheet.

977
00:56:27,660 --> 00:56:30,600
So you can actually play with
some of the parameters in this.

978
00:56:30,600 --> 00:56:34,960
I don't remember what
either p or n was for this.

979
00:56:34,960 --> 00:56:37,080
But you can start to
see, it's really--

980
00:56:37,080 --> 00:56:45,000
it does not look quite normal
because you can never have

981
00:56:45,000 --> 00:56:47,620
negative numbers of successes.

982
00:56:47,620 --> 00:56:50,400
It's always truncated.

983
00:56:50,400 --> 00:56:55,770
And you get these
very non-normal kinds

984
00:56:55,770 --> 00:56:56,860
of distributions.

985
00:56:56,860 --> 00:56:58,930
This is a binomial distribution.

986
00:56:58,930 --> 00:57:01,890
But its location and its
shape can change somewhat

987
00:57:01,890 --> 00:57:04,810
as you play with p and n.

988
00:57:04,810 --> 00:57:10,450
By the way, up here-- this is
just the cumulative probability

989
00:57:10,450 --> 00:57:13,710
function, just saying
the probability

990
00:57:13,710 --> 00:57:19,750
that I've got x less than
or equal to some value.

991
00:57:19,750 --> 00:57:22,170
So that's also shown.

992
00:57:22,170 --> 00:57:25,260
So then this is also in
this histogram, normalized

993
00:57:25,260 --> 00:57:29,250
to the fraction of products.

994
00:57:29,250 --> 00:57:33,650
And so now, you can start
to look at calculating.

995
00:57:33,650 --> 00:57:36,365
If this were my
data, and I simply--

996
00:57:36,365 --> 00:57:38,390
it was actually
coming from a line

997
00:57:38,390 --> 00:57:44,030
where I was looking at the
probability of any one part

998
00:57:44,030 --> 00:57:47,000
succeeding or not, I could
start to ask questions

999
00:57:47,000 --> 00:57:52,440
about the probability of
seeing, out of 1,000 products

1000
00:57:52,440 --> 00:57:55,680
coming off the line,
some number of defects

1001
00:57:55,680 --> 00:57:57,720
or some number of
failed products.

1002
00:57:57,720 --> 00:58:01,628
You can appeal to the binomial
distribution for that.

1003
00:58:01,628 --> 00:58:03,420
Now this is all still
pretty coarse, right?

1004
00:58:03,420 --> 00:58:05,700
It's just a very
simplified model--

1005
00:58:05,700 --> 00:58:10,420
failure or success for yield.

1006
00:58:10,420 --> 00:58:12,460
Now another discrete
distribution

1007
00:58:12,460 --> 00:58:20,170
is a Poisson distribution
or also sometimes referred

1008
00:58:20,170 --> 00:58:24,160
to as an exponential
distribution,

1009
00:58:24,160 --> 00:58:28,060
although terminology
there sometimes varies,

1010
00:58:28,060 --> 00:58:33,870
depending on whether people
are including this component

1011
00:58:33,870 --> 00:58:34,800
or not.

1012
00:58:34,800 --> 00:58:39,330
But the formal definition
for the Poisson distribution

1013
00:58:39,330 --> 00:58:40,330
is shown here.

1014
00:58:40,330 --> 00:58:43,300
Now it continues to be
a discrete distribution.

1015
00:58:43,300 --> 00:58:45,570
So I'm asking, what is
the probability associated

1016
00:58:45,570 --> 00:58:52,810
with observing x taking on
actual discrete integer values?

1017
00:58:52,810 --> 00:59:03,800
But this is a very nice
distribution associated with

1018
00:59:03,800 --> 00:59:09,320
kinds of operations that many
of you saw in 2.850 or 2.8--

1019
00:59:09,320 --> 00:59:11,600
yeah, 2.853.

1020
00:59:11,600 --> 00:59:14,900
The arrival times
in queuing networks

1021
00:59:14,900 --> 00:59:18,380
will often be
Poisson distributed.

1022
00:59:18,380 --> 00:59:21,980
But it also can come
up when we are dealing

1023
00:59:21,980 --> 00:59:25,640
with very large
numbers associated

1024
00:59:25,640 --> 00:59:30,290
with the binomial distribution
as a very good approximation

1025
00:59:30,290 --> 00:59:32,670
to the binomial.

1026
00:59:32,670 --> 00:59:34,320
And this turns out
to be really nice,

1027
00:59:34,320 --> 00:59:37,200
because if you actually go
back to the binomial formula

1028
00:59:37,200 --> 00:59:43,990
and try to calculate it for
situations where, say, n

1029
00:59:43,990 --> 00:59:48,640
or x are very, very
large, or p or 1 minus

1030
00:59:48,640 --> 00:59:50,440
p is very, very
small or very large,

1031
00:59:50,440 --> 00:59:53,170
very close to either
0 or 1, you end up

1032
00:59:53,170 --> 00:59:57,220
with some problems,
some numerical problems.

1033
00:59:57,220 --> 01:00:01,120
Because if you actually try to
calculate it for, let's say,

1034
01:00:01,120 --> 01:00:08,170
p is equal to 0.0001, or maybe
1 minus p is equal to that.

1035
01:00:08,170 --> 01:00:10,540
Let's say you had really,
really high yield.

1036
01:00:14,420 --> 01:00:17,030
And I take that, so
if that's 1 minus p--

1037
01:00:17,030 --> 01:00:20,330
and I'm doing this for a
sample of size a million.

1038
01:00:20,330 --> 01:00:25,410
I've got 0.0001 to the
one millionth power.

1039
01:00:25,410 --> 01:00:29,220
And numerically, you
start losing the digits.

1040
01:00:29,220 --> 01:00:31,660
You can't hardly
keep track of that.

1041
01:00:31,660 --> 01:00:33,990
But I might be asking,
what is the probability

1042
01:00:33,990 --> 01:00:36,910
of some substantial
number of failures?

1043
01:00:36,910 --> 01:00:39,180
And this, the
combinatorics, end up

1044
01:00:39,180 --> 01:00:41,470
being a really,
really large number.

1045
01:00:41,470 --> 01:00:46,530
So overall, the
overall probability

1046
01:00:46,530 --> 01:00:50,760
of seeing 10 failures
out of a million parts

1047
01:00:50,760 --> 01:00:52,570
might be substantial.

1048
01:00:52,570 --> 01:00:56,250
But to calculate it, you
can't do it numerically,

1049
01:00:56,250 --> 01:00:59,490
because I've got a huge number
times a really small number.

1050
01:00:59,490 --> 01:01:01,470
I get overflow or underflow.

1051
01:01:01,470 --> 01:01:04,470
And I can't actually
calculate it.

1052
01:01:04,470 --> 01:01:10,080
What's useful is in those
kinds of situations, where,

1053
01:01:10,080 --> 01:01:15,260
say, n and p together-- the
product of those things--

1054
01:01:15,260 --> 01:01:17,750
are reasonable-size
numbers, then

1055
01:01:17,750 --> 01:01:21,120
the Poisson distribution is a
very, very good approximation.

1056
01:01:21,120 --> 01:01:25,730
And this applies to things
where you have very, say,

1057
01:01:25,730 --> 01:01:28,240
low probability.

1058
01:01:28,240 --> 01:01:30,410
So p might be very small.

1059
01:01:30,410 --> 01:01:36,650
But I'm asking-- or I have
many, many opportunities

1060
01:01:36,650 --> 01:01:41,810
to observe that very
low-likelihood event.

1061
01:01:41,810 --> 01:01:45,290
So an example here that comes up
in semiconductor manufacturing

1062
01:01:45,290 --> 01:01:51,630
are things like the probability
of observing some number

1063
01:01:51,630 --> 01:01:53,340
of defects on a wafer.

1064
01:01:53,340 --> 01:01:56,610
The likelihood of seeing a
point defect on any one location

1065
01:01:56,610 --> 01:01:58,470
is very, very, very small.

1066
01:01:58,470 --> 01:02:00,600
But I've got lots
and lots of area

1067
01:02:00,600 --> 01:02:02,970
on the wafer-- lots
and lots of opportunity

1068
01:02:02,970 --> 01:02:05,610
for the appearance
of that small defect.

1069
01:02:05,610 --> 01:02:10,620
And so you can start to
talk about the product

1070
01:02:10,620 --> 01:02:14,370
of those things or
a rate per unit area

1071
01:02:14,370 --> 01:02:16,470
that starts to
become reasonable.

1072
01:02:20,170 --> 01:02:24,010
Another example is the number of
misprints on a page of a book.

1073
01:02:24,010 --> 01:02:26,560
You don't expect for
any one character

1074
01:02:26,560 --> 01:02:31,720
on a book for that to
actually be a misprint.

1075
01:02:31,720 --> 01:02:35,260
But over the entire aggregate
number of pages in your book,

1076
01:02:35,260 --> 01:02:37,180
you expect some
number of misprints.

1077
01:02:37,180 --> 01:02:39,580
And the statistics
that go with that

1078
01:02:39,580 --> 01:02:42,112
are typically
Poisson distributed.

1079
01:02:42,112 --> 01:02:44,580
And I already mentioned that
the mean and the variance,

1080
01:02:44,580 --> 01:02:48,150
if you actually apply those
formulas to this distribution,

1081
01:02:48,150 --> 01:02:50,490
come out to the
fascinating fact that they

1082
01:02:50,490 --> 01:02:52,950
are numerically the same value.

1083
01:02:52,950 --> 01:02:56,130
By the way, units-wise,
they're not.

1084
01:02:56,130 --> 01:03:00,360
But x is an integer and--

1085
01:03:00,360 --> 01:03:01,500
oops.

1086
01:03:01,500 --> 01:03:03,640
That should be x, by the way.

1087
01:03:03,640 --> 01:03:04,240
Come on.

1088
01:03:04,240 --> 01:03:04,830
Cut that out.

1089
01:03:10,710 --> 01:03:11,240
There we go.

1090
01:03:13,840 --> 01:03:16,660
So here are some example
Poisson distributions.

1091
01:03:16,660 --> 01:03:20,410
You can start to see one
here for a mean of 5.

1092
01:03:20,410 --> 01:03:23,830
It looks close to the
binomial distribution

1093
01:03:23,830 --> 01:03:25,360
that I showed you earlier.

1094
01:03:25,360 --> 01:03:29,570
And then as the mean
here is increasing,

1095
01:03:29,570 --> 01:03:31,340
and the lambda
parameter, you can

1096
01:03:31,340 --> 01:03:36,560
start to see this distribution
shifting to the right.

1097
01:03:36,560 --> 01:03:38,660
We said lambda is the mean.

1098
01:03:38,660 --> 01:03:43,280
It's also a characteristic
of the variance.

1099
01:03:43,280 --> 01:03:48,690
The variance is also
equal to lambda.

1100
01:03:48,690 --> 01:03:55,170
So that will also broaden
out for larger numbers of--

1101
01:03:55,170 --> 01:03:58,280
or larger values of lambda.

1102
01:03:58,280 --> 01:04:02,300
There's another observation
in here which is useful.

1103
01:04:02,300 --> 01:04:04,610
What are they starting to
look like for large lambdas?

1104
01:04:08,342 --> 01:04:09,050
AUDIENCE: Normal.

1105
01:04:09,050 --> 01:04:10,130
PROFESSOR: Normal, right.

1106
01:04:10,130 --> 01:04:12,260
If you looked at that, it
doesn't look very normal

1107
01:04:12,260 --> 01:04:12,830
distributed.

1108
01:04:12,830 --> 01:04:14,000
It's truncated.

1109
01:04:14,000 --> 01:04:15,920
It's a little bit skewed.

1110
01:04:15,920 --> 01:04:23,270
But another approximation is for
large lambda, that also tends

1111
01:04:23,270 --> 01:04:25,560
towards a normal distribution.

1112
01:04:25,560 --> 01:04:28,640
So very often, you've got
this success or succession

1113
01:04:28,640 --> 01:04:31,340
of approximations, where
you might take a binomial,

1114
01:04:31,340 --> 01:04:32,960
approximate it as a Poisson.

1115
01:04:32,960 --> 01:04:37,280
But then for large numbers,
a normal distribution also

1116
01:04:37,280 --> 01:04:42,130
can be a useful approximation.

1117
01:04:42,130 --> 01:04:46,860
So let's go back to the
continuous distributions,

1118
01:04:46,860 --> 01:04:49,230
the normal and the uniform.

1119
01:04:49,230 --> 01:04:52,920
And here, I want to start
getting to actually how you use

1120
01:04:52,920 --> 01:04:58,960
or calculate probabilities of
observations in certain ranges

1121
01:04:58,960 --> 01:05:02,520
and in particular things, like
the probabilities of observing

1122
01:05:02,520 --> 01:05:04,840
things out in the tails.

1123
01:05:04,840 --> 01:05:06,930
So here's a continuous
distribution

1124
01:05:06,930 --> 01:05:09,730
that has a probability
density function.

1125
01:05:09,730 --> 01:05:10,950
This is the normal--

1126
01:05:10,950 --> 01:05:12,510
excuse me, the
uniform distribution

1127
01:05:12,510 --> 01:05:17,400
that has the same probability
density for values

1128
01:05:17,400 --> 01:05:19,510
in some range.

1129
01:05:19,510 --> 01:05:22,380
And then I've also
indicated with a capital F

1130
01:05:22,380 --> 01:05:27,060
our cumulative density
function for that.

1131
01:05:27,060 --> 01:05:30,630
So this is just reminding you of
a little bit of the terminology

1132
01:05:30,630 --> 01:05:31,380
there.

1133
01:05:31,380 --> 01:05:36,060
But I'm highlighting
the uniform distribution

1134
01:05:36,060 --> 01:05:38,850
because there's a couple
of very standard questions,

1135
01:05:38,850 --> 01:05:42,975
that if you have a
known PDF or CDF,

1136
01:05:42,975 --> 01:05:45,600
these are the kinds of questions
that you're going to be asking

1137
01:05:45,600 --> 01:05:47,310
again and again and again.

1138
01:05:47,310 --> 01:05:48,780
And they're nice
and intuitive off

1139
01:05:48,780 --> 01:05:51,390
of the uniform distribution.

1140
01:05:51,390 --> 01:05:54,400
When we get to the normal
and other distributions,

1141
01:05:54,400 --> 01:05:56,580
they're not quite as intuitive.

1142
01:05:56,580 --> 01:06:01,530
But seeing them here for the
uniform first, I think, helps.

1143
01:06:01,530 --> 01:06:03,480
One of the typical
kinds of questions

1144
01:06:03,480 --> 01:06:08,760
is I want to know, what is
the probability that some x is

1145
01:06:08,760 --> 01:06:12,330
less than or equal to some
value if I were to draw it

1146
01:06:12,330 --> 01:06:15,350
from this underlying
distribution--

1147
01:06:15,350 --> 01:06:17,160
from a normal distribution?

1148
01:06:17,160 --> 01:06:21,890
And so one could ask
that using either

1149
01:06:21,890 --> 01:06:26,480
the PDF or the Cumulative
Density Function.

1150
01:06:26,480 --> 01:06:28,430
And sometimes, one or
the other, if they're

1151
01:06:28,430 --> 01:06:33,020
tabulated or available
to you, is easier to use.

1152
01:06:33,020 --> 01:06:36,770
Clearly, if this is a
Probability Density Function

1153
01:06:36,770 --> 01:06:41,390
here, I can ask it in terms
of the interval question.

1154
01:06:41,390 --> 01:06:46,050
Oops, excuse me-- the interval
question right here, and say,

1155
01:06:46,050 --> 01:06:53,410
well, the probability that x is
less than or equal to that x1

1156
01:06:53,410 --> 01:06:56,950
is simply the integration
up of that probability.

1157
01:06:56,950 --> 01:06:59,020
And you can do that
numerically or just

1158
01:06:59,020 --> 01:07:01,990
by hand on such a
simple distribution.

1159
01:07:01,990 --> 01:07:06,070
But the point that is actually
exactly the value that

1160
01:07:06,070 --> 01:07:08,860
is tabulated on the
Cumulative Density Function.

1161
01:07:08,860 --> 01:07:12,140
That's the definition of the
Cumulative Density Function.

1162
01:07:12,140 --> 01:07:17,060
So if you've got the CDF, you
simply look it up and say,

1163
01:07:17,060 --> 01:07:23,440
what is f of x1 equal to
whatever your value is

1164
01:07:23,440 --> 01:07:28,990
for that probability function?

1165
01:07:28,990 --> 01:07:32,180
Now similarly, you can
also ask the question,

1166
01:07:32,180 --> 01:07:35,770
what is the probability that
x sits within some range,

1167
01:07:35,770 --> 01:07:39,890
say, between x1 and x2?

1168
01:07:39,890 --> 01:07:41,510
And again, you
can do that either

1169
01:07:41,510 --> 01:07:45,620
off of the underlying
density function, just

1170
01:07:45,620 --> 01:07:47,510
integrating and
saying, yes, x has

1171
01:07:47,510 --> 01:07:51,560
to lie between those values,
and integrate up the density.

1172
01:07:51,560 --> 01:07:56,840
Or you can recognize that
the probability that x

1173
01:07:56,840 --> 01:08:00,770
is less than x2 is
simply that value

1174
01:08:00,770 --> 01:08:05,500
and subtract off that
the probability that x

1175
01:08:05,500 --> 01:08:08,410
was less than x1 is that.

1176
01:08:08,410 --> 01:08:11,650
And so therefore, the
difference between those two

1177
01:08:11,650 --> 01:08:16,899
corresponds to the integration
on the underlying Probability

1178
01:08:16,899 --> 01:08:19,460
Density Function.

1179
01:08:19,460 --> 01:08:20,600
So that's pretty easy.

1180
01:08:20,600 --> 01:08:22,590
That should be pretty clear.

1181
01:08:22,590 --> 01:08:25,880
Let's talk about that also
for the normal distribution

1182
01:08:25,880 --> 01:08:28,640
because some of
those values are not

1183
01:08:28,640 --> 01:08:30,960
as easy to integrate up by hand.

1184
01:08:30,960 --> 01:08:34,160
In fact, there exist no
closed-form formulas.

1185
01:08:34,160 --> 01:08:36,050
But they are tabulated for you.

1186
01:08:36,050 --> 01:08:39,149
And that's where
going to the table

1187
01:08:39,149 --> 01:08:41,880
on the normal distribution
for things like f of x

1188
01:08:41,880 --> 01:08:42,750
are going to--

1189
01:08:42,750 --> 01:08:46,140
is an operation that you will
actually perform quite a bit

1190
01:08:46,140 --> 01:08:50,160
when you're manipulating
normal distributions.

1191
01:08:50,160 --> 01:08:51,560
So here's another plot.

1192
01:08:51,560 --> 01:08:54,979
We've already talked, or I've
shown other examples here

1193
01:08:54,979 --> 01:08:56,600
of the normal distribution.

1194
01:08:56,600 --> 01:08:59,390
I've tagged off on
this plot for us

1195
01:08:59,390 --> 01:09:03,170
a few useful little numbers
to have as rules of thumb.

1196
01:09:03,170 --> 01:09:08,149
This is actually, I think,
a moderately useful page

1197
01:09:08,149 --> 01:09:13,069
to print out and have off
on the side for your use.

1198
01:09:13,069 --> 01:09:15,529
In particular, what
I'm showing here

1199
01:09:15,529 --> 01:09:20,180
is for the normal distribution
you've got a formula.

1200
01:09:20,180 --> 01:09:21,965
You're hardly ever
going to actually plug

1201
01:09:21,965 --> 01:09:23,870
in values for the formula.

1202
01:09:23,870 --> 01:09:27,260
But if you look out plus
1 standard deviation,

1203
01:09:27,260 --> 01:09:30,439
plus 2 standard
deviation, on the PDF,

1204
01:09:30,439 --> 01:09:33,080
I've tried to indicate
here how rapidly

1205
01:09:33,080 --> 01:09:37,140
the value of that probability
density falls off.

1206
01:09:37,140 --> 01:09:41,390
So for example, one standard
deviation, I'm about 60%

1207
01:09:41,390 --> 01:09:42,290
the peak.

1208
01:09:42,290 --> 01:09:45,740
Two standard deviations,
I'm down to about 13.5%

1209
01:09:45,740 --> 01:09:48,029
of the peak.

1210
01:09:48,029 --> 01:09:53,100
Now more often than asking what
is the relative probabilities

1211
01:09:53,100 --> 01:09:58,930
of these things, you're actually
more often asking, what is--

1212
01:09:58,930 --> 01:10:01,450
how much-- what is the
integrated probability

1213
01:10:01,450 --> 01:10:07,120
density of the random
variable out in some tail

1214
01:10:07,120 --> 01:10:09,340
or in some central region?

1215
01:10:09,340 --> 01:10:12,670
And that's where the Cumulative
Density Function is really

1216
01:10:12,670 --> 01:10:14,780
the one that you want to use.

1217
01:10:14,780 --> 01:10:20,550
And so what I'm showing
here is out for some number

1218
01:10:20,550 --> 01:10:25,170
of standard deviations-- this is
mu minus 3 standard deviation.

1219
01:10:25,170 --> 01:10:28,860
This is saying the
probability that x is less

1220
01:10:28,860 --> 01:10:36,290
than mu minus 3 sigma
is exactly that value.

1221
01:10:36,290 --> 01:10:42,720
That equals f of
mi minus 3 sigma.

1222
01:10:42,720 --> 01:10:44,550
And I simply look that up.

1223
01:10:44,550 --> 01:10:51,120
And that's about 0.00135, or
less than 0.1% of your data

1224
01:10:51,120 --> 01:10:55,230
should fall less than 3
sigma off the left side

1225
01:10:55,230 --> 01:10:58,050
of your distribution.

1226
01:10:58,050 --> 01:11:01,290
And then I've tabulated that
for two standard deviations, one

1227
01:11:01,290 --> 01:11:03,060
standard deviation.

1228
01:11:03,060 --> 01:11:05,640
By the way, what's
the probability, now

1229
01:11:05,640 --> 01:11:10,400
that I've marked it up,
that your data falls less

1230
01:11:10,400 --> 01:11:11,060
than your mean?

1231
01:11:13,740 --> 01:11:14,610
50%.

1232
01:11:14,610 --> 01:11:17,520
It's a symmetric distribution.

1233
01:11:17,520 --> 01:11:21,750
And so, in fact, you
could then ask also

1234
01:11:21,750 --> 01:11:26,250
the question, what's the
probability that my data is

1235
01:11:26,250 --> 01:11:29,850
all the way from my left tail
up to two standard deviations

1236
01:11:29,850 --> 01:11:31,380
above the mean?

1237
01:11:31,380 --> 01:11:33,680
And that's 97.7%.

1238
01:11:33,680 --> 01:11:36,480
But I want to also
point out these--

1239
01:11:36,480 --> 01:11:43,810
this distribution itself is also
anti-symmetric around the mean.

1240
01:11:43,810 --> 01:11:50,560
So this value and
this value sum to 1.

1241
01:11:50,560 --> 01:11:53,050
So in other words,
1 minus whatever

1242
01:11:53,050 --> 01:11:57,910
is out in the upper tail
is equal to the probability

1243
01:11:57,910 --> 01:12:00,280
of being below the lower tail.

1244
01:12:04,630 --> 01:12:09,510
So what's tabulated is
not mu minus numbers

1245
01:12:09,510 --> 01:12:11,100
of standard deviations.

1246
01:12:11,100 --> 01:12:12,450
But what will often--

1247
01:12:12,450 --> 01:12:16,440
what is actually tabulated
are the standardized or unit

1248
01:12:16,440 --> 01:12:20,310
normal distribution-- again,
the mean-centered version,

1249
01:12:20,310 --> 01:12:22,260
where I subtract off
the mean and divide

1250
01:12:22,260 --> 01:12:25,210
by the standard deviation.

1251
01:12:25,210 --> 01:12:33,000
And that gives a PDF and
a CDF that is universal.

1252
01:12:33,000 --> 01:12:39,370
And that is what will
often be then tabulated

1253
01:12:39,370 --> 01:12:45,220
as the unit normal
Cumulative Density Function.

1254
01:12:45,220 --> 01:12:47,770
In some sense, that's what I
actually showed on this plot,

1255
01:12:47,770 --> 01:12:51,040
by just labeling it as a
function of mu and standard

1256
01:12:51,040 --> 01:12:52,090
deviations.

1257
01:12:52,090 --> 01:12:57,160
But now when you normalize,
that becomes in units of z as 0

1258
01:12:57,160 --> 01:13:01,570
and the numbers of standard
deviations off on the side.

1259
01:13:01,570 --> 01:13:05,100
Now, if you look at
the back of Montgomery,

1260
01:13:05,100 --> 01:13:06,860
there is a whole
bunch of these tables.

1261
01:13:06,860 --> 01:13:09,360
And you'll be using these tables
in some of the problem sets

1262
01:13:09,360 --> 01:13:10,980
and so on.

1263
01:13:10,980 --> 01:13:14,520
And there is a table
for the unit normal.

1264
01:13:14,520 --> 01:13:20,790
And in particular,
what's tabulated

1265
01:13:20,790 --> 01:13:24,780
is this Cumulative Density
Function for the unit normal.

1266
01:13:24,780 --> 01:13:26,910
And we have a little
bit of terminology

1267
01:13:26,910 --> 01:13:28,590
here that I want
to alert you to,

1268
01:13:28,590 --> 01:13:32,460
because we often talk
about percentage points off

1269
01:13:32,460 --> 01:13:36,240
of some distribution or
percentage points of the unit

1270
01:13:36,240 --> 01:13:38,700
normal, as pictured here.

1271
01:13:38,700 --> 01:13:45,510
And what we're talking about
is relating percentages

1272
01:13:45,510 --> 01:13:48,660
of my distribution that are
in some location, usually

1273
01:13:48,660 --> 01:13:52,980
the tails, to numbers
of standard deviations

1274
01:13:52,980 --> 01:13:57,810
that I have to go in order
to apportion that amount over

1275
01:13:57,810 --> 01:14:00,330
in the tails or in
the central regions.

1276
01:14:00,330 --> 01:14:06,420
So a very typical question I
might ask is, how many z's--

1277
01:14:06,420 --> 01:14:11,900
how many "unit standard
deviations," how many z's--

1278
01:14:11,900 --> 01:14:16,610
do I have to go
away from the mean

1279
01:14:16,610 --> 01:14:22,160
in order to get some
alpha or some percentage

1280
01:14:22,160 --> 01:14:27,500
of the distribution
located out in those tails?

1281
01:14:27,500 --> 01:14:30,260
So for example, I
might say I want

1282
01:14:30,260 --> 01:14:38,580
the 20%, 20th
percentile percentage

1283
01:14:38,580 --> 01:14:46,890
point, the 0.2 probability that
my data sits in the two tails.

1284
01:14:46,890 --> 01:14:52,590
So for a total probability that
all my data or the remain--

1285
01:14:52,590 --> 01:14:55,140
the portion of my
data is on either

1286
01:14:55,140 --> 01:15:03,480
of the tails, some further away
than some z, that means 10%

1287
01:15:03,480 --> 01:15:04,800
is in each of the tails.

1288
01:15:04,800 --> 01:15:08,520
And I'm asking the
question, how far--

1289
01:15:08,520 --> 01:15:11,130
how many standard
deviations do I

1290
01:15:11,130 --> 01:15:13,770
have to go to get
10% in the left tail

1291
01:15:13,770 --> 01:15:17,200
and 10% out in the right tail?

1292
01:15:17,200 --> 01:15:19,600
So I'm essentially
asking the question,

1293
01:15:19,600 --> 01:15:24,550
what is the probability
on the cumulative unit

1294
01:15:24,550 --> 01:15:28,230
normal Probability Distribution
Function to get to--

1295
01:15:28,230 --> 01:15:30,060
how many z's do I
have to go to get

1296
01:15:30,060 --> 01:15:33,540
to half of that
alpha probability

1297
01:15:33,540 --> 01:15:36,600
being in each of the tails?

1298
01:15:36,600 --> 01:15:40,890
One observation here is that
these things are, again,

1299
01:15:40,890 --> 01:15:42,240
anti-symmetric.

1300
01:15:42,240 --> 01:15:46,140
So I can also ask the
question either looking

1301
01:15:46,140 --> 01:15:49,940
just the right tail
or the left tail.

1302
01:15:49,940 --> 01:15:54,510
And then you can do the inverse
operation using the table.

1303
01:15:54,510 --> 01:15:56,360
So I'm actually asking
the question, what

1304
01:15:56,360 --> 01:15:58,980
is the z associated with that?

1305
01:15:58,980 --> 01:16:02,280
And I'm looking up on this plot.

1306
01:16:02,280 --> 01:16:06,980
So I might ask, OK, I need
10% there in the tail.

1307
01:16:06,980 --> 01:16:09,590
How many z's does
that correspond to?

1308
01:16:09,590 --> 01:16:12,380
And to get 10% out
in that left tail,

1309
01:16:12,380 --> 01:16:17,090
I got to go out 1.28 standard
deviations off to the left.

1310
01:16:17,090 --> 01:16:22,050
That's the operation that one
would look up in the table.

1311
01:16:22,050 --> 01:16:28,910
So very often, you would get
to these kinds of lookups,

1312
01:16:28,910 --> 01:16:39,590
where you're relating the
probability alpha of your data

1313
01:16:39,590 --> 01:16:44,470
lying below that number
of standard deviations

1314
01:16:44,470 --> 01:16:46,810
and what that corresponding
standard deviation is.

1315
01:16:51,440 --> 01:16:56,510
So I didn't copy one of the
tables out of Montgomery,

1316
01:16:56,510 --> 01:17:00,900
but you'll get some practice
with that on the problem sets.

1317
01:17:00,900 --> 01:17:03,110
Now, there's other
related operations

1318
01:17:03,110 --> 01:17:04,860
you can do once you have that.

1319
01:17:04,860 --> 01:17:09,650
So for example, now I can
ask, what is the probability

1320
01:17:09,650 --> 01:17:12,590
not just that data
lies out in the tail,

1321
01:17:12,590 --> 01:17:16,280
but what are the probabilities
that it also or instead lies

1322
01:17:16,280 --> 01:17:17,660
in the middle region?

1323
01:17:17,660 --> 01:17:20,960
They're all the same
kinds of operations.

1324
01:17:20,960 --> 01:17:25,190
And so for example,
here's a quick tabulation

1325
01:17:25,190 --> 01:17:28,130
for three different
kinds of examples,

1326
01:17:28,130 --> 01:17:30,620
where I'm asking not
what is out in the tails,

1327
01:17:30,620 --> 01:17:35,420
but I'm asking what is within
the center plus/minus 1 sigma

1328
01:17:35,420 --> 01:17:37,220
region of the data?

1329
01:17:37,220 --> 01:17:40,010
And if you look
very carefully, I'm

1330
01:17:40,010 --> 01:17:44,060
using exactly these
Cumulative Probability Density

1331
01:17:44,060 --> 01:17:45,950
functions for the unit normal.

1332
01:17:45,950 --> 01:17:47,750
This is for a unit normal.

1333
01:17:50,930 --> 01:17:53,570
And looking out, what's
the cumulative probability

1334
01:17:53,570 --> 01:17:54,680
over in the left tail?

1335
01:17:54,680 --> 01:17:55,550
The right tail?

1336
01:17:55,550 --> 01:17:57,170
Doing those observations.

1337
01:17:57,170 --> 01:17:59,840
But these are also very
nice rules of thumb

1338
01:17:59,840 --> 01:18:07,740
to have ready for you, which
is saying within plus/minus 1

1339
01:18:07,740 --> 01:18:12,810
standard deviation in the
normal, 68% of your data

1340
01:18:12,810 --> 01:18:16,070
is going to fall in
that 1 sigma region.

1341
01:18:16,070 --> 01:18:21,540
In the case of if I
expand out to 2 sigma,

1342
01:18:21,540 --> 01:18:26,140
now I've got 95% of my data
should fall roughly in there.

1343
01:18:26,140 --> 01:18:29,340
And if I expand out even
further to the 3 sigma,

1344
01:18:29,340 --> 01:18:33,690
that's the 99.7% of your
data would be falling--

1345
01:18:33,690 --> 01:18:39,900
should fall within those center
three standard deviations.

1346
01:18:43,220 --> 01:18:45,170
So the percentage
points out there,

1347
01:18:45,170 --> 01:18:48,950
the part that falls outside
of that, is about 3 and a--

1348
01:18:48,950 --> 01:18:50,900
3 and 1,000.

1349
01:18:50,900 --> 01:18:54,590
We'll come back to this when we
see statistical process control

1350
01:18:54,590 --> 01:18:58,470
and control charts because you
may have run into these control

1351
01:18:58,470 --> 01:18:58,970
charts.

1352
01:18:58,970 --> 01:19:03,980
We're often plotting the
3 sigma control limits.

1353
01:19:03,980 --> 01:19:05,690
And essentially
what we're saying

1354
01:19:05,690 --> 01:19:10,070
is only a very small
fraction of my data--

1355
01:19:10,070 --> 01:19:13,100
3 out of 1,000, if
I'm using plus/minus

1356
01:19:13,100 --> 01:19:14,870
3 sigma control limits.

1357
01:19:14,870 --> 01:19:18,140
3 out of 1,000 points of
my data, by random chance

1358
01:19:18,140 --> 01:19:22,775
alone, should be falling
outside of those 3 sigma bounds.

1359
01:19:25,440 --> 01:19:33,150
So that starts to get as close
to statistical process control.

1360
01:19:33,150 --> 01:19:36,470
So what we're going
to do next time

1361
01:19:36,470 --> 01:19:41,030
is start to look a little bit
more closely at statistics.

1362
01:19:41,030 --> 01:19:46,110
When I do form, again,
things like the sample

1363
01:19:46,110 --> 01:19:53,990
mean, or I form the sample
standard deviation or sample

1364
01:19:53,990 --> 01:19:58,880
variance from my
data, those themselves

1365
01:19:58,880 --> 01:20:02,000
have these probability
densities associated with them.

1366
01:20:02,000 --> 01:20:05,810
And from that, we're going
to be able to go backwards

1367
01:20:05,810 --> 01:20:13,040
and essentially work to
try to understand things

1368
01:20:13,040 --> 01:20:16,490
about the underlying process
distribution, the parent

1369
01:20:16,490 --> 01:20:20,700
probability distribution
function, associated with that.

1370
01:20:20,700 --> 01:20:22,910
So we're going to
have to understand

1371
01:20:22,910 --> 01:20:29,990
more complicated PDFs than
the normal distribution

1372
01:20:29,990 --> 01:20:32,090
because things like
the sample variance

1373
01:20:32,090 --> 01:20:34,730
is not going to be
normally distributed.

1374
01:20:34,730 --> 01:20:39,020
It's going to have its
own bizarre distribution--

1375
01:20:39,020 --> 01:20:41,250
in this case, the
chi-square distribution.

1376
01:20:41,250 --> 01:20:44,330
So we'll return to looking at
some additional distributions,

1377
01:20:44,330 --> 01:20:47,730
but these same manipulations
will come up again.

1378
01:20:47,730 --> 01:20:51,590
And what we're ultimately going
to want to be able to do is

1379
01:20:51,590 --> 01:20:54,890
make inferences about the
underlying distribution--

1380
01:20:54,890 --> 01:20:56,660
the parent process--

1381
01:20:56,660 --> 01:20:59,510
what its mean is,
what its variance is,

1382
01:20:59,510 --> 01:21:02,990
based on the calculated sample
mean and sample variance

1383
01:21:02,990 --> 01:21:06,080
that we might be using,
and then also make

1384
01:21:06,080 --> 01:21:10,040
inferences about the
likelihood that the true mean

1385
01:21:10,040 --> 01:21:12,200
lies in certain ranges.

1386
01:21:12,200 --> 01:21:15,380
Or to put it another
way, next time,

1387
01:21:15,380 --> 01:21:20,360
we'll also be talking
about confidence intervals.

1388
01:21:20,360 --> 01:21:22,610
So we'll see you on Thursday.

1389
01:21:22,610 --> 01:21:30,500
Watch for the message from
Hayden about tours and enjoy.