1
00:00:00,000 --> 00:00:00,040
2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative
3
00:00:02,460 --> 00:00:03,870
Commons license.
4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,910 --> 00:00:10,560
offer high-quality educational
resources for free.
6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from
7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:19,290 --> 00:00:20,540
ocw.mit.edu.
9
00:00:20,540 --> 00:00:22,560
10
00:00:22,560 --> 00:00:25,340
PROFESSOR: We're going to finish
today our discussion of
11
00:00:25,340 --> 00:00:27,460
limit theorems.
12
00:00:27,460 --> 00:00:30,340
I'm going to remind you what the
central limit theorem is,
13
00:00:30,340 --> 00:00:33,460
which we introduced
briefly last time.
14
00:00:33,460 --> 00:00:37,230
We're going to discuss what
exactly it says and its
15
00:00:37,230 --> 00:00:38,780
implications.
16
00:00:38,780 --> 00:00:42,100
And then we're going to apply
to a couple of examples,
17
00:00:42,100 --> 00:00:45,520
mostly on the binomial
distribution.
18
00:00:45,520 --> 00:00:49,950
OK, so the situation is that
we are dealing with a large
19
00:00:49,950 --> 00:00:52,420
number of independent,
identically
20
00:00:52,420 --> 00:00:55,000
distributed random variables.
21
00:00:55,000 --> 00:00:58,270
And we want to look at the sum
of them and say something
22
00:00:58,270 --> 00:01:00,510
about the distribution
of the sum.
23
00:01:00,510 --> 00:01:03,310
24
00:01:03,310 --> 00:01:06,910
We might want to say that
the sum is distributed
25
00:01:06,910 --> 00:01:10,510
approximately as a normal random
variable, although,
26
00:01:10,510 --> 00:01:12,750
formally, this is
not quite right.
27
00:01:12,750 --> 00:01:16,330
As n goes to infinity, the
distribution of the sum
28
00:01:16,330 --> 00:01:20,000
becomes very spread out, and
it doesn't converge to a
29
00:01:20,000 --> 00:01:21,830
limiting distribution.
30
00:01:21,830 --> 00:01:24,930
In order to get an interesting
limit, we need first to take
31
00:01:24,930 --> 00:01:28,150
the sum and standardize it.
32
00:01:28,150 --> 00:01:32,267
By standardizing it, what we
mean is to subtract the mean
33
00:01:32,267 --> 00:01:38,060
and then divide by the
standard deviation.
34
00:01:38,060 --> 00:01:41,320
Now, the mean is, of course, n
times the expected value of
35
00:01:41,320 --> 00:01:43,080
each one of the X's.
36
00:01:43,080 --> 00:01:45,130
And the standard deviation
is the
37
00:01:45,130 --> 00:01:46,610
square root of the variance.
38
00:01:46,610 --> 00:01:50,530
The variance is n times sigma
squared, where sigma is the
39
00:01:50,530 --> 00:01:52,180
variance of the X's --
40
00:01:52,180 --> 00:01:53,400
so that's the standard
deviation.
41
00:01:53,400 --> 00:01:56,330
And after we do this, we obtain
a random variable that
42
00:01:56,330 --> 00:02:01,100
has 0 mean -- its centered
-- and the
43
00:02:01,100 --> 00:02:03,230
variance is equal to 1.
44
00:02:03,230 --> 00:02:07,240
And so the variance stays the
same, no matter how large n is
45
00:02:07,240 --> 00:02:08,500
going to be.
46
00:02:08,500 --> 00:02:12,660
So the distribution of Zn keeps
changing with n, but it
47
00:02:12,660 --> 00:02:14,080
cannot change too much.
48
00:02:14,080 --> 00:02:15,240
It stays in place.
49
00:02:15,240 --> 00:02:19,550
The mean is 0, and the width
remains also roughly the same
50
00:02:19,550 --> 00:02:22,000
because the variance is 1.
51
00:02:22,000 --> 00:02:25,820
The surprising thing is that, as
n grows, that distribution
52
00:02:25,820 --> 00:02:31,250
of Zn kind of settles in a
certain asymptotic shape.
53
00:02:31,250 --> 00:02:33,620
And that's the shape
of a standard
54
00:02:33,620 --> 00:02:35,290
normal random variable.
55
00:02:35,290 --> 00:02:37,580
So standard normal means
that it has 0
56
00:02:37,580 --> 00:02:39,930
mean and unit variance.
57
00:02:39,930 --> 00:02:43,850
More precisely, what the central
limit theorem tells us
58
00:02:43,850 --> 00:02:46,560
is a relation between the
cumulative distribution
59
00:02:46,560 --> 00:02:49,430
function of Zn and its relation
to the cumulative
60
00:02:49,430 --> 00:02:51,990
distribution function of
the standard normal.
61
00:02:51,990 --> 00:02:56,620
So for any given number, c,
the probability that Zn is
62
00:02:56,620 --> 00:03:01,140
less than or equal to c, in the
limit, becomes the same as
63
00:03:01,140 --> 00:03:04,090
the probability that the
standard normal becomes less
64
00:03:04,090 --> 00:03:05,760
than or equal to c.
65
00:03:05,760 --> 00:03:08,800
And of course, this is useful
because these probabilities
66
00:03:08,800 --> 00:03:11,960
are available from the normal
tables, whereas the
67
00:03:11,960 --> 00:03:15,850
distribution of Zn might be a
very complicated expression if
68
00:03:15,850 --> 00:03:19,520
you were to calculate
it exactly.
69
00:03:19,520 --> 00:03:22,960
So some comments about the
central limit theorem.
70
00:03:22,960 --> 00:03:27,860
First thing is that it's quite
amazing that it's universal.
71
00:03:27,860 --> 00:03:31,970
It doesn't matter what the
distribution of the X's is.
72
00:03:31,970 --> 00:03:35,970
It can be any distribution
whatsoever, as long as it has
73
00:03:35,970 --> 00:03:39,070
finite mean and finite
variance.
74
00:03:39,070 --> 00:03:42,170
And when you go and do your
approximations using the
75
00:03:42,170 --> 00:03:44,520
central limit theorem, the only
thing that you need to
76
00:03:44,520 --> 00:03:47,580
know about the distribution
of the X's are the
77
00:03:47,580 --> 00:03:49,130
mean and the variance.
78
00:03:49,130 --> 00:03:52,410
You need those in order
to standardize Sn.
79
00:03:52,410 --> 00:03:55,910
I mean -- to subtract the mean
and divide by the standard
80
00:03:55,910 --> 00:03:56,810
deviation --
81
00:03:56,810 --> 00:03:59,120
you need to know the mean
and the variance.
82
00:03:59,120 --> 00:04:02,350
But these are the only things
that you need to know in order
83
00:04:02,350 --> 00:04:03,600
to apply it.
84
00:04:03,600 --> 00:04:06,060
85
00:04:06,060 --> 00:04:08,730
In addition, it's
a very accurate
86
00:04:08,730 --> 00:04:10,650
computational shortcut.
87
00:04:10,650 --> 00:04:14,660
So the distribution of this
Zn's, in principle, you can
88
00:04:14,660 --> 00:04:18,130
calculate it by convolution of
the distribution of the X's
89
00:04:18,130 --> 00:04:20,050
with itself many, many times.
90
00:04:20,050 --> 00:04:23,720
But this is tedious, and if you
try to do it analytically,
91
00:04:23,720 --> 00:04:26,570
it might be a very complicated
expression.
92
00:04:26,570 --> 00:04:29,910
Whereas by just appealing to the
standard normal table for
93
00:04:29,910 --> 00:04:33,870
the standard normal random
variable, things are done in a
94
00:04:33,870 --> 00:04:35,360
very quick way.
95
00:04:35,360 --> 00:04:39,070
So it's a nice computational
shortcut if you don't want to
96
00:04:39,070 --> 00:04:42,210
get an exact answer to a
probability problem.
97
00:04:42,210 --> 00:04:47,480
Now, at a more philosophical
level, it justifies why we are
98
00:04:47,480 --> 00:04:50,930
really interested in normal
random variables.
99
00:04:50,930 --> 00:04:55,230
Whenever you have a phenomenon
which is noisy, and the noise
100
00:04:55,230 --> 00:05:00,420
that you observe is created by
adding the lots of little
101
00:05:00,420 --> 00:05:03,820
pieces of randomness that are
independent of each other, the
102
00:05:03,820 --> 00:05:06,840
overall effect that you're
going to observe can be
103
00:05:06,840 --> 00:05:10,240
described by a normal
random variable.
104
00:05:10,240 --> 00:05:16,810
So in a classic example that
goes 100 years back or so,
105
00:05:16,810 --> 00:05:19,840
suppose that you have a fluid,
and inside that fluid, there's
106
00:05:19,840 --> 00:05:23,340
a little particle of dust
or whatever that's
107
00:05:23,340 --> 00:05:24,950
suspended in there.
108
00:05:24,950 --> 00:05:28,380
That little particle gets
hit by molecules
109
00:05:28,380 --> 00:05:30,000
completely at random --
110
00:05:30,000 --> 00:05:32,730
and so what you're going to see
is that particle kind of
111
00:05:32,730 --> 00:05:36,020
moving randomly inside
that liquid.
112
00:05:36,020 --> 00:05:40,260
Now that random motion, if you
ask, after one second, how
113
00:05:40,260 --> 00:05:45,520
much is my particle displaced,
let's say, in the x-axis along
114
00:05:45,520 --> 00:05:47,170
the x direction.
115
00:05:47,170 --> 00:05:50,960
That displacement is very, very
well modeled by a normal
116
00:05:50,960 --> 00:05:51,960
random variable.
117
00:05:51,960 --> 00:05:55,710
And the reason is that the
position of that particle is
118
00:05:55,710 --> 00:06:00,160
decided by the cumulative effect
of lots of random hits
119
00:06:00,160 --> 00:06:04,480
by molecules that hit
that particle.
120
00:06:04,480 --> 00:06:11,630
So that's a sort of celebrated
physical model that goes under
121
00:06:11,630 --> 00:06:15,000
the name of Brownian motion.
122
00:06:15,000 --> 00:06:18,100
And it's the same model that
some people use to describe
123
00:06:18,100 --> 00:06:20,300
the movement in the
financial markets.
124
00:06:20,300 --> 00:06:24,660
The argument might go that the
movement of prices has to do
125
00:06:24,660 --> 00:06:28,300
with lots of little decisions
and lots of little events by
126
00:06:28,300 --> 00:06:31,310
many, many different
actors that are
127
00:06:31,310 --> 00:06:32,890
involved in the market.
128
00:06:32,890 --> 00:06:37,440
So the distribution of stock
prices might be well described
129
00:06:37,440 --> 00:06:39,740
by normal random variables.
130
00:06:39,740 --> 00:06:43,780
At least that's what people
wanted to believe until
131
00:06:43,780 --> 00:06:45,160
somewhat recently.
132
00:06:45,160 --> 00:06:48,300
Now, the evidence is that,
actually, these distributions
133
00:06:48,300 --> 00:06:52,210
are a little more heavy-tailed
in the sense that extreme
134
00:06:52,210 --> 00:06:55,630
events are a little more likely
to occur that what
135
00:06:55,630 --> 00:06:58,040
normal random variables would
seem to indicate.
136
00:06:58,040 --> 00:07:03,110
But as a first model, again,
it could be a plausible
137
00:07:03,110 --> 00:07:07,300
argument to have, at least as
a starting model, one that
138
00:07:07,300 --> 00:07:10,200
involves normal random
variables.
139
00:07:10,200 --> 00:07:13,020
So this is the philosophical
side of things.
140
00:07:13,020 --> 00:07:15,820
On the more accurate,
mathematical side, it's
141
00:07:15,820 --> 00:07:18,290
important to appreciate
exactly quite kind of
142
00:07:18,290 --> 00:07:21,250
statement the central
limit theorem is.
143
00:07:21,250 --> 00:07:25,460
It's a statement about the
convergence of the CDF of
144
00:07:25,460 --> 00:07:27,940
these standardized random
variables to
145
00:07:27,940 --> 00:07:29,840
the CDF of a normal.
146
00:07:29,840 --> 00:07:32,470
So it's a statement about
convergence of CDFs.
147
00:07:32,470 --> 00:07:36,580
It's not a statement about
convergence of PMFs, or
148
00:07:36,580 --> 00:07:39,100
convergence of PDFs.
149
00:07:39,100 --> 00:07:42,160
Now, if one makes additional
mathematical assumptions,
150
00:07:42,160 --> 00:07:44,880
there are variations of the
central limit theorem that
151
00:07:44,880 --> 00:07:47,220
talk about PDFs and PMFs.
152
00:07:47,220 --> 00:07:51,930
But in general, that's not
necessarily the case.
153
00:07:51,930 --> 00:07:54,610
And I'm going to illustrate
this with--
154
00:07:54,610 --> 00:07:58,890
I have a plot here which
is not in your slides.
155
00:07:58,890 --> 00:08:04,700
But just to make the point,
consider two different
156
00:08:04,700 --> 00:08:06,710
discrete distributions.
157
00:08:06,710 --> 00:08:09,820
This discrete distribution
takes values 1, 4, 7.
158
00:08:09,820 --> 00:08:13,470
159
00:08:13,470 --> 00:08:16,110
This discrete distribution can
take values 1, 2, 4, 6, and 7.
160
00:08:16,110 --> 00:08:18,720
161
00:08:18,720 --> 00:08:24,270
So this one has sort of a
periodicity of 3, this one,
162
00:08:24,270 --> 00:08:27,960
the range of values is a little
more interesting.
163
00:08:27,960 --> 00:08:30,910
The numbers in these two
distributions are cooked up so
164
00:08:30,910 --> 00:08:34,380
that they have the same mean
and the same variance.
165
00:08:34,380 --> 00:08:38,970
Now, what I'm going to do is
to take eight independent
166
00:08:38,970 --> 00:08:44,090
copies of the random variable
and plot the PMF of the sum of
167
00:08:44,090 --> 00:08:45,980
eight random variables.
168
00:08:45,980 --> 00:08:51,520
Now, if I plot the PMF of the
sum of 8 of these, I get the
169
00:08:51,520 --> 00:08:59,690
plot, which corresponds to these
bullets in this diagram.
170
00:08:59,690 --> 00:09:03,040
If I take 8 random variables,
according to this
171
00:09:03,040 --> 00:09:07,270
distribution, and add them up
and compute their PMF, the PMF
172
00:09:07,270 --> 00:09:10,310
I get is the one denoted
here by the X's.
173
00:09:10,310 --> 00:09:15,630
The two PMFs look really
different, at least, when you
174
00:09:15,630 --> 00:09:16,890
eyeball them.
175
00:09:16,890 --> 00:09:23,500
On the other hand, if you were
to plot the CDFs of them, then
176
00:09:23,500 --> 00:09:34,000
the CDFs, if you compare them
with the normal CDF, which is
177
00:09:34,000 --> 00:09:38,390
this continuous curve, the CDF,
of course, it goes up in
178
00:09:38,390 --> 00:09:41,870
steps because we're looking at
discrete random variables.
179
00:09:41,870 --> 00:09:47,600
But it's very close
to the normal CDF.
180
00:09:47,600 --> 00:09:52,000
And if we, instead of n equal to
8, we were to take 16, then
181
00:09:52,000 --> 00:09:54,480
the coincidence would
be even better.
182
00:09:54,480 --> 00:09:59,850
So in terms of CDFs, when we add
8 or 16 of these, we get
183
00:09:59,850 --> 00:10:01,930
very close to the normal CDF.
184
00:10:01,930 --> 00:10:05,080
We would get essentially the
same picture if I were to take
185
00:10:05,080 --> 00:10:06,850
8 or 16 of these.
186
00:10:06,850 --> 00:10:11,730
So the CDFs sit, essentially, on
top of each other, although
187
00:10:11,730 --> 00:10:14,400
the two PMFs look
quite different.
188
00:10:14,400 --> 00:10:17,230
So this is to appreciate that,
formally speaking, we only
189
00:10:17,230 --> 00:10:22,470
have a statement about
CDFs, not about PMFs.
190
00:10:22,470 --> 00:10:26,980
Now in practice, how do you use
the central limit theorem?
191
00:10:26,980 --> 00:10:30,550
Well, it tells us that we can
calculate probabilities by
192
00:10:30,550 --> 00:10:32,810
treating Zn as if it
were a standard
193
00:10:32,810 --> 00:10:34,550
normal random variable.
194
00:10:34,550 --> 00:10:38,280
Now Zn is a linear
function of Sn.
195
00:10:38,280 --> 00:10:43,120
Conversely, Sn is a linear
function of Zn.
196
00:10:43,120 --> 00:10:45,680
Linear functions of normals
are normal.
197
00:10:45,680 --> 00:10:49,450
So if I pretend that Zn is
normal, it's essentially the
198
00:10:49,450 --> 00:10:53,230
same as if we pretend
that Sn is normal.
199
00:10:53,230 --> 00:10:55,560
And so we can calculate
probabilities that have to do
200
00:10:55,560 --> 00:10:59,830
with Sn as if Sn were normal.
201
00:10:59,830 --> 00:11:03,850
Now, the central limit theorem
does not tell us that Sn is
202
00:11:03,850 --> 00:11:05,120
approximately normal.
203
00:11:05,120 --> 00:11:08,860
The formal statement is about
Zn, but, practically speaking,
204
00:11:08,860 --> 00:11:11,150
when you use the result,
you can just
205
00:11:11,150 --> 00:11:14,650
pretend that Sn is normal.
206
00:11:14,650 --> 00:11:18,620
Finally, it's a limit theorem,
so it tells us about what
207
00:11:18,620 --> 00:11:21,240
happens when n goes
to infinity.
208
00:11:21,240 --> 00:11:23,880
If we are to use it in practice,
of course, n is not
209
00:11:23,880 --> 00:11:25,120
going to be infinity.
210
00:11:25,120 --> 00:11:28,320
Maybe n is equal to 15.
211
00:11:28,320 --> 00:11:32,130
Can we use a limit theorem when
n is a small number, as
212
00:11:32,130 --> 00:11:34,020
small as 15?
213
00:11:34,020 --> 00:11:36,980
Well, it turns out that it's
a very good approximation.
214
00:11:36,980 --> 00:11:41,420
Even for quite small values
of n, it gives us
215
00:11:41,420 --> 00:11:43,770
very accurate answers.
216
00:11:43,770 --> 00:11:49,710
So n over the order of 15, or
20, or so give us very good
217
00:11:49,710 --> 00:11:51,790
results in practice.
218
00:11:51,790 --> 00:11:54,820
There are no good theorems
that will give us hard
219
00:11:54,820 --> 00:11:58,550
guarantees because the quality
of the approximation does
220
00:11:58,550 --> 00:12:03,490
depend on the details of the
distribution of the X's.
221
00:12:03,490 --> 00:12:07,510
If the X's have a distribution
that, from the outset, looks a
222
00:12:07,510 --> 00:12:13,200
little bit like the normal, then
for small values of n,
223
00:12:13,200 --> 00:12:15,700
you are going to see,
essentially, a normal
224
00:12:15,700 --> 00:12:16,980
distribution for the sum.
225
00:12:16,980 --> 00:12:20,030
If the distribution of the X's
is very different from the
226
00:12:20,030 --> 00:12:23,350
normal, it's going to take a
larger value of n for the
227
00:12:23,350 --> 00:12:25,770
central limit theorem
to take effect.
228
00:12:25,770 --> 00:12:29,960
So let's illustrates this with
a few representative plots.
229
00:12:29,960 --> 00:12:32,600
230
00:12:32,600 --> 00:12:36,460
So here, we're starting with a
discrete uniform distribution
231
00:12:36,460 --> 00:12:39,580
that goes from 1 to 8.
232
00:12:39,580 --> 00:12:44,200
Let's add 2 of these random
variables, 2 random variables
233
00:12:44,200 --> 00:12:47,870
with this PMF, and find
the PMF of the sum.
234
00:12:47,870 --> 00:12:52,570
This is a convolution of 2
discrete uniforms, and I
235
00:12:52,570 --> 00:12:54,960
believe you have seen this
exercise before.
236
00:12:54,960 --> 00:12:59,030
When you convolve this with
itself, you get a triangle.
237
00:12:59,030 --> 00:13:04,400
So this is the PMF for the sum
of two discrete uniforms.
238
00:13:04,400 --> 00:13:05,370
Now let's continue.
239
00:13:05,370 --> 00:13:07,980
Let's convolve this
with itself.
240
00:13:07,980 --> 00:13:10,750
These was going to give
us the PMF of a sum
241
00:13:10,750 --> 00:13:13,740
of 4 discrete uniforms.
242
00:13:13,740 --> 00:13:17,930
And we get this, which starts
looking like a normal.
243
00:13:17,930 --> 00:13:23,450
If we go to n equal to 32, then
it looks, essentially,
244
00:13:23,450 --> 00:13:25,270
exactly like a normal.
245
00:13:25,270 --> 00:13:27,850
And it's an excellent
approximation.
246
00:13:27,850 --> 00:13:32,290
So this is the PMF of the sum
of 32 discrete random
247
00:13:32,290 --> 00:13:36,560
variables with this uniform
distribution.
248
00:13:36,560 --> 00:13:42,190
If we start with a PMF which
is not symmetric--
249
00:13:42,190 --> 00:13:44,640
this one is symmetric
around the mean.
250
00:13:44,640 --> 00:13:47,630
But if we start with a PMF which
is non-symmetric, so
251
00:13:47,630 --> 00:13:53,780
this is, here, is a truncated
geometric PMF, then things do
252
00:13:53,780 --> 00:13:58,960
not work out as nicely when
I add 8 of these.
253
00:13:58,960 --> 00:14:03,640
That is, if I convolve this
with itself 8 times, I get
254
00:14:03,640 --> 00:14:08,600
this PMF, which maybe resembles
a little bit to the
255
00:14:08,600 --> 00:14:09,800
normal one.
256
00:14:09,800 --> 00:14:13,050
But you can really tell that
it's different from the normal
257
00:14:13,050 --> 00:14:16,640
if you focus at the details
here and there.
258
00:14:16,640 --> 00:14:19,930
Here it sort of rises sharply.
259
00:14:19,930 --> 00:14:23,420
Here it tails off
a bit slower.
260
00:14:23,420 --> 00:14:27,660
So there's an asymmetry here
that's present, and which is a
261
00:14:27,660 --> 00:14:29,340
consequence of the
asymmetry of the
262
00:14:29,340 --> 00:14:31,710
distribution we started with.
263
00:14:31,710 --> 00:14:35,320
If we go to 16, it looks a
little better, but still you
264
00:14:35,320 --> 00:14:39,600
can see the asymmetry between
this tail and that tail.
265
00:14:39,600 --> 00:14:43,030
If you get to 32 there's still a
little bit of asymmetry, but
266
00:14:43,030 --> 00:14:48,520
at least now it starts looking
like a normal distribution.
267
00:14:48,520 --> 00:14:54,270
So the moral from these plots
is that it might vary, a
268
00:14:54,270 --> 00:14:57,360
little bit, what kind of values
of n you need before
269
00:14:57,360 --> 00:15:00,070
you get the really good
approximation.
270
00:15:00,070 --> 00:15:04,520
But for values of n in the range
20 to 30 or so, usually
271
00:15:04,520 --> 00:15:07,340
you expect to get a pretty
good approximation.
272
00:15:07,340 --> 00:15:10,180
At least that's what the visual
inspection of these
273
00:15:10,180 --> 00:15:13,330
graphs tells us.
274
00:15:13,330 --> 00:15:16,560
So now that we know that we have
a good approximation in
275
00:15:16,560 --> 00:15:18,460
our hands, let's use it.
276
00:15:18,460 --> 00:15:21,890
Let's use it by revisiting an
example from last time.
277
00:15:21,890 --> 00:15:24,480
This is the polling problem.
278
00:15:24,480 --> 00:15:28,360
We're interested in the fraction
of population that
279
00:15:28,360 --> 00:15:30,220
has a certain habit been.
280
00:15:30,220 --> 00:15:33,680
And we try to find what f is.
281
00:15:33,680 --> 00:15:38,120
And the way we do it is by
polling people at random and
282
00:15:38,120 --> 00:15:40,600
recording the answers that they
give, whether they have
283
00:15:40,600 --> 00:15:42,340
the habit or not.
284
00:15:42,340 --> 00:15:45,250
So for each person, we get the
Bernoulli random variable.
285
00:15:45,250 --> 00:15:52,050
With probability f, a person is
going to respond 1, or yes,
286
00:15:52,050 --> 00:15:55,080
so this is with probability f.
287
00:15:55,080 --> 00:15:58,490
And with the remaining
probability 1-f, the person
288
00:15:58,490 --> 00:16:00,390
responds no.
289
00:16:00,390 --> 00:16:04,520
We record this number, which
is how many people answered
290
00:16:04,520 --> 00:16:06,800
yes, divided by the total
number of people.
291
00:16:06,800 --> 00:16:10,740
That's the fraction of the
population that we asked.
292
00:16:10,740 --> 00:16:16,980
This is the fraction inside our
sample that answered yes.
293
00:16:16,980 --> 00:16:21,410
And as we discussed last time,
you might start with some
294
00:16:21,410 --> 00:16:23,210
specs for the poll.
295
00:16:23,210 --> 00:16:25,660
And the specs have
two parameters--
296
00:16:25,660 --> 00:16:29,400
the accuracy that you want and
the confidence that you want
297
00:16:29,400 --> 00:16:33,620
to have that you did really
obtain the desired accuracy.
298
00:16:33,620 --> 00:16:40,550
So the specs here is that we
want, probability 95% that our
299
00:16:40,550 --> 00:16:46,400
estimate is within 1 % point
from the true answer.
300
00:16:46,400 --> 00:16:48,940
So the event of interest
is this.
301
00:16:48,940 --> 00:16:53,640
That's the result of the poll
minus distance from the true
302
00:16:53,640 --> 00:16:59,150
answer is less or bigger
than 1 % point.
303
00:16:59,150 --> 00:17:02,000
And we're interested in
calculating or approximating
304
00:17:02,000 --> 00:17:04,140
this particular probability.
305
00:17:04,140 --> 00:17:08,000
So we want to do it using the
central limit theorem.
306
00:17:08,000 --> 00:17:13,050
And one way of arranging the
mechanics of this calculation
307
00:17:13,050 --> 00:17:17,880
is to take the event of interest
and massage it by
308
00:17:17,880 --> 00:17:21,400
subtracting and dividing things
from both sides of this
309
00:17:21,400 --> 00:17:27,510
inequality so that you bring
him to the picture the
310
00:17:27,510 --> 00:17:31,600
standardized random variable,
the Zn, and then apply the
311
00:17:31,600 --> 00:17:33,900
central limit theorem.
312
00:17:33,900 --> 00:17:38,550
So the event of interest, let
me write it in full, Mn is
313
00:17:38,550 --> 00:17:42,280
this quantity, so I'm putting it
here, minus f, which is the
314
00:17:42,280 --> 00:17:44,410
same as nf divided by n.
315
00:17:44,410 --> 00:17:46,980
So this is the same
as that event.
316
00:17:46,980 --> 00:17:49,840
We're going to calculate the
probability of this.
317
00:17:49,840 --> 00:17:52,460
This is not exactly in the form
in which we apply the
318
00:17:52,460 --> 00:17:53,430
central limit theorem.
319
00:17:53,430 --> 00:17:56,570
To apply the central limit
theorem, we need, down here,
320
00:17:56,570 --> 00:17:59,660
to have sigma square root n.
321
00:17:59,660 --> 00:18:03,100
So how can I put sigma
square root n here?
322
00:18:03,100 --> 00:18:07,350
I can divide both sides of
this inequality by sigma.
323
00:18:07,350 --> 00:18:10,970
And then I can take a factor of
square root n from here and
324
00:18:10,970 --> 00:18:13,240
send it to the other side.
325
00:18:13,240 --> 00:18:15,660
So this event is the
same as that event.
326
00:18:15,660 --> 00:18:19,190
This will happen if and only
if that will happen.
327
00:18:19,190 --> 00:18:23,670
So calculating the probability
of this event here is the same
328
00:18:23,670 --> 00:18:27,110
as calculating the probability
that this events happens.
329
00:18:27,110 --> 00:18:30,870
And now we are in business
because the random variable
330
00:18:30,870 --> 00:18:36,510
that we got in here is Zn, or
the absolute value of Zn, and
331
00:18:36,510 --> 00:18:41,480
we're talking about the
probability that Zn, absolute
332
00:18:41,480 --> 00:18:45,660
value of Zn, is bigger than
a certain number.
333
00:18:45,660 --> 00:18:50,310
Since Zn is to be approximated
by a standard normal random
334
00:18:50,310 --> 00:18:54,560
variable, our approximation is
going to be, instead of asking
335
00:18:54,560 --> 00:18:59,040
for Zn being bigger than this
number, we will ask for Z,
336
00:18:59,040 --> 00:19:02,500
absolute value of Z, being
bigger than this number.
337
00:19:02,500 --> 00:19:05,640
So this is the probability that
we want to calculate.
338
00:19:05,640 --> 00:19:09,730
And now Z is a standard normal
random variable.
339
00:19:09,730 --> 00:19:12,760
There's a small difficulty,
the one that we also
340
00:19:12,760 --> 00:19:14,310
encountered last time.
341
00:19:14,310 --> 00:19:18,110
And the difficulty is that the
standard deviation, sigma, of
342
00:19:18,110 --> 00:19:20,720
the Xi's is not known.
343
00:19:20,720 --> 00:19:24,560
Sigma is equal to f times--
344
00:19:24,560 --> 00:19:30,090
sigma, in this example, is f
times (1-f), and the only
345
00:19:30,090 --> 00:19:32,690
thing that we know about sigma
is that it's going to be a
346
00:19:32,690 --> 00:19:35,010
number less than 1/2.
347
00:19:35,010 --> 00:19:39,810
348
00:19:39,810 --> 00:19:45,180
OK, so we're going to have to
use an inequality here.
349
00:19:45,180 --> 00:19:48,890
We're going to use a
conservative value of sigma,
350
00:19:48,890 --> 00:19:54,120
the value of sigma equal to 1/2
and use that instead of
351
00:19:54,120 --> 00:19:55,760
the exact value of sigma.
352
00:19:55,760 --> 00:19:59,100
And this gives us an inequality
going this way.
353
00:19:59,100 --> 00:20:03,710
Let's just make sure why the
inequality goes this way.
354
00:20:03,710 --> 00:20:06,683
We got, on our axis,
two numbers.
355
00:20:06,683 --> 00:20:12,390
356
00:20:12,390 --> 00:20:21,650
One number is 0.01 square
root n divided by sigma.
357
00:20:21,650 --> 00:20:27,870
And the other number is
0.02 square root of n.
358
00:20:27,870 --> 00:20:30,840
And my claim is that the numbers
are related to each
359
00:20:30,840 --> 00:20:32,930
other in this particular way.
360
00:20:32,930 --> 00:20:33,500
Why is this?
361
00:20:33,500 --> 00:20:35,410
Sigma is less than 2.
362
00:20:35,410 --> 00:20:39,580
So 1/sigma is bigger than 2.
363
00:20:39,580 --> 00:20:44,020
So since 1/sigma is bigger than
2 this means that this
364
00:20:44,020 --> 00:20:47,740
numbers sits to the right
of that number.
365
00:20:47,740 --> 00:20:51,950
So here we have the probability
that Z is bigger
366
00:20:51,950 --> 00:20:54,820
than this number.
367
00:20:54,820 --> 00:20:59,060
The probability of falling out
there is less than the
368
00:20:59,060 --> 00:21:03,060
probability of falling
in this interval.
369
00:21:03,060 --> 00:21:06,170
So that's what that last
inequality is saying--
370
00:21:06,170 --> 00:21:09,330
this probability is smaller
than that probability.
371
00:21:09,330 --> 00:21:12,010
This is the probability that
we're interested in, but since
372
00:21:12,010 --> 00:21:16,490
we don't know sigma, we take the
conservative value, and we
373
00:21:16,490 --> 00:21:21,610
use an upper bound in terms
of the probability of this
374
00:21:21,610 --> 00:21:23,730
interval here.
375
00:21:23,730 --> 00:21:26,920
And now we are in business.
376
00:21:26,920 --> 00:21:30,980
We can start using our normal
tables to calculate
377
00:21:30,980 --> 00:21:33,140
probabilities of interest.
378
00:21:33,140 --> 00:21:40,300
So for example, let's say that's
we take n to be 10,000.
379
00:21:40,300 --> 00:21:42,370
How is the calculation
going to go?
380
00:21:42,370 --> 00:21:45,860
We want to calculate the
probability that the absolute
381
00:21:45,860 --> 00:21:52,920
value of Z is bigger than 0.2
times 1000, which is the
382
00:21:52,920 --> 00:21:56,530
probability that the absolute
value of Z is larger than or
383
00:21:56,530 --> 00:21:58,490
equal to 2.
384
00:21:58,490 --> 00:22:00,500
And here let's do
some mechanics,
385
00:22:00,500 --> 00:22:03,300
just to stay in shape.
386
00:22:03,300 --> 00:22:05,860
The probability that you're
larger than or equal to 2 in
387
00:22:05,860 --> 00:22:09,290
absolute value, since the normal
is symmetric around the
388
00:22:09,290 --> 00:22:13,590
mean, this is going to be twice
the probability that Z
389
00:22:13,590 --> 00:22:16,560
is larger than or equal to 2.
390
00:22:16,560 --> 00:22:22,330
Can we use the cumulative
distribution function of Z to
391
00:22:22,330 --> 00:22:23,300
calculate this?
392
00:22:23,300 --> 00:22:26,100
Well, almost the cumulative
gives us probabilities of
393
00:22:26,100 --> 00:22:28,910
being less than something, not
bigger than something.
394
00:22:28,910 --> 00:22:33,480
So we need one more step and
write this as 1 minus the
395
00:22:33,480 --> 00:22:38,420
probability that Z is less
than or equal to 2.
396
00:22:38,420 --> 00:22:41,620
And this probability, now,
you can read off
397
00:22:41,620 --> 00:22:43,770
from the normal tables.
398
00:22:43,770 --> 00:22:46,460
And the normal tables will
tell you that this
399
00:22:46,460 --> 00:22:52,840
probability is 0.9772.
400
00:22:52,840 --> 00:22:54,520
And you do get an answer.
401
00:22:54,520 --> 00:23:02,530
And the answer is 0.0456.
402
00:23:02,530 --> 00:23:05,220
OK, so we tried 10,000.
403
00:23:05,220 --> 00:23:10,990
And we find that our probably
of error is 4.5%, so we're
404
00:23:10,990 --> 00:23:15,710
doing better than the
spec that we had.
405
00:23:15,710 --> 00:23:19,490
So this tells us that maybe
we have some leeway.
406
00:23:19,490 --> 00:23:24,070
Maybe we can use a smaller
sample size and still stay
407
00:23:24,070 --> 00:23:26,030
without our specs.
408
00:23:26,030 --> 00:23:29,630
Let's try to find how much
we can push the envelope.
409
00:23:29,630 --> 00:23:34,716
How much smaller
can we take n?
410
00:23:34,716 --> 00:23:37,890
To answer that question, we
need to do this kind of
411
00:23:37,890 --> 00:23:40,790
calculation, essentially,
going backwards.
412
00:23:40,790 --> 00:23:46,420
We're going to fix this number
to be 0.05 and work backwards
413
00:23:46,420 --> 00:23:49,130
here to find--
414
00:23:49,130 --> 00:23:50,770
did I do a mistake here?
415
00:23:50,770 --> 00:23:51,770
10,000.
416
00:23:51,770 --> 00:23:53,700
So I'm missing a 0 here.
417
00:23:53,700 --> 00:23:57,440
418
00:23:57,440 --> 00:24:07,540
Ah, but I'm taking the square
root, so it's 100.
419
00:24:07,540 --> 00:24:11,080
Where did the 0.02
come in from?
420
00:24:11,080 --> 00:24:12,020
Ah, from here.
421
00:24:12,020 --> 00:24:15,870
OK, all right.
422
00:24:15,870 --> 00:24:19,330
0.02 times 100, that
gives us 2.
423
00:24:19,330 --> 00:24:22,130
OK, all right.
424
00:24:22,130 --> 00:24:24,240
Very good, OK.
425
00:24:24,240 --> 00:24:27,570
So we'll have to do this
calculation now backwards,
426
00:24:27,570 --> 00:24:33,510
figure out if this is 0.05,
what kind of number we're
427
00:24:33,510 --> 00:24:41,380
going to need here and then
here, and from this we will be
428
00:24:41,380 --> 00:24:45,240
able to tell what value
of n do we need.
429
00:24:45,240 --> 00:24:53,670
OK, so we want to find n such
that the probability that Z is
430
00:24:53,670 --> 00:25:04,870
bigger than 0.02 square
root n is 0.05.
431
00:25:04,870 --> 00:25:09,320
OK, so Z is a standard normal
random variable.
432
00:25:09,320 --> 00:25:16,810
And we want the probability
that we are
433
00:25:16,810 --> 00:25:18,640
outside this range.
434
00:25:18,640 --> 00:25:21,940
We want the probability of
those two tails together.
435
00:25:21,940 --> 00:25:24,960
436
00:25:24,960 --> 00:25:26,920
Those two tails together
should have
437
00:25:26,920 --> 00:25:29,990
probability of 0.05.
438
00:25:29,990 --> 00:25:33,280
This means that this tail,
by itself, should have
439
00:25:33,280 --> 00:25:36,900
probability 0.025.
440
00:25:36,900 --> 00:25:45,960
And this means that this
probability should be 0.975.
441
00:25:45,960 --> 00:25:52,350
Now, if this probability
is to be 0.975, what
442
00:25:52,350 --> 00:25:54,970
should that number be?
443
00:25:54,970 --> 00:25:59,980
You go to the normal tables,
and you find which is the
444
00:25:59,980 --> 00:26:03,190
entry that corresponds
to that number.
445
00:26:03,190 --> 00:26:07,020
I actually brought a normal
table with me.
446
00:26:07,020 --> 00:26:12,740
And 0.975 is down here.
447
00:26:12,740 --> 00:26:15,420
And it tells you that
to the number that
448
00:26:15,420 --> 00:26:19,820
corresponds to it is 1.96.
449
00:26:19,820 --> 00:26:24,890
So this tells us that
this number
450
00:26:24,890 --> 00:26:31,790
should be equal to 1.96.
451
00:26:31,790 --> 00:26:36,380
And now, from here, you
do the calculations.
452
00:26:36,380 --> 00:26:47,510
And you find that n is 9604.
453
00:26:47,510 --> 00:26:53,200
So with a sample of 10,000, we
got probability of error 4.5%.
454
00:26:53,200 --> 00:26:57,910
With a slightly smaller sample
size of 9,600, we can get the
455
00:26:57,910 --> 00:27:01,880
probability of a mistake
to be 0.05, which
456
00:27:01,880 --> 00:27:04,070
was exactly our spec.
457
00:27:04,070 --> 00:27:07,450
So these are essentially the two
ways that you're going to
458
00:27:07,450 --> 00:27:09,830
be using the central
limit theorem.
459
00:27:09,830 --> 00:27:12,690
Either you're given n and
you try to calculate
460
00:27:12,690 --> 00:27:13,610
probabilities.
461
00:27:13,610 --> 00:27:15,590
Or you're given the
probabilities, and you want to
462
00:27:15,590 --> 00:27:18,210
work backwards to
find n itself.
463
00:27:18,210 --> 00:27:20,990
464
00:27:20,990 --> 00:27:27,710
So in this example, the random
variable that we dealt with
465
00:27:27,710 --> 00:27:30,450
was, of course, a binomial
random variable.
466
00:27:30,450 --> 00:27:38,590
The Xi's were Bernoulli,
so the sum of
467
00:27:38,590 --> 00:27:40,950
the Xi's were binomial.
468
00:27:40,950 --> 00:27:44,100
So the central limit theorem
certainly applies to the
469
00:27:44,100 --> 00:27:45,950
binomial distribution.
470
00:27:45,950 --> 00:27:49,440
To be more precise, of course,
it applies to the standardized
471
00:27:49,440 --> 00:27:52,730
version of the binomial
random variable.
472
00:27:52,730 --> 00:27:55,140
So here's what we did,
essentially, in
473
00:27:55,140 --> 00:27:57,300
the previous example.
474
00:27:57,300 --> 00:28:00,690
We fixed the number p, which is
the probability of success
475
00:28:00,690 --> 00:28:02,010
in our experiments.
476
00:28:02,010 --> 00:28:06,550
p corresponds to f in the
previous example.
477
00:28:06,550 --> 00:28:10,570
Let every Xi a Bernoulli
random variable and are
478
00:28:10,570 --> 00:28:13,790
standing assumption is that
these random variables are
479
00:28:13,790 --> 00:28:15,040
independent.
480
00:28:15,040 --> 00:28:17,580
481
00:28:17,580 --> 00:28:20,730
When we add them, we get a
random variable that has a
482
00:28:20,730 --> 00:28:22,030
binomial distribution.
483
00:28:22,030 --> 00:28:25,220
We know the mean and the
variance of the binomial, so
484
00:28:25,220 --> 00:28:29,130
we take Sn, we subtract the
mean, which is this, divide by
485
00:28:29,130 --> 00:28:30,470
the standard deviation.
486
00:28:30,470 --> 00:28:32,790
The central limit theorem tells
us that the cumulative
487
00:28:32,790 --> 00:28:36,130
distribution function of this
random variable is a standard
488
00:28:36,130 --> 00:28:39,860
normal random variable
in the limit.
489
00:28:39,860 --> 00:28:43,730
So let's do one more example
of a calculation.
490
00:28:43,730 --> 00:28:47,160
Let's take n to be--
491
00:28:47,160 --> 00:28:50,110
let's choose some specific
numbers to work with.
492
00:28:50,110 --> 00:28:52,950
493
00:28:52,950 --> 00:28:58,300
So in this example, first thing
to do is to find the
494
00:28:58,300 --> 00:29:02,390
expected value of Sn,
which is n times p.
495
00:29:02,390 --> 00:29:04,150
It's 18.
496
00:29:04,150 --> 00:29:08,100
Then we need to write down
the standard deviation.
497
00:29:08,100 --> 00:29:12,430
498
00:29:12,430 --> 00:29:16,530
The variance of Sn is the
sum of the variances.
499
00:29:16,530 --> 00:29:19,940
It's np times (1-p).
500
00:29:19,940 --> 00:29:25,920
And in this particular example,
p times (1-p) is 1/4,
501
00:29:25,920 --> 00:29:28,320
n is 36, so this is 9.
502
00:29:28,320 --> 00:29:33,120
And that tells us that the
standard deviation of this n
503
00:29:33,120 --> 00:29:34,370
is equal to 3.
504
00:29:34,370 --> 00:29:37,170
505
00:29:37,170 --> 00:29:40,650
So what we're going to do is to
take the event of interest,
506
00:29:40,650 --> 00:29:46,400
which is Sn less than 21, and
rewrite it in a way that
507
00:29:46,400 --> 00:29:48,910
involves the standardized
random variable.
508
00:29:48,910 --> 00:29:51,700
So to do that, we need
to subtract the mean.
509
00:29:51,700 --> 00:29:55,680
So we write this as Sn-3
should be less
510
00:29:55,680 --> 00:29:58,460
than or equal to 21-3.
511
00:29:58,460 --> 00:30:00,360
This is the same event.
512
00:30:00,360 --> 00:30:02,890
And then divide by the standard
deviation, which is
513
00:30:02,890 --> 00:30:06,450
3, and we end up with this.
514
00:30:06,450 --> 00:30:08,300
So the event itself of--
515
00:30:08,300 --> 00:30:09,550
AUDIENCE: [INAUDIBLE].
516
00:30:09,550 --> 00:30:13,700
517
00:30:13,700 --> 00:30:24,150
Should subtract, 18, yes, which
gives me a much nicer
518
00:30:24,150 --> 00:30:26,640
number out here, which is 1.
519
00:30:26,640 --> 00:30:31,650
So the event of interest, that
Sn is less than 21, is the
520
00:30:31,650 --> 00:30:37,330
same as the event that a
standard normal random
521
00:30:37,330 --> 00:30:41,580
variable is less than
or equal to 1.
522
00:30:41,580 --> 00:30:44,690
And once more, you can look this
up at the normal tables.
523
00:30:44,690 --> 00:30:50,690
And you find that the answer
that you get is 0.43.
524
00:30:50,690 --> 00:30:53,390
Now it's interesting to compare
this answer that we
525
00:30:53,390 --> 00:30:57,230
got through the central limit
theorem with the exact answer.
526
00:30:57,230 --> 00:31:01,920
The exact answer involves the
exact binomial distribution.
527
00:31:01,920 --> 00:31:08,780
What we have here is the
binomial probability that, Sn
528
00:31:08,780 --> 00:31:10,970
is equal to k.
529
00:31:10,970 --> 00:31:15,230
Sn being equal to k is given
by this formula.
530
00:31:15,230 --> 00:31:22,610
And we add, over all values for
k going from 0 up to 21,
531
00:31:22,610 --> 00:31:28,670
we write a two lines code to
calculate this sum, and we get
532
00:31:28,670 --> 00:31:32,530
the exact answer,
which is 0.8785.
533
00:31:32,530 --> 00:31:35,760
So there's a pretty good
agreements between the two,
534
00:31:35,760 --> 00:31:38,600
although you wouldn't
call that's
535
00:31:38,600 --> 00:31:40,395
necessarily excellent agreement.
536
00:31:40,395 --> 00:31:45,080
537
00:31:45,080 --> 00:31:47,060
Can we do a little
better than that?
538
00:31:47,060 --> 00:31:51,570
539
00:31:51,570 --> 00:31:53,750
OK.
540
00:31:53,750 --> 00:31:56,510
It turns out that we can.
541
00:31:56,510 --> 00:31:58,625
And here's the idea.
542
00:31:58,625 --> 00:32:02,300
543
00:32:02,300 --> 00:32:07,750
So our random variable
Sn has a mean of 18.
544
00:32:07,750 --> 00:32:09,540
It has a binomial
distribution.
545
00:32:09,540 --> 00:32:14,050
It's described by a PMF that has
a shape roughly like this
546
00:32:14,050 --> 00:32:16,690
and which keeps going on.
547
00:32:16,690 --> 00:32:20,960
Using the central limit
theorem is basically
548
00:32:20,960 --> 00:32:26,650
pretending that Sn is
normal with the
549
00:32:26,650 --> 00:32:28,650
right mean and variance.
550
00:32:28,650 --> 00:32:35,200
So pretending that Zn has
0 mean unit variance, we
551
00:32:35,200 --> 00:32:38,850
approximate it with Z, that
has 0 mean unit variance.
552
00:32:38,850 --> 00:32:42,190
If you were to pretend that
Sn is normal, you would
553
00:32:42,190 --> 00:32:45,407
approximate it with a normal
that has the correct mean and
554
00:32:45,407 --> 00:32:46,250
correct variance.
555
00:32:46,250 --> 00:32:49,390
So it would still be
centered at 18.
556
00:32:49,390 --> 00:32:53,800
And it would have the same
variance as the binomial PMF.
557
00:32:53,800 --> 00:32:57,350
So using the central limit
theorem essentially means that
558
00:32:57,350 --> 00:33:00,420
we keep the mean and the
variance what they are but we
559
00:33:00,420 --> 00:33:03,960
pretend that our distribution
is normal.
560
00:33:03,960 --> 00:33:06,780
We want to calculate the
probability that Sn is less
561
00:33:06,780 --> 00:33:09,590
than or equal to 21.
562
00:33:09,590 --> 00:33:14,310
I pretend that my random
variable is normal, so I draw
563
00:33:14,310 --> 00:33:18,680
a line here and I calculate
the area under the normal
564
00:33:18,680 --> 00:33:22,000
curve going up to 21.
565
00:33:22,000 --> 00:33:23,500
That's essentially
what we did.
566
00:33:23,500 --> 00:33:26,260
567
00:33:26,260 --> 00:33:29,730
Now, a smart person comes
around and says, Sn is a
568
00:33:29,730 --> 00:33:31,360
discrete random variable.
569
00:33:31,360 --> 00:33:34,750
So the event that Sn is less
than or equal to 21 is the
570
00:33:34,750 --> 00:33:38,480
same as Sn being strictly less
than 22 because nothing in
571
00:33:38,480 --> 00:33:41,240
between can happen.
572
00:33:41,240 --> 00:33:43,700
So I'm going to use the
central limit theorem
573
00:33:43,700 --> 00:33:48,290
approximation by pretending
again that Sn is normal and
574
00:33:48,290 --> 00:33:51,650
finding the probability of this
event while pretending
575
00:33:51,650 --> 00:33:53,720
that Sn is normal.
576
00:33:53,720 --> 00:33:57,870
So what this person would do
would be to draw a line here,
577
00:33:57,870 --> 00:34:02,780
at 22, and calculate the area
under the normal curve
578
00:34:02,780 --> 00:34:05,490
all the way to 22.
579
00:34:05,490 --> 00:34:06,700
Who is right?
580
00:34:06,700 --> 00:34:08,820
Which one is better?
581
00:34:08,820 --> 00:34:15,639
Well neither, but we can do
better than both if we sort of
582
00:34:15,639 --> 00:34:17,949
split the difference.
583
00:34:17,949 --> 00:34:21,969
So another way of writing the
same event for Sn is to write
584
00:34:21,969 --> 00:34:25,940
it as Sn being less than 21.5.
585
00:34:25,940 --> 00:34:29,570
In terms of the discrete random
variable Sn, all three
586
00:34:29,570 --> 00:34:32,239
of these are exactly
the same event.
587
00:34:32,239 --> 00:34:35,090
But when you do the continuous
approximation, they give you
588
00:34:35,090 --> 00:34:36,250
different probabilities.
589
00:34:36,250 --> 00:34:39,760
It's a matter of whether you
integrate the area under the
590
00:34:39,760 --> 00:34:46,159
normal curve up to here, up to
the midway point, or up to 22.
591
00:34:46,159 --> 00:34:50,659
It turns out that integrating
up to the midpoint is what
592
00:34:50,659 --> 00:34:54,469
gives us the better
numerical results.
593
00:34:54,469 --> 00:34:59,170
So we take here 21 and 1/2,
and we integrate the area
594
00:34:59,170 --> 00:35:01,170
under the normal curve
up to here.
595
00:35:01,170 --> 00:35:14,100
596
00:35:14,100 --> 00:35:18,560
So let's do this calculation
and see what we get.
597
00:35:18,560 --> 00:35:21,330
What would we change here?
598
00:35:21,330 --> 00:35:27,730
Instead of 21, we would
now write 21 and 1/2.
599
00:35:27,730 --> 00:35:32,810
This 18 becomes, no, that
18 stays what it is.
600
00:35:32,810 --> 00:35:36,890
But this 21 becomes
21 and 1/2.
601
00:35:36,890 --> 00:35:44,790
And so this one becomes
1 + 0.5 by 3.
602
00:35:44,790 --> 00:35:48,210
This is 117.
603
00:35:48,210 --> 00:35:51,980
So we now look up into the
normal tables and ask for the
604
00:35:51,980 --> 00:36:00,000
probability that Z is
less than 1.17.
605
00:36:00,000 --> 00:36:06,070
So this here gets approximated
by the probability that the
606
00:36:06,070 --> 00:36:09,240
standard normal is
less than 1.17.
607
00:36:09,240 --> 00:36:15,960
And the normal tables will
tell us this is 0.879.
608
00:36:15,960 --> 00:36:23,550
Going back to the previous
slide, what we got this time
609
00:36:23,550 --> 00:36:30,310
with this improved approximation
is 0.879.
610
00:36:30,310 --> 00:36:33,730
This is a really good
approximation
611
00:36:33,730 --> 00:36:35,730
of the correct number.
612
00:36:35,730 --> 00:36:39,160
This is what we got
using the 21.
613
00:36:39,160 --> 00:36:42,360
This is what we get using
the 21 and 1/2.
614
00:36:42,360 --> 00:36:45,940
And it's an approximation that's
sort of right on-- a
615
00:36:45,940 --> 00:36:48,350
very good one.
616
00:36:48,350 --> 00:36:54,120
The moral from this numerical
example is that doing this 1
617
00:36:54,120 --> 00:37:00,933
and 1/2 correction does give
us better approximations.
618
00:37:00,933 --> 00:37:06,070
619
00:37:06,070 --> 00:37:12,010
In fact, we can use this 1/2
idea to even calculate
620
00:37:12,010 --> 00:37:14,340
individual probabilities.
621
00:37:14,340 --> 00:37:17,130
So suppose you want to
approximate the probability
622
00:37:17,130 --> 00:37:21,010
that Sn equal to 19.
623
00:37:21,010 --> 00:37:25,620
If you were to pretend that Sn
is normal and calculate this
624
00:37:25,620 --> 00:37:28,470
probability, the probability
that the normal random
625
00:37:28,470 --> 00:37:31,670
variable is equal to 19 is 0.
626
00:37:31,670 --> 00:37:34,150
So you don't get an interesting
answer.
627
00:37:34,150 --> 00:37:37,610
You get a more interesting
answer by writing this event,
628
00:37:37,610 --> 00:37:41,460
19 as being the same as the
event of falling between 18
629
00:37:41,460 --> 00:37:45,910
and 1/2 and 19 and 1/2 and using
the normal approximation
630
00:37:45,910 --> 00:37:48,230
to calculate this probability.
631
00:37:48,230 --> 00:37:51,890
In terms of our previous
picture, this corresponds to
632
00:37:51,890 --> 00:37:53,140
the following.
633
00:37:53,140 --> 00:37:59,400
634
00:37:59,400 --> 00:38:04,650
We are interested in the
probability that
635
00:38:04,650 --> 00:38:07,130
Sn is equal to 19.
636
00:38:07,130 --> 00:38:11,230
So we're interested in the
height of this bar.
637
00:38:11,230 --> 00:38:15,720
We're going to consider the area
under the normal curve
638
00:38:15,720 --> 00:38:21,500
going from here to here,
and use this area as an
639
00:38:21,500 --> 00:38:25,110
approximation for the height
of that particular bar.
640
00:38:25,110 --> 00:38:30,670
So what we're basically doing
is, we take the probability
641
00:38:30,670 --> 00:38:33,830
under the normal curve that's
assigned over a continuum of
642
00:38:33,830 --> 00:38:38,280
values and attributed it to
different discrete values.
643
00:38:38,280 --> 00:38:43,510
Whatever is above the midpoint
gets attributed to 19.
644
00:38:43,510 --> 00:38:45,640
Whatever is below that
midpoint gets
645
00:38:45,640 --> 00:38:47,250
attributed to 18.
646
00:38:47,250 --> 00:38:54,280
So this is green area is our
approximation of the value of
647
00:38:54,280 --> 00:38:56,500
the PMF at 19.
648
00:38:56,500 --> 00:39:00,740
So similarly, if you wanted to
approximate the value of the
649
00:39:00,740 --> 00:39:04,440
PMF at this point, you would
take this interval and
650
00:39:04,440 --> 00:39:06,580
integrate the area
under the normal
651
00:39:06,580 --> 00:39:09,350
curve over that interval.
652
00:39:09,350 --> 00:39:13,410
It turns out that this gives a
very good approximation of the
653
00:39:13,410 --> 00:39:15,660
PMF of the binomial.
654
00:39:15,660 --> 00:39:22,580
And actually, this was the
context in which the central
655
00:39:22,580 --> 00:39:26,310
limit theorem was proved in
the first place, when this
656
00:39:26,310 --> 00:39:27,990
business started.
657
00:39:27,990 --> 00:39:33,060
So this business goes back
a few hundred years.
658
00:39:33,060 --> 00:39:35,700
And the central limit theorem
was first approved by
659
00:39:35,700 --> 00:39:39,420
considering the PMF of a
binomial random variable when
660
00:39:39,420 --> 00:39:41,840
p is equal to 1/2.
661
00:39:41,840 --> 00:39:45,590
People did the algebra, and they
found out that the exact
662
00:39:45,590 --> 00:39:49,700
expression for the PMF is quite
well approximated by
663
00:39:49,700 --> 00:39:51,980
that expression hat you would
get from a normal
664
00:39:51,980 --> 00:39:53,380
distribution.
665
00:39:53,380 --> 00:39:57,510
Then the proof was extended to
binomials for more general
666
00:39:57,510 --> 00:39:59,690
values of p.
667
00:39:59,690 --> 00:40:04,220
So here we talk about this as
a refinement of the general
668
00:40:04,220 --> 00:40:07,480
central limit theorem, but,
historically, that refinement
669
00:40:07,480 --> 00:40:09,830
was where the whole business
got started
670
00:40:09,830 --> 00:40:11,820
in the first place.
671
00:40:11,820 --> 00:40:18,700
All right, so let's go through
the mechanics of approximating
672
00:40:18,700 --> 00:40:21,970
the probability that
Sn is equal to 19--
673
00:40:21,970 --> 00:40:23,810
exactly 19.
674
00:40:23,810 --> 00:40:27,340
As we said, we're going to write
this event as an event
675
00:40:27,340 --> 00:40:31,040
that covers an interval of unit
length from 18 and 1/2 to
676
00:40:31,040 --> 00:40:31,970
19 and 1/2.
677
00:40:31,970 --> 00:40:33,730
This is the event of interest.
678
00:40:33,730 --> 00:40:37,070
First step is to massage the
event of interest so that it
679
00:40:37,070 --> 00:40:40,010
involves our Zn random
variable.
680
00:40:40,010 --> 00:40:43,290
So subtract 18 from all sides.
681
00:40:43,290 --> 00:40:46,860
Divide by the standard deviation
of 3 from all sides.
682
00:40:46,860 --> 00:40:50,850
That's the equivalent
representation of the event.
683
00:40:50,850 --> 00:40:54,200
This is our standardized
random variable Zn.
684
00:40:54,200 --> 00:40:56,950
These are just these numbers.
685
00:40:56,950 --> 00:41:00,530
And to do an approximation, we
want to find the probability
686
00:41:00,530 --> 00:41:04,380
of this event, but Zn is
approximately normal, so we
687
00:41:04,380 --> 00:41:08,030
plug in here the Z, which
is the standard normal.
688
00:41:08,030 --> 00:41:10,150
So we want to find the
probability that the standard
689
00:41:10,150 --> 00:41:12,890
normal falls inside
this interval.
690
00:41:12,890 --> 00:41:15,630
You find these using CDFs
because this is the
691
00:41:15,630 --> 00:41:18,760
probability that you're
less than this but
692
00:41:18,760 --> 00:41:22,370
not less than that.
693
00:41:22,370 --> 00:41:25,370
So it's a difference between two
cumulative probabilities.
694
00:41:25,370 --> 00:41:27,400
Then, you look up your
normal tables.
695
00:41:27,400 --> 00:41:30,560
You find two numbers for these
quantities, and, finally, you
696
00:41:30,560 --> 00:41:35,140
get a numerical answer for an
individual entry of the PMF of
697
00:41:35,140 --> 00:41:36,480
the binomial.
698
00:41:36,480 --> 00:41:39,350
This is a pretty good
approximation, it turns out.
699
00:41:39,350 --> 00:41:42,910
If you were to do the
calculations using the exact
700
00:41:42,910 --> 00:41:47,130
formula, you would
get something
701
00:41:47,130 --> 00:41:49,360
which is pretty close--
702
00:41:49,360 --> 00:41:52,800
an error in the third digit--
703
00:41:52,800 --> 00:41:56,980
this is pretty good.
704
00:41:56,980 --> 00:41:59,650
So I guess what we did here
with our discussion of the
705
00:41:59,650 --> 00:42:04,560
binomial slightly contradicts
what I said before--
706
00:42:04,560 --> 00:42:07,330
that the central limit theorem
is a statement about
707
00:42:07,330 --> 00:42:09,240
cumulative distribution
functions.
708
00:42:09,240 --> 00:42:13,240
In general, it doesn't tell you
what to do to approximate
709
00:42:13,240 --> 00:42:15,270
PMFs themselves.
710
00:42:15,270 --> 00:42:17,440
And that's indeed the
case in general.
711
00:42:17,440 --> 00:42:20,220
One the other hand, for the
special case of a binomial
712
00:42:20,220 --> 00:42:23,610
distribution, the central limit
theorem approximation,
713
00:42:23,610 --> 00:42:28,200
with this 1/2 correction, is a
very good approximation even
714
00:42:28,200 --> 00:42:29,560
for the individual PMF.
715
00:42:29,560 --> 00:42:33,290
716
00:42:33,290 --> 00:42:40,210
All right, so we spent quite
a bit of time on mechanics.
717
00:42:40,210 --> 00:42:46,050
So let's spend the last few
minutes today thinking a bit
718
00:42:46,050 --> 00:42:47,930
and look at a small puzzle.
719
00:42:47,930 --> 00:42:51,390
720
00:42:51,390 --> 00:42:54,240
So the puzzle is
the following.
721
00:42:54,240 --> 00:43:02,460
Consider Poisson process that
runs over a unit interval.
722
00:43:02,460 --> 00:43:07,770
And where the arrival
rate is equal to 1.
723
00:43:07,770 --> 00:43:09,790
So this is the unit interval.
724
00:43:09,790 --> 00:43:12,720
And let X be the number
of arrivals.
725
00:43:12,720 --> 00:43:15,430
726
00:43:15,430 --> 00:43:19,930
And this is Poisson,
with mean 1.
727
00:43:19,930 --> 00:43:25,000
728
00:43:25,000 --> 00:43:28,160
Now, let me take this interval
and divide it
729
00:43:28,160 --> 00:43:30,650
into n little pieces.
730
00:43:30,650 --> 00:43:34,270
So each piece has length 1/n.
731
00:43:34,270 --> 00:43:41,225
And let Xi be the number
of arrivals during
732
00:43:41,225 --> 00:43:43,490
the Ith little interval.
733
00:43:43,490 --> 00:43:48,000
734
00:43:48,000 --> 00:43:51,630
OK, what do we know about
the random variables Xi?
735
00:43:51,630 --> 00:43:55,260
Is they are themselves
Poisson.
736
00:43:55,260 --> 00:43:58,490
It's a number of arrivals
during a small interval.
737
00:43:58,490 --> 00:44:02,340
We also know that when n is
big, so the length of the
738
00:44:02,340 --> 00:44:08,190
interval is small, these Xi's
are approximately Bernoulli,
739
00:44:08,190 --> 00:44:11,730
with mean 1/n.
740
00:44:11,730 --> 00:44:13,970
Guess it doesn't matter whether
we model them as
741
00:44:13,970 --> 00:44:15,720
Bernoulli or not.
742
00:44:15,720 --> 00:44:19,660
What matters is that the
Xi's are independent.
743
00:44:19,660 --> 00:44:20,970
Why are they independent?
744
00:44:20,970 --> 00:44:24,410
Because, in a Poisson process,
these joint intervals are
745
00:44:24,410 --> 00:44:26,770
independent of each other.
746
00:44:26,770 --> 00:44:28,955
So the Xi's are independent.
747
00:44:28,955 --> 00:44:31,840
748
00:44:31,840 --> 00:44:35,570
And they also have the
same distribution.
749
00:44:35,570 --> 00:44:40,360
And we have that X, the total
number of arrivals, is the sum
750
00:44:40,360 --> 00:44:41,610
over the Xn's.
751
00:44:41,610 --> 00:44:44,470
752
00:44:44,470 --> 00:44:49,510
So the central limit theorem
tells us that, approximately,
753
00:44:49,510 --> 00:44:53,670
the sum of independent,
identically distributed random
754
00:44:53,670 --> 00:44:57,720
variables, when we have lots
of these random variables,
755
00:44:57,720 --> 00:45:01,530
behaves like a normal
random variable.
756
00:45:01,530 --> 00:45:07,475
So by using this decomposition
of X into a sum of i.i.d
757
00:45:07,475 --> 00:45:11,540
random variables, and by using
values of n that are bigger
758
00:45:11,540 --> 00:45:16,540
and bigger, by taking the limit,
it should follow that X
759
00:45:16,540 --> 00:45:19,510
has a normal distribution.
760
00:45:19,510 --> 00:45:22,120
On the other hand, we know
that X has a Poisson
761
00:45:22,120 --> 00:45:23,370
distribution.
762
00:45:23,370 --> 00:45:25,270
763
00:45:25,270 --> 00:45:32,640
So something must be wrong
in this argument here.
764
00:45:32,640 --> 00:45:34,900
Can we really use the
central limit
765
00:45:34,900 --> 00:45:38,330
theorem in this situation?
766
00:45:38,330 --> 00:45:41,300
So what do we need for the
central limit theorem?
767
00:45:41,300 --> 00:45:44,160
We need to have independent,
identically
768
00:45:44,160 --> 00:45:46,700
distributed random variables.
769
00:45:46,700 --> 00:45:49,060
We have it here.
770
00:45:49,060 --> 00:45:53,410
We want them to have a finite
mean and finite variance.
771
00:45:53,410 --> 00:45:57,610
We also have it here, means
variances are finite.
772
00:45:57,610 --> 00:46:02,050
What is another assumption that
was never made explicit,
773
00:46:02,050 --> 00:46:04,080
but essentially was there?
774
00:46:04,080 --> 00:46:07,680
775
00:46:07,680 --> 00:46:13,260
Or in other words, what is the
flaw in this argument that
776
00:46:13,260 --> 00:46:15,520
uses the central limit
theorem here?
777
00:46:15,520 --> 00:46:16,770
Any thoughts?
778
00:46:16,770 --> 00:46:24,110
779
00:46:24,110 --> 00:46:29,640
So in the central limit theorem,
we said, consider--
780
00:46:29,640 --> 00:46:34,820
fix a probability distribution,
and let the Xi's
781
00:46:34,820 --> 00:46:38,280
be distributed according to that
probability distribution,
782
00:46:38,280 --> 00:46:42,935
and add a larger and larger
number or Xi's.
783
00:46:42,935 --> 00:46:47,410
But the underlying, unstated
assumption is that we fix the
784
00:46:47,410 --> 00:46:49,490
distribution of the Xi's.
785
00:46:49,490 --> 00:46:52,810
As we let n increase,
the statistics of
786
00:46:52,810 --> 00:46:55,930
each Xi do not change.
787
00:46:55,930 --> 00:46:59,010
Whereas here, I'm playing
a trick on you.
788
00:46:59,010 --> 00:47:03,700
As I'm taking more and more
random variables, I'm actually
789
00:47:03,700 --> 00:47:07,850
changing what those random
variables are.
790
00:47:07,850 --> 00:47:12,960
When I take a larger n, the Xi's
are random variables with
791
00:47:12,960 --> 00:47:15,720
a different mean and
different variance.
792
00:47:15,720 --> 00:47:19,800
So I'm adding more of these, but
at the same time, in this
793
00:47:19,800 --> 00:47:23,420
example, I'm changing
their distributions.
794
00:47:23,420 --> 00:47:26,380
That's something that doesn't
fit the setting of the central
795
00:47:26,380 --> 00:47:27,000
limit theorem.
796
00:47:27,000 --> 00:47:29,910
In the central limit theorem,
you first fix the distribution
797
00:47:29,910 --> 00:47:31,200
of the X's.
798
00:47:31,200 --> 00:47:35,290
You keep it fixed, and then you
consider adding more and
799
00:47:35,290 --> 00:47:38,950
more according to that
particular fixed distribution.
800
00:47:38,950 --> 00:47:40,020
So that's the catch.
801
00:47:40,020 --> 00:47:42,240
That's why the central limit
theorem does not
802
00:47:42,240 --> 00:47:43,970
apply to this situation.
803
00:47:43,970 --> 00:47:46,230
And we're lucky that it
doesn't apply because,
804
00:47:46,230 --> 00:47:50,220
otherwise, we would have a huge
contradiction destroying
805
00:47:50,220 --> 00:47:52,770
probability theory.
806
00:47:52,770 --> 00:48:02,240
OK, but now that's still
leaves us with a
807
00:48:02,240 --> 00:48:05,040
little bit of a dilemma.
808
00:48:05,040 --> 00:48:08,510
Suppose that, here, essentially
we're adding
809
00:48:08,510 --> 00:48:12,815
independent Bernoulli
random variables.
810
00:48:12,815 --> 00:48:22,650
811
00:48:22,650 --> 00:48:25,300
So the issue is that the central
limit theorem has to
812
00:48:25,300 --> 00:48:28,920
do with asymptotics as
n goes to infinity.
813
00:48:28,920 --> 00:48:34,260
And if we consider a binomial,
and somebody gives us specific
814
00:48:34,260 --> 00:48:38,870
numbers about the parameters of
that binomial, it might not
815
00:48:38,870 --> 00:48:40,830
necessarily be obvious
what kind of
816
00:48:40,830 --> 00:48:42,790
approximation do we use.
817
00:48:42,790 --> 00:48:45,660
In particular, we do have two
different approximations for
818
00:48:45,660 --> 00:48:47,100
the binomial.
819
00:48:47,100 --> 00:48:51,610
If we fix p, then the binomial
is the sum of Bernoulli's that
820
00:48:51,610 --> 00:48:54,930
come from a fixed distribution,
we consider more
821
00:48:54,930 --> 00:48:56,450
and more of these.
822
00:48:56,450 --> 00:48:58,990
When we add them, the central
limit theorem tells us that we
823
00:48:58,990 --> 00:49:01,190
get the normal distribution.
824
00:49:01,190 --> 00:49:04,430
There's another sort of limit,
which has the flavor of this
825
00:49:04,430 --> 00:49:10,770
example, in which we still deal
with a binomial, sum of n
826
00:49:10,770 --> 00:49:11,170
Bernoulli's.
827
00:49:11,170 --> 00:49:14,310
We let that sum, the
number of the
828
00:49:14,310 --> 00:49:16,090
Bernoulli's go to infinity.
829
00:49:16,090 --> 00:49:18,890
But each Bernoulli has a
probability of success that
830
00:49:18,890 --> 00:49:23,830
goes to 0, and we do this in a
way so that np, the expected
831
00:49:23,830 --> 00:49:27,090
number of successes,
stays finite.
832
00:49:27,090 --> 00:49:30,660
This is the situation that we
dealt with when we first
833
00:49:30,660 --> 00:49:32,960
defined our Poisson process.
834
00:49:32,960 --> 00:49:37,540
We have a very, very large
number so lots, of time slots,
835
00:49:37,540 --> 00:49:40,920
but during each time slot,
there's a tiny probability of
836
00:49:40,920 --> 00:49:42,950
obtaining an arrival.
837
00:49:42,950 --> 00:49:48,460
Under that setting, in discrete
time, we have a
838
00:49:48,460 --> 00:49:51,670
binomial distribution, or
Bernoulli process, but when we
839
00:49:51,670 --> 00:49:54,530
take the limit, we obtain the
Poisson process and the
840
00:49:54,530 --> 00:49:56,470
Poisson approximation.
841
00:49:56,470 --> 00:49:58,510
So these are two equally valid
842
00:49:58,510 --> 00:50:00,550
approximations of the binomial.
843
00:50:00,550 --> 00:50:03,300
But they're valid in different
asymptotic regimes.
844
00:50:03,300 --> 00:50:06,180
In one regime, we fixed p,
let n go to infinity.
845
00:50:06,180 --> 00:50:09,360
In the other regime, we let
both n and p change
846
00:50:09,360 --> 00:50:11,540
simultaneously.
847
00:50:11,540 --> 00:50:14,240
Now, in real life, you're
never dealing with the
848
00:50:14,240 --> 00:50:15,290
limiting situations.
849
00:50:15,290 --> 00:50:17,870
You're dealing with
actual numbers.
850
00:50:17,870 --> 00:50:21,820
So if somebody tells you that
the numbers are like this,
851
00:50:21,820 --> 00:50:25,160
then you should probably say
that this is the situation
852
00:50:25,160 --> 00:50:27,380
that fits the Poisson
description--
853
00:50:27,380 --> 00:50:30,180
large number of slots with
each slot having a tiny
854
00:50:30,180 --> 00:50:32,460
probability of success.
855
00:50:32,460 --> 00:50:36,890
On the other hand, if p is
something like this, and n is
856
00:50:36,890 --> 00:50:40,460
500, then you expect to get
the distribution for the
857
00:50:40,460 --> 00:50:41,680
number of successes.
858
00:50:41,680 --> 00:50:45,740
It's going to have a mean of 50
and to have a fair amount
859
00:50:45,740 --> 00:50:47,280
of spread around there.
860
00:50:47,280 --> 00:50:50,150
It turns out that the normal
approximation would be better
861
00:50:50,150 --> 00:50:51,500
in this context.
862
00:50:51,500 --> 00:50:57,120
As a rule of thumb, if n times p
is bigger than 10 or 20, you
863
00:50:57,120 --> 00:50:59,320
can start using the normal
approximation.
864
00:50:59,320 --> 00:51:04,310
If n times p is a small number,
then you prefer to use
865
00:51:04,310 --> 00:51:06,090
the Poisson approximation.
866
00:51:06,090 --> 00:51:08,840
But there's no hard theorems
or rules about
867
00:51:08,840 --> 00:51:11,650
how to go about this.
868
00:51:11,650 --> 00:51:15,440
OK, so from next time we're
going to switch base again.
869
00:51:15,440 --> 00:51:17,830
And we're going to put together
everything we learned
870
00:51:17,830 --> 00:51:20,620
in this class to start solving
inference problems.
871
00:51:20,620 --> 00:51:22,050