WEBVTT

00:00:00.520 --> 00:00:03.450
An important application of the
central limit theorem is

00:00:03.450 --> 00:00:07.190
in the approximate calculation
of the binomial probabilities.

00:00:07.190 --> 00:00:09.270
Here is what is involved.

00:00:09.270 --> 00:00:11.060
We start with random
variables--

00:00:11.060 --> 00:00:11.840
Xi--

00:00:11.840 --> 00:00:13.280
that are independent.

00:00:13.280 --> 00:00:14.800
And they have the same
distribution.

00:00:14.800 --> 00:00:17.050
They're all Bernoulli
with parameter p.

00:00:17.050 --> 00:00:20.550
We add n of those random
variables, and the resulting

00:00:20.550 --> 00:00:24.450
random variable, Sn, we know
that it has a binomial PNF

00:00:24.450 --> 00:00:26.110
with parameters n and p.

00:00:26.110 --> 00:00:30.010
We also know its mean, and
we do know its variance.

00:00:30.010 --> 00:00:33.430
What the central limit theorem
tells us, in this case, since

00:00:33.430 --> 00:00:35.790
we're dealing with the sum of
independent identically

00:00:35.790 --> 00:00:38.640
distributed random variables,
is the following.

00:00:38.640 --> 00:00:41.340
If we take this random variable
here that we have

00:00:41.340 --> 00:00:45.610
been denoting by Zn, which is a
standardized version of Sn--

00:00:45.610 --> 00:00:48.550
we subtract the mean of Sn and
divide by the standard

00:00:48.550 --> 00:00:49.630
deviation--

00:00:49.630 --> 00:00:53.710
this random variable has a CDF
that approaches as n goes to

00:00:53.710 --> 00:00:57.070
infinity, the CDF of
a standard normal.

00:00:57.070 --> 00:01:00.500
So let us use what we now
know to calculate some

00:01:00.500 --> 00:01:01.910
probabilities.

00:01:01.910 --> 00:01:03.300
Let us fix some parameters.

00:01:03.300 --> 00:01:04.349
n is 36.

00:01:04.349 --> 00:01:05.620
p is 0.5.

00:01:05.620 --> 00:01:08.260
And we wish to calculate the
probability that Sn is less

00:01:08.260 --> 00:01:10.200
than or equal to 21.

00:01:10.200 --> 00:01:13.930
Now, in this case, we can
calculate it exactly using the

00:01:13.930 --> 00:01:15.420
binomial formula.

00:01:15.420 --> 00:01:18.830
The probability of being less
than or equal to 21 is the sum

00:01:18.830 --> 00:01:22.710
of the probabilities of all
the numbers from 0 to 21.

00:01:22.710 --> 00:01:26.670
And this is the probability
of obtaining a number k.

00:01:26.670 --> 00:01:29.950
And by calculating this
expression, we obtain this

00:01:29.950 --> 00:01:32.880
number, which is the
exact answer.

00:01:32.880 --> 00:01:35.950
Now, let us proceed using the
central limit theorem.

00:01:35.950 --> 00:01:40.039
We are interested in this
probability, but we will use

00:01:40.039 --> 00:01:43.590
the fact about the CDF of this
related random variable.

00:01:43.590 --> 00:01:45.720
So the first step is
to calculate n

00:01:45.720 --> 00:01:47.850
times p, which is 18.

00:01:47.850 --> 00:01:50.470
The second step is to calculate
this denominator

00:01:50.470 --> 00:01:54.600
here, which in our case
evaluates to 3.

00:01:54.600 --> 00:01:58.580
Now, since we know something
about the CDF of this random

00:01:58.580 --> 00:02:02.450
variable, what we need to do
is to take this event and

00:02:02.450 --> 00:02:05.830
rewrite it in terms of
this random variable.

00:02:05.830 --> 00:02:11.460
So we have the event of
interest, which is that Sn is

00:02:11.460 --> 00:02:14.150
less than or equal to 21.

00:02:14.150 --> 00:02:17.460
This is the same as the event
that Sn minus 18 is less than

00:02:17.460 --> 00:02:19.770
or equal to 21 minus 18.

00:02:19.770 --> 00:02:22.800
And it's the same as this event
here, where we divide

00:02:22.800 --> 00:02:25.829
both sides by 3.

00:02:25.829 --> 00:02:29.520
Now, what we have here is the
probability that this random

00:02:29.520 --> 00:02:33.060
variable Zn is less than
or equal to 1.

00:02:33.060 --> 00:02:37.079
But now, Zn is approximately a
standard normal, so we can use

00:02:37.079 --> 00:02:40.920
here the CDF of the standard
normal distribution,

00:02:40.920 --> 00:02:42.350
which is Phi of 1.

00:02:42.350 --> 00:02:45.550
And at this point, we look at
the tables for the normal

00:02:45.550 --> 00:02:46.590
distribution.

00:02:46.590 --> 00:02:48.440
We'll find this entry here.

00:02:48.440 --> 00:02:54.079
And this gives us an
answer of 0.8413.

00:02:54.079 --> 00:02:56.340
This is a pretty good
approximation of the exact

00:02:56.340 --> 00:02:59.340
answer, which is 0.8785.

00:02:59.340 --> 00:03:01.460
But it is not a great
approximation.

00:03:01.460 --> 00:03:04.150
It is off by about four
percentage points.

00:03:04.150 --> 00:03:07.960
Can we do better than that?

00:03:07.960 --> 00:03:11.550
It turns out that we can get
a better approximation.

00:03:11.550 --> 00:03:14.160
And let us see how
this can be done.

00:03:14.160 --> 00:03:17.320
Recall that we approximated
this probability using the

00:03:17.320 --> 00:03:20.380
central limit theorem and found
this numerical value.

00:03:20.380 --> 00:03:23.690
But we make an observation that
this probability is equal

00:03:23.690 --> 00:03:25.300
to this probability here.

00:03:25.300 --> 00:03:26.450
Why is that?

00:03:26.450 --> 00:03:28.550
Sn is an integer random
variable.

00:03:28.550 --> 00:03:31.790
Therefore, if I tell you that
it is strictly less than 22,

00:03:31.790 --> 00:03:35.640
I'm also telling you that
it is 21 or less.

00:03:35.640 --> 00:03:39.960
Therefore, this event here is
the same as that event here.

00:03:39.960 --> 00:03:42.480
And therefore, their
probabilities are the same.

00:03:42.480 --> 00:03:46.630
So instead of using the central
limit approximation to

00:03:46.630 --> 00:03:49.630
calculate this probability,
let us follow the same

00:03:49.630 --> 00:03:53.160
procedure but try to calculate
this probability here.

00:03:53.160 --> 00:03:56.820
And this probability here
is equal to the

00:03:56.820 --> 00:04:00.880
probability that Sn minus--

00:04:00.880 --> 00:04:03.690
we subtract the mean, divide
by the standard

00:04:03.690 --> 00:04:05.520
deviation of Sn--

00:04:05.520 --> 00:04:11.740
is strictly less than 22 minus
18 divided by 3, which is the

00:04:11.740 --> 00:04:16.060
probability that the random
variable that we denote by Zn,

00:04:16.060 --> 00:04:19.630
which is this expression here,
is strictly less than 22

00:04:19.630 --> 00:04:21.120
minus 18 over 3.

00:04:21.120 --> 00:04:23.910
And this is 1.33.

00:04:23.910 --> 00:04:27.380
Now, at this point, we pretend
that Zn is a standard normal

00:04:27.380 --> 00:04:28.210
random variable--

00:04:28.210 --> 00:04:31.510
the probability that the
standard normal is less than a

00:04:31.510 --> 00:04:32.760
certain number.

00:04:32.760 --> 00:04:37.800
This is the standard normal CDF
evaluated at that number.

00:04:37.800 --> 00:04:43.020
And then we look up at the
normal tables at 1.33 and we

00:04:43.020 --> 00:04:49.490
find this value of 0.9082.

00:04:49.490 --> 00:04:52.580
Now, we compare this value
with the exact

00:04:52.580 --> 00:04:54.720
answer for this problem.

00:04:54.720 --> 00:04:58.159
And we see that we
again missed it.

00:04:58.159 --> 00:05:02.370
Using this approximation to
this quantity gave us an

00:05:02.370 --> 00:05:04.510
underestimate of this number.

00:05:04.510 --> 00:05:07.430
Now, we obtained an
overestimate.

00:05:07.430 --> 00:05:10.420
The true value is somewhere
in the middle.

00:05:10.420 --> 00:05:13.250
So this suggests that we may
want to do something that

00:05:13.250 --> 00:05:17.560
combines these two alternative
choices here.

00:05:17.560 --> 00:05:20.750
But before doing that, it's
good to understand what

00:05:20.750 --> 00:05:24.350
exactly have we be
doing all along.

00:05:24.350 --> 00:05:27.000
What we're doing is
the following.

00:05:27.000 --> 00:05:31.370
We have the PMF of the binomial
centered at 18, which

00:05:31.370 --> 00:05:32.250
is the mean.

00:05:32.250 --> 00:05:34.250
It's a discrete random
variable.

00:05:34.250 --> 00:05:37.530
But when we use the central
limit theorem, we pretend that

00:05:37.530 --> 00:05:43.190
the binomial is normal, but
while we keep the same mean

00:05:43.190 --> 00:05:44.440
and variance.

00:05:46.720 --> 00:05:50.130
Now, when we calculate
probabilities, if we want to

00:05:50.130 --> 00:05:54.550
find the discrete probability
that Sn is less than or equal

00:05:54.550 --> 00:05:59.020
to 21, which is the sum of these
probabilities, what we

00:05:59.020 --> 00:06:05.380
do is we look at the area
under the normal

00:06:05.380 --> 00:06:09.610
PDF from 21 and below.

00:06:09.610 --> 00:06:14.700
In the alternative approach,
when we use the central limit

00:06:14.700 --> 00:06:18.180
theorem to approximate the
probability of this event, we

00:06:18.180 --> 00:06:24.100
go to 22, and we look at the
event of falling below 22.

00:06:24.100 --> 00:06:30.060
This means that we're looking at
the area from 22 and lower.

00:06:30.060 --> 00:06:36.650
So in one approach, this
particular region is not used

00:06:36.650 --> 00:06:37.690
in the calculation.

00:06:37.690 --> 00:06:39.250
That's what we did here.

00:06:39.250 --> 00:06:42.560
But in the second approach, it
was used in the calculation.

00:06:42.560 --> 00:06:45.690
Should it be used or not?

00:06:45.690 --> 00:06:52.180
It makes more sense to use only
part of this solid region

00:06:52.180 --> 00:06:55.150
and assign it to the calculation
of the probability

00:06:55.150 --> 00:06:57.470
of being at 21 or less.

00:06:57.470 --> 00:07:01.890
Namely, we can take the mid
point here, where the mid

00:07:01.890 --> 00:07:07.690
point is at 21.5, and calculate
the area under the

00:07:07.690 --> 00:07:13.340
normal PDF only going
up to 21.5.

00:07:13.340 --> 00:07:17.170
What this amounts to is looking
at this particular

00:07:17.170 --> 00:07:18.420
event here.

00:07:18.420 --> 00:07:21.520
Now, this event is, of course,
identical to this event that

00:07:21.520 --> 00:07:25.130
we have been considering,
because again, Sn is a

00:07:25.130 --> 00:07:29.470
discrete random variable that
takes integer values.

00:07:29.470 --> 00:07:32.510
But when we approximate it by
a normal, it does make a

00:07:32.510 --> 00:07:34.530
difference whether we
write the event

00:07:34.530 --> 00:07:36.840
this way or that way.

00:07:36.840 --> 00:07:40.570
So here, we're going to obtain
the probability that the

00:07:40.570 --> 00:07:43.760
standardized version
of Zn is less than.

00:07:43.760 --> 00:07:46.180
We follow the same calculation,
but now we have

00:07:46.180 --> 00:07:52.100
21.5 minus 18 divided by 3.

00:07:52.100 --> 00:07:56.730
And this number here is 1.17.

00:07:56.730 --> 00:08:01.550
And using the central limit
theorem calculation, this is

00:08:01.550 --> 00:08:08.960
the CDF of the standard normal
evaluated at 1.17, which we

00:08:08.960 --> 00:08:12.090
can go and look up in the
normal table to find

00:08:12.090 --> 00:08:16.960
the value of 0.8790.

00:08:16.960 --> 00:08:21.840
And now, we notice that this
value is remarkably close to

00:08:21.840 --> 00:08:23.320
the true value.

00:08:23.320 --> 00:08:26.730
It is much better as an
approximation that what we

00:08:26.730 --> 00:08:31.980
obtained using either this
choice or that choice.

00:08:31.980 --> 00:08:37.270
And since this approximation
is so good, we may consider

00:08:37.270 --> 00:08:41.370
even using it to approximate
individual probabilities of

00:08:41.370 --> 00:08:43.350
the binomial PMF.

00:08:43.350 --> 00:08:46.130
Let's see what that takes.

00:08:46.130 --> 00:08:49.580
Let us try to approximate, as
an example, the probability

00:08:49.580 --> 00:08:53.680
that Sn takes a value
of exactly 19.

00:08:53.680 --> 00:08:58.610
So what we will do will be to
write the event that Sn is

00:08:58.610 --> 00:09:08.770
equal to 19 as the event that Sn
lies between 18.5 and 19.5.

00:09:08.770 --> 00:09:12.010
In terms of the picture that
we were discussing before,

00:09:12.010 --> 00:09:15.290
what we are doing, essentially,
is to take the

00:09:15.290 --> 00:09:23.560
area under the normal PDF that
extends from 18.5 to 19.5 and

00:09:23.560 --> 00:09:27.640
declare that this area
corresponds to the discrete

00:09:27.640 --> 00:09:32.220
event that our binomial random
variable takes a value of 19.

00:09:32.220 --> 00:09:35.660
Similarly, if we wanted to
calculate approximately the

00:09:35.660 --> 00:09:40.060
value of the probability that
Sn takes a value of 21, we

00:09:40.060 --> 00:09:43.200
would consider the area
under the normal PDF

00:09:43.200 --> 00:09:46.890
from 20.5 to 21.5.

00:09:46.890 --> 00:09:49.630
So let us now continue
with this approach.

00:09:49.630 --> 00:09:54.940
We do the usual calculations,
which is to express this event

00:09:54.940 --> 00:09:57.420
in terms of standardized
values.

00:09:57.420 --> 00:10:02.080
That is, we subtract throughout
the mean of Sn and

00:10:02.080 --> 00:10:04.420
divide by standard deviation.

00:10:04.420 --> 00:10:08.560
So what we obtain here is the
standardized version of Sn.

00:10:08.560 --> 00:10:15.340
And that has to be, now, less
than or equal to 19.5 minus 18

00:10:15.340 --> 00:10:19.980
divided by 3, which is the
probability that our

00:10:19.980 --> 00:10:30.430
standardized random variable
lies between 0.17 and 0.5.

00:10:30.430 --> 00:10:35.230
And now, if we pretend that Zn
is a standard normal random

00:10:35.230 --> 00:10:38.000
variable, which is what the
central limit theorem

00:10:38.000 --> 00:10:42.170
suggests, this is going to be
equal to the probability that

00:10:42.170 --> 00:10:48.060
the standard normal is less than
or equal to 0.5 minus the

00:10:48.060 --> 00:10:53.530
probability that it
is less than 0.17.

00:10:53.530 --> 00:10:57.750
And if we look up those entries
in the normal tables,

00:10:57.750 --> 00:11:05.750
what we find is an answer of
0.6915 minus this number,

00:11:05.750 --> 00:11:10.090
which evaluates to 0.124.

00:11:10.090 --> 00:11:14.070
And what is the exact answer if
we were to use the binomial

00:11:14.070 --> 00:11:15.740
probability formulas?

00:11:15.740 --> 00:11:20.720
The exact answer is remarkably
close to what we obtained in

00:11:20.720 --> 00:11:23.070
our approximation.

00:11:23.070 --> 00:11:26.720
This example illustrates a more
general fact that this

00:11:26.720 --> 00:11:30.090
approach of calculating
individual entries of the

00:11:30.090 --> 00:11:34.130
binomial PMF gives very
accurate answers.

00:11:34.130 --> 00:11:36.370
And in fact, there are
theorems, there are

00:11:36.370 --> 00:11:40.260
theoretical results to this
effect, that tell us that this

00:11:40.260 --> 00:11:42.460
way of approximating--

00:11:42.460 --> 00:11:45.590
asymptotically, as n goes
to infinity and

00:11:45.590 --> 00:11:47.380
in a certain regime--

00:11:47.380 --> 00:11:50.140
does give us very accurate
approximations.