WEBVTT
00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.870
Commons license.
00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to
00:00:06.910 --> 00:00:10.560
offer high-quality educational
resources for free.
00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from
00:00:13.460 --> 00:00:19.290
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:19.290 --> 00:00:20.540
ocw.mit.edu.
00:00:22.560 --> 00:00:25.340
PROFESSOR: We're going to finish
today our discussion of
00:00:25.340 --> 00:00:27.460
limit theorems.
00:00:27.460 --> 00:00:30.340
I'm going to remind you what the
central limit theorem is,
00:00:30.340 --> 00:00:33.460
which we introduced
briefly last time.
00:00:33.460 --> 00:00:37.230
We're going to discuss what
exactly it says and its
00:00:37.230 --> 00:00:38.780
implications.
00:00:38.780 --> 00:00:42.100
And then we're going to apply
to a couple of examples,
00:00:42.100 --> 00:00:45.520
mostly on the binomial
distribution.
00:00:45.520 --> 00:00:49.950
OK, so the situation is that
we are dealing with a large
00:00:49.950 --> 00:00:52.420
number of independent,
identically
00:00:52.420 --> 00:00:55.000
distributed random variables.
00:00:55.000 --> 00:00:58.270
And we want to look at the sum
of them and say something
00:00:58.270 --> 00:01:00.510
about the distribution
of the sum.
00:01:03.310 --> 00:01:06.910
We might want to say that
the sum is distributed
00:01:06.910 --> 00:01:10.510
approximately as a normal random
variable, although,
00:01:10.510 --> 00:01:12.750
formally, this is
not quite right.
00:01:12.750 --> 00:01:16.330
As n goes to infinity, the
distribution of the sum
00:01:16.330 --> 00:01:20.000
becomes very spread out, and
it doesn't converge to a
00:01:20.000 --> 00:01:21.830
limiting distribution.
00:01:21.830 --> 00:01:24.930
In order to get an interesting
limit, we need first to take
00:01:24.930 --> 00:01:28.150
the sum and standardize it.
00:01:28.150 --> 00:01:32.267
By standardizing it, what we
mean is to subtract the mean
00:01:32.267 --> 00:01:38.060
and then divide by the
standard deviation.
00:01:38.060 --> 00:01:41.320
Now, the mean is, of course, n
times the expected value of
00:01:41.320 --> 00:01:43.080
each one of the X's.
00:01:43.080 --> 00:01:45.130
And the standard deviation
is the
00:01:45.130 --> 00:01:46.610
square root of the variance.
00:01:46.610 --> 00:01:50.530
The variance is n times sigma
squared, where sigma is the
00:01:50.530 --> 00:01:52.180
variance of the X's --
00:01:52.180 --> 00:01:53.400
so that's the standard
deviation.
00:01:53.400 --> 00:01:56.330
And after we do this, we obtain
a random variable that
00:01:56.330 --> 00:02:01.100
has 0 mean -- its centered
-- and the
00:02:01.100 --> 00:02:03.230
variance is equal to 1.
00:02:03.230 --> 00:02:07.240
And so the variance stays the
same, no matter how large n is
00:02:07.240 --> 00:02:08.500
going to be.
00:02:08.500 --> 00:02:12.660
So the distribution of Zn keeps
changing with n, but it
00:02:12.660 --> 00:02:14.080
cannot change too much.
00:02:14.080 --> 00:02:15.240
It stays in place.
00:02:15.240 --> 00:02:19.550
The mean is 0, and the width
remains also roughly the same
00:02:19.550 --> 00:02:22.000
because the variance is 1.
00:02:22.000 --> 00:02:25.820
The surprising thing is that, as
n grows, that distribution
00:02:25.820 --> 00:02:31.250
of Zn kind of settles in a
certain asymptotic shape.
00:02:31.250 --> 00:02:33.620
And that's the shape
of a standard
00:02:33.620 --> 00:02:35.290
normal random variable.
00:02:35.290 --> 00:02:37.580
So standard normal means
that it has 0
00:02:37.580 --> 00:02:39.930
mean and unit variance.
00:02:39.930 --> 00:02:43.850
More precisely, what the central
limit theorem tells us
00:02:43.850 --> 00:02:46.560
is a relation between the
cumulative distribution
00:02:46.560 --> 00:02:49.430
function of Zn and its relation
to the cumulative
00:02:49.430 --> 00:02:51.990
distribution function of
the standard normal.
00:02:51.990 --> 00:02:56.620
So for any given number, c,
the probability that Zn is
00:02:56.620 --> 00:03:01.140
less than or equal to c, in the
limit, becomes the same as
00:03:01.140 --> 00:03:04.090
the probability that the
standard normal becomes less
00:03:04.090 --> 00:03:05.760
than or equal to c.
00:03:05.760 --> 00:03:08.800
And of course, this is useful
because these probabilities
00:03:08.800 --> 00:03:11.960
are available from the normal
tables, whereas the
00:03:11.960 --> 00:03:15.850
distribution of Zn might be a
very complicated expression if
00:03:15.850 --> 00:03:19.520
you were to calculate
it exactly.
00:03:19.520 --> 00:03:22.960
So some comments about the
central limit theorem.
00:03:22.960 --> 00:03:27.860
First thing is that it's quite
amazing that it's universal.
00:03:27.860 --> 00:03:31.970
It doesn't matter what the
distribution of the X's is.
00:03:31.970 --> 00:03:35.970
It can be any distribution
whatsoever, as long as it has
00:03:35.970 --> 00:03:39.070
finite mean and finite
variance.
00:03:39.070 --> 00:03:42.170
And when you go and do your
approximations using the
00:03:42.170 --> 00:03:44.520
central limit theorem, the only
thing that you need to
00:03:44.520 --> 00:03:47.580
know about the distribution
of the X's are the
00:03:47.580 --> 00:03:49.130
mean and the variance.
00:03:49.130 --> 00:03:52.410
You need those in order
to standardize Sn.
00:03:52.410 --> 00:03:55.910
I mean -- to subtract the mean
and divide by the standard
00:03:55.910 --> 00:03:56.810
deviation --
00:03:56.810 --> 00:03:59.120
you need to know the mean
and the variance.
00:03:59.120 --> 00:04:02.350
But these are the only things
that you need to know in order
00:04:02.350 --> 00:04:03.600
to apply it.
00:04:06.060 --> 00:04:08.730
In addition, it's
a very accurate
00:04:08.730 --> 00:04:10.650
computational shortcut.
00:04:10.650 --> 00:04:14.660
So the distribution of this
Zn's, in principle, you can
00:04:14.660 --> 00:04:18.130
calculate it by convolution of
the distribution of the X's
00:04:18.130 --> 00:04:20.050
with itself many, many times.
00:04:20.050 --> 00:04:23.720
But this is tedious, and if you
try to do it analytically,
00:04:23.720 --> 00:04:26.570
it might be a very complicated
expression.
00:04:26.570 --> 00:04:29.910
Whereas by just appealing to the
standard normal table for
00:04:29.910 --> 00:04:33.870
the standard normal random
variable, things are done in a
00:04:33.870 --> 00:04:35.360
very quick way.
00:04:35.360 --> 00:04:39.070
So it's a nice computational
shortcut if you don't want to
00:04:39.070 --> 00:04:42.210
get an exact answer to a
probability problem.
00:04:42.210 --> 00:04:47.480
Now, at a more philosophical
level, it justifies why we are
00:04:47.480 --> 00:04:50.930
really interested in normal
random variables.
00:04:50.930 --> 00:04:55.230
Whenever you have a phenomenon
which is noisy, and the noise
00:04:55.230 --> 00:05:00.420
that you observe is created by
adding the lots of little
00:05:00.420 --> 00:05:03.820
pieces of randomness that are
independent of each other, the
00:05:03.820 --> 00:05:06.840
overall effect that you're
going to observe can be
00:05:06.840 --> 00:05:10.240
described by a normal
random variable.
00:05:10.240 --> 00:05:16.810
So in a classic example that
goes 100 years back or so,
00:05:16.810 --> 00:05:19.840
suppose that you have a fluid,
and inside that fluid, there's
00:05:19.840 --> 00:05:23.340
a little particle of dust
or whatever that's
00:05:23.340 --> 00:05:24.950
suspended in there.
00:05:24.950 --> 00:05:28.380
That little particle gets
hit by molecules
00:05:28.380 --> 00:05:30.000
completely at random --
00:05:30.000 --> 00:05:32.730
and so what you're going to see
is that particle kind of
00:05:32.730 --> 00:05:36.020
moving randomly inside
that liquid.
00:05:36.020 --> 00:05:40.260
Now that random motion, if you
ask, after one second, how
00:05:40.260 --> 00:05:45.520
much is my particle displaced,
let's say, in the x-axis along
00:05:45.520 --> 00:05:47.170
the x direction.
00:05:47.170 --> 00:05:50.960
That displacement is very, very
well modeled by a normal
00:05:50.960 --> 00:05:51.960
random variable.
00:05:51.960 --> 00:05:55.710
And the reason is that the
position of that particle is
00:05:55.710 --> 00:06:00.160
decided by the cumulative effect
of lots of random hits
00:06:00.160 --> 00:06:04.480
by molecules that hit
that particle.
00:06:04.480 --> 00:06:11.630
So that's a sort of celebrated
physical model that goes under
00:06:11.630 --> 00:06:15.000
the name of Brownian motion.
00:06:15.000 --> 00:06:18.100
And it's the same model that
some people use to describe
00:06:18.100 --> 00:06:20.300
the movement in the
financial markets.
00:06:20.300 --> 00:06:24.660
The argument might go that the
movement of prices has to do
00:06:24.660 --> 00:06:28.300
with lots of little decisions
and lots of little events by
00:06:28.300 --> 00:06:31.310
many, many different
actors that are
00:06:31.310 --> 00:06:32.890
involved in the market.
00:06:32.890 --> 00:06:37.440
So the distribution of stock
prices might be well described
00:06:37.440 --> 00:06:39.740
by normal random variables.
00:06:39.740 --> 00:06:43.780
At least that's what people
wanted to believe until
00:06:43.780 --> 00:06:45.160
somewhat recently.
00:06:45.160 --> 00:06:48.300
Now, the evidence is that,
actually, these distributions
00:06:48.300 --> 00:06:52.210
are a little more heavy-tailed
in the sense that extreme
00:06:52.210 --> 00:06:55.630
events are a little more likely
to occur that what
00:06:55.630 --> 00:06:58.040
normal random variables would
seem to indicate.
00:06:58.040 --> 00:07:03.110
But as a first model, again,
it could be a plausible
00:07:03.110 --> 00:07:07.300
argument to have, at least as
a starting model, one that
00:07:07.300 --> 00:07:10.200
involves normal random
variables.
00:07:10.200 --> 00:07:13.020
So this is the philosophical
side of things.
00:07:13.020 --> 00:07:15.820
On the more accurate,
mathematical side, it's
00:07:15.820 --> 00:07:18.290
important to appreciate
exactly quite kind of
00:07:18.290 --> 00:07:21.250
statement the central
limit theorem is.
00:07:21.250 --> 00:07:25.460
It's a statement about the
convergence of the CDF of
00:07:25.460 --> 00:07:27.940
these standardized random
variables to
00:07:27.940 --> 00:07:29.840
the CDF of a normal.
00:07:29.840 --> 00:07:32.470
So it's a statement about
convergence of CDFs.
00:07:32.470 --> 00:07:36.580
It's not a statement about
convergence of PMFs, or
00:07:36.580 --> 00:07:39.100
convergence of PDFs.
00:07:39.100 --> 00:07:42.160
Now, if one makes additional
mathematical assumptions,
00:07:42.160 --> 00:07:44.880
there are variations of the
central limit theorem that
00:07:44.880 --> 00:07:47.220
talk about PDFs and PMFs.
00:07:47.220 --> 00:07:51.930
But in general, that's not
necessarily the case.
00:07:51.930 --> 00:07:54.610
And I'm going to illustrate
this with--
00:07:54.610 --> 00:07:58.890
I have a plot here which
is not in your slides.
00:07:58.890 --> 00:08:04.700
But just to make the point,
consider two different
00:08:04.700 --> 00:08:06.710
discrete distributions.
00:08:06.710 --> 00:08:09.820
This discrete distribution
takes values 1, 4, 7.
00:08:13.470 --> 00:08:16.110
This discrete distribution can
take values 1, 2, 4, 6, and 7.
00:08:18.720 --> 00:08:24.270
So this one has sort of a
periodicity of 3, this one,
00:08:24.270 --> 00:08:27.960
the range of values is a little
more interesting.
00:08:27.960 --> 00:08:30.910
The numbers in these two
distributions are cooked up so
00:08:30.910 --> 00:08:34.380
that they have the same mean
and the same variance.
00:08:34.380 --> 00:08:38.970
Now, what I'm going to do is
to take eight independent
00:08:38.970 --> 00:08:44.090
copies of the random variable
and plot the PMF of the sum of
00:08:44.090 --> 00:08:45.980
eight random variables.
00:08:45.980 --> 00:08:51.520
Now, if I plot the PMF of the
sum of 8 of these, I get the
00:08:51.520 --> 00:08:59.690
plot, which corresponds to these
bullets in this diagram.
00:08:59.690 --> 00:09:03.040
If I take 8 random variables,
according to this
00:09:03.040 --> 00:09:07.270
distribution, and add them up
and compute their PMF, the PMF
00:09:07.270 --> 00:09:10.310
I get is the one denoted
here by the X's.
00:09:10.310 --> 00:09:15.630
The two PMFs look really
different, at least, when you
00:09:15.630 --> 00:09:16.890
eyeball them.
00:09:16.890 --> 00:09:23.500
On the other hand, if you were
to plot the CDFs of them, then
00:09:23.500 --> 00:09:34.000
the CDFs, if you compare them
with the normal CDF, which is
00:09:34.000 --> 00:09:38.390
this continuous curve, the CDF,
of course, it goes up in
00:09:38.390 --> 00:09:41.870
steps because we're looking at
discrete random variables.
00:09:41.870 --> 00:09:47.600
But it's very close
to the normal CDF.
00:09:47.600 --> 00:09:52.000
And if we, instead of n equal to
8, we were to take 16, then
00:09:52.000 --> 00:09:54.480
the coincidence would
be even better.
00:09:54.480 --> 00:09:59.850
So in terms of CDFs, when we add
8 or 16 of these, we get
00:09:59.850 --> 00:10:01.930
very close to the normal CDF.
00:10:01.930 --> 00:10:05.080
We would get essentially the
same picture if I were to take
00:10:05.080 --> 00:10:06.850
8 or 16 of these.
00:10:06.850 --> 00:10:11.730
So the CDFs sit, essentially, on
top of each other, although
00:10:11.730 --> 00:10:14.400
the two PMFs look
quite different.
00:10:14.400 --> 00:10:17.230
So this is to appreciate that,
formally speaking, we only
00:10:17.230 --> 00:10:22.470
have a statement about
CDFs, not about PMFs.
00:10:22.470 --> 00:10:26.980
Now in practice, how do you use
the central limit theorem?
00:10:26.980 --> 00:10:30.550
Well, it tells us that we can
calculate probabilities by
00:10:30.550 --> 00:10:32.810
treating Zn as if it
were a standard
00:10:32.810 --> 00:10:34.550
normal random variable.
00:10:34.550 --> 00:10:38.280
Now Zn is a linear
function of Sn.
00:10:38.280 --> 00:10:43.120
Conversely, Sn is a linear
function of Zn.
00:10:43.120 --> 00:10:45.680
Linear functions of normals
are normal.
00:10:45.680 --> 00:10:49.450
So if I pretend that Zn is
normal, it's essentially the
00:10:49.450 --> 00:10:53.230
same as if we pretend
that Sn is normal.
00:10:53.230 --> 00:10:55.560
And so we can calculate
probabilities that have to do
00:10:55.560 --> 00:10:59.830
with Sn as if Sn were normal.
00:10:59.830 --> 00:11:03.850
Now, the central limit theorem
does not tell us that Sn is
00:11:03.850 --> 00:11:05.120
approximately normal.
00:11:05.120 --> 00:11:08.860
The formal statement is about
Zn, but, practically speaking,
00:11:08.860 --> 00:11:11.150
when you use the result,
you can just
00:11:11.150 --> 00:11:14.650
pretend that Sn is normal.
00:11:14.650 --> 00:11:18.620
Finally, it's a limit theorem,
so it tells us about what
00:11:18.620 --> 00:11:21.240
happens when n goes
to infinity.
00:11:21.240 --> 00:11:23.880
If we are to use it in practice,
of course, n is not
00:11:23.880 --> 00:11:25.120
going to be infinity.
00:11:25.120 --> 00:11:28.320
Maybe n is equal to 15.
00:11:28.320 --> 00:11:32.130
Can we use a limit theorem when
n is a small number, as
00:11:32.130 --> 00:11:34.020
small as 15?
00:11:34.020 --> 00:11:36.980
Well, it turns out that it's
a very good approximation.
00:11:36.980 --> 00:11:41.420
Even for quite small values
of n, it gives us
00:11:41.420 --> 00:11:43.770
very accurate answers.
00:11:43.770 --> 00:11:49.710
So n over the order of 15, or
20, or so give us very good
00:11:49.710 --> 00:11:51.790
results in practice.
00:11:51.790 --> 00:11:54.820
There are no good theorems
that will give us hard
00:11:54.820 --> 00:11:58.550
guarantees because the quality
of the approximation does
00:11:58.550 --> 00:12:03.490
depend on the details of the
distribution of the X's.
00:12:03.490 --> 00:12:07.510
If the X's have a distribution
that, from the outset, looks a
00:12:07.510 --> 00:12:13.200
little bit like the normal, then
for small values of n,
00:12:13.200 --> 00:12:15.700
you are going to see,
essentially, a normal
00:12:15.700 --> 00:12:16.980
distribution for the sum.
00:12:16.980 --> 00:12:20.030
If the distribution of the X's
is very different from the
00:12:20.030 --> 00:12:23.350
normal, it's going to take a
larger value of n for the
00:12:23.350 --> 00:12:25.770
central limit theorem
to take effect.
00:12:25.770 --> 00:12:29.960
So let's illustrates this with
a few representative plots.
00:12:32.600 --> 00:12:36.460
So here, we're starting with a
discrete uniform distribution
00:12:36.460 --> 00:12:39.580
that goes from 1 to 8.
00:12:39.580 --> 00:12:44.200
Let's add 2 of these random
variables, 2 random variables
00:12:44.200 --> 00:12:47.870
with this PMF, and find
the PMF of the sum.
00:12:47.870 --> 00:12:52.570
This is a convolution of 2
discrete uniforms, and I
00:12:52.570 --> 00:12:54.960
believe you have seen this
exercise before.
00:12:54.960 --> 00:12:59.030
When you convolve this with
itself, you get a triangle.
00:12:59.030 --> 00:13:04.400
So this is the PMF for the sum
of two discrete uniforms.
00:13:04.400 --> 00:13:05.370
Now let's continue.
00:13:05.370 --> 00:13:07.980
Let's convolve this
with itself.
00:13:07.980 --> 00:13:10.750
These was going to give
us the PMF of a sum
00:13:10.750 --> 00:13:13.740
of 4 discrete uniforms.
00:13:13.740 --> 00:13:17.930
And we get this, which starts
looking like a normal.
00:13:17.930 --> 00:13:23.450
If we go to n equal to 32, then
it looks, essentially,
00:13:23.450 --> 00:13:25.270
exactly like a normal.
00:13:25.270 --> 00:13:27.850
And it's an excellent
approximation.
00:13:27.850 --> 00:13:32.290
So this is the PMF of the sum
of 32 discrete random
00:13:32.290 --> 00:13:36.560
variables with this uniform
distribution.
00:13:36.560 --> 00:13:42.190
If we start with a PMF which
is not symmetric--
00:13:42.190 --> 00:13:44.640
this one is symmetric
around the mean.
00:13:44.640 --> 00:13:47.630
But if we start with a PMF which
is non-symmetric, so
00:13:47.630 --> 00:13:53.780
this is, here, is a truncated
geometric PMF, then things do
00:13:53.780 --> 00:13:58.960
not work out as nicely when
I add 8 of these.
00:13:58.960 --> 00:14:03.640
That is, if I convolve this
with itself 8 times, I get
00:14:03.640 --> 00:14:08.600
this PMF, which maybe resembles
a little bit to the
00:14:08.600 --> 00:14:09.800
normal one.
00:14:09.800 --> 00:14:13.050
But you can really tell that
it's different from the normal
00:14:13.050 --> 00:14:16.640
if you focus at the details
here and there.
00:14:16.640 --> 00:14:19.930
Here it sort of rises sharply.
00:14:19.930 --> 00:14:23.420
Here it tails off
a bit slower.
00:14:23.420 --> 00:14:27.660
So there's an asymmetry here
that's present, and which is a
00:14:27.660 --> 00:14:29.340
consequence of the
asymmetry of the
00:14:29.340 --> 00:14:31.710
distribution we started with.
00:14:31.710 --> 00:14:35.320
If we go to 16, it looks a
little better, but still you
00:14:35.320 --> 00:14:39.600
can see the asymmetry between
this tail and that tail.
00:14:39.600 --> 00:14:43.030
If you get to 32 there's still a
little bit of asymmetry, but
00:14:43.030 --> 00:14:48.520
at least now it starts looking
like a normal distribution.
00:14:48.520 --> 00:14:54.270
So the moral from these plots
is that it might vary, a
00:14:54.270 --> 00:14:57.360
little bit, what kind of values
of n you need before
00:14:57.360 --> 00:15:00.070
you get the really good
approximation.
00:15:00.070 --> 00:15:04.520
But for values of n in the range
20 to 30 or so, usually
00:15:04.520 --> 00:15:07.340
you expect to get a pretty
good approximation.
00:15:07.340 --> 00:15:10.180
At least that's what the visual
inspection of these
00:15:10.180 --> 00:15:13.330
graphs tells us.
00:15:13.330 --> 00:15:16.560
So now that we know that we have
a good approximation in
00:15:16.560 --> 00:15:18.460
our hands, let's use it.
00:15:18.460 --> 00:15:21.890
Let's use it by revisiting an
example from last time.
00:15:21.890 --> 00:15:24.480
This is the polling problem.
00:15:24.480 --> 00:15:28.360
We're interested in the fraction
of population that
00:15:28.360 --> 00:15:30.220
has a certain habit been.
00:15:30.220 --> 00:15:33.680
And we try to find what f is.
00:15:33.680 --> 00:15:38.120
And the way we do it is by
polling people at random and
00:15:38.120 --> 00:15:40.600
recording the answers that they
give, whether they have
00:15:40.600 --> 00:15:42.340
the habit or not.
00:15:42.340 --> 00:15:45.250
So for each person, we get the
Bernoulli random variable.
00:15:45.250 --> 00:15:52.050
With probability f, a person is
going to respond 1, or yes,
00:15:52.050 --> 00:15:55.080
so this is with probability f.
00:15:55.080 --> 00:15:58.490
And with the remaining
probability 1-f, the person
00:15:58.490 --> 00:16:00.390
responds no.
00:16:00.390 --> 00:16:04.520
We record this number, which
is how many people answered
00:16:04.520 --> 00:16:06.800
yes, divided by the total
number of people.
00:16:06.800 --> 00:16:10.740
That's the fraction of the
population that we asked.
00:16:10.740 --> 00:16:16.980
This is the fraction inside our
sample that answered yes.
00:16:16.980 --> 00:16:21.410
And as we discussed last time,
you might start with some
00:16:21.410 --> 00:16:23.210
specs for the poll.
00:16:23.210 --> 00:16:25.660
And the specs have
two parameters--
00:16:25.660 --> 00:16:29.400
the accuracy that you want and
the confidence that you want
00:16:29.400 --> 00:16:33.620
to have that you did really
obtain the desired accuracy.
00:16:33.620 --> 00:16:40.550
So the specs here is that we
want, probability 95% that our
00:16:40.550 --> 00:16:46.400
estimate is within 1 % point
from the true answer.
00:16:46.400 --> 00:16:48.940
So the event of interest
is this.
00:16:48.940 --> 00:16:53.640
That's the result of the poll
minus distance from the true
00:16:53.640 --> 00:16:59.150
answer is less or bigger
than 1 % point.
00:16:59.150 --> 00:17:02.000
And we're interested in
calculating or approximating
00:17:02.000 --> 00:17:04.140
this particular probability.
00:17:04.140 --> 00:17:08.000
So we want to do it using the
central limit theorem.
00:17:08.000 --> 00:17:13.050
And one way of arranging the
mechanics of this calculation
00:17:13.050 --> 00:17:17.880
is to take the event of interest
and massage it by
00:17:17.880 --> 00:17:21.400
subtracting and dividing things
from both sides of this
00:17:21.400 --> 00:17:27.510
inequality so that you bring
him to the picture the
00:17:27.510 --> 00:17:31.600
standardized random variable,
the Zn, and then apply the
00:17:31.600 --> 00:17:33.900
central limit theorem.
00:17:33.900 --> 00:17:38.550
So the event of interest, let
me write it in full, Mn is
00:17:38.550 --> 00:17:42.280
this quantity, so I'm putting it
here, minus f, which is the
00:17:42.280 --> 00:17:44.410
same as nf divided by n.
00:17:44.410 --> 00:17:46.980
So this is the same
as that event.
00:17:46.980 --> 00:17:49.840
We're going to calculate the
probability of this.
00:17:49.840 --> 00:17:52.460
This is not exactly in the form
in which we apply the
00:17:52.460 --> 00:17:53.430
central limit theorem.
00:17:53.430 --> 00:17:56.570
To apply the central limit
theorem, we need, down here,
00:17:56.570 --> 00:17:59.660
to have sigma square root n.
00:17:59.660 --> 00:18:03.100
So how can I put sigma
square root n here?
00:18:03.100 --> 00:18:07.350
I can divide both sides of
this inequality by sigma.
00:18:07.350 --> 00:18:10.970
And then I can take a factor of
square root n from here and
00:18:10.970 --> 00:18:13.240
send it to the other side.
00:18:13.240 --> 00:18:15.660
So this event is the
same as that event.
00:18:15.660 --> 00:18:19.190
This will happen if and only
if that will happen.
00:18:19.190 --> 00:18:23.670
So calculating the probability
of this event here is the same
00:18:23.670 --> 00:18:27.110
as calculating the probability
that this events happens.
00:18:27.110 --> 00:18:30.870
And now we are in business
because the random variable
00:18:30.870 --> 00:18:36.510
that we got in here is Zn, or
the absolute value of Zn, and
00:18:36.510 --> 00:18:41.480
we're talking about the
probability that Zn, absolute
00:18:41.480 --> 00:18:45.660
value of Zn, is bigger than
a certain number.
00:18:45.660 --> 00:18:50.310
Since Zn is to be approximated
by a standard normal random
00:18:50.310 --> 00:18:54.560
variable, our approximation is
going to be, instead of asking
00:18:54.560 --> 00:18:59.040
for Zn being bigger than this
number, we will ask for Z,
00:18:59.040 --> 00:19:02.500
absolute value of Z, being
bigger than this number.
00:19:02.500 --> 00:19:05.640
So this is the probability that
we want to calculate.
00:19:05.640 --> 00:19:09.730
And now Z is a standard normal
random variable.
00:19:09.730 --> 00:19:12.760
There's a small difficulty,
the one that we also
00:19:12.760 --> 00:19:14.310
encountered last time.
00:19:14.310 --> 00:19:18.110
And the difficulty is that the
standard deviation, sigma, of
00:19:18.110 --> 00:19:20.720
the Xi's is not known.
00:19:20.720 --> 00:19:24.560
Sigma is equal to f times--
00:19:24.560 --> 00:19:30.090
sigma, in this example, is f
times (1-f), and the only
00:19:30.090 --> 00:19:32.690
thing that we know about sigma
is that it's going to be a
00:19:32.690 --> 00:19:35.010
number less than 1/2.
00:19:39.810 --> 00:19:45.180
OK, so we're going to have to
use an inequality here.
00:19:45.180 --> 00:19:48.890
We're going to use a
conservative value of sigma,
00:19:48.890 --> 00:19:54.120
the value of sigma equal to 1/2
and use that instead of
00:19:54.120 --> 00:19:55.760
the exact value of sigma.
00:19:55.760 --> 00:19:59.100
And this gives us an inequality
going this way.
00:19:59.100 --> 00:20:03.710
Let's just make sure why the
inequality goes this way.
00:20:03.710 --> 00:20:06.683
We got, on our axis,
two numbers.
00:20:12.390 --> 00:20:21.650
One number is 0.01 square
root n divided by sigma.
00:20:21.650 --> 00:20:27.870
And the other number is
0.02 square root of n.
00:20:27.870 --> 00:20:30.840
And my claim is that the numbers
are related to each
00:20:30.840 --> 00:20:32.930
other in this particular way.
00:20:32.930 --> 00:20:33.500
Why is this?
00:20:33.500 --> 00:20:35.410
Sigma is less than 2.
00:20:35.410 --> 00:20:39.580
So 1/sigma is bigger than 2.
00:20:39.580 --> 00:20:44.020
So since 1/sigma is bigger than
2 this means that this
00:20:44.020 --> 00:20:47.740
numbers sits to the right
of that number.
00:20:47.740 --> 00:20:51.950
So here we have the probability
that Z is bigger
00:20:51.950 --> 00:20:54.820
than this number.
00:20:54.820 --> 00:20:59.060
The probability of falling out
there is less than the
00:20:59.060 --> 00:21:03.060
probability of falling
in this interval.
00:21:03.060 --> 00:21:06.170
So that's what that last
inequality is saying--
00:21:06.170 --> 00:21:09.330
this probability is smaller
than that probability.
00:21:09.330 --> 00:21:12.010
This is the probability that
we're interested in, but since
00:21:12.010 --> 00:21:16.490
we don't know sigma, we take the
conservative value, and we
00:21:16.490 --> 00:21:21.610
use an upper bound in terms
of the probability of this
00:21:21.610 --> 00:21:23.730
interval here.
00:21:23.730 --> 00:21:26.920
And now we are in business.
00:21:26.920 --> 00:21:30.980
We can start using our normal
tables to calculate
00:21:30.980 --> 00:21:33.140
probabilities of interest.
00:21:33.140 --> 00:21:40.300
So for example, let's say that's
we take n to be 10,000.
00:21:40.300 --> 00:21:42.370
How is the calculation
going to go?
00:21:42.370 --> 00:21:45.860
We want to calculate the
probability that the absolute
00:21:45.860 --> 00:21:52.920
value of Z is bigger than 0.2
times 1000, which is the
00:21:52.920 --> 00:21:56.530
probability that the absolute
value of Z is larger than or
00:21:56.530 --> 00:21:58.490
equal to 2.
00:21:58.490 --> 00:22:00.500
And here let's do
some mechanics,
00:22:00.500 --> 00:22:03.300
just to stay in shape.
00:22:03.300 --> 00:22:05.860
The probability that you're
larger than or equal to 2 in
00:22:05.860 --> 00:22:09.290
absolute value, since the normal
is symmetric around the
00:22:09.290 --> 00:22:13.590
mean, this is going to be twice
the probability that Z
00:22:13.590 --> 00:22:16.560
is larger than or equal to 2.
00:22:16.560 --> 00:22:22.330
Can we use the cumulative
distribution function of Z to
00:22:22.330 --> 00:22:23.300
calculate this?
00:22:23.300 --> 00:22:26.100
Well, almost the cumulative
gives us probabilities of
00:22:26.100 --> 00:22:28.910
being less than something, not
bigger than something.
00:22:28.910 --> 00:22:33.480
So we need one more step and
write this as 1 minus the
00:22:33.480 --> 00:22:38.420
probability that Z is less
than or equal to 2.
00:22:38.420 --> 00:22:41.620
And this probability, now,
you can read off
00:22:41.620 --> 00:22:43.770
from the normal tables.
00:22:43.770 --> 00:22:46.460
And the normal tables will
tell you that this
00:22:46.460 --> 00:22:52.840
probability is 0.9772.
00:22:52.840 --> 00:22:54.520
And you do get an answer.
00:22:54.520 --> 00:23:02.530
And the answer is 0.0456.
00:23:02.530 --> 00:23:05.220
OK, so we tried 10,000.
00:23:05.220 --> 00:23:10.990
And we find that our probably
of error is 4.5%, so we're
00:23:10.990 --> 00:23:15.710
doing better than the
spec that we had.
00:23:15.710 --> 00:23:19.490
So this tells us that maybe
we have some leeway.
00:23:19.490 --> 00:23:24.070
Maybe we can use a smaller
sample size and still stay
00:23:24.070 --> 00:23:26.030
without our specs.
00:23:26.030 --> 00:23:29.630
Let's try to find how much
we can push the envelope.
00:23:29.630 --> 00:23:34.716
How much smaller
can we take n?
00:23:34.716 --> 00:23:37.890
To answer that question, we
need to do this kind of
00:23:37.890 --> 00:23:40.790
calculation, essentially,
going backwards.
00:23:40.790 --> 00:23:46.420
We're going to fix this number
to be 0.05 and work backwards
00:23:46.420 --> 00:23:49.130
here to find--
00:23:49.130 --> 00:23:50.770
did I do a mistake here?
00:23:50.770 --> 00:23:51.770
10,000.
00:23:51.770 --> 00:23:53.700
So I'm missing a 0 here.
00:23:57.440 --> 00:24:07.540
Ah, but I'm taking the square
root, so it's 100.
00:24:07.540 --> 00:24:11.080
Where did the 0.02
come in from?
00:24:11.080 --> 00:24:12.020
Ah, from here.
00:24:12.020 --> 00:24:15.870
OK, all right.
00:24:15.870 --> 00:24:19.330
0.02 times 100, that
gives us 2.
00:24:19.330 --> 00:24:22.130
OK, all right.
00:24:22.130 --> 00:24:24.240
Very good, OK.
00:24:24.240 --> 00:24:27.570
So we'll have to do this
calculation now backwards,
00:24:27.570 --> 00:24:33.510
figure out if this is 0.05,
what kind of number we're
00:24:33.510 --> 00:24:41.380
going to need here and then
here, and from this we will be
00:24:41.380 --> 00:24:45.240
able to tell what value
of n do we need.
00:24:45.240 --> 00:24:53.670
OK, so we want to find n such
that the probability that Z is
00:24:53.670 --> 00:25:04.870
bigger than 0.02 square
root n is 0.05.
00:25:04.870 --> 00:25:09.320
OK, so Z is a standard normal
random variable.
00:25:09.320 --> 00:25:16.810
And we want the probability
that we are
00:25:16.810 --> 00:25:18.640
outside this range.
00:25:18.640 --> 00:25:21.940
We want the probability of
those two tails together.
00:25:24.960 --> 00:25:26.920
Those two tails together
should have
00:25:26.920 --> 00:25:29.990
probability of 0.05.
00:25:29.990 --> 00:25:33.280
This means that this tail,
by itself, should have
00:25:33.280 --> 00:25:36.900
probability 0.025.
00:25:36.900 --> 00:25:45.960
And this means that this
probability should be 0.975.
00:25:45.960 --> 00:25:52.350
Now, if this probability
is to be 0.975, what
00:25:52.350 --> 00:25:54.970
should that number be?
00:25:54.970 --> 00:25:59.980
You go to the normal tables,
and you find which is the
00:25:59.980 --> 00:26:03.190
entry that corresponds
to that number.
00:26:03.190 --> 00:26:07.020
I actually brought a normal
table with me.
00:26:07.020 --> 00:26:12.740
And 0.975 is down here.
00:26:12.740 --> 00:26:15.420
And it tells you that
to the number that
00:26:15.420 --> 00:26:19.820
corresponds to it is 1.96.
00:26:19.820 --> 00:26:24.890
So this tells us that
this number
00:26:24.890 --> 00:26:31.790
should be equal to 1.96.
00:26:31.790 --> 00:26:36.380
And now, from here, you
do the calculations.
00:26:36.380 --> 00:26:47.510
And you find that n is 9604.
00:26:47.510 --> 00:26:53.200
So with a sample of 10,000, we
got probability of error 4.5%.
00:26:53.200 --> 00:26:57.910
With a slightly smaller sample
size of 9,600, we can get the
00:26:57.910 --> 00:27:01.880
probability of a mistake
to be 0.05, which
00:27:01.880 --> 00:27:04.070
was exactly our spec.
00:27:04.070 --> 00:27:07.450
So these are essentially the two
ways that you're going to
00:27:07.450 --> 00:27:09.830
be using the central
limit theorem.
00:27:09.830 --> 00:27:12.690
Either you're given n and
you try to calculate
00:27:12.690 --> 00:27:13.610
probabilities.
00:27:13.610 --> 00:27:15.590
Or you're given the
probabilities, and you want to
00:27:15.590 --> 00:27:18.210
work backwards to
find n itself.
00:27:20.990 --> 00:27:27.710
So in this example, the random
variable that we dealt with
00:27:27.710 --> 00:27:30.450
was, of course, a binomial
random variable.
00:27:30.450 --> 00:27:38.590
The Xi's were Bernoulli,
so the sum of
00:27:38.590 --> 00:27:40.950
the Xi's were binomial.
00:27:40.950 --> 00:27:44.100
So the central limit theorem
certainly applies to the
00:27:44.100 --> 00:27:45.950
binomial distribution.
00:27:45.950 --> 00:27:49.440
To be more precise, of course,
it applies to the standardized
00:27:49.440 --> 00:27:52.730
version of the binomial
random variable.
00:27:52.730 --> 00:27:55.140
So here's what we did,
essentially, in
00:27:55.140 --> 00:27:57.300
the previous example.
00:27:57.300 --> 00:28:00.690
We fixed the number p, which is
the probability of success
00:28:00.690 --> 00:28:02.010
in our experiments.
00:28:02.010 --> 00:28:06.550
p corresponds to f in the
previous example.
00:28:06.550 --> 00:28:10.570
Let every Xi a Bernoulli
random variable and are
00:28:10.570 --> 00:28:13.790
standing assumption is that
these random variables are
00:28:13.790 --> 00:28:15.040
independent.
00:28:17.580 --> 00:28:20.730
When we add them, we get a
random variable that has a
00:28:20.730 --> 00:28:22.030
binomial distribution.
00:28:22.030 --> 00:28:25.220
We know the mean and the
variance of the binomial, so
00:28:25.220 --> 00:28:29.130
we take Sn, we subtract the
mean, which is this, divide by
00:28:29.130 --> 00:28:30.470
the standard deviation.
00:28:30.470 --> 00:28:32.790
The central limit theorem tells
us that the cumulative
00:28:32.790 --> 00:28:36.130
distribution function of this
random variable is a standard
00:28:36.130 --> 00:28:39.860
normal random variable
in the limit.
00:28:39.860 --> 00:28:43.730
So let's do one more example
of a calculation.
00:28:43.730 --> 00:28:47.160
Let's take n to be--
00:28:47.160 --> 00:28:50.110
let's choose some specific
numbers to work with.
00:28:52.950 --> 00:28:58.300
So in this example, first thing
to do is to find the
00:28:58.300 --> 00:29:02.390
expected value of Sn,
which is n times p.
00:29:02.390 --> 00:29:04.150
It's 18.
00:29:04.150 --> 00:29:08.100
Then we need to write down
the standard deviation.
00:29:12.430 --> 00:29:16.530
The variance of Sn is the
sum of the variances.
00:29:16.530 --> 00:29:19.940
It's np times (1-p).
00:29:19.940 --> 00:29:25.920
And in this particular example,
p times (1-p) is 1/4,
00:29:25.920 --> 00:29:28.320
n is 36, so this is 9.
00:29:28.320 --> 00:29:33.120
And that tells us that the
standard deviation of this n
00:29:33.120 --> 00:29:34.370
is equal to 3.
00:29:37.170 --> 00:29:40.650
So what we're going to do is to
take the event of interest,
00:29:40.650 --> 00:29:46.400
which is Sn less than 21, and
rewrite it in a way that
00:29:46.400 --> 00:29:48.910
involves the standardized
random variable.
00:29:48.910 --> 00:29:51.700
So to do that, we need
to subtract the mean.
00:29:51.700 --> 00:29:55.680
So we write this as Sn-3
should be less
00:29:55.680 --> 00:29:58.460
than or equal to 21-3.
00:29:58.460 --> 00:30:00.360
This is the same event.
00:30:00.360 --> 00:30:02.890
And then divide by the standard
deviation, which is
00:30:02.890 --> 00:30:06.450
3, and we end up with this.
00:30:06.450 --> 00:30:08.300
So the event itself of--
00:30:08.300 --> 00:30:09.550
AUDIENCE: [INAUDIBLE].
00:30:13.700 --> 00:30:24.150
Should subtract, 18, yes, which
gives me a much nicer
00:30:24.150 --> 00:30:26.640
number out here, which is 1.
00:30:26.640 --> 00:30:31.650
So the event of interest, that
Sn is less than 21, is the
00:30:31.650 --> 00:30:37.330
same as the event that a
standard normal random
00:30:37.330 --> 00:30:41.580
variable is less than
or equal to 1.
00:30:41.580 --> 00:30:44.690
And once more, you can look this
up at the normal tables.
00:30:44.690 --> 00:30:50.690
And you find that the answer
that you get is 0.43.
00:30:50.690 --> 00:30:53.390
Now it's interesting to compare
this answer that we
00:30:53.390 --> 00:30:57.230
got through the central limit
theorem with the exact answer.
00:30:57.230 --> 00:31:01.920
The exact answer involves the
exact binomial distribution.
00:31:01.920 --> 00:31:08.780
What we have here is the
binomial probability that, Sn
00:31:08.780 --> 00:31:10.970
is equal to k.
00:31:10.970 --> 00:31:15.230
Sn being equal to k is given
by this formula.
00:31:15.230 --> 00:31:22.610
And we add, over all values for
k going from 0 up to 21,
00:31:22.610 --> 00:31:28.670
we write a two lines code to
calculate this sum, and we get
00:31:28.670 --> 00:31:32.530
the exact answer,
which is 0.8785.
00:31:32.530 --> 00:31:35.760
So there's a pretty good
agreements between the two,
00:31:35.760 --> 00:31:38.600
although you wouldn't
call that's
00:31:38.600 --> 00:31:40.395
necessarily excellent agreement.
00:31:45.080 --> 00:31:47.060
Can we do a little
better than that?
00:31:51.570 --> 00:31:53.750
OK.
00:31:53.750 --> 00:31:56.510
It turns out that we can.
00:31:56.510 --> 00:31:58.625
And here's the idea.
00:32:02.300 --> 00:32:07.750
So our random variable
Sn has a mean of 18.
00:32:07.750 --> 00:32:09.540
It has a binomial
distribution.
00:32:09.540 --> 00:32:14.050
It's described by a PMF that has
a shape roughly like this
00:32:14.050 --> 00:32:16.690
and which keeps going on.
00:32:16.690 --> 00:32:20.960
Using the central limit
theorem is basically
00:32:20.960 --> 00:32:26.650
pretending that Sn is
normal with the
00:32:26.650 --> 00:32:28.650
right mean and variance.
00:32:28.650 --> 00:32:35.200
So pretending that Zn has
0 mean unit variance, we
00:32:35.200 --> 00:32:38.850
approximate it with Z, that
has 0 mean unit variance.
00:32:38.850 --> 00:32:42.190
If you were to pretend that
Sn is normal, you would
00:32:42.190 --> 00:32:45.407
approximate it with a normal
that has the correct mean and
00:32:45.407 --> 00:32:46.250
correct variance.
00:32:46.250 --> 00:32:49.390
So it would still be
centered at 18.
00:32:49.390 --> 00:32:53.800
And it would have the same
variance as the binomial PMF.
00:32:53.800 --> 00:32:57.350
So using the central limit
theorem essentially means that
00:32:57.350 --> 00:33:00.420
we keep the mean and the
variance what they are but we
00:33:00.420 --> 00:33:03.960
pretend that our distribution
is normal.
00:33:03.960 --> 00:33:06.780
We want to calculate the
probability that Sn is less
00:33:06.780 --> 00:33:09.590
than or equal to 21.
00:33:09.590 --> 00:33:14.310
I pretend that my random
variable is normal, so I draw
00:33:14.310 --> 00:33:18.680
a line here and I calculate
the area under the normal
00:33:18.680 --> 00:33:22.000
curve going up to 21.
00:33:22.000 --> 00:33:23.500
That's essentially
what we did.
00:33:26.260 --> 00:33:29.730
Now, a smart person comes
around and says, Sn is a
00:33:29.730 --> 00:33:31.360
discrete random variable.
00:33:31.360 --> 00:33:34.750
So the event that Sn is less
than or equal to 21 is the
00:33:34.750 --> 00:33:38.480
same as Sn being strictly less
than 22 because nothing in
00:33:38.480 --> 00:33:41.240
between can happen.
00:33:41.240 --> 00:33:43.700
So I'm going to use the
central limit theorem
00:33:43.700 --> 00:33:48.290
approximation by pretending
again that Sn is normal and
00:33:48.290 --> 00:33:51.650
finding the probability of this
event while pretending
00:33:51.650 --> 00:33:53.720
that Sn is normal.
00:33:53.720 --> 00:33:57.870
So what this person would do
would be to draw a line here,
00:33:57.870 --> 00:34:02.780
at 22, and calculate the area
under the normal curve
00:34:02.780 --> 00:34:05.490
all the way to 22.
00:34:05.490 --> 00:34:06.700
Who is right?
00:34:06.700 --> 00:34:08.820
Which one is better?
00:34:08.820 --> 00:34:15.639
Well neither, but we can do
better than both if we sort of
00:34:15.639 --> 00:34:17.949
split the difference.
00:34:17.949 --> 00:34:21.969
So another way of writing the
same event for Sn is to write
00:34:21.969 --> 00:34:25.940
it as Sn being less than 21.5.
00:34:25.940 --> 00:34:29.570
In terms of the discrete random
variable Sn, all three
00:34:29.570 --> 00:34:32.239
of these are exactly
the same event.
00:34:32.239 --> 00:34:35.090
But when you do the continuous
approximation, they give you
00:34:35.090 --> 00:34:36.250
different probabilities.
00:34:36.250 --> 00:34:39.760
It's a matter of whether you
integrate the area under the
00:34:39.760 --> 00:34:46.159
normal curve up to here, up to
the midway point, or up to 22.
00:34:46.159 --> 00:34:50.659
It turns out that integrating
up to the midpoint is what
00:34:50.659 --> 00:34:54.469
gives us the better
numerical results.
00:34:54.469 --> 00:34:59.170
So we take here 21 and 1/2,
and we integrate the area
00:34:59.170 --> 00:35:01.170
under the normal curve
up to here.
00:35:14.100 --> 00:35:18.560
So let's do this calculation
and see what we get.
00:35:18.560 --> 00:35:21.330
What would we change here?
00:35:21.330 --> 00:35:27.730
Instead of 21, we would
now write 21 and 1/2.
00:35:27.730 --> 00:35:32.810
This 18 becomes, no, that
18 stays what it is.
00:35:32.810 --> 00:35:36.890
But this 21 becomes
21 and 1/2.
00:35:36.890 --> 00:35:44.790
And so this one becomes
1 + 0.5 by 3.
00:35:44.790 --> 00:35:48.210
This is 117.
00:35:48.210 --> 00:35:51.980
So we now look up into the
normal tables and ask for the
00:35:51.980 --> 00:36:00.000
probability that Z is
less than 1.17.
00:36:00.000 --> 00:36:06.070
So this here gets approximated
by the probability that the
00:36:06.070 --> 00:36:09.240
standard normal is
less than 1.17.
00:36:09.240 --> 00:36:15.960
And the normal tables will
tell us this is 0.879.
00:36:15.960 --> 00:36:23.550
Going back to the previous
slide, what we got this time
00:36:23.550 --> 00:36:30.310
with this improved approximation
is 0.879.
00:36:30.310 --> 00:36:33.730
This is a really good
approximation
00:36:33.730 --> 00:36:35.730
of the correct number.
00:36:35.730 --> 00:36:39.160
This is what we got
using the 21.
00:36:39.160 --> 00:36:42.360
This is what we get using
the 21 and 1/2.
00:36:42.360 --> 00:36:45.940
And it's an approximation that's
sort of right on-- a
00:36:45.940 --> 00:36:48.350
very good one.
00:36:48.350 --> 00:36:54.120
The moral from this numerical
example is that doing this 1
00:36:54.120 --> 00:37:00.933
and 1/2 correction does give
us better approximations.
00:37:06.070 --> 00:37:12.010
In fact, we can use this 1/2
idea to even calculate
00:37:12.010 --> 00:37:14.340
individual probabilities.
00:37:14.340 --> 00:37:17.130
So suppose you want to
approximate the probability
00:37:17.130 --> 00:37:21.010
that Sn equal to 19.
00:37:21.010 --> 00:37:25.620
If you were to pretend that Sn
is normal and calculate this
00:37:25.620 --> 00:37:28.470
probability, the probability
that the normal random
00:37:28.470 --> 00:37:31.670
variable is equal to 19 is 0.
00:37:31.670 --> 00:37:34.150
So you don't get an interesting
answer.
00:37:34.150 --> 00:37:37.610
You get a more interesting
answer by writing this event,
00:37:37.610 --> 00:37:41.460
19 as being the same as the
event of falling between 18
00:37:41.460 --> 00:37:45.910
and 1/2 and 19 and 1/2 and using
the normal approximation
00:37:45.910 --> 00:37:48.230
to calculate this probability.
00:37:48.230 --> 00:37:51.890
In terms of our previous
picture, this corresponds to
00:37:51.890 --> 00:37:53.140
the following.
00:37:59.400 --> 00:38:04.650
We are interested in the
probability that
00:38:04.650 --> 00:38:07.130
Sn is equal to 19.
00:38:07.130 --> 00:38:11.230
So we're interested in the
height of this bar.
00:38:11.230 --> 00:38:15.720
We're going to consider the area
under the normal curve
00:38:15.720 --> 00:38:21.500
going from here to here,
and use this area as an
00:38:21.500 --> 00:38:25.110
approximation for the height
of that particular bar.
00:38:25.110 --> 00:38:30.670
So what we're basically doing
is, we take the probability
00:38:30.670 --> 00:38:33.830
under the normal curve that's
assigned over a continuum of
00:38:33.830 --> 00:38:38.280
values and attributed it to
different discrete values.
00:38:38.280 --> 00:38:43.510
Whatever is above the midpoint
gets attributed to 19.
00:38:43.510 --> 00:38:45.640
Whatever is below that
midpoint gets
00:38:45.640 --> 00:38:47.250
attributed to 18.
00:38:47.250 --> 00:38:54.280
So this is green area is our
approximation of the value of
00:38:54.280 --> 00:38:56.500
the PMF at 19.
00:38:56.500 --> 00:39:00.740
So similarly, if you wanted to
approximate the value of the
00:39:00.740 --> 00:39:04.440
PMF at this point, you would
take this interval and
00:39:04.440 --> 00:39:06.580
integrate the area
under the normal
00:39:06.580 --> 00:39:09.350
curve over that interval.
00:39:09.350 --> 00:39:13.410
It turns out that this gives a
very good approximation of the
00:39:13.410 --> 00:39:15.660
PMF of the binomial.
00:39:15.660 --> 00:39:22.580
And actually, this was the
context in which the central
00:39:22.580 --> 00:39:26.310
limit theorem was proved in
the first place, when this
00:39:26.310 --> 00:39:27.990
business started.
00:39:27.990 --> 00:39:33.060
So this business goes back
a few hundred years.
00:39:33.060 --> 00:39:35.700
And the central limit theorem
was first approved by
00:39:35.700 --> 00:39:39.420
considering the PMF of a
binomial random variable when
00:39:39.420 --> 00:39:41.840
p is equal to 1/2.
00:39:41.840 --> 00:39:45.590
People did the algebra, and they
found out that the exact
00:39:45.590 --> 00:39:49.700
expression for the PMF is quite
well approximated by
00:39:49.700 --> 00:39:51.980
that expression hat you would
get from a normal
00:39:51.980 --> 00:39:53.380
distribution.
00:39:53.380 --> 00:39:57.510
Then the proof was extended to
binomials for more general
00:39:57.510 --> 00:39:59.690
values of p.
00:39:59.690 --> 00:40:04.220
So here we talk about this as
a refinement of the general
00:40:04.220 --> 00:40:07.480
central limit theorem, but,
historically, that refinement
00:40:07.480 --> 00:40:09.830
was where the whole business
got started
00:40:09.830 --> 00:40:11.820
in the first place.
00:40:11.820 --> 00:40:18.700
All right, so let's go through
the mechanics of approximating
00:40:18.700 --> 00:40:21.970
the probability that
Sn is equal to 19--
00:40:21.970 --> 00:40:23.810
exactly 19.
00:40:23.810 --> 00:40:27.340
As we said, we're going to write
this event as an event
00:40:27.340 --> 00:40:31.040
that covers an interval of unit
length from 18 and 1/2 to
00:40:31.040 --> 00:40:31.970
19 and 1/2.
00:40:31.970 --> 00:40:33.730
This is the event of interest.
00:40:33.730 --> 00:40:37.070
First step is to massage the
event of interest so that it
00:40:37.070 --> 00:40:40.010
involves our Zn random
variable.
00:40:40.010 --> 00:40:43.290
So subtract 18 from all sides.
00:40:43.290 --> 00:40:46.860
Divide by the standard deviation
of 3 from all sides.
00:40:46.860 --> 00:40:50.850
That's the equivalent
representation of the event.
00:40:50.850 --> 00:40:54.200
This is our standardized
random variable Zn.
00:40:54.200 --> 00:40:56.950
These are just these numbers.
00:40:56.950 --> 00:41:00.530
And to do an approximation, we
want to find the probability
00:41:00.530 --> 00:41:04.380
of this event, but Zn is
approximately normal, so we
00:41:04.380 --> 00:41:08.030
plug in here the Z, which
is the standard normal.
00:41:08.030 --> 00:41:10.150
So we want to find the
probability that the standard
00:41:10.150 --> 00:41:12.890
normal falls inside
this interval.
00:41:12.890 --> 00:41:15.630
You find these using CDFs
because this is the
00:41:15.630 --> 00:41:18.760
probability that you're
less than this but
00:41:18.760 --> 00:41:22.370
not less than that.
00:41:22.370 --> 00:41:25.370
So it's a difference between two
cumulative probabilities.
00:41:25.370 --> 00:41:27.400
Then, you look up your
normal tables.
00:41:27.400 --> 00:41:30.560
You find two numbers for these
quantities, and, finally, you
00:41:30.560 --> 00:41:35.140
get a numerical answer for an
individual entry of the PMF of
00:41:35.140 --> 00:41:36.480
the binomial.
00:41:36.480 --> 00:41:39.350
This is a pretty good
approximation, it turns out.
00:41:39.350 --> 00:41:42.910
If you were to do the
calculations using the exact
00:41:42.910 --> 00:41:47.130
formula, you would
get something
00:41:47.130 --> 00:41:49.360
which is pretty close--
00:41:49.360 --> 00:41:52.800
an error in the third digit--
00:41:52.800 --> 00:41:56.980
this is pretty good.
00:41:56.980 --> 00:41:59.650
So I guess what we did here
with our discussion of the
00:41:59.650 --> 00:42:04.560
binomial slightly contradicts
what I said before--
00:42:04.560 --> 00:42:07.330
that the central limit theorem
is a statement about
00:42:07.330 --> 00:42:09.240
cumulative distribution
functions.
00:42:09.240 --> 00:42:13.240
In general, it doesn't tell you
what to do to approximate
00:42:13.240 --> 00:42:15.270
PMFs themselves.
00:42:15.270 --> 00:42:17.440
And that's indeed the
case in general.
00:42:17.440 --> 00:42:20.220
One the other hand, for the
special case of a binomial
00:42:20.220 --> 00:42:23.610
distribution, the central limit
theorem approximation,
00:42:23.610 --> 00:42:28.200
with this 1/2 correction, is a
very good approximation even
00:42:28.200 --> 00:42:29.560
for the individual PMF.
00:42:33.290 --> 00:42:40.210
All right, so we spent quite
a bit of time on mechanics.
00:42:40.210 --> 00:42:46.050
So let's spend the last few
minutes today thinking a bit
00:42:46.050 --> 00:42:47.930
and look at a small puzzle.
00:42:51.390 --> 00:42:54.240
So the puzzle is
the following.
00:42:54.240 --> 00:43:02.460
Consider Poisson process that
runs over a unit interval.
00:43:02.460 --> 00:43:07.770
And where the arrival
rate is equal to 1.
00:43:07.770 --> 00:43:09.790
So this is the unit interval.
00:43:09.790 --> 00:43:12.720
And let X be the number
of arrivals.
00:43:15.430 --> 00:43:19.930
And this is Poisson,
with mean 1.
00:43:25.000 --> 00:43:28.160
Now, let me take this interval
and divide it
00:43:28.160 --> 00:43:30.650
into n little pieces.
00:43:30.650 --> 00:43:34.270
So each piece has length 1/n.
00:43:34.270 --> 00:43:41.225
And let Xi be the number
of arrivals during
00:43:41.225 --> 00:43:43.490
the Ith little interval.
00:43:48.000 --> 00:43:51.630
OK, what do we know about
the random variables Xi?
00:43:51.630 --> 00:43:55.260
Is they are themselves
Poisson.
00:43:55.260 --> 00:43:58.490
It's a number of arrivals
during a small interval.
00:43:58.490 --> 00:44:02.340
We also know that when n is
big, so the length of the
00:44:02.340 --> 00:44:08.190
interval is small, these Xi's
are approximately Bernoulli,
00:44:08.190 --> 00:44:11.730
with mean 1/n.
00:44:11.730 --> 00:44:13.970
Guess it doesn't matter whether
we model them as
00:44:13.970 --> 00:44:15.720
Bernoulli or not.
00:44:15.720 --> 00:44:19.660
What matters is that the
Xi's are independent.
00:44:19.660 --> 00:44:20.970
Why are they independent?
00:44:20.970 --> 00:44:24.410
Because, in a Poisson process,
these joint intervals are
00:44:24.410 --> 00:44:26.770
independent of each other.
00:44:26.770 --> 00:44:28.955
So the Xi's are independent.
00:44:31.840 --> 00:44:35.570
And they also have the
same distribution.
00:44:35.570 --> 00:44:40.360
And we have that X, the total
number of arrivals, is the sum
00:44:40.360 --> 00:44:41.610
over the Xn's.
00:44:44.470 --> 00:44:49.510
So the central limit theorem
tells us that, approximately,
00:44:49.510 --> 00:44:53.670
the sum of independent,
identically distributed random
00:44:53.670 --> 00:44:57.720
variables, when we have lots
of these random variables,
00:44:57.720 --> 00:45:01.530
behaves like a normal
random variable.
00:45:01.530 --> 00:45:07.475
So by using this decomposition
of X into a sum of i.i.d
00:45:07.475 --> 00:45:11.540
random variables, and by using
values of n that are bigger
00:45:11.540 --> 00:45:16.540
and bigger, by taking the limit,
it should follow that X
00:45:16.540 --> 00:45:19.510
has a normal distribution.
00:45:19.510 --> 00:45:22.120
On the other hand, we know
that X has a Poisson
00:45:22.120 --> 00:45:23.370
distribution.
00:45:25.270 --> 00:45:32.640
So something must be wrong
in this argument here.
00:45:32.640 --> 00:45:34.900
Can we really use the
central limit
00:45:34.900 --> 00:45:38.330
theorem in this situation?
00:45:38.330 --> 00:45:41.300
So what do we need for the
central limit theorem?
00:45:41.300 --> 00:45:44.160
We need to have independent,
identically
00:45:44.160 --> 00:45:46.700
distributed random variables.
00:45:46.700 --> 00:45:49.060
We have it here.
00:45:49.060 --> 00:45:53.410
We want them to have a finite
mean and finite variance.
00:45:53.410 --> 00:45:57.610
We also have it here, means
variances are finite.
00:45:57.610 --> 00:46:02.050
What is another assumption that
was never made explicit,
00:46:02.050 --> 00:46:04.080
but essentially was there?
00:46:07.680 --> 00:46:13.260
Or in other words, what is the
flaw in this argument that
00:46:13.260 --> 00:46:15.520
uses the central limit
theorem here?
00:46:15.520 --> 00:46:16.770
Any thoughts?
00:46:24.110 --> 00:46:29.640
So in the central limit theorem,
we said, consider--
00:46:29.640 --> 00:46:34.820
fix a probability distribution,
and let the Xi's
00:46:34.820 --> 00:46:38.280
be distributed according to that
probability distribution,
00:46:38.280 --> 00:46:42.935
and add a larger and larger
number or Xi's.
00:46:42.935 --> 00:46:47.410
But the underlying, unstated
assumption is that we fix the
00:46:47.410 --> 00:46:49.490
distribution of the Xi's.
00:46:49.490 --> 00:46:52.810
As we let n increase,
the statistics of
00:46:52.810 --> 00:46:55.930
each Xi do not change.
00:46:55.930 --> 00:46:59.010
Whereas here, I'm playing
a trick on you.
00:46:59.010 --> 00:47:03.700
As I'm taking more and more
random variables, I'm actually
00:47:03.700 --> 00:47:07.850
changing what those random
variables are.
00:47:07.850 --> 00:47:12.960
When I take a larger n, the Xi's
are random variables with
00:47:12.960 --> 00:47:15.720
a different mean and
different variance.
00:47:15.720 --> 00:47:19.800
So I'm adding more of these, but
at the same time, in this
00:47:19.800 --> 00:47:23.420
example, I'm changing
their distributions.
00:47:23.420 --> 00:47:26.380
That's something that doesn't
fit the setting of the central
00:47:26.380 --> 00:47:27.000
limit theorem.
00:47:27.000 --> 00:47:29.910
In the central limit theorem,
you first fix the distribution
00:47:29.910 --> 00:47:31.200
of the X's.
00:47:31.200 --> 00:47:35.290
You keep it fixed, and then you
consider adding more and
00:47:35.290 --> 00:47:38.950
more according to that
particular fixed distribution.
00:47:38.950 --> 00:47:40.020
So that's the catch.
00:47:40.020 --> 00:47:42.240
That's why the central limit
theorem does not
00:47:42.240 --> 00:47:43.970
apply to this situation.
00:47:43.970 --> 00:47:46.230
And we're lucky that it
doesn't apply because,
00:47:46.230 --> 00:47:50.220
otherwise, we would have a huge
contradiction destroying
00:47:50.220 --> 00:47:52.770
probability theory.
00:47:52.770 --> 00:48:02.240
OK, but now that's still
leaves us with a
00:48:02.240 --> 00:48:05.040
little bit of a dilemma.
00:48:05.040 --> 00:48:08.510
Suppose that, here, essentially
we're adding
00:48:08.510 --> 00:48:12.815
independent Bernoulli
random variables.
00:48:22.650 --> 00:48:25.300
So the issue is that the central
limit theorem has to
00:48:25.300 --> 00:48:28.920
do with asymptotics as
n goes to infinity.
00:48:28.920 --> 00:48:34.260
And if we consider a binomial,
and somebody gives us specific
00:48:34.260 --> 00:48:38.870
numbers about the parameters of
that binomial, it might not
00:48:38.870 --> 00:48:40.830
necessarily be obvious
what kind of
00:48:40.830 --> 00:48:42.790
approximation do we use.
00:48:42.790 --> 00:48:45.660
In particular, we do have two
different approximations for
00:48:45.660 --> 00:48:47.100
the binomial.
00:48:47.100 --> 00:48:51.610
If we fix p, then the binomial
is the sum of Bernoulli's that
00:48:51.610 --> 00:48:54.930
come from a fixed distribution,
we consider more
00:48:54.930 --> 00:48:56.450
and more of these.
00:48:56.450 --> 00:48:58.990
When we add them, the central
limit theorem tells us that we
00:48:58.990 --> 00:49:01.190
get the normal distribution.
00:49:01.190 --> 00:49:04.430
There's another sort of limit,
which has the flavor of this
00:49:04.430 --> 00:49:10.770
example, in which we still deal
with a binomial, sum of n
00:49:10.770 --> 00:49:11.170
Bernoulli's.
00:49:11.170 --> 00:49:14.310
We let that sum, the
number of the
00:49:14.310 --> 00:49:16.090
Bernoulli's go to infinity.
00:49:16.090 --> 00:49:18.890
But each Bernoulli has a
probability of success that
00:49:18.890 --> 00:49:23.830
goes to 0, and we do this in a
way so that np, the expected
00:49:23.830 --> 00:49:27.090
number of successes,
stays finite.
00:49:27.090 --> 00:49:30.660
This is the situation that we
dealt with when we first
00:49:30.660 --> 00:49:32.960
defined our Poisson process.
00:49:32.960 --> 00:49:37.540
We have a very, very large
number so lots, of time slots,
00:49:37.540 --> 00:49:40.920
but during each time slot,
there's a tiny probability of
00:49:40.920 --> 00:49:42.950
obtaining an arrival.
00:49:42.950 --> 00:49:48.460
Under that setting, in discrete
time, we have a
00:49:48.460 --> 00:49:51.670
binomial distribution, or
Bernoulli process, but when we
00:49:51.670 --> 00:49:54.530
take the limit, we obtain the
Poisson process and the
00:49:54.530 --> 00:49:56.470
Poisson approximation.
00:49:56.470 --> 00:49:58.510
So these are two equally valid
00:49:58.510 --> 00:50:00.550
approximations of the binomial.
00:50:00.550 --> 00:50:03.300
But they're valid in different
asymptotic regimes.
00:50:03.300 --> 00:50:06.180
In one regime, we fixed p,
let n go to infinity.
00:50:06.180 --> 00:50:09.360
In the other regime, we let
both n and p change
00:50:09.360 --> 00:50:11.540
simultaneously.
00:50:11.540 --> 00:50:14.240
Now, in real life, you're
never dealing with the
00:50:14.240 --> 00:50:15.290
limiting situations.
00:50:15.290 --> 00:50:17.870
You're dealing with
actual numbers.
00:50:17.870 --> 00:50:21.820
So if somebody tells you that
the numbers are like this,
00:50:21.820 --> 00:50:25.160
then you should probably say
that this is the situation
00:50:25.160 --> 00:50:27.380
that fits the Poisson
description--
00:50:27.380 --> 00:50:30.180
large number of slots with
each slot having a tiny
00:50:30.180 --> 00:50:32.460
probability of success.
00:50:32.460 --> 00:50:36.890
On the other hand, if p is
something like this, and n is
00:50:36.890 --> 00:50:40.460
500, then you expect to get
the distribution for the
00:50:40.460 --> 00:50:41.680
number of successes.
00:50:41.680 --> 00:50:45.740
It's going to have a mean of 50
and to have a fair amount
00:50:45.740 --> 00:50:47.280
of spread around there.
00:50:47.280 --> 00:50:50.150
It turns out that the normal
approximation would be better
00:50:50.150 --> 00:50:51.500
in this context.
00:50:51.500 --> 00:50:57.120
As a rule of thumb, if n times p
is bigger than 10 or 20, you
00:50:57.120 --> 00:50:59.320
can start using the normal
approximation.
00:50:59.320 --> 00:51:04.310
If n times p is a small number,
then you prefer to use
00:51:04.310 --> 00:51:06.090
the Poisson approximation.
00:51:06.090 --> 00:51:08.840
But there's no hard theorems
or rules about
00:51:08.840 --> 00:51:11.650
how to go about this.
00:51:11.650 --> 00:51:15.440
OK, so from next time we're
going to switch base again.
00:51:15.440 --> 00:51:17.830
And we're going to put together
everything we learned
00:51:17.830 --> 00:51:20.620
in this class to start solving
inference problems.