WEBVTT
00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.870
Commons license.
00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to
00:00:06.910 --> 00:00:10.560
offer high quality educational
resources for free.
00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from
00:00:13.460 --> 00:00:17.390
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:17.390 --> 00:00:18.640
ocw.mit.edu.
00:00:21.860 --> 00:00:25.130
JOHN TSITSIKLIS: We're going
to start today a new unit.
00:00:25.130 --> 00:00:29.320
so we will be talking about
limit theorems.
00:00:29.320 --> 00:00:33.580
So just to introduce the topic,
let's think of the
00:00:33.580 --> 00:00:35.560
following situation.
00:00:35.560 --> 00:00:37.580
There's a population
of penguins down
00:00:37.580 --> 00:00:38.970
at the South Pole.
00:00:38.970 --> 00:00:42.740
And if you were to pick a
penguin at random and measure
00:00:42.740 --> 00:00:46.930
their height, the expected value
of their height would be
00:00:46.930 --> 00:00:50.020
the average of the heights of
the different penguins in the
00:00:50.020 --> 00:00:50.970
population.
00:00:50.970 --> 00:00:53.430
So suppose when you
pick one, every
00:00:53.430 --> 00:00:55.210
penguin is equally likely.
00:00:55.210 --> 00:00:58.020
Then the expected value is just
the average of all the
00:00:58.020 --> 00:00:59.340
penguins out there.
00:00:59.340 --> 00:01:01.650
So your boss asks you to
find out what that the
00:01:01.650 --> 00:01:03.020
expected value is.
00:01:03.020 --> 00:01:04.980
One way would be to
go and measure
00:01:04.980 --> 00:01:06.540
each and every penguin.
00:01:06.540 --> 00:01:08.600
That might be a little
time consuming.
00:01:08.600 --> 00:01:13.120
So alternatively, what you can
do is to go and pick penguins
00:01:13.120 --> 00:01:17.450
at random, pick a few of them,
let's say a number n of them.
00:01:17.450 --> 00:01:20.420
So you measure the height
of each one.
00:01:20.420 --> 00:01:25.920
And then you calculate the
average of the heights of
00:01:25.920 --> 00:01:29.050
those penguins that you
have collected.
00:01:29.050 --> 00:01:33.100
So this is your estimate
of the expected value.
00:01:33.100 --> 00:01:41.010
Now, we called this the sample
mean, which is the mean value,
00:01:41.010 --> 00:01:44.430
but within the sample that
you have collected.
00:01:44.430 --> 00:01:48.090
This is something that's sort
of feels the same as the
00:01:48.090 --> 00:01:52.140
expected value, which
is again, the mean.
00:01:52.140 --> 00:01:54.400
But the expected value's a
different kind of mean.
00:01:54.400 --> 00:01:57.870
The expected value is the mean
over the entire population,
00:01:57.870 --> 00:02:01.680
whereas the sample mean is the
average over the smaller
00:02:01.680 --> 00:02:03.940
sample that you have measured.
00:02:03.940 --> 00:02:06.330
The expected value
is a number.
00:02:06.330 --> 00:02:09.220
The sample mean is a
random variable.
00:02:09.220 --> 00:02:11.720
It's a random variable because
the sample you have
00:02:11.720 --> 00:02:15.010
collected is random.
00:02:15.010 --> 00:02:18.520
Now, we think that this is a
reasonable way of estimating
00:02:18.520 --> 00:02:19.900
the expectation.
00:02:19.900 --> 00:02:25.710
So in the limit as n goes to
infinity, it's plausible that
00:02:25.710 --> 00:02:29.170
the sample mean, the estimate
that we are constructing,
00:02:29.170 --> 00:02:33.790
should somehow get close
to the expected value.
00:02:33.790 --> 00:02:34.560
What does this mean?
00:02:34.560 --> 00:02:36.160
What does it mean
to get close?
00:02:36.160 --> 00:02:37.620
In what sense?
00:02:37.620 --> 00:02:39.440
And is this statement true?
00:02:39.440 --> 00:02:44.160
This is the kind of statement
that we deal with when dealing
00:02:44.160 --> 00:02:45.710
with limit theorems.
00:02:45.710 --> 00:02:49.500
That's the subject of limit
theorems, when what happens if
00:02:49.500 --> 00:02:52.020
you're dealing with lots and
lots of random variables, and
00:02:52.020 --> 00:02:54.620
perhaps take averages
and so on.
00:02:54.620 --> 00:02:57.280
So why do we bother
about this?
00:02:57.280 --> 00:03:01.200
Well, if you're in the sampling
business, it would be
00:03:01.200 --> 00:03:04.870
reassuring to know that this
particular way of estimating
00:03:04.870 --> 00:03:06.880
the expected value
actually gets you
00:03:06.880 --> 00:03:08.850
close to the true answer.
00:03:08.850 --> 00:03:11.890
There's also a higher level
reason, which is a little more
00:03:11.890 --> 00:03:13.660
abstract and mathematical.
00:03:13.660 --> 00:03:17.110
So probability problems are easy
to deal with if you're
00:03:17.110 --> 00:03:20.040
having in your hands one or
two random variables.
00:03:20.040 --> 00:03:23.520
You can write down their mass
functions, joints density
00:03:23.520 --> 00:03:24.930
functions, and so on.
00:03:24.930 --> 00:03:27.500
You can calculate on paper
or on a computer,
00:03:27.500 --> 00:03:29.430
you can get the answers.
00:03:29.430 --> 00:03:33.510
Probability problems become
computationally intractable if
00:03:33.510 --> 00:03:36.760
you're dealing, let's say, with
100 random variables and
00:03:36.760 --> 00:03:40.280
you're trying to get the exact
answers for anything.
00:03:40.280 --> 00:03:43.050
So in principle, the same
formulas that we have, they
00:03:43.050 --> 00:03:44.230
still apply.
00:03:44.230 --> 00:03:47.360
But they involve summations
over large ranges of
00:03:47.360 --> 00:03:48.830
combinations of indices.
00:03:48.830 --> 00:03:51.310
And that makes life extremely
difficult.
00:03:51.310 --> 00:03:55.100
But when you push the envelope
and you go to a situation
00:03:55.100 --> 00:03:58.480
where you're dealing with a
very, very large number of
00:03:58.480 --> 00:04:02.130
variables, then you can
start taking limits.
00:04:02.130 --> 00:04:05.200
And when you take limits,
wonderful things happen.
00:04:05.200 --> 00:04:08.030
Many formulas start simplifying,
and you can
00:04:08.030 --> 00:04:11.770
actually get useful answers by
considering those limits.
00:04:11.770 --> 00:04:15.450
And that's sort of the big
reason why looking at limit
00:04:15.450 --> 00:04:17.820
theorems is a useful
thing to do.
00:04:17.820 --> 00:04:20.990
So what we're going to do today,
first we're going to
00:04:20.990 --> 00:04:27.110
start with a useful, simple tool
that allows us to relates
00:04:27.110 --> 00:04:30.290
probabilities with
expected values.
00:04:30.290 --> 00:04:33.230
The Markov inequality is the
first inequality we're going
00:04:33.230 --> 00:04:33.840
to write down.
00:04:33.840 --> 00:04:37.650
And then using that, we're going
to get the Chebyshev's
00:04:37.650 --> 00:04:39.760
inequality, a related
inequality.
00:04:39.760 --> 00:04:43.760
Then we need to define what do
we mean by convergence when we
00:04:43.760 --> 00:04:45.270
talk about random variables.
00:04:45.270 --> 00:04:48.310
It's a notion that's a
generalization of the notion
00:04:48.310 --> 00:04:51.000
of the usual convergence
of limits of
00:04:51.000 --> 00:04:52.690
a sequence of numbers.
00:04:52.690 --> 00:04:55.710
And once we have our notion of
convergence, we're going to
00:04:55.710 --> 00:05:00.860
see that, indeed, the sample
mean converges to the true
00:05:00.860 --> 00:05:04.380
mean, converges to the expected
value of the X's.
00:05:04.380 --> 00:05:08.840
And this statement is called the
weak law of large numbers.
00:05:08.840 --> 00:05:11.650
The reason it's called the weak
law is because there's
00:05:11.650 --> 00:05:14.640
also a strong law, which is
a statement with the same
00:05:14.640 --> 00:05:16.570
flavor, but with a somewhat
different
00:05:16.570 --> 00:05:18.410
mathematical content.
00:05:18.410 --> 00:05:20.790
But it's a little more abstract,
and we will not be
00:05:20.790 --> 00:05:21.680
getting into this.
00:05:21.680 --> 00:05:26.070
So the weak law is all that
you're going to get.
00:05:26.070 --> 00:05:28.570
All right.
00:05:28.570 --> 00:05:31.050
So now we start our
digression.
00:05:31.050 --> 00:05:38.220
And our first tool will be the
so-called Markov inequality.
00:05:45.050 --> 00:05:48.040
So let's take a random variable
that's always
00:05:48.040 --> 00:05:48.870
non-negative.
00:05:48.870 --> 00:05:51.790
No matter what, it gets
no negative values.
00:05:51.790 --> 00:05:53.710
To keep things simple,
let's assume it's a
00:05:53.710 --> 00:05:55.500
discrete random variable.
00:05:55.500 --> 00:05:59.770
So the expected value is the sum
over all possible values
00:05:59.770 --> 00:06:01.460
that a random variable
can take.
00:06:04.440 --> 00:06:06.600
The values of the random
variables that can take
00:06:06.600 --> 00:06:10.850
weighted according to their
corresponding probabilities.
00:06:10.850 --> 00:06:13.700
Now, this is a sum
over all x's.
00:06:13.700 --> 00:06:16.640
But x takes non-negative
values.
00:06:16.640 --> 00:06:19.780
And the PMF is also
non-negative.
00:06:19.780 --> 00:06:24.310
So if I take a sum over fewer
things, I'm going to get a
00:06:24.310 --> 00:06:25.550
smaller value.
00:06:25.550 --> 00:06:29.180
So the sum when I add over
everything is less than or
00:06:29.180 --> 00:06:33.255
equal to the sum that I will get
if I only add those terms
00:06:33.255 --> 00:06:35.620
that are bigger than
a certain constant.
00:06:38.600 --> 00:06:45.140
Now, if I'm adding over x's that
are bigger than a, the x
00:06:45.140 --> 00:06:48.630
that shows up up there
will always be larger
00:06:48.630 --> 00:06:50.490
than or equal to a.
00:06:50.490 --> 00:06:52.370
So we get this inequality.
00:06:58.170 --> 00:06:59.980
And now, a is a constant.
00:06:59.980 --> 00:07:02.870
I can pull it outside
the summation.
00:07:02.870 --> 00:07:05.320
And then I'm left with the
probabilities of all the x's
00:07:05.320 --> 00:07:06.990
that are bigger than a.
00:07:06.990 --> 00:07:08.850
And that's just the
probability of
00:07:08.850 --> 00:07:10.250
being bigger than a.
00:07:15.540 --> 00:07:18.140
OK, so that's the Markov
inequality.
00:07:18.140 --> 00:07:23.800
Basically tells us that the
expected value is larger than
00:07:23.800 --> 00:07:26.240
or equal to this number.
00:07:26.240 --> 00:07:30.260
It relates expected values
to probabilities.
00:07:30.260 --> 00:07:34.660
It tells us that if the expected
value is small, then
00:07:34.660 --> 00:07:39.250
the probability that x is big
is also going to be small.
00:07:39.250 --> 00:07:42.240
So it's translates a statement
about smallness of expected
00:07:42.240 --> 00:07:46.205
values to a statement about
smallness of probabilities.
00:07:49.020 --> 00:07:49.930
OK.
00:07:49.930 --> 00:07:54.210
What we actually need is a
somewhat different version of
00:07:54.210 --> 00:07:57.240
this same statement.
00:07:57.240 --> 00:08:03.010
And what we're going to do is to
apply this inequality to a
00:08:03.010 --> 00:08:08.150
non-negative random variable
of a special type.
00:08:08.150 --> 00:08:13.330
And you can think of applying
this same calculation to a
00:08:13.330 --> 00:08:18.800
random variable of this form, (X
minus mu)-squared, where mu
00:08:18.800 --> 00:08:21.870
is the expected value of X.
00:08:21.870 --> 00:08:24.075
Now, this is a non-negative
random variable.
00:08:35.419 --> 00:08:37.919
So, the expected value of this
random variable, which is the
00:08:37.919 --> 00:08:42.220
variance, by following the same
thinking as we had in
00:08:42.220 --> 00:08:52.880
that derivation up to there, is
bigger than the probability
00:08:52.880 --> 00:08:58.210
that this random variable
is bigger than some--
00:08:58.210 --> 00:09:04.760
let me use a-squared
instead of an a
00:09:04.760 --> 00:09:06.585
times the value a-squared.
00:09:12.420 --> 00:09:16.310
So now of course, this
probability is the same as the
00:09:16.310 --> 00:09:23.440
probability that the absolute
value of X minus mu is bigger
00:09:23.440 --> 00:09:27.190
than a times a-squared.
00:09:27.190 --> 00:09:34.860
And this side is equal to the
variance of X. So this relates
00:09:34.860 --> 00:09:40.890
the variance of X to the
probability that our random
00:09:40.890 --> 00:09:45.200
variable is far away
from its mean.
00:09:45.200 --> 00:09:50.590
If the variance is small, then
it means that the probability
00:09:50.590 --> 00:09:54.635
of being far away from the
mean is also small.
00:09:57.240 --> 00:10:02.220
So I derived this by applying
the Markov inequality to this
00:10:02.220 --> 00:10:04.950
particular non-negative
random variable.
00:10:04.950 --> 00:10:09.500
Or just to reinforce, perhaps,
the message, and increase your
00:10:09.500 --> 00:10:13.450
confidence in this inequality,
let's just look at the
00:10:13.450 --> 00:10:16.980
derivation once more, where I'm
going, here, to start from
00:10:16.980 --> 00:10:20.890
first principles, but use the
same idea as the one that was
00:10:20.890 --> 00:10:23.480
used in the proof out here.
00:10:23.480 --> 00:10:23.685
Ok.
00:10:23.685 --> 00:10:26.920
So just for variety, now let's
think of X as being a
00:10:26.920 --> 00:10:28.760
continuous random variable.
00:10:28.760 --> 00:10:31.520
The derivation is the same
whether it's discrete or
00:10:31.520 --> 00:10:32.510
continuous.
00:10:32.510 --> 00:10:35.990
So by definition, the variance
is the integral, is this
00:10:35.990 --> 00:10:38.130
particular integral.
00:10:38.130 --> 00:10:43.920
Now, the integral is going to
become smaller if I integrate,
00:10:43.920 --> 00:10:47.130
instead of integrating over
the full range, I only
00:10:47.130 --> 00:10:51.070
integrate over x's that are
far away from the mean.
00:10:51.070 --> 00:10:52.700
So mu is the mean.
00:10:52.700 --> 00:10:54.345
Think of c as some big number.
00:10:59.670 --> 00:11:02.210
These are x's that are far
away from the mean to the
00:11:02.210 --> 00:11:05.410
left, from minus infinity
to mu minus c.
00:11:05.410 --> 00:11:09.030
And these are the x's that are
far away from the mean on the
00:11:09.030 --> 00:11:11.210
positive side.
00:11:11.210 --> 00:11:13.420
So by integrating over
fewer stuff, I'm
00:11:13.420 --> 00:11:15.580
getting a smaller integral.
00:11:15.580 --> 00:11:21.970
Now, for any x in this range,
this distance, x minus mu, is
00:11:21.970 --> 00:11:23.220
at least c.
00:11:23.220 --> 00:11:26.320
So that squared is at
least c squared.
00:11:26.320 --> 00:11:28.910
So this term over this
range of integration
00:11:28.910 --> 00:11:30.520
is at least c squared.
00:11:30.520 --> 00:11:33.250
So I can take it outside
the integral.
00:11:33.250 --> 00:11:36.400
And I'm left just with the
integral of the density.
00:11:36.400 --> 00:11:38.480
Same thing on the other side.
00:11:38.480 --> 00:11:41.770
And so what factors out is
this term c squared.
00:11:41.770 --> 00:11:45.360
And inside, we're left with the
probability of being to
00:11:45.360 --> 00:11:49.060
the left of mu minus c, and then
the probability of being
00:11:49.060 --> 00:11:52.310
to the right of mu plus c,
which is the same as the
00:11:52.310 --> 00:11:55.370
probability that the absolute
value of the distance from the
00:11:55.370 --> 00:11:58.770
mean is larger than
or equal to c.
00:11:58.770 --> 00:12:04.820
So that's the same inequality
that we proved there, except
00:12:04.820 --> 00:12:06.060
that here I'm using c.
00:12:06.060 --> 00:12:10.530
There I used a, but it's
exactly the same one.
00:12:10.530 --> 00:12:12.960
This inequality was maybe better
to understand if you
00:12:12.960 --> 00:12:16.790
take that term and send it
to the other side and
00:12:16.790 --> 00:12:18.780
write it this form.
00:12:18.780 --> 00:12:20.010
What does it tell us?
00:12:20.010 --> 00:12:25.750
It tells us that if c is a big
number, it tells us that the
00:12:25.750 --> 00:12:30.750
probability of being more than
c away from the mean is going
00:12:30.750 --> 00:12:32.330
to be a small number.
00:12:32.330 --> 00:12:34.780
When c is big, this is small.
00:12:34.780 --> 00:12:35.970
Now, this is intuitive.
00:12:35.970 --> 00:12:38.290
The variance is a measure
of the spread of the
00:12:38.290 --> 00:12:40.960
distribution, how wide it is.
00:12:40.960 --> 00:12:43.960
It tells us that if the
variance is small, the
00:12:43.960 --> 00:12:46.320
distribution is not very wide.
00:12:46.320 --> 00:12:49.020
And mathematically, this
translates to this statement
00:12:49.020 --> 00:12:52.360
that when the variance is small,
the probability of
00:12:52.360 --> 00:12:54.880
being far away is going
to be small.
00:12:54.880 --> 00:12:58.370
And the further away you're
looking, that is, if c is a
00:12:58.370 --> 00:13:00.330
bigger number, that probability
00:13:00.330 --> 00:13:01.765
also becomes small.
00:13:04.930 --> 00:13:07.880
Maybe an even more intuitive way
to think about the content
00:13:07.880 --> 00:13:13.230
of this inequality is to,
instead of c, use the number
00:13:13.230 --> 00:13:16.910
k, where k is positive
and sigma is
00:13:16.910 --> 00:13:18.530
the standard deviation.
00:13:18.530 --> 00:13:22.670
So let's just plug k sigma
in the place of c.
00:13:22.670 --> 00:13:25.300
So this becomes k
sigma squared.
00:13:25.300 --> 00:13:27.130
These sigma squared's cancel.
00:13:27.130 --> 00:13:29.770
We're left with 1
over k-square.
00:13:29.770 --> 00:13:31.690
Now, what is this?
00:13:31.690 --> 00:13:36.260
This is the event that you are
k standard deviations away
00:13:36.260 --> 00:13:37.770
from the mean.
00:13:37.770 --> 00:13:40.600
So for example, this statement
here tells you that if you
00:13:40.600 --> 00:13:44.900
look at the test scores from a
quiz, what fraction of the
00:13:44.900 --> 00:13:49.900
class are 3 standard deviations
away from the mean?
00:13:49.900 --> 00:13:53.000
It's possible, but it's not
going to be a lot of people.
00:13:53.000 --> 00:13:57.930
It's going to be at most, 1/9
of the class that can be 3
00:13:57.930 --> 00:14:02.190
standard deviations or more
away from the mean.
00:14:02.190 --> 00:14:05.250
So the Chebyshev inequality
is a really useful one.
00:14:07.860 --> 00:14:11.300
It comes in handy whenever you
want to relate probabilities
00:14:11.300 --> 00:14:12.800
and expected values.
00:14:12.800 --> 00:14:16.390
So if you know that your
expected values or, in
00:14:16.390 --> 00:14:19.260
particular, that your variance
is small, this tells you
00:14:19.260 --> 00:14:23.080
something about tailed
probabilities.
00:14:23.080 --> 00:14:25.530
So this is the end of our
first digression.
00:14:25.530 --> 00:14:28.320
We have this inequality
in our hands.
00:14:28.320 --> 00:14:31.170
Our second digression is
talk about limits.
00:14:34.680 --> 00:14:37.190
We want to eventually talk
about limits of random
00:14:37.190 --> 00:14:39.750
variables, but as a warm up,
we're going to start with
00:14:39.750 --> 00:14:42.440
limits of sequences.
00:14:42.440 --> 00:14:47.670
So you're given a sequence
of numbers, a1,
00:14:47.670 --> 00:14:50.500
a2, a3, and so on.
00:14:50.500 --> 00:14:54.160
And we want to define the
notion that a sequence
00:14:54.160 --> 00:14:56.470
converges to a number.
00:14:56.470 --> 00:15:04.710
You sort of know what this
means, but let's just go
00:15:04.710 --> 00:15:06.510
through it some more.
00:15:06.510 --> 00:15:09.890
So here's a.
00:15:09.890 --> 00:15:16.200
We have our sequence of
values as n increases.
00:15:16.200 --> 00:15:20.290
What do we mean by the sequence
converging to a is
00:15:20.290 --> 00:15:23.550
that when you look at those
values, they get closer and
00:15:23.550 --> 00:15:25.140
closer to a.
00:15:25.140 --> 00:15:29.570
So this value here is your
typical a sub n.
00:15:29.570 --> 00:15:33.880
They get closer and closer to
a, and they stay closer.
00:15:33.880 --> 00:15:36.860
So let's try to make
that more precise.
00:15:36.860 --> 00:15:40.750
What it means is let's
fix a sense of what
00:15:40.750 --> 00:15:42.250
it means to be close.
00:15:42.250 --> 00:15:47.540
Let me look at an interval that
goes from a - epsilon to
00:15:47.540 --> 00:15:50.340
a + epsilon.
00:15:50.340 --> 00:15:57.280
Then if my sequence converges
to a, this means that as n
00:15:57.280 --> 00:16:02.810
increases, eventually the values
of the sequence that I
00:16:02.810 --> 00:16:06.420
get stay inside this band.
00:16:06.420 --> 00:16:10.430
Since they converge to a, this
means that eventually they
00:16:10.430 --> 00:16:14.130
will be smaller than
a + epsilon and
00:16:14.130 --> 00:16:16.310
bigger than a - epsilon.
00:16:16.310 --> 00:16:21.320
So convergence means that
given a band of positive
00:16:21.320 --> 00:16:25.690
length around the number a,
the values of the sequence
00:16:25.690 --> 00:16:28.720
that you get eventually
get inside and
00:16:28.720 --> 00:16:31.300
stay inside that band.
00:16:31.300 --> 00:16:34.060
So that's sort of the picture
definition of
00:16:34.060 --> 00:16:35.840
what convergence means.
00:16:35.840 --> 00:16:40.460
So now let's translate this into
a mathematical statement.
00:16:40.460 --> 00:16:45.610
Given a band of positive length,
no matter how wide
00:16:45.610 --> 00:16:50.690
that band is or how narrow it
is, so for every epsilon
00:16:50.690 --> 00:16:56.500
positive, eventually the
sequence gets inside the band.
00:16:56.500 --> 00:16:58.460
What does eventually mean?
00:16:58.460 --> 00:17:01.410
There exists a time,
so that after that
00:17:01.410 --> 00:17:03.510
time something happens.
00:17:03.510 --> 00:17:07.230
And the something that happens
is that after that time, we
00:17:07.230 --> 00:17:09.520
are inside that band.
00:17:09.520 --> 00:17:12.060
So this is a formal mathematical
definition, which
00:17:12.060 --> 00:17:17.250
actually translates what I was
telling in the wordy way
00:17:17.250 --> 00:17:20.140
before, and showing in
terms of the picture.
00:17:20.140 --> 00:17:25.140
Given a certain band, even if
it's narrow, eventually, after
00:17:25.140 --> 00:17:28.520
a certain time n0, the values
of the sequence are going to
00:17:28.520 --> 00:17:30.240
stay inside this band.
00:17:30.240 --> 00:17:35.770
Now, if I were to take epsilon
to be very small, this thing
00:17:35.770 --> 00:17:38.130
would still be true that
eventually I'm going to get
00:17:38.130 --> 00:17:42.400
inside of the band, except that
I may have to wait longer
00:17:42.400 --> 00:17:45.770
for the values to
get inside here.
00:17:45.770 --> 00:17:48.400
All right, that's what it means
for a deterministic
00:17:48.400 --> 00:17:51.350
sequence to converge
to something.
00:17:51.350 --> 00:17:54.150
Now, how about random
variables.
00:17:54.150 --> 00:17:57.340
What does it mean for a sequence
of random variables
00:17:57.340 --> 00:18:00.280
to converge to a number?
00:18:00.280 --> 00:18:02.600
We're just going to twist
a little bit of the word
00:18:02.600 --> 00:18:03.310
definition.
00:18:03.310 --> 00:18:08.390
For numbers, we said that
eventually the numbers get
00:18:08.390 --> 00:18:10.180
inside that band.
00:18:10.180 --> 00:18:13.270
But if instead of numbers we
have random variables with a
00:18:13.270 --> 00:18:18.080
certain distribution, so here
instead of a_n we're dealing
00:18:18.080 --> 00:18:20.750
with a random variable that has
a distribution, let's say,
00:18:20.750 --> 00:18:26.650
of this kind, what we want is
that this distribution gets
00:18:26.650 --> 00:18:31.460
inside this band, so it gets
concentrated inside here.
00:18:31.460 --> 00:18:33.150
What does it means that
the distribution
00:18:33.150 --> 00:18:34.850
gets inside this band?
00:18:34.850 --> 00:18:36.910
I mean a random variable
has a distribution.
00:18:36.910 --> 00:18:40.130
It may have some tails, so
maybe not the entire
00:18:40.130 --> 00:18:43.920
distribution gets concentrated
inside of the band.
00:18:43.920 --> 00:18:48.660
But we want that more and more
of this distribution is
00:18:48.660 --> 00:18:50.820
concentrated in this band.
00:18:50.820 --> 00:18:51.730
So that --
00:18:51.730 --> 00:18:53.130
in a sense that --
00:18:53.130 --> 00:18:57.070
the probability of falling
outside the band converges to
00:18:57.070 --> 00:19:00.410
0 -- becomes smaller
and smaller.
00:19:00.410 --> 00:19:05.660
So in words, we're going to say
that the sequence random
00:19:05.660 --> 00:19:09.070
variables or a sequence of
probability distributions,
00:19:09.070 --> 00:19:12.060
that would be the same,
converges to a particular
00:19:12.060 --> 00:19:15.070
number a if the following
is true.
00:19:15.070 --> 00:19:22.320
If I consider a small band
around a, then the probability
00:19:22.320 --> 00:19:26.300
that my random variable falls
outside this band, which is
00:19:26.300 --> 00:19:29.530
the area under this curve,
this probability becomes
00:19:29.530 --> 00:19:32.620
smaller and smaller as
n goes to infinity.
00:19:32.620 --> 00:19:35.370
The probability of being
outside this band
00:19:35.370 --> 00:19:38.570
converges to 0.
00:19:38.570 --> 00:19:40.620
So that's the intuitive idea.
00:19:40.620 --> 00:19:45.080
So in the beginning, maybe our
distribution is sitting
00:19:45.080 --> 00:19:46.590
everywhere.
00:19:46.590 --> 00:19:49.490
As n increases, the distribution
starts to get
00:19:49.490 --> 00:19:51.560
concentrating inside the band.
00:19:51.560 --> 00:19:57.300
When a is even bigger, our
distribution is even more
00:19:57.300 --> 00:20:00.310
inside that band, so that these
outside probabilities
00:20:00.310 --> 00:20:02.460
become smaller and smaller.
00:20:02.460 --> 00:20:03.860
So the corresponding
mathematical
00:20:03.860 --> 00:20:06.760
statement is the following.
00:20:06.760 --> 00:20:13.730
I fix a band around
a, a +/- epsilon.
00:20:13.730 --> 00:20:18.170
Given that band, the probability
of falling outside
00:20:18.170 --> 00:20:21.350
this band, this probability
converges to 0.
00:20:21.350 --> 00:20:23.600
Or another way to say it is
that the limit of this
00:20:23.600 --> 00:20:26.560
probability is equal to 0.
00:20:26.560 --> 00:20:29.720
If you were to translate this
into a complete mathematical
00:20:29.720 --> 00:20:31.800
statement, you would have
to write down the
00:20:31.800 --> 00:20:34.150
following messy thing.
00:20:34.150 --> 00:20:37.220
For every epsilon positive --
00:20:37.220 --> 00:20:39.480
that's this statement --
00:20:39.480 --> 00:20:41.240
the limit is 0.
00:20:41.240 --> 00:20:44.610
What does it mean that the
limit of something is 0?
00:20:44.610 --> 00:20:47.670
We flip back to the
previous slide.
00:20:47.670 --> 00:20:48.110
Why?
00:20:48.110 --> 00:20:51.430
Because a probability
is a number.
00:20:51.430 --> 00:20:54.720
So here we're talking about
a sequence of numbers
00:20:54.720 --> 00:20:56.340
convergent to 0.
00:20:56.340 --> 00:20:58.190
What does it mean for a
sequence of numbers to
00:20:58.190 --> 00:20:59.180
converge to 0?
00:20:59.180 --> 00:21:05.320
It means that for any epsilon
prime positive, there exists
00:21:05.320 --> 00:21:11.230
some n0 such that for every
n bigger than n0 the
00:21:11.230 --> 00:21:12.770
following is true --
00:21:12.770 --> 00:21:16.450
that this probability
is less than or
00:21:16.450 --> 00:21:17.860
equal to epsilon prime.
00:21:20.610 --> 00:21:27.660
So the mathematical statement
is a little hard to parse.
00:21:27.660 --> 00:21:32.270
For every size of that band,
and then you take the
00:21:32.270 --> 00:21:34.990
definition of what it means for
the limit of a sequence of
00:21:34.990 --> 00:21:37.720
numbers to converge to 0.
00:21:37.720 --> 00:21:42.340
But it's a lot easier to
describe this in words and,
00:21:42.340 --> 00:21:45.010
basically, think in terms
of this picture.
00:21:45.010 --> 00:21:48.690
That as n increases, the
probability of falling outside
00:21:48.690 --> 00:21:51.305
those bands just become
smaller and smaller.
00:21:51.305 --> 00:21:56.590
So the statement is that our
distribution gets concentrated
00:21:56.590 --> 00:22:01.340
in arbitrarily narrow little
bands around that
00:22:01.340 --> 00:22:05.050
particular number a.
00:22:05.050 --> 00:22:05.350
OK.
00:22:05.350 --> 00:22:07.790
So let's look at an example.
00:22:07.790 --> 00:22:11.660
Suppose a random variable Yn has
a discrete distribution of
00:22:11.660 --> 00:22:13.720
this particular type.
00:22:13.720 --> 00:22:17.150
Does it converge to something?
00:22:17.150 --> 00:22:19.570
Well, the probability
distribution of this random
00:22:19.570 --> 00:22:22.370
variable gets concentrated
at 0 --
00:22:22.370 --> 00:22:26.520
there's more and more
probability of being at 0.
00:22:26.520 --> 00:22:29.710
If I fix a band around 0 --
00:22:29.710 --> 00:22:34.850
so if I take the band from minus
epsilon to epsilon and
00:22:34.850 --> 00:22:36.520
look at that band--
00:22:36.520 --> 00:22:42.350
the probability of falling
outside this band is 1/n.
00:22:42.350 --> 00:22:45.780
As n goes to infinity, that
probability goes to 0.
00:22:45.780 --> 00:22:50.550
So in this case, we do
have convergence.
00:22:50.550 --> 00:22:56.780
And Yn converges in probability
to the number 0.
00:22:56.780 --> 00:23:00.310
So this just captures the
facts obvious from this
00:23:00.310 --> 00:23:03.680
picture, that more and more of
our probability distribution
00:23:03.680 --> 00:23:07.630
gets concentrated around 0,
as n goes to infinity.
00:23:07.630 --> 00:23:10.330
Now, an interesting thing to
notice is the following, that
00:23:10.330 --> 00:23:15.390
even though Yn converges to 0,
if you were to write down the
00:23:15.390 --> 00:23:20.440
expected value for Yn,
what would it be?
00:23:20.440 --> 00:23:24.410
It's going to be n times the
probability of this value,
00:23:24.410 --> 00:23:26.240
which is 1/n.
00:23:26.240 --> 00:23:29.230
So the expected value
turns out to be 1.
00:23:29.230 --> 00:23:34.300
And if you were to look at the
expected value of Yn-squared,
00:23:34.300 --> 00:23:38.190
this would be 0.
00:23:38.190 --> 00:23:41.770
times this probability, and
then n-squared times this
00:23:41.770 --> 00:23:45.720
probability, which
is equal to n.
00:23:45.720 --> 00:23:49.850
And this actually goes
to infinity.
00:23:49.850 --> 00:23:53.580
So we have this, perhaps,
strange situation where a
00:23:53.580 --> 00:23:58.030
random variable goes to 0, but
the expected value of this
00:23:58.030 --> 00:24:01.140
random variable does
not go to 0.
00:24:01.140 --> 00:24:04.570
And the second moment of that
random variable actually goes
00:24:04.570 --> 00:24:05.790
to infinity.
00:24:05.790 --> 00:24:08.740
So this tells us that
convergence in probability
00:24:08.740 --> 00:24:11.380
tells you something,
but it doesn't tell
00:24:11.380 --> 00:24:13.310
you the whole story.
00:24:13.310 --> 00:24:17.260
Convergence to 0 of a random
variable doesn't imply
00:24:17.260 --> 00:24:20.630
anything about convergence
of expected values or of
00:24:20.630 --> 00:24:23.420
variances and so on.
00:24:23.420 --> 00:24:26.060
So the reason is that
convergence in probability
00:24:26.060 --> 00:24:28.470
tells you that this
tail probability
00:24:28.470 --> 00:24:30.400
here is very small.
00:24:30.400 --> 00:24:34.440
But it doesn't tell you how
far does this tail go.
00:24:34.440 --> 00:24:39.390
As in this example, the tail
probability is small, but that
00:24:39.390 --> 00:24:43.410
tail acts far away, so it
gives a disproportionate
00:24:43.410 --> 00:24:45.950
contribution to the expected
value or the
00:24:45.950 --> 00:24:47.200
expected value squared.
00:24:53.340 --> 00:24:53.650
OK.
00:24:53.650 --> 00:24:59.000
So now we've got everything that
we need to go back to the
00:24:59.000 --> 00:25:02.900
sample mean and study
its properties.
00:25:02.900 --> 00:25:05.460
So the sad thing is
that we have a
00:25:05.460 --> 00:25:07.320
sequence of random variables.
00:25:07.320 --> 00:25:08.350
They're independent.
00:25:08.350 --> 00:25:10.450
They have the same
distribution.
00:25:10.450 --> 00:25:12.790
And we assume that they
have a finite mean
00:25:12.790 --> 00:25:14.480
and a finite variance.
00:25:14.480 --> 00:25:18.430
We're looking at the
sample mean.
00:25:18.430 --> 00:25:21.670
Now in principle, you can
calculate the probability
00:25:21.670 --> 00:25:25.090
distribution of the sample mean,
because we know how to
00:25:25.090 --> 00:25:26.950
find the distributions
of sums of
00:25:26.950 --> 00:25:28.320
independent random variables.
00:25:28.320 --> 00:25:31.030
You use the convolution
formula over and over.
00:25:31.030 --> 00:25:32.870
But this is pretty
complicated, so
00:25:32.870 --> 00:25:34.730
let's not look at that.
00:25:34.730 --> 00:25:38.920
Let's just look at expected
values, variances, and the
00:25:38.920 --> 00:25:42.610
probabilities that the sample
mean is far away
00:25:42.610 --> 00:25:44.310
from the true mean.
00:25:44.310 --> 00:25:47.470
So what is the expected value
of this random variable?
00:25:47.470 --> 00:25:51.260
The expected value of a sum of
random variables is the sum of
00:25:51.260 --> 00:25:52.510
the expected values.
00:25:56.320 --> 00:26:00.320
And then we have this factor
of n in the denominator.
00:26:00.320 --> 00:26:07.040
Each one of these expected
values is mu, so we get mu.
00:26:07.040 --> 00:26:13.960
So the sample mean, the average
value of this Mn in
00:26:13.960 --> 00:26:18.570
expectation is the same as
the true mean inside our
00:26:18.570 --> 00:26:20.620
population.
00:26:20.620 --> 00:26:26.560
Now here, this is a fine
conceptual point, there's two
00:26:26.560 --> 00:26:29.920
kinds of averages involved
when you write down this
00:26:29.920 --> 00:26:31.280
expression.
00:26:31.280 --> 00:26:33.310
We understand that
expectations are
00:26:33.310 --> 00:26:36.470
some kind of average.
00:26:36.470 --> 00:26:40.250
The sample mean is also an
average over the values that
00:26:40.250 --> 00:26:42.240
we have observed.
00:26:42.240 --> 00:26:45.220
But it's two different
kinds of averages.
00:26:45.220 --> 00:26:50.460
The sample mean is the average
of the heights of the penguins
00:26:50.460 --> 00:26:54.330
that we collected over
a single expedition.
00:26:54.330 --> 00:26:59.600
The expected value is to be
thought of as follows, my
00:26:59.600 --> 00:27:02.060
probabilistic experiment
is one expedition
00:27:02.060 --> 00:27:04.160
to the South Pole.
00:27:04.160 --> 00:27:09.760
Expected value here means
thinking on the average over a
00:27:09.760 --> 00:27:12.620
huge number of expeditions.
00:27:12.620 --> 00:27:16.270
So my expedition is a random
experiment, I collect random
00:27:16.270 --> 00:27:18.520
samples, and they record Mn.
00:27:21.230 --> 00:27:27.170
The average result of an
expedition is what we would
00:27:27.170 --> 00:27:31.060
get if we were to carry out
a zillion expeditions and
00:27:31.060 --> 00:27:35.050
average the averages that we
get at each particular
00:27:35.050 --> 00:27:36.090
expedition.
00:27:36.090 --> 00:27:39.860
So this Mn is the average during
a single expedition.
00:27:39.860 --> 00:27:44.090
This expectation is the average
over an imagined
00:27:44.090 --> 00:27:46.125
infinite sequence
of expeditions.
00:27:49.760 --> 00:27:52.830
And of course, the other thing
to always keep in mind is that
00:27:52.830 --> 00:27:56.910
expectations give you numbers,
whereas the sample mean is
00:27:56.910 --> 00:28:00.210
actually a random variable.
00:28:00.210 --> 00:28:00.486
All right.
00:28:00.486 --> 00:28:03.310
So this random variable,
how random is it?
00:28:03.310 --> 00:28:05.610
How big is its variance?
00:28:05.610 --> 00:28:10.040
So the variance of a sum of
random variables is the sum of
00:28:10.040 --> 00:28:12.370
the variances.
00:28:12.370 --> 00:28:16.610
But since we're dividing by n,
when you calculate variances
00:28:16.610 --> 00:28:19.580
this brings in a factor
of n-squared.
00:28:19.580 --> 00:28:21.215
So the variance is sigma-squared
over n.
00:28:24.340 --> 00:28:26.870
And in particular, the variance
of the sample mean
00:28:26.870 --> 00:28:28.620
becomes smaller and smaller.
00:28:28.620 --> 00:28:31.170
It means that when you estimate
that average height
00:28:31.170 --> 00:28:34.570
of penguins, if you take a
large sample, then your
00:28:34.570 --> 00:28:37.530
estimate is not going
to be too random.
00:28:37.530 --> 00:28:41.120
The randomness in your estimates
become small if you
00:28:41.120 --> 00:28:43.250
have a large sample size.
00:28:43.250 --> 00:28:46.090
Having a large sample size kind
of removes the randomness
00:28:46.090 --> 00:28:47.930
from your experiment.
00:28:47.930 --> 00:28:52.690
Now let's apply the Chebyshev
inequality to say something
00:28:52.690 --> 00:28:56.020
about tail probabilities
for the sample mean.
00:28:56.020 --> 00:28:59.610
The probability that you are
more than epsilon away from
00:28:59.610 --> 00:29:03.650
the true mean is less than or
equal to the variance of this
00:29:03.650 --> 00:29:07.030
quantity divided by this
number squared.
00:29:07.030 --> 00:29:09.860
So that's just the translation
of the Chebyshev inequality to
00:29:09.860 --> 00:29:12.320
the particular context
we've got here.
00:29:12.320 --> 00:29:13.590
We found the variance.
00:29:13.590 --> 00:29:15.100
It's sigma-squared over n.
00:29:15.100 --> 00:29:18.340
So we end up with
this expression.
00:29:18.340 --> 00:29:20.490
So what does this
expression do?
00:29:25.570 --> 00:29:32.370
For any given epsilon, if
I fix epsilon, then this
00:29:32.370 --> 00:29:36.630
probability, which is less
than sigma-squared over n
00:29:36.630 --> 00:29:40.550
epsilon-squared, converges to
0 as n goes to infinity.
00:29:44.730 --> 00:29:48.050
And this is just the definition
of convergence in
00:29:48.050 --> 00:29:49.690
probability.
00:29:49.690 --> 00:29:54.310
If this happens, that the
probability of being more than
00:29:54.310 --> 00:29:57.590
epsilon away from the mean, that
probability goes to 0,
00:29:57.590 --> 00:30:01.510
and this is true no matter how
I choose my epsilon, then by
00:30:01.510 --> 00:30:04.490
definition we have convergence
in probability.
00:30:04.490 --> 00:30:08.050
So we have proved that the
sample mean converges in
00:30:08.050 --> 00:30:11.430
probability to the true mean.
00:30:11.430 --> 00:30:16.210
And this is what the weak law
of large numbers tells us.
00:30:16.210 --> 00:30:21.060
So in some vague sense, it
tells us that the sample
00:30:21.060 --> 00:30:24.350
means, when you take the
average of many, many
00:30:24.350 --> 00:30:28.150
measurements in your sample,
then the sample mean is a good
00:30:28.150 --> 00:30:31.870
estimate of the true mean in the
sense that it approaches
00:30:31.870 --> 00:30:36.380
the true mean as your sample
size increases.
00:30:36.380 --> 00:30:39.220
It approaches the true mean,
but of course in a very
00:30:39.220 --> 00:30:42.540
specific sense, in probability,
according to this
00:30:42.540 --> 00:30:46.550
notion of convergence
that we have used.
00:30:46.550 --> 00:30:51.060
So since we're talking about
sampling, let's go over an
00:30:51.060 --> 00:30:56.150
example, which is the typical
situation faced by someone
00:30:56.150 --> 00:30:58.110
who's constructing a poll.
00:30:58.110 --> 00:31:02.680
So you're interested in some
property of the population.
00:31:02.680 --> 00:31:05.590
So what fraction of
the population
00:31:05.590 --> 00:31:08.380
prefers Coke to Pepsi?
00:31:08.380 --> 00:31:11.080
So there's a number f, which
is that fraction of the
00:31:11.080 --> 00:31:12.460
population.
00:31:12.460 --> 00:31:16.260
And so this is an
exact number.
00:31:16.260 --> 00:31:20.250
So out of a population of 100
million, 20 million prefer
00:31:20.250 --> 00:31:25.590
Coke, then f would be 0.2.
00:31:25.590 --> 00:31:27.970
We want to find out what
that fraction is.
00:31:27.970 --> 00:31:30.590
We cannot ask everyone.
00:31:30.590 --> 00:31:34.250
What we're going to do is to
take a random sample of people
00:31:34.250 --> 00:31:37.300
and ask them for their
preferences.
00:31:37.300 --> 00:31:42.690
So the ith person either says
yes for Coke or no.
00:31:42.690 --> 00:31:46.430
And we record that by putting
a 1 each time that we get a
00:31:46.430 --> 00:31:49.160
yes answer.
00:31:49.160 --> 00:31:51.850
And then we form the average
of these x's.
00:31:51.850 --> 00:31:53.070
What is this average?
00:31:53.070 --> 00:31:57.000
It's the number of 1's that
we got divided by n.
00:31:57.000 --> 00:32:02.590
So this is a fraction, but
calculated only on the basis
00:32:02.590 --> 00:32:04.880
of the sample that we have.
00:32:04.880 --> 00:32:10.260
So you can think of this as
being an estimate, f_hat,
00:32:10.260 --> 00:32:13.120
based on the sample
that we have.
00:32:13.120 --> 00:32:17.155
Now, even though we used the
lower case letter here, this
00:32:17.155 --> 00:32:20.590
f_hat is, of course,
a random variable.
00:32:20.590 --> 00:32:23.300
f is a number.
00:32:23.300 --> 00:32:27.570
This is the true fraction in
the overall population.
00:32:27.570 --> 00:32:30.380
f_hat is the estimate
that we get by using
00:32:30.380 --> 00:32:32.300
our particular sample.
00:32:32.300 --> 00:32:32.410
Ok.
00:32:32.410 --> 00:32:38.760
So your boss told you, I need to
know what f is, but go and
00:32:38.760 --> 00:32:40.150
do some sampling.
00:32:40.150 --> 00:32:42.720
What are you going to respond?
00:32:42.720 --> 00:32:46.360
Unless I ask everyone in the
whole population, there's no
00:32:46.360 --> 00:32:51.180
way for me to know f exactly.
00:32:51.180 --> 00:32:51.890
Right?
00:32:51.890 --> 00:32:54.560
There's no way.
00:32:54.560 --> 00:32:59.040
OK, so the boss tells you, well
OK, then that'll me f
00:32:59.040 --> 00:33:00.860
within an accuracy.
00:33:00.860 --> 00:33:10.910
I want an answer from you,
that's your answer, which is
00:33:10.910 --> 00:33:14.930
close to the correct answer
within 1 % point.
00:33:14.930 --> 00:33:20.260
So if the true f is 0.4, your
answer should be somewhere
00:33:20.260 --> 00:33:22.500
between 0.39 and 0.41.
00:33:22.500 --> 00:33:25.520
I want a really accurate
answer.
00:33:25.520 --> 00:33:27.580
What are you going to say?
00:33:27.580 --> 00:33:31.360
Well, there's no guarantee
that my answer
00:33:31.360 --> 00:33:33.230
will be within 1 %.
00:33:33.230 --> 00:33:37.320
Maybe I'm unlucky and I just
happen to sample the wrong set
00:33:37.320 --> 00:33:40.450
of people and my answer
comes out to be wrong.
00:33:40.450 --> 00:33:45.800
So I cannot give you a hard
guarantee that this inequality
00:33:45.800 --> 00:33:47.240
will be satisfied.
00:33:47.240 --> 00:33:51.990
But perhaps, I can give you a
guarantee that this inequality
00:33:51.990 --> 00:33:55.680
will be satisfied, this accuracy
requirement will be
00:33:55.680 --> 00:33:59.340
satisfied, with high
confidence.
00:33:59.340 --> 00:34:02.520
That is, there's going to be
a smaller probability that
00:34:02.520 --> 00:34:04.420
things go wrong, that
I'm unlikely
00:34:04.420 --> 00:34:07.030
and I use a bad sample.
00:34:07.030 --> 00:34:10.750
But leaving aside that smaller
probability of being unlucky,
00:34:10.750 --> 00:34:13.989
my answer will be accurate
within the accuracy
00:34:13.989 --> 00:34:16.100
requirement that you have.
00:34:16.100 --> 00:34:20.500
So these two numbers are the
usual specs that one has when
00:34:20.500 --> 00:34:22.010
designing polls.
00:34:22.010 --> 00:34:27.370
So this number is the accuracy
that we want.
00:34:27.370 --> 00:34:29.300
It's the desired accuracy.
00:34:29.300 --> 00:34:35.239
And this number has to do with
the confidence that we want.
00:34:35.239 --> 00:34:40.210
So 1 minus that number, we could
call it the confidence
00:34:40.210 --> 00:34:43.500
that we want out
of our sample.
00:34:43.500 --> 00:34:47.820
So this is really 1
minus confidence.
00:34:47.820 --> 00:34:51.830
So now your job is to figure out
how large an n, how large
00:34:51.830 --> 00:34:56.219
a sample should you be using, in
order to satisfy the specs
00:34:56.219 --> 00:34:59.060
that your boss gave you.
00:34:59.060 --> 00:35:02.560
All you know at this stage is
the Chebyshev inequality.
00:35:02.560 --> 00:35:05.210
So you just try to use it.
00:35:05.210 --> 00:35:09.780
The probability of getting an
answer that's more than 0.01
00:35:09.780 --> 00:35:14.780
away from the true answer is, by
Chebyshev's inequality, the
00:35:14.780 --> 00:35:20.170
variance of this random variable
divided by this
00:35:20.170 --> 00:35:21.540
number squared.
00:35:21.540 --> 00:35:25.870
The variance, as we argued
a little earlier, is the
00:35:25.870 --> 00:35:29.190
variance of the x's
divided by n.
00:35:29.190 --> 00:35:31.830
So we get this expression.
00:35:31.830 --> 00:35:35.230
So we would like this
number to be less
00:35:35.230 --> 00:35:38.330
than or equal to 0.05.
00:35:38.330 --> 00:35:41.620
OK, here we hit a little
bit off a difficulty.
00:35:41.620 --> 00:35:49.040
The variance, (sigma_x)-squared,
what is it?
00:35:49.040 --> 00:35:54.010
(Sigma_x)-squared is, if you
remember the variance of a
00:35:54.010 --> 00:35:58.010
Bernoulli random variable,
is this quantity.
00:35:58.010 --> 00:35:59.730
But we don't know it.
00:35:59.730 --> 00:36:02.880
f is what we're trying to
estimate in the first place.
00:36:02.880 --> 00:36:06.790
So the variance is not known,
so I cannot plug in a number
00:36:06.790 --> 00:36:08.080
inside here.
00:36:08.080 --> 00:36:12.340
What I can do is to be
conservative and use an upper
00:36:12.340 --> 00:36:14.050
bound of the variance.
00:36:14.050 --> 00:36:17.280
How large can this number get?
00:36:17.280 --> 00:36:20.090
Well, you can plot
f times (1-f).
00:36:25.950 --> 00:36:26.750
It's a parabola.
00:36:26.750 --> 00:36:29.420
It has a root at 0 and at 1.
00:36:29.420 --> 00:36:34.450
So the maximum value is going to
be, by symmetry, at 1/2 and
00:36:34.450 --> 00:36:39.350
when f is 1/2, then this
variance becomes 1/4.
00:36:39.350 --> 00:36:42.340
So I don't know
(sigma_x)-squared, but I'm
00:36:42.340 --> 00:36:45.480
going to use the worst case
value for (sigma_x)-squared,
00:36:45.480 --> 00:36:48.480
which is 4.
00:36:48.480 --> 00:36:53.320
And this is now an inequality
that I know to be always true.
00:36:53.320 --> 00:36:56.910
I've got my specs, and my specs
tell me that I want this
00:36:56.910 --> 00:36:59.800
number to be less than 0.05.
00:36:59.800 --> 00:37:04.980
And given what I know, the best
thing I can do is to say,
00:37:04.980 --> 00:37:07.860
OK, I'm going to take
this number and make
00:37:07.860 --> 00:37:14.070
it less than 0.05.
00:37:14.070 --> 00:37:20.860
If I choose my n so that this
is less than 0.05, then I'm
00:37:20.860 --> 00:37:24.890
certain that this probability
is also less than 0.05.
00:37:24.890 --> 00:37:28.720
What does it take for this
inequality to be true?
00:37:28.720 --> 00:37:36.370
You can solve for n here, and
you find that to satisfy this
00:37:36.370 --> 00:37:40.780
inequality, n should be larger
than or equal to 50,000.
00:37:40.780 --> 00:37:44.250
So you can just let n
be equal to 50,000.
00:37:44.250 --> 00:37:47.920
So the Chebyshev inequality
tells us that if you take n
00:37:47.920 --> 00:37:51.940
equal to 50,000, then by the
Chebyshev inequality, we're
00:37:51.940 --> 00:37:57.850
guaranteed to satisfy the specs
that we were given.
00:37:57.850 --> 00:37:57.960
Ok.
00:37:57.960 --> 00:38:03.950
Now, 50,000 is a bit of
a large sample size.
00:38:03.950 --> 00:38:05.980
Right?
00:38:05.980 --> 00:38:09.490
If you read anything in the
newspapers where they say so
00:38:09.490 --> 00:38:13.230
much of the voters think this
and that, this was determined
00:38:13.230 --> 00:38:19.830
on the basis of a sample of
1,200 likely voters or so.
00:38:19.830 --> 00:38:23.430
So the numbers that you will
typically see in these news
00:38:23.430 --> 00:38:27.590
items about polling, they
usually involve sample sizes
00:38:27.590 --> 00:38:30.080
about the 1,000 or so.
00:38:30.080 --> 00:38:35.250
You will never see a sample
size of 50,000.
00:38:35.250 --> 00:38:37.230
That's too much.
00:38:37.230 --> 00:38:41.670
So where can we cut
some corners?
00:38:41.670 --> 00:38:46.390
Well, we can cut corners
basically in three places.
00:38:46.390 --> 00:38:49.950
This requirement is a
little too tight.
00:38:49.950 --> 00:38:53.530
Newspaper stories will usually
tell you, we have an accuracy
00:38:53.530 --> 00:38:58.800
of +/- 3 % points, instead
of 1 % point.
00:38:58.800 --> 00:39:03.770
And because this number comes up
as a square, by making it 3
00:39:03.770 --> 00:39:09.000
% points instead of 1, saves
you a factor of 10.
00:39:09.000 --> 00:39:12.790
Then, the five percent
confidence, I guess that's
00:39:12.790 --> 00:39:15.180
usually OK.
00:39:15.180 --> 00:39:19.400
If we use that factor of 10,
then we make our sample that
00:39:19.400 --> 00:39:23.730
we gain from here, then we get
a sample size of 10,000.
00:39:23.730 --> 00:39:25.980
And that's, again,
a little too big.
00:39:25.980 --> 00:39:28.140
So where can we fix things?
00:39:28.140 --> 00:39:31.140
Well, it turns out that this
inequality that we're using
00:39:31.140 --> 00:39:34.660
here, Chebyshev's inequality,
is just an inequality.
00:39:34.660 --> 00:39:36.890
It's not that tight.
00:39:36.890 --> 00:39:38.850
It's not very accurate.
00:39:38.850 --> 00:39:42.800
Maybe there's a better way of
calculating or estimating this
00:39:42.800 --> 00:39:46.760
quantity, which is smaller
than this.
00:39:46.760 --> 00:39:49.770
And using a more accurate
inequality or a more accurate
00:39:49.770 --> 00:39:55.320
bound, then we can convince
ourselves that we can settle
00:39:55.320 --> 00:39:57.800
with a smaller sample size.
00:39:57.800 --> 00:40:01.770
This more accurate kind of
inequality comes out of a
00:40:01.770 --> 00:40:04.140
difference limit theorem,
which is the next limit
00:40:04.140 --> 00:40:06.030
theorem we're going
to consider.
00:40:06.030 --> 00:40:08.310
We're going to start the
discussion today, but we're
00:40:08.310 --> 00:40:12.150
going to continue with
it next week.
00:40:12.150 --> 00:40:18.750
Before I tell you exactly what
that other limit theorem says,
00:40:18.750 --> 00:40:20.800
let me give you the
big picture of
00:40:20.800 --> 00:40:24.760
what's involved here.
00:40:24.760 --> 00:40:29.170
We're dealing with sums of
i.i.d random variables.
00:40:29.170 --> 00:40:32.300
Each X has a distribution
of its own.
00:40:34.840 --> 00:40:41.190
So suppose that X has a
distribution which is
00:40:41.190 --> 00:40:43.090
something like this.
00:40:43.090 --> 00:40:48.560
This is the density of X. If I
add lots of X's together, what
00:40:48.560 --> 00:40:51.460
kind of distribution
do I expect?
00:40:51.460 --> 00:40:55.170
The mean is going to be
n times the mean of an
00:40:55.170 --> 00:41:00.560
individual X. So if this is mu,
I'm going to get a mean of
00:41:00.560 --> 00:41:02.730
n times mu.
00:41:02.730 --> 00:41:06.620
But my variance will
also increase.
00:41:06.620 --> 00:41:08.050
When I add the random
variables,
00:41:08.050 --> 00:41:10.190
I'm adding the variances.
00:41:10.190 --> 00:41:13.370
So since the variance increases,
we're going to get
00:41:13.370 --> 00:41:17.610
a distribution that's
pretty wide.
00:41:17.610 --> 00:41:23.240
So this is the density of X1
plus all the way to Xn.
00:41:23.240 --> 00:41:27.640
So as n increases, my
distribution shifts, because
00:41:27.640 --> 00:41:28.770
the mean is positive.
00:41:28.770 --> 00:41:30.610
So I keep adding things.
00:41:30.610 --> 00:41:33.870
And also, my distribution
becomes wider and wider.
00:41:33.870 --> 00:41:36.080
The variance increases.
00:41:36.080 --> 00:41:39.260
Well, we started a different
scaling.
00:41:39.260 --> 00:41:42.980
We started a scaled version of
this quantity when we looked
00:41:42.980 --> 00:41:46.180
at the weak law of
large numbers.
00:41:46.180 --> 00:41:49.580
In the weak law of large
numbers, we take this random
00:41:49.580 --> 00:41:52.140
variable and divide it by n.
00:41:52.140 --> 00:41:56.300
And what the weak law tells us
is that we're going to get a
00:41:56.300 --> 00:42:01.050
distribution that's very highly
concentrated around the
00:42:01.050 --> 00:42:03.650
true mean, which is mu.
00:42:03.650 --> 00:42:07.520
So this here would be the
density of X1 plus
00:42:07.520 --> 00:42:12.630
Xn divided by n.
00:42:12.630 --> 00:42:16.660
Because I've divided by n, the
mean has become the original
00:42:16.660 --> 00:42:19.410
mean, which is mu.
00:42:19.410 --> 00:42:22.620
But the weak law of large
numbers tells us that the
00:42:22.620 --> 00:42:26.650
distribution of this random
variable is very concentrated
00:42:26.650 --> 00:42:27.810
around the mean.
00:42:27.810 --> 00:42:29.850
So we get a distribution
that's very
00:42:29.850 --> 00:42:31.520
narrow in this kind.
00:42:31.520 --> 00:42:34.230
In the limit, this distribution
becomes one
00:42:34.230 --> 00:42:37.570
that's just concentrated
on top of mu.
00:42:37.570 --> 00:42:40.930
So it's sort of a degenerate
distribution.
00:42:40.930 --> 00:42:46.070
So these are two extremes, no
scaling for the sum, a scaling
00:42:46.070 --> 00:42:47.740
where we divide by n.
00:42:47.740 --> 00:42:50.680
In this extreme, we get the
trivial case of a distribution
00:42:50.680 --> 00:42:52.860
that flattens out completely.
00:42:52.860 --> 00:42:56.070
In this scaling, we get a
distribution that gets
00:42:56.070 --> 00:42:59.150
concentrated around
a single point.
00:42:59.150 --> 00:43:02.030
Again, we look at some
intermediate scaling that
00:43:02.030 --> 00:43:04.050
makes things more interesting.
00:43:04.050 --> 00:43:09.700
Things do become interesting
if we scale by dividing the
00:43:09.700 --> 00:43:14.520
sum by square root of n instead
of dividing by n.
00:43:14.520 --> 00:43:17.210
What effect does this have?
00:43:17.210 --> 00:43:22.510
When we scale by dividing by
square root of n, the variance
00:43:22.510 --> 00:43:28.050
of Sn over square root of n is
going to be the variance of Sn
00:43:28.050 --> 00:43:30.760
over sum divided by n.
00:43:30.760 --> 00:43:32.780
That's how variances behave.
00:43:32.780 --> 00:43:37.370
The variance of Sn is n
sigma-squared, divide by n,
00:43:37.370 --> 00:43:41.330
which is sigma squared, which
means that when we scale in
00:43:41.330 --> 00:43:45.940
this particular way,
as n changes, the
00:43:45.940 --> 00:43:48.230
variance doesn't change.
00:43:48.230 --> 00:43:50.300
So the width of our
distribution
00:43:50.300 --> 00:43:52.190
will be sort of constant.
00:43:52.190 --> 00:43:56.360
The distribution changes shape,
but it doesn't become
00:43:56.360 --> 00:43:59.910
narrower as was the case here.
00:43:59.910 --> 00:44:04.550
It doesn't become wider, kind
of keeps the same width.
00:44:04.550 --> 00:44:09.260
So perhaps in the limit, this
distribution is going to take
00:44:09.260 --> 00:44:11.080
an interesting shape.
00:44:11.080 --> 00:44:14.170
And that's indeed the case.
00:44:14.170 --> 00:44:19.800
So let's do what
we did before.
00:44:19.800 --> 00:44:25.110
So we're looking at the sum, and
we want to divide the sum
00:44:25.110 --> 00:44:28.860
by something that goes like
square root of n.
00:44:28.860 --> 00:44:33.140
So the variance of Sn
is n sigma squared.
00:44:33.140 --> 00:44:38.240
The variance of the sigma Sn
is the square root of that.
00:44:38.240 --> 00:44:39.570
It's this number.
00:44:39.570 --> 00:44:43.930
So effectively, we're scaling
by order of square root n.
00:44:43.930 --> 00:44:47.570
Now, I'm doing another
thing here.
00:44:47.570 --> 00:44:52.350
If my random variable has a
positive mean, then this
00:44:52.350 --> 00:44:55.470
quantity is going to
have a mean that's
00:44:55.470 --> 00:44:56.950
positive and growing.
00:44:56.950 --> 00:44:59.450
It's going to be shifting
to the right.
00:44:59.450 --> 00:45:01.350
Why is that?
00:45:01.350 --> 00:45:04.370
Sn has a mean that's
proportional to n.
00:45:04.370 --> 00:45:09.510
When I divide by square root n,
then it means that the mean
00:45:09.510 --> 00:45:11.990
scales like square root of n.
00:45:11.990 --> 00:45:14.740
So my distribution would
still keep shifting
00:45:14.740 --> 00:45:16.720
after I do this division.
00:45:16.720 --> 00:45:20.860
I want to keep my distribution
in place, so I subtract out
00:45:20.860 --> 00:45:23.920
the mean of Sn.
00:45:23.920 --> 00:45:29.580
So what we're doing here is
a standard technique or
00:45:29.580 --> 00:45:32.670
transformation where you take
a random variable and you
00:45:32.670 --> 00:45:34.890
so-called standardize it.
00:45:34.890 --> 00:45:38.500
I remove the mean of that random
variable and I divide
00:45:38.500 --> 00:45:40.100
by the standard deviation.
00:45:40.100 --> 00:45:43.030
This results in a random
variable that has 0 mean and
00:45:43.030 --> 00:45:44.960
unit variance.
00:45:44.960 --> 00:45:49.880
What Zn measures is the
following, Zn tells me how
00:45:49.880 --> 00:45:55.520
many standard deviations am
I away from the mean.
00:45:55.520 --> 00:45:59.380
Sn minus (n times expected value
of X) tells me how much
00:45:59.380 --> 00:46:02.980
is Sn away from the
mean value of Sn.
00:46:02.980 --> 00:46:06.250
And by dividing by the standard
deviation of Sn --
00:46:06.250 --> 00:46:09.830
this tells me how many standard
deviations away from
00:46:09.830 --> 00:46:12.550
the mean am I.
00:46:12.550 --> 00:46:15.360
So we're going to look at this
random variable, which is just
00:46:15.360 --> 00:46:17.260
a transformation Zn.
00:46:17.260 --> 00:46:20.840
It's a linear transformation
of Sn.
00:46:20.840 --> 00:46:24.740
S And we're going to compare
this random variable to a
00:46:24.740 --> 00:46:27.230
standard normal random
variable.
00:46:27.230 --> 00:46:30.610
So a standard normal is the
random variable that you are
00:46:30.610 --> 00:46:35.200
familiar with, given by the
usual formula, and for which
00:46:35.200 --> 00:46:37.400
we have tables for it.
00:46:37.400 --> 00:46:40.400
This Zn has 0 mean and
unit variance.
00:46:40.400 --> 00:46:44.220
So in that respect, it has the
same statistics as the
00:46:44.220 --> 00:46:45.655
standard normal.
00:46:45.655 --> 00:46:48.960
The distribution of Zn
could be anything --
00:46:48.960 --> 00:46:50.770
can be pretty messy.
00:46:50.770 --> 00:46:53.320
But there is this amazing
theorem called the central
00:46:53.320 --> 00:46:58.250
limit theorem that tells us that
the distribution of Zn
00:46:58.250 --> 00:47:01.930
approaches the distribution of
the standard normal in the
00:47:01.930 --> 00:47:06.270
following sense, that
probability is that you can
00:47:06.270 --> 00:47:07.080
calculate --
00:47:07.080 --> 00:47:07.930
of this type --
00:47:07.930 --> 00:47:10.350
that you can calculate
for Zn --
00:47:10.350 --> 00:47:13.330
is the limit becomes the same as
the probabilities that you
00:47:13.330 --> 00:47:17.590
would get from the standard
normal tables for Z.
00:47:17.590 --> 00:47:19.750
It's a statement about
the cumulative
00:47:19.750 --> 00:47:21.960
distribution functions.
00:47:21.960 --> 00:47:25.060
This quantity, as a function
of c, is the cumulative
00:47:25.060 --> 00:47:27.920
distribution function of
the random variable Zn.
00:47:27.920 --> 00:47:30.860
This is the cumulative
distribution function of the
00:47:30.860 --> 00:47:32.190
standard normal.
00:47:32.190 --> 00:47:34.530
The central limit theorem tells
us that the cumulative
00:47:34.530 --> 00:47:39.340
distribution function of the
sum of a number of random
00:47:39.340 --> 00:47:43.040
variables, after they're
appropriately standardized,
00:47:43.040 --> 00:47:46.480
approaches the cumulative
distribution function over the
00:47:46.480 --> 00:47:50.580
standard normal distribution.
00:47:50.580 --> 00:47:53.620
In particular, this tells
us that we can calculate
00:47:53.620 --> 00:47:59.480
probabilities for Zn when n is
large by calculating instead
00:47:59.480 --> 00:48:02.800
probabilities for Z. And that's
going to be a good
00:48:02.800 --> 00:48:04.020
approximation.
00:48:04.020 --> 00:48:07.670
Probabilities for Z are easy to
calculate because they're
00:48:07.670 --> 00:48:09.250
well tabulated.
00:48:09.250 --> 00:48:12.820
So we get a very nice shortcut
for calculating
00:48:12.820 --> 00:48:14.990
probabilities for Zn.
00:48:14.990 --> 00:48:17.990
Now, it's not Zn that you're
interested in.
00:48:17.990 --> 00:48:20.890
What you're interested
in is Sn.
00:48:20.890 --> 00:48:23.820
And Sn --
00:48:23.820 --> 00:48:29.080
inverting this relation
here --
00:48:29.080 --> 00:48:38.330
Sn is square root n sigma
Zn plus n expected
00:48:38.330 --> 00:48:42.602
value of X. All right.
00:48:42.602 --> 00:48:46.620
Now, if you can calculate
probabilities for Zn, even
00:48:46.620 --> 00:48:49.380
approximately, then you can
certainly calculate
00:48:49.380 --> 00:48:53.290
probabilities for Sn, because
one is a linear
00:48:53.290 --> 00:48:55.206
function of the other.
00:48:55.206 --> 00:48:58.710
And we're going to do a little
bit of that next time.
00:48:58.710 --> 00:49:02.220
You're going to get, also, some
practice in recitation.
00:49:02.220 --> 00:49:04.975
At a more vague level, you could
describe the central
00:49:04.975 --> 00:49:08.270
limit theorem as saying the
following, when n is large,
00:49:08.270 --> 00:49:12.160
you can pretend that Zn is
a standard normal random
00:49:12.160 --> 00:49:15.440
variable and do the calculations
as if Zn was
00:49:15.440 --> 00:49:16.680
standard normal.
00:49:16.680 --> 00:49:21.530
Now, pretending that Zn is
normal is the same as
00:49:21.530 --> 00:49:25.900
pretending that Sn is normal,
because Sn is a linear
00:49:25.900 --> 00:49:27.700
function of Zn.
00:49:27.700 --> 00:49:30.400
And we know that linear
functions of normal random
00:49:30.400 --> 00:49:32.140
variables are normal.
00:49:32.140 --> 00:49:36.290
So the central limit theorem
essentially tells us that we
00:49:36.290 --> 00:49:40.070
can pretend that Sn is a normal
random variable and do
00:49:40.070 --> 00:49:44.760
the calculations just as if it
were a normal random variable.
00:49:44.760 --> 00:49:47.020
Mathematically speaking though,
the central limit
00:49:47.020 --> 00:49:50.480
theorem does not talk about
the distribution of Sn,
00:49:50.480 --> 00:49:54.940
because the distribution of Sn
becomes degenerate in the
00:49:54.940 --> 00:49:57.650
limit, just a very flat
and long thing.
00:49:57.650 --> 00:49:59.810
So strictly speaking
mathematically, it's a
00:49:59.810 --> 00:50:03.060
statement about cumulative
distributions of Zn's.
00:50:03.060 --> 00:50:06.420
Practically, the way you use it
is by just pretending that
00:50:06.420 --> 00:50:08.415
Sn is normal.
00:50:08.415 --> 00:50:09.400
Very good.
00:50:09.400 --> 00:50:11.080
Enjoy the Thanksgiving Holiday.