WEBVTT

00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative

00:00:02.460 --> 00:00:03.870
Commons license.

00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to

00:00:06.910 --> 00:00:10.560
offer high quality educational
resources for free.

00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from

00:00:13.460 --> 00:00:17.390
hundreds of MIT courses, visit
MIT OpenCourseWare at

00:00:17.390 --> 00:00:18.640
ocw.mit.edu.

00:00:21.860 --> 00:00:25.130
JOHN TSITSIKLIS: We're going
to start today a new unit.

00:00:25.130 --> 00:00:29.320
so we will be talking about
limit theorems.

00:00:29.320 --> 00:00:33.580
So just to introduce the topic,
let's think of the

00:00:33.580 --> 00:00:35.560
following situation.

00:00:35.560 --> 00:00:37.580
There's a population
of penguins down

00:00:37.580 --> 00:00:38.970
at the South Pole.

00:00:38.970 --> 00:00:42.740
And if you were to pick a
penguin at random and measure

00:00:42.740 --> 00:00:46.930
their height, the expected value
of their height would be

00:00:46.930 --> 00:00:50.020
the average of the heights of
the different penguins in the

00:00:50.020 --> 00:00:50.970
population.

00:00:50.970 --> 00:00:53.430
So suppose when you
pick one, every

00:00:53.430 --> 00:00:55.210
penguin is equally likely.

00:00:55.210 --> 00:00:58.020
Then the expected value is just
the average of all the

00:00:58.020 --> 00:00:59.340
penguins out there.

00:00:59.340 --> 00:01:01.650
So your boss asks you to
find out what that the

00:01:01.650 --> 00:01:03.020
expected value is.

00:01:03.020 --> 00:01:04.980
One way would be to
go and measure

00:01:04.980 --> 00:01:06.540
each and every penguin.

00:01:06.540 --> 00:01:08.600
That might be a little
time consuming.

00:01:08.600 --> 00:01:13.120
So alternatively, what you can
do is to go and pick penguins

00:01:13.120 --> 00:01:17.450
at random, pick a few of them,
let's say a number n of them.

00:01:17.450 --> 00:01:20.420
So you measure the height
of each one.

00:01:20.420 --> 00:01:25.920
And then you calculate the
average of the heights of

00:01:25.920 --> 00:01:29.050
those penguins that you
have collected.

00:01:29.050 --> 00:01:33.100
So this is your estimate
of the expected value.

00:01:33.100 --> 00:01:41.010
Now, we called this the sample
mean, which is the mean value,

00:01:41.010 --> 00:01:44.430
but within the sample that
you have collected.

00:01:44.430 --> 00:01:48.090
This is something that's sort
of feels the same as the

00:01:48.090 --> 00:01:52.140
expected value, which
is again, the mean.

00:01:52.140 --> 00:01:54.400
But the expected value's a
different kind of mean.

00:01:54.400 --> 00:01:57.870
The expected value is the mean
over the entire population,

00:01:57.870 --> 00:02:01.680
whereas the sample mean is the
average over the smaller

00:02:01.680 --> 00:02:03.940
sample that you have measured.

00:02:03.940 --> 00:02:06.330
The expected value
is a number.

00:02:06.330 --> 00:02:09.220
The sample mean is a
random variable.

00:02:09.220 --> 00:02:11.720
It's a random variable because
the sample you have

00:02:11.720 --> 00:02:15.010
collected is random.

00:02:15.010 --> 00:02:18.520
Now, we think that this is a
reasonable way of estimating

00:02:18.520 --> 00:02:19.900
the expectation.

00:02:19.900 --> 00:02:25.710
So in the limit as n goes to
infinity, it's plausible that

00:02:25.710 --> 00:02:29.170
the sample mean, the estimate
that we are constructing,

00:02:29.170 --> 00:02:33.790
should somehow get close
to the expected value.

00:02:33.790 --> 00:02:34.560
What does this mean?

00:02:34.560 --> 00:02:36.160
What does it mean
to get close?

00:02:36.160 --> 00:02:37.620
In what sense?

00:02:37.620 --> 00:02:39.440
And is this statement true?

00:02:39.440 --> 00:02:44.160
This is the kind of statement
that we deal with when dealing

00:02:44.160 --> 00:02:45.710
with limit theorems.

00:02:45.710 --> 00:02:49.500
That's the subject of limit
theorems, when what happens if

00:02:49.500 --> 00:02:52.020
you're dealing with lots and
lots of random variables, and

00:02:52.020 --> 00:02:54.620
perhaps take averages
and so on.

00:02:54.620 --> 00:02:57.280
So why do we bother
about this?

00:02:57.280 --> 00:03:01.200
Well, if you're in the sampling
business, it would be

00:03:01.200 --> 00:03:04.870
reassuring to know that this
particular way of estimating

00:03:04.870 --> 00:03:06.880
the expected value
actually gets you

00:03:06.880 --> 00:03:08.850
close to the true answer.

00:03:08.850 --> 00:03:11.890
There's also a higher level
reason, which is a little more

00:03:11.890 --> 00:03:13.660
abstract and mathematical.

00:03:13.660 --> 00:03:17.110
So probability problems are easy
to deal with if you're

00:03:17.110 --> 00:03:20.040
having in your hands one or
two random variables.

00:03:20.040 --> 00:03:23.520
You can write down their mass
functions, joints density

00:03:23.520 --> 00:03:24.930
functions, and so on.

00:03:24.930 --> 00:03:27.500
You can calculate on paper
or on a computer,

00:03:27.500 --> 00:03:29.430
you can get the answers.

00:03:29.430 --> 00:03:33.510
Probability problems become
computationally intractable if

00:03:33.510 --> 00:03:36.760
you're dealing, let's say, with
100 random variables and

00:03:36.760 --> 00:03:40.280
you're trying to get the exact
answers for anything.

00:03:40.280 --> 00:03:43.050
So in principle, the same
formulas that we have, they

00:03:43.050 --> 00:03:44.230
still apply.

00:03:44.230 --> 00:03:47.360
But they involve summations
over large ranges of

00:03:47.360 --> 00:03:48.830
combinations of indices.

00:03:48.830 --> 00:03:51.310
And that makes life extremely
difficult.

00:03:51.310 --> 00:03:55.100
But when you push the envelope
and you go to a situation

00:03:55.100 --> 00:03:58.480
where you're dealing with a
very, very large number of

00:03:58.480 --> 00:04:02.130
variables, then you can
start taking limits.

00:04:02.130 --> 00:04:05.200
And when you take limits,
wonderful things happen.

00:04:05.200 --> 00:04:08.030
Many formulas start simplifying,
and you can

00:04:08.030 --> 00:04:11.770
actually get useful answers by
considering those limits.

00:04:11.770 --> 00:04:15.450
And that's sort of the big
reason why looking at limit

00:04:15.450 --> 00:04:17.820
theorems is a useful
thing to do.

00:04:17.820 --> 00:04:20.990
So what we're going to do today,
first we're going to

00:04:20.990 --> 00:04:27.110
start with a useful, simple tool
that allows us to relates

00:04:27.110 --> 00:04:30.290
probabilities with
expected values.

00:04:30.290 --> 00:04:33.230
The Markov inequality is the
first inequality we're going

00:04:33.230 --> 00:04:33.840
to write down.

00:04:33.840 --> 00:04:37.650
And then using that, we're going
to get the Chebyshev's

00:04:37.650 --> 00:04:39.760
inequality, a related
inequality.

00:04:39.760 --> 00:04:43.760
Then we need to define what do
we mean by convergence when we

00:04:43.760 --> 00:04:45.270
talk about random variables.

00:04:45.270 --> 00:04:48.310
It's a notion that's a
generalization of the notion

00:04:48.310 --> 00:04:51.000
of the usual convergence
of limits of

00:04:51.000 --> 00:04:52.690
a sequence of numbers.

00:04:52.690 --> 00:04:55.710
And once we have our notion of
convergence, we're going to

00:04:55.710 --> 00:05:00.860
see that, indeed, the sample
mean converges to the true

00:05:00.860 --> 00:05:04.380
mean, converges to the expected
value of the X's.

00:05:04.380 --> 00:05:08.840
And this statement is called the
weak law of large numbers.

00:05:08.840 --> 00:05:11.650
The reason it's called the weak
law is because there's

00:05:11.650 --> 00:05:14.640
also a strong law, which is
a statement with the same

00:05:14.640 --> 00:05:16.570
flavor, but with a somewhat
different

00:05:16.570 --> 00:05:18.410
mathematical content.

00:05:18.410 --> 00:05:20.790
But it's a little more abstract,
and we will not be

00:05:20.790 --> 00:05:21.680
getting into this.

00:05:21.680 --> 00:05:26.070
So the weak law is all that
you're going to get.

00:05:26.070 --> 00:05:28.570
All right.

00:05:28.570 --> 00:05:31.050
So now we start our
digression.

00:05:31.050 --> 00:05:38.220
And our first tool will be the
so-called Markov inequality.

00:05:45.050 --> 00:05:48.040
So let's take a random variable
that's always

00:05:48.040 --> 00:05:48.870
non-negative.

00:05:48.870 --> 00:05:51.790
No matter what, it gets
no negative values.

00:05:51.790 --> 00:05:53.710
To keep things simple,
let's assume it's a

00:05:53.710 --> 00:05:55.500
discrete random variable.

00:05:55.500 --> 00:05:59.770
So the expected value is the sum
over all possible values

00:05:59.770 --> 00:06:01.460
that a random variable
can take.

00:06:04.440 --> 00:06:06.600
The values of the random
variables that can take

00:06:06.600 --> 00:06:10.850
weighted according to their
corresponding probabilities.

00:06:10.850 --> 00:06:13.700
Now, this is a sum
over all x's.

00:06:13.700 --> 00:06:16.640
But x takes non-negative
values.

00:06:16.640 --> 00:06:19.780
And the PMF is also
non-negative.

00:06:19.780 --> 00:06:24.310
So if I take a sum over fewer
things, I'm going to get a

00:06:24.310 --> 00:06:25.550
smaller value.

00:06:25.550 --> 00:06:29.180
So the sum when I add over
everything is less than or

00:06:29.180 --> 00:06:33.255
equal to the sum that I will get
if I only add those terms

00:06:33.255 --> 00:06:35.620
that are bigger than
a certain constant.

00:06:38.600 --> 00:06:45.140
Now, if I'm adding over x's that
are bigger than a, the x

00:06:45.140 --> 00:06:48.630
that shows up up there
will always be larger

00:06:48.630 --> 00:06:50.490
than or equal to a.

00:06:50.490 --> 00:06:52.370
So we get this inequality.

00:06:58.170 --> 00:06:59.980
And now, a is a constant.

00:06:59.980 --> 00:07:02.870
I can pull it outside
the summation.

00:07:02.870 --> 00:07:05.320
And then I'm left with the
probabilities of all the x's

00:07:05.320 --> 00:07:06.990
that are bigger than a.

00:07:06.990 --> 00:07:08.850
And that's just the
probability of

00:07:08.850 --> 00:07:10.250
being bigger than a.

00:07:15.540 --> 00:07:18.140
OK, so that's the Markov
inequality.

00:07:18.140 --> 00:07:23.800
Basically tells us that the
expected value is larger than

00:07:23.800 --> 00:07:26.240
or equal to this number.

00:07:26.240 --> 00:07:30.260
It relates expected values
to probabilities.

00:07:30.260 --> 00:07:34.660
It tells us that if the expected
value is small, then

00:07:34.660 --> 00:07:39.250
the probability that x is big
is also going to be small.

00:07:39.250 --> 00:07:42.240
So it's translates a statement
about smallness of expected

00:07:42.240 --> 00:07:46.205
values to a statement about
smallness of probabilities.

00:07:49.020 --> 00:07:49.930
OK.

00:07:49.930 --> 00:07:54.210
What we actually need is a
somewhat different version of

00:07:54.210 --> 00:07:57.240
this same statement.

00:07:57.240 --> 00:08:03.010
And what we're going to do is to
apply this inequality to a

00:08:03.010 --> 00:08:08.150
non-negative random variable
of a special type.

00:08:08.150 --> 00:08:13.330
And you can think of applying
this same calculation to a

00:08:13.330 --> 00:08:18.800
random variable of this form, (X
minus mu)-squared, where mu

00:08:18.800 --> 00:08:21.870
is the expected value of X.

00:08:21.870 --> 00:08:24.075
Now, this is a non-negative
random variable.

00:08:35.419 --> 00:08:37.919
So, the expected value of this
random variable, which is the

00:08:37.919 --> 00:08:42.220
variance, by following the same
thinking as we had in

00:08:42.220 --> 00:08:52.880
that derivation up to there, is
bigger than the probability

00:08:52.880 --> 00:08:58.210
that this random variable
is bigger than some--

00:08:58.210 --> 00:09:04.760
let me use a-squared
instead of an a

00:09:04.760 --> 00:09:06.585
times the value a-squared.

00:09:12.420 --> 00:09:16.310
So now of course, this
probability is the same as the

00:09:16.310 --> 00:09:23.440
probability that the absolute
value of X minus mu is bigger

00:09:23.440 --> 00:09:27.190
than a times a-squared.

00:09:27.190 --> 00:09:34.860
And this side is equal to the
variance of X. So this relates

00:09:34.860 --> 00:09:40.890
the variance of X to the
probability that our random

00:09:40.890 --> 00:09:45.200
variable is far away
from its mean.

00:09:45.200 --> 00:09:50.590
If the variance is small, then
it means that the probability

00:09:50.590 --> 00:09:54.635
of being far away from the
mean is also small.

00:09:57.240 --> 00:10:02.220
So I derived this by applying
the Markov inequality to this

00:10:02.220 --> 00:10:04.950
particular non-negative
random variable.

00:10:04.950 --> 00:10:09.500
Or just to reinforce, perhaps,
the message, and increase your

00:10:09.500 --> 00:10:13.450
confidence in this inequality,
let's just look at the

00:10:13.450 --> 00:10:16.980
derivation once more, where I'm
going, here, to start from

00:10:16.980 --> 00:10:20.890
first principles, but use the
same idea as the one that was

00:10:20.890 --> 00:10:23.480
used in the proof out here.

00:10:23.480 --> 00:10:23.685
Ok.

00:10:23.685 --> 00:10:26.920
So just for variety, now let's
think of X as being a

00:10:26.920 --> 00:10:28.760
continuous random variable.

00:10:28.760 --> 00:10:31.520
The derivation is the same
whether it's discrete or

00:10:31.520 --> 00:10:32.510
continuous.

00:10:32.510 --> 00:10:35.990
So by definition, the variance
is the integral, is this

00:10:35.990 --> 00:10:38.130
particular integral.

00:10:38.130 --> 00:10:43.920
Now, the integral is going to
become smaller if I integrate,

00:10:43.920 --> 00:10:47.130
instead of integrating over
the full range, I only

00:10:47.130 --> 00:10:51.070
integrate over x's that are
far away from the mean.

00:10:51.070 --> 00:10:52.700
So mu is the mean.

00:10:52.700 --> 00:10:54.345
Think of c as some big number.

00:10:59.670 --> 00:11:02.210
These are x's that are far
away from the mean to the

00:11:02.210 --> 00:11:05.410
left, from minus infinity
to mu minus c.

00:11:05.410 --> 00:11:09.030
And these are the x's that are
far away from the mean on the

00:11:09.030 --> 00:11:11.210
positive side.

00:11:11.210 --> 00:11:13.420
So by integrating over
fewer stuff, I'm

00:11:13.420 --> 00:11:15.580
getting a smaller integral.

00:11:15.580 --> 00:11:21.970
Now, for any x in this range,
this distance, x minus mu, is

00:11:21.970 --> 00:11:23.220
at least c.

00:11:23.220 --> 00:11:26.320
So that squared is at
least c squared.

00:11:26.320 --> 00:11:28.910
So this term over this
range of integration

00:11:28.910 --> 00:11:30.520
is at least c squared.

00:11:30.520 --> 00:11:33.250
So I can take it outside
the integral.

00:11:33.250 --> 00:11:36.400
And I'm left just with the
integral of the density.

00:11:36.400 --> 00:11:38.480
Same thing on the other side.

00:11:38.480 --> 00:11:41.770
And so what factors out is
this term c squared.

00:11:41.770 --> 00:11:45.360
And inside, we're left with the
probability of being to

00:11:45.360 --> 00:11:49.060
the left of mu minus c, and then
the probability of being

00:11:49.060 --> 00:11:52.310
to the right of mu plus c,
which is the same as the

00:11:52.310 --> 00:11:55.370
probability that the absolute
value of the distance from the

00:11:55.370 --> 00:11:58.770
mean is larger than
or equal to c.

00:11:58.770 --> 00:12:04.820
So that's the same inequality
that we proved there, except

00:12:04.820 --> 00:12:06.060
that here I'm using c.

00:12:06.060 --> 00:12:10.530
There I used a, but it's
exactly the same one.

00:12:10.530 --> 00:12:12.960
This inequality was maybe better
to understand if you

00:12:12.960 --> 00:12:16.790
take that term and send it
to the other side and

00:12:16.790 --> 00:12:18.780
write it this form.

00:12:18.780 --> 00:12:20.010
What does it tell us?

00:12:20.010 --> 00:12:25.750
It tells us that if c is a big
number, it tells us that the

00:12:25.750 --> 00:12:30.750
probability of being more than
c away from the mean is going

00:12:30.750 --> 00:12:32.330
to be a small number.

00:12:32.330 --> 00:12:34.780
When c is big, this is small.

00:12:34.780 --> 00:12:35.970
Now, this is intuitive.

00:12:35.970 --> 00:12:38.290
The variance is a measure
of the spread of the

00:12:38.290 --> 00:12:40.960
distribution, how wide it is.

00:12:40.960 --> 00:12:43.960
It tells us that if the
variance is small, the

00:12:43.960 --> 00:12:46.320
distribution is not very wide.

00:12:46.320 --> 00:12:49.020
And mathematically, this
translates to this statement

00:12:49.020 --> 00:12:52.360
that when the variance is small,
the probability of

00:12:52.360 --> 00:12:54.880
being far away is going
to be small.

00:12:54.880 --> 00:12:58.370
And the further away you're
looking, that is, if c is a

00:12:58.370 --> 00:13:00.330
bigger number, that probability

00:13:00.330 --> 00:13:01.765
also becomes small.

00:13:04.930 --> 00:13:07.880
Maybe an even more intuitive way
to think about the content

00:13:07.880 --> 00:13:13.230
of this inequality is to,
instead of c, use the number

00:13:13.230 --> 00:13:16.910
k, where k is positive
and sigma is

00:13:16.910 --> 00:13:18.530
the standard deviation.

00:13:18.530 --> 00:13:22.670
So let's just plug k sigma
in the place of c.

00:13:22.670 --> 00:13:25.300
So this becomes k
sigma squared.

00:13:25.300 --> 00:13:27.130
These sigma squared's cancel.

00:13:27.130 --> 00:13:29.770
We're left with 1
over k-square.

00:13:29.770 --> 00:13:31.690
Now, what is this?

00:13:31.690 --> 00:13:36.260
This is the event that you are
k standard deviations away

00:13:36.260 --> 00:13:37.770
from the mean.

00:13:37.770 --> 00:13:40.600
So for example, this statement
here tells you that if you

00:13:40.600 --> 00:13:44.900
look at the test scores from a
quiz, what fraction of the

00:13:44.900 --> 00:13:49.900
class are 3 standard deviations
away from the mean?

00:13:49.900 --> 00:13:53.000
It's possible, but it's not
going to be a lot of people.

00:13:53.000 --> 00:13:57.930
It's going to be at most, 1/9
of the class that can be 3

00:13:57.930 --> 00:14:02.190
standard deviations or more
away from the mean.

00:14:02.190 --> 00:14:05.250
So the Chebyshev inequality
is a really useful one.

00:14:07.860 --> 00:14:11.300
It comes in handy whenever you
want to relate probabilities

00:14:11.300 --> 00:14:12.800
and expected values.

00:14:12.800 --> 00:14:16.390
So if you know that your
expected values or, in

00:14:16.390 --> 00:14:19.260
particular, that your variance
is small, this tells you

00:14:19.260 --> 00:14:23.080
something about tailed
probabilities.

00:14:23.080 --> 00:14:25.530
So this is the end of our
first digression.

00:14:25.530 --> 00:14:28.320
We have this inequality
in our hands.

00:14:28.320 --> 00:14:31.170
Our second digression is
talk about limits.

00:14:34.680 --> 00:14:37.190
We want to eventually talk
about limits of random

00:14:37.190 --> 00:14:39.750
variables, but as a warm up,
we're going to start with

00:14:39.750 --> 00:14:42.440
limits of sequences.

00:14:42.440 --> 00:14:47.670
So you're given a sequence
of numbers, a1,

00:14:47.670 --> 00:14:50.500
a2, a3, and so on.

00:14:50.500 --> 00:14:54.160
And we want to define the
notion that a sequence

00:14:54.160 --> 00:14:56.470
converges to a number.

00:14:56.470 --> 00:15:04.710
You sort of know what this
means, but let's just go

00:15:04.710 --> 00:15:06.510
through it some more.

00:15:06.510 --> 00:15:09.890
So here's a.

00:15:09.890 --> 00:15:16.200
We have our sequence of
values as n increases.

00:15:16.200 --> 00:15:20.290
What do we mean by the sequence
converging to a is

00:15:20.290 --> 00:15:23.550
that when you look at those
values, they get closer and

00:15:23.550 --> 00:15:25.140
closer to a.

00:15:25.140 --> 00:15:29.570
So this value here is your
typical a sub n.

00:15:29.570 --> 00:15:33.880
They get closer and closer to
a, and they stay closer.

00:15:33.880 --> 00:15:36.860
So let's try to make
that more precise.

00:15:36.860 --> 00:15:40.750
What it means is let's
fix a sense of what

00:15:40.750 --> 00:15:42.250
it means to be close.

00:15:42.250 --> 00:15:47.540
Let me look at an interval that
goes from a - epsilon to

00:15:47.540 --> 00:15:50.340
a + epsilon.

00:15:50.340 --> 00:15:57.280
Then if my sequence converges
to a, this means that as n

00:15:57.280 --> 00:16:02.810
increases, eventually the values
of the sequence that I

00:16:02.810 --> 00:16:06.420
get stay inside this band.

00:16:06.420 --> 00:16:10.430
Since they converge to a, this
means that eventually they

00:16:10.430 --> 00:16:14.130
will be smaller than
a + epsilon and

00:16:14.130 --> 00:16:16.310
bigger than a - epsilon.

00:16:16.310 --> 00:16:21.320
So convergence means that
given a band of positive

00:16:21.320 --> 00:16:25.690
length around the number a,
the values of the sequence

00:16:25.690 --> 00:16:28.720
that you get eventually
get inside and

00:16:28.720 --> 00:16:31.300
stay inside that band.

00:16:31.300 --> 00:16:34.060
So that's sort of the picture
definition of

00:16:34.060 --> 00:16:35.840
what convergence means.

00:16:35.840 --> 00:16:40.460
So now let's translate this into
a mathematical statement.

00:16:40.460 --> 00:16:45.610
Given a band of positive length,
no matter how wide

00:16:45.610 --> 00:16:50.690
that band is or how narrow it
is, so for every epsilon

00:16:50.690 --> 00:16:56.500
positive, eventually the
sequence gets inside the band.

00:16:56.500 --> 00:16:58.460
What does eventually mean?

00:16:58.460 --> 00:17:01.410
There exists a time,
so that after that

00:17:01.410 --> 00:17:03.510
time something happens.

00:17:03.510 --> 00:17:07.230
And the something that happens
is that after that time, we

00:17:07.230 --> 00:17:09.520
are inside that band.

00:17:09.520 --> 00:17:12.060
So this is a formal mathematical
definition, which

00:17:12.060 --> 00:17:17.250
actually translates what I was
telling in the wordy way

00:17:17.250 --> 00:17:20.140
before, and showing in
terms of the picture.

00:17:20.140 --> 00:17:25.140
Given a certain band, even if
it's narrow, eventually, after

00:17:25.140 --> 00:17:28.520
a certain time n0, the values
of the sequence are going to

00:17:28.520 --> 00:17:30.240
stay inside this band.

00:17:30.240 --> 00:17:35.770
Now, if I were to take epsilon
to be very small, this thing

00:17:35.770 --> 00:17:38.130
would still be true that
eventually I'm going to get

00:17:38.130 --> 00:17:42.400
inside of the band, except that
I may have to wait longer

00:17:42.400 --> 00:17:45.770
for the values to
get inside here.

00:17:45.770 --> 00:17:48.400
All right, that's what it means
for a deterministic

00:17:48.400 --> 00:17:51.350
sequence to converge
to something.

00:17:51.350 --> 00:17:54.150
Now, how about random
variables.

00:17:54.150 --> 00:17:57.340
What does it mean for a sequence
of random variables

00:17:57.340 --> 00:18:00.280
to converge to a number?

00:18:00.280 --> 00:18:02.600
We're just going to twist
a little bit of the word

00:18:02.600 --> 00:18:03.310
definition.

00:18:03.310 --> 00:18:08.390
For numbers, we said that
eventually the numbers get

00:18:08.390 --> 00:18:10.180
inside that band.

00:18:10.180 --> 00:18:13.270
But if instead of numbers we
have random variables with a

00:18:13.270 --> 00:18:18.080
certain distribution, so here
instead of a_n we're dealing

00:18:18.080 --> 00:18:20.750
with a random variable that has
a distribution, let's say,

00:18:20.750 --> 00:18:26.650
of this kind, what we want is
that this distribution gets

00:18:26.650 --> 00:18:31.460
inside this band, so it gets
concentrated inside here.

00:18:31.460 --> 00:18:33.150
What does it means that
the distribution

00:18:33.150 --> 00:18:34.850
gets inside this band?

00:18:34.850 --> 00:18:36.910
I mean a random variable
has a distribution.

00:18:36.910 --> 00:18:40.130
It may have some tails, so
maybe not the entire

00:18:40.130 --> 00:18:43.920
distribution gets concentrated
inside of the band.

00:18:43.920 --> 00:18:48.660
But we want that more and more
of this distribution is

00:18:48.660 --> 00:18:50.820
concentrated in this band.

00:18:50.820 --> 00:18:51.730
So that --

00:18:51.730 --> 00:18:53.130
in a sense that --

00:18:53.130 --> 00:18:57.070
the probability of falling
outside the band converges to

00:18:57.070 --> 00:19:00.410
0 -- becomes smaller
and smaller.

00:19:00.410 --> 00:19:05.660
So in words, we're going to say
that the sequence random

00:19:05.660 --> 00:19:09.070
variables or a sequence of
probability distributions,

00:19:09.070 --> 00:19:12.060
that would be the same,
converges to a particular

00:19:12.060 --> 00:19:15.070
number a if the following
is true.

00:19:15.070 --> 00:19:22.320
If I consider a small band
around a, then the probability

00:19:22.320 --> 00:19:26.300
that my random variable falls
outside this band, which is

00:19:26.300 --> 00:19:29.530
the area under this curve,
this probability becomes

00:19:29.530 --> 00:19:32.620
smaller and smaller as
n goes to infinity.

00:19:32.620 --> 00:19:35.370
The probability of being
outside this band

00:19:35.370 --> 00:19:38.570
converges to 0.

00:19:38.570 --> 00:19:40.620
So that's the intuitive idea.

00:19:40.620 --> 00:19:45.080
So in the beginning, maybe our
distribution is sitting

00:19:45.080 --> 00:19:46.590
everywhere.

00:19:46.590 --> 00:19:49.490
As n increases, the distribution
starts to get

00:19:49.490 --> 00:19:51.560
concentrating inside the band.

00:19:51.560 --> 00:19:57.300
When a is even bigger, our
distribution is even more

00:19:57.300 --> 00:20:00.310
inside that band, so that these
outside probabilities

00:20:00.310 --> 00:20:02.460
become smaller and smaller.

00:20:02.460 --> 00:20:03.860
So the corresponding
mathematical

00:20:03.860 --> 00:20:06.760
statement is the following.

00:20:06.760 --> 00:20:13.730
I fix a band around
a, a +/- epsilon.

00:20:13.730 --> 00:20:18.170
Given that band, the probability
of falling outside

00:20:18.170 --> 00:20:21.350
this band, this probability
converges to 0.

00:20:21.350 --> 00:20:23.600
Or another way to say it is
that the limit of this

00:20:23.600 --> 00:20:26.560
probability is equal to 0.

00:20:26.560 --> 00:20:29.720
If you were to translate this
into a complete mathematical

00:20:29.720 --> 00:20:31.800
statement, you would have
to write down the

00:20:31.800 --> 00:20:34.150
following messy thing.

00:20:34.150 --> 00:20:37.220
For every epsilon positive --

00:20:37.220 --> 00:20:39.480
that's this statement --

00:20:39.480 --> 00:20:41.240
the limit is 0.

00:20:41.240 --> 00:20:44.610
What does it mean that the
limit of something is 0?

00:20:44.610 --> 00:20:47.670
We flip back to the
previous slide.

00:20:47.670 --> 00:20:48.110
Why?

00:20:48.110 --> 00:20:51.430
Because a probability
is a number.

00:20:51.430 --> 00:20:54.720
So here we're talking about
a sequence of numbers

00:20:54.720 --> 00:20:56.340
convergent to 0.

00:20:56.340 --> 00:20:58.190
What does it mean for a
sequence of numbers to

00:20:58.190 --> 00:20:59.180
converge to 0?

00:20:59.180 --> 00:21:05.320
It means that for any epsilon
prime positive, there exists

00:21:05.320 --> 00:21:11.230
some n0 such that for every
n bigger than n0 the

00:21:11.230 --> 00:21:12.770
following is true --

00:21:12.770 --> 00:21:16.450
that this probability
is less than or

00:21:16.450 --> 00:21:17.860
equal to epsilon prime.

00:21:20.610 --> 00:21:27.660
So the mathematical statement
is a little hard to parse.

00:21:27.660 --> 00:21:32.270
For every size of that band,
and then you take the

00:21:32.270 --> 00:21:34.990
definition of what it means for
the limit of a sequence of

00:21:34.990 --> 00:21:37.720
numbers to converge to 0.

00:21:37.720 --> 00:21:42.340
But it's a lot easier to
describe this in words and,

00:21:42.340 --> 00:21:45.010
basically, think in terms
of this picture.

00:21:45.010 --> 00:21:48.690
That as n increases, the
probability of falling outside

00:21:48.690 --> 00:21:51.305
those bands just become
smaller and smaller.

00:21:51.305 --> 00:21:56.590
So the statement is that our
distribution gets concentrated

00:21:56.590 --> 00:22:01.340
in arbitrarily narrow little
bands around that

00:22:01.340 --> 00:22:05.050
particular number a.

00:22:05.050 --> 00:22:05.350
OK.

00:22:05.350 --> 00:22:07.790
So let's look at an example.

00:22:07.790 --> 00:22:11.660
Suppose a random variable Yn has
a discrete distribution of

00:22:11.660 --> 00:22:13.720
this particular type.

00:22:13.720 --> 00:22:17.150
Does it converge to something?

00:22:17.150 --> 00:22:19.570
Well, the probability
distribution of this random

00:22:19.570 --> 00:22:22.370
variable gets concentrated
at 0 --

00:22:22.370 --> 00:22:26.520
there's more and more
probability of being at 0.

00:22:26.520 --> 00:22:29.710
If I fix a band around 0 --

00:22:29.710 --> 00:22:34.850
so if I take the band from minus
epsilon to epsilon and

00:22:34.850 --> 00:22:36.520
look at that band--

00:22:36.520 --> 00:22:42.350
the probability of falling
outside this band is 1/n.

00:22:42.350 --> 00:22:45.780
As n goes to infinity, that
probability goes to 0.

00:22:45.780 --> 00:22:50.550
So in this case, we do
have convergence.

00:22:50.550 --> 00:22:56.780
And Yn converges in probability
to the number 0.

00:22:56.780 --> 00:23:00.310
So this just captures the
facts obvious from this

00:23:00.310 --> 00:23:03.680
picture, that more and more of
our probability distribution

00:23:03.680 --> 00:23:07.630
gets concentrated around 0,
as n goes to infinity.

00:23:07.630 --> 00:23:10.330
Now, an interesting thing to
notice is the following, that

00:23:10.330 --> 00:23:15.390
even though Yn converges to 0,
if you were to write down the

00:23:15.390 --> 00:23:20.440
expected value for Yn,
what would it be?

00:23:20.440 --> 00:23:24.410
It's going to be n times the
probability of this value,

00:23:24.410 --> 00:23:26.240
which is 1/n.

00:23:26.240 --> 00:23:29.230
So the expected value
turns out to be 1.

00:23:29.230 --> 00:23:34.300
And if you were to look at the
expected value of Yn-squared,

00:23:34.300 --> 00:23:38.190
this would be 0.

00:23:38.190 --> 00:23:41.770
times this probability, and
then n-squared times this

00:23:41.770 --> 00:23:45.720
probability, which
is equal to n.

00:23:45.720 --> 00:23:49.850
And this actually goes
to infinity.

00:23:49.850 --> 00:23:53.580
So we have this, perhaps,
strange situation where a

00:23:53.580 --> 00:23:58.030
random variable goes to 0, but
the expected value of this

00:23:58.030 --> 00:24:01.140
random variable does
not go to 0.

00:24:01.140 --> 00:24:04.570
And the second moment of that
random variable actually goes

00:24:04.570 --> 00:24:05.790
to infinity.

00:24:05.790 --> 00:24:08.740
So this tells us that
convergence in probability

00:24:08.740 --> 00:24:11.380
tells you something,
but it doesn't tell

00:24:11.380 --> 00:24:13.310
you the whole story.

00:24:13.310 --> 00:24:17.260
Convergence to 0 of a random
variable doesn't imply

00:24:17.260 --> 00:24:20.630
anything about convergence
of expected values or of

00:24:20.630 --> 00:24:23.420
variances and so on.

00:24:23.420 --> 00:24:26.060
So the reason is that
convergence in probability

00:24:26.060 --> 00:24:28.470
tells you that this
tail probability

00:24:28.470 --> 00:24:30.400
here is very small.

00:24:30.400 --> 00:24:34.440
But it doesn't tell you how
far does this tail go.

00:24:34.440 --> 00:24:39.390
As in this example, the tail
probability is small, but that

00:24:39.390 --> 00:24:43.410
tail acts far away, so it
gives a disproportionate

00:24:43.410 --> 00:24:45.950
contribution to the expected
value or the

00:24:45.950 --> 00:24:47.200
expected value squared.

00:24:53.340 --> 00:24:53.650
OK.

00:24:53.650 --> 00:24:59.000
So now we've got everything that
we need to go back to the

00:24:59.000 --> 00:25:02.900
sample mean and study
its properties.

00:25:02.900 --> 00:25:05.460
So the sad thing is
that we have a

00:25:05.460 --> 00:25:07.320
sequence of random variables.

00:25:07.320 --> 00:25:08.350
They're independent.

00:25:08.350 --> 00:25:10.450
They have the same
distribution.

00:25:10.450 --> 00:25:12.790
And we assume that they
have a finite mean

00:25:12.790 --> 00:25:14.480
and a finite variance.

00:25:14.480 --> 00:25:18.430
We're looking at the
sample mean.

00:25:18.430 --> 00:25:21.670
Now in principle, you can
calculate the probability

00:25:21.670 --> 00:25:25.090
distribution of the sample mean,
because we know how to

00:25:25.090 --> 00:25:26.950
find the distributions
of sums of

00:25:26.950 --> 00:25:28.320
independent random variables.

00:25:28.320 --> 00:25:31.030
You use the convolution
formula over and over.

00:25:31.030 --> 00:25:32.870
But this is pretty
complicated, so

00:25:32.870 --> 00:25:34.730
let's not look at that.

00:25:34.730 --> 00:25:38.920
Let's just look at expected
values, variances, and the

00:25:38.920 --> 00:25:42.610
probabilities that the sample
mean is far away

00:25:42.610 --> 00:25:44.310
from the true mean.

00:25:44.310 --> 00:25:47.470
So what is the expected value
of this random variable?

00:25:47.470 --> 00:25:51.260
The expected value of a sum of
random variables is the sum of

00:25:51.260 --> 00:25:52.510
the expected values.

00:25:56.320 --> 00:26:00.320
And then we have this factor
of n in the denominator.

00:26:00.320 --> 00:26:07.040
Each one of these expected
values is mu, so we get mu.

00:26:07.040 --> 00:26:13.960
So the sample mean, the average
value of this Mn in

00:26:13.960 --> 00:26:18.570
expectation is the same as
the true mean inside our

00:26:18.570 --> 00:26:20.620
population.

00:26:20.620 --> 00:26:26.560
Now here, this is a fine
conceptual point, there's two

00:26:26.560 --> 00:26:29.920
kinds of averages involved
when you write down this

00:26:29.920 --> 00:26:31.280
expression.

00:26:31.280 --> 00:26:33.310
We understand that
expectations are

00:26:33.310 --> 00:26:36.470
some kind of average.

00:26:36.470 --> 00:26:40.250
The sample mean is also an
average over the values that

00:26:40.250 --> 00:26:42.240
we have observed.

00:26:42.240 --> 00:26:45.220
But it's two different
kinds of averages.

00:26:45.220 --> 00:26:50.460
The sample mean is the average
of the heights of the penguins

00:26:50.460 --> 00:26:54.330
that we collected over
a single expedition.

00:26:54.330 --> 00:26:59.600
The expected value is to be
thought of as follows, my

00:26:59.600 --> 00:27:02.060
probabilistic experiment
is one expedition

00:27:02.060 --> 00:27:04.160
to the South Pole.

00:27:04.160 --> 00:27:09.760
Expected value here means
thinking on the average over a

00:27:09.760 --> 00:27:12.620
huge number of expeditions.

00:27:12.620 --> 00:27:16.270
So my expedition is a random
experiment, I collect random

00:27:16.270 --> 00:27:18.520
samples, and they record Mn.

00:27:21.230 --> 00:27:27.170
The average result of an
expedition is what we would

00:27:27.170 --> 00:27:31.060
get if we were to carry out
a zillion expeditions and

00:27:31.060 --> 00:27:35.050
average the averages that we
get at each particular

00:27:35.050 --> 00:27:36.090
expedition.

00:27:36.090 --> 00:27:39.860
So this Mn is the average during
a single expedition.

00:27:39.860 --> 00:27:44.090
This expectation is the average
over an imagined

00:27:44.090 --> 00:27:46.125
infinite sequence
of expeditions.

00:27:49.760 --> 00:27:52.830
And of course, the other thing
to always keep in mind is that

00:27:52.830 --> 00:27:56.910
expectations give you numbers,
whereas the sample mean is

00:27:56.910 --> 00:28:00.210
actually a random variable.

00:28:00.210 --> 00:28:00.486
All right.

00:28:00.486 --> 00:28:03.310
So this random variable,
how random is it?

00:28:03.310 --> 00:28:05.610
How big is its variance?

00:28:05.610 --> 00:28:10.040
So the variance of a sum of
random variables is the sum of

00:28:10.040 --> 00:28:12.370
the variances.

00:28:12.370 --> 00:28:16.610
But since we're dividing by n,
when you calculate variances

00:28:16.610 --> 00:28:19.580
this brings in a factor
of n-squared.

00:28:19.580 --> 00:28:21.215
So the variance is sigma-squared
over n.

00:28:24.340 --> 00:28:26.870
And in particular, the variance
of the sample mean

00:28:26.870 --> 00:28:28.620
becomes smaller and smaller.

00:28:28.620 --> 00:28:31.170
It means that when you estimate
that average height

00:28:31.170 --> 00:28:34.570
of penguins, if you take a
large sample, then your

00:28:34.570 --> 00:28:37.530
estimate is not going
to be too random.

00:28:37.530 --> 00:28:41.120
The randomness in your estimates
become small if you

00:28:41.120 --> 00:28:43.250
have a large sample size.

00:28:43.250 --> 00:28:46.090
Having a large sample size kind
of removes the randomness

00:28:46.090 --> 00:28:47.930
from your experiment.

00:28:47.930 --> 00:28:52.690
Now let's apply the Chebyshev
inequality to say something

00:28:52.690 --> 00:28:56.020
about tail probabilities
for the sample mean.

00:28:56.020 --> 00:28:59.610
The probability that you are
more than epsilon away from

00:28:59.610 --> 00:29:03.650
the true mean is less than or
equal to the variance of this

00:29:03.650 --> 00:29:07.030
quantity divided by this
number squared.

00:29:07.030 --> 00:29:09.860
So that's just the translation
of the Chebyshev inequality to

00:29:09.860 --> 00:29:12.320
the particular context
we've got here.

00:29:12.320 --> 00:29:13.590
We found the variance.

00:29:13.590 --> 00:29:15.100
It's sigma-squared over n.

00:29:15.100 --> 00:29:18.340
So we end up with
this expression.

00:29:18.340 --> 00:29:20.490
So what does this
expression do?

00:29:25.570 --> 00:29:32.370
For any given epsilon, if
I fix epsilon, then this

00:29:32.370 --> 00:29:36.630
probability, which is less
than sigma-squared over n

00:29:36.630 --> 00:29:40.550
epsilon-squared, converges to
0 as n goes to infinity.

00:29:44.730 --> 00:29:48.050
And this is just the definition
of convergence in

00:29:48.050 --> 00:29:49.690
probability.

00:29:49.690 --> 00:29:54.310
If this happens, that the
probability of being more than

00:29:54.310 --> 00:29:57.590
epsilon away from the mean, that
probability goes to 0,

00:29:57.590 --> 00:30:01.510
and this is true no matter how
I choose my epsilon, then by

00:30:01.510 --> 00:30:04.490
definition we have convergence
in probability.

00:30:04.490 --> 00:30:08.050
So we have proved that the
sample mean converges in

00:30:08.050 --> 00:30:11.430
probability to the true mean.

00:30:11.430 --> 00:30:16.210
And this is what the weak law
of large numbers tells us.

00:30:16.210 --> 00:30:21.060
So in some vague sense, it
tells us that the sample

00:30:21.060 --> 00:30:24.350
means, when you take the
average of many, many

00:30:24.350 --> 00:30:28.150
measurements in your sample,
then the sample mean is a good

00:30:28.150 --> 00:30:31.870
estimate of the true mean in the
sense that it approaches

00:30:31.870 --> 00:30:36.380
the true mean as your sample
size increases.

00:30:36.380 --> 00:30:39.220
It approaches the true mean,
but of course in a very

00:30:39.220 --> 00:30:42.540
specific sense, in probability,
according to this

00:30:42.540 --> 00:30:46.550
notion of convergence
that we have used.

00:30:46.550 --> 00:30:51.060
So since we're talking about
sampling, let's go over an

00:30:51.060 --> 00:30:56.150
example, which is the typical
situation faced by someone

00:30:56.150 --> 00:30:58.110
who's constructing a poll.

00:30:58.110 --> 00:31:02.680
So you're interested in some
property of the population.

00:31:02.680 --> 00:31:05.590
So what fraction of
the population

00:31:05.590 --> 00:31:08.380
prefers Coke to Pepsi?

00:31:08.380 --> 00:31:11.080
So there's a number f, which
is that fraction of the

00:31:11.080 --> 00:31:12.460
population.

00:31:12.460 --> 00:31:16.260
And so this is an
exact number.

00:31:16.260 --> 00:31:20.250
So out of a population of 100
million, 20 million prefer

00:31:20.250 --> 00:31:25.590
Coke, then f would be 0.2.

00:31:25.590 --> 00:31:27.970
We want to find out what
that fraction is.

00:31:27.970 --> 00:31:30.590
We cannot ask everyone.

00:31:30.590 --> 00:31:34.250
What we're going to do is to
take a random sample of people

00:31:34.250 --> 00:31:37.300
and ask them for their
preferences.

00:31:37.300 --> 00:31:42.690
So the ith person either says
yes for Coke or no.

00:31:42.690 --> 00:31:46.430
And we record that by putting
a 1 each time that we get a

00:31:46.430 --> 00:31:49.160
yes answer.

00:31:49.160 --> 00:31:51.850
And then we form the average
of these x's.

00:31:51.850 --> 00:31:53.070
What is this average?

00:31:53.070 --> 00:31:57.000
It's the number of 1's that
we got divided by n.

00:31:57.000 --> 00:32:02.590
So this is a fraction, but
calculated only on the basis

00:32:02.590 --> 00:32:04.880
of the sample that we have.

00:32:04.880 --> 00:32:10.260
So you can think of this as
being an estimate, f_hat,

00:32:10.260 --> 00:32:13.120
based on the sample
that we have.

00:32:13.120 --> 00:32:17.155
Now, even though we used the
lower case letter here, this

00:32:17.155 --> 00:32:20.590
f_hat is, of course,
a random variable.

00:32:20.590 --> 00:32:23.300
f is a number.

00:32:23.300 --> 00:32:27.570
This is the true fraction in
the overall population.

00:32:27.570 --> 00:32:30.380
f_hat is the estimate
that we get by using

00:32:30.380 --> 00:32:32.300
our particular sample.

00:32:32.300 --> 00:32:32.410
Ok.

00:32:32.410 --> 00:32:38.760
So your boss told you, I need to
know what f is, but go and

00:32:38.760 --> 00:32:40.150
do some sampling.

00:32:40.150 --> 00:32:42.720
What are you going to respond?

00:32:42.720 --> 00:32:46.360
Unless I ask everyone in the
whole population, there's no

00:32:46.360 --> 00:32:51.180
way for me to know f exactly.

00:32:51.180 --> 00:32:51.890
Right?

00:32:51.890 --> 00:32:54.560
There's no way.

00:32:54.560 --> 00:32:59.040
OK, so the boss tells you, well
OK, then that'll me f

00:32:59.040 --> 00:33:00.860
within an accuracy.

00:33:00.860 --> 00:33:10.910
I want an answer from you,
that's your answer, which is

00:33:10.910 --> 00:33:14.930
close to the correct answer
within 1 % point.

00:33:14.930 --> 00:33:20.260
So if the true f is 0.4, your
answer should be somewhere

00:33:20.260 --> 00:33:22.500
between 0.39 and 0.41.

00:33:22.500 --> 00:33:25.520
I want a really accurate
answer.

00:33:25.520 --> 00:33:27.580
What are you going to say?

00:33:27.580 --> 00:33:31.360
Well, there's no guarantee
that my answer

00:33:31.360 --> 00:33:33.230
will be within 1 %.

00:33:33.230 --> 00:33:37.320
Maybe I'm unlucky and I just
happen to sample the wrong set

00:33:37.320 --> 00:33:40.450
of people and my answer
comes out to be wrong.

00:33:40.450 --> 00:33:45.800
So I cannot give you a hard
guarantee that this inequality

00:33:45.800 --> 00:33:47.240
will be satisfied.

00:33:47.240 --> 00:33:51.990
But perhaps, I can give you a
guarantee that this inequality

00:33:51.990 --> 00:33:55.680
will be satisfied, this accuracy
requirement will be

00:33:55.680 --> 00:33:59.340
satisfied, with high
confidence.

00:33:59.340 --> 00:34:02.520
That is, there's going to be
a smaller probability that

00:34:02.520 --> 00:34:04.420
things go wrong, that
I'm unlikely

00:34:04.420 --> 00:34:07.030
and I use a bad sample.

00:34:07.030 --> 00:34:10.750
But leaving aside that smaller
probability of being unlucky,

00:34:10.750 --> 00:34:13.989
my answer will be accurate
within the accuracy

00:34:13.989 --> 00:34:16.100
requirement that you have.

00:34:16.100 --> 00:34:20.500
So these two numbers are the
usual specs that one has when

00:34:20.500 --> 00:34:22.010
designing polls.

00:34:22.010 --> 00:34:27.370
So this number is the accuracy
that we want.

00:34:27.370 --> 00:34:29.300
It's the desired accuracy.

00:34:29.300 --> 00:34:35.239
And this number has to do with
the confidence that we want.

00:34:35.239 --> 00:34:40.210
So 1 minus that number, we could
call it the confidence

00:34:40.210 --> 00:34:43.500
that we want out
of our sample.

00:34:43.500 --> 00:34:47.820
So this is really 1
minus confidence.

00:34:47.820 --> 00:34:51.830
So now your job is to figure out
how large an n, how large

00:34:51.830 --> 00:34:56.219
a sample should you be using, in
order to satisfy the specs

00:34:56.219 --> 00:34:59.060
that your boss gave you.

00:34:59.060 --> 00:35:02.560
All you know at this stage is
the Chebyshev inequality.

00:35:02.560 --> 00:35:05.210
So you just try to use it.

00:35:05.210 --> 00:35:09.780
The probability of getting an
answer that's more than 0.01

00:35:09.780 --> 00:35:14.780
away from the true answer is, by
Chebyshev's inequality, the

00:35:14.780 --> 00:35:20.170
variance of this random variable
divided by this

00:35:20.170 --> 00:35:21.540
number squared.

00:35:21.540 --> 00:35:25.870
The variance, as we argued
a little earlier, is the

00:35:25.870 --> 00:35:29.190
variance of the x's
divided by n.

00:35:29.190 --> 00:35:31.830
So we get this expression.

00:35:31.830 --> 00:35:35.230
So we would like this
number to be less

00:35:35.230 --> 00:35:38.330
than or equal to 0.05.

00:35:38.330 --> 00:35:41.620
OK, here we hit a little
bit off a difficulty.

00:35:41.620 --> 00:35:49.040
The variance, (sigma_x)-squared,
what is it?

00:35:49.040 --> 00:35:54.010
(Sigma_x)-squared is, if you
remember the variance of a

00:35:54.010 --> 00:35:58.010
Bernoulli random variable,
is this quantity.

00:35:58.010 --> 00:35:59.730
But we don't know it.

00:35:59.730 --> 00:36:02.880
f is what we're trying to
estimate in the first place.

00:36:02.880 --> 00:36:06.790
So the variance is not known,
so I cannot plug in a number

00:36:06.790 --> 00:36:08.080
inside here.

00:36:08.080 --> 00:36:12.340
What I can do is to be
conservative and use an upper

00:36:12.340 --> 00:36:14.050
bound of the variance.

00:36:14.050 --> 00:36:17.280
How large can this number get?

00:36:17.280 --> 00:36:20.090
Well, you can plot
f times (1-f).

00:36:25.950 --> 00:36:26.750
It's a parabola.

00:36:26.750 --> 00:36:29.420
It has a root at 0 and at 1.

00:36:29.420 --> 00:36:34.450
So the maximum value is going to
be, by symmetry, at 1/2 and

00:36:34.450 --> 00:36:39.350
when f is 1/2, then this
variance becomes 1/4.

00:36:39.350 --> 00:36:42.340
So I don't know
(sigma_x)-squared, but I'm

00:36:42.340 --> 00:36:45.480
going to use the worst case
value for (sigma_x)-squared,

00:36:45.480 --> 00:36:48.480
which is 4.

00:36:48.480 --> 00:36:53.320
And this is now an inequality
that I know to be always true.

00:36:53.320 --> 00:36:56.910
I've got my specs, and my specs
tell me that I want this

00:36:56.910 --> 00:36:59.800
number to be less than 0.05.

00:36:59.800 --> 00:37:04.980
And given what I know, the best
thing I can do is to say,

00:37:04.980 --> 00:37:07.860
OK, I'm going to take
this number and make

00:37:07.860 --> 00:37:14.070
it less than 0.05.

00:37:14.070 --> 00:37:20.860
If I choose my n so that this
is less than 0.05, then I'm

00:37:20.860 --> 00:37:24.890
certain that this probability
is also less than 0.05.

00:37:24.890 --> 00:37:28.720
What does it take for this
inequality to be true?

00:37:28.720 --> 00:37:36.370
You can solve for n here, and
you find that to satisfy this

00:37:36.370 --> 00:37:40.780
inequality, n should be larger
than or equal to 50,000.

00:37:40.780 --> 00:37:44.250
So you can just let n
be equal to 50,000.

00:37:44.250 --> 00:37:47.920
So the Chebyshev inequality
tells us that if you take n

00:37:47.920 --> 00:37:51.940
equal to 50,000, then by the
Chebyshev inequality, we're

00:37:51.940 --> 00:37:57.850
guaranteed to satisfy the specs
that we were given.

00:37:57.850 --> 00:37:57.960
Ok.

00:37:57.960 --> 00:38:03.950
Now, 50,000 is a bit of
a large sample size.

00:38:03.950 --> 00:38:05.980
Right?

00:38:05.980 --> 00:38:09.490
If you read anything in the
newspapers where they say so

00:38:09.490 --> 00:38:13.230
much of the voters think this
and that, this was determined

00:38:13.230 --> 00:38:19.830
on the basis of a sample of
1,200 likely voters or so.

00:38:19.830 --> 00:38:23.430
So the numbers that you will
typically see in these news

00:38:23.430 --> 00:38:27.590
items about polling, they
usually involve sample sizes

00:38:27.590 --> 00:38:30.080
about the 1,000 or so.

00:38:30.080 --> 00:38:35.250
You will never see a sample
size of 50,000.

00:38:35.250 --> 00:38:37.230
That's too much.

00:38:37.230 --> 00:38:41.670
So where can we cut
some corners?

00:38:41.670 --> 00:38:46.390
Well, we can cut corners
basically in three places.

00:38:46.390 --> 00:38:49.950
This requirement is a
little too tight.

00:38:49.950 --> 00:38:53.530
Newspaper stories will usually
tell you, we have an accuracy

00:38:53.530 --> 00:38:58.800
of +/- 3 % points, instead
of 1 % point.

00:38:58.800 --> 00:39:03.770
And because this number comes up
as a square, by making it 3

00:39:03.770 --> 00:39:09.000
% points instead of 1, saves
you a factor of 10.

00:39:09.000 --> 00:39:12.790
Then, the five percent
confidence, I guess that's

00:39:12.790 --> 00:39:15.180
usually OK.

00:39:15.180 --> 00:39:19.400
If we use that factor of 10,
then we make our sample that

00:39:19.400 --> 00:39:23.730
we gain from here, then we get
a sample size of 10,000.

00:39:23.730 --> 00:39:25.980
And that's, again,
a little too big.

00:39:25.980 --> 00:39:28.140
So where can we fix things?

00:39:28.140 --> 00:39:31.140
Well, it turns out that this
inequality that we're using

00:39:31.140 --> 00:39:34.660
here, Chebyshev's inequality,
is just an inequality.

00:39:34.660 --> 00:39:36.890
It's not that tight.

00:39:36.890 --> 00:39:38.850
It's not very accurate.

00:39:38.850 --> 00:39:42.800
Maybe there's a better way of
calculating or estimating this

00:39:42.800 --> 00:39:46.760
quantity, which is smaller
than this.

00:39:46.760 --> 00:39:49.770
And using a more accurate
inequality or a more accurate

00:39:49.770 --> 00:39:55.320
bound, then we can convince
ourselves that we can settle

00:39:55.320 --> 00:39:57.800
with a smaller sample size.

00:39:57.800 --> 00:40:01.770
This more accurate kind of
inequality comes out of a

00:40:01.770 --> 00:40:04.140
difference limit theorem,
which is the next limit

00:40:04.140 --> 00:40:06.030
theorem we're going
to consider.

00:40:06.030 --> 00:40:08.310
We're going to start the
discussion today, but we're

00:40:08.310 --> 00:40:12.150
going to continue with
it next week.

00:40:12.150 --> 00:40:18.750
Before I tell you exactly what
that other limit theorem says,

00:40:18.750 --> 00:40:20.800
let me give you the
big picture of

00:40:20.800 --> 00:40:24.760
what's involved here.

00:40:24.760 --> 00:40:29.170
We're dealing with sums of
i.i.d random variables.

00:40:29.170 --> 00:40:32.300
Each X has a distribution
of its own.

00:40:34.840 --> 00:40:41.190
So suppose that X has a
distribution which is

00:40:41.190 --> 00:40:43.090
something like this.

00:40:43.090 --> 00:40:48.560
This is the density of X. If I
add lots of X's together, what

00:40:48.560 --> 00:40:51.460
kind of distribution
do I expect?

00:40:51.460 --> 00:40:55.170
The mean is going to be
n times the mean of an

00:40:55.170 --> 00:41:00.560
individual X. So if this is mu,
I'm going to get a mean of

00:41:00.560 --> 00:41:02.730
n times mu.

00:41:02.730 --> 00:41:06.620
But my variance will
also increase.

00:41:06.620 --> 00:41:08.050
When I add the random
variables,

00:41:08.050 --> 00:41:10.190
I'm adding the variances.

00:41:10.190 --> 00:41:13.370
So since the variance increases,
we're going to get

00:41:13.370 --> 00:41:17.610
a distribution that's
pretty wide.

00:41:17.610 --> 00:41:23.240
So this is the density of X1
plus all the way to Xn.

00:41:23.240 --> 00:41:27.640
So as n increases, my
distribution shifts, because

00:41:27.640 --> 00:41:28.770
the mean is positive.

00:41:28.770 --> 00:41:30.610
So I keep adding things.

00:41:30.610 --> 00:41:33.870
And also, my distribution
becomes wider and wider.

00:41:33.870 --> 00:41:36.080
The variance increases.

00:41:36.080 --> 00:41:39.260
Well, we started a different
scaling.

00:41:39.260 --> 00:41:42.980
We started a scaled version of
this quantity when we looked

00:41:42.980 --> 00:41:46.180
at the weak law of
large numbers.

00:41:46.180 --> 00:41:49.580
In the weak law of large
numbers, we take this random

00:41:49.580 --> 00:41:52.140
variable and divide it by n.

00:41:52.140 --> 00:41:56.300
And what the weak law tells us
is that we're going to get a

00:41:56.300 --> 00:42:01.050
distribution that's very highly
concentrated around the

00:42:01.050 --> 00:42:03.650
true mean, which is mu.

00:42:03.650 --> 00:42:07.520
So this here would be the
density of X1 plus

00:42:07.520 --> 00:42:12.630
Xn divided by n.

00:42:12.630 --> 00:42:16.660
Because I've divided by n, the
mean has become the original

00:42:16.660 --> 00:42:19.410
mean, which is mu.

00:42:19.410 --> 00:42:22.620
But the weak law of large
numbers tells us that the

00:42:22.620 --> 00:42:26.650
distribution of this random
variable is very concentrated

00:42:26.650 --> 00:42:27.810
around the mean.

00:42:27.810 --> 00:42:29.850
So we get a distribution
that's very

00:42:29.850 --> 00:42:31.520
narrow in this kind.

00:42:31.520 --> 00:42:34.230
In the limit, this distribution
becomes one

00:42:34.230 --> 00:42:37.570
that's just concentrated
on top of mu.

00:42:37.570 --> 00:42:40.930
So it's sort of a degenerate
distribution.

00:42:40.930 --> 00:42:46.070
So these are two extremes, no
scaling for the sum, a scaling

00:42:46.070 --> 00:42:47.740
where we divide by n.

00:42:47.740 --> 00:42:50.680
In this extreme, we get the
trivial case of a distribution

00:42:50.680 --> 00:42:52.860
that flattens out completely.

00:42:52.860 --> 00:42:56.070
In this scaling, we get a
distribution that gets

00:42:56.070 --> 00:42:59.150
concentrated around
a single point.

00:42:59.150 --> 00:43:02.030
Again, we look at some
intermediate scaling that

00:43:02.030 --> 00:43:04.050
makes things more interesting.

00:43:04.050 --> 00:43:09.700
Things do become interesting
if we scale by dividing the

00:43:09.700 --> 00:43:14.520
sum by square root of n instead
of dividing by n.

00:43:14.520 --> 00:43:17.210
What effect does this have?

00:43:17.210 --> 00:43:22.510
When we scale by dividing by
square root of n, the variance

00:43:22.510 --> 00:43:28.050
of Sn over square root of n is
going to be the variance of Sn

00:43:28.050 --> 00:43:30.760
over sum divided by n.

00:43:30.760 --> 00:43:32.780
That's how variances behave.

00:43:32.780 --> 00:43:37.370
The variance of Sn is n
sigma-squared, divide by n,

00:43:37.370 --> 00:43:41.330
which is sigma squared, which
means that when we scale in

00:43:41.330 --> 00:43:45.940
this particular way,
as n changes, the

00:43:45.940 --> 00:43:48.230
variance doesn't change.

00:43:48.230 --> 00:43:50.300
So the width of our
distribution

00:43:50.300 --> 00:43:52.190
will be sort of constant.

00:43:52.190 --> 00:43:56.360
The distribution changes shape,
but it doesn't become

00:43:56.360 --> 00:43:59.910
narrower as was the case here.

00:43:59.910 --> 00:44:04.550
It doesn't become wider, kind
of keeps the same width.

00:44:04.550 --> 00:44:09.260
So perhaps in the limit, this
distribution is going to take

00:44:09.260 --> 00:44:11.080
an interesting shape.

00:44:11.080 --> 00:44:14.170
And that's indeed the case.

00:44:14.170 --> 00:44:19.800
So let's do what
we did before.

00:44:19.800 --> 00:44:25.110
So we're looking at the sum, and
we want to divide the sum

00:44:25.110 --> 00:44:28.860
by something that goes like
square root of n.

00:44:28.860 --> 00:44:33.140
So the variance of Sn
is n sigma squared.

00:44:33.140 --> 00:44:38.240
The variance of the sigma Sn
is the square root of that.

00:44:38.240 --> 00:44:39.570
It's this number.

00:44:39.570 --> 00:44:43.930
So effectively, we're scaling
by order of square root n.

00:44:43.930 --> 00:44:47.570
Now, I'm doing another
thing here.

00:44:47.570 --> 00:44:52.350
If my random variable has a
positive mean, then this

00:44:52.350 --> 00:44:55.470
quantity is going to
have a mean that's

00:44:55.470 --> 00:44:56.950
positive and growing.

00:44:56.950 --> 00:44:59.450
It's going to be shifting
to the right.

00:44:59.450 --> 00:45:01.350
Why is that?

00:45:01.350 --> 00:45:04.370
Sn has a mean that's
proportional to n.

00:45:04.370 --> 00:45:09.510
When I divide by square root n,
then it means that the mean

00:45:09.510 --> 00:45:11.990
scales like square root of n.

00:45:11.990 --> 00:45:14.740
So my distribution would
still keep shifting

00:45:14.740 --> 00:45:16.720
after I do this division.

00:45:16.720 --> 00:45:20.860
I want to keep my distribution
in place, so I subtract out

00:45:20.860 --> 00:45:23.920
the mean of Sn.

00:45:23.920 --> 00:45:29.580
So what we're doing here is
a standard technique or

00:45:29.580 --> 00:45:32.670
transformation where you take
a random variable and you

00:45:32.670 --> 00:45:34.890
so-called standardize it.

00:45:34.890 --> 00:45:38.500
I remove the mean of that random
variable and I divide

00:45:38.500 --> 00:45:40.100
by the standard deviation.

00:45:40.100 --> 00:45:43.030
This results in a random
variable that has 0 mean and

00:45:43.030 --> 00:45:44.960
unit variance.

00:45:44.960 --> 00:45:49.880
What Zn measures is the
following, Zn tells me how

00:45:49.880 --> 00:45:55.520
many standard deviations am
I away from the mean.

00:45:55.520 --> 00:45:59.380
Sn minus (n times expected value
of X) tells me how much

00:45:59.380 --> 00:46:02.980
is Sn away from the
mean value of Sn.

00:46:02.980 --> 00:46:06.250
And by dividing by the standard
deviation of Sn --

00:46:06.250 --> 00:46:09.830
this tells me how many standard
deviations away from

00:46:09.830 --> 00:46:12.550
the mean am I.

00:46:12.550 --> 00:46:15.360
So we're going to look at this
random variable, which is just

00:46:15.360 --> 00:46:17.260
a transformation Zn.

00:46:17.260 --> 00:46:20.840
It's a linear transformation
of Sn.

00:46:20.840 --> 00:46:24.740
S And we're going to compare
this random variable to a

00:46:24.740 --> 00:46:27.230
standard normal random
variable.

00:46:27.230 --> 00:46:30.610
So a standard normal is the
random variable that you are

00:46:30.610 --> 00:46:35.200
familiar with, given by the
usual formula, and for which

00:46:35.200 --> 00:46:37.400
we have tables for it.

00:46:37.400 --> 00:46:40.400
This Zn has 0 mean and
unit variance.

00:46:40.400 --> 00:46:44.220
So in that respect, it has the
same statistics as the

00:46:44.220 --> 00:46:45.655
standard normal.

00:46:45.655 --> 00:46:48.960
The distribution of Zn
could be anything --

00:46:48.960 --> 00:46:50.770
can be pretty messy.

00:46:50.770 --> 00:46:53.320
But there is this amazing
theorem called the central

00:46:53.320 --> 00:46:58.250
limit theorem that tells us that
the distribution of Zn

00:46:58.250 --> 00:47:01.930
approaches the distribution of
the standard normal in the

00:47:01.930 --> 00:47:06.270
following sense, that
probability is that you can

00:47:06.270 --> 00:47:07.080
calculate --

00:47:07.080 --> 00:47:07.930
of this type --

00:47:07.930 --> 00:47:10.350
that you can calculate
for Zn --

00:47:10.350 --> 00:47:13.330
is the limit becomes the same as
the probabilities that you

00:47:13.330 --> 00:47:17.590
would get from the standard
normal tables for Z.

00:47:17.590 --> 00:47:19.750
It's a statement about
the cumulative

00:47:19.750 --> 00:47:21.960
distribution functions.

00:47:21.960 --> 00:47:25.060
This quantity, as a function
of c, is the cumulative

00:47:25.060 --> 00:47:27.920
distribution function of
the random variable Zn.

00:47:27.920 --> 00:47:30.860
This is the cumulative
distribution function of the

00:47:30.860 --> 00:47:32.190
standard normal.

00:47:32.190 --> 00:47:34.530
The central limit theorem tells
us that the cumulative

00:47:34.530 --> 00:47:39.340
distribution function of the
sum of a number of random

00:47:39.340 --> 00:47:43.040
variables, after they're
appropriately standardized,

00:47:43.040 --> 00:47:46.480
approaches the cumulative
distribution function over the

00:47:46.480 --> 00:47:50.580
standard normal distribution.

00:47:50.580 --> 00:47:53.620
In particular, this tells
us that we can calculate

00:47:53.620 --> 00:47:59.480
probabilities for Zn when n is
large by calculating instead

00:47:59.480 --> 00:48:02.800
probabilities for Z. And that's
going to be a good

00:48:02.800 --> 00:48:04.020
approximation.

00:48:04.020 --> 00:48:07.670
Probabilities for Z are easy to
calculate because they're

00:48:07.670 --> 00:48:09.250
well tabulated.

00:48:09.250 --> 00:48:12.820
So we get a very nice shortcut
for calculating

00:48:12.820 --> 00:48:14.990
probabilities for Zn.

00:48:14.990 --> 00:48:17.990
Now, it's not Zn that you're
interested in.

00:48:17.990 --> 00:48:20.890
What you're interested
in is Sn.

00:48:20.890 --> 00:48:23.820
And Sn --

00:48:23.820 --> 00:48:29.080
inverting this relation
here --

00:48:29.080 --> 00:48:38.330
Sn is square root n sigma
Zn plus n expected

00:48:38.330 --> 00:48:42.602
value of X. All right.

00:48:42.602 --> 00:48:46.620
Now, if you can calculate
probabilities for Zn, even

00:48:46.620 --> 00:48:49.380
approximately, then you can
certainly calculate

00:48:49.380 --> 00:48:53.290
probabilities for Sn, because
one is a linear

00:48:53.290 --> 00:48:55.206
function of the other.

00:48:55.206 --> 00:48:58.710
And we're going to do a little
bit of that next time.

00:48:58.710 --> 00:49:02.220
You're going to get, also, some
practice in recitation.

00:49:02.220 --> 00:49:04.975
At a more vague level, you could
describe the central

00:49:04.975 --> 00:49:08.270
limit theorem as saying the
following, when n is large,

00:49:08.270 --> 00:49:12.160
you can pretend that Zn is
a standard normal random

00:49:12.160 --> 00:49:15.440
variable and do the calculations
as if Zn was

00:49:15.440 --> 00:49:16.680
standard normal.

00:49:16.680 --> 00:49:21.530
Now, pretending that Zn is
normal is the same as

00:49:21.530 --> 00:49:25.900
pretending that Sn is normal,
because Sn is a linear

00:49:25.900 --> 00:49:27.700
function of Zn.

00:49:27.700 --> 00:49:30.400
And we know that linear
functions of normal random

00:49:30.400 --> 00:49:32.140
variables are normal.

00:49:32.140 --> 00:49:36.290
So the central limit theorem
essentially tells us that we

00:49:36.290 --> 00:49:40.070
can pretend that Sn is a normal
random variable and do

00:49:40.070 --> 00:49:44.760
the calculations just as if it
were a normal random variable.

00:49:44.760 --> 00:49:47.020
Mathematically speaking though,
the central limit

00:49:47.020 --> 00:49:50.480
theorem does not talk about
the distribution of Sn,

00:49:50.480 --> 00:49:54.940
because the distribution of Sn
becomes degenerate in the

00:49:54.940 --> 00:49:57.650
limit, just a very flat
and long thing.

00:49:57.650 --> 00:49:59.810
So strictly speaking
mathematically, it's a

00:49:59.810 --> 00:50:03.060
statement about cumulative
distributions of Zn's.

00:50:03.060 --> 00:50:06.420
Practically, the way you use it
is by just pretending that

00:50:06.420 --> 00:50:08.415
Sn is normal.

00:50:08.415 --> 00:50:09.400
Very good.

00:50:09.400 --> 00:50:11.080
Enjoy the Thanksgiving Holiday.