WEBVTT

00:00:02.050 --> 00:00:04.500
We now revisit the polling
problem that we

00:00:04.500 --> 00:00:06.390
have started earlier.

00:00:06.390 --> 00:00:09.210
When we first looked at that
problem, we used the Chebyshev

00:00:09.210 --> 00:00:13.820
inequality to obtain certain
bounds and numerical results.

00:00:13.820 --> 00:00:18.000
What we want to do now is
instead to use a central limit

00:00:18.000 --> 00:00:22.110
theorem-type approximation,
which we hope that it will be

00:00:22.110 --> 00:00:24.920
more accurate and more
informative.

00:00:24.920 --> 00:00:27.350
Let us remind ourselves
of the setting.

00:00:27.350 --> 00:00:30.940
We want to estimate a certain
number, p, which is the

00:00:30.940 --> 00:00:33.510
fraction of the population
that will vote yes in a

00:00:33.510 --> 00:00:34.960
certain referendum.

00:00:34.960 --> 00:00:39.010
And we estimate p by picking a
sample out of the population.

00:00:39.010 --> 00:00:40.650
We pick n people.

00:00:40.650 --> 00:00:44.360
We pick them randomly, uniformly
over the population

00:00:44.360 --> 00:00:46.090
and independently.

00:00:46.090 --> 00:00:49.360
For each one of the people in
the sample, we ask them if

00:00:49.360 --> 00:00:52.520
they will vote to yes or no,
and then we record their

00:00:52.520 --> 00:00:56.360
answers in Bernoulli random
variables, Xi.

00:00:56.360 --> 00:00:59.590
So by the assumptions that we
have made, these Xi's are

00:00:59.590 --> 00:01:03.950
independent Bernoulli random
variables, and their mean is

00:01:03.950 --> 00:01:06.580
equal to p.

00:01:06.580 --> 00:01:09.760
We count how many X's
were equal to 1.

00:01:09.760 --> 00:01:11.000
That's the number of yeses.

00:01:11.000 --> 00:01:13.700
We divide by n, and that gives
us the fraction in the

00:01:13.700 --> 00:01:16.070
population that have
responded yes.

00:01:16.070 --> 00:01:18.610
This is the sample
mean of the X's.

00:01:18.610 --> 00:01:21.210
And we use this sample
mean to estimate the

00:01:21.210 --> 00:01:24.200
unknown fraction p.

00:01:24.200 --> 00:01:28.960
We would like the error in our
estimation to be small, that

00:01:28.960 --> 00:01:31.990
is the difference between the
sample mean and the true value

00:01:31.990 --> 00:01:35.280
p to be small, less,
let's say, than

00:01:35.280 --> 00:01:37.030
one percentage point.

00:01:37.030 --> 00:01:40.300
Now there's no way of
guaranteeing that this spec

00:01:40.300 --> 00:01:44.070
will be met with certainty,
unless we sample almost

00:01:44.070 --> 00:01:45.780
everyone in the population.

00:01:45.780 --> 00:01:49.740
But what we can do instead
is to ask that these

00:01:49.740 --> 00:01:54.520
specifications are violated with
only a small probability.

00:01:54.520 --> 00:01:58.150
So we look at the probability
that our estimation error is

00:01:58.150 --> 00:02:01.290
larger than what we want.

00:02:01.290 --> 00:02:04.310
This is the case that we do
not meet the specs, and we

00:02:04.310 --> 00:02:07.410
would like this probability
to be small.

00:02:07.410 --> 00:02:11.880
One possible question is what
the value of n should be in

00:02:11.880 --> 00:02:13.590
order to meet the specs.

00:02:13.590 --> 00:02:16.310
But in order to do any
calculations, we first need a

00:02:16.310 --> 00:02:19.520
way of approximating
this probability.

00:02:19.520 --> 00:02:22.700
We will do that using the
central limit theorem.

00:02:22.700 --> 00:02:26.070
The central limit theorem
involves this standardized

00:02:26.070 --> 00:02:30.730
version of the random variable
Sn, where Sn stands for the

00:02:30.730 --> 00:02:33.390
sum of the X's.

00:02:33.390 --> 00:02:35.090
We know that this random
variable is

00:02:35.090 --> 00:02:36.620
approximately normal.

00:02:36.620 --> 00:02:41.050
And what we want to do now is to
take this event and rewrite

00:02:41.050 --> 00:02:44.490
it in an equivalent way but
which involves this random

00:02:44.490 --> 00:02:46.900
variable Zn.

00:02:46.900 --> 00:02:48.030
Let us start.

00:02:48.030 --> 00:02:52.590
First, we note that here we a
mu and a sigma, so we should

00:02:52.590 --> 00:02:54.280
know what these are.

00:02:54.280 --> 00:02:58.480
For a Bernoulli random variable,
the mean is what we

00:02:58.480 --> 00:03:03.340
already wrote down, and sigma
is the square root of p

00:03:03.340 --> 00:03:04.930
times 1 minus p.

00:03:07.850 --> 00:03:09.930
Now let's look at this event.

00:03:09.930 --> 00:03:17.050
Mn is the same as Sn/n,
by definition.

00:03:17.050 --> 00:03:22.980
And we can write p in this
form, minus n times

00:03:22.980 --> 00:03:26.790
p divided by n.

00:03:26.790 --> 00:03:30.220
And we want this quantity
to be larger

00:03:30.220 --> 00:03:33.150
than or equal to 0.01.

00:03:33.150 --> 00:03:35.550
So this event here
is identical to

00:03:35.550 --> 00:03:37.490
that event up there.

00:03:37.490 --> 00:03:40.940
This starts to look like
this expression.

00:03:40.940 --> 00:03:42.900
p is the same as Mu.

00:03:42.900 --> 00:03:45.340
But there is a little bit
of a difference in

00:03:45.340 --> 00:03:47.190
the denominator terms.

00:03:47.190 --> 00:03:49.820
So let's see what we can do.

00:03:49.820 --> 00:03:57.860
Let's take this same event but
multiply both sides of the

00:03:57.860 --> 00:04:00.380
inequality by a square
root of n.

00:04:00.380 --> 00:04:04.490
This causes this denominator
term to become just square

00:04:04.490 --> 00:04:09.470
root of n, and we get a square
root of n term in the

00:04:09.470 --> 00:04:12.260
numerator on the other side.

00:04:12.260 --> 00:04:15.030
This is an equivalent
description of the event.

00:04:15.030 --> 00:04:20.720
Now we can multiply both sides
of this inequality by sigma--

00:04:20.720 --> 00:04:25.580
actually the denominators
on both sides by sigma--

00:04:25.580 --> 00:04:28.930
and we obtain this equivalent
representation.

00:04:28.930 --> 00:04:36.110
But now we notice that here we
do have the random variable Zn

00:04:36.110 --> 00:04:37.860
that we wanted.

00:04:37.860 --> 00:04:42.050
And so we managed to express
this event in terms of the

00:04:42.050 --> 00:04:44.000
random variable Zn.

00:04:44.000 --> 00:04:47.960
In particular what we have is
that this probability is the

00:04:47.960 --> 00:04:53.380
same as the probability that
the absolute value of Zn is

00:04:53.380 --> 00:04:58.030
larger than or equal to
0.01 square root of

00:04:58.030 --> 00:05:02.140
n divided by sigma.

00:05:02.140 --> 00:05:06.080
Then we can use the central
limit theorem approximation to

00:05:06.080 --> 00:05:09.360
approximate this probability
by the corresponding

00:05:09.360 --> 00:05:13.630
probability where we now use
a standard normal random

00:05:13.630 --> 00:05:20.100
variable instead of the
Zn random variable.

00:05:20.100 --> 00:05:23.860
So here, Z stands for a
standard normal random

00:05:23.860 --> 00:05:28.780
variable with mean 0 and
variance equal to 1.

00:05:28.780 --> 00:05:31.720
Let us now continue on a new
slide so that we have some

00:05:31.720 --> 00:05:33.010
working space.

00:05:33.010 --> 00:05:37.520
And here is the result that
we have derived so far.

00:05:37.520 --> 00:05:40.630
If somebody gives us the value
of n, we would like to be able

00:05:40.630 --> 00:05:44.100
to calculate this probability
using this approximation.

00:05:44.100 --> 00:05:49.040
However, there's a slight
difficulty because sigma is a

00:05:49.040 --> 00:05:55.210
function that depends on
p, and it is not known.

00:05:55.210 --> 00:05:58.670
However, as we discussed when
we first started the polling

00:05:58.670 --> 00:06:02.460
problem, we do know that
sigma is always less

00:06:02.460 --> 00:06:04.520
than or equal to 1/2.

00:06:04.520 --> 00:06:09.340
And this suggests that we could
use here the worst-case

00:06:09.340 --> 00:06:13.000
value of the standard deviation,
replace sigma by

00:06:13.000 --> 00:06:17.290
1/2 and instead look at
this probability here.

00:06:17.290 --> 00:06:19.990
How are these two probabilities
related?

00:06:19.990 --> 00:06:22.800
Which direction does
the inequality go?

00:06:22.800 --> 00:06:25.500
A sketch will be useful here.

00:06:25.500 --> 00:06:33.200
Z is a standard normal, and
it's centered at 0.

00:06:33.200 --> 00:06:39.650
Somewhere here, we have a value
of 0.02 square root n.

00:06:39.650 --> 00:06:45.210
And somewhere further out, we
have the value of 0.01 square

00:06:45.210 --> 00:06:49.040
root n divided by sigma.

00:06:49.040 --> 00:06:51.860
Why are these two values
ordered this way?

00:06:51.860 --> 00:06:58.450
Since sigma is less than 1/2, 1
over sigma is bigger than 2.

00:06:58.450 --> 00:07:01.170
So this expression here
is bigger than

00:07:01.170 --> 00:07:04.180
this expression there.

00:07:04.180 --> 00:07:09.250
Since the inequality goes this
way, now we can compare these

00:07:09.250 --> 00:07:12.230
two events.

00:07:12.230 --> 00:07:16.570
This event, that Z is larger
in absolute value than this

00:07:16.570 --> 00:07:22.170
number, is the probability of
this tail of the distribution.

00:07:22.170 --> 00:07:26.540
And we will have a similar
probability from the other end

00:07:26.540 --> 00:07:29.280
of the tail of the
distribution.

00:07:29.280 --> 00:07:31.880
Here we're talking about the
probability of being larger

00:07:31.880 --> 00:07:36.159
than or equal to this number,
which would correspond only to

00:07:36.159 --> 00:07:39.830
this part of the tail and,
similarly, a small part of the

00:07:39.830 --> 00:07:42.200
tail from the other side.

00:07:42.200 --> 00:07:45.590
The blue event is smaller
than the red event.

00:07:45.590 --> 00:07:49.159
This is the probability of the
blue event, so it's going to

00:07:49.159 --> 00:07:54.970
be no larger than the
probability of the red event.

00:07:54.970 --> 00:07:58.690
Now if somebody gives us a value
of n, we should be able

00:07:58.690 --> 00:08:00.780
to calculate this probability.

00:08:00.780 --> 00:08:03.740
How do we calculate it?

00:08:03.740 --> 00:08:07.480
The probability that the
absolute value is above a

00:08:07.480 --> 00:08:13.600
certain number is equal to the
probability of this tail plus

00:08:13.600 --> 00:08:15.570
the probability of that tail.

00:08:15.570 --> 00:08:18.960
But because of the symmetry of
the normal distribution, this

00:08:18.960 --> 00:08:24.450
is twice the probability of
each one of the tails.

00:08:24.450 --> 00:08:26.390
What is the probability
of this tail?

00:08:26.390 --> 00:08:30.740
It's 1 minus the probability
of whatever is below that.

00:08:30.740 --> 00:08:32.669
So it's 1 minus.

00:08:32.669 --> 00:08:36.270
And the probability of being
below that, this is the

00:08:36.270 --> 00:08:44.850
standard normal CDF evaluated
at 0.02 square root n.

00:08:44.850 --> 00:08:48.870
So we do have now an expression
for the desired

00:08:48.870 --> 00:08:52.890
probability, or at least a
bound for it, which is

00:08:52.890 --> 00:08:57.640
expressed in terms of the
standard normal CDF.

00:08:57.640 --> 00:09:01.970
If somebody gives you a value
of n, you can plug in here.

00:09:01.970 --> 00:09:06.410
If n is 10,000, then square
root of n is 100.

00:09:06.410 --> 00:09:10.640
And this number becomes
equal to 2.

00:09:10.640 --> 00:09:14.730
And so in this case, what we
obtain is that the probability

00:09:14.730 --> 00:09:19.680
of interest is less than
or equal to 2 times 1

00:09:19.680 --> 00:09:23.230
minus Phi of 2.

00:09:23.230 --> 00:09:27.960
Now we invoke the standard
normal table.

00:09:27.960 --> 00:09:32.730
From the normal table, we obtain
that this quantity is

00:09:32.730 --> 00:09:45.060
equal to twice 1 minus 0.9772,
which evaluates to 0.046.

00:09:45.060 --> 00:09:53.300
So if we use 10,000 people in
our sample, then we will get

00:09:53.300 --> 00:09:57.280
an accuracy of one percentage
point with very high

00:09:57.280 --> 00:09:58.460
probability.

00:09:58.460 --> 00:10:02.620
The probability that we do not
meet the specification so that

00:10:02.620 --> 00:10:05.620
the accuracy that we get is
worse than one percentage

00:10:05.620 --> 00:10:08.660
point, that probability
is quite small.

00:10:08.660 --> 00:10:12.450
It's 0.046.

00:10:12.450 --> 00:10:18.080
That is 4 and something
percent.

00:10:18.080 --> 00:10:19.880
This is pretty good.

00:10:19.880 --> 00:10:25.070
And suppose that your boss now
tells you, I only want the

00:10:25.070 --> 00:10:30.610
probability of not meeting
the specs to be 5%.

00:10:30.610 --> 00:10:34.670
You look at this result, and
you say, with 10,000, I

00:10:34.670 --> 00:10:38.810
achieved a probability
of a large error

00:10:38.810 --> 00:10:42.210
that's less than 5%.

00:10:42.210 --> 00:10:45.990
This means that I probably have
some leeway and that I

00:10:45.990 --> 00:10:49.820
can reduce the size
of my sample.

00:10:49.820 --> 00:10:52.690
What could the size of
the sample be and

00:10:52.690 --> 00:10:56.060
still meet those specs?

00:10:56.060 --> 00:10:59.630
What we're trying to do here
is that we have this

00:10:59.630 --> 00:11:04.880
approximation for the
probability of interest, and

00:11:04.880 --> 00:11:07.500
we want to set this probability

00:11:07.500 --> 00:11:15.250
to a value of 0.05.

00:11:15.250 --> 00:11:18.860
Then we want to ask, what is
the value of n that will

00:11:18.860 --> 00:11:22.900
result in this particular
probability of

00:11:22.900 --> 00:11:26.060
not meeting the specs?

00:11:26.060 --> 00:11:28.090
Now we can do the algebra.

00:11:28.090 --> 00:11:33.080
And we find that this
corresponds to requiring that

00:11:33.080 --> 00:11:42.320
phi of 0.02 square root n
to be equal to 0.975.

00:11:42.320 --> 00:11:45.380
What's the interpretation
of this?

00:11:45.380 --> 00:11:49.500
We want to choose n so that
the probability of the two

00:11:49.500 --> 00:11:53.780
tails is 5%.

00:11:53.780 --> 00:11:58.620
This means that we want this
probability here to be 2 and

00:11:58.620 --> 00:12:00.510
1/2 percent.

00:12:00.510 --> 00:12:03.240
This means that the probability
of whatever is to

00:12:03.240 --> 00:12:13.550
the left of this number should
be 0.975, including the tail.

00:12:13.550 --> 00:12:18.540
This means, again, that we have
to look at the standard

00:12:18.540 --> 00:12:25.150
normal table and ask, what's the
value for which the CDF is

00:12:25.150 --> 00:12:28.190
equal to 0.975?

00:12:28.190 --> 00:12:34.230
So we look around, and we find
0.975 to be here, and it

00:12:34.230 --> 00:12:38.130
corresponds to 1.96.

00:12:38.130 --> 00:12:42.750
This tells us that 0.02
square root n

00:12:42.750 --> 00:12:48.840
should be equal to 1.96.

00:12:48.840 --> 00:12:54.240
Then we solve for n, and we find
that the value of n is

00:12:54.240 --> 00:13:01.110
9,604, which is indeed some
reduction from the 10,000 that

00:13:01.110 --> 00:13:02.360
we had originally.

00:13:04.450 --> 00:13:08.020
How does this relate
to the real world?

00:13:08.020 --> 00:13:12.330
When you read newspapers about
polls, you will never see

00:13:12.330 --> 00:13:16.100
sample sizes that are
about 10,000.

00:13:16.100 --> 00:13:20.290
You will usually see sample
sizes of the order of 1,000,

00:13:20.290 --> 00:13:22.530
sometimes even smaller.

00:13:22.530 --> 00:13:24.400
How can they do that?

00:13:24.400 --> 00:13:28.100
Well, they can do that because
the specs that they impose are

00:13:28.100 --> 00:13:31.140
not as tight as the specs
that we have here.

00:13:31.140 --> 00:13:35.100
Usually, they tell you that
the results are accurate

00:13:35.100 --> 00:13:38.850
within three percentage points,
let's say, instead of

00:13:38.850 --> 00:13:40.600
one percentage point.

00:13:40.600 --> 00:13:46.420
And by moving from 0.01 to 0.03,
and if you repeat those

00:13:46.420 --> 00:13:50.090
calculations, you will find that
the sample size of about

00:13:50.090 --> 00:13:53.690
1,000 will actually do.