WEBVTT

00:00:01.030 --> 00:00:03.150
In this segment we introduce
the concept of

00:00:03.150 --> 00:00:05.120
a confidence interval.

00:00:05.120 --> 00:00:08.109
The starting point is that an
estimate, the value of an

00:00:08.109 --> 00:00:11.110
estimator, does not tell
the whole story.

00:00:11.110 --> 00:00:13.910
One option is to also provide
the standard error of the

00:00:13.910 --> 00:00:17.290
estimator, but a more common
practice is to report a

00:00:17.290 --> 00:00:18.910
confidence interval.

00:00:18.910 --> 00:00:20.040
What is that?

00:00:20.040 --> 00:00:22.060
We'll introduce the notion
of a confidence

00:00:22.060 --> 00:00:23.740
interval through a story.

00:00:23.740 --> 00:00:25.910
You're working for a
polling company.

00:00:25.910 --> 00:00:31.520
You carry out a poll, and then
you go and report to your boss

00:00:31.520 --> 00:00:38.000
that my estimate is this
particular number.

00:00:38.000 --> 00:00:41.620
And then your boss says, I
appreciate the five digit

00:00:41.620 --> 00:00:45.510
accuracy, but are your
conclusions that accurate?

00:00:45.510 --> 00:00:49.240
You go back to your desk, you do
some more calculations, and

00:00:49.240 --> 00:00:55.880
then you tell your boss, here is
a 95% confidence interval.

00:00:58.480 --> 00:01:01.930
Your boss tells you, that looks
great, but what does

00:01:01.930 --> 00:01:03.570
that exactly mean?

00:01:03.570 --> 00:01:06.980
You go back to your textbook,
you pull out the definition,

00:01:06.980 --> 00:01:09.420
and you reply as follows.

00:01:09.420 --> 00:01:13.130
Well, a 95% confidence
interval--

00:01:13.130 --> 00:01:17.550
so here I am letting
alpha to be 5%--

00:01:17.550 --> 00:01:22.800
a 95% confidence interval is
an interval that has the

00:01:22.800 --> 00:01:24.360
following property.

00:01:24.360 --> 00:01:27.710
That the unknown value of the
parameter that we're trying to

00:01:27.710 --> 00:01:32.060
estimate falls inside
this interval with

00:01:32.060 --> 00:01:35.320
probability at least 95%.

00:01:35.320 --> 00:01:39.390
And if you wish, I could also
let alpha be equal to 1%, in

00:01:39.390 --> 00:01:43.630
which case I could give you
a 99% confidence interval.

00:01:43.630 --> 00:01:46.220
Your boss replies, no,
that sounds good.

00:01:46.220 --> 00:01:49.920
A 95% interval sounds fine.

00:01:49.920 --> 00:01:53.310
And your boss goes out to
the press, holds a press

00:01:53.310 --> 00:01:58.740
conference, and reports that
the true value of the

00:01:58.740 --> 00:02:04.490
parameter lies inside this
range, inside the reported

00:02:04.490 --> 00:02:08.627
confidence interval with
probability at least 95%.

00:02:11.680 --> 00:02:14.610
Does this statement
make sense?

00:02:14.610 --> 00:02:16.220
Actually, no.

00:02:16.220 --> 00:02:20.300
This statement is the most
common misconception of what a

00:02:20.300 --> 00:02:22.740
confidence interval is.

00:02:22.740 --> 00:02:24.790
To see why this statement
does not make

00:02:24.790 --> 00:02:27.030
sense, look at it carefully.

00:02:27.030 --> 00:02:29.880
We're talking about the
probability of something.

00:02:29.880 --> 00:02:33.460
But that something does not
involve anything random.

00:02:33.460 --> 00:02:36.300
0.3 and 0.52 are just numbers.

00:02:36.300 --> 00:02:41.079
And theta is also a number which
we do not know what it

00:02:41.079 --> 00:02:42.816
is, but it is not random.

00:02:42.816 --> 00:02:44.650
It is a constant.

00:02:44.650 --> 00:02:49.350
So this statement is incorrect
on a purely syntactic basis.

00:02:49.350 --> 00:02:52.370
I mean the true parameter theta
either is inside this

00:02:52.370 --> 00:02:54.190
interval, or it is not.

00:02:54.190 --> 00:02:57.640
But there's nothing random here,
and so this statement

00:02:57.640 --> 00:02:59.980
does not make sense.

00:02:59.980 --> 00:03:04.070
Instead, let us look carefully
at this definition.

00:03:04.070 --> 00:03:06.810
This statement does make
sense because it

00:03:06.810 --> 00:03:09.230
involves random variables.

00:03:09.230 --> 00:03:13.150
The lower and the upper end of
the confidence interval are

00:03:13.150 --> 00:03:17.140
quantities that are determined
by the data, and therefore

00:03:17.140 --> 00:03:18.530
they are random.

00:03:18.530 --> 00:03:21.400
So we do have random
variables in here.

00:03:21.400 --> 00:03:25.160
And so it makes sense to talk
about probabilities.

00:03:25.160 --> 00:03:29.590
To really understand what's
going on, think as follows.

00:03:29.590 --> 00:03:33.860
We're dealing with a poll that
is trying to estimate some

00:03:33.860 --> 00:03:36.860
unknown value, theta.

00:03:36.860 --> 00:03:40.860
You carry out the poll, and you
come up with a confidence

00:03:40.860 --> 00:03:42.990
interval based on the data.

00:03:42.990 --> 00:03:46.820
You might be lucky, and your
confidence interval happens to

00:03:46.820 --> 00:03:49.100
capture the true parameter.

00:03:49.100 --> 00:03:52.579
You carry the poll one more
time, maybe on another day.

00:03:52.579 --> 00:03:55.170
You come up with another
confidence interval.

00:03:55.170 --> 00:03:58.360
And you're again lucky, and it
captures the true parameter.

00:03:58.360 --> 00:04:01.260
You carry it on another day,
and you come up with a

00:04:01.260 --> 00:04:02.550
confidence interval.

00:04:02.550 --> 00:04:06.770
And maybe the data that you
got were kind of skewed.

00:04:06.770 --> 00:04:10.010
You were unlucky, and your
confidence interval does not

00:04:10.010 --> 00:04:12.750
capture the true parameter.

00:04:12.750 --> 00:04:18.870
Having a 95% confidence interval
means that 95% of the

00:04:18.870 --> 00:04:24.980
time, 95% of the polls that you
carry out will capture the

00:04:24.980 --> 00:04:26.380
true parameter.

00:04:26.380 --> 00:04:30.190
So the word 95% really talks
about your method of

00:04:30.190 --> 00:04:32.490
constructing confidence
intervals.

00:04:32.490 --> 00:04:35.850
It's a method that 95%
of the time will

00:04:35.850 --> 00:04:37.590
capture the true parameter.

00:04:37.590 --> 00:04:42.380
It is not a statement about the
actual numbers that you

00:04:42.380 --> 00:04:46.040
are reporting on one
specific poll.

00:04:46.040 --> 00:04:49.190
So it is important to keep this
in mind, and to always

00:04:49.190 --> 00:04:53.700
interpret confidence intervals
the correct way.

00:04:53.700 --> 00:04:56.909
So how does one come up with
confidence intervals?

00:04:56.909 --> 00:04:59.240
The most common method
is based on normal

00:04:59.240 --> 00:05:02.140
approximations, as we
will be seeing next.