WEBVTT

00:00:00.000 --> 00:00:01.930
FEMALE SPEAKER: The following
content is provided under a

00:00:01.930 --> 00:00:03.670
Creative Commons license.

00:00:03.670 --> 00:00:06.640
Your support will help MIT
OpenCourseWare continue to

00:00:06.640 --> 00:00:09.980
offer high-quality educational
resources for free.

00:00:09.980 --> 00:00:12.820
To make a donation or to view
additional materials from

00:00:12.820 --> 00:00:15.246
hundreds of MIT courses, visit
MIT OpenCourseWare at

00:00:15.246 --> 00:00:16.496
ocw.mit.edu.

00:00:20.825 --> 00:00:23.110
PROFESSOR: OK, so my name is Ben
Olken and we're going to

00:00:23.110 --> 00:00:25.640
be talking about how to think
about sample size for

00:00:25.640 --> 00:00:26.780
randomized evaluations.

00:00:26.780 --> 00:00:29.250
And more generally that the
point of this lecture is not

00:00:29.250 --> 00:00:32.810
just about sample size but we've
spent a lot of time,

00:00:32.810 --> 00:00:34.650
like in last lecture, for
example, thinking about the

00:00:34.650 --> 00:00:35.800
data we're going to collect.

00:00:35.800 --> 00:00:37.000
Then the question is, well,
what are we going to

00:00:37.000 --> 00:00:38.120
do with that data?

00:00:38.120 --> 00:00:41.550
And so it's about sample size
but also more generally, we're

00:00:41.550 --> 00:00:44.540
going to talk about how do we
analyze data in the context of

00:00:44.540 --> 00:00:45.504
an experiment.

00:00:45.504 --> 00:00:46.310
OK.

00:00:46.310 --> 00:00:50.970
So as I said, where we're going
to end up at the end of

00:00:50.970 --> 00:00:53.630
this lecture is, how big
a sample do we need?

00:00:53.630 --> 00:00:55.670
But in order to think about how
big a sample we need, we

00:00:55.670 --> 00:00:58.020
need to understand a little more
about how do we actually

00:00:58.020 --> 00:01:01.070
analyze this data.

00:01:01.070 --> 00:01:04.690
When we say, how large does a
sample need to be to credibly

00:01:04.690 --> 00:01:09.110
detect a given treatment effect,
we're going to need to

00:01:09.110 --> 00:01:12.170
be a little more precise about
what we mean by credibly and

00:01:12.170 --> 00:01:15.950
particularly think a little bit
about the statistics that

00:01:15.950 --> 00:01:18.470
are involved in thinking
through--

00:01:18.470 --> 00:01:19.180
evaluate--

00:01:19.180 --> 00:01:21.210
understanding these
experiments.

00:01:21.210 --> 00:01:23.290
And particularly, when we say
something that's credibly

00:01:23.290 --> 00:01:25.870
different, what we mean is
that we can be reasonably

00:01:25.870 --> 00:01:27.660
sure, and I'll be a little bit
more precise about what we

00:01:27.660 --> 00:01:30.110
mean by that, that the
difference between the two

00:01:30.110 --> 00:01:32.430
different groups-- the treatment
and control group--

00:01:32.430 --> 00:01:34.770
didn't just occur by random
chance, right?

00:01:34.770 --> 00:01:37.500
That there's really something
that we'll call statistically

00:01:37.500 --> 00:01:41.260
significantly different between
these two groups, OK?

00:01:41.260 --> 00:01:43.840
And when we think about
randomizing, right?

00:01:43.840 --> 00:01:47.350
So we've talked about which
groups get the treatment and

00:01:47.350 --> 00:01:51.290
which get the control, that's
going to mean that we expect

00:01:51.290 --> 00:01:53.610
the two groups to be similar
if there was no treatment

00:01:53.610 --> 00:01:55.260
effect because the only
difference between them is

00:01:55.260 --> 00:01:56.550
that they were randomized.

00:01:56.550 --> 00:02:00.060
But there's going to be some
variation in the outcomes

00:02:00.060 --> 00:02:02.780
between the two different
groups, OK?

00:02:02.780 --> 00:02:05.160
And so randomization is going
to remove the bias.

00:02:05.160 --> 00:02:07.290
It's going to mean that the
groups-- we expect the two

00:02:07.290 --> 00:02:08.919
different groups to be
the same, but there

00:02:08.919 --> 00:02:10.520
still could be noise.

00:02:10.520 --> 00:02:12.990
So in some sense, another way of
thinking about this lecture

00:02:12.990 --> 00:02:15.170
is that this lecture is
all about the noise.

00:02:15.170 --> 00:02:18.160
And how big a sample do we
need for the noise to be

00:02:18.160 --> 00:02:20.800
sufficiently small for us to
actually credibly detect the

00:02:20.800 --> 00:02:24.400
differences between the two
different groups, OK?

00:02:24.400 --> 00:02:26.760
So that's what we're going to
talk about is basically, how

00:02:26.760 --> 00:02:28.960
large is large so we can
get rid of the noise?

00:02:28.960 --> 00:02:30.595
And let me say, by the way, that
we've got an hour and a

00:02:30.595 --> 00:02:33.310
half, but you should feel free
to interrupt with questions or

00:02:33.310 --> 00:02:35.270
whatever if I say something
that's not clear because

00:02:35.270 --> 00:02:36.650
there's a lot of material that
we're going to be going

00:02:36.650 --> 00:02:38.980
through pretty quickly.

00:02:38.980 --> 00:02:40.390
OK.

00:02:40.390 --> 00:02:45.050
So when we think about how big
our sample means to be--

00:02:45.050 --> 00:02:49.050
remember, the whole point is how
big does our sample have

00:02:49.050 --> 00:02:52.530
to be remove the noise that's
going to be in our data?

00:02:52.530 --> 00:02:56.050
And when we think about that, we
think essentially about how

00:02:56.050 --> 00:02:58.190
noisy our data is, right?

00:02:58.190 --> 00:03:02.230
So how big a sample we need is
going to be determined by how

00:03:02.230 --> 00:03:05.810
noisy is the data and also how
big an effect we're looking

00:03:05.810 --> 00:03:06.650
for, right?

00:03:06.650 --> 00:03:11.770
So if the data is really noisy
but the effect is enormous,

00:03:11.770 --> 00:03:13.200
then we don't need as
big of a sample.

00:03:13.200 --> 00:03:15.310
But if the effect we're looking
for is really small

00:03:15.310 --> 00:03:17.550
relative to the noise in the
data, we're going to need a

00:03:17.550 --> 00:03:18.260
bigger sample.

00:03:18.260 --> 00:03:20.690
So actually, sometimes it's
the comparison between the

00:03:20.690 --> 00:03:22.450
effect size and how
noisy the data is.

00:03:22.450 --> 00:03:24.330
It's the ratio between
these things

00:03:24.330 --> 00:03:26.520
that's really important.

00:03:26.520 --> 00:03:28.920
Other factors that we're going
to talk about are, did we do a

00:03:28.920 --> 00:03:31.160
baseline survey before
we started?

00:03:31.160 --> 00:03:33.400
Because a baseline can
essentially help us reduce the

00:03:33.400 --> 00:03:35.950
noise in some sense.

00:03:35.950 --> 00:03:37.520
We're going to talk about
whether individual responses

00:03:37.520 --> 00:03:38.560
are correlated with
each other.

00:03:38.560 --> 00:03:43.790
So for example, if we were to
randomize a whole group of

00:03:43.790 --> 00:03:47.050
people into a given treatment,
that group might be similar in

00:03:47.050 --> 00:03:48.100
lots of other respects.

00:03:48.100 --> 00:03:50.610
So you can't really count that
whole group as if they were

00:03:50.610 --> 00:03:52.430
all independent observations
because they might be

00:03:52.430 --> 00:03:52.880
correlated.

00:03:52.880 --> 00:03:54.890
For example, you all just
took my lecture.

00:03:54.890 --> 00:03:57.710
So if you all were put in the
same treatment group, you all

00:03:57.710 --> 00:04:00.085
were exposed to the treatment
but you also all were exposed

00:04:00.085 --> 00:04:01.030
to my lecture and so you're not

00:04:01.030 --> 00:04:03.970
necessarily independent events.

00:04:03.970 --> 00:04:05.650
And there are some other issues
in terms of the design

00:04:05.650 --> 00:04:07.270
of the experiment that we'll
talk about that can help

00:04:07.270 --> 00:04:10.050
affect samples as well, like
stratification, control

00:04:10.050 --> 00:04:12.860
variables, baseline data, et
cetera, which we're going to

00:04:12.860 --> 00:04:14.790
talk about, OK?

00:04:14.790 --> 00:04:17.640
So the way we're going to go in
this lecture is, I'm going

00:04:17.640 --> 00:04:20.620
to start off with some basics
about, what does it mean to

00:04:20.620 --> 00:04:23.280
test a hypothesis
statistically?

00:04:23.280 --> 00:04:25.440
And then when we get into
hypothesis testing, there are

00:04:25.440 --> 00:04:28.670
two different types of
errors that we're

00:04:28.670 --> 00:04:29.280
going to talk about.

00:04:29.280 --> 00:04:32.640
They're helpfully named type
I and type II errors.

00:04:32.640 --> 00:04:34.960
And you have to be careful not
to make a type III error,

00:04:34.960 --> 00:04:38.630
which is to confuse a type
I and a type II error.

00:04:38.630 --> 00:04:42.790
So we'll talk about
what those are.

00:04:42.790 --> 00:04:45.215
Then we'll talk about standard
errors and significance, which

00:04:45.215 --> 00:04:48.380
is, how do we think about more
formally what these different

00:04:48.380 --> 00:04:49.630
types of errors are?

00:04:54.060 --> 00:04:55.150
We'll talk about power.

00:04:55.150 --> 00:04:56.500
We'll talk about the
effect size.

00:04:56.500 --> 00:04:58.470
And then, finally, the factors
that influence power, OK?

00:04:58.470 --> 00:04:59.710
So this is all the stuff
we're going to go

00:04:59.710 --> 00:05:01.950
through, all right?

00:05:01.950 --> 00:05:09.108
So in order to understand
the basic concepts of--

00:05:09.108 --> 00:05:11.420
when we're talking about
hypothesis testing, we need to

00:05:11.420 --> 00:05:13.285
think a little about
probabilities, OK?

00:05:13.285 --> 00:05:15.480
Because all this comes down,
essentially, to some basic

00:05:15.480 --> 00:05:18.600
analysis about probability.

00:05:18.600 --> 00:05:21.040
So for example, suppose you had
a professional-- and the

00:05:21.040 --> 00:05:24.390
intuition here is that the more
observations we get, the

00:05:24.390 --> 00:05:27.200
more we can understand the
true probability that

00:05:27.200 --> 00:05:28.380
something occurred--

00:05:28.380 --> 00:05:30.700
whether the true probability
that something occurred was

00:05:30.700 --> 00:05:33.620
due to a real difference in
the underlying process or

00:05:33.620 --> 00:05:34.510
whether it was just
random chance.

00:05:34.510 --> 00:05:37.500
So for example, consider
the following example.

00:05:37.500 --> 00:05:41.100
So suppose you're faced with a
professional gambler who told

00:05:41.100 --> 00:05:44.550
you that she could get heads
most of the time.

00:05:44.550 --> 00:05:46.860
OK, so you might think this is
a reasonable claim or an

00:05:46.860 --> 00:05:48.880
unreasonable claim, but this is
what they're claiming and

00:05:48.880 --> 00:05:50.020
you want to see if
this is true.

00:05:50.020 --> 00:05:53.160
So they toss the coin and
they get heads, right?

00:05:53.160 --> 00:05:56.400
So can we learn anything
from that?

00:05:56.400 --> 00:05:59.070
Well, probably not because
anyone, even with a fair coin,

00:05:59.070 --> 00:06:01.160
50% of the time, they would get
heads if they tossed it.

00:06:01.160 --> 00:06:03.840
So we're really can't infer
anything from this one.

00:06:03.840 --> 00:06:06.540
What you saw that they did five
times and they got heads,

00:06:06.540 --> 00:06:09.440
heads, tails, heads, heads.

00:06:09.440 --> 00:06:12.000
Well, can you infer anything
about that?

00:06:12.000 --> 00:06:12.610
Well, maybe.

00:06:12.610 --> 00:06:15.000
You can start to say, well, this
seems less likely to have

00:06:15.000 --> 00:06:16.110
occurred just by
random chance.

00:06:16.110 --> 00:06:17.930
But you know there's
only five tosses.

00:06:17.930 --> 00:06:19.750
What's the chance that someone
with an even coin

00:06:19.750 --> 00:06:20.540
can get four heads?

00:06:20.540 --> 00:06:24.610
Well, we could calculate that if
we knew the probabilities.

00:06:24.610 --> 00:06:26.140
And it's certainly not
impossible that this could

00:06:26.140 --> 00:06:27.650
occur, right?

00:06:27.650 --> 00:06:29.860
And now, what if they got
20 tosses, right?

00:06:29.860 --> 00:06:32.210
Well, now you're starting to get
information, although in

00:06:32.210 --> 00:06:34.600
this particular example,
it was closer to 50-50.

00:06:34.600 --> 00:06:37.520
So now you have 12
versus eight.

00:06:37.520 --> 00:06:39.050
Could that have occurred
by random chance?

00:06:39.050 --> 00:06:40.870
Well, maybe it could
have, right?

00:06:40.870 --> 00:06:44.000
Because it's pretty
close to 50-50.

00:06:44.000 --> 00:06:46.530
And now, suppose you had 100
tosses or suppose you had

00:06:46.530 --> 00:06:50.900
1,000 tosses with 609 heads
and 391 tails, right?

00:06:50.900 --> 00:06:56.250
So as you're getting more and
more data, right, you're much

00:06:56.250 --> 00:06:58.600
more likely to say something
is meaningful.

00:06:58.600 --> 00:07:01.600
So if you saw this data, for
example, the odds that could

00:07:01.600 --> 00:07:03.545
occur by random chance
are pretty high.

00:07:03.545 --> 00:07:07.610
But if you saw this data with
609 heads and 391 tails out of

00:07:07.610 --> 00:07:10.180
1,000 tosses, it's actually
pretty unlikely that this

00:07:10.180 --> 00:07:12.450
would occur just by
random chance, OK?

00:07:12.450 --> 00:07:17.330
And so this shows you, as you
get more data you can actually

00:07:17.330 --> 00:07:19.500
say, how likely was this outcome
to have occurred by

00:07:19.500 --> 00:07:20.970
random chance?

00:07:20.970 --> 00:07:23.540
And the more data you have, the
more likely you're going

00:07:23.540 --> 00:07:25.850
to be able to conclude that
actually, this difference you

00:07:25.850 --> 00:07:28.790
observed was actually due to
something that the person was

00:07:28.790 --> 00:07:32.500
doing and not just due to what
would happen randomly.

00:07:32.500 --> 00:07:36.490
And in some sense, all of
statistics is basically this

00:07:36.490 --> 00:07:40.640
intuition, which is, you take
the data you observe and you

00:07:40.640 --> 00:07:43.510
calculate what is the chance
that the data I observe could

00:07:43.510 --> 00:07:45.710
have occurred just
by random chance.

00:07:45.710 --> 00:07:48.690
And if the chance that the data
I observed could have

00:07:48.690 --> 00:07:51.680
happened just by random chance
is really unlikely, then you

00:07:51.680 --> 00:07:54.280
say, well then it must've been
that your program actually had

00:07:54.280 --> 00:07:56.250
an effect, OK?

00:07:56.250 --> 00:07:58.600
Does that make sense?

00:07:58.600 --> 00:08:00.285
That's the basic idea,
essentially, of all of

00:08:00.285 --> 00:08:01.820
statistics is, what's the
probability that this thing

00:08:01.820 --> 00:08:02.710
could have happened randomly?

00:08:02.710 --> 00:08:06.090
And if it's unlikely, then
probably there was something

00:08:06.090 --> 00:08:08.730
else going on.

00:08:08.730 --> 00:08:11.860
Here's another example.

00:08:11.860 --> 00:08:16.110
So what this example shows is,
now suppose you have a second

00:08:16.110 --> 00:08:18.880
gambler who had 1,000 tosses
and they had 530

00:08:18.880 --> 00:08:20.130
heads and 470 tails.

00:08:23.350 --> 00:08:28.770
What this shows is that-- and
that's really a lot of data.

00:08:28.770 --> 00:08:31.660
But in some sense, what we can
learn about this data depends

00:08:31.660 --> 00:08:33.905
on what hypothesis we're
interested in.

00:08:33.905 --> 00:08:39.700
So if the gambler claimed they
obtained heads 70% of the

00:08:39.700 --> 00:08:43.210
time, we could probably say, no,
I don't think so, right?

00:08:43.210 --> 00:08:45.470
This is enough data that the
odds that you would get this

00:08:45.470 --> 00:08:48.330
data pattern if you had heads
70% of the time are really,

00:08:48.330 --> 00:08:50.440
really small, right?

00:08:50.440 --> 00:08:54.280
So we could say, I can
reject this claim.

00:08:54.280 --> 00:08:59.810
But suppose they said that
they claim they could get

00:08:59.810 --> 00:09:02.780
heads 54% of the time, OK?

00:09:02.780 --> 00:09:05.760
And you observe they got
heads 53% of the time.

00:09:05.760 --> 00:09:08.450
Well, you probably couldn't
reject this claim, right?

00:09:08.450 --> 00:09:11.180
Because this is similar enough
to this that if this was the

00:09:11.180 --> 00:09:14.410
truth, this could have occurred
by random chance.

00:09:14.410 --> 00:09:21.840
So in some sense, what we can
say based on the data depends

00:09:21.840 --> 00:09:26.180
on how far the data is from our
hypothesis and how much

00:09:26.180 --> 00:09:28.330
data we have.

00:09:28.330 --> 00:09:31.280
Does that make sense as
some basic intuition?

00:09:31.280 --> 00:09:32.530
OK.

00:09:34.680 --> 00:09:38.760
So how do we apply this
to an experiment?

00:09:38.760 --> 00:09:40.690
Well, at the end of the
experiment, what we're going

00:09:40.690 --> 00:09:42.680
to do is we're going
to compare the

00:09:42.680 --> 00:09:43.300
two different groups.

00:09:43.300 --> 00:09:44.060
We're going to compare
the treatment

00:09:44.060 --> 00:09:45.600
and the control group.

00:09:45.600 --> 00:09:47.240
And we're going to say--

00:09:47.240 --> 00:09:49.240
we're going to take a look at
the average, just like we were

00:09:49.240 --> 00:09:50.090
doing in the gambling example.

00:09:50.090 --> 00:09:51.870
We'll compare the average in
the treatment group and the

00:09:51.870 --> 00:09:54.930
average in the control, OK?

00:09:54.930 --> 00:09:56.400
And the difference is
the effect size.

00:09:59.590 --> 00:10:02.900
So for example, in this
particular case, in the

00:10:02.900 --> 00:10:05.070
Panchayat case, you'd look at,
for example, the mean number

00:10:05.070 --> 00:10:07.100
of wells you've got in the
village with the female

00:10:07.100 --> 00:10:09.195
leaders versus the mean number
of wells in the villages with

00:10:09.195 --> 00:10:11.060
the male leaders, OK?

00:10:11.060 --> 00:10:13.100
So that's in some sense our
estimate of how big the

00:10:13.100 --> 00:10:14.350
difference is.

00:10:19.130 --> 00:10:22.370
And the question is going to be,
how likely would we have

00:10:22.370 --> 00:10:25.200
been to observe this difference
between the

00:10:25.200 --> 00:10:28.050
treatment and the control group
if it was just due to

00:10:28.050 --> 00:10:29.080
random chance, OK?

00:10:29.080 --> 00:10:32.400
And that's what we need the
statistics to figure out.

00:10:32.400 --> 00:10:42.893
Now one of the reasons--

00:10:46.520 --> 00:10:49.990
so where does the
noise come from?

00:10:49.990 --> 00:10:53.920
In some sense, we're not going
to observe an infinite number

00:10:53.920 --> 00:10:56.030
of villages.

00:10:56.030 --> 00:10:57.990
Or we're not going to observe
all possible villages.

00:10:57.990 --> 00:11:00.600
In fact, even if we observe all
the villages that exist,

00:11:00.600 --> 00:11:02.110
we're not going to observe,
in some sense, all of the

00:11:02.110 --> 00:11:04.130
possible villages that could've
hypothetically

00:11:04.130 --> 00:11:06.640
existed if the villages were
replicated millions and

00:11:06.640 --> 00:11:08.580
millions of times.

00:11:08.580 --> 00:11:09.630
We're just going
to observe some

00:11:09.630 --> 00:11:12.550
finite number of villages.

00:11:12.550 --> 00:11:17.070
And so we're going to estimate
this mean by computing the

00:11:17.070 --> 00:11:21.820
mean in the villages that
we observed, OK?

00:11:21.820 --> 00:11:32.290
And if there are very few
villages, that mean that we're

00:11:32.290 --> 00:11:35.700
going to calculate is going to
be imprecise because if you

00:11:35.700 --> 00:11:38.460
took a different sample of
villages, you would get a

00:11:38.460 --> 00:11:41.080
slightly different mean, OK?

00:11:41.080 --> 00:11:43.010
If you sample an infinite number
of villages, you get

00:11:43.010 --> 00:11:44.770
the same thing every time.

00:11:44.770 --> 00:11:49.110
But suppose you only sampled
one village.

00:11:49.110 --> 00:11:50.950
Or suppose there was a million
villages out there and you

00:11:50.950 --> 00:11:52.380
sampled two, right?

00:11:52.380 --> 00:11:53.970
And you took the average, OK?

00:11:53.970 --> 00:11:56.120
If you sampled a different two
villages, just by random

00:11:56.120 --> 00:11:58.480
chance, you would get
a different average.

00:11:58.480 --> 00:12:00.880
And sometimes that's where the
part of the noise in our data

00:12:00.880 --> 00:12:02.130
is coming from.

00:12:08.370 --> 00:12:09.620
So for example--

00:12:11.860 --> 00:12:14.020
sorry.

00:12:14.020 --> 00:12:17.350
So in some sense, what we need
to know is, we need to know if

00:12:17.350 --> 00:12:20.480
these two groups-- it sort of
goes back to the same as

00:12:20.480 --> 00:12:24.760
before, if these two groups were
the same and I sampled

00:12:24.760 --> 00:12:27.950
them, what are the chances I
would get the difference that

00:12:27.950 --> 00:12:29.010
I observed by random chance?

00:12:29.010 --> 00:12:31.640
So for example, suppose
you observed these two

00:12:31.640 --> 00:12:33.670
distributions, OK?

00:12:33.670 --> 00:12:34.900
So this is your control
group and this is

00:12:34.900 --> 00:12:36.685
your treatment group.

00:12:36.685 --> 00:12:40.640
Now you can see there is some
noise in the data, right?

00:12:40.640 --> 00:12:44.170
This one is a mean of 50 and
this one is a mean of 60.

00:12:44.170 --> 00:12:45.060
And there's some--

00:12:45.060 --> 00:12:46.360
these are histograms, right?

00:12:46.360 --> 00:12:48.580
So this is the distribution of
the number of villages that

00:12:48.580 --> 00:12:52.930
you observed for each
possible outcome.

00:12:52.930 --> 00:12:55.540
So you can see here that there's
some noise, right?

00:12:55.540 --> 00:12:57.490
It's not that everyone here was
exactly 50 and everyone

00:12:57.490 --> 00:12:58.310
here was exactly 60.

00:12:58.310 --> 00:12:59.310
Some people were 45.

00:12:59.310 --> 00:13:02.030
Some were 55 or whatever.

00:13:02.030 --> 00:13:07.650
But if you look at these two
distributions, you could say

00:13:07.650 --> 00:13:10.730
it's pretty unlikely that if
these were actually drawn from

00:13:10.730 --> 00:13:12.780
the same distribution of
villages, all of the blue ones

00:13:12.780 --> 00:13:14.030
would be over here and
all the yellow ones

00:13:14.030 --> 00:13:16.210
would be over here.

00:13:16.210 --> 00:13:18.490
It's very unlikely that if these
were actually the same

00:13:18.490 --> 00:13:21.930
and you draw randomly, you get
this real bifurcation of these

00:13:21.930 --> 00:13:25.040
villages, OK?

00:13:28.680 --> 00:13:33.090
And where are we basing that
idea, that conclusion on?

00:13:33.090 --> 00:13:37.610
We're basing the conclusion on
the fact that there's not a

00:13:37.610 --> 00:13:40.710
lot of overlap, in some sense,
between these two groups.

00:13:40.710 --> 00:13:45.780
But now, what if you saw
this picture, right?

00:13:45.780 --> 00:13:48.411
What would you be able
to conclude?

00:13:48.411 --> 00:13:51.030
Well, it's a little
less clear.

00:13:51.030 --> 00:13:52.410
The mean is still the same.

00:13:52.410 --> 00:13:54.650
All the yellows still have an
average of 60 and all the

00:13:54.650 --> 00:13:56.000
blues have an average of 50.

00:13:56.000 --> 00:13:57.450
But there's a lot more
overlap between them.

00:13:57.450 --> 00:13:59.460
Now if we look at this, we can
sort of eyeball it and say,

00:13:59.460 --> 00:14:02.916
well, there's really a pretty
big difference even relative

00:14:02.916 --> 00:14:04.640
to the distributions there.

00:14:04.640 --> 00:14:06.060
So maybe we could conclude that

00:14:06.060 --> 00:14:06.790
they were really different.

00:14:06.790 --> 00:14:07.950
Maybe not.

00:14:07.950 --> 00:14:10.510
And what if we saw
this, right?

00:14:10.510 --> 00:14:12.340
This is still the same means.

00:14:12.340 --> 00:14:13.990
The yellows have a mean
of 60 and the blues

00:14:13.990 --> 00:14:15.225
have a mean of 50.

00:14:15.225 --> 00:14:17.590
But now they're so
interspersed that

00:14:17.590 --> 00:14:18.940
is harder to know--

00:14:18.940 --> 00:14:20.890
it's possible, if you saw
pictures like this, you would

00:14:20.890 --> 00:14:24.170
say, well, yes, the yellows are
higher, but maybe this was

00:14:24.170 --> 00:14:28.990
just due to random chance, OK?

00:14:28.990 --> 00:14:32.790
So what the purpose of these
graphs are, is to show you is

00:14:32.790 --> 00:14:35.900
that in order-- so in both
cases, we the same difference

00:14:35.900 --> 00:14:37.110
in the mean outcomes.

00:14:37.110 --> 00:14:41.160
It was 60 versus 50 in all
three cases, right?

00:14:41.160 --> 00:14:44.440
But when you saw this graph, it
was quite clear that these

00:14:44.440 --> 00:14:46.210
two groups were really
different.

00:14:46.210 --> 00:14:49.040
When you saw this graph, is was
much harder to figure out

00:14:49.040 --> 00:14:50.740
if these two were really
different or if this was just

00:14:50.740 --> 00:14:53.340
due to random chance, OK?

00:14:53.340 --> 00:14:55.490
Does that make sense of
where we're going?

00:14:55.490 --> 00:14:58.510
And so, just to come back to
the same theme, all the

00:14:58.510 --> 00:15:02.350
statistics are going to do in
our case is going to help us

00:15:02.350 --> 00:15:06.710
figure out, are these
differences big enough, given

00:15:06.710 --> 00:15:10.600
the distribution of data we
have, how likely is it that

00:15:10.600 --> 00:15:11.670
the difference we observed
could have

00:15:11.670 --> 00:15:13.470
happened by random chance.

00:15:13.470 --> 00:15:16.570
And so intuitively, we can
look at this one and say,

00:15:16.570 --> 00:15:17.740
definitely different.

00:15:17.740 --> 00:15:19.510
And this one, maybe not sure.

00:15:19.510 --> 00:15:21.550
But if we want to be a little
more precise about that,

00:15:21.550 --> 00:15:23.390
that's where we need the
added statistics.

00:15:23.390 --> 00:15:25.190
AUDIENCE: Is the sample size
the same in both examples?

00:15:25.190 --> 00:15:28.210
PROFESSOR: Yeah, the sample
size is the same.

00:15:28.210 --> 00:15:29.580
Yeah, sample size is
exactly the same.

00:15:29.580 --> 00:15:34.240
So you can see that the numbers
go down because it's

00:15:34.240 --> 00:15:35.490
more spread out.

00:15:41.410 --> 00:15:44.320
All right.

00:15:44.320 --> 00:15:47.520
So in some sense, what are the
ingredients that we've talked

00:15:47.520 --> 00:15:49.300
about in terms of thinking
about whether you have a

00:15:49.300 --> 00:15:52.580
statistically significant
difference?

00:15:52.580 --> 00:15:54.180
If you think back to the gambler
example, we talked

00:15:54.180 --> 00:15:57.830
about the sample size
matters, right?

00:15:57.830 --> 00:16:04.170
So if we saw 1,000 tosses, we
had much more precision about

00:16:04.170 --> 00:16:08.100
our estimates than if we had
10 tosses or five tosses.

00:16:08.100 --> 00:16:10.850
The hypothesis you're testing
matters, right?

00:16:10.850 --> 00:16:17.050
Because the smaller an effect
size we're trying to detect,

00:16:17.050 --> 00:16:20.820
the more tosses we need in
the gambler example.

00:16:20.820 --> 00:16:23.660
If you're trying to detect a
really small difference, you

00:16:23.660 --> 00:16:26.180
need a ton of data, whereas if
you're trying to detect really

00:16:26.180 --> 00:16:31.130
extreme differences, you can
do it with less data, OK?

00:16:31.130 --> 00:16:33.830
And the third thing we saw is
the variability of the outcome

00:16:33.830 --> 00:16:34.890
matters, right?

00:16:34.890 --> 00:16:38.340
So the more noisy the outcome
is, the harder it is to know

00:16:38.340 --> 00:16:40.460
whether the differences that
we observe are due just to

00:16:40.460 --> 00:16:43.160
random chance or if they're
really due some difference in

00:16:43.160 --> 00:16:44.480
the treatment versus
the control group.

00:16:48.690 --> 00:16:50.940
OK, so does this makes sense?

00:16:50.940 --> 00:16:53.510
Before I go on, these are the
three ingredients that we're

00:16:53.510 --> 00:16:54.700
going to be playing with.

00:16:54.700 --> 00:16:55.740
Do these make sense?

00:16:55.740 --> 00:16:59.225
Do you have questions on this?

00:16:59.225 --> 00:17:00.710
OK.

00:17:00.710 --> 00:17:06.210
So you may have heard of
a confidence interval.

00:17:06.210 --> 00:17:06.990
How many of you guys
have heard of

00:17:06.990 --> 00:17:08.859
a confidence interval?

00:17:08.859 --> 00:17:10.109
OK.

00:17:12.290 --> 00:17:13.710
How many of you can state
the definition of

00:17:13.710 --> 00:17:16.079
a confidence interval?

00:17:16.079 --> 00:17:16.609
Thanks, Dan.

00:17:16.609 --> 00:17:17.859
I'm glad that you can.

00:17:21.060 --> 00:17:30.270
So what do we mean when we
say confidence interval?

00:17:30.270 --> 00:17:31.665
What we mean by a confidence
interval--

00:17:35.000 --> 00:17:37.060
so let's just go through what's
on the slide and then

00:17:37.060 --> 00:17:38.690
we can talk about it
a little more.

00:17:38.690 --> 00:17:41.760
So we're going to measure,
say, 100 people and we're

00:17:41.760 --> 00:17:43.560
going to come up with an
average length of 53

00:17:43.560 --> 00:17:44.880
centimeters.

00:17:44.880 --> 00:17:47.260
So we want to be able to say
something about how precise

00:17:47.260 --> 00:17:48.460
our estimate is.

00:17:48.460 --> 00:17:51.990
So we say the average
is 53 centimeters.

00:17:51.990 --> 00:17:56.810
How confident are we or how
precise are we that it's 53%?

00:17:56.810 --> 00:17:59.600
And that's what a conference
interval is trying to say.

00:17:59.600 --> 00:18:02.315
And a confidence interval,
essentially, tells us that

00:18:02.315 --> 00:18:04.920
with 95% probability--

00:18:04.920 --> 00:18:07.390
so we have a confidence interval
of 50-56 says that

00:18:07.390 --> 00:18:10.860
with 95% probability, the
true average length lies

00:18:10.860 --> 00:18:13.210
between 50 and 56.

00:18:13.210 --> 00:18:20.210
And so the precise definition
is that if you had a

00:18:20.210 --> 00:18:25.280
hypothesis that the true average
length was in this

00:18:25.280 --> 00:18:31.500
range with--

00:18:31.500 --> 00:18:34.900
no, I'm going to get it wrong.

00:18:34.900 --> 00:18:43.800
It says that if you had a
hypothesis that the true

00:18:43.800 --> 00:18:48.090
average was in here, it's within
95% probability that

00:18:48.090 --> 00:18:52.790
you would get the data
that you observe, OK?

00:18:52.790 --> 00:18:55.720
A converse way of saying it is
that the truth is somewhere in

00:18:55.720 --> 00:18:58.180
this range, right?

00:19:01.620 --> 00:19:03.600
You can be 95% certain that the
truth is somewhere within

00:19:03.600 --> 00:19:08.720
this range, So if you did 20 of
these tests, only one out

00:19:08.720 --> 00:19:10.930
of 20 times would the truth
be outside your confidence

00:19:10.930 --> 00:19:16.370
interval, OK?

00:19:16.370 --> 00:19:18.980
And so an approximate
interpretation of a confidence

00:19:18.980 --> 00:19:19.930
interval is--

00:19:19.930 --> 00:19:23.910
so we know that the point
estimate of 43, we have some

00:19:23.910 --> 00:19:24.950
uncertainty about
that estimate.

00:19:24.950 --> 00:19:27.790
We think the average is 53, but
there's some uncertainty.

00:19:27.790 --> 00:19:32.210
And the confidence interval
says, well, it's 95% likely

00:19:32.210 --> 00:19:35.755
that the true answer is between
50 and 56, if that was

00:19:35.755 --> 00:19:41.300
the confidence interval, OK?

00:19:41.300 --> 00:19:45.730
So why is that useful for us?

00:19:45.730 --> 00:19:47.700
Well, our goal is
to figure out--

00:19:47.700 --> 00:19:49.280
we don't care, actually, what
our estimate of the

00:19:49.280 --> 00:19:50.080
program's effect is.

00:19:50.080 --> 00:19:51.990
We care what the true effect
of a program is, right?

00:19:51.990 --> 00:19:52.880
So we did some intervention.

00:19:52.880 --> 00:19:54.540
Like, for example, we had a
female Panchayat leader

00:19:54.540 --> 00:19:56.870
instead of a male Panchayat
leader and we want to figure

00:19:56.870 --> 00:20:02.740
out what the actual difference
that that intervention made is

00:20:02.740 --> 00:20:04.230
in the world.

00:20:04.230 --> 00:20:08.930
We're going to observe some
sample of Panchayats and we'll

00:20:08.930 --> 00:20:11.500
look at the difference
in that sample.

00:20:11.500 --> 00:20:13.490
And we want to know how much
can we learn about the true

00:20:13.490 --> 00:20:15.650
program effect from
what we estimated.

00:20:15.650 --> 00:20:17.580
And the confidence interval
basically tells us that with

00:20:17.580 --> 00:20:21.310
95% probability, the true
program effect is somewhere in

00:20:21.310 --> 00:20:24.020
the confidence interval, OK?

00:20:24.020 --> 00:20:25.270
Does that makes sense?

00:20:29.790 --> 00:20:33.020
How many of you guys have heard
of the standard error?

00:20:33.020 --> 00:20:33.930
OK.

00:20:33.930 --> 00:20:38.620
So a standard error is related
to the confidence interval in

00:20:38.620 --> 00:20:45.430
that a standard error says that
if we have some estimate,

00:20:45.430 --> 00:20:49.840
you could imagine that if we
did the experiment again,

00:20:49.840 --> 00:20:52.610
essentially, with a new sample
of people that looked like the

00:20:52.610 --> 00:20:57.830
original sample of people, we
might get a slightly different

00:20:57.830 --> 00:20:59.360
point estimate because it's
a different sample.

00:21:02.680 --> 00:21:06.470
The standard error basically
says, what's the distribution

00:21:06.470 --> 00:21:10.810
of those possible estimates
that you could get, OK?

00:21:10.810 --> 00:21:13.640
So it says that basically, if
I did this experiment again,

00:21:13.640 --> 00:21:15.810
maybe i wouldn't get
53, I'd get 54.

00:21:15.810 --> 00:21:17.380
If I did it again,
maybe I'd get 52.

00:21:17.380 --> 00:21:19.020
If I did it again,
I might get 53.

00:21:19.020 --> 00:21:21.220
The standard error is
essentially the standard

00:21:21.220 --> 00:21:25.250
deviation of those possible
estimates that you could get.

00:21:25.250 --> 00:21:27.880
What that means in practice
is that--

00:21:31.870 --> 00:21:33.080
well, in practice, the standard
error is very related

00:21:33.080 --> 00:21:34.340
to the confidence interval.

00:21:34.340 --> 00:21:38.150
And basically, a good rule of
thumb is that a 95% confidence

00:21:38.150 --> 00:21:40.330
interval is about two
standard errors.

00:21:40.330 --> 00:21:43.400
So if you ever see an estimate
of the standard error, you can

00:21:43.400 --> 00:21:45.180
calculate the confidence
interval, essentially, by

00:21:45.180 --> 00:21:51.420
going up or down two standard
errors from the point

00:21:51.420 --> 00:21:54.020
estimate, OK?

00:21:54.020 --> 00:21:56.250
And the confidence interval
and standard error,

00:21:56.250 --> 00:21:57.980
essentially, are capturing
the same thing.

00:21:57.980 --> 00:22:00.190
They're both capturing--

00:22:00.190 --> 00:22:03.080
when I said we need statistics
to basically compute, how

00:22:03.080 --> 00:22:05.420
likely is it that we would get
these differences by random

00:22:05.420 --> 00:22:07.930
chance, those are all coming out
in the standard error and

00:22:07.930 --> 00:22:09.060
the confidence interval,
right?

00:22:09.060 --> 00:22:12.630
They're computed by both looking
at how noisy our data

00:22:12.630 --> 00:22:17.710
is, which is the variability
of the outcome, and how big

00:22:17.710 --> 00:22:19.410
our sample is, right?

00:22:19.410 --> 00:22:22.450
Because from these two things,
you can basically calculate

00:22:22.450 --> 00:22:26.570
how uncertain your estimate
would be.

00:22:26.570 --> 00:22:29.800
This is a lot of terminology
very quickly, but does this

00:22:29.800 --> 00:22:32.180
all make sense?

00:22:32.180 --> 00:22:35.380
Any questions on this?

00:22:35.380 --> 00:22:36.630
OK.

00:22:38.810 --> 00:22:41.550
So for example.

00:22:41.550 --> 00:22:46.005
So suppose we saw the sampled
women Pradhans had 7.13 years

00:22:46.005 --> 00:22:51.200
of education and the men had
9.92 years of education, OK?

00:22:51.200 --> 00:22:56.410
And you want to know, is the
truth that men have more

00:22:56.410 --> 00:22:58.860
education than women or is this
just a random artifact of

00:22:58.860 --> 00:23:00.980
our sample?

00:23:00.980 --> 00:23:06.640
So suppose you calculated that
the difference was 2.59.

00:23:06.640 --> 00:23:07.880
That's easy to calculate.

00:23:07.880 --> 00:23:11.520
And the standard error was 0.54
and the standard error

00:23:11.520 --> 00:23:13.650
was going to be calculated based
on both how much data

00:23:13.650 --> 00:23:17.250
you had and how noisy
the data was.

00:23:17.250 --> 00:23:19.660
You would compute that the 95%
confidence interval is between

00:23:19.660 --> 00:23:22.620
1.53 and 3.64, OK?

00:23:22.620 --> 00:23:24.960
So this means that with 95%
probability, the true

00:23:24.960 --> 00:23:27.570
difference in education rates
between men and women is

00:23:27.570 --> 00:23:30.300
between 1.53 and 3.64.

00:23:30.300 --> 00:23:33.870
So if you were interested in
testing the hypothesis that,

00:23:33.870 --> 00:23:38.570
in fact, men and women are the
same in education, you could

00:23:38.570 --> 00:23:40.860
say that I can reject
that hypothesis.

00:23:40.860 --> 00:23:43.285
With 95% probability, the
true difference is

00:23:43.285 --> 00:23:45.360
between 1.53 and 3.64--

00:23:45.360 --> 00:23:49.040
so zero is not in this
confidence interval, right?

00:23:49.040 --> 00:23:52.910
So we can reject the hypothesis
that there's no

00:23:52.910 --> 00:23:55.730
difference between
these two groups.

00:23:55.730 --> 00:23:58.130
Does that makes sense?

00:23:58.130 --> 00:23:59.600
So doing another example.

00:23:59.600 --> 00:24:02.410
So in this example, suppose
that we saw that control

00:24:02.410 --> 00:24:04.790
children had an average test
score of 2.45 and the

00:24:04.790 --> 00:24:07.350
treatment had an average
test score of 2.5.

00:24:07.350 --> 00:24:10.060
So we saw a difference of
0.05 and the standard

00:24:10.060 --> 00:24:13.530
error was 0.26, OK?

00:24:13.530 --> 00:24:15.620
So in this case, you would say
well, the 95% confidence

00:24:15.620 --> 00:24:18.690
interval is minus 0.55.

00:24:18.690 --> 00:24:20.710
This is approximately two.

00:24:20.710 --> 00:24:21.820
It's not exactly two.

00:24:21.820 --> 00:24:22.630
Minus 0.55--

00:24:22.630 --> 00:24:24.140
oh, no, it is exactly
two in this example.

00:24:24.140 --> 00:24:28.850
Minus 0.55 to 0.46, OK?

00:24:28.850 --> 00:24:31.355
And here, you would say that
if we were introducing the

00:24:31.355 --> 00:24:33.170
hypothesis that the null
hypothesis is that the

00:24:33.170 --> 00:24:36.300
treatment had no effect on test
scores, you could not

00:24:36.300 --> 00:24:37.890
reject that null hypothesis,
right?

00:24:37.890 --> 00:24:40.990
Because an effect of zero
is within the confidence

00:24:40.990 --> 00:24:43.650
interval, OK?

00:24:43.650 --> 00:24:45.610
So that's basically how we
use confidence intervals.

00:24:45.610 --> 00:24:46.100
Yeah.

00:24:46.100 --> 00:24:48.350
AUDIENCE: Shouldn't the two
points of that confidence

00:24:48.350 --> 00:24:54.370
interval be equidistant
from 2.59?

00:24:54.370 --> 00:24:58.310
PROFESSOR: From 0.05 you mean?

00:24:58.310 --> 00:24:58.730
Yeah.

00:24:58.730 --> 00:24:59.250
AUDIENCE: [INAUDIBLE]

00:24:59.250 --> 00:25:01.360
PROFESSOR: Yeah, I think--

00:25:01.360 --> 00:25:02.500
oh, over here?

00:25:02.500 --> 00:25:03.330
AUDIENCE: Yeah.

00:25:03.330 --> 00:25:07.035
PROFESSOR: So they actually
don't always have--

00:25:07.035 --> 00:25:09.750
so you raise a good point.

00:25:09.750 --> 00:25:11.310
So there may be some
math errors here.

00:25:11.310 --> 00:25:13.540
I think a more reasonable
estimate, by the way, is that

00:25:13.540 --> 00:25:15.970
this would have to be minus
0.05 for you to get

00:25:15.970 --> 00:25:16.540
something like this.

00:25:16.540 --> 00:25:20.620
AUDIENCE: But in the first
example, if 2.59 is the mean,

00:25:20.620 --> 00:25:23.885
is the difference--

00:25:23.885 --> 00:25:26.790
PROFESSOR: So it's
approximately

00:25:26.790 --> 00:25:27.380
the same, isn't it?

00:25:27.380 --> 00:25:29.120
AUDIENCE: I think it's
a little skewed--

00:25:29.120 --> 00:25:29.670
PROFESSOR: Yeah.

00:25:29.670 --> 00:25:30.620
AUDIENCE: On that side,
it is 2.64.

00:25:30.620 --> 00:25:31.530
PROFESSOR: OK.

00:25:31.530 --> 00:25:33.720
Yeah, so you raise
a good point.

00:25:33.720 --> 00:25:39.520
So when I said that a rule
of thumb is two times the

00:25:39.520 --> 00:25:42.030
standard error, that's
a rule of thumb.

00:25:42.030 --> 00:25:45.820
And in particular cases, you can
sometimes get asymmetric

00:25:45.820 --> 00:25:47.100
confidence intervals.

00:25:47.100 --> 00:25:49.670
So you're right that usually
they should be symmetric and

00:25:49.670 --> 00:25:51.600
probably, for simplicity, we
should have put up symmetric

00:25:51.600 --> 00:25:54.780
ones, but it can occur that
confidence intervals are

00:25:54.780 --> 00:25:57.080
asymmetric.

00:25:57.080 --> 00:25:58.440
For example, if you had a--

00:26:02.000 --> 00:26:05.410
yeah, depending on the
estimation, if you have

00:26:05.410 --> 00:26:07.270
truncation at zero--

00:26:07.270 --> 00:26:08.740
if you know for sure that there
can never be an outcome

00:26:08.740 --> 00:26:11.264
below zero, for example, then
you can get asymmetric

00:26:11.264 --> 00:26:11.738
confidence intervals.

00:26:11.738 --> 00:26:15.530
AUDIENCE: When the distribution
is not normal?

00:26:15.530 --> 00:26:16.770
PROFESSOR: Yeah.

00:26:16.770 --> 00:26:19.070
Exactly.

00:26:19.070 --> 00:26:24.570
But for most things that you'll
be investigating,

00:26:24.570 --> 00:26:26.070
usually they're going to be--

00:26:26.070 --> 00:26:26.950
AUDIENCE: Normal.

00:26:26.950 --> 00:26:28.520
PROFESSOR: Yeah, for outcomes
that are zero.

00:26:28.520 --> 00:26:29.190
One [UNINTELLIGIBLE]

00:26:29.190 --> 00:26:31.670
get non-normal, but
yes, in general,

00:26:31.670 --> 00:26:33.010
they are pretty symmetric.

00:26:33.010 --> 00:26:37.563
But they might not be
exactly symmetric.

00:26:37.563 --> 00:26:40.010
OK.

00:26:40.010 --> 00:26:43.340
So as I sort of was suggesting
as we were going through these

00:26:43.340 --> 00:26:46.470
examples, we're often interested
in testing the

00:26:46.470 --> 00:26:50.460
hypothesis that the effect size
is equal to zero, right?

00:26:50.460 --> 00:26:54.710
The classic hypothesis the you
typically want to know is, did

00:26:54.710 --> 00:26:59.605
my program do anything, right?

00:26:59.605 --> 00:27:02.640
And so, how do you test the
hypothesis that my program--

00:27:02.640 --> 00:27:04.470
so we want to know, did
my program have

00:27:04.470 --> 00:27:06.140
any effect at all?

00:27:06.140 --> 00:27:08.890
And so what we technically want
to do is we want to test

00:27:08.890 --> 00:27:11.960
what's called the null
hypothesis, that the program

00:27:11.960 --> 00:27:15.240
had an effect of nothing,
against an alternative

00:27:15.240 --> 00:27:18.616
hypothesis that the program
had some effect.

00:27:18.616 --> 00:27:23.370
So this is the typical test
that we want to do.

00:27:23.370 --> 00:27:27.670
Now you could say, actually,
I don't care about zero.

00:27:27.670 --> 00:27:30.970
I want to say that I know--
for example, this is the

00:27:30.970 --> 00:27:33.420
standard thing that we would do
in most policy evaluations

00:27:33.420 --> 00:27:34.670
that we're going to be doing.

00:27:37.050 --> 00:27:38.680
It doesn't have to be zero.

00:27:38.680 --> 00:27:40.740
Suppose you were doing a drug
trial and you knew that the

00:27:40.740 --> 00:27:42.850
best existing treatment
out there already have

00:27:42.850 --> 00:27:44.160
an effect of one.

00:27:44.160 --> 00:27:49.420
And so instead of comparing
to zero, you might be

00:27:49.420 --> 00:27:51.250
comparing to one.

00:27:51.250 --> 00:27:54.800
Is it actually better than the
best existing treatment?

00:27:54.800 --> 00:27:59.170
In most cases, we're usually
comparing to zero, OK?

00:27:59.170 --> 00:28:02.140
And usually, we have the
alternative hypothesis that

00:28:02.140 --> 00:28:03.140
the effect is just not zero.

00:28:03.140 --> 00:28:05.080
We're interested in anything
other than zero.

00:28:05.080 --> 00:28:08.000
Sometimes you can specify other
alternative hypotheses,

00:28:08.000 --> 00:28:11.040
that the effect is always
positive or always negative,

00:28:11.040 --> 00:28:13.590
but usually this is the classic
case, which is we're

00:28:13.590 --> 00:28:17.490
saying, we think this thing
had-- the null is no effect.

00:28:17.490 --> 00:28:19.460
We want to say, did this program
have an effect and

00:28:19.460 --> 00:28:23.760
we're interested in any
possible effect, OK?

00:28:23.760 --> 00:28:27.010
And hypothesis testing says,
when can I reject this null

00:28:27.010 --> 00:28:32.170
hypothesis in favor of
this alternative, OK?

00:28:36.600 --> 00:28:39.510
And as we saw, essentially,
the confidence interval is

00:28:39.510 --> 00:28:41.090
giving you a way to do that.

00:28:41.090 --> 00:28:44.700
It's saying, if the null is
outside the confidence

00:28:44.700 --> 00:28:47.370
interval, then I can
reject the null.

00:28:47.370 --> 00:28:47.610
Yeah.

00:28:47.610 --> 00:28:53.665
AUDIENCE: Surely, if we're
trying to assess the impact of

00:28:53.665 --> 00:28:57.220
an intervention, we're always
going to think it's positive.

00:28:57.220 --> 00:28:59.610
Or in general, because--

00:28:59.610 --> 00:29:02.180
I gave someone some money to
increase their income or not.

00:29:02.180 --> 00:29:04.830
We've got a pretty good idea
it's going to be positive.

00:29:04.830 --> 00:29:08.230
The probability it's negative
is pretty--

00:29:08.230 --> 00:29:09.845
PROFESSOR: Why do you--

00:29:09.845 --> 00:29:12.230
AUDIENCE: Yeah, why
do we change our

00:29:12.230 --> 00:29:12.510
significance level--

00:29:12.510 --> 00:29:12.720
[INTERPOSING VOICES]

00:29:12.720 --> 00:29:16.420
PROFESSOR: You ask
a great question.

00:29:16.420 --> 00:29:17.840
And I have to say this is
a bit of a source of

00:29:17.840 --> 00:29:20.380
frustration of mine.

00:29:20.380 --> 00:29:24.950
Let me give you a couple
different answers to that.

00:29:24.950 --> 00:29:26.360
Here's the thing.

00:29:26.360 --> 00:29:27.610
If you did that--

00:29:30.790 --> 00:29:34.260
if I said I can commit, before
I look at the data, that I

00:29:34.260 --> 00:29:37.720
only think it could be positive,
that would mean that

00:29:37.720 --> 00:29:40.700
if it's negative, no matter how
negative, you're going to

00:29:40.700 --> 00:29:43.430
say that was random
chance, OK?

00:29:43.430 --> 00:29:46.730
So it would require a fair
amount of commitment on you,

00:29:46.730 --> 00:29:50.300
on your part, as the
experimenter to say, if I get

00:29:50.300 --> 00:29:53.540
a negative result, no matter how
crazy that negative result

00:29:53.540 --> 00:29:57.310
is, I'm going to say that's
random chance, OK?

00:29:57.310 --> 00:30:04.170
And typically, what often
happens ex post is that people

00:30:04.170 --> 00:30:06.450
can't commit to actually
doing that.

00:30:06.450 --> 00:30:08.780
So suppose you did your
program and you--

00:30:08.780 --> 00:30:12.120
so I actually have a program
right now that I'm working on

00:30:12.120 --> 00:30:13.700
in Indonesia that's supposed to

00:30:13.700 --> 00:30:15.510
improve health and education.

00:30:15.510 --> 00:30:17.820
And it seems to be making
education worse.

00:30:17.820 --> 00:30:20.020
Now, we have no theory for why
this program should be making

00:30:20.020 --> 00:30:22.700
education worse, OK?

00:30:22.700 --> 00:30:25.235
But it certainly seems to
be there in the data.

00:30:25.235 --> 00:30:28.520
Now, if we had adopted your
approach, we wouldn't be

00:30:28.520 --> 00:30:30.630
entertaining the hypothesis that
it made education worse.

00:30:30.630 --> 00:30:33.215
We would say, even though it's
looking like this program is

00:30:33.215 --> 00:30:35.555
making education worse,
that must be random

00:30:35.555 --> 00:30:37.250
noise in the data.

00:30:37.250 --> 00:30:42.110
We're not going to treat that as
something potentially real.

00:30:42.110 --> 00:30:44.480
Ex post, though, you see this in
the data and you're likely

00:30:44.480 --> 00:30:47.530
to say, gee, man, that's a
really negative effect.

00:30:47.530 --> 00:30:49.460
Maybe the program was doing
something that I

00:30:49.460 --> 00:30:50.150
didn't think about.

00:30:50.150 --> 00:30:51.435
And in our case, actually, we're
starting to investigate

00:30:51.435 --> 00:30:53.560
and maybe it's because it was
health and education and we're

00:30:53.560 --> 00:30:56.910
sort of sucking resources away
from education into health.

00:30:56.910 --> 00:30:59.660
So it requires a lot of
commitment on your part, as

00:30:59.660 --> 00:31:04.500
the researcher, that if you get
these negative effects, to

00:31:04.500 --> 00:31:06.260
treat them as random noise.

00:31:06.260 --> 00:31:09.440
And I think that, because most
researchers, even though they

00:31:09.440 --> 00:31:11.640
would like to say they're
going to do that, if it

00:31:11.640 --> 00:31:13.370
happens that they get a really
negative effect, they're going

00:31:13.370 --> 00:31:15.030
to want to say, gee, that looks
like a negative effect.

00:31:15.030 --> 00:31:15.980
We're going to want
to investigate

00:31:15.980 --> 00:31:17.490
that, take that seriously.

00:31:17.490 --> 00:31:22.030
Because most people do that ex
post, the convention is that

00:31:22.030 --> 00:31:25.340
in most cases, to say we're
going to test against either

00:31:25.340 --> 00:31:26.545
hypothesis in either
direction.

00:31:26.545 --> 00:31:27.410
AUDIENCE: Except that
the approach--

00:31:27.410 --> 00:31:28.650
PROFESSOR: Does that
makes sense?

00:31:28.650 --> 00:31:31.120
AUDIENCE: Your issue is do
I do this program or not.

00:31:31.120 --> 00:31:33.307
So it doesn't matter whether the
impact of the program is

00:31:33.307 --> 00:31:34.540
zero or negative.

00:31:34.540 --> 00:31:36.170
Even if it's zero, you're
saying that it's--

00:31:36.170 --> 00:31:37.080
PROFESSOR: You're absolutely
right.

00:31:37.080 --> 00:31:42.846
So if you were strict about it
and said, I'm going to do it

00:31:42.846 --> 00:31:45.430
if it's positive and not if it's
zero, then I think you

00:31:45.430 --> 00:31:47.990
were correct that, strictly
speaking, a one-sided

00:31:47.990 --> 00:31:49.640
hypothesis test will be correct
and it would give you

00:31:49.640 --> 00:31:50.260
some more power.

00:31:50.260 --> 00:31:52.210
AUDIENCE: So it would
give you power.

00:31:52.210 --> 00:31:53.090
PROFESSOR: Yeah, it would
give you more power.

00:31:53.090 --> 00:31:53.530
AUDIENCE: [UNINTELLIGIBLE]

00:31:53.530 --> 00:31:53.740
PROFESSOR: Right.

00:31:53.740 --> 00:31:57.220
And the reason it gives you
power is, remember, how does

00:31:57.220 --> 00:31:57.970
hypothesis testing work?

00:31:57.970 --> 00:31:59.520
It says, well, what is the
chance this outcome could have

00:31:59.520 --> 00:32:01.140
occurred 95--

00:32:01.140 --> 00:32:05.050
what would have occurred by
chance 95% of the time?

00:32:05.050 --> 00:32:06.810
When you do a two-sided
test, you say, OK--

00:32:06.810 --> 00:32:08.710
where's my chalkboard?

00:32:08.710 --> 00:32:09.692
Here.

00:32:09.692 --> 00:32:12.440
You imagine a normal
distribution of outcomes.

00:32:12.440 --> 00:32:14.910
You're going to say, well, the
95% is in the middle and

00:32:14.910 --> 00:32:18.790
anything in the tails is the
stuff that I'm going to

00:32:18.790 --> 00:32:20.360
[UNINTELLIGIBLE] by
non-random chance.

00:32:20.360 --> 00:32:22.250
Well, what you're doing with a
one-sided test is you're going

00:32:22.250 --> 00:32:24.960
to say, I'm going to take
that negative stuff--

00:32:24.960 --> 00:32:27.070
way out there negative stuff--
and I'm going to say that's

00:32:27.070 --> 00:32:28.530
also random chance.

00:32:28.530 --> 00:32:31.570
So I'm going to pick my 95%
all the way to the left.

00:32:31.570 --> 00:32:33.960
And that means that the 5%
that's not random chance is a

00:32:33.960 --> 00:32:35.870
little more to the right.

00:32:35.870 --> 00:32:36.900
Do you see what I'm saying?

00:32:36.900 --> 00:32:38.700
But it requires that if--

00:32:38.700 --> 00:32:40.820
you're committing to, even if
you get really negative

00:32:40.820 --> 00:32:43.160
outcomes, asserting that they're
random chance, which

00:32:43.160 --> 00:32:45.680
is really, often, kind
of unbelievable.

00:32:45.680 --> 00:32:48.480
The other thing is that,
although this is technically

00:32:48.480 --> 00:32:51.040
the way hypothesis testing
is set up, the norms and

00:32:51.040 --> 00:32:54.050
conventions are that we all use
two-sided tests for these

00:32:54.050 --> 00:32:55.310
reasons I talked about.

00:32:55.310 --> 00:33:03.250
And so I can just tell you that,
practically speaking, I

00:33:03.250 --> 00:33:05.140
think if you do a one-sided
test, people are going to be

00:33:05.140 --> 00:33:09.930
skeptical because it may be that
you, actually, would do

00:33:09.930 --> 00:33:13.750
that, but I think most of
the time, people can't

00:33:13.750 --> 00:33:14.300
commit to do that.

00:33:14.300 --> 00:33:16.830
And so the standard has become
two-sided tests.

00:33:16.830 --> 00:33:17.700
But I certainly agree
with you.

00:33:17.700 --> 00:33:20.320
It's very frustrating because
one should be able to

00:33:20.320 --> 00:33:21.570
articulate one-sided
hypotheses.

00:33:24.420 --> 00:33:27.418
That's sort of a long answer,
but does that make sense?

00:33:27.418 --> 00:33:28.350
It's OK.

00:33:28.350 --> 00:33:30.280
OK, now, for those of you on
this side of the board, you

00:33:30.280 --> 00:33:32.945
won't be able to see, but
maybe if I need to write

00:33:32.945 --> 00:33:34.220
something on the board
it will be better.

00:33:34.220 --> 00:33:35.470
OK.

00:33:38.595 --> 00:33:39.760
So now we're going to talk
about type I and type II

00:33:39.760 --> 00:33:46.230
errors, which, as I mentioned,
are not helpfully named.

00:33:46.230 --> 00:33:47.650
OK.

00:33:47.650 --> 00:33:48.900
A type I error--

00:33:53.940 --> 00:33:56.780
so this is all about
probability, so nothing we can

00:33:56.780 --> 00:33:57.590
ever say for sure.

00:33:57.590 --> 00:34:01.250
We can always say that this
is more or less likely.

00:34:01.250 --> 00:34:03.240
And there's two different types
of errors we can make

00:34:03.240 --> 00:34:05.780
when we're doing these
probabilities or doing these

00:34:05.780 --> 00:34:06.980
assessments.

00:34:06.980 --> 00:34:09.969
The first error, and it's called
type I error, is we can

00:34:09.969 --> 00:34:12.760
conclude that there was an
effect when, in fact, there

00:34:12.760 --> 00:34:17.219
was no effect, OK?

00:34:17.219 --> 00:34:21.070
So when I said the 95%
confidence interval, that 95%

00:34:21.070 --> 00:34:23.199
is coming from our choice
about type 1 errors.

00:34:31.530 --> 00:34:32.790
So for example--

00:34:36.530 --> 00:34:38.550
a significance level is the
probability that you're going

00:34:38.550 --> 00:34:40.620
to falsely conclude the program
had an effect when, in

00:34:40.620 --> 00:34:42.960
fact, there was no effect, OK?

00:34:42.960 --> 00:34:47.020
And that's related to when you
say a 95% confidence interval,

00:34:47.020 --> 00:34:49.710
the remaining 5% is what we're
talking about here.

00:34:49.710 --> 00:34:53.980
That's the probability of making
a type I error, OK?

00:34:53.980 --> 00:34:55.030
And why is that?

00:34:55.030 --> 00:34:57.830
Well, we said there's a 95%
chance that it's going to be

00:34:57.830 --> 00:34:59.590
within this range.

00:34:59.590 --> 00:35:02.510
That means that just by random
chance, there's some chance it

00:35:02.510 --> 00:35:05.295
could be outside that
range, right?

00:35:05.295 --> 00:35:08.430
So if your confidence interval
was over here and zero was

00:35:08.430 --> 00:35:12.650
over here, you would say, well,
with 95% conference, I'm

00:35:12.650 --> 00:35:14.630
going to assume the program had
an effect because zero is

00:35:14.630 --> 00:35:16.500
not within my confidence
interval.

00:35:16.500 --> 00:35:20.400
However, 5% of the time, the
true effect could be over here

00:35:20.400 --> 00:35:21.430
outside your confidence
interval.

00:35:21.430 --> 00:35:23.460
That's what a 95% confidence
interval means.

00:35:28.050 --> 00:35:33.230
So in some sense, that's
what we mean by a--

00:35:33.230 --> 00:35:35.000
so that's in some sense what
a type I error is.

00:35:35.000 --> 00:35:36.770
A type I error is the
probability that you're going

00:35:36.770 --> 00:35:46.000
to detect an effect when,
in fact, there's not.

00:35:46.000 --> 00:35:51.950
And so the typical levels that
you may see are 5%, 1% or 10%

00:35:51.950 --> 00:35:53.420
significance levels.

00:35:53.420 --> 00:35:55.930
And the way to think about those
significance levels is,

00:35:55.930 --> 00:35:58.650
if you see something that's
significant at the 10% level,

00:35:58.650 --> 00:36:01.820
that means it 10% of the time,
an effect of that size

00:36:01.820 --> 00:36:03.380
could've been just due
to random chance.

00:36:03.380 --> 00:36:06.860
Might not actually
be a true effect.

00:36:06.860 --> 00:36:10.440
And if you've heard of a
p-value, a p-value is exactly

00:36:10.440 --> 00:36:11.560
this number.

00:36:11.560 --> 00:36:14.880
A p-value basically says, what
is the probability that an

00:36:14.880 --> 00:36:18.650
effect this size or larger could
have occurred just by

00:36:18.650 --> 00:36:21.880
random chance, OK?

00:36:24.540 --> 00:36:27.280
So that's what's called
a type I error.

00:36:27.280 --> 00:36:34.620
And typically, there's no deep
reason why 5% is the normal

00:36:34.620 --> 00:36:36.580
level of type I errors that we
use, but it's kind of the

00:36:36.580 --> 00:36:39.160
convention.

00:36:39.160 --> 00:36:40.020
It's what everyone else uses.

00:36:40.020 --> 00:36:41.950
If you use something different,
people are going to

00:36:41.950 --> 00:36:42.820
look at you a little funny.

00:36:42.820 --> 00:36:47.430
So the conventions are we have
5%, 10%, and 1%, as these

00:36:47.430 --> 00:36:48.860
significance levels.

00:36:48.860 --> 00:36:55.030
And you might say, gee, 5%
or 10% seems pretty low.

00:36:55.030 --> 00:36:56.110
Maybe I would want
a bigger one.

00:36:56.110 --> 00:36:58.030
But on the other hand, if you
start thinking about it, that

00:36:58.030 --> 00:37:00.160
means that if you use 10%
significance, that means that

00:37:00.160 --> 00:37:02.876
one out of every 10 studies
is going to be wrong.

00:37:02.876 --> 00:37:05.710
Or if you had 10 different
outcomes in your data set, one

00:37:05.710 --> 00:37:08.660
out of every 10 would be
significant even just by

00:37:08.660 --> 00:37:09.910
random chance.

00:37:12.750 --> 00:37:15.920
So the other type of error is
what's called, as I said,

00:37:15.920 --> 00:37:18.210
helpfully, a type II error.

00:37:18.210 --> 00:37:21.310
And a type II error says that
you fail to reject that the

00:37:21.310 --> 00:37:26.570
program had no effect when, in
fact, there was an effect, OK?

00:37:26.570 --> 00:37:30.870
So this is, the program did
something, but I can't pick it

00:37:30.870 --> 00:37:35.480
up in the data, OK?

00:37:35.480 --> 00:37:39.280
And we talk about the
power of a test.

00:37:39.280 --> 00:37:42.870
The power is basically the
opposite of a type II error.

00:37:42.870 --> 00:37:45.550
A power just says, what's the
probability that I will be

00:37:45.550 --> 00:37:48.730
able to find an effect
given that the actual

00:37:48.730 --> 00:37:52.490
effect is there, OK?

00:37:52.490 --> 00:37:57.070
So when we talk about how big
a sample size we need, what

00:37:57.070 --> 00:38:00.320
we're basically talking about
is, how much power are we

00:38:00.320 --> 00:38:02.250
going to have to detect
an effect?

00:38:02.250 --> 00:38:04.560
Or what's the probability that
given that a true effect is

00:38:04.560 --> 00:38:08.150
there, we're going to pick
it up in the data, OK?

00:38:08.150 --> 00:38:10.740
So here's an example of how
to think about power.

00:38:10.740 --> 00:38:13.960
If I ran the experiment 100
times-- not 100 samples, but

00:38:13.960 --> 00:38:16.380
if I ran the whole
thing 100 times--

00:38:16.380 --> 00:38:18.960
what percentage of the time and
in how many these cases

00:38:18.960 --> 00:38:21.120
would I be able to say, reject
the hypothesis that men and

00:38:21.120 --> 00:38:24.280
women have the same education
at the 5% level if, in fact,

00:38:24.280 --> 00:38:28.010
they're different, OK?

00:38:28.010 --> 00:38:34.650
So this is a helpful graph
which basically plots the

00:38:34.650 --> 00:38:36.253
truth and what you're going
to conclude based

00:38:36.253 --> 00:38:37.730
on your data, OK?

00:38:37.730 --> 00:38:40.940
So suppose the truth is that
you had no effect and you

00:38:40.940 --> 00:38:43.570
conclude your no effect, OK?

00:38:43.570 --> 00:38:47.010
Then you're happy.

00:38:47.010 --> 00:38:49.290
If there was an effect and you
conclude there was an effect,

00:38:49.290 --> 00:38:49.960
you're happy.

00:38:49.960 --> 00:38:52.330
So you want to be in one
of these two boxes.

00:38:52.330 --> 00:38:54.540
And the two types of errors you
can make-- so one type of

00:38:54.540 --> 00:38:56.770
error is over here, right?

00:38:56.770 --> 00:39:00.160
So if the truth is there was no
effect, but you concluded

00:39:00.160 --> 00:39:01.860
there was an effect, that
would be making a

00:39:01.860 --> 00:39:03.670
type I error, OK?

00:39:03.670 --> 00:39:05.080
And this is what we
talked about size.

00:39:05.080 --> 00:39:09.075
So this one, we normally
fix this one at 5%.

00:39:09.075 --> 00:39:12.680
So it's only 5% of the time--

00:39:12.680 --> 00:39:15.505
if there's no effect, 5% of the
time you're going to end

00:39:15.505 --> 00:39:17.680
up here and 95% of the time
you're going to end up here.

00:39:17.680 --> 00:39:20.560
That's what a 95% confidence
interval is telling you.

00:39:20.560 --> 00:39:24.090
And the other thing is, suppose
that the thing had an

00:39:24.090 --> 00:39:29.310
effect but you couldn't find
it in the data, OK?

00:39:29.310 --> 00:39:30.850
That's what's called
a type II error.

00:39:30.850 --> 00:39:34.040
And that's, when we design our
experiments, we want to make

00:39:34.040 --> 00:39:36.400
sure that our samples are
sufficiently large that the

00:39:36.400 --> 00:39:40.710
probability you end up in this
box is not too big, OK?

00:39:40.710 --> 00:39:44.430
So that's a sense of what we
mean by the different types of

00:39:44.430 --> 00:39:45.720
mistakes or errors
you could make.

00:39:45.720 --> 00:39:46.181
Yeah.

00:39:46.181 --> 00:39:50.330
AUDIENCE: It's kind of
a stupid question.

00:39:50.330 --> 00:39:53.040
So power is the probability
that you are not

00:39:53.040 --> 00:39:54.495
making a type II error?

00:39:54.495 --> 00:39:55.465
PROFESSOR: Yes.

00:39:55.465 --> 00:39:58.470
AUDIENCE: So then power is the
probability that you're in the

00:39:58.470 --> 00:40:00.382
smiley face box,
that you are--

00:40:00.382 --> 00:40:00.860
[INTERPOSING VOICES]

00:40:00.860 --> 00:40:03.190
PROFESSOR: Yes.

00:40:03.190 --> 00:40:05.840
Power is the probability
you're over here.

00:40:05.840 --> 00:40:07.750
Yeah, we say power is related
to type II errors.

00:40:07.750 --> 00:40:08.580
Power is over here.

00:40:08.580 --> 00:40:09.930
This is the power.

00:40:09.930 --> 00:40:11.530
Power is conditional on
there being an effect.

00:40:11.530 --> 00:40:13.966
What's the probability you're
in this box, not this box?

00:40:17.390 --> 00:40:21.630
Probably should say one minus
power to be clearer.

00:40:21.630 --> 00:40:21.780
OK?

00:40:21.780 --> 00:40:23.527
Does that makes sense?

00:40:23.527 --> 00:40:24.660
All right.

00:40:24.660 --> 00:40:27.230
So when we're designing
experiments, we typically fix

00:40:27.230 --> 00:40:31.520
this at conventional levels.

00:40:31.520 --> 00:40:34.450
And we choose our sample size
so that we get this, the

00:40:34.450 --> 00:40:36.500
power, or the probability that
you're in the happy face box

00:40:36.500 --> 00:40:39.190
over here to a reasonable level
given the effect size

00:40:39.190 --> 00:40:42.862
that we think we're
likely to get, OK?

00:40:46.348 --> 00:40:47.598
OK.

00:40:49.350 --> 00:40:53.652
Now, in some sense, the next two
things, standard errors,

00:40:53.652 --> 00:40:58.620
are about this box, size.

00:40:58.620 --> 00:41:03.245
And power is about this
box, or these boxes.

00:41:03.245 --> 00:41:05.470
Yeah.

00:41:05.470 --> 00:41:08.260
AUDIENCE: Why is power not also
the probability that you

00:41:08.260 --> 00:41:11.620
end up in the bottom right box
as opposed to the bottom left?

00:41:11.620 --> 00:41:12.990
PROFESSOR: Because
that's size.

00:41:12.990 --> 00:41:17.310
AUDIENCE: Isn't size also linked
to-- or power also

00:41:17.310 --> 00:41:17.790
linked to--

00:41:17.790 --> 00:41:21.985
PROFESSOR: No, they're all
related, but we typically--

00:41:24.860 --> 00:41:27.490
they're related in the
following way.

00:41:27.490 --> 00:41:31.190
We assert a size because when
we calculate our standard

00:41:31.190 --> 00:41:35.620
error-- our confidence
intervals, we pick how big or

00:41:35.620 --> 00:41:37.840
small we want the confidence
intervals to be.

00:41:37.840 --> 00:41:40.090
When we say a 95% confidence
interval, we're

00:41:40.090 --> 00:41:42.460
picking the size, OK?

00:41:42.460 --> 00:41:45.045
So this one, we get to choose.

00:41:45.045 --> 00:41:47.935
AUDIENCE: So it's not sample
size, it's size of the

00:41:47.935 --> 00:41:48.560
confidence interval?

00:41:48.560 --> 00:41:49.690
PROFESSOR: No.

00:41:49.690 --> 00:41:53.660
Yeah, this is size is a--

00:41:53.660 --> 00:41:55.430
yeah, it's the size of the
confidence interval.

00:41:55.430 --> 00:41:55.880
That's right.

00:41:55.880 --> 00:41:56.900
Sorry, it's not the
sample size.

00:41:56.900 --> 00:41:58.470
That's right.

00:41:58.470 --> 00:42:02.210
It's called the size of the
test in yet more confusing

00:42:02.210 --> 00:42:03.460
terminology.

00:42:05.474 --> 00:42:06.350
That's right.

00:42:06.350 --> 00:42:08.640
This is the size of the
confidence interval,

00:42:08.640 --> 00:42:09.120
essentially.

00:42:09.120 --> 00:42:11.090
And this one you pick,
and this one is

00:42:11.090 --> 00:42:13.140
determined by your data.

00:42:16.916 --> 00:42:19.276
OK?

00:42:19.276 --> 00:42:20.220
All right.

00:42:20.220 --> 00:42:25.000
OK, so now let's talk about this
part, which is standard

00:42:25.000 --> 00:42:26.710
errors and significance.

00:42:26.710 --> 00:42:29.540
It's all kind of related.

00:42:29.540 --> 00:42:32.530
All right, so we're going
to estimate the

00:42:32.530 --> 00:42:33.380
effect of our program.

00:42:33.380 --> 00:42:37.750
And we typically call that
beta, or beta hat.

00:42:37.750 --> 00:42:40.560
So the convention is that things
that are estimated, we

00:42:40.560 --> 00:42:42.800
put a little hat
over them, OK?

00:42:42.800 --> 00:42:45.550
So beta hat is going to be our
estimate of the program's

00:42:45.550 --> 00:42:46.760
effectiveness.

00:42:46.760 --> 00:42:49.510
This is our best guess as to the
difference between these

00:42:49.510 --> 00:42:51.120
two groups.

00:42:51.120 --> 00:42:55.100
So for example, this is the
average treatment test score

00:42:55.100 --> 00:42:56.380
minus the average control
test score.

00:42:59.450 --> 00:43:02.560
And then we're also going to
calculate our estimate of the

00:43:02.560 --> 00:43:04.540
standard error of
beta hat, right?

00:43:04.540 --> 00:43:06.130
And remember that the confidence
interval is about

00:43:06.130 --> 00:43:08.890
two times the standard error.

00:43:08.890 --> 00:43:10.880
So the standard error is going
to say how precise our

00:43:10.880 --> 00:43:13.440
estimate of beta hat is, which
is, remember, if we ran the

00:43:13.440 --> 00:43:15.590
experiment 100 times, what will
be the distributions of

00:43:15.590 --> 00:43:21.180
beta hats that we
would get, OK?

00:43:21.180 --> 00:43:23.870
And this depends on the sample
size and the noise in the

00:43:23.870 --> 00:43:25.980
data, right?

00:43:25.980 --> 00:43:28.910
And remember we went through
this already that here, in

00:43:28.910 --> 00:43:32.490
this case, the standard error of
how confident we would be--

00:43:32.490 --> 00:43:36.140
so the beta hat, in this case,
is going to be 10, and in this

00:43:36.140 --> 00:43:38.500
case, it's also going
to be 10, right?

00:43:38.500 --> 00:43:42.910
But here, these two things are
really precisely estimated, so

00:43:42.910 --> 00:43:46.540
our standard error of beta hat
is going to be very small

00:43:46.540 --> 00:43:49.100
because we're going to say we
have a very precise estimate

00:43:49.100 --> 00:43:50.920
of the difference
between them.

00:43:50.920 --> 00:43:52.280
And so the confidence
interval is also

00:43:52.280 --> 00:43:53.830
going to be very small.

00:43:53.830 --> 00:43:55.630
And here, there's lots of noise
in the data, so our

00:43:55.630 --> 00:43:58.570
estimate of the standard error
is going to be larger.

00:43:58.570 --> 00:44:00.490
So in both cases, beta
hat is the same.

00:44:00.490 --> 00:44:01.940
It's 10 in both cases.

00:44:01.940 --> 00:44:03.580
But the standard error
is very big here and

00:44:03.580 --> 00:44:07.260
very small here, OK?

00:44:07.260 --> 00:44:11.320
Now, when we calculate the
statistical significance, we

00:44:11.320 --> 00:44:14.120
use something called
a t-ratio.

00:44:14.120 --> 00:44:15.370
And the t-ratio--

00:44:19.370 --> 00:44:21.510
it's actually often called the
student's t-ratio, which I

00:44:21.510 --> 00:44:22.810
thought was because
students used it.

00:44:22.810 --> 00:44:24.275
But it's actually named
after Mr. Student.

00:44:28.310 --> 00:44:30.770
It's the ratio of beta hat
to the standard error

00:44:30.770 --> 00:44:33.430
of beta hat, OK?

00:44:33.430 --> 00:44:38.040
And the reason that we happen to
use this ratio is that, if

00:44:38.040 --> 00:44:42.140
there is no effect, if beta hat
is actually zero, we know

00:44:42.140 --> 00:44:44.030
that this thing has a normal
distribution, so we can

00:44:44.030 --> 00:44:46.590
calculate the probabilities that
this thing is really or

00:44:46.590 --> 00:44:49.010
really small, OK?

00:44:51.630 --> 00:44:54.090
So we calculate this ratio of
beta hat over the standard

00:44:54.090 --> 00:44:55.870
error of beta hat.

00:44:58.550 --> 00:45:01.030
It turns out that
if t is greater

00:45:01.030 --> 00:45:05.330
than, in absolute value--

00:45:05.330 --> 00:45:08.850
sorry, if the absolute value of
t, I should say, is greater

00:45:08.850 --> 00:45:10.940
than 1.96--

00:45:10.940 --> 00:45:16.380
so essentially, if it's bigger
than 2 or less than minus 2,

00:45:16.380 --> 00:45:18.920
we're going to reject the
hypothesis of a quality at a

00:45:18.920 --> 00:45:20.810
5% significance level.

00:45:20.810 --> 00:45:21.850
And why is that?

00:45:21.850 --> 00:45:28.980
It's because it turns out, from
statistics, that if the

00:45:28.980 --> 00:45:31.350
truth is zero, OK?

00:45:31.350 --> 00:45:35.170
So if we're in the no effect
box and the truth is zero,

00:45:35.170 --> 00:45:39.180
this ratio, it turns out, will
have a normal distribution.

00:45:39.180 --> 00:45:40.990
And it just turns out from a
normal distribution that the

00:45:40.990 --> 00:45:45.590
probability that the 5%
confidence interval is 1.96

00:45:45.590 --> 00:45:47.780
away from zero if you have
a normal distribution.

00:45:47.780 --> 00:45:51.880
That's just a fact about normal
distributions, OK?

00:45:51.880 --> 00:45:54.880
So if we calculate this ratio
and we say it's greater in

00:45:54.880 --> 00:45:57.900
absolute value than 1.96, we're
going to reject the

00:45:57.900 --> 00:46:00.320
hypothesis of a quality
at the 5% level, OK?

00:46:00.320 --> 00:46:01.220
So we can reject zero.

00:46:01.220 --> 00:46:02.650
Zero is going to be outside of
our confidence interval.

00:46:02.650 --> 00:46:04.945
And if it's less than 1.96,
we're going to fail to reject

00:46:04.945 --> 00:46:06.945
it because zero is going to
be inside our confidence

00:46:06.945 --> 00:46:11.240
interval, OK?

00:46:11.240 --> 00:46:13.470
So in this case, for example,
the difference was 2.59.

00:46:13.470 --> 00:46:14.990
The standard error was 0.54.

00:46:14.990 --> 00:46:19.690
The t-ratio is about seven.

00:46:19.690 --> 00:46:21.490
No, it's about five.

00:46:21.490 --> 00:46:23.180
So we're definitely going
to be able to

00:46:23.180 --> 00:46:24.760
reject in this case.

00:46:24.760 --> 00:46:30.090
So we have a t-ratio
of about five, OK?

00:46:30.090 --> 00:46:35.180
So you may see this terminology
and this is where

00:46:35.180 --> 00:46:36.430
it's coming from.

00:46:36.430 --> 00:46:39.870
Now, there's an important point
to note here, which will

00:46:39.870 --> 00:46:42.380
come up later when we talk
about power calculations,

00:46:42.380 --> 00:46:50.180
which is, in some sense, that
the power that we have is

00:46:50.180 --> 00:46:53.980
determined by this ratio of
the point estimate to our

00:46:53.980 --> 00:46:55.620
standard error.

00:46:55.620 --> 00:46:58.530
And so this says, for example,
that if we kind of look at

00:46:58.530 --> 00:47:05.010
this a little more, that if you
have bigger betas, you can

00:47:05.010 --> 00:47:07.210
still detect effects for
a given standard--

00:47:07.210 --> 00:47:08.650
so if you fix the standard
error but you made beta

00:47:08.650 --> 00:47:11.070
bigger, you're more likely
to conclude there was a

00:47:11.070 --> 00:47:11.940
difference, right?

00:47:11.940 --> 00:47:15.950
So what's going to increase your
being able to conclude

00:47:15.950 --> 00:47:16.980
there was a difference?

00:47:16.980 --> 00:47:18.780
Either your effect size is
bigger or your standard error

00:47:18.780 --> 00:47:20.800
is smaller, mechanically.

00:47:25.500 --> 00:47:28.040
OK.

00:47:28.040 --> 00:47:32.230
So that's how we are going to
calculate being in this box.

00:47:32.230 --> 00:47:34.230
So how do we think about
power, which is the

00:47:34.230 --> 00:47:36.150
probability that we're
in this box?

00:47:36.150 --> 00:47:38.890
We had an effect and we're able
to detect that-- sorry,

00:47:38.890 --> 00:47:46.170
power's in this box-- that
we had an effect, OK?

00:47:46.170 --> 00:47:53.520
So when we're planning an
experiment, we can do some

00:47:53.520 --> 00:47:56.670
calculations to help us figure
out what that power is.

00:47:56.670 --> 00:48:00.450
What's the probability, if the
truth is a certain level, that

00:48:00.450 --> 00:48:02.180
we're going to be able to
pick it up in the data?

00:48:05.390 --> 00:48:06.640
And what do we need
to do that?

00:48:10.180 --> 00:48:12.550
We're going to have to specify
a null hypothesis, which is

00:48:12.550 --> 00:48:13.500
usually zero.

00:48:13.500 --> 00:48:15.030
We're going to be testing that
something's different than

00:48:15.030 --> 00:48:19.120
zero, the two groups are
the same, for example.

00:48:19.120 --> 00:48:19.940
We're going to have to pick our

00:48:19.940 --> 00:48:21.920
significance level, our size.

00:48:21.920 --> 00:48:23.170
And that, we almost
always pick at 5%.

00:48:26.000 --> 00:48:31.680
We're going to have to
pick an effect size.

00:48:31.680 --> 00:48:33.530
And we'll talk about what
exactly this means in a couple

00:48:33.530 --> 00:48:33.950
more slides.

00:48:33.950 --> 00:48:37.650
But when we calculate a power,
a power is for a given

00:48:37.650 --> 00:48:42.380
effect size, OK?

00:48:42.380 --> 00:48:43.800
And then we'll calculate
the power.

00:48:46.330 --> 00:48:50.600
So for example, suppose
that we did this

00:48:50.600 --> 00:48:52.660
and a power was 80%.

00:48:52.660 --> 00:48:54.780
That would mean that if we did
this experiment 100 times--

00:48:54.780 --> 00:48:56.860
not 100 times, but actually
repeated the whole experiment

00:48:56.860 --> 00:49:03.620
100 times, 80% of the times we
did this experiment, if the

00:49:03.620 --> 00:49:07.130
hypothesis is, in fact, false,
and instead, the truth is

00:49:07.130 --> 00:49:11.300
this, we would be able to reject
the null and conclude

00:49:11.300 --> 00:49:16.300
there was a true effect
80% of the time, OK?

00:49:16.300 --> 00:49:18.610
That's a little bit complicated,
but does that

00:49:18.610 --> 00:49:20.975
make sense, what we're going to
be trying to do with power?

00:49:25.250 --> 00:49:26.990
So we're going to fix
the effect size.

00:49:26.990 --> 00:49:29.505
So remember, we fix
the bottom box.

00:49:33.680 --> 00:49:36.040
When we calculate power, we
have to speculate not just

00:49:36.040 --> 00:49:37.020
effect versus no effect.

00:49:37.020 --> 00:49:40.090
We have to postulate just how
effective the program is.

00:49:40.090 --> 00:49:42.920
So we're going to say, suppose
that the effect size is 5%.

00:49:42.920 --> 00:49:48.890
The truth is 0.2, right?

00:49:48.890 --> 00:49:51.930
How big a sample would we need
to be in this box 80%

00:49:51.930 --> 00:49:54.390
of the time, OK?

00:49:54.390 --> 00:49:57.300
So when we say power,
that's what we mean.

00:49:57.300 --> 00:50:03.050
And when we calculate the size
of the experiments, you have

00:50:03.050 --> 00:50:07.300
to make a judgment call of how
big a power do you want.

00:50:07.300 --> 00:50:09.850
The typical powers that we use
when we do power calculations,

00:50:09.850 --> 00:50:12.430
are either 80% or 90%.

00:50:12.430 --> 00:50:14.670
So what does this mean?

00:50:14.670 --> 00:50:16.620
This means-- suppose
you did 80%.

00:50:16.620 --> 00:50:17.680
Or [UNINTELLIGIBLE] this.

00:50:17.680 --> 00:50:19.540
If you did 80%, that would
mean that if you ran your

00:50:19.540 --> 00:50:24.070
experiment 100 times and the
true effect was 0.2 in this

00:50:24.070 --> 00:50:27.220
case, you would be able
to pick up an effect,

00:50:27.220 --> 00:50:30.510
statistically 80 out
of those 100 times.

00:50:30.510 --> 00:50:32.100
20 out of 100 times,
you wouldn't.

00:50:37.280 --> 00:50:42.100
And the bigger your sample size,
the larger your power is

00:50:42.100 --> 00:50:47.070
going to be, OK?

00:50:47.070 --> 00:50:50.410
Does that make sense so far?

00:50:50.410 --> 00:50:52.260
OK.

00:50:52.260 --> 00:50:54.720
Suppose you wanted to calculate
what our power is

00:50:54.720 --> 00:50:57.610
going to be.

00:50:57.610 --> 00:51:00.680
What are the things you
would need to know?

00:51:00.680 --> 00:51:02.590
You would need to know
your significance

00:51:02.590 --> 00:51:03.710
level of your size.

00:51:03.710 --> 00:51:07.210
And as I said, this,
we just assume, OK?

00:51:07.210 --> 00:51:09.100
This is that bottom box.

00:51:09.100 --> 00:51:12.560
We're just going to assume
that it's 5%.

00:51:12.560 --> 00:51:14.290
And the lower it is,
the larger sample

00:51:14.290 --> 00:51:15.580
you're going to need.

00:51:15.580 --> 00:51:18.190
But this one is sort
of picked for you.

00:51:18.190 --> 00:51:21.250
We almost always use 5% because
that's the convention.

00:51:21.250 --> 00:51:22.738
That's what everyone
uses, essentially.

00:51:27.060 --> 00:51:29.720
The second thing you need to
know is the mean and the

00:51:29.720 --> 00:51:34.014
variance of the outcome in
the comparison group.

00:51:34.014 --> 00:51:37.050
So you need to know--

00:51:37.050 --> 00:51:40.600
so remember, all this power
calculation is going to depend

00:51:40.600 --> 00:51:44.820
on whether your sample looks
like this, really tight, or

00:51:44.820 --> 00:51:46.570
looks like this and
is very noisy.

00:51:46.570 --> 00:51:47.640
Because you obviously
need a much bigger

00:51:47.640 --> 00:51:49.390
sample here than here.

00:51:49.390 --> 00:51:51.920
So in order to do a power
calculation, you need to know,

00:51:51.920 --> 00:51:55.610
well, just what does the outcome
look like, right?

00:51:55.610 --> 00:51:57.920
Does the outcome really have
very narrow variance?

00:52:00.580 --> 00:52:03.110
Is everyone almost exactly the
same, in which case it's going

00:52:03.110 --> 00:52:04.020
to be very easy to
detect effects?

00:52:04.020 --> 00:52:09.620
Or is there are huge range of
people, in which case you're

00:52:09.620 --> 00:52:11.556
going to need bigger effects.

00:52:11.556 --> 00:52:13.120
Now, how do we get this?

00:52:13.120 --> 00:52:16.350
So this one, we just
conventionally set.

00:52:16.350 --> 00:52:17.960
This one, we have to
get somewhere.

00:52:17.960 --> 00:52:22.890
And we usually have to get it
from some other survey.

00:52:22.890 --> 00:52:26.730
So we have to find someone
that collected data in a

00:52:26.730 --> 00:52:27.970
similar population.

00:52:27.970 --> 00:52:30.700
Or sometimes we'll go and
collect data ourselves in that

00:52:30.700 --> 00:52:31.310
same population.

00:52:31.310 --> 00:52:33.820
Just a very small survey just
to get a sense of what this

00:52:33.820 --> 00:52:37.120
variable looks like, OK?

00:52:37.120 --> 00:52:41.010
And if the variability is big,
we're going to need a really

00:52:41.010 --> 00:52:42.650
big sample.

00:52:42.650 --> 00:52:44.323
And if the variability is really
small, we're going to

00:52:44.323 --> 00:52:45.325
need a small sample.

00:52:45.325 --> 00:52:49.010
And it's really important to do
this because you don't want

00:52:49.010 --> 00:52:51.530
to spend all your time and money
running an experiment

00:52:51.530 --> 00:52:53.930
only to turn out that there was
no hope of ever finding an

00:52:53.930 --> 00:52:59.650
effect because the power
was too small, right?

00:52:59.650 --> 00:52:59.955
Yeah.

00:52:59.955 --> 00:53:01.931
AUDIENCE: And this is in the
entire population, not just

00:53:01.931 --> 00:53:03.695
the comparison group, right?

00:53:03.695 --> 00:53:04.577
It says-- --

00:53:04.577 --> 00:53:06.920
PROFESSOR: Yeah, but before
you do your treatment, the

00:53:06.920 --> 00:53:08.540
comparison and the treatment
are the same.

00:53:08.540 --> 00:53:09.340
AUDIENCE: They are the same.

00:53:09.340 --> 00:53:10.030
PROFESSOR: Doesn't matter.

00:53:10.030 --> 00:53:10.740
AUDIENCE: So it's a baseline
population.

00:53:10.740 --> 00:53:11.520
PROFESSOR: Baseline
would be fine.

00:53:11.520 --> 00:53:14.400
Yeah.

00:53:14.400 --> 00:53:16.090
Before you do your treatment,
they're the same.

00:53:16.090 --> 00:53:20.860
So it doesn't matter, OK?

00:53:20.860 --> 00:53:24.660
And the first thing you need
is, you need to make an

00:53:24.660 --> 00:53:29.570
assumption about what effect
size you want to detect.

00:53:29.570 --> 00:53:30.820
And this one--

00:53:33.950 --> 00:53:37.530
sometimes you also have
to supply this.

00:53:37.530 --> 00:53:42.350
And the best way to think about
what effect size you

00:53:42.350 --> 00:53:47.880
want to put in here is you
want to say, what's the

00:53:47.880 --> 00:53:55.660
smallest effect that would
prompt a policy response, OK?

00:53:55.660 --> 00:53:57.520
So one could think about this,
for example, by doing a

00:53:57.520 --> 00:53:58.710
cost-benefit calculation,
right?

00:53:58.710 --> 00:54:01.910
You could say that we do a
cost-benefit calculation.

00:54:01.910 --> 00:54:03.590
This thing costs $100.

00:54:03.590 --> 00:54:06.120
If we don't get an effective
0.1, it's just

00:54:06.120 --> 00:54:08.800
not worth $100, right?

00:54:08.800 --> 00:54:11.450
So that would be a good way of
coming up with how big an

00:54:11.450 --> 00:54:14.680
effect size you want here.

00:54:14.680 --> 00:54:16.460
And the idea, then, is if the
effect is any smaller than

00:54:16.460 --> 00:54:18.722
this, it's just not interesting
to distinguish it

00:54:18.722 --> 00:54:21.000
from zero, right?

00:54:21.000 --> 00:54:24.060
Suppose that the thing had a
true effect of 0.001, right?

00:54:24.060 --> 00:54:26.260
But if it was that small of an
effect, it could be completely

00:54:26.260 --> 00:54:26.880
cost effective.

00:54:26.880 --> 00:54:29.750
So say the thing happens
at an effect of 0.001.

00:54:29.750 --> 00:54:32.130
Who cares, right?

00:54:32.130 --> 00:54:35.100
So you want to be thinking
about, from a policy

00:54:35.100 --> 00:54:37.330
perspective is, what's the
smallest effect size you want

00:54:37.330 --> 00:54:39.925
to know, from a policy
perspective, in order to set

00:54:39.925 --> 00:54:41.710
your power calculations?

00:54:41.710 --> 00:54:42.140
Yeah.

00:54:42.140 --> 00:54:44.100
AUDIENCE: I have a question
back at the mean

00:54:44.100 --> 00:54:45.025
and variance thing.

00:54:45.025 --> 00:54:46.290
PROFESSOR: Oh, here.

00:54:46.290 --> 00:54:46.773
Yeah.

00:54:46.773 --> 00:54:47.739
AUDIENCE: Yeah.

00:54:47.739 --> 00:54:50.154
So in terms of the baseline
thing that you would collect--

00:54:50.154 --> 00:54:54.030
so I'm on the implementation
side of this, right?

00:54:54.030 --> 00:54:54.825
So we do projects.

00:54:54.825 --> 00:54:57.100
We collect baseline data.

00:54:57.100 --> 00:55:03.050
Now, the case that I'm thinking
of, the baseline data

00:55:03.050 --> 00:55:06.820
that we would collect might not
be exactly the same kind

00:55:06.820 --> 00:55:12.530
of data that we are looking
for in terms of our study.

00:55:12.530 --> 00:55:13.985
What kind of base-- how--

00:55:13.985 --> 00:55:14.670
PROFESSOR: Right, OK.

00:55:14.670 --> 00:55:18.180
So when we say baseline, there's
two different things

00:55:18.180 --> 00:55:20.380
we mean by baseline.

00:55:20.380 --> 00:55:22.820
For this case, this is not
strictly a baseline.

00:55:22.820 --> 00:55:26.340
This is just something about
what's your variable

00:55:26.340 --> 00:55:26.870
going to look like.

00:55:26.870 --> 00:55:27.880
Let me come back to
that in a sec.

00:55:27.880 --> 00:55:29.910
We also sometimes talk about
baselines that we are going to

00:55:29.910 --> 00:55:33.020
use of actually collecting the
actual outcome variable before

00:55:33.020 --> 00:55:34.930
we start the intervention,
right?

00:55:34.930 --> 00:55:37.750
Those are also useful, and we'll
talk about those in a

00:55:37.750 --> 00:55:38.970
couple slides.

00:55:38.970 --> 00:55:41.900
And those, one wants them to be
more similar, probably, to

00:55:41.900 --> 00:55:44.285
the actual variable you're
going to use.

00:55:44.285 --> 00:55:48.410
Now, for your case,
we often don't--

00:55:52.590 --> 00:55:56.640
the accuracy of your power
calculation depends pretty

00:55:56.640 --> 00:56:01.780
critically on how close this
mean and variance are to what

00:56:01.780 --> 00:56:04.100
you're going to actually
get in your data.

00:56:04.100 --> 00:56:07.990
And when you start in the
example that you guys are

00:56:07.990 --> 00:56:09.470
going to work on or that maybe
you've already started working

00:56:09.470 --> 00:56:11.560
on, you're going to
find that it's

00:56:11.560 --> 00:56:12.920
actually pretty sensitive.

00:56:12.920 --> 00:56:15.010
Turns out it's pretty
sensitive.

00:56:15.010 --> 00:56:19.870
So getting these wrong is
going to mean your power

00:56:19.870 --> 00:56:23.200
calculation is going
to be wrong.

00:56:23.200 --> 00:56:26.460
So that's sort of an argument
for saying you want this to be

00:56:26.460 --> 00:56:28.530
as good as possible.

00:56:28.530 --> 00:56:32.750
Now the flip side of that,
though, is you're going to

00:56:32.750 --> 00:56:36.710
find that these power
calculations are fairly

00:56:36.710 --> 00:56:40.640
sensitive to what effect size
you choose as well.

00:56:40.640 --> 00:56:44.500
So you're going to find that if
you go from a effect size

00:56:44.500 --> 00:56:46.870
of 0.2 to an effect size of 0.1,
you're going to need four

00:56:46.870 --> 00:56:48.960
times the sample.

00:56:48.960 --> 00:56:50.210
That's just the way the
math works out.

00:56:54.820 --> 00:57:00.360
By which I'm going to mean that
I think that these power

00:57:00.360 --> 00:57:03.480
calculations are useful for
making sure you're in the

00:57:03.480 --> 00:57:07.550
right ballpark, but not
necessarily going to nail an

00:57:07.550 --> 00:57:09.250
exact number for you.

00:57:12.210 --> 00:57:17.050
All that's by way of saying that
you want to get-- because

00:57:17.050 --> 00:57:19.910
these things are so sensitive,
you want to get as close as

00:57:19.910 --> 00:57:22.770
possible to what's actually
going to be there.

00:57:22.770 --> 00:57:25.710
On the other hand you're going
to find the results are also

00:57:25.710 --> 00:57:29.080
so sensitive to the effect size
you want to detect that

00:57:29.080 --> 00:57:32.260
if this was a little bit off,
that might be a tradeoff you

00:57:32.260 --> 00:57:33.540
would be willing to live
with in practice.

00:57:33.540 --> 00:57:34.780
AUDIENCE: So, from my--

00:57:34.780 --> 00:57:36.870
PROFESSOR: Does that
make sense?

00:57:36.870 --> 00:57:40.630
AUDIENCE: Yeah, but it seems
like the effect size-- your

00:57:40.630 --> 00:57:44.690
estimate of your effect
size is this kind of--

00:57:44.690 --> 00:57:47.930
we've got all this science for
the calculation and yet your

00:57:47.930 --> 00:57:49.650
estimate of your effect
size is based on--

00:57:49.650 --> 00:57:50.780
PROFESSOR: You're absolutely
right.

00:57:50.780 --> 00:57:51.900
AUDIENCE: --getting that--

00:57:51.900 --> 00:57:52.690
PROFESSOR: Hold on, though.

00:57:52.690 --> 00:57:53.990
Let me back up a little
bit, though.

00:57:53.990 --> 00:57:55.370
You're right, except the--

00:57:55.370 --> 00:57:57.505
in some sense, the best way to
get estimates for your effect

00:57:57.505 --> 00:58:00.100
size is to look at similar
programs, OK?

00:58:00.100 --> 00:58:03.970
So now there are lots
of programs in

00:58:03.970 --> 00:58:05.130
education, for example.

00:58:05.130 --> 00:58:08.860
And they tend to find effect--

00:58:08.860 --> 00:58:11.840
I've now seen a bazillion things
that work on improving

00:58:11.840 --> 00:58:12.780
test scores.

00:58:12.780 --> 00:58:14.870
And I can tell you that
they tend to get--

00:58:14.870 --> 00:58:16.565
standardized effect size is the
effect size divided by the

00:58:16.565 --> 00:58:17.360
standard deviation.

00:58:17.360 --> 00:58:21.550
And they tend to get effect
sizes in the 0.1, 0.15, 0.2

00:58:21.550 --> 00:58:24.410
range, right?

00:58:24.410 --> 00:58:26.660
So you can look at those and
say, well, I think that most

00:58:26.660 --> 00:58:30.040
other comparable interventions
are getting 0.1, so I'm going

00:58:30.040 --> 00:58:32.910
to use 0.1 as my effect size.

00:58:32.910 --> 00:58:34.490
So you're right if you're
just trying to sit here

00:58:34.490 --> 00:58:34.940
introspectful--

00:58:34.940 --> 00:58:37.060
what my effect size is going
to be, it's very hard.

00:58:37.060 --> 00:58:40.580
But if you use comparable
studies to get a sense, then

00:58:40.580 --> 00:58:41.520
you can get a sense.

00:58:41.520 --> 00:58:42.990
And the other thing I mentioned
is, you can do

00:58:42.990 --> 00:58:45.640
cost-benefit analysis and
say, well, look--

00:58:45.640 --> 00:58:47.260
which is sort of another way
of saying it, If there are

00:58:47.260 --> 00:58:51.650
other things out there which
cost $100 per kid and get 0.1,

00:58:51.650 --> 00:58:54.150
then my thing, presumably, has
got to do at least as well as

00:58:54.150 --> 00:58:56.790
0.1 for $100-- suppose the other
thing also costs $100 a

00:58:56.790 --> 00:58:58.150
kid, I've got to do at
last as well as 0.1.

00:58:58.150 --> 00:58:59.760
Otherwise, I'd rather
do this other thing.

00:58:59.760 --> 00:59:01.640
So it's another way of getting
at the effect size.

00:59:01.640 --> 00:59:04.715
AUDIENCE: Could you, then, also
look at existing data in

00:59:04.715 --> 00:59:08.110
the literature for the mean
and variance thing,

00:59:08.110 --> 00:59:08.740
or do you have to--

00:59:08.740 --> 00:59:10.480
PROFESSOR: You could, but this
one is going to be more

00:59:10.480 --> 00:59:11.385
sensitive to your population.

00:59:11.385 --> 00:59:14.580
AUDIENCE: So it would just have
to be very well-matched

00:59:14.580 --> 00:59:16.165
to be able to use it.

00:59:16.165 --> 00:59:16.540
PROFESSOR: Right.

00:59:16.540 --> 00:59:19.240
I mean, look, if you don't have
it, you could do it to

00:59:19.240 --> 00:59:21.480
get a sense, but this is
one where the different

00:59:21.480 --> 00:59:23.730
populations are going to be
very different in terms of

00:59:23.730 --> 00:59:24.980
their mean and variance.

00:59:28.540 --> 00:59:31.550
In order to get an estimate of
this, you need a much, much,

00:59:31.550 --> 00:59:33.970
much smaller sample size than
you need to get an estimate of

00:59:33.970 --> 00:59:35.970
the overall treatment effect
of the program.

00:59:35.970 --> 00:59:39.720
So you can often do
a small survey--

00:59:39.720 --> 00:59:42.690
much, much, smaller than your
big survey, but a small survey

00:59:42.690 --> 00:59:44.810
just to get a sense of what
these things look like.

00:59:44.810 --> 00:59:47.070
And that can often be a very
worthwhile thing to do.

00:59:47.070 --> 00:59:49.115
AUDIENCE: I have a
related question.

00:59:51.680 --> 00:59:53.520
How often do you see--

00:59:53.520 --> 00:59:54.530
PROFESSOR: Oh, sorry.

00:59:54.530 --> 00:59:55.390
I just wanted to do one
other thing on this.

00:59:55.390 --> 00:59:57.550
I've had this come up in my
own experience, where I've

00:59:57.550 --> 01:00:01.050
done this small survey, and
found that the baseline

01:00:01.050 --> 01:00:02.350
situation was such that
the whole experiment

01:00:02.350 --> 01:00:03.010
didn't make any sense.

01:00:03.010 --> 01:00:05.150
And we just canceled
the experiment.

01:00:05.150 --> 01:00:06.390
And it can be really useful.

01:00:06.390 --> 01:00:11.050
If you say, if I do this and
my power is 0.01, for

01:00:11.050 --> 01:00:14.300
reasonable effect sizes,
this is pointless.

01:00:14.300 --> 01:00:16.110
So it can be worth it.

01:00:16.110 --> 01:00:16.520
Sorry.

01:00:16.520 --> 01:00:16.930
Go ahead.

01:00:16.930 --> 01:00:21.460
AUDIENCE: So to estimate the
effect size, have you seen

01:00:21.460 --> 01:00:24.960
people run small pilots in
different populations than

01:00:24.960 --> 01:00:28.010
they're eventually going to do
their impact evaluation to get

01:00:28.010 --> 01:00:30.670
a sense of what effect size are
they seeing with that same

01:00:30.670 --> 01:00:30.965
intervention?

01:00:30.965 --> 01:00:35.060
PROFESSOR: Not usually, because
you can't do a small

01:00:35.060 --> 01:00:37.500
pilot to get the effect
size, right?

01:00:37.500 --> 01:00:38.340
AUDIENCE: You're going
to see something--

01:00:38.340 --> 01:00:39.290
PROFESSOR: You've got to
do the whole thing.

01:00:39.290 --> 01:00:40.060
AUDIENCE: Yeah, yeah.

01:00:40.060 --> 01:00:40.310
PROFESSOR: Right?

01:00:40.310 --> 01:00:41.610
That's the whole point of the
power calculations is, in

01:00:41.610 --> 01:00:45.220
order to detect an effect of
that size, you need to do the

01:00:45.220 --> 01:00:45.610
whole sample.

01:00:45.610 --> 01:00:48.080
So a small pilot won't
really do it.

01:00:48.080 --> 01:00:49.556
AUDIENCE: OK.

01:00:49.556 --> 01:00:51.750
PROFESSOR: So it's not really
going to-- you could get a--

01:00:51.750 --> 01:00:54.290
no, I guess you really can't get
a sense because you would

01:00:54.290 --> 01:00:56.530
need the whole experiment to
detect the effect size.

01:00:59.490 --> 01:01:03.265
AUDIENCE: Don't you think that
there should be a lot more

01:01:03.265 --> 01:01:05.850
conversation about effect size
before things start?

01:01:05.850 --> 01:01:09.130
Because if you've got a
treatment, if you've got a

01:01:09.130 --> 01:01:17.110
program, and you can't have a
very-- and you've struggled to

01:01:17.110 --> 01:01:21.170
have a good conversation about
what is actually going to

01:01:21.170 --> 01:01:23.560
happen to the kids or what's
going to happen to the health

01:01:23.560 --> 01:01:25.840
or what's going to happen to
the income as a result of

01:01:25.840 --> 01:01:29.420
this, it really may be quite
telling that you really don't

01:01:29.420 --> 01:01:30.680
know what you're doing.

01:01:30.680 --> 01:01:34.690
That there isn't enough of
a theory behind your-- or

01:01:34.690 --> 01:01:37.920
practice or science or anything
behind what your

01:01:37.920 --> 01:01:38.900
program is.

01:01:38.900 --> 01:01:43.605
If people are not pretty
sure, what--

01:01:43.605 --> 01:01:45.010
PROFESSOR: I mean,
yes and no--

01:01:45.010 --> 01:01:48.170
AUDIENCE: And then, also, on
the resource allocation.

01:01:48.170 --> 01:01:52.670
Resource allocation, it just
seems to me, most of the time,

01:01:52.670 --> 01:01:57.105
if your ultimate client
is really probably the

01:01:57.105 --> 01:01:57.980
government, right?

01:01:57.980 --> 01:02:00.020
Because the government is the
one that's going to make the

01:02:00.020 --> 01:02:00.470
big resource allocations--

01:02:00.470 --> 01:02:02.760
PROFESSOR: It depends on who
you're working with.

01:02:02.760 --> 01:02:04.450
It could be an NGO, whoever.

01:02:04.450 --> 01:02:04.870
But yes.

01:02:04.870 --> 01:02:08.540
AUDIENCE: No, but an NGO is
doing something, usually, as a

01:02:08.540 --> 01:02:12.540
demonstration that, in fact,
if it works, then the

01:02:12.540 --> 01:02:14.130
government should do it.

01:02:14.130 --> 01:02:16.070
PROFESSOR: Not always, but
there's someone who,

01:02:16.070 --> 01:02:17.836
presumably, is going
to scale up.

01:02:17.836 --> 01:02:18.900
AUDIENCE: Right.

01:02:18.900 --> 01:02:21.750
And yes, businesses,
maybe, right?

01:02:21.750 --> 01:02:24.430
But I would say, 90% of the
time, it's going to be,

01:02:24.430 --> 01:02:26.360
ultimately, the government
needs to--

01:02:26.360 --> 01:02:27.480
PROFESSOR: Often, it's
the government.

01:02:27.480 --> 01:02:29.560
In India, for example, there
are NGOs who are--

01:02:29.560 --> 01:02:32.250
I don't know who's worked on
the Pratham reading thing.

01:02:32.250 --> 01:02:33.470
They're trying to teach--

01:02:33.470 --> 01:02:35.750
NGOs trying to teach millions
of kids to read, as an NGO.

01:02:35.750 --> 01:02:37.170
So sometimes NGOs
scale up too.

01:02:37.170 --> 01:02:40.880
But anyway, you're right that
there's an ultimate client

01:02:40.880 --> 01:02:41.660
who's interested in this.

01:02:41.660 --> 01:02:43.060
AUDIENCE: So then, having
a conversation

01:02:43.060 --> 01:02:45.160
very early on about--

01:02:45.160 --> 01:02:45.720
PROFESSOR: Yeah.

01:02:45.720 --> 01:02:46.410
Could be very useful.

01:02:46.410 --> 01:02:47.190
That's absolutely right.

01:02:47.190 --> 01:02:47.980
That's absolutely right.

01:02:47.980 --> 01:02:48.580
AUDIENCE: Because--

01:02:48.580 --> 01:02:50.770
PROFESSOR: Now, in terms of
your point about theory,

01:02:50.770 --> 01:02:52.560
though, yes and no.

01:02:52.560 --> 01:02:55.670
So I can design an experiment
that's supposed to teach kids

01:02:55.670 --> 01:02:57.210
how to read.

01:02:57.210 --> 01:03:00.550
I know the theory says it should
affect reading but I

01:03:00.550 --> 01:03:02.750
have no idea how much.

01:03:02.750 --> 01:03:03.090
And so--

01:03:03.090 --> 01:03:06.740
AUDIENCE: Wouldn't you say that
a significant percentage

01:03:06.740 --> 01:03:09.810
of the time, if it's a good
theory about reading, it

01:03:09.810 --> 01:03:10.760
actually should tell you?

01:03:10.760 --> 01:03:11.630
PROFESSOR: Not always.

01:03:11.630 --> 01:03:12.090
I mean--

01:03:12.090 --> 01:03:13.680
AUDIENCE: Well, then I'd
say it's not such a

01:03:13.680 --> 01:03:14.960
great theory, right?

01:03:14.960 --> 01:03:15.725
Wouldn't you--

01:03:15.725 --> 01:03:18.000
PROFESSOR: It's a little bit
semantic, but I think that a

01:03:18.000 --> 01:03:20.540
lot of times, I can--

01:03:20.540 --> 01:03:23.240
say I'm going to teach kids to
read a paragraph or whatever.

01:03:23.240 --> 01:03:26.570
But what percentage of the kids
is it going to work for?

01:03:26.570 --> 01:03:30.550
What percentage of the kids
are going to be affected?

01:03:30.550 --> 01:03:33.580
I think that using theory
to calculate how--

01:03:33.580 --> 01:03:35.060
I think theory can tell
you a lot what

01:03:35.060 --> 01:03:36.420
variables should be affected.

01:03:36.420 --> 01:03:38.620
And that's what we talked about
in the last lecture.

01:03:38.620 --> 01:03:41.110
I think theory can tell you what
the sign of those effects

01:03:41.110 --> 01:03:42.050
likely to be.

01:03:42.050 --> 01:03:45.350
I think it's often putting a lot
of demands on your theory

01:03:45.350 --> 01:03:47.440
to have them tell you
the magnitude.

01:03:47.440 --> 01:03:48.595
And that's why you want
to do the experiment.

01:03:48.595 --> 01:03:51.540
AUDIENCE: And you just told me
that even beyond the theory,

01:03:51.540 --> 01:03:53.950
you say, well, but we did this
in one school and we saw it

01:03:53.950 --> 01:03:55.290
had this great thing,
but you're saying--

01:03:55.290 --> 01:03:55.790
[INTERPOSING VOICES]

01:03:55.790 --> 01:03:57.175
PROFESSOR: But your confidence
interval is going to be--

01:03:57.175 --> 01:03:58.190
well, it's not nothing.

01:03:58.190 --> 01:03:59.550
It's going to tell you
something, but your confidence

01:03:59.550 --> 01:04:00.310
interval is going
to be enormous.

01:04:00.310 --> 01:04:02.340
AUDIENCE: Right, nothing
that you could

01:04:02.340 --> 01:04:04.200
rely on to set a good--

01:04:04.200 --> 01:04:04.490
[INTERPOSING VOICES]

01:04:04.490 --> 01:04:07.296
PROFESSOR: Right, it gives you
a data point, but it's going

01:04:07.296 --> 01:04:09.174
to have a huge conference
interval.

01:04:09.174 --> 01:04:13.330
AUDIENCE: I don't want to
belabor this, but if you think

01:04:13.330 --> 01:04:14.490
about it in business
terms, right?

01:04:14.490 --> 01:04:16.250
I want to go out and
raise some money.

01:04:16.250 --> 01:04:17.440
PROFESSOR: Yes, absolutely.

01:04:17.440 --> 01:04:17.790
[INTERPOSING VOICES]

01:04:17.790 --> 01:04:18.535
AUDIENCE: --something.

01:04:18.535 --> 01:04:20.520
And so, in order to raise that
money, I have to tell you

01:04:20.520 --> 01:04:22.680
that, in fact, you're going
to make this much money.

01:04:22.680 --> 01:04:23.360
PROFESSOR: Right.

01:04:23.360 --> 01:04:25.816
AUDIENCE: And, of course, it
could turn out to be wrong.

01:04:25.816 --> 01:04:28.483
But I have to tell you you're
going to get a 25% return on

01:04:28.483 --> 01:04:28.790
your money.

01:04:28.790 --> 01:04:30.575
And that means I have to
explain to you why this

01:04:30.575 --> 01:04:32.347
business is going to successful,
how many people

01:04:32.347 --> 01:04:34.320
are going to buy it, how I'm
going to manage my costs down.

01:04:34.320 --> 01:04:36.750
So it's always curious to me
that, when you're talking

01:04:36.750 --> 01:04:41.000
about social interventions, that
I'm not having to make

01:04:41.000 --> 01:04:45.120
that same argument with that
same level of specificity,

01:04:45.120 --> 01:04:46.930
which means I've talked
about the effect size.

01:04:46.930 --> 01:04:50.540
Because I can't raise money if I
tell you, look, I might only

01:04:50.540 --> 01:04:53.695
make you 5% or we might shoot
the moon and make 100%.

01:04:53.695 --> 01:04:55.290
You'll say, thank
you very much.

01:04:55.290 --> 01:04:56.690
This person doesn't know
what their business is.

01:04:56.690 --> 01:04:58.340
I'm not going to give
them my money.

01:04:58.340 --> 01:04:59.730
PROFESSOR: Right.

01:04:59.730 --> 01:05:03.710
So you actually hit on exactly
what's on the next slide.

01:05:03.710 --> 01:05:07.110
Which is exactly what I was
going to say, which is, what

01:05:07.110 --> 01:05:09.000
you want to think about with
your effect size is exactly

01:05:09.000 --> 01:05:09.370
this thing.

01:05:09.370 --> 01:05:10.850
What's the cost of this
program versus

01:05:10.850 --> 01:05:12.170
the benefit it brings?

01:05:12.170 --> 01:05:15.220
And sometimes, what's the cost
vis-a-vis alternative uses of

01:05:15.220 --> 01:05:16.270
the money, right?

01:05:16.270 --> 01:05:17.730
And that's going to be a
conversation you're going to

01:05:17.730 --> 01:05:19.960
have with your client, which is
going to say, if the effect

01:05:19.960 --> 01:05:22.200
size was 0.1, I would do it.

01:05:22.200 --> 01:05:23.730
And then you say, OK, I'm going
to design an experiment

01:05:23.730 --> 01:05:26.910
to see if it's 0.1
or bigger, right?

01:05:26.910 --> 01:05:30.770
So I'm totally on
board with that.

01:05:30.770 --> 01:05:33.030
Because, as I was saying, if
the effect size is smaller

01:05:33.030 --> 01:05:35.610
than that, it still could be
positive, but if your client

01:05:35.610 --> 01:05:39.210
doesn't care, if it's not worth
the money at that level,

01:05:39.210 --> 01:05:41.210
then why do we need to design
a big experiment

01:05:41.210 --> 01:05:42.460
to pick that up?

01:05:44.230 --> 01:05:46.590
It's also worth noting this
is not your expected

01:05:46.590 --> 01:05:48.540
effect size, right?

01:05:48.540 --> 01:05:53.620
I could expect this thing to
have an effect of 0.2 but even

01:05:53.620 --> 01:05:55.040
if it was as low as 0.1,
it would still be

01:05:55.040 --> 01:05:56.430
worth doing, OK?

01:05:56.430 --> 01:05:58.170
And in that case, I might want
to design an experiment of

01:05:58.170 --> 01:06:02.920
0.1, right?

01:06:02.920 --> 01:06:05.580
Conversely, you guys can all
imagine the opposite, which is

01:06:05.580 --> 01:06:08.350
you could say, I expect
this thing to be 0.1,

01:06:08.350 --> 01:06:11.090
but maybe it's 0.2.

01:06:11.090 --> 01:06:12.020
Maybe it's actually--

01:06:12.020 --> 01:06:14.260
I'm not sure how good it is.

01:06:14.260 --> 01:06:14.850
I think it's OK.

01:06:14.850 --> 01:06:17.370
But maybe it could
be really great.

01:06:17.370 --> 01:06:19.550
And if it was really great, I
would want to adopt it, so I

01:06:19.550 --> 01:06:21.540
would design an experiment
to 0.2.

01:06:21.540 --> 01:06:25.120
So it's not the expected effect
size, it's what you

01:06:25.120 --> 01:06:26.758
would use to adopt
the program.

01:06:33.180 --> 01:06:35.180
When we talk about
effect sizes, we

01:06:35.180 --> 01:06:37.282
often talk about them--

01:06:37.282 --> 01:06:40.970
we talk about what we call
standardized effect size, OK?

01:06:46.020 --> 01:06:48.670
As I mentioned, how large an
effect you can detect depends

01:06:48.670 --> 01:06:51.010
on how variable your
sample is.

01:06:51.010 --> 01:06:53.830
So if everyone's the
same, it's very

01:06:53.830 --> 01:06:55.870
easy to pick up effects.

01:06:55.870 --> 01:06:58.770
And we often talk about
standardized effects are the

01:06:58.770 --> 01:07:01.050
effect size divided by the
standard deviation of the

01:07:01.050 --> 01:07:03.350
outcome, OK?

01:07:03.350 --> 01:07:05.250
So standard deviation of outcome
is the measure of how

01:07:05.250 --> 01:07:06.600
variable your outcome is.

01:07:06.600 --> 01:07:10.530
So we often express our effect
sizes relative to the standard

01:07:10.530 --> 01:07:12.680
deviation of the outcome, OK?

01:07:12.680 --> 01:07:14.310
And so when I was talking
about test scores, for

01:07:14.310 --> 01:07:16.770
example, test scores are usually
normalized to have a

01:07:16.770 --> 01:07:18.290
standard deviation of one.

01:07:18.290 --> 01:07:20.400
So this is actually how we
normally express things in

01:07:20.400 --> 01:07:22.510
terms of test scores, but we
could do it for anything.

01:07:22.510 --> 01:07:25.850
And so effect sizes of
0.1, 0.2 are small.

01:07:25.850 --> 01:07:26.910
0.4 are medium.

01:07:26.910 --> 01:07:28.150
0.5 are large.

01:07:28.150 --> 01:07:29.830
Now what do we mean by that?

01:07:29.830 --> 01:07:31.790
This is actually a very helpful
way of thinking about

01:07:31.790 --> 01:07:34.580
what a standardized effect
size is telling you.

01:07:34.580 --> 01:07:37.830
So a standardized effect size of
0.2, which is what we were

01:07:37.830 --> 01:07:43.350
saying was a modest one, means
that the average person in the

01:07:43.350 --> 01:07:47.980
treatment group, the median
or the mean person of the

01:07:47.980 --> 01:07:52.610
treatment group, had a better
outcome than 58% of the people

01:07:52.610 --> 01:07:54.930
in the control group.

01:07:54.930 --> 01:07:57.810
So remember, if it was zero,
it would be 50-50.

01:07:57.810 --> 01:07:58.840
It would be 50%, right?

01:07:58.840 --> 01:08:01.160
If there was no effect, the
distributions would line up

01:08:01.160 --> 01:08:03.400
and this person's in the
treatment group--

01:08:03.400 --> 01:08:04.590
the median person in the
treatment group would be

01:08:04.590 --> 01:08:09.150
better than 50% of the people
in the control group.

01:08:09.150 --> 01:08:11.680
So this is saying, instead of
lining up at exactly 50-50,

01:08:11.680 --> 01:08:15.920
it's lining up 58%-50%, OK?

01:08:15.920 --> 01:08:20.700
If you get an effect size of
0.5, which we were saying was

01:08:20.700 --> 01:08:24.490
a large effect, that means that
69% of the people in the

01:08:24.490 --> 01:08:26.720
treatment group are going to
be bigger than the median

01:08:26.720 --> 01:08:29.484
person in the control group.

01:08:29.484 --> 01:08:31.100
Sorry, it's the other
way around.

01:08:31.100 --> 01:08:32.490
The average member of the
intervention group is better

01:08:32.490 --> 01:08:36.310
than 69% of people in
the control group.

01:08:36.310 --> 01:08:37.950
So the distributions are
still overlapping.

01:08:37.950 --> 01:08:39.170
But now there's--

01:08:39.170 --> 01:08:42.170
the middle of the treatment
distribution is at the 69th

01:08:42.170 --> 01:08:45.779
percentile of the control.

01:08:45.779 --> 01:08:49.800
And a large effect of 0.8 would
mean that the median

01:08:49.800 --> 01:08:55.210
person in the treatment group
is at the 79th percentile of

01:08:55.210 --> 01:08:56.970
the control.

01:08:56.970 --> 01:08:58.580
That just gives you a sense of
when we're talking about

01:08:58.580 --> 01:09:02.180
standardized effect sizes, how
big we're talking about.

01:09:02.180 --> 01:09:04.990
And so you can see that
0.2, is actually--

01:09:04.990 --> 01:09:08.689
you can imagine is going to be
pretty hard to detect, right?

01:09:08.689 --> 01:09:10.800
If the median person in the
treatment group looks like the

01:09:10.800 --> 01:09:14.029
58th percentile of the control
group, that's going to be a

01:09:14.029 --> 01:09:18.800
case where those distributions
have a lot of overlap, right?

01:09:18.800 --> 01:09:20.450
And so this is going to be much
harder to detect than

01:09:20.450 --> 01:09:25.130
this case when the overlap
is much smaller.

01:09:25.130 --> 01:09:25.330
Yeah.

01:09:25.330 --> 01:09:28.950
AUDIENCE: So in your experience,
what do most

01:09:28.950 --> 01:09:30.826
people think their
effect size is?

01:09:30.826 --> 01:09:32.140
Where do they settle?

01:09:32.140 --> 01:09:34.080
They probably wouldn't
settle at 0.2?

01:09:34.080 --> 01:09:36.649
PROFESSOR: Actually, a lot of
people in a lot of educational

01:09:36.649 --> 01:09:36.989
interventions--

01:09:36.989 --> 01:09:38.680
AUDIENCE: That's enough
for them?

01:09:38.680 --> 01:09:40.439
PROFESSOR: Yeah.

01:09:40.439 --> 01:09:42.279
I would say the typical
intervention that people study

01:09:42.279 --> 01:09:44.370
that I've seen in education,
the effect size is in the

01:09:44.370 --> 01:09:50.284
0.15, 0.2 range.

01:09:50.284 --> 01:09:52.019
It turns out it's really hard
to move test scores.

01:09:52.019 --> 01:09:52.982
AUDIENCE: Yeah.

01:09:52.982 --> 01:09:56.570
PROFESSOR: So yeah, I
would say a lot of--

01:09:56.570 --> 01:09:58.410
but you'll see when you do the
power calculations, that to

01:09:58.410 --> 01:10:00.040
detect 0.2, you often need
a pretty big sample.

01:10:03.810 --> 01:10:05.620
Look, it depends a lot on what
your intervention is, but I've

01:10:05.620 --> 01:10:09.270
seen a lot in that range.

01:10:09.270 --> 01:10:13.340
And I'm just trying to think
of an experiment I did.

01:10:13.340 --> 01:10:14.940
I can't think of it off hand.

01:10:14.940 --> 01:10:17.280
But yeah, I would say
a lot in this range.

01:10:17.280 --> 01:10:19.777
AUDIENCE: So would the converse
be true, that in

01:10:19.777 --> 01:10:22.890
fact, you don't see too
many that have a real

01:10:22.890 --> 01:10:25.962
large effect size?

01:10:25.962 --> 01:10:28.490
PROFESSOR: I would say it's
pretty rare that I see

01:10:28.490 --> 01:10:31.238
interventions that are 0.8.

01:10:31.238 --> 01:10:31.704
Yeah.

01:10:31.704 --> 01:10:34.160
AUDIENCE: Do you think it's
valuable that just because

01:10:34.160 --> 01:10:36.500
you're setting a low effect
size in designing your

01:10:36.500 --> 01:10:37.320
experiment, you're being
conservative.

01:10:37.320 --> 01:10:39.140
You can still pick up a
[UNINTELLIGIBLE] effect size--

01:10:39.140 --> 01:10:39.740
PROFESSOR: Of course.

01:10:39.740 --> 01:10:41.412
AUDIENCE: It's just in
the design process--

01:10:41.412 --> 01:10:41.830
[INTERPOSING VOICES]

01:10:41.830 --> 01:10:42.450
PROFESSOR: Right.

01:10:42.450 --> 01:10:44.520
This is the minimum thing
you could pick up.

01:10:44.520 --> 01:10:45.720
That's absolutely right.

01:10:45.720 --> 01:10:46.460
That's right.

01:10:46.460 --> 01:10:49.470
So right, if you design for 0.2
but, in fact, your thing

01:10:49.470 --> 01:10:54.150
is amazing and does 0.8, well,
there's no problem at all.

01:10:54.150 --> 01:10:56.520
You'll have a p-value
of 0.00 something.

01:10:56.520 --> 01:10:59.184
You'll have a very strong
[INAUDIBLE].

01:10:59.184 --> 01:11:01.900
It's a good point.

01:11:01.900 --> 01:11:03.510
OK.

01:11:03.510 --> 01:11:06.950
So how do we actually
calculate our power?

01:11:06.950 --> 01:11:10.390
So there's actually a very nice
software package, which,

01:11:10.390 --> 01:11:12.330
have you guys started
using this yet?

01:11:12.330 --> 01:11:12.810
Yeah?

01:11:12.810 --> 01:11:12.990
OK.

01:11:12.990 --> 01:11:14.110
AUDIENCE: I have a question.

01:11:14.110 --> 01:11:15.970
Can you just clarify something
before you go on?

01:11:15.970 --> 01:11:16.900
PROFESSOR: Yeah.

01:11:16.900 --> 01:11:20.208
AUDIENCE: So by rejecting a null
hypothesis, you won't be

01:11:20.208 --> 01:11:23.064
able to say what the expected
effect is, so you won't be

01:11:23.064 --> 01:11:24.254
able to necessarily quantify
the impact.

01:11:24.254 --> 01:11:26.865
PROFESSOR: No, that's
not quite right.

01:11:26.865 --> 01:11:27.697
AUDIENCE: OK.

01:11:27.697 --> 01:11:31.570
PROFESSOR: So you're going
to estimate your--

01:11:31.570 --> 01:11:34.260
you run your experiment, you're
going to get a beta,

01:11:34.260 --> 01:11:35.940
which is your estimate,
And you're going to

01:11:35.940 --> 01:11:38.470
get a standard error.

01:11:38.470 --> 01:11:42.110
You reject the null, which
means you say with 95%

01:11:42.110 --> 01:11:44.540
probability, I'm in my
confidence interval.

01:11:44.540 --> 01:11:48.360
So you know you're somewhere
in the confidence interval.

01:11:48.360 --> 01:11:50.990
And then beyond that, you have
an estimate of where in the

01:11:50.990 --> 01:11:52.360
confidence interval you are.

01:11:52.360 --> 01:11:54.210
And your best estimate for
where you are on the

01:11:54.210 --> 01:11:57.080
confidence interval is
your point estimate.

01:11:57.080 --> 01:11:58.230
Does that make sense?

01:11:58.230 --> 01:12:02.380
So in terms of thinking through
the cost-benefit or

01:12:02.380 --> 01:12:05.080
whatever, your best guess of the
effect of the program is

01:12:05.080 --> 01:12:07.470
your point estimate,
is your beta.

01:12:07.470 --> 01:12:10.480
If you wanted to be a little
more precise about it, you

01:12:10.480 --> 01:12:11.730
could say--

01:12:19.190 --> 01:12:25.090
so this is your estimate, this
is your beta hat, this is your

01:12:25.090 --> 01:12:26.810
confidence interval, right?

01:12:26.810 --> 01:12:29.970
Zero is over here, so you can
reject zero in this case.

01:12:29.970 --> 01:12:34.210
But, in fact, there's a
distribution of where your

01:12:34.210 --> 01:12:36.100
estimates are likely to be.

01:12:36.100 --> 01:12:37.870
And when we said it was 95%
confidence interval, that's

01:12:37.870 --> 01:12:41.100
because the probability of
being over here is 95%.

01:12:41.100 --> 01:12:43.710
But this says you're most likely
to be right here, but

01:12:43.710 --> 01:12:45.430
there's some probability
over here.

01:12:45.430 --> 01:12:48.980
You're more likely to be near
beta then you are to be very--

01:12:48.980 --> 01:12:51.160
it's not that you're equally
likely to be anywhere in your

01:12:51.160 --> 01:12:52.350
confidence interval.

01:12:52.350 --> 01:12:54.700
You're most likely to be right
near your point estimate.

01:12:54.700 --> 01:12:58.740
So, in fact, if you actually
cared about the range, you

01:12:58.740 --> 01:13:01.060
could say, well, what's the
probability I'm over here?

01:13:01.060 --> 01:13:02.410
And calculate that.

01:13:02.410 --> 01:13:03.340
What's the probability
I'm over here?

01:13:03.340 --> 01:13:06.130
And you could average them to
calculate the average benefit

01:13:06.130 --> 01:13:07.390
of your program.

01:13:07.390 --> 01:13:09.230
Usually, though, we don't bother
to do this and usually

01:13:09.230 --> 01:13:11.680
what we do is we say our best
estimate is that you're right

01:13:11.680 --> 01:13:12.250
at beta hat.

01:13:12.250 --> 01:13:14.372
That is our best estimate and
we calculate our estimate

01:13:14.372 --> 01:13:15.622
based on that.

01:13:18.480 --> 01:13:20.215
But in theory, you could use
the whole distribution

01:13:20.215 --> 01:13:22.392
[INAUDIBLE].

01:13:22.392 --> 01:13:23.642
OK.

01:13:27.600 --> 01:13:28.850
OK, so suppose we want--

01:13:28.850 --> 01:13:31.010
so how do we actually calculate
some of these?

01:13:31.010 --> 01:13:34.710
So using the software helps get
a sense, intuitively, of

01:13:34.710 --> 01:13:35.910
what these tradeoffs are
going to look like.

01:13:35.910 --> 01:13:37.690
And I don't know that I'll have
time to go through all

01:13:37.690 --> 01:13:41.320
this, but we'll go through
most of it, OK?

01:13:41.320 --> 01:13:43.960
So for example, so if you run
the software and look at power

01:13:43.960 --> 01:13:45.210
versus number of clusters--

01:13:50.156 --> 01:13:52.340
hold on.

01:13:52.340 --> 01:13:54.430
So how would you set this
up in the software?

01:13:54.430 --> 01:13:58.490
So we'll talk about clustered
effects in a sec.

01:13:58.490 --> 01:14:03.540
As we discussed, you have to
pick a significance level.

01:14:03.540 --> 01:14:05.930
You have to pick a standardized
effect size.

01:14:05.930 --> 01:14:07.360
That's what delta is
in the software.

01:14:07.360 --> 01:14:10.850
So we use 0.2, OK?

01:14:10.850 --> 01:14:12.620
In the software, it's always
a standardized effect size.

01:14:12.620 --> 01:14:13.650
You just divide by
your standard

01:14:13.650 --> 01:14:16.260
deviation of your outcome.

01:14:16.260 --> 01:14:18.400
That's why you know your
actual outcome variable

01:14:18.400 --> 01:14:19.670
because you know--
but I think the

01:14:19.670 --> 01:14:21.690
actual effect is whatever--

01:14:21.690 --> 01:14:23.590
people get one centimeter
longer in order to get a

01:14:23.590 --> 01:14:24.920
standardized effect size, I
need to know the standard

01:14:24.920 --> 01:14:27.290
deviation of my outcome
variable.

01:14:27.290 --> 01:14:34.660
And the program is going to
give you the power as a

01:14:34.660 --> 01:14:39.180
function of your sample
size, OK?

01:14:39.180 --> 01:14:42.960
And one of the things that you
can see is that this is not

01:14:42.960 --> 01:14:46.410
necessarily a linear
relationship, right?

01:14:46.410 --> 01:14:51.020
So for example, here, we've
plotted a delta of--

01:14:51.020 --> 01:14:54.050
effect size of 0.2 and here's
an effect size of 0.4.

01:14:54.050 --> 01:14:58.930
So this says that with about 200
clusters, you're going to

01:14:58.930 --> 01:15:04.315
get to a power of 0.8 with the
effect size of 0.4, but you're

01:15:04.315 --> 01:15:07.010
still going to be at a
power of 0.2 with an

01:15:07.010 --> 01:15:08.670
effect size of 0.2.

01:15:08.670 --> 01:15:11.570
So the formulas are
complicated.

01:15:11.570 --> 01:15:13.940
They're not necessarily a linear
function of your power.

01:15:22.050 --> 01:15:24.750
When we think about power, we've
talked about a couple of

01:15:24.750 --> 01:15:29.260
things that influence our power
in terms of the variance

01:15:29.260 --> 01:15:30.800
of our outcome, right?

01:15:30.800 --> 01:15:32.890
The variance of our outcome,
how big our effect size is.

01:15:32.890 --> 01:15:34.180
And those are the basic
things that are going

01:15:34.180 --> 01:15:35.420
to affect our power.

01:15:35.420 --> 01:15:39.650
But there are things that we
can do in our experiment--

01:15:39.650 --> 01:15:41.360
in the way we design our
experiment that are also going

01:15:41.360 --> 01:15:44.470
to make our experiment more
or less powerful.

01:15:44.470 --> 01:15:45.790
And here are some of the
things that we can do.

01:15:49.720 --> 01:15:52.390
One thing that we can
do is we can think

01:15:52.390 --> 01:15:55.240
about having a cluster--

01:15:55.240 --> 01:15:57.330
so whether we whether randomize
at the individual

01:15:57.330 --> 01:16:03.320
level or in clusters, whether we
have a baseline, whether we

01:16:03.320 --> 01:16:06.570
use control variables or
stratification, and the type

01:16:06.570 --> 01:16:09.750
hypothesis being tested.

01:16:09.750 --> 01:16:12.550
All four of these are things
that we're going to do that

01:16:12.550 --> 01:16:15.130
for a given outcome variable
and a given effect size, in

01:16:15.130 --> 01:16:17.750
some sense, are going to
affect how powerful our

01:16:17.750 --> 01:16:20.490
experiment is.

01:16:20.490 --> 01:16:22.800
In some sense--

01:16:22.800 --> 01:16:24.820
given that I may not have time
to finish everything, the one

01:16:24.820 --> 01:16:28.550
that I want to focus on is
the clustering issue.

01:16:28.550 --> 01:16:31.720
This is the one that is the
biggest for designing

01:16:31.720 --> 01:16:37.270
experiments, and it often
makes a big difference.

01:16:37.270 --> 01:16:43.640
So the intuition for clustering
is that--

01:16:43.640 --> 01:16:45.040
so what is clustering?

01:16:45.040 --> 01:16:49.480
Clustering is, instead of
randomizing-- suppose I want

01:16:49.480 --> 01:16:54.750
to do an experiment on whether
the J-PAL executive ed class

01:16:54.750 --> 01:16:56.980
improves your ability to--

01:16:56.980 --> 01:16:59.440
whether you took this lecture
improves your understanding of

01:16:59.440 --> 01:17:01.820
power calculation, OK?

01:17:01.820 --> 01:17:05.380
Suppose I randomly sampled this
half of the room and gave

01:17:05.380 --> 01:17:08.980
you my lecture and this half
was the control group.

01:17:08.980 --> 01:17:11.270
And I flipped a coin so I split
you in halves down the

01:17:11.270 --> 01:17:13.160
middle and I said, OK, I'm going
to flip a coin, which is

01:17:13.160 --> 01:17:14.410
control, which is treatment.

01:17:16.590 --> 01:17:20.020
You guys, presumably you all
sat with your friends, OK?

01:17:20.020 --> 01:17:23.280
So people on this side of the
room are going to be more like

01:17:23.280 --> 01:17:27.440
each other then people on that
side of the room, OK?

01:17:27.440 --> 01:17:32.620
So I didn't get an independent
sample, right?

01:17:32.620 --> 01:17:35.810
This group, their outcomes are
going to be correlated because

01:17:35.810 --> 01:17:37.510
some of you are friends
and have similar

01:17:37.510 --> 01:17:38.530
backgrounds and skills.

01:17:38.530 --> 01:17:41.350
And this group is going
to be correlated.

01:17:41.350 --> 01:17:44.490
On the other hand, suppose I had
gone through everyone and

01:17:44.490 --> 01:17:46.450
randomly flipped a coin for
every person and said,

01:17:46.450 --> 01:17:47.200
treatment or control, treatment
or control,

01:17:47.200 --> 01:17:49.920
treatment or control?

01:17:49.920 --> 01:17:53.540
In that case, I would've flipped
the coin 60 times and

01:17:53.540 --> 01:17:56.070
there would be no correlation
between who is in the control

01:17:56.070 --> 01:17:57.550
group and who is in the
treatment group because I

01:17:57.550 --> 01:17:59.690
wouldn't have been randomizing
you into the same groups

01:17:59.690 --> 01:18:02.550
together, OK?

01:18:02.550 --> 01:18:07.070
By doing the cluster design,
splitting you in half and then

01:18:07.070 --> 01:18:10.170
randomizing treatment versus
control or splitting you into

01:18:10.170 --> 01:18:11.600
groups of 10--

01:18:11.600 --> 01:18:12.920
you five, you 10, you 10.

01:18:12.920 --> 01:18:15.660
You 10, you 10, you 10, and
then flipping the coin.

01:18:15.660 --> 01:18:18.440
I have less variation, in some
sense, than if I had flipped

01:18:18.440 --> 01:18:19.890
the coin in individual--

01:18:19.890 --> 01:18:21.220
person by person--

01:18:21.220 --> 01:18:23.510
because those groups are
going to be correlated.

01:18:23.510 --> 01:18:25.580
They're going to have
similar outcomes.

01:18:25.580 --> 01:18:31.100
So the basic point is that your
power is going to be--

01:18:31.100 --> 01:18:33.790
the more times you flip the coin
to randomize treatment

01:18:33.790 --> 01:18:35.780
and control, essentially, the
more power you're going to

01:18:35.780 --> 01:18:37.950
have because the more your
different groups are going to

01:18:37.950 --> 01:18:40.160
be independent, OK?

01:18:40.160 --> 01:18:44.430
So to go through this again,
suppose you wanted to know--

01:18:44.430 --> 01:18:46.340
this is, in general,
about clustering.

01:18:46.340 --> 01:18:49.600
Suppose you wanted to know how
the outcome of the national

01:18:49.600 --> 01:18:51.180
elections are going to be.

01:18:51.180 --> 01:18:53.520
So you could either randomly
sample 50 people from the

01:18:53.520 --> 01:18:56.040
entire Indian population, or
you randomly pick five

01:18:56.040 --> 01:18:59.470
families and you ask 10
people per family what

01:18:59.470 --> 01:19:01.590
their opinions are.

01:19:01.590 --> 01:19:03.960
Clearly, this is going to give
you more information than this

01:19:03.960 --> 01:19:06.110
is because those family members
are going to be

01:19:06.110 --> 01:19:08.160
correlated, right?

01:19:08.160 --> 01:19:10.690
I have views like my wife and
like my father, et cetera.

01:19:10.690 --> 01:19:12.990
So we're not getting independent
views, whereas

01:19:12.990 --> 01:19:15.565
here, you're getting, really,
50 independent data points.

01:19:15.565 --> 01:19:16.910
And that's the same as
what we were talking

01:19:16.910 --> 01:19:19.230
about with the class.

01:19:19.230 --> 01:19:21.700
So this approach is going to
have more power than this

01:19:21.700 --> 01:19:24.132
approach because of the way
you did the sample.

01:19:24.132 --> 01:19:24.568
Yeah.

01:19:24.568 --> 01:19:26.465
AUDIENCE: So is the only reason
that you would cluster,

01:19:26.465 --> 01:19:30.370
then, just because you had to
because you had no choice--

01:19:30.370 --> 01:19:31.230
PROFESSOR: Yes.

01:19:31.230 --> 01:19:34.824
AUDIENCE: --for political
reasons or just feasibility.

01:19:34.824 --> 01:19:35.782
PROFESSOR: And cost.

01:19:35.782 --> 01:19:36.740
AUDIENCE: And cost.

01:19:36.740 --> 01:19:37.410
PROFESSOR: Yeah.

01:19:37.410 --> 01:19:38.810
AUDIENCE: Well, and the
level of intervention.

01:19:38.810 --> 01:19:40.800
PROFESSOR: Exactly.

01:19:40.800 --> 01:19:41.550
And we'll talk about that.

01:19:41.550 --> 01:19:44.040
There are lots of
reasons people--

01:19:44.040 --> 01:19:47.260
given this issue, people have
lots of good reasons for

01:19:47.260 --> 01:19:51.720
clustering, but the point is
that there are negative

01:19:51.720 --> 01:19:52.830
tradeoffs for sample size.

01:19:52.830 --> 01:19:56.360
AUDIENCE: About the clusters.

01:19:56.360 --> 01:20:03.260
If you flip the coin for all of
the class and then after,

01:20:03.260 --> 01:20:06.690
you decide that you will select
among the people that

01:20:06.690 --> 01:20:10.880
you have assigned, you will
select those seated--

01:20:10.880 --> 01:20:15.000
you will select half of those
seated on the left.

01:20:15.000 --> 01:20:16.290
Will that solve the problem--

01:20:16.290 --> 01:20:18.760
PROFESSOR: You select half of
the ones seated on the left?

01:20:18.760 --> 01:20:19.290
AUDIENCE: Yeah.

01:20:19.290 --> 01:20:21.670
PROFESSOR: Well, it's
a different issue.

01:20:21.670 --> 01:20:25.300
Suppose I first select the left
and now I go one by one,

01:20:25.300 --> 01:20:28.140
flip a coin of the people
on the left.

01:20:28.140 --> 01:20:30.270
I don't have the clustering
issue because I flipped the

01:20:30.270 --> 01:20:32.310
coin per person.

01:20:32.310 --> 01:20:34.860
But I have a different issue,
which is that the people I

01:20:34.860 --> 01:20:36.860
selected are not necessarily
representative of the whole

01:20:36.860 --> 01:20:40.450
population because I didn't pick
a representative sample.

01:20:40.450 --> 01:20:42.040
I picked the ones who happened
to sit over here.

01:20:42.040 --> 01:20:42.966
AUDIENCE: My question--

01:20:42.966 --> 01:20:45.320
PROFESSOR: So there's two
different issues.

01:20:45.320 --> 01:20:50.080
One is, essentially, how many
times you flip a coin is how

01:20:50.080 --> 01:20:53.000
much power you have, how
independent you're thing is.

01:20:53.000 --> 01:20:55.950
The other issue is, is this
group here representative of

01:20:55.950 --> 01:20:57.700
the entire population?

01:20:57.700 --> 01:21:01.450
You might think that people who
sit near the window like

01:21:01.450 --> 01:21:03.310
to look at the river are
daydreamers and they're not as

01:21:03.310 --> 01:21:06.020
good at math as people who don't
sit near the window.

01:21:06.020 --> 01:21:09.850
And so I would get the effect of
my treatment on people who

01:21:09.850 --> 01:21:11.630
like to sit near the window and
aren't as good at math.

01:21:11.630 --> 01:21:13.310
And that might be a different
treatment effect than if I had

01:21:13.310 --> 01:21:14.530
done it over the whole room.

01:21:14.530 --> 01:21:18.286
So it's a different issue.

01:21:18.286 --> 01:21:24.946
AUDIENCE: Yeah, but my question
was you first draw a

01:21:24.946 --> 01:21:29.752
random number of people that you
assign to the treatment or

01:21:29.752 --> 01:21:32.040
to the control.

01:21:32.040 --> 01:21:37.580
And after that, within that
people, you now say, I will

01:21:37.580 --> 01:21:39.726
take half of those people--

01:21:39.726 --> 01:21:44.037
I will take half that are seated
on the left and half

01:21:44.037 --> 01:21:47.390
that are seated on the right.

01:21:47.390 --> 01:21:48.640
PROFESSOR: I'm not sure--

01:21:48.640 --> 01:21:50.710
let me come back to
your question.

01:21:50.710 --> 01:21:51.920
I'm not sure I fully understand
what you're saying.

01:21:51.920 --> 01:21:53.540
Maybe we can talk about
it afterwards.

01:21:53.540 --> 01:21:54.850
I think what you're saying may
be about stratification.

01:21:54.850 --> 01:21:57.550
Why don't we talk
about it later?

01:21:57.550 --> 01:22:00.050
Because we're running a
little short on time.

01:22:00.050 --> 01:22:00.820
In fact, can I borrow
somone's handouts?

01:22:00.820 --> 01:22:04.950
Because I want to make sure I
cover the most important stuff

01:22:04.950 --> 01:22:05.865
in the lecture.

01:22:05.865 --> 01:22:07.115
Let me just see where we are.

01:22:13.910 --> 01:22:15.251
OK.

01:22:15.251 --> 01:22:16.940
AUDIENCE: And if you
need to, you can

01:22:16.940 --> 01:22:18.017
take ten extra minutes.

01:22:18.017 --> 01:22:20.510
PROFESSOR: I may do that.

01:22:20.510 --> 01:22:22.620
I was going to ask you,
Mark, for permission.

01:22:22.620 --> 01:22:24.170
I just wanted to see
what I had left.

01:22:24.170 --> 01:22:25.050
OK.

01:22:25.050 --> 01:22:28.220
So where were we?

01:22:31.690 --> 01:22:31.940
Right.

01:22:31.940 --> 01:22:32.310
OK.

01:22:32.310 --> 01:22:33.000
Right.

01:22:33.000 --> 01:22:36.680
So as I was saying, when
possible, it's better to run

01:22:36.680 --> 01:22:39.460
clustered design.

01:22:39.460 --> 01:22:42.780
And so a cluster randomized
trial is one in which the

01:22:42.780 --> 01:22:48.750
units that are randomized are
clusters of units rather than

01:22:48.750 --> 01:22:49.400
the individual units.

01:22:49.400 --> 01:22:51.640
So I randomized a whole cluster
at a time rather than

01:22:51.640 --> 01:22:54.030
individual person by person.

01:22:54.030 --> 01:22:55.950
And there are lots of common
examples of this.

01:22:55.950 --> 01:22:59.340
So the PROGRESA program, for
example, in Mexico was a

01:22:59.340 --> 01:23:00.410
conditional cash transfer
program.

01:23:00.410 --> 01:23:01.900
They randomized village.

01:23:01.900 --> 01:23:03.470
Some villages were in, some
villages were out.

01:23:03.470 --> 01:23:06.640
If a village was in,
everybody was in.

01:23:06.640 --> 01:23:08.700
In the panchayat case
we talked about, it

01:23:08.700 --> 01:23:10.340
was basically a village.

01:23:10.340 --> 01:23:10.930
It was a panchayat.

01:23:10.930 --> 01:23:12.470
So the whole panchayat
was in or the whole

01:23:12.470 --> 01:23:14.460
panchayat was not in.

01:23:14.460 --> 01:23:17.135
In a lot of education
experiments, we randomize at

01:23:17.135 --> 01:23:18.240
the level of a school.

01:23:18.240 --> 01:23:20.280
Either the whole school is in
or the whole school is out.

01:23:20.280 --> 01:23:21.220
Sometimes you do
it as a class.

01:23:21.220 --> 01:23:23.400
A whole class in a school is
in [UNINTELLIGIBLE] is out.

01:23:23.400 --> 01:23:25.970
In this iron supplementation
example, it was by the family.

01:23:25.970 --> 01:23:29.590
So there's lots of cases where
you would do this kind of

01:23:29.590 --> 01:23:31.940
clustering.

01:23:31.940 --> 01:23:34.790
And there are lots of good
reasons, as I've mentioned,

01:23:34.790 --> 01:23:36.130
for doing clustering.

01:23:36.130 --> 01:23:40.470
So one reason is you're
worried about

01:23:40.470 --> 01:23:43.450
contamination, right?

01:23:43.450 --> 01:23:47.000
So for example, when they're
interested in deworming, worms

01:23:47.000 --> 01:23:49.100
are very easily--

01:23:49.100 --> 01:23:50.400
there's a lot of
cross-contamination.

01:23:50.400 --> 01:23:52.785
If one kid has worms, the next
kid who's also in school with

01:23:52.785 --> 01:23:53.940
him is likely to get worms.

01:23:53.940 --> 01:23:56.960
So if I just deworm half the
kids in the school, that's

01:23:56.960 --> 01:24:00.350
going to have very little effect
because my control-

01:24:00.350 --> 01:24:02.270
they're going to get
recontaminated by the kids who

01:24:02.270 --> 01:24:03.440
weren't dewormed, right?

01:24:03.440 --> 01:24:04.540
Or it could be the
other way around.

01:24:04.540 --> 01:24:06.720
It could be that if I deworm
half the kids, that's enough

01:24:06.720 --> 01:24:07.830
to knock worms out of
the population.

01:24:07.830 --> 01:24:09.750
The control group is
also affected.

01:24:09.750 --> 01:24:12.300
So you need to choose a level
of randomization where your

01:24:12.300 --> 01:24:14.300
treatment is going to affect
the treatment group and not

01:24:14.300 --> 01:24:16.120
affect the control group.

01:24:16.120 --> 01:24:18.700
So that's a very important
reason for cluster

01:24:18.700 --> 01:24:20.470
randomizing.

01:24:20.470 --> 01:24:23.960
Another reason is this
feasibility consideration.

01:24:23.960 --> 01:24:26.960
So it's often just for a
variety of reasons not

01:24:26.960 --> 01:24:31.370
feasible to give some people the
treatment and not others.

01:24:31.370 --> 01:24:35.060
Sometimes within a village, it's
hard to make some people

01:24:35.060 --> 01:24:38.100
eligible for a program
and others not.

01:24:38.100 --> 01:24:40.170
It's just sometimes hard to
treat people in the same place

01:24:40.170 --> 01:24:41.100
differently.

01:24:41.100 --> 01:24:43.050
And so that's often a reason
why we do cluster

01:24:43.050 --> 01:24:45.080
randomization.

01:24:45.080 --> 01:24:50.100
And some experiments naturally
just occur at a cluster level.

01:24:50.100 --> 01:24:52.710
So for example, if I want to do
something that affects an

01:24:52.710 --> 01:24:54.900
entire classroom,
like give out--

01:24:54.900 --> 01:24:57.685
suppose I want to train
a teacher, right?

01:24:57.685 --> 01:25:00.130
That obviously affects all the
kids in the teacher's class.

01:25:00.130 --> 01:25:03.280
There's no way to have that only
affect half the kids in

01:25:03.280 --> 01:25:04.130
the teacher's class.

01:25:04.130 --> 01:25:06.420
It's just a fact of life.

01:25:06.420 --> 01:25:09.510
So there are lots of good
reasons why we do cluster

01:25:09.510 --> 01:25:13.080
randomized designs even though
they have negative

01:25:13.080 --> 01:25:16.280
impacts on our power.

01:25:16.280 --> 01:25:20.490
So as I mentioned, the reason
the cluster has a negative

01:25:20.490 --> 01:25:22.820
impact on your power is
because the groups are

01:25:22.820 --> 01:25:24.560
correlated.

01:25:24.560 --> 01:25:27.070
The outcomes for the individuals
are correlated.

01:25:27.070 --> 01:25:28.720
So, for example, if all of the
villagers are exposed to the

01:25:28.720 --> 01:25:30.860
same weather, right?

01:25:30.860 --> 01:25:32.610
All villagers are exposed
to the same weather.

01:25:32.610 --> 01:25:35.360
So it could be that
the weather was

01:25:35.360 --> 01:25:36.570
really bad in this village.

01:25:36.570 --> 01:25:40.300
So all those people are going
to have a lower outcome, for

01:25:40.300 --> 01:25:41.550
example, than if the
weather was good.

01:25:44.910 --> 01:25:48.150
And so, in some sense, even if
there are 1,000 people in that

01:25:48.150 --> 01:25:50.920
village, they all got this
common shock, which is the

01:25:50.920 --> 01:25:53.860
negative weather, you don't
actually have 1,000

01:25:53.860 --> 01:25:56.336
independent observations in that
village because they have

01:25:56.336 --> 01:26:00.180
this common correlated
component, OK?

01:26:00.180 --> 01:26:05.230
And this common correlated
component we denote by the

01:26:05.230 --> 01:26:08.460
Greek letter rho, which is the
correlation of the units

01:26:08.460 --> 01:26:09.710
within the same cluster.

01:26:13.840 --> 01:26:20.110
So rho measures the correlation
between units in

01:26:20.110 --> 01:26:20.850
the same cluster.

01:26:20.850 --> 01:26:24.090
If rho is zero, then people in
the same cluster are just as

01:26:24.090 --> 01:26:25.420
if they were independent.

01:26:25.420 --> 01:26:27.060
There's no correlation.

01:26:27.060 --> 01:26:29.080
Just as if they had been not
in the same cluster.

01:26:29.080 --> 01:26:31.760
If rho is one, they're perfectly
correlated and it

01:26:31.760 --> 01:26:36.320
means it they all have exactly
the same outcome, OK?

01:26:36.320 --> 01:26:39.150
So it's somewhere between
zero and one.

01:26:39.150 --> 01:26:42.990
And the lower the rho is, the
better you are if you're doing

01:26:42.990 --> 01:26:44.650
a cluster randomized design.

01:26:44.650 --> 01:26:45.570
And why is that?

01:26:45.570 --> 01:26:47.960
It's because the problem within
a clustered randomized

01:26:47.960 --> 01:26:50.230
design is, as I was saying, if
people were all exposed to the

01:26:50.230 --> 01:26:53.020
same weather, it's not as if
you had 1,000 independent

01:26:53.020 --> 01:26:54.720
people in that village.

01:26:54.720 --> 01:26:57.310
You effectively had fewer than
1,000 because they were

01:26:57.310 --> 01:26:58.150
correlated.

01:26:58.150 --> 01:27:02.030
And rho captures that effect--

01:27:02.030 --> 01:27:05.540
how much smaller, effectively,
is your sample, OK?

01:27:05.540 --> 01:27:09.340
And the bigger rho is, the
smaller your effective sample

01:27:09.340 --> 01:27:11.760
size is, OK?

01:27:15.130 --> 01:27:16.910
And once again, when you do the
power calculations, you

01:27:16.910 --> 01:27:19.120
can play with this and you'll
note that small differences in

01:27:19.120 --> 01:27:21.540
rho make very big differences
in your power.

01:27:21.540 --> 01:27:23.380
And I'll show you the
formula in a sec.

01:27:23.380 --> 01:27:26.320
So often it's low, but it
can be substantial.

01:27:26.320 --> 01:27:29.370
So in some of these test score
cases, for example, it's

01:27:29.370 --> 01:27:34.060
between 0.2 and 0.6, which,
0.6 means that most of the

01:27:34.060 --> 01:27:38.840
differences are coming between
groups, not within groups.

01:27:38.840 --> 01:27:44.190
So the groups, really, are much
closer to one object.

01:27:44.190 --> 01:27:44.665
Yeah.

01:27:44.665 --> 01:27:49.370
AUDIENCE: What does
the 0.5 mean?

01:27:49.370 --> 01:27:56.458
Are you saying that in
Madagascar, the scores on math

01:27:56.458 --> 01:27:58.370
and language--

01:27:58.370 --> 01:28:00.200
PROFESSOR: It's the correlation

01:28:00.200 --> 01:28:03.470
coefficient, which is the--

01:28:07.070 --> 01:28:13.230
technically, I believe it's the
between variation divided

01:28:13.230 --> 01:28:14.110
by the total variation.

01:28:14.110 --> 01:28:15.580
I think that's the formula.

01:28:15.580 --> 01:28:16.490
Dan's shaking his head.

01:28:16.490 --> 01:28:17.250
Good.

01:28:17.250 --> 01:28:18.070
Excellent.

01:28:18.070 --> 01:28:20.430
A for me.

01:28:20.430 --> 01:28:24.910
It's what share of the variation
is coming between

01:28:24.910 --> 01:28:29.710
groups divided by the total
share of variation.

01:28:29.710 --> 01:28:33.365
So 0.5 means that, in some
sense, half of the variation

01:28:33.365 --> 01:28:35.330
in your sample is coming
between groups.

01:28:37.890 --> 01:28:38.802
AUDIENCE: Okay.

01:28:38.802 --> 01:28:39.590
PROFESSOR: What?

01:28:39.590 --> 01:28:41.360
AUDIENCE: Isn't it within
[INAUDIBLE]?

01:28:41.360 --> 01:28:41.980
If a rho is--

01:28:41.980 --> 01:28:45.080
PROFESSOR: No, it's between.

01:28:45.080 --> 01:28:47.680
Because if rho is one, then
each group is one.

01:28:50.310 --> 01:28:51.590
Yeah, it's between.

01:28:55.880 --> 01:29:01.230
If it was zero, then they're
independent and it's saying

01:29:01.230 --> 01:29:03.884
that it's all coming
from within.

01:29:03.884 --> 01:29:06.139
Yeah.

01:29:06.139 --> 01:29:10.180
AUDIENCE: But here it's between
math and language

01:29:10.180 --> 01:29:15.290
scores of one kid or between
math plus language scores of

01:29:15.290 --> 01:29:16.970
two kids in the same group.

01:29:16.970 --> 01:29:20.130
AUDIENCE: Or is it math and
language scores in Madagascar

01:29:20.130 --> 01:29:21.770
are explained by--

01:29:21.770 --> 01:29:24.280
PROFESSOR: This says
the following.

01:29:24.280 --> 01:29:27.750
This was in Madagascar, they
sampled math and language

01:29:27.750 --> 01:29:31.170
schools by--

01:29:31.170 --> 01:29:34.650
they took math and language
scores for each kid by

01:29:34.650 --> 01:29:35.330
classroom--

01:29:35.330 --> 01:29:35.830
or by school.

01:29:35.830 --> 01:29:38.210
I think it was by school in
this particular case.

01:29:38.210 --> 01:29:40.110
Then they said, looking over
the whole sample that they

01:29:40.110 --> 01:29:44.030
looked at in Madagascar, what
percentage of the variation in

01:29:44.030 --> 01:29:47.620
test scores came
between schools

01:29:47.620 --> 01:29:49.870
relative to within schools.

01:29:49.870 --> 01:29:51.030
And they're saying
that half of the

01:29:51.030 --> 01:29:53.790
variation was between schools.

01:29:58.390 --> 01:29:59.640
OK.

01:30:09.190 --> 01:30:11.350
So how much does this hurt
us, essentially?

01:30:11.350 --> 01:30:15.460
So we need to adjust our
standard errors, given the

01:30:15.460 --> 01:30:19.640
fact that these things
are correlated.

01:30:19.640 --> 01:30:26.130
And this is the formula, which
is that for a given total

01:30:26.130 --> 01:30:28.780
sample size, if we have clusters
of size m-- so say we

01:30:28.780 --> 01:30:30.710
have 100 kids per school--

01:30:30.710 --> 01:30:34.070
and intercorrelation coefficient
should be a rho,

01:30:34.070 --> 01:30:37.160
the size of the smallest effect
we can detect increases

01:30:37.160 --> 01:30:39.940
by this formula compared to
a non-clustered design.

01:30:39.940 --> 01:30:43.880
So this shows you what
this looks like, OK?

01:30:43.880 --> 01:30:50.230
So suppose you had 100
kids per school, OK?

01:30:50.230 --> 01:30:53.270
Suppose you had 100 kids per
school and you randomized at

01:30:53.270 --> 01:30:56.310
the school level rather
the individual level.

01:30:56.310 --> 01:30:59.040
If your correlation coefficient
was zero, it would

01:30:59.040 --> 01:31:00.460
be the same as if we randomized
at the individual

01:31:00.460 --> 01:31:02.450
level because they're totally
uncorrelated.

01:31:02.450 --> 01:31:05.095
Suppose your correlation
coefficient was 0.1--

01:31:05.095 --> 01:31:06.980
rho was 0.1.

01:31:06.980 --> 01:31:11.200
Then the smallest effect size
you could have would be 3.3

01:31:11.200 --> 01:31:15.120
times larger than if you had
done an individual design.

01:31:19.413 --> 01:31:22.860
So does that make sense
how to interpret this?

01:31:22.860 --> 01:31:27.150
And so this illustrates that,
even with very mild

01:31:27.150 --> 01:31:30.110
correlation coefficients-- and
we saw examples of those math

01:31:30.110 --> 01:31:31.610
test scores that
were like 0.5.

01:31:31.610 --> 01:31:34.300
This is only 0.1, but it already
means, in some sense,

01:31:34.300 --> 01:31:38.380
that your experiment
can detect things--

01:31:41.180 --> 01:31:42.980
if you had been able to
individually randomize, you

01:31:42.980 --> 01:31:44.480
would be able to detect things
that were three times as

01:31:44.480 --> 01:31:47.100
small, right?

01:31:47.100 --> 01:31:49.520
Now that's a combination of
the fact that you have the

01:31:49.520 --> 01:31:50.540
correlation coefficient
and the number

01:31:50.540 --> 01:31:51.930
of people per cluster.

01:31:51.930 --> 01:31:56.634
AUDIENCE: Then in the previous
slide, 0.5 does not mean half?

01:31:56.634 --> 01:32:00.198
PROFESSOR: No, 0.5 is the
correlation-- it's rho.

01:32:00.198 --> 01:32:01.820
AUDIENCE: No, in the

01:32:01.820 --> 01:32:02.495
PROFESSOR: It's rho.

01:32:02.495 --> 01:32:04.600
AUDIENCE: Then it does not mean
half of the difference--

01:32:04.600 --> 01:32:07.650
PROFESSOR: No, it's half
of the variance.

01:32:10.960 --> 01:32:11.930
Let me move on.

01:32:11.930 --> 01:32:15.365
We can talk about the
formula for that.

01:32:15.365 --> 01:32:16.615
OK.

01:32:20.040 --> 01:32:21.420
So what this means is, if the
experimental design is

01:32:21.420 --> 01:32:24.720
clustered, we now not only need
to consider all the other

01:32:24.720 --> 01:32:26.200
factors we talked about before,
we also need to

01:32:26.200 --> 01:32:29.660
consider this factor rho when
doing our power calculations.

01:32:29.660 --> 01:32:32.360
And rho is yet another thing we
can try to estimate based

01:32:32.360 --> 01:32:35.845
on our little survey of our
population to get a sense of

01:32:35.845 --> 01:32:37.095
what this rho is likely to be.

01:32:40.570 --> 01:32:46.150
And given this clustering issue,
it's very important not

01:32:46.150 --> 01:32:47.820
just that you have a big
enough number of people

01:32:47.820 --> 01:32:50.130
involved in your experiment, but
that you randomize across

01:32:50.130 --> 01:32:52.560
a big enough number
of groups, right?

01:32:52.560 --> 01:32:54.745
And the way I like to think
about is, how many times did

01:32:54.745 --> 01:32:56.350
you flip the coin as to who
should be treatment and who

01:32:56.350 --> 01:32:57.600
should be control?

01:33:00.830 --> 01:33:03.540
And, in fact, it's usually the
case that the number of groups

01:33:03.540 --> 01:33:07.430
you have is often more important
than the total

01:33:07.430 --> 01:33:11.660
number of individuals that you
have because the individuals

01:33:11.660 --> 01:33:17.090
are correlated within
a group, OK?

01:33:17.090 --> 01:33:18.340
So moving on.

01:33:25.530 --> 01:33:26.890
So I'm going to flip
through this.

01:33:26.890 --> 01:33:28.050
This is mostly going over some
of this if you were doing the

01:33:28.050 --> 01:33:29.300
exercise quickly.

01:33:33.161 --> 01:33:35.120
OK.

01:33:35.120 --> 01:33:37.090
And so this chart--

01:33:41.860 --> 01:33:43.930
in your exercise shows you some
of the tradeoffs that you

01:33:43.930 --> 01:33:46.290
should think about when you're
trying to decide how you

01:33:46.290 --> 01:33:52.180
should trade off the number of
groups you have versus the

01:33:52.180 --> 01:33:54.460
number of people within
a group, OK?

01:33:54.460 --> 01:33:58.820
So in this particular case, a
group was a gram panchayat,

01:33:58.820 --> 01:34:01.740
and within a group there
were villages, OK?

01:34:01.740 --> 01:34:03.680
And there were different costs
involved in doing these

01:34:03.680 --> 01:34:04.390
different things, right?

01:34:04.390 --> 01:34:07.670
So going to the place involved
transportation costs to get to

01:34:07.670 --> 01:34:08.980
the gram panchayat.

01:34:08.980 --> 01:34:10.410
That, say, was a
couple of days.

01:34:10.410 --> 01:34:12.480
And then it took, like, half a
day, say, for every village

01:34:12.480 --> 01:34:13.620
you interviewed.

01:34:13.620 --> 01:34:16.450
So that said, there's some
cost of adding a new gram

01:34:16.450 --> 01:34:18.840
panchayat, but also some
marginal cost of adding

01:34:18.840 --> 01:34:21.860
additional village per
gram panchayat, OK?

01:34:21.860 --> 01:34:25.750
So you could calculate, based
on all your parameters and

01:34:25.750 --> 01:34:29.360
power of 80% and whatever the
intercluster correlation is in

01:34:29.360 --> 01:34:31.710
this particular case, you could
say, well, if we had

01:34:31.710 --> 01:34:34.830
this many villages per gram
panchayat, how many gram

01:34:34.830 --> 01:34:36.550
panchayats would we need
and how many villages

01:34:36.550 --> 01:34:38.280
would we need, OK?

01:34:38.280 --> 01:34:39.460
So you can do this
set of exercises

01:34:39.460 --> 01:34:40.460
and you can say that--

01:34:40.460 --> 01:34:43.500
and you'll note, for example,
that as we reduce the number

01:34:43.500 --> 01:34:46.230
of gram panchayats we go to--
another way, as we add more

01:34:46.230 --> 01:34:49.040
villages per gram panchayat, the
total number of villages

01:34:49.040 --> 01:34:51.040
we need to survey goes up.

01:34:51.040 --> 01:34:54.030
And in this particular case, it
doesn't go up by that much

01:34:54.030 --> 01:34:57.290
because the intercluster
correlation is not that high.

01:34:57.290 --> 01:34:59.170
And you could actually do this
type of calculation and you

01:34:59.170 --> 01:35:01.910
could say, well I know
my costs are, right?

01:35:01.910 --> 01:35:03.510
I know what my costs of
going this place are.

01:35:03.510 --> 01:35:06.420
And I can calculate which of
these designs is the cheapest

01:35:06.420 --> 01:35:08.780
design given what I
want to achieve.

01:35:08.780 --> 01:35:13.080
The other thing is, in this
case, the experiment was

01:35:13.080 --> 01:35:14.340
happening everywhere and
they were just trying

01:35:14.340 --> 01:35:15.680
to design the survey.

01:35:15.680 --> 01:35:17.690
But often when we're doing this,
we also need to pay for

01:35:17.690 --> 01:35:19.690
the intervention itself.

01:35:19.690 --> 01:35:22.860
And, at least in a lot of the
cases that I've worked with,

01:35:22.860 --> 01:35:26.940
the cost of actually doing the
intervention is much bigger

01:35:26.940 --> 01:35:29.150
than the cost of doing
the survey.

01:35:29.150 --> 01:35:33.460
And so, in that case, if you
always have to treat every

01:35:33.460 --> 01:35:37.670
village in the gram panchayat,
you can actually save a ton of

01:35:37.670 --> 01:35:39.670
money by going down in the
number of gram panchayats and

01:35:39.670 --> 01:35:41.930
surveying a lot more villages.

01:35:41.930 --> 01:35:44.520
But the whole point is there
are these tradeoffs and you

01:35:44.520 --> 01:35:47.910
need to, in deciding how you're
going to structure your

01:35:47.910 --> 01:35:49.370
experiment and how you're
going to structure your

01:35:49.370 --> 01:35:51.560
survey, you need to think
through what these tradeoffs

01:35:51.560 --> 01:35:54.380
are, make sure you have enough
power, given your estimates of

01:35:54.380 --> 01:35:56.150
your intercluster correlation
and sort of do the cost

01:35:56.150 --> 01:35:58.560
minimizing thing.

01:35:58.560 --> 01:36:01.320
OK, so in the last five minutes
or so, let me just

01:36:01.320 --> 01:36:04.880
highlight a couple of the other
issues that come up in

01:36:04.880 --> 01:36:07.210
thinking about power
calculations.

01:36:07.210 --> 01:36:10.970
So as I mentioned, the cluster
design is one of the most

01:36:10.970 --> 01:36:12.320
important ones.

01:36:12.320 --> 01:36:14.070
And the key thing is making
sure you have enough

01:36:14.070 --> 01:36:17.070
independent groups, where you
flip the coin to randomize

01:36:17.070 --> 01:36:19.300
betwee treatment and control
enough times.

01:36:19.300 --> 01:36:21.300
Some other things that matter
are baselines, control

01:36:21.300 --> 01:36:23.190
variables, and the hypothesis
being tested.

01:36:23.190 --> 01:36:26.150
So one minute on
each of those.

01:36:26.150 --> 01:36:29.810
A baseline has two uses--

01:36:29.810 --> 01:36:31.330
main uses.

01:36:31.330 --> 01:36:34.160
One use of a baseline is that it
lets you check whether the

01:36:34.160 --> 01:36:35.400
treatment and control
group look the

01:36:35.400 --> 01:36:37.780
same before you started.

01:36:37.780 --> 01:36:40.750
And if you randomized properly,
we know they should

01:36:40.750 --> 01:36:41.630
look similar.

01:36:41.630 --> 01:36:43.500
But you want to make sure that
your randomization was

01:36:43.500 --> 01:36:46.030
actually carried out the way it
was supposed to be and that

01:36:46.030 --> 01:36:48.970
it wasn't the case that people
were pulling out of the hat

01:36:48.970 --> 01:36:51.060
until they got a treatment or
something, that they were

01:36:51.060 --> 01:36:52.960
actually randomizing the way
they were supposed to.

01:36:52.960 --> 01:36:56.200
And having a baseline conducted
before you start can

01:36:56.200 --> 01:37:00.370
allow you to test that your
randomization is actually

01:37:00.370 --> 01:37:03.630
truly random and your groups
look balanced.

01:37:03.630 --> 01:37:06.070
The other thing is, the baseline
can actually help

01:37:06.070 --> 01:37:11.280
reduce your survey size needed
because, but it requires you

01:37:11.280 --> 01:37:14.720
to a survey before you start
the intervention, right?

01:37:14.720 --> 01:37:17.700
And the reason it can reduce
your sample size is that now,

01:37:17.700 --> 01:37:22.550
instead of just looking at, say,
test scores across kids,

01:37:22.550 --> 01:37:25.170
I can look at the change in test
scores from before versus

01:37:25.170 --> 01:37:27.460
after the experiment started.

01:37:27.460 --> 01:37:29.450
And if people are really
persistent, like if the people

01:37:29.450 --> 01:37:31.120
who did really well on the test
this year are likely to

01:37:31.120 --> 01:37:33.270
do really well on the test
next year, that can

01:37:33.270 --> 01:37:37.710
essentially reduce the variance
of your outcome.

01:37:37.710 --> 01:37:43.100
It can be that the variance of
difference in test scores can

01:37:43.100 --> 01:37:45.550
be a lot lower than the variance
in test scores.

01:37:45.550 --> 01:37:47.490
And having a baseline can help
you for that reason.

01:37:51.390 --> 01:37:54.570
And as this slide points out,
your evaluation costs

01:37:54.570 --> 01:37:57.910
basically double because you
have to do two surveys, not

01:37:57.910 --> 01:38:01.450
one survey, but the costs of
the intervention go down

01:38:01.450 --> 01:38:03.620
because you can have a slightly
smaller sample.

01:38:03.620 --> 01:38:06.420
So if your intervention is
really expensive relative to

01:38:06.420 --> 01:38:08.580
your survey, this can
make a lot of sense.

01:38:08.580 --> 01:38:10.020
If your survey is really
expensive relative to your

01:38:10.020 --> 01:38:13.820
intervention, you might
not want to do this.

01:38:13.820 --> 01:38:17.830
And to figure out how this is
going to affect your power,

01:38:17.830 --> 01:38:21.010
you need to know yet another
fact, which is how correlated

01:38:21.010 --> 01:38:24.600
are people's outcomes
over time, right?

01:38:24.600 --> 01:38:27.040
What's the correlation between
how well I do on a test today

01:38:27.040 --> 01:38:28.240
and how well I do on
a test tomorrow?

01:38:28.240 --> 01:38:29.770
And some things are really
correlated and some things are

01:38:29.770 --> 01:38:30.900
not that correlated.

01:38:30.900 --> 01:38:32.640
And a baseline really helps you
on things that are really

01:38:32.640 --> 01:38:33.890
correlated.

01:38:38.570 --> 01:38:40.065
Another thing that can help
you is stratification.

01:38:42.740 --> 01:38:48.980
So what stratification can do
is, stratification says,

01:38:48.980 --> 01:38:50.150
suppose I--

01:38:50.150 --> 01:38:52.280
in some ways, it's conceptually
a little bit like

01:38:52.280 --> 01:38:54.750
a baseline, which is, suppose I
know that all of the people

01:38:54.750 --> 01:38:58.040
who live in this village tend
to have similar outcomes and

01:38:58.040 --> 01:38:59.430
all of the people who live in
this village tend to have

01:38:59.430 --> 01:38:59.880
similar outcomes.

01:38:59.880 --> 01:39:01.780
And all of the people who live
in this village tend to have

01:39:01.780 --> 01:39:03.960
similar outcomes.

01:39:03.960 --> 01:39:07.810
If I can then randomize by
village, I can compare the

01:39:07.810 --> 01:39:11.090
people in each village
to each other, OK?

01:39:11.090 --> 01:39:14.480
So if I'm looking within
village, if people in villages

01:39:14.480 --> 01:39:16.910
tend to be similar and I can
randomize within village, if I

01:39:16.910 --> 01:39:18.820
look within villages, the
difference between the

01:39:18.820 --> 01:39:19.710
treatment and the
control group is

01:39:19.710 --> 01:39:21.650
going to be less noisy.

01:39:21.650 --> 01:39:23.930
So stratifying is basically a
way of saying I'm going to

01:39:23.930 --> 01:39:27.270
make sure my sample is balanced
across the treatment

01:39:27.270 --> 01:39:29.620
and control groups within
certain subgroups of the

01:39:29.620 --> 01:39:30.680
population.

01:39:30.680 --> 01:39:32.920
And then I'm going to compare
within those subgroups of the

01:39:32.920 --> 01:39:35.340
population when I
do my analysis.

01:39:35.340 --> 01:39:36.630
And once again, we can
think this as a

01:39:36.630 --> 01:39:38.120
way of reducing noise.

01:39:38.120 --> 01:39:41.100
That if people in similar
villages tend to be similar,

01:39:41.100 --> 01:39:43.250
if I only compare treatmet and
control within the same

01:39:43.250 --> 01:39:48.160
village, the noise there
is going to be smaller.

01:39:48.160 --> 01:39:51.150
So in some sense,
it's similar.

01:39:51.150 --> 01:39:55.520
So some things we tend to
stratify by, if we know the

01:39:55.520 --> 01:39:57.660
baseline value of the outcome,
we can sometimes stratify by

01:39:57.660 --> 01:39:59.480
that because we know that the
effects are going to be

01:39:59.480 --> 01:40:02.720
similar for people who have very
similar baseline values.

01:40:02.720 --> 01:40:04.815
Or often, I think,
geographically we often tend

01:40:04.815 --> 01:40:04.980
to do that.

01:40:04.980 --> 01:40:07.420
So basically, we think that
people in certain areas tend

01:40:07.420 --> 01:40:09.170
to be similar so we're going
to make sure our treatments

01:40:09.170 --> 01:40:11.760
and controls are balanced in
those areas as a way of

01:40:11.760 --> 01:40:13.010
reducing noise.

01:40:19.400 --> 01:40:23.680
And the final thing we want to
mention is the hypothesis

01:40:23.680 --> 01:40:30.080
being tested, which is, the more
things you want to test,

01:40:30.080 --> 01:40:32.930
the bigger your sample is
going to need to be.

01:40:32.930 --> 01:40:35.480
So for example, are we
interested in the difference

01:40:35.480 --> 01:40:37.170
between two treatments
as well as the

01:40:37.170 --> 01:40:39.280
treatment versus control?

01:40:39.280 --> 01:40:41.740
If so, we need a much bigger
sample because we not only

01:40:41.740 --> 01:40:43.620
need to be able to tell the
treatment versus the control

01:40:43.620 --> 01:40:45.490
but we also need to be able to
tell the two treatments from

01:40:45.490 --> 01:40:48.110
each other, right?

01:40:48.110 --> 01:40:50.600
So suppose you have two
different treatments.

01:40:50.600 --> 01:40:54.255
Are you interested in just the
overall effect of each of the

01:40:54.255 --> 01:40:54.760
two treatments?

01:40:54.760 --> 01:40:55.890
Or are you interested in
whether the treatments

01:40:55.890 --> 01:40:58.900
interact produces a different
effect if they happen

01:40:58.900 --> 01:41:00.400
together, right?

01:41:00.400 --> 01:41:03.120
The more things you're
interested in, the bigger your

01:41:03.120 --> 01:41:05.100
sample needs to be because you
need to design your sample to

01:41:05.100 --> 01:41:08.450
be big enough to answer each of
these different questions.

01:41:08.450 --> 01:41:10.740
Another thing, for example, you
were interested in testing

01:41:10.740 --> 01:41:12.270
whether the effect is different
in different

01:41:12.270 --> 01:41:13.330
subpopulations.

01:41:13.330 --> 01:41:15.440
Do you just want to know the
average effect of your program

01:41:15.440 --> 01:41:17.380
or do you want to know if it was
different in rural areas

01:41:17.380 --> 01:41:19.170
versus urban areas?

01:41:19.170 --> 01:41:20.860
If you want to know if it's
different in rural versus

01:41:20.860 --> 01:41:22.500
urban areas, you're going to
need a big enough sample in

01:41:22.500 --> 01:41:24.846
rural areas and a big enough
sample in urban areas that you

01:41:24.846 --> 01:41:27.270
can compare the difference
between them.

01:41:27.270 --> 01:41:34.550
So the more different things
that you want to test,

01:41:34.550 --> 01:41:36.840
obviously, the bigger
your experiment's

01:41:36.840 --> 01:41:37.760
going to need to be.

01:41:37.760 --> 01:41:39.400
And a lot of times, in actually
designing the

01:41:39.400 --> 01:41:41.660
experiment, this is something
that comes up all the time,

01:41:41.660 --> 01:41:44.250
that you will very quickly
figure out that the number of

01:41:44.250 --> 01:41:46.800
questions you would like to
answer is far bigger than the

01:41:46.800 --> 01:41:49.270
sample size you can afford.

01:41:49.270 --> 01:41:52.110
And one of the really important
conversations you

01:41:52.110 --> 01:41:53.160
need to have as you're
starting to design an

01:41:53.160 --> 01:41:57.490
experiment are, which are the
really critical questions that

01:41:57.490 --> 01:41:59.690
I really need to know
the answer to?

01:41:59.690 --> 01:42:02.210
So for example, in a project
I was recently doing in

01:42:02.210 --> 01:42:05.020
Indonesia, it turned out that
the government really wanted

01:42:05.020 --> 01:42:06.870
to know whether this program
would work differently in

01:42:06.870 --> 01:42:10.200
urban versus rural areas because
they had a view that

01:42:10.200 --> 01:42:11.210
urban areas are really
different.

01:42:11.210 --> 01:42:13.320
And they were willing to do
different programs in urban

01:42:13.320 --> 01:42:14.220
versus rural areas.

01:42:14.220 --> 01:42:16.810
So we designed our whole sample
to make sure we had

01:42:16.810 --> 01:42:21.110
enough sampled in urban areas
and in rural areas that we

01:42:21.110 --> 01:42:22.550
could test those two
things apart.

01:42:22.550 --> 01:42:24.650
That almost doubled the size
of the experiment, but the

01:42:24.650 --> 01:42:26.580
government thought that was
important enough that they

01:42:26.580 --> 01:42:29.410
really wanted to do that.

01:42:29.410 --> 01:42:32.200
The point here is that--

01:42:32.200 --> 01:42:33.930
that was the one they
wanted to focus on.

01:42:33.930 --> 01:42:34.690
There was a million
other things we

01:42:34.690 --> 01:42:35.780
could have done instead.

01:42:35.780 --> 01:42:39.670
And so it's really important
to think about, before you

01:42:39.670 --> 01:42:42.430
design the experiment, what the
few key things you want to

01:42:42.430 --> 01:42:45.530
test are because, as I said, net
you're never going to have

01:42:45.530 --> 01:42:48.035
enough money to test all
things you want.

01:42:48.035 --> 01:42:51.390
That's sort of a universal
truth.

01:42:51.390 --> 01:42:53.360
So just to conclude, we've talk
about in this lecture--

01:42:59.270 --> 01:43:02.050
going back to the basic
statistics of how you're going

01:43:02.050 --> 01:43:06.030
to analyze the experiment,
thinking about how noisy your

01:43:06.030 --> 01:43:08.486
outcome is going to be and how
you're going to compute your

01:43:08.486 --> 01:43:10.230
confidence intervals,
how big your effect

01:43:10.230 --> 01:43:11.900
size is going to be.

01:43:11.900 --> 01:43:15.140
That's all what goes into doing
a power calculation.

01:43:15.140 --> 01:43:16.870
You also need to do some
guess work, right?

01:43:16.870 --> 01:43:19.330
The power calculation is going
to require you to estimate how

01:43:19.330 --> 01:43:21.680
big your sample is
going to be--

01:43:21.680 --> 01:43:25.130
sorry, how much variance there's
going to be, what your

01:43:25.130 --> 01:43:25.980
effect size is going to be.

01:43:25.980 --> 01:43:28.410
You have to make some
assumptions.

01:43:28.410 --> 01:43:31.400
And a little bit of pilot
testing before the experiment

01:43:31.400 --> 01:43:33.680
begins can be really useful,
I think, mostly

01:43:33.680 --> 01:43:35.730
for thinking about--

01:43:35.730 --> 01:43:37.380
just collecting some data
can be useful to

01:43:37.380 --> 01:43:38.630
estimate these variances.

01:43:41.040 --> 01:43:43.350
The power calculations can
help you think about this

01:43:43.350 --> 01:43:45.380
question of how many treatments
you can afford to

01:43:45.380 --> 01:43:50.720
have, and can I afford to do
three different versions of

01:43:50.720 --> 01:43:54.210
the program or do I really need
to just pick one or two?

01:43:54.210 --> 01:43:56.880
How do I make this tradeoff of
more clusters versus more

01:43:56.880 --> 01:43:57.830
observations per cluster?

01:43:57.830 --> 01:44:00.390
The power calculation can
be very helpful here.

01:44:00.390 --> 01:44:01.910
And the other thing, and in some
sense, the place I find

01:44:01.910 --> 01:44:04.050
power calculations the most
useful, especially because

01:44:04.050 --> 01:44:06.430
there is a bit of guesswork in
power calculations and you get

01:44:06.430 --> 01:44:07.750
rough rules of thumb.

01:44:07.750 --> 01:44:09.450
You don't get precise answers
because it depends on the

01:44:09.450 --> 01:44:09.960
assumptions.

01:44:09.960 --> 01:44:12.580
But what I find this is really
useful for is whether this is

01:44:12.580 --> 01:44:14.090
feasible or not, right?

01:44:14.090 --> 01:44:17.790
Is this something where I'm
kind of in the right range

01:44:17.790 --> 01:44:20.630
where I think I can get
estimates, or where there's no

01:44:20.630 --> 01:44:23.000
chance, no matter how successful
this program is,

01:44:23.000 --> 01:44:25.900
that I'm going to be able to
pick it up in my data because

01:44:25.900 --> 01:44:28.730
the variable is just
way too noisy.

01:44:28.730 --> 01:44:31.210
And it's really important that
you do the power calculation,

01:44:31.210 --> 01:44:32.750
particularly--

01:44:32.750 --> 01:44:35.200
both for structuring how to
design the experiment, but

01:44:35.200 --> 01:44:37.450
particularly to make sure you're
not going to waste a

01:44:37.450 --> 01:44:38.740
lot of time and money doing
something where you're going

01:44:38.740 --> 01:44:40.820
to have no hope of
picking it up.

01:44:44.220 --> 01:44:46.720
Because a study which is
underpowered is going to waste

01:44:46.720 --> 01:44:49.210
a lot of everyone's time and
you're very frustrated.

01:44:49.210 --> 01:44:50.970
Very frustrating for
everyone involved.

01:44:50.970 --> 01:44:53.440
So you want to make sure you do
this right before you start

01:44:53.440 --> 01:44:54.940
because otherwise, you're going
to end up spending a lot

01:44:54.940 --> 01:44:58.150
of time, money, and effort on
an experiment and ending up

01:44:58.150 --> 01:45:01.060
not being able to conclude
much of anything.

01:45:01.060 --> 01:45:01.735
OK.

01:45:01.735 --> 01:45:02.985
Thanks very much.