WEBVTT

00:00:15.700 --> 00:00:19.300
DAVID SONTAG: So today's lecture
is going to be about causality.

00:00:22.102 --> 00:00:23.560
Who's heard about
causality before?

00:00:23.560 --> 00:00:24.227
Raise your hand.

00:00:27.130 --> 00:00:30.970
What's the number one thing
that you hear about when

00:00:30.970 --> 00:00:33.210
thinking about causality?

00:00:33.210 --> 00:00:33.710
Yeah?

00:00:33.710 --> 00:00:35.970
AUDIENCE: Correlation
does not imply causation.

00:00:35.970 --> 00:00:39.465
DAVID SONTAG: Correlation
does not imply causation.

00:00:39.465 --> 00:00:40.590
Anything else come to mind?

00:00:40.590 --> 00:00:41.490
That's what came to my mind.

00:00:41.490 --> 00:00:42.615
Anything else come to mind?

00:00:46.560 --> 00:00:48.660
So up until now in
the semester, we've

00:00:48.660 --> 00:00:50.760
been talking about purely
predictive questions.

00:00:50.760 --> 00:00:52.950
And for purely
predictive questions,

00:00:52.950 --> 00:00:56.160
one could argue that
correlation is good enough.

00:00:56.160 --> 00:00:57.780
If we have some
signs in our data

00:00:57.780 --> 00:01:00.355
that are predictive of
some outcome of interest,

00:01:00.355 --> 00:01:02.230
we want to be able to
take advantage of that.

00:01:02.230 --> 00:01:04.739
Whether it's
upstream, downstream,

00:01:04.739 --> 00:01:10.020
the causal directionality is
irrelevant for that purpose.

00:01:10.020 --> 00:01:12.030
Although even that
isn't quite true, right,

00:01:12.030 --> 00:01:16.050
because Pete and I have been
hinting throughout the semester

00:01:16.050 --> 00:01:19.260
that there are times
when the data changes

00:01:19.260 --> 00:01:23.370
on you, for example, when you go
from one institution to another

00:01:23.370 --> 00:01:26.100
or when you have non-stationary.

00:01:26.100 --> 00:01:30.210
And in those situations,
having a deeper understanding

00:01:30.210 --> 00:01:32.220
about the data might
allow one to build

00:01:32.220 --> 00:01:35.790
an additional robustness to
that type of data set shift.

00:01:35.790 --> 00:01:37.598
But there are other
reasons as well why

00:01:37.598 --> 00:01:40.140
understanding something about
your underlying data generating

00:01:40.140 --> 00:01:41.795
processes can be
really important.

00:01:41.795 --> 00:01:43.170
It's because often,
the questions

00:01:43.170 --> 00:01:45.990
that we want to answer when
it comes to health care

00:01:45.990 --> 00:01:48.542
are not predictive questions,
their causal questions.

00:01:48.542 --> 00:01:51.000
And so what I'll do now is I'll
walk through a few examples

00:01:51.000 --> 00:01:53.290
of what I mean by this.

00:01:53.290 --> 00:01:57.100
Let's start out with what we saw
in Lecture 4 and in Problem Set

00:01:57.100 --> 00:02:00.230
2, where we looked
at the question

00:02:00.230 --> 00:02:02.480
of how we can do early
detection of type 2 diabetes.

00:02:05.630 --> 00:02:08.300
You used Truven
MarketScan's data

00:02:08.300 --> 00:02:12.190
set to build a
risk stratification

00:02:12.190 --> 00:02:14.855
algorithm for
detecting who is going

00:02:14.855 --> 00:02:16.480
to be newly diagnosed
with diabetes one

00:02:16.480 --> 00:02:17.930
to three years from now.

00:02:17.930 --> 00:02:19.638
And if you think about
how one might then

00:02:19.638 --> 00:02:21.700
try to deploy that
algorithm, you

00:02:21.700 --> 00:02:25.090
might, for example, try to
get patients into the clinic

00:02:25.090 --> 00:02:28.200
to get them diagnosed.

00:02:28.200 --> 00:02:30.090
But the next set of
questions are usually

00:02:30.090 --> 00:02:31.490
about the so what question.

00:02:31.490 --> 00:02:35.040
What are you going to do
based on that prediction?

00:02:35.040 --> 00:02:37.650
Once diagnosed, how
will you intervene?

00:02:37.650 --> 00:02:39.660
And at the end of the
day, the interesting goal

00:02:39.660 --> 00:02:41.412
is not one of how do
you find them early,

00:02:41.412 --> 00:02:43.620
but how do you prevent them
from developing diabetes?

00:02:43.620 --> 00:02:45.990
Or how do you prevent the
patient from developing

00:02:45.990 --> 00:02:47.130
complications of diabetes?

00:02:49.710 --> 00:02:54.550
And those are questions
about causality.

00:02:54.550 --> 00:02:56.500
Now, when we built
a predictive model

00:02:56.500 --> 00:02:57.983
and we introspected
at the weight,

00:02:57.983 --> 00:02:59.900
we might have noticed
some interesting things.

00:02:59.900 --> 00:03:04.180
For example, if you looked at
the highest negative weights,

00:03:04.180 --> 00:03:07.430
which I'm not sure if we did
as part of the assignment

00:03:07.430 --> 00:03:10.050
but is something that I did
as part of my research study,

00:03:10.050 --> 00:03:12.100
you see that gastric
bypass surgery has

00:03:12.100 --> 00:03:16.330
the biggest negative weight.

00:03:16.330 --> 00:03:21.010
Does that mean that if you give
an obese person gastric bypass

00:03:21.010 --> 00:03:24.520
surgery, that will prevent
them from developing type 2

00:03:24.520 --> 00:03:26.080
diabetes?

00:03:26.080 --> 00:03:28.330
That's an example of a causal
question which is raised

00:03:28.330 --> 00:03:30.105
by this predictive model.

00:03:30.105 --> 00:03:31.480
But just by looking
at the weight

00:03:31.480 --> 00:03:34.810
alone, as I'll
show you this week,

00:03:34.810 --> 00:03:38.060
you won't be able to
correctly infer that there

00:03:38.060 --> 00:03:39.720
is a causal relationship.

00:03:39.720 --> 00:03:41.620
And so part of what
we will be doing

00:03:41.620 --> 00:03:45.070
is coming up with a mathematical
language for thinking

00:03:45.070 --> 00:03:47.020
about how does one
answer, is there

00:03:47.020 --> 00:03:49.350
a causal relationship here?

00:03:49.350 --> 00:03:51.750
Here's a second example.

00:03:51.750 --> 00:03:54.030
Right before spring break
we had a series of lectures

00:03:54.030 --> 00:03:57.120
about diagnosis,
particularly diagnosis

00:03:57.120 --> 00:04:00.570
from imaging data of
a variety of kinds,

00:04:00.570 --> 00:04:03.240
whether it be
radiology or pathology.

00:04:03.240 --> 00:04:05.400
And often, questions
are of this sort.

00:04:05.400 --> 00:04:07.530
Here is a woman's breasts.

00:04:07.530 --> 00:04:09.210
She has breast cancer.

00:04:09.210 --> 00:04:12.490
Maybe you have an associated
pathology slide as well.

00:04:12.490 --> 00:04:16.800
And you want to know what is
the risk of this person dying

00:04:16.800 --> 00:04:19.500
in the next five years.

00:04:19.500 --> 00:04:23.340
So one can take a
deep learning model,

00:04:23.340 --> 00:04:26.700
learn to predict
what one observes.

00:04:26.700 --> 00:04:28.950
So in the patient in your
data set, you have the input

00:04:28.950 --> 00:04:30.510
and you have, let's
say, survival time.

00:04:30.510 --> 00:04:32.302
And you might use that
to predict something

00:04:32.302 --> 00:04:38.510
about how long it takes
from diagnosis to death.

00:04:38.510 --> 00:04:41.310
And based on those predictions,
you might take actions.

00:04:41.310 --> 00:04:48.290
For example, if you predict
that a patient is not risky,

00:04:48.290 --> 00:04:51.130
then you might
conclude that they

00:04:51.130 --> 00:04:54.010
don't need to get treatment.

00:04:54.010 --> 00:04:56.950
But that could be
really, really dangerous,

00:04:56.950 --> 00:05:01.450
and I'll just give
you one example

00:05:01.450 --> 00:05:03.580
of why that could be dangerous.

00:05:06.210 --> 00:05:08.210
These predictive models,
if you're learning them

00:05:08.210 --> 00:05:12.040
in this way, the outcome,
in this case let's

00:05:12.040 --> 00:05:14.630
say time to death, is
going to be affected

00:05:14.630 --> 00:05:17.000
by what's happened in between.

00:05:17.000 --> 00:05:19.880
So, for example,
this patient might

00:05:19.880 --> 00:05:22.700
have been receiving
treatment, and because

00:05:22.700 --> 00:05:26.510
of them receiving treatment in
between the time from diagnosis

00:05:26.510 --> 00:05:30.202
to death, it might have
prolonged their life.

00:05:30.202 --> 00:05:31.910
And so for this patient
in your data set,

00:05:31.910 --> 00:05:35.612
you might have observed that
they lived a very long time.

00:05:35.612 --> 00:05:37.320
But if you ignore what
happens in between

00:05:37.320 --> 00:05:41.850
and you simply learn to predict
y from X, X being the input,

00:05:41.850 --> 00:05:43.850
then a new patient comes
along and you predicted

00:05:43.850 --> 00:05:46.560
that new patient is going
to survive a long time,

00:05:46.560 --> 00:05:48.540
and it would be completely
the wrong conclusion

00:05:48.540 --> 00:05:50.857
to say that you don't need
to treat that patient.

00:05:50.857 --> 00:05:53.190
Because, in fact, the only
reason the patients like them

00:05:53.190 --> 00:05:54.773
in the training data
lived a long time

00:05:54.773 --> 00:05:57.400
is because they were treated.

00:05:57.400 --> 00:06:00.460
And so when it comes to this
field of machine learning

00:06:00.460 --> 00:06:03.850
and health care, we need
to think really carefully

00:06:03.850 --> 00:06:07.120
about these types of questions
because an error in the way

00:06:07.120 --> 00:06:09.080
that we formalize our
problem could kill people

00:06:09.080 --> 00:06:10.330
because of mistakes like this.

00:06:13.920 --> 00:06:16.350
Now, other questions
are ones about not how

00:06:16.350 --> 00:06:19.770
do we predict
outcomes but how do we

00:06:19.770 --> 00:06:23.360
guide treatment decisions.

00:06:23.360 --> 00:06:26.300
So, for example, as
data from pathology

00:06:26.300 --> 00:06:28.370
gets richer and
richer and richer,

00:06:28.370 --> 00:06:30.650
we might think that we
can now use computers

00:06:30.650 --> 00:06:33.860
to try to better predict
who is likely to benefit

00:06:33.860 --> 00:06:36.200
from a treatment than
humans could do alone.

00:06:38.790 --> 00:06:40.470
But the challenge
with using algorithms

00:06:40.470 --> 00:06:42.900
to do that is that people
respond differently

00:06:42.900 --> 00:06:45.840
to treatment, and the
data which is being

00:06:45.840 --> 00:06:51.450
used to guide treatment is
biased based on existing

00:06:51.450 --> 00:06:52.740
treatment guidelines.

00:06:55.460 --> 00:06:58.820
So, similarly, to the previous
question, we could ask,

00:06:58.820 --> 00:07:01.890
what would happen if we trained
to predict past treatment

00:07:01.890 --> 00:07:02.390
decisions?

00:07:02.390 --> 00:07:04.015
This would be the
most naive way to try

00:07:04.015 --> 00:07:06.590
to use data to guide
treatment decisions.

00:07:06.590 --> 00:07:08.930
So maybe you see David
gets treatment A,

00:07:08.930 --> 00:07:11.450
John gets treatment B,
Juana gets treatment A.

00:07:11.450 --> 00:07:14.660
And you might ask then,
OK, a new patient comes in,

00:07:14.660 --> 00:07:17.473
what should this new
patient be treated with?

00:07:17.473 --> 00:07:18.890
And if you've just
learned a model

00:07:18.890 --> 00:07:21.470
to predict from what you
know about the treatment

00:07:21.470 --> 00:07:23.990
that David is likely
to get, then the best

00:07:23.990 --> 00:07:26.840
that you could hope
to do is to do as well

00:07:26.840 --> 00:07:29.850
as existing clinical practice.

00:07:29.850 --> 00:07:33.230
So if we want to go beyond
current clinical practice,

00:07:33.230 --> 00:07:35.780
for example, to recognize
that there is heterogeneity

00:07:35.780 --> 00:07:39.440
in treatment response, then
we have to somehow change

00:07:39.440 --> 00:07:44.090
the question that we're asking.

00:07:44.090 --> 00:07:46.070
I'll give you one
last example, which

00:07:46.070 --> 00:07:50.600
is perhaps a more traditional
question of, does X cause y?

00:07:50.600 --> 00:07:53.240
For example, does
smoking cause lung cancer

00:07:53.240 --> 00:07:59.080
is a major question of
societal importance.

00:07:59.080 --> 00:08:02.950
Now, you might be familiar
with the traditional way

00:08:02.950 --> 00:08:05.170
of trying to answer questions
of this nature, which

00:08:05.170 --> 00:08:07.520
would be to do a randomized
controlled trial.

00:08:07.520 --> 00:08:09.372
Except this isn't
exactly the type

00:08:09.372 --> 00:08:11.830
of setting where you could do
randomized controlled trials.

00:08:11.830 --> 00:08:17.170
How would you feel if you were
a smoker and someone came up

00:08:17.170 --> 00:08:19.998
to you and said, you have
to stop smoking because I

00:08:19.998 --> 00:08:21.040
need to see what happens?

00:08:21.040 --> 00:08:23.230
Or how would you feel
if you were a non-smoker

00:08:23.230 --> 00:08:24.730
and someone came
up to you and said,

00:08:24.730 --> 00:08:27.880
you have to start smoking?

00:08:27.880 --> 00:08:31.930
That would be both not feasible
and completely unethical.

00:08:31.930 --> 00:08:33.909
And so if we want to
try to answer questions

00:08:33.909 --> 00:08:35.590
like this from data,
we need to start

00:08:35.590 --> 00:08:39.850
thinking about
how can we design,

00:08:39.850 --> 00:08:43.580
using observational
data, ways of answering

00:08:43.580 --> 00:08:45.230
questions like this.

00:08:45.230 --> 00:08:46.610
And the challenge
is that there's

00:08:46.610 --> 00:08:50.570
going to be bias in the data
because of who decides to smoke

00:08:50.570 --> 00:08:52.370
and who decides not to smoke.

00:08:52.370 --> 00:08:53.840
So, for example,
the most naive way

00:08:53.840 --> 00:08:55.310
you might try to
answer this question

00:08:55.310 --> 00:08:57.227
would be to look at the
conditional likelihood

00:08:57.227 --> 00:08:59.750
of getting lung
cancer among smokers

00:08:59.750 --> 00:09:01.790
and getting lung cancer
among non-smokers.

00:09:04.570 --> 00:09:07.060
But those numbers, as you'll
see in the next few slides,

00:09:07.060 --> 00:09:09.340
can be very misleading
because there

00:09:09.340 --> 00:09:12.070
might be confounding
factors, factors

00:09:12.070 --> 00:09:21.250
that would, for example, both
cause people to be a smoker

00:09:21.250 --> 00:09:25.470
and cause them to
receive lung cancer,

00:09:25.470 --> 00:09:28.980
which would differentiate
between these two numbers.

00:09:28.980 --> 00:09:30.660
And we'll have a
very concrete example

00:09:30.660 --> 00:09:32.920
of this in just a few minutes.

00:09:32.920 --> 00:09:35.010
So to properly answer
all of these questions,

00:09:35.010 --> 00:09:37.470
one needs to be thinking
in terms of causal graphs.

00:09:37.470 --> 00:09:40.410
So rather than the
traditional setup in machine

00:09:40.410 --> 00:09:52.020
learning where you just
have inputs and outputs,

00:09:52.020 --> 00:09:53.930
now we need to have triplets.

00:09:53.930 --> 00:09:55.610
Rather than having
inputs and outputs,

00:09:55.610 --> 00:10:06.350
we need to be thinking
of inputs, interventions,

00:10:06.350 --> 00:10:09.200
and outcomes or outputs.

00:10:09.200 --> 00:10:13.530
So we now need be having
three quantities in mind.

00:10:13.530 --> 00:10:15.280
And we have to start
thinking about, well,

00:10:15.280 --> 00:10:18.740
what is the causal relationship
between these three?

00:10:18.740 --> 00:10:22.832
So for those of you who have
taken more graduate level

00:10:22.832 --> 00:10:24.290
machine learning
classes, you might

00:10:24.290 --> 00:10:28.350
be familiar with ideas
such as Bayesian networks.

00:10:28.350 --> 00:10:31.890
And when I went to
undergrad and grad school

00:10:31.890 --> 00:10:34.340
and I studied machine
learning, for the longest time

00:10:34.340 --> 00:10:36.590
I thought causal
inference had to do

00:10:36.590 --> 00:10:38.900
with learning causal graphs.

00:10:38.900 --> 00:10:41.720
So this is what I thought
causal inference was about.

00:10:41.720 --> 00:10:43.845
You have data of the
following nature--

00:10:46.560 --> 00:10:49.170
1, 0, 0, 1, dot, dot, dot.

00:10:51.950 --> 00:10:53.930
So here, there are
four random variables.

00:10:53.930 --> 00:10:56.690
I'm showing the realizations
of those four binary variables

00:10:56.690 --> 00:10:59.870
one per row, and you have
a data set like this.

00:10:59.870 --> 00:11:01.640
And I thought
causal inference had

00:11:01.640 --> 00:11:04.633
to do with taking data like
this and trying to figure out,

00:11:04.633 --> 00:11:06.050
is the underlying
Bayesian network

00:11:06.050 --> 00:11:14.490
that created that data, is it
X1 goes to X2 goes to X3 to X4?

00:11:14.490 --> 00:11:19.190
Or I'll say, this is X1,
that's X2, x3, and X4.

00:11:19.190 --> 00:11:27.687
Or maybe the causal graph
is X1, to X2, to X3, to x4.

00:11:27.687 --> 00:11:30.020
And trying to distinguish
between these different causal

00:11:30.020 --> 00:11:33.940
graphs from observational
data is one type of question

00:11:33.940 --> 00:11:36.840
that one can ask.

00:11:36.840 --> 00:11:40.020
And the one thing you learn
in traditional machine

00:11:40.020 --> 00:11:42.135
learning treatments of
this is that sometimes you

00:11:42.135 --> 00:11:44.010
can't distinguish between
these causal graphs

00:11:44.010 --> 00:11:45.210
from the data you have.

00:11:45.210 --> 00:11:49.050
For example, suppose you just
had two random variables.

00:11:49.050 --> 00:11:54.180
Because any distribution could
be represented by probability

00:11:54.180 --> 00:11:58.950
of X1 times probability
of X2 given X1,

00:11:58.950 --> 00:12:03.960
according to just rule of
conditional probability,

00:12:03.960 --> 00:12:06.090
and similarly, any
distribution can be represented

00:12:06.090 --> 00:12:10.380
as the opposite, probability
of X2 times probability

00:12:10.380 --> 00:12:17.735
of X1 given X2, which would
look like this, the statement

00:12:17.735 --> 00:12:19.360
that one would make
is that if you just

00:12:19.360 --> 00:12:22.210
had data involving X1 and
X2, you couldn't distinguish

00:12:22.210 --> 00:12:25.930
between these two causal graphs,
X1 causes X2 or X2 causes X1.

00:12:28.650 --> 00:12:31.320
And usually another
treatment would say, OK,

00:12:31.320 --> 00:12:33.810
but if you have a third variable
and you have a V structure

00:12:33.810 --> 00:12:40.590
or something like X1 goes
to x2, X1 goes to X3,

00:12:40.590 --> 00:12:44.414
this you could distinguish from,
let's say, a chain structure.

00:12:47.380 --> 00:12:49.470
And then the final
answer to what

00:12:49.470 --> 00:12:51.180
is causal inference
from this philosophy

00:12:51.180 --> 00:12:54.750
would be something like, OK, if
you're in a setting like this

00:12:54.750 --> 00:12:57.150
and you can't distinguish
between X1 causes X2 or X2

00:12:57.150 --> 00:12:59.970
causes X1, then you
do some interventions,

00:12:59.970 --> 00:13:04.440
like you intervene on X1 and you
look to see what happens to X2,

00:13:04.440 --> 00:13:07.290
and that'll help you disentangle
these directions of causality.

00:13:09.612 --> 00:13:12.070
None of this is what we're
going to be talking about today.

00:13:14.950 --> 00:13:18.100
Today, we're going to be
talking about the simplest,

00:13:18.100 --> 00:13:20.500
simplest possible setting
you could imagine,

00:13:20.500 --> 00:13:22.030
that graph shown up there.

00:13:25.470 --> 00:13:29.130
You have three sets of
random variables, X,

00:13:29.130 --> 00:13:30.810
which is perhaps
a vector, so it's

00:13:30.810 --> 00:13:33.570
high dimensional, a
single random variable

00:13:33.570 --> 00:13:37.170
T, and a single
random variable Y.

00:13:37.170 --> 00:13:40.500
And we know the
causal graph here.

00:13:40.500 --> 00:13:42.090
We're going to
suppose that we know

00:13:42.090 --> 00:13:49.650
the directionality, that we
know that X might cause T

00:13:49.650 --> 00:13:54.660
and X and T might cause Y. And
the only thing we don't know

00:13:54.660 --> 00:13:58.180
is the strength of the edges.

00:13:58.180 --> 00:13:58.680
All right.

00:13:58.680 --> 00:14:00.730
And so now let's try to
think through this in context

00:14:00.730 --> 00:14:01.772
of the previous examples.

00:14:01.772 --> 00:14:02.780
Yeah, question?

00:14:02.780 --> 00:14:05.697
AUDIENCE: Just to make sure-- so
T does not affect X in any way?

00:14:05.697 --> 00:14:07.530
DAVID SONTAG: Correct,
that's the assumption

00:14:07.530 --> 00:14:10.150
we're going to make here.

00:14:10.150 --> 00:14:12.480
So let's try to
instantiate this.

00:14:12.480 --> 00:14:13.855
So we'll start
with this example.

00:14:17.300 --> 00:14:23.340
X might be what you know about
the patient at diagnosis.

00:14:23.340 --> 00:14:27.940
T, I'm going to assume for
the purposes of today's class,

00:14:27.940 --> 00:14:32.468
is a decision between two
different treatment plans.

00:14:32.468 --> 00:14:34.510
And I'm going to simplify
the state of the world.

00:14:34.510 --> 00:14:36.940
I'm going to say those
treatment plans only

00:14:36.940 --> 00:14:42.070
depend on what you know about
the patient at diagnosis.

00:14:42.070 --> 00:14:44.290
So at diagnosis, you
decide, I'm going

00:14:44.290 --> 00:14:46.780
to be giving them this
sequence of treatments

00:14:46.780 --> 00:14:49.750
at this three-month interval
or this other sequence

00:14:49.750 --> 00:14:52.000
of treatment at, maybe,
that four-month interval.

00:14:52.000 --> 00:14:53.680
And you make that decision
just based on diagnosis

00:14:53.680 --> 00:14:55.930
and you don't change it based
on anything you observe.

00:14:58.060 --> 00:15:03.140
Then the causal graph
of relevance there is,

00:15:03.140 --> 00:15:06.080
based on what you know about
the patient at diagnosis,

00:15:06.080 --> 00:15:07.910
which I'm going to
say X is a vector

00:15:07.910 --> 00:15:09.740
because maybe it's
based on images,

00:15:09.740 --> 00:15:11.240
your whole electronic
health record.

00:15:11.240 --> 00:15:14.180
There's a ton of data you have
on the patient at diagnosis.

00:15:14.180 --> 00:15:17.300
Based on that, you
make some decision

00:15:17.300 --> 00:15:19.310
about a treatment plan.

00:15:19.310 --> 00:15:23.390
I'm going to call that
T. T could be binary,

00:15:23.390 --> 00:15:27.290
a choice between two treatments,
it could be continuous,

00:15:27.290 --> 00:15:29.780
maybe you're deciding the
dosage of the treatment,

00:15:29.780 --> 00:15:33.380
or it could be
maybe even a vector.

00:15:33.380 --> 00:15:35.150
For today's lecture,
I'm going to suppose

00:15:35.150 --> 00:15:38.930
that T is just binary,
just involves two choices.

00:15:38.930 --> 00:15:41.120
But most of what
I'll tell you about

00:15:41.120 --> 00:15:45.440
will generalize to the setting
where T is non-binary as well.

00:15:45.440 --> 00:15:48.200
But critically,
I'm going to make

00:15:48.200 --> 00:15:49.880
the assumption for
today's lecture

00:15:49.880 --> 00:15:52.620
that you're not observing
new things in between.

00:15:52.620 --> 00:15:56.630
So, for example, in this
whole week's lecture,

00:15:56.630 --> 00:15:59.850
the following scenario
will not happen.

00:15:59.850 --> 00:16:05.520
Based on diagnosis, you make a
decision about treatment plan.

00:16:05.520 --> 00:16:08.220
Treatment plan starts,
you got new observations.

00:16:08.220 --> 00:16:09.720
Based on those new
observations, you

00:16:09.720 --> 00:16:11.428
realize that treatment
plan isn't working

00:16:11.428 --> 00:16:14.310
and change to another
treatment plan, and so on.

00:16:14.310 --> 00:16:17.440
So that scenario goes
by a different name,

00:16:17.440 --> 00:16:18.930
which is called
dynamic treatment

00:16:18.930 --> 00:16:22.710
regimes or off-policy
reinforcement learning,

00:16:22.710 --> 00:16:24.620
and that we'll learn
about next week.

00:16:24.620 --> 00:16:27.000
So for today's and
Thursday's lecture,

00:16:27.000 --> 00:16:28.830
we're going to suppose
you base on what

00:16:28.830 --> 00:16:31.650
you know about the patient at
this time, you make a decision,

00:16:31.650 --> 00:16:34.650
you execute the decision,
and you look at some outcome.

00:16:34.650 --> 00:16:38.250
So X causes T, not
the other way around.

00:16:38.250 --> 00:16:42.060
And that's pretty clear
because of our prior knowledge

00:16:42.060 --> 00:16:43.440
about this problem.

00:16:43.440 --> 00:16:46.350
It's not that the
treatment affects

00:16:46.350 --> 00:16:49.280
what their diagnosis was.

00:16:49.280 --> 00:16:52.760
And then there's the outcome
Y, and there, again, we

00:16:52.760 --> 00:16:55.190
suppose the outcome, what
happens to the patient, maybe

00:16:55.190 --> 00:16:59.810
survival time, for example, is
a function of what treatment

00:16:59.810 --> 00:17:03.740
they're getting and
aspects about that patient.

00:17:03.740 --> 00:17:05.119
So this is the causal graph.

00:17:05.119 --> 00:17:06.746
We know it.

00:17:06.746 --> 00:17:08.329
But we don't know,
does that treatment

00:17:08.329 --> 00:17:09.680
do anything to this patient?

00:17:09.680 --> 00:17:12.670
For whom does this
treatment help the most?

00:17:12.670 --> 00:17:14.420
And those are the types
of questions we're

00:17:14.420 --> 00:17:15.628
going to try to answer today.

00:17:18.185 --> 00:17:19.060
Is the setting clear?

00:17:31.390 --> 00:17:32.320
OK.

00:17:32.320 --> 00:17:36.030
Now, these questions
are not new questions.

00:17:36.030 --> 00:17:39.360
They've been studied
for decades in fields

00:17:39.360 --> 00:17:43.770
such as political science,
economics, statistics,

00:17:43.770 --> 00:17:45.967
biostatistics.

00:17:45.967 --> 00:17:48.300
And the reason why they're
studied in those other fields

00:17:48.300 --> 00:17:51.420
is because often you don't
have the ability to intervene,

00:17:51.420 --> 00:17:53.880
and one has to try to
answer these questions

00:17:53.880 --> 00:17:56.020
from observational data.

00:17:56.020 --> 00:18:01.140
For example, you might ask, what
will happen to the US economy

00:18:01.140 --> 00:18:05.880
if the Federal Reserve raises
US interest rates by 1%?

00:18:07.922 --> 00:18:10.130
When's the last time you
heard of the Federal Reserve

00:18:10.130 --> 00:18:13.100
doing a randomized
controlled trial?

00:18:13.100 --> 00:18:15.630
And even if they had done a
randomized controlled trial,

00:18:15.630 --> 00:18:18.130
for example, flipped a coin to
decide which way the interest

00:18:18.130 --> 00:18:21.377
rates would go, it wouldn't
be comparable had they

00:18:21.377 --> 00:18:23.960
done that experiment today to
if they had done that experiment

00:18:23.960 --> 00:18:26.210
two years from now because
the state of the world

00:18:26.210 --> 00:18:28.050
has changed in those years.

00:18:31.010 --> 00:18:33.620
Let's talk about
political science.

00:18:33.620 --> 00:18:38.990
I have close colleagues of mine
at NYU who look at Twitter,

00:18:38.990 --> 00:18:41.910
and they want to
ask questions like,

00:18:41.910 --> 00:18:44.210
how can we influence
elections, or how

00:18:44.210 --> 00:18:46.670
are elections influenced?

00:18:46.670 --> 00:18:54.060
So you might look at some
unnamed actors, possibly

00:18:54.060 --> 00:18:56.460
people supported by the
Russian government, who

00:18:56.460 --> 00:19:00.660
are posting to Twitter
or their social media.

00:19:00.660 --> 00:19:03.300
And you might ask the
question of, well,

00:19:03.300 --> 00:19:05.490
did that actually
influence the outcome

00:19:05.490 --> 00:19:08.378
of the previous
presidential election?

00:19:08.378 --> 00:19:09.920
Again, in that
scenario, it's one of,

00:19:09.920 --> 00:19:11.990
well, we have this
data, something

00:19:11.990 --> 00:19:15.110
happened in the
world, and we'd like

00:19:15.110 --> 00:19:17.420
to understand what was
the effect of that action,

00:19:17.420 --> 00:19:22.510
but we can't exactly go back
and replay to do something else.

00:19:22.510 --> 00:19:24.352
So these are fundamental
questions that

00:19:24.352 --> 00:19:26.560
appear all across the
sciences, and of course they're

00:19:26.560 --> 00:19:28.130
extremely relevant
in health care,

00:19:28.130 --> 00:19:30.130
but yet, we don't teach
them in our introduction

00:19:30.130 --> 00:19:32.050
to machine learning classes.

00:19:32.050 --> 00:19:34.930
We don't teach them in our
undergraduate computer science

00:19:34.930 --> 00:19:36.460
education.

00:19:36.460 --> 00:19:38.708
And I view this as a major
hole in our education,

00:19:38.708 --> 00:19:40.500
which is why we're
spending two weeks on it

00:19:40.500 --> 00:19:42.640
in this course, which
is still not enough.

00:19:46.070 --> 00:19:48.550
But what has changed
between these fields,

00:19:48.550 --> 00:19:51.480
and what is relevant
in health care?

00:19:51.480 --> 00:19:54.100
Well, the traditional way
in which these questions

00:19:54.100 --> 00:19:56.410
were asked in
statistics were ones

00:19:56.410 --> 00:19:59.975
where you took a huge
amount of domain knowledge

00:19:59.975 --> 00:20:02.350
to, first of all, make sure
you're setting up the problem

00:20:02.350 --> 00:20:04.850
correctly, and that's always
going to be important.

00:20:04.850 --> 00:20:08.680
But then to think through what
are all of the factors that

00:20:08.680 --> 00:20:11.580
could influence the
treatment decisions

00:20:11.580 --> 00:20:14.770
called the confounding factors.

00:20:14.770 --> 00:20:16.420
And the traditional
approach is one

00:20:16.420 --> 00:20:19.157
would write down 10,
20 different things,

00:20:19.157 --> 00:20:21.240
and make sure that you do
some analysis, including

00:20:21.240 --> 00:20:24.040
the analysis I'll show you about
in today and Thursday's lecture

00:20:24.040 --> 00:20:27.500
using those 10 or 20 variables.

00:20:27.500 --> 00:20:30.120
But where this field
is going is one of now

00:20:30.120 --> 00:20:31.440
having high dimensional data.

00:20:31.440 --> 00:20:34.292
So I talked about how you
might have imaging data for X,

00:20:34.292 --> 00:20:36.750
you might have the whole entire
patient's electronic health

00:20:36.750 --> 00:20:38.540
record data facts.

00:20:38.540 --> 00:20:41.310
And the traditional approaches
that the statistics community

00:20:41.310 --> 00:20:46.020
used to work on no longer
work in this high dimensional

00:20:46.020 --> 00:20:46.657
setting.

00:20:46.657 --> 00:20:48.990
And so, in fact, it's actually
a really interesting area

00:20:48.990 --> 00:20:51.030
for research, one that my
lab is starting to work on

00:20:51.030 --> 00:20:53.160
and many other labs, where
we could ask, how can we

00:20:53.160 --> 00:20:56.490
bring machine learning
algorithms that are designed

00:20:56.490 --> 00:20:58.230
to work with high
dimensional data

00:20:58.230 --> 00:21:01.380
to answer these types of
causal inference questions?

00:21:01.380 --> 00:21:04.860
And in today's lecture, you'll
see one example of reduction

00:21:04.860 --> 00:21:08.363
from causal inference
to machine learning,

00:21:08.363 --> 00:21:09.780
where we'll be
able to use machine

00:21:09.780 --> 00:21:13.110
learning to answer one of those
causal inference questions.

00:21:16.500 --> 00:21:19.640
So the first thing we need
is some language in order

00:21:19.640 --> 00:21:23.240
to formalize these notions.

00:21:23.240 --> 00:21:26.230
So I will work within what's
known as the Rubin-Neyman

00:21:26.230 --> 00:21:30.030
Causal Model, where
we talk about what

00:21:30.030 --> 00:21:31.930
are called potential outcomes.

00:21:31.930 --> 00:21:36.130
What would have happened under
this world or that world?

00:21:36.130 --> 00:21:38.850
We'll call Y 0,
and often it will

00:21:38.850 --> 00:21:42.160
be denoted as Y underscore
0, sometimes it'll

00:21:42.160 --> 00:21:47.290
be denoted as Y parentheses
0, and sometimes it'll

00:21:47.290 --> 00:21:59.990
be denoted as Y given
X comma do Y equals 0.

00:21:59.990 --> 00:22:05.330
And all three of these
notations are equivalent.

00:22:05.330 --> 00:22:08.290
So Y is 0 corresponds
to what would

00:22:08.290 --> 00:22:10.730
have happened to this
individual if you gave them

00:22:10.730 --> 00:22:12.800
treatment to 0.

00:22:12.800 --> 00:22:15.447
And Y1 is the potential
outcome of what

00:22:15.447 --> 00:22:17.780
would have happened to this
individual had you gave them

00:22:17.780 --> 00:22:19.010
treatment one.

00:22:19.010 --> 00:22:23.340
So you could think about Y1
as being giving the blue pill

00:22:23.340 --> 00:22:25.070
and Y0 as being
given the red pill.

00:22:28.400 --> 00:22:32.870
Now, once you can talk about
these states of the world,

00:22:32.870 --> 00:22:34.550
then one could start
to ask questions

00:22:34.550 --> 00:22:38.850
of what's better, the red
pill or the blue pill?

00:22:38.850 --> 00:22:41.000
And one can formalize
that notion mathematically

00:22:41.000 --> 00:22:43.640
in terms of what's called the
conditional average treatment

00:22:43.640 --> 00:22:45.800
effect, and this
also goes by the name

00:22:45.800 --> 00:22:48.970
of individual treatment effect.

00:22:48.970 --> 00:22:51.210
So it's going to take
as input Xi, which

00:22:51.210 --> 00:22:54.050
I'm going to denote as
the data that you had

00:22:54.050 --> 00:22:56.100
at baseline for the individual.

00:22:56.100 --> 00:23:00.600
It's the covariance, the
features for the individual.

00:23:00.600 --> 00:23:05.010
And one wants to know, well,
for this individual with what

00:23:05.010 --> 00:23:07.770
we know about them, what's the
difference between giving them

00:23:07.770 --> 00:23:11.430
treatment one or giving
them treatment zero?

00:23:11.430 --> 00:23:13.740
So mathematically, that
corresponds to a difference

00:23:13.740 --> 00:23:14.700
in expectations.

00:23:14.700 --> 00:23:20.340
It's a difference in
expectation of Y1 from Y0.

00:23:20.340 --> 00:23:22.860
Now, the reason why I'm
calling this an expectation

00:23:22.860 --> 00:23:26.340
is because I'm not going to
assume that Y1 and Y0 are

00:23:26.340 --> 00:23:31.780
deterministic
because maybe there's

00:23:31.780 --> 00:23:33.370
some bad luck component.

00:23:33.370 --> 00:23:36.820
Like, maybe a medication usually
works for this type of person,

00:23:36.820 --> 00:23:41.788
but with a flip of a coin,
sometimes it doesn't work.

00:23:41.788 --> 00:23:43.330
And so that's the
randomness that I'm

00:23:43.330 --> 00:23:47.980
referring to when I talk about
probability over Y1 given Xi.

00:23:47.980 --> 00:23:50.320
And so the CATE looks
at the difference

00:23:50.320 --> 00:23:51.910
in those two expectations.

00:23:51.910 --> 00:23:55.300
And then one can now talk about
what the average treatment

00:23:55.300 --> 00:23:59.570
effect is, which is the
difference between those two.

00:23:59.570 --> 00:24:04.480
So the average treatment effect
is now the expectation of--

00:24:04.480 --> 00:24:09.100
I'll say the expectation of
the CATE over the distribution

00:24:09.100 --> 00:24:14.980
of people, P of X.
Now, we're going

00:24:14.980 --> 00:24:17.900
to go through this in four
different ways in the next 10

00:24:17.900 --> 00:24:20.270
minutes, and then you're
going to go over it five more

00:24:20.270 --> 00:24:21.770
ways doing your
homework assignment,

00:24:21.770 --> 00:24:25.340
and you'll go over it two more
ways on Friday in recitation.

00:24:25.340 --> 00:24:27.387
So if you don't get it
just yet, stay with me,

00:24:27.387 --> 00:24:28.970
you'll get it by the
end of this week.

00:24:33.300 --> 00:24:38.870
Now, in the data that you
observe for an individual,

00:24:38.870 --> 00:24:41.700
all you see is what happened
under one of the interventions.

00:24:41.700 --> 00:24:45.650
So, for example, if the i'th
individual in your data set

00:24:45.650 --> 00:24:49.670
received treatment Ti equals
1, then what you observe,

00:24:49.670 --> 00:24:53.702
Yi is the potential outcome Y1.

00:24:53.702 --> 00:24:55.910
On the other hand, if the
individual in your data set

00:24:55.910 --> 00:24:58.880
received treatment
Ti equals 0, then

00:24:58.880 --> 00:25:00.770
what you observed
for that individual

00:25:00.770 --> 00:25:04.130
is the potential outcome Y0.

00:25:04.130 --> 00:25:08.750
So that's the observed
factual outcome.

00:25:08.750 --> 00:25:11.930
But one could also talk
about the counterfactual

00:25:11.930 --> 00:25:14.450
of what would have
happened to this person had

00:25:14.450 --> 00:25:17.190
the opposite treatment
been done for them.

00:25:17.190 --> 00:25:22.460
Notice that I just swapped each
Ti for 1 minus Ti, and so on.

00:25:22.460 --> 00:25:26.380
Now, the key challenge in the
field is that in your data set,

00:25:26.380 --> 00:25:29.470
you only observe the
factual outcomes.

00:25:29.470 --> 00:25:33.370
And when you want to reason
about the counterfactual,

00:25:33.370 --> 00:25:36.970
that's where you have to impute
this unobserved counterfactual

00:25:36.970 --> 00:25:38.720
outcome.

00:25:38.720 --> 00:25:40.640
And that is known as
the fundamental problem

00:25:40.640 --> 00:25:42.110
of causal inference,
that we only

00:25:42.110 --> 00:25:44.820
observe one of the two outcomes
for any individual in the data

00:25:44.820 --> 00:25:46.340
set.

00:25:46.340 --> 00:25:49.070
So let's look at a
very simple example.

00:25:49.070 --> 00:25:50.630
Here, individuals
are characterized

00:25:50.630 --> 00:25:54.400
by just one feature, their age.

00:25:54.400 --> 00:25:58.240
And these two curves
that I'm showing you

00:25:58.240 --> 00:26:00.010
are the potential
outcomes of what

00:26:00.010 --> 00:26:03.070
would happen to this
individual's blood pressure

00:26:03.070 --> 00:26:04.783
if you gave them
treatment zero, which

00:26:04.783 --> 00:26:06.700
is the blue curve, versus
treatment one, which

00:26:06.700 --> 00:26:08.990
is the red curve.

00:26:08.990 --> 00:26:09.490
All right.

00:26:09.490 --> 00:26:12.000
So let's dig in a
little bit deeper.

00:26:12.000 --> 00:26:13.830
For the blue curve,
we see people

00:26:13.830 --> 00:26:22.220
who received the control, what
I'm calling treatment zero,

00:26:22.220 --> 00:26:26.380
their blood pressure
was pretty low

00:26:26.380 --> 00:26:28.890
for the individuals who
were low and for individuals

00:26:28.890 --> 00:26:30.930
whose age is high.

00:26:30.930 --> 00:26:35.250
But for middle age individuals,
their blood pressure

00:26:35.250 --> 00:26:40.050
on receiving treatment zero
is in the higher range.

00:26:40.050 --> 00:26:42.230
On the other hand,
for individuals

00:26:42.230 --> 00:26:44.760
who receive treatment
one, it's the red curve.

00:26:44.760 --> 00:26:47.940
So young people have much
higher, let's say, blood

00:26:47.940 --> 00:26:53.790
pressure under treatment one,
and, similarly, much older

00:26:53.790 --> 00:26:55.965
people.

00:26:55.965 --> 00:26:57.340
So then one could
ask, well, what

00:26:57.340 --> 00:26:59.757
about the difference between
these two potential outcomes?

00:26:59.757 --> 00:27:02.850
That is to say the CATE, the
Conditional Average Treatment

00:27:02.850 --> 00:27:06.810
Effect, is simply looking at the
distance between the blue curve

00:27:06.810 --> 00:27:08.910
and the red curve
for that individual.

00:27:08.910 --> 00:27:11.310
So for someone with
a specific age,

00:27:11.310 --> 00:27:14.640
let's say a young person
or a very old person,

00:27:14.640 --> 00:27:17.400
there's a very big difference
between giving treatment

00:27:17.400 --> 00:27:19.980
zero or giving treatment one.

00:27:19.980 --> 00:27:21.658
Whereas for a
middle aged person,

00:27:21.658 --> 00:27:22.950
there's very little difference.

00:27:22.950 --> 00:27:30.090
So, for example, if treatment
one was significantly cheaper

00:27:30.090 --> 00:27:32.890
than treatment zero,
then you might say,

00:27:32.890 --> 00:27:34.500
we'll give treatment one.

00:27:34.500 --> 00:27:37.020
Even though it's not quite
as good as treatment zero,

00:27:37.020 --> 00:27:39.660
but it's so much cheaper and
the difference between them

00:27:39.660 --> 00:27:43.187
is so small, we'll
give the other one.

00:27:43.187 --> 00:27:45.270
But in order to make that
type of policy decision,

00:27:45.270 --> 00:27:46.560
one, of course,
has to understand

00:27:46.560 --> 00:27:47.700
that conditional
average treatment

00:27:47.700 --> 00:27:49.700
effect for that individual,
and that's something

00:27:49.700 --> 00:27:53.252
that we're going to want
to predict using data.

00:27:53.252 --> 00:27:54.710
Now, we don't always
get the luxury

00:27:54.710 --> 00:27:57.440
of having personalized
treatment recommendations.

00:27:57.440 --> 00:28:01.230
Sometimes we have
to give a policy.

00:28:01.230 --> 00:28:02.538
Like, for example--

00:28:02.538 --> 00:28:04.080
I took this example
out of my slides,

00:28:04.080 --> 00:28:06.018
but I'll give it to you anyway.

00:28:06.018 --> 00:28:07.560
The federal government
might come out

00:28:07.560 --> 00:28:12.885
with a guideline saying that
all men over the age of 50--

00:28:12.885 --> 00:28:14.010
I'm making up that number--

00:28:14.010 --> 00:28:19.320
need to get annual
prostate cancer screening.

00:28:19.320 --> 00:28:24.550
That's an example of a
very broad policy decision.

00:28:24.550 --> 00:28:28.450
You might ask, well, what is
the effect of that policy now

00:28:28.450 --> 00:28:31.940
applied over the
full population on,

00:28:31.940 --> 00:28:35.410
let's say, decreasing deaths
due to prostate cancer?

00:28:35.410 --> 00:28:37.990
And that would be
an example of asking

00:28:37.990 --> 00:28:40.655
about the average
treatment effect.

00:28:40.655 --> 00:28:42.280
So if you were to
average the red line,

00:28:42.280 --> 00:28:43.750
if you were to
average the blue line,

00:28:43.750 --> 00:28:45.460
you get those two dotted
lines I show there.

00:28:45.460 --> 00:28:46.750
And if you look at the
difference between them,

00:28:46.750 --> 00:28:48.250
that is the average
treatment effect

00:28:48.250 --> 00:28:50.350
between giving the
red intervention

00:28:50.350 --> 00:28:52.400
or giving the blue intervention.

00:28:52.400 --> 00:28:56.830
And if the average human
effect is very positive,

00:28:56.830 --> 00:29:00.370
you might say that, on
average, this intervention

00:29:00.370 --> 00:29:01.767
is a good intervention.

00:29:01.767 --> 00:29:03.850
If it's very negative, you
might say the opposite.

00:29:06.670 --> 00:29:08.980
Now, the challenge about
doing causal inference

00:29:08.980 --> 00:29:11.330
from observational data
is that, of course,

00:29:11.330 --> 00:29:14.620
we don't observe those
red and those blue curves,

00:29:14.620 --> 00:29:18.788
rather what we observe are
data points that might be

00:29:18.788 --> 00:29:20.080
distributed all over the place.

00:29:20.080 --> 00:29:23.120
Like, for example,
in this example,

00:29:23.120 --> 00:29:26.537
the blue treatment happens
to be given in the data more

00:29:26.537 --> 00:29:28.120
to young people, and
the red treatment

00:29:28.120 --> 00:29:31.540
happens to be given in the
data more to older people.

00:29:31.540 --> 00:29:33.530
And that can happen for
a variety of reasons.

00:29:33.530 --> 00:29:37.030
It can happen due to
access to medication.

00:29:37.030 --> 00:29:39.520
It can happen for
socioeconomic reasons.

00:29:39.520 --> 00:29:43.990
It could happen because existing
treatment guidelines say

00:29:43.990 --> 00:29:46.710
that old people should
receive treatment one

00:29:46.710 --> 00:29:49.330
and young people should
receive treatment zero.

00:29:49.330 --> 00:29:52.600
These are all reasons why
in your data who receives

00:29:52.600 --> 00:29:56.240
what treatment could
be biased in some way.

00:29:56.240 --> 00:30:00.025
And that's exactly what this
edge from X to T is modeling.

00:30:02.960 --> 00:30:04.380
But for each of
those people, you

00:30:04.380 --> 00:30:06.660
might want to know, well, what
would have happened if they

00:30:06.660 --> 00:30:07.952
had gotten the other treatment?

00:30:07.952 --> 00:30:10.230
And that's asking about
the counterfactual.

00:30:10.230 --> 00:30:13.980
So these dotted circles
are the counterfactuals

00:30:13.980 --> 00:30:17.300
for each of those observations.

00:30:17.300 --> 00:30:19.740
And by the way, you'll notice
that those dots are not

00:30:19.740 --> 00:30:21.990
on the curves, and the reason
they're not on the curve

00:30:21.990 --> 00:30:23.407
is because I'm
trying to point out

00:30:23.407 --> 00:30:25.930
that there could be some
stochasticity in the outcome.

00:30:25.930 --> 00:30:30.210
So the dotted lines are the
expected potential outcomes

00:30:30.210 --> 00:30:32.245
and the circles are the
realizations of them.

00:30:34.920 --> 00:30:35.490
All right.

00:30:35.490 --> 00:30:40.460
Everyone take out a calculator
or your computer or your phone,

00:30:40.460 --> 00:30:41.550
and I'll take out mine.

00:30:45.470 --> 00:30:49.090
This is not an opportunity to go
on Facebook, just to be clear.

00:30:49.090 --> 00:30:50.862
All you want is a calculator.

00:30:54.972 --> 00:30:56.930
My phone doesn't-- oh,
OK, it has a calculator.

00:30:56.930 --> 00:30:58.500
Good.

00:30:58.500 --> 00:31:00.450
All right.

00:31:00.450 --> 00:31:02.624
So we're going to do
a little exercise.

00:31:05.410 --> 00:31:08.530
Here's a data set on
the left-hand side.

00:31:08.530 --> 00:31:10.540
Each row is an individual.

00:31:10.540 --> 00:31:13.720
We're observing the
individual's age, gender,

00:31:13.720 --> 00:31:15.340
whether they exercise
regularly, which

00:31:15.340 --> 00:31:17.800
I'll say is a one or a zero,
and what treatment they got,

00:31:17.800 --> 00:31:21.750
which is A or B. On
the far right-hand side

00:31:21.750 --> 00:31:28.200
are their observed sugar
glucose sugar levels, let's say,

00:31:28.200 --> 00:31:29.340
at the end of the year.

00:31:33.010 --> 00:31:37.960
Now, what we'd like to
have, it looks like this.

00:31:37.960 --> 00:31:42.460
So we'd like to know what would
have happened to this person's

00:31:42.460 --> 00:31:45.700
sugar levels had they
received medication A

00:31:45.700 --> 00:31:47.830
or had they received
medication B.

00:31:47.830 --> 00:31:52.630
But if you look at
the previous slide,

00:31:52.630 --> 00:31:56.560
we observed for each individual
that they got either A or B.

00:31:56.560 --> 00:31:58.480
And so we're only
going to know one

00:31:58.480 --> 00:32:00.920
of these columns
for each individual.

00:32:00.920 --> 00:32:03.430
So the first row, for
example, this individual

00:32:03.430 --> 00:32:05.980
received treatment
A, and so you'll

00:32:05.980 --> 00:32:11.650
see that I've taken
the observed sugar

00:32:11.650 --> 00:32:14.730
level for that individual,
and since they received

00:32:14.730 --> 00:32:17.860
treatment A, that
observed level represents

00:32:17.860 --> 00:32:21.550
the potential outcome Ya, or Y0.

00:32:21.550 --> 00:32:27.370
And that's why I have a 6,
which is bolded under Y0.

00:32:27.370 --> 00:32:30.370
And we don't know what
would have happened

00:32:30.370 --> 00:32:32.200
to that individual
had they received

00:32:32.200 --> 00:32:36.580
treatment B. So in this
case, some magical creature

00:32:36.580 --> 00:32:38.762
came to me and told me
their sugar levels would

00:32:38.762 --> 00:32:40.720
have been 5.5, but we
don't actually know that.

00:32:40.720 --> 00:32:42.070
It wasn't in the data.

00:32:42.070 --> 00:32:43.737
Let's look at the
next line just to make

00:32:43.737 --> 00:32:45.230
sure we get what I'm saying.

00:32:45.230 --> 00:32:47.050
So the second
individual actually

00:32:47.050 --> 00:32:53.450
received treatment B. They're
observed sugar level is 6.5.

00:32:53.450 --> 00:32:55.790
OK.

00:32:55.790 --> 00:32:58.160
Let's do a little survey.

00:32:58.160 --> 00:33:00.950
That 6.5 number, should
it be in this column?

00:33:00.950 --> 00:33:01.698
Raise your hand.

00:33:01.698 --> 00:33:02.990
Or should it be in this column?

00:33:02.990 --> 00:33:04.740
Raise your hand.

00:33:04.740 --> 00:33:05.240
All right.

00:33:05.240 --> 00:33:08.050
About half of you
got that right.

00:33:08.050 --> 00:33:11.040
Indeed, it goes to
the second column.

00:33:11.040 --> 00:33:14.080
And again, what we would like
to know is the counterfactual.

00:33:14.080 --> 00:33:16.260
What would have been
their sugar levels

00:33:16.260 --> 00:33:18.267
had they received medication A?

00:33:18.267 --> 00:33:20.100
Which we don't actually
observe in our data,

00:33:20.100 --> 00:33:22.440
but I'm going to
hypothesize is--

00:33:22.440 --> 00:33:25.562
suppose that someone
told me it was 7, then

00:33:25.562 --> 00:33:27.270
you would see that
value filled in there.

00:33:27.270 --> 00:33:30.610
That's the unobserved
counterfactual.

00:33:30.610 --> 00:33:31.110
All right.

00:33:31.110 --> 00:33:33.900
First of all, is
the setup clear?

00:33:33.900 --> 00:33:34.470
All right.

00:33:34.470 --> 00:33:37.990
Now here's when you
use your calculators.

00:33:37.990 --> 00:33:40.720
So we're going to
now demonstrate

00:33:40.720 --> 00:33:43.420
the difference between
a naive estimator

00:33:43.420 --> 00:33:47.590
of your average treatment effect
and the true average treatment

00:33:47.590 --> 00:33:48.850
effect.

00:33:48.850 --> 00:33:51.190
So what I want you
to do right now

00:33:51.190 --> 00:33:59.270
is to compute, first,
what is the average sugar

00:33:59.270 --> 00:34:07.270
level of the individuals who
got medication B. So for that,

00:34:07.270 --> 00:34:11.440
we're only going to
be using the red ones.

00:34:11.440 --> 00:34:17.050
So this is conditioning
on receiving medication B.

00:34:17.050 --> 00:34:24.340
And so this is equivalent
to going back to this one

00:34:24.340 --> 00:34:27.130
and saying, we're only going to
take the rows where individuals

00:34:27.130 --> 00:34:29.350
receive medication
B, and we're going

00:34:29.350 --> 00:34:34.110
to average their
observed sugar levels.

00:34:34.110 --> 00:34:36.530
And everyone should do that.

00:34:36.530 --> 00:34:37.530
What's the first number?

00:34:42.600 --> 00:35:02.370
6.5 plus-- I'm getting 7.875.

00:35:02.370 --> 00:35:08.790
This is for the
average sugar, given

00:35:08.790 --> 00:35:11.430
that they received
medication B. Is that

00:35:11.430 --> 00:35:12.680
what other people are getting?

00:35:12.680 --> 00:35:13.130
AUDIENCE: Yeah.

00:35:13.130 --> 00:35:13.838
DAVID SONTAG: OK.

00:35:13.838 --> 00:35:16.170
What about for
the second number?

00:35:16.170 --> 00:35:20.070
Average sugar, given A?

00:35:24.820 --> 00:35:26.058
I want you to compute it.

00:35:26.058 --> 00:35:27.850
And I'm going to ask
everyone to say it out

00:35:27.850 --> 00:35:29.123
loud in literally one minute.

00:35:29.123 --> 00:35:30.540
And if you get it
wrong, of course

00:35:30.540 --> 00:35:33.360
you're going to be embarrassed.

00:35:33.360 --> 00:35:34.360
I'm going to try myself.

00:35:53.090 --> 00:35:53.910
OK.

00:35:53.910 --> 00:35:55.493
On the count of
three, I want everyone

00:35:55.493 --> 00:35:57.680
to read out what
that third number is.

00:35:57.680 --> 00:36:00.020
One, two, three.

00:36:00.020 --> 00:36:04.100
ALL: 7.125.

00:36:04.100 --> 00:36:05.250
DAVID SONTAG: All right.

00:36:05.250 --> 00:36:05.750
Good.

00:36:05.750 --> 00:36:08.830
We can all do arithmetic.

00:36:08.830 --> 00:36:10.370
All right.

00:36:10.370 --> 00:36:11.595
Good.

00:36:11.595 --> 00:36:17.000
So, again, we're just
looking at the red numbers

00:36:17.000 --> 00:36:18.590
here, just the red numbers.

00:36:18.590 --> 00:36:20.960
So we just computed
that difference,

00:36:20.960 --> 00:36:24.280
which is point what?

00:36:24.280 --> 00:36:25.690
AUDIENCE: 0.75.

00:36:25.690 --> 00:36:27.500
DAVID SONTAG: 0.75?

00:36:27.500 --> 00:36:29.030
Yeah, that looks about right.

00:36:29.030 --> 00:36:29.670
Good.

00:36:29.670 --> 00:36:30.170
All right.

00:36:30.170 --> 00:36:33.220
So that's a positive number.

00:36:33.220 --> 00:36:37.260
Now let's do
something different.

00:36:37.260 --> 00:36:42.600
Now let's compute the actual
average treatment effect, which

00:36:42.600 --> 00:36:50.310
is we're now going to average
every number in this column,

00:36:50.310 --> 00:36:53.400
and we're going to average
every number in this column.

00:36:53.400 --> 00:36:56.880
So this is the
average sugar level

00:36:56.880 --> 00:37:00.030
under the potential outcome
of had the individual received

00:37:00.030 --> 00:37:03.590
treatment B, and this is
the average sugar level

00:37:03.590 --> 00:37:06.080
under the potential outcome
that the individual received

00:37:06.080 --> 00:37:12.350
treatment A. All right.

00:37:12.350 --> 00:37:13.550
Who's doing it?

00:37:13.550 --> 00:37:14.960
AUDIENCE: 0.75.

00:37:14.960 --> 00:37:16.385
DAVID SONTAG: 0.75 is what?

00:37:16.385 --> 00:37:17.760
AUDIENCE: The difference.

00:37:17.760 --> 00:37:19.010
DAVID SONTAG: How do you know?

00:37:19.010 --> 00:37:21.160
AUDIENCE: [INAUDIBLE]

00:37:21.160 --> 00:37:22.910
DAVID SONTAG: Wow, you're fast.

00:37:22.910 --> 00:37:23.410
OK.

00:37:23.410 --> 00:37:24.240
Let's see if you're right.

00:37:24.240 --> 00:37:25.190
I actually don't know.

00:37:25.190 --> 00:37:25.760
OK.

00:37:25.760 --> 00:37:26.930
The first one is 0.75.

00:37:26.930 --> 00:37:27.500
Good, we got that right.

00:37:27.500 --> 00:37:29.917
I intentionally didn't post
the slides to today's lecture.

00:37:32.240 --> 00:37:38.340
And the second
one is minus 0.75.

00:37:38.340 --> 00:37:38.840
All right.

00:37:38.840 --> 00:37:43.330
So now let's put us in the
shoes of a policymaker.

00:37:43.330 --> 00:37:47.135
The policymaker has to
decide, is it a good idea to--

00:37:47.135 --> 00:37:49.010
or let's say it's a
health insurance company.

00:37:49.010 --> 00:37:50.843
A health insurance
company is trying decide,

00:37:50.843 --> 00:37:53.768
should I reimburse for
treatment B or not?

00:37:53.768 --> 00:37:55.310
Or should I simply
say, no, I'm never

00:37:55.310 --> 00:37:58.430
going to reimburse for treatment
because it doesn't work well?

00:37:58.430 --> 00:38:02.300
So if they had done the
naive estimator, that

00:38:02.300 --> 00:38:04.860
would have been
the first example,

00:38:04.860 --> 00:38:10.540
then it would look
like medication B is--

00:38:10.540 --> 00:38:12.380
we want lower
numbers here, so it

00:38:12.380 --> 00:38:18.610
would look like medication B
is worse than medication A.

00:38:18.610 --> 00:38:21.730
And if you properly
estimated what

00:38:21.730 --> 00:38:24.580
the actual average
treatment effect is,

00:38:24.580 --> 00:38:26.890
you get the absolute
opposite conclusion.

00:38:26.890 --> 00:38:29.950
You conclude that medication B
is much better than medication

00:38:29.950 --> 00:38:33.660
A. It's just a simple
example to really illustrate

00:38:33.660 --> 00:38:35.970
the difference
between conditioning

00:38:35.970 --> 00:38:39.035
and actually computing
that counterfactual.

00:38:42.890 --> 00:38:43.390
OK.

00:38:43.390 --> 00:38:45.170
So hopefully now you're
starting to get it.

00:38:45.170 --> 00:38:47.795
And again, you're going to have
many more opportunities to work

00:38:47.795 --> 00:38:52.700
through these things in your
homework assignment and so on.

00:38:52.700 --> 00:38:55.550
So by now you should be
starting to wonder, how the hell

00:38:55.550 --> 00:38:57.620
could I do anything in
this state of the world?

00:38:57.620 --> 00:39:00.620
Because you don't actually
observe those black numbers.

00:39:00.620 --> 00:39:02.540
These are all unobserved.

00:39:02.540 --> 00:39:05.600
And clearly there
is bias in what

00:39:05.600 --> 00:39:07.100
the values should
be because of what

00:39:07.100 --> 00:39:08.790
I've been saying all along.

00:39:08.790 --> 00:39:11.163
So what can we do?

00:39:11.163 --> 00:39:12.830
Well, the first thing
we have to realize

00:39:12.830 --> 00:39:15.247
is that typically, this is an
impossible problem to solve.

00:39:15.247 --> 00:39:18.920
So your instincts
aren't wrong, and we're

00:39:18.920 --> 00:39:20.740
going to have to make
a ton of assumptions

00:39:20.740 --> 00:39:23.950
in order to do anything here.

00:39:23.950 --> 00:39:26.430
So the first assumption
is called SUTVA.

00:39:26.430 --> 00:39:27.930
I'm not even going
to talk about it.

00:39:27.930 --> 00:39:29.725
You can read about
that in your readings.

00:39:29.725 --> 00:39:31.350
I'll tell you about
the two assumptions

00:39:31.350 --> 00:39:34.890
that are a little bit
easier to describe.

00:39:34.890 --> 00:39:37.920
The first critical assumption
is that there are no unobserved

00:39:37.920 --> 00:39:39.840
confounding factors.

00:39:39.840 --> 00:39:41.370
Mathematically
what that's saying

00:39:41.370 --> 00:39:44.610
is that your potential
outcomes, Y0 and Y1,

00:39:44.610 --> 00:39:47.340
are conditionally independent
of the treatment decision given

00:39:47.340 --> 00:39:52.780
what you observe on
the individual, X.

00:39:52.780 --> 00:39:55.900
Now, this could
be a bit hard to--

00:39:55.900 --> 00:39:57.300
and that's called ignorability.

00:39:57.300 --> 00:39:59.008
And this can be a bit
hard to understand,

00:39:59.008 --> 00:40:01.950
so let me draw a picture.

00:40:01.950 --> 00:40:04.200
So X is your covariance, T
is your treatment decision.

00:40:04.200 --> 00:40:06.600
And now I've drawn for you
a slightly different graph.

00:40:06.600 --> 00:40:10.860
Over here I said X goes
to T, X and T go to Y.

00:40:10.860 --> 00:40:14.610
But now I don't have Y.
Instead, I have Y0 and Y1,

00:40:14.610 --> 00:40:16.662
and I don't have any
edge from T to them.

00:40:16.662 --> 00:40:18.120
And that's because
now I'm actually

00:40:18.120 --> 00:40:20.850
using the potential
outcomes notation.

00:40:20.850 --> 00:40:22.225
Y0 is a potential
outcome of what

00:40:22.225 --> 00:40:24.558
would have happened to this
individual had they received

00:40:24.558 --> 00:40:26.040
treatment 0, and
Y1 is what would

00:40:26.040 --> 00:40:28.950
have happened to this individual
if they received treatment one.

00:40:28.950 --> 00:40:31.395
And because you already know
what treatment the individual

00:40:31.395 --> 00:40:32.853
has received, it
doesn't make sense

00:40:32.853 --> 00:40:35.560
to talk about an edge
from T to those values.

00:40:35.560 --> 00:40:37.150
That's why there's
no edge there.

00:40:37.150 --> 00:40:39.150
So then you might wonder,
how could you possibly

00:40:39.150 --> 00:40:41.192
have a violation of this
conditional independence

00:40:41.192 --> 00:40:42.180
assumption?

00:40:42.180 --> 00:40:43.680
Well, before I give
you that answer,

00:40:43.680 --> 00:40:45.970
let me put some names
to these things.

00:40:45.970 --> 00:40:48.870
So we might think about X as
being the age, gender, weight,

00:40:48.870 --> 00:40:50.850
diet, and so on
of the individual.

00:40:50.850 --> 00:40:54.300
T might be a medication, like
an anti-hypertensive medication

00:40:54.300 --> 00:40:56.820
to try to lower a
patient's blood pressure.

00:40:56.820 --> 00:40:58.770
And these would be
the potential outcomes

00:40:58.770 --> 00:41:00.990
after those two medications.

00:41:00.990 --> 00:41:04.470
So an example of a
violation of ignorability

00:41:04.470 --> 00:41:10.970
is if there is something else,
some hidden variable h, which

00:41:10.970 --> 00:41:13.490
is not observed
and which affects

00:41:13.490 --> 00:41:15.470
both the decision
of what treatment

00:41:15.470 --> 00:41:17.750
the individual in
your data set receives

00:41:17.750 --> 00:41:20.545
and the potential outcomes.

00:41:20.545 --> 00:41:22.170
Now it should be
really clear that this

00:41:22.170 --> 00:41:24.378
would be a violation of that
conditional independence

00:41:24.378 --> 00:41:25.100
assumption.

00:41:25.100 --> 00:41:29.010
In this graph, Y0 and
Y1 are not conditionally

00:41:29.010 --> 00:41:32.760
independent of T
given X. All right.

00:41:32.760 --> 00:41:34.800
So what are these
hidden confounders?

00:41:34.800 --> 00:41:37.710
Well, they might be things,
for example, which really

00:41:37.710 --> 00:41:40.020
affect treatment decisions.

00:41:40.020 --> 00:41:42.420
So maybe there's a
treatment guideline

00:41:42.420 --> 00:41:44.400
saying that for
diabetic patients,

00:41:44.400 --> 00:41:47.700
they should receive
treatment zero, that that's

00:41:47.700 --> 00:41:50.350
the right thing to do.

00:41:50.350 --> 00:41:54.270
And so a violation
of this would be

00:41:54.270 --> 00:41:56.700
if the fact that the
patient's diabetic

00:41:56.700 --> 00:41:59.950
were not recorded in the
electronic health record.

00:41:59.950 --> 00:42:01.660
So you don't know--

00:42:01.660 --> 00:42:02.540
that's not up there.

00:42:02.540 --> 00:42:05.610
You don't know that,
in fact, the reason

00:42:05.610 --> 00:42:08.700
the patient received treatment
T was because of this h factor.

00:42:08.700 --> 00:42:10.450
And there's critically
another assumption,

00:42:10.450 --> 00:42:12.540
which is that h actually
affects the outcome,

00:42:12.540 --> 00:42:15.435
which is why you have these
edges from h to the Y's.

00:42:15.435 --> 00:42:17.310
If h were something
which might have affected

00:42:17.310 --> 00:42:21.620
treatment decision but not the
actual potential outcomes--

00:42:21.620 --> 00:42:23.530
and that can happen, of course.

00:42:23.530 --> 00:42:26.880
Things like gender can often
affect treatment decisions,

00:42:26.880 --> 00:42:32.570
but maybe, for some diseases,
it might not affect outcomes.

00:42:32.570 --> 00:42:36.270
In that situation it wouldn't
be a confounding factor

00:42:36.270 --> 00:42:38.540
because it doesn't
violate this assumption.

00:42:38.540 --> 00:42:40.290
And, in fact, one would
be able to come up

00:42:40.290 --> 00:42:42.960
with consistent estimators
of average treatment effect

00:42:42.960 --> 00:42:44.130
under that assumption.

00:42:44.130 --> 00:42:47.656
Where things go to hell is when
you have both of those edges.

00:42:47.656 --> 00:42:49.950
All right.

00:42:49.950 --> 00:42:51.790
So there can't be
any of these h's.

00:42:51.790 --> 00:42:53.540
You have to observe
all things that affect

00:42:53.540 --> 00:42:55.055
both treatment and outcomes.

00:42:57.848 --> 00:42:59.390
The second big
assumption-- oh, yeah.

00:42:59.390 --> 00:42:59.930
Question?

00:42:59.930 --> 00:43:02.098
AUDIENCE: In practice, how
good of a model is this?

00:43:02.098 --> 00:43:03.890
DAVID SONTAG: Of what
I'm showing you here?

00:43:03.890 --> 00:43:04.610
AUDIENCE: Yeah.

00:43:04.610 --> 00:43:06.015
DAVID SONTAG: For hypertension?

00:43:06.015 --> 00:43:06.640
AUDIENCE: Sure.

00:43:06.640 --> 00:43:07.848
DAVID SONTAG: I have no idea.

00:43:10.248 --> 00:43:11.790
But I think what
you're really trying

00:43:11.790 --> 00:43:14.248
to get at here in asking your
question, how good of a model

00:43:14.248 --> 00:43:17.540
is this, is, well, oh,
my god, how do I know

00:43:17.540 --> 00:43:19.200
if I've observed everything?

00:43:19.200 --> 00:43:20.100
Right?

00:43:20.100 --> 00:43:20.600
All right.

00:43:20.600 --> 00:43:22.017
And that's where
you need to start

00:43:22.017 --> 00:43:24.100
talking to domain experts.

00:43:24.100 --> 00:43:27.700
So this is my
starting place where

00:43:27.700 --> 00:43:31.450
I said, no, I'm not
going to attempt

00:43:31.450 --> 00:43:32.965
to fit the causal graph.

00:43:35.470 --> 00:43:37.478
I'm going to assume I
know the causal graph

00:43:37.478 --> 00:43:39.020
and just try to
estimate the effects.

00:43:39.020 --> 00:43:41.228
That's where this starts to
become really irrelevant.

00:43:41.228 --> 00:43:44.053
Because if you notice, this
is another causal graph, not

00:43:44.053 --> 00:43:45.220
the one I drew on the board.

00:43:48.100 --> 00:43:50.110
And so that's something
where, really,

00:43:50.110 --> 00:43:52.100
talking with domain
experts would be relevant.

00:43:52.100 --> 00:43:57.120
So if you say, OK, I'm going
to be studying hypertension

00:43:57.120 --> 00:44:00.870
and this is the data I've
observed on patients,

00:44:00.870 --> 00:44:04.980
well, you can then go to a
clinician, maybe a primary care

00:44:04.980 --> 00:44:08.220
doctor who often treats
patients with hypertension,

00:44:08.220 --> 00:44:10.530
and you say, OK, what usually
affects your treatment

00:44:10.530 --> 00:44:11.850
decisions?

00:44:11.850 --> 00:44:13.370
And you get a set
of variables out,

00:44:13.370 --> 00:44:15.660
and then you check
to make sure, am I

00:44:15.660 --> 00:44:17.610
observing all of those
variables, at least

00:44:17.610 --> 00:44:20.988
the variables that would
also affect outcomes?

00:44:20.988 --> 00:44:22.530
So, often, there's
going to be a back

00:44:22.530 --> 00:44:24.900
and forth in that conversation
to make sure that you've

00:44:24.900 --> 00:44:26.195
set up your problem correctly.

00:44:26.195 --> 00:44:27.570
And again, this
is one area where

00:44:27.570 --> 00:44:29.400
you see a critical
difference between the way

00:44:29.400 --> 00:44:31.067
that we do causal
inference from the way

00:44:31.067 --> 00:44:32.400
that we do machine learning.

00:44:32.400 --> 00:44:36.580
Machine learning, if there's
some unobserved variables,

00:44:36.580 --> 00:44:37.080
so what?

00:44:37.080 --> 00:44:38.880
I mean, maybe your predictive
accuracy isn't quite as good

00:44:38.880 --> 00:44:40.740
as it could have
been, but whatever.

00:44:40.740 --> 00:44:43.920
Here, your conclusions
could be completely wrong

00:44:43.920 --> 00:44:48.810
if you don't get those
confounding factors right.

00:44:48.810 --> 00:44:50.610
Now, in some of the
optional readings

00:44:50.610 --> 00:44:52.710
for Thursday's lecture--

00:44:52.710 --> 00:44:55.290
and we'll touch on it
very briefly on Thursday,

00:44:55.290 --> 00:44:57.300
but there's not much
time in this course--

00:44:57.300 --> 00:45:00.690
I'll talk about ways and
you'll read about ways

00:45:00.690 --> 00:45:03.780
to try to assess
robustness to violations

00:45:03.780 --> 00:45:05.430
of these assumptions.

00:45:05.430 --> 00:45:08.075
And those go by the name
of sensitivity analysis.

00:45:08.075 --> 00:45:10.200
So, for example, the type
of question you might ask

00:45:10.200 --> 00:45:12.300
is, how would my
conclusions have

00:45:12.300 --> 00:45:15.060
changed if there were a
confounding factor which

00:45:15.060 --> 00:45:17.860
was blah strong?

00:45:17.860 --> 00:45:23.210
And that's something that one
could try to answer from data,

00:45:23.210 --> 00:45:25.420
but it's really starting
to get beyond the scope

00:45:25.420 --> 00:45:26.510
of this course.

00:45:26.510 --> 00:45:28.052
So I'll give you
some readings on it,

00:45:28.052 --> 00:45:32.000
but I won't be able to talk
about it in the lecture.

00:45:32.000 --> 00:45:34.660
Now, the second major
assumption that one needs

00:45:34.660 --> 00:45:36.943
is what's known
as common support.

00:45:36.943 --> 00:45:38.610
And by the way, pay
close attention here

00:45:38.610 --> 00:45:43.680
because at the end of today's
lecture-- and if I forget,

00:45:43.680 --> 00:45:45.030
someone must remind me--

00:45:45.030 --> 00:45:49.650
I'm going to ask you where did
these two assumptions come up

00:45:49.650 --> 00:45:52.870
in the proof that I'm
about to give you.

00:45:52.870 --> 00:45:55.370
The first one I'm going to give
you will be a dead giveaway.

00:45:55.370 --> 00:45:57.840
So I'm going to answer to you
where ignorability comes up,

00:45:57.840 --> 00:45:59.423
but it's up to you
to figure out where

00:45:59.423 --> 00:46:01.560
does common support show up.

00:46:01.560 --> 00:46:02.670
So what is common support?

00:46:02.670 --> 00:46:07.560
Well, what common support
says is that there always

00:46:07.560 --> 00:46:11.440
must be some stochasticity
in the treatment decisions.

00:46:11.440 --> 00:46:17.270
For example, if in
your data patients only

00:46:17.270 --> 00:46:21.780
receive treatment A and no
patient receives treatment B,

00:46:21.780 --> 00:46:24.420
then you would never be able to
figure out the counterfactual,

00:46:24.420 --> 00:46:29.008
what would have happened if
patients receive treatment B.

00:46:29.008 --> 00:46:31.050
But what happens if it's
not quite that universal

00:46:31.050 --> 00:46:34.260
but maybe there is
classes of people?

00:46:34.260 --> 00:46:37.350
Some individual is X, let's
say, people with blue hair.

00:46:37.350 --> 00:46:42.450
People with blue hair always
receive treatment zero

00:46:42.450 --> 00:46:45.200
and they never
see treatment one.

00:46:45.200 --> 00:46:49.340
Well, for those people,
if for some reason

00:46:49.340 --> 00:46:50.990
something about them
having blue hair

00:46:50.990 --> 00:46:53.600
was also going to affect
how they would respond

00:46:53.600 --> 00:46:55.250
to the treatment,
then you wouldn't

00:46:55.250 --> 00:46:57.470
be able to answer anything
about the counterfactual

00:46:57.470 --> 00:46:59.660
for those individuals.

00:46:59.660 --> 00:47:03.560
This goes by the name of what's
called a propensity score.

00:47:03.560 --> 00:47:07.310
It's the probability of
receiving some treatment

00:47:07.310 --> 00:47:09.230
for each individual.

00:47:09.230 --> 00:47:14.150
And we're going to assume that
this propensity score is always

00:47:14.150 --> 00:47:17.150
bounded between 0 and 1.

00:47:17.150 --> 00:47:20.000
So it's between 1 minus
epsilon and epsilon

00:47:20.000 --> 00:47:23.020
for some small epsilon.

00:47:23.020 --> 00:47:25.330
And violations of
that assumption

00:47:25.330 --> 00:47:28.000
are going to completely
invalidate all conclusions

00:47:28.000 --> 00:47:30.610
that we could draw
from the data.

00:47:30.610 --> 00:47:31.520
All right.

00:47:31.520 --> 00:47:35.190
Now, in actual clinical
practice, you might wonder,

00:47:35.190 --> 00:47:37.010
can this ever hold?

00:47:37.010 --> 00:47:40.880
Because there are
clinical guidelines.

00:47:40.880 --> 00:47:43.867
Well, a couple of places where
you'll see this are as follows.

00:47:43.867 --> 00:47:46.450
First, often, there are settings
where we haven't the faintest

00:47:46.450 --> 00:47:49.720
idea how to treat patients,
like second line diabetes

00:47:49.720 --> 00:47:51.010
treatments.

00:47:51.010 --> 00:47:54.370
You know that the first thing
we start with is metformin.

00:47:54.370 --> 00:47:57.310
But if metformin doesn't help
control the patient's glucose

00:47:57.310 --> 00:48:00.490
values, there are several
second line diabetic treatments.

00:48:00.490 --> 00:48:03.100
And right now, we don't
really know which one to try.

00:48:03.100 --> 00:48:06.340
So a clinician might start
with treatments from one class.

00:48:06.340 --> 00:48:08.570
And if that's not working,
you try a different class,

00:48:08.570 --> 00:48:09.040
and so on.

00:48:09.040 --> 00:48:10.582
And it's a bit random
which class you

00:48:10.582 --> 00:48:13.550
start with for any one patient.

00:48:13.550 --> 00:48:16.600
In other settings, there might
be good clinical guidelines,

00:48:16.600 --> 00:48:18.860
but there is randomness
in other ways.

00:48:18.860 --> 00:48:25.500
For example, clinicians who
are trained on the west coast

00:48:25.500 --> 00:48:28.350
might be trained that this is
the right way to do things,

00:48:28.350 --> 00:48:30.420
and clinicians who are
trained in the east coast

00:48:30.420 --> 00:48:33.630
might be trained that this is
the right way to do things.

00:48:33.630 --> 00:48:37.860
And so even if any one
clinician's treatment decisions

00:48:37.860 --> 00:48:40.260
are deterministic
in some way, you'll

00:48:40.260 --> 00:48:43.530
see some stochasticity
now across clinicians.

00:48:43.530 --> 00:48:45.790
It's a bit subtle how to
use that in your analysis,

00:48:45.790 --> 00:48:49.160
but trust me, it can be done.

00:48:49.160 --> 00:48:51.680
So if you want to
do causal inference

00:48:51.680 --> 00:48:53.290
from observational
data, you're going

00:48:53.290 --> 00:48:56.570
to have to first start to
formalize things mathematically

00:48:56.570 --> 00:49:01.190
in terms of what is your X, what
is your T, what is your Y. You

00:49:01.190 --> 00:49:04.830
have to think through,
do these choices

00:49:04.830 --> 00:49:09.310
satisfy these assumptions
of ignorability and overlap?

00:49:09.310 --> 00:49:11.310
Some of these things you
can check in your data.

00:49:11.310 --> 00:49:13.770
Ignorability you can't
explicitly check in your data.

00:49:13.770 --> 00:49:19.580
But overlap, this thing,
you can test in your data.

00:49:19.580 --> 00:49:20.550
By the way, how?

00:49:20.550 --> 00:49:21.050
Any idea?

00:49:24.828 --> 00:49:26.370
Someone else who
hasn't spoken today.

00:49:31.320 --> 00:49:33.690
So just think back to
the previous example.

00:49:33.690 --> 00:49:41.220
You have this table of these X's
and treatment A or B and then

00:49:41.220 --> 00:49:42.750
sugar values.

00:49:42.750 --> 00:49:44.303
How would you test this?

00:49:44.303 --> 00:49:46.220
AUDIENCE: You could use
a frequentist approach

00:49:46.220 --> 00:49:48.550
and just count how
many things show up.

00:49:48.550 --> 00:49:51.880
And if there is zero, then you
could say that it's violated.

00:49:51.880 --> 00:49:52.770
DAVID SONTAG: Good.

00:49:52.770 --> 00:49:54.705
So you have this table.

00:49:54.705 --> 00:49:58.140
I'll just go back to that table.

00:49:58.140 --> 00:50:03.420
We have this table,
and these are your X's.

00:50:05.805 --> 00:50:07.680
Actually, we'll go back
to the previous slide

00:50:07.680 --> 00:50:08.972
where it's a bit easier to see.

00:50:13.930 --> 00:50:17.020
Here, we're going to ignore
the outcome, the sugar

00:50:17.020 --> 00:50:19.150
levels because,
remember, this only

00:50:19.150 --> 00:50:22.030
has to do with
probability of treatment

00:50:22.030 --> 00:50:23.770
given your covariance.

00:50:23.770 --> 00:50:25.588
The Y doesn't show
up here at all.

00:50:25.588 --> 00:50:27.130
So this thing on
the right-hand side,

00:50:27.130 --> 00:50:29.977
the observed sugar levels, is
irrelevant for this question.

00:50:29.977 --> 00:50:31.810
All we care about is
what goes on over here.

00:50:31.810 --> 00:50:32.740
So we look at this.

00:50:32.740 --> 00:50:35.100
These are your X's, and
this is your treatment.

00:50:35.100 --> 00:50:37.840
And you can look to
see, OK, here you

00:50:37.840 --> 00:50:42.010
have one 75-year-old
male who does exercise

00:50:42.010 --> 00:50:44.680
frequently and received
treatment A. Is there any one

00:50:44.680 --> 00:50:48.370
else in the data set who
is 75 years old and male,

00:50:48.370 --> 00:50:51.190
does exercise regularly
but received treatment B?

00:50:51.190 --> 00:50:52.450
Yes or no?

00:50:52.450 --> 00:50:53.580
No.

00:50:53.580 --> 00:50:54.080
Good.

00:50:54.080 --> 00:50:54.580
OK.

00:50:54.580 --> 00:50:59.360
So overlap is not satisfied
here, at least not empirically.

00:50:59.360 --> 00:51:03.190
Now, you might argue that I'm
being a bit too coarse here.

00:51:03.190 --> 00:51:05.740
Well, what happens if
the individual is 74

00:51:05.740 --> 00:51:06.850
and received treatment B?

00:51:06.850 --> 00:51:08.200
Maybe that's close enough.

00:51:08.200 --> 00:51:09.820
So there starts to
become subtleties

00:51:09.820 --> 00:51:12.700
in assessing these things
when you have finite data.

00:51:12.700 --> 00:51:14.710
But it is something at
the fundamental level

00:51:14.710 --> 00:51:17.290
that you could start
to assess using data.

00:51:17.290 --> 00:51:19.870
As opposed to ignorability,
which you cannot test using

00:51:19.870 --> 00:51:21.290
data.

00:51:21.290 --> 00:51:21.790
All right.

00:51:21.790 --> 00:51:29.990
So you have to think about, are
these assumptions satisfied?

00:51:29.990 --> 00:51:34.160
And only once you start to think
through those questions can

00:51:34.160 --> 00:51:37.340
you start to do your analysis.

00:51:37.340 --> 00:51:41.460
And so that now brings me to
the next part of this lecture,

00:51:41.460 --> 00:51:45.260
which is how do we actually--
let's just now believe David,

00:51:45.260 --> 00:51:46.760
believe that these
assumptions hold.

00:51:46.760 --> 00:51:49.720
How do we do that
causal inference?

00:51:49.720 --> 00:51:50.220
Yeah?

00:51:50.220 --> 00:51:51.802
AUDIENCE: I just had a
question on [INAUDIBLE]..

00:51:51.802 --> 00:51:54.687
If you know that some patients,
for instance, healthy patients,

00:51:54.687 --> 00:51:56.270
are not tracking to
get any treatment,

00:51:56.270 --> 00:51:58.890
should we just remove
them, basically?

00:51:58.890 --> 00:52:00.500
DAVID SONTAG: So
the question is,

00:52:00.500 --> 00:52:04.710
what happens if you have
a violation of overlap?

00:52:04.710 --> 00:52:08.240
For example, you know that
healthy individuals never

00:52:08.240 --> 00:52:09.770
receive any treatment.

00:52:09.770 --> 00:52:11.520
Should you remove them
from your data set?

00:52:11.520 --> 00:52:14.020
Well, first of all, that has
to do with how do you formalize

00:52:14.020 --> 00:52:16.160
the question because not
receiving a treatment

00:52:16.160 --> 00:52:18.250
is a treatment.

00:52:18.250 --> 00:52:21.880
So that might be your control
arm, just to be clear.

00:52:21.880 --> 00:52:24.160
Now, if you're asking about
the difference between two

00:52:24.160 --> 00:52:26.200
treatments-- two different
classes of treatment

00:52:26.200 --> 00:52:34.000
for a condition, then often one
defines the relevant inclusion

00:52:34.000 --> 00:52:40.990
criteria in order to have
these conditions hold.

00:52:40.990 --> 00:52:44.830
For example, we could try to
redefine the set of individuals

00:52:44.830 --> 00:52:47.140
that we're asking about
so that overlap does hold.

00:52:47.140 --> 00:52:48.640
But then in that
situation, you have

00:52:48.640 --> 00:52:51.520
to just make sure that your
policy is also modified.

00:52:51.520 --> 00:52:54.640
You say, OK, I conclude that
the average treatment effect is

00:52:54.640 --> 00:52:57.740
blah for this type of people.

00:52:57.740 --> 00:52:59.830
OK?

00:52:59.830 --> 00:53:01.730
OK.

00:53:01.730 --> 00:53:05.560
So how could we possibly compute
the average treatment effect

00:53:05.560 --> 00:53:07.835
from data?

00:53:07.835 --> 00:53:09.960
Remember, average treatment
effect, mathematically,

00:53:09.960 --> 00:53:13.635
is the expectation between
potential outcome Y1 minus Y0.

00:53:16.860 --> 00:53:20.250
The key tool which we'll use
in order to estimate that

00:53:20.250 --> 00:53:22.203
is what's known as the
adjustment formula.

00:53:22.203 --> 00:53:24.370
This goes by many names in
the statistics community,

00:53:24.370 --> 00:53:26.960
such as the G-formula as well.

00:53:26.960 --> 00:53:30.460
Here, I'll give you
a derivation of it.

00:53:30.460 --> 00:53:34.790
We're first going to recognize
that this expectation is

00:53:34.790 --> 00:53:36.830
actually two
expectations in one.

00:53:36.830 --> 00:53:39.770
It's the expectation
over individuals X

00:53:39.770 --> 00:53:43.575
and it's the expectation over
potential outcomes Y given X.

00:53:43.575 --> 00:53:45.200
So I'm first just
going to write it out

00:53:45.200 --> 00:53:47.330
in terms of those
two expectations,

00:53:47.330 --> 00:53:50.870
and I'll write the expectations
related to X on the outside.

00:53:50.870 --> 00:53:54.760
That goes by name of law
of total expectation.

00:53:54.760 --> 00:53:58.750
This is trivial at this stage.

00:53:58.750 --> 00:54:02.230
And by the way, I'm just
writing out expectation of Y1.

00:54:02.230 --> 00:54:04.900
In a few minutes, I'll
show you expectation of Y0,

00:54:04.900 --> 00:54:07.840
but it's going to be
exactly analogous.

00:54:07.840 --> 00:54:11.980
Now, the next step is
where we use ignorability.

00:54:11.980 --> 00:54:15.540
I told you I was going
to give that one away.

00:54:15.540 --> 00:54:19.000
So remember, we said
that we're assuming

00:54:19.000 --> 00:54:23.740
that Y1 is conditionally
independent of the treatment

00:54:23.740 --> 00:54:34.210
T given X. What that
means is probability of Y1

00:54:34.210 --> 00:54:39.910
given X is equal to
probability of Y1

00:54:39.910 --> 00:54:43.750
given X comma T equals
whatever-- in this case

00:54:43.750 --> 00:54:46.750
I'll just say T equals 1.

00:54:46.750 --> 00:54:52.220
This is implied by Y1 being
conditionally independent of T

00:54:52.220 --> 00:54:54.050
given X.

00:54:54.050 --> 00:54:59.640
So I can just stick n
comma T equals 1 here,

00:54:59.640 --> 00:55:05.090
and that's explicitly because
of ignorability holding.

00:55:05.090 --> 00:55:08.570
But now we're in a really good
place because notice that--

00:55:08.570 --> 00:55:10.760
and here I've just done
some short notation.

00:55:10.760 --> 00:55:14.391
I'm just going to
hide this expectation.

00:55:17.550 --> 00:55:19.840
And by the way, you could
do the same for Y0--

00:55:19.840 --> 00:55:21.640
Y1, Y0.

00:55:21.640 --> 00:55:26.330
And now notice
that we can replace

00:55:26.330 --> 00:55:30.140
this average human effect
with now this expectation

00:55:30.140 --> 00:55:32.440
with respect to
all individuals X

00:55:32.440 --> 00:55:35.720
of the expectation of Y1 given
X comma T equals 1, and so on.

00:55:39.500 --> 00:55:43.600
And these are mostly
quantities that we can now

00:55:43.600 --> 00:55:45.560
observe from our data.

00:55:45.560 --> 00:55:50.980
So, for example, we can
look at the individuals who

00:55:50.980 --> 00:55:54.550
received treatment one,
and for those individuals

00:55:54.550 --> 00:55:56.690
we have realizations of Y1.

00:55:56.690 --> 00:55:58.940
We can look at individuals
who receive treatment zero,

00:55:58.940 --> 00:56:02.497
and for those individuals
we have realizations of Y0.

00:56:02.497 --> 00:56:04.330
And we could just average
those realizations

00:56:04.330 --> 00:56:07.810
to get estimates of the
corresponding expectations.

00:56:07.810 --> 00:56:10.150
So these we can easily
estimate from our data.

00:56:13.020 --> 00:56:14.660
And so we've made progress.

00:56:14.660 --> 00:56:18.665
We can now estimate some
part of this from our data.

00:56:18.665 --> 00:56:20.040
But notice, there
are some things

00:56:20.040 --> 00:56:22.123
that we can't yet directly
estimate from our data.

00:56:22.123 --> 00:56:27.450
In particular, we can't
estimate expectation of Y0

00:56:27.450 --> 00:56:31.500
given X comma T equals 1
because we have no idea what

00:56:31.500 --> 00:56:34.620
would have happened to this
individual who actually

00:56:34.620 --> 00:56:37.170
got treatment one if they
had gotten treatment zero.

00:56:37.170 --> 00:56:39.060
So these we don't know.

00:56:42.210 --> 00:56:45.540
So these we don't know.

00:56:45.540 --> 00:56:47.620
Now, what is the trick
I'm planning on you?

00:56:47.620 --> 00:56:50.030
How does it help
that we can do this?

00:56:50.030 --> 00:56:52.300
Well, the key point is
that these quantities

00:56:52.300 --> 00:56:56.790
that we can estimate from
data show up in that term.

00:56:56.790 --> 00:56:59.910
In particular, if you
look at the individuals X

00:56:59.910 --> 00:57:04.230
that you've sampled from the
full set of individuals P of X,

00:57:04.230 --> 00:57:07.650
for that individual
X for which, in fact,

00:57:07.650 --> 00:57:11.310
we observed T equals 1, then we
can estimate expectation of Y1

00:57:11.310 --> 00:57:16.430
given X comma T equals
1, and similarly for Y0.

00:57:16.430 --> 00:57:19.080
But what we need to be able
to do is to extrapolate.

00:57:19.080 --> 00:57:22.995
Because empirically, we only
have samples from P of X

00:57:22.995 --> 00:57:24.620
given T equals 1, P
of X given T equals

00:57:24.620 --> 00:57:27.830
0 for those two potential
outcomes correspondingly.

00:57:27.830 --> 00:57:31.670
But we are going to also
get samples of X such

00:57:31.670 --> 00:57:33.620
that for those individuals
in your data set,

00:57:33.620 --> 00:57:36.650
you might have only
observed T equals 0.

00:57:36.650 --> 00:57:41.180
And to compute this formula,
you have to answer, for that X,

00:57:41.180 --> 00:57:44.360
what would it have been if
they got treatment equals one?

00:57:44.360 --> 00:57:46.283
So there are going to
be a set of individuals

00:57:46.283 --> 00:57:47.950
that we have to
extrapolate for in order

00:57:47.950 --> 00:57:50.275
to use this adjustment
formula for estimate.

00:57:52.780 --> 00:57:53.280
Yep?

00:57:53.280 --> 00:57:55.405
AUDIENCE: I thought because
common support is true,

00:57:55.405 --> 00:57:58.010
we have some patients that
received each treatment

00:57:58.010 --> 00:58:00.150
or a given type of X.

00:58:00.150 --> 00:58:02.110
DAVID SONTAG: Yes.

00:58:02.110 --> 00:58:06.850
But now-- so, yes, that's true.

00:58:09.460 --> 00:58:13.920
But that's a statement
about infinite data.

00:58:13.920 --> 00:58:17.010
And in reality, one
only has finite data.

00:58:17.010 --> 00:58:22.590
And so although common support
has to hold to some extent,

00:58:22.590 --> 00:58:25.500
you can't just build on
that to say that you always

00:58:25.500 --> 00:58:29.240
observe the counterfactual
for every individual,

00:58:29.240 --> 00:58:30.990
such as the pictures
I showed you earlier.

00:58:33.740 --> 00:58:36.340
So I'm going to leave this slide
up for just one more second

00:58:36.340 --> 00:58:38.260
to let it sink in and
see what it's saying.

00:58:41.660 --> 00:58:44.480
We started out from the goal of
computing the average treatment

00:58:44.480 --> 00:58:48.330
effect, expected
value of Y1 minus Y0.

00:58:48.330 --> 00:58:50.970
Using the adjustment
formula, we've

00:58:50.970 --> 00:58:54.750
gotten to now an equivalent
representation, which

00:58:54.750 --> 00:58:58.080
is now an expectation with
respect to all individuals

00:58:58.080 --> 00:59:03.310
sampling from P of X
of expected value of Y1

00:59:03.310 --> 00:59:05.800
given X comma T equals
1, expected value of Y0

00:59:05.800 --> 00:59:08.060
given X comma T equals 0.

00:59:08.060 --> 00:59:10.223
For some of the individuals,
you can observe this,

00:59:10.223 --> 00:59:12.140
and for some of them,
you have to extrapolate.

00:59:14.670 --> 00:59:18.547
So from here, there are
many ways that one can go.

00:59:18.547 --> 00:59:20.130
Hold your question
for a little while.

00:59:23.180 --> 00:59:25.940
So types of causal
inference methods

00:59:25.940 --> 00:59:27.500
that you will have
heard of include

00:59:27.500 --> 00:59:29.090
things like
covariance adjustment,

00:59:29.090 --> 00:59:32.120
propensity score re-weighting,
doubly robust estimators,

00:59:32.120 --> 00:59:34.830
matching, and so on.

00:59:34.830 --> 00:59:37.520
And those are the tools of
the causal inference trade.

00:59:37.520 --> 00:59:39.320
And in this course,
we're only going

00:59:39.320 --> 00:59:40.520
to talk about the first two.

00:59:40.520 --> 00:59:41.750
And in today's
lecture, we're only

00:59:41.750 --> 00:59:44.083
going to talk about the first
one, covariate adjustment.

00:59:44.083 --> 00:59:47.610
And on Thursday, we'll
talk about the second one.

00:59:47.610 --> 00:59:50.690
So covariate adjustment
is a very natural way

00:59:50.690 --> 00:59:54.505
to try to do that extrapolation.

00:59:54.505 --> 00:59:56.880
It also goes by the name, by
the way, of response surface

00:59:56.880 --> 00:59:57.500
modeling.

00:59:57.500 --> 00:59:59.042
What we're going to
do is we're going

00:59:59.042 --> 01:00:04.010
to learn a function f, which
takes as an input X and T,

01:00:04.010 --> 01:00:06.500
and its goals is to predict
Y. So intuitively, you

01:00:06.500 --> 01:00:10.790
should think about f as
this conditional probability

01:00:10.790 --> 01:00:12.620
distribution.

01:00:12.620 --> 01:00:19.140
It's predicting Y
given X and T. So

01:00:19.140 --> 01:00:23.340
T is going to be an input
to the machine learning

01:00:23.340 --> 01:00:25.830
algorithm, which is going
to predict what would be

01:00:25.830 --> 01:00:30.850
the potential outcome Y for this
individual described by feature

01:00:30.850 --> 01:00:42.720
as X1 through Xd
under intervention T.

01:00:42.720 --> 01:00:44.710
So this is just from
the previous slide.

01:00:44.710 --> 01:00:46.290
And what we're going
to do now are--

01:00:46.290 --> 01:00:50.640
this is now where we get the
reduction to machine learning--

01:00:50.640 --> 01:00:53.820
is we're going to use empirical
risk minimization, or maybe

01:00:53.820 --> 01:00:57.480
some regularized empirical risk
minimization, to fit a function

01:00:57.480 --> 01:01:02.070
f which approximates the
expected value of YT given

01:01:02.070 --> 01:01:03.910
capital T equals little t.

01:01:03.910 --> 01:01:07.830
Got my X. And then once
you have that function,

01:01:07.830 --> 01:01:10.260
we're going to be able
to use that to estimate

01:01:10.260 --> 01:01:15.420
the average treatment effect
by just implementing now

01:01:15.420 --> 01:01:16.863
this formula here.

01:01:16.863 --> 01:01:19.196
So we're going to first take
an expectation with respect

01:01:19.196 --> 01:01:20.790
to the individuals
in the data set.

01:01:20.790 --> 01:01:22.440
So we're going to
approximate that

01:01:22.440 --> 01:01:25.890
with an empirical expectation
where we sum over the little n

01:01:25.890 --> 01:01:28.370
individuals in your data set.

01:01:28.370 --> 01:01:29.870
Then what we're
going to do is we're

01:01:29.870 --> 01:01:36.620
going to estimate the first
term, which is f of Xi comma 1

01:01:36.620 --> 01:01:39.590
because that is approximating
the expected value of Y1

01:01:39.590 --> 01:01:41.330
given T comma X--

01:01:41.330 --> 01:01:43.970
T equals 1 comma
X. And we're going

01:01:43.970 --> 01:01:47.240
to approximate the second
term, which is just plugging

01:01:47.240 --> 01:01:49.750
now 0 for T instead of 1.

01:01:49.750 --> 01:01:51.950
And we're going to take the
difference between them,

01:01:51.950 --> 01:01:54.630
and that will be our estimator
of the average treatment

01:01:54.630 --> 01:01:55.130
effect.

01:02:00.357 --> 01:02:02.065
Here's a natural place
to ask a question.

01:02:07.210 --> 01:02:12.578
One thing you might wonder
is, in your data set,

01:02:12.578 --> 01:02:14.870
you actually did observe
something for that individual,

01:02:14.870 --> 01:02:15.660
right.

01:02:15.660 --> 01:02:20.550
Notice how your raw data
doesn't show up in this at all.

01:02:20.550 --> 01:02:23.250
Because I've done
machine learning,

01:02:23.250 --> 01:02:27.030
and then I've thrown
away the observed Y's,

01:02:27.030 --> 01:02:30.330
and I used this estimator.

01:02:30.330 --> 01:02:33.120
So what you could have done--
an alternative formula, which,

01:02:33.120 --> 01:02:35.530
by the way, is also a
consistent estimator,

01:02:35.530 --> 01:02:38.280
would have been to
use the observed

01:02:38.280 --> 01:02:41.760
Y for whatever the factual
is and the imputed Y

01:02:41.760 --> 01:02:44.642
for the counterfactual using f.

01:02:44.642 --> 01:02:46.350
That would have been
that would have also

01:02:46.350 --> 01:02:48.690
been a consistent estimator for
the average treatment effect.

01:02:48.690 --> 01:02:49.732
You could've done either.

01:02:53.790 --> 01:02:54.290
OK.

01:02:57.050 --> 01:02:59.360
Now, sometimes you're
not interested in just

01:02:59.360 --> 01:03:00.680
the average treatment
effect, but you're actually

01:03:00.680 --> 01:03:02.555
interested in understanding
the heterogeneity

01:03:02.555 --> 01:03:03.647
in the population.

01:03:03.647 --> 01:03:05.480
Well, this also now
gives you an opportunity

01:03:05.480 --> 01:03:08.460
to try to explore
that heterogeneity.

01:03:08.460 --> 01:03:10.520
So for each
individual Xi, you can

01:03:10.520 --> 01:03:12.530
look at just the
difference between what

01:03:12.530 --> 01:03:16.580
f predicts for
treatment one and what X

01:03:16.580 --> 01:03:17.930
predicts given treatment zero.

01:03:17.930 --> 01:03:19.070
And the difference
between those is

01:03:19.070 --> 01:03:21.195
your estimate of your
conditional average treatment

01:03:21.195 --> 01:03:21.695
effect.

01:03:21.695 --> 01:03:23.445
So, for example, if
you want to figure out

01:03:23.445 --> 01:03:25.460
for this individual, what
is the optimal policy,

01:03:25.460 --> 01:03:27.667
you might look to see is
CATE positive or negative,

01:03:27.667 --> 01:03:29.750
or is it greater than some
threshold, for example?

01:03:32.148 --> 01:03:33.440
So let's look at some pictures.

01:03:36.300 --> 01:03:39.030
Now what we're using is we're
using that function f in order

01:03:39.030 --> 01:03:41.190
to impute those counterfactuals.

01:03:41.190 --> 01:03:43.920
And now we have those
observed, and we can actually

01:03:43.920 --> 01:03:45.540
compute the CATE.

01:03:45.540 --> 01:03:48.120
And averaging over those,
you can estimate now

01:03:48.120 --> 01:03:51.060
the average treatment effect.

01:03:51.060 --> 01:03:51.560
Yep?

01:03:51.560 --> 01:03:53.180
AUDIENCE: How is f non-biased?

01:03:54.968 --> 01:03:55.760
DAVID SONTAG: Good.

01:03:55.760 --> 01:03:57.008
So where can this go wrong?

01:03:57.008 --> 01:03:58.550
So what do you mean
by biased, first?

01:03:58.550 --> 01:03:59.168
I'll ask that.

01:03:59.168 --> 01:04:00.710
AUDIENCE: For
instance, as we've seen

01:04:00.710 --> 01:04:04.820
in the paper like pneumonia
and people who have asthma,

01:04:04.820 --> 01:04:07.830
[INAUDIBLE]

01:04:08.717 --> 01:04:11.300
DAVID SONTAG: Oh, thank you so
much for bringing that back up.

01:04:11.300 --> 01:04:15.350
So you're referring
to one of the readings

01:04:15.350 --> 01:04:17.000
for the course
from several weeks

01:04:17.000 --> 01:04:20.180
ago, where we talked about using
just a pure machine learning

01:04:20.180 --> 01:04:24.903
algorithm to try to predict
outcomes in a hospital setting.

01:04:24.903 --> 01:04:26.570
In particular, what
happens for patients

01:04:26.570 --> 01:04:29.780
who have pneumonia in
the emergency department?

01:04:29.780 --> 01:04:32.600
And if you all remember,
there was this asthma example,

01:04:32.600 --> 01:04:36.320
where patients with
asthma were predicted

01:04:36.320 --> 01:04:41.090
to have better outcomes than
patients without asthma.

01:04:43.700 --> 01:04:45.188
And you're calling that bias.

01:04:45.188 --> 01:04:46.980
But you remember, when
I taught about this,

01:04:46.980 --> 01:04:48.610
I called it biased due
to a particular thing.

01:04:48.610 --> 01:04:49.735
What's the language I used?

01:04:52.990 --> 01:04:58.978
I said bias due to
intervention, maybe, is what I--

01:04:58.978 --> 01:05:00.520
I can't remember
exactly what I said.

01:05:00.520 --> 01:05:02.170
[LAUGHTER]

01:05:02.170 --> 01:05:03.400
I don't know.

01:05:03.400 --> 01:05:06.240
Make it up.

01:05:06.240 --> 01:05:08.960
Now a textbook will be written
with bias by intervention.

01:05:08.960 --> 01:05:09.460
OK.

01:05:09.460 --> 01:05:12.160
So the problem
there is that they

01:05:12.160 --> 01:05:16.193
didn't formulize the
prediction problem correctly.

01:05:16.193 --> 01:05:17.860
The question that
they should have asked

01:05:17.860 --> 01:05:20.920
is, for asthma patients--

01:05:23.440 --> 01:05:31.150
what you really want to ask is a
question of X and then T and Y,

01:05:31.150 --> 01:05:39.360
where T are the interventions
that are done for asthmatics.

01:05:45.090 --> 01:05:48.450
So the failure of that
paper is that it ignored

01:05:48.450 --> 01:05:51.360
the causal inference question
which was hidden in the data,

01:05:51.360 --> 01:05:54.120
and it just went to predict
Y given X marginalizing

01:05:54.120 --> 01:05:55.320
over T altogether.

01:05:55.320 --> 01:05:59.070
So T was never in
the predictive model.

01:05:59.070 --> 01:06:01.870
And said differently, they never
asked counterfactual questions

01:06:01.870 --> 01:06:04.200
of what would have happened
had you done a different T.

01:06:04.200 --> 01:06:06.640
And then they still used it
to try to guide some treatment

01:06:06.640 --> 01:06:07.140
decisions.

01:06:07.140 --> 01:06:09.840
Like, for example, should
you send this person home,

01:06:09.840 --> 01:06:12.173
or should you keep them for
careful monitoring or so on?

01:06:12.173 --> 01:06:14.415
So this is exactly
the same example

01:06:14.415 --> 01:06:16.260
as I gave in the
beginning of the lecture,

01:06:16.260 --> 01:06:19.020
where I said if you just
use a risk stratification

01:06:19.020 --> 01:06:23.430
model to make some decisions,
you run the risk that you're

01:06:23.430 --> 01:06:27.720
making the wrong decisions
because those predictions were

01:06:27.720 --> 01:06:30.360
biased by decisions
in your data.

01:06:30.360 --> 01:06:32.580
So that doesn't happen here
because we're explicitly

01:06:32.580 --> 01:06:35.320
accounting for T in
all of our analysis.

01:06:35.320 --> 01:06:35.980
Yep?

01:06:35.980 --> 01:06:38.330
AUDIENCE: In the data sets
that we've used, like MIMIC,

01:06:38.330 --> 01:06:39.922
how much treatment
information exists?

01:06:39.922 --> 01:06:41.880
DAVID SONTAG: So how much
treatment information

01:06:41.880 --> 01:06:42.380
is in MIMIC?

01:06:42.380 --> 01:06:44.880
A ton.

01:06:44.880 --> 01:06:48.240
In fact, one of the
readings for next week

01:06:48.240 --> 01:06:52.350
is going to be about trying to
understand how one could manage

01:06:52.350 --> 01:06:58.920
sepsis, which is a condition
caused by infection, which

01:06:58.920 --> 01:07:02.670
is managed by, for example,
giving broad spectrum

01:07:02.670 --> 01:07:05.850
antibiotics, giving
fluids, giving

01:07:05.850 --> 01:07:07.602
pressers and ventilators.

01:07:07.602 --> 01:07:09.060
And all of those
are interventions,

01:07:09.060 --> 01:07:11.227
and all those interventions
are recorded in the data

01:07:11.227 --> 01:07:13.590
so that one could then ask
counterfactual questions

01:07:13.590 --> 01:07:14.880
from the data, like
what would have happened

01:07:14.880 --> 01:07:16.170
if this patient
had they received

01:07:16.170 --> 01:07:17.545
a different set
of interventions?

01:07:17.545 --> 01:07:20.010
Would we have prolonged
their life, for example?

01:07:20.010 --> 01:07:24.383
And so in an intensive care unit
setting, most of the questions

01:07:24.383 --> 01:07:26.550
that we want to ask about,
not all, but many of them

01:07:26.550 --> 01:07:29.010
are about dynamic treatments
because it's not just

01:07:29.010 --> 01:07:30.698
a single treatment
but really about

01:07:30.698 --> 01:07:32.490
a service sequence of
treatments responding

01:07:32.490 --> 01:07:34.053
to the current
patient condition.

01:07:34.053 --> 01:07:36.720
And so that's where we'll really
start to get into that material

01:07:36.720 --> 01:07:40.310
next week, not in
today's lecture.

01:07:40.310 --> 01:07:41.300
Yep?

01:07:41.300 --> 01:07:44.022
AUDIENCE: How do you make sure
that your f function really

01:07:44.022 --> 01:07:46.388
learned from the relationship
between T and the outcome?

01:07:46.388 --> 01:07:48.180
DAVID SONTAG: That's
a phenomenal question.

01:07:48.180 --> 01:07:50.810
Where were you
this whole course?

01:07:50.810 --> 01:07:51.810
Thank you for asking it.

01:07:51.810 --> 01:07:53.640
So I'll repeat it.

01:07:53.640 --> 01:07:56.100
How do you know that
your function f actually

01:07:56.100 --> 01:07:59.850
learned something about the
relationship between the input

01:07:59.850 --> 01:08:04.730
X and the treatment
T and the outcome?

01:08:04.730 --> 01:08:07.070
And that really gets
to the question of,

01:08:07.070 --> 01:08:09.410
is my reduction actually valid?

01:08:09.410 --> 01:08:19.979
So I've taken this
problem and I've

01:08:19.979 --> 01:08:23.350
reduced it to this machine
learning problem, where

01:08:23.350 --> 01:08:27.340
I take my data, and
literally I just

01:08:27.340 --> 01:08:29.770
learn a function f to
try to predict well

01:08:29.770 --> 01:08:32.550
the observations in the data.

01:08:32.550 --> 01:08:34.705
And how do we know that
that function f actually

01:08:34.705 --> 01:08:36.330
does a good job at
estimating something

01:08:36.330 --> 01:08:38.682
like average treatment effect?

01:08:38.682 --> 01:08:41.250
In fact, it might not.

01:08:41.250 --> 01:08:44.250
And this is where
things start to get

01:08:44.250 --> 01:08:47.460
really tricky, particularly
with high dimensional data.

01:08:47.460 --> 01:08:51.520
Because it could happen, for
example, that your treatment

01:08:51.520 --> 01:08:55.470
decision is only one of a huge
number of factors that affect

01:08:55.470 --> 01:08:59.130
the outcome Y. And it
could be that a much more

01:08:59.130 --> 01:09:02.130
important factor is hidden in
X. And because you don't have

01:09:02.130 --> 01:09:05.640
much data, and because you have
to regularize your learning

01:09:05.640 --> 01:09:08.100
algorithm, let's say, with L1
or L2 regularization or maybe

01:09:08.100 --> 01:09:10.590
early stopping if you're
using deep neural network,

01:09:10.590 --> 01:09:15.790
your algorithm might never learn
the actual dependence on T.

01:09:15.790 --> 01:09:19.859
It might learn just to
throw away T and just

01:09:19.859 --> 01:09:23.649
use X to predict Y.
And if that's the case,

01:09:23.649 --> 01:09:26.160
you will never be able to
infer these average treatment

01:09:26.160 --> 01:09:27.750
effects accurately.

01:09:27.750 --> 01:09:29.545
You'll have huge errors.

01:09:29.545 --> 01:09:31.170
And that gets back
to one of the slides

01:09:31.170 --> 01:09:33.990
that I skipped, where I
started out from this picture.

01:09:33.990 --> 01:09:36.990
This is the machine learning
picture saying, OK, a reduction

01:09:36.990 --> 01:09:38.729
to machine learning is--

01:09:38.729 --> 01:09:40.229
now you add an
additional feature,

01:09:40.229 --> 01:09:41.760
which is your
treatment decision,

01:09:41.760 --> 01:09:44.795
and you learn that
black box function f.

01:09:44.795 --> 01:09:46.920
But this is where machine
learning causal inference

01:09:46.920 --> 01:09:50.100
starts to differ because
we don't actually

01:09:50.100 --> 01:09:55.703
care about the quality
of predicting Y.

01:09:55.703 --> 01:09:57.495
We can measure your
root mean squared error

01:09:57.495 --> 01:10:00.630
in predicting Y given your
X's and T's, and that error

01:10:00.630 --> 01:10:02.550
might be low.

01:10:02.550 --> 01:10:05.400
But you can run into
these failure modes

01:10:05.400 --> 01:10:08.130
where it just completely
ignores T, for example.

01:10:08.130 --> 01:10:10.263
So T is special here.

01:10:10.263 --> 01:10:12.180
So really, the picture
we want to have in mind

01:10:12.180 --> 01:10:15.760
is that T is some
parameter of interest.

01:10:15.760 --> 01:10:19.500
We want to learn a model f
such that if we twiddle T,

01:10:19.500 --> 01:10:22.500
we can see how there is a
differential effect on Y based

01:10:22.500 --> 01:10:24.297
on twiddling T.
That's what we truly

01:10:24.297 --> 01:10:26.130
care about when we're
using machine learning

01:10:26.130 --> 01:10:28.290
for causal inference.

01:10:28.290 --> 01:10:30.150
And so that's really
the gap, that's

01:10:30.150 --> 01:10:32.930
the gap in our
understanding today.

01:10:32.930 --> 01:10:34.680
And it's really an
active area of research

01:10:34.680 --> 01:10:37.320
to figure out how do you change
the whole machine learning

01:10:37.320 --> 01:10:40.938
paradigm to recognize that when
you're using machine learning

01:10:40.938 --> 01:10:42.480
for causal inference,
you're actually

01:10:42.480 --> 01:10:44.995
interested in something
a little bit different.

01:10:44.995 --> 01:10:47.370
And by the way, that's a major
area of my lab's research,

01:10:47.370 --> 01:10:49.037
and we just published
a series of papers

01:10:49.037 --> 01:10:50.503
trying to answer that question.

01:10:50.503 --> 01:10:52.170
Beyond the scope of
this course, but I'm

01:10:52.170 --> 01:10:56.370
happy to send you those
papers if anyone's interested.

01:10:56.370 --> 01:11:00.880
So that type of question
is extremely important.

01:11:00.880 --> 01:11:04.740
It doesn't show up quite as much
when your X's aren't very high

01:11:04.740 --> 01:11:07.560
dimensional and where
things like regularization

01:11:07.560 --> 01:11:09.090
don't become important.

01:11:09.090 --> 01:11:11.310
But once your X becomes
high dimensional

01:11:11.310 --> 01:11:14.160
and once you want to start to
consider more and more complex

01:11:14.160 --> 01:11:16.050
f's during your
fitting, like you

01:11:16.050 --> 01:11:18.510
want to use deep neural
networks, for example,

01:11:18.510 --> 01:11:22.650
these differences in goals
become extremely important.

01:11:35.790 --> 01:11:37.930
So there are other ways
in which things can fail.

01:11:37.930 --> 01:11:43.205
So I want to give you
here an example where--

01:11:43.205 --> 01:11:44.580
shoot, I'm answering
my question.

01:11:46.930 --> 01:11:47.430
OK.

01:11:50.840 --> 01:11:52.520
No one saw that slide.

01:11:52.520 --> 01:11:54.650
Question-- where did
the overlap assumptions

01:11:54.650 --> 01:11:59.930
show up in our approach for
estimating average treatment

01:11:59.930 --> 01:12:02.705
effect using
covariate adjustment?

01:12:17.580 --> 01:12:19.115
Let me go back to the formula.

01:12:24.630 --> 01:12:27.743
Someone who hasn't
spoken today, hopefully.

01:12:27.743 --> 01:12:28.910
You can be wrong, it's fine.

01:12:31.430 --> 01:12:32.415
Yeah, in the back?

01:12:32.415 --> 01:12:34.290
AUDIENCE: Is it the
version with the same age

01:12:34.290 --> 01:12:37.520
in receiving treatment
B and treatment B?

01:12:37.520 --> 01:12:43.917
DAVID SONTAG: So maybe you have
an individual with some age--

01:12:43.917 --> 01:12:45.500
we're going to want
to be able to look

01:12:45.500 --> 01:12:48.358
at the difference between what
f predicts for that individual

01:12:48.358 --> 01:12:50.150
if they got treatment
A versus treatment B,

01:12:50.150 --> 01:12:52.100
or one versus zero.

01:12:52.100 --> 01:12:57.500
And let me try to lead
this a little bit.

01:12:57.500 --> 01:12:59.000
And it might happen
in your data set

01:12:59.000 --> 01:13:04.310
that for individuals
like them, you only ever

01:13:04.310 --> 01:13:07.100
observe treatment one and
there's no one even remotely

01:13:07.100 --> 01:13:09.800
like them who you
observe treatment zero.

01:13:09.800 --> 01:13:13.690
So what's this function
going to output then

01:13:13.690 --> 01:13:17.290
when you input zero for
that second argument?

01:13:17.290 --> 01:13:19.850
Everyone say out loud.

01:13:19.850 --> 01:13:22.200
Garbage?

01:13:22.200 --> 01:13:22.700
Right?

01:13:22.700 --> 01:13:27.710
If in your data set you never
observed anyone even remotely

01:13:27.710 --> 01:13:31.730
similar to Xi who
received treatment zero,

01:13:31.730 --> 01:13:34.428
then this function is basically
undefined for that individual.

01:13:34.428 --> 01:13:36.470
I mean, yeah, your function
will output something

01:13:36.470 --> 01:13:41.610
because you fit it, but it's not
going to be the right answer.

01:13:41.610 --> 01:13:45.530
And so that's where this
assumption starts to show up.

01:13:45.530 --> 01:13:49.910
When one talks about the
sample complexity of learning

01:13:49.910 --> 01:13:53.030
these functions f to do
covariate adjustment,

01:13:53.030 --> 01:13:55.730
and when one talks
about the consistency

01:13:55.730 --> 01:13:57.140
of these arguments--
for example,

01:13:57.140 --> 01:13:58.640
you'd like to be
able to make claims

01:13:58.640 --> 01:14:01.220
that as the amount of
data grows to, let's

01:14:01.220 --> 01:14:04.430
say, infinity, that this is
the right answer-- gives you

01:14:04.430 --> 01:14:05.480
the right estimate.

01:14:05.480 --> 01:14:07.490
So that's the type
of proof which

01:14:07.490 --> 01:14:10.560
is often given in the
causal inference literature.

01:14:10.560 --> 01:14:13.920
Well, if you have overlap,
then as the amount of data

01:14:13.920 --> 01:14:18.138
goes to infinity, you
will observe someone,

01:14:18.138 --> 01:14:19.930
like the person who
received treatment one,

01:14:19.930 --> 01:14:22.120
you'll observe someone who
also received treatment zero.

01:14:22.120 --> 01:14:23.920
It might have taken you a huge
amount of data to get there

01:14:23.920 --> 01:14:26.110
because treatment zero
might have been much less

01:14:26.110 --> 01:14:27.520
likely than treatment one.

01:14:27.520 --> 01:14:30.970
But because the probability
of treatment zero is not zero,

01:14:30.970 --> 01:14:32.810
eventually you'll see
someone like that.

01:14:32.810 --> 01:14:34.477
And so eventually
you'll get enough data

01:14:34.477 --> 01:14:37.120
in order to learn a function
which can extrapolate correctly

01:14:37.120 --> 01:14:39.700
for that individual.

01:14:39.700 --> 01:14:43.930
And so that's where
overlap comes in

01:14:43.930 --> 01:14:46.450
in giving that type of
consistency argument.

01:14:46.450 --> 01:14:51.100
Of course, in reality, you
never have infinite data.

01:14:51.100 --> 01:14:54.280
And so these questions
about trade-offs

01:14:54.280 --> 01:14:56.050
between the amount
of data you have

01:14:56.050 --> 01:14:59.470
and the fact that
you never truly have

01:14:59.470 --> 01:15:02.350
empirical overlap with
a small amount of data,

01:15:02.350 --> 01:15:05.380
and answering when can
you extrapolate correctly

01:15:05.380 --> 01:15:07.750
despite that is the
critical question

01:15:07.750 --> 01:15:09.800
that one needs to answer,
but is, by the way,

01:15:09.800 --> 01:15:11.662
not studied very well
in the literature

01:15:11.662 --> 01:15:13.870
because people don't usually
think in terms of sample

01:15:13.870 --> 01:15:16.148
complexity in that field.

01:15:16.148 --> 01:15:18.190
That's where computer
scientists can start really

01:15:18.190 --> 01:15:20.560
to contribute to this
literature and bringing things

01:15:20.560 --> 01:15:22.060
that we often think
about in machine

01:15:22.060 --> 01:15:26.120
learning to this new topic.

01:15:26.120 --> 01:15:30.110
So I've got a couple
of minutes left.

01:15:30.110 --> 01:15:31.860
Are there any other
questions, or should I

01:15:31.860 --> 01:15:33.840
introduce some new
material in one minute?

01:15:33.840 --> 01:15:35.550
Yeah?

01:15:35.550 --> 01:15:38.160
AUDIENCE: So you said that
the average treatment effect

01:15:38.160 --> 01:15:40.020
estimator here is consistent.

01:15:40.020 --> 01:15:43.070
But does it matter if
we choose the wrong--

01:15:43.070 --> 01:15:46.830
do we have to choose some
functional form of the features

01:15:46.830 --> 01:15:47.413
to the effect?

01:15:47.413 --> 01:15:48.622
DAVID SONTAG: Great question.

01:15:48.622 --> 01:15:51.710
AUDIENCE: Is it consistent even
if we choose a completely wrong

01:15:51.710 --> 01:15:52.582
function or formula?

01:15:52.582 --> 01:15:53.290
DAVID SONTAG: No.

01:15:53.290 --> 01:15:53.910
AUDIENCE: That's
a different thing?

01:15:53.910 --> 01:15:54.520
DAVID SONTAG: No, no.

01:15:54.520 --> 01:15:56.103
You're asking all
the right questions.

01:15:56.103 --> 01:15:58.050
Good job today, everyone.

01:15:58.050 --> 01:16:00.090
So, no.

01:16:00.090 --> 01:16:03.750
If you walk through
that argument I made,

01:16:03.750 --> 01:16:04.830
I assume two things.

01:16:04.830 --> 01:16:06.660
First, that you observe
enough data such

01:16:06.660 --> 01:16:12.330
that you can have any chance
of extrapolating correctly.

01:16:12.330 --> 01:16:13.943
But then implicit
in that statement

01:16:13.943 --> 01:16:15.360
is that you're
choosing a function

01:16:15.360 --> 01:16:17.130
family which is
powerful enough that it

01:16:17.130 --> 01:16:19.060
can extrapolate correctly.

01:16:19.060 --> 01:16:21.255
So if your true function is--

01:16:24.010 --> 01:16:28.970
if you think back to this
figure I showed you here,

01:16:28.970 --> 01:16:30.830
if the true potential
outcome functions are

01:16:30.830 --> 01:16:34.250
these quadratic functions
and you're fitting them

01:16:34.250 --> 01:16:36.340
with a linear function,
then no matter

01:16:36.340 --> 01:16:37.840
how much data you
have you're always

01:16:37.840 --> 01:16:42.230
going to get wrong estimates
because this argument really

01:16:42.230 --> 01:16:45.080
requires that you're considering
more and more complex

01:16:45.080 --> 01:16:48.710
non-linearity as your
amount of data grows.

01:16:48.710 --> 01:16:51.050
So now here's a visual
depiction of what can go wrong

01:16:51.050 --> 01:16:53.490
if you don't have overlap.

01:16:53.490 --> 01:16:55.070
So now I've taken out--

01:16:55.070 --> 01:16:57.530
previously, I had one or two
red points over here and one

01:16:57.530 --> 01:17:00.080
or two blue points over here,
but I've taken those out.

01:17:00.080 --> 01:17:02.460
So in your data all you
have are these blue points

01:17:02.460 --> 01:17:03.335
and those red points.

01:17:06.500 --> 01:17:09.350
So all you have are
the points, and now one

01:17:09.350 --> 01:17:12.140
can learn as good functions,
as you can imagine, to try to,

01:17:12.140 --> 01:17:14.840
let's say, minimize the mean
squared error of predicting

01:17:14.840 --> 01:17:17.840
these blue points and minimize
the mean squared error

01:17:17.840 --> 01:17:19.400
of predicting those red points.

01:17:19.400 --> 01:17:21.530
And what you might get
out is something-- maybe

01:17:21.530 --> 01:17:23.130
you'll decide on
a linear function.

01:17:23.130 --> 01:17:26.090
That's as good as you
could do if all you

01:17:26.090 --> 01:17:28.940
have are those red points.

01:17:28.940 --> 01:17:30.910
And so even if you were
willing to consider

01:17:30.910 --> 01:17:34.010
more and more complex
hypothesis classes,

01:17:34.010 --> 01:17:36.950
here, if you tried to consider
a more complex hypothesis

01:17:36.950 --> 01:17:39.620
class than this line, you'd
probably just over-fitting

01:17:39.620 --> 01:17:41.360
to the data you have.

01:17:41.360 --> 01:17:44.750
And so you decide
on that line, which,

01:17:44.750 --> 01:17:47.480
because you had
no data over here,

01:17:47.480 --> 01:17:51.680
you don't even know that it's
not a good fit to the data.

01:17:51.680 --> 01:17:53.360
And then you notice
that you're getting

01:17:53.360 --> 01:17:54.485
completely wrong estimates.

01:17:54.485 --> 01:17:58.050
For example, if you asked about
the CATE for a young person,

01:17:58.050 --> 01:18:01.610
it would have the wrong sign
over here because they flipped,

01:18:01.610 --> 01:18:03.200
the two lines.

01:18:03.200 --> 01:18:07.760
So that's an example of how
one can start to get errors.

01:18:07.760 --> 01:18:10.790
And when we begin on
Thursday's lecture,

01:18:10.790 --> 01:18:13.610
we're going to pick up right
where we left off today,

01:18:13.610 --> 01:18:17.370
and I'll talk about this issue
a little bit more in detail.

01:18:17.370 --> 01:18:21.290
I'll talk about how, if one were
to learn a linear function, how

01:18:21.290 --> 01:18:23.090
one could actually,
under the assumption

01:18:23.090 --> 01:18:25.130
that the true potential
outcomes are linear,

01:18:25.130 --> 01:18:27.500
how one could actually
interpret the coefficients

01:18:27.500 --> 01:18:29.435
of that linear function
in a causal way

01:18:29.435 --> 01:18:31.310
under the very strong
assumption that the two

01:18:31.310 --> 01:18:32.700
potential outcomes are linear.

01:18:32.700 --> 01:18:35.180
So that's what we'll
return to on Thursday.