Lecture 14: Causal Inference, Part 1

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Prof. Sontag discusses causal inference, examples of causal questions, and how these guide treatment decisions. He explains the Rubin-Neyman causal model as a potential outcome framework.

Speaker: David Sontag

Lecture 14: Causal Inference, Part 1 slides (PDF - 2.2MB)

DAVID SONTAG: So today's lecture is going to be about causality. Who's heard about causality before? Raise your hand. What's the number one thing that you hear about when thinking about causality? Yeah?

AUDIENCE: Correlation does not imply causation.

DAVID SONTAG: Correlation does not imply causation. Anything else come to mind? That's what came to my mind. Anything else come to mind? So up until now in the semester, we've been talking about purely predictive questions. And for purely predictive questions, one could argue that correlation is good enough. If we have some signs in our data that are predictive of some outcome of interest, we want to be able to take advantage of that. Whether it's upstream, downstream, the causal directionality is irrelevant for that purpose.

Although even that isn't quite true, right, because Pete and I have been hinting throughout the semester that there are times when the data changes on you, for example, when you go from one institution to another or when you have non-stationary. And in those situations, having a deeper understanding about the data might allow one to build an additional robustness to that type of data set shift. But there are other reasons as well why understanding something about your underlying data generating processes can be really important.

It's because often, the questions that we want to answer when it comes to health care are not predictive questions, their causal questions. And so what I'll do now is I'll walk through a few examples of what I mean by this. Let's start out with what we saw in Lecture 4 and in Problem Set 2, where we looked at the question of how we can do early detection of type 2 diabetes. You used Truven MarketScan's data set to build a risk stratification algorithm for detecting who is going to be newly diagnosed with diabetes one to three years from now.

And if you think about how one might then try to deploy that algorithm, you might, for example, try to get patients into the clinic to get them diagnosed. But the next set of questions are usually about the so what question. What are you going to do based on that prediction? Once diagnosed, how will you intervene? And at the end of the day, the interesting goal is not one of how do you find them early, but how do you prevent them from developing diabetes? Or how do you prevent the patient from developing complications of diabetes?

And those are questions about causality. Now, when we built a predictive model and we introspected at the weight, we might have noticed some interesting things. For example, if you looked at the highest negative weights, which I'm not sure if we did as part of the assignment but is something that I did as part of my research study, you see that gastric bypass surgery has the biggest negative weight. Does that mean that if you give an obese person gastric bypass surgery, that will prevent them from developing type 2 diabetes?

That's an example of a causal question which is raised by this predictive model. But just by looking at the weight alone, as I'll show you this week, you won't be able to correctly infer that there is a causal relationship. And so part of what we will be doing is coming up with a mathematical language for thinking about how does one answer, is there a causal relationship here? Here's a second example. Right before spring break we had a series of lectures about diagnosis, particularly diagnosis from imaging data of a variety of kinds, whether it be radiology or pathology.

And often, questions are of this sort. Here is a woman's breasts. She has breast cancer. Maybe you have an associated pathology slide as well. And you want to know what is the risk of this person dying in the next five years. So one can take a deep learning model, learn to predict what one observes. So in the patient in your data set, you have the input and you have, let's say, survival time. And you might use that to predict something about how long it takes from diagnosis to death.

And based on those predictions, you might take actions. For example, if you predict that a patient is not risky, then you might conclude that they don't need to get treatment. But that could be really, really dangerous, and I'll just give you one example of why that could be dangerous. These predictive models, if you're learning them in this way, the outcome, in this case let's say time to death, is going to be affected by what's happened in between.

So, for example, this patient might have been receiving treatment, and because of them receiving treatment in between the time from diagnosis to death, it might have prolonged their life. And so for this patient in your data set, you might have observed that they lived a very long time. But if you ignore what happens in between and you simply learn to predict y from X, X being the input, then a new patient comes along and you predicted that new patient is going to survive a long time, and it would be completely the wrong conclusion to say that you don't need to treat that patient. Because, in fact, the only reason the patients like them in the training data lived a long time is because they were treated.

And so when it comes to this field of machine learning and health care, we need to think really carefully about these types of questions because an error in the way that we formalize our problem could kill people because of mistakes like this. Now, other questions are ones about not how do we predict outcomes but how do we guide treatment decisions. So, for example, as data from pathology gets richer and richer and richer, we might think that we can now use computers to try to better predict who is likely to benefit from a treatment than humans could do alone.

But the challenge with using algorithms to do that is that people respond differently to treatment, and the data which is being used to guide treatment is biased based on existing treatment guidelines. So, similarly, to the previous question, we could ask, what would happen if we trained to predict past treatment decisions? This would be the most naive way to try to use data to guide treatment decisions. So maybe you see David gets treatment A, John gets treatment B, Juana gets treatment A. And you might ask then, OK, a new patient comes in, what should this new patient be treated with?

And if you've just learned a model to predict from what you know about the treatment that David is likely to get, then the best that you could hope to do is to do as well as existing clinical practice. So if we want to go beyond current clinical practice, for example, to recognize that there is heterogeneity in treatment response, then we have to somehow change the question that we're asking. I'll give you one last example, which is perhaps a more traditional question of, does X cause y? For example, does smoking cause lung cancer is a major question of societal importance.

Now, you might be familiar with the traditional way of trying to answer questions of this nature, which would be to do a randomized controlled trial. Except this isn't exactly the type of setting where you could do randomized controlled trials. How would you feel if you were a smoker and someone came up to you and said, you have to stop smoking because I need to see what happens? Or how would you feel if you were a non-smoker and someone came up to you and said, you have to start smoking? That would be both not feasible and completely unethical.

And so if we want to try to answer questions like this from data, we need to start thinking about how can we design, using observational data, ways of answering questions like this. And the challenge is that there's going to be bias in the data because of who decides to smoke and who decides not to smoke. So, for example, the most naive way you might try to answer this question would be to look at the conditional likelihood of getting lung cancer among smokers and getting lung cancer among non-smokers. But those numbers, as you'll see in the next few slides, can be very misleading because there might be confounding factors, factors that would, for example, both cause people to be a smoker and cause them to receive lung cancer, which would differentiate between these two numbers.

And we'll have a very concrete example of this in just a few minutes. So to properly answer all of these questions, one needs to be thinking in terms of causal graphs. So rather than the traditional setup in machine learning where you just have inputs and outputs, now we need to have triplets. Rather than having inputs and outputs, we need to be thinking of inputs, interventions, and outcomes or outputs. So we now need be having three quantities in mind. And we have to start thinking about, well, what is the causal relationship between these three?

So for those of you who have taken more graduate level machine learning classes, you might be familiar with ideas such as Bayesian networks. And when I went to undergrad and grad school and I studied machine learning, for the longest time I thought causal inference had to do with learning causal graphs. So this is what I thought causal inference was about. You have data of the following nature-- 1, 0, 0, 1, dot, dot, dot. So here, there are four random variables. I'm showing the realizations of those four binary variables one per row, and you have a data set like this.

And I thought causal inference had to do with taking data like this and trying to figure out, is the underlying Bayesian network that created that data, is it X1 goes to X2 goes to X3 to X4? Or I'll say, this is X1, that's X2, x3, and X4. Or maybe the causal graph is X1, to X2, to X3, to x4. And trying to distinguish between these different causal graphs from observational data is one type of question that one can ask. And the one thing you learn in traditional machine learning treatments of this is that sometimes you can't distinguish between these causal graphs from the data you have.

For example, suppose you just had two random variables. Because any distribution could be represented by probability of X1 times probability of X2 given X1, according to just rule of conditional probability, and similarly, any distribution can be represented as the opposite, probability of X2 times probability of X1 given X2, which would look like this, the statement that one would make is that if you just had data involving X1 and X2, you couldn't distinguish between these two causal graphs, X1 causes X2 or X2 causes X1.

And usually another treatment would say, OK, but if you have a third variable and you have a V structure or something like X1 goes to x2, X1 goes to X3, this you could distinguish from, let's say, a chain structure. And then the final answer to what is causal inference from this philosophy would be something like, OK, if you're in a setting like this and you can't distinguish between X1 causes X2 or X2 causes X1, then you do some interventions, like you intervene on X1 and you look to see what happens to X2, and that'll help you disentangle these directions of causality.

None of this is what we're going to be talking about today. Today, we're going to be talking about the simplest, simplest possible setting you could imagine, that graph shown up there. You have three sets of random variables, X, which is perhaps a vector, so it's high dimensional, a single random variable T, and a single random variable Y. And we know the causal graph here. We're going to suppose that we know the directionality, that we know that X might cause T and X and T might cause Y. And the only thing we don't know is the strength of the edges.

All right. And so now let's try to think through this in context of the previous examples. Yeah, question?

AUDIENCE: Just to make sure-- so T does not affect X in any way?

DAVID SONTAG: Correct, that's the assumption we're going to make here. So let's try to instantiate this. So we'll start with this example. X might be what you know about the patient at diagnosis. T, I'm going to assume for the purposes of today's class, is a decision between two different treatment plans. And I'm going to simplify the state of the world. I'm going to say those treatment plans only depend on what you know about the patient at diagnosis.

So at diagnosis, you decide, I'm going to be giving them this sequence of treatments at this three-month interval or this other sequence of treatment at, maybe, that four-month interval. And you make that decision just based on diagnosis and you don't change it based on anything you observe. Then the causal graph of relevance there is, based on what you know about the patient at diagnosis, which I'm going to say X is a vector because maybe it's based on images, your whole electronic health record. There's a ton of data you have on the patient at diagnosis.

Based on that, you make some decision about a treatment plan. I'm going to call that T. T could be binary, a choice between two treatments, it could be continuous, maybe you're deciding the dosage of the treatment, or it could be maybe even a vector. For today's lecture, I'm going to suppose that T is just binary, just involves two choices. But most of what I'll tell you about will generalize to the setting where T is non-binary as well. But critically, I'm going to make the assumption for today's lecture that you're not observing new things in between.

So, for example, in this whole week's lecture, the following scenario will not happen. Based on diagnosis, you make a decision about treatment plan. Treatment plan starts, you got new observations. Based on those new observations, you realize that treatment plan isn't working and change to another treatment plan, and so on. So that scenario goes by a different name, which is called dynamic treatment regimes or off-policy reinforcement learning, and that we'll learn about next week.

So for today's and Thursday's lecture, we're going to suppose you base on what you know about the patient at this time, you make a decision, you execute the decision, and you look at some outcome. So X causes T, not the other way around. And that's pretty clear because of our prior knowledge about this problem. It's not that the treatment affects what their diagnosis was. And then there's the outcome Y, and there, again, we suppose the outcome, what happens to the patient, maybe survival time, for example, is a function of what treatment they're getting and aspects about that patient.

So this is the causal graph. We know it. But we don't know, does that treatment do anything to this patient? For whom does this treatment help the most? And those are the types of questions we're going to try to answer today. Is the setting clear? OK. Now, these questions are not new questions. They've been studied for decades in fields such as political science, economics, statistics, biostatistics. And the reason why they're studied in those other fields is because often you don't have the ability to intervene, and one has to try to answer these questions from observational data.

For example, you might ask, what will happen to the US economy if the Federal Reserve raises US interest rates by 1%? When's the last time you heard of the Federal Reserve doing a randomized controlled trial? And even if they had done a randomized controlled trial, for example, flipped a coin to decide which way the interest rates would go, it wouldn't be comparable had they done that experiment today to if they had done that experiment two years from now because the state of the world has changed in those years.

Let's talk about political science. I have close colleagues of mine at NYU who look at Twitter, and they want to ask questions like, how can we influence elections, or how are elections influenced? So you might look at some unnamed actors, possibly people supported by the Russian government, who are posting to Twitter or their social media. And you might ask the question of, well, did that actually influence the outcome of the previous presidential election? Again, in that scenario, it's one of, well, we have this data, something happened in the world, and we'd like to understand what was the effect of that action, but we can't exactly go back and replay to do something else.

So these are fundamental questions that appear all across the sciences, and of course they're extremely relevant in health care, but yet, we don't teach them in our introduction to machine learning classes. We don't teach them in our undergraduate computer science education. And I view this as a major hole in our education, which is why we're spending two weeks on it in this course, which is still not enough. But what has changed between these fields, and what is relevant in health care?

Well, the traditional way in which these questions were asked in statistics were ones where you took a huge amount of domain knowledge to, first of all, make sure you're setting up the problem correctly, and that's always going to be important. But then to think through what are all of the factors that could influence the treatment decisions called the confounding factors. And the traditional approach is one would write down 10, 20 different things, and make sure that you do some analysis, including the analysis I'll show you about in today and Thursday's lecture using those 10 or 20 variables. But where this field is going is one of now having high dimensional data.

So I talked about how you might have imaging data for X, you might have the whole entire patient's electronic health record data facts. And the traditional approaches that the statistics community used to work on no longer work in this high dimensional setting. And so, in fact, it's actually a really interesting area for research, one that my lab is starting to work on and many other labs, where we could ask, how can we bring machine learning algorithms that are designed to work with high dimensional data to answer these types of causal inference questions? And in today's lecture, you'll see one example of reduction from causal inference to machine learning, where we'll be able to use machine learning to answer one of those causal inference questions.

So the first thing we need is some language in order to formalize these notions. So I will work within what's known as the Rubin-Neyman Causal Model, where we talk about what are called potential outcomes. What would have happened under this world or that world? We'll call Y 0, and often it will be denoted as Y underscore 0, sometimes it'll be denoted as Y parentheses 0, and sometimes it'll be denoted as Y given X comma do Y equals 0. And all three of these notations are equivalent.

So Y is 0 corresponds to what would have happened to this individual if you gave them treatment to 0. And Y1 is the potential outcome of what would have happened to this individual had you gave them treatment one. So you could think about Y1 as being giving the blue pill and Y0 as being given the red pill. Now, once you can talk about these states of the world, then one could start to ask questions of what's better, the red pill or the blue pill? And one can formalize that notion mathematically in terms of what's called the conditional average treatment effect, and this also goes by the name of individual treatment effect.

So it's going to take as input Xi, which I'm going to denote as the data that you had at baseline for the individual. It's the covariance, the features for the individual. And one wants to know, well, for this individual with what we know about them, what's the difference between giving them treatment one or giving them treatment zero? So mathematically, that corresponds to a difference in expectations. It's a difference in expectation of Y1 from Y0. Now, the reason why I'm calling this an expectation is because I'm not going to assume that Y1 and Y0 are deterministic because maybe there's some bad luck component. Like, maybe a medication usually works for this type of person, but with a flip of a coin, sometimes it doesn't work.

And so that's the randomness that I'm referring to when I talk about probability over Y1 given Xi. And so the CATE looks at the difference in those two expectations. And then one can now talk about what the average treatment effect is, which is the difference between those two. So the average treatment effect is now the expectation of-- I'll say the expectation of the CATE over the distribution of people, P of X. Now, we're going to go through this in four different ways in the next 10 minutes, and then you're going to go over it five more ways doing your homework assignment, and you'll go over it two more ways on Friday in recitation.

So if you don't get it just yet, stay with me, you'll get it by the end of this week. Now, in the data that you observe for an individual, all you see is what happened under one of the interventions. So, for example, if the i'th individual in your data set received treatment Ti equals 1, then what you observe, Yi is the potential outcome Y1. On the other hand, if the individual in your data set received treatment Ti equals 0, then what you observed for that individual is the potential outcome Y0.

So that's the observed factual outcome. But one could also talk about the counterfactual of what would have happened to this person had the opposite treatment been done for them. Notice that I just swapped each Ti for 1 minus Ti, and so on. Now, the key challenge in the field is that in your data set, you only observe the factual outcomes. And when you want to reason about the counterfactual, that's where you have to impute this unobserved counterfactual outcome. And that is known as the fundamental problem of causal inference, that we only observe one of the two outcomes for any individual in the data set.

So let's look at a very simple example. Here, individuals are characterized by just one feature, their age. And these two curves that I'm showing you are the potential outcomes of what would happen to this individual's blood pressure if you gave them treatment zero, which is the blue curve, versus treatment one, which is the red curve. All right. So let's dig in a little bit deeper. For the blue curve, we see people who received the control, what I'm calling treatment zero, their blood pressure was pretty low for the individuals who were low and for individuals whose age is high. But for middle age individuals, their blood pressure on receiving treatment zero is in the higher range.

On the other hand, for individuals who receive treatment one, it's the red curve. So young people have much higher, let's say, blood pressure under treatment one, and, similarly, much older people. So then one could ask, well, what about the difference between these two potential outcomes? That is to say the CATE, the Conditional Average Treatment Effect, is simply looking at the distance between the blue curve and the red curve for that individual. So for someone with a specific age, let's say a young person or a very old person, there's a very big difference between giving treatment zero or giving treatment one. Whereas for a middle aged person, there's very little difference.

So, for example, if treatment one was significantly cheaper than treatment zero, then you might say, we'll give treatment one. Even though it's not quite as good as treatment zero, but it's so much cheaper and the difference between them is so small, we'll give the other one. But in order to make that type of policy decision, one, of course, has to understand that conditional average treatment effect for that individual, and that's something that we're going to want to predict using data.

Now, we don't always get the luxury of having personalized treatment recommendations. Sometimes we have to give a policy. Like, for example-- I took this example out of my slides, but I'll give it to you anyway. The federal government might come out with a guideline saying that all men over the age of 50-- I'm making up that number-- need to get annual prostate cancer screening. That's an example of a very broad policy decision. You might ask, well, what is the effect of that policy now applied over the full population on, let's say, decreasing deaths due to prostate cancer? And that would be an example of asking about the average treatment effect.

So if you were to average the red line, if you were to average the blue line, you get those two dotted lines I show there. And if you look at the difference between them, that is the average treatment effect between giving the red intervention or giving the blue intervention. And if the average human effect is very positive, you might say that, on average, this intervention is a good intervention. If it's very negative, you might say the opposite. Now, the challenge about doing causal inference from observational data is that, of course, we don't observe those red and those blue curves, rather what we observe are data points that might be distributed all over the place.

Like, for example, in this example, the blue treatment happens to be given in the data more to young people, and the red treatment happens to be given in the data more to older people. And that can happen for a variety of reasons. It can happen due to access to medication. It can happen for socioeconomic reasons. It could happen because existing treatment guidelines say that old people should receive treatment one and young people should receive treatment zero. These are all reasons why in your data who receives what treatment could be biased in some way. And that's exactly what this edge from X to T is modeling.

But for each of those people, you might want to know, well, what would have happened if they had gotten the other treatment? And that's asking about the counterfactual. So these dotted circles are the counterfactuals for each of those observations. And by the way, you'll notice that those dots are not on the curves, and the reason they're not on the curve is because I'm trying to point out that there could be some stochasticity in the outcome. So the dotted lines are the expected potential outcomes and the circles are the realizations of them.

All right. Everyone take out a calculator or your computer or your phone, and I'll take out mine. This is not an opportunity to go on Facebook, just to be clear. All you want is a calculator. My phone doesn't-- oh, OK, it has a calculator. Good. All right. So we're going to do a little exercise. Here's a data set on the left-hand side. Each row is an individual. We're observing the individual's age, gender, whether they exercise regularly, which I'll say is a one or a zero, and what treatment they got, which is A or B. On the far right-hand side are their observed sugar glucose sugar levels, let's say, at the end of the year.

Now, what we'd like to have, it looks like this. So we'd like to know what would have happened to this person's sugar levels had they received medication A or had they received medication B. But if you look at the previous slide, we observed for each individual that they got either A or B. And so we're only going to know one of these columns for each individual. So the first row, for example, this individual received treatment A, and so you'll see that I've taken the observed sugar level for that individual, and since they received treatment A, that observed level represents the potential outcome Ya, or Y0.

And that's why I have a 6, which is bolded under Y0. And we don't know what would have happened to that individual had they received treatment B. So in this case, some magical creature came to me and told me their sugar levels would have been 5.5, but we don't actually know that. It wasn't in the data. Let's look at the next line just to make sure we get what I'm saying. So the second individual actually received treatment B. They're observed sugar level is 6.5. OK.

Let's do a little survey. That 6.5 number, should it be in this column? Raise your hand. Or should it be in this column? Raise your hand. All right. About half of you got that right. Indeed, it goes to the second column. And again, what we would like to know is the counterfactual. What would have been their sugar levels had they received medication A? Which we don't actually observe in our data, but I'm going to hypothesize is-- suppose that someone told me it was 7, then you would see that value filled in there. That's the unobserved counterfactual.

All right. First of all, is the setup clear? All right. Now here's when you use your calculators. So we're going to now demonstrate the difference between a naive estimator of your average treatment effect and the true average treatment effect. So what I want you to do right now is to compute, first, what is the average sugar level of the individuals who got medication B. So for that, we're only going to be using the red ones. So this is conditioning on receiving medication B.

And so this is equivalent to going back to this one and saying, we're only going to take the rows where individuals receive medication B, and we're going to average their observed sugar levels. And everyone should do that. What's the first number? 6.5 plus-- I'm getting 7.875. This is for the average sugar, given that they received medication B. Is that what other people are getting?


DAVID SONTAG: OK. What about for the second number? Average sugar, given A? I want you to compute it. And I'm going to ask everyone to say it out loud in literally one minute. And if you get it wrong, of course you're going to be embarrassed. I'm going to try myself. OK. On the count of three, I want everyone to read out what that third number is. One, two, three.

ALL: 7.125.

DAVID SONTAG: All right. Good. We can all do arithmetic. All right. Good. So, again, we're just looking at the red numbers here, just the red numbers. So we just computed that difference, which is point what?


DAVID SONTAG: 0.75? Yeah, that looks about right. Good. All right. So that's a positive number. Now let's do something different. Now let's compute the actual average treatment effect, which is we're now going to average every number in this column, and we're going to average every number in this column. So this is the average sugar level under the potential outcome of had the individual received treatment B, and this is the average sugar level under the potential outcome that the individual received treatment A. All right. Who's doing it?


DAVID SONTAG: 0.75 is what?

AUDIENCE: The difference.

DAVID SONTAG: How do you know?


DAVID SONTAG: Wow, you're fast. OK. Let's see if you're right. I actually don't know. OK. The first one is 0.75. Good, we got that right. I intentionally didn't post the slides to today's lecture. And the second one is minus 0.75. All right. So now let's put us in the shoes of a policymaker. The policymaker has to decide, is it a good idea to-- or let's say it's a health insurance company. A health insurance company is trying decide, should I reimburse for treatment B or not? Or should I simply say, no, I'm never going to reimburse for treatment because it doesn't work well?

So if they had done the naive estimator, that would have been the first example, then it would look like medication B is-- we want lower numbers here, so it would look like medication B is worse than medication A. And if you properly estimated what the actual average treatment effect is, you get the absolute opposite conclusion. You conclude that medication B is much better than medication A. It's just a simple example to really illustrate the difference between conditioning and actually computing that counterfactual.

OK. So hopefully now you're starting to get it. And again, you're going to have many more opportunities to work through these things in your homework assignment and so on. So by now you should be starting to wonder, how the hell could I do anything in this state of the world? Because you don't actually observe those black numbers. These are all unobserved. And clearly there is bias in what the values should be because of what I've been saying all along. So what can we do?

Well, the first thing we have to realize is that typically, this is an impossible problem to solve. So your instincts aren't wrong, and we're going to have to make a ton of assumptions in order to do anything here. So the first assumption is called SUTVA. I'm not even going to talk about it. You can read about that in your readings. I'll tell you about the two assumptions that are a little bit easier to describe. The first critical assumption is that there are no unobserved confounding factors. Mathematically what that's saying is that your potential outcomes, Y0 and Y1, are conditionally independent of the treatment decision given what you observe on the individual, X.

Now, this could be a bit hard to-- and that's called ignorability. And this can be a bit hard to understand, so let me draw a picture. So X is your covariance, T is your treatment decision. And now I've drawn for you a slightly different graph. Over here I said X goes to T, X and T go to Y. But now I don't have Y. Instead, I have Y0 and Y1, and I don't have any edge from T to them. And that's because now I'm actually using the potential outcomes notation. Y0 is a potential outcome of what would have happened to this individual had they received treatment 0, and Y1 is what would have happened to this individual if they received treatment one.

And because you already know what treatment the individual has received, it doesn't make sense to talk about an edge from T to those values. That's why there's no edge there. So then you might wonder, how could you possibly have a violation of this conditional independence assumption? Well, before I give you that answer, let me put some names to these things. So we might think about X as being the age, gender, weight, diet, and so on of the individual. T might be a medication, like an anti-hypertensive medication to try to lower a patient's blood pressure. And these would be the potential outcomes after those two medications.

So an example of a violation of ignorability is if there is something else, some hidden variable h, which is not observed and which affects both the decision of what treatment the individual in your data set receives and the potential outcomes. Now it should be really clear that this would be a violation of that conditional independence assumption. In this graph, Y0 and Y1 are not conditionally independent of T given X. All right. So what are these hidden confounders? Well, they might be things, for example, which really affect treatment decisions.

So maybe there's a treatment guideline saying that for diabetic patients, they should receive treatment zero, that that's the right thing to do. And so a violation of this would be if the fact that the patient's diabetic were not recorded in the electronic health record. So you don't know-- that's not up there. You don't know that, in fact, the reason the patient received treatment T was because of this h factor. And there's critically another assumption, which is that h actually affects the outcome, which is why you have these edges from h to the Y's.

If h were something which might have affected treatment decision but not the actual potential outcomes-- and that can happen, of course. Things like gender can often affect treatment decisions, but maybe, for some diseases, it might not affect outcomes. In that situation it wouldn't be a confounding factor because it doesn't violate this assumption. And, in fact, one would be able to come up with consistent estimators of average treatment effect under that assumption. Where things go to hell is when you have both of those edges. All right.

So there can't be any of these h's. You have to observe all things that affect both treatment and outcomes. The second big assumption-- oh, yeah. Question?

AUDIENCE: In practice, how good of a model is this?

DAVID SONTAG: Of what I'm showing you here?


DAVID SONTAG: For hypertension?


DAVID SONTAG: I have no idea. But I think what you're really trying to get at here in asking your question, how good of a model is this, is, well, oh, my god, how do I know if I've observed everything? Right? All right. And that's where you need to start talking to domain experts. So this is my starting place where I said, no, I'm not going to attempt to fit the causal graph. I'm going to assume I know the causal graph and just try to estimate the effects. That's where this starts to become really irrelevant. Because if you notice, this is another causal graph, not the one I drew on the board.

And so that's something where, really, talking with domain experts would be relevant. So if you say, OK, I'm going to be studying hypertension and this is the data I've observed on patients, well, you can then go to a clinician, maybe a primary care doctor who often treats patients with hypertension, and you say, OK, what usually affects your treatment decisions? And you get a set of variables out, and then you check to make sure, am I observing all of those variables, at least the variables that would also affect outcomes? So, often, there's going to be a back and forth in that conversation to make sure that you've set up your problem correctly.

And again, this is one area where you see a critical difference between the way that we do causal inference from the way that we do machine learning. Machine learning, if there's some unobserved variables, so what? I mean, maybe your predictive accuracy isn't quite as good as it could have been, but whatever. Here, your conclusions could be completely wrong if you don't get those confounding factors right. Now, in some of the optional readings for Thursday's lecture-- and we'll touch on it very briefly on Thursday, but there's not much time in this course-- I'll talk about ways and you'll read about ways to try to assess robustness to violations of these assumptions. And those go by the name of sensitivity analysis.

So, for example, the type of question you might ask is, how would my conclusions have changed if there were a confounding factor which was blah strong? And that's something that one could try to answer from data, but it's really starting to get beyond the scope of this course. So I'll give you some readings on it, but I won't be able to talk about it in the lecture. Now, the second major assumption that one needs is what's known as common support. And by the way, pay close attention here because at the end of today's lecture-- and if I forget, someone must remind me-- I'm going to ask you where did these two assumptions come up in the proof that I'm about to give you.

The first one I'm going to give you will be a dead giveaway. So I'm going to answer to you where ignorability comes up, but it's up to you to figure out where does common support show up. So what is common support? Well, what common support says is that there always must be some stochasticity in the treatment decisions. For example, if in your data patients only receive treatment A and no patient receives treatment B, then you would never be able to figure out the counterfactual, what would have happened if patients receive treatment B.

But what happens if it's not quite that universal but maybe there is classes of people? Some individual is X, let's say, people with blue hair. People with blue hair always receive treatment zero and they never see treatment one. Well, for those people, if for some reason something about them having blue hair was also going to affect how they would respond to the treatment, then you wouldn't be able to answer anything about the counterfactual for those individuals. This goes by the name of what's called a propensity score. It's the probability of receiving some treatment for each individual.

And we're going to assume that this propensity score is always bounded between 0 and 1. So it's between 1 minus epsilon and epsilon for some small epsilon. And violations of that assumption are going to completely invalidate all conclusions that we could draw from the data. All right. Now, in actual clinical practice, you might wonder, can this ever hold? Because there are clinical guidelines. Well, a couple of places where you'll see this are as follows.

First, often, there are settings where we haven't the faintest idea how to treat patients, like second line diabetes treatments. You know that the first thing we start with is metformin. But if metformin doesn't help control the patient's glucose values, there are several second line diabetic treatments. And right now, we don't really know which one to try. So a clinician might start with treatments from one class. And if that's not working, you try a different class, and so on. And it's a bit random which class you start with for any one patient.

In other settings, there might be good clinical guidelines, but there is randomness in other ways. For example, clinicians who are trained on the west coast might be trained that this is the right way to do things, and clinicians who are trained in the east coast might be trained that this is the right way to do things. And so even if any one clinician's treatment decisions are deterministic in some way, you'll see some stochasticity now across clinicians. It's a bit subtle how to use that in your analysis, but trust me, it can be done.

So if you want to do causal inference from observational data, you're going to have to first start to formalize things mathematically in terms of what is your X, what is your T, what is your Y. You have to think through, do these choices satisfy these assumptions of ignorability and overlap? Some of these things you can check in your data. Ignorability you can't explicitly check in your data. But overlap, this thing, you can test in your data. By the way, how? Any idea? Someone else who hasn't spoken today.

So just think back to the previous example. You have this table of these X's and treatment A or B and then sugar values. How would you test this?

AUDIENCE: You could use a frequentist approach and just count how many things show up. And if there is zero, then you could say that it's violated.

DAVID SONTAG: Good. So you have this table. I'll just go back to that table. We have this table, and these are your X's. Actually, we'll go back to the previous slide where it's a bit easier to see. Here, we're going to ignore the outcome, the sugar levels because, remember, this only has to do with probability of treatment given your covariance. The Y doesn't show up here at all. So this thing on the right-hand side, the observed sugar levels, is irrelevant for this question. All we care about is what goes on over here.

So we look at this. These are your X's, and this is your treatment. And you can look to see, OK, here you have one 75-year-old male who does exercise frequently and received treatment A. Is there any one else in the data set who is 75 years old and male, does exercise regularly but received treatment B? Yes or no? No. Good. OK. So overlap is not satisfied here, at least not empirically. Now, you might argue that I'm being a bit too coarse here.

Well, what happens if the individual is 74 and received treatment B? Maybe that's close enough. So there starts to become subtleties in assessing these things when you have finite data. But it is something at the fundamental level that you could start to assess using data. As opposed to ignorability, which you cannot test using data. All right. So you have to think about, are these assumptions satisfied? And only once you start to think through those questions can you start to do your analysis.

And so that now brings me to the next part of this lecture, which is how do we actually-- let's just now believe David, believe that these assumptions hold. How do we do that causal inference? Yeah?

AUDIENCE: I just had a question on [INAUDIBLE]. If you know that some patients, for instance, healthy patients, are not tracking to get any treatment, should we just remove them, basically?

DAVID SONTAG: So the question is, what happens if you have a violation of overlap? For example, you know that healthy individuals never receive any treatment. Should you remove them from your data set? Well, first of all, that has to do with how do you formalize the question because not receiving a treatment is a treatment. So that might be your control arm, just to be clear. Now, if you're asking about the difference between two treatments-- two different classes of treatment for a condition, then often one defines the relevant inclusion criteria in order to have these conditions hold.

For example, we could try to redefine the set of individuals that we're asking about so that overlap does hold. But then in that situation, you have to just make sure that your policy is also modified. You say, OK, I conclude that the average treatment effect is blah for this type of people. OK? OK. So how could we possibly compute the average treatment effect from data? Remember, average treatment effect, mathematically, is the expectation between potential outcome Y1 minus Y0.

The key tool which we'll use in order to estimate that is what's known as the adjustment formula. This goes by many names in the statistics community, such as the G-formula as well. Here, I'll give you a derivation of it. We're first going to recognize that this expectation is actually two expectations in one. It's the expectation over individuals X and it's the expectation over potential outcomes Y given X. So I'm first just going to write it out in terms of those two expectations, and I'll write the expectations related to X on the outside. That goes by name of law of total expectation.

This is trivial at this stage. And by the way, I'm just writing out expectation of Y1. In a few minutes, I'll show you expectation of Y0, but it's going to be exactly analogous. Now, the next step is where we use ignorability. I told you I was going to give that one away. So remember, we said that we're assuming that Y1 is conditionally independent of the treatment T given X. What that means is probability of Y1 given X is equal to probability of Y1 given X comma T equals whatever-- in this case I'll just say T equals 1. This is implied by Y1 being conditionally independent of T given X.

So I can just stick n comma T equals 1 here, and that's explicitly because of ignorability holding. But now we're in a really good place because notice that-- and here I've just done some short notation. I'm just going to hide this expectation. And by the way, you could do the same for Y0-- Y1, Y0. And now notice that we can replace this average human effect with now this expectation with respect to all individuals X of the expectation of Y1 given X comma T equals 1, and so on. And these are mostly quantities that we can now observe from our data.

So, for example, we can look at the individuals who received treatment one, and for those individuals we have realizations of Y1. We can look at individuals who receive treatment zero, and for those individuals we have realizations of Y0. And we could just average those realizations to get estimates of the corresponding expectations. So these we can easily estimate from our data. And so we've made progress. We can now estimate some part of this from our data.

But notice, there are some things that we can't yet directly estimate from our data. In particular, we can't estimate expectation of Y0 given X comma T equals 1 because we have no idea what would have happened to this individual who actually got treatment one if they had gotten treatment zero. So these we don't know. So these we don't know. Now, what is the trick I'm planning on you? How does it help that we can do this?

Well, the key point is that these quantities that we can estimate from data show up in that term. In particular, if you look at the individuals X that you've sampled from the full set of individuals P of X, for that individual X for which, in fact, we observed T equals 1, then we can estimate expectation of Y1 given X comma T equals 1, and similarly for Y0. But what we need to be able to do is to extrapolate. Because empirically, we only have samples from P of X given T equals 1, P of X given T equals 0 for those two potential outcomes correspondingly.

But we are going to also get samples of X such that for those individuals in your data set, you might have only observed T equals 0. And to compute this formula, you have to answer, for that X, what would it have been if they got treatment equals one? So there are going to be a set of individuals that we have to extrapolate for in order to use this adjustment formula for estimate. Yep?

AUDIENCE: I thought because common support is true, we have some patients that received each treatment or a given type of X.

DAVID SONTAG: Yes. But now-- so, yes, that's true. But that's a statement about infinite data. And in reality, one only has finite data. And so although common support has to hold to some extent, you can't just build on that to say that you always observe the counterfactual for every individual, such as the pictures I showed you earlier. So I'm going to leave this slide up for just one more second to let it sink in and see what it's saying.

We started out from the goal of computing the average treatment effect, expected value of Y1 minus Y0. Using the adjustment formula, we've gotten to now an equivalent representation, which is now an expectation with respect to all individuals sampling from P of X of expected value of Y1 given X comma T equals 1, expected value of Y0 given X comma T equals 0. For some of the individuals, you can observe this, and for some of them, you have to extrapolate. So from here, there are many ways that one can go. Hold your question for a little while.

So types of causal inference methods that you will have heard of include things like covariance adjustment, propensity score re-weighting, doubly robust estimators, matching, and so on. And those are the tools of the causal inference trade. And in this course, we're only going to talk about the first two. And in today's lecture, we're only going to talk about the first one, covariate adjustment. And on Thursday, we'll talk about the second one. So covariate adjustment is a very natural way to try to do that extrapolation. It also goes by the name, by the way, of response surface modeling.

What we're going to do is we're going to learn a function f, which takes as an input X and T, and its goals is to predict Y. So intuitively, you should think about f as this conditional probability distribution. It's predicting Y given X and T. So T is going to be an input to the machine learning algorithm, which is going to predict what would be the potential outcome Y for this individual described by feature as X1 through Xd under intervention T.

So this is just from the previous slide. And what we're going to do now are-- this is now where we get the reduction to machine learning-- is we're going to use empirical risk minimization, or maybe some regularized empirical risk minimization, to fit a function f which approximates the expected value of YT given capital T equals little t. Got my X. And then once you have that function, we're going to be able to use that to estimate the average treatment effect by just implementing now this formula here.

So we're going to first take an expectation with respect to the individuals in the data set. So we're going to approximate that with an empirical expectation where we sum over the little n individuals in your data set. Then what we're going to do is we're going to estimate the first term, which is f of Xi comma 1 because that is approximating the expected value of Y1 given T comma X-- T equals 1 comma X. And we're going to approximate the second term, which is just plugging now 0 for T instead of 1. And we're going to take the difference between them, and that will be our estimator of the average treatment effect.

Here's a natural place to ask a question. One thing you might wonder is, in your data set, you actually did observe something for that individual, right. Notice how your raw data doesn't show up in this at all. Because I've done machine learning, and then I've thrown away the observed Y's, and I used this estimator. So what you could have done-- an alternative formula, which, by the way, is also a consistent estimator, would have been to use the observed Y for whatever the factual is and the imputed Y for the counterfactual using f.

That would have been that would have also been a consistent estimator for the average treatment effect. You could've done either. OK. Now, sometimes you're not interested in just the average treatment effect, but you're actually interested in understanding the heterogeneity in the population. Well, this also now gives you an opportunity to try to explore that heterogeneity. So for each individual Xi, you can look at just the difference between what f predicts for treatment one and what X predicts given treatment zero. And the difference between those is your estimate of your conditional average treatment effect.

So, for example, if you want to figure out for this individual, what is the optimal policy, you might look to see is CATE positive or negative, or is it greater than some threshold, for example? So let's look at some pictures. Now what we're using is we're using that function f in order to impute those counterfactuals. And now we have those observed, and we can actually compute the CATE. And averaging over those, you can estimate now the average treatment effect. Yep?

AUDIENCE: How is f non-biased?

DAVID SONTAG: Good. So where can this go wrong? So what do you mean by biased, first? I'll ask that.

AUDIENCE: For instance, as we've seen in the paper like pneumonia and people who have asthma, [INAUDIBLE]


DAVID SONTAG: Oh, thank you so much for bringing that back up. So you're referring to one of the readings for the course from several weeks ago, where we talked about using just a pure machine learning algorithm to try to predict outcomes in a hospital setting. In particular, what happens for patients who have pneumonia in the emergency department? And if you all remember, there was this asthma example, where patients with asthma were predicted to have better outcomes than patients without asthma.

And you're calling that bias. But you remember, when I taught about this, I called it biased due to a particular thing. What's the language I used? I said bias due to intervention, maybe, is what I-- I can't remember exactly what I said.


I don't know. Make it up. Now a textbook will be written with bias by intervention. OK. So the problem there is that they didn't formulize the prediction problem correctly. The question that they should have asked is, for asthma patients-- what you really want to ask is a question of X and then T and Y, where T are the interventions that are done for asthmatics.

So the failure of that paper is that it ignored the causal inference question which was hidden in the data, and it just went to predict Y given X marginalizing over T altogether. So T was never in the predictive model. And said differently, they never asked counterfactual questions of what would have happened had you done a different T. And then they still used it to try to guide some treatment decisions. Like, for example, should you send this person home, or should you keep them for careful monitoring or so on? So this is exactly the same example as I gave in the beginning of the lecture, where I said if you just use a risk stratification model to make some decisions, you run the risk that you're making the wrong decisions because those predictions were biased by decisions in your data.

So that doesn't happen here because we're explicitly accounting for T in all of our analysis. Yep?

AUDIENCE: In the data sets that we've used, like MIMIC, how much treatment information exists?

DAVID SONTAG: So how much treatment information is in MIMIC? A ton. In fact, one of the readings for next week is going to be about trying to understand how one could manage sepsis, which is a condition caused by infection, which is managed by, for example, giving broad spectrum antibiotics, giving fluids, giving pressers and ventilators. And all of those are interventions, and all those interventions are recorded in the data so that one could then ask counterfactual questions from the data, like what would have happened if this patient had they received a different set of interventions? Would we have prolonged their life, for example?

And so in an intensive care unit setting, most of the questions that we want to ask about, not all, but many of them are about dynamic treatments because it's not just a single treatment but really about a service sequence of treatments responding to the current patient condition. And so that's where we'll really start to get into that material next week, not in today's lecture. Yep?

AUDIENCE: How do you make sure that your f function really learned from the relationship between T and the outcome?

DAVID SONTAG: That's a phenomenal question. Where were you this whole course? Thank you for asking it. So I'll repeat it. How do you know that your function f actually learned something about the relationship between the input X and the treatment T and the outcome? And that really gets to the question of, is my reduction actually valid? So I've taken this problem and I've reduced it to this machine learning problem, where I take my data, and literally I just learn a function f to try to predict well the observations in the data.

And how do we know that that function f actually does a good job at estimating something like average treatment effect? In fact, it might not. And this is where things start to get really tricky, particularly with high dimensional data. Because it could happen, for example, that your treatment decision is only one of a huge number of factors that affect the outcome Y. And it could be that a much more important factor is hidden in X. And because you don't have much data, and because you have to regularize your learning algorithm, let's say, with L1 or L2 regularization or maybe early stopping if you're using deep neural network, your algorithm might never learn the actual dependence on T.

It might learn just to throw away T and just use X to predict Y. And if that's the case, you will never be able to infer these average treatment effects accurately. You'll have huge errors. And that gets back to one of the slides that I skipped, where I started out from this picture. This is the machine learning picture saying, OK, a reduction to machine learning is-- now you add an additional feature, which is your treatment decision, and you learn that black box function f. But this is where machine learning causal inference starts to differ because we don't actually care about the quality of predicting Y.

We can measure your root mean squared error in predicting Y given your X's and T's, and that error might be low. But you can run into these failure modes where it just completely ignores T, for example. So T is special here. So really, the picture we want to have in mind is that T is some parameter of interest. We want to learn a model f such that if we twiddle T, we can see how there is a differential effect on Y based on twiddling T. That's what we truly care about when we're using machine learning for causal inference.

And so that's really the gap, that's the gap in our understanding today. And it's really an active area of research to figure out how do you change the whole machine learning paradigm to recognize that when you're using machine learning for causal inference, you're actually interested in something a little bit different. And by the way, that's a major area of my lab's research, and we just published a series of papers trying to answer that question. Beyond the scope of this course, but I'm happy to send you those papers if anyone's interested.

So that type of question is extremely important. It doesn't show up quite as much when your X's aren't very high dimensional and where things like regularization don't become important. But once your X becomes high dimensional and once you want to start to consider more and more complex f's during your fitting, like you want to use deep neural networks, for example, these differences in goals become extremely important.

So there are other ways in which things can fail. So I want to give you here an example where-- shoot, I'm answering my question. OK. No one saw that slide. Question-- where did the overlap assumptions show up in our approach for estimating average treatment effect using covariate adjustment? Let me go back to the formula. Someone who hasn't spoken today, hopefully. You can be wrong, it's fine. Yeah, in the back?

AUDIENCE: Is it the version with the same age in receiving treatment B and treatment B?

DAVID SONTAG: So maybe you have an individual with some age-- we're going to want to be able to look at the difference between what f predicts for that individual if they got treatment A versus treatment B, or one versus zero. And let me try to lead this a little bit. And it might happen in your data set that for individuals like them, you only ever observe treatment one and there's no one even remotely like them who you observe treatment zero. So what's this function going to output then when you input zero for that second argument? Everyone say out loud. Garbage? Right?

If in your data set you never observed anyone even remotely similar to Xi who received treatment zero, then this function is basically undefined for that individual. I mean, yeah, your function will output something because you fit it, but it's not going to be the right answer. And so that's where this assumption starts to show up. When one talks about the sample complexity of learning these functions f to do covariate adjustment, and when one talks about the consistency of these arguments-- for example, you'd like to be able to make claims that as the amount of data grows to, let's say, infinity, that this is the right answer-- gives you the right estimate. So that's the type of proof which is often given in the causal inference literature.

Well, if you have overlap, then as the amount of data goes to infinity, you will observe someone, like the person who received treatment one, you'll observe someone who also received treatment zero. It might have taken you a huge amount of data to get there because treatment zero might have been much less likely than treatment one. But because the probability of treatment zero is not zero, eventually you'll see someone like that. And so eventually you'll get enough data in order to learn a function which can extrapolate correctly for that individual.

And so that's where overlap comes in in giving that type of consistency argument. Of course, in reality, you never have infinite data. And so these questions about trade-offs between the amount of data you have and the fact that you never truly have empirical overlap with a small amount of data, and answering when can you extrapolate correctly despite that is the critical question that one needs to answer, but is, by the way, not studied very well in the literature because people don't usually think in terms of sample complexity in that field. That's where computer scientists can start really to contribute to this literature and bringing things that we often think about in machine learning to this new topic.

So I've got a couple of minutes left. Are there any other questions, or should I introduce some new material in one minute? Yeah?

AUDIENCE: So you said that the average treatment effect estimator here is consistent. But does it matter if we choose the wrong-- do we have to choose some functional form of the features to the effect?

DAVID SONTAG: Great question.

AUDIENCE: Is it consistent even if we choose a completely wrong function or formula?


AUDIENCE: That's a different thing?

DAVID SONTAG: No, no. You're asking all the right questions. Good job today, everyone. So, no. If you walk through that argument I made, I assume two things. First, that you observe enough data such that you can have any chance of extrapolating correctly. But then implicit in that statement is that you're choosing a function family which is powerful enough that it can extrapolate correctly. So if your true function is-- if you think back to this figure I showed you here, if the true potential outcome functions are these quadratic functions and you're fitting them with a linear function, then no matter how much data you have you're always going to get wrong estimates because this argument really requires that you're considering more and more complex non-linearity as your amount of data grows.

So now here's a visual depiction of what can go wrong if you don't have overlap. So now I've taken out-- previously, I had one or two red points over here and one or two blue points over here, but I've taken those out. So in your data all you have are these blue points and those red points. So all you have are the points, and now one can learn as good functions, as you can imagine, to try to, let's say, minimize the mean squared error of predicting these blue points and minimize the mean squared error of predicting those red points. And what you might get out is something-- maybe you'll decide on a linear function. That's as good as you could do if all you have are those red points.

And so even if you were willing to consider more and more complex hypothesis classes, here, if you tried to consider a more complex hypothesis class than this line, you'd probably just over-fitting to the data you have. And so you decide on that line, which, because you had no data over here, you don't even know that it's not a good fit to the data. And then you notice that you're getting completely wrong estimates. For example, if you asked about the CATE for a young person, it would have the wrong sign over here because they flipped, the two lines.

So that's an example of how one can start to get errors. And when we begin on Thursday's lecture, we're going to pick up right where we left off today, and I'll talk about this issue a little bit more in detail. I'll talk about how, if one were to learn a linear function, how one could actually, under the assumption that the true potential outcomes are linear, how one could actually interpret the coefficients of that linear function in a causal way under the very strong assumption that the two potential outcomes are linear. So that's what we'll return to on Thursday.