1 00:00:14,710 --> 00:00:16,460 PROFESSOR: So I'm going to begin by trying 2 00:00:16,460 --> 00:00:20,540 to build some intuition for how one might be able to do staging 3 00:00:20,540 --> 00:00:24,400 from cross-sectional data, and we'll 4 00:00:24,400 --> 00:00:27,910 return to this question of combined staging subtyping only 5 00:00:27,910 --> 00:00:28,480 much later. 6 00:00:31,880 --> 00:00:34,760 So imagine that we had data that lived in one dimension. 7 00:00:34,760 --> 00:00:37,840 Here, each data point is an individual, 8 00:00:37,840 --> 00:00:40,610 we observe their data at just one point in time, 9 00:00:40,610 --> 00:00:44,250 and suppose we knew exactly which biomarker to look at. 10 00:00:44,250 --> 00:00:44,750 Right? 11 00:00:44,750 --> 00:00:46,333 So I gave you an example of that here, 12 00:00:46,333 --> 00:00:50,280 when you might look at some antibody expression level, 13 00:00:50,280 --> 00:00:54,878 and that might be what I call biomarker A, 14 00:00:54,878 --> 00:00:56,920 is if you knew exactly what biomarker to look at, 15 00:00:56,920 --> 00:01:00,730 you might just put each person along this line 16 00:01:00,730 --> 00:01:04,629 and you might conjecture that maybe on one side of the line, 17 00:01:04,629 --> 00:01:07,120 this is the early disease, and that the other sort of line, 18 00:01:07,120 --> 00:01:09,480 maybe that's the late disease. 19 00:01:09,480 --> 00:01:11,371 Why might that be a reasonable conjecture? 20 00:01:14,020 --> 00:01:16,088 What would be an alternative conjecture? 21 00:01:24,500 --> 00:01:26,000 Why don't you talk to your neighbors 22 00:01:26,000 --> 00:01:27,500 and see if you guys can come up with 23 00:01:27,500 --> 00:01:29,840 some alternative conjectures. 24 00:01:29,840 --> 00:01:30,800 Let's go. 25 00:01:37,050 --> 00:01:38,150 All right, that's enough. 26 00:01:38,150 --> 00:01:39,830 So hopefully simple questions, so I 27 00:01:39,830 --> 00:01:41,250 won't give you too much time. 28 00:01:41,250 --> 00:01:43,280 All right, so what would be another conjecture? 29 00:01:43,280 --> 00:01:47,197 So again, our goal is we have one observation per individual, 30 00:01:47,197 --> 00:01:49,530 each individual is in some unknown stage of the disease, 31 00:01:49,530 --> 00:01:51,863 we would like to be able to sort individuals and turn it 32 00:01:51,863 --> 00:01:55,872 into early and late stages of the disease. 33 00:01:55,872 --> 00:01:58,080 I give you one conjecture of how to do that, sorting, 34 00:01:58,080 --> 00:02:00,203 what would be another reasonable conjecture? 35 00:02:00,203 --> 00:02:00,870 Raise your hand. 36 00:02:03,420 --> 00:02:04,030 Yep? 37 00:02:04,030 --> 00:02:04,790 AUDIENCE: That there's the different-- 38 00:02:04,790 --> 00:02:07,270 that they have different types of the same diseases. 39 00:02:07,270 --> 00:02:09,919 They all have the same disease and it could-- 40 00:02:09,919 --> 00:02:13,103 just one of the subtypes might be sort of the-- 41 00:02:13,103 --> 00:02:13,770 PROFESSOR: Yeah. 42 00:02:13,770 --> 00:02:18,230 So you're going back to the example I gave here where you 43 00:02:18,230 --> 00:02:20,300 could conflate these things. 44 00:02:20,300 --> 00:02:22,240 I want to stick with a simpler story, 45 00:02:22,240 --> 00:02:25,903 let's suppose there's only one subtype of the disease. 46 00:02:25,903 --> 00:02:28,070 What would be another way to sort the patients given 47 00:02:28,070 --> 00:02:32,740 this data where the data is these points that you see here? 48 00:02:32,740 --> 00:02:34,261 Yeah? 49 00:02:34,261 --> 00:02:37,140 AUDIENCE: For any disease in the middle range, and then 50 00:02:37,140 --> 00:02:40,418 as you [INAUDIBLE] 51 00:02:40,418 --> 00:02:41,960 PROFESSOR: OK, so maybe early disease 52 00:02:41,960 --> 00:02:49,010 is right over here, and when things get bad, the patient-- 53 00:02:49,010 --> 00:02:51,170 this biomarker starts to become abnormal, 54 00:02:51,170 --> 00:02:53,060 and abnormality, for whatever reason, 55 00:02:53,060 --> 00:02:55,100 might be sort of to the right or to the left. 56 00:02:59,190 --> 00:03:02,990 Now I think that is a conjecture one could have. 57 00:03:02,990 --> 00:03:04,730 I would argue that that's perhaps 58 00:03:04,730 --> 00:03:06,230 not a very natural conjecture given 59 00:03:06,230 --> 00:03:08,960 what we know about common biomarkers that are measured 60 00:03:08,960 --> 00:03:11,270 from the human body and the way that they 61 00:03:11,270 --> 00:03:13,965 respond to disease progression. 62 00:03:13,965 --> 00:03:16,340 Unless you're in the situation of having multiple disease 63 00:03:16,340 --> 00:03:18,675 subtypes where, for example, going to the right marker 64 00:03:18,675 --> 00:03:20,300 might correspond to one disease subtype 65 00:03:20,300 --> 00:03:22,175 and going to the left marker might correspond 66 00:03:22,175 --> 00:03:24,640 to another disease subtype. 67 00:03:24,640 --> 00:03:26,273 What would be another conjecture? 68 00:03:31,696 --> 00:03:33,390 You guys are missing the easy one. 69 00:03:33,390 --> 00:03:34,140 Yeah, in the back. 70 00:03:34,140 --> 00:03:35,420 AUDIENCE: Well, it might just be one 71 00:03:35,420 --> 00:03:37,350 where the high values are [INAUDIBLE] stage 72 00:03:37,350 --> 00:03:38,780 and low values are later ones? 73 00:03:38,780 --> 00:03:39,710 PROFESSOR: Exactly. 74 00:03:39,710 --> 00:03:44,350 So this might be early disease and that might be late disease. 75 00:03:44,350 --> 00:03:46,580 AUDIENCE: It says vice versa on this slide. 76 00:03:46,580 --> 00:03:48,030 PROFESSOR: Oh, does it really? 77 00:03:48,030 --> 00:03:49,044 Oh shoot. 78 00:03:49,044 --> 00:03:50,200 [LAUGHTER] 79 00:03:50,200 --> 00:03:54,840 AUDIENCE: [INAUDIBLE] 80 00:03:54,840 --> 00:03:56,300 PROFESSOR: Right, right, OK, OK. 81 00:03:56,300 --> 00:03:56,630 Thank you. 82 00:03:56,630 --> 00:03:58,547 Next time I'll take out that lower vice versa. 83 00:03:58,547 --> 00:03:59,085 [LAUGHTER] 84 00:03:59,085 --> 00:04:00,710 That's why you guys aren't saying that. 85 00:04:00,710 --> 00:04:02,420 OK. 86 00:04:02,420 --> 00:04:04,670 OK, so this is good. 87 00:04:04,670 --> 00:04:06,590 Now I think we're all on the same page, 88 00:04:06,590 --> 00:04:09,170 and we had some idea of what are some of the assumptions 89 00:04:09,170 --> 00:04:10,670 that one might need to make in order 90 00:04:10,670 --> 00:04:11,880 to actually do anything here. 91 00:04:11,880 --> 00:04:16,019 Like for example, we are making some-- 92 00:04:16,019 --> 00:04:20,089 we'll probably have to make some assumption about continuity, 93 00:04:20,089 --> 00:04:23,300 that there might be some gradual progression of the biomarker 94 00:04:23,300 --> 00:04:26,660 relevance from early to late, and it might be getting larger, 95 00:04:26,660 --> 00:04:29,000 it might be getting smaller. 96 00:04:29,000 --> 00:04:32,390 If it's indeed the scenario that we talked about earlier where 97 00:04:32,390 --> 00:04:34,460 we said like early disease might be here 98 00:04:34,460 --> 00:04:38,450 and late disease might be going to either side, in that case, 99 00:04:38,450 --> 00:04:41,840 I think one could easily argue that with just information 100 00:04:41,840 --> 00:04:45,110 we have here, disease progression-- disease stage 101 00:04:45,110 --> 00:04:46,770 is unidentifiable, right? 102 00:04:46,770 --> 00:04:50,240 Because you wouldn't know where would it-- 103 00:04:50,240 --> 00:04:53,840 where should you-- where should that transition point be? 104 00:04:53,840 --> 00:04:55,787 So here, here, here, here, here, here. 105 00:04:55,787 --> 00:04:57,370 In fact, the same problem arises here. 106 00:04:57,370 --> 00:04:58,960 Like you don't know, is it early disease-- 107 00:04:58,960 --> 00:05:00,620 is it going this way or is it going that way? 108 00:05:00,620 --> 00:05:02,990 What would be one way to try to disentangle this just 109 00:05:02,990 --> 00:05:05,540 to try to get us all on the same page, right? 110 00:05:05,540 --> 00:05:08,828 So suppose it was only going this direction 111 00:05:08,828 --> 00:05:10,370 or going that direction, how could we 112 00:05:10,370 --> 00:05:11,453 figure out which is which? 113 00:05:20,470 --> 00:05:21,370 Yeah? 114 00:05:21,370 --> 00:05:28,260 AUDIENCE: Maybe we had data on low key and other data 115 00:05:28,260 --> 00:05:32,023 about how much time we had taken 116 00:05:32,023 --> 00:05:32,690 PROFESSOR: Yeah. 117 00:05:32,690 --> 00:05:33,398 No, that's great. 118 00:05:33,398 --> 00:05:38,230 So maybe we have data on let's say death information, 119 00:05:38,230 --> 00:05:40,480 or even just age. 120 00:05:40,480 --> 00:05:44,140 And if we started from a very, very rough assumption 121 00:05:44,140 --> 00:05:53,050 that disease stage let's say grows monotonically with age, 122 00:05:53,050 --> 00:05:55,060 then-- 123 00:05:55,060 --> 00:05:57,310 and if you had made an additional assumption 124 00:05:57,310 --> 00:06:01,215 that the disease stages are-- 125 00:06:01,215 --> 00:06:02,590 that the people who are coming in 126 00:06:02,590 --> 00:06:07,120 are uniformly drawn from across disease stages, with those two 127 00:06:07,120 --> 00:06:09,460 assumptions alone, then you could, for example, look 128 00:06:09,460 --> 00:06:12,970 at the average age of individuals over here 129 00:06:12,970 --> 00:06:15,580 and the average age of individuals over here, 130 00:06:15,580 --> 00:06:17,980 and you'd say, the one with the larger average age 131 00:06:17,980 --> 00:06:21,810 is the late disease one. 132 00:06:21,810 --> 00:06:26,090 Or you could look at time to death if you had for each-- 133 00:06:26,090 --> 00:06:27,900 for each data point you also knew 134 00:06:27,900 --> 00:06:30,210 how long until that individual died, 135 00:06:30,210 --> 00:06:31,920 you could look at average time to death 136 00:06:31,920 --> 00:06:33,330 for these individuals versus those individuals 137 00:06:33,330 --> 00:06:34,913 and try to tease it apart in that way. 138 00:06:37,270 --> 00:06:39,436 That's what you meant. 139 00:06:39,436 --> 00:06:41,930 OK, so I'm just trying to give you some intuition for how 140 00:06:41,930 --> 00:06:43,550 this might be possible. 141 00:06:43,550 --> 00:06:45,635 What about if your data looked like this? 142 00:06:51,700 --> 00:06:53,530 So now you have two biomarkers. 143 00:06:53,530 --> 00:06:58,040 So we've only gone up by one dimension only, 144 00:06:58,040 --> 00:07:00,500 and we want to figure out where's early, where's late? 145 00:07:03,118 --> 00:07:05,160 Already starts to become more challenging, right? 146 00:07:07,780 --> 00:07:10,702 So the intuition that I want you to have 147 00:07:10,702 --> 00:07:12,160 is that we're going to have to make 148 00:07:12,160 --> 00:07:16,090 some assumptions about disease progression, 149 00:07:16,090 --> 00:07:17,970 such as the ones we were just discussing, 150 00:07:17,970 --> 00:07:19,750 and we also have to get lucky in some way. 151 00:07:19,750 --> 00:07:22,750 So for example, one way of getting lucky 152 00:07:22,750 --> 00:07:25,490 would be to have a real lot of data. 153 00:07:25,490 --> 00:07:27,310 So if you had a ton, ton of data, 154 00:07:27,310 --> 00:07:30,820 and you made an additional assumption that your data lives 155 00:07:30,820 --> 00:07:33,832 in some low dimensional manifold where on one side of manifold 156 00:07:33,832 --> 00:07:35,790 is early disease and the other side of manifold 157 00:07:35,790 --> 00:07:38,170 is late disease, then you might be 158 00:07:38,170 --> 00:07:43,858 able to discover that manifold from this data, 159 00:07:43,858 --> 00:07:46,150 and you might conjecture that the manifold is something 160 00:07:46,150 --> 00:07:49,120 like that, that trajectory that I'm outlining there 161 00:07:49,120 --> 00:07:50,800 with my hand. 162 00:07:50,800 --> 00:07:52,870 But for you to be able to do that, of course, 163 00:07:52,870 --> 00:07:55,240 you need to have enough data, all right? 164 00:07:55,240 --> 00:07:58,570 So it's going to be now a trade-off between just 165 00:07:58,570 --> 00:08:01,360 having cross-sectional data, it might be OK so long 166 00:08:01,360 --> 00:08:03,010 as you have a real lot of that data 167 00:08:03,010 --> 00:08:04,630 so you can sort of fill in the spaces 168 00:08:04,630 --> 00:08:08,300 and really identify that manifold. 169 00:08:08,300 --> 00:08:11,210 A different approach might be, well maybe you 170 00:08:11,210 --> 00:08:13,460 don't have just pure cross-sectional data, 171 00:08:13,460 --> 00:08:17,000 maybe you have two or maybe three samples 172 00:08:17,000 --> 00:08:18,290 from each patient. 173 00:08:18,290 --> 00:08:19,915 And then you can color code this. 174 00:08:19,915 --> 00:08:24,875 So you might say, OK, green is patient 1-- 175 00:08:24,875 --> 00:08:28,220 or patient A, we'll call it, and this is the first time 176 00:08:28,220 --> 00:08:31,040 point from patient A, second time point from patient A, 177 00:08:31,040 --> 00:08:34,009 third and fourth time points from patient A. Red 178 00:08:34,009 --> 00:08:37,039 is patient B, and you have two time points for patient B, 179 00:08:37,039 --> 00:08:38,539 and blue here is patient C, and you 180 00:08:38,539 --> 00:08:44,120 have 1, 2, 3 time points from patient C. OK? 181 00:08:44,120 --> 00:08:46,580 Now again, it's not very dense data, 182 00:08:46,580 --> 00:08:49,337 we can't really draw curves out, But now 183 00:08:49,337 --> 00:08:51,170 we can start to get a sense of the ordering. 184 00:08:51,170 --> 00:08:55,280 And again, now we can-- even though we don't-- we're not 185 00:08:55,280 --> 00:08:57,240 in a dense setting like we were here, here, 186 00:08:57,240 --> 00:08:59,657 we'd still nonetheless be able to figure out that probably 187 00:08:59,657 --> 00:09:02,558 the manifold looks a little bit like this, right? 188 00:09:05,480 --> 00:09:09,250 And so again, I'm just trying to build intuition 189 00:09:09,250 --> 00:09:11,260 around when disease progression modeling 190 00:09:11,260 --> 00:09:14,030 for cross-sectional data might be possible, 191 00:09:14,030 --> 00:09:17,080 but this is a wide open field. 192 00:09:17,080 --> 00:09:18,880 And so today, I'll be telling you 193 00:09:18,880 --> 00:09:20,585 about a few algorithms that try to build 194 00:09:20,585 --> 00:09:22,960 on some of these intuitions for doing disease progression 195 00:09:22,960 --> 00:09:27,090 modeling, but they will break down very, very easily. 196 00:09:27,090 --> 00:09:29,730 They'll break down when these assumptions 197 00:09:29,730 --> 00:09:32,550 I gave you don't hold, they'll break down 198 00:09:32,550 --> 00:09:34,530 when your data is high dimensional, 199 00:09:34,530 --> 00:09:37,073 they'll break down when your data looks 200 00:09:37,073 --> 00:09:39,240 like this where you don't just have a single subtype 201 00:09:39,240 --> 00:09:41,440 of perhaps a multiple subtypes. 202 00:09:41,440 --> 00:09:44,530 and so this is a really very active area of research, 203 00:09:44,530 --> 00:09:47,430 and it's an area that I think we can make a lot of progress 204 00:09:47,430 --> 00:09:51,530 on in the field in the next several years. 205 00:09:51,530 --> 00:09:57,060 So I'll begin with one case study coming from my own work 206 00:09:57,060 --> 00:09:59,100 where we developed an algorithm for learning 207 00:09:59,100 --> 00:10:03,590 from cross-sectional data, and we valued it 208 00:10:03,590 --> 00:10:08,460 in the context of chronic obstructive pulmonary disorder 209 00:10:08,460 --> 00:10:10,030 or COPD. 210 00:10:10,030 --> 00:10:13,890 COPD is a condition of the lungs typically caused 211 00:10:13,890 --> 00:10:20,260 by air pollution or smoking, and it has a reasonably good 212 00:10:20,260 --> 00:10:22,210 staging mechanism. 213 00:10:22,210 --> 00:10:25,960 One uses what's called a spirometry device in order 214 00:10:25,960 --> 00:10:29,080 to measure the lung function of individual at any one point 215 00:10:29,080 --> 00:10:29,780 in time. 216 00:10:29,780 --> 00:10:32,230 So for example, you take this spirometry device, 217 00:10:32,230 --> 00:10:38,230 you stick it in your mouth, and you breathe in, 218 00:10:38,230 --> 00:10:43,550 and then you exhale, and one measure 219 00:10:43,550 --> 00:10:47,770 is how long it takes in order to exhale all your air, 220 00:10:47,770 --> 00:10:49,970 and that is going to be a measure of how 221 00:10:49,970 --> 00:10:52,310 good your lungs are. 222 00:10:52,310 --> 00:10:57,500 And so then one can take that measure of your function 223 00:10:57,500 --> 00:11:02,990 and one can stage how severe the person's COPD is, 224 00:11:02,990 --> 00:11:06,020 and that goes by what's called the gold criteria. 225 00:11:06,020 --> 00:11:09,140 So for example, in stage 1 of the COPD, 226 00:11:09,140 --> 00:11:13,460 common treatments involve just vaccinations 227 00:11:13,460 --> 00:11:16,250 using a short-acting bronchodilator only when 228 00:11:16,250 --> 00:11:18,140 needed. 229 00:11:18,140 --> 00:11:22,360 When the disease stage gets much more severe, like stage 4, 230 00:11:22,360 --> 00:11:25,460 than often treatment is recommended 231 00:11:25,460 --> 00:11:28,040 to be inhaling glucocorticosteroids 232 00:11:28,040 --> 00:11:31,260 if there are repeated aspirations of the disease, 233 00:11:31,260 --> 00:11:36,920 long-term oxygen If respiratory failure occurs, and so on. 234 00:11:36,920 --> 00:11:39,710 And so this is a disease that's reasonably well-understood 235 00:11:39,710 --> 00:11:44,160 because there exists a good staging mechanism. 236 00:11:44,160 --> 00:11:46,180 And I would argue that when we want 237 00:11:46,180 --> 00:11:48,340 to understand how to do disease staging 238 00:11:48,340 --> 00:11:51,550 in a data-driven fashion, we should first 239 00:11:51,550 --> 00:11:54,933 start by working with either synthetic data, 240 00:11:54,933 --> 00:11:57,100 or we should start with working with a disease where 241 00:11:57,100 --> 00:12:01,103 we have some idea of what the actual true disease staging is. 242 00:12:01,103 --> 00:12:03,520 And that way, we can look to see what our algorithms would 243 00:12:03,520 --> 00:12:05,110 recover in those scenarios, and does 244 00:12:05,110 --> 00:12:08,110 it align with what we would expect 245 00:12:08,110 --> 00:12:10,030 either from the way the data was generated 246 00:12:10,030 --> 00:12:12,810 or from the existing medical literature. 247 00:12:12,810 --> 00:12:14,170 And that's why we chose COPD. 248 00:12:14,170 --> 00:12:17,200 Because it is well-understood, and there's 249 00:12:17,200 --> 00:12:20,440 a wealth of literature on it, and because we 250 00:12:20,440 --> 00:12:23,650 have data on it which is much messier than the type of data 251 00:12:23,650 --> 00:12:26,038 that went into the original studies, and we could ask, 252 00:12:26,038 --> 00:12:27,580 could we come to the same conclusions 253 00:12:27,580 --> 00:12:28,720 as those original studies? 254 00:12:35,180 --> 00:12:38,540 So in this work, we're going to use data 255 00:12:38,540 --> 00:12:41,860 from the electronic medical record. 256 00:12:41,860 --> 00:12:44,500 We're only going to look at a subset of the EMR, 257 00:12:44,500 --> 00:12:47,200 in particular, diagnosis codes that 258 00:12:47,200 --> 00:12:50,223 are recorded for a patient at any point in time, 259 00:12:50,223 --> 00:12:52,390 and we're going to assume that we do not have access 260 00:12:52,390 --> 00:12:53,800 to spirometry data at all. 261 00:12:53,800 --> 00:12:57,470 So we don't have any obvious way of staging the patient's 262 00:12:57,470 --> 00:12:57,970 disease. 263 00:13:00,412 --> 00:13:01,870 The general approach is going to be 264 00:13:01,870 --> 00:13:09,840 to build a generative model for disease progression. 265 00:13:09,840 --> 00:13:13,290 At a very high level, this is a Markov model. 266 00:13:13,290 --> 00:13:19,380 It's a model that specifies the distribution of the patient's 267 00:13:19,380 --> 00:13:21,600 data, which is shown here in the bottom, 268 00:13:21,600 --> 00:13:25,750 as it evolves over time. 269 00:13:25,750 --> 00:13:28,180 According to a number of hidden variables 270 00:13:28,180 --> 00:13:30,550 that are shown in the top, these S variables that 271 00:13:30,550 --> 00:13:33,640 denote disease stages, and these X variables that denote 272 00:13:33,640 --> 00:13:35,140 comorbidities that the patient might 273 00:13:35,140 --> 00:13:39,780 have at that point in time, these X and S variables 274 00:13:39,780 --> 00:13:41,550 are always assumed to the unobserved. 275 00:13:41,550 --> 00:13:44,040 So if you were to clump them together into one variable, 276 00:13:44,040 --> 00:13:47,860 this would look exactly like a hidden Markov model. 277 00:13:47,860 --> 00:13:50,800 And moreover, we're not going to assume 278 00:13:50,800 --> 00:13:53,320 that we have a lot of longitudinal data 279 00:13:53,320 --> 00:13:54,520 for a patient. 280 00:13:54,520 --> 00:13:58,720 In particular, COPD evolves over a 10 to 20 years, and the data 281 00:13:58,720 --> 00:14:02,140 that we'll be learning from here has data only over one 282 00:14:02,140 --> 00:14:04,360 to three-year time range. 283 00:14:04,360 --> 00:14:08,200 The challenge will be, can we take data in this one 284 00:14:08,200 --> 00:14:11,975 to three-year time range and somehow stitch it together 285 00:14:11,975 --> 00:14:13,600 across large numbers of patients to get 286 00:14:13,600 --> 00:14:15,933 a picture of what the 20-year progression of the disease 287 00:14:15,933 --> 00:14:17,332 might look like? 288 00:14:17,332 --> 00:14:18,790 The way that we're going to do that 289 00:14:18,790 --> 00:14:22,372 is by learning the parameters of this probabilistic model. 290 00:14:22,372 --> 00:14:23,830 And then from the parameters, we're 291 00:14:23,830 --> 00:14:27,310 going to either infer the patient's actual disease stage 292 00:14:27,310 --> 00:14:30,760 and thus sort them, or actually simulate data 293 00:14:30,760 --> 00:14:33,370 from this model to see what a 20-year trajectory might 294 00:14:33,370 --> 00:14:35,180 look like. 295 00:14:35,180 --> 00:14:35,930 Is the goal clear? 296 00:14:38,660 --> 00:14:39,160 All right. 297 00:14:39,160 --> 00:14:40,577 So now what I'm going to do is I'm 298 00:14:40,577 --> 00:14:43,090 going to step into this model piece 299 00:14:43,090 --> 00:14:45,460 by piece to tell you what each of these components 300 00:14:45,460 --> 00:14:49,390 are, and I'll start out with a very topmost piece shown here 301 00:14:49,390 --> 00:14:51,790 by the red box. 302 00:14:51,790 --> 00:14:56,280 So this is the model of the patient's disease progression 303 00:14:56,280 --> 00:14:58,060 at any one point in time. 304 00:14:58,060 --> 00:15:00,180 So this variable, S1, for example, 305 00:15:00,180 --> 00:15:06,810 might denote the patient's disease stage on March 2011; 306 00:15:06,810 --> 00:15:10,470 S2 might denote the patient's disease stage April 2011; 307 00:15:10,470 --> 00:15:12,870 S capital T might denote the patient's disease stage 308 00:15:12,870 --> 00:15:14,850 June 2012. 309 00:15:14,850 --> 00:15:18,420 So we're going to have one random variable 310 00:15:18,420 --> 00:15:22,620 for each observation of the patient's data that we have. 311 00:15:22,620 --> 00:15:24,870 And notice that the observations of the patient's data 312 00:15:24,870 --> 00:15:27,610 might be at very irregular time intervals, 313 00:15:27,610 --> 00:15:30,240 and that's going to be OK with this approach, OK? 314 00:15:30,240 --> 00:15:33,750 So notice that there is a one-month gap between S1 315 00:15:33,750 --> 00:15:42,220 and S2, but a four-month gap between St minus 1 and St, OK? 316 00:15:42,220 --> 00:15:44,350 So we're going to model the patient's disease stage 317 00:15:44,350 --> 00:15:45,725 at the point in time when we have 318 00:15:45,725 --> 00:15:49,510 an observation for the patient. 319 00:15:49,510 --> 00:15:54,820 S denotes a discrete disease stage in this model. 320 00:15:54,820 --> 00:16:00,910 So S might be a value from 1 up to 4, maybe 1 up to 10 321 00:16:00,910 --> 00:16:07,150 where 1 is denoting a early disease stage and 4 or 10 322 00:16:07,150 --> 00:16:08,920 might denote a much later disease stage. 323 00:16:11,430 --> 00:16:13,980 If we have a sequence of observations per patient-- 324 00:16:13,980 --> 00:16:16,650 for example, we might have an observation on March 325 00:16:16,650 --> 00:16:19,620 and then in April, we're going to denote the disease 326 00:16:19,620 --> 00:16:23,010 stage by S1 and S2, what this model is going to talk about 327 00:16:23,010 --> 00:16:26,340 is the probability distribution of transitioning from whatever 328 00:16:26,340 --> 00:16:28,560 the disease stage at S1 is to whatever 329 00:16:28,560 --> 00:16:31,200 the disease stage at S2. 330 00:16:31,200 --> 00:16:34,140 Now because the time interval is between stages 331 00:16:34,140 --> 00:16:39,260 are not homogeneous, we have to have a transition distribution 332 00:16:39,260 --> 00:16:43,290 that takes into consideration that time gap. 333 00:16:43,290 --> 00:16:46,180 And to do that, we use what's known as a continuous time 334 00:16:46,180 --> 00:16:48,700 Markov process. 335 00:16:48,700 --> 00:16:54,790 Formally, we say that the transition distribution-- 336 00:16:54,790 --> 00:16:59,560 so the probability of transitioning from stage I 337 00:16:59,560 --> 00:17:06,710 at time t minus 1 to state j at time t, 338 00:17:06,710 --> 00:17:11,960 given as input the difference in time intervals-- 339 00:17:11,960 --> 00:17:14,060 the difference in time points delta-- 340 00:17:14,060 --> 00:17:18,390 so delta is the number of months between the two observations. 341 00:17:18,390 --> 00:17:20,569 So this conditional distribution is 342 00:17:20,569 --> 00:17:26,599 going to be given by the matrix exponential of this time 343 00:17:26,599 --> 00:17:35,560 interval times a matrix Q. 344 00:17:35,560 --> 00:17:38,890 And then here, the matrix Q gives us the parameters 345 00:17:38,890 --> 00:17:40,360 that we want to learn. 346 00:17:40,360 --> 00:17:42,070 So let me contrast this to things 347 00:17:42,070 --> 00:17:45,700 that you might already be used to. 348 00:17:45,700 --> 00:17:51,720 In a typical hidden Markov model, 349 00:17:51,720 --> 00:17:57,125 you might have asked t goes to St-- 350 00:17:57,125 --> 00:18:00,730 or St minus 1 goes to St, and you might imagine just 351 00:18:00,730 --> 00:18:07,570 parametrizing St given St minus 1 just by a lookup table. 352 00:18:07,570 --> 00:18:11,600 So for example, if the number of states-- 353 00:18:11,600 --> 00:18:17,130 for each running variable is 3, then you would have a 3 354 00:18:17,130 --> 00:18:23,076 by 3 table where for each state St minus 1, 355 00:18:23,076 --> 00:18:25,110 you have some probability of transition 356 00:18:25,110 --> 00:18:30,730 to the corresponding state St, so this might be something 357 00:18:30,730 --> 00:18:40,510 like 0.9, 0.9, 0.9, where notice, 358 00:18:40,510 --> 00:18:43,060 I'm having a very large value along the diagonal, 359 00:18:43,060 --> 00:18:46,260 because if, let's say, a very small period-- 360 00:18:46,260 --> 00:18:48,940 so a priori, we might believe that patients 361 00:18:48,940 --> 00:18:51,070 stay in the same disease, and then 362 00:18:51,070 --> 00:18:53,230 one might imagine that the probably 363 00:18:53,230 --> 00:18:59,080 transitioning from state 1 at time t minus 1 364 00:18:59,080 --> 00:19:08,075 to state 2 at time t might be something like 0.09, 365 00:19:08,075 --> 00:19:10,900 and the probability of skipping state 2, going directly 366 00:19:10,900 --> 00:19:12,880 to state 3 from state 1 might be something 367 00:19:12,880 --> 00:19:15,880 much smaller like 0.01, OK? 368 00:19:15,880 --> 00:19:18,130 And we might say something like that the probability-- 369 00:19:18,130 --> 00:19:19,672 we might imagine that the probability 370 00:19:19,672 --> 00:19:21,310 of going in a backwards direction, 371 00:19:21,310 --> 00:19:25,660 going from stage 2 at time t minus 1 372 00:19:25,660 --> 00:19:30,160 to let's say stage 1 at time t, that might be 0 all right? 373 00:19:30,160 --> 00:19:35,810 So you might imagine that actually this is the model, 374 00:19:35,810 --> 00:19:39,260 and what that's saying is something like you never 375 00:19:39,260 --> 00:19:42,560 go in the backwards direction, and you're 376 00:19:42,560 --> 00:19:44,690 more likely to transition to the state 377 00:19:44,690 --> 00:19:47,270 immediately adjacent to the current stage 378 00:19:47,270 --> 00:19:49,783 and very unlikely to skip a stage. 379 00:19:49,783 --> 00:19:51,200 So this would be an example of how 380 00:19:51,200 --> 00:19:54,230 you would parametrize the transition distribution 381 00:19:54,230 --> 00:19:58,450 in a typical discrete time Markov model, 382 00:19:58,450 --> 00:20:00,970 and the story here is going to be different specifically 383 00:20:00,970 --> 00:20:03,530 because we don't know the time intervals. 384 00:20:03,530 --> 00:20:07,840 So intuitively, if a lot of time has passed between the two 385 00:20:07,840 --> 00:20:11,120 observations, then we want to allow for an accelerated 386 00:20:11,120 --> 00:20:11,620 process. 387 00:20:11,620 --> 00:20:13,203 We want to allow for the fact that you 388 00:20:13,203 --> 00:20:15,280 might want to skip many different stages to go 389 00:20:15,280 --> 00:20:17,738 to your next time step, to go to the stage of the next time 390 00:20:17,738 --> 00:20:19,800 step, because so much time has passed. 391 00:20:19,800 --> 00:20:22,960 And that intuitively is what this scaling of this matrix 392 00:20:22,960 --> 00:20:25,270 Q by delta corresponds to. 393 00:20:25,270 --> 00:20:27,730 So the number of parameters in this parameterization 394 00:20:27,730 --> 00:20:29,950 is actually identical to the number of parameters 395 00:20:29,950 --> 00:20:31,300 in this parametrization, right? 396 00:20:31,300 --> 00:20:36,790 So you have a matrix Q which is given to you in essence 397 00:20:36,790 --> 00:20:39,160 by the number of states squared-- 398 00:20:39,160 --> 00:20:40,667 really, the number of states-- 399 00:20:40,667 --> 00:20:42,250 there's an additional redundancy there 400 00:20:42,250 --> 00:20:45,780 because it has to sum up to 1, but that's irrelevant. 401 00:20:45,780 --> 00:20:47,980 And so the same story here, but we're 402 00:20:47,980 --> 00:20:50,500 going to now parametrize the process 403 00:20:50,500 --> 00:20:56,552 by in some sense the infinitesimally small time 404 00:20:56,552 --> 00:20:57,760 probability of transitioning. 405 00:20:57,760 --> 00:21:00,250 So if you were to take the derivative of this transition 406 00:21:00,250 --> 00:21:03,430 distribution as the time interval shrinks, 407 00:21:03,430 --> 00:21:10,153 and then you were to integrate over the time interval that 408 00:21:10,153 --> 00:21:12,820 was observed and the probability of transitioning from any state 409 00:21:12,820 --> 00:21:16,420 to any other state with that infinitesimally small 410 00:21:16,420 --> 00:21:18,340 probability transitioning, what you get out 411 00:21:18,340 --> 00:21:20,790 is exactly this form. 412 00:21:20,790 --> 00:21:25,240 And I'll leave-- this paper is in the optional readings 413 00:21:25,240 --> 00:21:27,240 for today's lecture, and you can read through it 414 00:21:27,240 --> 00:21:32,080 to get more intuition about the continuous time Markov process. 415 00:21:32,080 --> 00:21:33,804 Any questions so far? 416 00:21:33,804 --> 00:21:34,760 Yep? 417 00:21:34,760 --> 00:21:37,260 AUDIENCE: Those Q are the same for both versions or-- 418 00:21:37,260 --> 00:21:37,950 PROFESSOR: Yes. 419 00:21:37,950 --> 00:21:41,970 And this model Q is essentially the same for all patients. 420 00:21:41,970 --> 00:21:45,120 And you might imagine, if there were disease subtypes, which 421 00:21:45,120 --> 00:21:47,190 there aren't in this approach, that Q might 422 00:21:47,190 --> 00:21:48,450 be different for each subtype. 423 00:21:48,450 --> 00:21:51,000 For example, you might transition between stages 424 00:21:51,000 --> 00:21:54,820 much more quickly for some subtypes than for others. 425 00:21:54,820 --> 00:21:55,968 Other questions? 426 00:21:59,880 --> 00:22:00,730 Yep? 427 00:22:00,730 --> 00:22:03,625 AUDIENCE: So-- OK, so Q you said had like-- 428 00:22:03,625 --> 00:22:06,058 it's just like a screen number used beforehand you 429 00:22:06,058 --> 00:22:08,100 kind of like specified these stages that you pick 430 00:22:08,100 --> 00:22:08,600 [INAUDIBLE] 431 00:22:08,600 --> 00:22:09,755 PROFESSOR: Correct. 432 00:22:09,755 --> 00:22:10,255 Yes. 433 00:22:10,255 --> 00:22:15,220 So you pre-specify the number of stages that you want to model, 434 00:22:15,220 --> 00:22:19,925 and there are many ways to try to choose that parameter. 435 00:22:19,925 --> 00:22:22,050 For example, you could look at how about likelihood 436 00:22:22,050 --> 00:22:23,730 under this model, which is learned 437 00:22:23,730 --> 00:22:26,100 for the different of stages. 438 00:22:26,100 --> 00:22:28,830 You could use typical model selection techniques 439 00:22:28,830 --> 00:22:31,410 from machine learning as another approach 440 00:22:31,410 --> 00:22:35,010 where you try to penalize complexity in some way. 441 00:22:35,010 --> 00:22:37,582 Or, what we found here, because of some of the other things 442 00:22:37,582 --> 00:22:39,540 that I'm about to tell you, it doesn't actually 443 00:22:39,540 --> 00:22:41,220 matter that much. 444 00:22:41,220 --> 00:22:44,520 So similarly to when one does hard [INAUDIBLE] clustering 445 00:22:44,520 --> 00:22:47,580 or even K-means clustering or even learning 446 00:22:47,580 --> 00:22:49,680 a problematic topic model, if you 447 00:22:49,680 --> 00:22:52,590 use a very small number of topics or number of clusters, 448 00:22:52,590 --> 00:22:54,960 you tend to learn very coarse-grained topics 449 00:22:54,960 --> 00:22:55,920 or clusters. 450 00:22:55,920 --> 00:22:57,530 If you use very many more-- 451 00:22:57,530 --> 00:22:59,280 if you use a much larger number of topics, 452 00:22:59,280 --> 00:23:01,830 you tend to learn much more fine-grained topics. 453 00:23:01,830 --> 00:23:03,720 Same story is going to happen here. 454 00:23:03,720 --> 00:23:05,638 If you use a small number of disease stages, 455 00:23:05,638 --> 00:23:07,680 you're going to learn very coarse-grained notions 456 00:23:07,680 --> 00:23:09,990 of disease stages; if you use more disease stages, 457 00:23:09,990 --> 00:23:12,085 you're going to learn a fine-grained notion; 458 00:23:12,085 --> 00:23:13,710 but the overall sorting of the patients 459 00:23:13,710 --> 00:23:16,680 is going to end up being very similar. 460 00:23:16,680 --> 00:23:18,180 But to make that statement, we're 461 00:23:18,180 --> 00:23:20,513 going to need to make some additional assumptions, which 462 00:23:20,513 --> 00:23:23,297 I'm going to show you in a few minutes. 463 00:23:23,297 --> 00:23:24,130 Any other questions? 464 00:23:24,130 --> 00:23:27,200 These are great questions. 465 00:23:27,200 --> 00:23:28,100 Yep? 466 00:23:28,100 --> 00:23:30,800 AUDIENCE: So do we know the staging of the disease 467 00:23:30,800 --> 00:23:31,440 because I 468 00:23:31,440 --> 00:23:33,320 PROFESSOR: No, and that's critical here. 469 00:23:33,320 --> 00:23:37,060 So I'm assuming that these variables-- these S's are all 470 00:23:37,060 --> 00:23:39,290 hidden variables here. 471 00:23:39,290 --> 00:23:41,980 And the way that we're going to learn this model 472 00:23:41,980 --> 00:23:43,730 is by maximum likelihood estimation 473 00:23:43,730 --> 00:23:46,090 where we marginalize over the hidden variables, 474 00:23:46,090 --> 00:23:50,870 just like you would do in any EM type algorithm. 475 00:23:50,870 --> 00:23:52,064 Any other questions? 476 00:23:55,382 --> 00:23:56,810 All right, so what I've just shown 477 00:23:56,810 --> 00:24:00,180 you is the topmost part of the model, 478 00:24:00,180 --> 00:24:02,530 now I'm going to talk about a horizontal slice. 479 00:24:02,530 --> 00:24:06,110 So I'm going to talk about one of these time points. 480 00:24:09,988 --> 00:24:15,840 So if you were to look at the translation-- 481 00:24:15,840 --> 00:24:18,720 the rotation of one of those time points, what you 482 00:24:18,720 --> 00:24:20,730 would get out is this model. 483 00:24:20,730 --> 00:24:24,360 These X's are also hidden variables, 484 00:24:24,360 --> 00:24:28,133 and we have pre-specified them to characterize different axes 485 00:24:28,133 --> 00:24:30,300 by which we want to understand the patient's disease 486 00:24:30,300 --> 00:24:32,655 progression. 487 00:24:32,655 --> 00:24:34,405 So in Thursday's lecture, we characterized 488 00:24:34,405 --> 00:24:43,070 the patient's disease as subtype by just a single number, 489 00:24:43,070 --> 00:24:46,325 and similarly in this example is just by a single number, 490 00:24:46,325 --> 00:24:48,200 but we might want to understand what's really 491 00:24:48,200 --> 00:24:50,150 unique about each subtype. 492 00:24:50,150 --> 00:24:53,730 So for example-- sorry, what's really 493 00:24:53,730 --> 00:24:56,080 unique about each disease stage. 494 00:24:56,080 --> 00:24:59,760 So for example, how is the patient's endocrine function 495 00:24:59,760 --> 00:25:01,440 in that disease stage? 496 00:25:01,440 --> 00:25:09,645 How is the patient's psychiatric status in that disease stage? 497 00:25:12,350 --> 00:25:15,170 Has the patient developed lung cancer yet and that disease 498 00:25:15,170 --> 00:25:16,790 stage? 499 00:25:16,790 --> 00:25:18,270 And so on. 500 00:25:18,270 --> 00:25:20,720 And so we're going to ask that we 501 00:25:20,720 --> 00:25:22,890 want to be able to read out from this model 502 00:25:22,890 --> 00:25:24,940 according to these axes, and this 503 00:25:24,940 --> 00:25:27,470 will become very clear at the end of this section 504 00:25:27,470 --> 00:25:29,150 where I show you a simulation of what 505 00:25:29,150 --> 00:25:33,500 20 years looks like for COPD according to these quantities. 506 00:25:33,500 --> 00:25:35,930 When does the patient typically develop diabetes, 507 00:25:35,930 --> 00:25:38,450 when does the patient typically become depressed, 508 00:25:38,450 --> 00:25:41,730 when does the patient typically develop cancer, and so on. 509 00:25:41,730 --> 00:25:43,310 So these are the quantities in which 510 00:25:43,310 --> 00:25:45,290 we want to be able to really talk 511 00:25:45,290 --> 00:25:48,650 about what happens to a patient at any one disease stage, 512 00:25:48,650 --> 00:25:50,780 but the challenge is, we never actually observe 513 00:25:50,780 --> 00:25:53,150 these quantities in the data that we have. 514 00:25:53,150 --> 00:25:56,450 Rather, all we observe are things like laboratory test 515 00:25:56,450 --> 00:25:58,820 results or diagnosis codes or procedures 516 00:25:58,820 --> 00:26:00,770 that have been formed and so on, which 517 00:26:00,770 --> 00:26:03,190 I'm going to call the clinical findings in the bottom. 518 00:26:03,190 --> 00:26:05,930 And as we've been discussing throughout this course, 519 00:26:05,930 --> 00:26:08,300 one could think about things as diagnosis codes 520 00:26:08,300 --> 00:26:10,760 as giving you information about the disease 521 00:26:10,760 --> 00:26:12,560 status of the patient, but they're not 522 00:26:12,560 --> 00:26:14,660 one and the same as the diagnosis, 523 00:26:14,660 --> 00:26:16,760 because there's so much noise and bias that 524 00:26:16,760 --> 00:26:20,015 goes into the assigning of diagnosis codes for patients. 525 00:26:22,600 --> 00:26:26,820 And so the way that we're going to model the raw data 526 00:26:26,820 --> 00:26:28,470 as a function of these hidden variables 527 00:26:28,470 --> 00:26:31,050 that we want to characterize is using what's 528 00:26:31,050 --> 00:26:33,165 known as a noisy-OR network. 529 00:26:35,710 --> 00:26:37,620 So we're going to suppose that there 530 00:26:37,620 --> 00:26:41,880 is some generative distribution where the observations you 531 00:26:41,880 --> 00:26:44,730 see-- for example, diagnosis codes 532 00:26:44,730 --> 00:26:48,510 are likely to be observed as a function of whether the patient 533 00:26:48,510 --> 00:26:51,450 has these phenotypes or comorbidities 534 00:26:51,450 --> 00:26:53,970 with some probability, and that probability can be specified 535 00:26:53,970 --> 00:26:56,550 by these edge weights. 536 00:26:56,550 --> 00:27:00,930 So for example, a diagnosis code for diabetes 537 00:27:00,930 --> 00:27:03,540 is very likely to be observed in the patient data 538 00:27:03,540 --> 00:27:06,270 if the patient truly has diabetes, 539 00:27:06,270 --> 00:27:07,860 but of course, it may not be recorded 540 00:27:07,860 --> 00:27:10,170 in the data for every single visit the patient has 541 00:27:10,170 --> 00:27:12,060 to a clinician, there might be some visits 542 00:27:12,060 --> 00:27:14,790 to clinicians that have nothing to do with their patients 543 00:27:14,790 --> 00:27:17,730 endocrine function and diabetes-- the diagnosis 544 00:27:17,730 --> 00:27:19,920 code might not be recorded for that visit. 545 00:27:19,920 --> 00:27:21,600 So it's going to be a noisy process, 546 00:27:21,600 --> 00:27:24,240 and that noise rate is going to be captured by that edge. 547 00:27:31,100 --> 00:27:33,080 So part of the learning algorithm 548 00:27:33,080 --> 00:27:35,940 is going to be to learn that transition distributions-- 549 00:27:35,940 --> 00:27:38,570 for example, that Q matrix I showed you in the earlier 550 00:27:38,570 --> 00:27:41,570 slide, but the other role-- learning algorithm 551 00:27:41,570 --> 00:27:43,700 is to learn all of the parameters 552 00:27:43,700 --> 00:27:46,490 of this noisy-OR distribution, namely these edge weights. 553 00:27:46,490 --> 00:27:49,820 So that's going to be discovered as part of the learning 554 00:27:49,820 --> 00:27:51,800 algorithm. 555 00:27:51,800 --> 00:27:53,930 And a key question that you have to ask 556 00:27:53,930 --> 00:27:58,040 me is, if I know I want to read out from the model 557 00:27:58,040 --> 00:28:01,390 according to these axes, but these axes are never-- 558 00:28:01,390 --> 00:28:03,140 I'm never assuming that they're explicitly 559 00:28:03,140 --> 00:28:06,380 observed in the data, how do I ground the learning 560 00:28:06,380 --> 00:28:09,050 algorithm to give meaning to these hidden variables? 561 00:28:09,050 --> 00:28:12,350 Because otherwise if we left them otherwise unconstrained 562 00:28:12,350 --> 00:28:14,960 and you did maximum likelihood estimation just 563 00:28:14,960 --> 00:28:17,360 like in any factor analysis-type model, 564 00:28:17,360 --> 00:28:18,920 you might discover some factors here, 565 00:28:18,920 --> 00:28:21,040 but they might not be the factors you care about, 566 00:28:21,040 --> 00:28:24,260 and if the learning problem was not identifiable, 567 00:28:24,260 --> 00:28:26,300 as is often the case in unsupervised learning, 568 00:28:26,300 --> 00:28:29,660 then you might not discover what you're interested in. 569 00:28:29,660 --> 00:28:32,690 So to ground the hidden variables, 570 00:28:32,690 --> 00:28:36,740 we introduced-- we used a technique that you already 571 00:28:36,740 --> 00:28:41,480 saw in an earlier lecture from lecture 8 called anchors. 572 00:28:41,480 --> 00:28:43,700 So a domain expert is going to specify 573 00:28:43,700 --> 00:28:48,980 for each one of the comorbidites one or more anchors, which 574 00:28:48,980 --> 00:28:52,010 are observations, which we are going to conjecture 575 00:28:52,010 --> 00:28:57,590 could only have arisen from the corresponding hidden variable. 576 00:28:57,590 --> 00:28:59,180 So notice here that this diagnosis 577 00:28:59,180 --> 00:29:01,430 code, which is for type 2 diabetes, 578 00:29:01,430 --> 00:29:05,270 has only an edge from X1. 579 00:29:05,270 --> 00:29:07,097 That is an assumption that we're making 580 00:29:07,097 --> 00:29:08,180 in the learning algorithm. 581 00:29:08,180 --> 00:29:10,160 We are actually explicitly zeroing 582 00:29:10,160 --> 00:29:14,210 out all of the other edges from all of the other comorbidities 583 00:29:14,210 --> 00:29:16,230 to a 1. 584 00:29:16,230 --> 00:29:18,680 We're not going to pre-specify what this edge rate is, 585 00:29:18,680 --> 00:29:21,140 we're going to allow for the fact that this might be noisy, 586 00:29:21,140 --> 00:29:23,907 it's not always observed even if the patient has diabetes, 587 00:29:23,907 --> 00:29:25,490 but we're going to say, this could not 588 00:29:25,490 --> 00:29:29,830 be explained by any of the other comorbidities. 589 00:29:29,830 --> 00:29:32,830 And so for each one of the comorbidites or phenotypes 590 00:29:32,830 --> 00:29:34,510 that we want to model, we're going 591 00:29:34,510 --> 00:29:37,270 to specify some small number of anchors 592 00:29:37,270 --> 00:29:40,060 which correspond to a type of sparsity assumption 593 00:29:40,060 --> 00:29:42,470 on that graph. 594 00:29:42,470 --> 00:29:46,210 And these are the anchors that we chose for asthma, we 595 00:29:46,210 --> 00:29:48,670 chose a diagnosis code corresponding to asthma; 596 00:29:48,670 --> 00:29:50,890 for lung cancer, we chose several diagnosis 597 00:29:50,890 --> 00:29:53,320 codes correspond to lung cancer; for obesity, we 598 00:29:53,320 --> 00:29:54,820 chose a diagnosis code corresponding 599 00:29:54,820 --> 00:29:57,340 to morbid obesity; and so on. 600 00:29:57,340 --> 00:29:58,960 And so these are ways that we're going 601 00:29:58,960 --> 00:30:01,180 to give meaning to the hidden variables, 602 00:30:01,180 --> 00:30:03,580 but as you'll see in just a few minutes, 603 00:30:03,580 --> 00:30:05,980 it is not going to pre-specify too much of the model. 604 00:30:05,980 --> 00:30:07,897 The model's still going to learn a whole bunch 605 00:30:07,897 --> 00:30:10,510 of other interesting things. 606 00:30:10,510 --> 00:30:14,340 By the way, the way that we actually came up with this set 607 00:30:14,340 --> 00:30:17,290 was by an iterative process. 608 00:30:17,290 --> 00:30:21,840 We specified some of the hidden variables to have anchors, 609 00:30:21,840 --> 00:30:24,780 but we also left some of them to be unanchored, 610 00:30:24,780 --> 00:30:27,090 meaning free variables. 611 00:30:27,090 --> 00:30:29,282 We did our learning algorithm, and just 612 00:30:29,282 --> 00:30:30,740 like you would do in a topic model, 613 00:30:30,740 --> 00:30:33,257 we discovered that there were some phenotypes that 614 00:30:33,257 --> 00:30:35,340 really seemed to be characterized by the patient's 615 00:30:35,340 --> 00:30:36,060 disease-- 616 00:30:36,060 --> 00:30:38,940 that seemed to characterize a patient's disease progression. 617 00:30:38,940 --> 00:30:43,380 Then in order to really dig deeper, working collaboratively 618 00:30:43,380 --> 00:30:46,150 with a domain expert, we specified anchors for those 619 00:30:46,150 --> 00:30:47,580 and we iterated, and in this way, 620 00:30:47,580 --> 00:30:50,310 we discovered the full set of interesting variables 621 00:30:50,310 --> 00:30:51,380 that we wanted to model. 622 00:30:51,380 --> 00:30:51,880 Yep? 623 00:30:51,880 --> 00:30:55,240 AUDIENCE: Did you measure how good an anchor these were? 624 00:30:55,240 --> 00:30:58,042 Like are some comorbidities better anchors than others? 625 00:30:58,042 --> 00:30:58,750 PROFESSOR: Great. 626 00:30:58,750 --> 00:31:01,042 You'll see-- I think we'll answer that question in just 627 00:31:01,042 --> 00:31:03,360 a few minutes when I show you what the graph looks 628 00:31:03,360 --> 00:31:04,752 like that's learned. 629 00:31:04,752 --> 00:31:05,400 Yep? 630 00:31:05,400 --> 00:31:06,900 AUDIENCE: Were all the other weights 631 00:31:06,900 --> 00:31:09,490 in that X to O network 0? 632 00:31:09,490 --> 00:31:12,103 They weren't part of it here. 633 00:31:12,103 --> 00:31:13,520 So it looks like a pretty sparse-- 634 00:31:13,520 --> 00:31:14,590 PROFESSOR: They're explicitly nonzero, actually, 635 00:31:14,590 --> 00:31:15,920 it's opposite. 636 00:31:15,920 --> 00:31:22,350 So for an anchor, we say that it only has a single parent. 637 00:31:22,350 --> 00:31:24,180 Everything that's not an anchor can 638 00:31:24,180 --> 00:31:27,060 have arbitrarily many parents. 639 00:31:27,060 --> 00:31:28,166 Is that clear? 640 00:31:28,166 --> 00:31:29,080 OK. 641 00:31:29,080 --> 00:31:30,120 Yeah? 642 00:31:30,120 --> 00:31:32,900 AUDIENCE: Do the anchors that you have in that linear table, 643 00:31:32,900 --> 00:31:35,360 you itereated yourself on that or did the doctors say 644 00:31:35,360 --> 00:31:37,757 that these are the [INAUDIBLE]? 645 00:31:37,757 --> 00:31:39,590 PROFESSOR: We started out with just a subset 646 00:31:39,590 --> 00:31:40,676 of these conditions. 647 00:31:40,676 --> 00:31:43,232 As things that we wanted to model-- 648 00:31:43,232 --> 00:31:44,690 things that we wanted to understand 649 00:31:44,690 --> 00:31:46,327 what happens along disease progression 650 00:31:46,327 --> 00:31:48,410 according to these axes, but just a subset of them 651 00:31:48,410 --> 00:31:49,670 originally. 652 00:31:49,670 --> 00:31:53,240 And then we included a few additional hidden variables 653 00:31:53,240 --> 00:31:55,460 that didn't have any anchors associated to them, 654 00:31:55,460 --> 00:31:57,510 and after doing unsupervised learning and just 655 00:31:57,510 --> 00:32:01,197 a preliminary development stage, they discovered some topics 656 00:32:01,197 --> 00:32:03,530 and we realized, oh shoot, we should have included those 657 00:32:03,530 --> 00:32:06,920 in there, and then we added them in with corresponding anchors. 658 00:32:09,407 --> 00:32:11,740 And so you could think about this as an exploratory data 659 00:32:11,740 --> 00:32:14,200 analysis pipeline. 660 00:32:14,200 --> 00:32:14,700 Yep? 661 00:32:14,700 --> 00:32:18,120 AUDIENCE: Is there a chance that these aren't anchors? 662 00:32:18,120 --> 00:32:20,375 PROFESSOR: Yes. 663 00:32:20,375 --> 00:32:22,500 So there's definitely the chance that these may not 664 00:32:22,500 --> 00:32:23,917 be anchors related to the question 665 00:32:23,917 --> 00:32:25,150 was asked a second ago. 666 00:32:25,150 --> 00:32:31,040 So for example, there might be some chance 667 00:32:31,040 --> 00:32:33,800 that the morbid obesity diagnosis 668 00:32:33,800 --> 00:32:37,610 code might be coded for a patient 669 00:32:37,610 --> 00:32:45,258 more often for a patient who has, let's say, asthma-- 670 00:32:45,258 --> 00:32:46,175 this is a bad example. 671 00:32:50,070 --> 00:32:53,270 And in which case, that would correspond to there truly 672 00:32:53,270 --> 00:32:56,450 existing an edge from asthma to this anchor, which 673 00:32:56,450 --> 00:32:58,980 would be a violation of anchor assumption. 674 00:32:58,980 --> 00:33:03,680 All right, so we chose these to make that unlikely, 675 00:33:03,680 --> 00:33:05,880 but it could happen. 676 00:33:05,880 --> 00:33:07,890 And it's not easily testable. 677 00:33:07,890 --> 00:33:10,280 So this is another example of an untestable assumption, 678 00:33:10,280 --> 00:33:12,230 just like we saw lots of other examples 679 00:33:12,230 --> 00:33:15,735 already in today's lecture and the causal inference lectures. 680 00:33:15,735 --> 00:33:17,360 If we had some ground truth data-- like 681 00:33:17,360 --> 00:33:19,652 if we had done chart review for some number of patients 682 00:33:19,652 --> 00:33:21,540 and we actually label these conditions, 683 00:33:21,540 --> 00:33:23,570 then we could test that anchor assumption. 684 00:33:23,570 --> 00:33:26,195 But here, we're assuming that we don't actually know the ground 685 00:33:26,195 --> 00:33:27,686 truth of these conditions. 686 00:33:27,686 --> 00:33:29,680 AUDIENCE: Is there a reason why you choose 687 00:33:29,680 --> 00:33:31,390 such high-level comorbidites? 688 00:33:31,390 --> 00:33:33,528 Like I imagine you could go more specific. 689 00:33:33,528 --> 00:33:34,945 Even, say, the diabetes, you could 690 00:33:34,945 --> 00:33:38,000 try to subtype the diabetes based on this other model, sort 691 00:33:38,000 --> 00:33:41,442 of use that as a single layer, but it seems to-- 692 00:33:41,442 --> 00:33:43,900 at least this model seems to choose [INAUDIBLE] high level. 693 00:33:43,900 --> 00:33:45,000 I was just curious of the reason. 694 00:33:45,000 --> 00:33:45,625 PROFESSOR: Yes. 695 00:33:45,625 --> 00:33:47,890 So that was a design choice that we made. 696 00:33:47,890 --> 00:33:50,330 There are many, many directions for follow-up work, 697 00:33:50,330 --> 00:33:53,720 one of which would be to use a hierarchical model here. 698 00:33:53,720 --> 00:33:55,290 But we hadn't gone that direction. 699 00:33:55,290 --> 00:33:57,230 Another obvious direction for follow-up work 700 00:33:57,230 --> 00:34:01,070 would be to do something within the subtyping with this staging 701 00:34:01,070 --> 00:34:03,050 by introducing another random variable, which 702 00:34:03,050 --> 00:34:04,672 is, let's say, the disease subtype, 703 00:34:04,672 --> 00:34:06,380 and making everything a function of that. 704 00:34:11,080 --> 00:34:14,273 OK, so I've talked about the vertical slice 705 00:34:14,273 --> 00:34:15,940 and I've talked about the topmost slice, 706 00:34:15,940 --> 00:34:17,565 but what I still need to tell you about 707 00:34:17,565 --> 00:34:23,874 is how these phenotypes relate to the observed disease stage. 708 00:34:27,139 --> 00:34:28,325 So for this, we use-- 709 00:34:33,090 --> 00:34:36,150 I don't remember the exact technical terminology-- 710 00:34:36,150 --> 00:34:38,639 a factored Markov model? 711 00:34:38,639 --> 00:34:41,130 Is that right, Pete? 712 00:34:41,130 --> 00:34:42,540 Factorized Markov model-- I mean, 713 00:34:42,540 --> 00:34:44,707 this is a term that existed in the graphical model's 714 00:34:44,707 --> 00:34:47,290 literature, but I don't remember right now. 715 00:34:47,290 --> 00:34:51,580 So what we're saying is that each of these Markov chains-- 716 00:34:51,580 --> 00:35:06,880 so each of these X1 up to Xt-- 717 00:35:12,500 --> 00:35:16,770 so this, will say, is the first one I call diabetes. 718 00:35:19,730 --> 00:35:23,180 This is the second one which I'll say is depression. 719 00:35:30,573 --> 00:35:32,990 We're going to assume that each one of these Markov chains 720 00:35:32,990 --> 00:35:35,630 is conditionally independent of each other given the disease 721 00:35:35,630 --> 00:35:36,710 stage. 722 00:35:36,710 --> 00:35:45,590 So it's the disease stage which ties everything together. 723 00:35:45,590 --> 00:35:57,160 So the graphical model looks like this, and so on, OK? 724 00:35:57,160 --> 00:36:00,130 So in particular, there are no edges between, let's say, 725 00:36:00,130 --> 00:36:03,010 the diabetes variable and the depression variable. 726 00:36:03,010 --> 00:36:05,470 All correlations between these conditions 727 00:36:05,470 --> 00:36:09,760 is assumed to be mediated by the disease stage variable. 728 00:36:09,760 --> 00:36:12,970 And that's a critical assumption that we had to make. 729 00:36:12,970 --> 00:36:14,110 Does anyone know why? 730 00:36:17,540 --> 00:36:21,242 What would go wrong if we didn't make that assumption? 731 00:36:21,242 --> 00:36:22,700 So for example, what would go wrong 732 00:36:22,700 --> 00:36:28,560 if we had something look like this, x1-- 733 00:36:28,560 --> 00:36:31,580 what was my notation? 734 00:36:31,580 --> 00:36:41,060 X1,1, X1,2, X1,3, and suppose we had edges between them, 735 00:36:41,060 --> 00:36:42,740 a complete graph, and we had, let's say, 736 00:36:42,740 --> 00:36:49,130 also the S variable with edges to everything? 737 00:36:49,130 --> 00:36:51,808 What would happen in that case where 738 00:36:51,808 --> 00:36:54,350 we're not assuming that the X's are conditionally independent 739 00:36:54,350 --> 00:36:55,160 given S? 740 00:37:02,180 --> 00:37:07,200 So I want you to think about this in terms of distributions. 741 00:37:07,200 --> 00:37:11,520 So remember, we're going to learn 742 00:37:11,520 --> 00:37:13,500 how to do disease progression through learning 743 00:37:13,500 --> 00:37:14,760 the parameters of this model. 744 00:37:14,760 --> 00:37:17,093 And so if we set this up and-- if we set up the learning 745 00:37:17,093 --> 00:37:19,230 problem in a way which is unidentifiable, 746 00:37:19,230 --> 00:37:20,220 then we're going to be screwed, we're 747 00:37:20,220 --> 00:37:21,678 not going to able to learn anything 748 00:37:21,678 --> 00:37:23,420 about disease progression. 749 00:37:23,420 --> 00:37:25,570 So what would happen in this situation? 750 00:37:30,120 --> 00:37:31,830 Someone who hasn't spoken today ideally. 751 00:37:38,760 --> 00:37:40,872 So any of you remember from, let's say, 752 00:37:40,872 --> 00:37:42,330 perhaps an earlier machine learning 753 00:37:42,330 --> 00:37:45,420 class what types of distribution's 754 00:37:45,420 --> 00:37:47,160 a complete graph-- 755 00:37:47,160 --> 00:37:49,658 a complete Bayesian network could represent? 756 00:37:56,990 --> 00:38:01,140 So the answer is all distributions, 757 00:38:01,140 --> 00:38:03,300 because it corresponds to any factorization 758 00:38:03,300 --> 00:38:05,220 of the joint distribution. 759 00:38:05,220 --> 00:38:08,670 And so if you allowed these x variables 760 00:38:08,670 --> 00:38:11,020 to be fully connected to each other-- so for example, 761 00:38:11,020 --> 00:38:13,620 saying that depression depends on diabetes in addition 762 00:38:13,620 --> 00:38:16,050 to the stage, then in fact, you don't even 763 00:38:16,050 --> 00:38:17,580 need this stage variable in here. 764 00:38:17,580 --> 00:38:20,340 The marginal-- you can fit any distribution 765 00:38:20,340 --> 00:38:23,460 on these X variables even without the S variable at all. 766 00:38:23,460 --> 00:38:26,160 And so the model could learn to simply ignore 767 00:38:26,160 --> 00:38:30,240 the S variable, which would be exactly not our goal, 768 00:38:30,240 --> 00:38:34,050 because our goal is to learn something about the disease 769 00:38:34,050 --> 00:38:37,690 stage, and in fact, we're going to be wanting 770 00:38:37,690 --> 00:38:41,430 to make assumptions on the progression of disease 771 00:38:41,430 --> 00:38:44,990 stage, which is going to help us learn. 772 00:38:44,990 --> 00:38:48,650 So by assuming conditional independence between these X 773 00:38:48,650 --> 00:38:51,770 variables, it's going to force all of the correlations 774 00:38:51,770 --> 00:38:54,230 to have to be mediated by that S variable, 775 00:38:54,230 --> 00:38:56,805 and it's going to remove some of that unidentifiability that 776 00:38:56,805 --> 00:38:59,840 would otherwise exist. 777 00:38:59,840 --> 00:39:01,760 It's this subtle but very important point. 778 00:39:05,520 --> 00:39:07,580 So the way that we're going to parametrize 779 00:39:07,580 --> 00:39:09,570 the conditional distribution-- so first of all, 780 00:39:09,570 --> 00:39:12,470 I'm going to assume these X's are all binary. 781 00:39:12,470 --> 00:39:14,420 So either the patient has diabetes 782 00:39:14,420 --> 00:39:16,900 or they don't have diabetes. 783 00:39:16,900 --> 00:39:18,490 I'm going to suppose that-- 784 00:39:18,490 --> 00:39:21,520 and this is, again, another assumption we're making, 785 00:39:21,520 --> 00:39:27,430 I'm going to suppose that once you already have a comorbidity, 786 00:39:27,430 --> 00:39:28,430 then you always have it. 787 00:39:28,430 --> 00:39:31,632 So for example, once this is 1, then all subsequent ones 788 00:39:31,632 --> 00:39:32,590 are also going to be 1. 789 00:39:35,190 --> 00:39:37,710 Hold the questions for just a second. 790 00:39:37,710 --> 00:39:39,390 I'm also going to make an assumption 791 00:39:39,390 --> 00:39:42,420 that later stages of the disease are more likely to develop 792 00:39:42,420 --> 00:39:43,680 the comorbidity. 793 00:39:43,680 --> 00:39:46,860 So in particular, one can formalize that mathematically 794 00:39:46,860 --> 00:39:53,775 as probability of X-- 795 00:39:56,360 --> 00:40:04,155 I'll just say Xi being 1 given S-- 796 00:40:04,155 --> 00:40:16,190 I'll say St equals little s, comma, Xt minus 1 equals 797 00:40:16,190 --> 00:40:22,010 0, and suppose that this is larger 798 00:40:22,010 --> 00:40:29,420 than or equal to probability of Xt equals 1 given St equals 799 00:40:29,420 --> 00:40:45,200 S prime and Xt minus 1 equals 0 for all S prime less than S, 800 00:40:45,200 --> 00:40:46,190 OK? 801 00:40:46,190 --> 00:40:48,580 So I'm saying, as you get further along in the disease 802 00:40:48,580 --> 00:40:52,450 stage, you're more likely to observe 803 00:40:52,450 --> 00:40:54,280 one of these complications. 804 00:40:56,828 --> 00:40:58,620 And again, this is an assumption that we're 805 00:40:58,620 --> 00:41:00,780 putting into the learning algorithm, 806 00:41:00,780 --> 00:41:04,525 but what we found that these types of assumptions 807 00:41:04,525 --> 00:41:06,900 are really critical in order to learn disease progression 808 00:41:06,900 --> 00:41:11,270 models when you don't have a large amount of data. 809 00:41:11,270 --> 00:41:16,350 And note that this is just a linear inequality 810 00:41:16,350 --> 00:41:19,300 on the parameters of the model. 811 00:41:19,300 --> 00:41:24,690 And so one can use a convex optimization algorithm 812 00:41:24,690 --> 00:41:27,525 during learning-- 813 00:41:27,525 --> 00:41:29,400 during the maximum likelihood estimation step 814 00:41:29,400 --> 00:41:32,610 with this algorithm, we just put a linear inequality 815 00:41:32,610 --> 00:41:34,500 into the convex optimization problem 816 00:41:34,500 --> 00:41:37,170 to enforce this constraint. 817 00:41:37,170 --> 00:41:38,640 There are a couple of questions. 818 00:41:38,640 --> 00:41:40,890 AUDIENCE: Is there generally like a quick way to check 819 00:41:40,890 --> 00:41:43,965 whether a model is unidentifiable or-- 820 00:41:48,510 --> 00:41:51,480 PROFESSOR: So there are ways to try 821 00:41:51,480 --> 00:41:54,960 to detect to see if a model is unidentifiable. 822 00:41:54,960 --> 00:41:56,440 It's beyond the scope of the class, 823 00:41:56,440 --> 00:41:58,980 but I'll just briefly mention one of the techniques. 824 00:42:04,630 --> 00:42:08,530 So one could-- so you can ask the identifiability question 825 00:42:08,530 --> 00:42:10,570 by looking at moments of the distribution. 826 00:42:10,570 --> 00:42:12,430 For example, you could talk about it 827 00:42:12,430 --> 00:42:14,320 as a function of all of the observed 828 00:42:14,320 --> 00:42:17,227 moments of distribution that you get from the data. 829 00:42:17,227 --> 00:42:19,810 Now the observed data here are not the S's and X's, but rather 830 00:42:19,810 --> 00:42:20,728 the O's. 831 00:42:20,728 --> 00:42:22,270 So you look at the joint distribution 832 00:42:22,270 --> 00:42:26,860 on the O's, and then you can ask questions about-- 833 00:42:26,860 --> 00:42:27,820 if I was to-- 834 00:42:27,820 --> 00:42:30,280 so suppose I was to choose a random set of parameters 835 00:42:30,280 --> 00:42:33,040 in the model, is there any way to do 836 00:42:33,040 --> 00:42:35,440 a perturbation of the parameters in the model which 837 00:42:35,440 --> 00:42:39,220 leave the observed marginal distribution on the O's 838 00:42:39,220 --> 00:42:41,290 identical? 839 00:42:41,290 --> 00:42:44,830 And often when you're in the setting of non-identifiability, 840 00:42:44,830 --> 00:42:50,712 you can take the gradient of a function and see-- 841 00:42:50,712 --> 00:42:53,170 and you could sort of find that there is some wiggle space, 842 00:42:53,170 --> 00:42:56,590 and then you show that OK, there are actually-- 843 00:42:56,590 --> 00:42:59,530 this objective function is actually unidentifiable. 844 00:42:59,530 --> 00:43:01,690 Now that type of technique is widely 845 00:43:01,690 --> 00:43:03,940 used when studying what are known as method of moments 846 00:43:03,940 --> 00:43:05,730 algorithms or estimation algorithms 847 00:43:05,730 --> 00:43:08,650 in learning verbal models, but they would be much, much 848 00:43:08,650 --> 00:43:13,892 harder to apply in this type of setting because first of all, 849 00:43:13,892 --> 00:43:15,350 these are much more complex models, 850 00:43:15,350 --> 00:43:17,330 and estimating the corresponding moments 851 00:43:17,330 --> 00:43:21,350 is going to be very hard because they're very high dimensional. 852 00:43:21,350 --> 00:43:25,088 And second, because they're-- 853 00:43:25,088 --> 00:43:27,130 I'm actually conflating two different things when 854 00:43:27,130 --> 00:43:28,760 I talk about identifiability. 855 00:43:28,760 --> 00:43:33,110 One statement is the infinite data identifiability, 856 00:43:33,110 --> 00:43:37,670 and the second question is your ability 857 00:43:37,670 --> 00:43:42,230 to actually learn a good model from a small amount of data, 858 00:43:42,230 --> 00:43:44,660 which is a sample complexity. 859 00:43:44,660 --> 00:43:48,440 And these constraints that I'm putting in, 860 00:43:48,440 --> 00:43:51,470 even if they don't affect the actual identifiability 861 00:43:51,470 --> 00:43:53,415 of the model, they could be extremely 862 00:43:53,415 --> 00:43:55,790 important for improving the sample complexity of learning 863 00:43:55,790 --> 00:43:57,637 algorithm. 864 00:43:57,637 --> 00:43:58,720 Is there another question? 865 00:44:03,715 --> 00:44:05,090 So we valued to this using a data 866 00:44:05,090 --> 00:44:10,050 set of almost 4,000 patients where, 867 00:44:10,050 --> 00:44:16,310 again, each patient we observed for only a couple of years-- 868 00:44:16,310 --> 00:44:18,310 one to three years. 869 00:44:18,310 --> 00:44:20,370 And the observations that we observed 870 00:44:20,370 --> 00:44:23,220 were 264 diagnosis codes, the presence 871 00:44:23,220 --> 00:44:25,950 or absence of each of those diagnosis codes 872 00:44:25,950 --> 00:44:28,080 during any three-month interval. 873 00:44:28,080 --> 00:44:32,760 Overall, there were almost 200,000 observations 874 00:44:32,760 --> 00:44:35,172 of diagnosis codes in this data set. 875 00:44:35,172 --> 00:44:36,630 The learning algorithm that we used 876 00:44:36,630 --> 00:44:38,650 was expectation maximization. 877 00:44:38,650 --> 00:44:40,900 Remember, there are a number of hidden variables here, 878 00:44:40,900 --> 00:44:42,330 and so if you want to maximize the likely-- 879 00:44:42,330 --> 00:44:44,372 if you want to learn the parameters that maximize 880 00:44:44,372 --> 00:44:46,230 the likelihood of those observations O, 881 00:44:46,230 --> 00:44:48,522 then you have to marginize over those hidden variables, 882 00:44:48,522 --> 00:44:53,010 and EM is one way to try to find a local optima of that 883 00:44:53,010 --> 00:44:59,500 likelihood function, with the key caveat that one has to do 884 00:44:59,500 --> 00:45:02,620 approximate inference during the E step here, 885 00:45:02,620 --> 00:45:05,500 because this model is not tractable, 886 00:45:05,500 --> 00:45:07,330 there's no closed form-- 887 00:45:07,330 --> 00:45:10,090 for example, dynamic programming algorithm for doing posterior 888 00:45:10,090 --> 00:45:13,420 inference in this model given its complexity. 889 00:45:13,420 --> 00:45:15,520 And so what we used was a Gibbs sampler 890 00:45:15,520 --> 00:45:17,690 to do approximate inference within that E step, 891 00:45:17,690 --> 00:45:19,090 and we used-- 892 00:45:19,090 --> 00:45:21,460 we did block sampling of the Markov chains 893 00:45:21,460 --> 00:45:23,590 where we combined a Gibbs sampler 894 00:45:23,590 --> 00:45:25,420 with a dynamic programming algorithm, which 895 00:45:25,420 --> 00:45:27,400 improved the mixing rate of the Markov chain 896 00:45:27,400 --> 00:45:30,460 for those of you who are familiar with those concepts. 897 00:45:30,460 --> 00:45:34,520 And in the end step of the learning algorithm when 898 00:45:34,520 --> 00:45:38,150 one has to learn the parameters of the distribution, 899 00:45:38,150 --> 00:45:40,070 the only complex part of this model 900 00:45:40,070 --> 00:45:42,440 is the continuous time Markov process, 901 00:45:42,440 --> 00:45:44,330 and there's actually been previous literature 902 00:45:44,330 --> 00:45:47,540 from the physics community which shows how you can really-- 903 00:45:47,540 --> 00:45:49,610 which gives you analytic closed-form solutions 904 00:45:49,610 --> 00:45:53,480 for that M step of that continuous time Markov process. 905 00:45:53,480 --> 00:45:55,790 Now if I were to do this again today, 906 00:45:55,790 --> 00:45:58,530 I would have done it a little bit differently. 907 00:45:58,530 --> 00:46:00,680 I would still think about modeling this problem 908 00:46:00,680 --> 00:46:03,230 in a very similar way, but I would 909 00:46:03,230 --> 00:46:06,530 do learning using a variational lower bound 910 00:46:06,530 --> 00:46:09,350 of the likelihood with a recognition network 911 00:46:09,350 --> 00:46:12,590 in order to very quickly get you a lower bound 912 00:46:12,590 --> 00:46:14,330 in the likelihood. 913 00:46:14,330 --> 00:46:16,580 And for those of you who are familiar 914 00:46:16,580 --> 00:46:18,200 with variational autoencoders, that's 915 00:46:18,200 --> 00:46:20,167 precisely the idea that is used there 916 00:46:20,167 --> 00:46:21,750 for learning variational autoencoders. 917 00:46:21,750 --> 00:46:22,880 So that's the way I would approach this 918 00:46:22,880 --> 00:46:23,880 if I was to do it again. 919 00:46:26,230 --> 00:46:29,730 There's just one or two other extensions I want to mention. 920 00:46:29,730 --> 00:46:32,500 The first one is something which we-- 921 00:46:32,500 --> 00:46:35,570 one, more customization we made for COPD, 922 00:46:35,570 --> 00:46:38,900 which is that we enforced monotonic stage progression. 923 00:46:38,900 --> 00:46:41,930 So we said that-- 924 00:46:41,930 --> 00:46:45,195 so here I talked about a type of monotonically 925 00:46:45,195 --> 00:46:47,070 in terms of the conditional distribution of X 926 00:46:47,070 --> 00:46:54,540 given S, but one could also put an assumption in the-- 927 00:46:54,540 --> 00:46:58,020 I already talked about that, but one could also 928 00:46:58,020 --> 00:47:01,005 put an assumption on P of S-- 929 00:47:04,520 --> 00:47:07,890 S of t given S of t minus 1, which is implicitly 930 00:47:07,890 --> 00:47:09,870 an assumption on Q, and I gave you 931 00:47:09,870 --> 00:47:12,333 a hint of how one might do that over here when 932 00:47:12,333 --> 00:47:15,000 I said that you might put 0's to the left-hand side, meaning you 933 00:47:15,000 --> 00:47:17,017 can never go to the left. 934 00:47:17,017 --> 00:47:18,600 And indeed, we did something like that 935 00:47:18,600 --> 00:47:20,903 here as well, which is another type of constraint. 936 00:47:24,220 --> 00:47:27,130 And finally, we regularize the learning problem 937 00:47:27,130 --> 00:47:30,130 by asking about that graph involving 938 00:47:30,130 --> 00:47:32,890 the conditions, the comorbidities, 939 00:47:32,890 --> 00:47:35,440 and the diagnosis codes be sparse, 940 00:47:35,440 --> 00:47:40,090 by putting a beta prior on those edge weights. 941 00:47:40,090 --> 00:47:42,340 So here's what one learned. 942 00:47:42,340 --> 00:47:43,900 So the first thing I'm going to do 943 00:47:43,900 --> 00:47:46,570 is I'm going to show you the-- 944 00:47:46,570 --> 00:47:48,640 we talked about how we specified anchors, 945 00:47:48,640 --> 00:47:51,550 but I told you that the anchors weren't the whole story. 946 00:47:51,550 --> 00:47:54,340 That we were able to infer much more interesting things 947 00:47:54,340 --> 00:47:57,130 about the hidden variables given all of the observations 948 00:47:57,130 --> 00:47:58,420 we have. 949 00:47:58,420 --> 00:48:01,630 So here I'm showing you several of the phenotypes that 950 00:48:01,630 --> 00:48:03,400 were learned by this unsupervised learning 951 00:48:03,400 --> 00:48:04,150 algorithm. 952 00:48:04,150 --> 00:48:06,480 First, the phenotype for kidney disease. 953 00:48:06,480 --> 00:48:10,660 In red here, I'm showing you the anchor variables 954 00:48:10,660 --> 00:48:15,010 that we chose for kidney disease, and what you'll notice 955 00:48:15,010 --> 00:48:16,250 are a couple of things. 956 00:48:16,250 --> 00:48:18,880 First, the weight, which you should think about 957 00:48:18,880 --> 00:48:22,270 as being proportional in some way 958 00:48:22,270 --> 00:48:25,360 to how often you would see that diagnosis code given 959 00:48:25,360 --> 00:48:27,550 that the patient had kidney disease, 960 00:48:27,550 --> 00:48:30,730 the weights are all far less than one, all right? 961 00:48:30,730 --> 00:48:34,690 So there is some noise in this process of when you observe 962 00:48:34,690 --> 00:48:37,340 a diagnosis code for a patient. 963 00:48:37,340 --> 00:48:39,090 The second thing you observe is that there 964 00:48:39,090 --> 00:48:46,080 are a number of other diagnosis codes that are observed to be-- 965 00:48:46,080 --> 00:48:52,050 which are explained by this kidney disease comorbidity, 966 00:48:52,050 --> 00:48:57,980 such as anemia, urinary tract infections, and so on, 967 00:48:57,980 --> 00:49:01,220 and that aligns well with what's known in the medical literature 968 00:49:01,220 --> 00:49:03,470 about kidney disease. 969 00:49:03,470 --> 00:49:06,050 Look at another example for lung cancer. 970 00:49:06,050 --> 00:49:08,240 In red here I'm showing you the anchors 971 00:49:08,240 --> 00:49:10,250 that we had pre-specified for these, which 972 00:49:10,250 --> 00:49:14,030 mean that these diagnosis codes could only be explained 973 00:49:14,030 --> 00:49:19,970 by the lung cancer comorbidity, and these 974 00:49:19,970 --> 00:49:22,550 are the noise rates that are learned for them, 975 00:49:22,550 --> 00:49:24,085 and that's everything else. 976 00:49:24,085 --> 00:49:26,660 And here's one more example of lung infection 977 00:49:26,660 --> 00:49:29,390 where there was only a signal anchor that we specified 978 00:49:29,390 --> 00:49:32,900 for pneumonia, and you see all of the other things that 979 00:49:32,900 --> 00:49:34,640 are automatically associated to that as 980 00:49:34,640 --> 00:49:36,914 by the unsupervised learning algorithm. 981 00:49:36,914 --> 00:49:37,808 Yep? 982 00:49:37,808 --> 00:49:41,180 AUDIENCE: So how do you [INAUDIBLE] 983 00:49:41,180 --> 00:49:42,282 for the mobidity, right? 984 00:49:42,282 --> 00:49:42,990 PROFESSOR: Right. 985 00:49:42,990 --> 00:49:45,490 So that's what the unsupervised learning algorithm is doing. 986 00:49:45,490 --> 00:49:47,850 So these weights are learned, and I'm showing you 987 00:49:47,850 --> 00:49:51,180 something like a point estimate of the parameters 988 00:49:51,180 --> 00:49:53,460 that are learned by the learning algorithm. 989 00:49:53,460 --> 00:49:54,630 AUDIENCE: And so we-- 990 00:49:54,630 --> 00:49:56,922 PROFESSOR: Just like if you're learning a Markov model, 991 00:49:56,922 --> 00:50:00,190 you learned some transition and [INAUDIBLE],, same thing here. 992 00:50:00,190 --> 00:50:00,690 All right. 993 00:50:00,690 --> 00:50:02,190 And this should look a lot like what 994 00:50:02,190 --> 00:50:04,770 you would see when you do topic modeling on a text copora, 995 00:50:04,770 --> 00:50:05,270 right? 996 00:50:05,270 --> 00:50:08,880 You would discover a topic-- this is analogous to a topic. 997 00:50:08,880 --> 00:50:11,390 It's a discrete topic, meaning it either occurs 998 00:50:11,390 --> 00:50:13,170 or it doesn't occur for a patient. 999 00:50:13,170 --> 00:50:15,838 And you would discover some word topic distribution. 1000 00:50:15,838 --> 00:50:17,880 This is analogous to that word topic distribution 1001 00:50:17,880 --> 00:50:20,430 for a topic in a topic model. 1002 00:50:24,090 --> 00:50:25,850 So one could then use the model to answer 1003 00:50:25,850 --> 00:50:28,280 a couple of the original questions we set out to solve. 1004 00:50:28,280 --> 00:50:32,750 The first one is given a patient's data, 1005 00:50:32,750 --> 00:50:35,880 which I'm illustrating here on the bottom, 1006 00:50:35,880 --> 00:50:37,910 I have artificially separated out 1007 00:50:37,910 --> 00:50:41,150 into three different comorbidities, 1008 00:50:41,150 --> 00:50:43,280 and a star denotes an observation 1009 00:50:43,280 --> 00:50:45,780 of a data type of that one. 1010 00:50:45,780 --> 00:50:47,910 But this was artificially done by us, 1011 00:50:47,910 --> 00:50:50,180 it was not given to learning algorithm. 1012 00:50:50,180 --> 00:50:54,890 One can infer, when the patient initiated-- 1013 00:50:54,890 --> 00:50:58,610 started with each one of these comorbidites, and also, when-- 1014 00:50:58,610 --> 00:51:00,585 so for the full three-year time range 1015 00:51:00,585 --> 00:51:02,210 that we have data for the patient, what 1016 00:51:02,210 --> 00:51:04,670 stage was the patient in in the disease at any one 1017 00:51:04,670 --> 00:51:05,340 point in time? 1018 00:51:05,340 --> 00:51:08,450 So this model infers that the patient starts out in stage 1, 1019 00:51:08,450 --> 00:51:12,230 and about half a year through the data collection 1020 00:51:12,230 --> 00:51:15,290 process, transitioned into stage 2 of COPD. 1021 00:51:18,400 --> 00:51:20,860 Another thing that one could do using this model 1022 00:51:20,860 --> 00:51:25,240 is to simulate from the model and answer the question of what 1023 00:51:25,240 --> 00:51:27,700 would, let's say, a 20-year trajectory of the disease look 1024 00:51:27,700 --> 00:51:28,986 like? 1025 00:51:28,986 --> 00:51:31,070 And here, I'm showing a 10-year trajectory. 1026 00:51:31,070 --> 00:51:33,280 And again, only one to three years of data 1027 00:51:33,280 --> 00:51:36,700 was used for any one patient during learning. 1028 00:51:36,700 --> 00:51:41,670 So this is the first time we see the those axes, 1029 00:51:41,670 --> 00:51:43,620 those comorbidities really start to show up 1030 00:51:43,620 --> 00:51:46,440 as being important as the way of reading out from the model. 1031 00:51:46,440 --> 00:51:49,690 Here, we've thrown away those O's, those diagnosis codes 1032 00:51:49,690 --> 00:51:53,357 altogether, we only care about what we conjecture is truly 1033 00:51:53,357 --> 00:51:55,440 happening to the patient, those X variables, which 1034 00:51:55,440 --> 00:51:58,300 are unobserved during training. 1035 00:51:58,300 --> 00:52:04,130 So what we conjecture is that kidney disease is very uncommon 1036 00:52:04,130 --> 00:52:08,810 in stage 1 of the disease, and increases slowly as you 1037 00:52:08,810 --> 00:52:13,160 transition from stage 2, stage 3, to stage 4 of the disease, 1038 00:52:13,160 --> 00:52:15,645 and then really bumps up towards stage 5 1039 00:52:15,645 --> 00:52:16,770 and stage 6 of the disease. 1040 00:52:16,770 --> 00:52:18,320 So you should read this as saying 1041 00:52:18,320 --> 00:52:22,280 that in stage 6 of the disease, over 60% 1042 00:52:22,280 --> 00:52:24,650 people have kidney disease. 1043 00:52:24,650 --> 00:52:27,050 Now the time interval is here. 1044 00:52:27,050 --> 00:52:30,860 So how I've chosen these-- 1045 00:52:30,860 --> 00:52:35,420 where to put these triangles, I've 1046 00:52:35,420 --> 00:52:39,050 chosen them based on the average amount of time 1047 00:52:39,050 --> 00:52:42,560 it takes to transition from one stage to the next stage 1048 00:52:42,560 --> 00:52:45,490 according to the learned parameters of the model. 1049 00:52:45,490 --> 00:52:48,160 And so you see that stages 1, 2, and 3, 1050 00:52:48,160 --> 00:52:50,890 and 4 take a long period of-- 1051 00:52:50,890 --> 00:52:54,440 amount of time to transition between those four stages, 1052 00:52:54,440 --> 00:52:56,280 and then there's a very small amount 1053 00:52:56,280 --> 00:52:58,030 of time between transitioning from stage 5 1054 00:52:58,030 --> 00:53:01,162 to stage 6 on average. 1055 00:53:01,162 --> 00:53:02,370 So that's for kidney disease. 1056 00:53:02,370 --> 00:53:06,400 One could also read this out for other comorbidities. 1057 00:53:06,400 --> 00:53:09,120 So in orange here-- in yellow here 1058 00:53:09,120 --> 00:53:11,910 is diabetes, in black here is musculoskeletal conditions, 1059 00:53:11,910 --> 00:53:15,300 and in red here is cardiovascular disease. 1060 00:53:15,300 --> 00:53:18,530 And so one of the interesting inferences 1061 00:53:18,530 --> 00:53:20,030 made by this learning algorithm is 1062 00:53:20,030 --> 00:53:25,700 that even in stage 1 of COPD, very early in the trajectory, 1063 00:53:25,700 --> 00:53:27,800 we are seeing patients with large amounts 1064 00:53:27,800 --> 00:53:29,318 of cardiovascular disease. 1065 00:53:29,318 --> 00:53:30,860 And again, this is something that one 1066 00:53:30,860 --> 00:53:32,277 can look at the medical literature 1067 00:53:32,277 --> 00:53:35,220 to see does it align with what we expect? 1068 00:53:35,220 --> 00:53:39,740 And it does, so even in patients with mild to moderate COPD, 1069 00:53:39,740 --> 00:53:43,310 the leading cause of morbidity is cardiovascular disease. 1070 00:53:43,310 --> 00:53:45,770 Again, this is just a sanity check 1071 00:53:45,770 --> 00:53:49,610 that what this model is learning for a common disease actually 1072 00:53:49,610 --> 00:53:54,020 aligns with the medical knowledge. 1073 00:53:54,020 --> 00:53:58,970 So that's all I want to say about this probabilistic model 1074 00:53:58,970 --> 00:54:00,680 approach to disease progression modeling 1075 00:54:00,680 --> 00:54:02,060 from cross-sectional data. 1076 00:54:02,060 --> 00:54:03,560 I want you to hold your questions 1077 00:54:03,560 --> 00:54:05,435 so I can get through the rest of the material 1078 00:54:05,435 --> 00:54:07,760 and you can ask me after class. 1079 00:54:07,760 --> 00:54:10,740 So next I want to talk about these pseudo-time methods, 1080 00:54:10,740 --> 00:54:13,500 which are a very different approach for trying 1081 00:54:13,500 --> 00:54:17,300 to align patients into early-to-late disease stage. 1082 00:54:17,300 --> 00:54:19,700 These approaches were really popularized in the last five 1083 00:54:19,700 --> 00:54:25,340 years due to the explosion in single-cell sequencing 1084 00:54:25,340 --> 00:54:28,660 experiments in the biology community. 1085 00:54:28,660 --> 00:54:31,900 Single-cell sequencing is a way to really understand 1086 00:54:31,900 --> 00:54:36,010 not just what is the average gene expression, 1087 00:54:36,010 --> 00:54:39,520 but on a cell-by-cell basis can we 1088 00:54:39,520 --> 00:54:43,270 understand what is expressed in each cell. 1089 00:54:43,270 --> 00:54:45,690 So at a very high level, the way this works is 1090 00:54:45,690 --> 00:54:51,270 you take a solid tissue, you do a number of procedures 1091 00:54:51,270 --> 00:54:54,090 in order to isolate out individual cells 1092 00:54:54,090 --> 00:54:56,670 from that tissue, then you're going 1093 00:54:56,670 --> 00:55:00,960 to extract the RNA from those individual cells, 1094 00:55:00,960 --> 00:55:03,360 you go through another complex procedure 1095 00:55:03,360 --> 00:55:07,350 which somehow barcodes each of the RNA 1096 00:55:07,350 --> 00:55:10,920 from each individual cell, mixes them all together, 1097 00:55:10,920 --> 00:55:14,070 does sequencing of it, and then deconvolves 1098 00:55:14,070 --> 00:55:17,610 it so that you can see what was the original RNA 1099 00:55:17,610 --> 00:55:20,355 expression for each of the individual cells. 1100 00:55:23,442 --> 00:55:28,450 Now the goal of these pseudo-time algorithms 1101 00:55:28,450 --> 00:55:31,420 is to take that type of data and then 1102 00:55:31,420 --> 00:55:37,510 to attempt to align cells to some trajectory. 1103 00:55:37,510 --> 00:55:40,780 So if you look at the very top of this figure part figure 1104 00:55:40,780 --> 00:55:45,100 a that's the picture that you should have in your mind. 1105 00:55:45,100 --> 00:55:50,180 In the real world, cells are evolving with time-- 1106 00:55:50,180 --> 00:56:00,940 for example, B cells will have a well-characterized evolution 1107 00:56:00,940 --> 00:56:04,300 between different cellular states, 1108 00:56:04,300 --> 00:56:07,010 and what we'd like to be understand, 1109 00:56:07,010 --> 00:56:09,342 given that you have cross-sectional data-- so you 1110 00:56:09,342 --> 00:56:11,800 can imagine-- imagine you have a whole collection of cells, 1111 00:56:11,800 --> 00:56:14,080 each one in a different part, a different stage 1112 00:56:14,080 --> 00:56:18,350 of differentiation, could you somehow 1113 00:56:18,350 --> 00:56:21,140 order them into where they were in different stages 1114 00:56:21,140 --> 00:56:22,448 of differentiation? 1115 00:56:22,448 --> 00:56:23,240 So that's the goal. 1116 00:56:23,240 --> 00:56:25,040 We want to take this-- 1117 00:56:25,040 --> 00:56:27,350 so there exists some true ordering that I'm 1118 00:56:27,350 --> 00:56:29,960 showing from dark to light. 1119 00:56:29,960 --> 00:56:32,343 The capture process is going to ignore 1120 00:56:32,343 --> 00:56:34,760 what the ordering information was, because all we're doing 1121 00:56:34,760 --> 00:56:37,790 is getting a collection of cells that are in different stages. 1122 00:56:37,790 --> 00:56:40,910 And then we're going to use this pseudo-time method 1123 00:56:40,910 --> 00:56:44,720 to try to re-sort them so that you could figure out, 1124 00:56:44,720 --> 00:56:46,670 oh, these were the cells in the early stage 1125 00:56:46,670 --> 00:56:48,740 and these are the cells in the late stage. 1126 00:56:48,740 --> 00:56:50,730 And of course, there's an analogy 1127 00:56:50,730 --> 00:56:52,730 here to the pictures I showed you in the earlier 1128 00:56:52,730 --> 00:56:54,520 part of the lecture. 1129 00:56:54,520 --> 00:56:58,270 Once you have this alignment of cells into stages, 1130 00:56:58,270 --> 00:57:00,880 then you can answer some really exciting scientific questions. 1131 00:57:00,880 --> 00:57:05,380 For example, you could ask a variety of different genes 1132 00:57:05,380 --> 00:57:08,650 which genes are expressed at which point in time. 1133 00:57:08,650 --> 00:57:11,050 So you might see that gene a is very highly 1134 00:57:11,050 --> 00:57:15,700 expressed very early in this cell's differentiation 1135 00:57:15,700 --> 00:57:18,875 and is not expressed very much towards the end, 1136 00:57:18,875 --> 00:57:20,875 and that might give you new biological insights. 1137 00:57:23,570 --> 00:57:26,120 So these methods could immediately 1138 00:57:26,120 --> 00:57:28,850 be applied, I believe, to disease progression modeling 1139 00:57:28,850 --> 00:57:32,870 where I want you to think about each cell as now a patient, 1140 00:57:32,870 --> 00:57:38,360 and that patient has a number of observations for this data. 1141 00:57:38,360 --> 00:57:41,660 The observations are an expression for that cell, 1142 00:57:41,660 --> 00:57:43,580 but in our data, the observations 1143 00:57:43,580 --> 00:57:46,640 might be symptoms that we observe for the patient, 1144 00:57:46,640 --> 00:57:48,618 for example. 1145 00:57:48,618 --> 00:57:50,660 And then the goal is, given those cross-sectional 1146 00:57:50,660 --> 00:57:54,050 observations, to sort them. 1147 00:57:54,050 --> 00:57:56,060 And once you have that sorting, then you 1148 00:57:56,060 --> 00:57:58,400 could answer scientific questions, 1149 00:57:58,400 --> 00:58:01,820 such as, I mentioned, of a number of different genes, 1150 00:58:01,820 --> 00:58:04,430 which genes are expressed when. 1151 00:58:04,430 --> 00:58:08,060 So here, I'm showing you sort of the density of when 1152 00:58:08,060 --> 00:58:09,860 this particular gene is expressed 1153 00:58:09,860 --> 00:58:12,610 as a function of pseudo-time. 1154 00:58:12,610 --> 00:58:14,517 Analogously for disease progression modeling, 1155 00:58:14,517 --> 00:58:16,350 you should think of that as being a symptom. 1156 00:58:16,350 --> 00:58:19,340 You could ask, OK, suppose there are some true progression 1157 00:58:19,340 --> 00:58:25,580 of the disease, when do patients typically develop diabetes 1158 00:58:25,580 --> 00:58:27,360 or cardiovascular symptoms? 1159 00:58:27,360 --> 00:58:29,825 And so for cardiovascular, going back to the COPD example, 1160 00:58:29,825 --> 00:58:32,450 you might imagine that there's a peak very early in the disease 1161 00:58:32,450 --> 00:58:36,510 stage; for diabetes, it might be in a later disease stage. 1162 00:58:36,510 --> 00:58:37,210 All right? 1163 00:58:37,210 --> 00:58:39,220 So that-- is the analogy clear? 1164 00:58:42,060 --> 00:58:47,970 So this community, which has been developing methods 1165 00:58:47,970 --> 00:58:51,090 for studying single-cell gene expression data, 1166 00:58:51,090 --> 00:58:55,770 has just exploded in the last 10 years. 1167 00:58:55,770 --> 00:58:58,140 So I lost count of how many different methods there 1168 00:58:58,140 --> 00:59:03,420 are, but if I had to guess, I'd say 50 to 200 different methods 1169 00:59:03,420 --> 00:59:06,047 for this problem. 1170 00:59:06,047 --> 00:59:08,380 There was a paper, which is one of the optional readings 1171 00:59:08,380 --> 00:59:11,980 for today's lecture, that just came out earlier this month 1172 00:59:11,980 --> 00:59:14,860 which looks at a comparison of these different trajectory 1173 00:59:14,860 --> 00:59:18,220 inference methods, and this picture 1174 00:59:18,220 --> 00:59:19,943 gives a really interesting illustration 1175 00:59:19,943 --> 00:59:21,610 of what are some of the assumptions made 1176 00:59:21,610 --> 00:59:23,030 by these algorithms. 1177 00:59:23,030 --> 00:59:25,990 So for example, the first question, 1178 00:59:25,990 --> 00:59:28,660 when you try to figure out which method of these tons of methods 1179 00:59:28,660 --> 00:59:31,450 to use is, do you expect multiple disconnected 1180 00:59:31,450 --> 00:59:33,808 trajectories? 1181 00:59:33,808 --> 00:59:35,600 What might be a reason why you would expect 1182 00:59:35,600 --> 00:59:38,315 multiple disconnected trajectories for disease 1183 00:59:38,315 --> 00:59:39,190 progression modeling? 1184 00:59:41,890 --> 00:59:43,160 TAs should not answer. 1185 00:59:46,770 --> 00:59:48,695 AUDIENCE: [INAUDIBLE] 1186 00:59:48,695 --> 00:59:53,120 PROFESSOR: Different subtexts would be an example. 1187 00:59:53,120 --> 00:59:56,360 So suppose the answer is no, as we've 1188 00:59:56,360 --> 01:00:00,395 been assuming for most of this lecture, then you might ask, 1189 01:00:00,395 --> 01:00:03,020 OK, there might only be a single trajectory, because we're only 1190 01:00:03,020 --> 01:00:04,940 assuming that a single disease subtype, 1191 01:00:04,940 --> 01:00:09,090 but do we expect a particular topology? 1192 01:00:09,090 --> 01:00:12,290 Now everything that we've been talking about up until now 1193 01:00:12,290 --> 01:00:14,750 has been a linear topology, meaning 1194 01:00:14,750 --> 01:00:16,520 there's a linear projection, there's 1195 01:00:16,520 --> 01:00:19,260 such notion of early and late to stage, 1196 01:00:19,260 --> 01:00:21,020 but in fact, the linear trajectory 1197 01:00:21,020 --> 01:00:23,120 may not be realistic. 1198 01:00:23,120 --> 01:00:25,130 Maybe the trajectory looks a little bit more 1199 01:00:25,130 --> 01:00:27,420 like this bifurcation. 1200 01:00:27,420 --> 01:00:31,370 Maybe patients look the same very early in the disease 1201 01:00:31,370 --> 01:00:34,280 stage, but then suddenly something 1202 01:00:34,280 --> 01:00:36,470 might happen which causes some patients to go 1203 01:00:36,470 --> 01:00:39,350 this way and some patients to go that way. 1204 01:00:39,350 --> 01:00:44,260 Any idea what that might be in a clinical setting? 1205 01:00:44,260 --> 01:00:45,177 AUDIENCE: A treatment? 1206 01:00:45,177 --> 01:00:46,677 PROFESSOR: Treatments, that's great. 1207 01:00:46,677 --> 01:00:47,550 All right? 1208 01:00:47,550 --> 01:00:50,340 So maybe these patients got t equals 0, 1209 01:00:50,340 --> 01:00:52,875 and maybe these patients got t equal as 1, and maybe 1210 01:00:52,875 --> 01:00:54,250 for whatever reason wouldn't even 1211 01:00:54,250 --> 01:00:56,208 have good data on what treatments patients got, 1212 01:00:56,208 --> 01:00:58,600 so we don't actually observe the treatment, right? 1213 01:00:58,600 --> 01:01:01,630 Then you might want to be able to discover that bifurcation 1214 01:01:01,630 --> 01:01:05,410 directly from the data, then that might suggest, going back 1215 01:01:05,410 --> 01:01:07,270 to the original source of the data, to ask, 1216 01:01:07,270 --> 01:01:09,970 what differentiated these patients at this point in time? 1217 01:01:09,970 --> 01:01:11,530 And you might discover, oh, there 1218 01:01:11,530 --> 01:01:13,530 was something in the data that we didn't record, 1219 01:01:13,530 --> 01:01:18,420 such as treatment, all right? 1220 01:01:18,420 --> 01:01:19,810 So there are a variety of methods 1221 01:01:19,810 --> 01:01:23,980 to try to infer these pseudo-times under a variety 1222 01:01:23,980 --> 01:01:25,240 of different assumptions. 1223 01:01:25,240 --> 01:01:26,740 What I'll do in the next few minutes 1224 01:01:26,740 --> 01:01:31,470 is just give you an inkling of how two of the methods work. 1225 01:01:31,470 --> 01:01:35,130 And I chose these to be representative examples. 1226 01:01:35,130 --> 01:01:38,330 The first example is an approach based 1227 01:01:38,330 --> 01:01:42,530 on building a minimum spanning tree. 1228 01:01:42,530 --> 01:01:45,140 And this algorithm I'm going to describe 1229 01:01:45,140 --> 01:01:46,910 goes by the name of Monocle. 1230 01:01:46,910 --> 01:01:50,870 It was published in 2014 in this paper by Trapnell et al, 1231 01:01:50,870 --> 01:01:53,750 but it builds very heavily on an earlier published 1232 01:01:53,750 --> 01:01:58,210 paper from 2003 that I mostly citing here. 1233 01:01:58,210 --> 01:02:00,580 So the way that this algorithm works is as follows. 1234 01:02:00,580 --> 01:02:03,670 It starts with, as we've been assuming 1235 01:02:03,670 --> 01:02:06,310 all along, cross-sectional data, which lives 1236 01:02:06,310 --> 01:02:08,140 in some high-dimensional space. 1237 01:02:08,140 --> 01:02:10,480 I'm drawing that in the top-left here. 1238 01:02:10,480 --> 01:02:16,323 Each data point corresponds to some patient or some cell. 1239 01:02:16,323 --> 01:02:17,740 The first step of the algorithm is 1240 01:02:17,740 --> 01:02:19,270 to do dimensionality reduction. 1241 01:02:19,270 --> 01:02:21,010 And there are many ways to do dimensionality reduction. 1242 01:02:21,010 --> 01:02:22,900 You could do principal components analysis, 1243 01:02:22,900 --> 01:02:25,067 or, for example, you could do independent components 1244 01:02:25,067 --> 01:02:25,570 analysis. 1245 01:02:25,570 --> 01:02:27,927 This paper uses the independent component analysis. 1246 01:02:27,927 --> 01:02:29,385 What ICA is going to do, it's going 1247 01:02:29,385 --> 01:02:31,718 to attempt to find a number of different components that 1248 01:02:31,718 --> 01:02:35,260 seem to be as independent from one another as possible. 1249 01:02:35,260 --> 01:02:37,420 Then you're going to represent the data now 1250 01:02:37,420 --> 01:02:41,140 in this low-dimensional space, and in many of these papers, 1251 01:02:41,140 --> 01:02:44,328 it's quite astonishing to me, they actually use dimension 2. 1252 01:02:44,328 --> 01:02:46,620 So they'll go all the way down to two-dimensional space 1253 01:02:46,620 --> 01:02:49,490 where you can actually plot all of the data. 1254 01:02:49,490 --> 01:02:51,990 It's not at all obvious to me why you would want to do that, 1255 01:02:51,990 --> 01:02:53,698 and for clinical data, I think that might 1256 01:02:53,698 --> 01:02:56,610 be a very poor choice. 1257 01:02:56,610 --> 01:03:02,070 Then what they do is they build a minimum spanning tree on all 1258 01:03:02,070 --> 01:03:04,920 of the patient or cells. 1259 01:03:04,920 --> 01:03:06,750 So the way that one does that is you 1260 01:03:06,750 --> 01:03:11,220 create a graph by drawing an edge between every pair 1261 01:03:11,220 --> 01:03:14,070 of nodes where the weight of the edge 1262 01:03:14,070 --> 01:03:17,625 is the Euclidean distance between those two points. 1263 01:03:20,130 --> 01:03:23,190 And then-- so for example, there is this edge from here to here, 1264 01:03:23,190 --> 01:03:26,130 there's an edge from here to here and so on. 1265 01:03:26,130 --> 01:03:28,740 And then given that weighted graph, 1266 01:03:28,740 --> 01:03:32,050 we're going to find the minimum spanning tree of that graph, 1267 01:03:32,050 --> 01:03:34,590 and what I'm showing you here is the minimum spanning tree 1268 01:03:34,590 --> 01:03:38,100 of the corresponding graph, OK? 1269 01:03:38,100 --> 01:03:39,920 Next, what one will do is go look 1270 01:03:39,920 --> 01:03:45,440 for the longest path in that tree. 1271 01:03:45,440 --> 01:03:48,670 Remember, finding the longest path in a graph-- 1272 01:03:48,670 --> 01:03:50,610 in an arbitrary graph has a name, 1273 01:03:50,610 --> 01:03:53,030 it's called the traveling salesman problem 1274 01:03:53,030 --> 01:03:54,897 and it's the NP-hard problem. 1275 01:03:54,897 --> 01:03:56,230 How has that gotten around here? 1276 01:03:56,230 --> 01:03:58,230 Well we're not-- this is not an arbitrary graph, 1277 01:03:58,230 --> 01:04:01,190 this is actually a tree. 1278 01:04:01,190 --> 01:04:03,460 So here's that poor-- 1279 01:04:03,460 --> 01:04:08,420 here's a algorithm for finding the longest path. 1280 01:04:08,420 --> 01:04:11,360 I won't talk about that. 1281 01:04:11,360 --> 01:04:14,990 So one finds along this path in a tree-- 1282 01:04:14,990 --> 01:04:17,820 in the tree, and then what one does is one says, 1283 01:04:17,820 --> 01:04:21,200 OK, one side of the path corresponds to, 1284 01:04:21,200 --> 01:04:23,750 let's say, early disease stage and the other side of the path 1285 01:04:23,750 --> 01:04:25,958 corresponds to late disease stage, 1286 01:04:25,958 --> 01:04:27,500 and it allows for the fact that there 1287 01:04:27,500 --> 01:04:28,820 might be some bifurcation. 1288 01:04:28,820 --> 01:04:30,480 So for example, you see here that there 1289 01:04:30,480 --> 01:04:33,530 is a bifurcation over here. 1290 01:04:33,530 --> 01:04:35,120 And as we discussed earlier, you have 1291 01:04:35,120 --> 01:04:37,158 to have some way of differentiating 1292 01:04:37,158 --> 01:04:39,200 what the beginning is and what the end should be, 1293 01:04:39,200 --> 01:04:41,325 and that's where some side information might become 1294 01:04:41,325 --> 01:04:43,280 useful. 1295 01:04:43,280 --> 01:04:45,590 So here's an illustration of applying that method 1296 01:04:45,590 --> 01:04:47,490 to some real data. 1297 01:04:47,490 --> 01:04:52,640 So every point here is a cell after doing 1298 01:04:52,640 --> 01:04:54,200 dimensionality reduction. 1299 01:04:54,200 --> 01:04:56,510 The edges between those points correspond 1300 01:04:56,510 --> 01:04:59,780 to the edges of the minimum spanning tree, 1301 01:04:59,780 --> 01:05:01,947 and now what the authors have done 1302 01:05:01,947 --> 01:05:04,280 is they've actually used some side information that they 1303 01:05:04,280 --> 01:05:07,490 had in order to color each of the nodes 1304 01:05:07,490 --> 01:05:11,390 based on what part of the cell differentiation process 1305 01:05:11,390 --> 01:05:14,480 is believed-- that cell is believed to be in. 1306 01:05:14,480 --> 01:05:18,140 And what one discovers, that in fact, this 1307 01:05:18,140 --> 01:05:21,410 is very sensible, that all of these points 1308 01:05:21,410 --> 01:05:25,160 are in a much earlier disease stage 1309 01:05:25,160 --> 01:05:27,110 than [INAUDIBLE] than these points, 1310 01:05:27,110 --> 01:05:33,050 and this is a sensible bifurcation 1311 01:05:33,050 --> 01:05:35,600 Next I want to talk about a slightly different approach 1312 01:05:35,600 --> 01:05:36,298 to-- 1313 01:05:36,298 --> 01:05:38,090 this is the whole story, by the way, right? 1314 01:05:38,090 --> 01:05:41,542 It's conceptually a very, very simple approach. 1315 01:05:41,542 --> 01:05:43,750 Next I want to talk about a different approach, which 1316 01:05:43,750 --> 01:05:48,430 now tries to return back to the probabilistic approaches 1317 01:05:48,430 --> 01:05:51,490 that we had earlier in the lecture. 1318 01:05:51,490 --> 01:05:57,293 This new approach is going to be based on Gaussian processes. 1319 01:05:57,293 --> 01:05:59,460 So Gaussian processes have come up a couple of times 1320 01:05:59,460 --> 01:06:01,020 in lecture, but I've never actually 1321 01:06:01,020 --> 01:06:02,280 formally defined them for you. 1322 01:06:02,280 --> 01:06:04,655 So in order for what I'm going to say next to make sense, 1323 01:06:04,655 --> 01:06:08,230 I'm going to formally define for you what a Gaussian process is. 1324 01:06:08,230 --> 01:06:14,120 A Gaussian process mu for a collection of time points, 1325 01:06:14,120 --> 01:06:17,850 T1 through T capital N, is defined 1326 01:06:17,850 --> 01:06:23,310 by a joint distribution, mu, of those time points, 1327 01:06:23,310 --> 01:06:26,125 which is a Gaussian distribution. 1328 01:06:26,125 --> 01:06:27,750 So we're going to say that the function 1329 01:06:27,750 --> 01:06:31,290 value for these T different time points is just 1330 01:06:31,290 --> 01:06:33,880 a Gaussian, which for the purpose of today's lecture 1331 01:06:33,880 --> 01:06:35,910 I'm going to assume is zero mean, 1332 01:06:35,910 --> 01:06:38,190 and where the covariance function is given 1333 01:06:38,190 --> 01:06:43,170 to you by this capital K, it's a covariance function of the time 1334 01:06:43,170 --> 01:06:46,180 points-- of the input points. 1335 01:06:46,180 --> 01:06:47,940 And so if you look-- 1336 01:06:47,940 --> 01:06:52,050 this has to be a matrix of dimension capital N 1337 01:06:52,050 --> 01:06:56,830 by capital N. And if you look at the I1 and I2 of the N tree, 1338 01:06:56,830 --> 01:07:00,090 if you look at the N tree of that matrix, 1339 01:07:00,090 --> 01:07:03,660 we're defining it to be given to by the following kernel 1340 01:07:03,660 --> 01:07:04,620 function. 1341 01:07:04,620 --> 01:07:09,510 It looks at the exponential of the negative Euclidean 1342 01:07:09,510 --> 01:07:13,770 distance squared between those two time points. 1343 01:07:13,770 --> 01:07:15,450 Intuitively what this is saying is 1344 01:07:15,450 --> 01:07:17,760 that if you have two time points that 1345 01:07:17,760 --> 01:07:20,700 are very close to one another, then 1346 01:07:20,700 --> 01:07:23,730 this kernel function is going to be very large. 1347 01:07:23,730 --> 01:07:28,710 If you have two time points that are very far from one another, 1348 01:07:28,710 --> 01:07:32,140 then this is very large-- it's a very large negative number, 1349 01:07:32,140 --> 01:07:34,260 and so this is going to be very small. 1350 01:07:34,260 --> 01:07:36,010 So the kernel function for two inputs that 1351 01:07:36,010 --> 01:07:37,900 are very far from another are very small; 1352 01:07:37,900 --> 01:07:39,358 the kernel function for inputs that 1353 01:07:39,358 --> 01:07:41,020 are very close to each other is large; 1354 01:07:41,020 --> 01:07:42,910 and thus, what we're saying here is 1355 01:07:42,910 --> 01:07:46,960 that there's going to be some correlation between nearby data 1356 01:07:46,960 --> 01:07:48,702 points. 1357 01:07:48,702 --> 01:07:50,660 And that's the way which we're going to specify 1358 01:07:50,660 --> 01:07:53,480 a distribution of functions. 1359 01:07:53,480 --> 01:07:55,820 If one were to sample from this Gaussian 1360 01:07:55,820 --> 01:07:58,760 with a covariance function specified in the way I did, 1361 01:07:58,760 --> 01:08:03,130 what one gets out is something that looks like this. 1362 01:08:03,130 --> 01:08:07,560 So I'm assuming here that every-- 1363 01:08:07,560 --> 01:08:11,227 that these curves look really dense, 1364 01:08:11,227 --> 01:08:13,810 and that's because I'm assuming the N is extremely large here. 1365 01:08:13,810 --> 01:08:15,643 If and what small, let's say 3, there'd only 1366 01:08:15,643 --> 01:08:18,660 be three time points here, OK? 1367 01:08:18,660 --> 01:08:23,340 And so if you can make this distribution of functions 1368 01:08:23,340 --> 01:08:28,229 be arbitrarily complex by playing with this little l-- 1369 01:08:28,229 --> 01:08:34,950 so for example, if you made a little l be very small, 1370 01:08:34,950 --> 01:08:37,319 then what you get are these really spiky functions 1371 01:08:37,319 --> 01:08:39,630 that I'm showing you in a very light color. 1372 01:08:39,630 --> 01:08:42,210 If you make a little l be very large, 1373 01:08:42,210 --> 01:08:45,920 you get these very smooth functions, right? 1374 01:08:45,920 --> 01:08:48,620 So this is a way to get a function-- this is a way 1375 01:08:48,620 --> 01:08:50,810 to get a distribution over functions 1376 01:08:50,810 --> 01:08:55,130 just by sampling from this Gaussian process. 1377 01:08:55,130 --> 01:08:57,770 What this paper does from Campbell and Yau published 1378 01:08:57,770 --> 01:09:01,870 two years ago in Computational Biology 1379 01:09:01,870 --> 01:09:07,790 is they assume that the observations that you have 1380 01:09:07,790 --> 01:09:13,700 are drawn from a Gaussian distribution whose mean is 1381 01:09:13,700 --> 01:09:16,740 given to you by the Gaussian process. 1382 01:09:16,740 --> 01:09:27,130 So if you think back to the story that we drew earlier, 1383 01:09:27,130 --> 01:09:33,229 suppose that the data lived in just one dimension, 1384 01:09:33,229 --> 01:09:36,149 and suppose we actually knew the sorting of patients. 1385 01:09:36,149 --> 01:09:39,130 So we actually knew which patients 1386 01:09:39,130 --> 01:09:45,200 are very early in time, which patients are very late in time. 1387 01:09:45,200 --> 01:09:51,380 You might imagine that that single biomarker, biomarker A, 1388 01:09:51,380 --> 01:09:55,820 you might imagine that the function which tells you 1389 01:09:55,820 --> 01:09:58,820 what the biomarker's value is as a function of time 1390 01:09:58,820 --> 01:10:01,640 might be something like this, right? 1391 01:10:01,640 --> 01:10:04,970 Or maybe it's something like that, OK? 1392 01:10:04,970 --> 01:10:06,770 So it might be a function that's increasing 1393 01:10:06,770 --> 01:10:09,920 or a function that's decreasing. 1394 01:10:09,920 --> 01:10:12,740 And this function is precisely what 1395 01:10:12,740 --> 01:10:17,330 this mu, this Gaussian process is meant to model. 1396 01:10:17,330 --> 01:10:20,120 The only difference is that now, one can model the Gaussian 1397 01:10:20,120 --> 01:10:23,250 process-- instead of just being a single dimension, 1398 01:10:23,250 --> 01:10:25,500 one could imagine having several different dimensions. 1399 01:10:25,500 --> 01:10:29,990 So this P denotes the number of dimensions, right? 1400 01:10:29,990 --> 01:10:31,490 Which corresponds to, in some sense, 1401 01:10:31,490 --> 01:10:33,650 to the number of synthetic biomarkers 1402 01:10:33,650 --> 01:10:36,860 that you might conjecture exist. 1403 01:10:36,860 --> 01:10:40,370 Now here, we truly don't know the sorting of patients 1404 01:10:40,370 --> 01:10:43,020 into early versus late stage. 1405 01:10:43,020 --> 01:10:47,270 And so the time points T are themselves 1406 01:10:47,270 --> 01:10:49,610 assumed to be latent variables that 1407 01:10:49,610 --> 01:10:52,130 are drawn from a truncated normal distribution that 1408 01:10:52,130 --> 01:10:53,490 looks like this. 1409 01:10:53,490 --> 01:10:55,580 So you might make some assumption 1410 01:10:55,580 --> 01:10:58,970 that the time intervals for when a patient comes in 1411 01:10:58,970 --> 01:11:02,022 might be, maybe patients come in really typically very 1412 01:11:02,022 --> 01:11:03,480 in the middle of the disease stage, 1413 01:11:03,480 --> 01:11:05,355 or maybe you're assuming it's something flat, 1414 01:11:05,355 --> 01:11:07,750 an so patients come in throughout the disease stage. 1415 01:11:07,750 --> 01:11:10,760 But the time point itself is latent. 1416 01:11:10,760 --> 01:11:13,200 So now the generative process for the data is as follows. 1417 01:11:13,200 --> 01:11:15,860 You first sample a time point from 1418 01:11:15,860 --> 01:11:18,620 this truncated normal distribution, then 1419 01:11:18,620 --> 01:11:21,280 you look to see-- 1420 01:11:21,280 --> 01:11:23,030 oh, and you sample from the very beginning 1421 01:11:23,030 --> 01:11:27,500 your sample this curve mu, and then you 1422 01:11:27,500 --> 01:11:30,260 look to see, what is the value of mu for the sample time 1423 01:11:30,260 --> 01:11:33,440 point, and that gives you the expected value you should 1424 01:11:33,440 --> 01:11:36,740 expect to see for that patient. 1425 01:11:36,740 --> 01:11:41,670 And the one, then, jointly optimizes 1426 01:11:41,670 --> 01:11:44,730 this to try to find the most-- 1427 01:11:44,730 --> 01:11:48,450 the curve, the curve mu which has highest posterior 1428 01:11:48,450 --> 01:11:52,470 probability, and that is how you read out from the model 1429 01:11:52,470 --> 01:11:57,815 both what the latent progression looks 1430 01:11:57,815 --> 01:11:59,940 like, and if you look at the posterior distribution 1431 01:11:59,940 --> 01:12:02,065 over the T's that are inferred for each individual, 1432 01:12:02,065 --> 01:12:05,070 you get the inferred location along the trajectory 1433 01:12:05,070 --> 01:12:07,550 for each individual. 1434 01:12:07,550 --> 01:12:09,200 And I'll stop there. 1435 01:12:09,200 --> 01:12:11,200 I'll post the slides online for this last piece, 1436 01:12:11,200 --> 01:12:13,890 but I'll let you read the paper on your own.