1 00:00:01 --> 00:00:04 Today is my last class with you. Awe, I'm sorry, too. You guys are 2 00:00:04 --> 00:00:08 a lot of fun. This has actually been the most interactive 7. 3 00:00:08 --> 00:00:12 1 I've ever had. Usually there are a couple of people who perk up and 4 00:00:12 --> 00:00:16 say things, but you guys are great because all sorts of people are 5 00:00:16 --> 00:00:20 willing to contribute. So, I've had a wonderful time and 6 00:00:20 --> 00:00:24 it certainly seems like you guys have learned a lot. 7 00:00:24 --> 00:00:28 What I'd like to do for my last lecture is pick up again a little 8 00:00:28 --> 00:00:32 bit like I did with genomics and try to give you a sense of where 9 00:00:32 --> 00:00:36 things are going. I always like doing this because I 10 00:00:36 --> 00:00:40 get to talk about things that are in none of the textbooks that, 11 00:00:40 --> 00:00:44 well, I mean, it's just stuff that many people working in the field 12 00:00:44 --> 00:00:48 don't necessarily know. And that's what's so much fun about 13 00:00:48 --> 00:00:52 teaching introductory biology is because it only takes a semester for 14 00:00:52 --> 00:00:56 you guys to get up to the point of at least being able to understand 15 00:00:56 --> 00:01:01 what's getting done on the cutting-edge. 16 00:01:01 --> 00:01:05 Even if you might not yet be able to go off and practice it, 17 00:01:05 --> 00:01:09 you might need a little more experience for that, 18 00:01:09 --> 00:01:13 but you'd be surprised, it's not that much more. 19 00:01:13 --> 00:01:17 Take maybe Project Lab and you'll be able to start doing it already. 20 00:01:17 --> 00:01:21 It's really wonderful that it's possible to grasp what's going on. 21 00:01:21 --> 00:01:25 And, in many ways, you guys may have an advantage in grasping what's 22 00:01:25 --> 00:01:29 going on because, as I've already hinted, 23 00:01:29 --> 00:01:33 biology's undergoing this remarkable transformation from being a purely 24 00:01:33 --> 00:01:37 laboratory-based science where each individual works on his or her own 25 00:01:37 --> 00:01:41 project to being an information-based science that 26 00:01:41 --> 00:01:45 involves an integration of vast amounts of data across the whole 27 00:01:45 --> 00:01:50 world and trying to learn things from this tremendous dataset. 28 00:01:50 --> 00:01:52 And, in that sense, I think the new students coming into 29 00:01:52 --> 00:01:55 the field have a distinct advantage over those who have been in it. 30 00:01:55 --> 00:01:58 And certainly the students who know mathematical and physical and 31 00:01:58 --> 00:02:01 chemical and other sorts of things, and aren't scared to write computer 32 00:02:01 --> 00:02:04 code when they need to write computer code have a really 33 00:02:04 --> 00:02:07 great advantage. So, anyway, all that by way of 34 00:02:07 --> 00:02:11 introduction. I want to talk about two subjects today of great interest 35 00:02:11 --> 00:02:14 to me. One is DNA variation and one is RNA variation. 36 00:02:14 --> 00:02:18 The variation of DNA sequence between individuals within a 37 00:02:18 --> 00:02:21 population, and in particular our population, and the other is RNA 38 00:02:21 --> 00:02:25 variation, the variation in RNA expression between different cell 39 00:02:25 --> 00:02:28 types, different tissues. And the work I'm going to talk 40 00:02:28 --> 00:02:32 about today is work that I, and my colleagues, have all been 41 00:02:32 --> 00:02:36 involved in. And it's stuff I know and love. 42 00:02:36 --> 00:02:40 So, feel free to ask questions about it. I may know the answers, 43 00:02:40 --> 00:02:44 but what's reasonably fun about these lectures is if I don't know 44 00:02:44 --> 00:02:48 the answers it's probably the case that the answers aren't known. 45 00:02:48 --> 00:02:52 So, that's good fun because it's stuff I really do know well, 46 00:02:52 --> 00:02:56 and I love. So, anyway, here's some DNA sequence. It's pretty boring. 47 00:02:56 --> 00:03:00 This is a chunk of sequence from, let's say, the human genome. 48 00:03:00 --> 00:03:04 How much does this differ between any two individuals? 49 00:03:04 --> 00:03:09 If I were to sequence any two chromosomes, any two copies of the 50 00:03:09 --> 00:03:14 chromosome from an individual in this class or two individuals on 51 00:03:14 --> 00:03:19 this planet, how much would they differ? The answer is that much. 52 00:03:19 --> 00:03:25 That's the average amount of difference between any two people on 53 00:03:25 --> 00:03:30 this planet. Not a lot. If you counted up, it is on average 54 00:03:30 --> 00:03:35 one nucleotide difference out of 1, 00 nucleotides on average, somewhat 55 00:03:35 --> 00:03:41 less than one part in 1, 00 or better than 99.9% identity 56 00:03:41 --> 00:03:46 between any two individuals. Now, that is a very small amount, 57 00:03:46 --> 00:03:51 not just in absolute terms, 99.9% identity is a lot, 58 00:03:51 --> 00:03:57 but in comparative terms with other species. If I take two chimpanzees 59 00:03:57 --> 00:04:02 in Africa, on average they will differ by about twice as much as any 60 00:04:02 --> 00:04:07 two random humans. And if I take two orangutans in 61 00:04:07 --> 00:04:12 Southeast Asia, they will on average differ by about 62 00:04:12 --> 00:04:17 eight times as much as any two humans on this planet. 63 00:04:17 --> 00:04:21 You guys think the orangutans all look the same. 64 00:04:21 --> 00:04:26 They think you all look the same, and they're right. So, why is this? 65 00:04:26 --> 00:04:31 Why are humans amongst mammalian 66 00:04:31 --> 00:04:36 species relatively limited in the amount of variation? 67 00:04:36 --> 00:04:40 Well, it's a direct result of our population history. 68 00:04:40 --> 00:04:45 It turns out that the amount of variation that can be sustained in a 69 00:04:45 --> 00:04:50 population depends on two things. At equilibrium, if population has 70 00:04:50 --> 00:04:55 constant size N for a very long time and a certain mutation rate, 71 00:04:55 --> 00:04:59 Mu, you can just write a piece of arithmetic that says, 72 00:04:59 --> 00:05:04 well, mutations are always arising due to new mutations in the 73 00:05:04 --> 00:05:09 population and mutations are being lost by genetic drift, 74 00:05:09 --> 00:05:14 just by random sampling from generation to generation. 75 00:05:14 --> 00:05:17 And those two processes, the creation of new mutations and 76 00:05:17 --> 00:05:21 the loss of mutations just due to random sampling in each generation, 77 00:05:21 --> 00:05:25 sets up an equilibrium, and the equilibrium defines an equation 78 00:05:25 --> 00:05:29 there, Pi equals one over one plus four and Mu reciprocal which 79 00:05:29 --> 00:05:33 equation you have no need to memorize whatsoever and possibly 80 00:05:33 --> 00:05:36 even no need to write down. The important point is the concept, 81 00:05:36 --> 00:05:40 that if you know the number of organisms in the population and you 82 00:05:40 --> 00:05:43 know the mutation rate, those set up the bounds of mutation 83 00:05:43 --> 00:05:47 and drift, and you can write down how polymorphic, 84 00:05:47 --> 00:05:51 how heterozygous random individuals should be at equilibrium. 85 00:05:51 --> 00:05:54 That is if the population has been at size N for a very long time. 86 00:05:54 --> 00:05:58 Well, the expected amount of heterozygosity for the 87 00:05:58 --> 00:06:02 human population -- Sorry. For a population of size 10, 88 00:06:02 --> 00:06:06 00 would be about one nucleotide in 1300. We have exactly the amount of 89 00:06:06 --> 00:06:11 heterozygosity you would expect for a population of about 10, 90 00:06:11 --> 00:06:15 00 individuals. Yeah, but wait, we're not a population of 10,000 91 00:06:15 --> 00:06:20 individuals. Why do we have the heterozygosity you would expect from 92 00:06:20 --> 00:06:25 a population of 10, 00 individuals? We're six billion. 93 00:06:25 --> 00:06:31 It's a reflection of our history. 94 00:06:31 --> 00:06:35 Because remember I said that was the statement about what the 95 00:06:35 --> 00:06:38 population heterozygosity should be at equilibrium? 96 00:06:38 --> 00:06:42 We haven't been six billion people except very recently. 97 00:06:42 --> 00:06:45 The human population has undergone an exponential expansion. 98 00:06:45 --> 00:06:49 It used to be a relatively small size, and then it very recently 99 00:06:49 --> 00:06:52 underwent this huge exponential expansion. If you actually write 100 00:06:52 --> 00:06:56 down the equations, the amount of variation in our 101 00:06:56 --> 00:07:00 population was determined by that constant size for a very long time. 102 00:07:00 --> 00:07:03 And then a rapid exponential expansion that's basically taken 103 00:07:03 --> 00:07:07 place in a mere 3, 00 generations, it's much too rapid 104 00:07:07 --> 00:07:11 to have any affect on the real variation in our population. 105 00:07:11 --> 00:07:15 What do I mean by that? What's the mutation rate per nucleotide in the 106 00:07:15 --> 00:07:18 human genome? It's on the order of two times ten to the minus eighth 107 00:07:18 --> 00:07:22 per generation. In a mere 3,000 generations, 108 00:07:22 --> 00:07:26 a tiny mutation rate like two times ten to the minus eighth is not going 109 00:07:26 --> 00:07:30 to be able to build up much more variation. 110 00:07:30 --> 00:07:32 So you might as well ignore the last 100,000 years or so. 111 00:07:32 --> 00:07:34 They're irrelevant to how much variation we have. 112 00:07:34 --> 00:07:36 The variation we have was set by our ancestral population size. 113 00:07:36 --> 00:07:38 Now, don't get me wrong. Eventually it will equilibrate. 114 00:07:38 --> 00:07:40 A couple million years from now we will have a much higher variation in 115 00:07:40 --> 00:07:42 the human population as a function of our size, but the population 116 00:07:42 --> 00:07:44 variation we have today is set by the fact that humans derive from a 117 00:07:44 --> 00:07:47 founding population of about 10, 00 individuals or so. 118 00:07:47 --> 00:07:52 And that means that the variation that you see in the human population 119 00:07:52 --> 00:07:57 is mostly ancestral variations, the variation that we all walked 120 00:07:57 --> 00:08:03 around with in Africa. And, in fact, that makes a 121 00:08:03 --> 00:08:08 prediction. That would say that if most of the variation in the human 122 00:08:08 --> 00:08:13 population is from the ancestral African founding population then if 123 00:08:13 --> 00:08:19 I go to any two villages around this world, in Japan or in Sweden or in 124 00:08:19 --> 00:08:24 Nigeria, the variance that I see will largely be identical. 125 00:08:24 --> 00:08:30 And that prediction has been well satisfied. 126 00:08:30 --> 00:08:34 Because when you go and look and you collect variation in Japan or Sweden 127 00:08:34 --> 00:08:38 or Africa and you compare it, 90% of the variance are common 128 00:08:38 --> 00:08:42 across the entire world. Most variation is common ancestral 129 00:08:42 --> 00:08:46 variation around the world, and only a minority of the variance 130 00:08:46 --> 00:08:50 are new local mutations restricted to individual populations. 131 00:08:50 --> 00:08:54 This is so contrary to what people think because there's a natural 132 00:08:54 --> 00:08:58 tendency to kind of xenophobia, to imagine that world populations 133 00:08:58 --> 00:09:02 are very different in their genetic background. 134 00:09:02 --> 00:09:05 But, in point of fact, they're extremely similar. 135 00:09:05 --> 00:09:09 So, anyway, there's a limited amount of variation. 136 00:09:09 --> 00:09:13 That's why we have such little variation in the human population. 137 00:09:13 --> 00:09:17 Now, that variation, humans have a low rate of genetic variation. 138 00:09:17 --> 00:09:20 Most of the variance that are out there are due to common genetic 139 00:09:20 --> 00:09:24 variance, not rare variance. If I take your genome and I find a 140 00:09:24 --> 00:09:28 site of genetic variation at the point of heterozygosity in your 141 00:09:28 --> 00:09:32 genome, what's the probability that somebody else in this class also is 142 00:09:32 --> 00:09:36 heterozygous for that spot? It turns out that the odds are about 143 00:09:36 --> 00:09:40 95% that someone else in this class will also share that variance. 144 00:09:40 --> 00:09:44 So that the variance are not mostly rare, they're mostly common. 145 00:09:44 --> 00:09:48 And it turns out that some of this common variation, 146 00:09:48 --> 00:09:52 that is most of this variation is likely to be important in the risk 147 00:09:52 --> 00:09:56 of human genetic diseases. So human geneticists have gotten 148 00:09:56 --> 00:10:00 very excited about the following paradigm. 149 00:10:00 --> 00:10:03 If there's only a limited amount of genetic variation in the human 150 00:10:03 --> 00:10:06 population, actually, if you do the arithmetic, 151 00:10:06 --> 00:10:09 there are only about ten million sites of common variation in the 152 00:10:09 --> 00:10:12 human population, where common might be defined as 153 00:10:12 --> 00:10:15 more than about 1% in the population. There are only ten million sites. 154 00:10:15 --> 00:10:18 Folks are saying, well, why not enumerate them all? 155 00:10:18 --> 00:10:22 Let's just know them all, and then let's test each one for its 156 00:10:22 --> 00:10:25 risk of, say, confirming susceptibility of diabetes or heart 157 00:10:25 --> 00:10:28 disease or whatever? After all, ten million is not as 158 00:10:28 --> 00:10:32 big a number as it used to be. We now have the whole sequence of 159 00:10:32 --> 00:10:36 the human genome. Why not layer on the sequence of 160 00:10:36 --> 00:10:40 the human genome all common human genetic polymorphism? 161 00:10:40 --> 00:10:44 Now, that's a fairly outrageous idea but could be a very useful one. 162 00:10:44 --> 00:10:48 Some of these variance are important, by the way. 163 00:10:48 --> 00:10:52 We know that there are two nucleotides that vary in the gene 164 00:10:52 --> 00:10:56 apolipoprotein E on chromosome number 19. Apolipoprotein E is also 165 00:10:56 --> 00:11:00 an apolipoprotein like we talked about before with familiar 166 00:11:00 --> 00:11:04 hypercholesterolemia. But, in fact, it turns out that 167 00:11:04 --> 00:11:08 apolipoprotein E is expressed in the brain. And it turns out, 168 00:11:08 --> 00:11:13 amongst other tissues, that it comes in three variances, 169 00:11:13 --> 00:11:18 the spelling T-T, T-C and C-C at those two particular spots. 170 00:11:18 --> 00:11:22 And if you happen to be homozygous for the E4 variant, 171 00:11:22 --> 00:11:27 homozygous for the E4 variant, you have about a 60% to 70% lifetime 172 00:11:27 --> 00:11:32 risk of Alzheimer's disease. In this class 13 of you are 173 00:11:32 --> 00:11:37 homozygous for E4 and have a high lifetime risk of Alzheimer's. 174 00:11:37 --> 00:11:42 And it would be fairly trivial to go across the street to anybody's 175 00:11:42 --> 00:11:47 lab and test that. Now, I don't particular recommend 176 00:11:47 --> 00:11:52 it, and I haven't tested myself for this variant because there happens 177 00:11:52 --> 00:11:57 to be no particular therapy available today to delay the onset 178 00:11:57 --> 00:12:01 of Alzheimer's disease. And, therefore, 179 00:12:01 --> 00:12:05 I don't recommend finding out about that. But a number of 180 00:12:05 --> 00:12:08 pharmaceutical companies, knowing that this is a very 181 00:12:08 --> 00:12:11 important gene in the pathogenesis of Alzheimer's disease, 182 00:12:11 --> 00:12:15 are working on drugs to try to delay the pathogenesis using this 183 00:12:15 --> 00:12:18 information. And it may be the case that five or ten years from now 184 00:12:18 --> 00:12:21 people will begin to offer drugs that will delay the onset of 185 00:12:21 --> 00:12:25 Alzheimer's disease by delaying the interaction of apolipoprotein E with 186 00:12:25 --> 00:12:29 a target protein called towe, etc. So, this is an example of where a 187 00:12:29 --> 00:12:33 common variant in the population points us to the basis of a common 188 00:12:33 --> 00:12:37 disease and has important therapeutic implications. 189 00:12:37 --> 00:12:41 There are some other ones, for example. 5% of you carry a 190 00:12:41 --> 00:12:45 particular variant in your factor 5 gene which is the clotting cascade. 191 00:12:45 --> 00:12:49 It's called the leiden variant. Those 5% of you are going to account 192 00:12:49 --> 00:12:53 for 50% of the admissions to emergency rooms for deep venous 193 00:12:53 --> 00:12:57 clots, for example. The much higher risk of deep venous 194 00:12:57 --> 00:13:02 clots. And, in particular, 195 00:13:02 --> 00:13:06 there are significant issues if you have that variant and you are a 196 00:13:06 --> 00:13:11 woman with taking birth control pills. Some of you were at higher 197 00:13:11 --> 00:13:16 risk for diabetes, type 2 adult onset diabetes. 198 00:13:16 --> 00:13:20 There's a particular variant in the population that increased your risk 199 00:13:20 --> 00:13:25 for type 2 diabetes by about 30%. 85% of you have the high-risk 200 00:13:25 --> 00:13:30 factor, so you might as well figure you do. 201 00:13:30 --> 00:13:35 15% of you have a lower risk, et cetera. And one I'm particularly 202 00:13:35 --> 00:13:40 interested in here, it turns out that HIV virus gets 203 00:13:40 --> 00:13:46 into cells with a co-receptor encoded by a gene called CCR5. 204 00:13:46 --> 00:13:51 Well, it turns out that if we go across the European population, 205 00:13:51 --> 00:13:57 10% of all chromosomes of European ancestry have a deletion 206 00:13:57 --> 00:14:02 within the CCR5 gene. If 10% of all chromosomes have that 207 00:14:02 --> 00:14:06 deletion then 10% times 10%, 1% of all individuals are homozygous 208 00:14:06 --> 00:14:10 for that deletion. Those individuals are essentially 209 00:14:10 --> 00:14:15 immune to infection from HIV. They are not susceptible. It's not 210 00:14:15 --> 00:14:19 through immunity, it's through lack of a receptor. 211 00:14:19 --> 00:14:23 Yes? You certainly can. It's not hard. It's a specific known variant. 212 00:14:23 --> 00:14:28 You could test for it. Absolutely. 213 00:14:28 --> 00:14:31 Now, of course, that only helps the 1% of people who 214 00:14:31 --> 00:14:34 have that variant. But what it did do was point to the 215 00:14:34 --> 00:14:37 pharmaceutical industry that the interaction between the virus and 216 00:14:37 --> 00:14:41 that variant is essential. And now companies are developing 217 00:14:41 --> 00:14:44 drugs to block the interaction with that particular protein. 218 00:14:44 --> 00:14:48 And that tells you that it's an important protein. Yes? 219 00:14:48 --> 00:14:56 Over the whole world? 220 00:14:56 --> 00:15:00 I just specified European population for that one. 221 00:15:00 --> 00:15:03 That one, interestingly, is not found at as high a frequency 222 00:15:03 --> 00:15:06 outside of Europe, and no one knows why, 223 00:15:06 --> 00:15:09 whether that might have been due to an ancient selective event or a 224 00:15:09 --> 00:15:13 genetic drift. By contrast, the apolipoprotein E 225 00:15:13 --> 00:15:16 variant, at that frequency of about 3% of people being homozygous and 226 00:15:16 --> 00:15:19 being at risk for Alzheimer's, is about the same frequency 227 00:15:19 --> 00:15:23 everywhere in the world. So, there's a little bit of 228 00:15:23 --> 00:15:26 population variation in frequency. Now, the HIV variant is found 229 00:15:26 --> 00:15:30 elsewhere but at considerably lower frequencies there. 230 00:15:30 --> 00:15:33 And that's an interesting question as to what causes that variation. 231 00:15:33 --> 00:15:36 So the notion would be, I've given you a couple of interesting examples, 232 00:15:36 --> 00:15:40 but, look, there's only ten million variants. Just write them all down. 233 00:15:40 --> 00:15:43 Make one big Excel spreadsheet with ten million variants along the top 234 00:15:43 --> 00:15:47 and all the diseases along the rows, and let's just fill in the matrix 235 00:15:47 --> 00:15:50 and then we'll really, you know, this is the way people 236 00:15:50 --> 00:15:54 think in a post-genomic era. Now, could you do something like 237 00:15:54 --> 00:15:57 that? You would have to enumerate all of the single nucleotide 238 00:15:57 --> 00:16:01 polymorphisms, or SNPs we call them, 239 00:16:01 --> 00:16:05 single nucleotide polymorphisms. Now, to give you an idea of the 240 00:16:05 --> 00:16:09 magnitude of this problem, as recently as 1998, the number of 241 00:16:09 --> 00:16:13 SNPs that were known in the human genome was a couple hundred. 242 00:16:13 --> 00:16:17 But then a project has taken off. In 1998 an initial SNP map of the 243 00:16:17 --> 00:16:21 human genome was built here at MIT that had about 4, 244 00:16:21 --> 00:16:25 00 of these variants. Then within the next year or so an 245 00:16:25 --> 00:16:29 international consortium was organized here and elsewhere to 246 00:16:29 --> 00:16:34 begin to collect more of these genetic variants. 247 00:16:34 --> 00:16:38 The goal was going to be to find 300, 00 of them within a period of two 248 00:16:38 --> 00:16:42 years. In fact, that goal was blown away and within 249 00:16:42 --> 00:16:46 three years two million of the SNPs in the human population were found. 250 00:16:46 --> 00:16:51 And as of today, if you go on the Web, you'll find the database with 251 00:16:51 --> 00:16:55 about 7.8 million of the roughly ten million SNPs in the human population 252 00:16:55 --> 00:17:00 already known. Now, that isn't all ten million. 253 00:17:00 --> 00:17:03 And it takes a while to collect the last ones, you know, 254 00:17:03 --> 00:17:07 collecting the last ones are hard, but we're already the hump of 255 00:17:07 --> 00:17:10 knowing the majority of common variation in the human population. 256 00:17:10 --> 00:17:14 Not just a sequence of the genome, but a database that already contains 257 00:17:14 --> 00:17:17 more than half of all common variation in the population. 258 00:17:17 --> 00:17:21 So, we could start building that Excel spreadsheet. 259 00:17:21 --> 00:17:24 Now, it turns out that it's even a little bit better than that because 260 00:17:24 --> 00:17:28 if we look at many chromosomes in the population, 261 00:17:28 --> 00:17:31 here are chromosomes in the population, it turns out that the 262 00:17:31 --> 00:17:35 common variance on each of those chromosomes tend to be correlated 263 00:17:35 --> 00:17:38 with each other. If I know your genotype at one 264 00:17:38 --> 00:17:41 variant, like over at this locus, I know your genotype at the next 265 00:17:41 --> 00:17:45 locus with reasonably high probability. There's a lot of local 266 00:17:45 --> 00:17:48 correlation. So, instead of looking like a scattered 267 00:17:48 --> 00:17:51 picture like that, it's more like this. 268 00:17:51 --> 00:17:55 If I know that you're red, red, red you're probably red, 269 00:17:55 --> 00:17:58 red, red over here. In other words, these variations occur in blocks 270 00:17:58 --> 00:18:01 that we called haplotypes. Here's real data. 271 00:18:01 --> 00:18:04 Across 111 kilobases of DNA there's a bunch of variants, 272 00:18:04 --> 00:18:08 but it turns out that the variants come in two basic flavors. 273 00:18:08 --> 00:18:11 98% of all chromosomes are either this, this, this, 274 00:18:11 --> 00:18:14 this, this or this, this, this, this, this. 275 00:18:14 --> 00:18:18 Then there tends to be sites of recombination that are actually 276 00:18:18 --> 00:18:21 hotspots of recombination where most of the recombination of the 277 00:18:21 --> 00:18:24 population is concentrated. And you get a couple of 278 00:18:24 --> 00:18:28 possibilities here. So, the human genome can kind of be 279 00:18:28 --> 00:18:31 broken up into these haplotypes. Blocks that might be 20, 280 00:18:31 --> 00:18:35 30, 40, sometimes 100 kilobases long in which within the block you tend 281 00:18:35 --> 00:18:39 to have a small number of haplotypes, or flavors as you might think of 282 00:18:39 --> 00:18:43 them, that define most of the chromosomes in the population. 283 00:18:43 --> 00:18:46 So, in fact, I don't actually need to know all the variants. 284 00:18:46 --> 00:18:50 If they're so well correlated within a block, 285 00:18:50 --> 00:18:54 if I knew this block structure I would be able to pick a small number 286 00:18:54 --> 00:18:58 of SNPs that would serve as a proxy for that entire block of inheritance 287 00:18:58 --> 00:19:01 in the population. So, what you might want to do is 288 00:19:01 --> 00:19:04 determine that entire haplotype block structure of hwo they're 289 00:19:04 --> 00:19:08 related to each other, and pick out tag snips. 290 00:19:08 --> 00:19:11 And it turns out that in theory, a mere 300,000 or so of them would 291 00:19:11 --> 00:19:14 suffice to proxy for most of the genome. So, you might want to 292 00:19:14 --> 00:19:18 declare an international project, and international haplotype map 293 00:19:18 --> 00:19:21 project to create a haplotype map of the human genome. 294 00:19:21 --> 00:19:24 And indeed, such a project was declared about a year and a half ago 295 00:19:24 --> 00:19:28 through some instigation of scientists and a number of places, 296 00:19:28 --> 00:19:31 including here. And this is $100 million project 297 00:19:31 --> 00:19:35 involving six different countries. And, it is already more than 298 00:19:35 --> 00:19:39 halfway done with the task, and it's very likely that by the 299 00:19:39 --> 00:19:42 middle of next year, we will have a pretty good haplotype 300 00:19:42 --> 00:19:46 map, not just knowing all the variation, but knowing the 301 00:19:46 --> 00:19:50 correlation between that variation, being able to break up the genome 302 00:19:50 --> 00:19:53 into these blocks. By the next time I teach 701, 303 00:19:53 --> 00:19:57 I should be able to show a haplotype map of the whole human genome 304 00:19:57 --> 00:20:01 already. That will allow you to start undertaking systematic studies 305 00:20:01 --> 00:20:05 of inheritance for different diseases across populations. 306 00:20:05 --> 00:20:08 And in fact, people are already doing things like that. 307 00:20:08 --> 00:20:12 Here's an example of a study done here at MIT like this, 308 00:20:12 --> 00:20:15 where to study inflammatory bowel disease, there was evidence that 309 00:20:15 --> 00:20:19 there might be a particular region of the genome that contained it, 310 00:20:19 --> 00:20:22 and haplotypes were determined across this, and blah, 311 00:20:22 --> 00:20:26 blah, blah, blah, blah, blah, blah. And this red haplotype 312 00:20:26 --> 00:20:29 here turns out to confer high risk, about a two and a half or higher 313 00:20:29 --> 00:20:33 risk of inflammatory bowel disease. 314 00:20:33 --> 00:20:36 And it sits over some genes involved in immune responses, 315 00:20:36 --> 00:20:40 certain cytokine genes and all that. And, things like this have been 316 00:20:40 --> 00:20:44 done for type 2 diabetes, schizophrenia, cardiovascular 317 00:20:44 --> 00:20:47 disease, just right now at the moment, a dozen or two examples. 318 00:20:47 --> 00:20:51 But I think we're set for an explosion in this kind of work. 319 00:20:51 --> 00:20:55 In addition, you can use this information to do things beyond 320 00:20:55 --> 00:20:59 medical genetics. You can use it for history and 321 00:20:59 --> 00:21:03 anthropology as well. It turns out rather interestingly, 322 00:21:03 --> 00:21:07 that since the human population originated in Africa and spread out 323 00:21:07 --> 00:21:12 from Africa all the way around the world arriving at different places 324 00:21:12 --> 00:21:17 in different times, you can trace those migrations by 325 00:21:17 --> 00:21:21 virtue of rare genetic variants that arose along the way, 326 00:21:21 --> 00:21:26 and let you, like a trail of break crumbs, see the migrations. 327 00:21:26 --> 00:21:30 So, for example, there are certain rare genetic variants that we can 328 00:21:30 --> 00:21:35 see in a South American Indian tribe, and we can actually see that they 329 00:21:35 --> 00:21:40 came along this route because we can see that residual of that. 330 00:21:40 --> 00:21:45 In fact, we can do things with this like take a look at Native American 331 00:21:45 --> 00:21:50 individuals and determine that they cluster into three distinct genetic 332 00:21:50 --> 00:21:55 groups that represent three distinct migrations over the land bridge. 333 00:21:55 --> 00:22:00 And, you can assign them to these different migrations. 334 00:22:00 --> 00:22:03 You can do this on the basis of mitochondrial genotype, 335 00:22:03 --> 00:22:06 etc. You can also, for example, determine when people talk about the 336 00:22:06 --> 00:22:09 out of Africa migration, there's now increasing evidence that 337 00:22:09 --> 00:22:13 there really were two, one that went this way over the land, 338 00:22:13 --> 00:22:16 and one that went this way following along the coast into southeast Asia. 339 00:22:16 --> 00:22:19 And, it looks like we're now beginning to get enough evidence of 340 00:22:19 --> 00:22:22 these two separate migrations by virtue of the genetic breadcrumbs 341 00:22:22 --> 00:22:26 that they have left along the way. 342 00:22:26 --> 00:22:30 So, it's really a very fascinating thing of how much you can 343 00:22:30 --> 00:22:34 reconstruct from looking at genetic variation, both the common variation 344 00:22:34 --> 00:22:38 that allows us to recognize medical risk, and the rare genetic variation 345 00:22:38 --> 00:22:43 that provides much more individual trails of things. 346 00:22:43 --> 00:22:47 None of this is perfect yet. There's lots to learn. But I think 347 00:22:47 --> 00:22:51 anthropologists are finding that the existing human population has a 348 00:22:51 --> 00:22:55 tremendous amount of its own history embedded in pattern of genetic 349 00:22:55 --> 00:23:00 variation across the world. You can do other things. 350 00:23:00 --> 00:23:04 I won't spend much time on this. Well, I'll take a moment on this, 351 00:23:04 --> 00:23:09 right? There's some very interesting work of a post-doctoral 352 00:23:09 --> 00:23:13 fellow here at MIT named Pardese Sebetti who has been trying to ask, 353 00:23:13 --> 00:23:18 can we see in the genetic variation in the population, 354 00:23:18 --> 00:23:22 signatures, patterns of ancient selection, or even recent selection 355 00:23:22 --> 00:23:27 in the human population? Now, hang onto your seats, 356 00:23:27 --> 00:23:32 because this will get just slightly tricky. 357 00:23:32 --> 00:23:35 But, hang on. It's only a couple of slides. Here was her idea. 358 00:23:35 --> 00:23:39 You see, when a mutation arises in the population, 359 00:23:39 --> 00:23:43 it usually dies out, right? Any new mutation just 360 00:23:43 --> 00:23:47 typically dies out. But, sometimes by chance it drifts 361 00:23:47 --> 00:23:50 up to a high frequency. Random events happen. But it 362 00:23:50 --> 00:23:54 usually takes a long time to do that. If some random mutation happens, 363 00:23:54 --> 00:23:58 and it happens to drift up to high frequency with no selection on it, 364 00:23:58 --> 00:24:02 then on average it takes a long time to do so. 365 00:24:02 --> 00:24:05 If you want, I could write a stochastic differential equation 366 00:24:05 --> 00:24:09 that would say that, but just take your gut feeling that 367 00:24:09 --> 00:24:12 if something has no selection on it and it's a rare event that'll drift 368 00:24:12 --> 00:24:16 up, when it drifts up it's kind of a slow process. It was a slow process. 369 00:24:16 --> 00:24:20 Then over the course of time that it took to drift to high frequency, 370 00:24:20 --> 00:24:23 a lot of genetic recombination would have had to have occurred many 371 00:24:23 --> 00:24:27 generations. And the correlation between the genotype at that spot 372 00:24:27 --> 00:24:31 and genotypes at other loci would break down. 373 00:24:31 --> 00:24:34 And there would only be short-range correlation. So, 374 00:24:34 --> 00:24:38 in other words, the amount of correlation between knowing the 375 00:24:38 --> 00:24:41 genotype here and the genotype here, maybe allele A here and a C here. 376 00:24:41 --> 00:24:45 That is an indication of time. It's a clock almost. It's like 377 00:24:45 --> 00:24:49 radioactive decay, right, that genetic recombination 378 00:24:49 --> 00:24:52 scrambles up the correlations. And, if something's old, the 379 00:24:52 --> 00:24:56 correlations go over short distances. But suppose that something happened. 380 00:24:56 --> 00:25:00 Some mutation happened that was very advantageous. 381 00:25:00 --> 00:25:03 Then, it would have risen to high frequency quickly because it was 382 00:25:03 --> 00:25:07 under selection. If it did so quickly, 383 00:25:07 --> 00:25:11 then the long-range correlations would not have had time to break 384 00:25:11 --> 00:25:15 down, and we'd have a smoking gun. A smoking gun would be that there 385 00:25:15 --> 00:25:18 would be a long-range correlation around that locus, 386 00:25:18 --> 00:25:22 much longer than you would expect across the genome. 387 00:25:22 --> 00:25:26 Things even out of this distance would show correlation with that, 388 00:25:26 --> 00:25:30 indicating that this was a recent event. 389 00:25:30 --> 00:25:34 So, we just measure across the genome, and look for this telltale 390 00:25:34 --> 00:25:39 sign of common variance that have very long range correlation that 391 00:25:39 --> 00:25:44 indicate that they're very recent. So, a plot of the allele frequency, 392 00:25:44 --> 00:25:49 common variance, sorry, if something has a common high frequency and 393 00:25:49 --> 00:25:54 long-range correlation, you wouldn't expect that by chance. 394 00:25:54 --> 00:25:58 So, something that was common in its 395 00:25:58 --> 00:26:02 frequency and had long-range correlation would be a signature of 396 00:26:02 --> 00:26:06 positive selection. So anyway, Pardise had this idea, 397 00:26:06 --> 00:26:09 and she tried it out with some interesting mutations, 398 00:26:09 --> 00:26:13 some mutations that confer resistance to malaria, 399 00:26:13 --> 00:26:17 one well-known mutation causing resistance to malaria called G6 PD 400 00:26:17 --> 00:26:21 and another one that she herself had proposed as a mutation causing 401 00:26:21 --> 00:26:24 resistance to malaria, variants in the CD4 ligand gene. 402 00:26:24 --> 00:26:28 And to make a long story short, both the known and her newly 403 00:26:28 --> 00:26:32 predicted variant showed this telltale property of having a high 404 00:26:32 --> 00:26:36 frequency and very long range correlation. 405 00:26:36 --> 00:26:40 Well that's very interesting because she was able to show that each of 406 00:26:40 --> 00:26:44 these mutations probably were the result of positive selection. 407 00:26:44 --> 00:26:49 But what you could do in principle is test every variant in the human 408 00:26:49 --> 00:26:53 genome this way: take any variant, look at its frequency, and compare 409 00:26:53 --> 00:26:58 it to the long range correlation around it, and test every single 410 00:26:58 --> 00:27:02 variant in the human population to see which ones might be the result 411 00:27:02 --> 00:27:06 of long range correlation. Now, when she proposed this, 412 00:27:06 --> 00:27:09 this was about a year and a half ago or two years ago, 413 00:27:09 --> 00:27:12 this was a pretty nutty idea because you would need all the variants in 414 00:27:12 --> 00:27:15 the human population, and you would need all this 415 00:27:15 --> 00:27:18 correlation information. But in fact, as I say, that 416 00:27:18 --> 00:27:21 information's almost upon us, and I believed that this experiment, 417 00:27:21 --> 00:27:24 this analysis to look for all strong positive selection in the human 418 00:27:24 --> 00:27:27 genome will in fact be done in the course of the next 12 months. 419 00:27:27 --> 00:27:30 So, I'm hoping by next year I can actually report on a genome-wide 420 00:27:30 --> 00:27:33 search for all the signatures of positive selection. 421 00:27:33 --> 00:27:36 Now, this doesn't detect all positive selection. 422 00:27:36 --> 00:27:39 It will detect sufficiently strong positive selection going back pretty 423 00:27:39 --> 00:27:42 much only over the 10, 00 years. When you do the 424 00:27:42 --> 00:27:45 arithmetic, that's how much power you have. Of course, 425 00:27:45 --> 00:27:48 10,000 years has been a pretty interesting time for the human 426 00:27:48 --> 00:27:52 population, right? The time of civilization and 427 00:27:52 --> 00:27:55 population density, and infectious diseases, 428 00:27:55 --> 00:27:58 and all that, and I think we'll have an interesting window into 429 00:27:58 --> 00:28:02 the change in diet. All of that should come out of 430 00:28:02 --> 00:28:06 something like this. So, there's a lot of really cool 431 00:28:06 --> 00:28:10 information in DNA variation to be had. All right, 432 00:28:10 --> 00:28:14 that's one half. The other half of what I would like to talk about is 433 00:28:14 --> 00:28:18 totally different. It's not about inherited DNA 434 00:28:18 --> 00:28:22 variation. It's about somatic differences between tissues in RNA 435 00:28:22 --> 00:28:26 variation. So, let's shift gears. 436 00:28:26 --> 00:28:30 RNA variation: let me start by giving you an example here. 437 00:28:30 --> 00:28:36 These are cells from two different patients with acute leukemia. 438 00:28:36 --> 00:28:43 Can you spot the difference between these? Yep? More like bunches of 439 00:28:43 --> 00:28:49 grapes and all that. Yeah, it turns out that's just a 440 00:28:49 --> 00:28:56 reflection of the field of view you have if you move over 441 00:28:56 --> 00:29:02 to look like that. But I mean, that's good. 442 00:29:02 --> 00:29:07 It's just that it turns out that that isn't actually a distinction 443 00:29:07 --> 00:29:12 when you look at more fields. Anything else? Yep? White blood 444 00:29:12 --> 00:29:16 cells like different. They look broken. There's more of 445 00:29:16 --> 00:29:21 them in this field of view. But you look at 100 fields of view 446 00:29:21 --> 00:29:26 and it turns out that's not either. Well, the reason you're having 447 00:29:26 --> 00:29:31 trouble spotting any difference is that highly trained pathologists 448 00:29:31 --> 00:29:35 can't find any difference either. I generally agree there's no 449 00:29:35 --> 00:29:39 difference between these two if you look at enough fields of view. 450 00:29:39 --> 00:29:43 But you can convince yourself if you look that you see things there. 451 00:29:43 --> 00:29:46 But these actually are two very different kinds of leukemia. 452 00:29:46 --> 00:29:50 And, these patients have to be treated very differently. 453 00:29:50 --> 00:29:54 But, pathologists cannot determine which leukemia it is just by looking 454 00:29:54 --> 00:29:57 at the microscope, it turns out. This is the work of this man, 455 00:29:57 --> 00:30:01 Sydney Farber, namesake of the Dana Farber Cancer Institute here in 456 00:30:01 --> 00:30:05 Boston, who in the 1950s began noticing that patients with 457 00:30:05 --> 00:30:08 leukemias, some of them seemed different in the way they responded 458 00:30:08 --> 00:30:12 to a certain treatment, and he said, look, I think there's 459 00:30:12 --> 00:30:16 some underlying classification of these leukemias, 460 00:30:16 --> 00:30:19 but I can't get any reliable way to tell it in the microscope. 461 00:30:19 --> 00:30:23 And he put many years into working this out, first by noticing certain 462 00:30:23 --> 00:30:27 difference in enzymes in the cells, and then people noticed certain 463 00:30:27 --> 00:30:31 things in cell surface markers, and some chromosomal rearrangements. 464 00:30:31 --> 00:30:34 And nowadays, there are a bunch of test that can be done by a 465 00:30:34 --> 00:30:38 pathologist when a patient comes in with acute leukemia to determine 466 00:30:38 --> 00:30:42 whether they have AML or ALL. But it turns out that you can't do 467 00:30:42 --> 00:30:46 it by looking. You have to do some kind of 468 00:30:46 --> 00:30:50 immunohystochemical test of some sort in order to do that. 469 00:30:50 --> 00:30:54 So this is a triumph of diagnosis. After 40 years of work, we can now 470 00:30:54 --> 00:30:58 correctly classify patients as AML or ALL. And they get the 471 00:30:58 --> 00:31:02 appropriate treatment. And if they don't get the right 472 00:31:02 --> 00:31:06 treatment, they have a much higher chance of dying. 473 00:31:06 --> 00:31:10 And if they do get the right treatment, they have a much higher 474 00:31:10 --> 00:31:14 chance of living. So, this is great. 475 00:31:14 --> 00:31:18 There's only one problem with the story. It took 40 years, 476 00:31:18 --> 00:31:22 40 years to sort this out. That's a long time. Couldn't we do 477 00:31:22 --> 00:31:26 better? Surely these cells know what they are. 478 00:31:26 --> 00:31:30 Surely we could just ask them if they are. Well, here's the idea. 479 00:31:30 --> 00:31:33 Suppose we could ask each cell, please tell us every gene that you 480 00:31:33 --> 00:31:37 have turned on, and the level to which you have that 481 00:31:37 --> 00:31:40 gene expressed. In other words, 482 00:31:40 --> 00:31:44 let us summarize each cell, each tumor by a description of its 483 00:31:44 --> 00:31:47 complete pattern of gene expression to 22,000 genes on the human genome. 484 00:31:47 --> 00:31:51 Let's write down the level of expression, X1 up to X22, 485 00:31:51 --> 00:31:54 00 for each of the 22,000 genes of the genome. So, 486 00:31:54 --> 00:31:58 ever tumor becomes a point in 22, 00 dimensional space, right? 487 00:31:58 --> 00:32:01 Now clearly, if we had every tumor described as a point in 22, 488 00:32:01 --> 00:32:05 00 dimensional space, we ought to be able to sort out which tumors are 489 00:32:05 --> 00:32:09 similar to each other, right? Well, it turns out you can 490 00:32:09 --> 00:32:13 do that now. These are gene chips, one of several technologies by which 491 00:32:13 --> 00:32:17 on a piece of glass are put little spots, each of which contains a 492 00:32:17 --> 00:32:21 piece of DNA, a unique DNA sequence. Actually, many copies of that DNA 493 00:32:21 --> 00:32:25 sequence are there. Each of these is a 25 base long DNA 494 00:32:25 --> 00:32:29 sequence, and I can design this so whatever DNA sequence you 495 00:32:29 --> 00:32:32 want is in each spot. The way that's done is with the same 496 00:32:32 --> 00:32:36 photolithographic techniques that are used to make microprocessors. 497 00:32:36 --> 00:32:40 People have worked out a chemistry where through a mask, 498 00:32:40 --> 00:32:44 you shine a light, photodeprotect certain pixels; the pixels that are 499 00:32:44 --> 00:32:48 photodeprotected you can chemically attach an A, then re-protect the 500 00:32:48 --> 00:32:52 surface. Use a light. Chemically photodeprotect certain 501 00:32:52 --> 00:32:56 spots. Wash on a C. And in this fashion, 502 00:32:56 --> 00:33:00 since you can randomly address the spots by light, 503 00:33:00 --> 00:33:04 and then chemically add bases to whatever spots are deprotected, 504 00:33:04 --> 00:33:08 you can simultaneously construct hundreds of thousands of spots each 505 00:33:08 --> 00:33:12 containing its own unique specified oligonucleotide sequence. 506 00:33:12 --> 00:33:16 And you can get them in little plastic chips. 507 00:33:16 --> 00:33:20 And then if you want, all you do is you take a tumor. 508 00:33:20 --> 00:33:24 You grind it up. You prepare RNA. You fluorescently label the RNA 509 00:33:24 --> 00:33:28 with some appropriate fluorescent dye. You squirt it into the chip. 510 00:33:28 --> 00:33:31 You wash it back and forth. You rock it back and forth, 511 00:33:31 --> 00:33:35 wash it out, and stick it in a laser scanner. And it'll see how much 512 00:33:35 --> 00:33:38 fluorescence is stuck to each spot. And bingo: you get a readout of the 513 00:33:38 --> 00:33:42 level of gene expression. I guess each spot, you should 514 00:33:42 --> 00:33:45 design it so that this spot has an oligonucleotide complementary to 515 00:33:45 --> 00:33:49 gene number one. And the next one, 516 00:33:49 --> 00:33:53 an oligonucleotide matching by Crick-Watson base pairing 517 00:33:53 --> 00:33:56 complementary to gene number two and gene number three. 518 00:33:56 --> 00:34:00 So, if I knew all the genes in the genome, I could make a detector spot 519 00:34:00 --> 00:34:03 for each gene in the genome. And of course we know essentially 520 00:34:03 --> 00:34:07 all the genes in the genome. So you can make those detector 521 00:34:07 --> 00:34:10 spots and you can buy them. So, you can now get a readout of 522 00:34:10 --> 00:34:13 all the, I mean, this is like so cool because when I 523 00:34:13 --> 00:34:17 started teaching 701, which wasn't that long ago because I 524 00:34:17 --> 00:34:20 ain't (sic) that old still, the way people did an analysis of 525 00:34:20 --> 00:34:23 gene expression is they used primitive technologies where they 526 00:34:23 --> 00:34:27 would analyze one gene at a time, certain things called northern blots 527 00:34:27 --> 00:34:30 and things like that, right? And, you know, 528 00:34:30 --> 00:34:34 you'd put in a lot of work and you get the expression level of a gene, 529 00:34:34 --> 00:34:37 whereas now you can get the expression of all the genes 530 00:34:37 --> 00:34:41 simultaneously, and it's pretty mind boggling that 531 00:34:41 --> 00:34:44 you can do that. How do you analyze data like that? 532 00:34:44 --> 00:34:48 So, we still use northern blots. It's true. So, 533 00:34:48 --> 00:34:51 every tumor becomes a vector, and we get a vector corresponding to 534 00:34:51 --> 00:34:55 each tumor. So, this line here is the first tumor, 535 00:34:55 --> 00:34:59 the second tumor, the third tumor, the fourth tumor. 536 00:34:59 --> 00:35:02 The columns here correspond to genes. There are 22, 537 00:35:02 --> 00:35:06 00 columns in this matrix, and I've shown a certain subset of 538 00:35:06 --> 00:35:10 the columns because these genes here have the interesting property that 539 00:35:10 --> 00:35:14 they tend to be high red in the ALL tumors, and they tend to be low blue 540 00:35:14 --> 00:35:18 in the AML tumors, whereas these genes here have the 541 00:35:18 --> 00:35:22 opposite property. They tend to be low blue in the ALL 542 00:35:22 --> 00:35:26 tumors and high red in the AML tumors. These genes do a pretty 543 00:35:26 --> 00:35:30 good job of telling apart these tumors. 544 00:35:30 --> 00:35:35 So, here's a new tumor. Patient came in. We analyzed the 545 00:35:35 --> 00:35:40 RNA, squirted it on the chip. Can somebody classify that? Louder? 546 00:35:40 --> 00:35:45 AML. Next? Next? Congratulations, you're 547 00:35:45 --> 00:35:50 pathologists. Very good. That's right, you can do that. 548 00:35:50 --> 00:35:56 It works. And in fact, in the study that was done that was 549 00:35:56 --> 00:36:01 published about this, the computer was able to get it 550 00:36:01 --> 00:36:05 right 100% of the time. Not bad. So now you say, 551 00:36:05 --> 00:36:09 wait, wait, wait, but you're cheating. 552 00:36:09 --> 00:36:12 You're giving it a whole bunch of knowns. Once I have a whole bunch 553 00:36:12 --> 00:36:15 of knowns it's not so hard to classify a new tumor. 554 00:36:15 --> 00:36:19 What Sydney Farber did was he discovered in the first place that 555 00:36:19 --> 00:36:22 there existed two subtypes. Surely that's harder than 556 00:36:22 --> 00:36:26 classifying when you're given a bunch of knowns. And 557 00:36:26 --> 00:36:29 that's true. So, suppose instead, 558 00:36:29 --> 00:36:33 I didn't tell you in advance which were AML's and which were ALL's, 559 00:36:33 --> 00:36:37 and I just gave you vectors corresponding to a large number of 560 00:36:37 --> 00:36:41 tumors, do you think you would be able to sort out that they actually 561 00:36:41 --> 00:36:49 fell into two clusters? 562 00:36:49 --> 00:36:53 Could you by computer tell that there's one class and the other 563 00:36:53 --> 00:36:57 class? Turns out that you can. Now, I've made it a little easier 564 00:36:57 --> 00:37:02 by not listing most of the 22,000 columns here. 565 00:37:02 --> 00:37:06 But think about it. Every tumor is a point in 22, 566 00:37:06 --> 00:37:10 00 dimensional space. If some of the tumors are similar, 567 00:37:10 --> 00:37:14 what can you say about those points in 22,000 dimensional space? 568 00:37:14 --> 00:37:18 They're going to be clumped together. They're near each other. 569 00:37:18 --> 00:37:22 So, just plot every tumor as a point in 22,000 dimensional space, 570 00:37:22 --> 00:37:26 and your question is, do the points tend to lie in two clumps up in 22, 571 00:37:26 --> 00:37:30 00 dimensional space? And there's simple arithmetic you 572 00:37:30 --> 00:37:34 can learn using linear algebra to get some separating hyperplane and 573 00:37:34 --> 00:37:38 ask, do tumors lie on one side or the other? And, 574 00:37:38 --> 00:37:42 it turns out the procedures like that will quickly tell you that 575 00:37:42 --> 00:37:46 these tumors clump into two very clear clumps. They're not randomly 576 00:37:46 --> 00:37:50 distributed. And so, if you get these tumors, 577 00:37:50 --> 00:37:54 and you do gene expression on them and put the data into a computer, 578 00:37:54 --> 00:37:58 the amount of time it takes the computer to discover that there were 579 00:37:58 --> 00:38:02 actually two types of acute leukemia is about three seconds marked down 580 00:38:02 --> 00:38:06 from 40 years. That's good. So, you can reproduce the discovery 581 00:38:06 --> 00:38:10 of AML and ALL in three seconds. Now you know what the pathologists 582 00:38:10 --> 00:38:14 say about this. They say, oh, give me a break. 583 00:38:14 --> 00:38:18 It's shooting fish in a barrel. We know there was a distinction. 584 00:38:18 --> 00:38:22 Big deal that the computer can find the distinction. 585 00:38:22 --> 00:38:26 We knew that there was distinction there. I know the computer didn't 586 00:38:26 --> 00:38:30 know it and all that. Tell us something we don't know. 587 00:38:30 --> 00:38:35 That's a fair question. So it turns out that you can ask 588 00:38:35 --> 00:38:40 some more questions. You can say, suppose I take now 589 00:38:40 --> 00:38:45 just the ALL's. Are they a homogeneous class, 590 00:38:45 --> 00:38:50 or did they fall into two classes? It turns out that extending this 591 00:38:50 --> 00:38:55 work, folks here were able to show that we can further split that ALL 592 00:38:55 --> 00:39:00 class. There was a hint that you might be able to do so because 593 00:39:00 --> 00:39:06 there's some ALL patients who have disruptions of a gene called MLL. 594 00:39:06 --> 00:39:09 And this tends to be a little more common in infants, 595 00:39:09 --> 00:39:13 and tends to be associated with a poor prognosis. 596 00:39:13 --> 00:39:16 But it was really very unclear whether this was simply one of a 597 00:39:16 --> 00:39:20 zillion factoids about some leukemia patients, whether this was a 598 00:39:20 --> 00:39:24 fundamental distinction. So, what happened was folks took a 599 00:39:24 --> 00:39:27 lot of ALL patients, got their expression profiles, 600 00:39:27 --> 00:39:31 and lo and behold it turned out that ALL itself broke into two very 601 00:39:31 --> 00:39:34 different clusters. This is an artist's rendition of a 602 00:39:34 --> 00:39:38 22,000 dimensional space. We can't afford a 22,000 603 00:39:38 --> 00:39:42 dimensional projector here, so we're just using two dimensions. 604 00:39:42 --> 00:39:46 But, the two forms of ALL were quite distinct from each other, 605 00:39:46 --> 00:39:50 and so actually ALL itself should be split up into two classes, 606 00:39:50 --> 00:39:54 ALL plus and minus, or ALL one and two, or MLL and ALL. 607 00:39:54 --> 00:39:58 And it turns out that these forms are quite different. 608 00:39:58 --> 00:40:02 They have different outcomes and should be treated differently. 609 00:40:02 --> 00:40:07 It also turns out that a particularly good distinction 610 00:40:07 --> 00:40:12 between these two subtypes of ALL is found by looking at this particular 611 00:40:12 --> 00:40:17 gene called the flit-3 kinase. The flit-3 kinase gene, whatever 612 00:40:17 --> 00:40:23 that is, was of great interest because people know that they can 613 00:40:23 --> 00:40:28 make inhibitors against certain kinases. And so, 614 00:40:28 --> 00:40:33 it turned out that an inhibitor against flit-3 kinases, 615 00:40:33 --> 00:40:39 against this flit-3 kinase gene product. 616 00:40:39 --> 00:40:44 If you treat cells with that inhibitor, cells of this type die, 617 00:40:44 --> 00:40:49 and cells of this type are not affected. So in fact, 618 00:40:49 --> 00:40:54 there's a potential drug use of flit-3 kinases in the MLL class of 619 00:40:54 --> 00:41:00 these leukemias, and folks are trying some clinical 620 00:41:00 --> 00:41:05 trials now. So, not only did the analysis of the 621 00:41:05 --> 00:41:09 gene expression point to two important sub-types of leukemias, 622 00:41:09 --> 00:41:14 but the analysis of the gene expression even suggested potential 623 00:41:14 --> 00:41:19 targets for therapy. So, I'll give you a bunch more 624 00:41:19 --> 00:41:23 examples. I have a bunch more examples like that there. 625 00:41:23 --> 00:41:28 They are examples of taking lymphomas and showing that they can 626 00:41:28 --> 00:41:33 be split into two different categories, examples of taking 627 00:41:33 --> 00:41:38 breast cancers into several categories, colon cancers. 628 00:41:38 --> 00:41:42 Basically what's going on right now is an attempt to reclassify cancers 629 00:41:42 --> 00:41:47 based not on what they look like in the microscope, 630 00:41:47 --> 00:41:51 and based not on what organ in the body they affect, 631 00:41:51 --> 00:41:56 but based on, molecularly, what their description is, because 632 00:41:56 --> 00:42:01 the molecular description, as Bob talked to you about with CML 633 00:42:01 --> 00:42:05 and with Gleveck, turns out to be a tremendously 634 00:42:05 --> 00:42:10 powerful way of classifying cancers because you're able to see what is 635 00:42:10 --> 00:42:15 the molecular defect and can make a molecular targeted therapy. 636 00:42:15 --> 00:42:20 So, these sorts of tools are quite cool, and I've got to say, 637 00:42:20 --> 00:42:25 in the last year we've begun using these expression tools not just to 638 00:42:25 --> 00:42:30 classify cancers, but to classify drugs. 639 00:42:30 --> 00:42:34 We've begun an interesting and somewhat crazy project to take all 640 00:42:34 --> 00:42:38 the FDA approved drugs, put them onto cell types, 641 00:42:38 --> 00:42:42 and see what they do, that is, get a signature, a fingerprint, 642 00:42:42 --> 00:42:46 a gene expression description of the action of a drug. 643 00:42:46 --> 00:42:50 And then we hope, here's the nutty idea, 644 00:42:50 --> 00:42:54 that we can look up in the computer which drugs do which things and 645 00:42:54 --> 00:42:58 might be useful for which diseases, because we'd put the diseases and 646 00:42:58 --> 00:43:02 the drugs on an equal footing. All of them would be described in 647 00:43:02 --> 00:43:06 terms of their gene expression patterns. So, 648 00:43:06 --> 00:43:10 I'll tell you one interesting example, OK? This is an interesting 649 00:43:10 --> 00:43:14 enough example. I don't even have slides for it yet. 650 00:43:14 --> 00:43:18 It turns out that these patients with ALL that I've been talking 651 00:43:18 --> 00:43:23 about, some of the patients with ALL will respond to the drug 652 00:43:23 --> 00:43:27 dexamethasone. Some won't. If you take patients 653 00:43:27 --> 00:43:31 who respond to dexamethasone, and patients who are resistant to 654 00:43:31 --> 00:43:35 dexamethasone, and you get their gene expression 655 00:43:35 --> 00:43:40 patterns, you can ask are there some genes that explain the difference? 656 00:43:40 --> 00:43:44 And you can get a certain gene signature, a list of, 657 00:43:44 --> 00:43:48 say, a dozen or so genes that do a pretty good job of classifying who's 658 00:43:48 --> 00:43:53 sensitive and who's resistant. Then you can go to this database I 659 00:43:53 --> 00:43:57 was telling you about of the action of many drugs and say, 660 00:43:57 --> 00:44:01 do we see any drugs whose effect would be to produce a signature 661 00:44:01 --> 00:44:06 of sensitivity? If we found a drug X, 662 00:44:06 --> 00:44:10 which when we put it on cells turned on those genes that correlate with 663 00:44:10 --> 00:44:14 being sensitive to dexamethasone, you could hallucinate the following 664 00:44:14 --> 00:44:18 really happy possibility that when you added that drug together with 665 00:44:18 --> 00:44:22 dexamethasone, you might be able to treat resistant 666 00:44:22 --> 00:44:26 patients because that drug could make them sensitive to dexamethasone, 667 00:44:26 --> 00:44:30 and that you could find that drug just by looking it up in 668 00:44:30 --> 00:44:35 a computer database. So, we tried it and we hit a drug. 669 00:44:35 --> 00:44:40 There was a certain drug that came up on the screen, 670 00:44:40 --> 00:44:45 yes? That's very much in the idea too. We found a drug that produced 671 00:44:45 --> 00:44:49 the signature sensitivity, and tested it in vitro. In vitro, 672 00:44:49 --> 00:44:54 if you take cells that are resistant and you add dexamethasone, 673 00:44:54 --> 00:44:59 nothing happens because they're resistant. If you add drug X, 674 00:44:59 --> 00:45:04 nothing happens. But if you add both drug X plus dexamethasone, 675 00:45:04 --> 00:45:08 the cells drop dead. It's now going into clinical trials 676 00:45:08 --> 00:45:12 in human patients. It turns out drug X is already a 677 00:45:12 --> 00:45:15 well FDA approved drug, so it can be tested in human 678 00:45:15 --> 00:45:19 patients right away, so it's going to be tested. 679 00:45:19 --> 00:45:22 So, the gene expression pattern was able to tell us to use a drug which 680 00:45:22 --> 00:45:26 actually had nothing to do with cancer uses in a cancer setting 681 00:45:26 --> 00:45:30 because it might do something helpful. 682 00:45:30 --> 00:45:33 Now, what's the point of all this? We can turn up the lights because I 683 00:45:33 --> 00:45:37 think I'm going to stop the slides there. The point of all of this, 684 00:45:37 --> 00:45:41 which is what I've made again, and I will make again, 685 00:45:41 --> 00:45:45 because you are the generation that's going to really live this, 686 00:45:45 --> 00:45:48 is that biology is becoming information. Now, 687 00:45:48 --> 00:45:52 don't get me wrong. It's not stopping being 688 00:45:52 --> 00:45:56 biochemistry. It's going to be biochemistry. It's not stopping 689 00:45:56 --> 00:46:00 being molecular biology. It's not stopping any of the things 690 00:46:00 --> 00:46:03 it was before. 45:57 But it is also becoming 691 00:46:03 --> 00:46:07 information, that for the first time we're entering a world where we can 692 00:46:07 --> 00:46:11 collect vast amounts of information: all the genetic variants in a 693 00:46:11 --> 00:46:15 patient, all of the gene expression pattern in a cell, 694 00:46:15 --> 00:46:18 or all of the gene expression pattern induced by a drug, 695 00:46:18 --> 00:46:22 and that whatever question you're asking will be informed by being 696 00:46:22 --> 00:46:26 able to access that whole database. In no way does it decrease the role 697 00:46:26 --> 00:46:30 of the individual smart scientist working on his or her problem. 698 00:46:30 --> 00:46:32 To the contrary, the goal is to empower the 699 00:46:32 --> 00:46:35 individual smart scientist so that you have all of that information at 700 00:46:35 --> 00:46:38 your fingertips. There are databases scattered 701 00:46:38 --> 00:46:41 around the web that have sequences from different species, 702 00:46:41 --> 00:46:44 variations from the human population, all of these drug database, 703 00:46:44 --> 00:46:47 etc., etc., etc., etc. It's a time of tremendous ferment, 704 00:46:47 --> 00:46:50 a little bit of chaos. You talk to people in the field, 705 00:46:50 --> 00:46:53 they say, we're getting deluged by data. We're getting crushed by the 706 00:46:53 --> 00:46:56 amount of data. I don't' know what to do with all 707 00:46:56 --> 00:46:59 the data. There's only one solution for a 708 00:46:59 --> 00:47:02 field in that condition, and that is young scientists because 709 00:47:02 --> 00:47:05 the young scientists who come into the field are the ones who take for 710 00:47:05 --> 00:47:08 granted, of course we're going to have all these data. 711 00:47:08 --> 00:47:11 We love having all these data. This is just great, couldn't be 712 00:47:11 --> 00:47:14 happier to have all these data. We're not put off by it in the 713 00:47:14 --> 00:47:17 least. That's what's going on. That's what's so important about 714 00:47:17 --> 00:47:20 your generation, and that's why I think it's really 715 00:47:20 --> 00:47:23 important that even though it's 701 and we're supposed to be teaching 716 00:47:23 --> 00:47:26 you the basics, it's important that you see this 717 00:47:26 --> 00:47:29 stuff because this is the change that's going on, 718 00:47:29 --> 00:47:32 and we're counting on this very much to drive a revolution in health, 719 00:47:32 --> 00:47:35 a revolution in biomedical research, and we're counting on you guys very 720 00:47:35 --> 00:47:39 much to drive that revolution. It has been a pleasure to teach you 721 00:47:39 --> 00:47:43 this term. I hope many of you will stay in touch, 722 00:47:43 --> 00:47:48 and some of you will go into biology, and even those of you who don't will 723 00:47:48 --> 47:53 know lots about it and enjoy it. Thank you very much. [APPLAUSE]