1 00:00:01 --> 00:00:06 Good morning. Welcome back. So, the Red Sox won, it's pretty 2 00:00:06 --> 00:00:13 convincing, yeah, very good. Yay Red Sox. 3 00:00:13 --> 00:00:20 So, as you can also tell, I have something of a cold, 4 00:00:20 --> 00:00:27 so I'll see if I, if my voice makes it through, but what I wanted to do 5 00:00:27 --> 00:00:34 today, if the voice allows, was to talk about genomics. 6 00:00:34 --> 00:00:38 Now, this is a little bit different than what we normally do in the 7 00:00:38 --> 00:00:42 class because, I work on genomics, 8 00:00:42 --> 00:00:46 it's something I'm extremely interested in. 9 00:00:46 --> 00:00:50 And so, what I wanted to do today, and I'll do it one more time before 10 00:00:50 --> 00:00:54 the end of the term, is to talk about research that's 11 00:00:54 --> 00:00:58 going on in genomics, give you a sense of what's really 12 00:00:58 --> 00:01:02 going on. I can assure you that what I say is not going to be in the 13 00:01:02 --> 00:01:05 text book, or any other text book. And, I'm not entirely sure how this 14 00:01:05 --> 00:01:08 might appear on an exam, so don't ask, because I'm really 15 00:01:08 --> 00:01:12 just going to talk about research that's going on today. 16 00:01:12 --> 00:01:15 And part of the purpose in doing that is to a, show you that it's 17 00:01:15 --> 00:01:18 possible for you to understand the kind of research that's going on in 18 00:01:18 --> 00:01:21 this field, and b, to excite you about what's going on 19 00:01:21 --> 00:01:25 in this field. So each year I pick different 20 00:01:25 --> 00:01:28 things to talk about, and I've picked a few things, 21 00:01:28 --> 00:01:32 and we'll see. So feel free to interrupt and to ask 22 00:01:32 --> 00:01:36 questions, and all of that, but this is very much more, sort of 23 00:01:36 --> 00:01:40 the edge of genomics, including stuff that's going on, 24 00:01:40 --> 00:01:44 you know, right now as we speak. So, we'll fire away. 25 00:01:44 --> 00:01:48 So a little introductory stuff. I call this, we can actually keep 26 00:01:48 --> 00:01:52 the lights up, I think people, 27 00:01:52 --> 00:01:56 can people read that? Yeah, it's fine, good, 28 00:01:56 --> 00:02:00 so we'll leave the lights up and I can see people. 29 00:02:00 --> 00:02:04 So, I think the thing that sets apart this revolution of biology 30 00:02:04 --> 00:02:08 that we're looking through right now, is the transformation of biology, 31 00:02:08 --> 00:02:12 not just from being the study of living organisms, 32 00:02:12 --> 00:02:16 to the study of chemicals and enzymes, to the study of molecules, 33 00:02:16 --> 00:02:20 but to the study of biology as information. That is what's 34 00:02:20 --> 00:02:24 distinctive about this decade, is the idea that the information 35 00:02:24 --> 00:02:28 sciences have begun to merge with biology, or biology merged with 36 00:02:28 --> 00:02:32 information sciences, and that it's having a profound 37 00:02:32 --> 00:02:36 effect on driving biomedicine. In both of the two talks I'll give, 38 00:02:36 --> 00:02:40 this one and near the end of the term, that will be the common theme, 39 00:02:40 --> 00:02:44 because I think that's the most important thing that's going on 40 00:02:44 --> 00:02:48 right now. Now, just to remind you, 41 00:02:48 --> 00:02:52 of course, the idea that biology is about information is an old one, 42 00:02:52 --> 00:02:56 it goes back to my hero, Gregor Mendel, with the recognition that 43 00:02:56 --> 00:03:00 information was passed from parent to offspring, according to rules. 44 00:03:00 --> 00:03:04 And, as you know, the history of biology in the 20th 45 00:03:04 --> 00:03:08 century can be read as the development of biology's information. 46 00:03:08 --> 00:03:12 The first quarter of the 20th century was the development of the 47 00:03:12 --> 00:03:16 idea that the information lives in chromosomes. The next quarter of 48 00:03:16 --> 00:03:20 the 20th century, the idea that the information of the 49 00:03:20 --> 00:03:24 chromosomes resides in the DNA double-helix, and that information 50 00:03:24 --> 00:03:28 was contained in this molecule, and somehow in it's sequence, and 51 00:03:28 --> 00:03:31 you know all of this. And the next quarter of the 20th 52 00:03:31 --> 00:03:35 century, basically from 1950 to 1975, understanding how it is that the 53 00:03:35 --> 00:03:39 cell reads out that information, from DNA to RNA to protein, how it 54 00:03:39 --> 00:03:43 uses a genetic code to translate RNA's into proteins, 55 00:03:43 --> 00:03:46 and the development of the tools of recombinant DNA that made it 56 00:03:46 --> 00:03:50 possible for us to read out the information that the cell reads out. 57 00:03:50 --> 00:03:54 So that brought us ¾ of the way through the 20th century, 58 00:03:54 --> 00:03:58 with the ability to read out genetic information, at least in little ways, 59 00:03:58 --> 00:04:02 but they were little ways. You could write a PhD thesis, 60 00:04:02 --> 00:04:07 around that time, for sequencing 200 letters of DNA. 61 00:04:07 --> 00:04:12 That would be, you know, considered amazingly exciting PhD 62 00:04:12 --> 00:04:17 thesis. The next quarter of the 20th century, the last quarter of 63 00:04:17 --> 00:04:22 the 20th century, was characterized by a veracious 64 00:04:22 --> 00:04:27 appetite to read as much of this information as possible. 65 00:04:27 --> 00:04:32 It started, first, with trying to read out the sequence of individual 66 00:04:32 --> 00:04:37 genes, then sets of genes, then genomes of small organisms' 67 00:04:37 --> 00:04:41 bacteria, medium-sized organisms. And then, you know, 68 00:04:41 --> 00:04:45 in a wonderful closure to the 20th century, the reading out of the 69 00:04:45 --> 00:04:48 nearly complete genetic information of the human being in the closing 70 00:04:48 --> 00:04:52 weeks of the 20th century. When you remember that, that Mendel 71 00:04:52 --> 00:04:55 was rediscovered in January of 1900, that's when the papers rediscovering 72 00:04:55 --> 00:04:59 Mendel came out, and you figure you've got perfect 73 00:04:59 --> 00:05:02 bookends from the rediscovery of Mendel in January of 1900, 74 00:05:02 --> 00:05:06 to the sequencing of the human genome in around 2000. 75 00:05:06 --> 00:05:09 You realize what a century can do. It's not bad, as centuries go, you 76 00:05:09 --> 00:05:12 know, to accomplish all that, and it gives you know, as students, 77 00:05:12 --> 00:05:15 you get a point estimate in time of what science knows, 78 00:05:15 --> 00:05:18 but you guys aren't old enough yet and haven't lived long enough yet, 79 00:05:18 --> 00:05:22 to measure the derivative, and see how rapidly it's changing. 80 00:05:22 --> 00:05:25 But just look at what happened over the course of that century, 81 00:05:25 --> 00:05:28 and then just project forward to what that can mean for 82 00:05:28 --> 00:05:32 the next century. So what that's done is it's brought 83 00:05:32 --> 00:05:36 us to the next picture. I have a picture in my head, 84 00:05:36 --> 00:05:40 of biology as a vast library of information, a library of 85 00:05:40 --> 00:05:44 information in which evolution has been taking patient notes. 86 00:05:44 --> 00:05:48 Evolution is a very good experimentalist, 87 00:05:48 --> 00:05:52 and it's a very patient note taker. It's notes, of course, are written 88 00:05:52 --> 00:05:56 in the genomes, and everyday evolution wakes up, 89 00:05:56 --> 00:06:00 changes a few nucleotides, sees how the organism works, 90 00:06:00 --> 00:06:04 if it was an improvement, evolution keeps the notes, 91 00:06:04 --> 00:06:08 if it was disadvantageous, evolution discards the notes. 92 00:06:08 --> 00:06:11 That, by the way, for those of you working in labs, 93 00:06:11 --> 00:06:14 is no longer considered appropriate laboratory practice. 94 00:06:14 --> 00:06:17 You're obliged to keep your laboratory notes from failed 95 00:06:17 --> 00:06:20 experiments, as well, but evolution got into this before 96 00:06:20 --> 00:06:23 those rules were codified, and so it discards the notes from 97 00:06:23 --> 00:06:26 unsuccessful experiments, and keeps the notes from the 98 00:06:26 --> 00:06:29 successful experiments. But nonetheless, we have all the 99 00:06:29 --> 00:06:32 notes from the successful experiments, and we can learn a 100 00:06:32 --> 00:06:35 tremendous amount from it. There's a volume on the shelf 101 00:06:35 --> 00:06:38 corresponding to each species on the planet. There's a volume on the 102 00:06:38 --> 00:06:41 shelf corresponding to each individual within each species, 103 00:06:41 --> 00:06:44 to each tissue within each individual within each species, 104 00:06:44 --> 00:06:47 and there's information there about the DNA sequence, 105 00:06:47 --> 00:06:50 about the RNA readouts, about the protein expression levels, 106 00:06:50 --> 00:06:53 and in principle, even if not yet in practice, we can pull down any 107 00:06:53 --> 00:06:56 volume we want, and interrogate it, 108 00:06:56 --> 00:06:59 and compare it for related species, for individuals within a species, 109 00:06:59 --> 00:07:02 some of whom might have a disease, some of whom might not, for 110 00:07:02 --> 00:07:06 different kinds of tissues treated in different ways. 111 00:07:06 --> 00:07:09 That is, I think, going to be a tremendous theme of 112 00:07:09 --> 00:07:12 biology going forward, and that's why it's a particular 113 00:07:12 --> 00:07:16 pleasure to teach biology at MIT, where you guys understand what that 114 00:07:16 --> 00:07:19 could mean, that fusion could mean. Now, this idea of extracting 115 00:07:19 --> 00:07:23 genomic information in large-scale, is a relatively new one. In the 116 00:07:23 --> 00:07:26 mid-1980's, the scientific community began debating what was a pretty 117 00:07:26 --> 00:07:30 radical idea, sequencing the human genome. 118 00:07:30 --> 00:07:33 This was floated in a couple of places, in 1984 at one meeting, 119 00:07:33 --> 00:07:37 somebody raised the idea, you've got to realize that sequencing itself, 120 00:07:37 --> 00:07:41 that sequencing DNA, only came from the late 70's, 121 00:07:41 --> 00:07:45 so within six, seven years of being able to sequence anything, 122 00:07:45 --> 00:07:49 people were now saying, let's sequence everything. 123 00:07:49 --> 00:07:52 That was a reasonably audacious thing to do, and it was 124 00:07:52 --> 00:07:56 controversial. There were many people who felt 125 00:07:56 --> 00:08:00 that the human genome project was a terrible idea, 126 00:08:00 --> 00:08:04 and with good reason, because the initial version of the 127 00:08:04 --> 00:08:08 human genome project was, kind of, a blunderbuss approach. 128 00:08:08 --> 00:08:11 It was, let's immediately mount a massive factory and start sequencing 129 00:08:11 --> 00:08:15 the human genome with the just horrible technologies of the 130 00:08:15 --> 00:08:19 mid-80's, with radioactive sequencing gels, 131 00:08:19 --> 00:08:22 and you know, lots and lots of people doing stuff. 132 00:08:22 --> 00:08:26 And so, you know, many people in science were, were concerned that an 133 00:08:26 --> 00:08:30 entire generation of students would need to be chained to the 134 00:08:30 --> 00:08:33 bench, sequencing DNA. Sydney Brenner, 135 00:08:33 --> 00:08:37 a great molecular biologist, proposed the whole thing be done at 136 00:08:37 --> 00:08:41 institutions [LAUGHTER], because you know, people could be 137 00:08:41 --> 00:08:45 sentenced to, 20 million bases, with time off for accuracy, or 138 00:08:45 --> 00:08:48 things like that [LAUGHTER]. And so what happened was, the 139 00:08:48 --> 00:08:52 scientific community came together well, in it's best form. 140 00:08:52 --> 00:08:56 Group, a group was put together by the National Academy of Sciences, 141 00:08:56 --> 00:09:00 who said, well look, this is a really good idea, 142 00:09:00 --> 00:09:04 but we also need a carefully thought-through program to do it. 143 00:09:04 --> 00:09:07 We need intermediate goals that will get us things that will advance the 144 00:09:07 --> 00:09:10 science along the way, we need to improve the technologies, 145 00:09:10 --> 00:09:13 and laid out a plan. The goals of that plan, to develop a genetic map, 146 00:09:13 --> 00:09:16 a map showing the locations of DNA polymorphisms, 147 00:09:16 --> 00:09:19 sites of variation, genetic markers, just like Sturdiman 148 00:09:19 --> 00:09:22 did with fruit flies, but to do it with humans, 149 00:09:22 --> 00:09:25 and with DNA sequence differences, to be used to trace inheritance. 150 00:09:25 --> 00:09:28 That, that genetic map could be used to map human diseases, 151 00:09:28 --> 00:09:31 and if all you accomplish was, got a human map of the human being, 152 00:09:31 --> 00:09:34 that would be a good thing. Then you could get a physical map of 153 00:09:34 --> 00:09:38 the human being, all the pieces of DNA overlapping 154 00:09:38 --> 00:09:41 each other, so that you would know if you had a genetic marker linked 155 00:09:41 --> 00:09:44 to cystic fibrosis, you would be able to get the piece 156 00:09:44 --> 00:09:48 of DNA that contains the gene. Then, if we managed to pull that 157 00:09:48 --> 00:09:51 off, we could get a sequence of the human genome, all three billion 158 00:09:51 --> 00:09:54 nucleotides, on the web, so that you could go to just any 159 00:09:54 --> 00:09:58 place on the genome, double-click, and up would pop the 160 00:09:58 --> 00:10:01 sequence. Now, you guys of course, 161 00:10:01 --> 00:10:04 don't laugh at that, but about eight years ago, when I would give talks 162 00:10:04 --> 00:10:07 about this, I would speak about, oh you'll be able to go double-click 163 00:10:07 --> 00:10:10 and up will pop the sequence, and of course, everybody thought 164 00:10:10 --> 00:10:13 that was really funny, and that, that was something people 165 00:10:13 --> 00:10:16 laughed at. But of course, you can just do that today, if 166 00:10:16 --> 00:10:19 anybody has a wireless you can just double-click, and up will pop the 167 00:10:19 --> 00:10:22 sequence. And then, of course, a complete inventory of 168 00:10:22 --> 00:10:25 all the genes within that sequence. And a very importantly, and from 169 00:10:25 --> 00:10:28 the very beginning, the notion that all this information 170 00:10:28 --> 00:10:31 should be completely, freely available to anybody, 171 00:10:31 --> 00:10:34 regardless of where they were, whether in academia, or industry, 172 00:10:34 --> 00:10:37 in first world, third world countries, that everybody should 173 00:10:37 --> 00:10:40 have free and unrestricted access to that information. 174 00:10:40 --> 00:10:43 So a plan was laid out, I won't go into the details here, 175 00:10:43 --> 00:10:46 but the plan was laid out that involved work constructing genetic 176 00:10:46 --> 00:10:49 maps, physical maps, sequence maps, in the human, 177 00:10:49 --> 00:10:53 the mouse, and some model organisms, including the bacteria yeast, fruit 178 00:10:53 --> 00:10:56 flies, worms. And, quite remarkably, it largely went 179 00:10:56 --> 00:11:00 according to plan, over the course of about 15 years. 180 00:11:00 --> 00:11:03 A lot of people in the scientific community came together and took up 181 00:11:03 --> 00:11:06 different tasks. I should say, with some pride, 182 00:11:06 --> 00:11:09 that MIT was by far, one of the leading contributors to this effort, 183 00:11:09 --> 00:11:13 having been involved in essentially every stage of this, 184 00:11:13 --> 00:11:16 the genetic mapping of human and mouse, the physical mapping of human 185 00:11:16 --> 00:11:19 and mouse, and the sequencing of human and mouse, 186 00:11:19 --> 00:11:23 and having been the leading contributor to the latter, 187 00:11:23 --> 00:11:26 and it's not an accident because MIT's a marvelous environment in 188 00:11:26 --> 00:11:30 which to undertake this kind of research. 189 00:11:30 --> 00:11:33 It involved changing the way we do biology. Back in the mid-80's, 190 00:11:33 --> 00:11:37 when we sequenced DNA, we did it with radioactivity, 191 00:11:37 --> 00:11:40 remember I taught you how to sequence using radioactive label of 192 00:11:40 --> 00:11:44 a gel, and all that. That's how we did it, 193 00:11:44 --> 00:11:48 stood behind this plastic shield, and you loaded the gels. Of course, 194 00:11:48 --> 00:11:51 now it's done in a highly automated fashion. This is the production 195 00:11:51 --> 00:11:55 floor at the Broad Institute, which is here at MIT, where robots 196 00:11:55 --> 00:11:59 prepare all the DNA samples, so E. coli's grown up, and then you 197 00:11:59 --> 00:12:02 have to crack open the cells, purify the DNA, purify the plasmid, 198 00:12:02 --> 00:12:06 do a sequencing reaction, etc., etc. it's all done robotically there, 199 00:12:06 --> 00:12:10 and this is capable of processing, and does process, in a given day, 200 00:12:10 --> 00:12:13 about 200,000 samples per day. They then go, and this is all 201 00:12:13 --> 00:12:17 equipment designed by people here at MIT, and then commercially built for 202 00:12:17 --> 00:12:21 us. They then go to the back room where, actually, 203 00:12:21 --> 00:12:24 these are the previous generation of DNA sequencers, 204 00:12:24 --> 00:12:28 commercial detectors, those capillary detectors that have 205 00:12:28 --> 00:12:32 little lasers on them, there's a whole farm of them that 206 00:12:32 --> 00:12:36 sit there, and are able to get data out. 207 00:12:36 --> 00:12:39 In the course of a single day, we can now generate about 40 billion 208 00:12:39 --> 00:12:43 bases, I'm sorry, in the course of a single year we 209 00:12:43 --> 00:12:46 can generate about 40 billion bases of DNA sequence. 210 00:12:46 --> 00:12:50 The genome project itself, was a collaboration involving 20 211 00:12:50 --> 00:12:53 different groups around the world, groups in the United States, United 212 00:12:53 --> 00:12:57 Kingdom, France, Germany, and Japan, 213 00:12:57 --> 00:13:01 and China. They were of different sizes, they used different 214 00:13:01 --> 00:13:04 approaches, but everybody was committed to one common cause of 215 00:13:04 --> 00:13:08 producing this information, and making it freely available, 216 00:13:08 --> 00:13:11 and everybody worked together. And for the rest of my life, 217 00:13:11 --> 00:13:15 when it comes to Friday, at 11 o'clock, I will always think genome 218 00:13:15 --> 00:13:19 project, because we had a weekly conference call of all the groups in 219 00:13:19 --> 00:13:23 the world working on this Fridays, at eleven, and it was a fascinating 220 00:13:23 --> 00:13:26 experience, there were many, many years of that. So a draft 221 00:13:26 --> 00:13:30 sequence, a rough draft sequence of the human genome, 222 00:13:30 --> 00:13:34 was published in the year, in February of 2001, it was 223 00:13:34 --> 00:13:38 announced with some fanfare in June of 2000, but the real scientific 224 00:13:38 --> 00:13:42 paper came out in February of 2001. 225 00:13:42 --> 00:13:45 This was not a perfect sequence of the human genome, 226 00:13:45 --> 00:13:48 by any means. We discovered about 90% of the sequence of the human 227 00:13:48 --> 00:13:51 genome. It still had about 150, 00 gaps in it, it had errors. But, 228 00:13:51 --> 00:13:54 it still did have 90% of the sequence of the human genome. 229 00:13:54 --> 00:13:57 For the next three years, people worked very hard, 230 00:13:57 --> 00:14:00 and, as of last April, a finished sequence of the human 231 00:14:00 --> 00:14:03 genome was produced, and was published a couple weeks ago, 232 00:14:03 --> 00:14:06 and it contains, our best guess, about 99. 233 00:14:06 --> 00:14:09 % of the human genome, and it still has about 343 gaps, 234 00:14:09 --> 00:14:12 they're, we know what they are, we know where they are, but they're 235 00:14:12 --> 00:14:16 not sequence able with current technology. 236 00:14:16 --> 00:14:19 That's the “finished human genome”. What is it like? Well, this is a 237 00:14:19 --> 00:14:23 picture of the genome, do we have a pointer, yes, 238 00:14:23 --> 00:14:27 I see here we do have a pointer. This is your genome here, this is 239 00:14:27 --> 00:14:31 chromosome number 11, and I'll call attention to some 240 00:14:31 --> 00:14:34 interesting bits. So these colored lines here, 241 00:14:34 --> 00:14:38 represent genes, or gene-predictions, based on both, 242 00:14:38 --> 00:14:42 sequencing of the DNA, and mapping them back to the genome, 243 00:14:42 --> 00:14:46 as well as computer programs that analyze the genome. 244 00:14:46 --> 00:14:49 And, right here, you have a big pileup of lots of 245 00:14:49 --> 00:14:52 genes, very few genes of here. Lots of genes, few genes. Notice 246 00:14:52 --> 00:14:55 the places where there are lots of genes, match up with these 247 00:14:55 --> 00:14:58 light-grey bands, which are the light-grey bands of 248 00:14:58 --> 00:15:01 the microscope, on chromosomes. The places with 249 00:15:01 --> 00:15:04 very few genes match up with the dark bands in the chromosome. 250 00:15:04 --> 00:15:08 Do you know why that is, that the gene-rich regions are these 251 00:15:08 --> 00:15:12 light bands, and the gene-poor regions are the chromosome dark 252 00:15:12 --> 00:15:16 bands? Me neither. Nobody has a clue. It's really, 253 00:15:16 --> 00:15:20 it's really just one of these things. We had no reason to expect that 254 00:15:20 --> 00:15:24 we'd see these striking patterns, and other genomes, e-coli, doesn't 255 00:15:24 --> 00:15:28 have this dense, urban cluster, and these big, 256 00:15:28 --> 00:15:32 rural plains that are gene-poor. This is very weird, and it's 257 00:15:32 --> 00:15:35 distinctive to mammals. You'll also notice that the 258 00:15:35 --> 00:15:38 gene-rich regions, here, are rich in G's and C's, 259 00:15:38 --> 00:15:41 they have different distributions of some repeat elements, 260 00:15:41 --> 00:15:43 it's all sorts of weirdness that comes from just looking at the 261 00:15:43 --> 00:15:46 genome. The biggest weirdness was the number of genes, 262 00:15:46 --> 00:15:49 the count of genes is, our best guess, about 22, 263 00:15:49 --> 00:15:52 00 genes, if I had to pick a number today, it would be our count of 264 00:15:52 --> 00:15:55 genes, and of course, that's down from the 100, 265 00:15:55 --> 00:15:58 00 that was in some textbooks, and it's down from even 30 to 40, 266 00:15:58 --> 00:16:01 00 that was in the genome paper of February, 2001. 267 00:16:01 --> 00:16:04 Our best guess is that it's really just about that range. 268 00:16:04 --> 00:16:07 Genes, themselves, are very interesting. 269 00:16:07 --> 00:16:11 When you look at, you know, if we only have 22,000 genes we know 270 00:16:11 --> 00:16:15 of, how do we manage to run a human being with so few genes? 271 00:16:15 --> 00:16:19 It is, by the way, probably fewer genes than the mustard weed, 272 00:16:19 --> 00:16:22 or Arabidopsis thaliana. So, what do we do? Well, humans, one 273 00:16:22 --> 00:16:26 thing we may take comfort in, is that we, although we only have 274 00:16:26 --> 00:16:30 about 22,000 genes, there's a lot of alternative 275 00:16:30 --> 00:16:34 splicing, on average the typical gene, on average, 276 00:16:34 --> 00:16:38 has about two alternative splice products. 277 00:16:38 --> 00:16:41 Some have many, some have few, but probably, 278 00:16:41 --> 00:16:45 when you're all done, those 22, 00 genes may encode 70-80,000 279 00:16:45 --> 00:16:48 different proteins, and it could be more than that 280 00:16:48 --> 00:16:52 because we don't know all the alternative splice products, 281 00:16:52 --> 00:16:55 and what they do. But, if you ask, humans get credit for being really 282 00:16:55 --> 00:16:59 inventive or creative, for having lots of new genes that 283 00:16:59 --> 00:17:03 make us human, the answer is, no. 284 00:17:03 --> 00:17:06 Not only are humans not different in their gene complement from other 285 00:17:06 --> 00:17:10 mammals, mammals, as a group, really haven't invented 286 00:17:10 --> 00:17:14 that much, when you get down to it. Most of the recognizable sub-domains 287 00:17:14 --> 00:17:18 of proteins, proteins are built up of sub-domains, 288 00:17:18 --> 00:17:22 recognizable sequences that have certain motifs that fold up in 289 00:17:22 --> 00:17:26 certain ways, or carry out certain enzymatic functions. 290 00:17:26 --> 00:17:30 And it looks like our genomes, our genes, are mixed-and-matched 291 00:17:30 --> 00:17:34 combinations of many domains that were invented a long time ago, 292 00:17:34 --> 00:17:38 in invertebrates and before, and that most of evolutionary innovation 293 00:17:38 --> 00:17:42 in the more complex, multi-cellular animals, 294 00:17:42 --> 00:17:46 has simply been mixing-and-matching these domains in new ways, 295 00:17:46 --> 00:17:50 to get slightly different functions. 296 00:17:50 --> 00:17:54 You don't get a lot of points for creativity, but it does seem to work. 297 00:17:54 --> 00:17:58 By far, the most derivative of all, and what characterizes our genome 298 00:17:58 --> 00:18:02 tremendously is, when a gene works, 299 00:18:02 --> 00:18:07 make extra copies of it, and let it diverge slightly, 300 00:18:07 --> 00:18:11 and take up new functions. Really, your genome is just characterized by 301 00:18:11 --> 00:18:15 large expansions of families, immunoglobulin-like genes, 302 00:18:15 --> 00:18:20 intermediate filament proteins holding together the cytoskeleton. 303 00:18:20 --> 00:18:23 There are 111 different keratin-like genes in your genome. 304 00:18:23 --> 00:18:26 They're all different, they do different things, 305 00:18:26 --> 00:18:29 but they all came from one gene that was copied, copied, 306 00:18:29 --> 00:18:33 copied, at random, randomly duplicated, and then diverged to 307 00:18:33 --> 00:18:36 take up new functions. Growth factors, flies and worms 308 00:18:36 --> 00:18:39 managed to get by just fine, thank you, with two growth factors 309 00:18:39 --> 00:18:43 of the TGF beta-class, whatever that is. You have 42 310 00:18:43 --> 00:18:46 growth factors of this TGF beta-class, all of which help 311 00:18:46 --> 00:18:50 communicate, cells communicate, in different ways. 312 00:18:50 --> 00:18:53 And then, of course, all the olfactory receptors. 313 00:18:53 --> 00:18:57 In your genome, you have about 1, 00 genes for olfactory, for smell 314 00:18:57 --> 00:19:00 receptors. This is what Richard Axel and Linda Buck won a Nobel 315 00:19:00 --> 00:19:04 Prize for this year, was their work on the olfactory 316 00:19:04 --> 00:19:07 receptors. Sad to say though, out of all your olfactory receptors, 317 00:19:07 --> 00:19:11 genes, most of them are broken. They're most pseudo-genes. 318 00:19:11 --> 00:19:15 It's not true in dogs and mice, who keep their olfactory receptor 319 00:19:15 --> 00:19:18 genes in pretty fine-working order, but it's very clear that in primates 320 00:19:18 --> 00:19:22 with color vision, our olfactory receptor genes have 321 00:19:22 --> 00:19:25 been going to seed. They've been piling up mutations, 322 00:19:25 --> 00:19:28 and there's no selective pressure to keep many of them. 323 00:19:28 --> 00:19:32 And, in fact, we've now shown, in a paper that will come out soon, 324 00:19:32 --> 00:19:35 that this process is accelerating dramatically in the last 7 million 325 00:19:35 --> 00:19:38 years since we diverged from chimps. And so, humans have almost 326 00:19:38 --> 00:19:42 completely lost interest in smell, that's not totally true, some of 327 00:19:42 --> 00:19:45 these olfactory receptors surely matter for various processes, 328 00:19:45 --> 00:19:48 but most of them are probably irrelevant right now. 329 00:19:48 --> 00:19:52 And so, anyway, that's the nature of the genes there. 330 00:19:52 --> 00:19:56 Anyway, another interesting fact that's worth mentioning about your 331 00:19:56 --> 00:20:01 genome is half of your genome consists of transposable elements, 332 00:20:01 --> 00:20:05 elements that simply duplicate themselves, and hop around the 333 00:20:05 --> 00:20:10 genome. Elements that are like viruses, they make a copy, 334 00:20:10 --> 00:20:14 sometimes in RNA, the RNA is copied back into DNA and slammed elsewhere 335 00:20:14 --> 00:20:19 in your genome. These elements, 336 00:20:19 --> 00:20:24 well the, there are four classes. 337 00:20:24 --> 00:20:27 Alo elements, Line elements, Retro-Virus like elements, all these 338 00:20:27 --> 00:20:30 go through RNA intermediates, and use reverse transcription. 339 00:20:30 --> 00:20:34 And then there's certain DNA transposons, that go through DNA 340 00:20:34 --> 00:20:37 intermediate. The number of copies of the aloe element, 341 00:20:37 --> 00:20:40 the aloe element that's hopped around your genome, 342 00:20:40 --> 00:20:44 you have about a million, you have a million fossils of this 343 00:20:44 --> 00:20:47 element. You say, why is it there, and the answer is, 344 00:20:47 --> 00:20:50 because it's there. Because anything that knows how to make a 345 00:20:50 --> 00:20:54 copy of itself, and insert it itself in it's genome, 346 00:20:54 --> 00:20:57 you can't get rid of. You can consider it, 347 00:20:57 --> 00:21:00 if you wish, an infection, but half of your genome consists of 348 00:21:00 --> 00:21:03 an infection, with these kinds of transposable elements. 349 00:21:03 --> 00:21:12 Now that's it, yes? 350 00:21:12 --> 00:21:16 Well, it's very interesting, what's the effect? Well, they do, 351 00:21:16 --> 00:21:20 some of them are transcribed and, it's very interesting. 352 00:21:20 --> 00:21:24 Sometimes it's bad, one of them will hop into a gene and mutate it, 353 00:21:24 --> 00:21:28 and that's bad, that person will have a lethal mutation, 354 00:21:28 --> 00:21:32 but the genome has probably begun to use them, and count on their being 355 00:21:32 --> 00:21:36 there. So, when a bunch, when a transposable goes in, 356 00:21:36 --> 00:21:40 and creates a spacing, if you, for example, if an engineering 357 00:21:40 --> 00:21:44 committee came in and cleaned up the genome by getting rid of all the 358 00:21:44 --> 00:21:48 transposable elements, it would surely not work. 359 00:21:48 --> 00:21:51 Because we have evolutionarily come to count on the spacing there. 360 00:21:51 --> 00:21:55 It's sort of like, if in some very, some very messy attic, you put a cup 361 00:21:55 --> 00:21:58 of coffee down on top of a stack of papers, those papers may be utterly 362 00:21:58 --> 00:22:02 irrelevant, but now they're holding up that cup of coffee that you put 363 00:22:02 --> 00:22:06 down on it. And if you were to just, poof, magically get rid of them, 364 00:22:06 --> 00:22:10 the cup of coffee would come crashing to the ground. 365 00:22:10 --> 00:22:13 So, you know it, they're just there, 366 00:22:13 --> 00:22:17 taking up space. Now sometimes, even more than that, a few of them 367 00:22:17 --> 00:22:20 have actually been co-opted into being human genes. 368 00:22:20 --> 00:22:24 We know that a few of these transposable elements have mutated 369 00:22:24 --> 00:22:27 into being our genes that do something for us. 370 00:22:27 --> 00:22:31 And others of them may do things in affecting the general neighborhood 371 00:22:31 --> 00:22:35 with regard to transcription, and so, instead of it being a 372 00:22:35 --> 00:22:38 parasite, think of them as a symbiont, that's a genomic symbiont, 373 00:22:38 --> 00:22:42 which takes some advantage of us, and we may, you know, have worked 374 00:22:42 --> 00:22:46 out a compromise to take some advantage of it. 375 00:22:46 --> 00:22:49 Every time a copy is made of these, and it hops in the genome, some 376 00:22:49 --> 00:22:53 mutations may happen in the master element, but when it lands in the 377 00:22:53 --> 00:22:56 new place, we have a record of that hop. And if you reconstruct the 378 00:22:56 --> 00:23:00 sequence of the million AluI elements, you can see which ones are 379 00:23:00 --> 00:23:04 very close relatives of each other, and had to have hopped recently, and 380 00:23:04 --> 00:23:08 which ones are somewhat more distant relatives. 381 00:23:08 --> 00:23:11 And you can build an evolutionary tree connecting all of the repeat 382 00:23:11 --> 00:23:14 elements that have hopped around your genome, and thereby attaching a 383 00:23:14 --> 00:23:17 date to each of them, as to when they hopped. 384 00:23:17 --> 00:23:20 So it really is a fossil record, and you can figure out how many of 385 00:23:20 --> 00:23:23 them have been hopping at different times over history. 386 00:23:23 --> 00:23:27 And we can even make a plot of that, this is long ago, 387 00:23:27 --> 00:23:30 sometime here, some 30 million years ago, there was a huge explosion and 388 00:23:30 --> 00:23:33 in transposion, transposons, in our genome. 389 00:23:33 --> 00:23:36 We don't know why that happened, but it's very interesting, it does 390 00:23:36 --> 00:23:40 correspond to very interesting periods of primate evolution. 391 00:23:40 --> 00:23:43 And then, interestingly, there's been a huge crash, 392 00:23:43 --> 00:23:46 and transposition has dropped dramatically. We have no clue why 393 00:23:46 --> 00:23:49 this is, but we have a whole fossil record here of the rate of 394 00:23:49 --> 00:23:52 transposition of different kinds of repeat elements around our genome, 395 00:23:52 --> 00:23:55 and people are now starting to try to figure out what in the world this 396 00:23:55 --> 00:23:58 means. So all this is sort of there, inherent in the sequence, 397 00:23:58 --> 00:24:01 and if you want the sequence, as I say, you can go to the web and 398 00:24:01 --> 00:24:04 pull all this stuff now. So how do we understand the 399 00:24:04 --> 00:24:07 sequence? Well, I've told you a little bit about it, 400 00:24:07 --> 00:24:10 from the simple things that we've done, but there's a lot more that 401 00:24:10 --> 00:24:13 needs to be learned about the sequence, so what I really want to 402 00:24:13 --> 00:24:16 turn to, is how we're extracting information out of this sequence. 403 00:24:16 --> 00:24:19 So, DNA sequence is long and boring, it's only marginally more 404 00:24:19 --> 00:24:23 interesting than reading your hard disk, because it has four letters, 405 00:24:23 --> 00:24:27 instead of ones and zeros, but it's, you know, well, it's pretty really 406 00:24:27 --> 00:24:30 boring if you take a look at it. How do you attach meaning to all 407 00:24:30 --> 00:24:34 this stuff? One of the most powerful ways is by comparison with 408 00:24:34 --> 00:24:38 other genomes. And so, comparing the human genome 409 00:24:38 --> 00:24:42 to the mouse genome is very informative in many ways. 410 00:24:42 --> 00:24:45 So, as soon as the human genome was far along, a portion of the 411 00:24:45 --> 00:24:49 international consortium, set to work getting a sequence of 412 00:24:49 --> 00:24:52 the mouse genome. And that was published in December 413 00:24:52 --> 00:24:56 of 2002. We have a nice map of the mouse genome, with all these things, 414 00:24:56 --> 00:24:59 it, too, shows these gene-rich regions, gene-poor regions, 415 00:24:59 --> 00:25:03 all sorts of funny things. And if we look closely at a portion of the 416 00:25:03 --> 00:25:06 human genome over here, I've picked about a million bases of 417 00:25:06 --> 00:25:10 the human genome, and we take any little spot in that 418 00:25:10 --> 00:25:14 million bases of the human genome, let's say over here. 419 00:25:14 --> 00:25:17 And we take half the DNA sequence corresponding to this spot, 420 00:25:17 --> 00:25:20 and we run it in the computer against the mouse genome, 421 00:25:20 --> 00:25:23 and ask where in the mouse genome do we get the best match for this, 422 00:25:23 --> 00:25:26 the best match to this is here. Now let's do it for this piece, 423 00:25:26 --> 00:25:29 here. The best match anywhere in the mouse genome lands in the same 424 00:25:29 --> 00:25:32 million bases here as the mouse genome. In fact, 425 00:25:32 --> 00:25:36 for every single sequence that we pull out from this million bases in 426 00:25:36 --> 00:25:39 the human genome, the best match is in this million 427 00:25:39 --> 00:25:42 bases of the mouse genome. That's very interesting. Why is 428 00:25:42 --> 00:25:45 that? Sorry? No, people do know. 429 00:25:45 --> 00:25:49 It, it was a good try, though. [LAUGHTER]. This million bases in 430 00:25:49 --> 00:25:52 the mouse genome, and this million bases in the human 431 00:25:52 --> 00:25:56 genome, represent the evolutionary descendents of a common million 432 00:25:56 --> 00:25:59 bases that occurred in our common ancestor 75-million years ago. 433 00:25:59 --> 00:26:03 This is a clear evidence of the evolution here, 434 00:26:03 --> 00:26:06 because we can see that this is a segment of DNA from our common 435 00:26:06 --> 00:26:10 ancestor that really hasn't undergone much rearrangement, 436 00:26:10 --> 00:26:14 and we can just line up the sequences and see. 437 00:26:14 --> 00:26:17 In fact, we can build a whole map across the mouse genome like this. 438 00:26:17 --> 00:26:20 For any bit of the mouse genome, I don't know, here's a bit on mouse 439 00:26:20 --> 00:26:24 chromosome 17, this whole stretch corresponds to a 440 00:26:24 --> 00:26:27 portion of human chromosome number eight. This stretch here, 441 00:26:27 --> 00:26:30 I don't know, this green color here on chromosome number six, 442 00:26:30 --> 00:26:34 corresponds to chromosome four in the human. And so, 443 00:26:34 --> 00:26:37 we can build a look-up table that says, for any portion of the human 444 00:26:37 --> 00:26:40 genome, what's the corresponding portion of the mouse genome that 445 00:26:40 --> 00:26:44 came from the same ancestor, has basically the same complement of 446 00:26:44 --> 00:26:47 genes in it. And there's only about 330 such 447 00:26:47 --> 00:26:50 regions that we need to cut-and-paste the human genome order 448 00:26:50 --> 00:26:53 to the mouse genome order, roughly speaking. There's a lot of 449 00:26:53 --> 00:26:56 little local rearrangements, but at this gross level. So now, 450 00:26:56 --> 00:26:59 if we go back more closely and we look at this, and we say, 451 00:26:59 --> 00:27:03 OK, so now we look at this region, we now know these two regions 452 00:27:03 --> 00:27:06 descend from a common ancestor, if we do a careful evolutionary 453 00:27:06 --> 00:27:09 analysis, lining up all the sequences, and see how 454 00:27:09 --> 00:27:12 well-preserved the sequences are, some are much better preserved than 455 00:27:12 --> 00:27:16 others. Evolution has been much more 456 00:27:16 --> 00:27:20 lovingly conserving other sequences than others, and so, 457 00:27:20 --> 00:27:24 so let's now zoom-in on a gene, this is a gene that goes by the name, 458 00:27:24 --> 00:27:28 PP-Gama, I'm fond of this gene but, it doesn't matter. If we look, I've 459 00:27:28 --> 00:27:32 indicated all the regions here, in which there's a heightened degree 460 00:27:32 --> 00:27:36 of conservation. The sequence is well-conserved here, 461 00:27:36 --> 00:27:40 here, here, here, here, here, here, and here, 462 00:27:40 --> 00:27:44 here, here, here, here, here. These correspond to the exons of 463 00:27:44 --> 00:27:48 the PPR-Gama gene, they encode the protein of the gene, 464 00:27:48 --> 00:27:52 then the splicing goes like this, OK? These things here do not correspond 465 00:27:52 --> 00:27:56 to the exons. People have no idea what they are, 466 00:27:56 --> 00:28:00 in fact, this is not supposed to be here. The official textbook picture 467 00:28:00 --> 00:28:04 says, the vast majority of what matters for a gene, 468 00:28:04 --> 00:28:09 what evolution should preserve, is the exons plus the promoter. 469 00:28:09 --> 00:28:13 Here's the promoter. But in fact, what we found is that 470 00:28:13 --> 00:28:17 an awful lot more is being preserved. In fact, across the genome, 471 00:28:17 --> 00:28:21 our best estimate is there are about 500,000 conserved elements across 472 00:28:21 --> 00:28:26 the genome, and only 1/3 of them are protein-coding exons. 473 00:28:26 --> 00:28:30 That means 2/3 of the stuff evolution has been interested in, 474 00:28:30 --> 00:28:34 is not protein-coding exons, and the truth is, we do not know what it is, 475 00:28:34 --> 00:28:38 this was a very radical finding, when this mouse paper came out, 476 00:28:38 --> 00:28:43 about a year and a half, about two years ago now. 477 00:28:43 --> 00:28:47 What it must be, I think, but we're guessing, 478 00:28:47 --> 00:28:51 are regulatory signals, the structural elements in chromosomes, 479 00:28:51 --> 00:28:56 RNA genes, but there's an awful lot more of it than we had imagined. 480 00:28:56 --> 00:28:59 And we've, now we're in this fascinating situation, 481 00:28:59 --> 00:29:03 where computational analysis has told us what's on evolution's mind, 482 00:29:03 --> 00:29:07 and now we have to go to the lab and figure out what in the world it does. 483 00:29:07 --> 00:29:10 But there's no doubt that it must do something, because evolution has 484 00:29:10 --> 00:29:14 preserved it quite well. Now, I oversimplified greatly in 485 00:29:14 --> 00:29:18 this discussion, let me first say, 486 00:29:18 --> 00:29:21 and I'll come back to that. We do know, if we take some of those 487 00:29:21 --> 00:29:24 elements, here's one, there's a 481 base-pair elements 488 00:29:24 --> 00:29:27 that's 84% identical between human and mouse. You could write yourself 489 00:29:27 --> 00:29:31 a little statistical model to say that's way unusual to have something 490 00:29:31 --> 00:29:34 that's so well preserved. When Eddie Ruben and his colleagues 491 00:29:34 --> 00:29:37 from Berkley made a knockout mouse that deleted that segment, 492 00:29:37 --> 00:29:40 this knockout mouse loses regulation of three different genes in the 493 00:29:40 --> 00:29:43 neighborhood, saying that this must be a regulatory sequence that 494 00:29:43 --> 00:29:47 affects multiple genes in the neighborhood. That, 495 00:29:47 --> 00:29:50 that's one, with about 300, 00 such elements to go, in order to 496 00:29:50 --> 00:29:54 attach meaning to them. So doing this entirely by knocking 497 00:29:54 --> 00:29:58 out mice will be a slow process, one's going to need other ways to be 498 00:29:58 --> 00:30:02 able to attach meaning, but there's no doubt. Now, 499 00:30:02 --> 00:30:06 there's some other interesting papers where people have knocked 500 00:30:06 --> 00:30:10 some of these things out, and they've seen no effect on the 501 00:30:10 --> 00:30:14 mouse. They get a totally viable mouse. Can you conclude from that, 502 00:30:14 --> 00:30:18 that they have no function? Why not? The knockout mouse is viable. 503 00:30:18 --> 00:30:22 Could be redundant, it could even not be redundant, 504 00:30:22 --> 00:30:26 but yes, it could be redundant, but you couldn't knock out both of 505 00:30:26 --> 00:30:29 two things. It turns out, suppose knocking it 506 00:30:29 --> 00:30:33 out affected the mouse's viability by part, ten to the third, 507 00:30:33 --> 00:30:37 it was only 99.9% as fertile, would you be able to see that in the 508 00:30:37 --> 00:30:41 laboratory? No. Would that matter to evolution? 509 00:30:41 --> 00:30:44 It would be lethal, in an evolutionary sense. 510 00:30:44 --> 00:30:48 Such mutation could never propagate through a population. 511 00:30:48 --> 00:30:52 One part, and ten to the third, is massive selection against, from 512 00:30:52 --> 00:30:56 an evolutionary point of view, but almost undetectable in a 513 00:30:56 --> 00:31:00 laboratory batch. Evolution has a far more sensitive 514 00:31:00 --> 00:31:04 assay than we do. Now, I won't go into detail, 515 00:31:04 --> 00:31:09 but for the mathematically inclined here, showing that there really were 516 00:31:09 --> 00:31:13 about 5% of the human genome under, under evolutionary selection, it was 517 00:31:13 --> 00:31:18 a complicated affair, because with only two genomes, 518 00:31:18 --> 00:31:23 what we really had to do, and if this doesn't make sense, ignore it. 519 00:31:23 --> 00:31:26 We looked at the background distribution of conservation of the 520 00:31:26 --> 00:31:29 genome in unimportant elements, in those repeat elements that we 521 00:31:29 --> 00:31:32 knew to be functionally broken. We looked at the overall 522 00:31:32 --> 00:31:35 conservation of the genome, and found that the overall genome 523 00:31:35 --> 00:31:38 has this rightward tail, by subtracting the distributions we 524 00:31:38 --> 00:31:41 were able to see how much excess conservation there was. 525 00:31:41 --> 00:31:44 That's because we only had two genomes, we had to draw inferences. 526 00:31:44 --> 00:31:47 If we had more genomes, like the mouse and the rat, 527 00:31:47 --> 00:31:50 and the dog and the-this-and-the-that, 528 00:31:50 --> 00:31:54 we would be able to extract signal from noise. 529 00:31:54 --> 00:31:57 We would be able to see right away, which bits were well-conserved, and 530 00:31:57 --> 00:32:01 we wouldn't have to do this as a sensitive statistical analysis. 531 00:32:01 --> 00:32:05 So, in fact, we need more mammalian genomes, so, so right now there's 532 00:32:05 --> 00:32:09 been a sequence of the rat genome in the past year or so, 533 00:32:09 --> 00:32:12 there's a sequence of the dog genome, we're writing up that paper now, 534 00:32:12 --> 00:32:16 but it's on the web already. There's a sequence of the chimpanzee 535 00:32:16 --> 00:32:20 genome we're writing up a paper on that, in collaboration with our 536 00:32:20 --> 00:32:24 friends in the genome-sequencing community. 537 00:32:24 --> 00:32:27 We're currently sequencing a variety of other organisms, 538 00:32:27 --> 00:32:30 as well. And if you had enough organisms, you ought to be able to 539 00:32:30 --> 00:32:34 just line it up and say, what has evolution preserved, 540 00:32:34 --> 00:32:37 and figure out exactly which nucleotides matter, 541 00:32:37 --> 00:32:40 and which nucleotides don't, are allowed to drift freely, at the 542 00:32:40 --> 00:32:44 background rate. How far could you go with this? 543 00:32:44 --> 00:32:47 Well, we decided to try an interesting experiment. 544 00:32:47 --> 00:32:50 We said, since mammals are very big, then we're going to need a lot of 545 00:32:50 --> 00:32:54 genome sequences, how about we try a small organism, 546 00:32:54 --> 00:32:58 like yeast? What if we were to try to do this, 547 00:32:58 --> 00:33:02 this kind of evolutionary, genomic analysis on something like the yeast 548 00:33:02 --> 00:33:06 genome? And so, this is work that I'll describe, 549 00:33:06 --> 00:33:10 that was between a bunch of people here at MIT who do genome-sequencing, 550 00:33:10 --> 00:33:14 and a student in computer science, Manolis Kellis, was PhD student in 551 00:33:14 --> 00:33:18 computer science, he now just joined the faculty here 552 00:33:18 --> 00:33:21 at MIT in computer science. But it was a really great example of 553 00:33:21 --> 00:33:25 how biology and computer science could come together. 554 00:33:25 --> 00:33:28 So, the genome-sequencing folks sequenced three related species, 555 00:33:28 --> 00:33:32 through our friend, the baker's yeast, Saccharomyces cerevisiae, 556 00:33:32 --> 00:33:35 workhorse of geneticist. These three different species are 557 00:33:35 --> 00:33:39 separated by different evolutionary distances, from Saccharomyces 558 00:33:39 --> 00:33:42 cerevisiae. When you line up their genomes, just like with human and 559 00:33:42 --> 00:33:46 mouse, you find the genes occur largely in the same order, 560 00:33:46 --> 00:33:49 and it's not hard to pick out, oh there's this gene there, there, 561 00:33:49 --> 00:33:53 it's all lined up, you've got these evolutionary segments, 562 00:33:53 --> 00:33:56 and very few rearrangements have occurred across these species, 563 00:33:56 --> 00:34:00 despite the fact that they're about 20 million years apart in history. 564 00:34:00 --> 00:34:05 But here's an interesting thing. When the yeast genome, 565 00:34:05 --> 00:34:11 Saccharomyces cerevisiae, was first published in 1995, 566 00:34:11 --> 00:34:16 the paper describing it reported 6, 00 genes. Now, how did they know 567 00:34:16 --> 00:34:22 there were 6,200 genes? They ran a computer program looking 568 00:34:22 --> 00:34:28 for open reading frames. Any open reading frame, consecutive 569 00:34:28 --> 00:34:34 codons without a stop sufficiently long, was called a gene. 570 00:34:34 --> 00:34:37 But statistically, you could, by chance, 571 00:34:37 --> 00:34:41 just have a long stretch of codons without a stop codon. 572 00:34:41 --> 00:34:44 And so, if I saw 100 codons in a row, without a stop, 573 00:34:44 --> 00:34:48 they called it a gene, but it might just be chance. 574 00:34:48 --> 00:34:52 And they knew that, of course, they wrote that in the paper, but 575 00:34:52 --> 00:34:55 for many years, people then had 6, 576 00:34:55 --> 00:34:59 00 open reading frames, which were the yeast's genes. 577 00:34:59 --> 00:35:02 Could evolution now tell us which one of them were real and which 578 00:35:02 --> 00:35:06 weren't? Well, it turns out that evolution was 579 00:35:06 --> 00:35:10 tremendously powerful in doing that. 580 00:35:10 --> 00:35:14 If you take something that's a well-known gene that has been 581 00:35:14 --> 00:35:19 extensively studied by yeast geneticists, you line it up across 582 00:35:19 --> 00:35:23 all four species, you almost never see deletions. 583 00:35:23 --> 00:35:28 And when you do see the lesions, here in grey, they're always a 584 00:35:28 --> 00:35:33 multiple of three. Why are they a multiple of three? 585 00:35:33 --> 00:35:37 They preserve the reading frame. By contrast, if I take some clear, 586 00:35:37 --> 00:35:42 intergenetic DNA, that's not protein-coding, 587 00:35:42 --> 00:35:47 and I compare it across these four species, I see lots and lots of 588 00:35:47 --> 00:35:52 frame shifting deletions that occur, 589 00:35:52 --> 00:35:54 Evolution tolerates frame shifting deletions, and if I juts write down 590 00:35:54 --> 00:35:57 the rates, frame shifting deletions are 75x more common in intergenic 591 00:35:57 --> 00:36:00 DNA, than genic DNA. This provides a very powerful test. 592 00:36:00 --> 00:36:03 Run this test across the genome, looking for the density of frame 593 00:36:03 --> 00:36:06 shifting deletions, any place that doesn't tolerate 594 00:36:06 --> 00:36:09 frame shifting deletions is probably a real gene, anything that does 595 00:36:09 --> 00:36:12 tolerate it is probably not. When you sorted through all this, 596 00:36:12 --> 00:36:15 it turned out that 528 of the official yeast genes were clearly 597 00:36:15 --> 00:36:18 not real, not real genes. They were just chock-a-block full 598 00:36:18 --> 00:36:22 of these frame shifting deletions. And, and a bunch of others could be 599 00:36:22 --> 00:36:26 confirmed. So the yeast gene count, and I won't tell you all the 600 00:36:26 --> 00:36:30 experimental and other that shows this is right, 601 00:36:30 --> 00:36:34 but the yeast genome has now been revised downward to 5, 602 00:36:34 --> 00:36:38 00 genes, and we have great confidence that almost all of those 603 00:36:38 --> 00:36:42 are real genes, there are 20 whose origins that 604 00:36:42 --> 00:36:46 we're not sure of, and new genes could be found in this 605 00:36:46 --> 00:36:50 way. Here's a really audacious thing. 606 00:36:50 --> 00:36:51 This graduate student in computer science said, I think, 607 00:36:51 --> 00:36:53 based on these other species, there was a mistake made in the 608 00:36:53 --> 00:36:55 sequencing of the first yeast, and that the reason these things are 609 00:36:55 --> 00:36:57 called two separate genes, is that somebody made a sequencing 610 00:36:57 --> 00:36:58 error that got a stop codon here, but I think these are really part of 611 00:36:58 --> 00:37:00 one gene. And so, somebody went back and re-sequenced 612 00:37:00 --> 00:37:02 some of these, and sure enough, 613 00:37:02 --> 00:37:04 he had correctly predicted that there had been a mistake made at 614 00:37:04 --> 00:37:06 that letter, and that these were in fact, a single gene. 615 00:37:06 --> 00:37:11 The computational analysis was incredibly powerful in this regard, 616 00:37:11 --> 00:37:17 it could go further than this, you could ask, could I also figure out 617 00:37:17 --> 00:37:23 the way genes are regulated in this fashion, could I work out the 618 00:37:23 --> 00:37:29 intergenic signals in the promoter regions? Remember that lac 619 00:37:29 --> 00:37:35 repressor to a certain operator site, well, all of these regulatory 620 00:37:35 --> 00:37:41 proteins bind to different sequences, could we figure out what the 621 00:37:41 --> 00:37:46 sequences were, computational? Well, if we look closely at a genic, 622 00:37:46 --> 00:37:50 intergenic region, here's one where there's two genes being transcribed 623 00:37:50 --> 00:37:54 in opposite directions, gal-1 and gal-10, both involved in 624 00:37:54 --> 00:37:58 galactose metabolism, and there's a particular protein, 625 00:37:58 --> 00:38:03 a transcription factor here, called Gal-4, in this region, 626 00:38:03 --> 00:38:07 and it has a particular sequence that it likes, 627 00:38:07 --> 00:38:11 CCG, 11 bases, GGC. So, that Gal-4 we see, 628 00:38:11 --> 00:38:16 is very well preserved across all of the species. 629 00:38:16 --> 00:38:20 So, in no regulatory sequence is well-preserved, 630 00:38:20 --> 00:38:24 now let's look at that closely. This Gal-4 binding site is a measly, 631 00:38:24 --> 00:38:29 crummy, six nucleotides of information. At random, 632 00:38:29 --> 00:38:33 it's going to occur in many places in the yeast genome, 633 00:38:33 --> 00:38:38 but not be a real, important Gal-4, right? Some of them matter, some of 634 00:38:38 --> 00:38:42 them don't. How do we figure out which of these occurrences are real 635 00:38:42 --> 00:38:46 Gal-4, well, if we look across all four species, what we find is that 636 00:38:46 --> 00:38:51 those occurrences that occur in promoter regions, 637 00:38:51 --> 00:38:55 are much more likely to be conserved by evolution than those 638 00:38:55 --> 00:39:00 that don't. So there's a special property here, 639 00:39:00 --> 00:39:04 conservation of the motif and the motor regions. 640 00:39:04 --> 00:39:08 In fact, this particular sequence is four times more likely to be 641 00:39:08 --> 00:39:12 preserved when it occurs in a promoter region, 642 00:39:12 --> 00:39:16 than when it occurs in a coded region. And for a typical control 643 00:39:16 --> 00:39:20 region, the opposite is true. Since genes, since coding sequences 644 00:39:20 --> 00:39:24 are better preserved in general, for a randomly chosen sequence, I 645 00:39:24 --> 00:39:28 don't know, ATGGCAT, it's more likely to be preserved in 646 00:39:28 --> 00:39:32 coding regions than non-coding regions. 647 00:39:32 --> 00:39:35 So this Gal-4 motif has a very funky property that, 648 00:39:35 --> 00:39:38 on average, it's 12x more likely than background, 649 00:39:38 --> 00:39:41 to be preserved when it occurs in a promoter. Now, 650 00:39:41 --> 00:39:44 that's a test you apply to another motif, and another motif. 651 00:39:44 --> 00:39:47 In fact, you could, by computer, test all possible motifs, and ask 652 00:39:47 --> 00:39:50 which ones have that property? Make a scatter plot, most motifs 653 00:39:50 --> 00:39:53 are better conserved when they occur in promoter regions, 654 00:39:53 --> 00:39:56 than when they occur in coding regions, some however, 655 00:39:56 --> 00:40:00 are better preserved in promoter regions than in coding regions. 656 00:40:00 --> 00:40:04 Our friend, Gal-4, is up there, but there are a lot 657 00:40:04 --> 00:40:09 more things like it, that are better preserved by 658 00:40:09 --> 00:40:14 evolution than promoters are. You can make a list of them. You 659 00:40:14 --> 00:40:19 can get about 72 well-conserved, regulatory motifs and it turns out 660 00:40:19 --> 00:40:24 that 20 years of yeast work produced knowledge about things like the 661 00:40:24 --> 00:40:29 Gal-4 site, and other sites. Almost all the known regulatory 662 00:40:29 --> 00:40:34 sites that had been discovered over the course of 20 years of 663 00:40:34 --> 00:40:39 experimental work appear on this list that falls out of the computer 664 00:40:39 --> 00:40:44 analysis of evolutionary comparison of genomes. 665 00:40:44 --> 00:40:48 You can actually go a step further, I'll hesitate to tell you, but I'll 666 00:40:48 --> 00:40:53 try anyway. If you wanted to find out, without knowing in advance, 667 00:40:53 --> 00:40:57 what these motifs were doing, what their biological function was, 668 00:40:57 --> 00:41:02 you can do that informationally, too. It turns out that if I take my 669 00:41:02 --> 00:41:06 motif, Gal-4, and I ask, which chains does it occur in front 670 00:41:06 --> 00:41:11 of? Well, across Saccharomyces cerevisiae, you find this crummy 671 00:41:11 --> 00:41:15 little motif in many, many places because, as I said, 672 00:41:15 --> 00:41:20 most of it's just noise. But if I ask, which genes have this 673 00:41:20 --> 00:41:24 motif in all four species, these genes, there's a huge overlap 674 00:41:24 --> 00:41:28 with a class of genes involved in carbohydrate metabolism. 675 00:41:28 --> 00:41:33 So, if I didn't know in advance that the Gal-4 motif was involved in 676 00:41:33 --> 00:41:37 regulating genes in carbohydrate metabolism, I could tell, 677 00:41:37 --> 00:41:41 just from the fact that the genes that'd conserved it, 678 00:41:41 --> 00:41:46 are genes involved in carbohydrate metabolism. 679 00:41:46 --> 00:41:50 You can do that using all sorts of tricks, expression of genes, 680 00:41:50 --> 00:41:54 protein mass spec, blah, blah, blah, and the short answer is, for 681 00:41:54 --> 00:41:58 almost all of those motifs that you can find in the computer, 682 00:41:58 --> 00:42:02 by consulting public data bases of sets of genes that are co-expressed, 683 00:42:02 --> 00:42:06 or have similar properties and all that, the computer can also offer 684 00:42:06 --> 00:42:10 you a pretty good hypothesis about what that motif is associated with. 685 00:42:10 --> 00:42:14 You can even go a step further than that. You can begin to look at 686 00:42:14 --> 00:42:18 pairs of motifs, you can say, if I have a certain 687 00:42:18 --> 00:42:23 regulatory sequence, number one, and a second regulatory 688 00:42:23 --> 00:42:27 sequence, number two, do they tend to be preserved in 689 00:42:27 --> 00:42:31 front of the same genes as each other? Is their conservation 690 00:42:31 --> 00:42:36 correlated? And you can build a map of these two 691 00:42:36 --> 00:42:40 guys tend, when this guy's correlated, this guy tends to be 692 00:42:40 --> 00:42:44 correlated. And you can say, oh those proteins must be talking to 693 00:42:44 --> 00:42:48 each other, and you can read that off from the patterns of evolution, 694 00:42:48 --> 00:42:52 as well. There are two regulators, one called Sterile 12, 695 00:42:52 --> 00:42:57 one called Tec1. This computational analysis shows that they tend to 696 00:42:57 --> 00:43:01 co-occur in a conserved fashion, far more often then you'd expect by 697 00:43:01 --> 00:43:05 chance. And when you do the analysis, you find that those genes 698 00:43:05 --> 00:43:09 that just have a conserved Sterile 12, those genes tend to 699 00:43:09 --> 00:43:13 be involved in mating. Genes that just have a conserved 700 00:43:13 --> 00:43:16 instance of Tec1 tend to be involved in the budding of the yeast, 701 00:43:16 --> 00:43:20 and those genes that have conserved the occurrences of both tend to be 702 00:43:20 --> 00:43:23 involved in fillamentation. Now all that can be read out, 703 00:43:23 --> 00:43:26 which is way cool, this is not the way we used to do biology. 704 00:43:26 --> 00:43:30 Now don't get me wrong, there's a ton of experiments that underlay 705 00:43:30 --> 00:43:33 creating these databases, and there's a ton of experiments 706 00:43:33 --> 00:43:36 that have to be done to check any of these things. But what we have is 707 00:43:36 --> 00:43:40 one of the most powerful hypothesis generators that's ever 708 00:43:40 --> 00:43:44 been seen here. Evolution, by telling us what to 709 00:43:44 --> 00:43:48 focus on, is giving us, on a silver platter, hundreds of 710 00:43:48 --> 00:43:52 hypothesis about who's interacting with whom, and sending us back to 711 00:43:52 --> 00:43:56 the lab then, to test these hypotheses. Now, 712 00:43:56 --> 00:44:00 what are the implications of all of this for the human genome? 713 00:44:00 --> 00:44:04 Could we do this for the human genome? Well, 714 00:44:04 --> 00:44:08 these species, Saccharomyces cerevisiase, S. 715 00:44:08 --> 00:44:12 paradoxus, S. mikatae and S. bayanus, are they a good model for 716 00:44:12 --> 00:44:15 mammals? Well it turns out that their 717 00:44:15 --> 00:44:19 evolutionary distance from each other is the same as the distance of 718 00:44:19 --> 00:44:23 human to lemur, to dog, to mouse. 719 00:44:23 --> 00:44:27 So they were chosen with a purpose. Those are actually fairly good 720 00:44:27 --> 00:44:30 models for the human. So could we do exactly the same 721 00:44:30 --> 00:44:34 analysis for the human, for the entire human genome? 722 00:44:34 --> 00:44:38 If we had, human, lemur, dog, and mouse, are basically four 723 00:44:38 --> 00:44:42 species, human, mouse, rat, and dog. 724 00:44:42 --> 00:44:46 Well, there's one little fly in the ointment. The human genome is 20x 725 00:44:46 --> 00:44:50 bigger than the yeast genome. If I want to analyze the whole 726 00:44:50 --> 00:44:54 human genome, I have a problem of signal-to-noise. 727 00:44:54 --> 00:44:58 The genome is 20x bigger, I've got 20x as much noise to get 728 00:44:58 --> 00:45:03 rid of. I won't walk you through it, but I need more evolutionary 729 00:45:03 --> 00:45:07 information to get rid of all that noise. And, you can do a simple 730 00:45:07 --> 00:45:11 calculation that says, my evolutionary tree needs to be 731 00:45:11 --> 00:45:15 bigger, it's branch length needs to be bigger by about the natural log 732 00:45:15 --> 00:45:20 of 20, to get rid of 20 fold more noise. 733 00:45:20 --> 00:45:24 And that would mean I'd need more species, I'd need about 16 species, 734 00:45:24 --> 00:45:28 or something like that to be able to do that. But if I built an 735 00:45:28 --> 00:45:32 evolutionary tree that had a branch length of four, 736 00:45:32 --> 00:45:36 that is, four substitutions per base across this evolutionary tree, 737 00:45:36 --> 00:45:40 as indicated by these colored lines here, I should have enough power to 738 00:45:40 --> 00:45:44 analyze the entire human genome, the way we just did the yeast genome. 739 00:45:44 --> 00:45:48 So we currently have human, chimp, mouse, rat, dog. As of this 740 00:45:48 --> 00:45:52 fall, during in fact, right at the beginning of this term, 741 00:45:52 --> 00:45:56 the National Institute of Health signed off on the sequencing of 742 00:45:56 --> 00:46:00 these additional eight mammals. These mammals are now in process, 743 00:46:00 --> 00:46:04 and in fact, the elephant is done, and the armadillo is in process, 744 00:46:04 --> 00:46:08 and the tree shrew, I think, is being caught at the moment. 745 00:46:08 --> 00:46:12 [LAUGHTER]. The ten-, don't talk about the tree 746 00:46:12 --> 00:46:18 shrews. The tenrec is actually being tested right now, 747 00:46:18 --> 00:46:24 etc, and all this is going on right now, as we speak, 748 00:46:24 --> 00:46:29 and I think that by next summer, we should have much of, and by 749 00:46:29 --> 00:46:35 certainly, by a year from now, we should have all this information 750 00:46:35 --> 00:46:41 to do such an analysis. That said, we're of course, 751 00:46:41 --> 00:46:47 very impatient people, you could just take the human, 752 00:46:47 --> 00:46:51 the mouse, the rat, and the dog. And I said that's not enough if you 753 00:46:51 --> 00:46:55 wanted to analyze the whole genome, but suppose you just wanted to 754 00:46:55 --> 00:46:59 analyze a portion of the genome, maybe about a yeast-size piece of 755 00:46:59 --> 00:47:03 the genome, well let's see, at 20,000 genes, I don't know, 756 00:47:03 --> 00:47:06 suppose I take, I don't know, two kilo bases around each 20, 757 00:47:06 --> 00:47:10 00 genes, well that's you know, 40 mega bases of DNA, it's only a 758 00:47:10 --> 00:47:14 couple-fold more than yeast. Maybe, if I just focus on a limited 759 00:47:14 --> 00:47:18 region around each promoter, I could start reading out these 760 00:47:18 --> 00:47:22 regulatory signals, with just four species. 761 00:47:22 --> 00:47:26 So in fact, the post-doctorate fellow is, has been working on this 762 00:47:26 --> 00:47:30 problem over the summer, and a little bit, too, through the 763 00:47:30 --> 00:47:34 spring and summer, together with Manolis Kellis, 764 00:47:34 --> 00:47:38 who's now in the computer science department. And I think we have a 765 00:47:38 --> 00:47:42 preliminary list for the human genome that's fallen out over the 766 00:47:42 --> 00:47:46 course of the past couple of months, and we're in the process, right now, 767 00:47:46 --> 00:47:50 of finishing up a paper that we're hoping to get submitted by Friday, 768 00:47:50 --> 00:47:54 with a preliminary list of regulatory signals in the human 769 00:47:54 --> 00:47:58 genome, read out from evolution of human, mouse, rat, and dog. 770 00:47:58 --> 00:48:01 It won't be everything, we don't have full power to pick up 771 00:48:01 --> 00:48:04 all possible signals, but we're picking up a lot of the 772 00:48:04 --> 00:48:08 signals, we're picking up a very large fraction of previously 773 00:48:08 --> 00:48:11 discovered signals, and lots more new signals, 774 00:48:11 --> 00:48:14 as well, are falling out of that analysis. So anyway, 775 00:48:14 --> 00:48:18 I can assure you that that's not in the textbooks because, 776 00:48:18 --> 00:48:21 actually, it hasn't been submitted yet. This other stuff I've 777 00:48:21 --> 00:48:25 described about the yeast analysis, this, you do want to look it up, 778 00:48:25 --> 00:48:28 there's a paper in nature about a year and change ago, 779 00:48:28 --> 00:48:32 Kellis et. al. describes this yeast work. This is what's going on. 780 00:48:32 --> 00:48:36 This is what's fun about teaching at MIT, as I can tell you this stuff, 781 00:48:36 --> 00:48:41 and you guys have a sense for the convergence that's going on in our 782 00:48:41 --> 00:48:45 field. Much of what I've tried to make the biology, 783 00:48:45 --> 00:48:50 you know, in making the biology clear, I've talked about how the 784 00:48:50 --> 00:48:54 different directions, genetics, biochemistry, 785 00:48:54 --> 00:48:59 have converged together. What we're really seeing now is 786 00:48:59 --> 00:49:03 information sciences converging with that as well, and I've got to say, 787 00:49:03 --> 00:49:08 it's a tremendous amount of fun. See you on Monday, good 788 00:49:08 --> 49:13 luck on the quiz.