# Lecture 2: Local Alignment (BLAST) and Statistics

Flash and JavaScript are required for this feature.

Description: In this lecture, Professor Burge reviews classical and next-generation sequencing. He then introduces local alignment (BLAST) and some of the associated statistics.

Instructor: Christopher Burge

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: All right. So today, we're going to briefly review classical sequencing and next-gen or second-gen sequencing, which sort of provides a lot of the data that the analytical methods we'll be talking about work on. And we'll then introduce local alignment a la BLAST and some of the statistics associated with that.

So just a few brief items on topic one. All right. So today, we're going to talk about sequencing first. Conventional-- or Sanger sequencing-- then next-gen or second-gen sequencing briefly. And then talk about local alignments.

So background for the sequencing part, the Metzger review covers everything you'll need. And for the alignment-- we'll talk about local alignment today, global alignment on Tuesday-- then chapters four and five of the text cover it pretty well. So here's the text. If you haven't decided whether to get it or not, I'll have it up here. You can come flip through it after class.

Sequencing is mostly done at the level of DNA. Whether the original material was RNA or not, usually convert to DNA and sequence at the DNA level. So we'll often think about DNA as sort of a string. But it's important to remember that it actually has a three dimensional structure as shown here. And often, it's helpful to think of it in sort of a two dimensional representation where you think about the bases and their hydrogen bonding and so forth as shown in the middle.

My mouse is not working today for some reason, but hopefully, we won't need it.

So the chemistry of sequencing is very closely related to the chemistry of the individual bases. And there are really three main types that are going to be relevant here. Ribonucleotides, deoxyribonucleotides, then for Sanger sequencing, dideoxyribonucleotides.

So who can tell me which of these structures corresponds to which of those names? And also, please let me know your name and I'll attempt to remember some of your names toward the end of the semester probably. So, which are which? Yes, what's your name?

AUDIENCE: I'm Simona. So the ribonucleotide is the top right. The deoxy is the one below it. And the dideoxy is the one to the left.

PROFESSOR: OK, so that is correct. So one way to keep these things in mind is the numbering of the bases. So the carbons in the ribo sugar are numbered one, so carbon 1 is the one where the base is attached. Two is here, which has an OH in RNA and just an H in DNA. And then three is very important. Four, and then five. So five connects to the phosphates, which then will connect the base to the sugar phosphate backbone. And three is where you extend. That's where you're going to add the next base in a growing chain.

And so what will happen if you give DNA polymerase a template and some dideoxy nucleotides? It won't be able to extend because there's no 3-prime OH. And all the chemistry requires the OH. And so that's the basis of classical or Sanger sequencing, which Fred Sanger got the Nobel Prize for in the 1980s-- I think it was developed in the '70s-- and it's really the basis of most of the sequencing, or pretty much all the DNA sequencing up until the early 2000s before some newer technologies came about. And it takes advantage of this special property of dideoxy nucleotides that they terminate the growing chain.

So imagine we have a template DNA. So this is the molecule whose sequence we want to determine shown there in black. We then have a primer. And notice the primer's written in 5-prime to 3-prime direction. The ends would be primer sequences and then primer complimentary sequences in the template. So you typically will have your template cloned-- this is in conventional sequencing-- cloned into some vector like a phage vector for sequencing so you know the flanking sequences.

And then you do four sequencing reactions in conventional Sanger sequencing. And I know some of you have probably had this before. So let's take the first chemical reaction. The one here with a DDGTP. So what would you put in that reaction? What are all the components of that reaction if you wanted to do conventional sequencing on, say, an acrylonitrile? Anyone? What do you need and what does it accomplish? Yeah, what's your name?

AUDIENCE: I'm Tim.

PROFESSOR: Tim? Oh yeah, I know you, Tim. OK, go ahead.

AUDIENCE: So you need the four nucleotides-- the deoxynucleotides. You will need the dideoxy P nucleotides. In addition, you need all the other [INAUDIBLE]. You need polymerase. Generally, you need a buffer of some sort, [INAUDIBLE], to [INAUDIBLE].

PROFESSOR: Yeah, primary template. Yeah. Great. That's good. It sounds like Tim could actually do this experiment. And what ratio would you put in? So you said you're going to put in all four conventional deoxynucleotides and then one dideoxynucleotide. So let's say dideoxy G just for simplicity here. So in what ratio would you put the dideoxynucleotide compared to the conventional nucleotides?

AUDIENCE: To lower the concentration.

PROFESSOR: Lower? Like how much lower?

AUDIENCE: Like, a lot lower.

PROFESSOR: Like maybe 1%?

AUDIENCE: Yeah.

PROFESSOR: Something like that. You want to put it a lot lower. And why is that so important?

AUDIENCE: Because you want the thing to be able to progress. Because you need enough of the ribonucleotide concentration so that [INAUDIBLE] every [INAUDIBLE] equivalent or excess and you're going to terminate [INAUDIBLE].

PROFESSOR: Right. So if you put equamolar deoxy G and dideoxy G, then it's going to be a 50% chance of terminating every time you hit a C in the template. So you're going to have half as much of the material at the second G, and a quarter as much as the third, and you're going to have vanishingly small amounts. So you're only going to be able to sequence the first few C's in the template. Exactly. So that's a very good point.

So now let's imagine you do these four separate reactions. You typically would have radiolabeled primer so you can see your DNA. And then you would run it on some sort of gel. This is obviously not a real gel, but an idealized version. And then in the lane where you put dideoxy G, you would see the smallest products. So you read these guys from the bottom up.

And in this lane there is a very small product that's just one base longer than the primer here. And that's because there was a C there and it terminated there. And then the next C appears several bases later. So you have sort of a gap here.

And so you can see that the first base in the template would be a complement of T, or C. And the second base would be, you can see, the next smallest product in this dideoxy T lane, therefore it would be A. And you just sort of snake your way up through the gel and read out the sequence. And this works well.

So what does it actually look like in practice? Here are some actual sequencing gels. So you run four lanes. And on big polyacrylamide gels like this. Torbin, you ever run one of these?

AUDIENCE: Yes.

PROFESSOR: Yes? They're a big pain to cast. Run for several hours, I think. And you get these banding patterns. And what limits the sequence read length? So we normally call the sequence generated from one run of a sequencer as a read. So that one attempt to sequence the template is called a read.

And you can see it's relatively easy to read the sequence toward the bottom, and then it gets harder as you go up. And so that's really what fundamentally limits the read length, is that the bands get closer and closer together. So they'll run inversely proportional to size with the small ones running faster. But then the difference between a 20 base product and a 21 might be significant. But the difference between a 500 base product and a 501 base product is going to be very small. And so you basically can't order the lanes anymore. And therefore, that's sort of what fundamentally limits it.

All right. So here we had to run four lanes of a gel. Can anyone think of a more efficient way of doing Sanger sequencing? Is there any way to do it in one lane? Yeah, what's your name?

AUDIENCE: Adrian. You can use four different types of the entities. Maybe like four different colors.

AUDIENCE: Four different colors. OK, so instead of using radio labeling on the primary, you use fluorophore on your dideoxy entities, for example. And then you can run them. Depending where that strand terminated, it'll be a different color. And you can run them all in one lane. OK, so that looks like that.

And so this was an important development called terminator sequencing in the '90s. That was the basis of the ABI 3700 machine, which was really the workhorse of genome sequencing in the late '90s and early 2000s. Really what enabled the human genome to be sequenced.

And so one of the other innovations in this technology was that instead of having a big gel, they shrunk the gel. And then they just had a reader at the bottom. So the gel was shrunk to as thin as these little capillaries. I don't know if you can see these guys. But basically it's like a little thread here. And so each one of these is effectively-- oops! Oh no. No worries, this is not valuable. Ancient technology that I got for free from somebody.

So the DNA would be loaded at the top. There would be a little gel in each of these-- it's called capillary sequencing. And then it would run out the bottom and there would be a detector which would detect the four different flours and read out the sequence.

So this basically condensed the volume needed for sequencing. Any questions about conventional sequencing? Yes?

AUDIENCE: Where are the [INAUDIBLE] where you'd put the fluorescent flags? Like the topic from the [INAUDIBLE]?

PROFESSOR: Yeah, that's a good question. I don't actually remember. I think there are different options available. And sometimes with some of these reactions, you need to use modified polymerases that can tolerate these modified nucleotides. Yeah, so I don't remember that. It's a good question. I can look that up.

So how long can a conventional sequencer go? What's the read length? Anyone know? It's about, say, 600 or so. And so that's reasonably long. How long is a typical mammalian mRNA? Maybe two, three kb? So you have in a typical exon, maybe 150 bases or so. So you have a chunk. You don't generally get full length cDNA. But you get a chunk of a cDNA that's say, three, four exons in length. And that is actually generally sufficient to uniquely identify the gene locus that that read came from.

And so that was the basis of EST sequencing-- so-called Expressed Sequence Tag sequencing. And millions of these 600 base chunks of cDNA were generated and they have been quite useful over the years.

All right. So what is next-gen sequencing? So in next-gen sequencing, you only read one base at a time. So it's often a little bit slower. But it's really massively parallel. And that's the big advantage. And it's orders of magnitude cheaper per base than conventional sequencing. Like when it first came out it, it was maybe two orders of magnitude cheaper. And now it's probably another four orders of magnitude.

So it really blows away conventional sequencing if the output that you care about is mostly proportional to number of bases sequence. If the output is proportional to the quality of the assembly or something, then there are applications where conventional sequencing still is very useful Because the next-gen sequencing tends to be shorter. But in terms of just volume, it generates much, much more bases in one reaction.

And so the basic ideas are that you have your template DNA molecules. Now typically, tens of thousands for technologies like PacBio or hundreds of millions for technologies like Illumina that are immobilized on some sort of surface-- typically a flow cell-- and there are either single molecule methods where you have a single molecule of your template or there are methods that locally amplify your template and produce, say, hundreds of identical copies in little clusters. And then you use modified nucleotides, often with fluorophores attached, to interrogate the next base at each of your template molecules for hundreds and hundreds of millions of them.

And so there are several different technologies. We won't talk about all of them. We'll just talk about two or three that are interesting and widely used. And they differ depending on the DNA template, what types of modified nucleotides are used, and to some extent, in the imaging and the image analysis, which differs for single molecule methods, for example, compared to the ones that sequence a cluster.

So there's a table in the Metzger review. And so I've just told you that next-gen sequencing is so cheap. But then you see how much these machines cost and you could buy lots of other interesting things with that kind of money. And I also want to emphasize that that's not even the full cost. So if you were to buy an Illumina GA2-- this would be like a couple years ago when the GA2 was the state of the art-- for half a million dollars, the reagents to run that thing, if you're going to run it continuously throughout the year, the reagents to run it would be over a million. So this actually underestimates the cost.

However, the cost per base is super, super low. Because they generate so much data at once. All right, So we'll talk about a couple of these.

The first next-gen sequencing technology to be published and still used today was from 454-- now Roche-- and it was based on what's called emulsion PCR. So they have these little beads, the little beads have adapter DNA molecules covalently attached. You incubate the beads with DNA, and you actually make an emulsion. So it's an oil water emulsion.

So each bead, which is hydrophilic, is in the little bubble of water inside oil. And the reason for that is so that you do it at a template concentration that's low enough that only a single molecule of template is associated with each bead. So the oil then provides a barrier so that the DNA can't get transferred from one bead to another. So each bead will have a unique template molecule. You do sort of a local PCR-like reaction to amplify that DNA molecule on the bead, and then you do sequencing one base at a time using a luciferase based method that I'll show you on the next slide.

So Illumina technology differs in that instead of an emulsion, you're doing it on the surface of a flow cell. Again, you start with a single molecule of template. Your flow cell has these two types of adapters covalently attached. The template anneals to one of these adapters. You extend the adapter molecule with dNTPs and polymerase. Now you have the complement of your template, your denature.

Now you have the inverse complement of your template molecule covalently attached to the cell surface. And then at the other end there's the other adapter. And so what you could do is what's called bridge amplification where that now complement of the template molecule will bridge over hybridized to the other adapter, and then you can extend that adapter. And now you've regenerated your original template. And so now you have the complementary strand, and the original strand, your denature. And then each of those molecules can undergo subsequent rounds of bridge amplification to make clusters of typically several hundred thousand molecules. Is that clear? Question. Yeah, what's your name?

AUDIENCE: Stephanie. How do they get the adapters onto the template molecules?

PROFESSOR: How do you get the adapters onto the template molecules? So that's typically by DNA ligation. So we may cover that in later steps. It depends. There's a few different protocol. So for example, if you're sequencing microRNAs, you typically would isolate the small RNAs and use RNA litigation to get the adapters on. And then you would do an RT step to get DNA.

With most other applications like RNA-seq or genome sequencing-- so with RNA-seq, you're starting from mRNA, you typically will isolate total RNA, do poly(A) selection, you fragment your RNA to reduce the effects of secondary structure, you random prime with, like, random hexamers RT enzyme. So that'll make little bits of cDNA 200 bases long. You use second strand synthesis. Now you have double stranded cDNA fragments. And then you do, like, blunt end ligation to add the adapters. And then you denature so you have single strand.

AUDIENCE: I guess my question is how do you make sure that the two ends sandwiching the DNA are different as opposed to--

PROFESSOR: That the two ends are different. Yeah, that's a good question. I'll post some stuff about-- It's a good question. I don't want to sweep it under the rug. But I kind of want to move on. And I'll post a little bit about that.

All right so we did 454 Illumina. Helicos is sort of like Illumina sequencing except single molecule. So you have your template covalently attached to your substrate. You just anneal primer and just start sequencing it And there's major pros and cons of single molecule sequencing, which we can talk about.

And then the PacBio technology is fundamentally different in that the template is not actually covalently attached to the surface. The DNA polymerase is covalently attached to the surface and the template is sort of threaded into the polymerase. And this is a phage polymerase that's highly processive and strand displacing. And the template is often a circular molecule. And so you can actually read around the template multiple times, which turns out to be really useful in PacBio because the error rate is quite high for the sequencing.

So in the top, in the 454, you're measuring luciferase activity-- light. In Illumina, you're measuring fluorescence. Four different fluorescent tags, sort of like the four different tags we saw in Sanger sequencing. Helicose, it's single tag one base at a time. And in PacBio, you actually have a fluorescently labeled dNTP that has the label on-- it's actually hexaphosphate-- it's got the label on the sixth phosphate.

So the dNTP is labeled. It enters the active site of the DNA polymerase. And the residence time is much longer if the base is actually going to get incorporated into that growing chain. And so you measure how much time you have a fluorescent signal. And if it's long, that means that that base must have incorporated into the DNA.

But then, the extension reaction itself will cleave off the last five phosphates and the fluorophore tag. And so you'll regenerate native DNA. So that's another difference. Whereas in Illumina sequencing, as we'll see, there's this reversible terminator chemistry. So the DNA is not native that you're synthesizing.

So this is just a little bit more on 454. Just some pretty pictures. I think I described that before. The key chemistry here is that you add one dNTP at a time. So only a subset of the wells-- perhaps a quarter of them-- that have that next base, the complementary base free-- as the next one after the primer-- will undergo synthesis. And when they undergo synthesis, you release pyrophosphate.

And they have these enzymes attached to these little micro beads-- the orange beads-- sulfurylase and luciferase, that use pyrophosphate to basically generate light. And so then you have one of these beads in each well. You look at which wells lit up when we added dCTP. And they must have had G as the next base and so forth.

And there's no termination here. The only termination is because you're only adding one base at a time. So if you have a single gene in the template, you'll add one base. But if you have two Gs in the template, you'll add two Cs. And in principle, you'll get twice as much light.

But then you have to sort of do some analysis after the fact to say, OK how much light do we have? And was that one G, two G, and so forth. And the amount of light is supposed to be linear up to about five or six Gs. But that's still a more error-prone step. And the most common type of error in 454 is actually insertions and deletions. Whereas in Illumina sequencing, it's substitutions.

David actually encouraged me to talk more about sequencing errors and quality scores. And I need to do a little bit more background. But I may add that a little bit later in the semester.

OK, so in Illumina sequencing, you add all four dNTPs at the same time. But they're non-native. They have two major modifications. So one is that they're three prime blocked. That means that the OH is not free, I'll show the chemical structure in a moment.

So you can't extend more than one base. You incorporate that one base, and the polymerase can't do anything more. And they're also tagged with four different fluors. So you add all four dNTPs at once. You let the polymerase incorporate them. And then you image the whole flow cell using two lasers and two filters.

So basically, to image the four fluors. So you have to sort of take four different pictures of each portion of the flow cell and then the camera moves and you scan the whole cell. And so then, those clusters that incorporated a C, let's say, they will show up in the green channel as spots. And those incorporated in A, and so forth.

So you basically have these clusters, each of them represents a distinct template, and you read one base at a time. So, first you read the first base after the primer. So it's sequencing downwards into the template. And you read the first base so you know what the first base of all your clusters is. And then you reverse the termination. You cleave off that chemical group that was blocking the 3-prime OH so now it can extend again. And then you add the four dNTPs again, do another round of extension, and then image again, and so forth.

And so it takes a little while. Each round of imaging takes about an hour. So if you want to do 100 base single and Illumina sequencing, it'll be running on the machine for about four days or so. Plus the time you have to build the clusters, which might be several hours on the day before.

So what is this? So actually the whole idea of blocking termination-- basically Sanger's idea-- is carried over here in Illumina sequencing with a little twist. And that's that you can reverse the termination. So if you look down here at the bottom, these are two different 3-prime terminators. Remember your base counting. Base one, two, three. So this was the 3-prime OH, now it's got this methyl [INAUDIBLE], or whatever that is. I'm not much of a chemist, so you can look that one up.

And then here's another version. And this is sort of chemistry that can cleave this off when you're done. And then this whole thing here, hanging off the base, is the fluor. And you cleave that off as well. So you add this big complicated thing, you image it, and then you cleave off the fluor and cleave off the 3-prime block.

These are some actual sequencing images you would image in the four channels. They're actually black and white. These are pseudocode. And then you can merge those and you can see then all the clusters on the flow cell. So this is from a GA2 with the recommended cluster density back in the day, like a few years ago. And nowadays, the image now since the software has gotten a lot better, so you can actually load the clusters more densely and therefore get more sequence out of the same area.

But imagine just millions and millions of these little clusters like this. Notice the clusters are not all the same size. Basically, you're doing PCR in situ, and so some molecules are easier to amplify by PCR than others. And that probably accounts for these variations in size.

So what is the current throughput? These data are accurate as of about, maybe, last year. So the HiSeq 2000 instrument is the most high performance, widely used instrument. Now there's a 2500, but I think it's roughly similar. You have one flow cell. So a flow cell looks sort of like a glass slide, except that it has these tunnels carved in it like eight little tubes inside the glass slide. And on the surfaces of those tubes is where the adapters are covalently attached. And so you have eight lanes and so you can sequence eight different things in those eight lanes. You could do yeast genome in one and fly RNA-seq in another, and so forth.

And these days, a single lane will produce something like 200 million reads. And this is typically routine to get 200 million reads from a lane. Sometimes you can get more. You can do up to 100 bases. You can do 150 these days on a MiSeq, which is a miniature version. You can do maybe 300 or more. And so that's a whole lot of sequence. So that's 160 billion bases of sequence from a single lane. And that will cost you-- that single lane-- maybe \$2,000 to \$3,000, depending where you're doing it. And the cost doesn't include the capital cost, that's just the reagent cost for running that.

So 160 billion-- the human genome is 3 billion, so you've now sequenced the human genome over many times there.

You can do more. So you can do paired-end sequencing, where you sequence both ends of your template. And that'll basically double the amount of sequence you get. And you can also, on this machine, do two flow cells at once. So you can actually double it beyond that.

And so for many applications, 160 billion bases is overkill. It's more than you need. Imagine you're doing bacterial genome sequencing. Bacterial genome might be five megabases or so. This is complete overkill. So you can do bar coding where you add little six base tags to different libraries, and then mix them together, introduce them to the machine, sequence the tags first or second, and then sequence the templates. And then you effectively sort them out later. And then do many samples in one lane. And that's what people most commonly do.

So, questions about next-gen sequencing? There's a lot more to learn. I'm happy to talk about it more. It's very relevant to this class. But I'm sure it'll come up later in David's sections, so I don't want to take too much time on it.

So, now once you generate reads from an Illumina instrument or some other instrument, you'll want to align them to the genome to determine, for example, if you're doing RNA-seq mapping reads that come from mRNA, you'll want to know what genes they came from. So you need to map those reads back to the genome. What are some other reasons you might want to align sequences? Just in general, why is aligning sequences-- meaning, matching them up and finding individual bases or amino acid residues that match-- why is that useful? Diego?

AUDIENCE: You can assemble them if you want.

PROFESSOR: You can assemble them? Yes. So if you're doing genome sequencing, if you align them to each other and you find a whole stack that sort of align this way, you can then assemble and infer the existence of a longer sequence. That's a good point.

AUDIENCE: Julianne. Looking at homologs.

PROFESSOR: Looking at homologs. Right. So if you, for example, are doing disease gene mapping, you've identified a human gene of unknown function that's associated with a disease. Then you might want to search it against, say, the mouse database and find a homolog in mouse and then that might be what you would want to study further. You might want to then knock it out in mouse or mutate it or something. So those are some good reasons. There's others.

So we're going to first talk about local alignment, which is a type of alignment where you want to find shorter stretches of high similarity. You don't require alignment of the entire sequence. So there are certain situations where you might want to do that.

So here's an example. You are studying a recently discovered human non-coding RNA. As you can see, it's 45 bases. You want to see if there's a mouse homolog. You run it through NCBI BLAST, which as we said is sort of the Google search engine of mathematics-- and you're going get a chance to do it on pump set one, and you get a hit that looks like this.

So notice, this is sort of BLAST notation. It says Q at the top. Q is for "query," that's the sequence you put in. S is "subject," that's the database you were searching against. You have coordinates, so 1 to 45. And then, in the subject, it happened to be base 403 to 447 in some mouse chromosome or something. And you can see that it's got some matching. But it also has some mismatches. So in all, there are 40 matches and five mismatches in the alignment.

So is that significant? Remember, the mouse genome is 2.7 billion bases long. It's big. So would you get a match this good by chance? So the question is really, should you trust this? Is this something you can confidently say, yes mouse is a homolog, and that's it? Or should you just be like, well, that's not better than I get by chance so I have no evidence of anything? Or is it sort of somewhere in between? And how would you tell? Yeah, what's your name?

AUDIENCE: Chris. You would want to figure out a scoring function for the alignment. And then, with that scoring function, you would find whether or not you have a significant match.

PROFESSOR: OK. So Chris says you want to define a scoring system and then use the scoring system to define statistical significance. Do want to suggest a scoring system? What's the simplest one you can think of?

AUDIENCE: Just if there's a match, you add a certain score. If it's a mismatch, you subtract a certain score.

PROFESSOR: So let's do that scoring system. So the notation that's often used is Sii. So that would be a match between nucleotide i and then another copy of nucleotide i. We'll call that 1, plus 1 for a match. And sij, where i and j are different, we'll give that a negative score. Minus 1. So this is i not equal to j.

So that's a scoring matrix. It's a four by four matrix with 1 on the diagonal and minus 1 everywhere else. And this is commonly used for DNA. And then there's a few other variations on this that are also used. So good, a scoring system. So then, how are we going to do the statistics? Any ideas? How do we know what's significant?

AUDIENCE: The higher score would probably be a little more significant than a lower score. But the scale, I'm not sure--

PROFESSOR: The scale is not so obvious. Yes, question?

AUDIENCE: My name is Andrea. So if you shuffled the RNA, like permute the sequence, then we'll get the [INAUDIBLE] genome you get with that shuffled sequence. And the score is about the same as you'd get with the non-shuffled sequence [INAUDIBLE] about very significant scores.

PROFESSOR: Yeah, so that's a good idea. BLAST, as it turns out-- is pretty fast. So you could shuffle your RNA molecule, randomly permute the nucleotides many times, maybe even like 1,000 times, search each one against the mouse genome, and get a distribution of what's the best score-- the top score-- that you get against a genome, look at that distribution and say whether the score of the actual one is significantly higher than that distribution or just falls in the middle of that somewhere. And that's reasonable.

You can certainly do that, and it's not a bad thing to do. But it turns out there is an analytical theory here that you can use. And so that you can determine significance more quickly without doing so much computation. And that's what we'll talk about. But another issue, before we get to the statistics, is how do you actually find that alignment? How do you find the top scoring match in a mouse genome?

So let's suppose this guy is your RNA. OK, of course, we're using T's, but that's just because you usually sequences it at the DNA level. But imagine this is your RNA. It's very short. This is like 10 or so, I think. And this is your database. But it goes on a few billion more. Several more blackboards. And I want to come up with an algorithm that will find the highest scoring segment of this query sequence against this database.

Any ideas? So this would be like our first algorithm. And it's not terribly hard, so that's why it's a good one to start with. Not totally obvious either. Who can think of an algorithm or something, some operation that we can do on this sequence compared to this sequence-- in some way-- that will help us find the highest scoring match? I'm sorry. Yeah?

AUDIENCE: You have to consider insertion and deletion.

PROFESSOR: Yeah, OK. So we're going to keep it simple. That's true, in general. But we're going to keep it simple and just say no insertions and deletions. So we're going to look for an ungapped local alignment. So that's the algorithm that I want. First, no gaps. And then we'll do gaps on Tuesday. Tim?

AUDIENCE: You could just compare your [INAUDIBLE] to [INAUDIBLE] all across the database and turn off all the [INAUDIBLE] on that [INAUDIBLE], and then figure out [INAUDIBLE].

PROFESSOR: Yeah, OK. Pretty much. I mean, that's pretty much right. Although it's not quite as much of a description as you would need if you want to actually code that. Like, how would you actually do that? So, I want a description that is sort of more at the level of pseudocode. Like, here's how you would actually organize your code.

So, let's say you entertain the hypothesis that the alignment can be in different registers. The alignment can correspond to base one of the query and base one of the subject. Or it could be shifted. It could be an alignment where base 1 of the query matches these two, and so forth. So there's sort of different registers. So let's just consider one register first. The one where base 1 matches.

So let's just look at the matches between corresponding bases. I'm just going to make these little angle bracket guys here. Hopefully I won't make any mistakes. I'm going to take this. This is sort of implementing Tim's idea here. And then I'm going to look for each of these-- so consider it going down here. Now we're sort of looking at an alignment here. Is this a match or a mismatch?

That's a mismatch. That's a match. That's a mismatch. That's a mismatch. That's a match. Match Match. Mismatch. Mismatch. Mismatch. So where is the top scoring match between the query and the subject? Tim? Anyone?

AUDIENCE: 6, 7, 8.

PROFESSOR: 6, 7, 8. Good. Oh--

AUDIENCE: 5, 6, 7.

PROFESSOR: 5, 6, 7. Right. Right here. You can see there's three in a row. Well, what about this? Why can't we add this to the match? What's the reason why it's not 2, 3, 4, 5, 6, 7?

AUDIENCE: Because the score for that is lower.

PROFESSOR: Because the score for that is lower. Right. We defined top scoring segment. You sum up the scores across the map. So you can have mismatches in there, but this will have a score of 3. And if you wanted to add these three bases, you would be adding negative 2 and plus 1, so it would reduce your score. So that would be worse.

Any ideas on how to do this in an automatic, algorithmic way? Yeah? What's your name?

AUDIENCE: Simon. So if you keep shifting the entire database, [INAUDIBLE].

PROFESSOR: OK so you keep shifting it over, and you generate one of these lines. But imagine my query was like 1,000 or something. And my database is like a billion. How do I look along here? And here it was obvious what the top scoring match is. But if I had two matches here, then we would've actually had a longer match here.

So in general, how do I find that that top match? For each of those registers, if you will, you'll have a thousand long diagonal here with 1's and minus 1's on it. How do I process those scores to find the top scoring segment? What's an algorithm to do that?

It's kind of intuitively obvious, but I want to do something with, you define a variable and you update it, and you add to it, and subtract. Something like that. But, like a computer could actually handle. Yeah? What was your name? Julianne?

AUDIENCE: Could you keep track of what the highest total score is, and then you keep going down the diagonal, And then you update it?

PROFESSOR: OK. You keep track of what the highest total score was?

AUDIENCE: Yeah. The highest test score.

PROFESSOR: The highest segment score? OK. I'm going to put this up here. And we'll define max s. That's the highest segment score we've achieved to date. And we'll initialize that to zero, let's say. Because if you had all mismatches, zero would be the correct answer. If your query was A's and your subject was T's. And then what do you do?

AUDIENCE: As you go down the diagonal, you keep track of--

PROFESSOR: Keep track of what?

AUDIENCE: So you look at 1 in 1 first. And then you go 1 in 2, and you find a score of zero. But that's higher than negative 1.

PROFESSOR: But the score of the maximum segment at that point, after base 2, is not zero. It's actually 1. Because you could have a segment of one base alignment. The cumulative score is zero. I think you're onto something here that may be also something useful to keep track of.

Let's do the cumulative score and then you tell me more. We'll define cumulative score variable. We'll initialize that to zero. And then we'll have some for loops that, as some of you have said, you want to loop through the subject. All the possible registers of the subject. So that would be maybe j equals 1 to subject length minus query length. Something like that. Don't worry too much about this. Again, this is not real code, obviously. It's pseudocode.

So then this will be, say, 1 to query language. And so this will be going along our diagonal. And we're going to plot the cumulative score. So here you would you have an update where cumulative score plus equals the score of query position i matched against subject position j. And update that. So that's just cumulative score.

So what will it look like? So in this case, I'll just use this down here. So you have zero, 1, 2, minus 1, minus 2. So you'll start at position zero in the sequence. At position 1 you're down here at minus 1 because it was a mismatch.

Then at position 2, as you said, we're back up to zero. And then what happens? Go down to minus 1, down to minus 2. Then we go up three times in a row until we're up here to 1. And then we go down after that.

So where is your highest scoring match in this cumulative score plot? People said it was from 5 to 7. Yeah, question?

AUDIENCE: So would it be from like a local minimum to a local maximum?

PROFESSOR: Yeah. Exactly. So, what do you want to keep track of?

AUDIENCE: You want to keep track of the minimum and the maximum. And look for the range which you maximize to different--

PROFESSOR: Yeah, so this is now sort of more what I was looking for in terms of-- so this was the local minimum, and that's the local maximum. This is the score. That's your mass s there. And you also want to keep track of where that happened in both the query and the subject. Does that make sense? So you would keep track of this running cumulative score variable. You keep track of the last minimum. The minimum that you've achieved so far. And so that would then be down here to minus 2.

And then when your cumulative score got up to plus 1, you always take that cumulative score, minus the last minimum cumulative score. That gives you a potential candidate for a high scoring segment. And if that is bigger than your current max high scoring segment, then you update it and you would update this. And then you would also have variables that would store where you are. And also, where did that last minimum occur.

So I'm not spelling it all out. I'm not going to give you all the variables. But this is an algorithm that would find the maximum score. Yeah, question?

AUDIENCE: So you're keeping track of the global maximum, local minimum, so that you can accept the most recent local minimum following the global maximum?

PROFESSOR: I'm not sure I got all that. But you're keeping track of the cumulative score. The minimum that that cumulative score ever got to. And the maximum difference, the maximum that you ever in the past have gone up. Where you've had a net increment upwards.

Like here. So this variable here, this max s, it would be initialized to zero. When you got to here, your last minimum score would be minus 1. Your cumulative score would be zero. You would take the difference of those, and you'd be like, oh I've got a high scoring segment of score one. So I'm going to update that.

So now, that variable is now 1 at this point. Then you're going down, so you're not getting anything. You're just lowering this minimum cumulative score down to minus 2 here. And then when you get to here, now you check the cumulative score minus the last minimum. It's 1. That's a tie. We won't keep track of ties.

Now at here, that difference is 2. So now we've got a new record. So now we update this maximum score to 2 in the locations. And then we get here, now it's 3, and we update that. Does that make sense?

AUDIENCE: Imagine the first dip-- instead of going down to negative 1, it went down to negative 3.

PROFESSOR: Negative 3?

AUDIENCE: That first dip.

PROFESSOR: Right here? So we started back a little bit. So back here, like this?

AUDIENCE: No.

PROFESSOR: Down to negative 3?

AUDIENCE: No.

PROFESSOR: But how do we get to negative 3? Because our scoring is this way. You want this dip to minus 3?

AUDIENCE: No

PROFESSOR: This one minus 3? Imagine we are at minus 3 here?

AUDIENCE: Yeah. Imagine it dipped to minus 3. And then the next one dipped to higher than that, to minus 2. And then it went up to 1. And so, would the difference you look at be negative 2 to 1, or negative 3 to 1?

PROFESSOR: Like that, right? So, minus 3, let's say, minus 2, 1. Something like that. What do people think? Anyone want to--?

AUDIENCE: Minus 3 to 1.

PROFESSOR: Minus 3 to 1. It's the minimum that you ever got to. This might be a stronger match, but this is a higher scoring match. And we said we want higher scoring. So you would count that.

AUDIENCE: So you keep track of both the global minimum and the global maximum, and you take the difference between them.

PROFESSOR: You keep track of the global minimum and the current cumulative score, and you take the difference.

AUDIENCE: The global maximum--

PROFESSOR: It's not necessarily global maximum because we could be well below zero here. We could do like this. From here to here. So this is not the global maximum. This just happens to be, we went up a lot since our last minimum. So that's your high scoring segment. Does that make sense?

I haven't completely spelled it out. But I think you guys have given enough ideas here that there;s sort of the core of an algorithm. And I encourage you to think this through afterwards and let me know if there are questions. And we could add an optional homework where I ask you to do this, that we've sometimes had in the past. It is a useful thing to look at.

This is not exactly how the BLAST algorithm works. It uses some tricks for faster speed. But this is sort of morally equivalent to BLAST in the sense that it has the same order of magnitude running time.

So this algorithm-- what is the running time in Big-O notation? So just for those who are non-CS people, when you use this Big-O notation, then you're asking, how does the running time increase in the size of the input? And so what is the input? So we have two inputs. We have a query of length. And let's say subject of length n. So clearly, if those are bigger, it'll take longer to run. But when you compare different algorithms, you want to know how the run time depends on those lengths. Yes. What's your name?

AUDIENCE: Sally. m times n.

PROFESSOR: So with this, this is what you would call an order mn algorithm. And why is that? How can you see that?

AUDIENCE: You have two for loops And for each length, essentially, you're going through everything in the query. And then, for everything that you go through in the query, you would [INAUDIBLE].

PROFESSOR: Right. In this second for loop here, you're going through the query. And you're doing that nested inside of a for loop that's basically the length of the subject. And eventually you're going to have to compare every base in the query to every base in the subject. There's no way around that. And that takes some unit of time. And so the actual time will be proportional to that. So the bigger n gets and m gets, it's just proportional to the product. Does that make sense?

Or another way to think about it is, you're clearly going to have to do something on this diagonal. And then you're going to have to do something on this diagonal, and this one, and this one. And actually, you have to also check these ones here. And in the end, the total number of computations there is going to be this times that. You're basically doing a rectangle's worth of computations. Does that makes sense?

So that's not bad, right? It could be worse. It could be, like, mn squared or something like that. So that's basically why BLAST is fast.

So what do these things look like, in general? And what is the condition on our score for this algorithm to work? What if I gave a score of plus 1 for a match, and zero for a mismatch? Could we do this? Joe, you're shaking your head.

AUDIENCE: It would just be going up.

PROFESSOR: Yeah. The problem is, it might be flat for a while, but eventually it would go up. And it would just go up and up and up. And so your highest scoring segment would, most of the time, be something that started very near the beginning and ended very near the end. So that doesn't work. So you have to have a net negative drift. And the way that's formalized is the expected score has to be negative.

So why is the expected score negative in this scoring system that has plus 1 for a match, and minus 1 for a mismatch? Why does that work?

AUDIENCE: It should be wrong three quarters of the time.

PROFESSOR: Yeah. You'll have a mismatch three quarters of the time. So on average, you tend to drift down. And then you have these little excursions upwards, and those are your high scoring segments. Any questions about that?

AUDIENCE: Question. Is there something better than m times n?

PROFESSOR: We've got some computer scientists here. David? Better than m times n? I don't think so, because you have to do all those comparisons. And so there's no way around that, so I don't think so. All right. But the constant-- you can do better on the constant than this algorithm thing.

AUDIENCE: With multiple queries--

PROFESSOR: With multiple queries, yeah. Then you can maybe do some hashing or find some-- to speed it up.

OK, so what about the statistics of this? So it turns out that Karlin and Altschul developed some theory for just exactly this problem. For searching a query sequence. It can be nucleotide or protein as long as you have integer scores and the average-- or the expected-- score is negative, then this theory tells you how often the highest score of all-- across the entire query database comparison-- exceeds a cut off x using a local alignment algorithm such as BLAST.

And it turns out that these scores follow what's called an extreme value or Gumbel distribution. And it has this kind of double exponential form here. So x is some cut off. So usually x would be the score that you actually observed when you searched your query against the database. That's the one you care about.

And then you want to know, what's the probability we would've seen something higher than that? Or you might do x is one less than the score you observed. So what's the chance we observed something the same, as good as this, or better? Does that make sense? And so this is going to be your P value then.

So the probability of S. The score of the highest segment under a model where you have a random query against a random database of the same length is 1 minus e to the minus KMN e to the minus lambda x. Where M and N are the lengths of the query and the database. x is the score. And then K and lambda are two positive parameters that depend actually on the details of your score matrix and the composition of your sequences.

And it turns out that lambda is really the one that matters. And you can see that because lambda is up there in that exponent multiplying x. So if you double lambda, that'll have a big effect on the answer. And K, it turns out, you can mostly ignore it for most purposes.

So as a formula, what does this thing look like? It looks like that. Kind of a funny shape. It sort of looks like an umlauf a little bit, but then has a different shape on the right than the left. And how do you calculate this lambda? So I said that lambda is sort of the key to all this because of its uniquely important place in that formula, multiplying the score.

So it turns out that lambda is the unique positive solution to this equation here. So now it actually depends on the scoring matrix. So you see there's sij there. It depends on the composition of your query. That's the pi's. The composition of your subject, that's the rj's. You sum over the i and j equal to each of the four nucleotides. And that sum has to be 1. So there's a unique positive solution to this equation.

So how would we solve an equation like this? First of all, what kind of equation is this, given that we're going to set the sij, and we're going to just measure the pi and the rj? So those are all known constants, and lambda is what we're trying to solve for here. So what kind of an equation is this in lambda? Linear? Quadratic? Hyperbolic? Anybody know what this is?

So this is called a transcendental equation because you have different powers. That sounds kind of unpleasant. You don't take a class in transcendental equations probably. So in general, they're not possible to solve analytically when they get complicated. But in simple cases, you can solve them analytically. And in fact, let's just do one.

So let's take the simplest case, which would be that all the pi's are a quarter. All the ri's are a quarter. And we'll use the scoring system that we came up with before, where sii is 1, and sij is minus 1. If i does not equal j.

And so when we plug those in to that sum there, what do we get? We'll get four terms that are one quarter, times one quarter, times e to the lambda. There's four possible types of matches, right? They have probability one quarter times a quarter. That's pi and rj. And the e to the lambda sii is just e to the lambda because sii is 1. And then there's 12 terms that are one quarter, one quarter, e to the minus lambda. Because there's the minus 1 score. And that has to equal 1.

So cancel this, we'll multiply through by 4, maybe. So now we get e to the lambda plus 3. e to the minus lambda equals 1. It's still a transcendental equation, but it's looking a little simpler. Any ideas how to solve this for lambda? Sally?

AUDIENCE: Wouldn't the 1 be 4?

PROFESSOR: I'm sorry. 4. Thank you. Yeah, what's your name?

AUDIENCE: [INAUDIBLE] I think [INAUDIBLE] quadratic equation. If you multiply both sides by [INAUDIBLE] then [INAUDIBLE].

PROFESSOR: OK, so the claim is this is basically a quadratic equation. So you multiply both sides by e to the lambda. So then you get e to the 2 lambda plus 3. And then it's going to move this over and do minus 4 e to the lambda equals zero. Is that good?

So how is it quadratic? What do you actually do to solve this?

AUDIENCE: Well, [INAUDIBLE].

PROFESSOR: Change the variable, x equals e to the lambda. Then it's quadratic in x. Solve for x. We all know how to solve quadratic equations. And then substitute that for lambda. OK, everyone got that?

If you use 16 different scores to represent all the different types of matches and mismatches, this will be very unpleasant. It's not unsolvable, it's just that you have to use computational numerical methods to solve it. But in simple cases where you just have a couple different types of scores, it will often be a quadratic equation.

All right. So let's suppose that we have a particular scoring system-- particular pi's, rj's-- and we have a value of lambda that satisfies those. So we've solved this quadratic equation for lambda. I think we get lambda equals natural log 3, something like that. Remember, it's a unique positive solution. Quadratic equations are two solutions, but there's going to be just one positive one. And then we have that value. It satisfies this equation.

So then, what if we double the scores? Instead of plus 1 minus 1, we use plus 2 minus 2? What would then happen? You can see that the original version of lambda wouldn't necessarily still satisfy this equation. But if you think about it a little bit, you can figure out what new value of lambda would satisfy this equation.

We've solved for the lambda that solves with these scores. Now we're going to have new scores. sii prime equals 2. sij prime equals minus 2. What is lambda prime? The lambda that goes with these scores? Yeah, go ahead.

AUDIENCE: Half of the original?

PROFESSOR: Half of the original? Right. So you're saying that lambda prime equals lambda over 2. And why is that? Can you explain?

AUDIENCE: Because of the [INAUDIBLE].

PROFESSOR: Yeah, if you think about these terms in the sum, the s part is all doubling. So if you cut the lambda apart, and the product will equal what it did before. And we haven't changed the pi's and rj's, so all those terms will be the same. So therefore, it will still satisfy that equation. So that's another way of thinking about it. Yes, you're correct.

So if you double the scores, lambda will be reduced by a factor of 2. So what does that tell us about lambda? What is it? What is its meaning? Yeah, go ahead, Jeff.

AUDIENCE: Scale of the distribution to the expectant score? Or the range score?

PROFESSOR: Yeah. It basically scales the scores. So we can have the same equation here used with arbitrary scoring. It just scales it. You can see the way it appears as a multiplicative factor in front of the score. So if you double all the scores, will that change what the highest scoring segment is? No, it won't change it because you'll have this cumulative thing. It just changes how you label the y-axis. It'll make it bigger, but it won't change what that is.

And if you look at this equation, it won't change the statistical significance. The x will double in value, because all the matches are now worth twice as much as what they were before. But lambda will be half as big, and so the product will be the same and therefore, the final probability will be the same. So it's just a scaling factor for using different scoring systems. Everyone got that?

All right. So what scoring matrix should we use for DNA? How about this one? So this is now a slight generalization. So we're going to keep 1 for the matches. You don't lose any generality by choosing 1 here for matches, because if you use 2, then lambda is just going to be reduced to compensate.

So 1 for matches. And then we're going to use m for mismatches. And m must be negative in order to satisfy this condition for this theory to work, that the average score has to be negative. Clearly, you have to have some negative scores.

And the question then is, should we use minus 1 like we used before? Or should we use like minus 2 or minus 5, or something else? Any thoughts on this? Or does it matter? Maybe it doesn't matter. Yeah, what's your name?

AUDIENCE: [INAUDIBLE]. Would it make sense to not use [INAUDIBLE], because [INAUDIBLE].

PROFESSOR: Yeah, OK. So you want to use a more complicated scoring system. What particular mismatches would you want to penalize more and less?

AUDIENCE: [INAUDIBLE] I think [INAUDIBLE] needs to be [INAUDIBLE].

PROFESSOR: Yeah, you are correct in your intuition. Maybe one of the biologists wants to offer a suggestion here. Yeah, go ahead.

AUDIENCE: So it's a mismatch between purine and pyrimidine [INAUDIBLE].

PROFESSOR: OK so now we've got purines and pyrimidines. So everyone remember, the purines are A and G. The pyrimidines are C and T. And the idea is that this should be penalized, or this should be penalized less than changing a purine to a pyrimidine. And why does that makes sense?

AUDIENCE: Well, structurally they're--

PROFESSOR: Structurally, purines are more similar to each other than they are to pyrimidines. And? More importantly, I think. In evolution?

AUDIENCE: [INAUDIBLE].

PROFESSOR: I'm sorry, can you speak up?

AUDIENCE: C to C mutations happen spontaneously in [INAUDIBLE] chemistry.

PROFESSOR: Yes. So C to C mutations happen spontaneously. So basically, it's easier because they look more similar structurally. The DNA polymerase is more likely to make a mistake and substitute another purine. The rate of purine, purine or pyrimidine, pyrimidine to transversions which switch the type is about three to one, or two to one in different systems. So yeah, that's a good idea.

But for simplicity, just to keep the math simple, we're just going to go with one mismatch penalty. But that is a good point. In practice, you might want to do that.

So now, I'm saying I'm going to limit you to one mismatch penalty. But I'm going to let you choose any value you want. So what value should you choose? Or does it matter? Or maybe different applications? Tim, yeah?

AUDIENCE: I've just got a question. Does it depend on pi and ri? For example, we could use all these numbers. But if the overall wants to be negative, then you couldn't use negative .1.

PROFESSOR: Right, that's a good point. You can't make it too weak. It may depend on what your expected fraction of matches is, which actually depends on pi and ri. So if you have very biased sequences, like very AT rich, your expected fraction of matches is actually higher. When you're researching an AT rich sequence against another AT rich sequence, it's actually higher than a quarter.

So even minus one might not be sufficient there. You might need to go down more negative. So you may need to use a higher negative value just to make sure that the expected value is negative. That's true. And yeah, you may want to adjust it based on the composition.

So let's just do a bit more. So it turns out that the Karlin and Altschul theory, in addition to telling you what the p value is of your match-- the statistical significance-- it also tells you what the matches will look like in terms of what fraction of identity they will have. And this is the so-called target frequency equation.

The theory says that if I search a query with one particular composition, p, subject meta-composition r-- here, I've just assumed they're the same, both p just for simplicity-- with a scoring matrix sij, which has a corresponding of lambda. Then, when I take those very high scoring matches-- the ones that are statistically significant-- and I look at those alignments of those matches, I will get values qij, given by this formula.

So look at the formula. So it's qij. So pipj e to the lambda sij. So it's basically the expected chance that you would have base i matching based j just by chance. That's pipj. But then weighted by e to the lambda sij. So we notice for a match, s will be positive, so e to the lambda will be positive. So that will be bigger than 1. And you'll have more matches and you'll have correspondingly less mismatches because the mismatch has a negative. So get the target value score.

And that also tells you that the so-called natural scores are actually determined by the fraction of matches that you want in your high scoring segments. If we want 90% matches, we just set qii to be 0.9, and use this equation here. Solve for sij.

For example, if you want to find regions with R% identities. Little r is just the r as a proportion. qii is going to be r over 4. This assumes unbiased base composition. A quarter of the matches are acgt. Qij, then, is 1 minus r over 12. 1 minus r is a fraction of non-matching positions. They're 12 different types.

Set sii equal to 1, that's what we said we normally do. And then you do a little bit of algebra here. m is sij. And you sort of plug in this equation twice here. And you get this equation. So it says that m equals log of 4 1 minus r over 3 over log 4 r.

And for this to be true, this assumes that both the query and the database have uniform composition of a quarter, and that r is between a quarter and 1. The proportion of matches in your high scoring segment-- you want it to be bigger than a quarter. A quarter is what you would see by chance. There's something wrong with your scoring system if you're considering those to be significant. So it's something above 25%.

And so it's just simple algebra-- you can check my work at home-- to solve for m here. And then this equation then tells you that if I want to find 75% identical matches in a nucleotide search, I should use a mismatch penalty of minus 1.

And if I want 99% identical matches, I should use a penalty of minus 3. Not minus 5, but minus 3. And I want you to think about, does that make sense? Does that not make sense? Because I'm going to ask you at the beginning of class on Tuesday to explain and comment on this particular phenomenon of how when you want higher percent identities, you want a more negative mismatch score. Any last questions? Comments?