Description: This lecture is guided by the question "Where is missing heritability found?" Prof. David Gifford discusses computational models that can predict phenotype from genotype. He then discusses how to discover and model quantitative trait loci.
Instructor: Prof. David Gifford
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR: So as you recall last time we talked about chromatin structure and chromatin regulation. And now we're going to move on to genetic analysis. But before we did that, I want us to touch on two points that we talked about briefly last time.
One was 5C analysis. Who was it that brought up-- who was the 5C expert here? Anybody? No? Nobody wants to own 5C. OK.
But as you recall, we talked about ChIA-PET as one way of analyzing any to any interactions in the way that the genome folds up and enhancers talk to promoters. And 5C is a very similar technique.
I just wanted to show you the flow chart for how the protocol goes. There is a cross linking. A digestion with a restriction enzyme step, followed by a proximity ligation step, which gives you molecules that had been brought together by an enhancer, promoter complex, or any other kind of distal protein-protein interaction.
And then, what happens is that you design specific timers to detect those ligation events. And you sequence the result of what is known as ligation mediated amplification. So those primers are only going to ligate if they're brought together at a particular junction, which is defined by the restriction sites lining up.
So, 5C is a method of looking at which regions of the genome interact and can produce these sorts of results, showing which parts of the genome interact with one another.
The key difference, I think, between chIA-PET and 5C is that you actually have to have these primers designed and pick the particular locations you want to query.
So the primers that you design represent query locations and you can then either apply the results to a microarray, or to high throughput sequencing to detect these interactions.
But the essential idea is the same. Where you do proximity based ligation to form molecules that contain components of two different pieces of the genome that have been brought together for some functional reason.
The next thing I want to touch upon was this idea of the CpG dinucleotides that are connected by a phosphate bond. And you recall that I talked about the idea that they were symmetric.
So you could have methyl groups on the cytosines in such a way that, because they could mirror one another, they could be transferred from one strand of DNA to the other strand of DNA, during cell replication by DNA methyltransferase.
So it forms a more stable kind of mark and as you recall, DNA methylation where something occurred in lowly expressed genes and typically in regions of the genome that are methylated. Other histone marks are not present and the genes are turned off.
OK. So those were the points I wanted to touch upon from last lecture. Now we're going to embark upon an adventure, looking for the answer to, wear is missing heritability found?
So it's a big open question now in genetics. In human genetics, which is that we really can't find all the heritability. And as a point of introduction, the narrative arc for today's lecture is that, generally speaking, you're more like your relatives than random people on the planet.
And why is this? Well obviously you contain components of your mom and dad's genomes. And they are providing you with components of your traits. And the heritability of a trait is defined by the fraction of phenotypic variance that can be explained by genetics.
And we're going to talk today about computational models that can predict phenotype from genotype. And this is very important, obviously, for understanding the sources of various traits and phenotypes. As well as fields such as pharmacogenomics that try and predict the best therapy for a disease based upon your genetic makeup.
So, individual loci in the genome that contribute to quantitative traits are called quantitative trait locis, or QTLs. So we're going to talked about how to discover them and how to build models of quantitative traits using QTLs.
And finally, as I said at the outset, our models are insufficient today. They really can't find all of the heritability. So we're going to go searching for this missing heritability and see where it might be found.
Computationally, we're going to apply a variety of techniques to these problems. A preview is, we're going to build linear models of phenotype and we're going to use stepwise regression to learn these models using a forward feature selection.
And I'll talk about what that is when we get to that point of the lecture. We're going to derive test statistics for discovering which QTLs are significant and which QTLs are not, to include in our model.
And finally, we're going to talk about how to measure narrow sense heritability and broad sense heritability in environmental variance.
OK. So, one great resource for traits that are fairly simple. That primarily are the result of a single gene mutation, or where a single gene mutation plays a dominant role, is something called Online Mendelian Inheritance in Man.
And it's a resource. It has about 21,000 genes in it right now. And it's a great way to explore what human genes function is in various diseases. And you could query by disease. You can query by gene. And it is a very carefully annotated and maintained collection that is worthy of study, if you're interested in particular disease genes.
We're going to be looking at more complex analyses today. The analyses we're going to look at are where there are many genes that influence a particular trait. And we would like to come up with general methods for discovering how we can de novo from experimental data-- discover all the different genes that participate.
Now just as a quick review of statistics, I think that we've talked before about means in class and variances. We're also going to talk a little bit about covariances today. But these are terms that you should be familiar with as we're looking today at some of our metrics for understanding heritability.
Are there any question about any of the statistical metrics that are up here? OK.
So, a broad overview of genotype to phenotype. So, we're primarily going to be working with complete genome sequences today, which will reveal all of the variance that are present in the genome.
And it's also the case that you can subsample a genome and only observe certain variance. Typically that's done with microarrays that have probes that are specific to particular markers.
The way those arrays are manufactured is that whole genome sequencing is done at the outset, and then high prevalence variance, at least common variance, which typically are at a frequency of at least 5% in the population are queried by using a microarray. But today we'll talk about complete genome sequence.
An individual's phenotype, we'll say is defined by one or more traits. And a non-quantitative trait is something perhaps as simple as whether or not something is dead or alive. Or whether or not it can survive in a particular condition. Or its ability to produce a particular substance.
A quantitative trait, on the other hand, is a continuous variable. Height, for example, of an individual is a quantitative trait. As is growth rate, expression of a particular gene, and so forth.
So we'll be focusing today on estimating quantitative traits. And as I said, a quantitative trait or loci, is a marker that's associated with a quantitative trait and could be used to predict it. And you can sometimes hear about eQTLs, which are expression quantitative trait loci. And they're loci that are related to gene expression.
So, let's begin then, with a very simple genetic model. It's going to be haploid, which means, of course, there's only one copy of each chromosome. Yeast is the model organism we're going to be talking about today. It's a haploid organism.
And we have mom and dad up there. Mom on the left, dad on the right in two different colors. And you can see that mom and dad in this particular example, have n different genes. They're going to contribute to the F1 generation, to junior.
And the relative color is white for mom, black for dad, are going to be used to describe the alleles, or the allelic variance that are inherited by the child, the F1 generation.
And as I said, a specific phenotype might be alive or dead in a specific environment. And note that I have drawn the chromosomes to be disconnected. Which means that each one of those genes is going to be independently inherited.
So the probability in the F1 generation that you're going to get one of those from mom or dad is going to be a coin flip. We're going to assume that they're far enough away that the probability of crossing over during meiosis is 0.5. And so we get a random assortment of alleles from mom and dad. OK?
So let us say that you go off and do an experiment. And you have 32 individuals that you produce out of a cross. And you test them, OK. And two of them are resistant to a particular substance.
How many genes do you think are involved in that resistance? Let's assume that mom is resistant and dad is not. OK. If you had two that were resistant out of 32, how many different genes do you think were involved? How do you estimate that? Any ideas? Yes?
AUDIENCE: If you had 32 individuals and say half of them got it?
PROFESSOR: Two, let's say. One out of 16 is resistant. And mom is resistant.
AUDIENCE: Because I was thinking that if it was half of them were resistant, then you would maybe guess one gene, or something like that.
PROFESSOR: Very good.
AUDIENCE: So then if only eight were resistant you might guess two genes, or something like that?
PROFESSOR: Yeah. What you say is, that if mom's resistant, then we're going to assume that you need to get the right number of genes from mom to be resistant. Right? And so, let's say that you had to get four genes from mom. What's the chance of getting four genes from mom?
AUDIENCE: Half to the power of four.
PROFESSOR: Yeah, which is one out of 16, right? So, if you, for example had two that were resistant out of 32, the chances are one in 16. Right? So you would naively think, and properly so, that you had to give four genes from mom to be resistant.
So the way to think about these sorts of non-quantitative traits is that you can estimate the number of genes involved. The simply is log base 2 over the number of F1s tested over the number of the F1s with the phenotype.
It tells you roughly how many genes are involved in providing a particular trait, assuming that the genes are unlinked. It's a coin flip, whether you get them or not.
Does everybody see that? Yes? Any questions at all about that? About the details? OK.
Let's talk now about quantitative traits then. We'll go back to our model and imagine that we have the same set-- actually it's going to a different set of n genes. We're going to have a coin flip as to whether or not you're getting a mom gene or a dad gene. OK. And each gene in dad has an effect size of 1 over n. Yes?
AUDIENCE: I just wanted to check. We're assuming that the parents are homozygous for the trait? Is that correct?
PROFESSOR: Remember these are haploid.
AUDIENCE: Oh, these are haploid.
PROFESSOR: Right. So they only have one copy of all these genes. All right. Yes?
AUDIENCE: [INAUDIBLE] resistant and they're [INAUDIBLE]. That could still mean that dad has three of the four genes in principle.
PROFESSOR: The previous slide? Is that where what you're talking about?
AUDIENCE: [INAUDIBLE] knew about it. So really what you mean is that dad does not have any of the genes that are involved with resistance.
PROFESSOR: The correct. I was saying that dad has to have all of gene-- that the child has to have all of the genes that are operative to create resistance. We're going to assume an AND model. He must have all the genes from mom. They're involved in the resistance pathway.
And since only one out of a 16 progeny has all those genes from mom, right, it appears that given the chance of inheriting something from mom is 1/2, that it's four genes you have to inherit from mom. Because the chance of inheriting all four is one out of 16.
AUDIENCE: [INAUDIBLE] in which case--
PROFESSOR: No, I'm assuming the dad doesn't have any of those. But here we're asking, what is the difference in the number of genes between mom and dad?
So you're right, that the number we're computing is the relative number of genes different between mom and dad you require. And so it might be that dad's a reference and we're asking how many additional genes mom brought to the table to provide with that resistance. But that's a good point. OK. OK.
So, now let's look at this quantitative model. Let's assume that mom has a bunch of genes that contribute zero to an effect size and dad-- each gene that dad has produces an effect of 1 over n. So the total effect size here for dad is 1.
So the effect of mom on this particular quantitative trait might be zero. It might be the amount of ethanol produced or some other quantitative value. And dad, on the other hand, since he has n genes, is going to produce one, because each gene contributes a little bit to this quantitative phenotype. Is everybody clear on that?
So, the child is going to inherit genes to our coin flip between mom and dad, right. So the first fundamental question is, how many different levels are there in our quantitative phenotype in our trait? How many different levels can you have?
AUDIENCE: N + 1?
PROFESSOR: N + 1, right, because you can either inherit zero, or up to n genes from dad. And it gets you n plus 1 different levels. OK.
So, what's the probability then-- well, I'll ask a different question. What's the expected value of the quantitative phenotype of a child? Just looking at this.
If dad's one and mom's zero, and you have a collection of genes and you do a coin flip each time, you're going to get half your genes from mom and half your genes from dad. Right.
And so the expected trait value is 0.5. So for these added traits, you're going be at the midpoint between mom and dad. Right. And what is the probability that you inherit x copies of dad's genes?
Well, that's n choose x, times 1 minus .5 n to the minus x times 0.5 to the x. A simple binomial. Right. So if you look at this, the probability of the distribution for the children is going to look something like this, where this is the mean, 0.5.
And the number of distinct values is going to be n plus 1. Right. So the expected value of x is 0.5 and turns out that the expected value, or the variance of x minus 0.5, which is the mean squared, is going to be 0.25 over n.
So I can show you this on the next slide. So you can see, this could be ethanol production, it could be growth rate, what have you. And you can see that the number of genes that you're going to get from dad follows this binomial distribution and gives you a spread of different phenotypes in the child's generation, depending upon how many copies of dad's genes that you inherit.
But does this make sense to everybody? Now would be a great time to ask any questions about the details of this. Yes?
AUDIENCE: Can you clarify what x is? Is x the fraction of genes inherited--
PROFESSOR: The number of genes you inherit from dad. The number of genes. So it would zero, one, two, up to n.
AUDIENCE: Shouldn't the expectation of n [INAUDIBLE] x be n/2?
PROFESSOR: I'm sorry. It is supposed to be n/2. But the last two expectations are some of the number of genes you've inherited from dad. Right, that's correct. Yeah, this slide's wrong. Any other questions? OK.
So this is a very simple model but it tells us a couple of things, right. Which is that as n gets to be very large, the effect of each gene gets to be quite small.
So something could be completely heritable, but if it's spread over, say 1,000 genes, then it will be very difficult to detect, because the effect of each gene would be quite small. And furthermore, the variance that you see in the offspring will be quite small as well, right, in terms of the phenotype. Because it's going to be 0.25/n in terms of the expected value.
So as n gets larger, the number genes that contribute to that phenotype increase, the variance is going to go down linearly. OK. So we should just keep this in mind as we're looking at discovering these sort of traits and the underlying QTLs that can be used to predict them.
And finally, I'd like to point out one other detail which is that, if genes are linked, that is, if they're in close proximity to one another in the genome and it makes it very unlikely there's going to be crossing over between them, then they're going to act as a unit. And if they act as a unit, then we'll get marker correlation. And you can also see, effectively, that the effect size of those two genes is going to be larger.
And in more complicated models, we obviously wouldn't have the same effect size for each gene. The effect size might be quite large for some genes, might be quite small for some genes. And we'll see the effects of marker correlation in a little bit.
So the way we're going to model this is we're going to-- this is a definition of the variables that we're going to be talking about today. And the essential idea is quite simple.
So the phenotype of an individual-- so p sub i is the phenotype of an individual, is going to be equal to some function of their genotype plus an environmental component. This function is the critical thing that we want to discover.
This function, f, is mapping from the genotype of an individual to its phenotype. And the environmental component could be how well something is fed, how much sunlight it gets, things that can greatly influence things like growth but they're not described by genetics.
But this function is going to encapsulate what we know about how the genetics of a particular individual influences a trait. And thus, if we consider a population of individuals, the phenotypic variance is going to be equal to the genotypic variance plus the environmental variance plus two times the covariance between the genotype in the environment.
And we're going to assume, as most studies do, that there is no correlation between genotype and environment. So this term disappears. So what we're left with is that the observed phenotypic variance is equal to the genotypic variance plus the environmental variance.
And what we would like to do is to come up with a function f, that best predicts the genotypic component of this equation. There's nothing we can do about environmental variance. Right. But we can measure it. Does anybody have any ideas how we could measure environmental variance? Yes?
AUDIENCE: Study populations in which there's some kind of controlled environment. So you study populations that one population is one with a homogeneous. And another one was a completely different one.
PROFESSOR: Right. So what we could do is we could use controls. So typically what we could do is we could study in environments where we try and control the environment exactly to eliminate this as much as we possibly can, for example. As we'll see that we also can do things like study clones, where individuals have exactly the same genotype. And then, all of the variance that we observe-- if this term vanishes because the genotypes are identical, it is due to the environment.
So typically, if you're doing things like studying humans, since cloning humans isn't really a good idea to actually measure environmental variance, right, what you could do is you can look at identical twins. And identical twins give you a way to get at the question of how much environment variance there is for a particular phenotype.
So in sum, this is replicates what I have here on the left-hand side of the board. And note that today we'll be talking about the idea of discovering this function, f, and how well we can discover f, which is really important, right. It's fundamental to be able to predict phenotype from genotype. It's an extraordinarily central question in genetics. And when we do the prediction, there are two kinds of-- oh, there's a question?
AUDIENCE: Could you please explain again why the co-variance drops out or it goes away.
PROFESSOR: Yeah, the co-variance drops out because we're going to assume that genotype and environment are independent. Now if they're not independent, it won't drop out. But making that assumption-- and of course, for human studies you can't really make that assumption completely, right?
And one of the problems in doing these sorts of studies is that it's very, very easy to get confounded. Because when you're trying to decompose the observed variance and height, for example.
You know, there's what mom and dad provided to an individual in terms of their height, and there's also how much junior ate, right. And whether he went to McDonald's a lot, or you know, was going to Whole Foods a lot. You know, who knows, right?
But this component and this component, it's easy to get confounded between them and sometimes you can imagine that genotype is related to place of origin in the world. And that has a lot to do with environment. And so this term wouldn't necessarily disappear.
OK. So there are two kinds of heritability I'd like to touch upon today. And it's important that you remember there are two kinds and one is extraordinarily difficult to recover and the other one is in some sense, a more constrained problem, because we're much better at building models for that kind of heritability estimate.
The first is broad-sense heritability, which describes the upper bound for phenotypic prediction given an arbitrary model. So it's the total contribution to phenotypic variance from genetic causes. And we can estimate that, right. And we'll see how we can estimate it in a moment.
And narrow-sense heritability is defined as, how much of the heritability can we describe when we restrict f to be a linear model. So when f is simply linear, as the sum of terms, that describes the maximum narrow-sense heritability we can recover in terms of the fraction of phenotypic variance we can capture in f.
And it's very useful because it turns out that we can compute both broad-sense and narrow-sense heritability from first principles-- I mean from experiment. And the difference between them is part of our quest today.
Our quest is, to answer the question, where is the missing heritability? Why can't we build an Oracle f that perfectly predicts phenotype from genotype?
So on that line-- I just want to give you some caveats. One is that we're always talking about populations when we're talking about heritability because it's how we're going to estimate it.
And when you hear people talk about heritability, oftentimes they won't qualify it in terms of whether it's broad-sense or narrow-sense. And so you should ask them if you're engaged in a scientific discussion with them.
And as we've already discussed, sometimes estimation is difficult because of matching environment and eliminating this term, the environmental term can be a challenge when you're out of the laboratory. Like when you're dealing with humans.
So, let's talk about broad-sense heritability. Imagine that we measure environmental variants simply by looking at environmental twins or clones, right.
Because if we, for example, take a bunch of yeast that are genotypically identical. And we grow them up separately, and we measure a trait like how well they respond to a particular chemical or their growth rate, then the variance we see from each individual to individual is simply environmental, because they're genetically identical. So
we can, in that particular case, exactly quantify the environmental variance given that every individual is genetically identical. We simply measure all the growth rates and we compute the variance. And that's the environmental variance. OK?
As I said for humans, the best we can do is identical twins. Monozygotic twins. You can go out and for pairs of twins that are identical, you can measure height or any other trait that you like and compute the variance. And then that is an estimate of the environmental component of that, because they should be genetically identical.
And big H squared-- broad-sense is always capital H squared and narrow-sense is always little h squared. Big H squared, which is broad-sense heritability is very simple then.
It's the phenotypic variance, minus the environmental variance, over the phenotypic variance. So it's the fraction of phenotypic experience that can be explained from genetic causes. Is that clear to everybody? Any questions at all about this? OK.
So, for example, on the right-hand hand side here, those three purplish squares have three different populations, which are genotypically identical. They have two genes, a little a, a little a, big A, a little A, and big A, big A. And each one is a variance of 1.0. out So since there are genetically identical, we know that the environmental variance has to be 1.0.
On the left-hand side, you see the genotypic variance. And that reminds us of where we started today. It depends on the number of alleles you get of big A, as to what the value is.
And when you put all of that together, you get a total variance of 3. And so big H squared is simply the genotypic variance, which is 2, over the total phenotypic variance, which is 3. So big H squared is 2/3. And so that is a way of computing broad-sense heritability.
Now, if we think about our models, we can see that narrow-sense heritability has some very nice properties. Right. That is, if we build and add a model of phenotype, to get at narrow-sense heritability.
So if we were to constraint f here to be linear, it's simply going to be a very simple linear model. For each particular QTL that we discover, we assign an effect size beta to it, or a coefficient that describes its deviation from the mean for that particular trait. And we have an offset, beta zero.
So our simple linear model is going to take all the discovery QTLs that we have-- take each QTL and discover which allelic form it's in. Typically it's considered either in zero or one form. And then add a beta j, where j is the particular QTL deviation from mean value. Add them all together to compute the phenotype. OK.
So, this is a very simple additive model and a consequence of this model is that if you think about an F1 or a child of two parents, as we said earlier, a child is going to inherit roughly half of the alleles from mom and half of the alleles from dad.
And so for additive models like this, the expected value of the child's trait value is going to be the midpoint of mom and dad. And that can be derived directly from the equation above, because you're getting half of the QTLs from mom and half of the QTLs from dad.
So this was observed a long time ago, right, because if you did studies and you looked at the deviation from the midpoint of parents for human height.
You can see that the children fall pretty close to mid-parent line, where the y-axis here is the height in inches and that suggests that much of human height can be modeled by a narrow-sense based heritability model.
Now, once again, narrow-sense heritability is the fraction of phenotypic variance explained by an additive model. And we've talked before about the model itself. And little h squared is simply going to be the amount of variance explained by the additive model over the total phenotypic variance.
And the additive variance is shown on the right-hand side. That equation boils down to, you take the phenotypic variance and you subtract off the variance that's environmental and that cannot be explained by the additive variance, and what you're left with is the additive variance.
And once again, coming back to the question of missing heritability, if we observe that what we can estimate for little h squared is below what we expect, that gap has to be explained somehow. Some typical values for theoretical h squared.
So this is not measured h squared in terms of building a model and testing it like this. But what we can do is we can theoretically estimate what h squared should be, by looking at the fraction of identity between individuals.
Morphological traits tend to have higher h squared for the fitness traits. So human height has a little h square of about 0.8. And for those ranchers out there in the audience, you'll be happy to know that cattle yearly weight has heritability of about 0.35.
Now, things like life history which are fitness traits are less heritable. Which would suggest that looking at how long your parents lived and trying to estimate how long you're going to live is not as productive as looking at how tall you are compared to your parents. And there's a complete table that I've included in the slides for you to look at, but it's too small to read on the screen.
OK, so now we're going to turn to computational models and how we can discover a model that figures out where the QTLs are, and then assigns that function f to them so we can predict phenotype from genotype. And we're going to be taking our example from this paper by Bloom, et al, which I posted on the Stellar site. And it came out last year and it's wonderful study in QTL analysis.
And the setup for this study is quite simple. What they did was, is they took two different strains of yeast, RM and BY, and they crossed them and produced roughly 1,000 F1s. And RM and BY are very similar. They are about, I think it's about 35,000 snips between them.
Only about 0.5% of their genomes are different. So they're really close. Just for point of reference, you know, the distance between me and you is something like one base for every thousand? Something like that. And then they assayed all those F1s. They genotyped them all.
So to genotype them, what you do is you know what the parental genotypes are because they sequence both parents. The mom and dad, so to speak, at 50x coverage. So they knew the genome sequence is completely for both mom and dad.
And then for each one of the 1,000 F1s they put them on a microarray and what is shown on the very bottom left is a result of genotype in an individual where they can see each chromosome and whether it came from mom or from dad.
And you can't see it here, but there are 16 different chromosomes and the alternating purple and yellow colors show whether that particular part of the genome came from mom or from dad. So they know for each individual, its source. From the left or the right strain. OK.
And they have a thousand different genetic makeups. And then they asked, for each one of those individuals, how well could they grow in 46 different conditions? So they exposed them to different sugars, to different unfavorable environments and so forth.
And they measured growth rate as shown on the right-hand side. Or right in the middle, that little thing that looks like a bunch of little dots of various sizes. By measuring colony size, they could measure how well the yeast were growing. And so they had two different things, right.
They had the exact genotype of each individual, and they also had how well it was growing in a particular condition. And so for each condition, they wanted to associate the genotype of the individual to how well it was growing. To its phenotype.
Now, one fair question is, of these different conditions, how many of them were really independent? And so to analyze that, they looked at the correlation between growth rates across conditions to try and figure out whether or not they actually had 46 different traits they were measuring.
So this is a correlation matrix that is too small to read on the screen. The colors are somewhat visible, where the blue colors are perfect correlation and the red colors are perfect anti-correlation.
And you can see that in certain areas of this grid, things are more correlated, like what sugars the yeast liked to eat. But suffice to say, they had a large collection of traits they wanted to estimate.
So, now we want to build a computational model. So our next step is figuring out how to find those places in the genome that allows us to predict, how well, given a trait, the yeast would grow. The actual growth rate.
So the key idea is this-- you have genetic markers, which are snips down the genome and you're going to test a particular marker. And if this is a particular trait, one possibility is that-- let's say that this marker could be either 0 or 1. Without loss of generality, it could be that here are all the individuals where the marker is zero.
And here are all the markers where the marker is 1. And really, fundamentally, whether an individual has a 0 or a 1 marker, it doesn't really change its growth rate very much. OK? It's more or less identical. It's also possible that this is best modeled by two different means for a given trait.
That when the marker is 1, you're growing-- actually this is going to be the growth rate on the x-axis. The y-axis is the density. That you're growing much better when you have a 1 in that marker position than a zero.
And so we need to distinguish between these two cases when the marker is predictive of growth rate and when the marker is not predictive of growth rate.
And we've talked about lod likelihood tests before and you can see one on the very top. And you can see there's an additional degree of freedom that we have in the top prediction versus the bottom because we're using two different means that are conditioned upon the genotypic value at a particular marker.
So we have a lot of different markers indeed. So we have-- let's see here, the exact number. I think it's about 13,000 markers they had in this study. No. 11,623 different unique markers they found. That they could discover, that weren't linked together. We talked about linkage earlier on.
So you've got over 11,000 markers. You're going to do a lod likelihood test to compute this lod odds score. Do we have to worry about multiple hypothesis correction here? Because you're testing over 11,000 markers to see whether or not they're significant for one trait. Right.
So one thing that we could do is imagine that what we did was we scrambled the association between phenotypes and individuals. So we just randomized it and we did that a thousand times. And each time we did it, we computed the distribution of these lod scores.
Because we have broken the association between phenotype and genotype, the lod scores which we should be seeing if we did this randomization, should correspond to essentially noise. But we would see it random. So it's a null distribution we can look at. And so what we'll see is a distribution of lod scores.
This is the lod. This is the probability from a null, a permutation test. And since we actually have done the randomization over all 11,000 markers, we can directly draw a line and ask what are the chances that a lod score would be greater than or equal to a particular value at random?
And we can pick an area inside this tail, let's say 0.05, because that's what the authors of this particular paper used and ask what value of a lod score would be very unlikely to have by chance? It turns out in their first iteration, it was 2.63. That a lod score over 2.63 had a 0.05 chance or less of occurring in randomly permuted data.
And since a permuted data contained all of the markers, we don't have to do any multiple hypothesis correction. So you can directly compare the statistic that you compute against a threshold and accept any marker or QTL that has a lod score greater, in this case then 2.63 and put it in your model. And everything else you can reject.
And so you start by building a model out of all of the markers that are significant at this particular level. You then assemble the model and you can now predict phenotype from genotype. But of course, you're going to make errors, right. For each individual, there's going to be an error.
You're going to have a residual for each individual that is going to be the phenotype minus the genotype of the individual. So this is the error that you're making.
So what these folks did was that you first look at predicting the phenotype directly, and you pick all the QTLs that are significant at that level. And then you compute the residuals and you try and predict the residuals.
And you try and find additional QTLs that are significant after you have picked the original ones. OK.
So why might this produce more QTLs then the original pass? What do you think? Why is it that trying to predict the residuals is a good idea after you've tried to predict the phenotype directly? Any ideas about that?
Well, what this is telling us, is that these QTLs we're going to predict now were not significant enough in the original pass, but when we're looking at what's left over, after we subtract off the effect of all the other QTLs, other things might pop up. But in some sense, we're obscured by the original QTLs. Once we subtract off their influence, we can see things that we didn't see before.
And we start gathering up these additional QTLs to predict the residual components. And so they do this three times. So they predict the original set of QTLs and then they iterate three time on the residuals to find and fit a linear model that predicts a given trait from a collection of QTLs that they discover. Yes?
AUDIENCE: Sorry. I'm still confused. The second round? [INAUDIBLE] done three additional times? Is that right? So the--
AUDIENCE: Is it done on the remainder of QTL or on the original list of every--
PROFESSOR: Each time you expand your model to include all the QTLs you've discovered up to that point. So initially, you discover a set of QTLs, call that set one. You then compute a model using set one and you discover the residuals.
PROFESSOR: Correct. Well, residual [INAUDIBLE] so you use set one to build a model, a phenotype. So set one is used here to compute this, right. And so set one is used. And then you compute what's left over after you've discovered the first set of QTLs.
Now you say, we still have this left to go. Let's discover some more QTLs. And now you discover set two of QTLs. OK. And that set two then is used to build a model that has set one and set two in it. Right.
And that residual is used to discover set three and so forth. So each time you're expanding the set of QTLs by what you've discovered in the residuals. Sort of in the trash bin so to speak. Yes?
AUDIENCE: Each time you're doing this randomization to determine lod cutoff?
PROFESSOR: That's correct. Each time you have to redo the randomization and get to the lod cutoff.
AUDIENCE: But does that method actually work the way you expect it on the second pass, given that you have some false positives from the pass that you've now subtracted from your data?
PROFESSOR: I'm not sure I understand the question.
AUDIENCE: So the second time you do this randomization, and you again come up with a threshold, you say, oh, above here there are 5% false positives.
AUDIENCE: But could it be that that estimate is actually significantly wrong based the fact that you've subtracted off false positives before you do that process?
PROFESSOR: I mean, in some sense, what's your definition of a false positive? Right. I mean it gets down to that because we've discovered there's an association between that QTL and predicting phenotype. And in this particular world it's useful for doing that.
So it's hard to call something a false positive in that sense, right. But you're right, you actually have to reset your threshold every time that you go through this iteration. Good question. Other questions? OK.
So, let's see what happens when you do this. What happens is that if you look down the genome, you discover a collection. For example, this is growth in E6 berbamine.
And you can see the significant locations in the genome, the numbers 1 through 16 of the chromosomes and the little red asterisks above the peaks indicate that that was a significant lod score. The y-axis is a lod score.
And you can see the locations in the genome where we have found places that were associated with growth rate in that particular chemical. OK.
Now, why is it, do you think, that in many of those places you see sort of a rise and fall that is somewhat gentle as opposed to having an impulse function right at that particular spot?
AUDIENCE: Nearby snips are linked?
PROFESSOR: Yeah, nearby snips are linked. That as you come up to a place that is causal, you get a lot of other things are linked to that. And the closer you get, the higher the correlation is.
So that is for 1,000 segregants in the top. And what was discovered for that particular trait, was 15 different loci that explained 78% of the phenotypic variance. And in the bottom, the same procedure was used, but was only used on 100 segregants.
And what you can see is that, in this particular case, only two loci were discovered that explain 21% of the variance. So the bottom study was grossly under powered.
Remember we talked about the problem of finding QTLs that had small effect sizes. And if you don't have enough individuals you're going to be under-powered and you can't actually identify all of the QTLs.
So this is a comparison of this. And of course, one of the things that you don't know is the environmental variance that you're fighting against. Because the number of individuals you need, depends both on the number of potential loci that you have.
The more loci you have, the more individuals you need to fight against the multiple hypotheses problem, which is taken care of by this permutation implicitly. And the more QTLs that contribute to a particular trait, the smaller they might be. And there you need more individuals to provide adequate power for your test.
And out of this model, however, if you look at for all the different traits, the predictive insight versus the observed phenotype, you can see that the model does a reasonably good job.
So the interesting things that came out of the study were that, first of all, it was possible to look at the effect sizes of each QTL. Now, the effect size in terms of fraction of variance explained of a particular marker, is the square of its coefficient. It's the beta squared.
So you can see here the histogram of effect sizes, and you can see that most QTLs have very small effects on phenotype where phenotype is scaled between 0 and 1 for this study.
So, most traits as described here have between 5 and 29 different QTL loci in the genome. They're used to describe them with a median of 12.
Now, the question the authors asked, was if they looked at the theoretical h squared that they computed for the F1s, how well did their model do? And you can see that their model does very well. That, in terms of looking at narrow sense heritability, they can recover almost all of it, all the time.
However, the problem comes here. Remember we talked about how to compute broad-sense heritability by looking at clones and computing environmental variance directly.
And so they were able to compute broad-sense heritability and compare that the narrow-sense heritability that they were able to actually achieve in the study. And you can see there are substantial gaps. So what could be making up those gaps? Why is it that this additive model can't explain growth rate in a particular condition?
So, the next thing that we're going to discover are some of the sources of this so-called missing heritability. But before I give you some of the stock answers that people in the field give, since this is part of our quest today to actually look into missing heritability, I'll put it to you, my panel of experts.
What could be causing this heritability to go missing? Why can't this additive model predict growth rate accurately, given it knows the genotype exactly? Yes.
AUDIENCE: [INAUDIBLE] that you wouldn't detect from looking at the DNA sequence.
PROFESSOR: So epidemic factors-- are you talking about protein factors or are you talking about epigenetic effects?
AUDIENCE: More of the epigenetic marks.
PROFESSOR: Epigenetic marks, OK. So it might be now, yeast doesn't have DNA methylation. It does have chromatin modifications in the form of histone marks. So it might be that there's some histone marks that are copied from generation to generation that are not counted for in our model. right? OK, that's one possibility. Great. Yes.
AUDIENCE: There could be more complex effects so two separate genes may come out, other than just adding. One could turn the other off. So it one's on, it could [INAUDIBLE].
PROFESSOR: Right. So those are called epistatic effects, or they're non-linear effects. They're gene-gene interaction effects. That's actually thought to be one of the major issues in missing heritability. What else could there be? Yes.
PROFESSOR: Right. So you're saying that there could be inherent noise that would cause there to be fluctuations in colony size that are unrelated to the genotype. And, in fact, that's a good point. And that's something that we're going to take care of with the environmental variance.
So we're going to measure how well individuals grow with exactly the same genotype in a given condition. And so that kind of fluctuation would appear in that variance term. And we're going to get rid of that.
But that's a good thought and I think it's important and not appreciated that there can be random fluctuations in that term. Any other ideas? So we have epistasis. We have epigenetics. We've got two E's so far. Anything else?
How about if there are a lot of different loci that are influencing a particular trait, but the effect sizes are very small. That we've captured, sort of the cream. We've skimmed off the cream.
So we get 70% of the variance explained, but the rest of the QTLs are small, right, and we can't see them. We can't see them because we don't have enough individuals. We're underpowered, right. We just-- more individuals more sequencing, right.
And that would be the only way to break through this and be able to see these very small effects. Because if the effects are small, in some sense, we're hosed. Right?
You just can't see them through the noise. All those effects are going to show up down here and we're going to reject them. Anything else, people can think about? Yes?
AUDIENCE: Could you content maybe the sum of some areas that are-- sorry, the addition sum of those guys that have low effects. Or is that not detectable by any [INAUDIBLE]?
PROFESSOR: Well, that's certainly what we're trying to do with residuals, right? This multi-round round thing is that we take all the things we can detect that have an effect with a conservative cut off and we get rid of them.
And then we say, oh, is there anything left? You know, that's hiding, sort of behind that forest, right. If we cut through the first line of trees, can we get to another collection of informative QTLs? Yeah.
AUDIENCE: I was wondering if this could be an overestimate also. Like, for example, if, when you throw out the variance for environmental conditions, the environmental conditions aren't as exact as we thought they were between two yeast growing in the same set, setup.
AUDIENCE: Then maybe you would inappropriately assign a variance to the environmental condition whereas some that could be, in fact-- something that wouldn't be explained by.
PROFESSOR: And probably the other way around. The other way around would be that you thought you had the conditions exactly duplicated, right. But when you actually did something else, they weren't exactly duplicated so you see bigger variance in another experiment. And it appears to be heritable in some sense. But, in fact, it would just be that you misestimated the environmental component.
So, there are a variety of things that we can think about, right. Incorrect heritability estimates. We can think about rare variance. Now in this particular study we're looking at everything, right. Nothing is hiding. We've got 50x sequencing. There are no variants hiding behind the bushes. They are all there for us to look at.
Structural variants-- well in this particular case, we know structural variants aren't present, but as you know, many kinds of mammalian cells exhibit structural variance and other kinds of bizarre behaviors with their chromosomes. Many common variants of low effect. We just talked about that. And epistasis was brought up. And this does not include epigenetics, I'll have to add that to the listen. It's a good point. OK.
And then we talked about this idea that epistasis is the case where we have nonlinear effects. So a very simple example of this is when you have little a and big B, and big A and big B together, they both had an effect. But little a, little b, have no effect. And big A and big B have no effect by themselves. So you have a pairwise interaction between these terms. Right.
So this is sort of the exclusive OR of two terms and that non-linear effect can never be captured when you're looking at terms one at a time. OK. Because looking one at a time looks like it has no effect whatsoever. And these effects, of course, could be more than pairwise, if you have a complicated network or pathway.
Now, what the authors did to examine this, is they looked at pairwise effects. So they considered all pairs of markers and asked whether or not, taken two at a time now, they could predict a difference in trait need. But what's the problem with this? How many markers did I say there were? 13,000, something like that.
All pairs of markers is a lot of pairs of markers. Right. And what happens to your statistical power when you get to that many markers? You have a serious problem. It goes right through the floor. So you really are very under-powered to detect these interactions.
The other thing they did was to try to get things a little bit better as they said, how about this. If we know that a given QTL is always important for a trait because we discovered it in our additive model. Well consider its pairwise interaction with all the other possible variants.
So instead of now 13,000 squared, it's only going to be like 22 different QTLs for a given trait times 13,000 to reduce the space of search. Obviously I got this explanation not completely clear. So let me try one more time. OK.
The naive way to go at looking at pairwise interactions is consider all pairs and ask whether or not all pairs have an influence on a particular trait value. Right. We've got that much? OK.
Now let's suppose we don't want to look at all pairs. How could we pick one element of the pair to be interesting, but smaller in number? Right. So what we'll do is, for a given trait, we already know which QTLs are important for it because we've built our model already.
So let's just say, for purpose of discussion, there are 20 QTLs that are important for this trait. We'll take each one of those 20 QTLs and we'll examine whether or not it has a pairwise interaction with all of the other variance. And that will reduce our search base. Is that better? OK, good.
So, when they did that, they did find some pairwise interactions. In 24 of their 46 traits had pairwise interactions and here is an example. And you can see the dot plot, or the upper right-hand part of this slide, how when you BYBY. You have a lower phenotypic value then when you have just any RM component on the right-hand side.
So those were two different snips on chromosome 7 and chromosome 11 and showing how they interact with one another in a non-linear way. If they were linear, then as you added either a chromosome at 7 or a chromosome 11 contribution it would go up a little bit.
Here, as soon as you add either contribution from RM, it goes all way up to have a mean of zero or higher. In this particular case, 71% of the gap between broad-sense and narrow-sense was explained by this one pair interaction.
So it is the case that pairwise interactions can explain some of the missing heritability. Can anybody think of anything else they can explain missing heritability? OK.
What's inherited? Let's make a list of everything that's inherited from the parental line to the F1s. OK. Yes.
AUDIENCE: I mean, because there's a lot more things inherited. The protein levels are inherited.
AUDIENCE: [INAUDIBLE] are inherited as well.
PROFESSOR: Good. I like this line of thinking.
PROFESSOR: There are a lot of things that are inherited, right? So what's inherited? Some proteins are probably inherited, right? What is replicable through generation to generation as a genetic material that's inherited?
Let's just talk about that for a moment. Proteins are interesting, don't get me wrong. I mean, prions and other things are very interesting. But what else is inherited? OK, yes?
PROFESSOR: So there are other genetic molecules. Let's just take a really simple one-- mitochondria. OK. Mitochondria are inherited. And it turns out that these two strains have can have different mitochondria. What else can be inherited?
Well, we were doing these experiments with our colleagues over at the Whitehead and for a long time we couldn't figure out what was going on. Because we would do the experiments on day one and they come out a particular way and on day two they come out a different way. Right.
And we're doing some very controlled conditions. Until we figured out that everybody uses S288C which is the genetic nomenclature for the lab trained yeast, right. It's lab train because it's very well behaved. It's a very nice yeast. It grows very well. It's been selected for that, right.
And people always do genetic studies by taking S288C, which is the lab yeast, which has being completely sequenced and so you want to use it because you can download the genome with a wild strain. And wild strains come from the wild, right.
And they come either off of people who have yeast infections. I mean, human beings, or they come off of grape vines or God knows where, right. But they are not well behaved. And why are they not well behaved?
What makes these yeast particularly rude? Well, the thing that makes them particularly rude is that they have things like viruses in them. Oh, no. OK. Because what happens is that when you take a yeast that has a virus in it, and you cross it with a lab yeast, right. All of the kids got the virus. Yuck. OK.
And it turns out that the so-called killer virus in yeast interacts with various chromosomal changes. And so now you have interactions-- genetic interactions between a viral element and the chromosome.
And so the phenotype you get out of particular deletions in the yeast genome has to do with whether or not it's infected with a particular virus. It also has to do with which mitochondrial content it has.
And people didn't appreciate this until recently because most of the past yeast studies for QTLs were busy crossing lab strains with wild strains and whether it was ethanol tolerance or growth and heat, a lot of the strains came up with a gene as a significant QTL, which was MKT1.
And people couldn't understand why MKT1 was so popular, right. MKT1, maintenance of killer toxin one. Yeah. That's the viral thing that enables-- the chromosomal thing that enables a viral competence.
So, it turns out that if you look at this-- in this particular case, we're looking at yeast that don't have the virus in the bottom little photograph there. You can see they're all sort of, you know, they're growing similarly.
And the yeast with the same genotype above-- those are all in tetrads. Two out of the four are growing, the other two are not, because the other two have a particular deletion.
And if you look at the model-- a deletion only model, the deletion only, only looks at the chromosomal compliment doesn't predict the variance very well. And if you look at the deletion and whether or not you have the virus, you do better.
But you do even better, if you allow for there to be a nonlinear interaction between the chromosomal modification and whether or not you have a virus. And then you recover almost all of missing heritability.
So I'll leave you with this thought, which is that genetics is complicated and QTLs are great, but don't forget that there are all sorts of genetic elements.
And on that note, next time we'll talk about human genetics. Have a great weekend until then. We'll see you. Take care.