Description: This lecture by Prof. Ernest Fraenkel is on protein interaction networks. He covers network models, including their structure and an analysis. He asks, "can we use networks to predict function?" He ends with a data integration example.
Instructor: Prof. Ernest Fraenkel
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR: Going to finish up a little bit from last time on gene regulatory networks and see how the different methods that we looked at compared, and then we'll dive into protein interaction networks. Were there any questions from last time?
OK. Very good. So recall that we start off with this dream challenge in which they provided unlabeled data representing gene expression data for either in a completely synthetic case, in silico data, or for three different actual experiments-- one in E. coli, one in S. cerevisiae, and one in aureus. For some of those, it was straight expression data under different conditions. In other cases, there were actual knock-down experiments or other kinds of perturbations. And then they gave that data out to the community and asked people to use whatever methods they wanted to try to rediscover automatically the gene regulatory networks.
So with some preliminary analysis, we saw that there were a couple of main clusters of kinds of analyses that all had similar properties across these data sets. There were the Bayesian networks, that we've discussed now in two separate contexts. And then we looked at regression-based techniques and mutual information based techniques. And there were a bunch of other kinds of approaches. And some of them actually combine multiple predictors from different kinds of algorithms together. And some of them, they evaluated how well each of these did on all the different data sets.
So first the results on the in silico data, and they're showing this as an area under the precision-recall curve. Obviously, higher numbers are going to be better here. So in this first group over here are the regression-based techniques, mutual information, correlation, Bayesian networks. Things didn't fall into any of those particular categories.
Meta were techniques that use more than one class of prediction and then develop their own prediction based on those individual techniques. Then they defined something that they call the community definition, which they combine data from many of the different techniques together with their own algorithms to kind of come up with what they call the "wisdom of the crowds." And then R represents a random collection of other predictions.
And you can see that on these in silico data, the performances don't dramatically differ one from the other. Within each class, if you look at the best performer in each class, they're all sort of in the same league. Obviously, some of the classes do better consistently.
Now their point in their analysis is about the wisdom of the crowds, that taking all these data together, even including some of the bad ones, is beneficial. That's not the main thing that I wanted to get out of these data for our purposes. So these E. coli data, notice though that the errant to the curve, it's about 30 something percent. Now this is, oh, sorry, this is in silico data.
Now this is the first real experimental data we'll look at, so this is E. coli data. And notice the change of scale, that the best performer's only doing under less than 10% of the possible objective optimal results. So you can see that the real data are much, much harder than the in silico data.
And here the performance varies quite a lot. You can see that the Bayesian networks are struggling, compared to some of the other techniques. The best of those doesn't really get close to the best of some of these other approaches.
So what they did next, was they took some of the predictions from their community predictions that were built off of all these other data, and they went and actually tested some of these. So they built regulatory networks for E. coli and for aureus. And then they actually did some experiments to test them.
I think the results overall are kind of encouraging, in the sense that if you focus on the top pie chart here, of all the things that they tested, about half of them, they could get some support. In some cases, it was very strong support. In other cases, it wasn't quite as good. So the glass is half empty or half full.
But also, one of the interesting things is that the data are quite variable over the different predictions that they make. So each one of these circles represents a regulator, and the things that they claim are targets of that regulator. And things that are in blue are things that were confirmed by their experiments. The things with black outlines and blue are the controls. So they knew that these would be right.
So you could see that for pure R, they do very well. For some of these others, they do mediocre. But there are some, which they're honest enough to admit, they do very poorly on. So they didn't get any of their predictions right for this regulator. And this probably reflects the kind of data that they had, in terms of what conditions were being tested.
So, so far, things look reasonable. I think the real shocker of this paper does not appear in the abstract or the title. But it is in one of the main figures, if you pay attention.
So these were the results for in silico data. Everything looked pretty good. Change of scale to E. coli, there's some variation. But you can make arguments.
These are the results for Saccharomyces cerevisiae. So this is the organism, yeast, on which most of the gene regulatory algorithms were originally developed. And people actually built careers off of saying how great their algorithms were in reconstructing these regulatory networks.
And we look at these completely blinded data, where people don't know what they're looking for. You could see that the actual results are rather terrible. So the area under the curve is in the single digits of percentage. And it doesn't seem to matter what algorithm they're using. They're all doing very badly. And the community predictions are no better-- in some cases, worse-- than the individual ones.
So this is really a stunning result. It's there in the data. And if you dig into the supplement, they actually explain what's going on, I think, pretty clearly.
Remember that all of these predictions are being made by looking for a transcriptional regulator that increases in its own expression or decreases in its own expression. And that change in its own expression is predictive of its targets. So the hypothesis is when you have more of an activator, you'll have more of its targets coming on. If you have less of an activator, you'll have less of the targets. And you look through all the data, whether it's by Bayesian networks or regression, to find those kinds of relationships.
Now what if those relationships don't actually exist in the data? And that's what this chart shows. So the green are genes that have no relationship with each other. And they're measuring here the correlation across all the data sets, between two pairs of genes, for which have no known regulatory relationship. The purple are ones that are targets of the same transcription factor. And the orange are ones where one is the activator or repressor of the other.
And in the in silico data, they give a very nice spread between the green, the orange, and the purple. So the co-regulator are very highly correlated with each other. The ones that are parent-child relationships-- a regulator and its target-- have a pretty good correlation, much, much different from the distribution that you see for the things that are not interacting. And on these data, the algorithms do their best.
Then you look at the E. coli data, and you can see that in E. Coli, the curves are much closer to each other, but there's still some spread. But when you look at yeast-- again, this is where a lot of these algorithms were developed-- you could see there's almost no difference between the correlation between the things that have no relationship to each other, things that are co-regulated by the same regulatory protein, or those parent-child relationships. They're all quite similar.
And it doesn't matter whether you use correlation analysis or mutual information. Over here and in this right-hand panel, they've blown up the bottom part of this curve, and you can see how similar these are. So again, this is a mutual information spread for in silico data for E. coli and then for yeast.
OK. So what I think we can say about the expression analysis is that expression data are very, very powerful for some things and are going to be rather poor for some other applications. So they're very powerful for classification and clustering. We saw that earlier.
Now what those clusters mean, that's this inference problem they're trying to solve now. And the expression data are not sufficient to figure out what the regulatory proteins are that are causing those sets of genes to be co-expressed-- at least not in yeast. And I think there's every expectation that if you did the same thing in humans, you would have the same result.
So the critical question then is if you do want to build models of how regulation is taking place in organisms, what do you do? And the answer is that you need some other kind of data. So one thing you might think, if we go back to this core analysis, like what's wrong? Why is it that these gene expression levels cannot be used to predict the regulatory networks?
And it comes down to whether gene levels are predictive approaching levels. And a couple of groups have looked into this. One of the earlier studies was this one, now 2009, where they used microarray data and looked at mRNA expression levels versus protein levels.
And what do you see in this? You see that there is a trend. Right there, R squared is around 0.2, but that there's a huge spread. So that for any position on the x-axis, a particular level of mRNA, you can have 1,000-fold variation in the protein levels.
So a lot of people saw this and said, well, we know there are problems with microarrays. They're not really great at predicting mRNA levels or low in protein levels. So maybe this will all get better if we use mRNA-Seq. Now that turns out not to be the case.
So there was a very careful study published in 2012, where the group used microarray data, RNA-Seq data, and a number of different ways of calling the proteomics data. So you might say, well, maybe some of the problem is that you're not doing a very good job of inferring protein levels from mass spec data. And so they try a whole bunch of these different ways of pulling mass spec data. And then they look, you should focus on the numbers in these columns for the average and the best correlations between the RNA data in these columns and the proteomic data in the rows. And you could see the best case scenario-- you can get these up to 0.54 correlation, still pretty weak.
So what's going on? What we've been focusing on now is the idea that the RNA levels are going to be very well correlated with protein levels. And I think a lot of literature is based on hypotheses that are almost identical. But in reality, of course, there are a lot of processes involved.
There's the process of translation, which has a rate associated with it. It has regulatory steps associated with it. And then there are degradatory pathways.
So the RNA gets degraded at some rate, and the protein gets degraded at some rate. And sometimes those rates are regulated, sometimes they're not. Sometimes it depends on the sequence.
So what would happen if you actually measured what's going on? And that was done recently in this paper in 2011, where the group used a labeling technique for proteins to [INAUDIBLE] and measure steady state levels of proteins and then label the proteins at specific times and see how much newly synthesized their protein was at various times. And similarly, for RNA, using a technology that allowed them to separate newly synthesized transcripts from the bulk RNA. And once you have those data, then you can find out what the spread is in the half lives of proteins and the abundance of proteins.
So if you focus on the left-hand side, these are the determined half lives for various RNAs in blue and proteins in red. If you look at the spread in the red ones, you've got at least three orders of magnitude of range in stability in half lives for proteins. So that's really at the heart of why RNA levels are very poorly predictive approaching levels, because there's such a range of the stability proteins. And the RNAs also, they spread over probably about one or two orders of magnitude in the RNA stability. And then here are the abundances. So you can see that the range of abundance for average copies per cell of proteins is extremely large, from 100 to 10 to the eighth copies per cell.
Now if you look at the degradation rates for protein half lives and RNA half lives, you can see there's no correlation. So these are completely independent processes that determine whether an RNA is degraded or a protein is degraded. So then when you try to figure out what the relationship is between RNA levels and protein levels, you really have to resort to a set of differential equations to map out what all the rates are. And if you know all those rates, then you can estimate what the relationships will be.
And so they did exactly that. And these charts show what they inferred to be the contribution of each of these components to protein levels. So on the left-hand side, these are from cells which had the most data. And they build a model on the same cells from which they collected the data. And in these cells, the RNA levels account for about 40% of the protein levels, the variance.
And the biggest thing that affects the abundance of proteins is rates of translation. And then they took the data built from one set of cells and tried to use it to predict outcomes in another set of cells in replicate. And the results are kind of similar. They also did it for an entirely different kind of cell types.
In all of these cases, the precise amounts are going to vary. But you can see that the red bars, which represent the amount of information content in the RNA, is less than about half of what you can get from other sources. So this gets back to why it's so hard to infer regulatory networks solely from RNA levels.
So this is the plot that they get when they compare protein levels and RNA levels at the experimental level. And again, you see that big spread and R squared at about 0.4, which at the time, they were very proud of. They write several times in the article, this is the best anyone has seen to date.
But if you incorporate all these other pieces of information about RNA stability and protein stability, you can actually get a very, very good correlation. So once you know the variation in the protein stability and the RNA stability for each and every protein and RNA, then you can do a good job of predicting protein levels from RNA levels. But without all that data, you can't. Any questions on this?
So what are we going to do then? So we really have two primary things that we can do. We can try to explicitly model all of these regulatory steps and include those in our predictive models and try to build up gene regulatory networks, protein models that actually include all those different kinds of data. And we'll see that in just a minute.
And the other thing we can try to do is actually, rather than try to focus on what's downstream of RNA synthesis, the protein levels, we can try to focus on what's upstream of RNA synthesis and look at what the production of RNAs-- which RNAs are getting turned on and off-- tell us about the signaling pathways and the transcription factors. And that's going to be a topic of one of the upcoming lectures in which Professor Gifford will look at variations in epigenomic data and using those variations in epigenomic data to identify sequences that represent which regulatory proteins are bound under certain conditions and not others. Questions? Yeah?
AUDIENCE: In a typical experiment, the rate constants for how many mRNAs or proteins can be estimated?
PROFESSOR: So the question was how many rate constants can you estimate in a typical experiment? So I should say, first of all, they're not typical experiments. Very few people do this kind of analysis. It's actually very time consuming, very expensive.
So I think in this one, it was-- I'll get the numbers roughly wrong-- but it was thousands. It was some decent fraction of the proteome, but not the entire one. But most of the data set's papers you'll read do not include any analysis of stability rates, degradation rates. They only look at the bulk abundance of the RNAs. Other questions?
OK. So this is an upcoming lecture where we're going to actually try to go backwards. We're going to say, we see these changes in RNA. What does that tell us about what regulatory regions of the genome were active or not? And then you could go upstream from that and try to figure out the signaling pathways.
So if I know changes in RNA, I'll deduce, as we'll see in that upcoming lecture-- the sequences-- the identity of the DNA binding proteins. And then I could try to figure out what the signaling pathways were that drove those changes in gene expression.
Now later in this lecture, we'll talk about the network modeling problem. If assuming you knew these transcription factors, what could you do to infer this network? But before we go to that, I'd like to talk about an interesting modeling approach that tries to take into account all these degradatory pathways and look specifically at each kind of regulation as an explicit step in the model and see how that copes with some of these issues.
So this is work from Josh Stewart. And one of the first papers is here. We'll look at some later ones as well. And the idea here is to explicitly, as I said, deal with many, many different steps in regulation and try to be quite specific about what kinds of data are informing about what step in the process.
So we measure the things in the bottom here-- arrays that tell us how many copies of a gene there are in the genome, especially in cancer. And you can get big changes of what are called copy number, amplifications, or deletions of large chunks of chromosomes. You need to take that into account.
All the RNA-Seq and microarrays that we were talking about in measuring transcription levels-- what do they actually tell us? Well, they give us some information about what they're directly connected to. So the transcriptomic data tells something about the expression state. But notice they have explicitly separated the expression state of the RNA from the protein level. And they separated the protein level from the protein activity.
And they have these little black boxes in here that represent the different kinds of regulations. So however many copies of a gene you have in the genome, there's some regulatory event, transcriptional regulation, that determines how much expression you get at the mRNA level. There's another regulatory event here that determines at what rate those RNAs are turned into proteins. And there are other regulatory steps here that have to do with signaling pathways, for example, that determine whether those proteins are active or not. So we're going to treat each of those as separate variables in our model that are going to be connected by these black boxes.
So they call their algorithm "Paradigm," and they developed it in the context of looking at cancer data. In cancer data, the two primary kinds of information they had were the RNA levels from either microarray or RNA-Seq and then these copy number variations, again, representing amplifications or deletions of chunks of the genome. And what they're trying to infer from that is how active different components are of known signaling pathways.
Now the approach that they used that involved all of those little black boxes is something called a factor graph. And factor graphs can be thought of in the same context as Bayesian networks. In fact, Bayesian networks are a type of factor graph. So if I have a Bayesian network that represents these three variables, where they're directly connected by edges, in a factor graph, there would be this extra kind of node-- this black box or red box-- that's the factor that's going to connect them.
So what do these things do? Well, again, they're bipartite graphs. They always have these two different kinds of nodes-- the random variables and the factors. And the reason they're called factor graphs is they describe how the global function-- in our case, it's going to be the global probability distribution-- can be broken down into factorable components. It can be combined in a product to look at what the global probability function is.
So if I have some global function over all the variables, you can think of this again, specifically, as the probability function-- the joint probability for all the variables in my system-- I want to be able to divide it into a product of individual terms, where I don't have all the variables in each of these f's. They're just some subset of variables. And each of these represents one of these terms in that global product. The only things that are in this function, are things to which it's directly connected. So these edges exist solely between a factor and the variables that are terms in that equation. Is that clear?
So in this context, the variables are going to be nodes. And their allowed values are going to be whether they're activated or not activated. The factors are going to describe the relationships among those variables. We previously saw those as being cases of regulation. Is the RNA turned into protein? Is the protein activated?
And what we'd like to be able do is compute marginal probabilities. So we've got some big network that represents our understanding of all the signaling pathways and all the transcriptional regulatory networks in a cancer cell. And we want to ask about a particular pathway or a particular protein, what's the probability that this protein or this pathway is activated, marginalized over all the other variables?
So that's our goal. Our goal is to find a way to compute these marginal probabilities efficiently. And how do you compute a marginal? Well, obviously you need to sum over all the configurations of all the variables that have your particular variable at its value.
So if I want to know if MYC and MAX are active, I set MYC and MAX equal to active. And then I sum over all the configurations that are consistent with that. And in general, that would be hard to do. But the factor graph gives us an efficient way of figuring out how to do that. I'll show you in a second.
So I have some global function. In this case, this little factor graph over here, this is the global function. Now remember, these represent the factors, and they only have edges to things that are terms in their equations. So over here, is a function of x3 and x5. And so it has edges to x3 and x5, and so on for all of them.
And if I want to explicitly compute the marginal with respect to a particular variable, so the marginal with respect to x1 set equal to a, so I'd have this function with x1 equal to a times the sum over all possible states of x2, the sum over all possible states of x3, x4, and x5. Is that clear? That's just the definition of a marginal.
They introduced a notation in factor graphs that's called a "not-sum." It's rather terrible, but the not-sum or summary. So I like this term, summary, better. The summary over all the variables. So if I want to figure out the summary for x1, that's the sum over all the other variables of all their possible states when I set x1 equal to a, in this case.
So it's purely a definition. So then I can rewrite-- and you can work this through by hand after class-- but I can rewrite this, which is this intuitive way of thinking of the marginal, in terms of these not-sums, where each one of these is over all the other variables that are not the one that's in the brackets. So that's just the definition.
OK, this hasn't really helped us very much, if we don't have some efficient way of computing these marginals. And that's what the factor graph does. So we've got some factor graph. We have this representation, either in terms of graph or equation, of how the global function can be partitioned.
Now if I take any one of these factor graphs, and I want to compute a marginal over a node, I can re-draw the factor graph so that variable of interest is the root node. Right? Everyone see that these two representations are completely equivalent? I've just yanked x1 up to the top. So now this is a tree structure.
So this is that factor graph that we just saw drawn as a tree. And this is what's called an expression tree, which is going to tell us how to compute the marginal over the structure of the graph. So this is just copied from the previous picture. And now we're going to come up with a program for computing these marginals, using this tree structure.
So first I'm going to compute that summary function-- the sum over all sets of the other variables for everything below this point, starting with the lowest point in the graph. And we can compute the summary function there. And that's this term, the summary for x3 of just this fE. I do the same thing for fD, the summary for it.
And then I go up a level in the tree, and I multiply the summary for everything below it. So I'm going to compute the product of the summary functions. And I always compute the summary with respect to the parent. So here the parent was x3, for both of these. So these are summaries with respect to x3.
Here who's the parent? x1. And so the summary is to x1. Yes?
AUDIENCE: Are there directed edges? In the sense that in f, in the example on the right, is fD just relating how x4 relates to x3?
PROFESSOR: That's exactly right. So the edges represent which factor you're related to. So that's why I can redraw it in any way. I'm always going to go from the leaves up. I don't have to worry about any directed edges in the graph. Other questions.
So what this does is it gives us a way to officially, overall a complicated graph structure, compute marginals. And they're typically thought of in terms of messages that are being sent from the bottom of the graph up to the top. And you can have a rule from computing these marginals. And the rule is as follows.
Each vertex waits for the messages from all of its children, until it gets its-- the messages are accumulating their way up the graph. And every node is waiting until it hears from all of its progeny about what's going on. And then it sends the signal up above it to its parent, based on the following rules.
A variable node just takes the product of the children. And a factor node-- one of those little black boxes-- computes the summary for the children and sends that up to the parent. And it's the summary with respect to the parent, just like in the examples before. So this is a formula for computing single marginals.
Now it turns out-- I'm not going to go into details of this. It's kind of complicated. But you actually can, based on this core idea, come up with an efficient way of computing all of the marginals without having to do this separately for every single one. And that's called a message passing algorithm. And if you're really interested, you can look into the citation for how that's done.
So the core idea is that we can take a representation of our belief of how this global function-- in our case, it's going to be the joint probability-- factors in terms of particular biological processes. We can encode what we know about the regulation in that factor graph, the structure of the graph. And then we could have an efficient way of computing the marginals, which will tell us, given the data, what's the probability that this particular pathway is active?
So in this particular case, in this paradigm model, the variables can take three states-- activated, deactivated, or unchanged. And this is, in a tumor setting, for example, you might say the tumor is just like the wild type cell, or the tumor has activation with respect to the wild type, or it has a repression with respect to the wild type.
Again, this is the structure of the factor graph that they're using and the different kinds of information that they have. The primary experimental data are just these arrays that tell us about SNiPs and copy number variation and then arrays or RNA-Seq to tell us about the transcript levels.
But now they can encode all sorts of rather complicated biological functions in the graph structure itself. So transcription regulation is shown here. Why is the edge from activity to here?
Because we don't want to just infer that if there's more of the protein, there's more activity. So we're actually, explicitly computing the activity of each protein. So if an RNA gets transcribed, it's because some transcription factor was active. And the transcription factor might not be active, even if the levels of the transcription factor are high. That's one of the pieces that's not encoded in all of those things that were in the dream challenge, that are really critical for representing the regulatory structure.
Similarly, protein activation-- I can have protein that goes from being present to being active. So think of a kinase, that itself needs to be phosphorylated to be active. So that would be that transition. Some other kinase comes in. And if that other kinase1 is active, then it can phosphorylate kinase2 and make that one active. And so it's pretty straightforward.
You can also represent the formation of a complex. So the fact that all the proteins are in the cell doesn't necessarily mean they're forming an active complex. So the next step then can be here. Only when I have all of them, would I have activity of the complex. We'll talk about how AND-like connections are formed.
And then they also can incorporate OR. So what does that mean? So if I know that all members of the gene family can do something, I might want to explicitly represent that gene family as an element to the graph-- a variable. Is any member of this family active? And so that would be done this way, where if you have an OR-like function here, then this factor would make this gene active if any of the parents are active.
So there, they give a toy example, where they're trying to figure out if the P53 pathway is active, so MDM2 is an inhibitor of P53. P53 can be an activator-related apoptosis. And so for separately, for MDM2 and for P53, they have the factor graphs that show the relationship between copy number variation and transcript level and protein level and activity. And those relate to each other. And then those relate to the apoptotic pathway.
So what they want to do then is take the data that they have, in terms of these pathways, and they want to compute the likelihood ratios. What's the probability of observing the data, given a hypothesis that this pathway is active and all my other settings of the parameters? And compare that to the probability of the data, given that that pathway is not active. So this is the kinds of likelihood ratios we've been seeing now in a couple of lectures.
So now it gets into the details of how you actually do this. So there's a lot of manual steps involved here. So if I want to encode a regulatory pathway as a factor graph, it's currently done in a manual way or semi-manual way.
You convert what's in the databases into the structure or factor graph. And you make a series of decisions about exactly how you want to do that. You can argue with the particular decisions they made, but the reasonable ones. People could do things differently.
So they convert the regulatory networks into graphs. And then they have to define some of the functions on this graph. So they define the expected state of a variable, based on the state of its parents. And they take a majority vote of the parents.
So a parent that's connected by a positive edge, meaning it's an activator, if the parent is active, then it contributes a plus 1 to the child. If it's connected by a repressive edge, then the parenting active would make a vote of minus 1 for the child. And you take the majority vote of all those votes. So that's what this says.
But the nice thing is that you can also incorporate logic. So for example, when we said, is any member of this pathway active? And you have a family member node. So that can be done with an OR function.
And there, it's these same factors that will determine-- so some of these edges are going to get labeled "maximum" or "minimum," that tell you what's the expected value of the child, based on the parent. So if it's an OR, then if any of the parents are active, then the child is active. And if it's AND, you need all of them.
So you could have described all of these networks by Bayesian networks. But the advantage of a factor graph is that your explicitly able to include all these steps to describe this regulation in an intuitive way. So you can go back to your models and understand what you've done, and change it in an obvious way.
Now critically, we're not trying to learn the structure of the graph from the data. We're imposing the structure of the graph. We still need to learn a lot of variables, and that's done using expectation maximization, as we saw in the Bayesian networks. And then, again, it's a factor graph, which primarily means that we can factor the global function into all of these factor nodes. So the total probability is normalized, but it's the product of these factors which have to do with just the variables that are connected to that factor node in the graph.
And this notation that you'll see if you look through this, this notation means the setting of all the variables consistent with something. So let's see that-- here we go. So this here, this is the setting of all the variables X, consistent with the data that we have-- so the data being the arrays, the RNA-Seq, if you had it.
And so we want to compute the marginal probability of some particular variable being at a particular setting, given the fully specified factor graph. And we just take the product of all of these marginals. Is that clear? Consistent with all the settings where that variable is set to x equals a. Questions? OK. And we can compute the likelihood function in the same way.
So then what actually happens when you try to do this? So they give an example here in this more recent paper, where it's basically a toy example. But they're modeling all of these different states in the cells. So G are the number of genomic copies, T, the level of transcripts. Those are connected by a factor to what you actually measure.
So there is some true change in the number of copies in the cell. And then there's what appears in your array. There's some true number of copies of RNA in the cell. And then there's what you get out of your RNA-Seq.
So that's what these factors are present-- and then these are regulatory terms. So how much transcript you get depends on these two variables, the epigenetic state of the promoter and the regulatory proteins that interact with it. How much transcript gets turned into protein depends on regulatory proteins. And those are determined by upstream signaling events. And how much protein becomes active, again, is determined by the upstream signaling events. And then those can have effects on downstream pathways as well.
So then in this toy example, they're looking at MYC/MAX. They're trying to figure out whether it's active or not. So we've got this pathway. PAK2 represses MYC/MAX. MYC/MAX activates these two genes and represses this one.
And so if these were the data that we had coming from copy number variation, DNA methylation, and RNA expression, then I'd see that the following states of the downstream genes-- this one's active. This one's repressed. This one's active. This one's repressed.
They infer that MYC/MAX is active. Oh, but what about the fact that this one should also be activated? That can be explained away by the fact that there's a difference in the epigenetic state between ENO1 and the other two.
And then the belief propagation allows us to transfer that information upward through the graph to figure out, OK, so now we've decided that MYC/MAX is active. And that gives us information about the state of the proteins upstream of it and the activity then of PAK2, which is a repressor of MYC/MAX. Questions on the factor graphs specifically or anything's that come up until now?
So this has all been reasoning on known pathways. One of the big promises of these systematic approaches is the hope that we can discover new pathways. Can we discover things we don't already know about? And for this, we're going to look at interactome graphs, so graphs that are built primarily from high throughput protein-protein interaction data, but could also be built, as we'll see, from other kinds of large-scale connections.
And we're going to look at what the underlying structure of these networks could be. And so they could arise from a graph where you put an edge between two nodes if their co-expressed, if they have high mutual information. That's what we saw in say, ARACHNE, which we talked about a lecture ago. Or, if say, the two hybrids and affinity capture mass spec indicated direct physical interaction or say a high throughput genetic screen indicated a genetic interaction.
These are going to be very, very large graphs. And we're going to look at some of the algorithmic problems that we have dealing with huge graphs and how to compress the information down so we get some piece of the network that's quite interpretable. And we'll look at various kinds of ways of analyzing these graphs that are listed here.
So one of the advantages of dealing with data in the graph formulation is that we can leverage the fact that computer science has dealt with large graphs for quite a while now, often in the context of telecommunications. Now big data, Facebook, Google-- they're always dealing with things in a graph formulation. So there are a lot of algorithms that we can take advantage of.
We're going to look at how to use quick distance calculations on graphs. And we'll look at that specifically in an example of how to find the kinase target relationships. Then we'll look at how to cluster large graphs to find subgraphs that either represents an interesting topological feature of the inherent structure of the graph or perhaps to represent active pieces of the network. And then we'll look at other kinds of optimization techniques to help us find the part of the network that's most relevant to our particular experimental setting.
So let's start with ostensibly a simple problem. I know a lot about-- I have a lot of protein phosphorylation data. I'd like to figure out what kinase was that phosphorylated a particular protein.
So let's say I have this protein that's involved in cancer signaling, Rad50. And I know it's phosphorylated these two sites. And I have the sequences of those sites. So what kinds of tools do we have at our disposal if I have a set of sequences that I believe are phosphorylated, that would help me try to figure out what kinase did the phosphorylation? Any ideas?
So if I know the specificity of the kinases, what could I do? I could look for a sequence match between the specificity of the kinase and the sequence of the protein, right? In the same way that we can look for a match between the specificity of a transcription factor and the region of the genome to which it binds.
So if I have a library of specificity motifs for different kinases, where every position here represents a piece of the recognition element, and the height of the letters represent the information content, I can scan those. And I can see what family of kinases are most likely to be responsible for phosphorylating these sites.
But again, those are families of kinases. There are many individual members of each of those families. So how to I find the specific member of that family that's most likely to carry out the regulation?
So here, what happens in this paper. It's called [? "Network." ?] And as they say, well, let's use the graph properties. Let's try to figure out which proteins are physically linked relatively closely in the network to the target.
So in this case, they've got Rad50 over here. And they're trying to figure out which kinase is regulating it. So here are two kinases that have similar specificity. But this one's directly connected in the interaction that works so it's more likely to be responsible.
And here's the member of the kinase that seems to be consistent with the sequence being phosphorylated over here. It's not directly connected, but it's relatively close. And so that's also a highly probable member, compared to one that's more distantly related. So in general, if I've got a set of kinases that are all of equally good sequence matches to the target sequence, represented by these dash lines, but one of them is physically linked as well, perhaps directly and perhaps indirectly, I have higher confidence in this kinase because of its physical links than I do in these.
So that's fine if you want to look at things one by one. But if you want to look at this at a global scale, we need very efficient algorithms for figuring out what the distance is in this interaction network between any kinase and any target. So how do you go about officially computing distances? Well that's where converting things into a graph structure is helpful.
So when we talk about graphs here, we mean sets of vertices and the edges that connect them. The vertices, in our case, are going to be proteins. The edges are going to perhaps represent physical interactions or some of these other kinds of graphs we talked about.
These graphs can be directed, or they can the undirected. Undirected would be what? For example, say two hybrid.
I don't know which one's doing what to which. I just know that two proteins can come together. Whereas a directed edge might be this kinase phosphorylates this target. And so it's a directed edge.
I can have weights associated with these edges. We'll see in a second how we can use that to encode our confidence that the edge represents a true physical interaction. We can also talk about the degree, the number of edges that come into a node or leave a node.
And for our point, it's rather important to talk about the path, the set of vertices that can get me from one node to another node, without ever retracing my steps. And we're going to talk about path length, so if my graph is unweighted, that's just the number of edges along the path. But if my graph has edge weights, it's going to be the sum of the edge weights along that path. Is that clear?
And then we're going to use an adjacency matrix to represent the graphs. So I have two completely equivalent formulations of the graph. One is the picture on the left-hand side, and the other one is the matrix on the right-hand side, where a 1 between any row and column represents the presence of an edge.
So the only edge connecting node 1 goes to node 2. Whereas, node 2 is connected both to node 1 and to node 3. Hopefully, that agrees. OK. Is that clear? And if I have a weighted graph, then instead of putting zeros or ones in the matrix, I'll put the actual edge weights in the matrix.
So there are algorithms that exist for officially finding shortest paths in large graphs. So we can very rapidly, for example, compute the shortest path between any two nodes, based solely on that adjacency matrix. Now why are we going to look at weighted graphs? Because that gives us the way to encode our confidence in the underlying data.
So because the total distance in network is the sum of the edge weights, if I set my edge weights to be negative log of a probability, then if I sum all the edge weights, I'm taking the product of all those probabilities. And so the shortest path is going to be the most probable path as well, because it's going to be the minimum of the sum of the negative log. So it's going to be the maximum of the joint probability. Is that clear? OK. Very good.
So by encoding our network as a weighted graph, where the edge weights are minus log of the probability, then when I use these standard algorithms for finding the shortest path between any two nodes, I'm also getting the most probable path between these two proteins. So where these edge weights come from? So if my network consists say of affinity capture mass spec and two hybrid interactions, how would I compute the edge of weights for that network?
We actually explicitly talked about this just a lecture or two ago. So I have all this affinity capture mass spec, two hybrid data. And I want to assign a probability to every edge that tells me how confident I am that it's real. So we already saw that in the context of this paper where we use Bayesian networks and gold standards to compute the probability for every single edge in the interactome.
OK. So that works pretty well if you can define the gold standards. It turns out that that has not been the most popular way of dealing with mammalian data. It works pretty well for yeast, but it's not what's used primarily in mammalian data.
So in mammalian data, the databases are much larger. The number of gold standards are fewer. People rely on more ad hoc methods.
One of the big advances, technically, for the field was the development of a common way for all these databases of protein-protein interactions to report their data, to be able to interchange them. There's something called PSICQUIC and PSISCORE, that allow a client to pull information from all the different databases of protein-protein interactions. And because you can get all the data in a common format where it's traceable back to the underlying experiment, then you can start computing confidence scores based on these properties, what we know about where the data came from in a high throughput way.
Different people have different approaches to computing those scores. So there's a common format for that as well, which is this PSISCORE where you can build your interaction database from whichever one of these underlying databases you want, filter it however you want. And then send your database to one of these scoring servers. And they'll send you back the scores according to their algorithm.
One that I kind of like this is this Miscore algorithm. It digs down into the underlying data of what kind of experiments were done and how many experiments were done. Again, they make all sorts of arbitrary decisions in how they do that. But the arbitrary decisions seem reasonable in the absence of any other data.
So their scores are based on these three kinds of terms-- how many publications there are associated with any interaction, what experimental method was used, and then also, if there's an annotation in the database saying that we know that this is a genetic interaction, or we know that it's a physical interaction. And then they put weights on all of these things.
So people can argue about what the best way of approaching this is. The fundamental point is that we can now have a very, very large database of known interactions as weighted. So by last count, there are about 250,000 protein-protein interactions for humans in these databases. So you have that giant interactome. It's got all these scores associated with it.
And now we can dive into that and say, these data are somewhat largely unbiased by our prior notions about what's important. They're built up from high throughput data. So unlike the carefully curated pathways that are what everybody's been studying for decades, there might be information here about pathways no one knows about. Can we find those pathways in different contexts? What can we learn from that?
So one early thing people can do is just try to find pieces of the network that seem to be modular, where there are more interactions among the components of that module than they are to other pieces of the network. And you can find those modules in two different ways. One is just based on the underlying network. And one is based on the network, plus some external data you have.
So one would be to say, are there proteins that fundamentally interact with each other under all possible settings? And then we would say, in my particular patient sample or my disease or my microorganism, which proteins seem to be functioning in this particular condition? So one is the topological model. That's just the network itself. And one is the functional model, where I layer onto information that the dark nodes are active in my particular condition.
So an early use of this kind of approach was to try to annotate nodes-- a large fraction of even well studied genomes that we don't know the function of any of those genes. So what if I use the structure of the network to infer that if some protein is close to another protein in this interaction network, it is likely to have similar function? And statistically, that's definitely true. So this graph shows, for things for where we know the function, the semantic similarity in the y-axis, the distance in the network in the x-axis, things that are close to each other in the network of interactions, are also more likely to be similar in terms of function.
So how do we go about doing that? So let's say we have got this graph. We've got some unknown node labeled u. And we've got two known nodes in black. And we want to systematically deduce for every example like this, every u, what its annotation should be.
So I could just look at its neighbors, and depending on how I set the window around it, do I look at the immediate neighbors? Do I go two out? Do I go three out? I could get different answers.
So if I set K equal to 1, I've got the unknown node, but all the neighbors are also unknown. If I go two steps out, then I pick up two knowns. Now there's a fundamental assumption going on here that the node has the same function as its neighbors, which is fine when the neighbors are homogeneous. But what do you do when the neighbors are heterogeneous?
So in this case, I've got two unknowns u and v. And if I just were to take the K nearest neighbors, they would have the same neighborhood, right? But I might have a prior expectation that u is more like the black nodes, and v is more like the grey nodes.
So how do you choose the best annotation? The K nearest neighbors is OK, but it's not the optimal. So here's one approach, which says the following. I'm going to go through for every function, every annotation in my database, separately. And for each annotation, I'll set all the nodes that have that annotation to plus 1 and every node that doesn't have that annotation, either it's unknown or it's got some other annotation, to minus 1.
And then for every unknown, I'm going to try to find the setting which is going to maximize the sum of products. So we're going to take the sum of the products of u and all of its neighbors. So in this setting, if I set u to plus 1, then I do better than if I set it to minus 1, right? Because I'll get plus 1 plus 1 minus 1. So that will be better than setting it to minus 1. Yes.
AUDIENCE: Are we ignoring all the end weights?
PROFESSOR: In this case, we're ignoring the end weights. We'll come back to using the end weights later. But this was done with an unweighted graph.
AUDIENCE: [INAUDIBLE] [? nearest neighborhood ?] they're using it then?
PROFESSOR: So here they're using the nearest neighbors. That's right, with no cutoff, right? So any interaction.
So then we could iterate this into convergence. That's one problem with this. But maybe a more fundamental problem is that you're never going to get the best overall solution by this local optimization procedure. So consider a setting like this.
Remember, I'm trying to maximize the sum of the product of the settings for neighbors. So how could I ever-- it seems plausible that all A, B, and C here, should have the red annotation, right? But if I set C to red, that doesn't help me. If I set A to red, that doesn't help me. If I set B to red, it makes things worse. So no local change is going to get me where I want to go.
So let's think for a second. What algorithms have we already seen that could help us get to the right answer? We can't get here by local optimization. We need to find the global minimum, not the local minimum. So what algorithms have we seen that help us find that global minimum?
Yeah, sorry, so a video simulated annealing. So the simulated annealing version in this setting is as follows. I initialize the graph. I pick a neighboring node, v, that I'm going to add. Say we'll turn one of these red.
I check the value of that sum of the products for this new one. And if it's improving things, I keep it. But the critical thing is, if it doesn't improve, if it makes things worse, I still keep it with some probability. It's based on how bad things have gotten. And by doing this, we can climb the hill and get over to some global optimum.
So we saw simulating before. In what context? When in the side chain placement problem. Here we're seeing it again. It's quite broad.
Any time you've got a local optimization that doesn't get you where you need to go, you need global optimization. You can think simulated annealing. It's quite often the plausible way to go.
All right. So this is one approach for annotation.
We also wanted to see whether we could discover inherent structure in these graphs. So often, we'll be interested in trying to find clusters in a graph. Some graphs have obvious structures in them. Other graphs, it's a little less obvious.
What algorithms exist for trying to do this? We're going to look at two relatively straightforward ways. One is called edge betweenness clustering and the other one is a Markov process.
Edge betweenness, I think, is the most intuitive. So I look at each edge, and I ask for all pairs of nodes in the graph, does the shortest path between those nodes pass through this edge? So if I look at this edge, very few shortest paths go through this edge, right? Just the shortest path for those two nodes. But if I look at this edge, all of the shortest paths between any node on this side and any node on this side have to pass through there. So that has a high betweenness.
So if I want a cluster, I can go through my graph. I can compute betweenness. I take the edge that has the highest betweenness, and I remove it from my graph. And then I repeat. And I'll be slowly breaking my graph down into chunks that are relatively more connected internally than they are to things in other pieces.
Any questions? So that's an entire edge betweenness clustering algorithm. Pretty straightforward.
Now an alternative is a Markov clustering method. And the Markov clustering method is based on the idea of random walks in the graph. So again, let's try to develop some intuition here. If I start at some node over here, and I randomly wander across this graph, I'm more likely to stay on the left-hand side than I am to move all the way across to the right-hand side, correct?
So can I formalize that and actually come up with a measure of how often any node will visit any other and then use that to cluster the graph? So remember our adjacency matrix, which just represented which nodes were connected to which. And what happens if I multiply the adjacency matrix by itself? So I raise it to some power. Well, if I multiply the adjacency matrix by itself just once, the squared adjacency matrix of the property, that it tells me how many paths of length 2 exists between any two nodes.
So the adjacency matrix told me how many paths of length 1 exist. Right? You're directly connected. If I squared the adjacency matrix, it tells me how many paths of length 2 exist. N-th power tells me how many paths of length N exist.
So let's see if that works. This claims that there are exactly two paths that connect node 2 to node 2. What are those two paths? Connect node 2 to node 2. I go here, and I go back. That's the path of length 2, and this is the path of length 2.
And there are zero paths of length 2 that connect node 2 to node three, because 1, 2. I'm not back at 3. So that's from general A to the N equals m, if there exists exactly m paths of length N between those two nodes.
So how does this help me? Well, when you take that idea of the N-th power of the adjacency matrix and convert it to a transition probability matrix, simply by normalizing. So if I were to do a random walk in this graph, what's the probability that I'll move from node i to node j in a certain number of steps? That's what I want to compute.
So I need to have a stochastic matrix, where the sum of the probabilities for any transition is 1. I have to end up somewhere. I either end up back in myself, or I end up at some other nodes. I'm just going to take that adjacency matrix and normalize the columns.
And then that gives me the stochastic matrix. And then I can exponentiate the stochastic matrix to figure out my probability of moving from any node to any other in a certain number of steps. Any questions on that? OK.
So if we simply keep multiplying this stochasticity matrix, we'll get the probability of increasing numbers of moves. But it doesn't give us sharp partitions of the matrix. So to do a Markov clustering, we do an exponentiation of this matrix with what's called an inflation operator, which is the following.
This inflation operator takes the r-th power of the adjacency matrix and puts a denominator, the sum of the powers of the transition. So here's an example. Let's say I've got two probabilities-- 0.9 and 0.1. When I inflate it, I square the numerator, and I square each element of the denominator. Now I've gone from 0.9 to 0.99 and 0.1 to 0.01.
So this inflation operator exaggerates all my probabilities and makes the higher probabilities more probable and makes the lower probabilities even less probable. So I take this adjacency matrix that represents the number of steps in my matrix, and I exaggerate it with the inflation operator. And that takes the basic clustering, and it makes it more compact.
So the algorithm for this Markov clustering is as follows. I start with a graph. I add loops to the graph. Why do I add loops? Because I need some probability that I stay in the same place, right?
And in a normal adjacency matrix, you can't stay in the same place. You have to go somewhere. So I add a loop. So there's always a self loop.
Then I set the inflation parameter to some value. M_1 is the matrix of random walks in the original graph. I multiply that. I inflate it. And then I find the difference. And I do that until the difference in this-- because this matrix gets below some value. And what I end up with then are relatively sharp partitions of the overall structure.
So I'll show you an example of how that works. So in this case, the authors were using a matrix where the nodes represented proteins. The edges represented BLAST hits.
And what they wanted to do was find families of proteins that had similar sequence similarity to each other. But they didn't want it to be entirely dominated by domains. So they figured that this graph structure would be helpful, because you'd get-- for any protein, there'd be edges, not just things that had similar common domains, but also things that had edges connecting it to other proteins as well.
So in the original graph, the edges are these BLAST values. They come up with the transition matrix. They convert into the Markov matrix, and they carry out that exponentiation. And what they end up with are clusters where any individual domain can appear multiple clusters. The domains are dominated not just by the highest BLAST hit, but by the whole network property of what other proteins they're connected to.
And it's also been done with a network, where the underlying network represents gene expression, and edges between two genes represent the degree of correlation of the expression across a very large data set for 61 mouse tissues. And once again, you take the overall graph, and you can break it down into clusters, where you can find functional annotations for specific clusters. Any questions then on the Markov clustering?
So these are two separate ways of looking at the underlying structure of a graph. We had the edge betweenness clustering and the Markov clustering. Now when you do this, you have to make some decision, as I found this cluster. Now how do I decide what it's doing? So you need to do some sort of annotation.
So once I have a cluster, how am I going to assign a function to that cluster? So one thing I could do would be to look at things that already have an annotation. So I got some cluster. Maybe two members of this cluster have an annotation and two members of this one. And that's fine.
But what do I do when a cluster has a whole bunch of different annotations? So I could be arbitrary. I could just take the one that's the most common. But a nice way to do it is by the hypergeometric distribution that you saw in the earlier part of the semester.
So these are all ways of clustering the underlying graph without any reference to specific data for a particular condition that you're interested in. A slightly harder problem is when I do have those specific data, and I'd like to find a piece of the network that's most relevant to those specific data. So it could be different in different settings. Maybe the part of the network that's relevant in the cancer setting is not the part of the network that's relevant in the diabetes setting.
So one way to think about this is that I have the network, and I paint onto it my expression data or my proteomic data. And then I want to find chunks of the network that are enriched in activity. So this is sometimes called the active subgraph problem. And how do we find the active subgraph?
Well, it's not that different from the problem that we just looked at. So if I want to figure out a piece of the network that's active, I could just take the things that are immediately connected to each other. That doesn't give me the global picture.
So instead why don't I try to find larger chunks of the network where I can include some nodes for which I do not have specific data? And one way that's been done for that is, again, the simulated annealing approach. So you can try to find pieces of the network that maximize the probability that all the things in the subnetwork are active.
Another formulation of this problem is something that's called the Steiner tree problem. And in the Steiner tree, I want to find trees in the network that consist of all the nodes that are active, plus some nodes that are not, for which I have no data. And those nodes for which I have no data are called Steiner nodes.
And this was a problem that was looked at extensively in telecommunications. So if I want to wire up a bunch of buildings-- back when people used wires-- say to give telephone service, so I need to figure out what the minimum cost is for wiring them all up. And sometimes, that involves sticking a pole in the ground, then having everybody communicate to that pole.
So if I've got paying customers over here, and I want to wire them to each other, I could run wires between everybody. But I don't have to. If I stick a pole over here, then I don't need this wire, and I don't need this wire, and I don't need this wire. So this is what's called a Steiner node.
And so in graph theory, there are pretty efficient algorithms for finding a Steiner graph-- the Steiner tree-- the smallest tree that connects all of the nodes. Now the problem in our setting is that we don't necessarily want to connect every node, because we're going to have in our data some things that are false positives. And if we connect too many things in our graph, we end up with what are lovingly called "hairballs."
So I'll give you a specific example of that. Here's some data that we were working with. We had a relatively small number of experimental hits that were detected as changing in a cancer setting and the interactome graph. And if you simply look for the shortest path, I should say, between the experimental hits across the interactome, you end up with something that looks very similar to the interactome.
So you start off with a relatively small set of nodes, and you try to find the subnetwork that includes everything. And you get a giant graph. And it's very hard to figure out what to do with a graph that's this big. I mean, there may be some information here, but you've taken a relatively simple problem to try to understand the relationship among these hits. And you've turned it into a problem that now involves hundreds and hundreds of nodes.
So these kinds of problems arise, as I said, in part, because of noise in the data. So some of these hits are not real. And incorporating those, obviously, makes me take very long paths in the interactome, but also arises because of the noise in the interactome-- both false positives and false negatives.
So I have two proteins that I'm trying to connect, and there's a false positive in the interactome. It's going to draw a line between them. If there's a false negative in the interactome, maybe these things really do interact, but there's no edge. If I force the algorithm to find a connection, it probably can, because most of the interactome is one giant connected component. But it could be a very, very long edge. It goes through many other proteins.
And so in the process of trying to connect all my data, I can get extremely large graphs. So to avoid having giant networks-- so on this projector, unfortunately, you can't see this very well. But there are a lot of edges among all the nodes here. Most of you have your computers. You can look at it there.
So in a Steiner tree approach, if my data are the ones that are yellow, they're called terminals. And the grey ones, I have no data. And I ask to try to solve the Steiner tree problem, it's going to have to find a way to connect this node up to the rest of the network. But if this one's a false positive, that's not the desired outcome.
So there are optimization techniques that actually allow me to tell the algorithm that it's OK to leave out some of the data to get a more compact network. So one of those approaches is called a prize collecting Steiner tree problem. And the idea here is the following.
For every node for which I have experimental data, I associate with that node a prize. The prize is larger, the more confident I am that that node is relevant in the experiment. And for every edge, I take the edge away, and I convert it into a cost. If I have a high confidence edge, there's a low cost. It's cheap. Low confidence edges are going to be very expensive.
And now I ask the algorithm to try to connect up all the things it can. Every time it includes a node for which the zeta keeps the prize, but it had to add an edge, so it pays the cost. So there's a trade-off for every node.
So if the algorithm wants to include this node, then it's going to pay the price for all the edges, but it gets to keep the node. So the optimization function is the following. For every vertex that's not in the tree, there's a penalty. And for every edge in the tree, there's a cost.
And you want to minimize the sum of these two terms. You want to minimize the number of edge costs you pay for. And you want to minimize the number of prizes you leave behind. Is that clear?
So then the algorithm then can, depending on the optimization terms, figure out is it more of a benefit to include this node, keep the prize, and pay all the edge costs or the opposite? Throw it out. You don't get to keep the prize, but you don't have to pay the edge costs. And so that turns these very, very large networks into relatively compact ones.
Now solving this problem is actually rather computationally challenging. You can do it with integer linear programming. It takes a huge amount of memory. There's also signal and message passing approach. If you're interested in the underlying algorithms, you can look at some of these papers.
So what happens when you actually do this? So that hairball that I showed you before consisted of a very small initial data set. If you do a shortest path search across the network, you get thousands of edges shown here.
But the prize collecting Steiner tree solution to this problem is actually extremely compact, and it consists of subnetworks. You can cluster it automatically. This was clustered by hand, but you get more or less the same results. It's just not quite as pretty.
If you cluster by hand or by say, edge betweenness, then you get subnetworks that are enriched in various reasonable cellular processes. This was a network built from cancer data. And you can see things that are highly relevant to cancer-- DNA damage, cell cycle, and so on.
And the really nice thing about this then is it gives you a very focused way to then go and do experiments. So you can take the networks that come out of it. And now you're not operating on a network that consists of tens of thousands of edges. You're working on a network that consists of very small sets of proteins.
So in this particular case, we actually were able to go in and test the number of the nodes that were not detected by the experimental data, but were inferred by the algorithms of the Steiner nodes, which had no direct experimental data. We will test whether blocking the activities of these nodes had any effect on the growth of these tumor cells. We will show that nodes that were very central to the network that were included in the prize collecting Steiner tree solution, had a high probability of being cancer targets. Whereas the ones that were just slightly more removed were much lower in probability.
So one of the advantages of these large interaction graphs is they give us a natural way to integrate many different kinds of data. So we already saw that the protein levels and the mRNA levels agreed very poorly with each other. And we talked about the fact that one thing you could do with those data would be to try to find the connections between not the RNAs and the proteins, but the connections between the RNAs and the things that drove the expression of the RNA.
And so as I said, we'll see in one of Professor Gifford's lectures, precisely how to do that. But once you are able to do that, you take epigenetic data, look at the regions that are regulatory around the sites of genes that are changing in transcription. You can infer DNA binding proteins. And then you can pile all those data onto an interaction graph, where you've got different kinds of edges.
So you've got RNA nodes that represent the transcript levels. You've got the transcription factors that infer from the epigenetic data. And then you've got the protein-protein interaction data that came from the two hybrid, the affinity capture mass spec.
And now you can put all those different kinds of data in the same graph. And even though there's no correlation between what happens in an RNA and what happens in the protein level-- or very low correlation-- there's this physical process that links that RNA up to the signaling pathways that are above it. And by using the prize collecting Steiner tree approaches, you can rediscover.
And these kinds of networks can be very valuable for other kinds of data that don't agree. So it's not unique to transcript data and proteome data. Turns out there are many different kinds of omic data, when looked at individually, give you very different views of what's going on in a cell.
So if you take knockout data, so which genes when knocked out, affect the phenotype? And which genes, in the same condition, change an expression? Those give you two completely different answers about which genes are important in a particular setting.
So here we're looking at which genes are differentially expressed when you put cells under a whole bunch of these different conditions. And which genes when knocked out, affect viability in that condition. And then the right-hand column shows the overlap in the number of genes. And you can see the overlap is small. In fact, it's less than you would expect by chance for most of these.
So just to drill that home, if I do two separate experiments on exactly the same experimental system, say yeast responding to DNA damage. And in one case, I read out which genes are important by looking at RNA levels. And the other one, I read out which genes are important by knocking every gene out and seeing whether it affects viability. We'll get two completely different sets of genes. And we'll also have two completely different sets of gene ontology categories.
But there is some underlying biological process that gives rise to that, right? And one of the reasons for this is different assays are measuring different things. So it turns out, if you look-- at least in yeast-- over 156 different experiments, for which there's both transcriptional data and genetic data, the things that come out in genetic screens seem to be master regulators. Things that were knocked out have a big effect in phenotype. Whereas the things that change in expression tend to be effector molecules.
And so in say, the DNA damage case, the proteins that were knocked out and have a big effect on phenotype are ones that detect DNA damage and signal to the nucleus that there's been changes in DNA damage that then goes on and blocks the cell cycle, initiates DNA response to repair. Those things show up as genetic hits, but they don't show up as differentially expressed.
The things that do show up as differentially expressed, the repair enzymes. Those, when you knock them out, don't have a big effect on phenotype, because they're highly redundant. But there are these underlying pathways. And so the idea is well, you could reconstruct these by, again, using the epigenetic data, the tough stuff Professor Gifford will talk about in upcoming lectures. And for the transcription factors and then the network properties, to try to build up a full network of how those relate to upstream signaling pathways that would then include some of the genetic hits.
I think I'll skip to the punchline here. So we've looked at a number of different modeling approaches for these large interactomes. We've also looked at ways of identifying transcriptional regulatory networks using mutual information, regression, Bayesian networks. And how do all these things fit together? And when would you want to use one of these techniques, and when would you want to use another?
So I like to think about the problem along these two axes. On one dimension, we're thinking about whether we have systems of known components or unknown components. And the other one is whether we want to identify physical relationships or statistical relationships.
So clustering, regression, mutual information-- those are very, very powerful for looking at the entire genome, the entire proteome. What they give you are statistical relationships. There's no guarantee of a functional link, right?
We saw that in the prediction that postprandial laughter predicts breast cancer outcome, that there's no causal link between those. Ultimately, you can find some reason why it's not totally random. But it's not as if that's going to lead you to new drug targets. But those can be on a completely hypothesis-free way, with no external data.
Bayesian networks are somewhat more causal. But depending on how much data you have, they may not be perfectly causal. You need a lot of intervention data. We also saw that they did not perform particularly well in discovering gene regulatory networks in the dream challenge.
These interactome models that we've just been talking about work very well across giant omic data sets. And they require this external data. They need the interactome. So it works well in organisms for which you have all that interactome data. It's not going to work in an organism for which you don't.
What they give you at the end, though, is a graph that tells you relationships among the proteins. But it doesn't tell you what's going to happen if you start to perturb those networks. So if I give you the active subgraph that has all the proteins and genes that are changing expression in my tumor sample, now the question is, OK, should you inhibit the nodes in that graph? Or should you activate the nodes in that graph?
And the interactome model doesn't tell you the answer to that. And so what you're going to hear about in the next lecture from Professor Lauffenburger are models that live up in this space. Once you've defined a relatively small piece of the network, you can use other kinds of approaches-- logic based models, differential equation based models, decision trees, and other techniques that will actually make very quantitative processions. What happens if I inhibit a particular node? Does it activate the process, or does it repress the process?
And so what you could think about then is going from a completely unbiased view of what's going in a cell, collect all the various kinds of omic data, and go through these kinds of modeling approaches to identify a subnetwork that's of interest. And then use the techniques that we'll [? be hearing ?] about in the next lecture to figure out quantitatively what would happen if I were to inhibit individual nodes or inhibit combinations of nodes or activate, and so on. Any questions on anything we've talked about so far? Yes.
AUDIENCE: Can you say again the fundamental difference between why you get those two different results if you're just weeding out the gene expression versus the proteins?
PROFESSOR: Oh, sure. Right. So we talked about the fact that if you look at genetic hits, and you look at differential expression, you get two completely different views of what's going in cells. So why is that?
So the genetic hits to tend to hit master regulators, things that when you knock out a single gene, you have a global effect on the response. So in the case of DNA damage, those are things that detect the DNA damage. Those genes tend often not to be changing very much in expression.
So transcription factors are very low abundance. They usually don't change very much. A lot of signaling proteins are kept at a constant level, and they're regulated post-transcriptionally. So those don't show up in the differential expression.
The things that are changing in expression-- say the response regulators, the DNA damage response-- those often are redundant. So one good analogy is to think about a smoke detector. A smoke detector is on all the time. You don't wait until the fire. So that's not going to be changing in expression, if you will. But if you knock it out, you've got a big problem.
The effectors, say the sprinklers-- the sprinklers only come on when there's a fire. So that's like the response genes. They come on only in certain circumstances, but they're highly redundant. Any room will have multiple sprinklers, so if one gets damaged or is blocked, you still get a response.
So that's why you get this discrepancy between the two different kinds of data. But again, in both cases, there's an underlying physical process that gives rise to both. And if you do this properly, you can detect that on these interactome models. Other questions? OK. Very good.