Final Project Guidelines
For the final project, you should write a brief paper (two to four pages in Physical Review format) about a topic of your choice at the interface of Statistical Physics and Biology. It should be formatted as a regular article with title, abstract, and bibliography. The main text should contain introductory and concluding paragraphs (whether or not they appear as subsections is not important). The ideal project will involve a combination of literature review, discussion of an analytical or computational model, and application/analysis of biological data.
Students can collaborate in groups provided that the respective contributions of the author of the joint paper is clearly specified in a footnote. (The length of the paper may be proportionately longer in such collaborations.) Clearly the initial hurdle is coming up with an interesting problem doable in a short time. We would thus like you to think about potential projects, and consult with Prof. Kardar or Prof. Mirny regarding their suitability (preferably as soon as possible, but no later than one day after Ses #16).
Here are some ideas for the final project; they still need to be made more concrete.
- Polymorphism in the Human Genome: There is a growing amount of data on variations of the genetic material in the human population, accumulated for purposes of disease prediction and DNA identification. Polymorphism can be at the level of single nucleotides (SNIPs), or repeated DNA sequences. Can we use this data to probe for interesting correlations or probability distributions? Possible questions include: Are SNIPs distributed uniformly over a gene, or do they cluster in some regions? What is the distribution of repeat lengths, and can the distribution be used to infer a model for growth and extinction of repeats?
- "Surface Evolver" as a Tool for Biological Shapes: "Surface evolver" is a public domain software developed for use in Materials Science and Mechanical Engineering. It finds the shape that minimizes the elastic energy of a sheet subject to various constraints. Can this software by used for problems in the biological context? Possible questions include: Can the program be used to produce shapes resembling those of red blood cells (or other cell types), where the membrane is 'fluid' in character? Can it reproduce icosohedral and other shapes observed in viral particles?
- DNA Binding Sites: The preferred binding site of a variety is transcription factors (TFs) on DNA is well known, and it is possible to construct a simple energy function to estimate the binding energy of a TF to a small sequence of DNA. Use such an energy function to test whether candidate binding sites are over or under represented in the whole genome, in the genes, in the upstream or downstream regions.
- Interactions between Transcription Factors: The above problem involved the binding of a single TF. In some cases several TFs are needed to recruit the RNA polymerase. Can you come up with a simple (Ising like) model of interacting TFs, and quantify the extent to which the interaction between the TFs enhances the occupation probability of the complex that recruits the polymerase.
- Dynamics of a Population of Variable Size: In lectures we discussed the dynamics mutations in a population of fixed size N. Can you extend the description to a population in which the total size N is itself allowed to change? How does the probability of fixation of a mutation, and the time to fixation, behave in such a case?
- Epistatic Interactions between Mutations: In class we described a model for the dynamics of a single mutation (or independently evolving mutations). Generalize the description to two mutations in which the fitness of the double mutant is not simply the sum of the fitnesses for single mutants.
- 'Finite Temperature' Alignments: The standard tests of sequence similarity (such as BLAST) result in a score which can be interpreted as the lowest energy of a directed polymer. One can use the transfer matrix to evaluate the partition function of the corresponding directed polymer at any temperature. The free energy would then converge to the standard similarity score in the limit of zero temperature. Is it possible to get a better sense of similarity of sequences by using the finite temperature free energy? One could test this by evaluating the free energies for alignment of proteins that are known to have structural similarity, but no obvious sequence similarity.
- Ejection of a Viral RNA from its Capsid: There are easily accessible programs for Monte Carlo or Molecular Dynamics simulations of polymers. Can one use these programs to study and visualize how the RNA (or DNA) is ejected from an initially compressed state in the viral capsid into the cell? Is this an easy process, or does one have to worry about jams and entanglements?
- 'Hubs' and Diameter of a Network: The 'diameter' of a network is a global measure of the effectiveness of a network, and is modified as various nodes in the network are removed. How sensitive is the diameter to removal of highly connected nodes (hubs)?
- Role of 'Topology' of the Native Structure in Protein Folding: Simulate folding of a 27-mer on a cubic lattice. Use Go-model (i.e. all native interactions are attractive and non-natives are zeros) and measure MFPT (mean first passage time, i.e. mean folding time). Try folding 27-mer "proteins" with different native structure. Try some of possible native structures and find structures that fold fast and those that fold slowly. How significant is the difference in MFPT? Can you explain why some structures fold fast and others fold slowly? Study correlation between MFPT and the number of local (e.g. i, i+3 or i, i+5) interactions. Study correlation between MFPT and the number of native-like folds in the space of structures.
- Role of Specific Interactions in Protein Folding: It is widely believed that all native interactions are important to fold a protein, but it may turn out that only few are sufficient to fold it. The aim is to test plausibility of folding a protein with minimal number of interactions, and to estimate the minimal number of interactions needed.
Simulate folding of a 27-mer or 36-mer on a cubic lattice. First, use a Go-model (i.e. all native interactions are attractive and non-natives are zeros) and measure MFPT (mean folding time). Next, make all non-native interactions the same and weakly attractive (hydrophobic attraction) Bnn = -0.1. Make most of native interactions the same Bn = -0.1, while keeping very few (3-10) native interactions strongly attractive Bn = -1. Can you fold this protein using only few native interactions? What's the minimal number of interactions required to fold a 27-mer? 36-mer? Are there any specific preferred location for these key interactions in the "protein" structure? Can you find a rational for the number and location of key interactions based on the native structure and other compact structures of the 27-mer?
- Checking Some Studies of Biological Networks: The field of 'biological networks' has been quite active in the last several years. There are a number of studies with interesting claims with bear more careful examination. Prof. Mirny can suggest a number of such papers to look at, and how to further test some of the claims.
Suggested Projects from 2003
Here are some ideas for the final project; they still need to be made more concrete.
- Quantify the Relevance Hydrophobicity in Protein Structure: The general expectation is that hydrophobic aminoacids are in the core of proteins, while polar aminoacids are on the surface. Using databases of protein structures, it is possible to construct histograms of hydrophobicity as a function of the distance from the center of the protein. Do such plots show universal properties? Are there characteristics that can distinguish between "super-families" or folds? Do membrane proteins exhibit different character?
- Quantify the Similarity between Protein Structures: Given two protein structures, can we determine how similar or different they are? To answer this question, you need to construct an algorithm that finds an optimal superposition of two structures. One potential method is to minimize a "distance"
where the sum runs over all pairs of aminoacids in the two proteins (of lengths I and J, respectively). The vector →ri is a three dimensional 'location' assigned to the ith aminoacid in the structure, and f(r) is a short-ranged function that rapidly decays for separations larger than some distance R. Study the feasibility of this scheme, and the optimal choice of f(r). As a minimum requirement, the algorithm should be able to align and superpose identical structures.
- Sliding Double Strands: The Poland-Scheraga model treats all bases of DNA as equivalent, yet assumes that the ith base on one strand can only bind to the ith base on the complementary strand. However, if all bases are equivalent, it should be possible to 'slide' the two strands with respect to each other, creating single stranded segments at the two ends. It should also be possible for the bubbles to have unequal numbers of monomers from each strand. Generalize the Poland-Scheraga model to allow for these possibilities.
- Energetics of Mutations: It is possible to mutate a single aminoacid in a protein, and quantify its effect on the stability of the protein by experimentally measuring the Gibbs free energy change ΔΔG.
Use simple pairwise energy function E = ∑ijδijU(ai, aj) to approximate the free energy of a protein. Where δij = 1 if amino acids i and j are closer than certain cutoff, ai is the amino acid in position i of the sequence, and U(x, y) is the energy of interactions between amino acid types x and y. Potential of interactions U(x, y) is generally unknown.
Using this formalism you can compute ΔE of a mutation. Can you derive potential of interactions U(x, y) that provides best fit between theoretical ΔE and experimentally measured ΔΔG of mutations?
What is the best correlation between theory and experiment that you can get using this model?/li>
- Correlated Mutations in Proteins: Take a protein structure, and a matching HSSP file with multiple alignments. (To get an HSSP file (multiple alignments) corresponding to a PDB file (protein structure), go to the protein data bank, and enter PDB-ID of a protein (e.g. 1ten). After you get protein's Web page, click on Other Sources and you'll get link to the corresponding HSSP file. By examining the differences between the sequences search for correlated mutations. Are such correlations primarily between interacting aminoacids (which are in close proximity in the chosen structure), or are there "induced" correlations between non-interacting ones? To improve statistics, you probably need to simplify your alphabet, for example grouping the aminoacids according to hydrophobicity, size, or charge. Try to find a representation that provides maximal discrimination between correlations of interacting and non-interacting aminoacids.
- Pair Correlations in the Genome: In problem 3 of Assignment 3, you examined the dependence of pair correlations on base pair separation, in the genome of E-coli. Expand and refine this problem as follows: Use mutual information as the measure of correlation between bases at a distance n. Compare the n dependence of this quantity for coding and non-coding regions of the E-coli genome. Check if the same correlations are observed in coding and non-coding regions from the human genome.