## Final Project Guidelines

Please choose one of the following two for a final project:

- Research Project: For the final project, you should write a brief paper (two to four pages in Physical Review format) about a topic of your choice at the interface of Statistical Physics and Biology. It should be formatted as a regular article with title, abstract, and bibliography. The main text should contain introductory and concluding paragraphs (whether or not they appear as subsections is not important). The ideal project will involve a combination of literature review, discussion of an analytical or computational model, and application/analysis of biological data.
- Teaching site: Design a web-site that can be used to teach a topic at the interface of Statistical Physics and Biology to non-specialists. For example, imagine that a high school teacher would use a one hour class to teach the material to an honor science class the material using your web-page. As such, you should include introductory materials, references that interested students can pursue on their own. The presentation must also be colorful and dynamic (e.g. by including figures, animations, applets, etc.) to engage and maintain the interests of a diverse non-specialist audience.

Students can collaborate in groups provided that the respective contributions of the author of the joint paper is clearly specified in a footnote. (The length of the paper may be proportionately longer in such collaborations.) Clearly the initial hurdle is coming up with an interesting problem doable in a short time. We would thus like you to think about potential projects, and consult with Prof. Kardar or Prof. Mirny regarding their suitability (preferably as soon as possible, but no later than one day after Ses #16).

## Suggestions for the Final Project

*Epistatic interactions between mutations*: In class we described a model for the dy-namics of a single mutation (or independently evolving mutations). Generalize the description to two mutations in which the fitness of the double mutant is not simply the sum of the fitnesses for single mutants.*'Finite temperature' alignments*: The standard tests of sequence similarity (such as BLAST) result in a score which can be interpreted as the lowest energy of a directed polymer. One can use the transfer matrix to evaluate the partition function of the cor-responding directed polymer at any temperature. The free energy would then converge to the standard similarity score in the limit of zero temperature. Is it possible to get a better sense of similarity of sequences by using the finite temperature free energy? One could test this by evaluating the free energies for alignment of proteins that are known to have structural similarity, but no obvious sequence similarity.*Role of specific interactions in protein folding*: It is widely believed that all native interactions are important to fold a protein, but it may turn out that only few are sufficient to fold it. The aim is to test plausibility of folding a protein with minimal number of interactions, and to estimate the minimal number of interactions needed.Simulate folding of a 27-mer or 36-mer on a cubic lattice. First, use a Go-model (i.e.all native interactions are attractive and non-natives are zeros) and measure MFPT (mean folding time). Next, make all non-native interactions the same and weakly attractive (hydrophobic attraction) B

_{nn}= −0.1. Make most of native interactions the same B_{n}= −0.1, while keeping very few (3−10) native interactions strongly attractive B_{n}= −1. Can you fold this protein using only few native interactions? What's the minimal number of interactions required to fold a 27-mer? 36-mer? Are there any specific preferred location for these key interactions in the "protein" structure? Can you find a rational for the number and location of key interactions based on the native structure and other compact structures of the 27-mer?*Role of 'topology' of the native structure in protein folding*: Simulate folding of a 27-mer on a cubic lattice. Use Go-model (i.e. all native interactions are attractive and non-natives are zeros) and measure MFPT (mean ﬁrst passage time, i.e. mean folding time). Try folding 27-mer "proteins" with different native structure. Try some of possible native structures and ﬁnd structures that fold fast and those that fold slowly. How significant is the difference in MFPT? Can you explain why some structures fold fast and others fold slowly? Study correlation between MFPT and the number of local (e.g. i, i+3 or i, i+5) interactions. Study correlation between MFPT and the number of native-like folds in the space of structures.*Interactions between transcription factors*: The above problem involved the binding of a single TF. In some cases several TFs are needed to recruit the RNA polymerase. Can you come up with a simple (Ising like) model of interacting TFs, and quantify the extent to which the interaction between the TFs enhances the occupation probability of the complex that recruits the polymerase.*Kinetics of assembly of multiple transcription factors*: Transcription factors (TFs) bind DNA in specific sites. To activate a gene, several TFs need to bind their sites close by. While kinetics of a single TF binding its site is understood, generalization of this problem for N TFs is an interesting problem. Consider a stretch where N TFs need to bind simultaneously. Calculate the mean and the distribution of the time all N of them bind their sites at the same time. Consider the case of non-interacting TFs and an extension for weakly-interacting TFs.*DNA binding sites*: The preferred binding sites of a variety is transcription factors (TFs) are well known and it is possible to construct a simple energy function to estimate the binding energy of a TF to a small sequence of DNA. Use such an energy function to test whether candidate binding sites are over or under represented in the whole genome, in the genes, in the upstream or downstream regions. Test whether sites have a tendency to cluster together.*Correlated mutations*: While evolving, proteins sequences accumulate mutations. Effect of one mutations can be compensated by another mutation. This leads to correlations between mutations. Revealing and understanding patterns of correlation can help to predict protein structure, as amino acids that interact are more likely to exhibit correlated mutations. The system is analogous to a spin system at low temperature, where spins become correlated. Inferring the coupling constants of the spin system is then analogous to the inference of correlated mutations. Consider this "inverse statistical mechanics" problem, where the goal is to reconstruct the couplings of the spin system, given a series of observations of the states of spins.*Protein folding and evolution*: "Foldable" proteins constitute a tiny fraction of all possible protein sequences. Little is know about their organization in the sequence space. Understanding their distribution in the sequence space can help to answer several important questions. For example, what is the minimal number of mutations in a random (non-foldable) protein that can make it foldable? What is the minimal number of mutations that can make a foldable protein, fold into a different structure? Using available code for folding lattice proteins one can try to answer these questions.*Information theory of immune system:*The goal of the immune system is to recognize foreign proteins from the host ones. Such recognition is accomplished by receptor (MHC) molecules that recognize short peptides (of 8–10 amino acids) by binding them. Receptors that bind host peptides are eliminated during the training process. Each receptor can be described by the consensus sequence that it binds the strongest and by its specificity, i.e. the number of changes in the consensus sequence that it tolerates. Consider a repertoire of N receptors: what is the optimal length of the peptide and the optimal specificity, which provide the best coverage of all possible peptide sequences, while minimizing the probability of an erroneous recognizing a host protein?*Self-assembly of the spindle:*In cell biology, the spindle is the structure that separates the chromosomes into the daughter cells during cell division. The spindle is made of chromosomes and microtubules. Recent data suggest that the spindle consists of thousands of short microtubules. Neighboring microtubules tend to align with each other, while the density of microtubules decays with the distance from the chromosome. Such system can be modeled by a modified XY model developed in statistical mechanics for spin systems. Each spin corresponds to a microtubule, the coupling favors alignment of microtubules and the external ﬁeld models the effect of chromosomes. Can such modified XY model explain the structure and the self-organization of the spindle?*'Hubs' and diameter of a network*: The 'diameter' of a network is a global measure of the effectiveness of network, and changes if nodes are removed from the network. It was proposed that the change in the network diameter depends on the type of the network (i.e. degree distribution). Consider two types of networks: a random one and a network with a power-law degree distribution, that resembles some biological networks and Internet. How sensitive is the diameter to removal of nodes in these two classes of networks? You can try different strategies of removal: picking a node for removal randomly (uniformly), or picking more connected node (hubs) with a higher probability.