Center for Theoretical Biological Physics, Rice University, Houston, TX 77005, Department of Chemistry, Rice University, Houston, TX 77005, Department of Physics and Astronomy, Rice University, Houston, TX 77005, Department of Biosciences, Rice University, Houston, TX 77005.
The human genome lives inside the cell nucleus and it is composed of 46 DNA molecules. These molecules are called chromosomes and have a combined length of about 2 meters. Chromosome structures vary for different cell types. Their three-dimensional architecture plays a central role in transcriptional regulation and its loss of functionality may be associate to disease. In this manuscript we describe a theoretical approach for determining the physical mechanism governing genome architecture. This is an interesting challenge since the DNA contained in every human cell is identical, and therefore this information has to be located somewhere else.
Towards answering this challenge, we have demonstrated that the architecture of interphase chromosomes is encoded mostly by its epigenetic marking. This one-dimensional sequence information is sufficient to determine its three-dimensional structure. This epigenetic information can be rewritten during cell differentiation, and therefore determining for each cell type both the three-dimensional structure and gene expression1,2.
In vivo, regions of chromatin characterized by different epigenetic markings go through a process similar to phase separation during its three-dimensional spatial organization. This process forms liquid droplets, which rearrange dynamically by breaking and fusing, which characterize the chromosomal architecture.
Our theoretical approach, combined with the respective computational tools, is able to predict the spatial conformation of genomes with unprecedented accuracy and specificity and, at the same time, it understands the physical mechanisms governing this organizational process. This physical understanding and modelling provide the initial tools towards understanding the functional aspects of genome architecture.
Exploring the energy landscape of chromatin and establishing the connection between genome conformation and phenotypes.
Chromatin is not only composed of DNA. Its structure and dynamics also require hundreds of structural and regulatory proteins that interact with this genetic material. This provides what is called epigenetic information. As we have described above, in the cell, chromatin has partially organized structures, which are essential in controlling the transcription of genes1-2 and disrupting them often leads to diseases.3-6
In the recent past, new experimental techniques have been developed which combine DNA proximity ligation with high-throughput sequencing. These techniques are making it possible to learn much about chromatin organization.7 These chromosome conformation capture7,8 experiments provide rich data sets that have been used to develop structural models of chromosomal organization.9,10 The most advanced of these approaches, Hi-C, provides a high-resolution contact map of the genome. These experiments show that chromatin structure gives rise to segregation of different types of chromatin in compartments (see discussion below). This information is very coarse-grained, with a representation of the chromosomes as a necklace of beads of 50 kilobases in size. Our goal is to determine the physical mechanism by which local interactions lead to the 3D fold observed in chromatin.
In earlier work, we have shown that energy landscape theory and the funnel concept are powerful tools in understanding protein folding.11-13 A similar approach can be developed to investigate chromosome structure and function.10,14 Any landscape theory is based on the fact that biomolecules do not have only a single folded structure but live in a dynamical ensemble of a large number of structures. In the case of proteins, they have a funnel landscape, i.e., convergent kinetic pathways that guide folding to a unique, stable, native conformation. There is a strong correlation between protein order and free energy stabilization. This landscape, therefore, is not only responsible for folding the protein but also governs its function since many of the protein functional states live higher in free energy in the funnel. This correlation between order and energetic stabilization allows for a connection between kinetic pathways and free energy minimization. Inspired by this framework, we have built an energy landscape approach for chromatin folding which is described in the following section. Utilizing the Hi-C data, we have been able to construct this coarse-grained energy landscape with the help of a maximum entropy Bayesian approach. Despite its limitations, these landscapes reproduce well the experimental information, particularly all the contacts predict by Hi-C between different genomic segments. It also reproduces different experimental results, such as contacts predict from FISH microscopy, even though this information was never used while building this model.
Predicting Genome Structures
In 2016 we proposed a physical model (the Minimal Chromatin Model (MiChroM)) for chromatin folding that has shown unprecedented accuracy in predicting the structure of chromosomes.15 MiChroM is based on the assumption that chromatin can be subdivided into a handful of structural types based on its biochemical properties. We proposed that selective binding of nuclear proteins to these few distinct types of chromatin is the driving phenomenon behind chromatin folding. Again, chromatin types, which are distinct from DNA sequence, are at least partially epigenetically controlled and change during cell differentiation, thus constituting an intriguing link among epigenetics, chromosomal organization, and cell development. Using molecular dynamics simulations, we showed that micro-phase separation of chromatin structural types is capable of generating 3D configurations that are strikingly similar to chromosome conformations found in vivo by the DNA ligation assays such as Hi-C. We also showed that differential binding of proteins generates an equilibrium ensemble of unknotted structures, a finding that stands in contrast to the long-held belief that these structures must be the result of a non-equilibrium process. We now present a summary of this model (details can be found in reference 15).
MiChroM is based on several physical assumptions informed by experiments. It is known that chromosome interactions are mediated by a cloud of proteins, each of which interact with DNA with a different selectivity. Our initial assumption is that these biochemical interactions can be represented by only one of a few “types” of chromatin. These types are determined by different histone modifications and other protein mediated interactions. In its most current version, utilizing high resolution Hi-C data, we utilize six distinct interaction patterns, corresponding then to six sub-compartments (A1, A2, B1, B2, B3, and B4). A and B types tend to phase separate into different compartments as described above. The second assumption is that some specific pairs of loci strongly interact to form particularly frequent contacts or loops Most of these loops are intra-chromosomal and require the presence of a binding factor (CTCF), which identifies the DNA bases CCCTC motif, and the cohesin protein complex. The final assumption is called the ideal chromosome potential. It assumes that there is an additional free energy that only depends on the genomic distance between the two loci. This term is responsible for the local chromatin structure. Figure 1 shows a schematic representation of MiChroM.
A real challenge is to determine these structural types utilizing only epigenetic information without any previous structural knowledge. Therefore we developed a method (Maximum Entropy Genomic Annotations from Biomarkers Associated with Structural Ensembles (MEGABASE)) to determine the structural type of a segment of chromatin using its epigenetic marking patterns.16 A neural network was used to convert chromatin immuno-precipitation (ChIp-Seq) data into chromatin types annotations. These annotations became an input to MiChroM to produce structural ensembles for chromosomes of a lymphoblastoid cell line (GM12878) for which high resolution Hi-C data is available. These predictions were extensively validated using experimental data from Hi-C and FISH microscopy. Details of this approach can be found in reference 16 and a schematic representation is shown in figure 2.
Figure 1. (left) Naked DNA is decorated by proteins and epigenetic markings that differ between cell types. Our results indicate that these markings carry enough information to determine the 3D architecture of the genome, which is also cell-type specific. (right) Validity of MiChroM is checked by direct comparison to Hi-C maps. It is particularly important to check predictions for contacts between loci at large genomic distances since they are vital to determine the success of the model.
Figure 2. Schematic representation of the MEGABASE+MiChroM computational pipeline. Initially, we utilize chromatin Immuno-precipitation Sequencing (ChIP-Seq) assays which contain the epigenetic markings of a cell type. This information is publicly available from the NIH ENCODE Project database. Utilizing these data, our machine-learning MEGABASE method generates annotations for the chromosomal types. Finally, molecular dynamics simulations are utilized to create the ensemble of 3D chromosome structures.
The MEGABASE+MiChroM computational pipeline allows simulating the conformation of entire genomes using only the sequence of epigenetic marking patterns without the need for any additional experimental information. Using a combined approach of machine learning and physical modeling, we were able to achieve a degree of quantitative accuracy that makes our model not only able to predict 3D conformations but also a viable explanation of the mechanism behind chromatin folding.
The MEGABASE+MiChroM computational pipeline makes it possible to extensively investigate the ensemble of structures of genomes in many different cell lines. Experimental data from ligations assays show that the structure of chromosomes changes accordingly to the phenotype17,18 as observed, for example, during cell differentiation. Structural changes have also been detected in cancer cells.19 Using this new theoretical approach, the 3D genome conformations for a large number of human cell lines can now be generated and the results validated by comparing to information from ligation assays, whenever such data is available. A systematic study of the relationship between structural ensembles and the expression patterns in each phenotype is now becoming possible.
It is important to keep in mind that this theoretical model is based on the three physical assumptions described above: compartment formation by type segregation, loop formation, and the ideal chromosome. Although most of the current experimental information supports them, much more work is needed to determine their validity. Also, most of the parameters used in this model have been learned from experimental data. Additional work is still need to link them to detailed physical descriptions. The quality the current predictions, however, provides us with sufficient confidence that the basic physical mechanisms governing genome folding are well described by this theoretical framework.
1. Cremer T., Cremer C., Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat Rev Genet 2, 292-301 (2001).
2. Bickmore W.A., The Spatial Organization of the Human Genome. Annu Rev Genom Hum G 14, 67-84 (2013).
3. Fullwood, M.J. et al., An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 58-64 (2009).
4. Gondor, A., Dynamic chromatin loops bridge health and disease in the nuclear landscape. Semin Cancer Biol 23, 90-98 (2013).
5. Krijger, P.H., de Laat, W., Regulation of disease-associated gene expression in the 3D genome. Nat Rev Mol Cell Biol 17, 771-782 (2016).
6. Montefiori, L. et al., Extremely Long-Range Chromatin Loops Link Topological Domains to Facilitate a Diverse Antibody Repertoire. Cell Rep 14, 896-906 (2016).
7. Lieberman-Aiden, E. et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science 326, 289-293 (2009).
8. Dekker J., Rippe K., Dekker M., Kleckner N., Capturing chromosome conformation. Science 295, 1306-1311 (2002).
9. Jost D., Carrivain P., Cavalli G., Vaillant C., Modeling epigenome folding: formation and dynamics of topologically associated chromatin domains. Nucleic Acids Res 42, 9553-9561 (2014).
10. Zhang B., Wolynes P.G., Topology, structures, and energy landscapes of human chromosomes. P Natl Acad Sci USA 112, 6062-6067 (2015).
11. Bryngelson J.D. , Onuchic J.N., Socci N.D., Wolynes P.G., Funnels, Pathways, and the Energy Landscape of Protein-Folding – a Synthesis. Proteins-Structure Function and Genetics 21, 167-195 (1995).
12. Wolynes P.G., Evolution, energy landscapes and the paradoxes of protein folding. Biochimie 119, 218-230 (2015).
13. Leopold P.E., Montal M., Onuchic J.N., P Natl Acad Sci USA 89, 8721-8725 (1992).
14. Zhang B., Wolynes P.G., Shape Transitions and Chiral Symmetry Breaking in the Energy Landscape of the Mitotic Chromosome. Phys Rev Lett 116, 248101 (2016).
15. Di Pierro, M., Zhang, B., Aiden, E.L., Wolynes, P.G., Onuchic, J.N. Transferable model for chromosome architecture. P Natl Acad Sci USA 113, 12168-12173 (2016).
16. Di Pierro, M., Cheng, R.R., Aiden, E.L., Wolynes, P.G., Onuchic, J.N. De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. P Natl Acad Sci USA 114, 12126-12131 (2017).
17. Krijger, P.H. et al., Cell-of-Origin-Specific 3D Genome Structure Acquired during Somatic Cell Reprogramming. Cell Stem Cell 18, 597-610 (2016).
18. Rao, S.S.P. et al., A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159, 1665-1680 (2014).
19. Akdemir, K.C. et al. Spatial Genome Organization as a Framework for Somatic Alterations in Human Cancer. bioRxiv, doi:10.1101/179176 (2017).