Abstract
Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Hi-C data from Rao et al. is available under GSE63525. Chromatin feature data from ENCODE, Roadmap and other publicly available data are listed in detail with accession numbers in Supplementary Table 1. Additional ENCODE data used for chromatin segmentation and visualization are listed with accession numbers in Supplementary Table 5. Tiled-C and NG Capture-C validation data are available under the Gene Expression Omnibus superseries GSE137437. Source data are provided with this paper.
Code availability
All code for training and employing deepC networks as well as trained models are available at https://github.com/rschwess/deepC; all code for training and employing chromatin feature networks is available at https://github.com/rschwess/deepHaem
References
Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Schreiber, J., Libbrecht, M., Bilmes, J. & Noble, W. S. Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. Preprint at bioRxiv https://doi.org/10.1101/103614 (2017).
Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).
Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 47, e60 (2019).
Qi, Y. & Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS Comput. Biol. 15, e1007024 (2019).
Belokopytova, P. S., Nuriddinov, M. A., Mozheiko, E. A., Fishman, D. & Fishman, V. Quantitative prediction of enhancer–promoter interactions. Genome Res. 30, 72–84 (2020).
Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat. Commun. 10, 5449 (2019).
Buckle, A., Brackley, C. A., Boyle, S., Marenduzzo, D. & Gilbert, N. Polymer simulations of heteromorphic chromatin predict the 3D folding of complex genomic loci. Mol. Cell 72, 786–797.e11 (2018).
Bianco, S. et al. Polymer physics predicts the effects of structural variants on chromatin architecture. Nat. Genet. 50, 662–667 (2018).
Hnisz, D., Day, D. S. & Young, R. A. Insulated neighborhoods: structural and functional units of mammalian gene control. Cell 167, 1188–1200 (2016).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at http://arxiv.org/abs/1511.07122 (2015).
van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 4, 3320–3328 (2014).
Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Bonev, B. et al. Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017).
Zhang, Y. et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat. Commun. 9, 750 (2018).
Liu, Q., Lv, H. & Jiang, R. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics 35, i99–i107 (2019).
Davies, J. O. J. et al. Multiplexed analysis of chromosome conformation at vastly improved sensitivity. Nat. Methods 13, 74–80 (2016).
Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).
Crane, E. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015).
Shin, H. et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 44, e70 (2015).
Zufferey, M., Tavernari, D., Oricchio, E. & Ciriello, G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 19, 217 (2018).
Oudelaar, A. M. et al. Dissection of the 4D chromatin structure of the α-globin locus through in vivo erythroid differentiation with extreme spatial and temporal resolution. Preprint at bioRxiv https://doi.org/10.1101/763763 (2019).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. 2nd International Conference on Learning Representations (ICLR 2014) Workshop Track (2013).
Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).
Schmiedel, B. J. et al. 17q21 asthma-risk variants switch CTCF binding and regulate IL-2 production by T cells. Nat. Commun. 7, 13426 (2016).
Robson, M. I., Ringel, A. R. & Mundlos, S. Regulatory landscaping: how enhancer-promoter communication is sculpted in 3D. Mol. Cell 74, 1110–1122 (2019).
Dixon, J. R., Gorkin, D. U. & Ren, B. Chromatin domains: the unit of chromosome organization. Mol. Cell 62, 668–680 (2016).
Marti-Renom, M. A. et al. Challenges and guidelines toward 4D nucleome data and model standards. Nat. Genet. 50, 1352–1358 (2018).
Spielmann, M., Lupiáñez, D. G. & Mundlos, S. Structural variation in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).
Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods https://doi.org/10.1038/s41592-020-0958-x (2020).
Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).
Schwessinger, R. et al. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res. 27, 1730–1742 (2017).
Downes, D. J. et al. An integrated platform to systematically identify causal variants and genes for polygenic human traits. Preprint at bioRxiv https://doi.org/10.1101/813618 (2019).
Telenius, J., Consortium, T. W. & Hughes, J. R. NGseqBasic—a single-command UNIX tool for ATAC-seq, DNaseI-seq, Cut-and-Run, and ChIP–seq data mapping, high-resolution visualisation, and quality control. Preprint at bioRxiv https://doi.org/10.1101/393413 (2018).
Zhang, Y. et al. Model-based analysis of ChIP–Seq (MACS). Genome Biol. 9, R137 (2008).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Multimed. Tools Appl. 77, 10437–10453 (2015).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).
Telenius, J. M. et al. CaptureCompendium: a comprehensive toolkit for 3C analysis. Preprint at bioRrxiv http://doi.org/10.1101/2020.02.17.952572 (2020).
Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Zacher, B. et al. Accurate promoter and enhancer identification in 127 ENCODE and roadmap epigenomics cell types and tissues by GenoSTAN. PLoS ONE 12, e0169249 (2017).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) 265–284 (2016).
Li, H. et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
Wickham, H ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).
Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009).
Zeileis, A. & Grothendieck, G. Zoo: S3 infrastructure for regular and irregular time series. J. Stat. Softw. 14, 1–27 (2005).
van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011).
Acknowledgements
We thank R. Beagrie for help in refining the manuscript. This work was supported by the MRC (grant no. MC_UU_00016/14 to J.R.H.) and the Wellcome Trust via Strategic Award (no. 106130/Z/14/Z to J.R.H.) and Institutional Strategic Support Fund (reference no. 105605/Z/14/Z to J.R.H.). The Wellcome Trust Genomic Medicine and Statistics PhD Program (grant nos. 203728/Z/16/Z to R.S. and 203141/Z/16/Z to R.C.B.). The Stevenson Junior Research Fellowship at University College, Oxford (to A.M.O). G.L. is supported by the Wellcome Trust supporting award (no. 090532/Z/09/Z). Y.W.T. is supported by the European Research Council under the European Union’s Seventh Framework Program (grant no. FP7/2007-2013) ERC grant agreement no. 617071. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS.
Author information
Authors and Affiliations
Contributions
R.S., G.L. and J.R.H. conceived the project. R.S., R.C.B., Y.W.T. and G.L designed the neural network architectures. R.S. optimized and trained the neural networks and performed downstream analysis. R.S., M.G., D.D., A.M.O. and J.R.H. designed and evaluated the validation strategy. M.G. performed NG Capture-C experiments. D.D. performed Tiled-C experiments. R.S., A.M.O. and J.T. performed bioinformatic analysis of NG Capture-C and Tiled-C. R.S. performed integrative analysis and prepared the figures. R.S., G.L and J.R.H. wrote the manuscript with inputs from all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Percentile normalizing Hi-C data for deep learning.
The Hi-C interactions are percentile-binned in a distance-stratified manner. For every genomic distance, in steps equal to the bin size, the Hi-C signal is split into unequal percentiles ranging from 20 % bottom to 5 % top. The percentiles are attributed the values 1 to 10 yielding the Hi-C skeleton. The unequal percentile sizes ensure a finer distinction of the differences at the high Hi-C interaction value range, while minor differences in the low interaction value range are squished. Effectively, this procedure reduces the proximity signal and enhances domains and domain boundaries.
Extended Data Fig. 2 Comparison of deepC training with and without transfer learning.
Training a deepC model with the same architecture but without pre-seeding the lower convolutional layers with the chromatin feature model weights results in the emergence of triangular structures. Their positioning however does not match with the Hi-C structures. In contrast, with pre-seeding the predicted domains overlap well with the Hi-C skeleton.
Extended Data Fig. 3 Tissue-specific deepC predictions.
Shown is a region on chromosome 2 around the MEIS1 locus. DeepC predicts a small domain with insulation to the upstream regions (black arrow) in a tissue specific manner. The domain is only visible in K562 Hi-C data and matches with tissue-specific CTCF binding. Tiled-C confirms the tissue-specific domain. For contrast, Tiled-C data were bounded between the 5 and 95 percentiles.
Extended Data Fig. 4 NG Capture-C validation of deepC predictions.
a) Example region with overlap of GM12878: Hi-C, skeleton and deepC prediction; NG Capture-C tracks, distance-normalized NG Capture-C tracks and CTCF ChIP-seq track (red). Shown is a CTCF viewpoint (purple triangle) and an intra domain viewpoint (blue triangle) not overlapping with any active elements. Dashed lines in the distance-normalized NG Capture-C tracks indicate the expected interaction value. Dotted black lines highlight deepC prediction details that correspond to boundaries in the NG Capture-C tracks. b) K562 data of the same region.
Extended Data Fig. 5 Mapping important features for genome folding.
Shown are GM12878 deepC predictions over the IKZF2 locus (a) on chromosome 2 and focused on the IKZF2 promoter (b). Aligned are DHS as well as ChIP-seq tracks for CTCF and histone modifications. Shown in green is the saliency score which is a proxy for the importance every base has in predicting the chromatin interactions of that region. The saliency score shows sharp peaks overlapping CTCF binding sites and broader peaks overlapping active gene promoters. Resolving the saliency score at base-pair resolution (b) highlights CTCF and general transcription factor binding motifs.
Supplementary information
Supplementary Information
Supplementary Figs. 1–21 and Tables 3–5.
Supplementary Tables
Supplementary Table 1. Source of human and mouse chromatin features used. Supplementary Table 2. NG Capture-C validation probes
Source data
Source Data Fig. 1
Distance-stratified Pearson correlation between prediction and Hi-C skeleton across chromosomes in cross-validation scheme. Smoothed and unsmoothed (M0–M2).
Source Data Fig. 2
a, Pearson correlation between NG Capture-C profiles and virtual4C profiles from the Hi-C skeleton and predictions respectively. b1, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878. b2, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878.
Source Data Fig. 3
a, Meta-profiles of saliency over stratified genomic elements. b, Predicted effects of genomic element deletion screen. c, Predicted effect of 1,000 randomly sampled SNPs within CTCF sites.
Rights and permissions
About this article
Cite this article
Schwessinger, R., Gosden, M., Downes, D. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods 17, 1118–1124 (2020). https://doi.org/10.1038/s41592-020-0960-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-020-0960-3