DeepC: predicting 3D genome folding using megabase-scale transfer learning

Schwessinger, Ron; Gosden, Matthew; Downes, Damien; Brown, Richard C.; Oudelaar, A. Marieke; Telenius, Jelena; Teh, Yee Whye; Lunter, Gerton; Hughes, Jim R.

doi:10.1038/s41592-020-0960-3

Article
Published: 12 October 2020

DeepC: predicting 3D genome folding using megabase-scale transfer learning

Nature Methods volume 17, pages 1118–1124 (2020)Cite this article

12k Accesses
70 Citations
64 Altmetric
Metrics details

Subjects

Abstract

Predicting the impact of noncoding genetic variation requires interpreting it in the context of three-dimensional genome architecture. We have developed deepC, a transfer-learning-based deep neural network that accurately predicts genome folding from megabase-scale DNA sequence. DeepC predicts domain boundaries at high resolution, learns the sequence determinants of genome folding and predicts the impact of both large-scale structural and single base-pair variations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Predicting Hi-C interactions from DNA sequence.**

**Fig. 2: Validation of deepC predictions.**

**Fig. 3: DeepC for dissecting the determinants of genome folding and predicting the impact of variation.**

Predicting 3D genome folding from DNA sequence with Akita

Article 12 October 2020

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning

Article Open access 27 November 2019

Optimized model architectures for deep learning on genomic data

Article Open access 30 April 2024

Data availability

Hi-C data from Rao et al. is available under GSE63525. Chromatin feature data from ENCODE, Roadmap and other publicly available data are listed in detail with accession numbers in Supplementary Table 1. Additional ENCODE data used for chromatin segmentation and visualization are listed with accession numbers in Supplementary Table 5. Tiled-C and NG Capture-C validation data are available under the Gene Expression Omnibus superseries GSE137437. Source data are provided with this paper.

Code availability

All code for training and employing deepC networks as well as trained models are available at https://github.com/rschwess/deepC; all code for training and employing chromatin feature networks is available at https://github.com/rschwess/deepHaem

References

Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
CAS PubMed PubMed Central Google Scholar
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
CAS PubMed PubMed Central Google Scholar
Schreiber, J., Libbrecht, M., Bilmes, J. & Noble, W. S. Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. Preprint at bioRxiv https://doi.org/10.1101/103614 (2017).
Whalen, S., Truty, R. M. & Pollard, K. S. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet. 48, 488–496 (2016).
CAS PubMed PubMed Central Google Scholar
Li, W., Wong, W. H. & Jiang, R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 47, e60 (2019).
CAS PubMed PubMed Central Google Scholar
Qi, Y. & Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS Comput. Biol. 15, e1007024 (2019).
CAS PubMed PubMed Central Google Scholar
Belokopytova, P. S., Nuriddinov, M. A., Mozheiko, E. A., Fishman, D. & Fishman, V. Quantitative prediction of enhancer–promoter interactions. Genome Res. 30, 72–84 (2020).
CAS PubMed PubMed Central Google Scholar
Zhang, S., Chasman, D., Knaack, S. & Roy, S. In silico prediction of high-resolution Hi-C interaction matrices. Nat. Commun. 10, 5449 (2019).
CAS PubMed PubMed Central Google Scholar
Buckle, A., Brackley, C. A., Boyle, S., Marenduzzo, D. & Gilbert, N. Polymer simulations of heteromorphic chromatin predict the 3D folding of complex genomic loci. Mol. Cell 72, 786–797.e11 (2018).
CAS PubMed PubMed Central Google Scholar
Bianco, S. et al. Polymer physics predicts the effects of structural variants on chromatin architecture. Nat. Genet. 50, 662–667 (2018).
CAS PubMed Google Scholar
Hnisz, D., Day, D. S. & Young, R. A. Insulated neighborhoods: structural and functional units of mammalian gene control. Cell 167, 1188–1200 (2016).
CAS PubMed PubMed Central Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).
CAS PubMed PubMed Central Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
CAS PubMed PubMed Central Google Scholar
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
CAS PubMed PubMed Central Google Scholar
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at http://arxiv.org/abs/1511.07122 (2015).
van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/abs/1609.03499 (2016).
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 4, 3320–3328 (2014).
Google Scholar
Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Google Scholar
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
CAS PubMed PubMed Central Google Scholar
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
CAS PubMed PubMed Central Google Scholar
Bonev, B. et al. Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017).
CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat. Commun. 9, 750 (2018).
PubMed PubMed Central Google Scholar
Liu, Q., Lv, H. & Jiang, R. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics 35, i99–i107 (2019).
CAS PubMed PubMed Central Google Scholar
Davies, J. O. J. et al. Multiplexed analysis of chromosome conformation at vastly improved sensitivity. Nat. Methods 13, 74–80 (2016).
CAS PubMed Google Scholar
Fudenberg, G. et al. Formation of chromosomal domains by loop extrusion. Cell Rep. 15, 2038–2049 (2016).
CAS PubMed PubMed Central Google Scholar
Crane, E. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015).
CAS PubMed PubMed Central Google Scholar
Shin, H. et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 44, e70 (2015).
PubMed PubMed Central Google Scholar
Zufferey, M., Tavernari, D., Oricchio, E. & Ciriello, G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 19, 217 (2018).
CAS PubMed PubMed Central Google Scholar
Oudelaar, A. M. et al. Dissection of the 4D chromatin structure of the α-globin locus through in vivo erythroid differentiation with extreme spatial and temporal resolution. Preprint at bioRxiv https://doi.org/10.1101/763763 (2019).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. 2nd International Conference on Learning Representations (ICLR 2014) Workshop Track (2013).
Hnisz, D. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016).
CAS PubMed PubMed Central Google Scholar
Schmiedel, B. J. et al. 17q21 asthma-risk variants switch CTCF binding and regulate IL-2 production by T cells. Nat. Commun. 7, 13426 (2016).
CAS PubMed PubMed Central Google Scholar
Robson, M. I., Ringel, A. R. & Mundlos, S. Regulatory landscaping: how enhancer-promoter communication is sculpted in 3D. Mol. Cell 74, 1110–1122 (2019).
CAS PubMed Google Scholar
Dixon, J. R., Gorkin, D. U. & Ren, B. Chromatin domains: the unit of chromosome organization. Mol. Cell 62, 668–680 (2016).
CAS PubMed PubMed Central Google Scholar
Marti-Renom, M. A. et al. Challenges and guidelines toward 4D nucleome data and model standards. Nat. Genet. 50, 1352–1358 (2018).
CAS PubMed Google Scholar
Spielmann, M., Lupiáñez, D. G. & Mundlos, S. Structural variation in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).
CAS PubMed Google Scholar
Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods https://doi.org/10.1038/s41592-020-0958-x (2020).
Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).
CAS PubMed PubMed Central Google Scholar
Schwessinger, R. et al. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res. 27, 1730–1742 (2017).
CAS PubMed PubMed Central Google Scholar
Downes, D. J. et al. An integrated platform to systematically identify causal variants and genes for polygenic human traits. Preprint at bioRxiv https://doi.org/10.1101/813618 (2019).
Telenius, J., Consortium, T. W. & Hughes, J. R. NGseqBasic—a single-command UNIX tool for ATAC-seq, DNaseI-seq, Cut-and-Run, and ChIP–seq data mapping, high-resolution visualisation, and quality control. Preprint at bioRxiv https://doi.org/10.1101/393413 (2018).
Zhang, Y. et al. Model-based analysis of ChIP–Seq (MACS). Genome Biol. 9, R137 (2008).
PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Multimed. Tools Appl. 77, 10437–10453 (2015).
Google Scholar
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).
PubMed PubMed Central Google Scholar
Telenius, J. M. et al. CaptureCompendium: a comprehensive toolkit for 3C analysis. Preprint at bioRrxiv http://doi.org/10.1101/2020.02.17.952572 (2020).
Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014).
PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Zacher, B. et al. Accurate promoter and enhancer identification in 127 ENCODE and roadmap epigenomics cell types and tissues by GenoSTAN. PLoS ONE 12, e0169249 (2017).
PubMed PubMed Central Google Scholar
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) 265–284 (2016).
Li, H. et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
PubMed PubMed Central Google Scholar
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
CAS PubMed PubMed Central Google Scholar
Wickham, H ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).
Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009).
CAS PubMed PubMed Central Google Scholar
Zeileis, A. & Grothendieck, G. Zoo: S3 infrastructure for regular and irregular time series. J. Stat. Softw. 14, 1–27 (2005).
Google Scholar
van der Walt, S., Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011).
Google Scholar

Download references

Acknowledgements

We thank R. Beagrie for help in refining the manuscript. This work was supported by the MRC (grant no. MC_UU_00016/14 to J.R.H.) and the Wellcome Trust via Strategic Award (no. 106130/Z/14/Z to J.R.H.) and Institutional Strategic Support Fund (reference no. 105605/Z/14/Z to J.R.H.). The Wellcome Trust Genomic Medicine and Statistics PhD Program (grant nos. 203728/Z/16/Z to R.S. and 203141/Z/16/Z to R.C.B.). The Stevenson Junior Research Fellowship at University College, Oxford (to A.M.O). G.L. is supported by the Wellcome Trust supporting award (no. 090532/Z/09/Z). Y.W.T. is supported by the European Research Council under the European Union’s Seventh Framework Program (grant no. FP7/2007-2013) ERC grant agreement no. 617071. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS.

Author information

Authors and Affiliations

MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
Ron Schwessinger, Matthew Gosden, Damien Downes, A. Marieke Oudelaar & Jim R. Hughes
MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
Ron Schwessinger, A. Marieke Oudelaar, Jelena Telenius, Gerton Lunter & Jim R. Hughes
Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Ron Schwessinger, Richard C. Brown & Gerton Lunter
Department of Statistics, University of Oxford, Oxford, UK
Yee Whye Teh

Authors

Ron Schwessinger
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Gosden
View author publications
You can also search for this author in PubMed Google Scholar
Damien Downes
View author publications
You can also search for this author in PubMed Google Scholar
Richard C. Brown
View author publications
You can also search for this author in PubMed Google Scholar
A. Marieke Oudelaar
View author publications
You can also search for this author in PubMed Google Scholar
Jelena Telenius
View author publications
You can also search for this author in PubMed Google Scholar
Yee Whye Teh
View author publications
You can also search for this author in PubMed Google Scholar
Gerton Lunter
View author publications
You can also search for this author in PubMed Google Scholar
Jim R. Hughes
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.S., G.L. and J.R.H. conceived the project. R.S., R.C.B., Y.W.T. and G.L designed the neural network architectures. R.S. optimized and trained the neural networks and performed downstream analysis. R.S., M.G., D.D., A.M.O. and J.R.H. designed and evaluated the validation strategy. M.G. performed NG Capture-C experiments. D.D. performed Tiled-C experiments. R.S., A.M.O. and J.T. performed bioinformatic analysis of NG Capture-C and Tiled-C. R.S. performed integrative analysis and prepared the figures. R.S., G.L and J.R.H. wrote the manuscript with inputs from all authors.

Corresponding authors

Correspondence to Gerton Lunter or Jim R. Hughes.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Percentile normalizing Hi-C data for deep learning.

The Hi-C interactions are percentile-binned in a distance-stratified manner. For every genomic distance, in steps equal to the bin size, the Hi-C signal is split into unequal percentiles ranging from 20 % bottom to 5 % top. The percentiles are attributed the values 1 to 10 yielding the Hi-C skeleton. The unequal percentile sizes ensure a finer distinction of the differences at the high Hi-C interaction value range, while minor differences in the low interaction value range are squished. Effectively, this procedure reduces the proximity signal and enhances domains and domain boundaries.

Extended Data Fig. 2 Comparison of deepC training with and without transfer learning.

Training a deepC model with the same architecture but without pre-seeding the lower convolutional layers with the chromatin feature model weights results in the emergence of triangular structures. Their positioning however does not match with the Hi-C structures. In contrast, with pre-seeding the predicted domains overlap well with the Hi-C skeleton.

Extended Data Fig. 3 Tissue-specific deepC predictions.

Shown is a region on chromosome 2 around the MEIS1 locus. DeepC predicts a small domain with insulation to the upstream regions (black arrow) in a tissue specific manner. The domain is only visible in K562 Hi-C data and matches with tissue-specific CTCF binding. Tiled-C confirms the tissue-specific domain. For contrast, Tiled-C data were bounded between the 5 and 95 percentiles.

Extended Data Fig. 4 NG Capture-C validation of deepC predictions.

a) Example region with overlap of GM12878: Hi-C, skeleton and deepC prediction; NG Capture-C tracks, distance-normalized NG Capture-C tracks and CTCF ChIP-seq track (red). Shown is a CTCF viewpoint (purple triangle) and an intra domain viewpoint (blue triangle) not overlapping with any active elements. Dashed lines in the distance-normalized NG Capture-C tracks indicate the expected interaction value. Dotted black lines highlight deepC prediction details that correspond to boundaries in the NG Capture-C tracks. b) K562 data of the same region.

Extended Data Fig. 5 Mapping important features for genome folding.

Shown are GM12878 deepC predictions over the IKZF2 locus (a) on chromosome 2 and focused on the IKZF2 promoter (b). Aligned are DHS as well as ChIP-seq tracks for CTCF and histone modifications. Shown in green is the saliency score which is a proxy for the importance every base has in predicting the chromatin interactions of that region. The saliency score shows sharp peaks overlapping CTCF binding sites and broader peaks overlapping active gene promoters. Resolving the saliency score at base-pair resolution (b) highlights CTCF and general transcription factor binding motifs.

Supplementary information

Supplementary Information

Supplementary Figs. 1–21 and Tables 3–5.

Reporting Summary

Supplementary Tables

Supplementary Table 1. Source of human and mouse chromatin features used. Supplementary Table 2. NG Capture-C validation probes

Source data

Source Data Fig. 1

Distance-stratified Pearson correlation between prediction and Hi-C skeleton across chromosomes in cross-validation scheme. Smoothed and unsmoothed (M0–M2).

Source Data Fig. 2

a, Pearson correlation between NG Capture-C profiles and virtual4C profiles from the Hi-C skeleton and predictions respectively. b1, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878. b2, Average distance-normalized NG Capture-C signal over insulation-score called boundaries in GM12878.

Source Data Fig. 3

a, Meta-profiles of saliency over stratified genomic elements. b, Predicted effects of genomic element deletion screen. c, Predicted effect of 1,000 randomly sampled SNPs within CTCF sites.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schwessinger, R., Gosden, M., Downes, D. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods 17, 1118–1124 (2020). https://doi.org/10.1038/s41592-020-0960-3

Download citation

Received: 28 January 2020
Accepted: 20 August 2020
Published: 12 October 2020
Issue Date: November 2020
DOI: https://doi.org/10.1038/s41592-020-0960-3

DeepC: predicting 3D genome folding using megabase-scale transfer learning

Subjects

Abstract

Access options

Similar content being viewed by others

Predicting 3D genome folding from DNA sequence with Akita

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning

Optimized model architectures for deep learning on genomic data

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 1 Percentile normalizing Hi-C data for deep learning.

Extended Data Fig. 2 Comparison of deepC training with and without transfer learning.

Extended Data Fig. 3 Tissue-specific deepC predictions.

Extended Data Fig. 4 NG Capture-C validation of deepC predictions.

Extended Data Fig. 5 Mapping important features for genome folding.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Tables

Source data

Source Data Fig. 1

Source Data Fig. 2

Source Data Fig. 3

Rights and permissions

About this article

Cite this article

Predicting 3D genome folding from DNA sequence with Akita

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links