Abstract
Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks1,2,3,4,5,6, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study7 to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Genotype and RNA-seq data for the Religious Orders Study and Rush Memory and Aging Project (ROSMAP) samples are available from the Synapse AMP-AD Data Portal (accession code syn2580853) as well as the RADC Research Resource Sharing Hub at www.radc.rush.edu.
Code availability
Scripts for running the analyses presented, as well as intermediate results, are available from https://github.com/mostafavilabuw/EnformerAssessment (ref. 25).
References
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 54, 725–734 (2022).
Park, C. Y. et al. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. Nat. Genet. 53, 166–173 (2021).
De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5, 180142 (2018).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods https://doi.org/10.1038/s41592-022-01562-8 (2022).
Maslova, A. et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA 117, 25655–25666 (2020).
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Kim, D. S. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. https://doi.org/10.1038/s41588-021-00947-3 (2021).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. https://doi.org/10.1038/s41576-022-00532-2 (2022).
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Reshef, Y. A. et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat. Genet. 50, 1483–1493 (2018).
Bennett, D. A. et al. Religious Orders Study and Rush Memory and Aging Project. J. Alzheimers Dis. 64, S161–S189 (2018).
Mostafavi, S. et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease. Nat. Neurosci. 21, 811–819 (2018).
Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014).
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (eds. Precup, D. & Teh, Y. W.) Vol. 70 3319–3328 (PMLR, 2017); https://doi.org/10.5281/zenodo.8274879
Sasse, A, Ng, B, & Spiro, E. A. mostafavilabuw/EnformerAssessment: EnformerEvaluationV1. Zenado https://doi.org/10.5281/zenodo.8274879 (2023).
Acknowledgements
We thank D. R. Kelley for helpful comments on this manuscript. We thank the participants of ROS and MAP for their essential contributions and gifts to this project. This work has been supported by many different NIH grants, including P30AG10161 (to D.A.B.), P30AG72975 (to D.A.B.), R01AG15819 (to D.A.B.), R01AG17917 (to D.A.B.), U01AG46152 (to D.A.B. and P.L.D.), U01AG61356 (to D.A.B. and P.L.D.), R01AG057911 (to C.G.), R01AG06179 (to C.G.) and R01AG036836 (to P.L.D.), as well as a CIFAR research fellowship and an NSERC Discovery Grant (to S.M.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Conceived the study: S.M. and M.C. Study design: S.M., A.S. and M.C. Data generation and quality control analyses: B.N., A.E.S., C.G., P.L.D., S.T. and D.A.B. Analyses and interpretation: A.S., A.E.S., B.N., S.M. and M.C. Wrote the initial draft: S.M., A.S. and B.N. Read and provided comments on the manuscript: M.C., B.N., A.E.S., P.L.D., C.G., S.T. and D.A.B. Supervised the project: S.M. and M.C.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Kaur Alasoo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Sensitivity analysis for Enformer Predictions.
(a) Density plot, where each dot represents a gene (n = 13,397). X-axis shows Pearson’s r coefficients for Enformer predictions for the single most relevant track (‘CAGE,brain,adult’) and y-axis shows the fine-tuned cortex model from all human tracks. Color depicts local density. (b) Pearson’s r coefficients across 839 individuals between observed expression and the predicted CAGE track from a single forward-stranded input sequence centered at the TSS (x-axis) versus the average over forward-stranded sequences which were shifted by −3, −2, −1, 0, 1, 2, 3 bp, and a reverse-stranded input sequence centered at the TSS (y-axis). Data shown for a random subset of loci (n = 30). Orange line: diagonal line where x and y-axis have the same value. The correlation coefficient between values on x-axis and y-axis is R = 0.94 (c) Absolute Pearson’s r coefficients between Enformer predictions and observed gene expression for sets of genes with one causal SNP and all others. Causal genes determined by the Susie algorithm (‘Susie-Causal’). Edges of the box indicate the 25th and 75th percentiles, and the central mark indicates the median (N1 = 183 genes fine-mapped with Susie, N2 = 6625 genes without fine-mapped variants, two-sided Wilcoxon rank-sum test, for each gene R coefficient computed using n = 839 individuals).
Extended Data Fig. 2 Performance of the shallow CNN model.
(a) Density plot of observed population-average expression of test set genes (n = 3,401 genes) in cerebral cortex versus simple CNN’s predicted gene expression from the Reference sequences. This plot only displays genes which could be assigned to Enformer’s test set. Colors depict local density. (b) Y-axis shows Pearson’s r correlation coefficients between observed expression values and a simple CNN’s predicted values per individual. X-axis shows the negative log10 p-value computed with a gene-specific Null model (one-sided T-test, n = 50 independent samples per gene; Supplementary Method). The color represents the predicted mean expression. Red dashed line indicates FDRBH = 0.05.
Supplementary information
Supplementary Information
Supplementary Methods and Supplementary Figs. S1—11.
Supplementary Tables
Sheet 1: Supplementary Table 1. This table provides information on genes whose expression prediction was evaluated. Sheet 2: Header descriptions for Supplementary Table 1. Sheet 3: Supplementary Table2.This table provides information about the driver gene analysis. Sheet 4: Header descriptions for Supplementary Table 2. Sheet 5; This table provides ISM values for the set of tested variants. Sheet 6: Header descriptions for Supplementary Table 3.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sasse, A., Ng, B., Spiro, A.E. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 55, 2060–2064 (2023). https://doi.org/10.1038/s41588-023-01524-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01524-6
This article is cited by
-
Current approaches to genomic deep learning struggle to fully capture human genetic variation
Nature Genetics (2023)
-
Personal transcriptome variation is poorly explained by current genomic deep learning models
Nature Genetics (2023)