Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Sasse, Alexander; Ng, Bernard; Spiro, Anna E.; Tasaki, Shinya; Bennett, David A.; Gaiteri, Christopher; De Jager, Philip L.; Chikina, Maria; Mostafavi, Sara

doi:10.1038/s41588-023-01524-6

Letter
Published: 30 November 2023

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Nature Genetics volume 55, pages 2060–2064 (2023)Cite this article

8203 Accesses
2 Citations
73 Altmetric
Metrics details

Subjects

Abstract

Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks^1,2,3,4,5,6, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study⁷ to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Evaluation of Enformer across genomic regions and select loci.**

**Fig. 2: Evaluation of Enformer on prediction of gene expression across individuals.**

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Genome-wide association studies

Article 26 August 2021

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Article Open access 30 April 2024

Data availability

Genotype and RNA-seq data for the Religious Orders Study and Rush Memory and Aging Project (ROSMAP) samples are available from the Synapse AMP-AD Data Portal (accession code syn2580853) as well as the RADC Research Resource Sharing Hub at www.radc.rush.edu.

Code availability

Scripts for running the analyses presented, as well as intermediate results, are available from https://github.com/mostafavilabuw/EnformerAssessment (ref. ²⁵).

References

Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Article CAS PubMed PubMed Central Google Scholar
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS PubMed PubMed Central Google Scholar
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Article CAS PubMed Google Scholar
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 54, 725–734 (2022).
Article CAS PubMed PubMed Central Google Scholar
Park, C. Y. et al. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. Nat. Genet. 53, 166–173 (2021).
Article CAS PubMed PubMed Central Google Scholar
De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5, 180142 (2018).
Article PubMed PubMed Central Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods https://doi.org/10.1038/s41592-022-01562-8 (2022).
Article PubMed Google Scholar
Maslova, A. et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA 117, 25655–25666 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kim, D. S. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. https://doi.org/10.1038/s41588-021-00947-3 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Article CAS PubMed PubMed Central Google Scholar
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. https://doi.org/10.1038/s41576-022-00532-2 (2022).
Article PubMed Google Scholar
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Article CAS PubMed PubMed Central Google Scholar
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).
Article PubMed PubMed Central Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS PubMed PubMed Central Google Scholar
Reshef, Y. A. et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat. Genet. 50, 1483–1493 (2018).
Article CAS PubMed PubMed Central Google Scholar
Bennett, D. A. et al. Religious Orders Study and Rush Memory and Aging Project. J. Alzheimers Dis. 64, S161–S189 (2018).
Article PubMed PubMed Central Google Scholar
Mostafavi, S. et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease. Nat. Neurosci. 21, 811–819 (2018).
Article CAS PubMed PubMed Central Google Scholar
Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 24, 14–24 (2014).
Article CAS PubMed PubMed Central Google Scholar
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article PubMed Central Google Scholar
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (eds. Precup, D. & Teh, Y. W.) Vol. 70 3319–3328 (PMLR, 2017); https://doi.org/10.5281/zenodo.8274879
Sasse, A, Ng, B, & Spiro, E. A. mostafavilabuw/EnformerAssessment: EnformerEvaluationV1. Zenado https://doi.org/10.5281/zenodo.8274879 (2023).

Download references

Acknowledgements

We thank D. R. Kelley for helpful comments on this manuscript. We thank the participants of ROS and MAP for their essential contributions and gifts to this project. This work has been supported by many different NIH grants, including P30AG10161 (to D.A.B.), P30AG72975 (to D.A.B.), R01AG15819 (to D.A.B.), R01AG17917 (to D.A.B.), U01AG46152 (to D.A.B. and P.L.D.), U01AG61356 (to D.A.B. and P.L.D.), R01AG057911 (to C.G.), R01AG06179 (to C.G.) and R01AG036836 (to P.L.D.), as well as a CIFAR research fellowship and an NSERC Discovery Grant (to S.M.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Alexander Sasse, Bernard Ng, Anna E. Spiro.

Authors and Affiliations

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
Alexander Sasse, Anna E. Spiro & Sara Mostafavi
Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago, IL, USA
Bernard Ng, Shinya Tasaki, David A. Bennett & Christopher Gaiteri
Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
Christopher Gaiteri
Center for Translational & Computational Neuroimmunology, Department of Neurology, and the Taub Institute for the Study of Alzheimer’s Disease and the Aging Brain, Columbia University Irving Medical Center, New York, NY, USA
Philip L. De Jager
Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
Maria Chikina
Canadian Institute for Advanced Research, Toronto, Ontario, Canada
Sara Mostafavi

Authors

Alexander Sasse
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Ng
View author publications
You can also search for this author in PubMed Google Scholar
Anna E. Spiro
View author publications
You can also search for this author in PubMed Google Scholar
Shinya Tasaki
View author publications
You can also search for this author in PubMed Google Scholar
David A. Bennett
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Gaiteri
View author publications
You can also search for this author in PubMed Google Scholar
Philip L. De Jager
View author publications
You can also search for this author in PubMed Google Scholar
Maria Chikina
View author publications
You can also search for this author in PubMed Google Scholar
Sara Mostafavi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceived the study: S.M. and M.C. Study design: S.M., A.S. and M.C. Data generation and quality control analyses: B.N., A.E.S., C.G., P.L.D., S.T. and D.A.B. Analyses and interpretation: A.S., A.E.S., B.N., S.M. and M.C. Wrote the initial draft: S.M., A.S. and B.N. Read and provided comments on the manuscript: M.C., B.N., A.E.S., P.L.D., C.G., S.T. and D.A.B. Supervised the project: S.M. and M.C.

Corresponding authors

Correspondence to Maria Chikina or Sara Mostafavi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Kaur Alasoo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Sensitivity analysis for Enformer Predictions.

(a) Density plot, where each dot represents a gene (n = 13,397). X-axis shows Pearson’s r coefficients for Enformer predictions for the single most relevant track (‘CAGE,brain,adult’) and y-axis shows the fine-tuned cortex model from all human tracks. Color depicts local density. (b) Pearson’s r coefficients across 839 individuals between observed expression and the predicted CAGE track from a single forward-stranded input sequence centered at the TSS (x-axis) versus the average over forward-stranded sequences which were shifted by −3, −2, −1, 0, 1, 2, 3 bp, and a reverse-stranded input sequence centered at the TSS (y-axis). Data shown for a random subset of loci (n = 30). Orange line: diagonal line where x and y-axis have the same value. The correlation coefficient between values on x-axis and y-axis is R = 0.94 (c) Absolute Pearson’s r coefficients between Enformer predictions and observed gene expression for sets of genes with one causal SNP and all others. Causal genes determined by the Susie algorithm (‘Susie-Causal’). Edges of the box indicate the 25th and 75th percentiles, and the central mark indicates the median (N1 = 183 genes fine-mapped with Susie, N2 = 6625 genes without fine-mapped variants, two-sided Wilcoxon rank-sum test, for each gene R coefficient computed using n = 839 individuals).

Extended Data Fig. 2 Performance of the shallow CNN model.

(a) Density plot of observed population-average expression of test set genes (n = 3,401 genes) in cerebral cortex versus simple CNN’s predicted gene expression from the Reference sequences. This plot only displays genes which could be assigned to Enformer’s test set. Colors depict local density. (b) Y-axis shows Pearson’s r correlation coefficients between observed expression values and a simple CNN’s predicted values per individual. X-axis shows the negative log10 p-value computed with a gene-specific Null model (one-sided T-test, n = 50 independent samples per gene; Supplementary Method). The color represents the predicted mean expression. Red dashed line indicates FDRBH = 0.05.

Supplementary information

Supplementary Information

Supplementary Methods and Supplementary Figs. S1—11.

Reporting Summary

Supplementary Tables

Sheet 1: Supplementary Table 1. This table provides information on genes whose expression prediction was evaluated. Sheet 2: Header descriptions for Supplementary Table 1. Sheet 3: Supplementary Table2.This table provides information about the driver gene analysis. Sheet 4: Header descriptions for Supplementary Table 2. Sheet 5; This table provides ISM values for the set of tested variants. Sheet 6: Header descriptions for Supplementary Table 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sasse, A., Ng, B., Spiro, A.E. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 55, 2060–2064 (2023). https://doi.org/10.1038/s41588-023-01524-6

Download citation

Received: 13 March 2023
Accepted: 08 September 2023
Published: 30 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s41588-023-01524-6

This article is cited by

Current approaches to genomic deep learning struggle to fully capture human genetic variation
- Ziqi Tang
- Shushan Toneyan
- Peter K. Koo
Nature Genetics (2023)
Personal transcriptome variation is poorly explained by current genomic deep learning models
- Connie Huang
- Richard W. Shuai
- Nilah M. Ioannidis
Nature Genetics (2023)

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Subjects

Abstract

Access options

Similar content being viewed by others

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Genome-wide association studies

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 Sensitivity analysis for Enformer Predictions.

Extended Data Fig. 2 Performance of the shallow CNN model.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Tables

Rights and permissions

About this article

Cite this article

This article is cited by

Current approaches to genomic deep learning struggle to fully capture human genetic variation

Personal transcriptome variation is poorly explained by current genomic deep learning models

Current approaches to genomic deep learning struggle to fully capture human genetic variation

Personal transcriptome variation is poorly explained by current genomic deep learning models

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links