Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Rare coding variant analysis for human diseases across biobanks and ancestries

Abstract

Large-scale sequencing has enabled unparalleled opportunities to investigate the role of rare coding variation in human phenotypic variability. Here, we present a pan-ancestry analysis of sequencing data from three large biobanks, including the All of Us research program. Using mixed-effects models, we performed gene-based rare variant testing for 601 diseases across 748,879 individuals, including 155,236 with ancestry dissimilar to European. We identified 363 significant associations, which highlighted core genes for the human disease phenome and identified potential novel associations, including UBR3 for cardiometabolic disease and YLPM1 for psychiatric disease. Pan-ancestry burden testing represented an inclusive and useful approach for discovery in diverse datasets, although we also highlight the importance of ancestry-specific sensitivity analyses in this setting. Finally, we found that effect sizes for rare protein-disrupting variants were concordant between samples similar to European ancestry and other genetic ancestries (βDeming = 0.7–1.0). Our results have implications for multi-ancestry and cross-biobank approaches in sequencing association studies for human disease.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Study overview for rare variant discovery across human disease.
Fig. 2: Multi-ancestry meta-analysis of rare genetic variation across three sequenced biobanks in over 750,000 individuals identifies 363 rare variant associations.
Fig. 3: Assessment of bias from inclusion of non-EUR samples among the significant associations.
Fig. 4: Large genetic effect sizes and pleiotropic associations identify core genes for the human disease phenome.
Fig. 5: Effect sizes of rare coding variants for disease correlate between genetic EUR and other genetic ancestries.

Similar content being viewed by others

Data availability

Results from our gene-based association analyses are available for browsing and download through our online portal (https://hugeamp.org:8000/research.html?pageid=600_traits_app_home). Bulk download of summary statistics is possible via the Cardiovascular Disease Knowledge Portal (https://cvd.hugeamp.org/downloads.html). Access to individual-level UKB data, both phenotypic and genetic, is available to bona fide researchers through application on the UKB website (https://www.ukbiobank.ac.uk). The final release of the exome sequencing dataset of UKB is available only through the DNAnexus Research Analysis Platform (https://www.ukbiobank.ac.uk/enable-your-research/research-analysis-platform). Additional information about registration for access to the data is available at http://www.ukbiobank.ac.uk/register-apply/. Use of UKB data was performed under application number 17488. Access to individual phenotypic and genetic data from AoU is currently available to bona fide researchers within the USA through the AoU Researcher Workbench, a cloud-based computing platform (https://www.researchallofus.org/register/). A publicly available data browser is provided by the research program (https://databrowser.researchallofus.org/). Access to individual-level data for participants from the MGB is currently not publicly available. Other datasets used in this manuscript include: the dbNSFP database v.4.2a and v.4.3a (https://sites.google.com/site/jpopgen/dbNSFP); gnomAD exomes v.2.1 (https://gnomad.broadinstitute.org/downloads); the OMIM database (omim.org) accessed on 25 August 2022; Ensembl release 105 (https://www.ensembl.org/info/data/index.html); and the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) accessed in December 2022.

Code availability

QC of individual-level data was performed using Hail v.0.2 (https://hail.is) as well as PLINK v.2.0.a (https://www.cog-genomics.org/plink/2.0/). Variant annotation was performed using VEP v.105 (https://github.com/Ensembl/ensembl-vep). Main rare variant association analyses were performed using an adaptation of the R package GENESIS v.2.18 (https://rdrr.io/bioc/GENESIS/man/GENESIS-package.html), which has previously been made available by us through the GitHub repository https://github.com/seanjosephjurgens/UKBB_200KWES_CVD / v.1.2 (https://doi.org/10.5281/zenodo.11638262). Meta-analyses were performed using custom code available in the same repository, and using METAL (2017-12-21 release). Analyses that were run in R, were run within R v.4 (https://www.r-project.org).

References

  1. Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics 2, 100168 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Jurgens, S. J. et al. Analysis of rare genetic variation underlying cardiometabolic diseases and traits among 200,000 individuals in the UK Biobank. Nat. Genet. 54, 240–250 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).

    Article  CAS  PubMed  Google Scholar 

  7. Sun, B. B. et al. Genetic associations of protein-coding variants in human disease. Nature 603, 95–102 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Heyne, H. O. et al. Mono- and biallelic variant effects on disease at biobank scale. Nature 613, 519–525 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Need, A. C. & Goldstein, D. B. Next generation disparities in human genomics: concerns and remedies. Trends Genet. 25, 489–494 (2009).

    Article  CAS  PubMed  Google Scholar 

  10. Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Hindorff, L. A. et al. Prioritizing diversity in human genomics research. Nat. Rev. Genet. 19, 175–185 (2018).

    Article  CAS  PubMed  Google Scholar 

  12. Ramirez, H. A. et al. The All of Us Research Program: data quality, utility, and diversity. Patterns 3, 100570 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Gurdasani, D. et al. The African Genome Variation Project shapes medical genetics in Africa. Nature 517, 327–332 (2015).

    Article  CAS  PubMed  Google Scholar 

  14. Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  PubMed  Google Scholar 

  15. All of Us Research Program Genomics Investigators. Genomic data in the All of Us research program. Nature 627, 340–346 (2024).

    Article  Google Scholar 

  16. Koyama, S. et al. Decoding genetics, ancestry, and geospatial context for precision health. Preprint at medRxiv https://doi.org/10.1101/2023.10.24.23297096 (2023).

  17. Denny, J. C. et al. The ‘All of Us’ research program. N. Engl. J. Med. 381, 668–676 (2019).

    Article  PubMed  Google Scholar 

  18. Ding, Y. et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature 618, 774–781 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  20. Janssen, F., Bardoutsos, A. & Vidra, N. Obesity prevalence in the long-term future in 18 European countries and in the USA. Obes. Facts 13, 514–527 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Marshall, A. et al. Comparison of hypertension healthcare outcomes among older people in the USA and England. J. Epidemiol. Community Health 70, 264–270 (2016).

    Article  PubMed  Google Scholar 

  22. Joffres, M. et al. Hypertension prevalence, awareness, treatment and control in national surveys from England, the USA and Canada, and correlation with stroke and ischaemic heart disease mortality: a cross-sectional study. BMJ Open 3, e003423 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Matyori, A., Brown, C. P., Ali, A. & Sherbeny, F. Statins utilization trends and expenditures in the U.S. before and after the implementation of the 2013 ACC/AHA guidelines. Saudi Pharm. J. 31, 795–800 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Gao, Y., Shah, L. M., Ding, J. & Martin, S. S. US trends in cholesterol screening, lipid levels, and lipid-lowering medication use in US adults, 1999 to 2018. J. Am. Heart Assoc. 12, e028205 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015).

    Article  CAS  PubMed  Google Scholar 

  28. Jurgens, S. J. et al. Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat. Genet. 55, 544–548 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Jaiswal, S. Clonal hematopoiesis and nonhematologic disorders. Blood 136, 1606–1614 (2020).

    PubMed  PubMed Central  Google Scholar 

  30. Asada, S. & Kitamura, T. Clonal hematopoiesis and associated diseases: a review of recent findings. Cancer Sci. 112, 3962–3971 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Mitchell, E. et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature 606, 343–350 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Ingles, J. et al. Evaluating the clinical validity of hypertrophic cardiomyopathy genes. Circ. Genom. Precis Med 12, e002460 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Walsh, R. et al. Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet. Med. 19, 192–203 (2017).

    Article  PubMed  Google Scholar 

  34. National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Women in Science, Engineering, and Medicine; Committee on Improving the Representation of Women and Underrepresented Minorities in Clinical Trials Research. Improving Representation in Clinical Trials and Research: Building Research Equity for Women and Underrepresented Groups (National Academies Press, 2022).

  35. Ward, E. et al. Cancer disparities by race/ethnicity and socioeconomic status. CA Cancer J. Clin. 54, 78–93 (2004).

    Article  PubMed  Google Scholar 

  36. Suther, S. & Kiros, G. E. Barriers to the use of genetic testing: a study of racial and ethnic disparities. Genet. Med. 11, 655–662 (2009).

    Article  PubMed  Google Scholar 

  37. Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Vujkovic, M. et al. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nat. Genet. 52, 680–691 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Wall, J. D. et al. South Asian medical cohorts reveal strong founder effects and high rates of homozygosity. Nat. Commun. 14, 3377 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Deaton, A. M. et al. Gene-level analysis of rare variants in 379,066 whole exome sequences identifies an association of GIGYF1 loss of function with type 2 diabetes. Sci. Rep. 11, 21565 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Välimäki, N. et al. Inherited mutations affecting the SRCAP complex are central in moderate-penetrance predisposition to uterine leiomyomas. Am. J. Hum. Genet. 110, 460–474 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Haas, M. E. et al. Machine learning enables new insights into genetic contributions to liver fat accumulation. Cell Genom. 1, 100066 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Khera, A. V. et al. Gene sequencing identifies perturbation in nitric oxide signaling as a nonlipid molecular subtype of coronary artery disease. Circ. Genom. Precis. Med. 15, e003598 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Ward, J. et al. Genome-wide analysis in UK Biobank identifies four loci associated with mood instability and genetic correlation with major depressive disorder, anxiety disorder and schizophrenia. Transl. Psychiatry 7, 1264 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Luciano, M. et al. Association analysis in over 329,000 individuals identifies 116 independent variants influencing neuroticism. Nat. Genet. 50, 6–11 (2018).

    Article  CAS  PubMed  Google Scholar 

  48. Nagel, M. et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat. Genet. 50, 920–927 (2018).

    Article  CAS  PubMed  Google Scholar 

  49. Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Liu, F. R. et al. Pedigree-based study to identify GOLGB1 as a risk gene for bipolar disorder. Transl. Psychiatry 12, 390 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Palmer, D. S. et al. Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia. Nat. Genet. 54, 541–547 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Cui, J. et al. Disruption of Gpr45 causes reduced hypothalamic POMC expression and obesity. J. Clin. Invest. 126, 3192–3206 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Akbari, P. et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Yamazaki, O., Hirohama, D., Ishizawa, K. & Shibata, S. Role of the ubiquitin proteasome system in the regulation of blood pressure: a review. Int. J. Mol. Sci. 21, 5358 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Li, X. Y., Zhai, W. J. & Teng, C. B. Notch signaling in pancreatic development. Int. J. Mol. Sci. 17, 48 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Horn, S. et al. Mind bomb 1 is required for pancreatic β-cell formation. Proc. Natl Acad. Sci. USA 109, 7356–7361 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Potter, G. B., Facchinetti, F., Beaudoin, G. M. & Thompson, C. C. Neuronal expression of synaptotagmin-related gene 1 is regulated by thyroid hormone during cerebellar development. J. Neurosci. 21, 4373–4380 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Moghadam, P. K. & Jackson, M. B. The functional significance of synaptotagmin diversity in neuroendocrine secretion. Front Endocrinol. (Lausanne) 4, 124 (2013).

    Article  PubMed  Google Scholar 

  59. Brown, B. C., Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet 99, 76–88 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Galinsky, K. J. et al. Estimating cross-population genetic correlations of causal effect sizes. Genet. Epidemiol. 43, 180–188 (2019).

    Article  PubMed  Google Scholar 

  61. Yengo, L. et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet. 55, 549–558 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Ziyatdinov, A. et al. Genotyping, sequencing and analysis of 140,000 adults from the Mexico City Prospective Study. Nature 622, 784–793 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Fatumo, S. & Inouye, M. African genomes hold the key to accurate genetic risk prediction. Nat. Hum. Behav. 7, 295–296 (2023).

    Article  PubMed  Google Scholar 

  65. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).

    Article  CAS  PubMed  Google Scholar 

  68. Cronin, R. M. et al. Development of the initial surveys for the All of Us Research Program. Epidemiology 30, 597–608 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Karlson, E. W., Boutin, N. T., Hoffnagle, A. G. & Allen, N. L. Building the Partners HealthCare Biobank at Partners Personalized Medicine: informed consent, return of research results, recruitment lessons and operational considerations. J. Pers. Med. 6, 2 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  70. Boutin, N. T. et al. Implementation of electronic consent at a biobank: an opportunity for precision medicine research. J. Pers. Med. 6, 17 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Wu, P. et al. Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation. JMIR Med. Inf. 7, e14325 (2019).

    Article  Google Scholar 

  72. Liu, X., Wu, C., Li, C. & Boerwinkle, E. dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum. Mutat. 37, 235–241 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  73. McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  74. Gogarten, S. M. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Heinze, G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med. 25, 4216–4226 (2006).

    Article  PubMed  Google Scholar 

  77. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).

    Article  CAS  PubMed  Google Scholar 

  78. Tang, Z. Z. & Lin, D. Y. MASS: meta-analysis of score statistics for sequencing studies. Bioinformatics 29, 1803–1805 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Zhao, Z. et al. UK Biobank whole-exome sequence binary phenome analysis with robust region-based rare-variant test. Am. J. Hum. Genet. 106, 3–12 (2020).

    Article  CAS  PubMed  Google Scholar 

  80. Liu, Y. et al. ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Muchinsky, P. M. The correction for attenuation.Educ. Psychol. Meas. 56, 63–75 (1996).

    Article  Google Scholar 

  82. Deming, W. E. Statistical Adjustment of Data (Wiley, 1943).

Download references

Acknowledgements

We gratefully thank all participants of UKB, AoU and MGB Biobank, as this study would not have been possible without their contributions. We also thank the NIH’s AoU Research Program, the UKB resource (under application number 17488) and the MGB team, for making available the participant data examined in this study. P.T.E. was supported by funding from the NIH (1RO1HL092577, 1R01HL157635), by a grant from the American Heart Association (18SFRN34110082, 961045) and from the European Union (MAESTRIA 965286). This work was also supported by an American Heart Association Strategically Focused Research Networks (SFRN) postdoctoral fellowship (18SFRN34110082) to L.-C.W. This work was supported by the John S. LaDue Memorial Fellowship for Cardiovascular Research, a Sarnoff Scholar award from the Sarnoff Cardiovascular Research Foundation and by a NIH grant (K08HL159346) to J.P.P. This work was further supported by a grant from the NIH (1K08HL153937) and a grant from the American Heart Association (862032) to K.G.A. This work was supported by a Sigrid Jusélius Fellowship to J.T.R. This work was also supported by an Amsterdam UMC doctoral fellowship and the Junior Clinical Scientist Fellowship (03-007-2022-0035) from the Dutch Heart Foundation, to S.J.J. This work was supported by the BioData Ecosystem fellowship to S.H.C. This work was also supported by a grant from the NIH (R01DK125490) to J.F.

Author information

Authors and Affiliations

Authors

Contributions

S.J.J. and P.T.E. conceived and designed the study. S.J.J., X.W., S.H.C., L.-C.W., S. Koyama and J.P.P. performed data curation and data processing. S.J.J. and X.W. performed the main statistical and bioinformatic analyses, with S.H.C. providing important bioinformatic support. M.C., R.W., C.R., K.J.B., S. Kany, A.L.E., L.F.J.M.W. and J.T.R. contributed critically to the analysis plan. P.N., K.G.A., C.R.B., S.A.L., K.L.L., and P.T.E. supervised the study. T.N., P.S. and D.J. created the online web portal on the Human Disease Knowledge Portal. J.F. and N.P.B. supervised the creation of the online web portal. S.J.J., X.W. and P.T.E. wrote the manuscript. All authors critically revised and approved the manuscript.

Corresponding author

Correspondence to Patrick T. Ellinor.

Ethics declarations

Competing interests

P.T.E. has received sponsored research support from Bayer AG, Bristol Myers Squibb and Pfizer and Novo Nordisk. S.A.L. is an employee of Novartis as of July 2022. S.A.L. previously received sponsored research support from Bristol Myers Squibb, Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier and IBM, and has consulted for Bristol Myers Squibb, Pfizer, Blackstone Life Sciences and Invitae. P.N. reports research grants from Allelica, Amgen, Apple, Boston Scientific, Genentech/Roche and Novartis, personal fees from Allelica, Apple, AstraZeneca, Blackstone Life Sciences, Creative Education Concepts, CRISPR Therapeutics, Eli Lilly & Co, Esperion Therapeutics, Foresite Labs, Genentech/Roche, GV, HeartFlow, Magnet Biomedicine, Merck, Novartis, TenSixteen Bio and Tourmaline Bio, equity in Bolt, Candela, Mercury, MyOme, Parameter Health, Preciseli and TenSixteen Bio, and spousal employment at Vertex Pharmaceuticals, all unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. S. Kany is supported by the Walter Benjamin Fellowship from the Deutsche Forschungsgemeinschaft (521832260). The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Benjamin Sun, Seunggeun Lee and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Prevalence of rare likely pathogenic and pathogenic variants in cardiomyopathy genes across UKB and MGB datasets.

a,b, Bar graphs reflect the percentage of biobank participants found to carry pathogenic or likely pathogenic variants for dilated cardiomyopathy (a) or hypertrophic cardiomyopathy (b) with 95% binomial confidence intervals, where light blue represent Massachusetts Biobank (MGB) and dark green represents UK Biobank (UKB). The absolute number of carriers identified in a given grouping is added above each bar. The total number of participants was n = 51,815 in MGB and n = 454,162 in UKB. Pathogenic or likely pathogenic variants reported in ClinVar and submitted by clinical testing labs from 2015 onwards were included, as well as high-confidence LOF variants affecting canonical transcripts for select genes (where truncation is considered pathogenic for the disease); for TTN, only variants affecting the cardiac exons were included. Variants were filtered to MAF < 0.1%. The combined prevalences (all genes combined) are shown on the far right of the panels. Overall, rare disease-causing variants for both disorders were more frequent in MGB vs. UKB (non-overlapping 95% binomial confidence intervals).

Extended Data Fig. 2 Evidence of age-related somaticism and phenotype associations for potentially somatic gene variants.

a, Volcano plot with results from linear regression models predicting age from rare variant carrier status (in a meta-analysis of UKB, AoU and MGB), with the -log10(PCauchy) on the y-axis and the estimated effect size per year for the most significant mask on the x-axis (βage).The same pipeline was used as for our primary analysis. The horizontal dotted line shows a suggestive significance cutoff of P < 1 × 10−6, while the vertical lines highlight βage = −0.1 and βage = 0.1, respectively. Significant genes (P < 1 × 10−6) are annotated with their gene names; all genes that were significantly associated with any outcome in our primary analysis (Supplementary Table 16) and with age are also annotated with their gene names. Gene masks reaching P < 1 × 10−6 and βage > 0.1 can be considered suggestively affected by age and therefore raise suspicion that they are affected by age-related somatic variants. Indeed, many of these genes are known clonal hematopoiesis of indeterminate potential (CHIP) genes (Supplementary Note). b, Heatmap with Phecodes on the x-axis and genes on the y-axis. The heatmap shows results for genes that reached significance in our PheWAS for leukemic/hematological outcomes and/or genes associated with age. These genes are plotted against a range of representative phenotypes suggestively associated at P < 1 × 10−5 with any of the genes. The color in each cell represents the odds ratio (OR) for the most significant mask (lowest nominal P-value) with red indicating increased disease risk (OR > 1) and blue indicating decreased disease risk (OR < 1). Significance levels are shown in each cell using circles and boxes, with a small dot representing nominal PCauchy < 0.05, a larger dot representing PCauchy < 0.001, a black box representing PCauchy = 1 × 10−5 and a black box with smaller white box representing QCauchy < 0.01 in our primary analysis. ORs were estimated using inverse-variance weighted meta-analysis of two-sided Firth’s logistic regression results. The reported P-values are Cauchy P-values that represent the omnibus P-value of all masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). Q-values represent the Benjamini-Hochberg FDR adjustments of these P-values. P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from a Z score-based meta-analysis of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. LOF, loss-of-function; OR, odds ratio.

Extended Data Fig. 3 No apparent discovery benefit in rare variant burden testing from ancestral diversity at current case numbers in AoU.

a,b, Grouped barcharts with the number of significant signals identified from rare variant burden testing on the y-axis, comparing results from two different sub-samples of datasets. Blue bars represent results for LOF variants only (MAFpopulation-max < 0.1%), while dark red bars show results for the Cauchy combination of 6 masks and light red bars represent a Cauchy combination of 9 masks (including 3 low-frequency variant masks at MAFpopulation-max < 1%). Plot in a shows results for the subset of AoU consisting of individuals genetically similar to European ancestry (n = 106,057 samples with complete EHR linkage; EUR) on the left side of each comparison, while the right side shows results for an ancestrally diverse sub-sample of AoU of equal size (n = 106,057 samples with complete EHR linkage; Mixed). Results are restricted to 584 phecodes that were testable in both subsamples. Plot in b shows those same sub-samples of AoU in a meta-analysis with UKB, restricting to 530 phecodes that were testable across AoU subsamples and in the UKB dataset. c, Violin plot showing prevalence ratios for all 601 phecodes in AoU, where the prevalence ratios represent the ratio between prevalence among EUR samples and within individuals genetically dissimilar to European ancestry (non-EUR). Prevalence ratios are presented on the log2-scale, where one unit difference represents a doubling/halving of the relative prevalence. The black line represents prevalence ratio of 0 (no difference), while the dotted lines represent prevalence ratios of 1 and −1. Select phenotypes enriched on either side are annotated. Many phecodes are relatively enriched in EUR as compared to non-EUR, which might contribute to the slightly diminished discovery yield within the ancestrally diverse subsample of AoU as compared to the EUR subsample. The Cauchy P-values represent the omnibus P-value of all relevant masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). P-values for mask-phecode pairs (prior to the Cauchy combination) in AoU were derived from a saddle-point-approximation score tests from logistic mixed-effects models, while meta-analysis P-values were derived from Z score-based meta-analysis of such score tests. All statistical tests and P-values are two-sided.

Extended Data Fig. 4 Broad Human Disease Knowledge Portal showing phenome-wide results for YLPM1.

Output from a search for the gene YLPM1 on the Broad Human Disease Knowledge Portal, which showcases the results from our primary meta-analysis of UKB, AoU and MGB. The top of the figure shows a dot plot with each dot representing a different phecode tested for association with YLPM1, where the y-axis shows the -log10(Cauchy P-value) and the x-axis represents different phenotypes grouped by broad phecode category. The arrows represent directionality, with an upwards arrow indicating that rare variants in YLPM1 are associated with increased risk of the given phecode, and downwards arrows representing decreased risk; directionality is based in the ‘Best Mask’ which is the mask that yielded the lowest nominal P-value in burden testing. The dotted line represents the significance level used for phenome-wide testing of a single gene on the portal (α = 5 × 10−5). Phenotypes reaching this level of significance are highlighted in black text. The bottom of the figure shows the associated results table as presented on the portal, including details on the most strongly associated phecodes, the Cauchy P-values for burden testing of rare variant masks, the Cauchy P-values for burden testing of rare and low-frequency masks, and the beta coefficient of the ‘Best Mask’ (ie the mask that reached the lowest nominal P-value in burden testing). Results can be queried through the following link: https://hugeamp.org:8000/research.html?ancestry=mixed&cohort=UKB_450k_AoU_250k_MGB_53k_META_overlapcorrected&file=600Traits.csv&gene=YLPM1&pageid=600_traits_app. Betas, which represent log(odds ratios), were estimated using inverse-variance weighted meta-analysis of two-sided Firth’s logistic regression results. The reported P-values are Cauchy P-values that represent the omnibus P-value of all masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from Z-score-based meta-analysis of score tests from logistic mixed-effects models. All statistical tests and P-values are two-sided. LOF, loss-of-function; OR, odds ratio.

Extended Data Fig. 5 Broad Human Disease Knowledge Portal showing phenome-wide results for UBR3.

Output from a search for the gene UBR3 on the Broad Human Disease Knowledge Portal, which showcases the results from our primary meta-analysis of UKB, AoU and MGB. The top of the figure shows a dot plot with each dot representing a different phecode, where the y-axis shows the -log10(Cauchy P-value) and the x-axis represents different phenotypes grouped by broad phecode category. The arrows represent directionality, with an upwards arrow indicating that rare variants in UBR3 are associated with increased risk of the given phecode, and downwards arrows representing decreased risk; directionality is based in the ‘Best Mask’ which is the mask that yielded the lowest nominal P-value in burden testing. The dotted line represents the significance level used for phenome-wide testing of a single gene on the portal (α = 5 × 10−5). Phenotypes reaching this level of significance are highlighted in black text. The bottom of the figure shows the associated results table as presented on the portal, including details on the most strongly associated phecodes, the Cauchy P-values for burden testing of rare variant masks, the Cauchy P-values for burden testing of rare and low-frequency masks, and the beta coefficient of the ‘Best Mask’ (ie the mask that reached the lowest nominal P-value in burden testing). Results can be queried through the following link: https://hugeamp.org:8000/research.html?ancestry=mixed&cohort=UKB_450k_AoU_250k_MGB_53k_META_overlapcorrected&file=600Traits.csv&gene=UBR3&pageid=600_traits_app. Betas, which represent log(odds ratios), were estimated using inverse-variance weighted meta-analysis of two-sided Firth’s logistic regression results. The reported P-values are Cauchy P-values that represent the omnibus P-value of all masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from Z-score-based meta-analysis of score tests from logistic mixed-effects models. All statistical tests and P-values are two-sided. LOF, loss-of-function; OR, odds ratio.

Extended Data Fig. 6 Broad Human Disease Knowledge Portal showing exome-wide results for the phecode ‘Cardiomyopathy’.

Output from a search for the phecode Cardiomyopathy on the Broad Human Disease Knowledge Portal, which showcases the results from our primary meta-analysis of UKB, AoU and MGB. A Manhattan plot is shown in the top left, with each dot representing a different gene tested for association with Cardiomyopathy, where the y-axis shows the -log10(Cauchy P-value) and the x-axis represents genomic coordinates. In this figure, results are restricted to ‘rare variant’ masks only (MAF < 0.1%). The dotted line represents the significance threshold used for a single phenotype on the portal (α = 1 × 10−6). A quantile-quantile plot in the top right shows the observed genome-wide test statistics on the y-axis, against the expected test statistics under the null hypothesis on the x-axis; the red line represents the x = y line. The bottom of the figure shows the associated results table as presented on the portal, including details on the most strongly associated genes, the Cauchy P-values for burden testing of rare variant masks, the Cauchy P-values for burden testing of rare and low-frequency masks, the P-value and beta coefficient for the ‘Best Mask’ (that is the mask with the lowest nominal P-value in burden testing), and information on case/control numbers. The table here is restricted to 9 genes with at least suggestive evidence (P < 3 × 10−6). Results can be queried through the following link: https://hugeamp.org:8000/research.html?ancestry=mixed&cohort=UKB_450k_AoU_250k_MGB_53k_META_overlapcorrected&file=600Traits.csv&pageid=600_traits_app&phenotype=phecode_425.0. Betas, which represent log(odds ratios), were estimated using inverse-variance weighted meta-analysis of two-sided Firth’s logistic regression results. The reported P-values for ‘rare’ and ‘low-freq’ represent omnibus P-values of all relevant masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from Z-score-based meta-analysis of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. LOF, loss-of-function; OR, odds ratio.

Extended Data Fig. 7 Broad Human Disease Knowledge Portal showing exome-wide results for the phecode ‘Diabetes Mellitus’.

Output from a search for the phecode Diabetes Mellitus on the Broad Human Disease Knowledge Portal, which showcases the results from our primary meta-analysis of UKB, AoU and MGB. A Manhattan plot is shown in the top left, with each dot representing a different gene tested for association with Diabetes Mellitus, where the y-axis shows the -log10(Cauchy P-value) and the x-axis represents genomic coordinates. In this figure, results include both ‘rare variant’ and ‘low-frequency’ masks (MAF < 1%). The dotted line represents the significance threshold used for a single phenotype on the portal (α = 1 × 10−6). A quantile-quantile plot in the top right shows the observed genome-wide test statistics on the y-axis, against the expected test statistics under the null hypothesis on the x-axis; the red line represents the x = y line. The bottom of the figure shows the associated results table as presented on the portal, including details on the most strongly associated genes, the Cauchy P-values for burden testing of rare variant masks, the Cauchy P-values for burden testing of rare and low-frequency masks, the P-value and beta coefficient for the ‘Best Mask’ (that is the mask with the lowest nominal P-value in burden testing), and information on case/control numbers. The table here is restricted to 10 genes with at least suggestive evidence (P < 3 × 10−6). Results can be queried through the following link: https://hugeamp.org:8000/research.html?ancestry=mixed&cohort=UKB_450k_AoU_250k_MGB_53k_META_overlapcorrected&file=600Traits.csv&pageid=600_traits_app&phenotype=phecode_250.0. Betas, which represent log(odds ratios), were estimated using inverse-variance-weighted meta-analysis of two-sided Firth’s logistic regression results. The reported P-values for ‘rare’ and ‘low-freq’ represent omnibus P-values of all relevant masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing). P-values for mask-phecode pairs - prior to the Cauchy combination - were derived from Z-score-based meta-analysis of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. LOF, loss-of-function; OR, odds ratio.

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–19.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–19.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jurgens, S.J., Wang, X., Choi, S.H. et al. Rare coding variant analysis for human diseases across biobanks and ancestries. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01894-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41588-024-01894-5

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing