A new method for multiancestry polygenic prediction improves performance across diverse populations

Zhang, Haoyu; Zhan, Jianan; Jin, Jin; Zhang, Jingning; Lu, Wenxuan; Zhao, Ruzhang; Ahearn, Thomas U.; Yu, Zhi; O’Connell, Jared; Jiang, Yunxuan; Chen, Tony; Okuhara, Dayne; Garcia-Closas, Montserrat; Lin, Xihong; Koelsch, Bertram L.; Chatterjee, Nilanjan

doi:10.1038/s41588-023-01501-z

Technical Report
Published: 25 September 2023

A new method for multiancestry polygenic prediction improves performance across diverse populations

Nature Genetics volume 55, pages 1757–1768 (2023)Cite this article

6017 Accesses
6 Citations
164 Altmetric
Metrics details

Subjects

Abstract

Polygenic risk scores (PRSs) increasingly predict complex traits; however, suboptimal performance in non-European populations raise concerns about clinical applications and health inequities. We developed CT-SLEB, a powerful and scalable method to calculate PRSs, using ancestry-specific genome-wide association study summary statistics from multiancestry training samples, integrating clumping and thresholding, empirical Bayes and superlearning. We evaluated CT-SLEB and nine alternative methods with large-scale simulated genome-wide association studies (~19 million common variants) and datasets from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us and UK Biobank, involving 5.1 million individuals of diverse ancestry, with 1.18 million individuals from four non-European populations across 13 complex traits. Results demonstrated that CT-SLEB significantly improves PRS performance in non-European populations compared with simple alternatives, with comparable or superior performance to a recent, computationally intensive method. Moreover, our simulation studies offered insights into sample size requirements and SNP density effects on multiancestry risk prediction.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Simulation results of various PRS methods in multiancestry settings.**

**Fig. 3: Comparison of CT-SLEB PRSs across different ancestries with single-ancestry EUR PRSs in the EUR population.**

**Fig. 4: Prediction performance of CT-SLEB PRS under varying SNP densities.**

**Fig. 5: Prediction accuracy of PRSs for heart metabolic disease burden and height in 23andMe, Inc. datasets.**

**Fig. 6: Prediction accuracy of five binary traits in 23andMe, Inc. datasets.**

**Fig. 7: Prediction accuracy of four blood lipid traits from the GLGC.**

**Fig. 8: Prediction accuracy of two traits from the AoU dataset.**

Identifying proteomic risk factors for cancer using prospective and exome analyses of 1463 circulating proteins and risk of 19 cancers in the UK Biobank

Article Open access 15 May 2024

Genome-wide association studies

Article 26 August 2021

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Data availability

Simulated genotype data for 600,000 subjects from 5 ancestries are at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/COXHAP. GWAS summary level statistics for five ancestries from GLGC are at: http://csg.sph.umich.edu/willer/public/glgc-lipids2021/results/ancestry_specific. GWAS summary statistics for three ancestries are from AoU at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FAWEQK. The PRSs developed for six traits for GLGC and AoU have been released through the PGS Catalog (https://www.pgscatalog.org) with publication ID PGP000489 and score IDs PGS003767–PGS003848. The 23andMe GWAS summary statistics for the top 10,000 genetic markers associated with 3 traits (height, morning person and SBMN) across 5 diverse ancestries have been made available as Supplementary Data and are also available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3NBNCV. The full GWAS summary statistics and the final PRSs for these three traits (height, morning person and SBMN) are available through 23andMe, Inc. to qualified researchers under an agreement with 23andMe, Inc. that protects the privacy of the 23andMe participants. Please visit research.23andme.com/dataset-access for more information and to apply for access to the data. The summary statistics for the four other traits used in the paper (any CVD, heart metabolic disease burden, depression and migraine) will not be made available because of 23andMe’s business requirements. Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited institutional review board, Ethical & Independent Review Services.

Code availability

Simulation and data analyses code is available at GitHub (https://github.com/andrewhaoyu/multi_ethnic (ref. ⁶⁶)). Software implementing CT-SLEB is available at GitHub (https://github.com/andrewhaoyu/CTSLEB (ref. ⁶⁷)). The P + T method was implemented using R version 4.0.0 in conjunction with PLINK 1.9 available at https://www.cog-genomics.org/plink/1.9. Other methods and their corresponding repositories include: SCT and LDpred2 at https://github.com/privefl/bigsnpr, XPASS at https://github.com/YangLabHKUST/XPASS, PolyPred-S+ at https://github.com/omerwe/polyfun, PRS-CSx at https://github.com/getian107/PRScsx, and LDSC at https://github.com/bulik/ldsc. PLINK: https://www.cog-genomics.org/plink/1.9. Most of our statistical analyses were performed using the following R packages: ggplot2 v.3.3.3, dplyr v.1.0.4, data.table v.1.13.6, bigsnpr v.1.6.1, SuperLearner v.2.0.26, caret v.6.0.86, ranger v.0.12.1, glmnet v.4.1, RISCA v.1.01, XPASS v.0.1.0, xgboost v.1.7.5.1 and randomForest.

References

Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
CAS PubMed Google Scholar
Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
CAS PubMed PubMed Central Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
CAS PubMed PubMed Central Google Scholar
Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 104, 21–34 (2019).
CAS PubMed Google Scholar
Jia, G. et al. Evaluating the utility of polygenic risk scores in identifying high-risk individuals for eight common cancers. JNCI Cancer Spectr. 4, pkaa021 (2020).
PubMed PubMed Central Google Scholar
Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat. Genet. 52, 572–581 (2020).
CAS PubMed PubMed Central Google Scholar
Graff, R. E. et al. Cross-cancer evaluation of polygenic risk scores for 16 cancer types in two large cohorts. Nat. Commun. 12, 970 (2021).
CAS PubMed PubMed Central Google Scholar
Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).
CAS PubMed PubMed Central Google Scholar
Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 3328 (2019).
CAS PubMed PubMed Central Google Scholar
Liu, C. et al. Generalizability of polygenic risk scores for breast cancer among women with European, African, and Latinx ancestry. JAMA Netw. Open 4, e2119084–e2119084 (2021).
PubMed PubMed Central Google Scholar
Du, Z. et al. Evaluating polygenic risk scores for breast cancer in women of african ancestry. J. Natl Cancer Inst. 113, 1168–1176 (2021).
PubMed PubMed Central Google Scholar
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).
CAS PubMed PubMed Central Google Scholar
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
CAS PubMed PubMed Central Google Scholar
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020).
PubMed PubMed Central Google Scholar
Kullo, I. J. et al. Polygenic scores in biomedical research. Nat. Rev. Genet. 23, 524–532 (2022).
CAS PubMed PubMed Central Google Scholar
Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 1520–1528 (2007).
CAS PubMed PubMed Central Google Scholar
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
CAS PubMed Google Scholar
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
PubMed PubMed Central Google Scholar
Privé, F., Vilhjálmsson, B. J., Aschard, H. & Blum, M. G. B. Making the most of clumping and thresholding for polygenic scores. Am. J. Hum. Genet. 105, 1213–1221 (2019).
PubMed PubMed Central Google Scholar
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).
PubMed PubMed Central Google Scholar
Newcombe, P. J., Nelson, C. P., Samani, N. J. & Dudbridge, F. A flexible and parallelizable approach to genome-wide polygenic risk scores. Genet. Epidemiol. 43, 730–741 (2019).
PubMed PubMed Central Google Scholar
Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
PubMed PubMed Central Google Scholar
Song, S., Jiang, W., Hou, L. & Zhao, H. Leveraging effect size distributions to improve polygenic risk scores derived from summary statistics of genome-wide association studies. PLoS Comput. Biol. 16, e1007565 (2020).
CAS PubMed PubMed Central Google Scholar
Zhou, G. & Zhao, H. A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics. PLoS Genet. 17, e1009697 (2021).
CAS PubMed PubMed Central Google Scholar
Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2021).
PubMed Google Scholar
Koyama, S. et al. Population-specific and trans-ancestry genome-wide analyses identify distinct and shared genetic risk loci for coronary artery disease. Nat. Genet. 52, 1169–1177 (2020).
CAS PubMed Google Scholar
Sakaue, S. et al. Trans-biobank analysis with 676,000 individuals elucidates the association of polygenic risk scores of complex traits with human lifespan. Nat. Med. 26, 542–548 (2020).
CAS PubMed Google Scholar
Agbaedeng, T. A. et al. Polygenic risk score and coronary artery disease: a meta-analysis of 979,286 participant data. Atherosclerosis 333, 48–55 (2021).
CAS PubMed Google Scholar
Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54, 573–580 (2022).
CAS PubMed PubMed Central Google Scholar
Tian, P. et al. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front. Genet. 13, 1854 (2022).
Google Scholar
Márquez-Luna, C. et al. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017).
PubMed PubMed Central Google Scholar
Xiao, J. et al. XPXP: improving polygenic prediction by cross-population and cross-phenotype analysis. Bioinformatics 38, 1947–1955 (2022).
CAS PubMed Google Scholar
Cai, M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet. 108, 632–655 (2021).
CAS PubMed PubMed Central Google Scholar
Dudbridge, F. & Wray, N. R. Power and predictive sccuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
CAS PubMed PubMed Central Google Scholar
Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat. Genet. 45, 400–405 (2013).
CAS PubMed PubMed Central Google Scholar
Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021).
CAS PubMed PubMed Central Google Scholar
Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).
CAS PubMed PubMed Central Google Scholar
Shi, H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098 (2021).
CAS PubMed PubMed Central Google Scholar
van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genet. Mol. Biol. 6, 25 (2007).
Google Scholar
Polley, E. & van der Laan, M. J. Super learner in prediction. UC Berkeley Division of Biostatistics Working Paper Series (2010); http://biostats.bepress.com/ucbbiostat/paper266
Ledell, E., Petersen, M. & Van Der Laan, M. J. Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electron J. Stat. 9, 1583–1607 (2015).
PubMed PubMed Central Google Scholar
Polley, E., LeDell, E., Kennedy, C. & van der Laan, M. J. SuperLearner: Super learner prediction. R version 2.0-26 (2019).
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
PubMed PubMed Central Google Scholar
Ripley, B. D. Pattern Recognition and Neural Networks (Cambridge Univ. Press, 2007).
Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 54, 450–458 (2022).
CAS PubMed PubMed Central Google Scholar
Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
CAS PubMed PubMed Central Google Scholar
Consortium, T. I. H. 3. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52 (2010).
Google Scholar
Bien, S. A. et al. Strategies for enriching variant coverage in candidate disease Loci on a multiethnic genotyping array. PLoS ONE 11, 167758 (2016).
Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
PubMed Google Scholar
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
CAS PubMed PubMed Central Google Scholar
Zhang, Y., Qi, G., Park, J. H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 50, 1318–1326 (2018).
CAS PubMed Google Scholar
Zhang, Y. D. et al. Assessment of polygenic architecture and risk prediction based on common variants across fourteen cancers. Nat. Commun. 11, 3353 (2020).
CAS PubMed PubMed Central Google Scholar
Márquez-Luna, C. et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 12, 6052 (2021).
PubMed PubMed Central Google Scholar
Ge, T., Chen, C. Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017).
PubMed PubMed Central Google Scholar
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
CAS PubMed PubMed Central Google Scholar
Ding, Y. et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature 618, 774–781 (2023).
CAS PubMed PubMed Central Google Scholar
Song, L. et al. SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics. Bioinformatics 35, 4038–4044 (2019).
CAS PubMed PubMed Central Google Scholar
Zhao, Z. et al. PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics. Genome Biol. 22, 257 (2021).
PubMed PubMed Central Google Scholar
Pritchard, J. K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).
CAS PubMed PubMed Central Google Scholar
van der Laan, M. J. & Rose, S. Targeted Learning: Causal inference for observational and experimental data, Vol. 4 (Springer New York, 2011).
Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305 (2011).
CAS PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
CAS PubMed PubMed Central Google Scholar
Foucher, Y. et al. RISCA: Causal inference and prediction in cohort-based analyses. R version 1.01 https://cran.r-project.org/package=RISCA (2020).
Zhang, H., Jin, J. & Zhang, J. Multi-ancestry PRS development. Zenodo https://doi.org/10.5281/zenodo.8033882 (2023).
Zhang, H. & Okuhara, D. CT-SLEB software. Zenodo https://doi.org/10.5281/zenodo.8033795 (2023).

Download references

Acknowledgements

We thank the research participants and employees of 23andMe, Inc. for making this work possible. We thank L. Noblin, M. J. Francis and E. Voeglein for helping with the research collaboration agreement with the Harvard T.H. Chan School of Public Health, Johns Hopkins Bloomberg School of Public Health and 23andMe, Inc. The analysis utilized the high-performance computation Biowulf cluster at the National Institutes of Health (NIH), USA, Faculty of Arts and Sciences Research Computing Cluster at Harvard University and the Joint High Performance Computing Exchange at Johns Hopkins Bloomberg School of Public Health. The UKBB data were obtained under UKBB resource application no. 17712. This work was funded by NIH grants: nos. K99 CA256513 to H.Z., R00 HG012223 to J.J., NHLBI 5T32HL007604-37 to Z.Y., R35-CA197449, U19-CA203654, R01-HL163560, U01-HG009088 and U01-HG012064 to X.L., R01 HG010480-01 to N.C. and U01HG011724 to N.C. The AoU Research Program is supported by the NIH, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA no.: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the AoU Research Program would not be possible without the partnership of its participants.

Author information

Authors and Affiliations

Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
Haoyu Zhang, Thomas U. Ahearn & Montserrat Garcia-Closas
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Haoyu Zhang, Tony Chen & Xihong Lin
23andMe, Inc., Sunnyvale, CA, USA
Jianan Zhan, Jared O’Connell, Yunxuan Jiang, Stella Aslibekyan, Adam Auton, Elizabeth Babalola, Robert K. Bell, Jessica Bielenberg, Katarzyna Bryc, Emily Bullis, Daniella Coker, Gabriel Cuellar Partida, Devika Dhamija, Sayantan Das, Sarah L. Elson, Nicholas Eriksson, Teresa Filshtein, Alison Fitch, Kipper Fletez-Brant, Pierre Fontanillas, Will Freyman, Julie M. Granka, Karl Heilbron, Alejandro Hernandez, Barry Hicks, David A. Hinds, Ethan M. Jewett, Katelyn Kukar, Alan Kwong, Keng-Han Lin, Bianca A. Llamas, Maya Lowe, Jey C. McCreight, Matthew H. McIntyre, Steven J. Micheletti, Meghan E. Moreno, Priyanka Nandakumar, Dominique T. Nguyen, Elizabeth S. Noblin, Aaron A. Petrakovitz, G. David Poznik, Alexandra Reynoso, Morgan Schumacher, Anjali J. Shastri, Janie F. Shelton, Jingchunzi Shi, Suyash Shringarpure, Qiaojuan Jane Su, Susana A. Tat, Christophe Toukam Tchakouté, Vinh Tran, Joyce Y. Tung, Xin Wang, Wei Wang, Catherine H. Weldon, Peter Wilton, Corinna D. Wong & Bertram L. Koelsch
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
Jin Jin, Jingning Zhang, Ruzhang Zhao & Nilanjan Chatterjee
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
Jin Jin
Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
Wenxuan Lu
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Zhi Yu & Xihong Lin
Booz Allen Hamilton Inc., McLean, VA, USA
Dayne Okuhara
Division of Genetics and Epidemiology, Institute of Cancer Research, London, UK
Montserrat Garcia-Closas
Department of Statistics, Harvard University, Cambridge, MA, USA
Xihong Lin
Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
Nilanjan Chatterjee

Authors

Haoyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianan Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Jin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Jingning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Ruzhang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Thomas U. Ahearn
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jared O’Connell
View author publications
You can also search for this author in PubMed Google Scholar
Yunxuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Tony Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dayne Okuhara
View author publications
You can also search for this author in PubMed Google Scholar
Montserrat Garcia-Closas
View author publications
You can also search for this author in PubMed Google Scholar
Xihong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Bertram L. Koelsch
View author publications
You can also search for this author in PubMed Google Scholar
Nilanjan Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

23andMe Research Team

Stella Aslibekyan
, Adam Auton
, Elizabeth Babalola
, Robert K. Bell
, Jessica Bielenberg
, Katarzyna Bryc
, Emily Bullis
, Daniella Coker
, Gabriel Cuellar Partida
, Devika Dhamija
, Sayantan Das
, Sarah L. Elson
, Nicholas Eriksson
, Teresa Filshtein
, Alison Fitch
, Kipper Fletez-Brant
, Pierre Fontanillas
, Will Freyman
, Julie M. Granka
, Karl Heilbron
, Alejandro Hernandez
, Barry Hicks
, David A. Hinds
, Ethan M. Jewett
, Yunxuan Jiang
, Katelyn Kukar
, Alan Kwong
, Keng-Han Lin
, Bianca A. Llamas
, Maya Lowe
, Jey C. McCreight
, Matthew H. McIntyre
, Steven J. Micheletti
, Meghan E. Moreno
, Priyanka Nandakumar
, Dominique T. Nguyen
, Elizabeth S. Noblin
, Jared O’Connell
, Aaron A. Petrakovitz
, G. David Poznik
, Alexandra Reynoso
, Morgan Schumacher
, Anjali J. Shastri
, Janie F. Shelton
, Jingchunzi Shi
, Suyash Shringarpure
, Qiaojuan Jane Su
, Susana A. Tat
, Christophe Toukam Tchakouté
, Vinh Tran
, Joyce Y. Tung
, Xin Wang
, Wei Wang
, Catherine H. Weldon
, Peter Wilton
& Corinna D. Wong

Contributions

H.Z. and N.C. conceived the project. H.Z., J. Zhan, J.J., J. Zhang, W.L. and R.Z. carried out all data analyses with supervision from N.C. J.Z., J.O.C. and Y.J. ran GWASs for training data from 23andMe Inc. with supervision from B.L.K. R.Z. ran GWASs for training data from AoU with supervision from N.C. and H.Z. H.Z., T.C. and D.O. developed the software and online resources for data sharing. H.Z., J. Zhan, J.J., J. Zhang, W.L., R.Z. and N.C. drafted the manuscript. X.L., M.G.C. and T.U.A. provided comments. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Haoyu Zhang or Nilanjan Chatterjee.

Ethics declarations

Competing interests

J.Z., J.O., Y.J., S.A., A.A., E.B., R.K.B., J.B., K.B., E.B., D.C., G.C.P., D.D., S.D., S.L.E., N.E., T.F., A.F., K.F.B., P.F., W.F., J.M.G., K.H., A.H., B.H., D.A.H., E.M.J., K.K., A.K., K.H.L., B.A.L., M.L., J.C.M., M.H.M., S.J.M., M.E.M., P.N., D.T.N., E.S.N., A.A.P., G.D.P., A.R., M.S., A.J.S., J.F.S., J.S., S.S., Q.J.S., S.A.T., C.T.T., V.T., J.Y.T., X.W., W.W., C.H.W., P.W., C.D.W. and B.L.K. are employed by and hold stock or stock options in 23andMe, Inc. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Shing Wan Choi, Bjarni Vilhjálmsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CT-SLEB detailed flowchart.

The method contains three major steps: 1. Two-dimensional clumping and thresholding; 2. Empirical-Bayes procedure for utilizing genetic correlations of effect sizes across populations; 3. Super-learning model for combining PRSs under different tuning parameters. The tuning dataset is used to train the super learning model. The final prediction performance is evaluated based on an independent validation dataset. For continuous traits, the prediction is evaluated using R² obtained from the linear regression between outcome and PRS after adjusting for covariates (Methods). For binary traits, the prediction is evaluated using the area under the ROC curve (AUC).

Extended Data Fig. 2 Performance of CT-SLEB with different tuning and validation sample sizes.

The total tuning and validation sample size is set as 2000, 5000, 100,000 and 200,000 with half for tuning and half for validation. Analyses are conducted in the multiancestry setting under a strong negative selection model. The training sample size for the AFR population is 15,000. The training sample size for EUR is 100,000. The sample size for the tuning dataset and validation for each population is fixed at 10,000, respectively. Common SNP heritability is assumed to be 0.4 across all populations and effect-size correlation is assumed to be 0.8 across populations. The causal SNPs proportion is varied across 0.01 (top panel), 0.001 (medium panel), or 5×10⁻⁴ (bottom panel). The final prediction R² is reported as the average of ten independent simulation replicates.

Supplementary information

Supplementary Figs. 1–22 and Note.

Reporting Summary

Peer Review File

Supplementary Table 1

Supplementary Tables 1–11.

Supplementary Data 1

The 23andMe GWAS summary statistics for the top 10,000 genetic markers associated with three traits (height, morning person and SBMN) across five diverse ancestries.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Zhan, J., Jin, J. et al. A new method for multiancestry polygenic prediction improves performance across diverse populations. Nat Genet 55, 1757–1768 (2023). https://doi.org/10.1038/s41588-023-01501-z

Download citation

Received: 31 March 2022
Accepted: 16 August 2023
Published: 25 September 2023
Issue Date: October 2023
DOI: https://doi.org/10.1038/s41588-023-01501-z

This article is cited by

Recent advances in polygenic scores: translation, equitability, methods and FAIR tools
- Ruidong Xiang
- Martin Kelemen
- Samuel A. Lambert
Genome Medicine (2024)
An ensemble penalized regression method for multi-ancestry polygenic risk prediction
- Jingning Zhang
- Jianan Zhan
- Nilanjan Chatterjee
Nature Communications (2024)