Abstract
Polygenic risk scores (PRSs) increasingly predict complex traits; however, suboptimal performance in non-European populations raise concerns about clinical applications and health inequities. We developed CT-SLEB, a powerful and scalable method to calculate PRSs, using ancestry-specific genome-wide association study summary statistics from multiancestry training samples, integrating clumping and thresholding, empirical Bayes and superlearning. We evaluated CT-SLEB and nine alternative methods with large-scale simulated genome-wide association studies (~19 million common variants) and datasets from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us and UK Biobank, involving 5.1 million individuals of diverse ancestry, with 1.18 million individuals from four non-European populations across 13 complex traits. Results demonstrated that CT-SLEB significantly improves PRS performance in non-European populations compared with simple alternatives, with comparable or superior performance to a recent, computationally intensive method. Moreover, our simulation studies offered insights into sample size requirements and SNP density effects on multiancestry risk prediction.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Simulated genotype data for 600,000 subjects from 5 ancestries are at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/COXHAP. GWAS summary level statistics for five ancestries from GLGC are at: http://csg.sph.umich.edu/willer/public/glgc-lipids2021/results/ancestry_specific. GWAS summary statistics for three ancestries are from AoU at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FAWEQK. The PRSs developed for six traits for GLGC and AoU have been released through the PGS Catalog (https://www.pgscatalog.org) with publication ID PGP000489 and score IDs PGS003767–PGS003848. The 23andMe GWAS summary statistics for the top 10,000 genetic markers associated with 3 traits (height, morning person and SBMN) across 5 diverse ancestries have been made available as Supplementary Data and are also available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3NBNCV. The full GWAS summary statistics and the final PRSs for these three traits (height, morning person and SBMN) are available through 23andMe, Inc. to qualified researchers under an agreement with 23andMe, Inc. that protects the privacy of the 23andMe participants. Please visit research.23andme.com/dataset-access for more information and to apply for access to the data. The summary statistics for the four other traits used in the paper (any CVD, heart metabolic disease burden, depression and migraine) will not be made available because of 23andMe’s business requirements. Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited institutional review board, Ethical & Independent Review Services.
Code availability
Simulation and data analyses code is available at GitHub (https://github.com/andrewhaoyu/multi_ethnic (ref. 66)). Software implementing CT-SLEB is available at GitHub (https://github.com/andrewhaoyu/CTSLEB (ref. 67)). The P + T method was implemented using R version 4.0.0 in conjunction with PLINK 1.9 available at https://www.cog-genomics.org/plink/1.9. Other methods and their corresponding repositories include: SCT and LDpred2 at https://github.com/privefl/bigsnpr, XPASS at https://github.com/YangLabHKUST/XPASS, PolyPred-S+ at https://github.com/omerwe/polyfun, PRS-CSx at https://github.com/getian107/PRScsx, and LDSC at https://github.com/bulik/ldsc. PLINK: https://www.cog-genomics.org/plink/1.9. Most of our statistical analyses were performed using the following R packages: ggplot2 v.3.3.3, dplyr v.1.0.4, data.table v.1.13.6, bigsnpr v.1.6.1, SuperLearner v.2.0.26, caret v.6.0.86, ranger v.0.12.1, glmnet v.4.1, RISCA v.1.01, XPASS v.0.1.0, xgboost v.1.7.5.1 and randomForest.
References
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 104, 21–34 (2019).
Jia, G. et al. Evaluating the utility of polygenic risk scores in identifying high-risk individuals for eight common cancers. JNCI Cancer Spectr. 4, pkaa021 (2020).
Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat. Genet. 52, 572–581 (2020).
Graff, R. E. et al. Cross-cancer evaluation of polygenic risk scores for 16 cancer types in two large cohorts. Nat. Commun. 12, 970 (2021).
Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).
Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 3328 (2019).
Liu, C. et al. Generalizability of polygenic risk scores for breast cancer among women with European, African, and Latinx ancestry. JAMA Netw. Open 4, e2119084–e2119084 (2021).
Du, Z. et al. Evaluating polygenic risk scores for breast cancer in women of african ancestry. J. Natl Cancer Inst. 113, 1168–1176 (2021).
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020).
Kullo, I. J. et al. Polygenic scores in biomedical research. Nat. Rev. Genet. 23, 524–532 (2022).
Wray, N. R., Goddard, M. E. & Visscher, P. M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 1520–1528 (2007).
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009).
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Privé, F., Vilhjálmsson, B. J., Aschard, H. & Blum, M. G. B. Making the most of clumping and thresholding for polygenic scores. Am. J. Hum. Genet. 105, 1213–1221 (2019).
Lloyd-Jones, L. R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 10, 5086 (2019).
Newcombe, P. J., Nelson, C. P., Samani, N. J. & Dudbridge, F. A flexible and parallelizable approach to genome-wide polygenic risk scores. Genet. Epidemiol. 43, 730–741 (2019).
Ge, T., Chen, C. Y., Ni, Y., Feng, Y. C. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
Song, S., Jiang, W., Hou, L. & Zhao, H. Leveraging effect size distributions to improve polygenic risk scores derived from summary statistics of genome-wide association studies. PLoS Comput. Biol. 16, e1007565 (2020).
Zhou, G. & Zhao, H. A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics. PLoS Genet. 17, e1009697 (2021).
Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2021).
Koyama, S. et al. Population-specific and trans-ancestry genome-wide analyses identify distinct and shared genetic risk loci for coronary artery disease. Nat. Genet. 52, 1169–1177 (2020).
Sakaue, S. et al. Trans-biobank analysis with 676,000 individuals elucidates the association of polygenic risk scores of complex traits with human lifespan. Nat. Med. 26, 542–548 (2020).
Agbaedeng, T. A. et al. Polygenic risk score and coronary artery disease: a meta-analysis of 979,286 participant data. Atherosclerosis 333, 48–55 (2021).
Ruan, Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 54, 573–580 (2022).
Tian, P. et al. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front. Genet. 13, 1854 (2022).
Márquez-Luna, C. et al. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017).
Xiao, J. et al. XPXP: improving polygenic prediction by cross-population and cross-phenotype analysis. Bioinformatics 38, 1947–1955 (2022).
Cai, M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet. 108, 632–655 (2021).
Dudbridge, F. & Wray, N. R. Power and predictive sccuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat. Genet. 45, 400–405 (2013).
Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021).
Brown, B. C., Ye, C. J., Price, A. L. & Zaitlen, N. Transethnic genetic-correlation estimates from summary statistics. Am. J. Hum. Genet. 99, 76–88 (2016).
Shi, H. et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098 (2021).
van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genet. Mol. Biol. 6, 25 (2007).
Polley, E. & van der Laan, M. J. Super learner in prediction. UC Berkeley Division of Biostatistics Working Paper Series (2010); http://biostats.bepress.com/ucbbiostat/paper266
Ledell, E., Petersen, M. & Van Der Laan, M. J. Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electron J. Stat. 9, 1583–1607 (2015).
Polley, E., LeDell, E., Kennedy, C. & van der Laan, M. J. SuperLearner: Super learner prediction. R version 2.0-26 (2019).
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Ripley, B. D. Pattern Recognition and Neural Networks (Cambridge Univ. Press, 2007).
Weissbrod, O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 54, 450–458 (2022).
Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).
Consortium, T. I. H. 3. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52 (2010).
Bien, S. A. et al. Strategies for enriching variant coverage in candidate disease Loci on a multiethnic genotyping array. PLoS ONE 11, 167758 (2016).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Zhang, Y., Qi, G., Park, J. H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 50, 1318–1326 (2018).
Zhang, Y. D. et al. Assessment of polygenic architecture and risk prediction based on common variants across fourteen cancers. Nat. Commun. 11, 3353 (2020).
Márquez-Luna, C. et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 12, 6052 (2021).
Ge, T., Chen, C. Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017).
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
Ding, Y. et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature 618, 774–781 (2023).
Song, L. et al. SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics. Bioinformatics 35, 4038–4044 (2019).
Zhao, Z. et al. PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics. Genome Biol. 22, 257 (2021).
Pritchard, J. K. & Przeworski, M. Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69, 1–14 (2001).
van der Laan, M. J. & Rose, S. Targeted Learning: Causal inference for observational and experimental data, Vol. 4 (Springer New York, 2011).
Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304–2305 (2011).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Foucher, Y. et al. RISCA: Causal inference and prediction in cohort-based analyses. R version 1.01 https://cran.r-project.org/package=RISCA (2020).
Zhang, H., Jin, J. & Zhang, J. Multi-ancestry PRS development. Zenodo https://doi.org/10.5281/zenodo.8033882 (2023).
Zhang, H. & Okuhara, D. CT-SLEB software. Zenodo https://doi.org/10.5281/zenodo.8033795 (2023).
Acknowledgements
We thank the research participants and employees of 23andMe, Inc. for making this work possible. We thank L. Noblin, M. J. Francis and E. Voeglein for helping with the research collaboration agreement with the Harvard T.H. Chan School of Public Health, Johns Hopkins Bloomberg School of Public Health and 23andMe, Inc. The analysis utilized the high-performance computation Biowulf cluster at the National Institutes of Health (NIH), USA, Faculty of Arts and Sciences Research Computing Cluster at Harvard University and the Joint High Performance Computing Exchange at Johns Hopkins Bloomberg School of Public Health. The UKBB data were obtained under UKBB resource application no. 17712. This work was funded by NIH grants: nos. K99 CA256513 to H.Z., R00 HG012223 to J.J., NHLBI 5T32HL007604-37 to Z.Y., R35-CA197449, U19-CA203654, R01-HL163560, U01-HG009088 and U01-HG012064 to X.L., R01 HG010480-01 to N.C. and U01HG011724 to N.C. The AoU Research Program is supported by the NIH, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA no.: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the AoU Research Program would not be possible without the partnership of its participants.
Author information
Authors and Affiliations
Consortia
Contributions
H.Z. and N.C. conceived the project. H.Z., J. Zhan, J.J., J. Zhang, W.L. and R.Z. carried out all data analyses with supervision from N.C. J.Z., J.O.C. and Y.J. ran GWASs for training data from 23andMe Inc. with supervision from B.L.K. R.Z. ran GWASs for training data from AoU with supervision from N.C. and H.Z. H.Z., T.C. and D.O. developed the software and online resources for data sharing. H.Z., J. Zhan, J.J., J. Zhang, W.L., R.Z. and N.C. drafted the manuscript. X.L., M.G.C. and T.U.A. provided comments. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
J.Z., J.O., Y.J., S.A., A.A., E.B., R.K.B., J.B., K.B., E.B., D.C., G.C.P., D.D., S.D., S.L.E., N.E., T.F., A.F., K.F.B., P.F., W.F., J.M.G., K.H., A.H., B.H., D.A.H., E.M.J., K.K., A.K., K.H.L., B.A.L., M.L., J.C.M., M.H.M., S.J.M., M.E.M., P.N., D.T.N., E.S.N., A.A.P., G.D.P., A.R., M.S., A.J.S., J.F.S., J.S., S.S., Q.J.S., S.A.T., C.T.T., V.T., J.Y.T., X.W., W.W., C.H.W., P.W., C.D.W. and B.L.K. are employed by and hold stock or stock options in 23andMe, Inc. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Shing Wan Choi, Bjarni Vilhjálmsson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 CT-SLEB detailed flowchart.
The method contains three major steps: 1. Two-dimensional clumping and thresholding; 2. Empirical-Bayes procedure for utilizing genetic correlations of effect sizes across populations; 3. Super-learning model for combining PRSs under different tuning parameters. The tuning dataset is used to train the super learning model. The final prediction performance is evaluated based on an independent validation dataset. For continuous traits, the prediction is evaluated using R2 obtained from the linear regression between outcome and PRS after adjusting for covariates (Methods). For binary traits, the prediction is evaluated using the area under the ROC curve (AUC).
Extended Data Fig. 2 Performance of CT-SLEB with different tuning and validation sample sizes.
The total tuning and validation sample size is set as 2000, 5000, 100,000 and 200,000 with half for tuning and half for validation. Analyses are conducted in the multiancestry setting under a strong negative selection model. The training sample size for the AFR population is 15,000. The training sample size for EUR is 100,000. The sample size for the tuning dataset and validation for each population is fixed at 10,000, respectively. Common SNP heritability is assumed to be 0.4 across all populations and effect-size correlation is assumed to be 0.8 across populations. The causal SNPs proportion is varied across 0.01 (top panel), 0.001 (medium panel), or 5×10−4 (bottom panel). The final prediction R2 is reported as the average of ten independent simulation replicates.
Supplementary information
Supplementary information
Supplementary Figs. 1–22 and Note.
Supplementary Table 1
Supplementary Tables 1–11.
Supplementary Data 1
The 23andMe GWAS summary statistics for the top 10,000 genetic markers associated with three traits (height, morning person and SBMN) across five diverse ancestries.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Zhan, J., Jin, J. et al. A new method for multiancestry polygenic prediction improves performance across diverse populations. Nat Genet 55, 1757–1768 (2023). https://doi.org/10.1038/s41588-023-01501-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01501-z
This article is cited by
-
Recent advances in polygenic scores: translation, equitability, methods and FAIR tools
Genome Medicine (2024)
-
An ensemble penalized regression method for multi-ancestry polygenic risk prediction
Nature Communications (2024)