Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Adjusting for common variant polygenic scores improves yield in rare variant association analyses

Abstract

With the emergence of large-scale sequencing data, methods for improving power in rare variant association tests are needed. Here we show that adjusting for common variant polygenic scores improves yield in gene-based rare variant association tests across 65 quantitative traits in the UK Biobank (up to 20% increase at α = 2.6 × 10−6), without marked increases in false-positive rates or genomic inflation. Benefits were seen for various models, with the largest improvements seen for efficient sparse mixed-effects models. Our results illustrate how polygenic score adjustment can efficiently improve power in rare variant association discovery.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Analysis flowchart.
Fig. 2: PGS adjustment improves discovery yield in analysis of rare deleterious variants.
Fig. 3: PGS adjustment does not increase false-positive rates or genomic inflation in the analysis of rare synonymous variants.

Similar content being viewed by others

Data availability

Summary statistics from the common variant association analyses, the rare variant association analyses, as well as the common variant weights used for polygenic score construction, have been made available for download through the Cardiovascular Disease Knowledge Portal (https://cvd.hugeamp.org/downloads.html). To download the GWAS summary statistics: https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_GWAS_Sumstats.zip. To download the PGS weights: https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_PGS_Weights.zip. To download the RVAT summary statistics: https://personal.broadinstitute.org/ryank/Jurgens_Pirruccello_2022_RVAT_Sumstats.zip. Summary statistics for tests of the statistical properties of different RVAT models are included in the supplementary tables. Access to individual-level UK Biobank data, both phenotypic and genetic, is available to bona fide researchers through application on the UK Biobank website (https://www.ukbiobank.ac.uk). The exome sequencing data can be found in the UK Biobank showcase portal: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=170. Additional information about registration for access to the data is available at http://www.ukbiobank.ac.uk/register-apply/. Use of UK Biobank data was performed under application number 17488. Other datasets utilized in this manuscript include: the dbNSFP database v.4.1a (https://sites.google.com/site/jpopgen/dbNSFP) and gnomAD exomes v.2.1 (https://gnomad.broadinstitute.org/downloads).

Code availability

Example scripts of our approach for the UK Biobank Research Analysis Platform (implementations of PGS adjustment in SAIGE-GENE+ and BOLT-LMM) have been made available through the GitHub repository: https://github.com/seanjosephjurgens/RVAT_PGSadjust. Quality control of individual-level data was performed using Hail v.0.2 (https://hail.is) as well as PLINK v.2.0.a (https://www.cog-genomics.org/plink/2.0/). Variant annotation was performed using VEP v.95 (https://github.com/Ensembl/ensembl-vep). Main common variant association analyses (GWAS) were performed using REGENIE v.2.0.2 (https://github.com/rgcgithub/regenie). Genome-wide polygenic scores were computed using PRS-CS (https://github.com/getian107/PRScs; githash: 43128be7fc9ca16ad8b85d8754c538bcfb7ec7b4). Main rare variant association analyses were performed using an adaptation of the R package GENESIS v.2.18 (https://rdrr.io/bioc/GENESIS/man/GENESIS-package.html), which has previously been made available by us through the GitHub repository: https://github.com/seanjosephjurgens/UKBB_200KWES_CVD. Analyses were run within R v.4.0 (https://www.r-project.org). Other RVAT software used in the present study include fastGWA implemented in GCTA v.1.94.0 (https://yanglab.westlake.edu.cn/software/gcta/#fastGWA), BOLT-LMM v.2.4 (https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html), SAIGE-GENE+ v.1.0.9 (https://saigegit.github.io/SAIGE-doc/) and REGENIE v.2.0.2 (https://github.com/rgcgithub/regenie).

References

  1. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Cirulli, E. T. et al. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat. Commun. 11, 542 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Jurgens, S. J. et al. Analysis of rare genetic variation underlying cardiometabolic diseases and traits among 200,000 individuals in the UK Biobank. Nat. Genet. 54, 240–250 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ge, T., Chen, C. Y., Neale, B. M., Sabuncu, M. R. & Smoller, J. W. Phenome-wide heritability analysis of the UK Biobank. PLoS Genet. 13, e1006711 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Pirinen, M., Donnelly, P. & Spencer, C. C. Including known covariates can reduce power to detect genetic effects in case-control studies. Nat. Genet. 44, 848–851 (2012).

    Article  CAS  PubMed  Google Scholar 

  8. Robinson, L. D. & Jewell, N. P. Some surprising results about covariate adjustment in logistic regression models. Int. Stat. Rev. 58, 227–240 (1991).

    Article  Google Scholar 

  9. Bennett, D., O’Shea, D., Ferguson, J., Morris, D. & Seoighe, C. Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies. Sci. Rep. 11, 19571 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Ge, T., Chen, C. Y., Ni, Y., Feng, Y. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015).

    Article  CAS  PubMed  Google Scholar 

  13. Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).

    Article  CAS  PubMed  Google Scholar 

  14. Jiang, L., Zheng, Z., Fang, H. & Yang, J. A generalized linear mixed model association tool for biobank-scale data. Nat. Genet. 53, 1616–1621 (2021).

    Article  CAS  PubMed  Google Scholar 

  15. Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zhou, W. et al. SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat. Genet. 54, 1466–1469 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).

    Article  CAS  PubMed  Google Scholar 

  18. Li, Z. et al. A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat. Methods 19, 1599–1611 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. Bioinformatics 36, 5582–5589 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Choi, S. W., Mak, T. S. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Gogarten, S. M. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet. 20, 747–759 (2019).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank all UK Biobank participants, as this study would not have been possible without their contributions. P.T.E. is supported by the National Institutes of Health (grant nos. 1R01HL092577, 5RO1HL139731 and 1R01HL157635), the American Heart Association (grant no. 18SFRN34110082) and the European Union (grant no. MAESTRIA 965286). S.A.L. is supported by the National Institutes of Health (grant no. 1R01HL139731) and by the American Heart Association (grant no. 18SFRN34250007). S.J.J. is supported by an Amsterdam UMC Doctoral Fellowship, and by the Junior Clinical Scientist Fellowship from the Dutch Heart Foundation (grant no. 03-007-2022-0035). J.P.P. is supported by the John S. LaDue Memorial Fellowship for Cardiovascular Research and by the National Institutes of Health (grant no. 5K08HL159346). S.H.C. is supported by the National Institutes of Health (grant nos. R01HL149352, R01HL111024, 2R01HL127564-05A1, 1U01AG058589-01A1, 1U01AG068221-01A1 and 1R01HL164824-01), and was supported by the BioData Ecosystem Fellowship program.

Author information

Authors and Affiliations

Authors

Contributions

S.J.J., J.P.P. and P.T.E. conceived and designed the study. S.J.J., J.P.P., S.H.C. and V.N.M. performed data curation and data processing. S.J.J. and J.P.P. performed statistical and bioinformatic analyses. S.A.L., K.L.L. and P.T.E. supervised the overall study. S.J.J., J.P.P. and P.T.E. wrote the manuscript. M.C. contributed critically to the analysis plan. All authors critically revised and approved the manuscript.

Corresponding author

Correspondence to Patrick T. Ellinor.

Ethics declarations

Competing interests

P.T.E. has received sponsored research support from Bayer AG, IBM Health, Bristol Myers Squibb and Pfizer; he has consulted for Bayer AG, Novartis and MyoKardia. S.A.L. is a full-time employee of Novartis Institutes of BioMedical Research as of 18 July 2022. S.A.L. previously received sponsored research support from Bristol Myers Squibb, Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier and IBM, and has consulted for Bristol Myers Squibb, Pfizer, Blackstone Life Sciences and Invitae. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Number of significant lead variants from common variant GWAS and variance explained by subsequently derived PGS across the 65 traits.

a, Violin plots for the number of significant independent lead variants from common variant GWAS across 65 phenotypes. Results from out-of-sample GWAS (230k, red), in-sample GWAS (190k, blue) and total GWAS (460k, purple) in the UK Biobank are shown. b, Violin plots for the phenotypic variance explained (R2) by six types of PGS across the 65 phenotypes. Red shows two PGS derived from out-of-sample GWAS data, blue shows two PGS derived from completely in-sample GWAS data, and purple shows results for PGS derived from total GWAS data. All types of PGS explained variance for their respective traits, although we caution the interpretation of the magnitude of the R2 values for the in-sample and total PGS, as discovery samples were naturally also included in PGS testing. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.

Extended Data Fig. 2 Regression of δ -log10(P) values after PGS-adjustment over the unadjusted -log10(P) values for positive control associations.

The y-axis represents the δ between PGS-adjusted -log10(P) and unadjusted -log10(P) values for positive control associations identified from Backman et al.5 (Supplementary Note), while the x-axis represents the unadjusted -log10(P) values. a, Results for out-of-sample derived PGS. b, Results for in-sample PGS. c, Results for the ‘total’ cohort derived PGS. Regression slopes and P-values from standard linear regression are added to the figure. The regression trend line is added in each plot. For all models, there is a trend towards a positive association between unadjusted -log10(P) and the subsequent improvement in RVAT power. The trend reached P < 0.0083 (= 0.05/6) for all PGS models except PGSCS (insample). β, regression coefficient; δ, difference.

Extended Data Fig. 3 Correlation between SNP-heritability and the change in the number of significant rare variant associations after PGS adjustment across the 65 traits.

In each plot, the x-axis represents trait SNP-heritability (h2SNP) estimated using Linkage Disequilibrium Score Regression. The y-axis represents the change in the number of RVAT associations reaching exome-wide significance (α = 2.6 × 10−6) after adjusting for PGS, across the studied traits (n = 65). RVAT yield change (defined as the difference in the number of significant associations after PGS adjustment compared to models without PGS) is regressed on h2SNP using ordinary linear regression; the regression trend line is added in each plot. a, Results for out-of-sample derived PGS. b, Results for in-sample PGS. c, Results for the ‘total’ cohort derived PGS. For all models, there is a trend towards a positive association between trait h2SNP and change in RVAT yield (P < 0.05 and β > 0). The trend reached P < 0.0083 (= 0.05/6) for all PGS models except PGSCS (insample). β, regression coefficient; α, significance cutoff.

Extended Data Fig. 4 Results for gene-based testing of LOF and missense variants at MAF < 0.1%.

Data are presented in violin plots with overlaid boxplots. The first column shows results restricting to gene-based associations reaching Bonferroni-corrected significance, while the second column shows results for gene-based associations reaching conventional exome-wide significance. a, Results for all qualifying gene-based associations. The n gene-trait pairs for distributions in the left panel equal 206, 217, 206, 213, 207 and 218 (from left to right), while the n values equal 321, 327, 310, 318, 320 and 335 (from left to right) in the right panel. b, Results restricted to associations that were identified using MAF < 0.1% but that were not identified in the initial analysis where MAC < 40 was applied. The n gene-trait pairs for distributions in the left panel equal 25, 33, 28, 33, 28 and 31 (from left to right), while the n values equal 57, 62, 57, 64, 56 and 58 (from left to right) in the right panel. The P-values from Wilcoxon signed rank tests and values from paired t-tests (after removing outliers) are added above each violin. P-values are two-sided and unadjusted for multiple testing. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. , estimated paired group difference; δ, difference; α, significance cutoff.

Extended Data Fig. 5 Comparison of P-values between full-PGS-adjusted and LOCO-PGS-adjusted models from height and LDL.

In these scatter plots, the y-axis shows -log10(P) values from gene-based testing with adjustment for the full out-of-sample PGSCS, while the x-axis shows the -log10(P) values for the leave-one-chromosome-out (LOCO) PGSCS. a, Results for height. b, Results for LDL cholesterol. The left panels show all gene-trait pair results, while the right panels show all exome-wide significant signals (and are capped at y = 30 and x = 30 for clarity).

Extended Data Fig. 6 Comparison of P-values between PGS-adjusted and unadjusted models within fastGWA.

The violin plots (with overlaid boxplots) show the distributions of differences in -log10(P) values between unadjusted and PGS adjusted models. The left panel results are restricted to associations reaching Bonferroni corrected significance in either analysis (PGS adjusted or unadjusted), while the right panel is restricted to association reaching conventional exome-wide significance in either analysis. Estimated values (difference values from paired t-test) and P-values (from paired Wilcoxon signed rank tests) are added above each violin. In all fastGWA runs, a sparsity cutoff of 0.05 was used, while 239,686 high-quality pruned common variants were used for computation of the relatedness matrix. In the left panel, the n gene-trait pairs equal 173, 177, 176 and 175 (from left to right), while in the right panel n values equal 257, 266, 258, 261 (from left to right). P-values are two-sided and unadjusted for multiple testing. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. , estimated paired group difference; δ difference; α, significance cutoff.

Extended Data Fig. 7 Comparison of P-values between PGS-adjusted and unadjusted models within BOLT-LMM.

The violin plots (with overlaid boxplots) show the distributions of differences in -log10(P) values between unadjusted and PGS adjusted models. a, Results for adjustment of out-of-sample PGS, where red indicates results for BOLT-LMM-Inf models and gold shows results for BOLT-LMM models. In the left panel, the n gene-trait pairs equal 176, 180, 175 and 182 (from left to right), while in the right panel the n values equal 262, 267, 271 and 282 (from left to right). b, Results for adjustment for in-sample PGS where blue indicates BOLT-LMM-Inf and gold indicated BOLT-LMM models. In the left panel, the n gene-trait pairs equal 175, 176, 177 and 174 (from left to right), while in the right panel the n values equal 256, 257, 269 and 268 (from left to right). In both a and b, the left panel results are restricted to associations reaching Bonferroni corrected significance in either analysis (PGS adjusted or unadjusted), while the right panel is restricted to association reaching conventional exome-wide significance in either analysis. Estimated values (difference values from paired t-test) and P-values (from paired Wilcoxon signed rank tests) are added above each violin. In all BOLT runs, 240,699 high-quality pruned common variants were used for computation of the genetic relatedness matrix. P-values are two-sided and unadjusted for multiple testing. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. , estimated paired group difference; δ, difference; α, significance cutoff.

Extended Data Fig. 8 Comparison of P-values between PGS-adjusted and unadjusted models for SKAT-O tests within SAIGE-GENE + .

The violin plots (with overlaid boxplots) show the distributions of differences in -log10(P) values between unadjusted and PGS adjusted models. The left panel results are restricted to associations reaching Bonferroni corrected significance in either analysis (PGS adjusted or unadjusted), while the right panel is restricted to association reaching conventional exome-wide significance in either analysis. Estimated values (difference values from paired t-test) and P-values (from paired Wilcoxon signed rank tests) are added above each violin. In all SAIGE-GENE + runs, the computationally efficient sparse matrix option was used with 0.05 cutoff, while ~240k high-quality pruned common variants (numbers differed slightly per trait) were used for computation of the relatedness matrix. In the left panel, the n gene-trait pairs equal 185, 186, 186 and 182 (from left to right), while in the right panel the n values equal 257, 266, 258 and 261 (from left to right). P-values are two-sided and unadjusted for multiple testing. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. , estimated paired group difference; δ, difference; α, significance cutoff.

Extended Data Fig. 9 Comparison of P-values between PGS-adjusted and unadjusted models from REGENIE.

The y-axis of this scatter plot shows the -log10(P) values from gene-based burden testing using REGENIE with adjustment for out-of-sample PGS, while the x-axis shows the unadjusted -log10(P) values from REGENIE. Tests are restricted to Bonferroni-correction significant associations. a, Results for PGSlead-SNP. b, Results for PGSCS. The left panels show all qualifying results, while the right panels are capped at x = 100 and y = 100 for clarity. Test statistics were very similar between adjusted and unadjusted models.

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–6.

Reporting Summary

Peer Review File

Supplementary Table

Supplementary Tables 1–16.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jurgens, S.J., Pirruccello, J.P., Choi, S.H. et al. Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet 55, 544–548 (2023). https://doi.org/10.1038/s41588-023-01342-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-023-01342-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing