arising from H. Gu et al. Nature Communications https://doi.org/10.1038/s41467-023-37468-y (2023)
With the recent onset of the SARS-CoV-2 pandemic, there has been great interest in interpreting the within-patient evolutionary dynamics of this virus. Indeed, the accurate identification of genomic regions experiencing positive selection, and the quantification of these selective effects, is of crucial importance for both evolutionary as well as clinical interpretation. With this goal, the recently published Gu et al.1 work collected 2820 respiratory samples to investigate observed levels of within-patient synonymous relative to non-synonymous variation, and relied upon this comparison to assign genomic regions as evolving under purifying selection, neutrality, or positive selection. Specifically, they interpreted \({\pi }_{N}-{\pi }_{S}\) > 0 as being indicative of positive selection, ~0 as being indicative of neutrality, and <0 as being indicative of purifying selection (e.g., see Fig. 2 of Gu et al.). Using this criterion when performing sliding window analyses, the authors claimed that multiple genomic regions are experiencing positive selection. Crucially, the authors relied upon their selection inference derived from these \(\pi\)-based comparisons to support conclusions regarding infection dynamics in vaccinated vs. unvaccinated patients, a focal point of their publication.
There is a long history in the field of population genetics of comparing non-synonymous and synonymous divergence in this regard (i.e., dN/dS), as well as in jointly interpreting non-synonymous to synonymous divergence relative to polymorphism (e.g., as implemented in the McDonald-Kreitman test2, as well as numerous other related implementations; see refs. 3,4). In this framework, assuming that synonymous sites are evolving neutrally, the neutral divergence at these sites under genetic drift alone will be equal to the neutral mutation rate5, and thus non-synonymous divergence may be interpreted as being depressed by purifying selection or accelerated by positive selection relative to this synonymous/neutral standard.
However, this divergence-based interpretation does not correctly extend to a comparison of \({\pi }_{N}\) and \({\pi }_{S}\) as utilized by Gu et al. As one example, the effects of selection at linked sites (see review of ref. 6) renders this polymorphism-level interpretation problematic. Namely, even if mutations at synonymous sites are themselves neutral (and see ref. 7), their observed frequency in the population may be shaped by the episodic genetic hitchhiking effects associated with positive selection (i.e., selective sweeps8), and will be shaped by the constantly occurring genetic hitchhiking effects associated with purifying selection (i.e., background selection9). Importantly, these genetic hitchhiking effects will not impact divergence-based comparisons such as dN/dS (10; though there are nonetheless important considerations, see refs. 11,12), but they will strongly impact polymorphism-based comparisons such as the \({\pi }_{N}-{\pi }_{S}\) of Gu et al.
For these reasons, one must account for the myriad of evolutionary forces shaping observed levels of within-patient nucleotide variation when performing population genomic inference of this sort13,14. In SARS-CoV-2 specifically, this evolutionary baseline model will necessarily include the underlying mutation and recombination rates, the history of population size change associated with infection, as well as the constant purging of deleterious mutations and the resulting effects on linked sites15,16. Only by accounting for these certain-to-be-operating evolutionary processes may one determine if episodic or hypothesized processes (such as positive or balancing section) need to be invoked to explain observed levels and patterns of variation17,18,19,20.
Thus, in order to investigate the claims of Gu et al., we simulated this SARS-CoV-2 baseline model in both the presence and absence of positive selection, in order to better interpret the behavior of \({\pi }_{N}-{\pi }_{S}\). As shown in Figs. 1 and 2, these simulations reveal multiple reasons to question their interpretations. Firstly, because of the small number of variable sites observed in the SARS-CoV-2 genome in any given patient sample, particularly after their filtering for SNPs segregating at greater than 2.5% frequency in a folded site frequency spectrum (i.e., resulting in a median of ~5 SNPs/sampled genome in the patient data), there is an extremely large variance associated with \({\pi }_{N}\) and \({\pi }_{S}\), which is only exacerbated by further reducing the scale of inference to specific genomic windows. For example, as shown in Fig. 1, in the complete absence of positive selection, it is naturally the case that purifying selection will on average reduce the frequencies of non-synonymous relative to synonymous variants (though the latter will be experiencing background selection effects); however, it is also the case that the variance is such that there is an appreciable probability of observing \({\pi }_{N}\) values that are larger than \({\pi }_{S}\) (i.e., their criteria for identifying positive selection), particularly on a sliding-window scale.
Secondly, even in the presence of positive selection (Fig. 2), the implemented expectation of \({\pi }_{N}-{\pi }_{S}\) > 0 by Gu et al. would not successfully identify this evolutionary process. As shown for both a partial selective sweep (i.e., a beneficial mutation having reached 50% frequency in the patient population) and a complete selective sweep (i.e., a beneficial mutation having reached fixation in the patient population immediately prior to sampling), respectively, the expectation of \({\pi }_{N}-{\pi }_{S}\) remains negative. This observation partly owes to the fact that linked synonymous variants will be increased in frequency via genetic hitchhiking more readily than other linked non-synonymous variants which are likely deleterious; as such, synonymous variation in the hitchhiked region of the genome may be augmented more than non-synonymous variation. In addition, these models are similarly characterized by a large variance.
We additionally extended this model to consider recurrent beneficial mutations. Specifically, we evaluated scenarios in which 1% of new mutations are beneficial and in which 10% of new mutations are beneficial, occurring on the strongly or weakly deleterious DFE backgrounds given in Figs. 1 and 2, or occurring on the DFE background recently estimated for SARS-CoV-2 experimentally21. As shown in Supplementary Fig. 1, genomic windows were observed in all scenarios in which \({\pi }_{N}-{\pi }_{S}\) is both greater than and less than 0, and even genome-wide there is no significant differentiation in these distributions. It is worth emphasizing that while an extreme scenario in which 10% of all newly arising mutations are strongly beneficial and simultaneously segregating in the population may indeed elevate \({\pi }_{N}\) relative to \({\pi }_{S}\), even this unrealistic parameter space does not reliably produce this pattern. Furthermore, given that elevated \({\pi }_{N}\) may also be readily generated by models lacking positive selection entirely as shown, this \(\pi\)-based approach of Gu et al. remains inappropriate owing to issues of identifiability.
In summary, \({\pi }_{N}-{\pi }_{S}\) is not a reliable indicator of selective effects and dynamics. As shown in the specific case of SARS-CoV-2, the large variance associated with relatively few genomic SNPs renders the interpretation highly tenuous, leading to a situation in which values greater than 0 and less than 0 are both associated with appreciable probabilities in the presence of purifying selection alone. Furthermore, even with the addition of positive selection, the observation of \({\pi }_{N}\) > \({\pi }_{S}\) is unreliable owing partly to the effects of genetic hitchhiking. For these reasons, statistical inference procedures which directly account for multiple competing evolutionary processes (see refs. 22,23), and which utilize more sophisticated expectations associated with patterns of variation in the site frequency spectrum and linkage disequilibrium associated with positive selection (as reviewed by ref. 24, and see ref. 25), would be required to evaluate the claims of Gu et al.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Datasets generated and/or analyzed during the current study are available in the paper. Source data are provided with this paper.
Code availability
All scripts and data underlying the simulations, analyses, and Figures may be found at: https://github.com/vivaksoni/Gu_etal_2023_response.
References
Gu, H. et al. Within-host genetic diversity of SARS-CoV-2 lineages in unvaccinated and vaccinated individuals. Nat. Commun. 14, 1793 (2023).
McDonald, J. H. & Kreitman, M. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652–654 (1991).
Charlesworth, B. & Charlesworth, D. Elements of Evolutionary Genetics. (W. H. Freeman and Company, New York, 2010).
Walsh, B. & Lynch, M. Evolution and Selection of Quantitative Traits. (Oxford University Press, Oxford, 2018).
Kimura, M. The Neutral Theory of Molecular Evolution. (Cambridge University Press, Cambridge, 1983).
Charlesworth, B. & Jensen, J. D. Effects of selection at linked sites on patterns of genetic variability. Annu. Rev. Ecol. Evol. Syst. 52, 177–197 (2021).
Wang, H., Pipes, L. & Nielsen, R. Synonymous mutations and the molecular evolution of SARS-CoV-2 origins. Virus Evol. 7, 1–11 (2021).
Maynard Smith, J. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23–35 (1974).
Charlesworth, B., Morgan, M. T. & Charlesworth, D. The effect of deleterious mutations on neutral molecular variation. Genetics 134, 1289–1303 (1993).
Birky, C. W. & Walsh, J. B. Effects of linkage on rates of molecular evolution. Proc. Natl Acad. Sci. 85, 6414–6418 (1988).
Eyre-Walker, A. Changing effective population size and the McDonald-Kreitman test. Genetics 162, 2017–2024 (2002).
Kryazhimskiy, S. & Plotkin, J. B. The population genetics of dN/dS. PLoS Genet. 4, e1000304 (2008).
Johri, P., Eyre-Walker, A., Gutenkunst, R. N., Lohmueller, K. E. & Jensen, J. D. On the prospect of achieving accurate joint estimation of selection with population history. Genome Biol. Evol. 14, evac088 (2022).
Johri, P. et al. Recommendations for improving statistical inference in population genomics. PLOS Biol. 20, e3001669 (2022).
Terbot, J. W. et al. Developing an appropriate evolutionary baseline model for the study of SARS-CoV-2 patient samples. PLOS Pathog. 19, e1011265 (2023).
Terbot, J. W. et al. A simulation framework for modeling the within-patient evolutionary dynamics of SARS-CoV-2. Genome Biol. Evol. 15, evad204 (2023).
Irwin, K. K. et al. On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity 117, 393–399 (2016).
Jensen, J. D. & Kowalik, T. F. A consideration of within-host human cytomegalovirus genetic variation. Proc. Natl Acad. Sci. 117, 816–817 (2020).
Jensen, J. D. Studying population genetic processes in viruses: from drug-resistance evolution to patient infection dynamics. In: Bamford, D. H. and Zuckerman, M. (eds.) Encyclopedia of Virology, 4th edition 5, 227–232 (2021).
Johri, P., Stephan, W. & Jensen, J. D. Soft selective sweeps: addressing new definitions, evaluating competing models, and interpreting empirical outliers. PLOS Genet. 18, e1010022 (2022).
Flynn, J. A. et al. Comprehensive fitness landscape of SARS-CoV-2 Mpro reveals insights into viral resistance mechanisms. Elife 11, e77433 (2022).
Johri, P., Charlesworth, B. & Jensen, J. D. Toward an evolutionarily appropriate null model: jointly inferring demography and purifying selection. Genetics 215, 173–192 (2020).
Howell, A. A. et al. Developing an appropriate evolutionary baseline model for the study of human cytomegalovirus. Genome Biol. Evol. 15, evad059 (2023).
Stephan, W. Selective sweeps. Genetics 211, 5–13 (2019).
Soni, V., Johri, P. & Jensen, J. D. Evaluating power to detect recurrent selective sweeps under increasingly realistic evolutionary null models. Evolution 77, 2113–2127 (2023).
Haller, B. C. & Messer, P. W. SLiM 4: Multispecies eco-evolutionary modeling. Am. Nat. 201, E127–E139 (2023).
Acknowledgements
This work was supported by the National Institutes of Health grant R35GM139383 to J.D.J.
Author information
Authors and Affiliations
Contributions
VS, JWT and JDJ conceived the project; VS performed simulations with input from JWT and JDJ; VS, JWT and JDJ wrote the manuscript; JDJ provided funding for the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Soni, V., Terbot, J.W. & Jensen, J.D. Population genetic considerations regarding the interpretation of within-patient SARS-CoV-2 polymorphism data. Nat Commun 15, 3240 (2024). https://doi.org/10.1038/s41467-024-46261-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-46261-4
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.