Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape

Turakhia, Yatish; Thornlow, Bryan; Hinrichs, Angie; McBroome, Jakob; Ayala, Nicolas; Ye, Cheng; Smith, Kyle; De Maio, Nicola; Haussler, David; Lanfear, Robert; Corbett-Detig, Russell

doi:10.1038/s41586-022-05189-9

Download PDF

Article
Open access
Published: 11 August 2022

Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape

Nature volume 609, pages 994–997 (2022)Cite this article

17k Accesses
53 Citations
169 Altmetric
Metrics details

Subjects

Abstract

Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses^1,2,3,4. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral evolution⁵. Here, we use a new phylogenomic method to search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. In a 1.6 million sample tree from May 2021, we identify 589 recombination events, which indicate that around 2.7% of sequenced SARS-CoV-2 genomes have detectable recombinant ancestry. Recombination breakpoints are inferred to occur disproportionately in the 3' portion of the genome that contains the spike protein. Our results highlight the need for timely analyses of recombination for pinpointing the emergence of recombinant lineages with the potential to increase transmissibility or virulence of the virus. We anticipate that this approach will empower comprehensive real-time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.

Mutational spectrum of SARS-CoV-2 during the global pandemic

Article Open access 27 August 2021

No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2

Article Open access 25 November 2020

Data-driven recombination detection in viral genomes

Article Open access 17 April 2024

Main

Recombination is a primary contributor of new genetic variation in many prevalent pathogens, including betacoronaviruses⁶, the clade that includes SARS-CoV-2. By mixing genetic material from diverse genomes, recombination can produce new combinations of mutations that have potentially important phenotypic effects⁷. For example, recombination is thought to have played an important role in the recent evolutionary histories of Middle East respiratory syndrome⁸and severe acute respiratory syndrome coronavirus (SARS-CoV)^9,10,11,12. Recombination might also have the potential to generate viruses with zoonotic potential in the future¹³. Therefore, accurate and timely characterization of recombination is foundational for understanding the evolutionary biology and infectious potential of established and emerging pathogens in human, agricultural and natural populations.

Now that substantial genetic diversity is present across SARS-CoV-2 populations¹⁴ and co-infection with different SARS-CoV-2 variants has been known to sometimes occur¹⁵, recombination is expected to be an important source of new genetic variation during the pandemic. Whether or not there is a detectable signal for recombination events in the SARS-CoV-2 genomes has been fiercely debated since the early days of the pandemic¹³. Nonetheless, several apparently genuine recombinant lineages have been identified using ad hoc approaches¹⁶ and semi-automated methods that cope with vast SARS-CoV-2 datasets by reducing the search space for possible pairs of recombinant ancestors^16,17. Because of the importance of timely and accurate surveillance of viral genetic variation during the continuing SARS-CoV-2 pandemic, new approaches for detecting and characterizing recombinant haplotypes are needed to evaluate new variant genome sequences as quickly as they become available. Such rapid turnaround is essential for driving an informed and coordinated public health response to new SARS-CoV-2 variants.

We developed a new method for detecting recombination in pandemic-scale phylogenies, Recombination Inference using Phylogenetic PLacEmentS (RIPPLES, Fig. 1). Because recombination violates the central assumption of many phylogenetic methods, that is, that a single evolutionary history is shared across the genome, recombinant lineages arising from diverse genomes will often be found on ‘long branches’, which result from accommodating the divergent evolutionary histories of the two parental haplotypes (Fig. 1). Note that as long as recombination is relatively uncommon, phylogenetic inference is expected to remain accurate even when branch lengths are artifactually expanded¹⁸. RIPPLES exploits that signal by first identifying long branches on a comprehensive SARS-CoV-2 mutation-annotated tree^19,20. RIPPLES then exhaustively breaks the potential recombinant sequence into distinct segments and replaces each onto a global phylogeny using maximum parsimony. RIPPLES reports the two parental nodes—hereafter termed donor and acceptor—that result in the highest parsimony score improvement relative to the original placement on the global phylogeny (Supplementary Text 1). Our approach therefore leverages phylogenetic signals for each parental lineage and the spatial correlation of markers along the genome. We establish significance using a null model conditioned on the inferred site-specific rates of de novo mutation (Supplementary Texts2 and 3).

**Fig. 1: RIPPLES exhaustively searches for optimal parsimony improvements using partial interval placements.**

Substantial testing via simulation indicates that RIPPLES is efficient, sensitive and can confidently identify recombinant lineages (Supplementary Texts 4–6). As expected²¹, when recombination occurs towards the edges of the genome or between genetically similar sequences, it is harder to detect using RIPPLES (Extended Data Figs. 1 and 2). Nonetheless, RIPPLES detects simulated recombinants with 75.8% sensitivity. Among the simulated samples detected as recombinants, RIPPLES accurately identifies 90% of simulated breakpoints (Extended Data Table 1 and Supplementary Text 6). Furthermore, RIPPLES is able to detect all highly confident recombinants identified in a previous analysis¹⁶ (Supplementary Text 6). Recombination analysis using RIPPLES on a global phylogeny of about 1.6 million SARS-CoV-2 genomes shows that a fraction of the sequenced SARS-CoV-2 genomes belongs to detectable recombinant lineages. To mitigate the impacts of sequencing and assembly errors, we exclude all nodes with only a single descendant, we applied conservative filters to remove potentially spurious samples from the recombinant sets flagged by RIPPLES, and we manually confirmed mutations in a subset of putative recombinant samples using raw sequence read data (Supplementary Texts 7 and 8, Extended Data Table 2 and Extended Data Fig. 3). After this, we retained 589 unique recombination events, which have a combined total of 43,104 descendant samples (Supplementary Table 1). This means that around 2.7% of total sampled SARS-CoV-2 genomes are inferred to belong to detectable recombinant lineages. Post hoc statistical analysis yields an empirical false discovery rate estimate of 11% for our statistical thresholds (Supplementary Text 9 and Extended Data Table 3). Additionally, excess similarity of geographic location and date metadata among the descendants of donor and acceptor nodes supports the notion that many ancestors of recombinant genomes co-circulated within human populations (Supplementary Texts 10 and 11 and Extended Data Figs. 4 and 5). Because recombination events that occur between genetically similar viral lineages are challenging to detect (Extended Data Fig. 2), ours is expected to be a potentially large underestimate of the overall frequency of recombination. As a result, the RIPPLES estimate is probably conservative with respect to the global frequency of recombination in the SARS-CoV-2 population.

RIPPLES uncovered a strikingly non-uniform distribution of recombination breakpoint positions across the SARS-CoV-2 genome, consistent with previous analyses in betacoronaviruses^11,22. In particular, among putative recombination events there is an excess of recombination breakpoints towards the 3' end of the SARS-CoV-2 genome relative to expectations on the basis of random breakpoint positions (P < 1 × 10⁻⁷; permutation test; Supplementary Text 12). Notably, no such bias is apparent when we simulate recombination breakpoints following a uniform distribution (Supplementary Text 13 and Extended Data Fig. 1). Change-point analysis identifies an increase in the frequency of recombination breakpoints immediately 5' of the spike protein region (20,875 base pairs; Supplementary Text 14), and this pattern is consistent when restricting ourselves to putative nodes with the largest numbers of descendants and among diverse data sources, further suggesting that it is not artefactual (Supplementary Text 15 and Extended Data Table 4). The rate of putative recombination breakpoints is about three times higher towards the 3' of the change point than the 5' interval (Fig. 2), which is similar to the relative recombination rates in the genomes of other human coronaviruses¹¹.

**Fig. 2: RIPPLES detects an excess of recombination in the spike protein region.**

Several lines of evidence suggest that the skewed distribution of recombination breakpoint positions is not a consequence of positive selection at the level of between-host transmission dynamics. First, many of these recombinant clades have existed for a relatively short period of time, and might already be extinct. The mean timespan between the earliest and latest dates of observed descendants of detected recombinant nodes is just 37 days. Second, of the subset of recombination events that we inferred to occur between variants of concern (VOC; lineages B.1.1.7, B.1.351, B.1.617.2 and P.1 (ref. ²³)) and other lineages, VOCs contribute slightly fewer spike protein mutations than non-VOC lineages on average (60 out of 125 VOC/non-VOC recombinants, P = 0.48, sign test). Third, recombinant clade size does not greatly differ from the remaining clade sizes, which would be expected if recombinant lineages experienced strong selection (P = 0.8470, permutation test). Therefore, although natural selection on between-host transmission dynamics of recombinant lineages could also impact the observed distribution of recombinant breakpoint positions¹¹, our data indicates that other biases shape the distribution of recombination events across the SARS-CoV-2 genome. These could include a neutral mechanistic bias affecting the distribution of recombination breakpoints.

Although not yet widespread among circulating SARS-CoV-2 genomes, recombination has measurably contributed to the genetic diversity in SARS-CoV-2 lineages. The ratio of variable positions contributed by recombination (R) versus those resulting from de novo mutation (M), R/M, is commonly used to summarize the relative impacts of these two sources of variation²². Using our dataset of putative recombination events, we estimate that R/M = 0.00264 in SARS-CoV-2 (Supplementary Text 16). This is low for a coronavirus population (for example, for Middle East respiratory syndrome, R/M is estimated to be 0.25–0.31 (ref. ²²)), which presumably reflects the extremely low genetic diversity among possible recombinant ancestors during the earliest phases of the pandemic and the conservative nature of our approach. As SARS-CoV-2 populations accumulate genetic diversity and co-infect hosts with other species of viruses, recombination will play an increasingly large role in generating functional genetic diversity and this ratio could increase²⁴. RIPPLES is therefore poised to play a primary role in detecting new recombinant lineages and quantifying their impacts on viral genomic diversity as the pandemic progresses.

Our extensively optimized implementation of RIPPLES allows it to search the entire phylogenetic tree and detect recombination both within and between SARS-CoV-2 lineages without a priori defining a set of lineages or clade-defining mutations. This is a key advantage of our approach relative to other methods that cope with the scale of SARS-CoV-2 datasets by reducing the search space for possible recombination events (for example, refs. ^16,17,25). RIPPLES discovers 223 recombination events within branches of the same Pango lineages. Our results also include 366 interlineage recombination events (Supplementary Table 1). Additionally, we find evidence that recombination has influenced the Pangolin SARS-CoV-2 nomenclature system²³. Specifically, we discover that the root of the B.1.355 lineage might have resulted from a recombination event between nodes belonging to the B.1.595 and B.1.371 lineages (Fig. 3 and Supplementary Table 1). These diverse recombination events highlight the versatility and strengths of the approach taken in RIPPLES.

**Fig. 3: RIPPLES uncovered evidence that the B.1.355 lineage might have resulted from a recombination event between lineages of B.1.595 and B.1.371.**

The detection of increased recombination rates in the 3' portion of the SARS-CoV-2 genome, which contains the spike protein, highlights the utility of continuing surveillance. The spike protein is a primary location of functional novelty for viral lineages as they adapt to transmission within and among human hosts. Our discovery of both the excess of recombination events specifically around the spike protein and the relatively high levels of recombinants in circulation at present underline the importance of monitoring the evolution of new viral lineages that arise through mutation or recombination through real-time analyses of viral genomes. Our work also emphasizes the impact that explicitly considering phylogenetic networks will have for accurate interpretation of SARS-CoV-2 sequences¹¹.

Beyond SARS-CoV-2, recombination is a major evolutionary force driving viral and microbial adaptation. It can drive the spread of antibiotic resistance⁷, drug resistance¹, and immunity and vaccine escape². Identification of recombination is an essential component of pathogen evolutionary analyses pipelines as recombination can affect the quality of phylogenetic, transmission and phylodynamic inference³. For these reasons, computational tools to detect microbial recombination have become very popular and important in recent years⁴. The SARS-CoV-2 pandemic has driven an unprecedented surge of pathogen genome sequencing and data sharing, which has in turn highlighted some of the limitations of current software in investigating large genomic datasets⁵. RIPPLES was built for pandemic-scale datasets and is sufficiently optimized to exhaustively search for recombination in one of the largest phylogenies ever inferred in 40 min (Supplementary Text 17). We expect RIPPLES to perform best on densely sampled genomic datasets, which will probably become the norm for many globally distributed pathogens, but we caution that it has not yet been validated on other species. To facilitate real-time analysis of recombination among tens of thousands of new SARS-CoV-2 sequences being generated by diverse research groups worldwide each day^26,27,28, RIPPLES provides an option to evaluate evidence for recombination ancestry in any user-supplied samples within minutes (Supplementary Text 17). RIPPLES therefore opens the door for rapid analysis of recombination in heavily sampled and rapidly evolving pathogen populations, and provides a tool for real-time investigation of recombinants during a pandemic.

Methods

RIPPLES uses the space-efficient data structure of mutation-annotated trees (MATs)²⁰, in which the branches of the phylogenetic tree are annotated with mutations that have been inferred to have occurred on them, to identify recombination events. Figure 1 illustrates the underlying algorithm. RIPPLES identifies putative recombinant nodes containing at least the number of mutations specified by the user and infers the set of mutations that have occurred on its corresponding sequence by accounting for all mutations annotated on the branches on its path from the root. RIPPLES then adds one or two breakpoints on mutation sites and assesses parsimony score improvement using partial placements compared to the starting parsimony. For more details, see Supplementary Text 1. To determine whether putative recombinants were significant, we developed a null model by selecting nodes at random and adding k additional mutations drawn from the actual mutation spectra in our global tree. We then placed these samples on the tree and used RIPPLES to determine their parsimony score improvements (Supplementary Text 2). For each putative recombinant in our global tree, we compared its parsimony score improvement to the distribution of null parsimony score improvements for the same initial parsimony score (Supplementary Text 3). We developed our starting tree by first taking the 28 May 2021 public tree^19,20, masking all problematic sites²⁹, and pruning samples with fewer than 28,000 non-N nucleotides and those with two or more non-[ACGTN-] nucleotides (Supplementary Text 5). After this, we optimized this tree by running matOptimize (Supplementary Text 4) twice, with a subtree pruning and regrafting (SPR) radius of first 10 and then 40 in subsequent rounds and with the masked Variant Call Format (VCF) file as an input. Instructions for using RIPPLES are available at https://usher-wiki.readthedocs.io/en/latest/tutorials.html. We ran RIPPLES on the n2d-highcpu-224 Google Cloud Platform instance containing 224 virtual central processing units (vCPUs) (Supplementary Text 18).

To test the sensitivity of RIPPLES, we simulated recombinant samples by choosing two random internal nodes from our phylogeny with at least ten descendants and choosing breakpoints at random across the genome. We generated 1,000 simulations each for one and two breakpoint recombinants with no, one, two and three additional mutations added to the sequence after the recombination event, using scripts available at https://github.com/bpt26/recombination/. These combinations yielded 2,000 total simulated recombinant lineages. We then measured the ability of RIPPLES to detect breakpoints as a function of the position of the breakpoint and the minimum genetic distance from the recombinant node to either parent (Supplementary Text 6; genetic distance is estimated on the basis of the number of mutations inferred to separate the focal samples, lineages or nodes). We also evaluated the sensitivity of RIPPLES by ensuring that it detected each of the high-confidence recombinant SARS-CoV-2 clusters of Jackson et al.¹⁶.

We applied several post hoc filters to remove putative recombinant nodes that may be false positives resulting from several possible sources of error. For each internal node from each trio (putative recombinant, donor and acceptor nodes) that comprised a recombinant event, we downloaded the consensus genome sequence for the nearest descendants of each node from COG-UK, GenBank, GISAID and the China National Center for Bioinformatics. We then aligned the sequences of all descendants for each trio using MAFFT³⁰, focusing specifically on recombination-informative sites, that is, where the allele of the recombinant node matched one parent node but not the other. If recombination-informative mutations were near to indels or missing bases, or if the entire basis for recombination was a single cluster of mutations in a 20-nucleotide span (Supplementary Text 7). We also confirmed sequence quality by manually examining raw reads for ten samples in which we could confidently link the raw sequence read data to a given consensus genome (Supplementary Text 8). To estimate the false discovery rate (FDR) associated with our specific approach and statistical threshold selected, we computed a post hoc empirical FDR. We obtained the number of internal nodes that we tested and that were associated with a given parsimony score. Then, for each initial parsimony score and parsimony score improvement, we obtained the expected number of internal nodes that would show that parsimony score improvement under the null model. Our FDR (Extended Data Table 3) is the ratio of expected nodes for a given initial and final parsimony score to the number of detected recombinant nodes with the same initial and final parsimony score (Supplementary Text 9).

We also performed post hoc analysis using sample metadata to determine whether the ancestors of the recombinant nodes had higher spatial or temporal overlap than expected by chance. We computed geographic overlap as the joint probability of choosing a sample from the same country from the descendants of the donor and the acceptor nodes. For temporal overlap, we recorded intervals from the earliest to the most recent sample descended from the donor and acceptor, respectively, and calculated the minimum number of days separating the two intervals (with 0 for overlapping intervals). We generated a null distribution for both categories by selecting, for each detected trio, two random internal nodes from the tree with a number of descendants equal to the real donor and acceptor respectively. We then calculated geographic and temporal overlap in the same way for this random set (Extended Data Fig. 4 and Supplementary Text 10).

To determine whether identified recombination breakpoints are significantly shifted towards the 3' end of the genome, we performed a permutation test comparing the difference between the mean of the distribution of uniformly simulated breakpoints and the mean of the detected breakpoint position distribution in the true set (Supplementary Text 12). We also conducted a change-point analysis using the changepoint R package³¹ and fit a Poisson model to the count of recombination prediction interval midpoints. We then computed the mean rate of recombination breakpoints within the intervals on either side of the identified change point to estimate the fold increase in recombination rate in the 3' portion of the genome (Supplementary Text 13). To estimate R/M, we found the decrease in parsimony score associated with each detected recombination event as an estimate of R. We then calculated M by taking this value and subtracting it from the total number of mutations observed across our entire phylogeny (Supplementary Text 16). R/M is the ratio of these values.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All data is available in the manuscript or the supplementary materials. Dataset 1 (containing the phylogeny analysed for recombination in this study in Newick format) and dataset 2 (containing a list of descendant samples of recombinant nodes identified through RIPPLES) are available at https://doi.org/10.5281/zenodo.6717378³².

Code availability

RIPPLES software is available under the MIT license as part of the UShER package at https://github.com/yatisht/usher. We provide a reproducible Google Cloud Platform workflow for RIPPLES under https://github.com/yatisht/usher/tree/master/scripts/recombination. An archived version of the specific code and workflow used in this study is available from https://doi.org/10.5281/zenodo.6709991(ref. ³³). We distribute RIPPLES with UShER because it uses the same underlying data objects and UShER is required to infer the input MAT. Documentation for RIPPLES and associated utilities can be found at https://usher-wiki.readthedocs.io/en/latest/.

References

Moutouh, L., Corbeil, J. & Richman, D. D. Recombination leads to the rapid emergence of HIV-1 dually resistant mutants under selective drug pressure. Proc. Natl Acad. Sci. USA 93, 6106–6111 (1996).
Article ADS CAS Google Scholar
Golubchik, T. et al. Pneumococcal genome sequencing tracks a vaccine escape variant formed through a multi-fragment recombination event. Nat. Genet. 44, 352–355 (2012).
Article CAS Google Scholar
Schierup, M. H. & Hein, J. Consequences of recombination on traditional phylogenetic analysis. Genetics 156, 879–891 (2000).
Article CAS Google Scholar
Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).
Article ADS Google Scholar
Hodcroft, E. B. et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature 591, 30–33 (2021).
Article ADS CAS Google Scholar
Forni, D., Cagliani, R. & Sironi, M. Recombination and positive selection differentially shaped the diversity of betacoronavirus subgenera. Viruses 12, 1313 (2020).
Article CAS Google Scholar
Didelot, X. & Maiden, M. C. J. Impact of recombination on bacterial evolution. Trends Microbiol. 18, 315–322 (2010).
Article CAS Google Scholar
Dudas, G. & Rambaut, A. MERS-CoV recombination: implications about the reservoir and potential for adaptation. Virus Evol 2, vev023 (2016).
Article Google Scholar
Lau, S. K. P. et al. Severe acute respiratory syndrome (SARS) coronavirus ORF8 protein is acquired from SARS-related coronavirus from greater horseshoe bats through recombination. J. Virol. 89, 10532–10547 (2015).
Article CAS Google Scholar
Holmes, E. C. & Rambaut, A. Viral evolution and the emergence of SARS coronavirus. Philos. Trans. R. Soc. Lond. B Biol. Sci. 359, 1059–1065 (2004).
Article CAS Google Scholar
Müller, N. F., Kistler, K. E. & Bedford, T. Recombination patterns in coronaviruses. Preprint at bioRxiv https://doi.org/10.1101/2021.04.28.441806 (2021).
Bobay, L.-M., O’Donnell, A. C. & Ochman, H. Recombination events are concentrated in the spike protein region of betacoronaviruses. PLoS Genet. 16, e1009272 (2020).
Article CAS Google Scholar
Li, X. et al. Emergence of SARS-CoV-2 through recombination and strong purifying selection. Sci. Adv. 6, eabb9153 (2020).
Article ADS CAS Google Scholar
De Maio, N. et al. Mutation rates and selection on synonymous mutations in SARS-CoV-2. Genome Biol. Evol. 13, evab087 (2021).
Article Google Scholar
Taghizadeh, P. et al. Study on SARS-CoV-2 strains in Iran reveals potential contribution of co-infection with and recombination between different strains to the emergence of new strains. Virology 562, 63–73 (2021).
Article CAS Google Scholar
Jackson, B. et al. Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic. Cell 184, 5179–5188 (2021).
Article CAS Google Scholar
VanInsberghe, D., Neish, A. S., Lowen, A. C. & Koelle, K. Recombinant SARS-CoV-2 genomes circulated at low levels over the first year of the pandemic. Virus Evol. 7, veab059 (2021)
Hedge, J. & Wilson, D. J. Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not. MBio 5, e02158 (2014).
Article Google Scholar
Turakhia, Y. et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat. Genet. 53, 809–816 (2021).
Article CAS Google Scholar
McBroome, J. et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Mol. Biol. Evol. 38, 5819–5824 (2021).
Article CAS Google Scholar
Stephens, J. C. On the frequency of undetectable recombination events. Genetics 112, 923–926 (1986).
Article CAS Google Scholar
Patiño-Galindo, J. Á., Filip, I. & Rabadan, R. Global patterns of recombination across human viruses. Mol. Biol. Evol. 38, 2520–2531 (2021).
Article Google Scholar
Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 5, 1403–1407 (2020).
Article CAS Google Scholar
Kim, D., Quinn, J., Pinsky, B., Shah, N. H. & Brown, I. Rates of co-infection between SARS-CoV-2 and other respiratory pathogens. JAMA 323, 2085–2086 (2020).
Article CAS Google Scholar
Varabyou, A., Pockrandt, C., Salzberg, S. L. & Pertea, M. Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie. Genetics 218, iyab074 (2021).
Article Google Scholar
Shu, Y. & McCauley, J. GISAID: global initiative on sharing all influenza data – from vision to reality. Eurosurveillance 22, 30494 (2017).
Article Google Scholar
Sayers, E. W. et al. GenBank. Nucleic Acids Res. 49, D92–D96 (2021).
Article CAS Google Scholar
COVID-19 Genomics UK (COG-UK) Consortium. An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe 1, e99–e100 (2020).
Article Google Scholar
Turakhia, Y. et al. Stability of SARS-CoV-2 phylogenies. PLoS Genet. 16, e1009175 (2020).
Article CAS Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS Google Scholar
Killick, R. & Eckley, I. changepoint: an R package for changepoint analysis. J. Stat. Softw. 58, 19 (2014).
Article Google Scholar
Turakhia, Y. et al. Supplement to pandemic-scale phylogenomics reveals a landscape of SARS-CoV2 recombination. Zenodo https://doi.org/10.5281/zenodo.6717378 (2022).
Turakhia, Y. et al. yatisht/usher v0.5.6. Zenodo https://doi.org/10.5281/zenodo.6709991 (2022).

Download references

Acknowledgements

We gratefully acknowledge the authors and the laboratories responsible for obtaining the specimens and the submitting laboratories where the genome data were generated and shared via GISAID (Supplementary Table 2)²⁶, China National Center for Bioinformation (Supplementary Table 3), COVID-19 Genomics UK (COG-UK)²⁸ (Supplementary Table 4) and the National Center for Biotechnology Information database²⁷ (Supplementary Table 5), on which this research is based. We thank S. Mollenkamp for assisting with the code development. B.T., J.M. and R.C.-D. were funded by National Institutes of Health grant no. R35GM128932. R.C.-D. was funded by an Alfred P. Sloan Foundation fellowship and the University of California Office of the President Emergency COVID-19 Research Seed Funding Grant no. R00RG2456. B.T. and J.M. were funded by the National Institutes of Health grant no. T32HG008345. B.T. was funded by the National Institutes of Health grant no. F31HG010584. N.D.M. was funded by the European Molecular Biology Laboratory. R.L. was funded by Australian Research Council grant no. DP200103151 and an Chan-Zuckerberg Initiative grant. Additional funding for this project was provided by Eric and Wendy Schmidt by recommendation of the Schmidt Futures programme.

Author information

These authors contributed equally: Yatish Turakhia, Bryan Thornlow

Authors and Affiliations

Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
Yatish Turakhia, Bryan Thornlow, Jakob McBroome, Nicolas Ayala, David Haussler & Russell Corbett-Detig
Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Yatish Turakhia, Bryan Thornlow, Angie Hinrichs, Jakob McBroome, Nicolas Ayala, David Haussler & Russell Corbett-Detig
Department of Electrical and Computer Engineering, University of California, San Diego, San Diego, CA, USA
Yatish Turakhia & Cheng Ye
Department of Biological Sciences, University of California, San Diego, San Diego, CA, USA
Kyle Smith
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
Nicola De Maio
Howard Hughes Medical Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
David Haussler
Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
Robert Lanfear

Authors

Yatish Turakhia
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Thornlow
View author publications
You can also search for this author in PubMed Google Scholar
Angie Hinrichs
View author publications
You can also search for this author in PubMed Google Scholar
Jakob McBroome
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Ayala
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Smith
View author publications
You can also search for this author in PubMed Google Scholar
Nicola De Maio
View author publications
You can also search for this author in PubMed Google Scholar
David Haussler
View author publications
You can also search for this author in PubMed Google Scholar
Robert Lanfear
View author publications
You can also search for this author in PubMed Google Scholar
Russell Corbett-Detig
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.C.-D. and Y.T. developed the approach and wrote the manuscript. R.C.-D., Y.T., B.T. and R.L. designed experiments. Y.T., B.T., A.H. and N.D.M. conducted experiments. Y.T., B.T., A.H., J.M., N.A., K.S. and C.Y. developed code. R.C.-D. and D.H. supervised the group. Y.T., B.T., A.H., J.M., N.A., C.Y., N.D.M., D.H., R.L. and R.C.-D. edited the manuscript.

Corresponding authors

Correspondence to Yatish Turakhia or Russell Corbett-Detig.

Ethics declarations

Competing interests

R.L. works as an advisor to GISAID. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Histogram of inferred and simulated recombination breakpoint positions.

A) True simulated breakpoints (red) are shown with all detected recombination interval midpoints (blue). Where blue bars exceed the height of red, it implies an excess rate of detection relative to the true rate of breakpoint positions. Likewise, where red bars exceed the height of blue, it implies a deficit. B) True simulated breakpoints (red) are shown with detected recombination interval midpoints for the 20% of the most closely related donor-acceptor pairs (blue). In both comparisons, we broke ties between equivalently improved partial phylogenetic placement parsimony scores by selecting the largest recombination intervals.

Extended Data Fig. 2 RIPPLES more easily detects breakpoints causing large changes in parsimony score.

The distribution of simulated breakpoints detected for each simulated sample is shown for each sample by A) initial parsimony score and B) minimum genetic distance from simulated sample to parent. Initial parsimony (A) is dependent upon the initial placement of the recombinant node in the tree and refers to the genetic distance in mutations between the recombinant node and its direct parent in the phylogeny. Minimum genetic distance from sample to parent (B) refers to the number of mutations relevant to recombination that separate the recombinant node from either the donor or the acceptor, and is not dependent on -the initial phylogeny. Similarly, among the simulated samples detected by RIPPLES, the detected and undetected breakpoints are shown by C) initial parsimony score and D) minimum genetic distance to parent. Detected samples and breakpoints are shown in black and undetected samples and breakpoints are shown in red. We condition on locating the true breakpoints and observing a significant parsimony score according to our phylogenetic null model. Therefore, we exclude recombination events with minimum starting parsimony scores and genetic distances of less than 3, as these are not significant under our null model.

Extended Data Fig. 3 Examples of detected trios filtered out due to sequence quality concerns.

A) Partial alignment of consensus sequences from a filtered recombinant trio of nodes 77695, 169585, and 77690, centred on site 28225, has consensus sequences of mostly 'N' spanning several sites meant to be informative of a recombination event. This can occur when many descendant samples have missing data. Mismatches between the three consensus sequences immediately flanking this region may be the result of poor sequencing quality as well. B) Partial alignment of consensus sequences from a filtered recombinant trio of nodes 173213, 173209, and 173274, centred on site 16846, has 7 recombination-informative mutations in an 8-nucleotide window that are unlikely to be true mutation events, but rather an alignment artifact or a complex indel event. C) Partial alignment of consensus sequences from a filtered recombinant trio of nodes 293461, 293460, and 211841, centred on site 29769, has 3 mismatches in a 5-nucleotide window, immediately flanked by a large gap in the alignment and are unlikely to be true mutations.

Extended Data Fig. 4 Recombinant ancestors exhibit increased spatial and temporal overlap.

A) Spatial and B) temporal overlap for our recombinant trios (in blue) and the null distribution (in gray), with Mann-Whitney Ranked-Sum p-values for the statistical increase in overlap for the recombinant ancestors shown on the top.

Extended Data Fig. 5 Ancestors of recombinants are genetically similar.

A) The initial parsimony scores for placements of putative (red) and simulated (blue) recombinant samples. B) The genetic distance between inferred (red) and simulated (blue) ancestor-donor pairs that gave rise to putative or simulated recombinants.

Extended Data Table 1 Summary of simulated breakpoint detection

Full size table

Extended Data Table 2 Raw sequence read datasets used to confirm recombination informative positions in selected recombinant samples

Full size table

Extended Data Table 3 False discovery rate estimation for each parsimony score improvement observed in our dataset

Full size table

Extended Data Table 4 Increased rate of breakpoint interval midpoint in the 3' portion of the genome when the recombinants are subdivided by the country of origin

Full size table

Supplementary information

Supplementary Information

Supplementary Text 1–18 referenced in the main text, legends for Tables 1–4 and References.

Reporting Summary

Peer Review File

Supplementary Table 1

Summary of detected recombinant nodes.

Supplementary Table 2

This file contains acknowledgements recognizing originating laboratories responsible for obtaining the specimens, as well as the submitting laboratories where the genome data were generated and shared via GISAID.

Supplementary Table 3

This file contains acknowledgements recognizing originating and submitting laboratories for data from the China National Center for Bioinformation.

Supplementary Table 4

This file contains acknowledgements recognizing originating and submitting laboratories for data from the COVID-19 Genomics UK (COG-UK) Consortium.

Supplementary Table 5

This file contains acknowledgements recognizing originating and submitting laboratories for data from the National Center for Biotechnology Information database.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Turakhia, Y., Thornlow, B., Hinrichs, A. et al. Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape. Nature 609, 994–997 (2022). https://doi.org/10.1038/s41586-022-05189-9

Download citation

Received: 13 December 2021
Accepted: 03 August 2022
Published: 11 August 2022
Issue Date: 29 September 2022
DOI: https://doi.org/10.1038/s41586-022-05189-9

This article is cited by

The ongoing evolution of UShER during the SARS-CoV-2 pandemic
- Angie Hinrichs
- Cheng Ye
- Russell Corbett-Detig
Nature Genetics (2024)
SARS-CoV-2 biology and host interactions
- Silvio Steiner
- Annika Kratzel
- Volker Thiel
Nature Reviews Microbiology (2024)
Data-driven recombination detection in viral genomes
- Tommaso Alfonsi
- Anna Bernasconi
- Stefano Ceri
Nature Communications (2024)
A framework for automated scalable designation of viral pathogen lineages from genomic data
- Jakob McBroome
- Adriano de Bernardi Schneider
- Russell Corbett-Detig
Nature Microbiology (2024)
Ranked Subtree Prune and Regraft
- Lena Collienne
- Chris Whidden
- Alex Gavryushkin
Bulletin of Mathematical Biology (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Methods

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links