Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

Sharma, Sudip; Kumar, Sudhir

doi:10.1038/s43588-021-00129-5

Brief Communication
Published: 22 September 2021

Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

Nature Computational Science volume 1, pages 573–577 (2021)Cite this article

776 Accesses
10 Citations
6 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Felsenstein’s bootstrap approach is widely used to assess confidence in species relationships inferred from multiple sequence alignments. It resamples sites randomly with replacement to build alignment replicates of the same size as the original alignment and infers a phylogeny from each replicate dataset. The proportion of phylogenies recovering the same grouping of species is its bootstrap confidence limit. However, standard bootstrap imposes a high computational burden in applications involving long sequence alignments. Here, we introduce the bag of little bootstraps approach to phylogenetics, bootstrapping only a few little samples, each containing a small subset of sites. We report that the median-bagging of bootstrap confidence limits from little samples produces confidence in inferred species relationships similar to standard bootstrap but in a fraction of the computational time and memory. Therefore, the little bootstraps approach can potentially enhance the rigor, efficiency and parallelization of big data phylogenomic analyses.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The little BS approach and analyses of simulated and empirical phylogenomic datasets.**

Phylogenetic tree building in the genomic age

Article 18 May 2020

Generation of accurate, expandable phylogenomic trees with uDance

Article 27 July 2023

Incongruence in the phylogenomics era

Article 27 June 2023

Data availability

All simulated DNA sequence alignments containing 446 taxa were obtained from published research articles^23,24. Ten empirical datasets from a variety of species have been analyzed. These DNA sequence alignments consisted of sequences from Eutherian mammals¹⁴, butterflies⁷, plants (A⁶ and B¹⁰), insects (A¹¹, B¹² and C⁵), spiders (A⁹ and B⁸) and birds¹³. All empirical and simulated datasets analyzed in this paper are available in an online repository²⁸. Source data are provided with this paper.

Code availability

R codes are available from https://github.com/ssharma2712/Little-Bootstraps. A capsule containing source codes and datasets for our analyses is available on the CodeOcean service²⁹. Users can replicate the little bootstraps sampling and bagging steps in this capsule.

References

Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).
Article Google Scholar
Kumar, S. & Filipski, A. Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007).
Article Google Scholar
Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L. & Tamura, K. Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012).
Article Google Scholar
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).
Article Google Scholar
Johnson, K. P. et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl Acad. Sci. USA 115, 12775–12780 (2018).
Article Google Scholar
Ran, J. H., Shen, T. T., Wu, H., Gong, X. & Wang, X. Q. Phylogeny and evolutionary history of Pinaceae updated by transcriptomic analysis. Mol. Phylogenet. Evol. 129, 106–116 (2018).
Article Google Scholar
Allio, R. et al. Whole genome shotgun phylogenomics resolves the pattern and timing of swallowtail butterfly evolution. Syst. Biol. 69, 38–60 (2020).
Article Google Scholar
Hedin, M., Derkarabetian, S., Alfaro, A., Ramírez, M. J. & Bond, J. E. Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci. PeerJ 7, e6864 (2019).
Article Google Scholar
Kuntner, M. et al. Golden orbweavers ignore biological rules: phylogenomic and comparative analyses unravel a complex evolution of sexual size dimorphism. Syst. Biol. 68, 555–572 (2019).
Article Google Scholar
Pessoa-Filho, M., Martins, A. M. & Ferreira, M. E. Molecular dating of phylogenetic divergence between Urochloa species based on complete chloroplast genomes. BMC Genomics 18, 516 (2017).
Article Google Scholar
Peters, R. S. et al. Evolutionary history of the Hymenoptera. Curr. Biol. 27, 1013–1018 (2017).
Article Google Scholar
Peters, R. S. et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence and evolutionary success. Mol. Phylogenet. Evol. 120, 286–296 (2018).
Article Google Scholar
Yonezawa, T. et al. Phylogenomics and morphology of extinct paleognaths reveal the origin and evolution of the ratites. Curr. Biol. 27, 68–77 (2017).
Article Google Scholar
Song, S., Liu, L., Edwards, S. V. & Wu, S. Resolving conflict in Eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl Acad. Sci. USA 109, 14942–14947 (2012).
Article Google Scholar
Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771 (2008).
Article Google Scholar
Minh, B. Q., Nguyen, M. A. T. & Von Haeseler, A. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013).
Article Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P. & Jordan, M. I. A scalable bootstrap for massive data. J. R. Stat. Soc. B Stat. Methodol. 76, 795–816 (2014).
Article MathSciNet Google Scholar
Seo, T.-K. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25, 960–971 (2008).
Article Google Scholar
Pattengale, N. D., Alipour, M., Bininda-Emonds, O. R. P., Moret, B. M. E. & Stamatakis, A. How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354 (2010).
Article MathSciNet Google Scholar
Leys, C., Ley, C., Klein, O., Bernard, P. & Licata, L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013).
Article Google Scholar
Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article Google Scholar
Lemoine, F. et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452–456 (2018).
Article Google Scholar
Rosenberg, M. S. & Kumar, S. Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. 20, 610–621 (2003).
Article Google Scholar
Tamura, K. et al. Estimating divergence times in large molecular phylogenies. Proc. Natl Acad. Sci. USA 109, 19333–19338 (2012).
Article Google Scholar
R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2020).
Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: efficient manipulation of biological strings. R Package Version 2.46.0 (Bioconductor, 2017); https://doi.org/10.18129/B9.bioc.Biostrings
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).
Article Google Scholar
Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. figshare https://doi.org/10.6084/m9.figshare.14130494
Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. CodeOcean https://doi.org/10.24433/CO.6432188.v1
Efron, B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78, 316–331 (1983).
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank S. Vahdatshoar and J. Davis for their help with computational analysis. We thank J. Craig, Q. Tao, M. Caraballo-Ortiz, A. Chroni, C. Palacios, S. L. K. Pond and S. Blair Hedges for providing critical comments on the manuscript. This research was supported by a grant from the US National Institutes of Health to S.K. (GM139540-01).

Author information

Authors and Affiliations

Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
Sudip Sharma & Sudhir Kumar
Department of Biology, Temple University, Philadelphia, PA, USA
Sudip Sharma & Sudhir Kumar
Center of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia
Sudhir Kumar

Authors

Sudip Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Sudhir Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.K. initially conceived all the methods, designed many analyses, developed visualizations and wrote the manuscript. S.S. refined methods, designed and conducted analyses, refined visualizations and contributed to writing the manuscript.

Corresponding author

Correspondence to Sudhir Kumar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Alexandros Stamatakis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A comparison of the standard and little bootstrap approaches.

Steps of (a) the standard phylogeny bootstrap and (b) the little bootstraps (BS) approach. Shaded boxes represent sequence alignments, with width representing sequence length. In standard BS, L sites are randomly sampled with replacement from the original dataset containing L sites. In this resampling process, ~63.2% of the data points^17,30 are expected to be represented in a bootstrap replicate dataset. Each replicate dataset is compressed into weighted resamples that contain only distinct site configurations and a vector of their counts (represented by stacks of dots). An ML tree is inferred from each replicate dataset, and the BCL for a species group is the proportion of times that appeared in bootstrap replicate phylogenies. In little BS, L sites are randomly sampled with replacement from the little dataset consisting of only l = L^g sites, which produces bootstrap replicate datasets. Because \(l \ll {{{\mathrm{L}}}}\), each site will be represented many times in the little bootstraps replicate datasets, which we refer to as upsampling that changes the frequency of unique site configurations. Stacks of dots are much higher for little BS due to upsampling than standard BS that involves only resampling. The number of distinct site configurations in the upsampled dataset is smaller than in the standard bootstrap replicate dataset because of \(l \ll {{{\mathrm{L}}}}\).

Extended Data Fig. 2 The number of sites used in little and standard bootstrap replicates.

The proportion of sites included in the little bootstrap replicates for little datasets with l = L^0.7 (open circles) and standard bootstrap (closed circles). The choice of l = L^0.7 offers increasingly greater computational savings for longer sequences because of a decreasing proportion of sites included in the little samples. For example, the standard bootstrap replicates always contain approximately 63%³⁰ of the site configurations from the full datasets. But, the little dataset size is ~3.1% of the original alignment for L = 100,000 bases, but it decreases to ~1.6% when L increases 10-fold (1,000,000 bases).

Source data

Extended Data Fig. 3 Patterns of unique site configurations per sequence and little sample size.

The relationship of the number of unique site configurations per sequence (C/S, log-transformed) and little sample size selected (power factor, g) (R² = 0.76).

Source data

Extended Data Fig. 4 Precision of little bootstrap confidence limits.

The relationship between little BS \(\widehat {BCL}\)s and their precision (standard errors) for the selected little BS parameters. The standard errors are inversely related to little bootstrap confidence limits (R² = 0.59).

Source data

Source Data Fig. 1

Phylogenetic trees and analysis log files for Fig. 1.

Source Data Extended Data Fig. 2

Source codes (R-script) that produce source data for Extended Data Fig. 2.

Source Data Extended Data Fig. 3

Statistical source data for Extended Data Fig. 3.

Source Data Extended Data Fig. 4

Phylogenetic tree files for Extended Data Fig. 4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, S., Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. Nat Comput Sci 1, 573–577 (2021). https://doi.org/10.1038/s43588-021-00129-5

Download citation

Received: 10 February 2021
Accepted: 13 August 2021
Published: 22 September 2021
Issue Date: September 2021
DOI: https://doi.org/10.1038/s43588-021-00129-5

This article is cited by

Incongruence in the phylogenomics era
- Jacob L. Steenwyk
- Yuanning Li
- Antonis Rokas
Nature Reviews Genetics (2023)