Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

A preprint version of the article is available at bioRxiv.

Abstract

Felsenstein’s bootstrap approach is widely used to assess confidence in species relationships inferred from multiple sequence alignments. It resamples sites randomly with replacement to build alignment replicates of the same size as the original alignment and infers a phylogeny from each replicate dataset. The proportion of phylogenies recovering the same grouping of species is its bootstrap confidence limit. However, standard bootstrap imposes a high computational burden in applications involving long sequence alignments. Here, we introduce the bag of little bootstraps approach to phylogenetics, bootstrapping only a few little samples, each containing a small subset of sites. We report that the median-bagging of bootstrap confidence limits from little samples produces confidence in inferred species relationships similar to standard bootstrap but in a fraction of the computational time and memory. Therefore, the little bootstraps approach can potentially enhance the rigor, efficiency and parallelization of big data phylogenomic analyses.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The little BS approach and analyses of simulated and empirical phylogenomic datasets.

Similar content being viewed by others

Data availability

All simulated DNA sequence alignments containing 446 taxa were obtained from published research articles23,24. Ten empirical datasets from a variety of species have been analyzed. These DNA sequence alignments consisted of sequences from Eutherian mammals14, butterflies7, plants (A6 and B10), insects (A11, B12 and C5), spiders (A9 and B8) and birds13. All empirical and simulated datasets analyzed in this paper are available in an online repository28. Source data are provided with this paper.

Code availability

R codes are available from https://github.com/ssharma2712/Little-Bootstraps. A capsule containing source codes and datasets for our analyses is available on the CodeOcean service29. Users can replicate the little bootstraps sampling and bagging steps in this capsule.

References

  1. Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).

    Article  Google Scholar 

  2. Kumar, S. & Filipski, A. Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007).

    Article  Google Scholar 

  3. Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L. & Tamura, K. Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012).

    Article  Google Scholar 

  4. Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).

    Article  Google Scholar 

  5. Johnson, K. P. et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl Acad. Sci. USA 115, 12775–12780 (2018).

    Article  Google Scholar 

  6. Ran, J. H., Shen, T. T., Wu, H., Gong, X. & Wang, X. Q. Phylogeny and evolutionary history of Pinaceae updated by transcriptomic analysis. Mol. Phylogenet. Evol. 129, 106–116 (2018).

    Article  Google Scholar 

  7. Allio, R. et al. Whole genome shotgun phylogenomics resolves the pattern and timing of swallowtail butterfly evolution. Syst. Biol. 69, 38–60 (2020).

    Article  Google Scholar 

  8. Hedin, M., Derkarabetian, S., Alfaro, A., RamĂ­rez, M. J. & Bond, J. E. Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci. PeerJ 7, e6864 (2019).

    Article  Google Scholar 

  9. Kuntner, M. et al. Golden orbweavers ignore biological rules: phylogenomic and comparative analyses unravel a complex evolution of sexual size dimorphism. Syst. Biol. 68, 555–572 (2019).

    Article  Google Scholar 

  10. Pessoa-Filho, M., Martins, A. M. & Ferreira, M. E. Molecular dating of phylogenetic divergence between Urochloa species based on complete chloroplast genomes. BMC Genomics 18, 516 (2017).

    Article  Google Scholar 

  11. Peters, R. S. et al. Evolutionary history of the Hymenoptera. Curr. Biol. 27, 1013–1018 (2017).

    Article  Google Scholar 

  12. Peters, R. S. et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence and evolutionary success. Mol. Phylogenet. Evol. 120, 286–296 (2018).

    Article  Google Scholar 

  13. Yonezawa, T. et al. Phylogenomics and morphology of extinct paleognaths reveal the origin and evolution of the ratites. Curr. Biol. 27, 68–77 (2017).

    Article  Google Scholar 

  14. Song, S., Liu, L., Edwards, S. V. & Wu, S. Resolving conflict in Eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl Acad. Sci. USA 109, 14942–14947 (2012).

    Article  Google Scholar 

  15. Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771 (2008).

    Article  Google Scholar 

  16. Minh, B. Q., Nguyen, M. A. T. & Von Haeseler, A. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013).

    Article  Google Scholar 

  17. Kleiner, A., Talwalkar, A., Sarkar, P. & Jordan, M. I. A scalable bootstrap for massive data. J. R. Stat. Soc. B Stat. Methodol. 76, 795–816 (2014).

    Article  MathSciNet  Google Scholar 

  18. Seo, T.-K. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25, 960–971 (2008).

    Article  Google Scholar 

  19. Pattengale, N. D., Alipour, M., Bininda-Emonds, O. R. P., Moret, B. M. E. & Stamatakis, A. How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354 (2010).

    Article  MathSciNet  Google Scholar 

  20. Leys, C., Ley, C., Klein, O., Bernard, P. & Licata, L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013).

    Article  Google Scholar 

  21. Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    Article  Google Scholar 

  22. Lemoine, F. et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452–456 (2018).

    Article  Google Scholar 

  23. Rosenberg, M. S. & Kumar, S. Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. 20, 610–621 (2003).

    Article  Google Scholar 

  24. Tamura, K. et al. Estimating divergence times in large molecular phylogenies. Proc. Natl Acad. Sci. USA 109, 19333–19338 (2012).

    Article  Google Scholar 

  25. R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2020).

  26. Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: efficient manipulation of biological strings. R Package Version 2.46.0 (Bioconductor, 2017); https://doi.org/10.18129/B9.bioc.Biostrings

  27. Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).

    Article  Google Scholar 

  28. Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. figshare https://doi.org/10.6084/m9.figshare.14130494

  29. Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. CodeOcean https://doi.org/10.24433/CO.6432188.v1

  30. Efron, B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78, 316–331 (1983).

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank S. Vahdatshoar and J. Davis for their help with computational analysis. We thank J. Craig, Q. Tao, M. Caraballo-Ortiz, A. Chroni, C. Palacios, S. L. K. Pond and S. Blair Hedges for providing critical comments on the manuscript. This research was supported by a grant from the US National Institutes of Health to S.K. (GM139540-01).

Author information

Authors and Affiliations

Authors

Contributions

S.K. initially conceived all the methods, designed many analyses, developed visualizations and wrote the manuscript. S.S. refined methods, designed and conducted analyses, refined visualizations and contributed to writing the manuscript.

Corresponding author

Correspondence to Sudhir Kumar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Alexandros Stamatakis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A comparison of the standard and little bootstrap approaches.

Steps of (a) the standard phylogeny bootstrap and (b) the little bootstraps (BS) approach. Shaded boxes represent sequence alignments, with width representing sequence length. In standard BS, L sites are randomly sampled with replacement from the original dataset containing L sites. In this resampling process, ~63.2% of the data points17,30 are expected to be represented in a bootstrap replicate dataset. Each replicate dataset is compressed into weighted resamples that contain only distinct site configurations and a vector of their counts (represented by stacks of dots). An ML tree is inferred from each replicate dataset, and the BCL for a species group is the proportion of times that appeared in bootstrap replicate phylogenies. In little BS, L sites are randomly sampled with replacement from the little dataset consisting of only l = Lg sites, which produces bootstrap replicate datasets. Because \(l \ll {{{\mathrm{L}}}}\), each site will be represented many times in the little bootstraps replicate datasets, which we refer to as upsampling that changes the frequency of unique site configurations. Stacks of dots are much higher for little BS due to upsampling than standard BS that involves only resampling. The number of distinct site configurations in the upsampled dataset is smaller than in the standard bootstrap replicate dataset because of \(l \ll {{{\mathrm{L}}}}\).

Extended Data Fig. 2 The number of sites used in little and standard bootstrap replicates.

The proportion of sites included in the little bootstrap replicates for little datasets with l = L0.7 (open circles) and standard bootstrap (closed circles). The choice of l = L0.7 offers increasingly greater computational savings for longer sequences because of a decreasing proportion of sites included in the little samples. For example, the standard bootstrap replicates always contain approximately 63%30 of the site configurations from the full datasets. But, the little dataset size is ~3.1% of the original alignment for L = 100,000 bases, but it decreases to ~1.6% when L increases 10-fold (1,000,000 bases).

Source data

Extended Data Fig. 3 Patterns of unique site configurations per sequence and little sample size.

The relationship of the number of unique site configurations per sequence (C/S, log-transformed) and little sample size selected (power factor, g) (R2 = 0.76).

Source data

Extended Data Fig. 4 Precision of little bootstrap confidence limits.

The relationship between little BS \(\widehat {BCL}\)s and their precision (standard errors) for the selected little BS parameters. The standard errors are inversely related to little bootstrap confidence limits (R2 = 0.59).

Source data

Source data

Source Data Fig. 1

Phylogenetic trees and analysis log files for Fig. 1.

Source Data Extended Data Fig. 2

Source codes (R-script) that produce source data for Extended Data Fig. 2.

Source Data Extended Data Fig. 3

Statistical source data for Extended Data Fig. 3.

Source Data Extended Data Fig. 4

Phylogenetic tree files for Extended Data Fig. 4.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, S., Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. Nat Comput Sci 1, 573–577 (2021). https://doi.org/10.1038/s43588-021-00129-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-021-00129-5

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing