Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Flash entropy search to query all mass spectral libraries in real time

Abstract

Public repositories of metabolomics mass spectra encompass more than 1 billion entries. With open search, dot product or entropy similarity, comparisons of a single tandem mass spectrometry spectrum take more than 8 h. Flash entropy search speeds up calculations more than 10,000 times to query 1 billion spectra in less than 2 s, without loss in accuracy. It benefits from using multiple threads and GPU calculations. This algorithm can fully exploit large spectral libraries with little memory overhead for any mass spectrometry laboratory.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of Flash entropy searches.
Fig. 2: Benchmarking Flash entropy searches for speed and accuracy.

Similar content being viewed by others

Data availability

All spectra from MassBank.us (https://massbank.us/) and GNPS (https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.mgf) were downloaded on 3 March 2023. Additional MS/MS spectra from public repositories were downloaded from the MassIVE/GNPS (https://gnps.ucsd.edu/ProteoSAFe/datasets.jsp#%7B%22query%22%3A%7B%7D%2C%22title_input%22%3A%22GNPS%22%7D), MetabolomicWorkbench.org (https://www.metabolomicsworkbench.org/) and MetaboLights (https://www.ebi.ac.uk/metabolights/) in December 2022. In total, more than 939 million spectra were available (237,185,147 negative ESI and 701,996,947 positive ESI MS/MS spectra). All the spectra from those sources were used in this study. Source data are provided with this paper.

Code availability

The original source code and benchmark data for the Flash entropy search are available under the Apache License 2.0 on GitHub (https://github.com/YuanyueLi/FlashEntropySearch) and Zenodo (https://doi.org/10.5281/zenodo.7972082), as well as on CodeOcean (https://doi.org/10.24433/CO.8809500.v1). The GUI can be downloaded from the GitHub repository: https://github.com/YuanyueLi/EntropySearch. Flash entropy search is also integrated into the ‘MSEntropy’ package, available for download from https://github.com/YuanyueLi/MSEntropy. Comprehensive documentation for the ‘MSEntropy’ package can be found at https://msentropy.readthedocs.io.

References

  1. Liang, L. et al. Metabolic dynamics and prediction of gestational age and time to delivery in pregnant women. Cell 181, 1680–1692 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Li, D. & Gaquerel, E. Next-generation mass spectrometry metabolomics revives the functional analysis of plant metabolic diversity. Annu. Rev. Plant Biol. 72, 867–891 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Choi, M. et al. MassIVE.quant: a community resource of quantitative mass spectrometry–based proteomics datasets. Nat. Methods 17, 981–984 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Haug, K. et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 48, D440–D444 (2019).

    PubMed Central  Google Scholar 

  7. Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Aisporna, A. et al. Neutral loss mass spectral data enhances molecular similarity analysis in METLIN. J. Am. Soc. Mass. Spectrom. 33, 530–534 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Burke, M. C. et al. The hybrid search: a mass spectral library search method for discovery of modifications in proteomics. J. Proteome Res. 16, 1924–1935 (2017).

    Article  CAS  PubMed  Google Scholar 

  12. Moorthy, A. S., Wallace, W. E., Kearsley, A. J., Tchekhovskoi, D. V. & Stein, S. E. Combining fragment-ion and neutral-loss matching during mass spectral library searching: a new general purpose algorithm applicable to illicit drug identification. Anal. Chem. 89, 13261–13268 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Bittremieux, W. et al. Comparison of cosine, modified cosine, and neutral loss based spectrum alignment for discovery of structurally related molecules. J. Am. Soc. Mass. Spectrom. 33, 1733–1744 (2022).

    Article  CAS  PubMed  Google Scholar 

  14. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Huber, F. et al. matchms - processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020).

    Article  Google Scholar 

  16. Harwood, T. et al. BLINK: Ultrafast tandem mass spectrometry cosine similarity scoring. Sci. Rep. 13, 13462 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).

    Article  CAS  PubMed  Google Scholar 

  18. King, E., Overstreet, R., Nguyen, J. & Ciesielski, D. Augmentation of MS/MS libraries with spectral interpolation for improved identification. J. Chem. Inf. Model. 62, 3724–3733 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Yi, X. et al. Deep learning prediction boosts phosphoproteomics-based discoveries through improved phosphopeptide identification. Preprint at bioRxiv https://doi.org/10.1101/2023.01.11.523329 (2023).

  21. Bittremieux, W., Laukens, K. & Noble, W. S. Extremely fast and accurate open modification spectral library searching of high-resolution mass spectra using feature hashing and graphics processing units. J. Proteome Res. 18, 3792–3799 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This study was funded by National Institutes of Health grants U2C ES030158 and R03 OD034497 (to O.F.).

Author information

Authors and Affiliations

Authors

Contributions

Y.L. and O.F. conceptualized the study. Y.L. designed the algorithm and performed the benchmarking. O.F. supervised the project. Y.L. and O.F. wrote the manuscript.

Corresponding author

Correspondence to Oliver Fiehn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Xusheng Wang and Jianguo Xia for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples for calculating Flash entropy similarity.

(a) Example when all ions match between query spectrum (top) and library spectrum (bottom). in the two spectra are matched. (b) Example when only one pair of ions matches between query and library spectra. Note that the sum intensities of ion abundances in each spectrum are normalized to equal 0.5 (see Supplementary Note 1 for equations). Hence, mismatched ions do not contribute themselves into the calculations, but are considered during the normalization process.

Extended Data Fig. 2 Distributions of spectral entropies when sampling spectra from different MS/MS repositories for benchmarking studies.

(a) MassBank.us, (b) GNPS for annotated compounds (library), (c) all combined experimental public MS/MS repositories including MassIVE/GNPS, MetaboLights, MetabolomicsWorkbench and West Coast Metabolomics Center.

Extended Data Fig. 3 Computation time required to perform ‘open search’ queries using entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.

MS/MS spectra were sampled from (a) GNPS (b) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) GNPS (b) public.

Source data

Extended Data Fig. 4 Computation time required to perform ‘open search’ queries using dot product similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.

MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.

Source data

Extended Data Fig. 5 Computation time required to perform ‘neutral loss’ searches with entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.

MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.

Source data

Extended Data Fig. 6 Computation time required to perform ‘hybrid searches’ with entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.

MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.

Source data

Extended Data Fig. 7 Calculation time to open search 100 positive ESI and 100 negative ESI MS/MS spectra at different spectral entropy levels against randomly picked samples from the MassBank.us library.

Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 100 independent MS/MS spectra randomly sampled from MassBank.us.

Extended Data Fig. 8 Comparison of the accuracy of similarity query results between Flash entropy search and BLINK.

Each dot shows the maximum similarity difference between the fast algorithms and their classic algorithm counterparts. 100 positive ESI and 100 negative ESI MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2 and Tables 1 and 2.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Fiehn, O. Flash entropy search to query all mass spectral libraries in real time. Nat Methods 20, 1475–1478 (2023). https://doi.org/10.1038/s41592-023-02012-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-02012-9

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research