Critical decisions in metaproteomics: achieving high confidence protein annotations in a sea of unknowns

Timmins-Schiffman, Emma; May, Damon H; Mikan, Molly; Riffle, Michael; Frazar, Chris; Harvey, H R; Noble, William S; Nunn, Brook L

doi:10.1038/ismej.2016.132

Download PDF

Commentary
Open access
Published: 08 November 2016

Critical decisions in metaproteomics: achieving high confidence protein annotations in a sea of unknowns

Emma Timmins-Schiffman¹,
Damon H May¹,
Molly Mikan²,
Michael Riffle³,
Chris Frazar¹,
H R Harvey²,
William S Noble ORCID: orcid.org/0000-0001-7283-4715^1,4 &
…
Brook L Nunn¹

The ISME Journal volume 11, pages 309–314 (2017)Cite this article

5277 Accesses
57 Citations
19 Altmetric
Metrics details

Subjects

Environmental meta-omics is rapidly expanding as sequencing capabilities improve, computing technologies become more accessible, and associated costs are reduced. The in situ snapshots of marine microbial life afforded by these data provide a growing knowledge of the functional roles of communities in ecosystem processes. Metaproteomics allows for the characterization of the dynamic proteome of a complex microbial community. It has the potential to reveal impacts of microbial metabolism on biogeochemical transport, storage and cycling (for example, Hawley et al., 2014), while additionally clarifying which taxonomic groups perform these roles. Previous work illuminated many of the important functions and interactions within marine microbial communities (for example, Morris et al., 2010), but a review of ocean metaproteomics literature revealed little standardization in bioinformatics pipelines for detecting peptides and inferring and annotating proteins. As prevalence of these data sets grows, there is a critical need to develop standardized approaches for mass spectrometry (MS) proteomic spectrum identification and annotation to maximize the scientific value of the data obtained. Here, we demonstrate that bioinformatics decisions made throughout the peptide identification process are as important for data interpretation as choices of sampling protocol and bacterial community manipulation experimental design. Our analysis offers a best practices guide for environmental metaproteomics.

MS-based metaproteomics is now practical due to advances in duty cycle and increased mass accuracy for both precursor and fragment masses. These improvements allow for the detection of over 10⁴ tandem mass spectra from a single data-dependent acquisition MS analysis of a mixed microbial sample. These spectra must then be associated with peptides from thousands of proteins from diverse taxonomic groups. The most common approach is database searching: scoring observed tandem mass spectra against theoretical peptide spectra generated in silico from a protein or peptide database (Eng et al., 1994). However, the approach to database selection, or construction, can vary dramatically. In an ocean metaproteomics experiment, the two main approaches for creating a protein identification database are to (1) leverage vast quantities of public sequence data or (2) sequence and assemble a metagenome. Further, when exploring and assembling possible public databases, a wide range of databases and sequence selection methods are used. As the field of environmental proteomics grows, the integrity of metaproteomics data sets and our ability to directly compare them across time and space depends on the adoption of a standardized procedure for peptide identification and annotation. Here, we reveal how highly influential the protein database selection is to the biological interpretations of a metaproteomics experiment.

We applied four database selection techniques in order to perform peptide detection, protein inference, and taxonomic and functional assignments from MS-based, oceanic, microbial community metaproteomics (Figure 1). The metaproteome in question represents a diverse and relatively under-sequenced area of the ocean, the Pacific Arctic. Our results from this study offer a path forward as well as a caution for investigators that the biological conclusions drawn from metaproteomics data are highly database specific.

Our study followed traditional procedures currently employed in ocean metaproteomics (details in Supplementary Information 1). Water samples were collected and selectively filtered from the Bering Strait as described in May et al. (2016) and incubated shipboard over 10 days (T0=day 0, T10=day 10). Bacterial community proteomes from the incubations were analyzed on a Q-Exactive-HF (Thermo Fisher Scientific, Waltham, MA, USA) and resulting data were searched against four different peptide identification databases (Supplementary Information 2): (1) site/time-specific metagenome collected concurrently with the incubated water; (2) NCBI’s env_NR database; (3) Arctic-bacterial database of NCBI protein sequences from known polar taxonomic groups (Supplementary Information 3) North Pacific database derived from a subset of the Ocean Microbiome sequencing project (Sunagawa et al., 2015; Supplementary Information 4). Peptides were identified and proteins were inferred using Comet v. 2015.01 rev. 2 (Eng et al., 2012, 2015), followed by peptide and protein match scoring (Pedrioli, 2010; Deutsch et al., 2015) at a false discovery rate threshold of 0.01 (Supplementary Information 5). Proteins from all databases were annotated using BLASTp (Altschul et al., 1990; Camacho et al., 2009) against the UniProtKB TrEMBL database (downloaded April 28, 2015) with an e-value cutoff of 1E-10 (Supplementary Information 6). Shifts in community biological functions over the 10-day incubation were quantified using a Gene Ontology (GO) analysis where peptide spectrum matches were associated with GO terms. Additionally, database-driven peptide score sensitivity as a function of database size was investigated by searching the site/time-specific metagenome database with increasing numbers of decoy peptides.

The number of peptide experimental spectra that yielded spectrum matches was very different among databases. The highest number of confidently scored unique peptide matches and protein inferences resulted from the search against the site/time-specific metagenome database. This number of peptide matches was augmented 1.5 times by searching the same data against unassembled reads. This ‘metapeptide’ approach (May et al., 2016) avoids sequence loss and potential noise introduced by read assembly (for example Cantarel et al., 2011). The peptides identified by the four assembled databases overlapped relatively little, suggesting that the different databases cover different parts of the acquired metaproteome (May et al., 2016). In a direct comparison of the unassembled metagenome peptides and env_NR, the metagenome contained more peptides from the metaproteome (May et al., 2016). Additionally, database size, especially in the cases of env_NR and North Pacific, had a substantial impact on search sensitivity, making statistically confident detection of peptides difficult (Supplementary Information 7; May et al., 2016). In agreement with others, we found large database searches suffer from a loss of statistical power from multiple hypothesis testing against the vast number of sequences unrepresented in the expressed metaproteome (Nesvizhiskii, 2010; Jagtap et al., 2013; Tanca et al., 2013). This paradox of too many sequences resulting in too few identifications will become increasingly problematic with the availability of more sequence data. Our results point to the success obtained by searching a metaproteome-specific database that excludes non-specific sequences, while balancing the need to retain a sufficient amount of sequence variation.

Taxonomic and functional interpretations resulting from the different searches of the same metaproteome against different databases were divergent, suggesting that each database would yield a different biological conclusion. The four resulting community taxonomy profiles diverged even at the phylum level, and these differences were amplified at finer taxonomic levels (Figure 2). The metagenome also yielded a greater variety of taxa at ranks more specific than class compared to env_NR (May et al., 2016). In addition to taxonomic discrepancies, functional response to the 10-day incubation differed depending on database used, differences that have been noted by others (Rooijers et al., 2011; Tanca et al., 2013). In our arctic microbiome, there was little agreement among database searches in the ten GO terms that changed the most between the beginning and end of the incubation experiment (Table 1, Supplementary Information 8). These GO terms would be considered the most significant contributors to changes in community function in the particular experiment, and would lead to substantially different interpretations depending on the database selected. The importance of these differences in functional assignments among search results can direct downstream analyses and interpretations. For example, they are of critical importance when inferring and reporting community function. Our results and others (for example, Rooijers et al., 2011) stress the importance of database choice for metaproteomics functional assignments and community biological process, especially in the case of a previously uncharacterized, complex community.

Table 1 Ten GO terms with the biological aspect ‘biological process’ with the greatest log fold change from each database search; five that changed the most to have higher abundance at T10 (light gray) and five that have higher abundance at T0 (dark gray)

Full size table

In addition to differences in peptide search results, the true complexity in annotating detected proteins was obscured by the standard approach that uses only the top BLAST hit as the defined protein annotation. The BLAST algorithm returns a list of possible hits with associated Expect values (e-values) when a sequence is searched; to better understand the downstream effects of this approach, we included up to 500 BLAST results per protein. On average, 403 protein matches per metagenome sequence were returned that passed the e-value cutoff of 1E-10. Disagreements in functional and taxonomic assignment among the BLAST hits for a single protein are very common, even when the results all have very low e-values (Supplementary Information 9). This casts doubt on the ‘top’ BLAST hit as the correct annotation for the protein of interest, even though this is common practice in ‘omics’ literature. Inaccuracy or lack of precision of protein annotation via BLAST methodology would further obscure an accurate interpretation of metaproteomics data when combined with an uninformed database choice.

The selection of a protein database for peptide identifications is one of the most critical bioinformatics decisions for accurate biological and ecological interpretation of in situ community functions. Although more time and money are required to complete a site/time-specific metagenome, we have demonstrated that these investments lay the groundwork for more complete metaproteome interpretation (Tanca et al., 2013; May et al., 2016). Whether or not a metagenome is assembled, data interpretation must proceed with care. Based on current and previous work, we propose a general best practices guide (Figure 3) to identifying peptides and inferring biological function and taxonomic distributions of natural microbial assemblages: (1) For previously uncharacterized communities, construct as accurate and efficient a database as possible by (a) using the metapeptide approach (May et al., 2016, b) sequencing the metagenome and utilizing gene prediction software (for example, Hyatt et al., 2012) or (c) constructing the most accurate database possible to avoid loss of sensitivity due to large search space when metagenome sequencing is not possible; (2) when annotating proteins, go beyond the top BLAST hit to base the annotation for taxonomy and function on an agreement among BLAST hits above a specific e-value threshold (Supplementary Information 9); (3) to increase peptide identifications, leverage publicly available sequences via the more statistically robust multi-step or iterative searches (for example, Jagtap et al., 2013; Kertesz-Farkas et al., 2015). As researchers begin to explore these different search methods with a variety of metaproteomics data sets, this approach will provide the most robust search methods and most reliable taxonomic and functional inference for environmental metaproteomics.

Supplementary Information is available at ISME Journal’s website.

1: Detailed methods for metagenome sequence, metaproteomics MS, database searching and biological interpretation of data.

2: Minimum, maximum and mean protein lengths for each protein identification database used in this study.

3: Taxonomic groups used to create the Arctic-bacterial database. The first and second columns list the group name and taxonomic level for the protein sequences that were downloaded from NCBI, followed by the complete taxonomic tree. For each taxonomic group, citations are given from peer-reviewed literature that were used to infer this group’s presence near our study site. The second tab in the workbook has the full citations listed.

4: Summary of total unique protein and peptide sequences in each database. Three different scenarios are given for peptide sequence generation to construct a database: (1) 3 missed cleavages and oxidized methionine; (2) 0 missed cleavages, no oxidation; (3) 3 missed cleavages, no oxidation.

5: Comet parameter file used to run all database searches.

6: Query protein, top UniProt BLAST hit and corresponding e-value are provided for all proteins detected with high confidence. Protein lists for the different database searches can be found in the different sheets of the Excel workbook.

7: Adding large numbers of random decoy peptides to an 11-million-peptide metagenome-derived database depressed peptide detection sensitivity. Horizontal axis is the number of peptides in each search database (11 million metagenome peptides, with increasing numbers of random decoy peptides). The vertical axis is the number of metagenome peptides detected at a false discovery rate of 0.01 as determined by forward–reverse database search for five different sample files. False discovery rate was calculated from Trans Proteomic Pipeline probabilities.

8: Direction of log₂ fold change for GO terms detected at total PSM count>50 in T0 vs T10, T0’ vs T10’, T0 vs T0’, and T10 vs T10’ (‘ ‘ ‘denotes a technical replicate). A log₂ fold change >1 is ‘positive’,<-1 is ‘negative’, between -1 and 1 is ‘none’ and if a GO term was not detected at above 50 PSM in a database there is an ‘X’. Results for each database (site/time-specific metagenome, env_NR, Arctic-bacterial and North Pacific) are listed in separate columns for each comparison.

9: A heatmap representing the granularity of taxa returned from a BLAST search (e-value⩽1E-10) as a function of percent identity threshold. Each colored bin represents the number of protein hits at a given least common taxonomic unit level for up to 500 protein hits. Horizontal axis: minimum percent sequence identity between query protein and BLAST hits. Vertical axis: rank of the lowest common taxonomic unit representing all BLAST hits above the threshold. Color indicates the natural log of the number of query proteins that fall into each bin, according to the scale at right. ‘None’ indicates hits that were assigned to multiple superkingdoms.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ . (1990). Basic local alignment search tool. J Mol Biol 215: 403–410.
Article CAS Google Scholar
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K et al. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10: 421–430.
Article Google Scholar
Cantarel BL, Erickson AR, VerBerkmoes NC, Erickson BK, Carey PA, Pan C et al. (2011). Strategies for metagenomic-guided whole-community proteomics of complex microbial environments. PLoS One 6: e27173.
Article CAS Google Scholar
Deutsch EW, Mendoza L, Shteynberg D, Slagel J, Sun Z, Moritz RL . (2015). Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin Appl 9: 745–754.
Article CAS Google Scholar
Eng JK, Hoopmann MR, Jahan TA, Egertson JD, Noble WS, MacCoss MJ . (2015). A deeper look into Comet– implementation and features. J Am Soc Mass Spectrom 26: 1865–1874.
Article CAS Google Scholar
Eng JK, Jahan TA, Hoopmann MR . (2012). Comet: an open source tandem mass spectrometry sequence database search tool. Proteomics 13: 22–24.
Article Google Scholar
Eng JK, Mccormack AL, Yates JR . (1994). An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database. J Am Soc Mass Spectrom 5: 976–989.
Article CAS Google Scholar
Hawley AK, Brewer HM, Norbeck AD, Pasa-Tolic L, Hallam SJ . (2014). Metaproteomics reveals differential modes of metabolic coupling among ubiquitous oxygen minimum zone microbes. PNAS 111: 11395–11400.
Article CAS Google Scholar
Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC . (2012). Gene and translation initiation site prediction in metagenome sequences. Bioinformatics 28: 2223–2230.
Article CAS Google Scholar
Jagtap P, Goslinga J, Kooren JA, McGowen T, Wroblewski MS, Seymour SL et al. (2013). A two-step database search method improves sensitivity in peptide sequence matches for metaproteomic and proteogenomic studies. Proteomics 13: 1352–1357.
Article CAS Google Scholar
Kertesz-Farkas A, Keich U, Noble WS . (2015). Tandem mass spectrum identification via cascaded search. J Proteome Res 14: 3027–3038.
Article CAS Google Scholar
May DH, Timmins-Schiffman E, Mikan MP, Harvey HR, Borenstein E, Nunn BL, Noble WS . (2016). Metaproteomics characterization of microbiome samples by translating shotgun metagenomic reads. J Proteome Res 15: 2697–2705.
Article CAS Google Scholar
Morris RM, Nunn BL, Frazar C, Goodlett DR, Ting YS, Rocap G . (2010). Comparative metaproteomics reveals ocean-scale shifts in microbial nutrient utilization and energy transduction. ISME J 4: 673–685.
Article CAS Google Scholar
Nesvizhskii AI . (2010). A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 73: 2092–2193.
Article CAS Google Scholar
Pedrioli PGA. (2010). Trans-Proteomic Pipeline: A pipeline for proteomic analysis. Proteome Bioinformatics 604: 213–238.
Article Google Scholar
Rooijers KK, Kolmeder C, Juste C, Doré J, de Been M, Boeren S et al. (2011). An iterative workflow for mining the human intestinal metaproteome. BMC Genomics 12: 6.
Article CAS Google Scholar
Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G et al. (2015). Structure and function of the global ocean microbiome. Science 348: 1261359.
Article Google Scholar
Tanca A, Palomba A, Deligios M, Cubeddu T, Fraumene C, Biosa G et al. (2013). Evaluating the impact of different sequence databases on metaproteome analysis: insights from a lab-assembled microbial mixture. PLoS One 8: e82981.
Article Google Scholar

Download references

Acknowledgements

This work was supported and funded by a grant from the National Science Foundation (NSF-OCE 1233014) for ETS, BLN, DHM and MPM as well as a Training Grant from the National Institutes of Health for ETS (T32 HG00035). DHM and WSN were supported by the National Institute of General Medical Sciences of the NIH under award number P41 GM103533. Microbial community sampling was supported through the Bureau of Ocean Energy Management (BOEM—Hanna Shoal Ecosystem Study) to HRH. This work is supported in part by the University of Washington’s Proteomics Resource (UWPR95794). We thank Jimmy Eng for aiding with database searching and bioinformatics; Jody Wright for advice on DNA extraction; Marcos Perez and Marsha Wheeler for assistance with metagenome sequencing; Ohad Manor for his help with annotations; Brian Searle for writing the code to download sequences from NCBI; Jarrett Egertson and the UW Genome Sciences Information Technology team for their assistance with data analysis; and Luis Pedro Coehlo for his advice and help with subsetting the Ocean Microbiome data set. BLN and ETS would like to thank TAN and IJE for their ongoing inspiration.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, WA, USA
Emma Timmins-Schiffman, Damon H May, Chris Frazar, William S Noble & Brook L Nunn
Department of Ocean, Old Dominion University, Earth, and Atmospheric Sciences, Norfolk, VA, USA
Molly Mikan & H R Harvey
Department of Biochemistry, University of Washington, Seattle, WA, USA
Michael Riffle
University of Washington, Computer Science and Engineering, Seattle, WA, USA
William S Noble

Authors

Emma Timmins-Schiffman
View author publications
You can also search for this author in PubMed Google Scholar
Damon H May
View author publications
You can also search for this author in PubMed Google Scholar
Molly Mikan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Riffle
View author publications
You can also search for this author in PubMed Google Scholar
Chris Frazar
View author publications
You can also search for this author in PubMed Google Scholar
H R Harvey
View author publications
You can also search for this author in PubMed Google Scholar
William S Noble
View author publications
You can also search for this author in PubMed Google Scholar
Brook L Nunn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Emma Timmins-Schiffman or Brook L Nunn.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies this paper on The ISME Journal website

Supplementary information

Supplementary Information 1 (DOCX 24 kb)

Supplementary Information 2 (XLSX 27 kb)

Supplementary Information 3 (XLSX 62 kb)

Supplementary Information 4 (XLSX 32 kb)

Supplementary Information 5 (TXT 8 kb)

Supplementary Information 6 (XLSX 1272 kb)

Supplementary Information 7 (PDF 10 kb)

Supplementary Information 8 (XLSX 250 kb)

Supplementary Information 9 (PDF 55 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/

Reprints and permissions

About this article

Cite this article

Timmins-Schiffman, E., May, D., Mikan, M. et al. Critical decisions in metaproteomics: achieving high confidence protein annotations in a sea of unknowns. ISME J 11, 309–314 (2017). https://doi.org/10.1038/ismej.2016.132

Download citation

Published: 08 November 2016
Issue Date: February 2017
DOI: https://doi.org/10.1038/ismej.2016.132

This article is cited by

Determination of the protein content of complex samples by aromatic amino acid analysis, liquid chromatography-UV absorbance, and colorimetry
- Kathrin Reinmuth-Selzle
- Teodor Tchipilov
- Michael G. Weller
Analytical and Bioanalytical Chemistry (2022)
Using high-abundance proteins as guides for fast and effective peptide/protein identification from human gut metaproteomic data
- Moses Stamboulian
- Sujun Li
- Yuzhen Ye
Microbiome (2021)
Coupled microbiome analyses highlights relative functional roles of bacteria in a bivalve hatchery
- Emma Timmins-Schiffman
- Samuel J. White
- Steven B. Roberts
Environmental Microbiome (2021)
Mechanisms and heterogeneity of in situ mineral processing by the marine nitrogen fixer Trichodesmium revealed by single-colony metaproteomics
- Noelle A Held
- Kevin M Sutherland
- Mak A Saito
ISME Communications (2021)
Critical Assessment of MetaProteome Investigation (CAMPI): a multi-laboratory comparison of established workflows
- Tim Van Den Bossche
- Benoit J. Kunath
- Thilo Muth
Nature Communications (2021)