Abstract
With the increasing role of computational tools in the analysis of sequenced genomes, there is an urgent need to maintain high accuracy of functional annotations. Misannotations can be easily generated and propagated through databases by functional transfer based on sequence homology. We developed and optimized an automatic policing method to detect biochemical misannotations using context genomic correlations. The method works by finding genes with unusually weak genomic correlations in their assigned network positions. We demonstrate the accuracy of the method using a cross-validated approach. In addition, we show that the method identifies a significant number of potential misannotations in Bacillus subtilis, including metabolic assignments already shown to be incorrect experimentally. The experimental analysis of the mispredicted genes forming the leucine degradation pathway in B. subtilis demonstrates that computational policing tools can generate important biological hypotheses.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Andrade, M.A. et al. Automated genome sequence analysis and annotation. Bioinformatics 15, 391–412 (1999).
Rost, B. Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608 (2002).
Tian, W. & Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333, 863–882 (2003).
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S. & Ouzounis, C.A. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18, 1641–1649 (2002).
Linial, M. How incorrect annotations evolve–the case of short ORFs. Trends Biotechnol. 21, 298–300 (2003).
Wieser, D., Kretschmann, E. & Apweiler, R. Filtering erroneous protein annotation. Bioinformatics 20 (suppl. 1), i342–i347 (2004).
Bairoch, A., Bucher, P. & Hofmann, K. The PROSITE database, its status in 1997. Nucleic Acids Res. 25, 217–221 (1997).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology consortium. Nat. Genet. 25, 25–29 (2000).
Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005).
Green, M.L. & Karp, P.D. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res. 33, 4035–4039 (2005).
Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).
Lee, J.M. & Sonnhammer, E.L. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13, 875–882 (2003).
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896–2901 (1999).
Huynen, M.A. & Bork, P. Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95, 5849–5856 (1998).
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999).
Enright, A.J., Iliopoulos, I., Kyrpides, N.C. & Ouzounis, C.A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).
Marcotte, E.M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).
Yanai, I., Derti, A. & DeLisi, C. Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc. Natl. Acad. Sci. USA 98, 7940–7945 (2001).
Kharchenko, P., Vitkup, D. & Church, G.M. Filling gaps in a metabolic network using expression information. Bioinformatics 20, i178–i185 (2004).
Kharchenko, P., Church, G.M. & Vitkup, D. Expression dynamics of a cellular metabolic network. Mol. Syst. Biol. 1, 2005.0016 (2005).
Chen, L. & Vitkup, D. Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol. 7, R17 (2006).
Freund, Y. & Mason, L. The alternating decision tree learning algorithm. in Proceedings of the Sixteenth International Conference on Machine Learning (eds. Bratko, I. & Dzeroski, S.) 124–133 (Morgan Kaufmann Publishers Inc., San Francisco, 1999).
Freund, Y. & Schapire, R.E. A short introduction introduction to Boosting. J. Jpn. Soc. Artif. Intell. 14, 771–780 (1999).
Middendorf, M., Kundaje, A., Wiggins, C.H., Freund, Y. & Leslie, C. Predicting genetic regulatory response using classification. Bioinformatics 20, i232–i240 (2004).
Kharchenko, P., Chen, L., Freund, Y., Vitkup, D. & Church, G.M. Identifying metabolic enzymes with multiple types of associated evidence. BMC Bioinformatics 7, 177 (2006).
Kuepfer, L., Sauer, U. & Blank, L.M. Metabolic functions of duplicate genes in Saccharomyces cerevisiae. Genome Res. 15, 1421–1430 (2005).
Reed, J.L., Vo, T.D., Schilling, C.H. & Palsson, B.O. An expanded genome-scale model of Escherichia coli K-12. Genome Biol. 4, R54 (2003).
Kanehisa, M. et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354–D357 (2006).
Caspi, R. et al. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 34, D511–D516 (2006).
Jerga, A., Lu, Y.J., Schujman, G.E., de Mendoza, D. & Rock, C.O. Identification of a soluble diacylglycerol kinase required for lipoteichoic acid production in Bacillus subtilis. J. Biol. Chem. 282, 21738–21745 (2007).
Minami, H., Suzuki, H. & Kumagai, H. Gamma-glutamyltranspeptidase, but not YwrD, is important in utilization of extracellular blutathione as a sulfur source in Bacillus subtilis. J. Bacteriol. 186, 1213–1214 (2004).
Overbeek, R. et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691–5702 (2005).
Eichenberger, P. et al. The sigmaE regulon and the identification of additional sporulation genes in Bacillus subtilis. J. Mol. Biol. 327, 945–972 (2003).
Sonenshein, A.L., Hoch, J. & Losick, R. Bacillus subtilis and Its Closest Relatives (American Society for Microbiology Press, Washington DC, 2001).
Sauer, U. et al. Physiology and metabolic fluxes of wild-type and riboflavin-producing Bacillus subtilis. Appl. Environ. Microbiol. 62, 3687–3696 (1996).
Kaneda, T. Iso- and anteiso-fatty acids in bacteria: biosynthesis, function, and taxonomic significance. Microbiol. Rev. 55, 288–302 (1991).
Gonzalez-Pastor, J.E., Hobbs, E. & Losick, R. Cannibalism by sporulating bacteria. Science 301, 510–513 (2003).
Ellermeier, C.D., Hobbs, E., Gonzalez-Pastor, J.E. & Losick, R. A three-protein signaling pathway governing immunity to a bacterial cannibalism toxin. Cell 124, 549–559 (2006).
Debarbouille, M., Gardan, R., Arnaud, M. & Rapoport, G. Role of bkdR, a transcriptional activator of the sigL-dependent isoleucine and valine degradation pathway in Bacillus subtilis. J. Bacteriol. 181, 2059–2066 (1999).
Letovsky, S. & Kasif, S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19 (suppl. 1), i197–i204 (2003).
Borenstein, E., Shlomi, T., Ruppin, E. & Sharan, R. Gene loss rate: a probabilistic measure for the conservation of eukaryotic genes. Nucleic Acids Res. 35, e7 (2007).
Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. The KEGG database at GenomeNet. Nucleic Acids Res. 30, 42–46 (2002).
DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).
Wu, L.F. et al. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet. 31, 255–265 (2002).
Hughes, T.R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).
Barrett, T. et al. NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Res. 33, D562–D566 (2005).
Kirkpatrick, S., Gelatt, C.D. & Vecchi, M.P. Optimization by simulated annealing. Science 220, 671–680 (1983).
Schaeffer, P.J., Millet, J. & Aubert, J.P. Catabolic repression of bacterial sporulation. Proc. Natl. Acad. Sci. USA 54, 704–711 (1965).
Acknowledgements
We thank I. Feldman, A. Rzhetsky, M. de Hoon, S. Gilman and C. Weinreb for comments on the manuscript and valuable discussions. This work was supported in part by US National Institutes of Health grant GM079759 to D.V. and National Centers for Biomedical Computing (MAGNet) grant U54CA121852 to Columbia University.
Author information
Authors and Affiliations
Contributions
T.-L.H., L.C. and D.V. performed computational research and data analysis. D.V. conceived and directed computational research. O.R. performed experimental research and analysis. U.S. conceived and directed experimental research. L.C., T.-L.H. and D.V. cowrote the paper. All authors read and edited the manuscript.
Corresponding author
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–4, Supplementary Tables 1 and 2 and Supplementary Methods (PDF 431 kb)
Rights and permissions
About this article
Cite this article
Hsiao, TL., Revelles, O., Chen, L. et al. Automatic policing of biochemical annotations using genomic correlations. Nat Chem Biol 6, 34–40 (2010). https://doi.org/10.1038/nchembio.266
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nchembio.266
This article is cited by
-
Parallel evolution of non-homologous isofunctional enzymes in methionine biosynthesis
Nature Chemical Biology (2017)
-
MIRAGE: a functional genomics-based approach for metabolic network model reconstruction and its application to cyanobacteria networks
Genome Biology (2012)
-
Global probabilistic annotation of metabolic networks enables enzyme discovery
Nature Chemical Biology (2012)
-
Constraining the metabolic genotype–phenotype relationship using a phylogeny of in silico methods
Nature Reviews Microbiology (2012)
-
A road map for the development of community systems (CoSy) biology
Nature Reviews Microbiology (2012)