Data mining in most cases relies on and is limited by a biologist's experience and knowledge. It is concentration- and time-intensive. We have developed an original hierarchical protein family classification, based primarily on protein sequence motifs, functional domains and cellular localization. This protein classification schema has been generated by classifying entries in Interpro 1.2, an integrated resource of protein families, domains and sites1. By using this new classification, one can effectively and rapidly mine data on proteins involved in oncogenesis. This classification is especially tailored toward the biopharmaceutical industry and drug discovery efforts, and it has been extensively used in Hyseq's internal data-mining processes. The hierarchical protein classification has 9 main classes, 56 subclasses, and 3,052 Interpro entries. These Interpro entries represent 574 domains, 2,418 families, 46 repeats and 14 post-translational modification sites from clustering PRINTS, PROSITE, ProDom, SWISS-PROT, TrEMBL and Pfam data. We generate Pfam models by multiple protein sequence alignments and express them mathematically in the form of hidden Markov models. To demonstrate the utility of our classification schema, we searched the SWISS-PROT and TrEMBL databases with Pfam models. Of the 31% of protein sequences that had significant Pfam hits, 4,520 sequences were in the cancer subclass. At present 59 Interpro entries exist in this subclass. These entries include breast cancer susceptibility proteins, retinoblastoma protein domains, p53 tumor antigen, Xeroderma pigmentosum proteins and the Burkitt's lymphoma receptor.