Introduction

The set of known proteins that directly regulate apoptosis has grown rapidly over the last 15 years. This growth will continue until all the proteins directly involved in the cell death signaling process are known. This assortment of proteins with wide ranging biochemical functions is linked together conceptually in the minds of apoptosis researchers. Assembling an up-to-date view of this conceptual collection of proteins within the context of apoptosis requires a considerable effort, or more specifically, complete immersion into the field. One principal goal of an apoptosis review article is to assemble such a collection of protein annotations as an educational and research resource. The apoptosis database described here is designed to fulfil the same goal, but to immediately allow the user to dig deeper using local and remote information and to always remain current with respect to the proteins known to be involved in apoptosis.

The foundation of the database is a set of proteins and their distinctive structural domains that are often, if not exclusively, involved in the apoptosis signaling pathway. We refer to these domains as apoptotic domains. The goal is to use these proteins and their associated domains as a framework for functional annotation that is generated automatically or added by apoptosis researchers through the database curator. Examples of annotations present at the various levels of organization are:(a), broad functional information on structural domains conserved across protein families;(b), groups of homologous proteins that contain a recognized domain yet may or may not have an apoptotic function;(c), proteins that have the same set of domains but which differ markedly in their roles in apoptosis.

Additional information such as protein–protein interactions, domain folds from structural classification of proteins (SCOP), protein modifications, genomic information, and literature references are not presently within the database, but are linked via external resources. Some of these features will be incorporated into the database at later stages of development, so they can be queried directly.

The apoptosis database provides a depth of information and a perspective that is not presently available in more general molecular biology resources that are broad but shallow containing limited information on a large number of proteins. Examples of general purpose curated databases are the National Center for Biological Information's RefSeq,1 the Weizmann Institute's GeneCards,2 the Swiss Institute of Bioinformatic's SwissPro,3 and the University of Washington's Pfam.4 Consider a specific example of the limitation to be found in a general purpose database. NF-κB is a protein with a well-established role before the discovery of its role in apoptosis. The GeneCards and SwissProt databases indicate the initial role of NF-κB in inflammation response, but neglect to mention its role in regulating expression of apoptosis genes. The apoptosis database highlights the apoptotic role, which would otherwise go unobserved.

The same broad but shallow argument can be made at the level of protein domains. The Pfam database is a powerful and popular tool for finding domains in proteins based on HMM profiles. However, useful functional categorization of proteins requires that the domains be assembled. Other online resources like simple modular architecture research tool (SMART)5 and domain architecture retrieval tool (DART)6 rely on domain architecture (the presence and order of domains) to differentiate between homologs containing a domain. We have found it necessary to go beyond domain architecture to provide the most useful annotation. The annotation is based upon (see Materials and Methods and Table 1):(a), the structure of individual domains as found in SCOP;(b), sequence profile-based homology based on family pair-wise search;(c), homology to identified orthologs;(d), specific proteins.

Table 1 Example of the levels of functional annotation within the apoptosis database

Ideally, we would manually annotate each protein. However, considering the number of proteins involved in apoptosis and the constantly expanding and changing functional annotation, this approach, by itself, is impractical. At the other end of the spectrum are fully automated resources that rely on the detection of remote homology, but which suffer from high error rates and propagation errors and fail to have a good measure of selectivity versus sensitivity.

Here, we use an intermediate approach for the apoptotic pathway – utilizing expert knowledge to define model proteins within the pathway and extending by inference using bioinformatics approaches, the functional annotation to other proteins. This approach works as a result of evolutionary constraints applied to apoptotic domains. In practice, often the human and mouse orthologs with published functional annotation will be manually selected to represent one group via their apoptotic domains. If any sequence (from a vertebrate) has a domain most similar to that in this group, it will be associated with the family's functional annotation. Our analysis thus creates families consisting of orthologs, and possibly paralogs, which have differentiated since approximately the time of vertebrate radiation (500 million years ago) based on their apoptotic domains. We have observed that paralogs having differentiated after vertebrate radiation still have a related function in the same branch of the pathway.

Results

The apoptotic domains maintained in the current version of the database are listed in Table 2. Of the 13 domains, 12 have representative structures in the PDB.6 The first three – death effector domain (DED), Caspase recruitment domain (CARD), and death domains (DDs) – share the same SCOP superfamily and fold.7 Although the domains may be in several proteins involved in apoptosis, the mere presence of the domain does not guarantee an apoptotic function. Nine of the 13 domains are also found in proteins that do not have direct involvement in apoptosis, as indicated by their Swiss-Prot annotation. Thus, the presence of a domain cannot be used to automatically define a protein's involvement in apoptosis. We have compared several alternative means of categorizing proteins with respect to their function in apoptosis. These have been combined as described in the Materials and Methods section.

Table 2 Protein domains described in the apoptosis database.

One potential means to differentiate functionally between homologs that share at least one putative apoptotic domain is to use domain architecture – the set and order of domains. Examples of caspase domain architectures are shown in Figure 1. As illustrated, the function of a protein is not distinct between domain architectures. Although caspases are essential to apoptosis, some caspases are instead involved in cytokine activation rather than apoptosis. For example, human caspase-1 is involved in the inflammation response pathway by virtue of its ability to process proinflammatory (pro-IL) cytokines, pro-IL-1beta and pro-IL-18.8 Mammalian caspase-1 proteins contain three domains: the CARD domain involved in protein–protein binding, and thus, regulation; caspase large subunit; and caspase small subunits (both subunits, together, form the catalytic site). Mammalian caspase-2 and caspase-9 proteins also share this combination of domains, yet they are both involved in different parts of the apoptosis signaling pathway. Thus, in this instance the domain architecture does not help in differentiating biological function. The details of the specific domains in each protein must be explored in detail.

Figure 1
figure 1

Human caspase proteins in the apoptosis database. Group I caspases are cytokine activators, group II proteins are initiators in the apoptotic pathway and group III proteins are apoptosis executioners. This is an example of a diagram maintained in the database with its text and caption subject to text search. Other bioinformatics tools that display the same or similar domain architectures (SMART,5 DART) are linked to the resource

Consider further our example of the caspases. Caspases have been categorized and renamed by a committee.9 This standard nomenclature based on individual genes is very useful in both conceptualizing and discussing these proteins and is captured in the database. Phylogenetic analysis of the individual domains shared among caspases leads to categories that closely follow the nomenclature. Figure 2 and Figure 3 show dendrograms based on a ClustalW multiple sequence analysis of the caspase small subunit and the CARD domain, respectively. In both cases, the caspase-1 proteins are distinct from caspase-2 and caspase-9. This evolutionary separation of domains is the principal criteria, which we use to automate the functional categorization of these proteins. The phylogeny diagram created from the ClustalW alignment is only used as a visual tool to illustrate the separation. In characterizing proteins within the database we use the family pairwise search (FPS) algorithm10 (see Materials and Methods).

Figure 2
figure 2

Phylogeny dendrogram for the caspase small subunit of several caspases based on a ClustalW alignment. Only the proteins representing each family are shown. Branch lines of this unrooted tree that are not exclusive to a family are shown in bold. A simplified version of this diagram is maintained in the database

Figure 3
figure 3

Phylogeny dendrogram for caspase recruitment domains (CARDs) of several caspases based on a ClustalW alignment. Only the proteins representing each family are shown. Branch lines of this unrooted tree that are not exclusive to a family are shown in bold. A simplified version of this diagram is maintained in the database

Caspase-1, caspase-2, and caspase-9 are designated as separate families. The apoptotic domains (CARD, caspase small subunit and caspase large subunit) of the representative orthologs define the families for use in the FPS algorithm. Table 3 details the FPS-based classification for a few caspases. Human and mouse caspase-1 were used to establish the caspase-1 family. The FPS-based categorization of each of their domains unambiguously places them into the caspase-1 family. The caspase-9 family was founded by the human and mouse caspase-9 orthologs. The Xenopus protein labeled as caspase-9 is clearly categorized by FPS as a caspase-9 based on each of the three domains. Human caspase-2 does not fall into either caspase-1 or caspase-9 families based on rank and relative expectation value over each of its domains. Two additional Xenopus homologs of caspase also have the same combination of domains (CARD, caspase small subunit and caspase large subunit), but their role in apoptosis is not yet known. Xenopus ICE-A/B proteins have closer homology to caspase-1, but the assignment is problematic since, although the rank is consistent across domains, some expectation values (E >10−8) are inconclusive. This methodology has, by default, placed the Xenopus paralogs of mammalian caspase-1 proteins into the caspase-1 category (Xenopus/mammalian divergence occurred approximately 365 million years ago).11 This includes several other mammalian caspases which seem to have emerged from a common caspase-1 ancestral gene during vertebrate radiation approximately 500 million years ago. However, as more functional information is published about these proteins, a choice can be made to leave them in the caspase-1 family or to create a new family.

Table 3 Rank and expectation values of some caspases (rows) to the caspase-1 and caspase-9 FPS families (columns)

The above indicates the challenge in providing a clear separation of apoptotic proteins. In summary, candidate proteins for inclusion in the database are defined based on the presence of homology to a known apoptotic domain. These homologs are then clustered based on the closest homology to a set of annotated orthologs. The groups created by these representative orthologs are called families. This homology to a family is based only on each apoptotic domain. For vertebrate proteins, these independent measures of family were consistent over each of the protein's apoptotic domains, when accepting homology with expectation values (E-values) better than 10−8.

This is superior to a resource that is reliant on Pfam or Pfscan for domain definitions. For example, the C-terminal motif of AIF proteins (or PCD-8, protein cell death-8) is an apoptotic domain whose apoptotic function is masked in the Pfam and Pfscan databases by another well-known functional role. AIF proteins have an additional well-recognized pyridine nucleotide-disulfide oxidoreductase domain, which is found in a large number of nonapoptotic proteins. The structure of the whole AIF has recently been solved, confirming a C-terminal structural domain only present in these apoptotic homologs, which is similar to bacterial ferredoxin reductase.12 This domain was used to define a number of apoptotic homologs, although it is yet not characterized as such in the Pfam or Prosite domain databases.

Attempts to use single domains that are repeated in a given protein to establish families can be problematic. Consider the bacterial IAP repeats (BIR) domains of inhibitor of apoptosis protein (IAP) proteins. IAP proteins have between one to three BIR domains. In the human cellular IAP-1 protein, the first repeat has 43–49% sequence identity to the second and third repeats, respectively. However, the first repeat of cellular IAP-1 is 49% identical to the first BIR repeat of human XIAP proteins. That is, percent sequence identities and expectation values do not distinguish between the two groups, c-IAP-1 and X-chromosome-linked inhibitors of apoptosis (XIAP), if single domains are compared. By combining each domain into an FPS family, each functionally distinct group was easily separated allowing homologs from other species, variants, and alternatively spliced forms to automatically group together. Similarly, repeated DED domains in caspase-8, caspase-10 and FLIP are also distinguishable by this approach.

A recent application of the database was to provide a reference set of proteins for use in the annotation of apoptotic proteins in the RIKEN mouse cDNA collection. The database was used in the first phase of the work, cataloging which human and other vertebrate proteins should have orthologs in mouse. A starting set of 187 apoptotic domains out of final derived set of 294, including alternative splicing forms, were taken from the database and used to query the RIKEN database. While a number of assignments are putative, clearly some domains and motifs must be added to the database in the future, for example, the BH3 only motif, TNF receptor extracellular cysteine-rich repeats, and the PAAD/PYRIN domain, to provide a more extended coverage of apoptotic domains.

Database access

The database can be accessed by protein sequence using sequence homology searches with basic local alignment search tool (BLAST), by text string searches, by database ID searches, or by browsing lists of domains, families, or homologs. The lists of homologs are the primary means of accessing individual proteins, and several types of lists are found with alternative organization and filtering.

The list of homologs for an apoptotic domain is the most commonly used list. It shows all the homologs separated into families, including proteins with weak homology to the selected domain (Figure 4). The level of homology for the domain based on FPS, Pfam and Pfscan is symbolized by: !! (strong), + (reliable), w (weak), ? (unreliably weak) or an empty area for undetected homology. The expectation value for each of these is viewable upon ‘mousing over’ the symbol. The lists are ordered by decreasing homology to each of the families. This order gives important clues as to reliability of the functional annotation for that homolog. Any two proteins can be selected for a pairwise alignment and several members on the list can be selected for multiple sequence alignment using ClustalW. Individual or multiple sequences can be output in FASTA format. All sequence alignments can be performed on the domain or the full-length protein sequence. This list of homologs can also be filtered to removed alternative splice forms or minor variants.

Figure 4
figure 4

Example of a Homolog Listing. The image shows part of the larger listing of all proteins (including minor variants or alternative splicing variants) with significant homology to the Bcl-2 domain. Only the proteins most similar to the bcl-2-related ovarian killer (BOK) family are listed. These include the three proteins used to represent the family (labeled on ID column), several orthologs from other species and possible paralogs

An alternative list of homologs containing an apoptotic domain is organized by taxonomy. In this case, mammalian proteins are shown first, then other vertebrates, then other eukaryotes, then noneukaryotes. This list can also be viewed with either the graphical representation of domain architecture or textual annotation.

Each family has functional annotation describing its role in apoptosis. Also associated with each family is the representative protein and the domains used to establish the family, literature references associated with the family, and functional annotation of each of the separate domains, if known.

Computed details of each protein record are also available:

  • Homology for domains using Pfam, Pfscan, FPS, and PSI-BLAST.

  • Details of the variable PSI-BLAST profile/query used to retrieve the protein from the GenBank non-redundant (nr) database.

  • Possible transmembrane helices (TNF receptors, Bcl-2 family members, and others) calculated by trans-membrane hidden Markov model (TMHMM).

  • Possible signal cleavage sites calculated using SigCleave and as a peripheral output of TMHMM.

  • Predicted alternative splice forms or minor variant relations among proteins.

Discussion

The field of apoptosis has become one of the fastest growing areas of biomedical research, as tracked by numbers of publications devoted to the topic, according to the ISI.13 Consequently, we have established a database consisting of a set of proteins and their distinctive domains that are often, if not exclusively, involved in the apoptosis signaling pathway. The database provides a domain structure classification based on SCOP, a classification of domains based on sequence and a classification of protein families based on combinations of domains. Each classification has its own set of functional annotation. Starting with manually selected and curated apoptotic proteins, additional proteins are automatically added using bioinformatics techniques.

Future work includes the addition of new domains, for example, the PAAD domain14 if sufficient evidence amasses to justify their inclusion as true apoptotic domains. Further, features other than structural domains will be used to identify and categorize apoptotic proteins. For example, second mitochondria-derived activator of caspase (Smac/DIABLO) and its functional counterparts in Drosophila share only a short N-terminal motif that is required for promoting apoptosis by binding IAPs. In Smac, the N-terminal four amino acids (after loss of a targeting domain) are required for the activity of the full-length protein and seven amino acids are sufficient for activity as a peptide.15 Methods such as MEME16 search and recognize biologically relevant short motifs of this nature and will expand the characterization methodology used in a future version of the database.

Domain structure requires further annotation within the database. First, domains for which structures exist and have apparent structure homology are not indicated. Moreover, many sequences for which structures do not exist could be modeled by comparative (homology) modeling, or fold recognition and structural annotation provided. Such structure annotation could be automated.

Currently, the detailed clustering based on orthologous sets of proteins is fully supervised as far as choosing clusters. In the future, we will use general trends observed for each domain or family to automatically suggest changes to the representative proteins of the families as the sequence database is updated. For instance, when a clear closest homolog is present between a single set of mammalian orthologs and Xenopus, the mammalian proteins could be tested for forming a family. The divergence rates of caspase orthologs (or most distinctively similar homologs) among vertebrate species has been studied (data not shown). The results showed that each orthologous group had independent divergence rates for vertebrate caspase large subunits (the ‘P20’ domain). Some sets of orthologs had similar divergence rates; however, those ortholog sets did not have in common either domain architecture, apoptotic functional roles, nor group specificity. Thus, as orthologs are found between distant species, they can be used to predict divergence rates, but only on a per-family, and not per-domain basis.

Materials and Methods

Figure 5 outlines the process of gathering, annotating, and classifying apoptotic proteins into families.

Figure 5
figure 5

Process diagram for the creation and update of the apoptosis database. Rectangles are processes, rectangles with additional vertical borders are process developed elsewhere and incorporated into the database with only parametric or configuration changes. Parallelograms denote data that are incorporated into the database. Stripes denote data over whole length proteins while gray filling denotes data on single domains. Asymmetric shapes (trapezoids) are manual curation processes

Gathering proteins into the database

The selection of proteins for inclusion is based on homology to apoptotic domains (Table 2). PSI-BLAST17 version 2.1.3 running against the NCBI nr database is used to create dynamic profiles. The PSI-BLAST profiles are robust and dynamic enough to gather all known domains when several (3–17) profiles are used to represent each domain. The initial round of iterative profile building always starts with a single seed sequence. This seed comes from interaction with experts in the field, literature, standard curated profile analysis from Pfam or Prosite, or prior PSI-BLAST results. Each PSI-BLAST run is independent – allowing for cases when the sequence homology between seeds for the same domain is very poor. In cases where many closely related homologs can dominate a profile, a high-expectation filter for inclusion into the profile combined with appropriate choice of seeds allows for rarer homologs to maintain a signal in the profile. These parameters are manually set and fixed for all the PSI-BLAST iterations. Most often, PSI-BLAST iterations are taken to convergence – no changes occur in the sequences defining the profile between iterations.

The profiles, parameter sets, and seed sequences are stored in the database. Figure 6 illustrates part of the database schema detailing the relation between profiles, alignments and associated domains, proteins, and families. Updates of the GenBank nr protein sequence database are periodically used to update the apoptosis database. The stored profiles are used to determine what new homologs to add with a resultant incremental changes to the profiles. Parameter changes and new seeds determined by experts are also incorporated as needed during updates. The current set of seeds and PSI-BLAST parameters used to populate the database are included in the Appendix.

Figure 6
figure 6

Relational database schema for profile- and FPS-based domain assignments. The schema is shown in standard ‘crows foot’ notation. Each rectangle represents a table (relation). The relation between records in two tables is represented by the symbols near each pair. Double crossing lines denotes one and only one record must be present in that table. A single crossing line with three forking lines denotes one or more corresponding records must be present. A circle with three forking lines denotes that zero or more corresponding records are allowed. Profile based assignments (gray) are made per domain onto full-length protein sequences (PROTEIN). An apoptotic domain (DOMAIN) can have 0 or more Pfam and Pfscan profiles representing it, but must have at least one PSI-BLAST and FPS family representation. The FPS families are defined over individual domains of the representative full-length sequences (FAMILY)

Domain annotation

Proteins with apoptotic domains found using PSI-BLAST are retrieved from GenBank using Perl scripts that include the Boulder::Genbank module. The protein full-length sequence, organism (taxonomy) and other information are retrieved from the GenBank record at NCBI. These protein sequences are then annotated for the presence of domains, transmembrane helices and possible signal peptide cleavage sites (Figure 5).

HMMsearch against the Pfam database (>3000 domains) and Pfscan (Not published. Available at URL: http:\\www.isrec.isb-sib.ch/software/PFSCANform.html) against the Prosite profiles database (161 domains) are used to assign domains. In most cases, the apoptotic domain is also profiled in Pfam and/or Prosite profiles database. These static, curated profiles help confirm assignments made using PSI-BLAST, but the PSI-BLAST profiles cover additional (often newer) homologs that are not yet represented by these standard profiles. In cases where Pfam or Prosite finds additional homologs for an apoptotic domain, new PSI-BLAST runs are created to cover this range of new homologs. The hits resulting from all domains in Pfam and Prosite profiles are stored in the database along with version information for the Pfam and Prosite databases.

The definition of structural superfamilies is taken from SCOP. Currently, any one sequence per domain that corresponds to a protein structure recorded in the PDB (as reported by NCBI GenBank sequence records) is checked for representation in SCOP.

Protein features

Transmembrane helices are detected using the TMHMM program.18 The output is parsed and stored into the database with individual sequence ranges labeled as intracellular, extracellular, or trans-membrane.

UniGene records for the proteins in the database do not come directly from the retrieved GenBank entries. The entire UniGene database from NCBI is downloaded, parsed, and analyzed for relations to current or former GenBank proteins in the nr database. Matches that cannot be made directly via GI number are performed by matching both the sequence and source organism exactly.

Alternative splice forms and minor protein variants from the same species are established using the BLAST algorithm (Figure 5). All the proteins in the database are separated into species-specific databases. Each protein is tested for BLAST hits against its species with greater than 96% identity and allowing for mismatches at the ends of the alignments. The representative for a set of alternative splice forms/variants is the longest sequence taken from all SwissProt, UniGene, and RefSeq sequences.

FPS clustering

The FPS algorithm calculates the expectation value of a query sequence against a set of FPS families. Each FPS family consists of one or more sequences and its normalization weight. In our case, the FPS families are the apoptotic domains of manually selected full-length representative sequences (usually the human and mouse orthologs, if available). The expectation value of a query sequence to a family is the weighted product of P-values from alignments (BLAST) to each sequence defining the FPS family. Normalization of P-values must be performed, since the sequences defining FPS families are related (thus, not independent measurements). The normalization weights are estimated by Smith–Waterman alignments against a random library of 995 sequences from the PDB database.

All proteins retrieved by PSI-BLAST searches for apoptotic domains are categorized using this FPS algorithm. Although the best family for each domain (with an expectation value cutoff of 10−8) is sufficient for categorizing each protein, the expectation values to all families are stored in the database. This richer representation is useful for representing cases where classification into a single family is ambiguous. These cases are later evaluated for the possible establishment of new FPS families.

References are related to several entities in the database: families, FPS- families, domains, proteins, and multiple sequence alignments, and diagrams. Several types of static diagrams (such as Figure 1 and phylogeny dendrograms similar to Figure 2 and Figure 3) are linked to the domains, which they describe and are accompanied by the readable text in the images in the database, which allows for matching relevant diagrams via text searches over the database.

The system is implemented using the MySQL (MySQL AB) relational database management system, with users accessing both static and dynamic web pages created by Perl scripts and a Perl module (library) specific for the database. Other Perl modules used include XML::Twig (Michel Rodriguez, xmltwig.com) for parsing XML data from PSI-BLAST; DBI:: and DBD::mysql (Tim Bunce, Jochen Wiedmann) for interfacing to the database; Boulder:: (Lincoln Stein) and Bio:: (bioperl.org) for sequence and literature references retrieval and parsing; and LWP:: and HTML::TokeParser (by Gisle Aas) for accessing and retrieving information on external web sites.

Two versions of the database, a development and a production version of the database are maintained at all times. The production version is static and accessible to all uses of the Internet from the URL http:\\www.apoptosis-db.org. The development database is dynamic with software upgrades and data added on a regular basis. Periodically, the development version is copied to create a prerelease version. Apoptosis experts review the prerelease version with emphasis on those proteins gathered automatically. When the experts are satisfied the prerelease version becomes the public production database.