Introduction

Machine learning methods have the capability to predict many useful properties of proteins directly from their sequences1,2,3. However, these methods require labeled data, mapping proteins to the property of interest. High-quality labeled data is expensive to acquire—and large quantities are usually required in order to train a good predictor.

In the domain of natural language processing (NLP), the issue of missing or scarce labels is often solved by pre-training a model on unlabeled, general domain data. This results in representations of the data that capture high-level semantics. These can then be used in downstream tasks: specific prediction tasks for which only a limited amount of labeled data is available. Usually, this is achieved by fine-tuning the pre-trained model on the labeled data4,5. Training the downstream tasks is therefore also called the fine-tuning step. Recently, the transformer architecture has emerged as a firm favorite for this kind of approach6. Large language models such as BERT7 and GPT-28, have shown a remarkable ability to generalize across domains. It is a reasonable question to ask whether this approach carries over to the domain of proteins: can we successfully pre-train a model on unlabeled data, and fine-tune for a variety of tasks requiring labeled data. If so, does the pre-training allow us to perform better than if we had trained on the labeled data alone? We will evaluate this question here using transformer models, which are currently the most popular approach, however any sequence-to-sequence model can be evaluated on such a benchmark.

Several recent studies have already investigated protein representation models, among which models based on the transformer architecture9,10,11,12. In general, the representation models make use of a large set of protein sequences to train an NLP based model of which the objective is to learn embeddings that represent the proteins. We provide a comprehensive overview of the protein representation models in the Related work section. Most of these models use different protein sequence databases for pre-training, widely different numbers of model parameters, and different downstream tasks for evaluation. Most importantly, these models are generally evaluated only on one or two downstream tasks, giving a poor indication of how well the pre-trained representations generalize. We suggest that the rapid progress of pre-trained transformer models in the domain of NLP is not just due to the power of the models, but also to the wealth of benchmarks for a variety of tasks already available. Without standardized and varied benchmark suites like GLUE13, we believe that development would have progressed much slower.

In this paper, we present a benchmark set for the domain of protein prediction, including a variety of structural protein prediction tasks, in order to generalize pre-trained representations, called the Protein General Language (of life) representation Evaluation (ProteinGLUE) benchmark. This benchmark consists of seven downstream tasks, with data formatted for and tested in large transformer models. The following tasks are included:

  • Secondary structure The secondary structure describes the local structure of a protein, defined by patterns of backbone hydrogen bonds in the protein14. Commonly, these are three types: \(\alpha\)-helix and \(\beta\)-strand, and anything else is labelled coil; a further subdivision can be made into eight types15. Secondary structure prediction methods aim to classify the type per amino acid2.

  • Solvent accessibility (ASA) For every amino acid in a protein, the solvent accessibility indicates the amount of surface area of the amino acid that is accessible to the surrounding solvent16. Relative solvent accessibility classifies residues into a buried or non-buried state, reducing the difficulty of the prediction task. We model the absolute solvent accessibility as a regression task over every amino acid and the relative solvent accessibility as a classification task over every amino acid.

  • Protein-protein interaction (PPI) The interactions between proteins is arguably the most important property for functioning of a protein17. Almost all processes occurring in a cell are in some way dependent on protein-protein interactions; these include DNA replication, protein transport, and signal transduction16,18. The protein-protein interaction interface determines which residues are involved in the interaction, and may be predicted as a classification task over every amino acid.

  • Epitope region A specific kind of PPI is the binding between antigens and antibodies. The antigen region that is recognised by the antibody is a set of amino acids on the protein surface, and is known as the epitope19. These may be predicted as a classification task over every amino acid20.

  • Hydrophobic patch prediction A number of spatially adjacent hydrophobic residues on the protein surface is called a hydrophobic patch. Hydrophobic residues on the surface of the protein may be important for the interaction between proteins, or for membrane interactions, and have been implicated as being a driving factor in protein aggregation21. Protein aggregation in turn is thought to be a major causative factor in the development of diseases like Alzheimer and Parkinson22. Hydrophobic patch prediction may be modeled as a regression task over every amino acid by identifying the rank of the size of the hydrophobic patch to which the amino acid belongs.

These tasks are related to protein structural properties, mainly because protein structures are the richest source of per-amino-acid property annotations, but are also closely linked to protein function. In particular, PPI, epitopes, and hydrophobic patches describe a proteins relations with other molecules in its environment which together define protein function, as inspired by Bork et al17.

General function-related annotations, such as GO labels or enzyme classifications may also be relevant, but these would yield one label per protein. This severely reduces the number of labels, which means that in such tasks a very large test set is required (in the order of \(10\;000\) proteins) to accurately estimate performance. Moreover, the small amount of labels in the training data may mean that the task becomes impossible rather than challenging. For these reasons, we focus here on tasks that contain one label per residue, and leave per-protein tasks to future work.

To estimate performance on these tasks, we pre-trained two large transformer models. This allows us to test our benchmark suite, provide reference code for the training and development of prediction methods, and to give a baseline performance for each task, showing what performance can be expected from a modest sized model. Commonly, pre-training is performed on a large set of unlabeled general domain data. In the protein domain this translates to unlabeled protein sequences. We have chosen to use the protein sequences from the protein domain family database Pfam, which is a widely used database for the classification of protein sequences23.

Our pre-training models are based on the BERT transformer architectures for natural language processing7. We trained two models of different sizes: the medium and base model architectures. The base model was first described in the paper of Devlin et al.7 and contain 12 hidden layers, 12 self-attention heads, a hidden size of 768 and 110 million parameters. The smaller medium model could be used to overcome the time- and memory limits that are associated with the base models24,25. This medium version contains 8 hidden layers, 8 attention head, a hidden size of 512 and 42 million parameters.

Our contributions are as follows:

  • A set of generally usable downstream tasks for evaluating pre-trained protein models.

  • A repository of reference code showing how to pre-train a large transformer model on unlabeled data from the Pfam database, and how to evaluate it on our benchmarks.

  • Two such pre-trained models, broadly similar to the BERT-medium and BERT-base models.

The rest of the manuscript is organised as follows. We will first give an overview of related work, followed by the outline of the ProteinGLUE Benchmark suite we present here. We then first describe methods and present results of the pre-trained models, and then methods and results of the fine-tuned models. All datasets, code, and models are publicly available at https://github.com/ibivu/protein-glue. All code is MIT licensed, the models are public domain (that is, creative commons CC-0) and the datasets are each released under the most permissive licence allowed by the source data.

Related work

Many learning architectures have been used for a variety of protein property prediction tasks. As a background to the multi-task benchmark suite we present here, we therefore first include a rather in-depth overview of relevant natural language modelling learning architecture. We will then proceed to review state-of-the-art machine learning approaches to protein modelling, from which we have gleaned relevant and interesting prediction tasks, sources of reference data as well as some inspiration on our learning approaches. Note that this section provides background information and is not required for the main understanding of this study.

Natural language modeling

The idea of combining labeled and unlabeled data has a long history in machine learning, going back to such approaches as self-training and co-training26,27,28. One of the first deep learning models to combine unsupervised pre-training with supervised fine-tuning in multiple domains was the RNN-based ULMFit29. This was followed by ELMo30, which used bidirectional RNNs and was the first to show state-of the art performance across many downstream tasks. We have recently investigated a number of different Neural Net architectures for their ability to predict protein interfaces31. Transformer models are architectures which rely primarily on the self-attention operation. Self-attention is a sequence-to-sequence operation in which the output vector for each token i in the sequence is a weighted average of all input vectors in the sequence, with the weights determined dynamically by the contents of the corresponding input vector. Most commonly, the weight for input j is based on the dot product of input vectors i and j. A key property of transformers is that self-attention is the only operation that mixes information between tokens in the sequence. All other operations in the model are applied to each token in isolation.

The Transformer was introduced in by Vaswani et al.6. This model was an encoder/decoder architecture designed specifically for machine translation. Devlin et al.7 simplified the model to a single stack of transformer blocks, and adopted the pre-training and fine-tuning approach from ULMFit and ELMo. The result, called BERT (Bidirectional Encoder Representations from Transformers), is what we base our reference models on. BERT is a bidirectional sequence-to-sequence model: both its input and its output are sequences of vectors, of the same sequence length. For the computation of each of the output vectors, all vectors of the input may be used (to the left or the right). By contrast, autoregressive models, like those in the GPT family8,32 are unidirectional, which means that the output vector for one element in the sequence is computed using only the input vectors of preceding elements. Both bidirectional and autoregressive models have been shown to be capable of learning strong representations both of the tokens in the sequence and of the sequence as a whole. In NLP, text sequences are most commonly broken up into tokens larger than individual characters, but smaller than individual words. In the protein setting, individual amino acids are usually taken as tokens.

Protein modeling

In recent years, attempts have been made to automatically generate enriched embeddings of protein sequences through machine learning. These embeddings generally capture information that is not explicitly encoded in the protein sequence, such as information about the structure or dynamics. This enriched representation can be used in place of the original sequence to improve performance on a variety of tasks, including protein database searching, regression, and classification tasks.

Enriched protein representation models can be categorized on three axes. Firstly, there is the architecture of the model used to generate the representation, which can broadly be categorized as either Word2Vec-based33,34,35, LSTM-based36,37,38,39,40, or Transformer-based9,10,11,12.

Although not always specified, a second categorization can be made in terms of the model size, or the number of parameters a model contains. This is, to some degree, correlated with the model type, but not in an absolute sense. Comparisons between types of models are therefore complicated, if, for example, more recent Transformer models with many parameters are benchmarked against smaller LSTM models. The LSTM-based representation method UniRep36 contains 18 million parameters, the TAPE Transformer-based model contains 38 million parameters9, and the LSTM-based model DeepSeqVec contains 93 million parameters37. Recently developed methods have mostly been based on Transformer models, and a clear trend can be observed of increasing size. For example, ESM-1b has 650 million parameters12, and Prot-TR-XL-UniRef has 3 billion11. It is shown that even these large models, such as ProGen with 1.2 billion parameters, still do not overfit the training data10. Heizinger et al.37 note that the risk of overfitting is generally very small, given the fact that the number of tokens in the training set (hundreds of billions) is much higher than the number of parameters; however for protein data one must take into account the redundancy due to a lot of similar (homologous) sequences. For (large) protein sequence alignments this is commonly done by evaluating the so-called effective number of sequences \(N_{eff}\)41, essentially counting clusters of sequences at a set threshold of similarity (e.g. 62% or 80%). However, we do not attempt to translate this approach to sequence databases here.

Thirdly, the size of the datasets used for pre-training differs significantly across methods. To give an idea of the range: UMSDProt38 is pre-trained on 560 thousand Swiss-prot sequences42, UniRep is trained on 24 million UniRef50 sequences, ESM-1b is pre-trained on 250 million UniParc sequences, and Elnaggar et al.11 even train their models on the BFD dataset containing over 2 billion sequences43,44, and on the UniRef100 database with 216 million sequences. We find that UniRef5042 and Pfam are the most popular pre-training databases. By selecting protein (domain) sequences from each cluster or family in Pfam, it can be ensured that the pre-training dataset contains a wide variety of protein sequences, and—to a limited degree—is non-redundant.

Protein representation models may be evaluated internally, by analysing their embeddings, or externally, by benchmarking predictions based on the representations. Several studies have shown correlation between the embeddings and physicochemical properties11,12,36,37, the source organism proteome11,12,36,37, the secondary structure11,36,37, and the subcellular location11,37. The quality of these correlations generally improves with fine-tuning. Additionally, Vig et al.45 inspect the intermediate outputs of the TAPE Transformer model, and observe that, in the first layers, its attention heads specialize on amino acid type. Then, deeper layers focus on more complex features, such as binding sites and intra-chain protein contacts.

We observe a wide variety in the kinds of benchmarks used to evaluate these protein representation models. Even if the type of benchmark is similar, methods will frequently use different datasets, use different training and validation methodologies, and differ in how the datasets are preprocessed.

A standard prediction task is the prediction of the secondary structure (\(\alpha\)-helix, \(\beta\)-strand, or coil)9,11,12,37,40,45. Two commonly used datasets for this task are the NetsurfP-2.0 dataset1 and the SPOT dataset46. These datasets consists of training sequences with assigned secondary structure, derived from PDB structures. Another frequently performed prediction task is contact map prediction9,12,39,45,47, in which residues are identified as in close contact with each other based on three-dimensional structure of the protein. Therefore, datasets containing contact map annotations are based on the PDB structures.

Next to per amino acid predictions, several studies perform per protein prediction tasks such as remote homology detection9,12,48. Remote homology detection is a classification task to identify homologous sequences i.e. descended during evolution from a common ancestor. While close homologs will be very similar in sequence, remote homologs can have conserved function (and structure) but diverse sequence, making this a difficult task48. The Pfam database could be used for this prediction task, but the SCOP49 and CATH50 databases are more commonly used9,12,48. These two databases use structure information more explicitly making them able to classify into broader superfamilies than Pfam does, and thus is more suitable for remote—and thus more difficult to detect and predict—homology.

Alley et al.36, and Rives et al.12 include variant effect prediction as a benchmark, as does the variational autoencoder-based approach of Riesselman et al51. In this task, the quantitative impact of a mutations is predicted for a specific set of protein functions, such as ligase activity and substrate binding. Two commonly used datasets are the dataset used by Gray et al.52 including 21 026 variant effect measurements of eight proteins from nine experimental datasets, and the dataset by Riesselman et al.51 including 712 218 mutations on 34 proteins.

Protein localization classification is also a protein-level labelling task. Protein function may also depend on subcellular localization. Abnormal localization can lead to dysfunction, which in turn can contribute to disease. Min et al.40, Heinzinger et al.37, and Elnaggar et al.11 use the DeepLoc dataset53, consisting of 13 858 proteins. Proteins are classified as being present in ten cellular locations, based on UniProt annotations. For transmembrane prediction, Bepler and Berger39, as well as Min et al.40 use the TOPCONS dataset54, containing 6 856 proteins. In this task, the model needs to predict for every amino acid in the training set whether it is membrane-spanning.

Notably, the Tape repository9 provides a set of five benchmark task for both biological property and protein engineering prediction, namely secondary structure, residue contacts, remote homology, respectively fluorescence landscape, and stability landscape. For the prediction of biological properties, these tasks are generally also used for evaluating representation models. Compared to TAPE, ProteinGLUE focuses on the biological property prediction tasks, and expands its set of tasks with solvent accessibility, protein-protein interaction, epitope, and hydrophobic patch prediction, but omits contact map prediction and remote homology detection. By adding these properties, that are related to the protein molecule’s interactions with its environment, the ProteinGLUE benchmark set includes tasks that are more closely related to protein function17. The ProteinGLUE benchmark set solely contains important per amino acid prediction tasks, and per protein tasks are out of scope for this study. Thus, ProteinGLUE may be seen as complementary to the TAPE benchmark tasks.

Table 1 Overview from literature of the best performing protein representation models on the secondary structure prediction task for 8 classes (SS8), using the CB513 dataset.

Notable factors influencing protein representation model performance are the size of the training set, the model size, and pre-training using multiple (orthogonal) objectives. Although the CB513 dataset55 has been called redundant and outdated11 , it is commonly used to evaluate model quality on the prediction of secondary structure in eight classes as it is sufficiently difficult to avoid ceiling effects. Table 1 summarizes some of the best performing models on this dataset. The state-of-the-art model NetSurfP-2.01 achieves 72.3% accuracy on this set. From the protein representation models, only the MSA Transformer from Rao et al.47 achieves a higher accuracy: 72.9%. This performance is more than 10% accuracy improvement compared to their earlier TAPE Transformer model that achieves a accuracy of 59%9. Big pre-trained Transformers come close to state-of-the-art performance: ESM-1b achieves 71.6%, ProtTrans-T511 achieves 71.4%, and ProtTrans-Bert11 achieves 70% on SS8 prediction. This is in accordance with a general trend that we observe: most protein representation models do not outperform the state-of-the-art, which are often methods which are laboriously hand-tuned for optimum performance. However, large transformer-based models often come close on a wide variety of downstream tasks that it is debatable whether a statistically significant performance difference actually exists. Because the architectures are not specialized for a specific task—most are actually fairly straightforward conversions of natural language processing architectures—there is already significant value in being able to get to near state-of-the-art performance.

Recently, Alphafold2—a neural-network-based algorithm from Google’s DeepMind that predicts a protein structure from sequence—has gained a lot of attention for its prediction results at CASP-143. This may raise the question whether the type of tasks presented here are still relevant. Notwithstanding the enormous progress that AlphaFold2 represents, reliable structural information remains unavailable for many proteins56,57. Moreover, the usefulness of predicted structures for direct derivation of structural features may be limited58,59. This means that sequence-based prediction of protein structural features is still a relevant task. Finally, AlphaFold2 requires a high-quality multiple sequence alignment as an input. Conceivably, advancements in protein sequence modeling could improve the quality and breadth of AlphaFold2 improvements as well.

ProteinGLUE Benchmark tasks

The ProteinGLUE benchmark suite described in this work consists of the following seven benchmark tasks, which are all structural features that are labelled per amino acid in the protein sequence.

  • Secondary structure (SS3 and SS8) The dataset used for the secondary structure classification into three classes (\(\alpha\)-helix, \(\beta\)-sheet, and coil) and into eight classes (coil, high-curvature, \(\beta\)-turn, \(\alpha\)-helix, 310-helix, \(\pi\)-helix, \(\beta\)-strand, and \(\beta\)-bridge) was created by Hanson et al.46. This dataset is used in multiple prediction methods2,46,60. Proteins in the set were obtained using the PISCES server61, and were filtered using a resolution cutoff of \(<2.5\) Å and sequence identity cutoff of (seq.ID) 25% according to BlastClust41. We split this dataset into 8 803 sequences for training, 1 102 sequences for validation and 1 102 sequences for testing. For the secondary structure prediction in three (SS3) and eight classes (SS8) these sets include the same proteins. Accuracy (ACC) is used for measuring model performance on these classification tasks. In order to compare the SS8 prediction to other models (see Related work) we determine the performance on the commonly used CB513 dataset. We used CDhit62 to excluded from this test set all proteins with more than 40% sequence identity to our training set, resulting in a dataset size of 390 protein sequences.

  • Solvent accessibility (ASA and BUR) The dataset used for the solvent accessibility prediction is based on the same dataset used for SS3 and SS8. Training, validation and test sets were sampled independently from the secondary structure prediction sets, but include the same number of sequences for each. The absolute solvent accessibility (ASA) values, as given in the source data, were used to identify buried residues. Residues were determined as being buried (BUR)63 if the relative solvent accessible area—that is, the solvent accessible area divided by the maximal solvent accessible area for an amino acid type—was less than 7%63. Accuracy (ACC) is used for measuring model performance on the BUR classification task, and Pearson correlation coefficient (PCC) for the ASA regression task.

  • Protein-protein interaction (PPI) The dataset used for the PPI interface prediction task was created by Hou et al.16 for their random forest based PPI interface prediction model SeRenDIP16,18, which included both homodimer and heterodimer interfaces. The homomeric dataset was based on the Test_set 1 dataset of earlier work from Hou et al.64. In short, this dataset was created by filtering on 30% seq.ID, and remaining proteins were filtered at 25% seq.ID against the heterodimer dataset and against the training sets of NetsurfP. The heteromeric dataset was created from Dset_186 and Dset_72 created by Murakami and Mizuguchi65, and were filtered at 25% seq.ID against the NetSurfP and DynaMine training, and the homomeric datasets16. We used those four datasets, retaining 287 homomeric training, 93 homomeric test, 118 heteromeric training and 44 heteromeric test proteins (the last stored protein for each was omitted). We then selected 20% of the homomeric and heteromeric training proteins individually in order to create a validation set. The homomeric and heteromeric training, validation, and test sets respectively were concatenated into one training, validation, and test set for the PPI task, containing both types of interfaces. The area under the receiver operating characteristic curve (AUC ROC) is used to evaluate model performance on the PPI prediction task.

  • Epitope region (EPI) The epitope dataset was obtained from Hou et al.16. This dataset is based on the structural antibody database of the Oxford Protein Information Group (SAbDab)66. The PDB structures of the antibody-antigen complexes were selected and antigen sequences were filtered at 25% seq.ID. We used the training, validation, and test sets of the first fold of the original 5-fold data split, which contains 179 training, 45 validation, and 56 test sequences. Area under the receiver operating characteristic curve (AUC ROC) is used for measuring model performance on the EPI prediction task.

  • Hydrophobic patch prediction (HPR) The hydrophobic patch dataset contains the structure-based assignment of hydrophobic patches, as created by Van Gils et al.22. PISCES was used to collect PDB structure of a resolution \(\le\) 3.0 Å, R-factor \(\le\) 0.3, and of sequence length between 40–10 000 residues. Only X-ray determined and non-C\(\alpha\)-only structures were selected. Subsequently, the selected proteins were filtered on 25% seq.ID. Finally, only monomers were selected: transmembrane proteins were excluded. The resulting set contains 4 917 proteins. Because the model is unable to predict the total hydrophobic surface area, MolPatch22 was used to generate, per protein, the rank of each hydrophobic patch. For amino acids belonging to multiple patches the rank of the largest patch was assigned. The dataset was split into 60% training sequences, 15% validation sequences, and 25% test sequences. Pearson correlation coefficient (PCC) is used for measuring model performance on the HPR regression task.

Based on previous studies on protein structural properties using all kinds of prediction models, we expect SS3 and BUR to be easy prediction tasks followed by SS8 and ASA. The PPI and EPI prediction tasks are expected to be ‘medium’ difficult18,20. The hydrophobic patch prediction is expected to be a hard prediction task22. The SS3 and SS8 tasks are naturally related, as are BUR and ASA; also between these pairs we expect some correlation, as for example loops tend to be exposed1.

All the datasets of the ProteinGLUE benchmark suite are provided in the TensorFlow and CSV format, making the set easily reusable by the community. We refer to the section TensorFlow format of datasets in the supplement for detailed information about this data format.

Pre-trained models

To set a challenging baseline for our datasets, we trained two transformer models based on the BERT architecture7. We provide a BERT medium model, with 8 hidden layers, 8 attention heads and hidden size of 512, and we provide a BERT base model with 12 hidden layer, 12 attention heads and hidden size 768. The “hidden size” refers to the dimensionality of the vectors representing the tokens in the hidden layers (between transformer blocks).

Pre-training data

Our baseline models were pre-trained on the Pfam dataset, more specifically the sequences from the PfamA 33.1 dataset23 , filtered on 90% sequence identity. Similar to the BERT training process, we distinguish the amino acid sequences into regular sequences, which were at most 128 tokens long, and big ones at most 5127. We discarded sequences longer than 512 amino acids, but as single domains longer than 512 amino acids are exceedingly rare, this does not exclude a significant fraction of the data. For both big and regular sequences, a test and a validation set were split off from the training data, each containing 10% of the total number of protein sequences. The resulting pre-training training dataset consists of 13 065 370 regular and 14 687 695 big sequences, a validation set of 1 469 855 regular and 1 835 963 big sequence, and a test set of 1 469 855 regular and 1 835 962 big sequences.

The downstream tasks may be used with models which have used other pre-training datasets than PfamA. In fact, in the natural language domain, progress in pre-training has often been the consequence of better-curated data, in addition to model improvements (in terms of size and architecture)8. We do, however, urge caution in assuming the source of performance improvements. If a new model and a different pre-training dataset are used, then, where possible, an ablation study should be performed67. For this purpose we provide the precise, canonical subset of Pfam used for training our baseline models.

Unless mentioned otherwise, all aspects of the model were taken from the BERT model7.

Pre-processing Following the previously mentioned seperation of sequences into regular (max 128 tokens) and big (max 512 tokens) sequences, we tokenize sequences into amino acids, giving us a base vocabulary of 20. We also reserve 20 special tokens, used to annotate the sequence. Four of these, named PAD, CLS, MSK, and SEP, are used in pre-training as explained below. The remaining 16 tokens are reserved for potential use in downstream tasks. These are not used for any of our downstream tasks, but they may be useful for others.

While Devlin et al.7 slice fixed-length contiguous sub-sequences out of the corpus, we always train on full-length proteins. This means that our input sequences are variable length, so we pad each batch using PAD tokens so that all sequences within each batch have the same length.

Pre-training methods

Following the structure of the original BERT models, with some slight deviations, we define two pre-training tasks:

Masked token prediction (MTP) We change out a small percentage of tokens in the input sequences. The task is to reproduce the original tokens in the sequence. Some proportion of input tokens are masked (replaced by the masking token), and the others are corrupted by replacing them by randomly chosen, but different, amino acids; this is a change from the BERT setup because, with a vocabulary size of only 20 amino acids, the chance of randomising into the same amino acid gives a non-negligible performance boost. The model receives no indication which tokens are corrupted, but the loss is only computed over changed tokens. We use a 15% chance for a token to be changed. 80% of these are masked, 10% are corrupted, and 10% remain unchanged (but do contribute to the loss). Here we follow Devlin et al.7 precisely.

Next sentence prediction (NSP) To stimulate the model to learn a representation of the whole sequence, a sequence-level task is added. That is, one label should be predicted for the whole sequence in addition to the ones for individual tokens. Devlin et al.7 created this task by either concatenating two half-length sequences from the corpus or using one whole-length sequence, and having the model predict which is the case. A special CLS token was prepended to every training example, and the representation of this token was used to predict the target label.

We adapt the NSP task for the application to protein sequence data, by selecting either whole sequences from the data or concatenating the first part of one and the second part of another sequence chosen at random. Sequences are cut or left unchanged with equal probability. To make the NSP task more challenging, we cut each sequence randomly somewhere between 40-60% of its length to avoid having the cut be in the exact same position every time. We place a SEP token in between the two separate chunks of protein sequence, or in the middle of the complete sequence, using the same 40-60% selection to simulate a cutting point. Another SEP token is placed at the end, before the padding tokens. Following Devlin et al.7, we also add a learned segment embedding to every token which indicates whether the token belongs to the first or the second part of the sequence. Summed with the token- and position embeddings, these form the input embeddings.

Batching For efficiency reasons, each batch either contains only regular sequences or only large sequences. We schedule the proportion of regular versus large sequence batches in each epoch. Training starts with 0% large sequence batches and linearly increases to 50% by the end of the run. Our assumption is that at the start of training, the model is not yet learning long dependencies, so it is more efficient to train on short sequences. Near the end of training, the model is hopefully learning longer dependencies, and the longer sequences become a valuable input.

For each batch, we perform both tasks: the batch is made up of either original sequences or concatenated cut sequences, and in both cases, masking is applied as described above. We then compute losses for both the MTP and NSP tasks, using categorical cross-entropy, and add them together to produce the total loss.

Pre-training results

Both the BERT medium and base model were pre-trained using the Layer-wise Adaptive Moments (LAMB) optimizer68. We used a learning rate of 0.00025 and batch sizes of 512 and 128 sequences for regular- and big batches, respectively. Both runs took approximately two weeks of wall-clock time of continuous training on four parallel TitanRTX GPUs and using mixed precision. The medium model was trained for 2 000 000 steps while the base one was trained for 1 000 000 steps.

The BERT medium model converged on roughly 38% accuracy for the regular sequences in the masked token prediction task, and was still improving on the large sequences at the end of this run. With longer training times, we would expect the accuracy for big sequences to also converge at around 38%, but the latest average training accuracy after the full 2 000 000 steps at the end of this run was approximately 35%. The average validation and testing accuracy for the masked token prediction were also both around 35% at the end. The NSP accuracy converged much faster than the masked token prediction and ended at an average training, validation, and testing accuracy of around 96% (see Supplementary Fig. 1A).

Figure 1
figure 1

Improved performance on downstream tasks with pre-trained models. Prediction performances of the benchmark test set, as introduced in section ProteinGLUE Benchmark tasks, for a medium model without the pre-training step (pink), medium model including a pre-trained model until step 2 000 000 (dark red), base model without the pre-training step (grey), and base model including a pre-trained model until step 1 000 000 (dark blue). The yellow lines indicate the performance of a random or majority-class baseline. All models are trained ten times on their selected set of hyperparameters after which the mean performance and standard error is determined.

The BERT base model showed a similar training progression as the medium model after only half the amount of steps. This increase in performance is likely due to its added complexity. The base model also performed better than the medium one overall, converging on roughly 41% accuracy for the regular sequences in the masked token prediction task, with the performance on big sequences still improving again, resulting in an average training accuracy at the end of the 1 000 000 steps of 38%. The average validation and testing accuracies at the end of this run were both approximately 38% as well. The NSP accuracy converged considerably faster again and ended at an average training, validation, and testing accuracy of around 97% (see Supplementary Fig. 1B).

Fine-tuned models

To provide baseline performance estimates, and as a proof-of-concept for pre-training performed as described in the previous section, we fine-tune both our BERT-derived models for all downstream tasks included in the ProteinGLUE benchmark dataset.

Fine-tuning methods

Architecture and parameters The fine-tuning models for the downstream tasks consists of the transformer model and a classifier. The size of the model is therefore dependent on the size of the pre-trained transformer. Classifiers of classification tasks (SS3, SS8, BU, PPI, and EPI) consist of a non-linear layer, a dropout layer, a classification layer, and a probability layer. The activation function in the non-linear layer is set to the Gaussian Error Linear Unit (GELU)69, which was also used in the pre-train layers. The dimension of the output space of the classification layer is set to the number of classes plus one, due to the padding of sequences. The activation function that is applied to the output in order to generate the probabilities is set to the softmax activation function. The classifier of the regression tasks (ASA and HPR) consisted of a linear layer, a dropout layer, and a regression layer. The activation function of the linear layer is also set to the GELU activation function. The dimensionality of the output space in the regression layer is set to 1.

We tuned the hyperparameters batch size, learning rate, and dropout rate for all downstream tasks by training models on the training sets and compare performances on the validation set. We performed an exhaustive grid search on the pre-trained medium and base models separately. The dropout rate is the rate included in the dropout layer of the fine-tuning classifier. For the medium model we considered batch sizes 8, 16, and 32, learning rates 6.25e-5, 1.25e-4, 2.5e-4, and dropout rates 0.0, 0.05, 0.1, and 0.2. When a high performance is attained for the largest learning rate, the learning rate is further increased to 5.0e-4 or 1.0e-3. For the base model we considered the same values, except for the batch size where only batch sizes of 4 and 8 were included due to memory limitations. After the exhaustive grid search, the 4 to 8 best performing hyper-parameter sets were selected for each prediction task. The models were trained 4 times on each set after which the best hyper-parameters were selected, based on the mean performance (see supplementary Table 1).

We decided to set the maximum length of the considered sequences to 512 and keep this value constant over all downstream prediction tasks. Even as for the pre-training, the LAMB optimiser was used. We define the fine-tuning models to be converged when they are trained for 15 epochs or 2 000 steps.

For the protein interface and epitope prediction tasks, we had to deal with the class imbalance of the dataset. The PPI interface training set consisted of 16 605 residues indicated as interface residues, and 66 180 residues indicated as non-interacting. The epitope training set consisted of 4 503 and 41 938 residues indicated as interacting and non-interacting, respectively. We included the ratio of the number of non-interface residues over the number of interface residues as weight in the loss function. Therefore this weight was set to 3.99 for the PPI interface prediction and 9.31 for the epitope prediction.

All downstream tasks were trained and validated on a single compute node, consisting of a single TitanX GPU containing 12Gb of GPU memory, with a (wallclock) run time of about 4 hours.

Training and evaluation During training, the loss was determined by categorical cross entropy for classification tasks, and mean absolute error for regression tasks. Model performance was assessed by the Pearson correlation coefficient (PCC) for the regression tasks (ASA and HPR), accuracy (ACC) for the classification of the structural components (SS3 and SS8) and the identification of the buried residues, and the area under the receiver operating characteristic curve (AUC ROC) for the interface predictions (PPI and EPI). Note that the range of the PCC is between −1 and 1, whereas the range of the ACC and AUC is between 0 and 1. We determined random prediction performances for all downstream prediction tasks in the benchmark set. For the SS3, SS8, and BUR prediction task this random performance is set to the fraction of the number of majority-class residues over all residues. For the ASA and HPR we compare the labels with a sequence of the same size sampled from a standard normal distribution. The expected random expected performance, in this way, is close to zero. For the two interface prediction tasks we set the baseline to the random performance of a ROC curve, which is 0.5.

The previously described pre-trained medium and base models were used for the hyperparameter tuning. The performance, per downstream task, between the medium and base model were compared. For comparison, we also trained the downstream tasks on both models excluding the pre-training step. During training of the pre-training models two checkpoints were stored manually. This includes the checkpoints of the medium model at step 500 000 and 1 600 000, and of the base model at step 350 00 and 700 000. The predictive performance of the different downstream tasks was evaluated over these checkpoints to check for overfitting. All fine-tuning models were trained ten times, after which the mean performance and standard error on the validation set was determined.

Table 2 Overview of the number of protein sequences in the training, validation and test set of the seven downstream protein structural prediction tasks.

Fine-tuning results

In this study, we selected seven protein structural prediction tasks to evaluate pre-trained transformer models in a varied way. Our pre-trained models include a BERT medium and a BERT base model trained on protein domain sequences from the PfamA database. The datasets for these tasks were based on previous state-of-the-art prediction studies on these tasks. Each selected dataset was divided into a training, validation, and test set, see Table 2.

Hyperparameter tuning Similar to the study of Devlin et al.7, most hyperparameters were set to the parameters of the pre-training-tasks. However, the downstream tasks are considered as having converged after reaching 2 000 steps, or 15 epochs. We tuned the the batch size, learning rate, and dropout rate for each task specific on both the converged medium model and the converged base model (see Supplementary Table 2).

Model performance The pre-trained models were used as a basis to train and evaluate models for the seven fully supervised downstream tasks. It is commonly assumed that, by allowing the model to learn the structure of the input data on a large amount of unlabeled sequences, the performance on specific tasks can be improved, without requiring large amounts of labeled sequences, which are often difficult or expensive to acquire. Figure 1 tests this assumption using our data and models. We compare the performance of a medium and base model on the downstream tasks, with and without pre-training. The results show a clear improvement of pre-training for 6 out of the 7 downstream tasks. For the epitope prediction both the medium and base pre-trained models perform slightly worse compared to the non pre-trained models. On the validation set, the improvement of pre-training is shown for all tasks (see Supplementary Fig. 2). For the hydrophobic patch regression (HPR), however, the results are less clear. There is minimal improvement in the medium model. The base model does show improvement, but that is because the model without pre-training far under performs, compared to the medium version. This appears to be the most challenging task in our set of benchmarks, and the one for which models behave the most counter-intuitively.

Figure 2
figure 2

Performance of downstream tasks monitored during pre-training. Prediction performances of the benchmark test set, as introduced in section ProteinGLUE Benchmark tasks, (a) for a medium sized model, and (b) for a base sized model, trained without the pre-training step (pink/grey), and on a pre-trained model for 350 000 (light red/blue), 700 000 (red/blue) and 1 000 000 steps (dark red/blue). Further details as in Fig. 1.

For the medium model an accuracy of 59% after pre-training and 39% without pre-training on the SS8 prediction using the CB513 test set is obtained. For the base model an accuracy of 58% after pre-training and 34% without pre-training is obtained on this test set.

Furthermore, we compare the downstream task performances against their random performance. The random performance of the SS3, SS8, and BUR prediction was set to the majority-class which resulted in 39%, 32%, and 70% respectively. The random performance of the ASA prediction was estimated to be –8.9e–4, and of the HPR prediction to –6.0e–4, i.e. both very close to zero. The random performance of the AUC ROC performance measure was set to the value 0.5. We conclude that in all cases except for the not pre-trained base model on the SS8 and BUR prediction tasks, the model performances, including standard error bars, outperform the random performances.

To check for convergence in the pre-trained models we also monitored the performance of the downstream task during pre-training, which is shown in Fig. 2. Results on the validation set are shown in Supplementary Fig. 3. Our main observation here is that for most tasks, pre-training confers a strong advantage. For the medium model (Fig. 2a), the performance increases with the amount of pre-training, aside from some small fluctuations. For the larger base model (Fig. 2b), we note a more uneven progress in the number of training steps, with greater standard error.

Figure 3
figure 3

Performance on downstream task is generally stable between test and validation sets. Prediction performances of the benchmark validation and test set for both the converged medium and base models. Further details are given in Fig. 1.

Figure 3 shows the difference between the validation and test performance. For some tasks the validation performance is substantially higher than the test performance. To some degree this is to be expected, but it may also indicate over-tuning of hyperparameters. Overall, however, model performance is stable over the test and validation sets.

Discussion

We present the ProteinGLUE benchmark set: a collection of classification and regression tasks in the protein sequence domain. Our main aim with this resource is to provide a standardized way to evaluate pre-trained protein models, and to provide clearer and more informative comparison between such models. We have also provided initial results of baseline models and checkpoints of these models for further development. All code used to train and evaluate our models is available online at https://github.com/ibivu/protein-glue.

ProteinGLUE is not the first publicly available benchmark dataset in the field of protein modeling. The TAPE benchmark has been used by multiple groups and has thereby served as a useful tool to compare model performances. In comparison with TAPE, our dataset covers more protein-function related tasks that also have not been covered yet by other groups.

We pre-trained two transformer models inspired by the BERT medium and base architectures. Pre-training was performed with two objectives commonly used in NLP: masked symbol prediction; and next sentence prediction, which we adapted to predict matching halves of a protein sequence. Given the fast convergence on the NSP objective—with a final accuracy close to 100% (Fig.  1)—in future work it could be investigated how this task could be made more difficult and how a harder pre-training task may help improve downstream performance.

We evaluated prediction performances on the ProteinGLUE benchmark set on secondary structure in three (SS3) and eight (SS8) classes, buried residues (BUR) and absolute solvent accessibility (ASA), protein-protein interaction interfaces (PPI), epitope interfaces (EPI), and hydrophobic patches (HPR) using the medium and base pre-trained BERT models. We compared results against performances of these models without pre-training (Fig. 1). Except for the epitope interface (EPI) prediction, all other benchmark tasks (SS3, SS8, BUR, ASA, PPI, HPR) achieved better performance on the pre-trained models. We also see that the secondary structure tasks (SS3 and SS8) show similar trends across models as the solvent accessibility tasks (BUR and ASA), indicating that these tasks rely on similar patterns1. As expected the hydrophobic patch (HPR) prediction is the most challenging task22. Results of the SS8 prediction on the CB513 test set indicate similar performances for both medium and base models, 59% and 58% respectively. Compared to other models (see Table 1), a similar accuracy is obtained for the TAPE Transformer (59% for the medium model) whereas approximately 10% points lower accuracies are obtained compared to the other protein representation models which achieve 70-73%. This could partially be due to the smaller models used here, and due to the use of multiple sequence alignments as input, which we do not include.

In order to make strong claims about zero-shot learning, such as those in7 and32, a rigorous effort should be made to remove any overlap between the pre-training set and the downstream training sets. Here, because of evolutionary relations between protein sequences, this is not a trivial task. In particular, PFAM is built from seed alignments based on proteins with known structure from the PDB; and the downstream task annotations are also based on the PDB. Nevertheless, we may still draw firm conclusions on the usefulness of pretraining in protein structure-related property prediction.

The larger base model does not always outperform the medium model (Fig. 1). However, the differences are small and not statistically significant, and are not indicative of a lack of convergence for the base model. Additionally, for some datasets, performance does not increase monotonically with the number of training steps (Fig. 2). While these instances are rare, it suggests there may be some benefit to early stopping in pre-training, or added regularization to allow for a more uniform convergence. We leave this as a matter for future work.

While it is not our aim to outperform state of the art structural prediction methods, which typically use feature sets that were handcrafted and tuned over many years of painstaking research (e.g. Table 1 for SS8), we compare the ProteinGLUE benchmark set results against these methods in order to provide a general understanding of the performances. Note that for most comparisons the test sets are different, so the observed differences in performance should only be taken as a rough approximation. The OPUS-TASS method2 outperforms earlier studies on the secondary structure prediction based on the dataset created by Hanson et al.46, which is also used here. OPUS-TASS reaches 89% and 79% on SS3 and SS8 prediction, respectively, on one of their test sets. We reach 75% accuracy on the pre-trained base model for SS3 and 62% accuracy on the pre-trained medium model for SS8. The SeRenDIP method16,18, trained on the combined dataset of homodimer and heterodimer protein-protein interactions, resulted in an AUC ROC of 0.72 on the homodimer test set and 0.64 on the heterodimer test set. We reach an AUC ROC for PPI of 0.62 on the combined test set. The SeRenDIP-CE method20 for epitope interface prediction reaches an average AUC ROC of 0.69 over their 5 folds. We reached an average AUC ROC for EPI of 0.67 over ten times training the first fold. Clearly, even though we could show the added value of our pre-training of the transformer models, in their current invocation these models are not yet competitive to established state-of-the-art for the various prediction tasks.

There are many research questions about pre-trained protein models that we hope these tasks can help to investigate. For instance, the question of how best to fine-tune a pre-trained model for a specific task. For simplicity, we used the original BERT approach of fine-tuning all weights in our baselines, but there are indications that freezing weights may help for some models, and that layerwise decay is preferable for others70,71 In general there are many ways to fine-tune a model for a downstream task (including distillation72 and prompt tuning73), all of which may be evaluated with these datasets.

One limitation of large models in general is that the state of the art eventually falls to such large models that even fine-tuning itself becomes a non-trivial task, requiring a large amount of computational resources. This can be mitigated by various methods. For example by fine-tuning with some or all of the lower layers of the model frozen, so that these values can be precomputed. What techniques are available, and what their impacts are on performance we leave as a topic for future research.

Compared to the standardized benchmarks available in the NLP domain, a set of seven tasks is a modest start. We hope that further benchmark sets will follow ours. All tasks included in our ProteinGLUE benchmark set currently label individual amino acids: no tasks were included for which the label applies to the whole sequence, or to a region such as a protein domain. Including tasks like fold or function prediction may provide such opportunities. Here, we evaluated the added value of pre-training using transformer models, as these are the most widely used at the moment, however our benchmark set is equally useful for evaluating any sequence to sequence prediction method. We expect our ProteinGLUE benchmark set will prove to be useful in itself, and that combined with the baseline models and performance comparison presented here, it will provide a starting point for further improvement of deep learning approaches, and transformer-based models in particular, to the exciting field of protein structural and functional property prediction.