The prowess of protein language models (PLMs) has been demonstrated in handling various tasks, such as protein structure prediction, function analysis and engineering, and novel protein design. Transformers, a deep learning architecture that excels in learning relationships in sequence data, have been commonly employed as the backbone of PLMs, first being pretrained on huge datasets of protein sequences to become versed in the ‘language’ of the protein universe and then adapted for multiple downstream tasks. However, their remarkable performances come at a cost of high computational burden, limiting the length of protein sequences they can digest. Curious to know whether transformers were the only architecture that would work for protein language models, Kevin Yang and colleagues at Microsoft Research New England explored the potential of using another architecture to build PLMs.

The team experimented with convolutional neural networks (CNNs), which were developed earlier than transformers in deep learning research and also widely applied to biological data analysis. One of CNNs’ major appealing features is their linear scalability in sequence length, compared to quadratic scalability of transformers. Yang and colleagues built a series of CNN-based protein language models called CARP (convolutional autoencoding representations of proteins) using the same pretraining strategy and dataset as the popular existing transformer-based PLM ESM. When comparing these models on both the pretraining tasks and a number of downstream tasks (for example, prediction of protein structure, mutation effect, fitness, fluorescence and stability), to their surprise, the overall performance of CARP was on par with, and in some cases even better than, ESM. Furthermore, “We were surprised that, for both architectures, downstream performance did not necessarily improve for bigger models with better pretrain performance,” says Yang.

As expected, the scalability advantage of CARP in comparison to ESM was also attested by run-time and memory comparison results. Because of this, CARP can handle protein sequences longer than 4,000 residues, while the input length of the original ESM-1b model is limited to about 1,000 residues.

The boom in PLMs in fact causes some challenges for evaluation efforts like this. “The field is currently moving so fast that new methods come out faster than we can benchmark them, and because different groups have different priorities, there’s a lot of fragmentation in the downstream tasks that people care about,” says Yang. Pretraining, another important pillar of modern PLMs, also leaves room for improvement. “While PLMs have been very successful at bioinformatics and especially at structure prediction tasks, the sequence reconstruction pretraining task limits the types of generalization they can perform when predicting structure and limits their scalability when predicting more complex function,” notes Yang. All these and many other directions exemplify the vast opportunity to map out the next frontiers in PLMs.

Original reference: Cell Syst. https://doi.org/10.1016/j.cels.2024.01.008 (2024)