Main

The ‘unreasonable effectiveness’ of deep learning1 in domains such as computer vision and natural language processing (NLP) relies on the ability of deep neural networks to leverage ever-increasing amounts of compute, data and model capacity. Large-scale models, including Bidirectional Encoder Representations from Transformers (BERT)2 and DALL-E3, have been so successful at synthesizing information from large datasets via self-supervised pre-training and performing a variety of downstream tasks with little to no fine-tuning that most state-of-the-art models in NLP and computer vision are adapted from a small set of large, pre-trained models4. Naturally, we might expect that massive model and dataset scaling will be a prerequisite to achieving out-sized success for deep learning in science. Recent work such as AlphaFold5, the Open Catalyst Project6,7 and ChemBERTa8 indicates that larger datasets and models, pre-training and self-supervised learning—all key ingredients in computer vision and NLP—unlock new capabilities for deep learning in chemistry. However, unlike in computer vision and NLP, the path to scaling deep chemical networks and the potential benefits are unclear. Chemical deep learning can incorporate physics-based priors that may ameliorate the steep resource requirements seen in other fields9,10,11,12. Moreover, because of the heterogeneity and complexity of chemical space13 and molecular machine learning tasks14,15, training general and robust models that perform well on a wide variety of downstream tasks remains a pressing challenge8,16,17. The enormity of chemical space and heterogeneity of these tasks motivates investigations of large-scale models in chemistry, because such models are well suited to unlabelled, multi-modal datasets3,4. Recently, neural-scaling laws18,19 have emerged as a way to characterize the striking trends of improved model performance over many orders of magnitude with respect to model size, dataset size and compute; however, these experiments require immense computational resources and rely on well-known, domain-specific model training procedures that do not apply outside of traditional deep learning application areas.

With the inordinate costs of developing and deploying large models20, it is difficult to investigate neural-scaling behaviours of scientific deep learning models, which require expensive hyperparameter optimization (HPO) and experimentation. Architectures and hyperparameters that work well for small models and small datasets do not transfer to larger scales21. This presents a risk that scientific deep learning will become increasingly inaccessible as resource demands increase. Techniques for accelerating neural architecture search and hyperparameter transfer such as training speed estimation (TSE)22 and μTransfer21 could accelerate the development of large-scale scientific deep learning models, where rapid advances in architecture design and complex data manifolds prevent the easy transfer of parameters and settings used in computer vision and NLP. To investigate the capabilities of deep chemical models across resource scales, practical and principled approaches are needed to accelerate hyperparameter transfer and characterize neural scaling.

In this Article, we develop strategies for scaling deep chemical models and investigate neural-scaling behaviour in large language models (LLMs) for generative chemical modelling and graph neural networks (GNNs) for machine-learned interatomic potentials. We introduce ChemGPT, a generative pre-trained transformer for autoregressive language modelling of small molecules. We train ChemGPT models with over 1 billion parameters, using datasets of up to 10 million unique molecules. We also examine large, invariant and equivariant GNNs trained on trajectories from molecular dynamics and investigate how physics-based priors affect scaling behaviour. To overcome the challenges of hyperparameter tuning at scale in new domains, we extend techniques for accelerating neural architecture search to reduce total time and compute budgets by up to 90% during HPO and neural architecture selection. We identify trends in chemical model scaling with respect to model capacity and dataset size, and show the pre-training loss performance improvements seen with increasing scale. Work concurrent with and following the original appearance of this paper has shown a wide range of performance on molecular property prediction tasks14 using pre-trained chemical language models23,24, from state-of-the-art to negligible or even negative performance. New research directions involve understanding the limitations of pre-trained representations25 from models including ChemGPT. Similarly, following the original appearance of our work, scaling in GNNs has shown immense success for chemical and biological systems26. Our core contribution is the discovery of neural-scaling laws across extremely diverse domains of chemical deep learning: language models and neural interatomic potentials. Our results provide motivation and practical guidance for scaling studies in scientific deep learning, as well as many fruitful new research directions at the intersection of massive scale and physics-informed deep learning.

Results

In this section, we describe our main results and the workflow developed in this paper, summarized graphically in Fig. 1.

Fig. 1: Discovery of neural-scaling relations for deep chemical models.
figure 1

a,b, Over a domain of model candidates (a), final, converged model loss is predicted from only a few initial epochs of training for large-scale models (b). c, Non-optimal model architectures and hyperparameter configurations are identified early in training, allowing for efficient selection of the ideal architecture and hyperparameters. The model with the best hyperparameters is then trained with varying model and dataset sizes to discover neural-scaling relations.

Accelerated hyperparameter optimization

To conduct extensive scaling experiments, we first need to find reasonable hyperparameters and training settings. Unlike for NLP and computer vision, there are no default model architectures, datasets, tasks, hyperparameter settings or training settings for large-scale chemical deep learning. Simply transferring empirical results from other deep learning domains or smaller-scale experiments will lead to suboptimal results21. Whereas large models and datasets are standard in traditional deep learning application areas, to investigate scaling in deep chemical models we must lay the groundwork for large-scale experiments. To this end, we first tackle the problem of accelerating HPO in general settings, for new model architectures, heterogeneous datasets and at scales that have not been previously investigated.

Figure 2 shows the results of training performance estimation (TPE) for ChemGPT models trained on 2 million molecules from the Molecular Sets (MOSES)27 dataset. MOSES is smaller than PubChem and is representative of datasets on which chemical generative models are typically trained27,28. Here we use MOSES to demonstrate how optimal settings for a chemical LLM such as ChemGPT are quickly discovered using TPE. To enable scaling experiments, we are mainly concerned with settings related to the learning dynamics (for example, batch size and learning rate), that will impact large-scale training and fluctuate depending on the type of model and the characteristics of the dataset. To demonstrate the effectiveness of TPE, we initialize ChemGPT with the default learning rate and batch size for causal language modelling in HuggingFace. We then vary the learning rate and batch size and train models with different hyperparameters for 50 epochs. Figure 2 shows the true loss after 50 epochs versus the predicted loss using TPE after only 10 epochs. R2 = 0.98 for the linear regression (equation (8)), and Spearman’s rank correlation ρ = 1.0. With only 20% of the total training budget, we are able to identify model configurations that outperform the default settings from HuggingFace. The procedure is easily repeatable for new datasets and enables accelerated HPO.

Fig. 2: Optimal models are identified early in training with TPE and training of non-optimal models is stopped to save 80%+ total compute consumption.
figure 2

ChemGPT final validation loss (cross-entropy for causal language modelling) predicted from 20% of training budget using TPE. Model configurations are determined through a grid search of different batch sizes and learning rates. Models are trained on two million molecules from MOSES.

While training procedures for LLMs such as ChemGPT are well established, scaling neural force fields (NFFs) to larger datasets and more expressive models requires new, scalable training procedures17. Large-batch training through data parallelism is one method for accelerating training, but there are known limitations and correct batch sizes vary widely for different domains29. This problem is particularly acute for NFFs, where each datapoint actually contains 3N + 1 labels for energies and atomic forces, where N is the number of atoms, creating a large effective batch size with large variance within each mini-batch. Hence, it has been observed that small batch sizes (even mini-batches of 1) work well across different NFF architectures9,30. TPE provides a method for quickly evaluating the speed–accuracy trade off for different combinations of batch size and learning rate, which are interdependent and must be varied together to enable large-batch training.

TPE performs equally well for GNNs. We repeat the TPE procedure, varying the learning rate and batch size, for SchNet, Polarizable Atom Interaction Neural Network (PaiNN) and SpookyNet, training on 10,000 frames (1,000 frames per molecule) from the revised MD-17 (100,000 structures of molecules are taken from the original MD17 dataset by ref. 31, with energies and forces recalculated at the PBE/def2-SVP level of theory)32 dataset of 10 small organic molecules. Using only 20% of the total training budget, we achieve excellent predictive power (Fig. 3) with TPE for SchNet and PaiNN. The variance in model loss using the entire training budget is important, indicating the importance of proper HPO.

Fig. 3: Optimal models are identified early in training with TPE and training of non-optimal models is stopped to save 80%+ total compute consumption.
figure 3

ac, NFF (SchNet (a), PaiNN (b) and SpookyNet (c)) model performance—measured via equation (5)—predicted from ≤20% of the training budget using TPE. Model configurations are determined through a grid search of different batch sizes and learning rates. Models are trained on 10,000 frames from the revised MD-17 dataset.

Because SpookyNet is a complex architecture that includes non-local interactions and empirical corrections, it shows slow convergence and the training speed is less correlated with the converged model loss compared with SchNet and PaiNN. However, the rank ordering of model configurations for SpookyNet from TPE is still robust (Spearman’s ρ = 0.92), which allows for discarding non-optimal model configurations early in training, representing notable computational savings. The goodness-of-fit metrics for linear regressions using TPE are given in Table 1.

Table 1 Goodness-of-fit metrics for GNN model performance using linear regression from TPE

Neural scaling quantifies the improvements in loss

Next, with a strategy in place to efficiently scale up experiments using TPE, we investigate neural scaling in ChemGPT and NFFs. For each model, we perform TPE to identify good hyperparameter choices that are predicted to perform well over a range of model and dataset sizes. Then, we systematically vary the dataset size (d) and model size (m) and perform exhaustive experiments to determine the converged loss, L(m, d). For efficiency and to isolate scaling behaviour, we fix hyperparameters from TPE as m and d are varied, but strictly speaking the optimal hyperparameters will change as m and d vary21. Due to computational resource limitations, we train ChemGPT models for a fixed number of epochs (ten) to determine the loss.

Figure 4 shows the pre-training loss as a function of model and dataset size over many orders of magnitude. Models are trained in a self-supervised, causal language modelling setting and evaluated on next-token prediction for a fixed validation set. Surprisingly, no limitations in loss improvement are seen with increasing scale. The pre-training loss monotonically improves with increasing dataset size up to nearly 10 million molecules. Furthermore, for a fixed data budget, increasing model size provides monotonic improvements to the pre-training loss until the model reaches 1 billion+ non-embedding parameters. This indicates that even for small datasets, much larger models than were previously considered for deep generative modelling28,33 may be useful for pre-training. For the largest dataset considered here, diminishing returns to loss improvements are seen for models above 100 million non-embedding parameters. Interestingly, greater loss improvements are seen with increasing model sizes for smaller datasets than larger ones. For the largest dataset considered, model loss saturates quickly beyond 100 million parameters. However, for the smallest dataset considered, the loss plateaus for model sizes between 10 and 107 parameters and then improves considerably. This indicates that for a fixed, small pre-training data budget, notable improvements in the pre-training loss are possible simply by scaling up the model size. Irrespective of model size, increasing dataset size provides continuous improvements to the loss with no evidence of diminishing returns for the dataset sizes considered here.

Fig. 4: Neural scaling of ChemGPT model performance (validation loss) as a function of model (number of non-embedding parameters) and dataset (number of tokens) size.
figure 4

ChemGPT is pre-trained on up to 10 million molecules (300 million tokens) from PubChem. Performance improvements are seen for models up to 1 billion non-embedding parameters and continuous improvements are observed with increasing pre-training dataset size.

Depending on the dataset size, regimes of power-law-like scaling behaviour are seen for different ranges of model sizes. Power-law scaling is graphically identifiable as an approximately straight line fit of loss versus model size on a log–log plot. For larger datasets, power-law scaling is observed for smaller model sizes. For example, the largest dataset shows approximate power-law scaling for models between 105 and 107 non-embedding parameters (Supplementary Fig. 1). Conversely, for smaller datasets, power-law scaling is observed for larger models and over a more limited range of model sizes. The smallest dataset shows approximate power-law scaling for models between 107 and 108 non-embedding parameters (not shown).

The breakdown in power-law scaling is indicative of ‘resolution limited’ neural scaling34, where the model is sufficiently large but the dataset is not, or vice versa. Identifying these resolution-limited regimes from the neural-scaling relations allows us to understand in general terms whether model loss improvements are limited by data availability or model capacity. The scaling exponent β is equal to 0.17 ± 0.01 for the largest dataset (Supplementary Fig. 1), after discarding the three largest models from the power-law fit. β = 0.30 ± 0.01 for the next largest dataset (Supplementary Fig. 2). The scaling exponent quantifies the loss improvements due to increasing model size, for a fixed data budget. A larger value of β corresponds to a steeper slope and better performance with increasing data/model size. The breakdown in power-law scaling is reflective of so-called broken neural-scaling laws35, which indicate that smoothly broken power-law functional forms are more general descriptions of neural-scaling behaviour.

GNNs exhibit robust neural-scaling behaviour

The potential benefits of large-scale GNNs are less clear than for LLMs, as are the relevant parameters to vary, due to the inequivalence of depth and width for GNNs36 and additional parameters beyond notions of model size that impact performance, for example, nearest-neighbour cut-off in graph construction. To simplify GNN scaling experiments, here we vary GNN capacity (depth × width) by systematically changing network width and the number of convolutions (depth). We train GNNs to predict atomic forces from the ANI-1x dataset (5 million density functional theory calculations of small organic molecules)37, the largest publicly available dataset of energies and forces for small molecules. NFF models are trained with a learning rate scheduler that reduces the learning rate every 50 epochs without improvement in the validation loss, until the learning rate reaches 10−7. The loss is an L1 loss (equation (5)), shown in Fig. 5 over four orders of magnitude of dataset size.

Fig. 5: Neural scaling of PaiNN model performance (validation loss) as a function of model capacity (depth × width) and dataset size (number of geometries).
figure 5

PaiNN is trained to predict atomic forces from density functional theory calculations on small organic molecules from the ANI-1x dataset. Improvements to the loss are seen for models with greater capacity and continuous improvements are observed with increasing dataset size.

The neural-scaling results for the equivariant GNN, PaiNN (Fig. 5), show monotonic improvements to the loss with increasing dataset size. For a fixed dataset size, the converged loss is strongly correlated with the total training time (compute) and model capacity. Other than for 103 datapoints (for which some small models reach convergence quickly), the converged loss has a Spearman correlation coefficient ρ ≥ 0.88 with the model capacity and ρ ≥ 0.75 with the total training time. This means that the best models are those with optimal capacity that are able to train the longest without the validation loss plateauing. The optimal capacity and depth versus width change with the dataset size, that is, the ideal GNN capacity is dataset-size dependent, and these choices can impact the converged loss. These effects may also be artefacts of random initialization that would diminish with repeated trials. Interestingly, there is a stark change at 104 datapoints—the converged loss is then nearly perfectly rank correlated with model capacity (Spearman’s ρ ≥ 0.93). This might indicate that substantial overlap exists between the training and validation sets, such that higher capacity models are merely exhibiting better memorization than lower-capacity models. In these experiments, the validation set is constructed from unseen geometries and seen species (chemical species are the same in the training and validation sets). Repeating these experiments with a hold-out set of unseen chemical species will reveal whether the same trend holds, which would indicate that rather than memorizing, the network is achieving generalization to new chemistries.

We observe similar trends in neural scaling for the invariant GNN, SchNet (Supplementary Fig. 3), although the equivariant GNNs, PaiNN and Allegro (Supplementary Fig. 4), show better scaling efficiency. A comparison of neural scaling between SchNet, PaiNN and Allegro for models with fixed capacity (equation (6)), c = 64 (4 layers, width 16), is shown in Supplementary Fig. 5. Over many orders of magnitude of dataset size, PaiNN and Allegro show greater sample efficiency, quantified by the calculated scaling exponents (Supplementary Table 1). That is, not only do the equivariant GNNs achieve better performance for a given data budget but also they exhibit larger β scaling parameter values, meaning that the loss improves more quickly with increasing amounts of training data. This is due to the models’ equivariance, which is known to produce greater sample efficiency9,10,38, but it is interesting to note that this trend persists to much larger and more chemically diverse datasets than were previously considered, which typically include only 102−103 molecular geometries from a single molecular species. We observe the same trends for calculated scaling exponents when the smallest (102) and largest (105) datasets are excluded from the power-law fits (Supplementary Table 1). Our results and recent work39 on hierarchical learning in equivariant GNNs suggest that the tensor order of features has an important role in the sample efficiency of these models. Future theoretical and empirical work is needed to untangle the competition between equivariance that is enforced via architectures and features and ‘learned’ equivariance40 achieved through data augmentation and training data.

Neural scaling enables substantial improvements to loss

Next, we briefly highlight the practical outcomes and usages of TPE and neural scaling as enabling technologies for scalable scientific deep learning. On the basis of the results presented above, TPE can be used in conjunction with any HPO routine to enable aggressive early stopping and accelerate HPO without sacrificing improvements to the loss. Clearly, the benefits of this approach become more pronounced in chemical and biological applications, where new network architectures must be continuously retrained, optimized and evaluated on heterogeneous datasets.

Similarly, neural scaling provides practical ways to improve model pre-training loss and efficiency. Given an unlimited data and computation budget, the minimum loss in the neural-scaling plot and corresponding model can be used. For example, the 300 million parameter ChemGPT model trained on 300 million tokens minimizes the loss in Fig. 4. Likewise, the PaiNN model with capacity ~1,000 trained on 105 frames minimizes the loss in Fig. 5. This may be valuable for pre-trained models that are designed to be reused and fine-tuned, where the training cost is amortized over many downstream applications. However, for many scientific applications, greedily optimizing for the minimum loss is not practical or even necessary. From the neural-scaling results, identifying regions with the steepest slope allows for optimal and efficient allocation of resources. For example, for large chemical language models, the greatest loss improvements (Fig. 4) are seen for large data budgets when scaling up small models (105 parameters). For small data budgets, more rapid loss improvements are seen when scaling up medium-sized models (107 parameters). For NFFs, there are diminishing returns with increasing dataset sizes for low-capacity models, while high-capacity models show rapid improvements with increasing dataset size (Fig. 5). The benefits from scaling model and dataset sizes should therefore be balanced against the increased computational costs to find the most computationally and data-efficient opportunities for improvement. Beyond optimizing resource allocation, the grand challenge for large pre-trained models is to achieve new capabilities and superior performance on downstream tasks.

Discussion

In this paper, we developed and applied strategies for scaling large chemical language models and GNN interatomic potentials. To enable the efficient scaling of deep chemical models under computational resource constraints, we introduced TPE, a generalization of TSE that reduces the computational costs of HPO and model selection for chemical language models and GNN interatomic potentials. The use of TPE enabled large-scale experiments, training GPT-style chemical models with over 1 billion non-embedding parameters on nearly 10 million molecules. It also made training tractable for invariant and equivariant GNNs with a wide range of model capacities on up to 100,000 three-dimensional molecular geometries (~4.5 million force labels). We discovered empirical power-law ‘neural scaling’ behaviour that quantifies how model loss depends on the scale of model and dataset size over many orders of magnitude. These results enable optimal allocation of computational and data budgets for maximally efficient model loss improvements, and make scalable scientific deep learning more accessible to a broader community of researchers. A key finding in our work is that for both large chemical language models and NFFs, we have not saturated model loss with respect to model size, dataset size or compute. Much further work remains to be done in investigating the limitations of scaling for chemistry. Finally, the effects of physics-based priors on scaling behaviour give a rich description of how the incorporation of physics, known empirical relationships and other forms of knowledge into machine learning frameworks impact both learning quality and efficiency. Future work in this area is well poised to yield fundamental advances in scientific machine learning.

Methods

In this section, we report details of the models considered in the paper and settings for the experiments performed in this paper. We define neural scaling and the model architectures considered here, which are chosen specifically for their likelihood to exhibit interesting scaling behaviour. Then we introduce strategies to enable scaling large chemical models and investigations of scaling behaviour.

Neural scaling

For large language and computer vision models trained to convergence with sufficient model parameters and/or data, performance is characterized by empirical scaling laws where the loss scales as a power law18 of the form

$$L(R)=\alpha {R}^{-\beta }$$
(1)

for coefficient α, scaling exponent β and resource R. R is the number of model parameters, dataset size or compute. β measures the slope of the power law and indicates the scaling efficiency of the model with respect to a scaling factor, R. The power-law trends break down in ‘resolution limited’ regimes34, indicating that the model (dataset) size is insufficient for the given amount of data (model parameters).

Neural scaling presents a best-case scenario for model pre-training loss improvements with increasing resources, and allows for optimal allocation of fixed budgets, for example, to decide whether longer training, more data or larger models will be most efficient for improving pre-training loss. Comparing neural-scaling exponents also provides a fundamental metric for measuring resource efficiency across model architectures. Investigations into neural scaling in the NLP domain have revealed general conclusions about overfitting, sensitivity to architectural choices, transfer learning and sample efficiency18. These factors are equally or more important in scientific deep learning applications, where rapid advances are being made in specialized architecture development, and it is often unclear how architectures will perform beyond the small benchmark datasets that are commonly available in scientific settings.

Large chemical language models

Strings are a simple representation for molecular graphs41, thereby making sequence-based machine learning models a natural choice for working with chemical data. Following the demonstrated pre-training loss improvements of transformer-based models with increasing model and dataset sizes8,18,34, we designed a large generative language model for chemistry called ChemGPT to investigate the impact of dataset and model size on pre-training loss. ChemGPT is a generative pre-trained transformer 3 (GPT3)-style model42,43 based on GPT-Neo44,45 with a tokenizer for self-referencing embedded strings (SELFIES)41,46 representations of molecules. SELFIES enforce chemical validity and are straightforward to tokenize, but ChemGPT can easily be used with simplified molecular-input line-entry system (SMILES) strings as well28. For chemical language modelling, a set of molecules (x1, x2, …, xn) is represented with each molecule as a sequence of symbols (s1, s2, …, sn). The probability of a sequence, p(x) is factorized as the product of conditional probabilities47:

$$p(x)=\mathop{\prod }\limits_{i=1}^{n}p\Big({s}_{i}| {s}_{1},\ldots ,{s}_{i-1}\Big).$$
(2)

ChemGPT uses the transformer48 architecture with a self-attention mechanism to compute conditional probabilities, estimate p(x), and sample from it to generate new molecules. ChemGPT is pre-trained on molecules from PubChem49 with a causal language modelling task, where the model must predict the next token in a sequence, given the previous tokens. ChemGPT models of up to 1 billion non-embedding parameters are trained on up to 10 million molecules, whereas typical chemical generative models have less than 1 million parameters and are trained on less than one million samples28,33.

GNN force fields

For many tasks in chemistry, molecular geometry and three-dimensional structure are essential and string-based representations of the chemical graph are not sufficient. NFFs are GNNs that take molecular geometries as inputs, described by a set of atomic numbers \(({Z}_{1},\ldots ,{Z}_{n}| {Z}_{i}\in {\mathbb{N}})\) and Cartesian coordinates \(({\bf{r}}_{1},\ldots ,{\bf{r}}_{n}| {\bf{r}}_{i}\in {{\mathbb{R}}}^{3})\). The NFF with parameters θ, fθ, predicts a real-valued energy \(\hat{E}={f}_{\theta }(X)\) for an atomistic configuration X. The NFF produces energy-conserving atomic forces by differentiating the energies with respect to the atomic coordinates

$${\hat{F}}_{ij}=-\frac{\partial \hat{E}}{\partial {r}_{ij}}$$
(3)

for atom i and Cartesian coordinate j. Typically, the network is trained by minimizing the loss \({{{\mathcal{L}}}}\) computed from the average mean squared error for a mini-batch of size N

$${{{\mathcal{L}}}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\alpha }_{E}\parallel {E}_{i}-{\hat{E}}_{i}{\parallel }^{2}+{\alpha }_{F}\parallel {{{{\bf{F}}}}}_{i}-{\hat{{{{\bf{F}}}}}}_{i}{\parallel }^{2}\right]$$
(4)

where αE and αF are coefficients that determine the relative weighting of energy and force predictions during training50. For scaling experiments we use the L1 loss or mean absolute error

$${{{\mathcal{L}}}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\alpha }_{E}\parallel {E}_{i}-{\hat{E}}_{i}\parallel +{\alpha }_{F}\parallel {{{{\bf{F}}}}}_{i}-{\hat{{{{\bf{F}}}}}}_{i}\parallel \right]$$
(5)

which we empirically find to show more robust convergence behaviour.

In this work, we consider four flavours of NFFs: SchNet51, PaiNN52, Allegro10 and SpookyNet30. This series of models represents increasingly physics-informed model architectures, from models with internal layers that manipulate only E(3) invariant quantities (SchNet) to those that use E(3) equivariant quantities (PaiNN, Allegro, SpookyNet), strictly local models with learned many-body functions and no message passing (Allegro), and physically informed via empirical corrections (SpookyNet). The power and expressivity of these GNNs can be defined in terms of their capacity36

$$c=d\times w$$
(6)

where d is depth (number of layers or convolutions51) and w is width (the embedding dimension or number of basis functions employed by each convolution). Capacity is a simple parameter to vary during neural-scaling experiments, because model size is not a strictly useful scaling parameter for GNNs36. Typical evaluations of NFFs consider training dataset sizes of less than 1,000 three-dimensional geometries of a single chemical species, which leads to insensitivity to model capacity because of the simplicity of the learning task17. Here, we consider up to 100,000 training geometries (corresponding to 4.5 million force labels) and GNNs with millions of trainable parameters.

Accelerating HPO with TPE

Because model hyperparameters, including learning rates and batch sizes, are essential for achieving optimal losses and are non-transferable between different domains and model/dataset sizes21, we need efficient strategies for scalable HPO in deep chemical models. We adapt TSE22, a simple technique for ranking computer vision architectures during neural architecture searches, to accelerate HPO and model selection for ChemGPT and GNNs. We call this method TPE, as it uses training speed to more generally enable performance estimation across a wide range of applications. TPE generalizes TSE to HPO for new deep learning domains (LLMs, GNNs) and can be used to directly predict converged loss, in addition to rank ordering different architectures. While not the main contribution of this work, TPE is an effective strategy for accelerating scaling studies under resource constraints. TPE is used for rapid experimentation and to discover which hyperparameters are most important in new domain applications, as well as what hyperparameter regimes to investigate. Similar methods including Hyberband53 accelerate HPO by automating early stopping during training. The technical details of TPE are provided in the ‘Training performance estimation’ section in Methods.

Experimental settings

All experiments described in this paper were conducted on NVIDIA Volta V100 graphics processing units (GPUs) with 32 GB of memory per node and 2 GPUs per node. All models were implemented in PyTorch54 and trained with the distributed data parallel accelerator55, the NVIDIA Collective Communication Library, PyTorch Lightning56 and LitMatter57 for multi-GPU, multi-node training.

Large language models

The ChemGPT model architecture is based on the GPT-Neo44,45 transformer implementation in HuggingFace58. The model has 24 layers, with variable width, w, where w (16, 32, 64, 128, 256, 512, 1,024, 2,048) and w determines the model size. Model sizes range from 77,600 to 1,208,455,168 non-embedding parameters. The model is trained via stochastic gradient descent with the AdamW59 optimizer, using a learning rate of 2 × 10−5, a per-GPU batch size of 8 and a constant learning rate schedule with 100 warm-up steps for scaling experiments. Models were trained for 10 epochs in a self-supervised manner, with a cross-entropy loss for causal language modelling. The number of epochs for training was chosen due to computational limitations, but importantly it is large enough to clearly distinguish differences in model performance from the empirical scaling results. As the initial publication of this work, new ‘compute optimal’ scaling laws60 have been discovered for general LLMs. Our results and this recent work clearly suggest that with increased compute and engineering time, larger chemical models could be trained.

The training dataset for scaling experiments is PubChem10M8, a set of 10 million SMILES strings. Five percent of the data is randomly sampled and held out as a fixed validation set of size 500,000 molecules. Variable training datasets with sizes 10n, where n (2, 3, 4, 5, 6), were used. The largest training dataset includes all molecules in PubChem10M, excluding the validation set. The maximum vocabulary size was 10,000 and the maximum sequence length was 512 tokens. SMILES strings were converted to SELFIES using version 1.0.4 of the SELFIES library46. SELFIES were tokenized by splitting individual strings into minimally semantically meaningful tokens denoted by brackets, including start-of-string, end-of-string and padding tokens. Dataset sizes range from 51,200 to 304,656,384 tokens.

Graph neural networks

We train GNNs to predict the forces of molecular geometries. Force-only training (αE = 0 in equation (5)) was used for neural-scaling experiments to improve convergence and avoid issues with systematic drift in predicted energies, which we identified during the course of this work and plan to address in future work. We use the SchNet61, PaiNN52, Allegro10 and SpookyNet30 models. Model implementations are from the NeuralForceField repository50,62,63 and the Allegro repository10. Model sizes (w in equation (6)) were varied between 16, 64 and 256, while the number of layers/convolutions (d in equation (6)) was chosen to be 2, 3 or 4. A 5 Å nearest-neighbour cut-off was used. All other model hyperparameters were set to default values from the original implementations. GNN models were trained with stochastic gradient descent using the Adam64 optimizer. For Allegro, l = 1 internal features were used.

A learning rate scheduler reduced the learning rate by 0.5× after 30 epochs without improvement in the validation loss, with a minimum learning rate of 10−7. Early stopping was applied after 50 epochs without improvement in the validation loss, and training was capped at 1,000 epochs. Initial learning rates of 10−3, 10−4 and 10−5, and per-GPU batch sizes of 4, 8, 16, 32 and 64 were used during HPO experiments, while keeping the network architecture hyperparameters fixed. Models were trained for 50 epochs during HPO to approximate a full training budget, with a limited percentage of the total training budget used to calculate TSE.

The training dataset was assembled from ANI-1x37,65, which contains energies and forces from 5 million density functional theory calculations for small molecules. A fixed validation dataset of 50,000 frames was held out by random sampling. Different splits of training were taken with sizes 10n where n (2, 3, 4, 5, 6). Training datasets for TPE were assembled by randomly sampling 1,000 structures from molecular dynamics trajectories for each of the 10 molecules available in the revised MD-1732 dataset, for a total of 10,000 training samples. A validation dataset of equal size was constructed from the remaining geometries. Revised MD-17 is an updated version of the MD-1731 dataset, recomputed at the PBE/def2-SVP level of theory with strict convergence criteria to remove noise found in the original MD-17 dataset.

Training performance estimation

HPO typically involves training tens or hundreds of networks and using random search and/or Bayesian optimization to identify optimal hyperparameters. For optimal performance, the process must be repeated when considering new datasets or distribution shift.

By calculating the ‘training speed’ from only the first few epochs of training, the converged model performance is predicted and optimal hyperparameters are identified using only a small fraction of the total training budget. For example, networks that require 100 epochs to train to convergence are trained for only 10–20 epochs, and the final performance is predicted using TPE to identify the best performing networks, thereby saving 80–90% of the total training budget.

Training speed is estimated by summing the training losses of each mini-batch during the first T epochs of training. After training the network for T epochs with B training steps per epoch, TSE is defined as

$${\mathrm{TSE}}=\mathop{\sum }\limits_{t=1}^{T}\left(\frac{1}{B}\mathop{\sum }\limits_{i=1}^{B}{{{\mathcal{L}}}}(\;{f}_{\theta (t,i)}({{{{\bf{X}}}}}_{i}),{{{{\bf{y}}}}}_{i})\right),$$
(7)

for a loss function \({{{\mathcal{L}}}}\) and a neural network fθ(t,i), with parameters θ at epoch t and mini-batch i. (Xi, yi) is a tuple of inputs and labels in the ith mini-batch. TSE is correlated with the converged performance of the network and can be used to rank networks early in training to yield substantial compute savings. Given a sufficient number of networks (5–10) that are trained to convergence, a linear regression of the form

$$L=m \times {\mathrm{TSE}}+b$$
(8)

is fit with parameters m and b and the calculated TSE values to predict the converged loss, L. This allows predictions of converged network loss for partially trained networks evaluated during HPO based on its TSE values. Optimal hyperparameters are chosen to minimize TSE. In our experiments, we noted that L is monotonic in TSE, meaning that equation (8) is not needed to simply choose the best hyperparameters. The TSE values computed after a small number of epochs are sufficient for ranking model configurations and finding the optimal ones. Although leveraging equation (8) requires training some small number of networks to convergence to fit the parameters, it provides the benefit of being able to predict the expected performance of new hyperparameter choices. In particular, this may provide guidance if a particular target loss value is desired, as equation (8) can be used to predict the performance gains potentially accessible through HPO. We find that TPE is robust over multiple orders of magnitude of learning rate for the networks and training regimes considered here.