Main

Mass spectrometry1,2 (MS), is an analytical technique used for identifying and quantifying chemicals in a mixture. Molecules from the sample are ionized and then detected by a mass analyser, which records information about the mass-to-charge ratio (m/z) of each ion in the form of a mass spectrum. Tandem mass spectrometry (MS/MS) is a variant of MS that includes a fragmentation step to isolate and break down charged molecules (called precursors) into smaller fragments. These ions appear as peaks in the fragment spectrum, and their m/z positions and relative abundances can be used to make inferences about the molecular structure of the original precursor. When coupled with online liquid chromatography (LC), a technique for chemical separation, the combined LC–MS/MS workflow is a powerful tool for analysing aqueous solutions. Because of its versatility, LC–MS/MS is commonly employed in a variety of domains, including proteomics3, metabolomics4,5, forensics6,7 and environmental chemistry8.

For most molecules, it is not possible to accurately simulate the fragmentation that occurs in a mass spectrometer. In principle, theoretical physics provides the tools to understand this process; however, existing first principles simulations are too slow to be used in a high-throughput manner and rely on approximations that limit their accuracy. This presents a fundamental problem for the MS field. Improving spectrum simulation is essential for deepening our knowledge of MS and gleaning valuable insights from experimental data.

Compound identification from MS/MS data is an important application of mass spectrometry, particularly in metabolomics. To identify a spectrum, practitioners often rely on database searches with large reference libraries, using spectrum similarity functions9,10,11,12 and domain expertise to identify matches. However, these databases have relatively poor coverage, containing on the order of 106 spectra, which represent approximately 104 unique compounds. This fails to cover even the relatively small set of human metabolites, which contains at least on the order of 105 compounds, according to the Human Metabolome Database13 (HDMB). One strategy to overcome the incompleteness of spectral libraries is to augment them with in silico mass spectra for compounds from a chemical structure database13,14,15. This can dramatically increase coverage of the reference data and improve the chance of finding a match.

The increasing availability of public13,16,17,18,19 and commercial20,21,22,23 MS datasets makes data-driven solutions to spectrum prediction appealing. One of the most popular spectrum prediction methods, competitive fragmentation modelling24,25,26 (CFM), combines combinatorial framgentation and data-driven probabilistic modelling to predict spectra. CFM has been shown to be effective at predicting and annotating spectra and can be used for compound identification. However, due to its combinatorial bond-breaking algorithm, CFM is slow and struggles with modelling larger compounds (particularly those with multiple rings).

More recently, deep learning approaches have been applied to spectrum prediction. Current models use fully connected neural networks based on molecular fingerprints27 or graph neural networks with molecular graph representation28,29,30. However, both strategies rely on local chemical structures and do not easily model global interactions between atoms distant from each other in the molecule. Molecular fingerprints are typically restricted to capturing subgraphs of a fixed size and cannot represent global structures. Graph neural networks, while more flexible, only model local interactions in a single layer and require increased depth for long-range interactions31. However, excessive depth can often result in over-smoothing32,33, presenting challenges for larger molecules. While fragmentation events may appear to be local, typically involving individual bond breakages or minor rearrangements, global properties of the molecule can affect their likelihood of occurrence. Thus, each event must be understood in the context of the entire molecule.

In this work we adapt a state-of-the-art neural network architecture, the graph transformer34, to tackle MS/MS spectrum prediction. Graph transformers model pairwise interactions between all nodes in the graph and use degree and shortest path information to capture topological properties. Our experiments demonstrate how adapting a pretrained graph transformer model to spectrum prediction can result in state-of-the-art performance. Through rigorous comparisons with strong baseline methods, we demonstrate that our model can more accurately predict spectra for held-out compounds on two different spectrum datasets. We further validate the quality of our model’s predictions by investigating the effects of collision energy on the simulated spectra and interpret MassFormer’s predictions with gradient-based attribution methods. Finally, we demonstrate a realistic application by applying the model to the spectrum identification problem.

Concurrently with this work, a number of new deep learning spectrum predictors have been proposed. Some of these models take into account three-dimensional structural information35, or frame the learning task as formula36,37 or subgraph38,39 prediction. A detailed discussion of these approaches and their advantages and limitations can be found in Supplementary Information Section 2.2.

Results

Overview

MassFormer uses a graph transformer architecture (‘MassFormer architecture’) to predict spectra from an input molecule, represented as a molecular graph. A visual summary of our method is presented in Fig. 1. The input graph is first preprocessed into node and edge embeddings (‘Input featurization’). The node information encodes chemical properties about each atom, such as the element, and centrality properties like degree. The edge information captures topological relationships between atoms in the molecule, such as the shortest path length and the bonds traversed along the path. After preprocessing, the embeddings are passed to a graph transformer, which iteratively applies multihead self-attention (MHA) and multilayer perceptrons (MLPs) to manipulate the data. The learned self-attention weights capture associations globally between all pairs of nodes and are directly influenced by edge embeddings at each iteration. After several rounds of processing, the final embeddings are summarized into a single embedding that represents the entire molecule. This chemical representation is combined with spectrum metadata and passed to an MLP that makes a prediction in the form of a sparse positive vector. Each dimension of the vector represents a binned peak location, with the magnitude corresponding to the peak’s intensity. The metadata describe important information about the precursor (such as the adduct formed during ionization, termed the precursor adduct) and the instrument (such as the collision energy), both of which influence the fragmentation process and resulting spectrum1. The learnable input embeddings and graph transformer parameters are initialized from a Graphormer model34 that is pretrained on a large chemical dataset (‘Pretraining and fine-tuning’), then jointly tuned with the spectrum predictor MLP on the MS/MS dataset.

Fig. 1: Overview of the method.
figure 1

Extraction of node and edge embeddings from the molecular graph, application of the graph transformer, extraction of the chemical embedding from the readout node, addition of spectral metadata (such as collision energy) and prediction of the binned spectrum. The parameters for the input embeddings and the graph transformer layers are initialized from a pretrained model, and the entire model is fine-tuned on the spectrum prediction task.

Predicting spectra for unseen compounds

To quantitatively evaluate our model’s generalization performance, we measure average cosine similarity on different types of held-out spectrum data. MassFormer is compared with two deep learning methods, a fingerprint (FP) neural network model (adapted from ref. 27) and a Weisfeiler–Lehman (WLN) graph neural network model40 (adapted from ref. 28), as well as with CFM24, a widely used probabilistic method for spectrum prediction that does not use deep neural networks (see the section ‘Baseline models’ for more details). The three deep learning models are trained on a portion of the National Institute of Standards and Technology 2020 Tandem Mass Spectrometry dataset21 (NIST). However, to benchmark CFM, we use the most recent publicly available pretrained model26, since retraining CFM on our dataset would be non-trivial due to its design. Differences in modelling assumptions and training data introduce some ambiguity in comparisons between CFM and other models.

Model evaluation is performed on a held-out portion of the NIST test set and an additional dataset from MassBank of North America19 (MoNA). We consider two strategies for splitting the data: a simple random split by compound, based on the International Chemical Identifier Key (InChIKey) and a more challenging split that stratifies compounds by their Murcko scaffold41,42 before splitting. For more details on the datasets and training splits, refer to the section ‘Datasets and training splits’. To allow for a fairer comparison with CFM, we focus on [M + H]+ spectra and remove compounds from the test set that overlap with CFM’s training set. In Extended Data Fig. 1, we demonstrate results for the deep learning models on the full range of supported precursor adducts and confirm that they are consistent with results on the [M + H]+ subset.

MassFormer reliably outperforms other models across different datasets and splitting criteria (Fig. 2c). The performance of the three deep learning models is correlated across data splits, reflecting the underlying differences in split difficulty. NIST-InChIKey is the easiest split, resulting in the best performance; NIST-Scaffold exhibits split-induced distributional shift; MoNA-InChIKey exhibits library-induced distributional shift; MoNA-Scaffold exhibits both of the aforementioned challenges. Ground truth spectra for compounds that exist in both NIST and MoNA tend to be highly similar: ostensibly identical spectra from NIST and MoNA (same compound and metadata) have an average cosine similarity of approximately 0.97, suggesting that batch effect is not a driver of library-induced distributional shift. Thus, the decrease in performance on MoNA splits is most likely due to differences in compound and metadata coverage. Altogether, the experiments demonstrate that model performance strongly depends on the data splitting technique, and that MassFormer is consistently superior across all splits.

Fig. 2: Spectrum similarity experiments.
figure 2

MassFormer can accurately predict spectra for held-out compounds. a, Predicted and real [M + H]+ spectra for Azlocillin (InChIKey-14: JTWOMNBEOCYFNV) with cosine similarity of 0.59. b, Predicted and real [M + H − NH3]+ spectra for Flamprop-isopropyl (InChIKey-14: IKVXBIIHQGXQRQ) with cosine similarity of 0.63. c, Average cosine similarity of four models (CFM, FP, WLN and MF) on four different data splits, using a bin resolution of 1 Da. The number of molecules in the training set (N) is provided for each split. To allow for direct comparison with CFM in this experiment, training and evaluation is limited to spectra with an [M + H]+ precursor adduct. MassFormer is consistently the best-performing deep learning model. All models perform worse on spectra from MoNA than spectra from NIST. Scaffold splitting also results in a more challenging evaluation. Averages and standard deviations from ten independently trained models are reported (except for CFM, which is pretrained). Statistical significance is determined by one-sided Welch’s t-test96 with Šidák correction97.

We also perform a more granular investigation of MassFormer’s performance by comparing average similarity across the top ten most frequent chemical classes in the NIST dataset (Extended Data Fig. 2), as indicated by ClassyFire43, an automatic chemical ontology tool. While most classes have similar performance, the model seems to perform exceptionally well on ‘lipids and lipid-like molecules’. This is perhaps unsurprising, as some types of lipids have predictable fragmentation patterns, facilitating the success of rule-based fragmenters in lipid spectrum prediction25,44. Conversely, simulating spectra for lipid-like compounds can be challenging for combinatorial methods like CFM due to their large size25,26, so MassFormer’s success in this area is encouraging. MassFormer predictions for individual held-out compounds are presented in Fig. 2a,b, with additional examples included in Extended Data Fig. 6.

Modelling the effect of collision energy

Collision energy is a key experimental parameter that can strongly influence the observed spectrum. Typically, increasing the collision energy results in more intense fragmentation, producing spectra with a higher proportion of smaller fragments. Accurately modelling collision energy is a key desideratum of any spectrum predictor. The relationship between collision energy and fragmentation is well represented in the NIST dataset, since each precursor is measured at ~11 different collision energies on average (see Supplementary Table 1 for breakdown by precursor adduct). Figure 3a illustrates how collision energy typically affects fragmentation. The four spectra (and associated spectrum predictions) all correspond to the same molecule at varying normalized collision energies (NCEs), which are expressions of collision energy relative to precursor mass (equation (14)). As collision energy increases, the peak intensities in the real mass spectra shift to the left, with the model’s predictions closely following this pattern. Figure 3c shows this relationship more generally, by plotting the distribution of mean peak m/z in the spectrum, where each peak is weighted by its relative intensity. Qualitatively, the real and predicted distributions for the held-out spectra are difficult to distinguish.

Fig. 3: Collision energy experiments.
figure 3

Increasing collision energy influences the mass spectrum. a, Four spectra for benzhydrylpiperazine (InChIKey-14: NWVNXDKZIQLBNM) collected at different NCEs. Notice how the mass distribution shifts to smaller fragments as NCE increases. MassFormer is able to predict the spectrum well at each of the four collision energies and correctly models the shift in intensity. b, The structure of benzhydrylpiperazine, which is a piperazine derivative. c, Density plots of average mass-to-charge ratio and collision energy for both real and predicted spectra from held-out NIST data. Besides confirming the negative correlation between collision energy and average peak location, these plots demonstrate that MassFormer accurately models collision energy.

Explaining peak predictions with gradient-based attribution

While deep learning models are often viewed as black boxes, gradient-based attribution methods can provide some understanding of model behaviour. Broadly speaking, these methods work by taking the gradient of the model’s output with respect to the input features to identify parts of the data that the model is most sensitive to. Gradient × Input (GI) attribution45,46 is one of the simplest methods for gradient-based attribution and can be useful for analysing transformer models47. GI works by measuring the dot product between the input vector and its gradient: this can give information about the effects (positive or negative) of changing that feature on the model’s output (‘Gradient-based feature attribution’).

Each peak in a mass spectrum corresponds to a fragment of the original precursor. In organic compounds, non-oxygen hetero-atoms are much less common than the core carbon/hydrogen/oxygen building blocks, so fragments that contain such hetero-atoms are in some sense related. It is therefore reasonable to expect the input dependencies for these fragments and their corresponding peaks to be correlated.

By applying GI attribution methods, we demonstrate that MassFormer can distinguish peaks based on hetero-atom composition. For each peak, a GI map is calculated using the gradient of the model’s output at the peak location with respect to the input atom embeddings. Because of their high complexity (and redundancy), the GI maps are projected to two dimensions using principal component analysis (PCA) to facilitate interpretation. The SIRIUS formula annotation tool48,49 is applied to identify which peaks contain a hetero-atom of interest. The linear separability of the hetero-atom labels in the projected GI space is used to quantify the peak differences. For each spectrum, we fit a logistic regression model in the two-dimensional projection space (using the hetero-atom peak labels as binary targets) and report its accuracy as a measurement of linear GI map separability.

The entire process is illustrated in Fig. 4. The GI maps for a particular spectrum (Fig. 4a) are computed, projected and plotted in two dimensions (Fig. 4b). In this example, it is clear that the nitrogen and non-nitrogen peaks are perfectly linearly separable. By repeating this analysis for many spectra from the dataset (Fig. 4c), we show that the nitrogen labelling in the projected GI map space results in markedly higher linear separability than random labelling. In Extended Data Fig. 3 we show that this pattern also holds for four other hetero-atoms (chlorine, phosphorus, sulfur and fluorine). All together, these data suggest that MassFormer learns implicit compositional relationships between predicted peaks in the spectrum and that these relationships can be revealed using gradient-based attribution methods.

Fig. 4: Explainability using gradient attributions.
figure 4

a, A propranolol (InChIKey-14: AQHHHDLHHXJYJD) spectrum with peak formula annotations from the SIRIUS49 toolkit: green corresponds to nitrogen-containing formulae, purple corresponds to formulae without nitrogen. b, GI attribution maps for each peak in the propranolol spectrum, visualized as heat maps (around the margins) and projected to two dimensions with PCA (centre plot). Each GI map is normalized, with red values corresponding to positive attributions and blue values corresponding to negative attributions. The projected GI maps for peaks that contain nitrogen (in green) are linearly separable from those that do not (in purple), suggesting that these maps contain interpretable information about the peaks. The dashed line represents a linear boundary in the projected GI map space that optimally separates the peaks that contain nitrogen from those that do not. Norm., normalized. c, Differences in GI maps reflect differences in peak formula composition. The distribution of optimal linear classification accuracy induced by the nitrogen labelling strategy is markedly different from the random labelling distribution (higher accuracy indicates improved separability of the peaks). Furthermore, 48% of the peaks are perfectly linearly separable when using the nitrogen labelling, compared to only 18% with the random baseline approach (P < 10−74 by one-sided Welch’s t-test).

Identifying spectra by ranking candidates

Spectrum identification is a major application for spectrum prediction models. MS-based compound identification tools (Supplementary Information Section 2.1) are often benchmarked on the Critical Assessment of Small Molecule Identification (CASMI) Contest50,51. This contest works by acquiring experimental spectra for a small set of compounds (called queries) that are not represented in existing public libraries and then scoring competing methods based on their ability to identify the correct structure. We compare MassFormer with the three baseline models (FP, WLN, CFM) on 124 [M + H]+ spectra from the CASMI 2016 Contest. Each of the three deep learning models are trained on a 90% partition of the NIST dataset, with the remaining 10% for validation. Additional filtering is applied to prevent leakage of query compounds into the training or validation data. Each query spectrum is associated with a list of candidate compounds, which the models rank based on the cosine similarity of their predicted spectra with the query. The ranking metrics are summarized in Table 1; additional metrics focusing on different subsets of the candidate sets can be found in Table 2, while plots of the rank distributions can be found in Extended Data Fig. 4. Overall, the deep learning models seem to consistently outperform CFM in all metrics. MassFormer is generally superior to other methods, except for CASMI 2016 top-1, where it is outperformed by the FP model.

Table 1 Ranking results
Table 2 Additional ranking results

The CASMI 2016 Contest has a number of drawbacks as a benchmark dataset. Because of the limited coverage of precursor adducts (only [M + H]+ for positive mode) and collision energies (20, 35 and 50 NCE), it is unclear if the results will generalize to other common experimental configurations. Additionally, many of the compounds in the CASMI 2016 query set have a high similarity to existing compounds in modern spectral libraries. To address these concerns, we also evaluate the models on the more recent CASMI 2022 Contest52 and a novel CASMI-inspired spectrum identification task called NIST20 Outlier, which uses structural outliers from the NIST dataset as queries. Full details about the setup for these tasks can be found in the section ‘Spectrum identification task setup’, and information about the size and diversity of the candidate sets can be found in Extended Data Fig. 5. Overall, CASMI 2022 results are the poorest across the board, at least in part because of the larger number of candidates per query. In general, differences between CASMI 2022 and NIST20 Outlier are less pronounced when comparing normalized metrics, which account for variations in candidate counts. MassFormer outperforms other models across almost every experiment and metric. CASMI 2022 and NIST20 Outlier are particularly challenging for CFM because of the addition of spectra with collision energies and precursor adducts that the model does not support natively.

All together, these experiments demonstrate how gains in spectrum prediction will often, but not always, translate to improvements in spectrum identification29,37,39. MassFormer is the superior spectrum predictor, as indicated by its higher query spectrum similarity; this is generally reflected in its strong ranking metrics. However, the FP model performs surprisingly well in terms of top-k retrieval, even eclipsing MassFormer’s performance in top-1 CASMI 2016, despite its relatively poor accuracy as a spectrum predictor. Spectrum identification requires making predictions for a large number of candidate structures, many of which are out-of-distribution with respect to the training set. Performance on an independent and identically distributed held-out test set does not fully describe a model’s behaviour under distributional shift, which is generally more difficult to estimate. In fact, optimized models which perform similarly on an in-distribution evaluation may perform differently on out-of-distribution data53,54,55.

Discussion

In this work, we introduce MassFormer, a new method for predicting MS/MS spectra for small molecules using a graph transformer architecture. We validate our model’s performance on two independent MS datasets, NIST and MoNA and show that it can produce realistic spectra. We verify that the model captures prior knowledge about the fragmentation process by investigating the effect of collision energy on spectrum predictions. Using gradient-based attributions, we explain model predictions by identifying peaks with similar element composition. We benchmark the model on three different spectrum identification tasks and show that it can be useful for inferring molecular structure. Our work represents one of the first transformer-based spectrum predictors for MS/MS data, with extensive benchmarking and implementations of other models from the literature.

Our method has a number of limitations. The current model is restricted to positive mode electrospray ionization Orbitrap spectra with specific precursor adducts. Extending the model to accommodate additional ionization methods (such as electron ionization and negative mode electrospray ionization), adduct types (such as [M − H] and [M + K]+) and instrument types (such as quadrupole time-of-flight) would broaden MassFormer’s impact. By supporting additional mass spectrum modalities, the model would be able to leverage new sources of MS data and potentially transfer knowledge across modalities. Prediction resolution is another limitation: most experiments in this work use spectra binned to 1 Da, but many modern instruments provide much higher resolution. This can be useful for identification, as the increased peak resolution can distinguish ions with similar masses. Finally, while our model is explainable to some degree (through gradient-based attribution), it does not provide true peak annotations like some other methods26,36,37,39,49. Formula and fragment annotations are often useful for practitioners: they can improve confidence in the model’s predictions by allowing experienced users to manually validate predicted peak patterns against prior knowledge about fragmentation mechanisms. Additionally, when predicting spectra for identification, peak annotations can be helpful for inferring higher level properties of the compound, even when the spectrum on the whole contains considerable noise.

MassFormer has a number of exciting applications, largely focused on MS-based compound identification. We have already demonstrated how MassFormer can be used to identify a spectrum given a list of candidate structures. SIRIUS49 is a popular tool for identification that does not rely on spectrum prediction. It may be possible to combine MassFormer with SIRIUS (or another existing tool56) to improve structure identifications. For example, SIRIUS uses information in the query spectrum to predict chemical features (such as precursor formulae and fingerprints) that help with identification. These features could be used to refine a set of candidate compounds that would subsequently be ranked by predicted spectrum similarity. Alternatively, MassFormer could improve performance of a spectrum-to-structure generative model57,58,59 by predicting spectra for unlabelled compounds, similar to how CFM has been used to augment training data59. MassFormer could also be incorporated into a reinforcement learning framework, providing a spectral similarity signal that could help guide the structure prediction agent to the correct compound during training and inference60. In targeted metabolomics analysis, spectrum predictors like MassFormer could help identify collision energies that are most useful for distinguishing compounds of similar mass61 by measuring the pairwise similarities of their predicted spectra under different settings. Finally, MassFormer has potential for use in decoy generation62,63, which plays an important role in false discovery rate calibration in untargeted metabolomics experiments. Target-decoy methods work by introducing a number of pseudo-random data points (decoys) that are similar in distribution to real data, allowing for empirical estimation and tuning of confidence thresholds to meet a false discovery rate criterion. Applying MassFormer to predict noisy or adversarial spectra for compounds that are not present in the sample might be a viable strategy for generating realistic decoys and could reduce the chance of incorrect compound identifications.

Methods

Problem formulation

Spectrum prediction can be viewed as a supervised learning problem, with a dataset \({\{(\bf{{x}}^{i},\bf{{z}}^{i},\bf{{y}}^{i})\}}_{i = 1}^{n}\) where xi is a molecule and yi is its spectrum under experimental conditions zi. The goal is to learn the parameters θ of the prediction function \({f}_{\bf{\uptheta} }:{{{\mathcal{X}}}}\times {{{\mathcal{Z}}}}\to {{{\mathcal{Y}}}}\) that maps chemicals in \({{{\mathcal{X}}}}\) to spectra in \({{{\mathcal{Y}}}}\), conditioned on metadata in \({{{\mathcal{Z}}}}\). Mass spectra can be represented as a set of peaks, each of which has an m/z location and an intensity. By discretizing the peak locations into m fixed-width bins (similar to refs. 27,28,29,35), a mass spectrum can be represented as an m-dimensional sparse vector, where each peak at location j has intensity yj ≥ 0. The problem of spectrum prediction can thus be formulated as vector regression, with \({{{\mathcal{Y}}}}={{\mathbb{R}}}^{m}\succcurlyeq 0\). Following ref. 28, the spectral metadata \(\bf{z} \in {{{\mathcal{Z}}}}\) can be provided as side information to the input molecule \(\bf{x} \in {{{\mathcal{X}}}}\). These metadata consist of instrument settings and other covariates that might affect the predicted spectrum.

Input featurization

The featurization of the input molecule x is critical, as it influences the structure of the spectrum prediction function fθ and can have an impact on downstream performance. Molecular fingerprints (also called molecular descriptors) represent molecules using expert-designed feature engineering. Common feature choices include presence of predefined substructures (used in the molecular access system (MACCS) fingerprint64) and hashed local substructure counts (used in the extended connectivity (Morgan) fingerprint65). Molecular graph representations capture the structure of a molecule by explicitly representing atoms as nodes and bonds as edges. The node features can encode various chemical properties associated with the atom (that is, element, formal charge, number of bonded hydrogens), while the edge features can encode bond information (that is, bond type, aromaticity). Such representations naturally lend themselves to graph neural networks and graph transformers (‘MassFormer architecture’) and can be more expressive than fingerprints. Our datasets do not include any conformer or three-dimensional coordinate information, although these properties can be estimated to varying degrees of accuracy42,66 and might be helpful for spectrum prediction35.

The features used for MassFormer are summarized in Supplementary Table 4. The node and edge features were chosen to be identical to the pretrained Graphormer model (‘Pretraining and fine-tuning’), to maintain compatibility. However, some of these features used in the pretrained model (formal charge, radical state and stereochemical information) were not applicable to our data and have been omitted for clarity.

MassFormer architecture

Transformers67 are a general family of neural networks that model relationships between elements of a set. Originally developed for neural machine translation68, transformer models have proven useful in a variety of domains69,70. A number of graph transformers have been proposed34,71,72,73,74,75, motivated by the ability to model pairwise global interactions between all nodes in the graph. Many graph neural networks, such as graph attention network76, are similar in structure to graph transformers but can only model local relations in a single layer and require large depth to model interactions over longer distances31.

Our approach adapts the Graphormer34 architecture, a recent graph transformer model that boasts impressive results on chemical property prediction tasks34,77. The distinguishing characteristic of the Graphormer architecture is its unique positional encoding scheme. It uses shortest path information between nodes and associated edge embeddings along that path as a form of relative positional encoding. The shortest path information is computed as a preprocessing step for each graph, using the Floyd–Warshall algorithm78.

Like most transformers, MassFormer processes the input iteratively using MHA and MLP. Assume the model has L layers, each with M attention heads. Let \({{\bf{h}}}_{i}^{(l)}\in {{\mathbb{R}}}^{d\times 1}\) be the representation of node i at layer l, with d being the node embedding size and \({{{\bf{h}}}_{i}^{(0)}}\) consisting of the node features described in Supplementary Table 4. The MHA operation for attention head m in layer l is described by equation (1), where \({{{W}}}_{V}^{\,(m,l)}\in {{\mathbb{R}}}^{d/M\times d}\) is a learnable projection matrix. The MLP operation is described by equation (2), where \({f}_{\mathbf{\upphi} }^{\;(l)}\) is the multilayer perceptron at layer l of the transformer.

$${{{\bf{h}}}_{i}^{\;(m,l)}=\mathop{\sum }\limits_{j=1}^{N}{a}_{i,j}^{\;(m,l)}\left({{{W}}}_{V}^{\;(m,l)}{{\bf{h}}}_{j}^{\;(l)}\right)}$$
(1)
$${{{\bf{h}}}_{i}^{\;(l+1)}={f}_{\bf{\upphi}}^{\;(l)}\left({{\bf{h}}}_{i}^{\;(l)}+{\parallel }_{m = 1}^{M}{{\bf{h}}}_{i}^{\;(m,l)}\right)+{\parallel }_{m = 1}^{M}{{\bf{h}}}_{i}^{\;(m,l)}}$$
(2)

Note that the intermediate representations for the attention heads \({{\bf{h}}}_{i}^{\;(m,l\,)}\in {{\mathbb{R}}}^{d/M\times 1}\) are concatenated along the head dimension before being processed by the MLP \({f}_{\bf{\upphi} }^{\;(l)}\). For simplicity, dropout79 and layer normalization80 operations have been omitted.

The attention mechanism ai,j is described in equations (3) and (4), without layer and attention head indices (to improve clarity). \({{{W}}}_{K},{{{W}}}_{Q}\in {{\mathbb{R}}}^{d/M\times d}\) are the standard learnable key and query projection matrices, \({n}_{i,\;j}\in {\mathbb{N}}\) is the shortest path distance between nodes i and j, \(b({n}_{i,\;j})\in {\mathbb{R}}\) is a learnable scalar indexed by ni,j. The last variable \({c}_{i,\;j}\in {\mathbb{R}}\) is the edge embedding term described by equation (4), where \({{{\bf{e}}}_{i,\;j,p}\in {{\mathbb{R}}}^{d/M\times 1}}\) is the embedding corresponding to the pth edge in the shortest path between i and j, and \({ \bf{{{{w}}}}_{p} \in {{\mathbb{R}}}^{d/M\times 1}}\) is a learnable weight for that position.

$${a}_{i,j}={{\mathtt{softmax}}}_{j}\left(\frac{{({{{W}}}_{Q}{{\bf{h}}}_{i})}^{T}({{{W}}}_{K}{{\bf{h}}}_{j})}{\sqrt{d}}+b({n}_{i,j})+{c}_{i,j}\right)$$
(3)
$${c}_{i,j}=\frac{1}{{n}_{i,j}}\mathop{\sum}\limits_{p}{ {\bf{{w}}}_{p}}^{T}{{\bf{e}}}_{i,j,p}$$
(4)

Before transformer processing, a readout node (with index N + 1) is initialized with a unique embedding and connected to all other nodes in the graph with a special edge type. Similar to the CLS token in language transformers81, the readout node’s final embedding \({{{\bf{h}}}_{N+1}^{L}\in {{\mathbb{R}}}^{d\times 1}}\) is interpreted as a summarized representation of the input molecule. This chemical embedding is concatenated with the metadata embedding \({{\bf{z}}}\in {{\mathbb{R}}}^{{d}^{{\prime} }\times 1}\) and passed to a large MLP fψ (equation (5)) which outputs a vector \({{\bf{s}}}\in {{\mathbb{R}}}^{{d}^{{\prime}{\prime}}\times 1}\), where d is the hidden dimension.

$${{\bf{s}}}={f}_{\uppsi }\left({{\bf{h}}}_{N+1}^{L}\parallel {{\bf{z}}}\right)$$
(5)

Following ref. 27, we apply bidirectional prediction to get the resulting spectrum \({{{\hat{\bf{y}}}}\in {{\mathbb{R}}}^{q\times 1}}\), where q is the number of mass bins. This process works by calculating a forward spectrum \({{\hat{\bf{y}}}}_{\rm{F}}({{\bf{s}}})\) (equation (6)) and reverse spectrum \({{\hat{\bf{y}}}}_{\rm{R}}({{\bf{s}}})\) (equation (7)), then averaging them bin-wise using a gating mechanism \({{\hat{\bf{y}}}}_{\rm{G}}({{\bf{s}}})\) (equations (8) and (9)). The forward, reverse and gating functions are linear transformations of s parametrized by weight matrices \({{{W}}}_{\rm{F}},{{{W}}}_{\rm{R}},{{{W}}}_{\rm{G}}\in {{\mathbb{R}}}^{q\times {d}^{{\prime}{\prime}}}\) and bias vectors \({{\bf{b}}}_{\rm{F}},{{\bf{b}}}_{\rm{R}},{{\bf{b}}}_{\rm{G}}\in {{\mathbb{R}}}^{q\times 1}\) respectively.

$${{\hat{y}}}_{\rm{F}}{({{\bf{s}}})}_{i}={\left({{{W}}}_{\rm{F}}{{\bf{s}}}+{{\bf{b}}}_{\rm{F}}\right)}_{i}$$
(6)
$${{\hat{y}}}_{\rm{R}}{({{\bf{s}}})}_{{m}_{p}+\tau -i}={\left({{{W}}}_{\rm{R}}{{\bf{s}}}+{{\bf{b}}}_{\rm{R}}\right)}_{i}$$
(7)
$${{\hat{y}}}_{\rm{G}}{({{\bf{s}}})}_{i}={\mathtt{sigmoid}}{\left[{{{W}}}_{\rm{G}}{{\bf{s}}}+{{\bf{b}}}_{\rm{G}}\right]}_{i}$$
(8)
$${{{\hat{y}}}_{\rm{FR}}{({{\bf{s}}})}_{i}={{\hat{y}}}_{\rm{G}}{({{\bf{s}}})}_{i}{{\hat{y}}}_{\rm{F}}{({{\bf{s}}})}_{i}+(1-{{\hat{y}}}_{\rm{G}}{({{\bf{s}}})}_{i}){{\hat{y}}}_{\rm{R}}{({{\bf{s}}})}_{i}}$$
(9)
$${{{\hat{y}}}{({{\bf{s}}})}_{i}={\mathtt{relu}}\left[{\mathbb{I}}\left[i\le {m}_{p}+\tau \right]{{\hat{y}}}_{\rm{FR}}{({{\bf{s}}})}_{i}\right]}$$
(10)

Note that \({\mathbb{I}}[x]\in \{0,1\}\) is the indicator function, mp is the index of the bin corresponding to the precursor mass and \(\tau \in {\mathbb{N}}\) is a tolerance parameter. The final spectrum \({{\hat{\bf{y}}}}\) (equation (10)) is constrained to be non-negative with a rectified linear (ReLU) activation82 and contains zeros for all bin indices i > mp + τ, preventing prediction of peaks that are substantially larger than the precursor. This spectrum can be normalized appropriately depending on the context (L2 for training, L1 for inference).

Pretraining and fine-tuning

Fine-tuning a pretrained model can offer improved performance over training a randomly initialized model from scratch, particularly when data scarcity is a concern. We initialize the parameters of our graph transformer module and trainable input node and edge embeddings with the corresponding parameters of a pretrained Graphormer model. This model was originally trained on the PCQM4Mv2 dataset83,84, a large dataset of approximately 4 million molecular graphs and associated Density Functional Theory simulated quantum chemical properties. The pretraining task is a supervised graph-level regression problem of predicting the energy gap between the highest occupied molecular orbital and the lowest unoccupied molecular orbital (dubbed the HOMO-LUMO gap). While this task is not directly related to MS, the roughly 100-fold larger compound coverage of the PCQM4Mv2 dataset provides an opportunity for the model to learn general chemical representations that transfer to the spectrometry task. In Supplementary Table 7, we perform model ablations to determine relative contributions of different aspects of the fine-tuning process. Our experiments demonstrate that using pretrained weights is necessary to scale up the number of model parameters while maintaining training stability and performance. We also find random re-initialization of the layer normalization80 statistics to be helpful. The other key component of MassFormer, the spectrum prediction MLP, is always initialized randomly. Both modules are fine-tuned jointly for 20 epochs using a linearly decaying learning rate. For full details on the training procedure, please refer to the code repository (‘Code availability’).

Loss and similarity calculations

Since MS peak intensities are relative, it is advisable to use loss functions that are invariant to scaling. We choose cosine distance (equation (11)) as the loss function, where \({{\hat{\bf{y}}}}={f}_{\mathbf{\uptheta} }(\bf{x},{{\bf{z}}})\) is the predicted spectrum and y is the real spectrum. Cosine similarity is commonly used to compare spectra, so minimizing the cosine distance (thus maximizing similarity) is a natural choice and has been shown to work well in other prediction models27,28,29. During model training, a log transformation is applied to the intensities (similar to refs. 28,29) before calculating the loss, to increase the importance of low-intensity peaks in the objective function.

$${\rm{CD}}({{\bf{y}}},{{\hat{\bf{y}}}})=1-\frac{{{\bf{y}}}^{T}{{\hat{\bf{y}}}}}{{\parallel {{\bf{y}}}\parallel }_{2}{\parallel {{\hat{\bf{y}}}}\parallel }_{2}}=1-\frac{\mathop{\sum }\nolimits_{i = 1}^{m}{{y}}_{i}{{{\hat{y}}}}_{i}}{\sqrt{\mathop{\sum }\nolimits_{j = 1}^{m}{{y}}_{j}^{2}\mathop{\sum }\nolimits_{k = 1}^{m}{{{\hat{y}}}}_{k}^{2}}}$$
(11)

As there are often multiple spectra with different collision energies that correspond to the same precursor, it can be useful to ‘merge’ target and predicted spectra across collision energies by averaging their intensities in binned spectrum space. Variations of collision energy merging are commonplace in the field49,51,56. Merging helps prevent distorted similarity scores resulting from uninformative spectra with very few peaks, which tends to happen when the collision energy is either too high or too low.

Another important consideration is the method of similarity score aggregation. The simplest approach is to average scores over spectra, denoted as ‘spectrum aggregation’. An alternate approach is to first average scores per molecule (that is, across precursor adducts), then average again across molecules. This approach, denoted as ‘molecule aggregation’, ensures that molecules with many precursor adducts are not over-represented in the final score.

Throughout the manuscript (Fig. 2 and Extended Data Figs. 1 and 2), spectral similarity is reported as average cosine similarity on untransformed merged spectra with molecule aggregation. The impact of spectrum intensity transformations (log transform, square root transform and precursor peak removal), similarity functions (cosine, Jensen–Shannon and Jaccard), spectrum merging and score aggregation methods are explored in detail in Supplementary Tables 9 and 10. While individual model scores vary depending on the exact configuration of the similarity calculation (Supplementary Table 10), the order of the models remains largely consistent, with MassFormer being ranked best for the vast majority of configurations (Supplementary Table 9).

Gradient-based feature attribution

GI45,46 is an attribution method that assigns importance scores to input variables based on the sensitivity of the model’s predictions to changes in those variables, estimated using gradients. It satisfies the conservation axiom, an underlying assumption for many explainable AI approaches, which posits that “scores assigned to input variables and forming the explanation must sum to the output of the network”47. If a model is sensitive to changes in certain parts of the input, these features are taken to be more important for making a correct prediction. GI methods compute the attribution score as the dot product of the gradient vector with the input vector. Inputs with scores close to 0 are unimportant, while those with large positive or negative scores are interpreted as contributing positively or negatively (respectively) to the prediction.

More formally, let \({{\bf{x}}}\in {{\mathbb{R}}}^{D}\) be an input vector with D dimensions, let \({{\bf{y}}}\in {{\mathbb{R}}}^{K}\) be an output value associated with x and \({f}_{\bf{\uptheta} }({{\bf{x}}}):{{\mathbb{R}}}^{D}\to {{\mathbb{R}}}^{K}\) be a neural network. Let \({{{\mathcal{L}}}}({{\bf{y}}},{{\hat{\bf{y}}}}):{{\mathbb{R}}}^{K}\times {{\mathbb{R}}}^{K}\to {\mathbb{R}}\) be a scalar loss function (for example, cosine distance). The GI score for the model on this input vector is defined precisely by equation (12):

$${{{\rm{GI}}}}({{\bf{x}}},{{\bf{y}}})={{\bf{x}}}\cdot {\nabla }_{{{\bf{x}}}}{{{\mathcal{L}}}}({{\bf{y}}},{f}_{\uptheta }({{\bf{x}}}))$$
(12)

While GI score calculation is generally applicable to any type of fully differentiable neural network, there are certain aspects of the transformer architecture (the self-attention mechanism67 and layer normalization module80) that violate the conservation axiom and, in practice, reduce the quality of model explanations47. To address this problem, we make slight modifications to these modules when calculating GI scores, as recommended in ref. 47.

In our experiments, the neural network fθ in equation (12) is MassFormer, which maps input molecules to K-dimensional binned spectrum vectors (typically K = 1,000). To compute the GI scores for a peak in bin k, we use a loss function that zeros out all other peaks in the spectrum. More precisely, we define the loss \({{{{\mathcal{L}}}}}_{k}\) for bin k using equation (13):

$${{{{\mathcal{L}}}}}_{k}({{\bf{x}}})={{\bf{e}}}_{k}\cdot {f}_{\uptheta }({{\bf{x}}})$$
(13)

In the above formulation, ek is the kth standard basis vector for \({{\mathbb{R}}}^{K}\) (in other words, a one-hot vector where the kth entry is 1).

Note that MassFormer takes a molecular graph as input, which is subsequently preprocessed into different types of embeddings (‘Input featurization’). We compute attribution scores only with respect to the element embeddings, as our experiments in the section ‘Explaining peak predictions with gradient-based attribution’ involve discriminating peaks by element composition. To visualize the GI attribution maps (that is, Fig. 4b), we estimate GI scores for each atom and then L2 normalize across atoms, to remove information about gradient magnitude and focus only on direction. We employ a slight variation of the aforementioned GI computation (equation (12)) for the linear projections: instead of summing over the embedding dimension to produce a single scalar score per atom, we perform PCA on the unreduced GI maps. Summing over the dimensions results in more interpretable scores, but might remove variation that could inform the PCA projection.

Our GI analysis is limited to spectra from the NIST-Scaffold test set that satisfy the following criteria:

  • Contain 10 or more peaks

  • Contain at least two hetero-atom peaks and two non-hetero-atom peaks

  • Can be accurately predicted by MassFormer (≥0.8 cosine similarity)

  • Can be accurately annotated by SIRIUS (≥0.8 CSI:FingerID tree score48,49)

We apply these criteria to remove instances where the peaks are trivially separable and focus analysis on examples where the model’s predictions are reliable, since inaccurate predictions can be difficult to interpret.

Datasets and training splits

We use NIST for both training and evaluation; it is a commercial dataset notable for its large coverage (over one million tandem spectra in total), standardized spectrum acquisition protocol and a high degree of manual validation. We also use the publicly available MoNA as a held-out evaluation set. MoNA is an important dataset for the MS community, as it is one of the largest freely accessible sources of small molecule spectra. The repository contains data from a variety of other public sources, including GNPS17, HMDB13 and ReSpect18. For simplicity, we only consider spectra from Orbitrap instruments that use higher-energy collisional dissociation1, as this corresponds to the largest subset of data in NIST. Furthermore, we restrict the dataset to only include positive mode spectra corresponding to one of six highly occurring precursor adducts ([M + H]+, [M + H − H2O]+, [M + H − 2H2O]+, [M + 2H]2+, [M + H − NH3]+ and [M + Na]+). The dataset statistics are summarized in Supplementary Table 1. After filtering, NIST is a much larger dataset than MoNA, both in terms of number of spectra and compounds. For this reason, we rely on NIST data to train the models and use MoNA exclusively for evaluation.

Two kinds of data splitting techniques are employed in the section ‘Predicting spectra for unseen compounds’. Both methods involve splitting spectra based on compound identity, to avoid leakage of spectra that differ only in metadata (such as collision energy or precursor type) but not in structure. The ‘InChIKey’ split uses non-stereochemical InChIKey strings85, which are hashed chemical string representations that are commonly used as molecular identifiers. This approach is, in essence, a simple random split based on compound identity. In contrast, the ‘Scaffold’ split uses Murcko Scaffolds41 to coarsely cluster compounds before splitting in a manner such that all the compounds (and associated spectra) from one cluster end up in the same partition. Scaffold splitting introduces a distributional shift between training and test data and is commonly used to evaluate deep learning models in small molecule applications86. A breakdown of molecule counts for each splitting technique is included in Supplementary Table 2.

Baseline models

We compare our method with two related deep learning models, based on existing approaches from the literature. The fingerprint model combines extended connectivity (Morgan) fingerprints65, MACCS64 and RDKit topological fingerprints (all implemented in the RDKit library42) and uses those directly as the chemical embedding. The WLN model uses a molecular graph representation in combination with a Weisfeiler–Lehman Network40, which is a particular kind of graph neural network, to produce the chemical embedding. These models are based on two previously published MS prediction models: ref. 27 and ref. 28 respectively. However, both models require re-implementation for direct comparison: the former27 is designed for a different type of MS data (electron ionization spectra), while the latter28 does not have publicly available code. We communicated with the authors of ref. 28 to ensure that our WLN model is similar to their best configuration. The parameter counts of each of the deep learning models are summarized in Supplementary Table 3.

In addition, we compare our model to CFM24,26, a well-known tool for small molecule spectrum prediction. CFM is not a deep learning approach: it uses combinatorial fragmentation to determine the set of possible molecule fragments, then fits a probabilistic model to predict the likelihood of the fragments. CFM is designed to predict quadrupole time-of-flight spectra at specific collision energies (10, 20 and 40 eV) and with a limited set of precursor adducts (for positive mode, only [M + H]+). It is not straightforward to retrain CFM on the NIST dataset, which contains Orbitrap spectra covering a wide range of precursor adducts and collision energies. Extending CFM to support these spectra would require non-trivial algorithmic modifications. Additionally, scaling CFM’s training procedure to a dataset as large as NIST (three to four times more training molecules, depending on the split) would be computationally challenging. For these reasons, we use the most recent pretrained version of CFM26 for all experiments. This version was trained on a 3,885-molecule subset of the Metlin dataset22; see Supplementary Table 5 for information about the overlap of CFM’s training set with the other datasets. In our experiments with CFM, we map each NCE to the nearest CFM-supported absolute collision energy. To convert from normalized to absolute collision energy (ACE), we use equation (14), where p is the precursor, m(p) is the precursor mass and c(p) is the charge factor (1.0 for singly charged precursors, 0.9 for doubly charged precursors).

$${{{\rm{ACE}}}}(\,p)=\frac{m(\,p)\times c(\,p)\times {{{\rm{NCE}}}}(\,p)}{500}$$
(14)

CFM also contains a rule-based module for lipid spectrum prediction25. We employ this module to make predictions instead of the probabilistic approach wherever applicable.

Spectrum identification task setup

The CASMI 2016 (ref. 51) query set contains Orbitrap spectra for 188 unique compounds, with 124 remaining after filtering for supported precursor adducts ([M + H]+) and removing charged compounds. Each spectrum is merged over three NCEs: 20, 35 and 50. Additionally, each query has an associated candidate list assembled by searching the ChemSpider database87 for compounds with similar mass to the precursor (see ref. 51 for full details). After preprocessing, there are on average 1,251 candidate compounds per spectrum.

The CASMI 2022 (ref. 52) query set contains Orbitrap spectra for 500 unique compounds, with 228 remaining after filtering for supported precursor adducts ([M + H]+ or [M + Na]+) and removing compounds with unsupported elements (Supplementary Table 4). The query spectra are extracted from a raw MSP file dump, based on the approach used in ref. 59. Subsequently, each spectrum is merged over three NCEs: 35, 45 and 65. Unlike CASMI 2016, candidate lists are not provided by contest organizers.

To construct the NIST20 Outlier query set, we select compounds from the NIST dataset with unique Murcko Scaffolds. We sample 400 such compounds in total, stratified by molecular weight: 75 with weight <200 Da, 75 with weight in [200, 300) Da, 75 with weight in [300, 400) Da and 75 with weight ≥400 Da. The motivation is to select a diverse group of outlier compounds to use as queries in the identification task. For each compound, all [M + H]+ spectra (of any collision energy) from NIST are merged.

The candidate sets for NIST20 Outlier and CASMI 2022 are selected by sampling compounds from PubChem with molecular mass within a certain tolerance of the true query mass (0.5 ppm for NIST20 Outlier, 10 ppm for CASMI 2022), up to a maximum of 10,000 per query. We filter the candidate lists to remove unsupported elements and multimolecular compounds. Stereochemical information is also removed, with stereoisomeric candidates being deduplicated. After preprocessing, there are on average 2,201 candidates per spectrum for NIST Outlier and 4,849 per spectrum for CASMI 2022. While it would be possible to improve scores of all methods with a more refined candidate structure selection26,49, our approach reflects the scenario where there is little a priori information about the query compound other than its approximate mass and precursor adduct.

The models are scored using an array of different unnormalized and normalized ranking metrics (Tables 1 and 2 and Extended Data Fig. 4) inspired by those typically employed in CASMI contests51,52. Rank corresponds to the average rank of the true candidate compound, with 1 being the best score. Its normalized counterpart, normalized rank, corresponds to the average rank of the true candidate expressed as a fraction of the total number of candidates, with 0 being the best score and 1 being the worst. Top-k accuracy represents the frequency with which the true candidate is ranked in the top k candidates and ranges from 0 to 1. Top-k% accuracy is the normalized equivalent, measuring how often the correct candidate appeared in the top k% of candidates. In contrast to the average rank metrics, top-k and top-k% metrics do not strongly penalize ranking the correct candidate extremely poorly (any rank outside of the top k or k% is equally bad). Orthogonally, the normalized metrics are less sensitive to differences caused by variation in the number of candidates per query, which can be informative.

Following ref. 29, we investigate the effects of candidate-query similarity on retrieval performance. To estimate chemical similarity between molecules, we compute Tanimoto similarity of their MACCS fingerprints. Then, we select the 20% of candidates most and least similar to the query and record ranking metrics on both candidate subsets (Table 2).

Our ranking tasks bear some resemblance to the evaluation in ref. 29, but there are a number of key differences. The authors of ref. 29 use a large set of query spectra, but for each query, they subsample a small number of candidates from the original PubChem set (either randomly, or intentionally choosing the top M candidates most and least similar to the query, where M ≤ 1,000). Our evaluations use a smaller set of query compounds, but for each query sample a large number of candidates (up to 10,000 per query). Their approach also assumes knowledge of the query’s mass formula, further constraining candidate set composition. In contrast, we only assume knowledge of the precursor m/z (common in real-world MS identification problems) and precursor adduct (consistent with CASMI 2016, but uncommon in real-world problems). Although the task in ref. 29 uses unmerged query spectra, we apply collision energy merging (‘Loss and similarity calculations’) to remain consistent with CASMI 2016 and 2022. Our approach increases the information content of the query spectrum, avoiding situations where the query is not specific enough to identify a compound (that is only one or two high-intensity peaks). However, the unmerged approach is also valuable, since such uninformative spectra sometimes appear in real-world untargeted MS experiments (which may only cover a single collision energy), and evaluating model behaviour in these settings could be useful.

Implementation details

All models are implemented in PyTorch88. Our model, MassFormer, uses a modified version of the Graphormer v2 implementation34, which relies on Pytorch Geometric89. The WLN model is implemented using Deep Graph Library90,91. We also adapt some code from ref. 28 for spectrum preprocessing. Before benchmarking the FP and WLN baseline models, we ran a hyperparameter sweep using Weights & Biases92, with a budget of 100 initializations, to find the best-performing configuration on the NIST validation set. The hyperparameters that we optimize are learning rate, weight decay, dropout, minibatch size and architecture specific parameters (such as hidden dimension and number of layers). For full model details and hyperparameter configurations, please refer to the code repository (‘Code availability’).